Commit d618a27c7808608e376de803a4fd3940f33776c2

Authored by Mel Gorman
Committed by Jiri Slaby
1 parent 967e64285a

mm: non-atomically mark page accessed during page cache allocation where possible

commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Showing 17 changed files with 219 additions and 162 deletions Inline Diff

fs/btrfs/extent_io.c
1 #include <linux/bitops.h> 1 #include <linux/bitops.h>
2 #include <linux/slab.h> 2 #include <linux/slab.h>
3 #include <linux/bio.h> 3 #include <linux/bio.h>
4 #include <linux/mm.h> 4 #include <linux/mm.h>
5 #include <linux/pagemap.h> 5 #include <linux/pagemap.h>
6 #include <linux/page-flags.h> 6 #include <linux/page-flags.h>
7 #include <linux/spinlock.h> 7 #include <linux/spinlock.h>
8 #include <linux/blkdev.h> 8 #include <linux/blkdev.h>
9 #include <linux/swap.h> 9 #include <linux/swap.h>
10 #include <linux/writeback.h> 10 #include <linux/writeback.h>
11 #include <linux/pagevec.h> 11 #include <linux/pagevec.h>
12 #include <linux/prefetch.h> 12 #include <linux/prefetch.h>
13 #include <linux/cleancache.h> 13 #include <linux/cleancache.h>
14 #include "extent_io.h" 14 #include "extent_io.h"
15 #include "extent_map.h" 15 #include "extent_map.h"
16 #include "compat.h" 16 #include "compat.h"
17 #include "ctree.h" 17 #include "ctree.h"
18 #include "btrfs_inode.h" 18 #include "btrfs_inode.h"
19 #include "volumes.h" 19 #include "volumes.h"
20 #include "check-integrity.h" 20 #include "check-integrity.h"
21 #include "locking.h" 21 #include "locking.h"
22 #include "rcu-string.h" 22 #include "rcu-string.h"
23 23
24 static struct kmem_cache *extent_state_cache; 24 static struct kmem_cache *extent_state_cache;
25 static struct kmem_cache *extent_buffer_cache; 25 static struct kmem_cache *extent_buffer_cache;
26 static struct bio_set *btrfs_bioset; 26 static struct bio_set *btrfs_bioset;
27 27
28 #ifdef CONFIG_BTRFS_DEBUG 28 #ifdef CONFIG_BTRFS_DEBUG
29 static LIST_HEAD(buffers); 29 static LIST_HEAD(buffers);
30 static LIST_HEAD(states); 30 static LIST_HEAD(states);
31 31
32 static DEFINE_SPINLOCK(leak_lock); 32 static DEFINE_SPINLOCK(leak_lock);
33 33
34 static inline 34 static inline
35 void btrfs_leak_debug_add(struct list_head *new, struct list_head *head) 35 void btrfs_leak_debug_add(struct list_head *new, struct list_head *head)
36 { 36 {
37 unsigned long flags; 37 unsigned long flags;
38 38
39 spin_lock_irqsave(&leak_lock, flags); 39 spin_lock_irqsave(&leak_lock, flags);
40 list_add(new, head); 40 list_add(new, head);
41 spin_unlock_irqrestore(&leak_lock, flags); 41 spin_unlock_irqrestore(&leak_lock, flags);
42 } 42 }
43 43
44 static inline 44 static inline
45 void btrfs_leak_debug_del(struct list_head *entry) 45 void btrfs_leak_debug_del(struct list_head *entry)
46 { 46 {
47 unsigned long flags; 47 unsigned long flags;
48 48
49 spin_lock_irqsave(&leak_lock, flags); 49 spin_lock_irqsave(&leak_lock, flags);
50 list_del(entry); 50 list_del(entry);
51 spin_unlock_irqrestore(&leak_lock, flags); 51 spin_unlock_irqrestore(&leak_lock, flags);
52 } 52 }
53 53
54 static inline 54 static inline
55 void btrfs_leak_debug_check(void) 55 void btrfs_leak_debug_check(void)
56 { 56 {
57 struct extent_state *state; 57 struct extent_state *state;
58 struct extent_buffer *eb; 58 struct extent_buffer *eb;
59 59
60 while (!list_empty(&states)) { 60 while (!list_empty(&states)) {
61 state = list_entry(states.next, struct extent_state, leak_list); 61 state = list_entry(states.next, struct extent_state, leak_list);
62 printk(KERN_ERR "btrfs state leak: start %llu end %llu " 62 printk(KERN_ERR "btrfs state leak: start %llu end %llu "
63 "state %lu in tree %p refs %d\n", 63 "state %lu in tree %p refs %d\n",
64 state->start, state->end, state->state, state->tree, 64 state->start, state->end, state->state, state->tree,
65 atomic_read(&state->refs)); 65 atomic_read(&state->refs));
66 list_del(&state->leak_list); 66 list_del(&state->leak_list);
67 kmem_cache_free(extent_state_cache, state); 67 kmem_cache_free(extent_state_cache, state);
68 } 68 }
69 69
70 while (!list_empty(&buffers)) { 70 while (!list_empty(&buffers)) {
71 eb = list_entry(buffers.next, struct extent_buffer, leak_list); 71 eb = list_entry(buffers.next, struct extent_buffer, leak_list);
72 printk(KERN_ERR "btrfs buffer leak start %llu len %lu " 72 printk(KERN_ERR "btrfs buffer leak start %llu len %lu "
73 "refs %d\n", 73 "refs %d\n",
74 eb->start, eb->len, atomic_read(&eb->refs)); 74 eb->start, eb->len, atomic_read(&eb->refs));
75 list_del(&eb->leak_list); 75 list_del(&eb->leak_list);
76 kmem_cache_free(extent_buffer_cache, eb); 76 kmem_cache_free(extent_buffer_cache, eb);
77 } 77 }
78 } 78 }
79 79
80 #define btrfs_debug_check_extent_io_range(inode, start, end) \ 80 #define btrfs_debug_check_extent_io_range(inode, start, end) \
81 __btrfs_debug_check_extent_io_range(__func__, (inode), (start), (end)) 81 __btrfs_debug_check_extent_io_range(__func__, (inode), (start), (end))
82 static inline void __btrfs_debug_check_extent_io_range(const char *caller, 82 static inline void __btrfs_debug_check_extent_io_range(const char *caller,
83 struct inode *inode, u64 start, u64 end) 83 struct inode *inode, u64 start, u64 end)
84 { 84 {
85 u64 isize = i_size_read(inode); 85 u64 isize = i_size_read(inode);
86 86
87 if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) { 87 if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) {
88 printk_ratelimited(KERN_DEBUG 88 printk_ratelimited(KERN_DEBUG
89 "btrfs: %s: ino %llu isize %llu odd range [%llu,%llu]\n", 89 "btrfs: %s: ino %llu isize %llu odd range [%llu,%llu]\n",
90 caller, btrfs_ino(inode), isize, start, end); 90 caller, btrfs_ino(inode), isize, start, end);
91 } 91 }
92 } 92 }
93 #else 93 #else
94 #define btrfs_leak_debug_add(new, head) do {} while (0) 94 #define btrfs_leak_debug_add(new, head) do {} while (0)
95 #define btrfs_leak_debug_del(entry) do {} while (0) 95 #define btrfs_leak_debug_del(entry) do {} while (0)
96 #define btrfs_leak_debug_check() do {} while (0) 96 #define btrfs_leak_debug_check() do {} while (0)
97 #define btrfs_debug_check_extent_io_range(c, s, e) do {} while (0) 97 #define btrfs_debug_check_extent_io_range(c, s, e) do {} while (0)
98 #endif 98 #endif
99 99
100 #define BUFFER_LRU_MAX 64 100 #define BUFFER_LRU_MAX 64
101 101
102 struct tree_entry { 102 struct tree_entry {
103 u64 start; 103 u64 start;
104 u64 end; 104 u64 end;
105 struct rb_node rb_node; 105 struct rb_node rb_node;
106 }; 106 };
107 107
108 struct extent_page_data { 108 struct extent_page_data {
109 struct bio *bio; 109 struct bio *bio;
110 struct extent_io_tree *tree; 110 struct extent_io_tree *tree;
111 get_extent_t *get_extent; 111 get_extent_t *get_extent;
112 unsigned long bio_flags; 112 unsigned long bio_flags;
113 113
114 /* tells writepage not to lock the state bits for this range 114 /* tells writepage not to lock the state bits for this range
115 * it still does the unlocking 115 * it still does the unlocking
116 */ 116 */
117 unsigned int extent_locked:1; 117 unsigned int extent_locked:1;
118 118
119 /* tells the submit_bio code to use a WRITE_SYNC */ 119 /* tells the submit_bio code to use a WRITE_SYNC */
120 unsigned int sync_io:1; 120 unsigned int sync_io:1;
121 }; 121 };
122 122
123 static noinline void flush_write_bio(void *data); 123 static noinline void flush_write_bio(void *data);
124 static inline struct btrfs_fs_info * 124 static inline struct btrfs_fs_info *
125 tree_fs_info(struct extent_io_tree *tree) 125 tree_fs_info(struct extent_io_tree *tree)
126 { 126 {
127 return btrfs_sb(tree->mapping->host->i_sb); 127 return btrfs_sb(tree->mapping->host->i_sb);
128 } 128 }
129 129
130 int __init extent_io_init(void) 130 int __init extent_io_init(void)
131 { 131 {
132 extent_state_cache = kmem_cache_create("btrfs_extent_state", 132 extent_state_cache = kmem_cache_create("btrfs_extent_state",
133 sizeof(struct extent_state), 0, 133 sizeof(struct extent_state), 0,
134 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); 134 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
135 if (!extent_state_cache) 135 if (!extent_state_cache)
136 return -ENOMEM; 136 return -ENOMEM;
137 137
138 extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer", 138 extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
139 sizeof(struct extent_buffer), 0, 139 sizeof(struct extent_buffer), 0,
140 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); 140 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
141 if (!extent_buffer_cache) 141 if (!extent_buffer_cache)
142 goto free_state_cache; 142 goto free_state_cache;
143 143
144 btrfs_bioset = bioset_create(BIO_POOL_SIZE, 144 btrfs_bioset = bioset_create(BIO_POOL_SIZE,
145 offsetof(struct btrfs_io_bio, bio)); 145 offsetof(struct btrfs_io_bio, bio));
146 if (!btrfs_bioset) 146 if (!btrfs_bioset)
147 goto free_buffer_cache; 147 goto free_buffer_cache;
148 148
149 if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE)) 149 if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE))
150 goto free_bioset; 150 goto free_bioset;
151 151
152 return 0; 152 return 0;
153 153
154 free_bioset: 154 free_bioset:
155 bioset_free(btrfs_bioset); 155 bioset_free(btrfs_bioset);
156 btrfs_bioset = NULL; 156 btrfs_bioset = NULL;
157 157
158 free_buffer_cache: 158 free_buffer_cache:
159 kmem_cache_destroy(extent_buffer_cache); 159 kmem_cache_destroy(extent_buffer_cache);
160 extent_buffer_cache = NULL; 160 extent_buffer_cache = NULL;
161 161
162 free_state_cache: 162 free_state_cache:
163 kmem_cache_destroy(extent_state_cache); 163 kmem_cache_destroy(extent_state_cache);
164 extent_state_cache = NULL; 164 extent_state_cache = NULL;
165 return -ENOMEM; 165 return -ENOMEM;
166 } 166 }
167 167
168 void extent_io_exit(void) 168 void extent_io_exit(void)
169 { 169 {
170 btrfs_leak_debug_check(); 170 btrfs_leak_debug_check();
171 171
172 /* 172 /*
173 * Make sure all delayed rcu free are flushed before we 173 * Make sure all delayed rcu free are flushed before we
174 * destroy caches. 174 * destroy caches.
175 */ 175 */
176 rcu_barrier(); 176 rcu_barrier();
177 if (extent_state_cache) 177 if (extent_state_cache)
178 kmem_cache_destroy(extent_state_cache); 178 kmem_cache_destroy(extent_state_cache);
179 if (extent_buffer_cache) 179 if (extent_buffer_cache)
180 kmem_cache_destroy(extent_buffer_cache); 180 kmem_cache_destroy(extent_buffer_cache);
181 if (btrfs_bioset) 181 if (btrfs_bioset)
182 bioset_free(btrfs_bioset); 182 bioset_free(btrfs_bioset);
183 } 183 }
184 184
185 void extent_io_tree_init(struct extent_io_tree *tree, 185 void extent_io_tree_init(struct extent_io_tree *tree,
186 struct address_space *mapping) 186 struct address_space *mapping)
187 { 187 {
188 tree->state = RB_ROOT; 188 tree->state = RB_ROOT;
189 INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC); 189 INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC);
190 tree->ops = NULL; 190 tree->ops = NULL;
191 tree->dirty_bytes = 0; 191 tree->dirty_bytes = 0;
192 spin_lock_init(&tree->lock); 192 spin_lock_init(&tree->lock);
193 spin_lock_init(&tree->buffer_lock); 193 spin_lock_init(&tree->buffer_lock);
194 tree->mapping = mapping; 194 tree->mapping = mapping;
195 } 195 }
196 196
197 static struct extent_state *alloc_extent_state(gfp_t mask) 197 static struct extent_state *alloc_extent_state(gfp_t mask)
198 { 198 {
199 struct extent_state *state; 199 struct extent_state *state;
200 200
201 state = kmem_cache_alloc(extent_state_cache, mask); 201 state = kmem_cache_alloc(extent_state_cache, mask);
202 if (!state) 202 if (!state)
203 return state; 203 return state;
204 state->state = 0; 204 state->state = 0;
205 state->private = 0; 205 state->private = 0;
206 state->tree = NULL; 206 state->tree = NULL;
207 btrfs_leak_debug_add(&state->leak_list, &states); 207 btrfs_leak_debug_add(&state->leak_list, &states);
208 atomic_set(&state->refs, 1); 208 atomic_set(&state->refs, 1);
209 init_waitqueue_head(&state->wq); 209 init_waitqueue_head(&state->wq);
210 trace_alloc_extent_state(state, mask, _RET_IP_); 210 trace_alloc_extent_state(state, mask, _RET_IP_);
211 return state; 211 return state;
212 } 212 }
213 213
214 void free_extent_state(struct extent_state *state) 214 void free_extent_state(struct extent_state *state)
215 { 215 {
216 if (!state) 216 if (!state)
217 return; 217 return;
218 if (atomic_dec_and_test(&state->refs)) { 218 if (atomic_dec_and_test(&state->refs)) {
219 WARN_ON(state->tree); 219 WARN_ON(state->tree);
220 btrfs_leak_debug_del(&state->leak_list); 220 btrfs_leak_debug_del(&state->leak_list);
221 trace_free_extent_state(state, _RET_IP_); 221 trace_free_extent_state(state, _RET_IP_);
222 kmem_cache_free(extent_state_cache, state); 222 kmem_cache_free(extent_state_cache, state);
223 } 223 }
224 } 224 }
225 225
226 static struct rb_node *tree_insert(struct rb_root *root, u64 offset, 226 static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
227 struct rb_node *node) 227 struct rb_node *node)
228 { 228 {
229 struct rb_node **p = &root->rb_node; 229 struct rb_node **p = &root->rb_node;
230 struct rb_node *parent = NULL; 230 struct rb_node *parent = NULL;
231 struct tree_entry *entry; 231 struct tree_entry *entry;
232 232
233 while (*p) { 233 while (*p) {
234 parent = *p; 234 parent = *p;
235 entry = rb_entry(parent, struct tree_entry, rb_node); 235 entry = rb_entry(parent, struct tree_entry, rb_node);
236 236
237 if (offset < entry->start) 237 if (offset < entry->start)
238 p = &(*p)->rb_left; 238 p = &(*p)->rb_left;
239 else if (offset > entry->end) 239 else if (offset > entry->end)
240 p = &(*p)->rb_right; 240 p = &(*p)->rb_right;
241 else 241 else
242 return parent; 242 return parent;
243 } 243 }
244 244
245 rb_link_node(node, parent, p); 245 rb_link_node(node, parent, p);
246 rb_insert_color(node, root); 246 rb_insert_color(node, root);
247 return NULL; 247 return NULL;
248 } 248 }
249 249
250 static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset, 250 static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset,
251 struct rb_node **prev_ret, 251 struct rb_node **prev_ret,
252 struct rb_node **next_ret) 252 struct rb_node **next_ret)
253 { 253 {
254 struct rb_root *root = &tree->state; 254 struct rb_root *root = &tree->state;
255 struct rb_node *n = root->rb_node; 255 struct rb_node *n = root->rb_node;
256 struct rb_node *prev = NULL; 256 struct rb_node *prev = NULL;
257 struct rb_node *orig_prev = NULL; 257 struct rb_node *orig_prev = NULL;
258 struct tree_entry *entry; 258 struct tree_entry *entry;
259 struct tree_entry *prev_entry = NULL; 259 struct tree_entry *prev_entry = NULL;
260 260
261 while (n) { 261 while (n) {
262 entry = rb_entry(n, struct tree_entry, rb_node); 262 entry = rb_entry(n, struct tree_entry, rb_node);
263 prev = n; 263 prev = n;
264 prev_entry = entry; 264 prev_entry = entry;
265 265
266 if (offset < entry->start) 266 if (offset < entry->start)
267 n = n->rb_left; 267 n = n->rb_left;
268 else if (offset > entry->end) 268 else if (offset > entry->end)
269 n = n->rb_right; 269 n = n->rb_right;
270 else 270 else
271 return n; 271 return n;
272 } 272 }
273 273
274 if (prev_ret) { 274 if (prev_ret) {
275 orig_prev = prev; 275 orig_prev = prev;
276 while (prev && offset > prev_entry->end) { 276 while (prev && offset > prev_entry->end) {
277 prev = rb_next(prev); 277 prev = rb_next(prev);
278 prev_entry = rb_entry(prev, struct tree_entry, rb_node); 278 prev_entry = rb_entry(prev, struct tree_entry, rb_node);
279 } 279 }
280 *prev_ret = prev; 280 *prev_ret = prev;
281 prev = orig_prev; 281 prev = orig_prev;
282 } 282 }
283 283
284 if (next_ret) { 284 if (next_ret) {
285 prev_entry = rb_entry(prev, struct tree_entry, rb_node); 285 prev_entry = rb_entry(prev, struct tree_entry, rb_node);
286 while (prev && offset < prev_entry->start) { 286 while (prev && offset < prev_entry->start) {
287 prev = rb_prev(prev); 287 prev = rb_prev(prev);
288 prev_entry = rb_entry(prev, struct tree_entry, rb_node); 288 prev_entry = rb_entry(prev, struct tree_entry, rb_node);
289 } 289 }
290 *next_ret = prev; 290 *next_ret = prev;
291 } 291 }
292 return NULL; 292 return NULL;
293 } 293 }
294 294
295 static inline struct rb_node *tree_search(struct extent_io_tree *tree, 295 static inline struct rb_node *tree_search(struct extent_io_tree *tree,
296 u64 offset) 296 u64 offset)
297 { 297 {
298 struct rb_node *prev = NULL; 298 struct rb_node *prev = NULL;
299 struct rb_node *ret; 299 struct rb_node *ret;
300 300
301 ret = __etree_search(tree, offset, &prev, NULL); 301 ret = __etree_search(tree, offset, &prev, NULL);
302 if (!ret) 302 if (!ret)
303 return prev; 303 return prev;
304 return ret; 304 return ret;
305 } 305 }
306 306
307 static void merge_cb(struct extent_io_tree *tree, struct extent_state *new, 307 static void merge_cb(struct extent_io_tree *tree, struct extent_state *new,
308 struct extent_state *other) 308 struct extent_state *other)
309 { 309 {
310 if (tree->ops && tree->ops->merge_extent_hook) 310 if (tree->ops && tree->ops->merge_extent_hook)
311 tree->ops->merge_extent_hook(tree->mapping->host, new, 311 tree->ops->merge_extent_hook(tree->mapping->host, new,
312 other); 312 other);
313 } 313 }
314 314
315 /* 315 /*
316 * utility function to look for merge candidates inside a given range. 316 * utility function to look for merge candidates inside a given range.
317 * Any extents with matching state are merged together into a single 317 * Any extents with matching state are merged together into a single
318 * extent in the tree. Extents with EXTENT_IO in their state field 318 * extent in the tree. Extents with EXTENT_IO in their state field
319 * are not merged because the end_io handlers need to be able to do 319 * are not merged because the end_io handlers need to be able to do
320 * operations on them without sleeping (or doing allocations/splits). 320 * operations on them without sleeping (or doing allocations/splits).
321 * 321 *
322 * This should be called with the tree lock held. 322 * This should be called with the tree lock held.
323 */ 323 */
324 static void merge_state(struct extent_io_tree *tree, 324 static void merge_state(struct extent_io_tree *tree,
325 struct extent_state *state) 325 struct extent_state *state)
326 { 326 {
327 struct extent_state *other; 327 struct extent_state *other;
328 struct rb_node *other_node; 328 struct rb_node *other_node;
329 329
330 if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) 330 if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY))
331 return; 331 return;
332 332
333 other_node = rb_prev(&state->rb_node); 333 other_node = rb_prev(&state->rb_node);
334 if (other_node) { 334 if (other_node) {
335 other = rb_entry(other_node, struct extent_state, rb_node); 335 other = rb_entry(other_node, struct extent_state, rb_node);
336 if (other->end == state->start - 1 && 336 if (other->end == state->start - 1 &&
337 other->state == state->state) { 337 other->state == state->state) {
338 merge_cb(tree, state, other); 338 merge_cb(tree, state, other);
339 state->start = other->start; 339 state->start = other->start;
340 other->tree = NULL; 340 other->tree = NULL;
341 rb_erase(&other->rb_node, &tree->state); 341 rb_erase(&other->rb_node, &tree->state);
342 free_extent_state(other); 342 free_extent_state(other);
343 } 343 }
344 } 344 }
345 other_node = rb_next(&state->rb_node); 345 other_node = rb_next(&state->rb_node);
346 if (other_node) { 346 if (other_node) {
347 other = rb_entry(other_node, struct extent_state, rb_node); 347 other = rb_entry(other_node, struct extent_state, rb_node);
348 if (other->start == state->end + 1 && 348 if (other->start == state->end + 1 &&
349 other->state == state->state) { 349 other->state == state->state) {
350 merge_cb(tree, state, other); 350 merge_cb(tree, state, other);
351 state->end = other->end; 351 state->end = other->end;
352 other->tree = NULL; 352 other->tree = NULL;
353 rb_erase(&other->rb_node, &tree->state); 353 rb_erase(&other->rb_node, &tree->state);
354 free_extent_state(other); 354 free_extent_state(other);
355 } 355 }
356 } 356 }
357 } 357 }
358 358
359 static void set_state_cb(struct extent_io_tree *tree, 359 static void set_state_cb(struct extent_io_tree *tree,
360 struct extent_state *state, unsigned long *bits) 360 struct extent_state *state, unsigned long *bits)
361 { 361 {
362 if (tree->ops && tree->ops->set_bit_hook) 362 if (tree->ops && tree->ops->set_bit_hook)
363 tree->ops->set_bit_hook(tree->mapping->host, state, bits); 363 tree->ops->set_bit_hook(tree->mapping->host, state, bits);
364 } 364 }
365 365
366 static void clear_state_cb(struct extent_io_tree *tree, 366 static void clear_state_cb(struct extent_io_tree *tree,
367 struct extent_state *state, unsigned long *bits) 367 struct extent_state *state, unsigned long *bits)
368 { 368 {
369 if (tree->ops && tree->ops->clear_bit_hook) 369 if (tree->ops && tree->ops->clear_bit_hook)
370 tree->ops->clear_bit_hook(tree->mapping->host, state, bits); 370 tree->ops->clear_bit_hook(tree->mapping->host, state, bits);
371 } 371 }
372 372
373 static void set_state_bits(struct extent_io_tree *tree, 373 static void set_state_bits(struct extent_io_tree *tree,
374 struct extent_state *state, unsigned long *bits); 374 struct extent_state *state, unsigned long *bits);
375 375
376 /* 376 /*
377 * insert an extent_state struct into the tree. 'bits' are set on the 377 * insert an extent_state struct into the tree. 'bits' are set on the
378 * struct before it is inserted. 378 * struct before it is inserted.
379 * 379 *
380 * This may return -EEXIST if the extent is already there, in which case the 380 * This may return -EEXIST if the extent is already there, in which case the
381 * state struct is freed. 381 * state struct is freed.
382 * 382 *
383 * The tree lock is not taken internally. This is a utility function and 383 * The tree lock is not taken internally. This is a utility function and
384 * probably isn't what you want to call (see set/clear_extent_bit). 384 * probably isn't what you want to call (see set/clear_extent_bit).
385 */ 385 */
386 static int insert_state(struct extent_io_tree *tree, 386 static int insert_state(struct extent_io_tree *tree,
387 struct extent_state *state, u64 start, u64 end, 387 struct extent_state *state, u64 start, u64 end,
388 unsigned long *bits) 388 unsigned long *bits)
389 { 389 {
390 struct rb_node *node; 390 struct rb_node *node;
391 391
392 if (end < start) 392 if (end < start)
393 WARN(1, KERN_ERR "btrfs end < start %llu %llu\n", 393 WARN(1, KERN_ERR "btrfs end < start %llu %llu\n",
394 end, start); 394 end, start);
395 state->start = start; 395 state->start = start;
396 state->end = end; 396 state->end = end;
397 397
398 set_state_bits(tree, state, bits); 398 set_state_bits(tree, state, bits);
399 399
400 node = tree_insert(&tree->state, end, &state->rb_node); 400 node = tree_insert(&tree->state, end, &state->rb_node);
401 if (node) { 401 if (node) {
402 struct extent_state *found; 402 struct extent_state *found;
403 found = rb_entry(node, struct extent_state, rb_node); 403 found = rb_entry(node, struct extent_state, rb_node);
404 printk(KERN_ERR "btrfs found node %llu %llu on insert of " 404 printk(KERN_ERR "btrfs found node %llu %llu on insert of "
405 "%llu %llu\n", 405 "%llu %llu\n",
406 found->start, found->end, start, end); 406 found->start, found->end, start, end);
407 return -EEXIST; 407 return -EEXIST;
408 } 408 }
409 state->tree = tree; 409 state->tree = tree;
410 merge_state(tree, state); 410 merge_state(tree, state);
411 return 0; 411 return 0;
412 } 412 }
413 413
414 static void split_cb(struct extent_io_tree *tree, struct extent_state *orig, 414 static void split_cb(struct extent_io_tree *tree, struct extent_state *orig,
415 u64 split) 415 u64 split)
416 { 416 {
417 if (tree->ops && tree->ops->split_extent_hook) 417 if (tree->ops && tree->ops->split_extent_hook)
418 tree->ops->split_extent_hook(tree->mapping->host, orig, split); 418 tree->ops->split_extent_hook(tree->mapping->host, orig, split);
419 } 419 }
420 420
421 /* 421 /*
422 * split a given extent state struct in two, inserting the preallocated 422 * split a given extent state struct in two, inserting the preallocated
423 * struct 'prealloc' as the newly created second half. 'split' indicates an 423 * struct 'prealloc' as the newly created second half. 'split' indicates an
424 * offset inside 'orig' where it should be split. 424 * offset inside 'orig' where it should be split.
425 * 425 *
426 * Before calling, 426 * Before calling,
427 * the tree has 'orig' at [orig->start, orig->end]. After calling, there 427 * the tree has 'orig' at [orig->start, orig->end]. After calling, there
428 * are two extent state structs in the tree: 428 * are two extent state structs in the tree:
429 * prealloc: [orig->start, split - 1] 429 * prealloc: [orig->start, split - 1]
430 * orig: [ split, orig->end ] 430 * orig: [ split, orig->end ]
431 * 431 *
432 * The tree locks are not taken by this function. They need to be held 432 * The tree locks are not taken by this function. They need to be held
433 * by the caller. 433 * by the caller.
434 */ 434 */
435 static int split_state(struct extent_io_tree *tree, struct extent_state *orig, 435 static int split_state(struct extent_io_tree *tree, struct extent_state *orig,
436 struct extent_state *prealloc, u64 split) 436 struct extent_state *prealloc, u64 split)
437 { 437 {
438 struct rb_node *node; 438 struct rb_node *node;
439 439
440 split_cb(tree, orig, split); 440 split_cb(tree, orig, split);
441 441
442 prealloc->start = orig->start; 442 prealloc->start = orig->start;
443 prealloc->end = split - 1; 443 prealloc->end = split - 1;
444 prealloc->state = orig->state; 444 prealloc->state = orig->state;
445 orig->start = split; 445 orig->start = split;
446 446
447 node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node); 447 node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node);
448 if (node) { 448 if (node) {
449 free_extent_state(prealloc); 449 free_extent_state(prealloc);
450 return -EEXIST; 450 return -EEXIST;
451 } 451 }
452 prealloc->tree = tree; 452 prealloc->tree = tree;
453 return 0; 453 return 0;
454 } 454 }
455 455
456 static struct extent_state *next_state(struct extent_state *state) 456 static struct extent_state *next_state(struct extent_state *state)
457 { 457 {
458 struct rb_node *next = rb_next(&state->rb_node); 458 struct rb_node *next = rb_next(&state->rb_node);
459 if (next) 459 if (next)
460 return rb_entry(next, struct extent_state, rb_node); 460 return rb_entry(next, struct extent_state, rb_node);
461 else 461 else
462 return NULL; 462 return NULL;
463 } 463 }
464 464
465 /* 465 /*
466 * utility function to clear some bits in an extent state struct. 466 * utility function to clear some bits in an extent state struct.
467 * it will optionally wake up any one waiting on this state (wake == 1). 467 * it will optionally wake up any one waiting on this state (wake == 1).
468 * 468 *
469 * If no bits are set on the state struct after clearing things, the 469 * If no bits are set on the state struct after clearing things, the
470 * struct is freed and removed from the tree 470 * struct is freed and removed from the tree
471 */ 471 */
472 static struct extent_state *clear_state_bit(struct extent_io_tree *tree, 472 static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
473 struct extent_state *state, 473 struct extent_state *state,
474 unsigned long *bits, int wake) 474 unsigned long *bits, int wake)
475 { 475 {
476 struct extent_state *next; 476 struct extent_state *next;
477 unsigned long bits_to_clear = *bits & ~EXTENT_CTLBITS; 477 unsigned long bits_to_clear = *bits & ~EXTENT_CTLBITS;
478 478
479 if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) { 479 if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
480 u64 range = state->end - state->start + 1; 480 u64 range = state->end - state->start + 1;
481 WARN_ON(range > tree->dirty_bytes); 481 WARN_ON(range > tree->dirty_bytes);
482 tree->dirty_bytes -= range; 482 tree->dirty_bytes -= range;
483 } 483 }
484 clear_state_cb(tree, state, bits); 484 clear_state_cb(tree, state, bits);
485 state->state &= ~bits_to_clear; 485 state->state &= ~bits_to_clear;
486 if (wake) 486 if (wake)
487 wake_up(&state->wq); 487 wake_up(&state->wq);
488 if (state->state == 0) { 488 if (state->state == 0) {
489 next = next_state(state); 489 next = next_state(state);
490 if (state->tree) { 490 if (state->tree) {
491 rb_erase(&state->rb_node, &tree->state); 491 rb_erase(&state->rb_node, &tree->state);
492 state->tree = NULL; 492 state->tree = NULL;
493 free_extent_state(state); 493 free_extent_state(state);
494 } else { 494 } else {
495 WARN_ON(1); 495 WARN_ON(1);
496 } 496 }
497 } else { 497 } else {
498 merge_state(tree, state); 498 merge_state(tree, state);
499 next = next_state(state); 499 next = next_state(state);
500 } 500 }
501 return next; 501 return next;
502 } 502 }
503 503
504 static struct extent_state * 504 static struct extent_state *
505 alloc_extent_state_atomic(struct extent_state *prealloc) 505 alloc_extent_state_atomic(struct extent_state *prealloc)
506 { 506 {
507 if (!prealloc) 507 if (!prealloc)
508 prealloc = alloc_extent_state(GFP_ATOMIC); 508 prealloc = alloc_extent_state(GFP_ATOMIC);
509 509
510 return prealloc; 510 return prealloc;
511 } 511 }
512 512
513 static void extent_io_tree_panic(struct extent_io_tree *tree, int err) 513 static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
514 { 514 {
515 btrfs_panic(tree_fs_info(tree), err, "Locking error: " 515 btrfs_panic(tree_fs_info(tree), err, "Locking error: "
516 "Extent tree was modified by another " 516 "Extent tree was modified by another "
517 "thread while locked."); 517 "thread while locked.");
518 } 518 }
519 519
520 /* 520 /*
521 * clear some bits on a range in the tree. This may require splitting 521 * clear some bits on a range in the tree. This may require splitting
522 * or inserting elements in the tree, so the gfp mask is used to 522 * or inserting elements in the tree, so the gfp mask is used to
523 * indicate which allocations or sleeping are allowed. 523 * indicate which allocations or sleeping are allowed.
524 * 524 *
525 * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove 525 * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove
526 * the given range from the tree regardless of state (ie for truncate). 526 * the given range from the tree regardless of state (ie for truncate).
527 * 527 *
528 * the range [start, end] is inclusive. 528 * the range [start, end] is inclusive.
529 * 529 *
530 * This takes the tree lock, and returns 0 on success and < 0 on error. 530 * This takes the tree lock, and returns 0 on success and < 0 on error.
531 */ 531 */
532 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, 532 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
533 unsigned long bits, int wake, int delete, 533 unsigned long bits, int wake, int delete,
534 struct extent_state **cached_state, 534 struct extent_state **cached_state,
535 gfp_t mask) 535 gfp_t mask)
536 { 536 {
537 struct extent_state *state; 537 struct extent_state *state;
538 struct extent_state *cached; 538 struct extent_state *cached;
539 struct extent_state *prealloc = NULL; 539 struct extent_state *prealloc = NULL;
540 struct rb_node *node; 540 struct rb_node *node;
541 u64 last_end; 541 u64 last_end;
542 int err; 542 int err;
543 int clear = 0; 543 int clear = 0;
544 544
545 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); 545 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
546 546
547 if (bits & EXTENT_DELALLOC) 547 if (bits & EXTENT_DELALLOC)
548 bits |= EXTENT_NORESERVE; 548 bits |= EXTENT_NORESERVE;
549 549
550 if (delete) 550 if (delete)
551 bits |= ~EXTENT_CTLBITS; 551 bits |= ~EXTENT_CTLBITS;
552 bits |= EXTENT_FIRST_DELALLOC; 552 bits |= EXTENT_FIRST_DELALLOC;
553 553
554 if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY)) 554 if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
555 clear = 1; 555 clear = 1;
556 again: 556 again:
557 if (!prealloc && (mask & __GFP_WAIT)) { 557 if (!prealloc && (mask & __GFP_WAIT)) {
558 prealloc = alloc_extent_state(mask); 558 prealloc = alloc_extent_state(mask);
559 if (!prealloc) 559 if (!prealloc)
560 return -ENOMEM; 560 return -ENOMEM;
561 } 561 }
562 562
563 spin_lock(&tree->lock); 563 spin_lock(&tree->lock);
564 if (cached_state) { 564 if (cached_state) {
565 cached = *cached_state; 565 cached = *cached_state;
566 566
567 if (clear) { 567 if (clear) {
568 *cached_state = NULL; 568 *cached_state = NULL;
569 cached_state = NULL; 569 cached_state = NULL;
570 } 570 }
571 571
572 if (cached && cached->tree && cached->start <= start && 572 if (cached && cached->tree && cached->start <= start &&
573 cached->end > start) { 573 cached->end > start) {
574 if (clear) 574 if (clear)
575 atomic_dec(&cached->refs); 575 atomic_dec(&cached->refs);
576 state = cached; 576 state = cached;
577 goto hit_next; 577 goto hit_next;
578 } 578 }
579 if (clear) 579 if (clear)
580 free_extent_state(cached); 580 free_extent_state(cached);
581 } 581 }
582 /* 582 /*
583 * this search will find the extents that end after 583 * this search will find the extents that end after
584 * our range starts 584 * our range starts
585 */ 585 */
586 node = tree_search(tree, start); 586 node = tree_search(tree, start);
587 if (!node) 587 if (!node)
588 goto out; 588 goto out;
589 state = rb_entry(node, struct extent_state, rb_node); 589 state = rb_entry(node, struct extent_state, rb_node);
590 hit_next: 590 hit_next:
591 if (state->start > end) 591 if (state->start > end)
592 goto out; 592 goto out;
593 WARN_ON(state->end < start); 593 WARN_ON(state->end < start);
594 last_end = state->end; 594 last_end = state->end;
595 595
596 /* the state doesn't have the wanted bits, go ahead */ 596 /* the state doesn't have the wanted bits, go ahead */
597 if (!(state->state & bits)) { 597 if (!(state->state & bits)) {
598 state = next_state(state); 598 state = next_state(state);
599 goto next; 599 goto next;
600 } 600 }
601 601
602 /* 602 /*
603 * | ---- desired range ---- | 603 * | ---- desired range ---- |
604 * | state | or 604 * | state | or
605 * | ------------- state -------------- | 605 * | ------------- state -------------- |
606 * 606 *
607 * We need to split the extent we found, and may flip 607 * We need to split the extent we found, and may flip
608 * bits on second half. 608 * bits on second half.
609 * 609 *
610 * If the extent we found extends past our range, we 610 * If the extent we found extends past our range, we
611 * just split and search again. It'll get split again 611 * just split and search again. It'll get split again
612 * the next time though. 612 * the next time though.
613 * 613 *
614 * If the extent we found is inside our range, we clear 614 * If the extent we found is inside our range, we clear
615 * the desired bit on it. 615 * the desired bit on it.
616 */ 616 */
617 617
618 if (state->start < start) { 618 if (state->start < start) {
619 prealloc = alloc_extent_state_atomic(prealloc); 619 prealloc = alloc_extent_state_atomic(prealloc);
620 BUG_ON(!prealloc); 620 BUG_ON(!prealloc);
621 err = split_state(tree, state, prealloc, start); 621 err = split_state(tree, state, prealloc, start);
622 if (err) 622 if (err)
623 extent_io_tree_panic(tree, err); 623 extent_io_tree_panic(tree, err);
624 624
625 prealloc = NULL; 625 prealloc = NULL;
626 if (err) 626 if (err)
627 goto out; 627 goto out;
628 if (state->end <= end) { 628 if (state->end <= end) {
629 state = clear_state_bit(tree, state, &bits, wake); 629 state = clear_state_bit(tree, state, &bits, wake);
630 goto next; 630 goto next;
631 } 631 }
632 goto search_again; 632 goto search_again;
633 } 633 }
634 /* 634 /*
635 * | ---- desired range ---- | 635 * | ---- desired range ---- |
636 * | state | 636 * | state |
637 * We need to split the extent, and clear the bit 637 * We need to split the extent, and clear the bit
638 * on the first half 638 * on the first half
639 */ 639 */
640 if (state->start <= end && state->end > end) { 640 if (state->start <= end && state->end > end) {
641 prealloc = alloc_extent_state_atomic(prealloc); 641 prealloc = alloc_extent_state_atomic(prealloc);
642 BUG_ON(!prealloc); 642 BUG_ON(!prealloc);
643 err = split_state(tree, state, prealloc, end + 1); 643 err = split_state(tree, state, prealloc, end + 1);
644 if (err) 644 if (err)
645 extent_io_tree_panic(tree, err); 645 extent_io_tree_panic(tree, err);
646 646
647 if (wake) 647 if (wake)
648 wake_up(&state->wq); 648 wake_up(&state->wq);
649 649
650 clear_state_bit(tree, prealloc, &bits, wake); 650 clear_state_bit(tree, prealloc, &bits, wake);
651 651
652 prealloc = NULL; 652 prealloc = NULL;
653 goto out; 653 goto out;
654 } 654 }
655 655
656 state = clear_state_bit(tree, state, &bits, wake); 656 state = clear_state_bit(tree, state, &bits, wake);
657 next: 657 next:
658 if (last_end == (u64)-1) 658 if (last_end == (u64)-1)
659 goto out; 659 goto out;
660 start = last_end + 1; 660 start = last_end + 1;
661 if (start <= end && state && !need_resched()) 661 if (start <= end && state && !need_resched())
662 goto hit_next; 662 goto hit_next;
663 goto search_again; 663 goto search_again;
664 664
665 out: 665 out:
666 spin_unlock(&tree->lock); 666 spin_unlock(&tree->lock);
667 if (prealloc) 667 if (prealloc)
668 free_extent_state(prealloc); 668 free_extent_state(prealloc);
669 669
670 return 0; 670 return 0;
671 671
672 search_again: 672 search_again:
673 if (start > end) 673 if (start > end)
674 goto out; 674 goto out;
675 spin_unlock(&tree->lock); 675 spin_unlock(&tree->lock);
676 if (mask & __GFP_WAIT) 676 if (mask & __GFP_WAIT)
677 cond_resched(); 677 cond_resched();
678 goto again; 678 goto again;
679 } 679 }
680 680
681 static void wait_on_state(struct extent_io_tree *tree, 681 static void wait_on_state(struct extent_io_tree *tree,
682 struct extent_state *state) 682 struct extent_state *state)
683 __releases(tree->lock) 683 __releases(tree->lock)
684 __acquires(tree->lock) 684 __acquires(tree->lock)
685 { 685 {
686 DEFINE_WAIT(wait); 686 DEFINE_WAIT(wait);
687 prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE); 687 prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
688 spin_unlock(&tree->lock); 688 spin_unlock(&tree->lock);
689 schedule(); 689 schedule();
690 spin_lock(&tree->lock); 690 spin_lock(&tree->lock);
691 finish_wait(&state->wq, &wait); 691 finish_wait(&state->wq, &wait);
692 } 692 }
693 693
694 /* 694 /*
695 * waits for one or more bits to clear on a range in the state tree. 695 * waits for one or more bits to clear on a range in the state tree.
696 * The range [start, end] is inclusive. 696 * The range [start, end] is inclusive.
697 * The tree lock is taken by this function 697 * The tree lock is taken by this function
698 */ 698 */
699 static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, 699 static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
700 unsigned long bits) 700 unsigned long bits)
701 { 701 {
702 struct extent_state *state; 702 struct extent_state *state;
703 struct rb_node *node; 703 struct rb_node *node;
704 704
705 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); 705 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
706 706
707 spin_lock(&tree->lock); 707 spin_lock(&tree->lock);
708 again: 708 again:
709 while (1) { 709 while (1) {
710 /* 710 /*
711 * this search will find all the extents that end after 711 * this search will find all the extents that end after
712 * our range starts 712 * our range starts
713 */ 713 */
714 node = tree_search(tree, start); 714 node = tree_search(tree, start);
715 if (!node) 715 if (!node)
716 break; 716 break;
717 717
718 state = rb_entry(node, struct extent_state, rb_node); 718 state = rb_entry(node, struct extent_state, rb_node);
719 719
720 if (state->start > end) 720 if (state->start > end)
721 goto out; 721 goto out;
722 722
723 if (state->state & bits) { 723 if (state->state & bits) {
724 start = state->start; 724 start = state->start;
725 atomic_inc(&state->refs); 725 atomic_inc(&state->refs);
726 wait_on_state(tree, state); 726 wait_on_state(tree, state);
727 free_extent_state(state); 727 free_extent_state(state);
728 goto again; 728 goto again;
729 } 729 }
730 start = state->end + 1; 730 start = state->end + 1;
731 731
732 if (start > end) 732 if (start > end)
733 break; 733 break;
734 734
735 cond_resched_lock(&tree->lock); 735 cond_resched_lock(&tree->lock);
736 } 736 }
737 out: 737 out:
738 spin_unlock(&tree->lock); 738 spin_unlock(&tree->lock);
739 } 739 }
740 740
741 static void set_state_bits(struct extent_io_tree *tree, 741 static void set_state_bits(struct extent_io_tree *tree,
742 struct extent_state *state, 742 struct extent_state *state,
743 unsigned long *bits) 743 unsigned long *bits)
744 { 744 {
745 unsigned long bits_to_set = *bits & ~EXTENT_CTLBITS; 745 unsigned long bits_to_set = *bits & ~EXTENT_CTLBITS;
746 746
747 set_state_cb(tree, state, bits); 747 set_state_cb(tree, state, bits);
748 if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) { 748 if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) {
749 u64 range = state->end - state->start + 1; 749 u64 range = state->end - state->start + 1;
750 tree->dirty_bytes += range; 750 tree->dirty_bytes += range;
751 } 751 }
752 state->state |= bits_to_set; 752 state->state |= bits_to_set;
753 } 753 }
754 754
755 static void cache_state(struct extent_state *state, 755 static void cache_state(struct extent_state *state,
756 struct extent_state **cached_ptr) 756 struct extent_state **cached_ptr)
757 { 757 {
758 if (cached_ptr && !(*cached_ptr)) { 758 if (cached_ptr && !(*cached_ptr)) {
759 if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) { 759 if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) {
760 *cached_ptr = state; 760 *cached_ptr = state;
761 atomic_inc(&state->refs); 761 atomic_inc(&state->refs);
762 } 762 }
763 } 763 }
764 } 764 }
765 765
766 /* 766 /*
767 * set some bits on a range in the tree. This may require allocations or 767 * set some bits on a range in the tree. This may require allocations or
768 * sleeping, so the gfp mask is used to indicate what is allowed. 768 * sleeping, so the gfp mask is used to indicate what is allowed.
769 * 769 *
770 * If any of the exclusive bits are set, this will fail with -EEXIST if some 770 * If any of the exclusive bits are set, this will fail with -EEXIST if some
771 * part of the range already has the desired bits set. The start of the 771 * part of the range already has the desired bits set. The start of the
772 * existing range is returned in failed_start in this case. 772 * existing range is returned in failed_start in this case.
773 * 773 *
774 * [start, end] is inclusive This takes the tree lock. 774 * [start, end] is inclusive This takes the tree lock.
775 */ 775 */
776 776
777 static int __must_check 777 static int __must_check
778 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, 778 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
779 unsigned long bits, unsigned long exclusive_bits, 779 unsigned long bits, unsigned long exclusive_bits,
780 u64 *failed_start, struct extent_state **cached_state, 780 u64 *failed_start, struct extent_state **cached_state,
781 gfp_t mask) 781 gfp_t mask)
782 { 782 {
783 struct extent_state *state; 783 struct extent_state *state;
784 struct extent_state *prealloc = NULL; 784 struct extent_state *prealloc = NULL;
785 struct rb_node *node; 785 struct rb_node *node;
786 int err = 0; 786 int err = 0;
787 u64 last_start; 787 u64 last_start;
788 u64 last_end; 788 u64 last_end;
789 789
790 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); 790 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
791 791
792 bits |= EXTENT_FIRST_DELALLOC; 792 bits |= EXTENT_FIRST_DELALLOC;
793 again: 793 again:
794 if (!prealloc && (mask & __GFP_WAIT)) { 794 if (!prealloc && (mask & __GFP_WAIT)) {
795 prealloc = alloc_extent_state(mask); 795 prealloc = alloc_extent_state(mask);
796 BUG_ON(!prealloc); 796 BUG_ON(!prealloc);
797 } 797 }
798 798
799 spin_lock(&tree->lock); 799 spin_lock(&tree->lock);
800 if (cached_state && *cached_state) { 800 if (cached_state && *cached_state) {
801 state = *cached_state; 801 state = *cached_state;
802 if (state->start <= start && state->end > start && 802 if (state->start <= start && state->end > start &&
803 state->tree) { 803 state->tree) {
804 node = &state->rb_node; 804 node = &state->rb_node;
805 goto hit_next; 805 goto hit_next;
806 } 806 }
807 } 807 }
808 /* 808 /*
809 * this search will find all the extents that end after 809 * this search will find all the extents that end after
810 * our range starts. 810 * our range starts.
811 */ 811 */
812 node = tree_search(tree, start); 812 node = tree_search(tree, start);
813 if (!node) { 813 if (!node) {
814 prealloc = alloc_extent_state_atomic(prealloc); 814 prealloc = alloc_extent_state_atomic(prealloc);
815 BUG_ON(!prealloc); 815 BUG_ON(!prealloc);
816 err = insert_state(tree, prealloc, start, end, &bits); 816 err = insert_state(tree, prealloc, start, end, &bits);
817 if (err) 817 if (err)
818 extent_io_tree_panic(tree, err); 818 extent_io_tree_panic(tree, err);
819 819
820 prealloc = NULL; 820 prealloc = NULL;
821 goto out; 821 goto out;
822 } 822 }
823 state = rb_entry(node, struct extent_state, rb_node); 823 state = rb_entry(node, struct extent_state, rb_node);
824 hit_next: 824 hit_next:
825 last_start = state->start; 825 last_start = state->start;
826 last_end = state->end; 826 last_end = state->end;
827 827
828 /* 828 /*
829 * | ---- desired range ---- | 829 * | ---- desired range ---- |
830 * | state | 830 * | state |
831 * 831 *
832 * Just lock what we found and keep going 832 * Just lock what we found and keep going
833 */ 833 */
834 if (state->start == start && state->end <= end) { 834 if (state->start == start && state->end <= end) {
835 if (state->state & exclusive_bits) { 835 if (state->state & exclusive_bits) {
836 *failed_start = state->start; 836 *failed_start = state->start;
837 err = -EEXIST; 837 err = -EEXIST;
838 goto out; 838 goto out;
839 } 839 }
840 840
841 set_state_bits(tree, state, &bits); 841 set_state_bits(tree, state, &bits);
842 cache_state(state, cached_state); 842 cache_state(state, cached_state);
843 merge_state(tree, state); 843 merge_state(tree, state);
844 if (last_end == (u64)-1) 844 if (last_end == (u64)-1)
845 goto out; 845 goto out;
846 start = last_end + 1; 846 start = last_end + 1;
847 state = next_state(state); 847 state = next_state(state);
848 if (start < end && state && state->start == start && 848 if (start < end && state && state->start == start &&
849 !need_resched()) 849 !need_resched())
850 goto hit_next; 850 goto hit_next;
851 goto search_again; 851 goto search_again;
852 } 852 }
853 853
854 /* 854 /*
855 * | ---- desired range ---- | 855 * | ---- desired range ---- |
856 * | state | 856 * | state |
857 * or 857 * or
858 * | ------------- state -------------- | 858 * | ------------- state -------------- |
859 * 859 *
860 * We need to split the extent we found, and may flip bits on 860 * We need to split the extent we found, and may flip bits on
861 * second half. 861 * second half.
862 * 862 *
863 * If the extent we found extends past our 863 * If the extent we found extends past our
864 * range, we just split and search again. It'll get split 864 * range, we just split and search again. It'll get split
865 * again the next time though. 865 * again the next time though.
866 * 866 *
867 * If the extent we found is inside our range, we set the 867 * If the extent we found is inside our range, we set the
868 * desired bit on it. 868 * desired bit on it.
869 */ 869 */
870 if (state->start < start) { 870 if (state->start < start) {
871 if (state->state & exclusive_bits) { 871 if (state->state & exclusive_bits) {
872 *failed_start = start; 872 *failed_start = start;
873 err = -EEXIST; 873 err = -EEXIST;
874 goto out; 874 goto out;
875 } 875 }
876 876
877 prealloc = alloc_extent_state_atomic(prealloc); 877 prealloc = alloc_extent_state_atomic(prealloc);
878 BUG_ON(!prealloc); 878 BUG_ON(!prealloc);
879 err = split_state(tree, state, prealloc, start); 879 err = split_state(tree, state, prealloc, start);
880 if (err) 880 if (err)
881 extent_io_tree_panic(tree, err); 881 extent_io_tree_panic(tree, err);
882 882
883 prealloc = NULL; 883 prealloc = NULL;
884 if (err) 884 if (err)
885 goto out; 885 goto out;
886 if (state->end <= end) { 886 if (state->end <= end) {
887 set_state_bits(tree, state, &bits); 887 set_state_bits(tree, state, &bits);
888 cache_state(state, cached_state); 888 cache_state(state, cached_state);
889 merge_state(tree, state); 889 merge_state(tree, state);
890 if (last_end == (u64)-1) 890 if (last_end == (u64)-1)
891 goto out; 891 goto out;
892 start = last_end + 1; 892 start = last_end + 1;
893 state = next_state(state); 893 state = next_state(state);
894 if (start < end && state && state->start == start && 894 if (start < end && state && state->start == start &&
895 !need_resched()) 895 !need_resched())
896 goto hit_next; 896 goto hit_next;
897 } 897 }
898 goto search_again; 898 goto search_again;
899 } 899 }
900 /* 900 /*
901 * | ---- desired range ---- | 901 * | ---- desired range ---- |
902 * | state | or | state | 902 * | state | or | state |
903 * 903 *
904 * There's a hole, we need to insert something in it and 904 * There's a hole, we need to insert something in it and
905 * ignore the extent we found. 905 * ignore the extent we found.
906 */ 906 */
907 if (state->start > start) { 907 if (state->start > start) {
908 u64 this_end; 908 u64 this_end;
909 if (end < last_start) 909 if (end < last_start)
910 this_end = end; 910 this_end = end;
911 else 911 else
912 this_end = last_start - 1; 912 this_end = last_start - 1;
913 913
914 prealloc = alloc_extent_state_atomic(prealloc); 914 prealloc = alloc_extent_state_atomic(prealloc);
915 BUG_ON(!prealloc); 915 BUG_ON(!prealloc);
916 916
917 /* 917 /*
918 * Avoid to free 'prealloc' if it can be merged with 918 * Avoid to free 'prealloc' if it can be merged with
919 * the later extent. 919 * the later extent.
920 */ 920 */
921 err = insert_state(tree, prealloc, start, this_end, 921 err = insert_state(tree, prealloc, start, this_end,
922 &bits); 922 &bits);
923 if (err) 923 if (err)
924 extent_io_tree_panic(tree, err); 924 extent_io_tree_panic(tree, err);
925 925
926 cache_state(prealloc, cached_state); 926 cache_state(prealloc, cached_state);
927 prealloc = NULL; 927 prealloc = NULL;
928 start = this_end + 1; 928 start = this_end + 1;
929 goto search_again; 929 goto search_again;
930 } 930 }
931 /* 931 /*
932 * | ---- desired range ---- | 932 * | ---- desired range ---- |
933 * | state | 933 * | state |
934 * We need to split the extent, and set the bit 934 * We need to split the extent, and set the bit
935 * on the first half 935 * on the first half
936 */ 936 */
937 if (state->start <= end && state->end > end) { 937 if (state->start <= end && state->end > end) {
938 if (state->state & exclusive_bits) { 938 if (state->state & exclusive_bits) {
939 *failed_start = start; 939 *failed_start = start;
940 err = -EEXIST; 940 err = -EEXIST;
941 goto out; 941 goto out;
942 } 942 }
943 943
944 prealloc = alloc_extent_state_atomic(prealloc); 944 prealloc = alloc_extent_state_atomic(prealloc);
945 BUG_ON(!prealloc); 945 BUG_ON(!prealloc);
946 err = split_state(tree, state, prealloc, end + 1); 946 err = split_state(tree, state, prealloc, end + 1);
947 if (err) 947 if (err)
948 extent_io_tree_panic(tree, err); 948 extent_io_tree_panic(tree, err);
949 949
950 set_state_bits(tree, prealloc, &bits); 950 set_state_bits(tree, prealloc, &bits);
951 cache_state(prealloc, cached_state); 951 cache_state(prealloc, cached_state);
952 merge_state(tree, prealloc); 952 merge_state(tree, prealloc);
953 prealloc = NULL; 953 prealloc = NULL;
954 goto out; 954 goto out;
955 } 955 }
956 956
957 goto search_again; 957 goto search_again;
958 958
959 out: 959 out:
960 spin_unlock(&tree->lock); 960 spin_unlock(&tree->lock);
961 if (prealloc) 961 if (prealloc)
962 free_extent_state(prealloc); 962 free_extent_state(prealloc);
963 963
964 return err; 964 return err;
965 965
966 search_again: 966 search_again:
967 if (start > end) 967 if (start > end)
968 goto out; 968 goto out;
969 spin_unlock(&tree->lock); 969 spin_unlock(&tree->lock);
970 if (mask & __GFP_WAIT) 970 if (mask & __GFP_WAIT)
971 cond_resched(); 971 cond_resched();
972 goto again; 972 goto again;
973 } 973 }
974 974
975 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, 975 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
976 unsigned long bits, u64 * failed_start, 976 unsigned long bits, u64 * failed_start,
977 struct extent_state **cached_state, gfp_t mask) 977 struct extent_state **cached_state, gfp_t mask)
978 { 978 {
979 return __set_extent_bit(tree, start, end, bits, 0, failed_start, 979 return __set_extent_bit(tree, start, end, bits, 0, failed_start,
980 cached_state, mask); 980 cached_state, mask);
981 } 981 }
982 982
983 983
984 /** 984 /**
985 * convert_extent_bit - convert all bits in a given range from one bit to 985 * convert_extent_bit - convert all bits in a given range from one bit to
986 * another 986 * another
987 * @tree: the io tree to search 987 * @tree: the io tree to search
988 * @start: the start offset in bytes 988 * @start: the start offset in bytes
989 * @end: the end offset in bytes (inclusive) 989 * @end: the end offset in bytes (inclusive)
990 * @bits: the bits to set in this range 990 * @bits: the bits to set in this range
991 * @clear_bits: the bits to clear in this range 991 * @clear_bits: the bits to clear in this range
992 * @cached_state: state that we're going to cache 992 * @cached_state: state that we're going to cache
993 * @mask: the allocation mask 993 * @mask: the allocation mask
994 * 994 *
995 * This will go through and set bits for the given range. If any states exist 995 * This will go through and set bits for the given range. If any states exist
996 * already in this range they are set with the given bit and cleared of the 996 * already in this range they are set with the given bit and cleared of the
997 * clear_bits. This is only meant to be used by things that are mergeable, ie 997 * clear_bits. This is only meant to be used by things that are mergeable, ie
998 * converting from say DELALLOC to DIRTY. This is not meant to be used with 998 * converting from say DELALLOC to DIRTY. This is not meant to be used with
999 * boundary bits like LOCK. 999 * boundary bits like LOCK.
1000 */ 1000 */
1001 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, 1001 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
1002 unsigned long bits, unsigned long clear_bits, 1002 unsigned long bits, unsigned long clear_bits,
1003 struct extent_state **cached_state, gfp_t mask) 1003 struct extent_state **cached_state, gfp_t mask)
1004 { 1004 {
1005 struct extent_state *state; 1005 struct extent_state *state;
1006 struct extent_state *prealloc = NULL; 1006 struct extent_state *prealloc = NULL;
1007 struct rb_node *node; 1007 struct rb_node *node;
1008 int err = 0; 1008 int err = 0;
1009 u64 last_start; 1009 u64 last_start;
1010 u64 last_end; 1010 u64 last_end;
1011 1011
1012 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); 1012 btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
1013 1013
1014 again: 1014 again:
1015 if (!prealloc && (mask & __GFP_WAIT)) { 1015 if (!prealloc && (mask & __GFP_WAIT)) {
1016 prealloc = alloc_extent_state(mask); 1016 prealloc = alloc_extent_state(mask);
1017 if (!prealloc) 1017 if (!prealloc)
1018 return -ENOMEM; 1018 return -ENOMEM;
1019 } 1019 }
1020 1020
1021 spin_lock(&tree->lock); 1021 spin_lock(&tree->lock);
1022 if (cached_state && *cached_state) { 1022 if (cached_state && *cached_state) {
1023 state = *cached_state; 1023 state = *cached_state;
1024 if (state->start <= start && state->end > start && 1024 if (state->start <= start && state->end > start &&
1025 state->tree) { 1025 state->tree) {
1026 node = &state->rb_node; 1026 node = &state->rb_node;
1027 goto hit_next; 1027 goto hit_next;
1028 } 1028 }
1029 } 1029 }
1030 1030
1031 /* 1031 /*
1032 * this search will find all the extents that end after 1032 * this search will find all the extents that end after
1033 * our range starts. 1033 * our range starts.
1034 */ 1034 */
1035 node = tree_search(tree, start); 1035 node = tree_search(tree, start);
1036 if (!node) { 1036 if (!node) {
1037 prealloc = alloc_extent_state_atomic(prealloc); 1037 prealloc = alloc_extent_state_atomic(prealloc);
1038 if (!prealloc) { 1038 if (!prealloc) {
1039 err = -ENOMEM; 1039 err = -ENOMEM;
1040 goto out; 1040 goto out;
1041 } 1041 }
1042 err = insert_state(tree, prealloc, start, end, &bits); 1042 err = insert_state(tree, prealloc, start, end, &bits);
1043 prealloc = NULL; 1043 prealloc = NULL;
1044 if (err) 1044 if (err)
1045 extent_io_tree_panic(tree, err); 1045 extent_io_tree_panic(tree, err);
1046 goto out; 1046 goto out;
1047 } 1047 }
1048 state = rb_entry(node, struct extent_state, rb_node); 1048 state = rb_entry(node, struct extent_state, rb_node);
1049 hit_next: 1049 hit_next:
1050 last_start = state->start; 1050 last_start = state->start;
1051 last_end = state->end; 1051 last_end = state->end;
1052 1052
1053 /* 1053 /*
1054 * | ---- desired range ---- | 1054 * | ---- desired range ---- |
1055 * | state | 1055 * | state |
1056 * 1056 *
1057 * Just lock what we found and keep going 1057 * Just lock what we found and keep going
1058 */ 1058 */
1059 if (state->start == start && state->end <= end) { 1059 if (state->start == start && state->end <= end) {
1060 set_state_bits(tree, state, &bits); 1060 set_state_bits(tree, state, &bits);
1061 cache_state(state, cached_state); 1061 cache_state(state, cached_state);
1062 state = clear_state_bit(tree, state, &clear_bits, 0); 1062 state = clear_state_bit(tree, state, &clear_bits, 0);
1063 if (last_end == (u64)-1) 1063 if (last_end == (u64)-1)
1064 goto out; 1064 goto out;
1065 start = last_end + 1; 1065 start = last_end + 1;
1066 if (start < end && state && state->start == start && 1066 if (start < end && state && state->start == start &&
1067 !need_resched()) 1067 !need_resched())
1068 goto hit_next; 1068 goto hit_next;
1069 goto search_again; 1069 goto search_again;
1070 } 1070 }
1071 1071
1072 /* 1072 /*
1073 * | ---- desired range ---- | 1073 * | ---- desired range ---- |
1074 * | state | 1074 * | state |
1075 * or 1075 * or
1076 * | ------------- state -------------- | 1076 * | ------------- state -------------- |
1077 * 1077 *
1078 * We need to split the extent we found, and may flip bits on 1078 * We need to split the extent we found, and may flip bits on
1079 * second half. 1079 * second half.
1080 * 1080 *
1081 * If the extent we found extends past our 1081 * If the extent we found extends past our
1082 * range, we just split and search again. It'll get split 1082 * range, we just split and search again. It'll get split
1083 * again the next time though. 1083 * again the next time though.
1084 * 1084 *
1085 * If the extent we found is inside our range, we set the 1085 * If the extent we found is inside our range, we set the
1086 * desired bit on it. 1086 * desired bit on it.
1087 */ 1087 */
1088 if (state->start < start) { 1088 if (state->start < start) {
1089 prealloc = alloc_extent_state_atomic(prealloc); 1089 prealloc = alloc_extent_state_atomic(prealloc);
1090 if (!prealloc) { 1090 if (!prealloc) {
1091 err = -ENOMEM; 1091 err = -ENOMEM;
1092 goto out; 1092 goto out;
1093 } 1093 }
1094 err = split_state(tree, state, prealloc, start); 1094 err = split_state(tree, state, prealloc, start);
1095 if (err) 1095 if (err)
1096 extent_io_tree_panic(tree, err); 1096 extent_io_tree_panic(tree, err);
1097 prealloc = NULL; 1097 prealloc = NULL;
1098 if (err) 1098 if (err)
1099 goto out; 1099 goto out;
1100 if (state->end <= end) { 1100 if (state->end <= end) {
1101 set_state_bits(tree, state, &bits); 1101 set_state_bits(tree, state, &bits);
1102 cache_state(state, cached_state); 1102 cache_state(state, cached_state);
1103 state = clear_state_bit(tree, state, &clear_bits, 0); 1103 state = clear_state_bit(tree, state, &clear_bits, 0);
1104 if (last_end == (u64)-1) 1104 if (last_end == (u64)-1)
1105 goto out; 1105 goto out;
1106 start = last_end + 1; 1106 start = last_end + 1;
1107 if (start < end && state && state->start == start && 1107 if (start < end && state && state->start == start &&
1108 !need_resched()) 1108 !need_resched())
1109 goto hit_next; 1109 goto hit_next;
1110 } 1110 }
1111 goto search_again; 1111 goto search_again;
1112 } 1112 }
1113 /* 1113 /*
1114 * | ---- desired range ---- | 1114 * | ---- desired range ---- |
1115 * | state | or | state | 1115 * | state | or | state |
1116 * 1116 *
1117 * There's a hole, we need to insert something in it and 1117 * There's a hole, we need to insert something in it and
1118 * ignore the extent we found. 1118 * ignore the extent we found.
1119 */ 1119 */
1120 if (state->start > start) { 1120 if (state->start > start) {
1121 u64 this_end; 1121 u64 this_end;
1122 if (end < last_start) 1122 if (end < last_start)
1123 this_end = end; 1123 this_end = end;
1124 else 1124 else
1125 this_end = last_start - 1; 1125 this_end = last_start - 1;
1126 1126
1127 prealloc = alloc_extent_state_atomic(prealloc); 1127 prealloc = alloc_extent_state_atomic(prealloc);
1128 if (!prealloc) { 1128 if (!prealloc) {
1129 err = -ENOMEM; 1129 err = -ENOMEM;
1130 goto out; 1130 goto out;
1131 } 1131 }
1132 1132
1133 /* 1133 /*
1134 * Avoid to free 'prealloc' if it can be merged with 1134 * Avoid to free 'prealloc' if it can be merged with
1135 * the later extent. 1135 * the later extent.
1136 */ 1136 */
1137 err = insert_state(tree, prealloc, start, this_end, 1137 err = insert_state(tree, prealloc, start, this_end,
1138 &bits); 1138 &bits);
1139 if (err) 1139 if (err)
1140 extent_io_tree_panic(tree, err); 1140 extent_io_tree_panic(tree, err);
1141 cache_state(prealloc, cached_state); 1141 cache_state(prealloc, cached_state);
1142 prealloc = NULL; 1142 prealloc = NULL;
1143 start = this_end + 1; 1143 start = this_end + 1;
1144 goto search_again; 1144 goto search_again;
1145 } 1145 }
1146 /* 1146 /*
1147 * | ---- desired range ---- | 1147 * | ---- desired range ---- |
1148 * | state | 1148 * | state |
1149 * We need to split the extent, and set the bit 1149 * We need to split the extent, and set the bit
1150 * on the first half 1150 * on the first half
1151 */ 1151 */
1152 if (state->start <= end && state->end > end) { 1152 if (state->start <= end && state->end > end) {
1153 prealloc = alloc_extent_state_atomic(prealloc); 1153 prealloc = alloc_extent_state_atomic(prealloc);
1154 if (!prealloc) { 1154 if (!prealloc) {
1155 err = -ENOMEM; 1155 err = -ENOMEM;
1156 goto out; 1156 goto out;
1157 } 1157 }
1158 1158
1159 err = split_state(tree, state, prealloc, end + 1); 1159 err = split_state(tree, state, prealloc, end + 1);
1160 if (err) 1160 if (err)
1161 extent_io_tree_panic(tree, err); 1161 extent_io_tree_panic(tree, err);
1162 1162
1163 set_state_bits(tree, prealloc, &bits); 1163 set_state_bits(tree, prealloc, &bits);
1164 cache_state(prealloc, cached_state); 1164 cache_state(prealloc, cached_state);
1165 clear_state_bit(tree, prealloc, &clear_bits, 0); 1165 clear_state_bit(tree, prealloc, &clear_bits, 0);
1166 prealloc = NULL; 1166 prealloc = NULL;
1167 goto out; 1167 goto out;
1168 } 1168 }
1169 1169
1170 goto search_again; 1170 goto search_again;
1171 1171
1172 out: 1172 out:
1173 spin_unlock(&tree->lock); 1173 spin_unlock(&tree->lock);
1174 if (prealloc) 1174 if (prealloc)
1175 free_extent_state(prealloc); 1175 free_extent_state(prealloc);
1176 1176
1177 return err; 1177 return err;
1178 1178
1179 search_again: 1179 search_again:
1180 if (start > end) 1180 if (start > end)
1181 goto out; 1181 goto out;
1182 spin_unlock(&tree->lock); 1182 spin_unlock(&tree->lock);
1183 if (mask & __GFP_WAIT) 1183 if (mask & __GFP_WAIT)
1184 cond_resched(); 1184 cond_resched();
1185 goto again; 1185 goto again;
1186 } 1186 }
1187 1187
1188 /* wrappers around set/clear extent bit */ 1188 /* wrappers around set/clear extent bit */
1189 int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, 1189 int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
1190 gfp_t mask) 1190 gfp_t mask)
1191 { 1191 {
1192 return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL, 1192 return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL,
1193 NULL, mask); 1193 NULL, mask);
1194 } 1194 }
1195 1195
1196 int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, 1196 int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
1197 unsigned long bits, gfp_t mask) 1197 unsigned long bits, gfp_t mask)
1198 { 1198 {
1199 return set_extent_bit(tree, start, end, bits, NULL, 1199 return set_extent_bit(tree, start, end, bits, NULL,
1200 NULL, mask); 1200 NULL, mask);
1201 } 1201 }
1202 1202
1203 int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, 1203 int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
1204 unsigned long bits, gfp_t mask) 1204 unsigned long bits, gfp_t mask)
1205 { 1205 {
1206 return clear_extent_bit(tree, start, end, bits, 0, 0, NULL, mask); 1206 return clear_extent_bit(tree, start, end, bits, 0, 0, NULL, mask);
1207 } 1207 }
1208 1208
1209 int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end, 1209 int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
1210 struct extent_state **cached_state, gfp_t mask) 1210 struct extent_state **cached_state, gfp_t mask)
1211 { 1211 {
1212 return set_extent_bit(tree, start, end, 1212 return set_extent_bit(tree, start, end,
1213 EXTENT_DELALLOC | EXTENT_UPTODATE, 1213 EXTENT_DELALLOC | EXTENT_UPTODATE,
1214 NULL, cached_state, mask); 1214 NULL, cached_state, mask);
1215 } 1215 }
1216 1216
1217 int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end, 1217 int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
1218 struct extent_state **cached_state, gfp_t mask) 1218 struct extent_state **cached_state, gfp_t mask)
1219 { 1219 {
1220 return set_extent_bit(tree, start, end, 1220 return set_extent_bit(tree, start, end,
1221 EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG, 1221 EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG,
1222 NULL, cached_state, mask); 1222 NULL, cached_state, mask);
1223 } 1223 }
1224 1224
1225 int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, 1225 int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
1226 gfp_t mask) 1226 gfp_t mask)
1227 { 1227 {
1228 return clear_extent_bit(tree, start, end, 1228 return clear_extent_bit(tree, start, end,
1229 EXTENT_DIRTY | EXTENT_DELALLOC | 1229 EXTENT_DIRTY | EXTENT_DELALLOC |
1230 EXTENT_DO_ACCOUNTING, 0, 0, NULL, mask); 1230 EXTENT_DO_ACCOUNTING, 0, 0, NULL, mask);
1231 } 1231 }
1232 1232
1233 int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end, 1233 int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
1234 gfp_t mask) 1234 gfp_t mask)
1235 { 1235 {
1236 return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, 1236 return set_extent_bit(tree, start, end, EXTENT_NEW, NULL,
1237 NULL, mask); 1237 NULL, mask);
1238 } 1238 }
1239 1239
1240 int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, 1240 int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
1241 struct extent_state **cached_state, gfp_t mask) 1241 struct extent_state **cached_state, gfp_t mask)
1242 { 1242 {
1243 return set_extent_bit(tree, start, end, EXTENT_UPTODATE, NULL, 1243 return set_extent_bit(tree, start, end, EXTENT_UPTODATE, NULL,
1244 cached_state, mask); 1244 cached_state, mask);
1245 } 1245 }
1246 1246
1247 int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, 1247 int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
1248 struct extent_state **cached_state, gfp_t mask) 1248 struct extent_state **cached_state, gfp_t mask)
1249 { 1249 {
1250 return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0, 1250 return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0,
1251 cached_state, mask); 1251 cached_state, mask);
1252 } 1252 }
1253 1253
1254 /* 1254 /*
1255 * either insert or lock state struct between start and end use mask to tell 1255 * either insert or lock state struct between start and end use mask to tell
1256 * us if waiting is desired. 1256 * us if waiting is desired.
1257 */ 1257 */
1258 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, 1258 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
1259 unsigned long bits, struct extent_state **cached_state) 1259 unsigned long bits, struct extent_state **cached_state)
1260 { 1260 {
1261 int err; 1261 int err;
1262 u64 failed_start; 1262 u64 failed_start;
1263 while (1) { 1263 while (1) {
1264 err = __set_extent_bit(tree, start, end, EXTENT_LOCKED | bits, 1264 err = __set_extent_bit(tree, start, end, EXTENT_LOCKED | bits,
1265 EXTENT_LOCKED, &failed_start, 1265 EXTENT_LOCKED, &failed_start,
1266 cached_state, GFP_NOFS); 1266 cached_state, GFP_NOFS);
1267 if (err == -EEXIST) { 1267 if (err == -EEXIST) {
1268 wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED); 1268 wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
1269 start = failed_start; 1269 start = failed_start;
1270 } else 1270 } else
1271 break; 1271 break;
1272 WARN_ON(start > end); 1272 WARN_ON(start > end);
1273 } 1273 }
1274 return err; 1274 return err;
1275 } 1275 }
1276 1276
1277 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end) 1277 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
1278 { 1278 {
1279 return lock_extent_bits(tree, start, end, 0, NULL); 1279 return lock_extent_bits(tree, start, end, 0, NULL);
1280 } 1280 }
1281 1281
1282 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end) 1282 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
1283 { 1283 {
1284 int err; 1284 int err;
1285 u64 failed_start; 1285 u64 failed_start;
1286 1286
1287 err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED, 1287 err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED,
1288 &failed_start, NULL, GFP_NOFS); 1288 &failed_start, NULL, GFP_NOFS);
1289 if (err == -EEXIST) { 1289 if (err == -EEXIST) {
1290 if (failed_start > start) 1290 if (failed_start > start)
1291 clear_extent_bit(tree, start, failed_start - 1, 1291 clear_extent_bit(tree, start, failed_start - 1,
1292 EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS); 1292 EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS);
1293 return 0; 1293 return 0;
1294 } 1294 }
1295 return 1; 1295 return 1;
1296 } 1296 }
1297 1297
1298 int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end, 1298 int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end,
1299 struct extent_state **cached, gfp_t mask) 1299 struct extent_state **cached, gfp_t mask)
1300 { 1300 {
1301 return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached, 1301 return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
1302 mask); 1302 mask);
1303 } 1303 }
1304 1304
1305 int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end) 1305 int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
1306 { 1306 {
1307 return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL, 1307 return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL,
1308 GFP_NOFS); 1308 GFP_NOFS);
1309 } 1309 }
1310 1310
1311 int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) 1311 int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
1312 { 1312 {
1313 unsigned long index = start >> PAGE_CACHE_SHIFT; 1313 unsigned long index = start >> PAGE_CACHE_SHIFT;
1314 unsigned long end_index = end >> PAGE_CACHE_SHIFT; 1314 unsigned long end_index = end >> PAGE_CACHE_SHIFT;
1315 struct page *page; 1315 struct page *page;
1316 1316
1317 while (index <= end_index) { 1317 while (index <= end_index) {
1318 page = find_get_page(inode->i_mapping, index); 1318 page = find_get_page(inode->i_mapping, index);
1319 BUG_ON(!page); /* Pages should be in the extent_io_tree */ 1319 BUG_ON(!page); /* Pages should be in the extent_io_tree */
1320 clear_page_dirty_for_io(page); 1320 clear_page_dirty_for_io(page);
1321 page_cache_release(page); 1321 page_cache_release(page);
1322 index++; 1322 index++;
1323 } 1323 }
1324 return 0; 1324 return 0;
1325 } 1325 }
1326 1326
1327 int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end) 1327 int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
1328 { 1328 {
1329 unsigned long index = start >> PAGE_CACHE_SHIFT; 1329 unsigned long index = start >> PAGE_CACHE_SHIFT;
1330 unsigned long end_index = end >> PAGE_CACHE_SHIFT; 1330 unsigned long end_index = end >> PAGE_CACHE_SHIFT;
1331 struct page *page; 1331 struct page *page;
1332 1332
1333 while (index <= end_index) { 1333 while (index <= end_index) {
1334 page = find_get_page(inode->i_mapping, index); 1334 page = find_get_page(inode->i_mapping, index);
1335 BUG_ON(!page); /* Pages should be in the extent_io_tree */ 1335 BUG_ON(!page); /* Pages should be in the extent_io_tree */
1336 account_page_redirty(page); 1336 account_page_redirty(page);
1337 __set_page_dirty_nobuffers(page); 1337 __set_page_dirty_nobuffers(page);
1338 page_cache_release(page); 1338 page_cache_release(page);
1339 index++; 1339 index++;
1340 } 1340 }
1341 return 0; 1341 return 0;
1342 } 1342 }
1343 1343
1344 /* 1344 /*
1345 * helper function to set both pages and extents in the tree writeback 1345 * helper function to set both pages and extents in the tree writeback
1346 */ 1346 */
1347 static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) 1347 static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
1348 { 1348 {
1349 unsigned long index = start >> PAGE_CACHE_SHIFT; 1349 unsigned long index = start >> PAGE_CACHE_SHIFT;
1350 unsigned long end_index = end >> PAGE_CACHE_SHIFT; 1350 unsigned long end_index = end >> PAGE_CACHE_SHIFT;
1351 struct page *page; 1351 struct page *page;
1352 1352
1353 while (index <= end_index) { 1353 while (index <= end_index) {
1354 page = find_get_page(tree->mapping, index); 1354 page = find_get_page(tree->mapping, index);
1355 BUG_ON(!page); /* Pages should be in the extent_io_tree */ 1355 BUG_ON(!page); /* Pages should be in the extent_io_tree */
1356 set_page_writeback(page); 1356 set_page_writeback(page);
1357 page_cache_release(page); 1357 page_cache_release(page);
1358 index++; 1358 index++;
1359 } 1359 }
1360 return 0; 1360 return 0;
1361 } 1361 }
1362 1362
1363 /* find the first state struct with 'bits' set after 'start', and 1363 /* find the first state struct with 'bits' set after 'start', and
1364 * return it. tree->lock must be held. NULL will returned if 1364 * return it. tree->lock must be held. NULL will returned if
1365 * nothing was found after 'start' 1365 * nothing was found after 'start'
1366 */ 1366 */
1367 static struct extent_state * 1367 static struct extent_state *
1368 find_first_extent_bit_state(struct extent_io_tree *tree, 1368 find_first_extent_bit_state(struct extent_io_tree *tree,
1369 u64 start, unsigned long bits) 1369 u64 start, unsigned long bits)
1370 { 1370 {
1371 struct rb_node *node; 1371 struct rb_node *node;
1372 struct extent_state *state; 1372 struct extent_state *state;
1373 1373
1374 /* 1374 /*
1375 * this search will find all the extents that end after 1375 * this search will find all the extents that end after
1376 * our range starts. 1376 * our range starts.
1377 */ 1377 */
1378 node = tree_search(tree, start); 1378 node = tree_search(tree, start);
1379 if (!node) 1379 if (!node)
1380 goto out; 1380 goto out;
1381 1381
1382 while (1) { 1382 while (1) {
1383 state = rb_entry(node, struct extent_state, rb_node); 1383 state = rb_entry(node, struct extent_state, rb_node);
1384 if (state->end >= start && (state->state & bits)) 1384 if (state->end >= start && (state->state & bits))
1385 return state; 1385 return state;
1386 1386
1387 node = rb_next(node); 1387 node = rb_next(node);
1388 if (!node) 1388 if (!node)
1389 break; 1389 break;
1390 } 1390 }
1391 out: 1391 out:
1392 return NULL; 1392 return NULL;
1393 } 1393 }
1394 1394
1395 /* 1395 /*
1396 * find the first offset in the io tree with 'bits' set. zero is 1396 * find the first offset in the io tree with 'bits' set. zero is
1397 * returned if we find something, and *start_ret and *end_ret are 1397 * returned if we find something, and *start_ret and *end_ret are
1398 * set to reflect the state struct that was found. 1398 * set to reflect the state struct that was found.
1399 * 1399 *
1400 * If nothing was found, 1 is returned. If found something, return 0. 1400 * If nothing was found, 1 is returned. If found something, return 0.
1401 */ 1401 */
1402 int find_first_extent_bit(struct extent_io_tree *tree, u64 start, 1402 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
1403 u64 *start_ret, u64 *end_ret, unsigned long bits, 1403 u64 *start_ret, u64 *end_ret, unsigned long bits,
1404 struct extent_state **cached_state) 1404 struct extent_state **cached_state)
1405 { 1405 {
1406 struct extent_state *state; 1406 struct extent_state *state;
1407 struct rb_node *n; 1407 struct rb_node *n;
1408 int ret = 1; 1408 int ret = 1;
1409 1409
1410 spin_lock(&tree->lock); 1410 spin_lock(&tree->lock);
1411 if (cached_state && *cached_state) { 1411 if (cached_state && *cached_state) {
1412 state = *cached_state; 1412 state = *cached_state;
1413 if (state->end == start - 1 && state->tree) { 1413 if (state->end == start - 1 && state->tree) {
1414 n = rb_next(&state->rb_node); 1414 n = rb_next(&state->rb_node);
1415 while (n) { 1415 while (n) {
1416 state = rb_entry(n, struct extent_state, 1416 state = rb_entry(n, struct extent_state,
1417 rb_node); 1417 rb_node);
1418 if (state->state & bits) 1418 if (state->state & bits)
1419 goto got_it; 1419 goto got_it;
1420 n = rb_next(n); 1420 n = rb_next(n);
1421 } 1421 }
1422 free_extent_state(*cached_state); 1422 free_extent_state(*cached_state);
1423 *cached_state = NULL; 1423 *cached_state = NULL;
1424 goto out; 1424 goto out;
1425 } 1425 }
1426 free_extent_state(*cached_state); 1426 free_extent_state(*cached_state);
1427 *cached_state = NULL; 1427 *cached_state = NULL;
1428 } 1428 }
1429 1429
1430 state = find_first_extent_bit_state(tree, start, bits); 1430 state = find_first_extent_bit_state(tree, start, bits);
1431 got_it: 1431 got_it:
1432 if (state) { 1432 if (state) {
1433 cache_state(state, cached_state); 1433 cache_state(state, cached_state);
1434 *start_ret = state->start; 1434 *start_ret = state->start;
1435 *end_ret = state->end; 1435 *end_ret = state->end;
1436 ret = 0; 1436 ret = 0;
1437 } 1437 }
1438 out: 1438 out:
1439 spin_unlock(&tree->lock); 1439 spin_unlock(&tree->lock);
1440 return ret; 1440 return ret;
1441 } 1441 }
1442 1442
1443 /* 1443 /*
1444 * find a contiguous range of bytes in the file marked as delalloc, not 1444 * find a contiguous range of bytes in the file marked as delalloc, not
1445 * more than 'max_bytes'. start and end are used to return the range, 1445 * more than 'max_bytes'. start and end are used to return the range,
1446 * 1446 *
1447 * 1 is returned if we find something, 0 if nothing was in the tree 1447 * 1 is returned if we find something, 0 if nothing was in the tree
1448 */ 1448 */
1449 static noinline u64 find_delalloc_range(struct extent_io_tree *tree, 1449 static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
1450 u64 *start, u64 *end, u64 max_bytes, 1450 u64 *start, u64 *end, u64 max_bytes,
1451 struct extent_state **cached_state) 1451 struct extent_state **cached_state)
1452 { 1452 {
1453 struct rb_node *node; 1453 struct rb_node *node;
1454 struct extent_state *state; 1454 struct extent_state *state;
1455 u64 cur_start = *start; 1455 u64 cur_start = *start;
1456 u64 found = 0; 1456 u64 found = 0;
1457 u64 total_bytes = 0; 1457 u64 total_bytes = 0;
1458 1458
1459 spin_lock(&tree->lock); 1459 spin_lock(&tree->lock);
1460 1460
1461 /* 1461 /*
1462 * this search will find all the extents that end after 1462 * this search will find all the extents that end after
1463 * our range starts. 1463 * our range starts.
1464 */ 1464 */
1465 node = tree_search(tree, cur_start); 1465 node = tree_search(tree, cur_start);
1466 if (!node) { 1466 if (!node) {
1467 if (!found) 1467 if (!found)
1468 *end = (u64)-1; 1468 *end = (u64)-1;
1469 goto out; 1469 goto out;
1470 } 1470 }
1471 1471
1472 while (1) { 1472 while (1) {
1473 state = rb_entry(node, struct extent_state, rb_node); 1473 state = rb_entry(node, struct extent_state, rb_node);
1474 if (found && (state->start != cur_start || 1474 if (found && (state->start != cur_start ||
1475 (state->state & EXTENT_BOUNDARY))) { 1475 (state->state & EXTENT_BOUNDARY))) {
1476 goto out; 1476 goto out;
1477 } 1477 }
1478 if (!(state->state & EXTENT_DELALLOC)) { 1478 if (!(state->state & EXTENT_DELALLOC)) {
1479 if (!found) 1479 if (!found)
1480 *end = state->end; 1480 *end = state->end;
1481 goto out; 1481 goto out;
1482 } 1482 }
1483 if (!found) { 1483 if (!found) {
1484 *start = state->start; 1484 *start = state->start;
1485 *cached_state = state; 1485 *cached_state = state;
1486 atomic_inc(&state->refs); 1486 atomic_inc(&state->refs);
1487 } 1487 }
1488 found++; 1488 found++;
1489 *end = state->end; 1489 *end = state->end;
1490 cur_start = state->end + 1; 1490 cur_start = state->end + 1;
1491 node = rb_next(node); 1491 node = rb_next(node);
1492 total_bytes += state->end - state->start + 1; 1492 total_bytes += state->end - state->start + 1;
1493 if (total_bytes >= max_bytes) 1493 if (total_bytes >= max_bytes)
1494 break; 1494 break;
1495 if (!node) 1495 if (!node)
1496 break; 1496 break;
1497 } 1497 }
1498 out: 1498 out:
1499 spin_unlock(&tree->lock); 1499 spin_unlock(&tree->lock);
1500 return found; 1500 return found;
1501 } 1501 }
1502 1502
1503 static noinline void __unlock_for_delalloc(struct inode *inode, 1503 static noinline void __unlock_for_delalloc(struct inode *inode,
1504 struct page *locked_page, 1504 struct page *locked_page,
1505 u64 start, u64 end) 1505 u64 start, u64 end)
1506 { 1506 {
1507 int ret; 1507 int ret;
1508 struct page *pages[16]; 1508 struct page *pages[16];
1509 unsigned long index = start >> PAGE_CACHE_SHIFT; 1509 unsigned long index = start >> PAGE_CACHE_SHIFT;
1510 unsigned long end_index = end >> PAGE_CACHE_SHIFT; 1510 unsigned long end_index = end >> PAGE_CACHE_SHIFT;
1511 unsigned long nr_pages = end_index - index + 1; 1511 unsigned long nr_pages = end_index - index + 1;
1512 int i; 1512 int i;
1513 1513
1514 if (index == locked_page->index && end_index == index) 1514 if (index == locked_page->index && end_index == index)
1515 return; 1515 return;
1516 1516
1517 while (nr_pages > 0) { 1517 while (nr_pages > 0) {
1518 ret = find_get_pages_contig(inode->i_mapping, index, 1518 ret = find_get_pages_contig(inode->i_mapping, index,
1519 min_t(unsigned long, nr_pages, 1519 min_t(unsigned long, nr_pages,
1520 ARRAY_SIZE(pages)), pages); 1520 ARRAY_SIZE(pages)), pages);
1521 for (i = 0; i < ret; i++) { 1521 for (i = 0; i < ret; i++) {
1522 if (pages[i] != locked_page) 1522 if (pages[i] != locked_page)
1523 unlock_page(pages[i]); 1523 unlock_page(pages[i]);
1524 page_cache_release(pages[i]); 1524 page_cache_release(pages[i]);
1525 } 1525 }
1526 nr_pages -= ret; 1526 nr_pages -= ret;
1527 index += ret; 1527 index += ret;
1528 cond_resched(); 1528 cond_resched();
1529 } 1529 }
1530 } 1530 }
1531 1531
1532 static noinline int lock_delalloc_pages(struct inode *inode, 1532 static noinline int lock_delalloc_pages(struct inode *inode,
1533 struct page *locked_page, 1533 struct page *locked_page,
1534 u64 delalloc_start, 1534 u64 delalloc_start,
1535 u64 delalloc_end) 1535 u64 delalloc_end)
1536 { 1536 {
1537 unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT; 1537 unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT;
1538 unsigned long start_index = index; 1538 unsigned long start_index = index;
1539 unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT; 1539 unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT;
1540 unsigned long pages_locked = 0; 1540 unsigned long pages_locked = 0;
1541 struct page *pages[16]; 1541 struct page *pages[16];
1542 unsigned long nrpages; 1542 unsigned long nrpages;
1543 int ret; 1543 int ret;
1544 int i; 1544 int i;
1545 1545
1546 /* the caller is responsible for locking the start index */ 1546 /* the caller is responsible for locking the start index */
1547 if (index == locked_page->index && index == end_index) 1547 if (index == locked_page->index && index == end_index)
1548 return 0; 1548 return 0;
1549 1549
1550 /* skip the page at the start index */ 1550 /* skip the page at the start index */
1551 nrpages = end_index - index + 1; 1551 nrpages = end_index - index + 1;
1552 while (nrpages > 0) { 1552 while (nrpages > 0) {
1553 ret = find_get_pages_contig(inode->i_mapping, index, 1553 ret = find_get_pages_contig(inode->i_mapping, index,
1554 min_t(unsigned long, 1554 min_t(unsigned long,
1555 nrpages, ARRAY_SIZE(pages)), pages); 1555 nrpages, ARRAY_SIZE(pages)), pages);
1556 if (ret == 0) { 1556 if (ret == 0) {
1557 ret = -EAGAIN; 1557 ret = -EAGAIN;
1558 goto done; 1558 goto done;
1559 } 1559 }
1560 /* now we have an array of pages, lock them all */ 1560 /* now we have an array of pages, lock them all */
1561 for (i = 0; i < ret; i++) { 1561 for (i = 0; i < ret; i++) {
1562 /* 1562 /*
1563 * the caller is taking responsibility for 1563 * the caller is taking responsibility for
1564 * locked_page 1564 * locked_page
1565 */ 1565 */
1566 if (pages[i] != locked_page) { 1566 if (pages[i] != locked_page) {
1567 lock_page(pages[i]); 1567 lock_page(pages[i]);
1568 if (!PageDirty(pages[i]) || 1568 if (!PageDirty(pages[i]) ||
1569 pages[i]->mapping != inode->i_mapping) { 1569 pages[i]->mapping != inode->i_mapping) {
1570 ret = -EAGAIN; 1570 ret = -EAGAIN;
1571 unlock_page(pages[i]); 1571 unlock_page(pages[i]);
1572 page_cache_release(pages[i]); 1572 page_cache_release(pages[i]);
1573 goto done; 1573 goto done;
1574 } 1574 }
1575 } 1575 }
1576 page_cache_release(pages[i]); 1576 page_cache_release(pages[i]);
1577 pages_locked++; 1577 pages_locked++;
1578 } 1578 }
1579 nrpages -= ret; 1579 nrpages -= ret;
1580 index += ret; 1580 index += ret;
1581 cond_resched(); 1581 cond_resched();
1582 } 1582 }
1583 ret = 0; 1583 ret = 0;
1584 done: 1584 done:
1585 if (ret && pages_locked) { 1585 if (ret && pages_locked) {
1586 __unlock_for_delalloc(inode, locked_page, 1586 __unlock_for_delalloc(inode, locked_page,
1587 delalloc_start, 1587 delalloc_start,
1588 ((u64)(start_index + pages_locked - 1)) << 1588 ((u64)(start_index + pages_locked - 1)) <<
1589 PAGE_CACHE_SHIFT); 1589 PAGE_CACHE_SHIFT);
1590 } 1590 }
1591 return ret; 1591 return ret;
1592 } 1592 }
1593 1593
1594 /* 1594 /*
1595 * find a contiguous range of bytes in the file marked as delalloc, not 1595 * find a contiguous range of bytes in the file marked as delalloc, not
1596 * more than 'max_bytes'. start and end are used to return the range, 1596 * more than 'max_bytes'. start and end are used to return the range,
1597 * 1597 *
1598 * 1 is returned if we find something, 0 if nothing was in the tree 1598 * 1 is returned if we find something, 0 if nothing was in the tree
1599 */ 1599 */
1600 static noinline u64 find_lock_delalloc_range(struct inode *inode, 1600 static noinline u64 find_lock_delalloc_range(struct inode *inode,
1601 struct extent_io_tree *tree, 1601 struct extent_io_tree *tree,
1602 struct page *locked_page, 1602 struct page *locked_page,
1603 u64 *start, u64 *end, 1603 u64 *start, u64 *end,
1604 u64 max_bytes) 1604 u64 max_bytes)
1605 { 1605 {
1606 u64 delalloc_start; 1606 u64 delalloc_start;
1607 u64 delalloc_end; 1607 u64 delalloc_end;
1608 u64 found; 1608 u64 found;
1609 struct extent_state *cached_state = NULL; 1609 struct extent_state *cached_state = NULL;
1610 int ret; 1610 int ret;
1611 int loops = 0; 1611 int loops = 0;
1612 1612
1613 again: 1613 again:
1614 /* step one, find a bunch of delalloc bytes starting at start */ 1614 /* step one, find a bunch of delalloc bytes starting at start */
1615 delalloc_start = *start; 1615 delalloc_start = *start;
1616 delalloc_end = 0; 1616 delalloc_end = 0;
1617 found = find_delalloc_range(tree, &delalloc_start, &delalloc_end, 1617 found = find_delalloc_range(tree, &delalloc_start, &delalloc_end,
1618 max_bytes, &cached_state); 1618 max_bytes, &cached_state);
1619 if (!found || delalloc_end <= *start) { 1619 if (!found || delalloc_end <= *start) {
1620 *start = delalloc_start; 1620 *start = delalloc_start;
1621 *end = delalloc_end; 1621 *end = delalloc_end;
1622 free_extent_state(cached_state); 1622 free_extent_state(cached_state);
1623 return 0; 1623 return 0;
1624 } 1624 }
1625 1625
1626 /* 1626 /*
1627 * start comes from the offset of locked_page. We have to lock 1627 * start comes from the offset of locked_page. We have to lock
1628 * pages in order, so we can't process delalloc bytes before 1628 * pages in order, so we can't process delalloc bytes before
1629 * locked_page 1629 * locked_page
1630 */ 1630 */
1631 if (delalloc_start < *start) 1631 if (delalloc_start < *start)
1632 delalloc_start = *start; 1632 delalloc_start = *start;
1633 1633
1634 /* 1634 /*
1635 * make sure to limit the number of pages we try to lock down 1635 * make sure to limit the number of pages we try to lock down
1636 */ 1636 */
1637 if (delalloc_end + 1 - delalloc_start > max_bytes) 1637 if (delalloc_end + 1 - delalloc_start > max_bytes)
1638 delalloc_end = delalloc_start + max_bytes - 1; 1638 delalloc_end = delalloc_start + max_bytes - 1;
1639 1639
1640 /* step two, lock all the pages after the page that has start */ 1640 /* step two, lock all the pages after the page that has start */
1641 ret = lock_delalloc_pages(inode, locked_page, 1641 ret = lock_delalloc_pages(inode, locked_page,
1642 delalloc_start, delalloc_end); 1642 delalloc_start, delalloc_end);
1643 if (ret == -EAGAIN) { 1643 if (ret == -EAGAIN) {
1644 /* some of the pages are gone, lets avoid looping by 1644 /* some of the pages are gone, lets avoid looping by
1645 * shortening the size of the delalloc range we're searching 1645 * shortening the size of the delalloc range we're searching
1646 */ 1646 */
1647 free_extent_state(cached_state); 1647 free_extent_state(cached_state);
1648 cached_state = NULL; 1648 cached_state = NULL;
1649 if (!loops) { 1649 if (!loops) {
1650 max_bytes = PAGE_CACHE_SIZE; 1650 max_bytes = PAGE_CACHE_SIZE;
1651 loops = 1; 1651 loops = 1;
1652 goto again; 1652 goto again;
1653 } else { 1653 } else {
1654 found = 0; 1654 found = 0;
1655 goto out_failed; 1655 goto out_failed;
1656 } 1656 }
1657 } 1657 }
1658 BUG_ON(ret); /* Only valid values are 0 and -EAGAIN */ 1658 BUG_ON(ret); /* Only valid values are 0 and -EAGAIN */
1659 1659
1660 /* step three, lock the state bits for the whole range */ 1660 /* step three, lock the state bits for the whole range */
1661 lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state); 1661 lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state);
1662 1662
1663 /* then test to make sure it is all still delalloc */ 1663 /* then test to make sure it is all still delalloc */
1664 ret = test_range_bit(tree, delalloc_start, delalloc_end, 1664 ret = test_range_bit(tree, delalloc_start, delalloc_end,
1665 EXTENT_DELALLOC, 1, cached_state); 1665 EXTENT_DELALLOC, 1, cached_state);
1666 if (!ret) { 1666 if (!ret) {
1667 unlock_extent_cached(tree, delalloc_start, delalloc_end, 1667 unlock_extent_cached(tree, delalloc_start, delalloc_end,
1668 &cached_state, GFP_NOFS); 1668 &cached_state, GFP_NOFS);
1669 __unlock_for_delalloc(inode, locked_page, 1669 __unlock_for_delalloc(inode, locked_page,
1670 delalloc_start, delalloc_end); 1670 delalloc_start, delalloc_end);
1671 cond_resched(); 1671 cond_resched();
1672 goto again; 1672 goto again;
1673 } 1673 }
1674 free_extent_state(cached_state); 1674 free_extent_state(cached_state);
1675 *start = delalloc_start; 1675 *start = delalloc_start;
1676 *end = delalloc_end; 1676 *end = delalloc_end;
1677 out_failed: 1677 out_failed:
1678 return found; 1678 return found;
1679 } 1679 }
1680 1680
1681 int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, 1681 int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
1682 struct page *locked_page, 1682 struct page *locked_page,
1683 unsigned long clear_bits, 1683 unsigned long clear_bits,
1684 unsigned long page_ops) 1684 unsigned long page_ops)
1685 { 1685 {
1686 struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; 1686 struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
1687 int ret; 1687 int ret;
1688 struct page *pages[16]; 1688 struct page *pages[16];
1689 unsigned long index = start >> PAGE_CACHE_SHIFT; 1689 unsigned long index = start >> PAGE_CACHE_SHIFT;
1690 unsigned long end_index = end >> PAGE_CACHE_SHIFT; 1690 unsigned long end_index = end >> PAGE_CACHE_SHIFT;
1691 unsigned long nr_pages = end_index - index + 1; 1691 unsigned long nr_pages = end_index - index + 1;
1692 int i; 1692 int i;
1693 1693
1694 clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS); 1694 clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS);
1695 if (page_ops == 0) 1695 if (page_ops == 0)
1696 return 0; 1696 return 0;
1697 1697
1698 while (nr_pages > 0) { 1698 while (nr_pages > 0) {
1699 ret = find_get_pages_contig(inode->i_mapping, index, 1699 ret = find_get_pages_contig(inode->i_mapping, index,
1700 min_t(unsigned long, 1700 min_t(unsigned long,
1701 nr_pages, ARRAY_SIZE(pages)), pages); 1701 nr_pages, ARRAY_SIZE(pages)), pages);
1702 for (i = 0; i < ret; i++) { 1702 for (i = 0; i < ret; i++) {
1703 1703
1704 if (page_ops & PAGE_SET_PRIVATE2) 1704 if (page_ops & PAGE_SET_PRIVATE2)
1705 SetPagePrivate2(pages[i]); 1705 SetPagePrivate2(pages[i]);
1706 1706
1707 if (pages[i] == locked_page) { 1707 if (pages[i] == locked_page) {
1708 page_cache_release(pages[i]); 1708 page_cache_release(pages[i]);
1709 continue; 1709 continue;
1710 } 1710 }
1711 if (page_ops & PAGE_CLEAR_DIRTY) 1711 if (page_ops & PAGE_CLEAR_DIRTY)
1712 clear_page_dirty_for_io(pages[i]); 1712 clear_page_dirty_for_io(pages[i]);
1713 if (page_ops & PAGE_SET_WRITEBACK) 1713 if (page_ops & PAGE_SET_WRITEBACK)
1714 set_page_writeback(pages[i]); 1714 set_page_writeback(pages[i]);
1715 if (page_ops & PAGE_END_WRITEBACK) 1715 if (page_ops & PAGE_END_WRITEBACK)
1716 end_page_writeback(pages[i]); 1716 end_page_writeback(pages[i]);
1717 if (page_ops & PAGE_UNLOCK) 1717 if (page_ops & PAGE_UNLOCK)
1718 unlock_page(pages[i]); 1718 unlock_page(pages[i]);
1719 page_cache_release(pages[i]); 1719 page_cache_release(pages[i]);
1720 } 1720 }
1721 nr_pages -= ret; 1721 nr_pages -= ret;
1722 index += ret; 1722 index += ret;
1723 cond_resched(); 1723 cond_resched();
1724 } 1724 }
1725 return 0; 1725 return 0;
1726 } 1726 }
1727 1727
1728 /* 1728 /*
1729 * count the number of bytes in the tree that have a given bit(s) 1729 * count the number of bytes in the tree that have a given bit(s)
1730 * set. This can be fairly slow, except for EXTENT_DIRTY which is 1730 * set. This can be fairly slow, except for EXTENT_DIRTY which is
1731 * cached. The total number found is returned. 1731 * cached. The total number found is returned.
1732 */ 1732 */
1733 u64 count_range_bits(struct extent_io_tree *tree, 1733 u64 count_range_bits(struct extent_io_tree *tree,
1734 u64 *start, u64 search_end, u64 max_bytes, 1734 u64 *start, u64 search_end, u64 max_bytes,
1735 unsigned long bits, int contig) 1735 unsigned long bits, int contig)
1736 { 1736 {
1737 struct rb_node *node; 1737 struct rb_node *node;
1738 struct extent_state *state; 1738 struct extent_state *state;
1739 u64 cur_start = *start; 1739 u64 cur_start = *start;
1740 u64 total_bytes = 0; 1740 u64 total_bytes = 0;
1741 u64 last = 0; 1741 u64 last = 0;
1742 int found = 0; 1742 int found = 0;
1743 1743
1744 if (search_end <= cur_start) { 1744 if (search_end <= cur_start) {
1745 WARN_ON(1); 1745 WARN_ON(1);
1746 return 0; 1746 return 0;
1747 } 1747 }
1748 1748
1749 spin_lock(&tree->lock); 1749 spin_lock(&tree->lock);
1750 if (cur_start == 0 && bits == EXTENT_DIRTY) { 1750 if (cur_start == 0 && bits == EXTENT_DIRTY) {
1751 total_bytes = tree->dirty_bytes; 1751 total_bytes = tree->dirty_bytes;
1752 goto out; 1752 goto out;
1753 } 1753 }
1754 /* 1754 /*
1755 * this search will find all the extents that end after 1755 * this search will find all the extents that end after
1756 * our range starts. 1756 * our range starts.
1757 */ 1757 */
1758 node = tree_search(tree, cur_start); 1758 node = tree_search(tree, cur_start);
1759 if (!node) 1759 if (!node)
1760 goto out; 1760 goto out;
1761 1761
1762 while (1) { 1762 while (1) {
1763 state = rb_entry(node, struct extent_state, rb_node); 1763 state = rb_entry(node, struct extent_state, rb_node);
1764 if (state->start > search_end) 1764 if (state->start > search_end)
1765 break; 1765 break;
1766 if (contig && found && state->start > last + 1) 1766 if (contig && found && state->start > last + 1)
1767 break; 1767 break;
1768 if (state->end >= cur_start && (state->state & bits) == bits) { 1768 if (state->end >= cur_start && (state->state & bits) == bits) {
1769 total_bytes += min(search_end, state->end) + 1 - 1769 total_bytes += min(search_end, state->end) + 1 -
1770 max(cur_start, state->start); 1770 max(cur_start, state->start);
1771 if (total_bytes >= max_bytes) 1771 if (total_bytes >= max_bytes)
1772 break; 1772 break;
1773 if (!found) { 1773 if (!found) {
1774 *start = max(cur_start, state->start); 1774 *start = max(cur_start, state->start);
1775 found = 1; 1775 found = 1;
1776 } 1776 }
1777 last = state->end; 1777 last = state->end;
1778 } else if (contig && found) { 1778 } else if (contig && found) {
1779 break; 1779 break;
1780 } 1780 }
1781 node = rb_next(node); 1781 node = rb_next(node);
1782 if (!node) 1782 if (!node)
1783 break; 1783 break;
1784 } 1784 }
1785 out: 1785 out:
1786 spin_unlock(&tree->lock); 1786 spin_unlock(&tree->lock);
1787 return total_bytes; 1787 return total_bytes;
1788 } 1788 }
1789 1789
1790 /* 1790 /*
1791 * set the private field for a given byte offset in the tree. If there isn't 1791 * set the private field for a given byte offset in the tree. If there isn't
1792 * an extent_state there already, this does nothing. 1792 * an extent_state there already, this does nothing.
1793 */ 1793 */
1794 static int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) 1794 static int set_state_private(struct extent_io_tree *tree, u64 start, u64 private)
1795 { 1795 {
1796 struct rb_node *node; 1796 struct rb_node *node;
1797 struct extent_state *state; 1797 struct extent_state *state;
1798 int ret = 0; 1798 int ret = 0;
1799 1799
1800 spin_lock(&tree->lock); 1800 spin_lock(&tree->lock);
1801 /* 1801 /*
1802 * this search will find all the extents that end after 1802 * this search will find all the extents that end after
1803 * our range starts. 1803 * our range starts.
1804 */ 1804 */
1805 node = tree_search(tree, start); 1805 node = tree_search(tree, start);
1806 if (!node) { 1806 if (!node) {
1807 ret = -ENOENT; 1807 ret = -ENOENT;
1808 goto out; 1808 goto out;
1809 } 1809 }
1810 state = rb_entry(node, struct extent_state, rb_node); 1810 state = rb_entry(node, struct extent_state, rb_node);
1811 if (state->start != start) { 1811 if (state->start != start) {
1812 ret = -ENOENT; 1812 ret = -ENOENT;
1813 goto out; 1813 goto out;
1814 } 1814 }
1815 state->private = private; 1815 state->private = private;
1816 out: 1816 out:
1817 spin_unlock(&tree->lock); 1817 spin_unlock(&tree->lock);
1818 return ret; 1818 return ret;
1819 } 1819 }
1820 1820
1821 int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private) 1821 int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private)
1822 { 1822 {
1823 struct rb_node *node; 1823 struct rb_node *node;
1824 struct extent_state *state; 1824 struct extent_state *state;
1825 int ret = 0; 1825 int ret = 0;
1826 1826
1827 spin_lock(&tree->lock); 1827 spin_lock(&tree->lock);
1828 /* 1828 /*
1829 * this search will find all the extents that end after 1829 * this search will find all the extents that end after
1830 * our range starts. 1830 * our range starts.
1831 */ 1831 */
1832 node = tree_search(tree, start); 1832 node = tree_search(tree, start);
1833 if (!node) { 1833 if (!node) {
1834 ret = -ENOENT; 1834 ret = -ENOENT;
1835 goto out; 1835 goto out;
1836 } 1836 }
1837 state = rb_entry(node, struct extent_state, rb_node); 1837 state = rb_entry(node, struct extent_state, rb_node);
1838 if (state->start != start) { 1838 if (state->start != start) {
1839 ret = -ENOENT; 1839 ret = -ENOENT;
1840 goto out; 1840 goto out;
1841 } 1841 }
1842 *private = state->private; 1842 *private = state->private;
1843 out: 1843 out:
1844 spin_unlock(&tree->lock); 1844 spin_unlock(&tree->lock);
1845 return ret; 1845 return ret;
1846 } 1846 }
1847 1847
1848 /* 1848 /*
1849 * searches a range in the state tree for a given mask. 1849 * searches a range in the state tree for a given mask.
1850 * If 'filled' == 1, this returns 1 only if every extent in the tree 1850 * If 'filled' == 1, this returns 1 only if every extent in the tree
1851 * has the bits set. Otherwise, 1 is returned if any bit in the 1851 * has the bits set. Otherwise, 1 is returned if any bit in the
1852 * range is found set. 1852 * range is found set.
1853 */ 1853 */
1854 int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, 1854 int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
1855 unsigned long bits, int filled, struct extent_state *cached) 1855 unsigned long bits, int filled, struct extent_state *cached)
1856 { 1856 {
1857 struct extent_state *state = NULL; 1857 struct extent_state *state = NULL;
1858 struct rb_node *node; 1858 struct rb_node *node;
1859 int bitset = 0; 1859 int bitset = 0;
1860 1860
1861 spin_lock(&tree->lock); 1861 spin_lock(&tree->lock);
1862 if (cached && cached->tree && cached->start <= start && 1862 if (cached && cached->tree && cached->start <= start &&
1863 cached->end > start) 1863 cached->end > start)
1864 node = &cached->rb_node; 1864 node = &cached->rb_node;
1865 else 1865 else
1866 node = tree_search(tree, start); 1866 node = tree_search(tree, start);
1867 while (node && start <= end) { 1867 while (node && start <= end) {
1868 state = rb_entry(node, struct extent_state, rb_node); 1868 state = rb_entry(node, struct extent_state, rb_node);
1869 1869
1870 if (filled && state->start > start) { 1870 if (filled && state->start > start) {
1871 bitset = 0; 1871 bitset = 0;
1872 break; 1872 break;
1873 } 1873 }
1874 1874
1875 if (state->start > end) 1875 if (state->start > end)
1876 break; 1876 break;
1877 1877
1878 if (state->state & bits) { 1878 if (state->state & bits) {
1879 bitset = 1; 1879 bitset = 1;
1880 if (!filled) 1880 if (!filled)
1881 break; 1881 break;
1882 } else if (filled) { 1882 } else if (filled) {
1883 bitset = 0; 1883 bitset = 0;
1884 break; 1884 break;
1885 } 1885 }
1886 1886
1887 if (state->end == (u64)-1) 1887 if (state->end == (u64)-1)
1888 break; 1888 break;
1889 1889
1890 start = state->end + 1; 1890 start = state->end + 1;
1891 if (start > end) 1891 if (start > end)
1892 break; 1892 break;
1893 node = rb_next(node); 1893 node = rb_next(node);
1894 if (!node) { 1894 if (!node) {
1895 if (filled) 1895 if (filled)
1896 bitset = 0; 1896 bitset = 0;
1897 break; 1897 break;
1898 } 1898 }
1899 } 1899 }
1900 spin_unlock(&tree->lock); 1900 spin_unlock(&tree->lock);
1901 return bitset; 1901 return bitset;
1902 } 1902 }
1903 1903
1904 /* 1904 /*
1905 * helper function to set a given page up to date if all the 1905 * helper function to set a given page up to date if all the
1906 * extents in the tree for that page are up to date 1906 * extents in the tree for that page are up to date
1907 */ 1907 */
1908 static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) 1908 static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
1909 { 1909 {
1910 u64 start = page_offset(page); 1910 u64 start = page_offset(page);
1911 u64 end = start + PAGE_CACHE_SIZE - 1; 1911 u64 end = start + PAGE_CACHE_SIZE - 1;
1912 if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL)) 1912 if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
1913 SetPageUptodate(page); 1913 SetPageUptodate(page);
1914 } 1914 }
1915 1915
1916 /* 1916 /*
1917 * When IO fails, either with EIO or csum verification fails, we 1917 * When IO fails, either with EIO or csum verification fails, we
1918 * try other mirrors that might have a good copy of the data. This 1918 * try other mirrors that might have a good copy of the data. This
1919 * io_failure_record is used to record state as we go through all the 1919 * io_failure_record is used to record state as we go through all the
1920 * mirrors. If another mirror has good data, the page is set up to date 1920 * mirrors. If another mirror has good data, the page is set up to date
1921 * and things continue. If a good mirror can't be found, the original 1921 * and things continue. If a good mirror can't be found, the original
1922 * bio end_io callback is called to indicate things have failed. 1922 * bio end_io callback is called to indicate things have failed.
1923 */ 1923 */
1924 struct io_failure_record { 1924 struct io_failure_record {
1925 struct page *page; 1925 struct page *page;
1926 u64 start; 1926 u64 start;
1927 u64 len; 1927 u64 len;
1928 u64 logical; 1928 u64 logical;
1929 unsigned long bio_flags; 1929 unsigned long bio_flags;
1930 int this_mirror; 1930 int this_mirror;
1931 int failed_mirror; 1931 int failed_mirror;
1932 int in_validation; 1932 int in_validation;
1933 }; 1933 };
1934 1934
1935 static int free_io_failure(struct inode *inode, struct io_failure_record *rec, 1935 static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
1936 int did_repair) 1936 int did_repair)
1937 { 1937 {
1938 int ret; 1938 int ret;
1939 int err = 0; 1939 int err = 0;
1940 struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; 1940 struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
1941 1941
1942 set_state_private(failure_tree, rec->start, 0); 1942 set_state_private(failure_tree, rec->start, 0);
1943 ret = clear_extent_bits(failure_tree, rec->start, 1943 ret = clear_extent_bits(failure_tree, rec->start,
1944 rec->start + rec->len - 1, 1944 rec->start + rec->len - 1,
1945 EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); 1945 EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS);
1946 if (ret) 1946 if (ret)
1947 err = ret; 1947 err = ret;
1948 1948
1949 ret = clear_extent_bits(&BTRFS_I(inode)->io_tree, rec->start, 1949 ret = clear_extent_bits(&BTRFS_I(inode)->io_tree, rec->start,
1950 rec->start + rec->len - 1, 1950 rec->start + rec->len - 1,
1951 EXTENT_DAMAGED, GFP_NOFS); 1951 EXTENT_DAMAGED, GFP_NOFS);
1952 if (ret && !err) 1952 if (ret && !err)
1953 err = ret; 1953 err = ret;
1954 1954
1955 kfree(rec); 1955 kfree(rec);
1956 return err; 1956 return err;
1957 } 1957 }
1958 1958
1959 static void repair_io_failure_callback(struct bio *bio, int err) 1959 static void repair_io_failure_callback(struct bio *bio, int err)
1960 { 1960 {
1961 complete(bio->bi_private); 1961 complete(bio->bi_private);
1962 } 1962 }
1963 1963
1964 /* 1964 /*
1965 * this bypasses the standard btrfs submit functions deliberately, as 1965 * this bypasses the standard btrfs submit functions deliberately, as
1966 * the standard behavior is to write all copies in a raid setup. here we only 1966 * the standard behavior is to write all copies in a raid setup. here we only
1967 * want to write the one bad copy. so we do the mapping for ourselves and issue 1967 * want to write the one bad copy. so we do the mapping for ourselves and issue
1968 * submit_bio directly. 1968 * submit_bio directly.
1969 * to avoid any synchronization issues, wait for the data after writing, which 1969 * to avoid any synchronization issues, wait for the data after writing, which
1970 * actually prevents the read that triggered the error from finishing. 1970 * actually prevents the read that triggered the error from finishing.
1971 * currently, there can be no more than two copies of every data bit. thus, 1971 * currently, there can be no more than two copies of every data bit. thus,
1972 * exactly one rewrite is required. 1972 * exactly one rewrite is required.
1973 */ 1973 */
1974 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, 1974 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
1975 u64 length, u64 logical, struct page *page, 1975 u64 length, u64 logical, struct page *page,
1976 int mirror_num) 1976 int mirror_num)
1977 { 1977 {
1978 struct bio *bio; 1978 struct bio *bio;
1979 struct btrfs_device *dev; 1979 struct btrfs_device *dev;
1980 DECLARE_COMPLETION_ONSTACK(compl); 1980 DECLARE_COMPLETION_ONSTACK(compl);
1981 u64 map_length = 0; 1981 u64 map_length = 0;
1982 u64 sector; 1982 u64 sector;
1983 struct btrfs_bio *bbio = NULL; 1983 struct btrfs_bio *bbio = NULL;
1984 struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; 1984 struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
1985 int ret; 1985 int ret;
1986 1986
1987 BUG_ON(!mirror_num); 1987 BUG_ON(!mirror_num);
1988 1988
1989 /* we can't repair anything in raid56 yet */ 1989 /* we can't repair anything in raid56 yet */
1990 if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) 1990 if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
1991 return 0; 1991 return 0;
1992 1992
1993 bio = btrfs_io_bio_alloc(GFP_NOFS, 1); 1993 bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
1994 if (!bio) 1994 if (!bio)
1995 return -EIO; 1995 return -EIO;
1996 bio->bi_private = &compl; 1996 bio->bi_private = &compl;
1997 bio->bi_end_io = repair_io_failure_callback; 1997 bio->bi_end_io = repair_io_failure_callback;
1998 bio->bi_size = 0; 1998 bio->bi_size = 0;
1999 map_length = length; 1999 map_length = length;
2000 2000
2001 ret = btrfs_map_block(fs_info, WRITE, logical, 2001 ret = btrfs_map_block(fs_info, WRITE, logical,
2002 &map_length, &bbio, mirror_num); 2002 &map_length, &bbio, mirror_num);
2003 if (ret) { 2003 if (ret) {
2004 bio_put(bio); 2004 bio_put(bio);
2005 return -EIO; 2005 return -EIO;
2006 } 2006 }
2007 BUG_ON(mirror_num != bbio->mirror_num); 2007 BUG_ON(mirror_num != bbio->mirror_num);
2008 sector = bbio->stripes[mirror_num-1].physical >> 9; 2008 sector = bbio->stripes[mirror_num-1].physical >> 9;
2009 bio->bi_sector = sector; 2009 bio->bi_sector = sector;
2010 dev = bbio->stripes[mirror_num-1].dev; 2010 dev = bbio->stripes[mirror_num-1].dev;
2011 kfree(bbio); 2011 kfree(bbio);
2012 if (!dev || !dev->bdev || !dev->writeable) { 2012 if (!dev || !dev->bdev || !dev->writeable) {
2013 bio_put(bio); 2013 bio_put(bio);
2014 return -EIO; 2014 return -EIO;
2015 } 2015 }
2016 bio->bi_bdev = dev->bdev; 2016 bio->bi_bdev = dev->bdev;
2017 bio_add_page(bio, page, length, start - page_offset(page)); 2017 bio_add_page(bio, page, length, start - page_offset(page));
2018 btrfsic_submit_bio(WRITE_SYNC, bio); 2018 btrfsic_submit_bio(WRITE_SYNC, bio);
2019 wait_for_completion(&compl); 2019 wait_for_completion(&compl);
2020 2020
2021 if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { 2021 if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
2022 /* try to remap that extent elsewhere? */ 2022 /* try to remap that extent elsewhere? */
2023 bio_put(bio); 2023 bio_put(bio);
2024 btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); 2024 btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
2025 return -EIO; 2025 return -EIO;
2026 } 2026 }
2027 2027
2028 printk_ratelimited_in_rcu(KERN_INFO "btrfs read error corrected: ino %lu off %llu " 2028 printk_ratelimited_in_rcu(KERN_INFO "btrfs read error corrected: ino %lu off %llu "
2029 "(dev %s sector %llu)\n", page->mapping->host->i_ino, 2029 "(dev %s sector %llu)\n", page->mapping->host->i_ino,
2030 start, rcu_str_deref(dev->name), sector); 2030 start, rcu_str_deref(dev->name), sector);
2031 2031
2032 bio_put(bio); 2032 bio_put(bio);
2033 return 0; 2033 return 0;
2034 } 2034 }
2035 2035
2036 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, 2036 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
2037 int mirror_num) 2037 int mirror_num)
2038 { 2038 {
2039 u64 start = eb->start; 2039 u64 start = eb->start;
2040 unsigned long i, num_pages = num_extent_pages(eb->start, eb->len); 2040 unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
2041 int ret = 0; 2041 int ret = 0;
2042 2042
2043 for (i = 0; i < num_pages; i++) { 2043 for (i = 0; i < num_pages; i++) {
2044 struct page *p = extent_buffer_page(eb, i); 2044 struct page *p = extent_buffer_page(eb, i);
2045 ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE, 2045 ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
2046 start, p, mirror_num); 2046 start, p, mirror_num);
2047 if (ret) 2047 if (ret)
2048 break; 2048 break;
2049 start += PAGE_CACHE_SIZE; 2049 start += PAGE_CACHE_SIZE;
2050 } 2050 }
2051 2051
2052 return ret; 2052 return ret;
2053 } 2053 }
2054 2054
2055 /* 2055 /*
2056 * each time an IO finishes, we do a fast check in the IO failure tree 2056 * each time an IO finishes, we do a fast check in the IO failure tree
2057 * to see if we need to process or clean up an io_failure_record 2057 * to see if we need to process or clean up an io_failure_record
2058 */ 2058 */
2059 static int clean_io_failure(u64 start, struct page *page) 2059 static int clean_io_failure(u64 start, struct page *page)
2060 { 2060 {
2061 u64 private; 2061 u64 private;
2062 u64 private_failure; 2062 u64 private_failure;
2063 struct io_failure_record *failrec; 2063 struct io_failure_record *failrec;
2064 struct btrfs_fs_info *fs_info; 2064 struct btrfs_fs_info *fs_info;
2065 struct extent_state *state; 2065 struct extent_state *state;
2066 int num_copies; 2066 int num_copies;
2067 int did_repair = 0; 2067 int did_repair = 0;
2068 int ret; 2068 int ret;
2069 struct inode *inode = page->mapping->host; 2069 struct inode *inode = page->mapping->host;
2070 2070
2071 private = 0; 2071 private = 0;
2072 ret = count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private, 2072 ret = count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private,
2073 (u64)-1, 1, EXTENT_DIRTY, 0); 2073 (u64)-1, 1, EXTENT_DIRTY, 0);
2074 if (!ret) 2074 if (!ret)
2075 return 0; 2075 return 0;
2076 2076
2077 ret = get_state_private(&BTRFS_I(inode)->io_failure_tree, start, 2077 ret = get_state_private(&BTRFS_I(inode)->io_failure_tree, start,
2078 &private_failure); 2078 &private_failure);
2079 if (ret) 2079 if (ret)
2080 return 0; 2080 return 0;
2081 2081
2082 failrec = (struct io_failure_record *)(unsigned long) private_failure; 2082 failrec = (struct io_failure_record *)(unsigned long) private_failure;
2083 BUG_ON(!failrec->this_mirror); 2083 BUG_ON(!failrec->this_mirror);
2084 2084
2085 if (failrec->in_validation) { 2085 if (failrec->in_validation) {
2086 /* there was no real error, just free the record */ 2086 /* there was no real error, just free the record */
2087 pr_debug("clean_io_failure: freeing dummy error at %llu\n", 2087 pr_debug("clean_io_failure: freeing dummy error at %llu\n",
2088 failrec->start); 2088 failrec->start);
2089 did_repair = 1; 2089 did_repair = 1;
2090 goto out; 2090 goto out;
2091 } 2091 }
2092 2092
2093 spin_lock(&BTRFS_I(inode)->io_tree.lock); 2093 spin_lock(&BTRFS_I(inode)->io_tree.lock);
2094 state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree, 2094 state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree,
2095 failrec->start, 2095 failrec->start,
2096 EXTENT_LOCKED); 2096 EXTENT_LOCKED);
2097 spin_unlock(&BTRFS_I(inode)->io_tree.lock); 2097 spin_unlock(&BTRFS_I(inode)->io_tree.lock);
2098 2098
2099 if (state && state->start <= failrec->start && 2099 if (state && state->start <= failrec->start &&
2100 state->end >= failrec->start + failrec->len - 1) { 2100 state->end >= failrec->start + failrec->len - 1) {
2101 fs_info = BTRFS_I(inode)->root->fs_info; 2101 fs_info = BTRFS_I(inode)->root->fs_info;
2102 num_copies = btrfs_num_copies(fs_info, failrec->logical, 2102 num_copies = btrfs_num_copies(fs_info, failrec->logical,
2103 failrec->len); 2103 failrec->len);
2104 if (num_copies > 1) { 2104 if (num_copies > 1) {
2105 ret = repair_io_failure(fs_info, start, failrec->len, 2105 ret = repair_io_failure(fs_info, start, failrec->len,
2106 failrec->logical, page, 2106 failrec->logical, page,
2107 failrec->failed_mirror); 2107 failrec->failed_mirror);
2108 did_repair = !ret; 2108 did_repair = !ret;
2109 } 2109 }
2110 ret = 0; 2110 ret = 0;
2111 } 2111 }
2112 2112
2113 out: 2113 out:
2114 if (!ret) 2114 if (!ret)
2115 ret = free_io_failure(inode, failrec, did_repair); 2115 ret = free_io_failure(inode, failrec, did_repair);
2116 2116
2117 return ret; 2117 return ret;
2118 } 2118 }
2119 2119
2120 /* 2120 /*
2121 * this is a generic handler for readpage errors (default 2121 * this is a generic handler for readpage errors (default
2122 * readpage_io_failed_hook). if other copies exist, read those and write back 2122 * readpage_io_failed_hook). if other copies exist, read those and write back
2123 * good data to the failed position. does not investigate in remapping the 2123 * good data to the failed position. does not investigate in remapping the
2124 * failed extent elsewhere, hoping the device will be smart enough to do this as 2124 * failed extent elsewhere, hoping the device will be smart enough to do this as
2125 * needed 2125 * needed
2126 */ 2126 */
2127 2127
2128 static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, 2128 static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
2129 struct page *page, u64 start, u64 end, 2129 struct page *page, u64 start, u64 end,
2130 int failed_mirror) 2130 int failed_mirror)
2131 { 2131 {
2132 struct io_failure_record *failrec = NULL; 2132 struct io_failure_record *failrec = NULL;
2133 u64 private; 2133 u64 private;
2134 struct extent_map *em; 2134 struct extent_map *em;
2135 struct inode *inode = page->mapping->host; 2135 struct inode *inode = page->mapping->host;
2136 struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; 2136 struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
2137 struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; 2137 struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
2138 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; 2138 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
2139 struct bio *bio; 2139 struct bio *bio;
2140 struct btrfs_io_bio *btrfs_failed_bio; 2140 struct btrfs_io_bio *btrfs_failed_bio;
2141 struct btrfs_io_bio *btrfs_bio; 2141 struct btrfs_io_bio *btrfs_bio;
2142 int num_copies; 2142 int num_copies;
2143 int ret; 2143 int ret;
2144 int read_mode; 2144 int read_mode;
2145 u64 logical; 2145 u64 logical;
2146 2146
2147 BUG_ON(failed_bio->bi_rw & REQ_WRITE); 2147 BUG_ON(failed_bio->bi_rw & REQ_WRITE);
2148 2148
2149 ret = get_state_private(failure_tree, start, &private); 2149 ret = get_state_private(failure_tree, start, &private);
2150 if (ret) { 2150 if (ret) {
2151 failrec = kzalloc(sizeof(*failrec), GFP_NOFS); 2151 failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
2152 if (!failrec) 2152 if (!failrec)
2153 return -ENOMEM; 2153 return -ENOMEM;
2154 failrec->start = start; 2154 failrec->start = start;
2155 failrec->len = end - start + 1; 2155 failrec->len = end - start + 1;
2156 failrec->this_mirror = 0; 2156 failrec->this_mirror = 0;
2157 failrec->bio_flags = 0; 2157 failrec->bio_flags = 0;
2158 failrec->in_validation = 0; 2158 failrec->in_validation = 0;
2159 2159
2160 read_lock(&em_tree->lock); 2160 read_lock(&em_tree->lock);
2161 em = lookup_extent_mapping(em_tree, start, failrec->len); 2161 em = lookup_extent_mapping(em_tree, start, failrec->len);
2162 if (!em) { 2162 if (!em) {
2163 read_unlock(&em_tree->lock); 2163 read_unlock(&em_tree->lock);
2164 kfree(failrec); 2164 kfree(failrec);
2165 return -EIO; 2165 return -EIO;
2166 } 2166 }
2167 2167
2168 if (em->start > start || em->start + em->len < start) { 2168 if (em->start > start || em->start + em->len < start) {
2169 free_extent_map(em); 2169 free_extent_map(em);
2170 em = NULL; 2170 em = NULL;
2171 } 2171 }
2172 read_unlock(&em_tree->lock); 2172 read_unlock(&em_tree->lock);
2173 2173
2174 if (!em) { 2174 if (!em) {
2175 kfree(failrec); 2175 kfree(failrec);
2176 return -EIO; 2176 return -EIO;
2177 } 2177 }
2178 logical = start - em->start; 2178 logical = start - em->start;
2179 logical = em->block_start + logical; 2179 logical = em->block_start + logical;
2180 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { 2180 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
2181 logical = em->block_start; 2181 logical = em->block_start;
2182 failrec->bio_flags = EXTENT_BIO_COMPRESSED; 2182 failrec->bio_flags = EXTENT_BIO_COMPRESSED;
2183 extent_set_compress_type(&failrec->bio_flags, 2183 extent_set_compress_type(&failrec->bio_flags,
2184 em->compress_type); 2184 em->compress_type);
2185 } 2185 }
2186 pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, " 2186 pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
2187 "len=%llu\n", logical, start, failrec->len); 2187 "len=%llu\n", logical, start, failrec->len);
2188 failrec->logical = logical; 2188 failrec->logical = logical;
2189 free_extent_map(em); 2189 free_extent_map(em);
2190 2190
2191 /* set the bits in the private failure tree */ 2191 /* set the bits in the private failure tree */
2192 ret = set_extent_bits(failure_tree, start, end, 2192 ret = set_extent_bits(failure_tree, start, end,
2193 EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); 2193 EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS);
2194 if (ret >= 0) 2194 if (ret >= 0)
2195 ret = set_state_private(failure_tree, start, 2195 ret = set_state_private(failure_tree, start,
2196 (u64)(unsigned long)failrec); 2196 (u64)(unsigned long)failrec);
2197 /* set the bits in the inode's tree */ 2197 /* set the bits in the inode's tree */
2198 if (ret >= 0) 2198 if (ret >= 0)
2199 ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED, 2199 ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED,
2200 GFP_NOFS); 2200 GFP_NOFS);
2201 if (ret < 0) { 2201 if (ret < 0) {
2202 kfree(failrec); 2202 kfree(failrec);
2203 return ret; 2203 return ret;
2204 } 2204 }
2205 } else { 2205 } else {
2206 failrec = (struct io_failure_record *)(unsigned long)private; 2206 failrec = (struct io_failure_record *)(unsigned long)private;
2207 pr_debug("bio_readpage_error: (found) logical=%llu, " 2207 pr_debug("bio_readpage_error: (found) logical=%llu, "
2208 "start=%llu, len=%llu, validation=%d\n", 2208 "start=%llu, len=%llu, validation=%d\n",
2209 failrec->logical, failrec->start, failrec->len, 2209 failrec->logical, failrec->start, failrec->len,
2210 failrec->in_validation); 2210 failrec->in_validation);
2211 /* 2211 /*
2212 * when data can be on disk more than twice, add to failrec here 2212 * when data can be on disk more than twice, add to failrec here
2213 * (e.g. with a list for failed_mirror) to make 2213 * (e.g. with a list for failed_mirror) to make
2214 * clean_io_failure() clean all those errors at once. 2214 * clean_io_failure() clean all those errors at once.
2215 */ 2215 */
2216 } 2216 }
2217 num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info, 2217 num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
2218 failrec->logical, failrec->len); 2218 failrec->logical, failrec->len);
2219 if (num_copies == 1) { 2219 if (num_copies == 1) {
2220 /* 2220 /*
2221 * we only have a single copy of the data, so don't bother with 2221 * we only have a single copy of the data, so don't bother with
2222 * all the retry and error correction code that follows. no 2222 * all the retry and error correction code that follows. no
2223 * matter what the error is, it is very likely to persist. 2223 * matter what the error is, it is very likely to persist.
2224 */ 2224 */
2225 pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n", 2225 pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
2226 num_copies, failrec->this_mirror, failed_mirror); 2226 num_copies, failrec->this_mirror, failed_mirror);
2227 free_io_failure(inode, failrec, 0); 2227 free_io_failure(inode, failrec, 0);
2228 return -EIO; 2228 return -EIO;
2229 } 2229 }
2230 2230
2231 /* 2231 /*
2232 * there are two premises: 2232 * there are two premises:
2233 * a) deliver good data to the caller 2233 * a) deliver good data to the caller
2234 * b) correct the bad sectors on disk 2234 * b) correct the bad sectors on disk
2235 */ 2235 */
2236 if (failed_bio->bi_vcnt > 1) { 2236 if (failed_bio->bi_vcnt > 1) {
2237 /* 2237 /*
2238 * to fulfill b), we need to know the exact failing sectors, as 2238 * to fulfill b), we need to know the exact failing sectors, as
2239 * we don't want to rewrite any more than the failed ones. thus, 2239 * we don't want to rewrite any more than the failed ones. thus,
2240 * we need separate read requests for the failed bio 2240 * we need separate read requests for the failed bio
2241 * 2241 *
2242 * if the following BUG_ON triggers, our validation request got 2242 * if the following BUG_ON triggers, our validation request got
2243 * merged. we need separate requests for our algorithm to work. 2243 * merged. we need separate requests for our algorithm to work.
2244 */ 2244 */
2245 BUG_ON(failrec->in_validation); 2245 BUG_ON(failrec->in_validation);
2246 failrec->in_validation = 1; 2246 failrec->in_validation = 1;
2247 failrec->this_mirror = failed_mirror; 2247 failrec->this_mirror = failed_mirror;
2248 read_mode = READ_SYNC | REQ_FAILFAST_DEV; 2248 read_mode = READ_SYNC | REQ_FAILFAST_DEV;
2249 } else { 2249 } else {
2250 /* 2250 /*
2251 * we're ready to fulfill a) and b) alongside. get a good copy 2251 * we're ready to fulfill a) and b) alongside. get a good copy
2252 * of the failed sector and if we succeed, we have setup 2252 * of the failed sector and if we succeed, we have setup
2253 * everything for repair_io_failure to do the rest for us. 2253 * everything for repair_io_failure to do the rest for us.
2254 */ 2254 */
2255 if (failrec->in_validation) { 2255 if (failrec->in_validation) {
2256 BUG_ON(failrec->this_mirror != failed_mirror); 2256 BUG_ON(failrec->this_mirror != failed_mirror);
2257 failrec->in_validation = 0; 2257 failrec->in_validation = 0;
2258 failrec->this_mirror = 0; 2258 failrec->this_mirror = 0;
2259 } 2259 }
2260 failrec->failed_mirror = failed_mirror; 2260 failrec->failed_mirror = failed_mirror;
2261 failrec->this_mirror++; 2261 failrec->this_mirror++;
2262 if (failrec->this_mirror == failed_mirror) 2262 if (failrec->this_mirror == failed_mirror)
2263 failrec->this_mirror++; 2263 failrec->this_mirror++;
2264 read_mode = READ_SYNC; 2264 read_mode = READ_SYNC;
2265 } 2265 }
2266 2266
2267 if (failrec->this_mirror > num_copies) { 2267 if (failrec->this_mirror > num_copies) {
2268 pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n", 2268 pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
2269 num_copies, failrec->this_mirror, failed_mirror); 2269 num_copies, failrec->this_mirror, failed_mirror);
2270 free_io_failure(inode, failrec, 0); 2270 free_io_failure(inode, failrec, 0);
2271 return -EIO; 2271 return -EIO;
2272 } 2272 }
2273 2273
2274 bio = btrfs_io_bio_alloc(GFP_NOFS, 1); 2274 bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
2275 if (!bio) { 2275 if (!bio) {
2276 free_io_failure(inode, failrec, 0); 2276 free_io_failure(inode, failrec, 0);
2277 return -EIO; 2277 return -EIO;
2278 } 2278 }
2279 bio->bi_end_io = failed_bio->bi_end_io; 2279 bio->bi_end_io = failed_bio->bi_end_io;
2280 bio->bi_sector = failrec->logical >> 9; 2280 bio->bi_sector = failrec->logical >> 9;
2281 bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; 2281 bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
2282 bio->bi_size = 0; 2282 bio->bi_size = 0;
2283 2283
2284 btrfs_failed_bio = btrfs_io_bio(failed_bio); 2284 btrfs_failed_bio = btrfs_io_bio(failed_bio);
2285 if (btrfs_failed_bio->csum) { 2285 if (btrfs_failed_bio->csum) {
2286 struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; 2286 struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
2287 u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); 2287 u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
2288 2288
2289 btrfs_bio = btrfs_io_bio(bio); 2289 btrfs_bio = btrfs_io_bio(bio);
2290 btrfs_bio->csum = btrfs_bio->csum_inline; 2290 btrfs_bio->csum = btrfs_bio->csum_inline;
2291 phy_offset >>= inode->i_sb->s_blocksize_bits; 2291 phy_offset >>= inode->i_sb->s_blocksize_bits;
2292 phy_offset *= csum_size; 2292 phy_offset *= csum_size;
2293 memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset, 2293 memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset,
2294 csum_size); 2294 csum_size);
2295 } 2295 }
2296 2296
2297 bio_add_page(bio, page, failrec->len, start - page_offset(page)); 2297 bio_add_page(bio, page, failrec->len, start - page_offset(page));
2298 2298
2299 pr_debug("bio_readpage_error: submitting new read[%#x] to " 2299 pr_debug("bio_readpage_error: submitting new read[%#x] to "
2300 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode, 2300 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode,
2301 failrec->this_mirror, num_copies, failrec->in_validation); 2301 failrec->this_mirror, num_copies, failrec->in_validation);
2302 2302
2303 ret = tree->ops->submit_bio_hook(inode, read_mode, bio, 2303 ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
2304 failrec->this_mirror, 2304 failrec->this_mirror,
2305 failrec->bio_flags, 0); 2305 failrec->bio_flags, 0);
2306 return ret; 2306 return ret;
2307 } 2307 }
2308 2308
2309 /* lots and lots of room for performance fixes in the end_bio funcs */ 2309 /* lots and lots of room for performance fixes in the end_bio funcs */
2310 2310
2311 int end_extent_writepage(struct page *page, int err, u64 start, u64 end) 2311 int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
2312 { 2312 {
2313 int uptodate = (err == 0); 2313 int uptodate = (err == 0);
2314 struct extent_io_tree *tree; 2314 struct extent_io_tree *tree;
2315 int ret = 0; 2315 int ret = 0;
2316 2316
2317 tree = &BTRFS_I(page->mapping->host)->io_tree; 2317 tree = &BTRFS_I(page->mapping->host)->io_tree;
2318 2318
2319 if (tree->ops && tree->ops->writepage_end_io_hook) { 2319 if (tree->ops && tree->ops->writepage_end_io_hook) {
2320 ret = tree->ops->writepage_end_io_hook(page, start, 2320 ret = tree->ops->writepage_end_io_hook(page, start,
2321 end, NULL, uptodate); 2321 end, NULL, uptodate);
2322 if (ret) 2322 if (ret)
2323 uptodate = 0; 2323 uptodate = 0;
2324 } 2324 }
2325 2325
2326 if (!uptodate) { 2326 if (!uptodate) {
2327 ClearPageUptodate(page); 2327 ClearPageUptodate(page);
2328 SetPageError(page); 2328 SetPageError(page);
2329 ret = ret < 0 ? ret : -EIO; 2329 ret = ret < 0 ? ret : -EIO;
2330 mapping_set_error(page->mapping, ret); 2330 mapping_set_error(page->mapping, ret);
2331 } 2331 }
2332 return 0; 2332 return 0;
2333 } 2333 }
2334 2334
2335 /* 2335 /*
2336 * after a writepage IO is done, we need to: 2336 * after a writepage IO is done, we need to:
2337 * clear the uptodate bits on error 2337 * clear the uptodate bits on error
2338 * clear the writeback bits in the extent tree for this IO 2338 * clear the writeback bits in the extent tree for this IO
2339 * end_page_writeback if the page has no more pending IO 2339 * end_page_writeback if the page has no more pending IO
2340 * 2340 *
2341 * Scheduling is not allowed, so the extent state tree is expected 2341 * Scheduling is not allowed, so the extent state tree is expected
2342 * to have one and only one object corresponding to this IO. 2342 * to have one and only one object corresponding to this IO.
2343 */ 2343 */
2344 static void end_bio_extent_writepage(struct bio *bio, int err) 2344 static void end_bio_extent_writepage(struct bio *bio, int err)
2345 { 2345 {
2346 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; 2346 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
2347 struct extent_io_tree *tree; 2347 struct extent_io_tree *tree;
2348 u64 start; 2348 u64 start;
2349 u64 end; 2349 u64 end;
2350 2350
2351 do { 2351 do {
2352 struct page *page = bvec->bv_page; 2352 struct page *page = bvec->bv_page;
2353 tree = &BTRFS_I(page->mapping->host)->io_tree; 2353 tree = &BTRFS_I(page->mapping->host)->io_tree;
2354 2354
2355 /* We always issue full-page reads, but if some block 2355 /* We always issue full-page reads, but if some block
2356 * in a page fails to read, blk_update_request() will 2356 * in a page fails to read, blk_update_request() will
2357 * advance bv_offset and adjust bv_len to compensate. 2357 * advance bv_offset and adjust bv_len to compensate.
2358 * Print a warning for nonzero offsets, and an error 2358 * Print a warning for nonzero offsets, and an error
2359 * if they don't add up to a full page. */ 2359 * if they don't add up to a full page. */
2360 if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) 2360 if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE)
2361 printk("%s page write in btrfs with offset %u and length %u\n", 2361 printk("%s page write in btrfs with offset %u and length %u\n",
2362 bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE 2362 bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE
2363 ? KERN_ERR "partial" : KERN_INFO "incomplete", 2363 ? KERN_ERR "partial" : KERN_INFO "incomplete",
2364 bvec->bv_offset, bvec->bv_len); 2364 bvec->bv_offset, bvec->bv_len);
2365 2365
2366 start = page_offset(page); 2366 start = page_offset(page);
2367 end = start + bvec->bv_offset + bvec->bv_len - 1; 2367 end = start + bvec->bv_offset + bvec->bv_len - 1;
2368 2368
2369 if (--bvec >= bio->bi_io_vec) 2369 if (--bvec >= bio->bi_io_vec)
2370 prefetchw(&bvec->bv_page->flags); 2370 prefetchw(&bvec->bv_page->flags);
2371 2371
2372 if (end_extent_writepage(page, err, start, end)) 2372 if (end_extent_writepage(page, err, start, end))
2373 continue; 2373 continue;
2374 2374
2375 end_page_writeback(page); 2375 end_page_writeback(page);
2376 } while (bvec >= bio->bi_io_vec); 2376 } while (bvec >= bio->bi_io_vec);
2377 2377
2378 bio_put(bio); 2378 bio_put(bio);
2379 } 2379 }
2380 2380
2381 static void 2381 static void
2382 endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len, 2382 endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
2383 int uptodate) 2383 int uptodate)
2384 { 2384 {
2385 struct extent_state *cached = NULL; 2385 struct extent_state *cached = NULL;
2386 u64 end = start + len - 1; 2386 u64 end = start + len - 1;
2387 2387
2388 if (uptodate && tree->track_uptodate) 2388 if (uptodate && tree->track_uptodate)
2389 set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC); 2389 set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
2390 unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); 2390 unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
2391 } 2391 }
2392 2392
2393 /* 2393 /*
2394 * after a readpage IO is done, we need to: 2394 * after a readpage IO is done, we need to:
2395 * clear the uptodate bits on error 2395 * clear the uptodate bits on error
2396 * set the uptodate bits if things worked 2396 * set the uptodate bits if things worked
2397 * set the page up to date if all extents in the tree are uptodate 2397 * set the page up to date if all extents in the tree are uptodate
2398 * clear the lock bit in the extent tree 2398 * clear the lock bit in the extent tree
2399 * unlock the page if there are no other extents locked for it 2399 * unlock the page if there are no other extents locked for it
2400 * 2400 *
2401 * Scheduling is not allowed, so the extent state tree is expected 2401 * Scheduling is not allowed, so the extent state tree is expected
2402 * to have one and only one object corresponding to this IO. 2402 * to have one and only one object corresponding to this IO.
2403 */ 2403 */
2404 static void end_bio_extent_readpage(struct bio *bio, int err) 2404 static void end_bio_extent_readpage(struct bio *bio, int err)
2405 { 2405 {
2406 int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); 2406 int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
2407 struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; 2407 struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
2408 struct bio_vec *bvec = bio->bi_io_vec; 2408 struct bio_vec *bvec = bio->bi_io_vec;
2409 struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); 2409 struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
2410 struct extent_io_tree *tree; 2410 struct extent_io_tree *tree;
2411 u64 offset = 0; 2411 u64 offset = 0;
2412 u64 start; 2412 u64 start;
2413 u64 end; 2413 u64 end;
2414 u64 len; 2414 u64 len;
2415 u64 extent_start = 0; 2415 u64 extent_start = 0;
2416 u64 extent_len = 0; 2416 u64 extent_len = 0;
2417 int mirror; 2417 int mirror;
2418 int ret; 2418 int ret;
2419 2419
2420 if (err) 2420 if (err)
2421 uptodate = 0; 2421 uptodate = 0;
2422 2422
2423 do { 2423 do {
2424 struct page *page = bvec->bv_page; 2424 struct page *page = bvec->bv_page;
2425 struct inode *inode = page->mapping->host; 2425 struct inode *inode = page->mapping->host;
2426 2426
2427 pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, " 2427 pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
2428 "mirror=%lu\n", (u64)bio->bi_sector, err, 2428 "mirror=%lu\n", (u64)bio->bi_sector, err,
2429 io_bio->mirror_num); 2429 io_bio->mirror_num);
2430 tree = &BTRFS_I(inode)->io_tree; 2430 tree = &BTRFS_I(inode)->io_tree;
2431 2431
2432 /* We always issue full-page reads, but if some block 2432 /* We always issue full-page reads, but if some block
2433 * in a page fails to read, blk_update_request() will 2433 * in a page fails to read, blk_update_request() will
2434 * advance bv_offset and adjust bv_len to compensate. 2434 * advance bv_offset and adjust bv_len to compensate.
2435 * Print a warning for nonzero offsets, and an error 2435 * Print a warning for nonzero offsets, and an error
2436 * if they don't add up to a full page. */ 2436 * if they don't add up to a full page. */
2437 if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) 2437 if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE)
2438 printk("%s page read in btrfs with offset %u and length %u\n", 2438 printk("%s page read in btrfs with offset %u and length %u\n",
2439 bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE 2439 bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE
2440 ? KERN_ERR "partial" : KERN_INFO "incomplete", 2440 ? KERN_ERR "partial" : KERN_INFO "incomplete",
2441 bvec->bv_offset, bvec->bv_len); 2441 bvec->bv_offset, bvec->bv_len);
2442 2442
2443 start = page_offset(page); 2443 start = page_offset(page);
2444 end = start + bvec->bv_offset + bvec->bv_len - 1; 2444 end = start + bvec->bv_offset + bvec->bv_len - 1;
2445 len = bvec->bv_len; 2445 len = bvec->bv_len;
2446 2446
2447 if (++bvec <= bvec_end) 2447 if (++bvec <= bvec_end)
2448 prefetchw(&bvec->bv_page->flags); 2448 prefetchw(&bvec->bv_page->flags);
2449 2449
2450 mirror = io_bio->mirror_num; 2450 mirror = io_bio->mirror_num;
2451 if (likely(uptodate && tree->ops && 2451 if (likely(uptodate && tree->ops &&
2452 tree->ops->readpage_end_io_hook)) { 2452 tree->ops->readpage_end_io_hook)) {
2453 ret = tree->ops->readpage_end_io_hook(io_bio, offset, 2453 ret = tree->ops->readpage_end_io_hook(io_bio, offset,
2454 page, start, end, 2454 page, start, end,
2455 mirror); 2455 mirror);
2456 if (ret) 2456 if (ret)
2457 uptodate = 0; 2457 uptodate = 0;
2458 else 2458 else
2459 clean_io_failure(start, page); 2459 clean_io_failure(start, page);
2460 } 2460 }
2461 2461
2462 if (likely(uptodate)) 2462 if (likely(uptodate))
2463 goto readpage_ok; 2463 goto readpage_ok;
2464 2464
2465 if (tree->ops && tree->ops->readpage_io_failed_hook) { 2465 if (tree->ops && tree->ops->readpage_io_failed_hook) {
2466 ret = tree->ops->readpage_io_failed_hook(page, mirror); 2466 ret = tree->ops->readpage_io_failed_hook(page, mirror);
2467 if (!ret && !err && 2467 if (!ret && !err &&
2468 test_bit(BIO_UPTODATE, &bio->bi_flags)) 2468 test_bit(BIO_UPTODATE, &bio->bi_flags))
2469 uptodate = 1; 2469 uptodate = 1;
2470 } else { 2470 } else {
2471 /* 2471 /*
2472 * The generic bio_readpage_error handles errors the 2472 * The generic bio_readpage_error handles errors the
2473 * following way: If possible, new read requests are 2473 * following way: If possible, new read requests are
2474 * created and submitted and will end up in 2474 * created and submitted and will end up in
2475 * end_bio_extent_readpage as well (if we're lucky, not 2475 * end_bio_extent_readpage as well (if we're lucky, not
2476 * in the !uptodate case). In that case it returns 0 and 2476 * in the !uptodate case). In that case it returns 0 and
2477 * we just go on with the next page in our bio. If it 2477 * we just go on with the next page in our bio. If it
2478 * can't handle the error it will return -EIO and we 2478 * can't handle the error it will return -EIO and we
2479 * remain responsible for that page. 2479 * remain responsible for that page.
2480 */ 2480 */
2481 ret = bio_readpage_error(bio, offset, page, start, end, 2481 ret = bio_readpage_error(bio, offset, page, start, end,
2482 mirror); 2482 mirror);
2483 if (ret == 0) { 2483 if (ret == 0) {
2484 uptodate = 2484 uptodate =
2485 test_bit(BIO_UPTODATE, &bio->bi_flags); 2485 test_bit(BIO_UPTODATE, &bio->bi_flags);
2486 if (err) 2486 if (err)
2487 uptodate = 0; 2487 uptodate = 0;
2488 offset += len; 2488 offset += len;
2489 continue; 2489 continue;
2490 } 2490 }
2491 } 2491 }
2492 readpage_ok: 2492 readpage_ok:
2493 if (likely(uptodate)) { 2493 if (likely(uptodate)) {
2494 loff_t i_size = i_size_read(inode); 2494 loff_t i_size = i_size_read(inode);
2495 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; 2495 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
2496 unsigned offset; 2496 unsigned offset;
2497 2497
2498 /* Zero out the end if this page straddles i_size */ 2498 /* Zero out the end if this page straddles i_size */
2499 offset = i_size & (PAGE_CACHE_SIZE-1); 2499 offset = i_size & (PAGE_CACHE_SIZE-1);
2500 if (page->index == end_index && offset) 2500 if (page->index == end_index && offset)
2501 zero_user_segment(page, offset, PAGE_CACHE_SIZE); 2501 zero_user_segment(page, offset, PAGE_CACHE_SIZE);
2502 SetPageUptodate(page); 2502 SetPageUptodate(page);
2503 } else { 2503 } else {
2504 ClearPageUptodate(page); 2504 ClearPageUptodate(page);
2505 SetPageError(page); 2505 SetPageError(page);
2506 } 2506 }
2507 unlock_page(page); 2507 unlock_page(page);
2508 offset += len; 2508 offset += len;
2509 2509
2510 if (unlikely(!uptodate)) { 2510 if (unlikely(!uptodate)) {
2511 if (extent_len) { 2511 if (extent_len) {
2512 endio_readpage_release_extent(tree, 2512 endio_readpage_release_extent(tree,
2513 extent_start, 2513 extent_start,
2514 extent_len, 1); 2514 extent_len, 1);
2515 extent_start = 0; 2515 extent_start = 0;
2516 extent_len = 0; 2516 extent_len = 0;
2517 } 2517 }
2518 endio_readpage_release_extent(tree, start, 2518 endio_readpage_release_extent(tree, start,
2519 end - start + 1, 0); 2519 end - start + 1, 0);
2520 } else if (!extent_len) { 2520 } else if (!extent_len) {
2521 extent_start = start; 2521 extent_start = start;
2522 extent_len = end + 1 - start; 2522 extent_len = end + 1 - start;
2523 } else if (extent_start + extent_len == start) { 2523 } else if (extent_start + extent_len == start) {
2524 extent_len += end + 1 - start; 2524 extent_len += end + 1 - start;
2525 } else { 2525 } else {
2526 endio_readpage_release_extent(tree, extent_start, 2526 endio_readpage_release_extent(tree, extent_start,
2527 extent_len, uptodate); 2527 extent_len, uptodate);
2528 extent_start = start; 2528 extent_start = start;
2529 extent_len = end + 1 - start; 2529 extent_len = end + 1 - start;
2530 } 2530 }
2531 } while (bvec <= bvec_end); 2531 } while (bvec <= bvec_end);
2532 2532
2533 if (extent_len) 2533 if (extent_len)
2534 endio_readpage_release_extent(tree, extent_start, extent_len, 2534 endio_readpage_release_extent(tree, extent_start, extent_len,
2535 uptodate); 2535 uptodate);
2536 if (io_bio->end_io) 2536 if (io_bio->end_io)
2537 io_bio->end_io(io_bio, err); 2537 io_bio->end_io(io_bio, err);
2538 bio_put(bio); 2538 bio_put(bio);
2539 } 2539 }
2540 2540
2541 /* 2541 /*
2542 * this allocates from the btrfs_bioset. We're returning a bio right now 2542 * this allocates from the btrfs_bioset. We're returning a bio right now
2543 * but you can call btrfs_io_bio for the appropriate container_of magic 2543 * but you can call btrfs_io_bio for the appropriate container_of magic
2544 */ 2544 */
2545 struct bio * 2545 struct bio *
2546 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, 2546 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
2547 gfp_t gfp_flags) 2547 gfp_t gfp_flags)
2548 { 2548 {
2549 struct btrfs_io_bio *btrfs_bio; 2549 struct btrfs_io_bio *btrfs_bio;
2550 struct bio *bio; 2550 struct bio *bio;
2551 2551
2552 bio = bio_alloc_bioset(gfp_flags, nr_vecs, btrfs_bioset); 2552 bio = bio_alloc_bioset(gfp_flags, nr_vecs, btrfs_bioset);
2553 2553
2554 if (bio == NULL && (current->flags & PF_MEMALLOC)) { 2554 if (bio == NULL && (current->flags & PF_MEMALLOC)) {
2555 while (!bio && (nr_vecs /= 2)) { 2555 while (!bio && (nr_vecs /= 2)) {
2556 bio = bio_alloc_bioset(gfp_flags, 2556 bio = bio_alloc_bioset(gfp_flags,
2557 nr_vecs, btrfs_bioset); 2557 nr_vecs, btrfs_bioset);
2558 } 2558 }
2559 } 2559 }
2560 2560
2561 if (bio) { 2561 if (bio) {
2562 bio->bi_size = 0; 2562 bio->bi_size = 0;
2563 bio->bi_bdev = bdev; 2563 bio->bi_bdev = bdev;
2564 bio->bi_sector = first_sector; 2564 bio->bi_sector = first_sector;
2565 btrfs_bio = btrfs_io_bio(bio); 2565 btrfs_bio = btrfs_io_bio(bio);
2566 btrfs_bio->csum = NULL; 2566 btrfs_bio->csum = NULL;
2567 btrfs_bio->csum_allocated = NULL; 2567 btrfs_bio->csum_allocated = NULL;
2568 btrfs_bio->end_io = NULL; 2568 btrfs_bio->end_io = NULL;
2569 } 2569 }
2570 return bio; 2570 return bio;
2571 } 2571 }
2572 2572
2573 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) 2573 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
2574 { 2574 {
2575 return bio_clone_bioset(bio, gfp_mask, btrfs_bioset); 2575 return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
2576 } 2576 }
2577 2577
2578 2578
2579 /* this also allocates from the btrfs_bioset */ 2579 /* this also allocates from the btrfs_bioset */
2580 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) 2580 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
2581 { 2581 {
2582 struct btrfs_io_bio *btrfs_bio; 2582 struct btrfs_io_bio *btrfs_bio;
2583 struct bio *bio; 2583 struct bio *bio;
2584 2584
2585 bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); 2585 bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
2586 if (bio) { 2586 if (bio) {
2587 btrfs_bio = btrfs_io_bio(bio); 2587 btrfs_bio = btrfs_io_bio(bio);
2588 btrfs_bio->csum = NULL; 2588 btrfs_bio->csum = NULL;
2589 btrfs_bio->csum_allocated = NULL; 2589 btrfs_bio->csum_allocated = NULL;
2590 btrfs_bio->end_io = NULL; 2590 btrfs_bio->end_io = NULL;
2591 } 2591 }
2592 return bio; 2592 return bio;
2593 } 2593 }
2594 2594
2595 2595
2596 static int __must_check submit_one_bio(int rw, struct bio *bio, 2596 static int __must_check submit_one_bio(int rw, struct bio *bio,
2597 int mirror_num, unsigned long bio_flags) 2597 int mirror_num, unsigned long bio_flags)
2598 { 2598 {
2599 int ret = 0; 2599 int ret = 0;
2600 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; 2600 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
2601 struct page *page = bvec->bv_page; 2601 struct page *page = bvec->bv_page;
2602 struct extent_io_tree *tree = bio->bi_private; 2602 struct extent_io_tree *tree = bio->bi_private;
2603 u64 start; 2603 u64 start;
2604 2604
2605 start = page_offset(page) + bvec->bv_offset; 2605 start = page_offset(page) + bvec->bv_offset;
2606 2606
2607 bio->bi_private = NULL; 2607 bio->bi_private = NULL;
2608 2608
2609 bio_get(bio); 2609 bio_get(bio);
2610 2610
2611 if (tree->ops && tree->ops->submit_bio_hook) 2611 if (tree->ops && tree->ops->submit_bio_hook)
2612 ret = tree->ops->submit_bio_hook(page->mapping->host, rw, bio, 2612 ret = tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
2613 mirror_num, bio_flags, start); 2613 mirror_num, bio_flags, start);
2614 else 2614 else
2615 btrfsic_submit_bio(rw, bio); 2615 btrfsic_submit_bio(rw, bio);
2616 2616
2617 if (bio_flagged(bio, BIO_EOPNOTSUPP)) 2617 if (bio_flagged(bio, BIO_EOPNOTSUPP))
2618 ret = -EOPNOTSUPP; 2618 ret = -EOPNOTSUPP;
2619 bio_put(bio); 2619 bio_put(bio);
2620 return ret; 2620 return ret;
2621 } 2621 }
2622 2622
2623 static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, 2623 static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page,
2624 unsigned long offset, size_t size, struct bio *bio, 2624 unsigned long offset, size_t size, struct bio *bio,
2625 unsigned long bio_flags) 2625 unsigned long bio_flags)
2626 { 2626 {
2627 int ret = 0; 2627 int ret = 0;
2628 if (tree->ops && tree->ops->merge_bio_hook) 2628 if (tree->ops && tree->ops->merge_bio_hook)
2629 ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, 2629 ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
2630 bio_flags); 2630 bio_flags);
2631 BUG_ON(ret < 0); 2631 BUG_ON(ret < 0);
2632 return ret; 2632 return ret;
2633 2633
2634 } 2634 }
2635 2635
2636 static int submit_extent_page(int rw, struct extent_io_tree *tree, 2636 static int submit_extent_page(int rw, struct extent_io_tree *tree,
2637 struct page *page, sector_t sector, 2637 struct page *page, sector_t sector,
2638 size_t size, unsigned long offset, 2638 size_t size, unsigned long offset,
2639 struct block_device *bdev, 2639 struct block_device *bdev,
2640 struct bio **bio_ret, 2640 struct bio **bio_ret,
2641 unsigned long max_pages, 2641 unsigned long max_pages,
2642 bio_end_io_t end_io_func, 2642 bio_end_io_t end_io_func,
2643 int mirror_num, 2643 int mirror_num,
2644 unsigned long prev_bio_flags, 2644 unsigned long prev_bio_flags,
2645 unsigned long bio_flags) 2645 unsigned long bio_flags)
2646 { 2646 {
2647 int ret = 0; 2647 int ret = 0;
2648 struct bio *bio; 2648 struct bio *bio;
2649 int nr; 2649 int nr;
2650 int contig = 0; 2650 int contig = 0;
2651 int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED; 2651 int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED;
2652 int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED; 2652 int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
2653 size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE); 2653 size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);
2654 2654
2655 if (bio_ret && *bio_ret) { 2655 if (bio_ret && *bio_ret) {
2656 bio = *bio_ret; 2656 bio = *bio_ret;
2657 if (old_compressed) 2657 if (old_compressed)
2658 contig = bio->bi_sector == sector; 2658 contig = bio->bi_sector == sector;
2659 else 2659 else
2660 contig = bio_end_sector(bio) == sector; 2660 contig = bio_end_sector(bio) == sector;
2661 2661
2662 if (prev_bio_flags != bio_flags || !contig || 2662 if (prev_bio_flags != bio_flags || !contig ||
2663 merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) || 2663 merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) ||
2664 bio_add_page(bio, page, page_size, offset) < page_size) { 2664 bio_add_page(bio, page, page_size, offset) < page_size) {
2665 ret = submit_one_bio(rw, bio, mirror_num, 2665 ret = submit_one_bio(rw, bio, mirror_num,
2666 prev_bio_flags); 2666 prev_bio_flags);
2667 if (ret < 0) 2667 if (ret < 0)
2668 return ret; 2668 return ret;
2669 bio = NULL; 2669 bio = NULL;
2670 } else { 2670 } else {
2671 return 0; 2671 return 0;
2672 } 2672 }
2673 } 2673 }
2674 if (this_compressed) 2674 if (this_compressed)
2675 nr = BIO_MAX_PAGES; 2675 nr = BIO_MAX_PAGES;
2676 else 2676 else
2677 nr = bio_get_nr_vecs(bdev); 2677 nr = bio_get_nr_vecs(bdev);
2678 2678
2679 bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH); 2679 bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH);
2680 if (!bio) 2680 if (!bio)
2681 return -ENOMEM; 2681 return -ENOMEM;
2682 2682
2683 bio_add_page(bio, page, page_size, offset); 2683 bio_add_page(bio, page, page_size, offset);
2684 bio->bi_end_io = end_io_func; 2684 bio->bi_end_io = end_io_func;
2685 bio->bi_private = tree; 2685 bio->bi_private = tree;
2686 2686
2687 if (bio_ret) 2687 if (bio_ret)
2688 *bio_ret = bio; 2688 *bio_ret = bio;
2689 else 2689 else
2690 ret = submit_one_bio(rw, bio, mirror_num, bio_flags); 2690 ret = submit_one_bio(rw, bio, mirror_num, bio_flags);
2691 2691
2692 return ret; 2692 return ret;
2693 } 2693 }
2694 2694
2695 static void attach_extent_buffer_page(struct extent_buffer *eb, 2695 static void attach_extent_buffer_page(struct extent_buffer *eb,
2696 struct page *page) 2696 struct page *page)
2697 { 2697 {
2698 if (!PagePrivate(page)) { 2698 if (!PagePrivate(page)) {
2699 SetPagePrivate(page); 2699 SetPagePrivate(page);
2700 page_cache_get(page); 2700 page_cache_get(page);
2701 set_page_private(page, (unsigned long)eb); 2701 set_page_private(page, (unsigned long)eb);
2702 } else { 2702 } else {
2703 WARN_ON(page->private != (unsigned long)eb); 2703 WARN_ON(page->private != (unsigned long)eb);
2704 } 2704 }
2705 } 2705 }
2706 2706
2707 void set_page_extent_mapped(struct page *page) 2707 void set_page_extent_mapped(struct page *page)
2708 { 2708 {
2709 if (!PagePrivate(page)) { 2709 if (!PagePrivate(page)) {
2710 SetPagePrivate(page); 2710 SetPagePrivate(page);
2711 page_cache_get(page); 2711 page_cache_get(page);
2712 set_page_private(page, EXTENT_PAGE_PRIVATE); 2712 set_page_private(page, EXTENT_PAGE_PRIVATE);
2713 } 2713 }
2714 } 2714 }
2715 2715
2716 static struct extent_map * 2716 static struct extent_map *
2717 __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset, 2717 __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
2718 u64 start, u64 len, get_extent_t *get_extent, 2718 u64 start, u64 len, get_extent_t *get_extent,
2719 struct extent_map **em_cached) 2719 struct extent_map **em_cached)
2720 { 2720 {
2721 struct extent_map *em; 2721 struct extent_map *em;
2722 2722
2723 if (em_cached && *em_cached) { 2723 if (em_cached && *em_cached) {
2724 em = *em_cached; 2724 em = *em_cached;
2725 if (em->in_tree && start >= em->start && 2725 if (em->in_tree && start >= em->start &&
2726 start < extent_map_end(em)) { 2726 start < extent_map_end(em)) {
2727 atomic_inc(&em->refs); 2727 atomic_inc(&em->refs);
2728 return em; 2728 return em;
2729 } 2729 }
2730 2730
2731 free_extent_map(em); 2731 free_extent_map(em);
2732 *em_cached = NULL; 2732 *em_cached = NULL;
2733 } 2733 }
2734 2734
2735 em = get_extent(inode, page, pg_offset, start, len, 0); 2735 em = get_extent(inode, page, pg_offset, start, len, 0);
2736 if (em_cached && !IS_ERR_OR_NULL(em)) { 2736 if (em_cached && !IS_ERR_OR_NULL(em)) {
2737 BUG_ON(*em_cached); 2737 BUG_ON(*em_cached);
2738 atomic_inc(&em->refs); 2738 atomic_inc(&em->refs);
2739 *em_cached = em; 2739 *em_cached = em;
2740 } 2740 }
2741 return em; 2741 return em;
2742 } 2742 }
2743 /* 2743 /*
2744 * basic readpage implementation. Locked extent state structs are inserted 2744 * basic readpage implementation. Locked extent state structs are inserted
2745 * into the tree that are removed when the IO is done (by the end_io 2745 * into the tree that are removed when the IO is done (by the end_io
2746 * handlers) 2746 * handlers)
2747 * XXX JDM: This needs looking at to ensure proper page locking 2747 * XXX JDM: This needs looking at to ensure proper page locking
2748 */ 2748 */
2749 static int __do_readpage(struct extent_io_tree *tree, 2749 static int __do_readpage(struct extent_io_tree *tree,
2750 struct page *page, 2750 struct page *page,
2751 get_extent_t *get_extent, 2751 get_extent_t *get_extent,
2752 struct extent_map **em_cached, 2752 struct extent_map **em_cached,
2753 struct bio **bio, int mirror_num, 2753 struct bio **bio, int mirror_num,
2754 unsigned long *bio_flags, int rw) 2754 unsigned long *bio_flags, int rw)
2755 { 2755 {
2756 struct inode *inode = page->mapping->host; 2756 struct inode *inode = page->mapping->host;
2757 u64 start = page_offset(page); 2757 u64 start = page_offset(page);
2758 u64 page_end = start + PAGE_CACHE_SIZE - 1; 2758 u64 page_end = start + PAGE_CACHE_SIZE - 1;
2759 u64 end; 2759 u64 end;
2760 u64 cur = start; 2760 u64 cur = start;
2761 u64 extent_offset; 2761 u64 extent_offset;
2762 u64 last_byte = i_size_read(inode); 2762 u64 last_byte = i_size_read(inode);
2763 u64 block_start; 2763 u64 block_start;
2764 u64 cur_end; 2764 u64 cur_end;
2765 sector_t sector; 2765 sector_t sector;
2766 struct extent_map *em; 2766 struct extent_map *em;
2767 struct block_device *bdev; 2767 struct block_device *bdev;
2768 int ret; 2768 int ret;
2769 int nr = 0; 2769 int nr = 0;
2770 int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED; 2770 int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
2771 size_t pg_offset = 0; 2771 size_t pg_offset = 0;
2772 size_t iosize; 2772 size_t iosize;
2773 size_t disk_io_size; 2773 size_t disk_io_size;
2774 size_t blocksize = inode->i_sb->s_blocksize; 2774 size_t blocksize = inode->i_sb->s_blocksize;
2775 unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED; 2775 unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
2776 2776
2777 set_page_extent_mapped(page); 2777 set_page_extent_mapped(page);
2778 2778
2779 end = page_end; 2779 end = page_end;
2780 if (!PageUptodate(page)) { 2780 if (!PageUptodate(page)) {
2781 if (cleancache_get_page(page) == 0) { 2781 if (cleancache_get_page(page) == 0) {
2782 BUG_ON(blocksize != PAGE_SIZE); 2782 BUG_ON(blocksize != PAGE_SIZE);
2783 unlock_extent(tree, start, end); 2783 unlock_extent(tree, start, end);
2784 goto out; 2784 goto out;
2785 } 2785 }
2786 } 2786 }
2787 2787
2788 if (page->index == last_byte >> PAGE_CACHE_SHIFT) { 2788 if (page->index == last_byte >> PAGE_CACHE_SHIFT) {
2789 char *userpage; 2789 char *userpage;
2790 size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1); 2790 size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1);
2791 2791
2792 if (zero_offset) { 2792 if (zero_offset) {
2793 iosize = PAGE_CACHE_SIZE - zero_offset; 2793 iosize = PAGE_CACHE_SIZE - zero_offset;
2794 userpage = kmap_atomic(page); 2794 userpage = kmap_atomic(page);
2795 memset(userpage + zero_offset, 0, iosize); 2795 memset(userpage + zero_offset, 0, iosize);
2796 flush_dcache_page(page); 2796 flush_dcache_page(page);
2797 kunmap_atomic(userpage); 2797 kunmap_atomic(userpage);
2798 } 2798 }
2799 } 2799 }
2800 while (cur <= end) { 2800 while (cur <= end) {
2801 unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1; 2801 unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1;
2802 2802
2803 if (cur >= last_byte) { 2803 if (cur >= last_byte) {
2804 char *userpage; 2804 char *userpage;
2805 struct extent_state *cached = NULL; 2805 struct extent_state *cached = NULL;
2806 2806
2807 iosize = PAGE_CACHE_SIZE - pg_offset; 2807 iosize = PAGE_CACHE_SIZE - pg_offset;
2808 userpage = kmap_atomic(page); 2808 userpage = kmap_atomic(page);
2809 memset(userpage + pg_offset, 0, iosize); 2809 memset(userpage + pg_offset, 0, iosize);
2810 flush_dcache_page(page); 2810 flush_dcache_page(page);
2811 kunmap_atomic(userpage); 2811 kunmap_atomic(userpage);
2812 set_extent_uptodate(tree, cur, cur + iosize - 1, 2812 set_extent_uptodate(tree, cur, cur + iosize - 1,
2813 &cached, GFP_NOFS); 2813 &cached, GFP_NOFS);
2814 if (!parent_locked) 2814 if (!parent_locked)
2815 unlock_extent_cached(tree, cur, 2815 unlock_extent_cached(tree, cur,
2816 cur + iosize - 1, 2816 cur + iosize - 1,
2817 &cached, GFP_NOFS); 2817 &cached, GFP_NOFS);
2818 break; 2818 break;
2819 } 2819 }
2820 em = __get_extent_map(inode, page, pg_offset, cur, 2820 em = __get_extent_map(inode, page, pg_offset, cur,
2821 end - cur + 1, get_extent, em_cached); 2821 end - cur + 1, get_extent, em_cached);
2822 if (IS_ERR_OR_NULL(em)) { 2822 if (IS_ERR_OR_NULL(em)) {
2823 SetPageError(page); 2823 SetPageError(page);
2824 if (!parent_locked) 2824 if (!parent_locked)
2825 unlock_extent(tree, cur, end); 2825 unlock_extent(tree, cur, end);
2826 break; 2826 break;
2827 } 2827 }
2828 extent_offset = cur - em->start; 2828 extent_offset = cur - em->start;
2829 BUG_ON(extent_map_end(em) <= cur); 2829 BUG_ON(extent_map_end(em) <= cur);
2830 BUG_ON(end < cur); 2830 BUG_ON(end < cur);
2831 2831
2832 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { 2832 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
2833 this_bio_flag |= EXTENT_BIO_COMPRESSED; 2833 this_bio_flag |= EXTENT_BIO_COMPRESSED;
2834 extent_set_compress_type(&this_bio_flag, 2834 extent_set_compress_type(&this_bio_flag,
2835 em->compress_type); 2835 em->compress_type);
2836 } 2836 }
2837 2837
2838 iosize = min(extent_map_end(em) - cur, end - cur + 1); 2838 iosize = min(extent_map_end(em) - cur, end - cur + 1);
2839 cur_end = min(extent_map_end(em) - 1, end); 2839 cur_end = min(extent_map_end(em) - 1, end);
2840 iosize = ALIGN(iosize, blocksize); 2840 iosize = ALIGN(iosize, blocksize);
2841 if (this_bio_flag & EXTENT_BIO_COMPRESSED) { 2841 if (this_bio_flag & EXTENT_BIO_COMPRESSED) {
2842 disk_io_size = em->block_len; 2842 disk_io_size = em->block_len;
2843 sector = em->block_start >> 9; 2843 sector = em->block_start >> 9;
2844 } else { 2844 } else {
2845 sector = (em->block_start + extent_offset) >> 9; 2845 sector = (em->block_start + extent_offset) >> 9;
2846 disk_io_size = iosize; 2846 disk_io_size = iosize;
2847 } 2847 }
2848 bdev = em->bdev; 2848 bdev = em->bdev;
2849 block_start = em->block_start; 2849 block_start = em->block_start;
2850 if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) 2850 if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
2851 block_start = EXTENT_MAP_HOLE; 2851 block_start = EXTENT_MAP_HOLE;
2852 free_extent_map(em); 2852 free_extent_map(em);
2853 em = NULL; 2853 em = NULL;
2854 2854
2855 /* we've found a hole, just zero and go on */ 2855 /* we've found a hole, just zero and go on */
2856 if (block_start == EXTENT_MAP_HOLE) { 2856 if (block_start == EXTENT_MAP_HOLE) {
2857 char *userpage; 2857 char *userpage;
2858 struct extent_state *cached = NULL; 2858 struct extent_state *cached = NULL;
2859 2859
2860 userpage = kmap_atomic(page); 2860 userpage = kmap_atomic(page);
2861 memset(userpage + pg_offset, 0, iosize); 2861 memset(userpage + pg_offset, 0, iosize);
2862 flush_dcache_page(page); 2862 flush_dcache_page(page);
2863 kunmap_atomic(userpage); 2863 kunmap_atomic(userpage);
2864 2864
2865 set_extent_uptodate(tree, cur, cur + iosize - 1, 2865 set_extent_uptodate(tree, cur, cur + iosize - 1,
2866 &cached, GFP_NOFS); 2866 &cached, GFP_NOFS);
2867 unlock_extent_cached(tree, cur, cur + iosize - 1, 2867 unlock_extent_cached(tree, cur, cur + iosize - 1,
2868 &cached, GFP_NOFS); 2868 &cached, GFP_NOFS);
2869 cur = cur + iosize; 2869 cur = cur + iosize;
2870 pg_offset += iosize; 2870 pg_offset += iosize;
2871 continue; 2871 continue;
2872 } 2872 }
2873 /* the get_extent function already copied into the page */ 2873 /* the get_extent function already copied into the page */
2874 if (test_range_bit(tree, cur, cur_end, 2874 if (test_range_bit(tree, cur, cur_end,
2875 EXTENT_UPTODATE, 1, NULL)) { 2875 EXTENT_UPTODATE, 1, NULL)) {
2876 check_page_uptodate(tree, page); 2876 check_page_uptodate(tree, page);
2877 if (!parent_locked) 2877 if (!parent_locked)
2878 unlock_extent(tree, cur, cur + iosize - 1); 2878 unlock_extent(tree, cur, cur + iosize - 1);
2879 cur = cur + iosize; 2879 cur = cur + iosize;
2880 pg_offset += iosize; 2880 pg_offset += iosize;
2881 continue; 2881 continue;
2882 } 2882 }
2883 /* we have an inline extent but it didn't get marked up 2883 /* we have an inline extent but it didn't get marked up
2884 * to date. Error out 2884 * to date. Error out
2885 */ 2885 */
2886 if (block_start == EXTENT_MAP_INLINE) { 2886 if (block_start == EXTENT_MAP_INLINE) {
2887 SetPageError(page); 2887 SetPageError(page);
2888 if (!parent_locked) 2888 if (!parent_locked)
2889 unlock_extent(tree, cur, cur + iosize - 1); 2889 unlock_extent(tree, cur, cur + iosize - 1);
2890 cur = cur + iosize; 2890 cur = cur + iosize;
2891 pg_offset += iosize; 2891 pg_offset += iosize;
2892 continue; 2892 continue;
2893 } 2893 }
2894 2894
2895 pnr -= page->index; 2895 pnr -= page->index;
2896 ret = submit_extent_page(rw, tree, page, 2896 ret = submit_extent_page(rw, tree, page,
2897 sector, disk_io_size, pg_offset, 2897 sector, disk_io_size, pg_offset,
2898 bdev, bio, pnr, 2898 bdev, bio, pnr,
2899 end_bio_extent_readpage, mirror_num, 2899 end_bio_extent_readpage, mirror_num,
2900 *bio_flags, 2900 *bio_flags,
2901 this_bio_flag); 2901 this_bio_flag);
2902 if (!ret) { 2902 if (!ret) {
2903 nr++; 2903 nr++;
2904 *bio_flags = this_bio_flag; 2904 *bio_flags = this_bio_flag;
2905 } else { 2905 } else {
2906 SetPageError(page); 2906 SetPageError(page);
2907 if (!parent_locked) 2907 if (!parent_locked)
2908 unlock_extent(tree, cur, cur + iosize - 1); 2908 unlock_extent(tree, cur, cur + iosize - 1);
2909 } 2909 }
2910 cur = cur + iosize; 2910 cur = cur + iosize;
2911 pg_offset += iosize; 2911 pg_offset += iosize;
2912 } 2912 }
2913 out: 2913 out:
2914 if (!nr) { 2914 if (!nr) {
2915 if (!PageError(page)) 2915 if (!PageError(page))
2916 SetPageUptodate(page); 2916 SetPageUptodate(page);
2917 unlock_page(page); 2917 unlock_page(page);
2918 } 2918 }
2919 return 0; 2919 return 0;
2920 } 2920 }
2921 2921
2922 static inline void __do_contiguous_readpages(struct extent_io_tree *tree, 2922 static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
2923 struct page *pages[], int nr_pages, 2923 struct page *pages[], int nr_pages,
2924 u64 start, u64 end, 2924 u64 start, u64 end,
2925 get_extent_t *get_extent, 2925 get_extent_t *get_extent,
2926 struct extent_map **em_cached, 2926 struct extent_map **em_cached,
2927 struct bio **bio, int mirror_num, 2927 struct bio **bio, int mirror_num,
2928 unsigned long *bio_flags, int rw) 2928 unsigned long *bio_flags, int rw)
2929 { 2929 {
2930 struct inode *inode; 2930 struct inode *inode;
2931 struct btrfs_ordered_extent *ordered; 2931 struct btrfs_ordered_extent *ordered;
2932 int index; 2932 int index;
2933 2933
2934 inode = pages[0]->mapping->host; 2934 inode = pages[0]->mapping->host;
2935 while (1) { 2935 while (1) {
2936 lock_extent(tree, start, end); 2936 lock_extent(tree, start, end);
2937 ordered = btrfs_lookup_ordered_range(inode, start, 2937 ordered = btrfs_lookup_ordered_range(inode, start,
2938 end - start + 1); 2938 end - start + 1);
2939 if (!ordered) 2939 if (!ordered)
2940 break; 2940 break;
2941 unlock_extent(tree, start, end); 2941 unlock_extent(tree, start, end);
2942 btrfs_start_ordered_extent(inode, ordered, 1); 2942 btrfs_start_ordered_extent(inode, ordered, 1);
2943 btrfs_put_ordered_extent(ordered); 2943 btrfs_put_ordered_extent(ordered);
2944 } 2944 }
2945 2945
2946 for (index = 0; index < nr_pages; index++) { 2946 for (index = 0; index < nr_pages; index++) {
2947 __do_readpage(tree, pages[index], get_extent, em_cached, bio, 2947 __do_readpage(tree, pages[index], get_extent, em_cached, bio,
2948 mirror_num, bio_flags, rw); 2948 mirror_num, bio_flags, rw);
2949 page_cache_release(pages[index]); 2949 page_cache_release(pages[index]);
2950 } 2950 }
2951 } 2951 }
2952 2952
2953 static void __extent_readpages(struct extent_io_tree *tree, 2953 static void __extent_readpages(struct extent_io_tree *tree,
2954 struct page *pages[], 2954 struct page *pages[],
2955 int nr_pages, get_extent_t *get_extent, 2955 int nr_pages, get_extent_t *get_extent,
2956 struct extent_map **em_cached, 2956 struct extent_map **em_cached,
2957 struct bio **bio, int mirror_num, 2957 struct bio **bio, int mirror_num,
2958 unsigned long *bio_flags, int rw) 2958 unsigned long *bio_flags, int rw)
2959 { 2959 {
2960 u64 start = 0; 2960 u64 start = 0;
2961 u64 end = 0; 2961 u64 end = 0;
2962 u64 page_start; 2962 u64 page_start;
2963 int index; 2963 int index;
2964 int first_index = 0; 2964 int first_index = 0;
2965 2965
2966 for (index = 0; index < nr_pages; index++) { 2966 for (index = 0; index < nr_pages; index++) {
2967 page_start = page_offset(pages[index]); 2967 page_start = page_offset(pages[index]);
2968 if (!end) { 2968 if (!end) {
2969 start = page_start; 2969 start = page_start;
2970 end = start + PAGE_CACHE_SIZE - 1; 2970 end = start + PAGE_CACHE_SIZE - 1;
2971 first_index = index; 2971 first_index = index;
2972 } else if (end + 1 == page_start) { 2972 } else if (end + 1 == page_start) {
2973 end += PAGE_CACHE_SIZE; 2973 end += PAGE_CACHE_SIZE;
2974 } else { 2974 } else {
2975 __do_contiguous_readpages(tree, &pages[first_index], 2975 __do_contiguous_readpages(tree, &pages[first_index],
2976 index - first_index, start, 2976 index - first_index, start,
2977 end, get_extent, em_cached, 2977 end, get_extent, em_cached,
2978 bio, mirror_num, bio_flags, 2978 bio, mirror_num, bio_flags,
2979 rw); 2979 rw);
2980 start = page_start; 2980 start = page_start;
2981 end = start + PAGE_CACHE_SIZE - 1; 2981 end = start + PAGE_CACHE_SIZE - 1;
2982 first_index = index; 2982 first_index = index;
2983 } 2983 }
2984 } 2984 }
2985 2985
2986 if (end) 2986 if (end)
2987 __do_contiguous_readpages(tree, &pages[first_index], 2987 __do_contiguous_readpages(tree, &pages[first_index],
2988 index - first_index, start, 2988 index - first_index, start,
2989 end, get_extent, em_cached, bio, 2989 end, get_extent, em_cached, bio,
2990 mirror_num, bio_flags, rw); 2990 mirror_num, bio_flags, rw);
2991 } 2991 }
2992 2992
2993 static int __extent_read_full_page(struct extent_io_tree *tree, 2993 static int __extent_read_full_page(struct extent_io_tree *tree,
2994 struct page *page, 2994 struct page *page,
2995 get_extent_t *get_extent, 2995 get_extent_t *get_extent,
2996 struct bio **bio, int mirror_num, 2996 struct bio **bio, int mirror_num,
2997 unsigned long *bio_flags, int rw) 2997 unsigned long *bio_flags, int rw)
2998 { 2998 {
2999 struct inode *inode = page->mapping->host; 2999 struct inode *inode = page->mapping->host;
3000 struct btrfs_ordered_extent *ordered; 3000 struct btrfs_ordered_extent *ordered;
3001 u64 start = page_offset(page); 3001 u64 start = page_offset(page);
3002 u64 end = start + PAGE_CACHE_SIZE - 1; 3002 u64 end = start + PAGE_CACHE_SIZE - 1;
3003 int ret; 3003 int ret;
3004 3004
3005 while (1) { 3005 while (1) {
3006 lock_extent(tree, start, end); 3006 lock_extent(tree, start, end);
3007 ordered = btrfs_lookup_ordered_extent(inode, start); 3007 ordered = btrfs_lookup_ordered_extent(inode, start);
3008 if (!ordered) 3008 if (!ordered)
3009 break; 3009 break;
3010 unlock_extent(tree, start, end); 3010 unlock_extent(tree, start, end);
3011 btrfs_start_ordered_extent(inode, ordered, 1); 3011 btrfs_start_ordered_extent(inode, ordered, 1);
3012 btrfs_put_ordered_extent(ordered); 3012 btrfs_put_ordered_extent(ordered);
3013 } 3013 }
3014 3014
3015 ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num, 3015 ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num,
3016 bio_flags, rw); 3016 bio_flags, rw);
3017 return ret; 3017 return ret;
3018 } 3018 }
3019 3019
3020 int extent_read_full_page(struct extent_io_tree *tree, struct page *page, 3020 int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
3021 get_extent_t *get_extent, int mirror_num) 3021 get_extent_t *get_extent, int mirror_num)
3022 { 3022 {
3023 struct bio *bio = NULL; 3023 struct bio *bio = NULL;
3024 unsigned long bio_flags = 0; 3024 unsigned long bio_flags = 0;
3025 int ret; 3025 int ret;
3026 3026
3027 ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num, 3027 ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
3028 &bio_flags, READ); 3028 &bio_flags, READ);
3029 if (bio) 3029 if (bio)
3030 ret = submit_one_bio(READ, bio, mirror_num, bio_flags); 3030 ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
3031 return ret; 3031 return ret;
3032 } 3032 }
3033 3033
3034 int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page, 3034 int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
3035 get_extent_t *get_extent, int mirror_num) 3035 get_extent_t *get_extent, int mirror_num)
3036 { 3036 {
3037 struct bio *bio = NULL; 3037 struct bio *bio = NULL;
3038 unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED; 3038 unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED;
3039 int ret; 3039 int ret;
3040 3040
3041 ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num, 3041 ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num,
3042 &bio_flags, READ); 3042 &bio_flags, READ);
3043 if (bio) 3043 if (bio)
3044 ret = submit_one_bio(READ, bio, mirror_num, bio_flags); 3044 ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
3045 return ret; 3045 return ret;
3046 } 3046 }
3047 3047
3048 static noinline void update_nr_written(struct page *page, 3048 static noinline void update_nr_written(struct page *page,
3049 struct writeback_control *wbc, 3049 struct writeback_control *wbc,
3050 unsigned long nr_written) 3050 unsigned long nr_written)
3051 { 3051 {
3052 wbc->nr_to_write -= nr_written; 3052 wbc->nr_to_write -= nr_written;
3053 if (wbc->range_cyclic || (wbc->nr_to_write > 0 && 3053 if (wbc->range_cyclic || (wbc->nr_to_write > 0 &&
3054 wbc->range_start == 0 && wbc->range_end == LLONG_MAX)) 3054 wbc->range_start == 0 && wbc->range_end == LLONG_MAX))
3055 page->mapping->writeback_index = page->index + nr_written; 3055 page->mapping->writeback_index = page->index + nr_written;
3056 } 3056 }
3057 3057
3058 /* 3058 /*
3059 * the writepage semantics are similar to regular writepage. extent 3059 * the writepage semantics are similar to regular writepage. extent
3060 * records are inserted to lock ranges in the tree, and as dirty areas 3060 * records are inserted to lock ranges in the tree, and as dirty areas
3061 * are found, they are marked writeback. Then the lock bits are removed 3061 * are found, they are marked writeback. Then the lock bits are removed
3062 * and the end_io handler clears the writeback ranges 3062 * and the end_io handler clears the writeback ranges
3063 */ 3063 */
3064 static int __extent_writepage(struct page *page, struct writeback_control *wbc, 3064 static int __extent_writepage(struct page *page, struct writeback_control *wbc,
3065 void *data) 3065 void *data)
3066 { 3066 {
3067 struct inode *inode = page->mapping->host; 3067 struct inode *inode = page->mapping->host;
3068 struct extent_page_data *epd = data; 3068 struct extent_page_data *epd = data;
3069 struct extent_io_tree *tree = epd->tree; 3069 struct extent_io_tree *tree = epd->tree;
3070 u64 start = page_offset(page); 3070 u64 start = page_offset(page);
3071 u64 delalloc_start; 3071 u64 delalloc_start;
3072 u64 page_end = start + PAGE_CACHE_SIZE - 1; 3072 u64 page_end = start + PAGE_CACHE_SIZE - 1;
3073 u64 end; 3073 u64 end;
3074 u64 cur = start; 3074 u64 cur = start;
3075 u64 extent_offset; 3075 u64 extent_offset;
3076 u64 last_byte = i_size_read(inode); 3076 u64 last_byte = i_size_read(inode);
3077 u64 block_start; 3077 u64 block_start;
3078 u64 iosize; 3078 u64 iosize;
3079 sector_t sector; 3079 sector_t sector;
3080 struct extent_state *cached_state = NULL; 3080 struct extent_state *cached_state = NULL;
3081 struct extent_map *em; 3081 struct extent_map *em;
3082 struct block_device *bdev; 3082 struct block_device *bdev;
3083 int ret; 3083 int ret;
3084 int nr = 0; 3084 int nr = 0;
3085 size_t pg_offset = 0; 3085 size_t pg_offset = 0;
3086 size_t blocksize; 3086 size_t blocksize;
3087 loff_t i_size = i_size_read(inode); 3087 loff_t i_size = i_size_read(inode);
3088 unsigned long end_index = i_size >> PAGE_CACHE_SHIFT; 3088 unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
3089 u64 nr_delalloc; 3089 u64 nr_delalloc;
3090 u64 delalloc_end; 3090 u64 delalloc_end;
3091 int page_started; 3091 int page_started;
3092 int compressed; 3092 int compressed;
3093 int write_flags; 3093 int write_flags;
3094 unsigned long nr_written = 0; 3094 unsigned long nr_written = 0;
3095 bool fill_delalloc = true; 3095 bool fill_delalloc = true;
3096 3096
3097 if (wbc->sync_mode == WB_SYNC_ALL) 3097 if (wbc->sync_mode == WB_SYNC_ALL)
3098 write_flags = WRITE_SYNC; 3098 write_flags = WRITE_SYNC;
3099 else 3099 else
3100 write_flags = WRITE; 3100 write_flags = WRITE;
3101 3101
3102 trace___extent_writepage(page, inode, wbc); 3102 trace___extent_writepage(page, inode, wbc);
3103 3103
3104 WARN_ON(!PageLocked(page)); 3104 WARN_ON(!PageLocked(page));
3105 3105
3106 ClearPageError(page); 3106 ClearPageError(page);
3107 3107
3108 pg_offset = i_size & (PAGE_CACHE_SIZE - 1); 3108 pg_offset = i_size & (PAGE_CACHE_SIZE - 1);
3109 if (page->index > end_index || 3109 if (page->index > end_index ||
3110 (page->index == end_index && !pg_offset)) { 3110 (page->index == end_index && !pg_offset)) {
3111 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); 3111 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
3112 unlock_page(page); 3112 unlock_page(page);
3113 return 0; 3113 return 0;
3114 } 3114 }
3115 3115
3116 if (page->index == end_index) { 3116 if (page->index == end_index) {
3117 char *userpage; 3117 char *userpage;
3118 3118
3119 userpage = kmap_atomic(page); 3119 userpage = kmap_atomic(page);
3120 memset(userpage + pg_offset, 0, 3120 memset(userpage + pg_offset, 0,
3121 PAGE_CACHE_SIZE - pg_offset); 3121 PAGE_CACHE_SIZE - pg_offset);
3122 kunmap_atomic(userpage); 3122 kunmap_atomic(userpage);
3123 flush_dcache_page(page); 3123 flush_dcache_page(page);
3124 } 3124 }
3125 pg_offset = 0; 3125 pg_offset = 0;
3126 3126
3127 set_page_extent_mapped(page); 3127 set_page_extent_mapped(page);
3128 3128
3129 if (!tree->ops || !tree->ops->fill_delalloc) 3129 if (!tree->ops || !tree->ops->fill_delalloc)
3130 fill_delalloc = false; 3130 fill_delalloc = false;
3131 3131
3132 delalloc_start = start; 3132 delalloc_start = start;
3133 delalloc_end = 0; 3133 delalloc_end = 0;
3134 page_started = 0; 3134 page_started = 0;
3135 if (!epd->extent_locked && fill_delalloc) { 3135 if (!epd->extent_locked && fill_delalloc) {
3136 u64 delalloc_to_write = 0; 3136 u64 delalloc_to_write = 0;
3137 /* 3137 /*
3138 * make sure the wbc mapping index is at least updated 3138 * make sure the wbc mapping index is at least updated
3139 * to this page. 3139 * to this page.
3140 */ 3140 */
3141 update_nr_written(page, wbc, 0); 3141 update_nr_written(page, wbc, 0);
3142 3142
3143 while (delalloc_end < page_end) { 3143 while (delalloc_end < page_end) {
3144 nr_delalloc = find_lock_delalloc_range(inode, tree, 3144 nr_delalloc = find_lock_delalloc_range(inode, tree,
3145 page, 3145 page,
3146 &delalloc_start, 3146 &delalloc_start,
3147 &delalloc_end, 3147 &delalloc_end,
3148 128 * 1024 * 1024); 3148 128 * 1024 * 1024);
3149 if (nr_delalloc == 0) { 3149 if (nr_delalloc == 0) {
3150 delalloc_start = delalloc_end + 1; 3150 delalloc_start = delalloc_end + 1;
3151 continue; 3151 continue;
3152 } 3152 }
3153 ret = tree->ops->fill_delalloc(inode, page, 3153 ret = tree->ops->fill_delalloc(inode, page,
3154 delalloc_start, 3154 delalloc_start,
3155 delalloc_end, 3155 delalloc_end,
3156 &page_started, 3156 &page_started,
3157 &nr_written); 3157 &nr_written);
3158 /* File system has been set read-only */ 3158 /* File system has been set read-only */
3159 if (ret) { 3159 if (ret) {
3160 SetPageError(page); 3160 SetPageError(page);
3161 goto done; 3161 goto done;
3162 } 3162 }
3163 /* 3163 /*
3164 * delalloc_end is already one less than the total 3164 * delalloc_end is already one less than the total
3165 * length, so we don't subtract one from 3165 * length, so we don't subtract one from
3166 * PAGE_CACHE_SIZE 3166 * PAGE_CACHE_SIZE
3167 */ 3167 */
3168 delalloc_to_write += (delalloc_end - delalloc_start + 3168 delalloc_to_write += (delalloc_end - delalloc_start +
3169 PAGE_CACHE_SIZE) >> 3169 PAGE_CACHE_SIZE) >>
3170 PAGE_CACHE_SHIFT; 3170 PAGE_CACHE_SHIFT;
3171 delalloc_start = delalloc_end + 1; 3171 delalloc_start = delalloc_end + 1;
3172 } 3172 }
3173 if (wbc->nr_to_write < delalloc_to_write) { 3173 if (wbc->nr_to_write < delalloc_to_write) {
3174 int thresh = 8192; 3174 int thresh = 8192;
3175 3175
3176 if (delalloc_to_write < thresh * 2) 3176 if (delalloc_to_write < thresh * 2)
3177 thresh = delalloc_to_write; 3177 thresh = delalloc_to_write;
3178 wbc->nr_to_write = min_t(u64, delalloc_to_write, 3178 wbc->nr_to_write = min_t(u64, delalloc_to_write,
3179 thresh); 3179 thresh);
3180 } 3180 }
3181 3181
3182 /* did the fill delalloc function already unlock and start 3182 /* did the fill delalloc function already unlock and start
3183 * the IO? 3183 * the IO?
3184 */ 3184 */
3185 if (page_started) { 3185 if (page_started) {
3186 ret = 0; 3186 ret = 0;
3187 /* 3187 /*
3188 * we've unlocked the page, so we can't update 3188 * we've unlocked the page, so we can't update
3189 * the mapping's writeback index, just update 3189 * the mapping's writeback index, just update
3190 * nr_to_write. 3190 * nr_to_write.
3191 */ 3191 */
3192 wbc->nr_to_write -= nr_written; 3192 wbc->nr_to_write -= nr_written;
3193 goto done_unlocked; 3193 goto done_unlocked;
3194 } 3194 }
3195 } 3195 }
3196 if (tree->ops && tree->ops->writepage_start_hook) { 3196 if (tree->ops && tree->ops->writepage_start_hook) {
3197 ret = tree->ops->writepage_start_hook(page, start, 3197 ret = tree->ops->writepage_start_hook(page, start,
3198 page_end); 3198 page_end);
3199 if (ret) { 3199 if (ret) {
3200 /* Fixup worker will requeue */ 3200 /* Fixup worker will requeue */
3201 if (ret == -EBUSY) 3201 if (ret == -EBUSY)
3202 wbc->pages_skipped++; 3202 wbc->pages_skipped++;
3203 else 3203 else
3204 redirty_page_for_writepage(wbc, page); 3204 redirty_page_for_writepage(wbc, page);
3205 update_nr_written(page, wbc, nr_written); 3205 update_nr_written(page, wbc, nr_written);
3206 unlock_page(page); 3206 unlock_page(page);
3207 ret = 0; 3207 ret = 0;
3208 goto done_unlocked; 3208 goto done_unlocked;
3209 } 3209 }
3210 } 3210 }
3211 3211
3212 /* 3212 /*
3213 * we don't want to touch the inode after unlocking the page, 3213 * we don't want to touch the inode after unlocking the page,
3214 * so we update the mapping writeback index now 3214 * so we update the mapping writeback index now
3215 */ 3215 */
3216 update_nr_written(page, wbc, nr_written + 1); 3216 update_nr_written(page, wbc, nr_written + 1);
3217 3217
3218 end = page_end; 3218 end = page_end;
3219 if (last_byte <= start) { 3219 if (last_byte <= start) {
3220 if (tree->ops && tree->ops->writepage_end_io_hook) 3220 if (tree->ops && tree->ops->writepage_end_io_hook)
3221 tree->ops->writepage_end_io_hook(page, start, 3221 tree->ops->writepage_end_io_hook(page, start,
3222 page_end, NULL, 1); 3222 page_end, NULL, 1);
3223 goto done; 3223 goto done;
3224 } 3224 }
3225 3225
3226 blocksize = inode->i_sb->s_blocksize; 3226 blocksize = inode->i_sb->s_blocksize;
3227 3227
3228 while (cur <= end) { 3228 while (cur <= end) {
3229 if (cur >= last_byte) { 3229 if (cur >= last_byte) {
3230 if (tree->ops && tree->ops->writepage_end_io_hook) 3230 if (tree->ops && tree->ops->writepage_end_io_hook)
3231 tree->ops->writepage_end_io_hook(page, cur, 3231 tree->ops->writepage_end_io_hook(page, cur,
3232 page_end, NULL, 1); 3232 page_end, NULL, 1);
3233 break; 3233 break;
3234 } 3234 }
3235 em = epd->get_extent(inode, page, pg_offset, cur, 3235 em = epd->get_extent(inode, page, pg_offset, cur,
3236 end - cur + 1, 1); 3236 end - cur + 1, 1);
3237 if (IS_ERR_OR_NULL(em)) { 3237 if (IS_ERR_OR_NULL(em)) {
3238 SetPageError(page); 3238 SetPageError(page);
3239 break; 3239 break;
3240 } 3240 }
3241 3241
3242 extent_offset = cur - em->start; 3242 extent_offset = cur - em->start;
3243 BUG_ON(extent_map_end(em) <= cur); 3243 BUG_ON(extent_map_end(em) <= cur);
3244 BUG_ON(end < cur); 3244 BUG_ON(end < cur);
3245 iosize = min(extent_map_end(em) - cur, end - cur + 1); 3245 iosize = min(extent_map_end(em) - cur, end - cur + 1);
3246 iosize = ALIGN(iosize, blocksize); 3246 iosize = ALIGN(iosize, blocksize);
3247 sector = (em->block_start + extent_offset) >> 9; 3247 sector = (em->block_start + extent_offset) >> 9;
3248 bdev = em->bdev; 3248 bdev = em->bdev;
3249 block_start = em->block_start; 3249 block_start = em->block_start;
3250 compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); 3250 compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
3251 free_extent_map(em); 3251 free_extent_map(em);
3252 em = NULL; 3252 em = NULL;
3253 3253
3254 /* 3254 /*
3255 * compressed and inline extents are written through other 3255 * compressed and inline extents are written through other
3256 * paths in the FS 3256 * paths in the FS
3257 */ 3257 */
3258 if (compressed || block_start == EXTENT_MAP_HOLE || 3258 if (compressed || block_start == EXTENT_MAP_HOLE ||
3259 block_start == EXTENT_MAP_INLINE) { 3259 block_start == EXTENT_MAP_INLINE) {
3260 /* 3260 /*
3261 * end_io notification does not happen here for 3261 * end_io notification does not happen here for
3262 * compressed extents 3262 * compressed extents
3263 */ 3263 */
3264 if (!compressed && tree->ops && 3264 if (!compressed && tree->ops &&
3265 tree->ops->writepage_end_io_hook) 3265 tree->ops->writepage_end_io_hook)
3266 tree->ops->writepage_end_io_hook(page, cur, 3266 tree->ops->writepage_end_io_hook(page, cur,
3267 cur + iosize - 1, 3267 cur + iosize - 1,
3268 NULL, 1); 3268 NULL, 1);
3269 else if (compressed) { 3269 else if (compressed) {
3270 /* we don't want to end_page_writeback on 3270 /* we don't want to end_page_writeback on
3271 * a compressed extent. this happens 3271 * a compressed extent. this happens
3272 * elsewhere 3272 * elsewhere
3273 */ 3273 */
3274 nr++; 3274 nr++;
3275 } 3275 }
3276 3276
3277 cur += iosize; 3277 cur += iosize;
3278 pg_offset += iosize; 3278 pg_offset += iosize;
3279 continue; 3279 continue;
3280 } 3280 }
3281 /* leave this out until we have a page_mkwrite call */ 3281 /* leave this out until we have a page_mkwrite call */
3282 if (0 && !test_range_bit(tree, cur, cur + iosize - 1, 3282 if (0 && !test_range_bit(tree, cur, cur + iosize - 1,
3283 EXTENT_DIRTY, 0, NULL)) { 3283 EXTENT_DIRTY, 0, NULL)) {
3284 cur = cur + iosize; 3284 cur = cur + iosize;
3285 pg_offset += iosize; 3285 pg_offset += iosize;
3286 continue; 3286 continue;
3287 } 3287 }
3288 3288
3289 if (tree->ops && tree->ops->writepage_io_hook) { 3289 if (tree->ops && tree->ops->writepage_io_hook) {
3290 ret = tree->ops->writepage_io_hook(page, cur, 3290 ret = tree->ops->writepage_io_hook(page, cur,
3291 cur + iosize - 1); 3291 cur + iosize - 1);
3292 } else { 3292 } else {
3293 ret = 0; 3293 ret = 0;
3294 } 3294 }
3295 if (ret) { 3295 if (ret) {
3296 SetPageError(page); 3296 SetPageError(page);
3297 } else { 3297 } else {
3298 unsigned long max_nr = end_index + 1; 3298 unsigned long max_nr = end_index + 1;
3299 3299
3300 set_range_writeback(tree, cur, cur + iosize - 1); 3300 set_range_writeback(tree, cur, cur + iosize - 1);
3301 if (!PageWriteback(page)) { 3301 if (!PageWriteback(page)) {
3302 printk(KERN_ERR "btrfs warning page %lu not " 3302 printk(KERN_ERR "btrfs warning page %lu not "
3303 "writeback, cur %llu end %llu\n", 3303 "writeback, cur %llu end %llu\n",
3304 page->index, cur, end); 3304 page->index, cur, end);
3305 } 3305 }
3306 3306
3307 ret = submit_extent_page(write_flags, tree, page, 3307 ret = submit_extent_page(write_flags, tree, page,
3308 sector, iosize, pg_offset, 3308 sector, iosize, pg_offset,
3309 bdev, &epd->bio, max_nr, 3309 bdev, &epd->bio, max_nr,
3310 end_bio_extent_writepage, 3310 end_bio_extent_writepage,
3311 0, 0, 0); 3311 0, 0, 0);
3312 if (ret) 3312 if (ret)
3313 SetPageError(page); 3313 SetPageError(page);
3314 } 3314 }
3315 cur = cur + iosize; 3315 cur = cur + iosize;
3316 pg_offset += iosize; 3316 pg_offset += iosize;
3317 nr++; 3317 nr++;
3318 } 3318 }
3319 done: 3319 done:
3320 if (nr == 0) { 3320 if (nr == 0) {
3321 /* make sure the mapping tag for page dirty gets cleared */ 3321 /* make sure the mapping tag for page dirty gets cleared */
3322 set_page_writeback(page); 3322 set_page_writeback(page);
3323 end_page_writeback(page); 3323 end_page_writeback(page);
3324 } 3324 }
3325 unlock_page(page); 3325 unlock_page(page);
3326 3326
3327 done_unlocked: 3327 done_unlocked:
3328 3328
3329 /* drop our reference on any cached states */ 3329 /* drop our reference on any cached states */
3330 free_extent_state(cached_state); 3330 free_extent_state(cached_state);
3331 return 0; 3331 return 0;
3332 } 3332 }
3333 3333
3334 static int eb_wait(void *word) 3334 static int eb_wait(void *word)
3335 { 3335 {
3336 io_schedule(); 3336 io_schedule();
3337 return 0; 3337 return 0;
3338 } 3338 }
3339 3339
3340 void wait_on_extent_buffer_writeback(struct extent_buffer *eb) 3340 void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
3341 { 3341 {
3342 wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait, 3342 wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
3343 TASK_UNINTERRUPTIBLE); 3343 TASK_UNINTERRUPTIBLE);
3344 } 3344 }
3345 3345
3346 static int lock_extent_buffer_for_io(struct extent_buffer *eb, 3346 static int lock_extent_buffer_for_io(struct extent_buffer *eb,
3347 struct btrfs_fs_info *fs_info, 3347 struct btrfs_fs_info *fs_info,
3348 struct extent_page_data *epd) 3348 struct extent_page_data *epd)
3349 { 3349 {
3350 unsigned long i, num_pages; 3350 unsigned long i, num_pages;
3351 int flush = 0; 3351 int flush = 0;
3352 int ret = 0; 3352 int ret = 0;
3353 3353
3354 if (!btrfs_try_tree_write_lock(eb)) { 3354 if (!btrfs_try_tree_write_lock(eb)) {
3355 flush = 1; 3355 flush = 1;
3356 flush_write_bio(epd); 3356 flush_write_bio(epd);
3357 btrfs_tree_lock(eb); 3357 btrfs_tree_lock(eb);
3358 } 3358 }
3359 3359
3360 if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) { 3360 if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
3361 btrfs_tree_unlock(eb); 3361 btrfs_tree_unlock(eb);
3362 if (!epd->sync_io) 3362 if (!epd->sync_io)
3363 return 0; 3363 return 0;
3364 if (!flush) { 3364 if (!flush) {
3365 flush_write_bio(epd); 3365 flush_write_bio(epd);
3366 flush = 1; 3366 flush = 1;
3367 } 3367 }
3368 while (1) { 3368 while (1) {
3369 wait_on_extent_buffer_writeback(eb); 3369 wait_on_extent_buffer_writeback(eb);
3370 btrfs_tree_lock(eb); 3370 btrfs_tree_lock(eb);
3371 if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) 3371 if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
3372 break; 3372 break;
3373 btrfs_tree_unlock(eb); 3373 btrfs_tree_unlock(eb);
3374 } 3374 }
3375 } 3375 }
3376 3376
3377 /* 3377 /*
3378 * We need to do this to prevent races in people who check if the eb is 3378 * We need to do this to prevent races in people who check if the eb is
3379 * under IO since we can end up having no IO bits set for a short period 3379 * under IO since we can end up having no IO bits set for a short period
3380 * of time. 3380 * of time.
3381 */ 3381 */
3382 spin_lock(&eb->refs_lock); 3382 spin_lock(&eb->refs_lock);
3383 if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { 3383 if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
3384 set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); 3384 set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
3385 spin_unlock(&eb->refs_lock); 3385 spin_unlock(&eb->refs_lock);
3386 btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN); 3386 btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
3387 __percpu_counter_add(&fs_info->dirty_metadata_bytes, 3387 __percpu_counter_add(&fs_info->dirty_metadata_bytes,
3388 -eb->len, 3388 -eb->len,
3389 fs_info->dirty_metadata_batch); 3389 fs_info->dirty_metadata_batch);
3390 ret = 1; 3390 ret = 1;
3391 } else { 3391 } else {
3392 spin_unlock(&eb->refs_lock); 3392 spin_unlock(&eb->refs_lock);
3393 } 3393 }
3394 3394
3395 btrfs_tree_unlock(eb); 3395 btrfs_tree_unlock(eb);
3396 3396
3397 if (!ret) 3397 if (!ret)
3398 return ret; 3398 return ret;
3399 3399
3400 num_pages = num_extent_pages(eb->start, eb->len); 3400 num_pages = num_extent_pages(eb->start, eb->len);
3401 for (i = 0; i < num_pages; i++) { 3401 for (i = 0; i < num_pages; i++) {
3402 struct page *p = extent_buffer_page(eb, i); 3402 struct page *p = extent_buffer_page(eb, i);
3403 3403
3404 if (!trylock_page(p)) { 3404 if (!trylock_page(p)) {
3405 if (!flush) { 3405 if (!flush) {
3406 flush_write_bio(epd); 3406 flush_write_bio(epd);
3407 flush = 1; 3407 flush = 1;
3408 } 3408 }
3409 lock_page(p); 3409 lock_page(p);
3410 } 3410 }
3411 } 3411 }
3412 3412
3413 return ret; 3413 return ret;
3414 } 3414 }
3415 3415
3416 static void end_extent_buffer_writeback(struct extent_buffer *eb) 3416 static void end_extent_buffer_writeback(struct extent_buffer *eb)
3417 { 3417 {
3418 clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); 3418 clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
3419 smp_mb__after_clear_bit(); 3419 smp_mb__after_clear_bit();
3420 wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); 3420 wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
3421 } 3421 }
3422 3422
3423 static void end_bio_extent_buffer_writepage(struct bio *bio, int err) 3423 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
3424 { 3424 {
3425 int uptodate = err == 0; 3425 int uptodate = err == 0;
3426 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; 3426 struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
3427 struct extent_buffer *eb; 3427 struct extent_buffer *eb;
3428 int done; 3428 int done;
3429 3429
3430 do { 3430 do {
3431 struct page *page = bvec->bv_page; 3431 struct page *page = bvec->bv_page;
3432 3432
3433 bvec--; 3433 bvec--;
3434 eb = (struct extent_buffer *)page->private; 3434 eb = (struct extent_buffer *)page->private;
3435 BUG_ON(!eb); 3435 BUG_ON(!eb);
3436 done = atomic_dec_and_test(&eb->io_pages); 3436 done = atomic_dec_and_test(&eb->io_pages);
3437 3437
3438 if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) { 3438 if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
3439 set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); 3439 set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
3440 ClearPageUptodate(page); 3440 ClearPageUptodate(page);
3441 SetPageError(page); 3441 SetPageError(page);
3442 } 3442 }
3443 3443
3444 end_page_writeback(page); 3444 end_page_writeback(page);
3445 3445
3446 if (!done) 3446 if (!done)
3447 continue; 3447 continue;
3448 3448
3449 end_extent_buffer_writeback(eb); 3449 end_extent_buffer_writeback(eb);
3450 } while (bvec >= bio->bi_io_vec); 3450 } while (bvec >= bio->bi_io_vec);
3451 3451
3452 bio_put(bio); 3452 bio_put(bio);
3453 3453
3454 } 3454 }
3455 3455
3456 static int write_one_eb(struct extent_buffer *eb, 3456 static int write_one_eb(struct extent_buffer *eb,
3457 struct btrfs_fs_info *fs_info, 3457 struct btrfs_fs_info *fs_info,
3458 struct writeback_control *wbc, 3458 struct writeback_control *wbc,
3459 struct extent_page_data *epd) 3459 struct extent_page_data *epd)
3460 { 3460 {
3461 struct block_device *bdev = fs_info->fs_devices->latest_bdev; 3461 struct block_device *bdev = fs_info->fs_devices->latest_bdev;
3462 u64 offset = eb->start; 3462 u64 offset = eb->start;
3463 unsigned long i, num_pages; 3463 unsigned long i, num_pages;
3464 unsigned long bio_flags = 0; 3464 unsigned long bio_flags = 0;
3465 int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META; 3465 int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
3466 int ret = 0; 3466 int ret = 0;
3467 3467
3468 clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); 3468 clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
3469 num_pages = num_extent_pages(eb->start, eb->len); 3469 num_pages = num_extent_pages(eb->start, eb->len);
3470 atomic_set(&eb->io_pages, num_pages); 3470 atomic_set(&eb->io_pages, num_pages);
3471 if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID) 3471 if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
3472 bio_flags = EXTENT_BIO_TREE_LOG; 3472 bio_flags = EXTENT_BIO_TREE_LOG;
3473 3473
3474 for (i = 0; i < num_pages; i++) { 3474 for (i = 0; i < num_pages; i++) {
3475 struct page *p = extent_buffer_page(eb, i); 3475 struct page *p = extent_buffer_page(eb, i);
3476 3476
3477 clear_page_dirty_for_io(p); 3477 clear_page_dirty_for_io(p);
3478 set_page_writeback(p); 3478 set_page_writeback(p);
3479 ret = submit_extent_page(rw, eb->tree, p, offset >> 9, 3479 ret = submit_extent_page(rw, eb->tree, p, offset >> 9,
3480 PAGE_CACHE_SIZE, 0, bdev, &epd->bio, 3480 PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
3481 -1, end_bio_extent_buffer_writepage, 3481 -1, end_bio_extent_buffer_writepage,
3482 0, epd->bio_flags, bio_flags); 3482 0, epd->bio_flags, bio_flags);
3483 epd->bio_flags = bio_flags; 3483 epd->bio_flags = bio_flags;
3484 if (ret) { 3484 if (ret) {
3485 set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); 3485 set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
3486 SetPageError(p); 3486 SetPageError(p);
3487 if (atomic_sub_and_test(num_pages - i, &eb->io_pages)) 3487 if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
3488 end_extent_buffer_writeback(eb); 3488 end_extent_buffer_writeback(eb);
3489 ret = -EIO; 3489 ret = -EIO;
3490 break; 3490 break;
3491 } 3491 }
3492 offset += PAGE_CACHE_SIZE; 3492 offset += PAGE_CACHE_SIZE;
3493 update_nr_written(p, wbc, 1); 3493 update_nr_written(p, wbc, 1);
3494 unlock_page(p); 3494 unlock_page(p);
3495 } 3495 }
3496 3496
3497 if (unlikely(ret)) { 3497 if (unlikely(ret)) {
3498 for (; i < num_pages; i++) { 3498 for (; i < num_pages; i++) {
3499 struct page *p = extent_buffer_page(eb, i); 3499 struct page *p = extent_buffer_page(eb, i);
3500 unlock_page(p); 3500 unlock_page(p);
3501 } 3501 }
3502 } 3502 }
3503 3503
3504 return ret; 3504 return ret;
3505 } 3505 }
3506 3506
3507 int btree_write_cache_pages(struct address_space *mapping, 3507 int btree_write_cache_pages(struct address_space *mapping,
3508 struct writeback_control *wbc) 3508 struct writeback_control *wbc)
3509 { 3509 {
3510 struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; 3510 struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
3511 struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info; 3511 struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
3512 struct extent_buffer *eb, *prev_eb = NULL; 3512 struct extent_buffer *eb, *prev_eb = NULL;
3513 struct extent_page_data epd = { 3513 struct extent_page_data epd = {
3514 .bio = NULL, 3514 .bio = NULL,
3515 .tree = tree, 3515 .tree = tree,
3516 .extent_locked = 0, 3516 .extent_locked = 0,
3517 .sync_io = wbc->sync_mode == WB_SYNC_ALL, 3517 .sync_io = wbc->sync_mode == WB_SYNC_ALL,
3518 .bio_flags = 0, 3518 .bio_flags = 0,
3519 }; 3519 };
3520 int ret = 0; 3520 int ret = 0;
3521 int done = 0; 3521 int done = 0;
3522 int nr_to_write_done = 0; 3522 int nr_to_write_done = 0;
3523 struct pagevec pvec; 3523 struct pagevec pvec;
3524 int nr_pages; 3524 int nr_pages;
3525 pgoff_t index; 3525 pgoff_t index;
3526 pgoff_t end; /* Inclusive */ 3526 pgoff_t end; /* Inclusive */
3527 int scanned = 0; 3527 int scanned = 0;
3528 int tag; 3528 int tag;
3529 3529
3530 pagevec_init(&pvec, 0); 3530 pagevec_init(&pvec, 0);
3531 if (wbc->range_cyclic) { 3531 if (wbc->range_cyclic) {
3532 index = mapping->writeback_index; /* Start from prev offset */ 3532 index = mapping->writeback_index; /* Start from prev offset */
3533 end = -1; 3533 end = -1;
3534 } else { 3534 } else {
3535 index = wbc->range_start >> PAGE_CACHE_SHIFT; 3535 index = wbc->range_start >> PAGE_CACHE_SHIFT;
3536 end = wbc->range_end >> PAGE_CACHE_SHIFT; 3536 end = wbc->range_end >> PAGE_CACHE_SHIFT;
3537 scanned = 1; 3537 scanned = 1;
3538 } 3538 }
3539 if (wbc->sync_mode == WB_SYNC_ALL) 3539 if (wbc->sync_mode == WB_SYNC_ALL)
3540 tag = PAGECACHE_TAG_TOWRITE; 3540 tag = PAGECACHE_TAG_TOWRITE;
3541 else 3541 else
3542 tag = PAGECACHE_TAG_DIRTY; 3542 tag = PAGECACHE_TAG_DIRTY;
3543 retry: 3543 retry:
3544 if (wbc->sync_mode == WB_SYNC_ALL) 3544 if (wbc->sync_mode == WB_SYNC_ALL)
3545 tag_pages_for_writeback(mapping, index, end); 3545 tag_pages_for_writeback(mapping, index, end);
3546 while (!done && !nr_to_write_done && (index <= end) && 3546 while (!done && !nr_to_write_done && (index <= end) &&
3547 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, 3547 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
3548 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { 3548 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
3549 unsigned i; 3549 unsigned i;
3550 3550
3551 scanned = 1; 3551 scanned = 1;
3552 for (i = 0; i < nr_pages; i++) { 3552 for (i = 0; i < nr_pages; i++) {
3553 struct page *page = pvec.pages[i]; 3553 struct page *page = pvec.pages[i];
3554 3554
3555 if (!PagePrivate(page)) 3555 if (!PagePrivate(page))
3556 continue; 3556 continue;
3557 3557
3558 if (!wbc->range_cyclic && page->index > end) { 3558 if (!wbc->range_cyclic && page->index > end) {
3559 done = 1; 3559 done = 1;
3560 break; 3560 break;
3561 } 3561 }
3562 3562
3563 spin_lock(&mapping->private_lock); 3563 spin_lock(&mapping->private_lock);
3564 if (!PagePrivate(page)) { 3564 if (!PagePrivate(page)) {
3565 spin_unlock(&mapping->private_lock); 3565 spin_unlock(&mapping->private_lock);
3566 continue; 3566 continue;
3567 } 3567 }
3568 3568
3569 eb = (struct extent_buffer *)page->private; 3569 eb = (struct extent_buffer *)page->private;
3570 3570
3571 /* 3571 /*
3572 * Shouldn't happen and normally this would be a BUG_ON 3572 * Shouldn't happen and normally this would be a BUG_ON
3573 * but no sense in crashing the users box for something 3573 * but no sense in crashing the users box for something
3574 * we can survive anyway. 3574 * we can survive anyway.
3575 */ 3575 */
3576 if (!eb) { 3576 if (!eb) {
3577 spin_unlock(&mapping->private_lock); 3577 spin_unlock(&mapping->private_lock);
3578 WARN_ON(1); 3578 WARN_ON(1);
3579 continue; 3579 continue;
3580 } 3580 }
3581 3581
3582 if (eb == prev_eb) { 3582 if (eb == prev_eb) {
3583 spin_unlock(&mapping->private_lock); 3583 spin_unlock(&mapping->private_lock);
3584 continue; 3584 continue;
3585 } 3585 }
3586 3586
3587 ret = atomic_inc_not_zero(&eb->refs); 3587 ret = atomic_inc_not_zero(&eb->refs);
3588 spin_unlock(&mapping->private_lock); 3588 spin_unlock(&mapping->private_lock);
3589 if (!ret) 3589 if (!ret)
3590 continue; 3590 continue;
3591 3591
3592 prev_eb = eb; 3592 prev_eb = eb;
3593 ret = lock_extent_buffer_for_io(eb, fs_info, &epd); 3593 ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
3594 if (!ret) { 3594 if (!ret) {
3595 free_extent_buffer(eb); 3595 free_extent_buffer(eb);
3596 continue; 3596 continue;
3597 } 3597 }
3598 3598
3599 ret = write_one_eb(eb, fs_info, wbc, &epd); 3599 ret = write_one_eb(eb, fs_info, wbc, &epd);
3600 if (ret) { 3600 if (ret) {
3601 done = 1; 3601 done = 1;
3602 free_extent_buffer(eb); 3602 free_extent_buffer(eb);
3603 break; 3603 break;
3604 } 3604 }
3605 free_extent_buffer(eb); 3605 free_extent_buffer(eb);
3606 3606
3607 /* 3607 /*
3608 * the filesystem may choose to bump up nr_to_write. 3608 * the filesystem may choose to bump up nr_to_write.
3609 * We have to make sure to honor the new nr_to_write 3609 * We have to make sure to honor the new nr_to_write
3610 * at any time 3610 * at any time
3611 */ 3611 */
3612 nr_to_write_done = wbc->nr_to_write <= 0; 3612 nr_to_write_done = wbc->nr_to_write <= 0;
3613 } 3613 }
3614 pagevec_release(&pvec); 3614 pagevec_release(&pvec);
3615 cond_resched(); 3615 cond_resched();
3616 } 3616 }
3617 if (!scanned && !done) { 3617 if (!scanned && !done) {
3618 /* 3618 /*
3619 * We hit the last page and there is more work to be done: wrap 3619 * We hit the last page and there is more work to be done: wrap
3620 * back to the start of the file 3620 * back to the start of the file
3621 */ 3621 */
3622 scanned = 1; 3622 scanned = 1;
3623 index = 0; 3623 index = 0;
3624 goto retry; 3624 goto retry;
3625 } 3625 }
3626 flush_write_bio(&epd); 3626 flush_write_bio(&epd);
3627 return ret; 3627 return ret;
3628 } 3628 }
3629 3629
3630 /** 3630 /**
3631 * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. 3631 * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
3632 * @mapping: address space structure to write 3632 * @mapping: address space structure to write
3633 * @wbc: subtract the number of written pages from *@wbc->nr_to_write 3633 * @wbc: subtract the number of written pages from *@wbc->nr_to_write
3634 * @writepage: function called for each page 3634 * @writepage: function called for each page
3635 * @data: data passed to writepage function 3635 * @data: data passed to writepage function
3636 * 3636 *
3637 * If a page is already under I/O, write_cache_pages() skips it, even 3637 * If a page is already under I/O, write_cache_pages() skips it, even
3638 * if it's dirty. This is desirable behaviour for memory-cleaning writeback, 3638 * if it's dirty. This is desirable behaviour for memory-cleaning writeback,
3639 * but it is INCORRECT for data-integrity system calls such as fsync(). fsync() 3639 * but it is INCORRECT for data-integrity system calls such as fsync(). fsync()
3640 * and msync() need to guarantee that all the data which was dirty at the time 3640 * and msync() need to guarantee that all the data which was dirty at the time
3641 * the call was made get new I/O started against them. If wbc->sync_mode is 3641 * the call was made get new I/O started against them. If wbc->sync_mode is
3642 * WB_SYNC_ALL then we were called for data integrity and we must wait for 3642 * WB_SYNC_ALL then we were called for data integrity and we must wait for
3643 * existing IO to complete. 3643 * existing IO to complete.
3644 */ 3644 */
3645 static int extent_write_cache_pages(struct extent_io_tree *tree, 3645 static int extent_write_cache_pages(struct extent_io_tree *tree,
3646 struct address_space *mapping, 3646 struct address_space *mapping,
3647 struct writeback_control *wbc, 3647 struct writeback_control *wbc,
3648 writepage_t writepage, void *data, 3648 writepage_t writepage, void *data,
3649 void (*flush_fn)(void *)) 3649 void (*flush_fn)(void *))
3650 { 3650 {
3651 struct inode *inode = mapping->host; 3651 struct inode *inode = mapping->host;
3652 int ret = 0; 3652 int ret = 0;
3653 int done = 0; 3653 int done = 0;
3654 int nr_to_write_done = 0; 3654 int nr_to_write_done = 0;
3655 struct pagevec pvec; 3655 struct pagevec pvec;
3656 int nr_pages; 3656 int nr_pages;
3657 pgoff_t index; 3657 pgoff_t index;
3658 pgoff_t end; /* Inclusive */ 3658 pgoff_t end; /* Inclusive */
3659 int scanned = 0; 3659 int scanned = 0;
3660 int tag; 3660 int tag;
3661 3661
3662 /* 3662 /*
3663 * We have to hold onto the inode so that ordered extents can do their 3663 * We have to hold onto the inode so that ordered extents can do their
3664 * work when the IO finishes. The alternative to this is failing to add 3664 * work when the IO finishes. The alternative to this is failing to add
3665 * an ordered extent if the igrab() fails there and that is a huge pain 3665 * an ordered extent if the igrab() fails there and that is a huge pain
3666 * to deal with, so instead just hold onto the inode throughout the 3666 * to deal with, so instead just hold onto the inode throughout the
3667 * writepages operation. If it fails here we are freeing up the inode 3667 * writepages operation. If it fails here we are freeing up the inode
3668 * anyway and we'd rather not waste our time writing out stuff that is 3668 * anyway and we'd rather not waste our time writing out stuff that is
3669 * going to be truncated anyway. 3669 * going to be truncated anyway.
3670 */ 3670 */
3671 if (!igrab(inode)) 3671 if (!igrab(inode))
3672 return 0; 3672 return 0;
3673 3673
3674 pagevec_init(&pvec, 0); 3674 pagevec_init(&pvec, 0);
3675 if (wbc->range_cyclic) { 3675 if (wbc->range_cyclic) {
3676 index = mapping->writeback_index; /* Start from prev offset */ 3676 index = mapping->writeback_index; /* Start from prev offset */
3677 end = -1; 3677 end = -1;
3678 } else { 3678 } else {
3679 index = wbc->range_start >> PAGE_CACHE_SHIFT; 3679 index = wbc->range_start >> PAGE_CACHE_SHIFT;
3680 end = wbc->range_end >> PAGE_CACHE_SHIFT; 3680 end = wbc->range_end >> PAGE_CACHE_SHIFT;
3681 scanned = 1; 3681 scanned = 1;
3682 } 3682 }
3683 if (wbc->sync_mode == WB_SYNC_ALL) 3683 if (wbc->sync_mode == WB_SYNC_ALL)
3684 tag = PAGECACHE_TAG_TOWRITE; 3684 tag = PAGECACHE_TAG_TOWRITE;
3685 else 3685 else
3686 tag = PAGECACHE_TAG_DIRTY; 3686 tag = PAGECACHE_TAG_DIRTY;
3687 retry: 3687 retry:
3688 if (wbc->sync_mode == WB_SYNC_ALL) 3688 if (wbc->sync_mode == WB_SYNC_ALL)
3689 tag_pages_for_writeback(mapping, index, end); 3689 tag_pages_for_writeback(mapping, index, end);
3690 while (!done && !nr_to_write_done && (index <= end) && 3690 while (!done && !nr_to_write_done && (index <= end) &&
3691 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, 3691 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
3692 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { 3692 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
3693 unsigned i; 3693 unsigned i;
3694 3694
3695 scanned = 1; 3695 scanned = 1;
3696 for (i = 0; i < nr_pages; i++) { 3696 for (i = 0; i < nr_pages; i++) {
3697 struct page *page = pvec.pages[i]; 3697 struct page *page = pvec.pages[i];
3698 3698
3699 /* 3699 /*
3700 * At this point we hold neither mapping->tree_lock nor 3700 * At this point we hold neither mapping->tree_lock nor
3701 * lock on the page itself: the page may be truncated or 3701 * lock on the page itself: the page may be truncated or
3702 * invalidated (changing page->mapping to NULL), or even 3702 * invalidated (changing page->mapping to NULL), or even
3703 * swizzled back from swapper_space to tmpfs file 3703 * swizzled back from swapper_space to tmpfs file
3704 * mapping 3704 * mapping
3705 */ 3705 */
3706 if (!trylock_page(page)) { 3706 if (!trylock_page(page)) {
3707 flush_fn(data); 3707 flush_fn(data);
3708 lock_page(page); 3708 lock_page(page);
3709 } 3709 }
3710 3710
3711 if (unlikely(page->mapping != mapping)) { 3711 if (unlikely(page->mapping != mapping)) {
3712 unlock_page(page); 3712 unlock_page(page);
3713 continue; 3713 continue;
3714 } 3714 }
3715 3715
3716 if (!wbc->range_cyclic && page->index > end) { 3716 if (!wbc->range_cyclic && page->index > end) {
3717 done = 1; 3717 done = 1;
3718 unlock_page(page); 3718 unlock_page(page);
3719 continue; 3719 continue;
3720 } 3720 }
3721 3721
3722 if (wbc->sync_mode != WB_SYNC_NONE) { 3722 if (wbc->sync_mode != WB_SYNC_NONE) {
3723 if (PageWriteback(page)) 3723 if (PageWriteback(page))
3724 flush_fn(data); 3724 flush_fn(data);
3725 wait_on_page_writeback(page); 3725 wait_on_page_writeback(page);
3726 } 3726 }
3727 3727
3728 if (PageWriteback(page) || 3728 if (PageWriteback(page) ||
3729 !clear_page_dirty_for_io(page)) { 3729 !clear_page_dirty_for_io(page)) {
3730 unlock_page(page); 3730 unlock_page(page);
3731 continue; 3731 continue;
3732 } 3732 }
3733 3733
3734 ret = (*writepage)(page, wbc, data); 3734 ret = (*writepage)(page, wbc, data);
3735 3735
3736 if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { 3736 if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
3737 unlock_page(page); 3737 unlock_page(page);
3738 ret = 0; 3738 ret = 0;
3739 } 3739 }
3740 if (ret) 3740 if (ret)
3741 done = 1; 3741 done = 1;
3742 3742
3743 /* 3743 /*
3744 * the filesystem may choose to bump up nr_to_write. 3744 * the filesystem may choose to bump up nr_to_write.
3745 * We have to make sure to honor the new nr_to_write 3745 * We have to make sure to honor the new nr_to_write
3746 * at any time 3746 * at any time
3747 */ 3747 */
3748 nr_to_write_done = wbc->nr_to_write <= 0; 3748 nr_to_write_done = wbc->nr_to_write <= 0;
3749 } 3749 }
3750 pagevec_release(&pvec); 3750 pagevec_release(&pvec);
3751 cond_resched(); 3751 cond_resched();
3752 } 3752 }
3753 if (!scanned && !done) { 3753 if (!scanned && !done) {
3754 /* 3754 /*
3755 * We hit the last page and there is more work to be done: wrap 3755 * We hit the last page and there is more work to be done: wrap
3756 * back to the start of the file 3756 * back to the start of the file
3757 */ 3757 */
3758 scanned = 1; 3758 scanned = 1;
3759 index = 0; 3759 index = 0;
3760 goto retry; 3760 goto retry;
3761 } 3761 }
3762 btrfs_add_delayed_iput(inode); 3762 btrfs_add_delayed_iput(inode);
3763 return ret; 3763 return ret;
3764 } 3764 }
3765 3765
3766 static void flush_epd_write_bio(struct extent_page_data *epd) 3766 static void flush_epd_write_bio(struct extent_page_data *epd)
3767 { 3767 {
3768 if (epd->bio) { 3768 if (epd->bio) {
3769 int rw = WRITE; 3769 int rw = WRITE;
3770 int ret; 3770 int ret;
3771 3771
3772 if (epd->sync_io) 3772 if (epd->sync_io)
3773 rw = WRITE_SYNC; 3773 rw = WRITE_SYNC;
3774 3774
3775 ret = submit_one_bio(rw, epd->bio, 0, epd->bio_flags); 3775 ret = submit_one_bio(rw, epd->bio, 0, epd->bio_flags);
3776 BUG_ON(ret < 0); /* -ENOMEM */ 3776 BUG_ON(ret < 0); /* -ENOMEM */
3777 epd->bio = NULL; 3777 epd->bio = NULL;
3778 } 3778 }
3779 } 3779 }
3780 3780
3781 static noinline void flush_write_bio(void *data) 3781 static noinline void flush_write_bio(void *data)
3782 { 3782 {
3783 struct extent_page_data *epd = data; 3783 struct extent_page_data *epd = data;
3784 flush_epd_write_bio(epd); 3784 flush_epd_write_bio(epd);
3785 } 3785 }
3786 3786
3787 int extent_write_full_page(struct extent_io_tree *tree, struct page *page, 3787 int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
3788 get_extent_t *get_extent, 3788 get_extent_t *get_extent,
3789 struct writeback_control *wbc) 3789 struct writeback_control *wbc)
3790 { 3790 {
3791 int ret; 3791 int ret;
3792 struct extent_page_data epd = { 3792 struct extent_page_data epd = {
3793 .bio = NULL, 3793 .bio = NULL,
3794 .tree = tree, 3794 .tree = tree,
3795 .get_extent = get_extent, 3795 .get_extent = get_extent,
3796 .extent_locked = 0, 3796 .extent_locked = 0,
3797 .sync_io = wbc->sync_mode == WB_SYNC_ALL, 3797 .sync_io = wbc->sync_mode == WB_SYNC_ALL,
3798 .bio_flags = 0, 3798 .bio_flags = 0,
3799 }; 3799 };
3800 3800
3801 ret = __extent_writepage(page, wbc, &epd); 3801 ret = __extent_writepage(page, wbc, &epd);
3802 3802
3803 flush_epd_write_bio(&epd); 3803 flush_epd_write_bio(&epd);
3804 return ret; 3804 return ret;
3805 } 3805 }
3806 3806
3807 int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode, 3807 int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
3808 u64 start, u64 end, get_extent_t *get_extent, 3808 u64 start, u64 end, get_extent_t *get_extent,
3809 int mode) 3809 int mode)
3810 { 3810 {
3811 int ret = 0; 3811 int ret = 0;
3812 struct address_space *mapping = inode->i_mapping; 3812 struct address_space *mapping = inode->i_mapping;
3813 struct page *page; 3813 struct page *page;
3814 unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >> 3814 unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >>
3815 PAGE_CACHE_SHIFT; 3815 PAGE_CACHE_SHIFT;
3816 3816
3817 struct extent_page_data epd = { 3817 struct extent_page_data epd = {
3818 .bio = NULL, 3818 .bio = NULL,
3819 .tree = tree, 3819 .tree = tree,
3820 .get_extent = get_extent, 3820 .get_extent = get_extent,
3821 .extent_locked = 1, 3821 .extent_locked = 1,
3822 .sync_io = mode == WB_SYNC_ALL, 3822 .sync_io = mode == WB_SYNC_ALL,
3823 .bio_flags = 0, 3823 .bio_flags = 0,
3824 }; 3824 };
3825 struct writeback_control wbc_writepages = { 3825 struct writeback_control wbc_writepages = {
3826 .sync_mode = mode, 3826 .sync_mode = mode,
3827 .nr_to_write = nr_pages * 2, 3827 .nr_to_write = nr_pages * 2,
3828 .range_start = start, 3828 .range_start = start,
3829 .range_end = end + 1, 3829 .range_end = end + 1,
3830 }; 3830 };
3831 3831
3832 while (start <= end) { 3832 while (start <= end) {
3833 page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT); 3833 page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT);
3834 if (clear_page_dirty_for_io(page)) 3834 if (clear_page_dirty_for_io(page))
3835 ret = __extent_writepage(page, &wbc_writepages, &epd); 3835 ret = __extent_writepage(page, &wbc_writepages, &epd);
3836 else { 3836 else {
3837 if (tree->ops && tree->ops->writepage_end_io_hook) 3837 if (tree->ops && tree->ops->writepage_end_io_hook)
3838 tree->ops->writepage_end_io_hook(page, start, 3838 tree->ops->writepage_end_io_hook(page, start,
3839 start + PAGE_CACHE_SIZE - 1, 3839 start + PAGE_CACHE_SIZE - 1,
3840 NULL, 1); 3840 NULL, 1);
3841 unlock_page(page); 3841 unlock_page(page);
3842 } 3842 }
3843 page_cache_release(page); 3843 page_cache_release(page);
3844 start += PAGE_CACHE_SIZE; 3844 start += PAGE_CACHE_SIZE;
3845 } 3845 }
3846 3846
3847 flush_epd_write_bio(&epd); 3847 flush_epd_write_bio(&epd);
3848 return ret; 3848 return ret;
3849 } 3849 }
3850 3850
3851 int extent_writepages(struct extent_io_tree *tree, 3851 int extent_writepages(struct extent_io_tree *tree,
3852 struct address_space *mapping, 3852 struct address_space *mapping,
3853 get_extent_t *get_extent, 3853 get_extent_t *get_extent,
3854 struct writeback_control *wbc) 3854 struct writeback_control *wbc)
3855 { 3855 {
3856 int ret = 0; 3856 int ret = 0;
3857 struct extent_page_data epd = { 3857 struct extent_page_data epd = {
3858 .bio = NULL, 3858 .bio = NULL,
3859 .tree = tree, 3859 .tree = tree,
3860 .get_extent = get_extent, 3860 .get_extent = get_extent,
3861 .extent_locked = 0, 3861 .extent_locked = 0,
3862 .sync_io = wbc->sync_mode == WB_SYNC_ALL, 3862 .sync_io = wbc->sync_mode == WB_SYNC_ALL,
3863 .bio_flags = 0, 3863 .bio_flags = 0,
3864 }; 3864 };
3865 3865
3866 ret = extent_write_cache_pages(tree, mapping, wbc, 3866 ret = extent_write_cache_pages(tree, mapping, wbc,
3867 __extent_writepage, &epd, 3867 __extent_writepage, &epd,
3868 flush_write_bio); 3868 flush_write_bio);
3869 flush_epd_write_bio(&epd); 3869 flush_epd_write_bio(&epd);
3870 return ret; 3870 return ret;
3871 } 3871 }
3872 3872
3873 int extent_readpages(struct extent_io_tree *tree, 3873 int extent_readpages(struct extent_io_tree *tree,
3874 struct address_space *mapping, 3874 struct address_space *mapping,
3875 struct list_head *pages, unsigned nr_pages, 3875 struct list_head *pages, unsigned nr_pages,
3876 get_extent_t get_extent) 3876 get_extent_t get_extent)
3877 { 3877 {
3878 struct bio *bio = NULL; 3878 struct bio *bio = NULL;
3879 unsigned page_idx; 3879 unsigned page_idx;
3880 unsigned long bio_flags = 0; 3880 unsigned long bio_flags = 0;
3881 struct page *pagepool[16]; 3881 struct page *pagepool[16];
3882 struct page *page; 3882 struct page *page;
3883 struct extent_map *em_cached = NULL; 3883 struct extent_map *em_cached = NULL;
3884 int nr = 0; 3884 int nr = 0;
3885 3885
3886 for (page_idx = 0; page_idx < nr_pages; page_idx++) { 3886 for (page_idx = 0; page_idx < nr_pages; page_idx++) {
3887 page = list_entry(pages->prev, struct page, lru); 3887 page = list_entry(pages->prev, struct page, lru);
3888 3888
3889 prefetchw(&page->flags); 3889 prefetchw(&page->flags);
3890 list_del(&page->lru); 3890 list_del(&page->lru);
3891 if (add_to_page_cache_lru(page, mapping, 3891 if (add_to_page_cache_lru(page, mapping,
3892 page->index, GFP_NOFS)) { 3892 page->index, GFP_NOFS)) {
3893 page_cache_release(page); 3893 page_cache_release(page);
3894 continue; 3894 continue;
3895 } 3895 }
3896 3896
3897 pagepool[nr++] = page; 3897 pagepool[nr++] = page;
3898 if (nr < ARRAY_SIZE(pagepool)) 3898 if (nr < ARRAY_SIZE(pagepool))
3899 continue; 3899 continue;
3900 __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, 3900 __extent_readpages(tree, pagepool, nr, get_extent, &em_cached,
3901 &bio, 0, &bio_flags, READ); 3901 &bio, 0, &bio_flags, READ);
3902 nr = 0; 3902 nr = 0;
3903 } 3903 }
3904 if (nr) 3904 if (nr)
3905 __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, 3905 __extent_readpages(tree, pagepool, nr, get_extent, &em_cached,
3906 &bio, 0, &bio_flags, READ); 3906 &bio, 0, &bio_flags, READ);
3907 3907
3908 if (em_cached) 3908 if (em_cached)
3909 free_extent_map(em_cached); 3909 free_extent_map(em_cached);
3910 3910
3911 BUG_ON(!list_empty(pages)); 3911 BUG_ON(!list_empty(pages));
3912 if (bio) 3912 if (bio)
3913 return submit_one_bio(READ, bio, 0, bio_flags); 3913 return submit_one_bio(READ, bio, 0, bio_flags);
3914 return 0; 3914 return 0;
3915 } 3915 }
3916 3916
3917 /* 3917 /*
3918 * basic invalidatepage code, this waits on any locked or writeback 3918 * basic invalidatepage code, this waits on any locked or writeback
3919 * ranges corresponding to the page, and then deletes any extent state 3919 * ranges corresponding to the page, and then deletes any extent state
3920 * records from the tree 3920 * records from the tree
3921 */ 3921 */
3922 int extent_invalidatepage(struct extent_io_tree *tree, 3922 int extent_invalidatepage(struct extent_io_tree *tree,
3923 struct page *page, unsigned long offset) 3923 struct page *page, unsigned long offset)
3924 { 3924 {
3925 struct extent_state *cached_state = NULL; 3925 struct extent_state *cached_state = NULL;
3926 u64 start = page_offset(page); 3926 u64 start = page_offset(page);
3927 u64 end = start + PAGE_CACHE_SIZE - 1; 3927 u64 end = start + PAGE_CACHE_SIZE - 1;
3928 size_t blocksize = page->mapping->host->i_sb->s_blocksize; 3928 size_t blocksize = page->mapping->host->i_sb->s_blocksize;
3929 3929
3930 start += ALIGN(offset, blocksize); 3930 start += ALIGN(offset, blocksize);
3931 if (start > end) 3931 if (start > end)
3932 return 0; 3932 return 0;
3933 3933
3934 lock_extent_bits(tree, start, end, 0, &cached_state); 3934 lock_extent_bits(tree, start, end, 0, &cached_state);
3935 wait_on_page_writeback(page); 3935 wait_on_page_writeback(page);
3936 clear_extent_bit(tree, start, end, 3936 clear_extent_bit(tree, start, end,
3937 EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC | 3937 EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC |
3938 EXTENT_DO_ACCOUNTING, 3938 EXTENT_DO_ACCOUNTING,
3939 1, 1, &cached_state, GFP_NOFS); 3939 1, 1, &cached_state, GFP_NOFS);
3940 return 0; 3940 return 0;
3941 } 3941 }
3942 3942
3943 /* 3943 /*
3944 * a helper for releasepage, this tests for areas of the page that 3944 * a helper for releasepage, this tests for areas of the page that
3945 * are locked or under IO and drops the related state bits if it is safe 3945 * are locked or under IO and drops the related state bits if it is safe
3946 * to drop the page. 3946 * to drop the page.
3947 */ 3947 */
3948 static int try_release_extent_state(struct extent_map_tree *map, 3948 static int try_release_extent_state(struct extent_map_tree *map,
3949 struct extent_io_tree *tree, 3949 struct extent_io_tree *tree,
3950 struct page *page, gfp_t mask) 3950 struct page *page, gfp_t mask)
3951 { 3951 {
3952 u64 start = page_offset(page); 3952 u64 start = page_offset(page);
3953 u64 end = start + PAGE_CACHE_SIZE - 1; 3953 u64 end = start + PAGE_CACHE_SIZE - 1;
3954 int ret = 1; 3954 int ret = 1;
3955 3955
3956 if (test_range_bit(tree, start, end, 3956 if (test_range_bit(tree, start, end,
3957 EXTENT_IOBITS, 0, NULL)) 3957 EXTENT_IOBITS, 0, NULL))
3958 ret = 0; 3958 ret = 0;
3959 else { 3959 else {
3960 if ((mask & GFP_NOFS) == GFP_NOFS) 3960 if ((mask & GFP_NOFS) == GFP_NOFS)
3961 mask = GFP_NOFS; 3961 mask = GFP_NOFS;
3962 /* 3962 /*
3963 * at this point we can safely clear everything except the 3963 * at this point we can safely clear everything except the
3964 * locked bit and the nodatasum bit 3964 * locked bit and the nodatasum bit
3965 */ 3965 */
3966 ret = clear_extent_bit(tree, start, end, 3966 ret = clear_extent_bit(tree, start, end,
3967 ~(EXTENT_LOCKED | EXTENT_NODATASUM), 3967 ~(EXTENT_LOCKED | EXTENT_NODATASUM),
3968 0, 0, NULL, mask); 3968 0, 0, NULL, mask);
3969 3969
3970 /* if clear_extent_bit failed for enomem reasons, 3970 /* if clear_extent_bit failed for enomem reasons,
3971 * we can't allow the release to continue. 3971 * we can't allow the release to continue.
3972 */ 3972 */
3973 if (ret < 0) 3973 if (ret < 0)
3974 ret = 0; 3974 ret = 0;
3975 else 3975 else
3976 ret = 1; 3976 ret = 1;
3977 } 3977 }
3978 return ret; 3978 return ret;
3979 } 3979 }
3980 3980
3981 /* 3981 /*
3982 * a helper for releasepage. As long as there are no locked extents 3982 * a helper for releasepage. As long as there are no locked extents
3983 * in the range corresponding to the page, both state records and extent 3983 * in the range corresponding to the page, both state records and extent
3984 * map records are removed 3984 * map records are removed
3985 */ 3985 */
3986 int try_release_extent_mapping(struct extent_map_tree *map, 3986 int try_release_extent_mapping(struct extent_map_tree *map,
3987 struct extent_io_tree *tree, struct page *page, 3987 struct extent_io_tree *tree, struct page *page,
3988 gfp_t mask) 3988 gfp_t mask)
3989 { 3989 {
3990 struct extent_map *em; 3990 struct extent_map *em;
3991 u64 start = page_offset(page); 3991 u64 start = page_offset(page);
3992 u64 end = start + PAGE_CACHE_SIZE - 1; 3992 u64 end = start + PAGE_CACHE_SIZE - 1;
3993 3993
3994 if ((mask & __GFP_WAIT) && 3994 if ((mask & __GFP_WAIT) &&
3995 page->mapping->host->i_size > 16 * 1024 * 1024) { 3995 page->mapping->host->i_size > 16 * 1024 * 1024) {
3996 u64 len; 3996 u64 len;
3997 while (start <= end) { 3997 while (start <= end) {
3998 len = end - start + 1; 3998 len = end - start + 1;
3999 write_lock(&map->lock); 3999 write_lock(&map->lock);
4000 em = lookup_extent_mapping(map, start, len); 4000 em = lookup_extent_mapping(map, start, len);
4001 if (!em) { 4001 if (!em) {
4002 write_unlock(&map->lock); 4002 write_unlock(&map->lock);
4003 break; 4003 break;
4004 } 4004 }
4005 if (test_bit(EXTENT_FLAG_PINNED, &em->flags) || 4005 if (test_bit(EXTENT_FLAG_PINNED, &em->flags) ||
4006 em->start != start) { 4006 em->start != start) {
4007 write_unlock(&map->lock); 4007 write_unlock(&map->lock);
4008 free_extent_map(em); 4008 free_extent_map(em);
4009 break; 4009 break;
4010 } 4010 }
4011 if (!test_range_bit(tree, em->start, 4011 if (!test_range_bit(tree, em->start,
4012 extent_map_end(em) - 1, 4012 extent_map_end(em) - 1,
4013 EXTENT_LOCKED | EXTENT_WRITEBACK, 4013 EXTENT_LOCKED | EXTENT_WRITEBACK,
4014 0, NULL)) { 4014 0, NULL)) {
4015 remove_extent_mapping(map, em); 4015 remove_extent_mapping(map, em);
4016 /* once for the rb tree */ 4016 /* once for the rb tree */
4017 free_extent_map(em); 4017 free_extent_map(em);
4018 } 4018 }
4019 start = extent_map_end(em); 4019 start = extent_map_end(em);
4020 write_unlock(&map->lock); 4020 write_unlock(&map->lock);
4021 4021
4022 /* once for us */ 4022 /* once for us */
4023 free_extent_map(em); 4023 free_extent_map(em);
4024 } 4024 }
4025 } 4025 }
4026 return try_release_extent_state(map, tree, page, mask); 4026 return try_release_extent_state(map, tree, page, mask);
4027 } 4027 }
4028 4028
4029 /* 4029 /*
4030 * helper function for fiemap, which doesn't want to see any holes. 4030 * helper function for fiemap, which doesn't want to see any holes.
4031 * This maps until we find something past 'last' 4031 * This maps until we find something past 'last'
4032 */ 4032 */
4033 static struct extent_map *get_extent_skip_holes(struct inode *inode, 4033 static struct extent_map *get_extent_skip_holes(struct inode *inode,
4034 u64 offset, 4034 u64 offset,
4035 u64 last, 4035 u64 last,
4036 get_extent_t *get_extent) 4036 get_extent_t *get_extent)
4037 { 4037 {
4038 u64 sectorsize = BTRFS_I(inode)->root->sectorsize; 4038 u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
4039 struct extent_map *em; 4039 struct extent_map *em;
4040 u64 len; 4040 u64 len;
4041 4041
4042 if (offset >= last) 4042 if (offset >= last)
4043 return NULL; 4043 return NULL;
4044 4044
4045 while(1) { 4045 while(1) {
4046 len = last - offset; 4046 len = last - offset;
4047 if (len == 0) 4047 if (len == 0)
4048 break; 4048 break;
4049 len = ALIGN(len, sectorsize); 4049 len = ALIGN(len, sectorsize);
4050 em = get_extent(inode, NULL, 0, offset, len, 0); 4050 em = get_extent(inode, NULL, 0, offset, len, 0);
4051 if (IS_ERR_OR_NULL(em)) 4051 if (IS_ERR_OR_NULL(em))
4052 return em; 4052 return em;
4053 4053
4054 /* if this isn't a hole return it */ 4054 /* if this isn't a hole return it */
4055 if (!test_bit(EXTENT_FLAG_VACANCY, &em->flags) && 4055 if (!test_bit(EXTENT_FLAG_VACANCY, &em->flags) &&
4056 em->block_start != EXTENT_MAP_HOLE) { 4056 em->block_start != EXTENT_MAP_HOLE) {
4057 return em; 4057 return em;
4058 } 4058 }
4059 4059
4060 /* this is a hole, advance to the next extent */ 4060 /* this is a hole, advance to the next extent */
4061 offset = extent_map_end(em); 4061 offset = extent_map_end(em);
4062 free_extent_map(em); 4062 free_extent_map(em);
4063 if (offset >= last) 4063 if (offset >= last)
4064 break; 4064 break;
4065 } 4065 }
4066 return NULL; 4066 return NULL;
4067 } 4067 }
4068 4068
4069 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 4069 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
4070 __u64 start, __u64 len, get_extent_t *get_extent) 4070 __u64 start, __u64 len, get_extent_t *get_extent)
4071 { 4071 {
4072 int ret = 0; 4072 int ret = 0;
4073 u64 off = start; 4073 u64 off = start;
4074 u64 max = start + len; 4074 u64 max = start + len;
4075 u32 flags = 0; 4075 u32 flags = 0;
4076 u32 found_type; 4076 u32 found_type;
4077 u64 last; 4077 u64 last;
4078 u64 last_for_get_extent = 0; 4078 u64 last_for_get_extent = 0;
4079 u64 disko = 0; 4079 u64 disko = 0;
4080 u64 isize = i_size_read(inode); 4080 u64 isize = i_size_read(inode);
4081 struct btrfs_key found_key; 4081 struct btrfs_key found_key;
4082 struct extent_map *em = NULL; 4082 struct extent_map *em = NULL;
4083 struct extent_state *cached_state = NULL; 4083 struct extent_state *cached_state = NULL;
4084 struct btrfs_path *path; 4084 struct btrfs_path *path;
4085 struct btrfs_file_extent_item *item; 4085 struct btrfs_file_extent_item *item;
4086 int end = 0; 4086 int end = 0;
4087 u64 em_start = 0; 4087 u64 em_start = 0;
4088 u64 em_len = 0; 4088 u64 em_len = 0;
4089 u64 em_end = 0; 4089 u64 em_end = 0;
4090 unsigned long emflags; 4090 unsigned long emflags;
4091 4091
4092 if (len == 0) 4092 if (len == 0)
4093 return -EINVAL; 4093 return -EINVAL;
4094 4094
4095 path = btrfs_alloc_path(); 4095 path = btrfs_alloc_path();
4096 if (!path) 4096 if (!path)
4097 return -ENOMEM; 4097 return -ENOMEM;
4098 path->leave_spinning = 1; 4098 path->leave_spinning = 1;
4099 4099
4100 start = ALIGN(start, BTRFS_I(inode)->root->sectorsize); 4100 start = ALIGN(start, BTRFS_I(inode)->root->sectorsize);
4101 len = ALIGN(len, BTRFS_I(inode)->root->sectorsize); 4101 len = ALIGN(len, BTRFS_I(inode)->root->sectorsize);
4102 4102
4103 /* 4103 /*
4104 * lookup the last file extent. We're not using i_size here 4104 * lookup the last file extent. We're not using i_size here
4105 * because there might be preallocation past i_size 4105 * because there might be preallocation past i_size
4106 */ 4106 */
4107 ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, 4107 ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root,
4108 path, btrfs_ino(inode), -1, 0); 4108 path, btrfs_ino(inode), -1, 0);
4109 if (ret < 0) { 4109 if (ret < 0) {
4110 btrfs_free_path(path); 4110 btrfs_free_path(path);
4111 return ret; 4111 return ret;
4112 } 4112 }
4113 WARN_ON(!ret); 4113 WARN_ON(!ret);
4114 path->slots[0]--; 4114 path->slots[0]--;
4115 item = btrfs_item_ptr(path->nodes[0], path->slots[0], 4115 item = btrfs_item_ptr(path->nodes[0], path->slots[0],
4116 struct btrfs_file_extent_item); 4116 struct btrfs_file_extent_item);
4117 btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); 4117 btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
4118 found_type = btrfs_key_type(&found_key); 4118 found_type = btrfs_key_type(&found_key);
4119 4119
4120 /* No extents, but there might be delalloc bits */ 4120 /* No extents, but there might be delalloc bits */
4121 if (found_key.objectid != btrfs_ino(inode) || 4121 if (found_key.objectid != btrfs_ino(inode) ||
4122 found_type != BTRFS_EXTENT_DATA_KEY) { 4122 found_type != BTRFS_EXTENT_DATA_KEY) {
4123 /* have to trust i_size as the end */ 4123 /* have to trust i_size as the end */
4124 last = (u64)-1; 4124 last = (u64)-1;
4125 last_for_get_extent = isize; 4125 last_for_get_extent = isize;
4126 } else { 4126 } else {
4127 /* 4127 /*
4128 * remember the start of the last extent. There are a 4128 * remember the start of the last extent. There are a
4129 * bunch of different factors that go into the length of the 4129 * bunch of different factors that go into the length of the
4130 * extent, so its much less complex to remember where it started 4130 * extent, so its much less complex to remember where it started
4131 */ 4131 */
4132 last = found_key.offset; 4132 last = found_key.offset;
4133 last_for_get_extent = last + 1; 4133 last_for_get_extent = last + 1;
4134 } 4134 }
4135 btrfs_free_path(path); 4135 btrfs_free_path(path);
4136 4136
4137 /* 4137 /*
4138 * we might have some extents allocated but more delalloc past those 4138 * we might have some extents allocated but more delalloc past those
4139 * extents. so, we trust isize unless the start of the last extent is 4139 * extents. so, we trust isize unless the start of the last extent is
4140 * beyond isize 4140 * beyond isize
4141 */ 4141 */
4142 if (last < isize) { 4142 if (last < isize) {
4143 last = (u64)-1; 4143 last = (u64)-1;
4144 last_for_get_extent = isize; 4144 last_for_get_extent = isize;
4145 } 4145 }
4146 4146
4147 lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1, 0, 4147 lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1, 0,
4148 &cached_state); 4148 &cached_state);
4149 4149
4150 em = get_extent_skip_holes(inode, start, last_for_get_extent, 4150 em = get_extent_skip_holes(inode, start, last_for_get_extent,
4151 get_extent); 4151 get_extent);
4152 if (!em) 4152 if (!em)
4153 goto out; 4153 goto out;
4154 if (IS_ERR(em)) { 4154 if (IS_ERR(em)) {
4155 ret = PTR_ERR(em); 4155 ret = PTR_ERR(em);
4156 goto out; 4156 goto out;
4157 } 4157 }
4158 4158
4159 while (!end) { 4159 while (!end) {
4160 u64 offset_in_extent = 0; 4160 u64 offset_in_extent = 0;
4161 4161
4162 /* break if the extent we found is outside the range */ 4162 /* break if the extent we found is outside the range */
4163 if (em->start >= max || extent_map_end(em) < off) 4163 if (em->start >= max || extent_map_end(em) < off)
4164 break; 4164 break;
4165 4165
4166 /* 4166 /*
4167 * get_extent may return an extent that starts before our 4167 * get_extent may return an extent that starts before our
4168 * requested range. We have to make sure the ranges 4168 * requested range. We have to make sure the ranges
4169 * we return to fiemap always move forward and don't 4169 * we return to fiemap always move forward and don't
4170 * overlap, so adjust the offsets here 4170 * overlap, so adjust the offsets here
4171 */ 4171 */
4172 em_start = max(em->start, off); 4172 em_start = max(em->start, off);
4173 4173
4174 /* 4174 /*
4175 * record the offset from the start of the extent 4175 * record the offset from the start of the extent
4176 * for adjusting the disk offset below. Only do this if the 4176 * for adjusting the disk offset below. Only do this if the
4177 * extent isn't compressed since our in ram offset may be past 4177 * extent isn't compressed since our in ram offset may be past
4178 * what we have actually allocated on disk. 4178 * what we have actually allocated on disk.
4179 */ 4179 */
4180 if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) 4180 if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
4181 offset_in_extent = em_start - em->start; 4181 offset_in_extent = em_start - em->start;
4182 em_end = extent_map_end(em); 4182 em_end = extent_map_end(em);
4183 em_len = em_end - em_start; 4183 em_len = em_end - em_start;
4184 emflags = em->flags; 4184 emflags = em->flags;
4185 disko = 0; 4185 disko = 0;
4186 flags = 0; 4186 flags = 0;
4187 4187
4188 /* 4188 /*
4189 * bump off for our next call to get_extent 4189 * bump off for our next call to get_extent
4190 */ 4190 */
4191 off = extent_map_end(em); 4191 off = extent_map_end(em);
4192 if (off >= max) 4192 if (off >= max)
4193 end = 1; 4193 end = 1;
4194 4194
4195 if (em->block_start == EXTENT_MAP_LAST_BYTE) { 4195 if (em->block_start == EXTENT_MAP_LAST_BYTE) {
4196 end = 1; 4196 end = 1;
4197 flags |= FIEMAP_EXTENT_LAST; 4197 flags |= FIEMAP_EXTENT_LAST;
4198 } else if (em->block_start == EXTENT_MAP_INLINE) { 4198 } else if (em->block_start == EXTENT_MAP_INLINE) {
4199 flags |= (FIEMAP_EXTENT_DATA_INLINE | 4199 flags |= (FIEMAP_EXTENT_DATA_INLINE |
4200 FIEMAP_EXTENT_NOT_ALIGNED); 4200 FIEMAP_EXTENT_NOT_ALIGNED);
4201 } else if (em->block_start == EXTENT_MAP_DELALLOC) { 4201 } else if (em->block_start == EXTENT_MAP_DELALLOC) {
4202 flags |= (FIEMAP_EXTENT_DELALLOC | 4202 flags |= (FIEMAP_EXTENT_DELALLOC |
4203 FIEMAP_EXTENT_UNKNOWN); 4203 FIEMAP_EXTENT_UNKNOWN);
4204 } else { 4204 } else {
4205 disko = em->block_start + offset_in_extent; 4205 disko = em->block_start + offset_in_extent;
4206 } 4206 }
4207 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) 4207 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
4208 flags |= FIEMAP_EXTENT_ENCODED; 4208 flags |= FIEMAP_EXTENT_ENCODED;
4209 4209
4210 free_extent_map(em); 4210 free_extent_map(em);
4211 em = NULL; 4211 em = NULL;
4212 if ((em_start >= last) || em_len == (u64)-1 || 4212 if ((em_start >= last) || em_len == (u64)-1 ||
4213 (last == (u64)-1 && isize <= em_end)) { 4213 (last == (u64)-1 && isize <= em_end)) {
4214 flags |= FIEMAP_EXTENT_LAST; 4214 flags |= FIEMAP_EXTENT_LAST;
4215 end = 1; 4215 end = 1;
4216 } 4216 }
4217 4217
4218 /* now scan forward to see if this is really the last extent. */ 4218 /* now scan forward to see if this is really the last extent. */
4219 em = get_extent_skip_holes(inode, off, last_for_get_extent, 4219 em = get_extent_skip_holes(inode, off, last_for_get_extent,
4220 get_extent); 4220 get_extent);
4221 if (IS_ERR(em)) { 4221 if (IS_ERR(em)) {
4222 ret = PTR_ERR(em); 4222 ret = PTR_ERR(em);
4223 goto out; 4223 goto out;
4224 } 4224 }
4225 if (!em) { 4225 if (!em) {
4226 flags |= FIEMAP_EXTENT_LAST; 4226 flags |= FIEMAP_EXTENT_LAST;
4227 end = 1; 4227 end = 1;
4228 } 4228 }
4229 ret = fiemap_fill_next_extent(fieinfo, em_start, disko, 4229 ret = fiemap_fill_next_extent(fieinfo, em_start, disko,
4230 em_len, flags); 4230 em_len, flags);
4231 if (ret) 4231 if (ret)
4232 goto out_free; 4232 goto out_free;
4233 } 4233 }
4234 out_free: 4234 out_free:
4235 free_extent_map(em); 4235 free_extent_map(em);
4236 out: 4236 out:
4237 unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1, 4237 unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1,
4238 &cached_state, GFP_NOFS); 4238 &cached_state, GFP_NOFS);
4239 return ret; 4239 return ret;
4240 } 4240 }
4241 4241
4242 static void __free_extent_buffer(struct extent_buffer *eb) 4242 static void __free_extent_buffer(struct extent_buffer *eb)
4243 { 4243 {
4244 btrfs_leak_debug_del(&eb->leak_list); 4244 btrfs_leak_debug_del(&eb->leak_list);
4245 kmem_cache_free(extent_buffer_cache, eb); 4245 kmem_cache_free(extent_buffer_cache, eb);
4246 } 4246 }
4247 4247
4248 static int extent_buffer_under_io(struct extent_buffer *eb) 4248 static int extent_buffer_under_io(struct extent_buffer *eb)
4249 { 4249 {
4250 return (atomic_read(&eb->io_pages) || 4250 return (atomic_read(&eb->io_pages) ||
4251 test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) || 4251 test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
4252 test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); 4252 test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
4253 } 4253 }
4254 4254
4255 /* 4255 /*
4256 * Helper for releasing extent buffer page. 4256 * Helper for releasing extent buffer page.
4257 */ 4257 */
4258 static void btrfs_release_extent_buffer_page(struct extent_buffer *eb, 4258 static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
4259 unsigned long start_idx) 4259 unsigned long start_idx)
4260 { 4260 {
4261 unsigned long index; 4261 unsigned long index;
4262 unsigned long num_pages; 4262 unsigned long num_pages;
4263 struct page *page; 4263 struct page *page;
4264 int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); 4264 int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
4265 4265
4266 BUG_ON(extent_buffer_under_io(eb)); 4266 BUG_ON(extent_buffer_under_io(eb));
4267 4267
4268 num_pages = num_extent_pages(eb->start, eb->len); 4268 num_pages = num_extent_pages(eb->start, eb->len);
4269 index = start_idx + num_pages; 4269 index = start_idx + num_pages;
4270 if (start_idx >= index) 4270 if (start_idx >= index)
4271 return; 4271 return;
4272 4272
4273 do { 4273 do {
4274 index--; 4274 index--;
4275 page = extent_buffer_page(eb, index); 4275 page = extent_buffer_page(eb, index);
4276 if (page && mapped) { 4276 if (page && mapped) {
4277 spin_lock(&page->mapping->private_lock); 4277 spin_lock(&page->mapping->private_lock);
4278 /* 4278 /*
4279 * We do this since we'll remove the pages after we've 4279 * We do this since we'll remove the pages after we've
4280 * removed the eb from the radix tree, so we could race 4280 * removed the eb from the radix tree, so we could race
4281 * and have this page now attached to the new eb. So 4281 * and have this page now attached to the new eb. So
4282 * only clear page_private if it's still connected to 4282 * only clear page_private if it's still connected to
4283 * this eb. 4283 * this eb.
4284 */ 4284 */
4285 if (PagePrivate(page) && 4285 if (PagePrivate(page) &&
4286 page->private == (unsigned long)eb) { 4286 page->private == (unsigned long)eb) {
4287 BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); 4287 BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
4288 BUG_ON(PageDirty(page)); 4288 BUG_ON(PageDirty(page));
4289 BUG_ON(PageWriteback(page)); 4289 BUG_ON(PageWriteback(page));
4290 /* 4290 /*
4291 * We need to make sure we haven't be attached 4291 * We need to make sure we haven't be attached
4292 * to a new eb. 4292 * to a new eb.
4293 */ 4293 */
4294 ClearPagePrivate(page); 4294 ClearPagePrivate(page);
4295 set_page_private(page, 0); 4295 set_page_private(page, 0);
4296 /* One for the page private */ 4296 /* One for the page private */
4297 page_cache_release(page); 4297 page_cache_release(page);
4298 } 4298 }
4299 spin_unlock(&page->mapping->private_lock); 4299 spin_unlock(&page->mapping->private_lock);
4300 4300
4301 } 4301 }
4302 if (page) { 4302 if (page) {
4303 /* One for when we alloced the page */ 4303 /* One for when we alloced the page */
4304 page_cache_release(page); 4304 page_cache_release(page);
4305 } 4305 }
4306 } while (index != start_idx); 4306 } while (index != start_idx);
4307 } 4307 }
4308 4308
4309 /* 4309 /*
4310 * Helper for releasing the extent buffer. 4310 * Helper for releasing the extent buffer.
4311 */ 4311 */
4312 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb) 4312 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
4313 { 4313 {
4314 btrfs_release_extent_buffer_page(eb, 0); 4314 btrfs_release_extent_buffer_page(eb, 0);
4315 __free_extent_buffer(eb); 4315 __free_extent_buffer(eb);
4316 } 4316 }
4317 4317
4318 static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree, 4318 static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
4319 u64 start, 4319 u64 start,
4320 unsigned long len, 4320 unsigned long len,
4321 gfp_t mask) 4321 gfp_t mask)
4322 { 4322 {
4323 struct extent_buffer *eb = NULL; 4323 struct extent_buffer *eb = NULL;
4324 4324
4325 eb = kmem_cache_zalloc(extent_buffer_cache, mask); 4325 eb = kmem_cache_zalloc(extent_buffer_cache, mask);
4326 if (eb == NULL) 4326 if (eb == NULL)
4327 return NULL; 4327 return NULL;
4328 eb->start = start; 4328 eb->start = start;
4329 eb->len = len; 4329 eb->len = len;
4330 eb->tree = tree; 4330 eb->tree = tree;
4331 eb->bflags = 0; 4331 eb->bflags = 0;
4332 rwlock_init(&eb->lock); 4332 rwlock_init(&eb->lock);
4333 atomic_set(&eb->write_locks, 0); 4333 atomic_set(&eb->write_locks, 0);
4334 atomic_set(&eb->read_locks, 0); 4334 atomic_set(&eb->read_locks, 0);
4335 atomic_set(&eb->blocking_readers, 0); 4335 atomic_set(&eb->blocking_readers, 0);
4336 atomic_set(&eb->blocking_writers, 0); 4336 atomic_set(&eb->blocking_writers, 0);
4337 atomic_set(&eb->spinning_readers, 0); 4337 atomic_set(&eb->spinning_readers, 0);
4338 atomic_set(&eb->spinning_writers, 0); 4338 atomic_set(&eb->spinning_writers, 0);
4339 eb->lock_nested = 0; 4339 eb->lock_nested = 0;
4340 init_waitqueue_head(&eb->write_lock_wq); 4340 init_waitqueue_head(&eb->write_lock_wq);
4341 init_waitqueue_head(&eb->read_lock_wq); 4341 init_waitqueue_head(&eb->read_lock_wq);
4342 4342
4343 btrfs_leak_debug_add(&eb->leak_list, &buffers); 4343 btrfs_leak_debug_add(&eb->leak_list, &buffers);
4344 4344
4345 spin_lock_init(&eb->refs_lock); 4345 spin_lock_init(&eb->refs_lock);
4346 atomic_set(&eb->refs, 1); 4346 atomic_set(&eb->refs, 1);
4347 atomic_set(&eb->io_pages, 0); 4347 atomic_set(&eb->io_pages, 0);
4348 4348
4349 /* 4349 /*
4350 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages 4350 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
4351 */ 4351 */
4352 BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE 4352 BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE
4353 > MAX_INLINE_EXTENT_BUFFER_SIZE); 4353 > MAX_INLINE_EXTENT_BUFFER_SIZE);
4354 BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE); 4354 BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
4355 4355
4356 return eb; 4356 return eb;
4357 } 4357 }
4358 4358
4359 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src) 4359 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
4360 { 4360 {
4361 unsigned long i; 4361 unsigned long i;
4362 struct page *p; 4362 struct page *p;
4363 struct extent_buffer *new; 4363 struct extent_buffer *new;
4364 unsigned long num_pages = num_extent_pages(src->start, src->len); 4364 unsigned long num_pages = num_extent_pages(src->start, src->len);
4365 4365
4366 new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS); 4366 new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS);
4367 if (new == NULL) 4367 if (new == NULL)
4368 return NULL; 4368 return NULL;
4369 4369
4370 for (i = 0; i < num_pages; i++) { 4370 for (i = 0; i < num_pages; i++) {
4371 p = alloc_page(GFP_NOFS); 4371 p = alloc_page(GFP_NOFS);
4372 if (!p) { 4372 if (!p) {
4373 btrfs_release_extent_buffer(new); 4373 btrfs_release_extent_buffer(new);
4374 return NULL; 4374 return NULL;
4375 } 4375 }
4376 attach_extent_buffer_page(new, p); 4376 attach_extent_buffer_page(new, p);
4377 WARN_ON(PageDirty(p)); 4377 WARN_ON(PageDirty(p));
4378 SetPageUptodate(p); 4378 SetPageUptodate(p);
4379 new->pages[i] = p; 4379 new->pages[i] = p;
4380 } 4380 }
4381 4381
4382 copy_extent_buffer(new, src, 0, 0, src->len); 4382 copy_extent_buffer(new, src, 0, 0, src->len);
4383 set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags); 4383 set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
4384 set_bit(EXTENT_BUFFER_DUMMY, &new->bflags); 4384 set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
4385 4385
4386 return new; 4386 return new;
4387 } 4387 }
4388 4388
4389 struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len) 4389 struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len)
4390 { 4390 {
4391 struct extent_buffer *eb; 4391 struct extent_buffer *eb;
4392 unsigned long num_pages = num_extent_pages(0, len); 4392 unsigned long num_pages = num_extent_pages(0, len);
4393 unsigned long i; 4393 unsigned long i;
4394 4394
4395 eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS); 4395 eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS);
4396 if (!eb) 4396 if (!eb)
4397 return NULL; 4397 return NULL;
4398 4398
4399 for (i = 0; i < num_pages; i++) { 4399 for (i = 0; i < num_pages; i++) {
4400 eb->pages[i] = alloc_page(GFP_NOFS); 4400 eb->pages[i] = alloc_page(GFP_NOFS);
4401 if (!eb->pages[i]) 4401 if (!eb->pages[i])
4402 goto err; 4402 goto err;
4403 } 4403 }
4404 set_extent_buffer_uptodate(eb); 4404 set_extent_buffer_uptodate(eb);
4405 btrfs_set_header_nritems(eb, 0); 4405 btrfs_set_header_nritems(eb, 0);
4406 set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); 4406 set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
4407 4407
4408 return eb; 4408 return eb;
4409 err: 4409 err:
4410 for (; i > 0; i--) 4410 for (; i > 0; i--)
4411 __free_page(eb->pages[i - 1]); 4411 __free_page(eb->pages[i - 1]);
4412 __free_extent_buffer(eb); 4412 __free_extent_buffer(eb);
4413 return NULL; 4413 return NULL;
4414 } 4414 }
4415 4415
4416 static void check_buffer_tree_ref(struct extent_buffer *eb) 4416 static void check_buffer_tree_ref(struct extent_buffer *eb)
4417 { 4417 {
4418 int refs; 4418 int refs;
4419 /* the ref bit is tricky. We have to make sure it is set 4419 /* the ref bit is tricky. We have to make sure it is set
4420 * if we have the buffer dirty. Otherwise the 4420 * if we have the buffer dirty. Otherwise the
4421 * code to free a buffer can end up dropping a dirty 4421 * code to free a buffer can end up dropping a dirty
4422 * page 4422 * page
4423 * 4423 *
4424 * Once the ref bit is set, it won't go away while the 4424 * Once the ref bit is set, it won't go away while the
4425 * buffer is dirty or in writeback, and it also won't 4425 * buffer is dirty or in writeback, and it also won't
4426 * go away while we have the reference count on the 4426 * go away while we have the reference count on the
4427 * eb bumped. 4427 * eb bumped.
4428 * 4428 *
4429 * We can't just set the ref bit without bumping the 4429 * We can't just set the ref bit without bumping the
4430 * ref on the eb because free_extent_buffer might 4430 * ref on the eb because free_extent_buffer might
4431 * see the ref bit and try to clear it. If this happens 4431 * see the ref bit and try to clear it. If this happens
4432 * free_extent_buffer might end up dropping our original 4432 * free_extent_buffer might end up dropping our original
4433 * ref by mistake and freeing the page before we are able 4433 * ref by mistake and freeing the page before we are able
4434 * to add one more ref. 4434 * to add one more ref.
4435 * 4435 *
4436 * So bump the ref count first, then set the bit. If someone 4436 * So bump the ref count first, then set the bit. If someone
4437 * beat us to it, drop the ref we added. 4437 * beat us to it, drop the ref we added.
4438 */ 4438 */
4439 refs = atomic_read(&eb->refs); 4439 refs = atomic_read(&eb->refs);
4440 if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) 4440 if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
4441 return; 4441 return;
4442 4442
4443 spin_lock(&eb->refs_lock); 4443 spin_lock(&eb->refs_lock);
4444 if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) 4444 if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
4445 atomic_inc(&eb->refs); 4445 atomic_inc(&eb->refs);
4446 spin_unlock(&eb->refs_lock); 4446 spin_unlock(&eb->refs_lock);
4447 } 4447 }
4448 4448
4449 static void mark_extent_buffer_accessed(struct extent_buffer *eb) 4449 static void mark_extent_buffer_accessed(struct extent_buffer *eb,
4450 struct page *accessed)
4450 { 4451 {
4451 unsigned long num_pages, i; 4452 unsigned long num_pages, i;
4452 4453
4453 check_buffer_tree_ref(eb); 4454 check_buffer_tree_ref(eb);
4454 4455
4455 num_pages = num_extent_pages(eb->start, eb->len); 4456 num_pages = num_extent_pages(eb->start, eb->len);
4456 for (i = 0; i < num_pages; i++) { 4457 for (i = 0; i < num_pages; i++) {
4457 struct page *p = extent_buffer_page(eb, i); 4458 struct page *p = extent_buffer_page(eb, i);
4458 mark_page_accessed(p); 4459 if (p != accessed)
4460 mark_page_accessed(p);
4459 } 4461 }
4460 } 4462 }
4461 4463
4462 struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree, 4464 struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
4463 u64 start, unsigned long len) 4465 u64 start, unsigned long len)
4464 { 4466 {
4465 unsigned long num_pages = num_extent_pages(start, len); 4467 unsigned long num_pages = num_extent_pages(start, len);
4466 unsigned long i; 4468 unsigned long i;
4467 unsigned long index = start >> PAGE_CACHE_SHIFT; 4469 unsigned long index = start >> PAGE_CACHE_SHIFT;
4468 struct extent_buffer *eb; 4470 struct extent_buffer *eb;
4469 struct extent_buffer *exists = NULL; 4471 struct extent_buffer *exists = NULL;
4470 struct page *p; 4472 struct page *p;
4471 struct address_space *mapping = tree->mapping; 4473 struct address_space *mapping = tree->mapping;
4472 int uptodate = 1; 4474 int uptodate = 1;
4473 int ret; 4475 int ret;
4474 4476
4475 rcu_read_lock(); 4477 rcu_read_lock();
4476 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); 4478 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
4477 if (eb && atomic_inc_not_zero(&eb->refs)) { 4479 if (eb && atomic_inc_not_zero(&eb->refs)) {
4478 rcu_read_unlock(); 4480 rcu_read_unlock();
4479 mark_extent_buffer_accessed(eb); 4481 mark_extent_buffer_accessed(eb, NULL);
4480 return eb; 4482 return eb;
4481 } 4483 }
4482 rcu_read_unlock(); 4484 rcu_read_unlock();
4483 4485
4484 eb = __alloc_extent_buffer(tree, start, len, GFP_NOFS); 4486 eb = __alloc_extent_buffer(tree, start, len, GFP_NOFS);
4485 if (!eb) 4487 if (!eb)
4486 return NULL; 4488 return NULL;
4487 4489
4488 for (i = 0; i < num_pages; i++, index++) { 4490 for (i = 0; i < num_pages; i++, index++) {
4489 p = find_or_create_page(mapping, index, GFP_NOFS); 4491 p = find_or_create_page(mapping, index, GFP_NOFS);
4490 if (!p) 4492 if (!p)
4491 goto free_eb; 4493 goto free_eb;
4492 4494
4493 spin_lock(&mapping->private_lock); 4495 spin_lock(&mapping->private_lock);
4494 if (PagePrivate(p)) { 4496 if (PagePrivate(p)) {
4495 /* 4497 /*
4496 * We could have already allocated an eb for this page 4498 * We could have already allocated an eb for this page
4497 * and attached one so lets see if we can get a ref on 4499 * and attached one so lets see if we can get a ref on
4498 * the existing eb, and if we can we know it's good and 4500 * the existing eb, and if we can we know it's good and
4499 * we can just return that one, else we know we can just 4501 * we can just return that one, else we know we can just
4500 * overwrite page->private. 4502 * overwrite page->private.
4501 */ 4503 */
4502 exists = (struct extent_buffer *)p->private; 4504 exists = (struct extent_buffer *)p->private;
4503 if (atomic_inc_not_zero(&exists->refs)) { 4505 if (atomic_inc_not_zero(&exists->refs)) {
4504 spin_unlock(&mapping->private_lock); 4506 spin_unlock(&mapping->private_lock);
4505 unlock_page(p); 4507 unlock_page(p);
4506 page_cache_release(p); 4508 page_cache_release(p);
4507 mark_extent_buffer_accessed(exists); 4509 mark_extent_buffer_accessed(exists, p);
4508 goto free_eb; 4510 goto free_eb;
4509 } 4511 }
4510 4512
4511 /* 4513 /*
4512 * Do this so attach doesn't complain and we need to 4514 * Do this so attach doesn't complain and we need to
4513 * drop the ref the old guy had. 4515 * drop the ref the old guy had.
4514 */ 4516 */
4515 ClearPagePrivate(p); 4517 ClearPagePrivate(p);
4516 WARN_ON(PageDirty(p)); 4518 WARN_ON(PageDirty(p));
4517 page_cache_release(p); 4519 page_cache_release(p);
4518 } 4520 }
4519 attach_extent_buffer_page(eb, p); 4521 attach_extent_buffer_page(eb, p);
4520 spin_unlock(&mapping->private_lock); 4522 spin_unlock(&mapping->private_lock);
4521 WARN_ON(PageDirty(p)); 4523 WARN_ON(PageDirty(p));
4522 mark_page_accessed(p);
4523 eb->pages[i] = p; 4524 eb->pages[i] = p;
4524 if (!PageUptodate(p)) 4525 if (!PageUptodate(p))
4525 uptodate = 0; 4526 uptodate = 0;
4526 4527
4527 /* 4528 /*
4528 * see below about how we avoid a nasty race with release page 4529 * see below about how we avoid a nasty race with release page
4529 * and why we unlock later 4530 * and why we unlock later
4530 */ 4531 */
4531 } 4532 }
4532 if (uptodate) 4533 if (uptodate)
4533 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); 4534 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
4534 again: 4535 again:
4535 ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); 4536 ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
4536 if (ret) 4537 if (ret)
4537 goto free_eb; 4538 goto free_eb;
4538 4539
4539 spin_lock(&tree->buffer_lock); 4540 spin_lock(&tree->buffer_lock);
4540 ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb); 4541 ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb);
4541 if (ret == -EEXIST) { 4542 if (ret == -EEXIST) {
4542 exists = radix_tree_lookup(&tree->buffer, 4543 exists = radix_tree_lookup(&tree->buffer,
4543 start >> PAGE_CACHE_SHIFT); 4544 start >> PAGE_CACHE_SHIFT);
4544 if (!atomic_inc_not_zero(&exists->refs)) { 4545 if (!atomic_inc_not_zero(&exists->refs)) {
4545 spin_unlock(&tree->buffer_lock); 4546 spin_unlock(&tree->buffer_lock);
4546 radix_tree_preload_end(); 4547 radix_tree_preload_end();
4547 exists = NULL; 4548 exists = NULL;
4548 goto again; 4549 goto again;
4549 } 4550 }
4550 spin_unlock(&tree->buffer_lock); 4551 spin_unlock(&tree->buffer_lock);
4551 radix_tree_preload_end(); 4552 radix_tree_preload_end();
4552 mark_extent_buffer_accessed(exists); 4553 mark_extent_buffer_accessed(exists, NULL);
4553 goto free_eb; 4554 goto free_eb;
4554 } 4555 }
4555 /* add one reference for the tree */ 4556 /* add one reference for the tree */
4556 check_buffer_tree_ref(eb); 4557 check_buffer_tree_ref(eb);
4557 spin_unlock(&tree->buffer_lock); 4558 spin_unlock(&tree->buffer_lock);
4558 radix_tree_preload_end(); 4559 radix_tree_preload_end();
4559 4560
4560 /* 4561 /*
4561 * there is a race where release page may have 4562 * there is a race where release page may have
4562 * tried to find this extent buffer in the radix 4563 * tried to find this extent buffer in the radix
4563 * but failed. It will tell the VM it is safe to 4564 * but failed. It will tell the VM it is safe to
4564 * reclaim the, and it will clear the page private bit. 4565 * reclaim the, and it will clear the page private bit.
4565 * We must make sure to set the page private bit properly 4566 * We must make sure to set the page private bit properly
4566 * after the extent buffer is in the radix tree so 4567 * after the extent buffer is in the radix tree so
4567 * it doesn't get lost 4568 * it doesn't get lost
4568 */ 4569 */
4569 SetPageChecked(eb->pages[0]); 4570 SetPageChecked(eb->pages[0]);
4570 for (i = 1; i < num_pages; i++) { 4571 for (i = 1; i < num_pages; i++) {
4571 p = extent_buffer_page(eb, i); 4572 p = extent_buffer_page(eb, i);
4572 ClearPageChecked(p); 4573 ClearPageChecked(p);
4573 unlock_page(p); 4574 unlock_page(p);
4574 } 4575 }
4575 unlock_page(eb->pages[0]); 4576 unlock_page(eb->pages[0]);
4576 return eb; 4577 return eb;
4577 4578
4578 free_eb: 4579 free_eb:
4579 for (i = 0; i < num_pages; i++) { 4580 for (i = 0; i < num_pages; i++) {
4580 if (eb->pages[i]) 4581 if (eb->pages[i])
4581 unlock_page(eb->pages[i]); 4582 unlock_page(eb->pages[i]);
4582 } 4583 }
4583 4584
4584 WARN_ON(!atomic_dec_and_test(&eb->refs)); 4585 WARN_ON(!atomic_dec_and_test(&eb->refs));
4585 btrfs_release_extent_buffer(eb); 4586 btrfs_release_extent_buffer(eb);
4586 return exists; 4587 return exists;
4587 } 4588 }
4588 4589
4589 struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree, 4590 struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
4590 u64 start, unsigned long len) 4591 u64 start, unsigned long len)
4591 { 4592 {
4592 struct extent_buffer *eb; 4593 struct extent_buffer *eb;
4593 4594
4594 rcu_read_lock(); 4595 rcu_read_lock();
4595 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); 4596 eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
4596 if (eb && atomic_inc_not_zero(&eb->refs)) { 4597 if (eb && atomic_inc_not_zero(&eb->refs)) {
4597 rcu_read_unlock(); 4598 rcu_read_unlock();
4598 mark_extent_buffer_accessed(eb); 4599 mark_extent_buffer_accessed(eb, NULL);
4599 return eb; 4600 return eb;
4600 } 4601 }
4601 rcu_read_unlock(); 4602 rcu_read_unlock();
4602 4603
4603 return NULL; 4604 return NULL;
4604 } 4605 }
4605 4606
4606 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head) 4607 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
4607 { 4608 {
4608 struct extent_buffer *eb = 4609 struct extent_buffer *eb =
4609 container_of(head, struct extent_buffer, rcu_head); 4610 container_of(head, struct extent_buffer, rcu_head);
4610 4611
4611 __free_extent_buffer(eb); 4612 __free_extent_buffer(eb);
4612 } 4613 }
4613 4614
4614 /* Expects to have eb->eb_lock already held */ 4615 /* Expects to have eb->eb_lock already held */
4615 static int release_extent_buffer(struct extent_buffer *eb) 4616 static int release_extent_buffer(struct extent_buffer *eb)
4616 { 4617 {
4617 WARN_ON(atomic_read(&eb->refs) == 0); 4618 WARN_ON(atomic_read(&eb->refs) == 0);
4618 if (atomic_dec_and_test(&eb->refs)) { 4619 if (atomic_dec_and_test(&eb->refs)) {
4619 if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) { 4620 if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) {
4620 spin_unlock(&eb->refs_lock); 4621 spin_unlock(&eb->refs_lock);
4621 } else { 4622 } else {
4622 struct extent_io_tree *tree = eb->tree; 4623 struct extent_io_tree *tree = eb->tree;
4623 4624
4624 spin_unlock(&eb->refs_lock); 4625 spin_unlock(&eb->refs_lock);
4625 4626
4626 spin_lock(&tree->buffer_lock); 4627 spin_lock(&tree->buffer_lock);
4627 radix_tree_delete(&tree->buffer, 4628 radix_tree_delete(&tree->buffer,
4628 eb->start >> PAGE_CACHE_SHIFT); 4629 eb->start >> PAGE_CACHE_SHIFT);
4629 spin_unlock(&tree->buffer_lock); 4630 spin_unlock(&tree->buffer_lock);
4630 } 4631 }
4631 4632
4632 /* Should be safe to release our pages at this point */ 4633 /* Should be safe to release our pages at this point */
4633 btrfs_release_extent_buffer_page(eb, 0); 4634 btrfs_release_extent_buffer_page(eb, 0);
4634 call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu); 4635 call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
4635 return 1; 4636 return 1;
4636 } 4637 }
4637 spin_unlock(&eb->refs_lock); 4638 spin_unlock(&eb->refs_lock);
4638 4639
4639 return 0; 4640 return 0;
4640 } 4641 }
4641 4642
4642 void free_extent_buffer(struct extent_buffer *eb) 4643 void free_extent_buffer(struct extent_buffer *eb)
4643 { 4644 {
4644 int refs; 4645 int refs;
4645 int old; 4646 int old;
4646 if (!eb) 4647 if (!eb)
4647 return; 4648 return;
4648 4649
4649 while (1) { 4650 while (1) {
4650 refs = atomic_read(&eb->refs); 4651 refs = atomic_read(&eb->refs);
4651 if (refs <= 3) 4652 if (refs <= 3)
4652 break; 4653 break;
4653 old = atomic_cmpxchg(&eb->refs, refs, refs - 1); 4654 old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
4654 if (old == refs) 4655 if (old == refs)
4655 return; 4656 return;
4656 } 4657 }
4657 4658
4658 spin_lock(&eb->refs_lock); 4659 spin_lock(&eb->refs_lock);
4659 if (atomic_read(&eb->refs) == 2 && 4660 if (atomic_read(&eb->refs) == 2 &&
4660 test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) 4661 test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
4661 atomic_dec(&eb->refs); 4662 atomic_dec(&eb->refs);
4662 4663
4663 if (atomic_read(&eb->refs) == 2 && 4664 if (atomic_read(&eb->refs) == 2 &&
4664 test_bit(EXTENT_BUFFER_STALE, &eb->bflags) && 4665 test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
4665 !extent_buffer_under_io(eb) && 4666 !extent_buffer_under_io(eb) &&
4666 test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) 4667 test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
4667 atomic_dec(&eb->refs); 4668 atomic_dec(&eb->refs);
4668 4669
4669 /* 4670 /*
4670 * I know this is terrible, but it's temporary until we stop tracking 4671 * I know this is terrible, but it's temporary until we stop tracking
4671 * the uptodate bits and such for the extent buffers. 4672 * the uptodate bits and such for the extent buffers.
4672 */ 4673 */
4673 release_extent_buffer(eb); 4674 release_extent_buffer(eb);
4674 } 4675 }
4675 4676
4676 void free_extent_buffer_stale(struct extent_buffer *eb) 4677 void free_extent_buffer_stale(struct extent_buffer *eb)
4677 { 4678 {
4678 if (!eb) 4679 if (!eb)
4679 return; 4680 return;
4680 4681
4681 spin_lock(&eb->refs_lock); 4682 spin_lock(&eb->refs_lock);
4682 set_bit(EXTENT_BUFFER_STALE, &eb->bflags); 4683 set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
4683 4684
4684 if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) && 4685 if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
4685 test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) 4686 test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
4686 atomic_dec(&eb->refs); 4687 atomic_dec(&eb->refs);
4687 release_extent_buffer(eb); 4688 release_extent_buffer(eb);
4688 } 4689 }
4689 4690
4690 void clear_extent_buffer_dirty(struct extent_buffer *eb) 4691 void clear_extent_buffer_dirty(struct extent_buffer *eb)
4691 { 4692 {
4692 unsigned long i; 4693 unsigned long i;
4693 unsigned long num_pages; 4694 unsigned long num_pages;
4694 struct page *page; 4695 struct page *page;
4695 4696
4696 num_pages = num_extent_pages(eb->start, eb->len); 4697 num_pages = num_extent_pages(eb->start, eb->len);
4697 4698
4698 for (i = 0; i < num_pages; i++) { 4699 for (i = 0; i < num_pages; i++) {
4699 page = extent_buffer_page(eb, i); 4700 page = extent_buffer_page(eb, i);
4700 if (!PageDirty(page)) 4701 if (!PageDirty(page))
4701 continue; 4702 continue;
4702 4703
4703 lock_page(page); 4704 lock_page(page);
4704 WARN_ON(!PagePrivate(page)); 4705 WARN_ON(!PagePrivate(page));
4705 4706
4706 clear_page_dirty_for_io(page); 4707 clear_page_dirty_for_io(page);
4707 spin_lock_irq(&page->mapping->tree_lock); 4708 spin_lock_irq(&page->mapping->tree_lock);
4708 if (!PageDirty(page)) { 4709 if (!PageDirty(page)) {
4709 radix_tree_tag_clear(&page->mapping->page_tree, 4710 radix_tree_tag_clear(&page->mapping->page_tree,
4710 page_index(page), 4711 page_index(page),
4711 PAGECACHE_TAG_DIRTY); 4712 PAGECACHE_TAG_DIRTY);
4712 } 4713 }
4713 spin_unlock_irq(&page->mapping->tree_lock); 4714 spin_unlock_irq(&page->mapping->tree_lock);
4714 ClearPageError(page); 4715 ClearPageError(page);
4715 unlock_page(page); 4716 unlock_page(page);
4716 } 4717 }
4717 WARN_ON(atomic_read(&eb->refs) == 0); 4718 WARN_ON(atomic_read(&eb->refs) == 0);
4718 } 4719 }
4719 4720
4720 int set_extent_buffer_dirty(struct extent_buffer *eb) 4721 int set_extent_buffer_dirty(struct extent_buffer *eb)
4721 { 4722 {
4722 unsigned long i; 4723 unsigned long i;
4723 unsigned long num_pages; 4724 unsigned long num_pages;
4724 int was_dirty = 0; 4725 int was_dirty = 0;
4725 4726
4726 check_buffer_tree_ref(eb); 4727 check_buffer_tree_ref(eb);
4727 4728
4728 was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags); 4729 was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
4729 4730
4730 num_pages = num_extent_pages(eb->start, eb->len); 4731 num_pages = num_extent_pages(eb->start, eb->len);
4731 WARN_ON(atomic_read(&eb->refs) == 0); 4732 WARN_ON(atomic_read(&eb->refs) == 0);
4732 WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)); 4733 WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
4733 4734
4734 for (i = 0; i < num_pages; i++) 4735 for (i = 0; i < num_pages; i++)
4735 set_page_dirty(extent_buffer_page(eb, i)); 4736 set_page_dirty(extent_buffer_page(eb, i));
4736 return was_dirty; 4737 return was_dirty;
4737 } 4738 }
4738 4739
4739 int clear_extent_buffer_uptodate(struct extent_buffer *eb) 4740 int clear_extent_buffer_uptodate(struct extent_buffer *eb)
4740 { 4741 {
4741 unsigned long i; 4742 unsigned long i;
4742 struct page *page; 4743 struct page *page;
4743 unsigned long num_pages; 4744 unsigned long num_pages;
4744 4745
4745 clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); 4746 clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
4746 num_pages = num_extent_pages(eb->start, eb->len); 4747 num_pages = num_extent_pages(eb->start, eb->len);
4747 for (i = 0; i < num_pages; i++) { 4748 for (i = 0; i < num_pages; i++) {
4748 page = extent_buffer_page(eb, i); 4749 page = extent_buffer_page(eb, i);
4749 if (page) 4750 if (page)
4750 ClearPageUptodate(page); 4751 ClearPageUptodate(page);
4751 } 4752 }
4752 return 0; 4753 return 0;
4753 } 4754 }
4754 4755
4755 int set_extent_buffer_uptodate(struct extent_buffer *eb) 4756 int set_extent_buffer_uptodate(struct extent_buffer *eb)
4756 { 4757 {
4757 unsigned long i; 4758 unsigned long i;
4758 struct page *page; 4759 struct page *page;
4759 unsigned long num_pages; 4760 unsigned long num_pages;
4760 4761
4761 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); 4762 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
4762 num_pages = num_extent_pages(eb->start, eb->len); 4763 num_pages = num_extent_pages(eb->start, eb->len);
4763 for (i = 0; i < num_pages; i++) { 4764 for (i = 0; i < num_pages; i++) {
4764 page = extent_buffer_page(eb, i); 4765 page = extent_buffer_page(eb, i);
4765 SetPageUptodate(page); 4766 SetPageUptodate(page);
4766 } 4767 }
4767 return 0; 4768 return 0;
4768 } 4769 }
4769 4770
4770 int extent_buffer_uptodate(struct extent_buffer *eb) 4771 int extent_buffer_uptodate(struct extent_buffer *eb)
4771 { 4772 {
4772 return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); 4773 return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
4773 } 4774 }
4774 4775
4775 int read_extent_buffer_pages(struct extent_io_tree *tree, 4776 int read_extent_buffer_pages(struct extent_io_tree *tree,
4776 struct extent_buffer *eb, u64 start, int wait, 4777 struct extent_buffer *eb, u64 start, int wait,
4777 get_extent_t *get_extent, int mirror_num) 4778 get_extent_t *get_extent, int mirror_num)
4778 { 4779 {
4779 unsigned long i; 4780 unsigned long i;
4780 unsigned long start_i; 4781 unsigned long start_i;
4781 struct page *page; 4782 struct page *page;
4782 int err; 4783 int err;
4783 int ret = 0; 4784 int ret = 0;
4784 int locked_pages = 0; 4785 int locked_pages = 0;
4785 int all_uptodate = 1; 4786 int all_uptodate = 1;
4786 unsigned long num_pages; 4787 unsigned long num_pages;
4787 unsigned long num_reads = 0; 4788 unsigned long num_reads = 0;
4788 struct bio *bio = NULL; 4789 struct bio *bio = NULL;
4789 unsigned long bio_flags = 0; 4790 unsigned long bio_flags = 0;
4790 4791
4791 if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) 4792 if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
4792 return 0; 4793 return 0;
4793 4794
4794 if (start) { 4795 if (start) {
4795 WARN_ON(start < eb->start); 4796 WARN_ON(start < eb->start);
4796 start_i = (start >> PAGE_CACHE_SHIFT) - 4797 start_i = (start >> PAGE_CACHE_SHIFT) -
4797 (eb->start >> PAGE_CACHE_SHIFT); 4798 (eb->start >> PAGE_CACHE_SHIFT);
4798 } else { 4799 } else {
4799 start_i = 0; 4800 start_i = 0;
4800 } 4801 }
4801 4802
4802 num_pages = num_extent_pages(eb->start, eb->len); 4803 num_pages = num_extent_pages(eb->start, eb->len);
4803 for (i = start_i; i < num_pages; i++) { 4804 for (i = start_i; i < num_pages; i++) {
4804 page = extent_buffer_page(eb, i); 4805 page = extent_buffer_page(eb, i);
4805 if (wait == WAIT_NONE) { 4806 if (wait == WAIT_NONE) {
4806 if (!trylock_page(page)) 4807 if (!trylock_page(page))
4807 goto unlock_exit; 4808 goto unlock_exit;
4808 } else { 4809 } else {
4809 lock_page(page); 4810 lock_page(page);
4810 } 4811 }
4811 locked_pages++; 4812 locked_pages++;
4812 if (!PageUptodate(page)) { 4813 if (!PageUptodate(page)) {
4813 num_reads++; 4814 num_reads++;
4814 all_uptodate = 0; 4815 all_uptodate = 0;
4815 } 4816 }
4816 } 4817 }
4817 if (all_uptodate) { 4818 if (all_uptodate) {
4818 if (start_i == 0) 4819 if (start_i == 0)
4819 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); 4820 set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
4820 goto unlock_exit; 4821 goto unlock_exit;
4821 } 4822 }
4822 4823
4823 clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); 4824 clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
4824 eb->read_mirror = 0; 4825 eb->read_mirror = 0;
4825 atomic_set(&eb->io_pages, num_reads); 4826 atomic_set(&eb->io_pages, num_reads);
4826 for (i = start_i; i < num_pages; i++) { 4827 for (i = start_i; i < num_pages; i++) {
4827 page = extent_buffer_page(eb, i); 4828 page = extent_buffer_page(eb, i);
4828 if (!PageUptodate(page)) { 4829 if (!PageUptodate(page)) {
4829 ClearPageError(page); 4830 ClearPageError(page);
4830 err = __extent_read_full_page(tree, page, 4831 err = __extent_read_full_page(tree, page,
4831 get_extent, &bio, 4832 get_extent, &bio,
4832 mirror_num, &bio_flags, 4833 mirror_num, &bio_flags,
4833 READ | REQ_META); 4834 READ | REQ_META);
4834 if (err) 4835 if (err)
4835 ret = err; 4836 ret = err;
4836 } else { 4837 } else {
4837 unlock_page(page); 4838 unlock_page(page);
4838 } 4839 }
4839 } 4840 }
4840 4841
4841 if (bio) { 4842 if (bio) {
4842 err = submit_one_bio(READ | REQ_META, bio, mirror_num, 4843 err = submit_one_bio(READ | REQ_META, bio, mirror_num,
4843 bio_flags); 4844 bio_flags);
4844 if (err) 4845 if (err)
4845 return err; 4846 return err;
4846 } 4847 }
4847 4848
4848 if (ret || wait != WAIT_COMPLETE) 4849 if (ret || wait != WAIT_COMPLETE)
4849 return ret; 4850 return ret;
4850 4851
4851 for (i = start_i; i < num_pages; i++) { 4852 for (i = start_i; i < num_pages; i++) {
4852 page = extent_buffer_page(eb, i); 4853 page = extent_buffer_page(eb, i);
4853 wait_on_page_locked(page); 4854 wait_on_page_locked(page);
4854 if (!PageUptodate(page)) 4855 if (!PageUptodate(page))
4855 ret = -EIO; 4856 ret = -EIO;
4856 } 4857 }
4857 4858
4858 return ret; 4859 return ret;
4859 4860
4860 unlock_exit: 4861 unlock_exit:
4861 i = start_i; 4862 i = start_i;
4862 while (locked_pages > 0) { 4863 while (locked_pages > 0) {
4863 page = extent_buffer_page(eb, i); 4864 page = extent_buffer_page(eb, i);
4864 i++; 4865 i++;
4865 unlock_page(page); 4866 unlock_page(page);
4866 locked_pages--; 4867 locked_pages--;
4867 } 4868 }
4868 return ret; 4869 return ret;
4869 } 4870 }
4870 4871
4871 void read_extent_buffer(struct extent_buffer *eb, void *dstv, 4872 void read_extent_buffer(struct extent_buffer *eb, void *dstv,
4872 unsigned long start, 4873 unsigned long start,
4873 unsigned long len) 4874 unsigned long len)
4874 { 4875 {
4875 size_t cur; 4876 size_t cur;
4876 size_t offset; 4877 size_t offset;
4877 struct page *page; 4878 struct page *page;
4878 char *kaddr; 4879 char *kaddr;
4879 char *dst = (char *)dstv; 4880 char *dst = (char *)dstv;
4880 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); 4881 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
4881 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; 4882 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
4882 4883
4883 WARN_ON(start > eb->len); 4884 WARN_ON(start > eb->len);
4884 WARN_ON(start + len > eb->start + eb->len); 4885 WARN_ON(start + len > eb->start + eb->len);
4885 4886
4886 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); 4887 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
4887 4888
4888 while (len > 0) { 4889 while (len > 0) {
4889 page = extent_buffer_page(eb, i); 4890 page = extent_buffer_page(eb, i);
4890 4891
4891 cur = min(len, (PAGE_CACHE_SIZE - offset)); 4892 cur = min(len, (PAGE_CACHE_SIZE - offset));
4892 kaddr = page_address(page); 4893 kaddr = page_address(page);
4893 memcpy(dst, kaddr + offset, cur); 4894 memcpy(dst, kaddr + offset, cur);
4894 4895
4895 dst += cur; 4896 dst += cur;
4896 len -= cur; 4897 len -= cur;
4897 offset = 0; 4898 offset = 0;
4898 i++; 4899 i++;
4899 } 4900 }
4900 } 4901 }
4901 4902
4902 int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start, 4903 int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
4903 unsigned long min_len, char **map, 4904 unsigned long min_len, char **map,
4904 unsigned long *map_start, 4905 unsigned long *map_start,
4905 unsigned long *map_len) 4906 unsigned long *map_len)
4906 { 4907 {
4907 size_t offset = start & (PAGE_CACHE_SIZE - 1); 4908 size_t offset = start & (PAGE_CACHE_SIZE - 1);
4908 char *kaddr; 4909 char *kaddr;
4909 struct page *p; 4910 struct page *p;
4910 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); 4911 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
4911 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; 4912 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
4912 unsigned long end_i = (start_offset + start + min_len - 1) >> 4913 unsigned long end_i = (start_offset + start + min_len - 1) >>
4913 PAGE_CACHE_SHIFT; 4914 PAGE_CACHE_SHIFT;
4914 4915
4915 if (i != end_i) 4916 if (i != end_i)
4916 return -EINVAL; 4917 return -EINVAL;
4917 4918
4918 if (i == 0) { 4919 if (i == 0) {
4919 offset = start_offset; 4920 offset = start_offset;
4920 *map_start = 0; 4921 *map_start = 0;
4921 } else { 4922 } else {
4922 offset = 0; 4923 offset = 0;
4923 *map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset; 4924 *map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset;
4924 } 4925 }
4925 4926
4926 if (start + min_len > eb->len) { 4927 if (start + min_len > eb->len) {
4927 WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, " 4928 WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, "
4928 "wanted %lu %lu\n", 4929 "wanted %lu %lu\n",
4929 eb->start, eb->len, start, min_len); 4930 eb->start, eb->len, start, min_len);
4930 return -EINVAL; 4931 return -EINVAL;
4931 } 4932 }
4932 4933
4933 p = extent_buffer_page(eb, i); 4934 p = extent_buffer_page(eb, i);
4934 kaddr = page_address(p); 4935 kaddr = page_address(p);
4935 *map = kaddr + offset; 4936 *map = kaddr + offset;
4936 *map_len = PAGE_CACHE_SIZE - offset; 4937 *map_len = PAGE_CACHE_SIZE - offset;
4937 return 0; 4938 return 0;
4938 } 4939 }
4939 4940
4940 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv, 4941 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
4941 unsigned long start, 4942 unsigned long start,
4942 unsigned long len) 4943 unsigned long len)
4943 { 4944 {
4944 size_t cur; 4945 size_t cur;
4945 size_t offset; 4946 size_t offset;
4946 struct page *page; 4947 struct page *page;
4947 char *kaddr; 4948 char *kaddr;
4948 char *ptr = (char *)ptrv; 4949 char *ptr = (char *)ptrv;
4949 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); 4950 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
4950 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; 4951 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
4951 int ret = 0; 4952 int ret = 0;
4952 4953
4953 WARN_ON(start > eb->len); 4954 WARN_ON(start > eb->len);
4954 WARN_ON(start + len > eb->start + eb->len); 4955 WARN_ON(start + len > eb->start + eb->len);
4955 4956
4956 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); 4957 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
4957 4958
4958 while (len > 0) { 4959 while (len > 0) {
4959 page = extent_buffer_page(eb, i); 4960 page = extent_buffer_page(eb, i);
4960 4961
4961 cur = min(len, (PAGE_CACHE_SIZE - offset)); 4962 cur = min(len, (PAGE_CACHE_SIZE - offset));
4962 4963
4963 kaddr = page_address(page); 4964 kaddr = page_address(page);
4964 ret = memcmp(ptr, kaddr + offset, cur); 4965 ret = memcmp(ptr, kaddr + offset, cur);
4965 if (ret) 4966 if (ret)
4966 break; 4967 break;
4967 4968
4968 ptr += cur; 4969 ptr += cur;
4969 len -= cur; 4970 len -= cur;
4970 offset = 0; 4971 offset = 0;
4971 i++; 4972 i++;
4972 } 4973 }
4973 return ret; 4974 return ret;
4974 } 4975 }
4975 4976
4976 void write_extent_buffer(struct extent_buffer *eb, const void *srcv, 4977 void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
4977 unsigned long start, unsigned long len) 4978 unsigned long start, unsigned long len)
4978 { 4979 {
4979 size_t cur; 4980 size_t cur;
4980 size_t offset; 4981 size_t offset;
4981 struct page *page; 4982 struct page *page;
4982 char *kaddr; 4983 char *kaddr;
4983 char *src = (char *)srcv; 4984 char *src = (char *)srcv;
4984 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); 4985 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
4985 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; 4986 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
4986 4987
4987 WARN_ON(start > eb->len); 4988 WARN_ON(start > eb->len);
4988 WARN_ON(start + len > eb->start + eb->len); 4989 WARN_ON(start + len > eb->start + eb->len);
4989 4990
4990 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); 4991 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
4991 4992
4992 while (len > 0) { 4993 while (len > 0) {
4993 page = extent_buffer_page(eb, i); 4994 page = extent_buffer_page(eb, i);
4994 WARN_ON(!PageUptodate(page)); 4995 WARN_ON(!PageUptodate(page));
4995 4996
4996 cur = min(len, PAGE_CACHE_SIZE - offset); 4997 cur = min(len, PAGE_CACHE_SIZE - offset);
4997 kaddr = page_address(page); 4998 kaddr = page_address(page);
4998 memcpy(kaddr + offset, src, cur); 4999 memcpy(kaddr + offset, src, cur);
4999 5000
5000 src += cur; 5001 src += cur;
5001 len -= cur; 5002 len -= cur;
5002 offset = 0; 5003 offset = 0;
5003 i++; 5004 i++;
5004 } 5005 }
5005 } 5006 }
5006 5007
5007 void memset_extent_buffer(struct extent_buffer *eb, char c, 5008 void memset_extent_buffer(struct extent_buffer *eb, char c,
5008 unsigned long start, unsigned long len) 5009 unsigned long start, unsigned long len)
5009 { 5010 {
5010 size_t cur; 5011 size_t cur;
5011 size_t offset; 5012 size_t offset;
5012 struct page *page; 5013 struct page *page;
5013 char *kaddr; 5014 char *kaddr;
5014 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); 5015 size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
5015 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; 5016 unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
5016 5017
5017 WARN_ON(start > eb->len); 5018 WARN_ON(start > eb->len);
5018 WARN_ON(start + len > eb->start + eb->len); 5019 WARN_ON(start + len > eb->start + eb->len);
5019 5020
5020 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); 5021 offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
5021 5022
5022 while (len > 0) { 5023 while (len > 0) {
5023 page = extent_buffer_page(eb, i); 5024 page = extent_buffer_page(eb, i);
5024 WARN_ON(!PageUptodate(page)); 5025 WARN_ON(!PageUptodate(page));
5025 5026
5026 cur = min(len, PAGE_CACHE_SIZE - offset); 5027 cur = min(len, PAGE_CACHE_SIZE - offset);
5027 kaddr = page_address(page); 5028 kaddr = page_address(page);
5028 memset(kaddr + offset, c, cur); 5029 memset(kaddr + offset, c, cur);
5029 5030
5030 len -= cur; 5031 len -= cur;
5031 offset = 0; 5032 offset = 0;
5032 i++; 5033 i++;
5033 } 5034 }
5034 } 5035 }
5035 5036
5036 void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src, 5037 void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
5037 unsigned long dst_offset, unsigned long src_offset, 5038 unsigned long dst_offset, unsigned long src_offset,
5038 unsigned long len) 5039 unsigned long len)
5039 { 5040 {
5040 u64 dst_len = dst->len; 5041 u64 dst_len = dst->len;
5041 size_t cur; 5042 size_t cur;
5042 size_t offset; 5043 size_t offset;
5043 struct page *page; 5044 struct page *page;
5044 char *kaddr; 5045 char *kaddr;
5045 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); 5046 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
5046 unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; 5047 unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
5047 5048
5048 WARN_ON(src->len != dst_len); 5049 WARN_ON(src->len != dst_len);
5049 5050
5050 offset = (start_offset + dst_offset) & 5051 offset = (start_offset + dst_offset) &
5051 (PAGE_CACHE_SIZE - 1); 5052 (PAGE_CACHE_SIZE - 1);
5052 5053
5053 while (len > 0) { 5054 while (len > 0) {
5054 page = extent_buffer_page(dst, i); 5055 page = extent_buffer_page(dst, i);
5055 WARN_ON(!PageUptodate(page)); 5056 WARN_ON(!PageUptodate(page));
5056 5057
5057 cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset)); 5058 cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset));
5058 5059
5059 kaddr = page_address(page); 5060 kaddr = page_address(page);
5060 read_extent_buffer(src, kaddr + offset, src_offset, cur); 5061 read_extent_buffer(src, kaddr + offset, src_offset, cur);
5061 5062
5062 src_offset += cur; 5063 src_offset += cur;
5063 len -= cur; 5064 len -= cur;
5064 offset = 0; 5065 offset = 0;
5065 i++; 5066 i++;
5066 } 5067 }
5067 } 5068 }
5068 5069
5069 static void move_pages(struct page *dst_page, struct page *src_page, 5070 static void move_pages(struct page *dst_page, struct page *src_page,
5070 unsigned long dst_off, unsigned long src_off, 5071 unsigned long dst_off, unsigned long src_off,
5071 unsigned long len) 5072 unsigned long len)
5072 { 5073 {
5073 char *dst_kaddr = page_address(dst_page); 5074 char *dst_kaddr = page_address(dst_page);
5074 if (dst_page == src_page) { 5075 if (dst_page == src_page) {
5075 memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len); 5076 memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len);
5076 } else { 5077 } else {
5077 char *src_kaddr = page_address(src_page); 5078 char *src_kaddr = page_address(src_page);
5078 char *p = dst_kaddr + dst_off + len; 5079 char *p = dst_kaddr + dst_off + len;
5079 char *s = src_kaddr + src_off + len; 5080 char *s = src_kaddr + src_off + len;
5080 5081
5081 while (len--) 5082 while (len--)
5082 *--p = *--s; 5083 *--p = *--s;
5083 } 5084 }
5084 } 5085 }
5085 5086
5086 static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len) 5087 static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
5087 { 5088 {
5088 unsigned long distance = (src > dst) ? src - dst : dst - src; 5089 unsigned long distance = (src > dst) ? src - dst : dst - src;
5089 return distance < len; 5090 return distance < len;
5090 } 5091 }
5091 5092
5092 static void copy_pages(struct page *dst_page, struct page *src_page, 5093 static void copy_pages(struct page *dst_page, struct page *src_page,
5093 unsigned long dst_off, unsigned long src_off, 5094 unsigned long dst_off, unsigned long src_off,
5094 unsigned long len) 5095 unsigned long len)
5095 { 5096 {
5096 char *dst_kaddr = page_address(dst_page); 5097 char *dst_kaddr = page_address(dst_page);
5097 char *src_kaddr; 5098 char *src_kaddr;
5098 int must_memmove = 0; 5099 int must_memmove = 0;
5099 5100
5100 if (dst_page != src_page) { 5101 if (dst_page != src_page) {
5101 src_kaddr = page_address(src_page); 5102 src_kaddr = page_address(src_page);
5102 } else { 5103 } else {
5103 src_kaddr = dst_kaddr; 5104 src_kaddr = dst_kaddr;
5104 if (areas_overlap(src_off, dst_off, len)) 5105 if (areas_overlap(src_off, dst_off, len))
5105 must_memmove = 1; 5106 must_memmove = 1;
5106 } 5107 }
5107 5108
5108 if (must_memmove) 5109 if (must_memmove)
5109 memmove(dst_kaddr + dst_off, src_kaddr + src_off, len); 5110 memmove(dst_kaddr + dst_off, src_kaddr + src_off, len);
5110 else 5111 else
5111 memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len); 5112 memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len);
5112 } 5113 }
5113 5114
5114 void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, 5115 void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
5115 unsigned long src_offset, unsigned long len) 5116 unsigned long src_offset, unsigned long len)
5116 { 5117 {
5117 size_t cur; 5118 size_t cur;
5118 size_t dst_off_in_page; 5119 size_t dst_off_in_page;
5119 size_t src_off_in_page; 5120 size_t src_off_in_page;
5120 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); 5121 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
5121 unsigned long dst_i; 5122 unsigned long dst_i;
5122 unsigned long src_i; 5123 unsigned long src_i;
5123 5124
5124 if (src_offset + len > dst->len) { 5125 if (src_offset + len > dst->len) {
5125 printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " 5126 printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
5126 "len %lu dst len %lu\n", src_offset, len, dst->len); 5127 "len %lu dst len %lu\n", src_offset, len, dst->len);
5127 BUG_ON(1); 5128 BUG_ON(1);
5128 } 5129 }
5129 if (dst_offset + len > dst->len) { 5130 if (dst_offset + len > dst->len) {
5130 printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " 5131 printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
5131 "len %lu dst len %lu\n", dst_offset, len, dst->len); 5132 "len %lu dst len %lu\n", dst_offset, len, dst->len);
5132 BUG_ON(1); 5133 BUG_ON(1);
5133 } 5134 }
5134 5135
5135 while (len > 0) { 5136 while (len > 0) {
5136 dst_off_in_page = (start_offset + dst_offset) & 5137 dst_off_in_page = (start_offset + dst_offset) &
5137 (PAGE_CACHE_SIZE - 1); 5138 (PAGE_CACHE_SIZE - 1);
5138 src_off_in_page = (start_offset + src_offset) & 5139 src_off_in_page = (start_offset + src_offset) &
5139 (PAGE_CACHE_SIZE - 1); 5140 (PAGE_CACHE_SIZE - 1);
5140 5141
5141 dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; 5142 dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
5142 src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT; 5143 src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT;
5143 5144
5144 cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - 5145 cur = min(len, (unsigned long)(PAGE_CACHE_SIZE -
5145 src_off_in_page)); 5146 src_off_in_page));
5146 cur = min_t(unsigned long, cur, 5147 cur = min_t(unsigned long, cur,
5147 (unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page)); 5148 (unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page));
5148 5149
5149 copy_pages(extent_buffer_page(dst, dst_i), 5150 copy_pages(extent_buffer_page(dst, dst_i),
5150 extent_buffer_page(dst, src_i), 5151 extent_buffer_page(dst, src_i),
5151 dst_off_in_page, src_off_in_page, cur); 5152 dst_off_in_page, src_off_in_page, cur);
5152 5153
5153 src_offset += cur; 5154 src_offset += cur;
5154 dst_offset += cur; 5155 dst_offset += cur;
5155 len -= cur; 5156 len -= cur;
5156 } 5157 }
5157 } 5158 }
5158 5159
5159 void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, 5160 void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
5160 unsigned long src_offset, unsigned long len) 5161 unsigned long src_offset, unsigned long len)
5161 { 5162 {
5162 size_t cur; 5163 size_t cur;
5163 size_t dst_off_in_page; 5164 size_t dst_off_in_page;
5164 size_t src_off_in_page; 5165 size_t src_off_in_page;
5165 unsigned long dst_end = dst_offset + len - 1; 5166 unsigned long dst_end = dst_offset + len - 1;
5166 unsigned long src_end = src_offset + len - 1; 5167 unsigned long src_end = src_offset + len - 1;
5167 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); 5168 size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
5168 unsigned long dst_i; 5169 unsigned long dst_i;
5169 unsigned long src_i; 5170 unsigned long src_i;
5170 5171
5171 if (src_offset + len > dst->len) { 5172 if (src_offset + len > dst->len) {
5172 printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " 5173 printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
5173 "len %lu len %lu\n", src_offset, len, dst->len); 5174 "len %lu len %lu\n", src_offset, len, dst->len);
5174 BUG_ON(1); 5175 BUG_ON(1);
5175 } 5176 }
5176 if (dst_offset + len > dst->len) { 5177 if (dst_offset + len > dst->len) {
5177 printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " 5178 printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
5178 "len %lu len %lu\n", dst_offset, len, dst->len); 5179 "len %lu len %lu\n", dst_offset, len, dst->len);
5179 BUG_ON(1); 5180 BUG_ON(1);
5180 } 5181 }
5181 if (dst_offset < src_offset) { 5182 if (dst_offset < src_offset) {
5182 memcpy_extent_buffer(dst, dst_offset, src_offset, len); 5183 memcpy_extent_buffer(dst, dst_offset, src_offset, len);
5183 return; 5184 return;
5184 } 5185 }
5185 while (len > 0) { 5186 while (len > 0) {
5186 dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT; 5187 dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT;
5187 src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT; 5188 src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT;
5188 5189
5189 dst_off_in_page = (start_offset + dst_end) & 5190 dst_off_in_page = (start_offset + dst_end) &
5190 (PAGE_CACHE_SIZE - 1); 5191 (PAGE_CACHE_SIZE - 1);
5191 src_off_in_page = (start_offset + src_end) & 5192 src_off_in_page = (start_offset + src_end) &
5192 (PAGE_CACHE_SIZE - 1); 5193 (PAGE_CACHE_SIZE - 1);
5193 5194
5194 cur = min_t(unsigned long, len, src_off_in_page + 1); 5195 cur = min_t(unsigned long, len, src_off_in_page + 1);
5195 cur = min(cur, dst_off_in_page + 1); 5196 cur = min(cur, dst_off_in_page + 1);
5196 move_pages(extent_buffer_page(dst, dst_i), 5197 move_pages(extent_buffer_page(dst, dst_i),
5197 extent_buffer_page(dst, src_i), 5198 extent_buffer_page(dst, src_i),
5198 dst_off_in_page - cur + 1, 5199 dst_off_in_page - cur + 1,
5199 src_off_in_page - cur + 1, cur); 5200 src_off_in_page - cur + 1, cur);
5200 5201
5201 dst_end -= cur; 5202 dst_end -= cur;
5202 src_end -= cur; 5203 src_end -= cur;
5203 len -= cur; 5204 len -= cur;
5204 } 5205 }
5205 } 5206 }
5206 5207
5207 int try_release_extent_buffer(struct page *page) 5208 int try_release_extent_buffer(struct page *page)
5208 { 5209 {
5209 struct extent_buffer *eb; 5210 struct extent_buffer *eb;
5210 5211
5211 /* 5212 /*
5212 * We need to make sure noboody is attaching this page to an eb right 5213 * We need to make sure noboody is attaching this page to an eb right
5213 * now. 5214 * now.
5214 */ 5215 */
5215 spin_lock(&page->mapping->private_lock); 5216 spin_lock(&page->mapping->private_lock);
5216 if (!PagePrivate(page)) { 5217 if (!PagePrivate(page)) {
5217 spin_unlock(&page->mapping->private_lock); 5218 spin_unlock(&page->mapping->private_lock);
5218 return 1; 5219 return 1;
5219 } 5220 }
5220 5221
5221 eb = (struct extent_buffer *)page->private; 5222 eb = (struct extent_buffer *)page->private;
5222 BUG_ON(!eb); 5223 BUG_ON(!eb);
5223 5224
5224 /* 5225 /*
5225 * This is a little awful but should be ok, we need to make sure that 5226 * This is a little awful but should be ok, we need to make sure that
5226 * the eb doesn't disappear out from under us while we're looking at 5227 * the eb doesn't disappear out from under us while we're looking at
5227 * this page. 5228 * this page.
5228 */ 5229 */
5229 spin_lock(&eb->refs_lock); 5230 spin_lock(&eb->refs_lock);
5230 if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { 5231 if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
5231 spin_unlock(&eb->refs_lock); 5232 spin_unlock(&eb->refs_lock);
5232 spin_unlock(&page->mapping->private_lock); 5233 spin_unlock(&page->mapping->private_lock);
5233 return 0; 5234 return 0;
5234 } 5235 }
5235 spin_unlock(&page->mapping->private_lock); 5236 spin_unlock(&page->mapping->private_lock);
5236 5237
5237 /* 5238 /*
5238 * If tree ref isn't set then we know the ref on this eb is a real ref, 5239 * If tree ref isn't set then we know the ref on this eb is a real ref,
5239 * so just return, this page will likely be freed soon anyway. 5240 * so just return, this page will likely be freed soon anyway.
5240 */ 5241 */
5241 if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) { 5242 if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
5242 spin_unlock(&eb->refs_lock); 5243 spin_unlock(&eb->refs_lock);
5243 return 0; 5244 return 0;
5244 } 5245 }
5245 5246
5246 return release_extent_buffer(eb); 5247 return release_extent_buffer(eb);
5247 } 5248 }
1 /* 1 /*
2 * Copyright (C) 2007 Oracle. All rights reserved. 2 * Copyright (C) 2007 Oracle. All rights reserved.
3 * 3 *
4 * This program is free software; you can redistribute it and/or 4 * This program is free software; you can redistribute it and/or
5 * modify it under the terms of the GNU General Public 5 * modify it under the terms of the GNU General Public
6 * License v2 as published by the Free Software Foundation. 6 * License v2 as published by the Free Software Foundation.
7 * 7 *
8 * This program is distributed in the hope that it will be useful, 8 * This program is distributed in the hope that it will be useful,
9 * but WITHOUT ANY WARRANTY; without even the implied warranty of 9 * but WITHOUT ANY WARRANTY; without even the implied warranty of
10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
11 * General Public License for more details. 11 * General Public License for more details.
12 * 12 *
13 * You should have received a copy of the GNU General Public 13 * You should have received a copy of the GNU General Public
14 * License along with this program; if not, write to the 14 * License along with this program; if not, write to the
15 * Free Software Foundation, Inc., 59 Temple Place - Suite 330, 15 * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
16 * Boston, MA 021110-1307, USA. 16 * Boston, MA 021110-1307, USA.
17 */ 17 */
18 18
19 #include <linux/fs.h> 19 #include <linux/fs.h>
20 #include <linux/pagemap.h> 20 #include <linux/pagemap.h>
21 #include <linux/highmem.h> 21 #include <linux/highmem.h>
22 #include <linux/time.h> 22 #include <linux/time.h>
23 #include <linux/init.h> 23 #include <linux/init.h>
24 #include <linux/string.h> 24 #include <linux/string.h>
25 #include <linux/backing-dev.h> 25 #include <linux/backing-dev.h>
26 #include <linux/mpage.h> 26 #include <linux/mpage.h>
27 #include <linux/aio.h> 27 #include <linux/aio.h>
28 #include <linux/falloc.h> 28 #include <linux/falloc.h>
29 #include <linux/swap.h> 29 #include <linux/swap.h>
30 #include <linux/writeback.h> 30 #include <linux/writeback.h>
31 #include <linux/statfs.h> 31 #include <linux/statfs.h>
32 #include <linux/compat.h> 32 #include <linux/compat.h>
33 #include <linux/slab.h> 33 #include <linux/slab.h>
34 #include <linux/btrfs.h> 34 #include <linux/btrfs.h>
35 #include "ctree.h" 35 #include "ctree.h"
36 #include "disk-io.h" 36 #include "disk-io.h"
37 #include "transaction.h" 37 #include "transaction.h"
38 #include "btrfs_inode.h" 38 #include "btrfs_inode.h"
39 #include "print-tree.h" 39 #include "print-tree.h"
40 #include "tree-log.h" 40 #include "tree-log.h"
41 #include "locking.h" 41 #include "locking.h"
42 #include "compat.h" 42 #include "compat.h"
43 #include "volumes.h" 43 #include "volumes.h"
44 44
45 static struct kmem_cache *btrfs_inode_defrag_cachep; 45 static struct kmem_cache *btrfs_inode_defrag_cachep;
46 /* 46 /*
47 * when auto defrag is enabled we 47 * when auto defrag is enabled we
48 * queue up these defrag structs to remember which 48 * queue up these defrag structs to remember which
49 * inodes need defragging passes 49 * inodes need defragging passes
50 */ 50 */
51 struct inode_defrag { 51 struct inode_defrag {
52 struct rb_node rb_node; 52 struct rb_node rb_node;
53 /* objectid */ 53 /* objectid */
54 u64 ino; 54 u64 ino;
55 /* 55 /*
56 * transid where the defrag was added, we search for 56 * transid where the defrag was added, we search for
57 * extents newer than this 57 * extents newer than this
58 */ 58 */
59 u64 transid; 59 u64 transid;
60 60
61 /* root objectid */ 61 /* root objectid */
62 u64 root; 62 u64 root;
63 63
64 /* last offset we were able to defrag */ 64 /* last offset we were able to defrag */
65 u64 last_offset; 65 u64 last_offset;
66 66
67 /* if we've wrapped around back to zero once already */ 67 /* if we've wrapped around back to zero once already */
68 int cycled; 68 int cycled;
69 }; 69 };
70 70
71 static int __compare_inode_defrag(struct inode_defrag *defrag1, 71 static int __compare_inode_defrag(struct inode_defrag *defrag1,
72 struct inode_defrag *defrag2) 72 struct inode_defrag *defrag2)
73 { 73 {
74 if (defrag1->root > defrag2->root) 74 if (defrag1->root > defrag2->root)
75 return 1; 75 return 1;
76 else if (defrag1->root < defrag2->root) 76 else if (defrag1->root < defrag2->root)
77 return -1; 77 return -1;
78 else if (defrag1->ino > defrag2->ino) 78 else if (defrag1->ino > defrag2->ino)
79 return 1; 79 return 1;
80 else if (defrag1->ino < defrag2->ino) 80 else if (defrag1->ino < defrag2->ino)
81 return -1; 81 return -1;
82 else 82 else
83 return 0; 83 return 0;
84 } 84 }
85 85
86 /* pop a record for an inode into the defrag tree. The lock 86 /* pop a record for an inode into the defrag tree. The lock
87 * must be held already 87 * must be held already
88 * 88 *
89 * If you're inserting a record for an older transid than an 89 * If you're inserting a record for an older transid than an
90 * existing record, the transid already in the tree is lowered 90 * existing record, the transid already in the tree is lowered
91 * 91 *
92 * If an existing record is found the defrag item you 92 * If an existing record is found the defrag item you
93 * pass in is freed 93 * pass in is freed
94 */ 94 */
95 static int __btrfs_add_inode_defrag(struct inode *inode, 95 static int __btrfs_add_inode_defrag(struct inode *inode,
96 struct inode_defrag *defrag) 96 struct inode_defrag *defrag)
97 { 97 {
98 struct btrfs_root *root = BTRFS_I(inode)->root; 98 struct btrfs_root *root = BTRFS_I(inode)->root;
99 struct inode_defrag *entry; 99 struct inode_defrag *entry;
100 struct rb_node **p; 100 struct rb_node **p;
101 struct rb_node *parent = NULL; 101 struct rb_node *parent = NULL;
102 int ret; 102 int ret;
103 103
104 p = &root->fs_info->defrag_inodes.rb_node; 104 p = &root->fs_info->defrag_inodes.rb_node;
105 while (*p) { 105 while (*p) {
106 parent = *p; 106 parent = *p;
107 entry = rb_entry(parent, struct inode_defrag, rb_node); 107 entry = rb_entry(parent, struct inode_defrag, rb_node);
108 108
109 ret = __compare_inode_defrag(defrag, entry); 109 ret = __compare_inode_defrag(defrag, entry);
110 if (ret < 0) 110 if (ret < 0)
111 p = &parent->rb_left; 111 p = &parent->rb_left;
112 else if (ret > 0) 112 else if (ret > 0)
113 p = &parent->rb_right; 113 p = &parent->rb_right;
114 else { 114 else {
115 /* if we're reinserting an entry for 115 /* if we're reinserting an entry for
116 * an old defrag run, make sure to 116 * an old defrag run, make sure to
117 * lower the transid of our existing record 117 * lower the transid of our existing record
118 */ 118 */
119 if (defrag->transid < entry->transid) 119 if (defrag->transid < entry->transid)
120 entry->transid = defrag->transid; 120 entry->transid = defrag->transid;
121 if (defrag->last_offset > entry->last_offset) 121 if (defrag->last_offset > entry->last_offset)
122 entry->last_offset = defrag->last_offset; 122 entry->last_offset = defrag->last_offset;
123 return -EEXIST; 123 return -EEXIST;
124 } 124 }
125 } 125 }
126 set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); 126 set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
127 rb_link_node(&defrag->rb_node, parent, p); 127 rb_link_node(&defrag->rb_node, parent, p);
128 rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes); 128 rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes);
129 return 0; 129 return 0;
130 } 130 }
131 131
132 static inline int __need_auto_defrag(struct btrfs_root *root) 132 static inline int __need_auto_defrag(struct btrfs_root *root)
133 { 133 {
134 if (!btrfs_test_opt(root, AUTO_DEFRAG)) 134 if (!btrfs_test_opt(root, AUTO_DEFRAG))
135 return 0; 135 return 0;
136 136
137 if (btrfs_fs_closing(root->fs_info)) 137 if (btrfs_fs_closing(root->fs_info))
138 return 0; 138 return 0;
139 139
140 return 1; 140 return 1;
141 } 141 }
142 142
143 /* 143 /*
144 * insert a defrag record for this inode if auto defrag is 144 * insert a defrag record for this inode if auto defrag is
145 * enabled 145 * enabled
146 */ 146 */
147 int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans, 147 int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
148 struct inode *inode) 148 struct inode *inode)
149 { 149 {
150 struct btrfs_root *root = BTRFS_I(inode)->root; 150 struct btrfs_root *root = BTRFS_I(inode)->root;
151 struct inode_defrag *defrag; 151 struct inode_defrag *defrag;
152 u64 transid; 152 u64 transid;
153 int ret; 153 int ret;
154 154
155 if (!__need_auto_defrag(root)) 155 if (!__need_auto_defrag(root))
156 return 0; 156 return 0;
157 157
158 if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) 158 if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
159 return 0; 159 return 0;
160 160
161 if (trans) 161 if (trans)
162 transid = trans->transid; 162 transid = trans->transid;
163 else 163 else
164 transid = BTRFS_I(inode)->root->last_trans; 164 transid = BTRFS_I(inode)->root->last_trans;
165 165
166 defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS); 166 defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
167 if (!defrag) 167 if (!defrag)
168 return -ENOMEM; 168 return -ENOMEM;
169 169
170 defrag->ino = btrfs_ino(inode); 170 defrag->ino = btrfs_ino(inode);
171 defrag->transid = transid; 171 defrag->transid = transid;
172 defrag->root = root->root_key.objectid; 172 defrag->root = root->root_key.objectid;
173 173
174 spin_lock(&root->fs_info->defrag_inodes_lock); 174 spin_lock(&root->fs_info->defrag_inodes_lock);
175 if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) { 175 if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) {
176 /* 176 /*
177 * If we set IN_DEFRAG flag and evict the inode from memory, 177 * If we set IN_DEFRAG flag and evict the inode from memory,
178 * and then re-read this inode, this new inode doesn't have 178 * and then re-read this inode, this new inode doesn't have
179 * IN_DEFRAG flag. At the case, we may find the existed defrag. 179 * IN_DEFRAG flag. At the case, we may find the existed defrag.
180 */ 180 */
181 ret = __btrfs_add_inode_defrag(inode, defrag); 181 ret = __btrfs_add_inode_defrag(inode, defrag);
182 if (ret) 182 if (ret)
183 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 183 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
184 } else { 184 } else {
185 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 185 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
186 } 186 }
187 spin_unlock(&root->fs_info->defrag_inodes_lock); 187 spin_unlock(&root->fs_info->defrag_inodes_lock);
188 return 0; 188 return 0;
189 } 189 }
190 190
191 /* 191 /*
192 * Requeue the defrag object. If there is a defrag object that points to 192 * Requeue the defrag object. If there is a defrag object that points to
193 * the same inode in the tree, we will merge them together (by 193 * the same inode in the tree, we will merge them together (by
194 * __btrfs_add_inode_defrag()) and free the one that we want to requeue. 194 * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
195 */ 195 */
196 static void btrfs_requeue_inode_defrag(struct inode *inode, 196 static void btrfs_requeue_inode_defrag(struct inode *inode,
197 struct inode_defrag *defrag) 197 struct inode_defrag *defrag)
198 { 198 {
199 struct btrfs_root *root = BTRFS_I(inode)->root; 199 struct btrfs_root *root = BTRFS_I(inode)->root;
200 int ret; 200 int ret;
201 201
202 if (!__need_auto_defrag(root)) 202 if (!__need_auto_defrag(root))
203 goto out; 203 goto out;
204 204
205 /* 205 /*
206 * Here we don't check the IN_DEFRAG flag, because we need merge 206 * Here we don't check the IN_DEFRAG flag, because we need merge
207 * them together. 207 * them together.
208 */ 208 */
209 spin_lock(&root->fs_info->defrag_inodes_lock); 209 spin_lock(&root->fs_info->defrag_inodes_lock);
210 ret = __btrfs_add_inode_defrag(inode, defrag); 210 ret = __btrfs_add_inode_defrag(inode, defrag);
211 spin_unlock(&root->fs_info->defrag_inodes_lock); 211 spin_unlock(&root->fs_info->defrag_inodes_lock);
212 if (ret) 212 if (ret)
213 goto out; 213 goto out;
214 return; 214 return;
215 out: 215 out:
216 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 216 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
217 } 217 }
218 218
219 /* 219 /*
220 * pick the defragable inode that we want, if it doesn't exist, we will get 220 * pick the defragable inode that we want, if it doesn't exist, we will get
221 * the next one. 221 * the next one.
222 */ 222 */
223 static struct inode_defrag * 223 static struct inode_defrag *
224 btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino) 224 btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
225 { 225 {
226 struct inode_defrag *entry = NULL; 226 struct inode_defrag *entry = NULL;
227 struct inode_defrag tmp; 227 struct inode_defrag tmp;
228 struct rb_node *p; 228 struct rb_node *p;
229 struct rb_node *parent = NULL; 229 struct rb_node *parent = NULL;
230 int ret; 230 int ret;
231 231
232 tmp.ino = ino; 232 tmp.ino = ino;
233 tmp.root = root; 233 tmp.root = root;
234 234
235 spin_lock(&fs_info->defrag_inodes_lock); 235 spin_lock(&fs_info->defrag_inodes_lock);
236 p = fs_info->defrag_inodes.rb_node; 236 p = fs_info->defrag_inodes.rb_node;
237 while (p) { 237 while (p) {
238 parent = p; 238 parent = p;
239 entry = rb_entry(parent, struct inode_defrag, rb_node); 239 entry = rb_entry(parent, struct inode_defrag, rb_node);
240 240
241 ret = __compare_inode_defrag(&tmp, entry); 241 ret = __compare_inode_defrag(&tmp, entry);
242 if (ret < 0) 242 if (ret < 0)
243 p = parent->rb_left; 243 p = parent->rb_left;
244 else if (ret > 0) 244 else if (ret > 0)
245 p = parent->rb_right; 245 p = parent->rb_right;
246 else 246 else
247 goto out; 247 goto out;
248 } 248 }
249 249
250 if (parent && __compare_inode_defrag(&tmp, entry) > 0) { 250 if (parent && __compare_inode_defrag(&tmp, entry) > 0) {
251 parent = rb_next(parent); 251 parent = rb_next(parent);
252 if (parent) 252 if (parent)
253 entry = rb_entry(parent, struct inode_defrag, rb_node); 253 entry = rb_entry(parent, struct inode_defrag, rb_node);
254 else 254 else
255 entry = NULL; 255 entry = NULL;
256 } 256 }
257 out: 257 out:
258 if (entry) 258 if (entry)
259 rb_erase(parent, &fs_info->defrag_inodes); 259 rb_erase(parent, &fs_info->defrag_inodes);
260 spin_unlock(&fs_info->defrag_inodes_lock); 260 spin_unlock(&fs_info->defrag_inodes_lock);
261 return entry; 261 return entry;
262 } 262 }
263 263
264 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) 264 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
265 { 265 {
266 struct inode_defrag *defrag; 266 struct inode_defrag *defrag;
267 struct rb_node *node; 267 struct rb_node *node;
268 268
269 spin_lock(&fs_info->defrag_inodes_lock); 269 spin_lock(&fs_info->defrag_inodes_lock);
270 node = rb_first(&fs_info->defrag_inodes); 270 node = rb_first(&fs_info->defrag_inodes);
271 while (node) { 271 while (node) {
272 rb_erase(node, &fs_info->defrag_inodes); 272 rb_erase(node, &fs_info->defrag_inodes);
273 defrag = rb_entry(node, struct inode_defrag, rb_node); 273 defrag = rb_entry(node, struct inode_defrag, rb_node);
274 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 274 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
275 275
276 if (need_resched()) { 276 if (need_resched()) {
277 spin_unlock(&fs_info->defrag_inodes_lock); 277 spin_unlock(&fs_info->defrag_inodes_lock);
278 cond_resched(); 278 cond_resched();
279 spin_lock(&fs_info->defrag_inodes_lock); 279 spin_lock(&fs_info->defrag_inodes_lock);
280 } 280 }
281 281
282 node = rb_first(&fs_info->defrag_inodes); 282 node = rb_first(&fs_info->defrag_inodes);
283 } 283 }
284 spin_unlock(&fs_info->defrag_inodes_lock); 284 spin_unlock(&fs_info->defrag_inodes_lock);
285 } 285 }
286 286
287 #define BTRFS_DEFRAG_BATCH 1024 287 #define BTRFS_DEFRAG_BATCH 1024
288 288
289 static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, 289 static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
290 struct inode_defrag *defrag) 290 struct inode_defrag *defrag)
291 { 291 {
292 struct btrfs_root *inode_root; 292 struct btrfs_root *inode_root;
293 struct inode *inode; 293 struct inode *inode;
294 struct btrfs_key key; 294 struct btrfs_key key;
295 struct btrfs_ioctl_defrag_range_args range; 295 struct btrfs_ioctl_defrag_range_args range;
296 int num_defrag; 296 int num_defrag;
297 int index; 297 int index;
298 int ret; 298 int ret;
299 299
300 /* get the inode */ 300 /* get the inode */
301 key.objectid = defrag->root; 301 key.objectid = defrag->root;
302 btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); 302 btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
303 key.offset = (u64)-1; 303 key.offset = (u64)-1;
304 304
305 index = srcu_read_lock(&fs_info->subvol_srcu); 305 index = srcu_read_lock(&fs_info->subvol_srcu);
306 306
307 inode_root = btrfs_read_fs_root_no_name(fs_info, &key); 307 inode_root = btrfs_read_fs_root_no_name(fs_info, &key);
308 if (IS_ERR(inode_root)) { 308 if (IS_ERR(inode_root)) {
309 ret = PTR_ERR(inode_root); 309 ret = PTR_ERR(inode_root);
310 goto cleanup; 310 goto cleanup;
311 } 311 }
312 312
313 key.objectid = defrag->ino; 313 key.objectid = defrag->ino;
314 btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); 314 btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
315 key.offset = 0; 315 key.offset = 0;
316 inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); 316 inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL);
317 if (IS_ERR(inode)) { 317 if (IS_ERR(inode)) {
318 ret = PTR_ERR(inode); 318 ret = PTR_ERR(inode);
319 goto cleanup; 319 goto cleanup;
320 } 320 }
321 srcu_read_unlock(&fs_info->subvol_srcu, index); 321 srcu_read_unlock(&fs_info->subvol_srcu, index);
322 322
323 /* do a chunk of defrag */ 323 /* do a chunk of defrag */
324 clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); 324 clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
325 memset(&range, 0, sizeof(range)); 325 memset(&range, 0, sizeof(range));
326 range.len = (u64)-1; 326 range.len = (u64)-1;
327 range.start = defrag->last_offset; 327 range.start = defrag->last_offset;
328 328
329 sb_start_write(fs_info->sb); 329 sb_start_write(fs_info->sb);
330 num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid, 330 num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
331 BTRFS_DEFRAG_BATCH); 331 BTRFS_DEFRAG_BATCH);
332 sb_end_write(fs_info->sb); 332 sb_end_write(fs_info->sb);
333 /* 333 /*
334 * if we filled the whole defrag batch, there 334 * if we filled the whole defrag batch, there
335 * must be more work to do. Queue this defrag 335 * must be more work to do. Queue this defrag
336 * again 336 * again
337 */ 337 */
338 if (num_defrag == BTRFS_DEFRAG_BATCH) { 338 if (num_defrag == BTRFS_DEFRAG_BATCH) {
339 defrag->last_offset = range.start; 339 defrag->last_offset = range.start;
340 btrfs_requeue_inode_defrag(inode, defrag); 340 btrfs_requeue_inode_defrag(inode, defrag);
341 } else if (defrag->last_offset && !defrag->cycled) { 341 } else if (defrag->last_offset && !defrag->cycled) {
342 /* 342 /*
343 * we didn't fill our defrag batch, but 343 * we didn't fill our defrag batch, but
344 * we didn't start at zero. Make sure we loop 344 * we didn't start at zero. Make sure we loop
345 * around to the start of the file. 345 * around to the start of the file.
346 */ 346 */
347 defrag->last_offset = 0; 347 defrag->last_offset = 0;
348 defrag->cycled = 1; 348 defrag->cycled = 1;
349 btrfs_requeue_inode_defrag(inode, defrag); 349 btrfs_requeue_inode_defrag(inode, defrag);
350 } else { 350 } else {
351 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 351 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
352 } 352 }
353 353
354 iput(inode); 354 iput(inode);
355 return 0; 355 return 0;
356 cleanup: 356 cleanup:
357 srcu_read_unlock(&fs_info->subvol_srcu, index); 357 srcu_read_unlock(&fs_info->subvol_srcu, index);
358 kmem_cache_free(btrfs_inode_defrag_cachep, defrag); 358 kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
359 return ret; 359 return ret;
360 } 360 }
361 361
362 /* 362 /*
363 * run through the list of inodes in the FS that need 363 * run through the list of inodes in the FS that need
364 * defragging 364 * defragging
365 */ 365 */
366 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) 366 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
367 { 367 {
368 struct inode_defrag *defrag; 368 struct inode_defrag *defrag;
369 u64 first_ino = 0; 369 u64 first_ino = 0;
370 u64 root_objectid = 0; 370 u64 root_objectid = 0;
371 371
372 atomic_inc(&fs_info->defrag_running); 372 atomic_inc(&fs_info->defrag_running);
373 while(1) { 373 while(1) {
374 /* Pause the auto defragger. */ 374 /* Pause the auto defragger. */
375 if (test_bit(BTRFS_FS_STATE_REMOUNTING, 375 if (test_bit(BTRFS_FS_STATE_REMOUNTING,
376 &fs_info->fs_state)) 376 &fs_info->fs_state))
377 break; 377 break;
378 378
379 if (!__need_auto_defrag(fs_info->tree_root)) 379 if (!__need_auto_defrag(fs_info->tree_root))
380 break; 380 break;
381 381
382 /* find an inode to defrag */ 382 /* find an inode to defrag */
383 defrag = btrfs_pick_defrag_inode(fs_info, root_objectid, 383 defrag = btrfs_pick_defrag_inode(fs_info, root_objectid,
384 first_ino); 384 first_ino);
385 if (!defrag) { 385 if (!defrag) {
386 if (root_objectid || first_ino) { 386 if (root_objectid || first_ino) {
387 root_objectid = 0; 387 root_objectid = 0;
388 first_ino = 0; 388 first_ino = 0;
389 continue; 389 continue;
390 } else { 390 } else {
391 break; 391 break;
392 } 392 }
393 } 393 }
394 394
395 first_ino = defrag->ino + 1; 395 first_ino = defrag->ino + 1;
396 root_objectid = defrag->root; 396 root_objectid = defrag->root;
397 397
398 __btrfs_run_defrag_inode(fs_info, defrag); 398 __btrfs_run_defrag_inode(fs_info, defrag);
399 } 399 }
400 atomic_dec(&fs_info->defrag_running); 400 atomic_dec(&fs_info->defrag_running);
401 401
402 /* 402 /*
403 * during unmount, we use the transaction_wait queue to 403 * during unmount, we use the transaction_wait queue to
404 * wait for the defragger to stop 404 * wait for the defragger to stop
405 */ 405 */
406 wake_up(&fs_info->transaction_wait); 406 wake_up(&fs_info->transaction_wait);
407 return 0; 407 return 0;
408 } 408 }
409 409
410 /* simple helper to fault in pages and copy. This should go away 410 /* simple helper to fault in pages and copy. This should go away
411 * and be replaced with calls into generic code. 411 * and be replaced with calls into generic code.
412 */ 412 */
413 static noinline int btrfs_copy_from_user(loff_t pos, int num_pages, 413 static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
414 size_t write_bytes, 414 size_t write_bytes,
415 struct page **prepared_pages, 415 struct page **prepared_pages,
416 struct iov_iter *i) 416 struct iov_iter *i)
417 { 417 {
418 size_t copied = 0; 418 size_t copied = 0;
419 size_t total_copied = 0; 419 size_t total_copied = 0;
420 int pg = 0; 420 int pg = 0;
421 int offset = pos & (PAGE_CACHE_SIZE - 1); 421 int offset = pos & (PAGE_CACHE_SIZE - 1);
422 422
423 while (write_bytes > 0) { 423 while (write_bytes > 0) {
424 size_t count = min_t(size_t, 424 size_t count = min_t(size_t,
425 PAGE_CACHE_SIZE - offset, write_bytes); 425 PAGE_CACHE_SIZE - offset, write_bytes);
426 struct page *page = prepared_pages[pg]; 426 struct page *page = prepared_pages[pg];
427 /* 427 /*
428 * Copy data from userspace to the current page 428 * Copy data from userspace to the current page
429 */ 429 */
430 copied = iov_iter_copy_from_user_atomic(page, i, offset, count); 430 copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
431 431
432 /* Flush processor's dcache for this page */ 432 /* Flush processor's dcache for this page */
433 flush_dcache_page(page); 433 flush_dcache_page(page);
434 434
435 /* 435 /*
436 * if we get a partial write, we can end up with 436 * if we get a partial write, we can end up with
437 * partially up to date pages. These add 437 * partially up to date pages. These add
438 * a lot of complexity, so make sure they don't 438 * a lot of complexity, so make sure they don't
439 * happen by forcing this copy to be retried. 439 * happen by forcing this copy to be retried.
440 * 440 *
441 * The rest of the btrfs_file_write code will fall 441 * The rest of the btrfs_file_write code will fall
442 * back to page at a time copies after we return 0. 442 * back to page at a time copies after we return 0.
443 */ 443 */
444 if (!PageUptodate(page) && copied < count) 444 if (!PageUptodate(page) && copied < count)
445 copied = 0; 445 copied = 0;
446 446
447 iov_iter_advance(i, copied); 447 iov_iter_advance(i, copied);
448 write_bytes -= copied; 448 write_bytes -= copied;
449 total_copied += copied; 449 total_copied += copied;
450 450
451 /* Return to btrfs_file_aio_write to fault page */ 451 /* Return to btrfs_file_aio_write to fault page */
452 if (unlikely(copied == 0)) 452 if (unlikely(copied == 0))
453 break; 453 break;
454 454
455 if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { 455 if (unlikely(copied < PAGE_CACHE_SIZE - offset)) {
456 offset += copied; 456 offset += copied;
457 } else { 457 } else {
458 pg++; 458 pg++;
459 offset = 0; 459 offset = 0;
460 } 460 }
461 } 461 }
462 return total_copied; 462 return total_copied;
463 } 463 }
464 464
465 /* 465 /*
466 * unlocks pages after btrfs_file_write is done with them 466 * unlocks pages after btrfs_file_write is done with them
467 */ 467 */
468 static void btrfs_drop_pages(struct page **pages, size_t num_pages) 468 static void btrfs_drop_pages(struct page **pages, size_t num_pages)
469 { 469 {
470 size_t i; 470 size_t i;
471 for (i = 0; i < num_pages; i++) { 471 for (i = 0; i < num_pages; i++) {
472 /* page checked is some magic around finding pages that 472 /* page checked is some magic around finding pages that
473 * have been modified without going through btrfs_set_page_dirty 473 * have been modified without going through btrfs_set_page_dirty
474 * clear it here 474 * clear it here. There should be no need to mark the pages
475 * accessed as prepare_pages should have marked them accessed
476 * in prepare_pages via find_or_create_page()
475 */ 477 */
476 ClearPageChecked(pages[i]); 478 ClearPageChecked(pages[i]);
477 unlock_page(pages[i]); 479 unlock_page(pages[i]);
478 mark_page_accessed(pages[i]);
479 page_cache_release(pages[i]); 480 page_cache_release(pages[i]);
480 } 481 }
481 } 482 }
482 483
483 /* 484 /*
484 * after copy_from_user, pages need to be dirtied and we need to make 485 * after copy_from_user, pages need to be dirtied and we need to make
485 * sure holes are created between the current EOF and the start of 486 * sure holes are created between the current EOF and the start of
486 * any next extents (if required). 487 * any next extents (if required).
487 * 488 *
488 * this also makes the decision about creating an inline extent vs 489 * this also makes the decision about creating an inline extent vs
489 * doing real data extents, marking pages dirty and delalloc as required. 490 * doing real data extents, marking pages dirty and delalloc as required.
490 */ 491 */
491 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, 492 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
492 struct page **pages, size_t num_pages, 493 struct page **pages, size_t num_pages,
493 loff_t pos, size_t write_bytes, 494 loff_t pos, size_t write_bytes,
494 struct extent_state **cached) 495 struct extent_state **cached)
495 { 496 {
496 int err = 0; 497 int err = 0;
497 int i; 498 int i;
498 u64 num_bytes; 499 u64 num_bytes;
499 u64 start_pos; 500 u64 start_pos;
500 u64 end_of_last_block; 501 u64 end_of_last_block;
501 u64 end_pos = pos + write_bytes; 502 u64 end_pos = pos + write_bytes;
502 loff_t isize = i_size_read(inode); 503 loff_t isize = i_size_read(inode);
503 504
504 start_pos = pos & ~((u64)root->sectorsize - 1); 505 start_pos = pos & ~((u64)root->sectorsize - 1);
505 num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); 506 num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
506 507
507 end_of_last_block = start_pos + num_bytes - 1; 508 end_of_last_block = start_pos + num_bytes - 1;
508 err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, 509 err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
509 cached); 510 cached);
510 if (err) 511 if (err)
511 return err; 512 return err;
512 513
513 for (i = 0; i < num_pages; i++) { 514 for (i = 0; i < num_pages; i++) {
514 struct page *p = pages[i]; 515 struct page *p = pages[i];
515 SetPageUptodate(p); 516 SetPageUptodate(p);
516 ClearPageChecked(p); 517 ClearPageChecked(p);
517 set_page_dirty(p); 518 set_page_dirty(p);
518 } 519 }
519 520
520 /* 521 /*
521 * we've only changed i_size in ram, and we haven't updated 522 * we've only changed i_size in ram, and we haven't updated
522 * the disk i_size. There is no need to log the inode 523 * the disk i_size. There is no need to log the inode
523 * at this time. 524 * at this time.
524 */ 525 */
525 if (end_pos > isize) 526 if (end_pos > isize)
526 i_size_write(inode, end_pos); 527 i_size_write(inode, end_pos);
527 return 0; 528 return 0;
528 } 529 }
529 530
530 /* 531 /*
531 * this drops all the extents in the cache that intersect the range 532 * this drops all the extents in the cache that intersect the range
532 * [start, end]. Existing extents are split as required. 533 * [start, end]. Existing extents are split as required.
533 */ 534 */
534 void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, 535 void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
535 int skip_pinned) 536 int skip_pinned)
536 { 537 {
537 struct extent_map *em; 538 struct extent_map *em;
538 struct extent_map *split = NULL; 539 struct extent_map *split = NULL;
539 struct extent_map *split2 = NULL; 540 struct extent_map *split2 = NULL;
540 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; 541 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
541 u64 len = end - start + 1; 542 u64 len = end - start + 1;
542 u64 gen; 543 u64 gen;
543 int ret; 544 int ret;
544 int testend = 1; 545 int testend = 1;
545 unsigned long flags; 546 unsigned long flags;
546 int compressed = 0; 547 int compressed = 0;
547 bool modified; 548 bool modified;
548 549
549 WARN_ON(end < start); 550 WARN_ON(end < start);
550 if (end == (u64)-1) { 551 if (end == (u64)-1) {
551 len = (u64)-1; 552 len = (u64)-1;
552 testend = 0; 553 testend = 0;
553 } 554 }
554 while (1) { 555 while (1) {
555 int no_splits = 0; 556 int no_splits = 0;
556 557
557 modified = false; 558 modified = false;
558 if (!split) 559 if (!split)
559 split = alloc_extent_map(); 560 split = alloc_extent_map();
560 if (!split2) 561 if (!split2)
561 split2 = alloc_extent_map(); 562 split2 = alloc_extent_map();
562 if (!split || !split2) 563 if (!split || !split2)
563 no_splits = 1; 564 no_splits = 1;
564 565
565 write_lock(&em_tree->lock); 566 write_lock(&em_tree->lock);
566 em = lookup_extent_mapping(em_tree, start, len); 567 em = lookup_extent_mapping(em_tree, start, len);
567 if (!em) { 568 if (!em) {
568 write_unlock(&em_tree->lock); 569 write_unlock(&em_tree->lock);
569 break; 570 break;
570 } 571 }
571 flags = em->flags; 572 flags = em->flags;
572 gen = em->generation; 573 gen = em->generation;
573 if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) { 574 if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
574 if (testend && em->start + em->len >= start + len) { 575 if (testend && em->start + em->len >= start + len) {
575 free_extent_map(em); 576 free_extent_map(em);
576 write_unlock(&em_tree->lock); 577 write_unlock(&em_tree->lock);
577 break; 578 break;
578 } 579 }
579 start = em->start + em->len; 580 start = em->start + em->len;
580 if (testend) 581 if (testend)
581 len = start + len - (em->start + em->len); 582 len = start + len - (em->start + em->len);
582 free_extent_map(em); 583 free_extent_map(em);
583 write_unlock(&em_tree->lock); 584 write_unlock(&em_tree->lock);
584 continue; 585 continue;
585 } 586 }
586 compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); 587 compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
587 clear_bit(EXTENT_FLAG_PINNED, &em->flags); 588 clear_bit(EXTENT_FLAG_PINNED, &em->flags);
588 clear_bit(EXTENT_FLAG_LOGGING, &flags); 589 clear_bit(EXTENT_FLAG_LOGGING, &flags);
589 modified = !list_empty(&em->list); 590 modified = !list_empty(&em->list);
590 remove_extent_mapping(em_tree, em); 591 remove_extent_mapping(em_tree, em);
591 if (no_splits) 592 if (no_splits)
592 goto next; 593 goto next;
593 594
594 if (em->start < start) { 595 if (em->start < start) {
595 split->start = em->start; 596 split->start = em->start;
596 split->len = start - em->start; 597 split->len = start - em->start;
597 598
598 if (em->block_start < EXTENT_MAP_LAST_BYTE) { 599 if (em->block_start < EXTENT_MAP_LAST_BYTE) {
599 split->orig_start = em->orig_start; 600 split->orig_start = em->orig_start;
600 split->block_start = em->block_start; 601 split->block_start = em->block_start;
601 602
602 if (compressed) 603 if (compressed)
603 split->block_len = em->block_len; 604 split->block_len = em->block_len;
604 else 605 else
605 split->block_len = split->len; 606 split->block_len = split->len;
606 split->orig_block_len = max(split->block_len, 607 split->orig_block_len = max(split->block_len,
607 em->orig_block_len); 608 em->orig_block_len);
608 split->ram_bytes = em->ram_bytes; 609 split->ram_bytes = em->ram_bytes;
609 } else { 610 } else {
610 split->orig_start = split->start; 611 split->orig_start = split->start;
611 split->block_len = 0; 612 split->block_len = 0;
612 split->block_start = em->block_start; 613 split->block_start = em->block_start;
613 split->orig_block_len = 0; 614 split->orig_block_len = 0;
614 split->ram_bytes = split->len; 615 split->ram_bytes = split->len;
615 } 616 }
616 617
617 split->generation = gen; 618 split->generation = gen;
618 split->bdev = em->bdev; 619 split->bdev = em->bdev;
619 split->flags = flags; 620 split->flags = flags;
620 split->compress_type = em->compress_type; 621 split->compress_type = em->compress_type;
621 ret = add_extent_mapping(em_tree, split, modified); 622 ret = add_extent_mapping(em_tree, split, modified);
622 BUG_ON(ret); /* Logic error */ 623 BUG_ON(ret); /* Logic error */
623 free_extent_map(split); 624 free_extent_map(split);
624 split = split2; 625 split = split2;
625 split2 = NULL; 626 split2 = NULL;
626 } 627 }
627 if (testend && em->start + em->len > start + len) { 628 if (testend && em->start + em->len > start + len) {
628 u64 diff = start + len - em->start; 629 u64 diff = start + len - em->start;
629 630
630 split->start = start + len; 631 split->start = start + len;
631 split->len = em->start + em->len - (start + len); 632 split->len = em->start + em->len - (start + len);
632 split->bdev = em->bdev; 633 split->bdev = em->bdev;
633 split->flags = flags; 634 split->flags = flags;
634 split->compress_type = em->compress_type; 635 split->compress_type = em->compress_type;
635 split->generation = gen; 636 split->generation = gen;
636 637
637 if (em->block_start < EXTENT_MAP_LAST_BYTE) { 638 if (em->block_start < EXTENT_MAP_LAST_BYTE) {
638 split->orig_block_len = max(em->block_len, 639 split->orig_block_len = max(em->block_len,
639 em->orig_block_len); 640 em->orig_block_len);
640 641
641 split->ram_bytes = em->ram_bytes; 642 split->ram_bytes = em->ram_bytes;
642 if (compressed) { 643 if (compressed) {
643 split->block_len = em->block_len; 644 split->block_len = em->block_len;
644 split->block_start = em->block_start; 645 split->block_start = em->block_start;
645 split->orig_start = em->orig_start; 646 split->orig_start = em->orig_start;
646 } else { 647 } else {
647 split->block_len = split->len; 648 split->block_len = split->len;
648 split->block_start = em->block_start 649 split->block_start = em->block_start
649 + diff; 650 + diff;
650 split->orig_start = em->orig_start; 651 split->orig_start = em->orig_start;
651 } 652 }
652 } else { 653 } else {
653 split->ram_bytes = split->len; 654 split->ram_bytes = split->len;
654 split->orig_start = split->start; 655 split->orig_start = split->start;
655 split->block_len = 0; 656 split->block_len = 0;
656 split->block_start = em->block_start; 657 split->block_start = em->block_start;
657 split->orig_block_len = 0; 658 split->orig_block_len = 0;
658 } 659 }
659 660
660 ret = add_extent_mapping(em_tree, split, modified); 661 ret = add_extent_mapping(em_tree, split, modified);
661 BUG_ON(ret); /* Logic error */ 662 BUG_ON(ret); /* Logic error */
662 free_extent_map(split); 663 free_extent_map(split);
663 split = NULL; 664 split = NULL;
664 } 665 }
665 next: 666 next:
666 write_unlock(&em_tree->lock); 667 write_unlock(&em_tree->lock);
667 668
668 /* once for us */ 669 /* once for us */
669 free_extent_map(em); 670 free_extent_map(em);
670 /* once for the tree*/ 671 /* once for the tree*/
671 free_extent_map(em); 672 free_extent_map(em);
672 } 673 }
673 if (split) 674 if (split)
674 free_extent_map(split); 675 free_extent_map(split);
675 if (split2) 676 if (split2)
676 free_extent_map(split2); 677 free_extent_map(split2);
677 } 678 }
678 679
679 /* 680 /*
680 * this is very complex, but the basic idea is to drop all extents 681 * this is very complex, but the basic idea is to drop all extents
681 * in the range start - end. hint_block is filled in with a block number 682 * in the range start - end. hint_block is filled in with a block number
682 * that would be a good hint to the block allocator for this file. 683 * that would be a good hint to the block allocator for this file.
683 * 684 *
684 * If an extent intersects the range but is not entirely inside the range 685 * If an extent intersects the range but is not entirely inside the range
685 * it is either truncated or split. Anything entirely inside the range 686 * it is either truncated or split. Anything entirely inside the range
686 * is deleted from the tree. 687 * is deleted from the tree.
687 */ 688 */
688 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, 689 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
689 struct btrfs_root *root, struct inode *inode, 690 struct btrfs_root *root, struct inode *inode,
690 struct btrfs_path *path, u64 start, u64 end, 691 struct btrfs_path *path, u64 start, u64 end,
691 u64 *drop_end, int drop_cache) 692 u64 *drop_end, int drop_cache)
692 { 693 {
693 struct extent_buffer *leaf; 694 struct extent_buffer *leaf;
694 struct btrfs_file_extent_item *fi; 695 struct btrfs_file_extent_item *fi;
695 struct btrfs_key key; 696 struct btrfs_key key;
696 struct btrfs_key new_key; 697 struct btrfs_key new_key;
697 u64 ino = btrfs_ino(inode); 698 u64 ino = btrfs_ino(inode);
698 u64 search_start = start; 699 u64 search_start = start;
699 u64 disk_bytenr = 0; 700 u64 disk_bytenr = 0;
700 u64 num_bytes = 0; 701 u64 num_bytes = 0;
701 u64 extent_offset = 0; 702 u64 extent_offset = 0;
702 u64 extent_end = 0; 703 u64 extent_end = 0;
703 int del_nr = 0; 704 int del_nr = 0;
704 int del_slot = 0; 705 int del_slot = 0;
705 int extent_type; 706 int extent_type;
706 int recow; 707 int recow;
707 int ret; 708 int ret;
708 int modify_tree = -1; 709 int modify_tree = -1;
709 int update_refs = (root->ref_cows || root == root->fs_info->tree_root); 710 int update_refs = (root->ref_cows || root == root->fs_info->tree_root);
710 int found = 0; 711 int found = 0;
711 712
712 if (drop_cache) 713 if (drop_cache)
713 btrfs_drop_extent_cache(inode, start, end - 1, 0); 714 btrfs_drop_extent_cache(inode, start, end - 1, 0);
714 715
715 if (start >= BTRFS_I(inode)->disk_i_size) 716 if (start >= BTRFS_I(inode)->disk_i_size)
716 modify_tree = 0; 717 modify_tree = 0;
717 718
718 while (1) { 719 while (1) {
719 recow = 0; 720 recow = 0;
720 ret = btrfs_lookup_file_extent(trans, root, path, ino, 721 ret = btrfs_lookup_file_extent(trans, root, path, ino,
721 search_start, modify_tree); 722 search_start, modify_tree);
722 if (ret < 0) 723 if (ret < 0)
723 break; 724 break;
724 if (ret > 0 && path->slots[0] > 0 && search_start == start) { 725 if (ret > 0 && path->slots[0] > 0 && search_start == start) {
725 leaf = path->nodes[0]; 726 leaf = path->nodes[0];
726 btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1); 727 btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
727 if (key.objectid == ino && 728 if (key.objectid == ino &&
728 key.type == BTRFS_EXTENT_DATA_KEY) 729 key.type == BTRFS_EXTENT_DATA_KEY)
729 path->slots[0]--; 730 path->slots[0]--;
730 } 731 }
731 ret = 0; 732 ret = 0;
732 next_slot: 733 next_slot:
733 leaf = path->nodes[0]; 734 leaf = path->nodes[0];
734 if (path->slots[0] >= btrfs_header_nritems(leaf)) { 735 if (path->slots[0] >= btrfs_header_nritems(leaf)) {
735 BUG_ON(del_nr > 0); 736 BUG_ON(del_nr > 0);
736 ret = btrfs_next_leaf(root, path); 737 ret = btrfs_next_leaf(root, path);
737 if (ret < 0) 738 if (ret < 0)
738 break; 739 break;
739 if (ret > 0) { 740 if (ret > 0) {
740 ret = 0; 741 ret = 0;
741 break; 742 break;
742 } 743 }
743 leaf = path->nodes[0]; 744 leaf = path->nodes[0];
744 recow = 1; 745 recow = 1;
745 } 746 }
746 747
747 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 748 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
748 if (key.objectid > ino || 749 if (key.objectid > ino ||
749 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) 750 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
750 break; 751 break;
751 752
752 fi = btrfs_item_ptr(leaf, path->slots[0], 753 fi = btrfs_item_ptr(leaf, path->slots[0],
753 struct btrfs_file_extent_item); 754 struct btrfs_file_extent_item);
754 extent_type = btrfs_file_extent_type(leaf, fi); 755 extent_type = btrfs_file_extent_type(leaf, fi);
755 756
756 if (extent_type == BTRFS_FILE_EXTENT_REG || 757 if (extent_type == BTRFS_FILE_EXTENT_REG ||
757 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { 758 extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
758 disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 759 disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
759 num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); 760 num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
760 extent_offset = btrfs_file_extent_offset(leaf, fi); 761 extent_offset = btrfs_file_extent_offset(leaf, fi);
761 extent_end = key.offset + 762 extent_end = key.offset +
762 btrfs_file_extent_num_bytes(leaf, fi); 763 btrfs_file_extent_num_bytes(leaf, fi);
763 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { 764 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
764 extent_end = key.offset + 765 extent_end = key.offset +
765 btrfs_file_extent_inline_len(leaf, fi); 766 btrfs_file_extent_inline_len(leaf, fi);
766 } else { 767 } else {
767 WARN_ON(1); 768 WARN_ON(1);
768 extent_end = search_start; 769 extent_end = search_start;
769 } 770 }
770 771
771 if (extent_end <= search_start) { 772 if (extent_end <= search_start) {
772 path->slots[0]++; 773 path->slots[0]++;
773 goto next_slot; 774 goto next_slot;
774 } 775 }
775 776
776 found = 1; 777 found = 1;
777 search_start = max(key.offset, start); 778 search_start = max(key.offset, start);
778 if (recow || !modify_tree) { 779 if (recow || !modify_tree) {
779 modify_tree = -1; 780 modify_tree = -1;
780 btrfs_release_path(path); 781 btrfs_release_path(path);
781 continue; 782 continue;
782 } 783 }
783 784
784 /* 785 /*
785 * | - range to drop - | 786 * | - range to drop - |
786 * | -------- extent -------- | 787 * | -------- extent -------- |
787 */ 788 */
788 if (start > key.offset && end < extent_end) { 789 if (start > key.offset && end < extent_end) {
789 BUG_ON(del_nr > 0); 790 BUG_ON(del_nr > 0);
790 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); 791 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
791 792
792 memcpy(&new_key, &key, sizeof(new_key)); 793 memcpy(&new_key, &key, sizeof(new_key));
793 new_key.offset = start; 794 new_key.offset = start;
794 ret = btrfs_duplicate_item(trans, root, path, 795 ret = btrfs_duplicate_item(trans, root, path,
795 &new_key); 796 &new_key);
796 if (ret == -EAGAIN) { 797 if (ret == -EAGAIN) {
797 btrfs_release_path(path); 798 btrfs_release_path(path);
798 continue; 799 continue;
799 } 800 }
800 if (ret < 0) 801 if (ret < 0)
801 break; 802 break;
802 803
803 leaf = path->nodes[0]; 804 leaf = path->nodes[0];
804 fi = btrfs_item_ptr(leaf, path->slots[0] - 1, 805 fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
805 struct btrfs_file_extent_item); 806 struct btrfs_file_extent_item);
806 btrfs_set_file_extent_num_bytes(leaf, fi, 807 btrfs_set_file_extent_num_bytes(leaf, fi,
807 start - key.offset); 808 start - key.offset);
808 809
809 fi = btrfs_item_ptr(leaf, path->slots[0], 810 fi = btrfs_item_ptr(leaf, path->slots[0],
810 struct btrfs_file_extent_item); 811 struct btrfs_file_extent_item);
811 812
812 extent_offset += start - key.offset; 813 extent_offset += start - key.offset;
813 btrfs_set_file_extent_offset(leaf, fi, extent_offset); 814 btrfs_set_file_extent_offset(leaf, fi, extent_offset);
814 btrfs_set_file_extent_num_bytes(leaf, fi, 815 btrfs_set_file_extent_num_bytes(leaf, fi,
815 extent_end - start); 816 extent_end - start);
816 btrfs_mark_buffer_dirty(leaf); 817 btrfs_mark_buffer_dirty(leaf);
817 818
818 if (update_refs && disk_bytenr > 0) { 819 if (update_refs && disk_bytenr > 0) {
819 ret = btrfs_inc_extent_ref(trans, root, 820 ret = btrfs_inc_extent_ref(trans, root,
820 disk_bytenr, num_bytes, 0, 821 disk_bytenr, num_bytes, 0,
821 root->root_key.objectid, 822 root->root_key.objectid,
822 new_key.objectid, 823 new_key.objectid,
823 start - extent_offset, 0); 824 start - extent_offset, 0);
824 BUG_ON(ret); /* -ENOMEM */ 825 BUG_ON(ret); /* -ENOMEM */
825 } 826 }
826 key.offset = start; 827 key.offset = start;
827 } 828 }
828 /* 829 /*
829 * | ---- range to drop ----- | 830 * | ---- range to drop ----- |
830 * | -------- extent -------- | 831 * | -------- extent -------- |
831 */ 832 */
832 if (start <= key.offset && end < extent_end) { 833 if (start <= key.offset && end < extent_end) {
833 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); 834 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
834 835
835 memcpy(&new_key, &key, sizeof(new_key)); 836 memcpy(&new_key, &key, sizeof(new_key));
836 new_key.offset = end; 837 new_key.offset = end;
837 btrfs_set_item_key_safe(root, path, &new_key); 838 btrfs_set_item_key_safe(root, path, &new_key);
838 839
839 extent_offset += end - key.offset; 840 extent_offset += end - key.offset;
840 btrfs_set_file_extent_offset(leaf, fi, extent_offset); 841 btrfs_set_file_extent_offset(leaf, fi, extent_offset);
841 btrfs_set_file_extent_num_bytes(leaf, fi, 842 btrfs_set_file_extent_num_bytes(leaf, fi,
842 extent_end - end); 843 extent_end - end);
843 btrfs_mark_buffer_dirty(leaf); 844 btrfs_mark_buffer_dirty(leaf);
844 if (update_refs && disk_bytenr > 0) 845 if (update_refs && disk_bytenr > 0)
845 inode_sub_bytes(inode, end - key.offset); 846 inode_sub_bytes(inode, end - key.offset);
846 break; 847 break;
847 } 848 }
848 849
849 search_start = extent_end; 850 search_start = extent_end;
850 /* 851 /*
851 * | ---- range to drop ----- | 852 * | ---- range to drop ----- |
852 * | -------- extent -------- | 853 * | -------- extent -------- |
853 */ 854 */
854 if (start > key.offset && end >= extent_end) { 855 if (start > key.offset && end >= extent_end) {
855 BUG_ON(del_nr > 0); 856 BUG_ON(del_nr > 0);
856 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); 857 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
857 858
858 btrfs_set_file_extent_num_bytes(leaf, fi, 859 btrfs_set_file_extent_num_bytes(leaf, fi,
859 start - key.offset); 860 start - key.offset);
860 btrfs_mark_buffer_dirty(leaf); 861 btrfs_mark_buffer_dirty(leaf);
861 if (update_refs && disk_bytenr > 0) 862 if (update_refs && disk_bytenr > 0)
862 inode_sub_bytes(inode, extent_end - start); 863 inode_sub_bytes(inode, extent_end - start);
863 if (end == extent_end) 864 if (end == extent_end)
864 break; 865 break;
865 866
866 path->slots[0]++; 867 path->slots[0]++;
867 goto next_slot; 868 goto next_slot;
868 } 869 }
869 870
870 /* 871 /*
871 * | ---- range to drop ----- | 872 * | ---- range to drop ----- |
872 * | ------ extent ------ | 873 * | ------ extent ------ |
873 */ 874 */
874 if (start <= key.offset && end >= extent_end) { 875 if (start <= key.offset && end >= extent_end) {
875 if (del_nr == 0) { 876 if (del_nr == 0) {
876 del_slot = path->slots[0]; 877 del_slot = path->slots[0];
877 del_nr = 1; 878 del_nr = 1;
878 } else { 879 } else {
879 BUG_ON(del_slot + del_nr != path->slots[0]); 880 BUG_ON(del_slot + del_nr != path->slots[0]);
880 del_nr++; 881 del_nr++;
881 } 882 }
882 883
883 if (update_refs && 884 if (update_refs &&
884 extent_type == BTRFS_FILE_EXTENT_INLINE) { 885 extent_type == BTRFS_FILE_EXTENT_INLINE) {
885 inode_sub_bytes(inode, 886 inode_sub_bytes(inode,
886 extent_end - key.offset); 887 extent_end - key.offset);
887 extent_end = ALIGN(extent_end, 888 extent_end = ALIGN(extent_end,
888 root->sectorsize); 889 root->sectorsize);
889 } else if (update_refs && disk_bytenr > 0) { 890 } else if (update_refs && disk_bytenr > 0) {
890 ret = btrfs_free_extent(trans, root, 891 ret = btrfs_free_extent(trans, root,
891 disk_bytenr, num_bytes, 0, 892 disk_bytenr, num_bytes, 0,
892 root->root_key.objectid, 893 root->root_key.objectid,
893 key.objectid, key.offset - 894 key.objectid, key.offset -
894 extent_offset, 0); 895 extent_offset, 0);
895 BUG_ON(ret); /* -ENOMEM */ 896 BUG_ON(ret); /* -ENOMEM */
896 inode_sub_bytes(inode, 897 inode_sub_bytes(inode,
897 extent_end - key.offset); 898 extent_end - key.offset);
898 } 899 }
899 900
900 if (end == extent_end) 901 if (end == extent_end)
901 break; 902 break;
902 903
903 if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) { 904 if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) {
904 path->slots[0]++; 905 path->slots[0]++;
905 goto next_slot; 906 goto next_slot;
906 } 907 }
907 908
908 ret = btrfs_del_items(trans, root, path, del_slot, 909 ret = btrfs_del_items(trans, root, path, del_slot,
909 del_nr); 910 del_nr);
910 if (ret) { 911 if (ret) {
911 btrfs_abort_transaction(trans, root, ret); 912 btrfs_abort_transaction(trans, root, ret);
912 break; 913 break;
913 } 914 }
914 915
915 del_nr = 0; 916 del_nr = 0;
916 del_slot = 0; 917 del_slot = 0;
917 918
918 btrfs_release_path(path); 919 btrfs_release_path(path);
919 continue; 920 continue;
920 } 921 }
921 922
922 BUG_ON(1); 923 BUG_ON(1);
923 } 924 }
924 925
925 if (!ret && del_nr > 0) { 926 if (!ret && del_nr > 0) {
926 ret = btrfs_del_items(trans, root, path, del_slot, del_nr); 927 ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
927 if (ret) 928 if (ret)
928 btrfs_abort_transaction(trans, root, ret); 929 btrfs_abort_transaction(trans, root, ret);
929 } 930 }
930 931
931 if (drop_end) 932 if (drop_end)
932 *drop_end = found ? min(end, extent_end) : end; 933 *drop_end = found ? min(end, extent_end) : end;
933 btrfs_release_path(path); 934 btrfs_release_path(path);
934 return ret; 935 return ret;
935 } 936 }
936 937
937 int btrfs_drop_extents(struct btrfs_trans_handle *trans, 938 int btrfs_drop_extents(struct btrfs_trans_handle *trans,
938 struct btrfs_root *root, struct inode *inode, u64 start, 939 struct btrfs_root *root, struct inode *inode, u64 start,
939 u64 end, int drop_cache) 940 u64 end, int drop_cache)
940 { 941 {
941 struct btrfs_path *path; 942 struct btrfs_path *path;
942 int ret; 943 int ret;
943 944
944 path = btrfs_alloc_path(); 945 path = btrfs_alloc_path();
945 if (!path) 946 if (!path)
946 return -ENOMEM; 947 return -ENOMEM;
947 ret = __btrfs_drop_extents(trans, root, inode, path, start, end, NULL, 948 ret = __btrfs_drop_extents(trans, root, inode, path, start, end, NULL,
948 drop_cache); 949 drop_cache);
949 btrfs_free_path(path); 950 btrfs_free_path(path);
950 return ret; 951 return ret;
951 } 952 }
952 953
953 static int extent_mergeable(struct extent_buffer *leaf, int slot, 954 static int extent_mergeable(struct extent_buffer *leaf, int slot,
954 u64 objectid, u64 bytenr, u64 orig_offset, 955 u64 objectid, u64 bytenr, u64 orig_offset,
955 u64 *start, u64 *end) 956 u64 *start, u64 *end)
956 { 957 {
957 struct btrfs_file_extent_item *fi; 958 struct btrfs_file_extent_item *fi;
958 struct btrfs_key key; 959 struct btrfs_key key;
959 u64 extent_end; 960 u64 extent_end;
960 961
961 if (slot < 0 || slot >= btrfs_header_nritems(leaf)) 962 if (slot < 0 || slot >= btrfs_header_nritems(leaf))
962 return 0; 963 return 0;
963 964
964 btrfs_item_key_to_cpu(leaf, &key, slot); 965 btrfs_item_key_to_cpu(leaf, &key, slot);
965 if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY) 966 if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY)
966 return 0; 967 return 0;
967 968
968 fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); 969 fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
969 if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG || 970 if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG ||
970 btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr || 971 btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr ||
971 btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset || 972 btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset ||
972 btrfs_file_extent_compression(leaf, fi) || 973 btrfs_file_extent_compression(leaf, fi) ||
973 btrfs_file_extent_encryption(leaf, fi) || 974 btrfs_file_extent_encryption(leaf, fi) ||
974 btrfs_file_extent_other_encoding(leaf, fi)) 975 btrfs_file_extent_other_encoding(leaf, fi))
975 return 0; 976 return 0;
976 977
977 extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); 978 extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
978 if ((*start && *start != key.offset) || (*end && *end != extent_end)) 979 if ((*start && *start != key.offset) || (*end && *end != extent_end))
979 return 0; 980 return 0;
980 981
981 *start = key.offset; 982 *start = key.offset;
982 *end = extent_end; 983 *end = extent_end;
983 return 1; 984 return 1;
984 } 985 }
985 986
986 /* 987 /*
987 * Mark extent in the range start - end as written. 988 * Mark extent in the range start - end as written.
988 * 989 *
989 * This changes extent type from 'pre-allocated' to 'regular'. If only 990 * This changes extent type from 'pre-allocated' to 'regular'. If only
990 * part of extent is marked as written, the extent will be split into 991 * part of extent is marked as written, the extent will be split into
991 * two or three. 992 * two or three.
992 */ 993 */
993 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans, 994 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
994 struct inode *inode, u64 start, u64 end) 995 struct inode *inode, u64 start, u64 end)
995 { 996 {
996 struct btrfs_root *root = BTRFS_I(inode)->root; 997 struct btrfs_root *root = BTRFS_I(inode)->root;
997 struct extent_buffer *leaf; 998 struct extent_buffer *leaf;
998 struct btrfs_path *path; 999 struct btrfs_path *path;
999 struct btrfs_file_extent_item *fi; 1000 struct btrfs_file_extent_item *fi;
1000 struct btrfs_key key; 1001 struct btrfs_key key;
1001 struct btrfs_key new_key; 1002 struct btrfs_key new_key;
1002 u64 bytenr; 1003 u64 bytenr;
1003 u64 num_bytes; 1004 u64 num_bytes;
1004 u64 extent_end; 1005 u64 extent_end;
1005 u64 orig_offset; 1006 u64 orig_offset;
1006 u64 other_start; 1007 u64 other_start;
1007 u64 other_end; 1008 u64 other_end;
1008 u64 split; 1009 u64 split;
1009 int del_nr = 0; 1010 int del_nr = 0;
1010 int del_slot = 0; 1011 int del_slot = 0;
1011 int recow; 1012 int recow;
1012 int ret; 1013 int ret;
1013 u64 ino = btrfs_ino(inode); 1014 u64 ino = btrfs_ino(inode);
1014 1015
1015 path = btrfs_alloc_path(); 1016 path = btrfs_alloc_path();
1016 if (!path) 1017 if (!path)
1017 return -ENOMEM; 1018 return -ENOMEM;
1018 again: 1019 again:
1019 recow = 0; 1020 recow = 0;
1020 split = start; 1021 split = start;
1021 key.objectid = ino; 1022 key.objectid = ino;
1022 key.type = BTRFS_EXTENT_DATA_KEY; 1023 key.type = BTRFS_EXTENT_DATA_KEY;
1023 key.offset = split; 1024 key.offset = split;
1024 1025
1025 ret = btrfs_search_slot(trans, root, &key, path, -1, 1); 1026 ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
1026 if (ret < 0) 1027 if (ret < 0)
1027 goto out; 1028 goto out;
1028 if (ret > 0 && path->slots[0] > 0) 1029 if (ret > 0 && path->slots[0] > 0)
1029 path->slots[0]--; 1030 path->slots[0]--;
1030 1031
1031 leaf = path->nodes[0]; 1032 leaf = path->nodes[0];
1032 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 1033 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
1033 BUG_ON(key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY); 1034 BUG_ON(key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY);
1034 fi = btrfs_item_ptr(leaf, path->slots[0], 1035 fi = btrfs_item_ptr(leaf, path->slots[0],
1035 struct btrfs_file_extent_item); 1036 struct btrfs_file_extent_item);
1036 BUG_ON(btrfs_file_extent_type(leaf, fi) != 1037 BUG_ON(btrfs_file_extent_type(leaf, fi) !=
1037 BTRFS_FILE_EXTENT_PREALLOC); 1038 BTRFS_FILE_EXTENT_PREALLOC);
1038 extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); 1039 extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
1039 BUG_ON(key.offset > start || extent_end < end); 1040 BUG_ON(key.offset > start || extent_end < end);
1040 1041
1041 bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); 1042 bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
1042 num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); 1043 num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
1043 orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi); 1044 orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi);
1044 memcpy(&new_key, &key, sizeof(new_key)); 1045 memcpy(&new_key, &key, sizeof(new_key));
1045 1046
1046 if (start == key.offset && end < extent_end) { 1047 if (start == key.offset && end < extent_end) {
1047 other_start = 0; 1048 other_start = 0;
1048 other_end = start; 1049 other_end = start;
1049 if (extent_mergeable(leaf, path->slots[0] - 1, 1050 if (extent_mergeable(leaf, path->slots[0] - 1,
1050 ino, bytenr, orig_offset, 1051 ino, bytenr, orig_offset,
1051 &other_start, &other_end)) { 1052 &other_start, &other_end)) {
1052 new_key.offset = end; 1053 new_key.offset = end;
1053 btrfs_set_item_key_safe(root, path, &new_key); 1054 btrfs_set_item_key_safe(root, path, &new_key);
1054 fi = btrfs_item_ptr(leaf, path->slots[0], 1055 fi = btrfs_item_ptr(leaf, path->slots[0],
1055 struct btrfs_file_extent_item); 1056 struct btrfs_file_extent_item);
1056 btrfs_set_file_extent_generation(leaf, fi, 1057 btrfs_set_file_extent_generation(leaf, fi,
1057 trans->transid); 1058 trans->transid);
1058 btrfs_set_file_extent_num_bytes(leaf, fi, 1059 btrfs_set_file_extent_num_bytes(leaf, fi,
1059 extent_end - end); 1060 extent_end - end);
1060 btrfs_set_file_extent_offset(leaf, fi, 1061 btrfs_set_file_extent_offset(leaf, fi,
1061 end - orig_offset); 1062 end - orig_offset);
1062 fi = btrfs_item_ptr(leaf, path->slots[0] - 1, 1063 fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
1063 struct btrfs_file_extent_item); 1064 struct btrfs_file_extent_item);
1064 btrfs_set_file_extent_generation(leaf, fi, 1065 btrfs_set_file_extent_generation(leaf, fi,
1065 trans->transid); 1066 trans->transid);
1066 btrfs_set_file_extent_num_bytes(leaf, fi, 1067 btrfs_set_file_extent_num_bytes(leaf, fi,
1067 end - other_start); 1068 end - other_start);
1068 btrfs_mark_buffer_dirty(leaf); 1069 btrfs_mark_buffer_dirty(leaf);
1069 goto out; 1070 goto out;
1070 } 1071 }
1071 } 1072 }
1072 1073
1073 if (start > key.offset && end == extent_end) { 1074 if (start > key.offset && end == extent_end) {
1074 other_start = end; 1075 other_start = end;
1075 other_end = 0; 1076 other_end = 0;
1076 if (extent_mergeable(leaf, path->slots[0] + 1, 1077 if (extent_mergeable(leaf, path->slots[0] + 1,
1077 ino, bytenr, orig_offset, 1078 ino, bytenr, orig_offset,
1078 &other_start, &other_end)) { 1079 &other_start, &other_end)) {
1079 fi = btrfs_item_ptr(leaf, path->slots[0], 1080 fi = btrfs_item_ptr(leaf, path->slots[0],
1080 struct btrfs_file_extent_item); 1081 struct btrfs_file_extent_item);
1081 btrfs_set_file_extent_num_bytes(leaf, fi, 1082 btrfs_set_file_extent_num_bytes(leaf, fi,
1082 start - key.offset); 1083 start - key.offset);
1083 btrfs_set_file_extent_generation(leaf, fi, 1084 btrfs_set_file_extent_generation(leaf, fi,
1084 trans->transid); 1085 trans->transid);
1085 path->slots[0]++; 1086 path->slots[0]++;
1086 new_key.offset = start; 1087 new_key.offset = start;
1087 btrfs_set_item_key_safe(root, path, &new_key); 1088 btrfs_set_item_key_safe(root, path, &new_key);
1088 1089
1089 fi = btrfs_item_ptr(leaf, path->slots[0], 1090 fi = btrfs_item_ptr(leaf, path->slots[0],
1090 struct btrfs_file_extent_item); 1091 struct btrfs_file_extent_item);
1091 btrfs_set_file_extent_generation(leaf, fi, 1092 btrfs_set_file_extent_generation(leaf, fi,
1092 trans->transid); 1093 trans->transid);
1093 btrfs_set_file_extent_num_bytes(leaf, fi, 1094 btrfs_set_file_extent_num_bytes(leaf, fi,
1094 other_end - start); 1095 other_end - start);
1095 btrfs_set_file_extent_offset(leaf, fi, 1096 btrfs_set_file_extent_offset(leaf, fi,
1096 start - orig_offset); 1097 start - orig_offset);
1097 btrfs_mark_buffer_dirty(leaf); 1098 btrfs_mark_buffer_dirty(leaf);
1098 goto out; 1099 goto out;
1099 } 1100 }
1100 } 1101 }
1101 1102
1102 while (start > key.offset || end < extent_end) { 1103 while (start > key.offset || end < extent_end) {
1103 if (key.offset == start) 1104 if (key.offset == start)
1104 split = end; 1105 split = end;
1105 1106
1106 new_key.offset = split; 1107 new_key.offset = split;
1107 ret = btrfs_duplicate_item(trans, root, path, &new_key); 1108 ret = btrfs_duplicate_item(trans, root, path, &new_key);
1108 if (ret == -EAGAIN) { 1109 if (ret == -EAGAIN) {
1109 btrfs_release_path(path); 1110 btrfs_release_path(path);
1110 goto again; 1111 goto again;
1111 } 1112 }
1112 if (ret < 0) { 1113 if (ret < 0) {
1113 btrfs_abort_transaction(trans, root, ret); 1114 btrfs_abort_transaction(trans, root, ret);
1114 goto out; 1115 goto out;
1115 } 1116 }
1116 1117
1117 leaf = path->nodes[0]; 1118 leaf = path->nodes[0];
1118 fi = btrfs_item_ptr(leaf, path->slots[0] - 1, 1119 fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
1119 struct btrfs_file_extent_item); 1120 struct btrfs_file_extent_item);
1120 btrfs_set_file_extent_generation(leaf, fi, trans->transid); 1121 btrfs_set_file_extent_generation(leaf, fi, trans->transid);
1121 btrfs_set_file_extent_num_bytes(leaf, fi, 1122 btrfs_set_file_extent_num_bytes(leaf, fi,
1122 split - key.offset); 1123 split - key.offset);
1123 1124
1124 fi = btrfs_item_ptr(leaf, path->slots[0], 1125 fi = btrfs_item_ptr(leaf, path->slots[0],
1125 struct btrfs_file_extent_item); 1126 struct btrfs_file_extent_item);
1126 1127
1127 btrfs_set_file_extent_generation(leaf, fi, trans->transid); 1128 btrfs_set_file_extent_generation(leaf, fi, trans->transid);
1128 btrfs_set_file_extent_offset(leaf, fi, split - orig_offset); 1129 btrfs_set_file_extent_offset(leaf, fi, split - orig_offset);
1129 btrfs_set_file_extent_num_bytes(leaf, fi, 1130 btrfs_set_file_extent_num_bytes(leaf, fi,
1130 extent_end - split); 1131 extent_end - split);
1131 btrfs_mark_buffer_dirty(leaf); 1132 btrfs_mark_buffer_dirty(leaf);
1132 1133
1133 ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0, 1134 ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
1134 root->root_key.objectid, 1135 root->root_key.objectid,
1135 ino, orig_offset, 0); 1136 ino, orig_offset, 0);
1136 BUG_ON(ret); /* -ENOMEM */ 1137 BUG_ON(ret); /* -ENOMEM */
1137 1138
1138 if (split == start) { 1139 if (split == start) {
1139 key.offset = start; 1140 key.offset = start;
1140 } else { 1141 } else {
1141 BUG_ON(start != key.offset); 1142 BUG_ON(start != key.offset);
1142 path->slots[0]--; 1143 path->slots[0]--;
1143 extent_end = end; 1144 extent_end = end;
1144 } 1145 }
1145 recow = 1; 1146 recow = 1;
1146 } 1147 }
1147 1148
1148 other_start = end; 1149 other_start = end;
1149 other_end = 0; 1150 other_end = 0;
1150 if (extent_mergeable(leaf, path->slots[0] + 1, 1151 if (extent_mergeable(leaf, path->slots[0] + 1,
1151 ino, bytenr, orig_offset, 1152 ino, bytenr, orig_offset,
1152 &other_start, &other_end)) { 1153 &other_start, &other_end)) {
1153 if (recow) { 1154 if (recow) {
1154 btrfs_release_path(path); 1155 btrfs_release_path(path);
1155 goto again; 1156 goto again;
1156 } 1157 }
1157 extent_end = other_end; 1158 extent_end = other_end;
1158 del_slot = path->slots[0] + 1; 1159 del_slot = path->slots[0] + 1;
1159 del_nr++; 1160 del_nr++;
1160 ret = btrfs_free_extent(trans, root, bytenr, num_bytes, 1161 ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
1161 0, root->root_key.objectid, 1162 0, root->root_key.objectid,
1162 ino, orig_offset, 0); 1163 ino, orig_offset, 0);
1163 BUG_ON(ret); /* -ENOMEM */ 1164 BUG_ON(ret); /* -ENOMEM */
1164 } 1165 }
1165 other_start = 0; 1166 other_start = 0;
1166 other_end = start; 1167 other_end = start;
1167 if (extent_mergeable(leaf, path->slots[0] - 1, 1168 if (extent_mergeable(leaf, path->slots[0] - 1,
1168 ino, bytenr, orig_offset, 1169 ino, bytenr, orig_offset,
1169 &other_start, &other_end)) { 1170 &other_start, &other_end)) {
1170 if (recow) { 1171 if (recow) {
1171 btrfs_release_path(path); 1172 btrfs_release_path(path);
1172 goto again; 1173 goto again;
1173 } 1174 }
1174 key.offset = other_start; 1175 key.offset = other_start;
1175 del_slot = path->slots[0]; 1176 del_slot = path->slots[0];
1176 del_nr++; 1177 del_nr++;
1177 ret = btrfs_free_extent(trans, root, bytenr, num_bytes, 1178 ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
1178 0, root->root_key.objectid, 1179 0, root->root_key.objectid,
1179 ino, orig_offset, 0); 1180 ino, orig_offset, 0);
1180 BUG_ON(ret); /* -ENOMEM */ 1181 BUG_ON(ret); /* -ENOMEM */
1181 } 1182 }
1182 if (del_nr == 0) { 1183 if (del_nr == 0) {
1183 fi = btrfs_item_ptr(leaf, path->slots[0], 1184 fi = btrfs_item_ptr(leaf, path->slots[0],
1184 struct btrfs_file_extent_item); 1185 struct btrfs_file_extent_item);
1185 btrfs_set_file_extent_type(leaf, fi, 1186 btrfs_set_file_extent_type(leaf, fi,
1186 BTRFS_FILE_EXTENT_REG); 1187 BTRFS_FILE_EXTENT_REG);
1187 btrfs_set_file_extent_generation(leaf, fi, trans->transid); 1188 btrfs_set_file_extent_generation(leaf, fi, trans->transid);
1188 btrfs_mark_buffer_dirty(leaf); 1189 btrfs_mark_buffer_dirty(leaf);
1189 } else { 1190 } else {
1190 fi = btrfs_item_ptr(leaf, del_slot - 1, 1191 fi = btrfs_item_ptr(leaf, del_slot - 1,
1191 struct btrfs_file_extent_item); 1192 struct btrfs_file_extent_item);
1192 btrfs_set_file_extent_type(leaf, fi, 1193 btrfs_set_file_extent_type(leaf, fi,
1193 BTRFS_FILE_EXTENT_REG); 1194 BTRFS_FILE_EXTENT_REG);
1194 btrfs_set_file_extent_generation(leaf, fi, trans->transid); 1195 btrfs_set_file_extent_generation(leaf, fi, trans->transid);
1195 btrfs_set_file_extent_num_bytes(leaf, fi, 1196 btrfs_set_file_extent_num_bytes(leaf, fi,
1196 extent_end - key.offset); 1197 extent_end - key.offset);
1197 btrfs_mark_buffer_dirty(leaf); 1198 btrfs_mark_buffer_dirty(leaf);
1198 1199
1199 ret = btrfs_del_items(trans, root, path, del_slot, del_nr); 1200 ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
1200 if (ret < 0) { 1201 if (ret < 0) {
1201 btrfs_abort_transaction(trans, root, ret); 1202 btrfs_abort_transaction(trans, root, ret);
1202 goto out; 1203 goto out;
1203 } 1204 }
1204 } 1205 }
1205 out: 1206 out:
1206 btrfs_free_path(path); 1207 btrfs_free_path(path);
1207 return 0; 1208 return 0;
1208 } 1209 }
1209 1210
1210 /* 1211 /*
1211 * on error we return an unlocked page and the error value 1212 * on error we return an unlocked page and the error value
1212 * on success we return a locked page and 0 1213 * on success we return a locked page and 0
1213 */ 1214 */
1214 static int prepare_uptodate_page(struct page *page, u64 pos, 1215 static int prepare_uptodate_page(struct page *page, u64 pos,
1215 bool force_uptodate) 1216 bool force_uptodate)
1216 { 1217 {
1217 int ret = 0; 1218 int ret = 0;
1218 1219
1219 if (((pos & (PAGE_CACHE_SIZE - 1)) || force_uptodate) && 1220 if (((pos & (PAGE_CACHE_SIZE - 1)) || force_uptodate) &&
1220 !PageUptodate(page)) { 1221 !PageUptodate(page)) {
1221 ret = btrfs_readpage(NULL, page); 1222 ret = btrfs_readpage(NULL, page);
1222 if (ret) 1223 if (ret)
1223 return ret; 1224 return ret;
1224 lock_page(page); 1225 lock_page(page);
1225 if (!PageUptodate(page)) { 1226 if (!PageUptodate(page)) {
1226 unlock_page(page); 1227 unlock_page(page);
1227 return -EIO; 1228 return -EIO;
1228 } 1229 }
1229 } 1230 }
1230 return 0; 1231 return 0;
1231 } 1232 }
1232 1233
1233 /* 1234 /*
1234 * this gets pages into the page cache and locks them down, it also properly 1235 * this gets pages into the page cache and locks them down, it also properly
1235 * waits for data=ordered extents to finish before allowing the pages to be 1236 * waits for data=ordered extents to finish before allowing the pages to be
1236 * modified. 1237 * modified.
1237 */ 1238 */
1238 static noinline int prepare_pages(struct btrfs_root *root, struct file *file, 1239 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
1239 struct page **pages, size_t num_pages, 1240 struct page **pages, size_t num_pages,
1240 loff_t pos, unsigned long first_index, 1241 loff_t pos, unsigned long first_index,
1241 size_t write_bytes, bool force_uptodate) 1242 size_t write_bytes, bool force_uptodate)
1242 { 1243 {
1243 struct extent_state *cached_state = NULL; 1244 struct extent_state *cached_state = NULL;
1244 int i; 1245 int i;
1245 unsigned long index = pos >> PAGE_CACHE_SHIFT; 1246 unsigned long index = pos >> PAGE_CACHE_SHIFT;
1246 struct inode *inode = file_inode(file); 1247 struct inode *inode = file_inode(file);
1247 gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); 1248 gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
1248 int err = 0; 1249 int err = 0;
1249 int faili = 0; 1250 int faili = 0;
1250 u64 start_pos; 1251 u64 start_pos;
1251 u64 last_pos; 1252 u64 last_pos;
1252 1253
1253 start_pos = pos & ~((u64)root->sectorsize - 1); 1254 start_pos = pos & ~((u64)root->sectorsize - 1);
1254 last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT; 1255 last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT;
1255 1256
1256 again: 1257 again:
1257 for (i = 0; i < num_pages; i++) { 1258 for (i = 0; i < num_pages; i++) {
1258 pages[i] = find_or_create_page(inode->i_mapping, index + i, 1259 pages[i] = find_or_create_page(inode->i_mapping, index + i,
1259 mask | __GFP_WRITE); 1260 mask | __GFP_WRITE);
1260 if (!pages[i]) { 1261 if (!pages[i]) {
1261 faili = i - 1; 1262 faili = i - 1;
1262 err = -ENOMEM; 1263 err = -ENOMEM;
1263 goto fail; 1264 goto fail;
1264 } 1265 }
1265 1266
1266 if (i == 0) 1267 if (i == 0)
1267 err = prepare_uptodate_page(pages[i], pos, 1268 err = prepare_uptodate_page(pages[i], pos,
1268 force_uptodate); 1269 force_uptodate);
1269 if (i == num_pages - 1) 1270 if (i == num_pages - 1)
1270 err = prepare_uptodate_page(pages[i], 1271 err = prepare_uptodate_page(pages[i],
1271 pos + write_bytes, false); 1272 pos + write_bytes, false);
1272 if (err) { 1273 if (err) {
1273 page_cache_release(pages[i]); 1274 page_cache_release(pages[i]);
1274 faili = i - 1; 1275 faili = i - 1;
1275 goto fail; 1276 goto fail;
1276 } 1277 }
1277 wait_on_page_writeback(pages[i]); 1278 wait_on_page_writeback(pages[i]);
1278 } 1279 }
1279 err = 0; 1280 err = 0;
1280 if (start_pos < inode->i_size) { 1281 if (start_pos < inode->i_size) {
1281 struct btrfs_ordered_extent *ordered; 1282 struct btrfs_ordered_extent *ordered;
1282 lock_extent_bits(&BTRFS_I(inode)->io_tree, 1283 lock_extent_bits(&BTRFS_I(inode)->io_tree,
1283 start_pos, last_pos - 1, 0, &cached_state); 1284 start_pos, last_pos - 1, 0, &cached_state);
1284 ordered = btrfs_lookup_first_ordered_extent(inode, 1285 ordered = btrfs_lookup_first_ordered_extent(inode,
1285 last_pos - 1); 1286 last_pos - 1);
1286 if (ordered && 1287 if (ordered &&
1287 ordered->file_offset + ordered->len > start_pos && 1288 ordered->file_offset + ordered->len > start_pos &&
1288 ordered->file_offset < last_pos) { 1289 ordered->file_offset < last_pos) {
1289 btrfs_put_ordered_extent(ordered); 1290 btrfs_put_ordered_extent(ordered);
1290 unlock_extent_cached(&BTRFS_I(inode)->io_tree, 1291 unlock_extent_cached(&BTRFS_I(inode)->io_tree,
1291 start_pos, last_pos - 1, 1292 start_pos, last_pos - 1,
1292 &cached_state, GFP_NOFS); 1293 &cached_state, GFP_NOFS);
1293 for (i = 0; i < num_pages; i++) { 1294 for (i = 0; i < num_pages; i++) {
1294 unlock_page(pages[i]); 1295 unlock_page(pages[i]);
1295 page_cache_release(pages[i]); 1296 page_cache_release(pages[i]);
1296 } 1297 }
1297 btrfs_wait_ordered_range(inode, start_pos, 1298 btrfs_wait_ordered_range(inode, start_pos,
1298 last_pos - start_pos); 1299 last_pos - start_pos);
1299 goto again; 1300 goto again;
1300 } 1301 }
1301 if (ordered) 1302 if (ordered)
1302 btrfs_put_ordered_extent(ordered); 1303 btrfs_put_ordered_extent(ordered);
1303 1304
1304 clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, 1305 clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos,
1305 last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC | 1306 last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC |
1306 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1307 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
1307 0, 0, &cached_state, GFP_NOFS); 1308 0, 0, &cached_state, GFP_NOFS);
1308 unlock_extent_cached(&BTRFS_I(inode)->io_tree, 1309 unlock_extent_cached(&BTRFS_I(inode)->io_tree,
1309 start_pos, last_pos - 1, &cached_state, 1310 start_pos, last_pos - 1, &cached_state,
1310 GFP_NOFS); 1311 GFP_NOFS);
1311 } 1312 }
1312 for (i = 0; i < num_pages; i++) { 1313 for (i = 0; i < num_pages; i++) {
1313 if (clear_page_dirty_for_io(pages[i])) 1314 if (clear_page_dirty_for_io(pages[i]))
1314 account_page_redirty(pages[i]); 1315 account_page_redirty(pages[i]);
1315 set_page_extent_mapped(pages[i]); 1316 set_page_extent_mapped(pages[i]);
1316 WARN_ON(!PageLocked(pages[i])); 1317 WARN_ON(!PageLocked(pages[i]));
1317 } 1318 }
1318 return 0; 1319 return 0;
1319 fail: 1320 fail:
1320 while (faili >= 0) { 1321 while (faili >= 0) {
1321 unlock_page(pages[faili]); 1322 unlock_page(pages[faili]);
1322 page_cache_release(pages[faili]); 1323 page_cache_release(pages[faili]);
1323 faili--; 1324 faili--;
1324 } 1325 }
1325 return err; 1326 return err;
1326 1327
1327 } 1328 }
1328 1329
1329 static noinline int check_can_nocow(struct inode *inode, loff_t pos, 1330 static noinline int check_can_nocow(struct inode *inode, loff_t pos,
1330 size_t *write_bytes) 1331 size_t *write_bytes)
1331 { 1332 {
1332 struct btrfs_root *root = BTRFS_I(inode)->root; 1333 struct btrfs_root *root = BTRFS_I(inode)->root;
1333 struct btrfs_ordered_extent *ordered; 1334 struct btrfs_ordered_extent *ordered;
1334 u64 lockstart, lockend; 1335 u64 lockstart, lockend;
1335 u64 num_bytes; 1336 u64 num_bytes;
1336 int ret; 1337 int ret;
1337 1338
1338 lockstart = round_down(pos, root->sectorsize); 1339 lockstart = round_down(pos, root->sectorsize);
1339 lockend = lockstart + round_up(*write_bytes, root->sectorsize) - 1; 1340 lockend = lockstart + round_up(*write_bytes, root->sectorsize) - 1;
1340 1341
1341 while (1) { 1342 while (1) {
1342 lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); 1343 lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
1343 ordered = btrfs_lookup_ordered_range(inode, lockstart, 1344 ordered = btrfs_lookup_ordered_range(inode, lockstart,
1344 lockend - lockstart + 1); 1345 lockend - lockstart + 1);
1345 if (!ordered) { 1346 if (!ordered) {
1346 break; 1347 break;
1347 } 1348 }
1348 unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); 1349 unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
1349 btrfs_start_ordered_extent(inode, ordered, 1); 1350 btrfs_start_ordered_extent(inode, ordered, 1);
1350 btrfs_put_ordered_extent(ordered); 1351 btrfs_put_ordered_extent(ordered);
1351 } 1352 }
1352 1353
1353 num_bytes = lockend - lockstart + 1; 1354 num_bytes = lockend - lockstart + 1;
1354 ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL); 1355 ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL);
1355 if (ret <= 0) { 1356 if (ret <= 0) {
1356 ret = 0; 1357 ret = 0;
1357 } else { 1358 } else {
1358 clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend, 1359 clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
1359 EXTENT_DIRTY | EXTENT_DELALLOC | 1360 EXTENT_DIRTY | EXTENT_DELALLOC |
1360 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, 1361 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0,
1361 NULL, GFP_NOFS); 1362 NULL, GFP_NOFS);
1362 *write_bytes = min_t(size_t, *write_bytes, num_bytes); 1363 *write_bytes = min_t(size_t, *write_bytes, num_bytes);
1363 } 1364 }
1364 1365
1365 unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); 1366 unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
1366 1367
1367 return ret; 1368 return ret;
1368 } 1369 }
1369 1370
1370 static noinline ssize_t __btrfs_buffered_write(struct file *file, 1371 static noinline ssize_t __btrfs_buffered_write(struct file *file,
1371 struct iov_iter *i, 1372 struct iov_iter *i,
1372 loff_t pos) 1373 loff_t pos)
1373 { 1374 {
1374 struct inode *inode = file_inode(file); 1375 struct inode *inode = file_inode(file);
1375 struct btrfs_root *root = BTRFS_I(inode)->root; 1376 struct btrfs_root *root = BTRFS_I(inode)->root;
1376 struct page **pages = NULL; 1377 struct page **pages = NULL;
1377 u64 release_bytes = 0; 1378 u64 release_bytes = 0;
1378 unsigned long first_index; 1379 unsigned long first_index;
1379 size_t num_written = 0; 1380 size_t num_written = 0;
1380 int nrptrs; 1381 int nrptrs;
1381 int ret = 0; 1382 int ret = 0;
1382 bool only_release_metadata = false; 1383 bool only_release_metadata = false;
1383 bool force_page_uptodate = false; 1384 bool force_page_uptodate = false;
1384 1385
1385 nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) / 1386 nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) /
1386 PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / 1387 PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
1387 (sizeof(struct page *))); 1388 (sizeof(struct page *)));
1388 nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied); 1389 nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
1389 nrptrs = max(nrptrs, 8); 1390 nrptrs = max(nrptrs, 8);
1390 pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); 1391 pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
1391 if (!pages) 1392 if (!pages)
1392 return -ENOMEM; 1393 return -ENOMEM;
1393 1394
1394 first_index = pos >> PAGE_CACHE_SHIFT; 1395 first_index = pos >> PAGE_CACHE_SHIFT;
1395 1396
1396 while (iov_iter_count(i) > 0) { 1397 while (iov_iter_count(i) > 0) {
1397 size_t offset = pos & (PAGE_CACHE_SIZE - 1); 1398 size_t offset = pos & (PAGE_CACHE_SIZE - 1);
1398 size_t write_bytes = min(iov_iter_count(i), 1399 size_t write_bytes = min(iov_iter_count(i),
1399 nrptrs * (size_t)PAGE_CACHE_SIZE - 1400 nrptrs * (size_t)PAGE_CACHE_SIZE -
1400 offset); 1401 offset);
1401 size_t num_pages = (write_bytes + offset + 1402 size_t num_pages = (write_bytes + offset +
1402 PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1403 PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1403 size_t reserve_bytes; 1404 size_t reserve_bytes;
1404 size_t dirty_pages; 1405 size_t dirty_pages;
1405 size_t copied; 1406 size_t copied;
1406 1407
1407 WARN_ON(num_pages > nrptrs); 1408 WARN_ON(num_pages > nrptrs);
1408 1409
1409 /* 1410 /*
1410 * Fault pages before locking them in prepare_pages 1411 * Fault pages before locking them in prepare_pages
1411 * to avoid recursive lock 1412 * to avoid recursive lock
1412 */ 1413 */
1413 if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) { 1414 if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
1414 ret = -EFAULT; 1415 ret = -EFAULT;
1415 break; 1416 break;
1416 } 1417 }
1417 1418
1418 reserve_bytes = num_pages << PAGE_CACHE_SHIFT; 1419 reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
1419 ret = btrfs_check_data_free_space(inode, reserve_bytes); 1420 ret = btrfs_check_data_free_space(inode, reserve_bytes);
1420 if (ret == -ENOSPC && 1421 if (ret == -ENOSPC &&
1421 (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | 1422 (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
1422 BTRFS_INODE_PREALLOC))) { 1423 BTRFS_INODE_PREALLOC))) {
1423 ret = check_can_nocow(inode, pos, &write_bytes); 1424 ret = check_can_nocow(inode, pos, &write_bytes);
1424 if (ret > 0) { 1425 if (ret > 0) {
1425 only_release_metadata = true; 1426 only_release_metadata = true;
1426 /* 1427 /*
1427 * our prealloc extent may be smaller than 1428 * our prealloc extent may be smaller than
1428 * write_bytes, so scale down. 1429 * write_bytes, so scale down.
1429 */ 1430 */
1430 num_pages = (write_bytes + offset + 1431 num_pages = (write_bytes + offset +
1431 PAGE_CACHE_SIZE - 1) >> 1432 PAGE_CACHE_SIZE - 1) >>
1432 PAGE_CACHE_SHIFT; 1433 PAGE_CACHE_SHIFT;
1433 reserve_bytes = num_pages << PAGE_CACHE_SHIFT; 1434 reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
1434 ret = 0; 1435 ret = 0;
1435 } else { 1436 } else {
1436 ret = -ENOSPC; 1437 ret = -ENOSPC;
1437 } 1438 }
1438 } 1439 }
1439 1440
1440 if (ret) 1441 if (ret)
1441 break; 1442 break;
1442 1443
1443 ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes); 1444 ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
1444 if (ret) { 1445 if (ret) {
1445 if (!only_release_metadata) 1446 if (!only_release_metadata)
1446 btrfs_free_reserved_data_space(inode, 1447 btrfs_free_reserved_data_space(inode,
1447 reserve_bytes); 1448 reserve_bytes);
1448 break; 1449 break;
1449 } 1450 }
1450 1451
1451 release_bytes = reserve_bytes; 1452 release_bytes = reserve_bytes;
1452 1453
1453 /* 1454 /*
1454 * This is going to setup the pages array with the number of 1455 * This is going to setup the pages array with the number of
1455 * pages we want, so we don't really need to worry about the 1456 * pages we want, so we don't really need to worry about the
1456 * contents of pages from loop to loop 1457 * contents of pages from loop to loop
1457 */ 1458 */
1458 ret = prepare_pages(root, file, pages, num_pages, 1459 ret = prepare_pages(root, file, pages, num_pages,
1459 pos, first_index, write_bytes, 1460 pos, first_index, write_bytes,
1460 force_page_uptodate); 1461 force_page_uptodate);
1461 if (ret) 1462 if (ret)
1462 break; 1463 break;
1463 1464
1464 copied = btrfs_copy_from_user(pos, num_pages, 1465 copied = btrfs_copy_from_user(pos, num_pages,
1465 write_bytes, pages, i); 1466 write_bytes, pages, i);
1466 1467
1467 /* 1468 /*
1468 * if we have trouble faulting in the pages, fall 1469 * if we have trouble faulting in the pages, fall
1469 * back to one page at a time 1470 * back to one page at a time
1470 */ 1471 */
1471 if (copied < write_bytes) 1472 if (copied < write_bytes)
1472 nrptrs = 1; 1473 nrptrs = 1;
1473 1474
1474 if (copied == 0) { 1475 if (copied == 0) {
1475 force_page_uptodate = true; 1476 force_page_uptodate = true;
1476 dirty_pages = 0; 1477 dirty_pages = 0;
1477 } else { 1478 } else {
1478 force_page_uptodate = false; 1479 force_page_uptodate = false;
1479 dirty_pages = (copied + offset + 1480 dirty_pages = (copied + offset +
1480 PAGE_CACHE_SIZE - 1) >> 1481 PAGE_CACHE_SIZE - 1) >>
1481 PAGE_CACHE_SHIFT; 1482 PAGE_CACHE_SHIFT;
1482 } 1483 }
1483 1484
1484 /* 1485 /*
1485 * If we had a short copy we need to release the excess delaloc 1486 * If we had a short copy we need to release the excess delaloc
1486 * bytes we reserved. We need to increment outstanding_extents 1487 * bytes we reserved. We need to increment outstanding_extents
1487 * because btrfs_delalloc_release_space will decrement it, but 1488 * because btrfs_delalloc_release_space will decrement it, but
1488 * we still have an outstanding extent for the chunk we actually 1489 * we still have an outstanding extent for the chunk we actually
1489 * managed to copy. 1490 * managed to copy.
1490 */ 1491 */
1491 if (num_pages > dirty_pages) { 1492 if (num_pages > dirty_pages) {
1492 release_bytes = (num_pages - dirty_pages) << 1493 release_bytes = (num_pages - dirty_pages) <<
1493 PAGE_CACHE_SHIFT; 1494 PAGE_CACHE_SHIFT;
1494 if (copied > 0) { 1495 if (copied > 0) {
1495 spin_lock(&BTRFS_I(inode)->lock); 1496 spin_lock(&BTRFS_I(inode)->lock);
1496 BTRFS_I(inode)->outstanding_extents++; 1497 BTRFS_I(inode)->outstanding_extents++;
1497 spin_unlock(&BTRFS_I(inode)->lock); 1498 spin_unlock(&BTRFS_I(inode)->lock);
1498 } 1499 }
1499 if (only_release_metadata) 1500 if (only_release_metadata)
1500 btrfs_delalloc_release_metadata(inode, 1501 btrfs_delalloc_release_metadata(inode,
1501 release_bytes); 1502 release_bytes);
1502 else 1503 else
1503 btrfs_delalloc_release_space(inode, 1504 btrfs_delalloc_release_space(inode,
1504 release_bytes); 1505 release_bytes);
1505 } 1506 }
1506 1507
1507 release_bytes = dirty_pages << PAGE_CACHE_SHIFT; 1508 release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
1508 if (copied > 0) { 1509 if (copied > 0) {
1509 ret = btrfs_dirty_pages(root, inode, pages, 1510 ret = btrfs_dirty_pages(root, inode, pages,
1510 dirty_pages, pos, copied, 1511 dirty_pages, pos, copied,
1511 NULL); 1512 NULL);
1512 if (ret) { 1513 if (ret) {
1513 btrfs_drop_pages(pages, num_pages); 1514 btrfs_drop_pages(pages, num_pages);
1514 break; 1515 break;
1515 } 1516 }
1516 } 1517 }
1517 1518
1518 release_bytes = 0; 1519 release_bytes = 0;
1519 btrfs_drop_pages(pages, num_pages); 1520 btrfs_drop_pages(pages, num_pages);
1520 1521
1521 if (only_release_metadata && copied > 0) { 1522 if (only_release_metadata && copied > 0) {
1522 u64 lockstart = round_down(pos, root->sectorsize); 1523 u64 lockstart = round_down(pos, root->sectorsize);
1523 u64 lockend = lockstart + 1524 u64 lockend = lockstart +
1524 (dirty_pages << PAGE_CACHE_SHIFT) - 1; 1525 (dirty_pages << PAGE_CACHE_SHIFT) - 1;
1525 1526
1526 set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, 1527 set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
1527 lockend, EXTENT_NORESERVE, NULL, 1528 lockend, EXTENT_NORESERVE, NULL,
1528 NULL, GFP_NOFS); 1529 NULL, GFP_NOFS);
1529 only_release_metadata = false; 1530 only_release_metadata = false;
1530 } 1531 }
1531 1532
1532 cond_resched(); 1533 cond_resched();
1533 1534
1534 balance_dirty_pages_ratelimited(inode->i_mapping); 1535 balance_dirty_pages_ratelimited(inode->i_mapping);
1535 if (dirty_pages < (root->leafsize >> PAGE_CACHE_SHIFT) + 1) 1536 if (dirty_pages < (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
1536 btrfs_btree_balance_dirty(root); 1537 btrfs_btree_balance_dirty(root);
1537 1538
1538 pos += copied; 1539 pos += copied;
1539 num_written += copied; 1540 num_written += copied;
1540 } 1541 }
1541 1542
1542 kfree(pages); 1543 kfree(pages);
1543 1544
1544 if (release_bytes) { 1545 if (release_bytes) {
1545 if (only_release_metadata) 1546 if (only_release_metadata)
1546 btrfs_delalloc_release_metadata(inode, release_bytes); 1547 btrfs_delalloc_release_metadata(inode, release_bytes);
1547 else 1548 else
1548 btrfs_delalloc_release_space(inode, release_bytes); 1549 btrfs_delalloc_release_space(inode, release_bytes);
1549 } 1550 }
1550 1551
1551 return num_written ? num_written : ret; 1552 return num_written ? num_written : ret;
1552 } 1553 }
1553 1554
1554 static ssize_t __btrfs_direct_write(struct kiocb *iocb, 1555 static ssize_t __btrfs_direct_write(struct kiocb *iocb,
1555 const struct iovec *iov, 1556 const struct iovec *iov,
1556 unsigned long nr_segs, loff_t pos, 1557 unsigned long nr_segs, loff_t pos,
1557 loff_t *ppos, size_t count, size_t ocount) 1558 loff_t *ppos, size_t count, size_t ocount)
1558 { 1559 {
1559 struct file *file = iocb->ki_filp; 1560 struct file *file = iocb->ki_filp;
1560 struct iov_iter i; 1561 struct iov_iter i;
1561 ssize_t written; 1562 ssize_t written;
1562 ssize_t written_buffered; 1563 ssize_t written_buffered;
1563 loff_t endbyte; 1564 loff_t endbyte;
1564 int err; 1565 int err;
1565 1566
1566 written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, 1567 written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos,
1567 count, ocount); 1568 count, ocount);
1568 1569
1569 if (written < 0 || written == count) 1570 if (written < 0 || written == count)
1570 return written; 1571 return written;
1571 1572
1572 pos += written; 1573 pos += written;
1573 count -= written; 1574 count -= written;
1574 iov_iter_init(&i, iov, nr_segs, count, written); 1575 iov_iter_init(&i, iov, nr_segs, count, written);
1575 written_buffered = __btrfs_buffered_write(file, &i, pos); 1576 written_buffered = __btrfs_buffered_write(file, &i, pos);
1576 if (written_buffered < 0) { 1577 if (written_buffered < 0) {
1577 err = written_buffered; 1578 err = written_buffered;
1578 goto out; 1579 goto out;
1579 } 1580 }
1580 endbyte = pos + written_buffered - 1; 1581 endbyte = pos + written_buffered - 1;
1581 err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); 1582 err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
1582 if (err) 1583 if (err)
1583 goto out; 1584 goto out;
1584 written += written_buffered; 1585 written += written_buffered;
1585 *ppos = pos + written_buffered; 1586 *ppos = pos + written_buffered;
1586 invalidate_mapping_pages(file->f_mapping, pos >> PAGE_CACHE_SHIFT, 1587 invalidate_mapping_pages(file->f_mapping, pos >> PAGE_CACHE_SHIFT,
1587 endbyte >> PAGE_CACHE_SHIFT); 1588 endbyte >> PAGE_CACHE_SHIFT);
1588 out: 1589 out:
1589 return written ? written : err; 1590 return written ? written : err;
1590 } 1591 }
1591 1592
1592 static void update_time_for_write(struct inode *inode) 1593 static void update_time_for_write(struct inode *inode)
1593 { 1594 {
1594 struct timespec now; 1595 struct timespec now;
1595 1596
1596 if (IS_NOCMTIME(inode)) 1597 if (IS_NOCMTIME(inode))
1597 return; 1598 return;
1598 1599
1599 now = current_fs_time(inode->i_sb); 1600 now = current_fs_time(inode->i_sb);
1600 if (!timespec_equal(&inode->i_mtime, &now)) 1601 if (!timespec_equal(&inode->i_mtime, &now))
1601 inode->i_mtime = now; 1602 inode->i_mtime = now;
1602 1603
1603 if (!timespec_equal(&inode->i_ctime, &now)) 1604 if (!timespec_equal(&inode->i_ctime, &now))
1604 inode->i_ctime = now; 1605 inode->i_ctime = now;
1605 1606
1606 if (IS_I_VERSION(inode)) 1607 if (IS_I_VERSION(inode))
1607 inode_inc_iversion(inode); 1608 inode_inc_iversion(inode);
1608 } 1609 }
1609 1610
1610 static ssize_t btrfs_file_aio_write(struct kiocb *iocb, 1611 static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
1611 const struct iovec *iov, 1612 const struct iovec *iov,
1612 unsigned long nr_segs, loff_t pos) 1613 unsigned long nr_segs, loff_t pos)
1613 { 1614 {
1614 struct file *file = iocb->ki_filp; 1615 struct file *file = iocb->ki_filp;
1615 struct inode *inode = file_inode(file); 1616 struct inode *inode = file_inode(file);
1616 struct btrfs_root *root = BTRFS_I(inode)->root; 1617 struct btrfs_root *root = BTRFS_I(inode)->root;
1617 loff_t *ppos = &iocb->ki_pos; 1618 loff_t *ppos = &iocb->ki_pos;
1618 u64 start_pos; 1619 u64 start_pos;
1619 ssize_t num_written = 0; 1620 ssize_t num_written = 0;
1620 ssize_t err = 0; 1621 ssize_t err = 0;
1621 size_t count, ocount; 1622 size_t count, ocount;
1622 bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); 1623 bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
1623 1624
1624 mutex_lock(&inode->i_mutex); 1625 mutex_lock(&inode->i_mutex);
1625 1626
1626 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); 1627 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
1627 if (err) { 1628 if (err) {
1628 mutex_unlock(&inode->i_mutex); 1629 mutex_unlock(&inode->i_mutex);
1629 goto out; 1630 goto out;
1630 } 1631 }
1631 count = ocount; 1632 count = ocount;
1632 1633
1633 current->backing_dev_info = inode->i_mapping->backing_dev_info; 1634 current->backing_dev_info = inode->i_mapping->backing_dev_info;
1634 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); 1635 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
1635 if (err) { 1636 if (err) {
1636 mutex_unlock(&inode->i_mutex); 1637 mutex_unlock(&inode->i_mutex);
1637 goto out; 1638 goto out;
1638 } 1639 }
1639 1640
1640 if (count == 0) { 1641 if (count == 0) {
1641 mutex_unlock(&inode->i_mutex); 1642 mutex_unlock(&inode->i_mutex);
1642 goto out; 1643 goto out;
1643 } 1644 }
1644 1645
1645 err = file_remove_suid(file); 1646 err = file_remove_suid(file);
1646 if (err) { 1647 if (err) {
1647 mutex_unlock(&inode->i_mutex); 1648 mutex_unlock(&inode->i_mutex);
1648 goto out; 1649 goto out;
1649 } 1650 }
1650 1651
1651 /* 1652 /*
1652 * If BTRFS flips readonly due to some impossible error 1653 * If BTRFS flips readonly due to some impossible error
1653 * (fs_info->fs_state now has BTRFS_SUPER_FLAG_ERROR), 1654 * (fs_info->fs_state now has BTRFS_SUPER_FLAG_ERROR),
1654 * although we have opened a file as writable, we have 1655 * although we have opened a file as writable, we have
1655 * to stop this write operation to ensure FS consistency. 1656 * to stop this write operation to ensure FS consistency.
1656 */ 1657 */
1657 if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state)) { 1658 if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state)) {
1658 mutex_unlock(&inode->i_mutex); 1659 mutex_unlock(&inode->i_mutex);
1659 err = -EROFS; 1660 err = -EROFS;
1660 goto out; 1661 goto out;
1661 } 1662 }
1662 1663
1663 /* 1664 /*
1664 * We reserve space for updating the inode when we reserve space for the 1665 * We reserve space for updating the inode when we reserve space for the
1665 * extent we are going to write, so we will enospc out there. We don't 1666 * extent we are going to write, so we will enospc out there. We don't
1666 * need to start yet another transaction to update the inode as we will 1667 * need to start yet another transaction to update the inode as we will
1667 * update the inode when we finish writing whatever data we write. 1668 * update the inode when we finish writing whatever data we write.
1668 */ 1669 */
1669 update_time_for_write(inode); 1670 update_time_for_write(inode);
1670 1671
1671 start_pos = round_down(pos, root->sectorsize); 1672 start_pos = round_down(pos, root->sectorsize);
1672 if (start_pos > i_size_read(inode)) { 1673 if (start_pos > i_size_read(inode)) {
1673 err = btrfs_cont_expand(inode, i_size_read(inode), start_pos); 1674 err = btrfs_cont_expand(inode, i_size_read(inode), start_pos);
1674 if (err) { 1675 if (err) {
1675 mutex_unlock(&inode->i_mutex); 1676 mutex_unlock(&inode->i_mutex);
1676 goto out; 1677 goto out;
1677 } 1678 }
1678 } 1679 }
1679 1680
1680 if (sync) 1681 if (sync)
1681 atomic_inc(&BTRFS_I(inode)->sync_writers); 1682 atomic_inc(&BTRFS_I(inode)->sync_writers);
1682 1683
1683 if (unlikely(file->f_flags & O_DIRECT)) { 1684 if (unlikely(file->f_flags & O_DIRECT)) {
1684 num_written = __btrfs_direct_write(iocb, iov, nr_segs, 1685 num_written = __btrfs_direct_write(iocb, iov, nr_segs,
1685 pos, ppos, count, ocount); 1686 pos, ppos, count, ocount);
1686 } else { 1687 } else {
1687 struct iov_iter i; 1688 struct iov_iter i;
1688 1689
1689 iov_iter_init(&i, iov, nr_segs, count, num_written); 1690 iov_iter_init(&i, iov, nr_segs, count, num_written);
1690 1691
1691 num_written = __btrfs_buffered_write(file, &i, pos); 1692 num_written = __btrfs_buffered_write(file, &i, pos);
1692 if (num_written > 0) 1693 if (num_written > 0)
1693 *ppos = pos + num_written; 1694 *ppos = pos + num_written;
1694 } 1695 }
1695 1696
1696 mutex_unlock(&inode->i_mutex); 1697 mutex_unlock(&inode->i_mutex);
1697 1698
1698 /* 1699 /*
1699 * we want to make sure fsync finds this change 1700 * we want to make sure fsync finds this change
1700 * but we haven't joined a transaction running right now. 1701 * but we haven't joined a transaction running right now.
1701 * 1702 *
1702 * Later on, someone is sure to update the inode and get the 1703 * Later on, someone is sure to update the inode and get the
1703 * real transid recorded. 1704 * real transid recorded.
1704 * 1705 *
1705 * We set last_trans now to the fs_info generation + 1, 1706 * We set last_trans now to the fs_info generation + 1,
1706 * this will either be one more than the running transaction 1707 * this will either be one more than the running transaction
1707 * or the generation used for the next transaction if there isn't 1708 * or the generation used for the next transaction if there isn't
1708 * one running right now. 1709 * one running right now.
1709 * 1710 *
1710 * We also have to set last_sub_trans to the current log transid, 1711 * We also have to set last_sub_trans to the current log transid,
1711 * otherwise subsequent syncs to a file that's been synced in this 1712 * otherwise subsequent syncs to a file that's been synced in this
1712 * transaction will appear to have already occured. 1713 * transaction will appear to have already occured.
1713 */ 1714 */
1714 BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; 1715 BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
1715 BTRFS_I(inode)->last_sub_trans = root->log_transid; 1716 BTRFS_I(inode)->last_sub_trans = root->log_transid;
1716 if (num_written > 0) { 1717 if (num_written > 0) {
1717 err = generic_write_sync(file, pos, num_written); 1718 err = generic_write_sync(file, pos, num_written);
1718 if (err < 0 && num_written > 0) 1719 if (err < 0 && num_written > 0)
1719 num_written = err; 1720 num_written = err;
1720 } 1721 }
1721 1722
1722 if (sync) 1723 if (sync)
1723 atomic_dec(&BTRFS_I(inode)->sync_writers); 1724 atomic_dec(&BTRFS_I(inode)->sync_writers);
1724 out: 1725 out:
1725 current->backing_dev_info = NULL; 1726 current->backing_dev_info = NULL;
1726 return num_written ? num_written : err; 1727 return num_written ? num_written : err;
1727 } 1728 }
1728 1729
1729 int btrfs_release_file(struct inode *inode, struct file *filp) 1730 int btrfs_release_file(struct inode *inode, struct file *filp)
1730 { 1731 {
1731 /* 1732 /*
1732 * ordered_data_close is set by settattr when we are about to truncate 1733 * ordered_data_close is set by settattr when we are about to truncate
1733 * a file from a non-zero size to a zero size. This tries to 1734 * a file from a non-zero size to a zero size. This tries to
1734 * flush down new bytes that may have been written if the 1735 * flush down new bytes that may have been written if the
1735 * application were using truncate to replace a file in place. 1736 * application were using truncate to replace a file in place.
1736 */ 1737 */
1737 if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE, 1738 if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
1738 &BTRFS_I(inode)->runtime_flags)) { 1739 &BTRFS_I(inode)->runtime_flags)) {
1739 struct btrfs_trans_handle *trans; 1740 struct btrfs_trans_handle *trans;
1740 struct btrfs_root *root = BTRFS_I(inode)->root; 1741 struct btrfs_root *root = BTRFS_I(inode)->root;
1741 1742
1742 /* 1743 /*
1743 * We need to block on a committing transaction to keep us from 1744 * We need to block on a committing transaction to keep us from
1744 * throwing a ordered operation on to the list and causing 1745 * throwing a ordered operation on to the list and causing
1745 * something like sync to deadlock trying to flush out this 1746 * something like sync to deadlock trying to flush out this
1746 * inode. 1747 * inode.
1747 */ 1748 */
1748 trans = btrfs_start_transaction(root, 0); 1749 trans = btrfs_start_transaction(root, 0);
1749 if (IS_ERR(trans)) 1750 if (IS_ERR(trans))
1750 return PTR_ERR(trans); 1751 return PTR_ERR(trans);
1751 btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode); 1752 btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode);
1752 btrfs_end_transaction(trans, root); 1753 btrfs_end_transaction(trans, root);
1753 if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT) 1754 if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
1754 filemap_flush(inode->i_mapping); 1755 filemap_flush(inode->i_mapping);
1755 } 1756 }
1756 if (filp->private_data) 1757 if (filp->private_data)
1757 btrfs_ioctl_trans_end(filp); 1758 btrfs_ioctl_trans_end(filp);
1758 return 0; 1759 return 0;
1759 } 1760 }
1760 1761
1761 /* 1762 /*
1762 * fsync call for both files and directories. This logs the inode into 1763 * fsync call for both files and directories. This logs the inode into
1763 * the tree log instead of forcing full commits whenever possible. 1764 * the tree log instead of forcing full commits whenever possible.
1764 * 1765 *
1765 * It needs to call filemap_fdatawait so that all ordered extent updates are 1766 * It needs to call filemap_fdatawait so that all ordered extent updates are
1766 * in the metadata btree are up to date for copying to the log. 1767 * in the metadata btree are up to date for copying to the log.
1767 * 1768 *
1768 * It drops the inode mutex before doing the tree log commit. This is an 1769 * It drops the inode mutex before doing the tree log commit. This is an
1769 * important optimization for directories because holding the mutex prevents 1770 * important optimization for directories because holding the mutex prevents
1770 * new operations on the dir while we write to disk. 1771 * new operations on the dir while we write to disk.
1771 */ 1772 */
1772 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) 1773 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
1773 { 1774 {
1774 struct dentry *dentry = file->f_path.dentry; 1775 struct dentry *dentry = file->f_path.dentry;
1775 struct inode *inode = dentry->d_inode; 1776 struct inode *inode = dentry->d_inode;
1776 struct btrfs_root *root = BTRFS_I(inode)->root; 1777 struct btrfs_root *root = BTRFS_I(inode)->root;
1777 int ret = 0; 1778 int ret = 0;
1778 struct btrfs_trans_handle *trans; 1779 struct btrfs_trans_handle *trans;
1779 bool full_sync = 0; 1780 bool full_sync = 0;
1780 1781
1781 trace_btrfs_sync_file(file, datasync); 1782 trace_btrfs_sync_file(file, datasync);
1782 1783
1783 /* 1784 /*
1784 * We write the dirty pages in the range and wait until they complete 1785 * We write the dirty pages in the range and wait until they complete
1785 * out of the ->i_mutex. If so, we can flush the dirty pages by 1786 * out of the ->i_mutex. If so, we can flush the dirty pages by
1786 * multi-task, and make the performance up. See 1787 * multi-task, and make the performance up. See
1787 * btrfs_wait_ordered_range for an explanation of the ASYNC check. 1788 * btrfs_wait_ordered_range for an explanation of the ASYNC check.
1788 */ 1789 */
1789 atomic_inc(&BTRFS_I(inode)->sync_writers); 1790 atomic_inc(&BTRFS_I(inode)->sync_writers);
1790 ret = filemap_fdatawrite_range(inode->i_mapping, start, end); 1791 ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
1791 if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, 1792 if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
1792 &BTRFS_I(inode)->runtime_flags)) 1793 &BTRFS_I(inode)->runtime_flags))
1793 ret = filemap_fdatawrite_range(inode->i_mapping, start, end); 1794 ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
1794 atomic_dec(&BTRFS_I(inode)->sync_writers); 1795 atomic_dec(&BTRFS_I(inode)->sync_writers);
1795 if (ret) 1796 if (ret)
1796 return ret; 1797 return ret;
1797 1798
1798 mutex_lock(&inode->i_mutex); 1799 mutex_lock(&inode->i_mutex);
1799 1800
1800 /* 1801 /*
1801 * We flush the dirty pages again to avoid some dirty pages in the 1802 * We flush the dirty pages again to avoid some dirty pages in the
1802 * range being left. 1803 * range being left.
1803 */ 1804 */
1804 atomic_inc(&root->log_batch); 1805 atomic_inc(&root->log_batch);
1805 full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 1806 full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
1806 &BTRFS_I(inode)->runtime_flags); 1807 &BTRFS_I(inode)->runtime_flags);
1807 if (full_sync) 1808 if (full_sync)
1808 btrfs_wait_ordered_range(inode, start, end - start + 1); 1809 btrfs_wait_ordered_range(inode, start, end - start + 1);
1809 atomic_inc(&root->log_batch); 1810 atomic_inc(&root->log_batch);
1810 1811
1811 /* 1812 /*
1812 * check the transaction that last modified this inode 1813 * check the transaction that last modified this inode
1813 * and see if its already been committed 1814 * and see if its already been committed
1814 */ 1815 */
1815 if (!BTRFS_I(inode)->last_trans) { 1816 if (!BTRFS_I(inode)->last_trans) {
1816 mutex_unlock(&inode->i_mutex); 1817 mutex_unlock(&inode->i_mutex);
1817 goto out; 1818 goto out;
1818 } 1819 }
1819 1820
1820 /* 1821 /*
1821 * if the last transaction that changed this file was before 1822 * if the last transaction that changed this file was before
1822 * the current transaction, we can bail out now without any 1823 * the current transaction, we can bail out now without any
1823 * syncing 1824 * syncing
1824 */ 1825 */
1825 smp_mb(); 1826 smp_mb();
1826 if (btrfs_inode_in_log(inode, root->fs_info->generation) || 1827 if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
1827 BTRFS_I(inode)->last_trans <= 1828 BTRFS_I(inode)->last_trans <=
1828 root->fs_info->last_trans_committed) { 1829 root->fs_info->last_trans_committed) {
1829 BTRFS_I(inode)->last_trans = 0; 1830 BTRFS_I(inode)->last_trans = 0;
1830 1831
1831 /* 1832 /*
1832 * We'v had everything committed since the last time we were 1833 * We'v had everything committed since the last time we were
1833 * modified so clear this flag in case it was set for whatever 1834 * modified so clear this flag in case it was set for whatever
1834 * reason, it's no longer relevant. 1835 * reason, it's no longer relevant.
1835 */ 1836 */
1836 clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 1837 clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
1837 &BTRFS_I(inode)->runtime_flags); 1838 &BTRFS_I(inode)->runtime_flags);
1838 mutex_unlock(&inode->i_mutex); 1839 mutex_unlock(&inode->i_mutex);
1839 goto out; 1840 goto out;
1840 } 1841 }
1841 1842
1842 /* 1843 /*
1843 * ok we haven't committed the transaction yet, lets do a commit 1844 * ok we haven't committed the transaction yet, lets do a commit
1844 */ 1845 */
1845 if (file->private_data) 1846 if (file->private_data)
1846 btrfs_ioctl_trans_end(file); 1847 btrfs_ioctl_trans_end(file);
1847 1848
1848 trans = btrfs_start_transaction(root, 0); 1849 trans = btrfs_start_transaction(root, 0);
1849 if (IS_ERR(trans)) { 1850 if (IS_ERR(trans)) {
1850 ret = PTR_ERR(trans); 1851 ret = PTR_ERR(trans);
1851 mutex_unlock(&inode->i_mutex); 1852 mutex_unlock(&inode->i_mutex);
1852 goto out; 1853 goto out;
1853 } 1854 }
1854 1855
1855 ret = btrfs_log_dentry_safe(trans, root, dentry); 1856 ret = btrfs_log_dentry_safe(trans, root, dentry);
1856 if (ret < 0) { 1857 if (ret < 0) {
1857 /* Fallthrough and commit/free transaction. */ 1858 /* Fallthrough and commit/free transaction. */
1858 ret = 1; 1859 ret = 1;
1859 } 1860 }
1860 1861
1861 /* we've logged all the items and now have a consistent 1862 /* we've logged all the items and now have a consistent
1862 * version of the file in the log. It is possible that 1863 * version of the file in the log. It is possible that
1863 * someone will come in and modify the file, but that's 1864 * someone will come in and modify the file, but that's
1864 * fine because the log is consistent on disk, and we 1865 * fine because the log is consistent on disk, and we
1865 * have references to all of the file's extents 1866 * have references to all of the file's extents
1866 * 1867 *
1867 * It is possible that someone will come in and log the 1868 * It is possible that someone will come in and log the
1868 * file again, but that will end up using the synchronization 1869 * file again, but that will end up using the synchronization
1869 * inside btrfs_sync_log to keep things safe. 1870 * inside btrfs_sync_log to keep things safe.
1870 */ 1871 */
1871 mutex_unlock(&inode->i_mutex); 1872 mutex_unlock(&inode->i_mutex);
1872 1873
1873 if (ret != BTRFS_NO_LOG_SYNC) { 1874 if (ret != BTRFS_NO_LOG_SYNC) {
1874 if (ret > 0) { 1875 if (ret > 0) {
1875 /* 1876 /*
1876 * If we didn't already wait for ordered extents we need 1877 * If we didn't already wait for ordered extents we need
1877 * to do that now. 1878 * to do that now.
1878 */ 1879 */
1879 if (!full_sync) 1880 if (!full_sync)
1880 btrfs_wait_ordered_range(inode, start, 1881 btrfs_wait_ordered_range(inode, start,
1881 end - start + 1); 1882 end - start + 1);
1882 ret = btrfs_commit_transaction(trans, root); 1883 ret = btrfs_commit_transaction(trans, root);
1883 } else { 1884 } else {
1884 ret = btrfs_sync_log(trans, root); 1885 ret = btrfs_sync_log(trans, root);
1885 if (ret == 0) { 1886 if (ret == 0) {
1886 ret = btrfs_end_transaction(trans, root); 1887 ret = btrfs_end_transaction(trans, root);
1887 } else { 1888 } else {
1888 if (!full_sync) 1889 if (!full_sync)
1889 btrfs_wait_ordered_range(inode, start, 1890 btrfs_wait_ordered_range(inode, start,
1890 end - 1891 end -
1891 start + 1); 1892 start + 1);
1892 ret = btrfs_commit_transaction(trans, root); 1893 ret = btrfs_commit_transaction(trans, root);
1893 } 1894 }
1894 } 1895 }
1895 } else { 1896 } else {
1896 ret = btrfs_end_transaction(trans, root); 1897 ret = btrfs_end_transaction(trans, root);
1897 } 1898 }
1898 out: 1899 out:
1899 return ret > 0 ? -EIO : ret; 1900 return ret > 0 ? -EIO : ret;
1900 } 1901 }
1901 1902
1902 static const struct vm_operations_struct btrfs_file_vm_ops = { 1903 static const struct vm_operations_struct btrfs_file_vm_ops = {
1903 .fault = filemap_fault, 1904 .fault = filemap_fault,
1904 .page_mkwrite = btrfs_page_mkwrite, 1905 .page_mkwrite = btrfs_page_mkwrite,
1905 .remap_pages = generic_file_remap_pages, 1906 .remap_pages = generic_file_remap_pages,
1906 }; 1907 };
1907 1908
1908 static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) 1909 static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
1909 { 1910 {
1910 struct address_space *mapping = filp->f_mapping; 1911 struct address_space *mapping = filp->f_mapping;
1911 1912
1912 if (!mapping->a_ops->readpage) 1913 if (!mapping->a_ops->readpage)
1913 return -ENOEXEC; 1914 return -ENOEXEC;
1914 1915
1915 file_accessed(filp); 1916 file_accessed(filp);
1916 vma->vm_ops = &btrfs_file_vm_ops; 1917 vma->vm_ops = &btrfs_file_vm_ops;
1917 1918
1918 return 0; 1919 return 0;
1919 } 1920 }
1920 1921
1921 static int hole_mergeable(struct inode *inode, struct extent_buffer *leaf, 1922 static int hole_mergeable(struct inode *inode, struct extent_buffer *leaf,
1922 int slot, u64 start, u64 end) 1923 int slot, u64 start, u64 end)
1923 { 1924 {
1924 struct btrfs_file_extent_item *fi; 1925 struct btrfs_file_extent_item *fi;
1925 struct btrfs_key key; 1926 struct btrfs_key key;
1926 1927
1927 if (slot < 0 || slot >= btrfs_header_nritems(leaf)) 1928 if (slot < 0 || slot >= btrfs_header_nritems(leaf))
1928 return 0; 1929 return 0;
1929 1930
1930 btrfs_item_key_to_cpu(leaf, &key, slot); 1931 btrfs_item_key_to_cpu(leaf, &key, slot);
1931 if (key.objectid != btrfs_ino(inode) || 1932 if (key.objectid != btrfs_ino(inode) ||
1932 key.type != BTRFS_EXTENT_DATA_KEY) 1933 key.type != BTRFS_EXTENT_DATA_KEY)
1933 return 0; 1934 return 0;
1934 1935
1935 fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); 1936 fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
1936 1937
1937 if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG) 1938 if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG)
1938 return 0; 1939 return 0;
1939 1940
1940 if (btrfs_file_extent_disk_bytenr(leaf, fi)) 1941 if (btrfs_file_extent_disk_bytenr(leaf, fi))
1941 return 0; 1942 return 0;
1942 1943
1943 if (key.offset == end) 1944 if (key.offset == end)
1944 return 1; 1945 return 1;
1945 if (key.offset + btrfs_file_extent_num_bytes(leaf, fi) == start) 1946 if (key.offset + btrfs_file_extent_num_bytes(leaf, fi) == start)
1946 return 1; 1947 return 1;
1947 return 0; 1948 return 0;
1948 } 1949 }
1949 1950
1950 static int fill_holes(struct btrfs_trans_handle *trans, struct inode *inode, 1951 static int fill_holes(struct btrfs_trans_handle *trans, struct inode *inode,
1951 struct btrfs_path *path, u64 offset, u64 end) 1952 struct btrfs_path *path, u64 offset, u64 end)
1952 { 1953 {
1953 struct btrfs_root *root = BTRFS_I(inode)->root; 1954 struct btrfs_root *root = BTRFS_I(inode)->root;
1954 struct extent_buffer *leaf; 1955 struct extent_buffer *leaf;
1955 struct btrfs_file_extent_item *fi; 1956 struct btrfs_file_extent_item *fi;
1956 struct extent_map *hole_em; 1957 struct extent_map *hole_em;
1957 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; 1958 struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
1958 struct btrfs_key key; 1959 struct btrfs_key key;
1959 int ret; 1960 int ret;
1960 1961
1961 key.objectid = btrfs_ino(inode); 1962 key.objectid = btrfs_ino(inode);
1962 key.type = BTRFS_EXTENT_DATA_KEY; 1963 key.type = BTRFS_EXTENT_DATA_KEY;
1963 key.offset = offset; 1964 key.offset = offset;
1964 1965
1965 1966
1966 ret = btrfs_search_slot(trans, root, &key, path, 0, 1); 1967 ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
1967 if (ret < 0) 1968 if (ret < 0)
1968 return ret; 1969 return ret;
1969 BUG_ON(!ret); 1970 BUG_ON(!ret);
1970 1971
1971 leaf = path->nodes[0]; 1972 leaf = path->nodes[0];
1972 if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) { 1973 if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) {
1973 u64 num_bytes; 1974 u64 num_bytes;
1974 1975
1975 path->slots[0]--; 1976 path->slots[0]--;
1976 fi = btrfs_item_ptr(leaf, path->slots[0], 1977 fi = btrfs_item_ptr(leaf, path->slots[0],
1977 struct btrfs_file_extent_item); 1978 struct btrfs_file_extent_item);
1978 num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + 1979 num_bytes = btrfs_file_extent_num_bytes(leaf, fi) +
1979 end - offset; 1980 end - offset;
1980 btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); 1981 btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
1981 btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); 1982 btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes);
1982 btrfs_set_file_extent_offset(leaf, fi, 0); 1983 btrfs_set_file_extent_offset(leaf, fi, 0);
1983 btrfs_mark_buffer_dirty(leaf); 1984 btrfs_mark_buffer_dirty(leaf);
1984 goto out; 1985 goto out;
1985 } 1986 }
1986 1987
1987 if (hole_mergeable(inode, leaf, path->slots[0]+1, offset, end)) { 1988 if (hole_mergeable(inode, leaf, path->slots[0]+1, offset, end)) {
1988 u64 num_bytes; 1989 u64 num_bytes;
1989 1990
1990 path->slots[0]++; 1991 path->slots[0]++;
1991 key.offset = offset; 1992 key.offset = offset;
1992 btrfs_set_item_key_safe(root, path, &key); 1993 btrfs_set_item_key_safe(root, path, &key);
1993 fi = btrfs_item_ptr(leaf, path->slots[0], 1994 fi = btrfs_item_ptr(leaf, path->slots[0],
1994 struct btrfs_file_extent_item); 1995 struct btrfs_file_extent_item);
1995 num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + end - 1996 num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + end -
1996 offset; 1997 offset;
1997 btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); 1998 btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
1998 btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); 1999 btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes);
1999 btrfs_set_file_extent_offset(leaf, fi, 0); 2000 btrfs_set_file_extent_offset(leaf, fi, 0);
2000 btrfs_mark_buffer_dirty(leaf); 2001 btrfs_mark_buffer_dirty(leaf);
2001 goto out; 2002 goto out;
2002 } 2003 }
2003 btrfs_release_path(path); 2004 btrfs_release_path(path);
2004 2005
2005 ret = btrfs_insert_file_extent(trans, root, btrfs_ino(inode), offset, 2006 ret = btrfs_insert_file_extent(trans, root, btrfs_ino(inode), offset,
2006 0, 0, end - offset, 0, end - offset, 2007 0, 0, end - offset, 0, end - offset,
2007 0, 0, 0); 2008 0, 0, 0);
2008 if (ret) 2009 if (ret)
2009 return ret; 2010 return ret;
2010 2011
2011 out: 2012 out:
2012 btrfs_release_path(path); 2013 btrfs_release_path(path);
2013 2014
2014 hole_em = alloc_extent_map(); 2015 hole_em = alloc_extent_map();
2015 if (!hole_em) { 2016 if (!hole_em) {
2016 btrfs_drop_extent_cache(inode, offset, end - 1, 0); 2017 btrfs_drop_extent_cache(inode, offset, end - 1, 0);
2017 set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 2018 set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
2018 &BTRFS_I(inode)->runtime_flags); 2019 &BTRFS_I(inode)->runtime_flags);
2019 } else { 2020 } else {
2020 hole_em->start = offset; 2021 hole_em->start = offset;
2021 hole_em->len = end - offset; 2022 hole_em->len = end - offset;
2022 hole_em->ram_bytes = hole_em->len; 2023 hole_em->ram_bytes = hole_em->len;
2023 hole_em->orig_start = offset; 2024 hole_em->orig_start = offset;
2024 2025
2025 hole_em->block_start = EXTENT_MAP_HOLE; 2026 hole_em->block_start = EXTENT_MAP_HOLE;
2026 hole_em->block_len = 0; 2027 hole_em->block_len = 0;
2027 hole_em->orig_block_len = 0; 2028 hole_em->orig_block_len = 0;
2028 hole_em->bdev = root->fs_info->fs_devices->latest_bdev; 2029 hole_em->bdev = root->fs_info->fs_devices->latest_bdev;
2029 hole_em->compress_type = BTRFS_COMPRESS_NONE; 2030 hole_em->compress_type = BTRFS_COMPRESS_NONE;
2030 hole_em->generation = trans->transid; 2031 hole_em->generation = trans->transid;
2031 2032
2032 do { 2033 do {
2033 btrfs_drop_extent_cache(inode, offset, end - 1, 0); 2034 btrfs_drop_extent_cache(inode, offset, end - 1, 0);
2034 write_lock(&em_tree->lock); 2035 write_lock(&em_tree->lock);
2035 ret = add_extent_mapping(em_tree, hole_em, 1); 2036 ret = add_extent_mapping(em_tree, hole_em, 1);
2036 write_unlock(&em_tree->lock); 2037 write_unlock(&em_tree->lock);
2037 } while (ret == -EEXIST); 2038 } while (ret == -EEXIST);
2038 free_extent_map(hole_em); 2039 free_extent_map(hole_em);
2039 if (ret) 2040 if (ret)
2040 set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, 2041 set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
2041 &BTRFS_I(inode)->runtime_flags); 2042 &BTRFS_I(inode)->runtime_flags);
2042 } 2043 }
2043 2044
2044 return 0; 2045 return 0;
2045 } 2046 }
2046 2047
2047 static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) 2048 static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
2048 { 2049 {
2049 struct btrfs_root *root = BTRFS_I(inode)->root; 2050 struct btrfs_root *root = BTRFS_I(inode)->root;
2050 struct extent_state *cached_state = NULL; 2051 struct extent_state *cached_state = NULL;
2051 struct btrfs_path *path; 2052 struct btrfs_path *path;
2052 struct btrfs_block_rsv *rsv; 2053 struct btrfs_block_rsv *rsv;
2053 struct btrfs_trans_handle *trans; 2054 struct btrfs_trans_handle *trans;
2054 u64 lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); 2055 u64 lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
2055 u64 lockend = round_down(offset + len, 2056 u64 lockend = round_down(offset + len,
2056 BTRFS_I(inode)->root->sectorsize) - 1; 2057 BTRFS_I(inode)->root->sectorsize) - 1;
2057 u64 cur_offset = lockstart; 2058 u64 cur_offset = lockstart;
2058 u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); 2059 u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
2059 u64 drop_end; 2060 u64 drop_end;
2060 int ret = 0; 2061 int ret = 0;
2061 int err = 0; 2062 int err = 0;
2062 bool same_page = ((offset >> PAGE_CACHE_SHIFT) == 2063 bool same_page = ((offset >> PAGE_CACHE_SHIFT) ==
2063 ((offset + len - 1) >> PAGE_CACHE_SHIFT)); 2064 ((offset + len - 1) >> PAGE_CACHE_SHIFT));
2064 2065
2065 btrfs_wait_ordered_range(inode, offset, len); 2066 btrfs_wait_ordered_range(inode, offset, len);
2066 2067
2067 mutex_lock(&inode->i_mutex); 2068 mutex_lock(&inode->i_mutex);
2068 /* 2069 /*
2069 * We needn't truncate any page which is beyond the end of the file 2070 * We needn't truncate any page which is beyond the end of the file
2070 * because we are sure there is no data there. 2071 * because we are sure there is no data there.
2071 */ 2072 */
2072 /* 2073 /*
2073 * Only do this if we are in the same page and we aren't doing the 2074 * Only do this if we are in the same page and we aren't doing the
2074 * entire page. 2075 * entire page.
2075 */ 2076 */
2076 if (same_page && len < PAGE_CACHE_SIZE) { 2077 if (same_page && len < PAGE_CACHE_SIZE) {
2077 if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) 2078 if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE))
2078 ret = btrfs_truncate_page(inode, offset, len, 0); 2079 ret = btrfs_truncate_page(inode, offset, len, 0);
2079 mutex_unlock(&inode->i_mutex); 2080 mutex_unlock(&inode->i_mutex);
2080 return ret; 2081 return ret;
2081 } 2082 }
2082 2083
2083 /* zero back part of the first page */ 2084 /* zero back part of the first page */
2084 if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) { 2085 if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) {
2085 ret = btrfs_truncate_page(inode, offset, 0, 0); 2086 ret = btrfs_truncate_page(inode, offset, 0, 0);
2086 if (ret) { 2087 if (ret) {
2087 mutex_unlock(&inode->i_mutex); 2088 mutex_unlock(&inode->i_mutex);
2088 return ret; 2089 return ret;
2089 } 2090 }
2090 } 2091 }
2091 2092
2092 /* zero the front end of the last page */ 2093 /* zero the front end of the last page */
2093 if (offset + len < round_up(inode->i_size, PAGE_CACHE_SIZE)) { 2094 if (offset + len < round_up(inode->i_size, PAGE_CACHE_SIZE)) {
2094 ret = btrfs_truncate_page(inode, offset + len, 0, 1); 2095 ret = btrfs_truncate_page(inode, offset + len, 0, 1);
2095 if (ret) { 2096 if (ret) {
2096 mutex_unlock(&inode->i_mutex); 2097 mutex_unlock(&inode->i_mutex);
2097 return ret; 2098 return ret;
2098 } 2099 }
2099 } 2100 }
2100 2101
2101 if (lockend < lockstart) { 2102 if (lockend < lockstart) {
2102 mutex_unlock(&inode->i_mutex); 2103 mutex_unlock(&inode->i_mutex);
2103 return 0; 2104 return 0;
2104 } 2105 }
2105 2106
2106 while (1) { 2107 while (1) {
2107 struct btrfs_ordered_extent *ordered; 2108 struct btrfs_ordered_extent *ordered;
2108 2109
2109 truncate_pagecache_range(inode, lockstart, lockend); 2110 truncate_pagecache_range(inode, lockstart, lockend);
2110 2111
2111 lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 2112 lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend,
2112 0, &cached_state); 2113 0, &cached_state);
2113 ordered = btrfs_lookup_first_ordered_extent(inode, lockend); 2114 ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
2114 2115
2115 /* 2116 /*
2116 * We need to make sure we have no ordered extents in this range 2117 * We need to make sure we have no ordered extents in this range
2117 * and nobody raced in and read a page in this range, if we did 2118 * and nobody raced in and read a page in this range, if we did
2118 * we need to try again. 2119 * we need to try again.
2119 */ 2120 */
2120 if ((!ordered || 2121 if ((!ordered ||
2121 (ordered->file_offset + ordered->len < lockstart || 2122 (ordered->file_offset + ordered->len < lockstart ||
2122 ordered->file_offset > lockend)) && 2123 ordered->file_offset > lockend)) &&
2123 !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart, 2124 !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart,
2124 lockend, EXTENT_UPTODATE, 0, 2125 lockend, EXTENT_UPTODATE, 0,
2125 cached_state)) { 2126 cached_state)) {
2126 if (ordered) 2127 if (ordered)
2127 btrfs_put_ordered_extent(ordered); 2128 btrfs_put_ordered_extent(ordered);
2128 break; 2129 break;
2129 } 2130 }
2130 if (ordered) 2131 if (ordered)
2131 btrfs_put_ordered_extent(ordered); 2132 btrfs_put_ordered_extent(ordered);
2132 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, 2133 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
2133 lockend, &cached_state, GFP_NOFS); 2134 lockend, &cached_state, GFP_NOFS);
2134 btrfs_wait_ordered_range(inode, lockstart, 2135 btrfs_wait_ordered_range(inode, lockstart,
2135 lockend - lockstart + 1); 2136 lockend - lockstart + 1);
2136 } 2137 }
2137 2138
2138 path = btrfs_alloc_path(); 2139 path = btrfs_alloc_path();
2139 if (!path) { 2140 if (!path) {
2140 ret = -ENOMEM; 2141 ret = -ENOMEM;
2141 goto out; 2142 goto out;
2142 } 2143 }
2143 2144
2144 rsv = btrfs_alloc_block_rsv(root, BTRFS_BLOCK_RSV_TEMP); 2145 rsv = btrfs_alloc_block_rsv(root, BTRFS_BLOCK_RSV_TEMP);
2145 if (!rsv) { 2146 if (!rsv) {
2146 ret = -ENOMEM; 2147 ret = -ENOMEM;
2147 goto out_free; 2148 goto out_free;
2148 } 2149 }
2149 rsv->size = btrfs_calc_trunc_metadata_size(root, 1); 2150 rsv->size = btrfs_calc_trunc_metadata_size(root, 1);
2150 rsv->failfast = 1; 2151 rsv->failfast = 1;
2151 2152
2152 /* 2153 /*
2153 * 1 - update the inode 2154 * 1 - update the inode
2154 * 1 - removing the extents in the range 2155 * 1 - removing the extents in the range
2155 * 1 - adding the hole extent 2156 * 1 - adding the hole extent
2156 */ 2157 */
2157 trans = btrfs_start_transaction(root, 3); 2158 trans = btrfs_start_transaction(root, 3);
2158 if (IS_ERR(trans)) { 2159 if (IS_ERR(trans)) {
2159 err = PTR_ERR(trans); 2160 err = PTR_ERR(trans);
2160 goto out_free; 2161 goto out_free;
2161 } 2162 }
2162 2163
2163 ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv, 2164 ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv,
2164 min_size); 2165 min_size);
2165 BUG_ON(ret); 2166 BUG_ON(ret);
2166 trans->block_rsv = rsv; 2167 trans->block_rsv = rsv;
2167 2168
2168 while (cur_offset < lockend) { 2169 while (cur_offset < lockend) {
2169 ret = __btrfs_drop_extents(trans, root, inode, path, 2170 ret = __btrfs_drop_extents(trans, root, inode, path,
2170 cur_offset, lockend + 1, 2171 cur_offset, lockend + 1,
2171 &drop_end, 1); 2172 &drop_end, 1);
2172 if (ret != -ENOSPC) 2173 if (ret != -ENOSPC)
2173 break; 2174 break;
2174 2175
2175 trans->block_rsv = &root->fs_info->trans_block_rsv; 2176 trans->block_rsv = &root->fs_info->trans_block_rsv;
2176 2177
2177 ret = fill_holes(trans, inode, path, cur_offset, drop_end); 2178 ret = fill_holes(trans, inode, path, cur_offset, drop_end);
2178 if (ret) { 2179 if (ret) {
2179 err = ret; 2180 err = ret;
2180 break; 2181 break;
2181 } 2182 }
2182 2183
2183 cur_offset = drop_end; 2184 cur_offset = drop_end;
2184 2185
2185 ret = btrfs_update_inode(trans, root, inode); 2186 ret = btrfs_update_inode(trans, root, inode);
2186 if (ret) { 2187 if (ret) {
2187 err = ret; 2188 err = ret;
2188 break; 2189 break;
2189 } 2190 }
2190 2191
2191 btrfs_end_transaction(trans, root); 2192 btrfs_end_transaction(trans, root);
2192 btrfs_btree_balance_dirty(root); 2193 btrfs_btree_balance_dirty(root);
2193 2194
2194 trans = btrfs_start_transaction(root, 3); 2195 trans = btrfs_start_transaction(root, 3);
2195 if (IS_ERR(trans)) { 2196 if (IS_ERR(trans)) {
2196 ret = PTR_ERR(trans); 2197 ret = PTR_ERR(trans);
2197 trans = NULL; 2198 trans = NULL;
2198 break; 2199 break;
2199 } 2200 }
2200 2201
2201 ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, 2202 ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv,
2202 rsv, min_size); 2203 rsv, min_size);
2203 BUG_ON(ret); /* shouldn't happen */ 2204 BUG_ON(ret); /* shouldn't happen */
2204 trans->block_rsv = rsv; 2205 trans->block_rsv = rsv;
2205 } 2206 }
2206 2207
2207 if (ret) { 2208 if (ret) {
2208 err = ret; 2209 err = ret;
2209 goto out_trans; 2210 goto out_trans;
2210 } 2211 }
2211 2212
2212 trans->block_rsv = &root->fs_info->trans_block_rsv; 2213 trans->block_rsv = &root->fs_info->trans_block_rsv;
2213 ret = fill_holes(trans, inode, path, cur_offset, drop_end); 2214 ret = fill_holes(trans, inode, path, cur_offset, drop_end);
2214 if (ret) { 2215 if (ret) {
2215 err = ret; 2216 err = ret;
2216 goto out_trans; 2217 goto out_trans;
2217 } 2218 }
2218 2219
2219 out_trans: 2220 out_trans:
2220 if (!trans) 2221 if (!trans)
2221 goto out_free; 2222 goto out_free;
2222 2223
2223 inode_inc_iversion(inode); 2224 inode_inc_iversion(inode);
2224 inode->i_mtime = inode->i_ctime = CURRENT_TIME; 2225 inode->i_mtime = inode->i_ctime = CURRENT_TIME;
2225 2226
2226 trans->block_rsv = &root->fs_info->trans_block_rsv; 2227 trans->block_rsv = &root->fs_info->trans_block_rsv;
2227 ret = btrfs_update_inode(trans, root, inode); 2228 ret = btrfs_update_inode(trans, root, inode);
2228 btrfs_end_transaction(trans, root); 2229 btrfs_end_transaction(trans, root);
2229 btrfs_btree_balance_dirty(root); 2230 btrfs_btree_balance_dirty(root);
2230 out_free: 2231 out_free:
2231 btrfs_free_path(path); 2232 btrfs_free_path(path);
2232 btrfs_free_block_rsv(root, rsv); 2233 btrfs_free_block_rsv(root, rsv);
2233 out: 2234 out:
2234 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, 2235 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
2235 &cached_state, GFP_NOFS); 2236 &cached_state, GFP_NOFS);
2236 mutex_unlock(&inode->i_mutex); 2237 mutex_unlock(&inode->i_mutex);
2237 if (ret && !err) 2238 if (ret && !err)
2238 err = ret; 2239 err = ret;
2239 return err; 2240 return err;
2240 } 2241 }
2241 2242
2242 static long btrfs_fallocate(struct file *file, int mode, 2243 static long btrfs_fallocate(struct file *file, int mode,
2243 loff_t offset, loff_t len) 2244 loff_t offset, loff_t len)
2244 { 2245 {
2245 struct inode *inode = file_inode(file); 2246 struct inode *inode = file_inode(file);
2246 struct extent_state *cached_state = NULL; 2247 struct extent_state *cached_state = NULL;
2247 struct btrfs_root *root = BTRFS_I(inode)->root; 2248 struct btrfs_root *root = BTRFS_I(inode)->root;
2248 u64 cur_offset; 2249 u64 cur_offset;
2249 u64 last_byte; 2250 u64 last_byte;
2250 u64 alloc_start; 2251 u64 alloc_start;
2251 u64 alloc_end; 2252 u64 alloc_end;
2252 u64 alloc_hint = 0; 2253 u64 alloc_hint = 0;
2253 u64 locked_end; 2254 u64 locked_end;
2254 struct extent_map *em; 2255 struct extent_map *em;
2255 int blocksize = BTRFS_I(inode)->root->sectorsize; 2256 int blocksize = BTRFS_I(inode)->root->sectorsize;
2256 int ret; 2257 int ret;
2257 2258
2258 alloc_start = round_down(offset, blocksize); 2259 alloc_start = round_down(offset, blocksize);
2259 alloc_end = round_up(offset + len, blocksize); 2260 alloc_end = round_up(offset + len, blocksize);
2260 2261
2261 /* Make sure we aren't being give some crap mode */ 2262 /* Make sure we aren't being give some crap mode */
2262 if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) 2263 if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
2263 return -EOPNOTSUPP; 2264 return -EOPNOTSUPP;
2264 2265
2265 if (mode & FALLOC_FL_PUNCH_HOLE) 2266 if (mode & FALLOC_FL_PUNCH_HOLE)
2266 return btrfs_punch_hole(inode, offset, len); 2267 return btrfs_punch_hole(inode, offset, len);
2267 2268
2268 /* 2269 /*
2269 * Make sure we have enough space before we do the 2270 * Make sure we have enough space before we do the
2270 * allocation. 2271 * allocation.
2271 */ 2272 */
2272 ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start); 2273 ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start);
2273 if (ret) 2274 if (ret)
2274 return ret; 2275 return ret;
2275 if (root->fs_info->quota_enabled) { 2276 if (root->fs_info->quota_enabled) {
2276 ret = btrfs_qgroup_reserve(root, alloc_end - alloc_start); 2277 ret = btrfs_qgroup_reserve(root, alloc_end - alloc_start);
2277 if (ret) 2278 if (ret)
2278 goto out_reserve_fail; 2279 goto out_reserve_fail;
2279 } 2280 }
2280 2281
2281 mutex_lock(&inode->i_mutex); 2282 mutex_lock(&inode->i_mutex);
2282 ret = inode_newsize_ok(inode, alloc_end); 2283 ret = inode_newsize_ok(inode, alloc_end);
2283 if (ret) 2284 if (ret)
2284 goto out; 2285 goto out;
2285 2286
2286 if (alloc_start > inode->i_size) { 2287 if (alloc_start > inode->i_size) {
2287 ret = btrfs_cont_expand(inode, i_size_read(inode), 2288 ret = btrfs_cont_expand(inode, i_size_read(inode),
2288 alloc_start); 2289 alloc_start);
2289 if (ret) 2290 if (ret)
2290 goto out; 2291 goto out;
2291 } else { 2292 } else {
2292 /* 2293 /*
2293 * If we are fallocating from the end of the file onward we 2294 * If we are fallocating from the end of the file onward we
2294 * need to zero out the end of the page if i_size lands in the 2295 * need to zero out the end of the page if i_size lands in the
2295 * middle of a page. 2296 * middle of a page.
2296 */ 2297 */
2297 ret = btrfs_truncate_page(inode, inode->i_size, 0, 0); 2298 ret = btrfs_truncate_page(inode, inode->i_size, 0, 0);
2298 if (ret) 2299 if (ret)
2299 goto out; 2300 goto out;
2300 } 2301 }
2301 2302
2302 /* 2303 /*
2303 * wait for ordered IO before we have any locks. We'll loop again 2304 * wait for ordered IO before we have any locks. We'll loop again
2304 * below with the locks held. 2305 * below with the locks held.
2305 */ 2306 */
2306 btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); 2307 btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start);
2307 2308
2308 locked_end = alloc_end - 1; 2309 locked_end = alloc_end - 1;
2309 while (1) { 2310 while (1) {
2310 struct btrfs_ordered_extent *ordered; 2311 struct btrfs_ordered_extent *ordered;
2311 2312
2312 /* the extent lock is ordered inside the running 2313 /* the extent lock is ordered inside the running
2313 * transaction 2314 * transaction
2314 */ 2315 */
2315 lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, 2316 lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start,
2316 locked_end, 0, &cached_state); 2317 locked_end, 0, &cached_state);
2317 ordered = btrfs_lookup_first_ordered_extent(inode, 2318 ordered = btrfs_lookup_first_ordered_extent(inode,
2318 alloc_end - 1); 2319 alloc_end - 1);
2319 if (ordered && 2320 if (ordered &&
2320 ordered->file_offset + ordered->len > alloc_start && 2321 ordered->file_offset + ordered->len > alloc_start &&
2321 ordered->file_offset < alloc_end) { 2322 ordered->file_offset < alloc_end) {
2322 btrfs_put_ordered_extent(ordered); 2323 btrfs_put_ordered_extent(ordered);
2323 unlock_extent_cached(&BTRFS_I(inode)->io_tree, 2324 unlock_extent_cached(&BTRFS_I(inode)->io_tree,
2324 alloc_start, locked_end, 2325 alloc_start, locked_end,
2325 &cached_state, GFP_NOFS); 2326 &cached_state, GFP_NOFS);
2326 /* 2327 /*
2327 * we can't wait on the range with the transaction 2328 * we can't wait on the range with the transaction
2328 * running or with the extent lock held 2329 * running or with the extent lock held
2329 */ 2330 */
2330 btrfs_wait_ordered_range(inode, alloc_start, 2331 btrfs_wait_ordered_range(inode, alloc_start,
2331 alloc_end - alloc_start); 2332 alloc_end - alloc_start);
2332 } else { 2333 } else {
2333 if (ordered) 2334 if (ordered)
2334 btrfs_put_ordered_extent(ordered); 2335 btrfs_put_ordered_extent(ordered);
2335 break; 2336 break;
2336 } 2337 }
2337 } 2338 }
2338 2339
2339 cur_offset = alloc_start; 2340 cur_offset = alloc_start;
2340 while (1) { 2341 while (1) {
2341 u64 actual_end; 2342 u64 actual_end;
2342 2343
2343 em = btrfs_get_extent(inode, NULL, 0, cur_offset, 2344 em = btrfs_get_extent(inode, NULL, 0, cur_offset,
2344 alloc_end - cur_offset, 0); 2345 alloc_end - cur_offset, 0);
2345 if (IS_ERR_OR_NULL(em)) { 2346 if (IS_ERR_OR_NULL(em)) {
2346 if (!em) 2347 if (!em)
2347 ret = -ENOMEM; 2348 ret = -ENOMEM;
2348 else 2349 else
2349 ret = PTR_ERR(em); 2350 ret = PTR_ERR(em);
2350 break; 2351 break;
2351 } 2352 }
2352 last_byte = min(extent_map_end(em), alloc_end); 2353 last_byte = min(extent_map_end(em), alloc_end);
2353 actual_end = min_t(u64, extent_map_end(em), offset + len); 2354 actual_end = min_t(u64, extent_map_end(em), offset + len);
2354 last_byte = ALIGN(last_byte, blocksize); 2355 last_byte = ALIGN(last_byte, blocksize);
2355 2356
2356 if (em->block_start == EXTENT_MAP_HOLE || 2357 if (em->block_start == EXTENT_MAP_HOLE ||
2357 (cur_offset >= inode->i_size && 2358 (cur_offset >= inode->i_size &&
2358 !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { 2359 !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
2359 ret = btrfs_prealloc_file_range(inode, mode, cur_offset, 2360 ret = btrfs_prealloc_file_range(inode, mode, cur_offset,
2360 last_byte - cur_offset, 2361 last_byte - cur_offset,
2361 1 << inode->i_blkbits, 2362 1 << inode->i_blkbits,
2362 offset + len, 2363 offset + len,
2363 &alloc_hint); 2364 &alloc_hint);
2364 2365
2365 if (ret < 0) { 2366 if (ret < 0) {
2366 free_extent_map(em); 2367 free_extent_map(em);
2367 break; 2368 break;
2368 } 2369 }
2369 } else if (actual_end > inode->i_size && 2370 } else if (actual_end > inode->i_size &&
2370 !(mode & FALLOC_FL_KEEP_SIZE)) { 2371 !(mode & FALLOC_FL_KEEP_SIZE)) {
2371 /* 2372 /*
2372 * We didn't need to allocate any more space, but we 2373 * We didn't need to allocate any more space, but we
2373 * still extended the size of the file so we need to 2374 * still extended the size of the file so we need to
2374 * update i_size. 2375 * update i_size.
2375 */ 2376 */
2376 inode->i_ctime = CURRENT_TIME; 2377 inode->i_ctime = CURRENT_TIME;
2377 i_size_write(inode, actual_end); 2378 i_size_write(inode, actual_end);
2378 btrfs_ordered_update_i_size(inode, actual_end, NULL); 2379 btrfs_ordered_update_i_size(inode, actual_end, NULL);
2379 } 2380 }
2380 free_extent_map(em); 2381 free_extent_map(em);
2381 2382
2382 cur_offset = last_byte; 2383 cur_offset = last_byte;
2383 if (cur_offset >= alloc_end) { 2384 if (cur_offset >= alloc_end) {
2384 ret = 0; 2385 ret = 0;
2385 break; 2386 break;
2386 } 2387 }
2387 } 2388 }
2388 unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, 2389 unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
2389 &cached_state, GFP_NOFS); 2390 &cached_state, GFP_NOFS);
2390 out: 2391 out:
2391 mutex_unlock(&inode->i_mutex); 2392 mutex_unlock(&inode->i_mutex);
2392 if (root->fs_info->quota_enabled) 2393 if (root->fs_info->quota_enabled)
2393 btrfs_qgroup_free(root, alloc_end - alloc_start); 2394 btrfs_qgroup_free(root, alloc_end - alloc_start);
2394 out_reserve_fail: 2395 out_reserve_fail:
2395 /* Let go of our reservation. */ 2396 /* Let go of our reservation. */
2396 btrfs_free_reserved_data_space(inode, alloc_end - alloc_start); 2397 btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
2397 return ret; 2398 return ret;
2398 } 2399 }
2399 2400
2400 static int find_desired_extent(struct inode *inode, loff_t *offset, int whence) 2401 static int find_desired_extent(struct inode *inode, loff_t *offset, int whence)
2401 { 2402 {
2402 struct btrfs_root *root = BTRFS_I(inode)->root; 2403 struct btrfs_root *root = BTRFS_I(inode)->root;
2403 struct extent_map *em; 2404 struct extent_map *em;
2404 struct extent_state *cached_state = NULL; 2405 struct extent_state *cached_state = NULL;
2405 u64 lockstart = *offset; 2406 u64 lockstart = *offset;
2406 u64 lockend = i_size_read(inode); 2407 u64 lockend = i_size_read(inode);
2407 u64 start = *offset; 2408 u64 start = *offset;
2408 u64 orig_start = *offset; 2409 u64 orig_start = *offset;
2409 u64 len = i_size_read(inode); 2410 u64 len = i_size_read(inode);
2410 u64 last_end = 0; 2411 u64 last_end = 0;
2411 int ret = 0; 2412 int ret = 0;
2412 2413
2413 lockend = max_t(u64, root->sectorsize, lockend); 2414 lockend = max_t(u64, root->sectorsize, lockend);
2414 if (lockend <= lockstart) 2415 if (lockend <= lockstart)
2415 lockend = lockstart + root->sectorsize; 2416 lockend = lockstart + root->sectorsize;
2416 2417
2417 lockend--; 2418 lockend--;
2418 len = lockend - lockstart + 1; 2419 len = lockend - lockstart + 1;
2419 2420
2420 len = max_t(u64, len, root->sectorsize); 2421 len = max_t(u64, len, root->sectorsize);
2421 if (inode->i_size == 0) 2422 if (inode->i_size == 0)
2422 return -ENXIO; 2423 return -ENXIO;
2423 2424
2424 lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0, 2425 lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
2425 &cached_state); 2426 &cached_state);
2426 2427
2427 /* 2428 /*
2428 * Delalloc is such a pain. If we have a hole and we have pending 2429 * Delalloc is such a pain. If we have a hole and we have pending
2429 * delalloc for a portion of the hole we will get back a hole that 2430 * delalloc for a portion of the hole we will get back a hole that
2430 * exists for the entire range since it hasn't been actually written 2431 * exists for the entire range since it hasn't been actually written
2431 * yet. So to take care of this case we need to look for an extent just 2432 * yet. So to take care of this case we need to look for an extent just
2432 * before the position we want in case there is outstanding delalloc 2433 * before the position we want in case there is outstanding delalloc
2433 * going on here. 2434 * going on here.
2434 */ 2435 */
2435 if (whence == SEEK_HOLE && start != 0) { 2436 if (whence == SEEK_HOLE && start != 0) {
2436 if (start <= root->sectorsize) 2437 if (start <= root->sectorsize)
2437 em = btrfs_get_extent_fiemap(inode, NULL, 0, 0, 2438 em = btrfs_get_extent_fiemap(inode, NULL, 0, 0,
2438 root->sectorsize, 0); 2439 root->sectorsize, 0);
2439 else 2440 else
2440 em = btrfs_get_extent_fiemap(inode, NULL, 0, 2441 em = btrfs_get_extent_fiemap(inode, NULL, 0,
2441 start - root->sectorsize, 2442 start - root->sectorsize,
2442 root->sectorsize, 0); 2443 root->sectorsize, 0);
2443 if (IS_ERR(em)) { 2444 if (IS_ERR(em)) {
2444 ret = PTR_ERR(em); 2445 ret = PTR_ERR(em);
2445 goto out; 2446 goto out;
2446 } 2447 }
2447 last_end = em->start + em->len; 2448 last_end = em->start + em->len;
2448 if (em->block_start == EXTENT_MAP_DELALLOC) 2449 if (em->block_start == EXTENT_MAP_DELALLOC)
2449 last_end = min_t(u64, last_end, inode->i_size); 2450 last_end = min_t(u64, last_end, inode->i_size);
2450 free_extent_map(em); 2451 free_extent_map(em);
2451 } 2452 }
2452 2453
2453 while (1) { 2454 while (1) {
2454 em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0); 2455 em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0);
2455 if (IS_ERR(em)) { 2456 if (IS_ERR(em)) {
2456 ret = PTR_ERR(em); 2457 ret = PTR_ERR(em);
2457 break; 2458 break;
2458 } 2459 }
2459 2460
2460 if (em->block_start == EXTENT_MAP_HOLE) { 2461 if (em->block_start == EXTENT_MAP_HOLE) {
2461 if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { 2462 if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
2462 if (last_end <= orig_start) { 2463 if (last_end <= orig_start) {
2463 free_extent_map(em); 2464 free_extent_map(em);
2464 ret = -ENXIO; 2465 ret = -ENXIO;
2465 break; 2466 break;
2466 } 2467 }
2467 } 2468 }
2468 2469
2469 if (whence == SEEK_HOLE) { 2470 if (whence == SEEK_HOLE) {
2470 *offset = start; 2471 *offset = start;
2471 free_extent_map(em); 2472 free_extent_map(em);
2472 break; 2473 break;
2473 } 2474 }
2474 } else { 2475 } else {
2475 if (whence == SEEK_DATA) { 2476 if (whence == SEEK_DATA) {
2476 if (em->block_start == EXTENT_MAP_DELALLOC) { 2477 if (em->block_start == EXTENT_MAP_DELALLOC) {
2477 if (start >= inode->i_size) { 2478 if (start >= inode->i_size) {
2478 free_extent_map(em); 2479 free_extent_map(em);
2479 ret = -ENXIO; 2480 ret = -ENXIO;
2480 break; 2481 break;
2481 } 2482 }
2482 } 2483 }
2483 2484
2484 if (!test_bit(EXTENT_FLAG_PREALLOC, 2485 if (!test_bit(EXTENT_FLAG_PREALLOC,
2485 &em->flags)) { 2486 &em->flags)) {
2486 *offset = start; 2487 *offset = start;
2487 free_extent_map(em); 2488 free_extent_map(em);
2488 break; 2489 break;
2489 } 2490 }
2490 } 2491 }
2491 } 2492 }
2492 2493
2493 start = em->start + em->len; 2494 start = em->start + em->len;
2494 last_end = em->start + em->len; 2495 last_end = em->start + em->len;
2495 2496
2496 if (em->block_start == EXTENT_MAP_DELALLOC) 2497 if (em->block_start == EXTENT_MAP_DELALLOC)
2497 last_end = min_t(u64, last_end, inode->i_size); 2498 last_end = min_t(u64, last_end, inode->i_size);
2498 2499
2499 if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { 2500 if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
2500 free_extent_map(em); 2501 free_extent_map(em);
2501 ret = -ENXIO; 2502 ret = -ENXIO;
2502 break; 2503 break;
2503 } 2504 }
2504 free_extent_map(em); 2505 free_extent_map(em);
2505 cond_resched(); 2506 cond_resched();
2506 } 2507 }
2507 if (!ret) 2508 if (!ret)
2508 *offset = min(*offset, inode->i_size); 2509 *offset = min(*offset, inode->i_size);
2509 out: 2510 out:
2510 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, 2511 unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
2511 &cached_state, GFP_NOFS); 2512 &cached_state, GFP_NOFS);
2512 return ret; 2513 return ret;
2513 } 2514 }
2514 2515
2515 static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) 2516 static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
2516 { 2517 {
2517 struct inode *inode = file->f_mapping->host; 2518 struct inode *inode = file->f_mapping->host;
2518 int ret; 2519 int ret;
2519 2520
2520 mutex_lock(&inode->i_mutex); 2521 mutex_lock(&inode->i_mutex);
2521 switch (whence) { 2522 switch (whence) {
2522 case SEEK_END: 2523 case SEEK_END:
2523 case SEEK_CUR: 2524 case SEEK_CUR:
2524 offset = generic_file_llseek(file, offset, whence); 2525 offset = generic_file_llseek(file, offset, whence);
2525 goto out; 2526 goto out;
2526 case SEEK_DATA: 2527 case SEEK_DATA:
2527 case SEEK_HOLE: 2528 case SEEK_HOLE:
2528 if (offset >= i_size_read(inode)) { 2529 if (offset >= i_size_read(inode)) {
2529 mutex_unlock(&inode->i_mutex); 2530 mutex_unlock(&inode->i_mutex);
2530 return -ENXIO; 2531 return -ENXIO;
2531 } 2532 }
2532 2533
2533 ret = find_desired_extent(inode, &offset, whence); 2534 ret = find_desired_extent(inode, &offset, whence);
2534 if (ret) { 2535 if (ret) {
2535 mutex_unlock(&inode->i_mutex); 2536 mutex_unlock(&inode->i_mutex);
2536 return ret; 2537 return ret;
2537 } 2538 }
2538 } 2539 }
2539 2540
2540 offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes); 2541 offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
2541 out: 2542 out:
2542 mutex_unlock(&inode->i_mutex); 2543 mutex_unlock(&inode->i_mutex);
2543 return offset; 2544 return offset;
2544 } 2545 }
2545 2546
2546 const struct file_operations btrfs_file_operations = { 2547 const struct file_operations btrfs_file_operations = {
2547 .llseek = btrfs_file_llseek, 2548 .llseek = btrfs_file_llseek,
2548 .read = do_sync_read, 2549 .read = do_sync_read,
2549 .write = do_sync_write, 2550 .write = do_sync_write,
2550 .aio_read = generic_file_aio_read, 2551 .aio_read = generic_file_aio_read,
2551 .splice_read = generic_file_splice_read, 2552 .splice_read = generic_file_splice_read,
2552 .aio_write = btrfs_file_aio_write, 2553 .aio_write = btrfs_file_aio_write,
2553 .mmap = btrfs_file_mmap, 2554 .mmap = btrfs_file_mmap,
2554 .open = generic_file_open, 2555 .open = generic_file_open,
2555 .release = btrfs_release_file, 2556 .release = btrfs_release_file,
2556 .fsync = btrfs_sync_file, 2557 .fsync = btrfs_sync_file,
2557 .fallocate = btrfs_fallocate, 2558 .fallocate = btrfs_fallocate,
2558 .unlocked_ioctl = btrfs_ioctl, 2559 .unlocked_ioctl = btrfs_ioctl,
2559 #ifdef CONFIG_COMPAT 2560 #ifdef CONFIG_COMPAT
2560 .compat_ioctl = btrfs_ioctl, 2561 .compat_ioctl = btrfs_ioctl,
2561 #endif 2562 #endif
2562 }; 2563 };
2563 2564
2564 void btrfs_auto_defrag_exit(void) 2565 void btrfs_auto_defrag_exit(void)
2565 { 2566 {
2566 if (btrfs_inode_defrag_cachep) 2567 if (btrfs_inode_defrag_cachep)
2567 kmem_cache_destroy(btrfs_inode_defrag_cachep); 2568 kmem_cache_destroy(btrfs_inode_defrag_cachep);
2568 } 2569 }
2569 2570
2570 int btrfs_auto_defrag_init(void) 2571 int btrfs_auto_defrag_init(void)
2571 { 2572 {
2572 btrfs_inode_defrag_cachep = kmem_cache_create("btrfs_inode_defrag", 2573 btrfs_inode_defrag_cachep = kmem_cache_create("btrfs_inode_defrag",
2573 sizeof(struct inode_defrag), 0, 2574 sizeof(struct inode_defrag), 0,
2574 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, 2575 SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
2575 NULL); 2576 NULL);
2576 if (!btrfs_inode_defrag_cachep) 2577 if (!btrfs_inode_defrag_cachep)
2577 return -ENOMEM; 2578 return -ENOMEM;
2578 2579
2579 return 0; 2580 return 0;
2580 } 2581 }
1 /* 1 /*
2 * linux/fs/buffer.c 2 * linux/fs/buffer.c
3 * 3 *
4 * Copyright (C) 1991, 1992, 2002 Linus Torvalds 4 * Copyright (C) 1991, 1992, 2002 Linus Torvalds
5 */ 5 */
6 6
7 /* 7 /*
8 * Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95 8 * Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95
9 * 9 *
10 * Removed a lot of unnecessary code and simplified things now that 10 * Removed a lot of unnecessary code and simplified things now that
11 * the buffer cache isn't our primary cache - Andrew Tridgell 12/96 11 * the buffer cache isn't our primary cache - Andrew Tridgell 12/96
12 * 12 *
13 * Speed up hash, lru, and free list operations. Use gfp() for allocating 13 * Speed up hash, lru, and free list operations. Use gfp() for allocating
14 * hash table, use SLAB cache for buffer heads. SMP threading. -DaveM 14 * hash table, use SLAB cache for buffer heads. SMP threading. -DaveM
15 * 15 *
16 * Added 32k buffer block sizes - these are required older ARM systems. - RMK 16 * Added 32k buffer block sizes - these are required older ARM systems. - RMK
17 * 17 *
18 * async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de> 18 * async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de>
19 */ 19 */
20 20
21 #include <linux/kernel.h> 21 #include <linux/kernel.h>
22 #include <linux/syscalls.h> 22 #include <linux/syscalls.h>
23 #include <linux/fs.h> 23 #include <linux/fs.h>
24 #include <linux/mm.h> 24 #include <linux/mm.h>
25 #include <linux/percpu.h> 25 #include <linux/percpu.h>
26 #include <linux/slab.h> 26 #include <linux/slab.h>
27 #include <linux/capability.h> 27 #include <linux/capability.h>
28 #include <linux/blkdev.h> 28 #include <linux/blkdev.h>
29 #include <linux/file.h> 29 #include <linux/file.h>
30 #include <linux/quotaops.h> 30 #include <linux/quotaops.h>
31 #include <linux/highmem.h> 31 #include <linux/highmem.h>
32 #include <linux/export.h> 32 #include <linux/export.h>
33 #include <linux/writeback.h> 33 #include <linux/writeback.h>
34 #include <linux/hash.h> 34 #include <linux/hash.h>
35 #include <linux/suspend.h> 35 #include <linux/suspend.h>
36 #include <linux/buffer_head.h> 36 #include <linux/buffer_head.h>
37 #include <linux/task_io_accounting_ops.h> 37 #include <linux/task_io_accounting_ops.h>
38 #include <linux/bio.h> 38 #include <linux/bio.h>
39 #include <linux/notifier.h> 39 #include <linux/notifier.h>
40 #include <linux/cpu.h> 40 #include <linux/cpu.h>
41 #include <linux/bitops.h> 41 #include <linux/bitops.h>
42 #include <linux/mpage.h> 42 #include <linux/mpage.h>
43 #include <linux/bit_spinlock.h> 43 #include <linux/bit_spinlock.h>
44 #include <trace/events/block.h> 44 #include <trace/events/block.h>
45 45
46 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); 46 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
47 47
48 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) 48 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
49 49
50 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) 50 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
51 { 51 {
52 bh->b_end_io = handler; 52 bh->b_end_io = handler;
53 bh->b_private = private; 53 bh->b_private = private;
54 } 54 }
55 EXPORT_SYMBOL(init_buffer); 55 EXPORT_SYMBOL(init_buffer);
56 56
57 inline void touch_buffer(struct buffer_head *bh) 57 inline void touch_buffer(struct buffer_head *bh)
58 { 58 {
59 trace_block_touch_buffer(bh); 59 trace_block_touch_buffer(bh);
60 mark_page_accessed(bh->b_page); 60 mark_page_accessed(bh->b_page);
61 } 61 }
62 EXPORT_SYMBOL(touch_buffer); 62 EXPORT_SYMBOL(touch_buffer);
63 63
64 static int sleep_on_buffer(void *word) 64 static int sleep_on_buffer(void *word)
65 { 65 {
66 io_schedule(); 66 io_schedule();
67 return 0; 67 return 0;
68 } 68 }
69 69
70 void __lock_buffer(struct buffer_head *bh) 70 void __lock_buffer(struct buffer_head *bh)
71 { 71 {
72 wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer, 72 wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer,
73 TASK_UNINTERRUPTIBLE); 73 TASK_UNINTERRUPTIBLE);
74 } 74 }
75 EXPORT_SYMBOL(__lock_buffer); 75 EXPORT_SYMBOL(__lock_buffer);
76 76
77 void unlock_buffer(struct buffer_head *bh) 77 void unlock_buffer(struct buffer_head *bh)
78 { 78 {
79 clear_bit_unlock(BH_Lock, &bh->b_state); 79 clear_bit_unlock(BH_Lock, &bh->b_state);
80 smp_mb__after_clear_bit(); 80 smp_mb__after_clear_bit();
81 wake_up_bit(&bh->b_state, BH_Lock); 81 wake_up_bit(&bh->b_state, BH_Lock);
82 } 82 }
83 EXPORT_SYMBOL(unlock_buffer); 83 EXPORT_SYMBOL(unlock_buffer);
84 84
85 /* 85 /*
86 * Returns if the page has dirty or writeback buffers. If all the buffers 86 * Returns if the page has dirty or writeback buffers. If all the buffers
87 * are unlocked and clean then the PageDirty information is stale. If 87 * are unlocked and clean then the PageDirty information is stale. If
88 * any of the pages are locked, it is assumed they are locked for IO. 88 * any of the pages are locked, it is assumed they are locked for IO.
89 */ 89 */
90 void buffer_check_dirty_writeback(struct page *page, 90 void buffer_check_dirty_writeback(struct page *page,
91 bool *dirty, bool *writeback) 91 bool *dirty, bool *writeback)
92 { 92 {
93 struct buffer_head *head, *bh; 93 struct buffer_head *head, *bh;
94 *dirty = false; 94 *dirty = false;
95 *writeback = false; 95 *writeback = false;
96 96
97 BUG_ON(!PageLocked(page)); 97 BUG_ON(!PageLocked(page));
98 98
99 if (!page_has_buffers(page)) 99 if (!page_has_buffers(page))
100 return; 100 return;
101 101
102 if (PageWriteback(page)) 102 if (PageWriteback(page))
103 *writeback = true; 103 *writeback = true;
104 104
105 head = page_buffers(page); 105 head = page_buffers(page);
106 bh = head; 106 bh = head;
107 do { 107 do {
108 if (buffer_locked(bh)) 108 if (buffer_locked(bh))
109 *writeback = true; 109 *writeback = true;
110 110
111 if (buffer_dirty(bh)) 111 if (buffer_dirty(bh))
112 *dirty = true; 112 *dirty = true;
113 113
114 bh = bh->b_this_page; 114 bh = bh->b_this_page;
115 } while (bh != head); 115 } while (bh != head);
116 } 116 }
117 EXPORT_SYMBOL(buffer_check_dirty_writeback); 117 EXPORT_SYMBOL(buffer_check_dirty_writeback);
118 118
119 /* 119 /*
120 * Block until a buffer comes unlocked. This doesn't stop it 120 * Block until a buffer comes unlocked. This doesn't stop it
121 * from becoming locked again - you have to lock it yourself 121 * from becoming locked again - you have to lock it yourself
122 * if you want to preserve its state. 122 * if you want to preserve its state.
123 */ 123 */
124 void __wait_on_buffer(struct buffer_head * bh) 124 void __wait_on_buffer(struct buffer_head * bh)
125 { 125 {
126 wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); 126 wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE);
127 } 127 }
128 EXPORT_SYMBOL(__wait_on_buffer); 128 EXPORT_SYMBOL(__wait_on_buffer);
129 129
130 static void 130 static void
131 __clear_page_buffers(struct page *page) 131 __clear_page_buffers(struct page *page)
132 { 132 {
133 ClearPagePrivate(page); 133 ClearPagePrivate(page);
134 set_page_private(page, 0); 134 set_page_private(page, 0);
135 page_cache_release(page); 135 page_cache_release(page);
136 } 136 }
137 137
138 138
139 static int quiet_error(struct buffer_head *bh) 139 static int quiet_error(struct buffer_head *bh)
140 { 140 {
141 if (!test_bit(BH_Quiet, &bh->b_state) && printk_ratelimit()) 141 if (!test_bit(BH_Quiet, &bh->b_state) && printk_ratelimit())
142 return 0; 142 return 0;
143 return 1; 143 return 1;
144 } 144 }
145 145
146 146
147 static void buffer_io_error(struct buffer_head *bh) 147 static void buffer_io_error(struct buffer_head *bh)
148 { 148 {
149 char b[BDEVNAME_SIZE]; 149 char b[BDEVNAME_SIZE];
150 printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n", 150 printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n",
151 bdevname(bh->b_bdev, b), 151 bdevname(bh->b_bdev, b),
152 (unsigned long long)bh->b_blocknr); 152 (unsigned long long)bh->b_blocknr);
153 } 153 }
154 154
155 /* 155 /*
156 * End-of-IO handler helper function which does not touch the bh after 156 * End-of-IO handler helper function which does not touch the bh after
157 * unlocking it. 157 * unlocking it.
158 * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but 158 * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but
159 * a race there is benign: unlock_buffer() only use the bh's address for 159 * a race there is benign: unlock_buffer() only use the bh's address for
160 * hashing after unlocking the buffer, so it doesn't actually touch the bh 160 * hashing after unlocking the buffer, so it doesn't actually touch the bh
161 * itself. 161 * itself.
162 */ 162 */
163 static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate) 163 static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
164 { 164 {
165 if (uptodate) { 165 if (uptodate) {
166 set_buffer_uptodate(bh); 166 set_buffer_uptodate(bh);
167 } else { 167 } else {
168 /* This happens, due to failed READA attempts. */ 168 /* This happens, due to failed READA attempts. */
169 clear_buffer_uptodate(bh); 169 clear_buffer_uptodate(bh);
170 } 170 }
171 unlock_buffer(bh); 171 unlock_buffer(bh);
172 } 172 }
173 173
174 /* 174 /*
175 * Default synchronous end-of-IO handler.. Just mark it up-to-date and 175 * Default synchronous end-of-IO handler.. Just mark it up-to-date and
176 * unlock the buffer. This is what ll_rw_block uses too. 176 * unlock the buffer. This is what ll_rw_block uses too.
177 */ 177 */
178 void end_buffer_read_sync(struct buffer_head *bh, int uptodate) 178 void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
179 { 179 {
180 __end_buffer_read_notouch(bh, uptodate); 180 __end_buffer_read_notouch(bh, uptodate);
181 put_bh(bh); 181 put_bh(bh);
182 } 182 }
183 EXPORT_SYMBOL(end_buffer_read_sync); 183 EXPORT_SYMBOL(end_buffer_read_sync);
184 184
185 void end_buffer_write_sync(struct buffer_head *bh, int uptodate) 185 void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
186 { 186 {
187 char b[BDEVNAME_SIZE]; 187 char b[BDEVNAME_SIZE];
188 188
189 if (uptodate) { 189 if (uptodate) {
190 set_buffer_uptodate(bh); 190 set_buffer_uptodate(bh);
191 } else { 191 } else {
192 if (!quiet_error(bh)) { 192 if (!quiet_error(bh)) {
193 buffer_io_error(bh); 193 buffer_io_error(bh);
194 printk(KERN_WARNING "lost page write due to " 194 printk(KERN_WARNING "lost page write due to "
195 "I/O error on %s\n", 195 "I/O error on %s\n",
196 bdevname(bh->b_bdev, b)); 196 bdevname(bh->b_bdev, b));
197 } 197 }
198 set_buffer_write_io_error(bh); 198 set_buffer_write_io_error(bh);
199 clear_buffer_uptodate(bh); 199 clear_buffer_uptodate(bh);
200 } 200 }
201 unlock_buffer(bh); 201 unlock_buffer(bh);
202 put_bh(bh); 202 put_bh(bh);
203 } 203 }
204 EXPORT_SYMBOL(end_buffer_write_sync); 204 EXPORT_SYMBOL(end_buffer_write_sync);
205 205
206 /* 206 /*
207 * Various filesystems appear to want __find_get_block to be non-blocking. 207 * Various filesystems appear to want __find_get_block to be non-blocking.
208 * But it's the page lock which protects the buffers. To get around this, 208 * But it's the page lock which protects the buffers. To get around this,
209 * we get exclusion from try_to_free_buffers with the blockdev mapping's 209 * we get exclusion from try_to_free_buffers with the blockdev mapping's
210 * private_lock. 210 * private_lock.
211 * 211 *
212 * Hack idea: for the blockdev mapping, i_bufferlist_lock contention 212 * Hack idea: for the blockdev mapping, i_bufferlist_lock contention
213 * may be quite high. This code could TryLock the page, and if that 213 * may be quite high. This code could TryLock the page, and if that
214 * succeeds, there is no need to take private_lock. (But if 214 * succeeds, there is no need to take private_lock. (But if
215 * private_lock is contended then so is mapping->tree_lock). 215 * private_lock is contended then so is mapping->tree_lock).
216 */ 216 */
217 static struct buffer_head * 217 static struct buffer_head *
218 __find_get_block_slow(struct block_device *bdev, sector_t block) 218 __find_get_block_slow(struct block_device *bdev, sector_t block)
219 { 219 {
220 struct inode *bd_inode = bdev->bd_inode; 220 struct inode *bd_inode = bdev->bd_inode;
221 struct address_space *bd_mapping = bd_inode->i_mapping; 221 struct address_space *bd_mapping = bd_inode->i_mapping;
222 struct buffer_head *ret = NULL; 222 struct buffer_head *ret = NULL;
223 pgoff_t index; 223 pgoff_t index;
224 struct buffer_head *bh; 224 struct buffer_head *bh;
225 struct buffer_head *head; 225 struct buffer_head *head;
226 struct page *page; 226 struct page *page;
227 int all_mapped = 1; 227 int all_mapped = 1;
228 228
229 index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); 229 index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
230 page = find_get_page(bd_mapping, index); 230 page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
231 if (!page) 231 if (!page)
232 goto out; 232 goto out;
233 233
234 spin_lock(&bd_mapping->private_lock); 234 spin_lock(&bd_mapping->private_lock);
235 if (!page_has_buffers(page)) 235 if (!page_has_buffers(page))
236 goto out_unlock; 236 goto out_unlock;
237 head = page_buffers(page); 237 head = page_buffers(page);
238 bh = head; 238 bh = head;
239 do { 239 do {
240 if (!buffer_mapped(bh)) 240 if (!buffer_mapped(bh))
241 all_mapped = 0; 241 all_mapped = 0;
242 else if (bh->b_blocknr == block) { 242 else if (bh->b_blocknr == block) {
243 ret = bh; 243 ret = bh;
244 get_bh(bh); 244 get_bh(bh);
245 goto out_unlock; 245 goto out_unlock;
246 } 246 }
247 bh = bh->b_this_page; 247 bh = bh->b_this_page;
248 } while (bh != head); 248 } while (bh != head);
249 249
250 /* we might be here because some of the buffers on this page are 250 /* we might be here because some of the buffers on this page are
251 * not mapped. This is due to various races between 251 * not mapped. This is due to various races between
252 * file io on the block device and getblk. It gets dealt with 252 * file io on the block device and getblk. It gets dealt with
253 * elsewhere, don't buffer_error if we had some unmapped buffers 253 * elsewhere, don't buffer_error if we had some unmapped buffers
254 */ 254 */
255 if (all_mapped) { 255 if (all_mapped) {
256 char b[BDEVNAME_SIZE]; 256 char b[BDEVNAME_SIZE];
257 257
258 printk("__find_get_block_slow() failed. " 258 printk("__find_get_block_slow() failed. "
259 "block=%llu, b_blocknr=%llu\n", 259 "block=%llu, b_blocknr=%llu\n",
260 (unsigned long long)block, 260 (unsigned long long)block,
261 (unsigned long long)bh->b_blocknr); 261 (unsigned long long)bh->b_blocknr);
262 printk("b_state=0x%08lx, b_size=%zu\n", 262 printk("b_state=0x%08lx, b_size=%zu\n",
263 bh->b_state, bh->b_size); 263 bh->b_state, bh->b_size);
264 printk("device %s blocksize: %d\n", bdevname(bdev, b), 264 printk("device %s blocksize: %d\n", bdevname(bdev, b),
265 1 << bd_inode->i_blkbits); 265 1 << bd_inode->i_blkbits);
266 } 266 }
267 out_unlock: 267 out_unlock:
268 spin_unlock(&bd_mapping->private_lock); 268 spin_unlock(&bd_mapping->private_lock);
269 page_cache_release(page); 269 page_cache_release(page);
270 out: 270 out:
271 return ret; 271 return ret;
272 } 272 }
273 273
274 /* 274 /*
275 * Kick the writeback threads then try to free up some ZONE_NORMAL memory. 275 * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
276 */ 276 */
277 static void free_more_memory(void) 277 static void free_more_memory(void)
278 { 278 {
279 struct zone *zone; 279 struct zone *zone;
280 int nid; 280 int nid;
281 281
282 wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM); 282 wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
283 yield(); 283 yield();
284 284
285 for_each_online_node(nid) { 285 for_each_online_node(nid) {
286 (void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS), 286 (void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
287 gfp_zone(GFP_NOFS), NULL, 287 gfp_zone(GFP_NOFS), NULL,
288 &zone); 288 &zone);
289 if (zone) 289 if (zone)
290 try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0, 290 try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
291 GFP_NOFS, NULL); 291 GFP_NOFS, NULL);
292 } 292 }
293 } 293 }
294 294
295 /* 295 /*
296 * I/O completion handler for block_read_full_page() - pages 296 * I/O completion handler for block_read_full_page() - pages
297 * which come unlocked at the end of I/O. 297 * which come unlocked at the end of I/O.
298 */ 298 */
299 static void end_buffer_async_read(struct buffer_head *bh, int uptodate) 299 static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
300 { 300 {
301 unsigned long flags; 301 unsigned long flags;
302 struct buffer_head *first; 302 struct buffer_head *first;
303 struct buffer_head *tmp; 303 struct buffer_head *tmp;
304 struct page *page; 304 struct page *page;
305 int page_uptodate = 1; 305 int page_uptodate = 1;
306 306
307 BUG_ON(!buffer_async_read(bh)); 307 BUG_ON(!buffer_async_read(bh));
308 308
309 page = bh->b_page; 309 page = bh->b_page;
310 if (uptodate) { 310 if (uptodate) {
311 set_buffer_uptodate(bh); 311 set_buffer_uptodate(bh);
312 } else { 312 } else {
313 clear_buffer_uptodate(bh); 313 clear_buffer_uptodate(bh);
314 if (!quiet_error(bh)) 314 if (!quiet_error(bh))
315 buffer_io_error(bh); 315 buffer_io_error(bh);
316 SetPageError(page); 316 SetPageError(page);
317 } 317 }
318 318
319 /* 319 /*
320 * Be _very_ careful from here on. Bad things can happen if 320 * Be _very_ careful from here on. Bad things can happen if
321 * two buffer heads end IO at almost the same time and both 321 * two buffer heads end IO at almost the same time and both
322 * decide that the page is now completely done. 322 * decide that the page is now completely done.
323 */ 323 */
324 first = page_buffers(page); 324 first = page_buffers(page);
325 local_irq_save(flags); 325 local_irq_save(flags);
326 bit_spin_lock(BH_Uptodate_Lock, &first->b_state); 326 bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
327 clear_buffer_async_read(bh); 327 clear_buffer_async_read(bh);
328 unlock_buffer(bh); 328 unlock_buffer(bh);
329 tmp = bh; 329 tmp = bh;
330 do { 330 do {
331 if (!buffer_uptodate(tmp)) 331 if (!buffer_uptodate(tmp))
332 page_uptodate = 0; 332 page_uptodate = 0;
333 if (buffer_async_read(tmp)) { 333 if (buffer_async_read(tmp)) {
334 BUG_ON(!buffer_locked(tmp)); 334 BUG_ON(!buffer_locked(tmp));
335 goto still_busy; 335 goto still_busy;
336 } 336 }
337 tmp = tmp->b_this_page; 337 tmp = tmp->b_this_page;
338 } while (tmp != bh); 338 } while (tmp != bh);
339 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); 339 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
340 local_irq_restore(flags); 340 local_irq_restore(flags);
341 341
342 /* 342 /*
343 * If none of the buffers had errors and they are all 343 * If none of the buffers had errors and they are all
344 * uptodate then we can set the page uptodate. 344 * uptodate then we can set the page uptodate.
345 */ 345 */
346 if (page_uptodate && !PageError(page)) 346 if (page_uptodate && !PageError(page))
347 SetPageUptodate(page); 347 SetPageUptodate(page);
348 unlock_page(page); 348 unlock_page(page);
349 return; 349 return;
350 350
351 still_busy: 351 still_busy:
352 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); 352 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
353 local_irq_restore(flags); 353 local_irq_restore(flags);
354 return; 354 return;
355 } 355 }
356 356
357 /* 357 /*
358 * Completion handler for block_write_full_page() - pages which are unlocked 358 * Completion handler for block_write_full_page() - pages which are unlocked
359 * during I/O, and which have PageWriteback cleared upon I/O completion. 359 * during I/O, and which have PageWriteback cleared upon I/O completion.
360 */ 360 */
361 void end_buffer_async_write(struct buffer_head *bh, int uptodate) 361 void end_buffer_async_write(struct buffer_head *bh, int uptodate)
362 { 362 {
363 char b[BDEVNAME_SIZE]; 363 char b[BDEVNAME_SIZE];
364 unsigned long flags; 364 unsigned long flags;
365 struct buffer_head *first; 365 struct buffer_head *first;
366 struct buffer_head *tmp; 366 struct buffer_head *tmp;
367 struct page *page; 367 struct page *page;
368 368
369 BUG_ON(!buffer_async_write(bh)); 369 BUG_ON(!buffer_async_write(bh));
370 370
371 page = bh->b_page; 371 page = bh->b_page;
372 if (uptodate) { 372 if (uptodate) {
373 set_buffer_uptodate(bh); 373 set_buffer_uptodate(bh);
374 } else { 374 } else {
375 if (!quiet_error(bh)) { 375 if (!quiet_error(bh)) {
376 buffer_io_error(bh); 376 buffer_io_error(bh);
377 printk(KERN_WARNING "lost page write due to " 377 printk(KERN_WARNING "lost page write due to "
378 "I/O error on %s\n", 378 "I/O error on %s\n",
379 bdevname(bh->b_bdev, b)); 379 bdevname(bh->b_bdev, b));
380 } 380 }
381 set_bit(AS_EIO, &page->mapping->flags); 381 set_bit(AS_EIO, &page->mapping->flags);
382 set_buffer_write_io_error(bh); 382 set_buffer_write_io_error(bh);
383 clear_buffer_uptodate(bh); 383 clear_buffer_uptodate(bh);
384 SetPageError(page); 384 SetPageError(page);
385 } 385 }
386 386
387 first = page_buffers(page); 387 first = page_buffers(page);
388 local_irq_save(flags); 388 local_irq_save(flags);
389 bit_spin_lock(BH_Uptodate_Lock, &first->b_state); 389 bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
390 390
391 clear_buffer_async_write(bh); 391 clear_buffer_async_write(bh);
392 unlock_buffer(bh); 392 unlock_buffer(bh);
393 tmp = bh->b_this_page; 393 tmp = bh->b_this_page;
394 while (tmp != bh) { 394 while (tmp != bh) {
395 if (buffer_async_write(tmp)) { 395 if (buffer_async_write(tmp)) {
396 BUG_ON(!buffer_locked(tmp)); 396 BUG_ON(!buffer_locked(tmp));
397 goto still_busy; 397 goto still_busy;
398 } 398 }
399 tmp = tmp->b_this_page; 399 tmp = tmp->b_this_page;
400 } 400 }
401 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); 401 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
402 local_irq_restore(flags); 402 local_irq_restore(flags);
403 end_page_writeback(page); 403 end_page_writeback(page);
404 return; 404 return;
405 405
406 still_busy: 406 still_busy:
407 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); 407 bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
408 local_irq_restore(flags); 408 local_irq_restore(flags);
409 return; 409 return;
410 } 410 }
411 EXPORT_SYMBOL(end_buffer_async_write); 411 EXPORT_SYMBOL(end_buffer_async_write);
412 412
413 /* 413 /*
414 * If a page's buffers are under async readin (end_buffer_async_read 414 * If a page's buffers are under async readin (end_buffer_async_read
415 * completion) then there is a possibility that another thread of 415 * completion) then there is a possibility that another thread of
416 * control could lock one of the buffers after it has completed 416 * control could lock one of the buffers after it has completed
417 * but while some of the other buffers have not completed. This 417 * but while some of the other buffers have not completed. This
418 * locked buffer would confuse end_buffer_async_read() into not unlocking 418 * locked buffer would confuse end_buffer_async_read() into not unlocking
419 * the page. So the absence of BH_Async_Read tells end_buffer_async_read() 419 * the page. So the absence of BH_Async_Read tells end_buffer_async_read()
420 * that this buffer is not under async I/O. 420 * that this buffer is not under async I/O.
421 * 421 *
422 * The page comes unlocked when it has no locked buffer_async buffers 422 * The page comes unlocked when it has no locked buffer_async buffers
423 * left. 423 * left.
424 * 424 *
425 * PageLocked prevents anyone starting new async I/O reads any of 425 * PageLocked prevents anyone starting new async I/O reads any of
426 * the buffers. 426 * the buffers.
427 * 427 *
428 * PageWriteback is used to prevent simultaneous writeout of the same 428 * PageWriteback is used to prevent simultaneous writeout of the same
429 * page. 429 * page.
430 * 430 *
431 * PageLocked prevents anyone from starting writeback of a page which is 431 * PageLocked prevents anyone from starting writeback of a page which is
432 * under read I/O (PageWriteback is only ever set against a locked page). 432 * under read I/O (PageWriteback is only ever set against a locked page).
433 */ 433 */
434 static void mark_buffer_async_read(struct buffer_head *bh) 434 static void mark_buffer_async_read(struct buffer_head *bh)
435 { 435 {
436 bh->b_end_io = end_buffer_async_read; 436 bh->b_end_io = end_buffer_async_read;
437 set_buffer_async_read(bh); 437 set_buffer_async_read(bh);
438 } 438 }
439 439
440 static void mark_buffer_async_write_endio(struct buffer_head *bh, 440 static void mark_buffer_async_write_endio(struct buffer_head *bh,
441 bh_end_io_t *handler) 441 bh_end_io_t *handler)
442 { 442 {
443 bh->b_end_io = handler; 443 bh->b_end_io = handler;
444 set_buffer_async_write(bh); 444 set_buffer_async_write(bh);
445 } 445 }
446 446
447 void mark_buffer_async_write(struct buffer_head *bh) 447 void mark_buffer_async_write(struct buffer_head *bh)
448 { 448 {
449 mark_buffer_async_write_endio(bh, end_buffer_async_write); 449 mark_buffer_async_write_endio(bh, end_buffer_async_write);
450 } 450 }
451 EXPORT_SYMBOL(mark_buffer_async_write); 451 EXPORT_SYMBOL(mark_buffer_async_write);
452 452
453 453
454 /* 454 /*
455 * fs/buffer.c contains helper functions for buffer-backed address space's 455 * fs/buffer.c contains helper functions for buffer-backed address space's
456 * fsync functions. A common requirement for buffer-based filesystems is 456 * fsync functions. A common requirement for buffer-based filesystems is
457 * that certain data from the backing blockdev needs to be written out for 457 * that certain data from the backing blockdev needs to be written out for
458 * a successful fsync(). For example, ext2 indirect blocks need to be 458 * a successful fsync(). For example, ext2 indirect blocks need to be
459 * written back and waited upon before fsync() returns. 459 * written back and waited upon before fsync() returns.
460 * 460 *
461 * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(), 461 * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
462 * inode_has_buffers() and invalidate_inode_buffers() are provided for the 462 * inode_has_buffers() and invalidate_inode_buffers() are provided for the
463 * management of a list of dependent buffers at ->i_mapping->private_list. 463 * management of a list of dependent buffers at ->i_mapping->private_list.
464 * 464 *
465 * Locking is a little subtle: try_to_free_buffers() will remove buffers 465 * Locking is a little subtle: try_to_free_buffers() will remove buffers
466 * from their controlling inode's queue when they are being freed. But 466 * from their controlling inode's queue when they are being freed. But
467 * try_to_free_buffers() will be operating against the *blockdev* mapping 467 * try_to_free_buffers() will be operating against the *blockdev* mapping
468 * at the time, not against the S_ISREG file which depends on those buffers. 468 * at the time, not against the S_ISREG file which depends on those buffers.
469 * So the locking for private_list is via the private_lock in the address_space 469 * So the locking for private_list is via the private_lock in the address_space
470 * which backs the buffers. Which is different from the address_space 470 * which backs the buffers. Which is different from the address_space
471 * against which the buffers are listed. So for a particular address_space, 471 * against which the buffers are listed. So for a particular address_space,
472 * mapping->private_lock does *not* protect mapping->private_list! In fact, 472 * mapping->private_lock does *not* protect mapping->private_list! In fact,
473 * mapping->private_list will always be protected by the backing blockdev's 473 * mapping->private_list will always be protected by the backing blockdev's
474 * ->private_lock. 474 * ->private_lock.
475 * 475 *
476 * Which introduces a requirement: all buffers on an address_space's 476 * Which introduces a requirement: all buffers on an address_space's
477 * ->private_list must be from the same address_space: the blockdev's. 477 * ->private_list must be from the same address_space: the blockdev's.
478 * 478 *
479 * address_spaces which do not place buffers at ->private_list via these 479 * address_spaces which do not place buffers at ->private_list via these
480 * utility functions are free to use private_lock and private_list for 480 * utility functions are free to use private_lock and private_list for
481 * whatever they want. The only requirement is that list_empty(private_list) 481 * whatever they want. The only requirement is that list_empty(private_list)
482 * be true at clear_inode() time. 482 * be true at clear_inode() time.
483 * 483 *
484 * FIXME: clear_inode should not call invalidate_inode_buffers(). The 484 * FIXME: clear_inode should not call invalidate_inode_buffers(). The
485 * filesystems should do that. invalidate_inode_buffers() should just go 485 * filesystems should do that. invalidate_inode_buffers() should just go
486 * BUG_ON(!list_empty). 486 * BUG_ON(!list_empty).
487 * 487 *
488 * FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should 488 * FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should
489 * take an address_space, not an inode. And it should be called 489 * take an address_space, not an inode. And it should be called
490 * mark_buffer_dirty_fsync() to clearly define why those buffers are being 490 * mark_buffer_dirty_fsync() to clearly define why those buffers are being
491 * queued up. 491 * queued up.
492 * 492 *
493 * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the 493 * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
494 * list if it is already on a list. Because if the buffer is on a list, 494 * list if it is already on a list. Because if the buffer is on a list,
495 * it *must* already be on the right one. If not, the filesystem is being 495 * it *must* already be on the right one. If not, the filesystem is being
496 * silly. This will save a ton of locking. But first we have to ensure 496 * silly. This will save a ton of locking. But first we have to ensure
497 * that buffers are taken *off* the old inode's list when they are freed 497 * that buffers are taken *off* the old inode's list when they are freed
498 * (presumably in truncate). That requires careful auditing of all 498 * (presumably in truncate). That requires careful auditing of all
499 * filesystems (do it inside bforget()). It could also be done by bringing 499 * filesystems (do it inside bforget()). It could also be done by bringing
500 * b_inode back. 500 * b_inode back.
501 */ 501 */
502 502
503 /* 503 /*
504 * The buffer's backing address_space's private_lock must be held 504 * The buffer's backing address_space's private_lock must be held
505 */ 505 */
506 static void __remove_assoc_queue(struct buffer_head *bh) 506 static void __remove_assoc_queue(struct buffer_head *bh)
507 { 507 {
508 list_del_init(&bh->b_assoc_buffers); 508 list_del_init(&bh->b_assoc_buffers);
509 WARN_ON(!bh->b_assoc_map); 509 WARN_ON(!bh->b_assoc_map);
510 if (buffer_write_io_error(bh)) 510 if (buffer_write_io_error(bh))
511 set_bit(AS_EIO, &bh->b_assoc_map->flags); 511 set_bit(AS_EIO, &bh->b_assoc_map->flags);
512 bh->b_assoc_map = NULL; 512 bh->b_assoc_map = NULL;
513 } 513 }
514 514
515 int inode_has_buffers(struct inode *inode) 515 int inode_has_buffers(struct inode *inode)
516 { 516 {
517 return !list_empty(&inode->i_data.private_list); 517 return !list_empty(&inode->i_data.private_list);
518 } 518 }
519 519
520 /* 520 /*
521 * osync is designed to support O_SYNC io. It waits synchronously for 521 * osync is designed to support O_SYNC io. It waits synchronously for
522 * all already-submitted IO to complete, but does not queue any new 522 * all already-submitted IO to complete, but does not queue any new
523 * writes to the disk. 523 * writes to the disk.
524 * 524 *
525 * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as 525 * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
526 * you dirty the buffers, and then use osync_inode_buffers to wait for 526 * you dirty the buffers, and then use osync_inode_buffers to wait for
527 * completion. Any other dirty buffers which are not yet queued for 527 * completion. Any other dirty buffers which are not yet queued for
528 * write will not be flushed to disk by the osync. 528 * write will not be flushed to disk by the osync.
529 */ 529 */
530 static int osync_buffers_list(spinlock_t *lock, struct list_head *list) 530 static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
531 { 531 {
532 struct buffer_head *bh; 532 struct buffer_head *bh;
533 struct list_head *p; 533 struct list_head *p;
534 int err = 0; 534 int err = 0;
535 535
536 spin_lock(lock); 536 spin_lock(lock);
537 repeat: 537 repeat:
538 list_for_each_prev(p, list) { 538 list_for_each_prev(p, list) {
539 bh = BH_ENTRY(p); 539 bh = BH_ENTRY(p);
540 if (buffer_locked(bh)) { 540 if (buffer_locked(bh)) {
541 get_bh(bh); 541 get_bh(bh);
542 spin_unlock(lock); 542 spin_unlock(lock);
543 wait_on_buffer(bh); 543 wait_on_buffer(bh);
544 if (!buffer_uptodate(bh)) 544 if (!buffer_uptodate(bh))
545 err = -EIO; 545 err = -EIO;
546 brelse(bh); 546 brelse(bh);
547 spin_lock(lock); 547 spin_lock(lock);
548 goto repeat; 548 goto repeat;
549 } 549 }
550 } 550 }
551 spin_unlock(lock); 551 spin_unlock(lock);
552 return err; 552 return err;
553 } 553 }
554 554
555 static void do_thaw_one(struct super_block *sb, void *unused) 555 static void do_thaw_one(struct super_block *sb, void *unused)
556 { 556 {
557 char b[BDEVNAME_SIZE]; 557 char b[BDEVNAME_SIZE];
558 while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb)) 558 while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb))
559 printk(KERN_WARNING "Emergency Thaw on %s\n", 559 printk(KERN_WARNING "Emergency Thaw on %s\n",
560 bdevname(sb->s_bdev, b)); 560 bdevname(sb->s_bdev, b));
561 } 561 }
562 562
563 static void do_thaw_all(struct work_struct *work) 563 static void do_thaw_all(struct work_struct *work)
564 { 564 {
565 iterate_supers(do_thaw_one, NULL); 565 iterate_supers(do_thaw_one, NULL);
566 kfree(work); 566 kfree(work);
567 printk(KERN_WARNING "Emergency Thaw complete\n"); 567 printk(KERN_WARNING "Emergency Thaw complete\n");
568 } 568 }
569 569
570 /** 570 /**
571 * emergency_thaw_all -- forcibly thaw every frozen filesystem 571 * emergency_thaw_all -- forcibly thaw every frozen filesystem
572 * 572 *
573 * Used for emergency unfreeze of all filesystems via SysRq 573 * Used for emergency unfreeze of all filesystems via SysRq
574 */ 574 */
575 void emergency_thaw_all(void) 575 void emergency_thaw_all(void)
576 { 576 {
577 struct work_struct *work; 577 struct work_struct *work;
578 578
579 work = kmalloc(sizeof(*work), GFP_ATOMIC); 579 work = kmalloc(sizeof(*work), GFP_ATOMIC);
580 if (work) { 580 if (work) {
581 INIT_WORK(work, do_thaw_all); 581 INIT_WORK(work, do_thaw_all);
582 schedule_work(work); 582 schedule_work(work);
583 } 583 }
584 } 584 }
585 585
586 /** 586 /**
587 * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers 587 * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers
588 * @mapping: the mapping which wants those buffers written 588 * @mapping: the mapping which wants those buffers written
589 * 589 *
590 * Starts I/O against the buffers at mapping->private_list, and waits upon 590 * Starts I/O against the buffers at mapping->private_list, and waits upon
591 * that I/O. 591 * that I/O.
592 * 592 *
593 * Basically, this is a convenience function for fsync(). 593 * Basically, this is a convenience function for fsync().
594 * @mapping is a file or directory which needs those buffers to be written for 594 * @mapping is a file or directory which needs those buffers to be written for
595 * a successful fsync(). 595 * a successful fsync().
596 */ 596 */
597 int sync_mapping_buffers(struct address_space *mapping) 597 int sync_mapping_buffers(struct address_space *mapping)
598 { 598 {
599 struct address_space *buffer_mapping = mapping->private_data; 599 struct address_space *buffer_mapping = mapping->private_data;
600 600
601 if (buffer_mapping == NULL || list_empty(&mapping->private_list)) 601 if (buffer_mapping == NULL || list_empty(&mapping->private_list))
602 return 0; 602 return 0;
603 603
604 return fsync_buffers_list(&buffer_mapping->private_lock, 604 return fsync_buffers_list(&buffer_mapping->private_lock,
605 &mapping->private_list); 605 &mapping->private_list);
606 } 606 }
607 EXPORT_SYMBOL(sync_mapping_buffers); 607 EXPORT_SYMBOL(sync_mapping_buffers);
608 608
609 /* 609 /*
610 * Called when we've recently written block `bblock', and it is known that 610 * Called when we've recently written block `bblock', and it is known that
611 * `bblock' was for a buffer_boundary() buffer. This means that the block at 611 * `bblock' was for a buffer_boundary() buffer. This means that the block at
612 * `bblock + 1' is probably a dirty indirect block. Hunt it down and, if it's 612 * `bblock + 1' is probably a dirty indirect block. Hunt it down and, if it's
613 * dirty, schedule it for IO. So that indirects merge nicely with their data. 613 * dirty, schedule it for IO. So that indirects merge nicely with their data.
614 */ 614 */
615 void write_boundary_block(struct block_device *bdev, 615 void write_boundary_block(struct block_device *bdev,
616 sector_t bblock, unsigned blocksize) 616 sector_t bblock, unsigned blocksize)
617 { 617 {
618 struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize); 618 struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize);
619 if (bh) { 619 if (bh) {
620 if (buffer_dirty(bh)) 620 if (buffer_dirty(bh))
621 ll_rw_block(WRITE, 1, &bh); 621 ll_rw_block(WRITE, 1, &bh);
622 put_bh(bh); 622 put_bh(bh);
623 } 623 }
624 } 624 }
625 625
626 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) 626 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
627 { 627 {
628 struct address_space *mapping = inode->i_mapping; 628 struct address_space *mapping = inode->i_mapping;
629 struct address_space *buffer_mapping = bh->b_page->mapping; 629 struct address_space *buffer_mapping = bh->b_page->mapping;
630 630
631 mark_buffer_dirty(bh); 631 mark_buffer_dirty(bh);
632 if (!mapping->private_data) { 632 if (!mapping->private_data) {
633 mapping->private_data = buffer_mapping; 633 mapping->private_data = buffer_mapping;
634 } else { 634 } else {
635 BUG_ON(mapping->private_data != buffer_mapping); 635 BUG_ON(mapping->private_data != buffer_mapping);
636 } 636 }
637 if (!bh->b_assoc_map) { 637 if (!bh->b_assoc_map) {
638 spin_lock(&buffer_mapping->private_lock); 638 spin_lock(&buffer_mapping->private_lock);
639 list_move_tail(&bh->b_assoc_buffers, 639 list_move_tail(&bh->b_assoc_buffers,
640 &mapping->private_list); 640 &mapping->private_list);
641 bh->b_assoc_map = mapping; 641 bh->b_assoc_map = mapping;
642 spin_unlock(&buffer_mapping->private_lock); 642 spin_unlock(&buffer_mapping->private_lock);
643 } 643 }
644 } 644 }
645 EXPORT_SYMBOL(mark_buffer_dirty_inode); 645 EXPORT_SYMBOL(mark_buffer_dirty_inode);
646 646
647 /* 647 /*
648 * Mark the page dirty, and set it dirty in the radix tree, and mark the inode 648 * Mark the page dirty, and set it dirty in the radix tree, and mark the inode
649 * dirty. 649 * dirty.
650 * 650 *
651 * If warn is true, then emit a warning if the page is not uptodate and has 651 * If warn is true, then emit a warning if the page is not uptodate and has
652 * not been truncated. 652 * not been truncated.
653 */ 653 */
654 static void __set_page_dirty(struct page *page, 654 static void __set_page_dirty(struct page *page,
655 struct address_space *mapping, int warn) 655 struct address_space *mapping, int warn)
656 { 656 {
657 unsigned long flags; 657 unsigned long flags;
658 658
659 spin_lock_irqsave(&mapping->tree_lock, flags); 659 spin_lock_irqsave(&mapping->tree_lock, flags);
660 if (page->mapping) { /* Race with truncate? */ 660 if (page->mapping) { /* Race with truncate? */
661 WARN_ON_ONCE(warn && !PageUptodate(page)); 661 WARN_ON_ONCE(warn && !PageUptodate(page));
662 account_page_dirtied(page, mapping); 662 account_page_dirtied(page, mapping);
663 radix_tree_tag_set(&mapping->page_tree, 663 radix_tree_tag_set(&mapping->page_tree,
664 page_index(page), PAGECACHE_TAG_DIRTY); 664 page_index(page), PAGECACHE_TAG_DIRTY);
665 } 665 }
666 spin_unlock_irqrestore(&mapping->tree_lock, flags); 666 spin_unlock_irqrestore(&mapping->tree_lock, flags);
667 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 667 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
668 } 668 }
669 669
670 /* 670 /*
671 * Add a page to the dirty page list. 671 * Add a page to the dirty page list.
672 * 672 *
673 * It is a sad fact of life that this function is called from several places 673 * It is a sad fact of life that this function is called from several places
674 * deeply under spinlocking. It may not sleep. 674 * deeply under spinlocking. It may not sleep.
675 * 675 *
676 * If the page has buffers, the uptodate buffers are set dirty, to preserve 676 * If the page has buffers, the uptodate buffers are set dirty, to preserve
677 * dirty-state coherency between the page and the buffers. It the page does 677 * dirty-state coherency between the page and the buffers. It the page does
678 * not have buffers then when they are later attached they will all be set 678 * not have buffers then when they are later attached they will all be set
679 * dirty. 679 * dirty.
680 * 680 *
681 * The buffers are dirtied before the page is dirtied. There's a small race 681 * The buffers are dirtied before the page is dirtied. There's a small race
682 * window in which a writepage caller may see the page cleanness but not the 682 * window in which a writepage caller may see the page cleanness but not the
683 * buffer dirtiness. That's fine. If this code were to set the page dirty 683 * buffer dirtiness. That's fine. If this code were to set the page dirty
684 * before the buffers, a concurrent writepage caller could clear the page dirty 684 * before the buffers, a concurrent writepage caller could clear the page dirty
685 * bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean 685 * bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean
686 * page on the dirty page list. 686 * page on the dirty page list.
687 * 687 *
688 * We use private_lock to lock against try_to_free_buffers while using the 688 * We use private_lock to lock against try_to_free_buffers while using the
689 * page's buffer list. Also use this to protect against clean buffers being 689 * page's buffer list. Also use this to protect against clean buffers being
690 * added to the page after it was set dirty. 690 * added to the page after it was set dirty.
691 * 691 *
692 * FIXME: may need to call ->reservepage here as well. That's rather up to the 692 * FIXME: may need to call ->reservepage here as well. That's rather up to the
693 * address_space though. 693 * address_space though.
694 */ 694 */
695 int __set_page_dirty_buffers(struct page *page) 695 int __set_page_dirty_buffers(struct page *page)
696 { 696 {
697 int newly_dirty; 697 int newly_dirty;
698 struct address_space *mapping = page_mapping(page); 698 struct address_space *mapping = page_mapping(page);
699 699
700 if (unlikely(!mapping)) 700 if (unlikely(!mapping))
701 return !TestSetPageDirty(page); 701 return !TestSetPageDirty(page);
702 702
703 spin_lock(&mapping->private_lock); 703 spin_lock(&mapping->private_lock);
704 if (page_has_buffers(page)) { 704 if (page_has_buffers(page)) {
705 struct buffer_head *head = page_buffers(page); 705 struct buffer_head *head = page_buffers(page);
706 struct buffer_head *bh = head; 706 struct buffer_head *bh = head;
707 707
708 do { 708 do {
709 set_buffer_dirty(bh); 709 set_buffer_dirty(bh);
710 bh = bh->b_this_page; 710 bh = bh->b_this_page;
711 } while (bh != head); 711 } while (bh != head);
712 } 712 }
713 newly_dirty = !TestSetPageDirty(page); 713 newly_dirty = !TestSetPageDirty(page);
714 spin_unlock(&mapping->private_lock); 714 spin_unlock(&mapping->private_lock);
715 715
716 if (newly_dirty) 716 if (newly_dirty)
717 __set_page_dirty(page, mapping, 1); 717 __set_page_dirty(page, mapping, 1);
718 return newly_dirty; 718 return newly_dirty;
719 } 719 }
720 EXPORT_SYMBOL(__set_page_dirty_buffers); 720 EXPORT_SYMBOL(__set_page_dirty_buffers);
721 721
722 /* 722 /*
723 * Write out and wait upon a list of buffers. 723 * Write out and wait upon a list of buffers.
724 * 724 *
725 * We have conflicting pressures: we want to make sure that all 725 * We have conflicting pressures: we want to make sure that all
726 * initially dirty buffers get waited on, but that any subsequently 726 * initially dirty buffers get waited on, but that any subsequently
727 * dirtied buffers don't. After all, we don't want fsync to last 727 * dirtied buffers don't. After all, we don't want fsync to last
728 * forever if somebody is actively writing to the file. 728 * forever if somebody is actively writing to the file.
729 * 729 *
730 * Do this in two main stages: first we copy dirty buffers to a 730 * Do this in two main stages: first we copy dirty buffers to a
731 * temporary inode list, queueing the writes as we go. Then we clean 731 * temporary inode list, queueing the writes as we go. Then we clean
732 * up, waiting for those writes to complete. 732 * up, waiting for those writes to complete.
733 * 733 *
734 * During this second stage, any subsequent updates to the file may end 734 * During this second stage, any subsequent updates to the file may end
735 * up refiling the buffer on the original inode's dirty list again, so 735 * up refiling the buffer on the original inode's dirty list again, so
736 * there is a chance we will end up with a buffer queued for write but 736 * there is a chance we will end up with a buffer queued for write but
737 * not yet completed on that list. So, as a final cleanup we go through 737 * not yet completed on that list. So, as a final cleanup we go through
738 * the osync code to catch these locked, dirty buffers without requeuing 738 * the osync code to catch these locked, dirty buffers without requeuing
739 * any newly dirty buffers for write. 739 * any newly dirty buffers for write.
740 */ 740 */
741 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) 741 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
742 { 742 {
743 struct buffer_head *bh; 743 struct buffer_head *bh;
744 struct list_head tmp; 744 struct list_head tmp;
745 struct address_space *mapping; 745 struct address_space *mapping;
746 int err = 0, err2; 746 int err = 0, err2;
747 struct blk_plug plug; 747 struct blk_plug plug;
748 748
749 INIT_LIST_HEAD(&tmp); 749 INIT_LIST_HEAD(&tmp);
750 blk_start_plug(&plug); 750 blk_start_plug(&plug);
751 751
752 spin_lock(lock); 752 spin_lock(lock);
753 while (!list_empty(list)) { 753 while (!list_empty(list)) {
754 bh = BH_ENTRY(list->next); 754 bh = BH_ENTRY(list->next);
755 mapping = bh->b_assoc_map; 755 mapping = bh->b_assoc_map;
756 __remove_assoc_queue(bh); 756 __remove_assoc_queue(bh);
757 /* Avoid race with mark_buffer_dirty_inode() which does 757 /* Avoid race with mark_buffer_dirty_inode() which does
758 * a lockless check and we rely on seeing the dirty bit */ 758 * a lockless check and we rely on seeing the dirty bit */
759 smp_mb(); 759 smp_mb();
760 if (buffer_dirty(bh) || buffer_locked(bh)) { 760 if (buffer_dirty(bh) || buffer_locked(bh)) {
761 list_add(&bh->b_assoc_buffers, &tmp); 761 list_add(&bh->b_assoc_buffers, &tmp);
762 bh->b_assoc_map = mapping; 762 bh->b_assoc_map = mapping;
763 if (buffer_dirty(bh)) { 763 if (buffer_dirty(bh)) {
764 get_bh(bh); 764 get_bh(bh);
765 spin_unlock(lock); 765 spin_unlock(lock);
766 /* 766 /*
767 * Ensure any pending I/O completes so that 767 * Ensure any pending I/O completes so that
768 * write_dirty_buffer() actually writes the 768 * write_dirty_buffer() actually writes the
769 * current contents - it is a noop if I/O is 769 * current contents - it is a noop if I/O is
770 * still in flight on potentially older 770 * still in flight on potentially older
771 * contents. 771 * contents.
772 */ 772 */
773 write_dirty_buffer(bh, WRITE_SYNC); 773 write_dirty_buffer(bh, WRITE_SYNC);
774 774
775 /* 775 /*
776 * Kick off IO for the previous mapping. Note 776 * Kick off IO for the previous mapping. Note
777 * that we will not run the very last mapping, 777 * that we will not run the very last mapping,
778 * wait_on_buffer() will do that for us 778 * wait_on_buffer() will do that for us
779 * through sync_buffer(). 779 * through sync_buffer().
780 */ 780 */
781 brelse(bh); 781 brelse(bh);
782 spin_lock(lock); 782 spin_lock(lock);
783 } 783 }
784 } 784 }
785 } 785 }
786 786
787 spin_unlock(lock); 787 spin_unlock(lock);
788 blk_finish_plug(&plug); 788 blk_finish_plug(&plug);
789 spin_lock(lock); 789 spin_lock(lock);
790 790
791 while (!list_empty(&tmp)) { 791 while (!list_empty(&tmp)) {
792 bh = BH_ENTRY(tmp.prev); 792 bh = BH_ENTRY(tmp.prev);
793 get_bh(bh); 793 get_bh(bh);
794 mapping = bh->b_assoc_map; 794 mapping = bh->b_assoc_map;
795 __remove_assoc_queue(bh); 795 __remove_assoc_queue(bh);
796 /* Avoid race with mark_buffer_dirty_inode() which does 796 /* Avoid race with mark_buffer_dirty_inode() which does
797 * a lockless check and we rely on seeing the dirty bit */ 797 * a lockless check and we rely on seeing the dirty bit */
798 smp_mb(); 798 smp_mb();
799 if (buffer_dirty(bh)) { 799 if (buffer_dirty(bh)) {
800 list_add(&bh->b_assoc_buffers, 800 list_add(&bh->b_assoc_buffers,
801 &mapping->private_list); 801 &mapping->private_list);
802 bh->b_assoc_map = mapping; 802 bh->b_assoc_map = mapping;
803 } 803 }
804 spin_unlock(lock); 804 spin_unlock(lock);
805 wait_on_buffer(bh); 805 wait_on_buffer(bh);
806 if (!buffer_uptodate(bh)) 806 if (!buffer_uptodate(bh))
807 err = -EIO; 807 err = -EIO;
808 brelse(bh); 808 brelse(bh);
809 spin_lock(lock); 809 spin_lock(lock);
810 } 810 }
811 811
812 spin_unlock(lock); 812 spin_unlock(lock);
813 err2 = osync_buffers_list(lock, list); 813 err2 = osync_buffers_list(lock, list);
814 if (err) 814 if (err)
815 return err; 815 return err;
816 else 816 else
817 return err2; 817 return err2;
818 } 818 }
819 819
820 /* 820 /*
821 * Invalidate any and all dirty buffers on a given inode. We are 821 * Invalidate any and all dirty buffers on a given inode. We are
822 * probably unmounting the fs, but that doesn't mean we have already 822 * probably unmounting the fs, but that doesn't mean we have already
823 * done a sync(). Just drop the buffers from the inode list. 823 * done a sync(). Just drop the buffers from the inode list.
824 * 824 *
825 * NOTE: we take the inode's blockdev's mapping's private_lock. Which 825 * NOTE: we take the inode's blockdev's mapping's private_lock. Which
826 * assumes that all the buffers are against the blockdev. Not true 826 * assumes that all the buffers are against the blockdev. Not true
827 * for reiserfs. 827 * for reiserfs.
828 */ 828 */
829 void invalidate_inode_buffers(struct inode *inode) 829 void invalidate_inode_buffers(struct inode *inode)
830 { 830 {
831 if (inode_has_buffers(inode)) { 831 if (inode_has_buffers(inode)) {
832 struct address_space *mapping = &inode->i_data; 832 struct address_space *mapping = &inode->i_data;
833 struct list_head *list = &mapping->private_list; 833 struct list_head *list = &mapping->private_list;
834 struct address_space *buffer_mapping = mapping->private_data; 834 struct address_space *buffer_mapping = mapping->private_data;
835 835
836 spin_lock(&buffer_mapping->private_lock); 836 spin_lock(&buffer_mapping->private_lock);
837 while (!list_empty(list)) 837 while (!list_empty(list))
838 __remove_assoc_queue(BH_ENTRY(list->next)); 838 __remove_assoc_queue(BH_ENTRY(list->next));
839 spin_unlock(&buffer_mapping->private_lock); 839 spin_unlock(&buffer_mapping->private_lock);
840 } 840 }
841 } 841 }
842 EXPORT_SYMBOL(invalidate_inode_buffers); 842 EXPORT_SYMBOL(invalidate_inode_buffers);
843 843
844 /* 844 /*
845 * Remove any clean buffers from the inode's buffer list. This is called 845 * Remove any clean buffers from the inode's buffer list. This is called
846 * when we're trying to free the inode itself. Those buffers can pin it. 846 * when we're trying to free the inode itself. Those buffers can pin it.
847 * 847 *
848 * Returns true if all buffers were removed. 848 * Returns true if all buffers were removed.
849 */ 849 */
850 int remove_inode_buffers(struct inode *inode) 850 int remove_inode_buffers(struct inode *inode)
851 { 851 {
852 int ret = 1; 852 int ret = 1;
853 853
854 if (inode_has_buffers(inode)) { 854 if (inode_has_buffers(inode)) {
855 struct address_space *mapping = &inode->i_data; 855 struct address_space *mapping = &inode->i_data;
856 struct list_head *list = &mapping->private_list; 856 struct list_head *list = &mapping->private_list;
857 struct address_space *buffer_mapping = mapping->private_data; 857 struct address_space *buffer_mapping = mapping->private_data;
858 858
859 spin_lock(&buffer_mapping->private_lock); 859 spin_lock(&buffer_mapping->private_lock);
860 while (!list_empty(list)) { 860 while (!list_empty(list)) {
861 struct buffer_head *bh = BH_ENTRY(list->next); 861 struct buffer_head *bh = BH_ENTRY(list->next);
862 if (buffer_dirty(bh)) { 862 if (buffer_dirty(bh)) {
863 ret = 0; 863 ret = 0;
864 break; 864 break;
865 } 865 }
866 __remove_assoc_queue(bh); 866 __remove_assoc_queue(bh);
867 } 867 }
868 spin_unlock(&buffer_mapping->private_lock); 868 spin_unlock(&buffer_mapping->private_lock);
869 } 869 }
870 return ret; 870 return ret;
871 } 871 }
872 872
873 /* 873 /*
874 * Create the appropriate buffers when given a page for data area and 874 * Create the appropriate buffers when given a page for data area and
875 * the size of each buffer.. Use the bh->b_this_page linked list to 875 * the size of each buffer.. Use the bh->b_this_page linked list to
876 * follow the buffers created. Return NULL if unable to create more 876 * follow the buffers created. Return NULL if unable to create more
877 * buffers. 877 * buffers.
878 * 878 *
879 * The retry flag is used to differentiate async IO (paging, swapping) 879 * The retry flag is used to differentiate async IO (paging, swapping)
880 * which may not fail from ordinary buffer allocations. 880 * which may not fail from ordinary buffer allocations.
881 */ 881 */
882 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, 882 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
883 int retry) 883 int retry)
884 { 884 {
885 struct buffer_head *bh, *head; 885 struct buffer_head *bh, *head;
886 long offset; 886 long offset;
887 887
888 try_again: 888 try_again:
889 head = NULL; 889 head = NULL;
890 offset = PAGE_SIZE; 890 offset = PAGE_SIZE;
891 while ((offset -= size) >= 0) { 891 while ((offset -= size) >= 0) {
892 bh = alloc_buffer_head(GFP_NOFS); 892 bh = alloc_buffer_head(GFP_NOFS);
893 if (!bh) 893 if (!bh)
894 goto no_grow; 894 goto no_grow;
895 895
896 bh->b_this_page = head; 896 bh->b_this_page = head;
897 bh->b_blocknr = -1; 897 bh->b_blocknr = -1;
898 head = bh; 898 head = bh;
899 899
900 bh->b_size = size; 900 bh->b_size = size;
901 901
902 /* Link the buffer to its page */ 902 /* Link the buffer to its page */
903 set_bh_page(bh, page, offset); 903 set_bh_page(bh, page, offset);
904 } 904 }
905 return head; 905 return head;
906 /* 906 /*
907 * In case anything failed, we just free everything we got. 907 * In case anything failed, we just free everything we got.
908 */ 908 */
909 no_grow: 909 no_grow:
910 if (head) { 910 if (head) {
911 do { 911 do {
912 bh = head; 912 bh = head;
913 head = head->b_this_page; 913 head = head->b_this_page;
914 free_buffer_head(bh); 914 free_buffer_head(bh);
915 } while (head); 915 } while (head);
916 } 916 }
917 917
918 /* 918 /*
919 * Return failure for non-async IO requests. Async IO requests 919 * Return failure for non-async IO requests. Async IO requests
920 * are not allowed to fail, so we have to wait until buffer heads 920 * are not allowed to fail, so we have to wait until buffer heads
921 * become available. But we don't want tasks sleeping with 921 * become available. But we don't want tasks sleeping with
922 * partially complete buffers, so all were released above. 922 * partially complete buffers, so all were released above.
923 */ 923 */
924 if (!retry) 924 if (!retry)
925 return NULL; 925 return NULL;
926 926
927 /* We're _really_ low on memory. Now we just 927 /* We're _really_ low on memory. Now we just
928 * wait for old buffer heads to become free due to 928 * wait for old buffer heads to become free due to
929 * finishing IO. Since this is an async request and 929 * finishing IO. Since this is an async request and
930 * the reserve list is empty, we're sure there are 930 * the reserve list is empty, we're sure there are
931 * async buffer heads in use. 931 * async buffer heads in use.
932 */ 932 */
933 free_more_memory(); 933 free_more_memory();
934 goto try_again; 934 goto try_again;
935 } 935 }
936 EXPORT_SYMBOL_GPL(alloc_page_buffers); 936 EXPORT_SYMBOL_GPL(alloc_page_buffers);
937 937
938 static inline void 938 static inline void
939 link_dev_buffers(struct page *page, struct buffer_head *head) 939 link_dev_buffers(struct page *page, struct buffer_head *head)
940 { 940 {
941 struct buffer_head *bh, *tail; 941 struct buffer_head *bh, *tail;
942 942
943 bh = head; 943 bh = head;
944 do { 944 do {
945 tail = bh; 945 tail = bh;
946 bh = bh->b_this_page; 946 bh = bh->b_this_page;
947 } while (bh); 947 } while (bh);
948 tail->b_this_page = head; 948 tail->b_this_page = head;
949 attach_page_buffers(page, head); 949 attach_page_buffers(page, head);
950 } 950 }
951 951
952 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) 952 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
953 { 953 {
954 sector_t retval = ~((sector_t)0); 954 sector_t retval = ~((sector_t)0);
955 loff_t sz = i_size_read(bdev->bd_inode); 955 loff_t sz = i_size_read(bdev->bd_inode);
956 956
957 if (sz) { 957 if (sz) {
958 unsigned int sizebits = blksize_bits(size); 958 unsigned int sizebits = blksize_bits(size);
959 retval = (sz >> sizebits); 959 retval = (sz >> sizebits);
960 } 960 }
961 return retval; 961 return retval;
962 } 962 }
963 963
964 /* 964 /*
965 * Initialise the state of a blockdev page's buffers. 965 * Initialise the state of a blockdev page's buffers.
966 */ 966 */
967 static sector_t 967 static sector_t
968 init_page_buffers(struct page *page, struct block_device *bdev, 968 init_page_buffers(struct page *page, struct block_device *bdev,
969 sector_t block, int size) 969 sector_t block, int size)
970 { 970 {
971 struct buffer_head *head = page_buffers(page); 971 struct buffer_head *head = page_buffers(page);
972 struct buffer_head *bh = head; 972 struct buffer_head *bh = head;
973 int uptodate = PageUptodate(page); 973 int uptodate = PageUptodate(page);
974 sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size); 974 sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size);
975 975
976 do { 976 do {
977 if (!buffer_mapped(bh)) { 977 if (!buffer_mapped(bh)) {
978 init_buffer(bh, NULL, NULL); 978 init_buffer(bh, NULL, NULL);
979 bh->b_bdev = bdev; 979 bh->b_bdev = bdev;
980 bh->b_blocknr = block; 980 bh->b_blocknr = block;
981 if (uptodate) 981 if (uptodate)
982 set_buffer_uptodate(bh); 982 set_buffer_uptodate(bh);
983 if (block < end_block) 983 if (block < end_block)
984 set_buffer_mapped(bh); 984 set_buffer_mapped(bh);
985 } 985 }
986 block++; 986 block++;
987 bh = bh->b_this_page; 987 bh = bh->b_this_page;
988 } while (bh != head); 988 } while (bh != head);
989 989
990 /* 990 /*
991 * Caller needs to validate requested block against end of device. 991 * Caller needs to validate requested block against end of device.
992 */ 992 */
993 return end_block; 993 return end_block;
994 } 994 }
995 995
996 /* 996 /*
997 * Create the page-cache page that contains the requested block. 997 * Create the page-cache page that contains the requested block.
998 * 998 *
999 * This is used purely for blockdev mappings. 999 * This is used purely for blockdev mappings.
1000 */ 1000 */
1001 static int 1001 static int
1002 grow_dev_page(struct block_device *bdev, sector_t block, 1002 grow_dev_page(struct block_device *bdev, sector_t block,
1003 pgoff_t index, int size, int sizebits) 1003 pgoff_t index, int size, int sizebits)
1004 { 1004 {
1005 struct inode *inode = bdev->bd_inode; 1005 struct inode *inode = bdev->bd_inode;
1006 struct page *page; 1006 struct page *page;
1007 struct buffer_head *bh; 1007 struct buffer_head *bh;
1008 sector_t end_block; 1008 sector_t end_block;
1009 int ret = 0; /* Will call free_more_memory() */ 1009 int ret = 0; /* Will call free_more_memory() */
1010 gfp_t gfp_mask; 1010 gfp_t gfp_mask;
1011 1011
1012 gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS; 1012 gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
1013 gfp_mask |= __GFP_MOVABLE; 1013 gfp_mask |= __GFP_MOVABLE;
1014 /* 1014 /*
1015 * XXX: __getblk_slow() can not really deal with failure and 1015 * XXX: __getblk_slow() can not really deal with failure and
1016 * will endlessly loop on improvised global reclaim. Prefer 1016 * will endlessly loop on improvised global reclaim. Prefer
1017 * looping in the allocator rather than here, at least that 1017 * looping in the allocator rather than here, at least that
1018 * code knows what it's doing. 1018 * code knows what it's doing.
1019 */ 1019 */
1020 gfp_mask |= __GFP_NOFAIL; 1020 gfp_mask |= __GFP_NOFAIL;
1021 1021
1022 page = find_or_create_page(inode->i_mapping, index, gfp_mask); 1022 page = find_or_create_page(inode->i_mapping, index, gfp_mask);
1023 if (!page) 1023 if (!page)
1024 return ret; 1024 return ret;
1025 1025
1026 BUG_ON(!PageLocked(page)); 1026 BUG_ON(!PageLocked(page));
1027 1027
1028 if (page_has_buffers(page)) { 1028 if (page_has_buffers(page)) {
1029 bh = page_buffers(page); 1029 bh = page_buffers(page);
1030 if (bh->b_size == size) { 1030 if (bh->b_size == size) {
1031 end_block = init_page_buffers(page, bdev, 1031 end_block = init_page_buffers(page, bdev,
1032 index << sizebits, size); 1032 index << sizebits, size);
1033 goto done; 1033 goto done;
1034 } 1034 }
1035 if (!try_to_free_buffers(page)) 1035 if (!try_to_free_buffers(page))
1036 goto failed; 1036 goto failed;
1037 } 1037 }
1038 1038
1039 /* 1039 /*
1040 * Allocate some buffers for this page 1040 * Allocate some buffers for this page
1041 */ 1041 */
1042 bh = alloc_page_buffers(page, size, 0); 1042 bh = alloc_page_buffers(page, size, 0);
1043 if (!bh) 1043 if (!bh)
1044 goto failed; 1044 goto failed;
1045 1045
1046 /* 1046 /*
1047 * Link the page to the buffers and initialise them. Take the 1047 * Link the page to the buffers and initialise them. Take the
1048 * lock to be atomic wrt __find_get_block(), which does not 1048 * lock to be atomic wrt __find_get_block(), which does not
1049 * run under the page lock. 1049 * run under the page lock.
1050 */ 1050 */
1051 spin_lock(&inode->i_mapping->private_lock); 1051 spin_lock(&inode->i_mapping->private_lock);
1052 link_dev_buffers(page, bh); 1052 link_dev_buffers(page, bh);
1053 end_block = init_page_buffers(page, bdev, index << sizebits, size); 1053 end_block = init_page_buffers(page, bdev, index << sizebits, size);
1054 spin_unlock(&inode->i_mapping->private_lock); 1054 spin_unlock(&inode->i_mapping->private_lock);
1055 done: 1055 done:
1056 ret = (block < end_block) ? 1 : -ENXIO; 1056 ret = (block < end_block) ? 1 : -ENXIO;
1057 failed: 1057 failed:
1058 unlock_page(page); 1058 unlock_page(page);
1059 page_cache_release(page); 1059 page_cache_release(page);
1060 return ret; 1060 return ret;
1061 } 1061 }
1062 1062
1063 /* 1063 /*
1064 * Create buffers for the specified block device block's page. If 1064 * Create buffers for the specified block device block's page. If
1065 * that page was dirty, the buffers are set dirty also. 1065 * that page was dirty, the buffers are set dirty also.
1066 */ 1066 */
1067 static int 1067 static int
1068 grow_buffers(struct block_device *bdev, sector_t block, int size) 1068 grow_buffers(struct block_device *bdev, sector_t block, int size)
1069 { 1069 {
1070 pgoff_t index; 1070 pgoff_t index;
1071 int sizebits; 1071 int sizebits;
1072 1072
1073 sizebits = -1; 1073 sizebits = -1;
1074 do { 1074 do {
1075 sizebits++; 1075 sizebits++;
1076 } while ((size << sizebits) < PAGE_SIZE); 1076 } while ((size << sizebits) < PAGE_SIZE);
1077 1077
1078 index = block >> sizebits; 1078 index = block >> sizebits;
1079 1079
1080 /* 1080 /*
1081 * Check for a block which wants to lie outside our maximum possible 1081 * Check for a block which wants to lie outside our maximum possible
1082 * pagecache index. (this comparison is done using sector_t types). 1082 * pagecache index. (this comparison is done using sector_t types).
1083 */ 1083 */
1084 if (unlikely(index != block >> sizebits)) { 1084 if (unlikely(index != block >> sizebits)) {
1085 char b[BDEVNAME_SIZE]; 1085 char b[BDEVNAME_SIZE];
1086 1086
1087 printk(KERN_ERR "%s: requested out-of-range block %llu for " 1087 printk(KERN_ERR "%s: requested out-of-range block %llu for "
1088 "device %s\n", 1088 "device %s\n",
1089 __func__, (unsigned long long)block, 1089 __func__, (unsigned long long)block,
1090 bdevname(bdev, b)); 1090 bdevname(bdev, b));
1091 return -EIO; 1091 return -EIO;
1092 } 1092 }
1093 1093
1094 /* Create a page with the proper size buffers.. */ 1094 /* Create a page with the proper size buffers.. */
1095 return grow_dev_page(bdev, block, index, size, sizebits); 1095 return grow_dev_page(bdev, block, index, size, sizebits);
1096 } 1096 }
1097 1097
1098 static struct buffer_head * 1098 static struct buffer_head *
1099 __getblk_slow(struct block_device *bdev, sector_t block, int size) 1099 __getblk_slow(struct block_device *bdev, sector_t block, int size)
1100 { 1100 {
1101 /* Size must be multiple of hard sectorsize */ 1101 /* Size must be multiple of hard sectorsize */
1102 if (unlikely(size & (bdev_logical_block_size(bdev)-1) || 1102 if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
1103 (size < 512 || size > PAGE_SIZE))) { 1103 (size < 512 || size > PAGE_SIZE))) {
1104 printk(KERN_ERR "getblk(): invalid block size %d requested\n", 1104 printk(KERN_ERR "getblk(): invalid block size %d requested\n",
1105 size); 1105 size);
1106 printk(KERN_ERR "logical block size: %d\n", 1106 printk(KERN_ERR "logical block size: %d\n",
1107 bdev_logical_block_size(bdev)); 1107 bdev_logical_block_size(bdev));
1108 1108
1109 dump_stack(); 1109 dump_stack();
1110 return NULL; 1110 return NULL;
1111 } 1111 }
1112 1112
1113 for (;;) { 1113 for (;;) {
1114 struct buffer_head *bh; 1114 struct buffer_head *bh;
1115 int ret; 1115 int ret;
1116 1116
1117 bh = __find_get_block(bdev, block, size); 1117 bh = __find_get_block(bdev, block, size);
1118 if (bh) 1118 if (bh)
1119 return bh; 1119 return bh;
1120 1120
1121 ret = grow_buffers(bdev, block, size); 1121 ret = grow_buffers(bdev, block, size);
1122 if (ret < 0) 1122 if (ret < 0)
1123 return NULL; 1123 return NULL;
1124 if (ret == 0) 1124 if (ret == 0)
1125 free_more_memory(); 1125 free_more_memory();
1126 } 1126 }
1127 } 1127 }
1128 1128
1129 /* 1129 /*
1130 * The relationship between dirty buffers and dirty pages: 1130 * The relationship between dirty buffers and dirty pages:
1131 * 1131 *
1132 * Whenever a page has any dirty buffers, the page's dirty bit is set, and 1132 * Whenever a page has any dirty buffers, the page's dirty bit is set, and
1133 * the page is tagged dirty in its radix tree. 1133 * the page is tagged dirty in its radix tree.
1134 * 1134 *
1135 * At all times, the dirtiness of the buffers represents the dirtiness of 1135 * At all times, the dirtiness of the buffers represents the dirtiness of
1136 * subsections of the page. If the page has buffers, the page dirty bit is 1136 * subsections of the page. If the page has buffers, the page dirty bit is
1137 * merely a hint about the true dirty state. 1137 * merely a hint about the true dirty state.
1138 * 1138 *
1139 * When a page is set dirty in its entirety, all its buffers are marked dirty 1139 * When a page is set dirty in its entirety, all its buffers are marked dirty
1140 * (if the page has buffers). 1140 * (if the page has buffers).
1141 * 1141 *
1142 * When a buffer is marked dirty, its page is dirtied, but the page's other 1142 * When a buffer is marked dirty, its page is dirtied, but the page's other
1143 * buffers are not. 1143 * buffers are not.
1144 * 1144 *
1145 * Also. When blockdev buffers are explicitly read with bread(), they 1145 * Also. When blockdev buffers are explicitly read with bread(), they
1146 * individually become uptodate. But their backing page remains not 1146 * individually become uptodate. But their backing page remains not
1147 * uptodate - even if all of its buffers are uptodate. A subsequent 1147 * uptodate - even if all of its buffers are uptodate. A subsequent
1148 * block_read_full_page() against that page will discover all the uptodate 1148 * block_read_full_page() against that page will discover all the uptodate
1149 * buffers, will set the page uptodate and will perform no I/O. 1149 * buffers, will set the page uptodate and will perform no I/O.
1150 */ 1150 */
1151 1151
1152 /** 1152 /**
1153 * mark_buffer_dirty - mark a buffer_head as needing writeout 1153 * mark_buffer_dirty - mark a buffer_head as needing writeout
1154 * @bh: the buffer_head to mark dirty 1154 * @bh: the buffer_head to mark dirty
1155 * 1155 *
1156 * mark_buffer_dirty() will set the dirty bit against the buffer, then set its 1156 * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
1157 * backing page dirty, then tag the page as dirty in its address_space's radix 1157 * backing page dirty, then tag the page as dirty in its address_space's radix
1158 * tree and then attach the address_space's inode to its superblock's dirty 1158 * tree and then attach the address_space's inode to its superblock's dirty
1159 * inode list. 1159 * inode list.
1160 * 1160 *
1161 * mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock, 1161 * mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock,
1162 * mapping->tree_lock and mapping->host->i_lock. 1162 * mapping->tree_lock and mapping->host->i_lock.
1163 */ 1163 */
1164 void mark_buffer_dirty(struct buffer_head *bh) 1164 void mark_buffer_dirty(struct buffer_head *bh)
1165 { 1165 {
1166 WARN_ON_ONCE(!buffer_uptodate(bh)); 1166 WARN_ON_ONCE(!buffer_uptodate(bh));
1167 1167
1168 trace_block_dirty_buffer(bh); 1168 trace_block_dirty_buffer(bh);
1169 1169
1170 /* 1170 /*
1171 * Very *carefully* optimize the it-is-already-dirty case. 1171 * Very *carefully* optimize the it-is-already-dirty case.
1172 * 1172 *
1173 * Don't let the final "is it dirty" escape to before we 1173 * Don't let the final "is it dirty" escape to before we
1174 * perhaps modified the buffer. 1174 * perhaps modified the buffer.
1175 */ 1175 */
1176 if (buffer_dirty(bh)) { 1176 if (buffer_dirty(bh)) {
1177 smp_mb(); 1177 smp_mb();
1178 if (buffer_dirty(bh)) 1178 if (buffer_dirty(bh))
1179 return; 1179 return;
1180 } 1180 }
1181 1181
1182 if (!test_set_buffer_dirty(bh)) { 1182 if (!test_set_buffer_dirty(bh)) {
1183 struct page *page = bh->b_page; 1183 struct page *page = bh->b_page;
1184 if (!TestSetPageDirty(page)) { 1184 if (!TestSetPageDirty(page)) {
1185 struct address_space *mapping = page_mapping(page); 1185 struct address_space *mapping = page_mapping(page);
1186 if (mapping) 1186 if (mapping)
1187 __set_page_dirty(page, mapping, 0); 1187 __set_page_dirty(page, mapping, 0);
1188 } 1188 }
1189 } 1189 }
1190 } 1190 }
1191 EXPORT_SYMBOL(mark_buffer_dirty); 1191 EXPORT_SYMBOL(mark_buffer_dirty);
1192 1192
1193 /* 1193 /*
1194 * Decrement a buffer_head's reference count. If all buffers against a page 1194 * Decrement a buffer_head's reference count. If all buffers against a page
1195 * have zero reference count, are clean and unlocked, and if the page is clean 1195 * have zero reference count, are clean and unlocked, and if the page is clean
1196 * and unlocked then try_to_free_buffers() may strip the buffers from the page 1196 * and unlocked then try_to_free_buffers() may strip the buffers from the page
1197 * in preparation for freeing it (sometimes, rarely, buffers are removed from 1197 * in preparation for freeing it (sometimes, rarely, buffers are removed from
1198 * a page but it ends up not being freed, and buffers may later be reattached). 1198 * a page but it ends up not being freed, and buffers may later be reattached).
1199 */ 1199 */
1200 void __brelse(struct buffer_head * buf) 1200 void __brelse(struct buffer_head * buf)
1201 { 1201 {
1202 if (atomic_read(&buf->b_count)) { 1202 if (atomic_read(&buf->b_count)) {
1203 put_bh(buf); 1203 put_bh(buf);
1204 return; 1204 return;
1205 } 1205 }
1206 WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n"); 1206 WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n");
1207 } 1207 }
1208 EXPORT_SYMBOL(__brelse); 1208 EXPORT_SYMBOL(__brelse);
1209 1209
1210 /* 1210 /*
1211 * bforget() is like brelse(), except it discards any 1211 * bforget() is like brelse(), except it discards any
1212 * potentially dirty data. 1212 * potentially dirty data.
1213 */ 1213 */
1214 void __bforget(struct buffer_head *bh) 1214 void __bforget(struct buffer_head *bh)
1215 { 1215 {
1216 clear_buffer_dirty(bh); 1216 clear_buffer_dirty(bh);
1217 if (bh->b_assoc_map) { 1217 if (bh->b_assoc_map) {
1218 struct address_space *buffer_mapping = bh->b_page->mapping; 1218 struct address_space *buffer_mapping = bh->b_page->mapping;
1219 1219
1220 spin_lock(&buffer_mapping->private_lock); 1220 spin_lock(&buffer_mapping->private_lock);
1221 list_del_init(&bh->b_assoc_buffers); 1221 list_del_init(&bh->b_assoc_buffers);
1222 bh->b_assoc_map = NULL; 1222 bh->b_assoc_map = NULL;
1223 spin_unlock(&buffer_mapping->private_lock); 1223 spin_unlock(&buffer_mapping->private_lock);
1224 } 1224 }
1225 __brelse(bh); 1225 __brelse(bh);
1226 } 1226 }
1227 EXPORT_SYMBOL(__bforget); 1227 EXPORT_SYMBOL(__bforget);
1228 1228
1229 static struct buffer_head *__bread_slow(struct buffer_head *bh) 1229 static struct buffer_head *__bread_slow(struct buffer_head *bh)
1230 { 1230 {
1231 lock_buffer(bh); 1231 lock_buffer(bh);
1232 if (buffer_uptodate(bh)) { 1232 if (buffer_uptodate(bh)) {
1233 unlock_buffer(bh); 1233 unlock_buffer(bh);
1234 return bh; 1234 return bh;
1235 } else { 1235 } else {
1236 get_bh(bh); 1236 get_bh(bh);
1237 bh->b_end_io = end_buffer_read_sync; 1237 bh->b_end_io = end_buffer_read_sync;
1238 submit_bh(READ, bh); 1238 submit_bh(READ, bh);
1239 wait_on_buffer(bh); 1239 wait_on_buffer(bh);
1240 if (buffer_uptodate(bh)) 1240 if (buffer_uptodate(bh))
1241 return bh; 1241 return bh;
1242 } 1242 }
1243 brelse(bh); 1243 brelse(bh);
1244 return NULL; 1244 return NULL;
1245 } 1245 }
1246 1246
1247 /* 1247 /*
1248 * Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block(). 1248 * Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block().
1249 * The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their 1249 * The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their
1250 * refcount elevated by one when they're in an LRU. A buffer can only appear 1250 * refcount elevated by one when they're in an LRU. A buffer can only appear
1251 * once in a particular CPU's LRU. A single buffer can be present in multiple 1251 * once in a particular CPU's LRU. A single buffer can be present in multiple
1252 * CPU's LRUs at the same time. 1252 * CPU's LRUs at the same time.
1253 * 1253 *
1254 * This is a transparent caching front-end to sb_bread(), sb_getblk() and 1254 * This is a transparent caching front-end to sb_bread(), sb_getblk() and
1255 * sb_find_get_block(). 1255 * sb_find_get_block().
1256 * 1256 *
1257 * The LRUs themselves only need locking against invalidate_bh_lrus. We use 1257 * The LRUs themselves only need locking against invalidate_bh_lrus. We use
1258 * a local interrupt disable for that. 1258 * a local interrupt disable for that.
1259 */ 1259 */
1260 1260
1261 #define BH_LRU_SIZE 8 1261 #define BH_LRU_SIZE 8
1262 1262
1263 struct bh_lru { 1263 struct bh_lru {
1264 struct buffer_head *bhs[BH_LRU_SIZE]; 1264 struct buffer_head *bhs[BH_LRU_SIZE];
1265 }; 1265 };
1266 1266
1267 static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }}; 1267 static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
1268 1268
1269 #ifdef CONFIG_SMP 1269 #ifdef CONFIG_SMP
1270 #define bh_lru_lock() local_irq_disable() 1270 #define bh_lru_lock() local_irq_disable()
1271 #define bh_lru_unlock() local_irq_enable() 1271 #define bh_lru_unlock() local_irq_enable()
1272 #else 1272 #else
1273 #define bh_lru_lock() preempt_disable() 1273 #define bh_lru_lock() preempt_disable()
1274 #define bh_lru_unlock() preempt_enable() 1274 #define bh_lru_unlock() preempt_enable()
1275 #endif 1275 #endif
1276 1276
1277 static inline void check_irqs_on(void) 1277 static inline void check_irqs_on(void)
1278 { 1278 {
1279 #ifdef irqs_disabled 1279 #ifdef irqs_disabled
1280 BUG_ON(irqs_disabled()); 1280 BUG_ON(irqs_disabled());
1281 #endif 1281 #endif
1282 } 1282 }
1283 1283
1284 /* 1284 /*
1285 * The LRU management algorithm is dopey-but-simple. Sorry. 1285 * The LRU management algorithm is dopey-but-simple. Sorry.
1286 */ 1286 */
1287 static void bh_lru_install(struct buffer_head *bh) 1287 static void bh_lru_install(struct buffer_head *bh)
1288 { 1288 {
1289 struct buffer_head *evictee = NULL; 1289 struct buffer_head *evictee = NULL;
1290 1290
1291 check_irqs_on(); 1291 check_irqs_on();
1292 bh_lru_lock(); 1292 bh_lru_lock();
1293 if (__this_cpu_read(bh_lrus.bhs[0]) != bh) { 1293 if (__this_cpu_read(bh_lrus.bhs[0]) != bh) {
1294 struct buffer_head *bhs[BH_LRU_SIZE]; 1294 struct buffer_head *bhs[BH_LRU_SIZE];
1295 int in; 1295 int in;
1296 int out = 0; 1296 int out = 0;
1297 1297
1298 get_bh(bh); 1298 get_bh(bh);
1299 bhs[out++] = bh; 1299 bhs[out++] = bh;
1300 for (in = 0; in < BH_LRU_SIZE; in++) { 1300 for (in = 0; in < BH_LRU_SIZE; in++) {
1301 struct buffer_head *bh2 = 1301 struct buffer_head *bh2 =
1302 __this_cpu_read(bh_lrus.bhs[in]); 1302 __this_cpu_read(bh_lrus.bhs[in]);
1303 1303
1304 if (bh2 == bh) { 1304 if (bh2 == bh) {
1305 __brelse(bh2); 1305 __brelse(bh2);
1306 } else { 1306 } else {
1307 if (out >= BH_LRU_SIZE) { 1307 if (out >= BH_LRU_SIZE) {
1308 BUG_ON(evictee != NULL); 1308 BUG_ON(evictee != NULL);
1309 evictee = bh2; 1309 evictee = bh2;
1310 } else { 1310 } else {
1311 bhs[out++] = bh2; 1311 bhs[out++] = bh2;
1312 } 1312 }
1313 } 1313 }
1314 } 1314 }
1315 while (out < BH_LRU_SIZE) 1315 while (out < BH_LRU_SIZE)
1316 bhs[out++] = NULL; 1316 bhs[out++] = NULL;
1317 memcpy(__this_cpu_ptr(&bh_lrus.bhs), bhs, sizeof(bhs)); 1317 memcpy(__this_cpu_ptr(&bh_lrus.bhs), bhs, sizeof(bhs));
1318 } 1318 }
1319 bh_lru_unlock(); 1319 bh_lru_unlock();
1320 1320
1321 if (evictee) 1321 if (evictee)
1322 __brelse(evictee); 1322 __brelse(evictee);
1323 } 1323 }
1324 1324
1325 /* 1325 /*
1326 * Look up the bh in this cpu's LRU. If it's there, move it to the head. 1326 * Look up the bh in this cpu's LRU. If it's there, move it to the head.
1327 */ 1327 */
1328 static struct buffer_head * 1328 static struct buffer_head *
1329 lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) 1329 lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
1330 { 1330 {
1331 struct buffer_head *ret = NULL; 1331 struct buffer_head *ret = NULL;
1332 unsigned int i; 1332 unsigned int i;
1333 1333
1334 check_irqs_on(); 1334 check_irqs_on();
1335 bh_lru_lock(); 1335 bh_lru_lock();
1336 for (i = 0; i < BH_LRU_SIZE; i++) { 1336 for (i = 0; i < BH_LRU_SIZE; i++) {
1337 struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); 1337 struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
1338 1338
1339 if (bh && bh->b_bdev == bdev && 1339 if (bh && bh->b_bdev == bdev &&
1340 bh->b_blocknr == block && bh->b_size == size) { 1340 bh->b_blocknr == block && bh->b_size == size) {
1341 if (i) { 1341 if (i) {
1342 while (i) { 1342 while (i) {
1343 __this_cpu_write(bh_lrus.bhs[i], 1343 __this_cpu_write(bh_lrus.bhs[i],
1344 __this_cpu_read(bh_lrus.bhs[i - 1])); 1344 __this_cpu_read(bh_lrus.bhs[i - 1]));
1345 i--; 1345 i--;
1346 } 1346 }
1347 __this_cpu_write(bh_lrus.bhs[0], bh); 1347 __this_cpu_write(bh_lrus.bhs[0], bh);
1348 } 1348 }
1349 get_bh(bh); 1349 get_bh(bh);
1350 ret = bh; 1350 ret = bh;
1351 break; 1351 break;
1352 } 1352 }
1353 } 1353 }
1354 bh_lru_unlock(); 1354 bh_lru_unlock();
1355 return ret; 1355 return ret;
1356 } 1356 }
1357 1357
1358 /* 1358 /*
1359 * Perform a pagecache lookup for the matching buffer. If it's there, refresh 1359 * Perform a pagecache lookup for the matching buffer. If it's there, refresh
1360 * it in the LRU and mark it as accessed. If it is not present then return 1360 * it in the LRU and mark it as accessed. If it is not present then return
1361 * NULL 1361 * NULL
1362 */ 1362 */
1363 struct buffer_head * 1363 struct buffer_head *
1364 __find_get_block(struct block_device *bdev, sector_t block, unsigned size) 1364 __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
1365 { 1365 {
1366 struct buffer_head *bh = lookup_bh_lru(bdev, block, size); 1366 struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
1367 1367
1368 if (bh == NULL) { 1368 if (bh == NULL) {
1369 /* __find_get_block_slow will mark the page accessed */
1369 bh = __find_get_block_slow(bdev, block); 1370 bh = __find_get_block_slow(bdev, block);
1370 if (bh) 1371 if (bh)
1371 bh_lru_install(bh); 1372 bh_lru_install(bh);
1372 } 1373 } else
1373 if (bh)
1374 touch_buffer(bh); 1374 touch_buffer(bh);
1375
1375 return bh; 1376 return bh;
1376 } 1377 }
1377 EXPORT_SYMBOL(__find_get_block); 1378 EXPORT_SYMBOL(__find_get_block);
1378 1379
1379 /* 1380 /*
1380 * __getblk will locate (and, if necessary, create) the buffer_head 1381 * __getblk will locate (and, if necessary, create) the buffer_head
1381 * which corresponds to the passed block_device, block and size. The 1382 * which corresponds to the passed block_device, block and size. The
1382 * returned buffer has its reference count incremented. 1383 * returned buffer has its reference count incremented.
1383 * 1384 *
1384 * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers() 1385 * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers()
1385 * attempt is failing. FIXME, perhaps? 1386 * attempt is failing. FIXME, perhaps?
1386 */ 1387 */
1387 struct buffer_head * 1388 struct buffer_head *
1388 __getblk(struct block_device *bdev, sector_t block, unsigned size) 1389 __getblk(struct block_device *bdev, sector_t block, unsigned size)
1389 { 1390 {
1390 struct buffer_head *bh = __find_get_block(bdev, block, size); 1391 struct buffer_head *bh = __find_get_block(bdev, block, size);
1391 1392
1392 might_sleep(); 1393 might_sleep();
1393 if (bh == NULL) 1394 if (bh == NULL)
1394 bh = __getblk_slow(bdev, block, size); 1395 bh = __getblk_slow(bdev, block, size);
1395 return bh; 1396 return bh;
1396 } 1397 }
1397 EXPORT_SYMBOL(__getblk); 1398 EXPORT_SYMBOL(__getblk);
1398 1399
1399 /* 1400 /*
1400 * Do async read-ahead on a buffer.. 1401 * Do async read-ahead on a buffer..
1401 */ 1402 */
1402 void __breadahead(struct block_device *bdev, sector_t block, unsigned size) 1403 void __breadahead(struct block_device *bdev, sector_t block, unsigned size)
1403 { 1404 {
1404 struct buffer_head *bh = __getblk(bdev, block, size); 1405 struct buffer_head *bh = __getblk(bdev, block, size);
1405 if (likely(bh)) { 1406 if (likely(bh)) {
1406 ll_rw_block(READA, 1, &bh); 1407 ll_rw_block(READA, 1, &bh);
1407 brelse(bh); 1408 brelse(bh);
1408 } 1409 }
1409 } 1410 }
1410 EXPORT_SYMBOL(__breadahead); 1411 EXPORT_SYMBOL(__breadahead);
1411 1412
1412 /** 1413 /**
1413 * __bread() - reads a specified block and returns the bh 1414 * __bread() - reads a specified block and returns the bh
1414 * @bdev: the block_device to read from 1415 * @bdev: the block_device to read from
1415 * @block: number of block 1416 * @block: number of block
1416 * @size: size (in bytes) to read 1417 * @size: size (in bytes) to read
1417 * 1418 *
1418 * Reads a specified block, and returns buffer head that contains it. 1419 * Reads a specified block, and returns buffer head that contains it.
1419 * It returns NULL if the block was unreadable. 1420 * It returns NULL if the block was unreadable.
1420 */ 1421 */
1421 struct buffer_head * 1422 struct buffer_head *
1422 __bread(struct block_device *bdev, sector_t block, unsigned size) 1423 __bread(struct block_device *bdev, sector_t block, unsigned size)
1423 { 1424 {
1424 struct buffer_head *bh = __getblk(bdev, block, size); 1425 struct buffer_head *bh = __getblk(bdev, block, size);
1425 1426
1426 if (likely(bh) && !buffer_uptodate(bh)) 1427 if (likely(bh) && !buffer_uptodate(bh))
1427 bh = __bread_slow(bh); 1428 bh = __bread_slow(bh);
1428 return bh; 1429 return bh;
1429 } 1430 }
1430 EXPORT_SYMBOL(__bread); 1431 EXPORT_SYMBOL(__bread);
1431 1432
1432 /* 1433 /*
1433 * invalidate_bh_lrus() is called rarely - but not only at unmount. 1434 * invalidate_bh_lrus() is called rarely - but not only at unmount.
1434 * This doesn't race because it runs in each cpu either in irq 1435 * This doesn't race because it runs in each cpu either in irq
1435 * or with preempt disabled. 1436 * or with preempt disabled.
1436 */ 1437 */
1437 static void invalidate_bh_lru(void *arg) 1438 static void invalidate_bh_lru(void *arg)
1438 { 1439 {
1439 struct bh_lru *b = &get_cpu_var(bh_lrus); 1440 struct bh_lru *b = &get_cpu_var(bh_lrus);
1440 int i; 1441 int i;
1441 1442
1442 for (i = 0; i < BH_LRU_SIZE; i++) { 1443 for (i = 0; i < BH_LRU_SIZE; i++) {
1443 brelse(b->bhs[i]); 1444 brelse(b->bhs[i]);
1444 b->bhs[i] = NULL; 1445 b->bhs[i] = NULL;
1445 } 1446 }
1446 put_cpu_var(bh_lrus); 1447 put_cpu_var(bh_lrus);
1447 } 1448 }
1448 1449
1449 static bool has_bh_in_lru(int cpu, void *dummy) 1450 static bool has_bh_in_lru(int cpu, void *dummy)
1450 { 1451 {
1451 struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu); 1452 struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
1452 int i; 1453 int i;
1453 1454
1454 for (i = 0; i < BH_LRU_SIZE; i++) { 1455 for (i = 0; i < BH_LRU_SIZE; i++) {
1455 if (b->bhs[i]) 1456 if (b->bhs[i])
1456 return 1; 1457 return 1;
1457 } 1458 }
1458 1459
1459 return 0; 1460 return 0;
1460 } 1461 }
1461 1462
1462 void invalidate_bh_lrus(void) 1463 void invalidate_bh_lrus(void)
1463 { 1464 {
1464 on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL); 1465 on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
1465 } 1466 }
1466 EXPORT_SYMBOL_GPL(invalidate_bh_lrus); 1467 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
1467 1468
1468 void set_bh_page(struct buffer_head *bh, 1469 void set_bh_page(struct buffer_head *bh,
1469 struct page *page, unsigned long offset) 1470 struct page *page, unsigned long offset)
1470 { 1471 {
1471 bh->b_page = page; 1472 bh->b_page = page;
1472 BUG_ON(offset >= PAGE_SIZE); 1473 BUG_ON(offset >= PAGE_SIZE);
1473 if (PageHighMem(page)) 1474 if (PageHighMem(page))
1474 /* 1475 /*
1475 * This catches illegal uses and preserves the offset: 1476 * This catches illegal uses and preserves the offset:
1476 */ 1477 */
1477 bh->b_data = (char *)(0 + offset); 1478 bh->b_data = (char *)(0 + offset);
1478 else 1479 else
1479 bh->b_data = page_address(page) + offset; 1480 bh->b_data = page_address(page) + offset;
1480 } 1481 }
1481 EXPORT_SYMBOL(set_bh_page); 1482 EXPORT_SYMBOL(set_bh_page);
1482 1483
1483 /* 1484 /*
1484 * Called when truncating a buffer on a page completely. 1485 * Called when truncating a buffer on a page completely.
1485 */ 1486 */
1486 1487
1487 /* Bits that are cleared during an invalidate */ 1488 /* Bits that are cleared during an invalidate */
1488 #define BUFFER_FLAGS_DISCARD \ 1489 #define BUFFER_FLAGS_DISCARD \
1489 (1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \ 1490 (1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
1490 1 << BH_Delay | 1 << BH_Unwritten) 1491 1 << BH_Delay | 1 << BH_Unwritten)
1491 1492
1492 static void discard_buffer(struct buffer_head * bh) 1493 static void discard_buffer(struct buffer_head * bh)
1493 { 1494 {
1494 unsigned long b_state, b_state_old; 1495 unsigned long b_state, b_state_old;
1495 1496
1496 lock_buffer(bh); 1497 lock_buffer(bh);
1497 clear_buffer_dirty(bh); 1498 clear_buffer_dirty(bh);
1498 bh->b_bdev = NULL; 1499 bh->b_bdev = NULL;
1499 b_state = bh->b_state; 1500 b_state = bh->b_state;
1500 for (;;) { 1501 for (;;) {
1501 b_state_old = cmpxchg(&bh->b_state, b_state, 1502 b_state_old = cmpxchg(&bh->b_state, b_state,
1502 (b_state & ~BUFFER_FLAGS_DISCARD)); 1503 (b_state & ~BUFFER_FLAGS_DISCARD));
1503 if (b_state_old == b_state) 1504 if (b_state_old == b_state)
1504 break; 1505 break;
1505 b_state = b_state_old; 1506 b_state = b_state_old;
1506 } 1507 }
1507 unlock_buffer(bh); 1508 unlock_buffer(bh);
1508 } 1509 }
1509 1510
1510 /** 1511 /**
1511 * block_invalidatepage - invalidate part or all of a buffer-backed page 1512 * block_invalidatepage - invalidate part or all of a buffer-backed page
1512 * 1513 *
1513 * @page: the page which is affected 1514 * @page: the page which is affected
1514 * @offset: start of the range to invalidate 1515 * @offset: start of the range to invalidate
1515 * @length: length of the range to invalidate 1516 * @length: length of the range to invalidate
1516 * 1517 *
1517 * block_invalidatepage() is called when all or part of the page has become 1518 * block_invalidatepage() is called when all or part of the page has become
1518 * invalidated by a truncate operation. 1519 * invalidated by a truncate operation.
1519 * 1520 *
1520 * block_invalidatepage() does not have to release all buffers, but it must 1521 * block_invalidatepage() does not have to release all buffers, but it must
1521 * ensure that no dirty buffer is left outside @offset and that no I/O 1522 * ensure that no dirty buffer is left outside @offset and that no I/O
1522 * is underway against any of the blocks which are outside the truncation 1523 * is underway against any of the blocks which are outside the truncation
1523 * point. Because the caller is about to free (and possibly reuse) those 1524 * point. Because the caller is about to free (and possibly reuse) those
1524 * blocks on-disk. 1525 * blocks on-disk.
1525 */ 1526 */
1526 void block_invalidatepage(struct page *page, unsigned int offset, 1527 void block_invalidatepage(struct page *page, unsigned int offset,
1527 unsigned int length) 1528 unsigned int length)
1528 { 1529 {
1529 struct buffer_head *head, *bh, *next; 1530 struct buffer_head *head, *bh, *next;
1530 unsigned int curr_off = 0; 1531 unsigned int curr_off = 0;
1531 unsigned int stop = length + offset; 1532 unsigned int stop = length + offset;
1532 1533
1533 BUG_ON(!PageLocked(page)); 1534 BUG_ON(!PageLocked(page));
1534 if (!page_has_buffers(page)) 1535 if (!page_has_buffers(page))
1535 goto out; 1536 goto out;
1536 1537
1537 /* 1538 /*
1538 * Check for overflow 1539 * Check for overflow
1539 */ 1540 */
1540 BUG_ON(stop > PAGE_CACHE_SIZE || stop < length); 1541 BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
1541 1542
1542 head = page_buffers(page); 1543 head = page_buffers(page);
1543 bh = head; 1544 bh = head;
1544 do { 1545 do {
1545 unsigned int next_off = curr_off + bh->b_size; 1546 unsigned int next_off = curr_off + bh->b_size;
1546 next = bh->b_this_page; 1547 next = bh->b_this_page;
1547 1548
1548 /* 1549 /*
1549 * Are we still fully in range ? 1550 * Are we still fully in range ?
1550 */ 1551 */
1551 if (next_off > stop) 1552 if (next_off > stop)
1552 goto out; 1553 goto out;
1553 1554
1554 /* 1555 /*
1555 * is this block fully invalidated? 1556 * is this block fully invalidated?
1556 */ 1557 */
1557 if (offset <= curr_off) 1558 if (offset <= curr_off)
1558 discard_buffer(bh); 1559 discard_buffer(bh);
1559 curr_off = next_off; 1560 curr_off = next_off;
1560 bh = next; 1561 bh = next;
1561 } while (bh != head); 1562 } while (bh != head);
1562 1563
1563 /* 1564 /*
1564 * We release buffers only if the entire page is being invalidated. 1565 * We release buffers only if the entire page is being invalidated.
1565 * The get_block cached value has been unconditionally invalidated, 1566 * The get_block cached value has been unconditionally invalidated,
1566 * so real IO is not possible anymore. 1567 * so real IO is not possible anymore.
1567 */ 1568 */
1568 if (offset == 0) 1569 if (offset == 0)
1569 try_to_release_page(page, 0); 1570 try_to_release_page(page, 0);
1570 out: 1571 out:
1571 return; 1572 return;
1572 } 1573 }
1573 EXPORT_SYMBOL(block_invalidatepage); 1574 EXPORT_SYMBOL(block_invalidatepage);
1574 1575
1575 1576
1576 /* 1577 /*
1577 * We attach and possibly dirty the buffers atomically wrt 1578 * We attach and possibly dirty the buffers atomically wrt
1578 * __set_page_dirty_buffers() via private_lock. try_to_free_buffers 1579 * __set_page_dirty_buffers() via private_lock. try_to_free_buffers
1579 * is already excluded via the page lock. 1580 * is already excluded via the page lock.
1580 */ 1581 */
1581 void create_empty_buffers(struct page *page, 1582 void create_empty_buffers(struct page *page,
1582 unsigned long blocksize, unsigned long b_state) 1583 unsigned long blocksize, unsigned long b_state)
1583 { 1584 {
1584 struct buffer_head *bh, *head, *tail; 1585 struct buffer_head *bh, *head, *tail;
1585 1586
1586 head = alloc_page_buffers(page, blocksize, 1); 1587 head = alloc_page_buffers(page, blocksize, 1);
1587 bh = head; 1588 bh = head;
1588 do { 1589 do {
1589 bh->b_state |= b_state; 1590 bh->b_state |= b_state;
1590 tail = bh; 1591 tail = bh;
1591 bh = bh->b_this_page; 1592 bh = bh->b_this_page;
1592 } while (bh); 1593 } while (bh);
1593 tail->b_this_page = head; 1594 tail->b_this_page = head;
1594 1595
1595 spin_lock(&page->mapping->private_lock); 1596 spin_lock(&page->mapping->private_lock);
1596 if (PageUptodate(page) || PageDirty(page)) { 1597 if (PageUptodate(page) || PageDirty(page)) {
1597 bh = head; 1598 bh = head;
1598 do { 1599 do {
1599 if (PageDirty(page)) 1600 if (PageDirty(page))
1600 set_buffer_dirty(bh); 1601 set_buffer_dirty(bh);
1601 if (PageUptodate(page)) 1602 if (PageUptodate(page))
1602 set_buffer_uptodate(bh); 1603 set_buffer_uptodate(bh);
1603 bh = bh->b_this_page; 1604 bh = bh->b_this_page;
1604 } while (bh != head); 1605 } while (bh != head);
1605 } 1606 }
1606 attach_page_buffers(page, head); 1607 attach_page_buffers(page, head);
1607 spin_unlock(&page->mapping->private_lock); 1608 spin_unlock(&page->mapping->private_lock);
1608 } 1609 }
1609 EXPORT_SYMBOL(create_empty_buffers); 1610 EXPORT_SYMBOL(create_empty_buffers);
1610 1611
1611 /* 1612 /*
1612 * We are taking a block for data and we don't want any output from any 1613 * We are taking a block for data and we don't want any output from any
1613 * buffer-cache aliases starting from return from that function and 1614 * buffer-cache aliases starting from return from that function and
1614 * until the moment when something will explicitly mark the buffer 1615 * until the moment when something will explicitly mark the buffer
1615 * dirty (hopefully that will not happen until we will free that block ;-) 1616 * dirty (hopefully that will not happen until we will free that block ;-)
1616 * We don't even need to mark it not-uptodate - nobody can expect 1617 * We don't even need to mark it not-uptodate - nobody can expect
1617 * anything from a newly allocated buffer anyway. We used to used 1618 * anything from a newly allocated buffer anyway. We used to used
1618 * unmap_buffer() for such invalidation, but that was wrong. We definitely 1619 * unmap_buffer() for such invalidation, but that was wrong. We definitely
1619 * don't want to mark the alias unmapped, for example - it would confuse 1620 * don't want to mark the alias unmapped, for example - it would confuse
1620 * anyone who might pick it with bread() afterwards... 1621 * anyone who might pick it with bread() afterwards...
1621 * 1622 *
1622 * Also.. Note that bforget() doesn't lock the buffer. So there can 1623 * Also.. Note that bforget() doesn't lock the buffer. So there can
1623 * be writeout I/O going on against recently-freed buffers. We don't 1624 * be writeout I/O going on against recently-freed buffers. We don't
1624 * wait on that I/O in bforget() - it's more efficient to wait on the I/O 1625 * wait on that I/O in bforget() - it's more efficient to wait on the I/O
1625 * only if we really need to. That happens here. 1626 * only if we really need to. That happens here.
1626 */ 1627 */
1627 void unmap_underlying_metadata(struct block_device *bdev, sector_t block) 1628 void unmap_underlying_metadata(struct block_device *bdev, sector_t block)
1628 { 1629 {
1629 struct buffer_head *old_bh; 1630 struct buffer_head *old_bh;
1630 1631
1631 might_sleep(); 1632 might_sleep();
1632 1633
1633 old_bh = __find_get_block_slow(bdev, block); 1634 old_bh = __find_get_block_slow(bdev, block);
1634 if (old_bh) { 1635 if (old_bh) {
1635 clear_buffer_dirty(old_bh); 1636 clear_buffer_dirty(old_bh);
1636 wait_on_buffer(old_bh); 1637 wait_on_buffer(old_bh);
1637 clear_buffer_req(old_bh); 1638 clear_buffer_req(old_bh);
1638 __brelse(old_bh); 1639 __brelse(old_bh);
1639 } 1640 }
1640 } 1641 }
1641 EXPORT_SYMBOL(unmap_underlying_metadata); 1642 EXPORT_SYMBOL(unmap_underlying_metadata);
1642 1643
1643 /* 1644 /*
1644 * Size is a power-of-two in the range 512..PAGE_SIZE, 1645 * Size is a power-of-two in the range 512..PAGE_SIZE,
1645 * and the case we care about most is PAGE_SIZE. 1646 * and the case we care about most is PAGE_SIZE.
1646 * 1647 *
1647 * So this *could* possibly be written with those 1648 * So this *could* possibly be written with those
1648 * constraints in mind (relevant mostly if some 1649 * constraints in mind (relevant mostly if some
1649 * architecture has a slow bit-scan instruction) 1650 * architecture has a slow bit-scan instruction)
1650 */ 1651 */
1651 static inline int block_size_bits(unsigned int blocksize) 1652 static inline int block_size_bits(unsigned int blocksize)
1652 { 1653 {
1653 return ilog2(blocksize); 1654 return ilog2(blocksize);
1654 } 1655 }
1655 1656
1656 static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) 1657 static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
1657 { 1658 {
1658 BUG_ON(!PageLocked(page)); 1659 BUG_ON(!PageLocked(page));
1659 1660
1660 if (!page_has_buffers(page)) 1661 if (!page_has_buffers(page))
1661 create_empty_buffers(page, 1 << ACCESS_ONCE(inode->i_blkbits), b_state); 1662 create_empty_buffers(page, 1 << ACCESS_ONCE(inode->i_blkbits), b_state);
1662 return page_buffers(page); 1663 return page_buffers(page);
1663 } 1664 }
1664 1665
1665 /* 1666 /*
1666 * NOTE! All mapped/uptodate combinations are valid: 1667 * NOTE! All mapped/uptodate combinations are valid:
1667 * 1668 *
1668 * Mapped Uptodate Meaning 1669 * Mapped Uptodate Meaning
1669 * 1670 *
1670 * No No "unknown" - must do get_block() 1671 * No No "unknown" - must do get_block()
1671 * No Yes "hole" - zero-filled 1672 * No Yes "hole" - zero-filled
1672 * Yes No "allocated" - allocated on disk, not read in 1673 * Yes No "allocated" - allocated on disk, not read in
1673 * Yes Yes "valid" - allocated and up-to-date in memory. 1674 * Yes Yes "valid" - allocated and up-to-date in memory.
1674 * 1675 *
1675 * "Dirty" is valid only with the last case (mapped+uptodate). 1676 * "Dirty" is valid only with the last case (mapped+uptodate).
1676 */ 1677 */
1677 1678
1678 /* 1679 /*
1679 * While block_write_full_page is writing back the dirty buffers under 1680 * While block_write_full_page is writing back the dirty buffers under
1680 * the page lock, whoever dirtied the buffers may decide to clean them 1681 * the page lock, whoever dirtied the buffers may decide to clean them
1681 * again at any time. We handle that by only looking at the buffer 1682 * again at any time. We handle that by only looking at the buffer
1682 * state inside lock_buffer(). 1683 * state inside lock_buffer().
1683 * 1684 *
1684 * If block_write_full_page() is called for regular writeback 1685 * If block_write_full_page() is called for regular writeback
1685 * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a 1686 * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
1686 * locked buffer. This only can happen if someone has written the buffer 1687 * locked buffer. This only can happen if someone has written the buffer
1687 * directly, with submit_bh(). At the address_space level PageWriteback 1688 * directly, with submit_bh(). At the address_space level PageWriteback
1688 * prevents this contention from occurring. 1689 * prevents this contention from occurring.
1689 * 1690 *
1690 * If block_write_full_page() is called with wbc->sync_mode == 1691 * If block_write_full_page() is called with wbc->sync_mode ==
1691 * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this 1692 * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this
1692 * causes the writes to be flagged as synchronous writes. 1693 * causes the writes to be flagged as synchronous writes.
1693 */ 1694 */
1694 static int __block_write_full_page(struct inode *inode, struct page *page, 1695 static int __block_write_full_page(struct inode *inode, struct page *page,
1695 get_block_t *get_block, struct writeback_control *wbc, 1696 get_block_t *get_block, struct writeback_control *wbc,
1696 bh_end_io_t *handler) 1697 bh_end_io_t *handler)
1697 { 1698 {
1698 int err; 1699 int err;
1699 sector_t block; 1700 sector_t block;
1700 sector_t last_block; 1701 sector_t last_block;
1701 struct buffer_head *bh, *head; 1702 struct buffer_head *bh, *head;
1702 unsigned int blocksize, bbits; 1703 unsigned int blocksize, bbits;
1703 int nr_underway = 0; 1704 int nr_underway = 0;
1704 int write_op = (wbc->sync_mode == WB_SYNC_ALL ? 1705 int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
1705 WRITE_SYNC : WRITE); 1706 WRITE_SYNC : WRITE);
1706 1707
1707 head = create_page_buffers(page, inode, 1708 head = create_page_buffers(page, inode,
1708 (1 << BH_Dirty)|(1 << BH_Uptodate)); 1709 (1 << BH_Dirty)|(1 << BH_Uptodate));
1709 1710
1710 /* 1711 /*
1711 * Be very careful. We have no exclusion from __set_page_dirty_buffers 1712 * Be very careful. We have no exclusion from __set_page_dirty_buffers
1712 * here, and the (potentially unmapped) buffers may become dirty at 1713 * here, and the (potentially unmapped) buffers may become dirty at
1713 * any time. If a buffer becomes dirty here after we've inspected it 1714 * any time. If a buffer becomes dirty here after we've inspected it
1714 * then we just miss that fact, and the page stays dirty. 1715 * then we just miss that fact, and the page stays dirty.
1715 * 1716 *
1716 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers; 1717 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
1717 * handle that here by just cleaning them. 1718 * handle that here by just cleaning them.
1718 */ 1719 */
1719 1720
1720 bh = head; 1721 bh = head;
1721 blocksize = bh->b_size; 1722 blocksize = bh->b_size;
1722 bbits = block_size_bits(blocksize); 1723 bbits = block_size_bits(blocksize);
1723 1724
1724 block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); 1725 block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
1725 last_block = (i_size_read(inode) - 1) >> bbits; 1726 last_block = (i_size_read(inode) - 1) >> bbits;
1726 1727
1727 /* 1728 /*
1728 * Get all the dirty buffers mapped to disk addresses and 1729 * Get all the dirty buffers mapped to disk addresses and
1729 * handle any aliases from the underlying blockdev's mapping. 1730 * handle any aliases from the underlying blockdev's mapping.
1730 */ 1731 */
1731 do { 1732 do {
1732 if (block > last_block) { 1733 if (block > last_block) {
1733 /* 1734 /*
1734 * mapped buffers outside i_size will occur, because 1735 * mapped buffers outside i_size will occur, because
1735 * this page can be outside i_size when there is a 1736 * this page can be outside i_size when there is a
1736 * truncate in progress. 1737 * truncate in progress.
1737 */ 1738 */
1738 /* 1739 /*
1739 * The buffer was zeroed by block_write_full_page() 1740 * The buffer was zeroed by block_write_full_page()
1740 */ 1741 */
1741 clear_buffer_dirty(bh); 1742 clear_buffer_dirty(bh);
1742 set_buffer_uptodate(bh); 1743 set_buffer_uptodate(bh);
1743 } else if ((!buffer_mapped(bh) || buffer_delay(bh)) && 1744 } else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
1744 buffer_dirty(bh)) { 1745 buffer_dirty(bh)) {
1745 WARN_ON(bh->b_size != blocksize); 1746 WARN_ON(bh->b_size != blocksize);
1746 err = get_block(inode, block, bh, 1); 1747 err = get_block(inode, block, bh, 1);
1747 if (err) 1748 if (err)
1748 goto recover; 1749 goto recover;
1749 clear_buffer_delay(bh); 1750 clear_buffer_delay(bh);
1750 if (buffer_new(bh)) { 1751 if (buffer_new(bh)) {
1751 /* blockdev mappings never come here */ 1752 /* blockdev mappings never come here */
1752 clear_buffer_new(bh); 1753 clear_buffer_new(bh);
1753 unmap_underlying_metadata(bh->b_bdev, 1754 unmap_underlying_metadata(bh->b_bdev,
1754 bh->b_blocknr); 1755 bh->b_blocknr);
1755 } 1756 }
1756 } 1757 }
1757 bh = bh->b_this_page; 1758 bh = bh->b_this_page;
1758 block++; 1759 block++;
1759 } while (bh != head); 1760 } while (bh != head);
1760 1761
1761 do { 1762 do {
1762 if (!buffer_mapped(bh)) 1763 if (!buffer_mapped(bh))
1763 continue; 1764 continue;
1764 /* 1765 /*
1765 * If it's a fully non-blocking write attempt and we cannot 1766 * If it's a fully non-blocking write attempt and we cannot
1766 * lock the buffer then redirty the page. Note that this can 1767 * lock the buffer then redirty the page. Note that this can
1767 * potentially cause a busy-wait loop from writeback threads 1768 * potentially cause a busy-wait loop from writeback threads
1768 * and kswapd activity, but those code paths have their own 1769 * and kswapd activity, but those code paths have their own
1769 * higher-level throttling. 1770 * higher-level throttling.
1770 */ 1771 */
1771 if (wbc->sync_mode != WB_SYNC_NONE) { 1772 if (wbc->sync_mode != WB_SYNC_NONE) {
1772 lock_buffer(bh); 1773 lock_buffer(bh);
1773 } else if (!trylock_buffer(bh)) { 1774 } else if (!trylock_buffer(bh)) {
1774 redirty_page_for_writepage(wbc, page); 1775 redirty_page_for_writepage(wbc, page);
1775 continue; 1776 continue;
1776 } 1777 }
1777 if (test_clear_buffer_dirty(bh)) { 1778 if (test_clear_buffer_dirty(bh)) {
1778 mark_buffer_async_write_endio(bh, handler); 1779 mark_buffer_async_write_endio(bh, handler);
1779 } else { 1780 } else {
1780 unlock_buffer(bh); 1781 unlock_buffer(bh);
1781 } 1782 }
1782 } while ((bh = bh->b_this_page) != head); 1783 } while ((bh = bh->b_this_page) != head);
1783 1784
1784 /* 1785 /*
1785 * The page and its buffers are protected by PageWriteback(), so we can 1786 * The page and its buffers are protected by PageWriteback(), so we can
1786 * drop the bh refcounts early. 1787 * drop the bh refcounts early.
1787 */ 1788 */
1788 BUG_ON(PageWriteback(page)); 1789 BUG_ON(PageWriteback(page));
1789 set_page_writeback(page); 1790 set_page_writeback(page);
1790 1791
1791 do { 1792 do {
1792 struct buffer_head *next = bh->b_this_page; 1793 struct buffer_head *next = bh->b_this_page;
1793 if (buffer_async_write(bh)) { 1794 if (buffer_async_write(bh)) {
1794 submit_bh(write_op, bh); 1795 submit_bh(write_op, bh);
1795 nr_underway++; 1796 nr_underway++;
1796 } 1797 }
1797 bh = next; 1798 bh = next;
1798 } while (bh != head); 1799 } while (bh != head);
1799 unlock_page(page); 1800 unlock_page(page);
1800 1801
1801 err = 0; 1802 err = 0;
1802 done: 1803 done:
1803 if (nr_underway == 0) { 1804 if (nr_underway == 0) {
1804 /* 1805 /*
1805 * The page was marked dirty, but the buffers were 1806 * The page was marked dirty, but the buffers were
1806 * clean. Someone wrote them back by hand with 1807 * clean. Someone wrote them back by hand with
1807 * ll_rw_block/submit_bh. A rare case. 1808 * ll_rw_block/submit_bh. A rare case.
1808 */ 1809 */
1809 end_page_writeback(page); 1810 end_page_writeback(page);
1810 1811
1811 /* 1812 /*
1812 * The page and buffer_heads can be released at any time from 1813 * The page and buffer_heads can be released at any time from
1813 * here on. 1814 * here on.
1814 */ 1815 */
1815 } 1816 }
1816 return err; 1817 return err;
1817 1818
1818 recover: 1819 recover:
1819 /* 1820 /*
1820 * ENOSPC, or some other error. We may already have added some 1821 * ENOSPC, or some other error. We may already have added some
1821 * blocks to the file, so we need to write these out to avoid 1822 * blocks to the file, so we need to write these out to avoid
1822 * exposing stale data. 1823 * exposing stale data.
1823 * The page is currently locked and not marked for writeback 1824 * The page is currently locked and not marked for writeback
1824 */ 1825 */
1825 bh = head; 1826 bh = head;
1826 /* Recovery: lock and submit the mapped buffers */ 1827 /* Recovery: lock and submit the mapped buffers */
1827 do { 1828 do {
1828 if (buffer_mapped(bh) && buffer_dirty(bh) && 1829 if (buffer_mapped(bh) && buffer_dirty(bh) &&
1829 !buffer_delay(bh)) { 1830 !buffer_delay(bh)) {
1830 lock_buffer(bh); 1831 lock_buffer(bh);
1831 mark_buffer_async_write_endio(bh, handler); 1832 mark_buffer_async_write_endio(bh, handler);
1832 } else { 1833 } else {
1833 /* 1834 /*
1834 * The buffer may have been set dirty during 1835 * The buffer may have been set dirty during
1835 * attachment to a dirty page. 1836 * attachment to a dirty page.
1836 */ 1837 */
1837 clear_buffer_dirty(bh); 1838 clear_buffer_dirty(bh);
1838 } 1839 }
1839 } while ((bh = bh->b_this_page) != head); 1840 } while ((bh = bh->b_this_page) != head);
1840 SetPageError(page); 1841 SetPageError(page);
1841 BUG_ON(PageWriteback(page)); 1842 BUG_ON(PageWriteback(page));
1842 mapping_set_error(page->mapping, err); 1843 mapping_set_error(page->mapping, err);
1843 set_page_writeback(page); 1844 set_page_writeback(page);
1844 do { 1845 do {
1845 struct buffer_head *next = bh->b_this_page; 1846 struct buffer_head *next = bh->b_this_page;
1846 if (buffer_async_write(bh)) { 1847 if (buffer_async_write(bh)) {
1847 clear_buffer_dirty(bh); 1848 clear_buffer_dirty(bh);
1848 submit_bh(write_op, bh); 1849 submit_bh(write_op, bh);
1849 nr_underway++; 1850 nr_underway++;
1850 } 1851 }
1851 bh = next; 1852 bh = next;
1852 } while (bh != head); 1853 } while (bh != head);
1853 unlock_page(page); 1854 unlock_page(page);
1854 goto done; 1855 goto done;
1855 } 1856 }
1856 1857
1857 /* 1858 /*
1858 * If a page has any new buffers, zero them out here, and mark them uptodate 1859 * If a page has any new buffers, zero them out here, and mark them uptodate
1859 * and dirty so they'll be written out (in order to prevent uninitialised 1860 * and dirty so they'll be written out (in order to prevent uninitialised
1860 * block data from leaking). And clear the new bit. 1861 * block data from leaking). And clear the new bit.
1861 */ 1862 */
1862 void page_zero_new_buffers(struct page *page, unsigned from, unsigned to) 1863 void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
1863 { 1864 {
1864 unsigned int block_start, block_end; 1865 unsigned int block_start, block_end;
1865 struct buffer_head *head, *bh; 1866 struct buffer_head *head, *bh;
1866 1867
1867 BUG_ON(!PageLocked(page)); 1868 BUG_ON(!PageLocked(page));
1868 if (!page_has_buffers(page)) 1869 if (!page_has_buffers(page))
1869 return; 1870 return;
1870 1871
1871 bh = head = page_buffers(page); 1872 bh = head = page_buffers(page);
1872 block_start = 0; 1873 block_start = 0;
1873 do { 1874 do {
1874 block_end = block_start + bh->b_size; 1875 block_end = block_start + bh->b_size;
1875 1876
1876 if (buffer_new(bh)) { 1877 if (buffer_new(bh)) {
1877 if (block_end > from && block_start < to) { 1878 if (block_end > from && block_start < to) {
1878 if (!PageUptodate(page)) { 1879 if (!PageUptodate(page)) {
1879 unsigned start, size; 1880 unsigned start, size;
1880 1881
1881 start = max(from, block_start); 1882 start = max(from, block_start);
1882 size = min(to, block_end) - start; 1883 size = min(to, block_end) - start;
1883 1884
1884 zero_user(page, start, size); 1885 zero_user(page, start, size);
1885 set_buffer_uptodate(bh); 1886 set_buffer_uptodate(bh);
1886 } 1887 }
1887 1888
1888 clear_buffer_new(bh); 1889 clear_buffer_new(bh);
1889 mark_buffer_dirty(bh); 1890 mark_buffer_dirty(bh);
1890 } 1891 }
1891 } 1892 }
1892 1893
1893 block_start = block_end; 1894 block_start = block_end;
1894 bh = bh->b_this_page; 1895 bh = bh->b_this_page;
1895 } while (bh != head); 1896 } while (bh != head);
1896 } 1897 }
1897 EXPORT_SYMBOL(page_zero_new_buffers); 1898 EXPORT_SYMBOL(page_zero_new_buffers);
1898 1899
1899 int __block_write_begin(struct page *page, loff_t pos, unsigned len, 1900 int __block_write_begin(struct page *page, loff_t pos, unsigned len,
1900 get_block_t *get_block) 1901 get_block_t *get_block)
1901 { 1902 {
1902 unsigned from = pos & (PAGE_CACHE_SIZE - 1); 1903 unsigned from = pos & (PAGE_CACHE_SIZE - 1);
1903 unsigned to = from + len; 1904 unsigned to = from + len;
1904 struct inode *inode = page->mapping->host; 1905 struct inode *inode = page->mapping->host;
1905 unsigned block_start, block_end; 1906 unsigned block_start, block_end;
1906 sector_t block; 1907 sector_t block;
1907 int err = 0; 1908 int err = 0;
1908 unsigned blocksize, bbits; 1909 unsigned blocksize, bbits;
1909 struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; 1910 struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
1910 1911
1911 BUG_ON(!PageLocked(page)); 1912 BUG_ON(!PageLocked(page));
1912 BUG_ON(from > PAGE_CACHE_SIZE); 1913 BUG_ON(from > PAGE_CACHE_SIZE);
1913 BUG_ON(to > PAGE_CACHE_SIZE); 1914 BUG_ON(to > PAGE_CACHE_SIZE);
1914 BUG_ON(from > to); 1915 BUG_ON(from > to);
1915 1916
1916 head = create_page_buffers(page, inode, 0); 1917 head = create_page_buffers(page, inode, 0);
1917 blocksize = head->b_size; 1918 blocksize = head->b_size;
1918 bbits = block_size_bits(blocksize); 1919 bbits = block_size_bits(blocksize);
1919 1920
1920 block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); 1921 block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
1921 1922
1922 for(bh = head, block_start = 0; bh != head || !block_start; 1923 for(bh = head, block_start = 0; bh != head || !block_start;
1923 block++, block_start=block_end, bh = bh->b_this_page) { 1924 block++, block_start=block_end, bh = bh->b_this_page) {
1924 block_end = block_start + blocksize; 1925 block_end = block_start + blocksize;
1925 if (block_end <= from || block_start >= to) { 1926 if (block_end <= from || block_start >= to) {
1926 if (PageUptodate(page)) { 1927 if (PageUptodate(page)) {
1927 if (!buffer_uptodate(bh)) 1928 if (!buffer_uptodate(bh))
1928 set_buffer_uptodate(bh); 1929 set_buffer_uptodate(bh);
1929 } 1930 }
1930 continue; 1931 continue;
1931 } 1932 }
1932 if (buffer_new(bh)) 1933 if (buffer_new(bh))
1933 clear_buffer_new(bh); 1934 clear_buffer_new(bh);
1934 if (!buffer_mapped(bh)) { 1935 if (!buffer_mapped(bh)) {
1935 WARN_ON(bh->b_size != blocksize); 1936 WARN_ON(bh->b_size != blocksize);
1936 err = get_block(inode, block, bh, 1); 1937 err = get_block(inode, block, bh, 1);
1937 if (err) 1938 if (err)
1938 break; 1939 break;
1939 if (buffer_new(bh)) { 1940 if (buffer_new(bh)) {
1940 unmap_underlying_metadata(bh->b_bdev, 1941 unmap_underlying_metadata(bh->b_bdev,
1941 bh->b_blocknr); 1942 bh->b_blocknr);
1942 if (PageUptodate(page)) { 1943 if (PageUptodate(page)) {
1943 clear_buffer_new(bh); 1944 clear_buffer_new(bh);
1944 set_buffer_uptodate(bh); 1945 set_buffer_uptodate(bh);
1945 mark_buffer_dirty(bh); 1946 mark_buffer_dirty(bh);
1946 continue; 1947 continue;
1947 } 1948 }
1948 if (block_end > to || block_start < from) 1949 if (block_end > to || block_start < from)
1949 zero_user_segments(page, 1950 zero_user_segments(page,
1950 to, block_end, 1951 to, block_end,
1951 block_start, from); 1952 block_start, from);
1952 continue; 1953 continue;
1953 } 1954 }
1954 } 1955 }
1955 if (PageUptodate(page)) { 1956 if (PageUptodate(page)) {
1956 if (!buffer_uptodate(bh)) 1957 if (!buffer_uptodate(bh))
1957 set_buffer_uptodate(bh); 1958 set_buffer_uptodate(bh);
1958 continue; 1959 continue;
1959 } 1960 }
1960 if (!buffer_uptodate(bh) && !buffer_delay(bh) && 1961 if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
1961 !buffer_unwritten(bh) && 1962 !buffer_unwritten(bh) &&
1962 (block_start < from || block_end > to)) { 1963 (block_start < from || block_end > to)) {
1963 ll_rw_block(READ, 1, &bh); 1964 ll_rw_block(READ, 1, &bh);
1964 *wait_bh++=bh; 1965 *wait_bh++=bh;
1965 } 1966 }
1966 } 1967 }
1967 /* 1968 /*
1968 * If we issued read requests - let them complete. 1969 * If we issued read requests - let them complete.
1969 */ 1970 */
1970 while(wait_bh > wait) { 1971 while(wait_bh > wait) {
1971 wait_on_buffer(*--wait_bh); 1972 wait_on_buffer(*--wait_bh);
1972 if (!buffer_uptodate(*wait_bh)) 1973 if (!buffer_uptodate(*wait_bh))
1973 err = -EIO; 1974 err = -EIO;
1974 } 1975 }
1975 if (unlikely(err)) 1976 if (unlikely(err))
1976 page_zero_new_buffers(page, from, to); 1977 page_zero_new_buffers(page, from, to);
1977 return err; 1978 return err;
1978 } 1979 }
1979 EXPORT_SYMBOL(__block_write_begin); 1980 EXPORT_SYMBOL(__block_write_begin);
1980 1981
1981 static int __block_commit_write(struct inode *inode, struct page *page, 1982 static int __block_commit_write(struct inode *inode, struct page *page,
1982 unsigned from, unsigned to) 1983 unsigned from, unsigned to)
1983 { 1984 {
1984 unsigned block_start, block_end; 1985 unsigned block_start, block_end;
1985 int partial = 0; 1986 int partial = 0;
1986 unsigned blocksize; 1987 unsigned blocksize;
1987 struct buffer_head *bh, *head; 1988 struct buffer_head *bh, *head;
1988 1989
1989 bh = head = page_buffers(page); 1990 bh = head = page_buffers(page);
1990 blocksize = bh->b_size; 1991 blocksize = bh->b_size;
1991 1992
1992 block_start = 0; 1993 block_start = 0;
1993 do { 1994 do {
1994 block_end = block_start + blocksize; 1995 block_end = block_start + blocksize;
1995 if (block_end <= from || block_start >= to) { 1996 if (block_end <= from || block_start >= to) {
1996 if (!buffer_uptodate(bh)) 1997 if (!buffer_uptodate(bh))
1997 partial = 1; 1998 partial = 1;
1998 } else { 1999 } else {
1999 set_buffer_uptodate(bh); 2000 set_buffer_uptodate(bh);
2000 mark_buffer_dirty(bh); 2001 mark_buffer_dirty(bh);
2001 } 2002 }
2002 clear_buffer_new(bh); 2003 clear_buffer_new(bh);
2003 2004
2004 block_start = block_end; 2005 block_start = block_end;
2005 bh = bh->b_this_page; 2006 bh = bh->b_this_page;
2006 } while (bh != head); 2007 } while (bh != head);
2007 2008
2008 /* 2009 /*
2009 * If this is a partial write which happened to make all buffers 2010 * If this is a partial write which happened to make all buffers
2010 * uptodate then we can optimize away a bogus readpage() for 2011 * uptodate then we can optimize away a bogus readpage() for
2011 * the next read(). Here we 'discover' whether the page went 2012 * the next read(). Here we 'discover' whether the page went
2012 * uptodate as a result of this (potentially partial) write. 2013 * uptodate as a result of this (potentially partial) write.
2013 */ 2014 */
2014 if (!partial) 2015 if (!partial)
2015 SetPageUptodate(page); 2016 SetPageUptodate(page);
2016 return 0; 2017 return 0;
2017 } 2018 }
2018 2019
2019 /* 2020 /*
2020 * block_write_begin takes care of the basic task of block allocation and 2021 * block_write_begin takes care of the basic task of block allocation and
2021 * bringing partial write blocks uptodate first. 2022 * bringing partial write blocks uptodate first.
2022 * 2023 *
2023 * The filesystem needs to handle block truncation upon failure. 2024 * The filesystem needs to handle block truncation upon failure.
2024 */ 2025 */
2025 int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len, 2026 int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
2026 unsigned flags, struct page **pagep, get_block_t *get_block) 2027 unsigned flags, struct page **pagep, get_block_t *get_block)
2027 { 2028 {
2028 pgoff_t index = pos >> PAGE_CACHE_SHIFT; 2029 pgoff_t index = pos >> PAGE_CACHE_SHIFT;
2029 struct page *page; 2030 struct page *page;
2030 int status; 2031 int status;
2031 2032
2032 page = grab_cache_page_write_begin(mapping, index, flags); 2033 page = grab_cache_page_write_begin(mapping, index, flags);
2033 if (!page) 2034 if (!page)
2034 return -ENOMEM; 2035 return -ENOMEM;
2035 2036
2036 status = __block_write_begin(page, pos, len, get_block); 2037 status = __block_write_begin(page, pos, len, get_block);
2037 if (unlikely(status)) { 2038 if (unlikely(status)) {
2038 unlock_page(page); 2039 unlock_page(page);
2039 page_cache_release(page); 2040 page_cache_release(page);
2040 page = NULL; 2041 page = NULL;
2041 } 2042 }
2042 2043
2043 *pagep = page; 2044 *pagep = page;
2044 return status; 2045 return status;
2045 } 2046 }
2046 EXPORT_SYMBOL(block_write_begin); 2047 EXPORT_SYMBOL(block_write_begin);
2047 2048
2048 int block_write_end(struct file *file, struct address_space *mapping, 2049 int block_write_end(struct file *file, struct address_space *mapping,
2049 loff_t pos, unsigned len, unsigned copied, 2050 loff_t pos, unsigned len, unsigned copied,
2050 struct page *page, void *fsdata) 2051 struct page *page, void *fsdata)
2051 { 2052 {
2052 struct inode *inode = mapping->host; 2053 struct inode *inode = mapping->host;
2053 unsigned start; 2054 unsigned start;
2054 2055
2055 start = pos & (PAGE_CACHE_SIZE - 1); 2056 start = pos & (PAGE_CACHE_SIZE - 1);
2056 2057
2057 if (unlikely(copied < len)) { 2058 if (unlikely(copied < len)) {
2058 /* 2059 /*
2059 * The buffers that were written will now be uptodate, so we 2060 * The buffers that were written will now be uptodate, so we
2060 * don't have to worry about a readpage reading them and 2061 * don't have to worry about a readpage reading them and
2061 * overwriting a partial write. However if we have encountered 2062 * overwriting a partial write. However if we have encountered
2062 * a short write and only partially written into a buffer, it 2063 * a short write and only partially written into a buffer, it
2063 * will not be marked uptodate, so a readpage might come in and 2064 * will not be marked uptodate, so a readpage might come in and
2064 * destroy our partial write. 2065 * destroy our partial write.
2065 * 2066 *
2066 * Do the simplest thing, and just treat any short write to a 2067 * Do the simplest thing, and just treat any short write to a
2067 * non uptodate page as a zero-length write, and force the 2068 * non uptodate page as a zero-length write, and force the
2068 * caller to redo the whole thing. 2069 * caller to redo the whole thing.
2069 */ 2070 */
2070 if (!PageUptodate(page)) 2071 if (!PageUptodate(page))
2071 copied = 0; 2072 copied = 0;
2072 2073
2073 page_zero_new_buffers(page, start+copied, start+len); 2074 page_zero_new_buffers(page, start+copied, start+len);
2074 } 2075 }
2075 flush_dcache_page(page); 2076 flush_dcache_page(page);
2076 2077
2077 /* This could be a short (even 0-length) commit */ 2078 /* This could be a short (even 0-length) commit */
2078 __block_commit_write(inode, page, start, start+copied); 2079 __block_commit_write(inode, page, start, start+copied);
2079 2080
2080 return copied; 2081 return copied;
2081 } 2082 }
2082 EXPORT_SYMBOL(block_write_end); 2083 EXPORT_SYMBOL(block_write_end);
2083 2084
2084 int generic_write_end(struct file *file, struct address_space *mapping, 2085 int generic_write_end(struct file *file, struct address_space *mapping,
2085 loff_t pos, unsigned len, unsigned copied, 2086 loff_t pos, unsigned len, unsigned copied,
2086 struct page *page, void *fsdata) 2087 struct page *page, void *fsdata)
2087 { 2088 {
2088 struct inode *inode = mapping->host; 2089 struct inode *inode = mapping->host;
2089 int i_size_changed = 0; 2090 int i_size_changed = 0;
2090 2091
2091 copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); 2092 copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
2092 2093
2093 /* 2094 /*
2094 * No need to use i_size_read() here, the i_size 2095 * No need to use i_size_read() here, the i_size
2095 * cannot change under us because we hold i_mutex. 2096 * cannot change under us because we hold i_mutex.
2096 * 2097 *
2097 * But it's important to update i_size while still holding page lock: 2098 * But it's important to update i_size while still holding page lock:
2098 * page writeout could otherwise come in and zero beyond i_size. 2099 * page writeout could otherwise come in and zero beyond i_size.
2099 */ 2100 */
2100 if (pos+copied > inode->i_size) { 2101 if (pos+copied > inode->i_size) {
2101 i_size_write(inode, pos+copied); 2102 i_size_write(inode, pos+copied);
2102 i_size_changed = 1; 2103 i_size_changed = 1;
2103 } 2104 }
2104 2105
2105 unlock_page(page); 2106 unlock_page(page);
2106 page_cache_release(page); 2107 page_cache_release(page);
2107 2108
2108 /* 2109 /*
2109 * Don't mark the inode dirty under page lock. First, it unnecessarily 2110 * Don't mark the inode dirty under page lock. First, it unnecessarily
2110 * makes the holding time of page lock longer. Second, it forces lock 2111 * makes the holding time of page lock longer. Second, it forces lock
2111 * ordering of page lock and transaction start for journaling 2112 * ordering of page lock and transaction start for journaling
2112 * filesystems. 2113 * filesystems.
2113 */ 2114 */
2114 if (i_size_changed) 2115 if (i_size_changed)
2115 mark_inode_dirty(inode); 2116 mark_inode_dirty(inode);
2116 2117
2117 return copied; 2118 return copied;
2118 } 2119 }
2119 EXPORT_SYMBOL(generic_write_end); 2120 EXPORT_SYMBOL(generic_write_end);
2120 2121
2121 /* 2122 /*
2122 * block_is_partially_uptodate checks whether buffers within a page are 2123 * block_is_partially_uptodate checks whether buffers within a page are
2123 * uptodate or not. 2124 * uptodate or not.
2124 * 2125 *
2125 * Returns true if all buffers which correspond to a file portion 2126 * Returns true if all buffers which correspond to a file portion
2126 * we want to read are uptodate. 2127 * we want to read are uptodate.
2127 */ 2128 */
2128 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, 2129 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
2129 unsigned long from) 2130 unsigned long from)
2130 { 2131 {
2131 unsigned block_start, block_end, blocksize; 2132 unsigned block_start, block_end, blocksize;
2132 unsigned to; 2133 unsigned to;
2133 struct buffer_head *bh, *head; 2134 struct buffer_head *bh, *head;
2134 int ret = 1; 2135 int ret = 1;
2135 2136
2136 if (!page_has_buffers(page)) 2137 if (!page_has_buffers(page))
2137 return 0; 2138 return 0;
2138 2139
2139 head = page_buffers(page); 2140 head = page_buffers(page);
2140 blocksize = head->b_size; 2141 blocksize = head->b_size;
2141 to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count); 2142 to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
2142 to = from + to; 2143 to = from + to;
2143 if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize) 2144 if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
2144 return 0; 2145 return 0;
2145 2146
2146 bh = head; 2147 bh = head;
2147 block_start = 0; 2148 block_start = 0;
2148 do { 2149 do {
2149 block_end = block_start + blocksize; 2150 block_end = block_start + blocksize;
2150 if (block_end > from && block_start < to) { 2151 if (block_end > from && block_start < to) {
2151 if (!buffer_uptodate(bh)) { 2152 if (!buffer_uptodate(bh)) {
2152 ret = 0; 2153 ret = 0;
2153 break; 2154 break;
2154 } 2155 }
2155 if (block_end >= to) 2156 if (block_end >= to)
2156 break; 2157 break;
2157 } 2158 }
2158 block_start = block_end; 2159 block_start = block_end;
2159 bh = bh->b_this_page; 2160 bh = bh->b_this_page;
2160 } while (bh != head); 2161 } while (bh != head);
2161 2162
2162 return ret; 2163 return ret;
2163 } 2164 }
2164 EXPORT_SYMBOL(block_is_partially_uptodate); 2165 EXPORT_SYMBOL(block_is_partially_uptodate);
2165 2166
2166 /* 2167 /*
2167 * Generic "read page" function for block devices that have the normal 2168 * Generic "read page" function for block devices that have the normal
2168 * get_block functionality. This is most of the block device filesystems. 2169 * get_block functionality. This is most of the block device filesystems.
2169 * Reads the page asynchronously --- the unlock_buffer() and 2170 * Reads the page asynchronously --- the unlock_buffer() and
2170 * set/clear_buffer_uptodate() functions propagate buffer state into the 2171 * set/clear_buffer_uptodate() functions propagate buffer state into the
2171 * page struct once IO has completed. 2172 * page struct once IO has completed.
2172 */ 2173 */
2173 int block_read_full_page(struct page *page, get_block_t *get_block) 2174 int block_read_full_page(struct page *page, get_block_t *get_block)
2174 { 2175 {
2175 struct inode *inode = page->mapping->host; 2176 struct inode *inode = page->mapping->host;
2176 sector_t iblock, lblock; 2177 sector_t iblock, lblock;
2177 struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; 2178 struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
2178 unsigned int blocksize, bbits; 2179 unsigned int blocksize, bbits;
2179 int nr, i; 2180 int nr, i;
2180 int fully_mapped = 1; 2181 int fully_mapped = 1;
2181 2182
2182 head = create_page_buffers(page, inode, 0); 2183 head = create_page_buffers(page, inode, 0);
2183 blocksize = head->b_size; 2184 blocksize = head->b_size;
2184 bbits = block_size_bits(blocksize); 2185 bbits = block_size_bits(blocksize);
2185 2186
2186 iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); 2187 iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
2187 lblock = (i_size_read(inode)+blocksize-1) >> bbits; 2188 lblock = (i_size_read(inode)+blocksize-1) >> bbits;
2188 bh = head; 2189 bh = head;
2189 nr = 0; 2190 nr = 0;
2190 i = 0; 2191 i = 0;
2191 2192
2192 do { 2193 do {
2193 if (buffer_uptodate(bh)) 2194 if (buffer_uptodate(bh))
2194 continue; 2195 continue;
2195 2196
2196 if (!buffer_mapped(bh)) { 2197 if (!buffer_mapped(bh)) {
2197 int err = 0; 2198 int err = 0;
2198 2199
2199 fully_mapped = 0; 2200 fully_mapped = 0;
2200 if (iblock < lblock) { 2201 if (iblock < lblock) {
2201 WARN_ON(bh->b_size != blocksize); 2202 WARN_ON(bh->b_size != blocksize);
2202 err = get_block(inode, iblock, bh, 0); 2203 err = get_block(inode, iblock, bh, 0);
2203 if (err) 2204 if (err)
2204 SetPageError(page); 2205 SetPageError(page);
2205 } 2206 }
2206 if (!buffer_mapped(bh)) { 2207 if (!buffer_mapped(bh)) {
2207 zero_user(page, i * blocksize, blocksize); 2208 zero_user(page, i * blocksize, blocksize);
2208 if (!err) 2209 if (!err)
2209 set_buffer_uptodate(bh); 2210 set_buffer_uptodate(bh);
2210 continue; 2211 continue;
2211 } 2212 }
2212 /* 2213 /*
2213 * get_block() might have updated the buffer 2214 * get_block() might have updated the buffer
2214 * synchronously 2215 * synchronously
2215 */ 2216 */
2216 if (buffer_uptodate(bh)) 2217 if (buffer_uptodate(bh))
2217 continue; 2218 continue;
2218 } 2219 }
2219 arr[nr++] = bh; 2220 arr[nr++] = bh;
2220 } while (i++, iblock++, (bh = bh->b_this_page) != head); 2221 } while (i++, iblock++, (bh = bh->b_this_page) != head);
2221 2222
2222 if (fully_mapped) 2223 if (fully_mapped)
2223 SetPageMappedToDisk(page); 2224 SetPageMappedToDisk(page);
2224 2225
2225 if (!nr) { 2226 if (!nr) {
2226 /* 2227 /*
2227 * All buffers are uptodate - we can set the page uptodate 2228 * All buffers are uptodate - we can set the page uptodate
2228 * as well. But not if get_block() returned an error. 2229 * as well. But not if get_block() returned an error.
2229 */ 2230 */
2230 if (!PageError(page)) 2231 if (!PageError(page))
2231 SetPageUptodate(page); 2232 SetPageUptodate(page);
2232 unlock_page(page); 2233 unlock_page(page);
2233 return 0; 2234 return 0;
2234 } 2235 }
2235 2236
2236 /* Stage two: lock the buffers */ 2237 /* Stage two: lock the buffers */
2237 for (i = 0; i < nr; i++) { 2238 for (i = 0; i < nr; i++) {
2238 bh = arr[i]; 2239 bh = arr[i];
2239 lock_buffer(bh); 2240 lock_buffer(bh);
2240 mark_buffer_async_read(bh); 2241 mark_buffer_async_read(bh);
2241 } 2242 }
2242 2243
2243 /* 2244 /*
2244 * Stage 3: start the IO. Check for uptodateness 2245 * Stage 3: start the IO. Check for uptodateness
2245 * inside the buffer lock in case another process reading 2246 * inside the buffer lock in case another process reading
2246 * the underlying blockdev brought it uptodate (the sct fix). 2247 * the underlying blockdev brought it uptodate (the sct fix).
2247 */ 2248 */
2248 for (i = 0; i < nr; i++) { 2249 for (i = 0; i < nr; i++) {
2249 bh = arr[i]; 2250 bh = arr[i];
2250 if (buffer_uptodate(bh)) 2251 if (buffer_uptodate(bh))
2251 end_buffer_async_read(bh, 1); 2252 end_buffer_async_read(bh, 1);
2252 else 2253 else
2253 submit_bh(READ, bh); 2254 submit_bh(READ, bh);
2254 } 2255 }
2255 return 0; 2256 return 0;
2256 } 2257 }
2257 EXPORT_SYMBOL(block_read_full_page); 2258 EXPORT_SYMBOL(block_read_full_page);
2258 2259
2259 /* utility function for filesystems that need to do work on expanding 2260 /* utility function for filesystems that need to do work on expanding
2260 * truncates. Uses filesystem pagecache writes to allow the filesystem to 2261 * truncates. Uses filesystem pagecache writes to allow the filesystem to
2261 * deal with the hole. 2262 * deal with the hole.
2262 */ 2263 */
2263 int generic_cont_expand_simple(struct inode *inode, loff_t size) 2264 int generic_cont_expand_simple(struct inode *inode, loff_t size)
2264 { 2265 {
2265 struct address_space *mapping = inode->i_mapping; 2266 struct address_space *mapping = inode->i_mapping;
2266 struct page *page; 2267 struct page *page;
2267 void *fsdata; 2268 void *fsdata;
2268 int err; 2269 int err;
2269 2270
2270 err = inode_newsize_ok(inode, size); 2271 err = inode_newsize_ok(inode, size);
2271 if (err) 2272 if (err)
2272 goto out; 2273 goto out;
2273 2274
2274 err = pagecache_write_begin(NULL, mapping, size, 0, 2275 err = pagecache_write_begin(NULL, mapping, size, 0,
2275 AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND, 2276 AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND,
2276 &page, &fsdata); 2277 &page, &fsdata);
2277 if (err) 2278 if (err)
2278 goto out; 2279 goto out;
2279 2280
2280 err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); 2281 err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
2281 BUG_ON(err > 0); 2282 BUG_ON(err > 0);
2282 2283
2283 out: 2284 out:
2284 return err; 2285 return err;
2285 } 2286 }
2286 EXPORT_SYMBOL(generic_cont_expand_simple); 2287 EXPORT_SYMBOL(generic_cont_expand_simple);
2287 2288
2288 static int cont_expand_zero(struct file *file, struct address_space *mapping, 2289 static int cont_expand_zero(struct file *file, struct address_space *mapping,
2289 loff_t pos, loff_t *bytes) 2290 loff_t pos, loff_t *bytes)
2290 { 2291 {
2291 struct inode *inode = mapping->host; 2292 struct inode *inode = mapping->host;
2292 unsigned blocksize = 1 << inode->i_blkbits; 2293 unsigned blocksize = 1 << inode->i_blkbits;
2293 struct page *page; 2294 struct page *page;
2294 void *fsdata; 2295 void *fsdata;
2295 pgoff_t index, curidx; 2296 pgoff_t index, curidx;
2296 loff_t curpos; 2297 loff_t curpos;
2297 unsigned zerofrom, offset, len; 2298 unsigned zerofrom, offset, len;
2298 int err = 0; 2299 int err = 0;
2299 2300
2300 index = pos >> PAGE_CACHE_SHIFT; 2301 index = pos >> PAGE_CACHE_SHIFT;
2301 offset = pos & ~PAGE_CACHE_MASK; 2302 offset = pos & ~PAGE_CACHE_MASK;
2302 2303
2303 while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) { 2304 while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) {
2304 zerofrom = curpos & ~PAGE_CACHE_MASK; 2305 zerofrom = curpos & ~PAGE_CACHE_MASK;
2305 if (zerofrom & (blocksize-1)) { 2306 if (zerofrom & (blocksize-1)) {
2306 *bytes |= (blocksize-1); 2307 *bytes |= (blocksize-1);
2307 (*bytes)++; 2308 (*bytes)++;
2308 } 2309 }
2309 len = PAGE_CACHE_SIZE - zerofrom; 2310 len = PAGE_CACHE_SIZE - zerofrom;
2310 2311
2311 err = pagecache_write_begin(file, mapping, curpos, len, 2312 err = pagecache_write_begin(file, mapping, curpos, len,
2312 AOP_FLAG_UNINTERRUPTIBLE, 2313 AOP_FLAG_UNINTERRUPTIBLE,
2313 &page, &fsdata); 2314 &page, &fsdata);
2314 if (err) 2315 if (err)
2315 goto out; 2316 goto out;
2316 zero_user(page, zerofrom, len); 2317 zero_user(page, zerofrom, len);
2317 err = pagecache_write_end(file, mapping, curpos, len, len, 2318 err = pagecache_write_end(file, mapping, curpos, len, len,
2318 page, fsdata); 2319 page, fsdata);
2319 if (err < 0) 2320 if (err < 0)
2320 goto out; 2321 goto out;
2321 BUG_ON(err != len); 2322 BUG_ON(err != len);
2322 err = 0; 2323 err = 0;
2323 2324
2324 balance_dirty_pages_ratelimited(mapping); 2325 balance_dirty_pages_ratelimited(mapping);
2325 } 2326 }
2326 2327
2327 /* page covers the boundary, find the boundary offset */ 2328 /* page covers the boundary, find the boundary offset */
2328 if (index == curidx) { 2329 if (index == curidx) {
2329 zerofrom = curpos & ~PAGE_CACHE_MASK; 2330 zerofrom = curpos & ~PAGE_CACHE_MASK;
2330 /* if we will expand the thing last block will be filled */ 2331 /* if we will expand the thing last block will be filled */
2331 if (offset <= zerofrom) { 2332 if (offset <= zerofrom) {
2332 goto out; 2333 goto out;
2333 } 2334 }
2334 if (zerofrom & (blocksize-1)) { 2335 if (zerofrom & (blocksize-1)) {
2335 *bytes |= (blocksize-1); 2336 *bytes |= (blocksize-1);
2336 (*bytes)++; 2337 (*bytes)++;
2337 } 2338 }
2338 len = offset - zerofrom; 2339 len = offset - zerofrom;
2339 2340
2340 err = pagecache_write_begin(file, mapping, curpos, len, 2341 err = pagecache_write_begin(file, mapping, curpos, len,
2341 AOP_FLAG_UNINTERRUPTIBLE, 2342 AOP_FLAG_UNINTERRUPTIBLE,
2342 &page, &fsdata); 2343 &page, &fsdata);
2343 if (err) 2344 if (err)
2344 goto out; 2345 goto out;
2345 zero_user(page, zerofrom, len); 2346 zero_user(page, zerofrom, len);
2346 err = pagecache_write_end(file, mapping, curpos, len, len, 2347 err = pagecache_write_end(file, mapping, curpos, len, len,
2347 page, fsdata); 2348 page, fsdata);
2348 if (err < 0) 2349 if (err < 0)
2349 goto out; 2350 goto out;
2350 BUG_ON(err != len); 2351 BUG_ON(err != len);
2351 err = 0; 2352 err = 0;
2352 } 2353 }
2353 out: 2354 out:
2354 return err; 2355 return err;
2355 } 2356 }
2356 2357
2357 /* 2358 /*
2358 * For moronic filesystems that do not allow holes in file. 2359 * For moronic filesystems that do not allow holes in file.
2359 * We may have to extend the file. 2360 * We may have to extend the file.
2360 */ 2361 */
2361 int cont_write_begin(struct file *file, struct address_space *mapping, 2362 int cont_write_begin(struct file *file, struct address_space *mapping,
2362 loff_t pos, unsigned len, unsigned flags, 2363 loff_t pos, unsigned len, unsigned flags,
2363 struct page **pagep, void **fsdata, 2364 struct page **pagep, void **fsdata,
2364 get_block_t *get_block, loff_t *bytes) 2365 get_block_t *get_block, loff_t *bytes)
2365 { 2366 {
2366 struct inode *inode = mapping->host; 2367 struct inode *inode = mapping->host;
2367 unsigned blocksize = 1 << inode->i_blkbits; 2368 unsigned blocksize = 1 << inode->i_blkbits;
2368 unsigned zerofrom; 2369 unsigned zerofrom;
2369 int err; 2370 int err;
2370 2371
2371 err = cont_expand_zero(file, mapping, pos, bytes); 2372 err = cont_expand_zero(file, mapping, pos, bytes);
2372 if (err) 2373 if (err)
2373 return err; 2374 return err;
2374 2375
2375 zerofrom = *bytes & ~PAGE_CACHE_MASK; 2376 zerofrom = *bytes & ~PAGE_CACHE_MASK;
2376 if (pos+len > *bytes && zerofrom & (blocksize-1)) { 2377 if (pos+len > *bytes && zerofrom & (blocksize-1)) {
2377 *bytes |= (blocksize-1); 2378 *bytes |= (blocksize-1);
2378 (*bytes)++; 2379 (*bytes)++;
2379 } 2380 }
2380 2381
2381 return block_write_begin(mapping, pos, len, flags, pagep, get_block); 2382 return block_write_begin(mapping, pos, len, flags, pagep, get_block);
2382 } 2383 }
2383 EXPORT_SYMBOL(cont_write_begin); 2384 EXPORT_SYMBOL(cont_write_begin);
2384 2385
2385 int block_commit_write(struct page *page, unsigned from, unsigned to) 2386 int block_commit_write(struct page *page, unsigned from, unsigned to)
2386 { 2387 {
2387 struct inode *inode = page->mapping->host; 2388 struct inode *inode = page->mapping->host;
2388 __block_commit_write(inode,page,from,to); 2389 __block_commit_write(inode,page,from,to);
2389 return 0; 2390 return 0;
2390 } 2391 }
2391 EXPORT_SYMBOL(block_commit_write); 2392 EXPORT_SYMBOL(block_commit_write);
2392 2393
2393 /* 2394 /*
2394 * block_page_mkwrite() is not allowed to change the file size as it gets 2395 * block_page_mkwrite() is not allowed to change the file size as it gets
2395 * called from a page fault handler when a page is first dirtied. Hence we must 2396 * called from a page fault handler when a page is first dirtied. Hence we must
2396 * be careful to check for EOF conditions here. We set the page up correctly 2397 * be careful to check for EOF conditions here. We set the page up correctly
2397 * for a written page which means we get ENOSPC checking when writing into 2398 * for a written page which means we get ENOSPC checking when writing into
2398 * holes and correct delalloc and unwritten extent mapping on filesystems that 2399 * holes and correct delalloc and unwritten extent mapping on filesystems that
2399 * support these features. 2400 * support these features.
2400 * 2401 *
2401 * We are not allowed to take the i_mutex here so we have to play games to 2402 * We are not allowed to take the i_mutex here so we have to play games to
2402 * protect against truncate races as the page could now be beyond EOF. Because 2403 * protect against truncate races as the page could now be beyond EOF. Because
2403 * truncate writes the inode size before removing pages, once we have the 2404 * truncate writes the inode size before removing pages, once we have the
2404 * page lock we can determine safely if the page is beyond EOF. If it is not 2405 * page lock we can determine safely if the page is beyond EOF. If it is not
2405 * beyond EOF, then the page is guaranteed safe against truncation until we 2406 * beyond EOF, then the page is guaranteed safe against truncation until we
2406 * unlock the page. 2407 * unlock the page.
2407 * 2408 *
2408 * Direct callers of this function should protect against filesystem freezing 2409 * Direct callers of this function should protect against filesystem freezing
2409 * using sb_start_write() - sb_end_write() functions. 2410 * using sb_start_write() - sb_end_write() functions.
2410 */ 2411 */
2411 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, 2412 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
2412 get_block_t get_block) 2413 get_block_t get_block)
2413 { 2414 {
2414 struct page *page = vmf->page; 2415 struct page *page = vmf->page;
2415 struct inode *inode = file_inode(vma->vm_file); 2416 struct inode *inode = file_inode(vma->vm_file);
2416 unsigned long end; 2417 unsigned long end;
2417 loff_t size; 2418 loff_t size;
2418 int ret; 2419 int ret;
2419 2420
2420 lock_page(page); 2421 lock_page(page);
2421 size = i_size_read(inode); 2422 size = i_size_read(inode);
2422 if ((page->mapping != inode->i_mapping) || 2423 if ((page->mapping != inode->i_mapping) ||
2423 (page_offset(page) > size)) { 2424 (page_offset(page) > size)) {
2424 /* We overload EFAULT to mean page got truncated */ 2425 /* We overload EFAULT to mean page got truncated */
2425 ret = -EFAULT; 2426 ret = -EFAULT;
2426 goto out_unlock; 2427 goto out_unlock;
2427 } 2428 }
2428 2429
2429 /* page is wholly or partially inside EOF */ 2430 /* page is wholly or partially inside EOF */
2430 if (((page->index + 1) << PAGE_CACHE_SHIFT) > size) 2431 if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
2431 end = size & ~PAGE_CACHE_MASK; 2432 end = size & ~PAGE_CACHE_MASK;
2432 else 2433 else
2433 end = PAGE_CACHE_SIZE; 2434 end = PAGE_CACHE_SIZE;
2434 2435
2435 ret = __block_write_begin(page, 0, end, get_block); 2436 ret = __block_write_begin(page, 0, end, get_block);
2436 if (!ret) 2437 if (!ret)
2437 ret = block_commit_write(page, 0, end); 2438 ret = block_commit_write(page, 0, end);
2438 2439
2439 if (unlikely(ret < 0)) 2440 if (unlikely(ret < 0))
2440 goto out_unlock; 2441 goto out_unlock;
2441 set_page_dirty(page); 2442 set_page_dirty(page);
2442 wait_for_stable_page(page); 2443 wait_for_stable_page(page);
2443 return 0; 2444 return 0;
2444 out_unlock: 2445 out_unlock:
2445 unlock_page(page); 2446 unlock_page(page);
2446 return ret; 2447 return ret;
2447 } 2448 }
2448 EXPORT_SYMBOL(__block_page_mkwrite); 2449 EXPORT_SYMBOL(__block_page_mkwrite);
2449 2450
2450 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, 2451 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
2451 get_block_t get_block) 2452 get_block_t get_block)
2452 { 2453 {
2453 int ret; 2454 int ret;
2454 struct super_block *sb = file_inode(vma->vm_file)->i_sb; 2455 struct super_block *sb = file_inode(vma->vm_file)->i_sb;
2455 2456
2456 sb_start_pagefault(sb); 2457 sb_start_pagefault(sb);
2457 2458
2458 /* 2459 /*
2459 * Update file times before taking page lock. We may end up failing the 2460 * Update file times before taking page lock. We may end up failing the
2460 * fault so this update may be superfluous but who really cares... 2461 * fault so this update may be superfluous but who really cares...
2461 */ 2462 */
2462 file_update_time(vma->vm_file); 2463 file_update_time(vma->vm_file);
2463 2464
2464 ret = __block_page_mkwrite(vma, vmf, get_block); 2465 ret = __block_page_mkwrite(vma, vmf, get_block);
2465 sb_end_pagefault(sb); 2466 sb_end_pagefault(sb);
2466 return block_page_mkwrite_return(ret); 2467 return block_page_mkwrite_return(ret);
2467 } 2468 }
2468 EXPORT_SYMBOL(block_page_mkwrite); 2469 EXPORT_SYMBOL(block_page_mkwrite);
2469 2470
2470 /* 2471 /*
2471 * nobh_write_begin()'s prereads are special: the buffer_heads are freed 2472 * nobh_write_begin()'s prereads are special: the buffer_heads are freed
2472 * immediately, while under the page lock. So it needs a special end_io 2473 * immediately, while under the page lock. So it needs a special end_io
2473 * handler which does not touch the bh after unlocking it. 2474 * handler which does not touch the bh after unlocking it.
2474 */ 2475 */
2475 static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate) 2476 static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
2476 { 2477 {
2477 __end_buffer_read_notouch(bh, uptodate); 2478 __end_buffer_read_notouch(bh, uptodate);
2478 } 2479 }
2479 2480
2480 /* 2481 /*
2481 * Attach the singly-linked list of buffers created by nobh_write_begin, to 2482 * Attach the singly-linked list of buffers created by nobh_write_begin, to
2482 * the page (converting it to circular linked list and taking care of page 2483 * the page (converting it to circular linked list and taking care of page
2483 * dirty races). 2484 * dirty races).
2484 */ 2485 */
2485 static void attach_nobh_buffers(struct page *page, struct buffer_head *head) 2486 static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
2486 { 2487 {
2487 struct buffer_head *bh; 2488 struct buffer_head *bh;
2488 2489
2489 BUG_ON(!PageLocked(page)); 2490 BUG_ON(!PageLocked(page));
2490 2491
2491 spin_lock(&page->mapping->private_lock); 2492 spin_lock(&page->mapping->private_lock);
2492 bh = head; 2493 bh = head;
2493 do { 2494 do {
2494 if (PageDirty(page)) 2495 if (PageDirty(page))
2495 set_buffer_dirty(bh); 2496 set_buffer_dirty(bh);
2496 if (!bh->b_this_page) 2497 if (!bh->b_this_page)
2497 bh->b_this_page = head; 2498 bh->b_this_page = head;
2498 bh = bh->b_this_page; 2499 bh = bh->b_this_page;
2499 } while (bh != head); 2500 } while (bh != head);
2500 attach_page_buffers(page, head); 2501 attach_page_buffers(page, head);
2501 spin_unlock(&page->mapping->private_lock); 2502 spin_unlock(&page->mapping->private_lock);
2502 } 2503 }
2503 2504
2504 /* 2505 /*
2505 * On entry, the page is fully not uptodate. 2506 * On entry, the page is fully not uptodate.
2506 * On exit the page is fully uptodate in the areas outside (from,to) 2507 * On exit the page is fully uptodate in the areas outside (from,to)
2507 * The filesystem needs to handle block truncation upon failure. 2508 * The filesystem needs to handle block truncation upon failure.
2508 */ 2509 */
2509 int nobh_write_begin(struct address_space *mapping, 2510 int nobh_write_begin(struct address_space *mapping,
2510 loff_t pos, unsigned len, unsigned flags, 2511 loff_t pos, unsigned len, unsigned flags,
2511 struct page **pagep, void **fsdata, 2512 struct page **pagep, void **fsdata,
2512 get_block_t *get_block) 2513 get_block_t *get_block)
2513 { 2514 {
2514 struct inode *inode = mapping->host; 2515 struct inode *inode = mapping->host;
2515 const unsigned blkbits = inode->i_blkbits; 2516 const unsigned blkbits = inode->i_blkbits;
2516 const unsigned blocksize = 1 << blkbits; 2517 const unsigned blocksize = 1 << blkbits;
2517 struct buffer_head *head, *bh; 2518 struct buffer_head *head, *bh;
2518 struct page *page; 2519 struct page *page;
2519 pgoff_t index; 2520 pgoff_t index;
2520 unsigned from, to; 2521 unsigned from, to;
2521 unsigned block_in_page; 2522 unsigned block_in_page;
2522 unsigned block_start, block_end; 2523 unsigned block_start, block_end;
2523 sector_t block_in_file; 2524 sector_t block_in_file;
2524 int nr_reads = 0; 2525 int nr_reads = 0;
2525 int ret = 0; 2526 int ret = 0;
2526 int is_mapped_to_disk = 1; 2527 int is_mapped_to_disk = 1;
2527 2528
2528 index = pos >> PAGE_CACHE_SHIFT; 2529 index = pos >> PAGE_CACHE_SHIFT;
2529 from = pos & (PAGE_CACHE_SIZE - 1); 2530 from = pos & (PAGE_CACHE_SIZE - 1);
2530 to = from + len; 2531 to = from + len;
2531 2532
2532 page = grab_cache_page_write_begin(mapping, index, flags); 2533 page = grab_cache_page_write_begin(mapping, index, flags);
2533 if (!page) 2534 if (!page)
2534 return -ENOMEM; 2535 return -ENOMEM;
2535 *pagep = page; 2536 *pagep = page;
2536 *fsdata = NULL; 2537 *fsdata = NULL;
2537 2538
2538 if (page_has_buffers(page)) { 2539 if (page_has_buffers(page)) {
2539 ret = __block_write_begin(page, pos, len, get_block); 2540 ret = __block_write_begin(page, pos, len, get_block);
2540 if (unlikely(ret)) 2541 if (unlikely(ret))
2541 goto out_release; 2542 goto out_release;
2542 return ret; 2543 return ret;
2543 } 2544 }
2544 2545
2545 if (PageMappedToDisk(page)) 2546 if (PageMappedToDisk(page))
2546 return 0; 2547 return 0;
2547 2548
2548 /* 2549 /*
2549 * Allocate buffers so that we can keep track of state, and potentially 2550 * Allocate buffers so that we can keep track of state, and potentially
2550 * attach them to the page if an error occurs. In the common case of 2551 * attach them to the page if an error occurs. In the common case of
2551 * no error, they will just be freed again without ever being attached 2552 * no error, they will just be freed again without ever being attached
2552 * to the page (which is all OK, because we're under the page lock). 2553 * to the page (which is all OK, because we're under the page lock).
2553 * 2554 *
2554 * Be careful: the buffer linked list is a NULL terminated one, rather 2555 * Be careful: the buffer linked list is a NULL terminated one, rather
2555 * than the circular one we're used to. 2556 * than the circular one we're used to.
2556 */ 2557 */
2557 head = alloc_page_buffers(page, blocksize, 0); 2558 head = alloc_page_buffers(page, blocksize, 0);
2558 if (!head) { 2559 if (!head) {
2559 ret = -ENOMEM; 2560 ret = -ENOMEM;
2560 goto out_release; 2561 goto out_release;
2561 } 2562 }
2562 2563
2563 block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); 2564 block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
2564 2565
2565 /* 2566 /*
2566 * We loop across all blocks in the page, whether or not they are 2567 * We loop across all blocks in the page, whether or not they are
2567 * part of the affected region. This is so we can discover if the 2568 * part of the affected region. This is so we can discover if the
2568 * page is fully mapped-to-disk. 2569 * page is fully mapped-to-disk.
2569 */ 2570 */
2570 for (block_start = 0, block_in_page = 0, bh = head; 2571 for (block_start = 0, block_in_page = 0, bh = head;
2571 block_start < PAGE_CACHE_SIZE; 2572 block_start < PAGE_CACHE_SIZE;
2572 block_in_page++, block_start += blocksize, bh = bh->b_this_page) { 2573 block_in_page++, block_start += blocksize, bh = bh->b_this_page) {
2573 int create; 2574 int create;
2574 2575
2575 block_end = block_start + blocksize; 2576 block_end = block_start + blocksize;
2576 bh->b_state = 0; 2577 bh->b_state = 0;
2577 create = 1; 2578 create = 1;
2578 if (block_start >= to) 2579 if (block_start >= to)
2579 create = 0; 2580 create = 0;
2580 ret = get_block(inode, block_in_file + block_in_page, 2581 ret = get_block(inode, block_in_file + block_in_page,
2581 bh, create); 2582 bh, create);
2582 if (ret) 2583 if (ret)
2583 goto failed; 2584 goto failed;
2584 if (!buffer_mapped(bh)) 2585 if (!buffer_mapped(bh))
2585 is_mapped_to_disk = 0; 2586 is_mapped_to_disk = 0;
2586 if (buffer_new(bh)) 2587 if (buffer_new(bh))
2587 unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); 2588 unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
2588 if (PageUptodate(page)) { 2589 if (PageUptodate(page)) {
2589 set_buffer_uptodate(bh); 2590 set_buffer_uptodate(bh);
2590 continue; 2591 continue;
2591 } 2592 }
2592 if (buffer_new(bh) || !buffer_mapped(bh)) { 2593 if (buffer_new(bh) || !buffer_mapped(bh)) {
2593 zero_user_segments(page, block_start, from, 2594 zero_user_segments(page, block_start, from,
2594 to, block_end); 2595 to, block_end);
2595 continue; 2596 continue;
2596 } 2597 }
2597 if (buffer_uptodate(bh)) 2598 if (buffer_uptodate(bh))
2598 continue; /* reiserfs does this */ 2599 continue; /* reiserfs does this */
2599 if (block_start < from || block_end > to) { 2600 if (block_start < from || block_end > to) {
2600 lock_buffer(bh); 2601 lock_buffer(bh);
2601 bh->b_end_io = end_buffer_read_nobh; 2602 bh->b_end_io = end_buffer_read_nobh;
2602 submit_bh(READ, bh); 2603 submit_bh(READ, bh);
2603 nr_reads++; 2604 nr_reads++;
2604 } 2605 }
2605 } 2606 }
2606 2607
2607 if (nr_reads) { 2608 if (nr_reads) {
2608 /* 2609 /*
2609 * The page is locked, so these buffers are protected from 2610 * The page is locked, so these buffers are protected from
2610 * any VM or truncate activity. Hence we don't need to care 2611 * any VM or truncate activity. Hence we don't need to care
2611 * for the buffer_head refcounts. 2612 * for the buffer_head refcounts.
2612 */ 2613 */
2613 for (bh = head; bh; bh = bh->b_this_page) { 2614 for (bh = head; bh; bh = bh->b_this_page) {
2614 wait_on_buffer(bh); 2615 wait_on_buffer(bh);
2615 if (!buffer_uptodate(bh)) 2616 if (!buffer_uptodate(bh))
2616 ret = -EIO; 2617 ret = -EIO;
2617 } 2618 }
2618 if (ret) 2619 if (ret)
2619 goto failed; 2620 goto failed;
2620 } 2621 }
2621 2622
2622 if (is_mapped_to_disk) 2623 if (is_mapped_to_disk)
2623 SetPageMappedToDisk(page); 2624 SetPageMappedToDisk(page);
2624 2625
2625 *fsdata = head; /* to be released by nobh_write_end */ 2626 *fsdata = head; /* to be released by nobh_write_end */
2626 2627
2627 return 0; 2628 return 0;
2628 2629
2629 failed: 2630 failed:
2630 BUG_ON(!ret); 2631 BUG_ON(!ret);
2631 /* 2632 /*
2632 * Error recovery is a bit difficult. We need to zero out blocks that 2633 * Error recovery is a bit difficult. We need to zero out blocks that
2633 * were newly allocated, and dirty them to ensure they get written out. 2634 * were newly allocated, and dirty them to ensure they get written out.
2634 * Buffers need to be attached to the page at this point, otherwise 2635 * Buffers need to be attached to the page at this point, otherwise
2635 * the handling of potential IO errors during writeout would be hard 2636 * the handling of potential IO errors during writeout would be hard
2636 * (could try doing synchronous writeout, but what if that fails too?) 2637 * (could try doing synchronous writeout, but what if that fails too?)
2637 */ 2638 */
2638 attach_nobh_buffers(page, head); 2639 attach_nobh_buffers(page, head);
2639 page_zero_new_buffers(page, from, to); 2640 page_zero_new_buffers(page, from, to);
2640 2641
2641 out_release: 2642 out_release:
2642 unlock_page(page); 2643 unlock_page(page);
2643 page_cache_release(page); 2644 page_cache_release(page);
2644 *pagep = NULL; 2645 *pagep = NULL;
2645 2646
2646 return ret; 2647 return ret;
2647 } 2648 }
2648 EXPORT_SYMBOL(nobh_write_begin); 2649 EXPORT_SYMBOL(nobh_write_begin);
2649 2650
2650 int nobh_write_end(struct file *file, struct address_space *mapping, 2651 int nobh_write_end(struct file *file, struct address_space *mapping,
2651 loff_t pos, unsigned len, unsigned copied, 2652 loff_t pos, unsigned len, unsigned copied,
2652 struct page *page, void *fsdata) 2653 struct page *page, void *fsdata)
2653 { 2654 {
2654 struct inode *inode = page->mapping->host; 2655 struct inode *inode = page->mapping->host;
2655 struct buffer_head *head = fsdata; 2656 struct buffer_head *head = fsdata;
2656 struct buffer_head *bh; 2657 struct buffer_head *bh;
2657 BUG_ON(fsdata != NULL && page_has_buffers(page)); 2658 BUG_ON(fsdata != NULL && page_has_buffers(page));
2658 2659
2659 if (unlikely(copied < len) && head) 2660 if (unlikely(copied < len) && head)
2660 attach_nobh_buffers(page, head); 2661 attach_nobh_buffers(page, head);
2661 if (page_has_buffers(page)) 2662 if (page_has_buffers(page))
2662 return generic_write_end(file, mapping, pos, len, 2663 return generic_write_end(file, mapping, pos, len,
2663 copied, page, fsdata); 2664 copied, page, fsdata);
2664 2665
2665 SetPageUptodate(page); 2666 SetPageUptodate(page);
2666 set_page_dirty(page); 2667 set_page_dirty(page);
2667 if (pos+copied > inode->i_size) { 2668 if (pos+copied > inode->i_size) {
2668 i_size_write(inode, pos+copied); 2669 i_size_write(inode, pos+copied);
2669 mark_inode_dirty(inode); 2670 mark_inode_dirty(inode);
2670 } 2671 }
2671 2672
2672 unlock_page(page); 2673 unlock_page(page);
2673 page_cache_release(page); 2674 page_cache_release(page);
2674 2675
2675 while (head) { 2676 while (head) {
2676 bh = head; 2677 bh = head;
2677 head = head->b_this_page; 2678 head = head->b_this_page;
2678 free_buffer_head(bh); 2679 free_buffer_head(bh);
2679 } 2680 }
2680 2681
2681 return copied; 2682 return copied;
2682 } 2683 }
2683 EXPORT_SYMBOL(nobh_write_end); 2684 EXPORT_SYMBOL(nobh_write_end);
2684 2685
2685 /* 2686 /*
2686 * nobh_writepage() - based on block_full_write_page() except 2687 * nobh_writepage() - based on block_full_write_page() except
2687 * that it tries to operate without attaching bufferheads to 2688 * that it tries to operate without attaching bufferheads to
2688 * the page. 2689 * the page.
2689 */ 2690 */
2690 int nobh_writepage(struct page *page, get_block_t *get_block, 2691 int nobh_writepage(struct page *page, get_block_t *get_block,
2691 struct writeback_control *wbc) 2692 struct writeback_control *wbc)
2692 { 2693 {
2693 struct inode * const inode = page->mapping->host; 2694 struct inode * const inode = page->mapping->host;
2694 loff_t i_size = i_size_read(inode); 2695 loff_t i_size = i_size_read(inode);
2695 const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; 2696 const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
2696 unsigned offset; 2697 unsigned offset;
2697 int ret; 2698 int ret;
2698 2699
2699 /* Is the page fully inside i_size? */ 2700 /* Is the page fully inside i_size? */
2700 if (page->index < end_index) 2701 if (page->index < end_index)
2701 goto out; 2702 goto out;
2702 2703
2703 /* Is the page fully outside i_size? (truncate in progress) */ 2704 /* Is the page fully outside i_size? (truncate in progress) */
2704 offset = i_size & (PAGE_CACHE_SIZE-1); 2705 offset = i_size & (PAGE_CACHE_SIZE-1);
2705 if (page->index >= end_index+1 || !offset) { 2706 if (page->index >= end_index+1 || !offset) {
2706 /* 2707 /*
2707 * The page may have dirty, unmapped buffers. For example, 2708 * The page may have dirty, unmapped buffers. For example,
2708 * they may have been added in ext3_writepage(). Make them 2709 * they may have been added in ext3_writepage(). Make them
2709 * freeable here, so the page does not leak. 2710 * freeable here, so the page does not leak.
2710 */ 2711 */
2711 #if 0 2712 #if 0
2712 /* Not really sure about this - do we need this ? */ 2713 /* Not really sure about this - do we need this ? */
2713 if (page->mapping->a_ops->invalidatepage) 2714 if (page->mapping->a_ops->invalidatepage)
2714 page->mapping->a_ops->invalidatepage(page, offset); 2715 page->mapping->a_ops->invalidatepage(page, offset);
2715 #endif 2716 #endif
2716 unlock_page(page); 2717 unlock_page(page);
2717 return 0; /* don't care */ 2718 return 0; /* don't care */
2718 } 2719 }
2719 2720
2720 /* 2721 /*
2721 * The page straddles i_size. It must be zeroed out on each and every 2722 * The page straddles i_size. It must be zeroed out on each and every
2722 * writepage invocation because it may be mmapped. "A file is mapped 2723 * writepage invocation because it may be mmapped. "A file is mapped
2723 * in multiples of the page size. For a file that is not a multiple of 2724 * in multiples of the page size. For a file that is not a multiple of
2724 * the page size, the remaining memory is zeroed when mapped, and 2725 * the page size, the remaining memory is zeroed when mapped, and
2725 * writes to that region are not written out to the file." 2726 * writes to that region are not written out to the file."
2726 */ 2727 */
2727 zero_user_segment(page, offset, PAGE_CACHE_SIZE); 2728 zero_user_segment(page, offset, PAGE_CACHE_SIZE);
2728 out: 2729 out:
2729 ret = mpage_writepage(page, get_block, wbc); 2730 ret = mpage_writepage(page, get_block, wbc);
2730 if (ret == -EAGAIN) 2731 if (ret == -EAGAIN)
2731 ret = __block_write_full_page(inode, page, get_block, wbc, 2732 ret = __block_write_full_page(inode, page, get_block, wbc,
2732 end_buffer_async_write); 2733 end_buffer_async_write);
2733 return ret; 2734 return ret;
2734 } 2735 }
2735 EXPORT_SYMBOL(nobh_writepage); 2736 EXPORT_SYMBOL(nobh_writepage);
2736 2737
2737 int nobh_truncate_page(struct address_space *mapping, 2738 int nobh_truncate_page(struct address_space *mapping,
2738 loff_t from, get_block_t *get_block) 2739 loff_t from, get_block_t *get_block)
2739 { 2740 {
2740 pgoff_t index = from >> PAGE_CACHE_SHIFT; 2741 pgoff_t index = from >> PAGE_CACHE_SHIFT;
2741 unsigned offset = from & (PAGE_CACHE_SIZE-1); 2742 unsigned offset = from & (PAGE_CACHE_SIZE-1);
2742 unsigned blocksize; 2743 unsigned blocksize;
2743 sector_t iblock; 2744 sector_t iblock;
2744 unsigned length, pos; 2745 unsigned length, pos;
2745 struct inode *inode = mapping->host; 2746 struct inode *inode = mapping->host;
2746 struct page *page; 2747 struct page *page;
2747 struct buffer_head map_bh; 2748 struct buffer_head map_bh;
2748 int err; 2749 int err;
2749 2750
2750 blocksize = 1 << inode->i_blkbits; 2751 blocksize = 1 << inode->i_blkbits;
2751 length = offset & (blocksize - 1); 2752 length = offset & (blocksize - 1);
2752 2753
2753 /* Block boundary? Nothing to do */ 2754 /* Block boundary? Nothing to do */
2754 if (!length) 2755 if (!length)
2755 return 0; 2756 return 0;
2756 2757
2757 length = blocksize - length; 2758 length = blocksize - length;
2758 iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); 2759 iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
2759 2760
2760 page = grab_cache_page(mapping, index); 2761 page = grab_cache_page(mapping, index);
2761 err = -ENOMEM; 2762 err = -ENOMEM;
2762 if (!page) 2763 if (!page)
2763 goto out; 2764 goto out;
2764 2765
2765 if (page_has_buffers(page)) { 2766 if (page_has_buffers(page)) {
2766 has_buffers: 2767 has_buffers:
2767 unlock_page(page); 2768 unlock_page(page);
2768 page_cache_release(page); 2769 page_cache_release(page);
2769 return block_truncate_page(mapping, from, get_block); 2770 return block_truncate_page(mapping, from, get_block);
2770 } 2771 }
2771 2772
2772 /* Find the buffer that contains "offset" */ 2773 /* Find the buffer that contains "offset" */
2773 pos = blocksize; 2774 pos = blocksize;
2774 while (offset >= pos) { 2775 while (offset >= pos) {
2775 iblock++; 2776 iblock++;
2776 pos += blocksize; 2777 pos += blocksize;
2777 } 2778 }
2778 2779
2779 map_bh.b_size = blocksize; 2780 map_bh.b_size = blocksize;
2780 map_bh.b_state = 0; 2781 map_bh.b_state = 0;
2781 err = get_block(inode, iblock, &map_bh, 0); 2782 err = get_block(inode, iblock, &map_bh, 0);
2782 if (err) 2783 if (err)
2783 goto unlock; 2784 goto unlock;
2784 /* unmapped? It's a hole - nothing to do */ 2785 /* unmapped? It's a hole - nothing to do */
2785 if (!buffer_mapped(&map_bh)) 2786 if (!buffer_mapped(&map_bh))
2786 goto unlock; 2787 goto unlock;
2787 2788
2788 /* Ok, it's mapped. Make sure it's up-to-date */ 2789 /* Ok, it's mapped. Make sure it's up-to-date */
2789 if (!PageUptodate(page)) { 2790 if (!PageUptodate(page)) {
2790 err = mapping->a_ops->readpage(NULL, page); 2791 err = mapping->a_ops->readpage(NULL, page);
2791 if (err) { 2792 if (err) {
2792 page_cache_release(page); 2793 page_cache_release(page);
2793 goto out; 2794 goto out;
2794 } 2795 }
2795 lock_page(page); 2796 lock_page(page);
2796 if (!PageUptodate(page)) { 2797 if (!PageUptodate(page)) {
2797 err = -EIO; 2798 err = -EIO;
2798 goto unlock; 2799 goto unlock;
2799 } 2800 }
2800 if (page_has_buffers(page)) 2801 if (page_has_buffers(page))
2801 goto has_buffers; 2802 goto has_buffers;
2802 } 2803 }
2803 zero_user(page, offset, length); 2804 zero_user(page, offset, length);
2804 set_page_dirty(page); 2805 set_page_dirty(page);
2805 err = 0; 2806 err = 0;
2806 2807
2807 unlock: 2808 unlock:
2808 unlock_page(page); 2809 unlock_page(page);
2809 page_cache_release(page); 2810 page_cache_release(page);
2810 out: 2811 out:
2811 return err; 2812 return err;
2812 } 2813 }
2813 EXPORT_SYMBOL(nobh_truncate_page); 2814 EXPORT_SYMBOL(nobh_truncate_page);
2814 2815
2815 int block_truncate_page(struct address_space *mapping, 2816 int block_truncate_page(struct address_space *mapping,
2816 loff_t from, get_block_t *get_block) 2817 loff_t from, get_block_t *get_block)
2817 { 2818 {
2818 pgoff_t index = from >> PAGE_CACHE_SHIFT; 2819 pgoff_t index = from >> PAGE_CACHE_SHIFT;
2819 unsigned offset = from & (PAGE_CACHE_SIZE-1); 2820 unsigned offset = from & (PAGE_CACHE_SIZE-1);
2820 unsigned blocksize; 2821 unsigned blocksize;
2821 sector_t iblock; 2822 sector_t iblock;
2822 unsigned length, pos; 2823 unsigned length, pos;
2823 struct inode *inode = mapping->host; 2824 struct inode *inode = mapping->host;
2824 struct page *page; 2825 struct page *page;
2825 struct buffer_head *bh; 2826 struct buffer_head *bh;
2826 int err; 2827 int err;
2827 2828
2828 blocksize = 1 << inode->i_blkbits; 2829 blocksize = 1 << inode->i_blkbits;
2829 length = offset & (blocksize - 1); 2830 length = offset & (blocksize - 1);
2830 2831
2831 /* Block boundary? Nothing to do */ 2832 /* Block boundary? Nothing to do */
2832 if (!length) 2833 if (!length)
2833 return 0; 2834 return 0;
2834 2835
2835 length = blocksize - length; 2836 length = blocksize - length;
2836 iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); 2837 iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
2837 2838
2838 page = grab_cache_page(mapping, index); 2839 page = grab_cache_page(mapping, index);
2839 err = -ENOMEM; 2840 err = -ENOMEM;
2840 if (!page) 2841 if (!page)
2841 goto out; 2842 goto out;
2842 2843
2843 if (!page_has_buffers(page)) 2844 if (!page_has_buffers(page))
2844 create_empty_buffers(page, blocksize, 0); 2845 create_empty_buffers(page, blocksize, 0);
2845 2846
2846 /* Find the buffer that contains "offset" */ 2847 /* Find the buffer that contains "offset" */
2847 bh = page_buffers(page); 2848 bh = page_buffers(page);
2848 pos = blocksize; 2849 pos = blocksize;
2849 while (offset >= pos) { 2850 while (offset >= pos) {
2850 bh = bh->b_this_page; 2851 bh = bh->b_this_page;
2851 iblock++; 2852 iblock++;
2852 pos += blocksize; 2853 pos += blocksize;
2853 } 2854 }
2854 2855
2855 err = 0; 2856 err = 0;
2856 if (!buffer_mapped(bh)) { 2857 if (!buffer_mapped(bh)) {
2857 WARN_ON(bh->b_size != blocksize); 2858 WARN_ON(bh->b_size != blocksize);
2858 err = get_block(inode, iblock, bh, 0); 2859 err = get_block(inode, iblock, bh, 0);
2859 if (err) 2860 if (err)
2860 goto unlock; 2861 goto unlock;
2861 /* unmapped? It's a hole - nothing to do */ 2862 /* unmapped? It's a hole - nothing to do */
2862 if (!buffer_mapped(bh)) 2863 if (!buffer_mapped(bh))
2863 goto unlock; 2864 goto unlock;
2864 } 2865 }
2865 2866
2866 /* Ok, it's mapped. Make sure it's up-to-date */ 2867 /* Ok, it's mapped. Make sure it's up-to-date */
2867 if (PageUptodate(page)) 2868 if (PageUptodate(page))
2868 set_buffer_uptodate(bh); 2869 set_buffer_uptodate(bh);
2869 2870
2870 if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) { 2871 if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
2871 err = -EIO; 2872 err = -EIO;
2872 ll_rw_block(READ, 1, &bh); 2873 ll_rw_block(READ, 1, &bh);
2873 wait_on_buffer(bh); 2874 wait_on_buffer(bh);
2874 /* Uhhuh. Read error. Complain and punt. */ 2875 /* Uhhuh. Read error. Complain and punt. */
2875 if (!buffer_uptodate(bh)) 2876 if (!buffer_uptodate(bh))
2876 goto unlock; 2877 goto unlock;
2877 } 2878 }
2878 2879
2879 zero_user(page, offset, length); 2880 zero_user(page, offset, length);
2880 mark_buffer_dirty(bh); 2881 mark_buffer_dirty(bh);
2881 err = 0; 2882 err = 0;
2882 2883
2883 unlock: 2884 unlock:
2884 unlock_page(page); 2885 unlock_page(page);
2885 page_cache_release(page); 2886 page_cache_release(page);
2886 out: 2887 out:
2887 return err; 2888 return err;
2888 } 2889 }
2889 EXPORT_SYMBOL(block_truncate_page); 2890 EXPORT_SYMBOL(block_truncate_page);
2890 2891
2891 /* 2892 /*
2892 * The generic ->writepage function for buffer-backed address_spaces 2893 * The generic ->writepage function for buffer-backed address_spaces
2893 * this form passes in the end_io handler used to finish the IO. 2894 * this form passes in the end_io handler used to finish the IO.
2894 */ 2895 */
2895 int block_write_full_page_endio(struct page *page, get_block_t *get_block, 2896 int block_write_full_page_endio(struct page *page, get_block_t *get_block,
2896 struct writeback_control *wbc, bh_end_io_t *handler) 2897 struct writeback_control *wbc, bh_end_io_t *handler)
2897 { 2898 {
2898 struct inode * const inode = page->mapping->host; 2899 struct inode * const inode = page->mapping->host;
2899 loff_t i_size = i_size_read(inode); 2900 loff_t i_size = i_size_read(inode);
2900 const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; 2901 const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
2901 unsigned offset; 2902 unsigned offset;
2902 2903
2903 /* Is the page fully inside i_size? */ 2904 /* Is the page fully inside i_size? */
2904 if (page->index < end_index) 2905 if (page->index < end_index)
2905 return __block_write_full_page(inode, page, get_block, wbc, 2906 return __block_write_full_page(inode, page, get_block, wbc,
2906 handler); 2907 handler);
2907 2908
2908 /* Is the page fully outside i_size? (truncate in progress) */ 2909 /* Is the page fully outside i_size? (truncate in progress) */
2909 offset = i_size & (PAGE_CACHE_SIZE-1); 2910 offset = i_size & (PAGE_CACHE_SIZE-1);
2910 if (page->index >= end_index+1 || !offset) { 2911 if (page->index >= end_index+1 || !offset) {
2911 /* 2912 /*
2912 * The page may have dirty, unmapped buffers. For example, 2913 * The page may have dirty, unmapped buffers. For example,
2913 * they may have been added in ext3_writepage(). Make them 2914 * they may have been added in ext3_writepage(). Make them
2914 * freeable here, so the page does not leak. 2915 * freeable here, so the page does not leak.
2915 */ 2916 */
2916 do_invalidatepage(page, 0, PAGE_CACHE_SIZE); 2917 do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
2917 unlock_page(page); 2918 unlock_page(page);
2918 return 0; /* don't care */ 2919 return 0; /* don't care */
2919 } 2920 }
2920 2921
2921 /* 2922 /*
2922 * The page straddles i_size. It must be zeroed out on each and every 2923 * The page straddles i_size. It must be zeroed out on each and every
2923 * writepage invocation because it may be mmapped. "A file is mapped 2924 * writepage invocation because it may be mmapped. "A file is mapped
2924 * in multiples of the page size. For a file that is not a multiple of 2925 * in multiples of the page size. For a file that is not a multiple of
2925 * the page size, the remaining memory is zeroed when mapped, and 2926 * the page size, the remaining memory is zeroed when mapped, and
2926 * writes to that region are not written out to the file." 2927 * writes to that region are not written out to the file."
2927 */ 2928 */
2928 zero_user_segment(page, offset, PAGE_CACHE_SIZE); 2929 zero_user_segment(page, offset, PAGE_CACHE_SIZE);
2929 return __block_write_full_page(inode, page, get_block, wbc, handler); 2930 return __block_write_full_page(inode, page, get_block, wbc, handler);
2930 } 2931 }
2931 EXPORT_SYMBOL(block_write_full_page_endio); 2932 EXPORT_SYMBOL(block_write_full_page_endio);
2932 2933
2933 /* 2934 /*
2934 * The generic ->writepage function for buffer-backed address_spaces 2935 * The generic ->writepage function for buffer-backed address_spaces
2935 */ 2936 */
2936 int block_write_full_page(struct page *page, get_block_t *get_block, 2937 int block_write_full_page(struct page *page, get_block_t *get_block,
2937 struct writeback_control *wbc) 2938 struct writeback_control *wbc)
2938 { 2939 {
2939 return block_write_full_page_endio(page, get_block, wbc, 2940 return block_write_full_page_endio(page, get_block, wbc,
2940 end_buffer_async_write); 2941 end_buffer_async_write);
2941 } 2942 }
2942 EXPORT_SYMBOL(block_write_full_page); 2943 EXPORT_SYMBOL(block_write_full_page);
2943 2944
2944 sector_t generic_block_bmap(struct address_space *mapping, sector_t block, 2945 sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
2945 get_block_t *get_block) 2946 get_block_t *get_block)
2946 { 2947 {
2947 struct buffer_head tmp; 2948 struct buffer_head tmp;
2948 struct inode *inode = mapping->host; 2949 struct inode *inode = mapping->host;
2949 tmp.b_state = 0; 2950 tmp.b_state = 0;
2950 tmp.b_blocknr = 0; 2951 tmp.b_blocknr = 0;
2951 tmp.b_size = 1 << inode->i_blkbits; 2952 tmp.b_size = 1 << inode->i_blkbits;
2952 get_block(inode, block, &tmp, 0); 2953 get_block(inode, block, &tmp, 0);
2953 return tmp.b_blocknr; 2954 return tmp.b_blocknr;
2954 } 2955 }
2955 EXPORT_SYMBOL(generic_block_bmap); 2956 EXPORT_SYMBOL(generic_block_bmap);
2956 2957
2957 static void end_bio_bh_io_sync(struct bio *bio, int err) 2958 static void end_bio_bh_io_sync(struct bio *bio, int err)
2958 { 2959 {
2959 struct buffer_head *bh = bio->bi_private; 2960 struct buffer_head *bh = bio->bi_private;
2960 2961
2961 if (err == -EOPNOTSUPP) { 2962 if (err == -EOPNOTSUPP) {
2962 set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); 2963 set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
2963 } 2964 }
2964 2965
2965 if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags))) 2966 if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
2966 set_bit(BH_Quiet, &bh->b_state); 2967 set_bit(BH_Quiet, &bh->b_state);
2967 2968
2968 bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); 2969 bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags));
2969 bio_put(bio); 2970 bio_put(bio);
2970 } 2971 }
2971 2972
2972 /* 2973 /*
2973 * This allows us to do IO even on the odd last sectors 2974 * This allows us to do IO even on the odd last sectors
2974 * of a device, even if the bh block size is some multiple 2975 * of a device, even if the bh block size is some multiple
2975 * of the physical sector size. 2976 * of the physical sector size.
2976 * 2977 *
2977 * We'll just truncate the bio to the size of the device, 2978 * We'll just truncate the bio to the size of the device,
2978 * and clear the end of the buffer head manually. 2979 * and clear the end of the buffer head manually.
2979 * 2980 *
2980 * Truly out-of-range accesses will turn into actual IO 2981 * Truly out-of-range accesses will turn into actual IO
2981 * errors, this only handles the "we need to be able to 2982 * errors, this only handles the "we need to be able to
2982 * do IO at the final sector" case. 2983 * do IO at the final sector" case.
2983 */ 2984 */
2984 static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh) 2985 static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
2985 { 2986 {
2986 sector_t maxsector; 2987 sector_t maxsector;
2987 unsigned bytes; 2988 unsigned bytes;
2988 2989
2989 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; 2990 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9;
2990 if (!maxsector) 2991 if (!maxsector)
2991 return; 2992 return;
2992 2993
2993 /* 2994 /*
2994 * If the *whole* IO is past the end of the device, 2995 * If the *whole* IO is past the end of the device,
2995 * let it through, and the IO layer will turn it into 2996 * let it through, and the IO layer will turn it into
2996 * an EIO. 2997 * an EIO.
2997 */ 2998 */
2998 if (unlikely(bio->bi_sector >= maxsector)) 2999 if (unlikely(bio->bi_sector >= maxsector))
2999 return; 3000 return;
3000 3001
3001 maxsector -= bio->bi_sector; 3002 maxsector -= bio->bi_sector;
3002 bytes = bio->bi_size; 3003 bytes = bio->bi_size;
3003 if (likely((bytes >> 9) <= maxsector)) 3004 if (likely((bytes >> 9) <= maxsector))
3004 return; 3005 return;
3005 3006
3006 /* Uhhuh. We've got a bh that straddles the device size! */ 3007 /* Uhhuh. We've got a bh that straddles the device size! */
3007 bytes = maxsector << 9; 3008 bytes = maxsector << 9;
3008 3009
3009 /* Truncate the bio.. */ 3010 /* Truncate the bio.. */
3010 bio->bi_size = bytes; 3011 bio->bi_size = bytes;
3011 bio->bi_io_vec[0].bv_len = bytes; 3012 bio->bi_io_vec[0].bv_len = bytes;
3012 3013
3013 /* ..and clear the end of the buffer for reads */ 3014 /* ..and clear the end of the buffer for reads */
3014 if ((rw & RW_MASK) == READ) { 3015 if ((rw & RW_MASK) == READ) {
3015 void *kaddr = kmap_atomic(bh->b_page); 3016 void *kaddr = kmap_atomic(bh->b_page);
3016 memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes); 3017 memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes);
3017 kunmap_atomic(kaddr); 3018 kunmap_atomic(kaddr);
3018 flush_dcache_page(bh->b_page); 3019 flush_dcache_page(bh->b_page);
3019 } 3020 }
3020 } 3021 }
3021 3022
3022 int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) 3023 int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
3023 { 3024 {
3024 struct bio *bio; 3025 struct bio *bio;
3025 int ret = 0; 3026 int ret = 0;
3026 3027
3027 BUG_ON(!buffer_locked(bh)); 3028 BUG_ON(!buffer_locked(bh));
3028 BUG_ON(!buffer_mapped(bh)); 3029 BUG_ON(!buffer_mapped(bh));
3029 BUG_ON(!bh->b_end_io); 3030 BUG_ON(!bh->b_end_io);
3030 BUG_ON(buffer_delay(bh)); 3031 BUG_ON(buffer_delay(bh));
3031 BUG_ON(buffer_unwritten(bh)); 3032 BUG_ON(buffer_unwritten(bh));
3032 3033
3033 /* 3034 /*
3034 * Only clear out a write error when rewriting 3035 * Only clear out a write error when rewriting
3035 */ 3036 */
3036 if (test_set_buffer_req(bh) && (rw & WRITE)) 3037 if (test_set_buffer_req(bh) && (rw & WRITE))
3037 clear_buffer_write_io_error(bh); 3038 clear_buffer_write_io_error(bh);
3038 3039
3039 /* 3040 /*
3040 * from here on down, it's all bio -- do the initial mapping, 3041 * from here on down, it's all bio -- do the initial mapping,
3041 * submit_bio -> generic_make_request may further map this bio around 3042 * submit_bio -> generic_make_request may further map this bio around
3042 */ 3043 */
3043 bio = bio_alloc(GFP_NOIO, 1); 3044 bio = bio_alloc(GFP_NOIO, 1);
3044 3045
3045 bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9); 3046 bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
3046 bio->bi_bdev = bh->b_bdev; 3047 bio->bi_bdev = bh->b_bdev;
3047 bio->bi_io_vec[0].bv_page = bh->b_page; 3048 bio->bi_io_vec[0].bv_page = bh->b_page;
3048 bio->bi_io_vec[0].bv_len = bh->b_size; 3049 bio->bi_io_vec[0].bv_len = bh->b_size;
3049 bio->bi_io_vec[0].bv_offset = bh_offset(bh); 3050 bio->bi_io_vec[0].bv_offset = bh_offset(bh);
3050 3051
3051 bio->bi_vcnt = 1; 3052 bio->bi_vcnt = 1;
3052 bio->bi_size = bh->b_size; 3053 bio->bi_size = bh->b_size;
3053 3054
3054 bio->bi_end_io = end_bio_bh_io_sync; 3055 bio->bi_end_io = end_bio_bh_io_sync;
3055 bio->bi_private = bh; 3056 bio->bi_private = bh;
3056 bio->bi_flags |= bio_flags; 3057 bio->bi_flags |= bio_flags;
3057 3058
3058 /* Take care of bh's that straddle the end of the device */ 3059 /* Take care of bh's that straddle the end of the device */
3059 guard_bh_eod(rw, bio, bh); 3060 guard_bh_eod(rw, bio, bh);
3060 3061
3061 if (buffer_meta(bh)) 3062 if (buffer_meta(bh))
3062 rw |= REQ_META; 3063 rw |= REQ_META;
3063 if (buffer_prio(bh)) 3064 if (buffer_prio(bh))
3064 rw |= REQ_PRIO; 3065 rw |= REQ_PRIO;
3065 3066
3066 bio_get(bio); 3067 bio_get(bio);
3067 submit_bio(rw, bio); 3068 submit_bio(rw, bio);
3068 3069
3069 if (bio_flagged(bio, BIO_EOPNOTSUPP)) 3070 if (bio_flagged(bio, BIO_EOPNOTSUPP))
3070 ret = -EOPNOTSUPP; 3071 ret = -EOPNOTSUPP;
3071 3072
3072 bio_put(bio); 3073 bio_put(bio);
3073 return ret; 3074 return ret;
3074 } 3075 }
3075 EXPORT_SYMBOL_GPL(_submit_bh); 3076 EXPORT_SYMBOL_GPL(_submit_bh);
3076 3077
3077 int submit_bh(int rw, struct buffer_head *bh) 3078 int submit_bh(int rw, struct buffer_head *bh)
3078 { 3079 {
3079 return _submit_bh(rw, bh, 0); 3080 return _submit_bh(rw, bh, 0);
3080 } 3081 }
3081 EXPORT_SYMBOL(submit_bh); 3082 EXPORT_SYMBOL(submit_bh);
3082 3083
3083 /** 3084 /**
3084 * ll_rw_block: low-level access to block devices (DEPRECATED) 3085 * ll_rw_block: low-level access to block devices (DEPRECATED)
3085 * @rw: whether to %READ or %WRITE or maybe %READA (readahead) 3086 * @rw: whether to %READ or %WRITE or maybe %READA (readahead)
3086 * @nr: number of &struct buffer_heads in the array 3087 * @nr: number of &struct buffer_heads in the array
3087 * @bhs: array of pointers to &struct buffer_head 3088 * @bhs: array of pointers to &struct buffer_head
3088 * 3089 *
3089 * ll_rw_block() takes an array of pointers to &struct buffer_heads, and 3090 * ll_rw_block() takes an array of pointers to &struct buffer_heads, and
3090 * requests an I/O operation on them, either a %READ or a %WRITE. The third 3091 * requests an I/O operation on them, either a %READ or a %WRITE. The third
3091 * %READA option is described in the documentation for generic_make_request() 3092 * %READA option is described in the documentation for generic_make_request()
3092 * which ll_rw_block() calls. 3093 * which ll_rw_block() calls.
3093 * 3094 *
3094 * This function drops any buffer that it cannot get a lock on (with the 3095 * This function drops any buffer that it cannot get a lock on (with the
3095 * BH_Lock state bit), any buffer that appears to be clean when doing a write 3096 * BH_Lock state bit), any buffer that appears to be clean when doing a write
3096 * request, and any buffer that appears to be up-to-date when doing read 3097 * request, and any buffer that appears to be up-to-date when doing read
3097 * request. Further it marks as clean buffers that are processed for 3098 * request. Further it marks as clean buffers that are processed for
3098 * writing (the buffer cache won't assume that they are actually clean 3099 * writing (the buffer cache won't assume that they are actually clean
3099 * until the buffer gets unlocked). 3100 * until the buffer gets unlocked).
3100 * 3101 *
3101 * ll_rw_block sets b_end_io to simple completion handler that marks 3102 * ll_rw_block sets b_end_io to simple completion handler that marks
3102 * the buffer up-to-date (if approriate), unlocks the buffer and wakes 3103 * the buffer up-to-date (if approriate), unlocks the buffer and wakes
3103 * any waiters. 3104 * any waiters.
3104 * 3105 *
3105 * All of the buffers must be for the same device, and must also be a 3106 * All of the buffers must be for the same device, and must also be a
3106 * multiple of the current approved size for the device. 3107 * multiple of the current approved size for the device.
3107 */ 3108 */
3108 void ll_rw_block(int rw, int nr, struct buffer_head *bhs[]) 3109 void ll_rw_block(int rw, int nr, struct buffer_head *bhs[])
3109 { 3110 {
3110 int i; 3111 int i;
3111 3112
3112 for (i = 0; i < nr; i++) { 3113 for (i = 0; i < nr; i++) {
3113 struct buffer_head *bh = bhs[i]; 3114 struct buffer_head *bh = bhs[i];
3114 3115
3115 if (!trylock_buffer(bh)) 3116 if (!trylock_buffer(bh))
3116 continue; 3117 continue;
3117 if (rw == WRITE) { 3118 if (rw == WRITE) {
3118 if (test_clear_buffer_dirty(bh)) { 3119 if (test_clear_buffer_dirty(bh)) {
3119 bh->b_end_io = end_buffer_write_sync; 3120 bh->b_end_io = end_buffer_write_sync;
3120 get_bh(bh); 3121 get_bh(bh);
3121 submit_bh(WRITE, bh); 3122 submit_bh(WRITE, bh);
3122 continue; 3123 continue;
3123 } 3124 }
3124 } else { 3125 } else {
3125 if (!buffer_uptodate(bh)) { 3126 if (!buffer_uptodate(bh)) {
3126 bh->b_end_io = end_buffer_read_sync; 3127 bh->b_end_io = end_buffer_read_sync;
3127 get_bh(bh); 3128 get_bh(bh);
3128 submit_bh(rw, bh); 3129 submit_bh(rw, bh);
3129 continue; 3130 continue;
3130 } 3131 }
3131 } 3132 }
3132 unlock_buffer(bh); 3133 unlock_buffer(bh);
3133 } 3134 }
3134 } 3135 }
3135 EXPORT_SYMBOL(ll_rw_block); 3136 EXPORT_SYMBOL(ll_rw_block);
3136 3137
3137 void write_dirty_buffer(struct buffer_head *bh, int rw) 3138 void write_dirty_buffer(struct buffer_head *bh, int rw)
3138 { 3139 {
3139 lock_buffer(bh); 3140 lock_buffer(bh);
3140 if (!test_clear_buffer_dirty(bh)) { 3141 if (!test_clear_buffer_dirty(bh)) {
3141 unlock_buffer(bh); 3142 unlock_buffer(bh);
3142 return; 3143 return;
3143 } 3144 }
3144 bh->b_end_io = end_buffer_write_sync; 3145 bh->b_end_io = end_buffer_write_sync;
3145 get_bh(bh); 3146 get_bh(bh);
3146 submit_bh(rw, bh); 3147 submit_bh(rw, bh);
3147 } 3148 }
3148 EXPORT_SYMBOL(write_dirty_buffer); 3149 EXPORT_SYMBOL(write_dirty_buffer);
3149 3150
3150 /* 3151 /*
3151 * For a data-integrity writeout, we need to wait upon any in-progress I/O 3152 * For a data-integrity writeout, we need to wait upon any in-progress I/O
3152 * and then start new I/O and then wait upon it. The caller must have a ref on 3153 * and then start new I/O and then wait upon it. The caller must have a ref on
3153 * the buffer_head. 3154 * the buffer_head.
3154 */ 3155 */
3155 int __sync_dirty_buffer(struct buffer_head *bh, int rw) 3156 int __sync_dirty_buffer(struct buffer_head *bh, int rw)
3156 { 3157 {
3157 int ret = 0; 3158 int ret = 0;
3158 3159
3159 WARN_ON(atomic_read(&bh->b_count) < 1); 3160 WARN_ON(atomic_read(&bh->b_count) < 1);
3160 lock_buffer(bh); 3161 lock_buffer(bh);
3161 if (test_clear_buffer_dirty(bh)) { 3162 if (test_clear_buffer_dirty(bh)) {
3162 get_bh(bh); 3163 get_bh(bh);
3163 bh->b_end_io = end_buffer_write_sync; 3164 bh->b_end_io = end_buffer_write_sync;
3164 ret = submit_bh(rw, bh); 3165 ret = submit_bh(rw, bh);
3165 wait_on_buffer(bh); 3166 wait_on_buffer(bh);
3166 if (!ret && !buffer_uptodate(bh)) 3167 if (!ret && !buffer_uptodate(bh))
3167 ret = -EIO; 3168 ret = -EIO;
3168 } else { 3169 } else {
3169 unlock_buffer(bh); 3170 unlock_buffer(bh);
3170 } 3171 }
3171 return ret; 3172 return ret;
3172 } 3173 }
3173 EXPORT_SYMBOL(__sync_dirty_buffer); 3174 EXPORT_SYMBOL(__sync_dirty_buffer);
3174 3175
3175 int sync_dirty_buffer(struct buffer_head *bh) 3176 int sync_dirty_buffer(struct buffer_head *bh)
3176 { 3177 {
3177 return __sync_dirty_buffer(bh, WRITE_SYNC); 3178 return __sync_dirty_buffer(bh, WRITE_SYNC);
3178 } 3179 }
3179 EXPORT_SYMBOL(sync_dirty_buffer); 3180 EXPORT_SYMBOL(sync_dirty_buffer);
3180 3181
3181 /* 3182 /*
3182 * try_to_free_buffers() checks if all the buffers on this particular page 3183 * try_to_free_buffers() checks if all the buffers on this particular page
3183 * are unused, and releases them if so. 3184 * are unused, and releases them if so.
3184 * 3185 *
3185 * Exclusion against try_to_free_buffers may be obtained by either 3186 * Exclusion against try_to_free_buffers may be obtained by either
3186 * locking the page or by holding its mapping's private_lock. 3187 * locking the page or by holding its mapping's private_lock.
3187 * 3188 *
3188 * If the page is dirty but all the buffers are clean then we need to 3189 * If the page is dirty but all the buffers are clean then we need to
3189 * be sure to mark the page clean as well. This is because the page 3190 * be sure to mark the page clean as well. This is because the page
3190 * may be against a block device, and a later reattachment of buffers 3191 * may be against a block device, and a later reattachment of buffers
3191 * to a dirty page will set *all* buffers dirty. Which would corrupt 3192 * to a dirty page will set *all* buffers dirty. Which would corrupt
3192 * filesystem data on the same device. 3193 * filesystem data on the same device.
3193 * 3194 *
3194 * The same applies to regular filesystem pages: if all the buffers are 3195 * The same applies to regular filesystem pages: if all the buffers are
3195 * clean then we set the page clean and proceed. To do that, we require 3196 * clean then we set the page clean and proceed. To do that, we require
3196 * total exclusion from __set_page_dirty_buffers(). That is obtained with 3197 * total exclusion from __set_page_dirty_buffers(). That is obtained with
3197 * private_lock. 3198 * private_lock.
3198 * 3199 *
3199 * try_to_free_buffers() is non-blocking. 3200 * try_to_free_buffers() is non-blocking.
3200 */ 3201 */
3201 static inline int buffer_busy(struct buffer_head *bh) 3202 static inline int buffer_busy(struct buffer_head *bh)
3202 { 3203 {
3203 return atomic_read(&bh->b_count) | 3204 return atomic_read(&bh->b_count) |
3204 (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock))); 3205 (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
3205 } 3206 }
3206 3207
3207 static int 3208 static int
3208 drop_buffers(struct page *page, struct buffer_head **buffers_to_free) 3209 drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
3209 { 3210 {
3210 struct buffer_head *head = page_buffers(page); 3211 struct buffer_head *head = page_buffers(page);
3211 struct buffer_head *bh; 3212 struct buffer_head *bh;
3212 3213
3213 bh = head; 3214 bh = head;
3214 do { 3215 do {
3215 if (buffer_write_io_error(bh) && page->mapping) 3216 if (buffer_write_io_error(bh) && page->mapping)
3216 set_bit(AS_EIO, &page->mapping->flags); 3217 set_bit(AS_EIO, &page->mapping->flags);
3217 if (buffer_busy(bh)) 3218 if (buffer_busy(bh))
3218 goto failed; 3219 goto failed;
3219 bh = bh->b_this_page; 3220 bh = bh->b_this_page;
3220 } while (bh != head); 3221 } while (bh != head);
3221 3222
3222 do { 3223 do {
3223 struct buffer_head *next = bh->b_this_page; 3224 struct buffer_head *next = bh->b_this_page;
3224 3225
3225 if (bh->b_assoc_map) 3226 if (bh->b_assoc_map)
3226 __remove_assoc_queue(bh); 3227 __remove_assoc_queue(bh);
3227 bh = next; 3228 bh = next;
3228 } while (bh != head); 3229 } while (bh != head);
3229 *buffers_to_free = head; 3230 *buffers_to_free = head;
3230 __clear_page_buffers(page); 3231 __clear_page_buffers(page);
3231 return 1; 3232 return 1;
3232 failed: 3233 failed:
3233 return 0; 3234 return 0;
3234 } 3235 }
3235 3236
3236 int try_to_free_buffers(struct page *page) 3237 int try_to_free_buffers(struct page *page)
3237 { 3238 {
3238 struct address_space * const mapping = page->mapping; 3239 struct address_space * const mapping = page->mapping;
3239 struct buffer_head *buffers_to_free = NULL; 3240 struct buffer_head *buffers_to_free = NULL;
3240 int ret = 0; 3241 int ret = 0;
3241 3242
3242 BUG_ON(!PageLocked(page)); 3243 BUG_ON(!PageLocked(page));
3243 if (PageWriteback(page)) 3244 if (PageWriteback(page))
3244 return 0; 3245 return 0;
3245 3246
3246 if (mapping == NULL) { /* can this still happen? */ 3247 if (mapping == NULL) { /* can this still happen? */
3247 ret = drop_buffers(page, &buffers_to_free); 3248 ret = drop_buffers(page, &buffers_to_free);
3248 goto out; 3249 goto out;
3249 } 3250 }
3250 3251
3251 spin_lock(&mapping->private_lock); 3252 spin_lock(&mapping->private_lock);
3252 ret = drop_buffers(page, &buffers_to_free); 3253 ret = drop_buffers(page, &buffers_to_free);
3253 3254
3254 /* 3255 /*
3255 * If the filesystem writes its buffers by hand (eg ext3) 3256 * If the filesystem writes its buffers by hand (eg ext3)
3256 * then we can have clean buffers against a dirty page. We 3257 * then we can have clean buffers against a dirty page. We
3257 * clean the page here; otherwise the VM will never notice 3258 * clean the page here; otherwise the VM will never notice
3258 * that the filesystem did any IO at all. 3259 * that the filesystem did any IO at all.
3259 * 3260 *
3260 * Also, during truncate, discard_buffer will have marked all 3261 * Also, during truncate, discard_buffer will have marked all
3261 * the page's buffers clean. We discover that here and clean 3262 * the page's buffers clean. We discover that here and clean
3262 * the page also. 3263 * the page also.
3263 * 3264 *
3264 * private_lock must be held over this entire operation in order 3265 * private_lock must be held over this entire operation in order
3265 * to synchronise against __set_page_dirty_buffers and prevent the 3266 * to synchronise against __set_page_dirty_buffers and prevent the
3266 * dirty bit from being lost. 3267 * dirty bit from being lost.
3267 */ 3268 */
3268 if (ret) 3269 if (ret)
3269 cancel_dirty_page(page, PAGE_CACHE_SIZE); 3270 cancel_dirty_page(page, PAGE_CACHE_SIZE);
3270 spin_unlock(&mapping->private_lock); 3271 spin_unlock(&mapping->private_lock);
3271 out: 3272 out:
3272 if (buffers_to_free) { 3273 if (buffers_to_free) {
3273 struct buffer_head *bh = buffers_to_free; 3274 struct buffer_head *bh = buffers_to_free;
3274 3275
3275 do { 3276 do {
3276 struct buffer_head *next = bh->b_this_page; 3277 struct buffer_head *next = bh->b_this_page;
3277 free_buffer_head(bh); 3278 free_buffer_head(bh);
3278 bh = next; 3279 bh = next;
3279 } while (bh != buffers_to_free); 3280 } while (bh != buffers_to_free);
3280 } 3281 }
3281 return ret; 3282 return ret;
3282 } 3283 }
3283 EXPORT_SYMBOL(try_to_free_buffers); 3284 EXPORT_SYMBOL(try_to_free_buffers);
3284 3285
3285 /* 3286 /*
3286 * There are no bdflush tunables left. But distributions are 3287 * There are no bdflush tunables left. But distributions are
3287 * still running obsolete flush daemons, so we terminate them here. 3288 * still running obsolete flush daemons, so we terminate them here.
3288 * 3289 *
3289 * Use of bdflush() is deprecated and will be removed in a future kernel. 3290 * Use of bdflush() is deprecated and will be removed in a future kernel.
3290 * The `flush-X' kernel threads fully replace bdflush daemons and this call. 3291 * The `flush-X' kernel threads fully replace bdflush daemons and this call.
3291 */ 3292 */
3292 SYSCALL_DEFINE2(bdflush, int, func, long, data) 3293 SYSCALL_DEFINE2(bdflush, int, func, long, data)
3293 { 3294 {
3294 static int msg_count; 3295 static int msg_count;
3295 3296
3296 if (!capable(CAP_SYS_ADMIN)) 3297 if (!capable(CAP_SYS_ADMIN))
3297 return -EPERM; 3298 return -EPERM;
3298 3299
3299 if (msg_count < 5) { 3300 if (msg_count < 5) {
3300 msg_count++; 3301 msg_count++;
3301 printk(KERN_INFO 3302 printk(KERN_INFO
3302 "warning: process `%s' used the obsolete bdflush" 3303 "warning: process `%s' used the obsolete bdflush"
3303 " system call\n", current->comm); 3304 " system call\n", current->comm);
3304 printk(KERN_INFO "Fix your initscripts?\n"); 3305 printk(KERN_INFO "Fix your initscripts?\n");
3305 } 3306 }
3306 3307
3307 if (func == 1) 3308 if (func == 1)
3308 do_exit(0); 3309 do_exit(0);
3309 return 0; 3310 return 0;
3310 } 3311 }
3311 3312
3312 /* 3313 /*
3313 * Buffer-head allocation 3314 * Buffer-head allocation
3314 */ 3315 */
3315 static struct kmem_cache *bh_cachep __read_mostly; 3316 static struct kmem_cache *bh_cachep __read_mostly;
3316 3317
3317 /* 3318 /*
3318 * Once the number of bh's in the machine exceeds this level, we start 3319 * Once the number of bh's in the machine exceeds this level, we start
3319 * stripping them in writeback. 3320 * stripping them in writeback.
3320 */ 3321 */
3321 static unsigned long max_buffer_heads; 3322 static unsigned long max_buffer_heads;
3322 3323
3323 int buffer_heads_over_limit; 3324 int buffer_heads_over_limit;
3324 3325
3325 struct bh_accounting { 3326 struct bh_accounting {
3326 int nr; /* Number of live bh's */ 3327 int nr; /* Number of live bh's */
3327 int ratelimit; /* Limit cacheline bouncing */ 3328 int ratelimit; /* Limit cacheline bouncing */
3328 }; 3329 };
3329 3330
3330 static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0}; 3331 static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0};
3331 3332
3332 static void recalc_bh_state(void) 3333 static void recalc_bh_state(void)
3333 { 3334 {
3334 int i; 3335 int i;
3335 int tot = 0; 3336 int tot = 0;
3336 3337
3337 if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096) 3338 if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096)
3338 return; 3339 return;
3339 __this_cpu_write(bh_accounting.ratelimit, 0); 3340 __this_cpu_write(bh_accounting.ratelimit, 0);
3340 for_each_online_cpu(i) 3341 for_each_online_cpu(i)
3341 tot += per_cpu(bh_accounting, i).nr; 3342 tot += per_cpu(bh_accounting, i).nr;
3342 buffer_heads_over_limit = (tot > max_buffer_heads); 3343 buffer_heads_over_limit = (tot > max_buffer_heads);
3343 } 3344 }
3344 3345
3345 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) 3346 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
3346 { 3347 {
3347 struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); 3348 struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
3348 if (ret) { 3349 if (ret) {
3349 INIT_LIST_HEAD(&ret->b_assoc_buffers); 3350 INIT_LIST_HEAD(&ret->b_assoc_buffers);
3350 preempt_disable(); 3351 preempt_disable();
3351 __this_cpu_inc(bh_accounting.nr); 3352 __this_cpu_inc(bh_accounting.nr);
3352 recalc_bh_state(); 3353 recalc_bh_state();
3353 preempt_enable(); 3354 preempt_enable();
3354 } 3355 }
3355 return ret; 3356 return ret;
3356 } 3357 }
3357 EXPORT_SYMBOL(alloc_buffer_head); 3358 EXPORT_SYMBOL(alloc_buffer_head);
3358 3359
3359 void free_buffer_head(struct buffer_head *bh) 3360 void free_buffer_head(struct buffer_head *bh)
3360 { 3361 {
3361 BUG_ON(!list_empty(&bh->b_assoc_buffers)); 3362 BUG_ON(!list_empty(&bh->b_assoc_buffers));
3362 kmem_cache_free(bh_cachep, bh); 3363 kmem_cache_free(bh_cachep, bh);
3363 preempt_disable(); 3364 preempt_disable();
3364 __this_cpu_dec(bh_accounting.nr); 3365 __this_cpu_dec(bh_accounting.nr);
3365 recalc_bh_state(); 3366 recalc_bh_state();
3366 preempt_enable(); 3367 preempt_enable();
3367 } 3368 }
3368 EXPORT_SYMBOL(free_buffer_head); 3369 EXPORT_SYMBOL(free_buffer_head);
3369 3370
3370 static void buffer_exit_cpu(int cpu) 3371 static void buffer_exit_cpu(int cpu)
3371 { 3372 {
3372 int i; 3373 int i;
3373 struct bh_lru *b = &per_cpu(bh_lrus, cpu); 3374 struct bh_lru *b = &per_cpu(bh_lrus, cpu);
3374 3375
3375 for (i = 0; i < BH_LRU_SIZE; i++) { 3376 for (i = 0; i < BH_LRU_SIZE; i++) {
3376 brelse(b->bhs[i]); 3377 brelse(b->bhs[i]);
3377 b->bhs[i] = NULL; 3378 b->bhs[i] = NULL;
3378 } 3379 }
3379 this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr); 3380 this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr);
3380 per_cpu(bh_accounting, cpu).nr = 0; 3381 per_cpu(bh_accounting, cpu).nr = 0;
3381 } 3382 }
3382 3383
3383 static int buffer_cpu_notify(struct notifier_block *self, 3384 static int buffer_cpu_notify(struct notifier_block *self,
3384 unsigned long action, void *hcpu) 3385 unsigned long action, void *hcpu)
3385 { 3386 {
3386 if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) 3387 if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
3387 buffer_exit_cpu((unsigned long)hcpu); 3388 buffer_exit_cpu((unsigned long)hcpu);
3388 return NOTIFY_OK; 3389 return NOTIFY_OK;
3389 } 3390 }
3390 3391
3391 /** 3392 /**
3392 * bh_uptodate_or_lock - Test whether the buffer is uptodate 3393 * bh_uptodate_or_lock - Test whether the buffer is uptodate
3393 * @bh: struct buffer_head 3394 * @bh: struct buffer_head
3394 * 3395 *
3395 * Return true if the buffer is up-to-date and false, 3396 * Return true if the buffer is up-to-date and false,
3396 * with the buffer locked, if not. 3397 * with the buffer locked, if not.
3397 */ 3398 */
3398 int bh_uptodate_or_lock(struct buffer_head *bh) 3399 int bh_uptodate_or_lock(struct buffer_head *bh)
3399 { 3400 {
3400 if (!buffer_uptodate(bh)) { 3401 if (!buffer_uptodate(bh)) {
3401 lock_buffer(bh); 3402 lock_buffer(bh);
3402 if (!buffer_uptodate(bh)) 3403 if (!buffer_uptodate(bh))
3403 return 0; 3404 return 0;
3404 unlock_buffer(bh); 3405 unlock_buffer(bh);
3405 } 3406 }
3406 return 1; 3407 return 1;
3407 } 3408 }
3408 EXPORT_SYMBOL(bh_uptodate_or_lock); 3409 EXPORT_SYMBOL(bh_uptodate_or_lock);
3409 3410
3410 /** 3411 /**
3411 * bh_submit_read - Submit a locked buffer for reading 3412 * bh_submit_read - Submit a locked buffer for reading
3412 * @bh: struct buffer_head 3413 * @bh: struct buffer_head
3413 * 3414 *
3414 * Returns zero on success and -EIO on error. 3415 * Returns zero on success and -EIO on error.
3415 */ 3416 */
3416 int bh_submit_read(struct buffer_head *bh) 3417 int bh_submit_read(struct buffer_head *bh)
3417 { 3418 {
3418 BUG_ON(!buffer_locked(bh)); 3419 BUG_ON(!buffer_locked(bh));
3419 3420
3420 if (buffer_uptodate(bh)) { 3421 if (buffer_uptodate(bh)) {
3421 unlock_buffer(bh); 3422 unlock_buffer(bh);
3422 return 0; 3423 return 0;
3423 } 3424 }
3424 3425
3425 get_bh(bh); 3426 get_bh(bh);
3426 bh->b_end_io = end_buffer_read_sync; 3427 bh->b_end_io = end_buffer_read_sync;
3427 submit_bh(READ, bh); 3428 submit_bh(READ, bh);
3428 wait_on_buffer(bh); 3429 wait_on_buffer(bh);
3429 if (buffer_uptodate(bh)) 3430 if (buffer_uptodate(bh))
3430 return 0; 3431 return 0;
3431 return -EIO; 3432 return -EIO;
3432 } 3433 }
3433 EXPORT_SYMBOL(bh_submit_read); 3434 EXPORT_SYMBOL(bh_submit_read);
3434 3435
3435 void __init buffer_init(void) 3436 void __init buffer_init(void)
3436 { 3437 {
3437 unsigned long nrpages; 3438 unsigned long nrpages;
3438 3439
3439 bh_cachep = kmem_cache_create("buffer_head", 3440 bh_cachep = kmem_cache_create("buffer_head",
3440 sizeof(struct buffer_head), 0, 3441 sizeof(struct buffer_head), 0,
3441 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC| 3442 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
3442 SLAB_MEM_SPREAD), 3443 SLAB_MEM_SPREAD),
3443 NULL); 3444 NULL);
3444 3445
3445 /* 3446 /*
3446 * Limit the bh occupancy to 10% of ZONE_NORMAL 3447 * Limit the bh occupancy to 10% of ZONE_NORMAL
3447 */ 3448 */
3448 nrpages = (nr_free_buffer_pages() * 10) / 100; 3449 nrpages = (nr_free_buffer_pages() * 10) / 100;
3449 max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); 3450 max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
3450 hotcpu_notifier(buffer_cpu_notify, 0); 3451 hotcpu_notifier(buffer_cpu_notify, 0);
3451 } 3452 }
1 /* 1 /*
2 * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com 2 * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
3 * Written by Alex Tomas <alex@clusterfs.com> 3 * Written by Alex Tomas <alex@clusterfs.com>
4 * 4 *
5 * This program is free software; you can redistribute it and/or modify 5 * This program is free software; you can redistribute it and/or modify
6 * it under the terms of the GNU General Public License version 2 as 6 * it under the terms of the GNU General Public License version 2 as
7 * published by the Free Software Foundation. 7 * published by the Free Software Foundation.
8 * 8 *
9 * This program is distributed in the hope that it will be useful, 9 * This program is distributed in the hope that it will be useful,
10 * but WITHOUT ANY WARRANTY; without even the implied warranty of 10 * but WITHOUT ANY WARRANTY; without even the implied warranty of
11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 * GNU General Public License for more details. 12 * GNU General Public License for more details.
13 * 13 *
14 * You should have received a copy of the GNU General Public Licens 14 * You should have received a copy of the GNU General Public Licens
15 * along with this program; if not, write to the Free Software 15 * along with this program; if not, write to the Free Software
16 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- 16 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
17 */ 17 */
18 18
19 19
20 /* 20 /*
21 * mballoc.c contains the multiblocks allocation routines 21 * mballoc.c contains the multiblocks allocation routines
22 */ 22 */
23 23
24 #include "ext4_jbd2.h" 24 #include "ext4_jbd2.h"
25 #include "mballoc.h" 25 #include "mballoc.h"
26 #include <linux/log2.h> 26 #include <linux/log2.h>
27 #include <linux/module.h> 27 #include <linux/module.h>
28 #include <linux/slab.h> 28 #include <linux/slab.h>
29 #include <trace/events/ext4.h> 29 #include <trace/events/ext4.h>
30 30
31 #ifdef CONFIG_EXT4_DEBUG 31 #ifdef CONFIG_EXT4_DEBUG
32 ushort ext4_mballoc_debug __read_mostly; 32 ushort ext4_mballoc_debug __read_mostly;
33 33
34 module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644); 34 module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644);
35 MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc"); 35 MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc");
36 #endif 36 #endif
37 37
38 /* 38 /*
39 * MUSTDO: 39 * MUSTDO:
40 * - test ext4_ext_search_left() and ext4_ext_search_right() 40 * - test ext4_ext_search_left() and ext4_ext_search_right()
41 * - search for metadata in few groups 41 * - search for metadata in few groups
42 * 42 *
43 * TODO v4: 43 * TODO v4:
44 * - normalization should take into account whether file is still open 44 * - normalization should take into account whether file is still open
45 * - discard preallocations if no free space left (policy?) 45 * - discard preallocations if no free space left (policy?)
46 * - don't normalize tails 46 * - don't normalize tails
47 * - quota 47 * - quota
48 * - reservation for superuser 48 * - reservation for superuser
49 * 49 *
50 * TODO v3: 50 * TODO v3:
51 * - bitmap read-ahead (proposed by Oleg Drokin aka green) 51 * - bitmap read-ahead (proposed by Oleg Drokin aka green)
52 * - track min/max extents in each group for better group selection 52 * - track min/max extents in each group for better group selection
53 * - mb_mark_used() may allocate chunk right after splitting buddy 53 * - mb_mark_used() may allocate chunk right after splitting buddy
54 * - tree of groups sorted by number of free blocks 54 * - tree of groups sorted by number of free blocks
55 * - error handling 55 * - error handling
56 */ 56 */
57 57
58 /* 58 /*
59 * The allocation request involve request for multiple number of blocks 59 * The allocation request involve request for multiple number of blocks
60 * near to the goal(block) value specified. 60 * near to the goal(block) value specified.
61 * 61 *
62 * During initialization phase of the allocator we decide to use the 62 * During initialization phase of the allocator we decide to use the
63 * group preallocation or inode preallocation depending on the size of 63 * group preallocation or inode preallocation depending on the size of
64 * the file. The size of the file could be the resulting file size we 64 * the file. The size of the file could be the resulting file size we
65 * would have after allocation, or the current file size, which ever 65 * would have after allocation, or the current file size, which ever
66 * is larger. If the size is less than sbi->s_mb_stream_request we 66 * is larger. If the size is less than sbi->s_mb_stream_request we
67 * select to use the group preallocation. The default value of 67 * select to use the group preallocation. The default value of
68 * s_mb_stream_request is 16 blocks. This can also be tuned via 68 * s_mb_stream_request is 16 blocks. This can also be tuned via
69 * /sys/fs/ext4/<partition>/mb_stream_req. The value is represented in 69 * /sys/fs/ext4/<partition>/mb_stream_req. The value is represented in
70 * terms of number of blocks. 70 * terms of number of blocks.
71 * 71 *
72 * The main motivation for having small file use group preallocation is to 72 * The main motivation for having small file use group preallocation is to
73 * ensure that we have small files closer together on the disk. 73 * ensure that we have small files closer together on the disk.
74 * 74 *
75 * First stage the allocator looks at the inode prealloc list, 75 * First stage the allocator looks at the inode prealloc list,
76 * ext4_inode_info->i_prealloc_list, which contains list of prealloc 76 * ext4_inode_info->i_prealloc_list, which contains list of prealloc
77 * spaces for this particular inode. The inode prealloc space is 77 * spaces for this particular inode. The inode prealloc space is
78 * represented as: 78 * represented as:
79 * 79 *
80 * pa_lstart -> the logical start block for this prealloc space 80 * pa_lstart -> the logical start block for this prealloc space
81 * pa_pstart -> the physical start block for this prealloc space 81 * pa_pstart -> the physical start block for this prealloc space
82 * pa_len -> length for this prealloc space (in clusters) 82 * pa_len -> length for this prealloc space (in clusters)
83 * pa_free -> free space available in this prealloc space (in clusters) 83 * pa_free -> free space available in this prealloc space (in clusters)
84 * 84 *
85 * The inode preallocation space is used looking at the _logical_ start 85 * The inode preallocation space is used looking at the _logical_ start
86 * block. If only the logical file block falls within the range of prealloc 86 * block. If only the logical file block falls within the range of prealloc
87 * space we will consume the particular prealloc space. This makes sure that 87 * space we will consume the particular prealloc space. This makes sure that
88 * we have contiguous physical blocks representing the file blocks 88 * we have contiguous physical blocks representing the file blocks
89 * 89 *
90 * The important thing to be noted in case of inode prealloc space is that 90 * The important thing to be noted in case of inode prealloc space is that
91 * we don't modify the values associated to inode prealloc space except 91 * we don't modify the values associated to inode prealloc space except
92 * pa_free. 92 * pa_free.
93 * 93 *
94 * If we are not able to find blocks in the inode prealloc space and if we 94 * If we are not able to find blocks in the inode prealloc space and if we
95 * have the group allocation flag set then we look at the locality group 95 * have the group allocation flag set then we look at the locality group
96 * prealloc space. These are per CPU prealloc list represented as 96 * prealloc space. These are per CPU prealloc list represented as
97 * 97 *
98 * ext4_sb_info.s_locality_groups[smp_processor_id()] 98 * ext4_sb_info.s_locality_groups[smp_processor_id()]
99 * 99 *
100 * The reason for having a per cpu locality group is to reduce the contention 100 * The reason for having a per cpu locality group is to reduce the contention
101 * between CPUs. It is possible to get scheduled at this point. 101 * between CPUs. It is possible to get scheduled at this point.
102 * 102 *
103 * The locality group prealloc space is used looking at whether we have 103 * The locality group prealloc space is used looking at whether we have
104 * enough free space (pa_free) within the prealloc space. 104 * enough free space (pa_free) within the prealloc space.
105 * 105 *
106 * If we can't allocate blocks via inode prealloc or/and locality group 106 * If we can't allocate blocks via inode prealloc or/and locality group
107 * prealloc then we look at the buddy cache. The buddy cache is represented 107 * prealloc then we look at the buddy cache. The buddy cache is represented
108 * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets 108 * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets
109 * mapped to the buddy and bitmap information regarding different 109 * mapped to the buddy and bitmap information regarding different
110 * groups. The buddy information is attached to buddy cache inode so that 110 * groups. The buddy information is attached to buddy cache inode so that
111 * we can access them through the page cache. The information regarding 111 * we can access them through the page cache. The information regarding
112 * each group is loaded via ext4_mb_load_buddy. The information involve 112 * each group is loaded via ext4_mb_load_buddy. The information involve
113 * block bitmap and buddy information. The information are stored in the 113 * block bitmap and buddy information. The information are stored in the
114 * inode as: 114 * inode as:
115 * 115 *
116 * { page } 116 * { page }
117 * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... 117 * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
118 * 118 *
119 * 119 *
120 * one block each for bitmap and buddy information. So for each group we 120 * one block each for bitmap and buddy information. So for each group we
121 * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE / 121 * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE /
122 * blocksize) blocks. So it can have information regarding groups_per_page 122 * blocksize) blocks. So it can have information regarding groups_per_page
123 * which is blocks_per_page/2 123 * which is blocks_per_page/2
124 * 124 *
125 * The buddy cache inode is not stored on disk. The inode is thrown 125 * The buddy cache inode is not stored on disk. The inode is thrown
126 * away when the filesystem is unmounted. 126 * away when the filesystem is unmounted.
127 * 127 *
128 * We look for count number of blocks in the buddy cache. If we were able 128 * We look for count number of blocks in the buddy cache. If we were able
129 * to locate that many free blocks we return with additional information 129 * to locate that many free blocks we return with additional information
130 * regarding rest of the contiguous physical block available 130 * regarding rest of the contiguous physical block available
131 * 131 *
132 * Before allocating blocks via buddy cache we normalize the request 132 * Before allocating blocks via buddy cache we normalize the request
133 * blocks. This ensure we ask for more blocks that we needed. The extra 133 * blocks. This ensure we ask for more blocks that we needed. The extra
134 * blocks that we get after allocation is added to the respective prealloc 134 * blocks that we get after allocation is added to the respective prealloc
135 * list. In case of inode preallocation we follow a list of heuristics 135 * list. In case of inode preallocation we follow a list of heuristics
136 * based on file size. This can be found in ext4_mb_normalize_request. If 136 * based on file size. This can be found in ext4_mb_normalize_request. If
137 * we are doing a group prealloc we try to normalize the request to 137 * we are doing a group prealloc we try to normalize the request to
138 * sbi->s_mb_group_prealloc. The default value of s_mb_group_prealloc is 138 * sbi->s_mb_group_prealloc. The default value of s_mb_group_prealloc is
139 * dependent on the cluster size; for non-bigalloc file systems, it is 139 * dependent on the cluster size; for non-bigalloc file systems, it is
140 * 512 blocks. This can be tuned via 140 * 512 blocks. This can be tuned via
141 * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in 141 * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in
142 * terms of number of blocks. If we have mounted the file system with -O 142 * terms of number of blocks. If we have mounted the file system with -O
143 * stripe=<value> option the group prealloc request is normalized to the 143 * stripe=<value> option the group prealloc request is normalized to the
144 * the smallest multiple of the stripe value (sbi->s_stripe) which is 144 * the smallest multiple of the stripe value (sbi->s_stripe) which is
145 * greater than the default mb_group_prealloc. 145 * greater than the default mb_group_prealloc.
146 * 146 *
147 * The regular allocator (using the buddy cache) supports a few tunables. 147 * The regular allocator (using the buddy cache) supports a few tunables.
148 * 148 *
149 * /sys/fs/ext4/<partition>/mb_min_to_scan 149 * /sys/fs/ext4/<partition>/mb_min_to_scan
150 * /sys/fs/ext4/<partition>/mb_max_to_scan 150 * /sys/fs/ext4/<partition>/mb_max_to_scan
151 * /sys/fs/ext4/<partition>/mb_order2_req 151 * /sys/fs/ext4/<partition>/mb_order2_req
152 * 152 *
153 * The regular allocator uses buddy scan only if the request len is power of 153 * The regular allocator uses buddy scan only if the request len is power of
154 * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The 154 * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
155 * value of s_mb_order2_reqs can be tuned via 155 * value of s_mb_order2_reqs can be tuned via
156 * /sys/fs/ext4/<partition>/mb_order2_req. If the request len is equal to 156 * /sys/fs/ext4/<partition>/mb_order2_req. If the request len is equal to
157 * stripe size (sbi->s_stripe), we try to search for contiguous block in 157 * stripe size (sbi->s_stripe), we try to search for contiguous block in
158 * stripe size. This should result in better allocation on RAID setups. If 158 * stripe size. This should result in better allocation on RAID setups. If
159 * not, we search in the specific group using bitmap for best extents. The 159 * not, we search in the specific group using bitmap for best extents. The
160 * tunable min_to_scan and max_to_scan control the behaviour here. 160 * tunable min_to_scan and max_to_scan control the behaviour here.
161 * min_to_scan indicate how long the mballoc __must__ look for a best 161 * min_to_scan indicate how long the mballoc __must__ look for a best
162 * extent and max_to_scan indicates how long the mballoc __can__ look for a 162 * extent and max_to_scan indicates how long the mballoc __can__ look for a
163 * best extent in the found extents. Searching for the blocks starts with 163 * best extent in the found extents. Searching for the blocks starts with
164 * the group specified as the goal value in allocation context via 164 * the group specified as the goal value in allocation context via
165 * ac_g_ex. Each group is first checked based on the criteria whether it 165 * ac_g_ex. Each group is first checked based on the criteria whether it
166 * can be used for allocation. ext4_mb_good_group explains how the groups are 166 * can be used for allocation. ext4_mb_good_group explains how the groups are
167 * checked. 167 * checked.
168 * 168 *
169 * Both the prealloc space are getting populated as above. So for the first 169 * Both the prealloc space are getting populated as above. So for the first
170 * request we will hit the buddy cache which will result in this prealloc 170 * request we will hit the buddy cache which will result in this prealloc
171 * space getting filled. The prealloc space is then later used for the 171 * space getting filled. The prealloc space is then later used for the
172 * subsequent request. 172 * subsequent request.
173 */ 173 */
174 174
175 /* 175 /*
176 * mballoc operates on the following data: 176 * mballoc operates on the following data:
177 * - on-disk bitmap 177 * - on-disk bitmap
178 * - in-core buddy (actually includes buddy and bitmap) 178 * - in-core buddy (actually includes buddy and bitmap)
179 * - preallocation descriptors (PAs) 179 * - preallocation descriptors (PAs)
180 * 180 *
181 * there are two types of preallocations: 181 * there are two types of preallocations:
182 * - inode 182 * - inode
183 * assiged to specific inode and can be used for this inode only. 183 * assiged to specific inode and can be used for this inode only.
184 * it describes part of inode's space preallocated to specific 184 * it describes part of inode's space preallocated to specific
185 * physical blocks. any block from that preallocated can be used 185 * physical blocks. any block from that preallocated can be used
186 * independent. the descriptor just tracks number of blocks left 186 * independent. the descriptor just tracks number of blocks left
187 * unused. so, before taking some block from descriptor, one must 187 * unused. so, before taking some block from descriptor, one must
188 * make sure corresponded logical block isn't allocated yet. this 188 * make sure corresponded logical block isn't allocated yet. this
189 * also means that freeing any block within descriptor's range 189 * also means that freeing any block within descriptor's range
190 * must discard all preallocated blocks. 190 * must discard all preallocated blocks.
191 * - locality group 191 * - locality group
192 * assigned to specific locality group which does not translate to 192 * assigned to specific locality group which does not translate to
193 * permanent set of inodes: inode can join and leave group. space 193 * permanent set of inodes: inode can join and leave group. space
194 * from this type of preallocation can be used for any inode. thus 194 * from this type of preallocation can be used for any inode. thus
195 * it's consumed from the beginning to the end. 195 * it's consumed from the beginning to the end.
196 * 196 *
197 * relation between them can be expressed as: 197 * relation between them can be expressed as:
198 * in-core buddy = on-disk bitmap + preallocation descriptors 198 * in-core buddy = on-disk bitmap + preallocation descriptors
199 * 199 *
200 * this mean blocks mballoc considers used are: 200 * this mean blocks mballoc considers used are:
201 * - allocated blocks (persistent) 201 * - allocated blocks (persistent)
202 * - preallocated blocks (non-persistent) 202 * - preallocated blocks (non-persistent)
203 * 203 *
204 * consistency in mballoc world means that at any time a block is either 204 * consistency in mballoc world means that at any time a block is either
205 * free or used in ALL structures. notice: "any time" should not be read 205 * free or used in ALL structures. notice: "any time" should not be read
206 * literally -- time is discrete and delimited by locks. 206 * literally -- time is discrete and delimited by locks.
207 * 207 *
208 * to keep it simple, we don't use block numbers, instead we count number of 208 * to keep it simple, we don't use block numbers, instead we count number of
209 * blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA. 209 * blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA.
210 * 210 *
211 * all operations can be expressed as: 211 * all operations can be expressed as:
212 * - init buddy: buddy = on-disk + PAs 212 * - init buddy: buddy = on-disk + PAs
213 * - new PA: buddy += N; PA = N 213 * - new PA: buddy += N; PA = N
214 * - use inode PA: on-disk += N; PA -= N 214 * - use inode PA: on-disk += N; PA -= N
215 * - discard inode PA buddy -= on-disk - PA; PA = 0 215 * - discard inode PA buddy -= on-disk - PA; PA = 0
216 * - use locality group PA on-disk += N; PA -= N 216 * - use locality group PA on-disk += N; PA -= N
217 * - discard locality group PA buddy -= PA; PA = 0 217 * - discard locality group PA buddy -= PA; PA = 0
218 * note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap 218 * note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap
219 * is used in real operation because we can't know actual used 219 * is used in real operation because we can't know actual used
220 * bits from PA, only from on-disk bitmap 220 * bits from PA, only from on-disk bitmap
221 * 221 *
222 * if we follow this strict logic, then all operations above should be atomic. 222 * if we follow this strict logic, then all operations above should be atomic.
223 * given some of them can block, we'd have to use something like semaphores 223 * given some of them can block, we'd have to use something like semaphores
224 * killing performance on high-end SMP hardware. let's try to relax it using 224 * killing performance on high-end SMP hardware. let's try to relax it using
225 * the following knowledge: 225 * the following knowledge:
226 * 1) if buddy is referenced, it's already initialized 226 * 1) if buddy is referenced, it's already initialized
227 * 2) while block is used in buddy and the buddy is referenced, 227 * 2) while block is used in buddy and the buddy is referenced,
228 * nobody can re-allocate that block 228 * nobody can re-allocate that block
229 * 3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has 229 * 3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has
230 * bit set and PA claims same block, it's OK. IOW, one can set bit in 230 * bit set and PA claims same block, it's OK. IOW, one can set bit in
231 * on-disk bitmap if buddy has same bit set or/and PA covers corresponded 231 * on-disk bitmap if buddy has same bit set or/and PA covers corresponded
232 * block 232 * block
233 * 233 *
234 * so, now we're building a concurrency table: 234 * so, now we're building a concurrency table:
235 * - init buddy vs. 235 * - init buddy vs.
236 * - new PA 236 * - new PA
237 * blocks for PA are allocated in the buddy, buddy must be referenced 237 * blocks for PA are allocated in the buddy, buddy must be referenced
238 * until PA is linked to allocation group to avoid concurrent buddy init 238 * until PA is linked to allocation group to avoid concurrent buddy init
239 * - use inode PA 239 * - use inode PA
240 * we need to make sure that either on-disk bitmap or PA has uptodate data 240 * we need to make sure that either on-disk bitmap or PA has uptodate data
241 * given (3) we care that PA-=N operation doesn't interfere with init 241 * given (3) we care that PA-=N operation doesn't interfere with init
242 * - discard inode PA 242 * - discard inode PA
243 * the simplest way would be to have buddy initialized by the discard 243 * the simplest way would be to have buddy initialized by the discard
244 * - use locality group PA 244 * - use locality group PA
245 * again PA-=N must be serialized with init 245 * again PA-=N must be serialized with init
246 * - discard locality group PA 246 * - discard locality group PA
247 * the simplest way would be to have buddy initialized by the discard 247 * the simplest way would be to have buddy initialized by the discard
248 * - new PA vs. 248 * - new PA vs.
249 * - use inode PA 249 * - use inode PA
250 * i_data_sem serializes them 250 * i_data_sem serializes them
251 * - discard inode PA 251 * - discard inode PA
252 * discard process must wait until PA isn't used by another process 252 * discard process must wait until PA isn't used by another process
253 * - use locality group PA 253 * - use locality group PA
254 * some mutex should serialize them 254 * some mutex should serialize them
255 * - discard locality group PA 255 * - discard locality group PA
256 * discard process must wait until PA isn't used by another process 256 * discard process must wait until PA isn't used by another process
257 * - use inode PA 257 * - use inode PA
258 * - use inode PA 258 * - use inode PA
259 * i_data_sem or another mutex should serializes them 259 * i_data_sem or another mutex should serializes them
260 * - discard inode PA 260 * - discard inode PA
261 * discard process must wait until PA isn't used by another process 261 * discard process must wait until PA isn't used by another process
262 * - use locality group PA 262 * - use locality group PA
263 * nothing wrong here -- they're different PAs covering different blocks 263 * nothing wrong here -- they're different PAs covering different blocks
264 * - discard locality group PA 264 * - discard locality group PA
265 * discard process must wait until PA isn't used by another process 265 * discard process must wait until PA isn't used by another process
266 * 266 *
267 * now we're ready to make few consequences: 267 * now we're ready to make few consequences:
268 * - PA is referenced and while it is no discard is possible 268 * - PA is referenced and while it is no discard is possible
269 * - PA is referenced until block isn't marked in on-disk bitmap 269 * - PA is referenced until block isn't marked in on-disk bitmap
270 * - PA changes only after on-disk bitmap 270 * - PA changes only after on-disk bitmap
271 * - discard must not compete with init. either init is done before 271 * - discard must not compete with init. either init is done before
272 * any discard or they're serialized somehow 272 * any discard or they're serialized somehow
273 * - buddy init as sum of on-disk bitmap and PAs is done atomically 273 * - buddy init as sum of on-disk bitmap and PAs is done atomically
274 * 274 *
275 * a special case when we've used PA to emptiness. no need to modify buddy 275 * a special case when we've used PA to emptiness. no need to modify buddy
276 * in this case, but we should care about concurrent init 276 * in this case, but we should care about concurrent init
277 * 277 *
278 */ 278 */
279 279
280 /* 280 /*
281 * Logic in few words: 281 * Logic in few words:
282 * 282 *
283 * - allocation: 283 * - allocation:
284 * load group 284 * load group
285 * find blocks 285 * find blocks
286 * mark bits in on-disk bitmap 286 * mark bits in on-disk bitmap
287 * release group 287 * release group
288 * 288 *
289 * - use preallocation: 289 * - use preallocation:
290 * find proper PA (per-inode or group) 290 * find proper PA (per-inode or group)
291 * load group 291 * load group
292 * mark bits in on-disk bitmap 292 * mark bits in on-disk bitmap
293 * release group 293 * release group
294 * release PA 294 * release PA
295 * 295 *
296 * - free: 296 * - free:
297 * load group 297 * load group
298 * mark bits in on-disk bitmap 298 * mark bits in on-disk bitmap
299 * release group 299 * release group
300 * 300 *
301 * - discard preallocations in group: 301 * - discard preallocations in group:
302 * mark PAs deleted 302 * mark PAs deleted
303 * move them onto local list 303 * move them onto local list
304 * load on-disk bitmap 304 * load on-disk bitmap
305 * load group 305 * load group
306 * remove PA from object (inode or locality group) 306 * remove PA from object (inode or locality group)
307 * mark free blocks in-core 307 * mark free blocks in-core
308 * 308 *
309 * - discard inode's preallocations: 309 * - discard inode's preallocations:
310 */ 310 */
311 311
312 /* 312 /*
313 * Locking rules 313 * Locking rules
314 * 314 *
315 * Locks: 315 * Locks:
316 * - bitlock on a group (group) 316 * - bitlock on a group (group)
317 * - object (inode/locality) (object) 317 * - object (inode/locality) (object)
318 * - per-pa lock (pa) 318 * - per-pa lock (pa)
319 * 319 *
320 * Paths: 320 * Paths:
321 * - new pa 321 * - new pa
322 * object 322 * object
323 * group 323 * group
324 * 324 *
325 * - find and use pa: 325 * - find and use pa:
326 * pa 326 * pa
327 * 327 *
328 * - release consumed pa: 328 * - release consumed pa:
329 * pa 329 * pa
330 * group 330 * group
331 * object 331 * object
332 * 332 *
333 * - generate in-core bitmap: 333 * - generate in-core bitmap:
334 * group 334 * group
335 * pa 335 * pa
336 * 336 *
337 * - discard all for given object (inode, locality group): 337 * - discard all for given object (inode, locality group):
338 * object 338 * object
339 * pa 339 * pa
340 * group 340 * group
341 * 341 *
342 * - discard all for given group: 342 * - discard all for given group:
343 * group 343 * group
344 * pa 344 * pa
345 * group 345 * group
346 * object 346 * object
347 * 347 *
348 */ 348 */
349 static struct kmem_cache *ext4_pspace_cachep; 349 static struct kmem_cache *ext4_pspace_cachep;
350 static struct kmem_cache *ext4_ac_cachep; 350 static struct kmem_cache *ext4_ac_cachep;
351 static struct kmem_cache *ext4_free_data_cachep; 351 static struct kmem_cache *ext4_free_data_cachep;
352 352
353 /* We create slab caches for groupinfo data structures based on the 353 /* We create slab caches for groupinfo data structures based on the
354 * superblock block size. There will be one per mounted filesystem for 354 * superblock block size. There will be one per mounted filesystem for
355 * each unique s_blocksize_bits */ 355 * each unique s_blocksize_bits */
356 #define NR_GRPINFO_CACHES 8 356 #define NR_GRPINFO_CACHES 8
357 static struct kmem_cache *ext4_groupinfo_caches[NR_GRPINFO_CACHES]; 357 static struct kmem_cache *ext4_groupinfo_caches[NR_GRPINFO_CACHES];
358 358
359 static const char *ext4_groupinfo_slab_names[NR_GRPINFO_CACHES] = { 359 static const char *ext4_groupinfo_slab_names[NR_GRPINFO_CACHES] = {
360 "ext4_groupinfo_1k", "ext4_groupinfo_2k", "ext4_groupinfo_4k", 360 "ext4_groupinfo_1k", "ext4_groupinfo_2k", "ext4_groupinfo_4k",
361 "ext4_groupinfo_8k", "ext4_groupinfo_16k", "ext4_groupinfo_32k", 361 "ext4_groupinfo_8k", "ext4_groupinfo_16k", "ext4_groupinfo_32k",
362 "ext4_groupinfo_64k", "ext4_groupinfo_128k" 362 "ext4_groupinfo_64k", "ext4_groupinfo_128k"
363 }; 363 };
364 364
365 static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, 365 static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
366 ext4_group_t group); 366 ext4_group_t group);
367 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, 367 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
368 ext4_group_t group); 368 ext4_group_t group);
369 static void ext4_free_data_callback(struct super_block *sb, 369 static void ext4_free_data_callback(struct super_block *sb,
370 struct ext4_journal_cb_entry *jce, int rc); 370 struct ext4_journal_cb_entry *jce, int rc);
371 371
372 static inline void *mb_correct_addr_and_bit(int *bit, void *addr) 372 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
373 { 373 {
374 #if BITS_PER_LONG == 64 374 #if BITS_PER_LONG == 64
375 *bit += ((unsigned long) addr & 7UL) << 3; 375 *bit += ((unsigned long) addr & 7UL) << 3;
376 addr = (void *) ((unsigned long) addr & ~7UL); 376 addr = (void *) ((unsigned long) addr & ~7UL);
377 #elif BITS_PER_LONG == 32 377 #elif BITS_PER_LONG == 32
378 *bit += ((unsigned long) addr & 3UL) << 3; 378 *bit += ((unsigned long) addr & 3UL) << 3;
379 addr = (void *) ((unsigned long) addr & ~3UL); 379 addr = (void *) ((unsigned long) addr & ~3UL);
380 #else 380 #else
381 #error "how many bits you are?!" 381 #error "how many bits you are?!"
382 #endif 382 #endif
383 return addr; 383 return addr;
384 } 384 }
385 385
386 static inline int mb_test_bit(int bit, void *addr) 386 static inline int mb_test_bit(int bit, void *addr)
387 { 387 {
388 /* 388 /*
389 * ext4_test_bit on architecture like powerpc 389 * ext4_test_bit on architecture like powerpc
390 * needs unsigned long aligned address 390 * needs unsigned long aligned address
391 */ 391 */
392 addr = mb_correct_addr_and_bit(&bit, addr); 392 addr = mb_correct_addr_and_bit(&bit, addr);
393 return ext4_test_bit(bit, addr); 393 return ext4_test_bit(bit, addr);
394 } 394 }
395 395
396 static inline void mb_set_bit(int bit, void *addr) 396 static inline void mb_set_bit(int bit, void *addr)
397 { 397 {
398 addr = mb_correct_addr_and_bit(&bit, addr); 398 addr = mb_correct_addr_and_bit(&bit, addr);
399 ext4_set_bit(bit, addr); 399 ext4_set_bit(bit, addr);
400 } 400 }
401 401
402 static inline void mb_clear_bit(int bit, void *addr) 402 static inline void mb_clear_bit(int bit, void *addr)
403 { 403 {
404 addr = mb_correct_addr_and_bit(&bit, addr); 404 addr = mb_correct_addr_and_bit(&bit, addr);
405 ext4_clear_bit(bit, addr); 405 ext4_clear_bit(bit, addr);
406 } 406 }
407 407
408 static inline int mb_test_and_clear_bit(int bit, void *addr) 408 static inline int mb_test_and_clear_bit(int bit, void *addr)
409 { 409 {
410 addr = mb_correct_addr_and_bit(&bit, addr); 410 addr = mb_correct_addr_and_bit(&bit, addr);
411 return ext4_test_and_clear_bit(bit, addr); 411 return ext4_test_and_clear_bit(bit, addr);
412 } 412 }
413 413
414 static inline int mb_find_next_zero_bit(void *addr, int max, int start) 414 static inline int mb_find_next_zero_bit(void *addr, int max, int start)
415 { 415 {
416 int fix = 0, ret, tmpmax; 416 int fix = 0, ret, tmpmax;
417 addr = mb_correct_addr_and_bit(&fix, addr); 417 addr = mb_correct_addr_and_bit(&fix, addr);
418 tmpmax = max + fix; 418 tmpmax = max + fix;
419 start += fix; 419 start += fix;
420 420
421 ret = ext4_find_next_zero_bit(addr, tmpmax, start) - fix; 421 ret = ext4_find_next_zero_bit(addr, tmpmax, start) - fix;
422 if (ret > max) 422 if (ret > max)
423 return max; 423 return max;
424 return ret; 424 return ret;
425 } 425 }
426 426
427 static inline int mb_find_next_bit(void *addr, int max, int start) 427 static inline int mb_find_next_bit(void *addr, int max, int start)
428 { 428 {
429 int fix = 0, ret, tmpmax; 429 int fix = 0, ret, tmpmax;
430 addr = mb_correct_addr_and_bit(&fix, addr); 430 addr = mb_correct_addr_and_bit(&fix, addr);
431 tmpmax = max + fix; 431 tmpmax = max + fix;
432 start += fix; 432 start += fix;
433 433
434 ret = ext4_find_next_bit(addr, tmpmax, start) - fix; 434 ret = ext4_find_next_bit(addr, tmpmax, start) - fix;
435 if (ret > max) 435 if (ret > max)
436 return max; 436 return max;
437 return ret; 437 return ret;
438 } 438 }
439 439
440 static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) 440 static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
441 { 441 {
442 char *bb; 442 char *bb;
443 443
444 BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); 444 BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
445 BUG_ON(max == NULL); 445 BUG_ON(max == NULL);
446 446
447 if (order > e4b->bd_blkbits + 1) { 447 if (order > e4b->bd_blkbits + 1) {
448 *max = 0; 448 *max = 0;
449 return NULL; 449 return NULL;
450 } 450 }
451 451
452 /* at order 0 we see each particular block */ 452 /* at order 0 we see each particular block */
453 if (order == 0) { 453 if (order == 0) {
454 *max = 1 << (e4b->bd_blkbits + 3); 454 *max = 1 << (e4b->bd_blkbits + 3);
455 return e4b->bd_bitmap; 455 return e4b->bd_bitmap;
456 } 456 }
457 457
458 bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order]; 458 bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
459 *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order]; 459 *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
460 460
461 return bb; 461 return bb;
462 } 462 }
463 463
464 #ifdef DOUBLE_CHECK 464 #ifdef DOUBLE_CHECK
465 static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b, 465 static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b,
466 int first, int count) 466 int first, int count)
467 { 467 {
468 int i; 468 int i;
469 struct super_block *sb = e4b->bd_sb; 469 struct super_block *sb = e4b->bd_sb;
470 470
471 if (unlikely(e4b->bd_info->bb_bitmap == NULL)) 471 if (unlikely(e4b->bd_info->bb_bitmap == NULL))
472 return; 472 return;
473 assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); 473 assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
474 for (i = 0; i < count; i++) { 474 for (i = 0; i < count; i++) {
475 if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) { 475 if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) {
476 ext4_fsblk_t blocknr; 476 ext4_fsblk_t blocknr;
477 477
478 blocknr = ext4_group_first_block_no(sb, e4b->bd_group); 478 blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
479 blocknr += EXT4_C2B(EXT4_SB(sb), first + i); 479 blocknr += EXT4_C2B(EXT4_SB(sb), first + i);
480 ext4_grp_locked_error(sb, e4b->bd_group, 480 ext4_grp_locked_error(sb, e4b->bd_group,
481 inode ? inode->i_ino : 0, 481 inode ? inode->i_ino : 0,
482 blocknr, 482 blocknr,
483 "freeing block already freed " 483 "freeing block already freed "
484 "(bit %u)", 484 "(bit %u)",
485 first + i); 485 first + i);
486 } 486 }
487 mb_clear_bit(first + i, e4b->bd_info->bb_bitmap); 487 mb_clear_bit(first + i, e4b->bd_info->bb_bitmap);
488 } 488 }
489 } 489 }
490 490
491 static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count) 491 static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count)
492 { 492 {
493 int i; 493 int i;
494 494
495 if (unlikely(e4b->bd_info->bb_bitmap == NULL)) 495 if (unlikely(e4b->bd_info->bb_bitmap == NULL))
496 return; 496 return;
497 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); 497 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
498 for (i = 0; i < count; i++) { 498 for (i = 0; i < count; i++) {
499 BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap)); 499 BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap));
500 mb_set_bit(first + i, e4b->bd_info->bb_bitmap); 500 mb_set_bit(first + i, e4b->bd_info->bb_bitmap);
501 } 501 }
502 } 502 }
503 503
504 static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) 504 static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
505 { 505 {
506 if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) { 506 if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) {
507 unsigned char *b1, *b2; 507 unsigned char *b1, *b2;
508 int i; 508 int i;
509 b1 = (unsigned char *) e4b->bd_info->bb_bitmap; 509 b1 = (unsigned char *) e4b->bd_info->bb_bitmap;
510 b2 = (unsigned char *) bitmap; 510 b2 = (unsigned char *) bitmap;
511 for (i = 0; i < e4b->bd_sb->s_blocksize; i++) { 511 for (i = 0; i < e4b->bd_sb->s_blocksize; i++) {
512 if (b1[i] != b2[i]) { 512 if (b1[i] != b2[i]) {
513 ext4_msg(e4b->bd_sb, KERN_ERR, 513 ext4_msg(e4b->bd_sb, KERN_ERR,
514 "corruption in group %u " 514 "corruption in group %u "
515 "at byte %u(%u): %x in copy != %x " 515 "at byte %u(%u): %x in copy != %x "
516 "on disk/prealloc", 516 "on disk/prealloc",
517 e4b->bd_group, i, i * 8, b1[i], b2[i]); 517 e4b->bd_group, i, i * 8, b1[i], b2[i]);
518 BUG(); 518 BUG();
519 } 519 }
520 } 520 }
521 } 521 }
522 } 522 }
523 523
524 #else 524 #else
525 static inline void mb_free_blocks_double(struct inode *inode, 525 static inline void mb_free_blocks_double(struct inode *inode,
526 struct ext4_buddy *e4b, int first, int count) 526 struct ext4_buddy *e4b, int first, int count)
527 { 527 {
528 return; 528 return;
529 } 529 }
530 static inline void mb_mark_used_double(struct ext4_buddy *e4b, 530 static inline void mb_mark_used_double(struct ext4_buddy *e4b,
531 int first, int count) 531 int first, int count)
532 { 532 {
533 return; 533 return;
534 } 534 }
535 static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) 535 static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
536 { 536 {
537 return; 537 return;
538 } 538 }
539 #endif 539 #endif
540 540
541 #ifdef AGGRESSIVE_CHECK 541 #ifdef AGGRESSIVE_CHECK
542 542
543 #define MB_CHECK_ASSERT(assert) \ 543 #define MB_CHECK_ASSERT(assert) \
544 do { \ 544 do { \
545 if (!(assert)) { \ 545 if (!(assert)) { \
546 printk(KERN_EMERG \ 546 printk(KERN_EMERG \
547 "Assertion failure in %s() at %s:%d: \"%s\"\n", \ 547 "Assertion failure in %s() at %s:%d: \"%s\"\n", \
548 function, file, line, # assert); \ 548 function, file, line, # assert); \
549 BUG(); \ 549 BUG(); \
550 } \ 550 } \
551 } while (0) 551 } while (0)
552 552
553 static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, 553 static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
554 const char *function, int line) 554 const char *function, int line)
555 { 555 {
556 struct super_block *sb = e4b->bd_sb; 556 struct super_block *sb = e4b->bd_sb;
557 int order = e4b->bd_blkbits + 1; 557 int order = e4b->bd_blkbits + 1;
558 int max; 558 int max;
559 int max2; 559 int max2;
560 int i; 560 int i;
561 int j; 561 int j;
562 int k; 562 int k;
563 int count; 563 int count;
564 struct ext4_group_info *grp; 564 struct ext4_group_info *grp;
565 int fragments = 0; 565 int fragments = 0;
566 int fstart; 566 int fstart;
567 struct list_head *cur; 567 struct list_head *cur;
568 void *buddy; 568 void *buddy;
569 void *buddy2; 569 void *buddy2;
570 570
571 { 571 {
572 static int mb_check_counter; 572 static int mb_check_counter;
573 if (mb_check_counter++ % 100 != 0) 573 if (mb_check_counter++ % 100 != 0)
574 return 0; 574 return 0;
575 } 575 }
576 576
577 while (order > 1) { 577 while (order > 1) {
578 buddy = mb_find_buddy(e4b, order, &max); 578 buddy = mb_find_buddy(e4b, order, &max);
579 MB_CHECK_ASSERT(buddy); 579 MB_CHECK_ASSERT(buddy);
580 buddy2 = mb_find_buddy(e4b, order - 1, &max2); 580 buddy2 = mb_find_buddy(e4b, order - 1, &max2);
581 MB_CHECK_ASSERT(buddy2); 581 MB_CHECK_ASSERT(buddy2);
582 MB_CHECK_ASSERT(buddy != buddy2); 582 MB_CHECK_ASSERT(buddy != buddy2);
583 MB_CHECK_ASSERT(max * 2 == max2); 583 MB_CHECK_ASSERT(max * 2 == max2);
584 584
585 count = 0; 585 count = 0;
586 for (i = 0; i < max; i++) { 586 for (i = 0; i < max; i++) {
587 587
588 if (mb_test_bit(i, buddy)) { 588 if (mb_test_bit(i, buddy)) {
589 /* only single bit in buddy2 may be 1 */ 589 /* only single bit in buddy2 may be 1 */
590 if (!mb_test_bit(i << 1, buddy2)) { 590 if (!mb_test_bit(i << 1, buddy2)) {
591 MB_CHECK_ASSERT( 591 MB_CHECK_ASSERT(
592 mb_test_bit((i<<1)+1, buddy2)); 592 mb_test_bit((i<<1)+1, buddy2));
593 } else if (!mb_test_bit((i << 1) + 1, buddy2)) { 593 } else if (!mb_test_bit((i << 1) + 1, buddy2)) {
594 MB_CHECK_ASSERT( 594 MB_CHECK_ASSERT(
595 mb_test_bit(i << 1, buddy2)); 595 mb_test_bit(i << 1, buddy2));
596 } 596 }
597 continue; 597 continue;
598 } 598 }
599 599
600 /* both bits in buddy2 must be 1 */ 600 /* both bits in buddy2 must be 1 */
601 MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2)); 601 MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2));
602 MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2)); 602 MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2));
603 603
604 for (j = 0; j < (1 << order); j++) { 604 for (j = 0; j < (1 << order); j++) {
605 k = (i * (1 << order)) + j; 605 k = (i * (1 << order)) + j;
606 MB_CHECK_ASSERT( 606 MB_CHECK_ASSERT(
607 !mb_test_bit(k, e4b->bd_bitmap)); 607 !mb_test_bit(k, e4b->bd_bitmap));
608 } 608 }
609 count++; 609 count++;
610 } 610 }
611 MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count); 611 MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count);
612 order--; 612 order--;
613 } 613 }
614 614
615 fstart = -1; 615 fstart = -1;
616 buddy = mb_find_buddy(e4b, 0, &max); 616 buddy = mb_find_buddy(e4b, 0, &max);
617 for (i = 0; i < max; i++) { 617 for (i = 0; i < max; i++) {
618 if (!mb_test_bit(i, buddy)) { 618 if (!mb_test_bit(i, buddy)) {
619 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); 619 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
620 if (fstart == -1) { 620 if (fstart == -1) {
621 fragments++; 621 fragments++;
622 fstart = i; 622 fstart = i;
623 } 623 }
624 continue; 624 continue;
625 } 625 }
626 fstart = -1; 626 fstart = -1;
627 /* check used bits only */ 627 /* check used bits only */
628 for (j = 0; j < e4b->bd_blkbits + 1; j++) { 628 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
629 buddy2 = mb_find_buddy(e4b, j, &max2); 629 buddy2 = mb_find_buddy(e4b, j, &max2);
630 k = i >> j; 630 k = i >> j;
631 MB_CHECK_ASSERT(k < max2); 631 MB_CHECK_ASSERT(k < max2);
632 MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); 632 MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
633 } 633 }
634 } 634 }
635 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); 635 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
636 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); 636 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
637 637
638 grp = ext4_get_group_info(sb, e4b->bd_group); 638 grp = ext4_get_group_info(sb, e4b->bd_group);
639 list_for_each(cur, &grp->bb_prealloc_list) { 639 list_for_each(cur, &grp->bb_prealloc_list) {
640 ext4_group_t groupnr; 640 ext4_group_t groupnr;
641 struct ext4_prealloc_space *pa; 641 struct ext4_prealloc_space *pa;
642 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); 642 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
643 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &groupnr, &k); 643 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &groupnr, &k);
644 MB_CHECK_ASSERT(groupnr == e4b->bd_group); 644 MB_CHECK_ASSERT(groupnr == e4b->bd_group);
645 for (i = 0; i < pa->pa_len; i++) 645 for (i = 0; i < pa->pa_len; i++)
646 MB_CHECK_ASSERT(mb_test_bit(k + i, buddy)); 646 MB_CHECK_ASSERT(mb_test_bit(k + i, buddy));
647 } 647 }
648 return 0; 648 return 0;
649 } 649 }
650 #undef MB_CHECK_ASSERT 650 #undef MB_CHECK_ASSERT
651 #define mb_check_buddy(e4b) __mb_check_buddy(e4b, \ 651 #define mb_check_buddy(e4b) __mb_check_buddy(e4b, \
652 __FILE__, __func__, __LINE__) 652 __FILE__, __func__, __LINE__)
653 #else 653 #else
654 #define mb_check_buddy(e4b) 654 #define mb_check_buddy(e4b)
655 #endif 655 #endif
656 656
657 /* 657 /*
658 * Divide blocks started from @first with length @len into 658 * Divide blocks started from @first with length @len into
659 * smaller chunks with power of 2 blocks. 659 * smaller chunks with power of 2 blocks.
660 * Clear the bits in bitmap which the blocks of the chunk(s) covered, 660 * Clear the bits in bitmap which the blocks of the chunk(s) covered,
661 * then increase bb_counters[] for corresponded chunk size. 661 * then increase bb_counters[] for corresponded chunk size.
662 */ 662 */
663 static void ext4_mb_mark_free_simple(struct super_block *sb, 663 static void ext4_mb_mark_free_simple(struct super_block *sb,
664 void *buddy, ext4_grpblk_t first, ext4_grpblk_t len, 664 void *buddy, ext4_grpblk_t first, ext4_grpblk_t len,
665 struct ext4_group_info *grp) 665 struct ext4_group_info *grp)
666 { 666 {
667 struct ext4_sb_info *sbi = EXT4_SB(sb); 667 struct ext4_sb_info *sbi = EXT4_SB(sb);
668 ext4_grpblk_t min; 668 ext4_grpblk_t min;
669 ext4_grpblk_t max; 669 ext4_grpblk_t max;
670 ext4_grpblk_t chunk; 670 ext4_grpblk_t chunk;
671 unsigned short border; 671 unsigned short border;
672 672
673 BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb)); 673 BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));
674 674
675 border = 2 << sb->s_blocksize_bits; 675 border = 2 << sb->s_blocksize_bits;
676 676
677 while (len > 0) { 677 while (len > 0) {
678 /* find how many blocks can be covered since this position */ 678 /* find how many blocks can be covered since this position */
679 max = ffs(first | border) - 1; 679 max = ffs(first | border) - 1;
680 680
681 /* find how many blocks of power 2 we need to mark */ 681 /* find how many blocks of power 2 we need to mark */
682 min = fls(len) - 1; 682 min = fls(len) - 1;
683 683
684 if (max < min) 684 if (max < min)
685 min = max; 685 min = max;
686 chunk = 1 << min; 686 chunk = 1 << min;
687 687
688 /* mark multiblock chunks only */ 688 /* mark multiblock chunks only */
689 grp->bb_counters[min]++; 689 grp->bb_counters[min]++;
690 if (min > 0) 690 if (min > 0)
691 mb_clear_bit(first >> min, 691 mb_clear_bit(first >> min,
692 buddy + sbi->s_mb_offsets[min]); 692 buddy + sbi->s_mb_offsets[min]);
693 693
694 len -= chunk; 694 len -= chunk;
695 first += chunk; 695 first += chunk;
696 } 696 }
697 } 697 }
698 698
699 /* 699 /*
700 * Cache the order of the largest free extent we have available in this block 700 * Cache the order of the largest free extent we have available in this block
701 * group. 701 * group.
702 */ 702 */
703 static void 703 static void
704 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) 704 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
705 { 705 {
706 int i; 706 int i;
707 int bits; 707 int bits;
708 708
709 grp->bb_largest_free_order = -1; /* uninit */ 709 grp->bb_largest_free_order = -1; /* uninit */
710 710
711 bits = sb->s_blocksize_bits + 1; 711 bits = sb->s_blocksize_bits + 1;
712 for (i = bits; i >= 0; i--) { 712 for (i = bits; i >= 0; i--) {
713 if (grp->bb_counters[i] > 0) { 713 if (grp->bb_counters[i] > 0) {
714 grp->bb_largest_free_order = i; 714 grp->bb_largest_free_order = i;
715 break; 715 break;
716 } 716 }
717 } 717 }
718 } 718 }
719 719
720 static noinline_for_stack 720 static noinline_for_stack
721 void ext4_mb_generate_buddy(struct super_block *sb, 721 void ext4_mb_generate_buddy(struct super_block *sb,
722 void *buddy, void *bitmap, ext4_group_t group) 722 void *buddy, void *bitmap, ext4_group_t group)
723 { 723 {
724 struct ext4_group_info *grp = ext4_get_group_info(sb, group); 724 struct ext4_group_info *grp = ext4_get_group_info(sb, group);
725 ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb); 725 ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
726 ext4_grpblk_t i = 0; 726 ext4_grpblk_t i = 0;
727 ext4_grpblk_t first; 727 ext4_grpblk_t first;
728 ext4_grpblk_t len; 728 ext4_grpblk_t len;
729 unsigned free = 0; 729 unsigned free = 0;
730 unsigned fragments = 0; 730 unsigned fragments = 0;
731 unsigned long long period = get_cycles(); 731 unsigned long long period = get_cycles();
732 732
733 /* initialize buddy from bitmap which is aggregation 733 /* initialize buddy from bitmap which is aggregation
734 * of on-disk bitmap and preallocations */ 734 * of on-disk bitmap and preallocations */
735 i = mb_find_next_zero_bit(bitmap, max, 0); 735 i = mb_find_next_zero_bit(bitmap, max, 0);
736 grp->bb_first_free = i; 736 grp->bb_first_free = i;
737 while (i < max) { 737 while (i < max) {
738 fragments++; 738 fragments++;
739 first = i; 739 first = i;
740 i = mb_find_next_bit(bitmap, max, i); 740 i = mb_find_next_bit(bitmap, max, i);
741 len = i - first; 741 len = i - first;
742 free += len; 742 free += len;
743 if (len > 1) 743 if (len > 1)
744 ext4_mb_mark_free_simple(sb, buddy, first, len, grp); 744 ext4_mb_mark_free_simple(sb, buddy, first, len, grp);
745 else 745 else
746 grp->bb_counters[0]++; 746 grp->bb_counters[0]++;
747 if (i < max) 747 if (i < max)
748 i = mb_find_next_zero_bit(bitmap, max, i); 748 i = mb_find_next_zero_bit(bitmap, max, i);
749 } 749 }
750 grp->bb_fragments = fragments; 750 grp->bb_fragments = fragments;
751 751
752 if (free != grp->bb_free) { 752 if (free != grp->bb_free) {
753 ext4_grp_locked_error(sb, group, 0, 0, 753 ext4_grp_locked_error(sb, group, 0, 0,
754 "block bitmap and bg descriptor " 754 "block bitmap and bg descriptor "
755 "inconsistent: %u vs %u free clusters", 755 "inconsistent: %u vs %u free clusters",
756 free, grp->bb_free); 756 free, grp->bb_free);
757 /* 757 /*
758 * If we intend to continue, we consider group descriptor 758 * If we intend to continue, we consider group descriptor
759 * corrupt and update bb_free using bitmap value 759 * corrupt and update bb_free using bitmap value
760 */ 760 */
761 grp->bb_free = free; 761 grp->bb_free = free;
762 set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); 762 set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
763 } 763 }
764 mb_set_largest_free_order(sb, grp); 764 mb_set_largest_free_order(sb, grp);
765 765
766 clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state)); 766 clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
767 767
768 period = get_cycles() - period; 768 period = get_cycles() - period;
769 spin_lock(&EXT4_SB(sb)->s_bal_lock); 769 spin_lock(&EXT4_SB(sb)->s_bal_lock);
770 EXT4_SB(sb)->s_mb_buddies_generated++; 770 EXT4_SB(sb)->s_mb_buddies_generated++;
771 EXT4_SB(sb)->s_mb_generation_time += period; 771 EXT4_SB(sb)->s_mb_generation_time += period;
772 spin_unlock(&EXT4_SB(sb)->s_bal_lock); 772 spin_unlock(&EXT4_SB(sb)->s_bal_lock);
773 } 773 }
774 774
775 static void mb_regenerate_buddy(struct ext4_buddy *e4b) 775 static void mb_regenerate_buddy(struct ext4_buddy *e4b)
776 { 776 {
777 int count; 777 int count;
778 int order = 1; 778 int order = 1;
779 void *buddy; 779 void *buddy;
780 780
781 while ((buddy = mb_find_buddy(e4b, order++, &count))) { 781 while ((buddy = mb_find_buddy(e4b, order++, &count))) {
782 ext4_set_bits(buddy, 0, count); 782 ext4_set_bits(buddy, 0, count);
783 } 783 }
784 e4b->bd_info->bb_fragments = 0; 784 e4b->bd_info->bb_fragments = 0;
785 memset(e4b->bd_info->bb_counters, 0, 785 memset(e4b->bd_info->bb_counters, 0,
786 sizeof(*e4b->bd_info->bb_counters) * 786 sizeof(*e4b->bd_info->bb_counters) *
787 (e4b->bd_sb->s_blocksize_bits + 2)); 787 (e4b->bd_sb->s_blocksize_bits + 2));
788 788
789 ext4_mb_generate_buddy(e4b->bd_sb, e4b->bd_buddy, 789 ext4_mb_generate_buddy(e4b->bd_sb, e4b->bd_buddy,
790 e4b->bd_bitmap, e4b->bd_group); 790 e4b->bd_bitmap, e4b->bd_group);
791 } 791 }
792 792
793 /* The buddy information is attached the buddy cache inode 793 /* The buddy information is attached the buddy cache inode
794 * for convenience. The information regarding each group 794 * for convenience. The information regarding each group
795 * is loaded via ext4_mb_load_buddy. The information involve 795 * is loaded via ext4_mb_load_buddy. The information involve
796 * block bitmap and buddy information. The information are 796 * block bitmap and buddy information. The information are
797 * stored in the inode as 797 * stored in the inode as
798 * 798 *
799 * { page } 799 * { page }
800 * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... 800 * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
801 * 801 *
802 * 802 *
803 * one block each for bitmap and buddy information. 803 * one block each for bitmap and buddy information.
804 * So for each group we take up 2 blocks. A page can 804 * So for each group we take up 2 blocks. A page can
805 * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks. 805 * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks.
806 * So it can have information regarding groups_per_page which 806 * So it can have information regarding groups_per_page which
807 * is blocks_per_page/2 807 * is blocks_per_page/2
808 * 808 *
809 * Locking note: This routine takes the block group lock of all groups 809 * Locking note: This routine takes the block group lock of all groups
810 * for this page; do not hold this lock when calling this routine! 810 * for this page; do not hold this lock when calling this routine!
811 */ 811 */
812 812
813 static int ext4_mb_init_cache(struct page *page, char *incore) 813 static int ext4_mb_init_cache(struct page *page, char *incore)
814 { 814 {
815 ext4_group_t ngroups; 815 ext4_group_t ngroups;
816 int blocksize; 816 int blocksize;
817 int blocks_per_page; 817 int blocks_per_page;
818 int groups_per_page; 818 int groups_per_page;
819 int err = 0; 819 int err = 0;
820 int i; 820 int i;
821 ext4_group_t first_group, group; 821 ext4_group_t first_group, group;
822 int first_block; 822 int first_block;
823 struct super_block *sb; 823 struct super_block *sb;
824 struct buffer_head *bhs; 824 struct buffer_head *bhs;
825 struct buffer_head **bh = NULL; 825 struct buffer_head **bh = NULL;
826 struct inode *inode; 826 struct inode *inode;
827 char *data; 827 char *data;
828 char *bitmap; 828 char *bitmap;
829 struct ext4_group_info *grinfo; 829 struct ext4_group_info *grinfo;
830 830
831 mb_debug(1, "init page %lu\n", page->index); 831 mb_debug(1, "init page %lu\n", page->index);
832 832
833 inode = page->mapping->host; 833 inode = page->mapping->host;
834 sb = inode->i_sb; 834 sb = inode->i_sb;
835 ngroups = ext4_get_groups_count(sb); 835 ngroups = ext4_get_groups_count(sb);
836 blocksize = 1 << inode->i_blkbits; 836 blocksize = 1 << inode->i_blkbits;
837 blocks_per_page = PAGE_CACHE_SIZE / blocksize; 837 blocks_per_page = PAGE_CACHE_SIZE / blocksize;
838 838
839 groups_per_page = blocks_per_page >> 1; 839 groups_per_page = blocks_per_page >> 1;
840 if (groups_per_page == 0) 840 if (groups_per_page == 0)
841 groups_per_page = 1; 841 groups_per_page = 1;
842 842
843 /* allocate buffer_heads to read bitmaps */ 843 /* allocate buffer_heads to read bitmaps */
844 if (groups_per_page > 1) { 844 if (groups_per_page > 1) {
845 i = sizeof(struct buffer_head *) * groups_per_page; 845 i = sizeof(struct buffer_head *) * groups_per_page;
846 bh = kzalloc(i, GFP_NOFS); 846 bh = kzalloc(i, GFP_NOFS);
847 if (bh == NULL) { 847 if (bh == NULL) {
848 err = -ENOMEM; 848 err = -ENOMEM;
849 goto out; 849 goto out;
850 } 850 }
851 } else 851 } else
852 bh = &bhs; 852 bh = &bhs;
853 853
854 first_group = page->index * blocks_per_page / 2; 854 first_group = page->index * blocks_per_page / 2;
855 855
856 /* read all groups the page covers into the cache */ 856 /* read all groups the page covers into the cache */
857 for (i = 0, group = first_group; i < groups_per_page; i++, group++) { 857 for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
858 if (group >= ngroups) 858 if (group >= ngroups)
859 break; 859 break;
860 860
861 grinfo = ext4_get_group_info(sb, group); 861 grinfo = ext4_get_group_info(sb, group);
862 /* 862 /*
863 * If page is uptodate then we came here after online resize 863 * If page is uptodate then we came here after online resize
864 * which added some new uninitialized group info structs, so 864 * which added some new uninitialized group info structs, so
865 * we must skip all initialized uptodate buddies on the page, 865 * we must skip all initialized uptodate buddies on the page,
866 * which may be currently in use by an allocating task. 866 * which may be currently in use by an allocating task.
867 */ 867 */
868 if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) { 868 if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) {
869 bh[i] = NULL; 869 bh[i] = NULL;
870 continue; 870 continue;
871 } 871 }
872 if (!(bh[i] = ext4_read_block_bitmap_nowait(sb, group))) { 872 if (!(bh[i] = ext4_read_block_bitmap_nowait(sb, group))) {
873 err = -ENOMEM; 873 err = -ENOMEM;
874 goto out; 874 goto out;
875 } 875 }
876 mb_debug(1, "read bitmap for group %u\n", group); 876 mb_debug(1, "read bitmap for group %u\n", group);
877 } 877 }
878 878
879 /* wait for I/O completion */ 879 /* wait for I/O completion */
880 for (i = 0, group = first_group; i < groups_per_page; i++, group++) { 880 for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
881 if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) { 881 if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
882 err = -EIO; 882 err = -EIO;
883 goto out; 883 goto out;
884 } 884 }
885 } 885 }
886 886
887 first_block = page->index * blocks_per_page; 887 first_block = page->index * blocks_per_page;
888 for (i = 0; i < blocks_per_page; i++) { 888 for (i = 0; i < blocks_per_page; i++) {
889 group = (first_block + i) >> 1; 889 group = (first_block + i) >> 1;
890 if (group >= ngroups) 890 if (group >= ngroups)
891 break; 891 break;
892 892
893 if (!bh[group - first_group]) 893 if (!bh[group - first_group])
894 /* skip initialized uptodate buddy */ 894 /* skip initialized uptodate buddy */
895 continue; 895 continue;
896 896
897 /* 897 /*
898 * data carry information regarding this 898 * data carry information regarding this
899 * particular group in the format specified 899 * particular group in the format specified
900 * above 900 * above
901 * 901 *
902 */ 902 */
903 data = page_address(page) + (i * blocksize); 903 data = page_address(page) + (i * blocksize);
904 bitmap = bh[group - first_group]->b_data; 904 bitmap = bh[group - first_group]->b_data;
905 905
906 /* 906 /*
907 * We place the buddy block and bitmap block 907 * We place the buddy block and bitmap block
908 * close together 908 * close together
909 */ 909 */
910 if ((first_block + i) & 1) { 910 if ((first_block + i) & 1) {
911 /* this is block of buddy */ 911 /* this is block of buddy */
912 BUG_ON(incore == NULL); 912 BUG_ON(incore == NULL);
913 mb_debug(1, "put buddy for group %u in page %lu/%x\n", 913 mb_debug(1, "put buddy for group %u in page %lu/%x\n",
914 group, page->index, i * blocksize); 914 group, page->index, i * blocksize);
915 trace_ext4_mb_buddy_bitmap_load(sb, group); 915 trace_ext4_mb_buddy_bitmap_load(sb, group);
916 grinfo = ext4_get_group_info(sb, group); 916 grinfo = ext4_get_group_info(sb, group);
917 grinfo->bb_fragments = 0; 917 grinfo->bb_fragments = 0;
918 memset(grinfo->bb_counters, 0, 918 memset(grinfo->bb_counters, 0,
919 sizeof(*grinfo->bb_counters) * 919 sizeof(*grinfo->bb_counters) *
920 (sb->s_blocksize_bits+2)); 920 (sb->s_blocksize_bits+2));
921 /* 921 /*
922 * incore got set to the group block bitmap below 922 * incore got set to the group block bitmap below
923 */ 923 */
924 ext4_lock_group(sb, group); 924 ext4_lock_group(sb, group);
925 /* init the buddy */ 925 /* init the buddy */
926 memset(data, 0xff, blocksize); 926 memset(data, 0xff, blocksize);
927 ext4_mb_generate_buddy(sb, data, incore, group); 927 ext4_mb_generate_buddy(sb, data, incore, group);
928 ext4_unlock_group(sb, group); 928 ext4_unlock_group(sb, group);
929 incore = NULL; 929 incore = NULL;
930 } else { 930 } else {
931 /* this is block of bitmap */ 931 /* this is block of bitmap */
932 BUG_ON(incore != NULL); 932 BUG_ON(incore != NULL);
933 mb_debug(1, "put bitmap for group %u in page %lu/%x\n", 933 mb_debug(1, "put bitmap for group %u in page %lu/%x\n",
934 group, page->index, i * blocksize); 934 group, page->index, i * blocksize);
935 trace_ext4_mb_bitmap_load(sb, group); 935 trace_ext4_mb_bitmap_load(sb, group);
936 936
937 /* see comments in ext4_mb_put_pa() */ 937 /* see comments in ext4_mb_put_pa() */
938 ext4_lock_group(sb, group); 938 ext4_lock_group(sb, group);
939 memcpy(data, bitmap, blocksize); 939 memcpy(data, bitmap, blocksize);
940 940
941 /* mark all preallocated blks used in in-core bitmap */ 941 /* mark all preallocated blks used in in-core bitmap */
942 ext4_mb_generate_from_pa(sb, data, group); 942 ext4_mb_generate_from_pa(sb, data, group);
943 ext4_mb_generate_from_freelist(sb, data, group); 943 ext4_mb_generate_from_freelist(sb, data, group);
944 ext4_unlock_group(sb, group); 944 ext4_unlock_group(sb, group);
945 945
946 /* set incore so that the buddy information can be 946 /* set incore so that the buddy information can be
947 * generated using this 947 * generated using this
948 */ 948 */
949 incore = data; 949 incore = data;
950 } 950 }
951 } 951 }
952 SetPageUptodate(page); 952 SetPageUptodate(page);
953 953
954 out: 954 out:
955 if (bh) { 955 if (bh) {
956 for (i = 0; i < groups_per_page; i++) 956 for (i = 0; i < groups_per_page; i++)
957 brelse(bh[i]); 957 brelse(bh[i]);
958 if (bh != &bhs) 958 if (bh != &bhs)
959 kfree(bh); 959 kfree(bh);
960 } 960 }
961 return err; 961 return err;
962 } 962 }
963 963
964 /* 964 /*
965 * Lock the buddy and bitmap pages. This make sure other parallel init_group 965 * Lock the buddy and bitmap pages. This make sure other parallel init_group
966 * on the same buddy page doesn't happen whild holding the buddy page lock. 966 * on the same buddy page doesn't happen whild holding the buddy page lock.
967 * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap 967 * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap
968 * are on the same page e4b->bd_buddy_page is NULL and return value is 0. 968 * are on the same page e4b->bd_buddy_page is NULL and return value is 0.
969 */ 969 */
970 static int ext4_mb_get_buddy_page_lock(struct super_block *sb, 970 static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
971 ext4_group_t group, struct ext4_buddy *e4b) 971 ext4_group_t group, struct ext4_buddy *e4b)
972 { 972 {
973 struct inode *inode = EXT4_SB(sb)->s_buddy_cache; 973 struct inode *inode = EXT4_SB(sb)->s_buddy_cache;
974 int block, pnum, poff; 974 int block, pnum, poff;
975 int blocks_per_page; 975 int blocks_per_page;
976 struct page *page; 976 struct page *page;
977 977
978 e4b->bd_buddy_page = NULL; 978 e4b->bd_buddy_page = NULL;
979 e4b->bd_bitmap_page = NULL; 979 e4b->bd_bitmap_page = NULL;
980 980
981 blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; 981 blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
982 /* 982 /*
983 * the buddy cache inode stores the block bitmap 983 * the buddy cache inode stores the block bitmap
984 * and buddy information in consecutive blocks. 984 * and buddy information in consecutive blocks.
985 * So for each group we need two blocks. 985 * So for each group we need two blocks.
986 */ 986 */
987 block = group * 2; 987 block = group * 2;
988 pnum = block / blocks_per_page; 988 pnum = block / blocks_per_page;
989 poff = block % blocks_per_page; 989 poff = block % blocks_per_page;
990 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); 990 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
991 if (!page) 991 if (!page)
992 return -EIO; 992 return -EIO;
993 BUG_ON(page->mapping != inode->i_mapping); 993 BUG_ON(page->mapping != inode->i_mapping);
994 e4b->bd_bitmap_page = page; 994 e4b->bd_bitmap_page = page;
995 e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); 995 e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
996 996
997 if (blocks_per_page >= 2) { 997 if (blocks_per_page >= 2) {
998 /* buddy and bitmap are on the same page */ 998 /* buddy and bitmap are on the same page */
999 return 0; 999 return 0;
1000 } 1000 }
1001 1001
1002 block++; 1002 block++;
1003 pnum = block / blocks_per_page; 1003 pnum = block / blocks_per_page;
1004 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); 1004 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
1005 if (!page) 1005 if (!page)
1006 return -EIO; 1006 return -EIO;
1007 BUG_ON(page->mapping != inode->i_mapping); 1007 BUG_ON(page->mapping != inode->i_mapping);
1008 e4b->bd_buddy_page = page; 1008 e4b->bd_buddy_page = page;
1009 return 0; 1009 return 0;
1010 } 1010 }
1011 1011
1012 static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b) 1012 static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
1013 { 1013 {
1014 if (e4b->bd_bitmap_page) { 1014 if (e4b->bd_bitmap_page) {
1015 unlock_page(e4b->bd_bitmap_page); 1015 unlock_page(e4b->bd_bitmap_page);
1016 page_cache_release(e4b->bd_bitmap_page); 1016 page_cache_release(e4b->bd_bitmap_page);
1017 } 1017 }
1018 if (e4b->bd_buddy_page) { 1018 if (e4b->bd_buddy_page) {
1019 unlock_page(e4b->bd_buddy_page); 1019 unlock_page(e4b->bd_buddy_page);
1020 page_cache_release(e4b->bd_buddy_page); 1020 page_cache_release(e4b->bd_buddy_page);
1021 } 1021 }
1022 } 1022 }
1023 1023
1024 /* 1024 /*
1025 * Locking note: This routine calls ext4_mb_init_cache(), which takes the 1025 * Locking note: This routine calls ext4_mb_init_cache(), which takes the
1026 * block group lock of all groups for this page; do not hold the BG lock when 1026 * block group lock of all groups for this page; do not hold the BG lock when
1027 * calling this routine! 1027 * calling this routine!
1028 */ 1028 */
1029 static noinline_for_stack 1029 static noinline_for_stack
1030 int ext4_mb_init_group(struct super_block *sb, ext4_group_t group) 1030 int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
1031 { 1031 {
1032 1032
1033 struct ext4_group_info *this_grp; 1033 struct ext4_group_info *this_grp;
1034 struct ext4_buddy e4b; 1034 struct ext4_buddy e4b;
1035 struct page *page; 1035 struct page *page;
1036 int ret = 0; 1036 int ret = 0;
1037 1037
1038 might_sleep(); 1038 might_sleep();
1039 mb_debug(1, "init group %u\n", group); 1039 mb_debug(1, "init group %u\n", group);
1040 this_grp = ext4_get_group_info(sb, group); 1040 this_grp = ext4_get_group_info(sb, group);
1041 /* 1041 /*
1042 * This ensures that we don't reinit the buddy cache 1042 * This ensures that we don't reinit the buddy cache
1043 * page which map to the group from which we are already 1043 * page which map to the group from which we are already
1044 * allocating. If we are looking at the buddy cache we would 1044 * allocating. If we are looking at the buddy cache we would
1045 * have taken a reference using ext4_mb_load_buddy and that 1045 * have taken a reference using ext4_mb_load_buddy and that
1046 * would have pinned buddy page to page cache. 1046 * would have pinned buddy page to page cache.
1047 * The call to ext4_mb_get_buddy_page_lock will mark the
1048 * page accessed.
1047 */ 1049 */
1048 ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b); 1050 ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
1049 if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) { 1051 if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
1050 /* 1052 /*
1051 * somebody initialized the group 1053 * somebody initialized the group
1052 * return without doing anything 1054 * return without doing anything
1053 */ 1055 */
1054 goto err; 1056 goto err;
1055 } 1057 }
1056 1058
1057 page = e4b.bd_bitmap_page; 1059 page = e4b.bd_bitmap_page;
1058 ret = ext4_mb_init_cache(page, NULL); 1060 ret = ext4_mb_init_cache(page, NULL);
1059 if (ret) 1061 if (ret)
1060 goto err; 1062 goto err;
1061 if (!PageUptodate(page)) { 1063 if (!PageUptodate(page)) {
1062 ret = -EIO; 1064 ret = -EIO;
1063 goto err; 1065 goto err;
1064 } 1066 }
1065 mark_page_accessed(page);
1066 1067
1067 if (e4b.bd_buddy_page == NULL) { 1068 if (e4b.bd_buddy_page == NULL) {
1068 /* 1069 /*
1069 * If both the bitmap and buddy are in 1070 * If both the bitmap and buddy are in
1070 * the same page we don't need to force 1071 * the same page we don't need to force
1071 * init the buddy 1072 * init the buddy
1072 */ 1073 */
1073 ret = 0; 1074 ret = 0;
1074 goto err; 1075 goto err;
1075 } 1076 }
1076 /* init buddy cache */ 1077 /* init buddy cache */
1077 page = e4b.bd_buddy_page; 1078 page = e4b.bd_buddy_page;
1078 ret = ext4_mb_init_cache(page, e4b.bd_bitmap); 1079 ret = ext4_mb_init_cache(page, e4b.bd_bitmap);
1079 if (ret) 1080 if (ret)
1080 goto err; 1081 goto err;
1081 if (!PageUptodate(page)) { 1082 if (!PageUptodate(page)) {
1082 ret = -EIO; 1083 ret = -EIO;
1083 goto err; 1084 goto err;
1084 } 1085 }
1085 mark_page_accessed(page);
1086 err: 1086 err:
1087 ext4_mb_put_buddy_page_lock(&e4b); 1087 ext4_mb_put_buddy_page_lock(&e4b);
1088 return ret; 1088 return ret;
1089 } 1089 }
1090 1090
1091 /* 1091 /*
1092 * Locking note: This routine calls ext4_mb_init_cache(), which takes the 1092 * Locking note: This routine calls ext4_mb_init_cache(), which takes the
1093 * block group lock of all groups for this page; do not hold the BG lock when 1093 * block group lock of all groups for this page; do not hold the BG lock when
1094 * calling this routine! 1094 * calling this routine!
1095 */ 1095 */
1096 static noinline_for_stack int 1096 static noinline_for_stack int
1097 ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, 1097 ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
1098 struct ext4_buddy *e4b) 1098 struct ext4_buddy *e4b)
1099 { 1099 {
1100 int blocks_per_page; 1100 int blocks_per_page;
1101 int block; 1101 int block;
1102 int pnum; 1102 int pnum;
1103 int poff; 1103 int poff;
1104 struct page *page; 1104 struct page *page;
1105 int ret; 1105 int ret;
1106 struct ext4_group_info *grp; 1106 struct ext4_group_info *grp;
1107 struct ext4_sb_info *sbi = EXT4_SB(sb); 1107 struct ext4_sb_info *sbi = EXT4_SB(sb);
1108 struct inode *inode = sbi->s_buddy_cache; 1108 struct inode *inode = sbi->s_buddy_cache;
1109 1109
1110 might_sleep(); 1110 might_sleep();
1111 mb_debug(1, "load group %u\n", group); 1111 mb_debug(1, "load group %u\n", group);
1112 1112
1113 blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; 1113 blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
1114 grp = ext4_get_group_info(sb, group); 1114 grp = ext4_get_group_info(sb, group);
1115 1115
1116 e4b->bd_blkbits = sb->s_blocksize_bits; 1116 e4b->bd_blkbits = sb->s_blocksize_bits;
1117 e4b->bd_info = grp; 1117 e4b->bd_info = grp;
1118 e4b->bd_sb = sb; 1118 e4b->bd_sb = sb;
1119 e4b->bd_group = group; 1119 e4b->bd_group = group;
1120 e4b->bd_buddy_page = NULL; 1120 e4b->bd_buddy_page = NULL;
1121 e4b->bd_bitmap_page = NULL; 1121 e4b->bd_bitmap_page = NULL;
1122 1122
1123 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { 1123 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
1124 /* 1124 /*
1125 * we need full data about the group 1125 * we need full data about the group
1126 * to make a good selection 1126 * to make a good selection
1127 */ 1127 */
1128 ret = ext4_mb_init_group(sb, group); 1128 ret = ext4_mb_init_group(sb, group);
1129 if (ret) 1129 if (ret)
1130 return ret; 1130 return ret;
1131 } 1131 }
1132 1132
1133 /* 1133 /*
1134 * the buddy cache inode stores the block bitmap 1134 * the buddy cache inode stores the block bitmap
1135 * and buddy information in consecutive blocks. 1135 * and buddy information in consecutive blocks.
1136 * So for each group we need two blocks. 1136 * So for each group we need two blocks.
1137 */ 1137 */
1138 block = group * 2; 1138 block = group * 2;
1139 pnum = block / blocks_per_page; 1139 pnum = block / blocks_per_page;
1140 poff = block % blocks_per_page; 1140 poff = block % blocks_per_page;
1141 1141
1142 /* we could use find_or_create_page(), but it locks page 1142 /* we could use find_or_create_page(), but it locks page
1143 * what we'd like to avoid in fast path ... */ 1143 * what we'd like to avoid in fast path ... */
1144 page = find_get_page(inode->i_mapping, pnum); 1144 page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
1145 if (page == NULL || !PageUptodate(page)) { 1145 if (page == NULL || !PageUptodate(page)) {
1146 if (page) 1146 if (page)
1147 /* 1147 /*
1148 * drop the page reference and try 1148 * drop the page reference and try
1149 * to get the page with lock. If we 1149 * to get the page with lock. If we
1150 * are not uptodate that implies 1150 * are not uptodate that implies
1151 * somebody just created the page but 1151 * somebody just created the page but
1152 * is yet to initialize the same. So 1152 * is yet to initialize the same. So
1153 * wait for it to initialize. 1153 * wait for it to initialize.
1154 */ 1154 */
1155 page_cache_release(page); 1155 page_cache_release(page);
1156 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); 1156 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
1157 if (page) { 1157 if (page) {
1158 BUG_ON(page->mapping != inode->i_mapping); 1158 BUG_ON(page->mapping != inode->i_mapping);
1159 if (!PageUptodate(page)) { 1159 if (!PageUptodate(page)) {
1160 ret = ext4_mb_init_cache(page, NULL); 1160 ret = ext4_mb_init_cache(page, NULL);
1161 if (ret) { 1161 if (ret) {
1162 unlock_page(page); 1162 unlock_page(page);
1163 goto err; 1163 goto err;
1164 } 1164 }
1165 mb_cmp_bitmaps(e4b, page_address(page) + 1165 mb_cmp_bitmaps(e4b, page_address(page) +
1166 (poff * sb->s_blocksize)); 1166 (poff * sb->s_blocksize));
1167 } 1167 }
1168 unlock_page(page); 1168 unlock_page(page);
1169 } 1169 }
1170 } 1170 }
1171 if (page == NULL || !PageUptodate(page)) { 1171 if (page == NULL || !PageUptodate(page)) {
1172 ret = -EIO; 1172 ret = -EIO;
1173 goto err; 1173 goto err;
1174 } 1174 }
1175
1176 /* Pages marked accessed already */
1175 e4b->bd_bitmap_page = page; 1177 e4b->bd_bitmap_page = page;
1176 e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); 1178 e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
1177 mark_page_accessed(page);
1178 1179
1179 block++; 1180 block++;
1180 pnum = block / blocks_per_page; 1181 pnum = block / blocks_per_page;
1181 poff = block % blocks_per_page; 1182 poff = block % blocks_per_page;
1182 1183
1183 page = find_get_page(inode->i_mapping, pnum); 1184 page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
1184 if (page == NULL || !PageUptodate(page)) { 1185 if (page == NULL || !PageUptodate(page)) {
1185 if (page) 1186 if (page)
1186 page_cache_release(page); 1187 page_cache_release(page);
1187 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); 1188 page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
1188 if (page) { 1189 if (page) {
1189 BUG_ON(page->mapping != inode->i_mapping); 1190 BUG_ON(page->mapping != inode->i_mapping);
1190 if (!PageUptodate(page)) { 1191 if (!PageUptodate(page)) {
1191 ret = ext4_mb_init_cache(page, e4b->bd_bitmap); 1192 ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
1192 if (ret) { 1193 if (ret) {
1193 unlock_page(page); 1194 unlock_page(page);
1194 goto err; 1195 goto err;
1195 } 1196 }
1196 } 1197 }
1197 unlock_page(page); 1198 unlock_page(page);
1198 } 1199 }
1199 } 1200 }
1200 if (page == NULL || !PageUptodate(page)) { 1201 if (page == NULL || !PageUptodate(page)) {
1201 ret = -EIO; 1202 ret = -EIO;
1202 goto err; 1203 goto err;
1203 } 1204 }
1205
1206 /* Pages marked accessed already */
1204 e4b->bd_buddy_page = page; 1207 e4b->bd_buddy_page = page;
1205 e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize); 1208 e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
1206 mark_page_accessed(page);
1207 1209
1208 BUG_ON(e4b->bd_bitmap_page == NULL); 1210 BUG_ON(e4b->bd_bitmap_page == NULL);
1209 BUG_ON(e4b->bd_buddy_page == NULL); 1211 BUG_ON(e4b->bd_buddy_page == NULL);
1210 1212
1211 return 0; 1213 return 0;
1212 1214
1213 err: 1215 err:
1214 if (page) 1216 if (page)
1215 page_cache_release(page); 1217 page_cache_release(page);
1216 if (e4b->bd_bitmap_page) 1218 if (e4b->bd_bitmap_page)
1217 page_cache_release(e4b->bd_bitmap_page); 1219 page_cache_release(e4b->bd_bitmap_page);
1218 if (e4b->bd_buddy_page) 1220 if (e4b->bd_buddy_page)
1219 page_cache_release(e4b->bd_buddy_page); 1221 page_cache_release(e4b->bd_buddy_page);
1220 e4b->bd_buddy = NULL; 1222 e4b->bd_buddy = NULL;
1221 e4b->bd_bitmap = NULL; 1223 e4b->bd_bitmap = NULL;
1222 return ret; 1224 return ret;
1223 } 1225 }
1224 1226
1225 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) 1227 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
1226 { 1228 {
1227 if (e4b->bd_bitmap_page) 1229 if (e4b->bd_bitmap_page)
1228 page_cache_release(e4b->bd_bitmap_page); 1230 page_cache_release(e4b->bd_bitmap_page);
1229 if (e4b->bd_buddy_page) 1231 if (e4b->bd_buddy_page)
1230 page_cache_release(e4b->bd_buddy_page); 1232 page_cache_release(e4b->bd_buddy_page);
1231 } 1233 }
1232 1234
1233 1235
1234 static int mb_find_order_for_block(struct ext4_buddy *e4b, int block) 1236 static int mb_find_order_for_block(struct ext4_buddy *e4b, int block)
1235 { 1237 {
1236 int order = 1; 1238 int order = 1;
1237 void *bb; 1239 void *bb;
1238 1240
1239 BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); 1241 BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
1240 BUG_ON(block >= (1 << (e4b->bd_blkbits + 3))); 1242 BUG_ON(block >= (1 << (e4b->bd_blkbits + 3)));
1241 1243
1242 bb = e4b->bd_buddy; 1244 bb = e4b->bd_buddy;
1243 while (order <= e4b->bd_blkbits + 1) { 1245 while (order <= e4b->bd_blkbits + 1) {
1244 block = block >> 1; 1246 block = block >> 1;
1245 if (!mb_test_bit(block, bb)) { 1247 if (!mb_test_bit(block, bb)) {
1246 /* this block is part of buddy of order 'order' */ 1248 /* this block is part of buddy of order 'order' */
1247 return order; 1249 return order;
1248 } 1250 }
1249 bb += 1 << (e4b->bd_blkbits - order); 1251 bb += 1 << (e4b->bd_blkbits - order);
1250 order++; 1252 order++;
1251 } 1253 }
1252 return 0; 1254 return 0;
1253 } 1255 }
1254 1256
1255 static void mb_clear_bits(void *bm, int cur, int len) 1257 static void mb_clear_bits(void *bm, int cur, int len)
1256 { 1258 {
1257 __u32 *addr; 1259 __u32 *addr;
1258 1260
1259 len = cur + len; 1261 len = cur + len;
1260 while (cur < len) { 1262 while (cur < len) {
1261 if ((cur & 31) == 0 && (len - cur) >= 32) { 1263 if ((cur & 31) == 0 && (len - cur) >= 32) {
1262 /* fast path: clear whole word at once */ 1264 /* fast path: clear whole word at once */
1263 addr = bm + (cur >> 3); 1265 addr = bm + (cur >> 3);
1264 *addr = 0; 1266 *addr = 0;
1265 cur += 32; 1267 cur += 32;
1266 continue; 1268 continue;
1267 } 1269 }
1268 mb_clear_bit(cur, bm); 1270 mb_clear_bit(cur, bm);
1269 cur++; 1271 cur++;
1270 } 1272 }
1271 } 1273 }
1272 1274
1273 /* clear bits in given range 1275 /* clear bits in given range
1274 * will return first found zero bit if any, -1 otherwise 1276 * will return first found zero bit if any, -1 otherwise
1275 */ 1277 */
1276 static int mb_test_and_clear_bits(void *bm, int cur, int len) 1278 static int mb_test_and_clear_bits(void *bm, int cur, int len)
1277 { 1279 {
1278 __u32 *addr; 1280 __u32 *addr;
1279 int zero_bit = -1; 1281 int zero_bit = -1;
1280 1282
1281 len = cur + len; 1283 len = cur + len;
1282 while (cur < len) { 1284 while (cur < len) {
1283 if ((cur & 31) == 0 && (len - cur) >= 32) { 1285 if ((cur & 31) == 0 && (len - cur) >= 32) {
1284 /* fast path: clear whole word at once */ 1286 /* fast path: clear whole word at once */
1285 addr = bm + (cur >> 3); 1287 addr = bm + (cur >> 3);
1286 if (*addr != (__u32)(-1) && zero_bit == -1) 1288 if (*addr != (__u32)(-1) && zero_bit == -1)
1287 zero_bit = cur + mb_find_next_zero_bit(addr, 32, 0); 1289 zero_bit = cur + mb_find_next_zero_bit(addr, 32, 0);
1288 *addr = 0; 1290 *addr = 0;
1289 cur += 32; 1291 cur += 32;
1290 continue; 1292 continue;
1291 } 1293 }
1292 if (!mb_test_and_clear_bit(cur, bm) && zero_bit == -1) 1294 if (!mb_test_and_clear_bit(cur, bm) && zero_bit == -1)
1293 zero_bit = cur; 1295 zero_bit = cur;
1294 cur++; 1296 cur++;
1295 } 1297 }
1296 1298
1297 return zero_bit; 1299 return zero_bit;
1298 } 1300 }
1299 1301
1300 void ext4_set_bits(void *bm, int cur, int len) 1302 void ext4_set_bits(void *bm, int cur, int len)
1301 { 1303 {
1302 __u32 *addr; 1304 __u32 *addr;
1303 1305
1304 len = cur + len; 1306 len = cur + len;
1305 while (cur < len) { 1307 while (cur < len) {
1306 if ((cur & 31) == 0 && (len - cur) >= 32) { 1308 if ((cur & 31) == 0 && (len - cur) >= 32) {
1307 /* fast path: set whole word at once */ 1309 /* fast path: set whole word at once */
1308 addr = bm + (cur >> 3); 1310 addr = bm + (cur >> 3);
1309 *addr = 0xffffffff; 1311 *addr = 0xffffffff;
1310 cur += 32; 1312 cur += 32;
1311 continue; 1313 continue;
1312 } 1314 }
1313 mb_set_bit(cur, bm); 1315 mb_set_bit(cur, bm);
1314 cur++; 1316 cur++;
1315 } 1317 }
1316 } 1318 }
1317 1319
1318 /* 1320 /*
1319 * _________________________________________________________________ */ 1321 * _________________________________________________________________ */
1320 1322
1321 static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side) 1323 static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side)
1322 { 1324 {
1323 if (mb_test_bit(*bit + side, bitmap)) { 1325 if (mb_test_bit(*bit + side, bitmap)) {
1324 mb_clear_bit(*bit, bitmap); 1326 mb_clear_bit(*bit, bitmap);
1325 (*bit) -= side; 1327 (*bit) -= side;
1326 return 1; 1328 return 1;
1327 } 1329 }
1328 else { 1330 else {
1329 (*bit) += side; 1331 (*bit) += side;
1330 mb_set_bit(*bit, bitmap); 1332 mb_set_bit(*bit, bitmap);
1331 return -1; 1333 return -1;
1332 } 1334 }
1333 } 1335 }
1334 1336
1335 static void mb_buddy_mark_free(struct ext4_buddy *e4b, int first, int last) 1337 static void mb_buddy_mark_free(struct ext4_buddy *e4b, int first, int last)
1336 { 1338 {
1337 int max; 1339 int max;
1338 int order = 1; 1340 int order = 1;
1339 void *buddy = mb_find_buddy(e4b, order, &max); 1341 void *buddy = mb_find_buddy(e4b, order, &max);
1340 1342
1341 while (buddy) { 1343 while (buddy) {
1342 void *buddy2; 1344 void *buddy2;
1343 1345
1344 /* Bits in range [first; last] are known to be set since 1346 /* Bits in range [first; last] are known to be set since
1345 * corresponding blocks were allocated. Bits in range 1347 * corresponding blocks were allocated. Bits in range
1346 * (first; last) will stay set because they form buddies on 1348 * (first; last) will stay set because they form buddies on
1347 * upper layer. We just deal with borders if they don't 1349 * upper layer. We just deal with borders if they don't
1348 * align with upper layer and then go up. 1350 * align with upper layer and then go up.
1349 * Releasing entire group is all about clearing 1351 * Releasing entire group is all about clearing
1350 * single bit of highest order buddy. 1352 * single bit of highest order buddy.
1351 */ 1353 */
1352 1354
1353 /* Example: 1355 /* Example:
1354 * --------------------------------- 1356 * ---------------------------------
1355 * | 1 | 1 | 1 | 1 | 1357 * | 1 | 1 | 1 | 1 |
1356 * --------------------------------- 1358 * ---------------------------------
1357 * | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1359 * | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1358 * --------------------------------- 1360 * ---------------------------------
1359 * 0 1 2 3 4 5 6 7 1361 * 0 1 2 3 4 5 6 7
1360 * \_____________________/ 1362 * \_____________________/
1361 * 1363 *
1362 * Neither [1] nor [6] is aligned to above layer. 1364 * Neither [1] nor [6] is aligned to above layer.
1363 * Left neighbour [0] is free, so mark it busy, 1365 * Left neighbour [0] is free, so mark it busy,
1364 * decrease bb_counters and extend range to 1366 * decrease bb_counters and extend range to
1365 * [0; 6] 1367 * [0; 6]
1366 * Right neighbour [7] is busy. It can't be coaleasced with [6], so 1368 * Right neighbour [7] is busy. It can't be coaleasced with [6], so
1367 * mark [6] free, increase bb_counters and shrink range to 1369 * mark [6] free, increase bb_counters and shrink range to
1368 * [0; 5]. 1370 * [0; 5].
1369 * Then shift range to [0; 2], go up and do the same. 1371 * Then shift range to [0; 2], go up and do the same.
1370 */ 1372 */
1371 1373
1372 1374
1373 if (first & 1) 1375 if (first & 1)
1374 e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&first, buddy, -1); 1376 e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&first, buddy, -1);
1375 if (!(last & 1)) 1377 if (!(last & 1))
1376 e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&last, buddy, 1); 1378 e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&last, buddy, 1);
1377 if (first > last) 1379 if (first > last)
1378 break; 1380 break;
1379 order++; 1381 order++;
1380 1382
1381 if (first == last || !(buddy2 = mb_find_buddy(e4b, order, &max))) { 1383 if (first == last || !(buddy2 = mb_find_buddy(e4b, order, &max))) {
1382 mb_clear_bits(buddy, first, last - first + 1); 1384 mb_clear_bits(buddy, first, last - first + 1);
1383 e4b->bd_info->bb_counters[order - 1] += last - first + 1; 1385 e4b->bd_info->bb_counters[order - 1] += last - first + 1;
1384 break; 1386 break;
1385 } 1387 }
1386 first >>= 1; 1388 first >>= 1;
1387 last >>= 1; 1389 last >>= 1;
1388 buddy = buddy2; 1390 buddy = buddy2;
1389 } 1391 }
1390 } 1392 }
1391 1393
1392 static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, 1394 static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
1393 int first, int count) 1395 int first, int count)
1394 { 1396 {
1395 int left_is_free = 0; 1397 int left_is_free = 0;
1396 int right_is_free = 0; 1398 int right_is_free = 0;
1397 int block; 1399 int block;
1398 int last = first + count - 1; 1400 int last = first + count - 1;
1399 struct super_block *sb = e4b->bd_sb; 1401 struct super_block *sb = e4b->bd_sb;
1400 1402
1401 if (WARN_ON(count == 0)) 1403 if (WARN_ON(count == 0))
1402 return; 1404 return;
1403 BUG_ON(last >= (sb->s_blocksize << 3)); 1405 BUG_ON(last >= (sb->s_blocksize << 3));
1404 assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); 1406 assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
1405 /* Don't bother if the block group is corrupt. */ 1407 /* Don't bother if the block group is corrupt. */
1406 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) 1408 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)))
1407 return; 1409 return;
1408 1410
1409 mb_check_buddy(e4b); 1411 mb_check_buddy(e4b);
1410 mb_free_blocks_double(inode, e4b, first, count); 1412 mb_free_blocks_double(inode, e4b, first, count);
1411 1413
1412 e4b->bd_info->bb_free += count; 1414 e4b->bd_info->bb_free += count;
1413 if (first < e4b->bd_info->bb_first_free) 1415 if (first < e4b->bd_info->bb_first_free)
1414 e4b->bd_info->bb_first_free = first; 1416 e4b->bd_info->bb_first_free = first;
1415 1417
1416 /* access memory sequentially: check left neighbour, 1418 /* access memory sequentially: check left neighbour,
1417 * clear range and then check right neighbour 1419 * clear range and then check right neighbour
1418 */ 1420 */
1419 if (first != 0) 1421 if (first != 0)
1420 left_is_free = !mb_test_bit(first - 1, e4b->bd_bitmap); 1422 left_is_free = !mb_test_bit(first - 1, e4b->bd_bitmap);
1421 block = mb_test_and_clear_bits(e4b->bd_bitmap, first, count); 1423 block = mb_test_and_clear_bits(e4b->bd_bitmap, first, count);
1422 if (last + 1 < EXT4_SB(sb)->s_mb_maxs[0]) 1424 if (last + 1 < EXT4_SB(sb)->s_mb_maxs[0])
1423 right_is_free = !mb_test_bit(last + 1, e4b->bd_bitmap); 1425 right_is_free = !mb_test_bit(last + 1, e4b->bd_bitmap);
1424 1426
1425 if (unlikely(block != -1)) { 1427 if (unlikely(block != -1)) {
1426 ext4_fsblk_t blocknr; 1428 ext4_fsblk_t blocknr;
1427 1429
1428 blocknr = ext4_group_first_block_no(sb, e4b->bd_group); 1430 blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
1429 blocknr += EXT4_C2B(EXT4_SB(sb), block); 1431 blocknr += EXT4_C2B(EXT4_SB(sb), block);
1430 ext4_grp_locked_error(sb, e4b->bd_group, 1432 ext4_grp_locked_error(sb, e4b->bd_group,
1431 inode ? inode->i_ino : 0, 1433 inode ? inode->i_ino : 0,
1432 blocknr, 1434 blocknr,
1433 "freeing already freed block " 1435 "freeing already freed block "
1434 "(bit %u); block bitmap corrupt.", 1436 "(bit %u); block bitmap corrupt.",
1435 block); 1437 block);
1436 /* Mark the block group as corrupt. */ 1438 /* Mark the block group as corrupt. */
1437 set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, 1439 set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT,
1438 &e4b->bd_info->bb_state); 1440 &e4b->bd_info->bb_state);
1439 mb_regenerate_buddy(e4b); 1441 mb_regenerate_buddy(e4b);
1440 goto done; 1442 goto done;
1441 } 1443 }
1442 1444
1443 /* let's maintain fragments counter */ 1445 /* let's maintain fragments counter */
1444 if (left_is_free && right_is_free) 1446 if (left_is_free && right_is_free)
1445 e4b->bd_info->bb_fragments--; 1447 e4b->bd_info->bb_fragments--;
1446 else if (!left_is_free && !right_is_free) 1448 else if (!left_is_free && !right_is_free)
1447 e4b->bd_info->bb_fragments++; 1449 e4b->bd_info->bb_fragments++;
1448 1450
1449 /* buddy[0] == bd_bitmap is a special case, so handle 1451 /* buddy[0] == bd_bitmap is a special case, so handle
1450 * it right away and let mb_buddy_mark_free stay free of 1452 * it right away and let mb_buddy_mark_free stay free of
1451 * zero order checks. 1453 * zero order checks.
1452 * Check if neighbours are to be coaleasced, 1454 * Check if neighbours are to be coaleasced,
1453 * adjust bitmap bb_counters and borders appropriately. 1455 * adjust bitmap bb_counters and borders appropriately.
1454 */ 1456 */
1455 if (first & 1) { 1457 if (first & 1) {
1456 first += !left_is_free; 1458 first += !left_is_free;
1457 e4b->bd_info->bb_counters[0] += left_is_free ? -1 : 1; 1459 e4b->bd_info->bb_counters[0] += left_is_free ? -1 : 1;
1458 } 1460 }
1459 if (!(last & 1)) { 1461 if (!(last & 1)) {
1460 last -= !right_is_free; 1462 last -= !right_is_free;
1461 e4b->bd_info->bb_counters[0] += right_is_free ? -1 : 1; 1463 e4b->bd_info->bb_counters[0] += right_is_free ? -1 : 1;
1462 } 1464 }
1463 1465
1464 if (first <= last) 1466 if (first <= last)
1465 mb_buddy_mark_free(e4b, first >> 1, last >> 1); 1467 mb_buddy_mark_free(e4b, first >> 1, last >> 1);
1466 1468
1467 done: 1469 done:
1468 mb_set_largest_free_order(sb, e4b->bd_info); 1470 mb_set_largest_free_order(sb, e4b->bd_info);
1469 mb_check_buddy(e4b); 1471 mb_check_buddy(e4b);
1470 } 1472 }
1471 1473
1472 static int mb_find_extent(struct ext4_buddy *e4b, int block, 1474 static int mb_find_extent(struct ext4_buddy *e4b, int block,
1473 int needed, struct ext4_free_extent *ex) 1475 int needed, struct ext4_free_extent *ex)
1474 { 1476 {
1475 int next = block; 1477 int next = block;
1476 int max, order; 1478 int max, order;
1477 void *buddy; 1479 void *buddy;
1478 1480
1479 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); 1481 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
1480 BUG_ON(ex == NULL); 1482 BUG_ON(ex == NULL);
1481 1483
1482 buddy = mb_find_buddy(e4b, 0, &max); 1484 buddy = mb_find_buddy(e4b, 0, &max);
1483 BUG_ON(buddy == NULL); 1485 BUG_ON(buddy == NULL);
1484 BUG_ON(block >= max); 1486 BUG_ON(block >= max);
1485 if (mb_test_bit(block, buddy)) { 1487 if (mb_test_bit(block, buddy)) {
1486 ex->fe_len = 0; 1488 ex->fe_len = 0;
1487 ex->fe_start = 0; 1489 ex->fe_start = 0;
1488 ex->fe_group = 0; 1490 ex->fe_group = 0;
1489 return 0; 1491 return 0;
1490 } 1492 }
1491 1493
1492 /* find actual order */ 1494 /* find actual order */
1493 order = mb_find_order_for_block(e4b, block); 1495 order = mb_find_order_for_block(e4b, block);
1494 block = block >> order; 1496 block = block >> order;
1495 1497
1496 ex->fe_len = 1 << order; 1498 ex->fe_len = 1 << order;
1497 ex->fe_start = block << order; 1499 ex->fe_start = block << order;
1498 ex->fe_group = e4b->bd_group; 1500 ex->fe_group = e4b->bd_group;
1499 1501
1500 /* calc difference from given start */ 1502 /* calc difference from given start */
1501 next = next - ex->fe_start; 1503 next = next - ex->fe_start;
1502 ex->fe_len -= next; 1504 ex->fe_len -= next;
1503 ex->fe_start += next; 1505 ex->fe_start += next;
1504 1506
1505 while (needed > ex->fe_len && 1507 while (needed > ex->fe_len &&
1506 mb_find_buddy(e4b, order, &max)) { 1508 mb_find_buddy(e4b, order, &max)) {
1507 1509
1508 if (block + 1 >= max) 1510 if (block + 1 >= max)
1509 break; 1511 break;
1510 1512
1511 next = (block + 1) * (1 << order); 1513 next = (block + 1) * (1 << order);
1512 if (mb_test_bit(next, e4b->bd_bitmap)) 1514 if (mb_test_bit(next, e4b->bd_bitmap))
1513 break; 1515 break;
1514 1516
1515 order = mb_find_order_for_block(e4b, next); 1517 order = mb_find_order_for_block(e4b, next);
1516 1518
1517 block = next >> order; 1519 block = next >> order;
1518 ex->fe_len += 1 << order; 1520 ex->fe_len += 1 << order;
1519 } 1521 }
1520 1522
1521 BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3))); 1523 BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3)));
1522 return ex->fe_len; 1524 return ex->fe_len;
1523 } 1525 }
1524 1526
1525 static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) 1527 static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
1526 { 1528 {
1527 int ord; 1529 int ord;
1528 int mlen = 0; 1530 int mlen = 0;
1529 int max = 0; 1531 int max = 0;
1530 int cur; 1532 int cur;
1531 int start = ex->fe_start; 1533 int start = ex->fe_start;
1532 int len = ex->fe_len; 1534 int len = ex->fe_len;
1533 unsigned ret = 0; 1535 unsigned ret = 0;
1534 int len0 = len; 1536 int len0 = len;
1535 void *buddy; 1537 void *buddy;
1536 1538
1537 BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); 1539 BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
1538 BUG_ON(e4b->bd_group != ex->fe_group); 1540 BUG_ON(e4b->bd_group != ex->fe_group);
1539 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); 1541 assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
1540 mb_check_buddy(e4b); 1542 mb_check_buddy(e4b);
1541 mb_mark_used_double(e4b, start, len); 1543 mb_mark_used_double(e4b, start, len);
1542 1544
1543 e4b->bd_info->bb_free -= len; 1545 e4b->bd_info->bb_free -= len;
1544 if (e4b->bd_info->bb_first_free == start) 1546 if (e4b->bd_info->bb_first_free == start)
1545 e4b->bd_info->bb_first_free += len; 1547 e4b->bd_info->bb_first_free += len;
1546 1548
1547 /* let's maintain fragments counter */ 1549 /* let's maintain fragments counter */
1548 if (start != 0) 1550 if (start != 0)
1549 mlen = !mb_test_bit(start - 1, e4b->bd_bitmap); 1551 mlen = !mb_test_bit(start - 1, e4b->bd_bitmap);
1550 if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0]) 1552 if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0])
1551 max = !mb_test_bit(start + len, e4b->bd_bitmap); 1553 max = !mb_test_bit(start + len, e4b->bd_bitmap);
1552 if (mlen && max) 1554 if (mlen && max)
1553 e4b->bd_info->bb_fragments++; 1555 e4b->bd_info->bb_fragments++;
1554 else if (!mlen && !max) 1556 else if (!mlen && !max)
1555 e4b->bd_info->bb_fragments--; 1557 e4b->bd_info->bb_fragments--;
1556 1558
1557 /* let's maintain buddy itself */ 1559 /* let's maintain buddy itself */
1558 while (len) { 1560 while (len) {
1559 ord = mb_find_order_for_block(e4b, start); 1561 ord = mb_find_order_for_block(e4b, start);
1560 1562
1561 if (((start >> ord) << ord) == start && len >= (1 << ord)) { 1563 if (((start >> ord) << ord) == start && len >= (1 << ord)) {
1562 /* the whole chunk may be allocated at once! */ 1564 /* the whole chunk may be allocated at once! */
1563 mlen = 1 << ord; 1565 mlen = 1 << ord;
1564 buddy = mb_find_buddy(e4b, ord, &max); 1566 buddy = mb_find_buddy(e4b, ord, &max);
1565 BUG_ON((start >> ord) >= max); 1567 BUG_ON((start >> ord) >= max);
1566 mb_set_bit(start >> ord, buddy); 1568 mb_set_bit(start >> ord, buddy);
1567 e4b->bd_info->bb_counters[ord]--; 1569 e4b->bd_info->bb_counters[ord]--;
1568 start += mlen; 1570 start += mlen;
1569 len -= mlen; 1571 len -= mlen;
1570 BUG_ON(len < 0); 1572 BUG_ON(len < 0);
1571 continue; 1573 continue;
1572 } 1574 }
1573 1575
1574 /* store for history */ 1576 /* store for history */
1575 if (ret == 0) 1577 if (ret == 0)
1576 ret = len | (ord << 16); 1578 ret = len | (ord << 16);
1577 1579
1578 /* we have to split large buddy */ 1580 /* we have to split large buddy */
1579 BUG_ON(ord <= 0); 1581 BUG_ON(ord <= 0);
1580 buddy = mb_find_buddy(e4b, ord, &max); 1582 buddy = mb_find_buddy(e4b, ord, &max);
1581 mb_set_bit(start >> ord, buddy); 1583 mb_set_bit(start >> ord, buddy);
1582 e4b->bd_info->bb_counters[ord]--; 1584 e4b->bd_info->bb_counters[ord]--;
1583 1585
1584 ord--; 1586 ord--;
1585 cur = (start >> ord) & ~1U; 1587 cur = (start >> ord) & ~1U;
1586 buddy = mb_find_buddy(e4b, ord, &max); 1588 buddy = mb_find_buddy(e4b, ord, &max);
1587 mb_clear_bit(cur, buddy); 1589 mb_clear_bit(cur, buddy);
1588 mb_clear_bit(cur + 1, buddy); 1590 mb_clear_bit(cur + 1, buddy);
1589 e4b->bd_info->bb_counters[ord]++; 1591 e4b->bd_info->bb_counters[ord]++;
1590 e4b->bd_info->bb_counters[ord]++; 1592 e4b->bd_info->bb_counters[ord]++;
1591 } 1593 }
1592 mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info); 1594 mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
1593 1595
1594 ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0); 1596 ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
1595 mb_check_buddy(e4b); 1597 mb_check_buddy(e4b);
1596 1598
1597 return ret; 1599 return ret;
1598 } 1600 }
1599 1601
1600 /* 1602 /*
1601 * Must be called under group lock! 1603 * Must be called under group lock!
1602 */ 1604 */
1603 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, 1605 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
1604 struct ext4_buddy *e4b) 1606 struct ext4_buddy *e4b)
1605 { 1607 {
1606 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 1608 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
1607 int ret; 1609 int ret;
1608 1610
1609 BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); 1611 BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
1610 BUG_ON(ac->ac_status == AC_STATUS_FOUND); 1612 BUG_ON(ac->ac_status == AC_STATUS_FOUND);
1611 1613
1612 ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len); 1614 ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len);
1613 ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical; 1615 ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical;
1614 ret = mb_mark_used(e4b, &ac->ac_b_ex); 1616 ret = mb_mark_used(e4b, &ac->ac_b_ex);
1615 1617
1616 /* preallocation can change ac_b_ex, thus we store actually 1618 /* preallocation can change ac_b_ex, thus we store actually
1617 * allocated blocks for history */ 1619 * allocated blocks for history */
1618 ac->ac_f_ex = ac->ac_b_ex; 1620 ac->ac_f_ex = ac->ac_b_ex;
1619 1621
1620 ac->ac_status = AC_STATUS_FOUND; 1622 ac->ac_status = AC_STATUS_FOUND;
1621 ac->ac_tail = ret & 0xffff; 1623 ac->ac_tail = ret & 0xffff;
1622 ac->ac_buddy = ret >> 16; 1624 ac->ac_buddy = ret >> 16;
1623 1625
1624 /* 1626 /*
1625 * take the page reference. We want the page to be pinned 1627 * take the page reference. We want the page to be pinned
1626 * so that we don't get a ext4_mb_init_cache_call for this 1628 * so that we don't get a ext4_mb_init_cache_call for this
1627 * group until we update the bitmap. That would mean we 1629 * group until we update the bitmap. That would mean we
1628 * double allocate blocks. The reference is dropped 1630 * double allocate blocks. The reference is dropped
1629 * in ext4_mb_release_context 1631 * in ext4_mb_release_context
1630 */ 1632 */
1631 ac->ac_bitmap_page = e4b->bd_bitmap_page; 1633 ac->ac_bitmap_page = e4b->bd_bitmap_page;
1632 get_page(ac->ac_bitmap_page); 1634 get_page(ac->ac_bitmap_page);
1633 ac->ac_buddy_page = e4b->bd_buddy_page; 1635 ac->ac_buddy_page = e4b->bd_buddy_page;
1634 get_page(ac->ac_buddy_page); 1636 get_page(ac->ac_buddy_page);
1635 /* store last allocated for subsequent stream allocation */ 1637 /* store last allocated for subsequent stream allocation */
1636 if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { 1638 if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
1637 spin_lock(&sbi->s_md_lock); 1639 spin_lock(&sbi->s_md_lock);
1638 sbi->s_mb_last_group = ac->ac_f_ex.fe_group; 1640 sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
1639 sbi->s_mb_last_start = ac->ac_f_ex.fe_start; 1641 sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
1640 spin_unlock(&sbi->s_md_lock); 1642 spin_unlock(&sbi->s_md_lock);
1641 } 1643 }
1642 } 1644 }
1643 1645
1644 /* 1646 /*
1645 * regular allocator, for general purposes allocation 1647 * regular allocator, for general purposes allocation
1646 */ 1648 */
1647 1649
1648 static void ext4_mb_check_limits(struct ext4_allocation_context *ac, 1650 static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
1649 struct ext4_buddy *e4b, 1651 struct ext4_buddy *e4b,
1650 int finish_group) 1652 int finish_group)
1651 { 1653 {
1652 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 1654 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
1653 struct ext4_free_extent *bex = &ac->ac_b_ex; 1655 struct ext4_free_extent *bex = &ac->ac_b_ex;
1654 struct ext4_free_extent *gex = &ac->ac_g_ex; 1656 struct ext4_free_extent *gex = &ac->ac_g_ex;
1655 struct ext4_free_extent ex; 1657 struct ext4_free_extent ex;
1656 int max; 1658 int max;
1657 1659
1658 if (ac->ac_status == AC_STATUS_FOUND) 1660 if (ac->ac_status == AC_STATUS_FOUND)
1659 return; 1661 return;
1660 /* 1662 /*
1661 * We don't want to scan for a whole year 1663 * We don't want to scan for a whole year
1662 */ 1664 */
1663 if (ac->ac_found > sbi->s_mb_max_to_scan && 1665 if (ac->ac_found > sbi->s_mb_max_to_scan &&
1664 !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { 1666 !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
1665 ac->ac_status = AC_STATUS_BREAK; 1667 ac->ac_status = AC_STATUS_BREAK;
1666 return; 1668 return;
1667 } 1669 }
1668 1670
1669 /* 1671 /*
1670 * Haven't found good chunk so far, let's continue 1672 * Haven't found good chunk so far, let's continue
1671 */ 1673 */
1672 if (bex->fe_len < gex->fe_len) 1674 if (bex->fe_len < gex->fe_len)
1673 return; 1675 return;
1674 1676
1675 if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan) 1677 if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
1676 && bex->fe_group == e4b->bd_group) { 1678 && bex->fe_group == e4b->bd_group) {
1677 /* recheck chunk's availability - we don't know 1679 /* recheck chunk's availability - we don't know
1678 * when it was found (within this lock-unlock 1680 * when it was found (within this lock-unlock
1679 * period or not) */ 1681 * period or not) */
1680 max = mb_find_extent(e4b, bex->fe_start, gex->fe_len, &ex); 1682 max = mb_find_extent(e4b, bex->fe_start, gex->fe_len, &ex);
1681 if (max >= gex->fe_len) { 1683 if (max >= gex->fe_len) {
1682 ext4_mb_use_best_found(ac, e4b); 1684 ext4_mb_use_best_found(ac, e4b);
1683 return; 1685 return;
1684 } 1686 }
1685 } 1687 }
1686 } 1688 }
1687 1689
1688 /* 1690 /*
1689 * The routine checks whether found extent is good enough. If it is, 1691 * The routine checks whether found extent is good enough. If it is,
1690 * then the extent gets marked used and flag is set to the context 1692 * then the extent gets marked used and flag is set to the context
1691 * to stop scanning. Otherwise, the extent is compared with the 1693 * to stop scanning. Otherwise, the extent is compared with the
1692 * previous found extent and if new one is better, then it's stored 1694 * previous found extent and if new one is better, then it's stored
1693 * in the context. Later, the best found extent will be used, if 1695 * in the context. Later, the best found extent will be used, if
1694 * mballoc can't find good enough extent. 1696 * mballoc can't find good enough extent.
1695 * 1697 *
1696 * FIXME: real allocation policy is to be designed yet! 1698 * FIXME: real allocation policy is to be designed yet!
1697 */ 1699 */
1698 static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, 1700 static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
1699 struct ext4_free_extent *ex, 1701 struct ext4_free_extent *ex,
1700 struct ext4_buddy *e4b) 1702 struct ext4_buddy *e4b)
1701 { 1703 {
1702 struct ext4_free_extent *bex = &ac->ac_b_ex; 1704 struct ext4_free_extent *bex = &ac->ac_b_ex;
1703 struct ext4_free_extent *gex = &ac->ac_g_ex; 1705 struct ext4_free_extent *gex = &ac->ac_g_ex;
1704 1706
1705 BUG_ON(ex->fe_len <= 0); 1707 BUG_ON(ex->fe_len <= 0);
1706 BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); 1708 BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
1707 BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); 1709 BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
1708 BUG_ON(ac->ac_status != AC_STATUS_CONTINUE); 1710 BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
1709 1711
1710 ac->ac_found++; 1712 ac->ac_found++;
1711 1713
1712 /* 1714 /*
1713 * The special case - take what you catch first 1715 * The special case - take what you catch first
1714 */ 1716 */
1715 if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) { 1717 if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
1716 *bex = *ex; 1718 *bex = *ex;
1717 ext4_mb_use_best_found(ac, e4b); 1719 ext4_mb_use_best_found(ac, e4b);
1718 return; 1720 return;
1719 } 1721 }
1720 1722
1721 /* 1723 /*
1722 * Let's check whether the chuck is good enough 1724 * Let's check whether the chuck is good enough
1723 */ 1725 */
1724 if (ex->fe_len == gex->fe_len) { 1726 if (ex->fe_len == gex->fe_len) {
1725 *bex = *ex; 1727 *bex = *ex;
1726 ext4_mb_use_best_found(ac, e4b); 1728 ext4_mb_use_best_found(ac, e4b);
1727 return; 1729 return;
1728 } 1730 }
1729 1731
1730 /* 1732 /*
1731 * If this is first found extent, just store it in the context 1733 * If this is first found extent, just store it in the context
1732 */ 1734 */
1733 if (bex->fe_len == 0) { 1735 if (bex->fe_len == 0) {
1734 *bex = *ex; 1736 *bex = *ex;
1735 return; 1737 return;
1736 } 1738 }
1737 1739
1738 /* 1740 /*
1739 * If new found extent is better, store it in the context 1741 * If new found extent is better, store it in the context
1740 */ 1742 */
1741 if (bex->fe_len < gex->fe_len) { 1743 if (bex->fe_len < gex->fe_len) {
1742 /* if the request isn't satisfied, any found extent 1744 /* if the request isn't satisfied, any found extent
1743 * larger than previous best one is better */ 1745 * larger than previous best one is better */
1744 if (ex->fe_len > bex->fe_len) 1746 if (ex->fe_len > bex->fe_len)
1745 *bex = *ex; 1747 *bex = *ex;
1746 } else if (ex->fe_len > gex->fe_len) { 1748 } else if (ex->fe_len > gex->fe_len) {
1747 /* if the request is satisfied, then we try to find 1749 /* if the request is satisfied, then we try to find
1748 * an extent that still satisfy the request, but is 1750 * an extent that still satisfy the request, but is
1749 * smaller than previous one */ 1751 * smaller than previous one */
1750 if (ex->fe_len < bex->fe_len) 1752 if (ex->fe_len < bex->fe_len)
1751 *bex = *ex; 1753 *bex = *ex;
1752 } 1754 }
1753 1755
1754 ext4_mb_check_limits(ac, e4b, 0); 1756 ext4_mb_check_limits(ac, e4b, 0);
1755 } 1757 }
1756 1758
1757 static noinline_for_stack 1759 static noinline_for_stack
1758 int ext4_mb_try_best_found(struct ext4_allocation_context *ac, 1760 int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
1759 struct ext4_buddy *e4b) 1761 struct ext4_buddy *e4b)
1760 { 1762 {
1761 struct ext4_free_extent ex = ac->ac_b_ex; 1763 struct ext4_free_extent ex = ac->ac_b_ex;
1762 ext4_group_t group = ex.fe_group; 1764 ext4_group_t group = ex.fe_group;
1763 int max; 1765 int max;
1764 int err; 1766 int err;
1765 1767
1766 BUG_ON(ex.fe_len <= 0); 1768 BUG_ON(ex.fe_len <= 0);
1767 err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); 1769 err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
1768 if (err) 1770 if (err)
1769 return err; 1771 return err;
1770 1772
1771 ext4_lock_group(ac->ac_sb, group); 1773 ext4_lock_group(ac->ac_sb, group);
1772 max = mb_find_extent(e4b, ex.fe_start, ex.fe_len, &ex); 1774 max = mb_find_extent(e4b, ex.fe_start, ex.fe_len, &ex);
1773 1775
1774 if (max > 0) { 1776 if (max > 0) {
1775 ac->ac_b_ex = ex; 1777 ac->ac_b_ex = ex;
1776 ext4_mb_use_best_found(ac, e4b); 1778 ext4_mb_use_best_found(ac, e4b);
1777 } 1779 }
1778 1780
1779 ext4_unlock_group(ac->ac_sb, group); 1781 ext4_unlock_group(ac->ac_sb, group);
1780 ext4_mb_unload_buddy(e4b); 1782 ext4_mb_unload_buddy(e4b);
1781 1783
1782 return 0; 1784 return 0;
1783 } 1785 }
1784 1786
1785 static noinline_for_stack 1787 static noinline_for_stack
1786 int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, 1788 int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
1787 struct ext4_buddy *e4b) 1789 struct ext4_buddy *e4b)
1788 { 1790 {
1789 ext4_group_t group = ac->ac_g_ex.fe_group; 1791 ext4_group_t group = ac->ac_g_ex.fe_group;
1790 int max; 1792 int max;
1791 int err; 1793 int err;
1792 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 1794 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
1793 struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); 1795 struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
1794 struct ext4_free_extent ex; 1796 struct ext4_free_extent ex;
1795 1797
1796 if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL)) 1798 if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
1797 return 0; 1799 return 0;
1798 if (grp->bb_free == 0) 1800 if (grp->bb_free == 0)
1799 return 0; 1801 return 0;
1800 1802
1801 err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); 1803 err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
1802 if (err) 1804 if (err)
1803 return err; 1805 return err;
1804 1806
1805 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) { 1807 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
1806 ext4_mb_unload_buddy(e4b); 1808 ext4_mb_unload_buddy(e4b);
1807 return 0; 1809 return 0;
1808 } 1810 }
1809 1811
1810 ext4_lock_group(ac->ac_sb, group); 1812 ext4_lock_group(ac->ac_sb, group);
1811 max = mb_find_extent(e4b, ac->ac_g_ex.fe_start, 1813 max = mb_find_extent(e4b, ac->ac_g_ex.fe_start,
1812 ac->ac_g_ex.fe_len, &ex); 1814 ac->ac_g_ex.fe_len, &ex);
1813 1815
1814 if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { 1816 if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) {
1815 ext4_fsblk_t start; 1817 ext4_fsblk_t start;
1816 1818
1817 start = ext4_group_first_block_no(ac->ac_sb, e4b->bd_group) + 1819 start = ext4_group_first_block_no(ac->ac_sb, e4b->bd_group) +
1818 ex.fe_start; 1820 ex.fe_start;
1819 /* use do_div to get remainder (would be 64-bit modulo) */ 1821 /* use do_div to get remainder (would be 64-bit modulo) */
1820 if (do_div(start, sbi->s_stripe) == 0) { 1822 if (do_div(start, sbi->s_stripe) == 0) {
1821 ac->ac_found++; 1823 ac->ac_found++;
1822 ac->ac_b_ex = ex; 1824 ac->ac_b_ex = ex;
1823 ext4_mb_use_best_found(ac, e4b); 1825 ext4_mb_use_best_found(ac, e4b);
1824 } 1826 }
1825 } else if (max >= ac->ac_g_ex.fe_len) { 1827 } else if (max >= ac->ac_g_ex.fe_len) {
1826 BUG_ON(ex.fe_len <= 0); 1828 BUG_ON(ex.fe_len <= 0);
1827 BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); 1829 BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
1828 BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); 1830 BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
1829 ac->ac_found++; 1831 ac->ac_found++;
1830 ac->ac_b_ex = ex; 1832 ac->ac_b_ex = ex;
1831 ext4_mb_use_best_found(ac, e4b); 1833 ext4_mb_use_best_found(ac, e4b);
1832 } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) { 1834 } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
1833 /* Sometimes, caller may want to merge even small 1835 /* Sometimes, caller may want to merge even small
1834 * number of blocks to an existing extent */ 1836 * number of blocks to an existing extent */
1835 BUG_ON(ex.fe_len <= 0); 1837 BUG_ON(ex.fe_len <= 0);
1836 BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); 1838 BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
1837 BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); 1839 BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
1838 ac->ac_found++; 1840 ac->ac_found++;
1839 ac->ac_b_ex = ex; 1841 ac->ac_b_ex = ex;
1840 ext4_mb_use_best_found(ac, e4b); 1842 ext4_mb_use_best_found(ac, e4b);
1841 } 1843 }
1842 ext4_unlock_group(ac->ac_sb, group); 1844 ext4_unlock_group(ac->ac_sb, group);
1843 ext4_mb_unload_buddy(e4b); 1845 ext4_mb_unload_buddy(e4b);
1844 1846
1845 return 0; 1847 return 0;
1846 } 1848 }
1847 1849
1848 /* 1850 /*
1849 * The routine scans buddy structures (not bitmap!) from given order 1851 * The routine scans buddy structures (not bitmap!) from given order
1850 * to max order and tries to find big enough chunk to satisfy the req 1852 * to max order and tries to find big enough chunk to satisfy the req
1851 */ 1853 */
1852 static noinline_for_stack 1854 static noinline_for_stack
1853 void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, 1855 void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
1854 struct ext4_buddy *e4b) 1856 struct ext4_buddy *e4b)
1855 { 1857 {
1856 struct super_block *sb = ac->ac_sb; 1858 struct super_block *sb = ac->ac_sb;
1857 struct ext4_group_info *grp = e4b->bd_info; 1859 struct ext4_group_info *grp = e4b->bd_info;
1858 void *buddy; 1860 void *buddy;
1859 int i; 1861 int i;
1860 int k; 1862 int k;
1861 int max; 1863 int max;
1862 1864
1863 BUG_ON(ac->ac_2order <= 0); 1865 BUG_ON(ac->ac_2order <= 0);
1864 for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { 1866 for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
1865 if (grp->bb_counters[i] == 0) 1867 if (grp->bb_counters[i] == 0)
1866 continue; 1868 continue;
1867 1869
1868 buddy = mb_find_buddy(e4b, i, &max); 1870 buddy = mb_find_buddy(e4b, i, &max);
1869 BUG_ON(buddy == NULL); 1871 BUG_ON(buddy == NULL);
1870 1872
1871 k = mb_find_next_zero_bit(buddy, max, 0); 1873 k = mb_find_next_zero_bit(buddy, max, 0);
1872 BUG_ON(k >= max); 1874 BUG_ON(k >= max);
1873 1875
1874 ac->ac_found++; 1876 ac->ac_found++;
1875 1877
1876 ac->ac_b_ex.fe_len = 1 << i; 1878 ac->ac_b_ex.fe_len = 1 << i;
1877 ac->ac_b_ex.fe_start = k << i; 1879 ac->ac_b_ex.fe_start = k << i;
1878 ac->ac_b_ex.fe_group = e4b->bd_group; 1880 ac->ac_b_ex.fe_group = e4b->bd_group;
1879 1881
1880 ext4_mb_use_best_found(ac, e4b); 1882 ext4_mb_use_best_found(ac, e4b);
1881 1883
1882 BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len); 1884 BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
1883 1885
1884 if (EXT4_SB(sb)->s_mb_stats) 1886 if (EXT4_SB(sb)->s_mb_stats)
1885 atomic_inc(&EXT4_SB(sb)->s_bal_2orders); 1887 atomic_inc(&EXT4_SB(sb)->s_bal_2orders);
1886 1888
1887 break; 1889 break;
1888 } 1890 }
1889 } 1891 }
1890 1892
1891 /* 1893 /*
1892 * The routine scans the group and measures all found extents. 1894 * The routine scans the group and measures all found extents.
1893 * In order to optimize scanning, caller must pass number of 1895 * In order to optimize scanning, caller must pass number of
1894 * free blocks in the group, so the routine can know upper limit. 1896 * free blocks in the group, so the routine can know upper limit.
1895 */ 1897 */
1896 static noinline_for_stack 1898 static noinline_for_stack
1897 void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, 1899 void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
1898 struct ext4_buddy *e4b) 1900 struct ext4_buddy *e4b)
1899 { 1901 {
1900 struct super_block *sb = ac->ac_sb; 1902 struct super_block *sb = ac->ac_sb;
1901 void *bitmap = e4b->bd_bitmap; 1903 void *bitmap = e4b->bd_bitmap;
1902 struct ext4_free_extent ex; 1904 struct ext4_free_extent ex;
1903 int i; 1905 int i;
1904 int free; 1906 int free;
1905 1907
1906 free = e4b->bd_info->bb_free; 1908 free = e4b->bd_info->bb_free;
1907 BUG_ON(free <= 0); 1909 BUG_ON(free <= 0);
1908 1910
1909 i = e4b->bd_info->bb_first_free; 1911 i = e4b->bd_info->bb_first_free;
1910 1912
1911 while (free && ac->ac_status == AC_STATUS_CONTINUE) { 1913 while (free && ac->ac_status == AC_STATUS_CONTINUE) {
1912 i = mb_find_next_zero_bit(bitmap, 1914 i = mb_find_next_zero_bit(bitmap,
1913 EXT4_CLUSTERS_PER_GROUP(sb), i); 1915 EXT4_CLUSTERS_PER_GROUP(sb), i);
1914 if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) { 1916 if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
1915 /* 1917 /*
1916 * IF we have corrupt bitmap, we won't find any 1918 * IF we have corrupt bitmap, we won't find any
1917 * free blocks even though group info says we 1919 * free blocks even though group info says we
1918 * we have free blocks 1920 * we have free blocks
1919 */ 1921 */
1920 ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, 1922 ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
1921 "%d free clusters as per " 1923 "%d free clusters as per "
1922 "group info. But bitmap says 0", 1924 "group info. But bitmap says 0",
1923 free); 1925 free);
1924 break; 1926 break;
1925 } 1927 }
1926 1928
1927 mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex); 1929 mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex);
1928 BUG_ON(ex.fe_len <= 0); 1930 BUG_ON(ex.fe_len <= 0);
1929 if (free < ex.fe_len) { 1931 if (free < ex.fe_len) {
1930 ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, 1932 ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
1931 "%d free clusters as per " 1933 "%d free clusters as per "
1932 "group info. But got %d blocks", 1934 "group info. But got %d blocks",
1933 free, ex.fe_len); 1935 free, ex.fe_len);
1934 /* 1936 /*
1935 * The number of free blocks differs. This mostly 1937 * The number of free blocks differs. This mostly
1936 * indicate that the bitmap is corrupt. So exit 1938 * indicate that the bitmap is corrupt. So exit
1937 * without claiming the space. 1939 * without claiming the space.
1938 */ 1940 */
1939 break; 1941 break;
1940 } 1942 }
1941 1943
1942 ext4_mb_measure_extent(ac, &ex, e4b); 1944 ext4_mb_measure_extent(ac, &ex, e4b);
1943 1945
1944 i += ex.fe_len; 1946 i += ex.fe_len;
1945 free -= ex.fe_len; 1947 free -= ex.fe_len;
1946 } 1948 }
1947 1949
1948 ext4_mb_check_limits(ac, e4b, 1); 1950 ext4_mb_check_limits(ac, e4b, 1);
1949 } 1951 }
1950 1952
1951 /* 1953 /*
1952 * This is a special case for storages like raid5 1954 * This is a special case for storages like raid5
1953 * we try to find stripe-aligned chunks for stripe-size-multiple requests 1955 * we try to find stripe-aligned chunks for stripe-size-multiple requests
1954 */ 1956 */
1955 static noinline_for_stack 1957 static noinline_for_stack
1956 void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, 1958 void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
1957 struct ext4_buddy *e4b) 1959 struct ext4_buddy *e4b)
1958 { 1960 {
1959 struct super_block *sb = ac->ac_sb; 1961 struct super_block *sb = ac->ac_sb;
1960 struct ext4_sb_info *sbi = EXT4_SB(sb); 1962 struct ext4_sb_info *sbi = EXT4_SB(sb);
1961 void *bitmap = e4b->bd_bitmap; 1963 void *bitmap = e4b->bd_bitmap;
1962 struct ext4_free_extent ex; 1964 struct ext4_free_extent ex;
1963 ext4_fsblk_t first_group_block; 1965 ext4_fsblk_t first_group_block;
1964 ext4_fsblk_t a; 1966 ext4_fsblk_t a;
1965 ext4_grpblk_t i; 1967 ext4_grpblk_t i;
1966 int max; 1968 int max;
1967 1969
1968 BUG_ON(sbi->s_stripe == 0); 1970 BUG_ON(sbi->s_stripe == 0);
1969 1971
1970 /* find first stripe-aligned block in group */ 1972 /* find first stripe-aligned block in group */
1971 first_group_block = ext4_group_first_block_no(sb, e4b->bd_group); 1973 first_group_block = ext4_group_first_block_no(sb, e4b->bd_group);
1972 1974
1973 a = first_group_block + sbi->s_stripe - 1; 1975 a = first_group_block + sbi->s_stripe - 1;
1974 do_div(a, sbi->s_stripe); 1976 do_div(a, sbi->s_stripe);
1975 i = (a * sbi->s_stripe) - first_group_block; 1977 i = (a * sbi->s_stripe) - first_group_block;
1976 1978
1977 while (i < EXT4_CLUSTERS_PER_GROUP(sb)) { 1979 while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
1978 if (!mb_test_bit(i, bitmap)) { 1980 if (!mb_test_bit(i, bitmap)) {
1979 max = mb_find_extent(e4b, i, sbi->s_stripe, &ex); 1981 max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
1980 if (max >= sbi->s_stripe) { 1982 if (max >= sbi->s_stripe) {
1981 ac->ac_found++; 1983 ac->ac_found++;
1982 ac->ac_b_ex = ex; 1984 ac->ac_b_ex = ex;
1983 ext4_mb_use_best_found(ac, e4b); 1985 ext4_mb_use_best_found(ac, e4b);
1984 break; 1986 break;
1985 } 1987 }
1986 } 1988 }
1987 i += sbi->s_stripe; 1989 i += sbi->s_stripe;
1988 } 1990 }
1989 } 1991 }
1990 1992
1991 /* This is now called BEFORE we load the buddy bitmap. */ 1993 /* This is now called BEFORE we load the buddy bitmap. */
1992 static int ext4_mb_good_group(struct ext4_allocation_context *ac, 1994 static int ext4_mb_good_group(struct ext4_allocation_context *ac,
1993 ext4_group_t group, int cr) 1995 ext4_group_t group, int cr)
1994 { 1996 {
1995 unsigned free, fragments; 1997 unsigned free, fragments;
1996 int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb)); 1998 int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
1997 struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); 1999 struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
1998 2000
1999 BUG_ON(cr < 0 || cr >= 4); 2001 BUG_ON(cr < 0 || cr >= 4);
2000 2002
2001 free = grp->bb_free; 2003 free = grp->bb_free;
2002 if (free == 0) 2004 if (free == 0)
2003 return 0; 2005 return 0;
2004 if (cr <= 2 && free < ac->ac_g_ex.fe_len) 2006 if (cr <= 2 && free < ac->ac_g_ex.fe_len)
2005 return 0; 2007 return 0;
2006 2008
2007 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) 2009 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
2008 return 0; 2010 return 0;
2009 2011
2010 /* We only do this if the grp has never been initialized */ 2012 /* We only do this if the grp has never been initialized */
2011 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { 2013 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
2012 int ret = ext4_mb_init_group(ac->ac_sb, group); 2014 int ret = ext4_mb_init_group(ac->ac_sb, group);
2013 if (ret) 2015 if (ret)
2014 return 0; 2016 return 0;
2015 } 2017 }
2016 2018
2017 fragments = grp->bb_fragments; 2019 fragments = grp->bb_fragments;
2018 if (fragments == 0) 2020 if (fragments == 0)
2019 return 0; 2021 return 0;
2020 2022
2021 switch (cr) { 2023 switch (cr) {
2022 case 0: 2024 case 0:
2023 BUG_ON(ac->ac_2order == 0); 2025 BUG_ON(ac->ac_2order == 0);
2024 2026
2025 /* Avoid using the first bg of a flexgroup for data files */ 2027 /* Avoid using the first bg of a flexgroup for data files */
2026 if ((ac->ac_flags & EXT4_MB_HINT_DATA) && 2028 if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
2027 (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) && 2029 (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
2028 ((group % flex_size) == 0)) 2030 ((group % flex_size) == 0))
2029 return 0; 2031 return 0;
2030 2032
2031 if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) || 2033 if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
2032 (free / fragments) >= ac->ac_g_ex.fe_len) 2034 (free / fragments) >= ac->ac_g_ex.fe_len)
2033 return 1; 2035 return 1;
2034 2036
2035 if (grp->bb_largest_free_order < ac->ac_2order) 2037 if (grp->bb_largest_free_order < ac->ac_2order)
2036 return 0; 2038 return 0;
2037 2039
2038 return 1; 2040 return 1;
2039 case 1: 2041 case 1:
2040 if ((free / fragments) >= ac->ac_g_ex.fe_len) 2042 if ((free / fragments) >= ac->ac_g_ex.fe_len)
2041 return 1; 2043 return 1;
2042 break; 2044 break;
2043 case 2: 2045 case 2:
2044 if (free >= ac->ac_g_ex.fe_len) 2046 if (free >= ac->ac_g_ex.fe_len)
2045 return 1; 2047 return 1;
2046 break; 2048 break;
2047 case 3: 2049 case 3:
2048 return 1; 2050 return 1;
2049 default: 2051 default:
2050 BUG(); 2052 BUG();
2051 } 2053 }
2052 2054
2053 return 0; 2055 return 0;
2054 } 2056 }
2055 2057
2056 static noinline_for_stack int 2058 static noinline_for_stack int
2057 ext4_mb_regular_allocator(struct ext4_allocation_context *ac) 2059 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
2058 { 2060 {
2059 ext4_group_t ngroups, group, i; 2061 ext4_group_t ngroups, group, i;
2060 int cr; 2062 int cr;
2061 int err = 0; 2063 int err = 0;
2062 struct ext4_sb_info *sbi; 2064 struct ext4_sb_info *sbi;
2063 struct super_block *sb; 2065 struct super_block *sb;
2064 struct ext4_buddy e4b; 2066 struct ext4_buddy e4b;
2065 2067
2066 sb = ac->ac_sb; 2068 sb = ac->ac_sb;
2067 sbi = EXT4_SB(sb); 2069 sbi = EXT4_SB(sb);
2068 ngroups = ext4_get_groups_count(sb); 2070 ngroups = ext4_get_groups_count(sb);
2069 /* non-extent files are limited to low blocks/groups */ 2071 /* non-extent files are limited to low blocks/groups */
2070 if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) 2072 if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
2071 ngroups = sbi->s_blockfile_groups; 2073 ngroups = sbi->s_blockfile_groups;
2072 2074
2073 BUG_ON(ac->ac_status == AC_STATUS_FOUND); 2075 BUG_ON(ac->ac_status == AC_STATUS_FOUND);
2074 2076
2075 /* first, try the goal */ 2077 /* first, try the goal */
2076 err = ext4_mb_find_by_goal(ac, &e4b); 2078 err = ext4_mb_find_by_goal(ac, &e4b);
2077 if (err || ac->ac_status == AC_STATUS_FOUND) 2079 if (err || ac->ac_status == AC_STATUS_FOUND)
2078 goto out; 2080 goto out;
2079 2081
2080 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) 2082 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
2081 goto out; 2083 goto out;
2082 2084
2083 /* 2085 /*
2084 * ac->ac2_order is set only if the fe_len is a power of 2 2086 * ac->ac2_order is set only if the fe_len is a power of 2
2085 * if ac2_order is set we also set criteria to 0 so that we 2087 * if ac2_order is set we also set criteria to 0 so that we
2086 * try exact allocation using buddy. 2088 * try exact allocation using buddy.
2087 */ 2089 */
2088 i = fls(ac->ac_g_ex.fe_len); 2090 i = fls(ac->ac_g_ex.fe_len);
2089 ac->ac_2order = 0; 2091 ac->ac_2order = 0;
2090 /* 2092 /*
2091 * We search using buddy data only if the order of the request 2093 * We search using buddy data only if the order of the request
2092 * is greater than equal to the sbi_s_mb_order2_reqs 2094 * is greater than equal to the sbi_s_mb_order2_reqs
2093 * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req 2095 * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
2094 */ 2096 */
2095 if (i >= sbi->s_mb_order2_reqs) { 2097 if (i >= sbi->s_mb_order2_reqs) {
2096 /* 2098 /*
2097 * This should tell if fe_len is exactly power of 2 2099 * This should tell if fe_len is exactly power of 2
2098 */ 2100 */
2099 if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0) 2101 if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
2100 ac->ac_2order = i - 1; 2102 ac->ac_2order = i - 1;
2101 } 2103 }
2102 2104
2103 /* if stream allocation is enabled, use global goal */ 2105 /* if stream allocation is enabled, use global goal */
2104 if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { 2106 if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
2105 /* TBD: may be hot point */ 2107 /* TBD: may be hot point */
2106 spin_lock(&sbi->s_md_lock); 2108 spin_lock(&sbi->s_md_lock);
2107 ac->ac_g_ex.fe_group = sbi->s_mb_last_group; 2109 ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
2108 ac->ac_g_ex.fe_start = sbi->s_mb_last_start; 2110 ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
2109 spin_unlock(&sbi->s_md_lock); 2111 spin_unlock(&sbi->s_md_lock);
2110 } 2112 }
2111 2113
2112 /* Let's just scan groups to find more-less suitable blocks */ 2114 /* Let's just scan groups to find more-less suitable blocks */
2113 cr = ac->ac_2order ? 0 : 1; 2115 cr = ac->ac_2order ? 0 : 1;
2114 /* 2116 /*
2115 * cr == 0 try to get exact allocation, 2117 * cr == 0 try to get exact allocation,
2116 * cr == 3 try to get anything 2118 * cr == 3 try to get anything
2117 */ 2119 */
2118 repeat: 2120 repeat:
2119 for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) { 2121 for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
2120 ac->ac_criteria = cr; 2122 ac->ac_criteria = cr;
2121 /* 2123 /*
2122 * searching for the right group start 2124 * searching for the right group start
2123 * from the goal value specified 2125 * from the goal value specified
2124 */ 2126 */
2125 group = ac->ac_g_ex.fe_group; 2127 group = ac->ac_g_ex.fe_group;
2126 2128
2127 for (i = 0; i < ngroups; group++, i++) { 2129 for (i = 0; i < ngroups; group++, i++) {
2128 cond_resched(); 2130 cond_resched();
2129 /* 2131 /*
2130 * Artificially restricted ngroups for non-extent 2132 * Artificially restricted ngroups for non-extent
2131 * files makes group > ngroups possible on first loop. 2133 * files makes group > ngroups possible on first loop.
2132 */ 2134 */
2133 if (group >= ngroups) 2135 if (group >= ngroups)
2134 group = 0; 2136 group = 0;
2135 2137
2136 /* This now checks without needing the buddy page */ 2138 /* This now checks without needing the buddy page */
2137 if (!ext4_mb_good_group(ac, group, cr)) 2139 if (!ext4_mb_good_group(ac, group, cr))
2138 continue; 2140 continue;
2139 2141
2140 err = ext4_mb_load_buddy(sb, group, &e4b); 2142 err = ext4_mb_load_buddy(sb, group, &e4b);
2141 if (err) 2143 if (err)
2142 goto out; 2144 goto out;
2143 2145
2144 ext4_lock_group(sb, group); 2146 ext4_lock_group(sb, group);
2145 2147
2146 /* 2148 /*
2147 * We need to check again after locking the 2149 * We need to check again after locking the
2148 * block group 2150 * block group
2149 */ 2151 */
2150 if (!ext4_mb_good_group(ac, group, cr)) { 2152 if (!ext4_mb_good_group(ac, group, cr)) {
2151 ext4_unlock_group(sb, group); 2153 ext4_unlock_group(sb, group);
2152 ext4_mb_unload_buddy(&e4b); 2154 ext4_mb_unload_buddy(&e4b);
2153 continue; 2155 continue;
2154 } 2156 }
2155 2157
2156 ac->ac_groups_scanned++; 2158 ac->ac_groups_scanned++;
2157 if (cr == 0 && ac->ac_2order < sb->s_blocksize_bits+2) 2159 if (cr == 0 && ac->ac_2order < sb->s_blocksize_bits+2)
2158 ext4_mb_simple_scan_group(ac, &e4b); 2160 ext4_mb_simple_scan_group(ac, &e4b);
2159 else if (cr == 1 && sbi->s_stripe && 2161 else if (cr == 1 && sbi->s_stripe &&
2160 !(ac->ac_g_ex.fe_len % sbi->s_stripe)) 2162 !(ac->ac_g_ex.fe_len % sbi->s_stripe))
2161 ext4_mb_scan_aligned(ac, &e4b); 2163 ext4_mb_scan_aligned(ac, &e4b);
2162 else 2164 else
2163 ext4_mb_complex_scan_group(ac, &e4b); 2165 ext4_mb_complex_scan_group(ac, &e4b);
2164 2166
2165 ext4_unlock_group(sb, group); 2167 ext4_unlock_group(sb, group);
2166 ext4_mb_unload_buddy(&e4b); 2168 ext4_mb_unload_buddy(&e4b);
2167 2169
2168 if (ac->ac_status != AC_STATUS_CONTINUE) 2170 if (ac->ac_status != AC_STATUS_CONTINUE)
2169 break; 2171 break;
2170 } 2172 }
2171 } 2173 }
2172 2174
2173 if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && 2175 if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
2174 !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { 2176 !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
2175 /* 2177 /*
2176 * We've been searching too long. Let's try to allocate 2178 * We've been searching too long. Let's try to allocate
2177 * the best chunk we've found so far 2179 * the best chunk we've found so far
2178 */ 2180 */
2179 2181
2180 ext4_mb_try_best_found(ac, &e4b); 2182 ext4_mb_try_best_found(ac, &e4b);
2181 if (ac->ac_status != AC_STATUS_FOUND) { 2183 if (ac->ac_status != AC_STATUS_FOUND) {
2182 /* 2184 /*
2183 * Someone more lucky has already allocated it. 2185 * Someone more lucky has already allocated it.
2184 * The only thing we can do is just take first 2186 * The only thing we can do is just take first
2185 * found block(s) 2187 * found block(s)
2186 printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n"); 2188 printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n");
2187 */ 2189 */
2188 ac->ac_b_ex.fe_group = 0; 2190 ac->ac_b_ex.fe_group = 0;
2189 ac->ac_b_ex.fe_start = 0; 2191 ac->ac_b_ex.fe_start = 0;
2190 ac->ac_b_ex.fe_len = 0; 2192 ac->ac_b_ex.fe_len = 0;
2191 ac->ac_status = AC_STATUS_CONTINUE; 2193 ac->ac_status = AC_STATUS_CONTINUE;
2192 ac->ac_flags |= EXT4_MB_HINT_FIRST; 2194 ac->ac_flags |= EXT4_MB_HINT_FIRST;
2193 cr = 3; 2195 cr = 3;
2194 atomic_inc(&sbi->s_mb_lost_chunks); 2196 atomic_inc(&sbi->s_mb_lost_chunks);
2195 goto repeat; 2197 goto repeat;
2196 } 2198 }
2197 } 2199 }
2198 out: 2200 out:
2199 return err; 2201 return err;
2200 } 2202 }
2201 2203
2202 static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) 2204 static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
2203 { 2205 {
2204 struct super_block *sb = seq->private; 2206 struct super_block *sb = seq->private;
2205 ext4_group_t group; 2207 ext4_group_t group;
2206 2208
2207 if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) 2209 if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
2208 return NULL; 2210 return NULL;
2209 group = *pos + 1; 2211 group = *pos + 1;
2210 return (void *) ((unsigned long) group); 2212 return (void *) ((unsigned long) group);
2211 } 2213 }
2212 2214
2213 static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) 2215 static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
2214 { 2216 {
2215 struct super_block *sb = seq->private; 2217 struct super_block *sb = seq->private;
2216 ext4_group_t group; 2218 ext4_group_t group;
2217 2219
2218 ++*pos; 2220 ++*pos;
2219 if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) 2221 if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
2220 return NULL; 2222 return NULL;
2221 group = *pos + 1; 2223 group = *pos + 1;
2222 return (void *) ((unsigned long) group); 2224 return (void *) ((unsigned long) group);
2223 } 2225 }
2224 2226
2225 static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v) 2227 static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
2226 { 2228 {
2227 struct super_block *sb = seq->private; 2229 struct super_block *sb = seq->private;
2228 ext4_group_t group = (ext4_group_t) ((unsigned long) v); 2230 ext4_group_t group = (ext4_group_t) ((unsigned long) v);
2229 int i; 2231 int i;
2230 int err, buddy_loaded = 0; 2232 int err, buddy_loaded = 0;
2231 struct ext4_buddy e4b; 2233 struct ext4_buddy e4b;
2232 struct ext4_group_info *grinfo; 2234 struct ext4_group_info *grinfo;
2233 struct sg { 2235 struct sg {
2234 struct ext4_group_info info; 2236 struct ext4_group_info info;
2235 ext4_grpblk_t counters[16]; 2237 ext4_grpblk_t counters[16];
2236 } sg; 2238 } sg;
2237 2239
2238 group--; 2240 group--;
2239 if (group == 0) 2241 if (group == 0)
2240 seq_printf(seq, "#%-5s: %-5s %-5s %-5s " 2242 seq_printf(seq, "#%-5s: %-5s %-5s %-5s "
2241 "[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s " 2243 "[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s "
2242 "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", 2244 "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n",
2243 "group", "free", "frags", "first", 2245 "group", "free", "frags", "first",
2244 "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6", 2246 "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6",
2245 "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13"); 2247 "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13");
2246 2248
2247 i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + 2249 i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) +
2248 sizeof(struct ext4_group_info); 2250 sizeof(struct ext4_group_info);
2249 grinfo = ext4_get_group_info(sb, group); 2251 grinfo = ext4_get_group_info(sb, group);
2250 /* Load the group info in memory only if not already loaded. */ 2252 /* Load the group info in memory only if not already loaded. */
2251 if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) { 2253 if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) {
2252 err = ext4_mb_load_buddy(sb, group, &e4b); 2254 err = ext4_mb_load_buddy(sb, group, &e4b);
2253 if (err) { 2255 if (err) {
2254 seq_printf(seq, "#%-5u: I/O error\n", group); 2256 seq_printf(seq, "#%-5u: I/O error\n", group);
2255 return 0; 2257 return 0;
2256 } 2258 }
2257 buddy_loaded = 1; 2259 buddy_loaded = 1;
2258 } 2260 }
2259 2261
2260 memcpy(&sg, ext4_get_group_info(sb, group), i); 2262 memcpy(&sg, ext4_get_group_info(sb, group), i);
2261 2263
2262 if (buddy_loaded) 2264 if (buddy_loaded)
2263 ext4_mb_unload_buddy(&e4b); 2265 ext4_mb_unload_buddy(&e4b);
2264 2266
2265 seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free, 2267 seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
2266 sg.info.bb_fragments, sg.info.bb_first_free); 2268 sg.info.bb_fragments, sg.info.bb_first_free);
2267 for (i = 0; i <= 13; i++) 2269 for (i = 0; i <= 13; i++)
2268 seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? 2270 seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ?
2269 sg.info.bb_counters[i] : 0); 2271 sg.info.bb_counters[i] : 0);
2270 seq_printf(seq, " ]\n"); 2272 seq_printf(seq, " ]\n");
2271 2273
2272 return 0; 2274 return 0;
2273 } 2275 }
2274 2276
2275 static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v) 2277 static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v)
2276 { 2278 {
2277 } 2279 }
2278 2280
2279 static const struct seq_operations ext4_mb_seq_groups_ops = { 2281 static const struct seq_operations ext4_mb_seq_groups_ops = {
2280 .start = ext4_mb_seq_groups_start, 2282 .start = ext4_mb_seq_groups_start,
2281 .next = ext4_mb_seq_groups_next, 2283 .next = ext4_mb_seq_groups_next,
2282 .stop = ext4_mb_seq_groups_stop, 2284 .stop = ext4_mb_seq_groups_stop,
2283 .show = ext4_mb_seq_groups_show, 2285 .show = ext4_mb_seq_groups_show,
2284 }; 2286 };
2285 2287
2286 static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file) 2288 static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file)
2287 { 2289 {
2288 struct super_block *sb = PDE_DATA(inode); 2290 struct super_block *sb = PDE_DATA(inode);
2289 int rc; 2291 int rc;
2290 2292
2291 rc = seq_open(file, &ext4_mb_seq_groups_ops); 2293 rc = seq_open(file, &ext4_mb_seq_groups_ops);
2292 if (rc == 0) { 2294 if (rc == 0) {
2293 struct seq_file *m = file->private_data; 2295 struct seq_file *m = file->private_data;
2294 m->private = sb; 2296 m->private = sb;
2295 } 2297 }
2296 return rc; 2298 return rc;
2297 2299
2298 } 2300 }
2299 2301
2300 static const struct file_operations ext4_mb_seq_groups_fops = { 2302 static const struct file_operations ext4_mb_seq_groups_fops = {
2301 .owner = THIS_MODULE, 2303 .owner = THIS_MODULE,
2302 .open = ext4_mb_seq_groups_open, 2304 .open = ext4_mb_seq_groups_open,
2303 .read = seq_read, 2305 .read = seq_read,
2304 .llseek = seq_lseek, 2306 .llseek = seq_lseek,
2305 .release = seq_release, 2307 .release = seq_release,
2306 }; 2308 };
2307 2309
2308 static struct kmem_cache *get_groupinfo_cache(int blocksize_bits) 2310 static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
2309 { 2311 {
2310 int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; 2312 int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
2311 struct kmem_cache *cachep = ext4_groupinfo_caches[cache_index]; 2313 struct kmem_cache *cachep = ext4_groupinfo_caches[cache_index];
2312 2314
2313 BUG_ON(!cachep); 2315 BUG_ON(!cachep);
2314 return cachep; 2316 return cachep;
2315 } 2317 }
2316 2318
2317 /* 2319 /*
2318 * Allocate the top-level s_group_info array for the specified number 2320 * Allocate the top-level s_group_info array for the specified number
2319 * of groups 2321 * of groups
2320 */ 2322 */
2321 int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups) 2323 int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups)
2322 { 2324 {
2323 struct ext4_sb_info *sbi = EXT4_SB(sb); 2325 struct ext4_sb_info *sbi = EXT4_SB(sb);
2324 unsigned size; 2326 unsigned size;
2325 struct ext4_group_info ***new_groupinfo; 2327 struct ext4_group_info ***new_groupinfo;
2326 2328
2327 size = (ngroups + EXT4_DESC_PER_BLOCK(sb) - 1) >> 2329 size = (ngroups + EXT4_DESC_PER_BLOCK(sb) - 1) >>
2328 EXT4_DESC_PER_BLOCK_BITS(sb); 2330 EXT4_DESC_PER_BLOCK_BITS(sb);
2329 if (size <= sbi->s_group_info_size) 2331 if (size <= sbi->s_group_info_size)
2330 return 0; 2332 return 0;
2331 2333
2332 size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size); 2334 size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size);
2333 new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL); 2335 new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL);
2334 if (!new_groupinfo) { 2336 if (!new_groupinfo) {
2335 ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group"); 2337 ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group");
2336 return -ENOMEM; 2338 return -ENOMEM;
2337 } 2339 }
2338 if (sbi->s_group_info) { 2340 if (sbi->s_group_info) {
2339 memcpy(new_groupinfo, sbi->s_group_info, 2341 memcpy(new_groupinfo, sbi->s_group_info,
2340 sbi->s_group_info_size * sizeof(*sbi->s_group_info)); 2342 sbi->s_group_info_size * sizeof(*sbi->s_group_info));
2341 ext4_kvfree(sbi->s_group_info); 2343 ext4_kvfree(sbi->s_group_info);
2342 } 2344 }
2343 sbi->s_group_info = new_groupinfo; 2345 sbi->s_group_info = new_groupinfo;
2344 sbi->s_group_info_size = size / sizeof(*sbi->s_group_info); 2346 sbi->s_group_info_size = size / sizeof(*sbi->s_group_info);
2345 ext4_debug("allocated s_groupinfo array for %d meta_bg's\n", 2347 ext4_debug("allocated s_groupinfo array for %d meta_bg's\n",
2346 sbi->s_group_info_size); 2348 sbi->s_group_info_size);
2347 return 0; 2349 return 0;
2348 } 2350 }
2349 2351
2350 /* Create and initialize ext4_group_info data for the given group. */ 2352 /* Create and initialize ext4_group_info data for the given group. */
2351 int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, 2353 int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
2352 struct ext4_group_desc *desc) 2354 struct ext4_group_desc *desc)
2353 { 2355 {
2354 int i; 2356 int i;
2355 int metalen = 0; 2357 int metalen = 0;
2356 struct ext4_sb_info *sbi = EXT4_SB(sb); 2358 struct ext4_sb_info *sbi = EXT4_SB(sb);
2357 struct ext4_group_info **meta_group_info; 2359 struct ext4_group_info **meta_group_info;
2358 struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); 2360 struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
2359 2361
2360 /* 2362 /*
2361 * First check if this group is the first of a reserved block. 2363 * First check if this group is the first of a reserved block.
2362 * If it's true, we have to allocate a new table of pointers 2364 * If it's true, we have to allocate a new table of pointers
2363 * to ext4_group_info structures 2365 * to ext4_group_info structures
2364 */ 2366 */
2365 if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { 2367 if (group % EXT4_DESC_PER_BLOCK(sb) == 0) {
2366 metalen = sizeof(*meta_group_info) << 2368 metalen = sizeof(*meta_group_info) <<
2367 EXT4_DESC_PER_BLOCK_BITS(sb); 2369 EXT4_DESC_PER_BLOCK_BITS(sb);
2368 meta_group_info = kmalloc(metalen, GFP_KERNEL); 2370 meta_group_info = kmalloc(metalen, GFP_KERNEL);
2369 if (meta_group_info == NULL) { 2371 if (meta_group_info == NULL) {
2370 ext4_msg(sb, KERN_ERR, "can't allocate mem " 2372 ext4_msg(sb, KERN_ERR, "can't allocate mem "
2371 "for a buddy group"); 2373 "for a buddy group");
2372 goto exit_meta_group_info; 2374 goto exit_meta_group_info;
2373 } 2375 }
2374 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = 2376 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] =
2375 meta_group_info; 2377 meta_group_info;
2376 } 2378 }
2377 2379
2378 meta_group_info = 2380 meta_group_info =
2379 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]; 2381 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)];
2380 i = group & (EXT4_DESC_PER_BLOCK(sb) - 1); 2382 i = group & (EXT4_DESC_PER_BLOCK(sb) - 1);
2381 2383
2382 meta_group_info[i] = kmem_cache_zalloc(cachep, GFP_KERNEL); 2384 meta_group_info[i] = kmem_cache_zalloc(cachep, GFP_KERNEL);
2383 if (meta_group_info[i] == NULL) { 2385 if (meta_group_info[i] == NULL) {
2384 ext4_msg(sb, KERN_ERR, "can't allocate buddy mem"); 2386 ext4_msg(sb, KERN_ERR, "can't allocate buddy mem");
2385 goto exit_group_info; 2387 goto exit_group_info;
2386 } 2388 }
2387 set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, 2389 set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT,
2388 &(meta_group_info[i]->bb_state)); 2390 &(meta_group_info[i]->bb_state));
2389 2391
2390 /* 2392 /*
2391 * initialize bb_free to be able to skip 2393 * initialize bb_free to be able to skip
2392 * empty groups without initialization 2394 * empty groups without initialization
2393 */ 2395 */
2394 if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { 2396 if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
2395 meta_group_info[i]->bb_free = 2397 meta_group_info[i]->bb_free =
2396 ext4_free_clusters_after_init(sb, group, desc); 2398 ext4_free_clusters_after_init(sb, group, desc);
2397 } else { 2399 } else {
2398 meta_group_info[i]->bb_free = 2400 meta_group_info[i]->bb_free =
2399 ext4_free_group_clusters(sb, desc); 2401 ext4_free_group_clusters(sb, desc);
2400 } 2402 }
2401 2403
2402 INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); 2404 INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
2403 init_rwsem(&meta_group_info[i]->alloc_sem); 2405 init_rwsem(&meta_group_info[i]->alloc_sem);
2404 meta_group_info[i]->bb_free_root = RB_ROOT; 2406 meta_group_info[i]->bb_free_root = RB_ROOT;
2405 meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ 2407 meta_group_info[i]->bb_largest_free_order = -1; /* uninit */
2406 2408
2407 #ifdef DOUBLE_CHECK 2409 #ifdef DOUBLE_CHECK
2408 { 2410 {
2409 struct buffer_head *bh; 2411 struct buffer_head *bh;
2410 meta_group_info[i]->bb_bitmap = 2412 meta_group_info[i]->bb_bitmap =
2411 kmalloc(sb->s_blocksize, GFP_KERNEL); 2413 kmalloc(sb->s_blocksize, GFP_KERNEL);
2412 BUG_ON(meta_group_info[i]->bb_bitmap == NULL); 2414 BUG_ON(meta_group_info[i]->bb_bitmap == NULL);
2413 bh = ext4_read_block_bitmap(sb, group); 2415 bh = ext4_read_block_bitmap(sb, group);
2414 BUG_ON(bh == NULL); 2416 BUG_ON(bh == NULL);
2415 memcpy(meta_group_info[i]->bb_bitmap, bh->b_data, 2417 memcpy(meta_group_info[i]->bb_bitmap, bh->b_data,
2416 sb->s_blocksize); 2418 sb->s_blocksize);
2417 put_bh(bh); 2419 put_bh(bh);
2418 } 2420 }
2419 #endif 2421 #endif
2420 2422
2421 return 0; 2423 return 0;
2422 2424
2423 exit_group_info: 2425 exit_group_info:
2424 /* If a meta_group_info table has been allocated, release it now */ 2426 /* If a meta_group_info table has been allocated, release it now */
2425 if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { 2427 if (group % EXT4_DESC_PER_BLOCK(sb) == 0) {
2426 kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]); 2428 kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]);
2427 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = NULL; 2429 sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = NULL;
2428 } 2430 }
2429 exit_meta_group_info: 2431 exit_meta_group_info:
2430 return -ENOMEM; 2432 return -ENOMEM;
2431 } /* ext4_mb_add_groupinfo */ 2433 } /* ext4_mb_add_groupinfo */
2432 2434
2433 static int ext4_mb_init_backend(struct super_block *sb) 2435 static int ext4_mb_init_backend(struct super_block *sb)
2434 { 2436 {
2435 ext4_group_t ngroups = ext4_get_groups_count(sb); 2437 ext4_group_t ngroups = ext4_get_groups_count(sb);
2436 ext4_group_t i; 2438 ext4_group_t i;
2437 struct ext4_sb_info *sbi = EXT4_SB(sb); 2439 struct ext4_sb_info *sbi = EXT4_SB(sb);
2438 int err; 2440 int err;
2439 struct ext4_group_desc *desc; 2441 struct ext4_group_desc *desc;
2440 struct kmem_cache *cachep; 2442 struct kmem_cache *cachep;
2441 2443
2442 err = ext4_mb_alloc_groupinfo(sb, ngroups); 2444 err = ext4_mb_alloc_groupinfo(sb, ngroups);
2443 if (err) 2445 if (err)
2444 return err; 2446 return err;
2445 2447
2446 sbi->s_buddy_cache = new_inode(sb); 2448 sbi->s_buddy_cache = new_inode(sb);
2447 if (sbi->s_buddy_cache == NULL) { 2449 if (sbi->s_buddy_cache == NULL) {
2448 ext4_msg(sb, KERN_ERR, "can't get new inode"); 2450 ext4_msg(sb, KERN_ERR, "can't get new inode");
2449 goto err_freesgi; 2451 goto err_freesgi;
2450 } 2452 }
2451 /* To avoid potentially colliding with an valid on-disk inode number, 2453 /* To avoid potentially colliding with an valid on-disk inode number,
2452 * use EXT4_BAD_INO for the buddy cache inode number. This inode is 2454 * use EXT4_BAD_INO for the buddy cache inode number. This inode is
2453 * not in the inode hash, so it should never be found by iget(), but 2455 * not in the inode hash, so it should never be found by iget(), but
2454 * this will avoid confusion if it ever shows up during debugging. */ 2456 * this will avoid confusion if it ever shows up during debugging. */
2455 sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; 2457 sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
2456 EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; 2458 EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
2457 for (i = 0; i < ngroups; i++) { 2459 for (i = 0; i < ngroups; i++) {
2458 desc = ext4_get_group_desc(sb, i, NULL); 2460 desc = ext4_get_group_desc(sb, i, NULL);
2459 if (desc == NULL) { 2461 if (desc == NULL) {
2460 ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i); 2462 ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
2461 goto err_freebuddy; 2463 goto err_freebuddy;
2462 } 2464 }
2463 if (ext4_mb_add_groupinfo(sb, i, desc) != 0) 2465 if (ext4_mb_add_groupinfo(sb, i, desc) != 0)
2464 goto err_freebuddy; 2466 goto err_freebuddy;
2465 } 2467 }
2466 2468
2467 return 0; 2469 return 0;
2468 2470
2469 err_freebuddy: 2471 err_freebuddy:
2470 cachep = get_groupinfo_cache(sb->s_blocksize_bits); 2472 cachep = get_groupinfo_cache(sb->s_blocksize_bits);
2471 while (i-- > 0) 2473 while (i-- > 0)
2472 kmem_cache_free(cachep, ext4_get_group_info(sb, i)); 2474 kmem_cache_free(cachep, ext4_get_group_info(sb, i));
2473 i = sbi->s_group_info_size; 2475 i = sbi->s_group_info_size;
2474 while (i-- > 0) 2476 while (i-- > 0)
2475 kfree(sbi->s_group_info[i]); 2477 kfree(sbi->s_group_info[i]);
2476 iput(sbi->s_buddy_cache); 2478 iput(sbi->s_buddy_cache);
2477 err_freesgi: 2479 err_freesgi:
2478 ext4_kvfree(sbi->s_group_info); 2480 ext4_kvfree(sbi->s_group_info);
2479 return -ENOMEM; 2481 return -ENOMEM;
2480 } 2482 }
2481 2483
2482 static void ext4_groupinfo_destroy_slabs(void) 2484 static void ext4_groupinfo_destroy_slabs(void)
2483 { 2485 {
2484 int i; 2486 int i;
2485 2487
2486 for (i = 0; i < NR_GRPINFO_CACHES; i++) { 2488 for (i = 0; i < NR_GRPINFO_CACHES; i++) {
2487 if (ext4_groupinfo_caches[i]) 2489 if (ext4_groupinfo_caches[i])
2488 kmem_cache_destroy(ext4_groupinfo_caches[i]); 2490 kmem_cache_destroy(ext4_groupinfo_caches[i]);
2489 ext4_groupinfo_caches[i] = NULL; 2491 ext4_groupinfo_caches[i] = NULL;
2490 } 2492 }
2491 } 2493 }
2492 2494
2493 static int ext4_groupinfo_create_slab(size_t size) 2495 static int ext4_groupinfo_create_slab(size_t size)
2494 { 2496 {
2495 static DEFINE_MUTEX(ext4_grpinfo_slab_create_mutex); 2497 static DEFINE_MUTEX(ext4_grpinfo_slab_create_mutex);
2496 int slab_size; 2498 int slab_size;
2497 int blocksize_bits = order_base_2(size); 2499 int blocksize_bits = order_base_2(size);
2498 int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; 2500 int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
2499 struct kmem_cache *cachep; 2501 struct kmem_cache *cachep;
2500 2502
2501 if (cache_index >= NR_GRPINFO_CACHES) 2503 if (cache_index >= NR_GRPINFO_CACHES)
2502 return -EINVAL; 2504 return -EINVAL;
2503 2505
2504 if (unlikely(cache_index < 0)) 2506 if (unlikely(cache_index < 0))
2505 cache_index = 0; 2507 cache_index = 0;
2506 2508
2507 mutex_lock(&ext4_grpinfo_slab_create_mutex); 2509 mutex_lock(&ext4_grpinfo_slab_create_mutex);
2508 if (ext4_groupinfo_caches[cache_index]) { 2510 if (ext4_groupinfo_caches[cache_index]) {
2509 mutex_unlock(&ext4_grpinfo_slab_create_mutex); 2511 mutex_unlock(&ext4_grpinfo_slab_create_mutex);
2510 return 0; /* Already created */ 2512 return 0; /* Already created */
2511 } 2513 }
2512 2514
2513 slab_size = offsetof(struct ext4_group_info, 2515 slab_size = offsetof(struct ext4_group_info,
2514 bb_counters[blocksize_bits + 2]); 2516 bb_counters[blocksize_bits + 2]);
2515 2517
2516 cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index], 2518 cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index],
2517 slab_size, 0, SLAB_RECLAIM_ACCOUNT, 2519 slab_size, 0, SLAB_RECLAIM_ACCOUNT,
2518 NULL); 2520 NULL);
2519 2521
2520 ext4_groupinfo_caches[cache_index] = cachep; 2522 ext4_groupinfo_caches[cache_index] = cachep;
2521 2523
2522 mutex_unlock(&ext4_grpinfo_slab_create_mutex); 2524 mutex_unlock(&ext4_grpinfo_slab_create_mutex);
2523 if (!cachep) { 2525 if (!cachep) {
2524 printk(KERN_EMERG 2526 printk(KERN_EMERG
2525 "EXT4-fs: no memory for groupinfo slab cache\n"); 2527 "EXT4-fs: no memory for groupinfo slab cache\n");
2526 return -ENOMEM; 2528 return -ENOMEM;
2527 } 2529 }
2528 2530
2529 return 0; 2531 return 0;
2530 } 2532 }
2531 2533
2532 int ext4_mb_init(struct super_block *sb) 2534 int ext4_mb_init(struct super_block *sb)
2533 { 2535 {
2534 struct ext4_sb_info *sbi = EXT4_SB(sb); 2536 struct ext4_sb_info *sbi = EXT4_SB(sb);
2535 unsigned i, j; 2537 unsigned i, j;
2536 unsigned offset; 2538 unsigned offset;
2537 unsigned max; 2539 unsigned max;
2538 int ret; 2540 int ret;
2539 2541
2540 i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets); 2542 i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
2541 2543
2542 sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); 2544 sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
2543 if (sbi->s_mb_offsets == NULL) { 2545 if (sbi->s_mb_offsets == NULL) {
2544 ret = -ENOMEM; 2546 ret = -ENOMEM;
2545 goto out; 2547 goto out;
2546 } 2548 }
2547 2549
2548 i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs); 2550 i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
2549 sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); 2551 sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
2550 if (sbi->s_mb_maxs == NULL) { 2552 if (sbi->s_mb_maxs == NULL) {
2551 ret = -ENOMEM; 2553 ret = -ENOMEM;
2552 goto out; 2554 goto out;
2553 } 2555 }
2554 2556
2555 ret = ext4_groupinfo_create_slab(sb->s_blocksize); 2557 ret = ext4_groupinfo_create_slab(sb->s_blocksize);
2556 if (ret < 0) 2558 if (ret < 0)
2557 goto out; 2559 goto out;
2558 2560
2559 /* order 0 is regular bitmap */ 2561 /* order 0 is regular bitmap */
2560 sbi->s_mb_maxs[0] = sb->s_blocksize << 3; 2562 sbi->s_mb_maxs[0] = sb->s_blocksize << 3;
2561 sbi->s_mb_offsets[0] = 0; 2563 sbi->s_mb_offsets[0] = 0;
2562 2564
2563 i = 1; 2565 i = 1;
2564 offset = 0; 2566 offset = 0;
2565 max = sb->s_blocksize << 2; 2567 max = sb->s_blocksize << 2;
2566 do { 2568 do {
2567 sbi->s_mb_offsets[i] = offset; 2569 sbi->s_mb_offsets[i] = offset;
2568 sbi->s_mb_maxs[i] = max; 2570 sbi->s_mb_maxs[i] = max;
2569 offset += 1 << (sb->s_blocksize_bits - i); 2571 offset += 1 << (sb->s_blocksize_bits - i);
2570 max = max >> 1; 2572 max = max >> 1;
2571 i++; 2573 i++;
2572 } while (i <= sb->s_blocksize_bits + 1); 2574 } while (i <= sb->s_blocksize_bits + 1);
2573 2575
2574 spin_lock_init(&sbi->s_md_lock); 2576 spin_lock_init(&sbi->s_md_lock);
2575 spin_lock_init(&sbi->s_bal_lock); 2577 spin_lock_init(&sbi->s_bal_lock);
2576 2578
2577 sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN; 2579 sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN;
2578 sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN; 2580 sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN;
2579 sbi->s_mb_stats = MB_DEFAULT_STATS; 2581 sbi->s_mb_stats = MB_DEFAULT_STATS;
2580 sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; 2582 sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
2581 sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; 2583 sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
2582 /* 2584 /*
2583 * The default group preallocation is 512, which for 4k block 2585 * The default group preallocation is 512, which for 4k block
2584 * sizes translates to 2 megabytes. However for bigalloc file 2586 * sizes translates to 2 megabytes. However for bigalloc file
2585 * systems, this is probably too big (i.e, if the cluster size 2587 * systems, this is probably too big (i.e, if the cluster size
2586 * is 1 megabyte, then group preallocation size becomes half a 2588 * is 1 megabyte, then group preallocation size becomes half a
2587 * gigabyte!). As a default, we will keep a two megabyte 2589 * gigabyte!). As a default, we will keep a two megabyte
2588 * group pralloc size for cluster sizes up to 64k, and after 2590 * group pralloc size for cluster sizes up to 64k, and after
2589 * that, we will force a minimum group preallocation size of 2591 * that, we will force a minimum group preallocation size of
2590 * 32 clusters. This translates to 8 megs when the cluster 2592 * 32 clusters. This translates to 8 megs when the cluster
2591 * size is 256k, and 32 megs when the cluster size is 1 meg, 2593 * size is 256k, and 32 megs when the cluster size is 1 meg,
2592 * which seems reasonable as a default. 2594 * which seems reasonable as a default.
2593 */ 2595 */
2594 sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >> 2596 sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >>
2595 sbi->s_cluster_bits, 32); 2597 sbi->s_cluster_bits, 32);
2596 /* 2598 /*
2597 * If there is a s_stripe > 1, then we set the s_mb_group_prealloc 2599 * If there is a s_stripe > 1, then we set the s_mb_group_prealloc
2598 * to the lowest multiple of s_stripe which is bigger than 2600 * to the lowest multiple of s_stripe which is bigger than
2599 * the s_mb_group_prealloc as determined above. We want 2601 * the s_mb_group_prealloc as determined above. We want
2600 * the preallocation size to be an exact multiple of the 2602 * the preallocation size to be an exact multiple of the
2601 * RAID stripe size so that preallocations don't fragment 2603 * RAID stripe size so that preallocations don't fragment
2602 * the stripes. 2604 * the stripes.
2603 */ 2605 */
2604 if (sbi->s_stripe > 1) { 2606 if (sbi->s_stripe > 1) {
2605 sbi->s_mb_group_prealloc = roundup( 2607 sbi->s_mb_group_prealloc = roundup(
2606 sbi->s_mb_group_prealloc, sbi->s_stripe); 2608 sbi->s_mb_group_prealloc, sbi->s_stripe);
2607 } 2609 }
2608 2610
2609 sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group); 2611 sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
2610 if (sbi->s_locality_groups == NULL) { 2612 if (sbi->s_locality_groups == NULL) {
2611 ret = -ENOMEM; 2613 ret = -ENOMEM;
2612 goto out_free_groupinfo_slab; 2614 goto out_free_groupinfo_slab;
2613 } 2615 }
2614 for_each_possible_cpu(i) { 2616 for_each_possible_cpu(i) {
2615 struct ext4_locality_group *lg; 2617 struct ext4_locality_group *lg;
2616 lg = per_cpu_ptr(sbi->s_locality_groups, i); 2618 lg = per_cpu_ptr(sbi->s_locality_groups, i);
2617 mutex_init(&lg->lg_mutex); 2619 mutex_init(&lg->lg_mutex);
2618 for (j = 0; j < PREALLOC_TB_SIZE; j++) 2620 for (j = 0; j < PREALLOC_TB_SIZE; j++)
2619 INIT_LIST_HEAD(&lg->lg_prealloc_list[j]); 2621 INIT_LIST_HEAD(&lg->lg_prealloc_list[j]);
2620 spin_lock_init(&lg->lg_prealloc_lock); 2622 spin_lock_init(&lg->lg_prealloc_lock);
2621 } 2623 }
2622 2624
2623 /* init file for buddy data */ 2625 /* init file for buddy data */
2624 ret = ext4_mb_init_backend(sb); 2626 ret = ext4_mb_init_backend(sb);
2625 if (ret != 0) 2627 if (ret != 0)
2626 goto out_free_locality_groups; 2628 goto out_free_locality_groups;
2627 2629
2628 if (sbi->s_proc) 2630 if (sbi->s_proc)
2629 proc_create_data("mb_groups", S_IRUGO, sbi->s_proc, 2631 proc_create_data("mb_groups", S_IRUGO, sbi->s_proc,
2630 &ext4_mb_seq_groups_fops, sb); 2632 &ext4_mb_seq_groups_fops, sb);
2631 2633
2632 return 0; 2634 return 0;
2633 2635
2634 out_free_locality_groups: 2636 out_free_locality_groups:
2635 free_percpu(sbi->s_locality_groups); 2637 free_percpu(sbi->s_locality_groups);
2636 sbi->s_locality_groups = NULL; 2638 sbi->s_locality_groups = NULL;
2637 out_free_groupinfo_slab: 2639 out_free_groupinfo_slab:
2638 ext4_groupinfo_destroy_slabs(); 2640 ext4_groupinfo_destroy_slabs();
2639 out: 2641 out:
2640 kfree(sbi->s_mb_offsets); 2642 kfree(sbi->s_mb_offsets);
2641 sbi->s_mb_offsets = NULL; 2643 sbi->s_mb_offsets = NULL;
2642 kfree(sbi->s_mb_maxs); 2644 kfree(sbi->s_mb_maxs);
2643 sbi->s_mb_maxs = NULL; 2645 sbi->s_mb_maxs = NULL;
2644 return ret; 2646 return ret;
2645 } 2647 }
2646 2648
2647 /* need to called with the ext4 group lock held */ 2649 /* need to called with the ext4 group lock held */
2648 static void ext4_mb_cleanup_pa(struct ext4_group_info *grp) 2650 static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
2649 { 2651 {
2650 struct ext4_prealloc_space *pa; 2652 struct ext4_prealloc_space *pa;
2651 struct list_head *cur, *tmp; 2653 struct list_head *cur, *tmp;
2652 int count = 0; 2654 int count = 0;
2653 2655
2654 list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) { 2656 list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) {
2655 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); 2657 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
2656 list_del(&pa->pa_group_list); 2658 list_del(&pa->pa_group_list);
2657 count++; 2659 count++;
2658 kmem_cache_free(ext4_pspace_cachep, pa); 2660 kmem_cache_free(ext4_pspace_cachep, pa);
2659 } 2661 }
2660 if (count) 2662 if (count)
2661 mb_debug(1, "mballoc: %u PAs left\n", count); 2663 mb_debug(1, "mballoc: %u PAs left\n", count);
2662 2664
2663 } 2665 }
2664 2666
2665 int ext4_mb_release(struct super_block *sb) 2667 int ext4_mb_release(struct super_block *sb)
2666 { 2668 {
2667 ext4_group_t ngroups = ext4_get_groups_count(sb); 2669 ext4_group_t ngroups = ext4_get_groups_count(sb);
2668 ext4_group_t i; 2670 ext4_group_t i;
2669 int num_meta_group_infos; 2671 int num_meta_group_infos;
2670 struct ext4_group_info *grinfo; 2672 struct ext4_group_info *grinfo;
2671 struct ext4_sb_info *sbi = EXT4_SB(sb); 2673 struct ext4_sb_info *sbi = EXT4_SB(sb);
2672 struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); 2674 struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
2673 2675
2674 if (sbi->s_proc) 2676 if (sbi->s_proc)
2675 remove_proc_entry("mb_groups", sbi->s_proc); 2677 remove_proc_entry("mb_groups", sbi->s_proc);
2676 2678
2677 if (sbi->s_group_info) { 2679 if (sbi->s_group_info) {
2678 for (i = 0; i < ngroups; i++) { 2680 for (i = 0; i < ngroups; i++) {
2679 grinfo = ext4_get_group_info(sb, i); 2681 grinfo = ext4_get_group_info(sb, i);
2680 #ifdef DOUBLE_CHECK 2682 #ifdef DOUBLE_CHECK
2681 kfree(grinfo->bb_bitmap); 2683 kfree(grinfo->bb_bitmap);
2682 #endif 2684 #endif
2683 ext4_lock_group(sb, i); 2685 ext4_lock_group(sb, i);
2684 ext4_mb_cleanup_pa(grinfo); 2686 ext4_mb_cleanup_pa(grinfo);
2685 ext4_unlock_group(sb, i); 2687 ext4_unlock_group(sb, i);
2686 kmem_cache_free(cachep, grinfo); 2688 kmem_cache_free(cachep, grinfo);
2687 } 2689 }
2688 num_meta_group_infos = (ngroups + 2690 num_meta_group_infos = (ngroups +
2689 EXT4_DESC_PER_BLOCK(sb) - 1) >> 2691 EXT4_DESC_PER_BLOCK(sb) - 1) >>
2690 EXT4_DESC_PER_BLOCK_BITS(sb); 2692 EXT4_DESC_PER_BLOCK_BITS(sb);
2691 for (i = 0; i < num_meta_group_infos; i++) 2693 for (i = 0; i < num_meta_group_infos; i++)
2692 kfree(sbi->s_group_info[i]); 2694 kfree(sbi->s_group_info[i]);
2693 ext4_kvfree(sbi->s_group_info); 2695 ext4_kvfree(sbi->s_group_info);
2694 } 2696 }
2695 kfree(sbi->s_mb_offsets); 2697 kfree(sbi->s_mb_offsets);
2696 kfree(sbi->s_mb_maxs); 2698 kfree(sbi->s_mb_maxs);
2697 if (sbi->s_buddy_cache) 2699 if (sbi->s_buddy_cache)
2698 iput(sbi->s_buddy_cache); 2700 iput(sbi->s_buddy_cache);
2699 if (sbi->s_mb_stats) { 2701 if (sbi->s_mb_stats) {
2700 ext4_msg(sb, KERN_INFO, 2702 ext4_msg(sb, KERN_INFO,
2701 "mballoc: %u blocks %u reqs (%u success)", 2703 "mballoc: %u blocks %u reqs (%u success)",
2702 atomic_read(&sbi->s_bal_allocated), 2704 atomic_read(&sbi->s_bal_allocated),
2703 atomic_read(&sbi->s_bal_reqs), 2705 atomic_read(&sbi->s_bal_reqs),
2704 atomic_read(&sbi->s_bal_success)); 2706 atomic_read(&sbi->s_bal_success));
2705 ext4_msg(sb, KERN_INFO, 2707 ext4_msg(sb, KERN_INFO,
2706 "mballoc: %u extents scanned, %u goal hits, " 2708 "mballoc: %u extents scanned, %u goal hits, "
2707 "%u 2^N hits, %u breaks, %u lost", 2709 "%u 2^N hits, %u breaks, %u lost",
2708 atomic_read(&sbi->s_bal_ex_scanned), 2710 atomic_read(&sbi->s_bal_ex_scanned),
2709 atomic_read(&sbi->s_bal_goals), 2711 atomic_read(&sbi->s_bal_goals),
2710 atomic_read(&sbi->s_bal_2orders), 2712 atomic_read(&sbi->s_bal_2orders),
2711 atomic_read(&sbi->s_bal_breaks), 2713 atomic_read(&sbi->s_bal_breaks),
2712 atomic_read(&sbi->s_mb_lost_chunks)); 2714 atomic_read(&sbi->s_mb_lost_chunks));
2713 ext4_msg(sb, KERN_INFO, 2715 ext4_msg(sb, KERN_INFO,
2714 "mballoc: %lu generated and it took %Lu", 2716 "mballoc: %lu generated and it took %Lu",
2715 sbi->s_mb_buddies_generated, 2717 sbi->s_mb_buddies_generated,
2716 sbi->s_mb_generation_time); 2718 sbi->s_mb_generation_time);
2717 ext4_msg(sb, KERN_INFO, 2719 ext4_msg(sb, KERN_INFO,
2718 "mballoc: %u preallocated, %u discarded", 2720 "mballoc: %u preallocated, %u discarded",
2719 atomic_read(&sbi->s_mb_preallocated), 2721 atomic_read(&sbi->s_mb_preallocated),
2720 atomic_read(&sbi->s_mb_discarded)); 2722 atomic_read(&sbi->s_mb_discarded));
2721 } 2723 }
2722 2724
2723 free_percpu(sbi->s_locality_groups); 2725 free_percpu(sbi->s_locality_groups);
2724 2726
2725 return 0; 2727 return 0;
2726 } 2728 }
2727 2729
2728 static inline int ext4_issue_discard(struct super_block *sb, 2730 static inline int ext4_issue_discard(struct super_block *sb,
2729 ext4_group_t block_group, ext4_grpblk_t cluster, int count) 2731 ext4_group_t block_group, ext4_grpblk_t cluster, int count)
2730 { 2732 {
2731 ext4_fsblk_t discard_block; 2733 ext4_fsblk_t discard_block;
2732 2734
2733 discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) + 2735 discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) +
2734 ext4_group_first_block_no(sb, block_group)); 2736 ext4_group_first_block_no(sb, block_group));
2735 count = EXT4_C2B(EXT4_SB(sb), count); 2737 count = EXT4_C2B(EXT4_SB(sb), count);
2736 trace_ext4_discard_blocks(sb, 2738 trace_ext4_discard_blocks(sb,
2737 (unsigned long long) discard_block, count); 2739 (unsigned long long) discard_block, count);
2738 return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0); 2740 return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
2739 } 2741 }
2740 2742
2741 /* 2743 /*
2742 * This function is called by the jbd2 layer once the commit has finished, 2744 * This function is called by the jbd2 layer once the commit has finished,
2743 * so we know we can free the blocks that were released with that commit. 2745 * so we know we can free the blocks that were released with that commit.
2744 */ 2746 */
2745 static void ext4_free_data_callback(struct super_block *sb, 2747 static void ext4_free_data_callback(struct super_block *sb,
2746 struct ext4_journal_cb_entry *jce, 2748 struct ext4_journal_cb_entry *jce,
2747 int rc) 2749 int rc)
2748 { 2750 {
2749 struct ext4_free_data *entry = (struct ext4_free_data *)jce; 2751 struct ext4_free_data *entry = (struct ext4_free_data *)jce;
2750 struct ext4_buddy e4b; 2752 struct ext4_buddy e4b;
2751 struct ext4_group_info *db; 2753 struct ext4_group_info *db;
2752 int err, count = 0, count2 = 0; 2754 int err, count = 0, count2 = 0;
2753 2755
2754 mb_debug(1, "gonna free %u blocks in group %u (0x%p):", 2756 mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
2755 entry->efd_count, entry->efd_group, entry); 2757 entry->efd_count, entry->efd_group, entry);
2756 2758
2757 if (test_opt(sb, DISCARD)) { 2759 if (test_opt(sb, DISCARD)) {
2758 err = ext4_issue_discard(sb, entry->efd_group, 2760 err = ext4_issue_discard(sb, entry->efd_group,
2759 entry->efd_start_cluster, 2761 entry->efd_start_cluster,
2760 entry->efd_count); 2762 entry->efd_count);
2761 if (err && err != -EOPNOTSUPP) 2763 if (err && err != -EOPNOTSUPP)
2762 ext4_msg(sb, KERN_WARNING, "discard request in" 2764 ext4_msg(sb, KERN_WARNING, "discard request in"
2763 " group:%d block:%d count:%d failed" 2765 " group:%d block:%d count:%d failed"
2764 " with %d", entry->efd_group, 2766 " with %d", entry->efd_group,
2765 entry->efd_start_cluster, 2767 entry->efd_start_cluster,
2766 entry->efd_count, err); 2768 entry->efd_count, err);
2767 } 2769 }
2768 2770
2769 err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b); 2771 err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
2770 /* we expect to find existing buddy because it's pinned */ 2772 /* we expect to find existing buddy because it's pinned */
2771 BUG_ON(err != 0); 2773 BUG_ON(err != 0);
2772 2774
2773 2775
2774 db = e4b.bd_info; 2776 db = e4b.bd_info;
2775 /* there are blocks to put in buddy to make them really free */ 2777 /* there are blocks to put in buddy to make them really free */
2776 count += entry->efd_count; 2778 count += entry->efd_count;
2777 count2++; 2779 count2++;
2778 ext4_lock_group(sb, entry->efd_group); 2780 ext4_lock_group(sb, entry->efd_group);
2779 /* Take it out of per group rb tree */ 2781 /* Take it out of per group rb tree */
2780 rb_erase(&entry->efd_node, &(db->bb_free_root)); 2782 rb_erase(&entry->efd_node, &(db->bb_free_root));
2781 mb_free_blocks(NULL, &e4b, entry->efd_start_cluster, entry->efd_count); 2783 mb_free_blocks(NULL, &e4b, entry->efd_start_cluster, entry->efd_count);
2782 2784
2783 /* 2785 /*
2784 * Clear the trimmed flag for the group so that the next 2786 * Clear the trimmed flag for the group so that the next
2785 * ext4_trim_fs can trim it. 2787 * ext4_trim_fs can trim it.
2786 * If the volume is mounted with -o discard, online discard 2788 * If the volume is mounted with -o discard, online discard
2787 * is supported and the free blocks will be trimmed online. 2789 * is supported and the free blocks will be trimmed online.
2788 */ 2790 */
2789 if (!test_opt(sb, DISCARD)) 2791 if (!test_opt(sb, DISCARD))
2790 EXT4_MB_GRP_CLEAR_TRIMMED(db); 2792 EXT4_MB_GRP_CLEAR_TRIMMED(db);
2791 2793
2792 if (!db->bb_free_root.rb_node) { 2794 if (!db->bb_free_root.rb_node) {
2793 /* No more items in the per group rb tree 2795 /* No more items in the per group rb tree
2794 * balance refcounts from ext4_mb_free_metadata() 2796 * balance refcounts from ext4_mb_free_metadata()
2795 */ 2797 */
2796 page_cache_release(e4b.bd_buddy_page); 2798 page_cache_release(e4b.bd_buddy_page);
2797 page_cache_release(e4b.bd_bitmap_page); 2799 page_cache_release(e4b.bd_bitmap_page);
2798 } 2800 }
2799 ext4_unlock_group(sb, entry->efd_group); 2801 ext4_unlock_group(sb, entry->efd_group);
2800 kmem_cache_free(ext4_free_data_cachep, entry); 2802 kmem_cache_free(ext4_free_data_cachep, entry);
2801 ext4_mb_unload_buddy(&e4b); 2803 ext4_mb_unload_buddy(&e4b);
2802 2804
2803 mb_debug(1, "freed %u blocks in %u structures\n", count, count2); 2805 mb_debug(1, "freed %u blocks in %u structures\n", count, count2);
2804 } 2806 }
2805 2807
2806 int __init ext4_init_mballoc(void) 2808 int __init ext4_init_mballoc(void)
2807 { 2809 {
2808 ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space, 2810 ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space,
2809 SLAB_RECLAIM_ACCOUNT); 2811 SLAB_RECLAIM_ACCOUNT);
2810 if (ext4_pspace_cachep == NULL) 2812 if (ext4_pspace_cachep == NULL)
2811 return -ENOMEM; 2813 return -ENOMEM;
2812 2814
2813 ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context, 2815 ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context,
2814 SLAB_RECLAIM_ACCOUNT); 2816 SLAB_RECLAIM_ACCOUNT);
2815 if (ext4_ac_cachep == NULL) { 2817 if (ext4_ac_cachep == NULL) {
2816 kmem_cache_destroy(ext4_pspace_cachep); 2818 kmem_cache_destroy(ext4_pspace_cachep);
2817 return -ENOMEM; 2819 return -ENOMEM;
2818 } 2820 }
2819 2821
2820 ext4_free_data_cachep = KMEM_CACHE(ext4_free_data, 2822 ext4_free_data_cachep = KMEM_CACHE(ext4_free_data,
2821 SLAB_RECLAIM_ACCOUNT); 2823 SLAB_RECLAIM_ACCOUNT);
2822 if (ext4_free_data_cachep == NULL) { 2824 if (ext4_free_data_cachep == NULL) {
2823 kmem_cache_destroy(ext4_pspace_cachep); 2825 kmem_cache_destroy(ext4_pspace_cachep);
2824 kmem_cache_destroy(ext4_ac_cachep); 2826 kmem_cache_destroy(ext4_ac_cachep);
2825 return -ENOMEM; 2827 return -ENOMEM;
2826 } 2828 }
2827 return 0; 2829 return 0;
2828 } 2830 }
2829 2831
2830 void ext4_exit_mballoc(void) 2832 void ext4_exit_mballoc(void)
2831 { 2833 {
2832 /* 2834 /*
2833 * Wait for completion of call_rcu()'s on ext4_pspace_cachep 2835 * Wait for completion of call_rcu()'s on ext4_pspace_cachep
2834 * before destroying the slab cache. 2836 * before destroying the slab cache.
2835 */ 2837 */
2836 rcu_barrier(); 2838 rcu_barrier();
2837 kmem_cache_destroy(ext4_pspace_cachep); 2839 kmem_cache_destroy(ext4_pspace_cachep);
2838 kmem_cache_destroy(ext4_ac_cachep); 2840 kmem_cache_destroy(ext4_ac_cachep);
2839 kmem_cache_destroy(ext4_free_data_cachep); 2841 kmem_cache_destroy(ext4_free_data_cachep);
2840 ext4_groupinfo_destroy_slabs(); 2842 ext4_groupinfo_destroy_slabs();
2841 } 2843 }
2842 2844
2843 2845
2844 /* 2846 /*
2845 * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps 2847 * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps
2846 * Returns 0 if success or error code 2848 * Returns 0 if success or error code
2847 */ 2849 */
2848 static noinline_for_stack int 2850 static noinline_for_stack int
2849 ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, 2851 ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
2850 handle_t *handle, unsigned int reserv_clstrs) 2852 handle_t *handle, unsigned int reserv_clstrs)
2851 { 2853 {
2852 struct buffer_head *bitmap_bh = NULL; 2854 struct buffer_head *bitmap_bh = NULL;
2853 struct ext4_group_desc *gdp; 2855 struct ext4_group_desc *gdp;
2854 struct buffer_head *gdp_bh; 2856 struct buffer_head *gdp_bh;
2855 struct ext4_sb_info *sbi; 2857 struct ext4_sb_info *sbi;
2856 struct super_block *sb; 2858 struct super_block *sb;
2857 ext4_fsblk_t block; 2859 ext4_fsblk_t block;
2858 int err, len; 2860 int err, len;
2859 2861
2860 BUG_ON(ac->ac_status != AC_STATUS_FOUND); 2862 BUG_ON(ac->ac_status != AC_STATUS_FOUND);
2861 BUG_ON(ac->ac_b_ex.fe_len <= 0); 2863 BUG_ON(ac->ac_b_ex.fe_len <= 0);
2862 2864
2863 sb = ac->ac_sb; 2865 sb = ac->ac_sb;
2864 sbi = EXT4_SB(sb); 2866 sbi = EXT4_SB(sb);
2865 2867
2866 err = -EIO; 2868 err = -EIO;
2867 bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group); 2869 bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group);
2868 if (!bitmap_bh) 2870 if (!bitmap_bh)
2869 goto out_err; 2871 goto out_err;
2870 2872
2871 err = ext4_journal_get_write_access(handle, bitmap_bh); 2873 err = ext4_journal_get_write_access(handle, bitmap_bh);
2872 if (err) 2874 if (err)
2873 goto out_err; 2875 goto out_err;
2874 2876
2875 err = -EIO; 2877 err = -EIO;
2876 gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh); 2878 gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh);
2877 if (!gdp) 2879 if (!gdp)
2878 goto out_err; 2880 goto out_err;
2879 2881
2880 ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group, 2882 ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group,
2881 ext4_free_group_clusters(sb, gdp)); 2883 ext4_free_group_clusters(sb, gdp));
2882 2884
2883 err = ext4_journal_get_write_access(handle, gdp_bh); 2885 err = ext4_journal_get_write_access(handle, gdp_bh);
2884 if (err) 2886 if (err)
2885 goto out_err; 2887 goto out_err;
2886 2888
2887 block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); 2889 block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
2888 2890
2889 len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len); 2891 len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
2890 if (!ext4_data_block_valid(sbi, block, len)) { 2892 if (!ext4_data_block_valid(sbi, block, len)) {
2891 ext4_error(sb, "Allocating blocks %llu-%llu which overlap " 2893 ext4_error(sb, "Allocating blocks %llu-%llu which overlap "
2892 "fs metadata", block, block+len); 2894 "fs metadata", block, block+len);
2893 /* File system mounted not to panic on error 2895 /* File system mounted not to panic on error
2894 * Fix the bitmap and repeat the block allocation 2896 * Fix the bitmap and repeat the block allocation
2895 * We leak some of the blocks here. 2897 * We leak some of the blocks here.
2896 */ 2898 */
2897 ext4_lock_group(sb, ac->ac_b_ex.fe_group); 2899 ext4_lock_group(sb, ac->ac_b_ex.fe_group);
2898 ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, 2900 ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
2899 ac->ac_b_ex.fe_len); 2901 ac->ac_b_ex.fe_len);
2900 ext4_unlock_group(sb, ac->ac_b_ex.fe_group); 2902 ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
2901 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); 2903 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
2902 if (!err) 2904 if (!err)
2903 err = -EAGAIN; 2905 err = -EAGAIN;
2904 goto out_err; 2906 goto out_err;
2905 } 2907 }
2906 2908
2907 ext4_lock_group(sb, ac->ac_b_ex.fe_group); 2909 ext4_lock_group(sb, ac->ac_b_ex.fe_group);
2908 #ifdef AGGRESSIVE_CHECK 2910 #ifdef AGGRESSIVE_CHECK
2909 { 2911 {
2910 int i; 2912 int i;
2911 for (i = 0; i < ac->ac_b_ex.fe_len; i++) { 2913 for (i = 0; i < ac->ac_b_ex.fe_len; i++) {
2912 BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i, 2914 BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i,
2913 bitmap_bh->b_data)); 2915 bitmap_bh->b_data));
2914 } 2916 }
2915 } 2917 }
2916 #endif 2918 #endif
2917 ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, 2919 ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
2918 ac->ac_b_ex.fe_len); 2920 ac->ac_b_ex.fe_len);
2919 if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { 2921 if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
2920 gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); 2922 gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
2921 ext4_free_group_clusters_set(sb, gdp, 2923 ext4_free_group_clusters_set(sb, gdp,
2922 ext4_free_clusters_after_init(sb, 2924 ext4_free_clusters_after_init(sb,
2923 ac->ac_b_ex.fe_group, gdp)); 2925 ac->ac_b_ex.fe_group, gdp));
2924 } 2926 }
2925 len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len; 2927 len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len;
2926 ext4_free_group_clusters_set(sb, gdp, len); 2928 ext4_free_group_clusters_set(sb, gdp, len);
2927 ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh); 2929 ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh);
2928 ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp); 2930 ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp);
2929 2931
2930 ext4_unlock_group(sb, ac->ac_b_ex.fe_group); 2932 ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
2931 percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len); 2933 percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len);
2932 /* 2934 /*
2933 * Now reduce the dirty block count also. Should not go negative 2935 * Now reduce the dirty block count also. Should not go negative
2934 */ 2936 */
2935 if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED)) 2937 if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
2936 /* release all the reserved blocks if non delalloc */ 2938 /* release all the reserved blocks if non delalloc */
2937 percpu_counter_sub(&sbi->s_dirtyclusters_counter, 2939 percpu_counter_sub(&sbi->s_dirtyclusters_counter,
2938 reserv_clstrs); 2940 reserv_clstrs);
2939 2941
2940 if (sbi->s_log_groups_per_flex) { 2942 if (sbi->s_log_groups_per_flex) {
2941 ext4_group_t flex_group = ext4_flex_group(sbi, 2943 ext4_group_t flex_group = ext4_flex_group(sbi,
2942 ac->ac_b_ex.fe_group); 2944 ac->ac_b_ex.fe_group);
2943 atomic64_sub(ac->ac_b_ex.fe_len, 2945 atomic64_sub(ac->ac_b_ex.fe_len,
2944 &sbi->s_flex_groups[flex_group].free_clusters); 2946 &sbi->s_flex_groups[flex_group].free_clusters);
2945 } 2947 }
2946 2948
2947 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); 2949 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
2948 if (err) 2950 if (err)
2949 goto out_err; 2951 goto out_err;
2950 err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh); 2952 err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
2951 2953
2952 out_err: 2954 out_err:
2953 brelse(bitmap_bh); 2955 brelse(bitmap_bh);
2954 return err; 2956 return err;
2955 } 2957 }
2956 2958
2957 /* 2959 /*
2958 * here we normalize request for locality group 2960 * here we normalize request for locality group
2959 * Group request are normalized to s_mb_group_prealloc, which goes to 2961 * Group request are normalized to s_mb_group_prealloc, which goes to
2960 * s_strip if we set the same via mount option. 2962 * s_strip if we set the same via mount option.
2961 * s_mb_group_prealloc can be configured via 2963 * s_mb_group_prealloc can be configured via
2962 * /sys/fs/ext4/<partition>/mb_group_prealloc 2964 * /sys/fs/ext4/<partition>/mb_group_prealloc
2963 * 2965 *
2964 * XXX: should we try to preallocate more than the group has now? 2966 * XXX: should we try to preallocate more than the group has now?
2965 */ 2967 */
2966 static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac) 2968 static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
2967 { 2969 {
2968 struct super_block *sb = ac->ac_sb; 2970 struct super_block *sb = ac->ac_sb;
2969 struct ext4_locality_group *lg = ac->ac_lg; 2971 struct ext4_locality_group *lg = ac->ac_lg;
2970 2972
2971 BUG_ON(lg == NULL); 2973 BUG_ON(lg == NULL);
2972 ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc; 2974 ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc;
2973 mb_debug(1, "#%u: goal %u blocks for locality group\n", 2975 mb_debug(1, "#%u: goal %u blocks for locality group\n",
2974 current->pid, ac->ac_g_ex.fe_len); 2976 current->pid, ac->ac_g_ex.fe_len);
2975 } 2977 }
2976 2978
2977 /* 2979 /*
2978 * Normalization means making request better in terms of 2980 * Normalization means making request better in terms of
2979 * size and alignment 2981 * size and alignment
2980 */ 2982 */
2981 static noinline_for_stack void 2983 static noinline_for_stack void
2982 ext4_mb_normalize_request(struct ext4_allocation_context *ac, 2984 ext4_mb_normalize_request(struct ext4_allocation_context *ac,
2983 struct ext4_allocation_request *ar) 2985 struct ext4_allocation_request *ar)
2984 { 2986 {
2985 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 2987 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
2986 int bsbits, max; 2988 int bsbits, max;
2987 ext4_lblk_t end; 2989 ext4_lblk_t end;
2988 loff_t size, start_off; 2990 loff_t size, start_off;
2989 loff_t orig_size __maybe_unused; 2991 loff_t orig_size __maybe_unused;
2990 ext4_lblk_t start; 2992 ext4_lblk_t start;
2991 struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); 2993 struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
2992 struct ext4_prealloc_space *pa; 2994 struct ext4_prealloc_space *pa;
2993 2995
2994 /* do normalize only data requests, metadata requests 2996 /* do normalize only data requests, metadata requests
2995 do not need preallocation */ 2997 do not need preallocation */
2996 if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) 2998 if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
2997 return; 2999 return;
2998 3000
2999 /* sometime caller may want exact blocks */ 3001 /* sometime caller may want exact blocks */
3000 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) 3002 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
3001 return; 3003 return;
3002 3004
3003 /* caller may indicate that preallocation isn't 3005 /* caller may indicate that preallocation isn't
3004 * required (it's a tail, for example) */ 3006 * required (it's a tail, for example) */
3005 if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) 3007 if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
3006 return; 3008 return;
3007 3009
3008 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { 3010 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
3009 ext4_mb_normalize_group_request(ac); 3011 ext4_mb_normalize_group_request(ac);
3010 return ; 3012 return ;
3011 } 3013 }
3012 3014
3013 bsbits = ac->ac_sb->s_blocksize_bits; 3015 bsbits = ac->ac_sb->s_blocksize_bits;
3014 3016
3015 /* first, let's learn actual file size 3017 /* first, let's learn actual file size
3016 * given current request is allocated */ 3018 * given current request is allocated */
3017 size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); 3019 size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
3018 size = size << bsbits; 3020 size = size << bsbits;
3019 if (size < i_size_read(ac->ac_inode)) 3021 if (size < i_size_read(ac->ac_inode))
3020 size = i_size_read(ac->ac_inode); 3022 size = i_size_read(ac->ac_inode);
3021 orig_size = size; 3023 orig_size = size;
3022 3024
3023 /* max size of free chunks */ 3025 /* max size of free chunks */
3024 max = 2 << bsbits; 3026 max = 2 << bsbits;
3025 3027
3026 #define NRL_CHECK_SIZE(req, size, max, chunk_size) \ 3028 #define NRL_CHECK_SIZE(req, size, max, chunk_size) \
3027 (req <= (size) || max <= (chunk_size)) 3029 (req <= (size) || max <= (chunk_size))
3028 3030
3029 /* first, try to predict filesize */ 3031 /* first, try to predict filesize */
3030 /* XXX: should this table be tunable? */ 3032 /* XXX: should this table be tunable? */
3031 start_off = 0; 3033 start_off = 0;
3032 if (size <= 16 * 1024) { 3034 if (size <= 16 * 1024) {
3033 size = 16 * 1024; 3035 size = 16 * 1024;
3034 } else if (size <= 32 * 1024) { 3036 } else if (size <= 32 * 1024) {
3035 size = 32 * 1024; 3037 size = 32 * 1024;
3036 } else if (size <= 64 * 1024) { 3038 } else if (size <= 64 * 1024) {
3037 size = 64 * 1024; 3039 size = 64 * 1024;
3038 } else if (size <= 128 * 1024) { 3040 } else if (size <= 128 * 1024) {
3039 size = 128 * 1024; 3041 size = 128 * 1024;
3040 } else if (size <= 256 * 1024) { 3042 } else if (size <= 256 * 1024) {
3041 size = 256 * 1024; 3043 size = 256 * 1024;
3042 } else if (size <= 512 * 1024) { 3044 } else if (size <= 512 * 1024) {
3043 size = 512 * 1024; 3045 size = 512 * 1024;
3044 } else if (size <= 1024 * 1024) { 3046 } else if (size <= 1024 * 1024) {
3045 size = 1024 * 1024; 3047 size = 1024 * 1024;
3046 } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) { 3048 } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {
3047 start_off = ((loff_t)ac->ac_o_ex.fe_logical >> 3049 start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
3048 (21 - bsbits)) << 21; 3050 (21 - bsbits)) << 21;
3049 size = 2 * 1024 * 1024; 3051 size = 2 * 1024 * 1024;
3050 } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) { 3052 } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {
3051 start_off = ((loff_t)ac->ac_o_ex.fe_logical >> 3053 start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
3052 (22 - bsbits)) << 22; 3054 (22 - bsbits)) << 22;
3053 size = 4 * 1024 * 1024; 3055 size = 4 * 1024 * 1024;
3054 } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len, 3056 } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
3055 (8<<20)>>bsbits, max, 8 * 1024)) { 3057 (8<<20)>>bsbits, max, 8 * 1024)) {
3056 start_off = ((loff_t)ac->ac_o_ex.fe_logical >> 3058 start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
3057 (23 - bsbits)) << 23; 3059 (23 - bsbits)) << 23;
3058 size = 8 * 1024 * 1024; 3060 size = 8 * 1024 * 1024;
3059 } else { 3061 } else {
3060 start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits; 3062 start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits;
3061 size = ac->ac_o_ex.fe_len << bsbits; 3063 size = ac->ac_o_ex.fe_len << bsbits;
3062 } 3064 }
3063 size = size >> bsbits; 3065 size = size >> bsbits;
3064 start = start_off >> bsbits; 3066 start = start_off >> bsbits;
3065 3067
3066 /* don't cover already allocated blocks in selected range */ 3068 /* don't cover already allocated blocks in selected range */
3067 if (ar->pleft && start <= ar->lleft) { 3069 if (ar->pleft && start <= ar->lleft) {
3068 size -= ar->lleft + 1 - start; 3070 size -= ar->lleft + 1 - start;
3069 start = ar->lleft + 1; 3071 start = ar->lleft + 1;
3070 } 3072 }
3071 if (ar->pright && start + size - 1 >= ar->lright) 3073 if (ar->pright && start + size - 1 >= ar->lright)
3072 size -= start + size - ar->lright; 3074 size -= start + size - ar->lright;
3073 3075
3074 end = start + size; 3076 end = start + size;
3075 3077
3076 /* check we don't cross already preallocated blocks */ 3078 /* check we don't cross already preallocated blocks */
3077 rcu_read_lock(); 3079 rcu_read_lock();
3078 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { 3080 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
3079 ext4_lblk_t pa_end; 3081 ext4_lblk_t pa_end;
3080 3082
3081 if (pa->pa_deleted) 3083 if (pa->pa_deleted)
3082 continue; 3084 continue;
3083 spin_lock(&pa->pa_lock); 3085 spin_lock(&pa->pa_lock);
3084 if (pa->pa_deleted) { 3086 if (pa->pa_deleted) {
3085 spin_unlock(&pa->pa_lock); 3087 spin_unlock(&pa->pa_lock);
3086 continue; 3088 continue;
3087 } 3089 }
3088 3090
3089 pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), 3091 pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
3090 pa->pa_len); 3092 pa->pa_len);
3091 3093
3092 /* PA must not overlap original request */ 3094 /* PA must not overlap original request */
3093 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || 3095 BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
3094 ac->ac_o_ex.fe_logical < pa->pa_lstart)); 3096 ac->ac_o_ex.fe_logical < pa->pa_lstart));
3095 3097
3096 /* skip PAs this normalized request doesn't overlap with */ 3098 /* skip PAs this normalized request doesn't overlap with */
3097 if (pa->pa_lstart >= end || pa_end <= start) { 3099 if (pa->pa_lstart >= end || pa_end <= start) {
3098 spin_unlock(&pa->pa_lock); 3100 spin_unlock(&pa->pa_lock);
3099 continue; 3101 continue;
3100 } 3102 }
3101 BUG_ON(pa->pa_lstart <= start && pa_end >= end); 3103 BUG_ON(pa->pa_lstart <= start && pa_end >= end);
3102 3104
3103 /* adjust start or end to be adjacent to this pa */ 3105 /* adjust start or end to be adjacent to this pa */
3104 if (pa_end <= ac->ac_o_ex.fe_logical) { 3106 if (pa_end <= ac->ac_o_ex.fe_logical) {
3105 BUG_ON(pa_end < start); 3107 BUG_ON(pa_end < start);
3106 start = pa_end; 3108 start = pa_end;
3107 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { 3109 } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
3108 BUG_ON(pa->pa_lstart > end); 3110 BUG_ON(pa->pa_lstart > end);
3109 end = pa->pa_lstart; 3111 end = pa->pa_lstart;
3110 } 3112 }
3111 spin_unlock(&pa->pa_lock); 3113 spin_unlock(&pa->pa_lock);
3112 } 3114 }
3113 rcu_read_unlock(); 3115 rcu_read_unlock();
3114 size = end - start; 3116 size = end - start;
3115 3117
3116 /* XXX: extra loop to check we really don't overlap preallocations */ 3118 /* XXX: extra loop to check we really don't overlap preallocations */
3117 rcu_read_lock(); 3119 rcu_read_lock();
3118 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { 3120 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
3119 ext4_lblk_t pa_end; 3121 ext4_lblk_t pa_end;
3120 3122
3121 spin_lock(&pa->pa_lock); 3123 spin_lock(&pa->pa_lock);
3122 if (pa->pa_deleted == 0) { 3124 if (pa->pa_deleted == 0) {
3123 pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), 3125 pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
3124 pa->pa_len); 3126 pa->pa_len);
3125 BUG_ON(!(start >= pa_end || end <= pa->pa_lstart)); 3127 BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
3126 } 3128 }
3127 spin_unlock(&pa->pa_lock); 3129 spin_unlock(&pa->pa_lock);
3128 } 3130 }
3129 rcu_read_unlock(); 3131 rcu_read_unlock();
3130 3132
3131 if (start + size <= ac->ac_o_ex.fe_logical && 3133 if (start + size <= ac->ac_o_ex.fe_logical &&
3132 start > ac->ac_o_ex.fe_logical) { 3134 start > ac->ac_o_ex.fe_logical) {
3133 ext4_msg(ac->ac_sb, KERN_ERR, 3135 ext4_msg(ac->ac_sb, KERN_ERR,
3134 "start %lu, size %lu, fe_logical %lu", 3136 "start %lu, size %lu, fe_logical %lu",
3135 (unsigned long) start, (unsigned long) size, 3137 (unsigned long) start, (unsigned long) size,
3136 (unsigned long) ac->ac_o_ex.fe_logical); 3138 (unsigned long) ac->ac_o_ex.fe_logical);
3137 } 3139 }
3138 BUG_ON(start + size <= ac->ac_o_ex.fe_logical && 3140 BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
3139 start > ac->ac_o_ex.fe_logical); 3141 start > ac->ac_o_ex.fe_logical);
3140 BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); 3142 BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
3141 3143
3142 /* now prepare goal request */ 3144 /* now prepare goal request */
3143 3145
3144 /* XXX: is it better to align blocks WRT to logical 3146 /* XXX: is it better to align blocks WRT to logical
3145 * placement or satisfy big request as is */ 3147 * placement or satisfy big request as is */
3146 ac->ac_g_ex.fe_logical = start; 3148 ac->ac_g_ex.fe_logical = start;
3147 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); 3149 ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
3148 3150
3149 /* define goal start in order to merge */ 3151 /* define goal start in order to merge */
3150 if (ar->pright && (ar->lright == (start + size))) { 3152 if (ar->pright && (ar->lright == (start + size))) {
3151 /* merge to the right */ 3153 /* merge to the right */
3152 ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size, 3154 ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size,
3153 &ac->ac_f_ex.fe_group, 3155 &ac->ac_f_ex.fe_group,
3154 &ac->ac_f_ex.fe_start); 3156 &ac->ac_f_ex.fe_start);
3155 ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; 3157 ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
3156 } 3158 }
3157 if (ar->pleft && (ar->lleft + 1 == start)) { 3159 if (ar->pleft && (ar->lleft + 1 == start)) {
3158 /* merge to the left */ 3160 /* merge to the left */
3159 ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1, 3161 ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1,
3160 &ac->ac_f_ex.fe_group, 3162 &ac->ac_f_ex.fe_group,
3161 &ac->ac_f_ex.fe_start); 3163 &ac->ac_f_ex.fe_start);
3162 ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; 3164 ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
3163 } 3165 }
3164 3166
3165 mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size, 3167 mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size,
3166 (unsigned) orig_size, (unsigned) start); 3168 (unsigned) orig_size, (unsigned) start);
3167 } 3169 }
3168 3170
3169 static void ext4_mb_collect_stats(struct ext4_allocation_context *ac) 3171 static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
3170 { 3172 {
3171 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 3173 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
3172 3174
3173 if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) { 3175 if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) {
3174 atomic_inc(&sbi->s_bal_reqs); 3176 atomic_inc(&sbi->s_bal_reqs);
3175 atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated); 3177 atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
3176 if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len) 3178 if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
3177 atomic_inc(&sbi->s_bal_success); 3179 atomic_inc(&sbi->s_bal_success);
3178 atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned); 3180 atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
3179 if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && 3181 if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
3180 ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) 3182 ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
3181 atomic_inc(&sbi->s_bal_goals); 3183 atomic_inc(&sbi->s_bal_goals);
3182 if (ac->ac_found > sbi->s_mb_max_to_scan) 3184 if (ac->ac_found > sbi->s_mb_max_to_scan)
3183 atomic_inc(&sbi->s_bal_breaks); 3185 atomic_inc(&sbi->s_bal_breaks);
3184 } 3186 }
3185 3187
3186 if (ac->ac_op == EXT4_MB_HISTORY_ALLOC) 3188 if (ac->ac_op == EXT4_MB_HISTORY_ALLOC)
3187 trace_ext4_mballoc_alloc(ac); 3189 trace_ext4_mballoc_alloc(ac);
3188 else 3190 else
3189 trace_ext4_mballoc_prealloc(ac); 3191 trace_ext4_mballoc_prealloc(ac);
3190 } 3192 }
3191 3193
3192 /* 3194 /*
3193 * Called on failure; free up any blocks from the inode PA for this 3195 * Called on failure; free up any blocks from the inode PA for this
3194 * context. We don't need this for MB_GROUP_PA because we only change 3196 * context. We don't need this for MB_GROUP_PA because we only change
3195 * pa_free in ext4_mb_release_context(), but on failure, we've already 3197 * pa_free in ext4_mb_release_context(), but on failure, we've already
3196 * zeroed out ac->ac_b_ex.fe_len, so group_pa->pa_free is not changed. 3198 * zeroed out ac->ac_b_ex.fe_len, so group_pa->pa_free is not changed.
3197 */ 3199 */
3198 static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac) 3200 static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
3199 { 3201 {
3200 struct ext4_prealloc_space *pa = ac->ac_pa; 3202 struct ext4_prealloc_space *pa = ac->ac_pa;
3201 struct ext4_buddy e4b; 3203 struct ext4_buddy e4b;
3202 int err; 3204 int err;
3203 3205
3204 if (pa == NULL) { 3206 if (pa == NULL) {
3205 if (ac->ac_f_ex.fe_len == 0) 3207 if (ac->ac_f_ex.fe_len == 0)
3206 return; 3208 return;
3207 err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b); 3209 err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
3208 if (err) { 3210 if (err) {
3209 /* 3211 /*
3210 * This should never happen since we pin the 3212 * This should never happen since we pin the
3211 * pages in the ext4_allocation_context so 3213 * pages in the ext4_allocation_context so
3212 * ext4_mb_load_buddy() should never fail. 3214 * ext4_mb_load_buddy() should never fail.
3213 */ 3215 */
3214 WARN(1, "mb_load_buddy failed (%d)", err); 3216 WARN(1, "mb_load_buddy failed (%d)", err);
3215 return; 3217 return;
3216 } 3218 }
3217 ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group); 3219 ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
3218 mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start, 3220 mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
3219 ac->ac_f_ex.fe_len); 3221 ac->ac_f_ex.fe_len);
3220 ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group); 3222 ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
3221 ext4_mb_unload_buddy(&e4b); 3223 ext4_mb_unload_buddy(&e4b);
3222 return; 3224 return;
3223 } 3225 }
3224 if (pa->pa_type == MB_INODE_PA) 3226 if (pa->pa_type == MB_INODE_PA)
3225 pa->pa_free += ac->ac_b_ex.fe_len; 3227 pa->pa_free += ac->ac_b_ex.fe_len;
3226 } 3228 }
3227 3229
3228 /* 3230 /*
3229 * use blocks preallocated to inode 3231 * use blocks preallocated to inode
3230 */ 3232 */
3231 static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac, 3233 static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
3232 struct ext4_prealloc_space *pa) 3234 struct ext4_prealloc_space *pa)
3233 { 3235 {
3234 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 3236 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
3235 ext4_fsblk_t start; 3237 ext4_fsblk_t start;
3236 ext4_fsblk_t end; 3238 ext4_fsblk_t end;
3237 int len; 3239 int len;
3238 3240
3239 /* found preallocated blocks, use them */ 3241 /* found preallocated blocks, use them */
3240 start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); 3242 start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
3241 end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), 3243 end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len),
3242 start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len)); 3244 start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len));
3243 len = EXT4_NUM_B2C(sbi, end - start); 3245 len = EXT4_NUM_B2C(sbi, end - start);
3244 ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group, 3246 ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group,
3245 &ac->ac_b_ex.fe_start); 3247 &ac->ac_b_ex.fe_start);
3246 ac->ac_b_ex.fe_len = len; 3248 ac->ac_b_ex.fe_len = len;
3247 ac->ac_status = AC_STATUS_FOUND; 3249 ac->ac_status = AC_STATUS_FOUND;
3248 ac->ac_pa = pa; 3250 ac->ac_pa = pa;
3249 3251
3250 BUG_ON(start < pa->pa_pstart); 3252 BUG_ON(start < pa->pa_pstart);
3251 BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len)); 3253 BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len));
3252 BUG_ON(pa->pa_free < len); 3254 BUG_ON(pa->pa_free < len);
3253 pa->pa_free -= len; 3255 pa->pa_free -= len;
3254 3256
3255 mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa); 3257 mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa);
3256 } 3258 }
3257 3259
3258 /* 3260 /*
3259 * use blocks preallocated to locality group 3261 * use blocks preallocated to locality group
3260 */ 3262 */
3261 static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac, 3263 static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
3262 struct ext4_prealloc_space *pa) 3264 struct ext4_prealloc_space *pa)
3263 { 3265 {
3264 unsigned int len = ac->ac_o_ex.fe_len; 3266 unsigned int len = ac->ac_o_ex.fe_len;
3265 3267
3266 ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart, 3268 ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart,
3267 &ac->ac_b_ex.fe_group, 3269 &ac->ac_b_ex.fe_group,
3268 &ac->ac_b_ex.fe_start); 3270 &ac->ac_b_ex.fe_start);
3269 ac->ac_b_ex.fe_len = len; 3271 ac->ac_b_ex.fe_len = len;
3270 ac->ac_status = AC_STATUS_FOUND; 3272 ac->ac_status = AC_STATUS_FOUND;
3271 ac->ac_pa = pa; 3273 ac->ac_pa = pa;
3272 3274
3273 /* we don't correct pa_pstart or pa_plen here to avoid 3275 /* we don't correct pa_pstart or pa_plen here to avoid
3274 * possible race when the group is being loaded concurrently 3276 * possible race when the group is being loaded concurrently
3275 * instead we correct pa later, after blocks are marked 3277 * instead we correct pa later, after blocks are marked
3276 * in on-disk bitmap -- see ext4_mb_release_context() 3278 * in on-disk bitmap -- see ext4_mb_release_context()
3277 * Other CPUs are prevented from allocating from this pa by lg_mutex 3279 * Other CPUs are prevented from allocating from this pa by lg_mutex
3278 */ 3280 */
3279 mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa); 3281 mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
3280 } 3282 }
3281 3283
3282 /* 3284 /*
3283 * Return the prealloc space that have minimal distance 3285 * Return the prealloc space that have minimal distance
3284 * from the goal block. @cpa is the prealloc 3286 * from the goal block. @cpa is the prealloc
3285 * space that is having currently known minimal distance 3287 * space that is having currently known minimal distance
3286 * from the goal block. 3288 * from the goal block.
3287 */ 3289 */
3288 static struct ext4_prealloc_space * 3290 static struct ext4_prealloc_space *
3289 ext4_mb_check_group_pa(ext4_fsblk_t goal_block, 3291 ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
3290 struct ext4_prealloc_space *pa, 3292 struct ext4_prealloc_space *pa,
3291 struct ext4_prealloc_space *cpa) 3293 struct ext4_prealloc_space *cpa)
3292 { 3294 {
3293 ext4_fsblk_t cur_distance, new_distance; 3295 ext4_fsblk_t cur_distance, new_distance;
3294 3296
3295 if (cpa == NULL) { 3297 if (cpa == NULL) {
3296 atomic_inc(&pa->pa_count); 3298 atomic_inc(&pa->pa_count);
3297 return pa; 3299 return pa;
3298 } 3300 }
3299 cur_distance = abs(goal_block - cpa->pa_pstart); 3301 cur_distance = abs(goal_block - cpa->pa_pstart);
3300 new_distance = abs(goal_block - pa->pa_pstart); 3302 new_distance = abs(goal_block - pa->pa_pstart);
3301 3303
3302 if (cur_distance <= new_distance) 3304 if (cur_distance <= new_distance)
3303 return cpa; 3305 return cpa;
3304 3306
3305 /* drop the previous reference */ 3307 /* drop the previous reference */
3306 atomic_dec(&cpa->pa_count); 3308 atomic_dec(&cpa->pa_count);
3307 atomic_inc(&pa->pa_count); 3309 atomic_inc(&pa->pa_count);
3308 return pa; 3310 return pa;
3309 } 3311 }
3310 3312
3311 /* 3313 /*
3312 * search goal blocks in preallocated space 3314 * search goal blocks in preallocated space
3313 */ 3315 */
3314 static noinline_for_stack int 3316 static noinline_for_stack int
3315 ext4_mb_use_preallocated(struct ext4_allocation_context *ac) 3317 ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
3316 { 3318 {
3317 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 3319 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
3318 int order, i; 3320 int order, i;
3319 struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); 3321 struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
3320 struct ext4_locality_group *lg; 3322 struct ext4_locality_group *lg;
3321 struct ext4_prealloc_space *pa, *cpa = NULL; 3323 struct ext4_prealloc_space *pa, *cpa = NULL;
3322 ext4_fsblk_t goal_block; 3324 ext4_fsblk_t goal_block;
3323 3325
3324 /* only data can be preallocated */ 3326 /* only data can be preallocated */
3325 if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) 3327 if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
3326 return 0; 3328 return 0;
3327 3329
3328 /* first, try per-file preallocation */ 3330 /* first, try per-file preallocation */
3329 rcu_read_lock(); 3331 rcu_read_lock();
3330 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { 3332 list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
3331 3333
3332 /* all fields in this condition don't change, 3334 /* all fields in this condition don't change,
3333 * so we can skip locking for them */ 3335 * so we can skip locking for them */
3334 if (ac->ac_o_ex.fe_logical < pa->pa_lstart || 3336 if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
3335 ac->ac_o_ex.fe_logical >= (pa->pa_lstart + 3337 ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
3336 EXT4_C2B(sbi, pa->pa_len))) 3338 EXT4_C2B(sbi, pa->pa_len)))
3337 continue; 3339 continue;
3338 3340
3339 /* non-extent files can't have physical blocks past 2^32 */ 3341 /* non-extent files can't have physical blocks past 2^32 */
3340 if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) && 3342 if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
3341 (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) > 3343 (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) >
3342 EXT4_MAX_BLOCK_FILE_PHYS)) 3344 EXT4_MAX_BLOCK_FILE_PHYS))
3343 continue; 3345 continue;
3344 3346
3345 /* found preallocated blocks, use them */ 3347 /* found preallocated blocks, use them */
3346 spin_lock(&pa->pa_lock); 3348 spin_lock(&pa->pa_lock);
3347 if (pa->pa_deleted == 0 && pa->pa_free) { 3349 if (pa->pa_deleted == 0 && pa->pa_free) {
3348 atomic_inc(&pa->pa_count); 3350 atomic_inc(&pa->pa_count);
3349 ext4_mb_use_inode_pa(ac, pa); 3351 ext4_mb_use_inode_pa(ac, pa);
3350 spin_unlock(&pa->pa_lock); 3352 spin_unlock(&pa->pa_lock);
3351 ac->ac_criteria = 10; 3353 ac->ac_criteria = 10;
3352 rcu_read_unlock(); 3354 rcu_read_unlock();
3353 return 1; 3355 return 1;
3354 } 3356 }
3355 spin_unlock(&pa->pa_lock); 3357 spin_unlock(&pa->pa_lock);
3356 } 3358 }
3357 rcu_read_unlock(); 3359 rcu_read_unlock();
3358 3360
3359 /* can we use group allocation? */ 3361 /* can we use group allocation? */
3360 if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)) 3362 if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
3361 return 0; 3363 return 0;
3362 3364
3363 /* inode may have no locality group for some reason */ 3365 /* inode may have no locality group for some reason */
3364 lg = ac->ac_lg; 3366 lg = ac->ac_lg;
3365 if (lg == NULL) 3367 if (lg == NULL)
3366 return 0; 3368 return 0;
3367 order = fls(ac->ac_o_ex.fe_len) - 1; 3369 order = fls(ac->ac_o_ex.fe_len) - 1;
3368 if (order > PREALLOC_TB_SIZE - 1) 3370 if (order > PREALLOC_TB_SIZE - 1)
3369 /* The max size of hash table is PREALLOC_TB_SIZE */ 3371 /* The max size of hash table is PREALLOC_TB_SIZE */
3370 order = PREALLOC_TB_SIZE - 1; 3372 order = PREALLOC_TB_SIZE - 1;
3371 3373
3372 goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex); 3374 goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
3373 /* 3375 /*
3374 * search for the prealloc space that is having 3376 * search for the prealloc space that is having
3375 * minimal distance from the goal block. 3377 * minimal distance from the goal block.
3376 */ 3378 */
3377 for (i = order; i < PREALLOC_TB_SIZE; i++) { 3379 for (i = order; i < PREALLOC_TB_SIZE; i++) {
3378 rcu_read_lock(); 3380 rcu_read_lock();
3379 list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i], 3381 list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
3380 pa_inode_list) { 3382 pa_inode_list) {
3381 spin_lock(&pa->pa_lock); 3383 spin_lock(&pa->pa_lock);
3382 if (pa->pa_deleted == 0 && 3384 if (pa->pa_deleted == 0 &&
3383 pa->pa_free >= ac->ac_o_ex.fe_len) { 3385 pa->pa_free >= ac->ac_o_ex.fe_len) {
3384 3386
3385 cpa = ext4_mb_check_group_pa(goal_block, 3387 cpa = ext4_mb_check_group_pa(goal_block,
3386 pa, cpa); 3388 pa, cpa);
3387 } 3389 }
3388 spin_unlock(&pa->pa_lock); 3390 spin_unlock(&pa->pa_lock);
3389 } 3391 }
3390 rcu_read_unlock(); 3392 rcu_read_unlock();
3391 } 3393 }
3392 if (cpa) { 3394 if (cpa) {
3393 ext4_mb_use_group_pa(ac, cpa); 3395 ext4_mb_use_group_pa(ac, cpa);
3394 ac->ac_criteria = 20; 3396 ac->ac_criteria = 20;
3395 return 1; 3397 return 1;
3396 } 3398 }
3397 return 0; 3399 return 0;
3398 } 3400 }
3399 3401
3400 /* 3402 /*
3401 * the function goes through all block freed in the group 3403 * the function goes through all block freed in the group
3402 * but not yet committed and marks them used in in-core bitmap. 3404 * but not yet committed and marks them used in in-core bitmap.
3403 * buddy must be generated from this bitmap 3405 * buddy must be generated from this bitmap
3404 * Need to be called with the ext4 group lock held 3406 * Need to be called with the ext4 group lock held
3405 */ 3407 */
3406 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, 3408 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
3407 ext4_group_t group) 3409 ext4_group_t group)
3408 { 3410 {
3409 struct rb_node *n; 3411 struct rb_node *n;
3410 struct ext4_group_info *grp; 3412 struct ext4_group_info *grp;
3411 struct ext4_free_data *entry; 3413 struct ext4_free_data *entry;
3412 3414
3413 grp = ext4_get_group_info(sb, group); 3415 grp = ext4_get_group_info(sb, group);
3414 n = rb_first(&(grp->bb_free_root)); 3416 n = rb_first(&(grp->bb_free_root));
3415 3417
3416 while (n) { 3418 while (n) {
3417 entry = rb_entry(n, struct ext4_free_data, efd_node); 3419 entry = rb_entry(n, struct ext4_free_data, efd_node);
3418 ext4_set_bits(bitmap, entry->efd_start_cluster, entry->efd_count); 3420 ext4_set_bits(bitmap, entry->efd_start_cluster, entry->efd_count);
3419 n = rb_next(n); 3421 n = rb_next(n);
3420 } 3422 }
3421 return; 3423 return;
3422 } 3424 }
3423 3425
3424 /* 3426 /*
3425 * the function goes through all preallocation in this group and marks them 3427 * the function goes through all preallocation in this group and marks them
3426 * used in in-core bitmap. buddy must be generated from this bitmap 3428 * used in in-core bitmap. buddy must be generated from this bitmap
3427 * Need to be called with ext4 group lock held 3429 * Need to be called with ext4 group lock held
3428 */ 3430 */
3429 static noinline_for_stack 3431 static noinline_for_stack
3430 void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, 3432 void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
3431 ext4_group_t group) 3433 ext4_group_t group)
3432 { 3434 {
3433 struct ext4_group_info *grp = ext4_get_group_info(sb, group); 3435 struct ext4_group_info *grp = ext4_get_group_info(sb, group);
3434 struct ext4_prealloc_space *pa; 3436 struct ext4_prealloc_space *pa;
3435 struct list_head *cur; 3437 struct list_head *cur;
3436 ext4_group_t groupnr; 3438 ext4_group_t groupnr;
3437 ext4_grpblk_t start; 3439 ext4_grpblk_t start;
3438 int preallocated = 0; 3440 int preallocated = 0;
3439 int len; 3441 int len;
3440 3442
3441 /* all form of preallocation discards first load group, 3443 /* all form of preallocation discards first load group,
3442 * so the only competing code is preallocation use. 3444 * so the only competing code is preallocation use.
3443 * we don't need any locking here 3445 * we don't need any locking here
3444 * notice we do NOT ignore preallocations with pa_deleted 3446 * notice we do NOT ignore preallocations with pa_deleted
3445 * otherwise we could leave used blocks available for 3447 * otherwise we could leave used blocks available for
3446 * allocation in buddy when concurrent ext4_mb_put_pa() 3448 * allocation in buddy when concurrent ext4_mb_put_pa()
3447 * is dropping preallocation 3449 * is dropping preallocation
3448 */ 3450 */
3449 list_for_each(cur, &grp->bb_prealloc_list) { 3451 list_for_each(cur, &grp->bb_prealloc_list) {
3450 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); 3452 pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
3451 spin_lock(&pa->pa_lock); 3453 spin_lock(&pa->pa_lock);
3452 ext4_get_group_no_and_offset(sb, pa->pa_pstart, 3454 ext4_get_group_no_and_offset(sb, pa->pa_pstart,
3453 &groupnr, &start); 3455 &groupnr, &start);
3454 len = pa->pa_len; 3456 len = pa->pa_len;
3455 spin_unlock(&pa->pa_lock); 3457 spin_unlock(&pa->pa_lock);
3456 if (unlikely(len == 0)) 3458 if (unlikely(len == 0))
3457 continue; 3459 continue;
3458 BUG_ON(groupnr != group); 3460 BUG_ON(groupnr != group);
3459 ext4_set_bits(bitmap, start, len); 3461 ext4_set_bits(bitmap, start, len);
3460 preallocated += len; 3462 preallocated += len;
3461 } 3463 }
3462 mb_debug(1, "prellocated %u for group %u\n", preallocated, group); 3464 mb_debug(1, "prellocated %u for group %u\n", preallocated, group);
3463 } 3465 }
3464 3466
3465 static void ext4_mb_pa_callback(struct rcu_head *head) 3467 static void ext4_mb_pa_callback(struct rcu_head *head)
3466 { 3468 {
3467 struct ext4_prealloc_space *pa; 3469 struct ext4_prealloc_space *pa;
3468 pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu); 3470 pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
3469 3471
3470 BUG_ON(atomic_read(&pa->pa_count)); 3472 BUG_ON(atomic_read(&pa->pa_count));
3471 BUG_ON(pa->pa_deleted == 0); 3473 BUG_ON(pa->pa_deleted == 0);
3472 kmem_cache_free(ext4_pspace_cachep, pa); 3474 kmem_cache_free(ext4_pspace_cachep, pa);
3473 } 3475 }
3474 3476
3475 /* 3477 /*
3476 * drops a reference to preallocated space descriptor 3478 * drops a reference to preallocated space descriptor
3477 * if this was the last reference and the space is consumed 3479 * if this was the last reference and the space is consumed
3478 */ 3480 */
3479 static void ext4_mb_put_pa(struct ext4_allocation_context *ac, 3481 static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
3480 struct super_block *sb, struct ext4_prealloc_space *pa) 3482 struct super_block *sb, struct ext4_prealloc_space *pa)
3481 { 3483 {
3482 ext4_group_t grp; 3484 ext4_group_t grp;
3483 ext4_fsblk_t grp_blk; 3485 ext4_fsblk_t grp_blk;
3484 3486
3485 /* in this short window concurrent discard can set pa_deleted */ 3487 /* in this short window concurrent discard can set pa_deleted */
3486 spin_lock(&pa->pa_lock); 3488 spin_lock(&pa->pa_lock);
3487 if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0) { 3489 if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0) {
3488 spin_unlock(&pa->pa_lock); 3490 spin_unlock(&pa->pa_lock);
3489 return; 3491 return;
3490 } 3492 }
3491 3493
3492 if (pa->pa_deleted == 1) { 3494 if (pa->pa_deleted == 1) {
3493 spin_unlock(&pa->pa_lock); 3495 spin_unlock(&pa->pa_lock);
3494 return; 3496 return;
3495 } 3497 }
3496 3498
3497 pa->pa_deleted = 1; 3499 pa->pa_deleted = 1;
3498 spin_unlock(&pa->pa_lock); 3500 spin_unlock(&pa->pa_lock);
3499 3501
3500 grp_blk = pa->pa_pstart; 3502 grp_blk = pa->pa_pstart;
3501 /* 3503 /*
3502 * If doing group-based preallocation, pa_pstart may be in the 3504 * If doing group-based preallocation, pa_pstart may be in the
3503 * next group when pa is used up 3505 * next group when pa is used up
3504 */ 3506 */
3505 if (pa->pa_type == MB_GROUP_PA) 3507 if (pa->pa_type == MB_GROUP_PA)
3506 grp_blk--; 3508 grp_blk--;
3507 3509
3508 grp = ext4_get_group_number(sb, grp_blk); 3510 grp = ext4_get_group_number(sb, grp_blk);
3509 3511
3510 /* 3512 /*
3511 * possible race: 3513 * possible race:
3512 * 3514 *
3513 * P1 (buddy init) P2 (regular allocation) 3515 * P1 (buddy init) P2 (regular allocation)
3514 * find block B in PA 3516 * find block B in PA
3515 * copy on-disk bitmap to buddy 3517 * copy on-disk bitmap to buddy
3516 * mark B in on-disk bitmap 3518 * mark B in on-disk bitmap
3517 * drop PA from group 3519 * drop PA from group
3518 * mark all PAs in buddy 3520 * mark all PAs in buddy
3519 * 3521 *
3520 * thus, P1 initializes buddy with B available. to prevent this 3522 * thus, P1 initializes buddy with B available. to prevent this
3521 * we make "copy" and "mark all PAs" atomic and serialize "drop PA" 3523 * we make "copy" and "mark all PAs" atomic and serialize "drop PA"
3522 * against that pair 3524 * against that pair
3523 */ 3525 */
3524 ext4_lock_group(sb, grp); 3526 ext4_lock_group(sb, grp);
3525 list_del(&pa->pa_group_list); 3527 list_del(&pa->pa_group_list);
3526 ext4_unlock_group(sb, grp); 3528 ext4_unlock_group(sb, grp);
3527 3529
3528 spin_lock(pa->pa_obj_lock); 3530 spin_lock(pa->pa_obj_lock);
3529 list_del_rcu(&pa->pa_inode_list); 3531 list_del_rcu(&pa->pa_inode_list);
3530 spin_unlock(pa->pa_obj_lock); 3532 spin_unlock(pa->pa_obj_lock);
3531 3533
3532 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); 3534 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
3533 } 3535 }
3534 3536
3535 /* 3537 /*
3536 * creates new preallocated space for given inode 3538 * creates new preallocated space for given inode
3537 */ 3539 */
3538 static noinline_for_stack int 3540 static noinline_for_stack int
3539 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) 3541 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
3540 { 3542 {
3541 struct super_block *sb = ac->ac_sb; 3543 struct super_block *sb = ac->ac_sb;
3542 struct ext4_sb_info *sbi = EXT4_SB(sb); 3544 struct ext4_sb_info *sbi = EXT4_SB(sb);
3543 struct ext4_prealloc_space *pa; 3545 struct ext4_prealloc_space *pa;
3544 struct ext4_group_info *grp; 3546 struct ext4_group_info *grp;
3545 struct ext4_inode_info *ei; 3547 struct ext4_inode_info *ei;
3546 3548
3547 /* preallocate only when found space is larger then requested */ 3549 /* preallocate only when found space is larger then requested */
3548 BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); 3550 BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
3549 BUG_ON(ac->ac_status != AC_STATUS_FOUND); 3551 BUG_ON(ac->ac_status != AC_STATUS_FOUND);
3550 BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); 3552 BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
3551 3553
3552 pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); 3554 pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
3553 if (pa == NULL) 3555 if (pa == NULL)
3554 return -ENOMEM; 3556 return -ENOMEM;
3555 3557
3556 if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { 3558 if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
3557 int winl; 3559 int winl;
3558 int wins; 3560 int wins;
3559 int win; 3561 int win;
3560 int offs; 3562 int offs;
3561 3563
3562 /* we can't allocate as much as normalizer wants. 3564 /* we can't allocate as much as normalizer wants.
3563 * so, found space must get proper lstart 3565 * so, found space must get proper lstart
3564 * to cover original request */ 3566 * to cover original request */
3565 BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical); 3567 BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical);
3566 BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len); 3568 BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len);
3567 3569
3568 /* we're limited by original request in that 3570 /* we're limited by original request in that
3569 * logical block must be covered any way 3571 * logical block must be covered any way
3570 * winl is window we can move our chunk within */ 3572 * winl is window we can move our chunk within */
3571 winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical; 3573 winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;
3572 3574
3573 /* also, we should cover whole original request */ 3575 /* also, we should cover whole original request */
3574 wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len); 3576 wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len);
3575 3577
3576 /* the smallest one defines real window */ 3578 /* the smallest one defines real window */
3577 win = min(winl, wins); 3579 win = min(winl, wins);
3578 3580
3579 offs = ac->ac_o_ex.fe_logical % 3581 offs = ac->ac_o_ex.fe_logical %
3580 EXT4_C2B(sbi, ac->ac_b_ex.fe_len); 3582 EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
3581 if (offs && offs < win) 3583 if (offs && offs < win)
3582 win = offs; 3584 win = offs;
3583 3585
3584 ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - 3586 ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical -
3585 EXT4_NUM_B2C(sbi, win); 3587 EXT4_NUM_B2C(sbi, win);
3586 BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); 3588 BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
3587 BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); 3589 BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
3588 } 3590 }
3589 3591
3590 /* preallocation can change ac_b_ex, thus we store actually 3592 /* preallocation can change ac_b_ex, thus we store actually
3591 * allocated blocks for history */ 3593 * allocated blocks for history */
3592 ac->ac_f_ex = ac->ac_b_ex; 3594 ac->ac_f_ex = ac->ac_b_ex;
3593 3595
3594 pa->pa_lstart = ac->ac_b_ex.fe_logical; 3596 pa->pa_lstart = ac->ac_b_ex.fe_logical;
3595 pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); 3597 pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
3596 pa->pa_len = ac->ac_b_ex.fe_len; 3598 pa->pa_len = ac->ac_b_ex.fe_len;
3597 pa->pa_free = pa->pa_len; 3599 pa->pa_free = pa->pa_len;
3598 atomic_set(&pa->pa_count, 1); 3600 atomic_set(&pa->pa_count, 1);
3599 spin_lock_init(&pa->pa_lock); 3601 spin_lock_init(&pa->pa_lock);
3600 INIT_LIST_HEAD(&pa->pa_inode_list); 3602 INIT_LIST_HEAD(&pa->pa_inode_list);
3601 INIT_LIST_HEAD(&pa->pa_group_list); 3603 INIT_LIST_HEAD(&pa->pa_group_list);
3602 pa->pa_deleted = 0; 3604 pa->pa_deleted = 0;
3603 pa->pa_type = MB_INODE_PA; 3605 pa->pa_type = MB_INODE_PA;
3604 3606
3605 mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa, 3607 mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa,
3606 pa->pa_pstart, pa->pa_len, pa->pa_lstart); 3608 pa->pa_pstart, pa->pa_len, pa->pa_lstart);
3607 trace_ext4_mb_new_inode_pa(ac, pa); 3609 trace_ext4_mb_new_inode_pa(ac, pa);
3608 3610
3609 ext4_mb_use_inode_pa(ac, pa); 3611 ext4_mb_use_inode_pa(ac, pa);
3610 atomic_add(pa->pa_free, &sbi->s_mb_preallocated); 3612 atomic_add(pa->pa_free, &sbi->s_mb_preallocated);
3611 3613
3612 ei = EXT4_I(ac->ac_inode); 3614 ei = EXT4_I(ac->ac_inode);
3613 grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); 3615 grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
3614 3616
3615 pa->pa_obj_lock = &ei->i_prealloc_lock; 3617 pa->pa_obj_lock = &ei->i_prealloc_lock;
3616 pa->pa_inode = ac->ac_inode; 3618 pa->pa_inode = ac->ac_inode;
3617 3619
3618 ext4_lock_group(sb, ac->ac_b_ex.fe_group); 3620 ext4_lock_group(sb, ac->ac_b_ex.fe_group);
3619 list_add(&pa->pa_group_list, &grp->bb_prealloc_list); 3621 list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
3620 ext4_unlock_group(sb, ac->ac_b_ex.fe_group); 3622 ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
3621 3623
3622 spin_lock(pa->pa_obj_lock); 3624 spin_lock(pa->pa_obj_lock);
3623 list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list); 3625 list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
3624 spin_unlock(pa->pa_obj_lock); 3626 spin_unlock(pa->pa_obj_lock);
3625 3627
3626 return 0; 3628 return 0;
3627 } 3629 }
3628 3630
3629 /* 3631 /*
3630 * creates new preallocated space for locality group inodes belongs to 3632 * creates new preallocated space for locality group inodes belongs to
3631 */ 3633 */
3632 static noinline_for_stack int 3634 static noinline_for_stack int
3633 ext4_mb_new_group_pa(struct ext4_allocation_context *ac) 3635 ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
3634 { 3636 {
3635 struct super_block *sb = ac->ac_sb; 3637 struct super_block *sb = ac->ac_sb;
3636 struct ext4_locality_group *lg; 3638 struct ext4_locality_group *lg;
3637 struct ext4_prealloc_space *pa; 3639 struct ext4_prealloc_space *pa;
3638 struct ext4_group_info *grp; 3640 struct ext4_group_info *grp;
3639 3641
3640 /* preallocate only when found space is larger then requested */ 3642 /* preallocate only when found space is larger then requested */
3641 BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); 3643 BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
3642 BUG_ON(ac->ac_status != AC_STATUS_FOUND); 3644 BUG_ON(ac->ac_status != AC_STATUS_FOUND);
3643 BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); 3645 BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
3644 3646
3645 BUG_ON(ext4_pspace_cachep == NULL); 3647 BUG_ON(ext4_pspace_cachep == NULL);
3646 pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); 3648 pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
3647 if (pa == NULL) 3649 if (pa == NULL)
3648 return -ENOMEM; 3650 return -ENOMEM;
3649 3651
3650 /* preallocation can change ac_b_ex, thus we store actually 3652 /* preallocation can change ac_b_ex, thus we store actually
3651 * allocated blocks for history */ 3653 * allocated blocks for history */
3652 ac->ac_f_ex = ac->ac_b_ex; 3654 ac->ac_f_ex = ac->ac_b_ex;
3653 3655
3654 pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); 3656 pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
3655 pa->pa_lstart = pa->pa_pstart; 3657 pa->pa_lstart = pa->pa_pstart;
3656 pa->pa_len = ac->ac_b_ex.fe_len; 3658 pa->pa_len = ac->ac_b_ex.fe_len;
3657 pa->pa_free = pa->pa_len; 3659 pa->pa_free = pa->pa_len;
3658 atomic_set(&pa->pa_count, 1); 3660 atomic_set(&pa->pa_count, 1);
3659 spin_lock_init(&pa->pa_lock); 3661 spin_lock_init(&pa->pa_lock);
3660 INIT_LIST_HEAD(&pa->pa_inode_list); 3662 INIT_LIST_HEAD(&pa->pa_inode_list);
3661 INIT_LIST_HEAD(&pa->pa_group_list); 3663 INIT_LIST_HEAD(&pa->pa_group_list);
3662 pa->pa_deleted = 0; 3664 pa->pa_deleted = 0;
3663 pa->pa_type = MB_GROUP_PA; 3665 pa->pa_type = MB_GROUP_PA;
3664 3666
3665 mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa, 3667 mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa,
3666 pa->pa_pstart, pa->pa_len, pa->pa_lstart); 3668 pa->pa_pstart, pa->pa_len, pa->pa_lstart);
3667 trace_ext4_mb_new_group_pa(ac, pa); 3669 trace_ext4_mb_new_group_pa(ac, pa);
3668 3670
3669 ext4_mb_use_group_pa(ac, pa); 3671 ext4_mb_use_group_pa(ac, pa);
3670 atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated); 3672 atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
3671 3673
3672 grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); 3674 grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
3673 lg = ac->ac_lg; 3675 lg = ac->ac_lg;
3674 BUG_ON(lg == NULL); 3676 BUG_ON(lg == NULL);
3675 3677
3676 pa->pa_obj_lock = &lg->lg_prealloc_lock; 3678 pa->pa_obj_lock = &lg->lg_prealloc_lock;
3677 pa->pa_inode = NULL; 3679 pa->pa_inode = NULL;
3678 3680
3679 ext4_lock_group(sb, ac->ac_b_ex.fe_group); 3681 ext4_lock_group(sb, ac->ac_b_ex.fe_group);
3680 list_add(&pa->pa_group_list, &grp->bb_prealloc_list); 3682 list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
3681 ext4_unlock_group(sb, ac->ac_b_ex.fe_group); 3683 ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
3682 3684
3683 /* 3685 /*
3684 * We will later add the new pa to the right bucket 3686 * We will later add the new pa to the right bucket
3685 * after updating the pa_free in ext4_mb_release_context 3687 * after updating the pa_free in ext4_mb_release_context
3686 */ 3688 */
3687 return 0; 3689 return 0;
3688 } 3690 }
3689 3691
3690 static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac) 3692 static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
3691 { 3693 {
3692 int err; 3694 int err;
3693 3695
3694 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) 3696 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
3695 err = ext4_mb_new_group_pa(ac); 3697 err = ext4_mb_new_group_pa(ac);
3696 else 3698 else
3697 err = ext4_mb_new_inode_pa(ac); 3699 err = ext4_mb_new_inode_pa(ac);
3698 return err; 3700 return err;
3699 } 3701 }
3700 3702
3701 /* 3703 /*
3702 * finds all unused blocks in on-disk bitmap, frees them in 3704 * finds all unused blocks in on-disk bitmap, frees them in
3703 * in-core bitmap and buddy. 3705 * in-core bitmap and buddy.
3704 * @pa must be unlinked from inode and group lists, so that 3706 * @pa must be unlinked from inode and group lists, so that
3705 * nobody else can find/use it. 3707 * nobody else can find/use it.
3706 * the caller MUST hold group/inode locks. 3708 * the caller MUST hold group/inode locks.
3707 * TODO: optimize the case when there are no in-core structures yet 3709 * TODO: optimize the case when there are no in-core structures yet
3708 */ 3710 */
3709 static noinline_for_stack int 3711 static noinline_for_stack int
3710 ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh, 3712 ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
3711 struct ext4_prealloc_space *pa) 3713 struct ext4_prealloc_space *pa)
3712 { 3714 {
3713 struct super_block *sb = e4b->bd_sb; 3715 struct super_block *sb = e4b->bd_sb;
3714 struct ext4_sb_info *sbi = EXT4_SB(sb); 3716 struct ext4_sb_info *sbi = EXT4_SB(sb);
3715 unsigned int end; 3717 unsigned int end;
3716 unsigned int next; 3718 unsigned int next;
3717 ext4_group_t group; 3719 ext4_group_t group;
3718 ext4_grpblk_t bit; 3720 ext4_grpblk_t bit;
3719 unsigned long long grp_blk_start; 3721 unsigned long long grp_blk_start;
3720 int err = 0; 3722 int err = 0;
3721 int free = 0; 3723 int free = 0;
3722 3724
3723 BUG_ON(pa->pa_deleted == 0); 3725 BUG_ON(pa->pa_deleted == 0);
3724 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); 3726 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
3725 grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit); 3727 grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit);
3726 BUG_ON(group != e4b->bd_group && pa->pa_len != 0); 3728 BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
3727 end = bit + pa->pa_len; 3729 end = bit + pa->pa_len;
3728 3730
3729 while (bit < end) { 3731 while (bit < end) {
3730 bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit); 3732 bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit);
3731 if (bit >= end) 3733 if (bit >= end)
3732 break; 3734 break;
3733 next = mb_find_next_bit(bitmap_bh->b_data, end, bit); 3735 next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
3734 mb_debug(1, " free preallocated %u/%u in group %u\n", 3736 mb_debug(1, " free preallocated %u/%u in group %u\n",
3735 (unsigned) ext4_group_first_block_no(sb, group) + bit, 3737 (unsigned) ext4_group_first_block_no(sb, group) + bit,
3736 (unsigned) next - bit, (unsigned) group); 3738 (unsigned) next - bit, (unsigned) group);
3737 free += next - bit; 3739 free += next - bit;
3738 3740
3739 trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit); 3741 trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit);
3740 trace_ext4_mb_release_inode_pa(pa, (grp_blk_start + 3742 trace_ext4_mb_release_inode_pa(pa, (grp_blk_start +
3741 EXT4_C2B(sbi, bit)), 3743 EXT4_C2B(sbi, bit)),
3742 next - bit); 3744 next - bit);
3743 mb_free_blocks(pa->pa_inode, e4b, bit, next - bit); 3745 mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
3744 bit = next + 1; 3746 bit = next + 1;
3745 } 3747 }
3746 if (free != pa->pa_free) { 3748 if (free != pa->pa_free) {
3747 ext4_msg(e4b->bd_sb, KERN_CRIT, 3749 ext4_msg(e4b->bd_sb, KERN_CRIT,
3748 "pa %p: logic %lu, phys. %lu, len %lu", 3750 "pa %p: logic %lu, phys. %lu, len %lu",
3749 pa, (unsigned long) pa->pa_lstart, 3751 pa, (unsigned long) pa->pa_lstart,
3750 (unsigned long) pa->pa_pstart, 3752 (unsigned long) pa->pa_pstart,
3751 (unsigned long) pa->pa_len); 3753 (unsigned long) pa->pa_len);
3752 ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u", 3754 ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
3753 free, pa->pa_free); 3755 free, pa->pa_free);
3754 /* 3756 /*
3755 * pa is already deleted so we use the value obtained 3757 * pa is already deleted so we use the value obtained
3756 * from the bitmap and continue. 3758 * from the bitmap and continue.
3757 */ 3759 */
3758 } 3760 }
3759 atomic_add(free, &sbi->s_mb_discarded); 3761 atomic_add(free, &sbi->s_mb_discarded);
3760 3762
3761 return err; 3763 return err;
3762 } 3764 }
3763 3765
3764 static noinline_for_stack int 3766 static noinline_for_stack int
3765 ext4_mb_release_group_pa(struct ext4_buddy *e4b, 3767 ext4_mb_release_group_pa(struct ext4_buddy *e4b,
3766 struct ext4_prealloc_space *pa) 3768 struct ext4_prealloc_space *pa)
3767 { 3769 {
3768 struct super_block *sb = e4b->bd_sb; 3770 struct super_block *sb = e4b->bd_sb;
3769 ext4_group_t group; 3771 ext4_group_t group;
3770 ext4_grpblk_t bit; 3772 ext4_grpblk_t bit;
3771 3773
3772 trace_ext4_mb_release_group_pa(sb, pa); 3774 trace_ext4_mb_release_group_pa(sb, pa);
3773 BUG_ON(pa->pa_deleted == 0); 3775 BUG_ON(pa->pa_deleted == 0);
3774 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); 3776 ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
3775 BUG_ON(group != e4b->bd_group && pa->pa_len != 0); 3777 BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
3776 mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len); 3778 mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
3777 atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded); 3779 atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
3778 trace_ext4_mballoc_discard(sb, NULL, group, bit, pa->pa_len); 3780 trace_ext4_mballoc_discard(sb, NULL, group, bit, pa->pa_len);
3779 3781
3780 return 0; 3782 return 0;
3781 } 3783 }
3782 3784
3783 /* 3785 /*
3784 * releases all preallocations in given group 3786 * releases all preallocations in given group
3785 * 3787 *
3786 * first, we need to decide discard policy: 3788 * first, we need to decide discard policy:
3787 * - when do we discard 3789 * - when do we discard
3788 * 1) ENOSPC 3790 * 1) ENOSPC
3789 * - how many do we discard 3791 * - how many do we discard
3790 * 1) how many requested 3792 * 1) how many requested
3791 */ 3793 */
3792 static noinline_for_stack int 3794 static noinline_for_stack int
3793 ext4_mb_discard_group_preallocations(struct super_block *sb, 3795 ext4_mb_discard_group_preallocations(struct super_block *sb,
3794 ext4_group_t group, int needed) 3796 ext4_group_t group, int needed)
3795 { 3797 {
3796 struct ext4_group_info *grp = ext4_get_group_info(sb, group); 3798 struct ext4_group_info *grp = ext4_get_group_info(sb, group);
3797 struct buffer_head *bitmap_bh = NULL; 3799 struct buffer_head *bitmap_bh = NULL;
3798 struct ext4_prealloc_space *pa, *tmp; 3800 struct ext4_prealloc_space *pa, *tmp;
3799 struct list_head list; 3801 struct list_head list;
3800 struct ext4_buddy e4b; 3802 struct ext4_buddy e4b;
3801 int err; 3803 int err;
3802 int busy = 0; 3804 int busy = 0;
3803 int free = 0; 3805 int free = 0;
3804 3806
3805 mb_debug(1, "discard preallocation for group %u\n", group); 3807 mb_debug(1, "discard preallocation for group %u\n", group);
3806 3808
3807 if (list_empty(&grp->bb_prealloc_list)) 3809 if (list_empty(&grp->bb_prealloc_list))
3808 return 0; 3810 return 0;
3809 3811
3810 bitmap_bh = ext4_read_block_bitmap(sb, group); 3812 bitmap_bh = ext4_read_block_bitmap(sb, group);
3811 if (bitmap_bh == NULL) { 3813 if (bitmap_bh == NULL) {
3812 ext4_error(sb, "Error reading block bitmap for %u", group); 3814 ext4_error(sb, "Error reading block bitmap for %u", group);
3813 return 0; 3815 return 0;
3814 } 3816 }
3815 3817
3816 err = ext4_mb_load_buddy(sb, group, &e4b); 3818 err = ext4_mb_load_buddy(sb, group, &e4b);
3817 if (err) { 3819 if (err) {
3818 ext4_error(sb, "Error loading buddy information for %u", group); 3820 ext4_error(sb, "Error loading buddy information for %u", group);
3819 put_bh(bitmap_bh); 3821 put_bh(bitmap_bh);
3820 return 0; 3822 return 0;
3821 } 3823 }
3822 3824
3823 if (needed == 0) 3825 if (needed == 0)
3824 needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1; 3826 needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
3825 3827
3826 INIT_LIST_HEAD(&list); 3828 INIT_LIST_HEAD(&list);
3827 repeat: 3829 repeat:
3828 ext4_lock_group(sb, group); 3830 ext4_lock_group(sb, group);
3829 list_for_each_entry_safe(pa, tmp, 3831 list_for_each_entry_safe(pa, tmp,
3830 &grp->bb_prealloc_list, pa_group_list) { 3832 &grp->bb_prealloc_list, pa_group_list) {
3831 spin_lock(&pa->pa_lock); 3833 spin_lock(&pa->pa_lock);
3832 if (atomic_read(&pa->pa_count)) { 3834 if (atomic_read(&pa->pa_count)) {
3833 spin_unlock(&pa->pa_lock); 3835 spin_unlock(&pa->pa_lock);
3834 busy = 1; 3836 busy = 1;
3835 continue; 3837 continue;
3836 } 3838 }
3837 if (pa->pa_deleted) { 3839 if (pa->pa_deleted) {
3838 spin_unlock(&pa->pa_lock); 3840 spin_unlock(&pa->pa_lock);
3839 continue; 3841 continue;
3840 } 3842 }
3841 3843
3842 /* seems this one can be freed ... */ 3844 /* seems this one can be freed ... */
3843 pa->pa_deleted = 1; 3845 pa->pa_deleted = 1;
3844 3846
3845 /* we can trust pa_free ... */ 3847 /* we can trust pa_free ... */
3846 free += pa->pa_free; 3848 free += pa->pa_free;
3847 3849
3848 spin_unlock(&pa->pa_lock); 3850 spin_unlock(&pa->pa_lock);
3849 3851
3850 list_del(&pa->pa_group_list); 3852 list_del(&pa->pa_group_list);
3851 list_add(&pa->u.pa_tmp_list, &list); 3853 list_add(&pa->u.pa_tmp_list, &list);
3852 } 3854 }
3853 3855
3854 /* if we still need more blocks and some PAs were used, try again */ 3856 /* if we still need more blocks and some PAs were used, try again */
3855 if (free < needed && busy) { 3857 if (free < needed && busy) {
3856 busy = 0; 3858 busy = 0;
3857 ext4_unlock_group(sb, group); 3859 ext4_unlock_group(sb, group);
3858 cond_resched(); 3860 cond_resched();
3859 goto repeat; 3861 goto repeat;
3860 } 3862 }
3861 3863
3862 /* found anything to free? */ 3864 /* found anything to free? */
3863 if (list_empty(&list)) { 3865 if (list_empty(&list)) {
3864 BUG_ON(free != 0); 3866 BUG_ON(free != 0);
3865 goto out; 3867 goto out;
3866 } 3868 }
3867 3869
3868 /* now free all selected PAs */ 3870 /* now free all selected PAs */
3869 list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { 3871 list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
3870 3872
3871 /* remove from object (inode or locality group) */ 3873 /* remove from object (inode or locality group) */
3872 spin_lock(pa->pa_obj_lock); 3874 spin_lock(pa->pa_obj_lock);
3873 list_del_rcu(&pa->pa_inode_list); 3875 list_del_rcu(&pa->pa_inode_list);
3874 spin_unlock(pa->pa_obj_lock); 3876 spin_unlock(pa->pa_obj_lock);
3875 3877
3876 if (pa->pa_type == MB_GROUP_PA) 3878 if (pa->pa_type == MB_GROUP_PA)
3877 ext4_mb_release_group_pa(&e4b, pa); 3879 ext4_mb_release_group_pa(&e4b, pa);
3878 else 3880 else
3879 ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); 3881 ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
3880 3882
3881 list_del(&pa->u.pa_tmp_list); 3883 list_del(&pa->u.pa_tmp_list);
3882 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); 3884 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
3883 } 3885 }
3884 3886
3885 out: 3887 out:
3886 ext4_unlock_group(sb, group); 3888 ext4_unlock_group(sb, group);
3887 ext4_mb_unload_buddy(&e4b); 3889 ext4_mb_unload_buddy(&e4b);
3888 put_bh(bitmap_bh); 3890 put_bh(bitmap_bh);
3889 return free; 3891 return free;
3890 } 3892 }
3891 3893
3892 /* 3894 /*
3893 * releases all non-used preallocated blocks for given inode 3895 * releases all non-used preallocated blocks for given inode
3894 * 3896 *
3895 * It's important to discard preallocations under i_data_sem 3897 * It's important to discard preallocations under i_data_sem
3896 * We don't want another block to be served from the prealloc 3898 * We don't want another block to be served from the prealloc
3897 * space when we are discarding the inode prealloc space. 3899 * space when we are discarding the inode prealloc space.
3898 * 3900 *
3899 * FIXME!! Make sure it is valid at all the call sites 3901 * FIXME!! Make sure it is valid at all the call sites
3900 */ 3902 */
3901 void ext4_discard_preallocations(struct inode *inode) 3903 void ext4_discard_preallocations(struct inode *inode)
3902 { 3904 {
3903 struct ext4_inode_info *ei = EXT4_I(inode); 3905 struct ext4_inode_info *ei = EXT4_I(inode);
3904 struct super_block *sb = inode->i_sb; 3906 struct super_block *sb = inode->i_sb;
3905 struct buffer_head *bitmap_bh = NULL; 3907 struct buffer_head *bitmap_bh = NULL;
3906 struct ext4_prealloc_space *pa, *tmp; 3908 struct ext4_prealloc_space *pa, *tmp;
3907 ext4_group_t group = 0; 3909 ext4_group_t group = 0;
3908 struct list_head list; 3910 struct list_head list;
3909 struct ext4_buddy e4b; 3911 struct ext4_buddy e4b;
3910 int err; 3912 int err;
3911 3913
3912 if (!S_ISREG(inode->i_mode)) { 3914 if (!S_ISREG(inode->i_mode)) {
3913 /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/ 3915 /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
3914 return; 3916 return;
3915 } 3917 }
3916 3918
3917 mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino); 3919 mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino);
3918 trace_ext4_discard_preallocations(inode); 3920 trace_ext4_discard_preallocations(inode);
3919 3921
3920 INIT_LIST_HEAD(&list); 3922 INIT_LIST_HEAD(&list);
3921 3923
3922 repeat: 3924 repeat:
3923 /* first, collect all pa's in the inode */ 3925 /* first, collect all pa's in the inode */
3924 spin_lock(&ei->i_prealloc_lock); 3926 spin_lock(&ei->i_prealloc_lock);
3925 while (!list_empty(&ei->i_prealloc_list)) { 3927 while (!list_empty(&ei->i_prealloc_list)) {
3926 pa = list_entry(ei->i_prealloc_list.next, 3928 pa = list_entry(ei->i_prealloc_list.next,
3927 struct ext4_prealloc_space, pa_inode_list); 3929 struct ext4_prealloc_space, pa_inode_list);
3928 BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock); 3930 BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
3929 spin_lock(&pa->pa_lock); 3931 spin_lock(&pa->pa_lock);
3930 if (atomic_read(&pa->pa_count)) { 3932 if (atomic_read(&pa->pa_count)) {
3931 /* this shouldn't happen often - nobody should 3933 /* this shouldn't happen often - nobody should
3932 * use preallocation while we're discarding it */ 3934 * use preallocation while we're discarding it */
3933 spin_unlock(&pa->pa_lock); 3935 spin_unlock(&pa->pa_lock);
3934 spin_unlock(&ei->i_prealloc_lock); 3936 spin_unlock(&ei->i_prealloc_lock);
3935 ext4_msg(sb, KERN_ERR, 3937 ext4_msg(sb, KERN_ERR,
3936 "uh-oh! used pa while discarding"); 3938 "uh-oh! used pa while discarding");
3937 WARN_ON(1); 3939 WARN_ON(1);
3938 schedule_timeout_uninterruptible(HZ); 3940 schedule_timeout_uninterruptible(HZ);
3939 goto repeat; 3941 goto repeat;
3940 3942
3941 } 3943 }
3942 if (pa->pa_deleted == 0) { 3944 if (pa->pa_deleted == 0) {
3943 pa->pa_deleted = 1; 3945 pa->pa_deleted = 1;
3944 spin_unlock(&pa->pa_lock); 3946 spin_unlock(&pa->pa_lock);
3945 list_del_rcu(&pa->pa_inode_list); 3947 list_del_rcu(&pa->pa_inode_list);
3946 list_add(&pa->u.pa_tmp_list, &list); 3948 list_add(&pa->u.pa_tmp_list, &list);
3947 continue; 3949 continue;
3948 } 3950 }
3949 3951
3950 /* someone is deleting pa right now */ 3952 /* someone is deleting pa right now */
3951 spin_unlock(&pa->pa_lock); 3953 spin_unlock(&pa->pa_lock);
3952 spin_unlock(&ei->i_prealloc_lock); 3954 spin_unlock(&ei->i_prealloc_lock);
3953 3955
3954 /* we have to wait here because pa_deleted 3956 /* we have to wait here because pa_deleted
3955 * doesn't mean pa is already unlinked from 3957 * doesn't mean pa is already unlinked from
3956 * the list. as we might be called from 3958 * the list. as we might be called from
3957 * ->clear_inode() the inode will get freed 3959 * ->clear_inode() the inode will get freed
3958 * and concurrent thread which is unlinking 3960 * and concurrent thread which is unlinking
3959 * pa from inode's list may access already 3961 * pa from inode's list may access already
3960 * freed memory, bad-bad-bad */ 3962 * freed memory, bad-bad-bad */
3961 3963
3962 /* XXX: if this happens too often, we can 3964 /* XXX: if this happens too often, we can
3963 * add a flag to force wait only in case 3965 * add a flag to force wait only in case
3964 * of ->clear_inode(), but not in case of 3966 * of ->clear_inode(), but not in case of
3965 * regular truncate */ 3967 * regular truncate */
3966 schedule_timeout_uninterruptible(HZ); 3968 schedule_timeout_uninterruptible(HZ);
3967 goto repeat; 3969 goto repeat;
3968 } 3970 }
3969 spin_unlock(&ei->i_prealloc_lock); 3971 spin_unlock(&ei->i_prealloc_lock);
3970 3972
3971 list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { 3973 list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
3972 BUG_ON(pa->pa_type != MB_INODE_PA); 3974 BUG_ON(pa->pa_type != MB_INODE_PA);
3973 group = ext4_get_group_number(sb, pa->pa_pstart); 3975 group = ext4_get_group_number(sb, pa->pa_pstart);
3974 3976
3975 err = ext4_mb_load_buddy(sb, group, &e4b); 3977 err = ext4_mb_load_buddy(sb, group, &e4b);
3976 if (err) { 3978 if (err) {
3977 ext4_error(sb, "Error loading buddy information for %u", 3979 ext4_error(sb, "Error loading buddy information for %u",
3978 group); 3980 group);
3979 continue; 3981 continue;
3980 } 3982 }
3981 3983
3982 bitmap_bh = ext4_read_block_bitmap(sb, group); 3984 bitmap_bh = ext4_read_block_bitmap(sb, group);
3983 if (bitmap_bh == NULL) { 3985 if (bitmap_bh == NULL) {
3984 ext4_error(sb, "Error reading block bitmap for %u", 3986 ext4_error(sb, "Error reading block bitmap for %u",
3985 group); 3987 group);
3986 ext4_mb_unload_buddy(&e4b); 3988 ext4_mb_unload_buddy(&e4b);
3987 continue; 3989 continue;
3988 } 3990 }
3989 3991
3990 ext4_lock_group(sb, group); 3992 ext4_lock_group(sb, group);
3991 list_del(&pa->pa_group_list); 3993 list_del(&pa->pa_group_list);
3992 ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); 3994 ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
3993 ext4_unlock_group(sb, group); 3995 ext4_unlock_group(sb, group);
3994 3996
3995 ext4_mb_unload_buddy(&e4b); 3997 ext4_mb_unload_buddy(&e4b);
3996 put_bh(bitmap_bh); 3998 put_bh(bitmap_bh);
3997 3999
3998 list_del(&pa->u.pa_tmp_list); 4000 list_del(&pa->u.pa_tmp_list);
3999 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); 4001 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
4000 } 4002 }
4001 } 4003 }
4002 4004
4003 #ifdef CONFIG_EXT4_DEBUG 4005 #ifdef CONFIG_EXT4_DEBUG
4004 static void ext4_mb_show_ac(struct ext4_allocation_context *ac) 4006 static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
4005 { 4007 {
4006 struct super_block *sb = ac->ac_sb; 4008 struct super_block *sb = ac->ac_sb;
4007 ext4_group_t ngroups, i; 4009 ext4_group_t ngroups, i;
4008 4010
4009 if (!ext4_mballoc_debug || 4011 if (!ext4_mballoc_debug ||
4010 (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) 4012 (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
4011 return; 4013 return;
4012 4014
4013 ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:" 4015 ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:"
4014 " Allocation context details:"); 4016 " Allocation context details:");
4015 ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d", 4017 ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d",
4016 ac->ac_status, ac->ac_flags); 4018 ac->ac_status, ac->ac_flags);
4017 ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, " 4019 ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
4018 "goal %lu/%lu/%lu@%lu, " 4020 "goal %lu/%lu/%lu@%lu, "
4019 "best %lu/%lu/%lu@%lu cr %d", 4021 "best %lu/%lu/%lu@%lu cr %d",
4020 (unsigned long)ac->ac_o_ex.fe_group, 4022 (unsigned long)ac->ac_o_ex.fe_group,
4021 (unsigned long)ac->ac_o_ex.fe_start, 4023 (unsigned long)ac->ac_o_ex.fe_start,
4022 (unsigned long)ac->ac_o_ex.fe_len, 4024 (unsigned long)ac->ac_o_ex.fe_len,
4023 (unsigned long)ac->ac_o_ex.fe_logical, 4025 (unsigned long)ac->ac_o_ex.fe_logical,
4024 (unsigned long)ac->ac_g_ex.fe_group, 4026 (unsigned long)ac->ac_g_ex.fe_group,
4025 (unsigned long)ac->ac_g_ex.fe_start, 4027 (unsigned long)ac->ac_g_ex.fe_start,
4026 (unsigned long)ac->ac_g_ex.fe_len, 4028 (unsigned long)ac->ac_g_ex.fe_len,
4027 (unsigned long)ac->ac_g_ex.fe_logical, 4029 (unsigned long)ac->ac_g_ex.fe_logical,
4028 (unsigned long)ac->ac_b_ex.fe_group, 4030 (unsigned long)ac->ac_b_ex.fe_group,
4029 (unsigned long)ac->ac_b_ex.fe_start, 4031 (unsigned long)ac->ac_b_ex.fe_start,
4030 (unsigned long)ac->ac_b_ex.fe_len, 4032 (unsigned long)ac->ac_b_ex.fe_len,
4031 (unsigned long)ac->ac_b_ex.fe_logical, 4033 (unsigned long)ac->ac_b_ex.fe_logical,
4032 (int)ac->ac_criteria); 4034 (int)ac->ac_criteria);
4033 ext4_msg(ac->ac_sb, KERN_ERR, "%lu scanned, %d found", 4035 ext4_msg(ac->ac_sb, KERN_ERR, "%lu scanned, %d found",
4034 ac->ac_ex_scanned, ac->ac_found); 4036 ac->ac_ex_scanned, ac->ac_found);
4035 ext4_msg(ac->ac_sb, KERN_ERR, "groups: "); 4037 ext4_msg(ac->ac_sb, KERN_ERR, "groups: ");
4036 ngroups = ext4_get_groups_count(sb); 4038 ngroups = ext4_get_groups_count(sb);
4037 for (i = 0; i < ngroups; i++) { 4039 for (i = 0; i < ngroups; i++) {
4038 struct ext4_group_info *grp = ext4_get_group_info(sb, i); 4040 struct ext4_group_info *grp = ext4_get_group_info(sb, i);
4039 struct ext4_prealloc_space *pa; 4041 struct ext4_prealloc_space *pa;
4040 ext4_grpblk_t start; 4042 ext4_grpblk_t start;
4041 struct list_head *cur; 4043 struct list_head *cur;
4042 ext4_lock_group(sb, i); 4044 ext4_lock_group(sb, i);
4043 list_for_each(cur, &grp->bb_prealloc_list) { 4045 list_for_each(cur, &grp->bb_prealloc_list) {
4044 pa = list_entry(cur, struct ext4_prealloc_space, 4046 pa = list_entry(cur, struct ext4_prealloc_space,
4045 pa_group_list); 4047 pa_group_list);
4046 spin_lock(&pa->pa_lock); 4048 spin_lock(&pa->pa_lock);
4047 ext4_get_group_no_and_offset(sb, pa->pa_pstart, 4049 ext4_get_group_no_and_offset(sb, pa->pa_pstart,
4048 NULL, &start); 4050 NULL, &start);
4049 spin_unlock(&pa->pa_lock); 4051 spin_unlock(&pa->pa_lock);
4050 printk(KERN_ERR "PA:%u:%d:%u \n", i, 4052 printk(KERN_ERR "PA:%u:%d:%u \n", i,
4051 start, pa->pa_len); 4053 start, pa->pa_len);
4052 } 4054 }
4053 ext4_unlock_group(sb, i); 4055 ext4_unlock_group(sb, i);
4054 4056
4055 if (grp->bb_free == 0) 4057 if (grp->bb_free == 0)
4056 continue; 4058 continue;
4057 printk(KERN_ERR "%u: %d/%d \n", 4059 printk(KERN_ERR "%u: %d/%d \n",
4058 i, grp->bb_free, grp->bb_fragments); 4060 i, grp->bb_free, grp->bb_fragments);
4059 } 4061 }
4060 printk(KERN_ERR "\n"); 4062 printk(KERN_ERR "\n");
4061 } 4063 }
4062 #else 4064 #else
4063 static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac) 4065 static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
4064 { 4066 {
4065 return; 4067 return;
4066 } 4068 }
4067 #endif 4069 #endif
4068 4070
4069 /* 4071 /*
4070 * We use locality group preallocation for small size file. The size of the 4072 * We use locality group preallocation for small size file. The size of the
4071 * file is determined by the current size or the resulting size after 4073 * file is determined by the current size or the resulting size after
4072 * allocation which ever is larger 4074 * allocation which ever is larger
4073 * 4075 *
4074 * One can tune this size via /sys/fs/ext4/<partition>/mb_stream_req 4076 * One can tune this size via /sys/fs/ext4/<partition>/mb_stream_req
4075 */ 4077 */
4076 static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) 4078 static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
4077 { 4079 {
4078 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 4080 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
4079 int bsbits = ac->ac_sb->s_blocksize_bits; 4081 int bsbits = ac->ac_sb->s_blocksize_bits;
4080 loff_t size, isize; 4082 loff_t size, isize;
4081 4083
4082 if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) 4084 if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
4083 return; 4085 return;
4084 4086
4085 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) 4087 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
4086 return; 4088 return;
4087 4089
4088 size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); 4090 size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
4089 isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) 4091 isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
4090 >> bsbits; 4092 >> bsbits;
4091 4093
4092 if ((size == isize) && 4094 if ((size == isize) &&
4093 !ext4_fs_is_busy(sbi) && 4095 !ext4_fs_is_busy(sbi) &&
4094 (atomic_read(&ac->ac_inode->i_writecount) == 0)) { 4096 (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
4095 ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; 4097 ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
4096 return; 4098 return;
4097 } 4099 }
4098 4100
4099 if (sbi->s_mb_group_prealloc <= 0) { 4101 if (sbi->s_mb_group_prealloc <= 0) {
4100 ac->ac_flags |= EXT4_MB_STREAM_ALLOC; 4102 ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
4101 return; 4103 return;
4102 } 4104 }
4103 4105
4104 /* don't use group allocation for large files */ 4106 /* don't use group allocation for large files */
4105 size = max(size, isize); 4107 size = max(size, isize);
4106 if (size > sbi->s_mb_stream_request) { 4108 if (size > sbi->s_mb_stream_request) {
4107 ac->ac_flags |= EXT4_MB_STREAM_ALLOC; 4109 ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
4108 return; 4110 return;
4109 } 4111 }
4110 4112
4111 BUG_ON(ac->ac_lg != NULL); 4113 BUG_ON(ac->ac_lg != NULL);
4112 /* 4114 /*
4113 * locality group prealloc space are per cpu. The reason for having 4115 * locality group prealloc space are per cpu. The reason for having
4114 * per cpu locality group is to reduce the contention between block 4116 * per cpu locality group is to reduce the contention between block
4115 * request from multiple CPUs. 4117 * request from multiple CPUs.
4116 */ 4118 */
4117 ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups); 4119 ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups);
4118 4120
4119 /* we're going to use group allocation */ 4121 /* we're going to use group allocation */
4120 ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC; 4122 ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
4121 4123
4122 /* serialize all allocations in the group */ 4124 /* serialize all allocations in the group */
4123 mutex_lock(&ac->ac_lg->lg_mutex); 4125 mutex_lock(&ac->ac_lg->lg_mutex);
4124 } 4126 }
4125 4127
4126 static noinline_for_stack int 4128 static noinline_for_stack int
4127 ext4_mb_initialize_context(struct ext4_allocation_context *ac, 4129 ext4_mb_initialize_context(struct ext4_allocation_context *ac,
4128 struct ext4_allocation_request *ar) 4130 struct ext4_allocation_request *ar)
4129 { 4131 {
4130 struct super_block *sb = ar->inode->i_sb; 4132 struct super_block *sb = ar->inode->i_sb;
4131 struct ext4_sb_info *sbi = EXT4_SB(sb); 4133 struct ext4_sb_info *sbi = EXT4_SB(sb);
4132 struct ext4_super_block *es = sbi->s_es; 4134 struct ext4_super_block *es = sbi->s_es;
4133 ext4_group_t group; 4135 ext4_group_t group;
4134 unsigned int len; 4136 unsigned int len;
4135 ext4_fsblk_t goal; 4137 ext4_fsblk_t goal;
4136 ext4_grpblk_t block; 4138 ext4_grpblk_t block;
4137 4139
4138 /* we can't allocate > group size */ 4140 /* we can't allocate > group size */
4139 len = ar->len; 4141 len = ar->len;
4140 4142
4141 /* just a dirty hack to filter too big requests */ 4143 /* just a dirty hack to filter too big requests */
4142 if (len >= EXT4_CLUSTERS_PER_GROUP(sb)) 4144 if (len >= EXT4_CLUSTERS_PER_GROUP(sb))
4143 len = EXT4_CLUSTERS_PER_GROUP(sb); 4145 len = EXT4_CLUSTERS_PER_GROUP(sb);
4144 4146
4145 /* start searching from the goal */ 4147 /* start searching from the goal */
4146 goal = ar->goal; 4148 goal = ar->goal;
4147 if (goal < le32_to_cpu(es->s_first_data_block) || 4149 if (goal < le32_to_cpu(es->s_first_data_block) ||
4148 goal >= ext4_blocks_count(es)) 4150 goal >= ext4_blocks_count(es))
4149 goal = le32_to_cpu(es->s_first_data_block); 4151 goal = le32_to_cpu(es->s_first_data_block);
4150 ext4_get_group_no_and_offset(sb, goal, &group, &block); 4152 ext4_get_group_no_and_offset(sb, goal, &group, &block);
4151 4153
4152 /* set up allocation goals */ 4154 /* set up allocation goals */
4153 ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical); 4155 ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical);
4154 ac->ac_status = AC_STATUS_CONTINUE; 4156 ac->ac_status = AC_STATUS_CONTINUE;
4155 ac->ac_sb = sb; 4157 ac->ac_sb = sb;
4156 ac->ac_inode = ar->inode; 4158 ac->ac_inode = ar->inode;
4157 ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical; 4159 ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical;
4158 ac->ac_o_ex.fe_group = group; 4160 ac->ac_o_ex.fe_group = group;
4159 ac->ac_o_ex.fe_start = block; 4161 ac->ac_o_ex.fe_start = block;
4160 ac->ac_o_ex.fe_len = len; 4162 ac->ac_o_ex.fe_len = len;
4161 ac->ac_g_ex = ac->ac_o_ex; 4163 ac->ac_g_ex = ac->ac_o_ex;
4162 ac->ac_flags = ar->flags; 4164 ac->ac_flags = ar->flags;
4163 4165
4164 /* we have to define context: we'll we work with a file or 4166 /* we have to define context: we'll we work with a file or
4165 * locality group. this is a policy, actually */ 4167 * locality group. this is a policy, actually */
4166 ext4_mb_group_or_file(ac); 4168 ext4_mb_group_or_file(ac);
4167 4169
4168 mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, " 4170 mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, "
4169 "left: %u/%u, right %u/%u to %swritable\n", 4171 "left: %u/%u, right %u/%u to %swritable\n",
4170 (unsigned) ar->len, (unsigned) ar->logical, 4172 (unsigned) ar->len, (unsigned) ar->logical,
4171 (unsigned) ar->goal, ac->ac_flags, ac->ac_2order, 4173 (unsigned) ar->goal, ac->ac_flags, ac->ac_2order,
4172 (unsigned) ar->lleft, (unsigned) ar->pleft, 4174 (unsigned) ar->lleft, (unsigned) ar->pleft,
4173 (unsigned) ar->lright, (unsigned) ar->pright, 4175 (unsigned) ar->lright, (unsigned) ar->pright,
4174 atomic_read(&ar->inode->i_writecount) ? "" : "non-"); 4176 atomic_read(&ar->inode->i_writecount) ? "" : "non-");
4175 return 0; 4177 return 0;
4176 4178
4177 } 4179 }
4178 4180
4179 static noinline_for_stack void 4181 static noinline_for_stack void
4180 ext4_mb_discard_lg_preallocations(struct super_block *sb, 4182 ext4_mb_discard_lg_preallocations(struct super_block *sb,
4181 struct ext4_locality_group *lg, 4183 struct ext4_locality_group *lg,
4182 int order, int total_entries) 4184 int order, int total_entries)
4183 { 4185 {
4184 ext4_group_t group = 0; 4186 ext4_group_t group = 0;
4185 struct ext4_buddy e4b; 4187 struct ext4_buddy e4b;
4186 struct list_head discard_list; 4188 struct list_head discard_list;
4187 struct ext4_prealloc_space *pa, *tmp; 4189 struct ext4_prealloc_space *pa, *tmp;
4188 4190
4189 mb_debug(1, "discard locality group preallocation\n"); 4191 mb_debug(1, "discard locality group preallocation\n");
4190 4192
4191 INIT_LIST_HEAD(&discard_list); 4193 INIT_LIST_HEAD(&discard_list);
4192 4194
4193 spin_lock(&lg->lg_prealloc_lock); 4195 spin_lock(&lg->lg_prealloc_lock);
4194 list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order], 4196 list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
4195 pa_inode_list) { 4197 pa_inode_list) {
4196 spin_lock(&pa->pa_lock); 4198 spin_lock(&pa->pa_lock);
4197 if (atomic_read(&pa->pa_count)) { 4199 if (atomic_read(&pa->pa_count)) {
4198 /* 4200 /*
4199 * This is the pa that we just used 4201 * This is the pa that we just used
4200 * for block allocation. So don't 4202 * for block allocation. So don't
4201 * free that 4203 * free that
4202 */ 4204 */
4203 spin_unlock(&pa->pa_lock); 4205 spin_unlock(&pa->pa_lock);
4204 continue; 4206 continue;
4205 } 4207 }
4206 if (pa->pa_deleted) { 4208 if (pa->pa_deleted) {
4207 spin_unlock(&pa->pa_lock); 4209 spin_unlock(&pa->pa_lock);
4208 continue; 4210 continue;
4209 } 4211 }
4210 /* only lg prealloc space */ 4212 /* only lg prealloc space */
4211 BUG_ON(pa->pa_type != MB_GROUP_PA); 4213 BUG_ON(pa->pa_type != MB_GROUP_PA);
4212 4214
4213 /* seems this one can be freed ... */ 4215 /* seems this one can be freed ... */
4214 pa->pa_deleted = 1; 4216 pa->pa_deleted = 1;
4215 spin_unlock(&pa->pa_lock); 4217 spin_unlock(&pa->pa_lock);
4216 4218
4217 list_del_rcu(&pa->pa_inode_list); 4219 list_del_rcu(&pa->pa_inode_list);
4218 list_add(&pa->u.pa_tmp_list, &discard_list); 4220 list_add(&pa->u.pa_tmp_list, &discard_list);
4219 4221
4220 total_entries--; 4222 total_entries--;
4221 if (total_entries <= 5) { 4223 if (total_entries <= 5) {
4222 /* 4224 /*
4223 * we want to keep only 5 entries 4225 * we want to keep only 5 entries
4224 * allowing it to grow to 8. This 4226 * allowing it to grow to 8. This
4225 * mak sure we don't call discard 4227 * mak sure we don't call discard
4226 * soon for this list. 4228 * soon for this list.
4227 */ 4229 */
4228 break; 4230 break;
4229 } 4231 }
4230 } 4232 }
4231 spin_unlock(&lg->lg_prealloc_lock); 4233 spin_unlock(&lg->lg_prealloc_lock);
4232 4234
4233 list_for_each_entry_safe(pa, tmp, &discard_list, u.pa_tmp_list) { 4235 list_for_each_entry_safe(pa, tmp, &discard_list, u.pa_tmp_list) {
4234 4236
4235 group = ext4_get_group_number(sb, pa->pa_pstart); 4237 group = ext4_get_group_number(sb, pa->pa_pstart);
4236 if (ext4_mb_load_buddy(sb, group, &e4b)) { 4238 if (ext4_mb_load_buddy(sb, group, &e4b)) {
4237 ext4_error(sb, "Error loading buddy information for %u", 4239 ext4_error(sb, "Error loading buddy information for %u",
4238 group); 4240 group);
4239 continue; 4241 continue;
4240 } 4242 }
4241 ext4_lock_group(sb, group); 4243 ext4_lock_group(sb, group);
4242 list_del(&pa->pa_group_list); 4244 list_del(&pa->pa_group_list);
4243 ext4_mb_release_group_pa(&e4b, pa); 4245 ext4_mb_release_group_pa(&e4b, pa);
4244 ext4_unlock_group(sb, group); 4246 ext4_unlock_group(sb, group);
4245 4247
4246 ext4_mb_unload_buddy(&e4b); 4248 ext4_mb_unload_buddy(&e4b);
4247 list_del(&pa->u.pa_tmp_list); 4249 list_del(&pa->u.pa_tmp_list);
4248 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); 4250 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
4249 } 4251 }
4250 } 4252 }
4251 4253
4252 /* 4254 /*
4253 * We have incremented pa_count. So it cannot be freed at this 4255 * We have incremented pa_count. So it cannot be freed at this
4254 * point. Also we hold lg_mutex. So no parallel allocation is 4256 * point. Also we hold lg_mutex. So no parallel allocation is
4255 * possible from this lg. That means pa_free cannot be updated. 4257 * possible from this lg. That means pa_free cannot be updated.
4256 * 4258 *
4257 * A parallel ext4_mb_discard_group_preallocations is possible. 4259 * A parallel ext4_mb_discard_group_preallocations is possible.
4258 * which can cause the lg_prealloc_list to be updated. 4260 * which can cause the lg_prealloc_list to be updated.
4259 */ 4261 */
4260 4262
4261 static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac) 4263 static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
4262 { 4264 {
4263 int order, added = 0, lg_prealloc_count = 1; 4265 int order, added = 0, lg_prealloc_count = 1;
4264 struct super_block *sb = ac->ac_sb; 4266 struct super_block *sb = ac->ac_sb;
4265 struct ext4_locality_group *lg = ac->ac_lg; 4267 struct ext4_locality_group *lg = ac->ac_lg;
4266 struct ext4_prealloc_space *tmp_pa, *pa = ac->ac_pa; 4268 struct ext4_prealloc_space *tmp_pa, *pa = ac->ac_pa;
4267 4269
4268 order = fls(pa->pa_free) - 1; 4270 order = fls(pa->pa_free) - 1;
4269 if (order > PREALLOC_TB_SIZE - 1) 4271 if (order > PREALLOC_TB_SIZE - 1)
4270 /* The max size of hash table is PREALLOC_TB_SIZE */ 4272 /* The max size of hash table is PREALLOC_TB_SIZE */
4271 order = PREALLOC_TB_SIZE - 1; 4273 order = PREALLOC_TB_SIZE - 1;
4272 /* Add the prealloc space to lg */ 4274 /* Add the prealloc space to lg */
4273 spin_lock(&lg->lg_prealloc_lock); 4275 spin_lock(&lg->lg_prealloc_lock);
4274 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], 4276 list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
4275 pa_inode_list) { 4277 pa_inode_list) {
4276 spin_lock(&tmp_pa->pa_lock); 4278 spin_lock(&tmp_pa->pa_lock);
4277 if (tmp_pa->pa_deleted) { 4279 if (tmp_pa->pa_deleted) {
4278 spin_unlock(&tmp_pa->pa_lock); 4280 spin_unlock(&tmp_pa->pa_lock);
4279 continue; 4281 continue;
4280 } 4282 }
4281 if (!added && pa->pa_free < tmp_pa->pa_free) { 4283 if (!added && pa->pa_free < tmp_pa->pa_free) {
4282 /* Add to the tail of the previous entry */ 4284 /* Add to the tail of the previous entry */
4283 list_add_tail_rcu(&pa->pa_inode_list, 4285 list_add_tail_rcu(&pa->pa_inode_list,
4284 &tmp_pa->pa_inode_list); 4286 &tmp_pa->pa_inode_list);
4285 added = 1; 4287 added = 1;
4286 /* 4288 /*
4287 * we want to count the total 4289 * we want to count the total
4288 * number of entries in the list 4290 * number of entries in the list
4289 */ 4291 */
4290 } 4292 }
4291 spin_unlock(&tmp_pa->pa_lock); 4293 spin_unlock(&tmp_pa->pa_lock);
4292 lg_prealloc_count++; 4294 lg_prealloc_count++;
4293 } 4295 }
4294 if (!added) 4296 if (!added)
4295 list_add_tail_rcu(&pa->pa_inode_list, 4297 list_add_tail_rcu(&pa->pa_inode_list,
4296 &lg->lg_prealloc_list[order]); 4298 &lg->lg_prealloc_list[order]);
4297 spin_unlock(&lg->lg_prealloc_lock); 4299 spin_unlock(&lg->lg_prealloc_lock);
4298 4300
4299 /* Now trim the list to be not more than 8 elements */ 4301 /* Now trim the list to be not more than 8 elements */
4300 if (lg_prealloc_count > 8) { 4302 if (lg_prealloc_count > 8) {
4301 ext4_mb_discard_lg_preallocations(sb, lg, 4303 ext4_mb_discard_lg_preallocations(sb, lg,
4302 order, lg_prealloc_count); 4304 order, lg_prealloc_count);
4303 return; 4305 return;
4304 } 4306 }
4305 return ; 4307 return ;
4306 } 4308 }
4307 4309
4308 /* 4310 /*
4309 * release all resource we used in allocation 4311 * release all resource we used in allocation
4310 */ 4312 */
4311 static int ext4_mb_release_context(struct ext4_allocation_context *ac) 4313 static int ext4_mb_release_context(struct ext4_allocation_context *ac)
4312 { 4314 {
4313 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); 4315 struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
4314 struct ext4_prealloc_space *pa = ac->ac_pa; 4316 struct ext4_prealloc_space *pa = ac->ac_pa;
4315 if (pa) { 4317 if (pa) {
4316 if (pa->pa_type == MB_GROUP_PA) { 4318 if (pa->pa_type == MB_GROUP_PA) {
4317 /* see comment in ext4_mb_use_group_pa() */ 4319 /* see comment in ext4_mb_use_group_pa() */
4318 spin_lock(&pa->pa_lock); 4320 spin_lock(&pa->pa_lock);
4319 pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); 4321 pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
4320 pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); 4322 pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
4321 pa->pa_free -= ac->ac_b_ex.fe_len; 4323 pa->pa_free -= ac->ac_b_ex.fe_len;
4322 pa->pa_len -= ac->ac_b_ex.fe_len; 4324 pa->pa_len -= ac->ac_b_ex.fe_len;
4323 spin_unlock(&pa->pa_lock); 4325 spin_unlock(&pa->pa_lock);
4324 } 4326 }
4325 } 4327 }
4326 if (pa) { 4328 if (pa) {
4327 /* 4329 /*
4328 * We want to add the pa to the right bucket. 4330 * We want to add the pa to the right bucket.
4329 * Remove it from the list and while adding 4331 * Remove it from the list and while adding
4330 * make sure the list to which we are adding 4332 * make sure the list to which we are adding
4331 * doesn't grow big. 4333 * doesn't grow big.
4332 */ 4334 */
4333 if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) { 4335 if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) {
4334 spin_lock(pa->pa_obj_lock); 4336 spin_lock(pa->pa_obj_lock);
4335 list_del_rcu(&pa->pa_inode_list); 4337 list_del_rcu(&pa->pa_inode_list);
4336 spin_unlock(pa->pa_obj_lock); 4338 spin_unlock(pa->pa_obj_lock);
4337 ext4_mb_add_n_trim(ac); 4339 ext4_mb_add_n_trim(ac);
4338 } 4340 }
4339 ext4_mb_put_pa(ac, ac->ac_sb, pa); 4341 ext4_mb_put_pa(ac, ac->ac_sb, pa);
4340 } 4342 }
4341 if (ac->ac_bitmap_page) 4343 if (ac->ac_bitmap_page)
4342 page_cache_release(ac->ac_bitmap_page); 4344 page_cache_release(ac->ac_bitmap_page);
4343 if (ac->ac_buddy_page) 4345 if (ac->ac_buddy_page)
4344 page_cache_release(ac->ac_buddy_page); 4346 page_cache_release(ac->ac_buddy_page);
4345 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) 4347 if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
4346 mutex_unlock(&ac->ac_lg->lg_mutex); 4348 mutex_unlock(&ac->ac_lg->lg_mutex);
4347 ext4_mb_collect_stats(ac); 4349 ext4_mb_collect_stats(ac);
4348 return 0; 4350 return 0;
4349 } 4351 }
4350 4352
4351 static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) 4353 static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
4352 { 4354 {
4353 ext4_group_t i, ngroups = ext4_get_groups_count(sb); 4355 ext4_group_t i, ngroups = ext4_get_groups_count(sb);
4354 int ret; 4356 int ret;
4355 int freed = 0; 4357 int freed = 0;
4356 4358
4357 trace_ext4_mb_discard_preallocations(sb, needed); 4359 trace_ext4_mb_discard_preallocations(sb, needed);
4358 for (i = 0; i < ngroups && needed > 0; i++) { 4360 for (i = 0; i < ngroups && needed > 0; i++) {
4359 ret = ext4_mb_discard_group_preallocations(sb, i, needed); 4361 ret = ext4_mb_discard_group_preallocations(sb, i, needed);
4360 freed += ret; 4362 freed += ret;
4361 needed -= ret; 4363 needed -= ret;
4362 } 4364 }
4363 4365
4364 return freed; 4366 return freed;
4365 } 4367 }
4366 4368
4367 /* 4369 /*
4368 * Main entry point into mballoc to allocate blocks 4370 * Main entry point into mballoc to allocate blocks
4369 * it tries to use preallocation first, then falls back 4371 * it tries to use preallocation first, then falls back
4370 * to usual allocation 4372 * to usual allocation
4371 */ 4373 */
4372 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, 4374 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
4373 struct ext4_allocation_request *ar, int *errp) 4375 struct ext4_allocation_request *ar, int *errp)
4374 { 4376 {
4375 int freed; 4377 int freed;
4376 struct ext4_allocation_context *ac = NULL; 4378 struct ext4_allocation_context *ac = NULL;
4377 struct ext4_sb_info *sbi; 4379 struct ext4_sb_info *sbi;
4378 struct super_block *sb; 4380 struct super_block *sb;
4379 ext4_fsblk_t block = 0; 4381 ext4_fsblk_t block = 0;
4380 unsigned int inquota = 0; 4382 unsigned int inquota = 0;
4381 unsigned int reserv_clstrs = 0; 4383 unsigned int reserv_clstrs = 0;
4382 4384
4383 might_sleep(); 4385 might_sleep();
4384 sb = ar->inode->i_sb; 4386 sb = ar->inode->i_sb;
4385 sbi = EXT4_SB(sb); 4387 sbi = EXT4_SB(sb);
4386 4388
4387 trace_ext4_request_blocks(ar); 4389 trace_ext4_request_blocks(ar);
4388 4390
4389 /* Allow to use superuser reservation for quota file */ 4391 /* Allow to use superuser reservation for quota file */
4390 if (IS_NOQUOTA(ar->inode)) 4392 if (IS_NOQUOTA(ar->inode))
4391 ar->flags |= EXT4_MB_USE_ROOT_BLOCKS; 4393 ar->flags |= EXT4_MB_USE_ROOT_BLOCKS;
4392 4394
4393 /* 4395 /*
4394 * For delayed allocation, we could skip the ENOSPC and 4396 * For delayed allocation, we could skip the ENOSPC and
4395 * EDQUOT check, as blocks and quotas have been already 4397 * EDQUOT check, as blocks and quotas have been already
4396 * reserved when data being copied into pagecache. 4398 * reserved when data being copied into pagecache.
4397 */ 4399 */
4398 if (ext4_test_inode_state(ar->inode, EXT4_STATE_DELALLOC_RESERVED)) 4400 if (ext4_test_inode_state(ar->inode, EXT4_STATE_DELALLOC_RESERVED))
4399 ar->flags |= EXT4_MB_DELALLOC_RESERVED; 4401 ar->flags |= EXT4_MB_DELALLOC_RESERVED;
4400 else { 4402 else {
4401 /* Without delayed allocation we need to verify 4403 /* Without delayed allocation we need to verify
4402 * there is enough free blocks to do block allocation 4404 * there is enough free blocks to do block allocation
4403 * and verify allocation doesn't exceed the quota limits. 4405 * and verify allocation doesn't exceed the quota limits.
4404 */ 4406 */
4405 while (ar->len && 4407 while (ar->len &&
4406 ext4_claim_free_clusters(sbi, ar->len, ar->flags)) { 4408 ext4_claim_free_clusters(sbi, ar->len, ar->flags)) {
4407 4409
4408 /* let others to free the space */ 4410 /* let others to free the space */
4409 cond_resched(); 4411 cond_resched();
4410 ar->len = ar->len >> 1; 4412 ar->len = ar->len >> 1;
4411 } 4413 }
4412 if (!ar->len) { 4414 if (!ar->len) {
4413 *errp = -ENOSPC; 4415 *errp = -ENOSPC;
4414 return 0; 4416 return 0;
4415 } 4417 }
4416 reserv_clstrs = ar->len; 4418 reserv_clstrs = ar->len;
4417 if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) { 4419 if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) {
4418 dquot_alloc_block_nofail(ar->inode, 4420 dquot_alloc_block_nofail(ar->inode,
4419 EXT4_C2B(sbi, ar->len)); 4421 EXT4_C2B(sbi, ar->len));
4420 } else { 4422 } else {
4421 while (ar->len && 4423 while (ar->len &&
4422 dquot_alloc_block(ar->inode, 4424 dquot_alloc_block(ar->inode,
4423 EXT4_C2B(sbi, ar->len))) { 4425 EXT4_C2B(sbi, ar->len))) {
4424 4426
4425 ar->flags |= EXT4_MB_HINT_NOPREALLOC; 4427 ar->flags |= EXT4_MB_HINT_NOPREALLOC;
4426 ar->len--; 4428 ar->len--;
4427 } 4429 }
4428 } 4430 }
4429 inquota = ar->len; 4431 inquota = ar->len;
4430 if (ar->len == 0) { 4432 if (ar->len == 0) {
4431 *errp = -EDQUOT; 4433 *errp = -EDQUOT;
4432 goto out; 4434 goto out;
4433 } 4435 }
4434 } 4436 }
4435 4437
4436 ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS); 4438 ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
4437 if (!ac) { 4439 if (!ac) {
4438 ar->len = 0; 4440 ar->len = 0;
4439 *errp = -ENOMEM; 4441 *errp = -ENOMEM;
4440 goto out; 4442 goto out;
4441 } 4443 }
4442 4444
4443 *errp = ext4_mb_initialize_context(ac, ar); 4445 *errp = ext4_mb_initialize_context(ac, ar);
4444 if (*errp) { 4446 if (*errp) {
4445 ar->len = 0; 4447 ar->len = 0;
4446 goto out; 4448 goto out;
4447 } 4449 }
4448 4450
4449 ac->ac_op = EXT4_MB_HISTORY_PREALLOC; 4451 ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
4450 if (!ext4_mb_use_preallocated(ac)) { 4452 if (!ext4_mb_use_preallocated(ac)) {
4451 ac->ac_op = EXT4_MB_HISTORY_ALLOC; 4453 ac->ac_op = EXT4_MB_HISTORY_ALLOC;
4452 ext4_mb_normalize_request(ac, ar); 4454 ext4_mb_normalize_request(ac, ar);
4453 repeat: 4455 repeat:
4454 /* allocate space in core */ 4456 /* allocate space in core */
4455 *errp = ext4_mb_regular_allocator(ac); 4457 *errp = ext4_mb_regular_allocator(ac);
4456 if (*errp) 4458 if (*errp)
4457 goto discard_and_exit; 4459 goto discard_and_exit;
4458 4460
4459 /* as we've just preallocated more space than 4461 /* as we've just preallocated more space than
4460 * user requested originally, we store allocated 4462 * user requested originally, we store allocated
4461 * space in a special descriptor */ 4463 * space in a special descriptor */
4462 if (ac->ac_status == AC_STATUS_FOUND && 4464 if (ac->ac_status == AC_STATUS_FOUND &&
4463 ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) 4465 ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
4464 *errp = ext4_mb_new_preallocation(ac); 4466 *errp = ext4_mb_new_preallocation(ac);
4465 if (*errp) { 4467 if (*errp) {
4466 discard_and_exit: 4468 discard_and_exit:
4467 ext4_discard_allocated_blocks(ac); 4469 ext4_discard_allocated_blocks(ac);
4468 goto errout; 4470 goto errout;
4469 } 4471 }
4470 } 4472 }
4471 if (likely(ac->ac_status == AC_STATUS_FOUND)) { 4473 if (likely(ac->ac_status == AC_STATUS_FOUND)) {
4472 *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs); 4474 *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
4473 if (*errp == -EAGAIN) { 4475 if (*errp == -EAGAIN) {
4474 /* 4476 /*
4475 * drop the reference that we took 4477 * drop the reference that we took
4476 * in ext4_mb_use_best_found 4478 * in ext4_mb_use_best_found
4477 */ 4479 */
4478 ext4_mb_release_context(ac); 4480 ext4_mb_release_context(ac);
4479 ac->ac_b_ex.fe_group = 0; 4481 ac->ac_b_ex.fe_group = 0;
4480 ac->ac_b_ex.fe_start = 0; 4482 ac->ac_b_ex.fe_start = 0;
4481 ac->ac_b_ex.fe_len = 0; 4483 ac->ac_b_ex.fe_len = 0;
4482 ac->ac_status = AC_STATUS_CONTINUE; 4484 ac->ac_status = AC_STATUS_CONTINUE;
4483 goto repeat; 4485 goto repeat;
4484 } else if (*errp) { 4486 } else if (*errp) {
4485 ext4_discard_allocated_blocks(ac); 4487 ext4_discard_allocated_blocks(ac);
4486 goto errout; 4488 goto errout;
4487 } else { 4489 } else {
4488 block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); 4490 block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
4489 ar->len = ac->ac_b_ex.fe_len; 4491 ar->len = ac->ac_b_ex.fe_len;
4490 } 4492 }
4491 } else { 4493 } else {
4492 freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); 4494 freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
4493 if (freed) 4495 if (freed)
4494 goto repeat; 4496 goto repeat;
4495 *errp = -ENOSPC; 4497 *errp = -ENOSPC;
4496 } 4498 }
4497 4499
4498 errout: 4500 errout:
4499 if (*errp) { 4501 if (*errp) {
4500 ac->ac_b_ex.fe_len = 0; 4502 ac->ac_b_ex.fe_len = 0;
4501 ar->len = 0; 4503 ar->len = 0;
4502 ext4_mb_show_ac(ac); 4504 ext4_mb_show_ac(ac);
4503 } 4505 }
4504 ext4_mb_release_context(ac); 4506 ext4_mb_release_context(ac);
4505 out: 4507 out:
4506 if (ac) 4508 if (ac)
4507 kmem_cache_free(ext4_ac_cachep, ac); 4509 kmem_cache_free(ext4_ac_cachep, ac);
4508 if (inquota && ar->len < inquota) 4510 if (inquota && ar->len < inquota)
4509 dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); 4511 dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len));
4510 if (!ar->len) { 4512 if (!ar->len) {
4511 if (!ext4_test_inode_state(ar->inode, 4513 if (!ext4_test_inode_state(ar->inode,
4512 EXT4_STATE_DELALLOC_RESERVED)) 4514 EXT4_STATE_DELALLOC_RESERVED))
4513 /* release all the reserved blocks if non delalloc */ 4515 /* release all the reserved blocks if non delalloc */
4514 percpu_counter_sub(&sbi->s_dirtyclusters_counter, 4516 percpu_counter_sub(&sbi->s_dirtyclusters_counter,
4515 reserv_clstrs); 4517 reserv_clstrs);
4516 } 4518 }
4517 4519
4518 trace_ext4_allocate_blocks(ar, (unsigned long long)block); 4520 trace_ext4_allocate_blocks(ar, (unsigned long long)block);
4519 4521
4520 return block; 4522 return block;
4521 } 4523 }
4522 4524
4523 /* 4525 /*
4524 * We can merge two free data extents only if the physical blocks 4526 * We can merge two free data extents only if the physical blocks
4525 * are contiguous, AND the extents were freed by the same transaction, 4527 * are contiguous, AND the extents were freed by the same transaction,
4526 * AND the blocks are associated with the same group. 4528 * AND the blocks are associated with the same group.
4527 */ 4529 */
4528 static int can_merge(struct ext4_free_data *entry1, 4530 static int can_merge(struct ext4_free_data *entry1,
4529 struct ext4_free_data *entry2) 4531 struct ext4_free_data *entry2)
4530 { 4532 {
4531 if ((entry1->efd_tid == entry2->efd_tid) && 4533 if ((entry1->efd_tid == entry2->efd_tid) &&
4532 (entry1->efd_group == entry2->efd_group) && 4534 (entry1->efd_group == entry2->efd_group) &&
4533 ((entry1->efd_start_cluster + entry1->efd_count) == entry2->efd_start_cluster)) 4535 ((entry1->efd_start_cluster + entry1->efd_count) == entry2->efd_start_cluster))
4534 return 1; 4536 return 1;
4535 return 0; 4537 return 0;
4536 } 4538 }
4537 4539
4538 static noinline_for_stack int 4540 static noinline_for_stack int
4539 ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, 4541 ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
4540 struct ext4_free_data *new_entry) 4542 struct ext4_free_data *new_entry)
4541 { 4543 {
4542 ext4_group_t group = e4b->bd_group; 4544 ext4_group_t group = e4b->bd_group;
4543 ext4_grpblk_t cluster; 4545 ext4_grpblk_t cluster;
4544 struct ext4_free_data *entry; 4546 struct ext4_free_data *entry;
4545 struct ext4_group_info *db = e4b->bd_info; 4547 struct ext4_group_info *db = e4b->bd_info;
4546 struct super_block *sb = e4b->bd_sb; 4548 struct super_block *sb = e4b->bd_sb;
4547 struct ext4_sb_info *sbi = EXT4_SB(sb); 4549 struct ext4_sb_info *sbi = EXT4_SB(sb);
4548 struct rb_node **n = &db->bb_free_root.rb_node, *node; 4550 struct rb_node **n = &db->bb_free_root.rb_node, *node;
4549 struct rb_node *parent = NULL, *new_node; 4551 struct rb_node *parent = NULL, *new_node;
4550 4552
4551 BUG_ON(!ext4_handle_valid(handle)); 4553 BUG_ON(!ext4_handle_valid(handle));
4552 BUG_ON(e4b->bd_bitmap_page == NULL); 4554 BUG_ON(e4b->bd_bitmap_page == NULL);
4553 BUG_ON(e4b->bd_buddy_page == NULL); 4555 BUG_ON(e4b->bd_buddy_page == NULL);
4554 4556
4555 new_node = &new_entry->efd_node; 4557 new_node = &new_entry->efd_node;
4556 cluster = new_entry->efd_start_cluster; 4558 cluster = new_entry->efd_start_cluster;
4557 4559
4558 if (!*n) { 4560 if (!*n) {
4559 /* first free block exent. We need to 4561 /* first free block exent. We need to
4560 protect buddy cache from being freed, 4562 protect buddy cache from being freed,
4561 * otherwise we'll refresh it from 4563 * otherwise we'll refresh it from
4562 * on-disk bitmap and lose not-yet-available 4564 * on-disk bitmap and lose not-yet-available
4563 * blocks */ 4565 * blocks */
4564 page_cache_get(e4b->bd_buddy_page); 4566 page_cache_get(e4b->bd_buddy_page);
4565 page_cache_get(e4b->bd_bitmap_page); 4567 page_cache_get(e4b->bd_bitmap_page);
4566 } 4568 }
4567 while (*n) { 4569 while (*n) {
4568 parent = *n; 4570 parent = *n;
4569 entry = rb_entry(parent, struct ext4_free_data, efd_node); 4571 entry = rb_entry(parent, struct ext4_free_data, efd_node);
4570 if (cluster < entry->efd_start_cluster) 4572 if (cluster < entry->efd_start_cluster)
4571 n = &(*n)->rb_left; 4573 n = &(*n)->rb_left;
4572 else if (cluster >= (entry->efd_start_cluster + entry->efd_count)) 4574 else if (cluster >= (entry->efd_start_cluster + entry->efd_count))
4573 n = &(*n)->rb_right; 4575 n = &(*n)->rb_right;
4574 else { 4576 else {
4575 ext4_grp_locked_error(sb, group, 0, 4577 ext4_grp_locked_error(sb, group, 0,
4576 ext4_group_first_block_no(sb, group) + 4578 ext4_group_first_block_no(sb, group) +
4577 EXT4_C2B(sbi, cluster), 4579 EXT4_C2B(sbi, cluster),
4578 "Block already on to-be-freed list"); 4580 "Block already on to-be-freed list");
4579 return 0; 4581 return 0;
4580 } 4582 }
4581 } 4583 }
4582 4584
4583 rb_link_node(new_node, parent, n); 4585 rb_link_node(new_node, parent, n);
4584 rb_insert_color(new_node, &db->bb_free_root); 4586 rb_insert_color(new_node, &db->bb_free_root);
4585 4587
4586 /* Now try to see the extent can be merged to left and right */ 4588 /* Now try to see the extent can be merged to left and right */
4587 node = rb_prev(new_node); 4589 node = rb_prev(new_node);
4588 if (node) { 4590 if (node) {
4589 entry = rb_entry(node, struct ext4_free_data, efd_node); 4591 entry = rb_entry(node, struct ext4_free_data, efd_node);
4590 if (can_merge(entry, new_entry) && 4592 if (can_merge(entry, new_entry) &&
4591 ext4_journal_callback_try_del(handle, &entry->efd_jce)) { 4593 ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
4592 new_entry->efd_start_cluster = entry->efd_start_cluster; 4594 new_entry->efd_start_cluster = entry->efd_start_cluster;
4593 new_entry->efd_count += entry->efd_count; 4595 new_entry->efd_count += entry->efd_count;
4594 rb_erase(node, &(db->bb_free_root)); 4596 rb_erase(node, &(db->bb_free_root));
4595 kmem_cache_free(ext4_free_data_cachep, entry); 4597 kmem_cache_free(ext4_free_data_cachep, entry);
4596 } 4598 }
4597 } 4599 }
4598 4600
4599 node = rb_next(new_node); 4601 node = rb_next(new_node);
4600 if (node) { 4602 if (node) {
4601 entry = rb_entry(node, struct ext4_free_data, efd_node); 4603 entry = rb_entry(node, struct ext4_free_data, efd_node);
4602 if (can_merge(new_entry, entry) && 4604 if (can_merge(new_entry, entry) &&
4603 ext4_journal_callback_try_del(handle, &entry->efd_jce)) { 4605 ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
4604 new_entry->efd_count += entry->efd_count; 4606 new_entry->efd_count += entry->efd_count;
4605 rb_erase(node, &(db->bb_free_root)); 4607 rb_erase(node, &(db->bb_free_root));
4606 kmem_cache_free(ext4_free_data_cachep, entry); 4608 kmem_cache_free(ext4_free_data_cachep, entry);
4607 } 4609 }
4608 } 4610 }
4609 /* Add the extent to transaction's private list */ 4611 /* Add the extent to transaction's private list */
4610 ext4_journal_callback_add(handle, ext4_free_data_callback, 4612 ext4_journal_callback_add(handle, ext4_free_data_callback,
4611 &new_entry->efd_jce); 4613 &new_entry->efd_jce);
4612 return 0; 4614 return 0;
4613 } 4615 }
4614 4616
4615 /** 4617 /**
4616 * ext4_free_blocks() -- Free given blocks and update quota 4618 * ext4_free_blocks() -- Free given blocks and update quota
4617 * @handle: handle for this transaction 4619 * @handle: handle for this transaction
4618 * @inode: inode 4620 * @inode: inode
4619 * @block: start physical block to free 4621 * @block: start physical block to free
4620 * @count: number of blocks to count 4622 * @count: number of blocks to count
4621 * @flags: flags used by ext4_free_blocks 4623 * @flags: flags used by ext4_free_blocks
4622 */ 4624 */
4623 void ext4_free_blocks(handle_t *handle, struct inode *inode, 4625 void ext4_free_blocks(handle_t *handle, struct inode *inode,
4624 struct buffer_head *bh, ext4_fsblk_t block, 4626 struct buffer_head *bh, ext4_fsblk_t block,
4625 unsigned long count, int flags) 4627 unsigned long count, int flags)
4626 { 4628 {
4627 struct buffer_head *bitmap_bh = NULL; 4629 struct buffer_head *bitmap_bh = NULL;
4628 struct super_block *sb = inode->i_sb; 4630 struct super_block *sb = inode->i_sb;
4629 struct ext4_group_desc *gdp; 4631 struct ext4_group_desc *gdp;
4630 unsigned int overflow; 4632 unsigned int overflow;
4631 ext4_grpblk_t bit; 4633 ext4_grpblk_t bit;
4632 struct buffer_head *gd_bh; 4634 struct buffer_head *gd_bh;
4633 ext4_group_t block_group; 4635 ext4_group_t block_group;
4634 struct ext4_sb_info *sbi; 4636 struct ext4_sb_info *sbi;
4635 struct ext4_inode_info *ei = EXT4_I(inode); 4637 struct ext4_inode_info *ei = EXT4_I(inode);
4636 struct ext4_buddy e4b; 4638 struct ext4_buddy e4b;
4637 unsigned int count_clusters; 4639 unsigned int count_clusters;
4638 int err = 0; 4640 int err = 0;
4639 int ret; 4641 int ret;
4640 4642
4641 might_sleep(); 4643 might_sleep();
4642 if (bh) { 4644 if (bh) {
4643 if (block) 4645 if (block)
4644 BUG_ON(block != bh->b_blocknr); 4646 BUG_ON(block != bh->b_blocknr);
4645 else 4647 else
4646 block = bh->b_blocknr; 4648 block = bh->b_blocknr;
4647 } 4649 }
4648 4650
4649 sbi = EXT4_SB(sb); 4651 sbi = EXT4_SB(sb);
4650 if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) && 4652 if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
4651 !ext4_data_block_valid(sbi, block, count)) { 4653 !ext4_data_block_valid(sbi, block, count)) {
4652 ext4_error(sb, "Freeing blocks not in datazone - " 4654 ext4_error(sb, "Freeing blocks not in datazone - "
4653 "block = %llu, count = %lu", block, count); 4655 "block = %llu, count = %lu", block, count);
4654 goto error_return; 4656 goto error_return;
4655 } 4657 }
4656 4658
4657 ext4_debug("freeing block %llu\n", block); 4659 ext4_debug("freeing block %llu\n", block);
4658 trace_ext4_free_blocks(inode, block, count, flags); 4660 trace_ext4_free_blocks(inode, block, count, flags);
4659 4661
4660 if (flags & EXT4_FREE_BLOCKS_FORGET) { 4662 if (flags & EXT4_FREE_BLOCKS_FORGET) {
4661 struct buffer_head *tbh = bh; 4663 struct buffer_head *tbh = bh;
4662 int i; 4664 int i;
4663 4665
4664 BUG_ON(bh && (count > 1)); 4666 BUG_ON(bh && (count > 1));
4665 4667
4666 for (i = 0; i < count; i++) { 4668 for (i = 0; i < count; i++) {
4667 cond_resched(); 4669 cond_resched();
4668 if (!bh) 4670 if (!bh)
4669 tbh = sb_find_get_block(inode->i_sb, 4671 tbh = sb_find_get_block(inode->i_sb,
4670 block + i); 4672 block + i);
4671 if (!tbh) 4673 if (!tbh)
4672 continue; 4674 continue;
4673 ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA, 4675 ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA,
4674 inode, tbh, block + i); 4676 inode, tbh, block + i);
4675 } 4677 }
4676 } 4678 }
4677 4679
4678 /* 4680 /*
4679 * We need to make sure we don't reuse the freed block until 4681 * We need to make sure we don't reuse the freed block until
4680 * after the transaction is committed, which we can do by 4682 * after the transaction is committed, which we can do by
4681 * treating the block as metadata, below. We make an 4683 * treating the block as metadata, below. We make an
4682 * exception if the inode is to be written in writeback mode 4684 * exception if the inode is to be written in writeback mode
4683 * since writeback mode has weak data consistency guarantees. 4685 * since writeback mode has weak data consistency guarantees.
4684 */ 4686 */
4685 if (!ext4_should_writeback_data(inode)) 4687 if (!ext4_should_writeback_data(inode))
4686 flags |= EXT4_FREE_BLOCKS_METADATA; 4688 flags |= EXT4_FREE_BLOCKS_METADATA;
4687 4689
4688 /* 4690 /*
4689 * If the extent to be freed does not begin on a cluster 4691 * If the extent to be freed does not begin on a cluster
4690 * boundary, we need to deal with partial clusters at the 4692 * boundary, we need to deal with partial clusters at the
4691 * beginning and end of the extent. Normally we will free 4693 * beginning and end of the extent. Normally we will free
4692 * blocks at the beginning or the end unless we are explicitly 4694 * blocks at the beginning or the end unless we are explicitly
4693 * requested to avoid doing so. 4695 * requested to avoid doing so.
4694 */ 4696 */
4695 overflow = EXT4_PBLK_COFF(sbi, block); 4697 overflow = EXT4_PBLK_COFF(sbi, block);
4696 if (overflow) { 4698 if (overflow) {
4697 if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) { 4699 if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) {
4698 overflow = sbi->s_cluster_ratio - overflow; 4700 overflow = sbi->s_cluster_ratio - overflow;
4699 block += overflow; 4701 block += overflow;
4700 if (count > overflow) 4702 if (count > overflow)
4701 count -= overflow; 4703 count -= overflow;
4702 else 4704 else
4703 return; 4705 return;
4704 } else { 4706 } else {
4705 block -= overflow; 4707 block -= overflow;
4706 count += overflow; 4708 count += overflow;
4707 } 4709 }
4708 } 4710 }
4709 overflow = EXT4_LBLK_COFF(sbi, count); 4711 overflow = EXT4_LBLK_COFF(sbi, count);
4710 if (overflow) { 4712 if (overflow) {
4711 if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) { 4713 if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) {
4712 if (count > overflow) 4714 if (count > overflow)
4713 count -= overflow; 4715 count -= overflow;
4714 else 4716 else
4715 return; 4717 return;
4716 } else 4718 } else
4717 count += sbi->s_cluster_ratio - overflow; 4719 count += sbi->s_cluster_ratio - overflow;
4718 } 4720 }
4719 4721
4720 do_more: 4722 do_more:
4721 overflow = 0; 4723 overflow = 0;
4722 ext4_get_group_no_and_offset(sb, block, &block_group, &bit); 4724 ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
4723 4725
4724 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT( 4726 if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(
4725 ext4_get_group_info(sb, block_group)))) 4727 ext4_get_group_info(sb, block_group))))
4726 return; 4728 return;
4727 4729
4728 /* 4730 /*
4729 * Check to see if we are freeing blocks across a group 4731 * Check to see if we are freeing blocks across a group
4730 * boundary. 4732 * boundary.
4731 */ 4733 */
4732 if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) { 4734 if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) {
4733 overflow = EXT4_C2B(sbi, bit) + count - 4735 overflow = EXT4_C2B(sbi, bit) + count -
4734 EXT4_BLOCKS_PER_GROUP(sb); 4736 EXT4_BLOCKS_PER_GROUP(sb);
4735 count -= overflow; 4737 count -= overflow;
4736 } 4738 }
4737 count_clusters = EXT4_NUM_B2C(sbi, count); 4739 count_clusters = EXT4_NUM_B2C(sbi, count);
4738 bitmap_bh = ext4_read_block_bitmap(sb, block_group); 4740 bitmap_bh = ext4_read_block_bitmap(sb, block_group);
4739 if (!bitmap_bh) { 4741 if (!bitmap_bh) {
4740 err = -EIO; 4742 err = -EIO;
4741 goto error_return; 4743 goto error_return;
4742 } 4744 }
4743 gdp = ext4_get_group_desc(sb, block_group, &gd_bh); 4745 gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
4744 if (!gdp) { 4746 if (!gdp) {
4745 err = -EIO; 4747 err = -EIO;
4746 goto error_return; 4748 goto error_return;
4747 } 4749 }
4748 4750
4749 if (in_range(ext4_block_bitmap(sb, gdp), block, count) || 4751 if (in_range(ext4_block_bitmap(sb, gdp), block, count) ||
4750 in_range(ext4_inode_bitmap(sb, gdp), block, count) || 4752 in_range(ext4_inode_bitmap(sb, gdp), block, count) ||
4751 in_range(block, ext4_inode_table(sb, gdp), 4753 in_range(block, ext4_inode_table(sb, gdp),
4752 EXT4_SB(sb)->s_itb_per_group) || 4754 EXT4_SB(sb)->s_itb_per_group) ||
4753 in_range(block + count - 1, ext4_inode_table(sb, gdp), 4755 in_range(block + count - 1, ext4_inode_table(sb, gdp),
4754 EXT4_SB(sb)->s_itb_per_group)) { 4756 EXT4_SB(sb)->s_itb_per_group)) {
4755 4757
4756 ext4_error(sb, "Freeing blocks in system zone - " 4758 ext4_error(sb, "Freeing blocks in system zone - "
4757 "Block = %llu, count = %lu", block, count); 4759 "Block = %llu, count = %lu", block, count);
4758 /* err = 0. ext4_std_error should be a no op */ 4760 /* err = 0. ext4_std_error should be a no op */
4759 goto error_return; 4761 goto error_return;
4760 } 4762 }
4761 4763
4762 BUFFER_TRACE(bitmap_bh, "getting write access"); 4764 BUFFER_TRACE(bitmap_bh, "getting write access");
4763 err = ext4_journal_get_write_access(handle, bitmap_bh); 4765 err = ext4_journal_get_write_access(handle, bitmap_bh);
4764 if (err) 4766 if (err)
4765 goto error_return; 4767 goto error_return;
4766 4768
4767 /* 4769 /*
4768 * We are about to modify some metadata. Call the journal APIs 4770 * We are about to modify some metadata. Call the journal APIs
4769 * to unshare ->b_data if a currently-committing transaction is 4771 * to unshare ->b_data if a currently-committing transaction is
4770 * using it 4772 * using it
4771 */ 4773 */
4772 BUFFER_TRACE(gd_bh, "get_write_access"); 4774 BUFFER_TRACE(gd_bh, "get_write_access");
4773 err = ext4_journal_get_write_access(handle, gd_bh); 4775 err = ext4_journal_get_write_access(handle, gd_bh);
4774 if (err) 4776 if (err)
4775 goto error_return; 4777 goto error_return;
4776 #ifdef AGGRESSIVE_CHECK 4778 #ifdef AGGRESSIVE_CHECK
4777 { 4779 {
4778 int i; 4780 int i;
4779 for (i = 0; i < count_clusters; i++) 4781 for (i = 0; i < count_clusters; i++)
4780 BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data)); 4782 BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
4781 } 4783 }
4782 #endif 4784 #endif
4783 trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters); 4785 trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
4784 4786
4785 err = ext4_mb_load_buddy(sb, block_group, &e4b); 4787 err = ext4_mb_load_buddy(sb, block_group, &e4b);
4786 if (err) 4788 if (err)
4787 goto error_return; 4789 goto error_return;
4788 4790
4789 if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) { 4791 if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) {
4790 struct ext4_free_data *new_entry; 4792 struct ext4_free_data *new_entry;
4791 /* 4793 /*
4792 * blocks being freed are metadata. these blocks shouldn't 4794 * blocks being freed are metadata. these blocks shouldn't
4793 * be used until this transaction is committed 4795 * be used until this transaction is committed
4794 */ 4796 */
4795 retry: 4797 retry:
4796 new_entry = kmem_cache_alloc(ext4_free_data_cachep, GFP_NOFS); 4798 new_entry = kmem_cache_alloc(ext4_free_data_cachep, GFP_NOFS);
4797 if (!new_entry) { 4799 if (!new_entry) {
4798 /* 4800 /*
4799 * We use a retry loop because 4801 * We use a retry loop because
4800 * ext4_free_blocks() is not allowed to fail. 4802 * ext4_free_blocks() is not allowed to fail.
4801 */ 4803 */
4802 cond_resched(); 4804 cond_resched();
4803 congestion_wait(BLK_RW_ASYNC, HZ/50); 4805 congestion_wait(BLK_RW_ASYNC, HZ/50);
4804 goto retry; 4806 goto retry;
4805 } 4807 }
4806 new_entry->efd_start_cluster = bit; 4808 new_entry->efd_start_cluster = bit;
4807 new_entry->efd_group = block_group; 4809 new_entry->efd_group = block_group;
4808 new_entry->efd_count = count_clusters; 4810 new_entry->efd_count = count_clusters;
4809 new_entry->efd_tid = handle->h_transaction->t_tid; 4811 new_entry->efd_tid = handle->h_transaction->t_tid;
4810 4812
4811 ext4_lock_group(sb, block_group); 4813 ext4_lock_group(sb, block_group);
4812 mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); 4814 mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
4813 ext4_mb_free_metadata(handle, &e4b, new_entry); 4815 ext4_mb_free_metadata(handle, &e4b, new_entry);
4814 } else { 4816 } else {
4815 /* need to update group_info->bb_free and bitmap 4817 /* need to update group_info->bb_free and bitmap
4816 * with group lock held. generate_buddy look at 4818 * with group lock held. generate_buddy look at
4817 * them with group lock_held 4819 * them with group lock_held
4818 */ 4820 */
4819 if (test_opt(sb, DISCARD)) { 4821 if (test_opt(sb, DISCARD)) {
4820 err = ext4_issue_discard(sb, block_group, bit, count); 4822 err = ext4_issue_discard(sb, block_group, bit, count);
4821 if (err && err != -EOPNOTSUPP) 4823 if (err && err != -EOPNOTSUPP)
4822 ext4_msg(sb, KERN_WARNING, "discard request in" 4824 ext4_msg(sb, KERN_WARNING, "discard request in"
4823 " group:%d block:%d count:%lu failed" 4825 " group:%d block:%d count:%lu failed"
4824 " with %d", block_group, bit, count, 4826 " with %d", block_group, bit, count,
4825 err); 4827 err);
4826 } else 4828 } else
4827 EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info); 4829 EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info);
4828 4830
4829 ext4_lock_group(sb, block_group); 4831 ext4_lock_group(sb, block_group);
4830 mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); 4832 mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
4831 mb_free_blocks(inode, &e4b, bit, count_clusters); 4833 mb_free_blocks(inode, &e4b, bit, count_clusters);
4832 } 4834 }
4833 4835
4834 ret = ext4_free_group_clusters(sb, gdp) + count_clusters; 4836 ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
4835 ext4_free_group_clusters_set(sb, gdp, ret); 4837 ext4_free_group_clusters_set(sb, gdp, ret);
4836 ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh); 4838 ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh);
4837 ext4_group_desc_csum_set(sb, block_group, gdp); 4839 ext4_group_desc_csum_set(sb, block_group, gdp);
4838 ext4_unlock_group(sb, block_group); 4840 ext4_unlock_group(sb, block_group);
4839 4841
4840 if (sbi->s_log_groups_per_flex) { 4842 if (sbi->s_log_groups_per_flex) {
4841 ext4_group_t flex_group = ext4_flex_group(sbi, block_group); 4843 ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
4842 atomic64_add(count_clusters, 4844 atomic64_add(count_clusters,
4843 &sbi->s_flex_groups[flex_group].free_clusters); 4845 &sbi->s_flex_groups[flex_group].free_clusters);
4844 } 4846 }
4845 4847
4846 if (flags & EXT4_FREE_BLOCKS_RESERVE && ei->i_reserved_data_blocks) { 4848 if (flags & EXT4_FREE_BLOCKS_RESERVE && ei->i_reserved_data_blocks) {
4847 percpu_counter_add(&sbi->s_dirtyclusters_counter, 4849 percpu_counter_add(&sbi->s_dirtyclusters_counter,
4848 count_clusters); 4850 count_clusters);
4849 spin_lock(&ei->i_block_reservation_lock); 4851 spin_lock(&ei->i_block_reservation_lock);
4850 if (flags & EXT4_FREE_BLOCKS_METADATA) 4852 if (flags & EXT4_FREE_BLOCKS_METADATA)
4851 ei->i_reserved_meta_blocks += count_clusters; 4853 ei->i_reserved_meta_blocks += count_clusters;
4852 else 4854 else
4853 ei->i_reserved_data_blocks += count_clusters; 4855 ei->i_reserved_data_blocks += count_clusters;
4854 spin_unlock(&ei->i_block_reservation_lock); 4856 spin_unlock(&ei->i_block_reservation_lock);
4855 if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) 4857 if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
4856 dquot_reclaim_block(inode, 4858 dquot_reclaim_block(inode,
4857 EXT4_C2B(sbi, count_clusters)); 4859 EXT4_C2B(sbi, count_clusters));
4858 } else if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) 4860 } else if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
4859 dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); 4861 dquot_free_block(inode, EXT4_C2B(sbi, count_clusters));
4860 percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); 4862 percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters);
4861 4863
4862 ext4_mb_unload_buddy(&e4b); 4864 ext4_mb_unload_buddy(&e4b);
4863 4865
4864 /* We dirtied the bitmap block */ 4866 /* We dirtied the bitmap block */
4865 BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); 4867 BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
4866 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); 4868 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
4867 4869
4868 /* And the group descriptor block */ 4870 /* And the group descriptor block */
4869 BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); 4871 BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
4870 ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); 4872 ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
4871 if (!err) 4873 if (!err)
4872 err = ret; 4874 err = ret;
4873 4875
4874 if (overflow && !err) { 4876 if (overflow && !err) {
4875 block += count; 4877 block += count;
4876 count = overflow; 4878 count = overflow;
4877 put_bh(bitmap_bh); 4879 put_bh(bitmap_bh);
4878 goto do_more; 4880 goto do_more;
4879 } 4881 }
4880 error_return: 4882 error_return:
4881 brelse(bitmap_bh); 4883 brelse(bitmap_bh);
4882 ext4_std_error(sb, err); 4884 ext4_std_error(sb, err);
4883 return; 4885 return;
4884 } 4886 }
4885 4887
4886 /** 4888 /**
4887 * ext4_group_add_blocks() -- Add given blocks to an existing group 4889 * ext4_group_add_blocks() -- Add given blocks to an existing group
4888 * @handle: handle to this transaction 4890 * @handle: handle to this transaction
4889 * @sb: super block 4891 * @sb: super block
4890 * @block: start physical block to add to the block group 4892 * @block: start physical block to add to the block group
4891 * @count: number of blocks to free 4893 * @count: number of blocks to free
4892 * 4894 *
4893 * This marks the blocks as free in the bitmap and buddy. 4895 * This marks the blocks as free in the bitmap and buddy.
4894 */ 4896 */
4895 int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, 4897 int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
4896 ext4_fsblk_t block, unsigned long count) 4898 ext4_fsblk_t block, unsigned long count)
4897 { 4899 {
4898 struct buffer_head *bitmap_bh = NULL; 4900 struct buffer_head *bitmap_bh = NULL;
4899 struct buffer_head *gd_bh; 4901 struct buffer_head *gd_bh;
4900 ext4_group_t block_group; 4902 ext4_group_t block_group;
4901 ext4_grpblk_t bit; 4903 ext4_grpblk_t bit;
4902 unsigned int i; 4904 unsigned int i;
4903 struct ext4_group_desc *desc; 4905 struct ext4_group_desc *desc;
4904 struct ext4_sb_info *sbi = EXT4_SB(sb); 4906 struct ext4_sb_info *sbi = EXT4_SB(sb);
4905 struct ext4_buddy e4b; 4907 struct ext4_buddy e4b;
4906 int err = 0, ret, blk_free_count; 4908 int err = 0, ret, blk_free_count;
4907 ext4_grpblk_t blocks_freed; 4909 ext4_grpblk_t blocks_freed;
4908 4910
4909 ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1); 4911 ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);
4910 4912
4911 if (count == 0) 4913 if (count == 0)
4912 return 0; 4914 return 0;
4913 4915
4914 ext4_get_group_no_and_offset(sb, block, &block_group, &bit); 4916 ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
4915 /* 4917 /*
4916 * Check to see if we are freeing blocks across a group 4918 * Check to see if we are freeing blocks across a group
4917 * boundary. 4919 * boundary.
4918 */ 4920 */
4919 if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) { 4921 if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
4920 ext4_warning(sb, "too much blocks added to group %u\n", 4922 ext4_warning(sb, "too much blocks added to group %u\n",
4921 block_group); 4923 block_group);
4922 err = -EINVAL; 4924 err = -EINVAL;
4923 goto error_return; 4925 goto error_return;
4924 } 4926 }
4925 4927
4926 bitmap_bh = ext4_read_block_bitmap(sb, block_group); 4928 bitmap_bh = ext4_read_block_bitmap(sb, block_group);
4927 if (!bitmap_bh) { 4929 if (!bitmap_bh) {
4928 err = -EIO; 4930 err = -EIO;
4929 goto error_return; 4931 goto error_return;
4930 } 4932 }
4931 4933
4932 desc = ext4_get_group_desc(sb, block_group, &gd_bh); 4934 desc = ext4_get_group_desc(sb, block_group, &gd_bh);
4933 if (!desc) { 4935 if (!desc) {
4934 err = -EIO; 4936 err = -EIO;
4935 goto error_return; 4937 goto error_return;
4936 } 4938 }
4937 4939
4938 if (in_range(ext4_block_bitmap(sb, desc), block, count) || 4940 if (in_range(ext4_block_bitmap(sb, desc), block, count) ||
4939 in_range(ext4_inode_bitmap(sb, desc), block, count) || 4941 in_range(ext4_inode_bitmap(sb, desc), block, count) ||
4940 in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) || 4942 in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) ||
4941 in_range(block + count - 1, ext4_inode_table(sb, desc), 4943 in_range(block + count - 1, ext4_inode_table(sb, desc),
4942 sbi->s_itb_per_group)) { 4944 sbi->s_itb_per_group)) {
4943 ext4_error(sb, "Adding blocks in system zones - " 4945 ext4_error(sb, "Adding blocks in system zones - "
4944 "Block = %llu, count = %lu", 4946 "Block = %llu, count = %lu",
4945 block, count); 4947 block, count);
4946 err = -EINVAL; 4948 err = -EINVAL;
4947 goto error_return; 4949 goto error_return;
4948 } 4950 }
4949 4951
4950 BUFFER_TRACE(bitmap_bh, "getting write access"); 4952 BUFFER_TRACE(bitmap_bh, "getting write access");
4951 err = ext4_journal_get_write_access(handle, bitmap_bh); 4953 err = ext4_journal_get_write_access(handle, bitmap_bh);
4952 if (err) 4954 if (err)
4953 goto error_return; 4955 goto error_return;
4954 4956
4955 /* 4957 /*
4956 * We are about to modify some metadata. Call the journal APIs 4958 * We are about to modify some metadata. Call the journal APIs
4957 * to unshare ->b_data if a currently-committing transaction is 4959 * to unshare ->b_data if a currently-committing transaction is
4958 * using it 4960 * using it
4959 */ 4961 */
4960 BUFFER_TRACE(gd_bh, "get_write_access"); 4962 BUFFER_TRACE(gd_bh, "get_write_access");
4961 err = ext4_journal_get_write_access(handle, gd_bh); 4963 err = ext4_journal_get_write_access(handle, gd_bh);
4962 if (err) 4964 if (err)
4963 goto error_return; 4965 goto error_return;
4964 4966
4965 for (i = 0, blocks_freed = 0; i < count; i++) { 4967 for (i = 0, blocks_freed = 0; i < count; i++) {
4966 BUFFER_TRACE(bitmap_bh, "clear bit"); 4968 BUFFER_TRACE(bitmap_bh, "clear bit");
4967 if (!mb_test_bit(bit + i, bitmap_bh->b_data)) { 4969 if (!mb_test_bit(bit + i, bitmap_bh->b_data)) {
4968 ext4_error(sb, "bit already cleared for block %llu", 4970 ext4_error(sb, "bit already cleared for block %llu",
4969 (ext4_fsblk_t)(block + i)); 4971 (ext4_fsblk_t)(block + i));
4970 BUFFER_TRACE(bitmap_bh, "bit already cleared"); 4972 BUFFER_TRACE(bitmap_bh, "bit already cleared");
4971 } else { 4973 } else {
4972 blocks_freed++; 4974 blocks_freed++;
4973 } 4975 }
4974 } 4976 }
4975 4977
4976 err = ext4_mb_load_buddy(sb, block_group, &e4b); 4978 err = ext4_mb_load_buddy(sb, block_group, &e4b);
4977 if (err) 4979 if (err)
4978 goto error_return; 4980 goto error_return;
4979 4981
4980 /* 4982 /*
4981 * need to update group_info->bb_free and bitmap 4983 * need to update group_info->bb_free and bitmap
4982 * with group lock held. generate_buddy look at 4984 * with group lock held. generate_buddy look at
4983 * them with group lock_held 4985 * them with group lock_held
4984 */ 4986 */
4985 ext4_lock_group(sb, block_group); 4987 ext4_lock_group(sb, block_group);
4986 mb_clear_bits(bitmap_bh->b_data, bit, count); 4988 mb_clear_bits(bitmap_bh->b_data, bit, count);
4987 mb_free_blocks(NULL, &e4b, bit, count); 4989 mb_free_blocks(NULL, &e4b, bit, count);
4988 blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc); 4990 blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc);
4989 ext4_free_group_clusters_set(sb, desc, blk_free_count); 4991 ext4_free_group_clusters_set(sb, desc, blk_free_count);
4990 ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh); 4992 ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh);
4991 ext4_group_desc_csum_set(sb, block_group, desc); 4993 ext4_group_desc_csum_set(sb, block_group, desc);
4992 ext4_unlock_group(sb, block_group); 4994 ext4_unlock_group(sb, block_group);
4993 percpu_counter_add(&sbi->s_freeclusters_counter, 4995 percpu_counter_add(&sbi->s_freeclusters_counter,
4994 EXT4_NUM_B2C(sbi, blocks_freed)); 4996 EXT4_NUM_B2C(sbi, blocks_freed));
4995 4997
4996 if (sbi->s_log_groups_per_flex) { 4998 if (sbi->s_log_groups_per_flex) {
4997 ext4_group_t flex_group = ext4_flex_group(sbi, block_group); 4999 ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
4998 atomic64_add(EXT4_NUM_B2C(sbi, blocks_freed), 5000 atomic64_add(EXT4_NUM_B2C(sbi, blocks_freed),
4999 &sbi->s_flex_groups[flex_group].free_clusters); 5001 &sbi->s_flex_groups[flex_group].free_clusters);
5000 } 5002 }
5001 5003
5002 ext4_mb_unload_buddy(&e4b); 5004 ext4_mb_unload_buddy(&e4b);
5003 5005
5004 /* We dirtied the bitmap block */ 5006 /* We dirtied the bitmap block */
5005 BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); 5007 BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
5006 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); 5008 err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
5007 5009
5008 /* And the group descriptor block */ 5010 /* And the group descriptor block */
5009 BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); 5011 BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
5010 ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); 5012 ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
5011 if (!err) 5013 if (!err)
5012 err = ret; 5014 err = ret;
5013 5015
5014 error_return: 5016 error_return:
5015 brelse(bitmap_bh); 5017 brelse(bitmap_bh);
5016 ext4_std_error(sb, err); 5018 ext4_std_error(sb, err);
5017 return err; 5019 return err;
5018 } 5020 }
5019 5021
5020 /** 5022 /**
5021 * ext4_trim_extent -- function to TRIM one single free extent in the group 5023 * ext4_trim_extent -- function to TRIM one single free extent in the group
5022 * @sb: super block for the file system 5024 * @sb: super block for the file system
5023 * @start: starting block of the free extent in the alloc. group 5025 * @start: starting block of the free extent in the alloc. group
5024 * @count: number of blocks to TRIM 5026 * @count: number of blocks to TRIM
5025 * @group: alloc. group we are working with 5027 * @group: alloc. group we are working with
5026 * @e4b: ext4 buddy for the group 5028 * @e4b: ext4 buddy for the group
5027 * 5029 *
5028 * Trim "count" blocks starting at "start" in the "group". To assure that no 5030 * Trim "count" blocks starting at "start" in the "group". To assure that no
5029 * one will allocate those blocks, mark it as used in buddy bitmap. This must 5031 * one will allocate those blocks, mark it as used in buddy bitmap. This must
5030 * be called with under the group lock. 5032 * be called with under the group lock.
5031 */ 5033 */
5032 static int ext4_trim_extent(struct super_block *sb, int start, int count, 5034 static int ext4_trim_extent(struct super_block *sb, int start, int count,
5033 ext4_group_t group, struct ext4_buddy *e4b) 5035 ext4_group_t group, struct ext4_buddy *e4b)
5034 { 5036 {
5035 struct ext4_free_extent ex; 5037 struct ext4_free_extent ex;
5036 int ret = 0; 5038 int ret = 0;
5037 5039
5038 trace_ext4_trim_extent(sb, group, start, count); 5040 trace_ext4_trim_extent(sb, group, start, count);
5039 5041
5040 assert_spin_locked(ext4_group_lock_ptr(sb, group)); 5042 assert_spin_locked(ext4_group_lock_ptr(sb, group));
5041 5043
5042 ex.fe_start = start; 5044 ex.fe_start = start;
5043 ex.fe_group = group; 5045 ex.fe_group = group;
5044 ex.fe_len = count; 5046 ex.fe_len = count;
5045 5047
5046 /* 5048 /*
5047 * Mark blocks used, so no one can reuse them while 5049 * Mark blocks used, so no one can reuse them while
5048 * being trimmed. 5050 * being trimmed.
5049 */ 5051 */
5050 mb_mark_used(e4b, &ex); 5052 mb_mark_used(e4b, &ex);
5051 ext4_unlock_group(sb, group); 5053 ext4_unlock_group(sb, group);
5052 ret = ext4_issue_discard(sb, group, start, count); 5054 ret = ext4_issue_discard(sb, group, start, count);
5053 ext4_lock_group(sb, group); 5055 ext4_lock_group(sb, group);
5054 mb_free_blocks(NULL, e4b, start, ex.fe_len); 5056 mb_free_blocks(NULL, e4b, start, ex.fe_len);
5055 return ret; 5057 return ret;
5056 } 5058 }
5057 5059
5058 /** 5060 /**
5059 * ext4_trim_all_free -- function to trim all free space in alloc. group 5061 * ext4_trim_all_free -- function to trim all free space in alloc. group
5060 * @sb: super block for file system 5062 * @sb: super block for file system
5061 * @group: group to be trimmed 5063 * @group: group to be trimmed
5062 * @start: first group block to examine 5064 * @start: first group block to examine
5063 * @max: last group block to examine 5065 * @max: last group block to examine
5064 * @minblocks: minimum extent block count 5066 * @minblocks: minimum extent block count
5065 * 5067 *
5066 * ext4_trim_all_free walks through group's buddy bitmap searching for free 5068 * ext4_trim_all_free walks through group's buddy bitmap searching for free
5067 * extents. When the free block is found, ext4_trim_extent is called to TRIM 5069 * extents. When the free block is found, ext4_trim_extent is called to TRIM
5068 * the extent. 5070 * the extent.
5069 * 5071 *
5070 * 5072 *
5071 * ext4_trim_all_free walks through group's block bitmap searching for free 5073 * ext4_trim_all_free walks through group's block bitmap searching for free
5072 * extents. When the free extent is found, mark it as used in group buddy 5074 * extents. When the free extent is found, mark it as used in group buddy
5073 * bitmap. Then issue a TRIM command on this extent and free the extent in 5075 * bitmap. Then issue a TRIM command on this extent and free the extent in
5074 * the group buddy bitmap. This is done until whole group is scanned. 5076 * the group buddy bitmap. This is done until whole group is scanned.
5075 */ 5077 */
5076 static ext4_grpblk_t 5078 static ext4_grpblk_t
5077 ext4_trim_all_free(struct super_block *sb, ext4_group_t group, 5079 ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
5078 ext4_grpblk_t start, ext4_grpblk_t max, 5080 ext4_grpblk_t start, ext4_grpblk_t max,
5079 ext4_grpblk_t minblocks) 5081 ext4_grpblk_t minblocks)
5080 { 5082 {
5081 void *bitmap; 5083 void *bitmap;
5082 ext4_grpblk_t next, count = 0, free_count = 0; 5084 ext4_grpblk_t next, count = 0, free_count = 0;
5083 struct ext4_buddy e4b; 5085 struct ext4_buddy e4b;
5084 int ret = 0; 5086 int ret = 0;
5085 5087
5086 trace_ext4_trim_all_free(sb, group, start, max); 5088 trace_ext4_trim_all_free(sb, group, start, max);
5087 5089
5088 ret = ext4_mb_load_buddy(sb, group, &e4b); 5090 ret = ext4_mb_load_buddy(sb, group, &e4b);
5089 if (ret) { 5091 if (ret) {
5090 ext4_error(sb, "Error in loading buddy " 5092 ext4_error(sb, "Error in loading buddy "
5091 "information for %u", group); 5093 "information for %u", group);
5092 return ret; 5094 return ret;
5093 } 5095 }
5094 bitmap = e4b.bd_bitmap; 5096 bitmap = e4b.bd_bitmap;
5095 5097
5096 ext4_lock_group(sb, group); 5098 ext4_lock_group(sb, group);
5097 if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) && 5099 if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
5098 minblocks >= atomic_read(&EXT4_SB(sb)->s_last_trim_minblks)) 5100 minblocks >= atomic_read(&EXT4_SB(sb)->s_last_trim_minblks))
5099 goto out; 5101 goto out;
5100 5102
5101 start = (e4b.bd_info->bb_first_free > start) ? 5103 start = (e4b.bd_info->bb_first_free > start) ?
5102 e4b.bd_info->bb_first_free : start; 5104 e4b.bd_info->bb_first_free : start;
5103 5105
5104 while (start <= max) { 5106 while (start <= max) {
5105 start = mb_find_next_zero_bit(bitmap, max + 1, start); 5107 start = mb_find_next_zero_bit(bitmap, max + 1, start);
5106 if (start > max) 5108 if (start > max)
5107 break; 5109 break;
5108 next = mb_find_next_bit(bitmap, max + 1, start); 5110 next = mb_find_next_bit(bitmap, max + 1, start);
5109 5111
5110 if ((next - start) >= minblocks) { 5112 if ((next - start) >= minblocks) {
5111 ret = ext4_trim_extent(sb, start, 5113 ret = ext4_trim_extent(sb, start,
5112 next - start, group, &e4b); 5114 next - start, group, &e4b);
5113 if (ret && ret != -EOPNOTSUPP) 5115 if (ret && ret != -EOPNOTSUPP)
5114 break; 5116 break;
5115 ret = 0; 5117 ret = 0;
5116 count += next - start; 5118 count += next - start;
5117 } 5119 }
5118 free_count += next - start; 5120 free_count += next - start;
5119 start = next + 1; 5121 start = next + 1;
5120 5122
5121 if (fatal_signal_pending(current)) { 5123 if (fatal_signal_pending(current)) {
5122 count = -ERESTARTSYS; 5124 count = -ERESTARTSYS;
5123 break; 5125 break;
5124 } 5126 }
5125 5127
5126 if (need_resched()) { 5128 if (need_resched()) {
5127 ext4_unlock_group(sb, group); 5129 ext4_unlock_group(sb, group);
5128 cond_resched(); 5130 cond_resched();
5129 ext4_lock_group(sb, group); 5131 ext4_lock_group(sb, group);
5130 } 5132 }
5131 5133
5132 if ((e4b.bd_info->bb_free - free_count) < minblocks) 5134 if ((e4b.bd_info->bb_free - free_count) < minblocks)
5133 break; 5135 break;
5134 } 5136 }
5135 5137
5136 if (!ret) { 5138 if (!ret) {
5137 ret = count; 5139 ret = count;
5138 EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); 5140 EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
5139 } 5141 }
5140 out: 5142 out:
5141 ext4_unlock_group(sb, group); 5143 ext4_unlock_group(sb, group);
5142 ext4_mb_unload_buddy(&e4b); 5144 ext4_mb_unload_buddy(&e4b);
5143 5145
5144 ext4_debug("trimmed %d blocks in the group %d\n", 5146 ext4_debug("trimmed %d blocks in the group %d\n",
5145 count, group); 5147 count, group);
5146 5148
5147 return ret; 5149 return ret;
5148 } 5150 }
5149 5151
5150 /** 5152 /**
5151 * ext4_trim_fs() -- trim ioctl handle function 5153 * ext4_trim_fs() -- trim ioctl handle function
5152 * @sb: superblock for filesystem 5154 * @sb: superblock for filesystem
5153 * @range: fstrim_range structure 5155 * @range: fstrim_range structure
5154 * 5156 *
5155 * start: First Byte to trim 5157 * start: First Byte to trim
5156 * len: number of Bytes to trim from start 5158 * len: number of Bytes to trim from start
5157 * minlen: minimum extent length in Bytes 5159 * minlen: minimum extent length in Bytes
5158 * ext4_trim_fs goes through all allocation groups containing Bytes from 5160 * ext4_trim_fs goes through all allocation groups containing Bytes from
5159 * start to start+len. For each such a group ext4_trim_all_free function 5161 * start to start+len. For each such a group ext4_trim_all_free function
5160 * is invoked to trim all free space. 5162 * is invoked to trim all free space.
5161 */ 5163 */
5162 int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) 5164 int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
5163 { 5165 {
5164 struct ext4_group_info *grp; 5166 struct ext4_group_info *grp;
5165 ext4_group_t group, first_group, last_group; 5167 ext4_group_t group, first_group, last_group;
5166 ext4_grpblk_t cnt = 0, first_cluster, last_cluster; 5168 ext4_grpblk_t cnt = 0, first_cluster, last_cluster;
5167 uint64_t start, end, minlen, trimmed = 0; 5169 uint64_t start, end, minlen, trimmed = 0;
5168 ext4_fsblk_t first_data_blk = 5170 ext4_fsblk_t first_data_blk =
5169 le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); 5171 le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
5170 ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); 5172 ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
5171 int ret = 0; 5173 int ret = 0;
5172 5174
5173 start = range->start >> sb->s_blocksize_bits; 5175 start = range->start >> sb->s_blocksize_bits;
5174 end = start + (range->len >> sb->s_blocksize_bits) - 1; 5176 end = start + (range->len >> sb->s_blocksize_bits) - 1;
5175 minlen = EXT4_NUM_B2C(EXT4_SB(sb), 5177 minlen = EXT4_NUM_B2C(EXT4_SB(sb),
5176 range->minlen >> sb->s_blocksize_bits); 5178 range->minlen >> sb->s_blocksize_bits);
5177 5179
5178 if (minlen > EXT4_CLUSTERS_PER_GROUP(sb) || 5180 if (minlen > EXT4_CLUSTERS_PER_GROUP(sb) ||
5179 start >= max_blks || 5181 start >= max_blks ||
5180 range->len < sb->s_blocksize) 5182 range->len < sb->s_blocksize)
5181 return -EINVAL; 5183 return -EINVAL;
5182 if (end >= max_blks) 5184 if (end >= max_blks)
5183 end = max_blks - 1; 5185 end = max_blks - 1;
5184 if (end <= first_data_blk) 5186 if (end <= first_data_blk)
5185 goto out; 5187 goto out;
5186 if (start < first_data_blk) 5188 if (start < first_data_blk)
5187 start = first_data_blk; 5189 start = first_data_blk;
5188 5190
5189 /* Determine first and last group to examine based on start and end */ 5191 /* Determine first and last group to examine based on start and end */
5190 ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start, 5192 ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start,
5191 &first_group, &first_cluster); 5193 &first_group, &first_cluster);
5192 ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end, 5194 ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end,
5193 &last_group, &last_cluster); 5195 &last_group, &last_cluster);
5194 5196
5195 /* end now represents the last cluster to discard in this group */ 5197 /* end now represents the last cluster to discard in this group */
5196 end = EXT4_CLUSTERS_PER_GROUP(sb) - 1; 5198 end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
5197 5199
5198 for (group = first_group; group <= last_group; group++) { 5200 for (group = first_group; group <= last_group; group++) {
5199 grp = ext4_get_group_info(sb, group); 5201 grp = ext4_get_group_info(sb, group);
5200 /* We only do this if the grp has never been initialized */ 5202 /* We only do this if the grp has never been initialized */
5201 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { 5203 if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
5202 ret = ext4_mb_init_group(sb, group); 5204 ret = ext4_mb_init_group(sb, group);
5203 if (ret) 5205 if (ret)
5204 break; 5206 break;
5205 } 5207 }
5206 5208
5207 /* 5209 /*
5208 * For all the groups except the last one, last cluster will 5210 * For all the groups except the last one, last cluster will
5209 * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to 5211 * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to
5210 * change it for the last group, note that last_cluster is 5212 * change it for the last group, note that last_cluster is
5211 * already computed earlier by ext4_get_group_no_and_offset() 5213 * already computed earlier by ext4_get_group_no_and_offset()
5212 */ 5214 */
5213 if (group == last_group) 5215 if (group == last_group)
5214 end = last_cluster; 5216 end = last_cluster;
5215 5217
5216 if (grp->bb_free >= minlen) { 5218 if (grp->bb_free >= minlen) {
5217 cnt = ext4_trim_all_free(sb, group, first_cluster, 5219 cnt = ext4_trim_all_free(sb, group, first_cluster,
5218 end, minlen); 5220 end, minlen);
5219 if (cnt < 0) { 5221 if (cnt < 0) {
5220 ret = cnt; 5222 ret = cnt;
5221 break; 5223 break;
5222 } 5224 }
5223 trimmed += cnt; 5225 trimmed += cnt;
5224 } 5226 }
5225 5227
5226 /* 5228 /*
5227 * For every group except the first one, we are sure 5229 * For every group except the first one, we are sure
5228 * that the first cluster to discard will be cluster #0. 5230 * that the first cluster to discard will be cluster #0.
5229 */ 5231 */
5230 first_cluster = 0; 5232 first_cluster = 0;
5231 } 5233 }
5232 5234
5233 if (!ret) 5235 if (!ret)
5234 atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen); 5236 atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen);
5235 5237
5236 out: 5238 out:
fs/f2fs/checkpoint.c
1 /* 1 /*
2 * fs/f2fs/checkpoint.c 2 * fs/f2fs/checkpoint.c
3 * 3 *
4 * Copyright (c) 2012 Samsung Electronics Co., Ltd. 4 * Copyright (c) 2012 Samsung Electronics Co., Ltd.
5 * http://www.samsung.com/ 5 * http://www.samsung.com/
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
8 * it under the terms of the GNU General Public License version 2 as 8 * it under the terms of the GNU General Public License version 2 as
9 * published by the Free Software Foundation. 9 * published by the Free Software Foundation.
10 */ 10 */
11 #include <linux/fs.h> 11 #include <linux/fs.h>
12 #include <linux/bio.h> 12 #include <linux/bio.h>
13 #include <linux/mpage.h> 13 #include <linux/mpage.h>
14 #include <linux/writeback.h> 14 #include <linux/writeback.h>
15 #include <linux/blkdev.h> 15 #include <linux/blkdev.h>
16 #include <linux/f2fs_fs.h> 16 #include <linux/f2fs_fs.h>
17 #include <linux/pagevec.h> 17 #include <linux/pagevec.h>
18 #include <linux/swap.h> 18 #include <linux/swap.h>
19 19
20 #include "f2fs.h" 20 #include "f2fs.h"
21 #include "node.h" 21 #include "node.h"
22 #include "segment.h" 22 #include "segment.h"
23 #include <trace/events/f2fs.h> 23 #include <trace/events/f2fs.h>
24 24
25 static struct kmem_cache *orphan_entry_slab; 25 static struct kmem_cache *orphan_entry_slab;
26 static struct kmem_cache *inode_entry_slab; 26 static struct kmem_cache *inode_entry_slab;
27 27
28 /* 28 /*
29 * We guarantee no failure on the returned page. 29 * We guarantee no failure on the returned page.
30 */ 30 */
31 struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) 31 struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
32 { 32 {
33 struct address_space *mapping = sbi->meta_inode->i_mapping; 33 struct address_space *mapping = sbi->meta_inode->i_mapping;
34 struct page *page = NULL; 34 struct page *page = NULL;
35 repeat: 35 repeat:
36 page = grab_cache_page(mapping, index); 36 page = grab_cache_page(mapping, index);
37 if (!page) { 37 if (!page) {
38 cond_resched(); 38 cond_resched();
39 goto repeat; 39 goto repeat;
40 } 40 }
41 41
42 /* We wait writeback only inside grab_meta_page() */ 42 /* We wait writeback only inside grab_meta_page() */
43 wait_on_page_writeback(page); 43 wait_on_page_writeback(page);
44 SetPageUptodate(page); 44 SetPageUptodate(page);
45 return page; 45 return page;
46 } 46 }
47 47
48 /* 48 /*
49 * We guarantee no failure on the returned page. 49 * We guarantee no failure on the returned page.
50 */ 50 */
51 struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) 51 struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
52 { 52 {
53 struct address_space *mapping = sbi->meta_inode->i_mapping; 53 struct address_space *mapping = sbi->meta_inode->i_mapping;
54 struct page *page; 54 struct page *page;
55 repeat: 55 repeat:
56 page = grab_cache_page(mapping, index); 56 page = grab_cache_page(mapping, index);
57 if (!page) { 57 if (!page) {
58 cond_resched(); 58 cond_resched();
59 goto repeat; 59 goto repeat;
60 } 60 }
61 if (PageUptodate(page)) 61 if (PageUptodate(page))
62 goto out; 62 goto out;
63 63
64 if (f2fs_readpage(sbi, page, index, READ_SYNC)) 64 if (f2fs_readpage(sbi, page, index, READ_SYNC))
65 goto repeat; 65 goto repeat;
66 66
67 lock_page(page); 67 lock_page(page);
68 if (page->mapping != mapping) { 68 if (page->mapping != mapping) {
69 f2fs_put_page(page, 1); 69 f2fs_put_page(page, 1);
70 goto repeat; 70 goto repeat;
71 } 71 }
72 out: 72 out:
73 mark_page_accessed(page);
74 return page; 73 return page;
75 } 74 }
76 75
77 static int f2fs_write_meta_page(struct page *page, 76 static int f2fs_write_meta_page(struct page *page,
78 struct writeback_control *wbc) 77 struct writeback_control *wbc)
79 { 78 {
80 struct inode *inode = page->mapping->host; 79 struct inode *inode = page->mapping->host;
81 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 80 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
82 81
83 /* Should not write any meta pages, if any IO error was occurred */ 82 /* Should not write any meta pages, if any IO error was occurred */
84 if (wbc->for_reclaim || 83 if (wbc->for_reclaim ||
85 is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ERROR_FLAG)) { 84 is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ERROR_FLAG)) {
86 dec_page_count(sbi, F2FS_DIRTY_META); 85 dec_page_count(sbi, F2FS_DIRTY_META);
87 wbc->pages_skipped++; 86 wbc->pages_skipped++;
88 set_page_dirty(page); 87 set_page_dirty(page);
89 return AOP_WRITEPAGE_ACTIVATE; 88 return AOP_WRITEPAGE_ACTIVATE;
90 } 89 }
91 90
92 wait_on_page_writeback(page); 91 wait_on_page_writeback(page);
93 92
94 write_meta_page(sbi, page); 93 write_meta_page(sbi, page);
95 dec_page_count(sbi, F2FS_DIRTY_META); 94 dec_page_count(sbi, F2FS_DIRTY_META);
96 unlock_page(page); 95 unlock_page(page);
97 return 0; 96 return 0;
98 } 97 }
99 98
100 static int f2fs_write_meta_pages(struct address_space *mapping, 99 static int f2fs_write_meta_pages(struct address_space *mapping,
101 struct writeback_control *wbc) 100 struct writeback_control *wbc)
102 { 101 {
103 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); 102 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
104 struct block_device *bdev = sbi->sb->s_bdev; 103 struct block_device *bdev = sbi->sb->s_bdev;
105 long written; 104 long written;
106 105
107 if (wbc->for_kupdate) 106 if (wbc->for_kupdate)
108 return 0; 107 return 0;
109 108
110 if (get_pages(sbi, F2FS_DIRTY_META) == 0) 109 if (get_pages(sbi, F2FS_DIRTY_META) == 0)
111 return 0; 110 return 0;
112 111
113 /* if mounting is failed, skip writing node pages */ 112 /* if mounting is failed, skip writing node pages */
114 mutex_lock(&sbi->cp_mutex); 113 mutex_lock(&sbi->cp_mutex);
115 written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev)); 114 written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev));
116 mutex_unlock(&sbi->cp_mutex); 115 mutex_unlock(&sbi->cp_mutex);
117 wbc->nr_to_write -= written; 116 wbc->nr_to_write -= written;
118 return 0; 117 return 0;
119 } 118 }
120 119
121 long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, 120 long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
122 long nr_to_write) 121 long nr_to_write)
123 { 122 {
124 struct address_space *mapping = sbi->meta_inode->i_mapping; 123 struct address_space *mapping = sbi->meta_inode->i_mapping;
125 pgoff_t index = 0, end = LONG_MAX; 124 pgoff_t index = 0, end = LONG_MAX;
126 struct pagevec pvec; 125 struct pagevec pvec;
127 long nwritten = 0; 126 long nwritten = 0;
128 struct writeback_control wbc = { 127 struct writeback_control wbc = {
129 .for_reclaim = 0, 128 .for_reclaim = 0,
130 }; 129 };
131 130
132 pagevec_init(&pvec, 0); 131 pagevec_init(&pvec, 0);
133 132
134 while (index <= end) { 133 while (index <= end) {
135 int i, nr_pages; 134 int i, nr_pages;
136 nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, 135 nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
137 PAGECACHE_TAG_DIRTY, 136 PAGECACHE_TAG_DIRTY,
138 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); 137 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
139 if (nr_pages == 0) 138 if (nr_pages == 0)
140 break; 139 break;
141 140
142 for (i = 0; i < nr_pages; i++) { 141 for (i = 0; i < nr_pages; i++) {
143 struct page *page = pvec.pages[i]; 142 struct page *page = pvec.pages[i];
144 lock_page(page); 143 lock_page(page);
145 BUG_ON(page->mapping != mapping); 144 BUG_ON(page->mapping != mapping);
146 BUG_ON(!PageDirty(page)); 145 BUG_ON(!PageDirty(page));
147 clear_page_dirty_for_io(page); 146 clear_page_dirty_for_io(page);
148 if (f2fs_write_meta_page(page, &wbc)) { 147 if (f2fs_write_meta_page(page, &wbc)) {
149 unlock_page(page); 148 unlock_page(page);
150 break; 149 break;
151 } 150 }
152 if (nwritten++ >= nr_to_write) 151 if (nwritten++ >= nr_to_write)
153 break; 152 break;
154 } 153 }
155 pagevec_release(&pvec); 154 pagevec_release(&pvec);
156 cond_resched(); 155 cond_resched();
157 } 156 }
158 157
159 if (nwritten) 158 if (nwritten)
160 f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX); 159 f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX);
161 160
162 return nwritten; 161 return nwritten;
163 } 162 }
164 163
165 static int f2fs_set_meta_page_dirty(struct page *page) 164 static int f2fs_set_meta_page_dirty(struct page *page)
166 { 165 {
167 struct address_space *mapping = page->mapping; 166 struct address_space *mapping = page->mapping;
168 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); 167 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
169 168
170 SetPageUptodate(page); 169 SetPageUptodate(page);
171 if (!PageDirty(page)) { 170 if (!PageDirty(page)) {
172 __set_page_dirty_nobuffers(page); 171 __set_page_dirty_nobuffers(page);
173 inc_page_count(sbi, F2FS_DIRTY_META); 172 inc_page_count(sbi, F2FS_DIRTY_META);
174 return 1; 173 return 1;
175 } 174 }
176 return 0; 175 return 0;
177 } 176 }
178 177
179 const struct address_space_operations f2fs_meta_aops = { 178 const struct address_space_operations f2fs_meta_aops = {
180 .writepage = f2fs_write_meta_page, 179 .writepage = f2fs_write_meta_page,
181 .writepages = f2fs_write_meta_pages, 180 .writepages = f2fs_write_meta_pages,
182 .set_page_dirty = f2fs_set_meta_page_dirty, 181 .set_page_dirty = f2fs_set_meta_page_dirty,
183 }; 182 };
184 183
185 int acquire_orphan_inode(struct f2fs_sb_info *sbi) 184 int acquire_orphan_inode(struct f2fs_sb_info *sbi)
186 { 185 {
187 unsigned int max_orphans; 186 unsigned int max_orphans;
188 int err = 0; 187 int err = 0;
189 188
190 /* 189 /*
191 * considering 512 blocks in a segment 5 blocks are needed for cp 190 * considering 512 blocks in a segment 5 blocks are needed for cp
192 * and log segment summaries. Remaining blocks are used to keep 191 * and log segment summaries. Remaining blocks are used to keep
193 * orphan entries with the limitation one reserved segment 192 * orphan entries with the limitation one reserved segment
194 * for cp pack we can have max 1020*507 orphan entries 193 * for cp pack we can have max 1020*507 orphan entries
195 */ 194 */
196 max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK; 195 max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK;
197 mutex_lock(&sbi->orphan_inode_mutex); 196 mutex_lock(&sbi->orphan_inode_mutex);
198 if (sbi->n_orphans >= max_orphans) 197 if (sbi->n_orphans >= max_orphans)
199 err = -ENOSPC; 198 err = -ENOSPC;
200 else 199 else
201 sbi->n_orphans++; 200 sbi->n_orphans++;
202 mutex_unlock(&sbi->orphan_inode_mutex); 201 mutex_unlock(&sbi->orphan_inode_mutex);
203 return err; 202 return err;
204 } 203 }
205 204
206 void release_orphan_inode(struct f2fs_sb_info *sbi) 205 void release_orphan_inode(struct f2fs_sb_info *sbi)
207 { 206 {
208 mutex_lock(&sbi->orphan_inode_mutex); 207 mutex_lock(&sbi->orphan_inode_mutex);
209 sbi->n_orphans--; 208 sbi->n_orphans--;
210 mutex_unlock(&sbi->orphan_inode_mutex); 209 mutex_unlock(&sbi->orphan_inode_mutex);
211 } 210 }
212 211
213 void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) 212 void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
214 { 213 {
215 struct list_head *head, *this; 214 struct list_head *head, *this;
216 struct orphan_inode_entry *new = NULL, *orphan = NULL; 215 struct orphan_inode_entry *new = NULL, *orphan = NULL;
217 216
218 mutex_lock(&sbi->orphan_inode_mutex); 217 mutex_lock(&sbi->orphan_inode_mutex);
219 head = &sbi->orphan_inode_list; 218 head = &sbi->orphan_inode_list;
220 list_for_each(this, head) { 219 list_for_each(this, head) {
221 orphan = list_entry(this, struct orphan_inode_entry, list); 220 orphan = list_entry(this, struct orphan_inode_entry, list);
222 if (orphan->ino == ino) 221 if (orphan->ino == ino)
223 goto out; 222 goto out;
224 if (orphan->ino > ino) 223 if (orphan->ino > ino)
225 break; 224 break;
226 orphan = NULL; 225 orphan = NULL;
227 } 226 }
228 retry: 227 retry:
229 new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC); 228 new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC);
230 if (!new) { 229 if (!new) {
231 cond_resched(); 230 cond_resched();
232 goto retry; 231 goto retry;
233 } 232 }
234 new->ino = ino; 233 new->ino = ino;
235 234
236 /* add new_oentry into list which is sorted by inode number */ 235 /* add new_oentry into list which is sorted by inode number */
237 if (orphan) 236 if (orphan)
238 list_add(&new->list, this->prev); 237 list_add(&new->list, this->prev);
239 else 238 else
240 list_add_tail(&new->list, head); 239 list_add_tail(&new->list, head);
241 out: 240 out:
242 mutex_unlock(&sbi->orphan_inode_mutex); 241 mutex_unlock(&sbi->orphan_inode_mutex);
243 } 242 }
244 243
245 void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) 244 void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
246 { 245 {
247 struct list_head *head; 246 struct list_head *head;
248 struct orphan_inode_entry *orphan; 247 struct orphan_inode_entry *orphan;
249 248
250 mutex_lock(&sbi->orphan_inode_mutex); 249 mutex_lock(&sbi->orphan_inode_mutex);
251 head = &sbi->orphan_inode_list; 250 head = &sbi->orphan_inode_list;
252 list_for_each_entry(orphan, head, list) { 251 list_for_each_entry(orphan, head, list) {
253 if (orphan->ino == ino) { 252 if (orphan->ino == ino) {
254 list_del(&orphan->list); 253 list_del(&orphan->list);
255 kmem_cache_free(orphan_entry_slab, orphan); 254 kmem_cache_free(orphan_entry_slab, orphan);
256 sbi->n_orphans--; 255 sbi->n_orphans--;
257 break; 256 break;
258 } 257 }
259 } 258 }
260 mutex_unlock(&sbi->orphan_inode_mutex); 259 mutex_unlock(&sbi->orphan_inode_mutex);
261 } 260 }
262 261
263 static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) 262 static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
264 { 263 {
265 struct inode *inode = f2fs_iget(sbi->sb, ino); 264 struct inode *inode = f2fs_iget(sbi->sb, ino);
266 BUG_ON(IS_ERR(inode)); 265 BUG_ON(IS_ERR(inode));
267 clear_nlink(inode); 266 clear_nlink(inode);
268 267
269 /* truncate all the data during iput */ 268 /* truncate all the data during iput */
270 iput(inode); 269 iput(inode);
271 } 270 }
272 271
273 int recover_orphan_inodes(struct f2fs_sb_info *sbi) 272 int recover_orphan_inodes(struct f2fs_sb_info *sbi)
274 { 273 {
275 block_t start_blk, orphan_blkaddr, i, j; 274 block_t start_blk, orphan_blkaddr, i, j;
276 275
277 if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG)) 276 if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG))
278 return 0; 277 return 0;
279 278
280 sbi->por_doing = 1; 279 sbi->por_doing = 1;
281 start_blk = __start_cp_addr(sbi) + 1; 280 start_blk = __start_cp_addr(sbi) + 1;
282 orphan_blkaddr = __start_sum_addr(sbi) - 1; 281 orphan_blkaddr = __start_sum_addr(sbi) - 1;
283 282
284 for (i = 0; i < orphan_blkaddr; i++) { 283 for (i = 0; i < orphan_blkaddr; i++) {
285 struct page *page = get_meta_page(sbi, start_blk + i); 284 struct page *page = get_meta_page(sbi, start_blk + i);
286 struct f2fs_orphan_block *orphan_blk; 285 struct f2fs_orphan_block *orphan_blk;
287 286
288 orphan_blk = (struct f2fs_orphan_block *)page_address(page); 287 orphan_blk = (struct f2fs_orphan_block *)page_address(page);
289 for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) { 288 for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) {
290 nid_t ino = le32_to_cpu(orphan_blk->ino[j]); 289 nid_t ino = le32_to_cpu(orphan_blk->ino[j]);
291 recover_orphan_inode(sbi, ino); 290 recover_orphan_inode(sbi, ino);
292 } 291 }
293 f2fs_put_page(page, 1); 292 f2fs_put_page(page, 1);
294 } 293 }
295 /* clear Orphan Flag */ 294 /* clear Orphan Flag */
296 clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG); 295 clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG);
297 sbi->por_doing = 0; 296 sbi->por_doing = 0;
298 return 0; 297 return 0;
299 } 298 }
300 299
301 static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) 300 static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk)
302 { 301 {
303 struct list_head *head, *this, *next; 302 struct list_head *head, *this, *next;
304 struct f2fs_orphan_block *orphan_blk = NULL; 303 struct f2fs_orphan_block *orphan_blk = NULL;
305 struct page *page = NULL; 304 struct page *page = NULL;
306 unsigned int nentries = 0; 305 unsigned int nentries = 0;
307 unsigned short index = 1; 306 unsigned short index = 1;
308 unsigned short orphan_blocks; 307 unsigned short orphan_blocks;
309 308
310 orphan_blocks = (unsigned short)((sbi->n_orphans + 309 orphan_blocks = (unsigned short)((sbi->n_orphans +
311 (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK); 310 (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK);
312 311
313 mutex_lock(&sbi->orphan_inode_mutex); 312 mutex_lock(&sbi->orphan_inode_mutex);
314 head = &sbi->orphan_inode_list; 313 head = &sbi->orphan_inode_list;
315 314
316 /* loop for each orphan inode entry and write them in Jornal block */ 315 /* loop for each orphan inode entry and write them in Jornal block */
317 list_for_each_safe(this, next, head) { 316 list_for_each_safe(this, next, head) {
318 struct orphan_inode_entry *orphan; 317 struct orphan_inode_entry *orphan;
319 318
320 orphan = list_entry(this, struct orphan_inode_entry, list); 319 orphan = list_entry(this, struct orphan_inode_entry, list);
321 320
322 if (nentries == F2FS_ORPHANS_PER_BLOCK) { 321 if (nentries == F2FS_ORPHANS_PER_BLOCK) {
323 /* 322 /*
324 * an orphan block is full of 1020 entries, 323 * an orphan block is full of 1020 entries,
325 * then we need to flush current orphan blocks 324 * then we need to flush current orphan blocks
326 * and bring another one in memory 325 * and bring another one in memory
327 */ 326 */
328 orphan_blk->blk_addr = cpu_to_le16(index); 327 orphan_blk->blk_addr = cpu_to_le16(index);
329 orphan_blk->blk_count = cpu_to_le16(orphan_blocks); 328 orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
330 orphan_blk->entry_count = cpu_to_le32(nentries); 329 orphan_blk->entry_count = cpu_to_le32(nentries);
331 set_page_dirty(page); 330 set_page_dirty(page);
332 f2fs_put_page(page, 1); 331 f2fs_put_page(page, 1);
333 index++; 332 index++;
334 start_blk++; 333 start_blk++;
335 nentries = 0; 334 nentries = 0;
336 page = NULL; 335 page = NULL;
337 } 336 }
338 if (page) 337 if (page)
339 goto page_exist; 338 goto page_exist;
340 339
341 page = grab_meta_page(sbi, start_blk); 340 page = grab_meta_page(sbi, start_blk);
342 orphan_blk = (struct f2fs_orphan_block *)page_address(page); 341 orphan_blk = (struct f2fs_orphan_block *)page_address(page);
343 memset(orphan_blk, 0, sizeof(*orphan_blk)); 342 memset(orphan_blk, 0, sizeof(*orphan_blk));
344 page_exist: 343 page_exist:
345 orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino); 344 orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino);
346 } 345 }
347 if (!page) 346 if (!page)
348 goto end; 347 goto end;
349 348
350 orphan_blk->blk_addr = cpu_to_le16(index); 349 orphan_blk->blk_addr = cpu_to_le16(index);
351 orphan_blk->blk_count = cpu_to_le16(orphan_blocks); 350 orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
352 orphan_blk->entry_count = cpu_to_le32(nentries); 351 orphan_blk->entry_count = cpu_to_le32(nentries);
353 set_page_dirty(page); 352 set_page_dirty(page);
354 f2fs_put_page(page, 1); 353 f2fs_put_page(page, 1);
355 end: 354 end:
356 mutex_unlock(&sbi->orphan_inode_mutex); 355 mutex_unlock(&sbi->orphan_inode_mutex);
357 } 356 }
358 357
359 static struct page *validate_checkpoint(struct f2fs_sb_info *sbi, 358 static struct page *validate_checkpoint(struct f2fs_sb_info *sbi,
360 block_t cp_addr, unsigned long long *version) 359 block_t cp_addr, unsigned long long *version)
361 { 360 {
362 struct page *cp_page_1, *cp_page_2 = NULL; 361 struct page *cp_page_1, *cp_page_2 = NULL;
363 unsigned long blk_size = sbi->blocksize; 362 unsigned long blk_size = sbi->blocksize;
364 struct f2fs_checkpoint *cp_block; 363 struct f2fs_checkpoint *cp_block;
365 unsigned long long cur_version = 0, pre_version = 0; 364 unsigned long long cur_version = 0, pre_version = 0;
366 size_t crc_offset; 365 size_t crc_offset;
367 __u32 crc = 0; 366 __u32 crc = 0;
368 367
369 /* Read the 1st cp block in this CP pack */ 368 /* Read the 1st cp block in this CP pack */
370 cp_page_1 = get_meta_page(sbi, cp_addr); 369 cp_page_1 = get_meta_page(sbi, cp_addr);
371 370
372 /* get the version number */ 371 /* get the version number */
373 cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1); 372 cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1);
374 crc_offset = le32_to_cpu(cp_block->checksum_offset); 373 crc_offset = le32_to_cpu(cp_block->checksum_offset);
375 if (crc_offset >= blk_size) 374 if (crc_offset >= blk_size)
376 goto invalid_cp1; 375 goto invalid_cp1;
377 376
378 crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); 377 crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset)));
379 if (!f2fs_crc_valid(crc, cp_block, crc_offset)) 378 if (!f2fs_crc_valid(crc, cp_block, crc_offset))
380 goto invalid_cp1; 379 goto invalid_cp1;
381 380
382 pre_version = cur_cp_version(cp_block); 381 pre_version = cur_cp_version(cp_block);
383 382
384 /* Read the 2nd cp block in this CP pack */ 383 /* Read the 2nd cp block in this CP pack */
385 cp_addr += le32_to_cpu(cp_block->cp_pack_total_block_count) - 1; 384 cp_addr += le32_to_cpu(cp_block->cp_pack_total_block_count) - 1;
386 cp_page_2 = get_meta_page(sbi, cp_addr); 385 cp_page_2 = get_meta_page(sbi, cp_addr);
387 386
388 cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2); 387 cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2);
389 crc_offset = le32_to_cpu(cp_block->checksum_offset); 388 crc_offset = le32_to_cpu(cp_block->checksum_offset);
390 if (crc_offset >= blk_size) 389 if (crc_offset >= blk_size)
391 goto invalid_cp2; 390 goto invalid_cp2;
392 391
393 crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); 392 crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset)));
394 if (!f2fs_crc_valid(crc, cp_block, crc_offset)) 393 if (!f2fs_crc_valid(crc, cp_block, crc_offset))
395 goto invalid_cp2; 394 goto invalid_cp2;
396 395
397 cur_version = cur_cp_version(cp_block); 396 cur_version = cur_cp_version(cp_block);
398 397
399 if (cur_version == pre_version) { 398 if (cur_version == pre_version) {
400 *version = cur_version; 399 *version = cur_version;
401 f2fs_put_page(cp_page_2, 1); 400 f2fs_put_page(cp_page_2, 1);
402 return cp_page_1; 401 return cp_page_1;
403 } 402 }
404 invalid_cp2: 403 invalid_cp2:
405 f2fs_put_page(cp_page_2, 1); 404 f2fs_put_page(cp_page_2, 1);
406 invalid_cp1: 405 invalid_cp1:
407 f2fs_put_page(cp_page_1, 1); 406 f2fs_put_page(cp_page_1, 1);
408 return NULL; 407 return NULL;
409 } 408 }
410 409
411 int get_valid_checkpoint(struct f2fs_sb_info *sbi) 410 int get_valid_checkpoint(struct f2fs_sb_info *sbi)
412 { 411 {
413 struct f2fs_checkpoint *cp_block; 412 struct f2fs_checkpoint *cp_block;
414 struct f2fs_super_block *fsb = sbi->raw_super; 413 struct f2fs_super_block *fsb = sbi->raw_super;
415 struct page *cp1, *cp2, *cur_page; 414 struct page *cp1, *cp2, *cur_page;
416 unsigned long blk_size = sbi->blocksize; 415 unsigned long blk_size = sbi->blocksize;
417 unsigned long long cp1_version = 0, cp2_version = 0; 416 unsigned long long cp1_version = 0, cp2_version = 0;
418 unsigned long long cp_start_blk_no; 417 unsigned long long cp_start_blk_no;
419 418
420 sbi->ckpt = kzalloc(blk_size, GFP_KERNEL); 419 sbi->ckpt = kzalloc(blk_size, GFP_KERNEL);
421 if (!sbi->ckpt) 420 if (!sbi->ckpt)
422 return -ENOMEM; 421 return -ENOMEM;
423 /* 422 /*
424 * Finding out valid cp block involves read both 423 * Finding out valid cp block involves read both
425 * sets( cp pack1 and cp pack 2) 424 * sets( cp pack1 and cp pack 2)
426 */ 425 */
427 cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr); 426 cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr);
428 cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version); 427 cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version);
429 428
430 /* The second checkpoint pack should start at the next segment */ 429 /* The second checkpoint pack should start at the next segment */
431 cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg); 430 cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg);
432 cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version); 431 cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version);
433 432
434 if (cp1 && cp2) { 433 if (cp1 && cp2) {
435 if (ver_after(cp2_version, cp1_version)) 434 if (ver_after(cp2_version, cp1_version))
436 cur_page = cp2; 435 cur_page = cp2;
437 else 436 else
438 cur_page = cp1; 437 cur_page = cp1;
439 } else if (cp1) { 438 } else if (cp1) {
440 cur_page = cp1; 439 cur_page = cp1;
441 } else if (cp2) { 440 } else if (cp2) {
442 cur_page = cp2; 441 cur_page = cp2;
443 } else { 442 } else {
444 goto fail_no_cp; 443 goto fail_no_cp;
445 } 444 }
446 445
447 cp_block = (struct f2fs_checkpoint *)page_address(cur_page); 446 cp_block = (struct f2fs_checkpoint *)page_address(cur_page);
448 memcpy(sbi->ckpt, cp_block, blk_size); 447 memcpy(sbi->ckpt, cp_block, blk_size);
449 448
450 f2fs_put_page(cp1, 1); 449 f2fs_put_page(cp1, 1);
451 f2fs_put_page(cp2, 1); 450 f2fs_put_page(cp2, 1);
452 return 0; 451 return 0;
453 452
454 fail_no_cp: 453 fail_no_cp:
455 kfree(sbi->ckpt); 454 kfree(sbi->ckpt);
456 return -EINVAL; 455 return -EINVAL;
457 } 456 }
458 457
459 static int __add_dirty_inode(struct inode *inode, struct dir_inode_entry *new) 458 static int __add_dirty_inode(struct inode *inode, struct dir_inode_entry *new)
460 { 459 {
461 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 460 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
462 struct list_head *head = &sbi->dir_inode_list; 461 struct list_head *head = &sbi->dir_inode_list;
463 struct list_head *this; 462 struct list_head *this;
464 463
465 list_for_each(this, head) { 464 list_for_each(this, head) {
466 struct dir_inode_entry *entry; 465 struct dir_inode_entry *entry;
467 entry = list_entry(this, struct dir_inode_entry, list); 466 entry = list_entry(this, struct dir_inode_entry, list);
468 if (entry->inode == inode) 467 if (entry->inode == inode)
469 return -EEXIST; 468 return -EEXIST;
470 } 469 }
471 list_add_tail(&new->list, head); 470 list_add_tail(&new->list, head);
472 #ifdef CONFIG_F2FS_STAT_FS 471 #ifdef CONFIG_F2FS_STAT_FS
473 sbi->n_dirty_dirs++; 472 sbi->n_dirty_dirs++;
474 #endif 473 #endif
475 return 0; 474 return 0;
476 } 475 }
477 476
478 void set_dirty_dir_page(struct inode *inode, struct page *page) 477 void set_dirty_dir_page(struct inode *inode, struct page *page)
479 { 478 {
480 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 479 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
481 struct dir_inode_entry *new; 480 struct dir_inode_entry *new;
482 481
483 if (!S_ISDIR(inode->i_mode)) 482 if (!S_ISDIR(inode->i_mode))
484 return; 483 return;
485 retry: 484 retry:
486 new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); 485 new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS);
487 if (!new) { 486 if (!new) {
488 cond_resched(); 487 cond_resched();
489 goto retry; 488 goto retry;
490 } 489 }
491 new->inode = inode; 490 new->inode = inode;
492 INIT_LIST_HEAD(&new->list); 491 INIT_LIST_HEAD(&new->list);
493 492
494 spin_lock(&sbi->dir_inode_lock); 493 spin_lock(&sbi->dir_inode_lock);
495 if (__add_dirty_inode(inode, new)) 494 if (__add_dirty_inode(inode, new))
496 kmem_cache_free(inode_entry_slab, new); 495 kmem_cache_free(inode_entry_slab, new);
497 496
498 inc_page_count(sbi, F2FS_DIRTY_DENTS); 497 inc_page_count(sbi, F2FS_DIRTY_DENTS);
499 inode_inc_dirty_dents(inode); 498 inode_inc_dirty_dents(inode);
500 SetPagePrivate(page); 499 SetPagePrivate(page);
501 spin_unlock(&sbi->dir_inode_lock); 500 spin_unlock(&sbi->dir_inode_lock);
502 } 501 }
503 502
504 void add_dirty_dir_inode(struct inode *inode) 503 void add_dirty_dir_inode(struct inode *inode)
505 { 504 {
506 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 505 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
507 struct dir_inode_entry *new; 506 struct dir_inode_entry *new;
508 retry: 507 retry:
509 new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); 508 new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS);
510 if (!new) { 509 if (!new) {
511 cond_resched(); 510 cond_resched();
512 goto retry; 511 goto retry;
513 } 512 }
514 new->inode = inode; 513 new->inode = inode;
515 INIT_LIST_HEAD(&new->list); 514 INIT_LIST_HEAD(&new->list);
516 515
517 spin_lock(&sbi->dir_inode_lock); 516 spin_lock(&sbi->dir_inode_lock);
518 if (__add_dirty_inode(inode, new)) 517 if (__add_dirty_inode(inode, new))
519 kmem_cache_free(inode_entry_slab, new); 518 kmem_cache_free(inode_entry_slab, new);
520 spin_unlock(&sbi->dir_inode_lock); 519 spin_unlock(&sbi->dir_inode_lock);
521 } 520 }
522 521
523 void remove_dirty_dir_inode(struct inode *inode) 522 void remove_dirty_dir_inode(struct inode *inode)
524 { 523 {
525 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 524 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
526 struct list_head *head = &sbi->dir_inode_list; 525 struct list_head *head = &sbi->dir_inode_list;
527 struct list_head *this; 526 struct list_head *this;
528 527
529 if (!S_ISDIR(inode->i_mode)) 528 if (!S_ISDIR(inode->i_mode))
530 return; 529 return;
531 530
532 spin_lock(&sbi->dir_inode_lock); 531 spin_lock(&sbi->dir_inode_lock);
533 if (atomic_read(&F2FS_I(inode)->dirty_dents)) { 532 if (atomic_read(&F2FS_I(inode)->dirty_dents)) {
534 spin_unlock(&sbi->dir_inode_lock); 533 spin_unlock(&sbi->dir_inode_lock);
535 return; 534 return;
536 } 535 }
537 536
538 list_for_each(this, head) { 537 list_for_each(this, head) {
539 struct dir_inode_entry *entry; 538 struct dir_inode_entry *entry;
540 entry = list_entry(this, struct dir_inode_entry, list); 539 entry = list_entry(this, struct dir_inode_entry, list);
541 if (entry->inode == inode) { 540 if (entry->inode == inode) {
542 list_del(&entry->list); 541 list_del(&entry->list);
543 kmem_cache_free(inode_entry_slab, entry); 542 kmem_cache_free(inode_entry_slab, entry);
544 #ifdef CONFIG_F2FS_STAT_FS 543 #ifdef CONFIG_F2FS_STAT_FS
545 sbi->n_dirty_dirs--; 544 sbi->n_dirty_dirs--;
546 #endif 545 #endif
547 break; 546 break;
548 } 547 }
549 } 548 }
550 spin_unlock(&sbi->dir_inode_lock); 549 spin_unlock(&sbi->dir_inode_lock);
551 550
552 /* Only from the recovery routine */ 551 /* Only from the recovery routine */
553 if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) { 552 if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) {
554 clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT); 553 clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT);
555 iput(inode); 554 iput(inode);
556 } 555 }
557 } 556 }
558 557
559 struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino) 558 struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino)
560 { 559 {
561 struct list_head *head = &sbi->dir_inode_list; 560 struct list_head *head = &sbi->dir_inode_list;
562 struct list_head *this; 561 struct list_head *this;
563 struct inode *inode = NULL; 562 struct inode *inode = NULL;
564 563
565 spin_lock(&sbi->dir_inode_lock); 564 spin_lock(&sbi->dir_inode_lock);
566 list_for_each(this, head) { 565 list_for_each(this, head) {
567 struct dir_inode_entry *entry; 566 struct dir_inode_entry *entry;
568 entry = list_entry(this, struct dir_inode_entry, list); 567 entry = list_entry(this, struct dir_inode_entry, list);
569 if (entry->inode->i_ino == ino) { 568 if (entry->inode->i_ino == ino) {
570 inode = entry->inode; 569 inode = entry->inode;
571 break; 570 break;
572 } 571 }
573 } 572 }
574 spin_unlock(&sbi->dir_inode_lock); 573 spin_unlock(&sbi->dir_inode_lock);
575 return inode; 574 return inode;
576 } 575 }
577 576
578 void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) 577 void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi)
579 { 578 {
580 struct list_head *head = &sbi->dir_inode_list; 579 struct list_head *head = &sbi->dir_inode_list;
581 struct dir_inode_entry *entry; 580 struct dir_inode_entry *entry;
582 struct inode *inode; 581 struct inode *inode;
583 retry: 582 retry:
584 spin_lock(&sbi->dir_inode_lock); 583 spin_lock(&sbi->dir_inode_lock);
585 if (list_empty(head)) { 584 if (list_empty(head)) {
586 spin_unlock(&sbi->dir_inode_lock); 585 spin_unlock(&sbi->dir_inode_lock);
587 return; 586 return;
588 } 587 }
589 entry = list_entry(head->next, struct dir_inode_entry, list); 588 entry = list_entry(head->next, struct dir_inode_entry, list);
590 inode = igrab(entry->inode); 589 inode = igrab(entry->inode);
591 spin_unlock(&sbi->dir_inode_lock); 590 spin_unlock(&sbi->dir_inode_lock);
592 if (inode) { 591 if (inode) {
593 filemap_flush(inode->i_mapping); 592 filemap_flush(inode->i_mapping);
594 iput(inode); 593 iput(inode);
595 } else { 594 } else {
596 /* 595 /*
597 * We should submit bio, since it exists several 596 * We should submit bio, since it exists several
598 * wribacking dentry pages in the freeing inode. 597 * wribacking dentry pages in the freeing inode.
599 */ 598 */
600 f2fs_submit_bio(sbi, DATA, true); 599 f2fs_submit_bio(sbi, DATA, true);
601 } 600 }
602 goto retry; 601 goto retry;
603 } 602 }
604 603
605 /* 604 /*
606 * Freeze all the FS-operations for checkpoint. 605 * Freeze all the FS-operations for checkpoint.
607 */ 606 */
608 static void block_operations(struct f2fs_sb_info *sbi) 607 static void block_operations(struct f2fs_sb_info *sbi)
609 { 608 {
610 struct writeback_control wbc = { 609 struct writeback_control wbc = {
611 .sync_mode = WB_SYNC_ALL, 610 .sync_mode = WB_SYNC_ALL,
612 .nr_to_write = LONG_MAX, 611 .nr_to_write = LONG_MAX,
613 .for_reclaim = 0, 612 .for_reclaim = 0,
614 }; 613 };
615 struct blk_plug plug; 614 struct blk_plug plug;
616 615
617 blk_start_plug(&plug); 616 blk_start_plug(&plug);
618 617
619 retry_flush_dents: 618 retry_flush_dents:
620 mutex_lock_all(sbi); 619 mutex_lock_all(sbi);
621 620
622 /* write all the dirty dentry pages */ 621 /* write all the dirty dentry pages */
623 if (get_pages(sbi, F2FS_DIRTY_DENTS)) { 622 if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
624 mutex_unlock_all(sbi); 623 mutex_unlock_all(sbi);
625 sync_dirty_dir_inodes(sbi); 624 sync_dirty_dir_inodes(sbi);
626 goto retry_flush_dents; 625 goto retry_flush_dents;
627 } 626 }
628 627
629 /* 628 /*
630 * POR: we should ensure that there is no dirty node pages 629 * POR: we should ensure that there is no dirty node pages
631 * until finishing nat/sit flush. 630 * until finishing nat/sit flush.
632 */ 631 */
633 retry_flush_nodes: 632 retry_flush_nodes:
634 mutex_lock(&sbi->node_write); 633 mutex_lock(&sbi->node_write);
635 634
636 if (get_pages(sbi, F2FS_DIRTY_NODES)) { 635 if (get_pages(sbi, F2FS_DIRTY_NODES)) {
637 mutex_unlock(&sbi->node_write); 636 mutex_unlock(&sbi->node_write);
638 sync_node_pages(sbi, 0, &wbc); 637 sync_node_pages(sbi, 0, &wbc);
639 goto retry_flush_nodes; 638 goto retry_flush_nodes;
640 } 639 }
641 blk_finish_plug(&plug); 640 blk_finish_plug(&plug);
642 } 641 }
643 642
644 static void unblock_operations(struct f2fs_sb_info *sbi) 643 static void unblock_operations(struct f2fs_sb_info *sbi)
645 { 644 {
646 mutex_unlock(&sbi->node_write); 645 mutex_unlock(&sbi->node_write);
647 mutex_unlock_all(sbi); 646 mutex_unlock_all(sbi);
648 } 647 }
649 648
650 static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) 649 static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
651 { 650 {
652 struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); 651 struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
653 nid_t last_nid = 0; 652 nid_t last_nid = 0;
654 block_t start_blk; 653 block_t start_blk;
655 struct page *cp_page; 654 struct page *cp_page;
656 unsigned int data_sum_blocks, orphan_blocks; 655 unsigned int data_sum_blocks, orphan_blocks;
657 __u32 crc32 = 0; 656 __u32 crc32 = 0;
658 void *kaddr; 657 void *kaddr;
659 int i; 658 int i;
660 659
661 /* Flush all the NAT/SIT pages */ 660 /* Flush all the NAT/SIT pages */
662 while (get_pages(sbi, F2FS_DIRTY_META)) 661 while (get_pages(sbi, F2FS_DIRTY_META))
663 sync_meta_pages(sbi, META, LONG_MAX); 662 sync_meta_pages(sbi, META, LONG_MAX);
664 663
665 next_free_nid(sbi, &last_nid); 664 next_free_nid(sbi, &last_nid);
666 665
667 /* 666 /*
668 * modify checkpoint 667 * modify checkpoint
669 * version number is already updated 668 * version number is already updated
670 */ 669 */
671 ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi)); 670 ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi));
672 ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi)); 671 ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
673 ckpt->free_segment_count = cpu_to_le32(free_segments(sbi)); 672 ckpt->free_segment_count = cpu_to_le32(free_segments(sbi));
674 for (i = 0; i < 3; i++) { 673 for (i = 0; i < 3; i++) {
675 ckpt->cur_node_segno[i] = 674 ckpt->cur_node_segno[i] =
676 cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE)); 675 cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE));
677 ckpt->cur_node_blkoff[i] = 676 ckpt->cur_node_blkoff[i] =
678 cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE)); 677 cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE));
679 ckpt->alloc_type[i + CURSEG_HOT_NODE] = 678 ckpt->alloc_type[i + CURSEG_HOT_NODE] =
680 curseg_alloc_type(sbi, i + CURSEG_HOT_NODE); 679 curseg_alloc_type(sbi, i + CURSEG_HOT_NODE);
681 } 680 }
682 for (i = 0; i < 3; i++) { 681 for (i = 0; i < 3; i++) {
683 ckpt->cur_data_segno[i] = 682 ckpt->cur_data_segno[i] =
684 cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA)); 683 cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA));
685 ckpt->cur_data_blkoff[i] = 684 ckpt->cur_data_blkoff[i] =
686 cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA)); 685 cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA));
687 ckpt->alloc_type[i + CURSEG_HOT_DATA] = 686 ckpt->alloc_type[i + CURSEG_HOT_DATA] =
688 curseg_alloc_type(sbi, i + CURSEG_HOT_DATA); 687 curseg_alloc_type(sbi, i + CURSEG_HOT_DATA);
689 } 688 }
690 689
691 ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi)); 690 ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
692 ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi)); 691 ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
693 ckpt->next_free_nid = cpu_to_le32(last_nid); 692 ckpt->next_free_nid = cpu_to_le32(last_nid);
694 693
695 /* 2 cp + n data seg summary + orphan inode blocks */ 694 /* 2 cp + n data seg summary + orphan inode blocks */
696 data_sum_blocks = npages_for_summary_flush(sbi); 695 data_sum_blocks = npages_for_summary_flush(sbi);
697 if (data_sum_blocks < 3) 696 if (data_sum_blocks < 3)
698 set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); 697 set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);
699 else 698 else
700 clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); 699 clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);
701 700
702 orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1) 701 orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1)
703 / F2FS_ORPHANS_PER_BLOCK; 702 / F2FS_ORPHANS_PER_BLOCK;
704 ckpt->cp_pack_start_sum = cpu_to_le32(1 + orphan_blocks); 703 ckpt->cp_pack_start_sum = cpu_to_le32(1 + orphan_blocks);
705 704
706 if (is_umount) { 705 if (is_umount) {
707 set_ckpt_flags(ckpt, CP_UMOUNT_FLAG); 706 set_ckpt_flags(ckpt, CP_UMOUNT_FLAG);
708 ckpt->cp_pack_total_block_count = cpu_to_le32(2 + 707 ckpt->cp_pack_total_block_count = cpu_to_le32(2 +
709 data_sum_blocks + orphan_blocks + NR_CURSEG_NODE_TYPE); 708 data_sum_blocks + orphan_blocks + NR_CURSEG_NODE_TYPE);
710 } else { 709 } else {
711 clear_ckpt_flags(ckpt, CP_UMOUNT_FLAG); 710 clear_ckpt_flags(ckpt, CP_UMOUNT_FLAG);
712 ckpt->cp_pack_total_block_count = cpu_to_le32(2 + 711 ckpt->cp_pack_total_block_count = cpu_to_le32(2 +
713 data_sum_blocks + orphan_blocks); 712 data_sum_blocks + orphan_blocks);
714 } 713 }
715 714
716 if (sbi->n_orphans) 715 if (sbi->n_orphans)
717 set_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); 716 set_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG);
718 else 717 else
719 clear_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); 718 clear_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG);
720 719
721 /* update SIT/NAT bitmap */ 720 /* update SIT/NAT bitmap */
722 get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP)); 721 get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP));
723 get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP)); 722 get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP));
724 723
725 crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset)); 724 crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset));
726 *((__le32 *)((unsigned char *)ckpt + 725 *((__le32 *)((unsigned char *)ckpt +
727 le32_to_cpu(ckpt->checksum_offset))) 726 le32_to_cpu(ckpt->checksum_offset)))
728 = cpu_to_le32(crc32); 727 = cpu_to_le32(crc32);
729 728
730 start_blk = __start_cp_addr(sbi); 729 start_blk = __start_cp_addr(sbi);
731 730
732 /* write out checkpoint buffer at block 0 */ 731 /* write out checkpoint buffer at block 0 */
733 cp_page = grab_meta_page(sbi, start_blk++); 732 cp_page = grab_meta_page(sbi, start_blk++);
734 kaddr = page_address(cp_page); 733 kaddr = page_address(cp_page);
735 memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); 734 memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
736 set_page_dirty(cp_page); 735 set_page_dirty(cp_page);
737 f2fs_put_page(cp_page, 1); 736 f2fs_put_page(cp_page, 1);
738 737
739 if (sbi->n_orphans) { 738 if (sbi->n_orphans) {
740 write_orphan_inodes(sbi, start_blk); 739 write_orphan_inodes(sbi, start_blk);
741 start_blk += orphan_blocks; 740 start_blk += orphan_blocks;
742 } 741 }
743 742
744 write_data_summaries(sbi, start_blk); 743 write_data_summaries(sbi, start_blk);
745 start_blk += data_sum_blocks; 744 start_blk += data_sum_blocks;
746 if (is_umount) { 745 if (is_umount) {
747 write_node_summaries(sbi, start_blk); 746 write_node_summaries(sbi, start_blk);
748 start_blk += NR_CURSEG_NODE_TYPE; 747 start_blk += NR_CURSEG_NODE_TYPE;
749 } 748 }
750 749
751 /* writeout checkpoint block */ 750 /* writeout checkpoint block */
752 cp_page = grab_meta_page(sbi, start_blk); 751 cp_page = grab_meta_page(sbi, start_blk);
753 kaddr = page_address(cp_page); 752 kaddr = page_address(cp_page);
754 memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); 753 memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
755 set_page_dirty(cp_page); 754 set_page_dirty(cp_page);
756 f2fs_put_page(cp_page, 1); 755 f2fs_put_page(cp_page, 1);
757 756
758 /* wait for previous submitted node/meta pages writeback */ 757 /* wait for previous submitted node/meta pages writeback */
759 while (get_pages(sbi, F2FS_WRITEBACK)) 758 while (get_pages(sbi, F2FS_WRITEBACK))
760 congestion_wait(BLK_RW_ASYNC, HZ / 50); 759 congestion_wait(BLK_RW_ASYNC, HZ / 50);
761 760
762 filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX); 761 filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
763 filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX); 762 filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
764 763
765 /* update user_block_counts */ 764 /* update user_block_counts */
766 sbi->last_valid_block_count = sbi->total_valid_block_count; 765 sbi->last_valid_block_count = sbi->total_valid_block_count;
767 sbi->alloc_valid_block_count = 0; 766 sbi->alloc_valid_block_count = 0;
768 767
769 /* Here, we only have one bio having CP pack */ 768 /* Here, we only have one bio having CP pack */
770 sync_meta_pages(sbi, META_FLUSH, LONG_MAX); 769 sync_meta_pages(sbi, META_FLUSH, LONG_MAX);
771 770
772 if (!is_set_ckpt_flags(ckpt, CP_ERROR_FLAG)) { 771 if (!is_set_ckpt_flags(ckpt, CP_ERROR_FLAG)) {
773 clear_prefree_segments(sbi); 772 clear_prefree_segments(sbi);
774 F2FS_RESET_SB_DIRT(sbi); 773 F2FS_RESET_SB_DIRT(sbi);
775 } 774 }
776 } 775 }
777 776
778 /* 777 /*
779 * We guarantee that this checkpoint procedure should not fail. 778 * We guarantee that this checkpoint procedure should not fail.
780 */ 779 */
781 void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) 780 void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
782 { 781 {
783 struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); 782 struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
784 unsigned long long ckpt_ver; 783 unsigned long long ckpt_ver;
785 784
786 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops"); 785 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops");
787 786
788 mutex_lock(&sbi->cp_mutex); 787 mutex_lock(&sbi->cp_mutex);
789 block_operations(sbi); 788 block_operations(sbi);
790 789
791 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops"); 790 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops");
792 791
793 f2fs_submit_bio(sbi, DATA, true); 792 f2fs_submit_bio(sbi, DATA, true);
794 f2fs_submit_bio(sbi, NODE, true); 793 f2fs_submit_bio(sbi, NODE, true);
795 f2fs_submit_bio(sbi, META, true); 794 f2fs_submit_bio(sbi, META, true);
796 795
797 /* 796 /*
798 * update checkpoint pack index 797 * update checkpoint pack index
799 * Increase the version number so that 798 * Increase the version number so that
800 * SIT entries and seg summaries are written at correct place 799 * SIT entries and seg summaries are written at correct place
801 */ 800 */
802 ckpt_ver = cur_cp_version(ckpt); 801 ckpt_ver = cur_cp_version(ckpt);
803 ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver); 802 ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver);
804 803
805 /* write cached NAT/SIT entries to NAT/SIT area */ 804 /* write cached NAT/SIT entries to NAT/SIT area */
806 flush_nat_entries(sbi); 805 flush_nat_entries(sbi);
807 flush_sit_entries(sbi); 806 flush_sit_entries(sbi);
808 807
809 /* unlock all the fs_lock[] in do_checkpoint() */ 808 /* unlock all the fs_lock[] in do_checkpoint() */
810 do_checkpoint(sbi, is_umount); 809 do_checkpoint(sbi, is_umount);
811 810
812 unblock_operations(sbi); 811 unblock_operations(sbi);
813 mutex_unlock(&sbi->cp_mutex); 812 mutex_unlock(&sbi->cp_mutex);
814 813
815 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint"); 814 trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint");
816 } 815 }
817 816
818 void init_orphan_info(struct f2fs_sb_info *sbi) 817 void init_orphan_info(struct f2fs_sb_info *sbi)
819 { 818 {
820 mutex_init(&sbi->orphan_inode_mutex); 819 mutex_init(&sbi->orphan_inode_mutex);
821 INIT_LIST_HEAD(&sbi->orphan_inode_list); 820 INIT_LIST_HEAD(&sbi->orphan_inode_list);
822 sbi->n_orphans = 0; 821 sbi->n_orphans = 0;
823 } 822 }
824 823
825 int __init create_checkpoint_caches(void) 824 int __init create_checkpoint_caches(void)
826 { 825 {
827 orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry", 826 orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry",
828 sizeof(struct orphan_inode_entry), NULL); 827 sizeof(struct orphan_inode_entry), NULL);
829 if (unlikely(!orphan_entry_slab)) 828 if (unlikely(!orphan_entry_slab))
830 return -ENOMEM; 829 return -ENOMEM;
831 inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry", 830 inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry",
832 sizeof(struct dir_inode_entry), NULL); 831 sizeof(struct dir_inode_entry), NULL);
833 if (unlikely(!inode_entry_slab)) { 832 if (unlikely(!inode_entry_slab)) {
834 kmem_cache_destroy(orphan_entry_slab); 833 kmem_cache_destroy(orphan_entry_slab);
835 return -ENOMEM; 834 return -ENOMEM;
836 } 835 }
837 return 0; 836 return 0;
838 } 837 }
839 838
840 void destroy_checkpoint_caches(void) 839 void destroy_checkpoint_caches(void)
841 { 840 {
842 kmem_cache_destroy(orphan_entry_slab); 841 kmem_cache_destroy(orphan_entry_slab);
843 kmem_cache_destroy(inode_entry_slab); 842 kmem_cache_destroy(inode_entry_slab);
844 } 843 }
845 844
1 /* 1 /*
2 * fs/f2fs/node.c 2 * fs/f2fs/node.c
3 * 3 *
4 * Copyright (c) 2012 Samsung Electronics Co., Ltd. 4 * Copyright (c) 2012 Samsung Electronics Co., Ltd.
5 * http://www.samsung.com/ 5 * http://www.samsung.com/
6 * 6 *
7 * This program is free software; you can redistribute it and/or modify 7 * This program is free software; you can redistribute it and/or modify
8 * it under the terms of the GNU General Public License version 2 as 8 * it under the terms of the GNU General Public License version 2 as
9 * published by the Free Software Foundation. 9 * published by the Free Software Foundation.
10 */ 10 */
11 #include <linux/fs.h> 11 #include <linux/fs.h>
12 #include <linux/f2fs_fs.h> 12 #include <linux/f2fs_fs.h>
13 #include <linux/mpage.h> 13 #include <linux/mpage.h>
14 #include <linux/backing-dev.h> 14 #include <linux/backing-dev.h>
15 #include <linux/blkdev.h> 15 #include <linux/blkdev.h>
16 #include <linux/pagevec.h> 16 #include <linux/pagevec.h>
17 #include <linux/swap.h> 17 #include <linux/swap.h>
18 18
19 #include "f2fs.h" 19 #include "f2fs.h"
20 #include "node.h" 20 #include "node.h"
21 #include "segment.h" 21 #include "segment.h"
22 #include <trace/events/f2fs.h> 22 #include <trace/events/f2fs.h>
23 23
24 static struct kmem_cache *nat_entry_slab; 24 static struct kmem_cache *nat_entry_slab;
25 static struct kmem_cache *free_nid_slab; 25 static struct kmem_cache *free_nid_slab;
26 26
27 static void clear_node_page_dirty(struct page *page) 27 static void clear_node_page_dirty(struct page *page)
28 { 28 {
29 struct address_space *mapping = page->mapping; 29 struct address_space *mapping = page->mapping;
30 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); 30 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
31 unsigned int long flags; 31 unsigned int long flags;
32 32
33 if (PageDirty(page)) { 33 if (PageDirty(page)) {
34 spin_lock_irqsave(&mapping->tree_lock, flags); 34 spin_lock_irqsave(&mapping->tree_lock, flags);
35 radix_tree_tag_clear(&mapping->page_tree, 35 radix_tree_tag_clear(&mapping->page_tree,
36 page_index(page), 36 page_index(page),
37 PAGECACHE_TAG_DIRTY); 37 PAGECACHE_TAG_DIRTY);
38 spin_unlock_irqrestore(&mapping->tree_lock, flags); 38 spin_unlock_irqrestore(&mapping->tree_lock, flags);
39 39
40 clear_page_dirty_for_io(page); 40 clear_page_dirty_for_io(page);
41 dec_page_count(sbi, F2FS_DIRTY_NODES); 41 dec_page_count(sbi, F2FS_DIRTY_NODES);
42 } 42 }
43 ClearPageUptodate(page); 43 ClearPageUptodate(page);
44 } 44 }
45 45
46 static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid) 46 static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
47 { 47 {
48 pgoff_t index = current_nat_addr(sbi, nid); 48 pgoff_t index = current_nat_addr(sbi, nid);
49 return get_meta_page(sbi, index); 49 return get_meta_page(sbi, index);
50 } 50 }
51 51
52 static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid) 52 static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
53 { 53 {
54 struct page *src_page; 54 struct page *src_page;
55 struct page *dst_page; 55 struct page *dst_page;
56 pgoff_t src_off; 56 pgoff_t src_off;
57 pgoff_t dst_off; 57 pgoff_t dst_off;
58 void *src_addr; 58 void *src_addr;
59 void *dst_addr; 59 void *dst_addr;
60 struct f2fs_nm_info *nm_i = NM_I(sbi); 60 struct f2fs_nm_info *nm_i = NM_I(sbi);
61 61
62 src_off = current_nat_addr(sbi, nid); 62 src_off = current_nat_addr(sbi, nid);
63 dst_off = next_nat_addr(sbi, src_off); 63 dst_off = next_nat_addr(sbi, src_off);
64 64
65 /* get current nat block page with lock */ 65 /* get current nat block page with lock */
66 src_page = get_meta_page(sbi, src_off); 66 src_page = get_meta_page(sbi, src_off);
67 67
68 /* Dirty src_page means that it is already the new target NAT page. */ 68 /* Dirty src_page means that it is already the new target NAT page. */
69 if (PageDirty(src_page)) 69 if (PageDirty(src_page))
70 return src_page; 70 return src_page;
71 71
72 dst_page = grab_meta_page(sbi, dst_off); 72 dst_page = grab_meta_page(sbi, dst_off);
73 73
74 src_addr = page_address(src_page); 74 src_addr = page_address(src_page);
75 dst_addr = page_address(dst_page); 75 dst_addr = page_address(dst_page);
76 memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE); 76 memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE);
77 set_page_dirty(dst_page); 77 set_page_dirty(dst_page);
78 f2fs_put_page(src_page, 1); 78 f2fs_put_page(src_page, 1);
79 79
80 set_to_next_nat(nm_i, nid); 80 set_to_next_nat(nm_i, nid);
81 81
82 return dst_page; 82 return dst_page;
83 } 83 }
84 84
85 /* 85 /*
86 * Readahead NAT pages 86 * Readahead NAT pages
87 */ 87 */
88 static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid) 88 static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid)
89 { 89 {
90 struct address_space *mapping = sbi->meta_inode->i_mapping; 90 struct address_space *mapping = sbi->meta_inode->i_mapping;
91 struct f2fs_nm_info *nm_i = NM_I(sbi); 91 struct f2fs_nm_info *nm_i = NM_I(sbi);
92 struct blk_plug plug; 92 struct blk_plug plug;
93 struct page *page; 93 struct page *page;
94 pgoff_t index; 94 pgoff_t index;
95 int i; 95 int i;
96 96
97 blk_start_plug(&plug); 97 blk_start_plug(&plug);
98 98
99 for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) { 99 for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) {
100 if (nid >= nm_i->max_nid) 100 if (nid >= nm_i->max_nid)
101 nid = 0; 101 nid = 0;
102 index = current_nat_addr(sbi, nid); 102 index = current_nat_addr(sbi, nid);
103 103
104 page = grab_cache_page(mapping, index); 104 page = grab_cache_page(mapping, index);
105 if (!page) 105 if (!page)
106 continue; 106 continue;
107 if (PageUptodate(page)) { 107 if (PageUptodate(page)) {
108 f2fs_put_page(page, 1); 108 f2fs_put_page(page, 1);
109 continue; 109 continue;
110 } 110 }
111 if (f2fs_readpage(sbi, page, index, READ)) 111 if (f2fs_readpage(sbi, page, index, READ))
112 continue; 112 continue;
113 113
114 f2fs_put_page(page, 0); 114 f2fs_put_page(page, 0);
115 } 115 }
116 blk_finish_plug(&plug); 116 blk_finish_plug(&plug);
117 } 117 }
118 118
119 static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n) 119 static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n)
120 { 120 {
121 return radix_tree_lookup(&nm_i->nat_root, n); 121 return radix_tree_lookup(&nm_i->nat_root, n);
122 } 122 }
123 123
124 static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i, 124 static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i,
125 nid_t start, unsigned int nr, struct nat_entry **ep) 125 nid_t start, unsigned int nr, struct nat_entry **ep)
126 { 126 {
127 return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr); 127 return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr);
128 } 128 }
129 129
130 static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e) 130 static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e)
131 { 131 {
132 list_del(&e->list); 132 list_del(&e->list);
133 radix_tree_delete(&nm_i->nat_root, nat_get_nid(e)); 133 radix_tree_delete(&nm_i->nat_root, nat_get_nid(e));
134 nm_i->nat_cnt--; 134 nm_i->nat_cnt--;
135 kmem_cache_free(nat_entry_slab, e); 135 kmem_cache_free(nat_entry_slab, e);
136 } 136 }
137 137
138 int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) 138 int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid)
139 { 139 {
140 struct f2fs_nm_info *nm_i = NM_I(sbi); 140 struct f2fs_nm_info *nm_i = NM_I(sbi);
141 struct nat_entry *e; 141 struct nat_entry *e;
142 int is_cp = 1; 142 int is_cp = 1;
143 143
144 read_lock(&nm_i->nat_tree_lock); 144 read_lock(&nm_i->nat_tree_lock);
145 e = __lookup_nat_cache(nm_i, nid); 145 e = __lookup_nat_cache(nm_i, nid);
146 if (e && !e->checkpointed) 146 if (e && !e->checkpointed)
147 is_cp = 0; 147 is_cp = 0;
148 read_unlock(&nm_i->nat_tree_lock); 148 read_unlock(&nm_i->nat_tree_lock);
149 return is_cp; 149 return is_cp;
150 } 150 }
151 151
152 static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid) 152 static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid)
153 { 153 {
154 struct nat_entry *new; 154 struct nat_entry *new;
155 155
156 new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC); 156 new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC);
157 if (!new) 157 if (!new)
158 return NULL; 158 return NULL;
159 if (radix_tree_insert(&nm_i->nat_root, nid, new)) { 159 if (radix_tree_insert(&nm_i->nat_root, nid, new)) {
160 kmem_cache_free(nat_entry_slab, new); 160 kmem_cache_free(nat_entry_slab, new);
161 return NULL; 161 return NULL;
162 } 162 }
163 memset(new, 0, sizeof(struct nat_entry)); 163 memset(new, 0, sizeof(struct nat_entry));
164 nat_set_nid(new, nid); 164 nat_set_nid(new, nid);
165 list_add_tail(&new->list, &nm_i->nat_entries); 165 list_add_tail(&new->list, &nm_i->nat_entries);
166 nm_i->nat_cnt++; 166 nm_i->nat_cnt++;
167 return new; 167 return new;
168 } 168 }
169 169
170 static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid, 170 static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid,
171 struct f2fs_nat_entry *ne) 171 struct f2fs_nat_entry *ne)
172 { 172 {
173 struct nat_entry *e; 173 struct nat_entry *e;
174 retry: 174 retry:
175 write_lock(&nm_i->nat_tree_lock); 175 write_lock(&nm_i->nat_tree_lock);
176 e = __lookup_nat_cache(nm_i, nid); 176 e = __lookup_nat_cache(nm_i, nid);
177 if (!e) { 177 if (!e) {
178 e = grab_nat_entry(nm_i, nid); 178 e = grab_nat_entry(nm_i, nid);
179 if (!e) { 179 if (!e) {
180 write_unlock(&nm_i->nat_tree_lock); 180 write_unlock(&nm_i->nat_tree_lock);
181 goto retry; 181 goto retry;
182 } 182 }
183 nat_set_blkaddr(e, le32_to_cpu(ne->block_addr)); 183 nat_set_blkaddr(e, le32_to_cpu(ne->block_addr));
184 nat_set_ino(e, le32_to_cpu(ne->ino)); 184 nat_set_ino(e, le32_to_cpu(ne->ino));
185 nat_set_version(e, ne->version); 185 nat_set_version(e, ne->version);
186 e->checkpointed = true; 186 e->checkpointed = true;
187 } 187 }
188 write_unlock(&nm_i->nat_tree_lock); 188 write_unlock(&nm_i->nat_tree_lock);
189 } 189 }
190 190
191 static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, 191 static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni,
192 block_t new_blkaddr) 192 block_t new_blkaddr)
193 { 193 {
194 struct f2fs_nm_info *nm_i = NM_I(sbi); 194 struct f2fs_nm_info *nm_i = NM_I(sbi);
195 struct nat_entry *e; 195 struct nat_entry *e;
196 retry: 196 retry:
197 write_lock(&nm_i->nat_tree_lock); 197 write_lock(&nm_i->nat_tree_lock);
198 e = __lookup_nat_cache(nm_i, ni->nid); 198 e = __lookup_nat_cache(nm_i, ni->nid);
199 if (!e) { 199 if (!e) {
200 e = grab_nat_entry(nm_i, ni->nid); 200 e = grab_nat_entry(nm_i, ni->nid);
201 if (!e) { 201 if (!e) {
202 write_unlock(&nm_i->nat_tree_lock); 202 write_unlock(&nm_i->nat_tree_lock);
203 goto retry; 203 goto retry;
204 } 204 }
205 e->ni = *ni; 205 e->ni = *ni;
206 e->checkpointed = true; 206 e->checkpointed = true;
207 BUG_ON(ni->blk_addr == NEW_ADDR); 207 BUG_ON(ni->blk_addr == NEW_ADDR);
208 } else if (new_blkaddr == NEW_ADDR) { 208 } else if (new_blkaddr == NEW_ADDR) {
209 /* 209 /*
210 * when nid is reallocated, 210 * when nid is reallocated,
211 * previous nat entry can be remained in nat cache. 211 * previous nat entry can be remained in nat cache.
212 * So, reinitialize it with new information. 212 * So, reinitialize it with new information.
213 */ 213 */
214 e->ni = *ni; 214 e->ni = *ni;
215 BUG_ON(ni->blk_addr != NULL_ADDR); 215 BUG_ON(ni->blk_addr != NULL_ADDR);
216 } 216 }
217 217
218 if (new_blkaddr == NEW_ADDR) 218 if (new_blkaddr == NEW_ADDR)
219 e->checkpointed = false; 219 e->checkpointed = false;
220 220
221 /* sanity check */ 221 /* sanity check */
222 BUG_ON(nat_get_blkaddr(e) != ni->blk_addr); 222 BUG_ON(nat_get_blkaddr(e) != ni->blk_addr);
223 BUG_ON(nat_get_blkaddr(e) == NULL_ADDR && 223 BUG_ON(nat_get_blkaddr(e) == NULL_ADDR &&
224 new_blkaddr == NULL_ADDR); 224 new_blkaddr == NULL_ADDR);
225 BUG_ON(nat_get_blkaddr(e) == NEW_ADDR && 225 BUG_ON(nat_get_blkaddr(e) == NEW_ADDR &&
226 new_blkaddr == NEW_ADDR); 226 new_blkaddr == NEW_ADDR);
227 BUG_ON(nat_get_blkaddr(e) != NEW_ADDR && 227 BUG_ON(nat_get_blkaddr(e) != NEW_ADDR &&
228 nat_get_blkaddr(e) != NULL_ADDR && 228 nat_get_blkaddr(e) != NULL_ADDR &&
229 new_blkaddr == NEW_ADDR); 229 new_blkaddr == NEW_ADDR);
230 230
231 /* increament version no as node is removed */ 231 /* increament version no as node is removed */
232 if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) { 232 if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) {
233 unsigned char version = nat_get_version(e); 233 unsigned char version = nat_get_version(e);
234 nat_set_version(e, inc_node_version(version)); 234 nat_set_version(e, inc_node_version(version));
235 } 235 }
236 236
237 /* change address */ 237 /* change address */
238 nat_set_blkaddr(e, new_blkaddr); 238 nat_set_blkaddr(e, new_blkaddr);
239 __set_nat_cache_dirty(nm_i, e); 239 __set_nat_cache_dirty(nm_i, e);
240 write_unlock(&nm_i->nat_tree_lock); 240 write_unlock(&nm_i->nat_tree_lock);
241 } 241 }
242 242
243 static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink) 243 static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink)
244 { 244 {
245 struct f2fs_nm_info *nm_i = NM_I(sbi); 245 struct f2fs_nm_info *nm_i = NM_I(sbi);
246 246
247 if (nm_i->nat_cnt <= NM_WOUT_THRESHOLD) 247 if (nm_i->nat_cnt <= NM_WOUT_THRESHOLD)
248 return 0; 248 return 0;
249 249
250 write_lock(&nm_i->nat_tree_lock); 250 write_lock(&nm_i->nat_tree_lock);
251 while (nr_shrink && !list_empty(&nm_i->nat_entries)) { 251 while (nr_shrink && !list_empty(&nm_i->nat_entries)) {
252 struct nat_entry *ne; 252 struct nat_entry *ne;
253 ne = list_first_entry(&nm_i->nat_entries, 253 ne = list_first_entry(&nm_i->nat_entries,
254 struct nat_entry, list); 254 struct nat_entry, list);
255 __del_from_nat_cache(nm_i, ne); 255 __del_from_nat_cache(nm_i, ne);
256 nr_shrink--; 256 nr_shrink--;
257 } 257 }
258 write_unlock(&nm_i->nat_tree_lock); 258 write_unlock(&nm_i->nat_tree_lock);
259 return nr_shrink; 259 return nr_shrink;
260 } 260 }
261 261
262 /* 262 /*
263 * This function returns always success 263 * This function returns always success
264 */ 264 */
265 void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) 265 void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni)
266 { 266 {
267 struct f2fs_nm_info *nm_i = NM_I(sbi); 267 struct f2fs_nm_info *nm_i = NM_I(sbi);
268 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); 268 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
269 struct f2fs_summary_block *sum = curseg->sum_blk; 269 struct f2fs_summary_block *sum = curseg->sum_blk;
270 nid_t start_nid = START_NID(nid); 270 nid_t start_nid = START_NID(nid);
271 struct f2fs_nat_block *nat_blk; 271 struct f2fs_nat_block *nat_blk;
272 struct page *page = NULL; 272 struct page *page = NULL;
273 struct f2fs_nat_entry ne; 273 struct f2fs_nat_entry ne;
274 struct nat_entry *e; 274 struct nat_entry *e;
275 int i; 275 int i;
276 276
277 memset(&ne, 0, sizeof(struct f2fs_nat_entry)); 277 memset(&ne, 0, sizeof(struct f2fs_nat_entry));
278 ni->nid = nid; 278 ni->nid = nid;
279 279
280 /* Check nat cache */ 280 /* Check nat cache */
281 read_lock(&nm_i->nat_tree_lock); 281 read_lock(&nm_i->nat_tree_lock);
282 e = __lookup_nat_cache(nm_i, nid); 282 e = __lookup_nat_cache(nm_i, nid);
283 if (e) { 283 if (e) {
284 ni->ino = nat_get_ino(e); 284 ni->ino = nat_get_ino(e);
285 ni->blk_addr = nat_get_blkaddr(e); 285 ni->blk_addr = nat_get_blkaddr(e);
286 ni->version = nat_get_version(e); 286 ni->version = nat_get_version(e);
287 } 287 }
288 read_unlock(&nm_i->nat_tree_lock); 288 read_unlock(&nm_i->nat_tree_lock);
289 if (e) 289 if (e)
290 return; 290 return;
291 291
292 /* Check current segment summary */ 292 /* Check current segment summary */
293 mutex_lock(&curseg->curseg_mutex); 293 mutex_lock(&curseg->curseg_mutex);
294 i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0); 294 i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0);
295 if (i >= 0) { 295 if (i >= 0) {
296 ne = nat_in_journal(sum, i); 296 ne = nat_in_journal(sum, i);
297 node_info_from_raw_nat(ni, &ne); 297 node_info_from_raw_nat(ni, &ne);
298 } 298 }
299 mutex_unlock(&curseg->curseg_mutex); 299 mutex_unlock(&curseg->curseg_mutex);
300 if (i >= 0) 300 if (i >= 0)
301 goto cache; 301 goto cache;
302 302
303 /* Fill node_info from nat page */ 303 /* Fill node_info from nat page */
304 page = get_current_nat_page(sbi, start_nid); 304 page = get_current_nat_page(sbi, start_nid);
305 nat_blk = (struct f2fs_nat_block *)page_address(page); 305 nat_blk = (struct f2fs_nat_block *)page_address(page);
306 ne = nat_blk->entries[nid - start_nid]; 306 ne = nat_blk->entries[nid - start_nid];
307 node_info_from_raw_nat(ni, &ne); 307 node_info_from_raw_nat(ni, &ne);
308 f2fs_put_page(page, 1); 308 f2fs_put_page(page, 1);
309 cache: 309 cache:
310 /* cache nat entry */ 310 /* cache nat entry */
311 cache_nat_entry(NM_I(sbi), nid, &ne); 311 cache_nat_entry(NM_I(sbi), nid, &ne);
312 } 312 }
313 313
314 /* 314 /*
315 * The maximum depth is four. 315 * The maximum depth is four.
316 * Offset[0] will have raw inode offset. 316 * Offset[0] will have raw inode offset.
317 */ 317 */
318 static int get_node_path(struct f2fs_inode_info *fi, long block, 318 static int get_node_path(struct f2fs_inode_info *fi, long block,
319 int offset[4], unsigned int noffset[4]) 319 int offset[4], unsigned int noffset[4])
320 { 320 {
321 const long direct_index = ADDRS_PER_INODE(fi); 321 const long direct_index = ADDRS_PER_INODE(fi);
322 const long direct_blks = ADDRS_PER_BLOCK; 322 const long direct_blks = ADDRS_PER_BLOCK;
323 const long dptrs_per_blk = NIDS_PER_BLOCK; 323 const long dptrs_per_blk = NIDS_PER_BLOCK;
324 const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK; 324 const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK;
325 const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK; 325 const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK;
326 int n = 0; 326 int n = 0;
327 int level = 0; 327 int level = 0;
328 328
329 noffset[0] = 0; 329 noffset[0] = 0;
330 330
331 if (block < direct_index) { 331 if (block < direct_index) {
332 offset[n] = block; 332 offset[n] = block;
333 goto got; 333 goto got;
334 } 334 }
335 block -= direct_index; 335 block -= direct_index;
336 if (block < direct_blks) { 336 if (block < direct_blks) {
337 offset[n++] = NODE_DIR1_BLOCK; 337 offset[n++] = NODE_DIR1_BLOCK;
338 noffset[n] = 1; 338 noffset[n] = 1;
339 offset[n] = block; 339 offset[n] = block;
340 level = 1; 340 level = 1;
341 goto got; 341 goto got;
342 } 342 }
343 block -= direct_blks; 343 block -= direct_blks;
344 if (block < direct_blks) { 344 if (block < direct_blks) {
345 offset[n++] = NODE_DIR2_BLOCK; 345 offset[n++] = NODE_DIR2_BLOCK;
346 noffset[n] = 2; 346 noffset[n] = 2;
347 offset[n] = block; 347 offset[n] = block;
348 level = 1; 348 level = 1;
349 goto got; 349 goto got;
350 } 350 }
351 block -= direct_blks; 351 block -= direct_blks;
352 if (block < indirect_blks) { 352 if (block < indirect_blks) {
353 offset[n++] = NODE_IND1_BLOCK; 353 offset[n++] = NODE_IND1_BLOCK;
354 noffset[n] = 3; 354 noffset[n] = 3;
355 offset[n++] = block / direct_blks; 355 offset[n++] = block / direct_blks;
356 noffset[n] = 4 + offset[n - 1]; 356 noffset[n] = 4 + offset[n - 1];
357 offset[n] = block % direct_blks; 357 offset[n] = block % direct_blks;
358 level = 2; 358 level = 2;
359 goto got; 359 goto got;
360 } 360 }
361 block -= indirect_blks; 361 block -= indirect_blks;
362 if (block < indirect_blks) { 362 if (block < indirect_blks) {
363 offset[n++] = NODE_IND2_BLOCK; 363 offset[n++] = NODE_IND2_BLOCK;
364 noffset[n] = 4 + dptrs_per_blk; 364 noffset[n] = 4 + dptrs_per_blk;
365 offset[n++] = block / direct_blks; 365 offset[n++] = block / direct_blks;
366 noffset[n] = 5 + dptrs_per_blk + offset[n - 1]; 366 noffset[n] = 5 + dptrs_per_blk + offset[n - 1];
367 offset[n] = block % direct_blks; 367 offset[n] = block % direct_blks;
368 level = 2; 368 level = 2;
369 goto got; 369 goto got;
370 } 370 }
371 block -= indirect_blks; 371 block -= indirect_blks;
372 if (block < dindirect_blks) { 372 if (block < dindirect_blks) {
373 offset[n++] = NODE_DIND_BLOCK; 373 offset[n++] = NODE_DIND_BLOCK;
374 noffset[n] = 5 + (dptrs_per_blk * 2); 374 noffset[n] = 5 + (dptrs_per_blk * 2);
375 offset[n++] = block / indirect_blks; 375 offset[n++] = block / indirect_blks;
376 noffset[n] = 6 + (dptrs_per_blk * 2) + 376 noffset[n] = 6 + (dptrs_per_blk * 2) +
377 offset[n - 1] * (dptrs_per_blk + 1); 377 offset[n - 1] * (dptrs_per_blk + 1);
378 offset[n++] = (block / direct_blks) % dptrs_per_blk; 378 offset[n++] = (block / direct_blks) % dptrs_per_blk;
379 noffset[n] = 7 + (dptrs_per_blk * 2) + 379 noffset[n] = 7 + (dptrs_per_blk * 2) +
380 offset[n - 2] * (dptrs_per_blk + 1) + 380 offset[n - 2] * (dptrs_per_blk + 1) +
381 offset[n - 1]; 381 offset[n - 1];
382 offset[n] = block % direct_blks; 382 offset[n] = block % direct_blks;
383 level = 3; 383 level = 3;
384 goto got; 384 goto got;
385 } else { 385 } else {
386 BUG(); 386 BUG();
387 } 387 }
388 got: 388 got:
389 return level; 389 return level;
390 } 390 }
391 391
392 /* 392 /*
393 * Caller should call f2fs_put_dnode(dn). 393 * Caller should call f2fs_put_dnode(dn).
394 * Also, it should grab and release a mutex by calling mutex_lock_op() and 394 * Also, it should grab and release a mutex by calling mutex_lock_op() and
395 * mutex_unlock_op() only if ro is not set RDONLY_NODE. 395 * mutex_unlock_op() only if ro is not set RDONLY_NODE.
396 * In the case of RDONLY_NODE, we don't need to care about mutex. 396 * In the case of RDONLY_NODE, we don't need to care about mutex.
397 */ 397 */
398 int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) 398 int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode)
399 { 399 {
400 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 400 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
401 struct page *npage[4]; 401 struct page *npage[4];
402 struct page *parent; 402 struct page *parent;
403 int offset[4]; 403 int offset[4];
404 unsigned int noffset[4]; 404 unsigned int noffset[4];
405 nid_t nids[4]; 405 nid_t nids[4];
406 int level, i; 406 int level, i;
407 int err = 0; 407 int err = 0;
408 408
409 level = get_node_path(F2FS_I(dn->inode), index, offset, noffset); 409 level = get_node_path(F2FS_I(dn->inode), index, offset, noffset);
410 410
411 nids[0] = dn->inode->i_ino; 411 nids[0] = dn->inode->i_ino;
412 npage[0] = dn->inode_page; 412 npage[0] = dn->inode_page;
413 413
414 if (!npage[0]) { 414 if (!npage[0]) {
415 npage[0] = get_node_page(sbi, nids[0]); 415 npage[0] = get_node_page(sbi, nids[0]);
416 if (IS_ERR(npage[0])) 416 if (IS_ERR(npage[0]))
417 return PTR_ERR(npage[0]); 417 return PTR_ERR(npage[0]);
418 } 418 }
419 parent = npage[0]; 419 parent = npage[0];
420 if (level != 0) 420 if (level != 0)
421 nids[1] = get_nid(parent, offset[0], true); 421 nids[1] = get_nid(parent, offset[0], true);
422 dn->inode_page = npage[0]; 422 dn->inode_page = npage[0];
423 dn->inode_page_locked = true; 423 dn->inode_page_locked = true;
424 424
425 /* get indirect or direct nodes */ 425 /* get indirect or direct nodes */
426 for (i = 1; i <= level; i++) { 426 for (i = 1; i <= level; i++) {
427 bool done = false; 427 bool done = false;
428 428
429 if (!nids[i] && mode == ALLOC_NODE) { 429 if (!nids[i] && mode == ALLOC_NODE) {
430 /* alloc new node */ 430 /* alloc new node */
431 if (!alloc_nid(sbi, &(nids[i]))) { 431 if (!alloc_nid(sbi, &(nids[i]))) {
432 err = -ENOSPC; 432 err = -ENOSPC;
433 goto release_pages; 433 goto release_pages;
434 } 434 }
435 435
436 dn->nid = nids[i]; 436 dn->nid = nids[i];
437 npage[i] = new_node_page(dn, noffset[i], NULL); 437 npage[i] = new_node_page(dn, noffset[i], NULL);
438 if (IS_ERR(npage[i])) { 438 if (IS_ERR(npage[i])) {
439 alloc_nid_failed(sbi, nids[i]); 439 alloc_nid_failed(sbi, nids[i]);
440 err = PTR_ERR(npage[i]); 440 err = PTR_ERR(npage[i]);
441 goto release_pages; 441 goto release_pages;
442 } 442 }
443 443
444 set_nid(parent, offset[i - 1], nids[i], i == 1); 444 set_nid(parent, offset[i - 1], nids[i], i == 1);
445 alloc_nid_done(sbi, nids[i]); 445 alloc_nid_done(sbi, nids[i]);
446 done = true; 446 done = true;
447 } else if (mode == LOOKUP_NODE_RA && i == level && level > 1) { 447 } else if (mode == LOOKUP_NODE_RA && i == level && level > 1) {
448 npage[i] = get_node_page_ra(parent, offset[i - 1]); 448 npage[i] = get_node_page_ra(parent, offset[i - 1]);
449 if (IS_ERR(npage[i])) { 449 if (IS_ERR(npage[i])) {
450 err = PTR_ERR(npage[i]); 450 err = PTR_ERR(npage[i]);
451 goto release_pages; 451 goto release_pages;
452 } 452 }
453 done = true; 453 done = true;
454 } 454 }
455 if (i == 1) { 455 if (i == 1) {
456 dn->inode_page_locked = false; 456 dn->inode_page_locked = false;
457 unlock_page(parent); 457 unlock_page(parent);
458 } else { 458 } else {
459 f2fs_put_page(parent, 1); 459 f2fs_put_page(parent, 1);
460 } 460 }
461 461
462 if (!done) { 462 if (!done) {
463 npage[i] = get_node_page(sbi, nids[i]); 463 npage[i] = get_node_page(sbi, nids[i]);
464 if (IS_ERR(npage[i])) { 464 if (IS_ERR(npage[i])) {
465 err = PTR_ERR(npage[i]); 465 err = PTR_ERR(npage[i]);
466 f2fs_put_page(npage[0], 0); 466 f2fs_put_page(npage[0], 0);
467 goto release_out; 467 goto release_out;
468 } 468 }
469 } 469 }
470 if (i < level) { 470 if (i < level) {
471 parent = npage[i]; 471 parent = npage[i];
472 nids[i + 1] = get_nid(parent, offset[i], false); 472 nids[i + 1] = get_nid(parent, offset[i], false);
473 } 473 }
474 } 474 }
475 dn->nid = nids[level]; 475 dn->nid = nids[level];
476 dn->ofs_in_node = offset[level]; 476 dn->ofs_in_node = offset[level];
477 dn->node_page = npage[level]; 477 dn->node_page = npage[level];
478 dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node); 478 dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node);
479 return 0; 479 return 0;
480 480
481 release_pages: 481 release_pages:
482 f2fs_put_page(parent, 1); 482 f2fs_put_page(parent, 1);
483 if (i > 1) 483 if (i > 1)
484 f2fs_put_page(npage[0], 0); 484 f2fs_put_page(npage[0], 0);
485 release_out: 485 release_out:
486 dn->inode_page = NULL; 486 dn->inode_page = NULL;
487 dn->node_page = NULL; 487 dn->node_page = NULL;
488 return err; 488 return err;
489 } 489 }
490 490
491 static void truncate_node(struct dnode_of_data *dn) 491 static void truncate_node(struct dnode_of_data *dn)
492 { 492 {
493 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 493 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
494 struct node_info ni; 494 struct node_info ni;
495 495
496 get_node_info(sbi, dn->nid, &ni); 496 get_node_info(sbi, dn->nid, &ni);
497 if (dn->inode->i_blocks == 0) { 497 if (dn->inode->i_blocks == 0) {
498 BUG_ON(ni.blk_addr != NULL_ADDR); 498 BUG_ON(ni.blk_addr != NULL_ADDR);
499 goto invalidate; 499 goto invalidate;
500 } 500 }
501 BUG_ON(ni.blk_addr == NULL_ADDR); 501 BUG_ON(ni.blk_addr == NULL_ADDR);
502 502
503 /* Deallocate node address */ 503 /* Deallocate node address */
504 invalidate_blocks(sbi, ni.blk_addr); 504 invalidate_blocks(sbi, ni.blk_addr);
505 dec_valid_node_count(sbi, dn->inode, 1); 505 dec_valid_node_count(sbi, dn->inode, 1);
506 set_node_addr(sbi, &ni, NULL_ADDR); 506 set_node_addr(sbi, &ni, NULL_ADDR);
507 507
508 if (dn->nid == dn->inode->i_ino) { 508 if (dn->nid == dn->inode->i_ino) {
509 remove_orphan_inode(sbi, dn->nid); 509 remove_orphan_inode(sbi, dn->nid);
510 dec_valid_inode_count(sbi); 510 dec_valid_inode_count(sbi);
511 } else { 511 } else {
512 sync_inode_page(dn); 512 sync_inode_page(dn);
513 } 513 }
514 invalidate: 514 invalidate:
515 clear_node_page_dirty(dn->node_page); 515 clear_node_page_dirty(dn->node_page);
516 F2FS_SET_SB_DIRT(sbi); 516 F2FS_SET_SB_DIRT(sbi);
517 517
518 f2fs_put_page(dn->node_page, 1); 518 f2fs_put_page(dn->node_page, 1);
519 dn->node_page = NULL; 519 dn->node_page = NULL;
520 trace_f2fs_truncate_node(dn->inode, dn->nid, ni.blk_addr); 520 trace_f2fs_truncate_node(dn->inode, dn->nid, ni.blk_addr);
521 } 521 }
522 522
523 static int truncate_dnode(struct dnode_of_data *dn) 523 static int truncate_dnode(struct dnode_of_data *dn)
524 { 524 {
525 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 525 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
526 struct page *page; 526 struct page *page;
527 527
528 if (dn->nid == 0) 528 if (dn->nid == 0)
529 return 1; 529 return 1;
530 530
531 /* get direct node */ 531 /* get direct node */
532 page = get_node_page(sbi, dn->nid); 532 page = get_node_page(sbi, dn->nid);
533 if (IS_ERR(page) && PTR_ERR(page) == -ENOENT) 533 if (IS_ERR(page) && PTR_ERR(page) == -ENOENT)
534 return 1; 534 return 1;
535 else if (IS_ERR(page)) 535 else if (IS_ERR(page))
536 return PTR_ERR(page); 536 return PTR_ERR(page);
537 537
538 /* Make dnode_of_data for parameter */ 538 /* Make dnode_of_data for parameter */
539 dn->node_page = page; 539 dn->node_page = page;
540 dn->ofs_in_node = 0; 540 dn->ofs_in_node = 0;
541 truncate_data_blocks(dn); 541 truncate_data_blocks(dn);
542 truncate_node(dn); 542 truncate_node(dn);
543 return 1; 543 return 1;
544 } 544 }
545 545
546 static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, 546 static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs,
547 int ofs, int depth) 547 int ofs, int depth)
548 { 548 {
549 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 549 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
550 struct dnode_of_data rdn = *dn; 550 struct dnode_of_data rdn = *dn;
551 struct page *page; 551 struct page *page;
552 struct f2fs_node *rn; 552 struct f2fs_node *rn;
553 nid_t child_nid; 553 nid_t child_nid;
554 unsigned int child_nofs; 554 unsigned int child_nofs;
555 int freed = 0; 555 int freed = 0;
556 int i, ret; 556 int i, ret;
557 557
558 if (dn->nid == 0) 558 if (dn->nid == 0)
559 return NIDS_PER_BLOCK + 1; 559 return NIDS_PER_BLOCK + 1;
560 560
561 trace_f2fs_truncate_nodes_enter(dn->inode, dn->nid, dn->data_blkaddr); 561 trace_f2fs_truncate_nodes_enter(dn->inode, dn->nid, dn->data_blkaddr);
562 562
563 page = get_node_page(sbi, dn->nid); 563 page = get_node_page(sbi, dn->nid);
564 if (IS_ERR(page)) { 564 if (IS_ERR(page)) {
565 trace_f2fs_truncate_nodes_exit(dn->inode, PTR_ERR(page)); 565 trace_f2fs_truncate_nodes_exit(dn->inode, PTR_ERR(page));
566 return PTR_ERR(page); 566 return PTR_ERR(page);
567 } 567 }
568 568
569 rn = F2FS_NODE(page); 569 rn = F2FS_NODE(page);
570 if (depth < 3) { 570 if (depth < 3) {
571 for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) { 571 for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) {
572 child_nid = le32_to_cpu(rn->in.nid[i]); 572 child_nid = le32_to_cpu(rn->in.nid[i]);
573 if (child_nid == 0) 573 if (child_nid == 0)
574 continue; 574 continue;
575 rdn.nid = child_nid; 575 rdn.nid = child_nid;
576 ret = truncate_dnode(&rdn); 576 ret = truncate_dnode(&rdn);
577 if (ret < 0) 577 if (ret < 0)
578 goto out_err; 578 goto out_err;
579 set_nid(page, i, 0, false); 579 set_nid(page, i, 0, false);
580 } 580 }
581 } else { 581 } else {
582 child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1; 582 child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1;
583 for (i = ofs; i < NIDS_PER_BLOCK; i++) { 583 for (i = ofs; i < NIDS_PER_BLOCK; i++) {
584 child_nid = le32_to_cpu(rn->in.nid[i]); 584 child_nid = le32_to_cpu(rn->in.nid[i]);
585 if (child_nid == 0) { 585 if (child_nid == 0) {
586 child_nofs += NIDS_PER_BLOCK + 1; 586 child_nofs += NIDS_PER_BLOCK + 1;
587 continue; 587 continue;
588 } 588 }
589 rdn.nid = child_nid; 589 rdn.nid = child_nid;
590 ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1); 590 ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1);
591 if (ret == (NIDS_PER_BLOCK + 1)) { 591 if (ret == (NIDS_PER_BLOCK + 1)) {
592 set_nid(page, i, 0, false); 592 set_nid(page, i, 0, false);
593 child_nofs += ret; 593 child_nofs += ret;
594 } else if (ret < 0 && ret != -ENOENT) { 594 } else if (ret < 0 && ret != -ENOENT) {
595 goto out_err; 595 goto out_err;
596 } 596 }
597 } 597 }
598 freed = child_nofs; 598 freed = child_nofs;
599 } 599 }
600 600
601 if (!ofs) { 601 if (!ofs) {
602 /* remove current indirect node */ 602 /* remove current indirect node */
603 dn->node_page = page; 603 dn->node_page = page;
604 truncate_node(dn); 604 truncate_node(dn);
605 freed++; 605 freed++;
606 } else { 606 } else {
607 f2fs_put_page(page, 1); 607 f2fs_put_page(page, 1);
608 } 608 }
609 trace_f2fs_truncate_nodes_exit(dn->inode, freed); 609 trace_f2fs_truncate_nodes_exit(dn->inode, freed);
610 return freed; 610 return freed;
611 611
612 out_err: 612 out_err:
613 f2fs_put_page(page, 1); 613 f2fs_put_page(page, 1);
614 trace_f2fs_truncate_nodes_exit(dn->inode, ret); 614 trace_f2fs_truncate_nodes_exit(dn->inode, ret);
615 return ret; 615 return ret;
616 } 616 }
617 617
618 static int truncate_partial_nodes(struct dnode_of_data *dn, 618 static int truncate_partial_nodes(struct dnode_of_data *dn,
619 struct f2fs_inode *ri, int *offset, int depth) 619 struct f2fs_inode *ri, int *offset, int depth)
620 { 620 {
621 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 621 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
622 struct page *pages[2]; 622 struct page *pages[2];
623 nid_t nid[3]; 623 nid_t nid[3];
624 nid_t child_nid; 624 nid_t child_nid;
625 int err = 0; 625 int err = 0;
626 int i; 626 int i;
627 int idx = depth - 2; 627 int idx = depth - 2;
628 628
629 nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]); 629 nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]);
630 if (!nid[0]) 630 if (!nid[0])
631 return 0; 631 return 0;
632 632
633 /* get indirect nodes in the path */ 633 /* get indirect nodes in the path */
634 for (i = 0; i < depth - 1; i++) { 634 for (i = 0; i < depth - 1; i++) {
635 /* refernece count'll be increased */ 635 /* refernece count'll be increased */
636 pages[i] = get_node_page(sbi, nid[i]); 636 pages[i] = get_node_page(sbi, nid[i]);
637 if (IS_ERR(pages[i])) { 637 if (IS_ERR(pages[i])) {
638 depth = i + 1; 638 depth = i + 1;
639 err = PTR_ERR(pages[i]); 639 err = PTR_ERR(pages[i]);
640 goto fail; 640 goto fail;
641 } 641 }
642 nid[i + 1] = get_nid(pages[i], offset[i + 1], false); 642 nid[i + 1] = get_nid(pages[i], offset[i + 1], false);
643 } 643 }
644 644
645 /* free direct nodes linked to a partial indirect node */ 645 /* free direct nodes linked to a partial indirect node */
646 for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) { 646 for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) {
647 child_nid = get_nid(pages[idx], i, false); 647 child_nid = get_nid(pages[idx], i, false);
648 if (!child_nid) 648 if (!child_nid)
649 continue; 649 continue;
650 dn->nid = child_nid; 650 dn->nid = child_nid;
651 err = truncate_dnode(dn); 651 err = truncate_dnode(dn);
652 if (err < 0) 652 if (err < 0)
653 goto fail; 653 goto fail;
654 set_nid(pages[idx], i, 0, false); 654 set_nid(pages[idx], i, 0, false);
655 } 655 }
656 656
657 if (offset[depth - 1] == 0) { 657 if (offset[depth - 1] == 0) {
658 dn->node_page = pages[idx]; 658 dn->node_page = pages[idx];
659 dn->nid = nid[idx]; 659 dn->nid = nid[idx];
660 truncate_node(dn); 660 truncate_node(dn);
661 } else { 661 } else {
662 f2fs_put_page(pages[idx], 1); 662 f2fs_put_page(pages[idx], 1);
663 } 663 }
664 offset[idx]++; 664 offset[idx]++;
665 offset[depth - 1] = 0; 665 offset[depth - 1] = 0;
666 fail: 666 fail:
667 for (i = depth - 3; i >= 0; i--) 667 for (i = depth - 3; i >= 0; i--)
668 f2fs_put_page(pages[i], 1); 668 f2fs_put_page(pages[i], 1);
669 669
670 trace_f2fs_truncate_partial_nodes(dn->inode, nid, depth, err); 670 trace_f2fs_truncate_partial_nodes(dn->inode, nid, depth, err);
671 671
672 return err; 672 return err;
673 } 673 }
674 674
675 /* 675 /*
676 * All the block addresses of data and nodes should be nullified. 676 * All the block addresses of data and nodes should be nullified.
677 */ 677 */
678 int truncate_inode_blocks(struct inode *inode, pgoff_t from) 678 int truncate_inode_blocks(struct inode *inode, pgoff_t from)
679 { 679 {
680 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 680 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
681 struct address_space *node_mapping = sbi->node_inode->i_mapping; 681 struct address_space *node_mapping = sbi->node_inode->i_mapping;
682 int err = 0, cont = 1; 682 int err = 0, cont = 1;
683 int level, offset[4], noffset[4]; 683 int level, offset[4], noffset[4];
684 unsigned int nofs = 0; 684 unsigned int nofs = 0;
685 struct f2fs_node *rn; 685 struct f2fs_node *rn;
686 struct dnode_of_data dn; 686 struct dnode_of_data dn;
687 struct page *page; 687 struct page *page;
688 688
689 trace_f2fs_truncate_inode_blocks_enter(inode, from); 689 trace_f2fs_truncate_inode_blocks_enter(inode, from);
690 690
691 level = get_node_path(F2FS_I(inode), from, offset, noffset); 691 level = get_node_path(F2FS_I(inode), from, offset, noffset);
692 restart: 692 restart:
693 page = get_node_page(sbi, inode->i_ino); 693 page = get_node_page(sbi, inode->i_ino);
694 if (IS_ERR(page)) { 694 if (IS_ERR(page)) {
695 trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page)); 695 trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page));
696 return PTR_ERR(page); 696 return PTR_ERR(page);
697 } 697 }
698 698
699 set_new_dnode(&dn, inode, page, NULL, 0); 699 set_new_dnode(&dn, inode, page, NULL, 0);
700 unlock_page(page); 700 unlock_page(page);
701 701
702 rn = F2FS_NODE(page); 702 rn = F2FS_NODE(page);
703 switch (level) { 703 switch (level) {
704 case 0: 704 case 0:
705 case 1: 705 case 1:
706 nofs = noffset[1]; 706 nofs = noffset[1];
707 break; 707 break;
708 case 2: 708 case 2:
709 nofs = noffset[1]; 709 nofs = noffset[1];
710 if (!offset[level - 1]) 710 if (!offset[level - 1])
711 goto skip_partial; 711 goto skip_partial;
712 err = truncate_partial_nodes(&dn, &rn->i, offset, level); 712 err = truncate_partial_nodes(&dn, &rn->i, offset, level);
713 if (err < 0 && err != -ENOENT) 713 if (err < 0 && err != -ENOENT)
714 goto fail; 714 goto fail;
715 nofs += 1 + NIDS_PER_BLOCK; 715 nofs += 1 + NIDS_PER_BLOCK;
716 break; 716 break;
717 case 3: 717 case 3:
718 nofs = 5 + 2 * NIDS_PER_BLOCK; 718 nofs = 5 + 2 * NIDS_PER_BLOCK;
719 if (!offset[level - 1]) 719 if (!offset[level - 1])
720 goto skip_partial; 720 goto skip_partial;
721 err = truncate_partial_nodes(&dn, &rn->i, offset, level); 721 err = truncate_partial_nodes(&dn, &rn->i, offset, level);
722 if (err < 0 && err != -ENOENT) 722 if (err < 0 && err != -ENOENT)
723 goto fail; 723 goto fail;
724 break; 724 break;
725 default: 725 default:
726 BUG(); 726 BUG();
727 } 727 }
728 728
729 skip_partial: 729 skip_partial:
730 while (cont) { 730 while (cont) {
731 dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]); 731 dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]);
732 switch (offset[0]) { 732 switch (offset[0]) {
733 case NODE_DIR1_BLOCK: 733 case NODE_DIR1_BLOCK:
734 case NODE_DIR2_BLOCK: 734 case NODE_DIR2_BLOCK:
735 err = truncate_dnode(&dn); 735 err = truncate_dnode(&dn);
736 break; 736 break;
737 737
738 case NODE_IND1_BLOCK: 738 case NODE_IND1_BLOCK:
739 case NODE_IND2_BLOCK: 739 case NODE_IND2_BLOCK:
740 err = truncate_nodes(&dn, nofs, offset[1], 2); 740 err = truncate_nodes(&dn, nofs, offset[1], 2);
741 break; 741 break;
742 742
743 case NODE_DIND_BLOCK: 743 case NODE_DIND_BLOCK:
744 err = truncate_nodes(&dn, nofs, offset[1], 3); 744 err = truncate_nodes(&dn, nofs, offset[1], 3);
745 cont = 0; 745 cont = 0;
746 break; 746 break;
747 747
748 default: 748 default:
749 BUG(); 749 BUG();
750 } 750 }
751 if (err < 0 && err != -ENOENT) 751 if (err < 0 && err != -ENOENT)
752 goto fail; 752 goto fail;
753 if (offset[1] == 0 && 753 if (offset[1] == 0 &&
754 rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) { 754 rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) {
755 lock_page(page); 755 lock_page(page);
756 if (page->mapping != node_mapping) { 756 if (page->mapping != node_mapping) {
757 f2fs_put_page(page, 1); 757 f2fs_put_page(page, 1);
758 goto restart; 758 goto restart;
759 } 759 }
760 wait_on_page_writeback(page); 760 wait_on_page_writeback(page);
761 rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0; 761 rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0;
762 set_page_dirty(page); 762 set_page_dirty(page);
763 unlock_page(page); 763 unlock_page(page);
764 } 764 }
765 offset[1] = 0; 765 offset[1] = 0;
766 offset[0]++; 766 offset[0]++;
767 nofs += err; 767 nofs += err;
768 } 768 }
769 fail: 769 fail:
770 f2fs_put_page(page, 0); 770 f2fs_put_page(page, 0);
771 trace_f2fs_truncate_inode_blocks_exit(inode, err); 771 trace_f2fs_truncate_inode_blocks_exit(inode, err);
772 return err > 0 ? 0 : err; 772 return err > 0 ? 0 : err;
773 } 773 }
774 774
775 int truncate_xattr_node(struct inode *inode, struct page *page) 775 int truncate_xattr_node(struct inode *inode, struct page *page)
776 { 776 {
777 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 777 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
778 nid_t nid = F2FS_I(inode)->i_xattr_nid; 778 nid_t nid = F2FS_I(inode)->i_xattr_nid;
779 struct dnode_of_data dn; 779 struct dnode_of_data dn;
780 struct page *npage; 780 struct page *npage;
781 781
782 if (!nid) 782 if (!nid)
783 return 0; 783 return 0;
784 784
785 npage = get_node_page(sbi, nid); 785 npage = get_node_page(sbi, nid);
786 if (IS_ERR(npage)) 786 if (IS_ERR(npage))
787 return PTR_ERR(npage); 787 return PTR_ERR(npage);
788 788
789 F2FS_I(inode)->i_xattr_nid = 0; 789 F2FS_I(inode)->i_xattr_nid = 0;
790 790
791 /* need to do checkpoint during fsync */ 791 /* need to do checkpoint during fsync */
792 F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi)); 792 F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi));
793 793
794 set_new_dnode(&dn, inode, page, npage, nid); 794 set_new_dnode(&dn, inode, page, npage, nid);
795 795
796 if (page) 796 if (page)
797 dn.inode_page_locked = 1; 797 dn.inode_page_locked = 1;
798 truncate_node(&dn); 798 truncate_node(&dn);
799 return 0; 799 return 0;
800 } 800 }
801 801
802 /* 802 /*
803 * Caller should grab and release a mutex by calling mutex_lock_op() and 803 * Caller should grab and release a mutex by calling mutex_lock_op() and
804 * mutex_unlock_op(). 804 * mutex_unlock_op().
805 */ 805 */
806 int remove_inode_page(struct inode *inode) 806 int remove_inode_page(struct inode *inode)
807 { 807 {
808 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 808 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
809 struct page *page; 809 struct page *page;
810 nid_t ino = inode->i_ino; 810 nid_t ino = inode->i_ino;
811 struct dnode_of_data dn; 811 struct dnode_of_data dn;
812 int err; 812 int err;
813 813
814 page = get_node_page(sbi, ino); 814 page = get_node_page(sbi, ino);
815 if (IS_ERR(page)) 815 if (IS_ERR(page))
816 return PTR_ERR(page); 816 return PTR_ERR(page);
817 817
818 err = truncate_xattr_node(inode, page); 818 err = truncate_xattr_node(inode, page);
819 if (err) { 819 if (err) {
820 f2fs_put_page(page, 1); 820 f2fs_put_page(page, 1);
821 return err; 821 return err;
822 } 822 }
823 823
824 /* 0 is possible, after f2fs_new_inode() is failed */ 824 /* 0 is possible, after f2fs_new_inode() is failed */
825 BUG_ON(inode->i_blocks != 0 && inode->i_blocks != 1); 825 BUG_ON(inode->i_blocks != 0 && inode->i_blocks != 1);
826 set_new_dnode(&dn, inode, page, page, ino); 826 set_new_dnode(&dn, inode, page, page, ino);
827 truncate_node(&dn); 827 truncate_node(&dn);
828 return 0; 828 return 0;
829 } 829 }
830 830
831 struct page *new_inode_page(struct inode *inode, const struct qstr *name) 831 struct page *new_inode_page(struct inode *inode, const struct qstr *name)
832 { 832 {
833 struct dnode_of_data dn; 833 struct dnode_of_data dn;
834 834
835 /* allocate inode page for new inode */ 835 /* allocate inode page for new inode */
836 set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino); 836 set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino);
837 837
838 /* caller should f2fs_put_page(page, 1); */ 838 /* caller should f2fs_put_page(page, 1); */
839 return new_node_page(&dn, 0, NULL); 839 return new_node_page(&dn, 0, NULL);
840 } 840 }
841 841
842 struct page *new_node_page(struct dnode_of_data *dn, 842 struct page *new_node_page(struct dnode_of_data *dn,
843 unsigned int ofs, struct page *ipage) 843 unsigned int ofs, struct page *ipage)
844 { 844 {
845 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); 845 struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
846 struct address_space *mapping = sbi->node_inode->i_mapping; 846 struct address_space *mapping = sbi->node_inode->i_mapping;
847 struct node_info old_ni, new_ni; 847 struct node_info old_ni, new_ni;
848 struct page *page; 848 struct page *page;
849 int err; 849 int err;
850 850
851 if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC)) 851 if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))
852 return ERR_PTR(-EPERM); 852 return ERR_PTR(-EPERM);
853 853
854 page = grab_cache_page(mapping, dn->nid); 854 page = grab_cache_page(mapping, dn->nid);
855 if (!page) 855 if (!page)
856 return ERR_PTR(-ENOMEM); 856 return ERR_PTR(-ENOMEM);
857 857
858 if (!inc_valid_node_count(sbi, dn->inode, 1)) { 858 if (!inc_valid_node_count(sbi, dn->inode, 1)) {
859 err = -ENOSPC; 859 err = -ENOSPC;
860 goto fail; 860 goto fail;
861 } 861 }
862 862
863 get_node_info(sbi, dn->nid, &old_ni); 863 get_node_info(sbi, dn->nid, &old_ni);
864 864
865 /* Reinitialize old_ni with new node page */ 865 /* Reinitialize old_ni with new node page */
866 BUG_ON(old_ni.blk_addr != NULL_ADDR); 866 BUG_ON(old_ni.blk_addr != NULL_ADDR);
867 new_ni = old_ni; 867 new_ni = old_ni;
868 new_ni.ino = dn->inode->i_ino; 868 new_ni.ino = dn->inode->i_ino;
869 set_node_addr(sbi, &new_ni, NEW_ADDR); 869 set_node_addr(sbi, &new_ni, NEW_ADDR);
870 870
871 fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true); 871 fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true);
872 set_cold_node(dn->inode, page); 872 set_cold_node(dn->inode, page);
873 SetPageUptodate(page); 873 SetPageUptodate(page);
874 set_page_dirty(page); 874 set_page_dirty(page);
875 875
876 if (ofs == XATTR_NODE_OFFSET) 876 if (ofs == XATTR_NODE_OFFSET)
877 F2FS_I(dn->inode)->i_xattr_nid = dn->nid; 877 F2FS_I(dn->inode)->i_xattr_nid = dn->nid;
878 878
879 dn->node_page = page; 879 dn->node_page = page;
880 if (ipage) 880 if (ipage)
881 update_inode(dn->inode, ipage); 881 update_inode(dn->inode, ipage);
882 else 882 else
883 sync_inode_page(dn); 883 sync_inode_page(dn);
884 if (ofs == 0) 884 if (ofs == 0)
885 inc_valid_inode_count(sbi); 885 inc_valid_inode_count(sbi);
886 886
887 return page; 887 return page;
888 888
889 fail: 889 fail:
890 clear_node_page_dirty(page); 890 clear_node_page_dirty(page);
891 f2fs_put_page(page, 1); 891 f2fs_put_page(page, 1);
892 return ERR_PTR(err); 892 return ERR_PTR(err);
893 } 893 }
894 894
895 /* 895 /*
896 * Caller should do after getting the following values. 896 * Caller should do after getting the following values.
897 * 0: f2fs_put_page(page, 0) 897 * 0: f2fs_put_page(page, 0)
898 * LOCKED_PAGE: f2fs_put_page(page, 1) 898 * LOCKED_PAGE: f2fs_put_page(page, 1)
899 * error: nothing 899 * error: nothing
900 */ 900 */
901 static int read_node_page(struct page *page, int type) 901 static int read_node_page(struct page *page, int type)
902 { 902 {
903 struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); 903 struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
904 struct node_info ni; 904 struct node_info ni;
905 905
906 get_node_info(sbi, page->index, &ni); 906 get_node_info(sbi, page->index, &ni);
907 907
908 if (ni.blk_addr == NULL_ADDR) { 908 if (ni.blk_addr == NULL_ADDR) {
909 f2fs_put_page(page, 1); 909 f2fs_put_page(page, 1);
910 return -ENOENT; 910 return -ENOENT;
911 } 911 }
912 912
913 if (PageUptodate(page)) 913 if (PageUptodate(page))
914 return LOCKED_PAGE; 914 return LOCKED_PAGE;
915 915
916 return f2fs_readpage(sbi, page, ni.blk_addr, type); 916 return f2fs_readpage(sbi, page, ni.blk_addr, type);
917 } 917 }
918 918
919 /* 919 /*
920 * Readahead a node page 920 * Readahead a node page
921 */ 921 */
922 void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid) 922 void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
923 { 923 {
924 struct address_space *mapping = sbi->node_inode->i_mapping; 924 struct address_space *mapping = sbi->node_inode->i_mapping;
925 struct page *apage; 925 struct page *apage;
926 int err; 926 int err;
927 927
928 apage = find_get_page(mapping, nid); 928 apage = find_get_page(mapping, nid);
929 if (apage && PageUptodate(apage)) { 929 if (apage && PageUptodate(apage)) {
930 f2fs_put_page(apage, 0); 930 f2fs_put_page(apage, 0);
931 return; 931 return;
932 } 932 }
933 f2fs_put_page(apage, 0); 933 f2fs_put_page(apage, 0);
934 934
935 apage = grab_cache_page(mapping, nid); 935 apage = grab_cache_page(mapping, nid);
936 if (!apage) 936 if (!apage)
937 return; 937 return;
938 938
939 err = read_node_page(apage, READA); 939 err = read_node_page(apage, READA);
940 if (err == 0) 940 if (err == 0)
941 f2fs_put_page(apage, 0); 941 f2fs_put_page(apage, 0);
942 else if (err == LOCKED_PAGE) 942 else if (err == LOCKED_PAGE)
943 f2fs_put_page(apage, 1); 943 f2fs_put_page(apage, 1);
944 } 944 }
945 945
946 struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid) 946 struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid)
947 { 947 {
948 struct address_space *mapping = sbi->node_inode->i_mapping; 948 struct address_space *mapping = sbi->node_inode->i_mapping;
949 struct page *page; 949 struct page *page;
950 int err; 950 int err;
951 repeat: 951 repeat:
952 page = grab_cache_page(mapping, nid); 952 page = grab_cache_page(mapping, nid);
953 if (!page) 953 if (!page)
954 return ERR_PTR(-ENOMEM); 954 return ERR_PTR(-ENOMEM);
955 955
956 err = read_node_page(page, READ_SYNC); 956 err = read_node_page(page, READ_SYNC);
957 if (err < 0) 957 if (err < 0)
958 return ERR_PTR(err); 958 return ERR_PTR(err);
959 else if (err == LOCKED_PAGE) 959 else if (err == LOCKED_PAGE)
960 goto got_it; 960 goto got_it;
961 961
962 lock_page(page); 962 lock_page(page);
963 if (!PageUptodate(page)) { 963 if (!PageUptodate(page)) {
964 f2fs_put_page(page, 1); 964 f2fs_put_page(page, 1);
965 return ERR_PTR(-EIO); 965 return ERR_PTR(-EIO);
966 } 966 }
967 if (page->mapping != mapping) { 967 if (page->mapping != mapping) {
968 f2fs_put_page(page, 1); 968 f2fs_put_page(page, 1);
969 goto repeat; 969 goto repeat;
970 } 970 }
971 got_it: 971 got_it:
972 BUG_ON(nid != nid_of_node(page)); 972 BUG_ON(nid != nid_of_node(page));
973 mark_page_accessed(page);
974 return page; 973 return page;
975 } 974 }
976 975
977 /* 976 /*
978 * Return a locked page for the desired node page. 977 * Return a locked page for the desired node page.
979 * And, readahead MAX_RA_NODE number of node pages. 978 * And, readahead MAX_RA_NODE number of node pages.
980 */ 979 */
981 struct page *get_node_page_ra(struct page *parent, int start) 980 struct page *get_node_page_ra(struct page *parent, int start)
982 { 981 {
983 struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb); 982 struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb);
984 struct address_space *mapping = sbi->node_inode->i_mapping; 983 struct address_space *mapping = sbi->node_inode->i_mapping;
985 struct blk_plug plug; 984 struct blk_plug plug;
986 struct page *page; 985 struct page *page;
987 int err, i, end; 986 int err, i, end;
988 nid_t nid; 987 nid_t nid;
989 988
990 /* First, try getting the desired direct node. */ 989 /* First, try getting the desired direct node. */
991 nid = get_nid(parent, start, false); 990 nid = get_nid(parent, start, false);
992 if (!nid) 991 if (!nid)
993 return ERR_PTR(-ENOENT); 992 return ERR_PTR(-ENOENT);
994 repeat: 993 repeat:
995 page = grab_cache_page(mapping, nid); 994 page = grab_cache_page(mapping, nid);
996 if (!page) 995 if (!page)
997 return ERR_PTR(-ENOMEM); 996 return ERR_PTR(-ENOMEM);
998 997
999 err = read_node_page(page, READ_SYNC); 998 err = read_node_page(page, READ_SYNC);
1000 if (err < 0) 999 if (err < 0)
1001 return ERR_PTR(err); 1000 return ERR_PTR(err);
1002 else if (err == LOCKED_PAGE) 1001 else if (err == LOCKED_PAGE)
1003 goto page_hit; 1002 goto page_hit;
1004 1003
1005 blk_start_plug(&plug); 1004 blk_start_plug(&plug);
1006 1005
1007 /* Then, try readahead for siblings of the desired node */ 1006 /* Then, try readahead for siblings of the desired node */
1008 end = start + MAX_RA_NODE; 1007 end = start + MAX_RA_NODE;
1009 end = min(end, NIDS_PER_BLOCK); 1008 end = min(end, NIDS_PER_BLOCK);
1010 for (i = start + 1; i < end; i++) { 1009 for (i = start + 1; i < end; i++) {
1011 nid = get_nid(parent, i, false); 1010 nid = get_nid(parent, i, false);
1012 if (!nid) 1011 if (!nid)
1013 continue; 1012 continue;
1014 ra_node_page(sbi, nid); 1013 ra_node_page(sbi, nid);
1015 } 1014 }
1016 1015
1017 blk_finish_plug(&plug); 1016 blk_finish_plug(&plug);
1018 1017
1019 lock_page(page); 1018 lock_page(page);
1020 if (page->mapping != mapping) { 1019 if (page->mapping != mapping) {
1021 f2fs_put_page(page, 1); 1020 f2fs_put_page(page, 1);
1022 goto repeat; 1021 goto repeat;
1023 } 1022 }
1024 page_hit: 1023 page_hit:
1025 if (!PageUptodate(page)) { 1024 if (!PageUptodate(page)) {
1026 f2fs_put_page(page, 1); 1025 f2fs_put_page(page, 1);
1027 return ERR_PTR(-EIO); 1026 return ERR_PTR(-EIO);
1028 } 1027 }
1029 mark_page_accessed(page);
1030 return page; 1028 return page;
1031 } 1029 }
1032 1030
1033 void sync_inode_page(struct dnode_of_data *dn) 1031 void sync_inode_page(struct dnode_of_data *dn)
1034 { 1032 {
1035 if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) { 1033 if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) {
1036 update_inode(dn->inode, dn->node_page); 1034 update_inode(dn->inode, dn->node_page);
1037 } else if (dn->inode_page) { 1035 } else if (dn->inode_page) {
1038 if (!dn->inode_page_locked) 1036 if (!dn->inode_page_locked)
1039 lock_page(dn->inode_page); 1037 lock_page(dn->inode_page);
1040 update_inode(dn->inode, dn->inode_page); 1038 update_inode(dn->inode, dn->inode_page);
1041 if (!dn->inode_page_locked) 1039 if (!dn->inode_page_locked)
1042 unlock_page(dn->inode_page); 1040 unlock_page(dn->inode_page);
1043 } else { 1041 } else {
1044 update_inode_page(dn->inode); 1042 update_inode_page(dn->inode);
1045 } 1043 }
1046 } 1044 }
1047 1045
1048 int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, 1046 int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino,
1049 struct writeback_control *wbc) 1047 struct writeback_control *wbc)
1050 { 1048 {
1051 struct address_space *mapping = sbi->node_inode->i_mapping; 1049 struct address_space *mapping = sbi->node_inode->i_mapping;
1052 pgoff_t index, end; 1050 pgoff_t index, end;
1053 struct pagevec pvec; 1051 struct pagevec pvec;
1054 int step = ino ? 2 : 0; 1052 int step = ino ? 2 : 0;
1055 int nwritten = 0, wrote = 0; 1053 int nwritten = 0, wrote = 0;
1056 1054
1057 pagevec_init(&pvec, 0); 1055 pagevec_init(&pvec, 0);
1058 1056
1059 next_step: 1057 next_step:
1060 index = 0; 1058 index = 0;
1061 end = LONG_MAX; 1059 end = LONG_MAX;
1062 1060
1063 while (index <= end) { 1061 while (index <= end) {
1064 int i, nr_pages; 1062 int i, nr_pages;
1065 nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, 1063 nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
1066 PAGECACHE_TAG_DIRTY, 1064 PAGECACHE_TAG_DIRTY,
1067 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); 1065 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
1068 if (nr_pages == 0) 1066 if (nr_pages == 0)
1069 break; 1067 break;
1070 1068
1071 for (i = 0; i < nr_pages; i++) { 1069 for (i = 0; i < nr_pages; i++) {
1072 struct page *page = pvec.pages[i]; 1070 struct page *page = pvec.pages[i];
1073 1071
1074 /* 1072 /*
1075 * flushing sequence with step: 1073 * flushing sequence with step:
1076 * 0. indirect nodes 1074 * 0. indirect nodes
1077 * 1. dentry dnodes 1075 * 1. dentry dnodes
1078 * 2. file dnodes 1076 * 2. file dnodes
1079 */ 1077 */
1080 if (step == 0 && IS_DNODE(page)) 1078 if (step == 0 && IS_DNODE(page))
1081 continue; 1079 continue;
1082 if (step == 1 && (!IS_DNODE(page) || 1080 if (step == 1 && (!IS_DNODE(page) ||
1083 is_cold_node(page))) 1081 is_cold_node(page)))
1084 continue; 1082 continue;
1085 if (step == 2 && (!IS_DNODE(page) || 1083 if (step == 2 && (!IS_DNODE(page) ||
1086 !is_cold_node(page))) 1084 !is_cold_node(page)))
1087 continue; 1085 continue;
1088 1086
1089 /* 1087 /*
1090 * If an fsync mode, 1088 * If an fsync mode,
1091 * we should not skip writing node pages. 1089 * we should not skip writing node pages.
1092 */ 1090 */
1093 if (ino && ino_of_node(page) == ino) 1091 if (ino && ino_of_node(page) == ino)
1094 lock_page(page); 1092 lock_page(page);
1095 else if (!trylock_page(page)) 1093 else if (!trylock_page(page))
1096 continue; 1094 continue;
1097 1095
1098 if (unlikely(page->mapping != mapping)) { 1096 if (unlikely(page->mapping != mapping)) {
1099 continue_unlock: 1097 continue_unlock:
1100 unlock_page(page); 1098 unlock_page(page);
1101 continue; 1099 continue;
1102 } 1100 }
1103 if (ino && ino_of_node(page) != ino) 1101 if (ino && ino_of_node(page) != ino)
1104 goto continue_unlock; 1102 goto continue_unlock;
1105 1103
1106 if (!PageDirty(page)) { 1104 if (!PageDirty(page)) {
1107 /* someone wrote it for us */ 1105 /* someone wrote it for us */
1108 goto continue_unlock; 1106 goto continue_unlock;
1109 } 1107 }
1110 1108
1111 if (!clear_page_dirty_for_io(page)) 1109 if (!clear_page_dirty_for_io(page))
1112 goto continue_unlock; 1110 goto continue_unlock;
1113 1111
1114 /* called by fsync() */ 1112 /* called by fsync() */
1115 if (ino && IS_DNODE(page)) { 1113 if (ino && IS_DNODE(page)) {
1116 int mark = !is_checkpointed_node(sbi, ino); 1114 int mark = !is_checkpointed_node(sbi, ino);
1117 set_fsync_mark(page, 1); 1115 set_fsync_mark(page, 1);
1118 if (IS_INODE(page)) 1116 if (IS_INODE(page))
1119 set_dentry_mark(page, mark); 1117 set_dentry_mark(page, mark);
1120 nwritten++; 1118 nwritten++;
1121 } else { 1119 } else {
1122 set_fsync_mark(page, 0); 1120 set_fsync_mark(page, 0);
1123 set_dentry_mark(page, 0); 1121 set_dentry_mark(page, 0);
1124 } 1122 }
1125 mapping->a_ops->writepage(page, wbc); 1123 mapping->a_ops->writepage(page, wbc);
1126 wrote++; 1124 wrote++;
1127 1125
1128 if (--wbc->nr_to_write == 0) 1126 if (--wbc->nr_to_write == 0)
1129 break; 1127 break;
1130 } 1128 }
1131 pagevec_release(&pvec); 1129 pagevec_release(&pvec);
1132 cond_resched(); 1130 cond_resched();
1133 1131
1134 if (wbc->nr_to_write == 0) { 1132 if (wbc->nr_to_write == 0) {
1135 step = 2; 1133 step = 2;
1136 break; 1134 break;
1137 } 1135 }
1138 } 1136 }
1139 1137
1140 if (step < 2) { 1138 if (step < 2) {
1141 step++; 1139 step++;
1142 goto next_step; 1140 goto next_step;
1143 } 1141 }
1144 1142
1145 if (wrote) 1143 if (wrote)
1146 f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL); 1144 f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL);
1147 1145
1148 return nwritten; 1146 return nwritten;
1149 } 1147 }
1150 1148
1151 static int f2fs_write_node_page(struct page *page, 1149 static int f2fs_write_node_page(struct page *page,
1152 struct writeback_control *wbc) 1150 struct writeback_control *wbc)
1153 { 1151 {
1154 struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); 1152 struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
1155 nid_t nid; 1153 nid_t nid;
1156 block_t new_addr; 1154 block_t new_addr;
1157 struct node_info ni; 1155 struct node_info ni;
1158 1156
1159 wait_on_page_writeback(page); 1157 wait_on_page_writeback(page);
1160 1158
1161 /* get old block addr of this node page */ 1159 /* get old block addr of this node page */
1162 nid = nid_of_node(page); 1160 nid = nid_of_node(page);
1163 BUG_ON(page->index != nid); 1161 BUG_ON(page->index != nid);
1164 1162
1165 get_node_info(sbi, nid, &ni); 1163 get_node_info(sbi, nid, &ni);
1166 1164
1167 /* This page is already truncated */ 1165 /* This page is already truncated */
1168 if (ni.blk_addr == NULL_ADDR) { 1166 if (ni.blk_addr == NULL_ADDR) {
1169 dec_page_count(sbi, F2FS_DIRTY_NODES); 1167 dec_page_count(sbi, F2FS_DIRTY_NODES);
1170 unlock_page(page); 1168 unlock_page(page);
1171 return 0; 1169 return 0;
1172 } 1170 }
1173 1171
1174 if (wbc->for_reclaim) { 1172 if (wbc->for_reclaim) {
1175 dec_page_count(sbi, F2FS_DIRTY_NODES); 1173 dec_page_count(sbi, F2FS_DIRTY_NODES);
1176 wbc->pages_skipped++; 1174 wbc->pages_skipped++;
1177 set_page_dirty(page); 1175 set_page_dirty(page);
1178 return AOP_WRITEPAGE_ACTIVATE; 1176 return AOP_WRITEPAGE_ACTIVATE;
1179 } 1177 }
1180 1178
1181 mutex_lock(&sbi->node_write); 1179 mutex_lock(&sbi->node_write);
1182 set_page_writeback(page); 1180 set_page_writeback(page);
1183 write_node_page(sbi, page, nid, ni.blk_addr, &new_addr); 1181 write_node_page(sbi, page, nid, ni.blk_addr, &new_addr);
1184 set_node_addr(sbi, &ni, new_addr); 1182 set_node_addr(sbi, &ni, new_addr);
1185 dec_page_count(sbi, F2FS_DIRTY_NODES); 1183 dec_page_count(sbi, F2FS_DIRTY_NODES);
1186 mutex_unlock(&sbi->node_write); 1184 mutex_unlock(&sbi->node_write);
1187 unlock_page(page); 1185 unlock_page(page);
1188 return 0; 1186 return 0;
1189 } 1187 }
1190 1188
1191 /* 1189 /*
1192 * It is very important to gather dirty pages and write at once, so that we can 1190 * It is very important to gather dirty pages and write at once, so that we can
1193 * submit a big bio without interfering other data writes. 1191 * submit a big bio without interfering other data writes.
1194 * Be default, 512 pages (2MB) * 3 node types, is more reasonable. 1192 * Be default, 512 pages (2MB) * 3 node types, is more reasonable.
1195 */ 1193 */
1196 #define COLLECT_DIRTY_NODES 1536 1194 #define COLLECT_DIRTY_NODES 1536
1197 static int f2fs_write_node_pages(struct address_space *mapping, 1195 static int f2fs_write_node_pages(struct address_space *mapping,
1198 struct writeback_control *wbc) 1196 struct writeback_control *wbc)
1199 { 1197 {
1200 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); 1198 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
1201 long nr_to_write = wbc->nr_to_write; 1199 long nr_to_write = wbc->nr_to_write;
1202 1200
1203 /* First check balancing cached NAT entries */ 1201 /* First check balancing cached NAT entries */
1204 if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) { 1202 if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) {
1205 f2fs_sync_fs(sbi->sb, true); 1203 f2fs_sync_fs(sbi->sb, true);
1206 return 0; 1204 return 0;
1207 } 1205 }
1208 1206
1209 /* collect a number of dirty node pages and write together */ 1207 /* collect a number of dirty node pages and write together */
1210 if (get_pages(sbi, F2FS_DIRTY_NODES) < COLLECT_DIRTY_NODES) 1208 if (get_pages(sbi, F2FS_DIRTY_NODES) < COLLECT_DIRTY_NODES)
1211 return 0; 1209 return 0;
1212 1210
1213 /* if mounting is failed, skip writing node pages */ 1211 /* if mounting is failed, skip writing node pages */
1214 wbc->nr_to_write = 3 * max_hw_blocks(sbi); 1212 wbc->nr_to_write = 3 * max_hw_blocks(sbi);
1215 sync_node_pages(sbi, 0, wbc); 1213 sync_node_pages(sbi, 0, wbc);
1216 wbc->nr_to_write = nr_to_write - (3 * max_hw_blocks(sbi) - 1214 wbc->nr_to_write = nr_to_write - (3 * max_hw_blocks(sbi) -
1217 wbc->nr_to_write); 1215 wbc->nr_to_write);
1218 return 0; 1216 return 0;
1219 } 1217 }
1220 1218
1221 static int f2fs_set_node_page_dirty(struct page *page) 1219 static int f2fs_set_node_page_dirty(struct page *page)
1222 { 1220 {
1223 struct address_space *mapping = page->mapping; 1221 struct address_space *mapping = page->mapping;
1224 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); 1222 struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
1225 1223
1226 SetPageUptodate(page); 1224 SetPageUptodate(page);
1227 if (!PageDirty(page)) { 1225 if (!PageDirty(page)) {
1228 __set_page_dirty_nobuffers(page); 1226 __set_page_dirty_nobuffers(page);
1229 inc_page_count(sbi, F2FS_DIRTY_NODES); 1227 inc_page_count(sbi, F2FS_DIRTY_NODES);
1230 SetPagePrivate(page); 1228 SetPagePrivate(page);
1231 return 1; 1229 return 1;
1232 } 1230 }
1233 return 0; 1231 return 0;
1234 } 1232 }
1235 1233
1236 static void f2fs_invalidate_node_page(struct page *page, unsigned int offset, 1234 static void f2fs_invalidate_node_page(struct page *page, unsigned int offset,
1237 unsigned int length) 1235 unsigned int length)
1238 { 1236 {
1239 struct inode *inode = page->mapping->host; 1237 struct inode *inode = page->mapping->host;
1240 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); 1238 struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
1241 if (PageDirty(page)) 1239 if (PageDirty(page))
1242 dec_page_count(sbi, F2FS_DIRTY_NODES); 1240 dec_page_count(sbi, F2FS_DIRTY_NODES);
1243 ClearPagePrivate(page); 1241 ClearPagePrivate(page);
1244 } 1242 }
1245 1243
1246 static int f2fs_release_node_page(struct page *page, gfp_t wait) 1244 static int f2fs_release_node_page(struct page *page, gfp_t wait)
1247 { 1245 {
1248 ClearPagePrivate(page); 1246 ClearPagePrivate(page);
1249 return 1; 1247 return 1;
1250 } 1248 }
1251 1249
1252 /* 1250 /*
1253 * Structure of the f2fs node operations 1251 * Structure of the f2fs node operations
1254 */ 1252 */
1255 const struct address_space_operations f2fs_node_aops = { 1253 const struct address_space_operations f2fs_node_aops = {
1256 .writepage = f2fs_write_node_page, 1254 .writepage = f2fs_write_node_page,
1257 .writepages = f2fs_write_node_pages, 1255 .writepages = f2fs_write_node_pages,
1258 .set_page_dirty = f2fs_set_node_page_dirty, 1256 .set_page_dirty = f2fs_set_node_page_dirty,
1259 .invalidatepage = f2fs_invalidate_node_page, 1257 .invalidatepage = f2fs_invalidate_node_page,
1260 .releasepage = f2fs_release_node_page, 1258 .releasepage = f2fs_release_node_page,
1261 }; 1259 };
1262 1260
1263 static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head) 1261 static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head)
1264 { 1262 {
1265 struct list_head *this; 1263 struct list_head *this;
1266 struct free_nid *i; 1264 struct free_nid *i;
1267 list_for_each(this, head) { 1265 list_for_each(this, head) {
1268 i = list_entry(this, struct free_nid, list); 1266 i = list_entry(this, struct free_nid, list);
1269 if (i->nid == n) 1267 if (i->nid == n)
1270 return i; 1268 return i;
1271 } 1269 }
1272 return NULL; 1270 return NULL;
1273 } 1271 }
1274 1272
1275 static void __del_from_free_nid_list(struct free_nid *i) 1273 static void __del_from_free_nid_list(struct free_nid *i)
1276 { 1274 {
1277 list_del(&i->list); 1275 list_del(&i->list);
1278 kmem_cache_free(free_nid_slab, i); 1276 kmem_cache_free(free_nid_slab, i);
1279 } 1277 }
1280 1278
1281 static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build) 1279 static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build)
1282 { 1280 {
1283 struct free_nid *i; 1281 struct free_nid *i;
1284 struct nat_entry *ne; 1282 struct nat_entry *ne;
1285 bool allocated = false; 1283 bool allocated = false;
1286 1284
1287 if (nm_i->fcnt > 2 * MAX_FREE_NIDS) 1285 if (nm_i->fcnt > 2 * MAX_FREE_NIDS)
1288 return -1; 1286 return -1;
1289 1287
1290 /* 0 nid should not be used */ 1288 /* 0 nid should not be used */
1291 if (nid == 0) 1289 if (nid == 0)
1292 return 0; 1290 return 0;
1293 1291
1294 if (!build) 1292 if (!build)
1295 goto retry; 1293 goto retry;
1296 1294
1297 /* do not add allocated nids */ 1295 /* do not add allocated nids */
1298 read_lock(&nm_i->nat_tree_lock); 1296 read_lock(&nm_i->nat_tree_lock);
1299 ne = __lookup_nat_cache(nm_i, nid); 1297 ne = __lookup_nat_cache(nm_i, nid);
1300 if (ne && nat_get_blkaddr(ne) != NULL_ADDR) 1298 if (ne && nat_get_blkaddr(ne) != NULL_ADDR)
1301 allocated = true; 1299 allocated = true;
1302 read_unlock(&nm_i->nat_tree_lock); 1300 read_unlock(&nm_i->nat_tree_lock);
1303 if (allocated) 1301 if (allocated)
1304 return 0; 1302 return 0;
1305 retry: 1303 retry:
1306 i = kmem_cache_alloc(free_nid_slab, GFP_NOFS); 1304 i = kmem_cache_alloc(free_nid_slab, GFP_NOFS);
1307 if (!i) { 1305 if (!i) {
1308 cond_resched(); 1306 cond_resched();
1309 goto retry; 1307 goto retry;
1310 } 1308 }
1311 i->nid = nid; 1309 i->nid = nid;
1312 i->state = NID_NEW; 1310 i->state = NID_NEW;
1313 1311
1314 spin_lock(&nm_i->free_nid_list_lock); 1312 spin_lock(&nm_i->free_nid_list_lock);
1315 if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) { 1313 if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) {
1316 spin_unlock(&nm_i->free_nid_list_lock); 1314 spin_unlock(&nm_i->free_nid_list_lock);
1317 kmem_cache_free(free_nid_slab, i); 1315 kmem_cache_free(free_nid_slab, i);
1318 return 0; 1316 return 0;
1319 } 1317 }
1320 list_add_tail(&i->list, &nm_i->free_nid_list); 1318 list_add_tail(&i->list, &nm_i->free_nid_list);
1321 nm_i->fcnt++; 1319 nm_i->fcnt++;
1322 spin_unlock(&nm_i->free_nid_list_lock); 1320 spin_unlock(&nm_i->free_nid_list_lock);
1323 return 1; 1321 return 1;
1324 } 1322 }
1325 1323
1326 static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid) 1324 static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid)
1327 { 1325 {
1328 struct free_nid *i; 1326 struct free_nid *i;
1329 spin_lock(&nm_i->free_nid_list_lock); 1327 spin_lock(&nm_i->free_nid_list_lock);
1330 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); 1328 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
1331 if (i && i->state == NID_NEW) { 1329 if (i && i->state == NID_NEW) {
1332 __del_from_free_nid_list(i); 1330 __del_from_free_nid_list(i);
1333 nm_i->fcnt--; 1331 nm_i->fcnt--;
1334 } 1332 }
1335 spin_unlock(&nm_i->free_nid_list_lock); 1333 spin_unlock(&nm_i->free_nid_list_lock);
1336 } 1334 }
1337 1335
1338 static void scan_nat_page(struct f2fs_nm_info *nm_i, 1336 static void scan_nat_page(struct f2fs_nm_info *nm_i,
1339 struct page *nat_page, nid_t start_nid) 1337 struct page *nat_page, nid_t start_nid)
1340 { 1338 {
1341 struct f2fs_nat_block *nat_blk = page_address(nat_page); 1339 struct f2fs_nat_block *nat_blk = page_address(nat_page);
1342 block_t blk_addr; 1340 block_t blk_addr;
1343 int i; 1341 int i;
1344 1342
1345 i = start_nid % NAT_ENTRY_PER_BLOCK; 1343 i = start_nid % NAT_ENTRY_PER_BLOCK;
1346 1344
1347 for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) { 1345 for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) {
1348 1346
1349 if (start_nid >= nm_i->max_nid) 1347 if (start_nid >= nm_i->max_nid)
1350 break; 1348 break;
1351 1349
1352 blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr); 1350 blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr);
1353 BUG_ON(blk_addr == NEW_ADDR); 1351 BUG_ON(blk_addr == NEW_ADDR);
1354 if (blk_addr == NULL_ADDR) { 1352 if (blk_addr == NULL_ADDR) {
1355 if (add_free_nid(nm_i, start_nid, true) < 0) 1353 if (add_free_nid(nm_i, start_nid, true) < 0)
1356 break; 1354 break;
1357 } 1355 }
1358 } 1356 }
1359 } 1357 }
1360 1358
1361 static void build_free_nids(struct f2fs_sb_info *sbi) 1359 static void build_free_nids(struct f2fs_sb_info *sbi)
1362 { 1360 {
1363 struct f2fs_nm_info *nm_i = NM_I(sbi); 1361 struct f2fs_nm_info *nm_i = NM_I(sbi);
1364 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); 1362 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
1365 struct f2fs_summary_block *sum = curseg->sum_blk; 1363 struct f2fs_summary_block *sum = curseg->sum_blk;
1366 int i = 0; 1364 int i = 0;
1367 nid_t nid = nm_i->next_scan_nid; 1365 nid_t nid = nm_i->next_scan_nid;
1368 1366
1369 /* Enough entries */ 1367 /* Enough entries */
1370 if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK) 1368 if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK)
1371 return; 1369 return;
1372 1370
1373 /* readahead nat pages to be scanned */ 1371 /* readahead nat pages to be scanned */
1374 ra_nat_pages(sbi, nid); 1372 ra_nat_pages(sbi, nid);
1375 1373
1376 while (1) { 1374 while (1) {
1377 struct page *page = get_current_nat_page(sbi, nid); 1375 struct page *page = get_current_nat_page(sbi, nid);
1378 1376
1379 scan_nat_page(nm_i, page, nid); 1377 scan_nat_page(nm_i, page, nid);
1380 f2fs_put_page(page, 1); 1378 f2fs_put_page(page, 1);
1381 1379
1382 nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK)); 1380 nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK));
1383 if (nid >= nm_i->max_nid) 1381 if (nid >= nm_i->max_nid)
1384 nid = 0; 1382 nid = 0;
1385 1383
1386 if (i++ == FREE_NID_PAGES) 1384 if (i++ == FREE_NID_PAGES)
1387 break; 1385 break;
1388 } 1386 }
1389 1387
1390 /* go to the next free nat pages to find free nids abundantly */ 1388 /* go to the next free nat pages to find free nids abundantly */
1391 nm_i->next_scan_nid = nid; 1389 nm_i->next_scan_nid = nid;
1392 1390
1393 /* find free nids from current sum_pages */ 1391 /* find free nids from current sum_pages */
1394 mutex_lock(&curseg->curseg_mutex); 1392 mutex_lock(&curseg->curseg_mutex);
1395 for (i = 0; i < nats_in_cursum(sum); i++) { 1393 for (i = 0; i < nats_in_cursum(sum); i++) {
1396 block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr); 1394 block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr);
1397 nid = le32_to_cpu(nid_in_journal(sum, i)); 1395 nid = le32_to_cpu(nid_in_journal(sum, i));
1398 if (addr == NULL_ADDR) 1396 if (addr == NULL_ADDR)
1399 add_free_nid(nm_i, nid, true); 1397 add_free_nid(nm_i, nid, true);
1400 else 1398 else
1401 remove_free_nid(nm_i, nid); 1399 remove_free_nid(nm_i, nid);
1402 } 1400 }
1403 mutex_unlock(&curseg->curseg_mutex); 1401 mutex_unlock(&curseg->curseg_mutex);
1404 } 1402 }
1405 1403
1406 /* 1404 /*
1407 * If this function returns success, caller can obtain a new nid 1405 * If this function returns success, caller can obtain a new nid
1408 * from second parameter of this function. 1406 * from second parameter of this function.
1409 * The returned nid could be used ino as well as nid when inode is created. 1407 * The returned nid could be used ino as well as nid when inode is created.
1410 */ 1408 */
1411 bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid) 1409 bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid)
1412 { 1410 {
1413 struct f2fs_nm_info *nm_i = NM_I(sbi); 1411 struct f2fs_nm_info *nm_i = NM_I(sbi);
1414 struct free_nid *i = NULL; 1412 struct free_nid *i = NULL;
1415 struct list_head *this; 1413 struct list_head *this;
1416 retry: 1414 retry:
1417 if (sbi->total_valid_node_count + 1 >= nm_i->max_nid) 1415 if (sbi->total_valid_node_count + 1 >= nm_i->max_nid)
1418 return false; 1416 return false;
1419 1417
1420 spin_lock(&nm_i->free_nid_list_lock); 1418 spin_lock(&nm_i->free_nid_list_lock);
1421 1419
1422 /* We should not use stale free nids created by build_free_nids */ 1420 /* We should not use stale free nids created by build_free_nids */
1423 if (nm_i->fcnt && !sbi->on_build_free_nids) { 1421 if (nm_i->fcnt && !sbi->on_build_free_nids) {
1424 BUG_ON(list_empty(&nm_i->free_nid_list)); 1422 BUG_ON(list_empty(&nm_i->free_nid_list));
1425 list_for_each(this, &nm_i->free_nid_list) { 1423 list_for_each(this, &nm_i->free_nid_list) {
1426 i = list_entry(this, struct free_nid, list); 1424 i = list_entry(this, struct free_nid, list);
1427 if (i->state == NID_NEW) 1425 if (i->state == NID_NEW)
1428 break; 1426 break;
1429 } 1427 }
1430 1428
1431 BUG_ON(i->state != NID_NEW); 1429 BUG_ON(i->state != NID_NEW);
1432 *nid = i->nid; 1430 *nid = i->nid;
1433 i->state = NID_ALLOC; 1431 i->state = NID_ALLOC;
1434 nm_i->fcnt--; 1432 nm_i->fcnt--;
1435 spin_unlock(&nm_i->free_nid_list_lock); 1433 spin_unlock(&nm_i->free_nid_list_lock);
1436 return true; 1434 return true;
1437 } 1435 }
1438 spin_unlock(&nm_i->free_nid_list_lock); 1436 spin_unlock(&nm_i->free_nid_list_lock);
1439 1437
1440 /* Let's scan nat pages and its caches to get free nids */ 1438 /* Let's scan nat pages and its caches to get free nids */
1441 mutex_lock(&nm_i->build_lock); 1439 mutex_lock(&nm_i->build_lock);
1442 sbi->on_build_free_nids = 1; 1440 sbi->on_build_free_nids = 1;
1443 build_free_nids(sbi); 1441 build_free_nids(sbi);
1444 sbi->on_build_free_nids = 0; 1442 sbi->on_build_free_nids = 0;
1445 mutex_unlock(&nm_i->build_lock); 1443 mutex_unlock(&nm_i->build_lock);
1446 goto retry; 1444 goto retry;
1447 } 1445 }
1448 1446
1449 /* 1447 /*
1450 * alloc_nid() should be called prior to this function. 1448 * alloc_nid() should be called prior to this function.
1451 */ 1449 */
1452 void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid) 1450 void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid)
1453 { 1451 {
1454 struct f2fs_nm_info *nm_i = NM_I(sbi); 1452 struct f2fs_nm_info *nm_i = NM_I(sbi);
1455 struct free_nid *i; 1453 struct free_nid *i;
1456 1454
1457 spin_lock(&nm_i->free_nid_list_lock); 1455 spin_lock(&nm_i->free_nid_list_lock);
1458 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); 1456 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
1459 BUG_ON(!i || i->state != NID_ALLOC); 1457 BUG_ON(!i || i->state != NID_ALLOC);
1460 __del_from_free_nid_list(i); 1458 __del_from_free_nid_list(i);
1461 spin_unlock(&nm_i->free_nid_list_lock); 1459 spin_unlock(&nm_i->free_nid_list_lock);
1462 } 1460 }
1463 1461
1464 /* 1462 /*
1465 * alloc_nid() should be called prior to this function. 1463 * alloc_nid() should be called prior to this function.
1466 */ 1464 */
1467 void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid) 1465 void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid)
1468 { 1466 {
1469 struct f2fs_nm_info *nm_i = NM_I(sbi); 1467 struct f2fs_nm_info *nm_i = NM_I(sbi);
1470 struct free_nid *i; 1468 struct free_nid *i;
1471 1469
1472 if (!nid) 1470 if (!nid)
1473 return; 1471 return;
1474 1472
1475 spin_lock(&nm_i->free_nid_list_lock); 1473 spin_lock(&nm_i->free_nid_list_lock);
1476 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); 1474 i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
1477 BUG_ON(!i || i->state != NID_ALLOC); 1475 BUG_ON(!i || i->state != NID_ALLOC);
1478 if (nm_i->fcnt > 2 * MAX_FREE_NIDS) { 1476 if (nm_i->fcnt > 2 * MAX_FREE_NIDS) {
1479 __del_from_free_nid_list(i); 1477 __del_from_free_nid_list(i);
1480 } else { 1478 } else {
1481 i->state = NID_NEW; 1479 i->state = NID_NEW;
1482 nm_i->fcnt++; 1480 nm_i->fcnt++;
1483 } 1481 }
1484 spin_unlock(&nm_i->free_nid_list_lock); 1482 spin_unlock(&nm_i->free_nid_list_lock);
1485 } 1483 }
1486 1484
1487 void recover_node_page(struct f2fs_sb_info *sbi, struct page *page, 1485 void recover_node_page(struct f2fs_sb_info *sbi, struct page *page,
1488 struct f2fs_summary *sum, struct node_info *ni, 1486 struct f2fs_summary *sum, struct node_info *ni,
1489 block_t new_blkaddr) 1487 block_t new_blkaddr)
1490 { 1488 {
1491 rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr); 1489 rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr);
1492 set_node_addr(sbi, ni, new_blkaddr); 1490 set_node_addr(sbi, ni, new_blkaddr);
1493 clear_node_page_dirty(page); 1491 clear_node_page_dirty(page);
1494 } 1492 }
1495 1493
1496 int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) 1494 int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page)
1497 { 1495 {
1498 struct address_space *mapping = sbi->node_inode->i_mapping; 1496 struct address_space *mapping = sbi->node_inode->i_mapping;
1499 struct f2fs_node *src, *dst; 1497 struct f2fs_node *src, *dst;
1500 nid_t ino = ino_of_node(page); 1498 nid_t ino = ino_of_node(page);
1501 struct node_info old_ni, new_ni; 1499 struct node_info old_ni, new_ni;
1502 struct page *ipage; 1500 struct page *ipage;
1503 1501
1504 ipage = grab_cache_page(mapping, ino); 1502 ipage = grab_cache_page(mapping, ino);
1505 if (!ipage) 1503 if (!ipage)
1506 return -ENOMEM; 1504 return -ENOMEM;
1507 1505
1508 /* Should not use this inode from free nid list */ 1506 /* Should not use this inode from free nid list */
1509 remove_free_nid(NM_I(sbi), ino); 1507 remove_free_nid(NM_I(sbi), ino);
1510 1508
1511 get_node_info(sbi, ino, &old_ni); 1509 get_node_info(sbi, ino, &old_ni);
1512 SetPageUptodate(ipage); 1510 SetPageUptodate(ipage);
1513 fill_node_footer(ipage, ino, ino, 0, true); 1511 fill_node_footer(ipage, ino, ino, 0, true);
1514 1512
1515 src = F2FS_NODE(page); 1513 src = F2FS_NODE(page);
1516 dst = F2FS_NODE(ipage); 1514 dst = F2FS_NODE(ipage);
1517 1515
1518 memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i); 1516 memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i);
1519 dst->i.i_size = 0; 1517 dst->i.i_size = 0;
1520 dst->i.i_blocks = cpu_to_le64(1); 1518 dst->i.i_blocks = cpu_to_le64(1);
1521 dst->i.i_links = cpu_to_le32(1); 1519 dst->i.i_links = cpu_to_le32(1);
1522 dst->i.i_xattr_nid = 0; 1520 dst->i.i_xattr_nid = 0;
1523 1521
1524 new_ni = old_ni; 1522 new_ni = old_ni;
1525 new_ni.ino = ino; 1523 new_ni.ino = ino;
1526 1524
1527 if (!inc_valid_node_count(sbi, NULL, 1)) 1525 if (!inc_valid_node_count(sbi, NULL, 1))
1528 WARN_ON(1); 1526 WARN_ON(1);
1529 set_node_addr(sbi, &new_ni, NEW_ADDR); 1527 set_node_addr(sbi, &new_ni, NEW_ADDR);
1530 inc_valid_inode_count(sbi); 1528 inc_valid_inode_count(sbi);
1531 f2fs_put_page(ipage, 1); 1529 f2fs_put_page(ipage, 1);
1532 return 0; 1530 return 0;
1533 } 1531 }
1534 1532
1535 int restore_node_summary(struct f2fs_sb_info *sbi, 1533 int restore_node_summary(struct f2fs_sb_info *sbi,
1536 unsigned int segno, struct f2fs_summary_block *sum) 1534 unsigned int segno, struct f2fs_summary_block *sum)
1537 { 1535 {
1538 struct f2fs_node *rn; 1536 struct f2fs_node *rn;
1539 struct f2fs_summary *sum_entry; 1537 struct f2fs_summary *sum_entry;
1540 struct page *page; 1538 struct page *page;
1541 block_t addr; 1539 block_t addr;
1542 int i, last_offset; 1540 int i, last_offset;
1543 1541
1544 /* alloc temporal page for read node */ 1542 /* alloc temporal page for read node */
1545 page = alloc_page(GFP_NOFS | __GFP_ZERO); 1543 page = alloc_page(GFP_NOFS | __GFP_ZERO);
1546 if (!page) 1544 if (!page)
1547 return -ENOMEM; 1545 return -ENOMEM;
1548 lock_page(page); 1546 lock_page(page);
1549 1547
1550 /* scan the node segment */ 1548 /* scan the node segment */
1551 last_offset = sbi->blocks_per_seg; 1549 last_offset = sbi->blocks_per_seg;
1552 addr = START_BLOCK(sbi, segno); 1550 addr = START_BLOCK(sbi, segno);
1553 sum_entry = &sum->entries[0]; 1551 sum_entry = &sum->entries[0];
1554 1552
1555 for (i = 0; i < last_offset; i++, sum_entry++) { 1553 for (i = 0; i < last_offset; i++, sum_entry++) {
1556 /* 1554 /*
1557 * In order to read next node page, 1555 * In order to read next node page,
1558 * we must clear PageUptodate flag. 1556 * we must clear PageUptodate flag.
1559 */ 1557 */
1560 ClearPageUptodate(page); 1558 ClearPageUptodate(page);
1561 1559
1562 if (f2fs_readpage(sbi, page, addr, READ_SYNC)) 1560 if (f2fs_readpage(sbi, page, addr, READ_SYNC))
1563 goto out; 1561 goto out;
1564 1562
1565 lock_page(page); 1563 lock_page(page);
1566 rn = F2FS_NODE(page); 1564 rn = F2FS_NODE(page);
1567 sum_entry->nid = rn->footer.nid; 1565 sum_entry->nid = rn->footer.nid;
1568 sum_entry->version = 0; 1566 sum_entry->version = 0;
1569 sum_entry->ofs_in_node = 0; 1567 sum_entry->ofs_in_node = 0;
1570 addr++; 1568 addr++;
1571 } 1569 }
1572 unlock_page(page); 1570 unlock_page(page);
1573 out: 1571 out:
1574 __free_pages(page, 0); 1572 __free_pages(page, 0);
1575 return 0; 1573 return 0;
1576 } 1574 }
1577 1575
1578 static bool flush_nats_in_journal(struct f2fs_sb_info *sbi) 1576 static bool flush_nats_in_journal(struct f2fs_sb_info *sbi)
1579 { 1577 {
1580 struct f2fs_nm_info *nm_i = NM_I(sbi); 1578 struct f2fs_nm_info *nm_i = NM_I(sbi);
1581 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); 1579 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
1582 struct f2fs_summary_block *sum = curseg->sum_blk; 1580 struct f2fs_summary_block *sum = curseg->sum_blk;
1583 int i; 1581 int i;
1584 1582
1585 mutex_lock(&curseg->curseg_mutex); 1583 mutex_lock(&curseg->curseg_mutex);
1586 1584
1587 if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) { 1585 if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) {
1588 mutex_unlock(&curseg->curseg_mutex); 1586 mutex_unlock(&curseg->curseg_mutex);
1589 return false; 1587 return false;
1590 } 1588 }
1591 1589
1592 for (i = 0; i < nats_in_cursum(sum); i++) { 1590 for (i = 0; i < nats_in_cursum(sum); i++) {
1593 struct nat_entry *ne; 1591 struct nat_entry *ne;
1594 struct f2fs_nat_entry raw_ne; 1592 struct f2fs_nat_entry raw_ne;
1595 nid_t nid = le32_to_cpu(nid_in_journal(sum, i)); 1593 nid_t nid = le32_to_cpu(nid_in_journal(sum, i));
1596 1594
1597 raw_ne = nat_in_journal(sum, i); 1595 raw_ne = nat_in_journal(sum, i);
1598 retry: 1596 retry:
1599 write_lock(&nm_i->nat_tree_lock); 1597 write_lock(&nm_i->nat_tree_lock);
1600 ne = __lookup_nat_cache(nm_i, nid); 1598 ne = __lookup_nat_cache(nm_i, nid);
1601 if (ne) { 1599 if (ne) {
1602 __set_nat_cache_dirty(nm_i, ne); 1600 __set_nat_cache_dirty(nm_i, ne);
1603 write_unlock(&nm_i->nat_tree_lock); 1601 write_unlock(&nm_i->nat_tree_lock);
1604 continue; 1602 continue;
1605 } 1603 }
1606 ne = grab_nat_entry(nm_i, nid); 1604 ne = grab_nat_entry(nm_i, nid);
1607 if (!ne) { 1605 if (!ne) {
1608 write_unlock(&nm_i->nat_tree_lock); 1606 write_unlock(&nm_i->nat_tree_lock);
1609 goto retry; 1607 goto retry;
1610 } 1608 }
1611 nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr)); 1609 nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr));
1612 nat_set_ino(ne, le32_to_cpu(raw_ne.ino)); 1610 nat_set_ino(ne, le32_to_cpu(raw_ne.ino));
1613 nat_set_version(ne, raw_ne.version); 1611 nat_set_version(ne, raw_ne.version);
1614 __set_nat_cache_dirty(nm_i, ne); 1612 __set_nat_cache_dirty(nm_i, ne);
1615 write_unlock(&nm_i->nat_tree_lock); 1613 write_unlock(&nm_i->nat_tree_lock);
1616 } 1614 }
1617 update_nats_in_cursum(sum, -i); 1615 update_nats_in_cursum(sum, -i);
1618 mutex_unlock(&curseg->curseg_mutex); 1616 mutex_unlock(&curseg->curseg_mutex);
1619 return true; 1617 return true;
1620 } 1618 }
1621 1619
1622 /* 1620 /*
1623 * This function is called during the checkpointing process. 1621 * This function is called during the checkpointing process.
1624 */ 1622 */
1625 void flush_nat_entries(struct f2fs_sb_info *sbi) 1623 void flush_nat_entries(struct f2fs_sb_info *sbi)
1626 { 1624 {
1627 struct f2fs_nm_info *nm_i = NM_I(sbi); 1625 struct f2fs_nm_info *nm_i = NM_I(sbi);
1628 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); 1626 struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
1629 struct f2fs_summary_block *sum = curseg->sum_blk; 1627 struct f2fs_summary_block *sum = curseg->sum_blk;
1630 struct list_head *cur, *n; 1628 struct list_head *cur, *n;
1631 struct page *page = NULL; 1629 struct page *page = NULL;
1632 struct f2fs_nat_block *nat_blk = NULL; 1630 struct f2fs_nat_block *nat_blk = NULL;
1633 nid_t start_nid = 0, end_nid = 0; 1631 nid_t start_nid = 0, end_nid = 0;
1634 bool flushed; 1632 bool flushed;
1635 1633
1636 flushed = flush_nats_in_journal(sbi); 1634 flushed = flush_nats_in_journal(sbi);
1637 1635
1638 if (!flushed) 1636 if (!flushed)
1639 mutex_lock(&curseg->curseg_mutex); 1637 mutex_lock(&curseg->curseg_mutex);
1640 1638
1641 /* 1) flush dirty nat caches */ 1639 /* 1) flush dirty nat caches */
1642 list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) { 1640 list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) {
1643 struct nat_entry *ne; 1641 struct nat_entry *ne;
1644 nid_t nid; 1642 nid_t nid;
1645 struct f2fs_nat_entry raw_ne; 1643 struct f2fs_nat_entry raw_ne;
1646 int offset = -1; 1644 int offset = -1;
1647 block_t new_blkaddr; 1645 block_t new_blkaddr;
1648 1646
1649 ne = list_entry(cur, struct nat_entry, list); 1647 ne = list_entry(cur, struct nat_entry, list);
1650 nid = nat_get_nid(ne); 1648 nid = nat_get_nid(ne);
1651 1649
1652 if (nat_get_blkaddr(ne) == NEW_ADDR) 1650 if (nat_get_blkaddr(ne) == NEW_ADDR)
1653 continue; 1651 continue;
1654 if (flushed) 1652 if (flushed)
1655 goto to_nat_page; 1653 goto to_nat_page;
1656 1654
1657 /* if there is room for nat enries in curseg->sumpage */ 1655 /* if there is room for nat enries in curseg->sumpage */
1658 offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1); 1656 offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1);
1659 if (offset >= 0) { 1657 if (offset >= 0) {
1660 raw_ne = nat_in_journal(sum, offset); 1658 raw_ne = nat_in_journal(sum, offset);
1661 goto flush_now; 1659 goto flush_now;
1662 } 1660 }
1663 to_nat_page: 1661 to_nat_page:
1664 if (!page || (start_nid > nid || nid > end_nid)) { 1662 if (!page || (start_nid > nid || nid > end_nid)) {
1665 if (page) { 1663 if (page) {
1666 f2fs_put_page(page, 1); 1664 f2fs_put_page(page, 1);
1667 page = NULL; 1665 page = NULL;
1668 } 1666 }
1669 start_nid = START_NID(nid); 1667 start_nid = START_NID(nid);
1670 end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1; 1668 end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1;
1671 1669
1672 /* 1670 /*
1673 * get nat block with dirty flag, increased reference 1671 * get nat block with dirty flag, increased reference
1674 * count, mapped and lock 1672 * count, mapped and lock
1675 */ 1673 */
1676 page = get_next_nat_page(sbi, start_nid); 1674 page = get_next_nat_page(sbi, start_nid);
1677 nat_blk = page_address(page); 1675 nat_blk = page_address(page);
1678 } 1676 }
1679 1677
1680 BUG_ON(!nat_blk); 1678 BUG_ON(!nat_blk);
1681 raw_ne = nat_blk->entries[nid - start_nid]; 1679 raw_ne = nat_blk->entries[nid - start_nid];
1682 flush_now: 1680 flush_now:
1683 new_blkaddr = nat_get_blkaddr(ne); 1681 new_blkaddr = nat_get_blkaddr(ne);
1684 1682
1685 raw_ne.ino = cpu_to_le32(nat_get_ino(ne)); 1683 raw_ne.ino = cpu_to_le32(nat_get_ino(ne));
1686 raw_ne.block_addr = cpu_to_le32(new_blkaddr); 1684 raw_ne.block_addr = cpu_to_le32(new_blkaddr);
1687 raw_ne.version = nat_get_version(ne); 1685 raw_ne.version = nat_get_version(ne);
1688 1686
1689 if (offset < 0) { 1687 if (offset < 0) {
1690 nat_blk->entries[nid - start_nid] = raw_ne; 1688 nat_blk->entries[nid - start_nid] = raw_ne;
1691 } else { 1689 } else {
1692 nat_in_journal(sum, offset) = raw_ne; 1690 nat_in_journal(sum, offset) = raw_ne;
1693 nid_in_journal(sum, offset) = cpu_to_le32(nid); 1691 nid_in_journal(sum, offset) = cpu_to_le32(nid);
1694 } 1692 }
1695 1693
1696 if (nat_get_blkaddr(ne) == NULL_ADDR && 1694 if (nat_get_blkaddr(ne) == NULL_ADDR &&
1697 add_free_nid(NM_I(sbi), nid, false) <= 0) { 1695 add_free_nid(NM_I(sbi), nid, false) <= 0) {
1698 write_lock(&nm_i->nat_tree_lock); 1696 write_lock(&nm_i->nat_tree_lock);
1699 __del_from_nat_cache(nm_i, ne); 1697 __del_from_nat_cache(nm_i, ne);
1700 write_unlock(&nm_i->nat_tree_lock); 1698 write_unlock(&nm_i->nat_tree_lock);
1701 } else { 1699 } else {
1702 write_lock(&nm_i->nat_tree_lock); 1700 write_lock(&nm_i->nat_tree_lock);
1703 __clear_nat_cache_dirty(nm_i, ne); 1701 __clear_nat_cache_dirty(nm_i, ne);
1704 ne->checkpointed = true; 1702 ne->checkpointed = true;
1705 write_unlock(&nm_i->nat_tree_lock); 1703 write_unlock(&nm_i->nat_tree_lock);
1706 } 1704 }
1707 } 1705 }
1708 if (!flushed) 1706 if (!flushed)
1709 mutex_unlock(&curseg->curseg_mutex); 1707 mutex_unlock(&curseg->curseg_mutex);
1710 f2fs_put_page(page, 1); 1708 f2fs_put_page(page, 1);
1711 1709
1712 /* 2) shrink nat caches if necessary */ 1710 /* 2) shrink nat caches if necessary */
1713 try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD); 1711 try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD);
1714 } 1712 }
1715 1713
1716 static int init_node_manager(struct f2fs_sb_info *sbi) 1714 static int init_node_manager(struct f2fs_sb_info *sbi)
1717 { 1715 {
1718 struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi); 1716 struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi);
1719 struct f2fs_nm_info *nm_i = NM_I(sbi); 1717 struct f2fs_nm_info *nm_i = NM_I(sbi);
1720 unsigned char *version_bitmap; 1718 unsigned char *version_bitmap;
1721 unsigned int nat_segs, nat_blocks; 1719 unsigned int nat_segs, nat_blocks;
1722 1720
1723 nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr); 1721 nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr);
1724 1722
1725 /* segment_count_nat includes pair segment so divide to 2. */ 1723 /* segment_count_nat includes pair segment so divide to 2. */
1726 nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1; 1724 nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1;
1727 nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg); 1725 nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg);
1728 nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks; 1726 nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks;
1729 nm_i->fcnt = 0; 1727 nm_i->fcnt = 0;
1730 nm_i->nat_cnt = 0; 1728 nm_i->nat_cnt = 0;
1731 1729
1732 INIT_LIST_HEAD(&nm_i->free_nid_list); 1730 INIT_LIST_HEAD(&nm_i->free_nid_list);
1733 INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC); 1731 INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC);
1734 INIT_LIST_HEAD(&nm_i->nat_entries); 1732 INIT_LIST_HEAD(&nm_i->nat_entries);
1735 INIT_LIST_HEAD(&nm_i->dirty_nat_entries); 1733 INIT_LIST_HEAD(&nm_i->dirty_nat_entries);
1736 1734
1737 mutex_init(&nm_i->build_lock); 1735 mutex_init(&nm_i->build_lock);
1738 spin_lock_init(&nm_i->free_nid_list_lock); 1736 spin_lock_init(&nm_i->free_nid_list_lock);
1739 rwlock_init(&nm_i->nat_tree_lock); 1737 rwlock_init(&nm_i->nat_tree_lock);
1740 1738
1741 nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid); 1739 nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid);
1742 nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP); 1740 nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP);
1743 version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP); 1741 version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP);
1744 if (!version_bitmap) 1742 if (!version_bitmap)
1745 return -EFAULT; 1743 return -EFAULT;
1746 1744
1747 nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size, 1745 nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size,
1748 GFP_KERNEL); 1746 GFP_KERNEL);
1749 if (!nm_i->nat_bitmap) 1747 if (!nm_i->nat_bitmap)
1750 return -ENOMEM; 1748 return -ENOMEM;
1751 return 0; 1749 return 0;
1752 } 1750 }
1753 1751
1754 int build_node_manager(struct f2fs_sb_info *sbi) 1752 int build_node_manager(struct f2fs_sb_info *sbi)
1755 { 1753 {
1756 int err; 1754 int err;
1757 1755
1758 sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL); 1756 sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL);
1759 if (!sbi->nm_info) 1757 if (!sbi->nm_info)
1760 return -ENOMEM; 1758 return -ENOMEM;
1761 1759
1762 err = init_node_manager(sbi); 1760 err = init_node_manager(sbi);
1763 if (err) 1761 if (err)
1764 return err; 1762 return err;
1765 1763
1766 build_free_nids(sbi); 1764 build_free_nids(sbi);
1767 return 0; 1765 return 0;
1768 } 1766 }
1769 1767
1770 void destroy_node_manager(struct f2fs_sb_info *sbi) 1768 void destroy_node_manager(struct f2fs_sb_info *sbi)
1771 { 1769 {
1772 struct f2fs_nm_info *nm_i = NM_I(sbi); 1770 struct f2fs_nm_info *nm_i = NM_I(sbi);
1773 struct free_nid *i, *next_i; 1771 struct free_nid *i, *next_i;
1774 struct nat_entry *natvec[NATVEC_SIZE]; 1772 struct nat_entry *natvec[NATVEC_SIZE];
1775 nid_t nid = 0; 1773 nid_t nid = 0;
1776 unsigned int found; 1774 unsigned int found;
1777 1775
1778 if (!nm_i) 1776 if (!nm_i)
1779 return; 1777 return;
1780 1778
1781 /* destroy free nid list */ 1779 /* destroy free nid list */
1782 spin_lock(&nm_i->free_nid_list_lock); 1780 spin_lock(&nm_i->free_nid_list_lock);
1783 list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) { 1781 list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) {
1784 BUG_ON(i->state == NID_ALLOC); 1782 BUG_ON(i->state == NID_ALLOC);
1785 __del_from_free_nid_list(i); 1783 __del_from_free_nid_list(i);
1786 nm_i->fcnt--; 1784 nm_i->fcnt--;
1787 } 1785 }
1788 BUG_ON(nm_i->fcnt); 1786 BUG_ON(nm_i->fcnt);
1789 spin_unlock(&nm_i->free_nid_list_lock); 1787 spin_unlock(&nm_i->free_nid_list_lock);
1790 1788
1791 /* destroy nat cache */ 1789 /* destroy nat cache */
1792 write_lock(&nm_i->nat_tree_lock); 1790 write_lock(&nm_i->nat_tree_lock);
1793 while ((found = __gang_lookup_nat_cache(nm_i, 1791 while ((found = __gang_lookup_nat_cache(nm_i,
1794 nid, NATVEC_SIZE, natvec))) { 1792 nid, NATVEC_SIZE, natvec))) {
1795 unsigned idx; 1793 unsigned idx;
1796 for (idx = 0; idx < found; idx++) { 1794 for (idx = 0; idx < found; idx++) {
1797 struct nat_entry *e = natvec[idx]; 1795 struct nat_entry *e = natvec[idx];
1798 nid = nat_get_nid(e) + 1; 1796 nid = nat_get_nid(e) + 1;
1799 __del_from_nat_cache(nm_i, e); 1797 __del_from_nat_cache(nm_i, e);
1800 } 1798 }
1801 } 1799 }
1802 BUG_ON(nm_i->nat_cnt); 1800 BUG_ON(nm_i->nat_cnt);
1803 write_unlock(&nm_i->nat_tree_lock); 1801 write_unlock(&nm_i->nat_tree_lock);
1804 1802
1805 kfree(nm_i->nat_bitmap); 1803 kfree(nm_i->nat_bitmap);
1806 sbi->nm_info = NULL; 1804 sbi->nm_info = NULL;
1807 kfree(nm_i); 1805 kfree(nm_i);
1808 } 1806 }
1809 1807
1810 int __init create_node_manager_caches(void) 1808 int __init create_node_manager_caches(void)
1811 { 1809 {
1812 nat_entry_slab = f2fs_kmem_cache_create("nat_entry", 1810 nat_entry_slab = f2fs_kmem_cache_create("nat_entry",
1813 sizeof(struct nat_entry), NULL); 1811 sizeof(struct nat_entry), NULL);
1814 if (!nat_entry_slab) 1812 if (!nat_entry_slab)
1815 return -ENOMEM; 1813 return -ENOMEM;
1816 1814
1817 free_nid_slab = f2fs_kmem_cache_create("free_nid", 1815 free_nid_slab = f2fs_kmem_cache_create("free_nid",
1818 sizeof(struct free_nid), NULL); 1816 sizeof(struct free_nid), NULL);
1819 if (!free_nid_slab) { 1817 if (!free_nid_slab) {
1820 kmem_cache_destroy(nat_entry_slab); 1818 kmem_cache_destroy(nat_entry_slab);
1821 return -ENOMEM; 1819 return -ENOMEM;
1822 } 1820 }
1823 return 0; 1821 return 0;
1824 } 1822 }
1825 1823
1826 void destroy_node_manager_caches(void) 1824 void destroy_node_manager_caches(void)
1827 { 1825 {
1828 kmem_cache_destroy(free_nid_slab); 1826 kmem_cache_destroy(free_nid_slab);
1829 kmem_cache_destroy(nat_entry_slab); 1827 kmem_cache_destroy(nat_entry_slab);
1830 } 1828 }
1831 1829
1 /* 1 /*
2 FUSE: Filesystem in Userspace 2 FUSE: Filesystem in Userspace
3 Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> 3 Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu>
4 4
5 This program can be distributed under the terms of the GNU GPL. 5 This program can be distributed under the terms of the GNU GPL.
6 See the file COPYING. 6 See the file COPYING.
7 */ 7 */
8 8
9 #include "fuse_i.h" 9 #include "fuse_i.h"
10 10
11 #include <linux/pagemap.h> 11 #include <linux/pagemap.h>
12 #include <linux/slab.h> 12 #include <linux/slab.h>
13 #include <linux/kernel.h> 13 #include <linux/kernel.h>
14 #include <linux/sched.h> 14 #include <linux/sched.h>
15 #include <linux/module.h> 15 #include <linux/module.h>
16 #include <linux/compat.h> 16 #include <linux/compat.h>
17 #include <linux/swap.h> 17 #include <linux/swap.h>
18 #include <linux/aio.h> 18 #include <linux/aio.h>
19 #include <linux/falloc.h> 19 #include <linux/falloc.h>
20 20
21 static const struct file_operations fuse_direct_io_file_operations; 21 static const struct file_operations fuse_direct_io_file_operations;
22 22
23 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file, 23 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
24 int opcode, struct fuse_open_out *outargp) 24 int opcode, struct fuse_open_out *outargp)
25 { 25 {
26 struct fuse_open_in inarg; 26 struct fuse_open_in inarg;
27 struct fuse_req *req; 27 struct fuse_req *req;
28 int err; 28 int err;
29 29
30 req = fuse_get_req_nopages(fc); 30 req = fuse_get_req_nopages(fc);
31 if (IS_ERR(req)) 31 if (IS_ERR(req))
32 return PTR_ERR(req); 32 return PTR_ERR(req);
33 33
34 memset(&inarg, 0, sizeof(inarg)); 34 memset(&inarg, 0, sizeof(inarg));
35 inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY); 35 inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
36 if (!fc->atomic_o_trunc) 36 if (!fc->atomic_o_trunc)
37 inarg.flags &= ~O_TRUNC; 37 inarg.flags &= ~O_TRUNC;
38 req->in.h.opcode = opcode; 38 req->in.h.opcode = opcode;
39 req->in.h.nodeid = nodeid; 39 req->in.h.nodeid = nodeid;
40 req->in.numargs = 1; 40 req->in.numargs = 1;
41 req->in.args[0].size = sizeof(inarg); 41 req->in.args[0].size = sizeof(inarg);
42 req->in.args[0].value = &inarg; 42 req->in.args[0].value = &inarg;
43 req->out.numargs = 1; 43 req->out.numargs = 1;
44 req->out.args[0].size = sizeof(*outargp); 44 req->out.args[0].size = sizeof(*outargp);
45 req->out.args[0].value = outargp; 45 req->out.args[0].value = outargp;
46 fuse_request_send(fc, req); 46 fuse_request_send(fc, req);
47 err = req->out.h.error; 47 err = req->out.h.error;
48 fuse_put_request(fc, req); 48 fuse_put_request(fc, req);
49 49
50 return err; 50 return err;
51 } 51 }
52 52
53 struct fuse_file *fuse_file_alloc(struct fuse_conn *fc) 53 struct fuse_file *fuse_file_alloc(struct fuse_conn *fc)
54 { 54 {
55 struct fuse_file *ff; 55 struct fuse_file *ff;
56 56
57 ff = kmalloc(sizeof(struct fuse_file), GFP_KERNEL); 57 ff = kmalloc(sizeof(struct fuse_file), GFP_KERNEL);
58 if (unlikely(!ff)) 58 if (unlikely(!ff))
59 return NULL; 59 return NULL;
60 60
61 ff->fc = fc; 61 ff->fc = fc;
62 ff->reserved_req = fuse_request_alloc(0); 62 ff->reserved_req = fuse_request_alloc(0);
63 if (unlikely(!ff->reserved_req)) { 63 if (unlikely(!ff->reserved_req)) {
64 kfree(ff); 64 kfree(ff);
65 return NULL; 65 return NULL;
66 } 66 }
67 67
68 INIT_LIST_HEAD(&ff->write_entry); 68 INIT_LIST_HEAD(&ff->write_entry);
69 atomic_set(&ff->count, 0); 69 atomic_set(&ff->count, 0);
70 RB_CLEAR_NODE(&ff->polled_node); 70 RB_CLEAR_NODE(&ff->polled_node);
71 init_waitqueue_head(&ff->poll_wait); 71 init_waitqueue_head(&ff->poll_wait);
72 72
73 spin_lock(&fc->lock); 73 spin_lock(&fc->lock);
74 ff->kh = ++fc->khctr; 74 ff->kh = ++fc->khctr;
75 spin_unlock(&fc->lock); 75 spin_unlock(&fc->lock);
76 76
77 return ff; 77 return ff;
78 } 78 }
79 79
80 void fuse_file_free(struct fuse_file *ff) 80 void fuse_file_free(struct fuse_file *ff)
81 { 81 {
82 fuse_request_free(ff->reserved_req); 82 fuse_request_free(ff->reserved_req);
83 kfree(ff); 83 kfree(ff);
84 } 84 }
85 85
86 struct fuse_file *fuse_file_get(struct fuse_file *ff) 86 struct fuse_file *fuse_file_get(struct fuse_file *ff)
87 { 87 {
88 atomic_inc(&ff->count); 88 atomic_inc(&ff->count);
89 return ff; 89 return ff;
90 } 90 }
91 91
92 static void fuse_release_async(struct work_struct *work) 92 static void fuse_release_async(struct work_struct *work)
93 { 93 {
94 struct fuse_req *req; 94 struct fuse_req *req;
95 struct fuse_conn *fc; 95 struct fuse_conn *fc;
96 struct path path; 96 struct path path;
97 97
98 req = container_of(work, struct fuse_req, misc.release.work); 98 req = container_of(work, struct fuse_req, misc.release.work);
99 path = req->misc.release.path; 99 path = req->misc.release.path;
100 fc = get_fuse_conn(path.dentry->d_inode); 100 fc = get_fuse_conn(path.dentry->d_inode);
101 101
102 fuse_put_request(fc, req); 102 fuse_put_request(fc, req);
103 path_put(&path); 103 path_put(&path);
104 } 104 }
105 105
106 static void fuse_release_end(struct fuse_conn *fc, struct fuse_req *req) 106 static void fuse_release_end(struct fuse_conn *fc, struct fuse_req *req)
107 { 107 {
108 if (fc->destroy_req) { 108 if (fc->destroy_req) {
109 /* 109 /*
110 * If this is a fuseblk mount, then it's possible that 110 * If this is a fuseblk mount, then it's possible that
111 * releasing the path will result in releasing the 111 * releasing the path will result in releasing the
112 * super block and sending the DESTROY request. If 112 * super block and sending the DESTROY request. If
113 * the server is single threaded, this would hang. 113 * the server is single threaded, this would hang.
114 * For this reason do the path_put() in a separate 114 * For this reason do the path_put() in a separate
115 * thread. 115 * thread.
116 */ 116 */
117 atomic_inc(&req->count); 117 atomic_inc(&req->count);
118 INIT_WORK(&req->misc.release.work, fuse_release_async); 118 INIT_WORK(&req->misc.release.work, fuse_release_async);
119 schedule_work(&req->misc.release.work); 119 schedule_work(&req->misc.release.work);
120 } else { 120 } else {
121 path_put(&req->misc.release.path); 121 path_put(&req->misc.release.path);
122 } 122 }
123 } 123 }
124 124
125 static void fuse_file_put(struct fuse_file *ff, bool sync) 125 static void fuse_file_put(struct fuse_file *ff, bool sync)
126 { 126 {
127 if (atomic_dec_and_test(&ff->count)) { 127 if (atomic_dec_and_test(&ff->count)) {
128 struct fuse_req *req = ff->reserved_req; 128 struct fuse_req *req = ff->reserved_req;
129 129
130 if (sync) { 130 if (sync) {
131 req->background = 0; 131 req->background = 0;
132 fuse_request_send(ff->fc, req); 132 fuse_request_send(ff->fc, req);
133 path_put(&req->misc.release.path); 133 path_put(&req->misc.release.path);
134 fuse_put_request(ff->fc, req); 134 fuse_put_request(ff->fc, req);
135 } else { 135 } else {
136 req->end = fuse_release_end; 136 req->end = fuse_release_end;
137 req->background = 1; 137 req->background = 1;
138 fuse_request_send_background(ff->fc, req); 138 fuse_request_send_background(ff->fc, req);
139 } 139 }
140 kfree(ff); 140 kfree(ff);
141 } 141 }
142 } 142 }
143 143
144 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, 144 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
145 bool isdir) 145 bool isdir)
146 { 146 {
147 struct fuse_open_out outarg; 147 struct fuse_open_out outarg;
148 struct fuse_file *ff; 148 struct fuse_file *ff;
149 int err; 149 int err;
150 int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN; 150 int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN;
151 151
152 ff = fuse_file_alloc(fc); 152 ff = fuse_file_alloc(fc);
153 if (!ff) 153 if (!ff)
154 return -ENOMEM; 154 return -ENOMEM;
155 155
156 err = fuse_send_open(fc, nodeid, file, opcode, &outarg); 156 err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
157 if (err) { 157 if (err) {
158 fuse_file_free(ff); 158 fuse_file_free(ff);
159 return err; 159 return err;
160 } 160 }
161 161
162 if (isdir) 162 if (isdir)
163 outarg.open_flags &= ~FOPEN_DIRECT_IO; 163 outarg.open_flags &= ~FOPEN_DIRECT_IO;
164 164
165 ff->fh = outarg.fh; 165 ff->fh = outarg.fh;
166 ff->nodeid = nodeid; 166 ff->nodeid = nodeid;
167 ff->open_flags = outarg.open_flags; 167 ff->open_flags = outarg.open_flags;
168 file->private_data = fuse_file_get(ff); 168 file->private_data = fuse_file_get(ff);
169 169
170 return 0; 170 return 0;
171 } 171 }
172 EXPORT_SYMBOL_GPL(fuse_do_open); 172 EXPORT_SYMBOL_GPL(fuse_do_open);
173 173
174 void fuse_finish_open(struct inode *inode, struct file *file) 174 void fuse_finish_open(struct inode *inode, struct file *file)
175 { 175 {
176 struct fuse_file *ff = file->private_data; 176 struct fuse_file *ff = file->private_data;
177 struct fuse_conn *fc = get_fuse_conn(inode); 177 struct fuse_conn *fc = get_fuse_conn(inode);
178 178
179 if (ff->open_flags & FOPEN_DIRECT_IO) 179 if (ff->open_flags & FOPEN_DIRECT_IO)
180 file->f_op = &fuse_direct_io_file_operations; 180 file->f_op = &fuse_direct_io_file_operations;
181 if (!(ff->open_flags & FOPEN_KEEP_CACHE)) 181 if (!(ff->open_flags & FOPEN_KEEP_CACHE))
182 invalidate_inode_pages2(inode->i_mapping); 182 invalidate_inode_pages2(inode->i_mapping);
183 if (ff->open_flags & FOPEN_NONSEEKABLE) 183 if (ff->open_flags & FOPEN_NONSEEKABLE)
184 nonseekable_open(inode, file); 184 nonseekable_open(inode, file);
185 if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) { 185 if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
186 struct fuse_inode *fi = get_fuse_inode(inode); 186 struct fuse_inode *fi = get_fuse_inode(inode);
187 187
188 spin_lock(&fc->lock); 188 spin_lock(&fc->lock);
189 fi->attr_version = ++fc->attr_version; 189 fi->attr_version = ++fc->attr_version;
190 i_size_write(inode, 0); 190 i_size_write(inode, 0);
191 spin_unlock(&fc->lock); 191 spin_unlock(&fc->lock);
192 fuse_invalidate_attr(inode); 192 fuse_invalidate_attr(inode);
193 } 193 }
194 } 194 }
195 195
196 int fuse_open_common(struct inode *inode, struct file *file, bool isdir) 196 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
197 { 197 {
198 struct fuse_conn *fc = get_fuse_conn(inode); 198 struct fuse_conn *fc = get_fuse_conn(inode);
199 int err; 199 int err;
200 200
201 err = generic_file_open(inode, file); 201 err = generic_file_open(inode, file);
202 if (err) 202 if (err)
203 return err; 203 return err;
204 204
205 err = fuse_do_open(fc, get_node_id(inode), file, isdir); 205 err = fuse_do_open(fc, get_node_id(inode), file, isdir);
206 if (err) 206 if (err)
207 return err; 207 return err;
208 208
209 fuse_finish_open(inode, file); 209 fuse_finish_open(inode, file);
210 210
211 return 0; 211 return 0;
212 } 212 }
213 213
214 static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) 214 static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode)
215 { 215 {
216 struct fuse_conn *fc = ff->fc; 216 struct fuse_conn *fc = ff->fc;
217 struct fuse_req *req = ff->reserved_req; 217 struct fuse_req *req = ff->reserved_req;
218 struct fuse_release_in *inarg = &req->misc.release.in; 218 struct fuse_release_in *inarg = &req->misc.release.in;
219 219
220 spin_lock(&fc->lock); 220 spin_lock(&fc->lock);
221 list_del(&ff->write_entry); 221 list_del(&ff->write_entry);
222 if (!RB_EMPTY_NODE(&ff->polled_node)) 222 if (!RB_EMPTY_NODE(&ff->polled_node))
223 rb_erase(&ff->polled_node, &fc->polled_files); 223 rb_erase(&ff->polled_node, &fc->polled_files);
224 spin_unlock(&fc->lock); 224 spin_unlock(&fc->lock);
225 225
226 wake_up_interruptible_all(&ff->poll_wait); 226 wake_up_interruptible_all(&ff->poll_wait);
227 227
228 inarg->fh = ff->fh; 228 inarg->fh = ff->fh;
229 inarg->flags = flags; 229 inarg->flags = flags;
230 req->in.h.opcode = opcode; 230 req->in.h.opcode = opcode;
231 req->in.h.nodeid = ff->nodeid; 231 req->in.h.nodeid = ff->nodeid;
232 req->in.numargs = 1; 232 req->in.numargs = 1;
233 req->in.args[0].size = sizeof(struct fuse_release_in); 233 req->in.args[0].size = sizeof(struct fuse_release_in);
234 req->in.args[0].value = inarg; 234 req->in.args[0].value = inarg;
235 } 235 }
236 236
237 void fuse_release_common(struct file *file, int opcode) 237 void fuse_release_common(struct file *file, int opcode)
238 { 238 {
239 struct fuse_file *ff; 239 struct fuse_file *ff;
240 struct fuse_req *req; 240 struct fuse_req *req;
241 241
242 ff = file->private_data; 242 ff = file->private_data;
243 if (unlikely(!ff)) 243 if (unlikely(!ff))
244 return; 244 return;
245 245
246 req = ff->reserved_req; 246 req = ff->reserved_req;
247 fuse_prepare_release(ff, file->f_flags, opcode); 247 fuse_prepare_release(ff, file->f_flags, opcode);
248 248
249 if (ff->flock) { 249 if (ff->flock) {
250 struct fuse_release_in *inarg = &req->misc.release.in; 250 struct fuse_release_in *inarg = &req->misc.release.in;
251 inarg->release_flags |= FUSE_RELEASE_FLOCK_UNLOCK; 251 inarg->release_flags |= FUSE_RELEASE_FLOCK_UNLOCK;
252 inarg->lock_owner = fuse_lock_owner_id(ff->fc, 252 inarg->lock_owner = fuse_lock_owner_id(ff->fc,
253 (fl_owner_t) file); 253 (fl_owner_t) file);
254 } 254 }
255 /* Hold vfsmount and dentry until release is finished */ 255 /* Hold vfsmount and dentry until release is finished */
256 path_get(&file->f_path); 256 path_get(&file->f_path);
257 req->misc.release.path = file->f_path; 257 req->misc.release.path = file->f_path;
258 258
259 /* 259 /*
260 * Normally this will send the RELEASE request, however if 260 * Normally this will send the RELEASE request, however if
261 * some asynchronous READ or WRITE requests are outstanding, 261 * some asynchronous READ or WRITE requests are outstanding,
262 * the sending will be delayed. 262 * the sending will be delayed.
263 * 263 *
264 * Make the release synchronous if this is a fuseblk mount, 264 * Make the release synchronous if this is a fuseblk mount,
265 * synchronous RELEASE is allowed (and desirable) in this case 265 * synchronous RELEASE is allowed (and desirable) in this case
266 * because the server can be trusted not to screw up. 266 * because the server can be trusted not to screw up.
267 */ 267 */
268 fuse_file_put(ff, ff->fc->destroy_req != NULL); 268 fuse_file_put(ff, ff->fc->destroy_req != NULL);
269 } 269 }
270 270
271 static int fuse_open(struct inode *inode, struct file *file) 271 static int fuse_open(struct inode *inode, struct file *file)
272 { 272 {
273 return fuse_open_common(inode, file, false); 273 return fuse_open_common(inode, file, false);
274 } 274 }
275 275
276 static int fuse_release(struct inode *inode, struct file *file) 276 static int fuse_release(struct inode *inode, struct file *file)
277 { 277 {
278 fuse_release_common(file, FUSE_RELEASE); 278 fuse_release_common(file, FUSE_RELEASE);
279 279
280 /* return value is ignored by VFS */ 280 /* return value is ignored by VFS */
281 return 0; 281 return 0;
282 } 282 }
283 283
284 void fuse_sync_release(struct fuse_file *ff, int flags) 284 void fuse_sync_release(struct fuse_file *ff, int flags)
285 { 285 {
286 WARN_ON(atomic_read(&ff->count) > 1); 286 WARN_ON(atomic_read(&ff->count) > 1);
287 fuse_prepare_release(ff, flags, FUSE_RELEASE); 287 fuse_prepare_release(ff, flags, FUSE_RELEASE);
288 ff->reserved_req->force = 1; 288 ff->reserved_req->force = 1;
289 ff->reserved_req->background = 0; 289 ff->reserved_req->background = 0;
290 fuse_request_send(ff->fc, ff->reserved_req); 290 fuse_request_send(ff->fc, ff->reserved_req);
291 fuse_put_request(ff->fc, ff->reserved_req); 291 fuse_put_request(ff->fc, ff->reserved_req);
292 kfree(ff); 292 kfree(ff);
293 } 293 }
294 EXPORT_SYMBOL_GPL(fuse_sync_release); 294 EXPORT_SYMBOL_GPL(fuse_sync_release);
295 295
296 /* 296 /*
297 * Scramble the ID space with XTEA, so that the value of the files_struct 297 * Scramble the ID space with XTEA, so that the value of the files_struct
298 * pointer is not exposed to userspace. 298 * pointer is not exposed to userspace.
299 */ 299 */
300 u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) 300 u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
301 { 301 {
302 u32 *k = fc->scramble_key; 302 u32 *k = fc->scramble_key;
303 u64 v = (unsigned long) id; 303 u64 v = (unsigned long) id;
304 u32 v0 = v; 304 u32 v0 = v;
305 u32 v1 = v >> 32; 305 u32 v1 = v >> 32;
306 u32 sum = 0; 306 u32 sum = 0;
307 int i; 307 int i;
308 308
309 for (i = 0; i < 32; i++) { 309 for (i = 0; i < 32; i++) {
310 v0 += ((v1 << 4 ^ v1 >> 5) + v1) ^ (sum + k[sum & 3]); 310 v0 += ((v1 << 4 ^ v1 >> 5) + v1) ^ (sum + k[sum & 3]);
311 sum += 0x9E3779B9; 311 sum += 0x9E3779B9;
312 v1 += ((v0 << 4 ^ v0 >> 5) + v0) ^ (sum + k[sum>>11 & 3]); 312 v1 += ((v0 << 4 ^ v0 >> 5) + v0) ^ (sum + k[sum>>11 & 3]);
313 } 313 }
314 314
315 return (u64) v0 + ((u64) v1 << 32); 315 return (u64) v0 + ((u64) v1 << 32);
316 } 316 }
317 317
318 /* 318 /*
319 * Check if page is under writeback 319 * Check if page is under writeback
320 * 320 *
321 * This is currently done by walking the list of writepage requests 321 * This is currently done by walking the list of writepage requests
322 * for the inode, which can be pretty inefficient. 322 * for the inode, which can be pretty inefficient.
323 */ 323 */
324 static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) 324 static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
325 { 325 {
326 struct fuse_conn *fc = get_fuse_conn(inode); 326 struct fuse_conn *fc = get_fuse_conn(inode);
327 struct fuse_inode *fi = get_fuse_inode(inode); 327 struct fuse_inode *fi = get_fuse_inode(inode);
328 struct fuse_req *req; 328 struct fuse_req *req;
329 bool found = false; 329 bool found = false;
330 330
331 spin_lock(&fc->lock); 331 spin_lock(&fc->lock);
332 list_for_each_entry(req, &fi->writepages, writepages_entry) { 332 list_for_each_entry(req, &fi->writepages, writepages_entry) {
333 pgoff_t curr_index; 333 pgoff_t curr_index;
334 334
335 BUG_ON(req->inode != inode); 335 BUG_ON(req->inode != inode);
336 curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT; 336 curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
337 if (curr_index == index) { 337 if (curr_index == index) {
338 found = true; 338 found = true;
339 break; 339 break;
340 } 340 }
341 } 341 }
342 spin_unlock(&fc->lock); 342 spin_unlock(&fc->lock);
343 343
344 return found; 344 return found;
345 } 345 }
346 346
347 /* 347 /*
348 * Wait for page writeback to be completed. 348 * Wait for page writeback to be completed.
349 * 349 *
350 * Since fuse doesn't rely on the VM writeback tracking, this has to 350 * Since fuse doesn't rely on the VM writeback tracking, this has to
351 * use some other means. 351 * use some other means.
352 */ 352 */
353 static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) 353 static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
354 { 354 {
355 struct fuse_inode *fi = get_fuse_inode(inode); 355 struct fuse_inode *fi = get_fuse_inode(inode);
356 356
357 wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); 357 wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
358 return 0; 358 return 0;
359 } 359 }
360 360
361 static int fuse_flush(struct file *file, fl_owner_t id) 361 static int fuse_flush(struct file *file, fl_owner_t id)
362 { 362 {
363 struct inode *inode = file_inode(file); 363 struct inode *inode = file_inode(file);
364 struct fuse_conn *fc = get_fuse_conn(inode); 364 struct fuse_conn *fc = get_fuse_conn(inode);
365 struct fuse_file *ff = file->private_data; 365 struct fuse_file *ff = file->private_data;
366 struct fuse_req *req; 366 struct fuse_req *req;
367 struct fuse_flush_in inarg; 367 struct fuse_flush_in inarg;
368 int err; 368 int err;
369 369
370 if (is_bad_inode(inode)) 370 if (is_bad_inode(inode))
371 return -EIO; 371 return -EIO;
372 372
373 if (fc->no_flush) 373 if (fc->no_flush)
374 return 0; 374 return 0;
375 375
376 req = fuse_get_req_nofail_nopages(fc, file); 376 req = fuse_get_req_nofail_nopages(fc, file);
377 memset(&inarg, 0, sizeof(inarg)); 377 memset(&inarg, 0, sizeof(inarg));
378 inarg.fh = ff->fh; 378 inarg.fh = ff->fh;
379 inarg.lock_owner = fuse_lock_owner_id(fc, id); 379 inarg.lock_owner = fuse_lock_owner_id(fc, id);
380 req->in.h.opcode = FUSE_FLUSH; 380 req->in.h.opcode = FUSE_FLUSH;
381 req->in.h.nodeid = get_node_id(inode); 381 req->in.h.nodeid = get_node_id(inode);
382 req->in.numargs = 1; 382 req->in.numargs = 1;
383 req->in.args[0].size = sizeof(inarg); 383 req->in.args[0].size = sizeof(inarg);
384 req->in.args[0].value = &inarg; 384 req->in.args[0].value = &inarg;
385 req->force = 1; 385 req->force = 1;
386 fuse_request_send(fc, req); 386 fuse_request_send(fc, req);
387 err = req->out.h.error; 387 err = req->out.h.error;
388 fuse_put_request(fc, req); 388 fuse_put_request(fc, req);
389 if (err == -ENOSYS) { 389 if (err == -ENOSYS) {
390 fc->no_flush = 1; 390 fc->no_flush = 1;
391 err = 0; 391 err = 0;
392 } 392 }
393 return err; 393 return err;
394 } 394 }
395 395
396 /* 396 /*
397 * Wait for all pending writepages on the inode to finish. 397 * Wait for all pending writepages on the inode to finish.
398 * 398 *
399 * This is currently done by blocking further writes with FUSE_NOWRITE 399 * This is currently done by blocking further writes with FUSE_NOWRITE
400 * and waiting for all sent writes to complete. 400 * and waiting for all sent writes to complete.
401 * 401 *
402 * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage 402 * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage
403 * could conflict with truncation. 403 * could conflict with truncation.
404 */ 404 */
405 static void fuse_sync_writes(struct inode *inode) 405 static void fuse_sync_writes(struct inode *inode)
406 { 406 {
407 fuse_set_nowrite(inode); 407 fuse_set_nowrite(inode);
408 fuse_release_nowrite(inode); 408 fuse_release_nowrite(inode);
409 } 409 }
410 410
411 int fuse_fsync_common(struct file *file, loff_t start, loff_t end, 411 int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
412 int datasync, int isdir) 412 int datasync, int isdir)
413 { 413 {
414 struct inode *inode = file->f_mapping->host; 414 struct inode *inode = file->f_mapping->host;
415 struct fuse_conn *fc = get_fuse_conn(inode); 415 struct fuse_conn *fc = get_fuse_conn(inode);
416 struct fuse_file *ff = file->private_data; 416 struct fuse_file *ff = file->private_data;
417 struct fuse_req *req; 417 struct fuse_req *req;
418 struct fuse_fsync_in inarg; 418 struct fuse_fsync_in inarg;
419 int err; 419 int err;
420 420
421 if (is_bad_inode(inode)) 421 if (is_bad_inode(inode))
422 return -EIO; 422 return -EIO;
423 423
424 err = filemap_write_and_wait_range(inode->i_mapping, start, end); 424 err = filemap_write_and_wait_range(inode->i_mapping, start, end);
425 if (err) 425 if (err)
426 return err; 426 return err;
427 427
428 if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir)) 428 if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
429 return 0; 429 return 0;
430 430
431 mutex_lock(&inode->i_mutex); 431 mutex_lock(&inode->i_mutex);
432 432
433 /* 433 /*
434 * Start writeback against all dirty pages of the inode, then 434 * Start writeback against all dirty pages of the inode, then
435 * wait for all outstanding writes, before sending the FSYNC 435 * wait for all outstanding writes, before sending the FSYNC
436 * request. 436 * request.
437 */ 437 */
438 err = write_inode_now(inode, 0); 438 err = write_inode_now(inode, 0);
439 if (err) 439 if (err)
440 goto out; 440 goto out;
441 441
442 fuse_sync_writes(inode); 442 fuse_sync_writes(inode);
443 443
444 req = fuse_get_req_nopages(fc); 444 req = fuse_get_req_nopages(fc);
445 if (IS_ERR(req)) { 445 if (IS_ERR(req)) {
446 err = PTR_ERR(req); 446 err = PTR_ERR(req);
447 goto out; 447 goto out;
448 } 448 }
449 449
450 memset(&inarg, 0, sizeof(inarg)); 450 memset(&inarg, 0, sizeof(inarg));
451 inarg.fh = ff->fh; 451 inarg.fh = ff->fh;
452 inarg.fsync_flags = datasync ? 1 : 0; 452 inarg.fsync_flags = datasync ? 1 : 0;
453 req->in.h.opcode = isdir ? FUSE_FSYNCDIR : FUSE_FSYNC; 453 req->in.h.opcode = isdir ? FUSE_FSYNCDIR : FUSE_FSYNC;
454 req->in.h.nodeid = get_node_id(inode); 454 req->in.h.nodeid = get_node_id(inode);
455 req->in.numargs = 1; 455 req->in.numargs = 1;
456 req->in.args[0].size = sizeof(inarg); 456 req->in.args[0].size = sizeof(inarg);
457 req->in.args[0].value = &inarg; 457 req->in.args[0].value = &inarg;
458 fuse_request_send(fc, req); 458 fuse_request_send(fc, req);
459 err = req->out.h.error; 459 err = req->out.h.error;
460 fuse_put_request(fc, req); 460 fuse_put_request(fc, req);
461 if (err == -ENOSYS) { 461 if (err == -ENOSYS) {
462 if (isdir) 462 if (isdir)
463 fc->no_fsyncdir = 1; 463 fc->no_fsyncdir = 1;
464 else 464 else
465 fc->no_fsync = 1; 465 fc->no_fsync = 1;
466 err = 0; 466 err = 0;
467 } 467 }
468 out: 468 out:
469 mutex_unlock(&inode->i_mutex); 469 mutex_unlock(&inode->i_mutex);
470 return err; 470 return err;
471 } 471 }
472 472
473 static int fuse_fsync(struct file *file, loff_t start, loff_t end, 473 static int fuse_fsync(struct file *file, loff_t start, loff_t end,
474 int datasync) 474 int datasync)
475 { 475 {
476 return fuse_fsync_common(file, start, end, datasync, 0); 476 return fuse_fsync_common(file, start, end, datasync, 0);
477 } 477 }
478 478
479 void fuse_read_fill(struct fuse_req *req, struct file *file, loff_t pos, 479 void fuse_read_fill(struct fuse_req *req, struct file *file, loff_t pos,
480 size_t count, int opcode) 480 size_t count, int opcode)
481 { 481 {
482 struct fuse_read_in *inarg = &req->misc.read.in; 482 struct fuse_read_in *inarg = &req->misc.read.in;
483 struct fuse_file *ff = file->private_data; 483 struct fuse_file *ff = file->private_data;
484 484
485 inarg->fh = ff->fh; 485 inarg->fh = ff->fh;
486 inarg->offset = pos; 486 inarg->offset = pos;
487 inarg->size = count; 487 inarg->size = count;
488 inarg->flags = file->f_flags; 488 inarg->flags = file->f_flags;
489 req->in.h.opcode = opcode; 489 req->in.h.opcode = opcode;
490 req->in.h.nodeid = ff->nodeid; 490 req->in.h.nodeid = ff->nodeid;
491 req->in.numargs = 1; 491 req->in.numargs = 1;
492 req->in.args[0].size = sizeof(struct fuse_read_in); 492 req->in.args[0].size = sizeof(struct fuse_read_in);
493 req->in.args[0].value = inarg; 493 req->in.args[0].value = inarg;
494 req->out.argvar = 1; 494 req->out.argvar = 1;
495 req->out.numargs = 1; 495 req->out.numargs = 1;
496 req->out.args[0].size = count; 496 req->out.args[0].size = count;
497 } 497 }
498 498
499 static void fuse_release_user_pages(struct fuse_req *req, int write) 499 static void fuse_release_user_pages(struct fuse_req *req, int write)
500 { 500 {
501 unsigned i; 501 unsigned i;
502 502
503 for (i = 0; i < req->num_pages; i++) { 503 for (i = 0; i < req->num_pages; i++) {
504 struct page *page = req->pages[i]; 504 struct page *page = req->pages[i];
505 if (write) 505 if (write)
506 set_page_dirty_lock(page); 506 set_page_dirty_lock(page);
507 put_page(page); 507 put_page(page);
508 } 508 }
509 } 509 }
510 510
511 /** 511 /**
512 * In case of short read, the caller sets 'pos' to the position of 512 * In case of short read, the caller sets 'pos' to the position of
513 * actual end of fuse request in IO request. Otherwise, if bytes_requested 513 * actual end of fuse request in IO request. Otherwise, if bytes_requested
514 * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1. 514 * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1.
515 * 515 *
516 * An example: 516 * An example:
517 * User requested DIO read of 64K. It was splitted into two 32K fuse requests, 517 * User requested DIO read of 64K. It was splitted into two 32K fuse requests,
518 * both submitted asynchronously. The first of them was ACKed by userspace as 518 * both submitted asynchronously. The first of them was ACKed by userspace as
519 * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The 519 * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The
520 * second request was ACKed as short, e.g. only 1K was read, resulting in 520 * second request was ACKed as short, e.g. only 1K was read, resulting in
521 * pos == 33K. 521 * pos == 33K.
522 * 522 *
523 * Thus, when all fuse requests are completed, the minimal non-negative 'pos' 523 * Thus, when all fuse requests are completed, the minimal non-negative 'pos'
524 * will be equal to the length of the longest contiguous fragment of 524 * will be equal to the length of the longest contiguous fragment of
525 * transferred data starting from the beginning of IO request. 525 * transferred data starting from the beginning of IO request.
526 */ 526 */
527 static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos) 527 static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
528 { 528 {
529 int left; 529 int left;
530 530
531 spin_lock(&io->lock); 531 spin_lock(&io->lock);
532 if (err) 532 if (err)
533 io->err = io->err ? : err; 533 io->err = io->err ? : err;
534 else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes)) 534 else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes))
535 io->bytes = pos; 535 io->bytes = pos;
536 536
537 left = --io->reqs; 537 left = --io->reqs;
538 spin_unlock(&io->lock); 538 spin_unlock(&io->lock);
539 539
540 if (!left) { 540 if (!left) {
541 long res; 541 long res;
542 542
543 if (io->err) 543 if (io->err)
544 res = io->err; 544 res = io->err;
545 else if (io->bytes >= 0 && io->write) 545 else if (io->bytes >= 0 && io->write)
546 res = -EIO; 546 res = -EIO;
547 else { 547 else {
548 res = io->bytes < 0 ? io->size : io->bytes; 548 res = io->bytes < 0 ? io->size : io->bytes;
549 549
550 if (!is_sync_kiocb(io->iocb)) { 550 if (!is_sync_kiocb(io->iocb)) {
551 struct inode *inode = file_inode(io->iocb->ki_filp); 551 struct inode *inode = file_inode(io->iocb->ki_filp);
552 struct fuse_conn *fc = get_fuse_conn(inode); 552 struct fuse_conn *fc = get_fuse_conn(inode);
553 struct fuse_inode *fi = get_fuse_inode(inode); 553 struct fuse_inode *fi = get_fuse_inode(inode);
554 554
555 spin_lock(&fc->lock); 555 spin_lock(&fc->lock);
556 fi->attr_version = ++fc->attr_version; 556 fi->attr_version = ++fc->attr_version;
557 spin_unlock(&fc->lock); 557 spin_unlock(&fc->lock);
558 } 558 }
559 } 559 }
560 560
561 aio_complete(io->iocb, res, 0); 561 aio_complete(io->iocb, res, 0);
562 kfree(io); 562 kfree(io);
563 } 563 }
564 } 564 }
565 565
566 static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req) 566 static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req)
567 { 567 {
568 struct fuse_io_priv *io = req->io; 568 struct fuse_io_priv *io = req->io;
569 ssize_t pos = -1; 569 ssize_t pos = -1;
570 570
571 fuse_release_user_pages(req, !io->write); 571 fuse_release_user_pages(req, !io->write);
572 572
573 if (io->write) { 573 if (io->write) {
574 if (req->misc.write.in.size != req->misc.write.out.size) 574 if (req->misc.write.in.size != req->misc.write.out.size)
575 pos = req->misc.write.in.offset - io->offset + 575 pos = req->misc.write.in.offset - io->offset +
576 req->misc.write.out.size; 576 req->misc.write.out.size;
577 } else { 577 } else {
578 if (req->misc.read.in.size != req->out.args[0].size) 578 if (req->misc.read.in.size != req->out.args[0].size)
579 pos = req->misc.read.in.offset - io->offset + 579 pos = req->misc.read.in.offset - io->offset +
580 req->out.args[0].size; 580 req->out.args[0].size;
581 } 581 }
582 582
583 fuse_aio_complete(io, req->out.h.error, pos); 583 fuse_aio_complete(io, req->out.h.error, pos);
584 } 584 }
585 585
586 static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req, 586 static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req,
587 size_t num_bytes, struct fuse_io_priv *io) 587 size_t num_bytes, struct fuse_io_priv *io)
588 { 588 {
589 spin_lock(&io->lock); 589 spin_lock(&io->lock);
590 io->size += num_bytes; 590 io->size += num_bytes;
591 io->reqs++; 591 io->reqs++;
592 spin_unlock(&io->lock); 592 spin_unlock(&io->lock);
593 593
594 req->io = io; 594 req->io = io;
595 req->end = fuse_aio_complete_req; 595 req->end = fuse_aio_complete_req;
596 596
597 __fuse_get_request(req); 597 __fuse_get_request(req);
598 fuse_request_send_background(fc, req); 598 fuse_request_send_background(fc, req);
599 599
600 return num_bytes; 600 return num_bytes;
601 } 601 }
602 602
603 static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io, 603 static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io,
604 loff_t pos, size_t count, fl_owner_t owner) 604 loff_t pos, size_t count, fl_owner_t owner)
605 { 605 {
606 struct file *file = io->file; 606 struct file *file = io->file;
607 struct fuse_file *ff = file->private_data; 607 struct fuse_file *ff = file->private_data;
608 struct fuse_conn *fc = ff->fc; 608 struct fuse_conn *fc = ff->fc;
609 609
610 fuse_read_fill(req, file, pos, count, FUSE_READ); 610 fuse_read_fill(req, file, pos, count, FUSE_READ);
611 if (owner != NULL) { 611 if (owner != NULL) {
612 struct fuse_read_in *inarg = &req->misc.read.in; 612 struct fuse_read_in *inarg = &req->misc.read.in;
613 613
614 inarg->read_flags |= FUSE_READ_LOCKOWNER; 614 inarg->read_flags |= FUSE_READ_LOCKOWNER;
615 inarg->lock_owner = fuse_lock_owner_id(fc, owner); 615 inarg->lock_owner = fuse_lock_owner_id(fc, owner);
616 } 616 }
617 617
618 if (io->async) 618 if (io->async)
619 return fuse_async_req_send(fc, req, count, io); 619 return fuse_async_req_send(fc, req, count, io);
620 620
621 fuse_request_send(fc, req); 621 fuse_request_send(fc, req);
622 return req->out.args[0].size; 622 return req->out.args[0].size;
623 } 623 }
624 624
625 static void fuse_read_update_size(struct inode *inode, loff_t size, 625 static void fuse_read_update_size(struct inode *inode, loff_t size,
626 u64 attr_ver) 626 u64 attr_ver)
627 { 627 {
628 struct fuse_conn *fc = get_fuse_conn(inode); 628 struct fuse_conn *fc = get_fuse_conn(inode);
629 struct fuse_inode *fi = get_fuse_inode(inode); 629 struct fuse_inode *fi = get_fuse_inode(inode);
630 630
631 spin_lock(&fc->lock); 631 spin_lock(&fc->lock);
632 if (attr_ver == fi->attr_version && size < inode->i_size && 632 if (attr_ver == fi->attr_version && size < inode->i_size &&
633 !test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { 633 !test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
634 fi->attr_version = ++fc->attr_version; 634 fi->attr_version = ++fc->attr_version;
635 i_size_write(inode, size); 635 i_size_write(inode, size);
636 } 636 }
637 spin_unlock(&fc->lock); 637 spin_unlock(&fc->lock);
638 } 638 }
639 639
640 static int fuse_readpage(struct file *file, struct page *page) 640 static int fuse_readpage(struct file *file, struct page *page)
641 { 641 {
642 struct fuse_io_priv io = { .async = 0, .file = file }; 642 struct fuse_io_priv io = { .async = 0, .file = file };
643 struct inode *inode = page->mapping->host; 643 struct inode *inode = page->mapping->host;
644 struct fuse_conn *fc = get_fuse_conn(inode); 644 struct fuse_conn *fc = get_fuse_conn(inode);
645 struct fuse_req *req; 645 struct fuse_req *req;
646 size_t num_read; 646 size_t num_read;
647 loff_t pos = page_offset(page); 647 loff_t pos = page_offset(page);
648 size_t count = PAGE_CACHE_SIZE; 648 size_t count = PAGE_CACHE_SIZE;
649 u64 attr_ver; 649 u64 attr_ver;
650 int err; 650 int err;
651 651
652 err = -EIO; 652 err = -EIO;
653 if (is_bad_inode(inode)) 653 if (is_bad_inode(inode))
654 goto out; 654 goto out;
655 655
656 /* 656 /*
657 * Page writeback can extend beyond the lifetime of the 657 * Page writeback can extend beyond the lifetime of the
658 * page-cache page, so make sure we read a properly synced 658 * page-cache page, so make sure we read a properly synced
659 * page. 659 * page.
660 */ 660 */
661 fuse_wait_on_page_writeback(inode, page->index); 661 fuse_wait_on_page_writeback(inode, page->index);
662 662
663 req = fuse_get_req(fc, 1); 663 req = fuse_get_req(fc, 1);
664 err = PTR_ERR(req); 664 err = PTR_ERR(req);
665 if (IS_ERR(req)) 665 if (IS_ERR(req))
666 goto out; 666 goto out;
667 667
668 attr_ver = fuse_get_attr_version(fc); 668 attr_ver = fuse_get_attr_version(fc);
669 669
670 req->out.page_zeroing = 1; 670 req->out.page_zeroing = 1;
671 req->out.argpages = 1; 671 req->out.argpages = 1;
672 req->num_pages = 1; 672 req->num_pages = 1;
673 req->pages[0] = page; 673 req->pages[0] = page;
674 req->page_descs[0].length = count; 674 req->page_descs[0].length = count;
675 num_read = fuse_send_read(req, &io, pos, count, NULL); 675 num_read = fuse_send_read(req, &io, pos, count, NULL);
676 err = req->out.h.error; 676 err = req->out.h.error;
677 fuse_put_request(fc, req); 677 fuse_put_request(fc, req);
678 678
679 if (!err) { 679 if (!err) {
680 /* 680 /*
681 * Short read means EOF. If file size is larger, truncate it 681 * Short read means EOF. If file size is larger, truncate it
682 */ 682 */
683 if (num_read < count) 683 if (num_read < count)
684 fuse_read_update_size(inode, pos + num_read, attr_ver); 684 fuse_read_update_size(inode, pos + num_read, attr_ver);
685 685
686 SetPageUptodate(page); 686 SetPageUptodate(page);
687 } 687 }
688 688
689 fuse_invalidate_attr(inode); /* atime changed */ 689 fuse_invalidate_attr(inode); /* atime changed */
690 out: 690 out:
691 unlock_page(page); 691 unlock_page(page);
692 return err; 692 return err;
693 } 693 }
694 694
695 static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) 695 static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
696 { 696 {
697 int i; 697 int i;
698 size_t count = req->misc.read.in.size; 698 size_t count = req->misc.read.in.size;
699 size_t num_read = req->out.args[0].size; 699 size_t num_read = req->out.args[0].size;
700 struct address_space *mapping = NULL; 700 struct address_space *mapping = NULL;
701 701
702 for (i = 0; mapping == NULL && i < req->num_pages; i++) 702 for (i = 0; mapping == NULL && i < req->num_pages; i++)
703 mapping = req->pages[i]->mapping; 703 mapping = req->pages[i]->mapping;
704 704
705 if (mapping) { 705 if (mapping) {
706 struct inode *inode = mapping->host; 706 struct inode *inode = mapping->host;
707 707
708 /* 708 /*
709 * Short read means EOF. If file size is larger, truncate it 709 * Short read means EOF. If file size is larger, truncate it
710 */ 710 */
711 if (!req->out.h.error && num_read < count) { 711 if (!req->out.h.error && num_read < count) {
712 loff_t pos; 712 loff_t pos;
713 713
714 pos = page_offset(req->pages[0]) + num_read; 714 pos = page_offset(req->pages[0]) + num_read;
715 fuse_read_update_size(inode, pos, 715 fuse_read_update_size(inode, pos,
716 req->misc.read.attr_ver); 716 req->misc.read.attr_ver);
717 } 717 }
718 fuse_invalidate_attr(inode); /* atime changed */ 718 fuse_invalidate_attr(inode); /* atime changed */
719 } 719 }
720 720
721 for (i = 0; i < req->num_pages; i++) { 721 for (i = 0; i < req->num_pages; i++) {
722 struct page *page = req->pages[i]; 722 struct page *page = req->pages[i];
723 if (!req->out.h.error) 723 if (!req->out.h.error)
724 SetPageUptodate(page); 724 SetPageUptodate(page);
725 else 725 else
726 SetPageError(page); 726 SetPageError(page);
727 unlock_page(page); 727 unlock_page(page);
728 page_cache_release(page); 728 page_cache_release(page);
729 } 729 }
730 if (req->ff) 730 if (req->ff)
731 fuse_file_put(req->ff, false); 731 fuse_file_put(req->ff, false);
732 } 732 }
733 733
734 static void fuse_send_readpages(struct fuse_req *req, struct file *file) 734 static void fuse_send_readpages(struct fuse_req *req, struct file *file)
735 { 735 {
736 struct fuse_file *ff = file->private_data; 736 struct fuse_file *ff = file->private_data;
737 struct fuse_conn *fc = ff->fc; 737 struct fuse_conn *fc = ff->fc;
738 loff_t pos = page_offset(req->pages[0]); 738 loff_t pos = page_offset(req->pages[0]);
739 size_t count = req->num_pages << PAGE_CACHE_SHIFT; 739 size_t count = req->num_pages << PAGE_CACHE_SHIFT;
740 740
741 req->out.argpages = 1; 741 req->out.argpages = 1;
742 req->out.page_zeroing = 1; 742 req->out.page_zeroing = 1;
743 req->out.page_replace = 1; 743 req->out.page_replace = 1;
744 fuse_read_fill(req, file, pos, count, FUSE_READ); 744 fuse_read_fill(req, file, pos, count, FUSE_READ);
745 req->misc.read.attr_ver = fuse_get_attr_version(fc); 745 req->misc.read.attr_ver = fuse_get_attr_version(fc);
746 if (fc->async_read) { 746 if (fc->async_read) {
747 req->ff = fuse_file_get(ff); 747 req->ff = fuse_file_get(ff);
748 req->end = fuse_readpages_end; 748 req->end = fuse_readpages_end;
749 fuse_request_send_background(fc, req); 749 fuse_request_send_background(fc, req);
750 } else { 750 } else {
751 fuse_request_send(fc, req); 751 fuse_request_send(fc, req);
752 fuse_readpages_end(fc, req); 752 fuse_readpages_end(fc, req);
753 fuse_put_request(fc, req); 753 fuse_put_request(fc, req);
754 } 754 }
755 } 755 }
756 756
757 struct fuse_fill_data { 757 struct fuse_fill_data {
758 struct fuse_req *req; 758 struct fuse_req *req;
759 struct file *file; 759 struct file *file;
760 struct inode *inode; 760 struct inode *inode;
761 unsigned nr_pages; 761 unsigned nr_pages;
762 }; 762 };
763 763
764 static int fuse_readpages_fill(void *_data, struct page *page) 764 static int fuse_readpages_fill(void *_data, struct page *page)
765 { 765 {
766 struct fuse_fill_data *data = _data; 766 struct fuse_fill_data *data = _data;
767 struct fuse_req *req = data->req; 767 struct fuse_req *req = data->req;
768 struct inode *inode = data->inode; 768 struct inode *inode = data->inode;
769 struct fuse_conn *fc = get_fuse_conn(inode); 769 struct fuse_conn *fc = get_fuse_conn(inode);
770 770
771 fuse_wait_on_page_writeback(inode, page->index); 771 fuse_wait_on_page_writeback(inode, page->index);
772 772
773 if (req->num_pages && 773 if (req->num_pages &&
774 (req->num_pages == FUSE_MAX_PAGES_PER_REQ || 774 (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
775 (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read || 775 (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read ||
776 req->pages[req->num_pages - 1]->index + 1 != page->index)) { 776 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
777 int nr_alloc = min_t(unsigned, data->nr_pages, 777 int nr_alloc = min_t(unsigned, data->nr_pages,
778 FUSE_MAX_PAGES_PER_REQ); 778 FUSE_MAX_PAGES_PER_REQ);
779 fuse_send_readpages(req, data->file); 779 fuse_send_readpages(req, data->file);
780 if (fc->async_read) 780 if (fc->async_read)
781 req = fuse_get_req_for_background(fc, nr_alloc); 781 req = fuse_get_req_for_background(fc, nr_alloc);
782 else 782 else
783 req = fuse_get_req(fc, nr_alloc); 783 req = fuse_get_req(fc, nr_alloc);
784 784
785 data->req = req; 785 data->req = req;
786 if (IS_ERR(req)) { 786 if (IS_ERR(req)) {
787 unlock_page(page); 787 unlock_page(page);
788 return PTR_ERR(req); 788 return PTR_ERR(req);
789 } 789 }
790 } 790 }
791 791
792 if (WARN_ON(req->num_pages >= req->max_pages)) { 792 if (WARN_ON(req->num_pages >= req->max_pages)) {
793 fuse_put_request(fc, req); 793 fuse_put_request(fc, req);
794 return -EIO; 794 return -EIO;
795 } 795 }
796 796
797 page_cache_get(page); 797 page_cache_get(page);
798 req->pages[req->num_pages] = page; 798 req->pages[req->num_pages] = page;
799 req->page_descs[req->num_pages].length = PAGE_SIZE; 799 req->page_descs[req->num_pages].length = PAGE_SIZE;
800 req->num_pages++; 800 req->num_pages++;
801 data->nr_pages--; 801 data->nr_pages--;
802 return 0; 802 return 0;
803 } 803 }
804 804
805 static int fuse_readpages(struct file *file, struct address_space *mapping, 805 static int fuse_readpages(struct file *file, struct address_space *mapping,
806 struct list_head *pages, unsigned nr_pages) 806 struct list_head *pages, unsigned nr_pages)
807 { 807 {
808 struct inode *inode = mapping->host; 808 struct inode *inode = mapping->host;
809 struct fuse_conn *fc = get_fuse_conn(inode); 809 struct fuse_conn *fc = get_fuse_conn(inode);
810 struct fuse_fill_data data; 810 struct fuse_fill_data data;
811 int err; 811 int err;
812 int nr_alloc = min_t(unsigned, nr_pages, FUSE_MAX_PAGES_PER_REQ); 812 int nr_alloc = min_t(unsigned, nr_pages, FUSE_MAX_PAGES_PER_REQ);
813 813
814 err = -EIO; 814 err = -EIO;
815 if (is_bad_inode(inode)) 815 if (is_bad_inode(inode))
816 goto out; 816 goto out;
817 817
818 data.file = file; 818 data.file = file;
819 data.inode = inode; 819 data.inode = inode;
820 if (fc->async_read) 820 if (fc->async_read)
821 data.req = fuse_get_req_for_background(fc, nr_alloc); 821 data.req = fuse_get_req_for_background(fc, nr_alloc);
822 else 822 else
823 data.req = fuse_get_req(fc, nr_alloc); 823 data.req = fuse_get_req(fc, nr_alloc);
824 data.nr_pages = nr_pages; 824 data.nr_pages = nr_pages;
825 err = PTR_ERR(data.req); 825 err = PTR_ERR(data.req);
826 if (IS_ERR(data.req)) 826 if (IS_ERR(data.req))
827 goto out; 827 goto out;
828 828
829 err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); 829 err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
830 if (!err) { 830 if (!err) {
831 if (data.req->num_pages) 831 if (data.req->num_pages)
832 fuse_send_readpages(data.req, file); 832 fuse_send_readpages(data.req, file);
833 else 833 else
834 fuse_put_request(fc, data.req); 834 fuse_put_request(fc, data.req);
835 } 835 }
836 out: 836 out:
837 return err; 837 return err;
838 } 838 }
839 839
840 static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov, 840 static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
841 unsigned long nr_segs, loff_t pos) 841 unsigned long nr_segs, loff_t pos)
842 { 842 {
843 struct inode *inode = iocb->ki_filp->f_mapping->host; 843 struct inode *inode = iocb->ki_filp->f_mapping->host;
844 struct fuse_conn *fc = get_fuse_conn(inode); 844 struct fuse_conn *fc = get_fuse_conn(inode);
845 845
846 /* 846 /*
847 * In auto invalidate mode, always update attributes on read. 847 * In auto invalidate mode, always update attributes on read.
848 * Otherwise, only update if we attempt to read past EOF (to ensure 848 * Otherwise, only update if we attempt to read past EOF (to ensure
849 * i_size is up to date). 849 * i_size is up to date).
850 */ 850 */
851 if (fc->auto_inval_data || 851 if (fc->auto_inval_data ||
852 (pos + iov_length(iov, nr_segs) > i_size_read(inode))) { 852 (pos + iov_length(iov, nr_segs) > i_size_read(inode))) {
853 int err; 853 int err;
854 err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL); 854 err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL);
855 if (err) 855 if (err)
856 return err; 856 return err;
857 } 857 }
858 858
859 return generic_file_aio_read(iocb, iov, nr_segs, pos); 859 return generic_file_aio_read(iocb, iov, nr_segs, pos);
860 } 860 }
861 861
862 static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff, 862 static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff,
863 loff_t pos, size_t count) 863 loff_t pos, size_t count)
864 { 864 {
865 struct fuse_write_in *inarg = &req->misc.write.in; 865 struct fuse_write_in *inarg = &req->misc.write.in;
866 struct fuse_write_out *outarg = &req->misc.write.out; 866 struct fuse_write_out *outarg = &req->misc.write.out;
867 867
868 inarg->fh = ff->fh; 868 inarg->fh = ff->fh;
869 inarg->offset = pos; 869 inarg->offset = pos;
870 inarg->size = count; 870 inarg->size = count;
871 req->in.h.opcode = FUSE_WRITE; 871 req->in.h.opcode = FUSE_WRITE;
872 req->in.h.nodeid = ff->nodeid; 872 req->in.h.nodeid = ff->nodeid;
873 req->in.numargs = 2; 873 req->in.numargs = 2;
874 if (ff->fc->minor < 9) 874 if (ff->fc->minor < 9)
875 req->in.args[0].size = FUSE_COMPAT_WRITE_IN_SIZE; 875 req->in.args[0].size = FUSE_COMPAT_WRITE_IN_SIZE;
876 else 876 else
877 req->in.args[0].size = sizeof(struct fuse_write_in); 877 req->in.args[0].size = sizeof(struct fuse_write_in);
878 req->in.args[0].value = inarg; 878 req->in.args[0].value = inarg;
879 req->in.args[1].size = count; 879 req->in.args[1].size = count;
880 req->out.numargs = 1; 880 req->out.numargs = 1;
881 req->out.args[0].size = sizeof(struct fuse_write_out); 881 req->out.args[0].size = sizeof(struct fuse_write_out);
882 req->out.args[0].value = outarg; 882 req->out.args[0].value = outarg;
883 } 883 }
884 884
885 static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io, 885 static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
886 loff_t pos, size_t count, fl_owner_t owner) 886 loff_t pos, size_t count, fl_owner_t owner)
887 { 887 {
888 struct file *file = io->file; 888 struct file *file = io->file;
889 struct fuse_file *ff = file->private_data; 889 struct fuse_file *ff = file->private_data;
890 struct fuse_conn *fc = ff->fc; 890 struct fuse_conn *fc = ff->fc;
891 struct fuse_write_in *inarg = &req->misc.write.in; 891 struct fuse_write_in *inarg = &req->misc.write.in;
892 892
893 fuse_write_fill(req, ff, pos, count); 893 fuse_write_fill(req, ff, pos, count);
894 inarg->flags = file->f_flags; 894 inarg->flags = file->f_flags;
895 if (owner != NULL) { 895 if (owner != NULL) {
896 inarg->write_flags |= FUSE_WRITE_LOCKOWNER; 896 inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
897 inarg->lock_owner = fuse_lock_owner_id(fc, owner); 897 inarg->lock_owner = fuse_lock_owner_id(fc, owner);
898 } 898 }
899 899
900 if (io->async) 900 if (io->async)
901 return fuse_async_req_send(fc, req, count, io); 901 return fuse_async_req_send(fc, req, count, io);
902 902
903 fuse_request_send(fc, req); 903 fuse_request_send(fc, req);
904 return req->misc.write.out.size; 904 return req->misc.write.out.size;
905 } 905 }
906 906
907 void fuse_write_update_size(struct inode *inode, loff_t pos) 907 void fuse_write_update_size(struct inode *inode, loff_t pos)
908 { 908 {
909 struct fuse_conn *fc = get_fuse_conn(inode); 909 struct fuse_conn *fc = get_fuse_conn(inode);
910 struct fuse_inode *fi = get_fuse_inode(inode); 910 struct fuse_inode *fi = get_fuse_inode(inode);
911 911
912 spin_lock(&fc->lock); 912 spin_lock(&fc->lock);
913 fi->attr_version = ++fc->attr_version; 913 fi->attr_version = ++fc->attr_version;
914 if (pos > inode->i_size) 914 if (pos > inode->i_size)
915 i_size_write(inode, pos); 915 i_size_write(inode, pos);
916 spin_unlock(&fc->lock); 916 spin_unlock(&fc->lock);
917 } 917 }
918 918
919 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file, 919 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
920 struct inode *inode, loff_t pos, 920 struct inode *inode, loff_t pos,
921 size_t count) 921 size_t count)
922 { 922 {
923 size_t res; 923 size_t res;
924 unsigned offset; 924 unsigned offset;
925 unsigned i; 925 unsigned i;
926 struct fuse_io_priv io = { .async = 0, .file = file }; 926 struct fuse_io_priv io = { .async = 0, .file = file };
927 927
928 for (i = 0; i < req->num_pages; i++) 928 for (i = 0; i < req->num_pages; i++)
929 fuse_wait_on_page_writeback(inode, req->pages[i]->index); 929 fuse_wait_on_page_writeback(inode, req->pages[i]->index);
930 930
931 res = fuse_send_write(req, &io, pos, count, NULL); 931 res = fuse_send_write(req, &io, pos, count, NULL);
932 932
933 offset = req->page_descs[0].offset; 933 offset = req->page_descs[0].offset;
934 count = res; 934 count = res;
935 for (i = 0; i < req->num_pages; i++) { 935 for (i = 0; i < req->num_pages; i++) {
936 struct page *page = req->pages[i]; 936 struct page *page = req->pages[i];
937 937
938 if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE) 938 if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE)
939 SetPageUptodate(page); 939 SetPageUptodate(page);
940 940
941 if (count > PAGE_CACHE_SIZE - offset) 941 if (count > PAGE_CACHE_SIZE - offset)
942 count -= PAGE_CACHE_SIZE - offset; 942 count -= PAGE_CACHE_SIZE - offset;
943 else 943 else
944 count = 0; 944 count = 0;
945 offset = 0; 945 offset = 0;
946 946
947 unlock_page(page); 947 unlock_page(page);
948 page_cache_release(page); 948 page_cache_release(page);
949 } 949 }
950 950
951 return res; 951 return res;
952 } 952 }
953 953
954 static ssize_t fuse_fill_write_pages(struct fuse_req *req, 954 static ssize_t fuse_fill_write_pages(struct fuse_req *req,
955 struct address_space *mapping, 955 struct address_space *mapping,
956 struct iov_iter *ii, loff_t pos) 956 struct iov_iter *ii, loff_t pos)
957 { 957 {
958 struct fuse_conn *fc = get_fuse_conn(mapping->host); 958 struct fuse_conn *fc = get_fuse_conn(mapping->host);
959 unsigned offset = pos & (PAGE_CACHE_SIZE - 1); 959 unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
960 size_t count = 0; 960 size_t count = 0;
961 int err; 961 int err;
962 962
963 req->in.argpages = 1; 963 req->in.argpages = 1;
964 req->page_descs[0].offset = offset; 964 req->page_descs[0].offset = offset;
965 965
966 do { 966 do {
967 size_t tmp; 967 size_t tmp;
968 struct page *page; 968 struct page *page;
969 pgoff_t index = pos >> PAGE_CACHE_SHIFT; 969 pgoff_t index = pos >> PAGE_CACHE_SHIFT;
970 size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset, 970 size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset,
971 iov_iter_count(ii)); 971 iov_iter_count(ii));
972 972
973 bytes = min_t(size_t, bytes, fc->max_write - count); 973 bytes = min_t(size_t, bytes, fc->max_write - count);
974 974
975 again: 975 again:
976 err = -EFAULT; 976 err = -EFAULT;
977 if (iov_iter_fault_in_readable(ii, bytes)) 977 if (iov_iter_fault_in_readable(ii, bytes))
978 break; 978 break;
979 979
980 err = -ENOMEM; 980 err = -ENOMEM;
981 page = grab_cache_page_write_begin(mapping, index, 0); 981 page = grab_cache_page_write_begin(mapping, index, 0);
982 if (!page) 982 if (!page)
983 break; 983 break;
984 984
985 if (mapping_writably_mapped(mapping)) 985 if (mapping_writably_mapped(mapping))
986 flush_dcache_page(page); 986 flush_dcache_page(page);
987 987
988 tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes); 988 tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
989 flush_dcache_page(page); 989 flush_dcache_page(page);
990 990
991 mark_page_accessed(page);
992
993 if (!tmp) { 991 if (!tmp) {
994 unlock_page(page); 992 unlock_page(page);
995 page_cache_release(page); 993 page_cache_release(page);
996 bytes = min(bytes, iov_iter_single_seg_count(ii)); 994 bytes = min(bytes, iov_iter_single_seg_count(ii));
997 goto again; 995 goto again;
998 } 996 }
999 997
1000 err = 0; 998 err = 0;
1001 req->pages[req->num_pages] = page; 999 req->pages[req->num_pages] = page;
1002 req->page_descs[req->num_pages].length = tmp; 1000 req->page_descs[req->num_pages].length = tmp;
1003 req->num_pages++; 1001 req->num_pages++;
1004 1002
1005 iov_iter_advance(ii, tmp); 1003 iov_iter_advance(ii, tmp);
1006 count += tmp; 1004 count += tmp;
1007 pos += tmp; 1005 pos += tmp;
1008 offset += tmp; 1006 offset += tmp;
1009 if (offset == PAGE_CACHE_SIZE) 1007 if (offset == PAGE_CACHE_SIZE)
1010 offset = 0; 1008 offset = 0;
1011 1009
1012 if (!fc->big_writes) 1010 if (!fc->big_writes)
1013 break; 1011 break;
1014 } while (iov_iter_count(ii) && count < fc->max_write && 1012 } while (iov_iter_count(ii) && count < fc->max_write &&
1015 req->num_pages < req->max_pages && offset == 0); 1013 req->num_pages < req->max_pages && offset == 0);
1016 1014
1017 return count > 0 ? count : err; 1015 return count > 0 ? count : err;
1018 } 1016 }
1019 1017
1020 static inline unsigned fuse_wr_pages(loff_t pos, size_t len) 1018 static inline unsigned fuse_wr_pages(loff_t pos, size_t len)
1021 { 1019 {
1022 return min_t(unsigned, 1020 return min_t(unsigned,
1023 ((pos + len - 1) >> PAGE_CACHE_SHIFT) - 1021 ((pos + len - 1) >> PAGE_CACHE_SHIFT) -
1024 (pos >> PAGE_CACHE_SHIFT) + 1, 1022 (pos >> PAGE_CACHE_SHIFT) + 1,
1025 FUSE_MAX_PAGES_PER_REQ); 1023 FUSE_MAX_PAGES_PER_REQ);
1026 } 1024 }
1027 1025
1028 static ssize_t fuse_perform_write(struct file *file, 1026 static ssize_t fuse_perform_write(struct file *file,
1029 struct address_space *mapping, 1027 struct address_space *mapping,
1030 struct iov_iter *ii, loff_t pos) 1028 struct iov_iter *ii, loff_t pos)
1031 { 1029 {
1032 struct inode *inode = mapping->host; 1030 struct inode *inode = mapping->host;
1033 struct fuse_conn *fc = get_fuse_conn(inode); 1031 struct fuse_conn *fc = get_fuse_conn(inode);
1034 struct fuse_inode *fi = get_fuse_inode(inode); 1032 struct fuse_inode *fi = get_fuse_inode(inode);
1035 int err = 0; 1033 int err = 0;
1036 ssize_t res = 0; 1034 ssize_t res = 0;
1037 1035
1038 if (is_bad_inode(inode)) 1036 if (is_bad_inode(inode))
1039 return -EIO; 1037 return -EIO;
1040 1038
1041 if (inode->i_size < pos + iov_iter_count(ii)) 1039 if (inode->i_size < pos + iov_iter_count(ii))
1042 set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); 1040 set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
1043 1041
1044 do { 1042 do {
1045 struct fuse_req *req; 1043 struct fuse_req *req;
1046 ssize_t count; 1044 ssize_t count;
1047 unsigned nr_pages = fuse_wr_pages(pos, iov_iter_count(ii)); 1045 unsigned nr_pages = fuse_wr_pages(pos, iov_iter_count(ii));
1048 1046
1049 req = fuse_get_req(fc, nr_pages); 1047 req = fuse_get_req(fc, nr_pages);
1050 if (IS_ERR(req)) { 1048 if (IS_ERR(req)) {
1051 err = PTR_ERR(req); 1049 err = PTR_ERR(req);
1052 break; 1050 break;
1053 } 1051 }
1054 1052
1055 count = fuse_fill_write_pages(req, mapping, ii, pos); 1053 count = fuse_fill_write_pages(req, mapping, ii, pos);
1056 if (count <= 0) { 1054 if (count <= 0) {
1057 err = count; 1055 err = count;
1058 } else { 1056 } else {
1059 size_t num_written; 1057 size_t num_written;
1060 1058
1061 num_written = fuse_send_write_pages(req, file, inode, 1059 num_written = fuse_send_write_pages(req, file, inode,
1062 pos, count); 1060 pos, count);
1063 err = req->out.h.error; 1061 err = req->out.h.error;
1064 if (!err) { 1062 if (!err) {
1065 res += num_written; 1063 res += num_written;
1066 pos += num_written; 1064 pos += num_written;
1067 1065
1068 /* break out of the loop on short write */ 1066 /* break out of the loop on short write */
1069 if (num_written != count) 1067 if (num_written != count)
1070 err = -EIO; 1068 err = -EIO;
1071 } 1069 }
1072 } 1070 }
1073 fuse_put_request(fc, req); 1071 fuse_put_request(fc, req);
1074 } while (!err && iov_iter_count(ii)); 1072 } while (!err && iov_iter_count(ii));
1075 1073
1076 if (res > 0) 1074 if (res > 0)
1077 fuse_write_update_size(inode, pos); 1075 fuse_write_update_size(inode, pos);
1078 1076
1079 clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); 1077 clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
1080 fuse_invalidate_attr(inode); 1078 fuse_invalidate_attr(inode);
1081 1079
1082 return res > 0 ? res : err; 1080 return res > 0 ? res : err;
1083 } 1081 }
1084 1082
1085 static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov, 1083 static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
1086 unsigned long nr_segs, loff_t pos) 1084 unsigned long nr_segs, loff_t pos)
1087 { 1085 {
1088 struct file *file = iocb->ki_filp; 1086 struct file *file = iocb->ki_filp;
1089 struct address_space *mapping = file->f_mapping; 1087 struct address_space *mapping = file->f_mapping;
1090 size_t count = 0; 1088 size_t count = 0;
1091 size_t ocount = 0; 1089 size_t ocount = 0;
1092 ssize_t written = 0; 1090 ssize_t written = 0;
1093 ssize_t written_buffered = 0; 1091 ssize_t written_buffered = 0;
1094 struct inode *inode = mapping->host; 1092 struct inode *inode = mapping->host;
1095 ssize_t err; 1093 ssize_t err;
1096 struct iov_iter i; 1094 struct iov_iter i;
1097 loff_t endbyte = 0; 1095 loff_t endbyte = 0;
1098 1096
1099 WARN_ON(iocb->ki_pos != pos); 1097 WARN_ON(iocb->ki_pos != pos);
1100 1098
1101 ocount = 0; 1099 ocount = 0;
1102 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); 1100 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
1103 if (err) 1101 if (err)
1104 return err; 1102 return err;
1105 1103
1106 count = ocount; 1104 count = ocount;
1107 mutex_lock(&inode->i_mutex); 1105 mutex_lock(&inode->i_mutex);
1108 1106
1109 /* We can write back this queue in page reclaim */ 1107 /* We can write back this queue in page reclaim */
1110 current->backing_dev_info = mapping->backing_dev_info; 1108 current->backing_dev_info = mapping->backing_dev_info;
1111 1109
1112 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); 1110 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
1113 if (err) 1111 if (err)
1114 goto out; 1112 goto out;
1115 1113
1116 if (count == 0) 1114 if (count == 0)
1117 goto out; 1115 goto out;
1118 1116
1119 err = file_remove_suid(file); 1117 err = file_remove_suid(file);
1120 if (err) 1118 if (err)
1121 goto out; 1119 goto out;
1122 1120
1123 err = file_update_time(file); 1121 err = file_update_time(file);
1124 if (err) 1122 if (err)
1125 goto out; 1123 goto out;
1126 1124
1127 if (file->f_flags & O_DIRECT) { 1125 if (file->f_flags & O_DIRECT) {
1128 written = generic_file_direct_write(iocb, iov, &nr_segs, 1126 written = generic_file_direct_write(iocb, iov, &nr_segs,
1129 pos, &iocb->ki_pos, 1127 pos, &iocb->ki_pos,
1130 count, ocount); 1128 count, ocount);
1131 if (written < 0 || written == count) 1129 if (written < 0 || written == count)
1132 goto out; 1130 goto out;
1133 1131
1134 pos += written; 1132 pos += written;
1135 count -= written; 1133 count -= written;
1136 1134
1137 iov_iter_init(&i, iov, nr_segs, count, written); 1135 iov_iter_init(&i, iov, nr_segs, count, written);
1138 written_buffered = fuse_perform_write(file, mapping, &i, pos); 1136 written_buffered = fuse_perform_write(file, mapping, &i, pos);
1139 if (written_buffered < 0) { 1137 if (written_buffered < 0) {
1140 err = written_buffered; 1138 err = written_buffered;
1141 goto out; 1139 goto out;
1142 } 1140 }
1143 endbyte = pos + written_buffered - 1; 1141 endbyte = pos + written_buffered - 1;
1144 1142
1145 err = filemap_write_and_wait_range(file->f_mapping, pos, 1143 err = filemap_write_and_wait_range(file->f_mapping, pos,
1146 endbyte); 1144 endbyte);
1147 if (err) 1145 if (err)
1148 goto out; 1146 goto out;
1149 1147
1150 invalidate_mapping_pages(file->f_mapping, 1148 invalidate_mapping_pages(file->f_mapping,
1151 pos >> PAGE_CACHE_SHIFT, 1149 pos >> PAGE_CACHE_SHIFT,
1152 endbyte >> PAGE_CACHE_SHIFT); 1150 endbyte >> PAGE_CACHE_SHIFT);
1153 1151
1154 written += written_buffered; 1152 written += written_buffered;
1155 iocb->ki_pos = pos + written_buffered; 1153 iocb->ki_pos = pos + written_buffered;
1156 } else { 1154 } else {
1157 iov_iter_init(&i, iov, nr_segs, count, 0); 1155 iov_iter_init(&i, iov, nr_segs, count, 0);
1158 written = fuse_perform_write(file, mapping, &i, pos); 1156 written = fuse_perform_write(file, mapping, &i, pos);
1159 if (written >= 0) 1157 if (written >= 0)
1160 iocb->ki_pos = pos + written; 1158 iocb->ki_pos = pos + written;
1161 } 1159 }
1162 out: 1160 out:
1163 current->backing_dev_info = NULL; 1161 current->backing_dev_info = NULL;
1164 mutex_unlock(&inode->i_mutex); 1162 mutex_unlock(&inode->i_mutex);
1165 1163
1166 return written ? written : err; 1164 return written ? written : err;
1167 } 1165 }
1168 1166
1169 static inline void fuse_page_descs_length_init(struct fuse_req *req, 1167 static inline void fuse_page_descs_length_init(struct fuse_req *req,
1170 unsigned index, unsigned nr_pages) 1168 unsigned index, unsigned nr_pages)
1171 { 1169 {
1172 int i; 1170 int i;
1173 1171
1174 for (i = index; i < index + nr_pages; i++) 1172 for (i = index; i < index + nr_pages; i++)
1175 req->page_descs[i].length = PAGE_SIZE - 1173 req->page_descs[i].length = PAGE_SIZE -
1176 req->page_descs[i].offset; 1174 req->page_descs[i].offset;
1177 } 1175 }
1178 1176
1179 static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii) 1177 static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii)
1180 { 1178 {
1181 return (unsigned long)ii->iov->iov_base + ii->iov_offset; 1179 return (unsigned long)ii->iov->iov_base + ii->iov_offset;
1182 } 1180 }
1183 1181
1184 static inline size_t fuse_get_frag_size(const struct iov_iter *ii, 1182 static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
1185 size_t max_size) 1183 size_t max_size)
1186 { 1184 {
1187 return min(iov_iter_single_seg_count(ii), max_size); 1185 return min(iov_iter_single_seg_count(ii), max_size);
1188 } 1186 }
1189 1187
1190 static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii, 1188 static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii,
1191 size_t *nbytesp, int write) 1189 size_t *nbytesp, int write)
1192 { 1190 {
1193 size_t nbytes = 0; /* # bytes already packed in req */ 1191 size_t nbytes = 0; /* # bytes already packed in req */
1194 1192
1195 /* Special case for kernel I/O: can copy directly into the buffer */ 1193 /* Special case for kernel I/O: can copy directly into the buffer */
1196 if (segment_eq(get_fs(), KERNEL_DS)) { 1194 if (segment_eq(get_fs(), KERNEL_DS)) {
1197 unsigned long user_addr = fuse_get_user_addr(ii); 1195 unsigned long user_addr = fuse_get_user_addr(ii);
1198 size_t frag_size = fuse_get_frag_size(ii, *nbytesp); 1196 size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
1199 1197
1200 if (write) 1198 if (write)
1201 req->in.args[1].value = (void *) user_addr; 1199 req->in.args[1].value = (void *) user_addr;
1202 else 1200 else
1203 req->out.args[0].value = (void *) user_addr; 1201 req->out.args[0].value = (void *) user_addr;
1204 1202
1205 iov_iter_advance(ii, frag_size); 1203 iov_iter_advance(ii, frag_size);
1206 *nbytesp = frag_size; 1204 *nbytesp = frag_size;
1207 return 0; 1205 return 0;
1208 } 1206 }
1209 1207
1210 while (nbytes < *nbytesp && req->num_pages < req->max_pages) { 1208 while (nbytes < *nbytesp && req->num_pages < req->max_pages) {
1211 unsigned npages; 1209 unsigned npages;
1212 unsigned long user_addr = fuse_get_user_addr(ii); 1210 unsigned long user_addr = fuse_get_user_addr(ii);
1213 unsigned offset = user_addr & ~PAGE_MASK; 1211 unsigned offset = user_addr & ~PAGE_MASK;
1214 size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes); 1212 size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes);
1215 int ret; 1213 int ret;
1216 1214
1217 unsigned n = req->max_pages - req->num_pages; 1215 unsigned n = req->max_pages - req->num_pages;
1218 frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT); 1216 frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT);
1219 1217
1220 npages = (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; 1218 npages = (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
1221 npages = clamp(npages, 1U, n); 1219 npages = clamp(npages, 1U, n);
1222 1220
1223 ret = get_user_pages_fast(user_addr, npages, !write, 1221 ret = get_user_pages_fast(user_addr, npages, !write,
1224 &req->pages[req->num_pages]); 1222 &req->pages[req->num_pages]);
1225 if (ret < 0) 1223 if (ret < 0)
1226 return ret; 1224 return ret;
1227 1225
1228 npages = ret; 1226 npages = ret;
1229 frag_size = min_t(size_t, frag_size, 1227 frag_size = min_t(size_t, frag_size,
1230 (npages << PAGE_SHIFT) - offset); 1228 (npages << PAGE_SHIFT) - offset);
1231 iov_iter_advance(ii, frag_size); 1229 iov_iter_advance(ii, frag_size);
1232 1230
1233 req->page_descs[req->num_pages].offset = offset; 1231 req->page_descs[req->num_pages].offset = offset;
1234 fuse_page_descs_length_init(req, req->num_pages, npages); 1232 fuse_page_descs_length_init(req, req->num_pages, npages);
1235 1233
1236 req->num_pages += npages; 1234 req->num_pages += npages;
1237 req->page_descs[req->num_pages - 1].length -= 1235 req->page_descs[req->num_pages - 1].length -=
1238 (npages << PAGE_SHIFT) - offset - frag_size; 1236 (npages << PAGE_SHIFT) - offset - frag_size;
1239 1237
1240 nbytes += frag_size; 1238 nbytes += frag_size;
1241 } 1239 }
1242 1240
1243 if (write) 1241 if (write)
1244 req->in.argpages = 1; 1242 req->in.argpages = 1;
1245 else 1243 else
1246 req->out.argpages = 1; 1244 req->out.argpages = 1;
1247 1245
1248 *nbytesp = nbytes; 1246 *nbytesp = nbytes;
1249 1247
1250 return 0; 1248 return 0;
1251 } 1249 }
1252 1250
1253 static inline int fuse_iter_npages(const struct iov_iter *ii_p) 1251 static inline int fuse_iter_npages(const struct iov_iter *ii_p)
1254 { 1252 {
1255 struct iov_iter ii = *ii_p; 1253 struct iov_iter ii = *ii_p;
1256 int npages = 0; 1254 int npages = 0;
1257 1255
1258 while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) { 1256 while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) {
1259 unsigned long user_addr = fuse_get_user_addr(&ii); 1257 unsigned long user_addr = fuse_get_user_addr(&ii);
1260 unsigned offset = user_addr & ~PAGE_MASK; 1258 unsigned offset = user_addr & ~PAGE_MASK;
1261 size_t frag_size = iov_iter_single_seg_count(&ii); 1259 size_t frag_size = iov_iter_single_seg_count(&ii);
1262 1260
1263 npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; 1261 npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
1264 iov_iter_advance(&ii, frag_size); 1262 iov_iter_advance(&ii, frag_size);
1265 } 1263 }
1266 1264
1267 return min(npages, FUSE_MAX_PAGES_PER_REQ); 1265 return min(npages, FUSE_MAX_PAGES_PER_REQ);
1268 } 1266 }
1269 1267
1270 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, 1268 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
1271 unsigned long nr_segs, size_t count, loff_t *ppos, 1269 unsigned long nr_segs, size_t count, loff_t *ppos,
1272 int write) 1270 int write)
1273 { 1271 {
1274 struct file *file = io->file; 1272 struct file *file = io->file;
1275 struct fuse_file *ff = file->private_data; 1273 struct fuse_file *ff = file->private_data;
1276 struct fuse_conn *fc = ff->fc; 1274 struct fuse_conn *fc = ff->fc;
1277 size_t nmax = write ? fc->max_write : fc->max_read; 1275 size_t nmax = write ? fc->max_write : fc->max_read;
1278 loff_t pos = *ppos; 1276 loff_t pos = *ppos;
1279 ssize_t res = 0; 1277 ssize_t res = 0;
1280 struct fuse_req *req; 1278 struct fuse_req *req;
1281 struct iov_iter ii; 1279 struct iov_iter ii;
1282 1280
1283 iov_iter_init(&ii, iov, nr_segs, count, 0); 1281 iov_iter_init(&ii, iov, nr_segs, count, 0);
1284 1282
1285 if (io->async) 1283 if (io->async)
1286 req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii)); 1284 req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii));
1287 else 1285 else
1288 req = fuse_get_req(fc, fuse_iter_npages(&ii)); 1286 req = fuse_get_req(fc, fuse_iter_npages(&ii));
1289 if (IS_ERR(req)) 1287 if (IS_ERR(req))
1290 return PTR_ERR(req); 1288 return PTR_ERR(req);
1291 1289
1292 while (count) { 1290 while (count) {
1293 size_t nres; 1291 size_t nres;
1294 fl_owner_t owner = current->files; 1292 fl_owner_t owner = current->files;
1295 size_t nbytes = min(count, nmax); 1293 size_t nbytes = min(count, nmax);
1296 int err = fuse_get_user_pages(req, &ii, &nbytes, write); 1294 int err = fuse_get_user_pages(req, &ii, &nbytes, write);
1297 if (err) { 1295 if (err) {
1298 res = err; 1296 res = err;
1299 break; 1297 break;
1300 } 1298 }
1301 1299
1302 if (write) 1300 if (write)
1303 nres = fuse_send_write(req, io, pos, nbytes, owner); 1301 nres = fuse_send_write(req, io, pos, nbytes, owner);
1304 else 1302 else
1305 nres = fuse_send_read(req, io, pos, nbytes, owner); 1303 nres = fuse_send_read(req, io, pos, nbytes, owner);
1306 1304
1307 if (!io->async) 1305 if (!io->async)
1308 fuse_release_user_pages(req, !write); 1306 fuse_release_user_pages(req, !write);
1309 if (req->out.h.error) { 1307 if (req->out.h.error) {
1310 if (!res) 1308 if (!res)
1311 res = req->out.h.error; 1309 res = req->out.h.error;
1312 break; 1310 break;
1313 } else if (nres > nbytes) { 1311 } else if (nres > nbytes) {
1314 res = -EIO; 1312 res = -EIO;
1315 break; 1313 break;
1316 } 1314 }
1317 count -= nres; 1315 count -= nres;
1318 res += nres; 1316 res += nres;
1319 pos += nres; 1317 pos += nres;
1320 if (nres != nbytes) 1318 if (nres != nbytes)
1321 break; 1319 break;
1322 if (count) { 1320 if (count) {
1323 fuse_put_request(fc, req); 1321 fuse_put_request(fc, req);
1324 if (io->async) 1322 if (io->async)
1325 req = fuse_get_req_for_background(fc, 1323 req = fuse_get_req_for_background(fc,
1326 fuse_iter_npages(&ii)); 1324 fuse_iter_npages(&ii));
1327 else 1325 else
1328 req = fuse_get_req(fc, fuse_iter_npages(&ii)); 1326 req = fuse_get_req(fc, fuse_iter_npages(&ii));
1329 if (IS_ERR(req)) 1327 if (IS_ERR(req))
1330 break; 1328 break;
1331 } 1329 }
1332 } 1330 }
1333 if (!IS_ERR(req)) 1331 if (!IS_ERR(req))
1334 fuse_put_request(fc, req); 1332 fuse_put_request(fc, req);
1335 if (res > 0) 1333 if (res > 0)
1336 *ppos = pos; 1334 *ppos = pos;
1337 1335
1338 return res; 1336 return res;
1339 } 1337 }
1340 EXPORT_SYMBOL_GPL(fuse_direct_io); 1338 EXPORT_SYMBOL_GPL(fuse_direct_io);
1341 1339
1342 static ssize_t __fuse_direct_read(struct fuse_io_priv *io, 1340 static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
1343 const struct iovec *iov, 1341 const struct iovec *iov,
1344 unsigned long nr_segs, loff_t *ppos, 1342 unsigned long nr_segs, loff_t *ppos,
1345 size_t count) 1343 size_t count)
1346 { 1344 {
1347 ssize_t res; 1345 ssize_t res;
1348 struct file *file = io->file; 1346 struct file *file = io->file;
1349 struct inode *inode = file_inode(file); 1347 struct inode *inode = file_inode(file);
1350 1348
1351 if (is_bad_inode(inode)) 1349 if (is_bad_inode(inode))
1352 return -EIO; 1350 return -EIO;
1353 1351
1354 res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0); 1352 res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0);
1355 1353
1356 fuse_invalidate_attr(inode); 1354 fuse_invalidate_attr(inode);
1357 1355
1358 return res; 1356 return res;
1359 } 1357 }
1360 1358
1361 static ssize_t fuse_direct_read(struct file *file, char __user *buf, 1359 static ssize_t fuse_direct_read(struct file *file, char __user *buf,
1362 size_t count, loff_t *ppos) 1360 size_t count, loff_t *ppos)
1363 { 1361 {
1364 struct fuse_io_priv io = { .async = 0, .file = file }; 1362 struct fuse_io_priv io = { .async = 0, .file = file };
1365 struct iovec iov = { .iov_base = buf, .iov_len = count }; 1363 struct iovec iov = { .iov_base = buf, .iov_len = count };
1366 return __fuse_direct_read(&io, &iov, 1, ppos, count); 1364 return __fuse_direct_read(&io, &iov, 1, ppos, count);
1367 } 1365 }
1368 1366
1369 static ssize_t __fuse_direct_write(struct fuse_io_priv *io, 1367 static ssize_t __fuse_direct_write(struct fuse_io_priv *io,
1370 const struct iovec *iov, 1368 const struct iovec *iov,
1371 unsigned long nr_segs, loff_t *ppos) 1369 unsigned long nr_segs, loff_t *ppos)
1372 { 1370 {
1373 struct file *file = io->file; 1371 struct file *file = io->file;
1374 struct inode *inode = file_inode(file); 1372 struct inode *inode = file_inode(file);
1375 size_t count = iov_length(iov, nr_segs); 1373 size_t count = iov_length(iov, nr_segs);
1376 ssize_t res; 1374 ssize_t res;
1377 1375
1378 res = generic_write_checks(file, ppos, &count, 0); 1376 res = generic_write_checks(file, ppos, &count, 0);
1379 if (!res) 1377 if (!res)
1380 res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1); 1378 res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
1381 1379
1382 fuse_invalidate_attr(inode); 1380 fuse_invalidate_attr(inode);
1383 1381
1384 return res; 1382 return res;
1385 } 1383 }
1386 1384
1387 static ssize_t fuse_direct_write(struct file *file, const char __user *buf, 1385 static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
1388 size_t count, loff_t *ppos) 1386 size_t count, loff_t *ppos)
1389 { 1387 {
1390 struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; 1388 struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count };
1391 struct inode *inode = file_inode(file); 1389 struct inode *inode = file_inode(file);
1392 ssize_t res; 1390 ssize_t res;
1393 struct fuse_io_priv io = { .async = 0, .file = file }; 1391 struct fuse_io_priv io = { .async = 0, .file = file };
1394 1392
1395 if (is_bad_inode(inode)) 1393 if (is_bad_inode(inode))
1396 return -EIO; 1394 return -EIO;
1397 1395
1398 /* Don't allow parallel writes to the same file */ 1396 /* Don't allow parallel writes to the same file */
1399 mutex_lock(&inode->i_mutex); 1397 mutex_lock(&inode->i_mutex);
1400 res = __fuse_direct_write(&io, &iov, 1, ppos); 1398 res = __fuse_direct_write(&io, &iov, 1, ppos);
1401 if (res > 0) 1399 if (res > 0)
1402 fuse_write_update_size(inode, *ppos); 1400 fuse_write_update_size(inode, *ppos);
1403 mutex_unlock(&inode->i_mutex); 1401 mutex_unlock(&inode->i_mutex);
1404 1402
1405 return res; 1403 return res;
1406 } 1404 }
1407 1405
1408 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) 1406 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
1409 { 1407 {
1410 __free_page(req->pages[0]); 1408 __free_page(req->pages[0]);
1411 fuse_file_put(req->ff, false); 1409 fuse_file_put(req->ff, false);
1412 } 1410 }
1413 1411
1414 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) 1412 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
1415 { 1413 {
1416 struct inode *inode = req->inode; 1414 struct inode *inode = req->inode;
1417 struct fuse_inode *fi = get_fuse_inode(inode); 1415 struct fuse_inode *fi = get_fuse_inode(inode);
1418 struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info; 1416 struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
1419 1417
1420 list_del(&req->writepages_entry); 1418 list_del(&req->writepages_entry);
1421 dec_bdi_stat(bdi, BDI_WRITEBACK); 1419 dec_bdi_stat(bdi, BDI_WRITEBACK);
1422 dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP); 1420 dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
1423 bdi_writeout_inc(bdi); 1421 bdi_writeout_inc(bdi);
1424 wake_up(&fi->page_waitq); 1422 wake_up(&fi->page_waitq);
1425 } 1423 }
1426 1424
1427 /* Called under fc->lock, may release and reacquire it */ 1425 /* Called under fc->lock, may release and reacquire it */
1428 static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req) 1426 static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req)
1429 __releases(fc->lock) 1427 __releases(fc->lock)
1430 __acquires(fc->lock) 1428 __acquires(fc->lock)
1431 { 1429 {
1432 struct fuse_inode *fi = get_fuse_inode(req->inode); 1430 struct fuse_inode *fi = get_fuse_inode(req->inode);
1433 loff_t size = i_size_read(req->inode); 1431 loff_t size = i_size_read(req->inode);
1434 struct fuse_write_in *inarg = &req->misc.write.in; 1432 struct fuse_write_in *inarg = &req->misc.write.in;
1435 1433
1436 if (!fc->connected) 1434 if (!fc->connected)
1437 goto out_free; 1435 goto out_free;
1438 1436
1439 if (inarg->offset + PAGE_CACHE_SIZE <= size) { 1437 if (inarg->offset + PAGE_CACHE_SIZE <= size) {
1440 inarg->size = PAGE_CACHE_SIZE; 1438 inarg->size = PAGE_CACHE_SIZE;
1441 } else if (inarg->offset < size) { 1439 } else if (inarg->offset < size) {
1442 inarg->size = size & (PAGE_CACHE_SIZE - 1); 1440 inarg->size = size & (PAGE_CACHE_SIZE - 1);
1443 } else { 1441 } else {
1444 /* Got truncated off completely */ 1442 /* Got truncated off completely */
1445 goto out_free; 1443 goto out_free;
1446 } 1444 }
1447 1445
1448 req->in.args[1].size = inarg->size; 1446 req->in.args[1].size = inarg->size;
1449 fi->writectr++; 1447 fi->writectr++;
1450 fuse_request_send_background_locked(fc, req); 1448 fuse_request_send_background_locked(fc, req);
1451 return; 1449 return;
1452 1450
1453 out_free: 1451 out_free:
1454 fuse_writepage_finish(fc, req); 1452 fuse_writepage_finish(fc, req);
1455 spin_unlock(&fc->lock); 1453 spin_unlock(&fc->lock);
1456 fuse_writepage_free(fc, req); 1454 fuse_writepage_free(fc, req);
1457 fuse_put_request(fc, req); 1455 fuse_put_request(fc, req);
1458 spin_lock(&fc->lock); 1456 spin_lock(&fc->lock);
1459 } 1457 }
1460 1458
1461 /* 1459 /*
1462 * If fi->writectr is positive (no truncate or fsync going on) send 1460 * If fi->writectr is positive (no truncate or fsync going on) send
1463 * all queued writepage requests. 1461 * all queued writepage requests.
1464 * 1462 *
1465 * Called with fc->lock 1463 * Called with fc->lock
1466 */ 1464 */
1467 void fuse_flush_writepages(struct inode *inode) 1465 void fuse_flush_writepages(struct inode *inode)
1468 __releases(fc->lock) 1466 __releases(fc->lock)
1469 __acquires(fc->lock) 1467 __acquires(fc->lock)
1470 { 1468 {
1471 struct fuse_conn *fc = get_fuse_conn(inode); 1469 struct fuse_conn *fc = get_fuse_conn(inode);
1472 struct fuse_inode *fi = get_fuse_inode(inode); 1470 struct fuse_inode *fi = get_fuse_inode(inode);
1473 struct fuse_req *req; 1471 struct fuse_req *req;
1474 1472
1475 while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) { 1473 while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) {
1476 req = list_entry(fi->queued_writes.next, struct fuse_req, list); 1474 req = list_entry(fi->queued_writes.next, struct fuse_req, list);
1477 list_del_init(&req->list); 1475 list_del_init(&req->list);
1478 fuse_send_writepage(fc, req); 1476 fuse_send_writepage(fc, req);
1479 } 1477 }
1480 } 1478 }
1481 1479
1482 static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req) 1480 static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
1483 { 1481 {
1484 struct inode *inode = req->inode; 1482 struct inode *inode = req->inode;
1485 struct fuse_inode *fi = get_fuse_inode(inode); 1483 struct fuse_inode *fi = get_fuse_inode(inode);
1486 1484
1487 mapping_set_error(inode->i_mapping, req->out.h.error); 1485 mapping_set_error(inode->i_mapping, req->out.h.error);
1488 spin_lock(&fc->lock); 1486 spin_lock(&fc->lock);
1489 fi->writectr--; 1487 fi->writectr--;
1490 fuse_writepage_finish(fc, req); 1488 fuse_writepage_finish(fc, req);
1491 spin_unlock(&fc->lock); 1489 spin_unlock(&fc->lock);
1492 fuse_writepage_free(fc, req); 1490 fuse_writepage_free(fc, req);
1493 } 1491 }
1494 1492
1495 static int fuse_writepage_locked(struct page *page) 1493 static int fuse_writepage_locked(struct page *page)
1496 { 1494 {
1497 struct address_space *mapping = page->mapping; 1495 struct address_space *mapping = page->mapping;
1498 struct inode *inode = mapping->host; 1496 struct inode *inode = mapping->host;
1499 struct fuse_conn *fc = get_fuse_conn(inode); 1497 struct fuse_conn *fc = get_fuse_conn(inode);
1500 struct fuse_inode *fi = get_fuse_inode(inode); 1498 struct fuse_inode *fi = get_fuse_inode(inode);
1501 struct fuse_req *req; 1499 struct fuse_req *req;
1502 struct fuse_file *ff; 1500 struct fuse_file *ff;
1503 struct page *tmp_page; 1501 struct page *tmp_page;
1504 1502
1505 set_page_writeback(page); 1503 set_page_writeback(page);
1506 1504
1507 req = fuse_request_alloc_nofs(1); 1505 req = fuse_request_alloc_nofs(1);
1508 if (!req) 1506 if (!req)
1509 goto err; 1507 goto err;
1510 1508
1511 req->background = 1; /* writeback always goes to bg_queue */ 1509 req->background = 1; /* writeback always goes to bg_queue */
1512 tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); 1510 tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
1513 if (!tmp_page) 1511 if (!tmp_page)
1514 goto err_free; 1512 goto err_free;
1515 1513
1516 spin_lock(&fc->lock); 1514 spin_lock(&fc->lock);
1517 BUG_ON(list_empty(&fi->write_files)); 1515 BUG_ON(list_empty(&fi->write_files));
1518 ff = list_entry(fi->write_files.next, struct fuse_file, write_entry); 1516 ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
1519 req->ff = fuse_file_get(ff); 1517 req->ff = fuse_file_get(ff);
1520 spin_unlock(&fc->lock); 1518 spin_unlock(&fc->lock);
1521 1519
1522 fuse_write_fill(req, ff, page_offset(page), 0); 1520 fuse_write_fill(req, ff, page_offset(page), 0);
1523 1521
1524 copy_highpage(tmp_page, page); 1522 copy_highpage(tmp_page, page);
1525 req->misc.write.in.write_flags |= FUSE_WRITE_CACHE; 1523 req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
1526 req->in.argpages = 1; 1524 req->in.argpages = 1;
1527 req->num_pages = 1; 1525 req->num_pages = 1;
1528 req->pages[0] = tmp_page; 1526 req->pages[0] = tmp_page;
1529 req->page_descs[0].offset = 0; 1527 req->page_descs[0].offset = 0;
1530 req->page_descs[0].length = PAGE_SIZE; 1528 req->page_descs[0].length = PAGE_SIZE;
1531 req->end = fuse_writepage_end; 1529 req->end = fuse_writepage_end;
1532 req->inode = inode; 1530 req->inode = inode;
1533 1531
1534 inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK); 1532 inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
1535 inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); 1533 inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
1536 1534
1537 spin_lock(&fc->lock); 1535 spin_lock(&fc->lock);
1538 list_add(&req->writepages_entry, &fi->writepages); 1536 list_add(&req->writepages_entry, &fi->writepages);
1539 list_add_tail(&req->list, &fi->queued_writes); 1537 list_add_tail(&req->list, &fi->queued_writes);
1540 fuse_flush_writepages(inode); 1538 fuse_flush_writepages(inode);
1541 spin_unlock(&fc->lock); 1539 spin_unlock(&fc->lock);
1542 1540
1543 end_page_writeback(page); 1541 end_page_writeback(page);
1544 1542
1545 return 0; 1543 return 0;
1546 1544
1547 err_free: 1545 err_free:
1548 fuse_request_free(req); 1546 fuse_request_free(req);
1549 err: 1547 err:
1550 end_page_writeback(page); 1548 end_page_writeback(page);
1551 return -ENOMEM; 1549 return -ENOMEM;
1552 } 1550 }
1553 1551
1554 static int fuse_writepage(struct page *page, struct writeback_control *wbc) 1552 static int fuse_writepage(struct page *page, struct writeback_control *wbc)
1555 { 1553 {
1556 int err; 1554 int err;
1557 1555
1558 err = fuse_writepage_locked(page); 1556 err = fuse_writepage_locked(page);
1559 unlock_page(page); 1557 unlock_page(page);
1560 1558
1561 return err; 1559 return err;
1562 } 1560 }
1563 1561
1564 static int fuse_launder_page(struct page *page) 1562 static int fuse_launder_page(struct page *page)
1565 { 1563 {
1566 int err = 0; 1564 int err = 0;
1567 if (clear_page_dirty_for_io(page)) { 1565 if (clear_page_dirty_for_io(page)) {
1568 struct inode *inode = page->mapping->host; 1566 struct inode *inode = page->mapping->host;
1569 err = fuse_writepage_locked(page); 1567 err = fuse_writepage_locked(page);
1570 if (!err) 1568 if (!err)
1571 fuse_wait_on_page_writeback(inode, page->index); 1569 fuse_wait_on_page_writeback(inode, page->index);
1572 } 1570 }
1573 return err; 1571 return err;
1574 } 1572 }
1575 1573
1576 /* 1574 /*
1577 * Write back dirty pages now, because there may not be any suitable 1575 * Write back dirty pages now, because there may not be any suitable
1578 * open files later 1576 * open files later
1579 */ 1577 */
1580 static void fuse_vma_close(struct vm_area_struct *vma) 1578 static void fuse_vma_close(struct vm_area_struct *vma)
1581 { 1579 {
1582 filemap_write_and_wait(vma->vm_file->f_mapping); 1580 filemap_write_and_wait(vma->vm_file->f_mapping);
1583 } 1581 }
1584 1582
1585 /* 1583 /*
1586 * Wait for writeback against this page to complete before allowing it 1584 * Wait for writeback against this page to complete before allowing it
1587 * to be marked dirty again, and hence written back again, possibly 1585 * to be marked dirty again, and hence written back again, possibly
1588 * before the previous writepage completed. 1586 * before the previous writepage completed.
1589 * 1587 *
1590 * Block here, instead of in ->writepage(), so that the userspace fs 1588 * Block here, instead of in ->writepage(), so that the userspace fs
1591 * can only block processes actually operating on the filesystem. 1589 * can only block processes actually operating on the filesystem.
1592 * 1590 *
1593 * Otherwise unprivileged userspace fs would be able to block 1591 * Otherwise unprivileged userspace fs would be able to block
1594 * unrelated: 1592 * unrelated:
1595 * 1593 *
1596 * - page migration 1594 * - page migration
1597 * - sync(2) 1595 * - sync(2)
1598 * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER 1596 * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER
1599 */ 1597 */
1600 static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) 1598 static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
1601 { 1599 {
1602 struct page *page = vmf->page; 1600 struct page *page = vmf->page;
1603 /* 1601 /*
1604 * Don't use page->mapping as it may become NULL from a 1602 * Don't use page->mapping as it may become NULL from a
1605 * concurrent truncate. 1603 * concurrent truncate.
1606 */ 1604 */
1607 struct inode *inode = vma->vm_file->f_mapping->host; 1605 struct inode *inode = vma->vm_file->f_mapping->host;
1608 1606
1609 fuse_wait_on_page_writeback(inode, page->index); 1607 fuse_wait_on_page_writeback(inode, page->index);
1610 return 0; 1608 return 0;
1611 } 1609 }
1612 1610
1613 static const struct vm_operations_struct fuse_file_vm_ops = { 1611 static const struct vm_operations_struct fuse_file_vm_ops = {
1614 .close = fuse_vma_close, 1612 .close = fuse_vma_close,
1615 .fault = filemap_fault, 1613 .fault = filemap_fault,
1616 .page_mkwrite = fuse_page_mkwrite, 1614 .page_mkwrite = fuse_page_mkwrite,
1617 .remap_pages = generic_file_remap_pages, 1615 .remap_pages = generic_file_remap_pages,
1618 }; 1616 };
1619 1617
1620 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) 1618 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
1621 { 1619 {
1622 if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) { 1620 if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
1623 struct inode *inode = file_inode(file); 1621 struct inode *inode = file_inode(file);
1624 struct fuse_conn *fc = get_fuse_conn(inode); 1622 struct fuse_conn *fc = get_fuse_conn(inode);
1625 struct fuse_inode *fi = get_fuse_inode(inode); 1623 struct fuse_inode *fi = get_fuse_inode(inode);
1626 struct fuse_file *ff = file->private_data; 1624 struct fuse_file *ff = file->private_data;
1627 /* 1625 /*
1628 * file may be written through mmap, so chain it onto the 1626 * file may be written through mmap, so chain it onto the
1629 * inodes's write_file list 1627 * inodes's write_file list
1630 */ 1628 */
1631 spin_lock(&fc->lock); 1629 spin_lock(&fc->lock);
1632 if (list_empty(&ff->write_entry)) 1630 if (list_empty(&ff->write_entry))
1633 list_add(&ff->write_entry, &fi->write_files); 1631 list_add(&ff->write_entry, &fi->write_files);
1634 spin_unlock(&fc->lock); 1632 spin_unlock(&fc->lock);
1635 } 1633 }
1636 file_accessed(file); 1634 file_accessed(file);
1637 vma->vm_ops = &fuse_file_vm_ops; 1635 vma->vm_ops = &fuse_file_vm_ops;
1638 return 0; 1636 return 0;
1639 } 1637 }
1640 1638
1641 static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma) 1639 static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
1642 { 1640 {
1643 /* Can't provide the coherency needed for MAP_SHARED */ 1641 /* Can't provide the coherency needed for MAP_SHARED */
1644 if (vma->vm_flags & VM_MAYSHARE) 1642 if (vma->vm_flags & VM_MAYSHARE)
1645 return -ENODEV; 1643 return -ENODEV;
1646 1644
1647 invalidate_inode_pages2(file->f_mapping); 1645 invalidate_inode_pages2(file->f_mapping);
1648 1646
1649 return generic_file_mmap(file, vma); 1647 return generic_file_mmap(file, vma);
1650 } 1648 }
1651 1649
1652 static int convert_fuse_file_lock(const struct fuse_file_lock *ffl, 1650 static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
1653 struct file_lock *fl) 1651 struct file_lock *fl)
1654 { 1652 {
1655 switch (ffl->type) { 1653 switch (ffl->type) {
1656 case F_UNLCK: 1654 case F_UNLCK:
1657 break; 1655 break;
1658 1656
1659 case F_RDLCK: 1657 case F_RDLCK:
1660 case F_WRLCK: 1658 case F_WRLCK:
1661 if (ffl->start > OFFSET_MAX || ffl->end > OFFSET_MAX || 1659 if (ffl->start > OFFSET_MAX || ffl->end > OFFSET_MAX ||
1662 ffl->end < ffl->start) 1660 ffl->end < ffl->start)
1663 return -EIO; 1661 return -EIO;
1664 1662
1665 fl->fl_start = ffl->start; 1663 fl->fl_start = ffl->start;
1666 fl->fl_end = ffl->end; 1664 fl->fl_end = ffl->end;
1667 fl->fl_pid = ffl->pid; 1665 fl->fl_pid = ffl->pid;
1668 break; 1666 break;
1669 1667
1670 default: 1668 default:
1671 return -EIO; 1669 return -EIO;
1672 } 1670 }
1673 fl->fl_type = ffl->type; 1671 fl->fl_type = ffl->type;
1674 return 0; 1672 return 0;
1675 } 1673 }
1676 1674
1677 static void fuse_lk_fill(struct fuse_req *req, struct file *file, 1675 static void fuse_lk_fill(struct fuse_req *req, struct file *file,
1678 const struct file_lock *fl, int opcode, pid_t pid, 1676 const struct file_lock *fl, int opcode, pid_t pid,
1679 int flock) 1677 int flock)
1680 { 1678 {
1681 struct inode *inode = file_inode(file); 1679 struct inode *inode = file_inode(file);
1682 struct fuse_conn *fc = get_fuse_conn(inode); 1680 struct fuse_conn *fc = get_fuse_conn(inode);
1683 struct fuse_file *ff = file->private_data; 1681 struct fuse_file *ff = file->private_data;
1684 struct fuse_lk_in *arg = &req->misc.lk_in; 1682 struct fuse_lk_in *arg = &req->misc.lk_in;
1685 1683
1686 arg->fh = ff->fh; 1684 arg->fh = ff->fh;
1687 arg->owner = fuse_lock_owner_id(fc, fl->fl_owner); 1685 arg->owner = fuse_lock_owner_id(fc, fl->fl_owner);
1688 arg->lk.start = fl->fl_start; 1686 arg->lk.start = fl->fl_start;
1689 arg->lk.end = fl->fl_end; 1687 arg->lk.end = fl->fl_end;
1690 arg->lk.type = fl->fl_type; 1688 arg->lk.type = fl->fl_type;
1691 arg->lk.pid = pid; 1689 arg->lk.pid = pid;
1692 if (flock) 1690 if (flock)
1693 arg->lk_flags |= FUSE_LK_FLOCK; 1691 arg->lk_flags |= FUSE_LK_FLOCK;
1694 req->in.h.opcode = opcode; 1692 req->in.h.opcode = opcode;
1695 req->in.h.nodeid = get_node_id(inode); 1693 req->in.h.nodeid = get_node_id(inode);
1696 req->in.numargs = 1; 1694 req->in.numargs = 1;
1697 req->in.args[0].size = sizeof(*arg); 1695 req->in.args[0].size = sizeof(*arg);
1698 req->in.args[0].value = arg; 1696 req->in.args[0].value = arg;
1699 } 1697 }
1700 1698
1701 static int fuse_getlk(struct file *file, struct file_lock *fl) 1699 static int fuse_getlk(struct file *file, struct file_lock *fl)
1702 { 1700 {
1703 struct inode *inode = file_inode(file); 1701 struct inode *inode = file_inode(file);
1704 struct fuse_conn *fc = get_fuse_conn(inode); 1702 struct fuse_conn *fc = get_fuse_conn(inode);
1705 struct fuse_req *req; 1703 struct fuse_req *req;
1706 struct fuse_lk_out outarg; 1704 struct fuse_lk_out outarg;
1707 int err; 1705 int err;
1708 1706
1709 req = fuse_get_req_nopages(fc); 1707 req = fuse_get_req_nopages(fc);
1710 if (IS_ERR(req)) 1708 if (IS_ERR(req))
1711 return PTR_ERR(req); 1709 return PTR_ERR(req);
1712 1710
1713 fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0); 1711 fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
1714 req->out.numargs = 1; 1712 req->out.numargs = 1;
1715 req->out.args[0].size = sizeof(outarg); 1713 req->out.args[0].size = sizeof(outarg);
1716 req->out.args[0].value = &outarg; 1714 req->out.args[0].value = &outarg;
1717 fuse_request_send(fc, req); 1715 fuse_request_send(fc, req);
1718 err = req->out.h.error; 1716 err = req->out.h.error;
1719 fuse_put_request(fc, req); 1717 fuse_put_request(fc, req);
1720 if (!err) 1718 if (!err)
1721 err = convert_fuse_file_lock(&outarg.lk, fl); 1719 err = convert_fuse_file_lock(&outarg.lk, fl);
1722 1720
1723 return err; 1721 return err;
1724 } 1722 }
1725 1723
1726 static int fuse_setlk(struct file *file, struct file_lock *fl, int flock) 1724 static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
1727 { 1725 {
1728 struct inode *inode = file_inode(file); 1726 struct inode *inode = file_inode(file);
1729 struct fuse_conn *fc = get_fuse_conn(inode); 1727 struct fuse_conn *fc = get_fuse_conn(inode);
1730 struct fuse_req *req; 1728 struct fuse_req *req;
1731 int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK; 1729 int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
1732 pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0; 1730 pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
1733 int err; 1731 int err;
1734 1732
1735 if (fl->fl_lmops && fl->fl_lmops->lm_grant) { 1733 if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
1736 /* NLM needs asynchronous locks, which we don't support yet */ 1734 /* NLM needs asynchronous locks, which we don't support yet */
1737 return -ENOLCK; 1735 return -ENOLCK;
1738 } 1736 }
1739 1737
1740 /* Unlock on close is handled by the flush method */ 1738 /* Unlock on close is handled by the flush method */
1741 if (fl->fl_flags & FL_CLOSE) 1739 if (fl->fl_flags & FL_CLOSE)
1742 return 0; 1740 return 0;
1743 1741
1744 req = fuse_get_req_nopages(fc); 1742 req = fuse_get_req_nopages(fc);
1745 if (IS_ERR(req)) 1743 if (IS_ERR(req))
1746 return PTR_ERR(req); 1744 return PTR_ERR(req);
1747 1745
1748 fuse_lk_fill(req, file, fl, opcode, pid, flock); 1746 fuse_lk_fill(req, file, fl, opcode, pid, flock);
1749 fuse_request_send(fc, req); 1747 fuse_request_send(fc, req);
1750 err = req->out.h.error; 1748 err = req->out.h.error;
1751 /* locking is restartable */ 1749 /* locking is restartable */
1752 if (err == -EINTR) 1750 if (err == -EINTR)
1753 err = -ERESTARTSYS; 1751 err = -ERESTARTSYS;
1754 fuse_put_request(fc, req); 1752 fuse_put_request(fc, req);
1755 return err; 1753 return err;
1756 } 1754 }
1757 1755
1758 static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl) 1756 static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl)
1759 { 1757 {
1760 struct inode *inode = file_inode(file); 1758 struct inode *inode = file_inode(file);
1761 struct fuse_conn *fc = get_fuse_conn(inode); 1759 struct fuse_conn *fc = get_fuse_conn(inode);
1762 int err; 1760 int err;
1763 1761
1764 if (cmd == F_CANCELLK) { 1762 if (cmd == F_CANCELLK) {
1765 err = 0; 1763 err = 0;
1766 } else if (cmd == F_GETLK) { 1764 } else if (cmd == F_GETLK) {
1767 if (fc->no_lock) { 1765 if (fc->no_lock) {
1768 posix_test_lock(file, fl); 1766 posix_test_lock(file, fl);
1769 err = 0; 1767 err = 0;
1770 } else 1768 } else
1771 err = fuse_getlk(file, fl); 1769 err = fuse_getlk(file, fl);
1772 } else { 1770 } else {
1773 if (fc->no_lock) 1771 if (fc->no_lock)
1774 err = posix_lock_file(file, fl, NULL); 1772 err = posix_lock_file(file, fl, NULL);
1775 else 1773 else
1776 err = fuse_setlk(file, fl, 0); 1774 err = fuse_setlk(file, fl, 0);
1777 } 1775 }
1778 return err; 1776 return err;
1779 } 1777 }
1780 1778
1781 static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl) 1779 static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
1782 { 1780 {
1783 struct inode *inode = file_inode(file); 1781 struct inode *inode = file_inode(file);
1784 struct fuse_conn *fc = get_fuse_conn(inode); 1782 struct fuse_conn *fc = get_fuse_conn(inode);
1785 int err; 1783 int err;
1786 1784
1787 if (fc->no_flock) { 1785 if (fc->no_flock) {
1788 err = flock_lock_file_wait(file, fl); 1786 err = flock_lock_file_wait(file, fl);
1789 } else { 1787 } else {
1790 struct fuse_file *ff = file->private_data; 1788 struct fuse_file *ff = file->private_data;
1791 1789
1792 /* emulate flock with POSIX locks */ 1790 /* emulate flock with POSIX locks */
1793 fl->fl_owner = (fl_owner_t) file; 1791 fl->fl_owner = (fl_owner_t) file;
1794 ff->flock = true; 1792 ff->flock = true;
1795 err = fuse_setlk(file, fl, 1); 1793 err = fuse_setlk(file, fl, 1);
1796 } 1794 }
1797 1795
1798 return err; 1796 return err;
1799 } 1797 }
1800 1798
1801 static sector_t fuse_bmap(struct address_space *mapping, sector_t block) 1799 static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
1802 { 1800 {
1803 struct inode *inode = mapping->host; 1801 struct inode *inode = mapping->host;
1804 struct fuse_conn *fc = get_fuse_conn(inode); 1802 struct fuse_conn *fc = get_fuse_conn(inode);
1805 struct fuse_req *req; 1803 struct fuse_req *req;
1806 struct fuse_bmap_in inarg; 1804 struct fuse_bmap_in inarg;
1807 struct fuse_bmap_out outarg; 1805 struct fuse_bmap_out outarg;
1808 int err; 1806 int err;
1809 1807
1810 if (!inode->i_sb->s_bdev || fc->no_bmap) 1808 if (!inode->i_sb->s_bdev || fc->no_bmap)
1811 return 0; 1809 return 0;
1812 1810
1813 req = fuse_get_req_nopages(fc); 1811 req = fuse_get_req_nopages(fc);
1814 if (IS_ERR(req)) 1812 if (IS_ERR(req))
1815 return 0; 1813 return 0;
1816 1814
1817 memset(&inarg, 0, sizeof(inarg)); 1815 memset(&inarg, 0, sizeof(inarg));
1818 inarg.block = block; 1816 inarg.block = block;
1819 inarg.blocksize = inode->i_sb->s_blocksize; 1817 inarg.blocksize = inode->i_sb->s_blocksize;
1820 req->in.h.opcode = FUSE_BMAP; 1818 req->in.h.opcode = FUSE_BMAP;
1821 req->in.h.nodeid = get_node_id(inode); 1819 req->in.h.nodeid = get_node_id(inode);
1822 req->in.numargs = 1; 1820 req->in.numargs = 1;
1823 req->in.args[0].size = sizeof(inarg); 1821 req->in.args[0].size = sizeof(inarg);
1824 req->in.args[0].value = &inarg; 1822 req->in.args[0].value = &inarg;
1825 req->out.numargs = 1; 1823 req->out.numargs = 1;
1826 req->out.args[0].size = sizeof(outarg); 1824 req->out.args[0].size = sizeof(outarg);
1827 req->out.args[0].value = &outarg; 1825 req->out.args[0].value = &outarg;
1828 fuse_request_send(fc, req); 1826 fuse_request_send(fc, req);
1829 err = req->out.h.error; 1827 err = req->out.h.error;
1830 fuse_put_request(fc, req); 1828 fuse_put_request(fc, req);
1831 if (err == -ENOSYS) 1829 if (err == -ENOSYS)
1832 fc->no_bmap = 1; 1830 fc->no_bmap = 1;
1833 1831
1834 return err ? 0 : outarg.block; 1832 return err ? 0 : outarg.block;
1835 } 1833 }
1836 1834
1837 static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence) 1835 static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence)
1838 { 1836 {
1839 loff_t retval; 1837 loff_t retval;
1840 struct inode *inode = file_inode(file); 1838 struct inode *inode = file_inode(file);
1841 1839
1842 /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */ 1840 /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */
1843 if (whence == SEEK_CUR || whence == SEEK_SET) 1841 if (whence == SEEK_CUR || whence == SEEK_SET)
1844 return generic_file_llseek(file, offset, whence); 1842 return generic_file_llseek(file, offset, whence);
1845 1843
1846 mutex_lock(&inode->i_mutex); 1844 mutex_lock(&inode->i_mutex);
1847 retval = fuse_update_attributes(inode, NULL, file, NULL); 1845 retval = fuse_update_attributes(inode, NULL, file, NULL);
1848 if (!retval) 1846 if (!retval)
1849 retval = generic_file_llseek(file, offset, whence); 1847 retval = generic_file_llseek(file, offset, whence);
1850 mutex_unlock(&inode->i_mutex); 1848 mutex_unlock(&inode->i_mutex);
1851 1849
1852 return retval; 1850 return retval;
1853 } 1851 }
1854 1852
1855 static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov, 1853 static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
1856 unsigned int nr_segs, size_t bytes, bool to_user) 1854 unsigned int nr_segs, size_t bytes, bool to_user)
1857 { 1855 {
1858 struct iov_iter ii; 1856 struct iov_iter ii;
1859 int page_idx = 0; 1857 int page_idx = 0;
1860 1858
1861 if (!bytes) 1859 if (!bytes)
1862 return 0; 1860 return 0;
1863 1861
1864 iov_iter_init(&ii, iov, nr_segs, bytes, 0); 1862 iov_iter_init(&ii, iov, nr_segs, bytes, 0);
1865 1863
1866 while (iov_iter_count(&ii)) { 1864 while (iov_iter_count(&ii)) {
1867 struct page *page = pages[page_idx++]; 1865 struct page *page = pages[page_idx++];
1868 size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii)); 1866 size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
1869 void *kaddr; 1867 void *kaddr;
1870 1868
1871 kaddr = kmap(page); 1869 kaddr = kmap(page);
1872 1870
1873 while (todo) { 1871 while (todo) {
1874 char __user *uaddr = ii.iov->iov_base + ii.iov_offset; 1872 char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
1875 size_t iov_len = ii.iov->iov_len - ii.iov_offset; 1873 size_t iov_len = ii.iov->iov_len - ii.iov_offset;
1876 size_t copy = min(todo, iov_len); 1874 size_t copy = min(todo, iov_len);
1877 size_t left; 1875 size_t left;
1878 1876
1879 if (!to_user) 1877 if (!to_user)
1880 left = copy_from_user(kaddr, uaddr, copy); 1878 left = copy_from_user(kaddr, uaddr, copy);
1881 else 1879 else
1882 left = copy_to_user(uaddr, kaddr, copy); 1880 left = copy_to_user(uaddr, kaddr, copy);
1883 1881
1884 if (unlikely(left)) 1882 if (unlikely(left))
1885 return -EFAULT; 1883 return -EFAULT;
1886 1884
1887 iov_iter_advance(&ii, copy); 1885 iov_iter_advance(&ii, copy);
1888 todo -= copy; 1886 todo -= copy;
1889 kaddr += copy; 1887 kaddr += copy;
1890 } 1888 }
1891 1889
1892 kunmap(page); 1890 kunmap(page);
1893 } 1891 }
1894 1892
1895 return 0; 1893 return 0;
1896 } 1894 }
1897 1895
1898 /* 1896 /*
1899 * CUSE servers compiled on 32bit broke on 64bit kernels because the 1897 * CUSE servers compiled on 32bit broke on 64bit kernels because the
1900 * ABI was defined to be 'struct iovec' which is different on 32bit 1898 * ABI was defined to be 'struct iovec' which is different on 32bit
1901 * and 64bit. Fortunately we can determine which structure the server 1899 * and 64bit. Fortunately we can determine which structure the server
1902 * used from the size of the reply. 1900 * used from the size of the reply.
1903 */ 1901 */
1904 static int fuse_copy_ioctl_iovec_old(struct iovec *dst, void *src, 1902 static int fuse_copy_ioctl_iovec_old(struct iovec *dst, void *src,
1905 size_t transferred, unsigned count, 1903 size_t transferred, unsigned count,
1906 bool is_compat) 1904 bool is_compat)
1907 { 1905 {
1908 #ifdef CONFIG_COMPAT 1906 #ifdef CONFIG_COMPAT
1909 if (count * sizeof(struct compat_iovec) == transferred) { 1907 if (count * sizeof(struct compat_iovec) == transferred) {
1910 struct compat_iovec *ciov = src; 1908 struct compat_iovec *ciov = src;
1911 unsigned i; 1909 unsigned i;
1912 1910
1913 /* 1911 /*
1914 * With this interface a 32bit server cannot support 1912 * With this interface a 32bit server cannot support
1915 * non-compat (i.e. ones coming from 64bit apps) ioctl 1913 * non-compat (i.e. ones coming from 64bit apps) ioctl
1916 * requests 1914 * requests
1917 */ 1915 */
1918 if (!is_compat) 1916 if (!is_compat)
1919 return -EINVAL; 1917 return -EINVAL;
1920 1918
1921 for (i = 0; i < count; i++) { 1919 for (i = 0; i < count; i++) {
1922 dst[i].iov_base = compat_ptr(ciov[i].iov_base); 1920 dst[i].iov_base = compat_ptr(ciov[i].iov_base);
1923 dst[i].iov_len = ciov[i].iov_len; 1921 dst[i].iov_len = ciov[i].iov_len;
1924 } 1922 }
1925 return 0; 1923 return 0;
1926 } 1924 }
1927 #endif 1925 #endif
1928 1926
1929 if (count * sizeof(struct iovec) != transferred) 1927 if (count * sizeof(struct iovec) != transferred)
1930 return -EIO; 1928 return -EIO;
1931 1929
1932 memcpy(dst, src, transferred); 1930 memcpy(dst, src, transferred);
1933 return 0; 1931 return 0;
1934 } 1932 }
1935 1933
1936 /* Make sure iov_length() won't overflow */ 1934 /* Make sure iov_length() won't overflow */
1937 static int fuse_verify_ioctl_iov(struct iovec *iov, size_t count) 1935 static int fuse_verify_ioctl_iov(struct iovec *iov, size_t count)
1938 { 1936 {
1939 size_t n; 1937 size_t n;
1940 u32 max = FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT; 1938 u32 max = FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT;
1941 1939
1942 for (n = 0; n < count; n++, iov++) { 1940 for (n = 0; n < count; n++, iov++) {
1943 if (iov->iov_len > (size_t) max) 1941 if (iov->iov_len > (size_t) max)
1944 return -ENOMEM; 1942 return -ENOMEM;
1945 max -= iov->iov_len; 1943 max -= iov->iov_len;
1946 } 1944 }
1947 return 0; 1945 return 0;
1948 } 1946 }
1949 1947
1950 static int fuse_copy_ioctl_iovec(struct fuse_conn *fc, struct iovec *dst, 1948 static int fuse_copy_ioctl_iovec(struct fuse_conn *fc, struct iovec *dst,
1951 void *src, size_t transferred, unsigned count, 1949 void *src, size_t transferred, unsigned count,
1952 bool is_compat) 1950 bool is_compat)
1953 { 1951 {
1954 unsigned i; 1952 unsigned i;
1955 struct fuse_ioctl_iovec *fiov = src; 1953 struct fuse_ioctl_iovec *fiov = src;
1956 1954
1957 if (fc->minor < 16) { 1955 if (fc->minor < 16) {
1958 return fuse_copy_ioctl_iovec_old(dst, src, transferred, 1956 return fuse_copy_ioctl_iovec_old(dst, src, transferred,
1959 count, is_compat); 1957 count, is_compat);
1960 } 1958 }
1961 1959
1962 if (count * sizeof(struct fuse_ioctl_iovec) != transferred) 1960 if (count * sizeof(struct fuse_ioctl_iovec) != transferred)
1963 return -EIO; 1961 return -EIO;
1964 1962
1965 for (i = 0; i < count; i++) { 1963 for (i = 0; i < count; i++) {
1966 /* Did the server supply an inappropriate value? */ 1964 /* Did the server supply an inappropriate value? */
1967 if (fiov[i].base != (unsigned long) fiov[i].base || 1965 if (fiov[i].base != (unsigned long) fiov[i].base ||
1968 fiov[i].len != (unsigned long) fiov[i].len) 1966 fiov[i].len != (unsigned long) fiov[i].len)
1969 return -EIO; 1967 return -EIO;
1970 1968
1971 dst[i].iov_base = (void __user *) (unsigned long) fiov[i].base; 1969 dst[i].iov_base = (void __user *) (unsigned long) fiov[i].base;
1972 dst[i].iov_len = (size_t) fiov[i].len; 1970 dst[i].iov_len = (size_t) fiov[i].len;
1973 1971
1974 #ifdef CONFIG_COMPAT 1972 #ifdef CONFIG_COMPAT
1975 if (is_compat && 1973 if (is_compat &&
1976 (ptr_to_compat(dst[i].iov_base) != fiov[i].base || 1974 (ptr_to_compat(dst[i].iov_base) != fiov[i].base ||
1977 (compat_size_t) dst[i].iov_len != fiov[i].len)) 1975 (compat_size_t) dst[i].iov_len != fiov[i].len))
1978 return -EIO; 1976 return -EIO;
1979 #endif 1977 #endif
1980 } 1978 }
1981 1979
1982 return 0; 1980 return 0;
1983 } 1981 }
1984 1982
1985 1983
1986 /* 1984 /*
1987 * For ioctls, there is no generic way to determine how much memory 1985 * For ioctls, there is no generic way to determine how much memory
1988 * needs to be read and/or written. Furthermore, ioctls are allowed 1986 * needs to be read and/or written. Furthermore, ioctls are allowed
1989 * to dereference the passed pointer, so the parameter requires deep 1987 * to dereference the passed pointer, so the parameter requires deep
1990 * copying but FUSE has no idea whatsoever about what to copy in or 1988 * copying but FUSE has no idea whatsoever about what to copy in or
1991 * out. 1989 * out.
1992 * 1990 *
1993 * This is solved by allowing FUSE server to retry ioctl with 1991 * This is solved by allowing FUSE server to retry ioctl with
1994 * necessary in/out iovecs. Let's assume the ioctl implementation 1992 * necessary in/out iovecs. Let's assume the ioctl implementation
1995 * needs to read in the following structure. 1993 * needs to read in the following structure.
1996 * 1994 *
1997 * struct a { 1995 * struct a {
1998 * char *buf; 1996 * char *buf;
1999 * size_t buflen; 1997 * size_t buflen;
2000 * } 1998 * }
2001 * 1999 *
2002 * On the first callout to FUSE server, inarg->in_size and 2000 * On the first callout to FUSE server, inarg->in_size and
2003 * inarg->out_size will be NULL; then, the server completes the ioctl 2001 * inarg->out_size will be NULL; then, the server completes the ioctl
2004 * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and 2002 * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and
2005 * the actual iov array to 2003 * the actual iov array to
2006 * 2004 *
2007 * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) } } 2005 * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) } }
2008 * 2006 *
2009 * which tells FUSE to copy in the requested area and retry the ioctl. 2007 * which tells FUSE to copy in the requested area and retry the ioctl.
2010 * On the second round, the server has access to the structure and 2008 * On the second round, the server has access to the structure and
2011 * from that it can tell what to look for next, so on the invocation, 2009 * from that it can tell what to look for next, so on the invocation,
2012 * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to 2010 * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to
2013 * 2011 *
2014 * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) }, 2012 * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) },
2015 * { .iov_base = a.buf, .iov_len = a.buflen } } 2013 * { .iov_base = a.buf, .iov_len = a.buflen } }
2016 * 2014 *
2017 * FUSE will copy both struct a and the pointed buffer from the 2015 * FUSE will copy both struct a and the pointed buffer from the
2018 * process doing the ioctl and retry ioctl with both struct a and the 2016 * process doing the ioctl and retry ioctl with both struct a and the
2019 * buffer. 2017 * buffer.
2020 * 2018 *
2021 * This time, FUSE server has everything it needs and completes ioctl 2019 * This time, FUSE server has everything it needs and completes ioctl
2022 * without FUSE_IOCTL_RETRY which finishes the ioctl call. 2020 * without FUSE_IOCTL_RETRY which finishes the ioctl call.
2023 * 2021 *
2024 * Copying data out works the same way. 2022 * Copying data out works the same way.
2025 * 2023 *
2026 * Note that if FUSE_IOCTL_UNRESTRICTED is clear, the kernel 2024 * Note that if FUSE_IOCTL_UNRESTRICTED is clear, the kernel
2027 * automatically initializes in and out iovs by decoding @cmd with 2025 * automatically initializes in and out iovs by decoding @cmd with
2028 * _IOC_* macros and the server is not allowed to request RETRY. This 2026 * _IOC_* macros and the server is not allowed to request RETRY. This
2029 * limits ioctl data transfers to well-formed ioctls and is the forced 2027 * limits ioctl data transfers to well-formed ioctls and is the forced
2030 * behavior for all FUSE servers. 2028 * behavior for all FUSE servers.
2031 */ 2029 */
2032 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg, 2030 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
2033 unsigned int flags) 2031 unsigned int flags)
2034 { 2032 {
2035 struct fuse_file *ff = file->private_data; 2033 struct fuse_file *ff = file->private_data;
2036 struct fuse_conn *fc = ff->fc; 2034 struct fuse_conn *fc = ff->fc;
2037 struct fuse_ioctl_in inarg = { 2035 struct fuse_ioctl_in inarg = {
2038 .fh = ff->fh, 2036 .fh = ff->fh,
2039 .cmd = cmd, 2037 .cmd = cmd,
2040 .arg = arg, 2038 .arg = arg,
2041 .flags = flags 2039 .flags = flags
2042 }; 2040 };
2043 struct fuse_ioctl_out outarg; 2041 struct fuse_ioctl_out outarg;
2044 struct fuse_req *req = NULL; 2042 struct fuse_req *req = NULL;
2045 struct page **pages = NULL; 2043 struct page **pages = NULL;
2046 struct iovec *iov_page = NULL; 2044 struct iovec *iov_page = NULL;
2047 struct iovec *in_iov = NULL, *out_iov = NULL; 2045 struct iovec *in_iov = NULL, *out_iov = NULL;
2048 unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages; 2046 unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages;
2049 size_t in_size, out_size, transferred; 2047 size_t in_size, out_size, transferred;
2050 int err; 2048 int err;
2051 2049
2052 #if BITS_PER_LONG == 32 2050 #if BITS_PER_LONG == 32
2053 inarg.flags |= FUSE_IOCTL_32BIT; 2051 inarg.flags |= FUSE_IOCTL_32BIT;
2054 #else 2052 #else
2055 if (flags & FUSE_IOCTL_COMPAT) 2053 if (flags & FUSE_IOCTL_COMPAT)
2056 inarg.flags |= FUSE_IOCTL_32BIT; 2054 inarg.flags |= FUSE_IOCTL_32BIT;
2057 #endif 2055 #endif
2058 2056
2059 /* assume all the iovs returned by client always fits in a page */ 2057 /* assume all the iovs returned by client always fits in a page */
2060 BUILD_BUG_ON(sizeof(struct fuse_ioctl_iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE); 2058 BUILD_BUG_ON(sizeof(struct fuse_ioctl_iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE);
2061 2059
2062 err = -ENOMEM; 2060 err = -ENOMEM;
2063 pages = kcalloc(FUSE_MAX_PAGES_PER_REQ, sizeof(pages[0]), GFP_KERNEL); 2061 pages = kcalloc(FUSE_MAX_PAGES_PER_REQ, sizeof(pages[0]), GFP_KERNEL);
2064 iov_page = (struct iovec *) __get_free_page(GFP_KERNEL); 2062 iov_page = (struct iovec *) __get_free_page(GFP_KERNEL);
2065 if (!pages || !iov_page) 2063 if (!pages || !iov_page)
2066 goto out; 2064 goto out;
2067 2065
2068 /* 2066 /*
2069 * If restricted, initialize IO parameters as encoded in @cmd. 2067 * If restricted, initialize IO parameters as encoded in @cmd.
2070 * RETRY from server is not allowed. 2068 * RETRY from server is not allowed.
2071 */ 2069 */
2072 if (!(flags & FUSE_IOCTL_UNRESTRICTED)) { 2070 if (!(flags & FUSE_IOCTL_UNRESTRICTED)) {
2073 struct iovec *iov = iov_page; 2071 struct iovec *iov = iov_page;
2074 2072
2075 iov->iov_base = (void __user *)arg; 2073 iov->iov_base = (void __user *)arg;
2076 iov->iov_len = _IOC_SIZE(cmd); 2074 iov->iov_len = _IOC_SIZE(cmd);
2077 2075
2078 if (_IOC_DIR(cmd) & _IOC_WRITE) { 2076 if (_IOC_DIR(cmd) & _IOC_WRITE) {
2079 in_iov = iov; 2077 in_iov = iov;
2080 in_iovs = 1; 2078 in_iovs = 1;
2081 } 2079 }
2082 2080
2083 if (_IOC_DIR(cmd) & _IOC_READ) { 2081 if (_IOC_DIR(cmd) & _IOC_READ) {
2084 out_iov = iov; 2082 out_iov = iov;
2085 out_iovs = 1; 2083 out_iovs = 1;
2086 } 2084 }
2087 } 2085 }
2088 2086
2089 retry: 2087 retry:
2090 inarg.in_size = in_size = iov_length(in_iov, in_iovs); 2088 inarg.in_size = in_size = iov_length(in_iov, in_iovs);
2091 inarg.out_size = out_size = iov_length(out_iov, out_iovs); 2089 inarg.out_size = out_size = iov_length(out_iov, out_iovs);
2092 2090
2093 /* 2091 /*
2094 * Out data can be used either for actual out data or iovs, 2092 * Out data can be used either for actual out data or iovs,
2095 * make sure there always is at least one page. 2093 * make sure there always is at least one page.
2096 */ 2094 */
2097 out_size = max_t(size_t, out_size, PAGE_SIZE); 2095 out_size = max_t(size_t, out_size, PAGE_SIZE);
2098 max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE); 2096 max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE);
2099 2097
2100 /* make sure there are enough buffer pages and init request with them */ 2098 /* make sure there are enough buffer pages and init request with them */
2101 err = -ENOMEM; 2099 err = -ENOMEM;
2102 if (max_pages > FUSE_MAX_PAGES_PER_REQ) 2100 if (max_pages > FUSE_MAX_PAGES_PER_REQ)
2103 goto out; 2101 goto out;
2104 while (num_pages < max_pages) { 2102 while (num_pages < max_pages) {
2105 pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM); 2103 pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
2106 if (!pages[num_pages]) 2104 if (!pages[num_pages])
2107 goto out; 2105 goto out;
2108 num_pages++; 2106 num_pages++;
2109 } 2107 }
2110 2108
2111 req = fuse_get_req(fc, num_pages); 2109 req = fuse_get_req(fc, num_pages);
2112 if (IS_ERR(req)) { 2110 if (IS_ERR(req)) {
2113 err = PTR_ERR(req); 2111 err = PTR_ERR(req);
2114 req = NULL; 2112 req = NULL;
2115 goto out; 2113 goto out;
2116 } 2114 }
2117 memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages); 2115 memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages);
2118 req->num_pages = num_pages; 2116 req->num_pages = num_pages;
2119 fuse_page_descs_length_init(req, 0, req->num_pages); 2117 fuse_page_descs_length_init(req, 0, req->num_pages);
2120 2118
2121 /* okay, let's send it to the client */ 2119 /* okay, let's send it to the client */
2122 req->in.h.opcode = FUSE_IOCTL; 2120 req->in.h.opcode = FUSE_IOCTL;
2123 req->in.h.nodeid = ff->nodeid; 2121 req->in.h.nodeid = ff->nodeid;
2124 req->in.numargs = 1; 2122 req->in.numargs = 1;
2125 req->in.args[0].size = sizeof(inarg); 2123 req->in.args[0].size = sizeof(inarg);
2126 req->in.args[0].value = &inarg; 2124 req->in.args[0].value = &inarg;
2127 if (in_size) { 2125 if (in_size) {
2128 req->in.numargs++; 2126 req->in.numargs++;
2129 req->in.args[1].size = in_size; 2127 req->in.args[1].size = in_size;
2130 req->in.argpages = 1; 2128 req->in.argpages = 1;
2131 2129
2132 err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size, 2130 err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size,
2133 false); 2131 false);
2134 if (err) 2132 if (err)
2135 goto out; 2133 goto out;
2136 } 2134 }
2137 2135
2138 req->out.numargs = 2; 2136 req->out.numargs = 2;
2139 req->out.args[0].size = sizeof(outarg); 2137 req->out.args[0].size = sizeof(outarg);
2140 req->out.args[0].value = &outarg; 2138 req->out.args[0].value = &outarg;
2141 req->out.args[1].size = out_size; 2139 req->out.args[1].size = out_size;
2142 req->out.argpages = 1; 2140 req->out.argpages = 1;
2143 req->out.argvar = 1; 2141 req->out.argvar = 1;
2144 2142
2145 fuse_request_send(fc, req); 2143 fuse_request_send(fc, req);
2146 err = req->out.h.error; 2144 err = req->out.h.error;
2147 transferred = req->out.args[1].size; 2145 transferred = req->out.args[1].size;
2148 fuse_put_request(fc, req); 2146 fuse_put_request(fc, req);
2149 req = NULL; 2147 req = NULL;
2150 if (err) 2148 if (err)
2151 goto out; 2149 goto out;
2152 2150
2153 /* did it ask for retry? */ 2151 /* did it ask for retry? */
2154 if (outarg.flags & FUSE_IOCTL_RETRY) { 2152 if (outarg.flags & FUSE_IOCTL_RETRY) {
2155 void *vaddr; 2153 void *vaddr;
2156 2154
2157 /* no retry if in restricted mode */ 2155 /* no retry if in restricted mode */
2158 err = -EIO; 2156 err = -EIO;
2159 if (!(flags & FUSE_IOCTL_UNRESTRICTED)) 2157 if (!(flags & FUSE_IOCTL_UNRESTRICTED))
2160 goto out; 2158 goto out;
2161 2159
2162 in_iovs = outarg.in_iovs; 2160 in_iovs = outarg.in_iovs;
2163 out_iovs = outarg.out_iovs; 2161 out_iovs = outarg.out_iovs;
2164 2162
2165 /* 2163 /*
2166 * Make sure things are in boundary, separate checks 2164 * Make sure things are in boundary, separate checks
2167 * are to protect against overflow. 2165 * are to protect against overflow.
2168 */ 2166 */
2169 err = -ENOMEM; 2167 err = -ENOMEM;
2170 if (in_iovs > FUSE_IOCTL_MAX_IOV || 2168 if (in_iovs > FUSE_IOCTL_MAX_IOV ||
2171 out_iovs > FUSE_IOCTL_MAX_IOV || 2169 out_iovs > FUSE_IOCTL_MAX_IOV ||
2172 in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV) 2170 in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV)
2173 goto out; 2171 goto out;
2174 2172
2175 vaddr = kmap_atomic(pages[0]); 2173 vaddr = kmap_atomic(pages[0]);
2176 err = fuse_copy_ioctl_iovec(fc, iov_page, vaddr, 2174 err = fuse_copy_ioctl_iovec(fc, iov_page, vaddr,
2177 transferred, in_iovs + out_iovs, 2175 transferred, in_iovs + out_iovs,
2178 (flags & FUSE_IOCTL_COMPAT) != 0); 2176 (flags & FUSE_IOCTL_COMPAT) != 0);
2179 kunmap_atomic(vaddr); 2177 kunmap_atomic(vaddr);
2180 if (err) 2178 if (err)
2181 goto out; 2179 goto out;
2182 2180
2183 in_iov = iov_page; 2181 in_iov = iov_page;
2184 out_iov = in_iov + in_iovs; 2182 out_iov = in_iov + in_iovs;
2185 2183
2186 err = fuse_verify_ioctl_iov(in_iov, in_iovs); 2184 err = fuse_verify_ioctl_iov(in_iov, in_iovs);
2187 if (err) 2185 if (err)
2188 goto out; 2186 goto out;
2189 2187
2190 err = fuse_verify_ioctl_iov(out_iov, out_iovs); 2188 err = fuse_verify_ioctl_iov(out_iov, out_iovs);
2191 if (err) 2189 if (err)
2192 goto out; 2190 goto out;
2193 2191
2194 goto retry; 2192 goto retry;
2195 } 2193 }
2196 2194
2197 err = -EIO; 2195 err = -EIO;
2198 if (transferred > inarg.out_size) 2196 if (transferred > inarg.out_size)
2199 goto out; 2197 goto out;
2200 2198
2201 err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true); 2199 err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true);
2202 out: 2200 out:
2203 if (req) 2201 if (req)
2204 fuse_put_request(fc, req); 2202 fuse_put_request(fc, req);
2205 free_page((unsigned long) iov_page); 2203 free_page((unsigned long) iov_page);
2206 while (num_pages) 2204 while (num_pages)
2207 __free_page(pages[--num_pages]); 2205 __free_page(pages[--num_pages]);
2208 kfree(pages); 2206 kfree(pages);
2209 2207
2210 return err ? err : outarg.result; 2208 return err ? err : outarg.result;
2211 } 2209 }
2212 EXPORT_SYMBOL_GPL(fuse_do_ioctl); 2210 EXPORT_SYMBOL_GPL(fuse_do_ioctl);
2213 2211
2214 long fuse_ioctl_common(struct file *file, unsigned int cmd, 2212 long fuse_ioctl_common(struct file *file, unsigned int cmd,
2215 unsigned long arg, unsigned int flags) 2213 unsigned long arg, unsigned int flags)
2216 { 2214 {
2217 struct inode *inode = file_inode(file); 2215 struct inode *inode = file_inode(file);
2218 struct fuse_conn *fc = get_fuse_conn(inode); 2216 struct fuse_conn *fc = get_fuse_conn(inode);
2219 2217
2220 if (!fuse_allow_current_process(fc)) 2218 if (!fuse_allow_current_process(fc))
2221 return -EACCES; 2219 return -EACCES;
2222 2220
2223 if (is_bad_inode(inode)) 2221 if (is_bad_inode(inode))
2224 return -EIO; 2222 return -EIO;
2225 2223
2226 return fuse_do_ioctl(file, cmd, arg, flags); 2224 return fuse_do_ioctl(file, cmd, arg, flags);
2227 } 2225 }
2228 2226
2229 static long fuse_file_ioctl(struct file *file, unsigned int cmd, 2227 static long fuse_file_ioctl(struct file *file, unsigned int cmd,
2230 unsigned long arg) 2228 unsigned long arg)
2231 { 2229 {
2232 return fuse_ioctl_common(file, cmd, arg, 0); 2230 return fuse_ioctl_common(file, cmd, arg, 0);
2233 } 2231 }
2234 2232
2235 static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, 2233 static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
2236 unsigned long arg) 2234 unsigned long arg)
2237 { 2235 {
2238 return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT); 2236 return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT);
2239 } 2237 }
2240 2238
2241 /* 2239 /*
2242 * All files which have been polled are linked to RB tree 2240 * All files which have been polled are linked to RB tree
2243 * fuse_conn->polled_files which is indexed by kh. Walk the tree and 2241 * fuse_conn->polled_files which is indexed by kh. Walk the tree and
2244 * find the matching one. 2242 * find the matching one.
2245 */ 2243 */
2246 static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 kh, 2244 static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 kh,
2247 struct rb_node **parent_out) 2245 struct rb_node **parent_out)
2248 { 2246 {
2249 struct rb_node **link = &fc->polled_files.rb_node; 2247 struct rb_node **link = &fc->polled_files.rb_node;
2250 struct rb_node *last = NULL; 2248 struct rb_node *last = NULL;
2251 2249
2252 while (*link) { 2250 while (*link) {
2253 struct fuse_file *ff; 2251 struct fuse_file *ff;
2254 2252
2255 last = *link; 2253 last = *link;
2256 ff = rb_entry(last, struct fuse_file, polled_node); 2254 ff = rb_entry(last, struct fuse_file, polled_node);
2257 2255
2258 if (kh < ff->kh) 2256 if (kh < ff->kh)
2259 link = &last->rb_left; 2257 link = &last->rb_left;
2260 else if (kh > ff->kh) 2258 else if (kh > ff->kh)
2261 link = &last->rb_right; 2259 link = &last->rb_right;
2262 else 2260 else
2263 return link; 2261 return link;
2264 } 2262 }
2265 2263
2266 if (parent_out) 2264 if (parent_out)
2267 *parent_out = last; 2265 *parent_out = last;
2268 return link; 2266 return link;
2269 } 2267 }
2270 2268
2271 /* 2269 /*
2272 * The file is about to be polled. Make sure it's on the polled_files 2270 * The file is about to be polled. Make sure it's on the polled_files
2273 * RB tree. Note that files once added to the polled_files tree are 2271 * RB tree. Note that files once added to the polled_files tree are
2274 * not removed before the file is released. This is because a file 2272 * not removed before the file is released. This is because a file
2275 * polled once is likely to be polled again. 2273 * polled once is likely to be polled again.
2276 */ 2274 */
2277 static void fuse_register_polled_file(struct fuse_conn *fc, 2275 static void fuse_register_polled_file(struct fuse_conn *fc,
2278 struct fuse_file *ff) 2276 struct fuse_file *ff)
2279 { 2277 {
2280 spin_lock(&fc->lock); 2278 spin_lock(&fc->lock);
2281 if (RB_EMPTY_NODE(&ff->polled_node)) { 2279 if (RB_EMPTY_NODE(&ff->polled_node)) {
2282 struct rb_node **link, *parent; 2280 struct rb_node **link, *parent;
2283 2281
2284 link = fuse_find_polled_node(fc, ff->kh, &parent); 2282 link = fuse_find_polled_node(fc, ff->kh, &parent);
2285 BUG_ON(*link); 2283 BUG_ON(*link);
2286 rb_link_node(&ff->polled_node, parent, link); 2284 rb_link_node(&ff->polled_node, parent, link);
2287 rb_insert_color(&ff->polled_node, &fc->polled_files); 2285 rb_insert_color(&ff->polled_node, &fc->polled_files);
2288 } 2286 }
2289 spin_unlock(&fc->lock); 2287 spin_unlock(&fc->lock);
2290 } 2288 }
2291 2289
2292 unsigned fuse_file_poll(struct file *file, poll_table *wait) 2290 unsigned fuse_file_poll(struct file *file, poll_table *wait)
2293 { 2291 {
2294 struct fuse_file *ff = file->private_data; 2292 struct fuse_file *ff = file->private_data;
2295 struct fuse_conn *fc = ff->fc; 2293 struct fuse_conn *fc = ff->fc;
2296 struct fuse_poll_in inarg = { .fh = ff->fh, .kh = ff->kh }; 2294 struct fuse_poll_in inarg = { .fh = ff->fh, .kh = ff->kh };
2297 struct fuse_poll_out outarg; 2295 struct fuse_poll_out outarg;
2298 struct fuse_req *req; 2296 struct fuse_req *req;
2299 int err; 2297 int err;
2300 2298
2301 if (fc->no_poll) 2299 if (fc->no_poll)
2302 return DEFAULT_POLLMASK; 2300 return DEFAULT_POLLMASK;
2303 2301
2304 poll_wait(file, &ff->poll_wait, wait); 2302 poll_wait(file, &ff->poll_wait, wait);
2305 inarg.events = (__u32)poll_requested_events(wait); 2303 inarg.events = (__u32)poll_requested_events(wait);
2306 2304
2307 /* 2305 /*
2308 * Ask for notification iff there's someone waiting for it. 2306 * Ask for notification iff there's someone waiting for it.
2309 * The client may ignore the flag and always notify. 2307 * The client may ignore the flag and always notify.
2310 */ 2308 */
2311 if (waitqueue_active(&ff->poll_wait)) { 2309 if (waitqueue_active(&ff->poll_wait)) {
2312 inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY; 2310 inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY;
2313 fuse_register_polled_file(fc, ff); 2311 fuse_register_polled_file(fc, ff);
2314 } 2312 }
2315 2313
2316 req = fuse_get_req_nopages(fc); 2314 req = fuse_get_req_nopages(fc);
2317 if (IS_ERR(req)) 2315 if (IS_ERR(req))
2318 return POLLERR; 2316 return POLLERR;
2319 2317
2320 req->in.h.opcode = FUSE_POLL; 2318 req->in.h.opcode = FUSE_POLL;
2321 req->in.h.nodeid = ff->nodeid; 2319 req->in.h.nodeid = ff->nodeid;
2322 req->in.numargs = 1; 2320 req->in.numargs = 1;
2323 req->in.args[0].size = sizeof(inarg); 2321 req->in.args[0].size = sizeof(inarg);
2324 req->in.args[0].value = &inarg; 2322 req->in.args[0].value = &inarg;
2325 req->out.numargs = 1; 2323 req->out.numargs = 1;
2326 req->out.args[0].size = sizeof(outarg); 2324 req->out.args[0].size = sizeof(outarg);
2327 req->out.args[0].value = &outarg; 2325 req->out.args[0].value = &outarg;
2328 fuse_request_send(fc, req); 2326 fuse_request_send(fc, req);
2329 err = req->out.h.error; 2327 err = req->out.h.error;
2330 fuse_put_request(fc, req); 2328 fuse_put_request(fc, req);
2331 2329
2332 if (!err) 2330 if (!err)
2333 return outarg.revents; 2331 return outarg.revents;
2334 if (err == -ENOSYS) { 2332 if (err == -ENOSYS) {
2335 fc->no_poll = 1; 2333 fc->no_poll = 1;
2336 return DEFAULT_POLLMASK; 2334 return DEFAULT_POLLMASK;
2337 } 2335 }
2338 return POLLERR; 2336 return POLLERR;
2339 } 2337 }
2340 EXPORT_SYMBOL_GPL(fuse_file_poll); 2338 EXPORT_SYMBOL_GPL(fuse_file_poll);
2341 2339
2342 /* 2340 /*
2343 * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and 2341 * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and
2344 * wakes up the poll waiters. 2342 * wakes up the poll waiters.
2345 */ 2343 */
2346 int fuse_notify_poll_wakeup(struct fuse_conn *fc, 2344 int fuse_notify_poll_wakeup(struct fuse_conn *fc,
2347 struct fuse_notify_poll_wakeup_out *outarg) 2345 struct fuse_notify_poll_wakeup_out *outarg)
2348 { 2346 {
2349 u64 kh = outarg->kh; 2347 u64 kh = outarg->kh;
2350 struct rb_node **link; 2348 struct rb_node **link;
2351 2349
2352 spin_lock(&fc->lock); 2350 spin_lock(&fc->lock);
2353 2351
2354 link = fuse_find_polled_node(fc, kh, NULL); 2352 link = fuse_find_polled_node(fc, kh, NULL);
2355 if (*link) { 2353 if (*link) {
2356 struct fuse_file *ff; 2354 struct fuse_file *ff;
2357 2355
2358 ff = rb_entry(*link, struct fuse_file, polled_node); 2356 ff = rb_entry(*link, struct fuse_file, polled_node);
2359 wake_up_interruptible_sync(&ff->poll_wait); 2357 wake_up_interruptible_sync(&ff->poll_wait);
2360 } 2358 }
2361 2359
2362 spin_unlock(&fc->lock); 2360 spin_unlock(&fc->lock);
2363 return 0; 2361 return 0;
2364 } 2362 }
2365 2363
2366 static void fuse_do_truncate(struct file *file) 2364 static void fuse_do_truncate(struct file *file)
2367 { 2365 {
2368 struct inode *inode = file->f_mapping->host; 2366 struct inode *inode = file->f_mapping->host;
2369 struct iattr attr; 2367 struct iattr attr;
2370 2368
2371 attr.ia_valid = ATTR_SIZE; 2369 attr.ia_valid = ATTR_SIZE;
2372 attr.ia_size = i_size_read(inode); 2370 attr.ia_size = i_size_read(inode);
2373 2371
2374 attr.ia_file = file; 2372 attr.ia_file = file;
2375 attr.ia_valid |= ATTR_FILE; 2373 attr.ia_valid |= ATTR_FILE;
2376 2374
2377 fuse_do_setattr(inode, &attr, file); 2375 fuse_do_setattr(inode, &attr, file);
2378 } 2376 }
2379 2377
2380 static inline loff_t fuse_round_up(loff_t off) 2378 static inline loff_t fuse_round_up(loff_t off)
2381 { 2379 {
2382 return round_up(off, FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT); 2380 return round_up(off, FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT);
2383 } 2381 }
2384 2382
2385 static ssize_t 2383 static ssize_t
2386 fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, 2384 fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
2387 loff_t offset, unsigned long nr_segs) 2385 loff_t offset, unsigned long nr_segs)
2388 { 2386 {
2389 ssize_t ret = 0; 2387 ssize_t ret = 0;
2390 struct file *file = iocb->ki_filp; 2388 struct file *file = iocb->ki_filp;
2391 struct fuse_file *ff = file->private_data; 2389 struct fuse_file *ff = file->private_data;
2392 bool async_dio = ff->fc->async_dio; 2390 bool async_dio = ff->fc->async_dio;
2393 loff_t pos = 0; 2391 loff_t pos = 0;
2394 struct inode *inode; 2392 struct inode *inode;
2395 loff_t i_size; 2393 loff_t i_size;
2396 size_t count = iov_length(iov, nr_segs); 2394 size_t count = iov_length(iov, nr_segs);
2397 struct fuse_io_priv *io; 2395 struct fuse_io_priv *io;
2398 2396
2399 pos = offset; 2397 pos = offset;
2400 inode = file->f_mapping->host; 2398 inode = file->f_mapping->host;
2401 i_size = i_size_read(inode); 2399 i_size = i_size_read(inode);
2402 2400
2403 /* optimization for short read */ 2401 /* optimization for short read */
2404 if (async_dio && rw != WRITE && offset + count > i_size) { 2402 if (async_dio && rw != WRITE && offset + count > i_size) {
2405 if (offset >= i_size) 2403 if (offset >= i_size)
2406 return 0; 2404 return 0;
2407 count = min_t(loff_t, count, fuse_round_up(i_size - offset)); 2405 count = min_t(loff_t, count, fuse_round_up(i_size - offset));
2408 } 2406 }
2409 2407
2410 io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL); 2408 io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL);
2411 if (!io) 2409 if (!io)
2412 return -ENOMEM; 2410 return -ENOMEM;
2413 spin_lock_init(&io->lock); 2411 spin_lock_init(&io->lock);
2414 io->reqs = 1; 2412 io->reqs = 1;
2415 io->bytes = -1; 2413 io->bytes = -1;
2416 io->size = 0; 2414 io->size = 0;
2417 io->offset = offset; 2415 io->offset = offset;
2418 io->write = (rw == WRITE); 2416 io->write = (rw == WRITE);
2419 io->err = 0; 2417 io->err = 0;
2420 io->file = file; 2418 io->file = file;
2421 /* 2419 /*
2422 * By default, we want to optimize all I/Os with async request 2420 * By default, we want to optimize all I/Os with async request
2423 * submission to the client filesystem if supported. 2421 * submission to the client filesystem if supported.
2424 */ 2422 */
2425 io->async = async_dio; 2423 io->async = async_dio;
2426 io->iocb = iocb; 2424 io->iocb = iocb;
2427 2425
2428 /* 2426 /*
2429 * We cannot asynchronously extend the size of a file. We have no method 2427 * We cannot asynchronously extend the size of a file. We have no method
2430 * to wait on real async I/O requests, so we must submit this request 2428 * to wait on real async I/O requests, so we must submit this request
2431 * synchronously. 2429 * synchronously.
2432 */ 2430 */
2433 if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE) 2431 if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE)
2434 io->async = false; 2432 io->async = false;
2435 2433
2436 if (rw == WRITE) 2434 if (rw == WRITE)
2437 ret = __fuse_direct_write(io, iov, nr_segs, &pos); 2435 ret = __fuse_direct_write(io, iov, nr_segs, &pos);
2438 else 2436 else
2439 ret = __fuse_direct_read(io, iov, nr_segs, &pos, count); 2437 ret = __fuse_direct_read(io, iov, nr_segs, &pos, count);
2440 2438
2441 if (io->async) { 2439 if (io->async) {
2442 fuse_aio_complete(io, ret < 0 ? ret : 0, -1); 2440 fuse_aio_complete(io, ret < 0 ? ret : 0, -1);
2443 2441
2444 /* we have a non-extending, async request, so return */ 2442 /* we have a non-extending, async request, so return */
2445 if (!is_sync_kiocb(iocb)) 2443 if (!is_sync_kiocb(iocb))
2446 return -EIOCBQUEUED; 2444 return -EIOCBQUEUED;
2447 2445
2448 ret = wait_on_sync_kiocb(iocb); 2446 ret = wait_on_sync_kiocb(iocb);
2449 } else { 2447 } else {
2450 kfree(io); 2448 kfree(io);
2451 } 2449 }
2452 2450
2453 if (rw == WRITE) { 2451 if (rw == WRITE) {
2454 if (ret > 0) 2452 if (ret > 0)
2455 fuse_write_update_size(inode, pos); 2453 fuse_write_update_size(inode, pos);
2456 else if (ret < 0 && offset + count > i_size) 2454 else if (ret < 0 && offset + count > i_size)
2457 fuse_do_truncate(file); 2455 fuse_do_truncate(file);
2458 } 2456 }
2459 2457
2460 return ret; 2458 return ret;
2461 } 2459 }
2462 2460
2463 static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, 2461 static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
2464 loff_t length) 2462 loff_t length)
2465 { 2463 {
2466 struct fuse_file *ff = file->private_data; 2464 struct fuse_file *ff = file->private_data;
2467 struct inode *inode = file->f_inode; 2465 struct inode *inode = file->f_inode;
2468 struct fuse_inode *fi = get_fuse_inode(inode); 2466 struct fuse_inode *fi = get_fuse_inode(inode);
2469 struct fuse_conn *fc = ff->fc; 2467 struct fuse_conn *fc = ff->fc;
2470 struct fuse_req *req; 2468 struct fuse_req *req;
2471 struct fuse_fallocate_in inarg = { 2469 struct fuse_fallocate_in inarg = {
2472 .fh = ff->fh, 2470 .fh = ff->fh,
2473 .offset = offset, 2471 .offset = offset,
2474 .length = length, 2472 .length = length,
2475 .mode = mode 2473 .mode = mode
2476 }; 2474 };
2477 int err; 2475 int err;
2478 bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) || 2476 bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
2479 (mode & FALLOC_FL_PUNCH_HOLE); 2477 (mode & FALLOC_FL_PUNCH_HOLE);
2480 2478
2481 if (fc->no_fallocate) 2479 if (fc->no_fallocate)
2482 return -EOPNOTSUPP; 2480 return -EOPNOTSUPP;
2483 2481
2484 if (lock_inode) { 2482 if (lock_inode) {
2485 mutex_lock(&inode->i_mutex); 2483 mutex_lock(&inode->i_mutex);
2486 if (mode & FALLOC_FL_PUNCH_HOLE) { 2484 if (mode & FALLOC_FL_PUNCH_HOLE) {
2487 loff_t endbyte = offset + length - 1; 2485 loff_t endbyte = offset + length - 1;
2488 err = filemap_write_and_wait_range(inode->i_mapping, 2486 err = filemap_write_and_wait_range(inode->i_mapping,
2489 offset, endbyte); 2487 offset, endbyte);
2490 if (err) 2488 if (err)
2491 goto out; 2489 goto out;
2492 2490
2493 fuse_sync_writes(inode); 2491 fuse_sync_writes(inode);
2494 } 2492 }
2495 } 2493 }
2496 2494
2497 if (!(mode & FALLOC_FL_KEEP_SIZE)) 2495 if (!(mode & FALLOC_FL_KEEP_SIZE))
2498 set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); 2496 set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
2499 2497
2500 req = fuse_get_req_nopages(fc); 2498 req = fuse_get_req_nopages(fc);
2501 if (IS_ERR(req)) { 2499 if (IS_ERR(req)) {
2502 err = PTR_ERR(req); 2500 err = PTR_ERR(req);
2503 goto out; 2501 goto out;
2504 } 2502 }
2505 2503
2506 req->in.h.opcode = FUSE_FALLOCATE; 2504 req->in.h.opcode = FUSE_FALLOCATE;
2507 req->in.h.nodeid = ff->nodeid; 2505 req->in.h.nodeid = ff->nodeid;
2508 req->in.numargs = 1; 2506 req->in.numargs = 1;
2509 req->in.args[0].size = sizeof(inarg); 2507 req->in.args[0].size = sizeof(inarg);
2510 req->in.args[0].value = &inarg; 2508 req->in.args[0].value = &inarg;
2511 fuse_request_send(fc, req); 2509 fuse_request_send(fc, req);
2512 err = req->out.h.error; 2510 err = req->out.h.error;
2513 if (err == -ENOSYS) { 2511 if (err == -ENOSYS) {
2514 fc->no_fallocate = 1; 2512 fc->no_fallocate = 1;
2515 err = -EOPNOTSUPP; 2513 err = -EOPNOTSUPP;
2516 } 2514 }
2517 fuse_put_request(fc, req); 2515 fuse_put_request(fc, req);
2518 2516
2519 if (err) 2517 if (err)
2520 goto out; 2518 goto out;
2521 2519
2522 /* we could have extended the file */ 2520 /* we could have extended the file */
2523 if (!(mode & FALLOC_FL_KEEP_SIZE)) 2521 if (!(mode & FALLOC_FL_KEEP_SIZE))
2524 fuse_write_update_size(inode, offset + length); 2522 fuse_write_update_size(inode, offset + length);
2525 2523
2526 if (mode & FALLOC_FL_PUNCH_HOLE) 2524 if (mode & FALLOC_FL_PUNCH_HOLE)
2527 truncate_pagecache_range(inode, offset, offset + length - 1); 2525 truncate_pagecache_range(inode, offset, offset + length - 1);
2528 2526
2529 fuse_invalidate_attr(inode); 2527 fuse_invalidate_attr(inode);
2530 2528
2531 out: 2529 out:
2532 if (!(mode & FALLOC_FL_KEEP_SIZE)) 2530 if (!(mode & FALLOC_FL_KEEP_SIZE))
2533 clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); 2531 clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
2534 2532
2535 if (lock_inode) 2533 if (lock_inode)
2536 mutex_unlock(&inode->i_mutex); 2534 mutex_unlock(&inode->i_mutex);
2537 2535
2538 return err; 2536 return err;
2539 } 2537 }
2540 2538
2541 static const struct file_operations fuse_file_operations = { 2539 static const struct file_operations fuse_file_operations = {
2542 .llseek = fuse_file_llseek, 2540 .llseek = fuse_file_llseek,
2543 .read = do_sync_read, 2541 .read = do_sync_read,
2544 .aio_read = fuse_file_aio_read, 2542 .aio_read = fuse_file_aio_read,
2545 .write = do_sync_write, 2543 .write = do_sync_write,
2546 .aio_write = fuse_file_aio_write, 2544 .aio_write = fuse_file_aio_write,
2547 .mmap = fuse_file_mmap, 2545 .mmap = fuse_file_mmap,
2548 .open = fuse_open, 2546 .open = fuse_open,
2549 .flush = fuse_flush, 2547 .flush = fuse_flush,
2550 .release = fuse_release, 2548 .release = fuse_release,
2551 .fsync = fuse_fsync, 2549 .fsync = fuse_fsync,
2552 .lock = fuse_file_lock, 2550 .lock = fuse_file_lock,
2553 .flock = fuse_file_flock, 2551 .flock = fuse_file_flock,
2554 .splice_read = generic_file_splice_read, 2552 .splice_read = generic_file_splice_read,
2555 .unlocked_ioctl = fuse_file_ioctl, 2553 .unlocked_ioctl = fuse_file_ioctl,
2556 .compat_ioctl = fuse_file_compat_ioctl, 2554 .compat_ioctl = fuse_file_compat_ioctl,
2557 .poll = fuse_file_poll, 2555 .poll = fuse_file_poll,
2558 .fallocate = fuse_file_fallocate, 2556 .fallocate = fuse_file_fallocate,
2559 }; 2557 };
2560 2558
2561 static const struct file_operations fuse_direct_io_file_operations = { 2559 static const struct file_operations fuse_direct_io_file_operations = {
2562 .llseek = fuse_file_llseek, 2560 .llseek = fuse_file_llseek,
2563 .read = fuse_direct_read, 2561 .read = fuse_direct_read,
2564 .write = fuse_direct_write, 2562 .write = fuse_direct_write,
2565 .mmap = fuse_direct_mmap, 2563 .mmap = fuse_direct_mmap,
2566 .open = fuse_open, 2564 .open = fuse_open,
2567 .flush = fuse_flush, 2565 .flush = fuse_flush,
2568 .release = fuse_release, 2566 .release = fuse_release,
2569 .fsync = fuse_fsync, 2567 .fsync = fuse_fsync,
2570 .lock = fuse_file_lock, 2568 .lock = fuse_file_lock,
2571 .flock = fuse_file_flock, 2569 .flock = fuse_file_flock,
2572 .unlocked_ioctl = fuse_file_ioctl, 2570 .unlocked_ioctl = fuse_file_ioctl,
2573 .compat_ioctl = fuse_file_compat_ioctl, 2571 .compat_ioctl = fuse_file_compat_ioctl,
2574 .poll = fuse_file_poll, 2572 .poll = fuse_file_poll,
2575 .fallocate = fuse_file_fallocate, 2573 .fallocate = fuse_file_fallocate,
2576 /* no splice_read */ 2574 /* no splice_read */
2577 }; 2575 };
2578 2576
2579 static const struct address_space_operations fuse_file_aops = { 2577 static const struct address_space_operations fuse_file_aops = {
2580 .readpage = fuse_readpage, 2578 .readpage = fuse_readpage,
2581 .writepage = fuse_writepage, 2579 .writepage = fuse_writepage,
2582 .launder_page = fuse_launder_page, 2580 .launder_page = fuse_launder_page,
2583 .readpages = fuse_readpages, 2581 .readpages = fuse_readpages,
2584 .set_page_dirty = __set_page_dirty_nobuffers, 2582 .set_page_dirty = __set_page_dirty_nobuffers,
2585 .bmap = fuse_bmap, 2583 .bmap = fuse_bmap,
2586 .direct_IO = fuse_direct_IO, 2584 .direct_IO = fuse_direct_IO,
2587 }; 2585 };
2588 2586
2589 void fuse_init_file_inode(struct inode *inode) 2587 void fuse_init_file_inode(struct inode *inode)
2590 { 2588 {
2591 inode->i_fop = &fuse_file_operations; 2589 inode->i_fop = &fuse_file_operations;
2592 inode->i_data.a_ops = &fuse_file_aops; 2590 inode->i_data.a_ops = &fuse_file_aops;
2593 } 2591 }
2594 2592
1 /* 1 /*
2 * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. 2 * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
3 * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. 3 * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
4 * 4 *
5 * This copyrighted material is made available to anyone wishing to use, 5 * This copyrighted material is made available to anyone wishing to use,
6 * modify, copy, or redistribute it subject to the terms and conditions 6 * modify, copy, or redistribute it subject to the terms and conditions
7 * of the GNU General Public License version 2. 7 * of the GNU General Public License version 2.
8 */ 8 */
9 9
10 #include <linux/sched.h> 10 #include <linux/sched.h>
11 #include <linux/slab.h> 11 #include <linux/slab.h>
12 #include <linux/spinlock.h> 12 #include <linux/spinlock.h>
13 #include <linux/completion.h> 13 #include <linux/completion.h>
14 #include <linux/buffer_head.h> 14 #include <linux/buffer_head.h>
15 #include <linux/pagemap.h> 15 #include <linux/pagemap.h>
16 #include <linux/pagevec.h> 16 #include <linux/pagevec.h>
17 #include <linux/mpage.h> 17 #include <linux/mpage.h>
18 #include <linux/fs.h> 18 #include <linux/fs.h>
19 #include <linux/writeback.h> 19 #include <linux/writeback.h>
20 #include <linux/swap.h> 20 #include <linux/swap.h>
21 #include <linux/gfs2_ondisk.h> 21 #include <linux/gfs2_ondisk.h>
22 #include <linux/backing-dev.h> 22 #include <linux/backing-dev.h>
23 #include <linux/aio.h> 23 #include <linux/aio.h>
24 24
25 #include "gfs2.h" 25 #include "gfs2.h"
26 #include "incore.h" 26 #include "incore.h"
27 #include "bmap.h" 27 #include "bmap.h"
28 #include "glock.h" 28 #include "glock.h"
29 #include "inode.h" 29 #include "inode.h"
30 #include "log.h" 30 #include "log.h"
31 #include "meta_io.h" 31 #include "meta_io.h"
32 #include "quota.h" 32 #include "quota.h"
33 #include "trans.h" 33 #include "trans.h"
34 #include "rgrp.h" 34 #include "rgrp.h"
35 #include "super.h" 35 #include "super.h"
36 #include "util.h" 36 #include "util.h"
37 #include "glops.h" 37 #include "glops.h"
38 38
39 39
40 static void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page, 40 static void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page,
41 unsigned int from, unsigned int to) 41 unsigned int from, unsigned int to)
42 { 42 {
43 struct buffer_head *head = page_buffers(page); 43 struct buffer_head *head = page_buffers(page);
44 unsigned int bsize = head->b_size; 44 unsigned int bsize = head->b_size;
45 struct buffer_head *bh; 45 struct buffer_head *bh;
46 unsigned int start, end; 46 unsigned int start, end;
47 47
48 for (bh = head, start = 0; bh != head || !start; 48 for (bh = head, start = 0; bh != head || !start;
49 bh = bh->b_this_page, start = end) { 49 bh = bh->b_this_page, start = end) {
50 end = start + bsize; 50 end = start + bsize;
51 if (end <= from || start >= to) 51 if (end <= from || start >= to)
52 continue; 52 continue;
53 if (gfs2_is_jdata(ip)) 53 if (gfs2_is_jdata(ip))
54 set_buffer_uptodate(bh); 54 set_buffer_uptodate(bh);
55 gfs2_trans_add_data(ip->i_gl, bh); 55 gfs2_trans_add_data(ip->i_gl, bh);
56 } 56 }
57 } 57 }
58 58
59 /** 59 /**
60 * gfs2_get_block_noalloc - Fills in a buffer head with details about a block 60 * gfs2_get_block_noalloc - Fills in a buffer head with details about a block
61 * @inode: The inode 61 * @inode: The inode
62 * @lblock: The block number to look up 62 * @lblock: The block number to look up
63 * @bh_result: The buffer head to return the result in 63 * @bh_result: The buffer head to return the result in
64 * @create: Non-zero if we may add block to the file 64 * @create: Non-zero if we may add block to the file
65 * 65 *
66 * Returns: errno 66 * Returns: errno
67 */ 67 */
68 68
69 static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock, 69 static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock,
70 struct buffer_head *bh_result, int create) 70 struct buffer_head *bh_result, int create)
71 { 71 {
72 int error; 72 int error;
73 73
74 error = gfs2_block_map(inode, lblock, bh_result, 0); 74 error = gfs2_block_map(inode, lblock, bh_result, 0);
75 if (error) 75 if (error)
76 return error; 76 return error;
77 if (!buffer_mapped(bh_result)) 77 if (!buffer_mapped(bh_result))
78 return -EIO; 78 return -EIO;
79 return 0; 79 return 0;
80 } 80 }
81 81
82 static int gfs2_get_block_direct(struct inode *inode, sector_t lblock, 82 static int gfs2_get_block_direct(struct inode *inode, sector_t lblock,
83 struct buffer_head *bh_result, int create) 83 struct buffer_head *bh_result, int create)
84 { 84 {
85 return gfs2_block_map(inode, lblock, bh_result, 0); 85 return gfs2_block_map(inode, lblock, bh_result, 0);
86 } 86 }
87 87
88 /** 88 /**
89 * gfs2_writepage_common - Common bits of writepage 89 * gfs2_writepage_common - Common bits of writepage
90 * @page: The page to be written 90 * @page: The page to be written
91 * @wbc: The writeback control 91 * @wbc: The writeback control
92 * 92 *
93 * Returns: 1 if writepage is ok, otherwise an error code or zero if no error. 93 * Returns: 1 if writepage is ok, otherwise an error code or zero if no error.
94 */ 94 */
95 95
96 static int gfs2_writepage_common(struct page *page, 96 static int gfs2_writepage_common(struct page *page,
97 struct writeback_control *wbc) 97 struct writeback_control *wbc)
98 { 98 {
99 struct inode *inode = page->mapping->host; 99 struct inode *inode = page->mapping->host;
100 struct gfs2_inode *ip = GFS2_I(inode); 100 struct gfs2_inode *ip = GFS2_I(inode);
101 struct gfs2_sbd *sdp = GFS2_SB(inode); 101 struct gfs2_sbd *sdp = GFS2_SB(inode);
102 loff_t i_size = i_size_read(inode); 102 loff_t i_size = i_size_read(inode);
103 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; 103 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
104 unsigned offset; 104 unsigned offset;
105 105
106 if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl))) 106 if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
107 goto out; 107 goto out;
108 if (current->journal_info) 108 if (current->journal_info)
109 goto redirty; 109 goto redirty;
110 /* Is the page fully outside i_size? (truncate in progress) */ 110 /* Is the page fully outside i_size? (truncate in progress) */
111 offset = i_size & (PAGE_CACHE_SIZE-1); 111 offset = i_size & (PAGE_CACHE_SIZE-1);
112 if (page->index > end_index || (page->index == end_index && !offset)) { 112 if (page->index > end_index || (page->index == end_index && !offset)) {
113 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); 113 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
114 goto out; 114 goto out;
115 } 115 }
116 return 1; 116 return 1;
117 redirty: 117 redirty:
118 redirty_page_for_writepage(wbc, page); 118 redirty_page_for_writepage(wbc, page);
119 out: 119 out:
120 unlock_page(page); 120 unlock_page(page);
121 return 0; 121 return 0;
122 } 122 }
123 123
124 /** 124 /**
125 * gfs2_writepage - Write page for writeback mappings 125 * gfs2_writepage - Write page for writeback mappings
126 * @page: The page 126 * @page: The page
127 * @wbc: The writeback control 127 * @wbc: The writeback control
128 * 128 *
129 */ 129 */
130 130
131 static int gfs2_writepage(struct page *page, struct writeback_control *wbc) 131 static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
132 { 132 {
133 int ret; 133 int ret;
134 134
135 ret = gfs2_writepage_common(page, wbc); 135 ret = gfs2_writepage_common(page, wbc);
136 if (ret <= 0) 136 if (ret <= 0)
137 return ret; 137 return ret;
138 138
139 return nobh_writepage(page, gfs2_get_block_noalloc, wbc); 139 return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
140 } 140 }
141 141
142 /** 142 /**
143 * __gfs2_jdata_writepage - The core of jdata writepage 143 * __gfs2_jdata_writepage - The core of jdata writepage
144 * @page: The page to write 144 * @page: The page to write
145 * @wbc: The writeback control 145 * @wbc: The writeback control
146 * 146 *
147 * This is shared between writepage and writepages and implements the 147 * This is shared between writepage and writepages and implements the
148 * core of the writepage operation. If a transaction is required then 148 * core of the writepage operation. If a transaction is required then
149 * PageChecked will have been set and the transaction will have 149 * PageChecked will have been set and the transaction will have
150 * already been started before this is called. 150 * already been started before this is called.
151 */ 151 */
152 152
153 static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) 153 static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
154 { 154 {
155 struct inode *inode = page->mapping->host; 155 struct inode *inode = page->mapping->host;
156 struct gfs2_inode *ip = GFS2_I(inode); 156 struct gfs2_inode *ip = GFS2_I(inode);
157 struct gfs2_sbd *sdp = GFS2_SB(inode); 157 struct gfs2_sbd *sdp = GFS2_SB(inode);
158 158
159 if (PageChecked(page)) { 159 if (PageChecked(page)) {
160 ClearPageChecked(page); 160 ClearPageChecked(page);
161 if (!page_has_buffers(page)) { 161 if (!page_has_buffers(page)) {
162 create_empty_buffers(page, inode->i_sb->s_blocksize, 162 create_empty_buffers(page, inode->i_sb->s_blocksize,
163 (1 << BH_Dirty)|(1 << BH_Uptodate)); 163 (1 << BH_Dirty)|(1 << BH_Uptodate));
164 } 164 }
165 gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1); 165 gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
166 } 166 }
167 return block_write_full_page(page, gfs2_get_block_noalloc, wbc); 167 return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
168 } 168 }
169 169
170 /** 170 /**
171 * gfs2_jdata_writepage - Write complete page 171 * gfs2_jdata_writepage - Write complete page
172 * @page: Page to write 172 * @page: Page to write
173 * 173 *
174 * Returns: errno 174 * Returns: errno
175 * 175 *
176 */ 176 */
177 177
178 static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) 178 static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
179 { 179 {
180 struct inode *inode = page->mapping->host; 180 struct inode *inode = page->mapping->host;
181 struct gfs2_sbd *sdp = GFS2_SB(inode); 181 struct gfs2_sbd *sdp = GFS2_SB(inode);
182 int ret; 182 int ret;
183 int done_trans = 0; 183 int done_trans = 0;
184 184
185 if (PageChecked(page)) { 185 if (PageChecked(page)) {
186 if (wbc->sync_mode != WB_SYNC_ALL) 186 if (wbc->sync_mode != WB_SYNC_ALL)
187 goto out_ignore; 187 goto out_ignore;
188 ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0); 188 ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0);
189 if (ret) 189 if (ret)
190 goto out_ignore; 190 goto out_ignore;
191 done_trans = 1; 191 done_trans = 1;
192 } 192 }
193 ret = gfs2_writepage_common(page, wbc); 193 ret = gfs2_writepage_common(page, wbc);
194 if (ret > 0) 194 if (ret > 0)
195 ret = __gfs2_jdata_writepage(page, wbc); 195 ret = __gfs2_jdata_writepage(page, wbc);
196 if (done_trans) 196 if (done_trans)
197 gfs2_trans_end(sdp); 197 gfs2_trans_end(sdp);
198 return ret; 198 return ret;
199 199
200 out_ignore: 200 out_ignore:
201 redirty_page_for_writepage(wbc, page); 201 redirty_page_for_writepage(wbc, page);
202 unlock_page(page); 202 unlock_page(page);
203 return 0; 203 return 0;
204 } 204 }
205 205
206 /** 206 /**
207 * gfs2_writepages - Write a bunch of dirty pages back to disk 207 * gfs2_writepages - Write a bunch of dirty pages back to disk
208 * @mapping: The mapping to write 208 * @mapping: The mapping to write
209 * @wbc: Write-back control 209 * @wbc: Write-back control
210 * 210 *
211 * Used for both ordered and writeback modes. 211 * Used for both ordered and writeback modes.
212 */ 212 */
213 static int gfs2_writepages(struct address_space *mapping, 213 static int gfs2_writepages(struct address_space *mapping,
214 struct writeback_control *wbc) 214 struct writeback_control *wbc)
215 { 215 {
216 return mpage_writepages(mapping, wbc, gfs2_get_block_noalloc); 216 return mpage_writepages(mapping, wbc, gfs2_get_block_noalloc);
217 } 217 }
218 218
219 /** 219 /**
220 * gfs2_write_jdata_pagevec - Write back a pagevec's worth of pages 220 * gfs2_write_jdata_pagevec - Write back a pagevec's worth of pages
221 * @mapping: The mapping 221 * @mapping: The mapping
222 * @wbc: The writeback control 222 * @wbc: The writeback control
223 * @writepage: The writepage function to call for each page 223 * @writepage: The writepage function to call for each page
224 * @pvec: The vector of pages 224 * @pvec: The vector of pages
225 * @nr_pages: The number of pages to write 225 * @nr_pages: The number of pages to write
226 * 226 *
227 * Returns: non-zero if loop should terminate, zero otherwise 227 * Returns: non-zero if loop should terminate, zero otherwise
228 */ 228 */
229 229
230 static int gfs2_write_jdata_pagevec(struct address_space *mapping, 230 static int gfs2_write_jdata_pagevec(struct address_space *mapping,
231 struct writeback_control *wbc, 231 struct writeback_control *wbc,
232 struct pagevec *pvec, 232 struct pagevec *pvec,
233 int nr_pages, pgoff_t end) 233 int nr_pages, pgoff_t end)
234 { 234 {
235 struct inode *inode = mapping->host; 235 struct inode *inode = mapping->host;
236 struct gfs2_sbd *sdp = GFS2_SB(inode); 236 struct gfs2_sbd *sdp = GFS2_SB(inode);
237 loff_t i_size = i_size_read(inode); 237 loff_t i_size = i_size_read(inode);
238 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; 238 pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
239 unsigned offset = i_size & (PAGE_CACHE_SIZE-1); 239 unsigned offset = i_size & (PAGE_CACHE_SIZE-1);
240 unsigned nrblocks = nr_pages * (PAGE_CACHE_SIZE/inode->i_sb->s_blocksize); 240 unsigned nrblocks = nr_pages * (PAGE_CACHE_SIZE/inode->i_sb->s_blocksize);
241 int i; 241 int i;
242 int ret; 242 int ret;
243 243
244 ret = gfs2_trans_begin(sdp, nrblocks, nrblocks); 244 ret = gfs2_trans_begin(sdp, nrblocks, nrblocks);
245 if (ret < 0) 245 if (ret < 0)
246 return ret; 246 return ret;
247 247
248 for(i = 0; i < nr_pages; i++) { 248 for(i = 0; i < nr_pages; i++) {
249 struct page *page = pvec->pages[i]; 249 struct page *page = pvec->pages[i];
250 250
251 lock_page(page); 251 lock_page(page);
252 252
253 if (unlikely(page->mapping != mapping)) { 253 if (unlikely(page->mapping != mapping)) {
254 unlock_page(page); 254 unlock_page(page);
255 continue; 255 continue;
256 } 256 }
257 257
258 if (!wbc->range_cyclic && page->index > end) { 258 if (!wbc->range_cyclic && page->index > end) {
259 ret = 1; 259 ret = 1;
260 unlock_page(page); 260 unlock_page(page);
261 continue; 261 continue;
262 } 262 }
263 263
264 if (wbc->sync_mode != WB_SYNC_NONE) 264 if (wbc->sync_mode != WB_SYNC_NONE)
265 wait_on_page_writeback(page); 265 wait_on_page_writeback(page);
266 266
267 if (PageWriteback(page) || 267 if (PageWriteback(page) ||
268 !clear_page_dirty_for_io(page)) { 268 !clear_page_dirty_for_io(page)) {
269 unlock_page(page); 269 unlock_page(page);
270 continue; 270 continue;
271 } 271 }
272 272
273 /* Is the page fully outside i_size? (truncate in progress) */ 273 /* Is the page fully outside i_size? (truncate in progress) */
274 if (page->index > end_index || (page->index == end_index && !offset)) { 274 if (page->index > end_index || (page->index == end_index && !offset)) {
275 page->mapping->a_ops->invalidatepage(page, 0, 275 page->mapping->a_ops->invalidatepage(page, 0,
276 PAGE_CACHE_SIZE); 276 PAGE_CACHE_SIZE);
277 unlock_page(page); 277 unlock_page(page);
278 continue; 278 continue;
279 } 279 }
280 280
281 ret = __gfs2_jdata_writepage(page, wbc); 281 ret = __gfs2_jdata_writepage(page, wbc);
282 282
283 if (ret || (--(wbc->nr_to_write) <= 0)) 283 if (ret || (--(wbc->nr_to_write) <= 0))
284 ret = 1; 284 ret = 1;
285 } 285 }
286 gfs2_trans_end(sdp); 286 gfs2_trans_end(sdp);
287 return ret; 287 return ret;
288 } 288 }
289 289
290 /** 290 /**
291 * gfs2_write_cache_jdata - Like write_cache_pages but different 291 * gfs2_write_cache_jdata - Like write_cache_pages but different
292 * @mapping: The mapping to write 292 * @mapping: The mapping to write
293 * @wbc: The writeback control 293 * @wbc: The writeback control
294 * @writepage: The writepage function to call 294 * @writepage: The writepage function to call
295 * @data: The data to pass to writepage 295 * @data: The data to pass to writepage
296 * 296 *
297 * The reason that we use our own function here is that we need to 297 * The reason that we use our own function here is that we need to
298 * start transactions before we grab page locks. This allows us 298 * start transactions before we grab page locks. This allows us
299 * to get the ordering right. 299 * to get the ordering right.
300 */ 300 */
301 301
302 static int gfs2_write_cache_jdata(struct address_space *mapping, 302 static int gfs2_write_cache_jdata(struct address_space *mapping,
303 struct writeback_control *wbc) 303 struct writeback_control *wbc)
304 { 304 {
305 int ret = 0; 305 int ret = 0;
306 int done = 0; 306 int done = 0;
307 struct pagevec pvec; 307 struct pagevec pvec;
308 int nr_pages; 308 int nr_pages;
309 pgoff_t index; 309 pgoff_t index;
310 pgoff_t end; 310 pgoff_t end;
311 int scanned = 0; 311 int scanned = 0;
312 int range_whole = 0; 312 int range_whole = 0;
313 313
314 pagevec_init(&pvec, 0); 314 pagevec_init(&pvec, 0);
315 if (wbc->range_cyclic) { 315 if (wbc->range_cyclic) {
316 index = mapping->writeback_index; /* Start from prev offset */ 316 index = mapping->writeback_index; /* Start from prev offset */
317 end = -1; 317 end = -1;
318 } else { 318 } else {
319 index = wbc->range_start >> PAGE_CACHE_SHIFT; 319 index = wbc->range_start >> PAGE_CACHE_SHIFT;
320 end = wbc->range_end >> PAGE_CACHE_SHIFT; 320 end = wbc->range_end >> PAGE_CACHE_SHIFT;
321 if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) 321 if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
322 range_whole = 1; 322 range_whole = 1;
323 scanned = 1; 323 scanned = 1;
324 } 324 }
325 325
326 retry: 326 retry:
327 while (!done && (index <= end) && 327 while (!done && (index <= end) &&
328 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, 328 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
329 PAGECACHE_TAG_DIRTY, 329 PAGECACHE_TAG_DIRTY,
330 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { 330 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
331 scanned = 1; 331 scanned = 1;
332 ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end); 332 ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
333 if (ret) 333 if (ret)
334 done = 1; 334 done = 1;
335 if (ret > 0) 335 if (ret > 0)
336 ret = 0; 336 ret = 0;
337 337
338 pagevec_release(&pvec); 338 pagevec_release(&pvec);
339 cond_resched(); 339 cond_resched();
340 } 340 }
341 341
342 if (!scanned && !done) { 342 if (!scanned && !done) {
343 /* 343 /*
344 * We hit the last page and there is more work to be done: wrap 344 * We hit the last page and there is more work to be done: wrap
345 * back to the start of the file 345 * back to the start of the file
346 */ 346 */
347 scanned = 1; 347 scanned = 1;
348 index = 0; 348 index = 0;
349 goto retry; 349 goto retry;
350 } 350 }
351 351
352 if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) 352 if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
353 mapping->writeback_index = index; 353 mapping->writeback_index = index;
354 return ret; 354 return ret;
355 } 355 }
356 356
357 357
358 /** 358 /**
359 * gfs2_jdata_writepages - Write a bunch of dirty pages back to disk 359 * gfs2_jdata_writepages - Write a bunch of dirty pages back to disk
360 * @mapping: The mapping to write 360 * @mapping: The mapping to write
361 * @wbc: The writeback control 361 * @wbc: The writeback control
362 * 362 *
363 */ 363 */
364 364
365 static int gfs2_jdata_writepages(struct address_space *mapping, 365 static int gfs2_jdata_writepages(struct address_space *mapping,
366 struct writeback_control *wbc) 366 struct writeback_control *wbc)
367 { 367 {
368 struct gfs2_inode *ip = GFS2_I(mapping->host); 368 struct gfs2_inode *ip = GFS2_I(mapping->host);
369 struct gfs2_sbd *sdp = GFS2_SB(mapping->host); 369 struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
370 int ret; 370 int ret;
371 371
372 ret = gfs2_write_cache_jdata(mapping, wbc); 372 ret = gfs2_write_cache_jdata(mapping, wbc);
373 if (ret == 0 && wbc->sync_mode == WB_SYNC_ALL) { 373 if (ret == 0 && wbc->sync_mode == WB_SYNC_ALL) {
374 gfs2_log_flush(sdp, ip->i_gl); 374 gfs2_log_flush(sdp, ip->i_gl);
375 ret = gfs2_write_cache_jdata(mapping, wbc); 375 ret = gfs2_write_cache_jdata(mapping, wbc);
376 } 376 }
377 return ret; 377 return ret;
378 } 378 }
379 379
380 /** 380 /**
381 * stuffed_readpage - Fill in a Linux page with stuffed file data 381 * stuffed_readpage - Fill in a Linux page with stuffed file data
382 * @ip: the inode 382 * @ip: the inode
383 * @page: the page 383 * @page: the page
384 * 384 *
385 * Returns: errno 385 * Returns: errno
386 */ 386 */
387 387
388 static int stuffed_readpage(struct gfs2_inode *ip, struct page *page) 388 static int stuffed_readpage(struct gfs2_inode *ip, struct page *page)
389 { 389 {
390 struct buffer_head *dibh; 390 struct buffer_head *dibh;
391 u64 dsize = i_size_read(&ip->i_inode); 391 u64 dsize = i_size_read(&ip->i_inode);
392 void *kaddr; 392 void *kaddr;
393 int error; 393 int error;
394 394
395 /* 395 /*
396 * Due to the order of unstuffing files and ->fault(), we can be 396 * Due to the order of unstuffing files and ->fault(), we can be
397 * asked for a zero page in the case of a stuffed file being extended, 397 * asked for a zero page in the case of a stuffed file being extended,
398 * so we need to supply one here. It doesn't happen often. 398 * so we need to supply one here. It doesn't happen often.
399 */ 399 */
400 if (unlikely(page->index)) { 400 if (unlikely(page->index)) {
401 zero_user(page, 0, PAGE_CACHE_SIZE); 401 zero_user(page, 0, PAGE_CACHE_SIZE);
402 SetPageUptodate(page); 402 SetPageUptodate(page);
403 return 0; 403 return 0;
404 } 404 }
405 405
406 error = gfs2_meta_inode_buffer(ip, &dibh); 406 error = gfs2_meta_inode_buffer(ip, &dibh);
407 if (error) 407 if (error)
408 return error; 408 return error;
409 409
410 kaddr = kmap_atomic(page); 410 kaddr = kmap_atomic(page);
411 if (dsize > (dibh->b_size - sizeof(struct gfs2_dinode))) 411 if (dsize > (dibh->b_size - sizeof(struct gfs2_dinode)))
412 dsize = (dibh->b_size - sizeof(struct gfs2_dinode)); 412 dsize = (dibh->b_size - sizeof(struct gfs2_dinode));
413 memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize); 413 memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize);
414 memset(kaddr + dsize, 0, PAGE_CACHE_SIZE - dsize); 414 memset(kaddr + dsize, 0, PAGE_CACHE_SIZE - dsize);
415 kunmap_atomic(kaddr); 415 kunmap_atomic(kaddr);
416 flush_dcache_page(page); 416 flush_dcache_page(page);
417 brelse(dibh); 417 brelse(dibh);
418 SetPageUptodate(page); 418 SetPageUptodate(page);
419 419
420 return 0; 420 return 0;
421 } 421 }
422 422
423 423
424 /** 424 /**
425 * __gfs2_readpage - readpage 425 * __gfs2_readpage - readpage
426 * @file: The file to read a page for 426 * @file: The file to read a page for
427 * @page: The page to read 427 * @page: The page to read
428 * 428 *
429 * This is the core of gfs2's readpage. Its used by the internal file 429 * This is the core of gfs2's readpage. Its used by the internal file
430 * reading code as in that case we already hold the glock. Also its 430 * reading code as in that case we already hold the glock. Also its
431 * called by gfs2_readpage() once the required lock has been granted. 431 * called by gfs2_readpage() once the required lock has been granted.
432 * 432 *
433 */ 433 */
434 434
435 static int __gfs2_readpage(void *file, struct page *page) 435 static int __gfs2_readpage(void *file, struct page *page)
436 { 436 {
437 struct gfs2_inode *ip = GFS2_I(page->mapping->host); 437 struct gfs2_inode *ip = GFS2_I(page->mapping->host);
438 struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); 438 struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
439 int error; 439 int error;
440 440
441 if (gfs2_is_stuffed(ip)) { 441 if (gfs2_is_stuffed(ip)) {
442 error = stuffed_readpage(ip, page); 442 error = stuffed_readpage(ip, page);
443 unlock_page(page); 443 unlock_page(page);
444 } else { 444 } else {
445 error = mpage_readpage(page, gfs2_block_map); 445 error = mpage_readpage(page, gfs2_block_map);
446 } 446 }
447 447
448 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) 448 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
449 return -EIO; 449 return -EIO;
450 450
451 return error; 451 return error;
452 } 452 }
453 453
454 /** 454 /**
455 * gfs2_readpage - read a page of a file 455 * gfs2_readpage - read a page of a file
456 * @file: The file to read 456 * @file: The file to read
457 * @page: The page of the file 457 * @page: The page of the file
458 * 458 *
459 * This deals with the locking required. We have to unlock and 459 * This deals with the locking required. We have to unlock and
460 * relock the page in order to get the locking in the right 460 * relock the page in order to get the locking in the right
461 * order. 461 * order.
462 */ 462 */
463 463
464 static int gfs2_readpage(struct file *file, struct page *page) 464 static int gfs2_readpage(struct file *file, struct page *page)
465 { 465 {
466 struct address_space *mapping = page->mapping; 466 struct address_space *mapping = page->mapping;
467 struct gfs2_inode *ip = GFS2_I(mapping->host); 467 struct gfs2_inode *ip = GFS2_I(mapping->host);
468 struct gfs2_holder gh; 468 struct gfs2_holder gh;
469 int error; 469 int error;
470 470
471 unlock_page(page); 471 unlock_page(page);
472 gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); 472 gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
473 error = gfs2_glock_nq(&gh); 473 error = gfs2_glock_nq(&gh);
474 if (unlikely(error)) 474 if (unlikely(error))
475 goto out; 475 goto out;
476 error = AOP_TRUNCATED_PAGE; 476 error = AOP_TRUNCATED_PAGE;
477 lock_page(page); 477 lock_page(page);
478 if (page->mapping == mapping && !PageUptodate(page)) 478 if (page->mapping == mapping && !PageUptodate(page))
479 error = __gfs2_readpage(file, page); 479 error = __gfs2_readpage(file, page);
480 else 480 else
481 unlock_page(page); 481 unlock_page(page);
482 gfs2_glock_dq(&gh); 482 gfs2_glock_dq(&gh);
483 out: 483 out:
484 gfs2_holder_uninit(&gh); 484 gfs2_holder_uninit(&gh);
485 if (error && error != AOP_TRUNCATED_PAGE) 485 if (error && error != AOP_TRUNCATED_PAGE)
486 lock_page(page); 486 lock_page(page);
487 return error; 487 return error;
488 } 488 }
489 489
490 /** 490 /**
491 * gfs2_internal_read - read an internal file 491 * gfs2_internal_read - read an internal file
492 * @ip: The gfs2 inode 492 * @ip: The gfs2 inode
493 * @buf: The buffer to fill 493 * @buf: The buffer to fill
494 * @pos: The file position 494 * @pos: The file position
495 * @size: The amount to read 495 * @size: The amount to read
496 * 496 *
497 */ 497 */
498 498
499 int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos, 499 int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
500 unsigned size) 500 unsigned size)
501 { 501 {
502 struct address_space *mapping = ip->i_inode.i_mapping; 502 struct address_space *mapping = ip->i_inode.i_mapping;
503 unsigned long index = *pos / PAGE_CACHE_SIZE; 503 unsigned long index = *pos / PAGE_CACHE_SIZE;
504 unsigned offset = *pos & (PAGE_CACHE_SIZE - 1); 504 unsigned offset = *pos & (PAGE_CACHE_SIZE - 1);
505 unsigned copied = 0; 505 unsigned copied = 0;
506 unsigned amt; 506 unsigned amt;
507 struct page *page; 507 struct page *page;
508 void *p; 508 void *p;
509 509
510 do { 510 do {
511 amt = size - copied; 511 amt = size - copied;
512 if (offset + size > PAGE_CACHE_SIZE) 512 if (offset + size > PAGE_CACHE_SIZE)
513 amt = PAGE_CACHE_SIZE - offset; 513 amt = PAGE_CACHE_SIZE - offset;
514 page = read_cache_page(mapping, index, __gfs2_readpage, NULL); 514 page = read_cache_page(mapping, index, __gfs2_readpage, NULL);
515 if (IS_ERR(page)) 515 if (IS_ERR(page))
516 return PTR_ERR(page); 516 return PTR_ERR(page);
517 p = kmap_atomic(page); 517 p = kmap_atomic(page);
518 memcpy(buf + copied, p + offset, amt); 518 memcpy(buf + copied, p + offset, amt);
519 kunmap_atomic(p); 519 kunmap_atomic(p);
520 mark_page_accessed(page);
521 page_cache_release(page); 520 page_cache_release(page);
522 copied += amt; 521 copied += amt;
523 index++; 522 index++;
524 offset = 0; 523 offset = 0;
525 } while(copied < size); 524 } while(copied < size);
526 (*pos) += size; 525 (*pos) += size;
527 return size; 526 return size;
528 } 527 }
529 528
530 /** 529 /**
531 * gfs2_readpages - Read a bunch of pages at once 530 * gfs2_readpages - Read a bunch of pages at once
532 * 531 *
533 * Some notes: 532 * Some notes:
534 * 1. This is only for readahead, so we can simply ignore any things 533 * 1. This is only for readahead, so we can simply ignore any things
535 * which are slightly inconvenient (such as locking conflicts between 534 * which are slightly inconvenient (such as locking conflicts between
536 * the page lock and the glock) and return having done no I/O. Its 535 * the page lock and the glock) and return having done no I/O. Its
537 * obviously not something we'd want to do on too regular a basis. 536 * obviously not something we'd want to do on too regular a basis.
538 * Any I/O we ignore at this time will be done via readpage later. 537 * Any I/O we ignore at this time will be done via readpage later.
539 * 2. We don't handle stuffed files here we let readpage do the honours. 538 * 2. We don't handle stuffed files here we let readpage do the honours.
540 * 3. mpage_readpages() does most of the heavy lifting in the common case. 539 * 3. mpage_readpages() does most of the heavy lifting in the common case.
541 * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places. 540 * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
542 */ 541 */
543 542
544 static int gfs2_readpages(struct file *file, struct address_space *mapping, 543 static int gfs2_readpages(struct file *file, struct address_space *mapping,
545 struct list_head *pages, unsigned nr_pages) 544 struct list_head *pages, unsigned nr_pages)
546 { 545 {
547 struct inode *inode = mapping->host; 546 struct inode *inode = mapping->host;
548 struct gfs2_inode *ip = GFS2_I(inode); 547 struct gfs2_inode *ip = GFS2_I(inode);
549 struct gfs2_sbd *sdp = GFS2_SB(inode); 548 struct gfs2_sbd *sdp = GFS2_SB(inode);
550 struct gfs2_holder gh; 549 struct gfs2_holder gh;
551 int ret; 550 int ret;
552 551
553 gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); 552 gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
554 ret = gfs2_glock_nq(&gh); 553 ret = gfs2_glock_nq(&gh);
555 if (unlikely(ret)) 554 if (unlikely(ret))
556 goto out_uninit; 555 goto out_uninit;
557 if (!gfs2_is_stuffed(ip)) 556 if (!gfs2_is_stuffed(ip))
558 ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map); 557 ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
559 gfs2_glock_dq(&gh); 558 gfs2_glock_dq(&gh);
560 out_uninit: 559 out_uninit:
561 gfs2_holder_uninit(&gh); 560 gfs2_holder_uninit(&gh);
562 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) 561 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
563 ret = -EIO; 562 ret = -EIO;
564 return ret; 563 return ret;
565 } 564 }
566 565
567 /** 566 /**
568 * gfs2_write_begin - Begin to write to a file 567 * gfs2_write_begin - Begin to write to a file
569 * @file: The file to write to 568 * @file: The file to write to
570 * @mapping: The mapping in which to write 569 * @mapping: The mapping in which to write
571 * @pos: The file offset at which to start writing 570 * @pos: The file offset at which to start writing
572 * @len: Length of the write 571 * @len: Length of the write
573 * @flags: Various flags 572 * @flags: Various flags
574 * @pagep: Pointer to return the page 573 * @pagep: Pointer to return the page
575 * @fsdata: Pointer to return fs data (unused by GFS2) 574 * @fsdata: Pointer to return fs data (unused by GFS2)
576 * 575 *
577 * Returns: errno 576 * Returns: errno
578 */ 577 */
579 578
580 static int gfs2_write_begin(struct file *file, struct address_space *mapping, 579 static int gfs2_write_begin(struct file *file, struct address_space *mapping,
581 loff_t pos, unsigned len, unsigned flags, 580 loff_t pos, unsigned len, unsigned flags,
582 struct page **pagep, void **fsdata) 581 struct page **pagep, void **fsdata)
583 { 582 {
584 struct gfs2_inode *ip = GFS2_I(mapping->host); 583 struct gfs2_inode *ip = GFS2_I(mapping->host);
585 struct gfs2_sbd *sdp = GFS2_SB(mapping->host); 584 struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
586 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); 585 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
587 unsigned int data_blocks = 0, ind_blocks = 0, rblocks; 586 unsigned int data_blocks = 0, ind_blocks = 0, rblocks;
588 unsigned requested = 0; 587 unsigned requested = 0;
589 int alloc_required; 588 int alloc_required;
590 int error = 0; 589 int error = 0;
591 pgoff_t index = pos >> PAGE_CACHE_SHIFT; 590 pgoff_t index = pos >> PAGE_CACHE_SHIFT;
592 unsigned from = pos & (PAGE_CACHE_SIZE - 1); 591 unsigned from = pos & (PAGE_CACHE_SIZE - 1);
593 struct page *page; 592 struct page *page;
594 593
595 gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh); 594 gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
596 error = gfs2_glock_nq(&ip->i_gh); 595 error = gfs2_glock_nq(&ip->i_gh);
597 if (unlikely(error)) 596 if (unlikely(error))
598 goto out_uninit; 597 goto out_uninit;
599 if (&ip->i_inode == sdp->sd_rindex) { 598 if (&ip->i_inode == sdp->sd_rindex) {
600 error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE, 599 error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
601 GL_NOCACHE, &m_ip->i_gh); 600 GL_NOCACHE, &m_ip->i_gh);
602 if (unlikely(error)) { 601 if (unlikely(error)) {
603 gfs2_glock_dq(&ip->i_gh); 602 gfs2_glock_dq(&ip->i_gh);
604 goto out_uninit; 603 goto out_uninit;
605 } 604 }
606 } 605 }
607 606
608 alloc_required = gfs2_write_alloc_required(ip, pos, len); 607 alloc_required = gfs2_write_alloc_required(ip, pos, len);
609 608
610 if (alloc_required || gfs2_is_jdata(ip)) 609 if (alloc_required || gfs2_is_jdata(ip))
611 gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks); 610 gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks);
612 611
613 if (alloc_required) { 612 if (alloc_required) {
614 error = gfs2_quota_lock_check(ip); 613 error = gfs2_quota_lock_check(ip);
615 if (error) 614 if (error)
616 goto out_unlock; 615 goto out_unlock;
617 616
618 requested = data_blocks + ind_blocks; 617 requested = data_blocks + ind_blocks;
619 error = gfs2_inplace_reserve(ip, requested, 0); 618 error = gfs2_inplace_reserve(ip, requested, 0);
620 if (error) 619 if (error)
621 goto out_qunlock; 620 goto out_qunlock;
622 } 621 }
623 622
624 rblocks = RES_DINODE + ind_blocks; 623 rblocks = RES_DINODE + ind_blocks;
625 if (gfs2_is_jdata(ip)) 624 if (gfs2_is_jdata(ip))
626 rblocks += data_blocks ? data_blocks : 1; 625 rblocks += data_blocks ? data_blocks : 1;
627 if (ind_blocks || data_blocks) 626 if (ind_blocks || data_blocks)
628 rblocks += RES_STATFS + RES_QUOTA; 627 rblocks += RES_STATFS + RES_QUOTA;
629 if (&ip->i_inode == sdp->sd_rindex) 628 if (&ip->i_inode == sdp->sd_rindex)
630 rblocks += 2 * RES_STATFS; 629 rblocks += 2 * RES_STATFS;
631 if (alloc_required) 630 if (alloc_required)
632 rblocks += gfs2_rg_blocks(ip, requested); 631 rblocks += gfs2_rg_blocks(ip, requested);
633 632
634 error = gfs2_trans_begin(sdp, rblocks, 633 error = gfs2_trans_begin(sdp, rblocks,
635 PAGE_CACHE_SIZE/sdp->sd_sb.sb_bsize); 634 PAGE_CACHE_SIZE/sdp->sd_sb.sb_bsize);
636 if (error) 635 if (error)
637 goto out_trans_fail; 636 goto out_trans_fail;
638 637
639 error = -ENOMEM; 638 error = -ENOMEM;
640 flags |= AOP_FLAG_NOFS; 639 flags |= AOP_FLAG_NOFS;
641 page = grab_cache_page_write_begin(mapping, index, flags); 640 page = grab_cache_page_write_begin(mapping, index, flags);
642 *pagep = page; 641 *pagep = page;
643 if (unlikely(!page)) 642 if (unlikely(!page))
644 goto out_endtrans; 643 goto out_endtrans;
645 644
646 if (gfs2_is_stuffed(ip)) { 645 if (gfs2_is_stuffed(ip)) {
647 error = 0; 646 error = 0;
648 if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) { 647 if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
649 error = gfs2_unstuff_dinode(ip, page); 648 error = gfs2_unstuff_dinode(ip, page);
650 if (error == 0) 649 if (error == 0)
651 goto prepare_write; 650 goto prepare_write;
652 } else if (!PageUptodate(page)) { 651 } else if (!PageUptodate(page)) {
653 error = stuffed_readpage(ip, page); 652 error = stuffed_readpage(ip, page);
654 } 653 }
655 goto out; 654 goto out;
656 } 655 }
657 656
658 prepare_write: 657 prepare_write:
659 error = __block_write_begin(page, from, len, gfs2_block_map); 658 error = __block_write_begin(page, from, len, gfs2_block_map);
660 out: 659 out:
661 if (error == 0) 660 if (error == 0)
662 return 0; 661 return 0;
663 662
664 unlock_page(page); 663 unlock_page(page);
665 page_cache_release(page); 664 page_cache_release(page);
666 665
667 gfs2_trans_end(sdp); 666 gfs2_trans_end(sdp);
668 if (pos + len > ip->i_inode.i_size) 667 if (pos + len > ip->i_inode.i_size)
669 gfs2_trim_blocks(&ip->i_inode); 668 gfs2_trim_blocks(&ip->i_inode);
670 goto out_trans_fail; 669 goto out_trans_fail;
671 670
672 out_endtrans: 671 out_endtrans:
673 gfs2_trans_end(sdp); 672 gfs2_trans_end(sdp);
674 out_trans_fail: 673 out_trans_fail:
675 if (alloc_required) { 674 if (alloc_required) {
676 gfs2_inplace_release(ip); 675 gfs2_inplace_release(ip);
677 out_qunlock: 676 out_qunlock:
678 gfs2_quota_unlock(ip); 677 gfs2_quota_unlock(ip);
679 } 678 }
680 out_unlock: 679 out_unlock:
681 if (&ip->i_inode == sdp->sd_rindex) { 680 if (&ip->i_inode == sdp->sd_rindex) {
682 gfs2_glock_dq(&m_ip->i_gh); 681 gfs2_glock_dq(&m_ip->i_gh);
683 gfs2_holder_uninit(&m_ip->i_gh); 682 gfs2_holder_uninit(&m_ip->i_gh);
684 } 683 }
685 gfs2_glock_dq(&ip->i_gh); 684 gfs2_glock_dq(&ip->i_gh);
686 out_uninit: 685 out_uninit:
687 gfs2_holder_uninit(&ip->i_gh); 686 gfs2_holder_uninit(&ip->i_gh);
688 return error; 687 return error;
689 } 688 }
690 689
691 /** 690 /**
692 * adjust_fs_space - Adjusts the free space available due to gfs2_grow 691 * adjust_fs_space - Adjusts the free space available due to gfs2_grow
693 * @inode: the rindex inode 692 * @inode: the rindex inode
694 */ 693 */
695 static void adjust_fs_space(struct inode *inode) 694 static void adjust_fs_space(struct inode *inode)
696 { 695 {
697 struct gfs2_sbd *sdp = inode->i_sb->s_fs_info; 696 struct gfs2_sbd *sdp = inode->i_sb->s_fs_info;
698 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); 697 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
699 struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode); 698 struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
700 struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master; 699 struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
701 struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local; 700 struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
702 struct buffer_head *m_bh, *l_bh; 701 struct buffer_head *m_bh, *l_bh;
703 u64 fs_total, new_free; 702 u64 fs_total, new_free;
704 703
705 /* Total up the file system space, according to the latest rindex. */ 704 /* Total up the file system space, according to the latest rindex. */
706 fs_total = gfs2_ri_total(sdp); 705 fs_total = gfs2_ri_total(sdp);
707 if (gfs2_meta_inode_buffer(m_ip, &m_bh) != 0) 706 if (gfs2_meta_inode_buffer(m_ip, &m_bh) != 0)
708 return; 707 return;
709 708
710 spin_lock(&sdp->sd_statfs_spin); 709 spin_lock(&sdp->sd_statfs_spin);
711 gfs2_statfs_change_in(m_sc, m_bh->b_data + 710 gfs2_statfs_change_in(m_sc, m_bh->b_data +
712 sizeof(struct gfs2_dinode)); 711 sizeof(struct gfs2_dinode));
713 if (fs_total > (m_sc->sc_total + l_sc->sc_total)) 712 if (fs_total > (m_sc->sc_total + l_sc->sc_total))
714 new_free = fs_total - (m_sc->sc_total + l_sc->sc_total); 713 new_free = fs_total - (m_sc->sc_total + l_sc->sc_total);
715 else 714 else
716 new_free = 0; 715 new_free = 0;
717 spin_unlock(&sdp->sd_statfs_spin); 716 spin_unlock(&sdp->sd_statfs_spin);
718 fs_warn(sdp, "File system extended by %llu blocks.\n", 717 fs_warn(sdp, "File system extended by %llu blocks.\n",
719 (unsigned long long)new_free); 718 (unsigned long long)new_free);
720 gfs2_statfs_change(sdp, new_free, new_free, 0); 719 gfs2_statfs_change(sdp, new_free, new_free, 0);
721 720
722 if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0) 721 if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0)
723 goto out; 722 goto out;
724 update_statfs(sdp, m_bh, l_bh); 723 update_statfs(sdp, m_bh, l_bh);
725 brelse(l_bh); 724 brelse(l_bh);
726 out: 725 out:
727 brelse(m_bh); 726 brelse(m_bh);
728 } 727 }
729 728
730 /** 729 /**
731 * gfs2_stuffed_write_end - Write end for stuffed files 730 * gfs2_stuffed_write_end - Write end for stuffed files
732 * @inode: The inode 731 * @inode: The inode
733 * @dibh: The buffer_head containing the on-disk inode 732 * @dibh: The buffer_head containing the on-disk inode
734 * @pos: The file position 733 * @pos: The file position
735 * @len: The length of the write 734 * @len: The length of the write
736 * @copied: How much was actually copied by the VFS 735 * @copied: How much was actually copied by the VFS
737 * @page: The page 736 * @page: The page
738 * 737 *
739 * This copies the data from the page into the inode block after 738 * This copies the data from the page into the inode block after
740 * the inode data structure itself. 739 * the inode data structure itself.
741 * 740 *
742 * Returns: errno 741 * Returns: errno
743 */ 742 */
744 static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh, 743 static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh,
745 loff_t pos, unsigned len, unsigned copied, 744 loff_t pos, unsigned len, unsigned copied,
746 struct page *page) 745 struct page *page)
747 { 746 {
748 struct gfs2_inode *ip = GFS2_I(inode); 747 struct gfs2_inode *ip = GFS2_I(inode);
749 struct gfs2_sbd *sdp = GFS2_SB(inode); 748 struct gfs2_sbd *sdp = GFS2_SB(inode);
750 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); 749 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
751 u64 to = pos + copied; 750 u64 to = pos + copied;
752 void *kaddr; 751 void *kaddr;
753 unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode); 752 unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode);
754 753
755 BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode))); 754 BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode)));
756 kaddr = kmap_atomic(page); 755 kaddr = kmap_atomic(page);
757 memcpy(buf + pos, kaddr + pos, copied); 756 memcpy(buf + pos, kaddr + pos, copied);
758 memset(kaddr + pos + copied, 0, len - copied); 757 memset(kaddr + pos + copied, 0, len - copied);
759 flush_dcache_page(page); 758 flush_dcache_page(page);
760 kunmap_atomic(kaddr); 759 kunmap_atomic(kaddr);
761 760
762 if (!PageUptodate(page)) 761 if (!PageUptodate(page))
763 SetPageUptodate(page); 762 SetPageUptodate(page);
764 unlock_page(page); 763 unlock_page(page);
765 page_cache_release(page); 764 page_cache_release(page);
766 765
767 if (copied) { 766 if (copied) {
768 if (inode->i_size < to) 767 if (inode->i_size < to)
769 i_size_write(inode, to); 768 i_size_write(inode, to);
770 mark_inode_dirty(inode); 769 mark_inode_dirty(inode);
771 } 770 }
772 771
773 if (inode == sdp->sd_rindex) { 772 if (inode == sdp->sd_rindex) {
774 adjust_fs_space(inode); 773 adjust_fs_space(inode);
775 sdp->sd_rindex_uptodate = 0; 774 sdp->sd_rindex_uptodate = 0;
776 } 775 }
777 776
778 brelse(dibh); 777 brelse(dibh);
779 gfs2_trans_end(sdp); 778 gfs2_trans_end(sdp);
780 if (inode == sdp->sd_rindex) { 779 if (inode == sdp->sd_rindex) {
781 gfs2_glock_dq(&m_ip->i_gh); 780 gfs2_glock_dq(&m_ip->i_gh);
782 gfs2_holder_uninit(&m_ip->i_gh); 781 gfs2_holder_uninit(&m_ip->i_gh);
783 } 782 }
784 gfs2_glock_dq(&ip->i_gh); 783 gfs2_glock_dq(&ip->i_gh);
785 gfs2_holder_uninit(&ip->i_gh); 784 gfs2_holder_uninit(&ip->i_gh);
786 return copied; 785 return copied;
787 } 786 }
788 787
789 /** 788 /**
790 * gfs2_write_end 789 * gfs2_write_end
791 * @file: The file to write to 790 * @file: The file to write to
792 * @mapping: The address space to write to 791 * @mapping: The address space to write to
793 * @pos: The file position 792 * @pos: The file position
794 * @len: The length of the data 793 * @len: The length of the data
795 * @copied: 794 * @copied:
796 * @page: The page that has been written 795 * @page: The page that has been written
797 * @fsdata: The fsdata (unused in GFS2) 796 * @fsdata: The fsdata (unused in GFS2)
798 * 797 *
799 * The main write_end function for GFS2. We have a separate one for 798 * The main write_end function for GFS2. We have a separate one for
800 * stuffed files as they are slightly different, otherwise we just 799 * stuffed files as they are slightly different, otherwise we just
801 * put our locking around the VFS provided functions. 800 * put our locking around the VFS provided functions.
802 * 801 *
803 * Returns: errno 802 * Returns: errno
804 */ 803 */
805 804
806 static int gfs2_write_end(struct file *file, struct address_space *mapping, 805 static int gfs2_write_end(struct file *file, struct address_space *mapping,
807 loff_t pos, unsigned len, unsigned copied, 806 loff_t pos, unsigned len, unsigned copied,
808 struct page *page, void *fsdata) 807 struct page *page, void *fsdata)
809 { 808 {
810 struct inode *inode = page->mapping->host; 809 struct inode *inode = page->mapping->host;
811 struct gfs2_inode *ip = GFS2_I(inode); 810 struct gfs2_inode *ip = GFS2_I(inode);
812 struct gfs2_sbd *sdp = GFS2_SB(inode); 811 struct gfs2_sbd *sdp = GFS2_SB(inode);
813 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); 812 struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
814 struct buffer_head *dibh; 813 struct buffer_head *dibh;
815 unsigned int from = pos & (PAGE_CACHE_SIZE - 1); 814 unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
816 unsigned int to = from + len; 815 unsigned int to = from + len;
817 int ret; 816 int ret;
818 struct gfs2_trans *tr = current->journal_info; 817 struct gfs2_trans *tr = current->journal_info;
819 BUG_ON(!tr); 818 BUG_ON(!tr);
820 819
821 BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL); 820 BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
822 821
823 ret = gfs2_meta_inode_buffer(ip, &dibh); 822 ret = gfs2_meta_inode_buffer(ip, &dibh);
824 if (unlikely(ret)) { 823 if (unlikely(ret)) {
825 unlock_page(page); 824 unlock_page(page);
826 page_cache_release(page); 825 page_cache_release(page);
827 goto failed; 826 goto failed;
828 } 827 }
829 828
830 if (gfs2_is_stuffed(ip)) 829 if (gfs2_is_stuffed(ip))
831 return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page); 830 return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
832 831
833 if (!gfs2_is_writeback(ip)) 832 if (!gfs2_is_writeback(ip))
834 gfs2_page_add_databufs(ip, page, from, to); 833 gfs2_page_add_databufs(ip, page, from, to);
835 834
836 ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); 835 ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
837 if (tr->tr_num_buf_new) 836 if (tr->tr_num_buf_new)
838 __mark_inode_dirty(inode, I_DIRTY_DATASYNC); 837 __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
839 else 838 else
840 gfs2_trans_add_meta(ip->i_gl, dibh); 839 gfs2_trans_add_meta(ip->i_gl, dibh);
841 840
842 841
843 if (inode == sdp->sd_rindex) { 842 if (inode == sdp->sd_rindex) {
844 adjust_fs_space(inode); 843 adjust_fs_space(inode);
845 sdp->sd_rindex_uptodate = 0; 844 sdp->sd_rindex_uptodate = 0;
846 } 845 }
847 846
848 brelse(dibh); 847 brelse(dibh);
849 failed: 848 failed:
850 gfs2_trans_end(sdp); 849 gfs2_trans_end(sdp);
851 gfs2_inplace_release(ip); 850 gfs2_inplace_release(ip);
852 if (ip->i_res->rs_qa_qd_num) 851 if (ip->i_res->rs_qa_qd_num)
853 gfs2_quota_unlock(ip); 852 gfs2_quota_unlock(ip);
854 if (inode == sdp->sd_rindex) { 853 if (inode == sdp->sd_rindex) {
855 gfs2_glock_dq(&m_ip->i_gh); 854 gfs2_glock_dq(&m_ip->i_gh);
856 gfs2_holder_uninit(&m_ip->i_gh); 855 gfs2_holder_uninit(&m_ip->i_gh);
857 } 856 }
858 gfs2_glock_dq(&ip->i_gh); 857 gfs2_glock_dq(&ip->i_gh);
859 gfs2_holder_uninit(&ip->i_gh); 858 gfs2_holder_uninit(&ip->i_gh);
860 return ret; 859 return ret;
861 } 860 }
862 861
863 /** 862 /**
864 * gfs2_set_page_dirty - Page dirtying function 863 * gfs2_set_page_dirty - Page dirtying function
865 * @page: The page to dirty 864 * @page: The page to dirty
866 * 865 *
867 * Returns: 1 if it dirtyed the page, or 0 otherwise 866 * Returns: 1 if it dirtyed the page, or 0 otherwise
868 */ 867 */
869 868
870 static int gfs2_set_page_dirty(struct page *page) 869 static int gfs2_set_page_dirty(struct page *page)
871 { 870 {
872 SetPageChecked(page); 871 SetPageChecked(page);
873 return __set_page_dirty_buffers(page); 872 return __set_page_dirty_buffers(page);
874 } 873 }
875 874
876 /** 875 /**
877 * gfs2_bmap - Block map function 876 * gfs2_bmap - Block map function
878 * @mapping: Address space info 877 * @mapping: Address space info
879 * @lblock: The block to map 878 * @lblock: The block to map
880 * 879 *
881 * Returns: The disk address for the block or 0 on hole or error 880 * Returns: The disk address for the block or 0 on hole or error
882 */ 881 */
883 882
884 static sector_t gfs2_bmap(struct address_space *mapping, sector_t lblock) 883 static sector_t gfs2_bmap(struct address_space *mapping, sector_t lblock)
885 { 884 {
886 struct gfs2_inode *ip = GFS2_I(mapping->host); 885 struct gfs2_inode *ip = GFS2_I(mapping->host);
887 struct gfs2_holder i_gh; 886 struct gfs2_holder i_gh;
888 sector_t dblock = 0; 887 sector_t dblock = 0;
889 int error; 888 int error;
890 889
891 error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh); 890 error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh);
892 if (error) 891 if (error)
893 return 0; 892 return 0;
894 893
895 if (!gfs2_is_stuffed(ip)) 894 if (!gfs2_is_stuffed(ip))
896 dblock = generic_block_bmap(mapping, lblock, gfs2_block_map); 895 dblock = generic_block_bmap(mapping, lblock, gfs2_block_map);
897 896
898 gfs2_glock_dq_uninit(&i_gh); 897 gfs2_glock_dq_uninit(&i_gh);
899 898
900 return dblock; 899 return dblock;
901 } 900 }
902 901
903 static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh) 902 static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh)
904 { 903 {
905 struct gfs2_bufdata *bd; 904 struct gfs2_bufdata *bd;
906 905
907 lock_buffer(bh); 906 lock_buffer(bh);
908 gfs2_log_lock(sdp); 907 gfs2_log_lock(sdp);
909 clear_buffer_dirty(bh); 908 clear_buffer_dirty(bh);
910 bd = bh->b_private; 909 bd = bh->b_private;
911 if (bd) { 910 if (bd) {
912 if (!list_empty(&bd->bd_list) && !buffer_pinned(bh)) 911 if (!list_empty(&bd->bd_list) && !buffer_pinned(bh))
913 list_del_init(&bd->bd_list); 912 list_del_init(&bd->bd_list);
914 else 913 else
915 gfs2_remove_from_journal(bh, current->journal_info, 0); 914 gfs2_remove_from_journal(bh, current->journal_info, 0);
916 } 915 }
917 bh->b_bdev = NULL; 916 bh->b_bdev = NULL;
918 clear_buffer_mapped(bh); 917 clear_buffer_mapped(bh);
919 clear_buffer_req(bh); 918 clear_buffer_req(bh);
920 clear_buffer_new(bh); 919 clear_buffer_new(bh);
921 gfs2_log_unlock(sdp); 920 gfs2_log_unlock(sdp);
922 unlock_buffer(bh); 921 unlock_buffer(bh);
923 } 922 }
924 923
925 static void gfs2_invalidatepage(struct page *page, unsigned int offset, 924 static void gfs2_invalidatepage(struct page *page, unsigned int offset,
926 unsigned int length) 925 unsigned int length)
927 { 926 {
928 struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); 927 struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
929 unsigned int stop = offset + length; 928 unsigned int stop = offset + length;
930 int partial_page = (offset || length < PAGE_CACHE_SIZE); 929 int partial_page = (offset || length < PAGE_CACHE_SIZE);
931 struct buffer_head *bh, *head; 930 struct buffer_head *bh, *head;
932 unsigned long pos = 0; 931 unsigned long pos = 0;
933 932
934 BUG_ON(!PageLocked(page)); 933 BUG_ON(!PageLocked(page));
935 if (!partial_page) 934 if (!partial_page)
936 ClearPageChecked(page); 935 ClearPageChecked(page);
937 if (!page_has_buffers(page)) 936 if (!page_has_buffers(page))
938 goto out; 937 goto out;
939 938
940 bh = head = page_buffers(page); 939 bh = head = page_buffers(page);
941 do { 940 do {
942 if (pos + bh->b_size > stop) 941 if (pos + bh->b_size > stop)
943 return; 942 return;
944 943
945 if (offset <= pos) 944 if (offset <= pos)
946 gfs2_discard(sdp, bh); 945 gfs2_discard(sdp, bh);
947 pos += bh->b_size; 946 pos += bh->b_size;
948 bh = bh->b_this_page; 947 bh = bh->b_this_page;
949 } while (bh != head); 948 } while (bh != head);
950 out: 949 out:
951 if (!partial_page) 950 if (!partial_page)
952 try_to_release_page(page, 0); 951 try_to_release_page(page, 0);
953 } 952 }
954 953
955 /** 954 /**
956 * gfs2_ok_for_dio - check that dio is valid on this file 955 * gfs2_ok_for_dio - check that dio is valid on this file
957 * @ip: The inode 956 * @ip: The inode
958 * @rw: READ or WRITE 957 * @rw: READ or WRITE
959 * @offset: The offset at which we are reading or writing 958 * @offset: The offset at which we are reading or writing
960 * 959 *
961 * Returns: 0 (to ignore the i/o request and thus fall back to buffered i/o) 960 * Returns: 0 (to ignore the i/o request and thus fall back to buffered i/o)
962 * 1 (to accept the i/o request) 961 * 1 (to accept the i/o request)
963 */ 962 */
964 static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset) 963 static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset)
965 { 964 {
966 /* 965 /*
967 * Should we return an error here? I can't see that O_DIRECT for 966 * Should we return an error here? I can't see that O_DIRECT for
968 * a stuffed file makes any sense. For now we'll silently fall 967 * a stuffed file makes any sense. For now we'll silently fall
969 * back to buffered I/O 968 * back to buffered I/O
970 */ 969 */
971 if (gfs2_is_stuffed(ip)) 970 if (gfs2_is_stuffed(ip))
972 return 0; 971 return 0;
973 972
974 if (offset >= i_size_read(&ip->i_inode)) 973 if (offset >= i_size_read(&ip->i_inode))
975 return 0; 974 return 0;
976 return 1; 975 return 1;
977 } 976 }
978 977
979 978
980 979
981 static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, 980 static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
982 const struct iovec *iov, loff_t offset, 981 const struct iovec *iov, loff_t offset,
983 unsigned long nr_segs) 982 unsigned long nr_segs)
984 { 983 {
985 struct file *file = iocb->ki_filp; 984 struct file *file = iocb->ki_filp;
986 struct inode *inode = file->f_mapping->host; 985 struct inode *inode = file->f_mapping->host;
987 struct address_space *mapping = inode->i_mapping; 986 struct address_space *mapping = inode->i_mapping;
988 struct gfs2_inode *ip = GFS2_I(inode); 987 struct gfs2_inode *ip = GFS2_I(inode);
989 struct gfs2_holder gh; 988 struct gfs2_holder gh;
990 int rv; 989 int rv;
991 990
992 /* 991 /*
993 * Deferred lock, even if its a write, since we do no allocation 992 * Deferred lock, even if its a write, since we do no allocation
994 * on this path. All we need change is atime, and this lock mode 993 * on this path. All we need change is atime, and this lock mode
995 * ensures that other nodes have flushed their buffered read caches 994 * ensures that other nodes have flushed their buffered read caches
996 * (i.e. their page cache entries for this inode). We do not, 995 * (i.e. their page cache entries for this inode). We do not,
997 * unfortunately have the option of only flushing a range like 996 * unfortunately have the option of only flushing a range like
998 * the VFS does. 997 * the VFS does.
999 */ 998 */
1000 gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, &gh); 999 gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, &gh);
1001 rv = gfs2_glock_nq(&gh); 1000 rv = gfs2_glock_nq(&gh);
1002 if (rv) 1001 if (rv)
1003 return rv; 1002 return rv;
1004 rv = gfs2_ok_for_dio(ip, rw, offset); 1003 rv = gfs2_ok_for_dio(ip, rw, offset);
1005 if (rv != 1) 1004 if (rv != 1)
1006 goto out; /* dio not valid, fall back to buffered i/o */ 1005 goto out; /* dio not valid, fall back to buffered i/o */
1007 1006
1008 /* 1007 /*
1009 * Now since we are holding a deferred (CW) lock at this point, you 1008 * Now since we are holding a deferred (CW) lock at this point, you
1010 * might be wondering why this is ever needed. There is a case however 1009 * might be wondering why this is ever needed. There is a case however
1011 * where we've granted a deferred local lock against a cached exclusive 1010 * where we've granted a deferred local lock against a cached exclusive
1012 * glock. That is ok provided all granted local locks are deferred, but 1011 * glock. That is ok provided all granted local locks are deferred, but
1013 * it also means that it is possible to encounter pages which are 1012 * it also means that it is possible to encounter pages which are
1014 * cached and possibly also mapped. So here we check for that and sort 1013 * cached and possibly also mapped. So here we check for that and sort
1015 * them out ahead of the dio. The glock state machine will take care of 1014 * them out ahead of the dio. The glock state machine will take care of
1016 * everything else. 1015 * everything else.
1017 * 1016 *
1018 * If in fact the cached glock state (gl->gl_state) is deferred (CW) in 1017 * If in fact the cached glock state (gl->gl_state) is deferred (CW) in
1019 * the first place, mapping->nr_pages will always be zero. 1018 * the first place, mapping->nr_pages will always be zero.
1020 */ 1019 */
1021 if (mapping->nrpages) { 1020 if (mapping->nrpages) {
1022 loff_t lstart = offset & (PAGE_CACHE_SIZE - 1); 1021 loff_t lstart = offset & (PAGE_CACHE_SIZE - 1);
1023 loff_t len = iov_length(iov, nr_segs); 1022 loff_t len = iov_length(iov, nr_segs);
1024 loff_t end = PAGE_ALIGN(offset + len) - 1; 1023 loff_t end = PAGE_ALIGN(offset + len) - 1;
1025 1024
1026 rv = 0; 1025 rv = 0;
1027 if (len == 0) 1026 if (len == 0)
1028 goto out; 1027 goto out;
1029 if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags)) 1028 if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
1030 unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len); 1029 unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len);
1031 rv = filemap_write_and_wait_range(mapping, lstart, end); 1030 rv = filemap_write_and_wait_range(mapping, lstart, end);
1032 if (rv) 1031 if (rv)
1033 return rv; 1032 return rv;
1034 truncate_inode_pages_range(mapping, lstart, end); 1033 truncate_inode_pages_range(mapping, lstart, end);
1035 } 1034 }
1036 1035
1037 rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, 1036 rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
1038 offset, nr_segs, gfs2_get_block_direct, 1037 offset, nr_segs, gfs2_get_block_direct,
1039 NULL, NULL, 0); 1038 NULL, NULL, 0);
1040 out: 1039 out:
1041 gfs2_glock_dq(&gh); 1040 gfs2_glock_dq(&gh);
1042 gfs2_holder_uninit(&gh); 1041 gfs2_holder_uninit(&gh);
1043 return rv; 1042 return rv;
1044 } 1043 }
1045 1044
1046 /** 1045 /**
1047 * gfs2_releasepage - free the metadata associated with a page 1046 * gfs2_releasepage - free the metadata associated with a page
1048 * @page: the page that's being released 1047 * @page: the page that's being released
1049 * @gfp_mask: passed from Linux VFS, ignored by us 1048 * @gfp_mask: passed from Linux VFS, ignored by us
1050 * 1049 *
1051 * Call try_to_free_buffers() if the buffers in this page can be 1050 * Call try_to_free_buffers() if the buffers in this page can be
1052 * released. 1051 * released.
1053 * 1052 *
1054 * Returns: 0 1053 * Returns: 0
1055 */ 1054 */
1056 1055
1057 int gfs2_releasepage(struct page *page, gfp_t gfp_mask) 1056 int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
1058 { 1057 {
1059 struct address_space *mapping = page->mapping; 1058 struct address_space *mapping = page->mapping;
1060 struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); 1059 struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
1061 struct buffer_head *bh, *head; 1060 struct buffer_head *bh, *head;
1062 struct gfs2_bufdata *bd; 1061 struct gfs2_bufdata *bd;
1063 1062
1064 if (!page_has_buffers(page)) 1063 if (!page_has_buffers(page))
1065 return 0; 1064 return 0;
1066 1065
1067 gfs2_log_lock(sdp); 1066 gfs2_log_lock(sdp);
1068 spin_lock(&sdp->sd_ail_lock); 1067 spin_lock(&sdp->sd_ail_lock);
1069 head = bh = page_buffers(page); 1068 head = bh = page_buffers(page);
1070 do { 1069 do {
1071 if (atomic_read(&bh->b_count)) 1070 if (atomic_read(&bh->b_count))
1072 goto cannot_release; 1071 goto cannot_release;
1073 bd = bh->b_private; 1072 bd = bh->b_private;
1074 if (bd && bd->bd_tr) 1073 if (bd && bd->bd_tr)
1075 goto cannot_release; 1074 goto cannot_release;
1076 if (buffer_pinned(bh) || buffer_dirty(bh)) 1075 if (buffer_pinned(bh) || buffer_dirty(bh))
1077 goto not_possible; 1076 goto not_possible;
1078 bh = bh->b_this_page; 1077 bh = bh->b_this_page;
1079 } while(bh != head); 1078 } while(bh != head);
1080 spin_unlock(&sdp->sd_ail_lock); 1079 spin_unlock(&sdp->sd_ail_lock);
1081 gfs2_log_unlock(sdp); 1080 gfs2_log_unlock(sdp);
1082 1081
1083 head = bh = page_buffers(page); 1082 head = bh = page_buffers(page);
1084 do { 1083 do {
1085 gfs2_log_lock(sdp); 1084 gfs2_log_lock(sdp);
1086 bd = bh->b_private; 1085 bd = bh->b_private;
1087 if (bd) { 1086 if (bd) {
1088 gfs2_assert_warn(sdp, bd->bd_bh == bh); 1087 gfs2_assert_warn(sdp, bd->bd_bh == bh);
1089 if (!list_empty(&bd->bd_list)) { 1088 if (!list_empty(&bd->bd_list)) {
1090 if (!buffer_pinned(bh)) 1089 if (!buffer_pinned(bh))
1091 list_del_init(&bd->bd_list); 1090 list_del_init(&bd->bd_list);
1092 else 1091 else
1093 bd = NULL; 1092 bd = NULL;
1094 } 1093 }
1095 if (bd) 1094 if (bd)
1096 bd->bd_bh = NULL; 1095 bd->bd_bh = NULL;
1097 bh->b_private = NULL; 1096 bh->b_private = NULL;
1098 } 1097 }
1099 gfs2_log_unlock(sdp); 1098 gfs2_log_unlock(sdp);
1100 if (bd) 1099 if (bd)
1101 kmem_cache_free(gfs2_bufdata_cachep, bd); 1100 kmem_cache_free(gfs2_bufdata_cachep, bd);
1102 1101
1103 bh = bh->b_this_page; 1102 bh = bh->b_this_page;
1104 } while (bh != head); 1103 } while (bh != head);
1105 1104
1106 return try_to_free_buffers(page); 1105 return try_to_free_buffers(page);
1107 1106
1108 not_possible: /* Should never happen */ 1107 not_possible: /* Should never happen */
1109 WARN_ON(buffer_dirty(bh)); 1108 WARN_ON(buffer_dirty(bh));
1110 WARN_ON(buffer_pinned(bh)); 1109 WARN_ON(buffer_pinned(bh));
1111 cannot_release: 1110 cannot_release:
1112 spin_unlock(&sdp->sd_ail_lock); 1111 spin_unlock(&sdp->sd_ail_lock);
1113 gfs2_log_unlock(sdp); 1112 gfs2_log_unlock(sdp);
1114 return 0; 1113 return 0;
1115 } 1114 }
1116 1115
1117 static const struct address_space_operations gfs2_writeback_aops = { 1116 static const struct address_space_operations gfs2_writeback_aops = {
1118 .writepage = gfs2_writepage, 1117 .writepage = gfs2_writepage,
1119 .writepages = gfs2_writepages, 1118 .writepages = gfs2_writepages,
1120 .readpage = gfs2_readpage, 1119 .readpage = gfs2_readpage,
1121 .readpages = gfs2_readpages, 1120 .readpages = gfs2_readpages,
1122 .write_begin = gfs2_write_begin, 1121 .write_begin = gfs2_write_begin,
1123 .write_end = gfs2_write_end, 1122 .write_end = gfs2_write_end,
1124 .bmap = gfs2_bmap, 1123 .bmap = gfs2_bmap,
1125 .invalidatepage = gfs2_invalidatepage, 1124 .invalidatepage = gfs2_invalidatepage,
1126 .releasepage = gfs2_releasepage, 1125 .releasepage = gfs2_releasepage,
1127 .direct_IO = gfs2_direct_IO, 1126 .direct_IO = gfs2_direct_IO,
1128 .migratepage = buffer_migrate_page, 1127 .migratepage = buffer_migrate_page,
1129 .is_partially_uptodate = block_is_partially_uptodate, 1128 .is_partially_uptodate = block_is_partially_uptodate,
1130 .error_remove_page = generic_error_remove_page, 1129 .error_remove_page = generic_error_remove_page,
1131 }; 1130 };
1132 1131
1133 static const struct address_space_operations gfs2_ordered_aops = { 1132 static const struct address_space_operations gfs2_ordered_aops = {
1134 .writepage = gfs2_writepage, 1133 .writepage = gfs2_writepage,
1135 .writepages = gfs2_writepages, 1134 .writepages = gfs2_writepages,
1136 .readpage = gfs2_readpage, 1135 .readpage = gfs2_readpage,
1137 .readpages = gfs2_readpages, 1136 .readpages = gfs2_readpages,
1138 .write_begin = gfs2_write_begin, 1137 .write_begin = gfs2_write_begin,
1139 .write_end = gfs2_write_end, 1138 .write_end = gfs2_write_end,
1140 .set_page_dirty = gfs2_set_page_dirty, 1139 .set_page_dirty = gfs2_set_page_dirty,
1141 .bmap = gfs2_bmap, 1140 .bmap = gfs2_bmap,
1142 .invalidatepage = gfs2_invalidatepage, 1141 .invalidatepage = gfs2_invalidatepage,
1143 .releasepage = gfs2_releasepage, 1142 .releasepage = gfs2_releasepage,
1144 .direct_IO = gfs2_direct_IO, 1143 .direct_IO = gfs2_direct_IO,
1145 .migratepage = buffer_migrate_page, 1144 .migratepage = buffer_migrate_page,
1146 .is_partially_uptodate = block_is_partially_uptodate, 1145 .is_partially_uptodate = block_is_partially_uptodate,
1147 .error_remove_page = generic_error_remove_page, 1146 .error_remove_page = generic_error_remove_page,
1148 }; 1147 };
1149 1148
1150 static const struct address_space_operations gfs2_jdata_aops = { 1149 static const struct address_space_operations gfs2_jdata_aops = {
1151 .writepage = gfs2_jdata_writepage, 1150 .writepage = gfs2_jdata_writepage,
1152 .writepages = gfs2_jdata_writepages, 1151 .writepages = gfs2_jdata_writepages,
1153 .readpage = gfs2_readpage, 1152 .readpage = gfs2_readpage,
1154 .readpages = gfs2_readpages, 1153 .readpages = gfs2_readpages,
1155 .write_begin = gfs2_write_begin, 1154 .write_begin = gfs2_write_begin,
1156 .write_end = gfs2_write_end, 1155 .write_end = gfs2_write_end,
1157 .set_page_dirty = gfs2_set_page_dirty, 1156 .set_page_dirty = gfs2_set_page_dirty,
1158 .bmap = gfs2_bmap, 1157 .bmap = gfs2_bmap,
1159 .invalidatepage = gfs2_invalidatepage, 1158 .invalidatepage = gfs2_invalidatepage,
1160 .releasepage = gfs2_releasepage, 1159 .releasepage = gfs2_releasepage,
1161 .is_partially_uptodate = block_is_partially_uptodate, 1160 .is_partially_uptodate = block_is_partially_uptodate,
1162 .error_remove_page = generic_error_remove_page, 1161 .error_remove_page = generic_error_remove_page,
1163 }; 1162 };
1164 1163
1165 void gfs2_set_aops(struct inode *inode) 1164 void gfs2_set_aops(struct inode *inode)
1166 { 1165 {
1167 struct gfs2_inode *ip = GFS2_I(inode); 1166 struct gfs2_inode *ip = GFS2_I(inode);
1168 1167
1169 if (gfs2_is_writeback(ip)) 1168 if (gfs2_is_writeback(ip))
1170 inode->i_mapping->a_ops = &gfs2_writeback_aops; 1169 inode->i_mapping->a_ops = &gfs2_writeback_aops;
1171 else if (gfs2_is_ordered(ip)) 1170 else if (gfs2_is_ordered(ip))
1172 inode->i_mapping->a_ops = &gfs2_ordered_aops; 1171 inode->i_mapping->a_ops = &gfs2_ordered_aops;
1173 else if (gfs2_is_jdata(ip)) 1172 else if (gfs2_is_jdata(ip))
1174 inode->i_mapping->a_ops = &gfs2_jdata_aops; 1173 inode->i_mapping->a_ops = &gfs2_jdata_aops;
1175 else 1174 else
1176 BUG(); 1175 BUG();
1177 } 1176 }
1178 1177
1179 1178
1 /* 1 /*
2 * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. 2 * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
3 * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. 3 * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
4 * 4 *
5 * This copyrighted material is made available to anyone wishing to use, 5 * This copyrighted material is made available to anyone wishing to use,
6 * modify, copy, or redistribute it subject to the terms and conditions 6 * modify, copy, or redistribute it subject to the terms and conditions
7 * of the GNU General Public License version 2. 7 * of the GNU General Public License version 2.
8 */ 8 */
9 9
10 #include <linux/sched.h> 10 #include <linux/sched.h>
11 #include <linux/slab.h> 11 #include <linux/slab.h>
12 #include <linux/spinlock.h> 12 #include <linux/spinlock.h>
13 #include <linux/completion.h> 13 #include <linux/completion.h>
14 #include <linux/buffer_head.h> 14 #include <linux/buffer_head.h>
15 #include <linux/mm.h> 15 #include <linux/mm.h>
16 #include <linux/pagemap.h> 16 #include <linux/pagemap.h>
17 #include <linux/writeback.h> 17 #include <linux/writeback.h>
18 #include <linux/swap.h> 18 #include <linux/swap.h>
19 #include <linux/delay.h> 19 #include <linux/delay.h>
20 #include <linux/bio.h> 20 #include <linux/bio.h>
21 #include <linux/gfs2_ondisk.h> 21 #include <linux/gfs2_ondisk.h>
22 22
23 #include "gfs2.h" 23 #include "gfs2.h"
24 #include "incore.h" 24 #include "incore.h"
25 #include "glock.h" 25 #include "glock.h"
26 #include "glops.h" 26 #include "glops.h"
27 #include "inode.h" 27 #include "inode.h"
28 #include "log.h" 28 #include "log.h"
29 #include "lops.h" 29 #include "lops.h"
30 #include "meta_io.h" 30 #include "meta_io.h"
31 #include "rgrp.h" 31 #include "rgrp.h"
32 #include "trans.h" 32 #include "trans.h"
33 #include "util.h" 33 #include "util.h"
34 #include "trace_gfs2.h" 34 #include "trace_gfs2.h"
35 35
36 static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc) 36 static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc)
37 { 37 {
38 struct buffer_head *bh, *head; 38 struct buffer_head *bh, *head;
39 int nr_underway = 0; 39 int nr_underway = 0;
40 int write_op = REQ_META | REQ_PRIO | 40 int write_op = REQ_META | REQ_PRIO |
41 (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); 41 (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
42 42
43 BUG_ON(!PageLocked(page)); 43 BUG_ON(!PageLocked(page));
44 BUG_ON(!page_has_buffers(page)); 44 BUG_ON(!page_has_buffers(page));
45 45
46 head = page_buffers(page); 46 head = page_buffers(page);
47 bh = head; 47 bh = head;
48 48
49 do { 49 do {
50 if (!buffer_mapped(bh)) 50 if (!buffer_mapped(bh))
51 continue; 51 continue;
52 /* 52 /*
53 * If it's a fully non-blocking write attempt and we cannot 53 * If it's a fully non-blocking write attempt and we cannot
54 * lock the buffer then redirty the page. Note that this can 54 * lock the buffer then redirty the page. Note that this can
55 * potentially cause a busy-wait loop from flusher thread and kswapd 55 * potentially cause a busy-wait loop from flusher thread and kswapd
56 * activity, but those code paths have their own higher-level 56 * activity, but those code paths have their own higher-level
57 * throttling. 57 * throttling.
58 */ 58 */
59 if (wbc->sync_mode != WB_SYNC_NONE) { 59 if (wbc->sync_mode != WB_SYNC_NONE) {
60 lock_buffer(bh); 60 lock_buffer(bh);
61 } else if (!trylock_buffer(bh)) { 61 } else if (!trylock_buffer(bh)) {
62 redirty_page_for_writepage(wbc, page); 62 redirty_page_for_writepage(wbc, page);
63 continue; 63 continue;
64 } 64 }
65 if (test_clear_buffer_dirty(bh)) { 65 if (test_clear_buffer_dirty(bh)) {
66 mark_buffer_async_write(bh); 66 mark_buffer_async_write(bh);
67 } else { 67 } else {
68 unlock_buffer(bh); 68 unlock_buffer(bh);
69 } 69 }
70 } while ((bh = bh->b_this_page) != head); 70 } while ((bh = bh->b_this_page) != head);
71 71
72 /* 72 /*
73 * The page and its buffers are protected by PageWriteback(), so we can 73 * The page and its buffers are protected by PageWriteback(), so we can
74 * drop the bh refcounts early. 74 * drop the bh refcounts early.
75 */ 75 */
76 BUG_ON(PageWriteback(page)); 76 BUG_ON(PageWriteback(page));
77 set_page_writeback(page); 77 set_page_writeback(page);
78 78
79 do { 79 do {
80 struct buffer_head *next = bh->b_this_page; 80 struct buffer_head *next = bh->b_this_page;
81 if (buffer_async_write(bh)) { 81 if (buffer_async_write(bh)) {
82 submit_bh(write_op, bh); 82 submit_bh(write_op, bh);
83 nr_underway++; 83 nr_underway++;
84 } 84 }
85 bh = next; 85 bh = next;
86 } while (bh != head); 86 } while (bh != head);
87 unlock_page(page); 87 unlock_page(page);
88 88
89 if (nr_underway == 0) 89 if (nr_underway == 0)
90 end_page_writeback(page); 90 end_page_writeback(page);
91 91
92 return 0; 92 return 0;
93 } 93 }
94 94
95 const struct address_space_operations gfs2_meta_aops = { 95 const struct address_space_operations gfs2_meta_aops = {
96 .writepage = gfs2_aspace_writepage, 96 .writepage = gfs2_aspace_writepage,
97 .releasepage = gfs2_releasepage, 97 .releasepage = gfs2_releasepage,
98 }; 98 };
99 99
100 /** 100 /**
101 * gfs2_getbuf - Get a buffer with a given address space 101 * gfs2_getbuf - Get a buffer with a given address space
102 * @gl: the glock 102 * @gl: the glock
103 * @blkno: the block number (filesystem scope) 103 * @blkno: the block number (filesystem scope)
104 * @create: 1 if the buffer should be created 104 * @create: 1 if the buffer should be created
105 * 105 *
106 * Returns: the buffer 106 * Returns: the buffer
107 */ 107 */
108 108
109 struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create) 109 struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
110 { 110 {
111 struct address_space *mapping = gfs2_glock2aspace(gl); 111 struct address_space *mapping = gfs2_glock2aspace(gl);
112 struct gfs2_sbd *sdp = gl->gl_sbd; 112 struct gfs2_sbd *sdp = gl->gl_sbd;
113 struct page *page; 113 struct page *page;
114 struct buffer_head *bh; 114 struct buffer_head *bh;
115 unsigned int shift; 115 unsigned int shift;
116 unsigned long index; 116 unsigned long index;
117 unsigned int bufnum; 117 unsigned int bufnum;
118 118
119 shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift; 119 shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift;
120 index = blkno >> shift; /* convert block to page */ 120 index = blkno >> shift; /* convert block to page */
121 bufnum = blkno - (index << shift); /* block buf index within page */ 121 bufnum = blkno - (index << shift); /* block buf index within page */
122 122
123 if (create) { 123 if (create) {
124 for (;;) { 124 for (;;) {
125 page = grab_cache_page(mapping, index); 125 page = grab_cache_page(mapping, index);
126 if (page) 126 if (page)
127 break; 127 break;
128 yield(); 128 yield();
129 } 129 }
130 } else { 130 } else {
131 page = find_lock_page(mapping, index); 131 page = find_get_page_flags(mapping, index,
132 FGP_LOCK|FGP_ACCESSED);
132 if (!page) 133 if (!page)
133 return NULL; 134 return NULL;
134 } 135 }
135 136
136 if (!page_has_buffers(page)) 137 if (!page_has_buffers(page))
137 create_empty_buffers(page, sdp->sd_sb.sb_bsize, 0); 138 create_empty_buffers(page, sdp->sd_sb.sb_bsize, 0);
138 139
139 /* Locate header for our buffer within our page */ 140 /* Locate header for our buffer within our page */
140 for (bh = page_buffers(page); bufnum--; bh = bh->b_this_page) 141 for (bh = page_buffers(page); bufnum--; bh = bh->b_this_page)
141 /* Do nothing */; 142 /* Do nothing */;
142 get_bh(bh); 143 get_bh(bh);
143 144
144 if (!buffer_mapped(bh)) 145 if (!buffer_mapped(bh))
145 map_bh(bh, sdp->sd_vfs, blkno); 146 map_bh(bh, sdp->sd_vfs, blkno);
146 147
147 unlock_page(page); 148 unlock_page(page);
148 mark_page_accessed(page);
149 page_cache_release(page); 149 page_cache_release(page);
150 150
151 return bh; 151 return bh;
152 } 152 }
153 153
154 static void meta_prep_new(struct buffer_head *bh) 154 static void meta_prep_new(struct buffer_head *bh)
155 { 155 {
156 struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data; 156 struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data;
157 157
158 lock_buffer(bh); 158 lock_buffer(bh);
159 clear_buffer_dirty(bh); 159 clear_buffer_dirty(bh);
160 set_buffer_uptodate(bh); 160 set_buffer_uptodate(bh);
161 unlock_buffer(bh); 161 unlock_buffer(bh);
162 162
163 mh->mh_magic = cpu_to_be32(GFS2_MAGIC); 163 mh->mh_magic = cpu_to_be32(GFS2_MAGIC);
164 } 164 }
165 165
166 /** 166 /**
167 * gfs2_meta_new - Get a block 167 * gfs2_meta_new - Get a block
168 * @gl: The glock associated with this block 168 * @gl: The glock associated with this block
169 * @blkno: The block number 169 * @blkno: The block number
170 * 170 *
171 * Returns: The buffer 171 * Returns: The buffer
172 */ 172 */
173 173
174 struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno) 174 struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno)
175 { 175 {
176 struct buffer_head *bh; 176 struct buffer_head *bh;
177 bh = gfs2_getbuf(gl, blkno, CREATE); 177 bh = gfs2_getbuf(gl, blkno, CREATE);
178 meta_prep_new(bh); 178 meta_prep_new(bh);
179 return bh; 179 return bh;
180 } 180 }
181 181
182 /** 182 /**
183 * gfs2_meta_read - Read a block from disk 183 * gfs2_meta_read - Read a block from disk
184 * @gl: The glock covering the block 184 * @gl: The glock covering the block
185 * @blkno: The block number 185 * @blkno: The block number
186 * @flags: flags 186 * @flags: flags
187 * @bhp: the place where the buffer is returned (NULL on failure) 187 * @bhp: the place where the buffer is returned (NULL on failure)
188 * 188 *
189 * Returns: errno 189 * Returns: errno
190 */ 190 */
191 191
192 int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags, 192 int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
193 struct buffer_head **bhp) 193 struct buffer_head **bhp)
194 { 194 {
195 struct gfs2_sbd *sdp = gl->gl_sbd; 195 struct gfs2_sbd *sdp = gl->gl_sbd;
196 struct buffer_head *bh; 196 struct buffer_head *bh;
197 197
198 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) { 198 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) {
199 *bhp = NULL; 199 *bhp = NULL;
200 return -EIO; 200 return -EIO;
201 } 201 }
202 202
203 *bhp = bh = gfs2_getbuf(gl, blkno, CREATE); 203 *bhp = bh = gfs2_getbuf(gl, blkno, CREATE);
204 204
205 lock_buffer(bh); 205 lock_buffer(bh);
206 if (buffer_uptodate(bh)) { 206 if (buffer_uptodate(bh)) {
207 unlock_buffer(bh); 207 unlock_buffer(bh);
208 return 0; 208 return 0;
209 } 209 }
210 bh->b_end_io = end_buffer_read_sync; 210 bh->b_end_io = end_buffer_read_sync;
211 get_bh(bh); 211 get_bh(bh);
212 submit_bh(READ_SYNC | REQ_META | REQ_PRIO, bh); 212 submit_bh(READ_SYNC | REQ_META | REQ_PRIO, bh);
213 if (!(flags & DIO_WAIT)) 213 if (!(flags & DIO_WAIT))
214 return 0; 214 return 0;
215 215
216 wait_on_buffer(bh); 216 wait_on_buffer(bh);
217 if (unlikely(!buffer_uptodate(bh))) { 217 if (unlikely(!buffer_uptodate(bh))) {
218 struct gfs2_trans *tr = current->journal_info; 218 struct gfs2_trans *tr = current->journal_info;
219 if (tr && tr->tr_touched) 219 if (tr && tr->tr_touched)
220 gfs2_io_error_bh(sdp, bh); 220 gfs2_io_error_bh(sdp, bh);
221 brelse(bh); 221 brelse(bh);
222 *bhp = NULL; 222 *bhp = NULL;
223 return -EIO; 223 return -EIO;
224 } 224 }
225 225
226 return 0; 226 return 0;
227 } 227 }
228 228
229 /** 229 /**
230 * gfs2_meta_wait - Reread a block from disk 230 * gfs2_meta_wait - Reread a block from disk
231 * @sdp: the filesystem 231 * @sdp: the filesystem
232 * @bh: The block to wait for 232 * @bh: The block to wait for
233 * 233 *
234 * Returns: errno 234 * Returns: errno
235 */ 235 */
236 236
237 int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh) 237 int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
238 { 238 {
239 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) 239 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
240 return -EIO; 240 return -EIO;
241 241
242 wait_on_buffer(bh); 242 wait_on_buffer(bh);
243 243
244 if (!buffer_uptodate(bh)) { 244 if (!buffer_uptodate(bh)) {
245 struct gfs2_trans *tr = current->journal_info; 245 struct gfs2_trans *tr = current->journal_info;
246 if (tr && tr->tr_touched) 246 if (tr && tr->tr_touched)
247 gfs2_io_error_bh(sdp, bh); 247 gfs2_io_error_bh(sdp, bh);
248 return -EIO; 248 return -EIO;
249 } 249 }
250 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) 250 if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
251 return -EIO; 251 return -EIO;
252 252
253 return 0; 253 return 0;
254 } 254 }
255 255
256 void gfs2_remove_from_journal(struct buffer_head *bh, struct gfs2_trans *tr, int meta) 256 void gfs2_remove_from_journal(struct buffer_head *bh, struct gfs2_trans *tr, int meta)
257 { 257 {
258 struct address_space *mapping = bh->b_page->mapping; 258 struct address_space *mapping = bh->b_page->mapping;
259 struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); 259 struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
260 struct gfs2_bufdata *bd = bh->b_private; 260 struct gfs2_bufdata *bd = bh->b_private;
261 int was_pinned = 0; 261 int was_pinned = 0;
262 262
263 if (test_clear_buffer_pinned(bh)) { 263 if (test_clear_buffer_pinned(bh)) {
264 trace_gfs2_pin(bd, 0); 264 trace_gfs2_pin(bd, 0);
265 atomic_dec(&sdp->sd_log_pinned); 265 atomic_dec(&sdp->sd_log_pinned);
266 list_del_init(&bd->bd_list); 266 list_del_init(&bd->bd_list);
267 if (meta) { 267 if (meta) {
268 gfs2_assert_warn(sdp, sdp->sd_log_num_buf); 268 gfs2_assert_warn(sdp, sdp->sd_log_num_buf);
269 sdp->sd_log_num_buf--; 269 sdp->sd_log_num_buf--;
270 tr->tr_num_buf_rm++; 270 tr->tr_num_buf_rm++;
271 } else { 271 } else {
272 gfs2_assert_warn(sdp, sdp->sd_log_num_databuf); 272 gfs2_assert_warn(sdp, sdp->sd_log_num_databuf);
273 sdp->sd_log_num_databuf--; 273 sdp->sd_log_num_databuf--;
274 tr->tr_num_databuf_rm++; 274 tr->tr_num_databuf_rm++;
275 } 275 }
276 tr->tr_touched = 1; 276 tr->tr_touched = 1;
277 was_pinned = 1; 277 was_pinned = 1;
278 brelse(bh); 278 brelse(bh);
279 } 279 }
280 if (bd) { 280 if (bd) {
281 spin_lock(&sdp->sd_ail_lock); 281 spin_lock(&sdp->sd_ail_lock);
282 if (bd->bd_tr) { 282 if (bd->bd_tr) {
283 gfs2_trans_add_revoke(sdp, bd); 283 gfs2_trans_add_revoke(sdp, bd);
284 } else if (was_pinned) { 284 } else if (was_pinned) {
285 bh->b_private = NULL; 285 bh->b_private = NULL;
286 kmem_cache_free(gfs2_bufdata_cachep, bd); 286 kmem_cache_free(gfs2_bufdata_cachep, bd);
287 } 287 }
288 spin_unlock(&sdp->sd_ail_lock); 288 spin_unlock(&sdp->sd_ail_lock);
289 } 289 }
290 clear_buffer_dirty(bh); 290 clear_buffer_dirty(bh);
291 clear_buffer_uptodate(bh); 291 clear_buffer_uptodate(bh);
292 } 292 }
293 293
294 /** 294 /**
295 * gfs2_meta_wipe - make inode's buffers so they aren't dirty/pinned anymore 295 * gfs2_meta_wipe - make inode's buffers so they aren't dirty/pinned anymore
296 * @ip: the inode who owns the buffers 296 * @ip: the inode who owns the buffers
297 * @bstart: the first buffer in the run 297 * @bstart: the first buffer in the run
298 * @blen: the number of buffers in the run 298 * @blen: the number of buffers in the run
299 * 299 *
300 */ 300 */
301 301
302 void gfs2_meta_wipe(struct gfs2_inode *ip, u64 bstart, u32 blen) 302 void gfs2_meta_wipe(struct gfs2_inode *ip, u64 bstart, u32 blen)
303 { 303 {
304 struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); 304 struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
305 struct buffer_head *bh; 305 struct buffer_head *bh;
306 306
307 while (blen) { 307 while (blen) {
308 bh = gfs2_getbuf(ip->i_gl, bstart, NO_CREATE); 308 bh = gfs2_getbuf(ip->i_gl, bstart, NO_CREATE);
309 if (bh) { 309 if (bh) {
310 lock_buffer(bh); 310 lock_buffer(bh);
311 gfs2_log_lock(sdp); 311 gfs2_log_lock(sdp);
312 gfs2_remove_from_journal(bh, current->journal_info, 1); 312 gfs2_remove_from_journal(bh, current->journal_info, 1);
313 gfs2_log_unlock(sdp); 313 gfs2_log_unlock(sdp);
314 unlock_buffer(bh); 314 unlock_buffer(bh);
315 brelse(bh); 315 brelse(bh);
316 } 316 }
317 317
318 bstart++; 318 bstart++;
319 blen--; 319 blen--;
320 } 320 }
321 } 321 }
322 322
323 /** 323 /**
324 * gfs2_meta_indirect_buffer - Get a metadata buffer 324 * gfs2_meta_indirect_buffer - Get a metadata buffer
325 * @ip: The GFS2 inode 325 * @ip: The GFS2 inode
326 * @height: The level of this buf in the metadata (indir addr) tree (if any) 326 * @height: The level of this buf in the metadata (indir addr) tree (if any)
327 * @num: The block number (device relative) of the buffer 327 * @num: The block number (device relative) of the buffer
328 * @bhp: the buffer is returned here 328 * @bhp: the buffer is returned here
329 * 329 *
330 * Returns: errno 330 * Returns: errno
331 */ 331 */
332 332
333 int gfs2_meta_indirect_buffer(struct gfs2_inode *ip, int height, u64 num, 333 int gfs2_meta_indirect_buffer(struct gfs2_inode *ip, int height, u64 num,
334 struct buffer_head **bhp) 334 struct buffer_head **bhp)
335 { 335 {
336 struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); 336 struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
337 struct gfs2_glock *gl = ip->i_gl; 337 struct gfs2_glock *gl = ip->i_gl;
338 struct buffer_head *bh; 338 struct buffer_head *bh;
339 int ret = 0; 339 int ret = 0;
340 u32 mtype = height ? GFS2_METATYPE_IN : GFS2_METATYPE_DI; 340 u32 mtype = height ? GFS2_METATYPE_IN : GFS2_METATYPE_DI;
341 341
342 ret = gfs2_meta_read(gl, num, DIO_WAIT, &bh); 342 ret = gfs2_meta_read(gl, num, DIO_WAIT, &bh);
343 if (ret == 0 && gfs2_metatype_check(sdp, bh, mtype)) { 343 if (ret == 0 && gfs2_metatype_check(sdp, bh, mtype)) {
344 brelse(bh); 344 brelse(bh);
345 ret = -EIO; 345 ret = -EIO;
346 } 346 }
347 *bhp = bh; 347 *bhp = bh;
348 return ret; 348 return ret;
349 } 349 }
350 350
351 /** 351 /**
352 * gfs2_meta_ra - start readahead on an extent of a file 352 * gfs2_meta_ra - start readahead on an extent of a file
353 * @gl: the glock the blocks belong to 353 * @gl: the glock the blocks belong to
354 * @dblock: the starting disk block 354 * @dblock: the starting disk block
355 * @extlen: the number of blocks in the extent 355 * @extlen: the number of blocks in the extent
356 * 356 *
357 * returns: the first buffer in the extent 357 * returns: the first buffer in the extent
358 */ 358 */
359 359
360 struct buffer_head *gfs2_meta_ra(struct gfs2_glock *gl, u64 dblock, u32 extlen) 360 struct buffer_head *gfs2_meta_ra(struct gfs2_glock *gl, u64 dblock, u32 extlen)
361 { 361 {
362 struct gfs2_sbd *sdp = gl->gl_sbd; 362 struct gfs2_sbd *sdp = gl->gl_sbd;
363 struct buffer_head *first_bh, *bh; 363 struct buffer_head *first_bh, *bh;
364 u32 max_ra = gfs2_tune_get(sdp, gt_max_readahead) >> 364 u32 max_ra = gfs2_tune_get(sdp, gt_max_readahead) >>
365 sdp->sd_sb.sb_bsize_shift; 365 sdp->sd_sb.sb_bsize_shift;
366 366
367 BUG_ON(!extlen); 367 BUG_ON(!extlen);
368 368
369 if (max_ra < 1) 369 if (max_ra < 1)
370 max_ra = 1; 370 max_ra = 1;
371 if (extlen > max_ra) 371 if (extlen > max_ra)
372 extlen = max_ra; 372 extlen = max_ra;
373 373
374 first_bh = gfs2_getbuf(gl, dblock, CREATE); 374 first_bh = gfs2_getbuf(gl, dblock, CREATE);
375 375
376 if (buffer_uptodate(first_bh)) 376 if (buffer_uptodate(first_bh))
377 goto out; 377 goto out;
378 if (!buffer_locked(first_bh)) 378 if (!buffer_locked(first_bh))
379 ll_rw_block(READ_SYNC | REQ_META, 1, &first_bh); 379 ll_rw_block(READ_SYNC | REQ_META, 1, &first_bh);
380 380
381 dblock++; 381 dblock++;
382 extlen--; 382 extlen--;
383 383
384 while (extlen) { 384 while (extlen) {
385 bh = gfs2_getbuf(gl, dblock, CREATE); 385 bh = gfs2_getbuf(gl, dblock, CREATE);
386 386
387 if (!buffer_uptodate(bh) && !buffer_locked(bh)) 387 if (!buffer_uptodate(bh) && !buffer_locked(bh))
388 ll_rw_block(READA | REQ_META, 1, &bh); 388 ll_rw_block(READA | REQ_META, 1, &bh);
389 brelse(bh); 389 brelse(bh);
390 dblock++; 390 dblock++;
391 extlen--; 391 extlen--;
392 if (!buffer_locked(first_bh) && buffer_uptodate(first_bh)) 392 if (!buffer_locked(first_bh) && buffer_uptodate(first_bh))
393 goto out; 393 goto out;
394 } 394 }
395 395
396 wait_on_buffer(first_bh); 396 wait_on_buffer(first_bh);
397 out: 397 out:
398 return first_bh; 398 return first_bh;
399 } 399 }
400 400
1 /** 1 /**
2 * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project. 2 * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project.
3 * 3 *
4 * Copyright (c) 2001-2012 Anton Altaparmakov and Tuxera Inc. 4 * Copyright (c) 2001-2012 Anton Altaparmakov and Tuxera Inc.
5 * Copyright (c) 2002 Richard Russon 5 * Copyright (c) 2002 Richard Russon
6 * 6 *
7 * This program/include file is free software; you can redistribute it and/or 7 * This program/include file is free software; you can redistribute it and/or
8 * modify it under the terms of the GNU General Public License as published 8 * modify it under the terms of the GNU General Public License as published
9 * by the Free Software Foundation; either version 2 of the License, or 9 * by the Free Software Foundation; either version 2 of the License, or
10 * (at your option) any later version. 10 * (at your option) any later version.
11 * 11 *
12 * This program/include file is distributed in the hope that it will be 12 * This program/include file is distributed in the hope that it will be
13 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty 13 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty
14 * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14 * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 * GNU General Public License for more details. 15 * GNU General Public License for more details.
16 * 16 *
17 * You should have received a copy of the GNU General Public License 17 * You should have received a copy of the GNU General Public License
18 * along with this program (in the main directory of the Linux-NTFS 18 * along with this program (in the main directory of the Linux-NTFS
19 * distribution in the file COPYING); if not, write to the Free Software 19 * distribution in the file COPYING); if not, write to the Free Software
20 * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 20 * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
21 */ 21 */
22 22
23 #include <linux/buffer_head.h> 23 #include <linux/buffer_head.h>
24 #include <linux/sched.h> 24 #include <linux/sched.h>
25 #include <linux/slab.h> 25 #include <linux/slab.h>
26 #include <linux/swap.h> 26 #include <linux/swap.h>
27 #include <linux/writeback.h> 27 #include <linux/writeback.h>
28 28
29 #include "attrib.h" 29 #include "attrib.h"
30 #include "debug.h" 30 #include "debug.h"
31 #include "layout.h" 31 #include "layout.h"
32 #include "lcnalloc.h" 32 #include "lcnalloc.h"
33 #include "malloc.h" 33 #include "malloc.h"
34 #include "mft.h" 34 #include "mft.h"
35 #include "ntfs.h" 35 #include "ntfs.h"
36 #include "types.h" 36 #include "types.h"
37 37
38 /** 38 /**
39 * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode 39 * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode
40 * @ni: ntfs inode for which to map (part of) a runlist 40 * @ni: ntfs inode for which to map (part of) a runlist
41 * @vcn: map runlist part containing this vcn 41 * @vcn: map runlist part containing this vcn
42 * @ctx: active attribute search context if present or NULL if not 42 * @ctx: active attribute search context if present or NULL if not
43 * 43 *
44 * Map the part of a runlist containing the @vcn of the ntfs inode @ni. 44 * Map the part of a runlist containing the @vcn of the ntfs inode @ni.
45 * 45 *
46 * If @ctx is specified, it is an active search context of @ni and its base mft 46 * If @ctx is specified, it is an active search context of @ni and its base mft
47 * record. This is needed when ntfs_map_runlist_nolock() encounters unmapped 47 * record. This is needed when ntfs_map_runlist_nolock() encounters unmapped
48 * runlist fragments and allows their mapping. If you do not have the mft 48 * runlist fragments and allows their mapping. If you do not have the mft
49 * record mapped, you can specify @ctx as NULL and ntfs_map_runlist_nolock() 49 * record mapped, you can specify @ctx as NULL and ntfs_map_runlist_nolock()
50 * will perform the necessary mapping and unmapping. 50 * will perform the necessary mapping and unmapping.
51 * 51 *
52 * Note, ntfs_map_runlist_nolock() saves the state of @ctx on entry and 52 * Note, ntfs_map_runlist_nolock() saves the state of @ctx on entry and
53 * restores it before returning. Thus, @ctx will be left pointing to the same 53 * restores it before returning. Thus, @ctx will be left pointing to the same
54 * attribute on return as on entry. However, the actual pointers in @ctx may 54 * attribute on return as on entry. However, the actual pointers in @ctx may
55 * point to different memory locations on return, so you must remember to reset 55 * point to different memory locations on return, so you must remember to reset
56 * any cached pointers from the @ctx, i.e. after the call to 56 * any cached pointers from the @ctx, i.e. after the call to
57 * ntfs_map_runlist_nolock(), you will probably want to do: 57 * ntfs_map_runlist_nolock(), you will probably want to do:
58 * m = ctx->mrec; 58 * m = ctx->mrec;
59 * a = ctx->attr; 59 * a = ctx->attr;
60 * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that 60 * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that
61 * you cache ctx->mrec in a variable @m of type MFT_RECORD *. 61 * you cache ctx->mrec in a variable @m of type MFT_RECORD *.
62 * 62 *
63 * Return 0 on success and -errno on error. There is one special error code 63 * Return 0 on success and -errno on error. There is one special error code
64 * which is not an error as such. This is -ENOENT. It means that @vcn is out 64 * which is not an error as such. This is -ENOENT. It means that @vcn is out
65 * of bounds of the runlist. 65 * of bounds of the runlist.
66 * 66 *
67 * Note the runlist can be NULL after this function returns if @vcn is zero and 67 * Note the runlist can be NULL after this function returns if @vcn is zero and
68 * the attribute has zero allocated size, i.e. there simply is no runlist. 68 * the attribute has zero allocated size, i.e. there simply is no runlist.
69 * 69 *
70 * WARNING: If @ctx is supplied, regardless of whether success or failure is 70 * WARNING: If @ctx is supplied, regardless of whether success or failure is
71 * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx 71 * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx
72 * is no longer valid, i.e. you need to either call 72 * is no longer valid, i.e. you need to either call
73 * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. 73 * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it.
74 * In that case PTR_ERR(@ctx->mrec) will give you the error code for 74 * In that case PTR_ERR(@ctx->mrec) will give you the error code for
75 * why the mapping of the old inode failed. 75 * why the mapping of the old inode failed.
76 * 76 *
77 * Locking: - The runlist described by @ni must be locked for writing on entry 77 * Locking: - The runlist described by @ni must be locked for writing on entry
78 * and is locked on return. Note the runlist will be modified. 78 * and is locked on return. Note the runlist will be modified.
79 * - If @ctx is NULL, the base mft record of @ni must not be mapped on 79 * - If @ctx is NULL, the base mft record of @ni must not be mapped on
80 * entry and it will be left unmapped on return. 80 * entry and it will be left unmapped on return.
81 * - If @ctx is not NULL, the base mft record must be mapped on entry 81 * - If @ctx is not NULL, the base mft record must be mapped on entry
82 * and it will be left mapped on return. 82 * and it will be left mapped on return.
83 */ 83 */
84 int ntfs_map_runlist_nolock(ntfs_inode *ni, VCN vcn, ntfs_attr_search_ctx *ctx) 84 int ntfs_map_runlist_nolock(ntfs_inode *ni, VCN vcn, ntfs_attr_search_ctx *ctx)
85 { 85 {
86 VCN end_vcn; 86 VCN end_vcn;
87 unsigned long flags; 87 unsigned long flags;
88 ntfs_inode *base_ni; 88 ntfs_inode *base_ni;
89 MFT_RECORD *m; 89 MFT_RECORD *m;
90 ATTR_RECORD *a; 90 ATTR_RECORD *a;
91 runlist_element *rl; 91 runlist_element *rl;
92 struct page *put_this_page = NULL; 92 struct page *put_this_page = NULL;
93 int err = 0; 93 int err = 0;
94 bool ctx_is_temporary, ctx_needs_reset; 94 bool ctx_is_temporary, ctx_needs_reset;
95 ntfs_attr_search_ctx old_ctx = { NULL, }; 95 ntfs_attr_search_ctx old_ctx = { NULL, };
96 96
97 ntfs_debug("Mapping runlist part containing vcn 0x%llx.", 97 ntfs_debug("Mapping runlist part containing vcn 0x%llx.",
98 (unsigned long long)vcn); 98 (unsigned long long)vcn);
99 if (!NInoAttr(ni)) 99 if (!NInoAttr(ni))
100 base_ni = ni; 100 base_ni = ni;
101 else 101 else
102 base_ni = ni->ext.base_ntfs_ino; 102 base_ni = ni->ext.base_ntfs_ino;
103 if (!ctx) { 103 if (!ctx) {
104 ctx_is_temporary = ctx_needs_reset = true; 104 ctx_is_temporary = ctx_needs_reset = true;
105 m = map_mft_record(base_ni); 105 m = map_mft_record(base_ni);
106 if (IS_ERR(m)) 106 if (IS_ERR(m))
107 return PTR_ERR(m); 107 return PTR_ERR(m);
108 ctx = ntfs_attr_get_search_ctx(base_ni, m); 108 ctx = ntfs_attr_get_search_ctx(base_ni, m);
109 if (unlikely(!ctx)) { 109 if (unlikely(!ctx)) {
110 err = -ENOMEM; 110 err = -ENOMEM;
111 goto err_out; 111 goto err_out;
112 } 112 }
113 } else { 113 } else {
114 VCN allocated_size_vcn; 114 VCN allocated_size_vcn;
115 115
116 BUG_ON(IS_ERR(ctx->mrec)); 116 BUG_ON(IS_ERR(ctx->mrec));
117 a = ctx->attr; 117 a = ctx->attr;
118 BUG_ON(!a->non_resident); 118 BUG_ON(!a->non_resident);
119 ctx_is_temporary = false; 119 ctx_is_temporary = false;
120 end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); 120 end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn);
121 read_lock_irqsave(&ni->size_lock, flags); 121 read_lock_irqsave(&ni->size_lock, flags);
122 allocated_size_vcn = ni->allocated_size >> 122 allocated_size_vcn = ni->allocated_size >>
123 ni->vol->cluster_size_bits; 123 ni->vol->cluster_size_bits;
124 read_unlock_irqrestore(&ni->size_lock, flags); 124 read_unlock_irqrestore(&ni->size_lock, flags);
125 if (!a->data.non_resident.lowest_vcn && end_vcn <= 0) 125 if (!a->data.non_resident.lowest_vcn && end_vcn <= 0)
126 end_vcn = allocated_size_vcn - 1; 126 end_vcn = allocated_size_vcn - 1;
127 /* 127 /*
128 * If we already have the attribute extent containing @vcn in 128 * If we already have the attribute extent containing @vcn in
129 * @ctx, no need to look it up again. We slightly cheat in 129 * @ctx, no need to look it up again. We slightly cheat in
130 * that if vcn exceeds the allocated size, we will refuse to 130 * that if vcn exceeds the allocated size, we will refuse to
131 * map the runlist below, so there is definitely no need to get 131 * map the runlist below, so there is definitely no need to get
132 * the right attribute extent. 132 * the right attribute extent.
133 */ 133 */
134 if (vcn >= allocated_size_vcn || (a->type == ni->type && 134 if (vcn >= allocated_size_vcn || (a->type == ni->type &&
135 a->name_length == ni->name_len && 135 a->name_length == ni->name_len &&
136 !memcmp((u8*)a + le16_to_cpu(a->name_offset), 136 !memcmp((u8*)a + le16_to_cpu(a->name_offset),
137 ni->name, ni->name_len) && 137 ni->name, ni->name_len) &&
138 sle64_to_cpu(a->data.non_resident.lowest_vcn) 138 sle64_to_cpu(a->data.non_resident.lowest_vcn)
139 <= vcn && end_vcn >= vcn)) 139 <= vcn && end_vcn >= vcn))
140 ctx_needs_reset = false; 140 ctx_needs_reset = false;
141 else { 141 else {
142 /* Save the old search context. */ 142 /* Save the old search context. */
143 old_ctx = *ctx; 143 old_ctx = *ctx;
144 /* 144 /*
145 * If the currently mapped (extent) inode is not the 145 * If the currently mapped (extent) inode is not the
146 * base inode we will unmap it when we reinitialize the 146 * base inode we will unmap it when we reinitialize the
147 * search context which means we need to get a 147 * search context which means we need to get a
148 * reference to the page containing the mapped mft 148 * reference to the page containing the mapped mft
149 * record so we do not accidentally drop changes to the 149 * record so we do not accidentally drop changes to the
150 * mft record when it has not been marked dirty yet. 150 * mft record when it has not been marked dirty yet.
151 */ 151 */
152 if (old_ctx.base_ntfs_ino && old_ctx.ntfs_ino != 152 if (old_ctx.base_ntfs_ino && old_ctx.ntfs_ino !=
153 old_ctx.base_ntfs_ino) { 153 old_ctx.base_ntfs_ino) {
154 put_this_page = old_ctx.ntfs_ino->page; 154 put_this_page = old_ctx.ntfs_ino->page;
155 page_cache_get(put_this_page); 155 page_cache_get(put_this_page);
156 } 156 }
157 /* 157 /*
158 * Reinitialize the search context so we can lookup the 158 * Reinitialize the search context so we can lookup the
159 * needed attribute extent. 159 * needed attribute extent.
160 */ 160 */
161 ntfs_attr_reinit_search_ctx(ctx); 161 ntfs_attr_reinit_search_ctx(ctx);
162 ctx_needs_reset = true; 162 ctx_needs_reset = true;
163 } 163 }
164 } 164 }
165 if (ctx_needs_reset) { 165 if (ctx_needs_reset) {
166 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 166 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
167 CASE_SENSITIVE, vcn, NULL, 0, ctx); 167 CASE_SENSITIVE, vcn, NULL, 0, ctx);
168 if (unlikely(err)) { 168 if (unlikely(err)) {
169 if (err == -ENOENT) 169 if (err == -ENOENT)
170 err = -EIO; 170 err = -EIO;
171 goto err_out; 171 goto err_out;
172 } 172 }
173 BUG_ON(!ctx->attr->non_resident); 173 BUG_ON(!ctx->attr->non_resident);
174 } 174 }
175 a = ctx->attr; 175 a = ctx->attr;
176 /* 176 /*
177 * Only decompress the mapping pairs if @vcn is inside it. Otherwise 177 * Only decompress the mapping pairs if @vcn is inside it. Otherwise
178 * we get into problems when we try to map an out of bounds vcn because 178 * we get into problems when we try to map an out of bounds vcn because
179 * we then try to map the already mapped runlist fragment and 179 * we then try to map the already mapped runlist fragment and
180 * ntfs_mapping_pairs_decompress() fails. 180 * ntfs_mapping_pairs_decompress() fails.
181 */ 181 */
182 end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn) + 1; 182 end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn) + 1;
183 if (unlikely(vcn && vcn >= end_vcn)) { 183 if (unlikely(vcn && vcn >= end_vcn)) {
184 err = -ENOENT; 184 err = -ENOENT;
185 goto err_out; 185 goto err_out;
186 } 186 }
187 rl = ntfs_mapping_pairs_decompress(ni->vol, a, ni->runlist.rl); 187 rl = ntfs_mapping_pairs_decompress(ni->vol, a, ni->runlist.rl);
188 if (IS_ERR(rl)) 188 if (IS_ERR(rl))
189 err = PTR_ERR(rl); 189 err = PTR_ERR(rl);
190 else 190 else
191 ni->runlist.rl = rl; 191 ni->runlist.rl = rl;
192 err_out: 192 err_out:
193 if (ctx_is_temporary) { 193 if (ctx_is_temporary) {
194 if (likely(ctx)) 194 if (likely(ctx))
195 ntfs_attr_put_search_ctx(ctx); 195 ntfs_attr_put_search_ctx(ctx);
196 unmap_mft_record(base_ni); 196 unmap_mft_record(base_ni);
197 } else if (ctx_needs_reset) { 197 } else if (ctx_needs_reset) {
198 /* 198 /*
199 * If there is no attribute list, restoring the search context 199 * If there is no attribute list, restoring the search context
200 * is accomplished simply by copying the saved context back over 200 * is accomplished simply by copying the saved context back over
201 * the caller supplied context. If there is an attribute list, 201 * the caller supplied context. If there is an attribute list,
202 * things are more complicated as we need to deal with mapping 202 * things are more complicated as we need to deal with mapping
203 * of mft records and resulting potential changes in pointers. 203 * of mft records and resulting potential changes in pointers.
204 */ 204 */
205 if (NInoAttrList(base_ni)) { 205 if (NInoAttrList(base_ni)) {
206 /* 206 /*
207 * If the currently mapped (extent) inode is not the 207 * If the currently mapped (extent) inode is not the
208 * one we had before, we need to unmap it and map the 208 * one we had before, we need to unmap it and map the
209 * old one. 209 * old one.
210 */ 210 */
211 if (ctx->ntfs_ino != old_ctx.ntfs_ino) { 211 if (ctx->ntfs_ino != old_ctx.ntfs_ino) {
212 /* 212 /*
213 * If the currently mapped inode is not the 213 * If the currently mapped inode is not the
214 * base inode, unmap it. 214 * base inode, unmap it.
215 */ 215 */
216 if (ctx->base_ntfs_ino && ctx->ntfs_ino != 216 if (ctx->base_ntfs_ino && ctx->ntfs_ino !=
217 ctx->base_ntfs_ino) { 217 ctx->base_ntfs_ino) {
218 unmap_extent_mft_record(ctx->ntfs_ino); 218 unmap_extent_mft_record(ctx->ntfs_ino);
219 ctx->mrec = ctx->base_mrec; 219 ctx->mrec = ctx->base_mrec;
220 BUG_ON(!ctx->mrec); 220 BUG_ON(!ctx->mrec);
221 } 221 }
222 /* 222 /*
223 * If the old mapped inode is not the base 223 * If the old mapped inode is not the base
224 * inode, map it. 224 * inode, map it.
225 */ 225 */
226 if (old_ctx.base_ntfs_ino && 226 if (old_ctx.base_ntfs_ino &&
227 old_ctx.ntfs_ino != 227 old_ctx.ntfs_ino !=
228 old_ctx.base_ntfs_ino) { 228 old_ctx.base_ntfs_ino) {
229 retry_map: 229 retry_map:
230 ctx->mrec = map_mft_record( 230 ctx->mrec = map_mft_record(
231 old_ctx.ntfs_ino); 231 old_ctx.ntfs_ino);
232 /* 232 /*
233 * Something bad has happened. If out 233 * Something bad has happened. If out
234 * of memory retry till it succeeds. 234 * of memory retry till it succeeds.
235 * Any other errors are fatal and we 235 * Any other errors are fatal and we
236 * return the error code in ctx->mrec. 236 * return the error code in ctx->mrec.
237 * Let the caller deal with it... We 237 * Let the caller deal with it... We
238 * just need to fudge things so the 238 * just need to fudge things so the
239 * caller can reinit and/or put the 239 * caller can reinit and/or put the
240 * search context safely. 240 * search context safely.
241 */ 241 */
242 if (IS_ERR(ctx->mrec)) { 242 if (IS_ERR(ctx->mrec)) {
243 if (PTR_ERR(ctx->mrec) == 243 if (PTR_ERR(ctx->mrec) ==
244 -ENOMEM) { 244 -ENOMEM) {
245 schedule(); 245 schedule();
246 goto retry_map; 246 goto retry_map;
247 } else 247 } else
248 old_ctx.ntfs_ino = 248 old_ctx.ntfs_ino =
249 old_ctx. 249 old_ctx.
250 base_ntfs_ino; 250 base_ntfs_ino;
251 } 251 }
252 } 252 }
253 } 253 }
254 /* Update the changed pointers in the saved context. */ 254 /* Update the changed pointers in the saved context. */
255 if (ctx->mrec != old_ctx.mrec) { 255 if (ctx->mrec != old_ctx.mrec) {
256 if (!IS_ERR(ctx->mrec)) 256 if (!IS_ERR(ctx->mrec))
257 old_ctx.attr = (ATTR_RECORD*)( 257 old_ctx.attr = (ATTR_RECORD*)(
258 (u8*)ctx->mrec + 258 (u8*)ctx->mrec +
259 ((u8*)old_ctx.attr - 259 ((u8*)old_ctx.attr -
260 (u8*)old_ctx.mrec)); 260 (u8*)old_ctx.mrec));
261 old_ctx.mrec = ctx->mrec; 261 old_ctx.mrec = ctx->mrec;
262 } 262 }
263 } 263 }
264 /* Restore the search context to the saved one. */ 264 /* Restore the search context to the saved one. */
265 *ctx = old_ctx; 265 *ctx = old_ctx;
266 /* 266 /*
267 * We drop the reference on the page we took earlier. In the 267 * We drop the reference on the page we took earlier. In the
268 * case that IS_ERR(ctx->mrec) is true this means we might lose 268 * case that IS_ERR(ctx->mrec) is true this means we might lose
269 * some changes to the mft record that had been made between 269 * some changes to the mft record that had been made between
270 * the last time it was marked dirty/written out and now. This 270 * the last time it was marked dirty/written out and now. This
271 * at this stage is not a problem as the mapping error is fatal 271 * at this stage is not a problem as the mapping error is fatal
272 * enough that the mft record cannot be written out anyway and 272 * enough that the mft record cannot be written out anyway and
273 * the caller is very likely to shutdown the whole inode 273 * the caller is very likely to shutdown the whole inode
274 * immediately and mark the volume dirty for chkdsk to pick up 274 * immediately and mark the volume dirty for chkdsk to pick up
275 * the pieces anyway. 275 * the pieces anyway.
276 */ 276 */
277 if (put_this_page) 277 if (put_this_page)
278 page_cache_release(put_this_page); 278 page_cache_release(put_this_page);
279 } 279 }
280 return err; 280 return err;
281 } 281 }
282 282
283 /** 283 /**
284 * ntfs_map_runlist - map (a part of) a runlist of an ntfs inode 284 * ntfs_map_runlist - map (a part of) a runlist of an ntfs inode
285 * @ni: ntfs inode for which to map (part of) a runlist 285 * @ni: ntfs inode for which to map (part of) a runlist
286 * @vcn: map runlist part containing this vcn 286 * @vcn: map runlist part containing this vcn
287 * 287 *
288 * Map the part of a runlist containing the @vcn of the ntfs inode @ni. 288 * Map the part of a runlist containing the @vcn of the ntfs inode @ni.
289 * 289 *
290 * Return 0 on success and -errno on error. There is one special error code 290 * Return 0 on success and -errno on error. There is one special error code
291 * which is not an error as such. This is -ENOENT. It means that @vcn is out 291 * which is not an error as such. This is -ENOENT. It means that @vcn is out
292 * of bounds of the runlist. 292 * of bounds of the runlist.
293 * 293 *
294 * Locking: - The runlist must be unlocked on entry and is unlocked on return. 294 * Locking: - The runlist must be unlocked on entry and is unlocked on return.
295 * - This function takes the runlist lock for writing and may modify 295 * - This function takes the runlist lock for writing and may modify
296 * the runlist. 296 * the runlist.
297 */ 297 */
298 int ntfs_map_runlist(ntfs_inode *ni, VCN vcn) 298 int ntfs_map_runlist(ntfs_inode *ni, VCN vcn)
299 { 299 {
300 int err = 0; 300 int err = 0;
301 301
302 down_write(&ni->runlist.lock); 302 down_write(&ni->runlist.lock);
303 /* Make sure someone else didn't do the work while we were sleeping. */ 303 /* Make sure someone else didn't do the work while we were sleeping. */
304 if (likely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) <= 304 if (likely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) <=
305 LCN_RL_NOT_MAPPED)) 305 LCN_RL_NOT_MAPPED))
306 err = ntfs_map_runlist_nolock(ni, vcn, NULL); 306 err = ntfs_map_runlist_nolock(ni, vcn, NULL);
307 up_write(&ni->runlist.lock); 307 up_write(&ni->runlist.lock);
308 return err; 308 return err;
309 } 309 }
310 310
311 /** 311 /**
312 * ntfs_attr_vcn_to_lcn_nolock - convert a vcn into a lcn given an ntfs inode 312 * ntfs_attr_vcn_to_lcn_nolock - convert a vcn into a lcn given an ntfs inode
313 * @ni: ntfs inode of the attribute whose runlist to search 313 * @ni: ntfs inode of the attribute whose runlist to search
314 * @vcn: vcn to convert 314 * @vcn: vcn to convert
315 * @write_locked: true if the runlist is locked for writing 315 * @write_locked: true if the runlist is locked for writing
316 * 316 *
317 * Find the virtual cluster number @vcn in the runlist of the ntfs attribute 317 * Find the virtual cluster number @vcn in the runlist of the ntfs attribute
318 * described by the ntfs inode @ni and return the corresponding logical cluster 318 * described by the ntfs inode @ni and return the corresponding logical cluster
319 * number (lcn). 319 * number (lcn).
320 * 320 *
321 * If the @vcn is not mapped yet, the attempt is made to map the attribute 321 * If the @vcn is not mapped yet, the attempt is made to map the attribute
322 * extent containing the @vcn and the vcn to lcn conversion is retried. 322 * extent containing the @vcn and the vcn to lcn conversion is retried.
323 * 323 *
324 * If @write_locked is true the caller has locked the runlist for writing and 324 * If @write_locked is true the caller has locked the runlist for writing and
325 * if false for reading. 325 * if false for reading.
326 * 326 *
327 * Since lcns must be >= 0, we use negative return codes with special meaning: 327 * Since lcns must be >= 0, we use negative return codes with special meaning:
328 * 328 *
329 * Return code Meaning / Description 329 * Return code Meaning / Description
330 * ========================================== 330 * ==========================================
331 * LCN_HOLE Hole / not allocated on disk. 331 * LCN_HOLE Hole / not allocated on disk.
332 * LCN_ENOENT There is no such vcn in the runlist, i.e. @vcn is out of bounds. 332 * LCN_ENOENT There is no such vcn in the runlist, i.e. @vcn is out of bounds.
333 * LCN_ENOMEM Not enough memory to map runlist. 333 * LCN_ENOMEM Not enough memory to map runlist.
334 * LCN_EIO Critical error (runlist/file is corrupt, i/o error, etc). 334 * LCN_EIO Critical error (runlist/file is corrupt, i/o error, etc).
335 * 335 *
336 * Locking: - The runlist must be locked on entry and is left locked on return. 336 * Locking: - The runlist must be locked on entry and is left locked on return.
337 * - If @write_locked is 'false', i.e. the runlist is locked for reading, 337 * - If @write_locked is 'false', i.e. the runlist is locked for reading,
338 * the lock may be dropped inside the function so you cannot rely on 338 * the lock may be dropped inside the function so you cannot rely on
339 * the runlist still being the same when this function returns. 339 * the runlist still being the same when this function returns.
340 */ 340 */
341 LCN ntfs_attr_vcn_to_lcn_nolock(ntfs_inode *ni, const VCN vcn, 341 LCN ntfs_attr_vcn_to_lcn_nolock(ntfs_inode *ni, const VCN vcn,
342 const bool write_locked) 342 const bool write_locked)
343 { 343 {
344 LCN lcn; 344 LCN lcn;
345 unsigned long flags; 345 unsigned long flags;
346 bool is_retry = false; 346 bool is_retry = false;
347 347
348 BUG_ON(!ni); 348 BUG_ON(!ni);
349 ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, %s_locked.", 349 ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, %s_locked.",
350 ni->mft_no, (unsigned long long)vcn, 350 ni->mft_no, (unsigned long long)vcn,
351 write_locked ? "write" : "read"); 351 write_locked ? "write" : "read");
352 BUG_ON(!NInoNonResident(ni)); 352 BUG_ON(!NInoNonResident(ni));
353 BUG_ON(vcn < 0); 353 BUG_ON(vcn < 0);
354 if (!ni->runlist.rl) { 354 if (!ni->runlist.rl) {
355 read_lock_irqsave(&ni->size_lock, flags); 355 read_lock_irqsave(&ni->size_lock, flags);
356 if (!ni->allocated_size) { 356 if (!ni->allocated_size) {
357 read_unlock_irqrestore(&ni->size_lock, flags); 357 read_unlock_irqrestore(&ni->size_lock, flags);
358 return LCN_ENOENT; 358 return LCN_ENOENT;
359 } 359 }
360 read_unlock_irqrestore(&ni->size_lock, flags); 360 read_unlock_irqrestore(&ni->size_lock, flags);
361 } 361 }
362 retry_remap: 362 retry_remap:
363 /* Convert vcn to lcn. If that fails map the runlist and retry once. */ 363 /* Convert vcn to lcn. If that fails map the runlist and retry once. */
364 lcn = ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn); 364 lcn = ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn);
365 if (likely(lcn >= LCN_HOLE)) { 365 if (likely(lcn >= LCN_HOLE)) {
366 ntfs_debug("Done, lcn 0x%llx.", (long long)lcn); 366 ntfs_debug("Done, lcn 0x%llx.", (long long)lcn);
367 return lcn; 367 return lcn;
368 } 368 }
369 if (lcn != LCN_RL_NOT_MAPPED) { 369 if (lcn != LCN_RL_NOT_MAPPED) {
370 if (lcn != LCN_ENOENT) 370 if (lcn != LCN_ENOENT)
371 lcn = LCN_EIO; 371 lcn = LCN_EIO;
372 } else if (!is_retry) { 372 } else if (!is_retry) {
373 int err; 373 int err;
374 374
375 if (!write_locked) { 375 if (!write_locked) {
376 up_read(&ni->runlist.lock); 376 up_read(&ni->runlist.lock);
377 down_write(&ni->runlist.lock); 377 down_write(&ni->runlist.lock);
378 if (unlikely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) != 378 if (unlikely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) !=
379 LCN_RL_NOT_MAPPED)) { 379 LCN_RL_NOT_MAPPED)) {
380 up_write(&ni->runlist.lock); 380 up_write(&ni->runlist.lock);
381 down_read(&ni->runlist.lock); 381 down_read(&ni->runlist.lock);
382 goto retry_remap; 382 goto retry_remap;
383 } 383 }
384 } 384 }
385 err = ntfs_map_runlist_nolock(ni, vcn, NULL); 385 err = ntfs_map_runlist_nolock(ni, vcn, NULL);
386 if (!write_locked) { 386 if (!write_locked) {
387 up_write(&ni->runlist.lock); 387 up_write(&ni->runlist.lock);
388 down_read(&ni->runlist.lock); 388 down_read(&ni->runlist.lock);
389 } 389 }
390 if (likely(!err)) { 390 if (likely(!err)) {
391 is_retry = true; 391 is_retry = true;
392 goto retry_remap; 392 goto retry_remap;
393 } 393 }
394 if (err == -ENOENT) 394 if (err == -ENOENT)
395 lcn = LCN_ENOENT; 395 lcn = LCN_ENOENT;
396 else if (err == -ENOMEM) 396 else if (err == -ENOMEM)
397 lcn = LCN_ENOMEM; 397 lcn = LCN_ENOMEM;
398 else 398 else
399 lcn = LCN_EIO; 399 lcn = LCN_EIO;
400 } 400 }
401 if (lcn != LCN_ENOENT) 401 if (lcn != LCN_ENOENT)
402 ntfs_error(ni->vol->sb, "Failed with error code %lli.", 402 ntfs_error(ni->vol->sb, "Failed with error code %lli.",
403 (long long)lcn); 403 (long long)lcn);
404 return lcn; 404 return lcn;
405 } 405 }
406 406
407 /** 407 /**
408 * ntfs_attr_find_vcn_nolock - find a vcn in the runlist of an ntfs inode 408 * ntfs_attr_find_vcn_nolock - find a vcn in the runlist of an ntfs inode
409 * @ni: ntfs inode describing the runlist to search 409 * @ni: ntfs inode describing the runlist to search
410 * @vcn: vcn to find 410 * @vcn: vcn to find
411 * @ctx: active attribute search context if present or NULL if not 411 * @ctx: active attribute search context if present or NULL if not
412 * 412 *
413 * Find the virtual cluster number @vcn in the runlist described by the ntfs 413 * Find the virtual cluster number @vcn in the runlist described by the ntfs
414 * inode @ni and return the address of the runlist element containing the @vcn. 414 * inode @ni and return the address of the runlist element containing the @vcn.
415 * 415 *
416 * If the @vcn is not mapped yet, the attempt is made to map the attribute 416 * If the @vcn is not mapped yet, the attempt is made to map the attribute
417 * extent containing the @vcn and the vcn to lcn conversion is retried. 417 * extent containing the @vcn and the vcn to lcn conversion is retried.
418 * 418 *
419 * If @ctx is specified, it is an active search context of @ni and its base mft 419 * If @ctx is specified, it is an active search context of @ni and its base mft
420 * record. This is needed when ntfs_attr_find_vcn_nolock() encounters unmapped 420 * record. This is needed when ntfs_attr_find_vcn_nolock() encounters unmapped
421 * runlist fragments and allows their mapping. If you do not have the mft 421 * runlist fragments and allows their mapping. If you do not have the mft
422 * record mapped, you can specify @ctx as NULL and ntfs_attr_find_vcn_nolock() 422 * record mapped, you can specify @ctx as NULL and ntfs_attr_find_vcn_nolock()
423 * will perform the necessary mapping and unmapping. 423 * will perform the necessary mapping and unmapping.
424 * 424 *
425 * Note, ntfs_attr_find_vcn_nolock() saves the state of @ctx on entry and 425 * Note, ntfs_attr_find_vcn_nolock() saves the state of @ctx on entry and
426 * restores it before returning. Thus, @ctx will be left pointing to the same 426 * restores it before returning. Thus, @ctx will be left pointing to the same
427 * attribute on return as on entry. However, the actual pointers in @ctx may 427 * attribute on return as on entry. However, the actual pointers in @ctx may
428 * point to different memory locations on return, so you must remember to reset 428 * point to different memory locations on return, so you must remember to reset
429 * any cached pointers from the @ctx, i.e. after the call to 429 * any cached pointers from the @ctx, i.e. after the call to
430 * ntfs_attr_find_vcn_nolock(), you will probably want to do: 430 * ntfs_attr_find_vcn_nolock(), you will probably want to do:
431 * m = ctx->mrec; 431 * m = ctx->mrec;
432 * a = ctx->attr; 432 * a = ctx->attr;
433 * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that 433 * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that
434 * you cache ctx->mrec in a variable @m of type MFT_RECORD *. 434 * you cache ctx->mrec in a variable @m of type MFT_RECORD *.
435 * Note you need to distinguish between the lcn of the returned runlist element 435 * Note you need to distinguish between the lcn of the returned runlist element
436 * being >= 0 and LCN_HOLE. In the later case you have to return zeroes on 436 * being >= 0 and LCN_HOLE. In the later case you have to return zeroes on
437 * read and allocate clusters on write. 437 * read and allocate clusters on write.
438 * 438 *
439 * Return the runlist element containing the @vcn on success and 439 * Return the runlist element containing the @vcn on success and
440 * ERR_PTR(-errno) on error. You need to test the return value with IS_ERR() 440 * ERR_PTR(-errno) on error. You need to test the return value with IS_ERR()
441 * to decide if the return is success or failure and PTR_ERR() to get to the 441 * to decide if the return is success or failure and PTR_ERR() to get to the
442 * error code if IS_ERR() is true. 442 * error code if IS_ERR() is true.
443 * 443 *
444 * The possible error return codes are: 444 * The possible error return codes are:
445 * -ENOENT - No such vcn in the runlist, i.e. @vcn is out of bounds. 445 * -ENOENT - No such vcn in the runlist, i.e. @vcn is out of bounds.
446 * -ENOMEM - Not enough memory to map runlist. 446 * -ENOMEM - Not enough memory to map runlist.
447 * -EIO - Critical error (runlist/file is corrupt, i/o error, etc). 447 * -EIO - Critical error (runlist/file is corrupt, i/o error, etc).
448 * 448 *
449 * WARNING: If @ctx is supplied, regardless of whether success or failure is 449 * WARNING: If @ctx is supplied, regardless of whether success or failure is
450 * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx 450 * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx
451 * is no longer valid, i.e. you need to either call 451 * is no longer valid, i.e. you need to either call
452 * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. 452 * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it.
453 * In that case PTR_ERR(@ctx->mrec) will give you the error code for 453 * In that case PTR_ERR(@ctx->mrec) will give you the error code for
454 * why the mapping of the old inode failed. 454 * why the mapping of the old inode failed.
455 * 455 *
456 * Locking: - The runlist described by @ni must be locked for writing on entry 456 * Locking: - The runlist described by @ni must be locked for writing on entry
457 * and is locked on return. Note the runlist may be modified when 457 * and is locked on return. Note the runlist may be modified when
458 * needed runlist fragments need to be mapped. 458 * needed runlist fragments need to be mapped.
459 * - If @ctx is NULL, the base mft record of @ni must not be mapped on 459 * - If @ctx is NULL, the base mft record of @ni must not be mapped on
460 * entry and it will be left unmapped on return. 460 * entry and it will be left unmapped on return.
461 * - If @ctx is not NULL, the base mft record must be mapped on entry 461 * - If @ctx is not NULL, the base mft record must be mapped on entry
462 * and it will be left mapped on return. 462 * and it will be left mapped on return.
463 */ 463 */
464 runlist_element *ntfs_attr_find_vcn_nolock(ntfs_inode *ni, const VCN vcn, 464 runlist_element *ntfs_attr_find_vcn_nolock(ntfs_inode *ni, const VCN vcn,
465 ntfs_attr_search_ctx *ctx) 465 ntfs_attr_search_ctx *ctx)
466 { 466 {
467 unsigned long flags; 467 unsigned long flags;
468 runlist_element *rl; 468 runlist_element *rl;
469 int err = 0; 469 int err = 0;
470 bool is_retry = false; 470 bool is_retry = false;
471 471
472 BUG_ON(!ni); 472 BUG_ON(!ni);
473 ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, with%s ctx.", 473 ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, with%s ctx.",
474 ni->mft_no, (unsigned long long)vcn, ctx ? "" : "out"); 474 ni->mft_no, (unsigned long long)vcn, ctx ? "" : "out");
475 BUG_ON(!NInoNonResident(ni)); 475 BUG_ON(!NInoNonResident(ni));
476 BUG_ON(vcn < 0); 476 BUG_ON(vcn < 0);
477 if (!ni->runlist.rl) { 477 if (!ni->runlist.rl) {
478 read_lock_irqsave(&ni->size_lock, flags); 478 read_lock_irqsave(&ni->size_lock, flags);
479 if (!ni->allocated_size) { 479 if (!ni->allocated_size) {
480 read_unlock_irqrestore(&ni->size_lock, flags); 480 read_unlock_irqrestore(&ni->size_lock, flags);
481 return ERR_PTR(-ENOENT); 481 return ERR_PTR(-ENOENT);
482 } 482 }
483 read_unlock_irqrestore(&ni->size_lock, flags); 483 read_unlock_irqrestore(&ni->size_lock, flags);
484 } 484 }
485 retry_remap: 485 retry_remap:
486 rl = ni->runlist.rl; 486 rl = ni->runlist.rl;
487 if (likely(rl && vcn >= rl[0].vcn)) { 487 if (likely(rl && vcn >= rl[0].vcn)) {
488 while (likely(rl->length)) { 488 while (likely(rl->length)) {
489 if (unlikely(vcn < rl[1].vcn)) { 489 if (unlikely(vcn < rl[1].vcn)) {
490 if (likely(rl->lcn >= LCN_HOLE)) { 490 if (likely(rl->lcn >= LCN_HOLE)) {
491 ntfs_debug("Done."); 491 ntfs_debug("Done.");
492 return rl; 492 return rl;
493 } 493 }
494 break; 494 break;
495 } 495 }
496 rl++; 496 rl++;
497 } 497 }
498 if (likely(rl->lcn != LCN_RL_NOT_MAPPED)) { 498 if (likely(rl->lcn != LCN_RL_NOT_MAPPED)) {
499 if (likely(rl->lcn == LCN_ENOENT)) 499 if (likely(rl->lcn == LCN_ENOENT))
500 err = -ENOENT; 500 err = -ENOENT;
501 else 501 else
502 err = -EIO; 502 err = -EIO;
503 } 503 }
504 } 504 }
505 if (!err && !is_retry) { 505 if (!err && !is_retry) {
506 /* 506 /*
507 * If the search context is invalid we cannot map the unmapped 507 * If the search context is invalid we cannot map the unmapped
508 * region. 508 * region.
509 */ 509 */
510 if (IS_ERR(ctx->mrec)) 510 if (IS_ERR(ctx->mrec))
511 err = PTR_ERR(ctx->mrec); 511 err = PTR_ERR(ctx->mrec);
512 else { 512 else {
513 /* 513 /*
514 * The @vcn is in an unmapped region, map the runlist 514 * The @vcn is in an unmapped region, map the runlist
515 * and retry. 515 * and retry.
516 */ 516 */
517 err = ntfs_map_runlist_nolock(ni, vcn, ctx); 517 err = ntfs_map_runlist_nolock(ni, vcn, ctx);
518 if (likely(!err)) { 518 if (likely(!err)) {
519 is_retry = true; 519 is_retry = true;
520 goto retry_remap; 520 goto retry_remap;
521 } 521 }
522 } 522 }
523 if (err == -EINVAL) 523 if (err == -EINVAL)
524 err = -EIO; 524 err = -EIO;
525 } else if (!err) 525 } else if (!err)
526 err = -EIO; 526 err = -EIO;
527 if (err != -ENOENT) 527 if (err != -ENOENT)
528 ntfs_error(ni->vol->sb, "Failed with error code %i.", err); 528 ntfs_error(ni->vol->sb, "Failed with error code %i.", err);
529 return ERR_PTR(err); 529 return ERR_PTR(err);
530 } 530 }
531 531
532 /** 532 /**
533 * ntfs_attr_find - find (next) attribute in mft record 533 * ntfs_attr_find - find (next) attribute in mft record
534 * @type: attribute type to find 534 * @type: attribute type to find
535 * @name: attribute name to find (optional, i.e. NULL means don't care) 535 * @name: attribute name to find (optional, i.e. NULL means don't care)
536 * @name_len: attribute name length (only needed if @name present) 536 * @name_len: attribute name length (only needed if @name present)
537 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) 537 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
538 * @val: attribute value to find (optional, resident attributes only) 538 * @val: attribute value to find (optional, resident attributes only)
539 * @val_len: attribute value length 539 * @val_len: attribute value length
540 * @ctx: search context with mft record and attribute to search from 540 * @ctx: search context with mft record and attribute to search from
541 * 541 *
542 * You should not need to call this function directly. Use ntfs_attr_lookup() 542 * You should not need to call this function directly. Use ntfs_attr_lookup()
543 * instead. 543 * instead.
544 * 544 *
545 * ntfs_attr_find() takes a search context @ctx as parameter and searches the 545 * ntfs_attr_find() takes a search context @ctx as parameter and searches the
546 * mft record specified by @ctx->mrec, beginning at @ctx->attr, for an 546 * mft record specified by @ctx->mrec, beginning at @ctx->attr, for an
547 * attribute of @type, optionally @name and @val. 547 * attribute of @type, optionally @name and @val.
548 * 548 *
549 * If the attribute is found, ntfs_attr_find() returns 0 and @ctx->attr will 549 * If the attribute is found, ntfs_attr_find() returns 0 and @ctx->attr will
550 * point to the found attribute. 550 * point to the found attribute.
551 * 551 *
552 * If the attribute is not found, ntfs_attr_find() returns -ENOENT and 552 * If the attribute is not found, ntfs_attr_find() returns -ENOENT and
553 * @ctx->attr will point to the attribute before which the attribute being 553 * @ctx->attr will point to the attribute before which the attribute being
554 * searched for would need to be inserted if such an action were to be desired. 554 * searched for would need to be inserted if such an action were to be desired.
555 * 555 *
556 * On actual error, ntfs_attr_find() returns -EIO. In this case @ctx->attr is 556 * On actual error, ntfs_attr_find() returns -EIO. In this case @ctx->attr is
557 * undefined and in particular do not rely on it not changing. 557 * undefined and in particular do not rely on it not changing.
558 * 558 *
559 * If @ctx->is_first is 'true', the search begins with @ctx->attr itself. If it 559 * If @ctx->is_first is 'true', the search begins with @ctx->attr itself. If it
560 * is 'false', the search begins after @ctx->attr. 560 * is 'false', the search begins after @ctx->attr.
561 * 561 *
562 * If @ic is IGNORE_CASE, the @name comparisson is not case sensitive and 562 * If @ic is IGNORE_CASE, the @name comparisson is not case sensitive and
563 * @ctx->ntfs_ino must be set to the ntfs inode to which the mft record 563 * @ctx->ntfs_ino must be set to the ntfs inode to which the mft record
564 * @ctx->mrec belongs. This is so we can get at the ntfs volume and hence at 564 * @ctx->mrec belongs. This is so we can get at the ntfs volume and hence at
565 * the upcase table. If @ic is CASE_SENSITIVE, the comparison is case 565 * the upcase table. If @ic is CASE_SENSITIVE, the comparison is case
566 * sensitive. When @name is present, @name_len is the @name length in Unicode 566 * sensitive. When @name is present, @name_len is the @name length in Unicode
567 * characters. 567 * characters.
568 * 568 *
569 * If @name is not present (NULL), we assume that the unnamed attribute is 569 * If @name is not present (NULL), we assume that the unnamed attribute is
570 * being searched for. 570 * being searched for.
571 * 571 *
572 * Finally, the resident attribute value @val is looked for, if present. If 572 * Finally, the resident attribute value @val is looked for, if present. If
573 * @val is not present (NULL), @val_len is ignored. 573 * @val is not present (NULL), @val_len is ignored.
574 * 574 *
575 * ntfs_attr_find() only searches the specified mft record and it ignores the 575 * ntfs_attr_find() only searches the specified mft record and it ignores the
576 * presence of an attribute list attribute (unless it is the one being searched 576 * presence of an attribute list attribute (unless it is the one being searched
577 * for, obviously). If you need to take attribute lists into consideration, 577 * for, obviously). If you need to take attribute lists into consideration,
578 * use ntfs_attr_lookup() instead (see below). This also means that you cannot 578 * use ntfs_attr_lookup() instead (see below). This also means that you cannot
579 * use ntfs_attr_find() to search for extent records of non-resident 579 * use ntfs_attr_find() to search for extent records of non-resident
580 * attributes, as extents with lowest_vcn != 0 are usually described by the 580 * attributes, as extents with lowest_vcn != 0 are usually described by the
581 * attribute list attribute only. - Note that it is possible that the first 581 * attribute list attribute only. - Note that it is possible that the first
582 * extent is only in the attribute list while the last extent is in the base 582 * extent is only in the attribute list while the last extent is in the base
583 * mft record, so do not rely on being able to find the first extent in the 583 * mft record, so do not rely on being able to find the first extent in the
584 * base mft record. 584 * base mft record.
585 * 585 *
586 * Warning: Never use @val when looking for attribute types which can be 586 * Warning: Never use @val when looking for attribute types which can be
587 * non-resident as this most likely will result in a crash! 587 * non-resident as this most likely will result in a crash!
588 */ 588 */
589 static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name, 589 static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name,
590 const u32 name_len, const IGNORE_CASE_BOOL ic, 590 const u32 name_len, const IGNORE_CASE_BOOL ic,
591 const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) 591 const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx)
592 { 592 {
593 ATTR_RECORD *a; 593 ATTR_RECORD *a;
594 ntfs_volume *vol = ctx->ntfs_ino->vol; 594 ntfs_volume *vol = ctx->ntfs_ino->vol;
595 ntfschar *upcase = vol->upcase; 595 ntfschar *upcase = vol->upcase;
596 u32 upcase_len = vol->upcase_len; 596 u32 upcase_len = vol->upcase_len;
597 597
598 /* 598 /*
599 * Iterate over attributes in mft record starting at @ctx->attr, or the 599 * Iterate over attributes in mft record starting at @ctx->attr, or the
600 * attribute following that, if @ctx->is_first is 'true'. 600 * attribute following that, if @ctx->is_first is 'true'.
601 */ 601 */
602 if (ctx->is_first) { 602 if (ctx->is_first) {
603 a = ctx->attr; 603 a = ctx->attr;
604 ctx->is_first = false; 604 ctx->is_first = false;
605 } else 605 } else
606 a = (ATTR_RECORD*)((u8*)ctx->attr + 606 a = (ATTR_RECORD*)((u8*)ctx->attr +
607 le32_to_cpu(ctx->attr->length)); 607 le32_to_cpu(ctx->attr->length));
608 for (;; a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) { 608 for (;; a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) {
609 if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + 609 if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec +
610 le32_to_cpu(ctx->mrec->bytes_allocated)) 610 le32_to_cpu(ctx->mrec->bytes_allocated))
611 break; 611 break;
612 ctx->attr = a; 612 ctx->attr = a;
613 if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) || 613 if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) ||
614 a->type == AT_END)) 614 a->type == AT_END))
615 return -ENOENT; 615 return -ENOENT;
616 if (unlikely(!a->length)) 616 if (unlikely(!a->length))
617 break; 617 break;
618 if (a->type != type) 618 if (a->type != type)
619 continue; 619 continue;
620 /* 620 /*
621 * If @name is present, compare the two names. If @name is 621 * If @name is present, compare the two names. If @name is
622 * missing, assume we want an unnamed attribute. 622 * missing, assume we want an unnamed attribute.
623 */ 623 */
624 if (!name) { 624 if (!name) {
625 /* The search failed if the found attribute is named. */ 625 /* The search failed if the found attribute is named. */
626 if (a->name_length) 626 if (a->name_length)
627 return -ENOENT; 627 return -ENOENT;
628 } else if (!ntfs_are_names_equal(name, name_len, 628 } else if (!ntfs_are_names_equal(name, name_len,
629 (ntfschar*)((u8*)a + le16_to_cpu(a->name_offset)), 629 (ntfschar*)((u8*)a + le16_to_cpu(a->name_offset)),
630 a->name_length, ic, upcase, upcase_len)) { 630 a->name_length, ic, upcase, upcase_len)) {
631 register int rc; 631 register int rc;
632 632
633 rc = ntfs_collate_names(name, name_len, 633 rc = ntfs_collate_names(name, name_len,
634 (ntfschar*)((u8*)a + 634 (ntfschar*)((u8*)a +
635 le16_to_cpu(a->name_offset)), 635 le16_to_cpu(a->name_offset)),
636 a->name_length, 1, IGNORE_CASE, 636 a->name_length, 1, IGNORE_CASE,
637 upcase, upcase_len); 637 upcase, upcase_len);
638 /* 638 /*
639 * If @name collates before a->name, there is no 639 * If @name collates before a->name, there is no
640 * matching attribute. 640 * matching attribute.
641 */ 641 */
642 if (rc == -1) 642 if (rc == -1)
643 return -ENOENT; 643 return -ENOENT;
644 /* If the strings are not equal, continue search. */ 644 /* If the strings are not equal, continue search. */
645 if (rc) 645 if (rc)
646 continue; 646 continue;
647 rc = ntfs_collate_names(name, name_len, 647 rc = ntfs_collate_names(name, name_len,
648 (ntfschar*)((u8*)a + 648 (ntfschar*)((u8*)a +
649 le16_to_cpu(a->name_offset)), 649 le16_to_cpu(a->name_offset)),
650 a->name_length, 1, CASE_SENSITIVE, 650 a->name_length, 1, CASE_SENSITIVE,
651 upcase, upcase_len); 651 upcase, upcase_len);
652 if (rc == -1) 652 if (rc == -1)
653 return -ENOENT; 653 return -ENOENT;
654 if (rc) 654 if (rc)
655 continue; 655 continue;
656 } 656 }
657 /* 657 /*
658 * The names match or @name not present and attribute is 658 * The names match or @name not present and attribute is
659 * unnamed. If no @val specified, we have found the attribute 659 * unnamed. If no @val specified, we have found the attribute
660 * and are done. 660 * and are done.
661 */ 661 */
662 if (!val) 662 if (!val)
663 return 0; 663 return 0;
664 /* @val is present; compare values. */ 664 /* @val is present; compare values. */
665 else { 665 else {
666 register int rc; 666 register int rc;
667 667
668 rc = memcmp(val, (u8*)a + le16_to_cpu( 668 rc = memcmp(val, (u8*)a + le16_to_cpu(
669 a->data.resident.value_offset), 669 a->data.resident.value_offset),
670 min_t(u32, val_len, le32_to_cpu( 670 min_t(u32, val_len, le32_to_cpu(
671 a->data.resident.value_length))); 671 a->data.resident.value_length)));
672 /* 672 /*
673 * If @val collates before the current attribute's 673 * If @val collates before the current attribute's
674 * value, there is no matching attribute. 674 * value, there is no matching attribute.
675 */ 675 */
676 if (!rc) { 676 if (!rc) {
677 register u32 avl; 677 register u32 avl;
678 678
679 avl = le32_to_cpu( 679 avl = le32_to_cpu(
680 a->data.resident.value_length); 680 a->data.resident.value_length);
681 if (val_len == avl) 681 if (val_len == avl)
682 return 0; 682 return 0;
683 if (val_len < avl) 683 if (val_len < avl)
684 return -ENOENT; 684 return -ENOENT;
685 } else if (rc < 0) 685 } else if (rc < 0)
686 return -ENOENT; 686 return -ENOENT;
687 } 687 }
688 } 688 }
689 ntfs_error(vol->sb, "Inode is corrupt. Run chkdsk."); 689 ntfs_error(vol->sb, "Inode is corrupt. Run chkdsk.");
690 NVolSetErrors(vol); 690 NVolSetErrors(vol);
691 return -EIO; 691 return -EIO;
692 } 692 }
693 693
694 /** 694 /**
695 * load_attribute_list - load an attribute list into memory 695 * load_attribute_list - load an attribute list into memory
696 * @vol: ntfs volume from which to read 696 * @vol: ntfs volume from which to read
697 * @runlist: runlist of the attribute list 697 * @runlist: runlist of the attribute list
698 * @al_start: destination buffer 698 * @al_start: destination buffer
699 * @size: size of the destination buffer in bytes 699 * @size: size of the destination buffer in bytes
700 * @initialized_size: initialized size of the attribute list 700 * @initialized_size: initialized size of the attribute list
701 * 701 *
702 * Walk the runlist @runlist and load all clusters from it copying them into 702 * Walk the runlist @runlist and load all clusters from it copying them into
703 * the linear buffer @al. The maximum number of bytes copied to @al is @size 703 * the linear buffer @al. The maximum number of bytes copied to @al is @size
704 * bytes. Note, @size does not need to be a multiple of the cluster size. If 704 * bytes. Note, @size does not need to be a multiple of the cluster size. If
705 * @initialized_size is less than @size, the region in @al between 705 * @initialized_size is less than @size, the region in @al between
706 * @initialized_size and @size will be zeroed and not read from disk. 706 * @initialized_size and @size will be zeroed and not read from disk.
707 * 707 *
708 * Return 0 on success or -errno on error. 708 * Return 0 on success or -errno on error.
709 */ 709 */
710 int load_attribute_list(ntfs_volume *vol, runlist *runlist, u8 *al_start, 710 int load_attribute_list(ntfs_volume *vol, runlist *runlist, u8 *al_start,
711 const s64 size, const s64 initialized_size) 711 const s64 size, const s64 initialized_size)
712 { 712 {
713 LCN lcn; 713 LCN lcn;
714 u8 *al = al_start; 714 u8 *al = al_start;
715 u8 *al_end = al + initialized_size; 715 u8 *al_end = al + initialized_size;
716 runlist_element *rl; 716 runlist_element *rl;
717 struct buffer_head *bh; 717 struct buffer_head *bh;
718 struct super_block *sb; 718 struct super_block *sb;
719 unsigned long block_size; 719 unsigned long block_size;
720 unsigned long block, max_block; 720 unsigned long block, max_block;
721 int err = 0; 721 int err = 0;
722 unsigned char block_size_bits; 722 unsigned char block_size_bits;
723 723
724 ntfs_debug("Entering."); 724 ntfs_debug("Entering.");
725 if (!vol || !runlist || !al || size <= 0 || initialized_size < 0 || 725 if (!vol || !runlist || !al || size <= 0 || initialized_size < 0 ||
726 initialized_size > size) 726 initialized_size > size)
727 return -EINVAL; 727 return -EINVAL;
728 if (!initialized_size) { 728 if (!initialized_size) {
729 memset(al, 0, size); 729 memset(al, 0, size);
730 return 0; 730 return 0;
731 } 731 }
732 sb = vol->sb; 732 sb = vol->sb;
733 block_size = sb->s_blocksize; 733 block_size = sb->s_blocksize;
734 block_size_bits = sb->s_blocksize_bits; 734 block_size_bits = sb->s_blocksize_bits;
735 down_read(&runlist->lock); 735 down_read(&runlist->lock);
736 rl = runlist->rl; 736 rl = runlist->rl;
737 if (!rl) { 737 if (!rl) {
738 ntfs_error(sb, "Cannot read attribute list since runlist is " 738 ntfs_error(sb, "Cannot read attribute list since runlist is "
739 "missing."); 739 "missing.");
740 goto err_out; 740 goto err_out;
741 } 741 }
742 /* Read all clusters specified by the runlist one run at a time. */ 742 /* Read all clusters specified by the runlist one run at a time. */
743 while (rl->length) { 743 while (rl->length) {
744 lcn = ntfs_rl_vcn_to_lcn(rl, rl->vcn); 744 lcn = ntfs_rl_vcn_to_lcn(rl, rl->vcn);
745 ntfs_debug("Reading vcn = 0x%llx, lcn = 0x%llx.", 745 ntfs_debug("Reading vcn = 0x%llx, lcn = 0x%llx.",
746 (unsigned long long)rl->vcn, 746 (unsigned long long)rl->vcn,
747 (unsigned long long)lcn); 747 (unsigned long long)lcn);
748 /* The attribute list cannot be sparse. */ 748 /* The attribute list cannot be sparse. */
749 if (lcn < 0) { 749 if (lcn < 0) {
750 ntfs_error(sb, "ntfs_rl_vcn_to_lcn() failed. Cannot " 750 ntfs_error(sb, "ntfs_rl_vcn_to_lcn() failed. Cannot "
751 "read attribute list."); 751 "read attribute list.");
752 goto err_out; 752 goto err_out;
753 } 753 }
754 block = lcn << vol->cluster_size_bits >> block_size_bits; 754 block = lcn << vol->cluster_size_bits >> block_size_bits;
755 /* Read the run from device in chunks of block_size bytes. */ 755 /* Read the run from device in chunks of block_size bytes. */
756 max_block = block + (rl->length << vol->cluster_size_bits >> 756 max_block = block + (rl->length << vol->cluster_size_bits >>
757 block_size_bits); 757 block_size_bits);
758 ntfs_debug("max_block = 0x%lx.", max_block); 758 ntfs_debug("max_block = 0x%lx.", max_block);
759 do { 759 do {
760 ntfs_debug("Reading block = 0x%lx.", block); 760 ntfs_debug("Reading block = 0x%lx.", block);
761 bh = sb_bread(sb, block); 761 bh = sb_bread(sb, block);
762 if (!bh) { 762 if (!bh) {
763 ntfs_error(sb, "sb_bread() failed. Cannot " 763 ntfs_error(sb, "sb_bread() failed. Cannot "
764 "read attribute list."); 764 "read attribute list.");
765 goto err_out; 765 goto err_out;
766 } 766 }
767 if (al + block_size >= al_end) 767 if (al + block_size >= al_end)
768 goto do_final; 768 goto do_final;
769 memcpy(al, bh->b_data, block_size); 769 memcpy(al, bh->b_data, block_size);
770 brelse(bh); 770 brelse(bh);
771 al += block_size; 771 al += block_size;
772 } while (++block < max_block); 772 } while (++block < max_block);
773 rl++; 773 rl++;
774 } 774 }
775 if (initialized_size < size) { 775 if (initialized_size < size) {
776 initialize: 776 initialize:
777 memset(al_start + initialized_size, 0, size - initialized_size); 777 memset(al_start + initialized_size, 0, size - initialized_size);
778 } 778 }
779 done: 779 done:
780 up_read(&runlist->lock); 780 up_read(&runlist->lock);
781 return err; 781 return err;
782 do_final: 782 do_final:
783 if (al < al_end) { 783 if (al < al_end) {
784 /* 784 /*
785 * Partial block. 785 * Partial block.
786 * 786 *
787 * Note: The attribute list can be smaller than its allocation 787 * Note: The attribute list can be smaller than its allocation
788 * by multiple clusters. This has been encountered by at least 788 * by multiple clusters. This has been encountered by at least
789 * two people running Windows XP, thus we cannot do any 789 * two people running Windows XP, thus we cannot do any
790 * truncation sanity checking here. (AIA) 790 * truncation sanity checking here. (AIA)
791 */ 791 */
792 memcpy(al, bh->b_data, al_end - al); 792 memcpy(al, bh->b_data, al_end - al);
793 brelse(bh); 793 brelse(bh);
794 if (initialized_size < size) 794 if (initialized_size < size)
795 goto initialize; 795 goto initialize;
796 goto done; 796 goto done;
797 } 797 }
798 brelse(bh); 798 brelse(bh);
799 /* Real overflow! */ 799 /* Real overflow! */
800 ntfs_error(sb, "Attribute list buffer overflow. Read attribute list " 800 ntfs_error(sb, "Attribute list buffer overflow. Read attribute list "
801 "is truncated."); 801 "is truncated.");
802 err_out: 802 err_out:
803 err = -EIO; 803 err = -EIO;
804 goto done; 804 goto done;
805 } 805 }
806 806
807 /** 807 /**
808 * ntfs_external_attr_find - find an attribute in the attribute list of an inode 808 * ntfs_external_attr_find - find an attribute in the attribute list of an inode
809 * @type: attribute type to find 809 * @type: attribute type to find
810 * @name: attribute name to find (optional, i.e. NULL means don't care) 810 * @name: attribute name to find (optional, i.e. NULL means don't care)
811 * @name_len: attribute name length (only needed if @name present) 811 * @name_len: attribute name length (only needed if @name present)
812 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) 812 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
813 * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) 813 * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only)
814 * @val: attribute value to find (optional, resident attributes only) 814 * @val: attribute value to find (optional, resident attributes only)
815 * @val_len: attribute value length 815 * @val_len: attribute value length
816 * @ctx: search context with mft record and attribute to search from 816 * @ctx: search context with mft record and attribute to search from
817 * 817 *
818 * You should not need to call this function directly. Use ntfs_attr_lookup() 818 * You should not need to call this function directly. Use ntfs_attr_lookup()
819 * instead. 819 * instead.
820 * 820 *
821 * Find an attribute by searching the attribute list for the corresponding 821 * Find an attribute by searching the attribute list for the corresponding
822 * attribute list entry. Having found the entry, map the mft record if the 822 * attribute list entry. Having found the entry, map the mft record if the
823 * attribute is in a different mft record/inode, ntfs_attr_find() the attribute 823 * attribute is in a different mft record/inode, ntfs_attr_find() the attribute
824 * in there and return it. 824 * in there and return it.
825 * 825 *
826 * On first search @ctx->ntfs_ino must be the base mft record and @ctx must 826 * On first search @ctx->ntfs_ino must be the base mft record and @ctx must
827 * have been obtained from a call to ntfs_attr_get_search_ctx(). On subsequent 827 * have been obtained from a call to ntfs_attr_get_search_ctx(). On subsequent
828 * calls @ctx->ntfs_ino can be any extent inode, too (@ctx->base_ntfs_ino is 828 * calls @ctx->ntfs_ino can be any extent inode, too (@ctx->base_ntfs_ino is
829 * then the base inode). 829 * then the base inode).
830 * 830 *
831 * After finishing with the attribute/mft record you need to call 831 * After finishing with the attribute/mft record you need to call
832 * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any 832 * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any
833 * mapped inodes, etc). 833 * mapped inodes, etc).
834 * 834 *
835 * If the attribute is found, ntfs_external_attr_find() returns 0 and 835 * If the attribute is found, ntfs_external_attr_find() returns 0 and
836 * @ctx->attr will point to the found attribute. @ctx->mrec will point to the 836 * @ctx->attr will point to the found attribute. @ctx->mrec will point to the
837 * mft record in which @ctx->attr is located and @ctx->al_entry will point to 837 * mft record in which @ctx->attr is located and @ctx->al_entry will point to
838 * the attribute list entry for the attribute. 838 * the attribute list entry for the attribute.
839 * 839 *
840 * If the attribute is not found, ntfs_external_attr_find() returns -ENOENT and 840 * If the attribute is not found, ntfs_external_attr_find() returns -ENOENT and
841 * @ctx->attr will point to the attribute in the base mft record before which 841 * @ctx->attr will point to the attribute in the base mft record before which
842 * the attribute being searched for would need to be inserted if such an action 842 * the attribute being searched for would need to be inserted if such an action
843 * were to be desired. @ctx->mrec will point to the mft record in which 843 * were to be desired. @ctx->mrec will point to the mft record in which
844 * @ctx->attr is located and @ctx->al_entry will point to the attribute list 844 * @ctx->attr is located and @ctx->al_entry will point to the attribute list
845 * entry of the attribute before which the attribute being searched for would 845 * entry of the attribute before which the attribute being searched for would
846 * need to be inserted if such an action were to be desired. 846 * need to be inserted if such an action were to be desired.
847 * 847 *
848 * Thus to insert the not found attribute, one wants to add the attribute to 848 * Thus to insert the not found attribute, one wants to add the attribute to
849 * @ctx->mrec (the base mft record) and if there is not enough space, the 849 * @ctx->mrec (the base mft record) and if there is not enough space, the
850 * attribute should be placed in a newly allocated extent mft record. The 850 * attribute should be placed in a newly allocated extent mft record. The
851 * attribute list entry for the inserted attribute should be inserted in the 851 * attribute list entry for the inserted attribute should be inserted in the
852 * attribute list attribute at @ctx->al_entry. 852 * attribute list attribute at @ctx->al_entry.
853 * 853 *
854 * On actual error, ntfs_external_attr_find() returns -EIO. In this case 854 * On actual error, ntfs_external_attr_find() returns -EIO. In this case
855 * @ctx->attr is undefined and in particular do not rely on it not changing. 855 * @ctx->attr is undefined and in particular do not rely on it not changing.
856 */ 856 */
857 static int ntfs_external_attr_find(const ATTR_TYPE type, 857 static int ntfs_external_attr_find(const ATTR_TYPE type,
858 const ntfschar *name, const u32 name_len, 858 const ntfschar *name, const u32 name_len,
859 const IGNORE_CASE_BOOL ic, const VCN lowest_vcn, 859 const IGNORE_CASE_BOOL ic, const VCN lowest_vcn,
860 const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) 860 const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx)
861 { 861 {
862 ntfs_inode *base_ni, *ni; 862 ntfs_inode *base_ni, *ni;
863 ntfs_volume *vol; 863 ntfs_volume *vol;
864 ATTR_LIST_ENTRY *al_entry, *next_al_entry; 864 ATTR_LIST_ENTRY *al_entry, *next_al_entry;
865 u8 *al_start, *al_end; 865 u8 *al_start, *al_end;
866 ATTR_RECORD *a; 866 ATTR_RECORD *a;
867 ntfschar *al_name; 867 ntfschar *al_name;
868 u32 al_name_len; 868 u32 al_name_len;
869 int err = 0; 869 int err = 0;
870 static const char *es = " Unmount and run chkdsk."; 870 static const char *es = " Unmount and run chkdsk.";
871 871
872 ni = ctx->ntfs_ino; 872 ni = ctx->ntfs_ino;
873 base_ni = ctx->base_ntfs_ino; 873 base_ni = ctx->base_ntfs_ino;
874 ntfs_debug("Entering for inode 0x%lx, type 0x%x.", ni->mft_no, type); 874 ntfs_debug("Entering for inode 0x%lx, type 0x%x.", ni->mft_no, type);
875 if (!base_ni) { 875 if (!base_ni) {
876 /* First call happens with the base mft record. */ 876 /* First call happens with the base mft record. */
877 base_ni = ctx->base_ntfs_ino = ctx->ntfs_ino; 877 base_ni = ctx->base_ntfs_ino = ctx->ntfs_ino;
878 ctx->base_mrec = ctx->mrec; 878 ctx->base_mrec = ctx->mrec;
879 } 879 }
880 if (ni == base_ni) 880 if (ni == base_ni)
881 ctx->base_attr = ctx->attr; 881 ctx->base_attr = ctx->attr;
882 if (type == AT_END) 882 if (type == AT_END)
883 goto not_found; 883 goto not_found;
884 vol = base_ni->vol; 884 vol = base_ni->vol;
885 al_start = base_ni->attr_list; 885 al_start = base_ni->attr_list;
886 al_end = al_start + base_ni->attr_list_size; 886 al_end = al_start + base_ni->attr_list_size;
887 if (!ctx->al_entry) 887 if (!ctx->al_entry)
888 ctx->al_entry = (ATTR_LIST_ENTRY*)al_start; 888 ctx->al_entry = (ATTR_LIST_ENTRY*)al_start;
889 /* 889 /*
890 * Iterate over entries in attribute list starting at @ctx->al_entry, 890 * Iterate over entries in attribute list starting at @ctx->al_entry,
891 * or the entry following that, if @ctx->is_first is 'true'. 891 * or the entry following that, if @ctx->is_first is 'true'.
892 */ 892 */
893 if (ctx->is_first) { 893 if (ctx->is_first) {
894 al_entry = ctx->al_entry; 894 al_entry = ctx->al_entry;
895 ctx->is_first = false; 895 ctx->is_first = false;
896 } else 896 } else
897 al_entry = (ATTR_LIST_ENTRY*)((u8*)ctx->al_entry + 897 al_entry = (ATTR_LIST_ENTRY*)((u8*)ctx->al_entry +
898 le16_to_cpu(ctx->al_entry->length)); 898 le16_to_cpu(ctx->al_entry->length));
899 for (;; al_entry = next_al_entry) { 899 for (;; al_entry = next_al_entry) {
900 /* Out of bounds check. */ 900 /* Out of bounds check. */
901 if ((u8*)al_entry < base_ni->attr_list || 901 if ((u8*)al_entry < base_ni->attr_list ||
902 (u8*)al_entry > al_end) 902 (u8*)al_entry > al_end)
903 break; /* Inode is corrupt. */ 903 break; /* Inode is corrupt. */
904 ctx->al_entry = al_entry; 904 ctx->al_entry = al_entry;
905 /* Catch the end of the attribute list. */ 905 /* Catch the end of the attribute list. */
906 if ((u8*)al_entry == al_end) 906 if ((u8*)al_entry == al_end)
907 goto not_found; 907 goto not_found;
908 if (!al_entry->length) 908 if (!al_entry->length)
909 break; 909 break;
910 if ((u8*)al_entry + 6 > al_end || (u8*)al_entry + 910 if ((u8*)al_entry + 6 > al_end || (u8*)al_entry +
911 le16_to_cpu(al_entry->length) > al_end) 911 le16_to_cpu(al_entry->length) > al_end)
912 break; 912 break;
913 next_al_entry = (ATTR_LIST_ENTRY*)((u8*)al_entry + 913 next_al_entry = (ATTR_LIST_ENTRY*)((u8*)al_entry +
914 le16_to_cpu(al_entry->length)); 914 le16_to_cpu(al_entry->length));
915 if (le32_to_cpu(al_entry->type) > le32_to_cpu(type)) 915 if (le32_to_cpu(al_entry->type) > le32_to_cpu(type))
916 goto not_found; 916 goto not_found;
917 if (type != al_entry->type) 917 if (type != al_entry->type)
918 continue; 918 continue;
919 /* 919 /*
920 * If @name is present, compare the two names. If @name is 920 * If @name is present, compare the two names. If @name is
921 * missing, assume we want an unnamed attribute. 921 * missing, assume we want an unnamed attribute.
922 */ 922 */
923 al_name_len = al_entry->name_length; 923 al_name_len = al_entry->name_length;
924 al_name = (ntfschar*)((u8*)al_entry + al_entry->name_offset); 924 al_name = (ntfschar*)((u8*)al_entry + al_entry->name_offset);
925 if (!name) { 925 if (!name) {
926 if (al_name_len) 926 if (al_name_len)
927 goto not_found; 927 goto not_found;
928 } else if (!ntfs_are_names_equal(al_name, al_name_len, name, 928 } else if (!ntfs_are_names_equal(al_name, al_name_len, name,
929 name_len, ic, vol->upcase, vol->upcase_len)) { 929 name_len, ic, vol->upcase, vol->upcase_len)) {
930 register int rc; 930 register int rc;
931 931
932 rc = ntfs_collate_names(name, name_len, al_name, 932 rc = ntfs_collate_names(name, name_len, al_name,
933 al_name_len, 1, IGNORE_CASE, 933 al_name_len, 1, IGNORE_CASE,
934 vol->upcase, vol->upcase_len); 934 vol->upcase, vol->upcase_len);
935 /* 935 /*
936 * If @name collates before al_name, there is no 936 * If @name collates before al_name, there is no
937 * matching attribute. 937 * matching attribute.
938 */ 938 */
939 if (rc == -1) 939 if (rc == -1)
940 goto not_found; 940 goto not_found;
941 /* If the strings are not equal, continue search. */ 941 /* If the strings are not equal, continue search. */
942 if (rc) 942 if (rc)
943 continue; 943 continue;
944 /* 944 /*
945 * FIXME: Reverse engineering showed 0, IGNORE_CASE but 945 * FIXME: Reverse engineering showed 0, IGNORE_CASE but
946 * that is inconsistent with ntfs_attr_find(). The 946 * that is inconsistent with ntfs_attr_find(). The
947 * subsequent rc checks were also different. Perhaps I 947 * subsequent rc checks were also different. Perhaps I
948 * made a mistake in one of the two. Need to recheck 948 * made a mistake in one of the two. Need to recheck
949 * which is correct or at least see what is going on... 949 * which is correct or at least see what is going on...
950 * (AIA) 950 * (AIA)
951 */ 951 */
952 rc = ntfs_collate_names(name, name_len, al_name, 952 rc = ntfs_collate_names(name, name_len, al_name,
953 al_name_len, 1, CASE_SENSITIVE, 953 al_name_len, 1, CASE_SENSITIVE,
954 vol->upcase, vol->upcase_len); 954 vol->upcase, vol->upcase_len);
955 if (rc == -1) 955 if (rc == -1)
956 goto not_found; 956 goto not_found;
957 if (rc) 957 if (rc)
958 continue; 958 continue;
959 } 959 }
960 /* 960 /*
961 * The names match or @name not present and attribute is 961 * The names match or @name not present and attribute is
962 * unnamed. Now check @lowest_vcn. Continue search if the 962 * unnamed. Now check @lowest_vcn. Continue search if the
963 * next attribute list entry still fits @lowest_vcn. Otherwise 963 * next attribute list entry still fits @lowest_vcn. Otherwise
964 * we have reached the right one or the search has failed. 964 * we have reached the right one or the search has failed.
965 */ 965 */
966 if (lowest_vcn && (u8*)next_al_entry >= al_start && 966 if (lowest_vcn && (u8*)next_al_entry >= al_start &&
967 (u8*)next_al_entry + 6 < al_end && 967 (u8*)next_al_entry + 6 < al_end &&
968 (u8*)next_al_entry + le16_to_cpu( 968 (u8*)next_al_entry + le16_to_cpu(
969 next_al_entry->length) <= al_end && 969 next_al_entry->length) <= al_end &&
970 sle64_to_cpu(next_al_entry->lowest_vcn) <= 970 sle64_to_cpu(next_al_entry->lowest_vcn) <=
971 lowest_vcn && 971 lowest_vcn &&
972 next_al_entry->type == al_entry->type && 972 next_al_entry->type == al_entry->type &&
973 next_al_entry->name_length == al_name_len && 973 next_al_entry->name_length == al_name_len &&
974 ntfs_are_names_equal((ntfschar*)((u8*) 974 ntfs_are_names_equal((ntfschar*)((u8*)
975 next_al_entry + 975 next_al_entry +
976 next_al_entry->name_offset), 976 next_al_entry->name_offset),
977 next_al_entry->name_length, 977 next_al_entry->name_length,
978 al_name, al_name_len, CASE_SENSITIVE, 978 al_name, al_name_len, CASE_SENSITIVE,
979 vol->upcase, vol->upcase_len)) 979 vol->upcase, vol->upcase_len))
980 continue; 980 continue;
981 if (MREF_LE(al_entry->mft_reference) == ni->mft_no) { 981 if (MREF_LE(al_entry->mft_reference) == ni->mft_no) {
982 if (MSEQNO_LE(al_entry->mft_reference) != ni->seq_no) { 982 if (MSEQNO_LE(al_entry->mft_reference) != ni->seq_no) {
983 ntfs_error(vol->sb, "Found stale mft " 983 ntfs_error(vol->sb, "Found stale mft "
984 "reference in attribute list " 984 "reference in attribute list "
985 "of base inode 0x%lx.%s", 985 "of base inode 0x%lx.%s",
986 base_ni->mft_no, es); 986 base_ni->mft_no, es);
987 err = -EIO; 987 err = -EIO;
988 break; 988 break;
989 } 989 }
990 } else { /* Mft references do not match. */ 990 } else { /* Mft references do not match. */
991 /* If there is a mapped record unmap it first. */ 991 /* If there is a mapped record unmap it first. */
992 if (ni != base_ni) 992 if (ni != base_ni)
993 unmap_extent_mft_record(ni); 993 unmap_extent_mft_record(ni);
994 /* Do we want the base record back? */ 994 /* Do we want the base record back? */
995 if (MREF_LE(al_entry->mft_reference) == 995 if (MREF_LE(al_entry->mft_reference) ==
996 base_ni->mft_no) { 996 base_ni->mft_no) {
997 ni = ctx->ntfs_ino = base_ni; 997 ni = ctx->ntfs_ino = base_ni;
998 ctx->mrec = ctx->base_mrec; 998 ctx->mrec = ctx->base_mrec;
999 } else { 999 } else {
1000 /* We want an extent record. */ 1000 /* We want an extent record. */
1001 ctx->mrec = map_extent_mft_record(base_ni, 1001 ctx->mrec = map_extent_mft_record(base_ni,
1002 le64_to_cpu( 1002 le64_to_cpu(
1003 al_entry->mft_reference), &ni); 1003 al_entry->mft_reference), &ni);
1004 if (IS_ERR(ctx->mrec)) { 1004 if (IS_ERR(ctx->mrec)) {
1005 ntfs_error(vol->sb, "Failed to map " 1005 ntfs_error(vol->sb, "Failed to map "
1006 "extent mft record " 1006 "extent mft record "
1007 "0x%lx of base inode " 1007 "0x%lx of base inode "
1008 "0x%lx.%s", 1008 "0x%lx.%s",
1009 MREF_LE(al_entry-> 1009 MREF_LE(al_entry->
1010 mft_reference), 1010 mft_reference),
1011 base_ni->mft_no, es); 1011 base_ni->mft_no, es);
1012 err = PTR_ERR(ctx->mrec); 1012 err = PTR_ERR(ctx->mrec);
1013 if (err == -ENOENT) 1013 if (err == -ENOENT)
1014 err = -EIO; 1014 err = -EIO;
1015 /* Cause @ctx to be sanitized below. */ 1015 /* Cause @ctx to be sanitized below. */
1016 ni = NULL; 1016 ni = NULL;
1017 break; 1017 break;
1018 } 1018 }
1019 ctx->ntfs_ino = ni; 1019 ctx->ntfs_ino = ni;
1020 } 1020 }
1021 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + 1021 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
1022 le16_to_cpu(ctx->mrec->attrs_offset)); 1022 le16_to_cpu(ctx->mrec->attrs_offset));
1023 } 1023 }
1024 /* 1024 /*
1025 * ctx->vfs_ino, ctx->mrec, and ctx->attr now point to the 1025 * ctx->vfs_ino, ctx->mrec, and ctx->attr now point to the
1026 * mft record containing the attribute represented by the 1026 * mft record containing the attribute represented by the
1027 * current al_entry. 1027 * current al_entry.
1028 */ 1028 */
1029 /* 1029 /*
1030 * We could call into ntfs_attr_find() to find the right 1030 * We could call into ntfs_attr_find() to find the right
1031 * attribute in this mft record but this would be less 1031 * attribute in this mft record but this would be less
1032 * efficient and not quite accurate as ntfs_attr_find() ignores 1032 * efficient and not quite accurate as ntfs_attr_find() ignores
1033 * the attribute instance numbers for example which become 1033 * the attribute instance numbers for example which become
1034 * important when one plays with attribute lists. Also, 1034 * important when one plays with attribute lists. Also,
1035 * because a proper match has been found in the attribute list 1035 * because a proper match has been found in the attribute list
1036 * entry above, the comparison can now be optimized. So it is 1036 * entry above, the comparison can now be optimized. So it is
1037 * worth re-implementing a simplified ntfs_attr_find() here. 1037 * worth re-implementing a simplified ntfs_attr_find() here.
1038 */ 1038 */
1039 a = ctx->attr; 1039 a = ctx->attr;
1040 /* 1040 /*
1041 * Use a manual loop so we can still use break and continue 1041 * Use a manual loop so we can still use break and continue
1042 * with the same meanings as above. 1042 * with the same meanings as above.
1043 */ 1043 */
1044 do_next_attr_loop: 1044 do_next_attr_loop:
1045 if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + 1045 if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec +
1046 le32_to_cpu(ctx->mrec->bytes_allocated)) 1046 le32_to_cpu(ctx->mrec->bytes_allocated))
1047 break; 1047 break;
1048 if (a->type == AT_END) 1048 if (a->type == AT_END)
1049 break; 1049 break;
1050 if (!a->length) 1050 if (!a->length)
1051 break; 1051 break;
1052 if (al_entry->instance != a->instance) 1052 if (al_entry->instance != a->instance)
1053 goto do_next_attr; 1053 goto do_next_attr;
1054 /* 1054 /*
1055 * If the type and/or the name are mismatched between the 1055 * If the type and/or the name are mismatched between the
1056 * attribute list entry and the attribute record, there is 1056 * attribute list entry and the attribute record, there is
1057 * corruption so we break and return error EIO. 1057 * corruption so we break and return error EIO.
1058 */ 1058 */
1059 if (al_entry->type != a->type) 1059 if (al_entry->type != a->type)
1060 break; 1060 break;
1061 if (!ntfs_are_names_equal((ntfschar*)((u8*)a + 1061 if (!ntfs_are_names_equal((ntfschar*)((u8*)a +
1062 le16_to_cpu(a->name_offset)), a->name_length, 1062 le16_to_cpu(a->name_offset)), a->name_length,
1063 al_name, al_name_len, CASE_SENSITIVE, 1063 al_name, al_name_len, CASE_SENSITIVE,
1064 vol->upcase, vol->upcase_len)) 1064 vol->upcase, vol->upcase_len))
1065 break; 1065 break;
1066 ctx->attr = a; 1066 ctx->attr = a;
1067 /* 1067 /*
1068 * If no @val specified or @val specified and it matches, we 1068 * If no @val specified or @val specified and it matches, we
1069 * have found it! 1069 * have found it!
1070 */ 1070 */
1071 if (!val || (!a->non_resident && le32_to_cpu( 1071 if (!val || (!a->non_resident && le32_to_cpu(
1072 a->data.resident.value_length) == val_len && 1072 a->data.resident.value_length) == val_len &&
1073 !memcmp((u8*)a + 1073 !memcmp((u8*)a +
1074 le16_to_cpu(a->data.resident.value_offset), 1074 le16_to_cpu(a->data.resident.value_offset),
1075 val, val_len))) { 1075 val, val_len))) {
1076 ntfs_debug("Done, found."); 1076 ntfs_debug("Done, found.");
1077 return 0; 1077 return 0;
1078 } 1078 }
1079 do_next_attr: 1079 do_next_attr:
1080 /* Proceed to the next attribute in the current mft record. */ 1080 /* Proceed to the next attribute in the current mft record. */
1081 a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length)); 1081 a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length));
1082 goto do_next_attr_loop; 1082 goto do_next_attr_loop;
1083 } 1083 }
1084 if (!err) { 1084 if (!err) {
1085 ntfs_error(vol->sb, "Base inode 0x%lx contains corrupt " 1085 ntfs_error(vol->sb, "Base inode 0x%lx contains corrupt "
1086 "attribute list attribute.%s", base_ni->mft_no, 1086 "attribute list attribute.%s", base_ni->mft_no,
1087 es); 1087 es);
1088 err = -EIO; 1088 err = -EIO;
1089 } 1089 }
1090 if (ni != base_ni) { 1090 if (ni != base_ni) {
1091 if (ni) 1091 if (ni)
1092 unmap_extent_mft_record(ni); 1092 unmap_extent_mft_record(ni);
1093 ctx->ntfs_ino = base_ni; 1093 ctx->ntfs_ino = base_ni;
1094 ctx->mrec = ctx->base_mrec; 1094 ctx->mrec = ctx->base_mrec;
1095 ctx->attr = ctx->base_attr; 1095 ctx->attr = ctx->base_attr;
1096 } 1096 }
1097 if (err != -ENOMEM) 1097 if (err != -ENOMEM)
1098 NVolSetErrors(vol); 1098 NVolSetErrors(vol);
1099 return err; 1099 return err;
1100 not_found: 1100 not_found:
1101 /* 1101 /*
1102 * If we were looking for AT_END, we reset the search context @ctx and 1102 * If we were looking for AT_END, we reset the search context @ctx and
1103 * use ntfs_attr_find() to seek to the end of the base mft record. 1103 * use ntfs_attr_find() to seek to the end of the base mft record.
1104 */ 1104 */
1105 if (type == AT_END) { 1105 if (type == AT_END) {
1106 ntfs_attr_reinit_search_ctx(ctx); 1106 ntfs_attr_reinit_search_ctx(ctx);
1107 return ntfs_attr_find(AT_END, name, name_len, ic, val, val_len, 1107 return ntfs_attr_find(AT_END, name, name_len, ic, val, val_len,
1108 ctx); 1108 ctx);
1109 } 1109 }
1110 /* 1110 /*
1111 * The attribute was not found. Before we return, we want to ensure 1111 * The attribute was not found. Before we return, we want to ensure
1112 * @ctx->mrec and @ctx->attr indicate the position at which the 1112 * @ctx->mrec and @ctx->attr indicate the position at which the
1113 * attribute should be inserted in the base mft record. Since we also 1113 * attribute should be inserted in the base mft record. Since we also
1114 * want to preserve @ctx->al_entry we cannot reinitialize the search 1114 * want to preserve @ctx->al_entry we cannot reinitialize the search
1115 * context using ntfs_attr_reinit_search_ctx() as this would set 1115 * context using ntfs_attr_reinit_search_ctx() as this would set
1116 * @ctx->al_entry to NULL. Thus we do the necessary bits manually (see 1116 * @ctx->al_entry to NULL. Thus we do the necessary bits manually (see
1117 * ntfs_attr_init_search_ctx() below). Note, we _only_ preserve 1117 * ntfs_attr_init_search_ctx() below). Note, we _only_ preserve
1118 * @ctx->al_entry as the remaining fields (base_*) are identical to 1118 * @ctx->al_entry as the remaining fields (base_*) are identical to
1119 * their non base_ counterparts and we cannot set @ctx->base_attr 1119 * their non base_ counterparts and we cannot set @ctx->base_attr
1120 * correctly yet as we do not know what @ctx->attr will be set to by 1120 * correctly yet as we do not know what @ctx->attr will be set to by
1121 * the call to ntfs_attr_find() below. 1121 * the call to ntfs_attr_find() below.
1122 */ 1122 */
1123 if (ni != base_ni) 1123 if (ni != base_ni)
1124 unmap_extent_mft_record(ni); 1124 unmap_extent_mft_record(ni);
1125 ctx->mrec = ctx->base_mrec; 1125 ctx->mrec = ctx->base_mrec;
1126 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + 1126 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
1127 le16_to_cpu(ctx->mrec->attrs_offset)); 1127 le16_to_cpu(ctx->mrec->attrs_offset));
1128 ctx->is_first = true; 1128 ctx->is_first = true;
1129 ctx->ntfs_ino = base_ni; 1129 ctx->ntfs_ino = base_ni;
1130 ctx->base_ntfs_ino = NULL; 1130 ctx->base_ntfs_ino = NULL;
1131 ctx->base_mrec = NULL; 1131 ctx->base_mrec = NULL;
1132 ctx->base_attr = NULL; 1132 ctx->base_attr = NULL;
1133 /* 1133 /*
1134 * In case there are multiple matches in the base mft record, need to 1134 * In case there are multiple matches in the base mft record, need to
1135 * keep enumerating until we get an attribute not found response (or 1135 * keep enumerating until we get an attribute not found response (or
1136 * another error), otherwise we would keep returning the same attribute 1136 * another error), otherwise we would keep returning the same attribute
1137 * over and over again and all programs using us for enumeration would 1137 * over and over again and all programs using us for enumeration would
1138 * lock up in a tight loop. 1138 * lock up in a tight loop.
1139 */ 1139 */
1140 do { 1140 do {
1141 err = ntfs_attr_find(type, name, name_len, ic, val, val_len, 1141 err = ntfs_attr_find(type, name, name_len, ic, val, val_len,
1142 ctx); 1142 ctx);
1143 } while (!err); 1143 } while (!err);
1144 ntfs_debug("Done, not found."); 1144 ntfs_debug("Done, not found.");
1145 return err; 1145 return err;
1146 } 1146 }
1147 1147
1148 /** 1148 /**
1149 * ntfs_attr_lookup - find an attribute in an ntfs inode 1149 * ntfs_attr_lookup - find an attribute in an ntfs inode
1150 * @type: attribute type to find 1150 * @type: attribute type to find
1151 * @name: attribute name to find (optional, i.e. NULL means don't care) 1151 * @name: attribute name to find (optional, i.e. NULL means don't care)
1152 * @name_len: attribute name length (only needed if @name present) 1152 * @name_len: attribute name length (only needed if @name present)
1153 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) 1153 * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
1154 * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) 1154 * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only)
1155 * @val: attribute value to find (optional, resident attributes only) 1155 * @val: attribute value to find (optional, resident attributes only)
1156 * @val_len: attribute value length 1156 * @val_len: attribute value length
1157 * @ctx: search context with mft record and attribute to search from 1157 * @ctx: search context with mft record and attribute to search from
1158 * 1158 *
1159 * Find an attribute in an ntfs inode. On first search @ctx->ntfs_ino must 1159 * Find an attribute in an ntfs inode. On first search @ctx->ntfs_ino must
1160 * be the base mft record and @ctx must have been obtained from a call to 1160 * be the base mft record and @ctx must have been obtained from a call to
1161 * ntfs_attr_get_search_ctx(). 1161 * ntfs_attr_get_search_ctx().
1162 * 1162 *
1163 * This function transparently handles attribute lists and @ctx is used to 1163 * This function transparently handles attribute lists and @ctx is used to
1164 * continue searches where they were left off at. 1164 * continue searches where they were left off at.
1165 * 1165 *
1166 * After finishing with the attribute/mft record you need to call 1166 * After finishing with the attribute/mft record you need to call
1167 * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any 1167 * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any
1168 * mapped inodes, etc). 1168 * mapped inodes, etc).
1169 * 1169 *
1170 * Return 0 if the search was successful and -errno if not. 1170 * Return 0 if the search was successful and -errno if not.
1171 * 1171 *
1172 * When 0, @ctx->attr is the found attribute and it is in mft record 1172 * When 0, @ctx->attr is the found attribute and it is in mft record
1173 * @ctx->mrec. If an attribute list attribute is present, @ctx->al_entry is 1173 * @ctx->mrec. If an attribute list attribute is present, @ctx->al_entry is
1174 * the attribute list entry of the found attribute. 1174 * the attribute list entry of the found attribute.
1175 * 1175 *
1176 * When -ENOENT, @ctx->attr is the attribute which collates just after the 1176 * When -ENOENT, @ctx->attr is the attribute which collates just after the
1177 * attribute being searched for, i.e. if one wants to add the attribute to the 1177 * attribute being searched for, i.e. if one wants to add the attribute to the
1178 * mft record this is the correct place to insert it into. If an attribute 1178 * mft record this is the correct place to insert it into. If an attribute
1179 * list attribute is present, @ctx->al_entry is the attribute list entry which 1179 * list attribute is present, @ctx->al_entry is the attribute list entry which
1180 * collates just after the attribute list entry of the attribute being searched 1180 * collates just after the attribute list entry of the attribute being searched
1181 * for, i.e. if one wants to add the attribute to the mft record this is the 1181 * for, i.e. if one wants to add the attribute to the mft record this is the
1182 * correct place to insert its attribute list entry into. 1182 * correct place to insert its attribute list entry into.
1183 * 1183 *
1184 * When -errno != -ENOENT, an error occurred during the lookup. @ctx->attr is 1184 * When -errno != -ENOENT, an error occurred during the lookup. @ctx->attr is
1185 * then undefined and in particular you should not rely on it not changing. 1185 * then undefined and in particular you should not rely on it not changing.
1186 */ 1186 */
1187 int ntfs_attr_lookup(const ATTR_TYPE type, const ntfschar *name, 1187 int ntfs_attr_lookup(const ATTR_TYPE type, const ntfschar *name,
1188 const u32 name_len, const IGNORE_CASE_BOOL ic, 1188 const u32 name_len, const IGNORE_CASE_BOOL ic,
1189 const VCN lowest_vcn, const u8 *val, const u32 val_len, 1189 const VCN lowest_vcn, const u8 *val, const u32 val_len,
1190 ntfs_attr_search_ctx *ctx) 1190 ntfs_attr_search_ctx *ctx)
1191 { 1191 {
1192 ntfs_inode *base_ni; 1192 ntfs_inode *base_ni;
1193 1193
1194 ntfs_debug("Entering."); 1194 ntfs_debug("Entering.");
1195 BUG_ON(IS_ERR(ctx->mrec)); 1195 BUG_ON(IS_ERR(ctx->mrec));
1196 if (ctx->base_ntfs_ino) 1196 if (ctx->base_ntfs_ino)
1197 base_ni = ctx->base_ntfs_ino; 1197 base_ni = ctx->base_ntfs_ino;
1198 else 1198 else
1199 base_ni = ctx->ntfs_ino; 1199 base_ni = ctx->ntfs_ino;
1200 /* Sanity check, just for debugging really. */ 1200 /* Sanity check, just for debugging really. */
1201 BUG_ON(!base_ni); 1201 BUG_ON(!base_ni);
1202 if (!NInoAttrList(base_ni) || type == AT_ATTRIBUTE_LIST) 1202 if (!NInoAttrList(base_ni) || type == AT_ATTRIBUTE_LIST)
1203 return ntfs_attr_find(type, name, name_len, ic, val, val_len, 1203 return ntfs_attr_find(type, name, name_len, ic, val, val_len,
1204 ctx); 1204 ctx);
1205 return ntfs_external_attr_find(type, name, name_len, ic, lowest_vcn, 1205 return ntfs_external_attr_find(type, name, name_len, ic, lowest_vcn,
1206 val, val_len, ctx); 1206 val, val_len, ctx);
1207 } 1207 }
1208 1208
1209 /** 1209 /**
1210 * ntfs_attr_init_search_ctx - initialize an attribute search context 1210 * ntfs_attr_init_search_ctx - initialize an attribute search context
1211 * @ctx: attribute search context to initialize 1211 * @ctx: attribute search context to initialize
1212 * @ni: ntfs inode with which to initialize the search context 1212 * @ni: ntfs inode with which to initialize the search context
1213 * @mrec: mft record with which to initialize the search context 1213 * @mrec: mft record with which to initialize the search context
1214 * 1214 *
1215 * Initialize the attribute search context @ctx with @ni and @mrec. 1215 * Initialize the attribute search context @ctx with @ni and @mrec.
1216 */ 1216 */
1217 static inline void ntfs_attr_init_search_ctx(ntfs_attr_search_ctx *ctx, 1217 static inline void ntfs_attr_init_search_ctx(ntfs_attr_search_ctx *ctx,
1218 ntfs_inode *ni, MFT_RECORD *mrec) 1218 ntfs_inode *ni, MFT_RECORD *mrec)
1219 { 1219 {
1220 *ctx = (ntfs_attr_search_ctx) { 1220 *ctx = (ntfs_attr_search_ctx) {
1221 .mrec = mrec, 1221 .mrec = mrec,
1222 /* Sanity checks are performed elsewhere. */ 1222 /* Sanity checks are performed elsewhere. */
1223 .attr = (ATTR_RECORD*)((u8*)mrec + 1223 .attr = (ATTR_RECORD*)((u8*)mrec +
1224 le16_to_cpu(mrec->attrs_offset)), 1224 le16_to_cpu(mrec->attrs_offset)),
1225 .is_first = true, 1225 .is_first = true,
1226 .ntfs_ino = ni, 1226 .ntfs_ino = ni,
1227 }; 1227 };
1228 } 1228 }
1229 1229
1230 /** 1230 /**
1231 * ntfs_attr_reinit_search_ctx - reinitialize an attribute search context 1231 * ntfs_attr_reinit_search_ctx - reinitialize an attribute search context
1232 * @ctx: attribute search context to reinitialize 1232 * @ctx: attribute search context to reinitialize
1233 * 1233 *
1234 * Reinitialize the attribute search context @ctx, unmapping an associated 1234 * Reinitialize the attribute search context @ctx, unmapping an associated
1235 * extent mft record if present, and initialize the search context again. 1235 * extent mft record if present, and initialize the search context again.
1236 * 1236 *
1237 * This is used when a search for a new attribute is being started to reset 1237 * This is used when a search for a new attribute is being started to reset
1238 * the search context to the beginning. 1238 * the search context to the beginning.
1239 */ 1239 */
1240 void ntfs_attr_reinit_search_ctx(ntfs_attr_search_ctx *ctx) 1240 void ntfs_attr_reinit_search_ctx(ntfs_attr_search_ctx *ctx)
1241 { 1241 {
1242 if (likely(!ctx->base_ntfs_ino)) { 1242 if (likely(!ctx->base_ntfs_ino)) {
1243 /* No attribute list. */ 1243 /* No attribute list. */
1244 ctx->is_first = true; 1244 ctx->is_first = true;
1245 /* Sanity checks are performed elsewhere. */ 1245 /* Sanity checks are performed elsewhere. */
1246 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + 1246 ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
1247 le16_to_cpu(ctx->mrec->attrs_offset)); 1247 le16_to_cpu(ctx->mrec->attrs_offset));
1248 /* 1248 /*
1249 * This needs resetting due to ntfs_external_attr_find() which 1249 * This needs resetting due to ntfs_external_attr_find() which
1250 * can leave it set despite having zeroed ctx->base_ntfs_ino. 1250 * can leave it set despite having zeroed ctx->base_ntfs_ino.
1251 */ 1251 */
1252 ctx->al_entry = NULL; 1252 ctx->al_entry = NULL;
1253 return; 1253 return;
1254 } /* Attribute list. */ 1254 } /* Attribute list. */
1255 if (ctx->ntfs_ino != ctx->base_ntfs_ino) 1255 if (ctx->ntfs_ino != ctx->base_ntfs_ino)
1256 unmap_extent_mft_record(ctx->ntfs_ino); 1256 unmap_extent_mft_record(ctx->ntfs_ino);
1257 ntfs_attr_init_search_ctx(ctx, ctx->base_ntfs_ino, ctx->base_mrec); 1257 ntfs_attr_init_search_ctx(ctx, ctx->base_ntfs_ino, ctx->base_mrec);
1258 return; 1258 return;
1259 } 1259 }
1260 1260
1261 /** 1261 /**
1262 * ntfs_attr_get_search_ctx - allocate/initialize a new attribute search context 1262 * ntfs_attr_get_search_ctx - allocate/initialize a new attribute search context
1263 * @ni: ntfs inode with which to initialize the search context 1263 * @ni: ntfs inode with which to initialize the search context
1264 * @mrec: mft record with which to initialize the search context 1264 * @mrec: mft record with which to initialize the search context
1265 * 1265 *
1266 * Allocate a new attribute search context, initialize it with @ni and @mrec, 1266 * Allocate a new attribute search context, initialize it with @ni and @mrec,
1267 * and return it. Return NULL if allocation failed. 1267 * and return it. Return NULL if allocation failed.
1268 */ 1268 */
1269 ntfs_attr_search_ctx *ntfs_attr_get_search_ctx(ntfs_inode *ni, MFT_RECORD *mrec) 1269 ntfs_attr_search_ctx *ntfs_attr_get_search_ctx(ntfs_inode *ni, MFT_RECORD *mrec)
1270 { 1270 {
1271 ntfs_attr_search_ctx *ctx; 1271 ntfs_attr_search_ctx *ctx;
1272 1272
1273 ctx = kmem_cache_alloc(ntfs_attr_ctx_cache, GFP_NOFS); 1273 ctx = kmem_cache_alloc(ntfs_attr_ctx_cache, GFP_NOFS);
1274 if (ctx) 1274 if (ctx)
1275 ntfs_attr_init_search_ctx(ctx, ni, mrec); 1275 ntfs_attr_init_search_ctx(ctx, ni, mrec);
1276 return ctx; 1276 return ctx;
1277 } 1277 }
1278 1278
1279 /** 1279 /**
1280 * ntfs_attr_put_search_ctx - release an attribute search context 1280 * ntfs_attr_put_search_ctx - release an attribute search context
1281 * @ctx: attribute search context to free 1281 * @ctx: attribute search context to free
1282 * 1282 *
1283 * Release the attribute search context @ctx, unmapping an associated extent 1283 * Release the attribute search context @ctx, unmapping an associated extent
1284 * mft record if present. 1284 * mft record if present.
1285 */ 1285 */
1286 void ntfs_attr_put_search_ctx(ntfs_attr_search_ctx *ctx) 1286 void ntfs_attr_put_search_ctx(ntfs_attr_search_ctx *ctx)
1287 { 1287 {
1288 if (ctx->base_ntfs_ino && ctx->ntfs_ino != ctx->base_ntfs_ino) 1288 if (ctx->base_ntfs_ino && ctx->ntfs_ino != ctx->base_ntfs_ino)
1289 unmap_extent_mft_record(ctx->ntfs_ino); 1289 unmap_extent_mft_record(ctx->ntfs_ino);
1290 kmem_cache_free(ntfs_attr_ctx_cache, ctx); 1290 kmem_cache_free(ntfs_attr_ctx_cache, ctx);
1291 return; 1291 return;
1292 } 1292 }
1293 1293
1294 #ifdef NTFS_RW 1294 #ifdef NTFS_RW
1295 1295
1296 /** 1296 /**
1297 * ntfs_attr_find_in_attrdef - find an attribute in the $AttrDef system file 1297 * ntfs_attr_find_in_attrdef - find an attribute in the $AttrDef system file
1298 * @vol: ntfs volume to which the attribute belongs 1298 * @vol: ntfs volume to which the attribute belongs
1299 * @type: attribute type which to find 1299 * @type: attribute type which to find
1300 * 1300 *
1301 * Search for the attribute definition record corresponding to the attribute 1301 * Search for the attribute definition record corresponding to the attribute
1302 * @type in the $AttrDef system file. 1302 * @type in the $AttrDef system file.
1303 * 1303 *
1304 * Return the attribute type definition record if found and NULL if not found. 1304 * Return the attribute type definition record if found and NULL if not found.
1305 */ 1305 */
1306 static ATTR_DEF *ntfs_attr_find_in_attrdef(const ntfs_volume *vol, 1306 static ATTR_DEF *ntfs_attr_find_in_attrdef(const ntfs_volume *vol,
1307 const ATTR_TYPE type) 1307 const ATTR_TYPE type)
1308 { 1308 {
1309 ATTR_DEF *ad; 1309 ATTR_DEF *ad;
1310 1310
1311 BUG_ON(!vol->attrdef); 1311 BUG_ON(!vol->attrdef);
1312 BUG_ON(!type); 1312 BUG_ON(!type);
1313 for (ad = vol->attrdef; (u8*)ad - (u8*)vol->attrdef < 1313 for (ad = vol->attrdef; (u8*)ad - (u8*)vol->attrdef <
1314 vol->attrdef_size && ad->type; ++ad) { 1314 vol->attrdef_size && ad->type; ++ad) {
1315 /* We have not found it yet, carry on searching. */ 1315 /* We have not found it yet, carry on searching. */
1316 if (likely(le32_to_cpu(ad->type) < le32_to_cpu(type))) 1316 if (likely(le32_to_cpu(ad->type) < le32_to_cpu(type)))
1317 continue; 1317 continue;
1318 /* We found the attribute; return it. */ 1318 /* We found the attribute; return it. */
1319 if (likely(ad->type == type)) 1319 if (likely(ad->type == type))
1320 return ad; 1320 return ad;
1321 /* We have gone too far already. No point in continuing. */ 1321 /* We have gone too far already. No point in continuing. */
1322 break; 1322 break;
1323 } 1323 }
1324 /* Attribute not found. */ 1324 /* Attribute not found. */
1325 ntfs_debug("Attribute type 0x%x not found in $AttrDef.", 1325 ntfs_debug("Attribute type 0x%x not found in $AttrDef.",
1326 le32_to_cpu(type)); 1326 le32_to_cpu(type));
1327 return NULL; 1327 return NULL;
1328 } 1328 }
1329 1329
1330 /** 1330 /**
1331 * ntfs_attr_size_bounds_check - check a size of an attribute type for validity 1331 * ntfs_attr_size_bounds_check - check a size of an attribute type for validity
1332 * @vol: ntfs volume to which the attribute belongs 1332 * @vol: ntfs volume to which the attribute belongs
1333 * @type: attribute type which to check 1333 * @type: attribute type which to check
1334 * @size: size which to check 1334 * @size: size which to check
1335 * 1335 *
1336 * Check whether the @size in bytes is valid for an attribute of @type on the 1336 * Check whether the @size in bytes is valid for an attribute of @type on the
1337 * ntfs volume @vol. This information is obtained from $AttrDef system file. 1337 * ntfs volume @vol. This information is obtained from $AttrDef system file.
1338 * 1338 *
1339 * Return 0 if valid, -ERANGE if not valid, or -ENOENT if the attribute is not 1339 * Return 0 if valid, -ERANGE if not valid, or -ENOENT if the attribute is not
1340 * listed in $AttrDef. 1340 * listed in $AttrDef.
1341 */ 1341 */
1342 int ntfs_attr_size_bounds_check(const ntfs_volume *vol, const ATTR_TYPE type, 1342 int ntfs_attr_size_bounds_check(const ntfs_volume *vol, const ATTR_TYPE type,
1343 const s64 size) 1343 const s64 size)
1344 { 1344 {
1345 ATTR_DEF *ad; 1345 ATTR_DEF *ad;
1346 1346
1347 BUG_ON(size < 0); 1347 BUG_ON(size < 0);
1348 /* 1348 /*
1349 * $ATTRIBUTE_LIST has a maximum size of 256kiB, but this is not 1349 * $ATTRIBUTE_LIST has a maximum size of 256kiB, but this is not
1350 * listed in $AttrDef. 1350 * listed in $AttrDef.
1351 */ 1351 */
1352 if (unlikely(type == AT_ATTRIBUTE_LIST && size > 256 * 1024)) 1352 if (unlikely(type == AT_ATTRIBUTE_LIST && size > 256 * 1024))
1353 return -ERANGE; 1353 return -ERANGE;
1354 /* Get the $AttrDef entry for the attribute @type. */ 1354 /* Get the $AttrDef entry for the attribute @type. */
1355 ad = ntfs_attr_find_in_attrdef(vol, type); 1355 ad = ntfs_attr_find_in_attrdef(vol, type);
1356 if (unlikely(!ad)) 1356 if (unlikely(!ad))
1357 return -ENOENT; 1357 return -ENOENT;
1358 /* Do the bounds check. */ 1358 /* Do the bounds check. */
1359 if (((sle64_to_cpu(ad->min_size) > 0) && 1359 if (((sle64_to_cpu(ad->min_size) > 0) &&
1360 size < sle64_to_cpu(ad->min_size)) || 1360 size < sle64_to_cpu(ad->min_size)) ||
1361 ((sle64_to_cpu(ad->max_size) > 0) && size > 1361 ((sle64_to_cpu(ad->max_size) > 0) && size >
1362 sle64_to_cpu(ad->max_size))) 1362 sle64_to_cpu(ad->max_size)))
1363 return -ERANGE; 1363 return -ERANGE;
1364 return 0; 1364 return 0;
1365 } 1365 }
1366 1366
1367 /** 1367 /**
1368 * ntfs_attr_can_be_non_resident - check if an attribute can be non-resident 1368 * ntfs_attr_can_be_non_resident - check if an attribute can be non-resident
1369 * @vol: ntfs volume to which the attribute belongs 1369 * @vol: ntfs volume to which the attribute belongs
1370 * @type: attribute type which to check 1370 * @type: attribute type which to check
1371 * 1371 *
1372 * Check whether the attribute of @type on the ntfs volume @vol is allowed to 1372 * Check whether the attribute of @type on the ntfs volume @vol is allowed to
1373 * be non-resident. This information is obtained from $AttrDef system file. 1373 * be non-resident. This information is obtained from $AttrDef system file.
1374 * 1374 *
1375 * Return 0 if the attribute is allowed to be non-resident, -EPERM if not, and 1375 * Return 0 if the attribute is allowed to be non-resident, -EPERM if not, and
1376 * -ENOENT if the attribute is not listed in $AttrDef. 1376 * -ENOENT if the attribute is not listed in $AttrDef.
1377 */ 1377 */
1378 int ntfs_attr_can_be_non_resident(const ntfs_volume *vol, const ATTR_TYPE type) 1378 int ntfs_attr_can_be_non_resident(const ntfs_volume *vol, const ATTR_TYPE type)
1379 { 1379 {
1380 ATTR_DEF *ad; 1380 ATTR_DEF *ad;
1381 1381
1382 /* Find the attribute definition record in $AttrDef. */ 1382 /* Find the attribute definition record in $AttrDef. */
1383 ad = ntfs_attr_find_in_attrdef(vol, type); 1383 ad = ntfs_attr_find_in_attrdef(vol, type);
1384 if (unlikely(!ad)) 1384 if (unlikely(!ad))
1385 return -ENOENT; 1385 return -ENOENT;
1386 /* Check the flags and return the result. */ 1386 /* Check the flags and return the result. */
1387 if (ad->flags & ATTR_DEF_RESIDENT) 1387 if (ad->flags & ATTR_DEF_RESIDENT)
1388 return -EPERM; 1388 return -EPERM;
1389 return 0; 1389 return 0;
1390 } 1390 }
1391 1391
1392 /** 1392 /**
1393 * ntfs_attr_can_be_resident - check if an attribute can be resident 1393 * ntfs_attr_can_be_resident - check if an attribute can be resident
1394 * @vol: ntfs volume to which the attribute belongs 1394 * @vol: ntfs volume to which the attribute belongs
1395 * @type: attribute type which to check 1395 * @type: attribute type which to check
1396 * 1396 *
1397 * Check whether the attribute of @type on the ntfs volume @vol is allowed to 1397 * Check whether the attribute of @type on the ntfs volume @vol is allowed to
1398 * be resident. This information is derived from our ntfs knowledge and may 1398 * be resident. This information is derived from our ntfs knowledge and may
1399 * not be completely accurate, especially when user defined attributes are 1399 * not be completely accurate, especially when user defined attributes are
1400 * present. Basically we allow everything to be resident except for index 1400 * present. Basically we allow everything to be resident except for index
1401 * allocation and $EA attributes. 1401 * allocation and $EA attributes.
1402 * 1402 *
1403 * Return 0 if the attribute is allowed to be non-resident and -EPERM if not. 1403 * Return 0 if the attribute is allowed to be non-resident and -EPERM if not.
1404 * 1404 *
1405 * Warning: In the system file $MFT the attribute $Bitmap must be non-resident 1405 * Warning: In the system file $MFT the attribute $Bitmap must be non-resident
1406 * otherwise windows will not boot (blue screen of death)! We cannot 1406 * otherwise windows will not boot (blue screen of death)! We cannot
1407 * check for this here as we do not know which inode's $Bitmap is 1407 * check for this here as we do not know which inode's $Bitmap is
1408 * being asked about so the caller needs to special case this. 1408 * being asked about so the caller needs to special case this.
1409 */ 1409 */
1410 int ntfs_attr_can_be_resident(const ntfs_volume *vol, const ATTR_TYPE type) 1410 int ntfs_attr_can_be_resident(const ntfs_volume *vol, const ATTR_TYPE type)
1411 { 1411 {
1412 if (type == AT_INDEX_ALLOCATION) 1412 if (type == AT_INDEX_ALLOCATION)
1413 return -EPERM; 1413 return -EPERM;
1414 return 0; 1414 return 0;
1415 } 1415 }
1416 1416
1417 /** 1417 /**
1418 * ntfs_attr_record_resize - resize an attribute record 1418 * ntfs_attr_record_resize - resize an attribute record
1419 * @m: mft record containing attribute record 1419 * @m: mft record containing attribute record
1420 * @a: attribute record to resize 1420 * @a: attribute record to resize
1421 * @new_size: new size in bytes to which to resize the attribute record @a 1421 * @new_size: new size in bytes to which to resize the attribute record @a
1422 * 1422 *
1423 * Resize the attribute record @a, i.e. the resident part of the attribute, in 1423 * Resize the attribute record @a, i.e. the resident part of the attribute, in
1424 * the mft record @m to @new_size bytes. 1424 * the mft record @m to @new_size bytes.
1425 * 1425 *
1426 * Return 0 on success and -errno on error. The following error codes are 1426 * Return 0 on success and -errno on error. The following error codes are
1427 * defined: 1427 * defined:
1428 * -ENOSPC - Not enough space in the mft record @m to perform the resize. 1428 * -ENOSPC - Not enough space in the mft record @m to perform the resize.
1429 * 1429 *
1430 * Note: On error, no modifications have been performed whatsoever. 1430 * Note: On error, no modifications have been performed whatsoever.
1431 * 1431 *
1432 * Warning: If you make a record smaller without having copied all the data you 1432 * Warning: If you make a record smaller without having copied all the data you
1433 * are interested in the data may be overwritten. 1433 * are interested in the data may be overwritten.
1434 */ 1434 */
1435 int ntfs_attr_record_resize(MFT_RECORD *m, ATTR_RECORD *a, u32 new_size) 1435 int ntfs_attr_record_resize(MFT_RECORD *m, ATTR_RECORD *a, u32 new_size)
1436 { 1436 {
1437 ntfs_debug("Entering for new_size %u.", new_size); 1437 ntfs_debug("Entering for new_size %u.", new_size);
1438 /* Align to 8 bytes if it is not already done. */ 1438 /* Align to 8 bytes if it is not already done. */
1439 if (new_size & 7) 1439 if (new_size & 7)
1440 new_size = (new_size + 7) & ~7; 1440 new_size = (new_size + 7) & ~7;
1441 /* If the actual attribute length has changed, move things around. */ 1441 /* If the actual attribute length has changed, move things around. */
1442 if (new_size != le32_to_cpu(a->length)) { 1442 if (new_size != le32_to_cpu(a->length)) {
1443 u32 new_muse = le32_to_cpu(m->bytes_in_use) - 1443 u32 new_muse = le32_to_cpu(m->bytes_in_use) -
1444 le32_to_cpu(a->length) + new_size; 1444 le32_to_cpu(a->length) + new_size;
1445 /* Not enough space in this mft record. */ 1445 /* Not enough space in this mft record. */
1446 if (new_muse > le32_to_cpu(m->bytes_allocated)) 1446 if (new_muse > le32_to_cpu(m->bytes_allocated))
1447 return -ENOSPC; 1447 return -ENOSPC;
1448 /* Move attributes following @a to their new location. */ 1448 /* Move attributes following @a to their new location. */
1449 memmove((u8*)a + new_size, (u8*)a + le32_to_cpu(a->length), 1449 memmove((u8*)a + new_size, (u8*)a + le32_to_cpu(a->length),
1450 le32_to_cpu(m->bytes_in_use) - ((u8*)a - 1450 le32_to_cpu(m->bytes_in_use) - ((u8*)a -
1451 (u8*)m) - le32_to_cpu(a->length)); 1451 (u8*)m) - le32_to_cpu(a->length));
1452 /* Adjust @m to reflect the change in used space. */ 1452 /* Adjust @m to reflect the change in used space. */
1453 m->bytes_in_use = cpu_to_le32(new_muse); 1453 m->bytes_in_use = cpu_to_le32(new_muse);
1454 /* Adjust @a to reflect the new size. */ 1454 /* Adjust @a to reflect the new size. */
1455 if (new_size >= offsetof(ATTR_REC, length) + sizeof(a->length)) 1455 if (new_size >= offsetof(ATTR_REC, length) + sizeof(a->length))
1456 a->length = cpu_to_le32(new_size); 1456 a->length = cpu_to_le32(new_size);
1457 } 1457 }
1458 return 0; 1458 return 0;
1459 } 1459 }
1460 1460
1461 /** 1461 /**
1462 * ntfs_resident_attr_value_resize - resize the value of a resident attribute 1462 * ntfs_resident_attr_value_resize - resize the value of a resident attribute
1463 * @m: mft record containing attribute record 1463 * @m: mft record containing attribute record
1464 * @a: attribute record whose value to resize 1464 * @a: attribute record whose value to resize
1465 * @new_size: new size in bytes to which to resize the attribute value of @a 1465 * @new_size: new size in bytes to which to resize the attribute value of @a
1466 * 1466 *
1467 * Resize the value of the attribute @a in the mft record @m to @new_size bytes. 1467 * Resize the value of the attribute @a in the mft record @m to @new_size bytes.
1468 * If the value is made bigger, the newly allocated space is cleared. 1468 * If the value is made bigger, the newly allocated space is cleared.
1469 * 1469 *
1470 * Return 0 on success and -errno on error. The following error codes are 1470 * Return 0 on success and -errno on error. The following error codes are
1471 * defined: 1471 * defined:
1472 * -ENOSPC - Not enough space in the mft record @m to perform the resize. 1472 * -ENOSPC - Not enough space in the mft record @m to perform the resize.
1473 * 1473 *
1474 * Note: On error, no modifications have been performed whatsoever. 1474 * Note: On error, no modifications have been performed whatsoever.
1475 * 1475 *
1476 * Warning: If you make a record smaller without having copied all the data you 1476 * Warning: If you make a record smaller without having copied all the data you
1477 * are interested in the data may be overwritten. 1477 * are interested in the data may be overwritten.
1478 */ 1478 */
1479 int ntfs_resident_attr_value_resize(MFT_RECORD *m, ATTR_RECORD *a, 1479 int ntfs_resident_attr_value_resize(MFT_RECORD *m, ATTR_RECORD *a,
1480 const u32 new_size) 1480 const u32 new_size)
1481 { 1481 {
1482 u32 old_size; 1482 u32 old_size;
1483 1483
1484 /* Resize the resident part of the attribute record. */ 1484 /* Resize the resident part of the attribute record. */
1485 if (ntfs_attr_record_resize(m, a, 1485 if (ntfs_attr_record_resize(m, a,
1486 le16_to_cpu(a->data.resident.value_offset) + new_size)) 1486 le16_to_cpu(a->data.resident.value_offset) + new_size))
1487 return -ENOSPC; 1487 return -ENOSPC;
1488 /* 1488 /*
1489 * The resize succeeded! If we made the attribute value bigger, clear 1489 * The resize succeeded! If we made the attribute value bigger, clear
1490 * the area between the old size and @new_size. 1490 * the area between the old size and @new_size.
1491 */ 1491 */
1492 old_size = le32_to_cpu(a->data.resident.value_length); 1492 old_size = le32_to_cpu(a->data.resident.value_length);
1493 if (new_size > old_size) 1493 if (new_size > old_size)
1494 memset((u8*)a + le16_to_cpu(a->data.resident.value_offset) + 1494 memset((u8*)a + le16_to_cpu(a->data.resident.value_offset) +
1495 old_size, 0, new_size - old_size); 1495 old_size, 0, new_size - old_size);
1496 /* Finally update the length of the attribute value. */ 1496 /* Finally update the length of the attribute value. */
1497 a->data.resident.value_length = cpu_to_le32(new_size); 1497 a->data.resident.value_length = cpu_to_le32(new_size);
1498 return 0; 1498 return 0;
1499 } 1499 }
1500 1500
1501 /** 1501 /**
1502 * ntfs_attr_make_non_resident - convert a resident to a non-resident attribute 1502 * ntfs_attr_make_non_resident - convert a resident to a non-resident attribute
1503 * @ni: ntfs inode describing the attribute to convert 1503 * @ni: ntfs inode describing the attribute to convert
1504 * @data_size: size of the resident data to copy to the non-resident attribute 1504 * @data_size: size of the resident data to copy to the non-resident attribute
1505 * 1505 *
1506 * Convert the resident ntfs attribute described by the ntfs inode @ni to a 1506 * Convert the resident ntfs attribute described by the ntfs inode @ni to a
1507 * non-resident one. 1507 * non-resident one.
1508 * 1508 *
1509 * @data_size must be equal to the attribute value size. This is needed since 1509 * @data_size must be equal to the attribute value size. This is needed since
1510 * we need to know the size before we can map the mft record and our callers 1510 * we need to know the size before we can map the mft record and our callers
1511 * always know it. The reason we cannot simply read the size from the vfs 1511 * always know it. The reason we cannot simply read the size from the vfs
1512 * inode i_size is that this is not necessarily uptodate. This happens when 1512 * inode i_size is that this is not necessarily uptodate. This happens when
1513 * ntfs_attr_make_non_resident() is called in the ->truncate call path(s). 1513 * ntfs_attr_make_non_resident() is called in the ->truncate call path(s).
1514 * 1514 *
1515 * Return 0 on success and -errno on error. The following error return codes 1515 * Return 0 on success and -errno on error. The following error return codes
1516 * are defined: 1516 * are defined:
1517 * -EPERM - The attribute is not allowed to be non-resident. 1517 * -EPERM - The attribute is not allowed to be non-resident.
1518 * -ENOMEM - Not enough memory. 1518 * -ENOMEM - Not enough memory.
1519 * -ENOSPC - Not enough disk space. 1519 * -ENOSPC - Not enough disk space.
1520 * -EINVAL - Attribute not defined on the volume. 1520 * -EINVAL - Attribute not defined on the volume.
1521 * -EIO - I/o error or other error. 1521 * -EIO - I/o error or other error.
1522 * Note that -ENOSPC is also returned in the case that there is not enough 1522 * Note that -ENOSPC is also returned in the case that there is not enough
1523 * space in the mft record to do the conversion. This can happen when the mft 1523 * space in the mft record to do the conversion. This can happen when the mft
1524 * record is already very full. The caller is responsible for trying to make 1524 * record is already very full. The caller is responsible for trying to make
1525 * space in the mft record and trying again. FIXME: Do we need a separate 1525 * space in the mft record and trying again. FIXME: Do we need a separate
1526 * error return code for this kind of -ENOSPC or is it always worth trying 1526 * error return code for this kind of -ENOSPC or is it always worth trying
1527 * again in case the attribute may then fit in a resident state so no need to 1527 * again in case the attribute may then fit in a resident state so no need to
1528 * make it non-resident at all? Ho-hum... (AIA) 1528 * make it non-resident at all? Ho-hum... (AIA)
1529 * 1529 *
1530 * NOTE to self: No changes in the attribute list are required to move from 1530 * NOTE to self: No changes in the attribute list are required to move from
1531 * a resident to a non-resident attribute. 1531 * a resident to a non-resident attribute.
1532 * 1532 *
1533 * Locking: - The caller must hold i_mutex on the inode. 1533 * Locking: - The caller must hold i_mutex on the inode.
1534 */ 1534 */
1535 int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size) 1535 int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size)
1536 { 1536 {
1537 s64 new_size; 1537 s64 new_size;
1538 struct inode *vi = VFS_I(ni); 1538 struct inode *vi = VFS_I(ni);
1539 ntfs_volume *vol = ni->vol; 1539 ntfs_volume *vol = ni->vol;
1540 ntfs_inode *base_ni; 1540 ntfs_inode *base_ni;
1541 MFT_RECORD *m; 1541 MFT_RECORD *m;
1542 ATTR_RECORD *a; 1542 ATTR_RECORD *a;
1543 ntfs_attr_search_ctx *ctx; 1543 ntfs_attr_search_ctx *ctx;
1544 struct page *page; 1544 struct page *page;
1545 runlist_element *rl; 1545 runlist_element *rl;
1546 u8 *kaddr; 1546 u8 *kaddr;
1547 unsigned long flags; 1547 unsigned long flags;
1548 int mp_size, mp_ofs, name_ofs, arec_size, err, err2; 1548 int mp_size, mp_ofs, name_ofs, arec_size, err, err2;
1549 u32 attr_size; 1549 u32 attr_size;
1550 u8 old_res_attr_flags; 1550 u8 old_res_attr_flags;
1551 1551
1552 /* Check that the attribute is allowed to be non-resident. */ 1552 /* Check that the attribute is allowed to be non-resident. */
1553 err = ntfs_attr_can_be_non_resident(vol, ni->type); 1553 err = ntfs_attr_can_be_non_resident(vol, ni->type);
1554 if (unlikely(err)) { 1554 if (unlikely(err)) {
1555 if (err == -EPERM) 1555 if (err == -EPERM)
1556 ntfs_debug("Attribute is not allowed to be " 1556 ntfs_debug("Attribute is not allowed to be "
1557 "non-resident."); 1557 "non-resident.");
1558 else 1558 else
1559 ntfs_debug("Attribute not defined on the NTFS " 1559 ntfs_debug("Attribute not defined on the NTFS "
1560 "volume!"); 1560 "volume!");
1561 return err; 1561 return err;
1562 } 1562 }
1563 /* 1563 /*
1564 * FIXME: Compressed and encrypted attributes are not supported when 1564 * FIXME: Compressed and encrypted attributes are not supported when
1565 * writing and we should never have gotten here for them. 1565 * writing and we should never have gotten here for them.
1566 */ 1566 */
1567 BUG_ON(NInoCompressed(ni)); 1567 BUG_ON(NInoCompressed(ni));
1568 BUG_ON(NInoEncrypted(ni)); 1568 BUG_ON(NInoEncrypted(ni));
1569 /* 1569 /*
1570 * The size needs to be aligned to a cluster boundary for allocation 1570 * The size needs to be aligned to a cluster boundary for allocation
1571 * purposes. 1571 * purposes.
1572 */ 1572 */
1573 new_size = (data_size + vol->cluster_size - 1) & 1573 new_size = (data_size + vol->cluster_size - 1) &
1574 ~(vol->cluster_size - 1); 1574 ~(vol->cluster_size - 1);
1575 if (new_size > 0) { 1575 if (new_size > 0) {
1576 /* 1576 /*
1577 * Will need the page later and since the page lock nests 1577 * Will need the page later and since the page lock nests
1578 * outside all ntfs locks, we need to get the page now. 1578 * outside all ntfs locks, we need to get the page now.
1579 */ 1579 */
1580 page = find_or_create_page(vi->i_mapping, 0, 1580 page = find_or_create_page(vi->i_mapping, 0,
1581 mapping_gfp_mask(vi->i_mapping)); 1581 mapping_gfp_mask(vi->i_mapping));
1582 if (unlikely(!page)) 1582 if (unlikely(!page))
1583 return -ENOMEM; 1583 return -ENOMEM;
1584 /* Start by allocating clusters to hold the attribute value. */ 1584 /* Start by allocating clusters to hold the attribute value. */
1585 rl = ntfs_cluster_alloc(vol, 0, new_size >> 1585 rl = ntfs_cluster_alloc(vol, 0, new_size >>
1586 vol->cluster_size_bits, -1, DATA_ZONE, true); 1586 vol->cluster_size_bits, -1, DATA_ZONE, true);
1587 if (IS_ERR(rl)) { 1587 if (IS_ERR(rl)) {
1588 err = PTR_ERR(rl); 1588 err = PTR_ERR(rl);
1589 ntfs_debug("Failed to allocate cluster%s, error code " 1589 ntfs_debug("Failed to allocate cluster%s, error code "
1590 "%i.", (new_size >> 1590 "%i.", (new_size >>
1591 vol->cluster_size_bits) > 1 ? "s" : "", 1591 vol->cluster_size_bits) > 1 ? "s" : "",
1592 err); 1592 err);
1593 goto page_err_out; 1593 goto page_err_out;
1594 } 1594 }
1595 } else { 1595 } else {
1596 rl = NULL; 1596 rl = NULL;
1597 page = NULL; 1597 page = NULL;
1598 } 1598 }
1599 /* Determine the size of the mapping pairs array. */ 1599 /* Determine the size of the mapping pairs array. */
1600 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl, 0, -1); 1600 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl, 0, -1);
1601 if (unlikely(mp_size < 0)) { 1601 if (unlikely(mp_size < 0)) {
1602 err = mp_size; 1602 err = mp_size;
1603 ntfs_debug("Failed to get size for mapping pairs array, error " 1603 ntfs_debug("Failed to get size for mapping pairs array, error "
1604 "code %i.", err); 1604 "code %i.", err);
1605 goto rl_err_out; 1605 goto rl_err_out;
1606 } 1606 }
1607 down_write(&ni->runlist.lock); 1607 down_write(&ni->runlist.lock);
1608 if (!NInoAttr(ni)) 1608 if (!NInoAttr(ni))
1609 base_ni = ni; 1609 base_ni = ni;
1610 else 1610 else
1611 base_ni = ni->ext.base_ntfs_ino; 1611 base_ni = ni->ext.base_ntfs_ino;
1612 m = map_mft_record(base_ni); 1612 m = map_mft_record(base_ni);
1613 if (IS_ERR(m)) { 1613 if (IS_ERR(m)) {
1614 err = PTR_ERR(m); 1614 err = PTR_ERR(m);
1615 m = NULL; 1615 m = NULL;
1616 ctx = NULL; 1616 ctx = NULL;
1617 goto err_out; 1617 goto err_out;
1618 } 1618 }
1619 ctx = ntfs_attr_get_search_ctx(base_ni, m); 1619 ctx = ntfs_attr_get_search_ctx(base_ni, m);
1620 if (unlikely(!ctx)) { 1620 if (unlikely(!ctx)) {
1621 err = -ENOMEM; 1621 err = -ENOMEM;
1622 goto err_out; 1622 goto err_out;
1623 } 1623 }
1624 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 1624 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
1625 CASE_SENSITIVE, 0, NULL, 0, ctx); 1625 CASE_SENSITIVE, 0, NULL, 0, ctx);
1626 if (unlikely(err)) { 1626 if (unlikely(err)) {
1627 if (err == -ENOENT) 1627 if (err == -ENOENT)
1628 err = -EIO; 1628 err = -EIO;
1629 goto err_out; 1629 goto err_out;
1630 } 1630 }
1631 m = ctx->mrec; 1631 m = ctx->mrec;
1632 a = ctx->attr; 1632 a = ctx->attr;
1633 BUG_ON(NInoNonResident(ni)); 1633 BUG_ON(NInoNonResident(ni));
1634 BUG_ON(a->non_resident); 1634 BUG_ON(a->non_resident);
1635 /* 1635 /*
1636 * Calculate new offsets for the name and the mapping pairs array. 1636 * Calculate new offsets for the name and the mapping pairs array.
1637 */ 1637 */
1638 if (NInoSparse(ni) || NInoCompressed(ni)) 1638 if (NInoSparse(ni) || NInoCompressed(ni))
1639 name_ofs = (offsetof(ATTR_REC, 1639 name_ofs = (offsetof(ATTR_REC,
1640 data.non_resident.compressed_size) + 1640 data.non_resident.compressed_size) +
1641 sizeof(a->data.non_resident.compressed_size) + 1641 sizeof(a->data.non_resident.compressed_size) +
1642 7) & ~7; 1642 7) & ~7;
1643 else 1643 else
1644 name_ofs = (offsetof(ATTR_REC, 1644 name_ofs = (offsetof(ATTR_REC,
1645 data.non_resident.compressed_size) + 7) & ~7; 1645 data.non_resident.compressed_size) + 7) & ~7;
1646 mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; 1646 mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7;
1647 /* 1647 /*
1648 * Determine the size of the resident part of the now non-resident 1648 * Determine the size of the resident part of the now non-resident
1649 * attribute record. 1649 * attribute record.
1650 */ 1650 */
1651 arec_size = (mp_ofs + mp_size + 7) & ~7; 1651 arec_size = (mp_ofs + mp_size + 7) & ~7;
1652 /* 1652 /*
1653 * If the page is not uptodate bring it uptodate by copying from the 1653 * If the page is not uptodate bring it uptodate by copying from the
1654 * attribute value. 1654 * attribute value.
1655 */ 1655 */
1656 attr_size = le32_to_cpu(a->data.resident.value_length); 1656 attr_size = le32_to_cpu(a->data.resident.value_length);
1657 BUG_ON(attr_size != data_size); 1657 BUG_ON(attr_size != data_size);
1658 if (page && !PageUptodate(page)) { 1658 if (page && !PageUptodate(page)) {
1659 kaddr = kmap_atomic(page); 1659 kaddr = kmap_atomic(page);
1660 memcpy(kaddr, (u8*)a + 1660 memcpy(kaddr, (u8*)a +
1661 le16_to_cpu(a->data.resident.value_offset), 1661 le16_to_cpu(a->data.resident.value_offset),
1662 attr_size); 1662 attr_size);
1663 memset(kaddr + attr_size, 0, PAGE_CACHE_SIZE - attr_size); 1663 memset(kaddr + attr_size, 0, PAGE_CACHE_SIZE - attr_size);
1664 kunmap_atomic(kaddr); 1664 kunmap_atomic(kaddr);
1665 flush_dcache_page(page); 1665 flush_dcache_page(page);
1666 SetPageUptodate(page); 1666 SetPageUptodate(page);
1667 } 1667 }
1668 /* Backup the attribute flag. */ 1668 /* Backup the attribute flag. */
1669 old_res_attr_flags = a->data.resident.flags; 1669 old_res_attr_flags = a->data.resident.flags;
1670 /* Resize the resident part of the attribute record. */ 1670 /* Resize the resident part of the attribute record. */
1671 err = ntfs_attr_record_resize(m, a, arec_size); 1671 err = ntfs_attr_record_resize(m, a, arec_size);
1672 if (unlikely(err)) 1672 if (unlikely(err))
1673 goto err_out; 1673 goto err_out;
1674 /* 1674 /*
1675 * Convert the resident part of the attribute record to describe a 1675 * Convert the resident part of the attribute record to describe a
1676 * non-resident attribute. 1676 * non-resident attribute.
1677 */ 1677 */
1678 a->non_resident = 1; 1678 a->non_resident = 1;
1679 /* Move the attribute name if it exists and update the offset. */ 1679 /* Move the attribute name if it exists and update the offset. */
1680 if (a->name_length) 1680 if (a->name_length)
1681 memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), 1681 memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset),
1682 a->name_length * sizeof(ntfschar)); 1682 a->name_length * sizeof(ntfschar));
1683 a->name_offset = cpu_to_le16(name_ofs); 1683 a->name_offset = cpu_to_le16(name_ofs);
1684 /* Setup the fields specific to non-resident attributes. */ 1684 /* Setup the fields specific to non-resident attributes. */
1685 a->data.non_resident.lowest_vcn = 0; 1685 a->data.non_resident.lowest_vcn = 0;
1686 a->data.non_resident.highest_vcn = cpu_to_sle64((new_size - 1) >> 1686 a->data.non_resident.highest_vcn = cpu_to_sle64((new_size - 1) >>
1687 vol->cluster_size_bits); 1687 vol->cluster_size_bits);
1688 a->data.non_resident.mapping_pairs_offset = cpu_to_le16(mp_ofs); 1688 a->data.non_resident.mapping_pairs_offset = cpu_to_le16(mp_ofs);
1689 memset(&a->data.non_resident.reserved, 0, 1689 memset(&a->data.non_resident.reserved, 0,
1690 sizeof(a->data.non_resident.reserved)); 1690 sizeof(a->data.non_resident.reserved));
1691 a->data.non_resident.allocated_size = cpu_to_sle64(new_size); 1691 a->data.non_resident.allocated_size = cpu_to_sle64(new_size);
1692 a->data.non_resident.data_size = 1692 a->data.non_resident.data_size =
1693 a->data.non_resident.initialized_size = 1693 a->data.non_resident.initialized_size =
1694 cpu_to_sle64(attr_size); 1694 cpu_to_sle64(attr_size);
1695 if (NInoSparse(ni) || NInoCompressed(ni)) { 1695 if (NInoSparse(ni) || NInoCompressed(ni)) {
1696 a->data.non_resident.compression_unit = 0; 1696 a->data.non_resident.compression_unit = 0;
1697 if (NInoCompressed(ni) || vol->major_ver < 3) 1697 if (NInoCompressed(ni) || vol->major_ver < 3)
1698 a->data.non_resident.compression_unit = 4; 1698 a->data.non_resident.compression_unit = 4;
1699 a->data.non_resident.compressed_size = 1699 a->data.non_resident.compressed_size =
1700 a->data.non_resident.allocated_size; 1700 a->data.non_resident.allocated_size;
1701 } else 1701 } else
1702 a->data.non_resident.compression_unit = 0; 1702 a->data.non_resident.compression_unit = 0;
1703 /* Generate the mapping pairs array into the attribute record. */ 1703 /* Generate the mapping pairs array into the attribute record. */
1704 err = ntfs_mapping_pairs_build(vol, (u8*)a + mp_ofs, 1704 err = ntfs_mapping_pairs_build(vol, (u8*)a + mp_ofs,
1705 arec_size - mp_ofs, rl, 0, -1, NULL); 1705 arec_size - mp_ofs, rl, 0, -1, NULL);
1706 if (unlikely(err)) { 1706 if (unlikely(err)) {
1707 ntfs_debug("Failed to build mapping pairs, error code %i.", 1707 ntfs_debug("Failed to build mapping pairs, error code %i.",
1708 err); 1708 err);
1709 goto undo_err_out; 1709 goto undo_err_out;
1710 } 1710 }
1711 /* Setup the in-memory attribute structure to be non-resident. */ 1711 /* Setup the in-memory attribute structure to be non-resident. */
1712 ni->runlist.rl = rl; 1712 ni->runlist.rl = rl;
1713 write_lock_irqsave(&ni->size_lock, flags); 1713 write_lock_irqsave(&ni->size_lock, flags);
1714 ni->allocated_size = new_size; 1714 ni->allocated_size = new_size;
1715 if (NInoSparse(ni) || NInoCompressed(ni)) { 1715 if (NInoSparse(ni) || NInoCompressed(ni)) {
1716 ni->itype.compressed.size = ni->allocated_size; 1716 ni->itype.compressed.size = ni->allocated_size;
1717 if (a->data.non_resident.compression_unit) { 1717 if (a->data.non_resident.compression_unit) {
1718 ni->itype.compressed.block_size = 1U << (a->data. 1718 ni->itype.compressed.block_size = 1U << (a->data.
1719 non_resident.compression_unit + 1719 non_resident.compression_unit +
1720 vol->cluster_size_bits); 1720 vol->cluster_size_bits);
1721 ni->itype.compressed.block_size_bits = 1721 ni->itype.compressed.block_size_bits =
1722 ffs(ni->itype.compressed.block_size) - 1722 ffs(ni->itype.compressed.block_size) -
1723 1; 1723 1;
1724 ni->itype.compressed.block_clusters = 1U << 1724 ni->itype.compressed.block_clusters = 1U <<
1725 a->data.non_resident.compression_unit; 1725 a->data.non_resident.compression_unit;
1726 } else { 1726 } else {
1727 ni->itype.compressed.block_size = 0; 1727 ni->itype.compressed.block_size = 0;
1728 ni->itype.compressed.block_size_bits = 0; 1728 ni->itype.compressed.block_size_bits = 0;
1729 ni->itype.compressed.block_clusters = 0; 1729 ni->itype.compressed.block_clusters = 0;
1730 } 1730 }
1731 vi->i_blocks = ni->itype.compressed.size >> 9; 1731 vi->i_blocks = ni->itype.compressed.size >> 9;
1732 } else 1732 } else
1733 vi->i_blocks = ni->allocated_size >> 9; 1733 vi->i_blocks = ni->allocated_size >> 9;
1734 write_unlock_irqrestore(&ni->size_lock, flags); 1734 write_unlock_irqrestore(&ni->size_lock, flags);
1735 /* 1735 /*
1736 * This needs to be last since the address space operations ->readpage 1736 * This needs to be last since the address space operations ->readpage
1737 * and ->writepage can run concurrently with us as they are not 1737 * and ->writepage can run concurrently with us as they are not
1738 * serialized on i_mutex. Note, we are not allowed to fail once we flip 1738 * serialized on i_mutex. Note, we are not allowed to fail once we flip
1739 * this switch, which is another reason to do this last. 1739 * this switch, which is another reason to do this last.
1740 */ 1740 */
1741 NInoSetNonResident(ni); 1741 NInoSetNonResident(ni);
1742 /* Mark the mft record dirty, so it gets written back. */ 1742 /* Mark the mft record dirty, so it gets written back. */
1743 flush_dcache_mft_record_page(ctx->ntfs_ino); 1743 flush_dcache_mft_record_page(ctx->ntfs_ino);
1744 mark_mft_record_dirty(ctx->ntfs_ino); 1744 mark_mft_record_dirty(ctx->ntfs_ino);
1745 ntfs_attr_put_search_ctx(ctx); 1745 ntfs_attr_put_search_ctx(ctx);
1746 unmap_mft_record(base_ni); 1746 unmap_mft_record(base_ni);
1747 up_write(&ni->runlist.lock); 1747 up_write(&ni->runlist.lock);
1748 if (page) { 1748 if (page) {
1749 set_page_dirty(page); 1749 set_page_dirty(page);
1750 unlock_page(page); 1750 unlock_page(page);
1751 mark_page_accessed(page);
1752 page_cache_release(page); 1751 page_cache_release(page);
1753 } 1752 }
1754 ntfs_debug("Done."); 1753 ntfs_debug("Done.");
1755 return 0; 1754 return 0;
1756 undo_err_out: 1755 undo_err_out:
1757 /* Convert the attribute back into a resident attribute. */ 1756 /* Convert the attribute back into a resident attribute. */
1758 a->non_resident = 0; 1757 a->non_resident = 0;
1759 /* Move the attribute name if it exists and update the offset. */ 1758 /* Move the attribute name if it exists and update the offset. */
1760 name_ofs = (offsetof(ATTR_RECORD, data.resident.reserved) + 1759 name_ofs = (offsetof(ATTR_RECORD, data.resident.reserved) +
1761 sizeof(a->data.resident.reserved) + 7) & ~7; 1760 sizeof(a->data.resident.reserved) + 7) & ~7;
1762 if (a->name_length) 1761 if (a->name_length)
1763 memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), 1762 memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset),
1764 a->name_length * sizeof(ntfschar)); 1763 a->name_length * sizeof(ntfschar));
1765 mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; 1764 mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7;
1766 a->name_offset = cpu_to_le16(name_ofs); 1765 a->name_offset = cpu_to_le16(name_ofs);
1767 arec_size = (mp_ofs + attr_size + 7) & ~7; 1766 arec_size = (mp_ofs + attr_size + 7) & ~7;
1768 /* Resize the resident part of the attribute record. */ 1767 /* Resize the resident part of the attribute record. */
1769 err2 = ntfs_attr_record_resize(m, a, arec_size); 1768 err2 = ntfs_attr_record_resize(m, a, arec_size);
1770 if (unlikely(err2)) { 1769 if (unlikely(err2)) {
1771 /* 1770 /*
1772 * This cannot happen (well if memory corruption is at work it 1771 * This cannot happen (well if memory corruption is at work it
1773 * could happen in theory), but deal with it as well as we can. 1772 * could happen in theory), but deal with it as well as we can.
1774 * If the old size is too small, truncate the attribute, 1773 * If the old size is too small, truncate the attribute,
1775 * otherwise simply give it a larger allocated size. 1774 * otherwise simply give it a larger allocated size.
1776 * FIXME: Should check whether chkdsk complains when the 1775 * FIXME: Should check whether chkdsk complains when the
1777 * allocated size is much bigger than the resident value size. 1776 * allocated size is much bigger than the resident value size.
1778 */ 1777 */
1779 arec_size = le32_to_cpu(a->length); 1778 arec_size = le32_to_cpu(a->length);
1780 if ((mp_ofs + attr_size) > arec_size) { 1779 if ((mp_ofs + attr_size) > arec_size) {
1781 err2 = attr_size; 1780 err2 = attr_size;
1782 attr_size = arec_size - mp_ofs; 1781 attr_size = arec_size - mp_ofs;
1783 ntfs_error(vol->sb, "Failed to undo partial resident " 1782 ntfs_error(vol->sb, "Failed to undo partial resident "
1784 "to non-resident attribute " 1783 "to non-resident attribute "
1785 "conversion. Truncating inode 0x%lx, " 1784 "conversion. Truncating inode 0x%lx, "
1786 "attribute type 0x%x from %i bytes to " 1785 "attribute type 0x%x from %i bytes to "
1787 "%i bytes to maintain metadata " 1786 "%i bytes to maintain metadata "
1788 "consistency. THIS MEANS YOU ARE " 1787 "consistency. THIS MEANS YOU ARE "
1789 "LOSING %i BYTES DATA FROM THIS %s.", 1788 "LOSING %i BYTES DATA FROM THIS %s.",
1790 vi->i_ino, 1789 vi->i_ino,
1791 (unsigned)le32_to_cpu(ni->type), 1790 (unsigned)le32_to_cpu(ni->type),
1792 err2, attr_size, err2 - attr_size, 1791 err2, attr_size, err2 - attr_size,
1793 ((ni->type == AT_DATA) && 1792 ((ni->type == AT_DATA) &&
1794 !ni->name_len) ? "FILE": "ATTRIBUTE"); 1793 !ni->name_len) ? "FILE": "ATTRIBUTE");
1795 write_lock_irqsave(&ni->size_lock, flags); 1794 write_lock_irqsave(&ni->size_lock, flags);
1796 ni->initialized_size = attr_size; 1795 ni->initialized_size = attr_size;
1797 i_size_write(vi, attr_size); 1796 i_size_write(vi, attr_size);
1798 write_unlock_irqrestore(&ni->size_lock, flags); 1797 write_unlock_irqrestore(&ni->size_lock, flags);
1799 } 1798 }
1800 } 1799 }
1801 /* Setup the fields specific to resident attributes. */ 1800 /* Setup the fields specific to resident attributes. */
1802 a->data.resident.value_length = cpu_to_le32(attr_size); 1801 a->data.resident.value_length = cpu_to_le32(attr_size);
1803 a->data.resident.value_offset = cpu_to_le16(mp_ofs); 1802 a->data.resident.value_offset = cpu_to_le16(mp_ofs);
1804 a->data.resident.flags = old_res_attr_flags; 1803 a->data.resident.flags = old_res_attr_flags;
1805 memset(&a->data.resident.reserved, 0, 1804 memset(&a->data.resident.reserved, 0,
1806 sizeof(a->data.resident.reserved)); 1805 sizeof(a->data.resident.reserved));
1807 /* Copy the data from the page back to the attribute value. */ 1806 /* Copy the data from the page back to the attribute value. */
1808 if (page) { 1807 if (page) {
1809 kaddr = kmap_atomic(page); 1808 kaddr = kmap_atomic(page);
1810 memcpy((u8*)a + mp_ofs, kaddr, attr_size); 1809 memcpy((u8*)a + mp_ofs, kaddr, attr_size);
1811 kunmap_atomic(kaddr); 1810 kunmap_atomic(kaddr);
1812 } 1811 }
1813 /* Setup the allocated size in the ntfs inode in case it changed. */ 1812 /* Setup the allocated size in the ntfs inode in case it changed. */
1814 write_lock_irqsave(&ni->size_lock, flags); 1813 write_lock_irqsave(&ni->size_lock, flags);
1815 ni->allocated_size = arec_size - mp_ofs; 1814 ni->allocated_size = arec_size - mp_ofs;
1816 write_unlock_irqrestore(&ni->size_lock, flags); 1815 write_unlock_irqrestore(&ni->size_lock, flags);
1817 /* Mark the mft record dirty, so it gets written back. */ 1816 /* Mark the mft record dirty, so it gets written back. */
1818 flush_dcache_mft_record_page(ctx->ntfs_ino); 1817 flush_dcache_mft_record_page(ctx->ntfs_ino);
1819 mark_mft_record_dirty(ctx->ntfs_ino); 1818 mark_mft_record_dirty(ctx->ntfs_ino);
1820 err_out: 1819 err_out:
1821 if (ctx) 1820 if (ctx)
1822 ntfs_attr_put_search_ctx(ctx); 1821 ntfs_attr_put_search_ctx(ctx);
1823 if (m) 1822 if (m)
1824 unmap_mft_record(base_ni); 1823 unmap_mft_record(base_ni);
1825 ni->runlist.rl = NULL; 1824 ni->runlist.rl = NULL;
1826 up_write(&ni->runlist.lock); 1825 up_write(&ni->runlist.lock);
1827 rl_err_out: 1826 rl_err_out:
1828 if (rl) { 1827 if (rl) {
1829 if (ntfs_cluster_free_from_rl(vol, rl) < 0) { 1828 if (ntfs_cluster_free_from_rl(vol, rl) < 0) {
1830 ntfs_error(vol->sb, "Failed to release allocated " 1829 ntfs_error(vol->sb, "Failed to release allocated "
1831 "cluster(s) in error code path. Run " 1830 "cluster(s) in error code path. Run "
1832 "chkdsk to recover the lost " 1831 "chkdsk to recover the lost "
1833 "cluster(s)."); 1832 "cluster(s).");
1834 NVolSetErrors(vol); 1833 NVolSetErrors(vol);
1835 } 1834 }
1836 ntfs_free(rl); 1835 ntfs_free(rl);
1837 page_err_out: 1836 page_err_out:
1838 unlock_page(page); 1837 unlock_page(page);
1839 page_cache_release(page); 1838 page_cache_release(page);
1840 } 1839 }
1841 if (err == -EINVAL) 1840 if (err == -EINVAL)
1842 err = -EIO; 1841 err = -EIO;
1843 return err; 1842 return err;
1844 } 1843 }
1845 1844
1846 /** 1845 /**
1847 * ntfs_attr_extend_allocation - extend the allocated space of an attribute 1846 * ntfs_attr_extend_allocation - extend the allocated space of an attribute
1848 * @ni: ntfs inode of the attribute whose allocation to extend 1847 * @ni: ntfs inode of the attribute whose allocation to extend
1849 * @new_alloc_size: new size in bytes to which to extend the allocation to 1848 * @new_alloc_size: new size in bytes to which to extend the allocation to
1850 * @new_data_size: new size in bytes to which to extend the data to 1849 * @new_data_size: new size in bytes to which to extend the data to
1851 * @data_start: beginning of region which is required to be non-sparse 1850 * @data_start: beginning of region which is required to be non-sparse
1852 * 1851 *
1853 * Extend the allocated space of an attribute described by the ntfs inode @ni 1852 * Extend the allocated space of an attribute described by the ntfs inode @ni
1854 * to @new_alloc_size bytes. If @data_start is -1, the whole extension may be 1853 * to @new_alloc_size bytes. If @data_start is -1, the whole extension may be
1855 * implemented as a hole in the file (as long as both the volume and the ntfs 1854 * implemented as a hole in the file (as long as both the volume and the ntfs
1856 * inode @ni have sparse support enabled). If @data_start is >= 0, then the 1855 * inode @ni have sparse support enabled). If @data_start is >= 0, then the
1857 * region between the old allocated size and @data_start - 1 may be made sparse 1856 * region between the old allocated size and @data_start - 1 may be made sparse
1858 * but the regions between @data_start and @new_alloc_size must be backed by 1857 * but the regions between @data_start and @new_alloc_size must be backed by
1859 * actual clusters. 1858 * actual clusters.
1860 * 1859 *
1861 * If @new_data_size is -1, it is ignored. If it is >= 0, then the data size 1860 * If @new_data_size is -1, it is ignored. If it is >= 0, then the data size
1862 * of the attribute is extended to @new_data_size. Note that the i_size of the 1861 * of the attribute is extended to @new_data_size. Note that the i_size of the
1863 * vfs inode is not updated. Only the data size in the base attribute record 1862 * vfs inode is not updated. Only the data size in the base attribute record
1864 * is updated. The caller has to update i_size separately if this is required. 1863 * is updated. The caller has to update i_size separately if this is required.
1865 * WARNING: It is a BUG() for @new_data_size to be smaller than the old data 1864 * WARNING: It is a BUG() for @new_data_size to be smaller than the old data
1866 * size as well as for @new_data_size to be greater than @new_alloc_size. 1865 * size as well as for @new_data_size to be greater than @new_alloc_size.
1867 * 1866 *
1868 * For resident attributes this involves resizing the attribute record and if 1867 * For resident attributes this involves resizing the attribute record and if
1869 * necessary moving it and/or other attributes into extent mft records and/or 1868 * necessary moving it and/or other attributes into extent mft records and/or
1870 * converting the attribute to a non-resident attribute which in turn involves 1869 * converting the attribute to a non-resident attribute which in turn involves
1871 * extending the allocation of a non-resident attribute as described below. 1870 * extending the allocation of a non-resident attribute as described below.
1872 * 1871 *
1873 * For non-resident attributes this involves allocating clusters in the data 1872 * For non-resident attributes this involves allocating clusters in the data
1874 * zone on the volume (except for regions that are being made sparse) and 1873 * zone on the volume (except for regions that are being made sparse) and
1875 * extending the run list to describe the allocated clusters as well as 1874 * extending the run list to describe the allocated clusters as well as
1876 * updating the mapping pairs array of the attribute. This in turn involves 1875 * updating the mapping pairs array of the attribute. This in turn involves
1877 * resizing the attribute record and if necessary moving it and/or other 1876 * resizing the attribute record and if necessary moving it and/or other
1878 * attributes into extent mft records and/or splitting the attribute record 1877 * attributes into extent mft records and/or splitting the attribute record
1879 * into multiple extent attribute records. 1878 * into multiple extent attribute records.
1880 * 1879 *
1881 * Also, the attribute list attribute is updated if present and in some of the 1880 * Also, the attribute list attribute is updated if present and in some of the
1882 * above cases (the ones where extent mft records/attributes come into play), 1881 * above cases (the ones where extent mft records/attributes come into play),
1883 * an attribute list attribute is created if not already present. 1882 * an attribute list attribute is created if not already present.
1884 * 1883 *
1885 * Return the new allocated size on success and -errno on error. In the case 1884 * Return the new allocated size on success and -errno on error. In the case
1886 * that an error is encountered but a partial extension at least up to 1885 * that an error is encountered but a partial extension at least up to
1887 * @data_start (if present) is possible, the allocation is partially extended 1886 * @data_start (if present) is possible, the allocation is partially extended
1888 * and this is returned. This means the caller must check the returned size to 1887 * and this is returned. This means the caller must check the returned size to
1889 * determine if the extension was partial. If @data_start is -1 then partial 1888 * determine if the extension was partial. If @data_start is -1 then partial
1890 * allocations are not performed. 1889 * allocations are not performed.
1891 * 1890 *
1892 * WARNING: Do not call ntfs_attr_extend_allocation() for $MFT/$DATA. 1891 * WARNING: Do not call ntfs_attr_extend_allocation() for $MFT/$DATA.
1893 * 1892 *
1894 * Locking: This function takes the runlist lock of @ni for writing as well as 1893 * Locking: This function takes the runlist lock of @ni for writing as well as
1895 * locking the mft record of the base ntfs inode. These locks are maintained 1894 * locking the mft record of the base ntfs inode. These locks are maintained
1896 * throughout execution of the function. These locks are required so that the 1895 * throughout execution of the function. These locks are required so that the
1897 * attribute can be resized safely and so that it can for example be converted 1896 * attribute can be resized safely and so that it can for example be converted
1898 * from resident to non-resident safely. 1897 * from resident to non-resident safely.
1899 * 1898 *
1900 * TODO: At present attribute list attribute handling is not implemented. 1899 * TODO: At present attribute list attribute handling is not implemented.
1901 * 1900 *
1902 * TODO: At present it is not safe to call this function for anything other 1901 * TODO: At present it is not safe to call this function for anything other
1903 * than the $DATA attribute(s) of an uncompressed and unencrypted file. 1902 * than the $DATA attribute(s) of an uncompressed and unencrypted file.
1904 */ 1903 */
1905 s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size, 1904 s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size,
1906 const s64 new_data_size, const s64 data_start) 1905 const s64 new_data_size, const s64 data_start)
1907 { 1906 {
1908 VCN vcn; 1907 VCN vcn;
1909 s64 ll, allocated_size, start = data_start; 1908 s64 ll, allocated_size, start = data_start;
1910 struct inode *vi = VFS_I(ni); 1909 struct inode *vi = VFS_I(ni);
1911 ntfs_volume *vol = ni->vol; 1910 ntfs_volume *vol = ni->vol;
1912 ntfs_inode *base_ni; 1911 ntfs_inode *base_ni;
1913 MFT_RECORD *m; 1912 MFT_RECORD *m;
1914 ATTR_RECORD *a; 1913 ATTR_RECORD *a;
1915 ntfs_attr_search_ctx *ctx; 1914 ntfs_attr_search_ctx *ctx;
1916 runlist_element *rl, *rl2; 1915 runlist_element *rl, *rl2;
1917 unsigned long flags; 1916 unsigned long flags;
1918 int err, mp_size; 1917 int err, mp_size;
1919 u32 attr_len = 0; /* Silence stupid gcc warning. */ 1918 u32 attr_len = 0; /* Silence stupid gcc warning. */
1920 bool mp_rebuilt; 1919 bool mp_rebuilt;
1921 1920
1922 #ifdef DEBUG 1921 #ifdef DEBUG
1923 read_lock_irqsave(&ni->size_lock, flags); 1922 read_lock_irqsave(&ni->size_lock, flags);
1924 allocated_size = ni->allocated_size; 1923 allocated_size = ni->allocated_size;
1925 read_unlock_irqrestore(&ni->size_lock, flags); 1924 read_unlock_irqrestore(&ni->size_lock, flags);
1926 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " 1925 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
1927 "old_allocated_size 0x%llx, " 1926 "old_allocated_size 0x%llx, "
1928 "new_allocated_size 0x%llx, new_data_size 0x%llx, " 1927 "new_allocated_size 0x%llx, new_data_size 0x%llx, "
1929 "data_start 0x%llx.", vi->i_ino, 1928 "data_start 0x%llx.", vi->i_ino,
1930 (unsigned)le32_to_cpu(ni->type), 1929 (unsigned)le32_to_cpu(ni->type),
1931 (unsigned long long)allocated_size, 1930 (unsigned long long)allocated_size,
1932 (unsigned long long)new_alloc_size, 1931 (unsigned long long)new_alloc_size,
1933 (unsigned long long)new_data_size, 1932 (unsigned long long)new_data_size,
1934 (unsigned long long)start); 1933 (unsigned long long)start);
1935 #endif 1934 #endif
1936 retry_extend: 1935 retry_extend:
1937 /* 1936 /*
1938 * For non-resident attributes, @start and @new_size need to be aligned 1937 * For non-resident attributes, @start and @new_size need to be aligned
1939 * to cluster boundaries for allocation purposes. 1938 * to cluster boundaries for allocation purposes.
1940 */ 1939 */
1941 if (NInoNonResident(ni)) { 1940 if (NInoNonResident(ni)) {
1942 if (start > 0) 1941 if (start > 0)
1943 start &= ~(s64)vol->cluster_size_mask; 1942 start &= ~(s64)vol->cluster_size_mask;
1944 new_alloc_size = (new_alloc_size + vol->cluster_size - 1) & 1943 new_alloc_size = (new_alloc_size + vol->cluster_size - 1) &
1945 ~(s64)vol->cluster_size_mask; 1944 ~(s64)vol->cluster_size_mask;
1946 } 1945 }
1947 BUG_ON(new_data_size >= 0 && new_data_size > new_alloc_size); 1946 BUG_ON(new_data_size >= 0 && new_data_size > new_alloc_size);
1948 /* Check if new size is allowed in $AttrDef. */ 1947 /* Check if new size is allowed in $AttrDef. */
1949 err = ntfs_attr_size_bounds_check(vol, ni->type, new_alloc_size); 1948 err = ntfs_attr_size_bounds_check(vol, ni->type, new_alloc_size);
1950 if (unlikely(err)) { 1949 if (unlikely(err)) {
1951 /* Only emit errors when the write will fail completely. */ 1950 /* Only emit errors when the write will fail completely. */
1952 read_lock_irqsave(&ni->size_lock, flags); 1951 read_lock_irqsave(&ni->size_lock, flags);
1953 allocated_size = ni->allocated_size; 1952 allocated_size = ni->allocated_size;
1954 read_unlock_irqrestore(&ni->size_lock, flags); 1953 read_unlock_irqrestore(&ni->size_lock, flags);
1955 if (start < 0 || start >= allocated_size) { 1954 if (start < 0 || start >= allocated_size) {
1956 if (err == -ERANGE) { 1955 if (err == -ERANGE) {
1957 ntfs_error(vol->sb, "Cannot extend allocation " 1956 ntfs_error(vol->sb, "Cannot extend allocation "
1958 "of inode 0x%lx, attribute " 1957 "of inode 0x%lx, attribute "
1959 "type 0x%x, because the new " 1958 "type 0x%x, because the new "
1960 "allocation would exceed the " 1959 "allocation would exceed the "
1961 "maximum allowed size for " 1960 "maximum allowed size for "
1962 "this attribute type.", 1961 "this attribute type.",
1963 vi->i_ino, (unsigned) 1962 vi->i_ino, (unsigned)
1964 le32_to_cpu(ni->type)); 1963 le32_to_cpu(ni->type));
1965 } else { 1964 } else {
1966 ntfs_error(vol->sb, "Cannot extend allocation " 1965 ntfs_error(vol->sb, "Cannot extend allocation "
1967 "of inode 0x%lx, attribute " 1966 "of inode 0x%lx, attribute "
1968 "type 0x%x, because this " 1967 "type 0x%x, because this "
1969 "attribute type is not " 1968 "attribute type is not "
1970 "defined on the NTFS volume. " 1969 "defined on the NTFS volume. "
1971 "Possible corruption! You " 1970 "Possible corruption! You "
1972 "should run chkdsk!", 1971 "should run chkdsk!",
1973 vi->i_ino, (unsigned) 1972 vi->i_ino, (unsigned)
1974 le32_to_cpu(ni->type)); 1973 le32_to_cpu(ni->type));
1975 } 1974 }
1976 } 1975 }
1977 /* Translate error code to be POSIX conformant for write(2). */ 1976 /* Translate error code to be POSIX conformant for write(2). */
1978 if (err == -ERANGE) 1977 if (err == -ERANGE)
1979 err = -EFBIG; 1978 err = -EFBIG;
1980 else 1979 else
1981 err = -EIO; 1980 err = -EIO;
1982 return err; 1981 return err;
1983 } 1982 }
1984 if (!NInoAttr(ni)) 1983 if (!NInoAttr(ni))
1985 base_ni = ni; 1984 base_ni = ni;
1986 else 1985 else
1987 base_ni = ni->ext.base_ntfs_ino; 1986 base_ni = ni->ext.base_ntfs_ino;
1988 /* 1987 /*
1989 * We will be modifying both the runlist (if non-resident) and the mft 1988 * We will be modifying both the runlist (if non-resident) and the mft
1990 * record so lock them both down. 1989 * record so lock them both down.
1991 */ 1990 */
1992 down_write(&ni->runlist.lock); 1991 down_write(&ni->runlist.lock);
1993 m = map_mft_record(base_ni); 1992 m = map_mft_record(base_ni);
1994 if (IS_ERR(m)) { 1993 if (IS_ERR(m)) {
1995 err = PTR_ERR(m); 1994 err = PTR_ERR(m);
1996 m = NULL; 1995 m = NULL;
1997 ctx = NULL; 1996 ctx = NULL;
1998 goto err_out; 1997 goto err_out;
1999 } 1998 }
2000 ctx = ntfs_attr_get_search_ctx(base_ni, m); 1999 ctx = ntfs_attr_get_search_ctx(base_ni, m);
2001 if (unlikely(!ctx)) { 2000 if (unlikely(!ctx)) {
2002 err = -ENOMEM; 2001 err = -ENOMEM;
2003 goto err_out; 2002 goto err_out;
2004 } 2003 }
2005 read_lock_irqsave(&ni->size_lock, flags); 2004 read_lock_irqsave(&ni->size_lock, flags);
2006 allocated_size = ni->allocated_size; 2005 allocated_size = ni->allocated_size;
2007 read_unlock_irqrestore(&ni->size_lock, flags); 2006 read_unlock_irqrestore(&ni->size_lock, flags);
2008 /* 2007 /*
2009 * If non-resident, seek to the last extent. If resident, there is 2008 * If non-resident, seek to the last extent. If resident, there is
2010 * only one extent, so seek to that. 2009 * only one extent, so seek to that.
2011 */ 2010 */
2012 vcn = NInoNonResident(ni) ? allocated_size >> vol->cluster_size_bits : 2011 vcn = NInoNonResident(ni) ? allocated_size >> vol->cluster_size_bits :
2013 0; 2012 0;
2014 /* 2013 /*
2015 * Abort if someone did the work whilst we waited for the locks. If we 2014 * Abort if someone did the work whilst we waited for the locks. If we
2016 * just converted the attribute from resident to non-resident it is 2015 * just converted the attribute from resident to non-resident it is
2017 * likely that exactly this has happened already. We cannot quite 2016 * likely that exactly this has happened already. We cannot quite
2018 * abort if we need to update the data size. 2017 * abort if we need to update the data size.
2019 */ 2018 */
2020 if (unlikely(new_alloc_size <= allocated_size)) { 2019 if (unlikely(new_alloc_size <= allocated_size)) {
2021 ntfs_debug("Allocated size already exceeds requested size."); 2020 ntfs_debug("Allocated size already exceeds requested size.");
2022 new_alloc_size = allocated_size; 2021 new_alloc_size = allocated_size;
2023 if (new_data_size < 0) 2022 if (new_data_size < 0)
2024 goto done; 2023 goto done;
2025 /* 2024 /*
2026 * We want the first attribute extent so that we can update the 2025 * We want the first attribute extent so that we can update the
2027 * data size. 2026 * data size.
2028 */ 2027 */
2029 vcn = 0; 2028 vcn = 0;
2030 } 2029 }
2031 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 2030 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
2032 CASE_SENSITIVE, vcn, NULL, 0, ctx); 2031 CASE_SENSITIVE, vcn, NULL, 0, ctx);
2033 if (unlikely(err)) { 2032 if (unlikely(err)) {
2034 if (err == -ENOENT) 2033 if (err == -ENOENT)
2035 err = -EIO; 2034 err = -EIO;
2036 goto err_out; 2035 goto err_out;
2037 } 2036 }
2038 m = ctx->mrec; 2037 m = ctx->mrec;
2039 a = ctx->attr; 2038 a = ctx->attr;
2040 /* Use goto to reduce indentation. */ 2039 /* Use goto to reduce indentation. */
2041 if (a->non_resident) 2040 if (a->non_resident)
2042 goto do_non_resident_extend; 2041 goto do_non_resident_extend;
2043 BUG_ON(NInoNonResident(ni)); 2042 BUG_ON(NInoNonResident(ni));
2044 /* The total length of the attribute value. */ 2043 /* The total length of the attribute value. */
2045 attr_len = le32_to_cpu(a->data.resident.value_length); 2044 attr_len = le32_to_cpu(a->data.resident.value_length);
2046 /* 2045 /*
2047 * Extend the attribute record to be able to store the new attribute 2046 * Extend the attribute record to be able to store the new attribute
2048 * size. ntfs_attr_record_resize() will not do anything if the size is 2047 * size. ntfs_attr_record_resize() will not do anything if the size is
2049 * not changing. 2048 * not changing.
2050 */ 2049 */
2051 if (new_alloc_size < vol->mft_record_size && 2050 if (new_alloc_size < vol->mft_record_size &&
2052 !ntfs_attr_record_resize(m, a, 2051 !ntfs_attr_record_resize(m, a,
2053 le16_to_cpu(a->data.resident.value_offset) + 2052 le16_to_cpu(a->data.resident.value_offset) +
2054 new_alloc_size)) { 2053 new_alloc_size)) {
2055 /* The resize succeeded! */ 2054 /* The resize succeeded! */
2056 write_lock_irqsave(&ni->size_lock, flags); 2055 write_lock_irqsave(&ni->size_lock, flags);
2057 ni->allocated_size = le32_to_cpu(a->length) - 2056 ni->allocated_size = le32_to_cpu(a->length) -
2058 le16_to_cpu(a->data.resident.value_offset); 2057 le16_to_cpu(a->data.resident.value_offset);
2059 write_unlock_irqrestore(&ni->size_lock, flags); 2058 write_unlock_irqrestore(&ni->size_lock, flags);
2060 if (new_data_size >= 0) { 2059 if (new_data_size >= 0) {
2061 BUG_ON(new_data_size < attr_len); 2060 BUG_ON(new_data_size < attr_len);
2062 a->data.resident.value_length = 2061 a->data.resident.value_length =
2063 cpu_to_le32((u32)new_data_size); 2062 cpu_to_le32((u32)new_data_size);
2064 } 2063 }
2065 goto flush_done; 2064 goto flush_done;
2066 } 2065 }
2067 /* 2066 /*
2068 * We have to drop all the locks so we can call 2067 * We have to drop all the locks so we can call
2069 * ntfs_attr_make_non_resident(). This could be optimised by try- 2068 * ntfs_attr_make_non_resident(). This could be optimised by try-
2070 * locking the first page cache page and only if that fails dropping 2069 * locking the first page cache page and only if that fails dropping
2071 * the locks, locking the page, and redoing all the locking and 2070 * the locks, locking the page, and redoing all the locking and
2072 * lookups. While this would be a huge optimisation, it is not worth 2071 * lookups. While this would be a huge optimisation, it is not worth
2073 * it as this is definitely a slow code path. 2072 * it as this is definitely a slow code path.
2074 */ 2073 */
2075 ntfs_attr_put_search_ctx(ctx); 2074 ntfs_attr_put_search_ctx(ctx);
2076 unmap_mft_record(base_ni); 2075 unmap_mft_record(base_ni);
2077 up_write(&ni->runlist.lock); 2076 up_write(&ni->runlist.lock);
2078 /* 2077 /*
2079 * Not enough space in the mft record, try to make the attribute 2078 * Not enough space in the mft record, try to make the attribute
2080 * non-resident and if successful restart the extension process. 2079 * non-resident and if successful restart the extension process.
2081 */ 2080 */
2082 err = ntfs_attr_make_non_resident(ni, attr_len); 2081 err = ntfs_attr_make_non_resident(ni, attr_len);
2083 if (likely(!err)) 2082 if (likely(!err))
2084 goto retry_extend; 2083 goto retry_extend;
2085 /* 2084 /*
2086 * Could not make non-resident. If this is due to this not being 2085 * Could not make non-resident. If this is due to this not being
2087 * permitted for this attribute type or there not being enough space, 2086 * permitted for this attribute type or there not being enough space,
2088 * try to make other attributes non-resident. Otherwise fail. 2087 * try to make other attributes non-resident. Otherwise fail.
2089 */ 2088 */
2090 if (unlikely(err != -EPERM && err != -ENOSPC)) { 2089 if (unlikely(err != -EPERM && err != -ENOSPC)) {
2091 /* Only emit errors when the write will fail completely. */ 2090 /* Only emit errors when the write will fail completely. */
2092 read_lock_irqsave(&ni->size_lock, flags); 2091 read_lock_irqsave(&ni->size_lock, flags);
2093 allocated_size = ni->allocated_size; 2092 allocated_size = ni->allocated_size;
2094 read_unlock_irqrestore(&ni->size_lock, flags); 2093 read_unlock_irqrestore(&ni->size_lock, flags);
2095 if (start < 0 || start >= allocated_size) 2094 if (start < 0 || start >= allocated_size)
2096 ntfs_error(vol->sb, "Cannot extend allocation of " 2095 ntfs_error(vol->sb, "Cannot extend allocation of "
2097 "inode 0x%lx, attribute type 0x%x, " 2096 "inode 0x%lx, attribute type 0x%x, "
2098 "because the conversion from resident " 2097 "because the conversion from resident "
2099 "to non-resident attribute failed " 2098 "to non-resident attribute failed "
2100 "with error code %i.", vi->i_ino, 2099 "with error code %i.", vi->i_ino,
2101 (unsigned)le32_to_cpu(ni->type), err); 2100 (unsigned)le32_to_cpu(ni->type), err);
2102 if (err != -ENOMEM) 2101 if (err != -ENOMEM)
2103 err = -EIO; 2102 err = -EIO;
2104 goto conv_err_out; 2103 goto conv_err_out;
2105 } 2104 }
2106 /* TODO: Not implemented from here, abort. */ 2105 /* TODO: Not implemented from here, abort. */
2107 read_lock_irqsave(&ni->size_lock, flags); 2106 read_lock_irqsave(&ni->size_lock, flags);
2108 allocated_size = ni->allocated_size; 2107 allocated_size = ni->allocated_size;
2109 read_unlock_irqrestore(&ni->size_lock, flags); 2108 read_unlock_irqrestore(&ni->size_lock, flags);
2110 if (start < 0 || start >= allocated_size) { 2109 if (start < 0 || start >= allocated_size) {
2111 if (err == -ENOSPC) 2110 if (err == -ENOSPC)
2112 ntfs_error(vol->sb, "Not enough space in the mft " 2111 ntfs_error(vol->sb, "Not enough space in the mft "
2113 "record/on disk for the non-resident " 2112 "record/on disk for the non-resident "
2114 "attribute value. This case is not " 2113 "attribute value. This case is not "
2115 "implemented yet."); 2114 "implemented yet.");
2116 else /* if (err == -EPERM) */ 2115 else /* if (err == -EPERM) */
2117 ntfs_error(vol->sb, "This attribute type may not be " 2116 ntfs_error(vol->sb, "This attribute type may not be "
2118 "non-resident. This case is not " 2117 "non-resident. This case is not "
2119 "implemented yet."); 2118 "implemented yet.");
2120 } 2119 }
2121 err = -EOPNOTSUPP; 2120 err = -EOPNOTSUPP;
2122 goto conv_err_out; 2121 goto conv_err_out;
2123 #if 0 2122 #if 0
2124 // TODO: Attempt to make other attributes non-resident. 2123 // TODO: Attempt to make other attributes non-resident.
2125 if (!err) 2124 if (!err)
2126 goto do_resident_extend; 2125 goto do_resident_extend;
2127 /* 2126 /*
2128 * Both the attribute list attribute and the standard information 2127 * Both the attribute list attribute and the standard information
2129 * attribute must remain in the base inode. Thus, if this is one of 2128 * attribute must remain in the base inode. Thus, if this is one of
2130 * these attributes, we have to try to move other attributes out into 2129 * these attributes, we have to try to move other attributes out into
2131 * extent mft records instead. 2130 * extent mft records instead.
2132 */ 2131 */
2133 if (ni->type == AT_ATTRIBUTE_LIST || 2132 if (ni->type == AT_ATTRIBUTE_LIST ||
2134 ni->type == AT_STANDARD_INFORMATION) { 2133 ni->type == AT_STANDARD_INFORMATION) {
2135 // TODO: Attempt to move other attributes into extent mft 2134 // TODO: Attempt to move other attributes into extent mft
2136 // records. 2135 // records.
2137 err = -EOPNOTSUPP; 2136 err = -EOPNOTSUPP;
2138 if (!err) 2137 if (!err)
2139 goto do_resident_extend; 2138 goto do_resident_extend;
2140 goto err_out; 2139 goto err_out;
2141 } 2140 }
2142 // TODO: Attempt to move this attribute to an extent mft record, but 2141 // TODO: Attempt to move this attribute to an extent mft record, but
2143 // only if it is not already the only attribute in an mft record in 2142 // only if it is not already the only attribute in an mft record in
2144 // which case there would be nothing to gain. 2143 // which case there would be nothing to gain.
2145 err = -EOPNOTSUPP; 2144 err = -EOPNOTSUPP;
2146 if (!err) 2145 if (!err)
2147 goto do_resident_extend; 2146 goto do_resident_extend;
2148 /* There is nothing we can do to make enough space. )-: */ 2147 /* There is nothing we can do to make enough space. )-: */
2149 goto err_out; 2148 goto err_out;
2150 #endif 2149 #endif
2151 do_non_resident_extend: 2150 do_non_resident_extend:
2152 BUG_ON(!NInoNonResident(ni)); 2151 BUG_ON(!NInoNonResident(ni));
2153 if (new_alloc_size == allocated_size) { 2152 if (new_alloc_size == allocated_size) {
2154 BUG_ON(vcn); 2153 BUG_ON(vcn);
2155 goto alloc_done; 2154 goto alloc_done;
2156 } 2155 }
2157 /* 2156 /*
2158 * If the data starts after the end of the old allocation, this is a 2157 * If the data starts after the end of the old allocation, this is a
2159 * $DATA attribute and sparse attributes are enabled on the volume and 2158 * $DATA attribute and sparse attributes are enabled on the volume and
2160 * for this inode, then create a sparse region between the old 2159 * for this inode, then create a sparse region between the old
2161 * allocated size and the start of the data. Otherwise simply proceed 2160 * allocated size and the start of the data. Otherwise simply proceed
2162 * with filling the whole space between the old allocated size and the 2161 * with filling the whole space between the old allocated size and the
2163 * new allocated size with clusters. 2162 * new allocated size with clusters.
2164 */ 2163 */
2165 if ((start >= 0 && start <= allocated_size) || ni->type != AT_DATA || 2164 if ((start >= 0 && start <= allocated_size) || ni->type != AT_DATA ||
2166 !NVolSparseEnabled(vol) || NInoSparseDisabled(ni)) 2165 !NVolSparseEnabled(vol) || NInoSparseDisabled(ni))
2167 goto skip_sparse; 2166 goto skip_sparse;
2168 // TODO: This is not implemented yet. We just fill in with real 2167 // TODO: This is not implemented yet. We just fill in with real
2169 // clusters for now... 2168 // clusters for now...
2170 ntfs_debug("Inserting holes is not-implemented yet. Falling back to " 2169 ntfs_debug("Inserting holes is not-implemented yet. Falling back to "
2171 "allocating real clusters instead."); 2170 "allocating real clusters instead.");
2172 skip_sparse: 2171 skip_sparse:
2173 rl = ni->runlist.rl; 2172 rl = ni->runlist.rl;
2174 if (likely(rl)) { 2173 if (likely(rl)) {
2175 /* Seek to the end of the runlist. */ 2174 /* Seek to the end of the runlist. */
2176 while (rl->length) 2175 while (rl->length)
2177 rl++; 2176 rl++;
2178 } 2177 }
2179 /* If this attribute extent is not mapped, map it now. */ 2178 /* If this attribute extent is not mapped, map it now. */
2180 if (unlikely(!rl || rl->lcn == LCN_RL_NOT_MAPPED || 2179 if (unlikely(!rl || rl->lcn == LCN_RL_NOT_MAPPED ||
2181 (rl->lcn == LCN_ENOENT && rl > ni->runlist.rl && 2180 (rl->lcn == LCN_ENOENT && rl > ni->runlist.rl &&
2182 (rl-1)->lcn == LCN_RL_NOT_MAPPED))) { 2181 (rl-1)->lcn == LCN_RL_NOT_MAPPED))) {
2183 if (!rl && !allocated_size) 2182 if (!rl && !allocated_size)
2184 goto first_alloc; 2183 goto first_alloc;
2185 rl = ntfs_mapping_pairs_decompress(vol, a, ni->runlist.rl); 2184 rl = ntfs_mapping_pairs_decompress(vol, a, ni->runlist.rl);
2186 if (IS_ERR(rl)) { 2185 if (IS_ERR(rl)) {
2187 err = PTR_ERR(rl); 2186 err = PTR_ERR(rl);
2188 if (start < 0 || start >= allocated_size) 2187 if (start < 0 || start >= allocated_size)
2189 ntfs_error(vol->sb, "Cannot extend allocation " 2188 ntfs_error(vol->sb, "Cannot extend allocation "
2190 "of inode 0x%lx, attribute " 2189 "of inode 0x%lx, attribute "
2191 "type 0x%x, because the " 2190 "type 0x%x, because the "
2192 "mapping of a runlist " 2191 "mapping of a runlist "
2193 "fragment failed with error " 2192 "fragment failed with error "
2194 "code %i.", vi->i_ino, 2193 "code %i.", vi->i_ino,
2195 (unsigned)le32_to_cpu(ni->type), 2194 (unsigned)le32_to_cpu(ni->type),
2196 err); 2195 err);
2197 if (err != -ENOMEM) 2196 if (err != -ENOMEM)
2198 err = -EIO; 2197 err = -EIO;
2199 goto err_out; 2198 goto err_out;
2200 } 2199 }
2201 ni->runlist.rl = rl; 2200 ni->runlist.rl = rl;
2202 /* Seek to the end of the runlist. */ 2201 /* Seek to the end of the runlist. */
2203 while (rl->length) 2202 while (rl->length)
2204 rl++; 2203 rl++;
2205 } 2204 }
2206 /* 2205 /*
2207 * We now know the runlist of the last extent is mapped and @rl is at 2206 * We now know the runlist of the last extent is mapped and @rl is at
2208 * the end of the runlist. We want to begin allocating clusters 2207 * the end of the runlist. We want to begin allocating clusters
2209 * starting at the last allocated cluster to reduce fragmentation. If 2208 * starting at the last allocated cluster to reduce fragmentation. If
2210 * there are no valid LCNs in the attribute we let the cluster 2209 * there are no valid LCNs in the attribute we let the cluster
2211 * allocator choose the starting cluster. 2210 * allocator choose the starting cluster.
2212 */ 2211 */
2213 /* If the last LCN is a hole or simillar seek back to last real LCN. */ 2212 /* If the last LCN is a hole or simillar seek back to last real LCN. */
2214 while (rl->lcn < 0 && rl > ni->runlist.rl) 2213 while (rl->lcn < 0 && rl > ni->runlist.rl)
2215 rl--; 2214 rl--;
2216 first_alloc: 2215 first_alloc:
2217 // FIXME: Need to implement partial allocations so at least part of the 2216 // FIXME: Need to implement partial allocations so at least part of the
2218 // write can be performed when start >= 0. (Needed for POSIX write(2) 2217 // write can be performed when start >= 0. (Needed for POSIX write(2)
2219 // conformance.) 2218 // conformance.)
2220 rl2 = ntfs_cluster_alloc(vol, allocated_size >> vol->cluster_size_bits, 2219 rl2 = ntfs_cluster_alloc(vol, allocated_size >> vol->cluster_size_bits,
2221 (new_alloc_size - allocated_size) >> 2220 (new_alloc_size - allocated_size) >>
2222 vol->cluster_size_bits, (rl && (rl->lcn >= 0)) ? 2221 vol->cluster_size_bits, (rl && (rl->lcn >= 0)) ?
2223 rl->lcn + rl->length : -1, DATA_ZONE, true); 2222 rl->lcn + rl->length : -1, DATA_ZONE, true);
2224 if (IS_ERR(rl2)) { 2223 if (IS_ERR(rl2)) {
2225 err = PTR_ERR(rl2); 2224 err = PTR_ERR(rl2);
2226 if (start < 0 || start >= allocated_size) 2225 if (start < 0 || start >= allocated_size)
2227 ntfs_error(vol->sb, "Cannot extend allocation of " 2226 ntfs_error(vol->sb, "Cannot extend allocation of "
2228 "inode 0x%lx, attribute type 0x%x, " 2227 "inode 0x%lx, attribute type 0x%x, "
2229 "because the allocation of clusters " 2228 "because the allocation of clusters "
2230 "failed with error code %i.", vi->i_ino, 2229 "failed with error code %i.", vi->i_ino,
2231 (unsigned)le32_to_cpu(ni->type), err); 2230 (unsigned)le32_to_cpu(ni->type), err);
2232 if (err != -ENOMEM && err != -ENOSPC) 2231 if (err != -ENOMEM && err != -ENOSPC)
2233 err = -EIO; 2232 err = -EIO;
2234 goto err_out; 2233 goto err_out;
2235 } 2234 }
2236 rl = ntfs_runlists_merge(ni->runlist.rl, rl2); 2235 rl = ntfs_runlists_merge(ni->runlist.rl, rl2);
2237 if (IS_ERR(rl)) { 2236 if (IS_ERR(rl)) {
2238 err = PTR_ERR(rl); 2237 err = PTR_ERR(rl);
2239 if (start < 0 || start >= allocated_size) 2238 if (start < 0 || start >= allocated_size)
2240 ntfs_error(vol->sb, "Cannot extend allocation of " 2239 ntfs_error(vol->sb, "Cannot extend allocation of "
2241 "inode 0x%lx, attribute type 0x%x, " 2240 "inode 0x%lx, attribute type 0x%x, "
2242 "because the runlist merge failed " 2241 "because the runlist merge failed "
2243 "with error code %i.", vi->i_ino, 2242 "with error code %i.", vi->i_ino,
2244 (unsigned)le32_to_cpu(ni->type), err); 2243 (unsigned)le32_to_cpu(ni->type), err);
2245 if (err != -ENOMEM) 2244 if (err != -ENOMEM)
2246 err = -EIO; 2245 err = -EIO;
2247 if (ntfs_cluster_free_from_rl(vol, rl2)) { 2246 if (ntfs_cluster_free_from_rl(vol, rl2)) {
2248 ntfs_error(vol->sb, "Failed to release allocated " 2247 ntfs_error(vol->sb, "Failed to release allocated "
2249 "cluster(s) in error code path. Run " 2248 "cluster(s) in error code path. Run "
2250 "chkdsk to recover the lost " 2249 "chkdsk to recover the lost "
2251 "cluster(s)."); 2250 "cluster(s).");
2252 NVolSetErrors(vol); 2251 NVolSetErrors(vol);
2253 } 2252 }
2254 ntfs_free(rl2); 2253 ntfs_free(rl2);
2255 goto err_out; 2254 goto err_out;
2256 } 2255 }
2257 ni->runlist.rl = rl; 2256 ni->runlist.rl = rl;
2258 ntfs_debug("Allocated 0x%llx clusters.", (long long)(new_alloc_size - 2257 ntfs_debug("Allocated 0x%llx clusters.", (long long)(new_alloc_size -
2259 allocated_size) >> vol->cluster_size_bits); 2258 allocated_size) >> vol->cluster_size_bits);
2260 /* Find the runlist element with which the attribute extent starts. */ 2259 /* Find the runlist element with which the attribute extent starts. */
2261 ll = sle64_to_cpu(a->data.non_resident.lowest_vcn); 2260 ll = sle64_to_cpu(a->data.non_resident.lowest_vcn);
2262 rl2 = ntfs_rl_find_vcn_nolock(rl, ll); 2261 rl2 = ntfs_rl_find_vcn_nolock(rl, ll);
2263 BUG_ON(!rl2); 2262 BUG_ON(!rl2);
2264 BUG_ON(!rl2->length); 2263 BUG_ON(!rl2->length);
2265 BUG_ON(rl2->lcn < LCN_HOLE); 2264 BUG_ON(rl2->lcn < LCN_HOLE);
2266 mp_rebuilt = false; 2265 mp_rebuilt = false;
2267 /* Get the size for the new mapping pairs array for this extent. */ 2266 /* Get the size for the new mapping pairs array for this extent. */
2268 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, ll, -1); 2267 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, ll, -1);
2269 if (unlikely(mp_size <= 0)) { 2268 if (unlikely(mp_size <= 0)) {
2270 err = mp_size; 2269 err = mp_size;
2271 if (start < 0 || start >= allocated_size) 2270 if (start < 0 || start >= allocated_size)
2272 ntfs_error(vol->sb, "Cannot extend allocation of " 2271 ntfs_error(vol->sb, "Cannot extend allocation of "
2273 "inode 0x%lx, attribute type 0x%x, " 2272 "inode 0x%lx, attribute type 0x%x, "
2274 "because determining the size for the " 2273 "because determining the size for the "
2275 "mapping pairs failed with error code " 2274 "mapping pairs failed with error code "
2276 "%i.", vi->i_ino, 2275 "%i.", vi->i_ino,
2277 (unsigned)le32_to_cpu(ni->type), err); 2276 (unsigned)le32_to_cpu(ni->type), err);
2278 err = -EIO; 2277 err = -EIO;
2279 goto undo_alloc; 2278 goto undo_alloc;
2280 } 2279 }
2281 /* Extend the attribute record to fit the bigger mapping pairs array. */ 2280 /* Extend the attribute record to fit the bigger mapping pairs array. */
2282 attr_len = le32_to_cpu(a->length); 2281 attr_len = le32_to_cpu(a->length);
2283 err = ntfs_attr_record_resize(m, a, mp_size + 2282 err = ntfs_attr_record_resize(m, a, mp_size +
2284 le16_to_cpu(a->data.non_resident.mapping_pairs_offset)); 2283 le16_to_cpu(a->data.non_resident.mapping_pairs_offset));
2285 if (unlikely(err)) { 2284 if (unlikely(err)) {
2286 BUG_ON(err != -ENOSPC); 2285 BUG_ON(err != -ENOSPC);
2287 // TODO: Deal with this by moving this extent to a new mft 2286 // TODO: Deal with this by moving this extent to a new mft
2288 // record or by starting a new extent in a new mft record, 2287 // record or by starting a new extent in a new mft record,
2289 // possibly by extending this extent partially and filling it 2288 // possibly by extending this extent partially and filling it
2290 // and creating a new extent for the remainder, or by making 2289 // and creating a new extent for the remainder, or by making
2291 // other attributes non-resident and/or by moving other 2290 // other attributes non-resident and/or by moving other
2292 // attributes out of this mft record. 2291 // attributes out of this mft record.
2293 if (start < 0 || start >= allocated_size) 2292 if (start < 0 || start >= allocated_size)
2294 ntfs_error(vol->sb, "Not enough space in the mft " 2293 ntfs_error(vol->sb, "Not enough space in the mft "
2295 "record for the extended attribute " 2294 "record for the extended attribute "
2296 "record. This case is not " 2295 "record. This case is not "
2297 "implemented yet."); 2296 "implemented yet.");
2298 err = -EOPNOTSUPP; 2297 err = -EOPNOTSUPP;
2299 goto undo_alloc; 2298 goto undo_alloc;
2300 } 2299 }
2301 mp_rebuilt = true; 2300 mp_rebuilt = true;
2302 /* Generate the mapping pairs array directly into the attr record. */ 2301 /* Generate the mapping pairs array directly into the attr record. */
2303 err = ntfs_mapping_pairs_build(vol, (u8*)a + 2302 err = ntfs_mapping_pairs_build(vol, (u8*)a +
2304 le16_to_cpu(a->data.non_resident.mapping_pairs_offset), 2303 le16_to_cpu(a->data.non_resident.mapping_pairs_offset),
2305 mp_size, rl2, ll, -1, NULL); 2304 mp_size, rl2, ll, -1, NULL);
2306 if (unlikely(err)) { 2305 if (unlikely(err)) {
2307 if (start < 0 || start >= allocated_size) 2306 if (start < 0 || start >= allocated_size)
2308 ntfs_error(vol->sb, "Cannot extend allocation of " 2307 ntfs_error(vol->sb, "Cannot extend allocation of "
2309 "inode 0x%lx, attribute type 0x%x, " 2308 "inode 0x%lx, attribute type 0x%x, "
2310 "because building the mapping pairs " 2309 "because building the mapping pairs "
2311 "failed with error code %i.", vi->i_ino, 2310 "failed with error code %i.", vi->i_ino,
2312 (unsigned)le32_to_cpu(ni->type), err); 2311 (unsigned)le32_to_cpu(ni->type), err);
2313 err = -EIO; 2312 err = -EIO;
2314 goto undo_alloc; 2313 goto undo_alloc;
2315 } 2314 }
2316 /* Update the highest_vcn. */ 2315 /* Update the highest_vcn. */
2317 a->data.non_resident.highest_vcn = cpu_to_sle64((new_alloc_size >> 2316 a->data.non_resident.highest_vcn = cpu_to_sle64((new_alloc_size >>
2318 vol->cluster_size_bits) - 1); 2317 vol->cluster_size_bits) - 1);
2319 /* 2318 /*
2320 * We now have extended the allocated size of the attribute. Reflect 2319 * We now have extended the allocated size of the attribute. Reflect
2321 * this in the ntfs_inode structure and the attribute record. 2320 * this in the ntfs_inode structure and the attribute record.
2322 */ 2321 */
2323 if (a->data.non_resident.lowest_vcn) { 2322 if (a->data.non_resident.lowest_vcn) {
2324 /* 2323 /*
2325 * We are not in the first attribute extent, switch to it, but 2324 * We are not in the first attribute extent, switch to it, but
2326 * first ensure the changes will make it to disk later. 2325 * first ensure the changes will make it to disk later.
2327 */ 2326 */
2328 flush_dcache_mft_record_page(ctx->ntfs_ino); 2327 flush_dcache_mft_record_page(ctx->ntfs_ino);
2329 mark_mft_record_dirty(ctx->ntfs_ino); 2328 mark_mft_record_dirty(ctx->ntfs_ino);
2330 ntfs_attr_reinit_search_ctx(ctx); 2329 ntfs_attr_reinit_search_ctx(ctx);
2331 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 2330 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
2332 CASE_SENSITIVE, 0, NULL, 0, ctx); 2331 CASE_SENSITIVE, 0, NULL, 0, ctx);
2333 if (unlikely(err)) 2332 if (unlikely(err))
2334 goto restore_undo_alloc; 2333 goto restore_undo_alloc;
2335 /* @m is not used any more so no need to set it. */ 2334 /* @m is not used any more so no need to set it. */
2336 a = ctx->attr; 2335 a = ctx->attr;
2337 } 2336 }
2338 write_lock_irqsave(&ni->size_lock, flags); 2337 write_lock_irqsave(&ni->size_lock, flags);
2339 ni->allocated_size = new_alloc_size; 2338 ni->allocated_size = new_alloc_size;
2340 a->data.non_resident.allocated_size = cpu_to_sle64(new_alloc_size); 2339 a->data.non_resident.allocated_size = cpu_to_sle64(new_alloc_size);
2341 /* 2340 /*
2342 * FIXME: This would fail if @ni is a directory, $MFT, or an index, 2341 * FIXME: This would fail if @ni is a directory, $MFT, or an index,
2343 * since those can have sparse/compressed set. For example can be 2342 * since those can have sparse/compressed set. For example can be
2344 * set compressed even though it is not compressed itself and in that 2343 * set compressed even though it is not compressed itself and in that
2345 * case the bit means that files are to be created compressed in the 2344 * case the bit means that files are to be created compressed in the
2346 * directory... At present this is ok as this code is only called for 2345 * directory... At present this is ok as this code is only called for
2347 * regular files, and only for their $DATA attribute(s). 2346 * regular files, and only for their $DATA attribute(s).
2348 * FIXME: The calculation is wrong if we created a hole above. For now 2347 * FIXME: The calculation is wrong if we created a hole above. For now
2349 * it does not matter as we never create holes. 2348 * it does not matter as we never create holes.
2350 */ 2349 */
2351 if (NInoSparse(ni) || NInoCompressed(ni)) { 2350 if (NInoSparse(ni) || NInoCompressed(ni)) {
2352 ni->itype.compressed.size += new_alloc_size - allocated_size; 2351 ni->itype.compressed.size += new_alloc_size - allocated_size;
2353 a->data.non_resident.compressed_size = 2352 a->data.non_resident.compressed_size =
2354 cpu_to_sle64(ni->itype.compressed.size); 2353 cpu_to_sle64(ni->itype.compressed.size);
2355 vi->i_blocks = ni->itype.compressed.size >> 9; 2354 vi->i_blocks = ni->itype.compressed.size >> 9;
2356 } else 2355 } else
2357 vi->i_blocks = new_alloc_size >> 9; 2356 vi->i_blocks = new_alloc_size >> 9;
2358 write_unlock_irqrestore(&ni->size_lock, flags); 2357 write_unlock_irqrestore(&ni->size_lock, flags);
2359 alloc_done: 2358 alloc_done:
2360 if (new_data_size >= 0) { 2359 if (new_data_size >= 0) {
2361 BUG_ON(new_data_size < 2360 BUG_ON(new_data_size <
2362 sle64_to_cpu(a->data.non_resident.data_size)); 2361 sle64_to_cpu(a->data.non_resident.data_size));
2363 a->data.non_resident.data_size = cpu_to_sle64(new_data_size); 2362 a->data.non_resident.data_size = cpu_to_sle64(new_data_size);
2364 } 2363 }
2365 flush_done: 2364 flush_done:
2366 /* Ensure the changes make it to disk. */ 2365 /* Ensure the changes make it to disk. */
2367 flush_dcache_mft_record_page(ctx->ntfs_ino); 2366 flush_dcache_mft_record_page(ctx->ntfs_ino);
2368 mark_mft_record_dirty(ctx->ntfs_ino); 2367 mark_mft_record_dirty(ctx->ntfs_ino);
2369 done: 2368 done:
2370 ntfs_attr_put_search_ctx(ctx); 2369 ntfs_attr_put_search_ctx(ctx);
2371 unmap_mft_record(base_ni); 2370 unmap_mft_record(base_ni);
2372 up_write(&ni->runlist.lock); 2371 up_write(&ni->runlist.lock);
2373 ntfs_debug("Done, new_allocated_size 0x%llx.", 2372 ntfs_debug("Done, new_allocated_size 0x%llx.",
2374 (unsigned long long)new_alloc_size); 2373 (unsigned long long)new_alloc_size);
2375 return new_alloc_size; 2374 return new_alloc_size;
2376 restore_undo_alloc: 2375 restore_undo_alloc:
2377 if (start < 0 || start >= allocated_size) 2376 if (start < 0 || start >= allocated_size)
2378 ntfs_error(vol->sb, "Cannot complete extension of allocation " 2377 ntfs_error(vol->sb, "Cannot complete extension of allocation "
2379 "of inode 0x%lx, attribute type 0x%x, because " 2378 "of inode 0x%lx, attribute type 0x%x, because "
2380 "lookup of first attribute extent failed with " 2379 "lookup of first attribute extent failed with "
2381 "error code %i.", vi->i_ino, 2380 "error code %i.", vi->i_ino,
2382 (unsigned)le32_to_cpu(ni->type), err); 2381 (unsigned)le32_to_cpu(ni->type), err);
2383 if (err == -ENOENT) 2382 if (err == -ENOENT)
2384 err = -EIO; 2383 err = -EIO;
2385 ntfs_attr_reinit_search_ctx(ctx); 2384 ntfs_attr_reinit_search_ctx(ctx);
2386 if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE, 2385 if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE,
2387 allocated_size >> vol->cluster_size_bits, NULL, 0, 2386 allocated_size >> vol->cluster_size_bits, NULL, 0,
2388 ctx)) { 2387 ctx)) {
2389 ntfs_error(vol->sb, "Failed to find last attribute extent of " 2388 ntfs_error(vol->sb, "Failed to find last attribute extent of "
2390 "attribute in error code path. Run chkdsk to " 2389 "attribute in error code path. Run chkdsk to "
2391 "recover."); 2390 "recover.");
2392 write_lock_irqsave(&ni->size_lock, flags); 2391 write_lock_irqsave(&ni->size_lock, flags);
2393 ni->allocated_size = new_alloc_size; 2392 ni->allocated_size = new_alloc_size;
2394 /* 2393 /*
2395 * FIXME: This would fail if @ni is a directory... See above. 2394 * FIXME: This would fail if @ni is a directory... See above.
2396 * FIXME: The calculation is wrong if we created a hole above. 2395 * FIXME: The calculation is wrong if we created a hole above.
2397 * For now it does not matter as we never create holes. 2396 * For now it does not matter as we never create holes.
2398 */ 2397 */
2399 if (NInoSparse(ni) || NInoCompressed(ni)) { 2398 if (NInoSparse(ni) || NInoCompressed(ni)) {
2400 ni->itype.compressed.size += new_alloc_size - 2399 ni->itype.compressed.size += new_alloc_size -
2401 allocated_size; 2400 allocated_size;
2402 vi->i_blocks = ni->itype.compressed.size >> 9; 2401 vi->i_blocks = ni->itype.compressed.size >> 9;
2403 } else 2402 } else
2404 vi->i_blocks = new_alloc_size >> 9; 2403 vi->i_blocks = new_alloc_size >> 9;
2405 write_unlock_irqrestore(&ni->size_lock, flags); 2404 write_unlock_irqrestore(&ni->size_lock, flags);
2406 ntfs_attr_put_search_ctx(ctx); 2405 ntfs_attr_put_search_ctx(ctx);
2407 unmap_mft_record(base_ni); 2406 unmap_mft_record(base_ni);
2408 up_write(&ni->runlist.lock); 2407 up_write(&ni->runlist.lock);
2409 /* 2408 /*
2410 * The only thing that is now wrong is the allocated size of the 2409 * The only thing that is now wrong is the allocated size of the
2411 * base attribute extent which chkdsk should be able to fix. 2410 * base attribute extent which chkdsk should be able to fix.
2412 */ 2411 */
2413 NVolSetErrors(vol); 2412 NVolSetErrors(vol);
2414 return err; 2413 return err;
2415 } 2414 }
2416 ctx->attr->data.non_resident.highest_vcn = cpu_to_sle64( 2415 ctx->attr->data.non_resident.highest_vcn = cpu_to_sle64(
2417 (allocated_size >> vol->cluster_size_bits) - 1); 2416 (allocated_size >> vol->cluster_size_bits) - 1);
2418 undo_alloc: 2417 undo_alloc:
2419 ll = allocated_size >> vol->cluster_size_bits; 2418 ll = allocated_size >> vol->cluster_size_bits;
2420 if (ntfs_cluster_free(ni, ll, -1, ctx) < 0) { 2419 if (ntfs_cluster_free(ni, ll, -1, ctx) < 0) {
2421 ntfs_error(vol->sb, "Failed to release allocated cluster(s) " 2420 ntfs_error(vol->sb, "Failed to release allocated cluster(s) "
2422 "in error code path. Run chkdsk to recover " 2421 "in error code path. Run chkdsk to recover "
2423 "the lost cluster(s)."); 2422 "the lost cluster(s).");
2424 NVolSetErrors(vol); 2423 NVolSetErrors(vol);
2425 } 2424 }
2426 m = ctx->mrec; 2425 m = ctx->mrec;
2427 a = ctx->attr; 2426 a = ctx->attr;
2428 /* 2427 /*
2429 * If the runlist truncation fails and/or the search context is no 2428 * If the runlist truncation fails and/or the search context is no
2430 * longer valid, we cannot resize the attribute record or build the 2429 * longer valid, we cannot resize the attribute record or build the
2431 * mapping pairs array thus we mark the inode bad so that no access to 2430 * mapping pairs array thus we mark the inode bad so that no access to
2432 * the freed clusters can happen. 2431 * the freed clusters can happen.
2433 */ 2432 */
2434 if (ntfs_rl_truncate_nolock(vol, &ni->runlist, ll) || IS_ERR(m)) { 2433 if (ntfs_rl_truncate_nolock(vol, &ni->runlist, ll) || IS_ERR(m)) {
2435 ntfs_error(vol->sb, "Failed to %s in error code path. Run " 2434 ntfs_error(vol->sb, "Failed to %s in error code path. Run "
2436 "chkdsk to recover.", IS_ERR(m) ? 2435 "chkdsk to recover.", IS_ERR(m) ?
2437 "restore attribute search context" : 2436 "restore attribute search context" :
2438 "truncate attribute runlist"); 2437 "truncate attribute runlist");
2439 NVolSetErrors(vol); 2438 NVolSetErrors(vol);
2440 } else if (mp_rebuilt) { 2439 } else if (mp_rebuilt) {
2441 if (ntfs_attr_record_resize(m, a, attr_len)) { 2440 if (ntfs_attr_record_resize(m, a, attr_len)) {
2442 ntfs_error(vol->sb, "Failed to restore attribute " 2441 ntfs_error(vol->sb, "Failed to restore attribute "
2443 "record in error code path. Run " 2442 "record in error code path. Run "
2444 "chkdsk to recover."); 2443 "chkdsk to recover.");
2445 NVolSetErrors(vol); 2444 NVolSetErrors(vol);
2446 } else /* if (success) */ { 2445 } else /* if (success) */ {
2447 if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( 2446 if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu(
2448 a->data.non_resident. 2447 a->data.non_resident.
2449 mapping_pairs_offset), attr_len - 2448 mapping_pairs_offset), attr_len -
2450 le16_to_cpu(a->data.non_resident. 2449 le16_to_cpu(a->data.non_resident.
2451 mapping_pairs_offset), rl2, ll, -1, 2450 mapping_pairs_offset), rl2, ll, -1,
2452 NULL)) { 2451 NULL)) {
2453 ntfs_error(vol->sb, "Failed to restore " 2452 ntfs_error(vol->sb, "Failed to restore "
2454 "mapping pairs array in error " 2453 "mapping pairs array in error "
2455 "code path. Run chkdsk to " 2454 "code path. Run chkdsk to "
2456 "recover."); 2455 "recover.");
2457 NVolSetErrors(vol); 2456 NVolSetErrors(vol);
2458 } 2457 }
2459 flush_dcache_mft_record_page(ctx->ntfs_ino); 2458 flush_dcache_mft_record_page(ctx->ntfs_ino);
2460 mark_mft_record_dirty(ctx->ntfs_ino); 2459 mark_mft_record_dirty(ctx->ntfs_ino);
2461 } 2460 }
2462 } 2461 }
2463 err_out: 2462 err_out:
2464 if (ctx) 2463 if (ctx)
2465 ntfs_attr_put_search_ctx(ctx); 2464 ntfs_attr_put_search_ctx(ctx);
2466 if (m) 2465 if (m)
2467 unmap_mft_record(base_ni); 2466 unmap_mft_record(base_ni);
2468 up_write(&ni->runlist.lock); 2467 up_write(&ni->runlist.lock);
2469 conv_err_out: 2468 conv_err_out:
2470 ntfs_debug("Failed. Returning error code %i.", err); 2469 ntfs_debug("Failed. Returning error code %i.", err);
2471 return err; 2470 return err;
2472 } 2471 }
2473 2472
2474 /** 2473 /**
2475 * ntfs_attr_set - fill (a part of) an attribute with a byte 2474 * ntfs_attr_set - fill (a part of) an attribute with a byte
2476 * @ni: ntfs inode describing the attribute to fill 2475 * @ni: ntfs inode describing the attribute to fill
2477 * @ofs: offset inside the attribute at which to start to fill 2476 * @ofs: offset inside the attribute at which to start to fill
2478 * @cnt: number of bytes to fill 2477 * @cnt: number of bytes to fill
2479 * @val: the unsigned 8-bit value with which to fill the attribute 2478 * @val: the unsigned 8-bit value with which to fill the attribute
2480 * 2479 *
2481 * Fill @cnt bytes of the attribute described by the ntfs inode @ni starting at 2480 * Fill @cnt bytes of the attribute described by the ntfs inode @ni starting at
2482 * byte offset @ofs inside the attribute with the constant byte @val. 2481 * byte offset @ofs inside the attribute with the constant byte @val.
2483 * 2482 *
2484 * This function is effectively like memset() applied to an ntfs attribute. 2483 * This function is effectively like memset() applied to an ntfs attribute.
2485 * Note thie function actually only operates on the page cache pages belonging 2484 * Note thie function actually only operates on the page cache pages belonging
2486 * to the ntfs attribute and it marks them dirty after doing the memset(). 2485 * to the ntfs attribute and it marks them dirty after doing the memset().
2487 * Thus it relies on the vm dirty page write code paths to cause the modified 2486 * Thus it relies on the vm dirty page write code paths to cause the modified
2488 * pages to be written to the mft record/disk. 2487 * pages to be written to the mft record/disk.
2489 * 2488 *
2490 * Return 0 on success and -errno on error. An error code of -ESPIPE means 2489 * Return 0 on success and -errno on error. An error code of -ESPIPE means
2491 * that @ofs + @cnt were outside the end of the attribute and no write was 2490 * that @ofs + @cnt were outside the end of the attribute and no write was
2492 * performed. 2491 * performed.
2493 */ 2492 */
2494 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val) 2493 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
2495 { 2494 {
2496 ntfs_volume *vol = ni->vol; 2495 ntfs_volume *vol = ni->vol;
2497 struct address_space *mapping; 2496 struct address_space *mapping;
2498 struct page *page; 2497 struct page *page;
2499 u8 *kaddr; 2498 u8 *kaddr;
2500 pgoff_t idx, end; 2499 pgoff_t idx, end;
2501 unsigned start_ofs, end_ofs, size; 2500 unsigned start_ofs, end_ofs, size;
2502 2501
2503 ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.", 2502 ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.",
2504 (long long)ofs, (long long)cnt, val); 2503 (long long)ofs, (long long)cnt, val);
2505 BUG_ON(ofs < 0); 2504 BUG_ON(ofs < 0);
2506 BUG_ON(cnt < 0); 2505 BUG_ON(cnt < 0);
2507 if (!cnt) 2506 if (!cnt)
2508 goto done; 2507 goto done;
2509 /* 2508 /*
2510 * FIXME: Compressed and encrypted attributes are not supported when 2509 * FIXME: Compressed and encrypted attributes are not supported when
2511 * writing and we should never have gotten here for them. 2510 * writing and we should never have gotten here for them.
2512 */ 2511 */
2513 BUG_ON(NInoCompressed(ni)); 2512 BUG_ON(NInoCompressed(ni));
2514 BUG_ON(NInoEncrypted(ni)); 2513 BUG_ON(NInoEncrypted(ni));
2515 mapping = VFS_I(ni)->i_mapping; 2514 mapping = VFS_I(ni)->i_mapping;
2516 /* Work out the starting index and page offset. */ 2515 /* Work out the starting index and page offset. */
2517 idx = ofs >> PAGE_CACHE_SHIFT; 2516 idx = ofs >> PAGE_CACHE_SHIFT;
2518 start_ofs = ofs & ~PAGE_CACHE_MASK; 2517 start_ofs = ofs & ~PAGE_CACHE_MASK;
2519 /* Work out the ending index and page offset. */ 2518 /* Work out the ending index and page offset. */
2520 end = ofs + cnt; 2519 end = ofs + cnt;
2521 end_ofs = end & ~PAGE_CACHE_MASK; 2520 end_ofs = end & ~PAGE_CACHE_MASK;
2522 /* If the end is outside the inode size return -ESPIPE. */ 2521 /* If the end is outside the inode size return -ESPIPE. */
2523 if (unlikely(end > i_size_read(VFS_I(ni)))) { 2522 if (unlikely(end > i_size_read(VFS_I(ni)))) {
2524 ntfs_error(vol->sb, "Request exceeds end of attribute."); 2523 ntfs_error(vol->sb, "Request exceeds end of attribute.");
2525 return -ESPIPE; 2524 return -ESPIPE;
2526 } 2525 }
2527 end >>= PAGE_CACHE_SHIFT; 2526 end >>= PAGE_CACHE_SHIFT;
2528 /* If there is a first partial page, need to do it the slow way. */ 2527 /* If there is a first partial page, need to do it the slow way. */
2529 if (start_ofs) { 2528 if (start_ofs) {
2530 page = read_mapping_page(mapping, idx, NULL); 2529 page = read_mapping_page(mapping, idx, NULL);
2531 if (IS_ERR(page)) { 2530 if (IS_ERR(page)) {
2532 ntfs_error(vol->sb, "Failed to read first partial " 2531 ntfs_error(vol->sb, "Failed to read first partial "
2533 "page (error, index 0x%lx).", idx); 2532 "page (error, index 0x%lx).", idx);
2534 return PTR_ERR(page); 2533 return PTR_ERR(page);
2535 } 2534 }
2536 /* 2535 /*
2537 * If the last page is the same as the first page, need to 2536 * If the last page is the same as the first page, need to
2538 * limit the write to the end offset. 2537 * limit the write to the end offset.
2539 */ 2538 */
2540 size = PAGE_CACHE_SIZE; 2539 size = PAGE_CACHE_SIZE;
2541 if (idx == end) 2540 if (idx == end)
2542 size = end_ofs; 2541 size = end_ofs;
2543 kaddr = kmap_atomic(page); 2542 kaddr = kmap_atomic(page);
2544 memset(kaddr + start_ofs, val, size - start_ofs); 2543 memset(kaddr + start_ofs, val, size - start_ofs);
2545 flush_dcache_page(page); 2544 flush_dcache_page(page);
2546 kunmap_atomic(kaddr); 2545 kunmap_atomic(kaddr);
2547 set_page_dirty(page); 2546 set_page_dirty(page);
2548 page_cache_release(page); 2547 page_cache_release(page);
2549 balance_dirty_pages_ratelimited(mapping); 2548 balance_dirty_pages_ratelimited(mapping);
2550 cond_resched(); 2549 cond_resched();
2551 if (idx == end) 2550 if (idx == end)
2552 goto done; 2551 goto done;
2553 idx++; 2552 idx++;
2554 } 2553 }
2555 /* Do the whole pages the fast way. */ 2554 /* Do the whole pages the fast way. */
2556 for (; idx < end; idx++) { 2555 for (; idx < end; idx++) {
2557 /* Find or create the current page. (The page is locked.) */ 2556 /* Find or create the current page. (The page is locked.) */
2558 page = grab_cache_page(mapping, idx); 2557 page = grab_cache_page(mapping, idx);
2559 if (unlikely(!page)) { 2558 if (unlikely(!page)) {
2560 ntfs_error(vol->sb, "Insufficient memory to grab " 2559 ntfs_error(vol->sb, "Insufficient memory to grab "
2561 "page (index 0x%lx).", idx); 2560 "page (index 0x%lx).", idx);
2562 return -ENOMEM; 2561 return -ENOMEM;
2563 } 2562 }
2564 kaddr = kmap_atomic(page); 2563 kaddr = kmap_atomic(page);
2565 memset(kaddr, val, PAGE_CACHE_SIZE); 2564 memset(kaddr, val, PAGE_CACHE_SIZE);
2566 flush_dcache_page(page); 2565 flush_dcache_page(page);
2567 kunmap_atomic(kaddr); 2566 kunmap_atomic(kaddr);
2568 /* 2567 /*
2569 * If the page has buffers, mark them uptodate since buffer 2568 * If the page has buffers, mark them uptodate since buffer
2570 * state and not page state is definitive in 2.6 kernels. 2569 * state and not page state is definitive in 2.6 kernels.
2571 */ 2570 */
2572 if (page_has_buffers(page)) { 2571 if (page_has_buffers(page)) {
2573 struct buffer_head *bh, *head; 2572 struct buffer_head *bh, *head;
2574 2573
2575 bh = head = page_buffers(page); 2574 bh = head = page_buffers(page);
2576 do { 2575 do {
2577 set_buffer_uptodate(bh); 2576 set_buffer_uptodate(bh);
2578 } while ((bh = bh->b_this_page) != head); 2577 } while ((bh = bh->b_this_page) != head);
2579 } 2578 }
2580 /* Now that buffers are uptodate, set the page uptodate, too. */ 2579 /* Now that buffers are uptodate, set the page uptodate, too. */
2581 SetPageUptodate(page); 2580 SetPageUptodate(page);
2582 /* 2581 /*
2583 * Set the page and all its buffers dirty and mark the inode 2582 * Set the page and all its buffers dirty and mark the inode
2584 * dirty, too. The VM will write the page later on. 2583 * dirty, too. The VM will write the page later on.
2585 */ 2584 */
2586 set_page_dirty(page); 2585 set_page_dirty(page);
2587 /* Finally unlock and release the page. */ 2586 /* Finally unlock and release the page. */
2588 unlock_page(page); 2587 unlock_page(page);
2589 page_cache_release(page); 2588 page_cache_release(page);
2590 balance_dirty_pages_ratelimited(mapping); 2589 balance_dirty_pages_ratelimited(mapping);
2591 cond_resched(); 2590 cond_resched();
2592 } 2591 }
2593 /* If there is a last partial page, need to do it the slow way. */ 2592 /* If there is a last partial page, need to do it the slow way. */
2594 if (end_ofs) { 2593 if (end_ofs) {
2595 page = read_mapping_page(mapping, idx, NULL); 2594 page = read_mapping_page(mapping, idx, NULL);
2596 if (IS_ERR(page)) { 2595 if (IS_ERR(page)) {
2597 ntfs_error(vol->sb, "Failed to read last partial page " 2596 ntfs_error(vol->sb, "Failed to read last partial page "
2598 "(error, index 0x%lx).", idx); 2597 "(error, index 0x%lx).", idx);
2599 return PTR_ERR(page); 2598 return PTR_ERR(page);
2600 } 2599 }
2601 kaddr = kmap_atomic(page); 2600 kaddr = kmap_atomic(page);
2602 memset(kaddr, val, end_ofs); 2601 memset(kaddr, val, end_ofs);
2603 flush_dcache_page(page); 2602 flush_dcache_page(page);
2604 kunmap_atomic(kaddr); 2603 kunmap_atomic(kaddr);
2605 set_page_dirty(page); 2604 set_page_dirty(page);
2606 page_cache_release(page); 2605 page_cache_release(page);
2607 balance_dirty_pages_ratelimited(mapping); 2606 balance_dirty_pages_ratelimited(mapping);
2608 cond_resched(); 2607 cond_resched();
2609 } 2608 }
2610 done: 2609 done:
2611 ntfs_debug("Done."); 2610 ntfs_debug("Done.");
2612 return 0; 2611 return 0;
2613 } 2612 }
2614 2613
2615 #endif /* NTFS_RW */ 2614 #endif /* NTFS_RW */
2616 2615
1 /* 1 /*
2 * file.c - NTFS kernel file operations. Part of the Linux-NTFS project. 2 * file.c - NTFS kernel file operations. Part of the Linux-NTFS project.
3 * 3 *
4 * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc. 4 * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc.
5 * 5 *
6 * This program/include file is free software; you can redistribute it and/or 6 * This program/include file is free software; you can redistribute it and/or
7 * modify it under the terms of the GNU General Public License as published 7 * modify it under the terms of the GNU General Public License as published
8 * by the Free Software Foundation; either version 2 of the License, or 8 * by the Free Software Foundation; either version 2 of the License, or
9 * (at your option) any later version. 9 * (at your option) any later version.
10 * 10 *
11 * This program/include file is distributed in the hope that it will be 11 * This program/include file is distributed in the hope that it will be
12 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty 12 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty
13 * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 * GNU General Public License for more details. 14 * GNU General Public License for more details.
15 * 15 *
16 * You should have received a copy of the GNU General Public License 16 * You should have received a copy of the GNU General Public License
17 * along with this program (in the main directory of the Linux-NTFS 17 * along with this program (in the main directory of the Linux-NTFS
18 * distribution in the file COPYING); if not, write to the Free Software 18 * distribution in the file COPYING); if not, write to the Free Software
19 * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 19 * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
20 */ 20 */
21 21
22 #include <linux/buffer_head.h> 22 #include <linux/buffer_head.h>
23 #include <linux/gfp.h> 23 #include <linux/gfp.h>
24 #include <linux/pagemap.h> 24 #include <linux/pagemap.h>
25 #include <linux/pagevec.h> 25 #include <linux/pagevec.h>
26 #include <linux/sched.h> 26 #include <linux/sched.h>
27 #include <linux/swap.h> 27 #include <linux/swap.h>
28 #include <linux/uio.h> 28 #include <linux/uio.h>
29 #include <linux/writeback.h> 29 #include <linux/writeback.h>
30 #include <linux/aio.h> 30 #include <linux/aio.h>
31 31
32 #include <asm/page.h> 32 #include <asm/page.h>
33 #include <asm/uaccess.h> 33 #include <asm/uaccess.h>
34 34
35 #include "attrib.h" 35 #include "attrib.h"
36 #include "bitmap.h" 36 #include "bitmap.h"
37 #include "inode.h" 37 #include "inode.h"
38 #include "debug.h" 38 #include "debug.h"
39 #include "lcnalloc.h" 39 #include "lcnalloc.h"
40 #include "malloc.h" 40 #include "malloc.h"
41 #include "mft.h" 41 #include "mft.h"
42 #include "ntfs.h" 42 #include "ntfs.h"
43 43
44 /** 44 /**
45 * ntfs_file_open - called when an inode is about to be opened 45 * ntfs_file_open - called when an inode is about to be opened
46 * @vi: inode to be opened 46 * @vi: inode to be opened
47 * @filp: file structure describing the inode 47 * @filp: file structure describing the inode
48 * 48 *
49 * Limit file size to the page cache limit on architectures where unsigned long 49 * Limit file size to the page cache limit on architectures where unsigned long
50 * is 32-bits. This is the most we can do for now without overflowing the page 50 * is 32-bits. This is the most we can do for now without overflowing the page
51 * cache page index. Doing it this way means we don't run into problems because 51 * cache page index. Doing it this way means we don't run into problems because
52 * of existing too large files. It would be better to allow the user to read 52 * of existing too large files. It would be better to allow the user to read
53 * the beginning of the file but I doubt very much anyone is going to hit this 53 * the beginning of the file but I doubt very much anyone is going to hit this
54 * check on a 32-bit architecture, so there is no point in adding the extra 54 * check on a 32-bit architecture, so there is no point in adding the extra
55 * complexity required to support this. 55 * complexity required to support this.
56 * 56 *
57 * On 64-bit architectures, the check is hopefully optimized away by the 57 * On 64-bit architectures, the check is hopefully optimized away by the
58 * compiler. 58 * compiler.
59 * 59 *
60 * After the check passes, just call generic_file_open() to do its work. 60 * After the check passes, just call generic_file_open() to do its work.
61 */ 61 */
62 static int ntfs_file_open(struct inode *vi, struct file *filp) 62 static int ntfs_file_open(struct inode *vi, struct file *filp)
63 { 63 {
64 if (sizeof(unsigned long) < 8) { 64 if (sizeof(unsigned long) < 8) {
65 if (i_size_read(vi) > MAX_LFS_FILESIZE) 65 if (i_size_read(vi) > MAX_LFS_FILESIZE)
66 return -EOVERFLOW; 66 return -EOVERFLOW;
67 } 67 }
68 return generic_file_open(vi, filp); 68 return generic_file_open(vi, filp);
69 } 69 }
70 70
71 #ifdef NTFS_RW 71 #ifdef NTFS_RW
72 72
73 /** 73 /**
74 * ntfs_attr_extend_initialized - extend the initialized size of an attribute 74 * ntfs_attr_extend_initialized - extend the initialized size of an attribute
75 * @ni: ntfs inode of the attribute to extend 75 * @ni: ntfs inode of the attribute to extend
76 * @new_init_size: requested new initialized size in bytes 76 * @new_init_size: requested new initialized size in bytes
77 * @cached_page: store any allocated but unused page here 77 * @cached_page: store any allocated but unused page here
78 * @lru_pvec: lru-buffering pagevec of the caller 78 * @lru_pvec: lru-buffering pagevec of the caller
79 * 79 *
80 * Extend the initialized size of an attribute described by the ntfs inode @ni 80 * Extend the initialized size of an attribute described by the ntfs inode @ni
81 * to @new_init_size bytes. This involves zeroing any non-sparse space between 81 * to @new_init_size bytes. This involves zeroing any non-sparse space between
82 * the old initialized size and @new_init_size both in the page cache and on 82 * the old initialized size and @new_init_size both in the page cache and on
83 * disk (if relevant complete pages are already uptodate in the page cache then 83 * disk (if relevant complete pages are already uptodate in the page cache then
84 * these are simply marked dirty). 84 * these are simply marked dirty).
85 * 85 *
86 * As a side-effect, the file size (vfs inode->i_size) may be incremented as, 86 * As a side-effect, the file size (vfs inode->i_size) may be incremented as,
87 * in the resident attribute case, it is tied to the initialized size and, in 87 * in the resident attribute case, it is tied to the initialized size and, in
88 * the non-resident attribute case, it may not fall below the initialized size. 88 * the non-resident attribute case, it may not fall below the initialized size.
89 * 89 *
90 * Note that if the attribute is resident, we do not need to touch the page 90 * Note that if the attribute is resident, we do not need to touch the page
91 * cache at all. This is because if the page cache page is not uptodate we 91 * cache at all. This is because if the page cache page is not uptodate we
92 * bring it uptodate later, when doing the write to the mft record since we 92 * bring it uptodate later, when doing the write to the mft record since we
93 * then already have the page mapped. And if the page is uptodate, the 93 * then already have the page mapped. And if the page is uptodate, the
94 * non-initialized region will already have been zeroed when the page was 94 * non-initialized region will already have been zeroed when the page was
95 * brought uptodate and the region may in fact already have been overwritten 95 * brought uptodate and the region may in fact already have been overwritten
96 * with new data via mmap() based writes, so we cannot just zero it. And since 96 * with new data via mmap() based writes, so we cannot just zero it. And since
97 * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped 97 * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped
98 * is unspecified, we choose not to do zeroing and thus we do not need to touch 98 * is unspecified, we choose not to do zeroing and thus we do not need to touch
99 * the page at all. For a more detailed explanation see ntfs_truncate() in 99 * the page at all. For a more detailed explanation see ntfs_truncate() in
100 * fs/ntfs/inode.c. 100 * fs/ntfs/inode.c.
101 * 101 *
102 * Return 0 on success and -errno on error. In the case that an error is 102 * Return 0 on success and -errno on error. In the case that an error is
103 * encountered it is possible that the initialized size will already have been 103 * encountered it is possible that the initialized size will already have been
104 * incremented some way towards @new_init_size but it is guaranteed that if 104 * incremented some way towards @new_init_size but it is guaranteed that if
105 * this is the case, the necessary zeroing will also have happened and that all 105 * this is the case, the necessary zeroing will also have happened and that all
106 * metadata is self-consistent. 106 * metadata is self-consistent.
107 * 107 *
108 * Locking: i_mutex on the vfs inode corrseponsind to the ntfs inode @ni must be 108 * Locking: i_mutex on the vfs inode corrseponsind to the ntfs inode @ni must be
109 * held by the caller. 109 * held by the caller.
110 */ 110 */
111 static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size) 111 static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size)
112 { 112 {
113 s64 old_init_size; 113 s64 old_init_size;
114 loff_t old_i_size; 114 loff_t old_i_size;
115 pgoff_t index, end_index; 115 pgoff_t index, end_index;
116 unsigned long flags; 116 unsigned long flags;
117 struct inode *vi = VFS_I(ni); 117 struct inode *vi = VFS_I(ni);
118 ntfs_inode *base_ni; 118 ntfs_inode *base_ni;
119 MFT_RECORD *m = NULL; 119 MFT_RECORD *m = NULL;
120 ATTR_RECORD *a; 120 ATTR_RECORD *a;
121 ntfs_attr_search_ctx *ctx = NULL; 121 ntfs_attr_search_ctx *ctx = NULL;
122 struct address_space *mapping; 122 struct address_space *mapping;
123 struct page *page = NULL; 123 struct page *page = NULL;
124 u8 *kattr; 124 u8 *kattr;
125 int err; 125 int err;
126 u32 attr_len; 126 u32 attr_len;
127 127
128 read_lock_irqsave(&ni->size_lock, flags); 128 read_lock_irqsave(&ni->size_lock, flags);
129 old_init_size = ni->initialized_size; 129 old_init_size = ni->initialized_size;
130 old_i_size = i_size_read(vi); 130 old_i_size = i_size_read(vi);
131 BUG_ON(new_init_size > ni->allocated_size); 131 BUG_ON(new_init_size > ni->allocated_size);
132 read_unlock_irqrestore(&ni->size_lock, flags); 132 read_unlock_irqrestore(&ni->size_lock, flags);
133 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " 133 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
134 "old_initialized_size 0x%llx, " 134 "old_initialized_size 0x%llx, "
135 "new_initialized_size 0x%llx, i_size 0x%llx.", 135 "new_initialized_size 0x%llx, i_size 0x%llx.",
136 vi->i_ino, (unsigned)le32_to_cpu(ni->type), 136 vi->i_ino, (unsigned)le32_to_cpu(ni->type),
137 (unsigned long long)old_init_size, 137 (unsigned long long)old_init_size,
138 (unsigned long long)new_init_size, old_i_size); 138 (unsigned long long)new_init_size, old_i_size);
139 if (!NInoAttr(ni)) 139 if (!NInoAttr(ni))
140 base_ni = ni; 140 base_ni = ni;
141 else 141 else
142 base_ni = ni->ext.base_ntfs_ino; 142 base_ni = ni->ext.base_ntfs_ino;
143 /* Use goto to reduce indentation and we need the label below anyway. */ 143 /* Use goto to reduce indentation and we need the label below anyway. */
144 if (NInoNonResident(ni)) 144 if (NInoNonResident(ni))
145 goto do_non_resident_extend; 145 goto do_non_resident_extend;
146 BUG_ON(old_init_size != old_i_size); 146 BUG_ON(old_init_size != old_i_size);
147 m = map_mft_record(base_ni); 147 m = map_mft_record(base_ni);
148 if (IS_ERR(m)) { 148 if (IS_ERR(m)) {
149 err = PTR_ERR(m); 149 err = PTR_ERR(m);
150 m = NULL; 150 m = NULL;
151 goto err_out; 151 goto err_out;
152 } 152 }
153 ctx = ntfs_attr_get_search_ctx(base_ni, m); 153 ctx = ntfs_attr_get_search_ctx(base_ni, m);
154 if (unlikely(!ctx)) { 154 if (unlikely(!ctx)) {
155 err = -ENOMEM; 155 err = -ENOMEM;
156 goto err_out; 156 goto err_out;
157 } 157 }
158 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 158 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
159 CASE_SENSITIVE, 0, NULL, 0, ctx); 159 CASE_SENSITIVE, 0, NULL, 0, ctx);
160 if (unlikely(err)) { 160 if (unlikely(err)) {
161 if (err == -ENOENT) 161 if (err == -ENOENT)
162 err = -EIO; 162 err = -EIO;
163 goto err_out; 163 goto err_out;
164 } 164 }
165 m = ctx->mrec; 165 m = ctx->mrec;
166 a = ctx->attr; 166 a = ctx->attr;
167 BUG_ON(a->non_resident); 167 BUG_ON(a->non_resident);
168 /* The total length of the attribute value. */ 168 /* The total length of the attribute value. */
169 attr_len = le32_to_cpu(a->data.resident.value_length); 169 attr_len = le32_to_cpu(a->data.resident.value_length);
170 BUG_ON(old_i_size != (loff_t)attr_len); 170 BUG_ON(old_i_size != (loff_t)attr_len);
171 /* 171 /*
172 * Do the zeroing in the mft record and update the attribute size in 172 * Do the zeroing in the mft record and update the attribute size in
173 * the mft record. 173 * the mft record.
174 */ 174 */
175 kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); 175 kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset);
176 memset(kattr + attr_len, 0, new_init_size - attr_len); 176 memset(kattr + attr_len, 0, new_init_size - attr_len);
177 a->data.resident.value_length = cpu_to_le32((u32)new_init_size); 177 a->data.resident.value_length = cpu_to_le32((u32)new_init_size);
178 /* Finally, update the sizes in the vfs and ntfs inodes. */ 178 /* Finally, update the sizes in the vfs and ntfs inodes. */
179 write_lock_irqsave(&ni->size_lock, flags); 179 write_lock_irqsave(&ni->size_lock, flags);
180 i_size_write(vi, new_init_size); 180 i_size_write(vi, new_init_size);
181 ni->initialized_size = new_init_size; 181 ni->initialized_size = new_init_size;
182 write_unlock_irqrestore(&ni->size_lock, flags); 182 write_unlock_irqrestore(&ni->size_lock, flags);
183 goto done; 183 goto done;
184 do_non_resident_extend: 184 do_non_resident_extend:
185 /* 185 /*
186 * If the new initialized size @new_init_size exceeds the current file 186 * If the new initialized size @new_init_size exceeds the current file
187 * size (vfs inode->i_size), we need to extend the file size to the 187 * size (vfs inode->i_size), we need to extend the file size to the
188 * new initialized size. 188 * new initialized size.
189 */ 189 */
190 if (new_init_size > old_i_size) { 190 if (new_init_size > old_i_size) {
191 m = map_mft_record(base_ni); 191 m = map_mft_record(base_ni);
192 if (IS_ERR(m)) { 192 if (IS_ERR(m)) {
193 err = PTR_ERR(m); 193 err = PTR_ERR(m);
194 m = NULL; 194 m = NULL;
195 goto err_out; 195 goto err_out;
196 } 196 }
197 ctx = ntfs_attr_get_search_ctx(base_ni, m); 197 ctx = ntfs_attr_get_search_ctx(base_ni, m);
198 if (unlikely(!ctx)) { 198 if (unlikely(!ctx)) {
199 err = -ENOMEM; 199 err = -ENOMEM;
200 goto err_out; 200 goto err_out;
201 } 201 }
202 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 202 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
203 CASE_SENSITIVE, 0, NULL, 0, ctx); 203 CASE_SENSITIVE, 0, NULL, 0, ctx);
204 if (unlikely(err)) { 204 if (unlikely(err)) {
205 if (err == -ENOENT) 205 if (err == -ENOENT)
206 err = -EIO; 206 err = -EIO;
207 goto err_out; 207 goto err_out;
208 } 208 }
209 m = ctx->mrec; 209 m = ctx->mrec;
210 a = ctx->attr; 210 a = ctx->attr;
211 BUG_ON(!a->non_resident); 211 BUG_ON(!a->non_resident);
212 BUG_ON(old_i_size != (loff_t) 212 BUG_ON(old_i_size != (loff_t)
213 sle64_to_cpu(a->data.non_resident.data_size)); 213 sle64_to_cpu(a->data.non_resident.data_size));
214 a->data.non_resident.data_size = cpu_to_sle64(new_init_size); 214 a->data.non_resident.data_size = cpu_to_sle64(new_init_size);
215 flush_dcache_mft_record_page(ctx->ntfs_ino); 215 flush_dcache_mft_record_page(ctx->ntfs_ino);
216 mark_mft_record_dirty(ctx->ntfs_ino); 216 mark_mft_record_dirty(ctx->ntfs_ino);
217 /* Update the file size in the vfs inode. */ 217 /* Update the file size in the vfs inode. */
218 i_size_write(vi, new_init_size); 218 i_size_write(vi, new_init_size);
219 ntfs_attr_put_search_ctx(ctx); 219 ntfs_attr_put_search_ctx(ctx);
220 ctx = NULL; 220 ctx = NULL;
221 unmap_mft_record(base_ni); 221 unmap_mft_record(base_ni);
222 m = NULL; 222 m = NULL;
223 } 223 }
224 mapping = vi->i_mapping; 224 mapping = vi->i_mapping;
225 index = old_init_size >> PAGE_CACHE_SHIFT; 225 index = old_init_size >> PAGE_CACHE_SHIFT;
226 end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 226 end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
227 do { 227 do {
228 /* 228 /*
229 * Read the page. If the page is not present, this will zero 229 * Read the page. If the page is not present, this will zero
230 * the uninitialized regions for us. 230 * the uninitialized regions for us.
231 */ 231 */
232 page = read_mapping_page(mapping, index, NULL); 232 page = read_mapping_page(mapping, index, NULL);
233 if (IS_ERR(page)) { 233 if (IS_ERR(page)) {
234 err = PTR_ERR(page); 234 err = PTR_ERR(page);
235 goto init_err_out; 235 goto init_err_out;
236 } 236 }
237 if (unlikely(PageError(page))) { 237 if (unlikely(PageError(page))) {
238 page_cache_release(page); 238 page_cache_release(page);
239 err = -EIO; 239 err = -EIO;
240 goto init_err_out; 240 goto init_err_out;
241 } 241 }
242 /* 242 /*
243 * Update the initialized size in the ntfs inode. This is 243 * Update the initialized size in the ntfs inode. This is
244 * enough to make ntfs_writepage() work. 244 * enough to make ntfs_writepage() work.
245 */ 245 */
246 write_lock_irqsave(&ni->size_lock, flags); 246 write_lock_irqsave(&ni->size_lock, flags);
247 ni->initialized_size = (s64)(index + 1) << PAGE_CACHE_SHIFT; 247 ni->initialized_size = (s64)(index + 1) << PAGE_CACHE_SHIFT;
248 if (ni->initialized_size > new_init_size) 248 if (ni->initialized_size > new_init_size)
249 ni->initialized_size = new_init_size; 249 ni->initialized_size = new_init_size;
250 write_unlock_irqrestore(&ni->size_lock, flags); 250 write_unlock_irqrestore(&ni->size_lock, flags);
251 /* Set the page dirty so it gets written out. */ 251 /* Set the page dirty so it gets written out. */
252 set_page_dirty(page); 252 set_page_dirty(page);
253 page_cache_release(page); 253 page_cache_release(page);
254 /* 254 /*
255 * Play nice with the vm and the rest of the system. This is 255 * Play nice with the vm and the rest of the system. This is
256 * very much needed as we can potentially be modifying the 256 * very much needed as we can potentially be modifying the
257 * initialised size from a very small value to a really huge 257 * initialised size from a very small value to a really huge
258 * value, e.g. 258 * value, e.g.
259 * f = open(somefile, O_TRUNC); 259 * f = open(somefile, O_TRUNC);
260 * truncate(f, 10GiB); 260 * truncate(f, 10GiB);
261 * seek(f, 10GiB); 261 * seek(f, 10GiB);
262 * write(f, 1); 262 * write(f, 1);
263 * And this would mean we would be marking dirty hundreds of 263 * And this would mean we would be marking dirty hundreds of
264 * thousands of pages or as in the above example more than 264 * thousands of pages or as in the above example more than
265 * two and a half million pages! 265 * two and a half million pages!
266 * 266 *
267 * TODO: For sparse pages could optimize this workload by using 267 * TODO: For sparse pages could optimize this workload by using
268 * the FsMisc / MiscFs page bit as a "PageIsSparse" bit. This 268 * the FsMisc / MiscFs page bit as a "PageIsSparse" bit. This
269 * would be set in readpage for sparse pages and here we would 269 * would be set in readpage for sparse pages and here we would
270 * not need to mark dirty any pages which have this bit set. 270 * not need to mark dirty any pages which have this bit set.
271 * The only caveat is that we have to clear the bit everywhere 271 * The only caveat is that we have to clear the bit everywhere
272 * where we allocate any clusters that lie in the page or that 272 * where we allocate any clusters that lie in the page or that
273 * contain the page. 273 * contain the page.
274 * 274 *
275 * TODO: An even greater optimization would be for us to only 275 * TODO: An even greater optimization would be for us to only
276 * call readpage() on pages which are not in sparse regions as 276 * call readpage() on pages which are not in sparse regions as
277 * determined from the runlist. This would greatly reduce the 277 * determined from the runlist. This would greatly reduce the
278 * number of pages we read and make dirty in the case of sparse 278 * number of pages we read and make dirty in the case of sparse
279 * files. 279 * files.
280 */ 280 */
281 balance_dirty_pages_ratelimited(mapping); 281 balance_dirty_pages_ratelimited(mapping);
282 cond_resched(); 282 cond_resched();
283 } while (++index < end_index); 283 } while (++index < end_index);
284 read_lock_irqsave(&ni->size_lock, flags); 284 read_lock_irqsave(&ni->size_lock, flags);
285 BUG_ON(ni->initialized_size != new_init_size); 285 BUG_ON(ni->initialized_size != new_init_size);
286 read_unlock_irqrestore(&ni->size_lock, flags); 286 read_unlock_irqrestore(&ni->size_lock, flags);
287 /* Now bring in sync the initialized_size in the mft record. */ 287 /* Now bring in sync the initialized_size in the mft record. */
288 m = map_mft_record(base_ni); 288 m = map_mft_record(base_ni);
289 if (IS_ERR(m)) { 289 if (IS_ERR(m)) {
290 err = PTR_ERR(m); 290 err = PTR_ERR(m);
291 m = NULL; 291 m = NULL;
292 goto init_err_out; 292 goto init_err_out;
293 } 293 }
294 ctx = ntfs_attr_get_search_ctx(base_ni, m); 294 ctx = ntfs_attr_get_search_ctx(base_ni, m);
295 if (unlikely(!ctx)) { 295 if (unlikely(!ctx)) {
296 err = -ENOMEM; 296 err = -ENOMEM;
297 goto init_err_out; 297 goto init_err_out;
298 } 298 }
299 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 299 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
300 CASE_SENSITIVE, 0, NULL, 0, ctx); 300 CASE_SENSITIVE, 0, NULL, 0, ctx);
301 if (unlikely(err)) { 301 if (unlikely(err)) {
302 if (err == -ENOENT) 302 if (err == -ENOENT)
303 err = -EIO; 303 err = -EIO;
304 goto init_err_out; 304 goto init_err_out;
305 } 305 }
306 m = ctx->mrec; 306 m = ctx->mrec;
307 a = ctx->attr; 307 a = ctx->attr;
308 BUG_ON(!a->non_resident); 308 BUG_ON(!a->non_resident);
309 a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size); 309 a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size);
310 done: 310 done:
311 flush_dcache_mft_record_page(ctx->ntfs_ino); 311 flush_dcache_mft_record_page(ctx->ntfs_ino);
312 mark_mft_record_dirty(ctx->ntfs_ino); 312 mark_mft_record_dirty(ctx->ntfs_ino);
313 if (ctx) 313 if (ctx)
314 ntfs_attr_put_search_ctx(ctx); 314 ntfs_attr_put_search_ctx(ctx);
315 if (m) 315 if (m)
316 unmap_mft_record(base_ni); 316 unmap_mft_record(base_ni);
317 ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.", 317 ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.",
318 (unsigned long long)new_init_size, i_size_read(vi)); 318 (unsigned long long)new_init_size, i_size_read(vi));
319 return 0; 319 return 0;
320 init_err_out: 320 init_err_out:
321 write_lock_irqsave(&ni->size_lock, flags); 321 write_lock_irqsave(&ni->size_lock, flags);
322 ni->initialized_size = old_init_size; 322 ni->initialized_size = old_init_size;
323 write_unlock_irqrestore(&ni->size_lock, flags); 323 write_unlock_irqrestore(&ni->size_lock, flags);
324 err_out: 324 err_out:
325 if (ctx) 325 if (ctx)
326 ntfs_attr_put_search_ctx(ctx); 326 ntfs_attr_put_search_ctx(ctx);
327 if (m) 327 if (m)
328 unmap_mft_record(base_ni); 328 unmap_mft_record(base_ni);
329 ntfs_debug("Failed. Returning error code %i.", err); 329 ntfs_debug("Failed. Returning error code %i.", err);
330 return err; 330 return err;
331 } 331 }
332 332
333 /** 333 /**
334 * ntfs_fault_in_pages_readable - 334 * ntfs_fault_in_pages_readable -
335 * 335 *
336 * Fault a number of userspace pages into pagetables. 336 * Fault a number of userspace pages into pagetables.
337 * 337 *
338 * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes 338 * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes
339 * with more than two userspace pages as well as handling the single page case 339 * with more than two userspace pages as well as handling the single page case
340 * elegantly. 340 * elegantly.
341 * 341 *
342 * If you find this difficult to understand, then think of the while loop being 342 * If you find this difficult to understand, then think of the while loop being
343 * the following code, except that we do without the integer variable ret: 343 * the following code, except that we do without the integer variable ret:
344 * 344 *
345 * do { 345 * do {
346 * ret = __get_user(c, uaddr); 346 * ret = __get_user(c, uaddr);
347 * uaddr += PAGE_SIZE; 347 * uaddr += PAGE_SIZE;
348 * } while (!ret && uaddr < end); 348 * } while (!ret && uaddr < end);
349 * 349 *
350 * Note, the final __get_user() may well run out-of-bounds of the user buffer, 350 * Note, the final __get_user() may well run out-of-bounds of the user buffer,
351 * but _not_ out-of-bounds of the page the user buffer belongs to, and since 351 * but _not_ out-of-bounds of the page the user buffer belongs to, and since
352 * this is only a read and not a write, and since it is still in the same page, 352 * this is only a read and not a write, and since it is still in the same page,
353 * it should not matter and this makes the code much simpler. 353 * it should not matter and this makes the code much simpler.
354 */ 354 */
355 static inline void ntfs_fault_in_pages_readable(const char __user *uaddr, 355 static inline void ntfs_fault_in_pages_readable(const char __user *uaddr,
356 int bytes) 356 int bytes)
357 { 357 {
358 const char __user *end; 358 const char __user *end;
359 volatile char c; 359 volatile char c;
360 360
361 /* Set @end to the first byte outside the last page we care about. */ 361 /* Set @end to the first byte outside the last page we care about. */
362 end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes); 362 end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes);
363 363
364 while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end)) 364 while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end))
365 ; 365 ;
366 } 366 }
367 367
368 /** 368 /**
369 * ntfs_fault_in_pages_readable_iovec - 369 * ntfs_fault_in_pages_readable_iovec -
370 * 370 *
371 * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs. 371 * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs.
372 */ 372 */
373 static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov, 373 static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov,
374 size_t iov_ofs, int bytes) 374 size_t iov_ofs, int bytes)
375 { 375 {
376 do { 376 do {
377 const char __user *buf; 377 const char __user *buf;
378 unsigned len; 378 unsigned len;
379 379
380 buf = iov->iov_base + iov_ofs; 380 buf = iov->iov_base + iov_ofs;
381 len = iov->iov_len - iov_ofs; 381 len = iov->iov_len - iov_ofs;
382 if (len > bytes) 382 if (len > bytes)
383 len = bytes; 383 len = bytes;
384 ntfs_fault_in_pages_readable(buf, len); 384 ntfs_fault_in_pages_readable(buf, len);
385 bytes -= len; 385 bytes -= len;
386 iov++; 386 iov++;
387 iov_ofs = 0; 387 iov_ofs = 0;
388 } while (bytes); 388 } while (bytes);
389 } 389 }
390 390
391 /** 391 /**
392 * __ntfs_grab_cache_pages - obtain a number of locked pages 392 * __ntfs_grab_cache_pages - obtain a number of locked pages
393 * @mapping: address space mapping from which to obtain page cache pages 393 * @mapping: address space mapping from which to obtain page cache pages
394 * @index: starting index in @mapping at which to begin obtaining pages 394 * @index: starting index in @mapping at which to begin obtaining pages
395 * @nr_pages: number of page cache pages to obtain 395 * @nr_pages: number of page cache pages to obtain
396 * @pages: array of pages in which to return the obtained page cache pages 396 * @pages: array of pages in which to return the obtained page cache pages
397 * @cached_page: allocated but as yet unused page 397 * @cached_page: allocated but as yet unused page
398 * @lru_pvec: lru-buffering pagevec of caller 398 * @lru_pvec: lru-buffering pagevec of caller
399 * 399 *
400 * Obtain @nr_pages locked page cache pages from the mapping @mapping and 400 * Obtain @nr_pages locked page cache pages from the mapping @mapping and
401 * starting at index @index. 401 * starting at index @index.
402 * 402 *
403 * If a page is newly created, add it to lru list 403 * If a page is newly created, add it to lru list
404 * 404 *
405 * Note, the page locks are obtained in ascending page index order. 405 * Note, the page locks are obtained in ascending page index order.
406 */ 406 */
407 static inline int __ntfs_grab_cache_pages(struct address_space *mapping, 407 static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
408 pgoff_t index, const unsigned nr_pages, struct page **pages, 408 pgoff_t index, const unsigned nr_pages, struct page **pages,
409 struct page **cached_page) 409 struct page **cached_page)
410 { 410 {
411 int err, nr; 411 int err, nr;
412 412
413 BUG_ON(!nr_pages); 413 BUG_ON(!nr_pages);
414 err = nr = 0; 414 err = nr = 0;
415 do { 415 do {
416 pages[nr] = find_lock_page(mapping, index); 416 pages[nr] = find_lock_page(mapping, index);
417 if (!pages[nr]) { 417 if (!pages[nr]) {
418 if (!*cached_page) { 418 if (!*cached_page) {
419 *cached_page = page_cache_alloc(mapping); 419 *cached_page = page_cache_alloc(mapping);
420 if (unlikely(!*cached_page)) { 420 if (unlikely(!*cached_page)) {
421 err = -ENOMEM; 421 err = -ENOMEM;
422 goto err_out; 422 goto err_out;
423 } 423 }
424 } 424 }
425 err = add_to_page_cache_lru(*cached_page, mapping, index, 425 err = add_to_page_cache_lru(*cached_page, mapping, index,
426 GFP_KERNEL); 426 GFP_KERNEL);
427 if (unlikely(err)) { 427 if (unlikely(err)) {
428 if (err == -EEXIST) 428 if (err == -EEXIST)
429 continue; 429 continue;
430 goto err_out; 430 goto err_out;
431 } 431 }
432 pages[nr] = *cached_page; 432 pages[nr] = *cached_page;
433 *cached_page = NULL; 433 *cached_page = NULL;
434 } 434 }
435 index++; 435 index++;
436 nr++; 436 nr++;
437 } while (nr < nr_pages); 437 } while (nr < nr_pages);
438 out: 438 out:
439 return err; 439 return err;
440 err_out: 440 err_out:
441 while (nr > 0) { 441 while (nr > 0) {
442 unlock_page(pages[--nr]); 442 unlock_page(pages[--nr]);
443 page_cache_release(pages[nr]); 443 page_cache_release(pages[nr]);
444 } 444 }
445 goto out; 445 goto out;
446 } 446 }
447 447
448 static inline int ntfs_submit_bh_for_read(struct buffer_head *bh) 448 static inline int ntfs_submit_bh_for_read(struct buffer_head *bh)
449 { 449 {
450 lock_buffer(bh); 450 lock_buffer(bh);
451 get_bh(bh); 451 get_bh(bh);
452 bh->b_end_io = end_buffer_read_sync; 452 bh->b_end_io = end_buffer_read_sync;
453 return submit_bh(READ, bh); 453 return submit_bh(READ, bh);
454 } 454 }
455 455
456 /** 456 /**
457 * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data 457 * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data
458 * @pages: array of destination pages 458 * @pages: array of destination pages
459 * @nr_pages: number of pages in @pages 459 * @nr_pages: number of pages in @pages
460 * @pos: byte position in file at which the write begins 460 * @pos: byte position in file at which the write begins
461 * @bytes: number of bytes to be written 461 * @bytes: number of bytes to be written
462 * 462 *
463 * This is called for non-resident attributes from ntfs_file_buffered_write() 463 * This is called for non-resident attributes from ntfs_file_buffered_write()
464 * with i_mutex held on the inode (@pages[0]->mapping->host). There are 464 * with i_mutex held on the inode (@pages[0]->mapping->host). There are
465 * @nr_pages pages in @pages which are locked but not kmap()ped. The source 465 * @nr_pages pages in @pages which are locked but not kmap()ped. The source
466 * data has not yet been copied into the @pages. 466 * data has not yet been copied into the @pages.
467 * 467 *
468 * Need to fill any holes with actual clusters, allocate buffers if necessary, 468 * Need to fill any holes with actual clusters, allocate buffers if necessary,
469 * ensure all the buffers are mapped, and bring uptodate any buffers that are 469 * ensure all the buffers are mapped, and bring uptodate any buffers that are
470 * only partially being written to. 470 * only partially being written to.
471 * 471 *
472 * If @nr_pages is greater than one, we are guaranteed that the cluster size is 472 * If @nr_pages is greater than one, we are guaranteed that the cluster size is
473 * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside 473 * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside
474 * the same cluster and that they are the entirety of that cluster, and that 474 * the same cluster and that they are the entirety of that cluster, and that
475 * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole. 475 * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole.
476 * 476 *
477 * i_size is not to be modified yet. 477 * i_size is not to be modified yet.
478 * 478 *
479 * Return 0 on success or -errno on error. 479 * Return 0 on success or -errno on error.
480 */ 480 */
481 static int ntfs_prepare_pages_for_non_resident_write(struct page **pages, 481 static int ntfs_prepare_pages_for_non_resident_write(struct page **pages,
482 unsigned nr_pages, s64 pos, size_t bytes) 482 unsigned nr_pages, s64 pos, size_t bytes)
483 { 483 {
484 VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend; 484 VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend;
485 LCN lcn; 485 LCN lcn;
486 s64 bh_pos, vcn_len, end, initialized_size; 486 s64 bh_pos, vcn_len, end, initialized_size;
487 sector_t lcn_block; 487 sector_t lcn_block;
488 struct page *page; 488 struct page *page;
489 struct inode *vi; 489 struct inode *vi;
490 ntfs_inode *ni, *base_ni = NULL; 490 ntfs_inode *ni, *base_ni = NULL;
491 ntfs_volume *vol; 491 ntfs_volume *vol;
492 runlist_element *rl, *rl2; 492 runlist_element *rl, *rl2;
493 struct buffer_head *bh, *head, *wait[2], **wait_bh = wait; 493 struct buffer_head *bh, *head, *wait[2], **wait_bh = wait;
494 ntfs_attr_search_ctx *ctx = NULL; 494 ntfs_attr_search_ctx *ctx = NULL;
495 MFT_RECORD *m = NULL; 495 MFT_RECORD *m = NULL;
496 ATTR_RECORD *a = NULL; 496 ATTR_RECORD *a = NULL;
497 unsigned long flags; 497 unsigned long flags;
498 u32 attr_rec_len = 0; 498 u32 attr_rec_len = 0;
499 unsigned blocksize, u; 499 unsigned blocksize, u;
500 int err, mp_size; 500 int err, mp_size;
501 bool rl_write_locked, was_hole, is_retry; 501 bool rl_write_locked, was_hole, is_retry;
502 unsigned char blocksize_bits; 502 unsigned char blocksize_bits;
503 struct { 503 struct {
504 u8 runlist_merged:1; 504 u8 runlist_merged:1;
505 u8 mft_attr_mapped:1; 505 u8 mft_attr_mapped:1;
506 u8 mp_rebuilt:1; 506 u8 mp_rebuilt:1;
507 u8 attr_switched:1; 507 u8 attr_switched:1;
508 } status = { 0, 0, 0, 0 }; 508 } status = { 0, 0, 0, 0 };
509 509
510 BUG_ON(!nr_pages); 510 BUG_ON(!nr_pages);
511 BUG_ON(!pages); 511 BUG_ON(!pages);
512 BUG_ON(!*pages); 512 BUG_ON(!*pages);
513 vi = pages[0]->mapping->host; 513 vi = pages[0]->mapping->host;
514 ni = NTFS_I(vi); 514 ni = NTFS_I(vi);
515 vol = ni->vol; 515 vol = ni->vol;
516 ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " 516 ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page "
517 "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", 517 "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.",
518 vi->i_ino, ni->type, pages[0]->index, nr_pages, 518 vi->i_ino, ni->type, pages[0]->index, nr_pages,
519 (long long)pos, bytes); 519 (long long)pos, bytes);
520 blocksize = vol->sb->s_blocksize; 520 blocksize = vol->sb->s_blocksize;
521 blocksize_bits = vol->sb->s_blocksize_bits; 521 blocksize_bits = vol->sb->s_blocksize_bits;
522 u = 0; 522 u = 0;
523 do { 523 do {
524 page = pages[u]; 524 page = pages[u];
525 BUG_ON(!page); 525 BUG_ON(!page);
526 /* 526 /*
527 * create_empty_buffers() will create uptodate/dirty buffers if 527 * create_empty_buffers() will create uptodate/dirty buffers if
528 * the page is uptodate/dirty. 528 * the page is uptodate/dirty.
529 */ 529 */
530 if (!page_has_buffers(page)) { 530 if (!page_has_buffers(page)) {
531 create_empty_buffers(page, blocksize, 0); 531 create_empty_buffers(page, blocksize, 0);
532 if (unlikely(!page_has_buffers(page))) 532 if (unlikely(!page_has_buffers(page)))
533 return -ENOMEM; 533 return -ENOMEM;
534 } 534 }
535 } while (++u < nr_pages); 535 } while (++u < nr_pages);
536 rl_write_locked = false; 536 rl_write_locked = false;
537 rl = NULL; 537 rl = NULL;
538 err = 0; 538 err = 0;
539 vcn = lcn = -1; 539 vcn = lcn = -1;
540 vcn_len = 0; 540 vcn_len = 0;
541 lcn_block = -1; 541 lcn_block = -1;
542 was_hole = false; 542 was_hole = false;
543 cpos = pos >> vol->cluster_size_bits; 543 cpos = pos >> vol->cluster_size_bits;
544 end = pos + bytes; 544 end = pos + bytes;
545 cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits; 545 cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits;
546 /* 546 /*
547 * Loop over each page and for each page over each buffer. Use goto to 547 * Loop over each page and for each page over each buffer. Use goto to
548 * reduce indentation. 548 * reduce indentation.
549 */ 549 */
550 u = 0; 550 u = 0;
551 do_next_page: 551 do_next_page:
552 page = pages[u]; 552 page = pages[u];
553 bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; 553 bh_pos = (s64)page->index << PAGE_CACHE_SHIFT;
554 bh = head = page_buffers(page); 554 bh = head = page_buffers(page);
555 do { 555 do {
556 VCN cdelta; 556 VCN cdelta;
557 s64 bh_end; 557 s64 bh_end;
558 unsigned bh_cofs; 558 unsigned bh_cofs;
559 559
560 /* Clear buffer_new on all buffers to reinitialise state. */ 560 /* Clear buffer_new on all buffers to reinitialise state. */
561 if (buffer_new(bh)) 561 if (buffer_new(bh))
562 clear_buffer_new(bh); 562 clear_buffer_new(bh);
563 bh_end = bh_pos + blocksize; 563 bh_end = bh_pos + blocksize;
564 bh_cpos = bh_pos >> vol->cluster_size_bits; 564 bh_cpos = bh_pos >> vol->cluster_size_bits;
565 bh_cofs = bh_pos & vol->cluster_size_mask; 565 bh_cofs = bh_pos & vol->cluster_size_mask;
566 if (buffer_mapped(bh)) { 566 if (buffer_mapped(bh)) {
567 /* 567 /*
568 * The buffer is already mapped. If it is uptodate, 568 * The buffer is already mapped. If it is uptodate,
569 * ignore it. 569 * ignore it.
570 */ 570 */
571 if (buffer_uptodate(bh)) 571 if (buffer_uptodate(bh))
572 continue; 572 continue;
573 /* 573 /*
574 * The buffer is not uptodate. If the page is uptodate 574 * The buffer is not uptodate. If the page is uptodate
575 * set the buffer uptodate and otherwise ignore it. 575 * set the buffer uptodate and otherwise ignore it.
576 */ 576 */
577 if (PageUptodate(page)) { 577 if (PageUptodate(page)) {
578 set_buffer_uptodate(bh); 578 set_buffer_uptodate(bh);
579 continue; 579 continue;
580 } 580 }
581 /* 581 /*
582 * Neither the page nor the buffer are uptodate. If 582 * Neither the page nor the buffer are uptodate. If
583 * the buffer is only partially being written to, we 583 * the buffer is only partially being written to, we
584 * need to read it in before the write, i.e. now. 584 * need to read it in before the write, i.e. now.
585 */ 585 */
586 if ((bh_pos < pos && bh_end > pos) || 586 if ((bh_pos < pos && bh_end > pos) ||
587 (bh_pos < end && bh_end > end)) { 587 (bh_pos < end && bh_end > end)) {
588 /* 588 /*
589 * If the buffer is fully or partially within 589 * If the buffer is fully or partially within
590 * the initialized size, do an actual read. 590 * the initialized size, do an actual read.
591 * Otherwise, simply zero the buffer. 591 * Otherwise, simply zero the buffer.
592 */ 592 */
593 read_lock_irqsave(&ni->size_lock, flags); 593 read_lock_irqsave(&ni->size_lock, flags);
594 initialized_size = ni->initialized_size; 594 initialized_size = ni->initialized_size;
595 read_unlock_irqrestore(&ni->size_lock, flags); 595 read_unlock_irqrestore(&ni->size_lock, flags);
596 if (bh_pos < initialized_size) { 596 if (bh_pos < initialized_size) {
597 ntfs_submit_bh_for_read(bh); 597 ntfs_submit_bh_for_read(bh);
598 *wait_bh++ = bh; 598 *wait_bh++ = bh;
599 } else { 599 } else {
600 zero_user(page, bh_offset(bh), 600 zero_user(page, bh_offset(bh),
601 blocksize); 601 blocksize);
602 set_buffer_uptodate(bh); 602 set_buffer_uptodate(bh);
603 } 603 }
604 } 604 }
605 continue; 605 continue;
606 } 606 }
607 /* Unmapped buffer. Need to map it. */ 607 /* Unmapped buffer. Need to map it. */
608 bh->b_bdev = vol->sb->s_bdev; 608 bh->b_bdev = vol->sb->s_bdev;
609 /* 609 /*
610 * If the current buffer is in the same clusters as the map 610 * If the current buffer is in the same clusters as the map
611 * cache, there is no need to check the runlist again. The 611 * cache, there is no need to check the runlist again. The
612 * map cache is made up of @vcn, which is the first cached file 612 * map cache is made up of @vcn, which is the first cached file
613 * cluster, @vcn_len which is the number of cached file 613 * cluster, @vcn_len which is the number of cached file
614 * clusters, @lcn is the device cluster corresponding to @vcn, 614 * clusters, @lcn is the device cluster corresponding to @vcn,
615 * and @lcn_block is the block number corresponding to @lcn. 615 * and @lcn_block is the block number corresponding to @lcn.
616 */ 616 */
617 cdelta = bh_cpos - vcn; 617 cdelta = bh_cpos - vcn;
618 if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) { 618 if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) {
619 map_buffer_cached: 619 map_buffer_cached:
620 BUG_ON(lcn < 0); 620 BUG_ON(lcn < 0);
621 bh->b_blocknr = lcn_block + 621 bh->b_blocknr = lcn_block +
622 (cdelta << (vol->cluster_size_bits - 622 (cdelta << (vol->cluster_size_bits -
623 blocksize_bits)) + 623 blocksize_bits)) +
624 (bh_cofs >> blocksize_bits); 624 (bh_cofs >> blocksize_bits);
625 set_buffer_mapped(bh); 625 set_buffer_mapped(bh);
626 /* 626 /*
627 * If the page is uptodate so is the buffer. If the 627 * If the page is uptodate so is the buffer. If the
628 * buffer is fully outside the write, we ignore it if 628 * buffer is fully outside the write, we ignore it if
629 * it was already allocated and we mark it dirty so it 629 * it was already allocated and we mark it dirty so it
630 * gets written out if we allocated it. On the other 630 * gets written out if we allocated it. On the other
631 * hand, if we allocated the buffer but we are not 631 * hand, if we allocated the buffer but we are not
632 * marking it dirty we set buffer_new so we can do 632 * marking it dirty we set buffer_new so we can do
633 * error recovery. 633 * error recovery.
634 */ 634 */
635 if (PageUptodate(page)) { 635 if (PageUptodate(page)) {
636 if (!buffer_uptodate(bh)) 636 if (!buffer_uptodate(bh))
637 set_buffer_uptodate(bh); 637 set_buffer_uptodate(bh);
638 if (unlikely(was_hole)) { 638 if (unlikely(was_hole)) {
639 /* We allocated the buffer. */ 639 /* We allocated the buffer. */
640 unmap_underlying_metadata(bh->b_bdev, 640 unmap_underlying_metadata(bh->b_bdev,
641 bh->b_blocknr); 641 bh->b_blocknr);
642 if (bh_end <= pos || bh_pos >= end) 642 if (bh_end <= pos || bh_pos >= end)
643 mark_buffer_dirty(bh); 643 mark_buffer_dirty(bh);
644 else 644 else
645 set_buffer_new(bh); 645 set_buffer_new(bh);
646 } 646 }
647 continue; 647 continue;
648 } 648 }
649 /* Page is _not_ uptodate. */ 649 /* Page is _not_ uptodate. */
650 if (likely(!was_hole)) { 650 if (likely(!was_hole)) {
651 /* 651 /*
652 * Buffer was already allocated. If it is not 652 * Buffer was already allocated. If it is not
653 * uptodate and is only partially being written 653 * uptodate and is only partially being written
654 * to, we need to read it in before the write, 654 * to, we need to read it in before the write,
655 * i.e. now. 655 * i.e. now.
656 */ 656 */
657 if (!buffer_uptodate(bh) && bh_pos < end && 657 if (!buffer_uptodate(bh) && bh_pos < end &&
658 bh_end > pos && 658 bh_end > pos &&
659 (bh_pos < pos || 659 (bh_pos < pos ||
660 bh_end > end)) { 660 bh_end > end)) {
661 /* 661 /*
662 * If the buffer is fully or partially 662 * If the buffer is fully or partially
663 * within the initialized size, do an 663 * within the initialized size, do an
664 * actual read. Otherwise, simply zero 664 * actual read. Otherwise, simply zero
665 * the buffer. 665 * the buffer.
666 */ 666 */
667 read_lock_irqsave(&ni->size_lock, 667 read_lock_irqsave(&ni->size_lock,
668 flags); 668 flags);
669 initialized_size = ni->initialized_size; 669 initialized_size = ni->initialized_size;
670 read_unlock_irqrestore(&ni->size_lock, 670 read_unlock_irqrestore(&ni->size_lock,
671 flags); 671 flags);
672 if (bh_pos < initialized_size) { 672 if (bh_pos < initialized_size) {
673 ntfs_submit_bh_for_read(bh); 673 ntfs_submit_bh_for_read(bh);
674 *wait_bh++ = bh; 674 *wait_bh++ = bh;
675 } else { 675 } else {
676 zero_user(page, bh_offset(bh), 676 zero_user(page, bh_offset(bh),
677 blocksize); 677 blocksize);
678 set_buffer_uptodate(bh); 678 set_buffer_uptodate(bh);
679 } 679 }
680 } 680 }
681 continue; 681 continue;
682 } 682 }
683 /* We allocated the buffer. */ 683 /* We allocated the buffer. */
684 unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); 684 unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
685 /* 685 /*
686 * If the buffer is fully outside the write, zero it, 686 * If the buffer is fully outside the write, zero it,
687 * set it uptodate, and mark it dirty so it gets 687 * set it uptodate, and mark it dirty so it gets
688 * written out. If it is partially being written to, 688 * written out. If it is partially being written to,
689 * zero region surrounding the write but leave it to 689 * zero region surrounding the write but leave it to
690 * commit write to do anything else. Finally, if the 690 * commit write to do anything else. Finally, if the
691 * buffer is fully being overwritten, do nothing. 691 * buffer is fully being overwritten, do nothing.
692 */ 692 */
693 if (bh_end <= pos || bh_pos >= end) { 693 if (bh_end <= pos || bh_pos >= end) {
694 if (!buffer_uptodate(bh)) { 694 if (!buffer_uptodate(bh)) {
695 zero_user(page, bh_offset(bh), 695 zero_user(page, bh_offset(bh),
696 blocksize); 696 blocksize);
697 set_buffer_uptodate(bh); 697 set_buffer_uptodate(bh);
698 } 698 }
699 mark_buffer_dirty(bh); 699 mark_buffer_dirty(bh);
700 continue; 700 continue;
701 } 701 }
702 set_buffer_new(bh); 702 set_buffer_new(bh);
703 if (!buffer_uptodate(bh) && 703 if (!buffer_uptodate(bh) &&
704 (bh_pos < pos || bh_end > end)) { 704 (bh_pos < pos || bh_end > end)) {
705 u8 *kaddr; 705 u8 *kaddr;
706 unsigned pofs; 706 unsigned pofs;
707 707
708 kaddr = kmap_atomic(page); 708 kaddr = kmap_atomic(page);
709 if (bh_pos < pos) { 709 if (bh_pos < pos) {
710 pofs = bh_pos & ~PAGE_CACHE_MASK; 710 pofs = bh_pos & ~PAGE_CACHE_MASK;
711 memset(kaddr + pofs, 0, pos - bh_pos); 711 memset(kaddr + pofs, 0, pos - bh_pos);
712 } 712 }
713 if (bh_end > end) { 713 if (bh_end > end) {
714 pofs = end & ~PAGE_CACHE_MASK; 714 pofs = end & ~PAGE_CACHE_MASK;
715 memset(kaddr + pofs, 0, bh_end - end); 715 memset(kaddr + pofs, 0, bh_end - end);
716 } 716 }
717 kunmap_atomic(kaddr); 717 kunmap_atomic(kaddr);
718 flush_dcache_page(page); 718 flush_dcache_page(page);
719 } 719 }
720 continue; 720 continue;
721 } 721 }
722 /* 722 /*
723 * Slow path: this is the first buffer in the cluster. If it 723 * Slow path: this is the first buffer in the cluster. If it
724 * is outside allocated size and is not uptodate, zero it and 724 * is outside allocated size and is not uptodate, zero it and
725 * set it uptodate. 725 * set it uptodate.
726 */ 726 */
727 read_lock_irqsave(&ni->size_lock, flags); 727 read_lock_irqsave(&ni->size_lock, flags);
728 initialized_size = ni->allocated_size; 728 initialized_size = ni->allocated_size;
729 read_unlock_irqrestore(&ni->size_lock, flags); 729 read_unlock_irqrestore(&ni->size_lock, flags);
730 if (bh_pos > initialized_size) { 730 if (bh_pos > initialized_size) {
731 if (PageUptodate(page)) { 731 if (PageUptodate(page)) {
732 if (!buffer_uptodate(bh)) 732 if (!buffer_uptodate(bh))
733 set_buffer_uptodate(bh); 733 set_buffer_uptodate(bh);
734 } else if (!buffer_uptodate(bh)) { 734 } else if (!buffer_uptodate(bh)) {
735 zero_user(page, bh_offset(bh), blocksize); 735 zero_user(page, bh_offset(bh), blocksize);
736 set_buffer_uptodate(bh); 736 set_buffer_uptodate(bh);
737 } 737 }
738 continue; 738 continue;
739 } 739 }
740 is_retry = false; 740 is_retry = false;
741 if (!rl) { 741 if (!rl) {
742 down_read(&ni->runlist.lock); 742 down_read(&ni->runlist.lock);
743 retry_remap: 743 retry_remap:
744 rl = ni->runlist.rl; 744 rl = ni->runlist.rl;
745 } 745 }
746 if (likely(rl != NULL)) { 746 if (likely(rl != NULL)) {
747 /* Seek to element containing target cluster. */ 747 /* Seek to element containing target cluster. */
748 while (rl->length && rl[1].vcn <= bh_cpos) 748 while (rl->length && rl[1].vcn <= bh_cpos)
749 rl++; 749 rl++;
750 lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos); 750 lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos);
751 if (likely(lcn >= 0)) { 751 if (likely(lcn >= 0)) {
752 /* 752 /*
753 * Successful remap, setup the map cache and 753 * Successful remap, setup the map cache and
754 * use that to deal with the buffer. 754 * use that to deal with the buffer.
755 */ 755 */
756 was_hole = false; 756 was_hole = false;
757 vcn = bh_cpos; 757 vcn = bh_cpos;
758 vcn_len = rl[1].vcn - vcn; 758 vcn_len = rl[1].vcn - vcn;
759 lcn_block = lcn << (vol->cluster_size_bits - 759 lcn_block = lcn << (vol->cluster_size_bits -
760 blocksize_bits); 760 blocksize_bits);
761 cdelta = 0; 761 cdelta = 0;
762 /* 762 /*
763 * If the number of remaining clusters touched 763 * If the number of remaining clusters touched
764 * by the write is smaller or equal to the 764 * by the write is smaller or equal to the
765 * number of cached clusters, unlock the 765 * number of cached clusters, unlock the
766 * runlist as the map cache will be used from 766 * runlist as the map cache will be used from
767 * now on. 767 * now on.
768 */ 768 */
769 if (likely(vcn + vcn_len >= cend)) { 769 if (likely(vcn + vcn_len >= cend)) {
770 if (rl_write_locked) { 770 if (rl_write_locked) {
771 up_write(&ni->runlist.lock); 771 up_write(&ni->runlist.lock);
772 rl_write_locked = false; 772 rl_write_locked = false;
773 } else 773 } else
774 up_read(&ni->runlist.lock); 774 up_read(&ni->runlist.lock);
775 rl = NULL; 775 rl = NULL;
776 } 776 }
777 goto map_buffer_cached; 777 goto map_buffer_cached;
778 } 778 }
779 } else 779 } else
780 lcn = LCN_RL_NOT_MAPPED; 780 lcn = LCN_RL_NOT_MAPPED;
781 /* 781 /*
782 * If it is not a hole and not out of bounds, the runlist is 782 * If it is not a hole and not out of bounds, the runlist is
783 * probably unmapped so try to map it now. 783 * probably unmapped so try to map it now.
784 */ 784 */
785 if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) { 785 if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) {
786 if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) { 786 if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) {
787 /* Attempt to map runlist. */ 787 /* Attempt to map runlist. */
788 if (!rl_write_locked) { 788 if (!rl_write_locked) {
789 /* 789 /*
790 * We need the runlist locked for 790 * We need the runlist locked for
791 * writing, so if it is locked for 791 * writing, so if it is locked for
792 * reading relock it now and retry in 792 * reading relock it now and retry in
793 * case it changed whilst we dropped 793 * case it changed whilst we dropped
794 * the lock. 794 * the lock.
795 */ 795 */
796 up_read(&ni->runlist.lock); 796 up_read(&ni->runlist.lock);
797 down_write(&ni->runlist.lock); 797 down_write(&ni->runlist.lock);
798 rl_write_locked = true; 798 rl_write_locked = true;
799 goto retry_remap; 799 goto retry_remap;
800 } 800 }
801 err = ntfs_map_runlist_nolock(ni, bh_cpos, 801 err = ntfs_map_runlist_nolock(ni, bh_cpos,
802 NULL); 802 NULL);
803 if (likely(!err)) { 803 if (likely(!err)) {
804 is_retry = true; 804 is_retry = true;
805 goto retry_remap; 805 goto retry_remap;
806 } 806 }
807 /* 807 /*
808 * If @vcn is out of bounds, pretend @lcn is 808 * If @vcn is out of bounds, pretend @lcn is
809 * LCN_ENOENT. As long as the buffer is out 809 * LCN_ENOENT. As long as the buffer is out
810 * of bounds this will work fine. 810 * of bounds this will work fine.
811 */ 811 */
812 if (err == -ENOENT) { 812 if (err == -ENOENT) {
813 lcn = LCN_ENOENT; 813 lcn = LCN_ENOENT;
814 err = 0; 814 err = 0;
815 goto rl_not_mapped_enoent; 815 goto rl_not_mapped_enoent;
816 } 816 }
817 } else 817 } else
818 err = -EIO; 818 err = -EIO;
819 /* Failed to map the buffer, even after retrying. */ 819 /* Failed to map the buffer, even after retrying. */
820 bh->b_blocknr = -1; 820 bh->b_blocknr = -1;
821 ntfs_error(vol->sb, "Failed to write to inode 0x%lx, " 821 ntfs_error(vol->sb, "Failed to write to inode 0x%lx, "
822 "attribute type 0x%x, vcn 0x%llx, " 822 "attribute type 0x%x, vcn 0x%llx, "
823 "vcn offset 0x%x, because its " 823 "vcn offset 0x%x, because its "
824 "location on disk could not be " 824 "location on disk could not be "
825 "determined%s (error code %i).", 825 "determined%s (error code %i).",
826 ni->mft_no, ni->type, 826 ni->mft_no, ni->type,
827 (unsigned long long)bh_cpos, 827 (unsigned long long)bh_cpos,
828 (unsigned)bh_pos & 828 (unsigned)bh_pos &
829 vol->cluster_size_mask, 829 vol->cluster_size_mask,
830 is_retry ? " even after retrying" : "", 830 is_retry ? " even after retrying" : "",
831 err); 831 err);
832 break; 832 break;
833 } 833 }
834 rl_not_mapped_enoent: 834 rl_not_mapped_enoent:
835 /* 835 /*
836 * The buffer is in a hole or out of bounds. We need to fill 836 * The buffer is in a hole or out of bounds. We need to fill
837 * the hole, unless the buffer is in a cluster which is not 837 * the hole, unless the buffer is in a cluster which is not
838 * touched by the write, in which case we just leave the buffer 838 * touched by the write, in which case we just leave the buffer
839 * unmapped. This can only happen when the cluster size is 839 * unmapped. This can only happen when the cluster size is
840 * less than the page cache size. 840 * less than the page cache size.
841 */ 841 */
842 if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) { 842 if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) {
843 bh_cend = (bh_end + vol->cluster_size - 1) >> 843 bh_cend = (bh_end + vol->cluster_size - 1) >>
844 vol->cluster_size_bits; 844 vol->cluster_size_bits;
845 if ((bh_cend <= cpos || bh_cpos >= cend)) { 845 if ((bh_cend <= cpos || bh_cpos >= cend)) {
846 bh->b_blocknr = -1; 846 bh->b_blocknr = -1;
847 /* 847 /*
848 * If the buffer is uptodate we skip it. If it 848 * If the buffer is uptodate we skip it. If it
849 * is not but the page is uptodate, we can set 849 * is not but the page is uptodate, we can set
850 * the buffer uptodate. If the page is not 850 * the buffer uptodate. If the page is not
851 * uptodate, we can clear the buffer and set it 851 * uptodate, we can clear the buffer and set it
852 * uptodate. Whether this is worthwhile is 852 * uptodate. Whether this is worthwhile is
853 * debatable and this could be removed. 853 * debatable and this could be removed.
854 */ 854 */
855 if (PageUptodate(page)) { 855 if (PageUptodate(page)) {
856 if (!buffer_uptodate(bh)) 856 if (!buffer_uptodate(bh))
857 set_buffer_uptodate(bh); 857 set_buffer_uptodate(bh);
858 } else if (!buffer_uptodate(bh)) { 858 } else if (!buffer_uptodate(bh)) {
859 zero_user(page, bh_offset(bh), 859 zero_user(page, bh_offset(bh),
860 blocksize); 860 blocksize);
861 set_buffer_uptodate(bh); 861 set_buffer_uptodate(bh);
862 } 862 }
863 continue; 863 continue;
864 } 864 }
865 } 865 }
866 /* 866 /*
867 * Out of bounds buffer is invalid if it was not really out of 867 * Out of bounds buffer is invalid if it was not really out of
868 * bounds. 868 * bounds.
869 */ 869 */
870 BUG_ON(lcn != LCN_HOLE); 870 BUG_ON(lcn != LCN_HOLE);
871 /* 871 /*
872 * We need the runlist locked for writing, so if it is locked 872 * We need the runlist locked for writing, so if it is locked
873 * for reading relock it now and retry in case it changed 873 * for reading relock it now and retry in case it changed
874 * whilst we dropped the lock. 874 * whilst we dropped the lock.
875 */ 875 */
876 BUG_ON(!rl); 876 BUG_ON(!rl);
877 if (!rl_write_locked) { 877 if (!rl_write_locked) {
878 up_read(&ni->runlist.lock); 878 up_read(&ni->runlist.lock);
879 down_write(&ni->runlist.lock); 879 down_write(&ni->runlist.lock);
880 rl_write_locked = true; 880 rl_write_locked = true;
881 goto retry_remap; 881 goto retry_remap;
882 } 882 }
883 /* Find the previous last allocated cluster. */ 883 /* Find the previous last allocated cluster. */
884 BUG_ON(rl->lcn != LCN_HOLE); 884 BUG_ON(rl->lcn != LCN_HOLE);
885 lcn = -1; 885 lcn = -1;
886 rl2 = rl; 886 rl2 = rl;
887 while (--rl2 >= ni->runlist.rl) { 887 while (--rl2 >= ni->runlist.rl) {
888 if (rl2->lcn >= 0) { 888 if (rl2->lcn >= 0) {
889 lcn = rl2->lcn + rl2->length; 889 lcn = rl2->lcn + rl2->length;
890 break; 890 break;
891 } 891 }
892 } 892 }
893 rl2 = ntfs_cluster_alloc(vol, bh_cpos, 1, lcn, DATA_ZONE, 893 rl2 = ntfs_cluster_alloc(vol, bh_cpos, 1, lcn, DATA_ZONE,
894 false); 894 false);
895 if (IS_ERR(rl2)) { 895 if (IS_ERR(rl2)) {
896 err = PTR_ERR(rl2); 896 err = PTR_ERR(rl2);
897 ntfs_debug("Failed to allocate cluster, error code %i.", 897 ntfs_debug("Failed to allocate cluster, error code %i.",
898 err); 898 err);
899 break; 899 break;
900 } 900 }
901 lcn = rl2->lcn; 901 lcn = rl2->lcn;
902 rl = ntfs_runlists_merge(ni->runlist.rl, rl2); 902 rl = ntfs_runlists_merge(ni->runlist.rl, rl2);
903 if (IS_ERR(rl)) { 903 if (IS_ERR(rl)) {
904 err = PTR_ERR(rl); 904 err = PTR_ERR(rl);
905 if (err != -ENOMEM) 905 if (err != -ENOMEM)
906 err = -EIO; 906 err = -EIO;
907 if (ntfs_cluster_free_from_rl(vol, rl2)) { 907 if (ntfs_cluster_free_from_rl(vol, rl2)) {
908 ntfs_error(vol->sb, "Failed to release " 908 ntfs_error(vol->sb, "Failed to release "
909 "allocated cluster in error " 909 "allocated cluster in error "
910 "code path. Run chkdsk to " 910 "code path. Run chkdsk to "
911 "recover the lost cluster."); 911 "recover the lost cluster.");
912 NVolSetErrors(vol); 912 NVolSetErrors(vol);
913 } 913 }
914 ntfs_free(rl2); 914 ntfs_free(rl2);
915 break; 915 break;
916 } 916 }
917 ni->runlist.rl = rl; 917 ni->runlist.rl = rl;
918 status.runlist_merged = 1; 918 status.runlist_merged = 1;
919 ntfs_debug("Allocated cluster, lcn 0x%llx.", 919 ntfs_debug("Allocated cluster, lcn 0x%llx.",
920 (unsigned long long)lcn); 920 (unsigned long long)lcn);
921 /* Map and lock the mft record and get the attribute record. */ 921 /* Map and lock the mft record and get the attribute record. */
922 if (!NInoAttr(ni)) 922 if (!NInoAttr(ni))
923 base_ni = ni; 923 base_ni = ni;
924 else 924 else
925 base_ni = ni->ext.base_ntfs_ino; 925 base_ni = ni->ext.base_ntfs_ino;
926 m = map_mft_record(base_ni); 926 m = map_mft_record(base_ni);
927 if (IS_ERR(m)) { 927 if (IS_ERR(m)) {
928 err = PTR_ERR(m); 928 err = PTR_ERR(m);
929 break; 929 break;
930 } 930 }
931 ctx = ntfs_attr_get_search_ctx(base_ni, m); 931 ctx = ntfs_attr_get_search_ctx(base_ni, m);
932 if (unlikely(!ctx)) { 932 if (unlikely(!ctx)) {
933 err = -ENOMEM; 933 err = -ENOMEM;
934 unmap_mft_record(base_ni); 934 unmap_mft_record(base_ni);
935 break; 935 break;
936 } 936 }
937 status.mft_attr_mapped = 1; 937 status.mft_attr_mapped = 1;
938 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 938 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
939 CASE_SENSITIVE, bh_cpos, NULL, 0, ctx); 939 CASE_SENSITIVE, bh_cpos, NULL, 0, ctx);
940 if (unlikely(err)) { 940 if (unlikely(err)) {
941 if (err == -ENOENT) 941 if (err == -ENOENT)
942 err = -EIO; 942 err = -EIO;
943 break; 943 break;
944 } 944 }
945 m = ctx->mrec; 945 m = ctx->mrec;
946 a = ctx->attr; 946 a = ctx->attr;
947 /* 947 /*
948 * Find the runlist element with which the attribute extent 948 * Find the runlist element with which the attribute extent
949 * starts. Note, we cannot use the _attr_ version because we 949 * starts. Note, we cannot use the _attr_ version because we
950 * have mapped the mft record. That is ok because we know the 950 * have mapped the mft record. That is ok because we know the
951 * runlist fragment must be mapped already to have ever gotten 951 * runlist fragment must be mapped already to have ever gotten
952 * here, so we can just use the _rl_ version. 952 * here, so we can just use the _rl_ version.
953 */ 953 */
954 vcn = sle64_to_cpu(a->data.non_resident.lowest_vcn); 954 vcn = sle64_to_cpu(a->data.non_resident.lowest_vcn);
955 rl2 = ntfs_rl_find_vcn_nolock(rl, vcn); 955 rl2 = ntfs_rl_find_vcn_nolock(rl, vcn);
956 BUG_ON(!rl2); 956 BUG_ON(!rl2);
957 BUG_ON(!rl2->length); 957 BUG_ON(!rl2->length);
958 BUG_ON(rl2->lcn < LCN_HOLE); 958 BUG_ON(rl2->lcn < LCN_HOLE);
959 highest_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); 959 highest_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn);
960 /* 960 /*
961 * If @highest_vcn is zero, calculate the real highest_vcn 961 * If @highest_vcn is zero, calculate the real highest_vcn
962 * (which can really be zero). 962 * (which can really be zero).
963 */ 963 */
964 if (!highest_vcn) 964 if (!highest_vcn)
965 highest_vcn = (sle64_to_cpu( 965 highest_vcn = (sle64_to_cpu(
966 a->data.non_resident.allocated_size) >> 966 a->data.non_resident.allocated_size) >>
967 vol->cluster_size_bits) - 1; 967 vol->cluster_size_bits) - 1;
968 /* 968 /*
969 * Determine the size of the mapping pairs array for the new 969 * Determine the size of the mapping pairs array for the new
970 * extent, i.e. the old extent with the hole filled. 970 * extent, i.e. the old extent with the hole filled.
971 */ 971 */
972 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, vcn, 972 mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, vcn,
973 highest_vcn); 973 highest_vcn);
974 if (unlikely(mp_size <= 0)) { 974 if (unlikely(mp_size <= 0)) {
975 if (!(err = mp_size)) 975 if (!(err = mp_size))
976 err = -EIO; 976 err = -EIO;
977 ntfs_debug("Failed to get size for mapping pairs " 977 ntfs_debug("Failed to get size for mapping pairs "
978 "array, error code %i.", err); 978 "array, error code %i.", err);
979 break; 979 break;
980 } 980 }
981 /* 981 /*
982 * Resize the attribute record to fit the new mapping pairs 982 * Resize the attribute record to fit the new mapping pairs
983 * array. 983 * array.
984 */ 984 */
985 attr_rec_len = le32_to_cpu(a->length); 985 attr_rec_len = le32_to_cpu(a->length);
986 err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu( 986 err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu(
987 a->data.non_resident.mapping_pairs_offset)); 987 a->data.non_resident.mapping_pairs_offset));
988 if (unlikely(err)) { 988 if (unlikely(err)) {
989 BUG_ON(err != -ENOSPC); 989 BUG_ON(err != -ENOSPC);
990 // TODO: Deal with this by using the current attribute 990 // TODO: Deal with this by using the current attribute
991 // and fill it with as much of the mapping pairs 991 // and fill it with as much of the mapping pairs
992 // array as possible. Then loop over each attribute 992 // array as possible. Then loop over each attribute
993 // extent rewriting the mapping pairs arrays as we go 993 // extent rewriting the mapping pairs arrays as we go
994 // along and if when we reach the end we have not 994 // along and if when we reach the end we have not
995 // enough space, try to resize the last attribute 995 // enough space, try to resize the last attribute
996 // extent and if even that fails, add a new attribute 996 // extent and if even that fails, add a new attribute
997 // extent. 997 // extent.
998 // We could also try to resize at each step in the hope 998 // We could also try to resize at each step in the hope
999 // that we will not need to rewrite every single extent. 999 // that we will not need to rewrite every single extent.
1000 // Note, we may need to decompress some extents to fill 1000 // Note, we may need to decompress some extents to fill
1001 // the runlist as we are walking the extents... 1001 // the runlist as we are walking the extents...
1002 ntfs_error(vol->sb, "Not enough space in the mft " 1002 ntfs_error(vol->sb, "Not enough space in the mft "
1003 "record for the extended attribute " 1003 "record for the extended attribute "
1004 "record. This case is not " 1004 "record. This case is not "
1005 "implemented yet."); 1005 "implemented yet.");
1006 err = -EOPNOTSUPP; 1006 err = -EOPNOTSUPP;
1007 break ; 1007 break ;
1008 } 1008 }
1009 status.mp_rebuilt = 1; 1009 status.mp_rebuilt = 1;
1010 /* 1010 /*
1011 * Generate the mapping pairs array directly into the attribute 1011 * Generate the mapping pairs array directly into the attribute
1012 * record. 1012 * record.
1013 */ 1013 */
1014 err = ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( 1014 err = ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu(
1015 a->data.non_resident.mapping_pairs_offset), 1015 a->data.non_resident.mapping_pairs_offset),
1016 mp_size, rl2, vcn, highest_vcn, NULL); 1016 mp_size, rl2, vcn, highest_vcn, NULL);
1017 if (unlikely(err)) { 1017 if (unlikely(err)) {
1018 ntfs_error(vol->sb, "Cannot fill hole in inode 0x%lx, " 1018 ntfs_error(vol->sb, "Cannot fill hole in inode 0x%lx, "
1019 "attribute type 0x%x, because building " 1019 "attribute type 0x%x, because building "
1020 "the mapping pairs failed with error " 1020 "the mapping pairs failed with error "
1021 "code %i.", vi->i_ino, 1021 "code %i.", vi->i_ino,
1022 (unsigned)le32_to_cpu(ni->type), err); 1022 (unsigned)le32_to_cpu(ni->type), err);
1023 err = -EIO; 1023 err = -EIO;
1024 break; 1024 break;
1025 } 1025 }
1026 /* Update the highest_vcn but only if it was not set. */ 1026 /* Update the highest_vcn but only if it was not set. */
1027 if (unlikely(!a->data.non_resident.highest_vcn)) 1027 if (unlikely(!a->data.non_resident.highest_vcn))
1028 a->data.non_resident.highest_vcn = 1028 a->data.non_resident.highest_vcn =
1029 cpu_to_sle64(highest_vcn); 1029 cpu_to_sle64(highest_vcn);
1030 /* 1030 /*
1031 * If the attribute is sparse/compressed, update the compressed 1031 * If the attribute is sparse/compressed, update the compressed
1032 * size in the ntfs_inode structure and the attribute record. 1032 * size in the ntfs_inode structure and the attribute record.
1033 */ 1033 */
1034 if (likely(NInoSparse(ni) || NInoCompressed(ni))) { 1034 if (likely(NInoSparse(ni) || NInoCompressed(ni))) {
1035 /* 1035 /*
1036 * If we are not in the first attribute extent, switch 1036 * If we are not in the first attribute extent, switch
1037 * to it, but first ensure the changes will make it to 1037 * to it, but first ensure the changes will make it to
1038 * disk later. 1038 * disk later.
1039 */ 1039 */
1040 if (a->data.non_resident.lowest_vcn) { 1040 if (a->data.non_resident.lowest_vcn) {
1041 flush_dcache_mft_record_page(ctx->ntfs_ino); 1041 flush_dcache_mft_record_page(ctx->ntfs_ino);
1042 mark_mft_record_dirty(ctx->ntfs_ino); 1042 mark_mft_record_dirty(ctx->ntfs_ino);
1043 ntfs_attr_reinit_search_ctx(ctx); 1043 ntfs_attr_reinit_search_ctx(ctx);
1044 err = ntfs_attr_lookup(ni->type, ni->name, 1044 err = ntfs_attr_lookup(ni->type, ni->name,
1045 ni->name_len, CASE_SENSITIVE, 1045 ni->name_len, CASE_SENSITIVE,
1046 0, NULL, 0, ctx); 1046 0, NULL, 0, ctx);
1047 if (unlikely(err)) { 1047 if (unlikely(err)) {
1048 status.attr_switched = 1; 1048 status.attr_switched = 1;
1049 break; 1049 break;
1050 } 1050 }
1051 /* @m is not used any more so do not set it. */ 1051 /* @m is not used any more so do not set it. */
1052 a = ctx->attr; 1052 a = ctx->attr;
1053 } 1053 }
1054 write_lock_irqsave(&ni->size_lock, flags); 1054 write_lock_irqsave(&ni->size_lock, flags);
1055 ni->itype.compressed.size += vol->cluster_size; 1055 ni->itype.compressed.size += vol->cluster_size;
1056 a->data.non_resident.compressed_size = 1056 a->data.non_resident.compressed_size =
1057 cpu_to_sle64(ni->itype.compressed.size); 1057 cpu_to_sle64(ni->itype.compressed.size);
1058 write_unlock_irqrestore(&ni->size_lock, flags); 1058 write_unlock_irqrestore(&ni->size_lock, flags);
1059 } 1059 }
1060 /* Ensure the changes make it to disk. */ 1060 /* Ensure the changes make it to disk. */
1061 flush_dcache_mft_record_page(ctx->ntfs_ino); 1061 flush_dcache_mft_record_page(ctx->ntfs_ino);
1062 mark_mft_record_dirty(ctx->ntfs_ino); 1062 mark_mft_record_dirty(ctx->ntfs_ino);
1063 ntfs_attr_put_search_ctx(ctx); 1063 ntfs_attr_put_search_ctx(ctx);
1064 unmap_mft_record(base_ni); 1064 unmap_mft_record(base_ni);
1065 /* Successfully filled the hole. */ 1065 /* Successfully filled the hole. */
1066 status.runlist_merged = 0; 1066 status.runlist_merged = 0;
1067 status.mft_attr_mapped = 0; 1067 status.mft_attr_mapped = 0;
1068 status.mp_rebuilt = 0; 1068 status.mp_rebuilt = 0;
1069 /* Setup the map cache and use that to deal with the buffer. */ 1069 /* Setup the map cache and use that to deal with the buffer. */
1070 was_hole = true; 1070 was_hole = true;
1071 vcn = bh_cpos; 1071 vcn = bh_cpos;
1072 vcn_len = 1; 1072 vcn_len = 1;
1073 lcn_block = lcn << (vol->cluster_size_bits - blocksize_bits); 1073 lcn_block = lcn << (vol->cluster_size_bits - blocksize_bits);
1074 cdelta = 0; 1074 cdelta = 0;
1075 /* 1075 /*
1076 * If the number of remaining clusters in the @pages is smaller 1076 * If the number of remaining clusters in the @pages is smaller
1077 * or equal to the number of cached clusters, unlock the 1077 * or equal to the number of cached clusters, unlock the
1078 * runlist as the map cache will be used from now on. 1078 * runlist as the map cache will be used from now on.
1079 */ 1079 */
1080 if (likely(vcn + vcn_len >= cend)) { 1080 if (likely(vcn + vcn_len >= cend)) {
1081 up_write(&ni->runlist.lock); 1081 up_write(&ni->runlist.lock);
1082 rl_write_locked = false; 1082 rl_write_locked = false;
1083 rl = NULL; 1083 rl = NULL;
1084 } 1084 }
1085 goto map_buffer_cached; 1085 goto map_buffer_cached;
1086 } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); 1086 } while (bh_pos += blocksize, (bh = bh->b_this_page) != head);
1087 /* If there are no errors, do the next page. */ 1087 /* If there are no errors, do the next page. */
1088 if (likely(!err && ++u < nr_pages)) 1088 if (likely(!err && ++u < nr_pages))
1089 goto do_next_page; 1089 goto do_next_page;
1090 /* If there are no errors, release the runlist lock if we took it. */ 1090 /* If there are no errors, release the runlist lock if we took it. */
1091 if (likely(!err)) { 1091 if (likely(!err)) {
1092 if (unlikely(rl_write_locked)) { 1092 if (unlikely(rl_write_locked)) {
1093 up_write(&ni->runlist.lock); 1093 up_write(&ni->runlist.lock);
1094 rl_write_locked = false; 1094 rl_write_locked = false;
1095 } else if (unlikely(rl)) 1095 } else if (unlikely(rl))
1096 up_read(&ni->runlist.lock); 1096 up_read(&ni->runlist.lock);
1097 rl = NULL; 1097 rl = NULL;
1098 } 1098 }
1099 /* If we issued read requests, let them complete. */ 1099 /* If we issued read requests, let them complete. */
1100 read_lock_irqsave(&ni->size_lock, flags); 1100 read_lock_irqsave(&ni->size_lock, flags);
1101 initialized_size = ni->initialized_size; 1101 initialized_size = ni->initialized_size;
1102 read_unlock_irqrestore(&ni->size_lock, flags); 1102 read_unlock_irqrestore(&ni->size_lock, flags);
1103 while (wait_bh > wait) { 1103 while (wait_bh > wait) {
1104 bh = *--wait_bh; 1104 bh = *--wait_bh;
1105 wait_on_buffer(bh); 1105 wait_on_buffer(bh);
1106 if (likely(buffer_uptodate(bh))) { 1106 if (likely(buffer_uptodate(bh))) {
1107 page = bh->b_page; 1107 page = bh->b_page;
1108 bh_pos = ((s64)page->index << PAGE_CACHE_SHIFT) + 1108 bh_pos = ((s64)page->index << PAGE_CACHE_SHIFT) +
1109 bh_offset(bh); 1109 bh_offset(bh);
1110 /* 1110 /*
1111 * If the buffer overflows the initialized size, need 1111 * If the buffer overflows the initialized size, need
1112 * to zero the overflowing region. 1112 * to zero the overflowing region.
1113 */ 1113 */
1114 if (unlikely(bh_pos + blocksize > initialized_size)) { 1114 if (unlikely(bh_pos + blocksize > initialized_size)) {
1115 int ofs = 0; 1115 int ofs = 0;
1116 1116
1117 if (likely(bh_pos < initialized_size)) 1117 if (likely(bh_pos < initialized_size))
1118 ofs = initialized_size - bh_pos; 1118 ofs = initialized_size - bh_pos;
1119 zero_user_segment(page, bh_offset(bh) + ofs, 1119 zero_user_segment(page, bh_offset(bh) + ofs,
1120 blocksize); 1120 blocksize);
1121 } 1121 }
1122 } else /* if (unlikely(!buffer_uptodate(bh))) */ 1122 } else /* if (unlikely(!buffer_uptodate(bh))) */
1123 err = -EIO; 1123 err = -EIO;
1124 } 1124 }
1125 if (likely(!err)) { 1125 if (likely(!err)) {
1126 /* Clear buffer_new on all buffers. */ 1126 /* Clear buffer_new on all buffers. */
1127 u = 0; 1127 u = 0;
1128 do { 1128 do {
1129 bh = head = page_buffers(pages[u]); 1129 bh = head = page_buffers(pages[u]);
1130 do { 1130 do {
1131 if (buffer_new(bh)) 1131 if (buffer_new(bh))
1132 clear_buffer_new(bh); 1132 clear_buffer_new(bh);
1133 } while ((bh = bh->b_this_page) != head); 1133 } while ((bh = bh->b_this_page) != head);
1134 } while (++u < nr_pages); 1134 } while (++u < nr_pages);
1135 ntfs_debug("Done."); 1135 ntfs_debug("Done.");
1136 return err; 1136 return err;
1137 } 1137 }
1138 if (status.attr_switched) { 1138 if (status.attr_switched) {
1139 /* Get back to the attribute extent we modified. */ 1139 /* Get back to the attribute extent we modified. */
1140 ntfs_attr_reinit_search_ctx(ctx); 1140 ntfs_attr_reinit_search_ctx(ctx);
1141 if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 1141 if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
1142 CASE_SENSITIVE, bh_cpos, NULL, 0, ctx)) { 1142 CASE_SENSITIVE, bh_cpos, NULL, 0, ctx)) {
1143 ntfs_error(vol->sb, "Failed to find required " 1143 ntfs_error(vol->sb, "Failed to find required "
1144 "attribute extent of attribute in " 1144 "attribute extent of attribute in "
1145 "error code path. Run chkdsk to " 1145 "error code path. Run chkdsk to "
1146 "recover."); 1146 "recover.");
1147 write_lock_irqsave(&ni->size_lock, flags); 1147 write_lock_irqsave(&ni->size_lock, flags);
1148 ni->itype.compressed.size += vol->cluster_size; 1148 ni->itype.compressed.size += vol->cluster_size;
1149 write_unlock_irqrestore(&ni->size_lock, flags); 1149 write_unlock_irqrestore(&ni->size_lock, flags);
1150 flush_dcache_mft_record_page(ctx->ntfs_ino); 1150 flush_dcache_mft_record_page(ctx->ntfs_ino);
1151 mark_mft_record_dirty(ctx->ntfs_ino); 1151 mark_mft_record_dirty(ctx->ntfs_ino);
1152 /* 1152 /*
1153 * The only thing that is now wrong is the compressed 1153 * The only thing that is now wrong is the compressed
1154 * size of the base attribute extent which chkdsk 1154 * size of the base attribute extent which chkdsk
1155 * should be able to fix. 1155 * should be able to fix.
1156 */ 1156 */
1157 NVolSetErrors(vol); 1157 NVolSetErrors(vol);
1158 } else { 1158 } else {
1159 m = ctx->mrec; 1159 m = ctx->mrec;
1160 a = ctx->attr; 1160 a = ctx->attr;
1161 status.attr_switched = 0; 1161 status.attr_switched = 0;
1162 } 1162 }
1163 } 1163 }
1164 /* 1164 /*
1165 * If the runlist has been modified, need to restore it by punching a 1165 * If the runlist has been modified, need to restore it by punching a
1166 * hole into it and we then need to deallocate the on-disk cluster as 1166 * hole into it and we then need to deallocate the on-disk cluster as
1167 * well. Note, we only modify the runlist if we are able to generate a 1167 * well. Note, we only modify the runlist if we are able to generate a
1168 * new mapping pairs array, i.e. only when the mapped attribute extent 1168 * new mapping pairs array, i.e. only when the mapped attribute extent
1169 * is not switched. 1169 * is not switched.
1170 */ 1170 */
1171 if (status.runlist_merged && !status.attr_switched) { 1171 if (status.runlist_merged && !status.attr_switched) {
1172 BUG_ON(!rl_write_locked); 1172 BUG_ON(!rl_write_locked);
1173 /* Make the file cluster we allocated sparse in the runlist. */ 1173 /* Make the file cluster we allocated sparse in the runlist. */
1174 if (ntfs_rl_punch_nolock(vol, &ni->runlist, bh_cpos, 1)) { 1174 if (ntfs_rl_punch_nolock(vol, &ni->runlist, bh_cpos, 1)) {
1175 ntfs_error(vol->sb, "Failed to punch hole into " 1175 ntfs_error(vol->sb, "Failed to punch hole into "
1176 "attribute runlist in error code " 1176 "attribute runlist in error code "
1177 "path. Run chkdsk to recover the " 1177 "path. Run chkdsk to recover the "
1178 "lost cluster."); 1178 "lost cluster.");
1179 NVolSetErrors(vol); 1179 NVolSetErrors(vol);
1180 } else /* if (success) */ { 1180 } else /* if (success) */ {
1181 status.runlist_merged = 0; 1181 status.runlist_merged = 0;
1182 /* 1182 /*
1183 * Deallocate the on-disk cluster we allocated but only 1183 * Deallocate the on-disk cluster we allocated but only
1184 * if we succeeded in punching its vcn out of the 1184 * if we succeeded in punching its vcn out of the
1185 * runlist. 1185 * runlist.
1186 */ 1186 */
1187 down_write(&vol->lcnbmp_lock); 1187 down_write(&vol->lcnbmp_lock);
1188 if (ntfs_bitmap_clear_bit(vol->lcnbmp_ino, lcn)) { 1188 if (ntfs_bitmap_clear_bit(vol->lcnbmp_ino, lcn)) {
1189 ntfs_error(vol->sb, "Failed to release " 1189 ntfs_error(vol->sb, "Failed to release "
1190 "allocated cluster in error " 1190 "allocated cluster in error "
1191 "code path. Run chkdsk to " 1191 "code path. Run chkdsk to "
1192 "recover the lost cluster."); 1192 "recover the lost cluster.");
1193 NVolSetErrors(vol); 1193 NVolSetErrors(vol);
1194 } 1194 }
1195 up_write(&vol->lcnbmp_lock); 1195 up_write(&vol->lcnbmp_lock);
1196 } 1196 }
1197 } 1197 }
1198 /* 1198 /*
1199 * Resize the attribute record to its old size and rebuild the mapping 1199 * Resize the attribute record to its old size and rebuild the mapping
1200 * pairs array. Note, we only can do this if the runlist has been 1200 * pairs array. Note, we only can do this if the runlist has been
1201 * restored to its old state which also implies that the mapped 1201 * restored to its old state which also implies that the mapped
1202 * attribute extent is not switched. 1202 * attribute extent is not switched.
1203 */ 1203 */
1204 if (status.mp_rebuilt && !status.runlist_merged) { 1204 if (status.mp_rebuilt && !status.runlist_merged) {
1205 if (ntfs_attr_record_resize(m, a, attr_rec_len)) { 1205 if (ntfs_attr_record_resize(m, a, attr_rec_len)) {
1206 ntfs_error(vol->sb, "Failed to restore attribute " 1206 ntfs_error(vol->sb, "Failed to restore attribute "
1207 "record in error code path. Run " 1207 "record in error code path. Run "
1208 "chkdsk to recover."); 1208 "chkdsk to recover.");
1209 NVolSetErrors(vol); 1209 NVolSetErrors(vol);
1210 } else /* if (success) */ { 1210 } else /* if (success) */ {
1211 if (ntfs_mapping_pairs_build(vol, (u8*)a + 1211 if (ntfs_mapping_pairs_build(vol, (u8*)a +
1212 le16_to_cpu(a->data.non_resident. 1212 le16_to_cpu(a->data.non_resident.
1213 mapping_pairs_offset), attr_rec_len - 1213 mapping_pairs_offset), attr_rec_len -
1214 le16_to_cpu(a->data.non_resident. 1214 le16_to_cpu(a->data.non_resident.
1215 mapping_pairs_offset), ni->runlist.rl, 1215 mapping_pairs_offset), ni->runlist.rl,
1216 vcn, highest_vcn, NULL)) { 1216 vcn, highest_vcn, NULL)) {
1217 ntfs_error(vol->sb, "Failed to restore " 1217 ntfs_error(vol->sb, "Failed to restore "
1218 "mapping pairs array in error " 1218 "mapping pairs array in error "
1219 "code path. Run chkdsk to " 1219 "code path. Run chkdsk to "
1220 "recover."); 1220 "recover.");
1221 NVolSetErrors(vol); 1221 NVolSetErrors(vol);
1222 } 1222 }
1223 flush_dcache_mft_record_page(ctx->ntfs_ino); 1223 flush_dcache_mft_record_page(ctx->ntfs_ino);
1224 mark_mft_record_dirty(ctx->ntfs_ino); 1224 mark_mft_record_dirty(ctx->ntfs_ino);
1225 } 1225 }
1226 } 1226 }
1227 /* Release the mft record and the attribute. */ 1227 /* Release the mft record and the attribute. */
1228 if (status.mft_attr_mapped) { 1228 if (status.mft_attr_mapped) {
1229 ntfs_attr_put_search_ctx(ctx); 1229 ntfs_attr_put_search_ctx(ctx);
1230 unmap_mft_record(base_ni); 1230 unmap_mft_record(base_ni);
1231 } 1231 }
1232 /* Release the runlist lock. */ 1232 /* Release the runlist lock. */
1233 if (rl_write_locked) 1233 if (rl_write_locked)
1234 up_write(&ni->runlist.lock); 1234 up_write(&ni->runlist.lock);
1235 else if (rl) 1235 else if (rl)
1236 up_read(&ni->runlist.lock); 1236 up_read(&ni->runlist.lock);
1237 /* 1237 /*
1238 * Zero out any newly allocated blocks to avoid exposing stale data. 1238 * Zero out any newly allocated blocks to avoid exposing stale data.
1239 * If BH_New is set, we know that the block was newly allocated above 1239 * If BH_New is set, we know that the block was newly allocated above
1240 * and that it has not been fully zeroed and marked dirty yet. 1240 * and that it has not been fully zeroed and marked dirty yet.
1241 */ 1241 */
1242 nr_pages = u; 1242 nr_pages = u;
1243 u = 0; 1243 u = 0;
1244 end = bh_cpos << vol->cluster_size_bits; 1244 end = bh_cpos << vol->cluster_size_bits;
1245 do { 1245 do {
1246 page = pages[u]; 1246 page = pages[u];
1247 bh = head = page_buffers(page); 1247 bh = head = page_buffers(page);
1248 do { 1248 do {
1249 if (u == nr_pages && 1249 if (u == nr_pages &&
1250 ((s64)page->index << PAGE_CACHE_SHIFT) + 1250 ((s64)page->index << PAGE_CACHE_SHIFT) +
1251 bh_offset(bh) >= end) 1251 bh_offset(bh) >= end)
1252 break; 1252 break;
1253 if (!buffer_new(bh)) 1253 if (!buffer_new(bh))
1254 continue; 1254 continue;
1255 clear_buffer_new(bh); 1255 clear_buffer_new(bh);
1256 if (!buffer_uptodate(bh)) { 1256 if (!buffer_uptodate(bh)) {
1257 if (PageUptodate(page)) 1257 if (PageUptodate(page))
1258 set_buffer_uptodate(bh); 1258 set_buffer_uptodate(bh);
1259 else { 1259 else {
1260 zero_user(page, bh_offset(bh), 1260 zero_user(page, bh_offset(bh),
1261 blocksize); 1261 blocksize);
1262 set_buffer_uptodate(bh); 1262 set_buffer_uptodate(bh);
1263 } 1263 }
1264 } 1264 }
1265 mark_buffer_dirty(bh); 1265 mark_buffer_dirty(bh);
1266 } while ((bh = bh->b_this_page) != head); 1266 } while ((bh = bh->b_this_page) != head);
1267 } while (++u <= nr_pages); 1267 } while (++u <= nr_pages);
1268 ntfs_error(vol->sb, "Failed. Returning error code %i.", err); 1268 ntfs_error(vol->sb, "Failed. Returning error code %i.", err);
1269 return err; 1269 return err;
1270 } 1270 }
1271 1271
1272 /* 1272 /*
1273 * Copy as much as we can into the pages and return the number of bytes which 1273 * Copy as much as we can into the pages and return the number of bytes which
1274 * were successfully copied. If a fault is encountered then clear the pages 1274 * were successfully copied. If a fault is encountered then clear the pages
1275 * out to (ofs + bytes) and return the number of bytes which were copied. 1275 * out to (ofs + bytes) and return the number of bytes which were copied.
1276 */ 1276 */
1277 static inline size_t ntfs_copy_from_user(struct page **pages, 1277 static inline size_t ntfs_copy_from_user(struct page **pages,
1278 unsigned nr_pages, unsigned ofs, const char __user *buf, 1278 unsigned nr_pages, unsigned ofs, const char __user *buf,
1279 size_t bytes) 1279 size_t bytes)
1280 { 1280 {
1281 struct page **last_page = pages + nr_pages; 1281 struct page **last_page = pages + nr_pages;
1282 char *addr; 1282 char *addr;
1283 size_t total = 0; 1283 size_t total = 0;
1284 unsigned len; 1284 unsigned len;
1285 int left; 1285 int left;
1286 1286
1287 do { 1287 do {
1288 len = PAGE_CACHE_SIZE - ofs; 1288 len = PAGE_CACHE_SIZE - ofs;
1289 if (len > bytes) 1289 if (len > bytes)
1290 len = bytes; 1290 len = bytes;
1291 addr = kmap_atomic(*pages); 1291 addr = kmap_atomic(*pages);
1292 left = __copy_from_user_inatomic(addr + ofs, buf, len); 1292 left = __copy_from_user_inatomic(addr + ofs, buf, len);
1293 kunmap_atomic(addr); 1293 kunmap_atomic(addr);
1294 if (unlikely(left)) { 1294 if (unlikely(left)) {
1295 /* Do it the slow way. */ 1295 /* Do it the slow way. */
1296 addr = kmap(*pages); 1296 addr = kmap(*pages);
1297 left = __copy_from_user(addr + ofs, buf, len); 1297 left = __copy_from_user(addr + ofs, buf, len);
1298 kunmap(*pages); 1298 kunmap(*pages);
1299 if (unlikely(left)) 1299 if (unlikely(left))
1300 goto err_out; 1300 goto err_out;
1301 } 1301 }
1302 total += len; 1302 total += len;
1303 bytes -= len; 1303 bytes -= len;
1304 if (!bytes) 1304 if (!bytes)
1305 break; 1305 break;
1306 buf += len; 1306 buf += len;
1307 ofs = 0; 1307 ofs = 0;
1308 } while (++pages < last_page); 1308 } while (++pages < last_page);
1309 out: 1309 out:
1310 return total; 1310 return total;
1311 err_out: 1311 err_out:
1312 total += len - left; 1312 total += len - left;
1313 /* Zero the rest of the target like __copy_from_user(). */ 1313 /* Zero the rest of the target like __copy_from_user(). */
1314 while (++pages < last_page) { 1314 while (++pages < last_page) {
1315 bytes -= len; 1315 bytes -= len;
1316 if (!bytes) 1316 if (!bytes)
1317 break; 1317 break;
1318 len = PAGE_CACHE_SIZE; 1318 len = PAGE_CACHE_SIZE;
1319 if (len > bytes) 1319 if (len > bytes)
1320 len = bytes; 1320 len = bytes;
1321 zero_user(*pages, 0, len); 1321 zero_user(*pages, 0, len);
1322 } 1322 }
1323 goto out; 1323 goto out;
1324 } 1324 }
1325 1325
1326 static size_t __ntfs_copy_from_user_iovec_inatomic(char *vaddr, 1326 static size_t __ntfs_copy_from_user_iovec_inatomic(char *vaddr,
1327 const struct iovec *iov, size_t iov_ofs, size_t bytes) 1327 const struct iovec *iov, size_t iov_ofs, size_t bytes)
1328 { 1328 {
1329 size_t total = 0; 1329 size_t total = 0;
1330 1330
1331 while (1) { 1331 while (1) {
1332 const char __user *buf = iov->iov_base + iov_ofs; 1332 const char __user *buf = iov->iov_base + iov_ofs;
1333 unsigned len; 1333 unsigned len;
1334 size_t left; 1334 size_t left;
1335 1335
1336 len = iov->iov_len - iov_ofs; 1336 len = iov->iov_len - iov_ofs;
1337 if (len > bytes) 1337 if (len > bytes)
1338 len = bytes; 1338 len = bytes;
1339 left = __copy_from_user_inatomic(vaddr, buf, len); 1339 left = __copy_from_user_inatomic(vaddr, buf, len);
1340 total += len; 1340 total += len;
1341 bytes -= len; 1341 bytes -= len;
1342 vaddr += len; 1342 vaddr += len;
1343 if (unlikely(left)) { 1343 if (unlikely(left)) {
1344 total -= left; 1344 total -= left;
1345 break; 1345 break;
1346 } 1346 }
1347 if (!bytes) 1347 if (!bytes)
1348 break; 1348 break;
1349 iov++; 1349 iov++;
1350 iov_ofs = 0; 1350 iov_ofs = 0;
1351 } 1351 }
1352 return total; 1352 return total;
1353 } 1353 }
1354 1354
1355 static inline void ntfs_set_next_iovec(const struct iovec **iovp, 1355 static inline void ntfs_set_next_iovec(const struct iovec **iovp,
1356 size_t *iov_ofsp, size_t bytes) 1356 size_t *iov_ofsp, size_t bytes)
1357 { 1357 {
1358 const struct iovec *iov = *iovp; 1358 const struct iovec *iov = *iovp;
1359 size_t iov_ofs = *iov_ofsp; 1359 size_t iov_ofs = *iov_ofsp;
1360 1360
1361 while (bytes) { 1361 while (bytes) {
1362 unsigned len; 1362 unsigned len;
1363 1363
1364 len = iov->iov_len - iov_ofs; 1364 len = iov->iov_len - iov_ofs;
1365 if (len > bytes) 1365 if (len > bytes)
1366 len = bytes; 1366 len = bytes;
1367 bytes -= len; 1367 bytes -= len;
1368 iov_ofs += len; 1368 iov_ofs += len;
1369 if (iov->iov_len == iov_ofs) { 1369 if (iov->iov_len == iov_ofs) {
1370 iov++; 1370 iov++;
1371 iov_ofs = 0; 1371 iov_ofs = 0;
1372 } 1372 }
1373 } 1373 }
1374 *iovp = iov; 1374 *iovp = iov;
1375 *iov_ofsp = iov_ofs; 1375 *iov_ofsp = iov_ofs;
1376 } 1376 }
1377 1377
1378 /* 1378 /*
1379 * This has the same side-effects and return value as ntfs_copy_from_user(). 1379 * This has the same side-effects and return value as ntfs_copy_from_user().
1380 * The difference is that on a fault we need to memset the remainder of the 1380 * The difference is that on a fault we need to memset the remainder of the
1381 * pages (out to offset + bytes), to emulate ntfs_copy_from_user()'s 1381 * pages (out to offset + bytes), to emulate ntfs_copy_from_user()'s
1382 * single-segment behaviour. 1382 * single-segment behaviour.
1383 * 1383 *
1384 * We call the same helper (__ntfs_copy_from_user_iovec_inatomic()) both when 1384 * We call the same helper (__ntfs_copy_from_user_iovec_inatomic()) both when
1385 * atomic and when not atomic. This is ok because it calls 1385 * atomic and when not atomic. This is ok because it calls
1386 * __copy_from_user_inatomic() and it is ok to call this when non-atomic. In 1386 * __copy_from_user_inatomic() and it is ok to call this when non-atomic. In
1387 * fact, the only difference between __copy_from_user_inatomic() and 1387 * fact, the only difference between __copy_from_user_inatomic() and
1388 * __copy_from_user() is that the latter calls might_sleep() and the former 1388 * __copy_from_user() is that the latter calls might_sleep() and the former
1389 * should not zero the tail of the buffer on error. And on many architectures 1389 * should not zero the tail of the buffer on error. And on many architectures
1390 * __copy_from_user_inatomic() is just defined to __copy_from_user() so it 1390 * __copy_from_user_inatomic() is just defined to __copy_from_user() so it
1391 * makes no difference at all on those architectures. 1391 * makes no difference at all on those architectures.
1392 */ 1392 */
1393 static inline size_t ntfs_copy_from_user_iovec(struct page **pages, 1393 static inline size_t ntfs_copy_from_user_iovec(struct page **pages,
1394 unsigned nr_pages, unsigned ofs, const struct iovec **iov, 1394 unsigned nr_pages, unsigned ofs, const struct iovec **iov,
1395 size_t *iov_ofs, size_t bytes) 1395 size_t *iov_ofs, size_t bytes)
1396 { 1396 {
1397 struct page **last_page = pages + nr_pages; 1397 struct page **last_page = pages + nr_pages;
1398 char *addr; 1398 char *addr;
1399 size_t copied, len, total = 0; 1399 size_t copied, len, total = 0;
1400 1400
1401 do { 1401 do {
1402 len = PAGE_CACHE_SIZE - ofs; 1402 len = PAGE_CACHE_SIZE - ofs;
1403 if (len > bytes) 1403 if (len > bytes)
1404 len = bytes; 1404 len = bytes;
1405 addr = kmap_atomic(*pages); 1405 addr = kmap_atomic(*pages);
1406 copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs, 1406 copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs,
1407 *iov, *iov_ofs, len); 1407 *iov, *iov_ofs, len);
1408 kunmap_atomic(addr); 1408 kunmap_atomic(addr);
1409 if (unlikely(copied != len)) { 1409 if (unlikely(copied != len)) {
1410 /* Do it the slow way. */ 1410 /* Do it the slow way. */
1411 addr = kmap(*pages); 1411 addr = kmap(*pages);
1412 copied = __ntfs_copy_from_user_iovec_inatomic(addr + 1412 copied = __ntfs_copy_from_user_iovec_inatomic(addr +
1413 ofs, *iov, *iov_ofs, len); 1413 ofs, *iov, *iov_ofs, len);
1414 if (unlikely(copied != len)) 1414 if (unlikely(copied != len))
1415 goto err_out; 1415 goto err_out;
1416 kunmap(*pages); 1416 kunmap(*pages);
1417 } 1417 }
1418 total += len; 1418 total += len;
1419 ntfs_set_next_iovec(iov, iov_ofs, len); 1419 ntfs_set_next_iovec(iov, iov_ofs, len);
1420 bytes -= len; 1420 bytes -= len;
1421 if (!bytes) 1421 if (!bytes)
1422 break; 1422 break;
1423 ofs = 0; 1423 ofs = 0;
1424 } while (++pages < last_page); 1424 } while (++pages < last_page);
1425 out: 1425 out:
1426 return total; 1426 return total;
1427 err_out: 1427 err_out:
1428 BUG_ON(copied > len); 1428 BUG_ON(copied > len);
1429 /* Zero the rest of the target like __copy_from_user(). */ 1429 /* Zero the rest of the target like __copy_from_user(). */
1430 memset(addr + ofs + copied, 0, len - copied); 1430 memset(addr + ofs + copied, 0, len - copied);
1431 kunmap(*pages); 1431 kunmap(*pages);
1432 total += copied; 1432 total += copied;
1433 ntfs_set_next_iovec(iov, iov_ofs, copied); 1433 ntfs_set_next_iovec(iov, iov_ofs, copied);
1434 while (++pages < last_page) { 1434 while (++pages < last_page) {
1435 bytes -= len; 1435 bytes -= len;
1436 if (!bytes) 1436 if (!bytes)
1437 break; 1437 break;
1438 len = PAGE_CACHE_SIZE; 1438 len = PAGE_CACHE_SIZE;
1439 if (len > bytes) 1439 if (len > bytes)
1440 len = bytes; 1440 len = bytes;
1441 zero_user(*pages, 0, len); 1441 zero_user(*pages, 0, len);
1442 } 1442 }
1443 goto out; 1443 goto out;
1444 } 1444 }
1445 1445
1446 static inline void ntfs_flush_dcache_pages(struct page **pages, 1446 static inline void ntfs_flush_dcache_pages(struct page **pages,
1447 unsigned nr_pages) 1447 unsigned nr_pages)
1448 { 1448 {
1449 BUG_ON(!nr_pages); 1449 BUG_ON(!nr_pages);
1450 /* 1450 /*
1451 * Warning: Do not do the decrement at the same time as the call to 1451 * Warning: Do not do the decrement at the same time as the call to
1452 * flush_dcache_page() because it is a NULL macro on i386 and hence the 1452 * flush_dcache_page() because it is a NULL macro on i386 and hence the
1453 * decrement never happens so the loop never terminates. 1453 * decrement never happens so the loop never terminates.
1454 */ 1454 */
1455 do { 1455 do {
1456 --nr_pages; 1456 --nr_pages;
1457 flush_dcache_page(pages[nr_pages]); 1457 flush_dcache_page(pages[nr_pages]);
1458 } while (nr_pages > 0); 1458 } while (nr_pages > 0);
1459 } 1459 }
1460 1460
1461 /** 1461 /**
1462 * ntfs_commit_pages_after_non_resident_write - commit the received data 1462 * ntfs_commit_pages_after_non_resident_write - commit the received data
1463 * @pages: array of destination pages 1463 * @pages: array of destination pages
1464 * @nr_pages: number of pages in @pages 1464 * @nr_pages: number of pages in @pages
1465 * @pos: byte position in file at which the write begins 1465 * @pos: byte position in file at which the write begins
1466 * @bytes: number of bytes to be written 1466 * @bytes: number of bytes to be written
1467 * 1467 *
1468 * See description of ntfs_commit_pages_after_write(), below. 1468 * See description of ntfs_commit_pages_after_write(), below.
1469 */ 1469 */
1470 static inline int ntfs_commit_pages_after_non_resident_write( 1470 static inline int ntfs_commit_pages_after_non_resident_write(
1471 struct page **pages, const unsigned nr_pages, 1471 struct page **pages, const unsigned nr_pages,
1472 s64 pos, size_t bytes) 1472 s64 pos, size_t bytes)
1473 { 1473 {
1474 s64 end, initialized_size; 1474 s64 end, initialized_size;
1475 struct inode *vi; 1475 struct inode *vi;
1476 ntfs_inode *ni, *base_ni; 1476 ntfs_inode *ni, *base_ni;
1477 struct buffer_head *bh, *head; 1477 struct buffer_head *bh, *head;
1478 ntfs_attr_search_ctx *ctx; 1478 ntfs_attr_search_ctx *ctx;
1479 MFT_RECORD *m; 1479 MFT_RECORD *m;
1480 ATTR_RECORD *a; 1480 ATTR_RECORD *a;
1481 unsigned long flags; 1481 unsigned long flags;
1482 unsigned blocksize, u; 1482 unsigned blocksize, u;
1483 int err; 1483 int err;
1484 1484
1485 vi = pages[0]->mapping->host; 1485 vi = pages[0]->mapping->host;
1486 ni = NTFS_I(vi); 1486 ni = NTFS_I(vi);
1487 blocksize = vi->i_sb->s_blocksize; 1487 blocksize = vi->i_sb->s_blocksize;
1488 end = pos + bytes; 1488 end = pos + bytes;
1489 u = 0; 1489 u = 0;
1490 do { 1490 do {
1491 s64 bh_pos; 1491 s64 bh_pos;
1492 struct page *page; 1492 struct page *page;
1493 bool partial; 1493 bool partial;
1494 1494
1495 page = pages[u]; 1495 page = pages[u];
1496 bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; 1496 bh_pos = (s64)page->index << PAGE_CACHE_SHIFT;
1497 bh = head = page_buffers(page); 1497 bh = head = page_buffers(page);
1498 partial = false; 1498 partial = false;
1499 do { 1499 do {
1500 s64 bh_end; 1500 s64 bh_end;
1501 1501
1502 bh_end = bh_pos + blocksize; 1502 bh_end = bh_pos + blocksize;
1503 if (bh_end <= pos || bh_pos >= end) { 1503 if (bh_end <= pos || bh_pos >= end) {
1504 if (!buffer_uptodate(bh)) 1504 if (!buffer_uptodate(bh))
1505 partial = true; 1505 partial = true;
1506 } else { 1506 } else {
1507 set_buffer_uptodate(bh); 1507 set_buffer_uptodate(bh);
1508 mark_buffer_dirty(bh); 1508 mark_buffer_dirty(bh);
1509 } 1509 }
1510 } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); 1510 } while (bh_pos += blocksize, (bh = bh->b_this_page) != head);
1511 /* 1511 /*
1512 * If all buffers are now uptodate but the page is not, set the 1512 * If all buffers are now uptodate but the page is not, set the
1513 * page uptodate. 1513 * page uptodate.
1514 */ 1514 */
1515 if (!partial && !PageUptodate(page)) 1515 if (!partial && !PageUptodate(page))
1516 SetPageUptodate(page); 1516 SetPageUptodate(page);
1517 } while (++u < nr_pages); 1517 } while (++u < nr_pages);
1518 /* 1518 /*
1519 * Finally, if we do not need to update initialized_size or i_size we 1519 * Finally, if we do not need to update initialized_size or i_size we
1520 * are finished. 1520 * are finished.
1521 */ 1521 */
1522 read_lock_irqsave(&ni->size_lock, flags); 1522 read_lock_irqsave(&ni->size_lock, flags);
1523 initialized_size = ni->initialized_size; 1523 initialized_size = ni->initialized_size;
1524 read_unlock_irqrestore(&ni->size_lock, flags); 1524 read_unlock_irqrestore(&ni->size_lock, flags);
1525 if (end <= initialized_size) { 1525 if (end <= initialized_size) {
1526 ntfs_debug("Done."); 1526 ntfs_debug("Done.");
1527 return 0; 1527 return 0;
1528 } 1528 }
1529 /* 1529 /*
1530 * Update initialized_size/i_size as appropriate, both in the inode and 1530 * Update initialized_size/i_size as appropriate, both in the inode and
1531 * the mft record. 1531 * the mft record.
1532 */ 1532 */
1533 if (!NInoAttr(ni)) 1533 if (!NInoAttr(ni))
1534 base_ni = ni; 1534 base_ni = ni;
1535 else 1535 else
1536 base_ni = ni->ext.base_ntfs_ino; 1536 base_ni = ni->ext.base_ntfs_ino;
1537 /* Map, pin, and lock the mft record. */ 1537 /* Map, pin, and lock the mft record. */
1538 m = map_mft_record(base_ni); 1538 m = map_mft_record(base_ni);
1539 if (IS_ERR(m)) { 1539 if (IS_ERR(m)) {
1540 err = PTR_ERR(m); 1540 err = PTR_ERR(m);
1541 m = NULL; 1541 m = NULL;
1542 ctx = NULL; 1542 ctx = NULL;
1543 goto err_out; 1543 goto err_out;
1544 } 1544 }
1545 BUG_ON(!NInoNonResident(ni)); 1545 BUG_ON(!NInoNonResident(ni));
1546 ctx = ntfs_attr_get_search_ctx(base_ni, m); 1546 ctx = ntfs_attr_get_search_ctx(base_ni, m);
1547 if (unlikely(!ctx)) { 1547 if (unlikely(!ctx)) {
1548 err = -ENOMEM; 1548 err = -ENOMEM;
1549 goto err_out; 1549 goto err_out;
1550 } 1550 }
1551 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 1551 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
1552 CASE_SENSITIVE, 0, NULL, 0, ctx); 1552 CASE_SENSITIVE, 0, NULL, 0, ctx);
1553 if (unlikely(err)) { 1553 if (unlikely(err)) {
1554 if (err == -ENOENT) 1554 if (err == -ENOENT)
1555 err = -EIO; 1555 err = -EIO;
1556 goto err_out; 1556 goto err_out;
1557 } 1557 }
1558 a = ctx->attr; 1558 a = ctx->attr;
1559 BUG_ON(!a->non_resident); 1559 BUG_ON(!a->non_resident);
1560 write_lock_irqsave(&ni->size_lock, flags); 1560 write_lock_irqsave(&ni->size_lock, flags);
1561 BUG_ON(end > ni->allocated_size); 1561 BUG_ON(end > ni->allocated_size);
1562 ni->initialized_size = end; 1562 ni->initialized_size = end;
1563 a->data.non_resident.initialized_size = cpu_to_sle64(end); 1563 a->data.non_resident.initialized_size = cpu_to_sle64(end);
1564 if (end > i_size_read(vi)) { 1564 if (end > i_size_read(vi)) {
1565 i_size_write(vi, end); 1565 i_size_write(vi, end);
1566 a->data.non_resident.data_size = 1566 a->data.non_resident.data_size =
1567 a->data.non_resident.initialized_size; 1567 a->data.non_resident.initialized_size;
1568 } 1568 }
1569 write_unlock_irqrestore(&ni->size_lock, flags); 1569 write_unlock_irqrestore(&ni->size_lock, flags);
1570 /* Mark the mft record dirty, so it gets written back. */ 1570 /* Mark the mft record dirty, so it gets written back. */
1571 flush_dcache_mft_record_page(ctx->ntfs_ino); 1571 flush_dcache_mft_record_page(ctx->ntfs_ino);
1572 mark_mft_record_dirty(ctx->ntfs_ino); 1572 mark_mft_record_dirty(ctx->ntfs_ino);
1573 ntfs_attr_put_search_ctx(ctx); 1573 ntfs_attr_put_search_ctx(ctx);
1574 unmap_mft_record(base_ni); 1574 unmap_mft_record(base_ni);
1575 ntfs_debug("Done."); 1575 ntfs_debug("Done.");
1576 return 0; 1576 return 0;
1577 err_out: 1577 err_out:
1578 if (ctx) 1578 if (ctx)
1579 ntfs_attr_put_search_ctx(ctx); 1579 ntfs_attr_put_search_ctx(ctx);
1580 if (m) 1580 if (m)
1581 unmap_mft_record(base_ni); 1581 unmap_mft_record(base_ni);
1582 ntfs_error(vi->i_sb, "Failed to update initialized_size/i_size (error " 1582 ntfs_error(vi->i_sb, "Failed to update initialized_size/i_size (error "
1583 "code %i).", err); 1583 "code %i).", err);
1584 if (err != -ENOMEM) 1584 if (err != -ENOMEM)
1585 NVolSetErrors(ni->vol); 1585 NVolSetErrors(ni->vol);
1586 return err; 1586 return err;
1587 } 1587 }
1588 1588
1589 /** 1589 /**
1590 * ntfs_commit_pages_after_write - commit the received data 1590 * ntfs_commit_pages_after_write - commit the received data
1591 * @pages: array of destination pages 1591 * @pages: array of destination pages
1592 * @nr_pages: number of pages in @pages 1592 * @nr_pages: number of pages in @pages
1593 * @pos: byte position in file at which the write begins 1593 * @pos: byte position in file at which the write begins
1594 * @bytes: number of bytes to be written 1594 * @bytes: number of bytes to be written
1595 * 1595 *
1596 * This is called from ntfs_file_buffered_write() with i_mutex held on the inode 1596 * This is called from ntfs_file_buffered_write() with i_mutex held on the inode
1597 * (@pages[0]->mapping->host). There are @nr_pages pages in @pages which are 1597 * (@pages[0]->mapping->host). There are @nr_pages pages in @pages which are
1598 * locked but not kmap()ped. The source data has already been copied into the 1598 * locked but not kmap()ped. The source data has already been copied into the
1599 * @page. ntfs_prepare_pages_for_non_resident_write() has been called before 1599 * @page. ntfs_prepare_pages_for_non_resident_write() has been called before
1600 * the data was copied (for non-resident attributes only) and it returned 1600 * the data was copied (for non-resident attributes only) and it returned
1601 * success. 1601 * success.
1602 * 1602 *
1603 * Need to set uptodate and mark dirty all buffers within the boundary of the 1603 * Need to set uptodate and mark dirty all buffers within the boundary of the
1604 * write. If all buffers in a page are uptodate we set the page uptodate, too. 1604 * write. If all buffers in a page are uptodate we set the page uptodate, too.
1605 * 1605 *
1606 * Setting the buffers dirty ensures that they get written out later when 1606 * Setting the buffers dirty ensures that they get written out later when
1607 * ntfs_writepage() is invoked by the VM. 1607 * ntfs_writepage() is invoked by the VM.
1608 * 1608 *
1609 * Finally, we need to update i_size and initialized_size as appropriate both 1609 * Finally, we need to update i_size and initialized_size as appropriate both
1610 * in the inode and the mft record. 1610 * in the inode and the mft record.
1611 * 1611 *
1612 * This is modelled after fs/buffer.c::generic_commit_write(), which marks 1612 * This is modelled after fs/buffer.c::generic_commit_write(), which marks
1613 * buffers uptodate and dirty, sets the page uptodate if all buffers in the 1613 * buffers uptodate and dirty, sets the page uptodate if all buffers in the
1614 * page are uptodate, and updates i_size if the end of io is beyond i_size. In 1614 * page are uptodate, and updates i_size if the end of io is beyond i_size. In
1615 * that case, it also marks the inode dirty. 1615 * that case, it also marks the inode dirty.
1616 * 1616 *
1617 * If things have gone as outlined in 1617 * If things have gone as outlined in
1618 * ntfs_prepare_pages_for_non_resident_write(), we do not need to do any page 1618 * ntfs_prepare_pages_for_non_resident_write(), we do not need to do any page
1619 * content modifications here for non-resident attributes. For resident 1619 * content modifications here for non-resident attributes. For resident
1620 * attributes we need to do the uptodate bringing here which we combine with 1620 * attributes we need to do the uptodate bringing here which we combine with
1621 * the copying into the mft record which means we save one atomic kmap. 1621 * the copying into the mft record which means we save one atomic kmap.
1622 * 1622 *
1623 * Return 0 on success or -errno on error. 1623 * Return 0 on success or -errno on error.
1624 */ 1624 */
1625 static int ntfs_commit_pages_after_write(struct page **pages, 1625 static int ntfs_commit_pages_after_write(struct page **pages,
1626 const unsigned nr_pages, s64 pos, size_t bytes) 1626 const unsigned nr_pages, s64 pos, size_t bytes)
1627 { 1627 {
1628 s64 end, initialized_size; 1628 s64 end, initialized_size;
1629 loff_t i_size; 1629 loff_t i_size;
1630 struct inode *vi; 1630 struct inode *vi;
1631 ntfs_inode *ni, *base_ni; 1631 ntfs_inode *ni, *base_ni;
1632 struct page *page; 1632 struct page *page;
1633 ntfs_attr_search_ctx *ctx; 1633 ntfs_attr_search_ctx *ctx;
1634 MFT_RECORD *m; 1634 MFT_RECORD *m;
1635 ATTR_RECORD *a; 1635 ATTR_RECORD *a;
1636 char *kattr, *kaddr; 1636 char *kattr, *kaddr;
1637 unsigned long flags; 1637 unsigned long flags;
1638 u32 attr_len; 1638 u32 attr_len;
1639 int err; 1639 int err;
1640 1640
1641 BUG_ON(!nr_pages); 1641 BUG_ON(!nr_pages);
1642 BUG_ON(!pages); 1642 BUG_ON(!pages);
1643 page = pages[0]; 1643 page = pages[0];
1644 BUG_ON(!page); 1644 BUG_ON(!page);
1645 vi = page->mapping->host; 1645 vi = page->mapping->host;
1646 ni = NTFS_I(vi); 1646 ni = NTFS_I(vi);
1647 ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " 1647 ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page "
1648 "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", 1648 "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.",
1649 vi->i_ino, ni->type, page->index, nr_pages, 1649 vi->i_ino, ni->type, page->index, nr_pages,
1650 (long long)pos, bytes); 1650 (long long)pos, bytes);
1651 if (NInoNonResident(ni)) 1651 if (NInoNonResident(ni))
1652 return ntfs_commit_pages_after_non_resident_write(pages, 1652 return ntfs_commit_pages_after_non_resident_write(pages,
1653 nr_pages, pos, bytes); 1653 nr_pages, pos, bytes);
1654 BUG_ON(nr_pages > 1); 1654 BUG_ON(nr_pages > 1);
1655 /* 1655 /*
1656 * Attribute is resident, implying it is not compressed, encrypted, or 1656 * Attribute is resident, implying it is not compressed, encrypted, or
1657 * sparse. 1657 * sparse.
1658 */ 1658 */
1659 if (!NInoAttr(ni)) 1659 if (!NInoAttr(ni))
1660 base_ni = ni; 1660 base_ni = ni;
1661 else 1661 else
1662 base_ni = ni->ext.base_ntfs_ino; 1662 base_ni = ni->ext.base_ntfs_ino;
1663 BUG_ON(NInoNonResident(ni)); 1663 BUG_ON(NInoNonResident(ni));
1664 /* Map, pin, and lock the mft record. */ 1664 /* Map, pin, and lock the mft record. */
1665 m = map_mft_record(base_ni); 1665 m = map_mft_record(base_ni);
1666 if (IS_ERR(m)) { 1666 if (IS_ERR(m)) {
1667 err = PTR_ERR(m); 1667 err = PTR_ERR(m);
1668 m = NULL; 1668 m = NULL;
1669 ctx = NULL; 1669 ctx = NULL;
1670 goto err_out; 1670 goto err_out;
1671 } 1671 }
1672 ctx = ntfs_attr_get_search_ctx(base_ni, m); 1672 ctx = ntfs_attr_get_search_ctx(base_ni, m);
1673 if (unlikely(!ctx)) { 1673 if (unlikely(!ctx)) {
1674 err = -ENOMEM; 1674 err = -ENOMEM;
1675 goto err_out; 1675 goto err_out;
1676 } 1676 }
1677 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, 1677 err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
1678 CASE_SENSITIVE, 0, NULL, 0, ctx); 1678 CASE_SENSITIVE, 0, NULL, 0, ctx);
1679 if (unlikely(err)) { 1679 if (unlikely(err)) {
1680 if (err == -ENOENT) 1680 if (err == -ENOENT)
1681 err = -EIO; 1681 err = -EIO;
1682 goto err_out; 1682 goto err_out;
1683 } 1683 }
1684 a = ctx->attr; 1684 a = ctx->attr;
1685 BUG_ON(a->non_resident); 1685 BUG_ON(a->non_resident);
1686 /* The total length of the attribute value. */ 1686 /* The total length of the attribute value. */
1687 attr_len = le32_to_cpu(a->data.resident.value_length); 1687 attr_len = le32_to_cpu(a->data.resident.value_length);
1688 i_size = i_size_read(vi); 1688 i_size = i_size_read(vi);
1689 BUG_ON(attr_len != i_size); 1689 BUG_ON(attr_len != i_size);
1690 BUG_ON(pos > attr_len); 1690 BUG_ON(pos > attr_len);
1691 end = pos + bytes; 1691 end = pos + bytes;
1692 BUG_ON(end > le32_to_cpu(a->length) - 1692 BUG_ON(end > le32_to_cpu(a->length) -
1693 le16_to_cpu(a->data.resident.value_offset)); 1693 le16_to_cpu(a->data.resident.value_offset));
1694 kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); 1694 kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset);
1695 kaddr = kmap_atomic(page); 1695 kaddr = kmap_atomic(page);
1696 /* Copy the received data from the page to the mft record. */ 1696 /* Copy the received data from the page to the mft record. */
1697 memcpy(kattr + pos, kaddr + pos, bytes); 1697 memcpy(kattr + pos, kaddr + pos, bytes);
1698 /* Update the attribute length if necessary. */ 1698 /* Update the attribute length if necessary. */
1699 if (end > attr_len) { 1699 if (end > attr_len) {
1700 attr_len = end; 1700 attr_len = end;
1701 a->data.resident.value_length = cpu_to_le32(attr_len); 1701 a->data.resident.value_length = cpu_to_le32(attr_len);
1702 } 1702 }
1703 /* 1703 /*
1704 * If the page is not uptodate, bring the out of bounds area(s) 1704 * If the page is not uptodate, bring the out of bounds area(s)
1705 * uptodate by copying data from the mft record to the page. 1705 * uptodate by copying data from the mft record to the page.
1706 */ 1706 */
1707 if (!PageUptodate(page)) { 1707 if (!PageUptodate(page)) {
1708 if (pos > 0) 1708 if (pos > 0)
1709 memcpy(kaddr, kattr, pos); 1709 memcpy(kaddr, kattr, pos);
1710 if (end < attr_len) 1710 if (end < attr_len)
1711 memcpy(kaddr + end, kattr + end, attr_len - end); 1711 memcpy(kaddr + end, kattr + end, attr_len - end);
1712 /* Zero the region outside the end of the attribute value. */ 1712 /* Zero the region outside the end of the attribute value. */
1713 memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); 1713 memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
1714 flush_dcache_page(page); 1714 flush_dcache_page(page);
1715 SetPageUptodate(page); 1715 SetPageUptodate(page);
1716 } 1716 }
1717 kunmap_atomic(kaddr); 1717 kunmap_atomic(kaddr);
1718 /* Update initialized_size/i_size if necessary. */ 1718 /* Update initialized_size/i_size if necessary. */
1719 read_lock_irqsave(&ni->size_lock, flags); 1719 read_lock_irqsave(&ni->size_lock, flags);
1720 initialized_size = ni->initialized_size; 1720 initialized_size = ni->initialized_size;
1721 BUG_ON(end > ni->allocated_size); 1721 BUG_ON(end > ni->allocated_size);
1722 read_unlock_irqrestore(&ni->size_lock, flags); 1722 read_unlock_irqrestore(&ni->size_lock, flags);
1723 BUG_ON(initialized_size != i_size); 1723 BUG_ON(initialized_size != i_size);
1724 if (end > initialized_size) { 1724 if (end > initialized_size) {
1725 write_lock_irqsave(&ni->size_lock, flags); 1725 write_lock_irqsave(&ni->size_lock, flags);
1726 ni->initialized_size = end; 1726 ni->initialized_size = end;
1727 i_size_write(vi, end); 1727 i_size_write(vi, end);
1728 write_unlock_irqrestore(&ni->size_lock, flags); 1728 write_unlock_irqrestore(&ni->size_lock, flags);
1729 } 1729 }
1730 /* Mark the mft record dirty, so it gets written back. */ 1730 /* Mark the mft record dirty, so it gets written back. */
1731 flush_dcache_mft_record_page(ctx->ntfs_ino); 1731 flush_dcache_mft_record_page(ctx->ntfs_ino);
1732 mark_mft_record_dirty(ctx->ntfs_ino); 1732 mark_mft_record_dirty(ctx->ntfs_ino);
1733 ntfs_attr_put_search_ctx(ctx); 1733 ntfs_attr_put_search_ctx(ctx);
1734 unmap_mft_record(base_ni); 1734 unmap_mft_record(base_ni);
1735 ntfs_debug("Done."); 1735 ntfs_debug("Done.");
1736 return 0; 1736 return 0;
1737 err_out: 1737 err_out:
1738 if (err == -ENOMEM) { 1738 if (err == -ENOMEM) {
1739 ntfs_warning(vi->i_sb, "Error allocating memory required to " 1739 ntfs_warning(vi->i_sb, "Error allocating memory required to "
1740 "commit the write."); 1740 "commit the write.");
1741 if (PageUptodate(page)) { 1741 if (PageUptodate(page)) {
1742 ntfs_warning(vi->i_sb, "Page is uptodate, setting " 1742 ntfs_warning(vi->i_sb, "Page is uptodate, setting "
1743 "dirty so the write will be retried " 1743 "dirty so the write will be retried "
1744 "later on by the VM."); 1744 "later on by the VM.");
1745 /* 1745 /*
1746 * Put the page on mapping->dirty_pages, but leave its 1746 * Put the page on mapping->dirty_pages, but leave its
1747 * buffers' dirty state as-is. 1747 * buffers' dirty state as-is.
1748 */ 1748 */
1749 __set_page_dirty_nobuffers(page); 1749 __set_page_dirty_nobuffers(page);
1750 err = 0; 1750 err = 0;
1751 } else 1751 } else
1752 ntfs_error(vi->i_sb, "Page is not uptodate. Written " 1752 ntfs_error(vi->i_sb, "Page is not uptodate. Written "
1753 "data has been lost."); 1753 "data has been lost.");
1754 } else { 1754 } else {
1755 ntfs_error(vi->i_sb, "Resident attribute commit write failed " 1755 ntfs_error(vi->i_sb, "Resident attribute commit write failed "
1756 "with error %i.", err); 1756 "with error %i.", err);
1757 NVolSetErrors(ni->vol); 1757 NVolSetErrors(ni->vol);
1758 } 1758 }
1759 if (ctx) 1759 if (ctx)
1760 ntfs_attr_put_search_ctx(ctx); 1760 ntfs_attr_put_search_ctx(ctx);
1761 if (m) 1761 if (m)
1762 unmap_mft_record(base_ni); 1762 unmap_mft_record(base_ni);
1763 return err; 1763 return err;
1764 } 1764 }
1765 1765
1766 static void ntfs_write_failed(struct address_space *mapping, loff_t to) 1766 static void ntfs_write_failed(struct address_space *mapping, loff_t to)
1767 { 1767 {
1768 struct inode *inode = mapping->host; 1768 struct inode *inode = mapping->host;
1769 1769
1770 if (to > inode->i_size) { 1770 if (to > inode->i_size) {
1771 truncate_pagecache(inode, inode->i_size); 1771 truncate_pagecache(inode, inode->i_size);
1772 ntfs_truncate_vfs(inode); 1772 ntfs_truncate_vfs(inode);
1773 } 1773 }
1774 } 1774 }
1775 1775
1776 /** 1776 /**
1777 * ntfs_file_buffered_write - 1777 * ntfs_file_buffered_write -
1778 * 1778 *
1779 * Locking: The vfs is holding ->i_mutex on the inode. 1779 * Locking: The vfs is holding ->i_mutex on the inode.
1780 */ 1780 */
1781 static ssize_t ntfs_file_buffered_write(struct kiocb *iocb, 1781 static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
1782 const struct iovec *iov, unsigned long nr_segs, 1782 const struct iovec *iov, unsigned long nr_segs,
1783 loff_t pos, loff_t *ppos, size_t count) 1783 loff_t pos, loff_t *ppos, size_t count)
1784 { 1784 {
1785 struct file *file = iocb->ki_filp; 1785 struct file *file = iocb->ki_filp;
1786 struct address_space *mapping = file->f_mapping; 1786 struct address_space *mapping = file->f_mapping;
1787 struct inode *vi = mapping->host; 1787 struct inode *vi = mapping->host;
1788 ntfs_inode *ni = NTFS_I(vi); 1788 ntfs_inode *ni = NTFS_I(vi);
1789 ntfs_volume *vol = ni->vol; 1789 ntfs_volume *vol = ni->vol;
1790 struct page *pages[NTFS_MAX_PAGES_PER_CLUSTER]; 1790 struct page *pages[NTFS_MAX_PAGES_PER_CLUSTER];
1791 struct page *cached_page = NULL; 1791 struct page *cached_page = NULL;
1792 char __user *buf = NULL; 1792 char __user *buf = NULL;
1793 s64 end, ll; 1793 s64 end, ll;
1794 VCN last_vcn; 1794 VCN last_vcn;
1795 LCN lcn; 1795 LCN lcn;
1796 unsigned long flags; 1796 unsigned long flags;
1797 size_t bytes, iov_ofs = 0; /* Offset in the current iovec. */ 1797 size_t bytes, iov_ofs = 0; /* Offset in the current iovec. */
1798 ssize_t status, written; 1798 ssize_t status, written;
1799 unsigned nr_pages; 1799 unsigned nr_pages;
1800 int err; 1800 int err;
1801 1801
1802 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " 1802 ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
1803 "pos 0x%llx, count 0x%lx.", 1803 "pos 0x%llx, count 0x%lx.",
1804 vi->i_ino, (unsigned)le32_to_cpu(ni->type), 1804 vi->i_ino, (unsigned)le32_to_cpu(ni->type),
1805 (unsigned long long)pos, (unsigned long)count); 1805 (unsigned long long)pos, (unsigned long)count);
1806 if (unlikely(!count)) 1806 if (unlikely(!count))
1807 return 0; 1807 return 0;
1808 BUG_ON(NInoMstProtected(ni)); 1808 BUG_ON(NInoMstProtected(ni));
1809 /* 1809 /*
1810 * If the attribute is not an index root and it is encrypted or 1810 * If the attribute is not an index root and it is encrypted or
1811 * compressed, we cannot write to it yet. Note we need to check for 1811 * compressed, we cannot write to it yet. Note we need to check for
1812 * AT_INDEX_ALLOCATION since this is the type of both directory and 1812 * AT_INDEX_ALLOCATION since this is the type of both directory and
1813 * index inodes. 1813 * index inodes.
1814 */ 1814 */
1815 if (ni->type != AT_INDEX_ALLOCATION) { 1815 if (ni->type != AT_INDEX_ALLOCATION) {
1816 /* If file is encrypted, deny access, just like NT4. */ 1816 /* If file is encrypted, deny access, just like NT4. */
1817 if (NInoEncrypted(ni)) { 1817 if (NInoEncrypted(ni)) {
1818 /* 1818 /*
1819 * Reminder for later: Encrypted files are _always_ 1819 * Reminder for later: Encrypted files are _always_
1820 * non-resident so that the content can always be 1820 * non-resident so that the content can always be
1821 * encrypted. 1821 * encrypted.
1822 */ 1822 */
1823 ntfs_debug("Denying write access to encrypted file."); 1823 ntfs_debug("Denying write access to encrypted file.");
1824 return -EACCES; 1824 return -EACCES;
1825 } 1825 }
1826 if (NInoCompressed(ni)) { 1826 if (NInoCompressed(ni)) {
1827 /* Only unnamed $DATA attribute can be compressed. */ 1827 /* Only unnamed $DATA attribute can be compressed. */
1828 BUG_ON(ni->type != AT_DATA); 1828 BUG_ON(ni->type != AT_DATA);
1829 BUG_ON(ni->name_len); 1829 BUG_ON(ni->name_len);
1830 /* 1830 /*
1831 * Reminder for later: If resident, the data is not 1831 * Reminder for later: If resident, the data is not
1832 * actually compressed. Only on the switch to non- 1832 * actually compressed. Only on the switch to non-
1833 * resident does compression kick in. This is in 1833 * resident does compression kick in. This is in
1834 * contrast to encrypted files (see above). 1834 * contrast to encrypted files (see above).
1835 */ 1835 */
1836 ntfs_error(vi->i_sb, "Writing to compressed files is " 1836 ntfs_error(vi->i_sb, "Writing to compressed files is "
1837 "not implemented yet. Sorry."); 1837 "not implemented yet. Sorry.");
1838 return -EOPNOTSUPP; 1838 return -EOPNOTSUPP;
1839 } 1839 }
1840 } 1840 }
1841 /* 1841 /*
1842 * If a previous ntfs_truncate() failed, repeat it and abort if it 1842 * If a previous ntfs_truncate() failed, repeat it and abort if it
1843 * fails again. 1843 * fails again.
1844 */ 1844 */
1845 if (unlikely(NInoTruncateFailed(ni))) { 1845 if (unlikely(NInoTruncateFailed(ni))) {
1846 inode_dio_wait(vi); 1846 inode_dio_wait(vi);
1847 err = ntfs_truncate(vi); 1847 err = ntfs_truncate(vi);
1848 if (err || NInoTruncateFailed(ni)) { 1848 if (err || NInoTruncateFailed(ni)) {
1849 if (!err) 1849 if (!err)
1850 err = -EIO; 1850 err = -EIO;
1851 ntfs_error(vol->sb, "Cannot perform write to inode " 1851 ntfs_error(vol->sb, "Cannot perform write to inode "
1852 "0x%lx, attribute type 0x%x, because " 1852 "0x%lx, attribute type 0x%x, because "
1853 "ntfs_truncate() failed (error code " 1853 "ntfs_truncate() failed (error code "
1854 "%i).", vi->i_ino, 1854 "%i).", vi->i_ino,
1855 (unsigned)le32_to_cpu(ni->type), err); 1855 (unsigned)le32_to_cpu(ni->type), err);
1856 return err; 1856 return err;
1857 } 1857 }
1858 } 1858 }
1859 /* The first byte after the write. */ 1859 /* The first byte after the write. */
1860 end = pos + count; 1860 end = pos + count;
1861 /* 1861 /*
1862 * If the write goes beyond the allocated size, extend the allocation 1862 * If the write goes beyond the allocated size, extend the allocation
1863 * to cover the whole of the write, rounded up to the nearest cluster. 1863 * to cover the whole of the write, rounded up to the nearest cluster.
1864 */ 1864 */
1865 read_lock_irqsave(&ni->size_lock, flags); 1865 read_lock_irqsave(&ni->size_lock, flags);
1866 ll = ni->allocated_size; 1866 ll = ni->allocated_size;
1867 read_unlock_irqrestore(&ni->size_lock, flags); 1867 read_unlock_irqrestore(&ni->size_lock, flags);
1868 if (end > ll) { 1868 if (end > ll) {
1869 /* Extend the allocation without changing the data size. */ 1869 /* Extend the allocation without changing the data size. */
1870 ll = ntfs_attr_extend_allocation(ni, end, -1, pos); 1870 ll = ntfs_attr_extend_allocation(ni, end, -1, pos);
1871 if (likely(ll >= 0)) { 1871 if (likely(ll >= 0)) {
1872 BUG_ON(pos >= ll); 1872 BUG_ON(pos >= ll);
1873 /* If the extension was partial truncate the write. */ 1873 /* If the extension was partial truncate the write. */
1874 if (end > ll) { 1874 if (end > ll) {
1875 ntfs_debug("Truncating write to inode 0x%lx, " 1875 ntfs_debug("Truncating write to inode 0x%lx, "
1876 "attribute type 0x%x, because " 1876 "attribute type 0x%x, because "
1877 "the allocation was only " 1877 "the allocation was only "
1878 "partially extended.", 1878 "partially extended.",
1879 vi->i_ino, (unsigned) 1879 vi->i_ino, (unsigned)
1880 le32_to_cpu(ni->type)); 1880 le32_to_cpu(ni->type));
1881 end = ll; 1881 end = ll;
1882 count = ll - pos; 1882 count = ll - pos;
1883 } 1883 }
1884 } else { 1884 } else {
1885 err = ll; 1885 err = ll;
1886 read_lock_irqsave(&ni->size_lock, flags); 1886 read_lock_irqsave(&ni->size_lock, flags);
1887 ll = ni->allocated_size; 1887 ll = ni->allocated_size;
1888 read_unlock_irqrestore(&ni->size_lock, flags); 1888 read_unlock_irqrestore(&ni->size_lock, flags);
1889 /* Perform a partial write if possible or fail. */ 1889 /* Perform a partial write if possible or fail. */
1890 if (pos < ll) { 1890 if (pos < ll) {
1891 ntfs_debug("Truncating write to inode 0x%lx, " 1891 ntfs_debug("Truncating write to inode 0x%lx, "
1892 "attribute type 0x%x, because " 1892 "attribute type 0x%x, because "
1893 "extending the allocation " 1893 "extending the allocation "
1894 "failed (error code %i).", 1894 "failed (error code %i).",
1895 vi->i_ino, (unsigned) 1895 vi->i_ino, (unsigned)
1896 le32_to_cpu(ni->type), err); 1896 le32_to_cpu(ni->type), err);
1897 end = ll; 1897 end = ll;
1898 count = ll - pos; 1898 count = ll - pos;
1899 } else { 1899 } else {
1900 ntfs_error(vol->sb, "Cannot perform write to " 1900 ntfs_error(vol->sb, "Cannot perform write to "
1901 "inode 0x%lx, attribute type " 1901 "inode 0x%lx, attribute type "
1902 "0x%x, because extending the " 1902 "0x%x, because extending the "
1903 "allocation failed (error " 1903 "allocation failed (error "
1904 "code %i).", vi->i_ino, 1904 "code %i).", vi->i_ino,
1905 (unsigned) 1905 (unsigned)
1906 le32_to_cpu(ni->type), err); 1906 le32_to_cpu(ni->type), err);
1907 return err; 1907 return err;
1908 } 1908 }
1909 } 1909 }
1910 } 1910 }
1911 written = 0; 1911 written = 0;
1912 /* 1912 /*
1913 * If the write starts beyond the initialized size, extend it up to the 1913 * If the write starts beyond the initialized size, extend it up to the
1914 * beginning of the write and initialize all non-sparse space between 1914 * beginning of the write and initialize all non-sparse space between
1915 * the old initialized size and the new one. This automatically also 1915 * the old initialized size and the new one. This automatically also
1916 * increments the vfs inode->i_size to keep it above or equal to the 1916 * increments the vfs inode->i_size to keep it above or equal to the
1917 * initialized_size. 1917 * initialized_size.
1918 */ 1918 */
1919 read_lock_irqsave(&ni->size_lock, flags); 1919 read_lock_irqsave(&ni->size_lock, flags);
1920 ll = ni->initialized_size; 1920 ll = ni->initialized_size;
1921 read_unlock_irqrestore(&ni->size_lock, flags); 1921 read_unlock_irqrestore(&ni->size_lock, flags);
1922 if (pos > ll) { 1922 if (pos > ll) {
1923 err = ntfs_attr_extend_initialized(ni, pos); 1923 err = ntfs_attr_extend_initialized(ni, pos);
1924 if (err < 0) { 1924 if (err < 0) {
1925 ntfs_error(vol->sb, "Cannot perform write to inode " 1925 ntfs_error(vol->sb, "Cannot perform write to inode "
1926 "0x%lx, attribute type 0x%x, because " 1926 "0x%lx, attribute type 0x%x, because "
1927 "extending the initialized size " 1927 "extending the initialized size "
1928 "failed (error code %i).", vi->i_ino, 1928 "failed (error code %i).", vi->i_ino,
1929 (unsigned)le32_to_cpu(ni->type), err); 1929 (unsigned)le32_to_cpu(ni->type), err);
1930 status = err; 1930 status = err;
1931 goto err_out; 1931 goto err_out;
1932 } 1932 }
1933 } 1933 }
1934 /* 1934 /*
1935 * Determine the number of pages per cluster for non-resident 1935 * Determine the number of pages per cluster for non-resident
1936 * attributes. 1936 * attributes.
1937 */ 1937 */
1938 nr_pages = 1; 1938 nr_pages = 1;
1939 if (vol->cluster_size > PAGE_CACHE_SIZE && NInoNonResident(ni)) 1939 if (vol->cluster_size > PAGE_CACHE_SIZE && NInoNonResident(ni))
1940 nr_pages = vol->cluster_size >> PAGE_CACHE_SHIFT; 1940 nr_pages = vol->cluster_size >> PAGE_CACHE_SHIFT;
1941 /* Finally, perform the actual write. */ 1941 /* Finally, perform the actual write. */
1942 last_vcn = -1; 1942 last_vcn = -1;
1943 if (likely(nr_segs == 1)) 1943 if (likely(nr_segs == 1))
1944 buf = iov->iov_base; 1944 buf = iov->iov_base;
1945 do { 1945 do {
1946 VCN vcn; 1946 VCN vcn;
1947 pgoff_t idx, start_idx; 1947 pgoff_t idx, start_idx;
1948 unsigned ofs, do_pages, u; 1948 unsigned ofs, do_pages, u;
1949 size_t copied; 1949 size_t copied;
1950 1950
1951 start_idx = idx = pos >> PAGE_CACHE_SHIFT; 1951 start_idx = idx = pos >> PAGE_CACHE_SHIFT;
1952 ofs = pos & ~PAGE_CACHE_MASK; 1952 ofs = pos & ~PAGE_CACHE_MASK;
1953 bytes = PAGE_CACHE_SIZE - ofs; 1953 bytes = PAGE_CACHE_SIZE - ofs;
1954 do_pages = 1; 1954 do_pages = 1;
1955 if (nr_pages > 1) { 1955 if (nr_pages > 1) {
1956 vcn = pos >> vol->cluster_size_bits; 1956 vcn = pos >> vol->cluster_size_bits;
1957 if (vcn != last_vcn) { 1957 if (vcn != last_vcn) {
1958 last_vcn = vcn; 1958 last_vcn = vcn;
1959 /* 1959 /*
1960 * Get the lcn of the vcn the write is in. If 1960 * Get the lcn of the vcn the write is in. If
1961 * it is a hole, need to lock down all pages in 1961 * it is a hole, need to lock down all pages in
1962 * the cluster. 1962 * the cluster.
1963 */ 1963 */
1964 down_read(&ni->runlist.lock); 1964 down_read(&ni->runlist.lock);
1965 lcn = ntfs_attr_vcn_to_lcn_nolock(ni, pos >> 1965 lcn = ntfs_attr_vcn_to_lcn_nolock(ni, pos >>
1966 vol->cluster_size_bits, false); 1966 vol->cluster_size_bits, false);
1967 up_read(&ni->runlist.lock); 1967 up_read(&ni->runlist.lock);
1968 if (unlikely(lcn < LCN_HOLE)) { 1968 if (unlikely(lcn < LCN_HOLE)) {
1969 status = -EIO; 1969 status = -EIO;
1970 if (lcn == LCN_ENOMEM) 1970 if (lcn == LCN_ENOMEM)
1971 status = -ENOMEM; 1971 status = -ENOMEM;
1972 else 1972 else
1973 ntfs_error(vol->sb, "Cannot " 1973 ntfs_error(vol->sb, "Cannot "
1974 "perform write to " 1974 "perform write to "
1975 "inode 0x%lx, " 1975 "inode 0x%lx, "
1976 "attribute type 0x%x, " 1976 "attribute type 0x%x, "
1977 "because the attribute " 1977 "because the attribute "
1978 "is corrupt.", 1978 "is corrupt.",
1979 vi->i_ino, (unsigned) 1979 vi->i_ino, (unsigned)
1980 le32_to_cpu(ni->type)); 1980 le32_to_cpu(ni->type));
1981 break; 1981 break;
1982 } 1982 }
1983 if (lcn == LCN_HOLE) { 1983 if (lcn == LCN_HOLE) {
1984 start_idx = (pos & ~(s64) 1984 start_idx = (pos & ~(s64)
1985 vol->cluster_size_mask) 1985 vol->cluster_size_mask)
1986 >> PAGE_CACHE_SHIFT; 1986 >> PAGE_CACHE_SHIFT;
1987 bytes = vol->cluster_size - (pos & 1987 bytes = vol->cluster_size - (pos &
1988 vol->cluster_size_mask); 1988 vol->cluster_size_mask);
1989 do_pages = nr_pages; 1989 do_pages = nr_pages;
1990 } 1990 }
1991 } 1991 }
1992 } 1992 }
1993 if (bytes > count) 1993 if (bytes > count)
1994 bytes = count; 1994 bytes = count;
1995 /* 1995 /*
1996 * Bring in the user page(s) that we will copy from _first_. 1996 * Bring in the user page(s) that we will copy from _first_.
1997 * Otherwise there is a nasty deadlock on copying from the same 1997 * Otherwise there is a nasty deadlock on copying from the same
1998 * page(s) as we are writing to, without it/them being marked 1998 * page(s) as we are writing to, without it/them being marked
1999 * up-to-date. Note, at present there is nothing to stop the 1999 * up-to-date. Note, at present there is nothing to stop the
2000 * pages being swapped out between us bringing them into memory 2000 * pages being swapped out between us bringing them into memory
2001 * and doing the actual copying. 2001 * and doing the actual copying.
2002 */ 2002 */
2003 if (likely(nr_segs == 1)) 2003 if (likely(nr_segs == 1))
2004 ntfs_fault_in_pages_readable(buf, bytes); 2004 ntfs_fault_in_pages_readable(buf, bytes);
2005 else 2005 else
2006 ntfs_fault_in_pages_readable_iovec(iov, iov_ofs, bytes); 2006 ntfs_fault_in_pages_readable_iovec(iov, iov_ofs, bytes);
2007 /* Get and lock @do_pages starting at index @start_idx. */ 2007 /* Get and lock @do_pages starting at index @start_idx. */
2008 status = __ntfs_grab_cache_pages(mapping, start_idx, do_pages, 2008 status = __ntfs_grab_cache_pages(mapping, start_idx, do_pages,
2009 pages, &cached_page); 2009 pages, &cached_page);
2010 if (unlikely(status)) 2010 if (unlikely(status))
2011 break; 2011 break;
2012 /* 2012 /*
2013 * For non-resident attributes, we need to fill any holes with 2013 * For non-resident attributes, we need to fill any holes with
2014 * actual clusters and ensure all bufferes are mapped. We also 2014 * actual clusters and ensure all bufferes are mapped. We also
2015 * need to bring uptodate any buffers that are only partially 2015 * need to bring uptodate any buffers that are only partially
2016 * being written to. 2016 * being written to.
2017 */ 2017 */
2018 if (NInoNonResident(ni)) { 2018 if (NInoNonResident(ni)) {
2019 status = ntfs_prepare_pages_for_non_resident_write( 2019 status = ntfs_prepare_pages_for_non_resident_write(
2020 pages, do_pages, pos, bytes); 2020 pages, do_pages, pos, bytes);
2021 if (unlikely(status)) { 2021 if (unlikely(status)) {
2022 loff_t i_size; 2022 loff_t i_size;
2023 2023
2024 do { 2024 do {
2025 unlock_page(pages[--do_pages]); 2025 unlock_page(pages[--do_pages]);
2026 page_cache_release(pages[do_pages]); 2026 page_cache_release(pages[do_pages]);
2027 } while (do_pages); 2027 } while (do_pages);
2028 /* 2028 /*
2029 * The write preparation may have instantiated 2029 * The write preparation may have instantiated
2030 * allocated space outside i_size. Trim this 2030 * allocated space outside i_size. Trim this
2031 * off again. We can ignore any errors in this 2031 * off again. We can ignore any errors in this
2032 * case as we will just be waisting a bit of 2032 * case as we will just be waisting a bit of
2033 * allocated space, which is not a disaster. 2033 * allocated space, which is not a disaster.
2034 */ 2034 */
2035 i_size = i_size_read(vi); 2035 i_size = i_size_read(vi);
2036 if (pos + bytes > i_size) { 2036 if (pos + bytes > i_size) {
2037 ntfs_write_failed(mapping, pos + bytes); 2037 ntfs_write_failed(mapping, pos + bytes);
2038 } 2038 }
2039 break; 2039 break;
2040 } 2040 }
2041 } 2041 }
2042 u = (pos >> PAGE_CACHE_SHIFT) - pages[0]->index; 2042 u = (pos >> PAGE_CACHE_SHIFT) - pages[0]->index;
2043 if (likely(nr_segs == 1)) { 2043 if (likely(nr_segs == 1)) {
2044 copied = ntfs_copy_from_user(pages + u, do_pages - u, 2044 copied = ntfs_copy_from_user(pages + u, do_pages - u,
2045 ofs, buf, bytes); 2045 ofs, buf, bytes);
2046 buf += copied; 2046 buf += copied;
2047 } else 2047 } else
2048 copied = ntfs_copy_from_user_iovec(pages + u, 2048 copied = ntfs_copy_from_user_iovec(pages + u,
2049 do_pages - u, ofs, &iov, &iov_ofs, 2049 do_pages - u, ofs, &iov, &iov_ofs,
2050 bytes); 2050 bytes);
2051 ntfs_flush_dcache_pages(pages + u, do_pages - u); 2051 ntfs_flush_dcache_pages(pages + u, do_pages - u);
2052 status = ntfs_commit_pages_after_write(pages, do_pages, pos, 2052 status = ntfs_commit_pages_after_write(pages, do_pages, pos,
2053 bytes); 2053 bytes);
2054 if (likely(!status)) { 2054 if (likely(!status)) {
2055 written += copied; 2055 written += copied;
2056 count -= copied; 2056 count -= copied;
2057 pos += copied; 2057 pos += copied;
2058 if (unlikely(copied != bytes)) 2058 if (unlikely(copied != bytes))
2059 status = -EFAULT; 2059 status = -EFAULT;
2060 } 2060 }
2061 do { 2061 do {
2062 unlock_page(pages[--do_pages]); 2062 unlock_page(pages[--do_pages]);
2063 mark_page_accessed(pages[do_pages]);
2064 page_cache_release(pages[do_pages]); 2063 page_cache_release(pages[do_pages]);
2065 } while (do_pages); 2064 } while (do_pages);
2066 if (unlikely(status)) 2065 if (unlikely(status))
2067 break; 2066 break;
2068 balance_dirty_pages_ratelimited(mapping); 2067 balance_dirty_pages_ratelimited(mapping);
2069 cond_resched(); 2068 cond_resched();
2070 } while (count); 2069 } while (count);
2071 err_out: 2070 err_out:
2072 *ppos = pos; 2071 *ppos = pos;
2073 if (cached_page) 2072 if (cached_page)
2074 page_cache_release(cached_page); 2073 page_cache_release(cached_page);
2075 ntfs_debug("Done. Returning %s (written 0x%lx, status %li).", 2074 ntfs_debug("Done. Returning %s (written 0x%lx, status %li).",
2076 written ? "written" : "status", (unsigned long)written, 2075 written ? "written" : "status", (unsigned long)written,
2077 (long)status); 2076 (long)status);
2078 return written ? written : status; 2077 return written ? written : status;
2079 } 2078 }
2080 2079
2081 /** 2080 /**
2082 * ntfs_file_aio_write_nolock - 2081 * ntfs_file_aio_write_nolock -
2083 */ 2082 */
2084 static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb, 2083 static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
2085 const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) 2084 const struct iovec *iov, unsigned long nr_segs, loff_t *ppos)
2086 { 2085 {
2087 struct file *file = iocb->ki_filp; 2086 struct file *file = iocb->ki_filp;
2088 struct address_space *mapping = file->f_mapping; 2087 struct address_space *mapping = file->f_mapping;
2089 struct inode *inode = mapping->host; 2088 struct inode *inode = mapping->host;
2090 loff_t pos; 2089 loff_t pos;
2091 size_t count; /* after file limit checks */ 2090 size_t count; /* after file limit checks */
2092 ssize_t written, err; 2091 ssize_t written, err;
2093 2092
2094 count = 0; 2093 count = 0;
2095 err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ); 2094 err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
2096 if (err) 2095 if (err)
2097 return err; 2096 return err;
2098 pos = *ppos; 2097 pos = *ppos;
2099 /* We can write back this queue in page reclaim. */ 2098 /* We can write back this queue in page reclaim. */
2100 current->backing_dev_info = mapping->backing_dev_info; 2099 current->backing_dev_info = mapping->backing_dev_info;
2101 written = 0; 2100 written = 0;
2102 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); 2101 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
2103 if (err) 2102 if (err)
2104 goto out; 2103 goto out;
2105 if (!count) 2104 if (!count)
2106 goto out; 2105 goto out;
2107 err = file_remove_suid(file); 2106 err = file_remove_suid(file);
2108 if (err) 2107 if (err)
2109 goto out; 2108 goto out;
2110 err = file_update_time(file); 2109 err = file_update_time(file);
2111 if (err) 2110 if (err)
2112 goto out; 2111 goto out;
2113 written = ntfs_file_buffered_write(iocb, iov, nr_segs, pos, ppos, 2112 written = ntfs_file_buffered_write(iocb, iov, nr_segs, pos, ppos,
2114 count); 2113 count);
2115 out: 2114 out:
2116 current->backing_dev_info = NULL; 2115 current->backing_dev_info = NULL;
2117 return written ? written : err; 2116 return written ? written : err;
2118 } 2117 }
2119 2118
2120 /** 2119 /**
2121 * ntfs_file_aio_write - 2120 * ntfs_file_aio_write -
2122 */ 2121 */
2123 static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, 2122 static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
2124 unsigned long nr_segs, loff_t pos) 2123 unsigned long nr_segs, loff_t pos)
2125 { 2124 {
2126 struct file *file = iocb->ki_filp; 2125 struct file *file = iocb->ki_filp;
2127 struct address_space *mapping = file->f_mapping; 2126 struct address_space *mapping = file->f_mapping;
2128 struct inode *inode = mapping->host; 2127 struct inode *inode = mapping->host;
2129 ssize_t ret; 2128 ssize_t ret;
2130 2129
2131 BUG_ON(iocb->ki_pos != pos); 2130 BUG_ON(iocb->ki_pos != pos);
2132 2131
2133 mutex_lock(&inode->i_mutex); 2132 mutex_lock(&inode->i_mutex);
2134 ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); 2133 ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
2135 mutex_unlock(&inode->i_mutex); 2134 mutex_unlock(&inode->i_mutex);
2136 if (ret > 0) { 2135 if (ret > 0) {
2137 int err = generic_write_sync(file, pos, ret); 2136 int err = generic_write_sync(file, pos, ret);
2138 if (err < 0) 2137 if (err < 0)
2139 ret = err; 2138 ret = err;
2140 } 2139 }
2141 return ret; 2140 return ret;
2142 } 2141 }
2143 2142
2144 /** 2143 /**
2145 * ntfs_file_fsync - sync a file to disk 2144 * ntfs_file_fsync - sync a file to disk
2146 * @filp: file to be synced 2145 * @filp: file to be synced
2147 * @datasync: if non-zero only flush user data and not metadata 2146 * @datasync: if non-zero only flush user data and not metadata
2148 * 2147 *
2149 * Data integrity sync of a file to disk. Used for fsync, fdatasync, and msync 2148 * Data integrity sync of a file to disk. Used for fsync, fdatasync, and msync
2150 * system calls. This function is inspired by fs/buffer.c::file_fsync(). 2149 * system calls. This function is inspired by fs/buffer.c::file_fsync().
2151 * 2150 *
2152 * If @datasync is false, write the mft record and all associated extent mft 2151 * If @datasync is false, write the mft record and all associated extent mft
2153 * records as well as the $DATA attribute and then sync the block device. 2152 * records as well as the $DATA attribute and then sync the block device.
2154 * 2153 *
2155 * If @datasync is true and the attribute is non-resident, we skip the writing 2154 * If @datasync is true and the attribute is non-resident, we skip the writing
2156 * of the mft record and all associated extent mft records (this might still 2155 * of the mft record and all associated extent mft records (this might still
2157 * happen due to the write_inode_now() call). 2156 * happen due to the write_inode_now() call).
2158 * 2157 *
2159 * Also, if @datasync is true, we do not wait on the inode to be written out 2158 * Also, if @datasync is true, we do not wait on the inode to be written out
2160 * but we always wait on the page cache pages to be written out. 2159 * but we always wait on the page cache pages to be written out.
2161 * 2160 *
2162 * Locking: Caller must hold i_mutex on the inode. 2161 * Locking: Caller must hold i_mutex on the inode.
2163 * 2162 *
2164 * TODO: We should probably also write all attribute/index inodes associated 2163 * TODO: We should probably also write all attribute/index inodes associated
2165 * with this inode but since we have no simple way of getting to them we ignore 2164 * with this inode but since we have no simple way of getting to them we ignore
2166 * this problem for now. 2165 * this problem for now.
2167 */ 2166 */
2168 static int ntfs_file_fsync(struct file *filp, loff_t start, loff_t end, 2167 static int ntfs_file_fsync(struct file *filp, loff_t start, loff_t end,
2169 int datasync) 2168 int datasync)
2170 { 2169 {
2171 struct inode *vi = filp->f_mapping->host; 2170 struct inode *vi = filp->f_mapping->host;
2172 int err, ret = 0; 2171 int err, ret = 0;
2173 2172
2174 ntfs_debug("Entering for inode 0x%lx.", vi->i_ino); 2173 ntfs_debug("Entering for inode 0x%lx.", vi->i_ino);
2175 2174
2176 err = filemap_write_and_wait_range(vi->i_mapping, start, end); 2175 err = filemap_write_and_wait_range(vi->i_mapping, start, end);
2177 if (err) 2176 if (err)
2178 return err; 2177 return err;
2179 mutex_lock(&vi->i_mutex); 2178 mutex_lock(&vi->i_mutex);
2180 2179
2181 BUG_ON(S_ISDIR(vi->i_mode)); 2180 BUG_ON(S_ISDIR(vi->i_mode));
2182 if (!datasync || !NInoNonResident(NTFS_I(vi))) 2181 if (!datasync || !NInoNonResident(NTFS_I(vi)))
2183 ret = __ntfs_write_inode(vi, 1); 2182 ret = __ntfs_write_inode(vi, 1);
2184 write_inode_now(vi, !datasync); 2183 write_inode_now(vi, !datasync);
2185 /* 2184 /*
2186 * NOTE: If we were to use mapping->private_list (see ext2 and 2185 * NOTE: If we were to use mapping->private_list (see ext2 and
2187 * fs/buffer.c) for dirty blocks then we could optimize the below to be 2186 * fs/buffer.c) for dirty blocks then we could optimize the below to be
2188 * sync_mapping_buffers(vi->i_mapping). 2187 * sync_mapping_buffers(vi->i_mapping).
2189 */ 2188 */
2190 err = sync_blockdev(vi->i_sb->s_bdev); 2189 err = sync_blockdev(vi->i_sb->s_bdev);
2191 if (unlikely(err && !ret)) 2190 if (unlikely(err && !ret))
2192 ret = err; 2191 ret = err;
2193 if (likely(!ret)) 2192 if (likely(!ret))
2194 ntfs_debug("Done."); 2193 ntfs_debug("Done.");
2195 else 2194 else
2196 ntfs_warning(vi->i_sb, "Failed to f%ssync inode 0x%lx. Error " 2195 ntfs_warning(vi->i_sb, "Failed to f%ssync inode 0x%lx. Error "
2197 "%u.", datasync ? "data" : "", vi->i_ino, -ret); 2196 "%u.", datasync ? "data" : "", vi->i_ino, -ret);
2198 mutex_unlock(&vi->i_mutex); 2197 mutex_unlock(&vi->i_mutex);
2199 return ret; 2198 return ret;
2200 } 2199 }
2201 2200
2202 #endif /* NTFS_RW */ 2201 #endif /* NTFS_RW */
2203 2202
2204 const struct file_operations ntfs_file_ops = { 2203 const struct file_operations ntfs_file_ops = {
2205 .llseek = generic_file_llseek, /* Seek inside file. */ 2204 .llseek = generic_file_llseek, /* Seek inside file. */
2206 .read = do_sync_read, /* Read from file. */ 2205 .read = do_sync_read, /* Read from file. */
2207 .aio_read = generic_file_aio_read, /* Async read from file. */ 2206 .aio_read = generic_file_aio_read, /* Async read from file. */
2208 #ifdef NTFS_RW 2207 #ifdef NTFS_RW
2209 .write = do_sync_write, /* Write to file. */ 2208 .write = do_sync_write, /* Write to file. */
2210 .aio_write = ntfs_file_aio_write, /* Async write to file. */ 2209 .aio_write = ntfs_file_aio_write, /* Async write to file. */
2211 /*.release = ,*/ /* Last file is closed. See 2210 /*.release = ,*/ /* Last file is closed. See
2212 fs/ext2/file.c:: 2211 fs/ext2/file.c::
2213 ext2_release_file() for 2212 ext2_release_file() for
2214 how to use this to discard 2213 how to use this to discard
2215 preallocated space for 2214 preallocated space for
2216 write opened files. */ 2215 write opened files. */
2217 .fsync = ntfs_file_fsync, /* Sync a file to disk. */ 2216 .fsync = ntfs_file_fsync, /* Sync a file to disk. */
2218 /*.aio_fsync = ,*/ /* Sync all outstanding async 2217 /*.aio_fsync = ,*/ /* Sync all outstanding async
2219 i/o operations on a 2218 i/o operations on a
2220 kiocb. */ 2219 kiocb. */
2221 #endif /* NTFS_RW */ 2220 #endif /* NTFS_RW */
2222 /*.ioctl = ,*/ /* Perform function on the 2221 /*.ioctl = ,*/ /* Perform function on the
2223 mounted filesystem. */ 2222 mounted filesystem. */
2224 .mmap = generic_file_mmap, /* Mmap file. */ 2223 .mmap = generic_file_mmap, /* Mmap file. */
2225 .open = ntfs_file_open, /* Open file. */ 2224 .open = ntfs_file_open, /* Open file. */
2226 .splice_read = generic_file_splice_read /* Zero-copy data send with 2225 .splice_read = generic_file_splice_read /* Zero-copy data send with
2227 the data source being on 2226 the data source being on
2228 the ntfs partition. We do 2227 the ntfs partition. We do
2229 not need to care about the 2228 not need to care about the
2230 data destination. */ 2229 data destination. */
2231 /*.sendpage = ,*/ /* Zero-copy data send with 2230 /*.sendpage = ,*/ /* Zero-copy data send with
2232 the data destination being 2231 the data destination being
2233 on the ntfs partition. We 2232 on the ntfs partition. We
2234 do not need to care about 2233 do not need to care about
2235 the data source. */ 2234 the data source. */
2236 }; 2235 };
2237 2236
2238 const struct inode_operations ntfs_file_inode_ops = { 2237 const struct inode_operations ntfs_file_inode_ops = {
2239 #ifdef NTFS_RW 2238 #ifdef NTFS_RW
2240 .setattr = ntfs_setattr, 2239 .setattr = ntfs_setattr,
2241 #endif /* NTFS_RW */ 2240 #endif /* NTFS_RW */
2242 }; 2241 };
2243 2242
2244 const struct file_operations ntfs_empty_file_ops = {}; 2243 const struct file_operations ntfs_empty_file_ops = {};
2245 2244
2246 const struct inode_operations ntfs_empty_inode_ops = {}; 2245 const struct inode_operations ntfs_empty_inode_ops = {};
2247 2246
include/linux/page-flags.h
1 /* 1 /*
2 * Macros for manipulating and testing page->flags 2 * Macros for manipulating and testing page->flags
3 */ 3 */
4 4
5 #ifndef PAGE_FLAGS_H 5 #ifndef PAGE_FLAGS_H
6 #define PAGE_FLAGS_H 6 #define PAGE_FLAGS_H
7 7
8 #include <linux/types.h> 8 #include <linux/types.h>
9 #include <linux/bug.h> 9 #include <linux/bug.h>
10 #include <linux/mmdebug.h> 10 #include <linux/mmdebug.h>
11 #ifndef __GENERATING_BOUNDS_H 11 #ifndef __GENERATING_BOUNDS_H
12 #include <linux/mm_types.h> 12 #include <linux/mm_types.h>
13 #include <generated/bounds.h> 13 #include <generated/bounds.h>
14 #endif /* !__GENERATING_BOUNDS_H */ 14 #endif /* !__GENERATING_BOUNDS_H */
15 15
16 /* 16 /*
17 * Various page->flags bits: 17 * Various page->flags bits:
18 * 18 *
19 * PG_reserved is set for special pages, which can never be swapped out. Some 19 * PG_reserved is set for special pages, which can never be swapped out. Some
20 * of them might not even exist (eg empty_bad_page)... 20 * of them might not even exist (eg empty_bad_page)...
21 * 21 *
22 * The PG_private bitflag is set on pagecache pages if they contain filesystem 22 * The PG_private bitflag is set on pagecache pages if they contain filesystem
23 * specific data (which is normally at page->private). It can be used by 23 * specific data (which is normally at page->private). It can be used by
24 * private allocations for its own usage. 24 * private allocations for its own usage.
25 * 25 *
26 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O 26 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O
27 * and cleared when writeback _starts_ or when read _completes_. PG_writeback 27 * and cleared when writeback _starts_ or when read _completes_. PG_writeback
28 * is set before writeback starts and cleared when it finishes. 28 * is set before writeback starts and cleared when it finishes.
29 * 29 *
30 * PG_locked also pins a page in pagecache, and blocks truncation of the file 30 * PG_locked also pins a page in pagecache, and blocks truncation of the file
31 * while it is held. 31 * while it is held.
32 * 32 *
33 * page_waitqueue(page) is a wait queue of all tasks waiting for the page 33 * page_waitqueue(page) is a wait queue of all tasks waiting for the page
34 * to become unlocked. 34 * to become unlocked.
35 * 35 *
36 * PG_uptodate tells whether the page's contents is valid. When a read 36 * PG_uptodate tells whether the page's contents is valid. When a read
37 * completes, the page becomes uptodate, unless a disk I/O error happened. 37 * completes, the page becomes uptodate, unless a disk I/O error happened.
38 * 38 *
39 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and 39 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and
40 * file-backed pagecache (see mm/vmscan.c). 40 * file-backed pagecache (see mm/vmscan.c).
41 * 41 *
42 * PG_error is set to indicate that an I/O error occurred on this page. 42 * PG_error is set to indicate that an I/O error occurred on this page.
43 * 43 *
44 * PG_arch_1 is an architecture specific page state bit. The generic code 44 * PG_arch_1 is an architecture specific page state bit. The generic code
45 * guarantees that this bit is cleared for a page when it first is entered into 45 * guarantees that this bit is cleared for a page when it first is entered into
46 * the page cache. 46 * the page cache.
47 * 47 *
48 * PG_highmem pages are not permanently mapped into the kernel virtual address 48 * PG_highmem pages are not permanently mapped into the kernel virtual address
49 * space, they need to be kmapped separately for doing IO on the pages. The 49 * space, they need to be kmapped separately for doing IO on the pages. The
50 * struct page (these bits with information) are always mapped into kernel 50 * struct page (these bits with information) are always mapped into kernel
51 * address space... 51 * address space...
52 * 52 *
53 * PG_hwpoison indicates that a page got corrupted in hardware and contains 53 * PG_hwpoison indicates that a page got corrupted in hardware and contains
54 * data with incorrect ECC bits that triggered a machine check. Accessing is 54 * data with incorrect ECC bits that triggered a machine check. Accessing is
55 * not safe since it may cause another machine check. Don't touch! 55 * not safe since it may cause another machine check. Don't touch!
56 */ 56 */
57 57
58 /* 58 /*
59 * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break 59 * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break
60 * locked- and dirty-page accounting. 60 * locked- and dirty-page accounting.
61 * 61 *
62 * The page flags field is split into two parts, the main flags area 62 * The page flags field is split into two parts, the main flags area
63 * which extends from the low bits upwards, and the fields area which 63 * which extends from the low bits upwards, and the fields area which
64 * extends from the high bits downwards. 64 * extends from the high bits downwards.
65 * 65 *
66 * | FIELD | ... | FLAGS | 66 * | FIELD | ... | FLAGS |
67 * N-1 ^ 0 67 * N-1 ^ 0
68 * (NR_PAGEFLAGS) 68 * (NR_PAGEFLAGS)
69 * 69 *
70 * The fields area is reserved for fields mapping zone, node (for NUMA) and 70 * The fields area is reserved for fields mapping zone, node (for NUMA) and
71 * SPARSEMEM section (for variants of SPARSEMEM that require section ids like 71 * SPARSEMEM section (for variants of SPARSEMEM that require section ids like
72 * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). 72 * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP).
73 */ 73 */
74 enum pageflags { 74 enum pageflags {
75 PG_locked, /* Page is locked. Don't touch. */ 75 PG_locked, /* Page is locked. Don't touch. */
76 PG_error, 76 PG_error,
77 PG_referenced, 77 PG_referenced,
78 PG_uptodate, 78 PG_uptodate,
79 PG_dirty, 79 PG_dirty,
80 PG_lru, 80 PG_lru,
81 PG_active, 81 PG_active,
82 PG_slab, 82 PG_slab,
83 PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ 83 PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
84 PG_arch_1, 84 PG_arch_1,
85 PG_reserved, 85 PG_reserved,
86 PG_private, /* If pagecache, has fs-private data */ 86 PG_private, /* If pagecache, has fs-private data */
87 PG_private_2, /* If pagecache, has fs aux data */ 87 PG_private_2, /* If pagecache, has fs aux data */
88 PG_writeback, /* Page is under writeback */ 88 PG_writeback, /* Page is under writeback */
89 #ifdef CONFIG_PAGEFLAGS_EXTENDED 89 #ifdef CONFIG_PAGEFLAGS_EXTENDED
90 PG_head, /* A head page */ 90 PG_head, /* A head page */
91 PG_tail, /* A tail page */ 91 PG_tail, /* A tail page */
92 #else 92 #else
93 PG_compound, /* A compound page */ 93 PG_compound, /* A compound page */
94 #endif 94 #endif
95 PG_swapcache, /* Swap page: swp_entry_t in private */ 95 PG_swapcache, /* Swap page: swp_entry_t in private */
96 PG_mappedtodisk, /* Has blocks allocated on-disk */ 96 PG_mappedtodisk, /* Has blocks allocated on-disk */
97 PG_reclaim, /* To be reclaimed asap */ 97 PG_reclaim, /* To be reclaimed asap */
98 PG_swapbacked, /* Page is backed by RAM/swap */ 98 PG_swapbacked, /* Page is backed by RAM/swap */
99 PG_unevictable, /* Page is "unevictable" */ 99 PG_unevictable, /* Page is "unevictable" */
100 #ifdef CONFIG_MMU 100 #ifdef CONFIG_MMU
101 PG_mlocked, /* Page is vma mlocked */ 101 PG_mlocked, /* Page is vma mlocked */
102 #endif 102 #endif
103 #ifdef CONFIG_ARCH_USES_PG_UNCACHED 103 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
104 PG_uncached, /* Page has been mapped as uncached */ 104 PG_uncached, /* Page has been mapped as uncached */
105 #endif 105 #endif
106 #ifdef CONFIG_MEMORY_FAILURE 106 #ifdef CONFIG_MEMORY_FAILURE
107 PG_hwpoison, /* hardware poisoned page. Don't touch */ 107 PG_hwpoison, /* hardware poisoned page. Don't touch */
108 #endif 108 #endif
109 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 109 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
110 PG_compound_lock, 110 PG_compound_lock,
111 #endif 111 #endif
112 __NR_PAGEFLAGS, 112 __NR_PAGEFLAGS,
113 113
114 /* Filesystems */ 114 /* Filesystems */
115 PG_checked = PG_owner_priv_1, 115 PG_checked = PG_owner_priv_1,
116 116
117 /* Two page bits are conscripted by FS-Cache to maintain local caching 117 /* Two page bits are conscripted by FS-Cache to maintain local caching
118 * state. These bits are set on pages belonging to the netfs's inodes 118 * state. These bits are set on pages belonging to the netfs's inodes
119 * when those inodes are being locally cached. 119 * when those inodes are being locally cached.
120 */ 120 */
121 PG_fscache = PG_private_2, /* page backed by cache */ 121 PG_fscache = PG_private_2, /* page backed by cache */
122 122
123 /* XEN */ 123 /* XEN */
124 PG_pinned = PG_owner_priv_1, 124 PG_pinned = PG_owner_priv_1,
125 PG_savepinned = PG_dirty, 125 PG_savepinned = PG_dirty,
126 126
127 /* SLOB */ 127 /* SLOB */
128 PG_slob_free = PG_private, 128 PG_slob_free = PG_private,
129 }; 129 };
130 130
131 #ifndef __GENERATING_BOUNDS_H 131 #ifndef __GENERATING_BOUNDS_H
132 132
133 /* 133 /*
134 * Macros to create function definitions for page flags 134 * Macros to create function definitions for page flags
135 */ 135 */
136 #define TESTPAGEFLAG(uname, lname) \ 136 #define TESTPAGEFLAG(uname, lname) \
137 static inline int Page##uname(const struct page *page) \ 137 static inline int Page##uname(const struct page *page) \
138 { return test_bit(PG_##lname, &page->flags); } 138 { return test_bit(PG_##lname, &page->flags); }
139 139
140 #define SETPAGEFLAG(uname, lname) \ 140 #define SETPAGEFLAG(uname, lname) \
141 static inline void SetPage##uname(struct page *page) \ 141 static inline void SetPage##uname(struct page *page) \
142 { set_bit(PG_##lname, &page->flags); } 142 { set_bit(PG_##lname, &page->flags); }
143 143
144 #define CLEARPAGEFLAG(uname, lname) \ 144 #define CLEARPAGEFLAG(uname, lname) \
145 static inline void ClearPage##uname(struct page *page) \ 145 static inline void ClearPage##uname(struct page *page) \
146 { clear_bit(PG_##lname, &page->flags); } 146 { clear_bit(PG_##lname, &page->flags); }
147 147
148 #define __SETPAGEFLAG(uname, lname) \ 148 #define __SETPAGEFLAG(uname, lname) \
149 static inline void __SetPage##uname(struct page *page) \ 149 static inline void __SetPage##uname(struct page *page) \
150 { __set_bit(PG_##lname, &page->flags); } 150 { __set_bit(PG_##lname, &page->flags); }
151 151
152 #define __CLEARPAGEFLAG(uname, lname) \ 152 #define __CLEARPAGEFLAG(uname, lname) \
153 static inline void __ClearPage##uname(struct page *page) \ 153 static inline void __ClearPage##uname(struct page *page) \
154 { __clear_bit(PG_##lname, &page->flags); } 154 { __clear_bit(PG_##lname, &page->flags); }
155 155
156 #define TESTSETFLAG(uname, lname) \ 156 #define TESTSETFLAG(uname, lname) \
157 static inline int TestSetPage##uname(struct page *page) \ 157 static inline int TestSetPage##uname(struct page *page) \
158 { return test_and_set_bit(PG_##lname, &page->flags); } 158 { return test_and_set_bit(PG_##lname, &page->flags); }
159 159
160 #define TESTCLEARFLAG(uname, lname) \ 160 #define TESTCLEARFLAG(uname, lname) \
161 static inline int TestClearPage##uname(struct page *page) \ 161 static inline int TestClearPage##uname(struct page *page) \
162 { return test_and_clear_bit(PG_##lname, &page->flags); } 162 { return test_and_clear_bit(PG_##lname, &page->flags); }
163 163
164 #define __TESTCLEARFLAG(uname, lname) \ 164 #define __TESTCLEARFLAG(uname, lname) \
165 static inline int __TestClearPage##uname(struct page *page) \ 165 static inline int __TestClearPage##uname(struct page *page) \
166 { return __test_and_clear_bit(PG_##lname, &page->flags); } 166 { return __test_and_clear_bit(PG_##lname, &page->flags); }
167 167
168 #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ 168 #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
169 SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) 169 SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname)
170 170
171 #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ 171 #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
172 __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) 172 __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname)
173 173
174 #define PAGEFLAG_FALSE(uname) \ 174 #define PAGEFLAG_FALSE(uname) \
175 static inline int Page##uname(const struct page *page) \ 175 static inline int Page##uname(const struct page *page) \
176 { return 0; } 176 { return 0; }
177 177
178 #define TESTSCFLAG(uname, lname) \ 178 #define TESTSCFLAG(uname, lname) \
179 TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) 179 TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname)
180 180
181 #define SETPAGEFLAG_NOOP(uname) \ 181 #define SETPAGEFLAG_NOOP(uname) \
182 static inline void SetPage##uname(struct page *page) { } 182 static inline void SetPage##uname(struct page *page) { }
183 183
184 #define CLEARPAGEFLAG_NOOP(uname) \ 184 #define CLEARPAGEFLAG_NOOP(uname) \
185 static inline void ClearPage##uname(struct page *page) { } 185 static inline void ClearPage##uname(struct page *page) { }
186 186
187 #define __CLEARPAGEFLAG_NOOP(uname) \ 187 #define __CLEARPAGEFLAG_NOOP(uname) \
188 static inline void __ClearPage##uname(struct page *page) { } 188 static inline void __ClearPage##uname(struct page *page) { }
189 189
190 #define TESTCLEARFLAG_FALSE(uname) \ 190 #define TESTCLEARFLAG_FALSE(uname) \
191 static inline int TestClearPage##uname(struct page *page) { return 0; } 191 static inline int TestClearPage##uname(struct page *page) { return 0; }
192 192
193 #define __TESTCLEARFLAG_FALSE(uname) \ 193 #define __TESTCLEARFLAG_FALSE(uname) \
194 static inline int __TestClearPage##uname(struct page *page) { return 0; } 194 static inline int __TestClearPage##uname(struct page *page) { return 0; }
195 195
196 struct page; /* forward declaration */ 196 struct page; /* forward declaration */
197 197
198 TESTPAGEFLAG(Locked, locked) 198 TESTPAGEFLAG(Locked, locked)
199 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) 199 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
200 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) 200 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
201 __SETPAGEFLAG(Referenced, referenced)
201 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) 202 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
202 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) 203 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
203 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) 204 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
204 TESTCLEARFLAG(Active, active) 205 TESTCLEARFLAG(Active, active)
205 __PAGEFLAG(Slab, slab) 206 __PAGEFLAG(Slab, slab)
206 PAGEFLAG(Checked, checked) /* Used by some filesystems */ 207 PAGEFLAG(Checked, checked) /* Used by some filesystems */
207 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ 208 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */
208 PAGEFLAG(SavePinned, savepinned); /* Xen */ 209 PAGEFLAG(SavePinned, savepinned); /* Xen */
209 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) 210 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
210 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) 211 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
211 __SETPAGEFLAG(SwapBacked, swapbacked) 212 __SETPAGEFLAG(SwapBacked, swapbacked)
212 213
213 __PAGEFLAG(SlobFree, slob_free) 214 __PAGEFLAG(SlobFree, slob_free)
214 215
215 /* 216 /*
216 * Private page markings that may be used by the filesystem that owns the page 217 * Private page markings that may be used by the filesystem that owns the page
217 * for its own purposes. 218 * for its own purposes.
218 * - PG_private and PG_private_2 cause releasepage() and co to be invoked 219 * - PG_private and PG_private_2 cause releasepage() and co to be invoked
219 */ 220 */
220 PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) 221 PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
221 __CLEARPAGEFLAG(Private, private) 222 __CLEARPAGEFLAG(Private, private)
222 PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) 223 PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
223 PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) 224 PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
224 225
225 /* 226 /*
226 * Only test-and-set exist for PG_writeback. The unconditional operators are 227 * Only test-and-set exist for PG_writeback. The unconditional operators are
227 * risky: they bypass page accounting. 228 * risky: they bypass page accounting.
228 */ 229 */
229 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) 230 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
230 PAGEFLAG(MappedToDisk, mappedtodisk) 231 PAGEFLAG(MappedToDisk, mappedtodisk)
231 232
232 /* PG_readahead is only used for reads; PG_reclaim is only for writes */ 233 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
233 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) 234 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
234 PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim) 235 PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
235 236
236 #ifdef CONFIG_HIGHMEM 237 #ifdef CONFIG_HIGHMEM
237 /* 238 /*
238 * Must use a macro here due to header dependency issues. page_zone() is not 239 * Must use a macro here due to header dependency issues. page_zone() is not
239 * available at this point. 240 * available at this point.
240 */ 241 */
241 #define PageHighMem(__p) is_highmem(page_zone(__p)) 242 #define PageHighMem(__p) is_highmem(page_zone(__p))
242 #else 243 #else
243 PAGEFLAG_FALSE(HighMem) 244 PAGEFLAG_FALSE(HighMem)
244 #endif 245 #endif
245 246
246 #ifdef CONFIG_SWAP 247 #ifdef CONFIG_SWAP
247 PAGEFLAG(SwapCache, swapcache) 248 PAGEFLAG(SwapCache, swapcache)
248 #else 249 #else
249 PAGEFLAG_FALSE(SwapCache) 250 PAGEFLAG_FALSE(SwapCache)
250 SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) 251 SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache)
251 #endif 252 #endif
252 253
253 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) 254 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
254 TESTCLEARFLAG(Unevictable, unevictable) 255 TESTCLEARFLAG(Unevictable, unevictable)
255 256
256 #ifdef CONFIG_MMU 257 #ifdef CONFIG_MMU
257 PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) 258 PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
258 TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) 259 TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked)
259 #else 260 #else
260 PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) 261 PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked)
261 TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) 262 TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
262 #endif 263 #endif
263 264
264 #ifdef CONFIG_ARCH_USES_PG_UNCACHED 265 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
265 PAGEFLAG(Uncached, uncached) 266 PAGEFLAG(Uncached, uncached)
266 #else 267 #else
267 PAGEFLAG_FALSE(Uncached) 268 PAGEFLAG_FALSE(Uncached)
268 #endif 269 #endif
269 270
270 #ifdef CONFIG_MEMORY_FAILURE 271 #ifdef CONFIG_MEMORY_FAILURE
271 PAGEFLAG(HWPoison, hwpoison) 272 PAGEFLAG(HWPoison, hwpoison)
272 TESTSCFLAG(HWPoison, hwpoison) 273 TESTSCFLAG(HWPoison, hwpoison)
273 #define __PG_HWPOISON (1UL << PG_hwpoison) 274 #define __PG_HWPOISON (1UL << PG_hwpoison)
274 #else 275 #else
275 PAGEFLAG_FALSE(HWPoison) 276 PAGEFLAG_FALSE(HWPoison)
276 #define __PG_HWPOISON 0 277 #define __PG_HWPOISON 0
277 #endif 278 #endif
278 279
279 u64 stable_page_flags(struct page *page); 280 u64 stable_page_flags(struct page *page);
280 281
281 static inline int PageUptodate(struct page *page) 282 static inline int PageUptodate(struct page *page)
282 { 283 {
283 int ret = test_bit(PG_uptodate, &(page)->flags); 284 int ret = test_bit(PG_uptodate, &(page)->flags);
284 285
285 /* 286 /*
286 * Must ensure that the data we read out of the page is loaded 287 * Must ensure that the data we read out of the page is loaded
287 * _after_ we've loaded page->flags to check for PageUptodate. 288 * _after_ we've loaded page->flags to check for PageUptodate.
288 * We can skip the barrier if the page is not uptodate, because 289 * We can skip the barrier if the page is not uptodate, because
289 * we wouldn't be reading anything from it. 290 * we wouldn't be reading anything from it.
290 * 291 *
291 * See SetPageUptodate() for the other side of the story. 292 * See SetPageUptodate() for the other side of the story.
292 */ 293 */
293 if (ret) 294 if (ret)
294 smp_rmb(); 295 smp_rmb();
295 296
296 return ret; 297 return ret;
297 } 298 }
298 299
299 static inline void __SetPageUptodate(struct page *page) 300 static inline void __SetPageUptodate(struct page *page)
300 { 301 {
301 smp_wmb(); 302 smp_wmb();
302 __set_bit(PG_uptodate, &(page)->flags); 303 __set_bit(PG_uptodate, &(page)->flags);
303 } 304 }
304 305
305 static inline void SetPageUptodate(struct page *page) 306 static inline void SetPageUptodate(struct page *page)
306 { 307 {
307 /* 308 /*
308 * Memory barrier must be issued before setting the PG_uptodate bit, 309 * Memory barrier must be issued before setting the PG_uptodate bit,
309 * so that all previous stores issued in order to bring the page 310 * so that all previous stores issued in order to bring the page
310 * uptodate are actually visible before PageUptodate becomes true. 311 * uptodate are actually visible before PageUptodate becomes true.
311 */ 312 */
312 smp_wmb(); 313 smp_wmb();
313 set_bit(PG_uptodate, &(page)->flags); 314 set_bit(PG_uptodate, &(page)->flags);
314 } 315 }
315 316
316 CLEARPAGEFLAG(Uptodate, uptodate) 317 CLEARPAGEFLAG(Uptodate, uptodate)
317 318
318 extern void cancel_dirty_page(struct page *page, unsigned int account_size); 319 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
319 320
320 int test_clear_page_writeback(struct page *page); 321 int test_clear_page_writeback(struct page *page);
321 int __test_set_page_writeback(struct page *page, bool keep_write); 322 int __test_set_page_writeback(struct page *page, bool keep_write);
322 323
323 #define test_set_page_writeback(page) \ 324 #define test_set_page_writeback(page) \
324 __test_set_page_writeback(page, false) 325 __test_set_page_writeback(page, false)
325 #define test_set_page_writeback_keepwrite(page) \ 326 #define test_set_page_writeback_keepwrite(page) \
326 __test_set_page_writeback(page, true) 327 __test_set_page_writeback(page, true)
327 328
328 static inline void set_page_writeback(struct page *page) 329 static inline void set_page_writeback(struct page *page)
329 { 330 {
330 test_set_page_writeback(page); 331 test_set_page_writeback(page);
331 } 332 }
332 333
333 static inline void set_page_writeback_keepwrite(struct page *page) 334 static inline void set_page_writeback_keepwrite(struct page *page)
334 { 335 {
335 test_set_page_writeback_keepwrite(page); 336 test_set_page_writeback_keepwrite(page);
336 } 337 }
337 338
338 #ifdef CONFIG_PAGEFLAGS_EXTENDED 339 #ifdef CONFIG_PAGEFLAGS_EXTENDED
339 /* 340 /*
340 * System with lots of page flags available. This allows separate 341 * System with lots of page flags available. This allows separate
341 * flags for PageHead() and PageTail() checks of compound pages so that bit 342 * flags for PageHead() and PageTail() checks of compound pages so that bit
342 * tests can be used in performance sensitive paths. PageCompound is 343 * tests can be used in performance sensitive paths. PageCompound is
343 * generally not used in hot code paths. 344 * generally not used in hot code paths.
344 */ 345 */
345 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) 346 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
346 __PAGEFLAG(Tail, tail) 347 __PAGEFLAG(Tail, tail)
347 348
348 static inline int PageCompound(struct page *page) 349 static inline int PageCompound(struct page *page)
349 { 350 {
350 return page->flags & ((1L << PG_head) | (1L << PG_tail)); 351 return page->flags & ((1L << PG_head) | (1L << PG_tail));
351 352
352 } 353 }
353 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 354 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
354 static inline void ClearPageCompound(struct page *page) 355 static inline void ClearPageCompound(struct page *page)
355 { 356 {
356 BUG_ON(!PageHead(page)); 357 BUG_ON(!PageHead(page));
357 ClearPageHead(page); 358 ClearPageHead(page);
358 } 359 }
359 #endif 360 #endif
360 #else 361 #else
361 /* 362 /*
362 * Reduce page flag use as much as possible by overlapping 363 * Reduce page flag use as much as possible by overlapping
363 * compound page flags with the flags used for page cache pages. Possible 364 * compound page flags with the flags used for page cache pages. Possible
364 * because PageCompound is always set for compound pages and not for 365 * because PageCompound is always set for compound pages and not for
365 * pages on the LRU and/or pagecache. 366 * pages on the LRU and/or pagecache.
366 */ 367 */
367 TESTPAGEFLAG(Compound, compound) 368 TESTPAGEFLAG(Compound, compound)
368 __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) 369 __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound)
369 370
370 /* 371 /*
371 * PG_reclaim is used in combination with PG_compound to mark the 372 * PG_reclaim is used in combination with PG_compound to mark the
372 * head and tail of a compound page. This saves one page flag 373 * head and tail of a compound page. This saves one page flag
373 * but makes it impossible to use compound pages for the page cache. 374 * but makes it impossible to use compound pages for the page cache.
374 * The PG_reclaim bit would have to be used for reclaim or readahead 375 * The PG_reclaim bit would have to be used for reclaim or readahead
375 * if compound pages enter the page cache. 376 * if compound pages enter the page cache.
376 * 377 *
377 * PG_compound & PG_reclaim => Tail page 378 * PG_compound & PG_reclaim => Tail page
378 * PG_compound & ~PG_reclaim => Head page 379 * PG_compound & ~PG_reclaim => Head page
379 */ 380 */
380 #define PG_head_mask ((1L << PG_compound)) 381 #define PG_head_mask ((1L << PG_compound))
381 #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) 382 #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
382 383
383 static inline int PageHead(struct page *page) 384 static inline int PageHead(struct page *page)
384 { 385 {
385 return ((page->flags & PG_head_tail_mask) == PG_head_mask); 386 return ((page->flags & PG_head_tail_mask) == PG_head_mask);
386 } 387 }
387 388
388 static inline int PageTail(struct page *page) 389 static inline int PageTail(struct page *page)
389 { 390 {
390 return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); 391 return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
391 } 392 }
392 393
393 static inline void __SetPageTail(struct page *page) 394 static inline void __SetPageTail(struct page *page)
394 { 395 {
395 page->flags |= PG_head_tail_mask; 396 page->flags |= PG_head_tail_mask;
396 } 397 }
397 398
398 static inline void __ClearPageTail(struct page *page) 399 static inline void __ClearPageTail(struct page *page)
399 { 400 {
400 page->flags &= ~PG_head_tail_mask; 401 page->flags &= ~PG_head_tail_mask;
401 } 402 }
402 403
403 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 404 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
404 static inline void ClearPageCompound(struct page *page) 405 static inline void ClearPageCompound(struct page *page)
405 { 406 {
406 BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); 407 BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
407 clear_bit(PG_compound, &page->flags); 408 clear_bit(PG_compound, &page->flags);
408 } 409 }
409 #endif 410 #endif
410 411
411 #endif /* !PAGEFLAGS_EXTENDED */ 412 #endif /* !PAGEFLAGS_EXTENDED */
412 413
413 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 414 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
414 /* 415 /*
415 * PageHuge() only returns true for hugetlbfs pages, but not for 416 * PageHuge() only returns true for hugetlbfs pages, but not for
416 * normal or transparent huge pages. 417 * normal or transparent huge pages.
417 * 418 *
418 * PageTransHuge() returns true for both transparent huge and 419 * PageTransHuge() returns true for both transparent huge and
419 * hugetlbfs pages, but not normal pages. PageTransHuge() can only be 420 * hugetlbfs pages, but not normal pages. PageTransHuge() can only be
420 * called only in the core VM paths where hugetlbfs pages can't exist. 421 * called only in the core VM paths where hugetlbfs pages can't exist.
421 */ 422 */
422 static inline int PageTransHuge(struct page *page) 423 static inline int PageTransHuge(struct page *page)
423 { 424 {
424 VM_BUG_ON(PageTail(page)); 425 VM_BUG_ON(PageTail(page));
425 return PageHead(page); 426 return PageHead(page);
426 } 427 }
427 428
428 /* 429 /*
429 * PageTransCompound returns true for both transparent huge pages 430 * PageTransCompound returns true for both transparent huge pages
430 * and hugetlbfs pages, so it should only be called when it's known 431 * and hugetlbfs pages, so it should only be called when it's known
431 * that hugetlbfs pages aren't involved. 432 * that hugetlbfs pages aren't involved.
432 */ 433 */
433 static inline int PageTransCompound(struct page *page) 434 static inline int PageTransCompound(struct page *page)
434 { 435 {
435 return PageCompound(page); 436 return PageCompound(page);
436 } 437 }
437 438
438 /* 439 /*
439 * PageTransTail returns true for both transparent huge pages 440 * PageTransTail returns true for both transparent huge pages
440 * and hugetlbfs pages, so it should only be called when it's known 441 * and hugetlbfs pages, so it should only be called when it's known
441 * that hugetlbfs pages aren't involved. 442 * that hugetlbfs pages aren't involved.
442 */ 443 */
443 static inline int PageTransTail(struct page *page) 444 static inline int PageTransTail(struct page *page)
444 { 445 {
445 return PageTail(page); 446 return PageTail(page);
446 } 447 }
447 448
448 #else 449 #else
449 450
450 static inline int PageTransHuge(struct page *page) 451 static inline int PageTransHuge(struct page *page)
451 { 452 {
452 return 0; 453 return 0;
453 } 454 }
454 455
455 static inline int PageTransCompound(struct page *page) 456 static inline int PageTransCompound(struct page *page)
456 { 457 {
457 return 0; 458 return 0;
458 } 459 }
459 460
460 static inline int PageTransTail(struct page *page) 461 static inline int PageTransTail(struct page *page)
461 { 462 {
462 return 0; 463 return 0;
463 } 464 }
464 #endif 465 #endif
465 466
466 /* 467 /*
467 * If network-based swap is enabled, sl*b must keep track of whether pages 468 * If network-based swap is enabled, sl*b must keep track of whether pages
468 * were allocated from pfmemalloc reserves. 469 * were allocated from pfmemalloc reserves.
469 */ 470 */
470 static inline int PageSlabPfmemalloc(struct page *page) 471 static inline int PageSlabPfmemalloc(struct page *page)
471 { 472 {
472 VM_BUG_ON(!PageSlab(page)); 473 VM_BUG_ON(!PageSlab(page));
473 return PageActive(page); 474 return PageActive(page);
474 } 475 }
475 476
476 static inline void SetPageSlabPfmemalloc(struct page *page) 477 static inline void SetPageSlabPfmemalloc(struct page *page)
477 { 478 {
478 VM_BUG_ON(!PageSlab(page)); 479 VM_BUG_ON(!PageSlab(page));
479 SetPageActive(page); 480 SetPageActive(page);
480 } 481 }
481 482
482 static inline void __ClearPageSlabPfmemalloc(struct page *page) 483 static inline void __ClearPageSlabPfmemalloc(struct page *page)
483 { 484 {
484 VM_BUG_ON(!PageSlab(page)); 485 VM_BUG_ON(!PageSlab(page));
485 __ClearPageActive(page); 486 __ClearPageActive(page);
486 } 487 }
487 488
488 static inline void ClearPageSlabPfmemalloc(struct page *page) 489 static inline void ClearPageSlabPfmemalloc(struct page *page)
489 { 490 {
490 VM_BUG_ON(!PageSlab(page)); 491 VM_BUG_ON(!PageSlab(page));
491 ClearPageActive(page); 492 ClearPageActive(page);
492 } 493 }
493 494
494 #ifdef CONFIG_MMU 495 #ifdef CONFIG_MMU
495 #define __PG_MLOCKED (1 << PG_mlocked) 496 #define __PG_MLOCKED (1 << PG_mlocked)
496 #else 497 #else
497 #define __PG_MLOCKED 0 498 #define __PG_MLOCKED 0
498 #endif 499 #endif
499 500
500 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 501 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
501 #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) 502 #define __PG_COMPOUND_LOCK (1 << PG_compound_lock)
502 #else 503 #else
503 #define __PG_COMPOUND_LOCK 0 504 #define __PG_COMPOUND_LOCK 0
504 #endif 505 #endif
505 506
506 /* 507 /*
507 * Flags checked when a page is freed. Pages being freed should not have 508 * Flags checked when a page is freed. Pages being freed should not have
508 * these flags set. It they are, there is a problem. 509 * these flags set. It they are, there is a problem.
509 */ 510 */
510 #define PAGE_FLAGS_CHECK_AT_FREE \ 511 #define PAGE_FLAGS_CHECK_AT_FREE \
511 (1 << PG_lru | 1 << PG_locked | \ 512 (1 << PG_lru | 1 << PG_locked | \
512 1 << PG_private | 1 << PG_private_2 | \ 513 1 << PG_private | 1 << PG_private_2 | \
513 1 << PG_writeback | 1 << PG_reserved | \ 514 1 << PG_writeback | 1 << PG_reserved | \
514 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 515 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
515 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ 516 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
516 __PG_COMPOUND_LOCK) 517 __PG_COMPOUND_LOCK)
517 518
518 /* 519 /*
519 * Flags checked when a page is prepped for return by the page allocator. 520 * Flags checked when a page is prepped for return by the page allocator.
520 * Pages being prepped should not have any flags set. It they are set, 521 * Pages being prepped should not have any flags set. It they are set,
521 * there has been a kernel bug or struct page corruption. 522 * there has been a kernel bug or struct page corruption.
522 */ 523 */
523 #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) 524 #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
524 525
525 #define PAGE_FLAGS_PRIVATE \ 526 #define PAGE_FLAGS_PRIVATE \
526 (1 << PG_private | 1 << PG_private_2) 527 (1 << PG_private | 1 << PG_private_2)
527 /** 528 /**
528 * page_has_private - Determine if page has private stuff 529 * page_has_private - Determine if page has private stuff
529 * @page: The page to be checked 530 * @page: The page to be checked
530 * 531 *
531 * Determine if a page has private stuff, indicating that release routines 532 * Determine if a page has private stuff, indicating that release routines
532 * should be invoked upon it. 533 * should be invoked upon it.
533 */ 534 */
534 static inline int page_has_private(struct page *page) 535 static inline int page_has_private(struct page *page)
535 { 536 {
536 return !!(page->flags & PAGE_FLAGS_PRIVATE); 537 return !!(page->flags & PAGE_FLAGS_PRIVATE);
537 } 538 }
538 539
539 #endif /* !__GENERATING_BOUNDS_H */ 540 #endif /* !__GENERATING_BOUNDS_H */
540 541
541 #endif /* PAGE_FLAGS_H */ 542 #endif /* PAGE_FLAGS_H */
542 543
include/linux/pagemap.h
1 #ifndef _LINUX_PAGEMAP_H 1 #ifndef _LINUX_PAGEMAP_H
2 #define _LINUX_PAGEMAP_H 2 #define _LINUX_PAGEMAP_H
3 3
4 /* 4 /*
5 * Copyright 1995 Linus Torvalds 5 * Copyright 1995 Linus Torvalds
6 */ 6 */
7 #include <linux/mm.h> 7 #include <linux/mm.h>
8 #include <linux/fs.h> 8 #include <linux/fs.h>
9 #include <linux/list.h> 9 #include <linux/list.h>
10 #include <linux/highmem.h> 10 #include <linux/highmem.h>
11 #include <linux/compiler.h> 11 #include <linux/compiler.h>
12 #include <asm/uaccess.h> 12 #include <asm/uaccess.h>
13 #include <linux/gfp.h> 13 #include <linux/gfp.h>
14 #include <linux/bitops.h> 14 #include <linux/bitops.h>
15 #include <linux/hardirq.h> /* for in_interrupt() */ 15 #include <linux/hardirq.h> /* for in_interrupt() */
16 #include <linux/hugetlb_inline.h> 16 #include <linux/hugetlb_inline.h>
17 17
18 /* 18 /*
19 * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page 19 * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
20 * allocation mode flags. 20 * allocation mode flags.
21 */ 21 */
22 enum mapping_flags { 22 enum mapping_flags {
23 AS_EIO = __GFP_BITS_SHIFT + 0, /* IO error on async write */ 23 AS_EIO = __GFP_BITS_SHIFT + 0, /* IO error on async write */
24 AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ 24 AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */
25 AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ 25 AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
26 AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ 26 AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
27 AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */ 27 AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
28 }; 28 };
29 29
30 static inline void mapping_set_error(struct address_space *mapping, int error) 30 static inline void mapping_set_error(struct address_space *mapping, int error)
31 { 31 {
32 if (unlikely(error)) { 32 if (unlikely(error)) {
33 if (error == -ENOSPC) 33 if (error == -ENOSPC)
34 set_bit(AS_ENOSPC, &mapping->flags); 34 set_bit(AS_ENOSPC, &mapping->flags);
35 else 35 else
36 set_bit(AS_EIO, &mapping->flags); 36 set_bit(AS_EIO, &mapping->flags);
37 } 37 }
38 } 38 }
39 39
40 static inline void mapping_set_unevictable(struct address_space *mapping) 40 static inline void mapping_set_unevictable(struct address_space *mapping)
41 { 41 {
42 set_bit(AS_UNEVICTABLE, &mapping->flags); 42 set_bit(AS_UNEVICTABLE, &mapping->flags);
43 } 43 }
44 44
45 static inline void mapping_clear_unevictable(struct address_space *mapping) 45 static inline void mapping_clear_unevictable(struct address_space *mapping)
46 { 46 {
47 clear_bit(AS_UNEVICTABLE, &mapping->flags); 47 clear_bit(AS_UNEVICTABLE, &mapping->flags);
48 } 48 }
49 49
50 static inline int mapping_unevictable(struct address_space *mapping) 50 static inline int mapping_unevictable(struct address_space *mapping)
51 { 51 {
52 if (mapping) 52 if (mapping)
53 return test_bit(AS_UNEVICTABLE, &mapping->flags); 53 return test_bit(AS_UNEVICTABLE, &mapping->flags);
54 return !!mapping; 54 return !!mapping;
55 } 55 }
56 56
57 static inline void mapping_set_balloon(struct address_space *mapping) 57 static inline void mapping_set_balloon(struct address_space *mapping)
58 { 58 {
59 set_bit(AS_BALLOON_MAP, &mapping->flags); 59 set_bit(AS_BALLOON_MAP, &mapping->flags);
60 } 60 }
61 61
62 static inline void mapping_clear_balloon(struct address_space *mapping) 62 static inline void mapping_clear_balloon(struct address_space *mapping)
63 { 63 {
64 clear_bit(AS_BALLOON_MAP, &mapping->flags); 64 clear_bit(AS_BALLOON_MAP, &mapping->flags);
65 } 65 }
66 66
67 static inline int mapping_balloon(struct address_space *mapping) 67 static inline int mapping_balloon(struct address_space *mapping)
68 { 68 {
69 return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags); 69 return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
70 } 70 }
71 71
72 static inline gfp_t mapping_gfp_mask(struct address_space * mapping) 72 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
73 { 73 {
74 return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; 74 return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
75 } 75 }
76 76
77 /* 77 /*
78 * This is non-atomic. Only to be used before the mapping is activated. 78 * This is non-atomic. Only to be used before the mapping is activated.
79 * Probably needs a barrier... 79 * Probably needs a barrier...
80 */ 80 */
81 static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) 81 static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
82 { 82 {
83 m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) | 83 m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
84 (__force unsigned long)mask; 84 (__force unsigned long)mask;
85 } 85 }
86 86
87 /* 87 /*
88 * The page cache can done in larger chunks than 88 * The page cache can done in larger chunks than
89 * one page, because it allows for more efficient 89 * one page, because it allows for more efficient
90 * throughput (it can then be mapped into user 90 * throughput (it can then be mapped into user
91 * space in smaller chunks for same flexibility). 91 * space in smaller chunks for same flexibility).
92 * 92 *
93 * Or rather, it _will_ be done in larger chunks. 93 * Or rather, it _will_ be done in larger chunks.
94 */ 94 */
95 #define PAGE_CACHE_SHIFT PAGE_SHIFT 95 #define PAGE_CACHE_SHIFT PAGE_SHIFT
96 #define PAGE_CACHE_SIZE PAGE_SIZE 96 #define PAGE_CACHE_SIZE PAGE_SIZE
97 #define PAGE_CACHE_MASK PAGE_MASK 97 #define PAGE_CACHE_MASK PAGE_MASK
98 #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) 98 #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
99 99
100 #define page_cache_get(page) get_page(page) 100 #define page_cache_get(page) get_page(page)
101 #define page_cache_release(page) put_page(page) 101 #define page_cache_release(page) put_page(page)
102 void release_pages(struct page **pages, int nr, bool cold); 102 void release_pages(struct page **pages, int nr, bool cold);
103 103
104 /* 104 /*
105 * speculatively take a reference to a page. 105 * speculatively take a reference to a page.
106 * If the page is free (_count == 0), then _count is untouched, and 0 106 * If the page is free (_count == 0), then _count is untouched, and 0
107 * is returned. Otherwise, _count is incremented by 1 and 1 is returned. 107 * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
108 * 108 *
109 * This function must be called inside the same rcu_read_lock() section as has 109 * This function must be called inside the same rcu_read_lock() section as has
110 * been used to lookup the page in the pagecache radix-tree (or page table): 110 * been used to lookup the page in the pagecache radix-tree (or page table):
111 * this allows allocators to use a synchronize_rcu() to stabilize _count. 111 * this allows allocators to use a synchronize_rcu() to stabilize _count.
112 * 112 *
113 * Unless an RCU grace period has passed, the count of all pages coming out 113 * Unless an RCU grace period has passed, the count of all pages coming out
114 * of the allocator must be considered unstable. page_count may return higher 114 * of the allocator must be considered unstable. page_count may return higher
115 * than expected, and put_page must be able to do the right thing when the 115 * than expected, and put_page must be able to do the right thing when the
116 * page has been finished with, no matter what it is subsequently allocated 116 * page has been finished with, no matter what it is subsequently allocated
117 * for (because put_page is what is used here to drop an invalid speculative 117 * for (because put_page is what is used here to drop an invalid speculative
118 * reference). 118 * reference).
119 * 119 *
120 * This is the interesting part of the lockless pagecache (and lockless 120 * This is the interesting part of the lockless pagecache (and lockless
121 * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page) 121 * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
122 * has the following pattern: 122 * has the following pattern:
123 * 1. find page in radix tree 123 * 1. find page in radix tree
124 * 2. conditionally increment refcount 124 * 2. conditionally increment refcount
125 * 3. check the page is still in pagecache (if no, goto 1) 125 * 3. check the page is still in pagecache (if no, goto 1)
126 * 126 *
127 * Remove-side that cares about stability of _count (eg. reclaim) has the 127 * Remove-side that cares about stability of _count (eg. reclaim) has the
128 * following (with tree_lock held for write): 128 * following (with tree_lock held for write):
129 * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg) 129 * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
130 * B. remove page from pagecache 130 * B. remove page from pagecache
131 * C. free the page 131 * C. free the page
132 * 132 *
133 * There are 2 critical interleavings that matter: 133 * There are 2 critical interleavings that matter:
134 * - 2 runs before A: in this case, A sees elevated refcount and bails out 134 * - 2 runs before A: in this case, A sees elevated refcount and bails out
135 * - A runs before 2: in this case, 2 sees zero refcount and retries; 135 * - A runs before 2: in this case, 2 sees zero refcount and retries;
136 * subsequently, B will complete and 1 will find no page, causing the 136 * subsequently, B will complete and 1 will find no page, causing the
137 * lookup to return NULL. 137 * lookup to return NULL.
138 * 138 *
139 * It is possible that between 1 and 2, the page is removed then the exact same 139 * It is possible that between 1 and 2, the page is removed then the exact same
140 * page is inserted into the same position in pagecache. That's OK: the 140 * page is inserted into the same position in pagecache. That's OK: the
141 * old find_get_page using tree_lock could equally have run before or after 141 * old find_get_page using tree_lock could equally have run before or after
142 * such a re-insertion, depending on order that locks are granted. 142 * such a re-insertion, depending on order that locks are granted.
143 * 143 *
144 * Lookups racing against pagecache insertion isn't a big problem: either 1 144 * Lookups racing against pagecache insertion isn't a big problem: either 1
145 * will find the page or it will not. Likewise, the old find_get_page could run 145 * will find the page or it will not. Likewise, the old find_get_page could run
146 * either before the insertion or afterwards, depending on timing. 146 * either before the insertion or afterwards, depending on timing.
147 */ 147 */
148 static inline int page_cache_get_speculative(struct page *page) 148 static inline int page_cache_get_speculative(struct page *page)
149 { 149 {
150 VM_BUG_ON(in_interrupt()); 150 VM_BUG_ON(in_interrupt());
151 151
152 #ifdef CONFIG_TINY_RCU 152 #ifdef CONFIG_TINY_RCU
153 # ifdef CONFIG_PREEMPT_COUNT 153 # ifdef CONFIG_PREEMPT_COUNT
154 VM_BUG_ON(!in_atomic()); 154 VM_BUG_ON(!in_atomic());
155 # endif 155 # endif
156 /* 156 /*
157 * Preempt must be disabled here - we rely on rcu_read_lock doing 157 * Preempt must be disabled here - we rely on rcu_read_lock doing
158 * this for us. 158 * this for us.
159 * 159 *
160 * Pagecache won't be truncated from interrupt context, so if we have 160 * Pagecache won't be truncated from interrupt context, so if we have
161 * found a page in the radix tree here, we have pinned its refcount by 161 * found a page in the radix tree here, we have pinned its refcount by
162 * disabling preempt, and hence no need for the "speculative get" that 162 * disabling preempt, and hence no need for the "speculative get" that
163 * SMP requires. 163 * SMP requires.
164 */ 164 */
165 VM_BUG_ON(page_count(page) == 0); 165 VM_BUG_ON(page_count(page) == 0);
166 atomic_inc(&page->_count); 166 atomic_inc(&page->_count);
167 167
168 #else 168 #else
169 if (unlikely(!get_page_unless_zero(page))) { 169 if (unlikely(!get_page_unless_zero(page))) {
170 /* 170 /*
171 * Either the page has been freed, or will be freed. 171 * Either the page has been freed, or will be freed.
172 * In either case, retry here and the caller should 172 * In either case, retry here and the caller should
173 * do the right thing (see comments above). 173 * do the right thing (see comments above).
174 */ 174 */
175 return 0; 175 return 0;
176 } 176 }
177 #endif 177 #endif
178 VM_BUG_ON(PageTail(page)); 178 VM_BUG_ON(PageTail(page));
179 179
180 return 1; 180 return 1;
181 } 181 }
182 182
183 /* 183 /*
184 * Same as above, but add instead of inc (could just be merged) 184 * Same as above, but add instead of inc (could just be merged)
185 */ 185 */
186 static inline int page_cache_add_speculative(struct page *page, int count) 186 static inline int page_cache_add_speculative(struct page *page, int count)
187 { 187 {
188 VM_BUG_ON(in_interrupt()); 188 VM_BUG_ON(in_interrupt());
189 189
190 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU) 190 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
191 # ifdef CONFIG_PREEMPT_COUNT 191 # ifdef CONFIG_PREEMPT_COUNT
192 VM_BUG_ON(!in_atomic()); 192 VM_BUG_ON(!in_atomic());
193 # endif 193 # endif
194 VM_BUG_ON(page_count(page) == 0); 194 VM_BUG_ON(page_count(page) == 0);
195 atomic_add(count, &page->_count); 195 atomic_add(count, &page->_count);
196 196
197 #else 197 #else
198 if (unlikely(!atomic_add_unless(&page->_count, count, 0))) 198 if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
199 return 0; 199 return 0;
200 #endif 200 #endif
201 VM_BUG_ON(PageCompound(page) && page != compound_head(page)); 201 VM_BUG_ON(PageCompound(page) && page != compound_head(page));
202 202
203 return 1; 203 return 1;
204 } 204 }
205 205
206 static inline int page_freeze_refs(struct page *page, int count) 206 static inline int page_freeze_refs(struct page *page, int count)
207 { 207 {
208 return likely(atomic_cmpxchg(&page->_count, count, 0) == count); 208 return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
209 } 209 }
210 210
211 static inline void page_unfreeze_refs(struct page *page, int count) 211 static inline void page_unfreeze_refs(struct page *page, int count)
212 { 212 {
213 VM_BUG_ON(page_count(page) != 0); 213 VM_BUG_ON(page_count(page) != 0);
214 VM_BUG_ON(count == 0); 214 VM_BUG_ON(count == 0);
215 215
216 atomic_set(&page->_count, count); 216 atomic_set(&page->_count, count);
217 } 217 }
218 218
219 #ifdef CONFIG_NUMA 219 #ifdef CONFIG_NUMA
220 extern struct page *__page_cache_alloc(gfp_t gfp); 220 extern struct page *__page_cache_alloc(gfp_t gfp);
221 #else 221 #else
222 static inline struct page *__page_cache_alloc(gfp_t gfp) 222 static inline struct page *__page_cache_alloc(gfp_t gfp)
223 { 223 {
224 return alloc_pages(gfp, 0); 224 return alloc_pages(gfp, 0);
225 } 225 }
226 #endif 226 #endif
227 227
228 static inline struct page *page_cache_alloc(struct address_space *x) 228 static inline struct page *page_cache_alloc(struct address_space *x)
229 { 229 {
230 return __page_cache_alloc(mapping_gfp_mask(x)); 230 return __page_cache_alloc(mapping_gfp_mask(x));
231 } 231 }
232 232
233 static inline struct page *page_cache_alloc_cold(struct address_space *x) 233 static inline struct page *page_cache_alloc_cold(struct address_space *x)
234 { 234 {
235 return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD); 235 return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
236 } 236 }
237 237
238 static inline struct page *page_cache_alloc_readahead(struct address_space *x) 238 static inline struct page *page_cache_alloc_readahead(struct address_space *x)
239 { 239 {
240 return __page_cache_alloc(mapping_gfp_mask(x) | 240 return __page_cache_alloc(mapping_gfp_mask(x) |
241 __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN); 241 __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
242 } 242 }
243 243
244 typedef int filler_t(void *, struct page *); 244 typedef int filler_t(void *, struct page *);
245 245
246 pgoff_t page_cache_next_hole(struct address_space *mapping, 246 pgoff_t page_cache_next_hole(struct address_space *mapping,
247 pgoff_t index, unsigned long max_scan); 247 pgoff_t index, unsigned long max_scan);
248 pgoff_t page_cache_prev_hole(struct address_space *mapping, 248 pgoff_t page_cache_prev_hole(struct address_space *mapping,
249 pgoff_t index, unsigned long max_scan); 249 pgoff_t index, unsigned long max_scan);
250 250
251 #define FGP_ACCESSED 0x00000001
252 #define FGP_LOCK 0x00000002
253 #define FGP_CREAT 0x00000004
254 #define FGP_WRITE 0x00000008
255 #define FGP_NOFS 0x00000010
256 #define FGP_NOWAIT 0x00000020
257
258 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
259 int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
260
261 /**
262 * find_get_page - find and get a page reference
263 * @mapping: the address_space to search
264 * @offset: the page index
265 *
266 * Looks up the page cache slot at @mapping & @offset. If there is a
267 * page cache page, it is returned with an increased refcount.
268 *
269 * Otherwise, %NULL is returned.
270 */
271 static inline struct page *find_get_page(struct address_space *mapping,
272 pgoff_t offset)
273 {
274 return pagecache_get_page(mapping, offset, 0, 0, 0);
275 }
276
277 static inline struct page *find_get_page_flags(struct address_space *mapping,
278 pgoff_t offset, int fgp_flags)
279 {
280 return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
281 }
282
283 /**
284 * find_lock_page - locate, pin and lock a pagecache page
285 * pagecache_get_page - find and get a page reference
286 * @mapping: the address_space to search
287 * @offset: the page index
288 *
289 * Looks up the page cache slot at @mapping & @offset. If there is a
290 * page cache page, it is returned locked and with an increased
291 * refcount.
292 *
293 * Otherwise, %NULL is returned.
294 *
295 * find_lock_page() may sleep.
296 */
297 static inline struct page *find_lock_page(struct address_space *mapping,
298 pgoff_t offset)
299 {
300 return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
301 }
302
303 /**
304 * find_or_create_page - locate or add a pagecache page
305 * @mapping: the page's address_space
306 * @index: the page's index into the mapping
307 * @gfp_mask: page allocation mode
308 *
309 * Looks up the page cache slot at @mapping & @offset. If there is a
310 * page cache page, it is returned locked and with an increased
311 * refcount.
312 *
313 * If the page is not present, a new page is allocated using @gfp_mask
314 * and added to the page cache and the VM's LRU list. The page is
315 * returned locked and with an increased refcount.
316 *
317 * On memory exhaustion, %NULL is returned.
318 *
319 * find_or_create_page() may sleep, even if @gfp_flags specifies an
320 * atomic allocation!
321 */
322 static inline struct page *find_or_create_page(struct address_space *mapping,
323 pgoff_t offset, gfp_t gfp_mask)
324 {
325 return pagecache_get_page(mapping, offset,
326 FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
327 gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
328 }
329
330 /**
331 * grab_cache_page_nowait - returns locked page at given index in given cache
332 * @mapping: target address_space
333 * @index: the page index
334 *
335 * Same as grab_cache_page(), but do not wait if the page is unavailable.
336 * This is intended for speculative data generators, where the data can
337 * be regenerated if the page couldn't be grabbed. This routine should
338 * be safe to call while holding the lock for another page.
339 *
340 * Clear __GFP_FS when allocating the page to avoid recursion into the fs
341 * and deadlock against the caller's locked page.
342 */
343 static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
344 pgoff_t index)
345 {
346 return pagecache_get_page(mapping, index,
347 FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
348 mapping_gfp_mask(mapping),
349 GFP_NOFS);
350 }
351
251 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); 352 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
252 struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
253 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset); 353 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
254 struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
255 struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
256 gfp_t gfp_mask);
257 unsigned find_get_entries(struct address_space *mapping, pgoff_t start, 354 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
258 unsigned int nr_entries, struct page **entries, 355 unsigned int nr_entries, struct page **entries,
259 pgoff_t *indices); 356 pgoff_t *indices);
260 unsigned find_get_pages(struct address_space *mapping, pgoff_t start, 357 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
261 unsigned int nr_pages, struct page **pages); 358 unsigned int nr_pages, struct page **pages);
262 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, 359 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
263 unsigned int nr_pages, struct page **pages); 360 unsigned int nr_pages, struct page **pages);
264 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, 361 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
265 int tag, unsigned int nr_pages, struct page **pages); 362 int tag, unsigned int nr_pages, struct page **pages);
266 363
267 struct page *grab_cache_page_write_begin(struct address_space *mapping, 364 struct page *grab_cache_page_write_begin(struct address_space *mapping,
268 pgoff_t index, unsigned flags); 365 pgoff_t index, unsigned flags);
269 366
270 /* 367 /*
271 * Returns locked page at given index in given cache, creating it if needed. 368 * Returns locked page at given index in given cache, creating it if needed.
272 */ 369 */
273 static inline struct page *grab_cache_page(struct address_space *mapping, 370 static inline struct page *grab_cache_page(struct address_space *mapping,
274 pgoff_t index) 371 pgoff_t index)
275 { 372 {
276 return find_or_create_page(mapping, index, mapping_gfp_mask(mapping)); 373 return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
277 } 374 }
278 375
279 extern struct page * grab_cache_page_nowait(struct address_space *mapping,
280 pgoff_t index);
281 extern struct page * read_cache_page(struct address_space *mapping, 376 extern struct page * read_cache_page(struct address_space *mapping,
282 pgoff_t index, filler_t *filler, void *data); 377 pgoff_t index, filler_t *filler, void *data);
283 extern struct page * read_cache_page_gfp(struct address_space *mapping, 378 extern struct page * read_cache_page_gfp(struct address_space *mapping,
284 pgoff_t index, gfp_t gfp_mask); 379 pgoff_t index, gfp_t gfp_mask);
285 extern int read_cache_pages(struct address_space *mapping, 380 extern int read_cache_pages(struct address_space *mapping,
286 struct list_head *pages, filler_t *filler, void *data); 381 struct list_head *pages, filler_t *filler, void *data);
287 382
288 static inline struct page *read_mapping_page(struct address_space *mapping, 383 static inline struct page *read_mapping_page(struct address_space *mapping,
289 pgoff_t index, void *data) 384 pgoff_t index, void *data)
290 { 385 {
291 filler_t *filler = (filler_t *)mapping->a_ops->readpage; 386 filler_t *filler = (filler_t *)mapping->a_ops->readpage;
292 return read_cache_page(mapping, index, filler, data); 387 return read_cache_page(mapping, index, filler, data);
293 } 388 }
294 389
295 /* 390 /*
296 * Return byte-offset into filesystem object for page. 391 * Return byte-offset into filesystem object for page.
297 */ 392 */
298 static inline loff_t page_offset(struct page *page) 393 static inline loff_t page_offset(struct page *page)
299 { 394 {
300 return ((loff_t)page->index) << PAGE_CACHE_SHIFT; 395 return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
301 } 396 }
302 397
303 static inline loff_t page_file_offset(struct page *page) 398 static inline loff_t page_file_offset(struct page *page)
304 { 399 {
305 return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; 400 return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
306 } 401 }
307 402
308 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, 403 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
309 unsigned long address); 404 unsigned long address);
310 405
311 static inline pgoff_t linear_page_index(struct vm_area_struct *vma, 406 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
312 unsigned long address) 407 unsigned long address)
313 { 408 {
314 pgoff_t pgoff; 409 pgoff_t pgoff;
315 if (unlikely(is_vm_hugetlb_page(vma))) 410 if (unlikely(is_vm_hugetlb_page(vma)))
316 return linear_hugepage_index(vma, address); 411 return linear_hugepage_index(vma, address);
317 pgoff = (address - vma->vm_start) >> PAGE_SHIFT; 412 pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
318 pgoff += vma->vm_pgoff; 413 pgoff += vma->vm_pgoff;
319 return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); 414 return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
320 } 415 }
321 416
322 extern void __lock_page(struct page *page); 417 extern void __lock_page(struct page *page);
323 extern int __lock_page_killable(struct page *page); 418 extern int __lock_page_killable(struct page *page);
324 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, 419 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
325 unsigned int flags); 420 unsigned int flags);
326 extern void unlock_page(struct page *page); 421 extern void unlock_page(struct page *page);
327 422
328 static inline void __set_page_locked(struct page *page) 423 static inline void __set_page_locked(struct page *page)
329 { 424 {
330 __set_bit(PG_locked, &page->flags); 425 __set_bit(PG_locked, &page->flags);
331 } 426 }
332 427
333 static inline void __clear_page_locked(struct page *page) 428 static inline void __clear_page_locked(struct page *page)
334 { 429 {
335 __clear_bit(PG_locked, &page->flags); 430 __clear_bit(PG_locked, &page->flags);
336 } 431 }
337 432
338 static inline int trylock_page(struct page *page) 433 static inline int trylock_page(struct page *page)
339 { 434 {
340 return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); 435 return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
341 } 436 }
342 437
343 /* 438 /*
344 * lock_page may only be called if we have the page's inode pinned. 439 * lock_page may only be called if we have the page's inode pinned.
345 */ 440 */
346 static inline void lock_page(struct page *page) 441 static inline void lock_page(struct page *page)
347 { 442 {
348 might_sleep(); 443 might_sleep();
349 if (!trylock_page(page)) 444 if (!trylock_page(page))
350 __lock_page(page); 445 __lock_page(page);
351 } 446 }
352 447
353 /* 448 /*
354 * lock_page_killable is like lock_page but can be interrupted by fatal 449 * lock_page_killable is like lock_page but can be interrupted by fatal
355 * signals. It returns 0 if it locked the page and -EINTR if it was 450 * signals. It returns 0 if it locked the page and -EINTR if it was
356 * killed while waiting. 451 * killed while waiting.
357 */ 452 */
358 static inline int lock_page_killable(struct page *page) 453 static inline int lock_page_killable(struct page *page)
359 { 454 {
360 might_sleep(); 455 might_sleep();
361 if (!trylock_page(page)) 456 if (!trylock_page(page))
362 return __lock_page_killable(page); 457 return __lock_page_killable(page);
363 return 0; 458 return 0;
364 } 459 }
365 460
366 /* 461 /*
367 * lock_page_or_retry - Lock the page, unless this would block and the 462 * lock_page_or_retry - Lock the page, unless this would block and the
368 * caller indicated that it can handle a retry. 463 * caller indicated that it can handle a retry.
369 */ 464 */
370 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, 465 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
371 unsigned int flags) 466 unsigned int flags)
372 { 467 {
373 might_sleep(); 468 might_sleep();
374 return trylock_page(page) || __lock_page_or_retry(page, mm, flags); 469 return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
375 } 470 }
376 471
377 /* 472 /*
378 * This is exported only for wait_on_page_locked/wait_on_page_writeback. 473 * This is exported only for wait_on_page_locked/wait_on_page_writeback.
379 * Never use this directly! 474 * Never use this directly!
380 */ 475 */
381 extern void wait_on_page_bit(struct page *page, int bit_nr); 476 extern void wait_on_page_bit(struct page *page, int bit_nr);
382 477
383 extern int wait_on_page_bit_killable(struct page *page, int bit_nr); 478 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
384 479
385 static inline int wait_on_page_locked_killable(struct page *page) 480 static inline int wait_on_page_locked_killable(struct page *page)
386 { 481 {
387 if (PageLocked(page)) 482 if (PageLocked(page))
388 return wait_on_page_bit_killable(page, PG_locked); 483 return wait_on_page_bit_killable(page, PG_locked);
389 return 0; 484 return 0;
390 } 485 }
391 486
392 /* 487 /*
393 * Wait for a page to be unlocked. 488 * Wait for a page to be unlocked.
394 * 489 *
395 * This must be called with the caller "holding" the page, 490 * This must be called with the caller "holding" the page,
396 * ie with increased "page->count" so that the page won't 491 * ie with increased "page->count" so that the page won't
397 * go away during the wait.. 492 * go away during the wait..
398 */ 493 */
399 static inline void wait_on_page_locked(struct page *page) 494 static inline void wait_on_page_locked(struct page *page)
400 { 495 {
401 if (PageLocked(page)) 496 if (PageLocked(page))
402 wait_on_page_bit(page, PG_locked); 497 wait_on_page_bit(page, PG_locked);
403 } 498 }
404 499
405 /* 500 /*
406 * Wait for a page to complete writeback 501 * Wait for a page to complete writeback
407 */ 502 */
408 static inline void wait_on_page_writeback(struct page *page) 503 static inline void wait_on_page_writeback(struct page *page)
409 { 504 {
410 if (PageWriteback(page)) 505 if (PageWriteback(page))
411 wait_on_page_bit(page, PG_writeback); 506 wait_on_page_bit(page, PG_writeback);
412 } 507 }
413 508
414 extern void end_page_writeback(struct page *page); 509 extern void end_page_writeback(struct page *page);
415 void wait_for_stable_page(struct page *page); 510 void wait_for_stable_page(struct page *page);
416 511
417 /* 512 /*
418 * Add an arbitrary waiter to a page's wait queue 513 * Add an arbitrary waiter to a page's wait queue
419 */ 514 */
420 extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); 515 extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
421 516
422 /* 517 /*
423 * Fault a userspace page into pagetables. Return non-zero on a fault. 518 * Fault a userspace page into pagetables. Return non-zero on a fault.
424 * 519 *
425 * This assumes that two userspace pages are always sufficient. That's 520 * This assumes that two userspace pages are always sufficient. That's
426 * not true if PAGE_CACHE_SIZE > PAGE_SIZE. 521 * not true if PAGE_CACHE_SIZE > PAGE_SIZE.
427 */ 522 */
428 static inline int fault_in_pages_writeable(char __user *uaddr, int size) 523 static inline int fault_in_pages_writeable(char __user *uaddr, int size)
429 { 524 {
430 int ret; 525 int ret;
431 526
432 if (unlikely(size == 0)) 527 if (unlikely(size == 0))
433 return 0; 528 return 0;
434 529
435 /* 530 /*
436 * Writing zeroes into userspace here is OK, because we know that if 531 * Writing zeroes into userspace here is OK, because we know that if
437 * the zero gets there, we'll be overwriting it. 532 * the zero gets there, we'll be overwriting it.
438 */ 533 */
439 ret = __put_user(0, uaddr); 534 ret = __put_user(0, uaddr);
440 if (ret == 0) { 535 if (ret == 0) {
441 char __user *end = uaddr + size - 1; 536 char __user *end = uaddr + size - 1;
442 537
443 /* 538 /*
444 * If the page was already mapped, this will get a cache miss 539 * If the page was already mapped, this will get a cache miss
445 * for sure, so try to avoid doing it. 540 * for sure, so try to avoid doing it.
446 */ 541 */
447 if (((unsigned long)uaddr & PAGE_MASK) != 542 if (((unsigned long)uaddr & PAGE_MASK) !=
448 ((unsigned long)end & PAGE_MASK)) 543 ((unsigned long)end & PAGE_MASK))
449 ret = __put_user(0, end); 544 ret = __put_user(0, end);
450 } 545 }
451 return ret; 546 return ret;
452 } 547 }
453 548
454 static inline int fault_in_pages_readable(const char __user *uaddr, int size) 549 static inline int fault_in_pages_readable(const char __user *uaddr, int size)
455 { 550 {
456 volatile char c; 551 volatile char c;
457 int ret; 552 int ret;
458 553
459 if (unlikely(size == 0)) 554 if (unlikely(size == 0))
460 return 0; 555 return 0;
461 556
462 ret = __get_user(c, uaddr); 557 ret = __get_user(c, uaddr);
463 if (ret == 0) { 558 if (ret == 0) {
464 const char __user *end = uaddr + size - 1; 559 const char __user *end = uaddr + size - 1;
465 560
466 if (((unsigned long)uaddr & PAGE_MASK) != 561 if (((unsigned long)uaddr & PAGE_MASK) !=
467 ((unsigned long)end & PAGE_MASK)) { 562 ((unsigned long)end & PAGE_MASK)) {
468 ret = __get_user(c, end); 563 ret = __get_user(c, end);
469 (void)c; 564 (void)c;
470 } 565 }
471 } 566 }
472 return ret; 567 return ret;
473 } 568 }
474 569
475 /* 570 /*
476 * Multipage variants of the above prefault helpers, useful if more than 571 * Multipage variants of the above prefault helpers, useful if more than
477 * PAGE_SIZE of data needs to be prefaulted. These are separate from the above 572 * PAGE_SIZE of data needs to be prefaulted. These are separate from the above
478 * functions (which only handle up to PAGE_SIZE) to avoid clobbering the 573 * functions (which only handle up to PAGE_SIZE) to avoid clobbering the
479 * filemap.c hotpaths. 574 * filemap.c hotpaths.
480 */ 575 */
481 static inline int fault_in_multipages_writeable(char __user *uaddr, int size) 576 static inline int fault_in_multipages_writeable(char __user *uaddr, int size)
482 { 577 {
483 int ret = 0; 578 int ret = 0;
484 char __user *end = uaddr + size - 1; 579 char __user *end = uaddr + size - 1;
485 580
486 if (unlikely(size == 0)) 581 if (unlikely(size == 0))
487 return ret; 582 return ret;
488 583
489 /* 584 /*
490 * Writing zeroes into userspace here is OK, because we know that if 585 * Writing zeroes into userspace here is OK, because we know that if
491 * the zero gets there, we'll be overwriting it. 586 * the zero gets there, we'll be overwriting it.
492 */ 587 */
493 while (uaddr <= end) { 588 while (uaddr <= end) {
494 ret = __put_user(0, uaddr); 589 ret = __put_user(0, uaddr);
495 if (ret != 0) 590 if (ret != 0)
496 return ret; 591 return ret;
497 uaddr += PAGE_SIZE; 592 uaddr += PAGE_SIZE;
498 } 593 }
499 594
500 /* Check whether the range spilled into the next page. */ 595 /* Check whether the range spilled into the next page. */
501 if (((unsigned long)uaddr & PAGE_MASK) == 596 if (((unsigned long)uaddr & PAGE_MASK) ==
502 ((unsigned long)end & PAGE_MASK)) 597 ((unsigned long)end & PAGE_MASK))
503 ret = __put_user(0, end); 598 ret = __put_user(0, end);
504 599
505 return ret; 600 return ret;
506 } 601 }
507 602
508 static inline int fault_in_multipages_readable(const char __user *uaddr, 603 static inline int fault_in_multipages_readable(const char __user *uaddr,
509 int size) 604 int size)
510 { 605 {
511 volatile char c; 606 volatile char c;
512 int ret = 0; 607 int ret = 0;
513 const char __user *end = uaddr + size - 1; 608 const char __user *end = uaddr + size - 1;
514 609
515 if (unlikely(size == 0)) 610 if (unlikely(size == 0))
516 return ret; 611 return ret;
517 612
518 while (uaddr <= end) { 613 while (uaddr <= end) {
519 ret = __get_user(c, uaddr); 614 ret = __get_user(c, uaddr);
520 if (ret != 0) 615 if (ret != 0)
521 return ret; 616 return ret;
522 uaddr += PAGE_SIZE; 617 uaddr += PAGE_SIZE;
523 } 618 }
524 619
525 /* Check whether the range spilled into the next page. */ 620 /* Check whether the range spilled into the next page. */
526 if (((unsigned long)uaddr & PAGE_MASK) == 621 if (((unsigned long)uaddr & PAGE_MASK) ==
527 ((unsigned long)end & PAGE_MASK)) { 622 ((unsigned long)end & PAGE_MASK)) {
528 ret = __get_user(c, end); 623 ret = __get_user(c, end);
529 (void)c; 624 (void)c;
530 } 625 }
531 626
532 return ret; 627 return ret;
533 } 628 }
534 629
535 int add_to_page_cache_locked(struct page *page, struct address_space *mapping, 630 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
536 pgoff_t index, gfp_t gfp_mask); 631 pgoff_t index, gfp_t gfp_mask);
537 int add_to_page_cache_lru(struct page *page, struct address_space *mapping, 632 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
538 pgoff_t index, gfp_t gfp_mask); 633 pgoff_t index, gfp_t gfp_mask);
539 extern void delete_from_page_cache(struct page *page); 634 extern void delete_from_page_cache(struct page *page);
540 extern void __delete_from_page_cache(struct page *page); 635 extern void __delete_from_page_cache(struct page *page);
541 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); 636 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
542 637
543 /* 638 /*
544 * Like add_to_page_cache_locked, but used to add newly allocated pages: 639 * Like add_to_page_cache_locked, but used to add newly allocated pages:
545 * the page is new, so we can just run __set_page_locked() against it. 640 * the page is new, so we can just run __set_page_locked() against it.
546 */ 641 */
547 static inline int add_to_page_cache(struct page *page, 642 static inline int add_to_page_cache(struct page *page,
548 struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) 643 struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
549 { 644 {
550 int error; 645 int error;
551 646
552 __set_page_locked(page); 647 __set_page_locked(page);
553 error = add_to_page_cache_locked(page, mapping, offset, gfp_mask); 648 error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
554 if (unlikely(error)) 649 if (unlikely(error))
include/linux/swap.h
1 #ifndef _LINUX_SWAP_H 1 #ifndef _LINUX_SWAP_H
2 #define _LINUX_SWAP_H 2 #define _LINUX_SWAP_H
3 3
4 #include <linux/spinlock.h> 4 #include <linux/spinlock.h>
5 #include <linux/linkage.h> 5 #include <linux/linkage.h>
6 #include <linux/mmzone.h> 6 #include <linux/mmzone.h>
7 #include <linux/list.h> 7 #include <linux/list.h>
8 #include <linux/memcontrol.h> 8 #include <linux/memcontrol.h>
9 #include <linux/sched.h> 9 #include <linux/sched.h>
10 #include <linux/node.h> 10 #include <linux/node.h>
11 #include <linux/fs.h> 11 #include <linux/fs.h>
12 #include <linux/atomic.h> 12 #include <linux/atomic.h>
13 #include <linux/page-flags.h> 13 #include <linux/page-flags.h>
14 #include <asm/page.h> 14 #include <asm/page.h>
15 15
16 struct notifier_block; 16 struct notifier_block;
17 17
18 struct bio; 18 struct bio;
19 19
20 #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */ 20 #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
21 #define SWAP_FLAG_PRIO_MASK 0x7fff 21 #define SWAP_FLAG_PRIO_MASK 0x7fff
22 #define SWAP_FLAG_PRIO_SHIFT 0 22 #define SWAP_FLAG_PRIO_SHIFT 0
23 #define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */ 23 #define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
24 #define SWAP_FLAG_DISCARD_ONCE 0x20000 /* discard swap area at swapon-time */ 24 #define SWAP_FLAG_DISCARD_ONCE 0x20000 /* discard swap area at swapon-time */
25 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */ 25 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
26 26
27 #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \ 27 #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
28 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \ 28 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
29 SWAP_FLAG_DISCARD_PAGES) 29 SWAP_FLAG_DISCARD_PAGES)
30 30
31 static inline int current_is_kswapd(void) 31 static inline int current_is_kswapd(void)
32 { 32 {
33 return current->flags & PF_KSWAPD; 33 return current->flags & PF_KSWAPD;
34 } 34 }
35 35
36 /* 36 /*
37 * MAX_SWAPFILES defines the maximum number of swaptypes: things which can 37 * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
38 * be swapped to. The swap type and the offset into that swap type are 38 * be swapped to. The swap type and the offset into that swap type are
39 * encoded into pte's and into pgoff_t's in the swapcache. Using five bits 39 * encoded into pte's and into pgoff_t's in the swapcache. Using five bits
40 * for the type means that the maximum number of swapcache pages is 27 bits 40 * for the type means that the maximum number of swapcache pages is 27 bits
41 * on 32-bit-pgoff_t architectures. And that assumes that the architecture packs 41 * on 32-bit-pgoff_t architectures. And that assumes that the architecture packs
42 * the type/offset into the pte as 5/27 as well. 42 * the type/offset into the pte as 5/27 as well.
43 */ 43 */
44 #define MAX_SWAPFILES_SHIFT 5 44 #define MAX_SWAPFILES_SHIFT 5
45 45
46 /* 46 /*
47 * Use some of the swap files numbers for other purposes. This 47 * Use some of the swap files numbers for other purposes. This
48 * is a convenient way to hook into the VM to trigger special 48 * is a convenient way to hook into the VM to trigger special
49 * actions on faults. 49 * actions on faults.
50 */ 50 */
51 51
52 /* 52 /*
53 * NUMA node memory migration support 53 * NUMA node memory migration support
54 */ 54 */
55 #ifdef CONFIG_MIGRATION 55 #ifdef CONFIG_MIGRATION
56 #define SWP_MIGRATION_NUM 2 56 #define SWP_MIGRATION_NUM 2
57 #define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) 57 #define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
58 #define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) 58 #define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
59 #else 59 #else
60 #define SWP_MIGRATION_NUM 0 60 #define SWP_MIGRATION_NUM 0
61 #endif 61 #endif
62 62
63 /* 63 /*
64 * Handling of hardware poisoned pages with memory corruption. 64 * Handling of hardware poisoned pages with memory corruption.
65 */ 65 */
66 #ifdef CONFIG_MEMORY_FAILURE 66 #ifdef CONFIG_MEMORY_FAILURE
67 #define SWP_HWPOISON_NUM 1 67 #define SWP_HWPOISON_NUM 1
68 #define SWP_HWPOISON MAX_SWAPFILES 68 #define SWP_HWPOISON MAX_SWAPFILES
69 #else 69 #else
70 #define SWP_HWPOISON_NUM 0 70 #define SWP_HWPOISON_NUM 0
71 #endif 71 #endif
72 72
73 #define MAX_SWAPFILES \ 73 #define MAX_SWAPFILES \
74 ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) 74 ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
75 75
76 /* 76 /*
77 * Magic header for a swap area. The first part of the union is 77 * Magic header for a swap area. The first part of the union is
78 * what the swap magic looks like for the old (limited to 128MB) 78 * what the swap magic looks like for the old (limited to 128MB)
79 * swap area format, the second part of the union adds - in the 79 * swap area format, the second part of the union adds - in the
80 * old reserved area - some extra information. Note that the first 80 * old reserved area - some extra information. Note that the first
81 * kilobyte is reserved for boot loader or disk label stuff... 81 * kilobyte is reserved for boot loader or disk label stuff...
82 * 82 *
83 * Having the magic at the end of the PAGE_SIZE makes detecting swap 83 * Having the magic at the end of the PAGE_SIZE makes detecting swap
84 * areas somewhat tricky on machines that support multiple page sizes. 84 * areas somewhat tricky on machines that support multiple page sizes.
85 * For 2.5 we'll probably want to move the magic to just beyond the 85 * For 2.5 we'll probably want to move the magic to just beyond the
86 * bootbits... 86 * bootbits...
87 */ 87 */
88 union swap_header { 88 union swap_header {
89 struct { 89 struct {
90 char reserved[PAGE_SIZE - 10]; 90 char reserved[PAGE_SIZE - 10];
91 char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */ 91 char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
92 } magic; 92 } magic;
93 struct { 93 struct {
94 char bootbits[1024]; /* Space for disklabel etc. */ 94 char bootbits[1024]; /* Space for disklabel etc. */
95 __u32 version; 95 __u32 version;
96 __u32 last_page; 96 __u32 last_page;
97 __u32 nr_badpages; 97 __u32 nr_badpages;
98 unsigned char sws_uuid[16]; 98 unsigned char sws_uuid[16];
99 unsigned char sws_volume[16]; 99 unsigned char sws_volume[16];
100 __u32 padding[117]; 100 __u32 padding[117];
101 __u32 badpages[1]; 101 __u32 badpages[1];
102 } info; 102 } info;
103 }; 103 };
104 104
105 /* A swap entry has to fit into a "unsigned long", as 105 /* A swap entry has to fit into a "unsigned long", as
106 * the entry is hidden in the "index" field of the 106 * the entry is hidden in the "index" field of the
107 * swapper address space. 107 * swapper address space.
108 */ 108 */
109 typedef struct { 109 typedef struct {
110 unsigned long val; 110 unsigned long val;
111 } swp_entry_t; 111 } swp_entry_t;
112 112
113 /* 113 /*
114 * current->reclaim_state points to one of these when a task is running 114 * current->reclaim_state points to one of these when a task is running
115 * memory reclaim 115 * memory reclaim
116 */ 116 */
117 struct reclaim_state { 117 struct reclaim_state {
118 unsigned long reclaimed_slab; 118 unsigned long reclaimed_slab;
119 }; 119 };
120 120
121 #ifdef __KERNEL__ 121 #ifdef __KERNEL__
122 122
123 struct address_space; 123 struct address_space;
124 struct sysinfo; 124 struct sysinfo;
125 struct writeback_control; 125 struct writeback_control;
126 struct zone; 126 struct zone;
127 127
128 /* 128 /*
129 * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of 129 * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
130 * disk blocks. A list of swap extents maps the entire swapfile. (Where the 130 * disk blocks. A list of swap extents maps the entire swapfile. (Where the
131 * term `swapfile' refers to either a blockdevice or an IS_REG file. Apart 131 * term `swapfile' refers to either a blockdevice or an IS_REG file. Apart
132 * from setup, they're handled identically. 132 * from setup, they're handled identically.
133 * 133 *
134 * We always assume that blocks are of size PAGE_SIZE. 134 * We always assume that blocks are of size PAGE_SIZE.
135 */ 135 */
136 struct swap_extent { 136 struct swap_extent {
137 struct list_head list; 137 struct list_head list;
138 pgoff_t start_page; 138 pgoff_t start_page;
139 pgoff_t nr_pages; 139 pgoff_t nr_pages;
140 sector_t start_block; 140 sector_t start_block;
141 }; 141 };
142 142
143 /* 143 /*
144 * Max bad pages in the new format.. 144 * Max bad pages in the new format..
145 */ 145 */
146 #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x) 146 #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x)
147 #define MAX_SWAP_BADPAGES \ 147 #define MAX_SWAP_BADPAGES \
148 ((__swapoffset(magic.magic) - __swapoffset(info.badpages)) / sizeof(int)) 148 ((__swapoffset(magic.magic) - __swapoffset(info.badpages)) / sizeof(int))
149 149
150 enum { 150 enum {
151 SWP_USED = (1 << 0), /* is slot in swap_info[] used? */ 151 SWP_USED = (1 << 0), /* is slot in swap_info[] used? */
152 SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ 152 SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */
153 SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */ 153 SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */
154 SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */ 154 SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */
155 SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */ 155 SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */
156 SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ 156 SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */
157 SWP_BLKDEV = (1 << 6), /* its a block device */ 157 SWP_BLKDEV = (1 << 6), /* its a block device */
158 SWP_FILE = (1 << 7), /* set after swap_activate success */ 158 SWP_FILE = (1 << 7), /* set after swap_activate success */
159 SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */ 159 SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */
160 SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */ 160 SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
161 /* add others here before... */ 161 /* add others here before... */
162 SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */ 162 SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */
163 }; 163 };
164 164
165 #define SWAP_CLUSTER_MAX 32UL 165 #define SWAP_CLUSTER_MAX 32UL
166 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX 166 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
167 167
168 /* 168 /*
169 * Ratio between the present memory in the zone and the "gap" that 169 * Ratio between the present memory in the zone and the "gap" that
170 * we're allowing kswapd to shrink in addition to the per-zone high 170 * we're allowing kswapd to shrink in addition to the per-zone high
171 * wmark, even for zones that already have the high wmark satisfied, 171 * wmark, even for zones that already have the high wmark satisfied,
172 * in order to provide better per-zone lru behavior. We are ok to 172 * in order to provide better per-zone lru behavior. We are ok to
173 * spend not more than 1% of the memory for this zone balancing "gap". 173 * spend not more than 1% of the memory for this zone balancing "gap".
174 */ 174 */
175 #define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 175 #define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
176 176
177 #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ 177 #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */
178 #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ 178 #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */
179 #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ 179 #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */
180 #define SWAP_CONT_MAX 0x7f /* Max count, in each swap_map continuation */ 180 #define SWAP_CONT_MAX 0x7f /* Max count, in each swap_map continuation */
181 #define COUNT_CONTINUED 0x80 /* See swap_map continuation for full count */ 181 #define COUNT_CONTINUED 0x80 /* See swap_map continuation for full count */
182 #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs, in first swap_map */ 182 #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs, in first swap_map */
183 183
184 /* 184 /*
185 * We use this to track usage of a cluster. A cluster is a block of swap disk 185 * We use this to track usage of a cluster. A cluster is a block of swap disk
186 * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All 186 * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
187 * free clusters are organized into a list. We fetch an entry from the list to 187 * free clusters are organized into a list. We fetch an entry from the list to
188 * get a free cluster. 188 * get a free cluster.
189 * 189 *
190 * The data field stores next cluster if the cluster is free or cluster usage 190 * The data field stores next cluster if the cluster is free or cluster usage
191 * counter otherwise. The flags field determines if a cluster is free. This is 191 * counter otherwise. The flags field determines if a cluster is free. This is
192 * protected by swap_info_struct.lock. 192 * protected by swap_info_struct.lock.
193 */ 193 */
194 struct swap_cluster_info { 194 struct swap_cluster_info {
195 unsigned int data:24; 195 unsigned int data:24;
196 unsigned int flags:8; 196 unsigned int flags:8;
197 }; 197 };
198 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ 198 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
199 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ 199 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
200 200
201 /* 201 /*
202 * We assign a cluster to each CPU, so each CPU can allocate swap entry from 202 * We assign a cluster to each CPU, so each CPU can allocate swap entry from
203 * its own cluster and swapout sequentially. The purpose is to optimize swapout 203 * its own cluster and swapout sequentially. The purpose is to optimize swapout
204 * throughput. 204 * throughput.
205 */ 205 */
206 struct percpu_cluster { 206 struct percpu_cluster {
207 struct swap_cluster_info index; /* Current cluster index */ 207 struct swap_cluster_info index; /* Current cluster index */
208 unsigned int next; /* Likely next allocation offset */ 208 unsigned int next; /* Likely next allocation offset */
209 }; 209 };
210 210
211 /* 211 /*
212 * The in-memory structure used to track swap areas. 212 * The in-memory structure used to track swap areas.
213 */ 213 */
214 struct swap_info_struct { 214 struct swap_info_struct {
215 unsigned long flags; /* SWP_USED etc: see above */ 215 unsigned long flags; /* SWP_USED etc: see above */
216 signed short prio; /* swap priority of this type */ 216 signed short prio; /* swap priority of this type */
217 struct plist_node list; /* entry in swap_active_head */ 217 struct plist_node list; /* entry in swap_active_head */
218 struct plist_node avail_list; /* entry in swap_avail_head */ 218 struct plist_node avail_list; /* entry in swap_avail_head */
219 signed char type; /* strange name for an index */ 219 signed char type; /* strange name for an index */
220 unsigned int max; /* extent of the swap_map */ 220 unsigned int max; /* extent of the swap_map */
221 unsigned char *swap_map; /* vmalloc'ed array of usage counts */ 221 unsigned char *swap_map; /* vmalloc'ed array of usage counts */
222 struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ 222 struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
223 struct swap_cluster_info free_cluster_head; /* free cluster list head */ 223 struct swap_cluster_info free_cluster_head; /* free cluster list head */
224 struct swap_cluster_info free_cluster_tail; /* free cluster list tail */ 224 struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
225 unsigned int lowest_bit; /* index of first free in swap_map */ 225 unsigned int lowest_bit; /* index of first free in swap_map */
226 unsigned int highest_bit; /* index of last free in swap_map */ 226 unsigned int highest_bit; /* index of last free in swap_map */
227 unsigned int pages; /* total of usable pages of swap */ 227 unsigned int pages; /* total of usable pages of swap */
228 unsigned int inuse_pages; /* number of those currently in use */ 228 unsigned int inuse_pages; /* number of those currently in use */
229 unsigned int cluster_next; /* likely index for next allocation */ 229 unsigned int cluster_next; /* likely index for next allocation */
230 unsigned int cluster_nr; /* countdown to next cluster search */ 230 unsigned int cluster_nr; /* countdown to next cluster search */
231 struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ 231 struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
232 struct swap_extent *curr_swap_extent; 232 struct swap_extent *curr_swap_extent;
233 struct swap_extent first_swap_extent; 233 struct swap_extent first_swap_extent;
234 struct block_device *bdev; /* swap device or bdev of swap file */ 234 struct block_device *bdev; /* swap device or bdev of swap file */
235 struct file *swap_file; /* seldom referenced */ 235 struct file *swap_file; /* seldom referenced */
236 unsigned int old_block_size; /* seldom referenced */ 236 unsigned int old_block_size; /* seldom referenced */
237 #ifdef CONFIG_FRONTSWAP 237 #ifdef CONFIG_FRONTSWAP
238 unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 238 unsigned long *frontswap_map; /* frontswap in-use, one bit per page */
239 atomic_t frontswap_pages; /* frontswap pages in-use counter */ 239 atomic_t frontswap_pages; /* frontswap pages in-use counter */
240 #endif 240 #endif
241 spinlock_t lock; /* 241 spinlock_t lock; /*
242 * protect map scan related fields like 242 * protect map scan related fields like
243 * swap_map, lowest_bit, highest_bit, 243 * swap_map, lowest_bit, highest_bit,
244 * inuse_pages, cluster_next, 244 * inuse_pages, cluster_next,
245 * cluster_nr, lowest_alloc, 245 * cluster_nr, lowest_alloc,
246 * highest_alloc, free/discard cluster 246 * highest_alloc, free/discard cluster
247 * list. other fields are only changed 247 * list. other fields are only changed
248 * at swapon/swapoff, so are protected 248 * at swapon/swapoff, so are protected
249 * by swap_lock. changing flags need 249 * by swap_lock. changing flags need
250 * hold this lock and swap_lock. If 250 * hold this lock and swap_lock. If
251 * both locks need hold, hold swap_lock 251 * both locks need hold, hold swap_lock
252 * first. 252 * first.
253 */ 253 */
254 struct work_struct discard_work; /* discard worker */ 254 struct work_struct discard_work; /* discard worker */
255 struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */ 255 struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */
256 struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */ 256 struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
257 }; 257 };
258 258
259 /* linux/mm/page_alloc.c */ 259 /* linux/mm/page_alloc.c */
260 extern unsigned long totalram_pages; 260 extern unsigned long totalram_pages;
261 extern unsigned long totalreserve_pages; 261 extern unsigned long totalreserve_pages;
262 extern unsigned long dirty_balance_reserve; 262 extern unsigned long dirty_balance_reserve;
263 extern unsigned long nr_free_buffer_pages(void); 263 extern unsigned long nr_free_buffer_pages(void);
264 extern unsigned long nr_free_pagecache_pages(void); 264 extern unsigned long nr_free_pagecache_pages(void);
265 265
266 /* Definition of global_page_state not available yet */ 266 /* Definition of global_page_state not available yet */
267 #define nr_free_pages() global_page_state(NR_FREE_PAGES) 267 #define nr_free_pages() global_page_state(NR_FREE_PAGES)
268 268
269 269
270 /* linux/mm/swap.c */ 270 /* linux/mm/swap.c */
271 extern void lru_cache_add(struct page *); 271 extern void lru_cache_add(struct page *);
272 extern void lru_cache_add_anon(struct page *page); 272 extern void lru_cache_add_anon(struct page *page);
273 extern void lru_cache_add_file(struct page *page); 273 extern void lru_cache_add_file(struct page *page);
274 extern void lru_add_page_tail(struct page *page, struct page *page_tail, 274 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
275 struct lruvec *lruvec, struct list_head *head); 275 struct lruvec *lruvec, struct list_head *head);
276 extern void activate_page(struct page *); 276 extern void activate_page(struct page *);
277 extern void mark_page_accessed(struct page *); 277 extern void mark_page_accessed(struct page *);
278 extern void init_page_accessed(struct page *page);
278 extern void lru_add_drain(void); 279 extern void lru_add_drain(void);
279 extern void lru_add_drain_cpu(int cpu); 280 extern void lru_add_drain_cpu(int cpu);
280 extern void lru_add_drain_all(void); 281 extern void lru_add_drain_all(void);
281 extern void rotate_reclaimable_page(struct page *page); 282 extern void rotate_reclaimable_page(struct page *page);
282 extern void deactivate_page(struct page *page); 283 extern void deactivate_page(struct page *page);
283 extern void swap_setup(void); 284 extern void swap_setup(void);
284 285
285 extern void add_page_to_unevictable_list(struct page *page); 286 extern void add_page_to_unevictable_list(struct page *page);
286 287
287 /* linux/mm/vmscan.c */ 288 /* linux/mm/vmscan.c */
288 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 289 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
289 gfp_t gfp_mask, nodemask_t *mask); 290 gfp_t gfp_mask, nodemask_t *mask);
290 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); 291 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
291 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, 292 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
292 gfp_t gfp_mask, bool noswap); 293 gfp_t gfp_mask, bool noswap);
293 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, 294 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
294 gfp_t gfp_mask, bool noswap, 295 gfp_t gfp_mask, bool noswap,
295 struct zone *zone, 296 struct zone *zone,
296 unsigned long *nr_scanned); 297 unsigned long *nr_scanned);
297 extern unsigned long shrink_all_memory(unsigned long nr_pages); 298 extern unsigned long shrink_all_memory(unsigned long nr_pages);
298 extern int vm_swappiness; 299 extern int vm_swappiness;
299 extern int remove_mapping(struct address_space *mapping, struct page *page); 300 extern int remove_mapping(struct address_space *mapping, struct page *page);
300 extern unsigned long vm_total_pages; 301 extern unsigned long vm_total_pages;
301 302
302 #ifdef CONFIG_NUMA 303 #ifdef CONFIG_NUMA
303 extern int zone_reclaim_mode; 304 extern int zone_reclaim_mode;
304 extern int sysctl_min_unmapped_ratio; 305 extern int sysctl_min_unmapped_ratio;
305 extern int sysctl_min_slab_ratio; 306 extern int sysctl_min_slab_ratio;
306 extern int zone_reclaim(struct zone *, gfp_t, unsigned int); 307 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
307 #else 308 #else
308 #define zone_reclaim_mode 0 309 #define zone_reclaim_mode 0
309 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) 310 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
310 { 311 {
311 return 0; 312 return 0;
312 } 313 }
313 #endif 314 #endif
314 315
315 extern int page_evictable(struct page *page); 316 extern int page_evictable(struct page *page);
316 extern void check_move_unevictable_pages(struct page **, int nr_pages); 317 extern void check_move_unevictable_pages(struct page **, int nr_pages);
317 318
318 extern unsigned long scan_unevictable_pages; 319 extern unsigned long scan_unevictable_pages;
319 extern int scan_unevictable_handler(struct ctl_table *, int, 320 extern int scan_unevictable_handler(struct ctl_table *, int,
320 void __user *, size_t *, loff_t *); 321 void __user *, size_t *, loff_t *);
321 #ifdef CONFIG_NUMA 322 #ifdef CONFIG_NUMA
322 extern int scan_unevictable_register_node(struct node *node); 323 extern int scan_unevictable_register_node(struct node *node);
323 extern void scan_unevictable_unregister_node(struct node *node); 324 extern void scan_unevictable_unregister_node(struct node *node);
324 #else 325 #else
325 static inline int scan_unevictable_register_node(struct node *node) 326 static inline int scan_unevictable_register_node(struct node *node)
326 { 327 {
327 return 0; 328 return 0;
328 } 329 }
329 static inline void scan_unevictable_unregister_node(struct node *node) 330 static inline void scan_unevictable_unregister_node(struct node *node)
330 { 331 {
331 } 332 }
332 #endif 333 #endif
333 334
334 extern int kswapd_run(int nid); 335 extern int kswapd_run(int nid);
335 extern void kswapd_stop(int nid); 336 extern void kswapd_stop(int nid);
336 #ifdef CONFIG_MEMCG 337 #ifdef CONFIG_MEMCG
337 extern int mem_cgroup_swappiness(struct mem_cgroup *mem); 338 extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
338 #else 339 #else
339 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) 340 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
340 { 341 {
341 return vm_swappiness; 342 return vm_swappiness;
342 } 343 }
343 #endif 344 #endif
344 #ifdef CONFIG_MEMCG_SWAP 345 #ifdef CONFIG_MEMCG_SWAP
345 extern void mem_cgroup_uncharge_swap(swp_entry_t ent); 346 extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
346 #else 347 #else
347 static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) 348 static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
348 { 349 {
349 } 350 }
350 #endif 351 #endif
351 #ifdef CONFIG_SWAP 352 #ifdef CONFIG_SWAP
352 /* linux/mm/page_io.c */ 353 /* linux/mm/page_io.c */
353 extern int swap_readpage(struct page *); 354 extern int swap_readpage(struct page *);
354 extern int swap_writepage(struct page *page, struct writeback_control *wbc); 355 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
355 extern void end_swap_bio_write(struct bio *bio, int err); 356 extern void end_swap_bio_write(struct bio *bio, int err);
356 extern int __swap_writepage(struct page *page, struct writeback_control *wbc, 357 extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
357 void (*end_write_func)(struct bio *, int)); 358 void (*end_write_func)(struct bio *, int));
358 extern int swap_set_page_dirty(struct page *page); 359 extern int swap_set_page_dirty(struct page *page);
359 extern void end_swap_bio_read(struct bio *bio, int err); 360 extern void end_swap_bio_read(struct bio *bio, int err);
360 361
361 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, 362 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
362 unsigned long nr_pages, sector_t start_block); 363 unsigned long nr_pages, sector_t start_block);
363 int generic_swapfile_activate(struct swap_info_struct *, struct file *, 364 int generic_swapfile_activate(struct swap_info_struct *, struct file *,
364 sector_t *); 365 sector_t *);
365 366
366 /* linux/mm/swap_state.c */ 367 /* linux/mm/swap_state.c */
367 extern struct address_space swapper_spaces[]; 368 extern struct address_space swapper_spaces[];
368 #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)]) 369 #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
369 extern unsigned long total_swapcache_pages(void); 370 extern unsigned long total_swapcache_pages(void);
370 extern void show_swap_cache_info(void); 371 extern void show_swap_cache_info(void);
371 extern int add_to_swap(struct page *, struct list_head *list); 372 extern int add_to_swap(struct page *, struct list_head *list);
372 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); 373 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
373 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); 374 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
374 extern void __delete_from_swap_cache(struct page *); 375 extern void __delete_from_swap_cache(struct page *);
375 extern void delete_from_swap_cache(struct page *); 376 extern void delete_from_swap_cache(struct page *);
376 extern void free_page_and_swap_cache(struct page *); 377 extern void free_page_and_swap_cache(struct page *);
377 extern void free_pages_and_swap_cache(struct page **, int); 378 extern void free_pages_and_swap_cache(struct page **, int);
378 extern struct page *lookup_swap_cache(swp_entry_t); 379 extern struct page *lookup_swap_cache(swp_entry_t);
379 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, 380 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
380 struct vm_area_struct *vma, unsigned long addr); 381 struct vm_area_struct *vma, unsigned long addr);
381 extern struct page *swapin_readahead(swp_entry_t, gfp_t, 382 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
382 struct vm_area_struct *vma, unsigned long addr); 383 struct vm_area_struct *vma, unsigned long addr);
383 384
384 /* linux/mm/swapfile.c */ 385 /* linux/mm/swapfile.c */
385 extern atomic_long_t nr_swap_pages; 386 extern atomic_long_t nr_swap_pages;
386 extern long total_swap_pages; 387 extern long total_swap_pages;
387 388
388 /* Swap 50% full? Release swapcache more aggressively.. */ 389 /* Swap 50% full? Release swapcache more aggressively.. */
389 static inline bool vm_swap_full(void) 390 static inline bool vm_swap_full(void)
390 { 391 {
391 return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages; 392 return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
392 } 393 }
393 394
394 static inline long get_nr_swap_pages(void) 395 static inline long get_nr_swap_pages(void)
395 { 396 {
396 return atomic_long_read(&nr_swap_pages); 397 return atomic_long_read(&nr_swap_pages);
397 } 398 }
398 399
399 extern void si_swapinfo(struct sysinfo *); 400 extern void si_swapinfo(struct sysinfo *);
400 extern swp_entry_t get_swap_page(void); 401 extern swp_entry_t get_swap_page(void);
401 extern swp_entry_t get_swap_page_of_type(int); 402 extern swp_entry_t get_swap_page_of_type(int);
402 extern int add_swap_count_continuation(swp_entry_t, gfp_t); 403 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
403 extern void swap_shmem_alloc(swp_entry_t); 404 extern void swap_shmem_alloc(swp_entry_t);
404 extern int swap_duplicate(swp_entry_t); 405 extern int swap_duplicate(swp_entry_t);
405 extern int swapcache_prepare(swp_entry_t); 406 extern int swapcache_prepare(swp_entry_t);
406 extern void swap_free(swp_entry_t); 407 extern void swap_free(swp_entry_t);
407 extern void swapcache_free(swp_entry_t, struct page *page); 408 extern void swapcache_free(swp_entry_t, struct page *page);
408 extern int free_swap_and_cache(swp_entry_t); 409 extern int free_swap_and_cache(swp_entry_t);
409 extern int swap_type_of(dev_t, sector_t, struct block_device **); 410 extern int swap_type_of(dev_t, sector_t, struct block_device **);
410 extern unsigned int count_swap_pages(int, int); 411 extern unsigned int count_swap_pages(int, int);
411 extern sector_t map_swap_page(struct page *, struct block_device **); 412 extern sector_t map_swap_page(struct page *, struct block_device **);
412 extern sector_t swapdev_block(int, pgoff_t); 413 extern sector_t swapdev_block(int, pgoff_t);
413 extern int page_swapcount(struct page *); 414 extern int page_swapcount(struct page *);
414 extern struct swap_info_struct *page_swap_info(struct page *); 415 extern struct swap_info_struct *page_swap_info(struct page *);
415 extern int reuse_swap_page(struct page *); 416 extern int reuse_swap_page(struct page *);
416 extern int try_to_free_swap(struct page *); 417 extern int try_to_free_swap(struct page *);
417 struct backing_dev_info; 418 struct backing_dev_info;
418 419
419 #ifdef CONFIG_MEMCG 420 #ifdef CONFIG_MEMCG
420 extern void 421 extern void
421 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout); 422 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
422 #else 423 #else
423 static inline void 424 static inline void
424 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) 425 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
425 { 426 {
426 } 427 }
427 #endif 428 #endif
428 429
429 #else /* CONFIG_SWAP */ 430 #else /* CONFIG_SWAP */
430 431
431 #define swap_address_space(entry) (NULL) 432 #define swap_address_space(entry) (NULL)
432 #define get_nr_swap_pages() 0L 433 #define get_nr_swap_pages() 0L
433 #define total_swap_pages 0L 434 #define total_swap_pages 0L
434 #define total_swapcache_pages() 0UL 435 #define total_swapcache_pages() 0UL
435 #define vm_swap_full() 0 436 #define vm_swap_full() 0
436 437
437 #define si_swapinfo(val) \ 438 #define si_swapinfo(val) \
438 do { (val)->freeswap = (val)->totalswap = 0; } while (0) 439 do { (val)->freeswap = (val)->totalswap = 0; } while (0)
439 /* only sparc can not include linux/pagemap.h in this file 440 /* only sparc can not include linux/pagemap.h in this file
440 * so leave page_cache_release and release_pages undeclared... */ 441 * so leave page_cache_release and release_pages undeclared... */
441 #define free_page_and_swap_cache(page) \ 442 #define free_page_and_swap_cache(page) \
442 page_cache_release(page) 443 page_cache_release(page)
443 #define free_pages_and_swap_cache(pages, nr) \ 444 #define free_pages_and_swap_cache(pages, nr) \
444 release_pages((pages), (nr), false); 445 release_pages((pages), (nr), false);
445 446
446 static inline void show_swap_cache_info(void) 447 static inline void show_swap_cache_info(void)
447 { 448 {
448 } 449 }
449 450
450 #define free_swap_and_cache(swp) is_migration_entry(swp) 451 #define free_swap_and_cache(swp) is_migration_entry(swp)
451 #define swapcache_prepare(swp) is_migration_entry(swp) 452 #define swapcache_prepare(swp) is_migration_entry(swp)
452 453
453 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) 454 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
454 { 455 {
455 return 0; 456 return 0;
456 } 457 }
457 458
458 static inline void swap_shmem_alloc(swp_entry_t swp) 459 static inline void swap_shmem_alloc(swp_entry_t swp)
459 { 460 {
460 } 461 }
461 462
462 static inline int swap_duplicate(swp_entry_t swp) 463 static inline int swap_duplicate(swp_entry_t swp)
463 { 464 {
464 return 0; 465 return 0;
465 } 466 }
466 467
467 static inline void swap_free(swp_entry_t swp) 468 static inline void swap_free(swp_entry_t swp)
468 { 469 {
469 } 470 }
470 471
471 static inline void swapcache_free(swp_entry_t swp, struct page *page) 472 static inline void swapcache_free(swp_entry_t swp, struct page *page)
472 { 473 {
473 } 474 }
474 475
475 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, 476 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
476 struct vm_area_struct *vma, unsigned long addr) 477 struct vm_area_struct *vma, unsigned long addr)
477 { 478 {
478 return NULL; 479 return NULL;
479 } 480 }
480 481
481 static inline int swap_writepage(struct page *p, struct writeback_control *wbc) 482 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
482 { 483 {
483 return 0; 484 return 0;
484 } 485 }
485 486
486 static inline struct page *lookup_swap_cache(swp_entry_t swp) 487 static inline struct page *lookup_swap_cache(swp_entry_t swp)
487 { 488 {
488 return NULL; 489 return NULL;
489 } 490 }
490 491
491 static inline int add_to_swap(struct page *page, struct list_head *list) 492 static inline int add_to_swap(struct page *page, struct list_head *list)
492 { 493 {
493 return 0; 494 return 0;
494 } 495 }
495 496
496 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, 497 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
497 gfp_t gfp_mask) 498 gfp_t gfp_mask)
498 { 499 {
499 return -1; 500 return -1;
500 } 501 }
501 502
502 static inline void __delete_from_swap_cache(struct page *page) 503 static inline void __delete_from_swap_cache(struct page *page)
503 { 504 {
504 } 505 }
505 506
506 static inline void delete_from_swap_cache(struct page *page) 507 static inline void delete_from_swap_cache(struct page *page)
507 { 508 {
508 } 509 }
509 510
510 static inline int page_swapcount(struct page *page) 511 static inline int page_swapcount(struct page *page)
511 { 512 {
512 return 0; 513 return 0;
513 } 514 }
514 515
515 #define reuse_swap_page(page) (page_mapcount(page) == 1) 516 #define reuse_swap_page(page) (page_mapcount(page) == 1)
516 517
517 static inline int try_to_free_swap(struct page *page) 518 static inline int try_to_free_swap(struct page *page)
518 { 519 {
519 return 0; 520 return 0;
520 } 521 }
521 522
522 static inline swp_entry_t get_swap_page(void) 523 static inline swp_entry_t get_swap_page(void)
523 { 524 {
524 swp_entry_t entry; 525 swp_entry_t entry;
525 entry.val = 0; 526 entry.val = 0;
526 return entry; 527 return entry;
527 } 528 }
528 529
529 static inline void 530 static inline void
530 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent) 531 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
531 { 532 {
532 } 533 }
533 534
534 #endif /* CONFIG_SWAP */ 535 #endif /* CONFIG_SWAP */
535 #endif /* __KERNEL__*/ 536 #endif /* __KERNEL__*/
536 #endif /* _LINUX_SWAP_H */ 537 #endif /* _LINUX_SWAP_H */
537 538
1 /* 1 /*
2 * linux/mm/filemap.c 2 * linux/mm/filemap.c
3 * 3 *
4 * Copyright (C) 1994-1999 Linus Torvalds 4 * Copyright (C) 1994-1999 Linus Torvalds
5 */ 5 */
6 6
7 /* 7 /*
8 * This file handles the generic file mmap semantics used by 8 * This file handles the generic file mmap semantics used by
9 * most "normal" filesystems (but you don't /have/ to use this: 9 * most "normal" filesystems (but you don't /have/ to use this:
10 * the NFS filesystem used to do this differently, for example) 10 * the NFS filesystem used to do this differently, for example)
11 */ 11 */
12 #include <linux/export.h> 12 #include <linux/export.h>
13 #include <linux/compiler.h> 13 #include <linux/compiler.h>
14 #include <linux/fs.h> 14 #include <linux/fs.h>
15 #include <linux/uaccess.h> 15 #include <linux/uaccess.h>
16 #include <linux/aio.h> 16 #include <linux/aio.h>
17 #include <linux/capability.h> 17 #include <linux/capability.h>
18 #include <linux/kernel_stat.h> 18 #include <linux/kernel_stat.h>
19 #include <linux/gfp.h> 19 #include <linux/gfp.h>
20 #include <linux/mm.h> 20 #include <linux/mm.h>
21 #include <linux/swap.h> 21 #include <linux/swap.h>
22 #include <linux/mman.h> 22 #include <linux/mman.h>
23 #include <linux/pagemap.h> 23 #include <linux/pagemap.h>
24 #include <linux/file.h> 24 #include <linux/file.h>
25 #include <linux/uio.h> 25 #include <linux/uio.h>
26 #include <linux/hash.h> 26 #include <linux/hash.h>
27 #include <linux/writeback.h> 27 #include <linux/writeback.h>
28 #include <linux/backing-dev.h> 28 #include <linux/backing-dev.h>
29 #include <linux/pagevec.h> 29 #include <linux/pagevec.h>
30 #include <linux/blkdev.h> 30 #include <linux/blkdev.h>
31 #include <linux/security.h> 31 #include <linux/security.h>
32 #include <linux/cpuset.h> 32 #include <linux/cpuset.h>
33 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */ 33 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
34 #include <linux/memcontrol.h> 34 #include <linux/memcontrol.h>
35 #include <linux/cleancache.h> 35 #include <linux/cleancache.h>
36 #include "internal.h" 36 #include "internal.h"
37 37
38 #define CREATE_TRACE_POINTS 38 #define CREATE_TRACE_POINTS
39 #include <trace/events/filemap.h> 39 #include <trace/events/filemap.h>
40 40
41 /* 41 /*
42 * FIXME: remove all knowledge of the buffer layer from the core VM 42 * FIXME: remove all knowledge of the buffer layer from the core VM
43 */ 43 */
44 #include <linux/buffer_head.h> /* for try_to_free_buffers */ 44 #include <linux/buffer_head.h> /* for try_to_free_buffers */
45 45
46 #include <asm/mman.h> 46 #include <asm/mman.h>
47 47
48 /* 48 /*
49 * Shared mappings implemented 30.11.1994. It's not fully working yet, 49 * Shared mappings implemented 30.11.1994. It's not fully working yet,
50 * though. 50 * though.
51 * 51 *
52 * Shared mappings now work. 15.8.1995 Bruno. 52 * Shared mappings now work. 15.8.1995 Bruno.
53 * 53 *
54 * finished 'unifying' the page and buffer cache and SMP-threaded the 54 * finished 'unifying' the page and buffer cache and SMP-threaded the
55 * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com> 55 * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
56 * 56 *
57 * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de> 57 * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
58 */ 58 */
59 59
60 /* 60 /*
61 * Lock ordering: 61 * Lock ordering:
62 * 62 *
63 * ->i_mmap_mutex (truncate_pagecache) 63 * ->i_mmap_mutex (truncate_pagecache)
64 * ->private_lock (__free_pte->__set_page_dirty_buffers) 64 * ->private_lock (__free_pte->__set_page_dirty_buffers)
65 * ->swap_lock (exclusive_swap_page, others) 65 * ->swap_lock (exclusive_swap_page, others)
66 * ->mapping->tree_lock 66 * ->mapping->tree_lock
67 * 67 *
68 * ->i_mutex 68 * ->i_mutex
69 * ->i_mmap_mutex (truncate->unmap_mapping_range) 69 * ->i_mmap_mutex (truncate->unmap_mapping_range)
70 * 70 *
71 * ->mmap_sem 71 * ->mmap_sem
72 * ->i_mmap_mutex 72 * ->i_mmap_mutex
73 * ->page_table_lock or pte_lock (various, mainly in memory.c) 73 * ->page_table_lock or pte_lock (various, mainly in memory.c)
74 * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) 74 * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
75 * 75 *
76 * ->mmap_sem 76 * ->mmap_sem
77 * ->lock_page (access_process_vm) 77 * ->lock_page (access_process_vm)
78 * 78 *
79 * ->i_mutex (generic_file_buffered_write) 79 * ->i_mutex (generic_file_buffered_write)
80 * ->mmap_sem (fault_in_pages_readable->do_page_fault) 80 * ->mmap_sem (fault_in_pages_readable->do_page_fault)
81 * 81 *
82 * bdi->wb.list_lock 82 * bdi->wb.list_lock
83 * sb_lock (fs/fs-writeback.c) 83 * sb_lock (fs/fs-writeback.c)
84 * ->mapping->tree_lock (__sync_single_inode) 84 * ->mapping->tree_lock (__sync_single_inode)
85 * 85 *
86 * ->i_mmap_mutex 86 * ->i_mmap_mutex
87 * ->anon_vma.lock (vma_adjust) 87 * ->anon_vma.lock (vma_adjust)
88 * 88 *
89 * ->anon_vma.lock 89 * ->anon_vma.lock
90 * ->page_table_lock or pte_lock (anon_vma_prepare and various) 90 * ->page_table_lock or pte_lock (anon_vma_prepare and various)
91 * 91 *
92 * ->page_table_lock or pte_lock 92 * ->page_table_lock or pte_lock
93 * ->swap_lock (try_to_unmap_one) 93 * ->swap_lock (try_to_unmap_one)
94 * ->private_lock (try_to_unmap_one) 94 * ->private_lock (try_to_unmap_one)
95 * ->tree_lock (try_to_unmap_one) 95 * ->tree_lock (try_to_unmap_one)
96 * ->zone.lru_lock (follow_page->mark_page_accessed) 96 * ->zone.lru_lock (follow_page->mark_page_accessed)
97 * ->zone.lru_lock (check_pte_range->isolate_lru_page) 97 * ->zone.lru_lock (check_pte_range->isolate_lru_page)
98 * ->private_lock (page_remove_rmap->set_page_dirty) 98 * ->private_lock (page_remove_rmap->set_page_dirty)
99 * ->tree_lock (page_remove_rmap->set_page_dirty) 99 * ->tree_lock (page_remove_rmap->set_page_dirty)
100 * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) 100 * bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
101 * ->inode->i_lock (page_remove_rmap->set_page_dirty) 101 * ->inode->i_lock (page_remove_rmap->set_page_dirty)
102 * bdi.wb->list_lock (zap_pte_range->set_page_dirty) 102 * bdi.wb->list_lock (zap_pte_range->set_page_dirty)
103 * ->inode->i_lock (zap_pte_range->set_page_dirty) 103 * ->inode->i_lock (zap_pte_range->set_page_dirty)
104 * ->private_lock (zap_pte_range->__set_page_dirty_buffers) 104 * ->private_lock (zap_pte_range->__set_page_dirty_buffers)
105 * 105 *
106 * ->i_mmap_mutex 106 * ->i_mmap_mutex
107 * ->tasklist_lock (memory_failure, collect_procs_ao) 107 * ->tasklist_lock (memory_failure, collect_procs_ao)
108 */ 108 */
109 109
110 /* 110 /*
111 * Delete a page from the page cache and free it. Caller has to make 111 * Delete a page from the page cache and free it. Caller has to make
112 * sure the page is locked and that nobody else uses it - or that usage 112 * sure the page is locked and that nobody else uses it - or that usage
113 * is safe. The caller must hold the mapping's tree_lock. 113 * is safe. The caller must hold the mapping's tree_lock.
114 */ 114 */
115 void __delete_from_page_cache(struct page *page) 115 void __delete_from_page_cache(struct page *page)
116 { 116 {
117 struct address_space *mapping = page->mapping; 117 struct address_space *mapping = page->mapping;
118 118
119 trace_mm_filemap_delete_from_page_cache(page); 119 trace_mm_filemap_delete_from_page_cache(page);
120 /* 120 /*
121 * if we're uptodate, flush out into the cleancache, otherwise 121 * if we're uptodate, flush out into the cleancache, otherwise
122 * invalidate any existing cleancache entries. We can't leave 122 * invalidate any existing cleancache entries. We can't leave
123 * stale data around in the cleancache once our page is gone 123 * stale data around in the cleancache once our page is gone
124 */ 124 */
125 if (PageUptodate(page) && PageMappedToDisk(page)) 125 if (PageUptodate(page) && PageMappedToDisk(page))
126 cleancache_put_page(page); 126 cleancache_put_page(page);
127 else 127 else
128 cleancache_invalidate_page(mapping, page); 128 cleancache_invalidate_page(mapping, page);
129 129
130 radix_tree_delete(&mapping->page_tree, page->index); 130 radix_tree_delete(&mapping->page_tree, page->index);
131 page->mapping = NULL; 131 page->mapping = NULL;
132 /* Leave page->index set: truncation lookup relies upon it */ 132 /* Leave page->index set: truncation lookup relies upon it */
133 mapping->nrpages--; 133 mapping->nrpages--;
134 __dec_zone_page_state(page, NR_FILE_PAGES); 134 __dec_zone_page_state(page, NR_FILE_PAGES);
135 if (PageSwapBacked(page)) 135 if (PageSwapBacked(page))
136 __dec_zone_page_state(page, NR_SHMEM); 136 __dec_zone_page_state(page, NR_SHMEM);
137 BUG_ON(page_mapped(page)); 137 BUG_ON(page_mapped(page));
138 138
139 /* 139 /*
140 * Some filesystems seem to re-dirty the page even after 140 * Some filesystems seem to re-dirty the page even after
141 * the VM has canceled the dirty bit (eg ext3 journaling). 141 * the VM has canceled the dirty bit (eg ext3 journaling).
142 * 142 *
143 * Fix it up by doing a final dirty accounting check after 143 * Fix it up by doing a final dirty accounting check after
144 * having removed the page entirely. 144 * having removed the page entirely.
145 */ 145 */
146 if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { 146 if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
147 dec_zone_page_state(page, NR_FILE_DIRTY); 147 dec_zone_page_state(page, NR_FILE_DIRTY);
148 dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); 148 dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
149 } 149 }
150 } 150 }
151 151
152 /** 152 /**
153 * delete_from_page_cache - delete page from page cache 153 * delete_from_page_cache - delete page from page cache
154 * @page: the page which the kernel is trying to remove from page cache 154 * @page: the page which the kernel is trying to remove from page cache
155 * 155 *
156 * This must be called only on pages that have been verified to be in the page 156 * This must be called only on pages that have been verified to be in the page
157 * cache and locked. It will never put the page into the free list, the caller 157 * cache and locked. It will never put the page into the free list, the caller
158 * has a reference on the page. 158 * has a reference on the page.
159 */ 159 */
160 void delete_from_page_cache(struct page *page) 160 void delete_from_page_cache(struct page *page)
161 { 161 {
162 struct address_space *mapping = page->mapping; 162 struct address_space *mapping = page->mapping;
163 void (*freepage)(struct page *); 163 void (*freepage)(struct page *);
164 164
165 BUG_ON(!PageLocked(page)); 165 BUG_ON(!PageLocked(page));
166 166
167 freepage = mapping->a_ops->freepage; 167 freepage = mapping->a_ops->freepage;
168 spin_lock_irq(&mapping->tree_lock); 168 spin_lock_irq(&mapping->tree_lock);
169 __delete_from_page_cache(page); 169 __delete_from_page_cache(page);
170 spin_unlock_irq(&mapping->tree_lock); 170 spin_unlock_irq(&mapping->tree_lock);
171 mem_cgroup_uncharge_cache_page(page); 171 mem_cgroup_uncharge_cache_page(page);
172 172
173 if (freepage) 173 if (freepage)
174 freepage(page); 174 freepage(page);
175 page_cache_release(page); 175 page_cache_release(page);
176 } 176 }
177 EXPORT_SYMBOL(delete_from_page_cache); 177 EXPORT_SYMBOL(delete_from_page_cache);
178 178
179 static int sleep_on_page(void *word) 179 static int sleep_on_page(void *word)
180 { 180 {
181 io_schedule(); 181 io_schedule();
182 return 0; 182 return 0;
183 } 183 }
184 184
185 static int sleep_on_page_killable(void *word) 185 static int sleep_on_page_killable(void *word)
186 { 186 {
187 sleep_on_page(word); 187 sleep_on_page(word);
188 return fatal_signal_pending(current) ? -EINTR : 0; 188 return fatal_signal_pending(current) ? -EINTR : 0;
189 } 189 }
190 190
191 static int filemap_check_errors(struct address_space *mapping) 191 static int filemap_check_errors(struct address_space *mapping)
192 { 192 {
193 int ret = 0; 193 int ret = 0;
194 /* Check for outstanding write errors */ 194 /* Check for outstanding write errors */
195 if (test_bit(AS_ENOSPC, &mapping->flags) && 195 if (test_bit(AS_ENOSPC, &mapping->flags) &&
196 test_and_clear_bit(AS_ENOSPC, &mapping->flags)) 196 test_and_clear_bit(AS_ENOSPC, &mapping->flags))
197 ret = -ENOSPC; 197 ret = -ENOSPC;
198 if (test_bit(AS_EIO, &mapping->flags) && 198 if (test_bit(AS_EIO, &mapping->flags) &&
199 test_and_clear_bit(AS_EIO, &mapping->flags)) 199 test_and_clear_bit(AS_EIO, &mapping->flags))
200 ret = -EIO; 200 ret = -EIO;
201 return ret; 201 return ret;
202 } 202 }
203 203
204 /** 204 /**
205 * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range 205 * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
206 * @mapping: address space structure to write 206 * @mapping: address space structure to write
207 * @start: offset in bytes where the range starts 207 * @start: offset in bytes where the range starts
208 * @end: offset in bytes where the range ends (inclusive) 208 * @end: offset in bytes where the range ends (inclusive)
209 * @sync_mode: enable synchronous operation 209 * @sync_mode: enable synchronous operation
210 * 210 *
211 * Start writeback against all of a mapping's dirty pages that lie 211 * Start writeback against all of a mapping's dirty pages that lie
212 * within the byte offsets <start, end> inclusive. 212 * within the byte offsets <start, end> inclusive.
213 * 213 *
214 * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as 214 * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
215 * opposed to a regular memory cleansing writeback. The difference between 215 * opposed to a regular memory cleansing writeback. The difference between
216 * these two operations is that if a dirty page/buffer is encountered, it must 216 * these two operations is that if a dirty page/buffer is encountered, it must
217 * be waited upon, and not just skipped over. 217 * be waited upon, and not just skipped over.
218 */ 218 */
219 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, 219 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
220 loff_t end, int sync_mode) 220 loff_t end, int sync_mode)
221 { 221 {
222 int ret; 222 int ret;
223 struct writeback_control wbc = { 223 struct writeback_control wbc = {
224 .sync_mode = sync_mode, 224 .sync_mode = sync_mode,
225 .nr_to_write = LONG_MAX, 225 .nr_to_write = LONG_MAX,
226 .range_start = start, 226 .range_start = start,
227 .range_end = end, 227 .range_end = end,
228 }; 228 };
229 229
230 if (!mapping_cap_writeback_dirty(mapping)) 230 if (!mapping_cap_writeback_dirty(mapping))
231 return 0; 231 return 0;
232 232
233 ret = do_writepages(mapping, &wbc); 233 ret = do_writepages(mapping, &wbc);
234 return ret; 234 return ret;
235 } 235 }
236 236
237 static inline int __filemap_fdatawrite(struct address_space *mapping, 237 static inline int __filemap_fdatawrite(struct address_space *mapping,
238 int sync_mode) 238 int sync_mode)
239 { 239 {
240 return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode); 240 return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
241 } 241 }
242 242
243 int filemap_fdatawrite(struct address_space *mapping) 243 int filemap_fdatawrite(struct address_space *mapping)
244 { 244 {
245 return __filemap_fdatawrite(mapping, WB_SYNC_ALL); 245 return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
246 } 246 }
247 EXPORT_SYMBOL(filemap_fdatawrite); 247 EXPORT_SYMBOL(filemap_fdatawrite);
248 248
249 int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, 249 int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
250 loff_t end) 250 loff_t end)
251 { 251 {
252 return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL); 252 return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
253 } 253 }
254 EXPORT_SYMBOL(filemap_fdatawrite_range); 254 EXPORT_SYMBOL(filemap_fdatawrite_range);
255 255
256 /** 256 /**
257 * filemap_flush - mostly a non-blocking flush 257 * filemap_flush - mostly a non-blocking flush
258 * @mapping: target address_space 258 * @mapping: target address_space
259 * 259 *
260 * This is a mostly non-blocking flush. Not suitable for data-integrity 260 * This is a mostly non-blocking flush. Not suitable for data-integrity
261 * purposes - I/O may not be started against all dirty pages. 261 * purposes - I/O may not be started against all dirty pages.
262 */ 262 */
263 int filemap_flush(struct address_space *mapping) 263 int filemap_flush(struct address_space *mapping)
264 { 264 {
265 return __filemap_fdatawrite(mapping, WB_SYNC_NONE); 265 return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
266 } 266 }
267 EXPORT_SYMBOL(filemap_flush); 267 EXPORT_SYMBOL(filemap_flush);
268 268
269 /** 269 /**
270 * filemap_fdatawait_range - wait for writeback to complete 270 * filemap_fdatawait_range - wait for writeback to complete
271 * @mapping: address space structure to wait for 271 * @mapping: address space structure to wait for
272 * @start_byte: offset in bytes where the range starts 272 * @start_byte: offset in bytes where the range starts
273 * @end_byte: offset in bytes where the range ends (inclusive) 273 * @end_byte: offset in bytes where the range ends (inclusive)
274 * 274 *
275 * Walk the list of under-writeback pages of the given address space 275 * Walk the list of under-writeback pages of the given address space
276 * in the given range and wait for all of them. 276 * in the given range and wait for all of them.
277 */ 277 */
278 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, 278 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
279 loff_t end_byte) 279 loff_t end_byte)
280 { 280 {
281 pgoff_t index = start_byte >> PAGE_CACHE_SHIFT; 281 pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
282 pgoff_t end = end_byte >> PAGE_CACHE_SHIFT; 282 pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
283 struct pagevec pvec; 283 struct pagevec pvec;
284 int nr_pages; 284 int nr_pages;
285 int ret2, ret = 0; 285 int ret2, ret = 0;
286 286
287 if (end_byte < start_byte) 287 if (end_byte < start_byte)
288 goto out; 288 goto out;
289 289
290 pagevec_init(&pvec, 0); 290 pagevec_init(&pvec, 0);
291 while ((index <= end) && 291 while ((index <= end) &&
292 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, 292 (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
293 PAGECACHE_TAG_WRITEBACK, 293 PAGECACHE_TAG_WRITEBACK,
294 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { 294 min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
295 unsigned i; 295 unsigned i;
296 296
297 for (i = 0; i < nr_pages; i++) { 297 for (i = 0; i < nr_pages; i++) {
298 struct page *page = pvec.pages[i]; 298 struct page *page = pvec.pages[i];
299 299
300 /* until radix tree lookup accepts end_index */ 300 /* until radix tree lookup accepts end_index */
301 if (page->index > end) 301 if (page->index > end)
302 continue; 302 continue;
303 303
304 wait_on_page_writeback(page); 304 wait_on_page_writeback(page);
305 if (TestClearPageError(page)) 305 if (TestClearPageError(page))
306 ret = -EIO; 306 ret = -EIO;
307 } 307 }
308 pagevec_release(&pvec); 308 pagevec_release(&pvec);
309 cond_resched(); 309 cond_resched();
310 } 310 }
311 out: 311 out:
312 ret2 = filemap_check_errors(mapping); 312 ret2 = filemap_check_errors(mapping);
313 if (!ret) 313 if (!ret)
314 ret = ret2; 314 ret = ret2;
315 315
316 return ret; 316 return ret;
317 } 317 }
318 EXPORT_SYMBOL(filemap_fdatawait_range); 318 EXPORT_SYMBOL(filemap_fdatawait_range);
319 319
320 /** 320 /**
321 * filemap_fdatawait - wait for all under-writeback pages to complete 321 * filemap_fdatawait - wait for all under-writeback pages to complete
322 * @mapping: address space structure to wait for 322 * @mapping: address space structure to wait for
323 * 323 *
324 * Walk the list of under-writeback pages of the given address space 324 * Walk the list of under-writeback pages of the given address space
325 * and wait for all of them. 325 * and wait for all of them.
326 */ 326 */
327 int filemap_fdatawait(struct address_space *mapping) 327 int filemap_fdatawait(struct address_space *mapping)
328 { 328 {
329 loff_t i_size = i_size_read(mapping->host); 329 loff_t i_size = i_size_read(mapping->host);
330 330
331 if (i_size == 0) 331 if (i_size == 0)
332 return 0; 332 return 0;
333 333
334 return filemap_fdatawait_range(mapping, 0, i_size - 1); 334 return filemap_fdatawait_range(mapping, 0, i_size - 1);
335 } 335 }
336 EXPORT_SYMBOL(filemap_fdatawait); 336 EXPORT_SYMBOL(filemap_fdatawait);
337 337
338 int filemap_write_and_wait(struct address_space *mapping) 338 int filemap_write_and_wait(struct address_space *mapping)
339 { 339 {
340 int err = 0; 340 int err = 0;
341 341
342 if (mapping->nrpages) { 342 if (mapping->nrpages) {
343 err = filemap_fdatawrite(mapping); 343 err = filemap_fdatawrite(mapping);
344 /* 344 /*
345 * Even if the above returned error, the pages may be 345 * Even if the above returned error, the pages may be
346 * written partially (e.g. -ENOSPC), so we wait for it. 346 * written partially (e.g. -ENOSPC), so we wait for it.
347 * But the -EIO is special case, it may indicate the worst 347 * But the -EIO is special case, it may indicate the worst
348 * thing (e.g. bug) happened, so we avoid waiting for it. 348 * thing (e.g. bug) happened, so we avoid waiting for it.
349 */ 349 */
350 if (err != -EIO) { 350 if (err != -EIO) {
351 int err2 = filemap_fdatawait(mapping); 351 int err2 = filemap_fdatawait(mapping);
352 if (!err) 352 if (!err)
353 err = err2; 353 err = err2;
354 } 354 }
355 } else { 355 } else {
356 err = filemap_check_errors(mapping); 356 err = filemap_check_errors(mapping);
357 } 357 }
358 return err; 358 return err;
359 } 359 }
360 EXPORT_SYMBOL(filemap_write_and_wait); 360 EXPORT_SYMBOL(filemap_write_and_wait);
361 361
362 /** 362 /**
363 * filemap_write_and_wait_range - write out & wait on a file range 363 * filemap_write_and_wait_range - write out & wait on a file range
364 * @mapping: the address_space for the pages 364 * @mapping: the address_space for the pages
365 * @lstart: offset in bytes where the range starts 365 * @lstart: offset in bytes where the range starts
366 * @lend: offset in bytes where the range ends (inclusive) 366 * @lend: offset in bytes where the range ends (inclusive)
367 * 367 *
368 * Write out and wait upon file offsets lstart->lend, inclusive. 368 * Write out and wait upon file offsets lstart->lend, inclusive.
369 * 369 *
370 * Note that `lend' is inclusive (describes the last byte to be written) so 370 * Note that `lend' is inclusive (describes the last byte to be written) so
371 * that this function can be used to write to the very end-of-file (end = -1). 371 * that this function can be used to write to the very end-of-file (end = -1).
372 */ 372 */
373 int filemap_write_and_wait_range(struct address_space *mapping, 373 int filemap_write_and_wait_range(struct address_space *mapping,
374 loff_t lstart, loff_t lend) 374 loff_t lstart, loff_t lend)
375 { 375 {
376 int err = 0; 376 int err = 0;
377 377
378 if (mapping->nrpages) { 378 if (mapping->nrpages) {
379 err = __filemap_fdatawrite_range(mapping, lstart, lend, 379 err = __filemap_fdatawrite_range(mapping, lstart, lend,
380 WB_SYNC_ALL); 380 WB_SYNC_ALL);
381 /* See comment of filemap_write_and_wait() */ 381 /* See comment of filemap_write_and_wait() */
382 if (err != -EIO) { 382 if (err != -EIO) {
383 int err2 = filemap_fdatawait_range(mapping, 383 int err2 = filemap_fdatawait_range(mapping,
384 lstart, lend); 384 lstart, lend);
385 if (!err) 385 if (!err)
386 err = err2; 386 err = err2;
387 } 387 }
388 } else { 388 } else {
389 err = filemap_check_errors(mapping); 389 err = filemap_check_errors(mapping);
390 } 390 }
391 return err; 391 return err;
392 } 392 }
393 EXPORT_SYMBOL(filemap_write_and_wait_range); 393 EXPORT_SYMBOL(filemap_write_and_wait_range);
394 394
395 /** 395 /**
396 * replace_page_cache_page - replace a pagecache page with a new one 396 * replace_page_cache_page - replace a pagecache page with a new one
397 * @old: page to be replaced 397 * @old: page to be replaced
398 * @new: page to replace with 398 * @new: page to replace with
399 * @gfp_mask: allocation mode 399 * @gfp_mask: allocation mode
400 * 400 *
401 * This function replaces a page in the pagecache with a new one. On 401 * This function replaces a page in the pagecache with a new one. On
402 * success it acquires the pagecache reference for the new page and 402 * success it acquires the pagecache reference for the new page and
403 * drops it for the old page. Both the old and new pages must be 403 * drops it for the old page. Both the old and new pages must be
404 * locked. This function does not add the new page to the LRU, the 404 * locked. This function does not add the new page to the LRU, the
405 * caller must do that. 405 * caller must do that.
406 * 406 *
407 * The remove + add is atomic. The only way this function can fail is 407 * The remove + add is atomic. The only way this function can fail is
408 * memory allocation failure. 408 * memory allocation failure.
409 */ 409 */
410 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) 410 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
411 { 411 {
412 int error; 412 int error;
413 413
414 VM_BUG_ON(!PageLocked(old)); 414 VM_BUG_ON(!PageLocked(old));
415 VM_BUG_ON(!PageLocked(new)); 415 VM_BUG_ON(!PageLocked(new));
416 VM_BUG_ON(new->mapping); 416 VM_BUG_ON(new->mapping);
417 417
418 error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); 418 error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
419 if (!error) { 419 if (!error) {
420 struct address_space *mapping = old->mapping; 420 struct address_space *mapping = old->mapping;
421 void (*freepage)(struct page *); 421 void (*freepage)(struct page *);
422 422
423 pgoff_t offset = old->index; 423 pgoff_t offset = old->index;
424 freepage = mapping->a_ops->freepage; 424 freepage = mapping->a_ops->freepage;
425 425
426 page_cache_get(new); 426 page_cache_get(new);
427 new->mapping = mapping; 427 new->mapping = mapping;
428 new->index = offset; 428 new->index = offset;
429 429
430 spin_lock_irq(&mapping->tree_lock); 430 spin_lock_irq(&mapping->tree_lock);
431 __delete_from_page_cache(old); 431 __delete_from_page_cache(old);
432 error = radix_tree_insert(&mapping->page_tree, offset, new); 432 error = radix_tree_insert(&mapping->page_tree, offset, new);
433 BUG_ON(error); 433 BUG_ON(error);
434 mapping->nrpages++; 434 mapping->nrpages++;
435 __inc_zone_page_state(new, NR_FILE_PAGES); 435 __inc_zone_page_state(new, NR_FILE_PAGES);
436 if (PageSwapBacked(new)) 436 if (PageSwapBacked(new))
437 __inc_zone_page_state(new, NR_SHMEM); 437 __inc_zone_page_state(new, NR_SHMEM);
438 spin_unlock_irq(&mapping->tree_lock); 438 spin_unlock_irq(&mapping->tree_lock);
439 /* mem_cgroup codes must not be called under tree_lock */ 439 /* mem_cgroup codes must not be called under tree_lock */
440 mem_cgroup_replace_page_cache(old, new); 440 mem_cgroup_replace_page_cache(old, new);
441 radix_tree_preload_end(); 441 radix_tree_preload_end();
442 if (freepage) 442 if (freepage)
443 freepage(old); 443 freepage(old);
444 page_cache_release(old); 444 page_cache_release(old);
445 } 445 }
446 446
447 return error; 447 return error;
448 } 448 }
449 EXPORT_SYMBOL_GPL(replace_page_cache_page); 449 EXPORT_SYMBOL_GPL(replace_page_cache_page);
450 450
451 static int page_cache_tree_insert(struct address_space *mapping, 451 static int page_cache_tree_insert(struct address_space *mapping,
452 struct page *page) 452 struct page *page)
453 { 453 {
454 void **slot; 454 void **slot;
455 int error; 455 int error;
456 456
457 slot = radix_tree_lookup_slot(&mapping->page_tree, page->index); 457 slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
458 if (slot) { 458 if (slot) {
459 void *p; 459 void *p;
460 460
461 p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); 461 p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
462 if (!radix_tree_exceptional_entry(p)) 462 if (!radix_tree_exceptional_entry(p))
463 return -EEXIST; 463 return -EEXIST;
464 radix_tree_replace_slot(slot, page); 464 radix_tree_replace_slot(slot, page);
465 mapping->nrpages++; 465 mapping->nrpages++;
466 return 0; 466 return 0;
467 } 467 }
468 error = radix_tree_insert(&mapping->page_tree, page->index, page); 468 error = radix_tree_insert(&mapping->page_tree, page->index, page);
469 if (!error) 469 if (!error)
470 mapping->nrpages++; 470 mapping->nrpages++;
471 return error; 471 return error;
472 } 472 }
473 473
474 /** 474 /**
475 * add_to_page_cache_locked - add a locked page to the pagecache 475 * add_to_page_cache_locked - add a locked page to the pagecache
476 * @page: page to add 476 * @page: page to add
477 * @mapping: the page's address_space 477 * @mapping: the page's address_space
478 * @offset: page index 478 * @offset: page index
479 * @gfp_mask: page allocation mode 479 * @gfp_mask: page allocation mode
480 * 480 *
481 * This function is used to add a page to the pagecache. It must be locked. 481 * This function is used to add a page to the pagecache. It must be locked.
482 * This function does not add the page to the LRU. The caller must do that. 482 * This function does not add the page to the LRU. The caller must do that.
483 */ 483 */
484 int add_to_page_cache_locked(struct page *page, struct address_space *mapping, 484 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
485 pgoff_t offset, gfp_t gfp_mask) 485 pgoff_t offset, gfp_t gfp_mask)
486 { 486 {
487 int error; 487 int error;
488 488
489 VM_BUG_ON(!PageLocked(page)); 489 VM_BUG_ON(!PageLocked(page));
490 VM_BUG_ON(PageSwapBacked(page)); 490 VM_BUG_ON(PageSwapBacked(page));
491 491
492 error = mem_cgroup_cache_charge(page, current->mm, 492 error = mem_cgroup_cache_charge(page, current->mm,
493 gfp_mask & GFP_RECLAIM_MASK); 493 gfp_mask & GFP_RECLAIM_MASK);
494 if (error) 494 if (error)
495 return error; 495 return error;
496 496
497 error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM); 497 error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
498 if (error) { 498 if (error) {
499 mem_cgroup_uncharge_cache_page(page); 499 mem_cgroup_uncharge_cache_page(page);
500 return error; 500 return error;
501 } 501 }
502 502
503 page_cache_get(page); 503 page_cache_get(page);
504 page->mapping = mapping; 504 page->mapping = mapping;
505 page->index = offset; 505 page->index = offset;
506 506
507 spin_lock_irq(&mapping->tree_lock); 507 spin_lock_irq(&mapping->tree_lock);
508 error = page_cache_tree_insert(mapping, page); 508 error = page_cache_tree_insert(mapping, page);
509 radix_tree_preload_end(); 509 radix_tree_preload_end();
510 if (unlikely(error)) 510 if (unlikely(error))
511 goto err_insert; 511 goto err_insert;
512 __inc_zone_page_state(page, NR_FILE_PAGES); 512 __inc_zone_page_state(page, NR_FILE_PAGES);
513 spin_unlock_irq(&mapping->tree_lock); 513 spin_unlock_irq(&mapping->tree_lock);
514 trace_mm_filemap_add_to_page_cache(page); 514 trace_mm_filemap_add_to_page_cache(page);
515 return 0; 515 return 0;
516 err_insert: 516 err_insert:
517 page->mapping = NULL; 517 page->mapping = NULL;
518 /* Leave page->index set: truncation relies upon it */ 518 /* Leave page->index set: truncation relies upon it */
519 spin_unlock_irq(&mapping->tree_lock); 519 spin_unlock_irq(&mapping->tree_lock);
520 mem_cgroup_uncharge_cache_page(page); 520 mem_cgroup_uncharge_cache_page(page);
521 page_cache_release(page); 521 page_cache_release(page);
522 return error; 522 return error;
523 } 523 }
524 EXPORT_SYMBOL(add_to_page_cache_locked); 524 EXPORT_SYMBOL(add_to_page_cache_locked);
525 525
526 int add_to_page_cache_lru(struct page *page, struct address_space *mapping, 526 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
527 pgoff_t offset, gfp_t gfp_mask) 527 pgoff_t offset, gfp_t gfp_mask)
528 { 528 {
529 int ret; 529 int ret;
530 530
531 ret = add_to_page_cache(page, mapping, offset, gfp_mask); 531 ret = add_to_page_cache(page, mapping, offset, gfp_mask);
532 if (ret == 0) 532 if (ret == 0)
533 lru_cache_add_file(page); 533 lru_cache_add_file(page);
534 return ret; 534 return ret;
535 } 535 }
536 EXPORT_SYMBOL_GPL(add_to_page_cache_lru); 536 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
537 537
538 #ifdef CONFIG_NUMA 538 #ifdef CONFIG_NUMA
539 struct page *__page_cache_alloc(gfp_t gfp) 539 struct page *__page_cache_alloc(gfp_t gfp)
540 { 540 {
541 int n; 541 int n;
542 struct page *page; 542 struct page *page;
543 543
544 if (cpuset_do_page_mem_spread()) { 544 if (cpuset_do_page_mem_spread()) {
545 unsigned int cpuset_mems_cookie; 545 unsigned int cpuset_mems_cookie;
546 do { 546 do {
547 cpuset_mems_cookie = read_mems_allowed_begin(); 547 cpuset_mems_cookie = read_mems_allowed_begin();
548 n = cpuset_mem_spread_node(); 548 n = cpuset_mem_spread_node();
549 page = alloc_pages_exact_node(n, gfp, 0); 549 page = alloc_pages_exact_node(n, gfp, 0);
550 } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); 550 } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
551 551
552 return page; 552 return page;
553 } 553 }
554 return alloc_pages(gfp, 0); 554 return alloc_pages(gfp, 0);
555 } 555 }
556 EXPORT_SYMBOL(__page_cache_alloc); 556 EXPORT_SYMBOL(__page_cache_alloc);
557 #endif 557 #endif
558 558
559 /* 559 /*
560 * In order to wait for pages to become available there must be 560 * In order to wait for pages to become available there must be
561 * waitqueues associated with pages. By using a hash table of 561 * waitqueues associated with pages. By using a hash table of
562 * waitqueues where the bucket discipline is to maintain all 562 * waitqueues where the bucket discipline is to maintain all
563 * waiters on the same queue and wake all when any of the pages 563 * waiters on the same queue and wake all when any of the pages
564 * become available, and for the woken contexts to check to be 564 * become available, and for the woken contexts to check to be
565 * sure the appropriate page became available, this saves space 565 * sure the appropriate page became available, this saves space
566 * at a cost of "thundering herd" phenomena during rare hash 566 * at a cost of "thundering herd" phenomena during rare hash
567 * collisions. 567 * collisions.
568 */ 568 */
569 static wait_queue_head_t *page_waitqueue(struct page *page) 569 static wait_queue_head_t *page_waitqueue(struct page *page)
570 { 570 {
571 const struct zone *zone = page_zone(page); 571 const struct zone *zone = page_zone(page);
572 572
573 return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; 573 return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
574 } 574 }
575 575
576 static inline void wake_up_page(struct page *page, int bit) 576 static inline void wake_up_page(struct page *page, int bit)
577 { 577 {
578 __wake_up_bit(page_waitqueue(page), &page->flags, bit); 578 __wake_up_bit(page_waitqueue(page), &page->flags, bit);
579 } 579 }
580 580
581 void wait_on_page_bit(struct page *page, int bit_nr) 581 void wait_on_page_bit(struct page *page, int bit_nr)
582 { 582 {
583 DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); 583 DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
584 584
585 if (test_bit(bit_nr, &page->flags)) 585 if (test_bit(bit_nr, &page->flags))
586 __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page, 586 __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
587 TASK_UNINTERRUPTIBLE); 587 TASK_UNINTERRUPTIBLE);
588 } 588 }
589 EXPORT_SYMBOL(wait_on_page_bit); 589 EXPORT_SYMBOL(wait_on_page_bit);
590 590
591 int wait_on_page_bit_killable(struct page *page, int bit_nr) 591 int wait_on_page_bit_killable(struct page *page, int bit_nr)
592 { 592 {
593 DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); 593 DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
594 594
595 if (!test_bit(bit_nr, &page->flags)) 595 if (!test_bit(bit_nr, &page->flags))
596 return 0; 596 return 0;
597 597
598 return __wait_on_bit(page_waitqueue(page), &wait, 598 return __wait_on_bit(page_waitqueue(page), &wait,
599 sleep_on_page_killable, TASK_KILLABLE); 599 sleep_on_page_killable, TASK_KILLABLE);
600 } 600 }
601 601
602 /** 602 /**
603 * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue 603 * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
604 * @page: Page defining the wait queue of interest 604 * @page: Page defining the wait queue of interest
605 * @waiter: Waiter to add to the queue 605 * @waiter: Waiter to add to the queue
606 * 606 *
607 * Add an arbitrary @waiter to the wait queue for the nominated @page. 607 * Add an arbitrary @waiter to the wait queue for the nominated @page.
608 */ 608 */
609 void add_page_wait_queue(struct page *page, wait_queue_t *waiter) 609 void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
610 { 610 {
611 wait_queue_head_t *q = page_waitqueue(page); 611 wait_queue_head_t *q = page_waitqueue(page);
612 unsigned long flags; 612 unsigned long flags;
613 613
614 spin_lock_irqsave(&q->lock, flags); 614 spin_lock_irqsave(&q->lock, flags);
615 __add_wait_queue(q, waiter); 615 __add_wait_queue(q, waiter);
616 spin_unlock_irqrestore(&q->lock, flags); 616 spin_unlock_irqrestore(&q->lock, flags);
617 } 617 }
618 EXPORT_SYMBOL_GPL(add_page_wait_queue); 618 EXPORT_SYMBOL_GPL(add_page_wait_queue);
619 619
620 /** 620 /**
621 * unlock_page - unlock a locked page 621 * unlock_page - unlock a locked page
622 * @page: the page 622 * @page: the page
623 * 623 *
624 * Unlocks the page and wakes up sleepers in ___wait_on_page_locked(). 624 * Unlocks the page and wakes up sleepers in ___wait_on_page_locked().
625 * Also wakes sleepers in wait_on_page_writeback() because the wakeup 625 * Also wakes sleepers in wait_on_page_writeback() because the wakeup
626 * mechananism between PageLocked pages and PageWriteback pages is shared. 626 * mechananism between PageLocked pages and PageWriteback pages is shared.
627 * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep. 627 * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
628 * 628 *
629 * The mb is necessary to enforce ordering between the clear_bit and the read 629 * The mb is necessary to enforce ordering between the clear_bit and the read
630 * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). 630 * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
631 */ 631 */
632 void unlock_page(struct page *page) 632 void unlock_page(struct page *page)
633 { 633 {
634 VM_BUG_ON(!PageLocked(page)); 634 VM_BUG_ON(!PageLocked(page));
635 clear_bit_unlock(PG_locked, &page->flags); 635 clear_bit_unlock(PG_locked, &page->flags);
636 smp_mb__after_clear_bit(); 636 smp_mb__after_clear_bit();
637 wake_up_page(page, PG_locked); 637 wake_up_page(page, PG_locked);
638 } 638 }
639 EXPORT_SYMBOL(unlock_page); 639 EXPORT_SYMBOL(unlock_page);
640 640
641 /** 641 /**
642 * end_page_writeback - end writeback against a page 642 * end_page_writeback - end writeback against a page
643 * @page: the page 643 * @page: the page
644 */ 644 */
645 void end_page_writeback(struct page *page) 645 void end_page_writeback(struct page *page)
646 { 646 {
647 if (TestClearPageReclaim(page)) 647 if (TestClearPageReclaim(page))
648 rotate_reclaimable_page(page); 648 rotate_reclaimable_page(page);
649 649
650 if (!test_clear_page_writeback(page)) 650 if (!test_clear_page_writeback(page))
651 BUG(); 651 BUG();
652 652
653 smp_mb__after_clear_bit(); 653 smp_mb__after_clear_bit();
654 wake_up_page(page, PG_writeback); 654 wake_up_page(page, PG_writeback);
655 } 655 }
656 EXPORT_SYMBOL(end_page_writeback); 656 EXPORT_SYMBOL(end_page_writeback);
657 657
658 /** 658 /**
659 * __lock_page - get a lock on the page, assuming we need to sleep to get it 659 * __lock_page - get a lock on the page, assuming we need to sleep to get it
660 * @page: the page to lock 660 * @page: the page to lock
661 */ 661 */
662 void __lock_page(struct page *page) 662 void __lock_page(struct page *page)
663 { 663 {
664 DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); 664 DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
665 665
666 __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, 666 __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
667 TASK_UNINTERRUPTIBLE); 667 TASK_UNINTERRUPTIBLE);
668 } 668 }
669 EXPORT_SYMBOL(__lock_page); 669 EXPORT_SYMBOL(__lock_page);
670 670
671 int __lock_page_killable(struct page *page) 671 int __lock_page_killable(struct page *page)
672 { 672 {
673 DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); 673 DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
674 674
675 return __wait_on_bit_lock(page_waitqueue(page), &wait, 675 return __wait_on_bit_lock(page_waitqueue(page), &wait,
676 sleep_on_page_killable, TASK_KILLABLE); 676 sleep_on_page_killable, TASK_KILLABLE);
677 } 677 }
678 EXPORT_SYMBOL_GPL(__lock_page_killable); 678 EXPORT_SYMBOL_GPL(__lock_page_killable);
679 679
680 int __lock_page_or_retry(struct page *page, struct mm_struct *mm, 680 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
681 unsigned int flags) 681 unsigned int flags)
682 { 682 {
683 if (flags & FAULT_FLAG_ALLOW_RETRY) { 683 if (flags & FAULT_FLAG_ALLOW_RETRY) {
684 /* 684 /*
685 * CAUTION! In this case, mmap_sem is not released 685 * CAUTION! In this case, mmap_sem is not released
686 * even though return 0. 686 * even though return 0.
687 */ 687 */
688 if (flags & FAULT_FLAG_RETRY_NOWAIT) 688 if (flags & FAULT_FLAG_RETRY_NOWAIT)
689 return 0; 689 return 0;
690 690
691 up_read(&mm->mmap_sem); 691 up_read(&mm->mmap_sem);
692 if (flags & FAULT_FLAG_KILLABLE) 692 if (flags & FAULT_FLAG_KILLABLE)
693 wait_on_page_locked_killable(page); 693 wait_on_page_locked_killable(page);
694 else 694 else
695 wait_on_page_locked(page); 695 wait_on_page_locked(page);
696 return 0; 696 return 0;
697 } else { 697 } else {
698 if (flags & FAULT_FLAG_KILLABLE) { 698 if (flags & FAULT_FLAG_KILLABLE) {
699 int ret; 699 int ret;
700 700
701 ret = __lock_page_killable(page); 701 ret = __lock_page_killable(page);
702 if (ret) { 702 if (ret) {
703 up_read(&mm->mmap_sem); 703 up_read(&mm->mmap_sem);
704 return 0; 704 return 0;
705 } 705 }
706 } else 706 } else
707 __lock_page(page); 707 __lock_page(page);
708 return 1; 708 return 1;
709 } 709 }
710 } 710 }
711 711
712 /** 712 /**
713 * page_cache_next_hole - find the next hole (not-present entry) 713 * page_cache_next_hole - find the next hole (not-present entry)
714 * @mapping: mapping 714 * @mapping: mapping
715 * @index: index 715 * @index: index
716 * @max_scan: maximum range to search 716 * @max_scan: maximum range to search
717 * 717 *
718 * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the 718 * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
719 * lowest indexed hole. 719 * lowest indexed hole.
720 * 720 *
721 * Returns: the index of the hole if found, otherwise returns an index 721 * Returns: the index of the hole if found, otherwise returns an index
722 * outside of the set specified (in which case 'return - index >= 722 * outside of the set specified (in which case 'return - index >=
723 * max_scan' will be true). In rare cases of index wrap-around, 0 will 723 * max_scan' will be true). In rare cases of index wrap-around, 0 will
724 * be returned. 724 * be returned.
725 * 725 *
726 * page_cache_next_hole may be called under rcu_read_lock. However, 726 * page_cache_next_hole may be called under rcu_read_lock. However,
727 * like radix_tree_gang_lookup, this will not atomically search a 727 * like radix_tree_gang_lookup, this will not atomically search a
728 * snapshot of the tree at a single point in time. For example, if a 728 * snapshot of the tree at a single point in time. For example, if a
729 * hole is created at index 5, then subsequently a hole is created at 729 * hole is created at index 5, then subsequently a hole is created at
730 * index 10, page_cache_next_hole covering both indexes may return 10 730 * index 10, page_cache_next_hole covering both indexes may return 10
731 * if called under rcu_read_lock. 731 * if called under rcu_read_lock.
732 */ 732 */
733 pgoff_t page_cache_next_hole(struct address_space *mapping, 733 pgoff_t page_cache_next_hole(struct address_space *mapping,
734 pgoff_t index, unsigned long max_scan) 734 pgoff_t index, unsigned long max_scan)
735 { 735 {
736 unsigned long i; 736 unsigned long i;
737 737
738 for (i = 0; i < max_scan; i++) { 738 for (i = 0; i < max_scan; i++) {
739 struct page *page; 739 struct page *page;
740 740
741 page = radix_tree_lookup(&mapping->page_tree, index); 741 page = radix_tree_lookup(&mapping->page_tree, index);
742 if (!page || radix_tree_exceptional_entry(page)) 742 if (!page || radix_tree_exceptional_entry(page))
743 break; 743 break;
744 index++; 744 index++;
745 if (index == 0) 745 if (index == 0)
746 break; 746 break;
747 } 747 }
748 748
749 return index; 749 return index;
750 } 750 }
751 EXPORT_SYMBOL(page_cache_next_hole); 751 EXPORT_SYMBOL(page_cache_next_hole);
752 752
753 /** 753 /**
754 * page_cache_prev_hole - find the prev hole (not-present entry) 754 * page_cache_prev_hole - find the prev hole (not-present entry)
755 * @mapping: mapping 755 * @mapping: mapping
756 * @index: index 756 * @index: index
757 * @max_scan: maximum range to search 757 * @max_scan: maximum range to search
758 * 758 *
759 * Search backwards in the range [max(index-max_scan+1, 0), index] for 759 * Search backwards in the range [max(index-max_scan+1, 0), index] for
760 * the first hole. 760 * the first hole.
761 * 761 *
762 * Returns: the index of the hole if found, otherwise returns an index 762 * Returns: the index of the hole if found, otherwise returns an index
763 * outside of the set specified (in which case 'index - return >= 763 * outside of the set specified (in which case 'index - return >=
764 * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX 764 * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
765 * will be returned. 765 * will be returned.
766 * 766 *
767 * page_cache_prev_hole may be called under rcu_read_lock. However, 767 * page_cache_prev_hole may be called under rcu_read_lock. However,
768 * like radix_tree_gang_lookup, this will not atomically search a 768 * like radix_tree_gang_lookup, this will not atomically search a
769 * snapshot of the tree at a single point in time. For example, if a 769 * snapshot of the tree at a single point in time. For example, if a
770 * hole is created at index 10, then subsequently a hole is created at 770 * hole is created at index 10, then subsequently a hole is created at
771 * index 5, page_cache_prev_hole covering both indexes may return 5 if 771 * index 5, page_cache_prev_hole covering both indexes may return 5 if
772 * called under rcu_read_lock. 772 * called under rcu_read_lock.
773 */ 773 */
774 pgoff_t page_cache_prev_hole(struct address_space *mapping, 774 pgoff_t page_cache_prev_hole(struct address_space *mapping,
775 pgoff_t index, unsigned long max_scan) 775 pgoff_t index, unsigned long max_scan)
776 { 776 {
777 unsigned long i; 777 unsigned long i;
778 778
779 for (i = 0; i < max_scan; i++) { 779 for (i = 0; i < max_scan; i++) {
780 struct page *page; 780 struct page *page;
781 781
782 page = radix_tree_lookup(&mapping->page_tree, index); 782 page = radix_tree_lookup(&mapping->page_tree, index);
783 if (!page || radix_tree_exceptional_entry(page)) 783 if (!page || radix_tree_exceptional_entry(page))
784 break; 784 break;
785 index--; 785 index--;
786 if (index == ULONG_MAX) 786 if (index == ULONG_MAX)
787 break; 787 break;
788 } 788 }
789 789
790 return index; 790 return index;
791 } 791 }
792 EXPORT_SYMBOL(page_cache_prev_hole); 792 EXPORT_SYMBOL(page_cache_prev_hole);
793 793
794 /** 794 /**
795 * find_get_entry - find and get a page cache entry 795 * find_get_entry - find and get a page cache entry
796 * @mapping: the address_space to search 796 * @mapping: the address_space to search
797 * @offset: the page cache index 797 * @offset: the page cache index
798 * 798 *
799 * Looks up the page cache slot at @mapping & @offset. If there is a 799 * Looks up the page cache slot at @mapping & @offset. If there is a
800 * page cache page, it is returned with an increased refcount. 800 * page cache page, it is returned with an increased refcount.
801 * 801 *
802 * If the slot holds a shadow entry of a previously evicted page, it 802 * If the slot holds a shadow entry of a previously evicted page, it
803 * is returned. 803 * is returned.
804 * 804 *
805 * Otherwise, %NULL is returned. 805 * Otherwise, %NULL is returned.
806 */ 806 */
807 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) 807 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
808 { 808 {
809 void **pagep; 809 void **pagep;
810 struct page *page; 810 struct page *page;
811 811
812 rcu_read_lock(); 812 rcu_read_lock();
813 repeat: 813 repeat:
814 page = NULL; 814 page = NULL;
815 pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); 815 pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
816 if (pagep) { 816 if (pagep) {
817 page = radix_tree_deref_slot(pagep); 817 page = radix_tree_deref_slot(pagep);
818 if (unlikely(!page)) 818 if (unlikely(!page))
819 goto out; 819 goto out;
820 if (radix_tree_exception(page)) { 820 if (radix_tree_exception(page)) {
821 if (radix_tree_deref_retry(page)) 821 if (radix_tree_deref_retry(page))
822 goto repeat; 822 goto repeat;
823 /* 823 /*
824 * Otherwise, shmem/tmpfs must be storing a swap entry 824 * Otherwise, shmem/tmpfs must be storing a swap entry
825 * here as an exceptional entry: so return it without 825 * here as an exceptional entry: so return it without
826 * attempting to raise page count. 826 * attempting to raise page count.
827 */ 827 */
828 goto out; 828 goto out;
829 } 829 }
830 if (!page_cache_get_speculative(page)) 830 if (!page_cache_get_speculative(page))
831 goto repeat; 831 goto repeat;
832 832
833 /* 833 /*
834 * Has the page moved? 834 * Has the page moved?
835 * This is part of the lockless pagecache protocol. See 835 * This is part of the lockless pagecache protocol. See
836 * include/linux/pagemap.h for details. 836 * include/linux/pagemap.h for details.
837 */ 837 */
838 if (unlikely(page != *pagep)) { 838 if (unlikely(page != *pagep)) {
839 page_cache_release(page); 839 page_cache_release(page);
840 goto repeat; 840 goto repeat;
841 } 841 }
842 } 842 }
843 out: 843 out:
844 rcu_read_unlock(); 844 rcu_read_unlock();
845 845
846 return page; 846 return page;
847 } 847 }
848 EXPORT_SYMBOL(find_get_entry); 848 EXPORT_SYMBOL(find_get_entry);
849 849
850 /** 850 /**
851 * find_get_page - find and get a page reference
852 * @mapping: the address_space to search
853 * @offset: the page index
854 *
855 * Looks up the page cache slot at @mapping & @offset. If there is a
856 * page cache page, it is returned with an increased refcount.
857 *
858 * Otherwise, %NULL is returned.
859 */
860 struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
861 {
862 struct page *page = find_get_entry(mapping, offset);
863
864 if (radix_tree_exceptional_entry(page))
865 page = NULL;
866 return page;
867 }
868 EXPORT_SYMBOL(find_get_page);
869
870 /**
871 * find_lock_entry - locate, pin and lock a page cache entry 851 * find_lock_entry - locate, pin and lock a page cache entry
872 * @mapping: the address_space to search 852 * @mapping: the address_space to search
873 * @offset: the page cache index 853 * @offset: the page cache index
874 * 854 *
875 * Looks up the page cache slot at @mapping & @offset. If there is a 855 * Looks up the page cache slot at @mapping & @offset. If there is a
876 * page cache page, it is returned locked and with an increased 856 * page cache page, it is returned locked and with an increased
877 * refcount. 857 * refcount.
878 * 858 *
879 * If the slot holds a shadow entry of a previously evicted page, it 859 * If the slot holds a shadow entry of a previously evicted page, it
880 * is returned. 860 * is returned.
881 * 861 *
882 * Otherwise, %NULL is returned. 862 * Otherwise, %NULL is returned.
883 * 863 *
884 * find_lock_entry() may sleep. 864 * find_lock_entry() may sleep.
885 */ 865 */
886 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) 866 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
887 { 867 {
888 struct page *page; 868 struct page *page;
889 869
890 repeat: 870 repeat:
891 page = find_get_entry(mapping, offset); 871 page = find_get_entry(mapping, offset);
892 if (page && !radix_tree_exception(page)) { 872 if (page && !radix_tree_exception(page)) {
893 lock_page(page); 873 lock_page(page);
894 /* Has the page been truncated? */ 874 /* Has the page been truncated? */
895 if (unlikely(page->mapping != mapping)) { 875 if (unlikely(page->mapping != mapping)) {
896 unlock_page(page); 876 unlock_page(page);
897 page_cache_release(page); 877 page_cache_release(page);
898 goto repeat; 878 goto repeat;
899 } 879 }
900 VM_BUG_ON(page->index != offset); 880 VM_BUG_ON(page->index != offset);
901 } 881 }
902 return page; 882 return page;
903 } 883 }
904 EXPORT_SYMBOL(find_lock_entry); 884 EXPORT_SYMBOL(find_lock_entry);
905 885
906 /** 886 /**
907 * find_lock_page - locate, pin and lock a pagecache page 887 * pagecache_get_page - find and get a page reference
908 * @mapping: the address_space to search 888 * @mapping: the address_space to search
909 * @offset: the page index 889 * @offset: the page index
890 * @fgp_flags: PCG flags
891 * @gfp_mask: gfp mask to use if a page is to be allocated
910 * 892 *
911 * Looks up the page cache slot at @mapping & @offset. If there is a 893 * Looks up the page cache slot at @mapping & @offset.
912 * page cache page, it is returned locked and with an increased
913 * refcount.
914 * 894 *
915 * Otherwise, %NULL is returned. 895 * PCG flags modify how the page is returned
916 * 896 *
917 * find_lock_page() may sleep. 897 * FGP_ACCESSED: the page will be marked accessed
918 */ 898 * FGP_LOCK: Page is return locked
919 struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) 899 * FGP_CREAT: If page is not present then a new page is allocated using
920 { 900 * @gfp_mask and added to the page cache and the VM's LRU
921 struct page *page = find_lock_entry(mapping, offset); 901 * list. The page is returned locked and with an increased
922 902 * refcount. Otherwise, %NULL is returned.
923 if (radix_tree_exceptional_entry(page))
924 page = NULL;
925 return page;
926 }
927 EXPORT_SYMBOL(find_lock_page);
928
929 /**
930 * find_or_create_page - locate or add a pagecache page
931 * @mapping: the page's address_space
932 * @index: the page's index into the mapping
933 * @gfp_mask: page allocation mode
934 * 903 *
935 * Looks up the page cache slot at @mapping & @offset. If there is a 904 * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
936 * page cache page, it is returned locked and with an increased 905 * if the GFP flags specified for FGP_CREAT are atomic.
937 * refcount.
938 * 906 *
939 * If the page is not present, a new page is allocated using @gfp_mask 907 * If there is a page cache page, it is returned with an increased refcount.
940 * and added to the page cache and the VM's LRU list. The page is
941 * returned locked and with an increased refcount.
942 *
943 * On memory exhaustion, %NULL is returned.
944 *
945 * find_or_create_page() may sleep, even if @gfp_flags specifies an
946 * atomic allocation!
947 */ 908 */
948 struct page *find_or_create_page(struct address_space *mapping, 909 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
949 pgoff_t index, gfp_t gfp_mask) 910 int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
950 { 911 {
951 struct page *page; 912 struct page *page;
952 int err; 913
953 repeat: 914 repeat:
954 page = find_lock_page(mapping, index); 915 page = find_get_entry(mapping, offset);
955 if (!page) { 916 if (radix_tree_exceptional_entry(page))
956 page = __page_cache_alloc(gfp_mask); 917 page = NULL;
918 if (!page)
919 goto no_page;
920
921 if (fgp_flags & FGP_LOCK) {
922 if (fgp_flags & FGP_NOWAIT) {
923 if (!trylock_page(page)) {
924 page_cache_release(page);
925 return NULL;
926 }
927 } else {
928 lock_page(page);
929 }
930
931 /* Has the page been truncated? */
932 if (unlikely(page->mapping != mapping)) {
933 unlock_page(page);
934 page_cache_release(page);
935 goto repeat;
936 }
937 VM_BUG_ON(page->index != offset);
938 }
939
940 if (page && (fgp_flags & FGP_ACCESSED))
941 mark_page_accessed(page);
942
943 no_page:
944 if (!page && (fgp_flags & FGP_CREAT)) {
945 int err;
946 if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
947 cache_gfp_mask |= __GFP_WRITE;
948 if (fgp_flags & FGP_NOFS) {
949 cache_gfp_mask &= ~__GFP_FS;
950 radix_gfp_mask &= ~__GFP_FS;
951 }
952
953 page = __page_cache_alloc(cache_gfp_mask);
957 if (!page) 954 if (!page)
958 return NULL; 955 return NULL;
959 /* 956
960 * We want a regular kernel memory (not highmem or DMA etc) 957 if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
961 * allocation for the radix tree nodes, but we need to honour 958 fgp_flags |= FGP_LOCK;
962 * the context-specific requirements the caller has asked for. 959
963 * GFP_RECLAIM_MASK collects those requirements. 960 /* Init accessed so avoit atomic mark_page_accessed later */
964 */ 961 if (fgp_flags & FGP_ACCESSED)
965 err = add_to_page_cache_lru(page, mapping, index, 962 init_page_accessed(page);
966 (gfp_mask & GFP_RECLAIM_MASK)); 963
964 err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
967 if (unlikely(err)) { 965 if (unlikely(err)) {
968 page_cache_release(page); 966 page_cache_release(page);
969 page = NULL; 967 page = NULL;
970 if (err == -EEXIST) 968 if (err == -EEXIST)
971 goto repeat; 969 goto repeat;
972 } 970 }
973 } 971 }
972
974 return page; 973 return page;
975 } 974 }
976 EXPORT_SYMBOL(find_or_create_page); 975 EXPORT_SYMBOL(pagecache_get_page);
977 976
978 /** 977 /**
979 * find_get_entries - gang pagecache lookup 978 * find_get_entries - gang pagecache lookup
980 * @mapping: The address_space to search 979 * @mapping: The address_space to search
981 * @start: The starting page cache index 980 * @start: The starting page cache index
982 * @nr_entries: The maximum number of entries 981 * @nr_entries: The maximum number of entries
983 * @entries: Where the resulting entries are placed 982 * @entries: Where the resulting entries are placed
984 * @indices: The cache indices corresponding to the entries in @entries 983 * @indices: The cache indices corresponding to the entries in @entries
985 * 984 *
986 * find_get_entries() will search for and return a group of up to 985 * find_get_entries() will search for and return a group of up to
987 * @nr_entries entries in the mapping. The entries are placed at 986 * @nr_entries entries in the mapping. The entries are placed at
988 * @entries. find_get_entries() takes a reference against any actual 987 * @entries. find_get_entries() takes a reference against any actual
989 * pages it returns. 988 * pages it returns.
990 * 989 *
991 * The search returns a group of mapping-contiguous page cache entries 990 * The search returns a group of mapping-contiguous page cache entries
992 * with ascending indexes. There may be holes in the indices due to 991 * with ascending indexes. There may be holes in the indices due to
993 * not-present pages. 992 * not-present pages.
994 * 993 *
995 * Any shadow entries of evicted pages are included in the returned 994 * Any shadow entries of evicted pages are included in the returned
996 * array. 995 * array.
997 * 996 *
998 * find_get_entries() returns the number of pages and shadow entries 997 * find_get_entries() returns the number of pages and shadow entries
999 * which were found. 998 * which were found.
1000 */ 999 */
1001 unsigned find_get_entries(struct address_space *mapping, 1000 unsigned find_get_entries(struct address_space *mapping,
1002 pgoff_t start, unsigned int nr_entries, 1001 pgoff_t start, unsigned int nr_entries,
1003 struct page **entries, pgoff_t *indices) 1002 struct page **entries, pgoff_t *indices)
1004 { 1003 {
1005 void **slot; 1004 void **slot;
1006 unsigned int ret = 0; 1005 unsigned int ret = 0;
1007 struct radix_tree_iter iter; 1006 struct radix_tree_iter iter;
1008 1007
1009 if (!nr_entries) 1008 if (!nr_entries)
1010 return 0; 1009 return 0;
1011 1010
1012 rcu_read_lock(); 1011 rcu_read_lock();
1013 restart: 1012 restart:
1014 radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { 1013 radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
1015 struct page *page; 1014 struct page *page;
1016 repeat: 1015 repeat:
1017 page = radix_tree_deref_slot(slot); 1016 page = radix_tree_deref_slot(slot);
1018 if (unlikely(!page)) 1017 if (unlikely(!page))
1019 continue; 1018 continue;
1020 if (radix_tree_exception(page)) { 1019 if (radix_tree_exception(page)) {
1021 if (radix_tree_deref_retry(page)) 1020 if (radix_tree_deref_retry(page))
1022 goto restart; 1021 goto restart;
1023 /* 1022 /*
1024 * Otherwise, we must be storing a swap entry 1023 * Otherwise, we must be storing a swap entry
1025 * here as an exceptional entry: so return it 1024 * here as an exceptional entry: so return it
1026 * without attempting to raise page count. 1025 * without attempting to raise page count.
1027 */ 1026 */
1028 goto export; 1027 goto export;
1029 } 1028 }
1030 if (!page_cache_get_speculative(page)) 1029 if (!page_cache_get_speculative(page))
1031 goto repeat; 1030 goto repeat;
1032 1031
1033 /* Has the page moved? */ 1032 /* Has the page moved? */
1034 if (unlikely(page != *slot)) { 1033 if (unlikely(page != *slot)) {
1035 page_cache_release(page); 1034 page_cache_release(page);
1036 goto repeat; 1035 goto repeat;
1037 } 1036 }
1038 export: 1037 export:
1039 indices[ret] = iter.index; 1038 indices[ret] = iter.index;
1040 entries[ret] = page; 1039 entries[ret] = page;
1041 if (++ret == nr_entries) 1040 if (++ret == nr_entries)
1042 break; 1041 break;
1043 } 1042 }
1044 rcu_read_unlock(); 1043 rcu_read_unlock();
1045 return ret; 1044 return ret;
1046 } 1045 }
1047 1046
1048 /** 1047 /**
1049 * find_get_pages - gang pagecache lookup 1048 * find_get_pages - gang pagecache lookup
1050 * @mapping: The address_space to search 1049 * @mapping: The address_space to search
1051 * @start: The starting page index 1050 * @start: The starting page index
1052 * @nr_pages: The maximum number of pages 1051 * @nr_pages: The maximum number of pages
1053 * @pages: Where the resulting pages are placed 1052 * @pages: Where the resulting pages are placed
1054 * 1053 *
1055 * find_get_pages() will search for and return a group of up to 1054 * find_get_pages() will search for and return a group of up to
1056 * @nr_pages pages in the mapping. The pages are placed at @pages. 1055 * @nr_pages pages in the mapping. The pages are placed at @pages.
1057 * find_get_pages() takes a reference against the returned pages. 1056 * find_get_pages() takes a reference against the returned pages.
1058 * 1057 *
1059 * The search returns a group of mapping-contiguous pages with ascending 1058 * The search returns a group of mapping-contiguous pages with ascending
1060 * indexes. There may be holes in the indices due to not-present pages. 1059 * indexes. There may be holes in the indices due to not-present pages.
1061 * 1060 *
1062 * find_get_pages() returns the number of pages which were found. 1061 * find_get_pages() returns the number of pages which were found.
1063 */ 1062 */
1064 unsigned find_get_pages(struct address_space *mapping, pgoff_t start, 1063 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
1065 unsigned int nr_pages, struct page **pages) 1064 unsigned int nr_pages, struct page **pages)
1066 { 1065 {
1067 struct radix_tree_iter iter; 1066 struct radix_tree_iter iter;
1068 void **slot; 1067 void **slot;
1069 unsigned ret = 0; 1068 unsigned ret = 0;
1070 1069
1071 if (unlikely(!nr_pages)) 1070 if (unlikely(!nr_pages))
1072 return 0; 1071 return 0;
1073 1072
1074 rcu_read_lock(); 1073 rcu_read_lock();
1075 restart: 1074 restart:
1076 radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { 1075 radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
1077 struct page *page; 1076 struct page *page;
1078 repeat: 1077 repeat:
1079 page = radix_tree_deref_slot(slot); 1078 page = radix_tree_deref_slot(slot);
1080 if (unlikely(!page)) 1079 if (unlikely(!page))
1081 continue; 1080 continue;
1082 1081
1083 if (radix_tree_exception(page)) { 1082 if (radix_tree_exception(page)) {
1084 if (radix_tree_deref_retry(page)) { 1083 if (radix_tree_deref_retry(page)) {
1085 /* 1084 /*
1086 * Transient condition which can only trigger 1085 * Transient condition which can only trigger
1087 * when entry at index 0 moves out of or back 1086 * when entry at index 0 moves out of or back
1088 * to root: none yet gotten, safe to restart. 1087 * to root: none yet gotten, safe to restart.
1089 */ 1088 */
1090 WARN_ON(iter.index); 1089 WARN_ON(iter.index);
1091 goto restart; 1090 goto restart;
1092 } 1091 }
1093 /* 1092 /*
1094 * Otherwise, shmem/tmpfs must be storing a swap entry 1093 * Otherwise, shmem/tmpfs must be storing a swap entry
1095 * here as an exceptional entry: so skip over it - 1094 * here as an exceptional entry: so skip over it -
1096 * we only reach this from invalidate_mapping_pages(). 1095 * we only reach this from invalidate_mapping_pages().
1097 */ 1096 */
1098 continue; 1097 continue;
1099 } 1098 }
1100 1099
1101 if (!page_cache_get_speculative(page)) 1100 if (!page_cache_get_speculative(page))
1102 goto repeat; 1101 goto repeat;
1103 1102
1104 /* Has the page moved? */ 1103 /* Has the page moved? */
1105 if (unlikely(page != *slot)) { 1104 if (unlikely(page != *slot)) {
1106 page_cache_release(page); 1105 page_cache_release(page);
1107 goto repeat; 1106 goto repeat;
1108 } 1107 }
1109 1108
1110 pages[ret] = page; 1109 pages[ret] = page;
1111 if (++ret == nr_pages) 1110 if (++ret == nr_pages)
1112 break; 1111 break;
1113 } 1112 }
1114 1113
1115 rcu_read_unlock(); 1114 rcu_read_unlock();
1116 return ret; 1115 return ret;
1117 } 1116 }
1118 1117
1119 /** 1118 /**
1120 * find_get_pages_contig - gang contiguous pagecache lookup 1119 * find_get_pages_contig - gang contiguous pagecache lookup
1121 * @mapping: The address_space to search 1120 * @mapping: The address_space to search
1122 * @index: The starting page index 1121 * @index: The starting page index
1123 * @nr_pages: The maximum number of pages 1122 * @nr_pages: The maximum number of pages
1124 * @pages: Where the resulting pages are placed 1123 * @pages: Where the resulting pages are placed
1125 * 1124 *
1126 * find_get_pages_contig() works exactly like find_get_pages(), except 1125 * find_get_pages_contig() works exactly like find_get_pages(), except
1127 * that the returned number of pages are guaranteed to be contiguous. 1126 * that the returned number of pages are guaranteed to be contiguous.
1128 * 1127 *
1129 * find_get_pages_contig() returns the number of pages which were found. 1128 * find_get_pages_contig() returns the number of pages which were found.
1130 */ 1129 */
1131 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index, 1130 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
1132 unsigned int nr_pages, struct page **pages) 1131 unsigned int nr_pages, struct page **pages)
1133 { 1132 {
1134 struct radix_tree_iter iter; 1133 struct radix_tree_iter iter;
1135 void **slot; 1134 void **slot;
1136 unsigned int ret = 0; 1135 unsigned int ret = 0;
1137 1136
1138 if (unlikely(!nr_pages)) 1137 if (unlikely(!nr_pages))
1139 return 0; 1138 return 0;
1140 1139
1141 rcu_read_lock(); 1140 rcu_read_lock();
1142 restart: 1141 restart:
1143 radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) { 1142 radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
1144 struct page *page; 1143 struct page *page;
1145 repeat: 1144 repeat:
1146 page = radix_tree_deref_slot(slot); 1145 page = radix_tree_deref_slot(slot);
1147 /* The hole, there no reason to continue */ 1146 /* The hole, there no reason to continue */
1148 if (unlikely(!page)) 1147 if (unlikely(!page))
1149 break; 1148 break;
1150 1149
1151 if (radix_tree_exception(page)) { 1150 if (radix_tree_exception(page)) {
1152 if (radix_tree_deref_retry(page)) { 1151 if (radix_tree_deref_retry(page)) {
1153 /* 1152 /*
1154 * Transient condition which can only trigger 1153 * Transient condition which can only trigger
1155 * when entry at index 0 moves out of or back 1154 * when entry at index 0 moves out of or back
1156 * to root: none yet gotten, safe to restart. 1155 * to root: none yet gotten, safe to restart.
1157 */ 1156 */
1158 goto restart; 1157 goto restart;
1159 } 1158 }
1160 /* 1159 /*
1161 * Otherwise, shmem/tmpfs must be storing a swap entry 1160 * Otherwise, shmem/tmpfs must be storing a swap entry
1162 * here as an exceptional entry: so stop looking for 1161 * here as an exceptional entry: so stop looking for
1163 * contiguous pages. 1162 * contiguous pages.
1164 */ 1163 */
1165 break; 1164 break;
1166 } 1165 }
1167 1166
1168 if (!page_cache_get_speculative(page)) 1167 if (!page_cache_get_speculative(page))
1169 goto repeat; 1168 goto repeat;
1170 1169
1171 /* Has the page moved? */ 1170 /* Has the page moved? */
1172 if (unlikely(page != *slot)) { 1171 if (unlikely(page != *slot)) {
1173 page_cache_release(page); 1172 page_cache_release(page);
1174 goto repeat; 1173 goto repeat;
1175 } 1174 }
1176 1175
1177 /* 1176 /*
1178 * must check mapping and index after taking the ref. 1177 * must check mapping and index after taking the ref.
1179 * otherwise we can get both false positives and false 1178 * otherwise we can get both false positives and false
1180 * negatives, which is just confusing to the caller. 1179 * negatives, which is just confusing to the caller.
1181 */ 1180 */
1182 if (page->mapping == NULL || page->index != iter.index) { 1181 if (page->mapping == NULL || page->index != iter.index) {
1183 page_cache_release(page); 1182 page_cache_release(page);
1184 break; 1183 break;
1185 } 1184 }
1186 1185
1187 pages[ret] = page; 1186 pages[ret] = page;
1188 if (++ret == nr_pages) 1187 if (++ret == nr_pages)
1189 break; 1188 break;
1190 } 1189 }
1191 rcu_read_unlock(); 1190 rcu_read_unlock();
1192 return ret; 1191 return ret;
1193 } 1192 }
1194 EXPORT_SYMBOL(find_get_pages_contig); 1193 EXPORT_SYMBOL(find_get_pages_contig);
1195 1194
1196 /** 1195 /**
1197 * find_get_pages_tag - find and return pages that match @tag 1196 * find_get_pages_tag - find and return pages that match @tag
1198 * @mapping: the address_space to search 1197 * @mapping: the address_space to search
1199 * @index: the starting page index 1198 * @index: the starting page index
1200 * @tag: the tag index 1199 * @tag: the tag index
1201 * @nr_pages: the maximum number of pages 1200 * @nr_pages: the maximum number of pages
1202 * @pages: where the resulting pages are placed 1201 * @pages: where the resulting pages are placed
1203 * 1202 *
1204 * Like find_get_pages, except we only return pages which are tagged with 1203 * Like find_get_pages, except we only return pages which are tagged with
1205 * @tag. We update @index to index the next page for the traversal. 1204 * @tag. We update @index to index the next page for the traversal.
1206 */ 1205 */
1207 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, 1206 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
1208 int tag, unsigned int nr_pages, struct page **pages) 1207 int tag, unsigned int nr_pages, struct page **pages)
1209 { 1208 {
1210 struct radix_tree_iter iter; 1209 struct radix_tree_iter iter;
1211 void **slot; 1210 void **slot;
1212 unsigned ret = 0; 1211 unsigned ret = 0;
1213 1212
1214 if (unlikely(!nr_pages)) 1213 if (unlikely(!nr_pages))
1215 return 0; 1214 return 0;
1216 1215
1217 rcu_read_lock(); 1216 rcu_read_lock();
1218 restart: 1217 restart:
1219 radix_tree_for_each_tagged(slot, &mapping->page_tree, 1218 radix_tree_for_each_tagged(slot, &mapping->page_tree,
1220 &iter, *index, tag) { 1219 &iter, *index, tag) {
1221 struct page *page; 1220 struct page *page;
1222 repeat: 1221 repeat:
1223 page = radix_tree_deref_slot(slot); 1222 page = radix_tree_deref_slot(slot);
1224 if (unlikely(!page)) 1223 if (unlikely(!page))
1225 continue; 1224 continue;
1226 1225
1227 if (radix_tree_exception(page)) { 1226 if (radix_tree_exception(page)) {
1228 if (radix_tree_deref_retry(page)) { 1227 if (radix_tree_deref_retry(page)) {
1229 /* 1228 /*
1230 * Transient condition which can only trigger 1229 * Transient condition which can only trigger
1231 * when entry at index 0 moves out of or back 1230 * when entry at index 0 moves out of or back
1232 * to root: none yet gotten, safe to restart. 1231 * to root: none yet gotten, safe to restart.
1233 */ 1232 */
1234 goto restart; 1233 goto restart;
1235 } 1234 }
1236 /* 1235 /*
1237 * This function is never used on a shmem/tmpfs 1236 * This function is never used on a shmem/tmpfs
1238 * mapping, so a swap entry won't be found here. 1237 * mapping, so a swap entry won't be found here.
1239 */ 1238 */
1240 BUG(); 1239 BUG();
1241 } 1240 }
1242 1241
1243 if (!page_cache_get_speculative(page)) 1242 if (!page_cache_get_speculative(page))
1244 goto repeat; 1243 goto repeat;
1245 1244
1246 /* Has the page moved? */ 1245 /* Has the page moved? */
1247 if (unlikely(page != *slot)) { 1246 if (unlikely(page != *slot)) {
1248 page_cache_release(page); 1247 page_cache_release(page);
1249 goto repeat; 1248 goto repeat;
1250 } 1249 }
1251 1250
1252 pages[ret] = page; 1251 pages[ret] = page;
1253 if (++ret == nr_pages) 1252 if (++ret == nr_pages)
1254 break; 1253 break;
1255 } 1254 }
1256 1255
1257 rcu_read_unlock(); 1256 rcu_read_unlock();
1258 1257
1259 if (ret) 1258 if (ret)
1260 *index = pages[ret - 1]->index + 1; 1259 *index = pages[ret - 1]->index + 1;
1261 1260
1262 return ret; 1261 return ret;
1263 } 1262 }
1264 EXPORT_SYMBOL(find_get_pages_tag); 1263 EXPORT_SYMBOL(find_get_pages_tag);
1265 1264
1266 /**
1267 * grab_cache_page_nowait - returns locked page at given index in given cache
1268 * @mapping: target address_space
1269 * @index: the page index
1270 *
1271 * Same as grab_cache_page(), but do not wait if the page is unavailable.
1272 * This is intended for speculative data generators, where the data can
1273 * be regenerated if the page couldn't be grabbed. This routine should
1274 * be safe to call while holding the lock for another page.
1275 *
1276 * Clear __GFP_FS when allocating the page to avoid recursion into the fs
1277 * and deadlock against the caller's locked page.
1278 */
1279 struct page *
1280 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
1281 {
1282 struct page *page = find_get_page(mapping, index);
1283
1284 if (page) {
1285 if (trylock_page(page))
1286 return page;
1287 page_cache_release(page);
1288 return NULL;
1289 }
1290 page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
1291 if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
1292 page_cache_release(page);
1293 page = NULL;
1294 }
1295 return page;
1296 }
1297 EXPORT_SYMBOL(grab_cache_page_nowait);
1298
1299 /* 1265 /*
1300 * CD/DVDs are error prone. When a medium error occurs, the driver may fail 1266 * CD/DVDs are error prone. When a medium error occurs, the driver may fail
1301 * a _large_ part of the i/o request. Imagine the worst scenario: 1267 * a _large_ part of the i/o request. Imagine the worst scenario:
1302 * 1268 *
1303 * ---R__________________________________________B__________ 1269 * ---R__________________________________________B__________
1304 * ^ reading here ^ bad block(assume 4k) 1270 * ^ reading here ^ bad block(assume 4k)
1305 * 1271 *
1306 * read(R) => miss => readahead(R...B) => media error => frustrating retries 1272 * read(R) => miss => readahead(R...B) => media error => frustrating retries
1307 * => failing the whole request => read(R) => read(R+1) => 1273 * => failing the whole request => read(R) => read(R+1) =>
1308 * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) => 1274 * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
1309 * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) => 1275 * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
1310 * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ...... 1276 * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
1311 * 1277 *
1312 * It is going insane. Fix it by quickly scaling down the readahead size. 1278 * It is going insane. Fix it by quickly scaling down the readahead size.
1313 */ 1279 */
1314 static void shrink_readahead_size_eio(struct file *filp, 1280 static void shrink_readahead_size_eio(struct file *filp,
1315 struct file_ra_state *ra) 1281 struct file_ra_state *ra)
1316 { 1282 {
1317 ra->ra_pages /= 4; 1283 ra->ra_pages /= 4;
1318 } 1284 }
1319 1285
1320 /** 1286 /**
1321 * do_generic_file_read - generic file read routine 1287 * do_generic_file_read - generic file read routine
1322 * @filp: the file to read 1288 * @filp: the file to read
1323 * @ppos: current file position 1289 * @ppos: current file position
1324 * @desc: read_descriptor 1290 * @desc: read_descriptor
1325 * @actor: read method 1291 * @actor: read method
1326 * 1292 *
1327 * This is a generic file read routine, and uses the 1293 * This is a generic file read routine, and uses the
1328 * mapping->a_ops->readpage() function for the actual low-level stuff. 1294 * mapping->a_ops->readpage() function for the actual low-level stuff.
1329 * 1295 *
1330 * This is really ugly. But the goto's actually try to clarify some 1296 * This is really ugly. But the goto's actually try to clarify some
1331 * of the logic when it comes to error handling etc. 1297 * of the logic when it comes to error handling etc.
1332 */ 1298 */
1333 static void do_generic_file_read(struct file *filp, loff_t *ppos, 1299 static void do_generic_file_read(struct file *filp, loff_t *ppos,
1334 read_descriptor_t *desc, read_actor_t actor) 1300 read_descriptor_t *desc, read_actor_t actor)
1335 { 1301 {
1336 struct address_space *mapping = filp->f_mapping; 1302 struct address_space *mapping = filp->f_mapping;
1337 struct inode *inode = mapping->host; 1303 struct inode *inode = mapping->host;
1338 struct file_ra_state *ra = &filp->f_ra; 1304 struct file_ra_state *ra = &filp->f_ra;
1339 pgoff_t index; 1305 pgoff_t index;
1340 pgoff_t last_index; 1306 pgoff_t last_index;
1341 pgoff_t prev_index; 1307 pgoff_t prev_index;
1342 unsigned long offset; /* offset into pagecache page */ 1308 unsigned long offset; /* offset into pagecache page */
1343 unsigned int prev_offset; 1309 unsigned int prev_offset;
1344 int error; 1310 int error;
1345 1311
1346 index = *ppos >> PAGE_CACHE_SHIFT; 1312 index = *ppos >> PAGE_CACHE_SHIFT;
1347 prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT; 1313 prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
1348 prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1); 1314 prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
1349 last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; 1315 last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
1350 offset = *ppos & ~PAGE_CACHE_MASK; 1316 offset = *ppos & ~PAGE_CACHE_MASK;
1351 1317
1352 for (;;) { 1318 for (;;) {
1353 struct page *page; 1319 struct page *page;
1354 pgoff_t end_index; 1320 pgoff_t end_index;
1355 loff_t isize; 1321 loff_t isize;
1356 unsigned long nr, ret; 1322 unsigned long nr, ret;
1357 1323
1358 cond_resched(); 1324 cond_resched();
1359 find_page: 1325 find_page:
1360 page = find_get_page(mapping, index); 1326 page = find_get_page(mapping, index);
1361 if (!page) { 1327 if (!page) {
1362 page_cache_sync_readahead(mapping, 1328 page_cache_sync_readahead(mapping,
1363 ra, filp, 1329 ra, filp,
1364 index, last_index - index); 1330 index, last_index - index);
1365 page = find_get_page(mapping, index); 1331 page = find_get_page(mapping, index);
1366 if (unlikely(page == NULL)) 1332 if (unlikely(page == NULL))
1367 goto no_cached_page; 1333 goto no_cached_page;
1368 } 1334 }
1369 if (PageReadahead(page)) { 1335 if (PageReadahead(page)) {
1370 page_cache_async_readahead(mapping, 1336 page_cache_async_readahead(mapping,
1371 ra, filp, page, 1337 ra, filp, page,
1372 index, last_index - index); 1338 index, last_index - index);
1373 } 1339 }
1374 if (!PageUptodate(page)) { 1340 if (!PageUptodate(page)) {
1375 if (inode->i_blkbits == PAGE_CACHE_SHIFT || 1341 if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
1376 !mapping->a_ops->is_partially_uptodate) 1342 !mapping->a_ops->is_partially_uptodate)
1377 goto page_not_up_to_date; 1343 goto page_not_up_to_date;
1378 if (!trylock_page(page)) 1344 if (!trylock_page(page))
1379 goto page_not_up_to_date; 1345 goto page_not_up_to_date;
1380 /* Did it get truncated before we got the lock? */ 1346 /* Did it get truncated before we got the lock? */
1381 if (!page->mapping) 1347 if (!page->mapping)
1382 goto page_not_up_to_date_locked; 1348 goto page_not_up_to_date_locked;
1383 if (!mapping->a_ops->is_partially_uptodate(page, 1349 if (!mapping->a_ops->is_partially_uptodate(page,
1384 desc, offset)) 1350 desc, offset))
1385 goto page_not_up_to_date_locked; 1351 goto page_not_up_to_date_locked;
1386 unlock_page(page); 1352 unlock_page(page);
1387 } 1353 }
1388 page_ok: 1354 page_ok:
1389 /* 1355 /*
1390 * i_size must be checked after we know the page is Uptodate. 1356 * i_size must be checked after we know the page is Uptodate.
1391 * 1357 *
1392 * Checking i_size after the check allows us to calculate 1358 * Checking i_size after the check allows us to calculate
1393 * the correct value for "nr", which means the zero-filled 1359 * the correct value for "nr", which means the zero-filled
1394 * part of the page is not copied back to userspace (unless 1360 * part of the page is not copied back to userspace (unless
1395 * another truncate extends the file - this is desired though). 1361 * another truncate extends the file - this is desired though).
1396 */ 1362 */
1397 1363
1398 isize = i_size_read(inode); 1364 isize = i_size_read(inode);
1399 end_index = (isize - 1) >> PAGE_CACHE_SHIFT; 1365 end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
1400 if (unlikely(!isize || index > end_index)) { 1366 if (unlikely(!isize || index > end_index)) {
1401 page_cache_release(page); 1367 page_cache_release(page);
1402 goto out; 1368 goto out;
1403 } 1369 }
1404 1370
1405 /* nr is the maximum number of bytes to copy from this page */ 1371 /* nr is the maximum number of bytes to copy from this page */
1406 nr = PAGE_CACHE_SIZE; 1372 nr = PAGE_CACHE_SIZE;
1407 if (index == end_index) { 1373 if (index == end_index) {
1408 nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; 1374 nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
1409 if (nr <= offset) { 1375 if (nr <= offset) {
1410 page_cache_release(page); 1376 page_cache_release(page);
1411 goto out; 1377 goto out;
1412 } 1378 }
1413 } 1379 }
1414 nr = nr - offset; 1380 nr = nr - offset;
1415 1381
1416 /* If users can be writing to this page using arbitrary 1382 /* If users can be writing to this page using arbitrary
1417 * virtual addresses, take care about potential aliasing 1383 * virtual addresses, take care about potential aliasing
1418 * before reading the page on the kernel side. 1384 * before reading the page on the kernel side.
1419 */ 1385 */
1420 if (mapping_writably_mapped(mapping)) 1386 if (mapping_writably_mapped(mapping))
1421 flush_dcache_page(page); 1387 flush_dcache_page(page);
1422 1388
1423 /* 1389 /*
1424 * When a sequential read accesses a page several times, 1390 * When a sequential read accesses a page several times,
1425 * only mark it as accessed the first time. 1391 * only mark it as accessed the first time.
1426 */ 1392 */
1427 if (prev_index != index || offset != prev_offset) 1393 if (prev_index != index || offset != prev_offset)
1428 mark_page_accessed(page); 1394 mark_page_accessed(page);
1429 prev_index = index; 1395 prev_index = index;
1430 1396
1431 /* 1397 /*
1432 * Ok, we have the page, and it's up-to-date, so 1398 * Ok, we have the page, and it's up-to-date, so
1433 * now we can copy it to user space... 1399 * now we can copy it to user space...
1434 * 1400 *
1435 * The actor routine returns how many bytes were actually used.. 1401 * The actor routine returns how many bytes were actually used..
1436 * NOTE! This may not be the same as how much of a user buffer 1402 * NOTE! This may not be the same as how much of a user buffer
1437 * we filled up (we may be padding etc), so we can only update 1403 * we filled up (we may be padding etc), so we can only update
1438 * "pos" here (the actor routine has to update the user buffer 1404 * "pos" here (the actor routine has to update the user buffer
1439 * pointers and the remaining count). 1405 * pointers and the remaining count).
1440 */ 1406 */
1441 ret = actor(desc, page, offset, nr); 1407 ret = actor(desc, page, offset, nr);
1442 offset += ret; 1408 offset += ret;
1443 index += offset >> PAGE_CACHE_SHIFT; 1409 index += offset >> PAGE_CACHE_SHIFT;
1444 offset &= ~PAGE_CACHE_MASK; 1410 offset &= ~PAGE_CACHE_MASK;
1445 prev_offset = offset; 1411 prev_offset = offset;
1446 1412
1447 page_cache_release(page); 1413 page_cache_release(page);
1448 if (ret == nr && desc->count) 1414 if (ret == nr && desc->count)
1449 continue; 1415 continue;
1450 goto out; 1416 goto out;
1451 1417
1452 page_not_up_to_date: 1418 page_not_up_to_date:
1453 /* Get exclusive access to the page ... */ 1419 /* Get exclusive access to the page ... */
1454 error = lock_page_killable(page); 1420 error = lock_page_killable(page);
1455 if (unlikely(error)) 1421 if (unlikely(error))
1456 goto readpage_error; 1422 goto readpage_error;
1457 1423
1458 page_not_up_to_date_locked: 1424 page_not_up_to_date_locked:
1459 /* Did it get truncated before we got the lock? */ 1425 /* Did it get truncated before we got the lock? */
1460 if (!page->mapping) { 1426 if (!page->mapping) {
1461 unlock_page(page); 1427 unlock_page(page);
1462 page_cache_release(page); 1428 page_cache_release(page);
1463 continue; 1429 continue;
1464 } 1430 }
1465 1431
1466 /* Did somebody else fill it already? */ 1432 /* Did somebody else fill it already? */
1467 if (PageUptodate(page)) { 1433 if (PageUptodate(page)) {
1468 unlock_page(page); 1434 unlock_page(page);
1469 goto page_ok; 1435 goto page_ok;
1470 } 1436 }
1471 1437
1472 readpage: 1438 readpage:
1473 /* 1439 /*
1474 * A previous I/O error may have been due to temporary 1440 * A previous I/O error may have been due to temporary
1475 * failures, eg. multipath errors. 1441 * failures, eg. multipath errors.
1476 * PG_error will be set again if readpage fails. 1442 * PG_error will be set again if readpage fails.
1477 */ 1443 */
1478 ClearPageError(page); 1444 ClearPageError(page);
1479 /* Start the actual read. The read will unlock the page. */ 1445 /* Start the actual read. The read will unlock the page. */
1480 error = mapping->a_ops->readpage(filp, page); 1446 error = mapping->a_ops->readpage(filp, page);
1481 1447
1482 if (unlikely(error)) { 1448 if (unlikely(error)) {
1483 if (error == AOP_TRUNCATED_PAGE) { 1449 if (error == AOP_TRUNCATED_PAGE) {
1484 page_cache_release(page); 1450 page_cache_release(page);
1485 goto find_page; 1451 goto find_page;
1486 } 1452 }
1487 goto readpage_error; 1453 goto readpage_error;
1488 } 1454 }
1489 1455
1490 if (!PageUptodate(page)) { 1456 if (!PageUptodate(page)) {
1491 error = lock_page_killable(page); 1457 error = lock_page_killable(page);
1492 if (unlikely(error)) 1458 if (unlikely(error))
1493 goto readpage_error; 1459 goto readpage_error;
1494 if (!PageUptodate(page)) { 1460 if (!PageUptodate(page)) {
1495 if (page->mapping == NULL) { 1461 if (page->mapping == NULL) {
1496 /* 1462 /*
1497 * invalidate_mapping_pages got it 1463 * invalidate_mapping_pages got it
1498 */ 1464 */
1499 unlock_page(page); 1465 unlock_page(page);
1500 page_cache_release(page); 1466 page_cache_release(page);
1501 goto find_page; 1467 goto find_page;
1502 } 1468 }
1503 unlock_page(page); 1469 unlock_page(page);
1504 shrink_readahead_size_eio(filp, ra); 1470 shrink_readahead_size_eio(filp, ra);
1505 error = -EIO; 1471 error = -EIO;
1506 goto readpage_error; 1472 goto readpage_error;
1507 } 1473 }
1508 unlock_page(page); 1474 unlock_page(page);
1509 } 1475 }
1510 1476
1511 goto page_ok; 1477 goto page_ok;
1512 1478
1513 readpage_error: 1479 readpage_error:
1514 /* UHHUH! A synchronous read error occurred. Report it */ 1480 /* UHHUH! A synchronous read error occurred. Report it */
1515 desc->error = error; 1481 desc->error = error;
1516 page_cache_release(page); 1482 page_cache_release(page);
1517 goto out; 1483 goto out;
1518 1484
1519 no_cached_page: 1485 no_cached_page:
1520 /* 1486 /*
1521 * Ok, it wasn't cached, so we need to create a new 1487 * Ok, it wasn't cached, so we need to create a new
1522 * page.. 1488 * page..
1523 */ 1489 */
1524 page = page_cache_alloc_cold(mapping); 1490 page = page_cache_alloc_cold(mapping);
1525 if (!page) { 1491 if (!page) {
1526 desc->error = -ENOMEM; 1492 desc->error = -ENOMEM;
1527 goto out; 1493 goto out;
1528 } 1494 }
1529 error = add_to_page_cache_lru(page, mapping, 1495 error = add_to_page_cache_lru(page, mapping,
1530 index, GFP_KERNEL); 1496 index, GFP_KERNEL);
1531 if (error) { 1497 if (error) {
1532 page_cache_release(page); 1498 page_cache_release(page);
1533 if (error == -EEXIST) 1499 if (error == -EEXIST)
1534 goto find_page; 1500 goto find_page;
1535 desc->error = error; 1501 desc->error = error;
1536 goto out; 1502 goto out;
1537 } 1503 }
1538 goto readpage; 1504 goto readpage;
1539 } 1505 }
1540 1506
1541 out: 1507 out:
1542 ra->prev_pos = prev_index; 1508 ra->prev_pos = prev_index;
1543 ra->prev_pos <<= PAGE_CACHE_SHIFT; 1509 ra->prev_pos <<= PAGE_CACHE_SHIFT;
1544 ra->prev_pos |= prev_offset; 1510 ra->prev_pos |= prev_offset;
1545 1511
1546 *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset; 1512 *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
1547 file_accessed(filp); 1513 file_accessed(filp);
1548 } 1514 }
1549 1515
1550 int file_read_actor(read_descriptor_t *desc, struct page *page, 1516 int file_read_actor(read_descriptor_t *desc, struct page *page,
1551 unsigned long offset, unsigned long size) 1517 unsigned long offset, unsigned long size)
1552 { 1518 {
1553 char *kaddr; 1519 char *kaddr;
1554 unsigned long left, count = desc->count; 1520 unsigned long left, count = desc->count;
1555 1521
1556 if (size > count) 1522 if (size > count)
1557 size = count; 1523 size = count;
1558 1524
1559 /* 1525 /*
1560 * Faults on the destination of a read are common, so do it before 1526 * Faults on the destination of a read are common, so do it before
1561 * taking the kmap. 1527 * taking the kmap.
1562 */ 1528 */
1563 if (!fault_in_pages_writeable(desc->arg.buf, size)) { 1529 if (!fault_in_pages_writeable(desc->arg.buf, size)) {
1564 kaddr = kmap_atomic(page); 1530 kaddr = kmap_atomic(page);
1565 left = __copy_to_user_inatomic(desc->arg.buf, 1531 left = __copy_to_user_inatomic(desc->arg.buf,
1566 kaddr + offset, size); 1532 kaddr + offset, size);
1567 kunmap_atomic(kaddr); 1533 kunmap_atomic(kaddr);
1568 if (left == 0) 1534 if (left == 0)
1569 goto success; 1535 goto success;
1570 } 1536 }
1571 1537
1572 /* Do it the slow way */ 1538 /* Do it the slow way */
1573 kaddr = kmap(page); 1539 kaddr = kmap(page);
1574 left = __copy_to_user(desc->arg.buf, kaddr + offset, size); 1540 left = __copy_to_user(desc->arg.buf, kaddr + offset, size);
1575 kunmap(page); 1541 kunmap(page);
1576 1542
1577 if (left) { 1543 if (left) {
1578 size -= left; 1544 size -= left;
1579 desc->error = -EFAULT; 1545 desc->error = -EFAULT;
1580 } 1546 }
1581 success: 1547 success:
1582 desc->count = count - size; 1548 desc->count = count - size;
1583 desc->written += size; 1549 desc->written += size;
1584 desc->arg.buf += size; 1550 desc->arg.buf += size;
1585 return size; 1551 return size;
1586 } 1552 }
1587 1553
1588 /* 1554 /*
1589 * Performs necessary checks before doing a write 1555 * Performs necessary checks before doing a write
1590 * @iov: io vector request 1556 * @iov: io vector request
1591 * @nr_segs: number of segments in the iovec 1557 * @nr_segs: number of segments in the iovec
1592 * @count: number of bytes to write 1558 * @count: number of bytes to write
1593 * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE 1559 * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE
1594 * 1560 *
1595 * Adjust number of segments and amount of bytes to write (nr_segs should be 1561 * Adjust number of segments and amount of bytes to write (nr_segs should be
1596 * properly initialized first). Returns appropriate error code that caller 1562 * properly initialized first). Returns appropriate error code that caller
1597 * should return or zero in case that write should be allowed. 1563 * should return or zero in case that write should be allowed.
1598 */ 1564 */
1599 int generic_segment_checks(const struct iovec *iov, 1565 int generic_segment_checks(const struct iovec *iov,
1600 unsigned long *nr_segs, size_t *count, int access_flags) 1566 unsigned long *nr_segs, size_t *count, int access_flags)
1601 { 1567 {
1602 unsigned long seg; 1568 unsigned long seg;
1603 size_t cnt = 0; 1569 size_t cnt = 0;
1604 for (seg = 0; seg < *nr_segs; seg++) { 1570 for (seg = 0; seg < *nr_segs; seg++) {
1605 const struct iovec *iv = &iov[seg]; 1571 const struct iovec *iv = &iov[seg];
1606 1572
1607 /* 1573 /*
1608 * If any segment has a negative length, or the cumulative 1574 * If any segment has a negative length, or the cumulative
1609 * length ever wraps negative then return -EINVAL. 1575 * length ever wraps negative then return -EINVAL.
1610 */ 1576 */
1611 cnt += iv->iov_len; 1577 cnt += iv->iov_len;
1612 if (unlikely((ssize_t)(cnt|iv->iov_len) < 0)) 1578 if (unlikely((ssize_t)(cnt|iv->iov_len) < 0))
1613 return -EINVAL; 1579 return -EINVAL;
1614 if (access_ok(access_flags, iv->iov_base, iv->iov_len)) 1580 if (access_ok(access_flags, iv->iov_base, iv->iov_len))
1615 continue; 1581 continue;
1616 if (seg == 0) 1582 if (seg == 0)
1617 return -EFAULT; 1583 return -EFAULT;
1618 *nr_segs = seg; 1584 *nr_segs = seg;
1619 cnt -= iv->iov_len; /* This segment is no good */ 1585 cnt -= iv->iov_len; /* This segment is no good */
1620 break; 1586 break;
1621 } 1587 }
1622 *count = cnt; 1588 *count = cnt;
1623 return 0; 1589 return 0;
1624 } 1590 }
1625 EXPORT_SYMBOL(generic_segment_checks); 1591 EXPORT_SYMBOL(generic_segment_checks);
1626 1592
1627 /** 1593 /**
1628 * generic_file_aio_read - generic filesystem read routine 1594 * generic_file_aio_read - generic filesystem read routine
1629 * @iocb: kernel I/O control block 1595 * @iocb: kernel I/O control block
1630 * @iov: io vector request 1596 * @iov: io vector request
1631 * @nr_segs: number of segments in the iovec 1597 * @nr_segs: number of segments in the iovec
1632 * @pos: current file position 1598 * @pos: current file position
1633 * 1599 *
1634 * This is the "read()" routine for all filesystems 1600 * This is the "read()" routine for all filesystems
1635 * that can use the page cache directly. 1601 * that can use the page cache directly.
1636 */ 1602 */
1637 ssize_t 1603 ssize_t
1638 generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, 1604 generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
1639 unsigned long nr_segs, loff_t pos) 1605 unsigned long nr_segs, loff_t pos)
1640 { 1606 {
1641 struct file *filp = iocb->ki_filp; 1607 struct file *filp = iocb->ki_filp;
1642 ssize_t retval; 1608 ssize_t retval;
1643 unsigned long seg = 0; 1609 unsigned long seg = 0;
1644 size_t count; 1610 size_t count;
1645 loff_t *ppos = &iocb->ki_pos; 1611 loff_t *ppos = &iocb->ki_pos;
1646 1612
1647 count = 0; 1613 count = 0;
1648 retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); 1614 retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
1649 if (retval) 1615 if (retval)
1650 return retval; 1616 return retval;
1651 1617
1652 /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ 1618 /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
1653 if (filp->f_flags & O_DIRECT) { 1619 if (filp->f_flags & O_DIRECT) {
1654 loff_t size; 1620 loff_t size;
1655 struct address_space *mapping; 1621 struct address_space *mapping;
1656 struct inode *inode; 1622 struct inode *inode;
1657 1623
1658 mapping = filp->f_mapping; 1624 mapping = filp->f_mapping;
1659 inode = mapping->host; 1625 inode = mapping->host;
1660 if (!count) 1626 if (!count)
1661 goto out; /* skip atime */ 1627 goto out; /* skip atime */
1662 size = i_size_read(inode); 1628 size = i_size_read(inode);
1663 if (pos < size) { 1629 if (pos < size) {
1664 retval = filemap_write_and_wait_range(mapping, pos, 1630 retval = filemap_write_and_wait_range(mapping, pos,
1665 pos + iov_length(iov, nr_segs) - 1); 1631 pos + iov_length(iov, nr_segs) - 1);
1666 if (!retval) { 1632 if (!retval) {
1667 retval = mapping->a_ops->direct_IO(READ, iocb, 1633 retval = mapping->a_ops->direct_IO(READ, iocb,
1668 iov, pos, nr_segs); 1634 iov, pos, nr_segs);
1669 } 1635 }
1670 if (retval > 0) { 1636 if (retval > 0) {
1671 *ppos = pos + retval; 1637 *ppos = pos + retval;
1672 count -= retval; 1638 count -= retval;
1673 } 1639 }
1674 1640
1675 /* 1641 /*
1676 * Btrfs can have a short DIO read if we encounter 1642 * Btrfs can have a short DIO read if we encounter
1677 * compressed extents, so if there was an error, or if 1643 * compressed extents, so if there was an error, or if
1678 * we've already read everything we wanted to, or if 1644 * we've already read everything we wanted to, or if
1679 * there was a short read because we hit EOF, go ahead 1645 * there was a short read because we hit EOF, go ahead
1680 * and return. Otherwise fallthrough to buffered io for 1646 * and return. Otherwise fallthrough to buffered io for
1681 * the rest of the read. 1647 * the rest of the read.
1682 */ 1648 */
1683 if (retval < 0 || !count || *ppos >= size) { 1649 if (retval < 0 || !count || *ppos >= size) {
1684 file_accessed(filp); 1650 file_accessed(filp);
1685 goto out; 1651 goto out;
1686 } 1652 }
1687 } 1653 }
1688 } 1654 }
1689 1655
1690 count = retval; 1656 count = retval;
1691 for (seg = 0; seg < nr_segs; seg++) { 1657 for (seg = 0; seg < nr_segs; seg++) {
1692 read_descriptor_t desc; 1658 read_descriptor_t desc;
1693 loff_t offset = 0; 1659 loff_t offset = 0;
1694 1660
1695 /* 1661 /*
1696 * If we did a short DIO read we need to skip the section of the 1662 * If we did a short DIO read we need to skip the section of the
1697 * iov that we've already read data into. 1663 * iov that we've already read data into.
1698 */ 1664 */
1699 if (count) { 1665 if (count) {
1700 if (count > iov[seg].iov_len) { 1666 if (count > iov[seg].iov_len) {
1701 count -= iov[seg].iov_len; 1667 count -= iov[seg].iov_len;
1702 continue; 1668 continue;
1703 } 1669 }
1704 offset = count; 1670 offset = count;
1705 count = 0; 1671 count = 0;
1706 } 1672 }
1707 1673
1708 desc.written = 0; 1674 desc.written = 0;
1709 desc.arg.buf = iov[seg].iov_base + offset; 1675 desc.arg.buf = iov[seg].iov_base + offset;
1710 desc.count = iov[seg].iov_len - offset; 1676 desc.count = iov[seg].iov_len - offset;
1711 if (desc.count == 0) 1677 if (desc.count == 0)
1712 continue; 1678 continue;
1713 desc.error = 0; 1679 desc.error = 0;
1714 do_generic_file_read(filp, ppos, &desc, file_read_actor); 1680 do_generic_file_read(filp, ppos, &desc, file_read_actor);
1715 retval += desc.written; 1681 retval += desc.written;
1716 if (desc.error) { 1682 if (desc.error) {
1717 retval = retval ?: desc.error; 1683 retval = retval ?: desc.error;
1718 break; 1684 break;
1719 } 1685 }
1720 if (desc.count > 0) 1686 if (desc.count > 0)
1721 break; 1687 break;
1722 } 1688 }
1723 out: 1689 out:
1724 return retval; 1690 return retval;
1725 } 1691 }
1726 EXPORT_SYMBOL(generic_file_aio_read); 1692 EXPORT_SYMBOL(generic_file_aio_read);
1727 1693
1728 #ifdef CONFIG_MMU 1694 #ifdef CONFIG_MMU
1729 /** 1695 /**
1730 * page_cache_read - adds requested page to the page cache if not already there 1696 * page_cache_read - adds requested page to the page cache if not already there
1731 * @file: file to read 1697 * @file: file to read
1732 * @offset: page index 1698 * @offset: page index
1733 * 1699 *
1734 * This adds the requested page to the page cache if it isn't already there, 1700 * This adds the requested page to the page cache if it isn't already there,
1735 * and schedules an I/O to read in its contents from disk. 1701 * and schedules an I/O to read in its contents from disk.
1736 */ 1702 */
1737 static int page_cache_read(struct file *file, pgoff_t offset) 1703 static int page_cache_read(struct file *file, pgoff_t offset)
1738 { 1704 {
1739 struct address_space *mapping = file->f_mapping; 1705 struct address_space *mapping = file->f_mapping;
1740 struct page *page; 1706 struct page *page;
1741 int ret; 1707 int ret;
1742 1708
1743 do { 1709 do {
1744 page = page_cache_alloc_cold(mapping); 1710 page = page_cache_alloc_cold(mapping);
1745 if (!page) 1711 if (!page)
1746 return -ENOMEM; 1712 return -ENOMEM;
1747 1713
1748 ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL); 1714 ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
1749 if (ret == 0) 1715 if (ret == 0)
1750 ret = mapping->a_ops->readpage(file, page); 1716 ret = mapping->a_ops->readpage(file, page);
1751 else if (ret == -EEXIST) 1717 else if (ret == -EEXIST)
1752 ret = 0; /* losing race to add is OK */ 1718 ret = 0; /* losing race to add is OK */
1753 1719
1754 page_cache_release(page); 1720 page_cache_release(page);
1755 1721
1756 } while (ret == AOP_TRUNCATED_PAGE); 1722 } while (ret == AOP_TRUNCATED_PAGE);
1757 1723
1758 return ret; 1724 return ret;
1759 } 1725 }
1760 1726
1761 #define MMAP_LOTSAMISS (100) 1727 #define MMAP_LOTSAMISS (100)
1762 1728
1763 /* 1729 /*
1764 * Synchronous readahead happens when we don't even find 1730 * Synchronous readahead happens when we don't even find
1765 * a page in the page cache at all. 1731 * a page in the page cache at all.
1766 */ 1732 */
1767 static void do_sync_mmap_readahead(struct vm_area_struct *vma, 1733 static void do_sync_mmap_readahead(struct vm_area_struct *vma,
1768 struct file_ra_state *ra, 1734 struct file_ra_state *ra,
1769 struct file *file, 1735 struct file *file,
1770 pgoff_t offset) 1736 pgoff_t offset)
1771 { 1737 {
1772 unsigned long ra_pages; 1738 unsigned long ra_pages;
1773 struct address_space *mapping = file->f_mapping; 1739 struct address_space *mapping = file->f_mapping;
1774 1740
1775 /* If we don't want any read-ahead, don't bother */ 1741 /* If we don't want any read-ahead, don't bother */
1776 if (vma->vm_flags & VM_RAND_READ) 1742 if (vma->vm_flags & VM_RAND_READ)
1777 return; 1743 return;
1778 if (!ra->ra_pages) 1744 if (!ra->ra_pages)
1779 return; 1745 return;
1780 1746
1781 if (vma->vm_flags & VM_SEQ_READ) { 1747 if (vma->vm_flags & VM_SEQ_READ) {
1782 page_cache_sync_readahead(mapping, ra, file, offset, 1748 page_cache_sync_readahead(mapping, ra, file, offset,
1783 ra->ra_pages); 1749 ra->ra_pages);
1784 return; 1750 return;
1785 } 1751 }
1786 1752
1787 /* Avoid banging the cache line if not needed */ 1753 /* Avoid banging the cache line if not needed */
1788 if (ra->mmap_miss < MMAP_LOTSAMISS * 10) 1754 if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
1789 ra->mmap_miss++; 1755 ra->mmap_miss++;
1790 1756
1791 /* 1757 /*
1792 * Do we miss much more than hit in this file? If so, 1758 * Do we miss much more than hit in this file? If so,
1793 * stop bothering with read-ahead. It will only hurt. 1759 * stop bothering with read-ahead. It will only hurt.
1794 */ 1760 */
1795 if (ra->mmap_miss > MMAP_LOTSAMISS) 1761 if (ra->mmap_miss > MMAP_LOTSAMISS)
1796 return; 1762 return;
1797 1763
1798 /* 1764 /*
1799 * mmap read-around 1765 * mmap read-around
1800 */ 1766 */
1801 ra_pages = max_sane_readahead(ra->ra_pages); 1767 ra_pages = max_sane_readahead(ra->ra_pages);
1802 ra->start = max_t(long, 0, offset - ra_pages / 2); 1768 ra->start = max_t(long, 0, offset - ra_pages / 2);
1803 ra->size = ra_pages; 1769 ra->size = ra_pages;
1804 ra->async_size = ra_pages / 4; 1770 ra->async_size = ra_pages / 4;
1805 ra_submit(ra, mapping, file); 1771 ra_submit(ra, mapping, file);
1806 } 1772 }
1807 1773
1808 /* 1774 /*
1809 * Asynchronous readahead happens when we find the page and PG_readahead, 1775 * Asynchronous readahead happens when we find the page and PG_readahead,
1810 * so we want to possibly extend the readahead further.. 1776 * so we want to possibly extend the readahead further..
1811 */ 1777 */
1812 static void do_async_mmap_readahead(struct vm_area_struct *vma, 1778 static void do_async_mmap_readahead(struct vm_area_struct *vma,
1813 struct file_ra_state *ra, 1779 struct file_ra_state *ra,
1814 struct file *file, 1780 struct file *file,
1815 struct page *page, 1781 struct page *page,
1816 pgoff_t offset) 1782 pgoff_t offset)
1817 { 1783 {
1818 struct address_space *mapping = file->f_mapping; 1784 struct address_space *mapping = file->f_mapping;
1819 1785
1820 /* If we don't want any read-ahead, don't bother */ 1786 /* If we don't want any read-ahead, don't bother */
1821 if (vma->vm_flags & VM_RAND_READ) 1787 if (vma->vm_flags & VM_RAND_READ)
1822 return; 1788 return;
1823 if (ra->mmap_miss > 0) 1789 if (ra->mmap_miss > 0)
1824 ra->mmap_miss--; 1790 ra->mmap_miss--;
1825 if (PageReadahead(page)) 1791 if (PageReadahead(page))
1826 page_cache_async_readahead(mapping, ra, file, 1792 page_cache_async_readahead(mapping, ra, file,
1827 page, offset, ra->ra_pages); 1793 page, offset, ra->ra_pages);
1828 } 1794 }
1829 1795
1830 /** 1796 /**
1831 * filemap_fault - read in file data for page fault handling 1797 * filemap_fault - read in file data for page fault handling
1832 * @vma: vma in which the fault was taken 1798 * @vma: vma in which the fault was taken
1833 * @vmf: struct vm_fault containing details of the fault 1799 * @vmf: struct vm_fault containing details of the fault
1834 * 1800 *
1835 * filemap_fault() is invoked via the vma operations vector for a 1801 * filemap_fault() is invoked via the vma operations vector for a
1836 * mapped memory region to read in file data during a page fault. 1802 * mapped memory region to read in file data during a page fault.
1837 * 1803 *
1838 * The goto's are kind of ugly, but this streamlines the normal case of having 1804 * The goto's are kind of ugly, but this streamlines the normal case of having
1839 * it in the page cache, and handles the special cases reasonably without 1805 * it in the page cache, and handles the special cases reasonably without
1840 * having a lot of duplicated code. 1806 * having a lot of duplicated code.
1841 */ 1807 */
1842 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) 1808 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
1843 { 1809 {
1844 int error; 1810 int error;
1845 struct file *file = vma->vm_file; 1811 struct file *file = vma->vm_file;
1846 struct address_space *mapping = file->f_mapping; 1812 struct address_space *mapping = file->f_mapping;
1847 struct file_ra_state *ra = &file->f_ra; 1813 struct file_ra_state *ra = &file->f_ra;
1848 struct inode *inode = mapping->host; 1814 struct inode *inode = mapping->host;
1849 pgoff_t offset = vmf->pgoff; 1815 pgoff_t offset = vmf->pgoff;
1850 struct page *page; 1816 struct page *page;
1851 pgoff_t size; 1817 pgoff_t size;
1852 int ret = 0; 1818 int ret = 0;
1853 1819
1854 size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1820 size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1855 if (offset >= size) 1821 if (offset >= size)
1856 return VM_FAULT_SIGBUS; 1822 return VM_FAULT_SIGBUS;
1857 1823
1858 /* 1824 /*
1859 * Do we have something in the page cache already? 1825 * Do we have something in the page cache already?
1860 */ 1826 */
1861 page = find_get_page(mapping, offset); 1827 page = find_get_page(mapping, offset);
1862 if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { 1828 if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
1863 /* 1829 /*
1864 * We found the page, so try async readahead before 1830 * We found the page, so try async readahead before
1865 * waiting for the lock. 1831 * waiting for the lock.
1866 */ 1832 */
1867 do_async_mmap_readahead(vma, ra, file, page, offset); 1833 do_async_mmap_readahead(vma, ra, file, page, offset);
1868 } else if (!page) { 1834 } else if (!page) {
1869 /* No page in the page cache at all */ 1835 /* No page in the page cache at all */
1870 do_sync_mmap_readahead(vma, ra, file, offset); 1836 do_sync_mmap_readahead(vma, ra, file, offset);
1871 count_vm_event(PGMAJFAULT); 1837 count_vm_event(PGMAJFAULT);
1872 mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); 1838 mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
1873 ret = VM_FAULT_MAJOR; 1839 ret = VM_FAULT_MAJOR;
1874 retry_find: 1840 retry_find:
1875 page = find_get_page(mapping, offset); 1841 page = find_get_page(mapping, offset);
1876 if (!page) 1842 if (!page)
1877 goto no_cached_page; 1843 goto no_cached_page;
1878 } 1844 }
1879 1845
1880 if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { 1846 if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
1881 page_cache_release(page); 1847 page_cache_release(page);
1882 return ret | VM_FAULT_RETRY; 1848 return ret | VM_FAULT_RETRY;
1883 } 1849 }
1884 1850
1885 /* Did it get truncated? */ 1851 /* Did it get truncated? */
1886 if (unlikely(page->mapping != mapping)) { 1852 if (unlikely(page->mapping != mapping)) {
1887 unlock_page(page); 1853 unlock_page(page);
1888 put_page(page); 1854 put_page(page);
1889 goto retry_find; 1855 goto retry_find;
1890 } 1856 }
1891 VM_BUG_ON(page->index != offset); 1857 VM_BUG_ON(page->index != offset);
1892 1858
1893 /* 1859 /*
1894 * We have a locked page in the page cache, now we need to check 1860 * We have a locked page in the page cache, now we need to check
1895 * that it's up-to-date. If not, it is going to be due to an error. 1861 * that it's up-to-date. If not, it is going to be due to an error.
1896 */ 1862 */
1897 if (unlikely(!PageUptodate(page))) 1863 if (unlikely(!PageUptodate(page)))
1898 goto page_not_uptodate; 1864 goto page_not_uptodate;
1899 1865
1900 /* 1866 /*
1901 * Found the page and have a reference on it. 1867 * Found the page and have a reference on it.
1902 * We must recheck i_size under page lock. 1868 * We must recheck i_size under page lock.
1903 */ 1869 */
1904 size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1870 size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1905 if (unlikely(offset >= size)) { 1871 if (unlikely(offset >= size)) {
1906 unlock_page(page); 1872 unlock_page(page);
1907 page_cache_release(page); 1873 page_cache_release(page);
1908 return VM_FAULT_SIGBUS; 1874 return VM_FAULT_SIGBUS;
1909 } 1875 }
1910 1876
1911 vmf->page = page; 1877 vmf->page = page;
1912 return ret | VM_FAULT_LOCKED; 1878 return ret | VM_FAULT_LOCKED;
1913 1879
1914 no_cached_page: 1880 no_cached_page:
1915 /* 1881 /*
1916 * We're only likely to ever get here if MADV_RANDOM is in 1882 * We're only likely to ever get here if MADV_RANDOM is in
1917 * effect. 1883 * effect.
1918 */ 1884 */
1919 error = page_cache_read(file, offset); 1885 error = page_cache_read(file, offset);
1920 1886
1921 /* 1887 /*
1922 * The page we want has now been added to the page cache. 1888 * The page we want has now been added to the page cache.
1923 * In the unlikely event that someone removed it in the 1889 * In the unlikely event that someone removed it in the
1924 * meantime, we'll just come back here and read it again. 1890 * meantime, we'll just come back here and read it again.
1925 */ 1891 */
1926 if (error >= 0) 1892 if (error >= 0)
1927 goto retry_find; 1893 goto retry_find;
1928 1894
1929 /* 1895 /*
1930 * An error return from page_cache_read can result if the 1896 * An error return from page_cache_read can result if the
1931 * system is low on memory, or a problem occurs while trying 1897 * system is low on memory, or a problem occurs while trying
1932 * to schedule I/O. 1898 * to schedule I/O.
1933 */ 1899 */
1934 if (error == -ENOMEM) 1900 if (error == -ENOMEM)
1935 return VM_FAULT_OOM; 1901 return VM_FAULT_OOM;
1936 return VM_FAULT_SIGBUS; 1902 return VM_FAULT_SIGBUS;
1937 1903
1938 page_not_uptodate: 1904 page_not_uptodate:
1939 /* 1905 /*
1940 * Umm, take care of errors if the page isn't up-to-date. 1906 * Umm, take care of errors if the page isn't up-to-date.
1941 * Try to re-read it _once_. We do this synchronously, 1907 * Try to re-read it _once_. We do this synchronously,
1942 * because there really aren't any performance issues here 1908 * because there really aren't any performance issues here
1943 * and we need to check for errors. 1909 * and we need to check for errors.
1944 */ 1910 */
1945 ClearPageError(page); 1911 ClearPageError(page);
1946 error = mapping->a_ops->readpage(file, page); 1912 error = mapping->a_ops->readpage(file, page);
1947 if (!error) { 1913 if (!error) {
1948 wait_on_page_locked(page); 1914 wait_on_page_locked(page);
1949 if (!PageUptodate(page)) 1915 if (!PageUptodate(page))
1950 error = -EIO; 1916 error = -EIO;
1951 } 1917 }
1952 page_cache_release(page); 1918 page_cache_release(page);
1953 1919
1954 if (!error || error == AOP_TRUNCATED_PAGE) 1920 if (!error || error == AOP_TRUNCATED_PAGE)
1955 goto retry_find; 1921 goto retry_find;
1956 1922
1957 /* Things didn't work out. Return zero to tell the mm layer so. */ 1923 /* Things didn't work out. Return zero to tell the mm layer so. */
1958 shrink_readahead_size_eio(file, ra); 1924 shrink_readahead_size_eio(file, ra);
1959 return VM_FAULT_SIGBUS; 1925 return VM_FAULT_SIGBUS;
1960 } 1926 }
1961 EXPORT_SYMBOL(filemap_fault); 1927 EXPORT_SYMBOL(filemap_fault);
1962 1928
1963 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) 1929 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
1964 { 1930 {
1965 struct page *page = vmf->page; 1931 struct page *page = vmf->page;
1966 struct inode *inode = file_inode(vma->vm_file); 1932 struct inode *inode = file_inode(vma->vm_file);
1967 int ret = VM_FAULT_LOCKED; 1933 int ret = VM_FAULT_LOCKED;
1968 1934
1969 sb_start_pagefault(inode->i_sb); 1935 sb_start_pagefault(inode->i_sb);
1970 file_update_time(vma->vm_file); 1936 file_update_time(vma->vm_file);
1971 lock_page(page); 1937 lock_page(page);
1972 if (page->mapping != inode->i_mapping) { 1938 if (page->mapping != inode->i_mapping) {
1973 unlock_page(page); 1939 unlock_page(page);
1974 ret = VM_FAULT_NOPAGE; 1940 ret = VM_FAULT_NOPAGE;
1975 goto out; 1941 goto out;
1976 } 1942 }
1977 /* 1943 /*
1978 * We mark the page dirty already here so that when freeze is in 1944 * We mark the page dirty already here so that when freeze is in
1979 * progress, we are guaranteed that writeback during freezing will 1945 * progress, we are guaranteed that writeback during freezing will
1980 * see the dirty page and writeprotect it again. 1946 * see the dirty page and writeprotect it again.
1981 */ 1947 */
1982 set_page_dirty(page); 1948 set_page_dirty(page);
1983 wait_for_stable_page(page); 1949 wait_for_stable_page(page);
1984 out: 1950 out:
1985 sb_end_pagefault(inode->i_sb); 1951 sb_end_pagefault(inode->i_sb);
1986 return ret; 1952 return ret;
1987 } 1953 }
1988 EXPORT_SYMBOL(filemap_page_mkwrite); 1954 EXPORT_SYMBOL(filemap_page_mkwrite);
1989 1955
1990 const struct vm_operations_struct generic_file_vm_ops = { 1956 const struct vm_operations_struct generic_file_vm_ops = {
1991 .fault = filemap_fault, 1957 .fault = filemap_fault,
1992 .page_mkwrite = filemap_page_mkwrite, 1958 .page_mkwrite = filemap_page_mkwrite,
1993 .remap_pages = generic_file_remap_pages, 1959 .remap_pages = generic_file_remap_pages,
1994 }; 1960 };
1995 1961
1996 /* This is used for a general mmap of a disk file */ 1962 /* This is used for a general mmap of a disk file */
1997 1963
1998 int generic_file_mmap(struct file * file, struct vm_area_struct * vma) 1964 int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
1999 { 1965 {
2000 struct address_space *mapping = file->f_mapping; 1966 struct address_space *mapping = file->f_mapping;
2001 1967
2002 if (!mapping->a_ops->readpage) 1968 if (!mapping->a_ops->readpage)
2003 return -ENOEXEC; 1969 return -ENOEXEC;
2004 file_accessed(file); 1970 file_accessed(file);
2005 vma->vm_ops = &generic_file_vm_ops; 1971 vma->vm_ops = &generic_file_vm_ops;
2006 return 0; 1972 return 0;
2007 } 1973 }
2008 1974
2009 /* 1975 /*
2010 * This is for filesystems which do not implement ->writepage. 1976 * This is for filesystems which do not implement ->writepage.
2011 */ 1977 */
2012 int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma) 1978 int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
2013 { 1979 {
2014 if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) 1980 if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
2015 return -EINVAL; 1981 return -EINVAL;
2016 return generic_file_mmap(file, vma); 1982 return generic_file_mmap(file, vma);
2017 } 1983 }
2018 #else 1984 #else
2019 int generic_file_mmap(struct file * file, struct vm_area_struct * vma) 1985 int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
2020 { 1986 {
2021 return -ENOSYS; 1987 return -ENOSYS;
2022 } 1988 }
2023 int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma) 1989 int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma)
2024 { 1990 {
2025 return -ENOSYS; 1991 return -ENOSYS;
2026 } 1992 }
2027 #endif /* CONFIG_MMU */ 1993 #endif /* CONFIG_MMU */
2028 1994
2029 EXPORT_SYMBOL(generic_file_mmap); 1995 EXPORT_SYMBOL(generic_file_mmap);
2030 EXPORT_SYMBOL(generic_file_readonly_mmap); 1996 EXPORT_SYMBOL(generic_file_readonly_mmap);
2031 1997
2032 static struct page *wait_on_page_read(struct page *page) 1998 static struct page *wait_on_page_read(struct page *page)
2033 { 1999 {
2034 if (!IS_ERR(page)) { 2000 if (!IS_ERR(page)) {
2035 wait_on_page_locked(page); 2001 wait_on_page_locked(page);
2036 if (!PageUptodate(page)) { 2002 if (!PageUptodate(page)) {
2037 page_cache_release(page); 2003 page_cache_release(page);
2038 page = ERR_PTR(-EIO); 2004 page = ERR_PTR(-EIO);
2039 } 2005 }
2040 } 2006 }
2041 return page; 2007 return page;
2042 } 2008 }
2043 2009
2044 static struct page *__read_cache_page(struct address_space *mapping, 2010 static struct page *__read_cache_page(struct address_space *mapping,
2045 pgoff_t index, 2011 pgoff_t index,
2046 int (*filler)(void *, struct page *), 2012 int (*filler)(void *, struct page *),
2047 void *data, 2013 void *data,
2048 gfp_t gfp) 2014 gfp_t gfp)
2049 { 2015 {
2050 struct page *page; 2016 struct page *page;
2051 int err; 2017 int err;
2052 repeat: 2018 repeat:
2053 page = find_get_page(mapping, index); 2019 page = find_get_page(mapping, index);
2054 if (!page) { 2020 if (!page) {
2055 page = __page_cache_alloc(gfp | __GFP_COLD); 2021 page = __page_cache_alloc(gfp | __GFP_COLD);
2056 if (!page) 2022 if (!page)
2057 return ERR_PTR(-ENOMEM); 2023 return ERR_PTR(-ENOMEM);
2058 err = add_to_page_cache_lru(page, mapping, index, gfp); 2024 err = add_to_page_cache_lru(page, mapping, index, gfp);
2059 if (unlikely(err)) { 2025 if (unlikely(err)) {
2060 page_cache_release(page); 2026 page_cache_release(page);
2061 if (err == -EEXIST) 2027 if (err == -EEXIST)
2062 goto repeat; 2028 goto repeat;
2063 /* Presumably ENOMEM for radix tree node */ 2029 /* Presumably ENOMEM for radix tree node */
2064 return ERR_PTR(err); 2030 return ERR_PTR(err);
2065 } 2031 }
2066 err = filler(data, page); 2032 err = filler(data, page);
2067 if (err < 0) { 2033 if (err < 0) {
2068 page_cache_release(page); 2034 page_cache_release(page);
2069 page = ERR_PTR(err); 2035 page = ERR_PTR(err);
2070 } else { 2036 } else {
2071 page = wait_on_page_read(page); 2037 page = wait_on_page_read(page);
2072 } 2038 }
2073 } 2039 }
2074 return page; 2040 return page;
2075 } 2041 }
2076 2042
2077 static struct page *do_read_cache_page(struct address_space *mapping, 2043 static struct page *do_read_cache_page(struct address_space *mapping,
2078 pgoff_t index, 2044 pgoff_t index,
2079 int (*filler)(void *, struct page *), 2045 int (*filler)(void *, struct page *),
2080 void *data, 2046 void *data,
2081 gfp_t gfp) 2047 gfp_t gfp)
2082 2048
2083 { 2049 {
2084 struct page *page; 2050 struct page *page;
2085 int err; 2051 int err;
2086 2052
2087 retry: 2053 retry:
2088 page = __read_cache_page(mapping, index, filler, data, gfp); 2054 page = __read_cache_page(mapping, index, filler, data, gfp);
2089 if (IS_ERR(page)) 2055 if (IS_ERR(page))
2090 return page; 2056 return page;
2091 if (PageUptodate(page)) 2057 if (PageUptodate(page))
2092 goto out; 2058 goto out;
2093 2059
2094 lock_page(page); 2060 lock_page(page);
2095 if (!page->mapping) { 2061 if (!page->mapping) {
2096 unlock_page(page); 2062 unlock_page(page);
2097 page_cache_release(page); 2063 page_cache_release(page);
2098 goto retry; 2064 goto retry;
2099 } 2065 }
2100 if (PageUptodate(page)) { 2066 if (PageUptodate(page)) {
2101 unlock_page(page); 2067 unlock_page(page);
2102 goto out; 2068 goto out;
2103 } 2069 }
2104 err = filler(data, page); 2070 err = filler(data, page);
2105 if (err < 0) { 2071 if (err < 0) {
2106 page_cache_release(page); 2072 page_cache_release(page);
2107 return ERR_PTR(err); 2073 return ERR_PTR(err);
2108 } else { 2074 } else {
2109 page = wait_on_page_read(page); 2075 page = wait_on_page_read(page);
2110 if (IS_ERR(page)) 2076 if (IS_ERR(page))
2111 return page; 2077 return page;
2112 } 2078 }
2113 out: 2079 out:
2114 mark_page_accessed(page); 2080 mark_page_accessed(page);
2115 return page; 2081 return page;
2116 } 2082 }
2117 2083
2118 /** 2084 /**
2119 * read_cache_page - read into page cache, fill it if needed 2085 * read_cache_page - read into page cache, fill it if needed
2120 * @mapping: the page's address_space 2086 * @mapping: the page's address_space
2121 * @index: the page index 2087 * @index: the page index
2122 * @filler: function to perform the read 2088 * @filler: function to perform the read
2123 * @data: first arg to filler(data, page) function, often left as NULL 2089 * @data: first arg to filler(data, page) function, often left as NULL
2124 * 2090 *
2125 * Read into the page cache. If a page already exists, and PageUptodate() is 2091 * Read into the page cache. If a page already exists, and PageUptodate() is
2126 * not set, try to fill the page and wait for it to become unlocked. 2092 * not set, try to fill the page and wait for it to become unlocked.
2127 * 2093 *
2128 * If the page does not get brought uptodate, return -EIO. 2094 * If the page does not get brought uptodate, return -EIO.
2129 */ 2095 */
2130 struct page *read_cache_page(struct address_space *mapping, 2096 struct page *read_cache_page(struct address_space *mapping,
2131 pgoff_t index, 2097 pgoff_t index,
2132 int (*filler)(void *, struct page *), 2098 int (*filler)(void *, struct page *),
2133 void *data) 2099 void *data)
2134 { 2100 {
2135 return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping)); 2101 return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping));
2136 } 2102 }
2137 EXPORT_SYMBOL(read_cache_page); 2103 EXPORT_SYMBOL(read_cache_page);
2138 2104
2139 /** 2105 /**
2140 * read_cache_page_gfp - read into page cache, using specified page allocation flags. 2106 * read_cache_page_gfp - read into page cache, using specified page allocation flags.
2141 * @mapping: the page's address_space 2107 * @mapping: the page's address_space
2142 * @index: the page index 2108 * @index: the page index
2143 * @gfp: the page allocator flags to use if allocating 2109 * @gfp: the page allocator flags to use if allocating
2144 * 2110 *
2145 * This is the same as "read_mapping_page(mapping, index, NULL)", but with 2111 * This is the same as "read_mapping_page(mapping, index, NULL)", but with
2146 * any new page allocations done using the specified allocation flags. 2112 * any new page allocations done using the specified allocation flags.
2147 * 2113 *
2148 * If the page does not get brought uptodate, return -EIO. 2114 * If the page does not get brought uptodate, return -EIO.
2149 */ 2115 */
2150 struct page *read_cache_page_gfp(struct address_space *mapping, 2116 struct page *read_cache_page_gfp(struct address_space *mapping,
2151 pgoff_t index, 2117 pgoff_t index,
2152 gfp_t gfp) 2118 gfp_t gfp)
2153 { 2119 {
2154 filler_t *filler = (filler_t *)mapping->a_ops->readpage; 2120 filler_t *filler = (filler_t *)mapping->a_ops->readpage;
2155 2121
2156 return do_read_cache_page(mapping, index, filler, NULL, gfp); 2122 return do_read_cache_page(mapping, index, filler, NULL, gfp);
2157 } 2123 }
2158 EXPORT_SYMBOL(read_cache_page_gfp); 2124 EXPORT_SYMBOL(read_cache_page_gfp);
2159 2125
2160 static size_t __iovec_copy_from_user_inatomic(char *vaddr, 2126 static size_t __iovec_copy_from_user_inatomic(char *vaddr,
2161 const struct iovec *iov, size_t base, size_t bytes) 2127 const struct iovec *iov, size_t base, size_t bytes)
2162 { 2128 {
2163 size_t copied = 0, left = 0; 2129 size_t copied = 0, left = 0;
2164 2130
2165 while (bytes) { 2131 while (bytes) {
2166 char __user *buf = iov->iov_base + base; 2132 char __user *buf = iov->iov_base + base;
2167 int copy = min(bytes, iov->iov_len - base); 2133 int copy = min(bytes, iov->iov_len - base);
2168 2134
2169 base = 0; 2135 base = 0;
2170 left = __copy_from_user_inatomic(vaddr, buf, copy); 2136 left = __copy_from_user_inatomic(vaddr, buf, copy);
2171 copied += copy; 2137 copied += copy;
2172 bytes -= copy; 2138 bytes -= copy;
2173 vaddr += copy; 2139 vaddr += copy;
2174 iov++; 2140 iov++;
2175 2141
2176 if (unlikely(left)) 2142 if (unlikely(left))
2177 break; 2143 break;
2178 } 2144 }
2179 return copied - left; 2145 return copied - left;
2180 } 2146 }
2181 2147
2182 /* 2148 /*
2183 * Copy as much as we can into the page and return the number of bytes which 2149 * Copy as much as we can into the page and return the number of bytes which
2184 * were successfully copied. If a fault is encountered then return the number of 2150 * were successfully copied. If a fault is encountered then return the number of
2185 * bytes which were copied. 2151 * bytes which were copied.
2186 */ 2152 */
2187 size_t iov_iter_copy_from_user_atomic(struct page *page, 2153 size_t iov_iter_copy_from_user_atomic(struct page *page,
2188 struct iov_iter *i, unsigned long offset, size_t bytes) 2154 struct iov_iter *i, unsigned long offset, size_t bytes)
2189 { 2155 {
2190 char *kaddr; 2156 char *kaddr;
2191 size_t copied; 2157 size_t copied;
2192 2158
2193 kaddr = kmap_atomic(page); 2159 kaddr = kmap_atomic(page);
2194 if (likely(i->nr_segs == 1)) { 2160 if (likely(i->nr_segs == 1)) {
2195 int left; 2161 int left;
2196 char __user *buf = i->iov->iov_base + i->iov_offset; 2162 char __user *buf = i->iov->iov_base + i->iov_offset;
2197 left = __copy_from_user_inatomic(kaddr + offset, buf, bytes); 2163 left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
2198 copied = bytes - left; 2164 copied = bytes - left;
2199 } else { 2165 } else {
2200 copied = __iovec_copy_from_user_inatomic(kaddr + offset, 2166 copied = __iovec_copy_from_user_inatomic(kaddr + offset,
2201 i->iov, i->iov_offset, bytes); 2167 i->iov, i->iov_offset, bytes);
2202 } 2168 }
2203 kunmap_atomic(kaddr); 2169 kunmap_atomic(kaddr);
2204 2170
2205 return copied; 2171 return copied;
2206 } 2172 }
2207 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); 2173 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
2208 2174
2209 /* 2175 /*
2210 * This has the same sideeffects and return value as 2176 * This has the same sideeffects and return value as
2211 * iov_iter_copy_from_user_atomic(). 2177 * iov_iter_copy_from_user_atomic().
2212 * The difference is that it attempts to resolve faults. 2178 * The difference is that it attempts to resolve faults.
2213 * Page must not be locked. 2179 * Page must not be locked.
2214 */ 2180 */
2215 size_t iov_iter_copy_from_user(struct page *page, 2181 size_t iov_iter_copy_from_user(struct page *page,
2216 struct iov_iter *i, unsigned long offset, size_t bytes) 2182 struct iov_iter *i, unsigned long offset, size_t bytes)
2217 { 2183 {
2218 char *kaddr; 2184 char *kaddr;
2219 size_t copied; 2185 size_t copied;
2220 2186
2221 kaddr = kmap(page); 2187 kaddr = kmap(page);
2222 if (likely(i->nr_segs == 1)) { 2188 if (likely(i->nr_segs == 1)) {
2223 int left; 2189 int left;
2224 char __user *buf = i->iov->iov_base + i->iov_offset; 2190 char __user *buf = i->iov->iov_base + i->iov_offset;
2225 left = __copy_from_user(kaddr + offset, buf, bytes); 2191 left = __copy_from_user(kaddr + offset, buf, bytes);
2226 copied = bytes - left; 2192 copied = bytes - left;
2227 } else { 2193 } else {
2228 copied = __iovec_copy_from_user_inatomic(kaddr + offset, 2194 copied = __iovec_copy_from_user_inatomic(kaddr + offset,
2229 i->iov, i->iov_offset, bytes); 2195 i->iov, i->iov_offset, bytes);
2230 } 2196 }
2231 kunmap(page); 2197 kunmap(page);
2232 return copied; 2198 return copied;
2233 } 2199 }
2234 EXPORT_SYMBOL(iov_iter_copy_from_user); 2200 EXPORT_SYMBOL(iov_iter_copy_from_user);
2235 2201
2236 void iov_iter_advance(struct iov_iter *i, size_t bytes) 2202 void iov_iter_advance(struct iov_iter *i, size_t bytes)
2237 { 2203 {
2238 BUG_ON(i->count < bytes); 2204 BUG_ON(i->count < bytes);
2239 2205
2240 if (likely(i->nr_segs == 1)) { 2206 if (likely(i->nr_segs == 1)) {
2241 i->iov_offset += bytes; 2207 i->iov_offset += bytes;
2242 i->count -= bytes; 2208 i->count -= bytes;
2243 } else { 2209 } else {
2244 const struct iovec *iov = i->iov; 2210 const struct iovec *iov = i->iov;
2245 size_t base = i->iov_offset; 2211 size_t base = i->iov_offset;
2246 unsigned long nr_segs = i->nr_segs; 2212 unsigned long nr_segs = i->nr_segs;
2247 2213
2248 /* 2214 /*
2249 * The !iov->iov_len check ensures we skip over unlikely 2215 * The !iov->iov_len check ensures we skip over unlikely
2250 * zero-length segments (without overruning the iovec). 2216 * zero-length segments (without overruning the iovec).
2251 */ 2217 */
2252 while (bytes || unlikely(i->count && !iov->iov_len)) { 2218 while (bytes || unlikely(i->count && !iov->iov_len)) {
2253 int copy; 2219 int copy;
2254 2220
2255 copy = min(bytes, iov->iov_len - base); 2221 copy = min(bytes, iov->iov_len - base);
2256 BUG_ON(!i->count || i->count < copy); 2222 BUG_ON(!i->count || i->count < copy);
2257 i->count -= copy; 2223 i->count -= copy;
2258 bytes -= copy; 2224 bytes -= copy;
2259 base += copy; 2225 base += copy;
2260 if (iov->iov_len == base) { 2226 if (iov->iov_len == base) {
2261 iov++; 2227 iov++;
2262 nr_segs--; 2228 nr_segs--;
2263 base = 0; 2229 base = 0;
2264 } 2230 }
2265 } 2231 }
2266 i->iov = iov; 2232 i->iov = iov;
2267 i->iov_offset = base; 2233 i->iov_offset = base;
2268 i->nr_segs = nr_segs; 2234 i->nr_segs = nr_segs;
2269 } 2235 }
2270 } 2236 }
2271 EXPORT_SYMBOL(iov_iter_advance); 2237 EXPORT_SYMBOL(iov_iter_advance);
2272 2238
2273 /* 2239 /*
2274 * Fault in the first iovec of the given iov_iter, to a maximum length 2240 * Fault in the first iovec of the given iov_iter, to a maximum length
2275 * of bytes. Returns 0 on success, or non-zero if the memory could not be 2241 * of bytes. Returns 0 on success, or non-zero if the memory could not be
2276 * accessed (ie. because it is an invalid address). 2242 * accessed (ie. because it is an invalid address).
2277 * 2243 *
2278 * writev-intensive code may want this to prefault several iovecs -- that 2244 * writev-intensive code may want this to prefault several iovecs -- that
2279 * would be possible (callers must not rely on the fact that _only_ the 2245 * would be possible (callers must not rely on the fact that _only_ the
2280 * first iovec will be faulted with the current implementation). 2246 * first iovec will be faulted with the current implementation).
2281 */ 2247 */
2282 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) 2248 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
2283 { 2249 {
2284 char __user *buf = i->iov->iov_base + i->iov_offset; 2250 char __user *buf = i->iov->iov_base + i->iov_offset;
2285 bytes = min(bytes, i->iov->iov_len - i->iov_offset); 2251 bytes = min(bytes, i->iov->iov_len - i->iov_offset);
2286 return fault_in_pages_readable(buf, bytes); 2252 return fault_in_pages_readable(buf, bytes);
2287 } 2253 }
2288 EXPORT_SYMBOL(iov_iter_fault_in_readable); 2254 EXPORT_SYMBOL(iov_iter_fault_in_readable);
2289 2255
2290 /* 2256 /*
2291 * Return the count of just the current iov_iter segment. 2257 * Return the count of just the current iov_iter segment.
2292 */ 2258 */
2293 size_t iov_iter_single_seg_count(const struct iov_iter *i) 2259 size_t iov_iter_single_seg_count(const struct iov_iter *i)
2294 { 2260 {
2295 const struct iovec *iov = i->iov; 2261 const struct iovec *iov = i->iov;
2296 if (i->nr_segs == 1) 2262 if (i->nr_segs == 1)
2297 return i->count; 2263 return i->count;
2298 else 2264 else
2299 return min(i->count, iov->iov_len - i->iov_offset); 2265 return min(i->count, iov->iov_len - i->iov_offset);
2300 } 2266 }
2301 EXPORT_SYMBOL(iov_iter_single_seg_count); 2267 EXPORT_SYMBOL(iov_iter_single_seg_count);
2302 2268
2303 /* 2269 /*
2304 * Performs necessary checks before doing a write 2270 * Performs necessary checks before doing a write
2305 * 2271 *
2306 * Can adjust writing position or amount of bytes to write. 2272 * Can adjust writing position or amount of bytes to write.
2307 * Returns appropriate error code that caller should return or 2273 * Returns appropriate error code that caller should return or
2308 * zero in case that write should be allowed. 2274 * zero in case that write should be allowed.
2309 */ 2275 */
2310 inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk) 2276 inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk)
2311 { 2277 {
2312 struct inode *inode = file->f_mapping->host; 2278 struct inode *inode = file->f_mapping->host;
2313 unsigned long limit = rlimit(RLIMIT_FSIZE); 2279 unsigned long limit = rlimit(RLIMIT_FSIZE);
2314 2280
2315 if (unlikely(*pos < 0)) 2281 if (unlikely(*pos < 0))
2316 return -EINVAL; 2282 return -EINVAL;
2317 2283
2318 if (!isblk) { 2284 if (!isblk) {
2319 /* FIXME: this is for backwards compatibility with 2.4 */ 2285 /* FIXME: this is for backwards compatibility with 2.4 */
2320 if (file->f_flags & O_APPEND) 2286 if (file->f_flags & O_APPEND)
2321 *pos = i_size_read(inode); 2287 *pos = i_size_read(inode);
2322 2288
2323 if (limit != RLIM_INFINITY) { 2289 if (limit != RLIM_INFINITY) {
2324 if (*pos >= limit) { 2290 if (*pos >= limit) {
2325 send_sig(SIGXFSZ, current, 0); 2291 send_sig(SIGXFSZ, current, 0);
2326 return -EFBIG; 2292 return -EFBIG;
2327 } 2293 }
2328 if (*count > limit - (typeof(limit))*pos) { 2294 if (*count > limit - (typeof(limit))*pos) {
2329 *count = limit - (typeof(limit))*pos; 2295 *count = limit - (typeof(limit))*pos;
2330 } 2296 }
2331 } 2297 }
2332 } 2298 }
2333 2299
2334 /* 2300 /*
2335 * LFS rule 2301 * LFS rule
2336 */ 2302 */
2337 if (unlikely(*pos + *count > MAX_NON_LFS && 2303 if (unlikely(*pos + *count > MAX_NON_LFS &&
2338 !(file->f_flags & O_LARGEFILE))) { 2304 !(file->f_flags & O_LARGEFILE))) {
2339 if (*pos >= MAX_NON_LFS) { 2305 if (*pos >= MAX_NON_LFS) {
2340 return -EFBIG; 2306 return -EFBIG;
2341 } 2307 }
2342 if (*count > MAX_NON_LFS - (unsigned long)*pos) { 2308 if (*count > MAX_NON_LFS - (unsigned long)*pos) {
2343 *count = MAX_NON_LFS - (unsigned long)*pos; 2309 *count = MAX_NON_LFS - (unsigned long)*pos;
2344 } 2310 }
2345 } 2311 }
2346 2312
2347 /* 2313 /*
2348 * Are we about to exceed the fs block limit ? 2314 * Are we about to exceed the fs block limit ?
2349 * 2315 *
2350 * If we have written data it becomes a short write. If we have 2316 * If we have written data it becomes a short write. If we have
2351 * exceeded without writing data we send a signal and return EFBIG. 2317 * exceeded without writing data we send a signal and return EFBIG.
2352 * Linus frestrict idea will clean these up nicely.. 2318 * Linus frestrict idea will clean these up nicely..
2353 */ 2319 */
2354 if (likely(!isblk)) { 2320 if (likely(!isblk)) {
2355 if (unlikely(*pos >= inode->i_sb->s_maxbytes)) { 2321 if (unlikely(*pos >= inode->i_sb->s_maxbytes)) {
2356 if (*count || *pos > inode->i_sb->s_maxbytes) { 2322 if (*count || *pos > inode->i_sb->s_maxbytes) {
2357 return -EFBIG; 2323 return -EFBIG;
2358 } 2324 }
2359 /* zero-length writes at ->s_maxbytes are OK */ 2325 /* zero-length writes at ->s_maxbytes are OK */
2360 } 2326 }
2361 2327
2362 if (unlikely(*pos + *count > inode->i_sb->s_maxbytes)) 2328 if (unlikely(*pos + *count > inode->i_sb->s_maxbytes))
2363 *count = inode->i_sb->s_maxbytes - *pos; 2329 *count = inode->i_sb->s_maxbytes - *pos;
2364 } else { 2330 } else {
2365 #ifdef CONFIG_BLOCK 2331 #ifdef CONFIG_BLOCK
2366 loff_t isize; 2332 loff_t isize;
2367 if (bdev_read_only(I_BDEV(inode))) 2333 if (bdev_read_only(I_BDEV(inode)))
2368 return -EPERM; 2334 return -EPERM;
2369 isize = i_size_read(inode); 2335 isize = i_size_read(inode);
2370 if (*pos >= isize) { 2336 if (*pos >= isize) {
2371 if (*count || *pos > isize) 2337 if (*count || *pos > isize)
2372 return -ENOSPC; 2338 return -ENOSPC;
2373 } 2339 }
2374 2340
2375 if (*pos + *count > isize) 2341 if (*pos + *count > isize)
2376 *count = isize - *pos; 2342 *count = isize - *pos;
2377 #else 2343 #else
2378 return -EPERM; 2344 return -EPERM;
2379 #endif 2345 #endif
2380 } 2346 }
2381 return 0; 2347 return 0;
2382 } 2348 }
2383 EXPORT_SYMBOL(generic_write_checks); 2349 EXPORT_SYMBOL(generic_write_checks);
2384 2350
2385 int pagecache_write_begin(struct file *file, struct address_space *mapping, 2351 int pagecache_write_begin(struct file *file, struct address_space *mapping,
2386 loff_t pos, unsigned len, unsigned flags, 2352 loff_t pos, unsigned len, unsigned flags,
2387 struct page **pagep, void **fsdata) 2353 struct page **pagep, void **fsdata)
2388 { 2354 {
2389 const struct address_space_operations *aops = mapping->a_ops; 2355 const struct address_space_operations *aops = mapping->a_ops;
2390 2356
2391 return aops->write_begin(file, mapping, pos, len, flags, 2357 return aops->write_begin(file, mapping, pos, len, flags,
2392 pagep, fsdata); 2358 pagep, fsdata);
2393 } 2359 }
2394 EXPORT_SYMBOL(pagecache_write_begin); 2360 EXPORT_SYMBOL(pagecache_write_begin);
2395 2361
2396 int pagecache_write_end(struct file *file, struct address_space *mapping, 2362 int pagecache_write_end(struct file *file, struct address_space *mapping,
2397 loff_t pos, unsigned len, unsigned copied, 2363 loff_t pos, unsigned len, unsigned copied,
2398 struct page *page, void *fsdata) 2364 struct page *page, void *fsdata)
2399 { 2365 {
2400 const struct address_space_operations *aops = mapping->a_ops; 2366 const struct address_space_operations *aops = mapping->a_ops;
2401 2367
2402 mark_page_accessed(page);
2403 return aops->write_end(file, mapping, pos, len, copied, page, fsdata); 2368 return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
2404 } 2369 }
2405 EXPORT_SYMBOL(pagecache_write_end); 2370 EXPORT_SYMBOL(pagecache_write_end);
2406 2371
2407 ssize_t 2372 ssize_t
2408 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, 2373 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
2409 unsigned long *nr_segs, loff_t pos, loff_t *ppos, 2374 unsigned long *nr_segs, loff_t pos, loff_t *ppos,
2410 size_t count, size_t ocount) 2375 size_t count, size_t ocount)
2411 { 2376 {
2412 struct file *file = iocb->ki_filp; 2377 struct file *file = iocb->ki_filp;
2413 struct address_space *mapping = file->f_mapping; 2378 struct address_space *mapping = file->f_mapping;
2414 struct inode *inode = mapping->host; 2379 struct inode *inode = mapping->host;
2415 ssize_t written; 2380 ssize_t written;
2416 size_t write_len; 2381 size_t write_len;
2417 pgoff_t end; 2382 pgoff_t end;
2418 2383
2419 if (count != ocount) 2384 if (count != ocount)
2420 *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count); 2385 *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
2421 2386
2422 write_len = iov_length(iov, *nr_segs); 2387 write_len = iov_length(iov, *nr_segs);
2423 end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT; 2388 end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
2424 2389
2425 written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); 2390 written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
2426 if (written) 2391 if (written)
2427 goto out; 2392 goto out;
2428 2393
2429 /* 2394 /*
2430 * After a write we want buffered reads to be sure to go to disk to get 2395 * After a write we want buffered reads to be sure to go to disk to get
2431 * the new data. We invalidate clean cached page from the region we're 2396 * the new data. We invalidate clean cached page from the region we're
2432 * about to write. We do this *before* the write so that we can return 2397 * about to write. We do this *before* the write so that we can return
2433 * without clobbering -EIOCBQUEUED from ->direct_IO(). 2398 * without clobbering -EIOCBQUEUED from ->direct_IO().
2434 */ 2399 */
2435 if (mapping->nrpages) { 2400 if (mapping->nrpages) {
2436 written = invalidate_inode_pages2_range(mapping, 2401 written = invalidate_inode_pages2_range(mapping,
2437 pos >> PAGE_CACHE_SHIFT, end); 2402 pos >> PAGE_CACHE_SHIFT, end);
2438 /* 2403 /*
2439 * If a page can not be invalidated, return 0 to fall back 2404 * If a page can not be invalidated, return 0 to fall back
2440 * to buffered write. 2405 * to buffered write.
2441 */ 2406 */
2442 if (written) { 2407 if (written) {
2443 if (written == -EBUSY) 2408 if (written == -EBUSY)
2444 return 0; 2409 return 0;
2445 goto out; 2410 goto out;
2446 } 2411 }
2447 } 2412 }
2448 2413
2449 written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs); 2414 written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
2450 2415
2451 /* 2416 /*
2452 * Finally, try again to invalidate clean pages which might have been 2417 * Finally, try again to invalidate clean pages which might have been
2453 * cached by non-direct readahead, or faulted in by get_user_pages() 2418 * cached by non-direct readahead, or faulted in by get_user_pages()
2454 * if the source of the write was an mmap'ed region of the file 2419 * if the source of the write was an mmap'ed region of the file
2455 * we're writing. Either one is a pretty crazy thing to do, 2420 * we're writing. Either one is a pretty crazy thing to do,
2456 * so we don't support it 100%. If this invalidation 2421 * so we don't support it 100%. If this invalidation
2457 * fails, tough, the write still worked... 2422 * fails, tough, the write still worked...
2458 */ 2423 */
2459 if (mapping->nrpages) { 2424 if (mapping->nrpages) {
2460 invalidate_inode_pages2_range(mapping, 2425 invalidate_inode_pages2_range(mapping,
2461 pos >> PAGE_CACHE_SHIFT, end); 2426 pos >> PAGE_CACHE_SHIFT, end);
2462 } 2427 }
2463 2428
2464 if (written > 0) { 2429 if (written > 0) {
2465 pos += written; 2430 pos += written;
2466 if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) { 2431 if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
2467 i_size_write(inode, pos); 2432 i_size_write(inode, pos);
2468 mark_inode_dirty(inode); 2433 mark_inode_dirty(inode);
2469 } 2434 }
2470 *ppos = pos; 2435 *ppos = pos;
2471 } 2436 }
2472 out: 2437 out:
2473 return written; 2438 return written;
2474 } 2439 }
2475 EXPORT_SYMBOL(generic_file_direct_write); 2440 EXPORT_SYMBOL(generic_file_direct_write);
2476 2441
2477 /* 2442 /*
2478 * Find or create a page at the given pagecache position. Return the locked 2443 * Find or create a page at the given pagecache position. Return the locked
2479 * page. This function is specifically for buffered writes. 2444 * page. This function is specifically for buffered writes.
2480 */ 2445 */
2481 struct page *grab_cache_page_write_begin(struct address_space *mapping, 2446 struct page *grab_cache_page_write_begin(struct address_space *mapping,
2482 pgoff_t index, unsigned flags) 2447 pgoff_t index, unsigned flags)
2483 { 2448 {
2484 int status;
2485 gfp_t gfp_mask;
2486 struct page *page; 2449 struct page *page;
2487 gfp_t gfp_notmask = 0; 2450 int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
2488 2451
2489 gfp_mask = mapping_gfp_mask(mapping);
2490 if (mapping_cap_account_dirty(mapping))
2491 gfp_mask |= __GFP_WRITE;
2492 if (flags & AOP_FLAG_NOFS) 2452 if (flags & AOP_FLAG_NOFS)
2493 gfp_notmask = __GFP_FS; 2453 fgp_flags |= FGP_NOFS;
2494 repeat: 2454
2495 page = find_lock_page(mapping, index); 2455 page = pagecache_get_page(mapping, index, fgp_flags,
2456 mapping_gfp_mask(mapping),
2457 GFP_KERNEL);
2496 if (page) 2458 if (page)
2497 goto found; 2459 wait_for_stable_page(page);
2498 2460
2499 page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
2500 if (!page)
2501 return NULL;
2502 status = add_to_page_cache_lru(page, mapping, index,
2503 GFP_KERNEL & ~gfp_notmask);
2504 if (unlikely(status)) {
2505 page_cache_release(page);
2506 if (status == -EEXIST)
2507 goto repeat;
2508 return NULL;
2509 }
2510 found:
2511 wait_for_stable_page(page);
2512 return page; 2461 return page;
2513 } 2462 }
2514 EXPORT_SYMBOL(grab_cache_page_write_begin); 2463 EXPORT_SYMBOL(grab_cache_page_write_begin);
2515 2464
2516 static ssize_t generic_perform_write(struct file *file, 2465 static ssize_t generic_perform_write(struct file *file,
2517 struct iov_iter *i, loff_t pos) 2466 struct iov_iter *i, loff_t pos)
2518 { 2467 {
2519 struct address_space *mapping = file->f_mapping; 2468 struct address_space *mapping = file->f_mapping;
2520 const struct address_space_operations *a_ops = mapping->a_ops; 2469 const struct address_space_operations *a_ops = mapping->a_ops;
2521 long status = 0; 2470 long status = 0;
2522 ssize_t written = 0; 2471 ssize_t written = 0;
2523 unsigned int flags = 0; 2472 unsigned int flags = 0;
2524 2473
2525 /* 2474 /*
2526 * Copies from kernel address space cannot fail (NFSD is a big user). 2475 * Copies from kernel address space cannot fail (NFSD is a big user).
2527 */ 2476 */
2528 if (segment_eq(get_fs(), KERNEL_DS)) 2477 if (segment_eq(get_fs(), KERNEL_DS))
2529 flags |= AOP_FLAG_UNINTERRUPTIBLE; 2478 flags |= AOP_FLAG_UNINTERRUPTIBLE;
2530 2479
2531 do { 2480 do {
2532 struct page *page; 2481 struct page *page;
2533 unsigned long offset; /* Offset into pagecache page */ 2482 unsigned long offset; /* Offset into pagecache page */
2534 unsigned long bytes; /* Bytes to write to page */ 2483 unsigned long bytes; /* Bytes to write to page */
2535 size_t copied; /* Bytes copied from user */ 2484 size_t copied; /* Bytes copied from user */
2536 void *fsdata; 2485 void *fsdata;
2537 2486
2538 offset = (pos & (PAGE_CACHE_SIZE - 1)); 2487 offset = (pos & (PAGE_CACHE_SIZE - 1));
2539 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, 2488 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
2540 iov_iter_count(i)); 2489 iov_iter_count(i));
2541 2490
2542 again: 2491 again:
2543 /* 2492 /*
2544 * Bring in the user page that we will copy from _first_. 2493 * Bring in the user page that we will copy from _first_.
2545 * Otherwise there's a nasty deadlock on copying from the 2494 * Otherwise there's a nasty deadlock on copying from the
2546 * same page as we're writing to, without it being marked 2495 * same page as we're writing to, without it being marked
2547 * up-to-date. 2496 * up-to-date.
2548 * 2497 *
2549 * Not only is this an optimisation, but it is also required 2498 * Not only is this an optimisation, but it is also required
2550 * to check that the address is actually valid, when atomic 2499 * to check that the address is actually valid, when atomic
2551 * usercopies are used, below. 2500 * usercopies are used, below.
2552 */ 2501 */
2553 if (unlikely(iov_iter_fault_in_readable(i, bytes))) { 2502 if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
2554 status = -EFAULT; 2503 status = -EFAULT;
2555 break; 2504 break;
2556 } 2505 }
2557 2506
2558 status = a_ops->write_begin(file, mapping, pos, bytes, flags, 2507 status = a_ops->write_begin(file, mapping, pos, bytes, flags,
2559 &page, &fsdata); 2508 &page, &fsdata);
2560 if (unlikely(status)) 2509 if (unlikely(status < 0))
2561 break; 2510 break;
2562 2511
2563 if (mapping_writably_mapped(mapping)) 2512 if (mapping_writably_mapped(mapping))
2564 flush_dcache_page(page); 2513 flush_dcache_page(page);
2565 2514
2566 copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); 2515 copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
2567 flush_dcache_page(page); 2516 flush_dcache_page(page);
2568 2517
2569 mark_page_accessed(page);
2570 status = a_ops->write_end(file, mapping, pos, bytes, copied, 2518 status = a_ops->write_end(file, mapping, pos, bytes, copied,
2571 page, fsdata); 2519 page, fsdata);
2572 if (unlikely(status < 0)) 2520 if (unlikely(status < 0))
2573 break; 2521 break;
2574 copied = status; 2522 copied = status;
2575 2523
2576 cond_resched(); 2524 cond_resched();
2577 2525
2578 iov_iter_advance(i, copied); 2526 iov_iter_advance(i, copied);
2579 if (unlikely(copied == 0)) { 2527 if (unlikely(copied == 0)) {
2580 /* 2528 /*
2581 * If we were unable to copy any data at all, we must 2529 * If we were unable to copy any data at all, we must
2582 * fall back to a single segment length write. 2530 * fall back to a single segment length write.
2583 * 2531 *
2584 * If we didn't fallback here, we could livelock 2532 * If we didn't fallback here, we could livelock
2585 * because not all segments in the iov can be copied at 2533 * because not all segments in the iov can be copied at
2586 * once without a pagefault. 2534 * once without a pagefault.
2587 */ 2535 */
2588 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, 2536 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
2589 iov_iter_single_seg_count(i)); 2537 iov_iter_single_seg_count(i));
2590 goto again; 2538 goto again;
2591 } 2539 }
2592 pos += copied; 2540 pos += copied;
2593 written += copied; 2541 written += copied;
2594 2542
2595 balance_dirty_pages_ratelimited(mapping); 2543 balance_dirty_pages_ratelimited(mapping);
2596 if (fatal_signal_pending(current)) { 2544 if (fatal_signal_pending(current)) {
2597 status = -EINTR; 2545 status = -EINTR;
2598 break; 2546 break;
2599 } 2547 }
2600 } while (iov_iter_count(i)); 2548 } while (iov_iter_count(i));
2601 2549
2602 return written ? written : status; 2550 return written ? written : status;
2603 } 2551 }
2604 2552
2605 ssize_t 2553 ssize_t
2606 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, 2554 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
2607 unsigned long nr_segs, loff_t pos, loff_t *ppos, 2555 unsigned long nr_segs, loff_t pos, loff_t *ppos,
2608 size_t count, ssize_t written) 2556 size_t count, ssize_t written)
2609 { 2557 {
2610 struct file *file = iocb->ki_filp; 2558 struct file *file = iocb->ki_filp;
2611 ssize_t status; 2559 ssize_t status;
2612 struct iov_iter i; 2560 struct iov_iter i;
2613 2561
2614 iov_iter_init(&i, iov, nr_segs, count, written); 2562 iov_iter_init(&i, iov, nr_segs, count, written);
2615 status = generic_perform_write(file, &i, pos); 2563 status = generic_perform_write(file, &i, pos);
2616 2564
2617 if (likely(status >= 0)) { 2565 if (likely(status >= 0)) {
2618 written += status; 2566 written += status;
2619 *ppos = pos + status; 2567 *ppos = pos + status;
2620 } 2568 }
2621 2569
2622 return written ? written : status; 2570 return written ? written : status;
2623 } 2571 }
2624 EXPORT_SYMBOL(generic_file_buffered_write); 2572 EXPORT_SYMBOL(generic_file_buffered_write);
2625 2573
2626 /** 2574 /**
2627 * __generic_file_aio_write - write data to a file 2575 * __generic_file_aio_write - write data to a file
2628 * @iocb: IO state structure (file, offset, etc.) 2576 * @iocb: IO state structure (file, offset, etc.)
2629 * @iov: vector with data to write 2577 * @iov: vector with data to write
2630 * @nr_segs: number of segments in the vector 2578 * @nr_segs: number of segments in the vector
2631 * @ppos: position where to write 2579 * @ppos: position where to write
2632 * 2580 *
2633 * This function does all the work needed for actually writing data to a 2581 * This function does all the work needed for actually writing data to a
2634 * file. It does all basic checks, removes SUID from the file, updates 2582 * file. It does all basic checks, removes SUID from the file, updates
2635 * modification times and calls proper subroutines depending on whether we 2583 * modification times and calls proper subroutines depending on whether we
2636 * do direct IO or a standard buffered write. 2584 * do direct IO or a standard buffered write.
2637 * 2585 *
2638 * It expects i_mutex to be grabbed unless we work on a block device or similar 2586 * It expects i_mutex to be grabbed unless we work on a block device or similar
2639 * object which does not need locking at all. 2587 * object which does not need locking at all.
2640 * 2588 *
2641 * This function does *not* take care of syncing data in case of O_SYNC write. 2589 * This function does *not* take care of syncing data in case of O_SYNC write.
2642 * A caller has to handle it. This is mainly due to the fact that we want to 2590 * A caller has to handle it. This is mainly due to the fact that we want to
2643 * avoid syncing under i_mutex. 2591 * avoid syncing under i_mutex.
2644 */ 2592 */
2645 ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, 2593 ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
2646 unsigned long nr_segs, loff_t *ppos) 2594 unsigned long nr_segs, loff_t *ppos)
2647 { 2595 {
2648 struct file *file = iocb->ki_filp; 2596 struct file *file = iocb->ki_filp;
2649 struct address_space * mapping = file->f_mapping; 2597 struct address_space * mapping = file->f_mapping;
2650 size_t ocount; /* original count */ 2598 size_t ocount; /* original count */
2651 size_t count; /* after file limit checks */ 2599 size_t count; /* after file limit checks */
2652 struct inode *inode = mapping->host; 2600 struct inode *inode = mapping->host;
2653 loff_t pos; 2601 loff_t pos;
2654 ssize_t written; 2602 ssize_t written;
2655 ssize_t err; 2603 ssize_t err;
2656 2604
2657 ocount = 0; 2605 ocount = 0;
2658 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); 2606 err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
2659 if (err) 2607 if (err)
2660 return err; 2608 return err;
2661 2609
2662 count = ocount; 2610 count = ocount;
2663 pos = *ppos; 2611 pos = *ppos;
2664 2612
2665 /* We can write back this queue in page reclaim */ 2613 /* We can write back this queue in page reclaim */
2666 current->backing_dev_info = mapping->backing_dev_info; 2614 current->backing_dev_info = mapping->backing_dev_info;
2667 written = 0; 2615 written = 0;
2668 2616
2669 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); 2617 err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
2670 if (err) 2618 if (err)
2671 goto out; 2619 goto out;
2672 2620
2673 if (count == 0) 2621 if (count == 0)
2674 goto out; 2622 goto out;
2675 2623
2676 err = file_remove_suid(file); 2624 err = file_remove_suid(file);
2677 if (err) 2625 if (err)
2678 goto out; 2626 goto out;
2679 2627
2680 err = file_update_time(file); 2628 err = file_update_time(file);
2681 if (err) 2629 if (err)
2682 goto out; 2630 goto out;
2683 2631
2684 /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ 2632 /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
2685 if (unlikely(file->f_flags & O_DIRECT)) { 2633 if (unlikely(file->f_flags & O_DIRECT)) {
2686 loff_t endbyte; 2634 loff_t endbyte;
2687 ssize_t written_buffered; 2635 ssize_t written_buffered;
2688 2636
2689 written = generic_file_direct_write(iocb, iov, &nr_segs, pos, 2637 written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
2690 ppos, count, ocount); 2638 ppos, count, ocount);
2691 if (written < 0 || written == count) 2639 if (written < 0 || written == count)
2692 goto out; 2640 goto out;
2693 /* 2641 /*
2694 * direct-io write to a hole: fall through to buffered I/O 2642 * direct-io write to a hole: fall through to buffered I/O
2695 * for completing the rest of the request. 2643 * for completing the rest of the request.
2696 */ 2644 */
2697 pos += written; 2645 pos += written;
2698 count -= written; 2646 count -= written;
2699 written_buffered = generic_file_buffered_write(iocb, iov, 2647 written_buffered = generic_file_buffered_write(iocb, iov,
2700 nr_segs, pos, ppos, count, 2648 nr_segs, pos, ppos, count,
2701 written); 2649 written);
2702 /* 2650 /*
2703 * If generic_file_buffered_write() retuned a synchronous error 2651 * If generic_file_buffered_write() retuned a synchronous error
2704 * then we want to return the number of bytes which were 2652 * then we want to return the number of bytes which were
2705 * direct-written, or the error code if that was zero. Note 2653 * direct-written, or the error code if that was zero. Note
2706 * that this differs from normal direct-io semantics, which 2654 * that this differs from normal direct-io semantics, which
2707 * will return -EFOO even if some bytes were written. 2655 * will return -EFOO even if some bytes were written.
2708 */ 2656 */
2709 if (written_buffered < 0) { 2657 if (written_buffered < 0) {
2710 err = written_buffered; 2658 err = written_buffered;
2711 goto out; 2659 goto out;
2712 } 2660 }
2713 2661
2714 /* 2662 /*
2715 * We need to ensure that the page cache pages are written to 2663 * We need to ensure that the page cache pages are written to
2716 * disk and invalidated to preserve the expected O_DIRECT 2664 * disk and invalidated to preserve the expected O_DIRECT
2717 * semantics. 2665 * semantics.
2718 */ 2666 */
2719 endbyte = pos + written_buffered - written - 1; 2667 endbyte = pos + written_buffered - written - 1;
2720 err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); 2668 err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
2721 if (err == 0) { 2669 if (err == 0) {
2722 written = written_buffered; 2670 written = written_buffered;
2723 invalidate_mapping_pages(mapping, 2671 invalidate_mapping_pages(mapping,
2724 pos >> PAGE_CACHE_SHIFT, 2672 pos >> PAGE_CACHE_SHIFT,
2725 endbyte >> PAGE_CACHE_SHIFT); 2673 endbyte >> PAGE_CACHE_SHIFT);
2726 } else { 2674 } else {
2727 /* 2675 /*
2728 * We don't know how much we wrote, so just return 2676 * We don't know how much we wrote, so just return
2729 * the number of bytes which were direct-written 2677 * the number of bytes which were direct-written
2730 */ 2678 */
2731 } 2679 }
2732 } else { 2680 } else {
2733 written = generic_file_buffered_write(iocb, iov, nr_segs, 2681 written = generic_file_buffered_write(iocb, iov, nr_segs,
2734 pos, ppos, count, written); 2682 pos, ppos, count, written);
2735 } 2683 }
2736 out: 2684 out:
2737 current->backing_dev_info = NULL; 2685 current->backing_dev_info = NULL;
2738 return written ? written : err; 2686 return written ? written : err;
2739 } 2687 }
2740 EXPORT_SYMBOL(__generic_file_aio_write); 2688 EXPORT_SYMBOL(__generic_file_aio_write);
2741 2689
2742 /** 2690 /**
2743 * generic_file_aio_write - write data to a file 2691 * generic_file_aio_write - write data to a file
2744 * @iocb: IO state structure 2692 * @iocb: IO state structure
2745 * @iov: vector with data to write 2693 * @iov: vector with data to write
2746 * @nr_segs: number of segments in the vector 2694 * @nr_segs: number of segments in the vector
2747 * @pos: position in file where to write 2695 * @pos: position in file where to write
2748 * 2696 *
2749 * This is a wrapper around __generic_file_aio_write() to be used by most 2697 * This is a wrapper around __generic_file_aio_write() to be used by most
2750 * filesystems. It takes care of syncing the file in case of O_SYNC file 2698 * filesystems. It takes care of syncing the file in case of O_SYNC file
2751 * and acquires i_mutex as needed. 2699 * and acquires i_mutex as needed.
2752 */ 2700 */
2753 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, 2701 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
2754 unsigned long nr_segs, loff_t pos) 2702 unsigned long nr_segs, loff_t pos)
2755 { 2703 {
2756 struct file *file = iocb->ki_filp; 2704 struct file *file = iocb->ki_filp;
2757 struct inode *inode = file->f_mapping->host; 2705 struct inode *inode = file->f_mapping->host;
2758 ssize_t ret; 2706 ssize_t ret;
2759 2707
2760 BUG_ON(iocb->ki_pos != pos); 2708 BUG_ON(iocb->ki_pos != pos);
2761 2709
2762 mutex_lock(&inode->i_mutex); 2710 mutex_lock(&inode->i_mutex);
2763 ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); 2711 ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
2764 mutex_unlock(&inode->i_mutex); 2712 mutex_unlock(&inode->i_mutex);
2765 2713
2766 if (ret > 0) { 2714 if (ret > 0) {
1 /* 1 /*
2 * Resizable virtual memory filesystem for Linux. 2 * Resizable virtual memory filesystem for Linux.
3 * 3 *
4 * Copyright (C) 2000 Linus Torvalds. 4 * Copyright (C) 2000 Linus Torvalds.
5 * 2000 Transmeta Corp. 5 * 2000 Transmeta Corp.
6 * 2000-2001 Christoph Rohland 6 * 2000-2001 Christoph Rohland
7 * 2000-2001 SAP AG 7 * 2000-2001 SAP AG
8 * 2002 Red Hat Inc. 8 * 2002 Red Hat Inc.
9 * Copyright (C) 2002-2011 Hugh Dickins. 9 * Copyright (C) 2002-2011 Hugh Dickins.
10 * Copyright (C) 2011 Google Inc. 10 * Copyright (C) 2011 Google Inc.
11 * Copyright (C) 2002-2005 VERITAS Software Corporation. 11 * Copyright (C) 2002-2005 VERITAS Software Corporation.
12 * Copyright (C) 2004 Andi Kleen, SuSE Labs 12 * Copyright (C) 2004 Andi Kleen, SuSE Labs
13 * 13 *
14 * Extended attribute support for tmpfs: 14 * Extended attribute support for tmpfs:
15 * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net> 15 * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net>
16 * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com> 16 * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com>
17 * 17 *
18 * tiny-shmem: 18 * tiny-shmem:
19 * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com> 19 * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com>
20 * 20 *
21 * This file is released under the GPL. 21 * This file is released under the GPL.
22 */ 22 */
23 23
24 #include <linux/fs.h> 24 #include <linux/fs.h>
25 #include <linux/init.h> 25 #include <linux/init.h>
26 #include <linux/vfs.h> 26 #include <linux/vfs.h>
27 #include <linux/mount.h> 27 #include <linux/mount.h>
28 #include <linux/ramfs.h> 28 #include <linux/ramfs.h>
29 #include <linux/pagemap.h> 29 #include <linux/pagemap.h>
30 #include <linux/file.h> 30 #include <linux/file.h>
31 #include <linux/mm.h> 31 #include <linux/mm.h>
32 #include <linux/export.h> 32 #include <linux/export.h>
33 #include <linux/swap.h> 33 #include <linux/swap.h>
34 #include <linux/aio.h> 34 #include <linux/aio.h>
35 35
36 static struct vfsmount *shm_mnt; 36 static struct vfsmount *shm_mnt;
37 37
38 #ifdef CONFIG_SHMEM 38 #ifdef CONFIG_SHMEM
39 /* 39 /*
40 * This virtual memory filesystem is heavily based on the ramfs. It 40 * This virtual memory filesystem is heavily based on the ramfs. It
41 * extends ramfs by the ability to use swap and honor resource limits 41 * extends ramfs by the ability to use swap and honor resource limits
42 * which makes it a completely usable filesystem. 42 * which makes it a completely usable filesystem.
43 */ 43 */
44 44
45 #include <linux/xattr.h> 45 #include <linux/xattr.h>
46 #include <linux/exportfs.h> 46 #include <linux/exportfs.h>
47 #include <linux/posix_acl.h> 47 #include <linux/posix_acl.h>
48 #include <linux/generic_acl.h> 48 #include <linux/generic_acl.h>
49 #include <linux/mman.h> 49 #include <linux/mman.h>
50 #include <linux/string.h> 50 #include <linux/string.h>
51 #include <linux/slab.h> 51 #include <linux/slab.h>
52 #include <linux/backing-dev.h> 52 #include <linux/backing-dev.h>
53 #include <linux/shmem_fs.h> 53 #include <linux/shmem_fs.h>
54 #include <linux/writeback.h> 54 #include <linux/writeback.h>
55 #include <linux/blkdev.h> 55 #include <linux/blkdev.h>
56 #include <linux/pagevec.h> 56 #include <linux/pagevec.h>
57 #include <linux/percpu_counter.h> 57 #include <linux/percpu_counter.h>
58 #include <linux/falloc.h> 58 #include <linux/falloc.h>
59 #include <linux/splice.h> 59 #include <linux/splice.h>
60 #include <linux/security.h> 60 #include <linux/security.h>
61 #include <linux/swapops.h> 61 #include <linux/swapops.h>
62 #include <linux/mempolicy.h> 62 #include <linux/mempolicy.h>
63 #include <linux/namei.h> 63 #include <linux/namei.h>
64 #include <linux/ctype.h> 64 #include <linux/ctype.h>
65 #include <linux/migrate.h> 65 #include <linux/migrate.h>
66 #include <linux/highmem.h> 66 #include <linux/highmem.h>
67 #include <linux/seq_file.h> 67 #include <linux/seq_file.h>
68 #include <linux/magic.h> 68 #include <linux/magic.h>
69 69
70 #include <asm/uaccess.h> 70 #include <asm/uaccess.h>
71 #include <asm/pgtable.h> 71 #include <asm/pgtable.h>
72 72
73 #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512) 73 #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512)
74 #define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT) 74 #define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
75 75
76 /* Pretend that each entry is of this size in directory's i_size */ 76 /* Pretend that each entry is of this size in directory's i_size */
77 #define BOGO_DIRENT_SIZE 20 77 #define BOGO_DIRENT_SIZE 20
78 78
79 /* Symlink up to this size is kmalloc'ed instead of using a swappable page */ 79 /* Symlink up to this size is kmalloc'ed instead of using a swappable page */
80 #define SHORT_SYMLINK_LEN 128 80 #define SHORT_SYMLINK_LEN 128
81 81
82 /* 82 /*
83 * shmem_fallocate communicates with shmem_fault or shmem_writepage via 83 * shmem_fallocate communicates with shmem_fault or shmem_writepage via
84 * inode->i_private (with i_mutex making sure that it has only one user at 84 * inode->i_private (with i_mutex making sure that it has only one user at
85 * a time): we would prefer not to enlarge the shmem inode just for that. 85 * a time): we would prefer not to enlarge the shmem inode just for that.
86 */ 86 */
87 struct shmem_falloc { 87 struct shmem_falloc {
88 wait_queue_head_t *waitq; /* faults into hole wait for punch to end */ 88 wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
89 pgoff_t start; /* start of range currently being fallocated */ 89 pgoff_t start; /* start of range currently being fallocated */
90 pgoff_t next; /* the next page offset to be fallocated */ 90 pgoff_t next; /* the next page offset to be fallocated */
91 pgoff_t nr_falloced; /* how many new pages have been fallocated */ 91 pgoff_t nr_falloced; /* how many new pages have been fallocated */
92 pgoff_t nr_unswapped; /* how often writepage refused to swap out */ 92 pgoff_t nr_unswapped; /* how often writepage refused to swap out */
93 }; 93 };
94 94
95 /* Flag allocation requirements to shmem_getpage */ 95 /* Flag allocation requirements to shmem_getpage */
96 enum sgp_type { 96 enum sgp_type {
97 SGP_READ, /* don't exceed i_size, don't allocate page */ 97 SGP_READ, /* don't exceed i_size, don't allocate page */
98 SGP_CACHE, /* don't exceed i_size, may allocate page */ 98 SGP_CACHE, /* don't exceed i_size, may allocate page */
99 SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ 99 SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
100 SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ 100 SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
101 SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ 101 SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
102 }; 102 };
103 103
104 #ifdef CONFIG_TMPFS 104 #ifdef CONFIG_TMPFS
105 static unsigned long shmem_default_max_blocks(void) 105 static unsigned long shmem_default_max_blocks(void)
106 { 106 {
107 return totalram_pages / 2; 107 return totalram_pages / 2;
108 } 108 }
109 109
110 static unsigned long shmem_default_max_inodes(void) 110 static unsigned long shmem_default_max_inodes(void)
111 { 111 {
112 return min(totalram_pages - totalhigh_pages, totalram_pages / 2); 112 return min(totalram_pages - totalhigh_pages, totalram_pages / 2);
113 } 113 }
114 #endif 114 #endif
115 115
116 static bool shmem_should_replace_page(struct page *page, gfp_t gfp); 116 static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
117 static int shmem_replace_page(struct page **pagep, gfp_t gfp, 117 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
118 struct shmem_inode_info *info, pgoff_t index); 118 struct shmem_inode_info *info, pgoff_t index);
119 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, 119 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
120 struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); 120 struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);
121 121
122 static inline int shmem_getpage(struct inode *inode, pgoff_t index, 122 static inline int shmem_getpage(struct inode *inode, pgoff_t index,
123 struct page **pagep, enum sgp_type sgp, int *fault_type) 123 struct page **pagep, enum sgp_type sgp, int *fault_type)
124 { 124 {
125 return shmem_getpage_gfp(inode, index, pagep, sgp, 125 return shmem_getpage_gfp(inode, index, pagep, sgp,
126 mapping_gfp_mask(inode->i_mapping), fault_type); 126 mapping_gfp_mask(inode->i_mapping), fault_type);
127 } 127 }
128 128
129 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) 129 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
130 { 130 {
131 return sb->s_fs_info; 131 return sb->s_fs_info;
132 } 132 }
133 133
134 /* 134 /*
135 * shmem_file_setup pre-accounts the whole fixed size of a VM object, 135 * shmem_file_setup pre-accounts the whole fixed size of a VM object,
136 * for shared memory and for shared anonymous (/dev/zero) mappings 136 * for shared memory and for shared anonymous (/dev/zero) mappings
137 * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1), 137 * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1),
138 * consistent with the pre-accounting of private mappings ... 138 * consistent with the pre-accounting of private mappings ...
139 */ 139 */
140 static inline int shmem_acct_size(unsigned long flags, loff_t size) 140 static inline int shmem_acct_size(unsigned long flags, loff_t size)
141 { 141 {
142 return (flags & VM_NORESERVE) ? 142 return (flags & VM_NORESERVE) ?
143 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size)); 143 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size));
144 } 144 }
145 145
146 static inline void shmem_unacct_size(unsigned long flags, loff_t size) 146 static inline void shmem_unacct_size(unsigned long flags, loff_t size)
147 { 147 {
148 if (!(flags & VM_NORESERVE)) 148 if (!(flags & VM_NORESERVE))
149 vm_unacct_memory(VM_ACCT(size)); 149 vm_unacct_memory(VM_ACCT(size));
150 } 150 }
151 151
152 /* 152 /*
153 * ... whereas tmpfs objects are accounted incrementally as 153 * ... whereas tmpfs objects are accounted incrementally as
154 * pages are allocated, in order to allow huge sparse files. 154 * pages are allocated, in order to allow huge sparse files.
155 * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM, 155 * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
156 * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM. 156 * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
157 */ 157 */
158 static inline int shmem_acct_block(unsigned long flags) 158 static inline int shmem_acct_block(unsigned long flags)
159 { 159 {
160 return (flags & VM_NORESERVE) ? 160 return (flags & VM_NORESERVE) ?
161 security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0; 161 security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
162 } 162 }
163 163
164 static inline void shmem_unacct_blocks(unsigned long flags, long pages) 164 static inline void shmem_unacct_blocks(unsigned long flags, long pages)
165 { 165 {
166 if (flags & VM_NORESERVE) 166 if (flags & VM_NORESERVE)
167 vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE)); 167 vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE));
168 } 168 }
169 169
170 static const struct super_operations shmem_ops; 170 static const struct super_operations shmem_ops;
171 static const struct address_space_operations shmem_aops; 171 static const struct address_space_operations shmem_aops;
172 static const struct file_operations shmem_file_operations; 172 static const struct file_operations shmem_file_operations;
173 static const struct inode_operations shmem_inode_operations; 173 static const struct inode_operations shmem_inode_operations;
174 static const struct inode_operations shmem_dir_inode_operations; 174 static const struct inode_operations shmem_dir_inode_operations;
175 static const struct inode_operations shmem_special_inode_operations; 175 static const struct inode_operations shmem_special_inode_operations;
176 static const struct vm_operations_struct shmem_vm_ops; 176 static const struct vm_operations_struct shmem_vm_ops;
177 177
178 static struct backing_dev_info shmem_backing_dev_info __read_mostly = { 178 static struct backing_dev_info shmem_backing_dev_info __read_mostly = {
179 .ra_pages = 0, /* No readahead */ 179 .ra_pages = 0, /* No readahead */
180 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, 180 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
181 }; 181 };
182 182
183 static LIST_HEAD(shmem_swaplist); 183 static LIST_HEAD(shmem_swaplist);
184 static DEFINE_MUTEX(shmem_swaplist_mutex); 184 static DEFINE_MUTEX(shmem_swaplist_mutex);
185 185
186 static int shmem_reserve_inode(struct super_block *sb) 186 static int shmem_reserve_inode(struct super_block *sb)
187 { 187 {
188 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 188 struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
189 if (sbinfo->max_inodes) { 189 if (sbinfo->max_inodes) {
190 spin_lock(&sbinfo->stat_lock); 190 spin_lock(&sbinfo->stat_lock);
191 if (!sbinfo->free_inodes) { 191 if (!sbinfo->free_inodes) {
192 spin_unlock(&sbinfo->stat_lock); 192 spin_unlock(&sbinfo->stat_lock);
193 return -ENOSPC; 193 return -ENOSPC;
194 } 194 }
195 sbinfo->free_inodes--; 195 sbinfo->free_inodes--;
196 spin_unlock(&sbinfo->stat_lock); 196 spin_unlock(&sbinfo->stat_lock);
197 } 197 }
198 return 0; 198 return 0;
199 } 199 }
200 200
201 static void shmem_free_inode(struct super_block *sb) 201 static void shmem_free_inode(struct super_block *sb)
202 { 202 {
203 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 203 struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
204 if (sbinfo->max_inodes) { 204 if (sbinfo->max_inodes) {
205 spin_lock(&sbinfo->stat_lock); 205 spin_lock(&sbinfo->stat_lock);
206 sbinfo->free_inodes++; 206 sbinfo->free_inodes++;
207 spin_unlock(&sbinfo->stat_lock); 207 spin_unlock(&sbinfo->stat_lock);
208 } 208 }
209 } 209 }
210 210
211 /** 211 /**
212 * shmem_recalc_inode - recalculate the block usage of an inode 212 * shmem_recalc_inode - recalculate the block usage of an inode
213 * @inode: inode to recalc 213 * @inode: inode to recalc
214 * 214 *
215 * We have to calculate the free blocks since the mm can drop 215 * We have to calculate the free blocks since the mm can drop
216 * undirtied hole pages behind our back. 216 * undirtied hole pages behind our back.
217 * 217 *
218 * But normally info->alloced == inode->i_mapping->nrpages + info->swapped 218 * But normally info->alloced == inode->i_mapping->nrpages + info->swapped
219 * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) 219 * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
220 * 220 *
221 * It has to be called with the spinlock held. 221 * It has to be called with the spinlock held.
222 */ 222 */
223 static void shmem_recalc_inode(struct inode *inode) 223 static void shmem_recalc_inode(struct inode *inode)
224 { 224 {
225 struct shmem_inode_info *info = SHMEM_I(inode); 225 struct shmem_inode_info *info = SHMEM_I(inode);
226 long freed; 226 long freed;
227 227
228 freed = info->alloced - info->swapped - inode->i_mapping->nrpages; 228 freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
229 if (freed > 0) { 229 if (freed > 0) {
230 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 230 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
231 if (sbinfo->max_blocks) 231 if (sbinfo->max_blocks)
232 percpu_counter_add(&sbinfo->used_blocks, -freed); 232 percpu_counter_add(&sbinfo->used_blocks, -freed);
233 info->alloced -= freed; 233 info->alloced -= freed;
234 inode->i_blocks -= freed * BLOCKS_PER_PAGE; 234 inode->i_blocks -= freed * BLOCKS_PER_PAGE;
235 shmem_unacct_blocks(info->flags, freed); 235 shmem_unacct_blocks(info->flags, freed);
236 } 236 }
237 } 237 }
238 238
239 /* 239 /*
240 * Replace item expected in radix tree by a new item, while holding tree lock. 240 * Replace item expected in radix tree by a new item, while holding tree lock.
241 */ 241 */
242 static int shmem_radix_tree_replace(struct address_space *mapping, 242 static int shmem_radix_tree_replace(struct address_space *mapping,
243 pgoff_t index, void *expected, void *replacement) 243 pgoff_t index, void *expected, void *replacement)
244 { 244 {
245 void **pslot; 245 void **pslot;
246 void *item; 246 void *item;
247 247
248 VM_BUG_ON(!expected); 248 VM_BUG_ON(!expected);
249 VM_BUG_ON(!replacement); 249 VM_BUG_ON(!replacement);
250 pslot = radix_tree_lookup_slot(&mapping->page_tree, index); 250 pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
251 if (!pslot) 251 if (!pslot)
252 return -ENOENT; 252 return -ENOENT;
253 item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock); 253 item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
254 if (item != expected) 254 if (item != expected)
255 return -ENOENT; 255 return -ENOENT;
256 radix_tree_replace_slot(pslot, replacement); 256 radix_tree_replace_slot(pslot, replacement);
257 return 0; 257 return 0;
258 } 258 }
259 259
260 /* 260 /*
261 * Sometimes, before we decide whether to proceed or to fail, we must check 261 * Sometimes, before we decide whether to proceed or to fail, we must check
262 * that an entry was not already brought back from swap by a racing thread. 262 * that an entry was not already brought back from swap by a racing thread.
263 * 263 *
264 * Checking page is not enough: by the time a SwapCache page is locked, it 264 * Checking page is not enough: by the time a SwapCache page is locked, it
265 * might be reused, and again be SwapCache, using the same swap as before. 265 * might be reused, and again be SwapCache, using the same swap as before.
266 */ 266 */
267 static bool shmem_confirm_swap(struct address_space *mapping, 267 static bool shmem_confirm_swap(struct address_space *mapping,
268 pgoff_t index, swp_entry_t swap) 268 pgoff_t index, swp_entry_t swap)
269 { 269 {
270 void *item; 270 void *item;
271 271
272 rcu_read_lock(); 272 rcu_read_lock();
273 item = radix_tree_lookup(&mapping->page_tree, index); 273 item = radix_tree_lookup(&mapping->page_tree, index);
274 rcu_read_unlock(); 274 rcu_read_unlock();
275 return item == swp_to_radix_entry(swap); 275 return item == swp_to_radix_entry(swap);
276 } 276 }
277 277
278 /* 278 /*
279 * Like add_to_page_cache_locked, but error if expected item has gone. 279 * Like add_to_page_cache_locked, but error if expected item has gone.
280 */ 280 */
281 static int shmem_add_to_page_cache(struct page *page, 281 static int shmem_add_to_page_cache(struct page *page,
282 struct address_space *mapping, 282 struct address_space *mapping,
283 pgoff_t index, gfp_t gfp, void *expected) 283 pgoff_t index, gfp_t gfp, void *expected)
284 { 284 {
285 int error; 285 int error;
286 286
287 VM_BUG_ON(!PageLocked(page)); 287 VM_BUG_ON(!PageLocked(page));
288 VM_BUG_ON(!PageSwapBacked(page)); 288 VM_BUG_ON(!PageSwapBacked(page));
289 289
290 page_cache_get(page); 290 page_cache_get(page);
291 page->mapping = mapping; 291 page->mapping = mapping;
292 page->index = index; 292 page->index = index;
293 293
294 spin_lock_irq(&mapping->tree_lock); 294 spin_lock_irq(&mapping->tree_lock);
295 if (!expected) 295 if (!expected)
296 error = radix_tree_insert(&mapping->page_tree, index, page); 296 error = radix_tree_insert(&mapping->page_tree, index, page);
297 else 297 else
298 error = shmem_radix_tree_replace(mapping, index, expected, 298 error = shmem_radix_tree_replace(mapping, index, expected,
299 page); 299 page);
300 if (!error) { 300 if (!error) {
301 mapping->nrpages++; 301 mapping->nrpages++;
302 __inc_zone_page_state(page, NR_FILE_PAGES); 302 __inc_zone_page_state(page, NR_FILE_PAGES);
303 __inc_zone_page_state(page, NR_SHMEM); 303 __inc_zone_page_state(page, NR_SHMEM);
304 spin_unlock_irq(&mapping->tree_lock); 304 spin_unlock_irq(&mapping->tree_lock);
305 } else { 305 } else {
306 page->mapping = NULL; 306 page->mapping = NULL;
307 spin_unlock_irq(&mapping->tree_lock); 307 spin_unlock_irq(&mapping->tree_lock);
308 page_cache_release(page); 308 page_cache_release(page);
309 } 309 }
310 return error; 310 return error;
311 } 311 }
312 312
313 /* 313 /*
314 * Like delete_from_page_cache, but substitutes swap for page. 314 * Like delete_from_page_cache, but substitutes swap for page.
315 */ 315 */
316 static void shmem_delete_from_page_cache(struct page *page, void *radswap) 316 static void shmem_delete_from_page_cache(struct page *page, void *radswap)
317 { 317 {
318 struct address_space *mapping = page->mapping; 318 struct address_space *mapping = page->mapping;
319 int error; 319 int error;
320 320
321 spin_lock_irq(&mapping->tree_lock); 321 spin_lock_irq(&mapping->tree_lock);
322 error = shmem_radix_tree_replace(mapping, page->index, page, radswap); 322 error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
323 page->mapping = NULL; 323 page->mapping = NULL;
324 mapping->nrpages--; 324 mapping->nrpages--;
325 __dec_zone_page_state(page, NR_FILE_PAGES); 325 __dec_zone_page_state(page, NR_FILE_PAGES);
326 __dec_zone_page_state(page, NR_SHMEM); 326 __dec_zone_page_state(page, NR_SHMEM);
327 spin_unlock_irq(&mapping->tree_lock); 327 spin_unlock_irq(&mapping->tree_lock);
328 page_cache_release(page); 328 page_cache_release(page);
329 BUG_ON(error); 329 BUG_ON(error);
330 } 330 }
331 331
332 /* 332 /*
333 * Remove swap entry from radix tree, free the swap and its page cache. 333 * Remove swap entry from radix tree, free the swap and its page cache.
334 */ 334 */
335 static int shmem_free_swap(struct address_space *mapping, 335 static int shmem_free_swap(struct address_space *mapping,
336 pgoff_t index, void *radswap) 336 pgoff_t index, void *radswap)
337 { 337 {
338 void *old; 338 void *old;
339 339
340 spin_lock_irq(&mapping->tree_lock); 340 spin_lock_irq(&mapping->tree_lock);
341 old = radix_tree_delete_item(&mapping->page_tree, index, radswap); 341 old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
342 spin_unlock_irq(&mapping->tree_lock); 342 spin_unlock_irq(&mapping->tree_lock);
343 if (old != radswap) 343 if (old != radswap)
344 return -ENOENT; 344 return -ENOENT;
345 free_swap_and_cache(radix_to_swp_entry(radswap)); 345 free_swap_and_cache(radix_to_swp_entry(radswap));
346 return 0; 346 return 0;
347 } 347 }
348 348
349 /* 349 /*
350 * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists. 350 * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
351 */ 351 */
352 void shmem_unlock_mapping(struct address_space *mapping) 352 void shmem_unlock_mapping(struct address_space *mapping)
353 { 353 {
354 struct pagevec pvec; 354 struct pagevec pvec;
355 pgoff_t indices[PAGEVEC_SIZE]; 355 pgoff_t indices[PAGEVEC_SIZE];
356 pgoff_t index = 0; 356 pgoff_t index = 0;
357 357
358 pagevec_init(&pvec, 0); 358 pagevec_init(&pvec, 0);
359 /* 359 /*
360 * Minor point, but we might as well stop if someone else SHM_LOCKs it. 360 * Minor point, but we might as well stop if someone else SHM_LOCKs it.
361 */ 361 */
362 while (!mapping_unevictable(mapping)) { 362 while (!mapping_unevictable(mapping)) {
363 /* 363 /*
364 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it 364 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
365 * has finished, if it hits a row of PAGEVEC_SIZE swap entries. 365 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
366 */ 366 */
367 pvec.nr = find_get_entries(mapping, index, 367 pvec.nr = find_get_entries(mapping, index,
368 PAGEVEC_SIZE, pvec.pages, indices); 368 PAGEVEC_SIZE, pvec.pages, indices);
369 if (!pvec.nr) 369 if (!pvec.nr)
370 break; 370 break;
371 index = indices[pvec.nr - 1] + 1; 371 index = indices[pvec.nr - 1] + 1;
372 pagevec_remove_exceptionals(&pvec); 372 pagevec_remove_exceptionals(&pvec);
373 check_move_unevictable_pages(pvec.pages, pvec.nr); 373 check_move_unevictable_pages(pvec.pages, pvec.nr);
374 pagevec_release(&pvec); 374 pagevec_release(&pvec);
375 cond_resched(); 375 cond_resched();
376 } 376 }
377 } 377 }
378 378
379 /* 379 /*
380 * Remove range of pages and swap entries from radix tree, and free them. 380 * Remove range of pages and swap entries from radix tree, and free them.
381 * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. 381 * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
382 */ 382 */
383 static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, 383 static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
384 bool unfalloc) 384 bool unfalloc)
385 { 385 {
386 struct address_space *mapping = inode->i_mapping; 386 struct address_space *mapping = inode->i_mapping;
387 struct shmem_inode_info *info = SHMEM_I(inode); 387 struct shmem_inode_info *info = SHMEM_I(inode);
388 pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 388 pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
389 pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT; 389 pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT;
390 unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1); 390 unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1);
391 unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); 391 unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
392 struct pagevec pvec; 392 struct pagevec pvec;
393 pgoff_t indices[PAGEVEC_SIZE]; 393 pgoff_t indices[PAGEVEC_SIZE];
394 long nr_swaps_freed = 0; 394 long nr_swaps_freed = 0;
395 pgoff_t index; 395 pgoff_t index;
396 int i; 396 int i;
397 397
398 if (lend == -1) 398 if (lend == -1)
399 end = -1; /* unsigned, so actually very big */ 399 end = -1; /* unsigned, so actually very big */
400 400
401 pagevec_init(&pvec, 0); 401 pagevec_init(&pvec, 0);
402 index = start; 402 index = start;
403 while (index < end) { 403 while (index < end) {
404 pvec.nr = find_get_entries(mapping, index, 404 pvec.nr = find_get_entries(mapping, index,
405 min(end - index, (pgoff_t)PAGEVEC_SIZE), 405 min(end - index, (pgoff_t)PAGEVEC_SIZE),
406 pvec.pages, indices); 406 pvec.pages, indices);
407 if (!pvec.nr) 407 if (!pvec.nr)
408 break; 408 break;
409 mem_cgroup_uncharge_start(); 409 mem_cgroup_uncharge_start();
410 for (i = 0; i < pagevec_count(&pvec); i++) { 410 for (i = 0; i < pagevec_count(&pvec); i++) {
411 struct page *page = pvec.pages[i]; 411 struct page *page = pvec.pages[i];
412 412
413 index = indices[i]; 413 index = indices[i];
414 if (index >= end) 414 if (index >= end)
415 break; 415 break;
416 416
417 if (radix_tree_exceptional_entry(page)) { 417 if (radix_tree_exceptional_entry(page)) {
418 if (unfalloc) 418 if (unfalloc)
419 continue; 419 continue;
420 nr_swaps_freed += !shmem_free_swap(mapping, 420 nr_swaps_freed += !shmem_free_swap(mapping,
421 index, page); 421 index, page);
422 continue; 422 continue;
423 } 423 }
424 424
425 if (!trylock_page(page)) 425 if (!trylock_page(page))
426 continue; 426 continue;
427 if (!unfalloc || !PageUptodate(page)) { 427 if (!unfalloc || !PageUptodate(page)) {
428 if (page->mapping == mapping) { 428 if (page->mapping == mapping) {
429 VM_BUG_ON(PageWriteback(page)); 429 VM_BUG_ON(PageWriteback(page));
430 truncate_inode_page(mapping, page); 430 truncate_inode_page(mapping, page);
431 } 431 }
432 } 432 }
433 unlock_page(page); 433 unlock_page(page);
434 } 434 }
435 pagevec_remove_exceptionals(&pvec); 435 pagevec_remove_exceptionals(&pvec);
436 pagevec_release(&pvec); 436 pagevec_release(&pvec);
437 mem_cgroup_uncharge_end(); 437 mem_cgroup_uncharge_end();
438 cond_resched(); 438 cond_resched();
439 index++; 439 index++;
440 } 440 }
441 441
442 if (partial_start) { 442 if (partial_start) {
443 struct page *page = NULL; 443 struct page *page = NULL;
444 shmem_getpage(inode, start - 1, &page, SGP_READ, NULL); 444 shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
445 if (page) { 445 if (page) {
446 unsigned int top = PAGE_CACHE_SIZE; 446 unsigned int top = PAGE_CACHE_SIZE;
447 if (start > end) { 447 if (start > end) {
448 top = partial_end; 448 top = partial_end;
449 partial_end = 0; 449 partial_end = 0;
450 } 450 }
451 zero_user_segment(page, partial_start, top); 451 zero_user_segment(page, partial_start, top);
452 set_page_dirty(page); 452 set_page_dirty(page);
453 unlock_page(page); 453 unlock_page(page);
454 page_cache_release(page); 454 page_cache_release(page);
455 } 455 }
456 } 456 }
457 if (partial_end) { 457 if (partial_end) {
458 struct page *page = NULL; 458 struct page *page = NULL;
459 shmem_getpage(inode, end, &page, SGP_READ, NULL); 459 shmem_getpage(inode, end, &page, SGP_READ, NULL);
460 if (page) { 460 if (page) {
461 zero_user_segment(page, 0, partial_end); 461 zero_user_segment(page, 0, partial_end);
462 set_page_dirty(page); 462 set_page_dirty(page);
463 unlock_page(page); 463 unlock_page(page);
464 page_cache_release(page); 464 page_cache_release(page);
465 } 465 }
466 } 466 }
467 if (start >= end) 467 if (start >= end)
468 return; 468 return;
469 469
470 index = start; 470 index = start;
471 while (index < end) { 471 while (index < end) {
472 cond_resched(); 472 cond_resched();
473 473
474 pvec.nr = find_get_entries(mapping, index, 474 pvec.nr = find_get_entries(mapping, index,
475 min(end - index, (pgoff_t)PAGEVEC_SIZE), 475 min(end - index, (pgoff_t)PAGEVEC_SIZE),
476 pvec.pages, indices); 476 pvec.pages, indices);
477 if (!pvec.nr) { 477 if (!pvec.nr) {
478 /* If all gone or hole-punch or unfalloc, we're done */ 478 /* If all gone or hole-punch or unfalloc, we're done */
479 if (index == start || end != -1) 479 if (index == start || end != -1)
480 break; 480 break;
481 /* But if truncating, restart to make sure all gone */ 481 /* But if truncating, restart to make sure all gone */
482 index = start; 482 index = start;
483 continue; 483 continue;
484 } 484 }
485 mem_cgroup_uncharge_start(); 485 mem_cgroup_uncharge_start();
486 for (i = 0; i < pagevec_count(&pvec); i++) { 486 for (i = 0; i < pagevec_count(&pvec); i++) {
487 struct page *page = pvec.pages[i]; 487 struct page *page = pvec.pages[i];
488 488
489 index = indices[i]; 489 index = indices[i];
490 if (index >= end) 490 if (index >= end)
491 break; 491 break;
492 492
493 if (radix_tree_exceptional_entry(page)) { 493 if (radix_tree_exceptional_entry(page)) {
494 if (unfalloc) 494 if (unfalloc)
495 continue; 495 continue;
496 if (shmem_free_swap(mapping, index, page)) { 496 if (shmem_free_swap(mapping, index, page)) {
497 /* Swap was replaced by page: retry */ 497 /* Swap was replaced by page: retry */
498 index--; 498 index--;
499 break; 499 break;
500 } 500 }
501 nr_swaps_freed++; 501 nr_swaps_freed++;
502 continue; 502 continue;
503 } 503 }
504 504
505 lock_page(page); 505 lock_page(page);
506 if (!unfalloc || !PageUptodate(page)) { 506 if (!unfalloc || !PageUptodate(page)) {
507 if (page->mapping == mapping) { 507 if (page->mapping == mapping) {
508 VM_BUG_ON(PageWriteback(page)); 508 VM_BUG_ON(PageWriteback(page));
509 truncate_inode_page(mapping, page); 509 truncate_inode_page(mapping, page);
510 } else { 510 } else {
511 /* Page was replaced by swap: retry */ 511 /* Page was replaced by swap: retry */
512 unlock_page(page); 512 unlock_page(page);
513 index--; 513 index--;
514 break; 514 break;
515 } 515 }
516 } 516 }
517 unlock_page(page); 517 unlock_page(page);
518 } 518 }
519 pagevec_remove_exceptionals(&pvec); 519 pagevec_remove_exceptionals(&pvec);
520 pagevec_release(&pvec); 520 pagevec_release(&pvec);
521 mem_cgroup_uncharge_end(); 521 mem_cgroup_uncharge_end();
522 index++; 522 index++;
523 } 523 }
524 524
525 spin_lock(&info->lock); 525 spin_lock(&info->lock);
526 info->swapped -= nr_swaps_freed; 526 info->swapped -= nr_swaps_freed;
527 shmem_recalc_inode(inode); 527 shmem_recalc_inode(inode);
528 spin_unlock(&info->lock); 528 spin_unlock(&info->lock);
529 } 529 }
530 530
531 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) 531 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
532 { 532 {
533 shmem_undo_range(inode, lstart, lend, false); 533 shmem_undo_range(inode, lstart, lend, false);
534 inode->i_ctime = inode->i_mtime = CURRENT_TIME; 534 inode->i_ctime = inode->i_mtime = CURRENT_TIME;
535 } 535 }
536 EXPORT_SYMBOL_GPL(shmem_truncate_range); 536 EXPORT_SYMBOL_GPL(shmem_truncate_range);
537 537
538 static int shmem_setattr(struct dentry *dentry, struct iattr *attr) 538 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
539 { 539 {
540 struct inode *inode = dentry->d_inode; 540 struct inode *inode = dentry->d_inode;
541 int error; 541 int error;
542 542
543 error = inode_change_ok(inode, attr); 543 error = inode_change_ok(inode, attr);
544 if (error) 544 if (error)
545 return error; 545 return error;
546 546
547 if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { 547 if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
548 loff_t oldsize = inode->i_size; 548 loff_t oldsize = inode->i_size;
549 loff_t newsize = attr->ia_size; 549 loff_t newsize = attr->ia_size;
550 550
551 if (newsize != oldsize) { 551 if (newsize != oldsize) {
552 i_size_write(inode, newsize); 552 i_size_write(inode, newsize);
553 inode->i_ctime = inode->i_mtime = CURRENT_TIME; 553 inode->i_ctime = inode->i_mtime = CURRENT_TIME;
554 } 554 }
555 if (newsize < oldsize) { 555 if (newsize < oldsize) {
556 loff_t holebegin = round_up(newsize, PAGE_SIZE); 556 loff_t holebegin = round_up(newsize, PAGE_SIZE);
557 unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); 557 unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
558 shmem_truncate_range(inode, newsize, (loff_t)-1); 558 shmem_truncate_range(inode, newsize, (loff_t)-1);
559 /* unmap again to remove racily COWed private pages */ 559 /* unmap again to remove racily COWed private pages */
560 unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); 560 unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
561 } 561 }
562 } 562 }
563 563
564 setattr_copy(inode, attr); 564 setattr_copy(inode, attr);
565 #ifdef CONFIG_TMPFS_POSIX_ACL 565 #ifdef CONFIG_TMPFS_POSIX_ACL
566 if (attr->ia_valid & ATTR_MODE) 566 if (attr->ia_valid & ATTR_MODE)
567 error = generic_acl_chmod(inode); 567 error = generic_acl_chmod(inode);
568 #endif 568 #endif
569 return error; 569 return error;
570 } 570 }
571 571
572 static void shmem_evict_inode(struct inode *inode) 572 static void shmem_evict_inode(struct inode *inode)
573 { 573 {
574 struct shmem_inode_info *info = SHMEM_I(inode); 574 struct shmem_inode_info *info = SHMEM_I(inode);
575 575
576 if (inode->i_mapping->a_ops == &shmem_aops) { 576 if (inode->i_mapping->a_ops == &shmem_aops) {
577 shmem_unacct_size(info->flags, inode->i_size); 577 shmem_unacct_size(info->flags, inode->i_size);
578 inode->i_size = 0; 578 inode->i_size = 0;
579 shmem_truncate_range(inode, 0, (loff_t)-1); 579 shmem_truncate_range(inode, 0, (loff_t)-1);
580 if (!list_empty(&info->swaplist)) { 580 if (!list_empty(&info->swaplist)) {
581 mutex_lock(&shmem_swaplist_mutex); 581 mutex_lock(&shmem_swaplist_mutex);
582 list_del_init(&info->swaplist); 582 list_del_init(&info->swaplist);
583 mutex_unlock(&shmem_swaplist_mutex); 583 mutex_unlock(&shmem_swaplist_mutex);
584 } 584 }
585 } else 585 } else
586 kfree(info->symlink); 586 kfree(info->symlink);
587 587
588 simple_xattrs_free(&info->xattrs); 588 simple_xattrs_free(&info->xattrs);
589 WARN_ON(inode->i_blocks); 589 WARN_ON(inode->i_blocks);
590 shmem_free_inode(inode->i_sb); 590 shmem_free_inode(inode->i_sb);
591 clear_inode(inode); 591 clear_inode(inode);
592 } 592 }
593 593
594 /* 594 /*
595 * If swap found in inode, free it and move page from swapcache to filecache. 595 * If swap found in inode, free it and move page from swapcache to filecache.
596 */ 596 */
597 static int shmem_unuse_inode(struct shmem_inode_info *info, 597 static int shmem_unuse_inode(struct shmem_inode_info *info,
598 swp_entry_t swap, struct page **pagep) 598 swp_entry_t swap, struct page **pagep)
599 { 599 {
600 struct address_space *mapping = info->vfs_inode.i_mapping; 600 struct address_space *mapping = info->vfs_inode.i_mapping;
601 void *radswap; 601 void *radswap;
602 pgoff_t index; 602 pgoff_t index;
603 gfp_t gfp; 603 gfp_t gfp;
604 int error = 0; 604 int error = 0;
605 605
606 radswap = swp_to_radix_entry(swap); 606 radswap = swp_to_radix_entry(swap);
607 index = radix_tree_locate_item(&mapping->page_tree, radswap); 607 index = radix_tree_locate_item(&mapping->page_tree, radswap);
608 if (index == -1) 608 if (index == -1)
609 return 0; 609 return 0;
610 610
611 /* 611 /*
612 * Move _head_ to start search for next from here. 612 * Move _head_ to start search for next from here.
613 * But be careful: shmem_evict_inode checks list_empty without taking 613 * But be careful: shmem_evict_inode checks list_empty without taking
614 * mutex, and there's an instant in list_move_tail when info->swaplist 614 * mutex, and there's an instant in list_move_tail when info->swaplist
615 * would appear empty, if it were the only one on shmem_swaplist. 615 * would appear empty, if it were the only one on shmem_swaplist.
616 */ 616 */
617 if (shmem_swaplist.next != &info->swaplist) 617 if (shmem_swaplist.next != &info->swaplist)
618 list_move_tail(&shmem_swaplist, &info->swaplist); 618 list_move_tail(&shmem_swaplist, &info->swaplist);
619 619
620 gfp = mapping_gfp_mask(mapping); 620 gfp = mapping_gfp_mask(mapping);
621 if (shmem_should_replace_page(*pagep, gfp)) { 621 if (shmem_should_replace_page(*pagep, gfp)) {
622 mutex_unlock(&shmem_swaplist_mutex); 622 mutex_unlock(&shmem_swaplist_mutex);
623 error = shmem_replace_page(pagep, gfp, info, index); 623 error = shmem_replace_page(pagep, gfp, info, index);
624 mutex_lock(&shmem_swaplist_mutex); 624 mutex_lock(&shmem_swaplist_mutex);
625 /* 625 /*
626 * We needed to drop mutex to make that restrictive page 626 * We needed to drop mutex to make that restrictive page
627 * allocation, but the inode might have been freed while we 627 * allocation, but the inode might have been freed while we
628 * dropped it: although a racing shmem_evict_inode() cannot 628 * dropped it: although a racing shmem_evict_inode() cannot
629 * complete without emptying the radix_tree, our page lock 629 * complete without emptying the radix_tree, our page lock
630 * on this swapcache page is not enough to prevent that - 630 * on this swapcache page is not enough to prevent that -
631 * free_swap_and_cache() of our swap entry will only 631 * free_swap_and_cache() of our swap entry will only
632 * trylock_page(), removing swap from radix_tree whatever. 632 * trylock_page(), removing swap from radix_tree whatever.
633 * 633 *
634 * We must not proceed to shmem_add_to_page_cache() if the 634 * We must not proceed to shmem_add_to_page_cache() if the
635 * inode has been freed, but of course we cannot rely on 635 * inode has been freed, but of course we cannot rely on
636 * inode or mapping or info to check that. However, we can 636 * inode or mapping or info to check that. However, we can
637 * safely check if our swap entry is still in use (and here 637 * safely check if our swap entry is still in use (and here
638 * it can't have got reused for another page): if it's still 638 * it can't have got reused for another page): if it's still
639 * in use, then the inode cannot have been freed yet, and we 639 * in use, then the inode cannot have been freed yet, and we
640 * can safely proceed (if it's no longer in use, that tells 640 * can safely proceed (if it's no longer in use, that tells
641 * nothing about the inode, but we don't need to unuse swap). 641 * nothing about the inode, but we don't need to unuse swap).
642 */ 642 */
643 if (!page_swapcount(*pagep)) 643 if (!page_swapcount(*pagep))
644 error = -ENOENT; 644 error = -ENOENT;
645 } 645 }
646 646
647 /* 647 /*
648 * We rely on shmem_swaplist_mutex, not only to protect the swaplist, 648 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
649 * but also to hold up shmem_evict_inode(): so inode cannot be freed 649 * but also to hold up shmem_evict_inode(): so inode cannot be freed
650 * beneath us (pagelock doesn't help until the page is in pagecache). 650 * beneath us (pagelock doesn't help until the page is in pagecache).
651 */ 651 */
652 if (!error) 652 if (!error)
653 error = shmem_add_to_page_cache(*pagep, mapping, index, 653 error = shmem_add_to_page_cache(*pagep, mapping, index,
654 GFP_NOWAIT, radswap); 654 GFP_NOWAIT, radswap);
655 if (error != -ENOMEM) { 655 if (error != -ENOMEM) {
656 /* 656 /*
657 * Truncation and eviction use free_swap_and_cache(), which 657 * Truncation and eviction use free_swap_and_cache(), which
658 * only does trylock page: if we raced, best clean up here. 658 * only does trylock page: if we raced, best clean up here.
659 */ 659 */
660 delete_from_swap_cache(*pagep); 660 delete_from_swap_cache(*pagep);
661 set_page_dirty(*pagep); 661 set_page_dirty(*pagep);
662 if (!error) { 662 if (!error) {
663 spin_lock(&info->lock); 663 spin_lock(&info->lock);
664 info->swapped--; 664 info->swapped--;
665 spin_unlock(&info->lock); 665 spin_unlock(&info->lock);
666 swap_free(swap); 666 swap_free(swap);
667 } 667 }
668 error = 1; /* not an error, but entry was found */ 668 error = 1; /* not an error, but entry was found */
669 } 669 }
670 return error; 670 return error;
671 } 671 }
672 672
673 /* 673 /*
674 * Search through swapped inodes to find and replace swap by page. 674 * Search through swapped inodes to find and replace swap by page.
675 */ 675 */
676 int shmem_unuse(swp_entry_t swap, struct page *page) 676 int shmem_unuse(swp_entry_t swap, struct page *page)
677 { 677 {
678 struct list_head *this, *next; 678 struct list_head *this, *next;
679 struct shmem_inode_info *info; 679 struct shmem_inode_info *info;
680 int found = 0; 680 int found = 0;
681 int error = 0; 681 int error = 0;
682 682
683 /* 683 /*
684 * There's a faint possibility that swap page was replaced before 684 * There's a faint possibility that swap page was replaced before
685 * caller locked it: caller will come back later with the right page. 685 * caller locked it: caller will come back later with the right page.
686 */ 686 */
687 if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) 687 if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
688 goto out; 688 goto out;
689 689
690 /* 690 /*
691 * Charge page using GFP_KERNEL while we can wait, before taking 691 * Charge page using GFP_KERNEL while we can wait, before taking
692 * the shmem_swaplist_mutex which might hold up shmem_writepage(). 692 * the shmem_swaplist_mutex which might hold up shmem_writepage().
693 * Charged back to the user (not to caller) when swap account is used. 693 * Charged back to the user (not to caller) when swap account is used.
694 */ 694 */
695 error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); 695 error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
696 if (error) 696 if (error)
697 goto out; 697 goto out;
698 /* No radix_tree_preload: swap entry keeps a place for page in tree */ 698 /* No radix_tree_preload: swap entry keeps a place for page in tree */
699 699
700 mutex_lock(&shmem_swaplist_mutex); 700 mutex_lock(&shmem_swaplist_mutex);
701 list_for_each_safe(this, next, &shmem_swaplist) { 701 list_for_each_safe(this, next, &shmem_swaplist) {
702 info = list_entry(this, struct shmem_inode_info, swaplist); 702 info = list_entry(this, struct shmem_inode_info, swaplist);
703 if (info->swapped) 703 if (info->swapped)
704 found = shmem_unuse_inode(info, swap, &page); 704 found = shmem_unuse_inode(info, swap, &page);
705 else 705 else
706 list_del_init(&info->swaplist); 706 list_del_init(&info->swaplist);
707 cond_resched(); 707 cond_resched();
708 if (found) 708 if (found)
709 break; 709 break;
710 } 710 }
711 mutex_unlock(&shmem_swaplist_mutex); 711 mutex_unlock(&shmem_swaplist_mutex);
712 712
713 if (found < 0) 713 if (found < 0)
714 error = found; 714 error = found;
715 out: 715 out:
716 unlock_page(page); 716 unlock_page(page);
717 page_cache_release(page); 717 page_cache_release(page);
718 return error; 718 return error;
719 } 719 }
720 720
721 /* 721 /*
722 * Move the page from the page cache to the swap cache. 722 * Move the page from the page cache to the swap cache.
723 */ 723 */
724 static int shmem_writepage(struct page *page, struct writeback_control *wbc) 724 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
725 { 725 {
726 struct shmem_inode_info *info; 726 struct shmem_inode_info *info;
727 struct address_space *mapping; 727 struct address_space *mapping;
728 struct inode *inode; 728 struct inode *inode;
729 swp_entry_t swap; 729 swp_entry_t swap;
730 pgoff_t index; 730 pgoff_t index;
731 731
732 BUG_ON(!PageLocked(page)); 732 BUG_ON(!PageLocked(page));
733 mapping = page->mapping; 733 mapping = page->mapping;
734 index = page->index; 734 index = page->index;
735 inode = mapping->host; 735 inode = mapping->host;
736 info = SHMEM_I(inode); 736 info = SHMEM_I(inode);
737 if (info->flags & VM_LOCKED) 737 if (info->flags & VM_LOCKED)
738 goto redirty; 738 goto redirty;
739 if (!total_swap_pages) 739 if (!total_swap_pages)
740 goto redirty; 740 goto redirty;
741 741
742 /* 742 /*
743 * shmem_backing_dev_info's capabilities prevent regular writeback or 743 * shmem_backing_dev_info's capabilities prevent regular writeback or
744 * sync from ever calling shmem_writepage; but a stacking filesystem 744 * sync from ever calling shmem_writepage; but a stacking filesystem
745 * might use ->writepage of its underlying filesystem, in which case 745 * might use ->writepage of its underlying filesystem, in which case
746 * tmpfs should write out to swap only in response to memory pressure, 746 * tmpfs should write out to swap only in response to memory pressure,
747 * and not for the writeback threads or sync. 747 * and not for the writeback threads or sync.
748 */ 748 */
749 if (!wbc->for_reclaim) { 749 if (!wbc->for_reclaim) {
750 WARN_ON_ONCE(1); /* Still happens? Tell us about it! */ 750 WARN_ON_ONCE(1); /* Still happens? Tell us about it! */
751 goto redirty; 751 goto redirty;
752 } 752 }
753 753
754 /* 754 /*
755 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC 755 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
756 * value into swapfile.c, the only way we can correctly account for a 756 * value into swapfile.c, the only way we can correctly account for a
757 * fallocated page arriving here is now to initialize it and write it. 757 * fallocated page arriving here is now to initialize it and write it.
758 * 758 *
759 * That's okay for a page already fallocated earlier, but if we have 759 * That's okay for a page already fallocated earlier, but if we have
760 * not yet completed the fallocation, then (a) we want to keep track 760 * not yet completed the fallocation, then (a) we want to keep track
761 * of this page in case we have to undo it, and (b) it may not be a 761 * of this page in case we have to undo it, and (b) it may not be a
762 * good idea to continue anyway, once we're pushing into swap. So 762 * good idea to continue anyway, once we're pushing into swap. So
763 * reactivate the page, and let shmem_fallocate() quit when too many. 763 * reactivate the page, and let shmem_fallocate() quit when too many.
764 */ 764 */
765 if (!PageUptodate(page)) { 765 if (!PageUptodate(page)) {
766 if (inode->i_private) { 766 if (inode->i_private) {
767 struct shmem_falloc *shmem_falloc; 767 struct shmem_falloc *shmem_falloc;
768 spin_lock(&inode->i_lock); 768 spin_lock(&inode->i_lock);
769 shmem_falloc = inode->i_private; 769 shmem_falloc = inode->i_private;
770 if (shmem_falloc && 770 if (shmem_falloc &&
771 !shmem_falloc->waitq && 771 !shmem_falloc->waitq &&
772 index >= shmem_falloc->start && 772 index >= shmem_falloc->start &&
773 index < shmem_falloc->next) 773 index < shmem_falloc->next)
774 shmem_falloc->nr_unswapped++; 774 shmem_falloc->nr_unswapped++;
775 else 775 else
776 shmem_falloc = NULL; 776 shmem_falloc = NULL;
777 spin_unlock(&inode->i_lock); 777 spin_unlock(&inode->i_lock);
778 if (shmem_falloc) 778 if (shmem_falloc)
779 goto redirty; 779 goto redirty;
780 } 780 }
781 clear_highpage(page); 781 clear_highpage(page);
782 flush_dcache_page(page); 782 flush_dcache_page(page);
783 SetPageUptodate(page); 783 SetPageUptodate(page);
784 } 784 }
785 785
786 swap = get_swap_page(); 786 swap = get_swap_page();
787 if (!swap.val) 787 if (!swap.val)
788 goto redirty; 788 goto redirty;
789 789
790 /* 790 /*
791 * Add inode to shmem_unuse()'s list of swapped-out inodes, 791 * Add inode to shmem_unuse()'s list of swapped-out inodes,
792 * if it's not already there. Do it now before the page is 792 * if it's not already there. Do it now before the page is
793 * moved to swap cache, when its pagelock no longer protects 793 * moved to swap cache, when its pagelock no longer protects
794 * the inode from eviction. But don't unlock the mutex until 794 * the inode from eviction. But don't unlock the mutex until
795 * we've incremented swapped, because shmem_unuse_inode() will 795 * we've incremented swapped, because shmem_unuse_inode() will
796 * prune a !swapped inode from the swaplist under this mutex. 796 * prune a !swapped inode from the swaplist under this mutex.
797 */ 797 */
798 mutex_lock(&shmem_swaplist_mutex); 798 mutex_lock(&shmem_swaplist_mutex);
799 if (list_empty(&info->swaplist)) 799 if (list_empty(&info->swaplist))
800 list_add_tail(&info->swaplist, &shmem_swaplist); 800 list_add_tail(&info->swaplist, &shmem_swaplist);
801 801
802 if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { 802 if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
803 swap_shmem_alloc(swap); 803 swap_shmem_alloc(swap);
804 shmem_delete_from_page_cache(page, swp_to_radix_entry(swap)); 804 shmem_delete_from_page_cache(page, swp_to_radix_entry(swap));
805 805
806 spin_lock(&info->lock); 806 spin_lock(&info->lock);
807 info->swapped++; 807 info->swapped++;
808 shmem_recalc_inode(inode); 808 shmem_recalc_inode(inode);
809 spin_unlock(&info->lock); 809 spin_unlock(&info->lock);
810 810
811 mutex_unlock(&shmem_swaplist_mutex); 811 mutex_unlock(&shmem_swaplist_mutex);
812 BUG_ON(page_mapped(page)); 812 BUG_ON(page_mapped(page));
813 swap_writepage(page, wbc); 813 swap_writepage(page, wbc);
814 return 0; 814 return 0;
815 } 815 }
816 816
817 mutex_unlock(&shmem_swaplist_mutex); 817 mutex_unlock(&shmem_swaplist_mutex);
818 swapcache_free(swap, NULL); 818 swapcache_free(swap, NULL);
819 redirty: 819 redirty:
820 set_page_dirty(page); 820 set_page_dirty(page);
821 if (wbc->for_reclaim) 821 if (wbc->for_reclaim)
822 return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */ 822 return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */
823 unlock_page(page); 823 unlock_page(page);
824 return 0; 824 return 0;
825 } 825 }
826 826
827 #ifdef CONFIG_NUMA 827 #ifdef CONFIG_NUMA
828 #ifdef CONFIG_TMPFS 828 #ifdef CONFIG_TMPFS
829 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) 829 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
830 { 830 {
831 char buffer[64]; 831 char buffer[64];
832 832
833 if (!mpol || mpol->mode == MPOL_DEFAULT) 833 if (!mpol || mpol->mode == MPOL_DEFAULT)
834 return; /* show nothing */ 834 return; /* show nothing */
835 835
836 mpol_to_str(buffer, sizeof(buffer), mpol); 836 mpol_to_str(buffer, sizeof(buffer), mpol);
837 837
838 seq_printf(seq, ",mpol=%s", buffer); 838 seq_printf(seq, ",mpol=%s", buffer);
839 } 839 }
840 840
841 static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) 841 static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
842 { 842 {
843 struct mempolicy *mpol = NULL; 843 struct mempolicy *mpol = NULL;
844 if (sbinfo->mpol) { 844 if (sbinfo->mpol) {
845 spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ 845 spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */
846 mpol = sbinfo->mpol; 846 mpol = sbinfo->mpol;
847 mpol_get(mpol); 847 mpol_get(mpol);
848 spin_unlock(&sbinfo->stat_lock); 848 spin_unlock(&sbinfo->stat_lock);
849 } 849 }
850 return mpol; 850 return mpol;
851 } 851 }
852 #endif /* CONFIG_TMPFS */ 852 #endif /* CONFIG_TMPFS */
853 853
854 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, 854 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
855 struct shmem_inode_info *info, pgoff_t index) 855 struct shmem_inode_info *info, pgoff_t index)
856 { 856 {
857 struct vm_area_struct pvma; 857 struct vm_area_struct pvma;
858 struct page *page; 858 struct page *page;
859 859
860 /* Create a pseudo vma that just contains the policy */ 860 /* Create a pseudo vma that just contains the policy */
861 pvma.vm_start = 0; 861 pvma.vm_start = 0;
862 /* Bias interleave by inode number to distribute better across nodes */ 862 /* Bias interleave by inode number to distribute better across nodes */
863 pvma.vm_pgoff = index + info->vfs_inode.i_ino; 863 pvma.vm_pgoff = index + info->vfs_inode.i_ino;
864 pvma.vm_ops = NULL; 864 pvma.vm_ops = NULL;
865 pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); 865 pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
866 866
867 page = swapin_readahead(swap, gfp, &pvma, 0); 867 page = swapin_readahead(swap, gfp, &pvma, 0);
868 868
869 /* Drop reference taken by mpol_shared_policy_lookup() */ 869 /* Drop reference taken by mpol_shared_policy_lookup() */
870 mpol_cond_put(pvma.vm_policy); 870 mpol_cond_put(pvma.vm_policy);
871 871
872 return page; 872 return page;
873 } 873 }
874 874
875 static struct page *shmem_alloc_page(gfp_t gfp, 875 static struct page *shmem_alloc_page(gfp_t gfp,
876 struct shmem_inode_info *info, pgoff_t index) 876 struct shmem_inode_info *info, pgoff_t index)
877 { 877 {
878 struct vm_area_struct pvma; 878 struct vm_area_struct pvma;
879 struct page *page; 879 struct page *page;
880 880
881 /* Create a pseudo vma that just contains the policy */ 881 /* Create a pseudo vma that just contains the policy */
882 pvma.vm_start = 0; 882 pvma.vm_start = 0;
883 /* Bias interleave by inode number to distribute better across nodes */ 883 /* Bias interleave by inode number to distribute better across nodes */
884 pvma.vm_pgoff = index + info->vfs_inode.i_ino; 884 pvma.vm_pgoff = index + info->vfs_inode.i_ino;
885 pvma.vm_ops = NULL; 885 pvma.vm_ops = NULL;
886 pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); 886 pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
887 887
888 page = alloc_page_vma(gfp, &pvma, 0); 888 page = alloc_page_vma(gfp, &pvma, 0);
889 889
890 /* Drop reference taken by mpol_shared_policy_lookup() */ 890 /* Drop reference taken by mpol_shared_policy_lookup() */
891 mpol_cond_put(pvma.vm_policy); 891 mpol_cond_put(pvma.vm_policy);
892 892
893 return page; 893 return page;
894 } 894 }
895 #else /* !CONFIG_NUMA */ 895 #else /* !CONFIG_NUMA */
896 #ifdef CONFIG_TMPFS 896 #ifdef CONFIG_TMPFS
897 static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) 897 static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
898 { 898 {
899 } 899 }
900 #endif /* CONFIG_TMPFS */ 900 #endif /* CONFIG_TMPFS */
901 901
902 static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, 902 static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
903 struct shmem_inode_info *info, pgoff_t index) 903 struct shmem_inode_info *info, pgoff_t index)
904 { 904 {
905 return swapin_readahead(swap, gfp, NULL, 0); 905 return swapin_readahead(swap, gfp, NULL, 0);
906 } 906 }
907 907
908 static inline struct page *shmem_alloc_page(gfp_t gfp, 908 static inline struct page *shmem_alloc_page(gfp_t gfp,
909 struct shmem_inode_info *info, pgoff_t index) 909 struct shmem_inode_info *info, pgoff_t index)
910 { 910 {
911 return alloc_page(gfp); 911 return alloc_page(gfp);
912 } 912 }
913 #endif /* CONFIG_NUMA */ 913 #endif /* CONFIG_NUMA */
914 914
915 #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS) 915 #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
916 static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) 916 static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
917 { 917 {
918 return NULL; 918 return NULL;
919 } 919 }
920 #endif 920 #endif
921 921
922 /* 922 /*
923 * When a page is moved from swapcache to shmem filecache (either by the 923 * When a page is moved from swapcache to shmem filecache (either by the
924 * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of 924 * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of
925 * shmem_unuse_inode()), it may have been read in earlier from swap, in 925 * shmem_unuse_inode()), it may have been read in earlier from swap, in
926 * ignorance of the mapping it belongs to. If that mapping has special 926 * ignorance of the mapping it belongs to. If that mapping has special
927 * constraints (like the gma500 GEM driver, which requires RAM below 4GB), 927 * constraints (like the gma500 GEM driver, which requires RAM below 4GB),
928 * we may need to copy to a suitable page before moving to filecache. 928 * we may need to copy to a suitable page before moving to filecache.
929 * 929 *
930 * In a future release, this may well be extended to respect cpuset and 930 * In a future release, this may well be extended to respect cpuset and
931 * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); 931 * NUMA mempolicy, and applied also to anonymous pages in do_swap_page();
932 * but for now it is a simple matter of zone. 932 * but for now it is a simple matter of zone.
933 */ 933 */
934 static bool shmem_should_replace_page(struct page *page, gfp_t gfp) 934 static bool shmem_should_replace_page(struct page *page, gfp_t gfp)
935 { 935 {
936 return page_zonenum(page) > gfp_zone(gfp); 936 return page_zonenum(page) > gfp_zone(gfp);
937 } 937 }
938 938
939 static int shmem_replace_page(struct page **pagep, gfp_t gfp, 939 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
940 struct shmem_inode_info *info, pgoff_t index) 940 struct shmem_inode_info *info, pgoff_t index)
941 { 941 {
942 struct page *oldpage, *newpage; 942 struct page *oldpage, *newpage;
943 struct address_space *swap_mapping; 943 struct address_space *swap_mapping;
944 pgoff_t swap_index; 944 pgoff_t swap_index;
945 int error; 945 int error;
946 946
947 oldpage = *pagep; 947 oldpage = *pagep;
948 swap_index = page_private(oldpage); 948 swap_index = page_private(oldpage);
949 swap_mapping = page_mapping(oldpage); 949 swap_mapping = page_mapping(oldpage);
950 950
951 /* 951 /*
952 * We have arrived here because our zones are constrained, so don't 952 * We have arrived here because our zones are constrained, so don't
953 * limit chance of success by further cpuset and node constraints. 953 * limit chance of success by further cpuset and node constraints.
954 */ 954 */
955 gfp &= ~GFP_CONSTRAINT_MASK; 955 gfp &= ~GFP_CONSTRAINT_MASK;
956 newpage = shmem_alloc_page(gfp, info, index); 956 newpage = shmem_alloc_page(gfp, info, index);
957 if (!newpage) 957 if (!newpage)
958 return -ENOMEM; 958 return -ENOMEM;
959 959
960 page_cache_get(newpage); 960 page_cache_get(newpage);
961 copy_highpage(newpage, oldpage); 961 copy_highpage(newpage, oldpage);
962 flush_dcache_page(newpage); 962 flush_dcache_page(newpage);
963 963
964 __set_page_locked(newpage); 964 __set_page_locked(newpage);
965 SetPageUptodate(newpage); 965 SetPageUptodate(newpage);
966 SetPageSwapBacked(newpage); 966 SetPageSwapBacked(newpage);
967 set_page_private(newpage, swap_index); 967 set_page_private(newpage, swap_index);
968 SetPageSwapCache(newpage); 968 SetPageSwapCache(newpage);
969 969
970 /* 970 /*
971 * Our caller will very soon move newpage out of swapcache, but it's 971 * Our caller will very soon move newpage out of swapcache, but it's
972 * a nice clean interface for us to replace oldpage by newpage there. 972 * a nice clean interface for us to replace oldpage by newpage there.
973 */ 973 */
974 spin_lock_irq(&swap_mapping->tree_lock); 974 spin_lock_irq(&swap_mapping->tree_lock);
975 error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, 975 error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
976 newpage); 976 newpage);
977 if (!error) { 977 if (!error) {
978 __inc_zone_page_state(newpage, NR_FILE_PAGES); 978 __inc_zone_page_state(newpage, NR_FILE_PAGES);
979 __dec_zone_page_state(oldpage, NR_FILE_PAGES); 979 __dec_zone_page_state(oldpage, NR_FILE_PAGES);
980 } 980 }
981 spin_unlock_irq(&swap_mapping->tree_lock); 981 spin_unlock_irq(&swap_mapping->tree_lock);
982 982
983 if (unlikely(error)) { 983 if (unlikely(error)) {
984 /* 984 /*
985 * Is this possible? I think not, now that our callers check 985 * Is this possible? I think not, now that our callers check
986 * both PageSwapCache and page_private after getting page lock; 986 * both PageSwapCache and page_private after getting page lock;
987 * but be defensive. Reverse old to newpage for clear and free. 987 * but be defensive. Reverse old to newpage for clear and free.
988 */ 988 */
989 oldpage = newpage; 989 oldpage = newpage;
990 } else { 990 } else {
991 mem_cgroup_replace_page_cache(oldpage, newpage); 991 mem_cgroup_replace_page_cache(oldpage, newpage);
992 lru_cache_add_anon(newpage); 992 lru_cache_add_anon(newpage);
993 *pagep = newpage; 993 *pagep = newpage;
994 } 994 }
995 995
996 ClearPageSwapCache(oldpage); 996 ClearPageSwapCache(oldpage);
997 set_page_private(oldpage, 0); 997 set_page_private(oldpage, 0);
998 998
999 unlock_page(oldpage); 999 unlock_page(oldpage);
1000 page_cache_release(oldpage); 1000 page_cache_release(oldpage);
1001 page_cache_release(oldpage); 1001 page_cache_release(oldpage);
1002 return error; 1002 return error;
1003 } 1003 }
1004 1004
1005 /* 1005 /*
1006 * shmem_getpage_gfp - find page in cache, or get from swap, or allocate 1006 * shmem_getpage_gfp - find page in cache, or get from swap, or allocate
1007 * 1007 *
1008 * If we allocate a new one we do not mark it dirty. That's up to the 1008 * If we allocate a new one we do not mark it dirty. That's up to the
1009 * vm. If we swap it in we mark it dirty since we also free the swap 1009 * vm. If we swap it in we mark it dirty since we also free the swap
1010 * entry since a page cannot live in both the swap and page cache 1010 * entry since a page cannot live in both the swap and page cache
1011 */ 1011 */
1012 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, 1012 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
1013 struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type) 1013 struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
1014 { 1014 {
1015 struct address_space *mapping = inode->i_mapping; 1015 struct address_space *mapping = inode->i_mapping;
1016 struct shmem_inode_info *info; 1016 struct shmem_inode_info *info;
1017 struct shmem_sb_info *sbinfo; 1017 struct shmem_sb_info *sbinfo;
1018 struct page *page; 1018 struct page *page;
1019 swp_entry_t swap; 1019 swp_entry_t swap;
1020 int error; 1020 int error;
1021 int once = 0; 1021 int once = 0;
1022 int alloced = 0; 1022 int alloced = 0;
1023 1023
1024 if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) 1024 if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
1025 return -EFBIG; 1025 return -EFBIG;
1026 repeat: 1026 repeat:
1027 swap.val = 0; 1027 swap.val = 0;
1028 page = find_lock_entry(mapping, index); 1028 page = find_lock_entry(mapping, index);
1029 if (radix_tree_exceptional_entry(page)) { 1029 if (radix_tree_exceptional_entry(page)) {
1030 swap = radix_to_swp_entry(page); 1030 swap = radix_to_swp_entry(page);
1031 page = NULL; 1031 page = NULL;
1032 } 1032 }
1033 1033
1034 if (sgp != SGP_WRITE && sgp != SGP_FALLOC && 1034 if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
1035 ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { 1035 ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
1036 error = -EINVAL; 1036 error = -EINVAL;
1037 goto failed; 1037 goto failed;
1038 } 1038 }
1039 1039
1040 /* fallocated page? */ 1040 /* fallocated page? */
1041 if (page && !PageUptodate(page)) { 1041 if (page && !PageUptodate(page)) {
1042 if (sgp != SGP_READ) 1042 if (sgp != SGP_READ)
1043 goto clear; 1043 goto clear;
1044 unlock_page(page); 1044 unlock_page(page);
1045 page_cache_release(page); 1045 page_cache_release(page);
1046 page = NULL; 1046 page = NULL;
1047 } 1047 }
1048 if (page || (sgp == SGP_READ && !swap.val)) { 1048 if (page || (sgp == SGP_READ && !swap.val)) {
1049 *pagep = page; 1049 *pagep = page;
1050 return 0; 1050 return 0;
1051 } 1051 }
1052 1052
1053 /* 1053 /*
1054 * Fast cache lookup did not find it: 1054 * Fast cache lookup did not find it:
1055 * bring it back from swap or allocate. 1055 * bring it back from swap or allocate.
1056 */ 1056 */
1057 info = SHMEM_I(inode); 1057 info = SHMEM_I(inode);
1058 sbinfo = SHMEM_SB(inode->i_sb); 1058 sbinfo = SHMEM_SB(inode->i_sb);
1059 1059
1060 if (swap.val) { 1060 if (swap.val) {
1061 /* Look it up and read it in.. */ 1061 /* Look it up and read it in.. */
1062 page = lookup_swap_cache(swap); 1062 page = lookup_swap_cache(swap);
1063 if (!page) { 1063 if (!page) {
1064 /* here we actually do the io */ 1064 /* here we actually do the io */
1065 if (fault_type) 1065 if (fault_type)
1066 *fault_type |= VM_FAULT_MAJOR; 1066 *fault_type |= VM_FAULT_MAJOR;
1067 page = shmem_swapin(swap, gfp, info, index); 1067 page = shmem_swapin(swap, gfp, info, index);
1068 if (!page) { 1068 if (!page) {
1069 error = -ENOMEM; 1069 error = -ENOMEM;
1070 goto failed; 1070 goto failed;
1071 } 1071 }
1072 } 1072 }
1073 1073
1074 /* We have to do this with page locked to prevent races */ 1074 /* We have to do this with page locked to prevent races */
1075 lock_page(page); 1075 lock_page(page);
1076 if (!PageSwapCache(page) || page_private(page) != swap.val || 1076 if (!PageSwapCache(page) || page_private(page) != swap.val ||
1077 !shmem_confirm_swap(mapping, index, swap)) { 1077 !shmem_confirm_swap(mapping, index, swap)) {
1078 error = -EEXIST; /* try again */ 1078 error = -EEXIST; /* try again */
1079 goto unlock; 1079 goto unlock;
1080 } 1080 }
1081 if (!PageUptodate(page)) { 1081 if (!PageUptodate(page)) {
1082 error = -EIO; 1082 error = -EIO;
1083 goto failed; 1083 goto failed;
1084 } 1084 }
1085 wait_on_page_writeback(page); 1085 wait_on_page_writeback(page);
1086 1086
1087 if (shmem_should_replace_page(page, gfp)) { 1087 if (shmem_should_replace_page(page, gfp)) {
1088 error = shmem_replace_page(&page, gfp, info, index); 1088 error = shmem_replace_page(&page, gfp, info, index);
1089 if (error) 1089 if (error)
1090 goto failed; 1090 goto failed;
1091 } 1091 }
1092 1092
1093 error = mem_cgroup_cache_charge(page, current->mm, 1093 error = mem_cgroup_cache_charge(page, current->mm,
1094 gfp & GFP_RECLAIM_MASK); 1094 gfp & GFP_RECLAIM_MASK);
1095 if (!error) { 1095 if (!error) {
1096 error = shmem_add_to_page_cache(page, mapping, index, 1096 error = shmem_add_to_page_cache(page, mapping, index,
1097 gfp, swp_to_radix_entry(swap)); 1097 gfp, swp_to_radix_entry(swap));
1098 /* 1098 /*
1099 * We already confirmed swap under page lock, and make 1099 * We already confirmed swap under page lock, and make
1100 * no memory allocation here, so usually no possibility 1100 * no memory allocation here, so usually no possibility
1101 * of error; but free_swap_and_cache() only trylocks a 1101 * of error; but free_swap_and_cache() only trylocks a
1102 * page, so it is just possible that the entry has been 1102 * page, so it is just possible that the entry has been
1103 * truncated or holepunched since swap was confirmed. 1103 * truncated or holepunched since swap was confirmed.
1104 * shmem_undo_range() will have done some of the 1104 * shmem_undo_range() will have done some of the
1105 * unaccounting, now delete_from_swap_cache() will do 1105 * unaccounting, now delete_from_swap_cache() will do
1106 * the rest (including mem_cgroup_uncharge_swapcache). 1106 * the rest (including mem_cgroup_uncharge_swapcache).
1107 * Reset swap.val? No, leave it so "failed" goes back to 1107 * Reset swap.val? No, leave it so "failed" goes back to
1108 * "repeat": reading a hole and writing should succeed. 1108 * "repeat": reading a hole and writing should succeed.
1109 */ 1109 */
1110 if (error) 1110 if (error)
1111 delete_from_swap_cache(page); 1111 delete_from_swap_cache(page);
1112 } 1112 }
1113 if (error) 1113 if (error)
1114 goto failed; 1114 goto failed;
1115 1115
1116 spin_lock(&info->lock); 1116 spin_lock(&info->lock);
1117 info->swapped--; 1117 info->swapped--;
1118 shmem_recalc_inode(inode); 1118 shmem_recalc_inode(inode);
1119 spin_unlock(&info->lock); 1119 spin_unlock(&info->lock);
1120 1120
1121 delete_from_swap_cache(page); 1121 delete_from_swap_cache(page);
1122 set_page_dirty(page); 1122 set_page_dirty(page);
1123 swap_free(swap); 1123 swap_free(swap);
1124 1124
1125 } else { 1125 } else {
1126 if (shmem_acct_block(info->flags)) { 1126 if (shmem_acct_block(info->flags)) {
1127 error = -ENOSPC; 1127 error = -ENOSPC;
1128 goto failed; 1128 goto failed;
1129 } 1129 }
1130 if (sbinfo->max_blocks) { 1130 if (sbinfo->max_blocks) {
1131 if (percpu_counter_compare(&sbinfo->used_blocks, 1131 if (percpu_counter_compare(&sbinfo->used_blocks,
1132 sbinfo->max_blocks) >= 0) { 1132 sbinfo->max_blocks) >= 0) {
1133 error = -ENOSPC; 1133 error = -ENOSPC;
1134 goto unacct; 1134 goto unacct;
1135 } 1135 }
1136 percpu_counter_inc(&sbinfo->used_blocks); 1136 percpu_counter_inc(&sbinfo->used_blocks);
1137 } 1137 }
1138 1138
1139 page = shmem_alloc_page(gfp, info, index); 1139 page = shmem_alloc_page(gfp, info, index);
1140 if (!page) { 1140 if (!page) {
1141 error = -ENOMEM; 1141 error = -ENOMEM;
1142 goto decused; 1142 goto decused;
1143 } 1143 }
1144 1144
1145 __SetPageSwapBacked(page); 1145 __SetPageSwapBacked(page);
1146 __set_page_locked(page); 1146 __set_page_locked(page);
1147 error = mem_cgroup_cache_charge(page, current->mm, 1147 error = mem_cgroup_cache_charge(page, current->mm,
1148 gfp & GFP_RECLAIM_MASK); 1148 gfp & GFP_RECLAIM_MASK);
1149 if (error) 1149 if (error)
1150 goto decused; 1150 goto decused;
1151 error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK); 1151 error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
1152 if (!error) { 1152 if (!error) {
1153 error = shmem_add_to_page_cache(page, mapping, index, 1153 error = shmem_add_to_page_cache(page, mapping, index,
1154 gfp, NULL); 1154 gfp, NULL);
1155 radix_tree_preload_end(); 1155 radix_tree_preload_end();
1156 } 1156 }
1157 if (error) { 1157 if (error) {
1158 mem_cgroup_uncharge_cache_page(page); 1158 mem_cgroup_uncharge_cache_page(page);
1159 goto decused; 1159 goto decused;
1160 } 1160 }
1161 lru_cache_add_anon(page); 1161 lru_cache_add_anon(page);
1162 1162
1163 spin_lock(&info->lock); 1163 spin_lock(&info->lock);
1164 info->alloced++; 1164 info->alloced++;
1165 inode->i_blocks += BLOCKS_PER_PAGE; 1165 inode->i_blocks += BLOCKS_PER_PAGE;
1166 shmem_recalc_inode(inode); 1166 shmem_recalc_inode(inode);
1167 spin_unlock(&info->lock); 1167 spin_unlock(&info->lock);
1168 alloced = true; 1168 alloced = true;
1169 1169
1170 /* 1170 /*
1171 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page. 1171 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
1172 */ 1172 */
1173 if (sgp == SGP_FALLOC) 1173 if (sgp == SGP_FALLOC)
1174 sgp = SGP_WRITE; 1174 sgp = SGP_WRITE;
1175 clear: 1175 clear:
1176 /* 1176 /*
1177 * Let SGP_WRITE caller clear ends if write does not fill page; 1177 * Let SGP_WRITE caller clear ends if write does not fill page;
1178 * but SGP_FALLOC on a page fallocated earlier must initialize 1178 * but SGP_FALLOC on a page fallocated earlier must initialize
1179 * it now, lest undo on failure cancel our earlier guarantee. 1179 * it now, lest undo on failure cancel our earlier guarantee.
1180 */ 1180 */
1181 if (sgp != SGP_WRITE) { 1181 if (sgp != SGP_WRITE) {
1182 clear_highpage(page); 1182 clear_highpage(page);
1183 flush_dcache_page(page); 1183 flush_dcache_page(page);
1184 SetPageUptodate(page); 1184 SetPageUptodate(page);
1185 } 1185 }
1186 if (sgp == SGP_DIRTY) 1186 if (sgp == SGP_DIRTY)
1187 set_page_dirty(page); 1187 set_page_dirty(page);
1188 } 1188 }
1189 1189
1190 /* Perhaps the file has been truncated since we checked */ 1190 /* Perhaps the file has been truncated since we checked */
1191 if (sgp != SGP_WRITE && sgp != SGP_FALLOC && 1191 if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
1192 ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { 1192 ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
1193 error = -EINVAL; 1193 error = -EINVAL;
1194 if (alloced) 1194 if (alloced)
1195 goto trunc; 1195 goto trunc;
1196 else 1196 else
1197 goto failed; 1197 goto failed;
1198 } 1198 }
1199 *pagep = page; 1199 *pagep = page;
1200 return 0; 1200 return 0;
1201 1201
1202 /* 1202 /*
1203 * Error recovery. 1203 * Error recovery.
1204 */ 1204 */
1205 trunc: 1205 trunc:
1206 info = SHMEM_I(inode); 1206 info = SHMEM_I(inode);
1207 ClearPageDirty(page); 1207 ClearPageDirty(page);
1208 delete_from_page_cache(page); 1208 delete_from_page_cache(page);
1209 spin_lock(&info->lock); 1209 spin_lock(&info->lock);
1210 info->alloced--; 1210 info->alloced--;
1211 inode->i_blocks -= BLOCKS_PER_PAGE; 1211 inode->i_blocks -= BLOCKS_PER_PAGE;
1212 spin_unlock(&info->lock); 1212 spin_unlock(&info->lock);
1213 decused: 1213 decused:
1214 sbinfo = SHMEM_SB(inode->i_sb); 1214 sbinfo = SHMEM_SB(inode->i_sb);
1215 if (sbinfo->max_blocks) 1215 if (sbinfo->max_blocks)
1216 percpu_counter_add(&sbinfo->used_blocks, -1); 1216 percpu_counter_add(&sbinfo->used_blocks, -1);
1217 unacct: 1217 unacct:
1218 shmem_unacct_blocks(info->flags, 1); 1218 shmem_unacct_blocks(info->flags, 1);
1219 failed: 1219 failed:
1220 if (swap.val && error != -EINVAL && 1220 if (swap.val && error != -EINVAL &&
1221 !shmem_confirm_swap(mapping, index, swap)) 1221 !shmem_confirm_swap(mapping, index, swap))
1222 error = -EEXIST; 1222 error = -EEXIST;
1223 unlock: 1223 unlock:
1224 if (page) { 1224 if (page) {
1225 unlock_page(page); 1225 unlock_page(page);
1226 page_cache_release(page); 1226 page_cache_release(page);
1227 } 1227 }
1228 if (error == -ENOSPC && !once++) { 1228 if (error == -ENOSPC && !once++) {
1229 info = SHMEM_I(inode); 1229 info = SHMEM_I(inode);
1230 spin_lock(&info->lock); 1230 spin_lock(&info->lock);
1231 shmem_recalc_inode(inode); 1231 shmem_recalc_inode(inode);
1232 spin_unlock(&info->lock); 1232 spin_unlock(&info->lock);
1233 goto repeat; 1233 goto repeat;
1234 } 1234 }
1235 if (error == -EEXIST) /* from above or from radix_tree_insert */ 1235 if (error == -EEXIST) /* from above or from radix_tree_insert */
1236 goto repeat; 1236 goto repeat;
1237 return error; 1237 return error;
1238 } 1238 }
1239 1239
1240 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) 1240 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
1241 { 1241 {
1242 struct inode *inode = file_inode(vma->vm_file); 1242 struct inode *inode = file_inode(vma->vm_file);
1243 int error; 1243 int error;
1244 int ret = VM_FAULT_LOCKED; 1244 int ret = VM_FAULT_LOCKED;
1245 1245
1246 /* 1246 /*
1247 * Trinity finds that probing a hole which tmpfs is punching can 1247 * Trinity finds that probing a hole which tmpfs is punching can
1248 * prevent the hole-punch from ever completing: which in turn 1248 * prevent the hole-punch from ever completing: which in turn
1249 * locks writers out with its hold on i_mutex. So refrain from 1249 * locks writers out with its hold on i_mutex. So refrain from
1250 * faulting pages into the hole while it's being punched. Although 1250 * faulting pages into the hole while it's being punched. Although
1251 * shmem_undo_range() does remove the additions, it may be unable to 1251 * shmem_undo_range() does remove the additions, it may be unable to
1252 * keep up, as each new page needs its own unmap_mapping_range() call, 1252 * keep up, as each new page needs its own unmap_mapping_range() call,
1253 * and the i_mmap tree grows ever slower to scan if new vmas are added. 1253 * and the i_mmap tree grows ever slower to scan if new vmas are added.
1254 * 1254 *
1255 * It does not matter if we sometimes reach this check just before the 1255 * It does not matter if we sometimes reach this check just before the
1256 * hole-punch begins, so that one fault then races with the punch: 1256 * hole-punch begins, so that one fault then races with the punch:
1257 * we just need to make racing faults a rare case. 1257 * we just need to make racing faults a rare case.
1258 * 1258 *
1259 * The implementation below would be much simpler if we just used a 1259 * The implementation below would be much simpler if we just used a
1260 * standard mutex or completion: but we cannot take i_mutex in fault, 1260 * standard mutex or completion: but we cannot take i_mutex in fault,
1261 * and bloating every shmem inode for this unlikely case would be sad. 1261 * and bloating every shmem inode for this unlikely case would be sad.
1262 */ 1262 */
1263 if (unlikely(inode->i_private)) { 1263 if (unlikely(inode->i_private)) {
1264 struct shmem_falloc *shmem_falloc; 1264 struct shmem_falloc *shmem_falloc;
1265 1265
1266 spin_lock(&inode->i_lock); 1266 spin_lock(&inode->i_lock);
1267 shmem_falloc = inode->i_private; 1267 shmem_falloc = inode->i_private;
1268 if (shmem_falloc && 1268 if (shmem_falloc &&
1269 shmem_falloc->waitq && 1269 shmem_falloc->waitq &&
1270 vmf->pgoff >= shmem_falloc->start && 1270 vmf->pgoff >= shmem_falloc->start &&
1271 vmf->pgoff < shmem_falloc->next) { 1271 vmf->pgoff < shmem_falloc->next) {
1272 wait_queue_head_t *shmem_falloc_waitq; 1272 wait_queue_head_t *shmem_falloc_waitq;
1273 DEFINE_WAIT(shmem_fault_wait); 1273 DEFINE_WAIT(shmem_fault_wait);
1274 1274
1275 ret = VM_FAULT_NOPAGE; 1275 ret = VM_FAULT_NOPAGE;
1276 if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && 1276 if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
1277 !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { 1277 !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
1278 /* It's polite to up mmap_sem if we can */ 1278 /* It's polite to up mmap_sem if we can */
1279 up_read(&vma->vm_mm->mmap_sem); 1279 up_read(&vma->vm_mm->mmap_sem);
1280 ret = VM_FAULT_RETRY; 1280 ret = VM_FAULT_RETRY;
1281 } 1281 }
1282 1282
1283 shmem_falloc_waitq = shmem_falloc->waitq; 1283 shmem_falloc_waitq = shmem_falloc->waitq;
1284 prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, 1284 prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
1285 TASK_UNINTERRUPTIBLE); 1285 TASK_UNINTERRUPTIBLE);
1286 spin_unlock(&inode->i_lock); 1286 spin_unlock(&inode->i_lock);
1287 schedule(); 1287 schedule();
1288 1288
1289 /* 1289 /*
1290 * shmem_falloc_waitq points into the shmem_fallocate() 1290 * shmem_falloc_waitq points into the shmem_fallocate()
1291 * stack of the hole-punching task: shmem_falloc_waitq 1291 * stack of the hole-punching task: shmem_falloc_waitq
1292 * is usually invalid by the time we reach here, but 1292 * is usually invalid by the time we reach here, but
1293 * finish_wait() does not dereference it in that case; 1293 * finish_wait() does not dereference it in that case;
1294 * though i_lock needed lest racing with wake_up_all(). 1294 * though i_lock needed lest racing with wake_up_all().
1295 */ 1295 */
1296 spin_lock(&inode->i_lock); 1296 spin_lock(&inode->i_lock);
1297 finish_wait(shmem_falloc_waitq, &shmem_fault_wait); 1297 finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
1298 spin_unlock(&inode->i_lock); 1298 spin_unlock(&inode->i_lock);
1299 return ret; 1299 return ret;
1300 } 1300 }
1301 spin_unlock(&inode->i_lock); 1301 spin_unlock(&inode->i_lock);
1302 } 1302 }
1303 1303
1304 error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); 1304 error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
1305 if (error) 1305 if (error)
1306 return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); 1306 return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
1307 1307
1308 if (ret & VM_FAULT_MAJOR) { 1308 if (ret & VM_FAULT_MAJOR) {
1309 count_vm_event(PGMAJFAULT); 1309 count_vm_event(PGMAJFAULT);
1310 mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); 1310 mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
1311 } 1311 }
1312 return ret; 1312 return ret;
1313 } 1313 }
1314 1314
1315 #ifdef CONFIG_NUMA 1315 #ifdef CONFIG_NUMA
1316 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) 1316 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
1317 { 1317 {
1318 struct inode *inode = file_inode(vma->vm_file); 1318 struct inode *inode = file_inode(vma->vm_file);
1319 return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol); 1319 return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol);
1320 } 1320 }
1321 1321
1322 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, 1322 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
1323 unsigned long addr) 1323 unsigned long addr)
1324 { 1324 {
1325 struct inode *inode = file_inode(vma->vm_file); 1325 struct inode *inode = file_inode(vma->vm_file);
1326 pgoff_t index; 1326 pgoff_t index;
1327 1327
1328 index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; 1328 index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
1329 return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); 1329 return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
1330 } 1330 }
1331 #endif 1331 #endif
1332 1332
1333 int shmem_lock(struct file *file, int lock, struct user_struct *user) 1333 int shmem_lock(struct file *file, int lock, struct user_struct *user)
1334 { 1334 {
1335 struct inode *inode = file_inode(file); 1335 struct inode *inode = file_inode(file);
1336 struct shmem_inode_info *info = SHMEM_I(inode); 1336 struct shmem_inode_info *info = SHMEM_I(inode);
1337 int retval = -ENOMEM; 1337 int retval = -ENOMEM;
1338 1338
1339 spin_lock(&info->lock); 1339 spin_lock(&info->lock);
1340 if (lock && !(info->flags & VM_LOCKED)) { 1340 if (lock && !(info->flags & VM_LOCKED)) {
1341 if (!user_shm_lock(inode->i_size, user)) 1341 if (!user_shm_lock(inode->i_size, user))
1342 goto out_nomem; 1342 goto out_nomem;
1343 info->flags |= VM_LOCKED; 1343 info->flags |= VM_LOCKED;
1344 mapping_set_unevictable(file->f_mapping); 1344 mapping_set_unevictable(file->f_mapping);
1345 } 1345 }
1346 if (!lock && (info->flags & VM_LOCKED) && user) { 1346 if (!lock && (info->flags & VM_LOCKED) && user) {
1347 user_shm_unlock(inode->i_size, user); 1347 user_shm_unlock(inode->i_size, user);
1348 info->flags &= ~VM_LOCKED; 1348 info->flags &= ~VM_LOCKED;
1349 mapping_clear_unevictable(file->f_mapping); 1349 mapping_clear_unevictable(file->f_mapping);
1350 } 1350 }
1351 retval = 0; 1351 retval = 0;
1352 1352
1353 out_nomem: 1353 out_nomem:
1354 spin_unlock(&info->lock); 1354 spin_unlock(&info->lock);
1355 return retval; 1355 return retval;
1356 } 1356 }
1357 1357
1358 static int shmem_mmap(struct file *file, struct vm_area_struct *vma) 1358 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
1359 { 1359 {
1360 file_accessed(file); 1360 file_accessed(file);
1361 vma->vm_ops = &shmem_vm_ops; 1361 vma->vm_ops = &shmem_vm_ops;
1362 return 0; 1362 return 0;
1363 } 1363 }
1364 1364
1365 static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, 1365 static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
1366 umode_t mode, dev_t dev, unsigned long flags) 1366 umode_t mode, dev_t dev, unsigned long flags)
1367 { 1367 {
1368 struct inode *inode; 1368 struct inode *inode;
1369 struct shmem_inode_info *info; 1369 struct shmem_inode_info *info;
1370 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 1370 struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
1371 1371
1372 if (shmem_reserve_inode(sb)) 1372 if (shmem_reserve_inode(sb))
1373 return NULL; 1373 return NULL;
1374 1374
1375 inode = new_inode(sb); 1375 inode = new_inode(sb);
1376 if (inode) { 1376 if (inode) {
1377 inode->i_ino = get_next_ino(); 1377 inode->i_ino = get_next_ino();
1378 inode_init_owner(inode, dir, mode); 1378 inode_init_owner(inode, dir, mode);
1379 inode->i_blocks = 0; 1379 inode->i_blocks = 0;
1380 inode->i_mapping->backing_dev_info = &shmem_backing_dev_info; 1380 inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
1381 inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; 1381 inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
1382 inode->i_generation = get_seconds(); 1382 inode->i_generation = get_seconds();
1383 info = SHMEM_I(inode); 1383 info = SHMEM_I(inode);
1384 memset(info, 0, (char *)inode - (char *)info); 1384 memset(info, 0, (char *)inode - (char *)info);
1385 spin_lock_init(&info->lock); 1385 spin_lock_init(&info->lock);
1386 info->flags = flags & VM_NORESERVE; 1386 info->flags = flags & VM_NORESERVE;
1387 INIT_LIST_HEAD(&info->swaplist); 1387 INIT_LIST_HEAD(&info->swaplist);
1388 simple_xattrs_init(&info->xattrs); 1388 simple_xattrs_init(&info->xattrs);
1389 cache_no_acl(inode); 1389 cache_no_acl(inode);
1390 1390
1391 switch (mode & S_IFMT) { 1391 switch (mode & S_IFMT) {
1392 default: 1392 default:
1393 inode->i_op = &shmem_special_inode_operations; 1393 inode->i_op = &shmem_special_inode_operations;
1394 init_special_inode(inode, mode, dev); 1394 init_special_inode(inode, mode, dev);
1395 break; 1395 break;
1396 case S_IFREG: 1396 case S_IFREG:
1397 inode->i_mapping->a_ops = &shmem_aops; 1397 inode->i_mapping->a_ops = &shmem_aops;
1398 inode->i_op = &shmem_inode_operations; 1398 inode->i_op = &shmem_inode_operations;
1399 inode->i_fop = &shmem_file_operations; 1399 inode->i_fop = &shmem_file_operations;
1400 mpol_shared_policy_init(&info->policy, 1400 mpol_shared_policy_init(&info->policy,
1401 shmem_get_sbmpol(sbinfo)); 1401 shmem_get_sbmpol(sbinfo));
1402 break; 1402 break;
1403 case S_IFDIR: 1403 case S_IFDIR:
1404 inc_nlink(inode); 1404 inc_nlink(inode);
1405 /* Some things misbehave if size == 0 on a directory */ 1405 /* Some things misbehave if size == 0 on a directory */
1406 inode->i_size = 2 * BOGO_DIRENT_SIZE; 1406 inode->i_size = 2 * BOGO_DIRENT_SIZE;
1407 inode->i_op = &shmem_dir_inode_operations; 1407 inode->i_op = &shmem_dir_inode_operations;
1408 inode->i_fop = &simple_dir_operations; 1408 inode->i_fop = &simple_dir_operations;
1409 break; 1409 break;
1410 case S_IFLNK: 1410 case S_IFLNK:
1411 /* 1411 /*
1412 * Must not load anything in the rbtree, 1412 * Must not load anything in the rbtree,
1413 * mpol_free_shared_policy will not be called. 1413 * mpol_free_shared_policy will not be called.
1414 */ 1414 */
1415 mpol_shared_policy_init(&info->policy, NULL); 1415 mpol_shared_policy_init(&info->policy, NULL);
1416 break; 1416 break;
1417 } 1417 }
1418 } else 1418 } else
1419 shmem_free_inode(sb); 1419 shmem_free_inode(sb);
1420 return inode; 1420 return inode;
1421 } 1421 }
1422 1422
1423 bool shmem_mapping(struct address_space *mapping) 1423 bool shmem_mapping(struct address_space *mapping)
1424 { 1424 {
1425 return mapping->backing_dev_info == &shmem_backing_dev_info; 1425 return mapping->backing_dev_info == &shmem_backing_dev_info;
1426 } 1426 }
1427 1427
1428 #ifdef CONFIG_TMPFS 1428 #ifdef CONFIG_TMPFS
1429 static const struct inode_operations shmem_symlink_inode_operations; 1429 static const struct inode_operations shmem_symlink_inode_operations;
1430 static const struct inode_operations shmem_short_symlink_operations; 1430 static const struct inode_operations shmem_short_symlink_operations;
1431 1431
1432 #ifdef CONFIG_TMPFS_XATTR 1432 #ifdef CONFIG_TMPFS_XATTR
1433 static int shmem_initxattrs(struct inode *, const struct xattr *, void *); 1433 static int shmem_initxattrs(struct inode *, const struct xattr *, void *);
1434 #else 1434 #else
1435 #define shmem_initxattrs NULL 1435 #define shmem_initxattrs NULL
1436 #endif 1436 #endif
1437 1437
1438 static int 1438 static int
1439 shmem_write_begin(struct file *file, struct address_space *mapping, 1439 shmem_write_begin(struct file *file, struct address_space *mapping,
1440 loff_t pos, unsigned len, unsigned flags, 1440 loff_t pos, unsigned len, unsigned flags,
1441 struct page **pagep, void **fsdata) 1441 struct page **pagep, void **fsdata)
1442 { 1442 {
1443 int ret;
1443 struct inode *inode = mapping->host; 1444 struct inode *inode = mapping->host;
1444 pgoff_t index = pos >> PAGE_CACHE_SHIFT; 1445 pgoff_t index = pos >> PAGE_CACHE_SHIFT;
1445 return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); 1446 ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
1447 if (ret == 0 && *pagep)
1448 init_page_accessed(*pagep);
1449 return ret;
1446 } 1450 }
1447 1451
1448 static int 1452 static int
1449 shmem_write_end(struct file *file, struct address_space *mapping, 1453 shmem_write_end(struct file *file, struct address_space *mapping,
1450 loff_t pos, unsigned len, unsigned copied, 1454 loff_t pos, unsigned len, unsigned copied,
1451 struct page *page, void *fsdata) 1455 struct page *page, void *fsdata)
1452 { 1456 {
1453 struct inode *inode = mapping->host; 1457 struct inode *inode = mapping->host;
1454 1458
1455 if (pos + copied > inode->i_size) 1459 if (pos + copied > inode->i_size)
1456 i_size_write(inode, pos + copied); 1460 i_size_write(inode, pos + copied);
1457 1461
1458 if (!PageUptodate(page)) { 1462 if (!PageUptodate(page)) {
1459 if (copied < PAGE_CACHE_SIZE) { 1463 if (copied < PAGE_CACHE_SIZE) {
1460 unsigned from = pos & (PAGE_CACHE_SIZE - 1); 1464 unsigned from = pos & (PAGE_CACHE_SIZE - 1);
1461 zero_user_segments(page, 0, from, 1465 zero_user_segments(page, 0, from,
1462 from + copied, PAGE_CACHE_SIZE); 1466 from + copied, PAGE_CACHE_SIZE);
1463 } 1467 }
1464 SetPageUptodate(page); 1468 SetPageUptodate(page);
1465 } 1469 }
1466 set_page_dirty(page); 1470 set_page_dirty(page);
1467 unlock_page(page); 1471 unlock_page(page);
1468 page_cache_release(page); 1472 page_cache_release(page);
1469 1473
1470 return copied; 1474 return copied;
1471 } 1475 }
1472 1476
1473 static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor) 1477 static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor)
1474 { 1478 {
1475 struct inode *inode = file_inode(filp); 1479 struct inode *inode = file_inode(filp);
1476 struct address_space *mapping = inode->i_mapping; 1480 struct address_space *mapping = inode->i_mapping;
1477 pgoff_t index; 1481 pgoff_t index;
1478 unsigned long offset; 1482 unsigned long offset;
1479 enum sgp_type sgp = SGP_READ; 1483 enum sgp_type sgp = SGP_READ;
1480 1484
1481 /* 1485 /*
1482 * Might this read be for a stacking filesystem? Then when reading 1486 * Might this read be for a stacking filesystem? Then when reading
1483 * holes of a sparse file, we actually need to allocate those pages, 1487 * holes of a sparse file, we actually need to allocate those pages,
1484 * and even mark them dirty, so it cannot exceed the max_blocks limit. 1488 * and even mark them dirty, so it cannot exceed the max_blocks limit.
1485 */ 1489 */
1486 if (segment_eq(get_fs(), KERNEL_DS)) 1490 if (segment_eq(get_fs(), KERNEL_DS))
1487 sgp = SGP_DIRTY; 1491 sgp = SGP_DIRTY;
1488 1492
1489 index = *ppos >> PAGE_CACHE_SHIFT; 1493 index = *ppos >> PAGE_CACHE_SHIFT;
1490 offset = *ppos & ~PAGE_CACHE_MASK; 1494 offset = *ppos & ~PAGE_CACHE_MASK;
1491 1495
1492 for (;;) { 1496 for (;;) {
1493 struct page *page = NULL; 1497 struct page *page = NULL;
1494 pgoff_t end_index; 1498 pgoff_t end_index;
1495 unsigned long nr, ret; 1499 unsigned long nr, ret;
1496 loff_t i_size = i_size_read(inode); 1500 loff_t i_size = i_size_read(inode);
1497 1501
1498 end_index = i_size >> PAGE_CACHE_SHIFT; 1502 end_index = i_size >> PAGE_CACHE_SHIFT;
1499 if (index > end_index) 1503 if (index > end_index)
1500 break; 1504 break;
1501 if (index == end_index) { 1505 if (index == end_index) {
1502 nr = i_size & ~PAGE_CACHE_MASK; 1506 nr = i_size & ~PAGE_CACHE_MASK;
1503 if (nr <= offset) 1507 if (nr <= offset)
1504 break; 1508 break;
1505 } 1509 }
1506 1510
1507 desc->error = shmem_getpage(inode, index, &page, sgp, NULL); 1511 desc->error = shmem_getpage(inode, index, &page, sgp, NULL);
1508 if (desc->error) { 1512 if (desc->error) {
1509 if (desc->error == -EINVAL) 1513 if (desc->error == -EINVAL)
1510 desc->error = 0; 1514 desc->error = 0;
1511 break; 1515 break;
1512 } 1516 }
1513 if (page) 1517 if (page)
1514 unlock_page(page); 1518 unlock_page(page);
1515 1519
1516 /* 1520 /*
1517 * We must evaluate after, since reads (unlike writes) 1521 * We must evaluate after, since reads (unlike writes)
1518 * are called without i_mutex protection against truncate 1522 * are called without i_mutex protection against truncate
1519 */ 1523 */
1520 nr = PAGE_CACHE_SIZE; 1524 nr = PAGE_CACHE_SIZE;
1521 i_size = i_size_read(inode); 1525 i_size = i_size_read(inode);
1522 end_index = i_size >> PAGE_CACHE_SHIFT; 1526 end_index = i_size >> PAGE_CACHE_SHIFT;
1523 if (index == end_index) { 1527 if (index == end_index) {
1524 nr = i_size & ~PAGE_CACHE_MASK; 1528 nr = i_size & ~PAGE_CACHE_MASK;
1525 if (nr <= offset) { 1529 if (nr <= offset) {
1526 if (page) 1530 if (page)
1527 page_cache_release(page); 1531 page_cache_release(page);
1528 break; 1532 break;
1529 } 1533 }
1530 } 1534 }
1531 nr -= offset; 1535 nr -= offset;
1532 1536
1533 if (page) { 1537 if (page) {
1534 /* 1538 /*
1535 * If users can be writing to this page using arbitrary 1539 * If users can be writing to this page using arbitrary
1536 * virtual addresses, take care about potential aliasing 1540 * virtual addresses, take care about potential aliasing
1537 * before reading the page on the kernel side. 1541 * before reading the page on the kernel side.
1538 */ 1542 */
1539 if (mapping_writably_mapped(mapping)) 1543 if (mapping_writably_mapped(mapping))
1540 flush_dcache_page(page); 1544 flush_dcache_page(page);
1541 /* 1545 /*
1542 * Mark the page accessed if we read the beginning. 1546 * Mark the page accessed if we read the beginning.
1543 */ 1547 */
1544 if (!offset) 1548 if (!offset)
1545 mark_page_accessed(page); 1549 mark_page_accessed(page);
1546 } else { 1550 } else {
1547 page = ZERO_PAGE(0); 1551 page = ZERO_PAGE(0);
1548 page_cache_get(page); 1552 page_cache_get(page);
1549 } 1553 }
1550 1554
1551 /* 1555 /*
1552 * Ok, we have the page, and it's up-to-date, so 1556 * Ok, we have the page, and it's up-to-date, so
1553 * now we can copy it to user space... 1557 * now we can copy it to user space...
1554 * 1558 *
1555 * The actor routine returns how many bytes were actually used.. 1559 * The actor routine returns how many bytes were actually used..
1556 * NOTE! This may not be the same as how much of a user buffer 1560 * NOTE! This may not be the same as how much of a user buffer
1557 * we filled up (we may be padding etc), so we can only update 1561 * we filled up (we may be padding etc), so we can only update
1558 * "pos" here (the actor routine has to update the user buffer 1562 * "pos" here (the actor routine has to update the user buffer
1559 * pointers and the remaining count). 1563 * pointers and the remaining count).
1560 */ 1564 */
1561 ret = actor(desc, page, offset, nr); 1565 ret = actor(desc, page, offset, nr);
1562 offset += ret; 1566 offset += ret;
1563 index += offset >> PAGE_CACHE_SHIFT; 1567 index += offset >> PAGE_CACHE_SHIFT;
1564 offset &= ~PAGE_CACHE_MASK; 1568 offset &= ~PAGE_CACHE_MASK;
1565 1569
1566 page_cache_release(page); 1570 page_cache_release(page);
1567 if (ret != nr || !desc->count) 1571 if (ret != nr || !desc->count)
1568 break; 1572 break;
1569 1573
1570 cond_resched(); 1574 cond_resched();
1571 } 1575 }
1572 1576
1573 *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; 1577 *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
1574 file_accessed(filp); 1578 file_accessed(filp);
1575 } 1579 }
1576 1580
1577 static ssize_t shmem_file_aio_read(struct kiocb *iocb, 1581 static ssize_t shmem_file_aio_read(struct kiocb *iocb,
1578 const struct iovec *iov, unsigned long nr_segs, loff_t pos) 1582 const struct iovec *iov, unsigned long nr_segs, loff_t pos)
1579 { 1583 {
1580 struct file *filp = iocb->ki_filp; 1584 struct file *filp = iocb->ki_filp;
1581 ssize_t retval; 1585 ssize_t retval;
1582 unsigned long seg; 1586 unsigned long seg;
1583 size_t count; 1587 size_t count;
1584 loff_t *ppos = &iocb->ki_pos; 1588 loff_t *ppos = &iocb->ki_pos;
1585 1589
1586 retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); 1590 retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
1587 if (retval) 1591 if (retval)
1588 return retval; 1592 return retval;
1589 1593
1590 for (seg = 0; seg < nr_segs; seg++) { 1594 for (seg = 0; seg < nr_segs; seg++) {
1591 read_descriptor_t desc; 1595 read_descriptor_t desc;
1592 1596
1593 desc.written = 0; 1597 desc.written = 0;
1594 desc.arg.buf = iov[seg].iov_base; 1598 desc.arg.buf = iov[seg].iov_base;
1595 desc.count = iov[seg].iov_len; 1599 desc.count = iov[seg].iov_len;
1596 if (desc.count == 0) 1600 if (desc.count == 0)
1597 continue; 1601 continue;
1598 desc.error = 0; 1602 desc.error = 0;
1599 do_shmem_file_read(filp, ppos, &desc, file_read_actor); 1603 do_shmem_file_read(filp, ppos, &desc, file_read_actor);
1600 retval += desc.written; 1604 retval += desc.written;
1601 if (desc.error) { 1605 if (desc.error) {
1602 retval = retval ?: desc.error; 1606 retval = retval ?: desc.error;
1603 break; 1607 break;
1604 } 1608 }
1605 if (desc.count > 0) 1609 if (desc.count > 0)
1606 break; 1610 break;
1607 } 1611 }
1608 return retval; 1612 return retval;
1609 } 1613 }
1610 1614
1611 static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, 1615 static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
1612 struct pipe_inode_info *pipe, size_t len, 1616 struct pipe_inode_info *pipe, size_t len,
1613 unsigned int flags) 1617 unsigned int flags)
1614 { 1618 {
1615 struct address_space *mapping = in->f_mapping; 1619 struct address_space *mapping = in->f_mapping;
1616 struct inode *inode = mapping->host; 1620 struct inode *inode = mapping->host;
1617 unsigned int loff, nr_pages, req_pages; 1621 unsigned int loff, nr_pages, req_pages;
1618 struct page *pages[PIPE_DEF_BUFFERS]; 1622 struct page *pages[PIPE_DEF_BUFFERS];
1619 struct partial_page partial[PIPE_DEF_BUFFERS]; 1623 struct partial_page partial[PIPE_DEF_BUFFERS];
1620 struct page *page; 1624 struct page *page;
1621 pgoff_t index, end_index; 1625 pgoff_t index, end_index;
1622 loff_t isize, left; 1626 loff_t isize, left;
1623 int error, page_nr; 1627 int error, page_nr;
1624 struct splice_pipe_desc spd = { 1628 struct splice_pipe_desc spd = {
1625 .pages = pages, 1629 .pages = pages,
1626 .partial = partial, 1630 .partial = partial,
1627 .nr_pages_max = PIPE_DEF_BUFFERS, 1631 .nr_pages_max = PIPE_DEF_BUFFERS,
1628 .flags = flags, 1632 .flags = flags,
1629 .ops = &page_cache_pipe_buf_ops, 1633 .ops = &page_cache_pipe_buf_ops,
1630 .spd_release = spd_release_page, 1634 .spd_release = spd_release_page,
1631 }; 1635 };
1632 1636
1633 isize = i_size_read(inode); 1637 isize = i_size_read(inode);
1634 if (unlikely(*ppos >= isize)) 1638 if (unlikely(*ppos >= isize))
1635 return 0; 1639 return 0;
1636 1640
1637 left = isize - *ppos; 1641 left = isize - *ppos;
1638 if (unlikely(left < len)) 1642 if (unlikely(left < len))
1639 len = left; 1643 len = left;
1640 1644
1641 if (splice_grow_spd(pipe, &spd)) 1645 if (splice_grow_spd(pipe, &spd))
1642 return -ENOMEM; 1646 return -ENOMEM;
1643 1647
1644 index = *ppos >> PAGE_CACHE_SHIFT; 1648 index = *ppos >> PAGE_CACHE_SHIFT;
1645 loff = *ppos & ~PAGE_CACHE_MASK; 1649 loff = *ppos & ~PAGE_CACHE_MASK;
1646 req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1650 req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1647 nr_pages = min(req_pages, pipe->buffers); 1651 nr_pages = min(req_pages, pipe->buffers);
1648 1652
1649 spd.nr_pages = find_get_pages_contig(mapping, index, 1653 spd.nr_pages = find_get_pages_contig(mapping, index,
1650 nr_pages, spd.pages); 1654 nr_pages, spd.pages);
1651 index += spd.nr_pages; 1655 index += spd.nr_pages;
1652 error = 0; 1656 error = 0;
1653 1657
1654 while (spd.nr_pages < nr_pages) { 1658 while (spd.nr_pages < nr_pages) {
1655 error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL); 1659 error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL);
1656 if (error) 1660 if (error)
1657 break; 1661 break;
1658 unlock_page(page); 1662 unlock_page(page);
1659 spd.pages[spd.nr_pages++] = page; 1663 spd.pages[spd.nr_pages++] = page;
1660 index++; 1664 index++;
1661 } 1665 }
1662 1666
1663 index = *ppos >> PAGE_CACHE_SHIFT; 1667 index = *ppos >> PAGE_CACHE_SHIFT;
1664 nr_pages = spd.nr_pages; 1668 nr_pages = spd.nr_pages;
1665 spd.nr_pages = 0; 1669 spd.nr_pages = 0;
1666 1670
1667 for (page_nr = 0; page_nr < nr_pages; page_nr++) { 1671 for (page_nr = 0; page_nr < nr_pages; page_nr++) {
1668 unsigned int this_len; 1672 unsigned int this_len;
1669 1673
1670 if (!len) 1674 if (!len)
1671 break; 1675 break;
1672 1676
1673 this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff); 1677 this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
1674 page = spd.pages[page_nr]; 1678 page = spd.pages[page_nr];
1675 1679
1676 if (!PageUptodate(page) || page->mapping != mapping) { 1680 if (!PageUptodate(page) || page->mapping != mapping) {
1677 error = shmem_getpage(inode, index, &page, 1681 error = shmem_getpage(inode, index, &page,
1678 SGP_CACHE, NULL); 1682 SGP_CACHE, NULL);
1679 if (error) 1683 if (error)
1680 break; 1684 break;
1681 unlock_page(page); 1685 unlock_page(page);
1682 page_cache_release(spd.pages[page_nr]); 1686 page_cache_release(spd.pages[page_nr]);
1683 spd.pages[page_nr] = page; 1687 spd.pages[page_nr] = page;
1684 } 1688 }
1685 1689
1686 isize = i_size_read(inode); 1690 isize = i_size_read(inode);
1687 end_index = (isize - 1) >> PAGE_CACHE_SHIFT; 1691 end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
1688 if (unlikely(!isize || index > end_index)) 1692 if (unlikely(!isize || index > end_index))
1689 break; 1693 break;
1690 1694
1691 if (end_index == index) { 1695 if (end_index == index) {
1692 unsigned int plen; 1696 unsigned int plen;
1693 1697
1694 plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; 1698 plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
1695 if (plen <= loff) 1699 if (plen <= loff)
1696 break; 1700 break;
1697 1701
1698 this_len = min(this_len, plen - loff); 1702 this_len = min(this_len, plen - loff);
1699 len = this_len; 1703 len = this_len;
1700 } 1704 }
1701 1705
1702 spd.partial[page_nr].offset = loff; 1706 spd.partial[page_nr].offset = loff;
1703 spd.partial[page_nr].len = this_len; 1707 spd.partial[page_nr].len = this_len;
1704 len -= this_len; 1708 len -= this_len;
1705 loff = 0; 1709 loff = 0;
1706 spd.nr_pages++; 1710 spd.nr_pages++;
1707 index++; 1711 index++;
1708 } 1712 }
1709 1713
1710 while (page_nr < nr_pages) 1714 while (page_nr < nr_pages)
1711 page_cache_release(spd.pages[page_nr++]); 1715 page_cache_release(spd.pages[page_nr++]);
1712 1716
1713 if (spd.nr_pages) 1717 if (spd.nr_pages)
1714 error = splice_to_pipe(pipe, &spd); 1718 error = splice_to_pipe(pipe, &spd);
1715 1719
1716 splice_shrink_spd(&spd); 1720 splice_shrink_spd(&spd);
1717 1721
1718 if (error > 0) { 1722 if (error > 0) {
1719 *ppos += error; 1723 *ppos += error;
1720 file_accessed(in); 1724 file_accessed(in);
1721 } 1725 }
1722 return error; 1726 return error;
1723 } 1727 }
1724 1728
1725 /* 1729 /*
1726 * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. 1730 * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
1727 */ 1731 */
1728 static pgoff_t shmem_seek_hole_data(struct address_space *mapping, 1732 static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
1729 pgoff_t index, pgoff_t end, int whence) 1733 pgoff_t index, pgoff_t end, int whence)
1730 { 1734 {
1731 struct page *page; 1735 struct page *page;
1732 struct pagevec pvec; 1736 struct pagevec pvec;
1733 pgoff_t indices[PAGEVEC_SIZE]; 1737 pgoff_t indices[PAGEVEC_SIZE];
1734 bool done = false; 1738 bool done = false;
1735 int i; 1739 int i;
1736 1740
1737 pagevec_init(&pvec, 0); 1741 pagevec_init(&pvec, 0);
1738 pvec.nr = 1; /* start small: we may be there already */ 1742 pvec.nr = 1; /* start small: we may be there already */
1739 while (!done) { 1743 while (!done) {
1740 pvec.nr = find_get_entries(mapping, index, 1744 pvec.nr = find_get_entries(mapping, index,
1741 pvec.nr, pvec.pages, indices); 1745 pvec.nr, pvec.pages, indices);
1742 if (!pvec.nr) { 1746 if (!pvec.nr) {
1743 if (whence == SEEK_DATA) 1747 if (whence == SEEK_DATA)
1744 index = end; 1748 index = end;
1745 break; 1749 break;
1746 } 1750 }
1747 for (i = 0; i < pvec.nr; i++, index++) { 1751 for (i = 0; i < pvec.nr; i++, index++) {
1748 if (index < indices[i]) { 1752 if (index < indices[i]) {
1749 if (whence == SEEK_HOLE) { 1753 if (whence == SEEK_HOLE) {
1750 done = true; 1754 done = true;
1751 break; 1755 break;
1752 } 1756 }
1753 index = indices[i]; 1757 index = indices[i];
1754 } 1758 }
1755 page = pvec.pages[i]; 1759 page = pvec.pages[i];
1756 if (page && !radix_tree_exceptional_entry(page)) { 1760 if (page && !radix_tree_exceptional_entry(page)) {
1757 if (!PageUptodate(page)) 1761 if (!PageUptodate(page))
1758 page = NULL; 1762 page = NULL;
1759 } 1763 }
1760 if (index >= end || 1764 if (index >= end ||
1761 (page && whence == SEEK_DATA) || 1765 (page && whence == SEEK_DATA) ||
1762 (!page && whence == SEEK_HOLE)) { 1766 (!page && whence == SEEK_HOLE)) {
1763 done = true; 1767 done = true;
1764 break; 1768 break;
1765 } 1769 }
1766 } 1770 }
1767 pagevec_remove_exceptionals(&pvec); 1771 pagevec_remove_exceptionals(&pvec);
1768 pagevec_release(&pvec); 1772 pagevec_release(&pvec);
1769 pvec.nr = PAGEVEC_SIZE; 1773 pvec.nr = PAGEVEC_SIZE;
1770 cond_resched(); 1774 cond_resched();
1771 } 1775 }
1772 return index; 1776 return index;
1773 } 1777 }
1774 1778
1775 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) 1779 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
1776 { 1780 {
1777 struct address_space *mapping = file->f_mapping; 1781 struct address_space *mapping = file->f_mapping;
1778 struct inode *inode = mapping->host; 1782 struct inode *inode = mapping->host;
1779 pgoff_t start, end; 1783 pgoff_t start, end;
1780 loff_t new_offset; 1784 loff_t new_offset;
1781 1785
1782 if (whence != SEEK_DATA && whence != SEEK_HOLE) 1786 if (whence != SEEK_DATA && whence != SEEK_HOLE)
1783 return generic_file_llseek_size(file, offset, whence, 1787 return generic_file_llseek_size(file, offset, whence,
1784 MAX_LFS_FILESIZE, i_size_read(inode)); 1788 MAX_LFS_FILESIZE, i_size_read(inode));
1785 mutex_lock(&inode->i_mutex); 1789 mutex_lock(&inode->i_mutex);
1786 /* We're holding i_mutex so we can access i_size directly */ 1790 /* We're holding i_mutex so we can access i_size directly */
1787 1791
1788 if (offset < 0) 1792 if (offset < 0)
1789 offset = -EINVAL; 1793 offset = -EINVAL;
1790 else if (offset >= inode->i_size) 1794 else if (offset >= inode->i_size)
1791 offset = -ENXIO; 1795 offset = -ENXIO;
1792 else { 1796 else {
1793 start = offset >> PAGE_CACHE_SHIFT; 1797 start = offset >> PAGE_CACHE_SHIFT;
1794 end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1798 end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1795 new_offset = shmem_seek_hole_data(mapping, start, end, whence); 1799 new_offset = shmem_seek_hole_data(mapping, start, end, whence);
1796 new_offset <<= PAGE_CACHE_SHIFT; 1800 new_offset <<= PAGE_CACHE_SHIFT;
1797 if (new_offset > offset) { 1801 if (new_offset > offset) {
1798 if (new_offset < inode->i_size) 1802 if (new_offset < inode->i_size)
1799 offset = new_offset; 1803 offset = new_offset;
1800 else if (whence == SEEK_DATA) 1804 else if (whence == SEEK_DATA)
1801 offset = -ENXIO; 1805 offset = -ENXIO;
1802 else 1806 else
1803 offset = inode->i_size; 1807 offset = inode->i_size;
1804 } 1808 }
1805 } 1809 }
1806 1810
1807 if (offset >= 0) 1811 if (offset >= 0)
1808 offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); 1812 offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE);
1809 mutex_unlock(&inode->i_mutex); 1813 mutex_unlock(&inode->i_mutex);
1810 return offset; 1814 return offset;
1811 } 1815 }
1812 1816
1813 static long shmem_fallocate(struct file *file, int mode, loff_t offset, 1817 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
1814 loff_t len) 1818 loff_t len)
1815 { 1819 {
1816 struct inode *inode = file_inode(file); 1820 struct inode *inode = file_inode(file);
1817 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 1821 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
1818 struct shmem_falloc shmem_falloc; 1822 struct shmem_falloc shmem_falloc;
1819 pgoff_t start, index, end; 1823 pgoff_t start, index, end;
1820 int error; 1824 int error;
1821 1825
1822 mutex_lock(&inode->i_mutex); 1826 mutex_lock(&inode->i_mutex);
1823 1827
1824 if (mode & FALLOC_FL_PUNCH_HOLE) { 1828 if (mode & FALLOC_FL_PUNCH_HOLE) {
1825 struct address_space *mapping = file->f_mapping; 1829 struct address_space *mapping = file->f_mapping;
1826 loff_t unmap_start = round_up(offset, PAGE_SIZE); 1830 loff_t unmap_start = round_up(offset, PAGE_SIZE);
1827 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; 1831 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
1828 DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); 1832 DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
1829 1833
1830 shmem_falloc.waitq = &shmem_falloc_waitq; 1834 shmem_falloc.waitq = &shmem_falloc_waitq;
1831 shmem_falloc.start = unmap_start >> PAGE_SHIFT; 1835 shmem_falloc.start = unmap_start >> PAGE_SHIFT;
1832 shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; 1836 shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
1833 spin_lock(&inode->i_lock); 1837 spin_lock(&inode->i_lock);
1834 inode->i_private = &shmem_falloc; 1838 inode->i_private = &shmem_falloc;
1835 spin_unlock(&inode->i_lock); 1839 spin_unlock(&inode->i_lock);
1836 1840
1837 if ((u64)unmap_end > (u64)unmap_start) 1841 if ((u64)unmap_end > (u64)unmap_start)
1838 unmap_mapping_range(mapping, unmap_start, 1842 unmap_mapping_range(mapping, unmap_start,
1839 1 + unmap_end - unmap_start, 0); 1843 1 + unmap_end - unmap_start, 0);
1840 shmem_truncate_range(inode, offset, offset + len - 1); 1844 shmem_truncate_range(inode, offset, offset + len - 1);
1841 /* No need to unmap again: hole-punching leaves COWed pages */ 1845 /* No need to unmap again: hole-punching leaves COWed pages */
1842 1846
1843 spin_lock(&inode->i_lock); 1847 spin_lock(&inode->i_lock);
1844 inode->i_private = NULL; 1848 inode->i_private = NULL;
1845 wake_up_all(&shmem_falloc_waitq); 1849 wake_up_all(&shmem_falloc_waitq);
1846 spin_unlock(&inode->i_lock); 1850 spin_unlock(&inode->i_lock);
1847 error = 0; 1851 error = 0;
1848 goto out; 1852 goto out;
1849 } 1853 }
1850 1854
1851 /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ 1855 /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
1852 error = inode_newsize_ok(inode, offset + len); 1856 error = inode_newsize_ok(inode, offset + len);
1853 if (error) 1857 if (error)
1854 goto out; 1858 goto out;
1855 1859
1856 start = offset >> PAGE_CACHE_SHIFT; 1860 start = offset >> PAGE_CACHE_SHIFT;
1857 end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; 1861 end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1858 /* Try to avoid a swapstorm if len is impossible to satisfy */ 1862 /* Try to avoid a swapstorm if len is impossible to satisfy */
1859 if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { 1863 if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) {
1860 error = -ENOSPC; 1864 error = -ENOSPC;
1861 goto out; 1865 goto out;
1862 } 1866 }
1863 1867
1864 shmem_falloc.waitq = NULL; 1868 shmem_falloc.waitq = NULL;
1865 shmem_falloc.start = start; 1869 shmem_falloc.start = start;
1866 shmem_falloc.next = start; 1870 shmem_falloc.next = start;
1867 shmem_falloc.nr_falloced = 0; 1871 shmem_falloc.nr_falloced = 0;
1868 shmem_falloc.nr_unswapped = 0; 1872 shmem_falloc.nr_unswapped = 0;
1869 spin_lock(&inode->i_lock); 1873 spin_lock(&inode->i_lock);
1870 inode->i_private = &shmem_falloc; 1874 inode->i_private = &shmem_falloc;
1871 spin_unlock(&inode->i_lock); 1875 spin_unlock(&inode->i_lock);
1872 1876
1873 for (index = start; index < end; index++) { 1877 for (index = start; index < end; index++) {
1874 struct page *page; 1878 struct page *page;
1875 1879
1876 /* 1880 /*
1877 * Good, the fallocate(2) manpage permits EINTR: we may have 1881 * Good, the fallocate(2) manpage permits EINTR: we may have
1878 * been interrupted because we are using up too much memory. 1882 * been interrupted because we are using up too much memory.
1879 */ 1883 */
1880 if (signal_pending(current)) 1884 if (signal_pending(current))
1881 error = -EINTR; 1885 error = -EINTR;
1882 else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) 1886 else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced)
1883 error = -ENOMEM; 1887 error = -ENOMEM;
1884 else 1888 else
1885 error = shmem_getpage(inode, index, &page, SGP_FALLOC, 1889 error = shmem_getpage(inode, index, &page, SGP_FALLOC,
1886 NULL); 1890 NULL);
1887 if (error) { 1891 if (error) {
1888 /* Remove the !PageUptodate pages we added */ 1892 /* Remove the !PageUptodate pages we added */
1889 shmem_undo_range(inode, 1893 shmem_undo_range(inode,
1890 (loff_t)start << PAGE_CACHE_SHIFT, 1894 (loff_t)start << PAGE_CACHE_SHIFT,
1891 (loff_t)index << PAGE_CACHE_SHIFT, true); 1895 (loff_t)index << PAGE_CACHE_SHIFT, true);
1892 goto undone; 1896 goto undone;
1893 } 1897 }
1894 1898
1895 /* 1899 /*
1896 * Inform shmem_writepage() how far we have reached. 1900 * Inform shmem_writepage() how far we have reached.
1897 * No need for lock or barrier: we have the page lock. 1901 * No need for lock or barrier: we have the page lock.
1898 */ 1902 */
1899 shmem_falloc.next++; 1903 shmem_falloc.next++;
1900 if (!PageUptodate(page)) 1904 if (!PageUptodate(page))
1901 shmem_falloc.nr_falloced++; 1905 shmem_falloc.nr_falloced++;
1902 1906
1903 /* 1907 /*
1904 * If !PageUptodate, leave it that way so that freeable pages 1908 * If !PageUptodate, leave it that way so that freeable pages
1905 * can be recognized if we need to rollback on error later. 1909 * can be recognized if we need to rollback on error later.
1906 * But set_page_dirty so that memory pressure will swap rather 1910 * But set_page_dirty so that memory pressure will swap rather
1907 * than free the pages we are allocating (and SGP_CACHE pages 1911 * than free the pages we are allocating (and SGP_CACHE pages
1908 * might still be clean: we now need to mark those dirty too). 1912 * might still be clean: we now need to mark those dirty too).
1909 */ 1913 */
1910 set_page_dirty(page); 1914 set_page_dirty(page);
1911 unlock_page(page); 1915 unlock_page(page);
1912 page_cache_release(page); 1916 page_cache_release(page);
1913 cond_resched(); 1917 cond_resched();
1914 } 1918 }
1915 1919
1916 if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) 1920 if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
1917 i_size_write(inode, offset + len); 1921 i_size_write(inode, offset + len);
1918 inode->i_ctime = CURRENT_TIME; 1922 inode->i_ctime = CURRENT_TIME;
1919 undone: 1923 undone:
1920 spin_lock(&inode->i_lock); 1924 spin_lock(&inode->i_lock);
1921 inode->i_private = NULL; 1925 inode->i_private = NULL;
1922 spin_unlock(&inode->i_lock); 1926 spin_unlock(&inode->i_lock);
1923 out: 1927 out:
1924 mutex_unlock(&inode->i_mutex); 1928 mutex_unlock(&inode->i_mutex);
1925 return error; 1929 return error;
1926 } 1930 }
1927 1931
1928 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) 1932 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
1929 { 1933 {
1930 struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); 1934 struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
1931 1935
1932 buf->f_type = TMPFS_MAGIC; 1936 buf->f_type = TMPFS_MAGIC;
1933 buf->f_bsize = PAGE_CACHE_SIZE; 1937 buf->f_bsize = PAGE_CACHE_SIZE;
1934 buf->f_namelen = NAME_MAX; 1938 buf->f_namelen = NAME_MAX;
1935 if (sbinfo->max_blocks) { 1939 if (sbinfo->max_blocks) {
1936 buf->f_blocks = sbinfo->max_blocks; 1940 buf->f_blocks = sbinfo->max_blocks;
1937 buf->f_bavail = 1941 buf->f_bavail =
1938 buf->f_bfree = sbinfo->max_blocks - 1942 buf->f_bfree = sbinfo->max_blocks -
1939 percpu_counter_sum(&sbinfo->used_blocks); 1943 percpu_counter_sum(&sbinfo->used_blocks);
1940 } 1944 }
1941 if (sbinfo->max_inodes) { 1945 if (sbinfo->max_inodes) {
1942 buf->f_files = sbinfo->max_inodes; 1946 buf->f_files = sbinfo->max_inodes;
1943 buf->f_ffree = sbinfo->free_inodes; 1947 buf->f_ffree = sbinfo->free_inodes;
1944 } 1948 }
1945 /* else leave those fields 0 like simple_statfs */ 1949 /* else leave those fields 0 like simple_statfs */
1946 return 0; 1950 return 0;
1947 } 1951 }
1948 1952
1949 /* 1953 /*
1950 * File creation. Allocate an inode, and we're done.. 1954 * File creation. Allocate an inode, and we're done..
1951 */ 1955 */
1952 static int 1956 static int
1953 shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) 1957 shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
1954 { 1958 {
1955 struct inode *inode; 1959 struct inode *inode;
1956 int error = -ENOSPC; 1960 int error = -ENOSPC;
1957 1961
1958 inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE); 1962 inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE);
1959 if (inode) { 1963 if (inode) {
1960 #ifdef CONFIG_TMPFS_POSIX_ACL 1964 #ifdef CONFIG_TMPFS_POSIX_ACL
1961 error = generic_acl_init(inode, dir); 1965 error = generic_acl_init(inode, dir);
1962 if (error) { 1966 if (error) {
1963 iput(inode); 1967 iput(inode);
1964 return error; 1968 return error;
1965 } 1969 }
1966 #endif 1970 #endif
1967 error = security_inode_init_security(inode, dir, 1971 error = security_inode_init_security(inode, dir,
1968 &dentry->d_name, 1972 &dentry->d_name,
1969 shmem_initxattrs, NULL); 1973 shmem_initxattrs, NULL);
1970 if (error) { 1974 if (error) {
1971 if (error != -EOPNOTSUPP) { 1975 if (error != -EOPNOTSUPP) {
1972 iput(inode); 1976 iput(inode);
1973 return error; 1977 return error;
1974 } 1978 }
1975 } 1979 }
1976 1980
1977 error = 0; 1981 error = 0;
1978 dir->i_size += BOGO_DIRENT_SIZE; 1982 dir->i_size += BOGO_DIRENT_SIZE;
1979 dir->i_ctime = dir->i_mtime = CURRENT_TIME; 1983 dir->i_ctime = dir->i_mtime = CURRENT_TIME;
1980 d_instantiate(dentry, inode); 1984 d_instantiate(dentry, inode);
1981 dget(dentry); /* Extra count - pin the dentry in core */ 1985 dget(dentry); /* Extra count - pin the dentry in core */
1982 } 1986 }
1983 return error; 1987 return error;
1984 } 1988 }
1985 1989
1986 static int 1990 static int
1987 shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) 1991 shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
1988 { 1992 {
1989 struct inode *inode; 1993 struct inode *inode;
1990 int error = -ENOSPC; 1994 int error = -ENOSPC;
1991 1995
1992 inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE); 1996 inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE);
1993 if (inode) { 1997 if (inode) {
1994 error = security_inode_init_security(inode, dir, 1998 error = security_inode_init_security(inode, dir,
1995 NULL, 1999 NULL,
1996 shmem_initxattrs, NULL); 2000 shmem_initxattrs, NULL);
1997 if (error) { 2001 if (error) {
1998 if (error != -EOPNOTSUPP) { 2002 if (error != -EOPNOTSUPP) {
1999 iput(inode); 2003 iput(inode);
2000 return error; 2004 return error;
2001 } 2005 }
2002 } 2006 }
2003 #ifdef CONFIG_TMPFS_POSIX_ACL 2007 #ifdef CONFIG_TMPFS_POSIX_ACL
2004 error = generic_acl_init(inode, dir); 2008 error = generic_acl_init(inode, dir);
2005 if (error) { 2009 if (error) {
2006 iput(inode); 2010 iput(inode);
2007 return error; 2011 return error;
2008 } 2012 }
2009 #else 2013 #else
2010 error = 0; 2014 error = 0;
2011 #endif 2015 #endif
2012 d_tmpfile(dentry, inode); 2016 d_tmpfile(dentry, inode);
2013 } 2017 }
2014 return error; 2018 return error;
2015 } 2019 }
2016 2020
2017 static int shmem_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) 2021 static int shmem_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
2018 { 2022 {
2019 int error; 2023 int error;
2020 2024
2021 if ((error = shmem_mknod(dir, dentry, mode | S_IFDIR, 0))) 2025 if ((error = shmem_mknod(dir, dentry, mode | S_IFDIR, 0)))
2022 return error; 2026 return error;
2023 inc_nlink(dir); 2027 inc_nlink(dir);
2024 return 0; 2028 return 0;
2025 } 2029 }
2026 2030
2027 static int shmem_create(struct inode *dir, struct dentry *dentry, umode_t mode, 2031 static int shmem_create(struct inode *dir, struct dentry *dentry, umode_t mode,
2028 bool excl) 2032 bool excl)
2029 { 2033 {
2030 return shmem_mknod(dir, dentry, mode | S_IFREG, 0); 2034 return shmem_mknod(dir, dentry, mode | S_IFREG, 0);
2031 } 2035 }
2032 2036
2033 /* 2037 /*
2034 * Link a file.. 2038 * Link a file..
2035 */ 2039 */
2036 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) 2040 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
2037 { 2041 {
2038 struct inode *inode = old_dentry->d_inode; 2042 struct inode *inode = old_dentry->d_inode;
2039 int ret; 2043 int ret;
2040 2044
2041 /* 2045 /*
2042 * No ordinary (disk based) filesystem counts links as inodes; 2046 * No ordinary (disk based) filesystem counts links as inodes;
2043 * but each new link needs a new dentry, pinning lowmem, and 2047 * but each new link needs a new dentry, pinning lowmem, and
2044 * tmpfs dentries cannot be pruned until they are unlinked. 2048 * tmpfs dentries cannot be pruned until they are unlinked.
2045 */ 2049 */
2046 ret = shmem_reserve_inode(inode->i_sb); 2050 ret = shmem_reserve_inode(inode->i_sb);
2047 if (ret) 2051 if (ret)
2048 goto out; 2052 goto out;
2049 2053
2050 dir->i_size += BOGO_DIRENT_SIZE; 2054 dir->i_size += BOGO_DIRENT_SIZE;
2051 inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; 2055 inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
2052 inc_nlink(inode); 2056 inc_nlink(inode);
2053 ihold(inode); /* New dentry reference */ 2057 ihold(inode); /* New dentry reference */
2054 dget(dentry); /* Extra pinning count for the created dentry */ 2058 dget(dentry); /* Extra pinning count for the created dentry */
2055 d_instantiate(dentry, inode); 2059 d_instantiate(dentry, inode);
2056 out: 2060 out:
2057 return ret; 2061 return ret;
2058 } 2062 }
2059 2063
2060 static int shmem_unlink(struct inode *dir, struct dentry *dentry) 2064 static int shmem_unlink(struct inode *dir, struct dentry *dentry)
2061 { 2065 {
2062 struct inode *inode = dentry->d_inode; 2066 struct inode *inode = dentry->d_inode;
2063 2067
2064 if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) 2068 if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
2065 shmem_free_inode(inode->i_sb); 2069 shmem_free_inode(inode->i_sb);
2066 2070
2067 dir->i_size -= BOGO_DIRENT_SIZE; 2071 dir->i_size -= BOGO_DIRENT_SIZE;
2068 inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; 2072 inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
2069 drop_nlink(inode); 2073 drop_nlink(inode);
2070 dput(dentry); /* Undo the count from "create" - this does all the work */ 2074 dput(dentry); /* Undo the count from "create" - this does all the work */
2071 return 0; 2075 return 0;
2072 } 2076 }
2073 2077
2074 static int shmem_rmdir(struct inode *dir, struct dentry *dentry) 2078 static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
2075 { 2079 {
2076 if (!simple_empty(dentry)) 2080 if (!simple_empty(dentry))
2077 return -ENOTEMPTY; 2081 return -ENOTEMPTY;
2078 2082
2079 drop_nlink(dentry->d_inode); 2083 drop_nlink(dentry->d_inode);
2080 drop_nlink(dir); 2084 drop_nlink(dir);
2081 return shmem_unlink(dir, dentry); 2085 return shmem_unlink(dir, dentry);
2082 } 2086 }
2083 2087
2084 /* 2088 /*
2085 * The VFS layer already does all the dentry stuff for rename, 2089 * The VFS layer already does all the dentry stuff for rename,
2086 * we just have to decrement the usage count for the target if 2090 * we just have to decrement the usage count for the target if
2087 * it exists so that the VFS layer correctly free's it when it 2091 * it exists so that the VFS layer correctly free's it when it
2088 * gets overwritten. 2092 * gets overwritten.
2089 */ 2093 */
2090 static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) 2094 static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
2091 { 2095 {
2092 struct inode *inode = old_dentry->d_inode; 2096 struct inode *inode = old_dentry->d_inode;
2093 int they_are_dirs = S_ISDIR(inode->i_mode); 2097 int they_are_dirs = S_ISDIR(inode->i_mode);
2094 2098
2095 if (!simple_empty(new_dentry)) 2099 if (!simple_empty(new_dentry))
2096 return -ENOTEMPTY; 2100 return -ENOTEMPTY;
2097 2101
2098 if (new_dentry->d_inode) { 2102 if (new_dentry->d_inode) {
2099 (void) shmem_unlink(new_dir, new_dentry); 2103 (void) shmem_unlink(new_dir, new_dentry);
2100 if (they_are_dirs) 2104 if (they_are_dirs)
2101 drop_nlink(old_dir); 2105 drop_nlink(old_dir);
2102 } else if (they_are_dirs) { 2106 } else if (they_are_dirs) {
2103 drop_nlink(old_dir); 2107 drop_nlink(old_dir);
2104 inc_nlink(new_dir); 2108 inc_nlink(new_dir);
2105 } 2109 }
2106 2110
2107 old_dir->i_size -= BOGO_DIRENT_SIZE; 2111 old_dir->i_size -= BOGO_DIRENT_SIZE;
2108 new_dir->i_size += BOGO_DIRENT_SIZE; 2112 new_dir->i_size += BOGO_DIRENT_SIZE;
2109 old_dir->i_ctime = old_dir->i_mtime = 2113 old_dir->i_ctime = old_dir->i_mtime =
2110 new_dir->i_ctime = new_dir->i_mtime = 2114 new_dir->i_ctime = new_dir->i_mtime =
2111 inode->i_ctime = CURRENT_TIME; 2115 inode->i_ctime = CURRENT_TIME;
2112 return 0; 2116 return 0;
2113 } 2117 }
2114 2118
2115 static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *symname) 2119 static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
2116 { 2120 {
2117 int error; 2121 int error;
2118 int len; 2122 int len;
2119 struct inode *inode; 2123 struct inode *inode;
2120 struct page *page; 2124 struct page *page;
2121 char *kaddr; 2125 char *kaddr;
2122 struct shmem_inode_info *info; 2126 struct shmem_inode_info *info;
2123 2127
2124 len = strlen(symname) + 1; 2128 len = strlen(symname) + 1;
2125 if (len > PAGE_CACHE_SIZE) 2129 if (len > PAGE_CACHE_SIZE)
2126 return -ENAMETOOLONG; 2130 return -ENAMETOOLONG;
2127 2131
2128 inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE); 2132 inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE);
2129 if (!inode) 2133 if (!inode)
2130 return -ENOSPC; 2134 return -ENOSPC;
2131 2135
2132 error = security_inode_init_security(inode, dir, &dentry->d_name, 2136 error = security_inode_init_security(inode, dir, &dentry->d_name,
2133 shmem_initxattrs, NULL); 2137 shmem_initxattrs, NULL);
2134 if (error) { 2138 if (error) {
2135 if (error != -EOPNOTSUPP) { 2139 if (error != -EOPNOTSUPP) {
2136 iput(inode); 2140 iput(inode);
2137 return error; 2141 return error;
2138 } 2142 }
2139 error = 0; 2143 error = 0;
2140 } 2144 }
2141 2145
2142 info = SHMEM_I(inode); 2146 info = SHMEM_I(inode);
2143 inode->i_size = len-1; 2147 inode->i_size = len-1;
2144 if (len <= SHORT_SYMLINK_LEN) { 2148 if (len <= SHORT_SYMLINK_LEN) {
2145 info->symlink = kmemdup(symname, len, GFP_KERNEL); 2149 info->symlink = kmemdup(symname, len, GFP_KERNEL);
2146 if (!info->symlink) { 2150 if (!info->symlink) {
2147 iput(inode); 2151 iput(inode);
2148 return -ENOMEM; 2152 return -ENOMEM;
2149 } 2153 }
2150 inode->i_op = &shmem_short_symlink_operations; 2154 inode->i_op = &shmem_short_symlink_operations;
2151 } else { 2155 } else {
2152 error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL); 2156 error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
2153 if (error) { 2157 if (error) {
2154 iput(inode); 2158 iput(inode);
2155 return error; 2159 return error;
2156 } 2160 }
2157 inode->i_mapping->a_ops = &shmem_aops; 2161 inode->i_mapping->a_ops = &shmem_aops;
2158 inode->i_op = &shmem_symlink_inode_operations; 2162 inode->i_op = &shmem_symlink_inode_operations;
2159 kaddr = kmap_atomic(page); 2163 kaddr = kmap_atomic(page);
2160 memcpy(kaddr, symname, len); 2164 memcpy(kaddr, symname, len);
2161 kunmap_atomic(kaddr); 2165 kunmap_atomic(kaddr);
2162 SetPageUptodate(page); 2166 SetPageUptodate(page);
2163 set_page_dirty(page); 2167 set_page_dirty(page);
2164 unlock_page(page); 2168 unlock_page(page);
2165 page_cache_release(page); 2169 page_cache_release(page);
2166 } 2170 }
2167 dir->i_size += BOGO_DIRENT_SIZE; 2171 dir->i_size += BOGO_DIRENT_SIZE;
2168 dir->i_ctime = dir->i_mtime = CURRENT_TIME; 2172 dir->i_ctime = dir->i_mtime = CURRENT_TIME;
2169 d_instantiate(dentry, inode); 2173 d_instantiate(dentry, inode);
2170 dget(dentry); 2174 dget(dentry);
2171 return 0; 2175 return 0;
2172 } 2176 }
2173 2177
2174 static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd) 2178 static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd)
2175 { 2179 {
2176 nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink); 2180 nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink);
2177 return NULL; 2181 return NULL;
2178 } 2182 }
2179 2183
2180 static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd) 2184 static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd)
2181 { 2185 {
2182 struct page *page = NULL; 2186 struct page *page = NULL;
2183 int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL); 2187 int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
2184 nd_set_link(nd, error ? ERR_PTR(error) : kmap(page)); 2188 nd_set_link(nd, error ? ERR_PTR(error) : kmap(page));
2185 if (page) 2189 if (page)
2186 unlock_page(page); 2190 unlock_page(page);
2187 return page; 2191 return page;
2188 } 2192 }
2189 2193
2190 static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie) 2194 static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie)
2191 { 2195 {
2192 if (!IS_ERR(nd_get_link(nd))) { 2196 if (!IS_ERR(nd_get_link(nd))) {
2193 struct page *page = cookie; 2197 struct page *page = cookie;
2194 kunmap(page); 2198 kunmap(page);
2195 mark_page_accessed(page); 2199 mark_page_accessed(page);
2196 page_cache_release(page); 2200 page_cache_release(page);
2197 } 2201 }
2198 } 2202 }
2199 2203
2200 #ifdef CONFIG_TMPFS_XATTR 2204 #ifdef CONFIG_TMPFS_XATTR
2201 /* 2205 /*
2202 * Superblocks without xattr inode operations may get some security.* xattr 2206 * Superblocks without xattr inode operations may get some security.* xattr
2203 * support from the LSM "for free". As soon as we have any other xattrs 2207 * support from the LSM "for free". As soon as we have any other xattrs
2204 * like ACLs, we also need to implement the security.* handlers at 2208 * like ACLs, we also need to implement the security.* handlers at
2205 * filesystem level, though. 2209 * filesystem level, though.
2206 */ 2210 */
2207 2211
2208 /* 2212 /*
2209 * Callback for security_inode_init_security() for acquiring xattrs. 2213 * Callback for security_inode_init_security() for acquiring xattrs.
2210 */ 2214 */
2211 static int shmem_initxattrs(struct inode *inode, 2215 static int shmem_initxattrs(struct inode *inode,
2212 const struct xattr *xattr_array, 2216 const struct xattr *xattr_array,
2213 void *fs_info) 2217 void *fs_info)
2214 { 2218 {
2215 struct shmem_inode_info *info = SHMEM_I(inode); 2219 struct shmem_inode_info *info = SHMEM_I(inode);
2216 const struct xattr *xattr; 2220 const struct xattr *xattr;
2217 struct simple_xattr *new_xattr; 2221 struct simple_xattr *new_xattr;
2218 size_t len; 2222 size_t len;
2219 2223
2220 for (xattr = xattr_array; xattr->name != NULL; xattr++) { 2224 for (xattr = xattr_array; xattr->name != NULL; xattr++) {
2221 new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len); 2225 new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
2222 if (!new_xattr) 2226 if (!new_xattr)
2223 return -ENOMEM; 2227 return -ENOMEM;
2224 2228
2225 len = strlen(xattr->name) + 1; 2229 len = strlen(xattr->name) + 1;
2226 new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, 2230 new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
2227 GFP_KERNEL); 2231 GFP_KERNEL);
2228 if (!new_xattr->name) { 2232 if (!new_xattr->name) {
2229 kfree(new_xattr); 2233 kfree(new_xattr);
2230 return -ENOMEM; 2234 return -ENOMEM;
2231 } 2235 }
2232 2236
2233 memcpy(new_xattr->name, XATTR_SECURITY_PREFIX, 2237 memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
2234 XATTR_SECURITY_PREFIX_LEN); 2238 XATTR_SECURITY_PREFIX_LEN);
2235 memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN, 2239 memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
2236 xattr->name, len); 2240 xattr->name, len);
2237 2241
2238 simple_xattr_list_add(&info->xattrs, new_xattr); 2242 simple_xattr_list_add(&info->xattrs, new_xattr);
2239 } 2243 }
2240 2244
2241 return 0; 2245 return 0;
2242 } 2246 }
2243 2247
2244 static const struct xattr_handler *shmem_xattr_handlers[] = { 2248 static const struct xattr_handler *shmem_xattr_handlers[] = {
2245 #ifdef CONFIG_TMPFS_POSIX_ACL 2249 #ifdef CONFIG_TMPFS_POSIX_ACL
2246 &generic_acl_access_handler, 2250 &generic_acl_access_handler,
2247 &generic_acl_default_handler, 2251 &generic_acl_default_handler,
2248 #endif 2252 #endif
2249 NULL 2253 NULL
2250 }; 2254 };
2251 2255
2252 static int shmem_xattr_validate(const char *name) 2256 static int shmem_xattr_validate(const char *name)
2253 { 2257 {
2254 struct { const char *prefix; size_t len; } arr[] = { 2258 struct { const char *prefix; size_t len; } arr[] = {
2255 { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN }, 2259 { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
2256 { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN } 2260 { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
2257 }; 2261 };
2258 int i; 2262 int i;
2259 2263
2260 for (i = 0; i < ARRAY_SIZE(arr); i++) { 2264 for (i = 0; i < ARRAY_SIZE(arr); i++) {
2261 size_t preflen = arr[i].len; 2265 size_t preflen = arr[i].len;
2262 if (strncmp(name, arr[i].prefix, preflen) == 0) { 2266 if (strncmp(name, arr[i].prefix, preflen) == 0) {
2263 if (!name[preflen]) 2267 if (!name[preflen])
2264 return -EINVAL; 2268 return -EINVAL;
2265 return 0; 2269 return 0;
2266 } 2270 }
2267 } 2271 }
2268 return -EOPNOTSUPP; 2272 return -EOPNOTSUPP;
2269 } 2273 }
2270 2274
2271 static ssize_t shmem_getxattr(struct dentry *dentry, const char *name, 2275 static ssize_t shmem_getxattr(struct dentry *dentry, const char *name,
2272 void *buffer, size_t size) 2276 void *buffer, size_t size)
2273 { 2277 {
2274 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); 2278 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
2275 int err; 2279 int err;
2276 2280
2277 /* 2281 /*
2278 * If this is a request for a synthetic attribute in the system.* 2282 * If this is a request for a synthetic attribute in the system.*
2279 * namespace use the generic infrastructure to resolve a handler 2283 * namespace use the generic infrastructure to resolve a handler
2280 * for it via sb->s_xattr. 2284 * for it via sb->s_xattr.
2281 */ 2285 */
2282 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) 2286 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
2283 return generic_getxattr(dentry, name, buffer, size); 2287 return generic_getxattr(dentry, name, buffer, size);
2284 2288
2285 err = shmem_xattr_validate(name); 2289 err = shmem_xattr_validate(name);
2286 if (err) 2290 if (err)
2287 return err; 2291 return err;
2288 2292
2289 return simple_xattr_get(&info->xattrs, name, buffer, size); 2293 return simple_xattr_get(&info->xattrs, name, buffer, size);
2290 } 2294 }
2291 2295
2292 static int shmem_setxattr(struct dentry *dentry, const char *name, 2296 static int shmem_setxattr(struct dentry *dentry, const char *name,
2293 const void *value, size_t size, int flags) 2297 const void *value, size_t size, int flags)
2294 { 2298 {
2295 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); 2299 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
2296 int err; 2300 int err;
2297 2301
2298 /* 2302 /*
2299 * If this is a request for a synthetic attribute in the system.* 2303 * If this is a request for a synthetic attribute in the system.*
2300 * namespace use the generic infrastructure to resolve a handler 2304 * namespace use the generic infrastructure to resolve a handler
2301 * for it via sb->s_xattr. 2305 * for it via sb->s_xattr.
2302 */ 2306 */
2303 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) 2307 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
2304 return generic_setxattr(dentry, name, value, size, flags); 2308 return generic_setxattr(dentry, name, value, size, flags);
2305 2309
2306 err = shmem_xattr_validate(name); 2310 err = shmem_xattr_validate(name);
2307 if (err) 2311 if (err)
2308 return err; 2312 return err;
2309 2313
2310 return simple_xattr_set(&info->xattrs, name, value, size, flags); 2314 return simple_xattr_set(&info->xattrs, name, value, size, flags);
2311 } 2315 }
2312 2316
2313 static int shmem_removexattr(struct dentry *dentry, const char *name) 2317 static int shmem_removexattr(struct dentry *dentry, const char *name)
2314 { 2318 {
2315 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); 2319 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
2316 int err; 2320 int err;
2317 2321
2318 /* 2322 /*
2319 * If this is a request for a synthetic attribute in the system.* 2323 * If this is a request for a synthetic attribute in the system.*
2320 * namespace use the generic infrastructure to resolve a handler 2324 * namespace use the generic infrastructure to resolve a handler
2321 * for it via sb->s_xattr. 2325 * for it via sb->s_xattr.
2322 */ 2326 */
2323 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) 2327 if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
2324 return generic_removexattr(dentry, name); 2328 return generic_removexattr(dentry, name);
2325 2329
2326 err = shmem_xattr_validate(name); 2330 err = shmem_xattr_validate(name);
2327 if (err) 2331 if (err)
2328 return err; 2332 return err;
2329 2333
2330 return simple_xattr_remove(&info->xattrs, name); 2334 return simple_xattr_remove(&info->xattrs, name);
2331 } 2335 }
2332 2336
2333 static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size) 2337 static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size)
2334 { 2338 {
2335 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); 2339 struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
2336 return simple_xattr_list(&info->xattrs, buffer, size); 2340 return simple_xattr_list(&info->xattrs, buffer, size);
2337 } 2341 }
2338 #endif /* CONFIG_TMPFS_XATTR */ 2342 #endif /* CONFIG_TMPFS_XATTR */
2339 2343
2340 static const struct inode_operations shmem_short_symlink_operations = { 2344 static const struct inode_operations shmem_short_symlink_operations = {
2341 .readlink = generic_readlink, 2345 .readlink = generic_readlink,
2342 .follow_link = shmem_follow_short_symlink, 2346 .follow_link = shmem_follow_short_symlink,
2343 #ifdef CONFIG_TMPFS_XATTR 2347 #ifdef CONFIG_TMPFS_XATTR
2344 .setxattr = shmem_setxattr, 2348 .setxattr = shmem_setxattr,
2345 .getxattr = shmem_getxattr, 2349 .getxattr = shmem_getxattr,
2346 .listxattr = shmem_listxattr, 2350 .listxattr = shmem_listxattr,
2347 .removexattr = shmem_removexattr, 2351 .removexattr = shmem_removexattr,
2348 #endif 2352 #endif
2349 }; 2353 };
2350 2354
2351 static const struct inode_operations shmem_symlink_inode_operations = { 2355 static const struct inode_operations shmem_symlink_inode_operations = {
2352 .readlink = generic_readlink, 2356 .readlink = generic_readlink,
2353 .follow_link = shmem_follow_link, 2357 .follow_link = shmem_follow_link,
2354 .put_link = shmem_put_link, 2358 .put_link = shmem_put_link,
2355 #ifdef CONFIG_TMPFS_XATTR 2359 #ifdef CONFIG_TMPFS_XATTR
2356 .setxattr = shmem_setxattr, 2360 .setxattr = shmem_setxattr,
2357 .getxattr = shmem_getxattr, 2361 .getxattr = shmem_getxattr,
2358 .listxattr = shmem_listxattr, 2362 .listxattr = shmem_listxattr,
2359 .removexattr = shmem_removexattr, 2363 .removexattr = shmem_removexattr,
2360 #endif 2364 #endif
2361 }; 2365 };
2362 2366
2363 static struct dentry *shmem_get_parent(struct dentry *child) 2367 static struct dentry *shmem_get_parent(struct dentry *child)
2364 { 2368 {
2365 return ERR_PTR(-ESTALE); 2369 return ERR_PTR(-ESTALE);
2366 } 2370 }
2367 2371
2368 static int shmem_match(struct inode *ino, void *vfh) 2372 static int shmem_match(struct inode *ino, void *vfh)
2369 { 2373 {
2370 __u32 *fh = vfh; 2374 __u32 *fh = vfh;
2371 __u64 inum = fh[2]; 2375 __u64 inum = fh[2];
2372 inum = (inum << 32) | fh[1]; 2376 inum = (inum << 32) | fh[1];
2373 return ino->i_ino == inum && fh[0] == ino->i_generation; 2377 return ino->i_ino == inum && fh[0] == ino->i_generation;
2374 } 2378 }
2375 2379
2376 static struct dentry *shmem_fh_to_dentry(struct super_block *sb, 2380 static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
2377 struct fid *fid, int fh_len, int fh_type) 2381 struct fid *fid, int fh_len, int fh_type)
2378 { 2382 {
2379 struct inode *inode; 2383 struct inode *inode;
2380 struct dentry *dentry = NULL; 2384 struct dentry *dentry = NULL;
2381 u64 inum; 2385 u64 inum;
2382 2386
2383 if (fh_len < 3) 2387 if (fh_len < 3)
2384 return NULL; 2388 return NULL;
2385 2389
2386 inum = fid->raw[2]; 2390 inum = fid->raw[2];
2387 inum = (inum << 32) | fid->raw[1]; 2391 inum = (inum << 32) | fid->raw[1];
2388 2392
2389 inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]), 2393 inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]),
2390 shmem_match, fid->raw); 2394 shmem_match, fid->raw);
2391 if (inode) { 2395 if (inode) {
2392 dentry = d_find_alias(inode); 2396 dentry = d_find_alias(inode);
2393 iput(inode); 2397 iput(inode);
2394 } 2398 }
2395 2399
2396 return dentry; 2400 return dentry;
2397 } 2401 }
2398 2402
2399 static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, 2403 static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len,
2400 struct inode *parent) 2404 struct inode *parent)
2401 { 2405 {
2402 if (*len < 3) { 2406 if (*len < 3) {
2403 *len = 3; 2407 *len = 3;
2404 return FILEID_INVALID; 2408 return FILEID_INVALID;
2405 } 2409 }
2406 2410
2407 if (inode_unhashed(inode)) { 2411 if (inode_unhashed(inode)) {
2408 /* Unfortunately insert_inode_hash is not idempotent, 2412 /* Unfortunately insert_inode_hash is not idempotent,
2409 * so as we hash inodes here rather than at creation 2413 * so as we hash inodes here rather than at creation
2410 * time, we need a lock to ensure we only try 2414 * time, we need a lock to ensure we only try
2411 * to do it once 2415 * to do it once
2412 */ 2416 */
2413 static DEFINE_SPINLOCK(lock); 2417 static DEFINE_SPINLOCK(lock);
2414 spin_lock(&lock); 2418 spin_lock(&lock);
2415 if (inode_unhashed(inode)) 2419 if (inode_unhashed(inode))
2416 __insert_inode_hash(inode, 2420 __insert_inode_hash(inode,
2417 inode->i_ino + inode->i_generation); 2421 inode->i_ino + inode->i_generation);
2418 spin_unlock(&lock); 2422 spin_unlock(&lock);
2419 } 2423 }
2420 2424
2421 fh[0] = inode->i_generation; 2425 fh[0] = inode->i_generation;
2422 fh[1] = inode->i_ino; 2426 fh[1] = inode->i_ino;
2423 fh[2] = ((__u64)inode->i_ino) >> 32; 2427 fh[2] = ((__u64)inode->i_ino) >> 32;
2424 2428
2425 *len = 3; 2429 *len = 3;
2426 return 1; 2430 return 1;
2427 } 2431 }
2428 2432
2429 static const struct export_operations shmem_export_ops = { 2433 static const struct export_operations shmem_export_ops = {
2430 .get_parent = shmem_get_parent, 2434 .get_parent = shmem_get_parent,
2431 .encode_fh = shmem_encode_fh, 2435 .encode_fh = shmem_encode_fh,
2432 .fh_to_dentry = shmem_fh_to_dentry, 2436 .fh_to_dentry = shmem_fh_to_dentry,
2433 }; 2437 };
2434 2438
2435 static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, 2439 static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
2436 bool remount) 2440 bool remount)
2437 { 2441 {
2438 char *this_char, *value, *rest; 2442 char *this_char, *value, *rest;
2439 struct mempolicy *mpol = NULL; 2443 struct mempolicy *mpol = NULL;
2440 uid_t uid; 2444 uid_t uid;
2441 gid_t gid; 2445 gid_t gid;
2442 2446
2443 while (options != NULL) { 2447 while (options != NULL) {
2444 this_char = options; 2448 this_char = options;
2445 for (;;) { 2449 for (;;) {
2446 /* 2450 /*
2447 * NUL-terminate this option: unfortunately, 2451 * NUL-terminate this option: unfortunately,
2448 * mount options form a comma-separated list, 2452 * mount options form a comma-separated list,
2449 * but mpol's nodelist may also contain commas. 2453 * but mpol's nodelist may also contain commas.
2450 */ 2454 */
2451 options = strchr(options, ','); 2455 options = strchr(options, ',');
2452 if (options == NULL) 2456 if (options == NULL)
2453 break; 2457 break;
2454 options++; 2458 options++;
2455 if (!isdigit(*options)) { 2459 if (!isdigit(*options)) {
2456 options[-1] = '\0'; 2460 options[-1] = '\0';
2457 break; 2461 break;
2458 } 2462 }
2459 } 2463 }
2460 if (!*this_char) 2464 if (!*this_char)
2461 continue; 2465 continue;
2462 if ((value = strchr(this_char,'=')) != NULL) { 2466 if ((value = strchr(this_char,'=')) != NULL) {
2463 *value++ = 0; 2467 *value++ = 0;
2464 } else { 2468 } else {
2465 printk(KERN_ERR 2469 printk(KERN_ERR
2466 "tmpfs: No value for mount option '%s'\n", 2470 "tmpfs: No value for mount option '%s'\n",
2467 this_char); 2471 this_char);
2468 goto error; 2472 goto error;
2469 } 2473 }
2470 2474
2471 if (!strcmp(this_char,"size")) { 2475 if (!strcmp(this_char,"size")) {
2472 unsigned long long size; 2476 unsigned long long size;
2473 size = memparse(value,&rest); 2477 size = memparse(value,&rest);
2474 if (*rest == '%') { 2478 if (*rest == '%') {
2475 size <<= PAGE_SHIFT; 2479 size <<= PAGE_SHIFT;
2476 size *= totalram_pages; 2480 size *= totalram_pages;
2477 do_div(size, 100); 2481 do_div(size, 100);
2478 rest++; 2482 rest++;
2479 } 2483 }
2480 if (*rest) 2484 if (*rest)
2481 goto bad_val; 2485 goto bad_val;
2482 sbinfo->max_blocks = 2486 sbinfo->max_blocks =
2483 DIV_ROUND_UP(size, PAGE_CACHE_SIZE); 2487 DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
2484 } else if (!strcmp(this_char,"nr_blocks")) { 2488 } else if (!strcmp(this_char,"nr_blocks")) {
2485 sbinfo->max_blocks = memparse(value, &rest); 2489 sbinfo->max_blocks = memparse(value, &rest);
2486 if (*rest) 2490 if (*rest)
2487 goto bad_val; 2491 goto bad_val;
2488 } else if (!strcmp(this_char,"nr_inodes")) { 2492 } else if (!strcmp(this_char,"nr_inodes")) {
2489 sbinfo->max_inodes = memparse(value, &rest); 2493 sbinfo->max_inodes = memparse(value, &rest);
2490 if (*rest) 2494 if (*rest)
2491 goto bad_val; 2495 goto bad_val;
2492 } else if (!strcmp(this_char,"mode")) { 2496 } else if (!strcmp(this_char,"mode")) {
2493 if (remount) 2497 if (remount)
2494 continue; 2498 continue;
2495 sbinfo->mode = simple_strtoul(value, &rest, 8) & 07777; 2499 sbinfo->mode = simple_strtoul(value, &rest, 8) & 07777;
2496 if (*rest) 2500 if (*rest)
2497 goto bad_val; 2501 goto bad_val;
2498 } else if (!strcmp(this_char,"uid")) { 2502 } else if (!strcmp(this_char,"uid")) {
2499 if (remount) 2503 if (remount)
2500 continue; 2504 continue;
2501 uid = simple_strtoul(value, &rest, 0); 2505 uid = simple_strtoul(value, &rest, 0);
2502 if (*rest) 2506 if (*rest)
2503 goto bad_val; 2507 goto bad_val;
2504 sbinfo->uid = make_kuid(current_user_ns(), uid); 2508 sbinfo->uid = make_kuid(current_user_ns(), uid);
2505 if (!uid_valid(sbinfo->uid)) 2509 if (!uid_valid(sbinfo->uid))
2506 goto bad_val; 2510 goto bad_val;
2507 } else if (!strcmp(this_char,"gid")) { 2511 } else if (!strcmp(this_char,"gid")) {
2508 if (remount) 2512 if (remount)
2509 continue; 2513 continue;
2510 gid = simple_strtoul(value, &rest, 0); 2514 gid = simple_strtoul(value, &rest, 0);
2511 if (*rest) 2515 if (*rest)
2512 goto bad_val; 2516 goto bad_val;
2513 sbinfo->gid = make_kgid(current_user_ns(), gid); 2517 sbinfo->gid = make_kgid(current_user_ns(), gid);
2514 if (!gid_valid(sbinfo->gid)) 2518 if (!gid_valid(sbinfo->gid))
2515 goto bad_val; 2519 goto bad_val;
2516 } else if (!strcmp(this_char,"mpol")) { 2520 } else if (!strcmp(this_char,"mpol")) {
2517 mpol_put(mpol); 2521 mpol_put(mpol);
2518 mpol = NULL; 2522 mpol = NULL;
2519 if (mpol_parse_str(value, &mpol)) 2523 if (mpol_parse_str(value, &mpol))
2520 goto bad_val; 2524 goto bad_val;
2521 } else { 2525 } else {
2522 printk(KERN_ERR "tmpfs: Bad mount option %s\n", 2526 printk(KERN_ERR "tmpfs: Bad mount option %s\n",
2523 this_char); 2527 this_char);
2524 goto error; 2528 goto error;
2525 } 2529 }
2526 } 2530 }
2527 sbinfo->mpol = mpol; 2531 sbinfo->mpol = mpol;
2528 return 0; 2532 return 0;
2529 2533
2530 bad_val: 2534 bad_val:
2531 printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n", 2535 printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n",
2532 value, this_char); 2536 value, this_char);
2533 error: 2537 error:
2534 mpol_put(mpol); 2538 mpol_put(mpol);
2535 return 1; 2539 return 1;
2536 2540
2537 } 2541 }
2538 2542
2539 static int shmem_remount_fs(struct super_block *sb, int *flags, char *data) 2543 static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
2540 { 2544 {
2541 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 2545 struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
2542 struct shmem_sb_info config = *sbinfo; 2546 struct shmem_sb_info config = *sbinfo;
2543 unsigned long inodes; 2547 unsigned long inodes;
2544 int error = -EINVAL; 2548 int error = -EINVAL;
2545 2549
2546 config.mpol = NULL; 2550 config.mpol = NULL;
2547 if (shmem_parse_options(data, &config, true)) 2551 if (shmem_parse_options(data, &config, true))
2548 return error; 2552 return error;
2549 2553
2550 spin_lock(&sbinfo->stat_lock); 2554 spin_lock(&sbinfo->stat_lock);
2551 inodes = sbinfo->max_inodes - sbinfo->free_inodes; 2555 inodes = sbinfo->max_inodes - sbinfo->free_inodes;
2552 if (percpu_counter_compare(&sbinfo->used_blocks, config.max_blocks) > 0) 2556 if (percpu_counter_compare(&sbinfo->used_blocks, config.max_blocks) > 0)
2553 goto out; 2557 goto out;
2554 if (config.max_inodes < inodes) 2558 if (config.max_inodes < inodes)
2555 goto out; 2559 goto out;
2556 /* 2560 /*
2557 * Those tests disallow limited->unlimited while any are in use; 2561 * Those tests disallow limited->unlimited while any are in use;
2558 * but we must separately disallow unlimited->limited, because 2562 * but we must separately disallow unlimited->limited, because
2559 * in that case we have no record of how much is already in use. 2563 * in that case we have no record of how much is already in use.
2560 */ 2564 */
2561 if (config.max_blocks && !sbinfo->max_blocks) 2565 if (config.max_blocks && !sbinfo->max_blocks)
2562 goto out; 2566 goto out;
2563 if (config.max_inodes && !sbinfo->max_inodes) 2567 if (config.max_inodes && !sbinfo->max_inodes)
2564 goto out; 2568 goto out;
2565 2569
2566 error = 0; 2570 error = 0;
2567 sbinfo->max_blocks = config.max_blocks; 2571 sbinfo->max_blocks = config.max_blocks;
2568 sbinfo->max_inodes = config.max_inodes; 2572 sbinfo->max_inodes = config.max_inodes;
2569 sbinfo->free_inodes = config.max_inodes - inodes; 2573 sbinfo->free_inodes = config.max_inodes - inodes;
2570 2574
2571 /* 2575 /*
2572 * Preserve previous mempolicy unless mpol remount option was specified. 2576 * Preserve previous mempolicy unless mpol remount option was specified.
2573 */ 2577 */
2574 if (config.mpol) { 2578 if (config.mpol) {
2575 mpol_put(sbinfo->mpol); 2579 mpol_put(sbinfo->mpol);
2576 sbinfo->mpol = config.mpol; /* transfers initial ref */ 2580 sbinfo->mpol = config.mpol; /* transfers initial ref */
2577 } 2581 }
2578 out: 2582 out:
2579 spin_unlock(&sbinfo->stat_lock); 2583 spin_unlock(&sbinfo->stat_lock);
2580 return error; 2584 return error;
2581 } 2585 }
2582 2586
2583 static int shmem_show_options(struct seq_file *seq, struct dentry *root) 2587 static int shmem_show_options(struct seq_file *seq, struct dentry *root)
2584 { 2588 {
2585 struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb); 2589 struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
2586 2590
2587 if (sbinfo->max_blocks != shmem_default_max_blocks()) 2591 if (sbinfo->max_blocks != shmem_default_max_blocks())
2588 seq_printf(seq, ",size=%luk", 2592 seq_printf(seq, ",size=%luk",
2589 sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10)); 2593 sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10));
2590 if (sbinfo->max_inodes != shmem_default_max_inodes()) 2594 if (sbinfo->max_inodes != shmem_default_max_inodes())
2591 seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); 2595 seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
2592 if (sbinfo->mode != (S_IRWXUGO | S_ISVTX)) 2596 if (sbinfo->mode != (S_IRWXUGO | S_ISVTX))
2593 seq_printf(seq, ",mode=%03ho", sbinfo->mode); 2597 seq_printf(seq, ",mode=%03ho", sbinfo->mode);
2594 if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID)) 2598 if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))
2595 seq_printf(seq, ",uid=%u", 2599 seq_printf(seq, ",uid=%u",
2596 from_kuid_munged(&init_user_ns, sbinfo->uid)); 2600 from_kuid_munged(&init_user_ns, sbinfo->uid));
2597 if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) 2601 if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
2598 seq_printf(seq, ",gid=%u", 2602 seq_printf(seq, ",gid=%u",
2599 from_kgid_munged(&init_user_ns, sbinfo->gid)); 2603 from_kgid_munged(&init_user_ns, sbinfo->gid));
2600 shmem_show_mpol(seq, sbinfo->mpol); 2604 shmem_show_mpol(seq, sbinfo->mpol);
2601 return 0; 2605 return 0;
2602 } 2606 }
2603 #endif /* CONFIG_TMPFS */ 2607 #endif /* CONFIG_TMPFS */
2604 2608
2605 static void shmem_put_super(struct super_block *sb) 2609 static void shmem_put_super(struct super_block *sb)
2606 { 2610 {
2607 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 2611 struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
2608 2612
2609 percpu_counter_destroy(&sbinfo->used_blocks); 2613 percpu_counter_destroy(&sbinfo->used_blocks);
2610 mpol_put(sbinfo->mpol); 2614 mpol_put(sbinfo->mpol);
2611 kfree(sbinfo); 2615 kfree(sbinfo);
2612 sb->s_fs_info = NULL; 2616 sb->s_fs_info = NULL;
2613 } 2617 }
2614 2618
2615 int shmem_fill_super(struct super_block *sb, void *data, int silent) 2619 int shmem_fill_super(struct super_block *sb, void *data, int silent)
2616 { 2620 {
2617 struct inode *inode; 2621 struct inode *inode;
2618 struct shmem_sb_info *sbinfo; 2622 struct shmem_sb_info *sbinfo;
2619 int err = -ENOMEM; 2623 int err = -ENOMEM;
2620 2624
2621 /* Round up to L1_CACHE_BYTES to resist false sharing */ 2625 /* Round up to L1_CACHE_BYTES to resist false sharing */
2622 sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info), 2626 sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info),
2623 L1_CACHE_BYTES), GFP_KERNEL); 2627 L1_CACHE_BYTES), GFP_KERNEL);
2624 if (!sbinfo) 2628 if (!sbinfo)
2625 return -ENOMEM; 2629 return -ENOMEM;
2626 2630
2627 sbinfo->mode = S_IRWXUGO | S_ISVTX; 2631 sbinfo->mode = S_IRWXUGO | S_ISVTX;
2628 sbinfo->uid = current_fsuid(); 2632 sbinfo->uid = current_fsuid();
2629 sbinfo->gid = current_fsgid(); 2633 sbinfo->gid = current_fsgid();
2630 sb->s_fs_info = sbinfo; 2634 sb->s_fs_info = sbinfo;
2631 2635
2632 #ifdef CONFIG_TMPFS 2636 #ifdef CONFIG_TMPFS
2633 /* 2637 /*
2634 * Per default we only allow half of the physical ram per 2638 * Per default we only allow half of the physical ram per
2635 * tmpfs instance, limiting inodes to one per page of lowmem; 2639 * tmpfs instance, limiting inodes to one per page of lowmem;
2636 * but the internal instance is left unlimited. 2640 * but the internal instance is left unlimited.
2637 */ 2641 */
2638 if (!(sb->s_flags & MS_KERNMOUNT)) { 2642 if (!(sb->s_flags & MS_KERNMOUNT)) {
2639 sbinfo->max_blocks = shmem_default_max_blocks(); 2643 sbinfo->max_blocks = shmem_default_max_blocks();
2640 sbinfo->max_inodes = shmem_default_max_inodes(); 2644 sbinfo->max_inodes = shmem_default_max_inodes();
2641 if (shmem_parse_options(data, sbinfo, false)) { 2645 if (shmem_parse_options(data, sbinfo, false)) {
2642 err = -EINVAL; 2646 err = -EINVAL;
2643 goto failed; 2647 goto failed;
2644 } 2648 }
2645 } else { 2649 } else {
2646 sb->s_flags |= MS_NOUSER; 2650 sb->s_flags |= MS_NOUSER;
2647 } 2651 }
2648 sb->s_export_op = &shmem_export_ops; 2652 sb->s_export_op = &shmem_export_ops;
2649 sb->s_flags |= MS_NOSEC; 2653 sb->s_flags |= MS_NOSEC;
2650 #else 2654 #else
2651 sb->s_flags |= MS_NOUSER; 2655 sb->s_flags |= MS_NOUSER;
2652 #endif 2656 #endif
2653 2657
2654 spin_lock_init(&sbinfo->stat_lock); 2658 spin_lock_init(&sbinfo->stat_lock);
2655 if (percpu_counter_init(&sbinfo->used_blocks, 0)) 2659 if (percpu_counter_init(&sbinfo->used_blocks, 0))
2656 goto failed; 2660 goto failed;
2657 sbinfo->free_inodes = sbinfo->max_inodes; 2661 sbinfo->free_inodes = sbinfo->max_inodes;
2658 2662
2659 sb->s_maxbytes = MAX_LFS_FILESIZE; 2663 sb->s_maxbytes = MAX_LFS_FILESIZE;
2660 sb->s_blocksize = PAGE_CACHE_SIZE; 2664 sb->s_blocksize = PAGE_CACHE_SIZE;
2661 sb->s_blocksize_bits = PAGE_CACHE_SHIFT; 2665 sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
2662 sb->s_magic = TMPFS_MAGIC; 2666 sb->s_magic = TMPFS_MAGIC;
2663 sb->s_op = &shmem_ops; 2667 sb->s_op = &shmem_ops;
2664 sb->s_time_gran = 1; 2668 sb->s_time_gran = 1;
2665 #ifdef CONFIG_TMPFS_XATTR 2669 #ifdef CONFIG_TMPFS_XATTR
2666 sb->s_xattr = shmem_xattr_handlers; 2670 sb->s_xattr = shmem_xattr_handlers;
2667 #endif 2671 #endif
2668 #ifdef CONFIG_TMPFS_POSIX_ACL 2672 #ifdef CONFIG_TMPFS_POSIX_ACL
2669 sb->s_flags |= MS_POSIXACL; 2673 sb->s_flags |= MS_POSIXACL;
2670 #endif 2674 #endif
2671 2675
2672 inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); 2676 inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
2673 if (!inode) 2677 if (!inode)
2674 goto failed; 2678 goto failed;
2675 inode->i_uid = sbinfo->uid; 2679 inode->i_uid = sbinfo->uid;
2676 inode->i_gid = sbinfo->gid; 2680 inode->i_gid = sbinfo->gid;
2677 sb->s_root = d_make_root(inode); 2681 sb->s_root = d_make_root(inode);
2678 if (!sb->s_root) 2682 if (!sb->s_root)
2679 goto failed; 2683 goto failed;
2680 return 0; 2684 return 0;
2681 2685
2682 failed: 2686 failed:
2683 shmem_put_super(sb); 2687 shmem_put_super(sb);
2684 return err; 2688 return err;
2685 } 2689 }
2686 2690
2687 static struct kmem_cache *shmem_inode_cachep; 2691 static struct kmem_cache *shmem_inode_cachep;
2688 2692
2689 static struct inode *shmem_alloc_inode(struct super_block *sb) 2693 static struct inode *shmem_alloc_inode(struct super_block *sb)
2690 { 2694 {
2691 struct shmem_inode_info *info; 2695 struct shmem_inode_info *info;
2692 info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL); 2696 info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
2693 if (!info) 2697 if (!info)
2694 return NULL; 2698 return NULL;
2695 return &info->vfs_inode; 2699 return &info->vfs_inode;
2696 } 2700 }
2697 2701
2698 static void shmem_destroy_callback(struct rcu_head *head) 2702 static void shmem_destroy_callback(struct rcu_head *head)
2699 { 2703 {
2700 struct inode *inode = container_of(head, struct inode, i_rcu); 2704 struct inode *inode = container_of(head, struct inode, i_rcu);
2701 kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode)); 2705 kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
2702 } 2706 }
2703 2707
2704 static void shmem_destroy_inode(struct inode *inode) 2708 static void shmem_destroy_inode(struct inode *inode)
2705 { 2709 {
2706 if (S_ISREG(inode->i_mode)) 2710 if (S_ISREG(inode->i_mode))
2707 mpol_free_shared_policy(&SHMEM_I(inode)->policy); 2711 mpol_free_shared_policy(&SHMEM_I(inode)->policy);
2708 call_rcu(&inode->i_rcu, shmem_destroy_callback); 2712 call_rcu(&inode->i_rcu, shmem_destroy_callback);
2709 } 2713 }
2710 2714
2711 static void shmem_init_inode(void *foo) 2715 static void shmem_init_inode(void *foo)
2712 { 2716 {
2713 struct shmem_inode_info *info = foo; 2717 struct shmem_inode_info *info = foo;
2714 inode_init_once(&info->vfs_inode); 2718 inode_init_once(&info->vfs_inode);
2715 } 2719 }
2716 2720
2717 static int shmem_init_inodecache(void) 2721 static int shmem_init_inodecache(void)
2718 { 2722 {
2719 shmem_inode_cachep = kmem_cache_create("shmem_inode_cache", 2723 shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
2720 sizeof(struct shmem_inode_info), 2724 sizeof(struct shmem_inode_info),
2721 0, SLAB_PANIC, shmem_init_inode); 2725 0, SLAB_PANIC, shmem_init_inode);
2722 return 0; 2726 return 0;
2723 } 2727 }
2724 2728
2725 static void shmem_destroy_inodecache(void) 2729 static void shmem_destroy_inodecache(void)
2726 { 2730 {
2727 kmem_cache_destroy(shmem_inode_cachep); 2731 kmem_cache_destroy(shmem_inode_cachep);
2728 } 2732 }
2729 2733
2730 static const struct address_space_operations shmem_aops = { 2734 static const struct address_space_operations shmem_aops = {
2731 .writepage = shmem_writepage, 2735 .writepage = shmem_writepage,
2732 .set_page_dirty = __set_page_dirty_no_writeback, 2736 .set_page_dirty = __set_page_dirty_no_writeback,
2733 #ifdef CONFIG_TMPFS 2737 #ifdef CONFIG_TMPFS
2734 .write_begin = shmem_write_begin, 2738 .write_begin = shmem_write_begin,
2735 .write_end = shmem_write_end, 2739 .write_end = shmem_write_end,
2736 #endif 2740 #endif
2737 .migratepage = migrate_page, 2741 .migratepage = migrate_page,
2738 .error_remove_page = generic_error_remove_page, 2742 .error_remove_page = generic_error_remove_page,
2739 }; 2743 };
2740 2744
2741 static const struct file_operations shmem_file_operations = { 2745 static const struct file_operations shmem_file_operations = {
2742 .mmap = shmem_mmap, 2746 .mmap = shmem_mmap,
2743 #ifdef CONFIG_TMPFS 2747 #ifdef CONFIG_TMPFS
2744 .llseek = shmem_file_llseek, 2748 .llseek = shmem_file_llseek,
2745 .read = do_sync_read, 2749 .read = do_sync_read,
2746 .write = do_sync_write, 2750 .write = do_sync_write,
2747 .aio_read = shmem_file_aio_read, 2751 .aio_read = shmem_file_aio_read,
2748 .aio_write = generic_file_aio_write, 2752 .aio_write = generic_file_aio_write,
2749 .fsync = noop_fsync, 2753 .fsync = noop_fsync,
2750 .splice_read = shmem_file_splice_read, 2754 .splice_read = shmem_file_splice_read,
2751 .splice_write = generic_file_splice_write, 2755 .splice_write = generic_file_splice_write,
2752 .fallocate = shmem_fallocate, 2756 .fallocate = shmem_fallocate,
2753 #endif 2757 #endif
2754 }; 2758 };
2755 2759
2756 static const struct inode_operations shmem_inode_operations = { 2760 static const struct inode_operations shmem_inode_operations = {
2757 .setattr = shmem_setattr, 2761 .setattr = shmem_setattr,
2758 #ifdef CONFIG_TMPFS_XATTR 2762 #ifdef CONFIG_TMPFS_XATTR
2759 .setxattr = shmem_setxattr, 2763 .setxattr = shmem_setxattr,
2760 .getxattr = shmem_getxattr, 2764 .getxattr = shmem_getxattr,
2761 .listxattr = shmem_listxattr, 2765 .listxattr = shmem_listxattr,
2762 .removexattr = shmem_removexattr, 2766 .removexattr = shmem_removexattr,
2763 #endif 2767 #endif
2764 }; 2768 };
2765 2769
2766 static const struct inode_operations shmem_dir_inode_operations = { 2770 static const struct inode_operations shmem_dir_inode_operations = {
2767 #ifdef CONFIG_TMPFS 2771 #ifdef CONFIG_TMPFS
2768 .create = shmem_create, 2772 .create = shmem_create,
2769 .lookup = simple_lookup, 2773 .lookup = simple_lookup,
2770 .link = shmem_link, 2774 .link = shmem_link,
2771 .unlink = shmem_unlink, 2775 .unlink = shmem_unlink,
2772 .symlink = shmem_symlink, 2776 .symlink = shmem_symlink,
2773 .mkdir = shmem_mkdir, 2777 .mkdir = shmem_mkdir,
2774 .rmdir = shmem_rmdir, 2778 .rmdir = shmem_rmdir,
2775 .mknod = shmem_mknod, 2779 .mknod = shmem_mknod,
2776 .rename = shmem_rename, 2780 .rename = shmem_rename,
2777 .tmpfile = shmem_tmpfile, 2781 .tmpfile = shmem_tmpfile,
2778 #endif 2782 #endif
2779 #ifdef CONFIG_TMPFS_XATTR 2783 #ifdef CONFIG_TMPFS_XATTR
2780 .setxattr = shmem_setxattr, 2784 .setxattr = shmem_setxattr,
2781 .getxattr = shmem_getxattr, 2785 .getxattr = shmem_getxattr,
2782 .listxattr = shmem_listxattr, 2786 .listxattr = shmem_listxattr,
2783 .removexattr = shmem_removexattr, 2787 .removexattr = shmem_removexattr,
2784 #endif 2788 #endif
2785 #ifdef CONFIG_TMPFS_POSIX_ACL 2789 #ifdef CONFIG_TMPFS_POSIX_ACL
2786 .setattr = shmem_setattr, 2790 .setattr = shmem_setattr,
2787 #endif 2791 #endif
2788 }; 2792 };
2789 2793
2790 static const struct inode_operations shmem_special_inode_operations = { 2794 static const struct inode_operations shmem_special_inode_operations = {
2791 #ifdef CONFIG_TMPFS_XATTR 2795 #ifdef CONFIG_TMPFS_XATTR
2792 .setxattr = shmem_setxattr, 2796 .setxattr = shmem_setxattr,
2793 .getxattr = shmem_getxattr, 2797 .getxattr = shmem_getxattr,
2794 .listxattr = shmem_listxattr, 2798 .listxattr = shmem_listxattr,
2795 .removexattr = shmem_removexattr, 2799 .removexattr = shmem_removexattr,
2796 #endif 2800 #endif
2797 #ifdef CONFIG_TMPFS_POSIX_ACL 2801 #ifdef CONFIG_TMPFS_POSIX_ACL
2798 .setattr = shmem_setattr, 2802 .setattr = shmem_setattr,
2799 #endif 2803 #endif
2800 }; 2804 };
2801 2805
2802 static const struct super_operations shmem_ops = { 2806 static const struct super_operations shmem_ops = {
2803 .alloc_inode = shmem_alloc_inode, 2807 .alloc_inode = shmem_alloc_inode,
2804 .destroy_inode = shmem_destroy_inode, 2808 .destroy_inode = shmem_destroy_inode,
2805 #ifdef CONFIG_TMPFS 2809 #ifdef CONFIG_TMPFS
2806 .statfs = shmem_statfs, 2810 .statfs = shmem_statfs,
2807 .remount_fs = shmem_remount_fs, 2811 .remount_fs = shmem_remount_fs,
2808 .show_options = shmem_show_options, 2812 .show_options = shmem_show_options,
2809 #endif 2813 #endif
2810 .evict_inode = shmem_evict_inode, 2814 .evict_inode = shmem_evict_inode,
2811 .drop_inode = generic_delete_inode, 2815 .drop_inode = generic_delete_inode,
2812 .put_super = shmem_put_super, 2816 .put_super = shmem_put_super,
2813 }; 2817 };
2814 2818
2815 static const struct vm_operations_struct shmem_vm_ops = { 2819 static const struct vm_operations_struct shmem_vm_ops = {
2816 .fault = shmem_fault, 2820 .fault = shmem_fault,
2817 #ifdef CONFIG_NUMA 2821 #ifdef CONFIG_NUMA
2818 .set_policy = shmem_set_policy, 2822 .set_policy = shmem_set_policy,
2819 .get_policy = shmem_get_policy, 2823 .get_policy = shmem_get_policy,
2820 #endif 2824 #endif
2821 .remap_pages = generic_file_remap_pages, 2825 .remap_pages = generic_file_remap_pages,
2822 }; 2826 };
2823 2827
2824 static struct dentry *shmem_mount(struct file_system_type *fs_type, 2828 static struct dentry *shmem_mount(struct file_system_type *fs_type,
2825 int flags, const char *dev_name, void *data) 2829 int flags, const char *dev_name, void *data)
2826 { 2830 {
2827 return mount_nodev(fs_type, flags, data, shmem_fill_super); 2831 return mount_nodev(fs_type, flags, data, shmem_fill_super);
2828 } 2832 }
2829 2833
2830 static struct file_system_type shmem_fs_type = { 2834 static struct file_system_type shmem_fs_type = {
2831 .owner = THIS_MODULE, 2835 .owner = THIS_MODULE,
2832 .name = "tmpfs", 2836 .name = "tmpfs",
2833 .mount = shmem_mount, 2837 .mount = shmem_mount,
2834 .kill_sb = kill_litter_super, 2838 .kill_sb = kill_litter_super,
2835 .fs_flags = FS_USERNS_MOUNT, 2839 .fs_flags = FS_USERNS_MOUNT,
2836 }; 2840 };
2837 2841
2838 int __init shmem_init(void) 2842 int __init shmem_init(void)
2839 { 2843 {
2840 int error; 2844 int error;
2841 2845
2842 /* If rootfs called this, don't re-init */ 2846 /* If rootfs called this, don't re-init */
2843 if (shmem_inode_cachep) 2847 if (shmem_inode_cachep)
2844 return 0; 2848 return 0;
2845 2849
2846 error = bdi_init(&shmem_backing_dev_info); 2850 error = bdi_init(&shmem_backing_dev_info);
2847 if (error) 2851 if (error)
2848 goto out4; 2852 goto out4;
2849 2853
2850 error = shmem_init_inodecache(); 2854 error = shmem_init_inodecache();
2851 if (error) 2855 if (error)
2852 goto out3; 2856 goto out3;
2853 2857
2854 error = register_filesystem(&shmem_fs_type); 2858 error = register_filesystem(&shmem_fs_type);
2855 if (error) { 2859 if (error) {
2856 printk(KERN_ERR "Could not register tmpfs\n"); 2860 printk(KERN_ERR "Could not register tmpfs\n");
2857 goto out2; 2861 goto out2;
2858 } 2862 }
2859 2863
2860 shm_mnt = kern_mount(&shmem_fs_type); 2864 shm_mnt = kern_mount(&shmem_fs_type);
2861 if (IS_ERR(shm_mnt)) { 2865 if (IS_ERR(shm_mnt)) {
2862 error = PTR_ERR(shm_mnt); 2866 error = PTR_ERR(shm_mnt);
2863 printk(KERN_ERR "Could not kern_mount tmpfs\n"); 2867 printk(KERN_ERR "Could not kern_mount tmpfs\n");
2864 goto out1; 2868 goto out1;
2865 } 2869 }
2866 return 0; 2870 return 0;
2867 2871
2868 out1: 2872 out1:
2869 unregister_filesystem(&shmem_fs_type); 2873 unregister_filesystem(&shmem_fs_type);
2870 out2: 2874 out2:
2871 shmem_destroy_inodecache(); 2875 shmem_destroy_inodecache();
2872 out3: 2876 out3:
2873 bdi_destroy(&shmem_backing_dev_info); 2877 bdi_destroy(&shmem_backing_dev_info);
2874 out4: 2878 out4:
2875 shm_mnt = ERR_PTR(error); 2879 shm_mnt = ERR_PTR(error);
2876 return error; 2880 return error;
2877 } 2881 }
2878 2882
2879 #else /* !CONFIG_SHMEM */ 2883 #else /* !CONFIG_SHMEM */
2880 2884
2881 /* 2885 /*
2882 * tiny-shmem: simple shmemfs and tmpfs using ramfs code 2886 * tiny-shmem: simple shmemfs and tmpfs using ramfs code
2883 * 2887 *
2884 * This is intended for small system where the benefits of the full 2888 * This is intended for small system where the benefits of the full
2885 * shmem code (swap-backed and resource-limited) are outweighed by 2889 * shmem code (swap-backed and resource-limited) are outweighed by
2886 * their complexity. On systems without swap this code should be 2890 * their complexity. On systems without swap this code should be
2887 * effectively equivalent, but much lighter weight. 2891 * effectively equivalent, but much lighter weight.
2888 */ 2892 */
2889 2893
2890 static struct file_system_type shmem_fs_type = { 2894 static struct file_system_type shmem_fs_type = {
2891 .name = "tmpfs", 2895 .name = "tmpfs",
2892 .mount = ramfs_mount, 2896 .mount = ramfs_mount,
2893 .kill_sb = kill_litter_super, 2897 .kill_sb = kill_litter_super,
2894 .fs_flags = FS_USERNS_MOUNT, 2898 .fs_flags = FS_USERNS_MOUNT,
2895 }; 2899 };
2896 2900
2897 int __init shmem_init(void) 2901 int __init shmem_init(void)
2898 { 2902 {
2899 BUG_ON(register_filesystem(&shmem_fs_type) != 0); 2903 BUG_ON(register_filesystem(&shmem_fs_type) != 0);
2900 2904
2901 shm_mnt = kern_mount(&shmem_fs_type); 2905 shm_mnt = kern_mount(&shmem_fs_type);
2902 BUG_ON(IS_ERR(shm_mnt)); 2906 BUG_ON(IS_ERR(shm_mnt));
2903 2907
2904 return 0; 2908 return 0;
2905 } 2909 }
2906 2910
2907 int shmem_unuse(swp_entry_t swap, struct page *page) 2911 int shmem_unuse(swp_entry_t swap, struct page *page)
2908 { 2912 {
2909 return 0; 2913 return 0;
2910 } 2914 }
2911 2915
2912 int shmem_lock(struct file *file, int lock, struct user_struct *user) 2916 int shmem_lock(struct file *file, int lock, struct user_struct *user)
2913 { 2917 {
2914 return 0; 2918 return 0;
2915 } 2919 }
2916 2920
2917 void shmem_unlock_mapping(struct address_space *mapping) 2921 void shmem_unlock_mapping(struct address_space *mapping)
2918 { 2922 {
2919 } 2923 }
2920 2924
2921 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) 2925 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
2922 { 2926 {
2923 truncate_inode_pages_range(inode->i_mapping, lstart, lend); 2927 truncate_inode_pages_range(inode->i_mapping, lstart, lend);
2924 } 2928 }
2925 EXPORT_SYMBOL_GPL(shmem_truncate_range); 2929 EXPORT_SYMBOL_GPL(shmem_truncate_range);
2926 2930
2927 #define shmem_vm_ops generic_file_vm_ops 2931 #define shmem_vm_ops generic_file_vm_ops
2928 #define shmem_file_operations ramfs_file_operations 2932 #define shmem_file_operations ramfs_file_operations
2929 #define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev) 2933 #define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev)
2930 #define shmem_acct_size(flags, size) 0 2934 #define shmem_acct_size(flags, size) 0
2931 #define shmem_unacct_size(flags, size) do {} while (0) 2935 #define shmem_unacct_size(flags, size) do {} while (0)
2932 2936
2933 #endif /* CONFIG_SHMEM */ 2937 #endif /* CONFIG_SHMEM */
2934 2938
2935 /* common code */ 2939 /* common code */
2936 2940
2937 static struct dentry_operations anon_ops = { 2941 static struct dentry_operations anon_ops = {
2938 .d_dname = simple_dname 2942 .d_dname = simple_dname
2939 }; 2943 };
2940 2944
2941 /** 2945 /**
2942 * shmem_file_setup - get an unlinked file living in tmpfs 2946 * shmem_file_setup - get an unlinked file living in tmpfs
2943 * @name: name for dentry (to be seen in /proc/<pid>/maps 2947 * @name: name for dentry (to be seen in /proc/<pid>/maps
2944 * @size: size to be set for the file 2948 * @size: size to be set for the file
2945 * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size 2949 * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size
2946 */ 2950 */
2947 struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags) 2951 struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags)
2948 { 2952 {
2949 struct file *res; 2953 struct file *res;
2950 struct inode *inode; 2954 struct inode *inode;
2951 struct path path; 2955 struct path path;
2952 struct super_block *sb; 2956 struct super_block *sb;
2953 struct qstr this; 2957 struct qstr this;
2954 2958
2955 if (IS_ERR(shm_mnt)) 2959 if (IS_ERR(shm_mnt))
2956 return ERR_CAST(shm_mnt); 2960 return ERR_CAST(shm_mnt);
2957 2961
2958 if (size < 0 || size > MAX_LFS_FILESIZE) 2962 if (size < 0 || size > MAX_LFS_FILESIZE)
2959 return ERR_PTR(-EINVAL); 2963 return ERR_PTR(-EINVAL);
2960 2964
2961 if (shmem_acct_size(flags, size)) 2965 if (shmem_acct_size(flags, size))
2962 return ERR_PTR(-ENOMEM); 2966 return ERR_PTR(-ENOMEM);
2963 2967
2964 res = ERR_PTR(-ENOMEM); 2968 res = ERR_PTR(-ENOMEM);
2965 this.name = name; 2969 this.name = name;
2966 this.len = strlen(name); 2970 this.len = strlen(name);
2967 this.hash = 0; /* will go */ 2971 this.hash = 0; /* will go */
2968 sb = shm_mnt->mnt_sb; 2972 sb = shm_mnt->mnt_sb;
2969 path.dentry = d_alloc_pseudo(sb, &this); 2973 path.dentry = d_alloc_pseudo(sb, &this);
2970 if (!path.dentry) 2974 if (!path.dentry)
2971 goto put_memory; 2975 goto put_memory;
2972 d_set_d_op(path.dentry, &anon_ops); 2976 d_set_d_op(path.dentry, &anon_ops);
2973 path.mnt = mntget(shm_mnt); 2977 path.mnt = mntget(shm_mnt);
2974 2978
2975 res = ERR_PTR(-ENOSPC); 2979 res = ERR_PTR(-ENOSPC);
2976 inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); 2980 inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags);
2977 if (!inode) 2981 if (!inode)
2978 goto put_dentry; 2982 goto put_dentry;
2979 2983
2980 d_instantiate(path.dentry, inode); 2984 d_instantiate(path.dentry, inode);
2981 inode->i_size = size; 2985 inode->i_size = size;
2982 clear_nlink(inode); /* It is unlinked */ 2986 clear_nlink(inode); /* It is unlinked */
2983 res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size)); 2987 res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
2984 if (IS_ERR(res)) 2988 if (IS_ERR(res))
2985 goto put_dentry; 2989 goto put_dentry;
2986 2990
2987 res = alloc_file(&path, FMODE_WRITE | FMODE_READ, 2991 res = alloc_file(&path, FMODE_WRITE | FMODE_READ,
2988 &shmem_file_operations); 2992 &shmem_file_operations);
2989 if (IS_ERR(res)) 2993 if (IS_ERR(res))
2990 goto put_dentry; 2994 goto put_dentry;
2991 2995
2992 return res; 2996 return res;
2993 2997
2994 put_dentry: 2998 put_dentry:
2995 path_put(&path); 2999 path_put(&path);
2996 put_memory: 3000 put_memory:
2997 shmem_unacct_size(flags, size); 3001 shmem_unacct_size(flags, size);
2998 return res; 3002 return res;
2999 } 3003 }
3000 EXPORT_SYMBOL_GPL(shmem_file_setup); 3004 EXPORT_SYMBOL_GPL(shmem_file_setup);
3001 3005
3002 /** 3006 /**
3003 * shmem_zero_setup - setup a shared anonymous mapping 3007 * shmem_zero_setup - setup a shared anonymous mapping
3004 * @vma: the vma to be mmapped is prepared by do_mmap_pgoff 3008 * @vma: the vma to be mmapped is prepared by do_mmap_pgoff
3005 */ 3009 */
3006 int shmem_zero_setup(struct vm_area_struct *vma) 3010 int shmem_zero_setup(struct vm_area_struct *vma)
3007 { 3011 {
3008 struct file *file; 3012 struct file *file;
3009 loff_t size = vma->vm_end - vma->vm_start; 3013 loff_t size = vma->vm_end - vma->vm_start;
3010 3014
3011 file = shmem_file_setup("dev/zero", size, vma->vm_flags); 3015 file = shmem_file_setup("dev/zero", size, vma->vm_flags);
3012 if (IS_ERR(file)) 3016 if (IS_ERR(file))
3013 return PTR_ERR(file); 3017 return PTR_ERR(file);
3014 3018
3015 if (vma->vm_file) 3019 if (vma->vm_file)
3016 fput(vma->vm_file); 3020 fput(vma->vm_file);
3017 vma->vm_file = file; 3021 vma->vm_file = file;
3018 vma->vm_ops = &shmem_vm_ops; 3022 vma->vm_ops = &shmem_vm_ops;
3019 return 0; 3023 return 0;
3020 } 3024 }
3021 3025
3022 /** 3026 /**
3023 * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags. 3027 * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
3024 * @mapping: the page's address_space 3028 * @mapping: the page's address_space
3025 * @index: the page index 3029 * @index: the page index
3026 * @gfp: the page allocator flags to use if allocating 3030 * @gfp: the page allocator flags to use if allocating
3027 * 3031 *
3028 * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)", 3032 * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)",
3029 * with any new page allocations done using the specified allocation flags. 3033 * with any new page allocations done using the specified allocation flags.
3030 * But read_cache_page_gfp() uses the ->readpage() method: which does not 3034 * But read_cache_page_gfp() uses the ->readpage() method: which does not
3031 * suit tmpfs, since it may have pages in swapcache, and needs to find those 3035 * suit tmpfs, since it may have pages in swapcache, and needs to find those
3032 * for itself; although drivers/gpu/drm i915 and ttm rely upon this support. 3036 * for itself; although drivers/gpu/drm i915 and ttm rely upon this support.
3033 * 3037 *
3034 * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in 3038 * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in
3035 * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily. 3039 * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily.
3036 */ 3040 */
3037 struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, 3041 struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
3038 pgoff_t index, gfp_t gfp) 3042 pgoff_t index, gfp_t gfp)
3039 { 3043 {
3040 #ifdef CONFIG_SHMEM 3044 #ifdef CONFIG_SHMEM
3041 struct inode *inode = mapping->host; 3045 struct inode *inode = mapping->host;
3042 struct page *page; 3046 struct page *page;
3043 int error; 3047 int error;
3044 3048
3045 BUG_ON(mapping->a_ops != &shmem_aops); 3049 BUG_ON(mapping->a_ops != &shmem_aops);
3046 error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL); 3050 error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL);
3047 if (error) 3051 if (error)
3048 page = ERR_PTR(error); 3052 page = ERR_PTR(error);
3049 else 3053 else
3050 unlock_page(page); 3054 unlock_page(page);
3051 return page; 3055 return page;
3052 #else 3056 #else
3053 /* 3057 /*
3054 * The tiny !SHMEM case uses ramfs without swap 3058 * The tiny !SHMEM case uses ramfs without swap
3055 */ 3059 */
3056 return read_cache_page_gfp(mapping, index, gfp); 3060 return read_cache_page_gfp(mapping, index, gfp);
3057 #endif 3061 #endif
3058 } 3062 }
3059 EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp); 3063 EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp);
3060 3064
1 /* 1 /*
2 * linux/mm/swap.c 2 * linux/mm/swap.c
3 * 3 *
4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds 4 * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
5 */ 5 */
6 6
7 /* 7 /*
8 * This file contains the default values for the operation of the 8 * This file contains the default values for the operation of the
9 * Linux VM subsystem. Fine-tuning documentation can be found in 9 * Linux VM subsystem. Fine-tuning documentation can be found in
10 * Documentation/sysctl/vm.txt. 10 * Documentation/sysctl/vm.txt.
11 * Started 18.12.91 11 * Started 18.12.91
12 * Swap aging added 23.2.95, Stephen Tweedie. 12 * Swap aging added 23.2.95, Stephen Tweedie.
13 * Buffermem limits added 12.3.98, Rik van Riel. 13 * Buffermem limits added 12.3.98, Rik van Riel.
14 */ 14 */
15 15
16 #include <linux/mm.h> 16 #include <linux/mm.h>
17 #include <linux/sched.h> 17 #include <linux/sched.h>
18 #include <linux/kernel_stat.h> 18 #include <linux/kernel_stat.h>
19 #include <linux/swap.h> 19 #include <linux/swap.h>
20 #include <linux/mman.h> 20 #include <linux/mman.h>
21 #include <linux/pagemap.h> 21 #include <linux/pagemap.h>
22 #include <linux/pagevec.h> 22 #include <linux/pagevec.h>
23 #include <linux/init.h> 23 #include <linux/init.h>
24 #include <linux/export.h> 24 #include <linux/export.h>
25 #include <linux/mm_inline.h> 25 #include <linux/mm_inline.h>
26 #include <linux/percpu_counter.h> 26 #include <linux/percpu_counter.h>
27 #include <linux/percpu.h> 27 #include <linux/percpu.h>
28 #include <linux/cpu.h> 28 #include <linux/cpu.h>
29 #include <linux/notifier.h> 29 #include <linux/notifier.h>
30 #include <linux/backing-dev.h> 30 #include <linux/backing-dev.h>
31 #include <linux/memcontrol.h> 31 #include <linux/memcontrol.h>
32 #include <linux/gfp.h> 32 #include <linux/gfp.h>
33 #include <linux/uio.h> 33 #include <linux/uio.h>
34 #include <linux/hugetlb.h> 34 #include <linux/hugetlb.h>
35 35
36 #include "internal.h" 36 #include "internal.h"
37 37
38 #define CREATE_TRACE_POINTS 38 #define CREATE_TRACE_POINTS
39 #include <trace/events/pagemap.h> 39 #include <trace/events/pagemap.h>
40 40
41 /* How many pages do we try to swap or page in/out together? */ 41 /* How many pages do we try to swap or page in/out together? */
42 int page_cluster; 42 int page_cluster;
43 43
44 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); 44 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
45 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); 45 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
46 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); 46 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
47 47
48 /* 48 /*
49 * This path almost never happens for VM activity - pages are normally 49 * This path almost never happens for VM activity - pages are normally
50 * freed via pagevecs. But it gets used by networking. 50 * freed via pagevecs. But it gets used by networking.
51 */ 51 */
52 static void __page_cache_release(struct page *page) 52 static void __page_cache_release(struct page *page)
53 { 53 {
54 if (PageLRU(page)) { 54 if (PageLRU(page)) {
55 struct zone *zone = page_zone(page); 55 struct zone *zone = page_zone(page);
56 struct lruvec *lruvec; 56 struct lruvec *lruvec;
57 unsigned long flags; 57 unsigned long flags;
58 58
59 spin_lock_irqsave(&zone->lru_lock, flags); 59 spin_lock_irqsave(&zone->lru_lock, flags);
60 lruvec = mem_cgroup_page_lruvec(page, zone); 60 lruvec = mem_cgroup_page_lruvec(page, zone);
61 VM_BUG_ON(!PageLRU(page)); 61 VM_BUG_ON(!PageLRU(page));
62 __ClearPageLRU(page); 62 __ClearPageLRU(page);
63 del_page_from_lru_list(page, lruvec, page_off_lru(page)); 63 del_page_from_lru_list(page, lruvec, page_off_lru(page));
64 spin_unlock_irqrestore(&zone->lru_lock, flags); 64 spin_unlock_irqrestore(&zone->lru_lock, flags);
65 } 65 }
66 } 66 }
67 67
68 static void __put_single_page(struct page *page) 68 static void __put_single_page(struct page *page)
69 { 69 {
70 __page_cache_release(page); 70 __page_cache_release(page);
71 free_hot_cold_page(page, false); 71 free_hot_cold_page(page, false);
72 } 72 }
73 73
74 static void __put_compound_page(struct page *page) 74 static void __put_compound_page(struct page *page)
75 { 75 {
76 compound_page_dtor *dtor; 76 compound_page_dtor *dtor;
77 77
78 __page_cache_release(page); 78 __page_cache_release(page);
79 dtor = get_compound_page_dtor(page); 79 dtor = get_compound_page_dtor(page);
80 (*dtor)(page); 80 (*dtor)(page);
81 } 81 }
82 82
83 static void put_compound_page(struct page *page) 83 static void put_compound_page(struct page *page)
84 { 84 {
85 if (unlikely(PageTail(page))) { 85 if (unlikely(PageTail(page))) {
86 /* __split_huge_page_refcount can run under us */ 86 /* __split_huge_page_refcount can run under us */
87 struct page *page_head = compound_head(page); 87 struct page *page_head = compound_head(page);
88 88
89 if (likely(page != page_head && 89 if (likely(page != page_head &&
90 get_page_unless_zero(page_head))) { 90 get_page_unless_zero(page_head))) {
91 unsigned long flags; 91 unsigned long flags;
92 92
93 /* 93 /*
94 * THP can not break up slab pages so avoid taking 94 * THP can not break up slab pages so avoid taking
95 * compound_lock(). Slab performs non-atomic bit ops 95 * compound_lock(). Slab performs non-atomic bit ops
96 * on page->flags for better performance. In particular 96 * on page->flags for better performance. In particular
97 * slab_unlock() in slub used to be a hot path. It is 97 * slab_unlock() in slub used to be a hot path. It is
98 * still hot on arches that do not support 98 * still hot on arches that do not support
99 * this_cpu_cmpxchg_double(). 99 * this_cpu_cmpxchg_double().
100 */ 100 */
101 if (PageSlab(page_head) || PageHeadHuge(page_head)) { 101 if (PageSlab(page_head) || PageHeadHuge(page_head)) {
102 if (likely(PageTail(page))) { 102 if (likely(PageTail(page))) {
103 /* 103 /*
104 * __split_huge_page_refcount 104 * __split_huge_page_refcount
105 * cannot race here. 105 * cannot race here.
106 */ 106 */
107 VM_BUG_ON(!PageHead(page_head)); 107 VM_BUG_ON(!PageHead(page_head));
108 atomic_dec(&page->_mapcount); 108 atomic_dec(&page->_mapcount);
109 if (put_page_testzero(page_head)) 109 if (put_page_testzero(page_head))
110 VM_BUG_ON(1); 110 VM_BUG_ON(1);
111 if (put_page_testzero(page_head)) 111 if (put_page_testzero(page_head))
112 __put_compound_page(page_head); 112 __put_compound_page(page_head);
113 return; 113 return;
114 } else 114 } else
115 /* 115 /*
116 * __split_huge_page_refcount 116 * __split_huge_page_refcount
117 * run before us, "page" was a 117 * run before us, "page" was a
118 * THP tail. The split 118 * THP tail. The split
119 * page_head has been freed 119 * page_head has been freed
120 * and reallocated as slab or 120 * and reallocated as slab or
121 * hugetlbfs page of smaller 121 * hugetlbfs page of smaller
122 * order (only possible if 122 * order (only possible if
123 * reallocated as slab on 123 * reallocated as slab on
124 * x86). 124 * x86).
125 */ 125 */
126 goto skip_lock; 126 goto skip_lock;
127 } 127 }
128 /* 128 /*
129 * page_head wasn't a dangling pointer but it 129 * page_head wasn't a dangling pointer but it
130 * may not be a head page anymore by the time 130 * may not be a head page anymore by the time
131 * we obtain the lock. That is ok as long as it 131 * we obtain the lock. That is ok as long as it
132 * can't be freed from under us. 132 * can't be freed from under us.
133 */ 133 */
134 flags = compound_lock_irqsave(page_head); 134 flags = compound_lock_irqsave(page_head);
135 if (unlikely(!PageTail(page))) { 135 if (unlikely(!PageTail(page))) {
136 /* __split_huge_page_refcount run before us */ 136 /* __split_huge_page_refcount run before us */
137 compound_unlock_irqrestore(page_head, flags); 137 compound_unlock_irqrestore(page_head, flags);
138 skip_lock: 138 skip_lock:
139 if (put_page_testzero(page_head)) { 139 if (put_page_testzero(page_head)) {
140 /* 140 /*
141 * The head page may have been 141 * The head page may have been
142 * freed and reallocated as a 142 * freed and reallocated as a
143 * compound page of smaller 143 * compound page of smaller
144 * order and then freed again. 144 * order and then freed again.
145 * All we know is that it 145 * All we know is that it
146 * cannot have become: a THP 146 * cannot have become: a THP
147 * page, a compound page of 147 * page, a compound page of
148 * higher order, a tail page. 148 * higher order, a tail page.
149 * That is because we still 149 * That is because we still
150 * hold the refcount of the 150 * hold the refcount of the
151 * split THP tail and 151 * split THP tail and
152 * page_head was the THP head 152 * page_head was the THP head
153 * before the split. 153 * before the split.
154 */ 154 */
155 if (PageHead(page_head)) 155 if (PageHead(page_head))
156 __put_compound_page(page_head); 156 __put_compound_page(page_head);
157 else 157 else
158 __put_single_page(page_head); 158 __put_single_page(page_head);
159 } 159 }
160 out_put_single: 160 out_put_single:
161 if (put_page_testzero(page)) 161 if (put_page_testzero(page))
162 __put_single_page(page); 162 __put_single_page(page);
163 return; 163 return;
164 } 164 }
165 VM_BUG_ON(page_head != page->first_page); 165 VM_BUG_ON(page_head != page->first_page);
166 /* 166 /*
167 * We can release the refcount taken by 167 * We can release the refcount taken by
168 * get_page_unless_zero() now that 168 * get_page_unless_zero() now that
169 * __split_huge_page_refcount() is blocked on 169 * __split_huge_page_refcount() is blocked on
170 * the compound_lock. 170 * the compound_lock.
171 */ 171 */
172 if (put_page_testzero(page_head)) 172 if (put_page_testzero(page_head))
173 VM_BUG_ON(1); 173 VM_BUG_ON(1);
174 /* __split_huge_page_refcount will wait now */ 174 /* __split_huge_page_refcount will wait now */
175 VM_BUG_ON(page_mapcount(page) <= 0); 175 VM_BUG_ON(page_mapcount(page) <= 0);
176 atomic_dec(&page->_mapcount); 176 atomic_dec(&page->_mapcount);
177 VM_BUG_ON(atomic_read(&page_head->_count) <= 0); 177 VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
178 VM_BUG_ON(atomic_read(&page->_count) != 0); 178 VM_BUG_ON(atomic_read(&page->_count) != 0);
179 compound_unlock_irqrestore(page_head, flags); 179 compound_unlock_irqrestore(page_head, flags);
180 180
181 if (put_page_testzero(page_head)) { 181 if (put_page_testzero(page_head)) {
182 if (PageHead(page_head)) 182 if (PageHead(page_head))
183 __put_compound_page(page_head); 183 __put_compound_page(page_head);
184 else 184 else
185 __put_single_page(page_head); 185 __put_single_page(page_head);
186 } 186 }
187 } else { 187 } else {
188 /* page_head is a dangling pointer */ 188 /* page_head is a dangling pointer */
189 VM_BUG_ON(PageTail(page)); 189 VM_BUG_ON(PageTail(page));
190 goto out_put_single; 190 goto out_put_single;
191 } 191 }
192 } else if (put_page_testzero(page)) { 192 } else if (put_page_testzero(page)) {
193 if (PageHead(page)) 193 if (PageHead(page))
194 __put_compound_page(page); 194 __put_compound_page(page);
195 else 195 else
196 __put_single_page(page); 196 __put_single_page(page);
197 } 197 }
198 } 198 }
199 199
200 void put_page(struct page *page) 200 void put_page(struct page *page)
201 { 201 {
202 if (unlikely(PageCompound(page))) 202 if (unlikely(PageCompound(page)))
203 put_compound_page(page); 203 put_compound_page(page);
204 else if (put_page_testzero(page)) 204 else if (put_page_testzero(page))
205 __put_single_page(page); 205 __put_single_page(page);
206 } 206 }
207 EXPORT_SYMBOL(put_page); 207 EXPORT_SYMBOL(put_page);
208 208
209 /* 209 /*
210 * This function is exported but must not be called by anything other 210 * This function is exported but must not be called by anything other
211 * than get_page(). It implements the slow path of get_page(). 211 * than get_page(). It implements the slow path of get_page().
212 */ 212 */
213 bool __get_page_tail(struct page *page) 213 bool __get_page_tail(struct page *page)
214 { 214 {
215 /* 215 /*
216 * This takes care of get_page() if run on a tail page 216 * This takes care of get_page() if run on a tail page
217 * returned by one of the get_user_pages/follow_page variants. 217 * returned by one of the get_user_pages/follow_page variants.
218 * get_user_pages/follow_page itself doesn't need the compound 218 * get_user_pages/follow_page itself doesn't need the compound
219 * lock because it runs __get_page_tail_foll() under the 219 * lock because it runs __get_page_tail_foll() under the
220 * proper PT lock that already serializes against 220 * proper PT lock that already serializes against
221 * split_huge_page(). 221 * split_huge_page().
222 */ 222 */
223 unsigned long flags; 223 unsigned long flags;
224 bool got = false; 224 bool got = false;
225 struct page *page_head = compound_head(page); 225 struct page *page_head = compound_head(page);
226 226
227 if (likely(page != page_head && get_page_unless_zero(page_head))) { 227 if (likely(page != page_head && get_page_unless_zero(page_head))) {
228 /* Ref to put_compound_page() comment. */ 228 /* Ref to put_compound_page() comment. */
229 if (PageSlab(page_head) || PageHeadHuge(page_head)) { 229 if (PageSlab(page_head) || PageHeadHuge(page_head)) {
230 if (likely(PageTail(page))) { 230 if (likely(PageTail(page))) {
231 /* 231 /*
232 * This is a hugetlbfs page or a slab 232 * This is a hugetlbfs page or a slab
233 * page. __split_huge_page_refcount 233 * page. __split_huge_page_refcount
234 * cannot race here. 234 * cannot race here.
235 */ 235 */
236 VM_BUG_ON(!PageHead(page_head)); 236 VM_BUG_ON(!PageHead(page_head));
237 __get_page_tail_foll(page, false); 237 __get_page_tail_foll(page, false);
238 return true; 238 return true;
239 } else { 239 } else {
240 /* 240 /*
241 * __split_huge_page_refcount run 241 * __split_huge_page_refcount run
242 * before us, "page" was a THP 242 * before us, "page" was a THP
243 * tail. The split page_head has been 243 * tail. The split page_head has been
244 * freed and reallocated as slab or 244 * freed and reallocated as slab or
245 * hugetlbfs page of smaller order 245 * hugetlbfs page of smaller order
246 * (only possible if reallocated as 246 * (only possible if reallocated as
247 * slab on x86). 247 * slab on x86).
248 */ 248 */
249 put_page(page_head); 249 put_page(page_head);
250 return false; 250 return false;
251 } 251 }
252 } 252 }
253 253
254 /* 254 /*
255 * page_head wasn't a dangling pointer but it 255 * page_head wasn't a dangling pointer but it
256 * may not be a head page anymore by the time 256 * may not be a head page anymore by the time
257 * we obtain the lock. That is ok as long as it 257 * we obtain the lock. That is ok as long as it
258 * can't be freed from under us. 258 * can't be freed from under us.
259 */ 259 */
260 flags = compound_lock_irqsave(page_head); 260 flags = compound_lock_irqsave(page_head);
261 /* here __split_huge_page_refcount won't run anymore */ 261 /* here __split_huge_page_refcount won't run anymore */
262 if (likely(PageTail(page))) { 262 if (likely(PageTail(page))) {
263 __get_page_tail_foll(page, false); 263 __get_page_tail_foll(page, false);
264 got = true; 264 got = true;
265 } 265 }
266 compound_unlock_irqrestore(page_head, flags); 266 compound_unlock_irqrestore(page_head, flags);
267 if (unlikely(!got)) 267 if (unlikely(!got))
268 put_page(page_head); 268 put_page(page_head);
269 } 269 }
270 return got; 270 return got;
271 } 271 }
272 EXPORT_SYMBOL(__get_page_tail); 272 EXPORT_SYMBOL(__get_page_tail);
273 273
274 /** 274 /**
275 * put_pages_list() - release a list of pages 275 * put_pages_list() - release a list of pages
276 * @pages: list of pages threaded on page->lru 276 * @pages: list of pages threaded on page->lru
277 * 277 *
278 * Release a list of pages which are strung together on page.lru. Currently 278 * Release a list of pages which are strung together on page.lru. Currently
279 * used by read_cache_pages() and related error recovery code. 279 * used by read_cache_pages() and related error recovery code.
280 */ 280 */
281 void put_pages_list(struct list_head *pages) 281 void put_pages_list(struct list_head *pages)
282 { 282 {
283 while (!list_empty(pages)) { 283 while (!list_empty(pages)) {
284 struct page *victim; 284 struct page *victim;
285 285
286 victim = list_entry(pages->prev, struct page, lru); 286 victim = list_entry(pages->prev, struct page, lru);
287 list_del(&victim->lru); 287 list_del(&victim->lru);
288 page_cache_release(victim); 288 page_cache_release(victim);
289 } 289 }
290 } 290 }
291 EXPORT_SYMBOL(put_pages_list); 291 EXPORT_SYMBOL(put_pages_list);
292 292
293 /* 293 /*
294 * get_kernel_pages() - pin kernel pages in memory 294 * get_kernel_pages() - pin kernel pages in memory
295 * @kiov: An array of struct kvec structures 295 * @kiov: An array of struct kvec structures
296 * @nr_segs: number of segments to pin 296 * @nr_segs: number of segments to pin
297 * @write: pinning for read/write, currently ignored 297 * @write: pinning for read/write, currently ignored
298 * @pages: array that receives pointers to the pages pinned. 298 * @pages: array that receives pointers to the pages pinned.
299 * Should be at least nr_segs long. 299 * Should be at least nr_segs long.
300 * 300 *
301 * Returns number of pages pinned. This may be fewer than the number 301 * Returns number of pages pinned. This may be fewer than the number
302 * requested. If nr_pages is 0 or negative, returns 0. If no pages 302 * requested. If nr_pages is 0 or negative, returns 0. If no pages
303 * were pinned, returns -errno. Each page returned must be released 303 * were pinned, returns -errno. Each page returned must be released
304 * with a put_page() call when it is finished with. 304 * with a put_page() call when it is finished with.
305 */ 305 */
306 int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write, 306 int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write,
307 struct page **pages) 307 struct page **pages)
308 { 308 {
309 int seg; 309 int seg;
310 310
311 for (seg = 0; seg < nr_segs; seg++) { 311 for (seg = 0; seg < nr_segs; seg++) {
312 if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE)) 312 if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE))
313 return seg; 313 return seg;
314 314
315 pages[seg] = kmap_to_page(kiov[seg].iov_base); 315 pages[seg] = kmap_to_page(kiov[seg].iov_base);
316 page_cache_get(pages[seg]); 316 page_cache_get(pages[seg]);
317 } 317 }
318 318
319 return seg; 319 return seg;
320 } 320 }
321 EXPORT_SYMBOL_GPL(get_kernel_pages); 321 EXPORT_SYMBOL_GPL(get_kernel_pages);
322 322
323 /* 323 /*
324 * get_kernel_page() - pin a kernel page in memory 324 * get_kernel_page() - pin a kernel page in memory
325 * @start: starting kernel address 325 * @start: starting kernel address
326 * @write: pinning for read/write, currently ignored 326 * @write: pinning for read/write, currently ignored
327 * @pages: array that receives pointer to the page pinned. 327 * @pages: array that receives pointer to the page pinned.
328 * Must be at least nr_segs long. 328 * Must be at least nr_segs long.
329 * 329 *
330 * Returns 1 if page is pinned. If the page was not pinned, returns 330 * Returns 1 if page is pinned. If the page was not pinned, returns
331 * -errno. The page returned must be released with a put_page() call 331 * -errno. The page returned must be released with a put_page() call
332 * when it is finished with. 332 * when it is finished with.
333 */ 333 */
334 int get_kernel_page(unsigned long start, int write, struct page **pages) 334 int get_kernel_page(unsigned long start, int write, struct page **pages)
335 { 335 {
336 const struct kvec kiov = { 336 const struct kvec kiov = {
337 .iov_base = (void *)start, 337 .iov_base = (void *)start,
338 .iov_len = PAGE_SIZE 338 .iov_len = PAGE_SIZE
339 }; 339 };
340 340
341 return get_kernel_pages(&kiov, 1, write, pages); 341 return get_kernel_pages(&kiov, 1, write, pages);
342 } 342 }
343 EXPORT_SYMBOL_GPL(get_kernel_page); 343 EXPORT_SYMBOL_GPL(get_kernel_page);
344 344
345 static void pagevec_lru_move_fn(struct pagevec *pvec, 345 static void pagevec_lru_move_fn(struct pagevec *pvec,
346 void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), 346 void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
347 void *arg) 347 void *arg)
348 { 348 {
349 int i; 349 int i;
350 struct zone *zone = NULL; 350 struct zone *zone = NULL;
351 struct lruvec *lruvec; 351 struct lruvec *lruvec;
352 unsigned long flags = 0; 352 unsigned long flags = 0;
353 353
354 for (i = 0; i < pagevec_count(pvec); i++) { 354 for (i = 0; i < pagevec_count(pvec); i++) {
355 struct page *page = pvec->pages[i]; 355 struct page *page = pvec->pages[i];
356 struct zone *pagezone = page_zone(page); 356 struct zone *pagezone = page_zone(page);
357 357
358 if (pagezone != zone) { 358 if (pagezone != zone) {
359 if (zone) 359 if (zone)
360 spin_unlock_irqrestore(&zone->lru_lock, flags); 360 spin_unlock_irqrestore(&zone->lru_lock, flags);
361 zone = pagezone; 361 zone = pagezone;
362 spin_lock_irqsave(&zone->lru_lock, flags); 362 spin_lock_irqsave(&zone->lru_lock, flags);
363 } 363 }
364 364
365 lruvec = mem_cgroup_page_lruvec(page, zone); 365 lruvec = mem_cgroup_page_lruvec(page, zone);
366 (*move_fn)(page, lruvec, arg); 366 (*move_fn)(page, lruvec, arg);
367 } 367 }
368 if (zone) 368 if (zone)
369 spin_unlock_irqrestore(&zone->lru_lock, flags); 369 spin_unlock_irqrestore(&zone->lru_lock, flags);
370 release_pages(pvec->pages, pvec->nr, pvec->cold); 370 release_pages(pvec->pages, pvec->nr, pvec->cold);
371 pagevec_reinit(pvec); 371 pagevec_reinit(pvec);
372 } 372 }
373 373
374 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, 374 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
375 void *arg) 375 void *arg)
376 { 376 {
377 int *pgmoved = arg; 377 int *pgmoved = arg;
378 378
379 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { 379 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
380 enum lru_list lru = page_lru_base_type(page); 380 enum lru_list lru = page_lru_base_type(page);
381 list_move_tail(&page->lru, &lruvec->lists[lru]); 381 list_move_tail(&page->lru, &lruvec->lists[lru]);
382 (*pgmoved)++; 382 (*pgmoved)++;
383 } 383 }
384 } 384 }
385 385
386 /* 386 /*
387 * pagevec_move_tail() must be called with IRQ disabled. 387 * pagevec_move_tail() must be called with IRQ disabled.
388 * Otherwise this may cause nasty races. 388 * Otherwise this may cause nasty races.
389 */ 389 */
390 static void pagevec_move_tail(struct pagevec *pvec) 390 static void pagevec_move_tail(struct pagevec *pvec)
391 { 391 {
392 int pgmoved = 0; 392 int pgmoved = 0;
393 393
394 pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved); 394 pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
395 __count_vm_events(PGROTATED, pgmoved); 395 __count_vm_events(PGROTATED, pgmoved);
396 } 396 }
397 397
398 /* 398 /*
399 * Writeback is about to end against a page which has been marked for immediate 399 * Writeback is about to end against a page which has been marked for immediate
400 * reclaim. If it still appears to be reclaimable, move it to the tail of the 400 * reclaim. If it still appears to be reclaimable, move it to the tail of the
401 * inactive list. 401 * inactive list.
402 */ 402 */
403 void rotate_reclaimable_page(struct page *page) 403 void rotate_reclaimable_page(struct page *page)
404 { 404 {
405 if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && 405 if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
406 !PageUnevictable(page) && PageLRU(page)) { 406 !PageUnevictable(page) && PageLRU(page)) {
407 struct pagevec *pvec; 407 struct pagevec *pvec;
408 unsigned long flags; 408 unsigned long flags;
409 409
410 page_cache_get(page); 410 page_cache_get(page);
411 local_irq_save(flags); 411 local_irq_save(flags);
412 pvec = &__get_cpu_var(lru_rotate_pvecs); 412 pvec = &__get_cpu_var(lru_rotate_pvecs);
413 if (!pagevec_add(pvec, page)) 413 if (!pagevec_add(pvec, page))
414 pagevec_move_tail(pvec); 414 pagevec_move_tail(pvec);
415 local_irq_restore(flags); 415 local_irq_restore(flags);
416 } 416 }
417 } 417 }
418 418
419 static void update_page_reclaim_stat(struct lruvec *lruvec, 419 static void update_page_reclaim_stat(struct lruvec *lruvec,
420 int file, int rotated) 420 int file, int rotated)
421 { 421 {
422 struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; 422 struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
423 423
424 reclaim_stat->recent_scanned[file]++; 424 reclaim_stat->recent_scanned[file]++;
425 if (rotated) 425 if (rotated)
426 reclaim_stat->recent_rotated[file]++; 426 reclaim_stat->recent_rotated[file]++;
427 } 427 }
428 428
429 static void __activate_page(struct page *page, struct lruvec *lruvec, 429 static void __activate_page(struct page *page, struct lruvec *lruvec,
430 void *arg) 430 void *arg)
431 { 431 {
432 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { 432 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
433 int file = page_is_file_cache(page); 433 int file = page_is_file_cache(page);
434 int lru = page_lru_base_type(page); 434 int lru = page_lru_base_type(page);
435 435
436 del_page_from_lru_list(page, lruvec, lru); 436 del_page_from_lru_list(page, lruvec, lru);
437 SetPageActive(page); 437 SetPageActive(page);
438 lru += LRU_ACTIVE; 438 lru += LRU_ACTIVE;
439 add_page_to_lru_list(page, lruvec, lru); 439 add_page_to_lru_list(page, lruvec, lru);
440 trace_mm_lru_activate(page, page_to_pfn(page)); 440 trace_mm_lru_activate(page, page_to_pfn(page));
441 441
442 __count_vm_event(PGACTIVATE); 442 __count_vm_event(PGACTIVATE);
443 update_page_reclaim_stat(lruvec, file, 1); 443 update_page_reclaim_stat(lruvec, file, 1);
444 } 444 }
445 } 445 }
446 446
447 #ifdef CONFIG_SMP 447 #ifdef CONFIG_SMP
448 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); 448 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
449 449
450 static void activate_page_drain(int cpu) 450 static void activate_page_drain(int cpu)
451 { 451 {
452 struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu); 452 struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
453 453
454 if (pagevec_count(pvec)) 454 if (pagevec_count(pvec))
455 pagevec_lru_move_fn(pvec, __activate_page, NULL); 455 pagevec_lru_move_fn(pvec, __activate_page, NULL);
456 } 456 }
457 457
458 static bool need_activate_page_drain(int cpu) 458 static bool need_activate_page_drain(int cpu)
459 { 459 {
460 return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0; 460 return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0;
461 } 461 }
462 462
463 void activate_page(struct page *page) 463 void activate_page(struct page *page)
464 { 464 {
465 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { 465 if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
466 struct pagevec *pvec = &get_cpu_var(activate_page_pvecs); 466 struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
467 467
468 page_cache_get(page); 468 page_cache_get(page);
469 if (!pagevec_add(pvec, page)) 469 if (!pagevec_add(pvec, page))
470 pagevec_lru_move_fn(pvec, __activate_page, NULL); 470 pagevec_lru_move_fn(pvec, __activate_page, NULL);
471 put_cpu_var(activate_page_pvecs); 471 put_cpu_var(activate_page_pvecs);
472 } 472 }
473 } 473 }
474 474
475 #else 475 #else
476 static inline void activate_page_drain(int cpu) 476 static inline void activate_page_drain(int cpu)
477 { 477 {
478 } 478 }
479 479
480 static bool need_activate_page_drain(int cpu) 480 static bool need_activate_page_drain(int cpu)
481 { 481 {
482 return false; 482 return false;
483 } 483 }
484 484
485 void activate_page(struct page *page) 485 void activate_page(struct page *page)
486 { 486 {
487 struct zone *zone = page_zone(page); 487 struct zone *zone = page_zone(page);
488 488
489 spin_lock_irq(&zone->lru_lock); 489 spin_lock_irq(&zone->lru_lock);
490 __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL); 490 __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
491 spin_unlock_irq(&zone->lru_lock); 491 spin_unlock_irq(&zone->lru_lock);
492 } 492 }
493 #endif 493 #endif
494 494
495 static void __lru_cache_activate_page(struct page *page) 495 static void __lru_cache_activate_page(struct page *page)
496 { 496 {
497 struct pagevec *pvec = &get_cpu_var(lru_add_pvec); 497 struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
498 int i; 498 int i;
499 499
500 /* 500 /*
501 * Search backwards on the optimistic assumption that the page being 501 * Search backwards on the optimistic assumption that the page being
502 * activated has just been added to this pagevec. Note that only 502 * activated has just been added to this pagevec. Note that only
503 * the local pagevec is examined as a !PageLRU page could be in the 503 * the local pagevec is examined as a !PageLRU page could be in the
504 * process of being released, reclaimed, migrated or on a remote 504 * process of being released, reclaimed, migrated or on a remote
505 * pagevec that is currently being drained. Furthermore, marking 505 * pagevec that is currently being drained. Furthermore, marking
506 * a remote pagevec's page PageActive potentially hits a race where 506 * a remote pagevec's page PageActive potentially hits a race where
507 * a page is marked PageActive just after it is added to the inactive 507 * a page is marked PageActive just after it is added to the inactive
508 * list causing accounting errors and BUG_ON checks to trigger. 508 * list causing accounting errors and BUG_ON checks to trigger.
509 */ 509 */
510 for (i = pagevec_count(pvec) - 1; i >= 0; i--) { 510 for (i = pagevec_count(pvec) - 1; i >= 0; i--) {
511 struct page *pagevec_page = pvec->pages[i]; 511 struct page *pagevec_page = pvec->pages[i];
512 512
513 if (pagevec_page == page) { 513 if (pagevec_page == page) {
514 SetPageActive(page); 514 SetPageActive(page);
515 break; 515 break;
516 } 516 }
517 } 517 }
518 518
519 put_cpu_var(lru_add_pvec); 519 put_cpu_var(lru_add_pvec);
520 } 520 }
521 521
522 /* 522 /*
523 * Mark a page as having seen activity. 523 * Mark a page as having seen activity.
524 * 524 *
525 * inactive,unreferenced -> inactive,referenced 525 * inactive,unreferenced -> inactive,referenced
526 * inactive,referenced -> active,unreferenced 526 * inactive,referenced -> active,unreferenced
527 * active,unreferenced -> active,referenced 527 * active,unreferenced -> active,referenced
528 */ 528 */
529 void mark_page_accessed(struct page *page) 529 void mark_page_accessed(struct page *page)
530 { 530 {
531 if (!PageActive(page) && !PageUnevictable(page) && 531 if (!PageActive(page) && !PageUnevictable(page) &&
532 PageReferenced(page)) { 532 PageReferenced(page)) {
533 533
534 /* 534 /*
535 * If the page is on the LRU, queue it for activation via 535 * If the page is on the LRU, queue it for activation via
536 * activate_page_pvecs. Otherwise, assume the page is on a 536 * activate_page_pvecs. Otherwise, assume the page is on a
537 * pagevec, mark it active and it'll be moved to the active 537 * pagevec, mark it active and it'll be moved to the active
538 * LRU on the next drain. 538 * LRU on the next drain.
539 */ 539 */
540 if (PageLRU(page)) 540 if (PageLRU(page))
541 activate_page(page); 541 activate_page(page);
542 else 542 else
543 __lru_cache_activate_page(page); 543 __lru_cache_activate_page(page);
544 ClearPageReferenced(page); 544 ClearPageReferenced(page);
545 } else if (!PageReferenced(page)) { 545 } else if (!PageReferenced(page)) {
546 SetPageReferenced(page); 546 SetPageReferenced(page);
547 } 547 }
548 } 548 }
549 EXPORT_SYMBOL(mark_page_accessed); 549 EXPORT_SYMBOL(mark_page_accessed);
550 550
551 /*
552 * Used to mark_page_accessed(page) that is not visible yet and when it is
553 * still safe to use non-atomic ops
554 */
555 void init_page_accessed(struct page *page)
556 {
557 if (!PageReferenced(page))
558 __SetPageReferenced(page);
559 }
560 EXPORT_SYMBOL(init_page_accessed);
561
551 static void __lru_cache_add(struct page *page) 562 static void __lru_cache_add(struct page *page)
552 { 563 {
553 struct pagevec *pvec = &get_cpu_var(lru_add_pvec); 564 struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
554 565
555 page_cache_get(page); 566 page_cache_get(page);
556 if (!pagevec_space(pvec)) 567 if (!pagevec_space(pvec))
557 __pagevec_lru_add(pvec); 568 __pagevec_lru_add(pvec);
558 pagevec_add(pvec, page); 569 pagevec_add(pvec, page);
559 put_cpu_var(lru_add_pvec); 570 put_cpu_var(lru_add_pvec);
560 } 571 }
561 572
562 /** 573 /**
563 * lru_cache_add: add a page to the page lists 574 * lru_cache_add: add a page to the page lists
564 * @page: the page to add 575 * @page: the page to add
565 */ 576 */
566 void lru_cache_add_anon(struct page *page) 577 void lru_cache_add_anon(struct page *page)
567 { 578 {
568 if (PageActive(page)) 579 if (PageActive(page))
569 ClearPageActive(page); 580 ClearPageActive(page);
570 __lru_cache_add(page); 581 __lru_cache_add(page);
571 } 582 }
572 583
573 void lru_cache_add_file(struct page *page) 584 void lru_cache_add_file(struct page *page)
574 { 585 {
575 if (PageActive(page)) 586 if (PageActive(page))
576 ClearPageActive(page); 587 ClearPageActive(page);
577 __lru_cache_add(page); 588 __lru_cache_add(page);
578 } 589 }
579 EXPORT_SYMBOL(lru_cache_add_file); 590 EXPORT_SYMBOL(lru_cache_add_file);
580 591
581 /** 592 /**
582 * lru_cache_add - add a page to a page list 593 * lru_cache_add - add a page to a page list
583 * @page: the page to be added to the LRU. 594 * @page: the page to be added to the LRU.
584 * 595 *
585 * Queue the page for addition to the LRU via pagevec. The decision on whether 596 * Queue the page for addition to the LRU via pagevec. The decision on whether
586 * to add the page to the [in]active [file|anon] list is deferred until the 597 * to add the page to the [in]active [file|anon] list is deferred until the
587 * pagevec is drained. This gives a chance for the caller of lru_cache_add() 598 * pagevec is drained. This gives a chance for the caller of lru_cache_add()
588 * have the page added to the active list using mark_page_accessed(). 599 * have the page added to the active list using mark_page_accessed().
589 */ 600 */
590 void lru_cache_add(struct page *page) 601 void lru_cache_add(struct page *page)
591 { 602 {
592 VM_BUG_ON(PageActive(page) && PageUnevictable(page)); 603 VM_BUG_ON(PageActive(page) && PageUnevictable(page));
593 VM_BUG_ON(PageLRU(page)); 604 VM_BUG_ON(PageLRU(page));
594 __lru_cache_add(page); 605 __lru_cache_add(page);
595 } 606 }
596 607
597 /** 608 /**
598 * add_page_to_unevictable_list - add a page to the unevictable list 609 * add_page_to_unevictable_list - add a page to the unevictable list
599 * @page: the page to be added to the unevictable list 610 * @page: the page to be added to the unevictable list
600 * 611 *
601 * Add page directly to its zone's unevictable list. To avoid races with 612 * Add page directly to its zone's unevictable list. To avoid races with
602 * tasks that might be making the page evictable, through eg. munlock, 613 * tasks that might be making the page evictable, through eg. munlock,
603 * munmap or exit, while it's not on the lru, we want to add the page 614 * munmap or exit, while it's not on the lru, we want to add the page
604 * while it's locked or otherwise "invisible" to other tasks. This is 615 * while it's locked or otherwise "invisible" to other tasks. This is
605 * difficult to do when using the pagevec cache, so bypass that. 616 * difficult to do when using the pagevec cache, so bypass that.
606 */ 617 */
607 void add_page_to_unevictable_list(struct page *page) 618 void add_page_to_unevictable_list(struct page *page)
608 { 619 {
609 struct zone *zone = page_zone(page); 620 struct zone *zone = page_zone(page);
610 struct lruvec *lruvec; 621 struct lruvec *lruvec;
611 622
612 spin_lock_irq(&zone->lru_lock); 623 spin_lock_irq(&zone->lru_lock);
613 lruvec = mem_cgroup_page_lruvec(page, zone); 624 lruvec = mem_cgroup_page_lruvec(page, zone);
614 ClearPageActive(page); 625 ClearPageActive(page);
615 SetPageUnevictable(page); 626 SetPageUnevictable(page);
616 SetPageLRU(page); 627 SetPageLRU(page);
617 add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE); 628 add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
618 spin_unlock_irq(&zone->lru_lock); 629 spin_unlock_irq(&zone->lru_lock);
619 } 630 }
620 631
621 /* 632 /*
622 * If the page can not be invalidated, it is moved to the 633 * If the page can not be invalidated, it is moved to the
623 * inactive list to speed up its reclaim. It is moved to the 634 * inactive list to speed up its reclaim. It is moved to the
624 * head of the list, rather than the tail, to give the flusher 635 * head of the list, rather than the tail, to give the flusher
625 * threads some time to write it out, as this is much more 636 * threads some time to write it out, as this is much more
626 * effective than the single-page writeout from reclaim. 637 * effective than the single-page writeout from reclaim.
627 * 638 *
628 * If the page isn't page_mapped and dirty/writeback, the page 639 * If the page isn't page_mapped and dirty/writeback, the page
629 * could reclaim asap using PG_reclaim. 640 * could reclaim asap using PG_reclaim.
630 * 641 *
631 * 1. active, mapped page -> none 642 * 1. active, mapped page -> none
632 * 2. active, dirty/writeback page -> inactive, head, PG_reclaim 643 * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
633 * 3. inactive, mapped page -> none 644 * 3. inactive, mapped page -> none
634 * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim 645 * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
635 * 5. inactive, clean -> inactive, tail 646 * 5. inactive, clean -> inactive, tail
636 * 6. Others -> none 647 * 6. Others -> none
637 * 648 *
638 * In 4, why it moves inactive's head, the VM expects the page would 649 * In 4, why it moves inactive's head, the VM expects the page would
639 * be write it out by flusher threads as this is much more effective 650 * be write it out by flusher threads as this is much more effective
640 * than the single-page writeout from reclaim. 651 * than the single-page writeout from reclaim.
641 */ 652 */
642 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, 653 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
643 void *arg) 654 void *arg)
644 { 655 {
645 int lru, file; 656 int lru, file;
646 bool active; 657 bool active;
647 658
648 if (!PageLRU(page)) 659 if (!PageLRU(page))
649 return; 660 return;
650 661
651 if (PageUnevictable(page)) 662 if (PageUnevictable(page))
652 return; 663 return;
653 664
654 /* Some processes are using the page */ 665 /* Some processes are using the page */
655 if (page_mapped(page)) 666 if (page_mapped(page))
656 return; 667 return;
657 668
658 active = PageActive(page); 669 active = PageActive(page);
659 file = page_is_file_cache(page); 670 file = page_is_file_cache(page);
660 lru = page_lru_base_type(page); 671 lru = page_lru_base_type(page);
661 672
662 del_page_from_lru_list(page, lruvec, lru + active); 673 del_page_from_lru_list(page, lruvec, lru + active);
663 ClearPageActive(page); 674 ClearPageActive(page);
664 ClearPageReferenced(page); 675 ClearPageReferenced(page);
665 add_page_to_lru_list(page, lruvec, lru); 676 add_page_to_lru_list(page, lruvec, lru);
666 677
667 if (PageWriteback(page) || PageDirty(page)) { 678 if (PageWriteback(page) || PageDirty(page)) {
668 /* 679 /*
669 * PG_reclaim could be raced with end_page_writeback 680 * PG_reclaim could be raced with end_page_writeback
670 * It can make readahead confusing. But race window 681 * It can make readahead confusing. But race window
671 * is _really_ small and it's non-critical problem. 682 * is _really_ small and it's non-critical problem.
672 */ 683 */
673 SetPageReclaim(page); 684 SetPageReclaim(page);
674 } else { 685 } else {
675 /* 686 /*
676 * The page's writeback ends up during pagevec 687 * The page's writeback ends up during pagevec
677 * We moves tha page into tail of inactive. 688 * We moves tha page into tail of inactive.
678 */ 689 */
679 list_move_tail(&page->lru, &lruvec->lists[lru]); 690 list_move_tail(&page->lru, &lruvec->lists[lru]);
680 __count_vm_event(PGROTATED); 691 __count_vm_event(PGROTATED);
681 } 692 }
682 693
683 if (active) 694 if (active)
684 __count_vm_event(PGDEACTIVATE); 695 __count_vm_event(PGDEACTIVATE);
685 update_page_reclaim_stat(lruvec, file, 0); 696 update_page_reclaim_stat(lruvec, file, 0);
686 } 697 }
687 698
688 /* 699 /*
689 * Drain pages out of the cpu's pagevecs. 700 * Drain pages out of the cpu's pagevecs.
690 * Either "cpu" is the current CPU, and preemption has already been 701 * Either "cpu" is the current CPU, and preemption has already been
691 * disabled; or "cpu" is being hot-unplugged, and is already dead. 702 * disabled; or "cpu" is being hot-unplugged, and is already dead.
692 */ 703 */
693 void lru_add_drain_cpu(int cpu) 704 void lru_add_drain_cpu(int cpu)
694 { 705 {
695 struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu); 706 struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
696 707
697 if (pagevec_count(pvec)) 708 if (pagevec_count(pvec))
698 __pagevec_lru_add(pvec); 709 __pagevec_lru_add(pvec);
699 710
700 pvec = &per_cpu(lru_rotate_pvecs, cpu); 711 pvec = &per_cpu(lru_rotate_pvecs, cpu);
701 if (pagevec_count(pvec)) { 712 if (pagevec_count(pvec)) {
702 unsigned long flags; 713 unsigned long flags;
703 714
704 /* No harm done if a racing interrupt already did this */ 715 /* No harm done if a racing interrupt already did this */
705 local_irq_save(flags); 716 local_irq_save(flags);
706 pagevec_move_tail(pvec); 717 pagevec_move_tail(pvec);
707 local_irq_restore(flags); 718 local_irq_restore(flags);
708 } 719 }
709 720
710 pvec = &per_cpu(lru_deactivate_pvecs, cpu); 721 pvec = &per_cpu(lru_deactivate_pvecs, cpu);
711 if (pagevec_count(pvec)) 722 if (pagevec_count(pvec))
712 pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); 723 pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
713 724
714 activate_page_drain(cpu); 725 activate_page_drain(cpu);
715 } 726 }
716 727
717 /** 728 /**
718 * deactivate_page - forcefully deactivate a page 729 * deactivate_page - forcefully deactivate a page
719 * @page: page to deactivate 730 * @page: page to deactivate
720 * 731 *
721 * This function hints the VM that @page is a good reclaim candidate, 732 * This function hints the VM that @page is a good reclaim candidate,
722 * for example if its invalidation fails due to the page being dirty 733 * for example if its invalidation fails due to the page being dirty
723 * or under writeback. 734 * or under writeback.
724 */ 735 */
725 void deactivate_page(struct page *page) 736 void deactivate_page(struct page *page)
726 { 737 {
727 /* 738 /*
728 * In a workload with many unevictable page such as mprotect, unevictable 739 * In a workload with many unevictable page such as mprotect, unevictable
729 * page deactivation for accelerating reclaim is pointless. 740 * page deactivation for accelerating reclaim is pointless.
730 */ 741 */
731 if (PageUnevictable(page)) 742 if (PageUnevictable(page))
732 return; 743 return;
733 744
734 if (likely(get_page_unless_zero(page))) { 745 if (likely(get_page_unless_zero(page))) {
735 struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); 746 struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
736 747
737 if (!pagevec_add(pvec, page)) 748 if (!pagevec_add(pvec, page))
738 pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); 749 pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
739 put_cpu_var(lru_deactivate_pvecs); 750 put_cpu_var(lru_deactivate_pvecs);
740 } 751 }
741 } 752 }
742 753
743 void lru_add_drain(void) 754 void lru_add_drain(void)
744 { 755 {
745 lru_add_drain_cpu(get_cpu()); 756 lru_add_drain_cpu(get_cpu());
746 put_cpu(); 757 put_cpu();
747 } 758 }
748 759
749 static void lru_add_drain_per_cpu(struct work_struct *dummy) 760 static void lru_add_drain_per_cpu(struct work_struct *dummy)
750 { 761 {
751 lru_add_drain(); 762 lru_add_drain();
752 } 763 }
753 764
754 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); 765 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
755 766
756 void lru_add_drain_all(void) 767 void lru_add_drain_all(void)
757 { 768 {
758 static DEFINE_MUTEX(lock); 769 static DEFINE_MUTEX(lock);
759 static struct cpumask has_work; 770 static struct cpumask has_work;
760 int cpu; 771 int cpu;
761 772
762 mutex_lock(&lock); 773 mutex_lock(&lock);
763 get_online_cpus(); 774 get_online_cpus();
764 cpumask_clear(&has_work); 775 cpumask_clear(&has_work);
765 776
766 for_each_online_cpu(cpu) { 777 for_each_online_cpu(cpu) {
767 struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); 778 struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
768 779
769 if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || 780 if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
770 pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || 781 pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
771 pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || 782 pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
772 need_activate_page_drain(cpu)) { 783 need_activate_page_drain(cpu)) {
773 INIT_WORK(work, lru_add_drain_per_cpu); 784 INIT_WORK(work, lru_add_drain_per_cpu);
774 schedule_work_on(cpu, work); 785 schedule_work_on(cpu, work);
775 cpumask_set_cpu(cpu, &has_work); 786 cpumask_set_cpu(cpu, &has_work);
776 } 787 }
777 } 788 }
778 789
779 for_each_cpu(cpu, &has_work) 790 for_each_cpu(cpu, &has_work)
780 flush_work(&per_cpu(lru_add_drain_work, cpu)); 791 flush_work(&per_cpu(lru_add_drain_work, cpu));
781 792
782 put_online_cpus(); 793 put_online_cpus();
783 mutex_unlock(&lock); 794 mutex_unlock(&lock);
784 } 795 }
785 796
786 /* 797 /*
787 * Batched page_cache_release(). Decrement the reference count on all the 798 * Batched page_cache_release(). Decrement the reference count on all the
788 * passed pages. If it fell to zero then remove the page from the LRU and 799 * passed pages. If it fell to zero then remove the page from the LRU and
789 * free it. 800 * free it.
790 * 801 *
791 * Avoid taking zone->lru_lock if possible, but if it is taken, retain it 802 * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
792 * for the remainder of the operation. 803 * for the remainder of the operation.
793 * 804 *
794 * The locking in this function is against shrink_inactive_list(): we recheck 805 * The locking in this function is against shrink_inactive_list(): we recheck
795 * the page count inside the lock to see whether shrink_inactive_list() 806 * the page count inside the lock to see whether shrink_inactive_list()
796 * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() 807 * grabbed the page via the LRU. If it did, give up: shrink_inactive_list()
797 * will free it. 808 * will free it.
798 */ 809 */
799 void release_pages(struct page **pages, int nr, bool cold) 810 void release_pages(struct page **pages, int nr, bool cold)
800 { 811 {
801 int i; 812 int i;
802 LIST_HEAD(pages_to_free); 813 LIST_HEAD(pages_to_free);
803 struct zone *zone = NULL; 814 struct zone *zone = NULL;
804 struct lruvec *lruvec; 815 struct lruvec *lruvec;
805 unsigned long uninitialized_var(flags); 816 unsigned long uninitialized_var(flags);
806 817
807 for (i = 0; i < nr; i++) { 818 for (i = 0; i < nr; i++) {
808 struct page *page = pages[i]; 819 struct page *page = pages[i];
809 820
810 if (unlikely(PageCompound(page))) { 821 if (unlikely(PageCompound(page))) {
811 if (zone) { 822 if (zone) {
812 spin_unlock_irqrestore(&zone->lru_lock, flags); 823 spin_unlock_irqrestore(&zone->lru_lock, flags);
813 zone = NULL; 824 zone = NULL;
814 } 825 }
815 put_compound_page(page); 826 put_compound_page(page);
816 continue; 827 continue;
817 } 828 }
818 829
819 if (!put_page_testzero(page)) 830 if (!put_page_testzero(page))
820 continue; 831 continue;
821 832
822 if (PageLRU(page)) { 833 if (PageLRU(page)) {
823 struct zone *pagezone = page_zone(page); 834 struct zone *pagezone = page_zone(page);
824 835
825 if (pagezone != zone) { 836 if (pagezone != zone) {
826 if (zone) 837 if (zone)
827 spin_unlock_irqrestore(&zone->lru_lock, 838 spin_unlock_irqrestore(&zone->lru_lock,
828 flags); 839 flags);
829 zone = pagezone; 840 zone = pagezone;
830 spin_lock_irqsave(&zone->lru_lock, flags); 841 spin_lock_irqsave(&zone->lru_lock, flags);
831 } 842 }
832 843
833 lruvec = mem_cgroup_page_lruvec(page, zone); 844 lruvec = mem_cgroup_page_lruvec(page, zone);
834 VM_BUG_ON(!PageLRU(page)); 845 VM_BUG_ON(!PageLRU(page));
835 __ClearPageLRU(page); 846 __ClearPageLRU(page);
836 del_page_from_lru_list(page, lruvec, page_off_lru(page)); 847 del_page_from_lru_list(page, lruvec, page_off_lru(page));
837 } 848 }
838 849
839 /* Clear Active bit in case of parallel mark_page_accessed */ 850 /* Clear Active bit in case of parallel mark_page_accessed */
840 __ClearPageActive(page); 851 __ClearPageActive(page);
841 852
842 list_add(&page->lru, &pages_to_free); 853 list_add(&page->lru, &pages_to_free);
843 } 854 }
844 if (zone) 855 if (zone)
845 spin_unlock_irqrestore(&zone->lru_lock, flags); 856 spin_unlock_irqrestore(&zone->lru_lock, flags);
846 857
847 free_hot_cold_page_list(&pages_to_free, cold); 858 free_hot_cold_page_list(&pages_to_free, cold);
848 } 859 }
849 EXPORT_SYMBOL(release_pages); 860 EXPORT_SYMBOL(release_pages);
850 861
851 /* 862 /*
852 * The pages which we're about to release may be in the deferred lru-addition 863 * The pages which we're about to release may be in the deferred lru-addition
853 * queues. That would prevent them from really being freed right now. That's 864 * queues. That would prevent them from really being freed right now. That's
854 * OK from a correctness point of view but is inefficient - those pages may be 865 * OK from a correctness point of view but is inefficient - those pages may be
855 * cache-warm and we want to give them back to the page allocator ASAP. 866 * cache-warm and we want to give them back to the page allocator ASAP.
856 * 867 *
857 * So __pagevec_release() will drain those queues here. __pagevec_lru_add() 868 * So __pagevec_release() will drain those queues here. __pagevec_lru_add()
858 * and __pagevec_lru_add_active() call release_pages() directly to avoid 869 * and __pagevec_lru_add_active() call release_pages() directly to avoid
859 * mutual recursion. 870 * mutual recursion.
860 */ 871 */
861 void __pagevec_release(struct pagevec *pvec) 872 void __pagevec_release(struct pagevec *pvec)
862 { 873 {
863 lru_add_drain(); 874 lru_add_drain();
864 release_pages(pvec->pages, pagevec_count(pvec), pvec->cold); 875 release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
865 pagevec_reinit(pvec); 876 pagevec_reinit(pvec);
866 } 877 }
867 EXPORT_SYMBOL(__pagevec_release); 878 EXPORT_SYMBOL(__pagevec_release);
868 879
869 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 880 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
870 /* used by __split_huge_page_refcount() */ 881 /* used by __split_huge_page_refcount() */
871 void lru_add_page_tail(struct page *page, struct page *page_tail, 882 void lru_add_page_tail(struct page *page, struct page *page_tail,
872 struct lruvec *lruvec, struct list_head *list) 883 struct lruvec *lruvec, struct list_head *list)
873 { 884 {
874 const int file = 0; 885 const int file = 0;
875 886
876 VM_BUG_ON(!PageHead(page)); 887 VM_BUG_ON(!PageHead(page));
877 VM_BUG_ON(PageCompound(page_tail)); 888 VM_BUG_ON(PageCompound(page_tail));
878 VM_BUG_ON(PageLRU(page_tail)); 889 VM_BUG_ON(PageLRU(page_tail));
879 VM_BUG_ON(NR_CPUS != 1 && 890 VM_BUG_ON(NR_CPUS != 1 &&
880 !spin_is_locked(&lruvec_zone(lruvec)->lru_lock)); 891 !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
881 892
882 if (!list) 893 if (!list)
883 SetPageLRU(page_tail); 894 SetPageLRU(page_tail);
884 895
885 if (likely(PageLRU(page))) 896 if (likely(PageLRU(page)))
886 list_add_tail(&page_tail->lru, &page->lru); 897 list_add_tail(&page_tail->lru, &page->lru);
887 else if (list) { 898 else if (list) {
888 /* page reclaim is reclaiming a huge page */ 899 /* page reclaim is reclaiming a huge page */
889 get_page(page_tail); 900 get_page(page_tail);
890 list_add_tail(&page_tail->lru, list); 901 list_add_tail(&page_tail->lru, list);
891 } else { 902 } else {
892 struct list_head *list_head; 903 struct list_head *list_head;
893 /* 904 /*
894 * Head page has not yet been counted, as an hpage, 905 * Head page has not yet been counted, as an hpage,
895 * so we must account for each subpage individually. 906 * so we must account for each subpage individually.
896 * 907 *
897 * Use the standard add function to put page_tail on the list, 908 * Use the standard add function to put page_tail on the list,
898 * but then correct its position so they all end up in order. 909 * but then correct its position so they all end up in order.
899 */ 910 */
900 add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail)); 911 add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
901 list_head = page_tail->lru.prev; 912 list_head = page_tail->lru.prev;
902 list_move_tail(&page_tail->lru, list_head); 913 list_move_tail(&page_tail->lru, list_head);
903 } 914 }
904 915
905 if (!PageUnevictable(page)) 916 if (!PageUnevictable(page))
906 update_page_reclaim_stat(lruvec, file, PageActive(page_tail)); 917 update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
907 } 918 }
908 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 919 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
909 920
910 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, 921 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
911 void *arg) 922 void *arg)
912 { 923 {
913 int file = page_is_file_cache(page); 924 int file = page_is_file_cache(page);
914 int active = PageActive(page); 925 int active = PageActive(page);
915 enum lru_list lru = page_lru(page); 926 enum lru_list lru = page_lru(page);
916 927
917 VM_BUG_ON(PageLRU(page)); 928 VM_BUG_ON(PageLRU(page));
918 929
919 SetPageLRU(page); 930 SetPageLRU(page);
920 add_page_to_lru_list(page, lruvec, lru); 931 add_page_to_lru_list(page, lruvec, lru);
921 update_page_reclaim_stat(lruvec, file, active); 932 update_page_reclaim_stat(lruvec, file, active);
922 trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page)); 933 trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
923 } 934 }
924 935
925 /* 936 /*
926 * Add the passed pages to the LRU, then drop the caller's refcount 937 * Add the passed pages to the LRU, then drop the caller's refcount
927 * on them. Reinitialises the caller's pagevec. 938 * on them. Reinitialises the caller's pagevec.
928 */ 939 */
929 void __pagevec_lru_add(struct pagevec *pvec) 940 void __pagevec_lru_add(struct pagevec *pvec)
930 { 941 {
931 pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL); 942 pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
932 } 943 }
933 EXPORT_SYMBOL(__pagevec_lru_add); 944 EXPORT_SYMBOL(__pagevec_lru_add);
934 945
935 /** 946 /**
936 * pagevec_lookup_entries - gang pagecache lookup 947 * pagevec_lookup_entries - gang pagecache lookup
937 * @pvec: Where the resulting entries are placed 948 * @pvec: Where the resulting entries are placed
938 * @mapping: The address_space to search 949 * @mapping: The address_space to search
939 * @start: The starting entry index 950 * @start: The starting entry index
940 * @nr_entries: The maximum number of entries 951 * @nr_entries: The maximum number of entries
941 * @indices: The cache indices corresponding to the entries in @pvec 952 * @indices: The cache indices corresponding to the entries in @pvec
942 * 953 *
943 * pagevec_lookup_entries() will search for and return a group of up 954 * pagevec_lookup_entries() will search for and return a group of up
944 * to @nr_entries pages and shadow entries in the mapping. All 955 * to @nr_entries pages and shadow entries in the mapping. All
945 * entries are placed in @pvec. pagevec_lookup_entries() takes a 956 * entries are placed in @pvec. pagevec_lookup_entries() takes a
946 * reference against actual pages in @pvec. 957 * reference against actual pages in @pvec.
947 * 958 *
948 * The search returns a group of mapping-contiguous entries with 959 * The search returns a group of mapping-contiguous entries with
949 * ascending indexes. There may be holes in the indices due to 960 * ascending indexes. There may be holes in the indices due to
950 * not-present entries. 961 * not-present entries.
951 * 962 *
952 * pagevec_lookup_entries() returns the number of entries which were 963 * pagevec_lookup_entries() returns the number of entries which were
953 * found. 964 * found.
954 */ 965 */
955 unsigned pagevec_lookup_entries(struct pagevec *pvec, 966 unsigned pagevec_lookup_entries(struct pagevec *pvec,
956 struct address_space *mapping, 967 struct address_space *mapping,
957 pgoff_t start, unsigned nr_pages, 968 pgoff_t start, unsigned nr_pages,
958 pgoff_t *indices) 969 pgoff_t *indices)
959 { 970 {
960 pvec->nr = find_get_entries(mapping, start, nr_pages, 971 pvec->nr = find_get_entries(mapping, start, nr_pages,
961 pvec->pages, indices); 972 pvec->pages, indices);
962 return pagevec_count(pvec); 973 return pagevec_count(pvec);
963 } 974 }
964 975
965 /** 976 /**
966 * pagevec_remove_exceptionals - pagevec exceptionals pruning 977 * pagevec_remove_exceptionals - pagevec exceptionals pruning
967 * @pvec: The pagevec to prune 978 * @pvec: The pagevec to prune
968 * 979 *
969 * pagevec_lookup_entries() fills both pages and exceptional radix 980 * pagevec_lookup_entries() fills both pages and exceptional radix
970 * tree entries into the pagevec. This function prunes all 981 * tree entries into the pagevec. This function prunes all
971 * exceptionals from @pvec without leaving holes, so that it can be 982 * exceptionals from @pvec without leaving holes, so that it can be
972 * passed on to page-only pagevec operations. 983 * passed on to page-only pagevec operations.
973 */ 984 */
974 void pagevec_remove_exceptionals(struct pagevec *pvec) 985 void pagevec_remove_exceptionals(struct pagevec *pvec)
975 { 986 {
976 int i, j; 987 int i, j;
977 988
978 for (i = 0, j = 0; i < pagevec_count(pvec); i++) { 989 for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
979 struct page *page = pvec->pages[i]; 990 struct page *page = pvec->pages[i];
980 if (!radix_tree_exceptional_entry(page)) 991 if (!radix_tree_exceptional_entry(page))
981 pvec->pages[j++] = page; 992 pvec->pages[j++] = page;
982 } 993 }
983 pvec->nr = j; 994 pvec->nr = j;
984 } 995 }
985 996
986 /** 997 /**
987 * pagevec_lookup - gang pagecache lookup 998 * pagevec_lookup - gang pagecache lookup
988 * @pvec: Where the resulting pages are placed 999 * @pvec: Where the resulting pages are placed
989 * @mapping: The address_space to search 1000 * @mapping: The address_space to search
990 * @start: The starting page index 1001 * @start: The starting page index
991 * @nr_pages: The maximum number of pages 1002 * @nr_pages: The maximum number of pages
992 * 1003 *
993 * pagevec_lookup() will search for and return a group of up to @nr_pages pages 1004 * pagevec_lookup() will search for and return a group of up to @nr_pages pages
994 * in the mapping. The pages are placed in @pvec. pagevec_lookup() takes a 1005 * in the mapping. The pages are placed in @pvec. pagevec_lookup() takes a
995 * reference against the pages in @pvec. 1006 * reference against the pages in @pvec.
996 * 1007 *
997 * The search returns a group of mapping-contiguous pages with ascending 1008 * The search returns a group of mapping-contiguous pages with ascending
998 * indexes. There may be holes in the indices due to not-present pages. 1009 * indexes. There may be holes in the indices due to not-present pages.
999 * 1010 *
1000 * pagevec_lookup() returns the number of pages which were found. 1011 * pagevec_lookup() returns the number of pages which were found.
1001 */ 1012 */
1002 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, 1013 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
1003 pgoff_t start, unsigned nr_pages) 1014 pgoff_t start, unsigned nr_pages)
1004 { 1015 {
1005 pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages); 1016 pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages);
1006 return pagevec_count(pvec); 1017 return pagevec_count(pvec);
1007 } 1018 }
1008 EXPORT_SYMBOL(pagevec_lookup); 1019 EXPORT_SYMBOL(pagevec_lookup);
1009 1020
1010 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, 1021 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
1011 pgoff_t *index, int tag, unsigned nr_pages) 1022 pgoff_t *index, int tag, unsigned nr_pages)
1012 { 1023 {
1013 pvec->nr = find_get_pages_tag(mapping, index, tag, 1024 pvec->nr = find_get_pages_tag(mapping, index, tag,
1014 nr_pages, pvec->pages); 1025 nr_pages, pvec->pages);
1015 return pagevec_count(pvec); 1026 return pagevec_count(pvec);
1016 } 1027 }
1017 EXPORT_SYMBOL(pagevec_lookup_tag); 1028 EXPORT_SYMBOL(pagevec_lookup_tag);
1018 1029
1019 /* 1030 /*
1020 * Perform any setup for the swap system 1031 * Perform any setup for the swap system
1021 */ 1032 */
1022 void __init swap_setup(void) 1033 void __init swap_setup(void)
1023 { 1034 {
1024 unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); 1035 unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
1025 #ifdef CONFIG_SWAP 1036 #ifdef CONFIG_SWAP
1026 int i; 1037 int i;
1027 1038
1028 bdi_init(swapper_spaces[0].backing_dev_info); 1039 bdi_init(swapper_spaces[0].backing_dev_info);
1029 for (i = 0; i < MAX_SWAPFILES; i++) { 1040 for (i = 0; i < MAX_SWAPFILES; i++) {
1030 spin_lock_init(&swapper_spaces[i].tree_lock); 1041 spin_lock_init(&swapper_spaces[i].tree_lock);
1031 INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); 1042 INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear);
1032 } 1043 }
1033 #endif 1044 #endif
1034 1045
1035 /* Use a smaller cluster for small-memory machines */ 1046 /* Use a smaller cluster for small-memory machines */
1036 if (megs < 16) 1047 if (megs < 16)
1037 page_cluster = 2; 1048 page_cluster = 2;
1038 else 1049 else
1039 page_cluster = 3; 1050 page_cluster = 3;
1040 /* 1051 /*
1041 * Right now other parts of the system means that we 1052 * Right now other parts of the system means that we
1042 * _really_ don't want to cluster much more 1053 * _really_ don't want to cluster much more
1043 */ 1054 */
1044 } 1055 }
1045 1056