Commit d618a27c7808608e376de803a4fd3940f33776c2
Committed by
Jiri Slaby
1 parent
967e64285a
Exists in
ti-linux-3.12.y
and in
2 other branches
mm: non-atomically mark page accessed during page cache allocation where possible
commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Showing 17 changed files with 219 additions and 162 deletions Inline Diff
fs/btrfs/extent_io.c
1 | #include <linux/bitops.h> | 1 | #include <linux/bitops.h> |
2 | #include <linux/slab.h> | 2 | #include <linux/slab.h> |
3 | #include <linux/bio.h> | 3 | #include <linux/bio.h> |
4 | #include <linux/mm.h> | 4 | #include <linux/mm.h> |
5 | #include <linux/pagemap.h> | 5 | #include <linux/pagemap.h> |
6 | #include <linux/page-flags.h> | 6 | #include <linux/page-flags.h> |
7 | #include <linux/spinlock.h> | 7 | #include <linux/spinlock.h> |
8 | #include <linux/blkdev.h> | 8 | #include <linux/blkdev.h> |
9 | #include <linux/swap.h> | 9 | #include <linux/swap.h> |
10 | #include <linux/writeback.h> | 10 | #include <linux/writeback.h> |
11 | #include <linux/pagevec.h> | 11 | #include <linux/pagevec.h> |
12 | #include <linux/prefetch.h> | 12 | #include <linux/prefetch.h> |
13 | #include <linux/cleancache.h> | 13 | #include <linux/cleancache.h> |
14 | #include "extent_io.h" | 14 | #include "extent_io.h" |
15 | #include "extent_map.h" | 15 | #include "extent_map.h" |
16 | #include "compat.h" | 16 | #include "compat.h" |
17 | #include "ctree.h" | 17 | #include "ctree.h" |
18 | #include "btrfs_inode.h" | 18 | #include "btrfs_inode.h" |
19 | #include "volumes.h" | 19 | #include "volumes.h" |
20 | #include "check-integrity.h" | 20 | #include "check-integrity.h" |
21 | #include "locking.h" | 21 | #include "locking.h" |
22 | #include "rcu-string.h" | 22 | #include "rcu-string.h" |
23 | 23 | ||
24 | static struct kmem_cache *extent_state_cache; | 24 | static struct kmem_cache *extent_state_cache; |
25 | static struct kmem_cache *extent_buffer_cache; | 25 | static struct kmem_cache *extent_buffer_cache; |
26 | static struct bio_set *btrfs_bioset; | 26 | static struct bio_set *btrfs_bioset; |
27 | 27 | ||
28 | #ifdef CONFIG_BTRFS_DEBUG | 28 | #ifdef CONFIG_BTRFS_DEBUG |
29 | static LIST_HEAD(buffers); | 29 | static LIST_HEAD(buffers); |
30 | static LIST_HEAD(states); | 30 | static LIST_HEAD(states); |
31 | 31 | ||
32 | static DEFINE_SPINLOCK(leak_lock); | 32 | static DEFINE_SPINLOCK(leak_lock); |
33 | 33 | ||
34 | static inline | 34 | static inline |
35 | void btrfs_leak_debug_add(struct list_head *new, struct list_head *head) | 35 | void btrfs_leak_debug_add(struct list_head *new, struct list_head *head) |
36 | { | 36 | { |
37 | unsigned long flags; | 37 | unsigned long flags; |
38 | 38 | ||
39 | spin_lock_irqsave(&leak_lock, flags); | 39 | spin_lock_irqsave(&leak_lock, flags); |
40 | list_add(new, head); | 40 | list_add(new, head); |
41 | spin_unlock_irqrestore(&leak_lock, flags); | 41 | spin_unlock_irqrestore(&leak_lock, flags); |
42 | } | 42 | } |
43 | 43 | ||
44 | static inline | 44 | static inline |
45 | void btrfs_leak_debug_del(struct list_head *entry) | 45 | void btrfs_leak_debug_del(struct list_head *entry) |
46 | { | 46 | { |
47 | unsigned long flags; | 47 | unsigned long flags; |
48 | 48 | ||
49 | spin_lock_irqsave(&leak_lock, flags); | 49 | spin_lock_irqsave(&leak_lock, flags); |
50 | list_del(entry); | 50 | list_del(entry); |
51 | spin_unlock_irqrestore(&leak_lock, flags); | 51 | spin_unlock_irqrestore(&leak_lock, flags); |
52 | } | 52 | } |
53 | 53 | ||
54 | static inline | 54 | static inline |
55 | void btrfs_leak_debug_check(void) | 55 | void btrfs_leak_debug_check(void) |
56 | { | 56 | { |
57 | struct extent_state *state; | 57 | struct extent_state *state; |
58 | struct extent_buffer *eb; | 58 | struct extent_buffer *eb; |
59 | 59 | ||
60 | while (!list_empty(&states)) { | 60 | while (!list_empty(&states)) { |
61 | state = list_entry(states.next, struct extent_state, leak_list); | 61 | state = list_entry(states.next, struct extent_state, leak_list); |
62 | printk(KERN_ERR "btrfs state leak: start %llu end %llu " | 62 | printk(KERN_ERR "btrfs state leak: start %llu end %llu " |
63 | "state %lu in tree %p refs %d\n", | 63 | "state %lu in tree %p refs %d\n", |
64 | state->start, state->end, state->state, state->tree, | 64 | state->start, state->end, state->state, state->tree, |
65 | atomic_read(&state->refs)); | 65 | atomic_read(&state->refs)); |
66 | list_del(&state->leak_list); | 66 | list_del(&state->leak_list); |
67 | kmem_cache_free(extent_state_cache, state); | 67 | kmem_cache_free(extent_state_cache, state); |
68 | } | 68 | } |
69 | 69 | ||
70 | while (!list_empty(&buffers)) { | 70 | while (!list_empty(&buffers)) { |
71 | eb = list_entry(buffers.next, struct extent_buffer, leak_list); | 71 | eb = list_entry(buffers.next, struct extent_buffer, leak_list); |
72 | printk(KERN_ERR "btrfs buffer leak start %llu len %lu " | 72 | printk(KERN_ERR "btrfs buffer leak start %llu len %lu " |
73 | "refs %d\n", | 73 | "refs %d\n", |
74 | eb->start, eb->len, atomic_read(&eb->refs)); | 74 | eb->start, eb->len, atomic_read(&eb->refs)); |
75 | list_del(&eb->leak_list); | 75 | list_del(&eb->leak_list); |
76 | kmem_cache_free(extent_buffer_cache, eb); | 76 | kmem_cache_free(extent_buffer_cache, eb); |
77 | } | 77 | } |
78 | } | 78 | } |
79 | 79 | ||
80 | #define btrfs_debug_check_extent_io_range(inode, start, end) \ | 80 | #define btrfs_debug_check_extent_io_range(inode, start, end) \ |
81 | __btrfs_debug_check_extent_io_range(__func__, (inode), (start), (end)) | 81 | __btrfs_debug_check_extent_io_range(__func__, (inode), (start), (end)) |
82 | static inline void __btrfs_debug_check_extent_io_range(const char *caller, | 82 | static inline void __btrfs_debug_check_extent_io_range(const char *caller, |
83 | struct inode *inode, u64 start, u64 end) | 83 | struct inode *inode, u64 start, u64 end) |
84 | { | 84 | { |
85 | u64 isize = i_size_read(inode); | 85 | u64 isize = i_size_read(inode); |
86 | 86 | ||
87 | if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) { | 87 | if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) { |
88 | printk_ratelimited(KERN_DEBUG | 88 | printk_ratelimited(KERN_DEBUG |
89 | "btrfs: %s: ino %llu isize %llu odd range [%llu,%llu]\n", | 89 | "btrfs: %s: ino %llu isize %llu odd range [%llu,%llu]\n", |
90 | caller, btrfs_ino(inode), isize, start, end); | 90 | caller, btrfs_ino(inode), isize, start, end); |
91 | } | 91 | } |
92 | } | 92 | } |
93 | #else | 93 | #else |
94 | #define btrfs_leak_debug_add(new, head) do {} while (0) | 94 | #define btrfs_leak_debug_add(new, head) do {} while (0) |
95 | #define btrfs_leak_debug_del(entry) do {} while (0) | 95 | #define btrfs_leak_debug_del(entry) do {} while (0) |
96 | #define btrfs_leak_debug_check() do {} while (0) | 96 | #define btrfs_leak_debug_check() do {} while (0) |
97 | #define btrfs_debug_check_extent_io_range(c, s, e) do {} while (0) | 97 | #define btrfs_debug_check_extent_io_range(c, s, e) do {} while (0) |
98 | #endif | 98 | #endif |
99 | 99 | ||
100 | #define BUFFER_LRU_MAX 64 | 100 | #define BUFFER_LRU_MAX 64 |
101 | 101 | ||
102 | struct tree_entry { | 102 | struct tree_entry { |
103 | u64 start; | 103 | u64 start; |
104 | u64 end; | 104 | u64 end; |
105 | struct rb_node rb_node; | 105 | struct rb_node rb_node; |
106 | }; | 106 | }; |
107 | 107 | ||
108 | struct extent_page_data { | 108 | struct extent_page_data { |
109 | struct bio *bio; | 109 | struct bio *bio; |
110 | struct extent_io_tree *tree; | 110 | struct extent_io_tree *tree; |
111 | get_extent_t *get_extent; | 111 | get_extent_t *get_extent; |
112 | unsigned long bio_flags; | 112 | unsigned long bio_flags; |
113 | 113 | ||
114 | /* tells writepage not to lock the state bits for this range | 114 | /* tells writepage not to lock the state bits for this range |
115 | * it still does the unlocking | 115 | * it still does the unlocking |
116 | */ | 116 | */ |
117 | unsigned int extent_locked:1; | 117 | unsigned int extent_locked:1; |
118 | 118 | ||
119 | /* tells the submit_bio code to use a WRITE_SYNC */ | 119 | /* tells the submit_bio code to use a WRITE_SYNC */ |
120 | unsigned int sync_io:1; | 120 | unsigned int sync_io:1; |
121 | }; | 121 | }; |
122 | 122 | ||
123 | static noinline void flush_write_bio(void *data); | 123 | static noinline void flush_write_bio(void *data); |
124 | static inline struct btrfs_fs_info * | 124 | static inline struct btrfs_fs_info * |
125 | tree_fs_info(struct extent_io_tree *tree) | 125 | tree_fs_info(struct extent_io_tree *tree) |
126 | { | 126 | { |
127 | return btrfs_sb(tree->mapping->host->i_sb); | 127 | return btrfs_sb(tree->mapping->host->i_sb); |
128 | } | 128 | } |
129 | 129 | ||
130 | int __init extent_io_init(void) | 130 | int __init extent_io_init(void) |
131 | { | 131 | { |
132 | extent_state_cache = kmem_cache_create("btrfs_extent_state", | 132 | extent_state_cache = kmem_cache_create("btrfs_extent_state", |
133 | sizeof(struct extent_state), 0, | 133 | sizeof(struct extent_state), 0, |
134 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); | 134 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); |
135 | if (!extent_state_cache) | 135 | if (!extent_state_cache) |
136 | return -ENOMEM; | 136 | return -ENOMEM; |
137 | 137 | ||
138 | extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer", | 138 | extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer", |
139 | sizeof(struct extent_buffer), 0, | 139 | sizeof(struct extent_buffer), 0, |
140 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); | 140 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); |
141 | if (!extent_buffer_cache) | 141 | if (!extent_buffer_cache) |
142 | goto free_state_cache; | 142 | goto free_state_cache; |
143 | 143 | ||
144 | btrfs_bioset = bioset_create(BIO_POOL_SIZE, | 144 | btrfs_bioset = bioset_create(BIO_POOL_SIZE, |
145 | offsetof(struct btrfs_io_bio, bio)); | 145 | offsetof(struct btrfs_io_bio, bio)); |
146 | if (!btrfs_bioset) | 146 | if (!btrfs_bioset) |
147 | goto free_buffer_cache; | 147 | goto free_buffer_cache; |
148 | 148 | ||
149 | if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE)) | 149 | if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE)) |
150 | goto free_bioset; | 150 | goto free_bioset; |
151 | 151 | ||
152 | return 0; | 152 | return 0; |
153 | 153 | ||
154 | free_bioset: | 154 | free_bioset: |
155 | bioset_free(btrfs_bioset); | 155 | bioset_free(btrfs_bioset); |
156 | btrfs_bioset = NULL; | 156 | btrfs_bioset = NULL; |
157 | 157 | ||
158 | free_buffer_cache: | 158 | free_buffer_cache: |
159 | kmem_cache_destroy(extent_buffer_cache); | 159 | kmem_cache_destroy(extent_buffer_cache); |
160 | extent_buffer_cache = NULL; | 160 | extent_buffer_cache = NULL; |
161 | 161 | ||
162 | free_state_cache: | 162 | free_state_cache: |
163 | kmem_cache_destroy(extent_state_cache); | 163 | kmem_cache_destroy(extent_state_cache); |
164 | extent_state_cache = NULL; | 164 | extent_state_cache = NULL; |
165 | return -ENOMEM; | 165 | return -ENOMEM; |
166 | } | 166 | } |
167 | 167 | ||
168 | void extent_io_exit(void) | 168 | void extent_io_exit(void) |
169 | { | 169 | { |
170 | btrfs_leak_debug_check(); | 170 | btrfs_leak_debug_check(); |
171 | 171 | ||
172 | /* | 172 | /* |
173 | * Make sure all delayed rcu free are flushed before we | 173 | * Make sure all delayed rcu free are flushed before we |
174 | * destroy caches. | 174 | * destroy caches. |
175 | */ | 175 | */ |
176 | rcu_barrier(); | 176 | rcu_barrier(); |
177 | if (extent_state_cache) | 177 | if (extent_state_cache) |
178 | kmem_cache_destroy(extent_state_cache); | 178 | kmem_cache_destroy(extent_state_cache); |
179 | if (extent_buffer_cache) | 179 | if (extent_buffer_cache) |
180 | kmem_cache_destroy(extent_buffer_cache); | 180 | kmem_cache_destroy(extent_buffer_cache); |
181 | if (btrfs_bioset) | 181 | if (btrfs_bioset) |
182 | bioset_free(btrfs_bioset); | 182 | bioset_free(btrfs_bioset); |
183 | } | 183 | } |
184 | 184 | ||
185 | void extent_io_tree_init(struct extent_io_tree *tree, | 185 | void extent_io_tree_init(struct extent_io_tree *tree, |
186 | struct address_space *mapping) | 186 | struct address_space *mapping) |
187 | { | 187 | { |
188 | tree->state = RB_ROOT; | 188 | tree->state = RB_ROOT; |
189 | INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC); | 189 | INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC); |
190 | tree->ops = NULL; | 190 | tree->ops = NULL; |
191 | tree->dirty_bytes = 0; | 191 | tree->dirty_bytes = 0; |
192 | spin_lock_init(&tree->lock); | 192 | spin_lock_init(&tree->lock); |
193 | spin_lock_init(&tree->buffer_lock); | 193 | spin_lock_init(&tree->buffer_lock); |
194 | tree->mapping = mapping; | 194 | tree->mapping = mapping; |
195 | } | 195 | } |
196 | 196 | ||
197 | static struct extent_state *alloc_extent_state(gfp_t mask) | 197 | static struct extent_state *alloc_extent_state(gfp_t mask) |
198 | { | 198 | { |
199 | struct extent_state *state; | 199 | struct extent_state *state; |
200 | 200 | ||
201 | state = kmem_cache_alloc(extent_state_cache, mask); | 201 | state = kmem_cache_alloc(extent_state_cache, mask); |
202 | if (!state) | 202 | if (!state) |
203 | return state; | 203 | return state; |
204 | state->state = 0; | 204 | state->state = 0; |
205 | state->private = 0; | 205 | state->private = 0; |
206 | state->tree = NULL; | 206 | state->tree = NULL; |
207 | btrfs_leak_debug_add(&state->leak_list, &states); | 207 | btrfs_leak_debug_add(&state->leak_list, &states); |
208 | atomic_set(&state->refs, 1); | 208 | atomic_set(&state->refs, 1); |
209 | init_waitqueue_head(&state->wq); | 209 | init_waitqueue_head(&state->wq); |
210 | trace_alloc_extent_state(state, mask, _RET_IP_); | 210 | trace_alloc_extent_state(state, mask, _RET_IP_); |
211 | return state; | 211 | return state; |
212 | } | 212 | } |
213 | 213 | ||
214 | void free_extent_state(struct extent_state *state) | 214 | void free_extent_state(struct extent_state *state) |
215 | { | 215 | { |
216 | if (!state) | 216 | if (!state) |
217 | return; | 217 | return; |
218 | if (atomic_dec_and_test(&state->refs)) { | 218 | if (atomic_dec_and_test(&state->refs)) { |
219 | WARN_ON(state->tree); | 219 | WARN_ON(state->tree); |
220 | btrfs_leak_debug_del(&state->leak_list); | 220 | btrfs_leak_debug_del(&state->leak_list); |
221 | trace_free_extent_state(state, _RET_IP_); | 221 | trace_free_extent_state(state, _RET_IP_); |
222 | kmem_cache_free(extent_state_cache, state); | 222 | kmem_cache_free(extent_state_cache, state); |
223 | } | 223 | } |
224 | } | 224 | } |
225 | 225 | ||
226 | static struct rb_node *tree_insert(struct rb_root *root, u64 offset, | 226 | static struct rb_node *tree_insert(struct rb_root *root, u64 offset, |
227 | struct rb_node *node) | 227 | struct rb_node *node) |
228 | { | 228 | { |
229 | struct rb_node **p = &root->rb_node; | 229 | struct rb_node **p = &root->rb_node; |
230 | struct rb_node *parent = NULL; | 230 | struct rb_node *parent = NULL; |
231 | struct tree_entry *entry; | 231 | struct tree_entry *entry; |
232 | 232 | ||
233 | while (*p) { | 233 | while (*p) { |
234 | parent = *p; | 234 | parent = *p; |
235 | entry = rb_entry(parent, struct tree_entry, rb_node); | 235 | entry = rb_entry(parent, struct tree_entry, rb_node); |
236 | 236 | ||
237 | if (offset < entry->start) | 237 | if (offset < entry->start) |
238 | p = &(*p)->rb_left; | 238 | p = &(*p)->rb_left; |
239 | else if (offset > entry->end) | 239 | else if (offset > entry->end) |
240 | p = &(*p)->rb_right; | 240 | p = &(*p)->rb_right; |
241 | else | 241 | else |
242 | return parent; | 242 | return parent; |
243 | } | 243 | } |
244 | 244 | ||
245 | rb_link_node(node, parent, p); | 245 | rb_link_node(node, parent, p); |
246 | rb_insert_color(node, root); | 246 | rb_insert_color(node, root); |
247 | return NULL; | 247 | return NULL; |
248 | } | 248 | } |
249 | 249 | ||
250 | static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset, | 250 | static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset, |
251 | struct rb_node **prev_ret, | 251 | struct rb_node **prev_ret, |
252 | struct rb_node **next_ret) | 252 | struct rb_node **next_ret) |
253 | { | 253 | { |
254 | struct rb_root *root = &tree->state; | 254 | struct rb_root *root = &tree->state; |
255 | struct rb_node *n = root->rb_node; | 255 | struct rb_node *n = root->rb_node; |
256 | struct rb_node *prev = NULL; | 256 | struct rb_node *prev = NULL; |
257 | struct rb_node *orig_prev = NULL; | 257 | struct rb_node *orig_prev = NULL; |
258 | struct tree_entry *entry; | 258 | struct tree_entry *entry; |
259 | struct tree_entry *prev_entry = NULL; | 259 | struct tree_entry *prev_entry = NULL; |
260 | 260 | ||
261 | while (n) { | 261 | while (n) { |
262 | entry = rb_entry(n, struct tree_entry, rb_node); | 262 | entry = rb_entry(n, struct tree_entry, rb_node); |
263 | prev = n; | 263 | prev = n; |
264 | prev_entry = entry; | 264 | prev_entry = entry; |
265 | 265 | ||
266 | if (offset < entry->start) | 266 | if (offset < entry->start) |
267 | n = n->rb_left; | 267 | n = n->rb_left; |
268 | else if (offset > entry->end) | 268 | else if (offset > entry->end) |
269 | n = n->rb_right; | 269 | n = n->rb_right; |
270 | else | 270 | else |
271 | return n; | 271 | return n; |
272 | } | 272 | } |
273 | 273 | ||
274 | if (prev_ret) { | 274 | if (prev_ret) { |
275 | orig_prev = prev; | 275 | orig_prev = prev; |
276 | while (prev && offset > prev_entry->end) { | 276 | while (prev && offset > prev_entry->end) { |
277 | prev = rb_next(prev); | 277 | prev = rb_next(prev); |
278 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); | 278 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); |
279 | } | 279 | } |
280 | *prev_ret = prev; | 280 | *prev_ret = prev; |
281 | prev = orig_prev; | 281 | prev = orig_prev; |
282 | } | 282 | } |
283 | 283 | ||
284 | if (next_ret) { | 284 | if (next_ret) { |
285 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); | 285 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); |
286 | while (prev && offset < prev_entry->start) { | 286 | while (prev && offset < prev_entry->start) { |
287 | prev = rb_prev(prev); | 287 | prev = rb_prev(prev); |
288 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); | 288 | prev_entry = rb_entry(prev, struct tree_entry, rb_node); |
289 | } | 289 | } |
290 | *next_ret = prev; | 290 | *next_ret = prev; |
291 | } | 291 | } |
292 | return NULL; | 292 | return NULL; |
293 | } | 293 | } |
294 | 294 | ||
295 | static inline struct rb_node *tree_search(struct extent_io_tree *tree, | 295 | static inline struct rb_node *tree_search(struct extent_io_tree *tree, |
296 | u64 offset) | 296 | u64 offset) |
297 | { | 297 | { |
298 | struct rb_node *prev = NULL; | 298 | struct rb_node *prev = NULL; |
299 | struct rb_node *ret; | 299 | struct rb_node *ret; |
300 | 300 | ||
301 | ret = __etree_search(tree, offset, &prev, NULL); | 301 | ret = __etree_search(tree, offset, &prev, NULL); |
302 | if (!ret) | 302 | if (!ret) |
303 | return prev; | 303 | return prev; |
304 | return ret; | 304 | return ret; |
305 | } | 305 | } |
306 | 306 | ||
307 | static void merge_cb(struct extent_io_tree *tree, struct extent_state *new, | 307 | static void merge_cb(struct extent_io_tree *tree, struct extent_state *new, |
308 | struct extent_state *other) | 308 | struct extent_state *other) |
309 | { | 309 | { |
310 | if (tree->ops && tree->ops->merge_extent_hook) | 310 | if (tree->ops && tree->ops->merge_extent_hook) |
311 | tree->ops->merge_extent_hook(tree->mapping->host, new, | 311 | tree->ops->merge_extent_hook(tree->mapping->host, new, |
312 | other); | 312 | other); |
313 | } | 313 | } |
314 | 314 | ||
315 | /* | 315 | /* |
316 | * utility function to look for merge candidates inside a given range. | 316 | * utility function to look for merge candidates inside a given range. |
317 | * Any extents with matching state are merged together into a single | 317 | * Any extents with matching state are merged together into a single |
318 | * extent in the tree. Extents with EXTENT_IO in their state field | 318 | * extent in the tree. Extents with EXTENT_IO in their state field |
319 | * are not merged because the end_io handlers need to be able to do | 319 | * are not merged because the end_io handlers need to be able to do |
320 | * operations on them without sleeping (or doing allocations/splits). | 320 | * operations on them without sleeping (or doing allocations/splits). |
321 | * | 321 | * |
322 | * This should be called with the tree lock held. | 322 | * This should be called with the tree lock held. |
323 | */ | 323 | */ |
324 | static void merge_state(struct extent_io_tree *tree, | 324 | static void merge_state(struct extent_io_tree *tree, |
325 | struct extent_state *state) | 325 | struct extent_state *state) |
326 | { | 326 | { |
327 | struct extent_state *other; | 327 | struct extent_state *other; |
328 | struct rb_node *other_node; | 328 | struct rb_node *other_node; |
329 | 329 | ||
330 | if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) | 330 | if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) |
331 | return; | 331 | return; |
332 | 332 | ||
333 | other_node = rb_prev(&state->rb_node); | 333 | other_node = rb_prev(&state->rb_node); |
334 | if (other_node) { | 334 | if (other_node) { |
335 | other = rb_entry(other_node, struct extent_state, rb_node); | 335 | other = rb_entry(other_node, struct extent_state, rb_node); |
336 | if (other->end == state->start - 1 && | 336 | if (other->end == state->start - 1 && |
337 | other->state == state->state) { | 337 | other->state == state->state) { |
338 | merge_cb(tree, state, other); | 338 | merge_cb(tree, state, other); |
339 | state->start = other->start; | 339 | state->start = other->start; |
340 | other->tree = NULL; | 340 | other->tree = NULL; |
341 | rb_erase(&other->rb_node, &tree->state); | 341 | rb_erase(&other->rb_node, &tree->state); |
342 | free_extent_state(other); | 342 | free_extent_state(other); |
343 | } | 343 | } |
344 | } | 344 | } |
345 | other_node = rb_next(&state->rb_node); | 345 | other_node = rb_next(&state->rb_node); |
346 | if (other_node) { | 346 | if (other_node) { |
347 | other = rb_entry(other_node, struct extent_state, rb_node); | 347 | other = rb_entry(other_node, struct extent_state, rb_node); |
348 | if (other->start == state->end + 1 && | 348 | if (other->start == state->end + 1 && |
349 | other->state == state->state) { | 349 | other->state == state->state) { |
350 | merge_cb(tree, state, other); | 350 | merge_cb(tree, state, other); |
351 | state->end = other->end; | 351 | state->end = other->end; |
352 | other->tree = NULL; | 352 | other->tree = NULL; |
353 | rb_erase(&other->rb_node, &tree->state); | 353 | rb_erase(&other->rb_node, &tree->state); |
354 | free_extent_state(other); | 354 | free_extent_state(other); |
355 | } | 355 | } |
356 | } | 356 | } |
357 | } | 357 | } |
358 | 358 | ||
359 | static void set_state_cb(struct extent_io_tree *tree, | 359 | static void set_state_cb(struct extent_io_tree *tree, |
360 | struct extent_state *state, unsigned long *bits) | 360 | struct extent_state *state, unsigned long *bits) |
361 | { | 361 | { |
362 | if (tree->ops && tree->ops->set_bit_hook) | 362 | if (tree->ops && tree->ops->set_bit_hook) |
363 | tree->ops->set_bit_hook(tree->mapping->host, state, bits); | 363 | tree->ops->set_bit_hook(tree->mapping->host, state, bits); |
364 | } | 364 | } |
365 | 365 | ||
366 | static void clear_state_cb(struct extent_io_tree *tree, | 366 | static void clear_state_cb(struct extent_io_tree *tree, |
367 | struct extent_state *state, unsigned long *bits) | 367 | struct extent_state *state, unsigned long *bits) |
368 | { | 368 | { |
369 | if (tree->ops && tree->ops->clear_bit_hook) | 369 | if (tree->ops && tree->ops->clear_bit_hook) |
370 | tree->ops->clear_bit_hook(tree->mapping->host, state, bits); | 370 | tree->ops->clear_bit_hook(tree->mapping->host, state, bits); |
371 | } | 371 | } |
372 | 372 | ||
373 | static void set_state_bits(struct extent_io_tree *tree, | 373 | static void set_state_bits(struct extent_io_tree *tree, |
374 | struct extent_state *state, unsigned long *bits); | 374 | struct extent_state *state, unsigned long *bits); |
375 | 375 | ||
376 | /* | 376 | /* |
377 | * insert an extent_state struct into the tree. 'bits' are set on the | 377 | * insert an extent_state struct into the tree. 'bits' are set on the |
378 | * struct before it is inserted. | 378 | * struct before it is inserted. |
379 | * | 379 | * |
380 | * This may return -EEXIST if the extent is already there, in which case the | 380 | * This may return -EEXIST if the extent is already there, in which case the |
381 | * state struct is freed. | 381 | * state struct is freed. |
382 | * | 382 | * |
383 | * The tree lock is not taken internally. This is a utility function and | 383 | * The tree lock is not taken internally. This is a utility function and |
384 | * probably isn't what you want to call (see set/clear_extent_bit). | 384 | * probably isn't what you want to call (see set/clear_extent_bit). |
385 | */ | 385 | */ |
386 | static int insert_state(struct extent_io_tree *tree, | 386 | static int insert_state(struct extent_io_tree *tree, |
387 | struct extent_state *state, u64 start, u64 end, | 387 | struct extent_state *state, u64 start, u64 end, |
388 | unsigned long *bits) | 388 | unsigned long *bits) |
389 | { | 389 | { |
390 | struct rb_node *node; | 390 | struct rb_node *node; |
391 | 391 | ||
392 | if (end < start) | 392 | if (end < start) |
393 | WARN(1, KERN_ERR "btrfs end < start %llu %llu\n", | 393 | WARN(1, KERN_ERR "btrfs end < start %llu %llu\n", |
394 | end, start); | 394 | end, start); |
395 | state->start = start; | 395 | state->start = start; |
396 | state->end = end; | 396 | state->end = end; |
397 | 397 | ||
398 | set_state_bits(tree, state, bits); | 398 | set_state_bits(tree, state, bits); |
399 | 399 | ||
400 | node = tree_insert(&tree->state, end, &state->rb_node); | 400 | node = tree_insert(&tree->state, end, &state->rb_node); |
401 | if (node) { | 401 | if (node) { |
402 | struct extent_state *found; | 402 | struct extent_state *found; |
403 | found = rb_entry(node, struct extent_state, rb_node); | 403 | found = rb_entry(node, struct extent_state, rb_node); |
404 | printk(KERN_ERR "btrfs found node %llu %llu on insert of " | 404 | printk(KERN_ERR "btrfs found node %llu %llu on insert of " |
405 | "%llu %llu\n", | 405 | "%llu %llu\n", |
406 | found->start, found->end, start, end); | 406 | found->start, found->end, start, end); |
407 | return -EEXIST; | 407 | return -EEXIST; |
408 | } | 408 | } |
409 | state->tree = tree; | 409 | state->tree = tree; |
410 | merge_state(tree, state); | 410 | merge_state(tree, state); |
411 | return 0; | 411 | return 0; |
412 | } | 412 | } |
413 | 413 | ||
414 | static void split_cb(struct extent_io_tree *tree, struct extent_state *orig, | 414 | static void split_cb(struct extent_io_tree *tree, struct extent_state *orig, |
415 | u64 split) | 415 | u64 split) |
416 | { | 416 | { |
417 | if (tree->ops && tree->ops->split_extent_hook) | 417 | if (tree->ops && tree->ops->split_extent_hook) |
418 | tree->ops->split_extent_hook(tree->mapping->host, orig, split); | 418 | tree->ops->split_extent_hook(tree->mapping->host, orig, split); |
419 | } | 419 | } |
420 | 420 | ||
421 | /* | 421 | /* |
422 | * split a given extent state struct in two, inserting the preallocated | 422 | * split a given extent state struct in two, inserting the preallocated |
423 | * struct 'prealloc' as the newly created second half. 'split' indicates an | 423 | * struct 'prealloc' as the newly created second half. 'split' indicates an |
424 | * offset inside 'orig' where it should be split. | 424 | * offset inside 'orig' where it should be split. |
425 | * | 425 | * |
426 | * Before calling, | 426 | * Before calling, |
427 | * the tree has 'orig' at [orig->start, orig->end]. After calling, there | 427 | * the tree has 'orig' at [orig->start, orig->end]. After calling, there |
428 | * are two extent state structs in the tree: | 428 | * are two extent state structs in the tree: |
429 | * prealloc: [orig->start, split - 1] | 429 | * prealloc: [orig->start, split - 1] |
430 | * orig: [ split, orig->end ] | 430 | * orig: [ split, orig->end ] |
431 | * | 431 | * |
432 | * The tree locks are not taken by this function. They need to be held | 432 | * The tree locks are not taken by this function. They need to be held |
433 | * by the caller. | 433 | * by the caller. |
434 | */ | 434 | */ |
435 | static int split_state(struct extent_io_tree *tree, struct extent_state *orig, | 435 | static int split_state(struct extent_io_tree *tree, struct extent_state *orig, |
436 | struct extent_state *prealloc, u64 split) | 436 | struct extent_state *prealloc, u64 split) |
437 | { | 437 | { |
438 | struct rb_node *node; | 438 | struct rb_node *node; |
439 | 439 | ||
440 | split_cb(tree, orig, split); | 440 | split_cb(tree, orig, split); |
441 | 441 | ||
442 | prealloc->start = orig->start; | 442 | prealloc->start = orig->start; |
443 | prealloc->end = split - 1; | 443 | prealloc->end = split - 1; |
444 | prealloc->state = orig->state; | 444 | prealloc->state = orig->state; |
445 | orig->start = split; | 445 | orig->start = split; |
446 | 446 | ||
447 | node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node); | 447 | node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node); |
448 | if (node) { | 448 | if (node) { |
449 | free_extent_state(prealloc); | 449 | free_extent_state(prealloc); |
450 | return -EEXIST; | 450 | return -EEXIST; |
451 | } | 451 | } |
452 | prealloc->tree = tree; | 452 | prealloc->tree = tree; |
453 | return 0; | 453 | return 0; |
454 | } | 454 | } |
455 | 455 | ||
456 | static struct extent_state *next_state(struct extent_state *state) | 456 | static struct extent_state *next_state(struct extent_state *state) |
457 | { | 457 | { |
458 | struct rb_node *next = rb_next(&state->rb_node); | 458 | struct rb_node *next = rb_next(&state->rb_node); |
459 | if (next) | 459 | if (next) |
460 | return rb_entry(next, struct extent_state, rb_node); | 460 | return rb_entry(next, struct extent_state, rb_node); |
461 | else | 461 | else |
462 | return NULL; | 462 | return NULL; |
463 | } | 463 | } |
464 | 464 | ||
465 | /* | 465 | /* |
466 | * utility function to clear some bits in an extent state struct. | 466 | * utility function to clear some bits in an extent state struct. |
467 | * it will optionally wake up any one waiting on this state (wake == 1). | 467 | * it will optionally wake up any one waiting on this state (wake == 1). |
468 | * | 468 | * |
469 | * If no bits are set on the state struct after clearing things, the | 469 | * If no bits are set on the state struct after clearing things, the |
470 | * struct is freed and removed from the tree | 470 | * struct is freed and removed from the tree |
471 | */ | 471 | */ |
472 | static struct extent_state *clear_state_bit(struct extent_io_tree *tree, | 472 | static struct extent_state *clear_state_bit(struct extent_io_tree *tree, |
473 | struct extent_state *state, | 473 | struct extent_state *state, |
474 | unsigned long *bits, int wake) | 474 | unsigned long *bits, int wake) |
475 | { | 475 | { |
476 | struct extent_state *next; | 476 | struct extent_state *next; |
477 | unsigned long bits_to_clear = *bits & ~EXTENT_CTLBITS; | 477 | unsigned long bits_to_clear = *bits & ~EXTENT_CTLBITS; |
478 | 478 | ||
479 | if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) { | 479 | if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) { |
480 | u64 range = state->end - state->start + 1; | 480 | u64 range = state->end - state->start + 1; |
481 | WARN_ON(range > tree->dirty_bytes); | 481 | WARN_ON(range > tree->dirty_bytes); |
482 | tree->dirty_bytes -= range; | 482 | tree->dirty_bytes -= range; |
483 | } | 483 | } |
484 | clear_state_cb(tree, state, bits); | 484 | clear_state_cb(tree, state, bits); |
485 | state->state &= ~bits_to_clear; | 485 | state->state &= ~bits_to_clear; |
486 | if (wake) | 486 | if (wake) |
487 | wake_up(&state->wq); | 487 | wake_up(&state->wq); |
488 | if (state->state == 0) { | 488 | if (state->state == 0) { |
489 | next = next_state(state); | 489 | next = next_state(state); |
490 | if (state->tree) { | 490 | if (state->tree) { |
491 | rb_erase(&state->rb_node, &tree->state); | 491 | rb_erase(&state->rb_node, &tree->state); |
492 | state->tree = NULL; | 492 | state->tree = NULL; |
493 | free_extent_state(state); | 493 | free_extent_state(state); |
494 | } else { | 494 | } else { |
495 | WARN_ON(1); | 495 | WARN_ON(1); |
496 | } | 496 | } |
497 | } else { | 497 | } else { |
498 | merge_state(tree, state); | 498 | merge_state(tree, state); |
499 | next = next_state(state); | 499 | next = next_state(state); |
500 | } | 500 | } |
501 | return next; | 501 | return next; |
502 | } | 502 | } |
503 | 503 | ||
504 | static struct extent_state * | 504 | static struct extent_state * |
505 | alloc_extent_state_atomic(struct extent_state *prealloc) | 505 | alloc_extent_state_atomic(struct extent_state *prealloc) |
506 | { | 506 | { |
507 | if (!prealloc) | 507 | if (!prealloc) |
508 | prealloc = alloc_extent_state(GFP_ATOMIC); | 508 | prealloc = alloc_extent_state(GFP_ATOMIC); |
509 | 509 | ||
510 | return prealloc; | 510 | return prealloc; |
511 | } | 511 | } |
512 | 512 | ||
513 | static void extent_io_tree_panic(struct extent_io_tree *tree, int err) | 513 | static void extent_io_tree_panic(struct extent_io_tree *tree, int err) |
514 | { | 514 | { |
515 | btrfs_panic(tree_fs_info(tree), err, "Locking error: " | 515 | btrfs_panic(tree_fs_info(tree), err, "Locking error: " |
516 | "Extent tree was modified by another " | 516 | "Extent tree was modified by another " |
517 | "thread while locked."); | 517 | "thread while locked."); |
518 | } | 518 | } |
519 | 519 | ||
520 | /* | 520 | /* |
521 | * clear some bits on a range in the tree. This may require splitting | 521 | * clear some bits on a range in the tree. This may require splitting |
522 | * or inserting elements in the tree, so the gfp mask is used to | 522 | * or inserting elements in the tree, so the gfp mask is used to |
523 | * indicate which allocations or sleeping are allowed. | 523 | * indicate which allocations or sleeping are allowed. |
524 | * | 524 | * |
525 | * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove | 525 | * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove |
526 | * the given range from the tree regardless of state (ie for truncate). | 526 | * the given range from the tree regardless of state (ie for truncate). |
527 | * | 527 | * |
528 | * the range [start, end] is inclusive. | 528 | * the range [start, end] is inclusive. |
529 | * | 529 | * |
530 | * This takes the tree lock, and returns 0 on success and < 0 on error. | 530 | * This takes the tree lock, and returns 0 on success and < 0 on error. |
531 | */ | 531 | */ |
532 | int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, | 532 | int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, |
533 | unsigned long bits, int wake, int delete, | 533 | unsigned long bits, int wake, int delete, |
534 | struct extent_state **cached_state, | 534 | struct extent_state **cached_state, |
535 | gfp_t mask) | 535 | gfp_t mask) |
536 | { | 536 | { |
537 | struct extent_state *state; | 537 | struct extent_state *state; |
538 | struct extent_state *cached; | 538 | struct extent_state *cached; |
539 | struct extent_state *prealloc = NULL; | 539 | struct extent_state *prealloc = NULL; |
540 | struct rb_node *node; | 540 | struct rb_node *node; |
541 | u64 last_end; | 541 | u64 last_end; |
542 | int err; | 542 | int err; |
543 | int clear = 0; | 543 | int clear = 0; |
544 | 544 | ||
545 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); | 545 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); |
546 | 546 | ||
547 | if (bits & EXTENT_DELALLOC) | 547 | if (bits & EXTENT_DELALLOC) |
548 | bits |= EXTENT_NORESERVE; | 548 | bits |= EXTENT_NORESERVE; |
549 | 549 | ||
550 | if (delete) | 550 | if (delete) |
551 | bits |= ~EXTENT_CTLBITS; | 551 | bits |= ~EXTENT_CTLBITS; |
552 | bits |= EXTENT_FIRST_DELALLOC; | 552 | bits |= EXTENT_FIRST_DELALLOC; |
553 | 553 | ||
554 | if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY)) | 554 | if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY)) |
555 | clear = 1; | 555 | clear = 1; |
556 | again: | 556 | again: |
557 | if (!prealloc && (mask & __GFP_WAIT)) { | 557 | if (!prealloc && (mask & __GFP_WAIT)) { |
558 | prealloc = alloc_extent_state(mask); | 558 | prealloc = alloc_extent_state(mask); |
559 | if (!prealloc) | 559 | if (!prealloc) |
560 | return -ENOMEM; | 560 | return -ENOMEM; |
561 | } | 561 | } |
562 | 562 | ||
563 | spin_lock(&tree->lock); | 563 | spin_lock(&tree->lock); |
564 | if (cached_state) { | 564 | if (cached_state) { |
565 | cached = *cached_state; | 565 | cached = *cached_state; |
566 | 566 | ||
567 | if (clear) { | 567 | if (clear) { |
568 | *cached_state = NULL; | 568 | *cached_state = NULL; |
569 | cached_state = NULL; | 569 | cached_state = NULL; |
570 | } | 570 | } |
571 | 571 | ||
572 | if (cached && cached->tree && cached->start <= start && | 572 | if (cached && cached->tree && cached->start <= start && |
573 | cached->end > start) { | 573 | cached->end > start) { |
574 | if (clear) | 574 | if (clear) |
575 | atomic_dec(&cached->refs); | 575 | atomic_dec(&cached->refs); |
576 | state = cached; | 576 | state = cached; |
577 | goto hit_next; | 577 | goto hit_next; |
578 | } | 578 | } |
579 | if (clear) | 579 | if (clear) |
580 | free_extent_state(cached); | 580 | free_extent_state(cached); |
581 | } | 581 | } |
582 | /* | 582 | /* |
583 | * this search will find the extents that end after | 583 | * this search will find the extents that end after |
584 | * our range starts | 584 | * our range starts |
585 | */ | 585 | */ |
586 | node = tree_search(tree, start); | 586 | node = tree_search(tree, start); |
587 | if (!node) | 587 | if (!node) |
588 | goto out; | 588 | goto out; |
589 | state = rb_entry(node, struct extent_state, rb_node); | 589 | state = rb_entry(node, struct extent_state, rb_node); |
590 | hit_next: | 590 | hit_next: |
591 | if (state->start > end) | 591 | if (state->start > end) |
592 | goto out; | 592 | goto out; |
593 | WARN_ON(state->end < start); | 593 | WARN_ON(state->end < start); |
594 | last_end = state->end; | 594 | last_end = state->end; |
595 | 595 | ||
596 | /* the state doesn't have the wanted bits, go ahead */ | 596 | /* the state doesn't have the wanted bits, go ahead */ |
597 | if (!(state->state & bits)) { | 597 | if (!(state->state & bits)) { |
598 | state = next_state(state); | 598 | state = next_state(state); |
599 | goto next; | 599 | goto next; |
600 | } | 600 | } |
601 | 601 | ||
602 | /* | 602 | /* |
603 | * | ---- desired range ---- | | 603 | * | ---- desired range ---- | |
604 | * | state | or | 604 | * | state | or |
605 | * | ------------- state -------------- | | 605 | * | ------------- state -------------- | |
606 | * | 606 | * |
607 | * We need to split the extent we found, and may flip | 607 | * We need to split the extent we found, and may flip |
608 | * bits on second half. | 608 | * bits on second half. |
609 | * | 609 | * |
610 | * If the extent we found extends past our range, we | 610 | * If the extent we found extends past our range, we |
611 | * just split and search again. It'll get split again | 611 | * just split and search again. It'll get split again |
612 | * the next time though. | 612 | * the next time though. |
613 | * | 613 | * |
614 | * If the extent we found is inside our range, we clear | 614 | * If the extent we found is inside our range, we clear |
615 | * the desired bit on it. | 615 | * the desired bit on it. |
616 | */ | 616 | */ |
617 | 617 | ||
618 | if (state->start < start) { | 618 | if (state->start < start) { |
619 | prealloc = alloc_extent_state_atomic(prealloc); | 619 | prealloc = alloc_extent_state_atomic(prealloc); |
620 | BUG_ON(!prealloc); | 620 | BUG_ON(!prealloc); |
621 | err = split_state(tree, state, prealloc, start); | 621 | err = split_state(tree, state, prealloc, start); |
622 | if (err) | 622 | if (err) |
623 | extent_io_tree_panic(tree, err); | 623 | extent_io_tree_panic(tree, err); |
624 | 624 | ||
625 | prealloc = NULL; | 625 | prealloc = NULL; |
626 | if (err) | 626 | if (err) |
627 | goto out; | 627 | goto out; |
628 | if (state->end <= end) { | 628 | if (state->end <= end) { |
629 | state = clear_state_bit(tree, state, &bits, wake); | 629 | state = clear_state_bit(tree, state, &bits, wake); |
630 | goto next; | 630 | goto next; |
631 | } | 631 | } |
632 | goto search_again; | 632 | goto search_again; |
633 | } | 633 | } |
634 | /* | 634 | /* |
635 | * | ---- desired range ---- | | 635 | * | ---- desired range ---- | |
636 | * | state | | 636 | * | state | |
637 | * We need to split the extent, and clear the bit | 637 | * We need to split the extent, and clear the bit |
638 | * on the first half | 638 | * on the first half |
639 | */ | 639 | */ |
640 | if (state->start <= end && state->end > end) { | 640 | if (state->start <= end && state->end > end) { |
641 | prealloc = alloc_extent_state_atomic(prealloc); | 641 | prealloc = alloc_extent_state_atomic(prealloc); |
642 | BUG_ON(!prealloc); | 642 | BUG_ON(!prealloc); |
643 | err = split_state(tree, state, prealloc, end + 1); | 643 | err = split_state(tree, state, prealloc, end + 1); |
644 | if (err) | 644 | if (err) |
645 | extent_io_tree_panic(tree, err); | 645 | extent_io_tree_panic(tree, err); |
646 | 646 | ||
647 | if (wake) | 647 | if (wake) |
648 | wake_up(&state->wq); | 648 | wake_up(&state->wq); |
649 | 649 | ||
650 | clear_state_bit(tree, prealloc, &bits, wake); | 650 | clear_state_bit(tree, prealloc, &bits, wake); |
651 | 651 | ||
652 | prealloc = NULL; | 652 | prealloc = NULL; |
653 | goto out; | 653 | goto out; |
654 | } | 654 | } |
655 | 655 | ||
656 | state = clear_state_bit(tree, state, &bits, wake); | 656 | state = clear_state_bit(tree, state, &bits, wake); |
657 | next: | 657 | next: |
658 | if (last_end == (u64)-1) | 658 | if (last_end == (u64)-1) |
659 | goto out; | 659 | goto out; |
660 | start = last_end + 1; | 660 | start = last_end + 1; |
661 | if (start <= end && state && !need_resched()) | 661 | if (start <= end && state && !need_resched()) |
662 | goto hit_next; | 662 | goto hit_next; |
663 | goto search_again; | 663 | goto search_again; |
664 | 664 | ||
665 | out: | 665 | out: |
666 | spin_unlock(&tree->lock); | 666 | spin_unlock(&tree->lock); |
667 | if (prealloc) | 667 | if (prealloc) |
668 | free_extent_state(prealloc); | 668 | free_extent_state(prealloc); |
669 | 669 | ||
670 | return 0; | 670 | return 0; |
671 | 671 | ||
672 | search_again: | 672 | search_again: |
673 | if (start > end) | 673 | if (start > end) |
674 | goto out; | 674 | goto out; |
675 | spin_unlock(&tree->lock); | 675 | spin_unlock(&tree->lock); |
676 | if (mask & __GFP_WAIT) | 676 | if (mask & __GFP_WAIT) |
677 | cond_resched(); | 677 | cond_resched(); |
678 | goto again; | 678 | goto again; |
679 | } | 679 | } |
680 | 680 | ||
681 | static void wait_on_state(struct extent_io_tree *tree, | 681 | static void wait_on_state(struct extent_io_tree *tree, |
682 | struct extent_state *state) | 682 | struct extent_state *state) |
683 | __releases(tree->lock) | 683 | __releases(tree->lock) |
684 | __acquires(tree->lock) | 684 | __acquires(tree->lock) |
685 | { | 685 | { |
686 | DEFINE_WAIT(wait); | 686 | DEFINE_WAIT(wait); |
687 | prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE); | 687 | prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE); |
688 | spin_unlock(&tree->lock); | 688 | spin_unlock(&tree->lock); |
689 | schedule(); | 689 | schedule(); |
690 | spin_lock(&tree->lock); | 690 | spin_lock(&tree->lock); |
691 | finish_wait(&state->wq, &wait); | 691 | finish_wait(&state->wq, &wait); |
692 | } | 692 | } |
693 | 693 | ||
694 | /* | 694 | /* |
695 | * waits for one or more bits to clear on a range in the state tree. | 695 | * waits for one or more bits to clear on a range in the state tree. |
696 | * The range [start, end] is inclusive. | 696 | * The range [start, end] is inclusive. |
697 | * The tree lock is taken by this function | 697 | * The tree lock is taken by this function |
698 | */ | 698 | */ |
699 | static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, | 699 | static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, |
700 | unsigned long bits) | 700 | unsigned long bits) |
701 | { | 701 | { |
702 | struct extent_state *state; | 702 | struct extent_state *state; |
703 | struct rb_node *node; | 703 | struct rb_node *node; |
704 | 704 | ||
705 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); | 705 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); |
706 | 706 | ||
707 | spin_lock(&tree->lock); | 707 | spin_lock(&tree->lock); |
708 | again: | 708 | again: |
709 | while (1) { | 709 | while (1) { |
710 | /* | 710 | /* |
711 | * this search will find all the extents that end after | 711 | * this search will find all the extents that end after |
712 | * our range starts | 712 | * our range starts |
713 | */ | 713 | */ |
714 | node = tree_search(tree, start); | 714 | node = tree_search(tree, start); |
715 | if (!node) | 715 | if (!node) |
716 | break; | 716 | break; |
717 | 717 | ||
718 | state = rb_entry(node, struct extent_state, rb_node); | 718 | state = rb_entry(node, struct extent_state, rb_node); |
719 | 719 | ||
720 | if (state->start > end) | 720 | if (state->start > end) |
721 | goto out; | 721 | goto out; |
722 | 722 | ||
723 | if (state->state & bits) { | 723 | if (state->state & bits) { |
724 | start = state->start; | 724 | start = state->start; |
725 | atomic_inc(&state->refs); | 725 | atomic_inc(&state->refs); |
726 | wait_on_state(tree, state); | 726 | wait_on_state(tree, state); |
727 | free_extent_state(state); | 727 | free_extent_state(state); |
728 | goto again; | 728 | goto again; |
729 | } | 729 | } |
730 | start = state->end + 1; | 730 | start = state->end + 1; |
731 | 731 | ||
732 | if (start > end) | 732 | if (start > end) |
733 | break; | 733 | break; |
734 | 734 | ||
735 | cond_resched_lock(&tree->lock); | 735 | cond_resched_lock(&tree->lock); |
736 | } | 736 | } |
737 | out: | 737 | out: |
738 | spin_unlock(&tree->lock); | 738 | spin_unlock(&tree->lock); |
739 | } | 739 | } |
740 | 740 | ||
741 | static void set_state_bits(struct extent_io_tree *tree, | 741 | static void set_state_bits(struct extent_io_tree *tree, |
742 | struct extent_state *state, | 742 | struct extent_state *state, |
743 | unsigned long *bits) | 743 | unsigned long *bits) |
744 | { | 744 | { |
745 | unsigned long bits_to_set = *bits & ~EXTENT_CTLBITS; | 745 | unsigned long bits_to_set = *bits & ~EXTENT_CTLBITS; |
746 | 746 | ||
747 | set_state_cb(tree, state, bits); | 747 | set_state_cb(tree, state, bits); |
748 | if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) { | 748 | if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) { |
749 | u64 range = state->end - state->start + 1; | 749 | u64 range = state->end - state->start + 1; |
750 | tree->dirty_bytes += range; | 750 | tree->dirty_bytes += range; |
751 | } | 751 | } |
752 | state->state |= bits_to_set; | 752 | state->state |= bits_to_set; |
753 | } | 753 | } |
754 | 754 | ||
755 | static void cache_state(struct extent_state *state, | 755 | static void cache_state(struct extent_state *state, |
756 | struct extent_state **cached_ptr) | 756 | struct extent_state **cached_ptr) |
757 | { | 757 | { |
758 | if (cached_ptr && !(*cached_ptr)) { | 758 | if (cached_ptr && !(*cached_ptr)) { |
759 | if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) { | 759 | if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) { |
760 | *cached_ptr = state; | 760 | *cached_ptr = state; |
761 | atomic_inc(&state->refs); | 761 | atomic_inc(&state->refs); |
762 | } | 762 | } |
763 | } | 763 | } |
764 | } | 764 | } |
765 | 765 | ||
766 | /* | 766 | /* |
767 | * set some bits on a range in the tree. This may require allocations or | 767 | * set some bits on a range in the tree. This may require allocations or |
768 | * sleeping, so the gfp mask is used to indicate what is allowed. | 768 | * sleeping, so the gfp mask is used to indicate what is allowed. |
769 | * | 769 | * |
770 | * If any of the exclusive bits are set, this will fail with -EEXIST if some | 770 | * If any of the exclusive bits are set, this will fail with -EEXIST if some |
771 | * part of the range already has the desired bits set. The start of the | 771 | * part of the range already has the desired bits set. The start of the |
772 | * existing range is returned in failed_start in this case. | 772 | * existing range is returned in failed_start in this case. |
773 | * | 773 | * |
774 | * [start, end] is inclusive This takes the tree lock. | 774 | * [start, end] is inclusive This takes the tree lock. |
775 | */ | 775 | */ |
776 | 776 | ||
777 | static int __must_check | 777 | static int __must_check |
778 | __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, | 778 | __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, |
779 | unsigned long bits, unsigned long exclusive_bits, | 779 | unsigned long bits, unsigned long exclusive_bits, |
780 | u64 *failed_start, struct extent_state **cached_state, | 780 | u64 *failed_start, struct extent_state **cached_state, |
781 | gfp_t mask) | 781 | gfp_t mask) |
782 | { | 782 | { |
783 | struct extent_state *state; | 783 | struct extent_state *state; |
784 | struct extent_state *prealloc = NULL; | 784 | struct extent_state *prealloc = NULL; |
785 | struct rb_node *node; | 785 | struct rb_node *node; |
786 | int err = 0; | 786 | int err = 0; |
787 | u64 last_start; | 787 | u64 last_start; |
788 | u64 last_end; | 788 | u64 last_end; |
789 | 789 | ||
790 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); | 790 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); |
791 | 791 | ||
792 | bits |= EXTENT_FIRST_DELALLOC; | 792 | bits |= EXTENT_FIRST_DELALLOC; |
793 | again: | 793 | again: |
794 | if (!prealloc && (mask & __GFP_WAIT)) { | 794 | if (!prealloc && (mask & __GFP_WAIT)) { |
795 | prealloc = alloc_extent_state(mask); | 795 | prealloc = alloc_extent_state(mask); |
796 | BUG_ON(!prealloc); | 796 | BUG_ON(!prealloc); |
797 | } | 797 | } |
798 | 798 | ||
799 | spin_lock(&tree->lock); | 799 | spin_lock(&tree->lock); |
800 | if (cached_state && *cached_state) { | 800 | if (cached_state && *cached_state) { |
801 | state = *cached_state; | 801 | state = *cached_state; |
802 | if (state->start <= start && state->end > start && | 802 | if (state->start <= start && state->end > start && |
803 | state->tree) { | 803 | state->tree) { |
804 | node = &state->rb_node; | 804 | node = &state->rb_node; |
805 | goto hit_next; | 805 | goto hit_next; |
806 | } | 806 | } |
807 | } | 807 | } |
808 | /* | 808 | /* |
809 | * this search will find all the extents that end after | 809 | * this search will find all the extents that end after |
810 | * our range starts. | 810 | * our range starts. |
811 | */ | 811 | */ |
812 | node = tree_search(tree, start); | 812 | node = tree_search(tree, start); |
813 | if (!node) { | 813 | if (!node) { |
814 | prealloc = alloc_extent_state_atomic(prealloc); | 814 | prealloc = alloc_extent_state_atomic(prealloc); |
815 | BUG_ON(!prealloc); | 815 | BUG_ON(!prealloc); |
816 | err = insert_state(tree, prealloc, start, end, &bits); | 816 | err = insert_state(tree, prealloc, start, end, &bits); |
817 | if (err) | 817 | if (err) |
818 | extent_io_tree_panic(tree, err); | 818 | extent_io_tree_panic(tree, err); |
819 | 819 | ||
820 | prealloc = NULL; | 820 | prealloc = NULL; |
821 | goto out; | 821 | goto out; |
822 | } | 822 | } |
823 | state = rb_entry(node, struct extent_state, rb_node); | 823 | state = rb_entry(node, struct extent_state, rb_node); |
824 | hit_next: | 824 | hit_next: |
825 | last_start = state->start; | 825 | last_start = state->start; |
826 | last_end = state->end; | 826 | last_end = state->end; |
827 | 827 | ||
828 | /* | 828 | /* |
829 | * | ---- desired range ---- | | 829 | * | ---- desired range ---- | |
830 | * | state | | 830 | * | state | |
831 | * | 831 | * |
832 | * Just lock what we found and keep going | 832 | * Just lock what we found and keep going |
833 | */ | 833 | */ |
834 | if (state->start == start && state->end <= end) { | 834 | if (state->start == start && state->end <= end) { |
835 | if (state->state & exclusive_bits) { | 835 | if (state->state & exclusive_bits) { |
836 | *failed_start = state->start; | 836 | *failed_start = state->start; |
837 | err = -EEXIST; | 837 | err = -EEXIST; |
838 | goto out; | 838 | goto out; |
839 | } | 839 | } |
840 | 840 | ||
841 | set_state_bits(tree, state, &bits); | 841 | set_state_bits(tree, state, &bits); |
842 | cache_state(state, cached_state); | 842 | cache_state(state, cached_state); |
843 | merge_state(tree, state); | 843 | merge_state(tree, state); |
844 | if (last_end == (u64)-1) | 844 | if (last_end == (u64)-1) |
845 | goto out; | 845 | goto out; |
846 | start = last_end + 1; | 846 | start = last_end + 1; |
847 | state = next_state(state); | 847 | state = next_state(state); |
848 | if (start < end && state && state->start == start && | 848 | if (start < end && state && state->start == start && |
849 | !need_resched()) | 849 | !need_resched()) |
850 | goto hit_next; | 850 | goto hit_next; |
851 | goto search_again; | 851 | goto search_again; |
852 | } | 852 | } |
853 | 853 | ||
854 | /* | 854 | /* |
855 | * | ---- desired range ---- | | 855 | * | ---- desired range ---- | |
856 | * | state | | 856 | * | state | |
857 | * or | 857 | * or |
858 | * | ------------- state -------------- | | 858 | * | ------------- state -------------- | |
859 | * | 859 | * |
860 | * We need to split the extent we found, and may flip bits on | 860 | * We need to split the extent we found, and may flip bits on |
861 | * second half. | 861 | * second half. |
862 | * | 862 | * |
863 | * If the extent we found extends past our | 863 | * If the extent we found extends past our |
864 | * range, we just split and search again. It'll get split | 864 | * range, we just split and search again. It'll get split |
865 | * again the next time though. | 865 | * again the next time though. |
866 | * | 866 | * |
867 | * If the extent we found is inside our range, we set the | 867 | * If the extent we found is inside our range, we set the |
868 | * desired bit on it. | 868 | * desired bit on it. |
869 | */ | 869 | */ |
870 | if (state->start < start) { | 870 | if (state->start < start) { |
871 | if (state->state & exclusive_bits) { | 871 | if (state->state & exclusive_bits) { |
872 | *failed_start = start; | 872 | *failed_start = start; |
873 | err = -EEXIST; | 873 | err = -EEXIST; |
874 | goto out; | 874 | goto out; |
875 | } | 875 | } |
876 | 876 | ||
877 | prealloc = alloc_extent_state_atomic(prealloc); | 877 | prealloc = alloc_extent_state_atomic(prealloc); |
878 | BUG_ON(!prealloc); | 878 | BUG_ON(!prealloc); |
879 | err = split_state(tree, state, prealloc, start); | 879 | err = split_state(tree, state, prealloc, start); |
880 | if (err) | 880 | if (err) |
881 | extent_io_tree_panic(tree, err); | 881 | extent_io_tree_panic(tree, err); |
882 | 882 | ||
883 | prealloc = NULL; | 883 | prealloc = NULL; |
884 | if (err) | 884 | if (err) |
885 | goto out; | 885 | goto out; |
886 | if (state->end <= end) { | 886 | if (state->end <= end) { |
887 | set_state_bits(tree, state, &bits); | 887 | set_state_bits(tree, state, &bits); |
888 | cache_state(state, cached_state); | 888 | cache_state(state, cached_state); |
889 | merge_state(tree, state); | 889 | merge_state(tree, state); |
890 | if (last_end == (u64)-1) | 890 | if (last_end == (u64)-1) |
891 | goto out; | 891 | goto out; |
892 | start = last_end + 1; | 892 | start = last_end + 1; |
893 | state = next_state(state); | 893 | state = next_state(state); |
894 | if (start < end && state && state->start == start && | 894 | if (start < end && state && state->start == start && |
895 | !need_resched()) | 895 | !need_resched()) |
896 | goto hit_next; | 896 | goto hit_next; |
897 | } | 897 | } |
898 | goto search_again; | 898 | goto search_again; |
899 | } | 899 | } |
900 | /* | 900 | /* |
901 | * | ---- desired range ---- | | 901 | * | ---- desired range ---- | |
902 | * | state | or | state | | 902 | * | state | or | state | |
903 | * | 903 | * |
904 | * There's a hole, we need to insert something in it and | 904 | * There's a hole, we need to insert something in it and |
905 | * ignore the extent we found. | 905 | * ignore the extent we found. |
906 | */ | 906 | */ |
907 | if (state->start > start) { | 907 | if (state->start > start) { |
908 | u64 this_end; | 908 | u64 this_end; |
909 | if (end < last_start) | 909 | if (end < last_start) |
910 | this_end = end; | 910 | this_end = end; |
911 | else | 911 | else |
912 | this_end = last_start - 1; | 912 | this_end = last_start - 1; |
913 | 913 | ||
914 | prealloc = alloc_extent_state_atomic(prealloc); | 914 | prealloc = alloc_extent_state_atomic(prealloc); |
915 | BUG_ON(!prealloc); | 915 | BUG_ON(!prealloc); |
916 | 916 | ||
917 | /* | 917 | /* |
918 | * Avoid to free 'prealloc' if it can be merged with | 918 | * Avoid to free 'prealloc' if it can be merged with |
919 | * the later extent. | 919 | * the later extent. |
920 | */ | 920 | */ |
921 | err = insert_state(tree, prealloc, start, this_end, | 921 | err = insert_state(tree, prealloc, start, this_end, |
922 | &bits); | 922 | &bits); |
923 | if (err) | 923 | if (err) |
924 | extent_io_tree_panic(tree, err); | 924 | extent_io_tree_panic(tree, err); |
925 | 925 | ||
926 | cache_state(prealloc, cached_state); | 926 | cache_state(prealloc, cached_state); |
927 | prealloc = NULL; | 927 | prealloc = NULL; |
928 | start = this_end + 1; | 928 | start = this_end + 1; |
929 | goto search_again; | 929 | goto search_again; |
930 | } | 930 | } |
931 | /* | 931 | /* |
932 | * | ---- desired range ---- | | 932 | * | ---- desired range ---- | |
933 | * | state | | 933 | * | state | |
934 | * We need to split the extent, and set the bit | 934 | * We need to split the extent, and set the bit |
935 | * on the first half | 935 | * on the first half |
936 | */ | 936 | */ |
937 | if (state->start <= end && state->end > end) { | 937 | if (state->start <= end && state->end > end) { |
938 | if (state->state & exclusive_bits) { | 938 | if (state->state & exclusive_bits) { |
939 | *failed_start = start; | 939 | *failed_start = start; |
940 | err = -EEXIST; | 940 | err = -EEXIST; |
941 | goto out; | 941 | goto out; |
942 | } | 942 | } |
943 | 943 | ||
944 | prealloc = alloc_extent_state_atomic(prealloc); | 944 | prealloc = alloc_extent_state_atomic(prealloc); |
945 | BUG_ON(!prealloc); | 945 | BUG_ON(!prealloc); |
946 | err = split_state(tree, state, prealloc, end + 1); | 946 | err = split_state(tree, state, prealloc, end + 1); |
947 | if (err) | 947 | if (err) |
948 | extent_io_tree_panic(tree, err); | 948 | extent_io_tree_panic(tree, err); |
949 | 949 | ||
950 | set_state_bits(tree, prealloc, &bits); | 950 | set_state_bits(tree, prealloc, &bits); |
951 | cache_state(prealloc, cached_state); | 951 | cache_state(prealloc, cached_state); |
952 | merge_state(tree, prealloc); | 952 | merge_state(tree, prealloc); |
953 | prealloc = NULL; | 953 | prealloc = NULL; |
954 | goto out; | 954 | goto out; |
955 | } | 955 | } |
956 | 956 | ||
957 | goto search_again; | 957 | goto search_again; |
958 | 958 | ||
959 | out: | 959 | out: |
960 | spin_unlock(&tree->lock); | 960 | spin_unlock(&tree->lock); |
961 | if (prealloc) | 961 | if (prealloc) |
962 | free_extent_state(prealloc); | 962 | free_extent_state(prealloc); |
963 | 963 | ||
964 | return err; | 964 | return err; |
965 | 965 | ||
966 | search_again: | 966 | search_again: |
967 | if (start > end) | 967 | if (start > end) |
968 | goto out; | 968 | goto out; |
969 | spin_unlock(&tree->lock); | 969 | spin_unlock(&tree->lock); |
970 | if (mask & __GFP_WAIT) | 970 | if (mask & __GFP_WAIT) |
971 | cond_resched(); | 971 | cond_resched(); |
972 | goto again; | 972 | goto again; |
973 | } | 973 | } |
974 | 974 | ||
975 | int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, | 975 | int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, |
976 | unsigned long bits, u64 * failed_start, | 976 | unsigned long bits, u64 * failed_start, |
977 | struct extent_state **cached_state, gfp_t mask) | 977 | struct extent_state **cached_state, gfp_t mask) |
978 | { | 978 | { |
979 | return __set_extent_bit(tree, start, end, bits, 0, failed_start, | 979 | return __set_extent_bit(tree, start, end, bits, 0, failed_start, |
980 | cached_state, mask); | 980 | cached_state, mask); |
981 | } | 981 | } |
982 | 982 | ||
983 | 983 | ||
984 | /** | 984 | /** |
985 | * convert_extent_bit - convert all bits in a given range from one bit to | 985 | * convert_extent_bit - convert all bits in a given range from one bit to |
986 | * another | 986 | * another |
987 | * @tree: the io tree to search | 987 | * @tree: the io tree to search |
988 | * @start: the start offset in bytes | 988 | * @start: the start offset in bytes |
989 | * @end: the end offset in bytes (inclusive) | 989 | * @end: the end offset in bytes (inclusive) |
990 | * @bits: the bits to set in this range | 990 | * @bits: the bits to set in this range |
991 | * @clear_bits: the bits to clear in this range | 991 | * @clear_bits: the bits to clear in this range |
992 | * @cached_state: state that we're going to cache | 992 | * @cached_state: state that we're going to cache |
993 | * @mask: the allocation mask | 993 | * @mask: the allocation mask |
994 | * | 994 | * |
995 | * This will go through and set bits for the given range. If any states exist | 995 | * This will go through and set bits for the given range. If any states exist |
996 | * already in this range they are set with the given bit and cleared of the | 996 | * already in this range they are set with the given bit and cleared of the |
997 | * clear_bits. This is only meant to be used by things that are mergeable, ie | 997 | * clear_bits. This is only meant to be used by things that are mergeable, ie |
998 | * converting from say DELALLOC to DIRTY. This is not meant to be used with | 998 | * converting from say DELALLOC to DIRTY. This is not meant to be used with |
999 | * boundary bits like LOCK. | 999 | * boundary bits like LOCK. |
1000 | */ | 1000 | */ |
1001 | int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, | 1001 | int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, |
1002 | unsigned long bits, unsigned long clear_bits, | 1002 | unsigned long bits, unsigned long clear_bits, |
1003 | struct extent_state **cached_state, gfp_t mask) | 1003 | struct extent_state **cached_state, gfp_t mask) |
1004 | { | 1004 | { |
1005 | struct extent_state *state; | 1005 | struct extent_state *state; |
1006 | struct extent_state *prealloc = NULL; | 1006 | struct extent_state *prealloc = NULL; |
1007 | struct rb_node *node; | 1007 | struct rb_node *node; |
1008 | int err = 0; | 1008 | int err = 0; |
1009 | u64 last_start; | 1009 | u64 last_start; |
1010 | u64 last_end; | 1010 | u64 last_end; |
1011 | 1011 | ||
1012 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); | 1012 | btrfs_debug_check_extent_io_range(tree->mapping->host, start, end); |
1013 | 1013 | ||
1014 | again: | 1014 | again: |
1015 | if (!prealloc && (mask & __GFP_WAIT)) { | 1015 | if (!prealloc && (mask & __GFP_WAIT)) { |
1016 | prealloc = alloc_extent_state(mask); | 1016 | prealloc = alloc_extent_state(mask); |
1017 | if (!prealloc) | 1017 | if (!prealloc) |
1018 | return -ENOMEM; | 1018 | return -ENOMEM; |
1019 | } | 1019 | } |
1020 | 1020 | ||
1021 | spin_lock(&tree->lock); | 1021 | spin_lock(&tree->lock); |
1022 | if (cached_state && *cached_state) { | 1022 | if (cached_state && *cached_state) { |
1023 | state = *cached_state; | 1023 | state = *cached_state; |
1024 | if (state->start <= start && state->end > start && | 1024 | if (state->start <= start && state->end > start && |
1025 | state->tree) { | 1025 | state->tree) { |
1026 | node = &state->rb_node; | 1026 | node = &state->rb_node; |
1027 | goto hit_next; | 1027 | goto hit_next; |
1028 | } | 1028 | } |
1029 | } | 1029 | } |
1030 | 1030 | ||
1031 | /* | 1031 | /* |
1032 | * this search will find all the extents that end after | 1032 | * this search will find all the extents that end after |
1033 | * our range starts. | 1033 | * our range starts. |
1034 | */ | 1034 | */ |
1035 | node = tree_search(tree, start); | 1035 | node = tree_search(tree, start); |
1036 | if (!node) { | 1036 | if (!node) { |
1037 | prealloc = alloc_extent_state_atomic(prealloc); | 1037 | prealloc = alloc_extent_state_atomic(prealloc); |
1038 | if (!prealloc) { | 1038 | if (!prealloc) { |
1039 | err = -ENOMEM; | 1039 | err = -ENOMEM; |
1040 | goto out; | 1040 | goto out; |
1041 | } | 1041 | } |
1042 | err = insert_state(tree, prealloc, start, end, &bits); | 1042 | err = insert_state(tree, prealloc, start, end, &bits); |
1043 | prealloc = NULL; | 1043 | prealloc = NULL; |
1044 | if (err) | 1044 | if (err) |
1045 | extent_io_tree_panic(tree, err); | 1045 | extent_io_tree_panic(tree, err); |
1046 | goto out; | 1046 | goto out; |
1047 | } | 1047 | } |
1048 | state = rb_entry(node, struct extent_state, rb_node); | 1048 | state = rb_entry(node, struct extent_state, rb_node); |
1049 | hit_next: | 1049 | hit_next: |
1050 | last_start = state->start; | 1050 | last_start = state->start; |
1051 | last_end = state->end; | 1051 | last_end = state->end; |
1052 | 1052 | ||
1053 | /* | 1053 | /* |
1054 | * | ---- desired range ---- | | 1054 | * | ---- desired range ---- | |
1055 | * | state | | 1055 | * | state | |
1056 | * | 1056 | * |
1057 | * Just lock what we found and keep going | 1057 | * Just lock what we found and keep going |
1058 | */ | 1058 | */ |
1059 | if (state->start == start && state->end <= end) { | 1059 | if (state->start == start && state->end <= end) { |
1060 | set_state_bits(tree, state, &bits); | 1060 | set_state_bits(tree, state, &bits); |
1061 | cache_state(state, cached_state); | 1061 | cache_state(state, cached_state); |
1062 | state = clear_state_bit(tree, state, &clear_bits, 0); | 1062 | state = clear_state_bit(tree, state, &clear_bits, 0); |
1063 | if (last_end == (u64)-1) | 1063 | if (last_end == (u64)-1) |
1064 | goto out; | 1064 | goto out; |
1065 | start = last_end + 1; | 1065 | start = last_end + 1; |
1066 | if (start < end && state && state->start == start && | 1066 | if (start < end && state && state->start == start && |
1067 | !need_resched()) | 1067 | !need_resched()) |
1068 | goto hit_next; | 1068 | goto hit_next; |
1069 | goto search_again; | 1069 | goto search_again; |
1070 | } | 1070 | } |
1071 | 1071 | ||
1072 | /* | 1072 | /* |
1073 | * | ---- desired range ---- | | 1073 | * | ---- desired range ---- | |
1074 | * | state | | 1074 | * | state | |
1075 | * or | 1075 | * or |
1076 | * | ------------- state -------------- | | 1076 | * | ------------- state -------------- | |
1077 | * | 1077 | * |
1078 | * We need to split the extent we found, and may flip bits on | 1078 | * We need to split the extent we found, and may flip bits on |
1079 | * second half. | 1079 | * second half. |
1080 | * | 1080 | * |
1081 | * If the extent we found extends past our | 1081 | * If the extent we found extends past our |
1082 | * range, we just split and search again. It'll get split | 1082 | * range, we just split and search again. It'll get split |
1083 | * again the next time though. | 1083 | * again the next time though. |
1084 | * | 1084 | * |
1085 | * If the extent we found is inside our range, we set the | 1085 | * If the extent we found is inside our range, we set the |
1086 | * desired bit on it. | 1086 | * desired bit on it. |
1087 | */ | 1087 | */ |
1088 | if (state->start < start) { | 1088 | if (state->start < start) { |
1089 | prealloc = alloc_extent_state_atomic(prealloc); | 1089 | prealloc = alloc_extent_state_atomic(prealloc); |
1090 | if (!prealloc) { | 1090 | if (!prealloc) { |
1091 | err = -ENOMEM; | 1091 | err = -ENOMEM; |
1092 | goto out; | 1092 | goto out; |
1093 | } | 1093 | } |
1094 | err = split_state(tree, state, prealloc, start); | 1094 | err = split_state(tree, state, prealloc, start); |
1095 | if (err) | 1095 | if (err) |
1096 | extent_io_tree_panic(tree, err); | 1096 | extent_io_tree_panic(tree, err); |
1097 | prealloc = NULL; | 1097 | prealloc = NULL; |
1098 | if (err) | 1098 | if (err) |
1099 | goto out; | 1099 | goto out; |
1100 | if (state->end <= end) { | 1100 | if (state->end <= end) { |
1101 | set_state_bits(tree, state, &bits); | 1101 | set_state_bits(tree, state, &bits); |
1102 | cache_state(state, cached_state); | 1102 | cache_state(state, cached_state); |
1103 | state = clear_state_bit(tree, state, &clear_bits, 0); | 1103 | state = clear_state_bit(tree, state, &clear_bits, 0); |
1104 | if (last_end == (u64)-1) | 1104 | if (last_end == (u64)-1) |
1105 | goto out; | 1105 | goto out; |
1106 | start = last_end + 1; | 1106 | start = last_end + 1; |
1107 | if (start < end && state && state->start == start && | 1107 | if (start < end && state && state->start == start && |
1108 | !need_resched()) | 1108 | !need_resched()) |
1109 | goto hit_next; | 1109 | goto hit_next; |
1110 | } | 1110 | } |
1111 | goto search_again; | 1111 | goto search_again; |
1112 | } | 1112 | } |
1113 | /* | 1113 | /* |
1114 | * | ---- desired range ---- | | 1114 | * | ---- desired range ---- | |
1115 | * | state | or | state | | 1115 | * | state | or | state | |
1116 | * | 1116 | * |
1117 | * There's a hole, we need to insert something in it and | 1117 | * There's a hole, we need to insert something in it and |
1118 | * ignore the extent we found. | 1118 | * ignore the extent we found. |
1119 | */ | 1119 | */ |
1120 | if (state->start > start) { | 1120 | if (state->start > start) { |
1121 | u64 this_end; | 1121 | u64 this_end; |
1122 | if (end < last_start) | 1122 | if (end < last_start) |
1123 | this_end = end; | 1123 | this_end = end; |
1124 | else | 1124 | else |
1125 | this_end = last_start - 1; | 1125 | this_end = last_start - 1; |
1126 | 1126 | ||
1127 | prealloc = alloc_extent_state_atomic(prealloc); | 1127 | prealloc = alloc_extent_state_atomic(prealloc); |
1128 | if (!prealloc) { | 1128 | if (!prealloc) { |
1129 | err = -ENOMEM; | 1129 | err = -ENOMEM; |
1130 | goto out; | 1130 | goto out; |
1131 | } | 1131 | } |
1132 | 1132 | ||
1133 | /* | 1133 | /* |
1134 | * Avoid to free 'prealloc' if it can be merged with | 1134 | * Avoid to free 'prealloc' if it can be merged with |
1135 | * the later extent. | 1135 | * the later extent. |
1136 | */ | 1136 | */ |
1137 | err = insert_state(tree, prealloc, start, this_end, | 1137 | err = insert_state(tree, prealloc, start, this_end, |
1138 | &bits); | 1138 | &bits); |
1139 | if (err) | 1139 | if (err) |
1140 | extent_io_tree_panic(tree, err); | 1140 | extent_io_tree_panic(tree, err); |
1141 | cache_state(prealloc, cached_state); | 1141 | cache_state(prealloc, cached_state); |
1142 | prealloc = NULL; | 1142 | prealloc = NULL; |
1143 | start = this_end + 1; | 1143 | start = this_end + 1; |
1144 | goto search_again; | 1144 | goto search_again; |
1145 | } | 1145 | } |
1146 | /* | 1146 | /* |
1147 | * | ---- desired range ---- | | 1147 | * | ---- desired range ---- | |
1148 | * | state | | 1148 | * | state | |
1149 | * We need to split the extent, and set the bit | 1149 | * We need to split the extent, and set the bit |
1150 | * on the first half | 1150 | * on the first half |
1151 | */ | 1151 | */ |
1152 | if (state->start <= end && state->end > end) { | 1152 | if (state->start <= end && state->end > end) { |
1153 | prealloc = alloc_extent_state_atomic(prealloc); | 1153 | prealloc = alloc_extent_state_atomic(prealloc); |
1154 | if (!prealloc) { | 1154 | if (!prealloc) { |
1155 | err = -ENOMEM; | 1155 | err = -ENOMEM; |
1156 | goto out; | 1156 | goto out; |
1157 | } | 1157 | } |
1158 | 1158 | ||
1159 | err = split_state(tree, state, prealloc, end + 1); | 1159 | err = split_state(tree, state, prealloc, end + 1); |
1160 | if (err) | 1160 | if (err) |
1161 | extent_io_tree_panic(tree, err); | 1161 | extent_io_tree_panic(tree, err); |
1162 | 1162 | ||
1163 | set_state_bits(tree, prealloc, &bits); | 1163 | set_state_bits(tree, prealloc, &bits); |
1164 | cache_state(prealloc, cached_state); | 1164 | cache_state(prealloc, cached_state); |
1165 | clear_state_bit(tree, prealloc, &clear_bits, 0); | 1165 | clear_state_bit(tree, prealloc, &clear_bits, 0); |
1166 | prealloc = NULL; | 1166 | prealloc = NULL; |
1167 | goto out; | 1167 | goto out; |
1168 | } | 1168 | } |
1169 | 1169 | ||
1170 | goto search_again; | 1170 | goto search_again; |
1171 | 1171 | ||
1172 | out: | 1172 | out: |
1173 | spin_unlock(&tree->lock); | 1173 | spin_unlock(&tree->lock); |
1174 | if (prealloc) | 1174 | if (prealloc) |
1175 | free_extent_state(prealloc); | 1175 | free_extent_state(prealloc); |
1176 | 1176 | ||
1177 | return err; | 1177 | return err; |
1178 | 1178 | ||
1179 | search_again: | 1179 | search_again: |
1180 | if (start > end) | 1180 | if (start > end) |
1181 | goto out; | 1181 | goto out; |
1182 | spin_unlock(&tree->lock); | 1182 | spin_unlock(&tree->lock); |
1183 | if (mask & __GFP_WAIT) | 1183 | if (mask & __GFP_WAIT) |
1184 | cond_resched(); | 1184 | cond_resched(); |
1185 | goto again; | 1185 | goto again; |
1186 | } | 1186 | } |
1187 | 1187 | ||
1188 | /* wrappers around set/clear extent bit */ | 1188 | /* wrappers around set/clear extent bit */ |
1189 | int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, | 1189 | int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, |
1190 | gfp_t mask) | 1190 | gfp_t mask) |
1191 | { | 1191 | { |
1192 | return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL, | 1192 | return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL, |
1193 | NULL, mask); | 1193 | NULL, mask); |
1194 | } | 1194 | } |
1195 | 1195 | ||
1196 | int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, | 1196 | int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, |
1197 | unsigned long bits, gfp_t mask) | 1197 | unsigned long bits, gfp_t mask) |
1198 | { | 1198 | { |
1199 | return set_extent_bit(tree, start, end, bits, NULL, | 1199 | return set_extent_bit(tree, start, end, bits, NULL, |
1200 | NULL, mask); | 1200 | NULL, mask); |
1201 | } | 1201 | } |
1202 | 1202 | ||
1203 | int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, | 1203 | int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, |
1204 | unsigned long bits, gfp_t mask) | 1204 | unsigned long bits, gfp_t mask) |
1205 | { | 1205 | { |
1206 | return clear_extent_bit(tree, start, end, bits, 0, 0, NULL, mask); | 1206 | return clear_extent_bit(tree, start, end, bits, 0, 0, NULL, mask); |
1207 | } | 1207 | } |
1208 | 1208 | ||
1209 | int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end, | 1209 | int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end, |
1210 | struct extent_state **cached_state, gfp_t mask) | 1210 | struct extent_state **cached_state, gfp_t mask) |
1211 | { | 1211 | { |
1212 | return set_extent_bit(tree, start, end, | 1212 | return set_extent_bit(tree, start, end, |
1213 | EXTENT_DELALLOC | EXTENT_UPTODATE, | 1213 | EXTENT_DELALLOC | EXTENT_UPTODATE, |
1214 | NULL, cached_state, mask); | 1214 | NULL, cached_state, mask); |
1215 | } | 1215 | } |
1216 | 1216 | ||
1217 | int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end, | 1217 | int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end, |
1218 | struct extent_state **cached_state, gfp_t mask) | 1218 | struct extent_state **cached_state, gfp_t mask) |
1219 | { | 1219 | { |
1220 | return set_extent_bit(tree, start, end, | 1220 | return set_extent_bit(tree, start, end, |
1221 | EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG, | 1221 | EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG, |
1222 | NULL, cached_state, mask); | 1222 | NULL, cached_state, mask); |
1223 | } | 1223 | } |
1224 | 1224 | ||
1225 | int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, | 1225 | int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, |
1226 | gfp_t mask) | 1226 | gfp_t mask) |
1227 | { | 1227 | { |
1228 | return clear_extent_bit(tree, start, end, | 1228 | return clear_extent_bit(tree, start, end, |
1229 | EXTENT_DIRTY | EXTENT_DELALLOC | | 1229 | EXTENT_DIRTY | EXTENT_DELALLOC | |
1230 | EXTENT_DO_ACCOUNTING, 0, 0, NULL, mask); | 1230 | EXTENT_DO_ACCOUNTING, 0, 0, NULL, mask); |
1231 | } | 1231 | } |
1232 | 1232 | ||
1233 | int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end, | 1233 | int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end, |
1234 | gfp_t mask) | 1234 | gfp_t mask) |
1235 | { | 1235 | { |
1236 | return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, | 1236 | return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, |
1237 | NULL, mask); | 1237 | NULL, mask); |
1238 | } | 1238 | } |
1239 | 1239 | ||
1240 | int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, | 1240 | int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, |
1241 | struct extent_state **cached_state, gfp_t mask) | 1241 | struct extent_state **cached_state, gfp_t mask) |
1242 | { | 1242 | { |
1243 | return set_extent_bit(tree, start, end, EXTENT_UPTODATE, NULL, | 1243 | return set_extent_bit(tree, start, end, EXTENT_UPTODATE, NULL, |
1244 | cached_state, mask); | 1244 | cached_state, mask); |
1245 | } | 1245 | } |
1246 | 1246 | ||
1247 | int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, | 1247 | int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, |
1248 | struct extent_state **cached_state, gfp_t mask) | 1248 | struct extent_state **cached_state, gfp_t mask) |
1249 | { | 1249 | { |
1250 | return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0, | 1250 | return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0, |
1251 | cached_state, mask); | 1251 | cached_state, mask); |
1252 | } | 1252 | } |
1253 | 1253 | ||
1254 | /* | 1254 | /* |
1255 | * either insert or lock state struct between start and end use mask to tell | 1255 | * either insert or lock state struct between start and end use mask to tell |
1256 | * us if waiting is desired. | 1256 | * us if waiting is desired. |
1257 | */ | 1257 | */ |
1258 | int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, | 1258 | int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, |
1259 | unsigned long bits, struct extent_state **cached_state) | 1259 | unsigned long bits, struct extent_state **cached_state) |
1260 | { | 1260 | { |
1261 | int err; | 1261 | int err; |
1262 | u64 failed_start; | 1262 | u64 failed_start; |
1263 | while (1) { | 1263 | while (1) { |
1264 | err = __set_extent_bit(tree, start, end, EXTENT_LOCKED | bits, | 1264 | err = __set_extent_bit(tree, start, end, EXTENT_LOCKED | bits, |
1265 | EXTENT_LOCKED, &failed_start, | 1265 | EXTENT_LOCKED, &failed_start, |
1266 | cached_state, GFP_NOFS); | 1266 | cached_state, GFP_NOFS); |
1267 | if (err == -EEXIST) { | 1267 | if (err == -EEXIST) { |
1268 | wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED); | 1268 | wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED); |
1269 | start = failed_start; | 1269 | start = failed_start; |
1270 | } else | 1270 | } else |
1271 | break; | 1271 | break; |
1272 | WARN_ON(start > end); | 1272 | WARN_ON(start > end); |
1273 | } | 1273 | } |
1274 | return err; | 1274 | return err; |
1275 | } | 1275 | } |
1276 | 1276 | ||
1277 | int lock_extent(struct extent_io_tree *tree, u64 start, u64 end) | 1277 | int lock_extent(struct extent_io_tree *tree, u64 start, u64 end) |
1278 | { | 1278 | { |
1279 | return lock_extent_bits(tree, start, end, 0, NULL); | 1279 | return lock_extent_bits(tree, start, end, 0, NULL); |
1280 | } | 1280 | } |
1281 | 1281 | ||
1282 | int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end) | 1282 | int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end) |
1283 | { | 1283 | { |
1284 | int err; | 1284 | int err; |
1285 | u64 failed_start; | 1285 | u64 failed_start; |
1286 | 1286 | ||
1287 | err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED, | 1287 | err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED, |
1288 | &failed_start, NULL, GFP_NOFS); | 1288 | &failed_start, NULL, GFP_NOFS); |
1289 | if (err == -EEXIST) { | 1289 | if (err == -EEXIST) { |
1290 | if (failed_start > start) | 1290 | if (failed_start > start) |
1291 | clear_extent_bit(tree, start, failed_start - 1, | 1291 | clear_extent_bit(tree, start, failed_start - 1, |
1292 | EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS); | 1292 | EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS); |
1293 | return 0; | 1293 | return 0; |
1294 | } | 1294 | } |
1295 | return 1; | 1295 | return 1; |
1296 | } | 1296 | } |
1297 | 1297 | ||
1298 | int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end, | 1298 | int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end, |
1299 | struct extent_state **cached, gfp_t mask) | 1299 | struct extent_state **cached, gfp_t mask) |
1300 | { | 1300 | { |
1301 | return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached, | 1301 | return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached, |
1302 | mask); | 1302 | mask); |
1303 | } | 1303 | } |
1304 | 1304 | ||
1305 | int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end) | 1305 | int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end) |
1306 | { | 1306 | { |
1307 | return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL, | 1307 | return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL, |
1308 | GFP_NOFS); | 1308 | GFP_NOFS); |
1309 | } | 1309 | } |
1310 | 1310 | ||
1311 | int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) | 1311 | int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end) |
1312 | { | 1312 | { |
1313 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 1313 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
1314 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; | 1314 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; |
1315 | struct page *page; | 1315 | struct page *page; |
1316 | 1316 | ||
1317 | while (index <= end_index) { | 1317 | while (index <= end_index) { |
1318 | page = find_get_page(inode->i_mapping, index); | 1318 | page = find_get_page(inode->i_mapping, index); |
1319 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ | 1319 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ |
1320 | clear_page_dirty_for_io(page); | 1320 | clear_page_dirty_for_io(page); |
1321 | page_cache_release(page); | 1321 | page_cache_release(page); |
1322 | index++; | 1322 | index++; |
1323 | } | 1323 | } |
1324 | return 0; | 1324 | return 0; |
1325 | } | 1325 | } |
1326 | 1326 | ||
1327 | int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end) | 1327 | int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end) |
1328 | { | 1328 | { |
1329 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 1329 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
1330 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; | 1330 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; |
1331 | struct page *page; | 1331 | struct page *page; |
1332 | 1332 | ||
1333 | while (index <= end_index) { | 1333 | while (index <= end_index) { |
1334 | page = find_get_page(inode->i_mapping, index); | 1334 | page = find_get_page(inode->i_mapping, index); |
1335 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ | 1335 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ |
1336 | account_page_redirty(page); | 1336 | account_page_redirty(page); |
1337 | __set_page_dirty_nobuffers(page); | 1337 | __set_page_dirty_nobuffers(page); |
1338 | page_cache_release(page); | 1338 | page_cache_release(page); |
1339 | index++; | 1339 | index++; |
1340 | } | 1340 | } |
1341 | return 0; | 1341 | return 0; |
1342 | } | 1342 | } |
1343 | 1343 | ||
1344 | /* | 1344 | /* |
1345 | * helper function to set both pages and extents in the tree writeback | 1345 | * helper function to set both pages and extents in the tree writeback |
1346 | */ | 1346 | */ |
1347 | static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) | 1347 | static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) |
1348 | { | 1348 | { |
1349 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 1349 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
1350 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; | 1350 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; |
1351 | struct page *page; | 1351 | struct page *page; |
1352 | 1352 | ||
1353 | while (index <= end_index) { | 1353 | while (index <= end_index) { |
1354 | page = find_get_page(tree->mapping, index); | 1354 | page = find_get_page(tree->mapping, index); |
1355 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ | 1355 | BUG_ON(!page); /* Pages should be in the extent_io_tree */ |
1356 | set_page_writeback(page); | 1356 | set_page_writeback(page); |
1357 | page_cache_release(page); | 1357 | page_cache_release(page); |
1358 | index++; | 1358 | index++; |
1359 | } | 1359 | } |
1360 | return 0; | 1360 | return 0; |
1361 | } | 1361 | } |
1362 | 1362 | ||
1363 | /* find the first state struct with 'bits' set after 'start', and | 1363 | /* find the first state struct with 'bits' set after 'start', and |
1364 | * return it. tree->lock must be held. NULL will returned if | 1364 | * return it. tree->lock must be held. NULL will returned if |
1365 | * nothing was found after 'start' | 1365 | * nothing was found after 'start' |
1366 | */ | 1366 | */ |
1367 | static struct extent_state * | 1367 | static struct extent_state * |
1368 | find_first_extent_bit_state(struct extent_io_tree *tree, | 1368 | find_first_extent_bit_state(struct extent_io_tree *tree, |
1369 | u64 start, unsigned long bits) | 1369 | u64 start, unsigned long bits) |
1370 | { | 1370 | { |
1371 | struct rb_node *node; | 1371 | struct rb_node *node; |
1372 | struct extent_state *state; | 1372 | struct extent_state *state; |
1373 | 1373 | ||
1374 | /* | 1374 | /* |
1375 | * this search will find all the extents that end after | 1375 | * this search will find all the extents that end after |
1376 | * our range starts. | 1376 | * our range starts. |
1377 | */ | 1377 | */ |
1378 | node = tree_search(tree, start); | 1378 | node = tree_search(tree, start); |
1379 | if (!node) | 1379 | if (!node) |
1380 | goto out; | 1380 | goto out; |
1381 | 1381 | ||
1382 | while (1) { | 1382 | while (1) { |
1383 | state = rb_entry(node, struct extent_state, rb_node); | 1383 | state = rb_entry(node, struct extent_state, rb_node); |
1384 | if (state->end >= start && (state->state & bits)) | 1384 | if (state->end >= start && (state->state & bits)) |
1385 | return state; | 1385 | return state; |
1386 | 1386 | ||
1387 | node = rb_next(node); | 1387 | node = rb_next(node); |
1388 | if (!node) | 1388 | if (!node) |
1389 | break; | 1389 | break; |
1390 | } | 1390 | } |
1391 | out: | 1391 | out: |
1392 | return NULL; | 1392 | return NULL; |
1393 | } | 1393 | } |
1394 | 1394 | ||
1395 | /* | 1395 | /* |
1396 | * find the first offset in the io tree with 'bits' set. zero is | 1396 | * find the first offset in the io tree with 'bits' set. zero is |
1397 | * returned if we find something, and *start_ret and *end_ret are | 1397 | * returned if we find something, and *start_ret and *end_ret are |
1398 | * set to reflect the state struct that was found. | 1398 | * set to reflect the state struct that was found. |
1399 | * | 1399 | * |
1400 | * If nothing was found, 1 is returned. If found something, return 0. | 1400 | * If nothing was found, 1 is returned. If found something, return 0. |
1401 | */ | 1401 | */ |
1402 | int find_first_extent_bit(struct extent_io_tree *tree, u64 start, | 1402 | int find_first_extent_bit(struct extent_io_tree *tree, u64 start, |
1403 | u64 *start_ret, u64 *end_ret, unsigned long bits, | 1403 | u64 *start_ret, u64 *end_ret, unsigned long bits, |
1404 | struct extent_state **cached_state) | 1404 | struct extent_state **cached_state) |
1405 | { | 1405 | { |
1406 | struct extent_state *state; | 1406 | struct extent_state *state; |
1407 | struct rb_node *n; | 1407 | struct rb_node *n; |
1408 | int ret = 1; | 1408 | int ret = 1; |
1409 | 1409 | ||
1410 | spin_lock(&tree->lock); | 1410 | spin_lock(&tree->lock); |
1411 | if (cached_state && *cached_state) { | 1411 | if (cached_state && *cached_state) { |
1412 | state = *cached_state; | 1412 | state = *cached_state; |
1413 | if (state->end == start - 1 && state->tree) { | 1413 | if (state->end == start - 1 && state->tree) { |
1414 | n = rb_next(&state->rb_node); | 1414 | n = rb_next(&state->rb_node); |
1415 | while (n) { | 1415 | while (n) { |
1416 | state = rb_entry(n, struct extent_state, | 1416 | state = rb_entry(n, struct extent_state, |
1417 | rb_node); | 1417 | rb_node); |
1418 | if (state->state & bits) | 1418 | if (state->state & bits) |
1419 | goto got_it; | 1419 | goto got_it; |
1420 | n = rb_next(n); | 1420 | n = rb_next(n); |
1421 | } | 1421 | } |
1422 | free_extent_state(*cached_state); | 1422 | free_extent_state(*cached_state); |
1423 | *cached_state = NULL; | 1423 | *cached_state = NULL; |
1424 | goto out; | 1424 | goto out; |
1425 | } | 1425 | } |
1426 | free_extent_state(*cached_state); | 1426 | free_extent_state(*cached_state); |
1427 | *cached_state = NULL; | 1427 | *cached_state = NULL; |
1428 | } | 1428 | } |
1429 | 1429 | ||
1430 | state = find_first_extent_bit_state(tree, start, bits); | 1430 | state = find_first_extent_bit_state(tree, start, bits); |
1431 | got_it: | 1431 | got_it: |
1432 | if (state) { | 1432 | if (state) { |
1433 | cache_state(state, cached_state); | 1433 | cache_state(state, cached_state); |
1434 | *start_ret = state->start; | 1434 | *start_ret = state->start; |
1435 | *end_ret = state->end; | 1435 | *end_ret = state->end; |
1436 | ret = 0; | 1436 | ret = 0; |
1437 | } | 1437 | } |
1438 | out: | 1438 | out: |
1439 | spin_unlock(&tree->lock); | 1439 | spin_unlock(&tree->lock); |
1440 | return ret; | 1440 | return ret; |
1441 | } | 1441 | } |
1442 | 1442 | ||
1443 | /* | 1443 | /* |
1444 | * find a contiguous range of bytes in the file marked as delalloc, not | 1444 | * find a contiguous range of bytes in the file marked as delalloc, not |
1445 | * more than 'max_bytes'. start and end are used to return the range, | 1445 | * more than 'max_bytes'. start and end are used to return the range, |
1446 | * | 1446 | * |
1447 | * 1 is returned if we find something, 0 if nothing was in the tree | 1447 | * 1 is returned if we find something, 0 if nothing was in the tree |
1448 | */ | 1448 | */ |
1449 | static noinline u64 find_delalloc_range(struct extent_io_tree *tree, | 1449 | static noinline u64 find_delalloc_range(struct extent_io_tree *tree, |
1450 | u64 *start, u64 *end, u64 max_bytes, | 1450 | u64 *start, u64 *end, u64 max_bytes, |
1451 | struct extent_state **cached_state) | 1451 | struct extent_state **cached_state) |
1452 | { | 1452 | { |
1453 | struct rb_node *node; | 1453 | struct rb_node *node; |
1454 | struct extent_state *state; | 1454 | struct extent_state *state; |
1455 | u64 cur_start = *start; | 1455 | u64 cur_start = *start; |
1456 | u64 found = 0; | 1456 | u64 found = 0; |
1457 | u64 total_bytes = 0; | 1457 | u64 total_bytes = 0; |
1458 | 1458 | ||
1459 | spin_lock(&tree->lock); | 1459 | spin_lock(&tree->lock); |
1460 | 1460 | ||
1461 | /* | 1461 | /* |
1462 | * this search will find all the extents that end after | 1462 | * this search will find all the extents that end after |
1463 | * our range starts. | 1463 | * our range starts. |
1464 | */ | 1464 | */ |
1465 | node = tree_search(tree, cur_start); | 1465 | node = tree_search(tree, cur_start); |
1466 | if (!node) { | 1466 | if (!node) { |
1467 | if (!found) | 1467 | if (!found) |
1468 | *end = (u64)-1; | 1468 | *end = (u64)-1; |
1469 | goto out; | 1469 | goto out; |
1470 | } | 1470 | } |
1471 | 1471 | ||
1472 | while (1) { | 1472 | while (1) { |
1473 | state = rb_entry(node, struct extent_state, rb_node); | 1473 | state = rb_entry(node, struct extent_state, rb_node); |
1474 | if (found && (state->start != cur_start || | 1474 | if (found && (state->start != cur_start || |
1475 | (state->state & EXTENT_BOUNDARY))) { | 1475 | (state->state & EXTENT_BOUNDARY))) { |
1476 | goto out; | 1476 | goto out; |
1477 | } | 1477 | } |
1478 | if (!(state->state & EXTENT_DELALLOC)) { | 1478 | if (!(state->state & EXTENT_DELALLOC)) { |
1479 | if (!found) | 1479 | if (!found) |
1480 | *end = state->end; | 1480 | *end = state->end; |
1481 | goto out; | 1481 | goto out; |
1482 | } | 1482 | } |
1483 | if (!found) { | 1483 | if (!found) { |
1484 | *start = state->start; | 1484 | *start = state->start; |
1485 | *cached_state = state; | 1485 | *cached_state = state; |
1486 | atomic_inc(&state->refs); | 1486 | atomic_inc(&state->refs); |
1487 | } | 1487 | } |
1488 | found++; | 1488 | found++; |
1489 | *end = state->end; | 1489 | *end = state->end; |
1490 | cur_start = state->end + 1; | 1490 | cur_start = state->end + 1; |
1491 | node = rb_next(node); | 1491 | node = rb_next(node); |
1492 | total_bytes += state->end - state->start + 1; | 1492 | total_bytes += state->end - state->start + 1; |
1493 | if (total_bytes >= max_bytes) | 1493 | if (total_bytes >= max_bytes) |
1494 | break; | 1494 | break; |
1495 | if (!node) | 1495 | if (!node) |
1496 | break; | 1496 | break; |
1497 | } | 1497 | } |
1498 | out: | 1498 | out: |
1499 | spin_unlock(&tree->lock); | 1499 | spin_unlock(&tree->lock); |
1500 | return found; | 1500 | return found; |
1501 | } | 1501 | } |
1502 | 1502 | ||
1503 | static noinline void __unlock_for_delalloc(struct inode *inode, | 1503 | static noinline void __unlock_for_delalloc(struct inode *inode, |
1504 | struct page *locked_page, | 1504 | struct page *locked_page, |
1505 | u64 start, u64 end) | 1505 | u64 start, u64 end) |
1506 | { | 1506 | { |
1507 | int ret; | 1507 | int ret; |
1508 | struct page *pages[16]; | 1508 | struct page *pages[16]; |
1509 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 1509 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
1510 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; | 1510 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; |
1511 | unsigned long nr_pages = end_index - index + 1; | 1511 | unsigned long nr_pages = end_index - index + 1; |
1512 | int i; | 1512 | int i; |
1513 | 1513 | ||
1514 | if (index == locked_page->index && end_index == index) | 1514 | if (index == locked_page->index && end_index == index) |
1515 | return; | 1515 | return; |
1516 | 1516 | ||
1517 | while (nr_pages > 0) { | 1517 | while (nr_pages > 0) { |
1518 | ret = find_get_pages_contig(inode->i_mapping, index, | 1518 | ret = find_get_pages_contig(inode->i_mapping, index, |
1519 | min_t(unsigned long, nr_pages, | 1519 | min_t(unsigned long, nr_pages, |
1520 | ARRAY_SIZE(pages)), pages); | 1520 | ARRAY_SIZE(pages)), pages); |
1521 | for (i = 0; i < ret; i++) { | 1521 | for (i = 0; i < ret; i++) { |
1522 | if (pages[i] != locked_page) | 1522 | if (pages[i] != locked_page) |
1523 | unlock_page(pages[i]); | 1523 | unlock_page(pages[i]); |
1524 | page_cache_release(pages[i]); | 1524 | page_cache_release(pages[i]); |
1525 | } | 1525 | } |
1526 | nr_pages -= ret; | 1526 | nr_pages -= ret; |
1527 | index += ret; | 1527 | index += ret; |
1528 | cond_resched(); | 1528 | cond_resched(); |
1529 | } | 1529 | } |
1530 | } | 1530 | } |
1531 | 1531 | ||
1532 | static noinline int lock_delalloc_pages(struct inode *inode, | 1532 | static noinline int lock_delalloc_pages(struct inode *inode, |
1533 | struct page *locked_page, | 1533 | struct page *locked_page, |
1534 | u64 delalloc_start, | 1534 | u64 delalloc_start, |
1535 | u64 delalloc_end) | 1535 | u64 delalloc_end) |
1536 | { | 1536 | { |
1537 | unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT; | 1537 | unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT; |
1538 | unsigned long start_index = index; | 1538 | unsigned long start_index = index; |
1539 | unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT; | 1539 | unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT; |
1540 | unsigned long pages_locked = 0; | 1540 | unsigned long pages_locked = 0; |
1541 | struct page *pages[16]; | 1541 | struct page *pages[16]; |
1542 | unsigned long nrpages; | 1542 | unsigned long nrpages; |
1543 | int ret; | 1543 | int ret; |
1544 | int i; | 1544 | int i; |
1545 | 1545 | ||
1546 | /* the caller is responsible for locking the start index */ | 1546 | /* the caller is responsible for locking the start index */ |
1547 | if (index == locked_page->index && index == end_index) | 1547 | if (index == locked_page->index && index == end_index) |
1548 | return 0; | 1548 | return 0; |
1549 | 1549 | ||
1550 | /* skip the page at the start index */ | 1550 | /* skip the page at the start index */ |
1551 | nrpages = end_index - index + 1; | 1551 | nrpages = end_index - index + 1; |
1552 | while (nrpages > 0) { | 1552 | while (nrpages > 0) { |
1553 | ret = find_get_pages_contig(inode->i_mapping, index, | 1553 | ret = find_get_pages_contig(inode->i_mapping, index, |
1554 | min_t(unsigned long, | 1554 | min_t(unsigned long, |
1555 | nrpages, ARRAY_SIZE(pages)), pages); | 1555 | nrpages, ARRAY_SIZE(pages)), pages); |
1556 | if (ret == 0) { | 1556 | if (ret == 0) { |
1557 | ret = -EAGAIN; | 1557 | ret = -EAGAIN; |
1558 | goto done; | 1558 | goto done; |
1559 | } | 1559 | } |
1560 | /* now we have an array of pages, lock them all */ | 1560 | /* now we have an array of pages, lock them all */ |
1561 | for (i = 0; i < ret; i++) { | 1561 | for (i = 0; i < ret; i++) { |
1562 | /* | 1562 | /* |
1563 | * the caller is taking responsibility for | 1563 | * the caller is taking responsibility for |
1564 | * locked_page | 1564 | * locked_page |
1565 | */ | 1565 | */ |
1566 | if (pages[i] != locked_page) { | 1566 | if (pages[i] != locked_page) { |
1567 | lock_page(pages[i]); | 1567 | lock_page(pages[i]); |
1568 | if (!PageDirty(pages[i]) || | 1568 | if (!PageDirty(pages[i]) || |
1569 | pages[i]->mapping != inode->i_mapping) { | 1569 | pages[i]->mapping != inode->i_mapping) { |
1570 | ret = -EAGAIN; | 1570 | ret = -EAGAIN; |
1571 | unlock_page(pages[i]); | 1571 | unlock_page(pages[i]); |
1572 | page_cache_release(pages[i]); | 1572 | page_cache_release(pages[i]); |
1573 | goto done; | 1573 | goto done; |
1574 | } | 1574 | } |
1575 | } | 1575 | } |
1576 | page_cache_release(pages[i]); | 1576 | page_cache_release(pages[i]); |
1577 | pages_locked++; | 1577 | pages_locked++; |
1578 | } | 1578 | } |
1579 | nrpages -= ret; | 1579 | nrpages -= ret; |
1580 | index += ret; | 1580 | index += ret; |
1581 | cond_resched(); | 1581 | cond_resched(); |
1582 | } | 1582 | } |
1583 | ret = 0; | 1583 | ret = 0; |
1584 | done: | 1584 | done: |
1585 | if (ret && pages_locked) { | 1585 | if (ret && pages_locked) { |
1586 | __unlock_for_delalloc(inode, locked_page, | 1586 | __unlock_for_delalloc(inode, locked_page, |
1587 | delalloc_start, | 1587 | delalloc_start, |
1588 | ((u64)(start_index + pages_locked - 1)) << | 1588 | ((u64)(start_index + pages_locked - 1)) << |
1589 | PAGE_CACHE_SHIFT); | 1589 | PAGE_CACHE_SHIFT); |
1590 | } | 1590 | } |
1591 | return ret; | 1591 | return ret; |
1592 | } | 1592 | } |
1593 | 1593 | ||
1594 | /* | 1594 | /* |
1595 | * find a contiguous range of bytes in the file marked as delalloc, not | 1595 | * find a contiguous range of bytes in the file marked as delalloc, not |
1596 | * more than 'max_bytes'. start and end are used to return the range, | 1596 | * more than 'max_bytes'. start and end are used to return the range, |
1597 | * | 1597 | * |
1598 | * 1 is returned if we find something, 0 if nothing was in the tree | 1598 | * 1 is returned if we find something, 0 if nothing was in the tree |
1599 | */ | 1599 | */ |
1600 | static noinline u64 find_lock_delalloc_range(struct inode *inode, | 1600 | static noinline u64 find_lock_delalloc_range(struct inode *inode, |
1601 | struct extent_io_tree *tree, | 1601 | struct extent_io_tree *tree, |
1602 | struct page *locked_page, | 1602 | struct page *locked_page, |
1603 | u64 *start, u64 *end, | 1603 | u64 *start, u64 *end, |
1604 | u64 max_bytes) | 1604 | u64 max_bytes) |
1605 | { | 1605 | { |
1606 | u64 delalloc_start; | 1606 | u64 delalloc_start; |
1607 | u64 delalloc_end; | 1607 | u64 delalloc_end; |
1608 | u64 found; | 1608 | u64 found; |
1609 | struct extent_state *cached_state = NULL; | 1609 | struct extent_state *cached_state = NULL; |
1610 | int ret; | 1610 | int ret; |
1611 | int loops = 0; | 1611 | int loops = 0; |
1612 | 1612 | ||
1613 | again: | 1613 | again: |
1614 | /* step one, find a bunch of delalloc bytes starting at start */ | 1614 | /* step one, find a bunch of delalloc bytes starting at start */ |
1615 | delalloc_start = *start; | 1615 | delalloc_start = *start; |
1616 | delalloc_end = 0; | 1616 | delalloc_end = 0; |
1617 | found = find_delalloc_range(tree, &delalloc_start, &delalloc_end, | 1617 | found = find_delalloc_range(tree, &delalloc_start, &delalloc_end, |
1618 | max_bytes, &cached_state); | 1618 | max_bytes, &cached_state); |
1619 | if (!found || delalloc_end <= *start) { | 1619 | if (!found || delalloc_end <= *start) { |
1620 | *start = delalloc_start; | 1620 | *start = delalloc_start; |
1621 | *end = delalloc_end; | 1621 | *end = delalloc_end; |
1622 | free_extent_state(cached_state); | 1622 | free_extent_state(cached_state); |
1623 | return 0; | 1623 | return 0; |
1624 | } | 1624 | } |
1625 | 1625 | ||
1626 | /* | 1626 | /* |
1627 | * start comes from the offset of locked_page. We have to lock | 1627 | * start comes from the offset of locked_page. We have to lock |
1628 | * pages in order, so we can't process delalloc bytes before | 1628 | * pages in order, so we can't process delalloc bytes before |
1629 | * locked_page | 1629 | * locked_page |
1630 | */ | 1630 | */ |
1631 | if (delalloc_start < *start) | 1631 | if (delalloc_start < *start) |
1632 | delalloc_start = *start; | 1632 | delalloc_start = *start; |
1633 | 1633 | ||
1634 | /* | 1634 | /* |
1635 | * make sure to limit the number of pages we try to lock down | 1635 | * make sure to limit the number of pages we try to lock down |
1636 | */ | 1636 | */ |
1637 | if (delalloc_end + 1 - delalloc_start > max_bytes) | 1637 | if (delalloc_end + 1 - delalloc_start > max_bytes) |
1638 | delalloc_end = delalloc_start + max_bytes - 1; | 1638 | delalloc_end = delalloc_start + max_bytes - 1; |
1639 | 1639 | ||
1640 | /* step two, lock all the pages after the page that has start */ | 1640 | /* step two, lock all the pages after the page that has start */ |
1641 | ret = lock_delalloc_pages(inode, locked_page, | 1641 | ret = lock_delalloc_pages(inode, locked_page, |
1642 | delalloc_start, delalloc_end); | 1642 | delalloc_start, delalloc_end); |
1643 | if (ret == -EAGAIN) { | 1643 | if (ret == -EAGAIN) { |
1644 | /* some of the pages are gone, lets avoid looping by | 1644 | /* some of the pages are gone, lets avoid looping by |
1645 | * shortening the size of the delalloc range we're searching | 1645 | * shortening the size of the delalloc range we're searching |
1646 | */ | 1646 | */ |
1647 | free_extent_state(cached_state); | 1647 | free_extent_state(cached_state); |
1648 | cached_state = NULL; | 1648 | cached_state = NULL; |
1649 | if (!loops) { | 1649 | if (!loops) { |
1650 | max_bytes = PAGE_CACHE_SIZE; | 1650 | max_bytes = PAGE_CACHE_SIZE; |
1651 | loops = 1; | 1651 | loops = 1; |
1652 | goto again; | 1652 | goto again; |
1653 | } else { | 1653 | } else { |
1654 | found = 0; | 1654 | found = 0; |
1655 | goto out_failed; | 1655 | goto out_failed; |
1656 | } | 1656 | } |
1657 | } | 1657 | } |
1658 | BUG_ON(ret); /* Only valid values are 0 and -EAGAIN */ | 1658 | BUG_ON(ret); /* Only valid values are 0 and -EAGAIN */ |
1659 | 1659 | ||
1660 | /* step three, lock the state bits for the whole range */ | 1660 | /* step three, lock the state bits for the whole range */ |
1661 | lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state); | 1661 | lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state); |
1662 | 1662 | ||
1663 | /* then test to make sure it is all still delalloc */ | 1663 | /* then test to make sure it is all still delalloc */ |
1664 | ret = test_range_bit(tree, delalloc_start, delalloc_end, | 1664 | ret = test_range_bit(tree, delalloc_start, delalloc_end, |
1665 | EXTENT_DELALLOC, 1, cached_state); | 1665 | EXTENT_DELALLOC, 1, cached_state); |
1666 | if (!ret) { | 1666 | if (!ret) { |
1667 | unlock_extent_cached(tree, delalloc_start, delalloc_end, | 1667 | unlock_extent_cached(tree, delalloc_start, delalloc_end, |
1668 | &cached_state, GFP_NOFS); | 1668 | &cached_state, GFP_NOFS); |
1669 | __unlock_for_delalloc(inode, locked_page, | 1669 | __unlock_for_delalloc(inode, locked_page, |
1670 | delalloc_start, delalloc_end); | 1670 | delalloc_start, delalloc_end); |
1671 | cond_resched(); | 1671 | cond_resched(); |
1672 | goto again; | 1672 | goto again; |
1673 | } | 1673 | } |
1674 | free_extent_state(cached_state); | 1674 | free_extent_state(cached_state); |
1675 | *start = delalloc_start; | 1675 | *start = delalloc_start; |
1676 | *end = delalloc_end; | 1676 | *end = delalloc_end; |
1677 | out_failed: | 1677 | out_failed: |
1678 | return found; | 1678 | return found; |
1679 | } | 1679 | } |
1680 | 1680 | ||
1681 | int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, | 1681 | int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, |
1682 | struct page *locked_page, | 1682 | struct page *locked_page, |
1683 | unsigned long clear_bits, | 1683 | unsigned long clear_bits, |
1684 | unsigned long page_ops) | 1684 | unsigned long page_ops) |
1685 | { | 1685 | { |
1686 | struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; | 1686 | struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; |
1687 | int ret; | 1687 | int ret; |
1688 | struct page *pages[16]; | 1688 | struct page *pages[16]; |
1689 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 1689 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
1690 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; | 1690 | unsigned long end_index = end >> PAGE_CACHE_SHIFT; |
1691 | unsigned long nr_pages = end_index - index + 1; | 1691 | unsigned long nr_pages = end_index - index + 1; |
1692 | int i; | 1692 | int i; |
1693 | 1693 | ||
1694 | clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS); | 1694 | clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS); |
1695 | if (page_ops == 0) | 1695 | if (page_ops == 0) |
1696 | return 0; | 1696 | return 0; |
1697 | 1697 | ||
1698 | while (nr_pages > 0) { | 1698 | while (nr_pages > 0) { |
1699 | ret = find_get_pages_contig(inode->i_mapping, index, | 1699 | ret = find_get_pages_contig(inode->i_mapping, index, |
1700 | min_t(unsigned long, | 1700 | min_t(unsigned long, |
1701 | nr_pages, ARRAY_SIZE(pages)), pages); | 1701 | nr_pages, ARRAY_SIZE(pages)), pages); |
1702 | for (i = 0; i < ret; i++) { | 1702 | for (i = 0; i < ret; i++) { |
1703 | 1703 | ||
1704 | if (page_ops & PAGE_SET_PRIVATE2) | 1704 | if (page_ops & PAGE_SET_PRIVATE2) |
1705 | SetPagePrivate2(pages[i]); | 1705 | SetPagePrivate2(pages[i]); |
1706 | 1706 | ||
1707 | if (pages[i] == locked_page) { | 1707 | if (pages[i] == locked_page) { |
1708 | page_cache_release(pages[i]); | 1708 | page_cache_release(pages[i]); |
1709 | continue; | 1709 | continue; |
1710 | } | 1710 | } |
1711 | if (page_ops & PAGE_CLEAR_DIRTY) | 1711 | if (page_ops & PAGE_CLEAR_DIRTY) |
1712 | clear_page_dirty_for_io(pages[i]); | 1712 | clear_page_dirty_for_io(pages[i]); |
1713 | if (page_ops & PAGE_SET_WRITEBACK) | 1713 | if (page_ops & PAGE_SET_WRITEBACK) |
1714 | set_page_writeback(pages[i]); | 1714 | set_page_writeback(pages[i]); |
1715 | if (page_ops & PAGE_END_WRITEBACK) | 1715 | if (page_ops & PAGE_END_WRITEBACK) |
1716 | end_page_writeback(pages[i]); | 1716 | end_page_writeback(pages[i]); |
1717 | if (page_ops & PAGE_UNLOCK) | 1717 | if (page_ops & PAGE_UNLOCK) |
1718 | unlock_page(pages[i]); | 1718 | unlock_page(pages[i]); |
1719 | page_cache_release(pages[i]); | 1719 | page_cache_release(pages[i]); |
1720 | } | 1720 | } |
1721 | nr_pages -= ret; | 1721 | nr_pages -= ret; |
1722 | index += ret; | 1722 | index += ret; |
1723 | cond_resched(); | 1723 | cond_resched(); |
1724 | } | 1724 | } |
1725 | return 0; | 1725 | return 0; |
1726 | } | 1726 | } |
1727 | 1727 | ||
1728 | /* | 1728 | /* |
1729 | * count the number of bytes in the tree that have a given bit(s) | 1729 | * count the number of bytes in the tree that have a given bit(s) |
1730 | * set. This can be fairly slow, except for EXTENT_DIRTY which is | 1730 | * set. This can be fairly slow, except for EXTENT_DIRTY which is |
1731 | * cached. The total number found is returned. | 1731 | * cached. The total number found is returned. |
1732 | */ | 1732 | */ |
1733 | u64 count_range_bits(struct extent_io_tree *tree, | 1733 | u64 count_range_bits(struct extent_io_tree *tree, |
1734 | u64 *start, u64 search_end, u64 max_bytes, | 1734 | u64 *start, u64 search_end, u64 max_bytes, |
1735 | unsigned long bits, int contig) | 1735 | unsigned long bits, int contig) |
1736 | { | 1736 | { |
1737 | struct rb_node *node; | 1737 | struct rb_node *node; |
1738 | struct extent_state *state; | 1738 | struct extent_state *state; |
1739 | u64 cur_start = *start; | 1739 | u64 cur_start = *start; |
1740 | u64 total_bytes = 0; | 1740 | u64 total_bytes = 0; |
1741 | u64 last = 0; | 1741 | u64 last = 0; |
1742 | int found = 0; | 1742 | int found = 0; |
1743 | 1743 | ||
1744 | if (search_end <= cur_start) { | 1744 | if (search_end <= cur_start) { |
1745 | WARN_ON(1); | 1745 | WARN_ON(1); |
1746 | return 0; | 1746 | return 0; |
1747 | } | 1747 | } |
1748 | 1748 | ||
1749 | spin_lock(&tree->lock); | 1749 | spin_lock(&tree->lock); |
1750 | if (cur_start == 0 && bits == EXTENT_DIRTY) { | 1750 | if (cur_start == 0 && bits == EXTENT_DIRTY) { |
1751 | total_bytes = tree->dirty_bytes; | 1751 | total_bytes = tree->dirty_bytes; |
1752 | goto out; | 1752 | goto out; |
1753 | } | 1753 | } |
1754 | /* | 1754 | /* |
1755 | * this search will find all the extents that end after | 1755 | * this search will find all the extents that end after |
1756 | * our range starts. | 1756 | * our range starts. |
1757 | */ | 1757 | */ |
1758 | node = tree_search(tree, cur_start); | 1758 | node = tree_search(tree, cur_start); |
1759 | if (!node) | 1759 | if (!node) |
1760 | goto out; | 1760 | goto out; |
1761 | 1761 | ||
1762 | while (1) { | 1762 | while (1) { |
1763 | state = rb_entry(node, struct extent_state, rb_node); | 1763 | state = rb_entry(node, struct extent_state, rb_node); |
1764 | if (state->start > search_end) | 1764 | if (state->start > search_end) |
1765 | break; | 1765 | break; |
1766 | if (contig && found && state->start > last + 1) | 1766 | if (contig && found && state->start > last + 1) |
1767 | break; | 1767 | break; |
1768 | if (state->end >= cur_start && (state->state & bits) == bits) { | 1768 | if (state->end >= cur_start && (state->state & bits) == bits) { |
1769 | total_bytes += min(search_end, state->end) + 1 - | 1769 | total_bytes += min(search_end, state->end) + 1 - |
1770 | max(cur_start, state->start); | 1770 | max(cur_start, state->start); |
1771 | if (total_bytes >= max_bytes) | 1771 | if (total_bytes >= max_bytes) |
1772 | break; | 1772 | break; |
1773 | if (!found) { | 1773 | if (!found) { |
1774 | *start = max(cur_start, state->start); | 1774 | *start = max(cur_start, state->start); |
1775 | found = 1; | 1775 | found = 1; |
1776 | } | 1776 | } |
1777 | last = state->end; | 1777 | last = state->end; |
1778 | } else if (contig && found) { | 1778 | } else if (contig && found) { |
1779 | break; | 1779 | break; |
1780 | } | 1780 | } |
1781 | node = rb_next(node); | 1781 | node = rb_next(node); |
1782 | if (!node) | 1782 | if (!node) |
1783 | break; | 1783 | break; |
1784 | } | 1784 | } |
1785 | out: | 1785 | out: |
1786 | spin_unlock(&tree->lock); | 1786 | spin_unlock(&tree->lock); |
1787 | return total_bytes; | 1787 | return total_bytes; |
1788 | } | 1788 | } |
1789 | 1789 | ||
1790 | /* | 1790 | /* |
1791 | * set the private field for a given byte offset in the tree. If there isn't | 1791 | * set the private field for a given byte offset in the tree. If there isn't |
1792 | * an extent_state there already, this does nothing. | 1792 | * an extent_state there already, this does nothing. |
1793 | */ | 1793 | */ |
1794 | static int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) | 1794 | static int set_state_private(struct extent_io_tree *tree, u64 start, u64 private) |
1795 | { | 1795 | { |
1796 | struct rb_node *node; | 1796 | struct rb_node *node; |
1797 | struct extent_state *state; | 1797 | struct extent_state *state; |
1798 | int ret = 0; | 1798 | int ret = 0; |
1799 | 1799 | ||
1800 | spin_lock(&tree->lock); | 1800 | spin_lock(&tree->lock); |
1801 | /* | 1801 | /* |
1802 | * this search will find all the extents that end after | 1802 | * this search will find all the extents that end after |
1803 | * our range starts. | 1803 | * our range starts. |
1804 | */ | 1804 | */ |
1805 | node = tree_search(tree, start); | 1805 | node = tree_search(tree, start); |
1806 | if (!node) { | 1806 | if (!node) { |
1807 | ret = -ENOENT; | 1807 | ret = -ENOENT; |
1808 | goto out; | 1808 | goto out; |
1809 | } | 1809 | } |
1810 | state = rb_entry(node, struct extent_state, rb_node); | 1810 | state = rb_entry(node, struct extent_state, rb_node); |
1811 | if (state->start != start) { | 1811 | if (state->start != start) { |
1812 | ret = -ENOENT; | 1812 | ret = -ENOENT; |
1813 | goto out; | 1813 | goto out; |
1814 | } | 1814 | } |
1815 | state->private = private; | 1815 | state->private = private; |
1816 | out: | 1816 | out: |
1817 | spin_unlock(&tree->lock); | 1817 | spin_unlock(&tree->lock); |
1818 | return ret; | 1818 | return ret; |
1819 | } | 1819 | } |
1820 | 1820 | ||
1821 | int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private) | 1821 | int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private) |
1822 | { | 1822 | { |
1823 | struct rb_node *node; | 1823 | struct rb_node *node; |
1824 | struct extent_state *state; | 1824 | struct extent_state *state; |
1825 | int ret = 0; | 1825 | int ret = 0; |
1826 | 1826 | ||
1827 | spin_lock(&tree->lock); | 1827 | spin_lock(&tree->lock); |
1828 | /* | 1828 | /* |
1829 | * this search will find all the extents that end after | 1829 | * this search will find all the extents that end after |
1830 | * our range starts. | 1830 | * our range starts. |
1831 | */ | 1831 | */ |
1832 | node = tree_search(tree, start); | 1832 | node = tree_search(tree, start); |
1833 | if (!node) { | 1833 | if (!node) { |
1834 | ret = -ENOENT; | 1834 | ret = -ENOENT; |
1835 | goto out; | 1835 | goto out; |
1836 | } | 1836 | } |
1837 | state = rb_entry(node, struct extent_state, rb_node); | 1837 | state = rb_entry(node, struct extent_state, rb_node); |
1838 | if (state->start != start) { | 1838 | if (state->start != start) { |
1839 | ret = -ENOENT; | 1839 | ret = -ENOENT; |
1840 | goto out; | 1840 | goto out; |
1841 | } | 1841 | } |
1842 | *private = state->private; | 1842 | *private = state->private; |
1843 | out: | 1843 | out: |
1844 | spin_unlock(&tree->lock); | 1844 | spin_unlock(&tree->lock); |
1845 | return ret; | 1845 | return ret; |
1846 | } | 1846 | } |
1847 | 1847 | ||
1848 | /* | 1848 | /* |
1849 | * searches a range in the state tree for a given mask. | 1849 | * searches a range in the state tree for a given mask. |
1850 | * If 'filled' == 1, this returns 1 only if every extent in the tree | 1850 | * If 'filled' == 1, this returns 1 only if every extent in the tree |
1851 | * has the bits set. Otherwise, 1 is returned if any bit in the | 1851 | * has the bits set. Otherwise, 1 is returned if any bit in the |
1852 | * range is found set. | 1852 | * range is found set. |
1853 | */ | 1853 | */ |
1854 | int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, | 1854 | int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, |
1855 | unsigned long bits, int filled, struct extent_state *cached) | 1855 | unsigned long bits, int filled, struct extent_state *cached) |
1856 | { | 1856 | { |
1857 | struct extent_state *state = NULL; | 1857 | struct extent_state *state = NULL; |
1858 | struct rb_node *node; | 1858 | struct rb_node *node; |
1859 | int bitset = 0; | 1859 | int bitset = 0; |
1860 | 1860 | ||
1861 | spin_lock(&tree->lock); | 1861 | spin_lock(&tree->lock); |
1862 | if (cached && cached->tree && cached->start <= start && | 1862 | if (cached && cached->tree && cached->start <= start && |
1863 | cached->end > start) | 1863 | cached->end > start) |
1864 | node = &cached->rb_node; | 1864 | node = &cached->rb_node; |
1865 | else | 1865 | else |
1866 | node = tree_search(tree, start); | 1866 | node = tree_search(tree, start); |
1867 | while (node && start <= end) { | 1867 | while (node && start <= end) { |
1868 | state = rb_entry(node, struct extent_state, rb_node); | 1868 | state = rb_entry(node, struct extent_state, rb_node); |
1869 | 1869 | ||
1870 | if (filled && state->start > start) { | 1870 | if (filled && state->start > start) { |
1871 | bitset = 0; | 1871 | bitset = 0; |
1872 | break; | 1872 | break; |
1873 | } | 1873 | } |
1874 | 1874 | ||
1875 | if (state->start > end) | 1875 | if (state->start > end) |
1876 | break; | 1876 | break; |
1877 | 1877 | ||
1878 | if (state->state & bits) { | 1878 | if (state->state & bits) { |
1879 | bitset = 1; | 1879 | bitset = 1; |
1880 | if (!filled) | 1880 | if (!filled) |
1881 | break; | 1881 | break; |
1882 | } else if (filled) { | 1882 | } else if (filled) { |
1883 | bitset = 0; | 1883 | bitset = 0; |
1884 | break; | 1884 | break; |
1885 | } | 1885 | } |
1886 | 1886 | ||
1887 | if (state->end == (u64)-1) | 1887 | if (state->end == (u64)-1) |
1888 | break; | 1888 | break; |
1889 | 1889 | ||
1890 | start = state->end + 1; | 1890 | start = state->end + 1; |
1891 | if (start > end) | 1891 | if (start > end) |
1892 | break; | 1892 | break; |
1893 | node = rb_next(node); | 1893 | node = rb_next(node); |
1894 | if (!node) { | 1894 | if (!node) { |
1895 | if (filled) | 1895 | if (filled) |
1896 | bitset = 0; | 1896 | bitset = 0; |
1897 | break; | 1897 | break; |
1898 | } | 1898 | } |
1899 | } | 1899 | } |
1900 | spin_unlock(&tree->lock); | 1900 | spin_unlock(&tree->lock); |
1901 | return bitset; | 1901 | return bitset; |
1902 | } | 1902 | } |
1903 | 1903 | ||
1904 | /* | 1904 | /* |
1905 | * helper function to set a given page up to date if all the | 1905 | * helper function to set a given page up to date if all the |
1906 | * extents in the tree for that page are up to date | 1906 | * extents in the tree for that page are up to date |
1907 | */ | 1907 | */ |
1908 | static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) | 1908 | static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) |
1909 | { | 1909 | { |
1910 | u64 start = page_offset(page); | 1910 | u64 start = page_offset(page); |
1911 | u64 end = start + PAGE_CACHE_SIZE - 1; | 1911 | u64 end = start + PAGE_CACHE_SIZE - 1; |
1912 | if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL)) | 1912 | if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL)) |
1913 | SetPageUptodate(page); | 1913 | SetPageUptodate(page); |
1914 | } | 1914 | } |
1915 | 1915 | ||
1916 | /* | 1916 | /* |
1917 | * When IO fails, either with EIO or csum verification fails, we | 1917 | * When IO fails, either with EIO or csum verification fails, we |
1918 | * try other mirrors that might have a good copy of the data. This | 1918 | * try other mirrors that might have a good copy of the data. This |
1919 | * io_failure_record is used to record state as we go through all the | 1919 | * io_failure_record is used to record state as we go through all the |
1920 | * mirrors. If another mirror has good data, the page is set up to date | 1920 | * mirrors. If another mirror has good data, the page is set up to date |
1921 | * and things continue. If a good mirror can't be found, the original | 1921 | * and things continue. If a good mirror can't be found, the original |
1922 | * bio end_io callback is called to indicate things have failed. | 1922 | * bio end_io callback is called to indicate things have failed. |
1923 | */ | 1923 | */ |
1924 | struct io_failure_record { | 1924 | struct io_failure_record { |
1925 | struct page *page; | 1925 | struct page *page; |
1926 | u64 start; | 1926 | u64 start; |
1927 | u64 len; | 1927 | u64 len; |
1928 | u64 logical; | 1928 | u64 logical; |
1929 | unsigned long bio_flags; | 1929 | unsigned long bio_flags; |
1930 | int this_mirror; | 1930 | int this_mirror; |
1931 | int failed_mirror; | 1931 | int failed_mirror; |
1932 | int in_validation; | 1932 | int in_validation; |
1933 | }; | 1933 | }; |
1934 | 1934 | ||
1935 | static int free_io_failure(struct inode *inode, struct io_failure_record *rec, | 1935 | static int free_io_failure(struct inode *inode, struct io_failure_record *rec, |
1936 | int did_repair) | 1936 | int did_repair) |
1937 | { | 1937 | { |
1938 | int ret; | 1938 | int ret; |
1939 | int err = 0; | 1939 | int err = 0; |
1940 | struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; | 1940 | struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; |
1941 | 1941 | ||
1942 | set_state_private(failure_tree, rec->start, 0); | 1942 | set_state_private(failure_tree, rec->start, 0); |
1943 | ret = clear_extent_bits(failure_tree, rec->start, | 1943 | ret = clear_extent_bits(failure_tree, rec->start, |
1944 | rec->start + rec->len - 1, | 1944 | rec->start + rec->len - 1, |
1945 | EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); | 1945 | EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); |
1946 | if (ret) | 1946 | if (ret) |
1947 | err = ret; | 1947 | err = ret; |
1948 | 1948 | ||
1949 | ret = clear_extent_bits(&BTRFS_I(inode)->io_tree, rec->start, | 1949 | ret = clear_extent_bits(&BTRFS_I(inode)->io_tree, rec->start, |
1950 | rec->start + rec->len - 1, | 1950 | rec->start + rec->len - 1, |
1951 | EXTENT_DAMAGED, GFP_NOFS); | 1951 | EXTENT_DAMAGED, GFP_NOFS); |
1952 | if (ret && !err) | 1952 | if (ret && !err) |
1953 | err = ret; | 1953 | err = ret; |
1954 | 1954 | ||
1955 | kfree(rec); | 1955 | kfree(rec); |
1956 | return err; | 1956 | return err; |
1957 | } | 1957 | } |
1958 | 1958 | ||
1959 | static void repair_io_failure_callback(struct bio *bio, int err) | 1959 | static void repair_io_failure_callback(struct bio *bio, int err) |
1960 | { | 1960 | { |
1961 | complete(bio->bi_private); | 1961 | complete(bio->bi_private); |
1962 | } | 1962 | } |
1963 | 1963 | ||
1964 | /* | 1964 | /* |
1965 | * this bypasses the standard btrfs submit functions deliberately, as | 1965 | * this bypasses the standard btrfs submit functions deliberately, as |
1966 | * the standard behavior is to write all copies in a raid setup. here we only | 1966 | * the standard behavior is to write all copies in a raid setup. here we only |
1967 | * want to write the one bad copy. so we do the mapping for ourselves and issue | 1967 | * want to write the one bad copy. so we do the mapping for ourselves and issue |
1968 | * submit_bio directly. | 1968 | * submit_bio directly. |
1969 | * to avoid any synchronization issues, wait for the data after writing, which | 1969 | * to avoid any synchronization issues, wait for the data after writing, which |
1970 | * actually prevents the read that triggered the error from finishing. | 1970 | * actually prevents the read that triggered the error from finishing. |
1971 | * currently, there can be no more than two copies of every data bit. thus, | 1971 | * currently, there can be no more than two copies of every data bit. thus, |
1972 | * exactly one rewrite is required. | 1972 | * exactly one rewrite is required. |
1973 | */ | 1973 | */ |
1974 | int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, | 1974 | int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, |
1975 | u64 length, u64 logical, struct page *page, | 1975 | u64 length, u64 logical, struct page *page, |
1976 | int mirror_num) | 1976 | int mirror_num) |
1977 | { | 1977 | { |
1978 | struct bio *bio; | 1978 | struct bio *bio; |
1979 | struct btrfs_device *dev; | 1979 | struct btrfs_device *dev; |
1980 | DECLARE_COMPLETION_ONSTACK(compl); | 1980 | DECLARE_COMPLETION_ONSTACK(compl); |
1981 | u64 map_length = 0; | 1981 | u64 map_length = 0; |
1982 | u64 sector; | 1982 | u64 sector; |
1983 | struct btrfs_bio *bbio = NULL; | 1983 | struct btrfs_bio *bbio = NULL; |
1984 | struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; | 1984 | struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree; |
1985 | int ret; | 1985 | int ret; |
1986 | 1986 | ||
1987 | BUG_ON(!mirror_num); | 1987 | BUG_ON(!mirror_num); |
1988 | 1988 | ||
1989 | /* we can't repair anything in raid56 yet */ | 1989 | /* we can't repair anything in raid56 yet */ |
1990 | if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) | 1990 | if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) |
1991 | return 0; | 1991 | return 0; |
1992 | 1992 | ||
1993 | bio = btrfs_io_bio_alloc(GFP_NOFS, 1); | 1993 | bio = btrfs_io_bio_alloc(GFP_NOFS, 1); |
1994 | if (!bio) | 1994 | if (!bio) |
1995 | return -EIO; | 1995 | return -EIO; |
1996 | bio->bi_private = &compl; | 1996 | bio->bi_private = &compl; |
1997 | bio->bi_end_io = repair_io_failure_callback; | 1997 | bio->bi_end_io = repair_io_failure_callback; |
1998 | bio->bi_size = 0; | 1998 | bio->bi_size = 0; |
1999 | map_length = length; | 1999 | map_length = length; |
2000 | 2000 | ||
2001 | ret = btrfs_map_block(fs_info, WRITE, logical, | 2001 | ret = btrfs_map_block(fs_info, WRITE, logical, |
2002 | &map_length, &bbio, mirror_num); | 2002 | &map_length, &bbio, mirror_num); |
2003 | if (ret) { | 2003 | if (ret) { |
2004 | bio_put(bio); | 2004 | bio_put(bio); |
2005 | return -EIO; | 2005 | return -EIO; |
2006 | } | 2006 | } |
2007 | BUG_ON(mirror_num != bbio->mirror_num); | 2007 | BUG_ON(mirror_num != bbio->mirror_num); |
2008 | sector = bbio->stripes[mirror_num-1].physical >> 9; | 2008 | sector = bbio->stripes[mirror_num-1].physical >> 9; |
2009 | bio->bi_sector = sector; | 2009 | bio->bi_sector = sector; |
2010 | dev = bbio->stripes[mirror_num-1].dev; | 2010 | dev = bbio->stripes[mirror_num-1].dev; |
2011 | kfree(bbio); | 2011 | kfree(bbio); |
2012 | if (!dev || !dev->bdev || !dev->writeable) { | 2012 | if (!dev || !dev->bdev || !dev->writeable) { |
2013 | bio_put(bio); | 2013 | bio_put(bio); |
2014 | return -EIO; | 2014 | return -EIO; |
2015 | } | 2015 | } |
2016 | bio->bi_bdev = dev->bdev; | 2016 | bio->bi_bdev = dev->bdev; |
2017 | bio_add_page(bio, page, length, start - page_offset(page)); | 2017 | bio_add_page(bio, page, length, start - page_offset(page)); |
2018 | btrfsic_submit_bio(WRITE_SYNC, bio); | 2018 | btrfsic_submit_bio(WRITE_SYNC, bio); |
2019 | wait_for_completion(&compl); | 2019 | wait_for_completion(&compl); |
2020 | 2020 | ||
2021 | if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { | 2021 | if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { |
2022 | /* try to remap that extent elsewhere? */ | 2022 | /* try to remap that extent elsewhere? */ |
2023 | bio_put(bio); | 2023 | bio_put(bio); |
2024 | btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); | 2024 | btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); |
2025 | return -EIO; | 2025 | return -EIO; |
2026 | } | 2026 | } |
2027 | 2027 | ||
2028 | printk_ratelimited_in_rcu(KERN_INFO "btrfs read error corrected: ino %lu off %llu " | 2028 | printk_ratelimited_in_rcu(KERN_INFO "btrfs read error corrected: ino %lu off %llu " |
2029 | "(dev %s sector %llu)\n", page->mapping->host->i_ino, | 2029 | "(dev %s sector %llu)\n", page->mapping->host->i_ino, |
2030 | start, rcu_str_deref(dev->name), sector); | 2030 | start, rcu_str_deref(dev->name), sector); |
2031 | 2031 | ||
2032 | bio_put(bio); | 2032 | bio_put(bio); |
2033 | return 0; | 2033 | return 0; |
2034 | } | 2034 | } |
2035 | 2035 | ||
2036 | int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, | 2036 | int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, |
2037 | int mirror_num) | 2037 | int mirror_num) |
2038 | { | 2038 | { |
2039 | u64 start = eb->start; | 2039 | u64 start = eb->start; |
2040 | unsigned long i, num_pages = num_extent_pages(eb->start, eb->len); | 2040 | unsigned long i, num_pages = num_extent_pages(eb->start, eb->len); |
2041 | int ret = 0; | 2041 | int ret = 0; |
2042 | 2042 | ||
2043 | for (i = 0; i < num_pages; i++) { | 2043 | for (i = 0; i < num_pages; i++) { |
2044 | struct page *p = extent_buffer_page(eb, i); | 2044 | struct page *p = extent_buffer_page(eb, i); |
2045 | ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE, | 2045 | ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE, |
2046 | start, p, mirror_num); | 2046 | start, p, mirror_num); |
2047 | if (ret) | 2047 | if (ret) |
2048 | break; | 2048 | break; |
2049 | start += PAGE_CACHE_SIZE; | 2049 | start += PAGE_CACHE_SIZE; |
2050 | } | 2050 | } |
2051 | 2051 | ||
2052 | return ret; | 2052 | return ret; |
2053 | } | 2053 | } |
2054 | 2054 | ||
2055 | /* | 2055 | /* |
2056 | * each time an IO finishes, we do a fast check in the IO failure tree | 2056 | * each time an IO finishes, we do a fast check in the IO failure tree |
2057 | * to see if we need to process or clean up an io_failure_record | 2057 | * to see if we need to process or clean up an io_failure_record |
2058 | */ | 2058 | */ |
2059 | static int clean_io_failure(u64 start, struct page *page) | 2059 | static int clean_io_failure(u64 start, struct page *page) |
2060 | { | 2060 | { |
2061 | u64 private; | 2061 | u64 private; |
2062 | u64 private_failure; | 2062 | u64 private_failure; |
2063 | struct io_failure_record *failrec; | 2063 | struct io_failure_record *failrec; |
2064 | struct btrfs_fs_info *fs_info; | 2064 | struct btrfs_fs_info *fs_info; |
2065 | struct extent_state *state; | 2065 | struct extent_state *state; |
2066 | int num_copies; | 2066 | int num_copies; |
2067 | int did_repair = 0; | 2067 | int did_repair = 0; |
2068 | int ret; | 2068 | int ret; |
2069 | struct inode *inode = page->mapping->host; | 2069 | struct inode *inode = page->mapping->host; |
2070 | 2070 | ||
2071 | private = 0; | 2071 | private = 0; |
2072 | ret = count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private, | 2072 | ret = count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private, |
2073 | (u64)-1, 1, EXTENT_DIRTY, 0); | 2073 | (u64)-1, 1, EXTENT_DIRTY, 0); |
2074 | if (!ret) | 2074 | if (!ret) |
2075 | return 0; | 2075 | return 0; |
2076 | 2076 | ||
2077 | ret = get_state_private(&BTRFS_I(inode)->io_failure_tree, start, | 2077 | ret = get_state_private(&BTRFS_I(inode)->io_failure_tree, start, |
2078 | &private_failure); | 2078 | &private_failure); |
2079 | if (ret) | 2079 | if (ret) |
2080 | return 0; | 2080 | return 0; |
2081 | 2081 | ||
2082 | failrec = (struct io_failure_record *)(unsigned long) private_failure; | 2082 | failrec = (struct io_failure_record *)(unsigned long) private_failure; |
2083 | BUG_ON(!failrec->this_mirror); | 2083 | BUG_ON(!failrec->this_mirror); |
2084 | 2084 | ||
2085 | if (failrec->in_validation) { | 2085 | if (failrec->in_validation) { |
2086 | /* there was no real error, just free the record */ | 2086 | /* there was no real error, just free the record */ |
2087 | pr_debug("clean_io_failure: freeing dummy error at %llu\n", | 2087 | pr_debug("clean_io_failure: freeing dummy error at %llu\n", |
2088 | failrec->start); | 2088 | failrec->start); |
2089 | did_repair = 1; | 2089 | did_repair = 1; |
2090 | goto out; | 2090 | goto out; |
2091 | } | 2091 | } |
2092 | 2092 | ||
2093 | spin_lock(&BTRFS_I(inode)->io_tree.lock); | 2093 | spin_lock(&BTRFS_I(inode)->io_tree.lock); |
2094 | state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree, | 2094 | state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree, |
2095 | failrec->start, | 2095 | failrec->start, |
2096 | EXTENT_LOCKED); | 2096 | EXTENT_LOCKED); |
2097 | spin_unlock(&BTRFS_I(inode)->io_tree.lock); | 2097 | spin_unlock(&BTRFS_I(inode)->io_tree.lock); |
2098 | 2098 | ||
2099 | if (state && state->start <= failrec->start && | 2099 | if (state && state->start <= failrec->start && |
2100 | state->end >= failrec->start + failrec->len - 1) { | 2100 | state->end >= failrec->start + failrec->len - 1) { |
2101 | fs_info = BTRFS_I(inode)->root->fs_info; | 2101 | fs_info = BTRFS_I(inode)->root->fs_info; |
2102 | num_copies = btrfs_num_copies(fs_info, failrec->logical, | 2102 | num_copies = btrfs_num_copies(fs_info, failrec->logical, |
2103 | failrec->len); | 2103 | failrec->len); |
2104 | if (num_copies > 1) { | 2104 | if (num_copies > 1) { |
2105 | ret = repair_io_failure(fs_info, start, failrec->len, | 2105 | ret = repair_io_failure(fs_info, start, failrec->len, |
2106 | failrec->logical, page, | 2106 | failrec->logical, page, |
2107 | failrec->failed_mirror); | 2107 | failrec->failed_mirror); |
2108 | did_repair = !ret; | 2108 | did_repair = !ret; |
2109 | } | 2109 | } |
2110 | ret = 0; | 2110 | ret = 0; |
2111 | } | 2111 | } |
2112 | 2112 | ||
2113 | out: | 2113 | out: |
2114 | if (!ret) | 2114 | if (!ret) |
2115 | ret = free_io_failure(inode, failrec, did_repair); | 2115 | ret = free_io_failure(inode, failrec, did_repair); |
2116 | 2116 | ||
2117 | return ret; | 2117 | return ret; |
2118 | } | 2118 | } |
2119 | 2119 | ||
2120 | /* | 2120 | /* |
2121 | * this is a generic handler for readpage errors (default | 2121 | * this is a generic handler for readpage errors (default |
2122 | * readpage_io_failed_hook). if other copies exist, read those and write back | 2122 | * readpage_io_failed_hook). if other copies exist, read those and write back |
2123 | * good data to the failed position. does not investigate in remapping the | 2123 | * good data to the failed position. does not investigate in remapping the |
2124 | * failed extent elsewhere, hoping the device will be smart enough to do this as | 2124 | * failed extent elsewhere, hoping the device will be smart enough to do this as |
2125 | * needed | 2125 | * needed |
2126 | */ | 2126 | */ |
2127 | 2127 | ||
2128 | static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, | 2128 | static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, |
2129 | struct page *page, u64 start, u64 end, | 2129 | struct page *page, u64 start, u64 end, |
2130 | int failed_mirror) | 2130 | int failed_mirror) |
2131 | { | 2131 | { |
2132 | struct io_failure_record *failrec = NULL; | 2132 | struct io_failure_record *failrec = NULL; |
2133 | u64 private; | 2133 | u64 private; |
2134 | struct extent_map *em; | 2134 | struct extent_map *em; |
2135 | struct inode *inode = page->mapping->host; | 2135 | struct inode *inode = page->mapping->host; |
2136 | struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; | 2136 | struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree; |
2137 | struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; | 2137 | struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; |
2138 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; | 2138 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; |
2139 | struct bio *bio; | 2139 | struct bio *bio; |
2140 | struct btrfs_io_bio *btrfs_failed_bio; | 2140 | struct btrfs_io_bio *btrfs_failed_bio; |
2141 | struct btrfs_io_bio *btrfs_bio; | 2141 | struct btrfs_io_bio *btrfs_bio; |
2142 | int num_copies; | 2142 | int num_copies; |
2143 | int ret; | 2143 | int ret; |
2144 | int read_mode; | 2144 | int read_mode; |
2145 | u64 logical; | 2145 | u64 logical; |
2146 | 2146 | ||
2147 | BUG_ON(failed_bio->bi_rw & REQ_WRITE); | 2147 | BUG_ON(failed_bio->bi_rw & REQ_WRITE); |
2148 | 2148 | ||
2149 | ret = get_state_private(failure_tree, start, &private); | 2149 | ret = get_state_private(failure_tree, start, &private); |
2150 | if (ret) { | 2150 | if (ret) { |
2151 | failrec = kzalloc(sizeof(*failrec), GFP_NOFS); | 2151 | failrec = kzalloc(sizeof(*failrec), GFP_NOFS); |
2152 | if (!failrec) | 2152 | if (!failrec) |
2153 | return -ENOMEM; | 2153 | return -ENOMEM; |
2154 | failrec->start = start; | 2154 | failrec->start = start; |
2155 | failrec->len = end - start + 1; | 2155 | failrec->len = end - start + 1; |
2156 | failrec->this_mirror = 0; | 2156 | failrec->this_mirror = 0; |
2157 | failrec->bio_flags = 0; | 2157 | failrec->bio_flags = 0; |
2158 | failrec->in_validation = 0; | 2158 | failrec->in_validation = 0; |
2159 | 2159 | ||
2160 | read_lock(&em_tree->lock); | 2160 | read_lock(&em_tree->lock); |
2161 | em = lookup_extent_mapping(em_tree, start, failrec->len); | 2161 | em = lookup_extent_mapping(em_tree, start, failrec->len); |
2162 | if (!em) { | 2162 | if (!em) { |
2163 | read_unlock(&em_tree->lock); | 2163 | read_unlock(&em_tree->lock); |
2164 | kfree(failrec); | 2164 | kfree(failrec); |
2165 | return -EIO; | 2165 | return -EIO; |
2166 | } | 2166 | } |
2167 | 2167 | ||
2168 | if (em->start > start || em->start + em->len < start) { | 2168 | if (em->start > start || em->start + em->len < start) { |
2169 | free_extent_map(em); | 2169 | free_extent_map(em); |
2170 | em = NULL; | 2170 | em = NULL; |
2171 | } | 2171 | } |
2172 | read_unlock(&em_tree->lock); | 2172 | read_unlock(&em_tree->lock); |
2173 | 2173 | ||
2174 | if (!em) { | 2174 | if (!em) { |
2175 | kfree(failrec); | 2175 | kfree(failrec); |
2176 | return -EIO; | 2176 | return -EIO; |
2177 | } | 2177 | } |
2178 | logical = start - em->start; | 2178 | logical = start - em->start; |
2179 | logical = em->block_start + logical; | 2179 | logical = em->block_start + logical; |
2180 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { | 2180 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { |
2181 | logical = em->block_start; | 2181 | logical = em->block_start; |
2182 | failrec->bio_flags = EXTENT_BIO_COMPRESSED; | 2182 | failrec->bio_flags = EXTENT_BIO_COMPRESSED; |
2183 | extent_set_compress_type(&failrec->bio_flags, | 2183 | extent_set_compress_type(&failrec->bio_flags, |
2184 | em->compress_type); | 2184 | em->compress_type); |
2185 | } | 2185 | } |
2186 | pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, " | 2186 | pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, " |
2187 | "len=%llu\n", logical, start, failrec->len); | 2187 | "len=%llu\n", logical, start, failrec->len); |
2188 | failrec->logical = logical; | 2188 | failrec->logical = logical; |
2189 | free_extent_map(em); | 2189 | free_extent_map(em); |
2190 | 2190 | ||
2191 | /* set the bits in the private failure tree */ | 2191 | /* set the bits in the private failure tree */ |
2192 | ret = set_extent_bits(failure_tree, start, end, | 2192 | ret = set_extent_bits(failure_tree, start, end, |
2193 | EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); | 2193 | EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS); |
2194 | if (ret >= 0) | 2194 | if (ret >= 0) |
2195 | ret = set_state_private(failure_tree, start, | 2195 | ret = set_state_private(failure_tree, start, |
2196 | (u64)(unsigned long)failrec); | 2196 | (u64)(unsigned long)failrec); |
2197 | /* set the bits in the inode's tree */ | 2197 | /* set the bits in the inode's tree */ |
2198 | if (ret >= 0) | 2198 | if (ret >= 0) |
2199 | ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED, | 2199 | ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED, |
2200 | GFP_NOFS); | 2200 | GFP_NOFS); |
2201 | if (ret < 0) { | 2201 | if (ret < 0) { |
2202 | kfree(failrec); | 2202 | kfree(failrec); |
2203 | return ret; | 2203 | return ret; |
2204 | } | 2204 | } |
2205 | } else { | 2205 | } else { |
2206 | failrec = (struct io_failure_record *)(unsigned long)private; | 2206 | failrec = (struct io_failure_record *)(unsigned long)private; |
2207 | pr_debug("bio_readpage_error: (found) logical=%llu, " | 2207 | pr_debug("bio_readpage_error: (found) logical=%llu, " |
2208 | "start=%llu, len=%llu, validation=%d\n", | 2208 | "start=%llu, len=%llu, validation=%d\n", |
2209 | failrec->logical, failrec->start, failrec->len, | 2209 | failrec->logical, failrec->start, failrec->len, |
2210 | failrec->in_validation); | 2210 | failrec->in_validation); |
2211 | /* | 2211 | /* |
2212 | * when data can be on disk more than twice, add to failrec here | 2212 | * when data can be on disk more than twice, add to failrec here |
2213 | * (e.g. with a list for failed_mirror) to make | 2213 | * (e.g. with a list for failed_mirror) to make |
2214 | * clean_io_failure() clean all those errors at once. | 2214 | * clean_io_failure() clean all those errors at once. |
2215 | */ | 2215 | */ |
2216 | } | 2216 | } |
2217 | num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info, | 2217 | num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info, |
2218 | failrec->logical, failrec->len); | 2218 | failrec->logical, failrec->len); |
2219 | if (num_copies == 1) { | 2219 | if (num_copies == 1) { |
2220 | /* | 2220 | /* |
2221 | * we only have a single copy of the data, so don't bother with | 2221 | * we only have a single copy of the data, so don't bother with |
2222 | * all the retry and error correction code that follows. no | 2222 | * all the retry and error correction code that follows. no |
2223 | * matter what the error is, it is very likely to persist. | 2223 | * matter what the error is, it is very likely to persist. |
2224 | */ | 2224 | */ |
2225 | pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n", | 2225 | pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n", |
2226 | num_copies, failrec->this_mirror, failed_mirror); | 2226 | num_copies, failrec->this_mirror, failed_mirror); |
2227 | free_io_failure(inode, failrec, 0); | 2227 | free_io_failure(inode, failrec, 0); |
2228 | return -EIO; | 2228 | return -EIO; |
2229 | } | 2229 | } |
2230 | 2230 | ||
2231 | /* | 2231 | /* |
2232 | * there are two premises: | 2232 | * there are two premises: |
2233 | * a) deliver good data to the caller | 2233 | * a) deliver good data to the caller |
2234 | * b) correct the bad sectors on disk | 2234 | * b) correct the bad sectors on disk |
2235 | */ | 2235 | */ |
2236 | if (failed_bio->bi_vcnt > 1) { | 2236 | if (failed_bio->bi_vcnt > 1) { |
2237 | /* | 2237 | /* |
2238 | * to fulfill b), we need to know the exact failing sectors, as | 2238 | * to fulfill b), we need to know the exact failing sectors, as |
2239 | * we don't want to rewrite any more than the failed ones. thus, | 2239 | * we don't want to rewrite any more than the failed ones. thus, |
2240 | * we need separate read requests for the failed bio | 2240 | * we need separate read requests for the failed bio |
2241 | * | 2241 | * |
2242 | * if the following BUG_ON triggers, our validation request got | 2242 | * if the following BUG_ON triggers, our validation request got |
2243 | * merged. we need separate requests for our algorithm to work. | 2243 | * merged. we need separate requests for our algorithm to work. |
2244 | */ | 2244 | */ |
2245 | BUG_ON(failrec->in_validation); | 2245 | BUG_ON(failrec->in_validation); |
2246 | failrec->in_validation = 1; | 2246 | failrec->in_validation = 1; |
2247 | failrec->this_mirror = failed_mirror; | 2247 | failrec->this_mirror = failed_mirror; |
2248 | read_mode = READ_SYNC | REQ_FAILFAST_DEV; | 2248 | read_mode = READ_SYNC | REQ_FAILFAST_DEV; |
2249 | } else { | 2249 | } else { |
2250 | /* | 2250 | /* |
2251 | * we're ready to fulfill a) and b) alongside. get a good copy | 2251 | * we're ready to fulfill a) and b) alongside. get a good copy |
2252 | * of the failed sector and if we succeed, we have setup | 2252 | * of the failed sector and if we succeed, we have setup |
2253 | * everything for repair_io_failure to do the rest for us. | 2253 | * everything for repair_io_failure to do the rest for us. |
2254 | */ | 2254 | */ |
2255 | if (failrec->in_validation) { | 2255 | if (failrec->in_validation) { |
2256 | BUG_ON(failrec->this_mirror != failed_mirror); | 2256 | BUG_ON(failrec->this_mirror != failed_mirror); |
2257 | failrec->in_validation = 0; | 2257 | failrec->in_validation = 0; |
2258 | failrec->this_mirror = 0; | 2258 | failrec->this_mirror = 0; |
2259 | } | 2259 | } |
2260 | failrec->failed_mirror = failed_mirror; | 2260 | failrec->failed_mirror = failed_mirror; |
2261 | failrec->this_mirror++; | 2261 | failrec->this_mirror++; |
2262 | if (failrec->this_mirror == failed_mirror) | 2262 | if (failrec->this_mirror == failed_mirror) |
2263 | failrec->this_mirror++; | 2263 | failrec->this_mirror++; |
2264 | read_mode = READ_SYNC; | 2264 | read_mode = READ_SYNC; |
2265 | } | 2265 | } |
2266 | 2266 | ||
2267 | if (failrec->this_mirror > num_copies) { | 2267 | if (failrec->this_mirror > num_copies) { |
2268 | pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n", | 2268 | pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n", |
2269 | num_copies, failrec->this_mirror, failed_mirror); | 2269 | num_copies, failrec->this_mirror, failed_mirror); |
2270 | free_io_failure(inode, failrec, 0); | 2270 | free_io_failure(inode, failrec, 0); |
2271 | return -EIO; | 2271 | return -EIO; |
2272 | } | 2272 | } |
2273 | 2273 | ||
2274 | bio = btrfs_io_bio_alloc(GFP_NOFS, 1); | 2274 | bio = btrfs_io_bio_alloc(GFP_NOFS, 1); |
2275 | if (!bio) { | 2275 | if (!bio) { |
2276 | free_io_failure(inode, failrec, 0); | 2276 | free_io_failure(inode, failrec, 0); |
2277 | return -EIO; | 2277 | return -EIO; |
2278 | } | 2278 | } |
2279 | bio->bi_end_io = failed_bio->bi_end_io; | 2279 | bio->bi_end_io = failed_bio->bi_end_io; |
2280 | bio->bi_sector = failrec->logical >> 9; | 2280 | bio->bi_sector = failrec->logical >> 9; |
2281 | bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; | 2281 | bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; |
2282 | bio->bi_size = 0; | 2282 | bio->bi_size = 0; |
2283 | 2283 | ||
2284 | btrfs_failed_bio = btrfs_io_bio(failed_bio); | 2284 | btrfs_failed_bio = btrfs_io_bio(failed_bio); |
2285 | if (btrfs_failed_bio->csum) { | 2285 | if (btrfs_failed_bio->csum) { |
2286 | struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; | 2286 | struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; |
2287 | u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); | 2287 | u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); |
2288 | 2288 | ||
2289 | btrfs_bio = btrfs_io_bio(bio); | 2289 | btrfs_bio = btrfs_io_bio(bio); |
2290 | btrfs_bio->csum = btrfs_bio->csum_inline; | 2290 | btrfs_bio->csum = btrfs_bio->csum_inline; |
2291 | phy_offset >>= inode->i_sb->s_blocksize_bits; | 2291 | phy_offset >>= inode->i_sb->s_blocksize_bits; |
2292 | phy_offset *= csum_size; | 2292 | phy_offset *= csum_size; |
2293 | memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset, | 2293 | memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset, |
2294 | csum_size); | 2294 | csum_size); |
2295 | } | 2295 | } |
2296 | 2296 | ||
2297 | bio_add_page(bio, page, failrec->len, start - page_offset(page)); | 2297 | bio_add_page(bio, page, failrec->len, start - page_offset(page)); |
2298 | 2298 | ||
2299 | pr_debug("bio_readpage_error: submitting new read[%#x] to " | 2299 | pr_debug("bio_readpage_error: submitting new read[%#x] to " |
2300 | "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode, | 2300 | "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode, |
2301 | failrec->this_mirror, num_copies, failrec->in_validation); | 2301 | failrec->this_mirror, num_copies, failrec->in_validation); |
2302 | 2302 | ||
2303 | ret = tree->ops->submit_bio_hook(inode, read_mode, bio, | 2303 | ret = tree->ops->submit_bio_hook(inode, read_mode, bio, |
2304 | failrec->this_mirror, | 2304 | failrec->this_mirror, |
2305 | failrec->bio_flags, 0); | 2305 | failrec->bio_flags, 0); |
2306 | return ret; | 2306 | return ret; |
2307 | } | 2307 | } |
2308 | 2308 | ||
2309 | /* lots and lots of room for performance fixes in the end_bio funcs */ | 2309 | /* lots and lots of room for performance fixes in the end_bio funcs */ |
2310 | 2310 | ||
2311 | int end_extent_writepage(struct page *page, int err, u64 start, u64 end) | 2311 | int end_extent_writepage(struct page *page, int err, u64 start, u64 end) |
2312 | { | 2312 | { |
2313 | int uptodate = (err == 0); | 2313 | int uptodate = (err == 0); |
2314 | struct extent_io_tree *tree; | 2314 | struct extent_io_tree *tree; |
2315 | int ret = 0; | 2315 | int ret = 0; |
2316 | 2316 | ||
2317 | tree = &BTRFS_I(page->mapping->host)->io_tree; | 2317 | tree = &BTRFS_I(page->mapping->host)->io_tree; |
2318 | 2318 | ||
2319 | if (tree->ops && tree->ops->writepage_end_io_hook) { | 2319 | if (tree->ops && tree->ops->writepage_end_io_hook) { |
2320 | ret = tree->ops->writepage_end_io_hook(page, start, | 2320 | ret = tree->ops->writepage_end_io_hook(page, start, |
2321 | end, NULL, uptodate); | 2321 | end, NULL, uptodate); |
2322 | if (ret) | 2322 | if (ret) |
2323 | uptodate = 0; | 2323 | uptodate = 0; |
2324 | } | 2324 | } |
2325 | 2325 | ||
2326 | if (!uptodate) { | 2326 | if (!uptodate) { |
2327 | ClearPageUptodate(page); | 2327 | ClearPageUptodate(page); |
2328 | SetPageError(page); | 2328 | SetPageError(page); |
2329 | ret = ret < 0 ? ret : -EIO; | 2329 | ret = ret < 0 ? ret : -EIO; |
2330 | mapping_set_error(page->mapping, ret); | 2330 | mapping_set_error(page->mapping, ret); |
2331 | } | 2331 | } |
2332 | return 0; | 2332 | return 0; |
2333 | } | 2333 | } |
2334 | 2334 | ||
2335 | /* | 2335 | /* |
2336 | * after a writepage IO is done, we need to: | 2336 | * after a writepage IO is done, we need to: |
2337 | * clear the uptodate bits on error | 2337 | * clear the uptodate bits on error |
2338 | * clear the writeback bits in the extent tree for this IO | 2338 | * clear the writeback bits in the extent tree for this IO |
2339 | * end_page_writeback if the page has no more pending IO | 2339 | * end_page_writeback if the page has no more pending IO |
2340 | * | 2340 | * |
2341 | * Scheduling is not allowed, so the extent state tree is expected | 2341 | * Scheduling is not allowed, so the extent state tree is expected |
2342 | * to have one and only one object corresponding to this IO. | 2342 | * to have one and only one object corresponding to this IO. |
2343 | */ | 2343 | */ |
2344 | static void end_bio_extent_writepage(struct bio *bio, int err) | 2344 | static void end_bio_extent_writepage(struct bio *bio, int err) |
2345 | { | 2345 | { |
2346 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; | 2346 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; |
2347 | struct extent_io_tree *tree; | 2347 | struct extent_io_tree *tree; |
2348 | u64 start; | 2348 | u64 start; |
2349 | u64 end; | 2349 | u64 end; |
2350 | 2350 | ||
2351 | do { | 2351 | do { |
2352 | struct page *page = bvec->bv_page; | 2352 | struct page *page = bvec->bv_page; |
2353 | tree = &BTRFS_I(page->mapping->host)->io_tree; | 2353 | tree = &BTRFS_I(page->mapping->host)->io_tree; |
2354 | 2354 | ||
2355 | /* We always issue full-page reads, but if some block | 2355 | /* We always issue full-page reads, but if some block |
2356 | * in a page fails to read, blk_update_request() will | 2356 | * in a page fails to read, blk_update_request() will |
2357 | * advance bv_offset and adjust bv_len to compensate. | 2357 | * advance bv_offset and adjust bv_len to compensate. |
2358 | * Print a warning for nonzero offsets, and an error | 2358 | * Print a warning for nonzero offsets, and an error |
2359 | * if they don't add up to a full page. */ | 2359 | * if they don't add up to a full page. */ |
2360 | if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) | 2360 | if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) |
2361 | printk("%s page write in btrfs with offset %u and length %u\n", | 2361 | printk("%s page write in btrfs with offset %u and length %u\n", |
2362 | bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE | 2362 | bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE |
2363 | ? KERN_ERR "partial" : KERN_INFO "incomplete", | 2363 | ? KERN_ERR "partial" : KERN_INFO "incomplete", |
2364 | bvec->bv_offset, bvec->bv_len); | 2364 | bvec->bv_offset, bvec->bv_len); |
2365 | 2365 | ||
2366 | start = page_offset(page); | 2366 | start = page_offset(page); |
2367 | end = start + bvec->bv_offset + bvec->bv_len - 1; | 2367 | end = start + bvec->bv_offset + bvec->bv_len - 1; |
2368 | 2368 | ||
2369 | if (--bvec >= bio->bi_io_vec) | 2369 | if (--bvec >= bio->bi_io_vec) |
2370 | prefetchw(&bvec->bv_page->flags); | 2370 | prefetchw(&bvec->bv_page->flags); |
2371 | 2371 | ||
2372 | if (end_extent_writepage(page, err, start, end)) | 2372 | if (end_extent_writepage(page, err, start, end)) |
2373 | continue; | 2373 | continue; |
2374 | 2374 | ||
2375 | end_page_writeback(page); | 2375 | end_page_writeback(page); |
2376 | } while (bvec >= bio->bi_io_vec); | 2376 | } while (bvec >= bio->bi_io_vec); |
2377 | 2377 | ||
2378 | bio_put(bio); | 2378 | bio_put(bio); |
2379 | } | 2379 | } |
2380 | 2380 | ||
2381 | static void | 2381 | static void |
2382 | endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len, | 2382 | endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len, |
2383 | int uptodate) | 2383 | int uptodate) |
2384 | { | 2384 | { |
2385 | struct extent_state *cached = NULL; | 2385 | struct extent_state *cached = NULL; |
2386 | u64 end = start + len - 1; | 2386 | u64 end = start + len - 1; |
2387 | 2387 | ||
2388 | if (uptodate && tree->track_uptodate) | 2388 | if (uptodate && tree->track_uptodate) |
2389 | set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC); | 2389 | set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC); |
2390 | unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); | 2390 | unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); |
2391 | } | 2391 | } |
2392 | 2392 | ||
2393 | /* | 2393 | /* |
2394 | * after a readpage IO is done, we need to: | 2394 | * after a readpage IO is done, we need to: |
2395 | * clear the uptodate bits on error | 2395 | * clear the uptodate bits on error |
2396 | * set the uptodate bits if things worked | 2396 | * set the uptodate bits if things worked |
2397 | * set the page up to date if all extents in the tree are uptodate | 2397 | * set the page up to date if all extents in the tree are uptodate |
2398 | * clear the lock bit in the extent tree | 2398 | * clear the lock bit in the extent tree |
2399 | * unlock the page if there are no other extents locked for it | 2399 | * unlock the page if there are no other extents locked for it |
2400 | * | 2400 | * |
2401 | * Scheduling is not allowed, so the extent state tree is expected | 2401 | * Scheduling is not allowed, so the extent state tree is expected |
2402 | * to have one and only one object corresponding to this IO. | 2402 | * to have one and only one object corresponding to this IO. |
2403 | */ | 2403 | */ |
2404 | static void end_bio_extent_readpage(struct bio *bio, int err) | 2404 | static void end_bio_extent_readpage(struct bio *bio, int err) |
2405 | { | 2405 | { |
2406 | int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); | 2406 | int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); |
2407 | struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; | 2407 | struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; |
2408 | struct bio_vec *bvec = bio->bi_io_vec; | 2408 | struct bio_vec *bvec = bio->bi_io_vec; |
2409 | struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); | 2409 | struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); |
2410 | struct extent_io_tree *tree; | 2410 | struct extent_io_tree *tree; |
2411 | u64 offset = 0; | 2411 | u64 offset = 0; |
2412 | u64 start; | 2412 | u64 start; |
2413 | u64 end; | 2413 | u64 end; |
2414 | u64 len; | 2414 | u64 len; |
2415 | u64 extent_start = 0; | 2415 | u64 extent_start = 0; |
2416 | u64 extent_len = 0; | 2416 | u64 extent_len = 0; |
2417 | int mirror; | 2417 | int mirror; |
2418 | int ret; | 2418 | int ret; |
2419 | 2419 | ||
2420 | if (err) | 2420 | if (err) |
2421 | uptodate = 0; | 2421 | uptodate = 0; |
2422 | 2422 | ||
2423 | do { | 2423 | do { |
2424 | struct page *page = bvec->bv_page; | 2424 | struct page *page = bvec->bv_page; |
2425 | struct inode *inode = page->mapping->host; | 2425 | struct inode *inode = page->mapping->host; |
2426 | 2426 | ||
2427 | pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, " | 2427 | pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, " |
2428 | "mirror=%lu\n", (u64)bio->bi_sector, err, | 2428 | "mirror=%lu\n", (u64)bio->bi_sector, err, |
2429 | io_bio->mirror_num); | 2429 | io_bio->mirror_num); |
2430 | tree = &BTRFS_I(inode)->io_tree; | 2430 | tree = &BTRFS_I(inode)->io_tree; |
2431 | 2431 | ||
2432 | /* We always issue full-page reads, but if some block | 2432 | /* We always issue full-page reads, but if some block |
2433 | * in a page fails to read, blk_update_request() will | 2433 | * in a page fails to read, blk_update_request() will |
2434 | * advance bv_offset and adjust bv_len to compensate. | 2434 | * advance bv_offset and adjust bv_len to compensate. |
2435 | * Print a warning for nonzero offsets, and an error | 2435 | * Print a warning for nonzero offsets, and an error |
2436 | * if they don't add up to a full page. */ | 2436 | * if they don't add up to a full page. */ |
2437 | if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) | 2437 | if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) |
2438 | printk("%s page read in btrfs with offset %u and length %u\n", | 2438 | printk("%s page read in btrfs with offset %u and length %u\n", |
2439 | bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE | 2439 | bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE |
2440 | ? KERN_ERR "partial" : KERN_INFO "incomplete", | 2440 | ? KERN_ERR "partial" : KERN_INFO "incomplete", |
2441 | bvec->bv_offset, bvec->bv_len); | 2441 | bvec->bv_offset, bvec->bv_len); |
2442 | 2442 | ||
2443 | start = page_offset(page); | 2443 | start = page_offset(page); |
2444 | end = start + bvec->bv_offset + bvec->bv_len - 1; | 2444 | end = start + bvec->bv_offset + bvec->bv_len - 1; |
2445 | len = bvec->bv_len; | 2445 | len = bvec->bv_len; |
2446 | 2446 | ||
2447 | if (++bvec <= bvec_end) | 2447 | if (++bvec <= bvec_end) |
2448 | prefetchw(&bvec->bv_page->flags); | 2448 | prefetchw(&bvec->bv_page->flags); |
2449 | 2449 | ||
2450 | mirror = io_bio->mirror_num; | 2450 | mirror = io_bio->mirror_num; |
2451 | if (likely(uptodate && tree->ops && | 2451 | if (likely(uptodate && tree->ops && |
2452 | tree->ops->readpage_end_io_hook)) { | 2452 | tree->ops->readpage_end_io_hook)) { |
2453 | ret = tree->ops->readpage_end_io_hook(io_bio, offset, | 2453 | ret = tree->ops->readpage_end_io_hook(io_bio, offset, |
2454 | page, start, end, | 2454 | page, start, end, |
2455 | mirror); | 2455 | mirror); |
2456 | if (ret) | 2456 | if (ret) |
2457 | uptodate = 0; | 2457 | uptodate = 0; |
2458 | else | 2458 | else |
2459 | clean_io_failure(start, page); | 2459 | clean_io_failure(start, page); |
2460 | } | 2460 | } |
2461 | 2461 | ||
2462 | if (likely(uptodate)) | 2462 | if (likely(uptodate)) |
2463 | goto readpage_ok; | 2463 | goto readpage_ok; |
2464 | 2464 | ||
2465 | if (tree->ops && tree->ops->readpage_io_failed_hook) { | 2465 | if (tree->ops && tree->ops->readpage_io_failed_hook) { |
2466 | ret = tree->ops->readpage_io_failed_hook(page, mirror); | 2466 | ret = tree->ops->readpage_io_failed_hook(page, mirror); |
2467 | if (!ret && !err && | 2467 | if (!ret && !err && |
2468 | test_bit(BIO_UPTODATE, &bio->bi_flags)) | 2468 | test_bit(BIO_UPTODATE, &bio->bi_flags)) |
2469 | uptodate = 1; | 2469 | uptodate = 1; |
2470 | } else { | 2470 | } else { |
2471 | /* | 2471 | /* |
2472 | * The generic bio_readpage_error handles errors the | 2472 | * The generic bio_readpage_error handles errors the |
2473 | * following way: If possible, new read requests are | 2473 | * following way: If possible, new read requests are |
2474 | * created and submitted and will end up in | 2474 | * created and submitted and will end up in |
2475 | * end_bio_extent_readpage as well (if we're lucky, not | 2475 | * end_bio_extent_readpage as well (if we're lucky, not |
2476 | * in the !uptodate case). In that case it returns 0 and | 2476 | * in the !uptodate case). In that case it returns 0 and |
2477 | * we just go on with the next page in our bio. If it | 2477 | * we just go on with the next page in our bio. If it |
2478 | * can't handle the error it will return -EIO and we | 2478 | * can't handle the error it will return -EIO and we |
2479 | * remain responsible for that page. | 2479 | * remain responsible for that page. |
2480 | */ | 2480 | */ |
2481 | ret = bio_readpage_error(bio, offset, page, start, end, | 2481 | ret = bio_readpage_error(bio, offset, page, start, end, |
2482 | mirror); | 2482 | mirror); |
2483 | if (ret == 0) { | 2483 | if (ret == 0) { |
2484 | uptodate = | 2484 | uptodate = |
2485 | test_bit(BIO_UPTODATE, &bio->bi_flags); | 2485 | test_bit(BIO_UPTODATE, &bio->bi_flags); |
2486 | if (err) | 2486 | if (err) |
2487 | uptodate = 0; | 2487 | uptodate = 0; |
2488 | offset += len; | 2488 | offset += len; |
2489 | continue; | 2489 | continue; |
2490 | } | 2490 | } |
2491 | } | 2491 | } |
2492 | readpage_ok: | 2492 | readpage_ok: |
2493 | if (likely(uptodate)) { | 2493 | if (likely(uptodate)) { |
2494 | loff_t i_size = i_size_read(inode); | 2494 | loff_t i_size = i_size_read(inode); |
2495 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; | 2495 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; |
2496 | unsigned offset; | 2496 | unsigned offset; |
2497 | 2497 | ||
2498 | /* Zero out the end if this page straddles i_size */ | 2498 | /* Zero out the end if this page straddles i_size */ |
2499 | offset = i_size & (PAGE_CACHE_SIZE-1); | 2499 | offset = i_size & (PAGE_CACHE_SIZE-1); |
2500 | if (page->index == end_index && offset) | 2500 | if (page->index == end_index && offset) |
2501 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); | 2501 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); |
2502 | SetPageUptodate(page); | 2502 | SetPageUptodate(page); |
2503 | } else { | 2503 | } else { |
2504 | ClearPageUptodate(page); | 2504 | ClearPageUptodate(page); |
2505 | SetPageError(page); | 2505 | SetPageError(page); |
2506 | } | 2506 | } |
2507 | unlock_page(page); | 2507 | unlock_page(page); |
2508 | offset += len; | 2508 | offset += len; |
2509 | 2509 | ||
2510 | if (unlikely(!uptodate)) { | 2510 | if (unlikely(!uptodate)) { |
2511 | if (extent_len) { | 2511 | if (extent_len) { |
2512 | endio_readpage_release_extent(tree, | 2512 | endio_readpage_release_extent(tree, |
2513 | extent_start, | 2513 | extent_start, |
2514 | extent_len, 1); | 2514 | extent_len, 1); |
2515 | extent_start = 0; | 2515 | extent_start = 0; |
2516 | extent_len = 0; | 2516 | extent_len = 0; |
2517 | } | 2517 | } |
2518 | endio_readpage_release_extent(tree, start, | 2518 | endio_readpage_release_extent(tree, start, |
2519 | end - start + 1, 0); | 2519 | end - start + 1, 0); |
2520 | } else if (!extent_len) { | 2520 | } else if (!extent_len) { |
2521 | extent_start = start; | 2521 | extent_start = start; |
2522 | extent_len = end + 1 - start; | 2522 | extent_len = end + 1 - start; |
2523 | } else if (extent_start + extent_len == start) { | 2523 | } else if (extent_start + extent_len == start) { |
2524 | extent_len += end + 1 - start; | 2524 | extent_len += end + 1 - start; |
2525 | } else { | 2525 | } else { |
2526 | endio_readpage_release_extent(tree, extent_start, | 2526 | endio_readpage_release_extent(tree, extent_start, |
2527 | extent_len, uptodate); | 2527 | extent_len, uptodate); |
2528 | extent_start = start; | 2528 | extent_start = start; |
2529 | extent_len = end + 1 - start; | 2529 | extent_len = end + 1 - start; |
2530 | } | 2530 | } |
2531 | } while (bvec <= bvec_end); | 2531 | } while (bvec <= bvec_end); |
2532 | 2532 | ||
2533 | if (extent_len) | 2533 | if (extent_len) |
2534 | endio_readpage_release_extent(tree, extent_start, extent_len, | 2534 | endio_readpage_release_extent(tree, extent_start, extent_len, |
2535 | uptodate); | 2535 | uptodate); |
2536 | if (io_bio->end_io) | 2536 | if (io_bio->end_io) |
2537 | io_bio->end_io(io_bio, err); | 2537 | io_bio->end_io(io_bio, err); |
2538 | bio_put(bio); | 2538 | bio_put(bio); |
2539 | } | 2539 | } |
2540 | 2540 | ||
2541 | /* | 2541 | /* |
2542 | * this allocates from the btrfs_bioset. We're returning a bio right now | 2542 | * this allocates from the btrfs_bioset. We're returning a bio right now |
2543 | * but you can call btrfs_io_bio for the appropriate container_of magic | 2543 | * but you can call btrfs_io_bio for the appropriate container_of magic |
2544 | */ | 2544 | */ |
2545 | struct bio * | 2545 | struct bio * |
2546 | btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, | 2546 | btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, |
2547 | gfp_t gfp_flags) | 2547 | gfp_t gfp_flags) |
2548 | { | 2548 | { |
2549 | struct btrfs_io_bio *btrfs_bio; | 2549 | struct btrfs_io_bio *btrfs_bio; |
2550 | struct bio *bio; | 2550 | struct bio *bio; |
2551 | 2551 | ||
2552 | bio = bio_alloc_bioset(gfp_flags, nr_vecs, btrfs_bioset); | 2552 | bio = bio_alloc_bioset(gfp_flags, nr_vecs, btrfs_bioset); |
2553 | 2553 | ||
2554 | if (bio == NULL && (current->flags & PF_MEMALLOC)) { | 2554 | if (bio == NULL && (current->flags & PF_MEMALLOC)) { |
2555 | while (!bio && (nr_vecs /= 2)) { | 2555 | while (!bio && (nr_vecs /= 2)) { |
2556 | bio = bio_alloc_bioset(gfp_flags, | 2556 | bio = bio_alloc_bioset(gfp_flags, |
2557 | nr_vecs, btrfs_bioset); | 2557 | nr_vecs, btrfs_bioset); |
2558 | } | 2558 | } |
2559 | } | 2559 | } |
2560 | 2560 | ||
2561 | if (bio) { | 2561 | if (bio) { |
2562 | bio->bi_size = 0; | 2562 | bio->bi_size = 0; |
2563 | bio->bi_bdev = bdev; | 2563 | bio->bi_bdev = bdev; |
2564 | bio->bi_sector = first_sector; | 2564 | bio->bi_sector = first_sector; |
2565 | btrfs_bio = btrfs_io_bio(bio); | 2565 | btrfs_bio = btrfs_io_bio(bio); |
2566 | btrfs_bio->csum = NULL; | 2566 | btrfs_bio->csum = NULL; |
2567 | btrfs_bio->csum_allocated = NULL; | 2567 | btrfs_bio->csum_allocated = NULL; |
2568 | btrfs_bio->end_io = NULL; | 2568 | btrfs_bio->end_io = NULL; |
2569 | } | 2569 | } |
2570 | return bio; | 2570 | return bio; |
2571 | } | 2571 | } |
2572 | 2572 | ||
2573 | struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) | 2573 | struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) |
2574 | { | 2574 | { |
2575 | return bio_clone_bioset(bio, gfp_mask, btrfs_bioset); | 2575 | return bio_clone_bioset(bio, gfp_mask, btrfs_bioset); |
2576 | } | 2576 | } |
2577 | 2577 | ||
2578 | 2578 | ||
2579 | /* this also allocates from the btrfs_bioset */ | 2579 | /* this also allocates from the btrfs_bioset */ |
2580 | struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) | 2580 | struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) |
2581 | { | 2581 | { |
2582 | struct btrfs_io_bio *btrfs_bio; | 2582 | struct btrfs_io_bio *btrfs_bio; |
2583 | struct bio *bio; | 2583 | struct bio *bio; |
2584 | 2584 | ||
2585 | bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); | 2585 | bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset); |
2586 | if (bio) { | 2586 | if (bio) { |
2587 | btrfs_bio = btrfs_io_bio(bio); | 2587 | btrfs_bio = btrfs_io_bio(bio); |
2588 | btrfs_bio->csum = NULL; | 2588 | btrfs_bio->csum = NULL; |
2589 | btrfs_bio->csum_allocated = NULL; | 2589 | btrfs_bio->csum_allocated = NULL; |
2590 | btrfs_bio->end_io = NULL; | 2590 | btrfs_bio->end_io = NULL; |
2591 | } | 2591 | } |
2592 | return bio; | 2592 | return bio; |
2593 | } | 2593 | } |
2594 | 2594 | ||
2595 | 2595 | ||
2596 | static int __must_check submit_one_bio(int rw, struct bio *bio, | 2596 | static int __must_check submit_one_bio(int rw, struct bio *bio, |
2597 | int mirror_num, unsigned long bio_flags) | 2597 | int mirror_num, unsigned long bio_flags) |
2598 | { | 2598 | { |
2599 | int ret = 0; | 2599 | int ret = 0; |
2600 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; | 2600 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; |
2601 | struct page *page = bvec->bv_page; | 2601 | struct page *page = bvec->bv_page; |
2602 | struct extent_io_tree *tree = bio->bi_private; | 2602 | struct extent_io_tree *tree = bio->bi_private; |
2603 | u64 start; | 2603 | u64 start; |
2604 | 2604 | ||
2605 | start = page_offset(page) + bvec->bv_offset; | 2605 | start = page_offset(page) + bvec->bv_offset; |
2606 | 2606 | ||
2607 | bio->bi_private = NULL; | 2607 | bio->bi_private = NULL; |
2608 | 2608 | ||
2609 | bio_get(bio); | 2609 | bio_get(bio); |
2610 | 2610 | ||
2611 | if (tree->ops && tree->ops->submit_bio_hook) | 2611 | if (tree->ops && tree->ops->submit_bio_hook) |
2612 | ret = tree->ops->submit_bio_hook(page->mapping->host, rw, bio, | 2612 | ret = tree->ops->submit_bio_hook(page->mapping->host, rw, bio, |
2613 | mirror_num, bio_flags, start); | 2613 | mirror_num, bio_flags, start); |
2614 | else | 2614 | else |
2615 | btrfsic_submit_bio(rw, bio); | 2615 | btrfsic_submit_bio(rw, bio); |
2616 | 2616 | ||
2617 | if (bio_flagged(bio, BIO_EOPNOTSUPP)) | 2617 | if (bio_flagged(bio, BIO_EOPNOTSUPP)) |
2618 | ret = -EOPNOTSUPP; | 2618 | ret = -EOPNOTSUPP; |
2619 | bio_put(bio); | 2619 | bio_put(bio); |
2620 | return ret; | 2620 | return ret; |
2621 | } | 2621 | } |
2622 | 2622 | ||
2623 | static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, | 2623 | static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, |
2624 | unsigned long offset, size_t size, struct bio *bio, | 2624 | unsigned long offset, size_t size, struct bio *bio, |
2625 | unsigned long bio_flags) | 2625 | unsigned long bio_flags) |
2626 | { | 2626 | { |
2627 | int ret = 0; | 2627 | int ret = 0; |
2628 | if (tree->ops && tree->ops->merge_bio_hook) | 2628 | if (tree->ops && tree->ops->merge_bio_hook) |
2629 | ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, | 2629 | ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, |
2630 | bio_flags); | 2630 | bio_flags); |
2631 | BUG_ON(ret < 0); | 2631 | BUG_ON(ret < 0); |
2632 | return ret; | 2632 | return ret; |
2633 | 2633 | ||
2634 | } | 2634 | } |
2635 | 2635 | ||
2636 | static int submit_extent_page(int rw, struct extent_io_tree *tree, | 2636 | static int submit_extent_page(int rw, struct extent_io_tree *tree, |
2637 | struct page *page, sector_t sector, | 2637 | struct page *page, sector_t sector, |
2638 | size_t size, unsigned long offset, | 2638 | size_t size, unsigned long offset, |
2639 | struct block_device *bdev, | 2639 | struct block_device *bdev, |
2640 | struct bio **bio_ret, | 2640 | struct bio **bio_ret, |
2641 | unsigned long max_pages, | 2641 | unsigned long max_pages, |
2642 | bio_end_io_t end_io_func, | 2642 | bio_end_io_t end_io_func, |
2643 | int mirror_num, | 2643 | int mirror_num, |
2644 | unsigned long prev_bio_flags, | 2644 | unsigned long prev_bio_flags, |
2645 | unsigned long bio_flags) | 2645 | unsigned long bio_flags) |
2646 | { | 2646 | { |
2647 | int ret = 0; | 2647 | int ret = 0; |
2648 | struct bio *bio; | 2648 | struct bio *bio; |
2649 | int nr; | 2649 | int nr; |
2650 | int contig = 0; | 2650 | int contig = 0; |
2651 | int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED; | 2651 | int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED; |
2652 | int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED; | 2652 | int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED; |
2653 | size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE); | 2653 | size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE); |
2654 | 2654 | ||
2655 | if (bio_ret && *bio_ret) { | 2655 | if (bio_ret && *bio_ret) { |
2656 | bio = *bio_ret; | 2656 | bio = *bio_ret; |
2657 | if (old_compressed) | 2657 | if (old_compressed) |
2658 | contig = bio->bi_sector == sector; | 2658 | contig = bio->bi_sector == sector; |
2659 | else | 2659 | else |
2660 | contig = bio_end_sector(bio) == sector; | 2660 | contig = bio_end_sector(bio) == sector; |
2661 | 2661 | ||
2662 | if (prev_bio_flags != bio_flags || !contig || | 2662 | if (prev_bio_flags != bio_flags || !contig || |
2663 | merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) || | 2663 | merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) || |
2664 | bio_add_page(bio, page, page_size, offset) < page_size) { | 2664 | bio_add_page(bio, page, page_size, offset) < page_size) { |
2665 | ret = submit_one_bio(rw, bio, mirror_num, | 2665 | ret = submit_one_bio(rw, bio, mirror_num, |
2666 | prev_bio_flags); | 2666 | prev_bio_flags); |
2667 | if (ret < 0) | 2667 | if (ret < 0) |
2668 | return ret; | 2668 | return ret; |
2669 | bio = NULL; | 2669 | bio = NULL; |
2670 | } else { | 2670 | } else { |
2671 | return 0; | 2671 | return 0; |
2672 | } | 2672 | } |
2673 | } | 2673 | } |
2674 | if (this_compressed) | 2674 | if (this_compressed) |
2675 | nr = BIO_MAX_PAGES; | 2675 | nr = BIO_MAX_PAGES; |
2676 | else | 2676 | else |
2677 | nr = bio_get_nr_vecs(bdev); | 2677 | nr = bio_get_nr_vecs(bdev); |
2678 | 2678 | ||
2679 | bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH); | 2679 | bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH); |
2680 | if (!bio) | 2680 | if (!bio) |
2681 | return -ENOMEM; | 2681 | return -ENOMEM; |
2682 | 2682 | ||
2683 | bio_add_page(bio, page, page_size, offset); | 2683 | bio_add_page(bio, page, page_size, offset); |
2684 | bio->bi_end_io = end_io_func; | 2684 | bio->bi_end_io = end_io_func; |
2685 | bio->bi_private = tree; | 2685 | bio->bi_private = tree; |
2686 | 2686 | ||
2687 | if (bio_ret) | 2687 | if (bio_ret) |
2688 | *bio_ret = bio; | 2688 | *bio_ret = bio; |
2689 | else | 2689 | else |
2690 | ret = submit_one_bio(rw, bio, mirror_num, bio_flags); | 2690 | ret = submit_one_bio(rw, bio, mirror_num, bio_flags); |
2691 | 2691 | ||
2692 | return ret; | 2692 | return ret; |
2693 | } | 2693 | } |
2694 | 2694 | ||
2695 | static void attach_extent_buffer_page(struct extent_buffer *eb, | 2695 | static void attach_extent_buffer_page(struct extent_buffer *eb, |
2696 | struct page *page) | 2696 | struct page *page) |
2697 | { | 2697 | { |
2698 | if (!PagePrivate(page)) { | 2698 | if (!PagePrivate(page)) { |
2699 | SetPagePrivate(page); | 2699 | SetPagePrivate(page); |
2700 | page_cache_get(page); | 2700 | page_cache_get(page); |
2701 | set_page_private(page, (unsigned long)eb); | 2701 | set_page_private(page, (unsigned long)eb); |
2702 | } else { | 2702 | } else { |
2703 | WARN_ON(page->private != (unsigned long)eb); | 2703 | WARN_ON(page->private != (unsigned long)eb); |
2704 | } | 2704 | } |
2705 | } | 2705 | } |
2706 | 2706 | ||
2707 | void set_page_extent_mapped(struct page *page) | 2707 | void set_page_extent_mapped(struct page *page) |
2708 | { | 2708 | { |
2709 | if (!PagePrivate(page)) { | 2709 | if (!PagePrivate(page)) { |
2710 | SetPagePrivate(page); | 2710 | SetPagePrivate(page); |
2711 | page_cache_get(page); | 2711 | page_cache_get(page); |
2712 | set_page_private(page, EXTENT_PAGE_PRIVATE); | 2712 | set_page_private(page, EXTENT_PAGE_PRIVATE); |
2713 | } | 2713 | } |
2714 | } | 2714 | } |
2715 | 2715 | ||
2716 | static struct extent_map * | 2716 | static struct extent_map * |
2717 | __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset, | 2717 | __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset, |
2718 | u64 start, u64 len, get_extent_t *get_extent, | 2718 | u64 start, u64 len, get_extent_t *get_extent, |
2719 | struct extent_map **em_cached) | 2719 | struct extent_map **em_cached) |
2720 | { | 2720 | { |
2721 | struct extent_map *em; | 2721 | struct extent_map *em; |
2722 | 2722 | ||
2723 | if (em_cached && *em_cached) { | 2723 | if (em_cached && *em_cached) { |
2724 | em = *em_cached; | 2724 | em = *em_cached; |
2725 | if (em->in_tree && start >= em->start && | 2725 | if (em->in_tree && start >= em->start && |
2726 | start < extent_map_end(em)) { | 2726 | start < extent_map_end(em)) { |
2727 | atomic_inc(&em->refs); | 2727 | atomic_inc(&em->refs); |
2728 | return em; | 2728 | return em; |
2729 | } | 2729 | } |
2730 | 2730 | ||
2731 | free_extent_map(em); | 2731 | free_extent_map(em); |
2732 | *em_cached = NULL; | 2732 | *em_cached = NULL; |
2733 | } | 2733 | } |
2734 | 2734 | ||
2735 | em = get_extent(inode, page, pg_offset, start, len, 0); | 2735 | em = get_extent(inode, page, pg_offset, start, len, 0); |
2736 | if (em_cached && !IS_ERR_OR_NULL(em)) { | 2736 | if (em_cached && !IS_ERR_OR_NULL(em)) { |
2737 | BUG_ON(*em_cached); | 2737 | BUG_ON(*em_cached); |
2738 | atomic_inc(&em->refs); | 2738 | atomic_inc(&em->refs); |
2739 | *em_cached = em; | 2739 | *em_cached = em; |
2740 | } | 2740 | } |
2741 | return em; | 2741 | return em; |
2742 | } | 2742 | } |
2743 | /* | 2743 | /* |
2744 | * basic readpage implementation. Locked extent state structs are inserted | 2744 | * basic readpage implementation. Locked extent state structs are inserted |
2745 | * into the tree that are removed when the IO is done (by the end_io | 2745 | * into the tree that are removed when the IO is done (by the end_io |
2746 | * handlers) | 2746 | * handlers) |
2747 | * XXX JDM: This needs looking at to ensure proper page locking | 2747 | * XXX JDM: This needs looking at to ensure proper page locking |
2748 | */ | 2748 | */ |
2749 | static int __do_readpage(struct extent_io_tree *tree, | 2749 | static int __do_readpage(struct extent_io_tree *tree, |
2750 | struct page *page, | 2750 | struct page *page, |
2751 | get_extent_t *get_extent, | 2751 | get_extent_t *get_extent, |
2752 | struct extent_map **em_cached, | 2752 | struct extent_map **em_cached, |
2753 | struct bio **bio, int mirror_num, | 2753 | struct bio **bio, int mirror_num, |
2754 | unsigned long *bio_flags, int rw) | 2754 | unsigned long *bio_flags, int rw) |
2755 | { | 2755 | { |
2756 | struct inode *inode = page->mapping->host; | 2756 | struct inode *inode = page->mapping->host; |
2757 | u64 start = page_offset(page); | 2757 | u64 start = page_offset(page); |
2758 | u64 page_end = start + PAGE_CACHE_SIZE - 1; | 2758 | u64 page_end = start + PAGE_CACHE_SIZE - 1; |
2759 | u64 end; | 2759 | u64 end; |
2760 | u64 cur = start; | 2760 | u64 cur = start; |
2761 | u64 extent_offset; | 2761 | u64 extent_offset; |
2762 | u64 last_byte = i_size_read(inode); | 2762 | u64 last_byte = i_size_read(inode); |
2763 | u64 block_start; | 2763 | u64 block_start; |
2764 | u64 cur_end; | 2764 | u64 cur_end; |
2765 | sector_t sector; | 2765 | sector_t sector; |
2766 | struct extent_map *em; | 2766 | struct extent_map *em; |
2767 | struct block_device *bdev; | 2767 | struct block_device *bdev; |
2768 | int ret; | 2768 | int ret; |
2769 | int nr = 0; | 2769 | int nr = 0; |
2770 | int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED; | 2770 | int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED; |
2771 | size_t pg_offset = 0; | 2771 | size_t pg_offset = 0; |
2772 | size_t iosize; | 2772 | size_t iosize; |
2773 | size_t disk_io_size; | 2773 | size_t disk_io_size; |
2774 | size_t blocksize = inode->i_sb->s_blocksize; | 2774 | size_t blocksize = inode->i_sb->s_blocksize; |
2775 | unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED; | 2775 | unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED; |
2776 | 2776 | ||
2777 | set_page_extent_mapped(page); | 2777 | set_page_extent_mapped(page); |
2778 | 2778 | ||
2779 | end = page_end; | 2779 | end = page_end; |
2780 | if (!PageUptodate(page)) { | 2780 | if (!PageUptodate(page)) { |
2781 | if (cleancache_get_page(page) == 0) { | 2781 | if (cleancache_get_page(page) == 0) { |
2782 | BUG_ON(blocksize != PAGE_SIZE); | 2782 | BUG_ON(blocksize != PAGE_SIZE); |
2783 | unlock_extent(tree, start, end); | 2783 | unlock_extent(tree, start, end); |
2784 | goto out; | 2784 | goto out; |
2785 | } | 2785 | } |
2786 | } | 2786 | } |
2787 | 2787 | ||
2788 | if (page->index == last_byte >> PAGE_CACHE_SHIFT) { | 2788 | if (page->index == last_byte >> PAGE_CACHE_SHIFT) { |
2789 | char *userpage; | 2789 | char *userpage; |
2790 | size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1); | 2790 | size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1); |
2791 | 2791 | ||
2792 | if (zero_offset) { | 2792 | if (zero_offset) { |
2793 | iosize = PAGE_CACHE_SIZE - zero_offset; | 2793 | iosize = PAGE_CACHE_SIZE - zero_offset; |
2794 | userpage = kmap_atomic(page); | 2794 | userpage = kmap_atomic(page); |
2795 | memset(userpage + zero_offset, 0, iosize); | 2795 | memset(userpage + zero_offset, 0, iosize); |
2796 | flush_dcache_page(page); | 2796 | flush_dcache_page(page); |
2797 | kunmap_atomic(userpage); | 2797 | kunmap_atomic(userpage); |
2798 | } | 2798 | } |
2799 | } | 2799 | } |
2800 | while (cur <= end) { | 2800 | while (cur <= end) { |
2801 | unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1; | 2801 | unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1; |
2802 | 2802 | ||
2803 | if (cur >= last_byte) { | 2803 | if (cur >= last_byte) { |
2804 | char *userpage; | 2804 | char *userpage; |
2805 | struct extent_state *cached = NULL; | 2805 | struct extent_state *cached = NULL; |
2806 | 2806 | ||
2807 | iosize = PAGE_CACHE_SIZE - pg_offset; | 2807 | iosize = PAGE_CACHE_SIZE - pg_offset; |
2808 | userpage = kmap_atomic(page); | 2808 | userpage = kmap_atomic(page); |
2809 | memset(userpage + pg_offset, 0, iosize); | 2809 | memset(userpage + pg_offset, 0, iosize); |
2810 | flush_dcache_page(page); | 2810 | flush_dcache_page(page); |
2811 | kunmap_atomic(userpage); | 2811 | kunmap_atomic(userpage); |
2812 | set_extent_uptodate(tree, cur, cur + iosize - 1, | 2812 | set_extent_uptodate(tree, cur, cur + iosize - 1, |
2813 | &cached, GFP_NOFS); | 2813 | &cached, GFP_NOFS); |
2814 | if (!parent_locked) | 2814 | if (!parent_locked) |
2815 | unlock_extent_cached(tree, cur, | 2815 | unlock_extent_cached(tree, cur, |
2816 | cur + iosize - 1, | 2816 | cur + iosize - 1, |
2817 | &cached, GFP_NOFS); | 2817 | &cached, GFP_NOFS); |
2818 | break; | 2818 | break; |
2819 | } | 2819 | } |
2820 | em = __get_extent_map(inode, page, pg_offset, cur, | 2820 | em = __get_extent_map(inode, page, pg_offset, cur, |
2821 | end - cur + 1, get_extent, em_cached); | 2821 | end - cur + 1, get_extent, em_cached); |
2822 | if (IS_ERR_OR_NULL(em)) { | 2822 | if (IS_ERR_OR_NULL(em)) { |
2823 | SetPageError(page); | 2823 | SetPageError(page); |
2824 | if (!parent_locked) | 2824 | if (!parent_locked) |
2825 | unlock_extent(tree, cur, end); | 2825 | unlock_extent(tree, cur, end); |
2826 | break; | 2826 | break; |
2827 | } | 2827 | } |
2828 | extent_offset = cur - em->start; | 2828 | extent_offset = cur - em->start; |
2829 | BUG_ON(extent_map_end(em) <= cur); | 2829 | BUG_ON(extent_map_end(em) <= cur); |
2830 | BUG_ON(end < cur); | 2830 | BUG_ON(end < cur); |
2831 | 2831 | ||
2832 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { | 2832 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { |
2833 | this_bio_flag |= EXTENT_BIO_COMPRESSED; | 2833 | this_bio_flag |= EXTENT_BIO_COMPRESSED; |
2834 | extent_set_compress_type(&this_bio_flag, | 2834 | extent_set_compress_type(&this_bio_flag, |
2835 | em->compress_type); | 2835 | em->compress_type); |
2836 | } | 2836 | } |
2837 | 2837 | ||
2838 | iosize = min(extent_map_end(em) - cur, end - cur + 1); | 2838 | iosize = min(extent_map_end(em) - cur, end - cur + 1); |
2839 | cur_end = min(extent_map_end(em) - 1, end); | 2839 | cur_end = min(extent_map_end(em) - 1, end); |
2840 | iosize = ALIGN(iosize, blocksize); | 2840 | iosize = ALIGN(iosize, blocksize); |
2841 | if (this_bio_flag & EXTENT_BIO_COMPRESSED) { | 2841 | if (this_bio_flag & EXTENT_BIO_COMPRESSED) { |
2842 | disk_io_size = em->block_len; | 2842 | disk_io_size = em->block_len; |
2843 | sector = em->block_start >> 9; | 2843 | sector = em->block_start >> 9; |
2844 | } else { | 2844 | } else { |
2845 | sector = (em->block_start + extent_offset) >> 9; | 2845 | sector = (em->block_start + extent_offset) >> 9; |
2846 | disk_io_size = iosize; | 2846 | disk_io_size = iosize; |
2847 | } | 2847 | } |
2848 | bdev = em->bdev; | 2848 | bdev = em->bdev; |
2849 | block_start = em->block_start; | 2849 | block_start = em->block_start; |
2850 | if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) | 2850 | if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) |
2851 | block_start = EXTENT_MAP_HOLE; | 2851 | block_start = EXTENT_MAP_HOLE; |
2852 | free_extent_map(em); | 2852 | free_extent_map(em); |
2853 | em = NULL; | 2853 | em = NULL; |
2854 | 2854 | ||
2855 | /* we've found a hole, just zero and go on */ | 2855 | /* we've found a hole, just zero and go on */ |
2856 | if (block_start == EXTENT_MAP_HOLE) { | 2856 | if (block_start == EXTENT_MAP_HOLE) { |
2857 | char *userpage; | 2857 | char *userpage; |
2858 | struct extent_state *cached = NULL; | 2858 | struct extent_state *cached = NULL; |
2859 | 2859 | ||
2860 | userpage = kmap_atomic(page); | 2860 | userpage = kmap_atomic(page); |
2861 | memset(userpage + pg_offset, 0, iosize); | 2861 | memset(userpage + pg_offset, 0, iosize); |
2862 | flush_dcache_page(page); | 2862 | flush_dcache_page(page); |
2863 | kunmap_atomic(userpage); | 2863 | kunmap_atomic(userpage); |
2864 | 2864 | ||
2865 | set_extent_uptodate(tree, cur, cur + iosize - 1, | 2865 | set_extent_uptodate(tree, cur, cur + iosize - 1, |
2866 | &cached, GFP_NOFS); | 2866 | &cached, GFP_NOFS); |
2867 | unlock_extent_cached(tree, cur, cur + iosize - 1, | 2867 | unlock_extent_cached(tree, cur, cur + iosize - 1, |
2868 | &cached, GFP_NOFS); | 2868 | &cached, GFP_NOFS); |
2869 | cur = cur + iosize; | 2869 | cur = cur + iosize; |
2870 | pg_offset += iosize; | 2870 | pg_offset += iosize; |
2871 | continue; | 2871 | continue; |
2872 | } | 2872 | } |
2873 | /* the get_extent function already copied into the page */ | 2873 | /* the get_extent function already copied into the page */ |
2874 | if (test_range_bit(tree, cur, cur_end, | 2874 | if (test_range_bit(tree, cur, cur_end, |
2875 | EXTENT_UPTODATE, 1, NULL)) { | 2875 | EXTENT_UPTODATE, 1, NULL)) { |
2876 | check_page_uptodate(tree, page); | 2876 | check_page_uptodate(tree, page); |
2877 | if (!parent_locked) | 2877 | if (!parent_locked) |
2878 | unlock_extent(tree, cur, cur + iosize - 1); | 2878 | unlock_extent(tree, cur, cur + iosize - 1); |
2879 | cur = cur + iosize; | 2879 | cur = cur + iosize; |
2880 | pg_offset += iosize; | 2880 | pg_offset += iosize; |
2881 | continue; | 2881 | continue; |
2882 | } | 2882 | } |
2883 | /* we have an inline extent but it didn't get marked up | 2883 | /* we have an inline extent but it didn't get marked up |
2884 | * to date. Error out | 2884 | * to date. Error out |
2885 | */ | 2885 | */ |
2886 | if (block_start == EXTENT_MAP_INLINE) { | 2886 | if (block_start == EXTENT_MAP_INLINE) { |
2887 | SetPageError(page); | 2887 | SetPageError(page); |
2888 | if (!parent_locked) | 2888 | if (!parent_locked) |
2889 | unlock_extent(tree, cur, cur + iosize - 1); | 2889 | unlock_extent(tree, cur, cur + iosize - 1); |
2890 | cur = cur + iosize; | 2890 | cur = cur + iosize; |
2891 | pg_offset += iosize; | 2891 | pg_offset += iosize; |
2892 | continue; | 2892 | continue; |
2893 | } | 2893 | } |
2894 | 2894 | ||
2895 | pnr -= page->index; | 2895 | pnr -= page->index; |
2896 | ret = submit_extent_page(rw, tree, page, | 2896 | ret = submit_extent_page(rw, tree, page, |
2897 | sector, disk_io_size, pg_offset, | 2897 | sector, disk_io_size, pg_offset, |
2898 | bdev, bio, pnr, | 2898 | bdev, bio, pnr, |
2899 | end_bio_extent_readpage, mirror_num, | 2899 | end_bio_extent_readpage, mirror_num, |
2900 | *bio_flags, | 2900 | *bio_flags, |
2901 | this_bio_flag); | 2901 | this_bio_flag); |
2902 | if (!ret) { | 2902 | if (!ret) { |
2903 | nr++; | 2903 | nr++; |
2904 | *bio_flags = this_bio_flag; | 2904 | *bio_flags = this_bio_flag; |
2905 | } else { | 2905 | } else { |
2906 | SetPageError(page); | 2906 | SetPageError(page); |
2907 | if (!parent_locked) | 2907 | if (!parent_locked) |
2908 | unlock_extent(tree, cur, cur + iosize - 1); | 2908 | unlock_extent(tree, cur, cur + iosize - 1); |
2909 | } | 2909 | } |
2910 | cur = cur + iosize; | 2910 | cur = cur + iosize; |
2911 | pg_offset += iosize; | 2911 | pg_offset += iosize; |
2912 | } | 2912 | } |
2913 | out: | 2913 | out: |
2914 | if (!nr) { | 2914 | if (!nr) { |
2915 | if (!PageError(page)) | 2915 | if (!PageError(page)) |
2916 | SetPageUptodate(page); | 2916 | SetPageUptodate(page); |
2917 | unlock_page(page); | 2917 | unlock_page(page); |
2918 | } | 2918 | } |
2919 | return 0; | 2919 | return 0; |
2920 | } | 2920 | } |
2921 | 2921 | ||
2922 | static inline void __do_contiguous_readpages(struct extent_io_tree *tree, | 2922 | static inline void __do_contiguous_readpages(struct extent_io_tree *tree, |
2923 | struct page *pages[], int nr_pages, | 2923 | struct page *pages[], int nr_pages, |
2924 | u64 start, u64 end, | 2924 | u64 start, u64 end, |
2925 | get_extent_t *get_extent, | 2925 | get_extent_t *get_extent, |
2926 | struct extent_map **em_cached, | 2926 | struct extent_map **em_cached, |
2927 | struct bio **bio, int mirror_num, | 2927 | struct bio **bio, int mirror_num, |
2928 | unsigned long *bio_flags, int rw) | 2928 | unsigned long *bio_flags, int rw) |
2929 | { | 2929 | { |
2930 | struct inode *inode; | 2930 | struct inode *inode; |
2931 | struct btrfs_ordered_extent *ordered; | 2931 | struct btrfs_ordered_extent *ordered; |
2932 | int index; | 2932 | int index; |
2933 | 2933 | ||
2934 | inode = pages[0]->mapping->host; | 2934 | inode = pages[0]->mapping->host; |
2935 | while (1) { | 2935 | while (1) { |
2936 | lock_extent(tree, start, end); | 2936 | lock_extent(tree, start, end); |
2937 | ordered = btrfs_lookup_ordered_range(inode, start, | 2937 | ordered = btrfs_lookup_ordered_range(inode, start, |
2938 | end - start + 1); | 2938 | end - start + 1); |
2939 | if (!ordered) | 2939 | if (!ordered) |
2940 | break; | 2940 | break; |
2941 | unlock_extent(tree, start, end); | 2941 | unlock_extent(tree, start, end); |
2942 | btrfs_start_ordered_extent(inode, ordered, 1); | 2942 | btrfs_start_ordered_extent(inode, ordered, 1); |
2943 | btrfs_put_ordered_extent(ordered); | 2943 | btrfs_put_ordered_extent(ordered); |
2944 | } | 2944 | } |
2945 | 2945 | ||
2946 | for (index = 0; index < nr_pages; index++) { | 2946 | for (index = 0; index < nr_pages; index++) { |
2947 | __do_readpage(tree, pages[index], get_extent, em_cached, bio, | 2947 | __do_readpage(tree, pages[index], get_extent, em_cached, bio, |
2948 | mirror_num, bio_flags, rw); | 2948 | mirror_num, bio_flags, rw); |
2949 | page_cache_release(pages[index]); | 2949 | page_cache_release(pages[index]); |
2950 | } | 2950 | } |
2951 | } | 2951 | } |
2952 | 2952 | ||
2953 | static void __extent_readpages(struct extent_io_tree *tree, | 2953 | static void __extent_readpages(struct extent_io_tree *tree, |
2954 | struct page *pages[], | 2954 | struct page *pages[], |
2955 | int nr_pages, get_extent_t *get_extent, | 2955 | int nr_pages, get_extent_t *get_extent, |
2956 | struct extent_map **em_cached, | 2956 | struct extent_map **em_cached, |
2957 | struct bio **bio, int mirror_num, | 2957 | struct bio **bio, int mirror_num, |
2958 | unsigned long *bio_flags, int rw) | 2958 | unsigned long *bio_flags, int rw) |
2959 | { | 2959 | { |
2960 | u64 start = 0; | 2960 | u64 start = 0; |
2961 | u64 end = 0; | 2961 | u64 end = 0; |
2962 | u64 page_start; | 2962 | u64 page_start; |
2963 | int index; | 2963 | int index; |
2964 | int first_index = 0; | 2964 | int first_index = 0; |
2965 | 2965 | ||
2966 | for (index = 0; index < nr_pages; index++) { | 2966 | for (index = 0; index < nr_pages; index++) { |
2967 | page_start = page_offset(pages[index]); | 2967 | page_start = page_offset(pages[index]); |
2968 | if (!end) { | 2968 | if (!end) { |
2969 | start = page_start; | 2969 | start = page_start; |
2970 | end = start + PAGE_CACHE_SIZE - 1; | 2970 | end = start + PAGE_CACHE_SIZE - 1; |
2971 | first_index = index; | 2971 | first_index = index; |
2972 | } else if (end + 1 == page_start) { | 2972 | } else if (end + 1 == page_start) { |
2973 | end += PAGE_CACHE_SIZE; | 2973 | end += PAGE_CACHE_SIZE; |
2974 | } else { | 2974 | } else { |
2975 | __do_contiguous_readpages(tree, &pages[first_index], | 2975 | __do_contiguous_readpages(tree, &pages[first_index], |
2976 | index - first_index, start, | 2976 | index - first_index, start, |
2977 | end, get_extent, em_cached, | 2977 | end, get_extent, em_cached, |
2978 | bio, mirror_num, bio_flags, | 2978 | bio, mirror_num, bio_flags, |
2979 | rw); | 2979 | rw); |
2980 | start = page_start; | 2980 | start = page_start; |
2981 | end = start + PAGE_CACHE_SIZE - 1; | 2981 | end = start + PAGE_CACHE_SIZE - 1; |
2982 | first_index = index; | 2982 | first_index = index; |
2983 | } | 2983 | } |
2984 | } | 2984 | } |
2985 | 2985 | ||
2986 | if (end) | 2986 | if (end) |
2987 | __do_contiguous_readpages(tree, &pages[first_index], | 2987 | __do_contiguous_readpages(tree, &pages[first_index], |
2988 | index - first_index, start, | 2988 | index - first_index, start, |
2989 | end, get_extent, em_cached, bio, | 2989 | end, get_extent, em_cached, bio, |
2990 | mirror_num, bio_flags, rw); | 2990 | mirror_num, bio_flags, rw); |
2991 | } | 2991 | } |
2992 | 2992 | ||
2993 | static int __extent_read_full_page(struct extent_io_tree *tree, | 2993 | static int __extent_read_full_page(struct extent_io_tree *tree, |
2994 | struct page *page, | 2994 | struct page *page, |
2995 | get_extent_t *get_extent, | 2995 | get_extent_t *get_extent, |
2996 | struct bio **bio, int mirror_num, | 2996 | struct bio **bio, int mirror_num, |
2997 | unsigned long *bio_flags, int rw) | 2997 | unsigned long *bio_flags, int rw) |
2998 | { | 2998 | { |
2999 | struct inode *inode = page->mapping->host; | 2999 | struct inode *inode = page->mapping->host; |
3000 | struct btrfs_ordered_extent *ordered; | 3000 | struct btrfs_ordered_extent *ordered; |
3001 | u64 start = page_offset(page); | 3001 | u64 start = page_offset(page); |
3002 | u64 end = start + PAGE_CACHE_SIZE - 1; | 3002 | u64 end = start + PAGE_CACHE_SIZE - 1; |
3003 | int ret; | 3003 | int ret; |
3004 | 3004 | ||
3005 | while (1) { | 3005 | while (1) { |
3006 | lock_extent(tree, start, end); | 3006 | lock_extent(tree, start, end); |
3007 | ordered = btrfs_lookup_ordered_extent(inode, start); | 3007 | ordered = btrfs_lookup_ordered_extent(inode, start); |
3008 | if (!ordered) | 3008 | if (!ordered) |
3009 | break; | 3009 | break; |
3010 | unlock_extent(tree, start, end); | 3010 | unlock_extent(tree, start, end); |
3011 | btrfs_start_ordered_extent(inode, ordered, 1); | 3011 | btrfs_start_ordered_extent(inode, ordered, 1); |
3012 | btrfs_put_ordered_extent(ordered); | 3012 | btrfs_put_ordered_extent(ordered); |
3013 | } | 3013 | } |
3014 | 3014 | ||
3015 | ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num, | 3015 | ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num, |
3016 | bio_flags, rw); | 3016 | bio_flags, rw); |
3017 | return ret; | 3017 | return ret; |
3018 | } | 3018 | } |
3019 | 3019 | ||
3020 | int extent_read_full_page(struct extent_io_tree *tree, struct page *page, | 3020 | int extent_read_full_page(struct extent_io_tree *tree, struct page *page, |
3021 | get_extent_t *get_extent, int mirror_num) | 3021 | get_extent_t *get_extent, int mirror_num) |
3022 | { | 3022 | { |
3023 | struct bio *bio = NULL; | 3023 | struct bio *bio = NULL; |
3024 | unsigned long bio_flags = 0; | 3024 | unsigned long bio_flags = 0; |
3025 | int ret; | 3025 | int ret; |
3026 | 3026 | ||
3027 | ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num, | 3027 | ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num, |
3028 | &bio_flags, READ); | 3028 | &bio_flags, READ); |
3029 | if (bio) | 3029 | if (bio) |
3030 | ret = submit_one_bio(READ, bio, mirror_num, bio_flags); | 3030 | ret = submit_one_bio(READ, bio, mirror_num, bio_flags); |
3031 | return ret; | 3031 | return ret; |
3032 | } | 3032 | } |
3033 | 3033 | ||
3034 | int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page, | 3034 | int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page, |
3035 | get_extent_t *get_extent, int mirror_num) | 3035 | get_extent_t *get_extent, int mirror_num) |
3036 | { | 3036 | { |
3037 | struct bio *bio = NULL; | 3037 | struct bio *bio = NULL; |
3038 | unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED; | 3038 | unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED; |
3039 | int ret; | 3039 | int ret; |
3040 | 3040 | ||
3041 | ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num, | 3041 | ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num, |
3042 | &bio_flags, READ); | 3042 | &bio_flags, READ); |
3043 | if (bio) | 3043 | if (bio) |
3044 | ret = submit_one_bio(READ, bio, mirror_num, bio_flags); | 3044 | ret = submit_one_bio(READ, bio, mirror_num, bio_flags); |
3045 | return ret; | 3045 | return ret; |
3046 | } | 3046 | } |
3047 | 3047 | ||
3048 | static noinline void update_nr_written(struct page *page, | 3048 | static noinline void update_nr_written(struct page *page, |
3049 | struct writeback_control *wbc, | 3049 | struct writeback_control *wbc, |
3050 | unsigned long nr_written) | 3050 | unsigned long nr_written) |
3051 | { | 3051 | { |
3052 | wbc->nr_to_write -= nr_written; | 3052 | wbc->nr_to_write -= nr_written; |
3053 | if (wbc->range_cyclic || (wbc->nr_to_write > 0 && | 3053 | if (wbc->range_cyclic || (wbc->nr_to_write > 0 && |
3054 | wbc->range_start == 0 && wbc->range_end == LLONG_MAX)) | 3054 | wbc->range_start == 0 && wbc->range_end == LLONG_MAX)) |
3055 | page->mapping->writeback_index = page->index + nr_written; | 3055 | page->mapping->writeback_index = page->index + nr_written; |
3056 | } | 3056 | } |
3057 | 3057 | ||
3058 | /* | 3058 | /* |
3059 | * the writepage semantics are similar to regular writepage. extent | 3059 | * the writepage semantics are similar to regular writepage. extent |
3060 | * records are inserted to lock ranges in the tree, and as dirty areas | 3060 | * records are inserted to lock ranges in the tree, and as dirty areas |
3061 | * are found, they are marked writeback. Then the lock bits are removed | 3061 | * are found, they are marked writeback. Then the lock bits are removed |
3062 | * and the end_io handler clears the writeback ranges | 3062 | * and the end_io handler clears the writeback ranges |
3063 | */ | 3063 | */ |
3064 | static int __extent_writepage(struct page *page, struct writeback_control *wbc, | 3064 | static int __extent_writepage(struct page *page, struct writeback_control *wbc, |
3065 | void *data) | 3065 | void *data) |
3066 | { | 3066 | { |
3067 | struct inode *inode = page->mapping->host; | 3067 | struct inode *inode = page->mapping->host; |
3068 | struct extent_page_data *epd = data; | 3068 | struct extent_page_data *epd = data; |
3069 | struct extent_io_tree *tree = epd->tree; | 3069 | struct extent_io_tree *tree = epd->tree; |
3070 | u64 start = page_offset(page); | 3070 | u64 start = page_offset(page); |
3071 | u64 delalloc_start; | 3071 | u64 delalloc_start; |
3072 | u64 page_end = start + PAGE_CACHE_SIZE - 1; | 3072 | u64 page_end = start + PAGE_CACHE_SIZE - 1; |
3073 | u64 end; | 3073 | u64 end; |
3074 | u64 cur = start; | 3074 | u64 cur = start; |
3075 | u64 extent_offset; | 3075 | u64 extent_offset; |
3076 | u64 last_byte = i_size_read(inode); | 3076 | u64 last_byte = i_size_read(inode); |
3077 | u64 block_start; | 3077 | u64 block_start; |
3078 | u64 iosize; | 3078 | u64 iosize; |
3079 | sector_t sector; | 3079 | sector_t sector; |
3080 | struct extent_state *cached_state = NULL; | 3080 | struct extent_state *cached_state = NULL; |
3081 | struct extent_map *em; | 3081 | struct extent_map *em; |
3082 | struct block_device *bdev; | 3082 | struct block_device *bdev; |
3083 | int ret; | 3083 | int ret; |
3084 | int nr = 0; | 3084 | int nr = 0; |
3085 | size_t pg_offset = 0; | 3085 | size_t pg_offset = 0; |
3086 | size_t blocksize; | 3086 | size_t blocksize; |
3087 | loff_t i_size = i_size_read(inode); | 3087 | loff_t i_size = i_size_read(inode); |
3088 | unsigned long end_index = i_size >> PAGE_CACHE_SHIFT; | 3088 | unsigned long end_index = i_size >> PAGE_CACHE_SHIFT; |
3089 | u64 nr_delalloc; | 3089 | u64 nr_delalloc; |
3090 | u64 delalloc_end; | 3090 | u64 delalloc_end; |
3091 | int page_started; | 3091 | int page_started; |
3092 | int compressed; | 3092 | int compressed; |
3093 | int write_flags; | 3093 | int write_flags; |
3094 | unsigned long nr_written = 0; | 3094 | unsigned long nr_written = 0; |
3095 | bool fill_delalloc = true; | 3095 | bool fill_delalloc = true; |
3096 | 3096 | ||
3097 | if (wbc->sync_mode == WB_SYNC_ALL) | 3097 | if (wbc->sync_mode == WB_SYNC_ALL) |
3098 | write_flags = WRITE_SYNC; | 3098 | write_flags = WRITE_SYNC; |
3099 | else | 3099 | else |
3100 | write_flags = WRITE; | 3100 | write_flags = WRITE; |
3101 | 3101 | ||
3102 | trace___extent_writepage(page, inode, wbc); | 3102 | trace___extent_writepage(page, inode, wbc); |
3103 | 3103 | ||
3104 | WARN_ON(!PageLocked(page)); | 3104 | WARN_ON(!PageLocked(page)); |
3105 | 3105 | ||
3106 | ClearPageError(page); | 3106 | ClearPageError(page); |
3107 | 3107 | ||
3108 | pg_offset = i_size & (PAGE_CACHE_SIZE - 1); | 3108 | pg_offset = i_size & (PAGE_CACHE_SIZE - 1); |
3109 | if (page->index > end_index || | 3109 | if (page->index > end_index || |
3110 | (page->index == end_index && !pg_offset)) { | 3110 | (page->index == end_index && !pg_offset)) { |
3111 | page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); | 3111 | page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); |
3112 | unlock_page(page); | 3112 | unlock_page(page); |
3113 | return 0; | 3113 | return 0; |
3114 | } | 3114 | } |
3115 | 3115 | ||
3116 | if (page->index == end_index) { | 3116 | if (page->index == end_index) { |
3117 | char *userpage; | 3117 | char *userpage; |
3118 | 3118 | ||
3119 | userpage = kmap_atomic(page); | 3119 | userpage = kmap_atomic(page); |
3120 | memset(userpage + pg_offset, 0, | 3120 | memset(userpage + pg_offset, 0, |
3121 | PAGE_CACHE_SIZE - pg_offset); | 3121 | PAGE_CACHE_SIZE - pg_offset); |
3122 | kunmap_atomic(userpage); | 3122 | kunmap_atomic(userpage); |
3123 | flush_dcache_page(page); | 3123 | flush_dcache_page(page); |
3124 | } | 3124 | } |
3125 | pg_offset = 0; | 3125 | pg_offset = 0; |
3126 | 3126 | ||
3127 | set_page_extent_mapped(page); | 3127 | set_page_extent_mapped(page); |
3128 | 3128 | ||
3129 | if (!tree->ops || !tree->ops->fill_delalloc) | 3129 | if (!tree->ops || !tree->ops->fill_delalloc) |
3130 | fill_delalloc = false; | 3130 | fill_delalloc = false; |
3131 | 3131 | ||
3132 | delalloc_start = start; | 3132 | delalloc_start = start; |
3133 | delalloc_end = 0; | 3133 | delalloc_end = 0; |
3134 | page_started = 0; | 3134 | page_started = 0; |
3135 | if (!epd->extent_locked && fill_delalloc) { | 3135 | if (!epd->extent_locked && fill_delalloc) { |
3136 | u64 delalloc_to_write = 0; | 3136 | u64 delalloc_to_write = 0; |
3137 | /* | 3137 | /* |
3138 | * make sure the wbc mapping index is at least updated | 3138 | * make sure the wbc mapping index is at least updated |
3139 | * to this page. | 3139 | * to this page. |
3140 | */ | 3140 | */ |
3141 | update_nr_written(page, wbc, 0); | 3141 | update_nr_written(page, wbc, 0); |
3142 | 3142 | ||
3143 | while (delalloc_end < page_end) { | 3143 | while (delalloc_end < page_end) { |
3144 | nr_delalloc = find_lock_delalloc_range(inode, tree, | 3144 | nr_delalloc = find_lock_delalloc_range(inode, tree, |
3145 | page, | 3145 | page, |
3146 | &delalloc_start, | 3146 | &delalloc_start, |
3147 | &delalloc_end, | 3147 | &delalloc_end, |
3148 | 128 * 1024 * 1024); | 3148 | 128 * 1024 * 1024); |
3149 | if (nr_delalloc == 0) { | 3149 | if (nr_delalloc == 0) { |
3150 | delalloc_start = delalloc_end + 1; | 3150 | delalloc_start = delalloc_end + 1; |
3151 | continue; | 3151 | continue; |
3152 | } | 3152 | } |
3153 | ret = tree->ops->fill_delalloc(inode, page, | 3153 | ret = tree->ops->fill_delalloc(inode, page, |
3154 | delalloc_start, | 3154 | delalloc_start, |
3155 | delalloc_end, | 3155 | delalloc_end, |
3156 | &page_started, | 3156 | &page_started, |
3157 | &nr_written); | 3157 | &nr_written); |
3158 | /* File system has been set read-only */ | 3158 | /* File system has been set read-only */ |
3159 | if (ret) { | 3159 | if (ret) { |
3160 | SetPageError(page); | 3160 | SetPageError(page); |
3161 | goto done; | 3161 | goto done; |
3162 | } | 3162 | } |
3163 | /* | 3163 | /* |
3164 | * delalloc_end is already one less than the total | 3164 | * delalloc_end is already one less than the total |
3165 | * length, so we don't subtract one from | 3165 | * length, so we don't subtract one from |
3166 | * PAGE_CACHE_SIZE | 3166 | * PAGE_CACHE_SIZE |
3167 | */ | 3167 | */ |
3168 | delalloc_to_write += (delalloc_end - delalloc_start + | 3168 | delalloc_to_write += (delalloc_end - delalloc_start + |
3169 | PAGE_CACHE_SIZE) >> | 3169 | PAGE_CACHE_SIZE) >> |
3170 | PAGE_CACHE_SHIFT; | 3170 | PAGE_CACHE_SHIFT; |
3171 | delalloc_start = delalloc_end + 1; | 3171 | delalloc_start = delalloc_end + 1; |
3172 | } | 3172 | } |
3173 | if (wbc->nr_to_write < delalloc_to_write) { | 3173 | if (wbc->nr_to_write < delalloc_to_write) { |
3174 | int thresh = 8192; | 3174 | int thresh = 8192; |
3175 | 3175 | ||
3176 | if (delalloc_to_write < thresh * 2) | 3176 | if (delalloc_to_write < thresh * 2) |
3177 | thresh = delalloc_to_write; | 3177 | thresh = delalloc_to_write; |
3178 | wbc->nr_to_write = min_t(u64, delalloc_to_write, | 3178 | wbc->nr_to_write = min_t(u64, delalloc_to_write, |
3179 | thresh); | 3179 | thresh); |
3180 | } | 3180 | } |
3181 | 3181 | ||
3182 | /* did the fill delalloc function already unlock and start | 3182 | /* did the fill delalloc function already unlock and start |
3183 | * the IO? | 3183 | * the IO? |
3184 | */ | 3184 | */ |
3185 | if (page_started) { | 3185 | if (page_started) { |
3186 | ret = 0; | 3186 | ret = 0; |
3187 | /* | 3187 | /* |
3188 | * we've unlocked the page, so we can't update | 3188 | * we've unlocked the page, so we can't update |
3189 | * the mapping's writeback index, just update | 3189 | * the mapping's writeback index, just update |
3190 | * nr_to_write. | 3190 | * nr_to_write. |
3191 | */ | 3191 | */ |
3192 | wbc->nr_to_write -= nr_written; | 3192 | wbc->nr_to_write -= nr_written; |
3193 | goto done_unlocked; | 3193 | goto done_unlocked; |
3194 | } | 3194 | } |
3195 | } | 3195 | } |
3196 | if (tree->ops && tree->ops->writepage_start_hook) { | 3196 | if (tree->ops && tree->ops->writepage_start_hook) { |
3197 | ret = tree->ops->writepage_start_hook(page, start, | 3197 | ret = tree->ops->writepage_start_hook(page, start, |
3198 | page_end); | 3198 | page_end); |
3199 | if (ret) { | 3199 | if (ret) { |
3200 | /* Fixup worker will requeue */ | 3200 | /* Fixup worker will requeue */ |
3201 | if (ret == -EBUSY) | 3201 | if (ret == -EBUSY) |
3202 | wbc->pages_skipped++; | 3202 | wbc->pages_skipped++; |
3203 | else | 3203 | else |
3204 | redirty_page_for_writepage(wbc, page); | 3204 | redirty_page_for_writepage(wbc, page); |
3205 | update_nr_written(page, wbc, nr_written); | 3205 | update_nr_written(page, wbc, nr_written); |
3206 | unlock_page(page); | 3206 | unlock_page(page); |
3207 | ret = 0; | 3207 | ret = 0; |
3208 | goto done_unlocked; | 3208 | goto done_unlocked; |
3209 | } | 3209 | } |
3210 | } | 3210 | } |
3211 | 3211 | ||
3212 | /* | 3212 | /* |
3213 | * we don't want to touch the inode after unlocking the page, | 3213 | * we don't want to touch the inode after unlocking the page, |
3214 | * so we update the mapping writeback index now | 3214 | * so we update the mapping writeback index now |
3215 | */ | 3215 | */ |
3216 | update_nr_written(page, wbc, nr_written + 1); | 3216 | update_nr_written(page, wbc, nr_written + 1); |
3217 | 3217 | ||
3218 | end = page_end; | 3218 | end = page_end; |
3219 | if (last_byte <= start) { | 3219 | if (last_byte <= start) { |
3220 | if (tree->ops && tree->ops->writepage_end_io_hook) | 3220 | if (tree->ops && tree->ops->writepage_end_io_hook) |
3221 | tree->ops->writepage_end_io_hook(page, start, | 3221 | tree->ops->writepage_end_io_hook(page, start, |
3222 | page_end, NULL, 1); | 3222 | page_end, NULL, 1); |
3223 | goto done; | 3223 | goto done; |
3224 | } | 3224 | } |
3225 | 3225 | ||
3226 | blocksize = inode->i_sb->s_blocksize; | 3226 | blocksize = inode->i_sb->s_blocksize; |
3227 | 3227 | ||
3228 | while (cur <= end) { | 3228 | while (cur <= end) { |
3229 | if (cur >= last_byte) { | 3229 | if (cur >= last_byte) { |
3230 | if (tree->ops && tree->ops->writepage_end_io_hook) | 3230 | if (tree->ops && tree->ops->writepage_end_io_hook) |
3231 | tree->ops->writepage_end_io_hook(page, cur, | 3231 | tree->ops->writepage_end_io_hook(page, cur, |
3232 | page_end, NULL, 1); | 3232 | page_end, NULL, 1); |
3233 | break; | 3233 | break; |
3234 | } | 3234 | } |
3235 | em = epd->get_extent(inode, page, pg_offset, cur, | 3235 | em = epd->get_extent(inode, page, pg_offset, cur, |
3236 | end - cur + 1, 1); | 3236 | end - cur + 1, 1); |
3237 | if (IS_ERR_OR_NULL(em)) { | 3237 | if (IS_ERR_OR_NULL(em)) { |
3238 | SetPageError(page); | 3238 | SetPageError(page); |
3239 | break; | 3239 | break; |
3240 | } | 3240 | } |
3241 | 3241 | ||
3242 | extent_offset = cur - em->start; | 3242 | extent_offset = cur - em->start; |
3243 | BUG_ON(extent_map_end(em) <= cur); | 3243 | BUG_ON(extent_map_end(em) <= cur); |
3244 | BUG_ON(end < cur); | 3244 | BUG_ON(end < cur); |
3245 | iosize = min(extent_map_end(em) - cur, end - cur + 1); | 3245 | iosize = min(extent_map_end(em) - cur, end - cur + 1); |
3246 | iosize = ALIGN(iosize, blocksize); | 3246 | iosize = ALIGN(iosize, blocksize); |
3247 | sector = (em->block_start + extent_offset) >> 9; | 3247 | sector = (em->block_start + extent_offset) >> 9; |
3248 | bdev = em->bdev; | 3248 | bdev = em->bdev; |
3249 | block_start = em->block_start; | 3249 | block_start = em->block_start; |
3250 | compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); | 3250 | compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); |
3251 | free_extent_map(em); | 3251 | free_extent_map(em); |
3252 | em = NULL; | 3252 | em = NULL; |
3253 | 3253 | ||
3254 | /* | 3254 | /* |
3255 | * compressed and inline extents are written through other | 3255 | * compressed and inline extents are written through other |
3256 | * paths in the FS | 3256 | * paths in the FS |
3257 | */ | 3257 | */ |
3258 | if (compressed || block_start == EXTENT_MAP_HOLE || | 3258 | if (compressed || block_start == EXTENT_MAP_HOLE || |
3259 | block_start == EXTENT_MAP_INLINE) { | 3259 | block_start == EXTENT_MAP_INLINE) { |
3260 | /* | 3260 | /* |
3261 | * end_io notification does not happen here for | 3261 | * end_io notification does not happen here for |
3262 | * compressed extents | 3262 | * compressed extents |
3263 | */ | 3263 | */ |
3264 | if (!compressed && tree->ops && | 3264 | if (!compressed && tree->ops && |
3265 | tree->ops->writepage_end_io_hook) | 3265 | tree->ops->writepage_end_io_hook) |
3266 | tree->ops->writepage_end_io_hook(page, cur, | 3266 | tree->ops->writepage_end_io_hook(page, cur, |
3267 | cur + iosize - 1, | 3267 | cur + iosize - 1, |
3268 | NULL, 1); | 3268 | NULL, 1); |
3269 | else if (compressed) { | 3269 | else if (compressed) { |
3270 | /* we don't want to end_page_writeback on | 3270 | /* we don't want to end_page_writeback on |
3271 | * a compressed extent. this happens | 3271 | * a compressed extent. this happens |
3272 | * elsewhere | 3272 | * elsewhere |
3273 | */ | 3273 | */ |
3274 | nr++; | 3274 | nr++; |
3275 | } | 3275 | } |
3276 | 3276 | ||
3277 | cur += iosize; | 3277 | cur += iosize; |
3278 | pg_offset += iosize; | 3278 | pg_offset += iosize; |
3279 | continue; | 3279 | continue; |
3280 | } | 3280 | } |
3281 | /* leave this out until we have a page_mkwrite call */ | 3281 | /* leave this out until we have a page_mkwrite call */ |
3282 | if (0 && !test_range_bit(tree, cur, cur + iosize - 1, | 3282 | if (0 && !test_range_bit(tree, cur, cur + iosize - 1, |
3283 | EXTENT_DIRTY, 0, NULL)) { | 3283 | EXTENT_DIRTY, 0, NULL)) { |
3284 | cur = cur + iosize; | 3284 | cur = cur + iosize; |
3285 | pg_offset += iosize; | 3285 | pg_offset += iosize; |
3286 | continue; | 3286 | continue; |
3287 | } | 3287 | } |
3288 | 3288 | ||
3289 | if (tree->ops && tree->ops->writepage_io_hook) { | 3289 | if (tree->ops && tree->ops->writepage_io_hook) { |
3290 | ret = tree->ops->writepage_io_hook(page, cur, | 3290 | ret = tree->ops->writepage_io_hook(page, cur, |
3291 | cur + iosize - 1); | 3291 | cur + iosize - 1); |
3292 | } else { | 3292 | } else { |
3293 | ret = 0; | 3293 | ret = 0; |
3294 | } | 3294 | } |
3295 | if (ret) { | 3295 | if (ret) { |
3296 | SetPageError(page); | 3296 | SetPageError(page); |
3297 | } else { | 3297 | } else { |
3298 | unsigned long max_nr = end_index + 1; | 3298 | unsigned long max_nr = end_index + 1; |
3299 | 3299 | ||
3300 | set_range_writeback(tree, cur, cur + iosize - 1); | 3300 | set_range_writeback(tree, cur, cur + iosize - 1); |
3301 | if (!PageWriteback(page)) { | 3301 | if (!PageWriteback(page)) { |
3302 | printk(KERN_ERR "btrfs warning page %lu not " | 3302 | printk(KERN_ERR "btrfs warning page %lu not " |
3303 | "writeback, cur %llu end %llu\n", | 3303 | "writeback, cur %llu end %llu\n", |
3304 | page->index, cur, end); | 3304 | page->index, cur, end); |
3305 | } | 3305 | } |
3306 | 3306 | ||
3307 | ret = submit_extent_page(write_flags, tree, page, | 3307 | ret = submit_extent_page(write_flags, tree, page, |
3308 | sector, iosize, pg_offset, | 3308 | sector, iosize, pg_offset, |
3309 | bdev, &epd->bio, max_nr, | 3309 | bdev, &epd->bio, max_nr, |
3310 | end_bio_extent_writepage, | 3310 | end_bio_extent_writepage, |
3311 | 0, 0, 0); | 3311 | 0, 0, 0); |
3312 | if (ret) | 3312 | if (ret) |
3313 | SetPageError(page); | 3313 | SetPageError(page); |
3314 | } | 3314 | } |
3315 | cur = cur + iosize; | 3315 | cur = cur + iosize; |
3316 | pg_offset += iosize; | 3316 | pg_offset += iosize; |
3317 | nr++; | 3317 | nr++; |
3318 | } | 3318 | } |
3319 | done: | 3319 | done: |
3320 | if (nr == 0) { | 3320 | if (nr == 0) { |
3321 | /* make sure the mapping tag for page dirty gets cleared */ | 3321 | /* make sure the mapping tag for page dirty gets cleared */ |
3322 | set_page_writeback(page); | 3322 | set_page_writeback(page); |
3323 | end_page_writeback(page); | 3323 | end_page_writeback(page); |
3324 | } | 3324 | } |
3325 | unlock_page(page); | 3325 | unlock_page(page); |
3326 | 3326 | ||
3327 | done_unlocked: | 3327 | done_unlocked: |
3328 | 3328 | ||
3329 | /* drop our reference on any cached states */ | 3329 | /* drop our reference on any cached states */ |
3330 | free_extent_state(cached_state); | 3330 | free_extent_state(cached_state); |
3331 | return 0; | 3331 | return 0; |
3332 | } | 3332 | } |
3333 | 3333 | ||
3334 | static int eb_wait(void *word) | 3334 | static int eb_wait(void *word) |
3335 | { | 3335 | { |
3336 | io_schedule(); | 3336 | io_schedule(); |
3337 | return 0; | 3337 | return 0; |
3338 | } | 3338 | } |
3339 | 3339 | ||
3340 | void wait_on_extent_buffer_writeback(struct extent_buffer *eb) | 3340 | void wait_on_extent_buffer_writeback(struct extent_buffer *eb) |
3341 | { | 3341 | { |
3342 | wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait, | 3342 | wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait, |
3343 | TASK_UNINTERRUPTIBLE); | 3343 | TASK_UNINTERRUPTIBLE); |
3344 | } | 3344 | } |
3345 | 3345 | ||
3346 | static int lock_extent_buffer_for_io(struct extent_buffer *eb, | 3346 | static int lock_extent_buffer_for_io(struct extent_buffer *eb, |
3347 | struct btrfs_fs_info *fs_info, | 3347 | struct btrfs_fs_info *fs_info, |
3348 | struct extent_page_data *epd) | 3348 | struct extent_page_data *epd) |
3349 | { | 3349 | { |
3350 | unsigned long i, num_pages; | 3350 | unsigned long i, num_pages; |
3351 | int flush = 0; | 3351 | int flush = 0; |
3352 | int ret = 0; | 3352 | int ret = 0; |
3353 | 3353 | ||
3354 | if (!btrfs_try_tree_write_lock(eb)) { | 3354 | if (!btrfs_try_tree_write_lock(eb)) { |
3355 | flush = 1; | 3355 | flush = 1; |
3356 | flush_write_bio(epd); | 3356 | flush_write_bio(epd); |
3357 | btrfs_tree_lock(eb); | 3357 | btrfs_tree_lock(eb); |
3358 | } | 3358 | } |
3359 | 3359 | ||
3360 | if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) { | 3360 | if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) { |
3361 | btrfs_tree_unlock(eb); | 3361 | btrfs_tree_unlock(eb); |
3362 | if (!epd->sync_io) | 3362 | if (!epd->sync_io) |
3363 | return 0; | 3363 | return 0; |
3364 | if (!flush) { | 3364 | if (!flush) { |
3365 | flush_write_bio(epd); | 3365 | flush_write_bio(epd); |
3366 | flush = 1; | 3366 | flush = 1; |
3367 | } | 3367 | } |
3368 | while (1) { | 3368 | while (1) { |
3369 | wait_on_extent_buffer_writeback(eb); | 3369 | wait_on_extent_buffer_writeback(eb); |
3370 | btrfs_tree_lock(eb); | 3370 | btrfs_tree_lock(eb); |
3371 | if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) | 3371 | if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) |
3372 | break; | 3372 | break; |
3373 | btrfs_tree_unlock(eb); | 3373 | btrfs_tree_unlock(eb); |
3374 | } | 3374 | } |
3375 | } | 3375 | } |
3376 | 3376 | ||
3377 | /* | 3377 | /* |
3378 | * We need to do this to prevent races in people who check if the eb is | 3378 | * We need to do this to prevent races in people who check if the eb is |
3379 | * under IO since we can end up having no IO bits set for a short period | 3379 | * under IO since we can end up having no IO bits set for a short period |
3380 | * of time. | 3380 | * of time. |
3381 | */ | 3381 | */ |
3382 | spin_lock(&eb->refs_lock); | 3382 | spin_lock(&eb->refs_lock); |
3383 | if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { | 3383 | if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { |
3384 | set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); | 3384 | set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); |
3385 | spin_unlock(&eb->refs_lock); | 3385 | spin_unlock(&eb->refs_lock); |
3386 | btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN); | 3386 | btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN); |
3387 | __percpu_counter_add(&fs_info->dirty_metadata_bytes, | 3387 | __percpu_counter_add(&fs_info->dirty_metadata_bytes, |
3388 | -eb->len, | 3388 | -eb->len, |
3389 | fs_info->dirty_metadata_batch); | 3389 | fs_info->dirty_metadata_batch); |
3390 | ret = 1; | 3390 | ret = 1; |
3391 | } else { | 3391 | } else { |
3392 | spin_unlock(&eb->refs_lock); | 3392 | spin_unlock(&eb->refs_lock); |
3393 | } | 3393 | } |
3394 | 3394 | ||
3395 | btrfs_tree_unlock(eb); | 3395 | btrfs_tree_unlock(eb); |
3396 | 3396 | ||
3397 | if (!ret) | 3397 | if (!ret) |
3398 | return ret; | 3398 | return ret; |
3399 | 3399 | ||
3400 | num_pages = num_extent_pages(eb->start, eb->len); | 3400 | num_pages = num_extent_pages(eb->start, eb->len); |
3401 | for (i = 0; i < num_pages; i++) { | 3401 | for (i = 0; i < num_pages; i++) { |
3402 | struct page *p = extent_buffer_page(eb, i); | 3402 | struct page *p = extent_buffer_page(eb, i); |
3403 | 3403 | ||
3404 | if (!trylock_page(p)) { | 3404 | if (!trylock_page(p)) { |
3405 | if (!flush) { | 3405 | if (!flush) { |
3406 | flush_write_bio(epd); | 3406 | flush_write_bio(epd); |
3407 | flush = 1; | 3407 | flush = 1; |
3408 | } | 3408 | } |
3409 | lock_page(p); | 3409 | lock_page(p); |
3410 | } | 3410 | } |
3411 | } | 3411 | } |
3412 | 3412 | ||
3413 | return ret; | 3413 | return ret; |
3414 | } | 3414 | } |
3415 | 3415 | ||
3416 | static void end_extent_buffer_writeback(struct extent_buffer *eb) | 3416 | static void end_extent_buffer_writeback(struct extent_buffer *eb) |
3417 | { | 3417 | { |
3418 | clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); | 3418 | clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); |
3419 | smp_mb__after_clear_bit(); | 3419 | smp_mb__after_clear_bit(); |
3420 | wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); | 3420 | wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK); |
3421 | } | 3421 | } |
3422 | 3422 | ||
3423 | static void end_bio_extent_buffer_writepage(struct bio *bio, int err) | 3423 | static void end_bio_extent_buffer_writepage(struct bio *bio, int err) |
3424 | { | 3424 | { |
3425 | int uptodate = err == 0; | 3425 | int uptodate = err == 0; |
3426 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; | 3426 | struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; |
3427 | struct extent_buffer *eb; | 3427 | struct extent_buffer *eb; |
3428 | int done; | 3428 | int done; |
3429 | 3429 | ||
3430 | do { | 3430 | do { |
3431 | struct page *page = bvec->bv_page; | 3431 | struct page *page = bvec->bv_page; |
3432 | 3432 | ||
3433 | bvec--; | 3433 | bvec--; |
3434 | eb = (struct extent_buffer *)page->private; | 3434 | eb = (struct extent_buffer *)page->private; |
3435 | BUG_ON(!eb); | 3435 | BUG_ON(!eb); |
3436 | done = atomic_dec_and_test(&eb->io_pages); | 3436 | done = atomic_dec_and_test(&eb->io_pages); |
3437 | 3437 | ||
3438 | if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) { | 3438 | if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) { |
3439 | set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); | 3439 | set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); |
3440 | ClearPageUptodate(page); | 3440 | ClearPageUptodate(page); |
3441 | SetPageError(page); | 3441 | SetPageError(page); |
3442 | } | 3442 | } |
3443 | 3443 | ||
3444 | end_page_writeback(page); | 3444 | end_page_writeback(page); |
3445 | 3445 | ||
3446 | if (!done) | 3446 | if (!done) |
3447 | continue; | 3447 | continue; |
3448 | 3448 | ||
3449 | end_extent_buffer_writeback(eb); | 3449 | end_extent_buffer_writeback(eb); |
3450 | } while (bvec >= bio->bi_io_vec); | 3450 | } while (bvec >= bio->bi_io_vec); |
3451 | 3451 | ||
3452 | bio_put(bio); | 3452 | bio_put(bio); |
3453 | 3453 | ||
3454 | } | 3454 | } |
3455 | 3455 | ||
3456 | static int write_one_eb(struct extent_buffer *eb, | 3456 | static int write_one_eb(struct extent_buffer *eb, |
3457 | struct btrfs_fs_info *fs_info, | 3457 | struct btrfs_fs_info *fs_info, |
3458 | struct writeback_control *wbc, | 3458 | struct writeback_control *wbc, |
3459 | struct extent_page_data *epd) | 3459 | struct extent_page_data *epd) |
3460 | { | 3460 | { |
3461 | struct block_device *bdev = fs_info->fs_devices->latest_bdev; | 3461 | struct block_device *bdev = fs_info->fs_devices->latest_bdev; |
3462 | u64 offset = eb->start; | 3462 | u64 offset = eb->start; |
3463 | unsigned long i, num_pages; | 3463 | unsigned long i, num_pages; |
3464 | unsigned long bio_flags = 0; | 3464 | unsigned long bio_flags = 0; |
3465 | int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META; | 3465 | int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META; |
3466 | int ret = 0; | 3466 | int ret = 0; |
3467 | 3467 | ||
3468 | clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); | 3468 | clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); |
3469 | num_pages = num_extent_pages(eb->start, eb->len); | 3469 | num_pages = num_extent_pages(eb->start, eb->len); |
3470 | atomic_set(&eb->io_pages, num_pages); | 3470 | atomic_set(&eb->io_pages, num_pages); |
3471 | if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID) | 3471 | if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID) |
3472 | bio_flags = EXTENT_BIO_TREE_LOG; | 3472 | bio_flags = EXTENT_BIO_TREE_LOG; |
3473 | 3473 | ||
3474 | for (i = 0; i < num_pages; i++) { | 3474 | for (i = 0; i < num_pages; i++) { |
3475 | struct page *p = extent_buffer_page(eb, i); | 3475 | struct page *p = extent_buffer_page(eb, i); |
3476 | 3476 | ||
3477 | clear_page_dirty_for_io(p); | 3477 | clear_page_dirty_for_io(p); |
3478 | set_page_writeback(p); | 3478 | set_page_writeback(p); |
3479 | ret = submit_extent_page(rw, eb->tree, p, offset >> 9, | 3479 | ret = submit_extent_page(rw, eb->tree, p, offset >> 9, |
3480 | PAGE_CACHE_SIZE, 0, bdev, &epd->bio, | 3480 | PAGE_CACHE_SIZE, 0, bdev, &epd->bio, |
3481 | -1, end_bio_extent_buffer_writepage, | 3481 | -1, end_bio_extent_buffer_writepage, |
3482 | 0, epd->bio_flags, bio_flags); | 3482 | 0, epd->bio_flags, bio_flags); |
3483 | epd->bio_flags = bio_flags; | 3483 | epd->bio_flags = bio_flags; |
3484 | if (ret) { | 3484 | if (ret) { |
3485 | set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); | 3485 | set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); |
3486 | SetPageError(p); | 3486 | SetPageError(p); |
3487 | if (atomic_sub_and_test(num_pages - i, &eb->io_pages)) | 3487 | if (atomic_sub_and_test(num_pages - i, &eb->io_pages)) |
3488 | end_extent_buffer_writeback(eb); | 3488 | end_extent_buffer_writeback(eb); |
3489 | ret = -EIO; | 3489 | ret = -EIO; |
3490 | break; | 3490 | break; |
3491 | } | 3491 | } |
3492 | offset += PAGE_CACHE_SIZE; | 3492 | offset += PAGE_CACHE_SIZE; |
3493 | update_nr_written(p, wbc, 1); | 3493 | update_nr_written(p, wbc, 1); |
3494 | unlock_page(p); | 3494 | unlock_page(p); |
3495 | } | 3495 | } |
3496 | 3496 | ||
3497 | if (unlikely(ret)) { | 3497 | if (unlikely(ret)) { |
3498 | for (; i < num_pages; i++) { | 3498 | for (; i < num_pages; i++) { |
3499 | struct page *p = extent_buffer_page(eb, i); | 3499 | struct page *p = extent_buffer_page(eb, i); |
3500 | unlock_page(p); | 3500 | unlock_page(p); |
3501 | } | 3501 | } |
3502 | } | 3502 | } |
3503 | 3503 | ||
3504 | return ret; | 3504 | return ret; |
3505 | } | 3505 | } |
3506 | 3506 | ||
3507 | int btree_write_cache_pages(struct address_space *mapping, | 3507 | int btree_write_cache_pages(struct address_space *mapping, |
3508 | struct writeback_control *wbc) | 3508 | struct writeback_control *wbc) |
3509 | { | 3509 | { |
3510 | struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; | 3510 | struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; |
3511 | struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info; | 3511 | struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info; |
3512 | struct extent_buffer *eb, *prev_eb = NULL; | 3512 | struct extent_buffer *eb, *prev_eb = NULL; |
3513 | struct extent_page_data epd = { | 3513 | struct extent_page_data epd = { |
3514 | .bio = NULL, | 3514 | .bio = NULL, |
3515 | .tree = tree, | 3515 | .tree = tree, |
3516 | .extent_locked = 0, | 3516 | .extent_locked = 0, |
3517 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, | 3517 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, |
3518 | .bio_flags = 0, | 3518 | .bio_flags = 0, |
3519 | }; | 3519 | }; |
3520 | int ret = 0; | 3520 | int ret = 0; |
3521 | int done = 0; | 3521 | int done = 0; |
3522 | int nr_to_write_done = 0; | 3522 | int nr_to_write_done = 0; |
3523 | struct pagevec pvec; | 3523 | struct pagevec pvec; |
3524 | int nr_pages; | 3524 | int nr_pages; |
3525 | pgoff_t index; | 3525 | pgoff_t index; |
3526 | pgoff_t end; /* Inclusive */ | 3526 | pgoff_t end; /* Inclusive */ |
3527 | int scanned = 0; | 3527 | int scanned = 0; |
3528 | int tag; | 3528 | int tag; |
3529 | 3529 | ||
3530 | pagevec_init(&pvec, 0); | 3530 | pagevec_init(&pvec, 0); |
3531 | if (wbc->range_cyclic) { | 3531 | if (wbc->range_cyclic) { |
3532 | index = mapping->writeback_index; /* Start from prev offset */ | 3532 | index = mapping->writeback_index; /* Start from prev offset */ |
3533 | end = -1; | 3533 | end = -1; |
3534 | } else { | 3534 | } else { |
3535 | index = wbc->range_start >> PAGE_CACHE_SHIFT; | 3535 | index = wbc->range_start >> PAGE_CACHE_SHIFT; |
3536 | end = wbc->range_end >> PAGE_CACHE_SHIFT; | 3536 | end = wbc->range_end >> PAGE_CACHE_SHIFT; |
3537 | scanned = 1; | 3537 | scanned = 1; |
3538 | } | 3538 | } |
3539 | if (wbc->sync_mode == WB_SYNC_ALL) | 3539 | if (wbc->sync_mode == WB_SYNC_ALL) |
3540 | tag = PAGECACHE_TAG_TOWRITE; | 3540 | tag = PAGECACHE_TAG_TOWRITE; |
3541 | else | 3541 | else |
3542 | tag = PAGECACHE_TAG_DIRTY; | 3542 | tag = PAGECACHE_TAG_DIRTY; |
3543 | retry: | 3543 | retry: |
3544 | if (wbc->sync_mode == WB_SYNC_ALL) | 3544 | if (wbc->sync_mode == WB_SYNC_ALL) |
3545 | tag_pages_for_writeback(mapping, index, end); | 3545 | tag_pages_for_writeback(mapping, index, end); |
3546 | while (!done && !nr_to_write_done && (index <= end) && | 3546 | while (!done && !nr_to_write_done && (index <= end) && |
3547 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, | 3547 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, |
3548 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { | 3548 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { |
3549 | unsigned i; | 3549 | unsigned i; |
3550 | 3550 | ||
3551 | scanned = 1; | 3551 | scanned = 1; |
3552 | for (i = 0; i < nr_pages; i++) { | 3552 | for (i = 0; i < nr_pages; i++) { |
3553 | struct page *page = pvec.pages[i]; | 3553 | struct page *page = pvec.pages[i]; |
3554 | 3554 | ||
3555 | if (!PagePrivate(page)) | 3555 | if (!PagePrivate(page)) |
3556 | continue; | 3556 | continue; |
3557 | 3557 | ||
3558 | if (!wbc->range_cyclic && page->index > end) { | 3558 | if (!wbc->range_cyclic && page->index > end) { |
3559 | done = 1; | 3559 | done = 1; |
3560 | break; | 3560 | break; |
3561 | } | 3561 | } |
3562 | 3562 | ||
3563 | spin_lock(&mapping->private_lock); | 3563 | spin_lock(&mapping->private_lock); |
3564 | if (!PagePrivate(page)) { | 3564 | if (!PagePrivate(page)) { |
3565 | spin_unlock(&mapping->private_lock); | 3565 | spin_unlock(&mapping->private_lock); |
3566 | continue; | 3566 | continue; |
3567 | } | 3567 | } |
3568 | 3568 | ||
3569 | eb = (struct extent_buffer *)page->private; | 3569 | eb = (struct extent_buffer *)page->private; |
3570 | 3570 | ||
3571 | /* | 3571 | /* |
3572 | * Shouldn't happen and normally this would be a BUG_ON | 3572 | * Shouldn't happen and normally this would be a BUG_ON |
3573 | * but no sense in crashing the users box for something | 3573 | * but no sense in crashing the users box for something |
3574 | * we can survive anyway. | 3574 | * we can survive anyway. |
3575 | */ | 3575 | */ |
3576 | if (!eb) { | 3576 | if (!eb) { |
3577 | spin_unlock(&mapping->private_lock); | 3577 | spin_unlock(&mapping->private_lock); |
3578 | WARN_ON(1); | 3578 | WARN_ON(1); |
3579 | continue; | 3579 | continue; |
3580 | } | 3580 | } |
3581 | 3581 | ||
3582 | if (eb == prev_eb) { | 3582 | if (eb == prev_eb) { |
3583 | spin_unlock(&mapping->private_lock); | 3583 | spin_unlock(&mapping->private_lock); |
3584 | continue; | 3584 | continue; |
3585 | } | 3585 | } |
3586 | 3586 | ||
3587 | ret = atomic_inc_not_zero(&eb->refs); | 3587 | ret = atomic_inc_not_zero(&eb->refs); |
3588 | spin_unlock(&mapping->private_lock); | 3588 | spin_unlock(&mapping->private_lock); |
3589 | if (!ret) | 3589 | if (!ret) |
3590 | continue; | 3590 | continue; |
3591 | 3591 | ||
3592 | prev_eb = eb; | 3592 | prev_eb = eb; |
3593 | ret = lock_extent_buffer_for_io(eb, fs_info, &epd); | 3593 | ret = lock_extent_buffer_for_io(eb, fs_info, &epd); |
3594 | if (!ret) { | 3594 | if (!ret) { |
3595 | free_extent_buffer(eb); | 3595 | free_extent_buffer(eb); |
3596 | continue; | 3596 | continue; |
3597 | } | 3597 | } |
3598 | 3598 | ||
3599 | ret = write_one_eb(eb, fs_info, wbc, &epd); | 3599 | ret = write_one_eb(eb, fs_info, wbc, &epd); |
3600 | if (ret) { | 3600 | if (ret) { |
3601 | done = 1; | 3601 | done = 1; |
3602 | free_extent_buffer(eb); | 3602 | free_extent_buffer(eb); |
3603 | break; | 3603 | break; |
3604 | } | 3604 | } |
3605 | free_extent_buffer(eb); | 3605 | free_extent_buffer(eb); |
3606 | 3606 | ||
3607 | /* | 3607 | /* |
3608 | * the filesystem may choose to bump up nr_to_write. | 3608 | * the filesystem may choose to bump up nr_to_write. |
3609 | * We have to make sure to honor the new nr_to_write | 3609 | * We have to make sure to honor the new nr_to_write |
3610 | * at any time | 3610 | * at any time |
3611 | */ | 3611 | */ |
3612 | nr_to_write_done = wbc->nr_to_write <= 0; | 3612 | nr_to_write_done = wbc->nr_to_write <= 0; |
3613 | } | 3613 | } |
3614 | pagevec_release(&pvec); | 3614 | pagevec_release(&pvec); |
3615 | cond_resched(); | 3615 | cond_resched(); |
3616 | } | 3616 | } |
3617 | if (!scanned && !done) { | 3617 | if (!scanned && !done) { |
3618 | /* | 3618 | /* |
3619 | * We hit the last page and there is more work to be done: wrap | 3619 | * We hit the last page and there is more work to be done: wrap |
3620 | * back to the start of the file | 3620 | * back to the start of the file |
3621 | */ | 3621 | */ |
3622 | scanned = 1; | 3622 | scanned = 1; |
3623 | index = 0; | 3623 | index = 0; |
3624 | goto retry; | 3624 | goto retry; |
3625 | } | 3625 | } |
3626 | flush_write_bio(&epd); | 3626 | flush_write_bio(&epd); |
3627 | return ret; | 3627 | return ret; |
3628 | } | 3628 | } |
3629 | 3629 | ||
3630 | /** | 3630 | /** |
3631 | * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. | 3631 | * write_cache_pages - walk the list of dirty pages of the given address space and write all of them. |
3632 | * @mapping: address space structure to write | 3632 | * @mapping: address space structure to write |
3633 | * @wbc: subtract the number of written pages from *@wbc->nr_to_write | 3633 | * @wbc: subtract the number of written pages from *@wbc->nr_to_write |
3634 | * @writepage: function called for each page | 3634 | * @writepage: function called for each page |
3635 | * @data: data passed to writepage function | 3635 | * @data: data passed to writepage function |
3636 | * | 3636 | * |
3637 | * If a page is already under I/O, write_cache_pages() skips it, even | 3637 | * If a page is already under I/O, write_cache_pages() skips it, even |
3638 | * if it's dirty. This is desirable behaviour for memory-cleaning writeback, | 3638 | * if it's dirty. This is desirable behaviour for memory-cleaning writeback, |
3639 | * but it is INCORRECT for data-integrity system calls such as fsync(). fsync() | 3639 | * but it is INCORRECT for data-integrity system calls such as fsync(). fsync() |
3640 | * and msync() need to guarantee that all the data which was dirty at the time | 3640 | * and msync() need to guarantee that all the data which was dirty at the time |
3641 | * the call was made get new I/O started against them. If wbc->sync_mode is | 3641 | * the call was made get new I/O started against them. If wbc->sync_mode is |
3642 | * WB_SYNC_ALL then we were called for data integrity and we must wait for | 3642 | * WB_SYNC_ALL then we were called for data integrity and we must wait for |
3643 | * existing IO to complete. | 3643 | * existing IO to complete. |
3644 | */ | 3644 | */ |
3645 | static int extent_write_cache_pages(struct extent_io_tree *tree, | 3645 | static int extent_write_cache_pages(struct extent_io_tree *tree, |
3646 | struct address_space *mapping, | 3646 | struct address_space *mapping, |
3647 | struct writeback_control *wbc, | 3647 | struct writeback_control *wbc, |
3648 | writepage_t writepage, void *data, | 3648 | writepage_t writepage, void *data, |
3649 | void (*flush_fn)(void *)) | 3649 | void (*flush_fn)(void *)) |
3650 | { | 3650 | { |
3651 | struct inode *inode = mapping->host; | 3651 | struct inode *inode = mapping->host; |
3652 | int ret = 0; | 3652 | int ret = 0; |
3653 | int done = 0; | 3653 | int done = 0; |
3654 | int nr_to_write_done = 0; | 3654 | int nr_to_write_done = 0; |
3655 | struct pagevec pvec; | 3655 | struct pagevec pvec; |
3656 | int nr_pages; | 3656 | int nr_pages; |
3657 | pgoff_t index; | 3657 | pgoff_t index; |
3658 | pgoff_t end; /* Inclusive */ | 3658 | pgoff_t end; /* Inclusive */ |
3659 | int scanned = 0; | 3659 | int scanned = 0; |
3660 | int tag; | 3660 | int tag; |
3661 | 3661 | ||
3662 | /* | 3662 | /* |
3663 | * We have to hold onto the inode so that ordered extents can do their | 3663 | * We have to hold onto the inode so that ordered extents can do their |
3664 | * work when the IO finishes. The alternative to this is failing to add | 3664 | * work when the IO finishes. The alternative to this is failing to add |
3665 | * an ordered extent if the igrab() fails there and that is a huge pain | 3665 | * an ordered extent if the igrab() fails there and that is a huge pain |
3666 | * to deal with, so instead just hold onto the inode throughout the | 3666 | * to deal with, so instead just hold onto the inode throughout the |
3667 | * writepages operation. If it fails here we are freeing up the inode | 3667 | * writepages operation. If it fails here we are freeing up the inode |
3668 | * anyway and we'd rather not waste our time writing out stuff that is | 3668 | * anyway and we'd rather not waste our time writing out stuff that is |
3669 | * going to be truncated anyway. | 3669 | * going to be truncated anyway. |
3670 | */ | 3670 | */ |
3671 | if (!igrab(inode)) | 3671 | if (!igrab(inode)) |
3672 | return 0; | 3672 | return 0; |
3673 | 3673 | ||
3674 | pagevec_init(&pvec, 0); | 3674 | pagevec_init(&pvec, 0); |
3675 | if (wbc->range_cyclic) { | 3675 | if (wbc->range_cyclic) { |
3676 | index = mapping->writeback_index; /* Start from prev offset */ | 3676 | index = mapping->writeback_index; /* Start from prev offset */ |
3677 | end = -1; | 3677 | end = -1; |
3678 | } else { | 3678 | } else { |
3679 | index = wbc->range_start >> PAGE_CACHE_SHIFT; | 3679 | index = wbc->range_start >> PAGE_CACHE_SHIFT; |
3680 | end = wbc->range_end >> PAGE_CACHE_SHIFT; | 3680 | end = wbc->range_end >> PAGE_CACHE_SHIFT; |
3681 | scanned = 1; | 3681 | scanned = 1; |
3682 | } | 3682 | } |
3683 | if (wbc->sync_mode == WB_SYNC_ALL) | 3683 | if (wbc->sync_mode == WB_SYNC_ALL) |
3684 | tag = PAGECACHE_TAG_TOWRITE; | 3684 | tag = PAGECACHE_TAG_TOWRITE; |
3685 | else | 3685 | else |
3686 | tag = PAGECACHE_TAG_DIRTY; | 3686 | tag = PAGECACHE_TAG_DIRTY; |
3687 | retry: | 3687 | retry: |
3688 | if (wbc->sync_mode == WB_SYNC_ALL) | 3688 | if (wbc->sync_mode == WB_SYNC_ALL) |
3689 | tag_pages_for_writeback(mapping, index, end); | 3689 | tag_pages_for_writeback(mapping, index, end); |
3690 | while (!done && !nr_to_write_done && (index <= end) && | 3690 | while (!done && !nr_to_write_done && (index <= end) && |
3691 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, | 3691 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, |
3692 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { | 3692 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { |
3693 | unsigned i; | 3693 | unsigned i; |
3694 | 3694 | ||
3695 | scanned = 1; | 3695 | scanned = 1; |
3696 | for (i = 0; i < nr_pages; i++) { | 3696 | for (i = 0; i < nr_pages; i++) { |
3697 | struct page *page = pvec.pages[i]; | 3697 | struct page *page = pvec.pages[i]; |
3698 | 3698 | ||
3699 | /* | 3699 | /* |
3700 | * At this point we hold neither mapping->tree_lock nor | 3700 | * At this point we hold neither mapping->tree_lock nor |
3701 | * lock on the page itself: the page may be truncated or | 3701 | * lock on the page itself: the page may be truncated or |
3702 | * invalidated (changing page->mapping to NULL), or even | 3702 | * invalidated (changing page->mapping to NULL), or even |
3703 | * swizzled back from swapper_space to tmpfs file | 3703 | * swizzled back from swapper_space to tmpfs file |
3704 | * mapping | 3704 | * mapping |
3705 | */ | 3705 | */ |
3706 | if (!trylock_page(page)) { | 3706 | if (!trylock_page(page)) { |
3707 | flush_fn(data); | 3707 | flush_fn(data); |
3708 | lock_page(page); | 3708 | lock_page(page); |
3709 | } | 3709 | } |
3710 | 3710 | ||
3711 | if (unlikely(page->mapping != mapping)) { | 3711 | if (unlikely(page->mapping != mapping)) { |
3712 | unlock_page(page); | 3712 | unlock_page(page); |
3713 | continue; | 3713 | continue; |
3714 | } | 3714 | } |
3715 | 3715 | ||
3716 | if (!wbc->range_cyclic && page->index > end) { | 3716 | if (!wbc->range_cyclic && page->index > end) { |
3717 | done = 1; | 3717 | done = 1; |
3718 | unlock_page(page); | 3718 | unlock_page(page); |
3719 | continue; | 3719 | continue; |
3720 | } | 3720 | } |
3721 | 3721 | ||
3722 | if (wbc->sync_mode != WB_SYNC_NONE) { | 3722 | if (wbc->sync_mode != WB_SYNC_NONE) { |
3723 | if (PageWriteback(page)) | 3723 | if (PageWriteback(page)) |
3724 | flush_fn(data); | 3724 | flush_fn(data); |
3725 | wait_on_page_writeback(page); | 3725 | wait_on_page_writeback(page); |
3726 | } | 3726 | } |
3727 | 3727 | ||
3728 | if (PageWriteback(page) || | 3728 | if (PageWriteback(page) || |
3729 | !clear_page_dirty_for_io(page)) { | 3729 | !clear_page_dirty_for_io(page)) { |
3730 | unlock_page(page); | 3730 | unlock_page(page); |
3731 | continue; | 3731 | continue; |
3732 | } | 3732 | } |
3733 | 3733 | ||
3734 | ret = (*writepage)(page, wbc, data); | 3734 | ret = (*writepage)(page, wbc, data); |
3735 | 3735 | ||
3736 | if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { | 3736 | if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { |
3737 | unlock_page(page); | 3737 | unlock_page(page); |
3738 | ret = 0; | 3738 | ret = 0; |
3739 | } | 3739 | } |
3740 | if (ret) | 3740 | if (ret) |
3741 | done = 1; | 3741 | done = 1; |
3742 | 3742 | ||
3743 | /* | 3743 | /* |
3744 | * the filesystem may choose to bump up nr_to_write. | 3744 | * the filesystem may choose to bump up nr_to_write. |
3745 | * We have to make sure to honor the new nr_to_write | 3745 | * We have to make sure to honor the new nr_to_write |
3746 | * at any time | 3746 | * at any time |
3747 | */ | 3747 | */ |
3748 | nr_to_write_done = wbc->nr_to_write <= 0; | 3748 | nr_to_write_done = wbc->nr_to_write <= 0; |
3749 | } | 3749 | } |
3750 | pagevec_release(&pvec); | 3750 | pagevec_release(&pvec); |
3751 | cond_resched(); | 3751 | cond_resched(); |
3752 | } | 3752 | } |
3753 | if (!scanned && !done) { | 3753 | if (!scanned && !done) { |
3754 | /* | 3754 | /* |
3755 | * We hit the last page and there is more work to be done: wrap | 3755 | * We hit the last page and there is more work to be done: wrap |
3756 | * back to the start of the file | 3756 | * back to the start of the file |
3757 | */ | 3757 | */ |
3758 | scanned = 1; | 3758 | scanned = 1; |
3759 | index = 0; | 3759 | index = 0; |
3760 | goto retry; | 3760 | goto retry; |
3761 | } | 3761 | } |
3762 | btrfs_add_delayed_iput(inode); | 3762 | btrfs_add_delayed_iput(inode); |
3763 | return ret; | 3763 | return ret; |
3764 | } | 3764 | } |
3765 | 3765 | ||
3766 | static void flush_epd_write_bio(struct extent_page_data *epd) | 3766 | static void flush_epd_write_bio(struct extent_page_data *epd) |
3767 | { | 3767 | { |
3768 | if (epd->bio) { | 3768 | if (epd->bio) { |
3769 | int rw = WRITE; | 3769 | int rw = WRITE; |
3770 | int ret; | 3770 | int ret; |
3771 | 3771 | ||
3772 | if (epd->sync_io) | 3772 | if (epd->sync_io) |
3773 | rw = WRITE_SYNC; | 3773 | rw = WRITE_SYNC; |
3774 | 3774 | ||
3775 | ret = submit_one_bio(rw, epd->bio, 0, epd->bio_flags); | 3775 | ret = submit_one_bio(rw, epd->bio, 0, epd->bio_flags); |
3776 | BUG_ON(ret < 0); /* -ENOMEM */ | 3776 | BUG_ON(ret < 0); /* -ENOMEM */ |
3777 | epd->bio = NULL; | 3777 | epd->bio = NULL; |
3778 | } | 3778 | } |
3779 | } | 3779 | } |
3780 | 3780 | ||
3781 | static noinline void flush_write_bio(void *data) | 3781 | static noinline void flush_write_bio(void *data) |
3782 | { | 3782 | { |
3783 | struct extent_page_data *epd = data; | 3783 | struct extent_page_data *epd = data; |
3784 | flush_epd_write_bio(epd); | 3784 | flush_epd_write_bio(epd); |
3785 | } | 3785 | } |
3786 | 3786 | ||
3787 | int extent_write_full_page(struct extent_io_tree *tree, struct page *page, | 3787 | int extent_write_full_page(struct extent_io_tree *tree, struct page *page, |
3788 | get_extent_t *get_extent, | 3788 | get_extent_t *get_extent, |
3789 | struct writeback_control *wbc) | 3789 | struct writeback_control *wbc) |
3790 | { | 3790 | { |
3791 | int ret; | 3791 | int ret; |
3792 | struct extent_page_data epd = { | 3792 | struct extent_page_data epd = { |
3793 | .bio = NULL, | 3793 | .bio = NULL, |
3794 | .tree = tree, | 3794 | .tree = tree, |
3795 | .get_extent = get_extent, | 3795 | .get_extent = get_extent, |
3796 | .extent_locked = 0, | 3796 | .extent_locked = 0, |
3797 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, | 3797 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, |
3798 | .bio_flags = 0, | 3798 | .bio_flags = 0, |
3799 | }; | 3799 | }; |
3800 | 3800 | ||
3801 | ret = __extent_writepage(page, wbc, &epd); | 3801 | ret = __extent_writepage(page, wbc, &epd); |
3802 | 3802 | ||
3803 | flush_epd_write_bio(&epd); | 3803 | flush_epd_write_bio(&epd); |
3804 | return ret; | 3804 | return ret; |
3805 | } | 3805 | } |
3806 | 3806 | ||
3807 | int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode, | 3807 | int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode, |
3808 | u64 start, u64 end, get_extent_t *get_extent, | 3808 | u64 start, u64 end, get_extent_t *get_extent, |
3809 | int mode) | 3809 | int mode) |
3810 | { | 3810 | { |
3811 | int ret = 0; | 3811 | int ret = 0; |
3812 | struct address_space *mapping = inode->i_mapping; | 3812 | struct address_space *mapping = inode->i_mapping; |
3813 | struct page *page; | 3813 | struct page *page; |
3814 | unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >> | 3814 | unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >> |
3815 | PAGE_CACHE_SHIFT; | 3815 | PAGE_CACHE_SHIFT; |
3816 | 3816 | ||
3817 | struct extent_page_data epd = { | 3817 | struct extent_page_data epd = { |
3818 | .bio = NULL, | 3818 | .bio = NULL, |
3819 | .tree = tree, | 3819 | .tree = tree, |
3820 | .get_extent = get_extent, | 3820 | .get_extent = get_extent, |
3821 | .extent_locked = 1, | 3821 | .extent_locked = 1, |
3822 | .sync_io = mode == WB_SYNC_ALL, | 3822 | .sync_io = mode == WB_SYNC_ALL, |
3823 | .bio_flags = 0, | 3823 | .bio_flags = 0, |
3824 | }; | 3824 | }; |
3825 | struct writeback_control wbc_writepages = { | 3825 | struct writeback_control wbc_writepages = { |
3826 | .sync_mode = mode, | 3826 | .sync_mode = mode, |
3827 | .nr_to_write = nr_pages * 2, | 3827 | .nr_to_write = nr_pages * 2, |
3828 | .range_start = start, | 3828 | .range_start = start, |
3829 | .range_end = end + 1, | 3829 | .range_end = end + 1, |
3830 | }; | 3830 | }; |
3831 | 3831 | ||
3832 | while (start <= end) { | 3832 | while (start <= end) { |
3833 | page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT); | 3833 | page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT); |
3834 | if (clear_page_dirty_for_io(page)) | 3834 | if (clear_page_dirty_for_io(page)) |
3835 | ret = __extent_writepage(page, &wbc_writepages, &epd); | 3835 | ret = __extent_writepage(page, &wbc_writepages, &epd); |
3836 | else { | 3836 | else { |
3837 | if (tree->ops && tree->ops->writepage_end_io_hook) | 3837 | if (tree->ops && tree->ops->writepage_end_io_hook) |
3838 | tree->ops->writepage_end_io_hook(page, start, | 3838 | tree->ops->writepage_end_io_hook(page, start, |
3839 | start + PAGE_CACHE_SIZE - 1, | 3839 | start + PAGE_CACHE_SIZE - 1, |
3840 | NULL, 1); | 3840 | NULL, 1); |
3841 | unlock_page(page); | 3841 | unlock_page(page); |
3842 | } | 3842 | } |
3843 | page_cache_release(page); | 3843 | page_cache_release(page); |
3844 | start += PAGE_CACHE_SIZE; | 3844 | start += PAGE_CACHE_SIZE; |
3845 | } | 3845 | } |
3846 | 3846 | ||
3847 | flush_epd_write_bio(&epd); | 3847 | flush_epd_write_bio(&epd); |
3848 | return ret; | 3848 | return ret; |
3849 | } | 3849 | } |
3850 | 3850 | ||
3851 | int extent_writepages(struct extent_io_tree *tree, | 3851 | int extent_writepages(struct extent_io_tree *tree, |
3852 | struct address_space *mapping, | 3852 | struct address_space *mapping, |
3853 | get_extent_t *get_extent, | 3853 | get_extent_t *get_extent, |
3854 | struct writeback_control *wbc) | 3854 | struct writeback_control *wbc) |
3855 | { | 3855 | { |
3856 | int ret = 0; | 3856 | int ret = 0; |
3857 | struct extent_page_data epd = { | 3857 | struct extent_page_data epd = { |
3858 | .bio = NULL, | 3858 | .bio = NULL, |
3859 | .tree = tree, | 3859 | .tree = tree, |
3860 | .get_extent = get_extent, | 3860 | .get_extent = get_extent, |
3861 | .extent_locked = 0, | 3861 | .extent_locked = 0, |
3862 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, | 3862 | .sync_io = wbc->sync_mode == WB_SYNC_ALL, |
3863 | .bio_flags = 0, | 3863 | .bio_flags = 0, |
3864 | }; | 3864 | }; |
3865 | 3865 | ||
3866 | ret = extent_write_cache_pages(tree, mapping, wbc, | 3866 | ret = extent_write_cache_pages(tree, mapping, wbc, |
3867 | __extent_writepage, &epd, | 3867 | __extent_writepage, &epd, |
3868 | flush_write_bio); | 3868 | flush_write_bio); |
3869 | flush_epd_write_bio(&epd); | 3869 | flush_epd_write_bio(&epd); |
3870 | return ret; | 3870 | return ret; |
3871 | } | 3871 | } |
3872 | 3872 | ||
3873 | int extent_readpages(struct extent_io_tree *tree, | 3873 | int extent_readpages(struct extent_io_tree *tree, |
3874 | struct address_space *mapping, | 3874 | struct address_space *mapping, |
3875 | struct list_head *pages, unsigned nr_pages, | 3875 | struct list_head *pages, unsigned nr_pages, |
3876 | get_extent_t get_extent) | 3876 | get_extent_t get_extent) |
3877 | { | 3877 | { |
3878 | struct bio *bio = NULL; | 3878 | struct bio *bio = NULL; |
3879 | unsigned page_idx; | 3879 | unsigned page_idx; |
3880 | unsigned long bio_flags = 0; | 3880 | unsigned long bio_flags = 0; |
3881 | struct page *pagepool[16]; | 3881 | struct page *pagepool[16]; |
3882 | struct page *page; | 3882 | struct page *page; |
3883 | struct extent_map *em_cached = NULL; | 3883 | struct extent_map *em_cached = NULL; |
3884 | int nr = 0; | 3884 | int nr = 0; |
3885 | 3885 | ||
3886 | for (page_idx = 0; page_idx < nr_pages; page_idx++) { | 3886 | for (page_idx = 0; page_idx < nr_pages; page_idx++) { |
3887 | page = list_entry(pages->prev, struct page, lru); | 3887 | page = list_entry(pages->prev, struct page, lru); |
3888 | 3888 | ||
3889 | prefetchw(&page->flags); | 3889 | prefetchw(&page->flags); |
3890 | list_del(&page->lru); | 3890 | list_del(&page->lru); |
3891 | if (add_to_page_cache_lru(page, mapping, | 3891 | if (add_to_page_cache_lru(page, mapping, |
3892 | page->index, GFP_NOFS)) { | 3892 | page->index, GFP_NOFS)) { |
3893 | page_cache_release(page); | 3893 | page_cache_release(page); |
3894 | continue; | 3894 | continue; |
3895 | } | 3895 | } |
3896 | 3896 | ||
3897 | pagepool[nr++] = page; | 3897 | pagepool[nr++] = page; |
3898 | if (nr < ARRAY_SIZE(pagepool)) | 3898 | if (nr < ARRAY_SIZE(pagepool)) |
3899 | continue; | 3899 | continue; |
3900 | __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, | 3900 | __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, |
3901 | &bio, 0, &bio_flags, READ); | 3901 | &bio, 0, &bio_flags, READ); |
3902 | nr = 0; | 3902 | nr = 0; |
3903 | } | 3903 | } |
3904 | if (nr) | 3904 | if (nr) |
3905 | __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, | 3905 | __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, |
3906 | &bio, 0, &bio_flags, READ); | 3906 | &bio, 0, &bio_flags, READ); |
3907 | 3907 | ||
3908 | if (em_cached) | 3908 | if (em_cached) |
3909 | free_extent_map(em_cached); | 3909 | free_extent_map(em_cached); |
3910 | 3910 | ||
3911 | BUG_ON(!list_empty(pages)); | 3911 | BUG_ON(!list_empty(pages)); |
3912 | if (bio) | 3912 | if (bio) |
3913 | return submit_one_bio(READ, bio, 0, bio_flags); | 3913 | return submit_one_bio(READ, bio, 0, bio_flags); |
3914 | return 0; | 3914 | return 0; |
3915 | } | 3915 | } |
3916 | 3916 | ||
3917 | /* | 3917 | /* |
3918 | * basic invalidatepage code, this waits on any locked or writeback | 3918 | * basic invalidatepage code, this waits on any locked or writeback |
3919 | * ranges corresponding to the page, and then deletes any extent state | 3919 | * ranges corresponding to the page, and then deletes any extent state |
3920 | * records from the tree | 3920 | * records from the tree |
3921 | */ | 3921 | */ |
3922 | int extent_invalidatepage(struct extent_io_tree *tree, | 3922 | int extent_invalidatepage(struct extent_io_tree *tree, |
3923 | struct page *page, unsigned long offset) | 3923 | struct page *page, unsigned long offset) |
3924 | { | 3924 | { |
3925 | struct extent_state *cached_state = NULL; | 3925 | struct extent_state *cached_state = NULL; |
3926 | u64 start = page_offset(page); | 3926 | u64 start = page_offset(page); |
3927 | u64 end = start + PAGE_CACHE_SIZE - 1; | 3927 | u64 end = start + PAGE_CACHE_SIZE - 1; |
3928 | size_t blocksize = page->mapping->host->i_sb->s_blocksize; | 3928 | size_t blocksize = page->mapping->host->i_sb->s_blocksize; |
3929 | 3929 | ||
3930 | start += ALIGN(offset, blocksize); | 3930 | start += ALIGN(offset, blocksize); |
3931 | if (start > end) | 3931 | if (start > end) |
3932 | return 0; | 3932 | return 0; |
3933 | 3933 | ||
3934 | lock_extent_bits(tree, start, end, 0, &cached_state); | 3934 | lock_extent_bits(tree, start, end, 0, &cached_state); |
3935 | wait_on_page_writeback(page); | 3935 | wait_on_page_writeback(page); |
3936 | clear_extent_bit(tree, start, end, | 3936 | clear_extent_bit(tree, start, end, |
3937 | EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC | | 3937 | EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC | |
3938 | EXTENT_DO_ACCOUNTING, | 3938 | EXTENT_DO_ACCOUNTING, |
3939 | 1, 1, &cached_state, GFP_NOFS); | 3939 | 1, 1, &cached_state, GFP_NOFS); |
3940 | return 0; | 3940 | return 0; |
3941 | } | 3941 | } |
3942 | 3942 | ||
3943 | /* | 3943 | /* |
3944 | * a helper for releasepage, this tests for areas of the page that | 3944 | * a helper for releasepage, this tests for areas of the page that |
3945 | * are locked or under IO and drops the related state bits if it is safe | 3945 | * are locked or under IO and drops the related state bits if it is safe |
3946 | * to drop the page. | 3946 | * to drop the page. |
3947 | */ | 3947 | */ |
3948 | static int try_release_extent_state(struct extent_map_tree *map, | 3948 | static int try_release_extent_state(struct extent_map_tree *map, |
3949 | struct extent_io_tree *tree, | 3949 | struct extent_io_tree *tree, |
3950 | struct page *page, gfp_t mask) | 3950 | struct page *page, gfp_t mask) |
3951 | { | 3951 | { |
3952 | u64 start = page_offset(page); | 3952 | u64 start = page_offset(page); |
3953 | u64 end = start + PAGE_CACHE_SIZE - 1; | 3953 | u64 end = start + PAGE_CACHE_SIZE - 1; |
3954 | int ret = 1; | 3954 | int ret = 1; |
3955 | 3955 | ||
3956 | if (test_range_bit(tree, start, end, | 3956 | if (test_range_bit(tree, start, end, |
3957 | EXTENT_IOBITS, 0, NULL)) | 3957 | EXTENT_IOBITS, 0, NULL)) |
3958 | ret = 0; | 3958 | ret = 0; |
3959 | else { | 3959 | else { |
3960 | if ((mask & GFP_NOFS) == GFP_NOFS) | 3960 | if ((mask & GFP_NOFS) == GFP_NOFS) |
3961 | mask = GFP_NOFS; | 3961 | mask = GFP_NOFS; |
3962 | /* | 3962 | /* |
3963 | * at this point we can safely clear everything except the | 3963 | * at this point we can safely clear everything except the |
3964 | * locked bit and the nodatasum bit | 3964 | * locked bit and the nodatasum bit |
3965 | */ | 3965 | */ |
3966 | ret = clear_extent_bit(tree, start, end, | 3966 | ret = clear_extent_bit(tree, start, end, |
3967 | ~(EXTENT_LOCKED | EXTENT_NODATASUM), | 3967 | ~(EXTENT_LOCKED | EXTENT_NODATASUM), |
3968 | 0, 0, NULL, mask); | 3968 | 0, 0, NULL, mask); |
3969 | 3969 | ||
3970 | /* if clear_extent_bit failed for enomem reasons, | 3970 | /* if clear_extent_bit failed for enomem reasons, |
3971 | * we can't allow the release to continue. | 3971 | * we can't allow the release to continue. |
3972 | */ | 3972 | */ |
3973 | if (ret < 0) | 3973 | if (ret < 0) |
3974 | ret = 0; | 3974 | ret = 0; |
3975 | else | 3975 | else |
3976 | ret = 1; | 3976 | ret = 1; |
3977 | } | 3977 | } |
3978 | return ret; | 3978 | return ret; |
3979 | } | 3979 | } |
3980 | 3980 | ||
3981 | /* | 3981 | /* |
3982 | * a helper for releasepage. As long as there are no locked extents | 3982 | * a helper for releasepage. As long as there are no locked extents |
3983 | * in the range corresponding to the page, both state records and extent | 3983 | * in the range corresponding to the page, both state records and extent |
3984 | * map records are removed | 3984 | * map records are removed |
3985 | */ | 3985 | */ |
3986 | int try_release_extent_mapping(struct extent_map_tree *map, | 3986 | int try_release_extent_mapping(struct extent_map_tree *map, |
3987 | struct extent_io_tree *tree, struct page *page, | 3987 | struct extent_io_tree *tree, struct page *page, |
3988 | gfp_t mask) | 3988 | gfp_t mask) |
3989 | { | 3989 | { |
3990 | struct extent_map *em; | 3990 | struct extent_map *em; |
3991 | u64 start = page_offset(page); | 3991 | u64 start = page_offset(page); |
3992 | u64 end = start + PAGE_CACHE_SIZE - 1; | 3992 | u64 end = start + PAGE_CACHE_SIZE - 1; |
3993 | 3993 | ||
3994 | if ((mask & __GFP_WAIT) && | 3994 | if ((mask & __GFP_WAIT) && |
3995 | page->mapping->host->i_size > 16 * 1024 * 1024) { | 3995 | page->mapping->host->i_size > 16 * 1024 * 1024) { |
3996 | u64 len; | 3996 | u64 len; |
3997 | while (start <= end) { | 3997 | while (start <= end) { |
3998 | len = end - start + 1; | 3998 | len = end - start + 1; |
3999 | write_lock(&map->lock); | 3999 | write_lock(&map->lock); |
4000 | em = lookup_extent_mapping(map, start, len); | 4000 | em = lookup_extent_mapping(map, start, len); |
4001 | if (!em) { | 4001 | if (!em) { |
4002 | write_unlock(&map->lock); | 4002 | write_unlock(&map->lock); |
4003 | break; | 4003 | break; |
4004 | } | 4004 | } |
4005 | if (test_bit(EXTENT_FLAG_PINNED, &em->flags) || | 4005 | if (test_bit(EXTENT_FLAG_PINNED, &em->flags) || |
4006 | em->start != start) { | 4006 | em->start != start) { |
4007 | write_unlock(&map->lock); | 4007 | write_unlock(&map->lock); |
4008 | free_extent_map(em); | 4008 | free_extent_map(em); |
4009 | break; | 4009 | break; |
4010 | } | 4010 | } |
4011 | if (!test_range_bit(tree, em->start, | 4011 | if (!test_range_bit(tree, em->start, |
4012 | extent_map_end(em) - 1, | 4012 | extent_map_end(em) - 1, |
4013 | EXTENT_LOCKED | EXTENT_WRITEBACK, | 4013 | EXTENT_LOCKED | EXTENT_WRITEBACK, |
4014 | 0, NULL)) { | 4014 | 0, NULL)) { |
4015 | remove_extent_mapping(map, em); | 4015 | remove_extent_mapping(map, em); |
4016 | /* once for the rb tree */ | 4016 | /* once for the rb tree */ |
4017 | free_extent_map(em); | 4017 | free_extent_map(em); |
4018 | } | 4018 | } |
4019 | start = extent_map_end(em); | 4019 | start = extent_map_end(em); |
4020 | write_unlock(&map->lock); | 4020 | write_unlock(&map->lock); |
4021 | 4021 | ||
4022 | /* once for us */ | 4022 | /* once for us */ |
4023 | free_extent_map(em); | 4023 | free_extent_map(em); |
4024 | } | 4024 | } |
4025 | } | 4025 | } |
4026 | return try_release_extent_state(map, tree, page, mask); | 4026 | return try_release_extent_state(map, tree, page, mask); |
4027 | } | 4027 | } |
4028 | 4028 | ||
4029 | /* | 4029 | /* |
4030 | * helper function for fiemap, which doesn't want to see any holes. | 4030 | * helper function for fiemap, which doesn't want to see any holes. |
4031 | * This maps until we find something past 'last' | 4031 | * This maps until we find something past 'last' |
4032 | */ | 4032 | */ |
4033 | static struct extent_map *get_extent_skip_holes(struct inode *inode, | 4033 | static struct extent_map *get_extent_skip_holes(struct inode *inode, |
4034 | u64 offset, | 4034 | u64 offset, |
4035 | u64 last, | 4035 | u64 last, |
4036 | get_extent_t *get_extent) | 4036 | get_extent_t *get_extent) |
4037 | { | 4037 | { |
4038 | u64 sectorsize = BTRFS_I(inode)->root->sectorsize; | 4038 | u64 sectorsize = BTRFS_I(inode)->root->sectorsize; |
4039 | struct extent_map *em; | 4039 | struct extent_map *em; |
4040 | u64 len; | 4040 | u64 len; |
4041 | 4041 | ||
4042 | if (offset >= last) | 4042 | if (offset >= last) |
4043 | return NULL; | 4043 | return NULL; |
4044 | 4044 | ||
4045 | while(1) { | 4045 | while(1) { |
4046 | len = last - offset; | 4046 | len = last - offset; |
4047 | if (len == 0) | 4047 | if (len == 0) |
4048 | break; | 4048 | break; |
4049 | len = ALIGN(len, sectorsize); | 4049 | len = ALIGN(len, sectorsize); |
4050 | em = get_extent(inode, NULL, 0, offset, len, 0); | 4050 | em = get_extent(inode, NULL, 0, offset, len, 0); |
4051 | if (IS_ERR_OR_NULL(em)) | 4051 | if (IS_ERR_OR_NULL(em)) |
4052 | return em; | 4052 | return em; |
4053 | 4053 | ||
4054 | /* if this isn't a hole return it */ | 4054 | /* if this isn't a hole return it */ |
4055 | if (!test_bit(EXTENT_FLAG_VACANCY, &em->flags) && | 4055 | if (!test_bit(EXTENT_FLAG_VACANCY, &em->flags) && |
4056 | em->block_start != EXTENT_MAP_HOLE) { | 4056 | em->block_start != EXTENT_MAP_HOLE) { |
4057 | return em; | 4057 | return em; |
4058 | } | 4058 | } |
4059 | 4059 | ||
4060 | /* this is a hole, advance to the next extent */ | 4060 | /* this is a hole, advance to the next extent */ |
4061 | offset = extent_map_end(em); | 4061 | offset = extent_map_end(em); |
4062 | free_extent_map(em); | 4062 | free_extent_map(em); |
4063 | if (offset >= last) | 4063 | if (offset >= last) |
4064 | break; | 4064 | break; |
4065 | } | 4065 | } |
4066 | return NULL; | 4066 | return NULL; |
4067 | } | 4067 | } |
4068 | 4068 | ||
4069 | int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, | 4069 | int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, |
4070 | __u64 start, __u64 len, get_extent_t *get_extent) | 4070 | __u64 start, __u64 len, get_extent_t *get_extent) |
4071 | { | 4071 | { |
4072 | int ret = 0; | 4072 | int ret = 0; |
4073 | u64 off = start; | 4073 | u64 off = start; |
4074 | u64 max = start + len; | 4074 | u64 max = start + len; |
4075 | u32 flags = 0; | 4075 | u32 flags = 0; |
4076 | u32 found_type; | 4076 | u32 found_type; |
4077 | u64 last; | 4077 | u64 last; |
4078 | u64 last_for_get_extent = 0; | 4078 | u64 last_for_get_extent = 0; |
4079 | u64 disko = 0; | 4079 | u64 disko = 0; |
4080 | u64 isize = i_size_read(inode); | 4080 | u64 isize = i_size_read(inode); |
4081 | struct btrfs_key found_key; | 4081 | struct btrfs_key found_key; |
4082 | struct extent_map *em = NULL; | 4082 | struct extent_map *em = NULL; |
4083 | struct extent_state *cached_state = NULL; | 4083 | struct extent_state *cached_state = NULL; |
4084 | struct btrfs_path *path; | 4084 | struct btrfs_path *path; |
4085 | struct btrfs_file_extent_item *item; | 4085 | struct btrfs_file_extent_item *item; |
4086 | int end = 0; | 4086 | int end = 0; |
4087 | u64 em_start = 0; | 4087 | u64 em_start = 0; |
4088 | u64 em_len = 0; | 4088 | u64 em_len = 0; |
4089 | u64 em_end = 0; | 4089 | u64 em_end = 0; |
4090 | unsigned long emflags; | 4090 | unsigned long emflags; |
4091 | 4091 | ||
4092 | if (len == 0) | 4092 | if (len == 0) |
4093 | return -EINVAL; | 4093 | return -EINVAL; |
4094 | 4094 | ||
4095 | path = btrfs_alloc_path(); | 4095 | path = btrfs_alloc_path(); |
4096 | if (!path) | 4096 | if (!path) |
4097 | return -ENOMEM; | 4097 | return -ENOMEM; |
4098 | path->leave_spinning = 1; | 4098 | path->leave_spinning = 1; |
4099 | 4099 | ||
4100 | start = ALIGN(start, BTRFS_I(inode)->root->sectorsize); | 4100 | start = ALIGN(start, BTRFS_I(inode)->root->sectorsize); |
4101 | len = ALIGN(len, BTRFS_I(inode)->root->sectorsize); | 4101 | len = ALIGN(len, BTRFS_I(inode)->root->sectorsize); |
4102 | 4102 | ||
4103 | /* | 4103 | /* |
4104 | * lookup the last file extent. We're not using i_size here | 4104 | * lookup the last file extent. We're not using i_size here |
4105 | * because there might be preallocation past i_size | 4105 | * because there might be preallocation past i_size |
4106 | */ | 4106 | */ |
4107 | ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, | 4107 | ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, |
4108 | path, btrfs_ino(inode), -1, 0); | 4108 | path, btrfs_ino(inode), -1, 0); |
4109 | if (ret < 0) { | 4109 | if (ret < 0) { |
4110 | btrfs_free_path(path); | 4110 | btrfs_free_path(path); |
4111 | return ret; | 4111 | return ret; |
4112 | } | 4112 | } |
4113 | WARN_ON(!ret); | 4113 | WARN_ON(!ret); |
4114 | path->slots[0]--; | 4114 | path->slots[0]--; |
4115 | item = btrfs_item_ptr(path->nodes[0], path->slots[0], | 4115 | item = btrfs_item_ptr(path->nodes[0], path->slots[0], |
4116 | struct btrfs_file_extent_item); | 4116 | struct btrfs_file_extent_item); |
4117 | btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); | 4117 | btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); |
4118 | found_type = btrfs_key_type(&found_key); | 4118 | found_type = btrfs_key_type(&found_key); |
4119 | 4119 | ||
4120 | /* No extents, but there might be delalloc bits */ | 4120 | /* No extents, but there might be delalloc bits */ |
4121 | if (found_key.objectid != btrfs_ino(inode) || | 4121 | if (found_key.objectid != btrfs_ino(inode) || |
4122 | found_type != BTRFS_EXTENT_DATA_KEY) { | 4122 | found_type != BTRFS_EXTENT_DATA_KEY) { |
4123 | /* have to trust i_size as the end */ | 4123 | /* have to trust i_size as the end */ |
4124 | last = (u64)-1; | 4124 | last = (u64)-1; |
4125 | last_for_get_extent = isize; | 4125 | last_for_get_extent = isize; |
4126 | } else { | 4126 | } else { |
4127 | /* | 4127 | /* |
4128 | * remember the start of the last extent. There are a | 4128 | * remember the start of the last extent. There are a |
4129 | * bunch of different factors that go into the length of the | 4129 | * bunch of different factors that go into the length of the |
4130 | * extent, so its much less complex to remember where it started | 4130 | * extent, so its much less complex to remember where it started |
4131 | */ | 4131 | */ |
4132 | last = found_key.offset; | 4132 | last = found_key.offset; |
4133 | last_for_get_extent = last + 1; | 4133 | last_for_get_extent = last + 1; |
4134 | } | 4134 | } |
4135 | btrfs_free_path(path); | 4135 | btrfs_free_path(path); |
4136 | 4136 | ||
4137 | /* | 4137 | /* |
4138 | * we might have some extents allocated but more delalloc past those | 4138 | * we might have some extents allocated but more delalloc past those |
4139 | * extents. so, we trust isize unless the start of the last extent is | 4139 | * extents. so, we trust isize unless the start of the last extent is |
4140 | * beyond isize | 4140 | * beyond isize |
4141 | */ | 4141 | */ |
4142 | if (last < isize) { | 4142 | if (last < isize) { |
4143 | last = (u64)-1; | 4143 | last = (u64)-1; |
4144 | last_for_get_extent = isize; | 4144 | last_for_get_extent = isize; |
4145 | } | 4145 | } |
4146 | 4146 | ||
4147 | lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1, 0, | 4147 | lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1, 0, |
4148 | &cached_state); | 4148 | &cached_state); |
4149 | 4149 | ||
4150 | em = get_extent_skip_holes(inode, start, last_for_get_extent, | 4150 | em = get_extent_skip_holes(inode, start, last_for_get_extent, |
4151 | get_extent); | 4151 | get_extent); |
4152 | if (!em) | 4152 | if (!em) |
4153 | goto out; | 4153 | goto out; |
4154 | if (IS_ERR(em)) { | 4154 | if (IS_ERR(em)) { |
4155 | ret = PTR_ERR(em); | 4155 | ret = PTR_ERR(em); |
4156 | goto out; | 4156 | goto out; |
4157 | } | 4157 | } |
4158 | 4158 | ||
4159 | while (!end) { | 4159 | while (!end) { |
4160 | u64 offset_in_extent = 0; | 4160 | u64 offset_in_extent = 0; |
4161 | 4161 | ||
4162 | /* break if the extent we found is outside the range */ | 4162 | /* break if the extent we found is outside the range */ |
4163 | if (em->start >= max || extent_map_end(em) < off) | 4163 | if (em->start >= max || extent_map_end(em) < off) |
4164 | break; | 4164 | break; |
4165 | 4165 | ||
4166 | /* | 4166 | /* |
4167 | * get_extent may return an extent that starts before our | 4167 | * get_extent may return an extent that starts before our |
4168 | * requested range. We have to make sure the ranges | 4168 | * requested range. We have to make sure the ranges |
4169 | * we return to fiemap always move forward and don't | 4169 | * we return to fiemap always move forward and don't |
4170 | * overlap, so adjust the offsets here | 4170 | * overlap, so adjust the offsets here |
4171 | */ | 4171 | */ |
4172 | em_start = max(em->start, off); | 4172 | em_start = max(em->start, off); |
4173 | 4173 | ||
4174 | /* | 4174 | /* |
4175 | * record the offset from the start of the extent | 4175 | * record the offset from the start of the extent |
4176 | * for adjusting the disk offset below. Only do this if the | 4176 | * for adjusting the disk offset below. Only do this if the |
4177 | * extent isn't compressed since our in ram offset may be past | 4177 | * extent isn't compressed since our in ram offset may be past |
4178 | * what we have actually allocated on disk. | 4178 | * what we have actually allocated on disk. |
4179 | */ | 4179 | */ |
4180 | if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) | 4180 | if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) |
4181 | offset_in_extent = em_start - em->start; | 4181 | offset_in_extent = em_start - em->start; |
4182 | em_end = extent_map_end(em); | 4182 | em_end = extent_map_end(em); |
4183 | em_len = em_end - em_start; | 4183 | em_len = em_end - em_start; |
4184 | emflags = em->flags; | 4184 | emflags = em->flags; |
4185 | disko = 0; | 4185 | disko = 0; |
4186 | flags = 0; | 4186 | flags = 0; |
4187 | 4187 | ||
4188 | /* | 4188 | /* |
4189 | * bump off for our next call to get_extent | 4189 | * bump off for our next call to get_extent |
4190 | */ | 4190 | */ |
4191 | off = extent_map_end(em); | 4191 | off = extent_map_end(em); |
4192 | if (off >= max) | 4192 | if (off >= max) |
4193 | end = 1; | 4193 | end = 1; |
4194 | 4194 | ||
4195 | if (em->block_start == EXTENT_MAP_LAST_BYTE) { | 4195 | if (em->block_start == EXTENT_MAP_LAST_BYTE) { |
4196 | end = 1; | 4196 | end = 1; |
4197 | flags |= FIEMAP_EXTENT_LAST; | 4197 | flags |= FIEMAP_EXTENT_LAST; |
4198 | } else if (em->block_start == EXTENT_MAP_INLINE) { | 4198 | } else if (em->block_start == EXTENT_MAP_INLINE) { |
4199 | flags |= (FIEMAP_EXTENT_DATA_INLINE | | 4199 | flags |= (FIEMAP_EXTENT_DATA_INLINE | |
4200 | FIEMAP_EXTENT_NOT_ALIGNED); | 4200 | FIEMAP_EXTENT_NOT_ALIGNED); |
4201 | } else if (em->block_start == EXTENT_MAP_DELALLOC) { | 4201 | } else if (em->block_start == EXTENT_MAP_DELALLOC) { |
4202 | flags |= (FIEMAP_EXTENT_DELALLOC | | 4202 | flags |= (FIEMAP_EXTENT_DELALLOC | |
4203 | FIEMAP_EXTENT_UNKNOWN); | 4203 | FIEMAP_EXTENT_UNKNOWN); |
4204 | } else { | 4204 | } else { |
4205 | disko = em->block_start + offset_in_extent; | 4205 | disko = em->block_start + offset_in_extent; |
4206 | } | 4206 | } |
4207 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) | 4207 | if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) |
4208 | flags |= FIEMAP_EXTENT_ENCODED; | 4208 | flags |= FIEMAP_EXTENT_ENCODED; |
4209 | 4209 | ||
4210 | free_extent_map(em); | 4210 | free_extent_map(em); |
4211 | em = NULL; | 4211 | em = NULL; |
4212 | if ((em_start >= last) || em_len == (u64)-1 || | 4212 | if ((em_start >= last) || em_len == (u64)-1 || |
4213 | (last == (u64)-1 && isize <= em_end)) { | 4213 | (last == (u64)-1 && isize <= em_end)) { |
4214 | flags |= FIEMAP_EXTENT_LAST; | 4214 | flags |= FIEMAP_EXTENT_LAST; |
4215 | end = 1; | 4215 | end = 1; |
4216 | } | 4216 | } |
4217 | 4217 | ||
4218 | /* now scan forward to see if this is really the last extent. */ | 4218 | /* now scan forward to see if this is really the last extent. */ |
4219 | em = get_extent_skip_holes(inode, off, last_for_get_extent, | 4219 | em = get_extent_skip_holes(inode, off, last_for_get_extent, |
4220 | get_extent); | 4220 | get_extent); |
4221 | if (IS_ERR(em)) { | 4221 | if (IS_ERR(em)) { |
4222 | ret = PTR_ERR(em); | 4222 | ret = PTR_ERR(em); |
4223 | goto out; | 4223 | goto out; |
4224 | } | 4224 | } |
4225 | if (!em) { | 4225 | if (!em) { |
4226 | flags |= FIEMAP_EXTENT_LAST; | 4226 | flags |= FIEMAP_EXTENT_LAST; |
4227 | end = 1; | 4227 | end = 1; |
4228 | } | 4228 | } |
4229 | ret = fiemap_fill_next_extent(fieinfo, em_start, disko, | 4229 | ret = fiemap_fill_next_extent(fieinfo, em_start, disko, |
4230 | em_len, flags); | 4230 | em_len, flags); |
4231 | if (ret) | 4231 | if (ret) |
4232 | goto out_free; | 4232 | goto out_free; |
4233 | } | 4233 | } |
4234 | out_free: | 4234 | out_free: |
4235 | free_extent_map(em); | 4235 | free_extent_map(em); |
4236 | out: | 4236 | out: |
4237 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1, | 4237 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1, |
4238 | &cached_state, GFP_NOFS); | 4238 | &cached_state, GFP_NOFS); |
4239 | return ret; | 4239 | return ret; |
4240 | } | 4240 | } |
4241 | 4241 | ||
4242 | static void __free_extent_buffer(struct extent_buffer *eb) | 4242 | static void __free_extent_buffer(struct extent_buffer *eb) |
4243 | { | 4243 | { |
4244 | btrfs_leak_debug_del(&eb->leak_list); | 4244 | btrfs_leak_debug_del(&eb->leak_list); |
4245 | kmem_cache_free(extent_buffer_cache, eb); | 4245 | kmem_cache_free(extent_buffer_cache, eb); |
4246 | } | 4246 | } |
4247 | 4247 | ||
4248 | static int extent_buffer_under_io(struct extent_buffer *eb) | 4248 | static int extent_buffer_under_io(struct extent_buffer *eb) |
4249 | { | 4249 | { |
4250 | return (atomic_read(&eb->io_pages) || | 4250 | return (atomic_read(&eb->io_pages) || |
4251 | test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) || | 4251 | test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) || |
4252 | test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); | 4252 | test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); |
4253 | } | 4253 | } |
4254 | 4254 | ||
4255 | /* | 4255 | /* |
4256 | * Helper for releasing extent buffer page. | 4256 | * Helper for releasing extent buffer page. |
4257 | */ | 4257 | */ |
4258 | static void btrfs_release_extent_buffer_page(struct extent_buffer *eb, | 4258 | static void btrfs_release_extent_buffer_page(struct extent_buffer *eb, |
4259 | unsigned long start_idx) | 4259 | unsigned long start_idx) |
4260 | { | 4260 | { |
4261 | unsigned long index; | 4261 | unsigned long index; |
4262 | unsigned long num_pages; | 4262 | unsigned long num_pages; |
4263 | struct page *page; | 4263 | struct page *page; |
4264 | int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); | 4264 | int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); |
4265 | 4265 | ||
4266 | BUG_ON(extent_buffer_under_io(eb)); | 4266 | BUG_ON(extent_buffer_under_io(eb)); |
4267 | 4267 | ||
4268 | num_pages = num_extent_pages(eb->start, eb->len); | 4268 | num_pages = num_extent_pages(eb->start, eb->len); |
4269 | index = start_idx + num_pages; | 4269 | index = start_idx + num_pages; |
4270 | if (start_idx >= index) | 4270 | if (start_idx >= index) |
4271 | return; | 4271 | return; |
4272 | 4272 | ||
4273 | do { | 4273 | do { |
4274 | index--; | 4274 | index--; |
4275 | page = extent_buffer_page(eb, index); | 4275 | page = extent_buffer_page(eb, index); |
4276 | if (page && mapped) { | 4276 | if (page && mapped) { |
4277 | spin_lock(&page->mapping->private_lock); | 4277 | spin_lock(&page->mapping->private_lock); |
4278 | /* | 4278 | /* |
4279 | * We do this since we'll remove the pages after we've | 4279 | * We do this since we'll remove the pages after we've |
4280 | * removed the eb from the radix tree, so we could race | 4280 | * removed the eb from the radix tree, so we could race |
4281 | * and have this page now attached to the new eb. So | 4281 | * and have this page now attached to the new eb. So |
4282 | * only clear page_private if it's still connected to | 4282 | * only clear page_private if it's still connected to |
4283 | * this eb. | 4283 | * this eb. |
4284 | */ | 4284 | */ |
4285 | if (PagePrivate(page) && | 4285 | if (PagePrivate(page) && |
4286 | page->private == (unsigned long)eb) { | 4286 | page->private == (unsigned long)eb) { |
4287 | BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); | 4287 | BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); |
4288 | BUG_ON(PageDirty(page)); | 4288 | BUG_ON(PageDirty(page)); |
4289 | BUG_ON(PageWriteback(page)); | 4289 | BUG_ON(PageWriteback(page)); |
4290 | /* | 4290 | /* |
4291 | * We need to make sure we haven't be attached | 4291 | * We need to make sure we haven't be attached |
4292 | * to a new eb. | 4292 | * to a new eb. |
4293 | */ | 4293 | */ |
4294 | ClearPagePrivate(page); | 4294 | ClearPagePrivate(page); |
4295 | set_page_private(page, 0); | 4295 | set_page_private(page, 0); |
4296 | /* One for the page private */ | 4296 | /* One for the page private */ |
4297 | page_cache_release(page); | 4297 | page_cache_release(page); |
4298 | } | 4298 | } |
4299 | spin_unlock(&page->mapping->private_lock); | 4299 | spin_unlock(&page->mapping->private_lock); |
4300 | 4300 | ||
4301 | } | 4301 | } |
4302 | if (page) { | 4302 | if (page) { |
4303 | /* One for when we alloced the page */ | 4303 | /* One for when we alloced the page */ |
4304 | page_cache_release(page); | 4304 | page_cache_release(page); |
4305 | } | 4305 | } |
4306 | } while (index != start_idx); | 4306 | } while (index != start_idx); |
4307 | } | 4307 | } |
4308 | 4308 | ||
4309 | /* | 4309 | /* |
4310 | * Helper for releasing the extent buffer. | 4310 | * Helper for releasing the extent buffer. |
4311 | */ | 4311 | */ |
4312 | static inline void btrfs_release_extent_buffer(struct extent_buffer *eb) | 4312 | static inline void btrfs_release_extent_buffer(struct extent_buffer *eb) |
4313 | { | 4313 | { |
4314 | btrfs_release_extent_buffer_page(eb, 0); | 4314 | btrfs_release_extent_buffer_page(eb, 0); |
4315 | __free_extent_buffer(eb); | 4315 | __free_extent_buffer(eb); |
4316 | } | 4316 | } |
4317 | 4317 | ||
4318 | static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree, | 4318 | static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree, |
4319 | u64 start, | 4319 | u64 start, |
4320 | unsigned long len, | 4320 | unsigned long len, |
4321 | gfp_t mask) | 4321 | gfp_t mask) |
4322 | { | 4322 | { |
4323 | struct extent_buffer *eb = NULL; | 4323 | struct extent_buffer *eb = NULL; |
4324 | 4324 | ||
4325 | eb = kmem_cache_zalloc(extent_buffer_cache, mask); | 4325 | eb = kmem_cache_zalloc(extent_buffer_cache, mask); |
4326 | if (eb == NULL) | 4326 | if (eb == NULL) |
4327 | return NULL; | 4327 | return NULL; |
4328 | eb->start = start; | 4328 | eb->start = start; |
4329 | eb->len = len; | 4329 | eb->len = len; |
4330 | eb->tree = tree; | 4330 | eb->tree = tree; |
4331 | eb->bflags = 0; | 4331 | eb->bflags = 0; |
4332 | rwlock_init(&eb->lock); | 4332 | rwlock_init(&eb->lock); |
4333 | atomic_set(&eb->write_locks, 0); | 4333 | atomic_set(&eb->write_locks, 0); |
4334 | atomic_set(&eb->read_locks, 0); | 4334 | atomic_set(&eb->read_locks, 0); |
4335 | atomic_set(&eb->blocking_readers, 0); | 4335 | atomic_set(&eb->blocking_readers, 0); |
4336 | atomic_set(&eb->blocking_writers, 0); | 4336 | atomic_set(&eb->blocking_writers, 0); |
4337 | atomic_set(&eb->spinning_readers, 0); | 4337 | atomic_set(&eb->spinning_readers, 0); |
4338 | atomic_set(&eb->spinning_writers, 0); | 4338 | atomic_set(&eb->spinning_writers, 0); |
4339 | eb->lock_nested = 0; | 4339 | eb->lock_nested = 0; |
4340 | init_waitqueue_head(&eb->write_lock_wq); | 4340 | init_waitqueue_head(&eb->write_lock_wq); |
4341 | init_waitqueue_head(&eb->read_lock_wq); | 4341 | init_waitqueue_head(&eb->read_lock_wq); |
4342 | 4342 | ||
4343 | btrfs_leak_debug_add(&eb->leak_list, &buffers); | 4343 | btrfs_leak_debug_add(&eb->leak_list, &buffers); |
4344 | 4344 | ||
4345 | spin_lock_init(&eb->refs_lock); | 4345 | spin_lock_init(&eb->refs_lock); |
4346 | atomic_set(&eb->refs, 1); | 4346 | atomic_set(&eb->refs, 1); |
4347 | atomic_set(&eb->io_pages, 0); | 4347 | atomic_set(&eb->io_pages, 0); |
4348 | 4348 | ||
4349 | /* | 4349 | /* |
4350 | * Sanity checks, currently the maximum is 64k covered by 16x 4k pages | 4350 | * Sanity checks, currently the maximum is 64k covered by 16x 4k pages |
4351 | */ | 4351 | */ |
4352 | BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE | 4352 | BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE |
4353 | > MAX_INLINE_EXTENT_BUFFER_SIZE); | 4353 | > MAX_INLINE_EXTENT_BUFFER_SIZE); |
4354 | BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE); | 4354 | BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE); |
4355 | 4355 | ||
4356 | return eb; | 4356 | return eb; |
4357 | } | 4357 | } |
4358 | 4358 | ||
4359 | struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src) | 4359 | struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src) |
4360 | { | 4360 | { |
4361 | unsigned long i; | 4361 | unsigned long i; |
4362 | struct page *p; | 4362 | struct page *p; |
4363 | struct extent_buffer *new; | 4363 | struct extent_buffer *new; |
4364 | unsigned long num_pages = num_extent_pages(src->start, src->len); | 4364 | unsigned long num_pages = num_extent_pages(src->start, src->len); |
4365 | 4365 | ||
4366 | new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS); | 4366 | new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS); |
4367 | if (new == NULL) | 4367 | if (new == NULL) |
4368 | return NULL; | 4368 | return NULL; |
4369 | 4369 | ||
4370 | for (i = 0; i < num_pages; i++) { | 4370 | for (i = 0; i < num_pages; i++) { |
4371 | p = alloc_page(GFP_NOFS); | 4371 | p = alloc_page(GFP_NOFS); |
4372 | if (!p) { | 4372 | if (!p) { |
4373 | btrfs_release_extent_buffer(new); | 4373 | btrfs_release_extent_buffer(new); |
4374 | return NULL; | 4374 | return NULL; |
4375 | } | 4375 | } |
4376 | attach_extent_buffer_page(new, p); | 4376 | attach_extent_buffer_page(new, p); |
4377 | WARN_ON(PageDirty(p)); | 4377 | WARN_ON(PageDirty(p)); |
4378 | SetPageUptodate(p); | 4378 | SetPageUptodate(p); |
4379 | new->pages[i] = p; | 4379 | new->pages[i] = p; |
4380 | } | 4380 | } |
4381 | 4381 | ||
4382 | copy_extent_buffer(new, src, 0, 0, src->len); | 4382 | copy_extent_buffer(new, src, 0, 0, src->len); |
4383 | set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags); | 4383 | set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags); |
4384 | set_bit(EXTENT_BUFFER_DUMMY, &new->bflags); | 4384 | set_bit(EXTENT_BUFFER_DUMMY, &new->bflags); |
4385 | 4385 | ||
4386 | return new; | 4386 | return new; |
4387 | } | 4387 | } |
4388 | 4388 | ||
4389 | struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len) | 4389 | struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len) |
4390 | { | 4390 | { |
4391 | struct extent_buffer *eb; | 4391 | struct extent_buffer *eb; |
4392 | unsigned long num_pages = num_extent_pages(0, len); | 4392 | unsigned long num_pages = num_extent_pages(0, len); |
4393 | unsigned long i; | 4393 | unsigned long i; |
4394 | 4394 | ||
4395 | eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS); | 4395 | eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS); |
4396 | if (!eb) | 4396 | if (!eb) |
4397 | return NULL; | 4397 | return NULL; |
4398 | 4398 | ||
4399 | for (i = 0; i < num_pages; i++) { | 4399 | for (i = 0; i < num_pages; i++) { |
4400 | eb->pages[i] = alloc_page(GFP_NOFS); | 4400 | eb->pages[i] = alloc_page(GFP_NOFS); |
4401 | if (!eb->pages[i]) | 4401 | if (!eb->pages[i]) |
4402 | goto err; | 4402 | goto err; |
4403 | } | 4403 | } |
4404 | set_extent_buffer_uptodate(eb); | 4404 | set_extent_buffer_uptodate(eb); |
4405 | btrfs_set_header_nritems(eb, 0); | 4405 | btrfs_set_header_nritems(eb, 0); |
4406 | set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); | 4406 | set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags); |
4407 | 4407 | ||
4408 | return eb; | 4408 | return eb; |
4409 | err: | 4409 | err: |
4410 | for (; i > 0; i--) | 4410 | for (; i > 0; i--) |
4411 | __free_page(eb->pages[i - 1]); | 4411 | __free_page(eb->pages[i - 1]); |
4412 | __free_extent_buffer(eb); | 4412 | __free_extent_buffer(eb); |
4413 | return NULL; | 4413 | return NULL; |
4414 | } | 4414 | } |
4415 | 4415 | ||
4416 | static void check_buffer_tree_ref(struct extent_buffer *eb) | 4416 | static void check_buffer_tree_ref(struct extent_buffer *eb) |
4417 | { | 4417 | { |
4418 | int refs; | 4418 | int refs; |
4419 | /* the ref bit is tricky. We have to make sure it is set | 4419 | /* the ref bit is tricky. We have to make sure it is set |
4420 | * if we have the buffer dirty. Otherwise the | 4420 | * if we have the buffer dirty. Otherwise the |
4421 | * code to free a buffer can end up dropping a dirty | 4421 | * code to free a buffer can end up dropping a dirty |
4422 | * page | 4422 | * page |
4423 | * | 4423 | * |
4424 | * Once the ref bit is set, it won't go away while the | 4424 | * Once the ref bit is set, it won't go away while the |
4425 | * buffer is dirty or in writeback, and it also won't | 4425 | * buffer is dirty or in writeback, and it also won't |
4426 | * go away while we have the reference count on the | 4426 | * go away while we have the reference count on the |
4427 | * eb bumped. | 4427 | * eb bumped. |
4428 | * | 4428 | * |
4429 | * We can't just set the ref bit without bumping the | 4429 | * We can't just set the ref bit without bumping the |
4430 | * ref on the eb because free_extent_buffer might | 4430 | * ref on the eb because free_extent_buffer might |
4431 | * see the ref bit and try to clear it. If this happens | 4431 | * see the ref bit and try to clear it. If this happens |
4432 | * free_extent_buffer might end up dropping our original | 4432 | * free_extent_buffer might end up dropping our original |
4433 | * ref by mistake and freeing the page before we are able | 4433 | * ref by mistake and freeing the page before we are able |
4434 | * to add one more ref. | 4434 | * to add one more ref. |
4435 | * | 4435 | * |
4436 | * So bump the ref count first, then set the bit. If someone | 4436 | * So bump the ref count first, then set the bit. If someone |
4437 | * beat us to it, drop the ref we added. | 4437 | * beat us to it, drop the ref we added. |
4438 | */ | 4438 | */ |
4439 | refs = atomic_read(&eb->refs); | 4439 | refs = atomic_read(&eb->refs); |
4440 | if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) | 4440 | if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) |
4441 | return; | 4441 | return; |
4442 | 4442 | ||
4443 | spin_lock(&eb->refs_lock); | 4443 | spin_lock(&eb->refs_lock); |
4444 | if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) | 4444 | if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) |
4445 | atomic_inc(&eb->refs); | 4445 | atomic_inc(&eb->refs); |
4446 | spin_unlock(&eb->refs_lock); | 4446 | spin_unlock(&eb->refs_lock); |
4447 | } | 4447 | } |
4448 | 4448 | ||
4449 | static void mark_extent_buffer_accessed(struct extent_buffer *eb) | 4449 | static void mark_extent_buffer_accessed(struct extent_buffer *eb, |
4450 | struct page *accessed) | ||
4450 | { | 4451 | { |
4451 | unsigned long num_pages, i; | 4452 | unsigned long num_pages, i; |
4452 | 4453 | ||
4453 | check_buffer_tree_ref(eb); | 4454 | check_buffer_tree_ref(eb); |
4454 | 4455 | ||
4455 | num_pages = num_extent_pages(eb->start, eb->len); | 4456 | num_pages = num_extent_pages(eb->start, eb->len); |
4456 | for (i = 0; i < num_pages; i++) { | 4457 | for (i = 0; i < num_pages; i++) { |
4457 | struct page *p = extent_buffer_page(eb, i); | 4458 | struct page *p = extent_buffer_page(eb, i); |
4458 | mark_page_accessed(p); | 4459 | if (p != accessed) |
4460 | mark_page_accessed(p); | ||
4459 | } | 4461 | } |
4460 | } | 4462 | } |
4461 | 4463 | ||
4462 | struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree, | 4464 | struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree, |
4463 | u64 start, unsigned long len) | 4465 | u64 start, unsigned long len) |
4464 | { | 4466 | { |
4465 | unsigned long num_pages = num_extent_pages(start, len); | 4467 | unsigned long num_pages = num_extent_pages(start, len); |
4466 | unsigned long i; | 4468 | unsigned long i; |
4467 | unsigned long index = start >> PAGE_CACHE_SHIFT; | 4469 | unsigned long index = start >> PAGE_CACHE_SHIFT; |
4468 | struct extent_buffer *eb; | 4470 | struct extent_buffer *eb; |
4469 | struct extent_buffer *exists = NULL; | 4471 | struct extent_buffer *exists = NULL; |
4470 | struct page *p; | 4472 | struct page *p; |
4471 | struct address_space *mapping = tree->mapping; | 4473 | struct address_space *mapping = tree->mapping; |
4472 | int uptodate = 1; | 4474 | int uptodate = 1; |
4473 | int ret; | 4475 | int ret; |
4474 | 4476 | ||
4475 | rcu_read_lock(); | 4477 | rcu_read_lock(); |
4476 | eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); | 4478 | eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); |
4477 | if (eb && atomic_inc_not_zero(&eb->refs)) { | 4479 | if (eb && atomic_inc_not_zero(&eb->refs)) { |
4478 | rcu_read_unlock(); | 4480 | rcu_read_unlock(); |
4479 | mark_extent_buffer_accessed(eb); | 4481 | mark_extent_buffer_accessed(eb, NULL); |
4480 | return eb; | 4482 | return eb; |
4481 | } | 4483 | } |
4482 | rcu_read_unlock(); | 4484 | rcu_read_unlock(); |
4483 | 4485 | ||
4484 | eb = __alloc_extent_buffer(tree, start, len, GFP_NOFS); | 4486 | eb = __alloc_extent_buffer(tree, start, len, GFP_NOFS); |
4485 | if (!eb) | 4487 | if (!eb) |
4486 | return NULL; | 4488 | return NULL; |
4487 | 4489 | ||
4488 | for (i = 0; i < num_pages; i++, index++) { | 4490 | for (i = 0; i < num_pages; i++, index++) { |
4489 | p = find_or_create_page(mapping, index, GFP_NOFS); | 4491 | p = find_or_create_page(mapping, index, GFP_NOFS); |
4490 | if (!p) | 4492 | if (!p) |
4491 | goto free_eb; | 4493 | goto free_eb; |
4492 | 4494 | ||
4493 | spin_lock(&mapping->private_lock); | 4495 | spin_lock(&mapping->private_lock); |
4494 | if (PagePrivate(p)) { | 4496 | if (PagePrivate(p)) { |
4495 | /* | 4497 | /* |
4496 | * We could have already allocated an eb for this page | 4498 | * We could have already allocated an eb for this page |
4497 | * and attached one so lets see if we can get a ref on | 4499 | * and attached one so lets see if we can get a ref on |
4498 | * the existing eb, and if we can we know it's good and | 4500 | * the existing eb, and if we can we know it's good and |
4499 | * we can just return that one, else we know we can just | 4501 | * we can just return that one, else we know we can just |
4500 | * overwrite page->private. | 4502 | * overwrite page->private. |
4501 | */ | 4503 | */ |
4502 | exists = (struct extent_buffer *)p->private; | 4504 | exists = (struct extent_buffer *)p->private; |
4503 | if (atomic_inc_not_zero(&exists->refs)) { | 4505 | if (atomic_inc_not_zero(&exists->refs)) { |
4504 | spin_unlock(&mapping->private_lock); | 4506 | spin_unlock(&mapping->private_lock); |
4505 | unlock_page(p); | 4507 | unlock_page(p); |
4506 | page_cache_release(p); | 4508 | page_cache_release(p); |
4507 | mark_extent_buffer_accessed(exists); | 4509 | mark_extent_buffer_accessed(exists, p); |
4508 | goto free_eb; | 4510 | goto free_eb; |
4509 | } | 4511 | } |
4510 | 4512 | ||
4511 | /* | 4513 | /* |
4512 | * Do this so attach doesn't complain and we need to | 4514 | * Do this so attach doesn't complain and we need to |
4513 | * drop the ref the old guy had. | 4515 | * drop the ref the old guy had. |
4514 | */ | 4516 | */ |
4515 | ClearPagePrivate(p); | 4517 | ClearPagePrivate(p); |
4516 | WARN_ON(PageDirty(p)); | 4518 | WARN_ON(PageDirty(p)); |
4517 | page_cache_release(p); | 4519 | page_cache_release(p); |
4518 | } | 4520 | } |
4519 | attach_extent_buffer_page(eb, p); | 4521 | attach_extent_buffer_page(eb, p); |
4520 | spin_unlock(&mapping->private_lock); | 4522 | spin_unlock(&mapping->private_lock); |
4521 | WARN_ON(PageDirty(p)); | 4523 | WARN_ON(PageDirty(p)); |
4522 | mark_page_accessed(p); | ||
4523 | eb->pages[i] = p; | 4524 | eb->pages[i] = p; |
4524 | if (!PageUptodate(p)) | 4525 | if (!PageUptodate(p)) |
4525 | uptodate = 0; | 4526 | uptodate = 0; |
4526 | 4527 | ||
4527 | /* | 4528 | /* |
4528 | * see below about how we avoid a nasty race with release page | 4529 | * see below about how we avoid a nasty race with release page |
4529 | * and why we unlock later | 4530 | * and why we unlock later |
4530 | */ | 4531 | */ |
4531 | } | 4532 | } |
4532 | if (uptodate) | 4533 | if (uptodate) |
4533 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); | 4534 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); |
4534 | again: | 4535 | again: |
4535 | ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); | 4536 | ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); |
4536 | if (ret) | 4537 | if (ret) |
4537 | goto free_eb; | 4538 | goto free_eb; |
4538 | 4539 | ||
4539 | spin_lock(&tree->buffer_lock); | 4540 | spin_lock(&tree->buffer_lock); |
4540 | ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb); | 4541 | ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb); |
4541 | if (ret == -EEXIST) { | 4542 | if (ret == -EEXIST) { |
4542 | exists = radix_tree_lookup(&tree->buffer, | 4543 | exists = radix_tree_lookup(&tree->buffer, |
4543 | start >> PAGE_CACHE_SHIFT); | 4544 | start >> PAGE_CACHE_SHIFT); |
4544 | if (!atomic_inc_not_zero(&exists->refs)) { | 4545 | if (!atomic_inc_not_zero(&exists->refs)) { |
4545 | spin_unlock(&tree->buffer_lock); | 4546 | spin_unlock(&tree->buffer_lock); |
4546 | radix_tree_preload_end(); | 4547 | radix_tree_preload_end(); |
4547 | exists = NULL; | 4548 | exists = NULL; |
4548 | goto again; | 4549 | goto again; |
4549 | } | 4550 | } |
4550 | spin_unlock(&tree->buffer_lock); | 4551 | spin_unlock(&tree->buffer_lock); |
4551 | radix_tree_preload_end(); | 4552 | radix_tree_preload_end(); |
4552 | mark_extent_buffer_accessed(exists); | 4553 | mark_extent_buffer_accessed(exists, NULL); |
4553 | goto free_eb; | 4554 | goto free_eb; |
4554 | } | 4555 | } |
4555 | /* add one reference for the tree */ | 4556 | /* add one reference for the tree */ |
4556 | check_buffer_tree_ref(eb); | 4557 | check_buffer_tree_ref(eb); |
4557 | spin_unlock(&tree->buffer_lock); | 4558 | spin_unlock(&tree->buffer_lock); |
4558 | radix_tree_preload_end(); | 4559 | radix_tree_preload_end(); |
4559 | 4560 | ||
4560 | /* | 4561 | /* |
4561 | * there is a race where release page may have | 4562 | * there is a race where release page may have |
4562 | * tried to find this extent buffer in the radix | 4563 | * tried to find this extent buffer in the radix |
4563 | * but failed. It will tell the VM it is safe to | 4564 | * but failed. It will tell the VM it is safe to |
4564 | * reclaim the, and it will clear the page private bit. | 4565 | * reclaim the, and it will clear the page private bit. |
4565 | * We must make sure to set the page private bit properly | 4566 | * We must make sure to set the page private bit properly |
4566 | * after the extent buffer is in the radix tree so | 4567 | * after the extent buffer is in the radix tree so |
4567 | * it doesn't get lost | 4568 | * it doesn't get lost |
4568 | */ | 4569 | */ |
4569 | SetPageChecked(eb->pages[0]); | 4570 | SetPageChecked(eb->pages[0]); |
4570 | for (i = 1; i < num_pages; i++) { | 4571 | for (i = 1; i < num_pages; i++) { |
4571 | p = extent_buffer_page(eb, i); | 4572 | p = extent_buffer_page(eb, i); |
4572 | ClearPageChecked(p); | 4573 | ClearPageChecked(p); |
4573 | unlock_page(p); | 4574 | unlock_page(p); |
4574 | } | 4575 | } |
4575 | unlock_page(eb->pages[0]); | 4576 | unlock_page(eb->pages[0]); |
4576 | return eb; | 4577 | return eb; |
4577 | 4578 | ||
4578 | free_eb: | 4579 | free_eb: |
4579 | for (i = 0; i < num_pages; i++) { | 4580 | for (i = 0; i < num_pages; i++) { |
4580 | if (eb->pages[i]) | 4581 | if (eb->pages[i]) |
4581 | unlock_page(eb->pages[i]); | 4582 | unlock_page(eb->pages[i]); |
4582 | } | 4583 | } |
4583 | 4584 | ||
4584 | WARN_ON(!atomic_dec_and_test(&eb->refs)); | 4585 | WARN_ON(!atomic_dec_and_test(&eb->refs)); |
4585 | btrfs_release_extent_buffer(eb); | 4586 | btrfs_release_extent_buffer(eb); |
4586 | return exists; | 4587 | return exists; |
4587 | } | 4588 | } |
4588 | 4589 | ||
4589 | struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree, | 4590 | struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree, |
4590 | u64 start, unsigned long len) | 4591 | u64 start, unsigned long len) |
4591 | { | 4592 | { |
4592 | struct extent_buffer *eb; | 4593 | struct extent_buffer *eb; |
4593 | 4594 | ||
4594 | rcu_read_lock(); | 4595 | rcu_read_lock(); |
4595 | eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); | 4596 | eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT); |
4596 | if (eb && atomic_inc_not_zero(&eb->refs)) { | 4597 | if (eb && atomic_inc_not_zero(&eb->refs)) { |
4597 | rcu_read_unlock(); | 4598 | rcu_read_unlock(); |
4598 | mark_extent_buffer_accessed(eb); | 4599 | mark_extent_buffer_accessed(eb, NULL); |
4599 | return eb; | 4600 | return eb; |
4600 | } | 4601 | } |
4601 | rcu_read_unlock(); | 4602 | rcu_read_unlock(); |
4602 | 4603 | ||
4603 | return NULL; | 4604 | return NULL; |
4604 | } | 4605 | } |
4605 | 4606 | ||
4606 | static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head) | 4607 | static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head) |
4607 | { | 4608 | { |
4608 | struct extent_buffer *eb = | 4609 | struct extent_buffer *eb = |
4609 | container_of(head, struct extent_buffer, rcu_head); | 4610 | container_of(head, struct extent_buffer, rcu_head); |
4610 | 4611 | ||
4611 | __free_extent_buffer(eb); | 4612 | __free_extent_buffer(eb); |
4612 | } | 4613 | } |
4613 | 4614 | ||
4614 | /* Expects to have eb->eb_lock already held */ | 4615 | /* Expects to have eb->eb_lock already held */ |
4615 | static int release_extent_buffer(struct extent_buffer *eb) | 4616 | static int release_extent_buffer(struct extent_buffer *eb) |
4616 | { | 4617 | { |
4617 | WARN_ON(atomic_read(&eb->refs) == 0); | 4618 | WARN_ON(atomic_read(&eb->refs) == 0); |
4618 | if (atomic_dec_and_test(&eb->refs)) { | 4619 | if (atomic_dec_and_test(&eb->refs)) { |
4619 | if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) { | 4620 | if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) { |
4620 | spin_unlock(&eb->refs_lock); | 4621 | spin_unlock(&eb->refs_lock); |
4621 | } else { | 4622 | } else { |
4622 | struct extent_io_tree *tree = eb->tree; | 4623 | struct extent_io_tree *tree = eb->tree; |
4623 | 4624 | ||
4624 | spin_unlock(&eb->refs_lock); | 4625 | spin_unlock(&eb->refs_lock); |
4625 | 4626 | ||
4626 | spin_lock(&tree->buffer_lock); | 4627 | spin_lock(&tree->buffer_lock); |
4627 | radix_tree_delete(&tree->buffer, | 4628 | radix_tree_delete(&tree->buffer, |
4628 | eb->start >> PAGE_CACHE_SHIFT); | 4629 | eb->start >> PAGE_CACHE_SHIFT); |
4629 | spin_unlock(&tree->buffer_lock); | 4630 | spin_unlock(&tree->buffer_lock); |
4630 | } | 4631 | } |
4631 | 4632 | ||
4632 | /* Should be safe to release our pages at this point */ | 4633 | /* Should be safe to release our pages at this point */ |
4633 | btrfs_release_extent_buffer_page(eb, 0); | 4634 | btrfs_release_extent_buffer_page(eb, 0); |
4634 | call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu); | 4635 | call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu); |
4635 | return 1; | 4636 | return 1; |
4636 | } | 4637 | } |
4637 | spin_unlock(&eb->refs_lock); | 4638 | spin_unlock(&eb->refs_lock); |
4638 | 4639 | ||
4639 | return 0; | 4640 | return 0; |
4640 | } | 4641 | } |
4641 | 4642 | ||
4642 | void free_extent_buffer(struct extent_buffer *eb) | 4643 | void free_extent_buffer(struct extent_buffer *eb) |
4643 | { | 4644 | { |
4644 | int refs; | 4645 | int refs; |
4645 | int old; | 4646 | int old; |
4646 | if (!eb) | 4647 | if (!eb) |
4647 | return; | 4648 | return; |
4648 | 4649 | ||
4649 | while (1) { | 4650 | while (1) { |
4650 | refs = atomic_read(&eb->refs); | 4651 | refs = atomic_read(&eb->refs); |
4651 | if (refs <= 3) | 4652 | if (refs <= 3) |
4652 | break; | 4653 | break; |
4653 | old = atomic_cmpxchg(&eb->refs, refs, refs - 1); | 4654 | old = atomic_cmpxchg(&eb->refs, refs, refs - 1); |
4654 | if (old == refs) | 4655 | if (old == refs) |
4655 | return; | 4656 | return; |
4656 | } | 4657 | } |
4657 | 4658 | ||
4658 | spin_lock(&eb->refs_lock); | 4659 | spin_lock(&eb->refs_lock); |
4659 | if (atomic_read(&eb->refs) == 2 && | 4660 | if (atomic_read(&eb->refs) == 2 && |
4660 | test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) | 4661 | test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) |
4661 | atomic_dec(&eb->refs); | 4662 | atomic_dec(&eb->refs); |
4662 | 4663 | ||
4663 | if (atomic_read(&eb->refs) == 2 && | 4664 | if (atomic_read(&eb->refs) == 2 && |
4664 | test_bit(EXTENT_BUFFER_STALE, &eb->bflags) && | 4665 | test_bit(EXTENT_BUFFER_STALE, &eb->bflags) && |
4665 | !extent_buffer_under_io(eb) && | 4666 | !extent_buffer_under_io(eb) && |
4666 | test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) | 4667 | test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) |
4667 | atomic_dec(&eb->refs); | 4668 | atomic_dec(&eb->refs); |
4668 | 4669 | ||
4669 | /* | 4670 | /* |
4670 | * I know this is terrible, but it's temporary until we stop tracking | 4671 | * I know this is terrible, but it's temporary until we stop tracking |
4671 | * the uptodate bits and such for the extent buffers. | 4672 | * the uptodate bits and such for the extent buffers. |
4672 | */ | 4673 | */ |
4673 | release_extent_buffer(eb); | 4674 | release_extent_buffer(eb); |
4674 | } | 4675 | } |
4675 | 4676 | ||
4676 | void free_extent_buffer_stale(struct extent_buffer *eb) | 4677 | void free_extent_buffer_stale(struct extent_buffer *eb) |
4677 | { | 4678 | { |
4678 | if (!eb) | 4679 | if (!eb) |
4679 | return; | 4680 | return; |
4680 | 4681 | ||
4681 | spin_lock(&eb->refs_lock); | 4682 | spin_lock(&eb->refs_lock); |
4682 | set_bit(EXTENT_BUFFER_STALE, &eb->bflags); | 4683 | set_bit(EXTENT_BUFFER_STALE, &eb->bflags); |
4683 | 4684 | ||
4684 | if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) && | 4685 | if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) && |
4685 | test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) | 4686 | test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) |
4686 | atomic_dec(&eb->refs); | 4687 | atomic_dec(&eb->refs); |
4687 | release_extent_buffer(eb); | 4688 | release_extent_buffer(eb); |
4688 | } | 4689 | } |
4689 | 4690 | ||
4690 | void clear_extent_buffer_dirty(struct extent_buffer *eb) | 4691 | void clear_extent_buffer_dirty(struct extent_buffer *eb) |
4691 | { | 4692 | { |
4692 | unsigned long i; | 4693 | unsigned long i; |
4693 | unsigned long num_pages; | 4694 | unsigned long num_pages; |
4694 | struct page *page; | 4695 | struct page *page; |
4695 | 4696 | ||
4696 | num_pages = num_extent_pages(eb->start, eb->len); | 4697 | num_pages = num_extent_pages(eb->start, eb->len); |
4697 | 4698 | ||
4698 | for (i = 0; i < num_pages; i++) { | 4699 | for (i = 0; i < num_pages; i++) { |
4699 | page = extent_buffer_page(eb, i); | 4700 | page = extent_buffer_page(eb, i); |
4700 | if (!PageDirty(page)) | 4701 | if (!PageDirty(page)) |
4701 | continue; | 4702 | continue; |
4702 | 4703 | ||
4703 | lock_page(page); | 4704 | lock_page(page); |
4704 | WARN_ON(!PagePrivate(page)); | 4705 | WARN_ON(!PagePrivate(page)); |
4705 | 4706 | ||
4706 | clear_page_dirty_for_io(page); | 4707 | clear_page_dirty_for_io(page); |
4707 | spin_lock_irq(&page->mapping->tree_lock); | 4708 | spin_lock_irq(&page->mapping->tree_lock); |
4708 | if (!PageDirty(page)) { | 4709 | if (!PageDirty(page)) { |
4709 | radix_tree_tag_clear(&page->mapping->page_tree, | 4710 | radix_tree_tag_clear(&page->mapping->page_tree, |
4710 | page_index(page), | 4711 | page_index(page), |
4711 | PAGECACHE_TAG_DIRTY); | 4712 | PAGECACHE_TAG_DIRTY); |
4712 | } | 4713 | } |
4713 | spin_unlock_irq(&page->mapping->tree_lock); | 4714 | spin_unlock_irq(&page->mapping->tree_lock); |
4714 | ClearPageError(page); | 4715 | ClearPageError(page); |
4715 | unlock_page(page); | 4716 | unlock_page(page); |
4716 | } | 4717 | } |
4717 | WARN_ON(atomic_read(&eb->refs) == 0); | 4718 | WARN_ON(atomic_read(&eb->refs) == 0); |
4718 | } | 4719 | } |
4719 | 4720 | ||
4720 | int set_extent_buffer_dirty(struct extent_buffer *eb) | 4721 | int set_extent_buffer_dirty(struct extent_buffer *eb) |
4721 | { | 4722 | { |
4722 | unsigned long i; | 4723 | unsigned long i; |
4723 | unsigned long num_pages; | 4724 | unsigned long num_pages; |
4724 | int was_dirty = 0; | 4725 | int was_dirty = 0; |
4725 | 4726 | ||
4726 | check_buffer_tree_ref(eb); | 4727 | check_buffer_tree_ref(eb); |
4727 | 4728 | ||
4728 | was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags); | 4729 | was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags); |
4729 | 4730 | ||
4730 | num_pages = num_extent_pages(eb->start, eb->len); | 4731 | num_pages = num_extent_pages(eb->start, eb->len); |
4731 | WARN_ON(atomic_read(&eb->refs) == 0); | 4732 | WARN_ON(atomic_read(&eb->refs) == 0); |
4732 | WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)); | 4733 | WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)); |
4733 | 4734 | ||
4734 | for (i = 0; i < num_pages; i++) | 4735 | for (i = 0; i < num_pages; i++) |
4735 | set_page_dirty(extent_buffer_page(eb, i)); | 4736 | set_page_dirty(extent_buffer_page(eb, i)); |
4736 | return was_dirty; | 4737 | return was_dirty; |
4737 | } | 4738 | } |
4738 | 4739 | ||
4739 | int clear_extent_buffer_uptodate(struct extent_buffer *eb) | 4740 | int clear_extent_buffer_uptodate(struct extent_buffer *eb) |
4740 | { | 4741 | { |
4741 | unsigned long i; | 4742 | unsigned long i; |
4742 | struct page *page; | 4743 | struct page *page; |
4743 | unsigned long num_pages; | 4744 | unsigned long num_pages; |
4744 | 4745 | ||
4745 | clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); | 4746 | clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); |
4746 | num_pages = num_extent_pages(eb->start, eb->len); | 4747 | num_pages = num_extent_pages(eb->start, eb->len); |
4747 | for (i = 0; i < num_pages; i++) { | 4748 | for (i = 0; i < num_pages; i++) { |
4748 | page = extent_buffer_page(eb, i); | 4749 | page = extent_buffer_page(eb, i); |
4749 | if (page) | 4750 | if (page) |
4750 | ClearPageUptodate(page); | 4751 | ClearPageUptodate(page); |
4751 | } | 4752 | } |
4752 | return 0; | 4753 | return 0; |
4753 | } | 4754 | } |
4754 | 4755 | ||
4755 | int set_extent_buffer_uptodate(struct extent_buffer *eb) | 4756 | int set_extent_buffer_uptodate(struct extent_buffer *eb) |
4756 | { | 4757 | { |
4757 | unsigned long i; | 4758 | unsigned long i; |
4758 | struct page *page; | 4759 | struct page *page; |
4759 | unsigned long num_pages; | 4760 | unsigned long num_pages; |
4760 | 4761 | ||
4761 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); | 4762 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); |
4762 | num_pages = num_extent_pages(eb->start, eb->len); | 4763 | num_pages = num_extent_pages(eb->start, eb->len); |
4763 | for (i = 0; i < num_pages; i++) { | 4764 | for (i = 0; i < num_pages; i++) { |
4764 | page = extent_buffer_page(eb, i); | 4765 | page = extent_buffer_page(eb, i); |
4765 | SetPageUptodate(page); | 4766 | SetPageUptodate(page); |
4766 | } | 4767 | } |
4767 | return 0; | 4768 | return 0; |
4768 | } | 4769 | } |
4769 | 4770 | ||
4770 | int extent_buffer_uptodate(struct extent_buffer *eb) | 4771 | int extent_buffer_uptodate(struct extent_buffer *eb) |
4771 | { | 4772 | { |
4772 | return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); | 4773 | return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); |
4773 | } | 4774 | } |
4774 | 4775 | ||
4775 | int read_extent_buffer_pages(struct extent_io_tree *tree, | 4776 | int read_extent_buffer_pages(struct extent_io_tree *tree, |
4776 | struct extent_buffer *eb, u64 start, int wait, | 4777 | struct extent_buffer *eb, u64 start, int wait, |
4777 | get_extent_t *get_extent, int mirror_num) | 4778 | get_extent_t *get_extent, int mirror_num) |
4778 | { | 4779 | { |
4779 | unsigned long i; | 4780 | unsigned long i; |
4780 | unsigned long start_i; | 4781 | unsigned long start_i; |
4781 | struct page *page; | 4782 | struct page *page; |
4782 | int err; | 4783 | int err; |
4783 | int ret = 0; | 4784 | int ret = 0; |
4784 | int locked_pages = 0; | 4785 | int locked_pages = 0; |
4785 | int all_uptodate = 1; | 4786 | int all_uptodate = 1; |
4786 | unsigned long num_pages; | 4787 | unsigned long num_pages; |
4787 | unsigned long num_reads = 0; | 4788 | unsigned long num_reads = 0; |
4788 | struct bio *bio = NULL; | 4789 | struct bio *bio = NULL; |
4789 | unsigned long bio_flags = 0; | 4790 | unsigned long bio_flags = 0; |
4790 | 4791 | ||
4791 | if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) | 4792 | if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) |
4792 | return 0; | 4793 | return 0; |
4793 | 4794 | ||
4794 | if (start) { | 4795 | if (start) { |
4795 | WARN_ON(start < eb->start); | 4796 | WARN_ON(start < eb->start); |
4796 | start_i = (start >> PAGE_CACHE_SHIFT) - | 4797 | start_i = (start >> PAGE_CACHE_SHIFT) - |
4797 | (eb->start >> PAGE_CACHE_SHIFT); | 4798 | (eb->start >> PAGE_CACHE_SHIFT); |
4798 | } else { | 4799 | } else { |
4799 | start_i = 0; | 4800 | start_i = 0; |
4800 | } | 4801 | } |
4801 | 4802 | ||
4802 | num_pages = num_extent_pages(eb->start, eb->len); | 4803 | num_pages = num_extent_pages(eb->start, eb->len); |
4803 | for (i = start_i; i < num_pages; i++) { | 4804 | for (i = start_i; i < num_pages; i++) { |
4804 | page = extent_buffer_page(eb, i); | 4805 | page = extent_buffer_page(eb, i); |
4805 | if (wait == WAIT_NONE) { | 4806 | if (wait == WAIT_NONE) { |
4806 | if (!trylock_page(page)) | 4807 | if (!trylock_page(page)) |
4807 | goto unlock_exit; | 4808 | goto unlock_exit; |
4808 | } else { | 4809 | } else { |
4809 | lock_page(page); | 4810 | lock_page(page); |
4810 | } | 4811 | } |
4811 | locked_pages++; | 4812 | locked_pages++; |
4812 | if (!PageUptodate(page)) { | 4813 | if (!PageUptodate(page)) { |
4813 | num_reads++; | 4814 | num_reads++; |
4814 | all_uptodate = 0; | 4815 | all_uptodate = 0; |
4815 | } | 4816 | } |
4816 | } | 4817 | } |
4817 | if (all_uptodate) { | 4818 | if (all_uptodate) { |
4818 | if (start_i == 0) | 4819 | if (start_i == 0) |
4819 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); | 4820 | set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); |
4820 | goto unlock_exit; | 4821 | goto unlock_exit; |
4821 | } | 4822 | } |
4822 | 4823 | ||
4823 | clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); | 4824 | clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags); |
4824 | eb->read_mirror = 0; | 4825 | eb->read_mirror = 0; |
4825 | atomic_set(&eb->io_pages, num_reads); | 4826 | atomic_set(&eb->io_pages, num_reads); |
4826 | for (i = start_i; i < num_pages; i++) { | 4827 | for (i = start_i; i < num_pages; i++) { |
4827 | page = extent_buffer_page(eb, i); | 4828 | page = extent_buffer_page(eb, i); |
4828 | if (!PageUptodate(page)) { | 4829 | if (!PageUptodate(page)) { |
4829 | ClearPageError(page); | 4830 | ClearPageError(page); |
4830 | err = __extent_read_full_page(tree, page, | 4831 | err = __extent_read_full_page(tree, page, |
4831 | get_extent, &bio, | 4832 | get_extent, &bio, |
4832 | mirror_num, &bio_flags, | 4833 | mirror_num, &bio_flags, |
4833 | READ | REQ_META); | 4834 | READ | REQ_META); |
4834 | if (err) | 4835 | if (err) |
4835 | ret = err; | 4836 | ret = err; |
4836 | } else { | 4837 | } else { |
4837 | unlock_page(page); | 4838 | unlock_page(page); |
4838 | } | 4839 | } |
4839 | } | 4840 | } |
4840 | 4841 | ||
4841 | if (bio) { | 4842 | if (bio) { |
4842 | err = submit_one_bio(READ | REQ_META, bio, mirror_num, | 4843 | err = submit_one_bio(READ | REQ_META, bio, mirror_num, |
4843 | bio_flags); | 4844 | bio_flags); |
4844 | if (err) | 4845 | if (err) |
4845 | return err; | 4846 | return err; |
4846 | } | 4847 | } |
4847 | 4848 | ||
4848 | if (ret || wait != WAIT_COMPLETE) | 4849 | if (ret || wait != WAIT_COMPLETE) |
4849 | return ret; | 4850 | return ret; |
4850 | 4851 | ||
4851 | for (i = start_i; i < num_pages; i++) { | 4852 | for (i = start_i; i < num_pages; i++) { |
4852 | page = extent_buffer_page(eb, i); | 4853 | page = extent_buffer_page(eb, i); |
4853 | wait_on_page_locked(page); | 4854 | wait_on_page_locked(page); |
4854 | if (!PageUptodate(page)) | 4855 | if (!PageUptodate(page)) |
4855 | ret = -EIO; | 4856 | ret = -EIO; |
4856 | } | 4857 | } |
4857 | 4858 | ||
4858 | return ret; | 4859 | return ret; |
4859 | 4860 | ||
4860 | unlock_exit: | 4861 | unlock_exit: |
4861 | i = start_i; | 4862 | i = start_i; |
4862 | while (locked_pages > 0) { | 4863 | while (locked_pages > 0) { |
4863 | page = extent_buffer_page(eb, i); | 4864 | page = extent_buffer_page(eb, i); |
4864 | i++; | 4865 | i++; |
4865 | unlock_page(page); | 4866 | unlock_page(page); |
4866 | locked_pages--; | 4867 | locked_pages--; |
4867 | } | 4868 | } |
4868 | return ret; | 4869 | return ret; |
4869 | } | 4870 | } |
4870 | 4871 | ||
4871 | void read_extent_buffer(struct extent_buffer *eb, void *dstv, | 4872 | void read_extent_buffer(struct extent_buffer *eb, void *dstv, |
4872 | unsigned long start, | 4873 | unsigned long start, |
4873 | unsigned long len) | 4874 | unsigned long len) |
4874 | { | 4875 | { |
4875 | size_t cur; | 4876 | size_t cur; |
4876 | size_t offset; | 4877 | size_t offset; |
4877 | struct page *page; | 4878 | struct page *page; |
4878 | char *kaddr; | 4879 | char *kaddr; |
4879 | char *dst = (char *)dstv; | 4880 | char *dst = (char *)dstv; |
4880 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); | 4881 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); |
4881 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; | 4882 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; |
4882 | 4883 | ||
4883 | WARN_ON(start > eb->len); | 4884 | WARN_ON(start > eb->len); |
4884 | WARN_ON(start + len > eb->start + eb->len); | 4885 | WARN_ON(start + len > eb->start + eb->len); |
4885 | 4886 | ||
4886 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); | 4887 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); |
4887 | 4888 | ||
4888 | while (len > 0) { | 4889 | while (len > 0) { |
4889 | page = extent_buffer_page(eb, i); | 4890 | page = extent_buffer_page(eb, i); |
4890 | 4891 | ||
4891 | cur = min(len, (PAGE_CACHE_SIZE - offset)); | 4892 | cur = min(len, (PAGE_CACHE_SIZE - offset)); |
4892 | kaddr = page_address(page); | 4893 | kaddr = page_address(page); |
4893 | memcpy(dst, kaddr + offset, cur); | 4894 | memcpy(dst, kaddr + offset, cur); |
4894 | 4895 | ||
4895 | dst += cur; | 4896 | dst += cur; |
4896 | len -= cur; | 4897 | len -= cur; |
4897 | offset = 0; | 4898 | offset = 0; |
4898 | i++; | 4899 | i++; |
4899 | } | 4900 | } |
4900 | } | 4901 | } |
4901 | 4902 | ||
4902 | int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start, | 4903 | int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start, |
4903 | unsigned long min_len, char **map, | 4904 | unsigned long min_len, char **map, |
4904 | unsigned long *map_start, | 4905 | unsigned long *map_start, |
4905 | unsigned long *map_len) | 4906 | unsigned long *map_len) |
4906 | { | 4907 | { |
4907 | size_t offset = start & (PAGE_CACHE_SIZE - 1); | 4908 | size_t offset = start & (PAGE_CACHE_SIZE - 1); |
4908 | char *kaddr; | 4909 | char *kaddr; |
4909 | struct page *p; | 4910 | struct page *p; |
4910 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); | 4911 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); |
4911 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; | 4912 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; |
4912 | unsigned long end_i = (start_offset + start + min_len - 1) >> | 4913 | unsigned long end_i = (start_offset + start + min_len - 1) >> |
4913 | PAGE_CACHE_SHIFT; | 4914 | PAGE_CACHE_SHIFT; |
4914 | 4915 | ||
4915 | if (i != end_i) | 4916 | if (i != end_i) |
4916 | return -EINVAL; | 4917 | return -EINVAL; |
4917 | 4918 | ||
4918 | if (i == 0) { | 4919 | if (i == 0) { |
4919 | offset = start_offset; | 4920 | offset = start_offset; |
4920 | *map_start = 0; | 4921 | *map_start = 0; |
4921 | } else { | 4922 | } else { |
4922 | offset = 0; | 4923 | offset = 0; |
4923 | *map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset; | 4924 | *map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset; |
4924 | } | 4925 | } |
4925 | 4926 | ||
4926 | if (start + min_len > eb->len) { | 4927 | if (start + min_len > eb->len) { |
4927 | WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, " | 4928 | WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, " |
4928 | "wanted %lu %lu\n", | 4929 | "wanted %lu %lu\n", |
4929 | eb->start, eb->len, start, min_len); | 4930 | eb->start, eb->len, start, min_len); |
4930 | return -EINVAL; | 4931 | return -EINVAL; |
4931 | } | 4932 | } |
4932 | 4933 | ||
4933 | p = extent_buffer_page(eb, i); | 4934 | p = extent_buffer_page(eb, i); |
4934 | kaddr = page_address(p); | 4935 | kaddr = page_address(p); |
4935 | *map = kaddr + offset; | 4936 | *map = kaddr + offset; |
4936 | *map_len = PAGE_CACHE_SIZE - offset; | 4937 | *map_len = PAGE_CACHE_SIZE - offset; |
4937 | return 0; | 4938 | return 0; |
4938 | } | 4939 | } |
4939 | 4940 | ||
4940 | int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv, | 4941 | int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv, |
4941 | unsigned long start, | 4942 | unsigned long start, |
4942 | unsigned long len) | 4943 | unsigned long len) |
4943 | { | 4944 | { |
4944 | size_t cur; | 4945 | size_t cur; |
4945 | size_t offset; | 4946 | size_t offset; |
4946 | struct page *page; | 4947 | struct page *page; |
4947 | char *kaddr; | 4948 | char *kaddr; |
4948 | char *ptr = (char *)ptrv; | 4949 | char *ptr = (char *)ptrv; |
4949 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); | 4950 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); |
4950 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; | 4951 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; |
4951 | int ret = 0; | 4952 | int ret = 0; |
4952 | 4953 | ||
4953 | WARN_ON(start > eb->len); | 4954 | WARN_ON(start > eb->len); |
4954 | WARN_ON(start + len > eb->start + eb->len); | 4955 | WARN_ON(start + len > eb->start + eb->len); |
4955 | 4956 | ||
4956 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); | 4957 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); |
4957 | 4958 | ||
4958 | while (len > 0) { | 4959 | while (len > 0) { |
4959 | page = extent_buffer_page(eb, i); | 4960 | page = extent_buffer_page(eb, i); |
4960 | 4961 | ||
4961 | cur = min(len, (PAGE_CACHE_SIZE - offset)); | 4962 | cur = min(len, (PAGE_CACHE_SIZE - offset)); |
4962 | 4963 | ||
4963 | kaddr = page_address(page); | 4964 | kaddr = page_address(page); |
4964 | ret = memcmp(ptr, kaddr + offset, cur); | 4965 | ret = memcmp(ptr, kaddr + offset, cur); |
4965 | if (ret) | 4966 | if (ret) |
4966 | break; | 4967 | break; |
4967 | 4968 | ||
4968 | ptr += cur; | 4969 | ptr += cur; |
4969 | len -= cur; | 4970 | len -= cur; |
4970 | offset = 0; | 4971 | offset = 0; |
4971 | i++; | 4972 | i++; |
4972 | } | 4973 | } |
4973 | return ret; | 4974 | return ret; |
4974 | } | 4975 | } |
4975 | 4976 | ||
4976 | void write_extent_buffer(struct extent_buffer *eb, const void *srcv, | 4977 | void write_extent_buffer(struct extent_buffer *eb, const void *srcv, |
4977 | unsigned long start, unsigned long len) | 4978 | unsigned long start, unsigned long len) |
4978 | { | 4979 | { |
4979 | size_t cur; | 4980 | size_t cur; |
4980 | size_t offset; | 4981 | size_t offset; |
4981 | struct page *page; | 4982 | struct page *page; |
4982 | char *kaddr; | 4983 | char *kaddr; |
4983 | char *src = (char *)srcv; | 4984 | char *src = (char *)srcv; |
4984 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); | 4985 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); |
4985 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; | 4986 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; |
4986 | 4987 | ||
4987 | WARN_ON(start > eb->len); | 4988 | WARN_ON(start > eb->len); |
4988 | WARN_ON(start + len > eb->start + eb->len); | 4989 | WARN_ON(start + len > eb->start + eb->len); |
4989 | 4990 | ||
4990 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); | 4991 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); |
4991 | 4992 | ||
4992 | while (len > 0) { | 4993 | while (len > 0) { |
4993 | page = extent_buffer_page(eb, i); | 4994 | page = extent_buffer_page(eb, i); |
4994 | WARN_ON(!PageUptodate(page)); | 4995 | WARN_ON(!PageUptodate(page)); |
4995 | 4996 | ||
4996 | cur = min(len, PAGE_CACHE_SIZE - offset); | 4997 | cur = min(len, PAGE_CACHE_SIZE - offset); |
4997 | kaddr = page_address(page); | 4998 | kaddr = page_address(page); |
4998 | memcpy(kaddr + offset, src, cur); | 4999 | memcpy(kaddr + offset, src, cur); |
4999 | 5000 | ||
5000 | src += cur; | 5001 | src += cur; |
5001 | len -= cur; | 5002 | len -= cur; |
5002 | offset = 0; | 5003 | offset = 0; |
5003 | i++; | 5004 | i++; |
5004 | } | 5005 | } |
5005 | } | 5006 | } |
5006 | 5007 | ||
5007 | void memset_extent_buffer(struct extent_buffer *eb, char c, | 5008 | void memset_extent_buffer(struct extent_buffer *eb, char c, |
5008 | unsigned long start, unsigned long len) | 5009 | unsigned long start, unsigned long len) |
5009 | { | 5010 | { |
5010 | size_t cur; | 5011 | size_t cur; |
5011 | size_t offset; | 5012 | size_t offset; |
5012 | struct page *page; | 5013 | struct page *page; |
5013 | char *kaddr; | 5014 | char *kaddr; |
5014 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); | 5015 | size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); |
5015 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; | 5016 | unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; |
5016 | 5017 | ||
5017 | WARN_ON(start > eb->len); | 5018 | WARN_ON(start > eb->len); |
5018 | WARN_ON(start + len > eb->start + eb->len); | 5019 | WARN_ON(start + len > eb->start + eb->len); |
5019 | 5020 | ||
5020 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); | 5021 | offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1); |
5021 | 5022 | ||
5022 | while (len > 0) { | 5023 | while (len > 0) { |
5023 | page = extent_buffer_page(eb, i); | 5024 | page = extent_buffer_page(eb, i); |
5024 | WARN_ON(!PageUptodate(page)); | 5025 | WARN_ON(!PageUptodate(page)); |
5025 | 5026 | ||
5026 | cur = min(len, PAGE_CACHE_SIZE - offset); | 5027 | cur = min(len, PAGE_CACHE_SIZE - offset); |
5027 | kaddr = page_address(page); | 5028 | kaddr = page_address(page); |
5028 | memset(kaddr + offset, c, cur); | 5029 | memset(kaddr + offset, c, cur); |
5029 | 5030 | ||
5030 | len -= cur; | 5031 | len -= cur; |
5031 | offset = 0; | 5032 | offset = 0; |
5032 | i++; | 5033 | i++; |
5033 | } | 5034 | } |
5034 | } | 5035 | } |
5035 | 5036 | ||
5036 | void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src, | 5037 | void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src, |
5037 | unsigned long dst_offset, unsigned long src_offset, | 5038 | unsigned long dst_offset, unsigned long src_offset, |
5038 | unsigned long len) | 5039 | unsigned long len) |
5039 | { | 5040 | { |
5040 | u64 dst_len = dst->len; | 5041 | u64 dst_len = dst->len; |
5041 | size_t cur; | 5042 | size_t cur; |
5042 | size_t offset; | 5043 | size_t offset; |
5043 | struct page *page; | 5044 | struct page *page; |
5044 | char *kaddr; | 5045 | char *kaddr; |
5045 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); | 5046 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); |
5046 | unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; | 5047 | unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; |
5047 | 5048 | ||
5048 | WARN_ON(src->len != dst_len); | 5049 | WARN_ON(src->len != dst_len); |
5049 | 5050 | ||
5050 | offset = (start_offset + dst_offset) & | 5051 | offset = (start_offset + dst_offset) & |
5051 | (PAGE_CACHE_SIZE - 1); | 5052 | (PAGE_CACHE_SIZE - 1); |
5052 | 5053 | ||
5053 | while (len > 0) { | 5054 | while (len > 0) { |
5054 | page = extent_buffer_page(dst, i); | 5055 | page = extent_buffer_page(dst, i); |
5055 | WARN_ON(!PageUptodate(page)); | 5056 | WARN_ON(!PageUptodate(page)); |
5056 | 5057 | ||
5057 | cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset)); | 5058 | cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset)); |
5058 | 5059 | ||
5059 | kaddr = page_address(page); | 5060 | kaddr = page_address(page); |
5060 | read_extent_buffer(src, kaddr + offset, src_offset, cur); | 5061 | read_extent_buffer(src, kaddr + offset, src_offset, cur); |
5061 | 5062 | ||
5062 | src_offset += cur; | 5063 | src_offset += cur; |
5063 | len -= cur; | 5064 | len -= cur; |
5064 | offset = 0; | 5065 | offset = 0; |
5065 | i++; | 5066 | i++; |
5066 | } | 5067 | } |
5067 | } | 5068 | } |
5068 | 5069 | ||
5069 | static void move_pages(struct page *dst_page, struct page *src_page, | 5070 | static void move_pages(struct page *dst_page, struct page *src_page, |
5070 | unsigned long dst_off, unsigned long src_off, | 5071 | unsigned long dst_off, unsigned long src_off, |
5071 | unsigned long len) | 5072 | unsigned long len) |
5072 | { | 5073 | { |
5073 | char *dst_kaddr = page_address(dst_page); | 5074 | char *dst_kaddr = page_address(dst_page); |
5074 | if (dst_page == src_page) { | 5075 | if (dst_page == src_page) { |
5075 | memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len); | 5076 | memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len); |
5076 | } else { | 5077 | } else { |
5077 | char *src_kaddr = page_address(src_page); | 5078 | char *src_kaddr = page_address(src_page); |
5078 | char *p = dst_kaddr + dst_off + len; | 5079 | char *p = dst_kaddr + dst_off + len; |
5079 | char *s = src_kaddr + src_off + len; | 5080 | char *s = src_kaddr + src_off + len; |
5080 | 5081 | ||
5081 | while (len--) | 5082 | while (len--) |
5082 | *--p = *--s; | 5083 | *--p = *--s; |
5083 | } | 5084 | } |
5084 | } | 5085 | } |
5085 | 5086 | ||
5086 | static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len) | 5087 | static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len) |
5087 | { | 5088 | { |
5088 | unsigned long distance = (src > dst) ? src - dst : dst - src; | 5089 | unsigned long distance = (src > dst) ? src - dst : dst - src; |
5089 | return distance < len; | 5090 | return distance < len; |
5090 | } | 5091 | } |
5091 | 5092 | ||
5092 | static void copy_pages(struct page *dst_page, struct page *src_page, | 5093 | static void copy_pages(struct page *dst_page, struct page *src_page, |
5093 | unsigned long dst_off, unsigned long src_off, | 5094 | unsigned long dst_off, unsigned long src_off, |
5094 | unsigned long len) | 5095 | unsigned long len) |
5095 | { | 5096 | { |
5096 | char *dst_kaddr = page_address(dst_page); | 5097 | char *dst_kaddr = page_address(dst_page); |
5097 | char *src_kaddr; | 5098 | char *src_kaddr; |
5098 | int must_memmove = 0; | 5099 | int must_memmove = 0; |
5099 | 5100 | ||
5100 | if (dst_page != src_page) { | 5101 | if (dst_page != src_page) { |
5101 | src_kaddr = page_address(src_page); | 5102 | src_kaddr = page_address(src_page); |
5102 | } else { | 5103 | } else { |
5103 | src_kaddr = dst_kaddr; | 5104 | src_kaddr = dst_kaddr; |
5104 | if (areas_overlap(src_off, dst_off, len)) | 5105 | if (areas_overlap(src_off, dst_off, len)) |
5105 | must_memmove = 1; | 5106 | must_memmove = 1; |
5106 | } | 5107 | } |
5107 | 5108 | ||
5108 | if (must_memmove) | 5109 | if (must_memmove) |
5109 | memmove(dst_kaddr + dst_off, src_kaddr + src_off, len); | 5110 | memmove(dst_kaddr + dst_off, src_kaddr + src_off, len); |
5110 | else | 5111 | else |
5111 | memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len); | 5112 | memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len); |
5112 | } | 5113 | } |
5113 | 5114 | ||
5114 | void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, | 5115 | void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, |
5115 | unsigned long src_offset, unsigned long len) | 5116 | unsigned long src_offset, unsigned long len) |
5116 | { | 5117 | { |
5117 | size_t cur; | 5118 | size_t cur; |
5118 | size_t dst_off_in_page; | 5119 | size_t dst_off_in_page; |
5119 | size_t src_off_in_page; | 5120 | size_t src_off_in_page; |
5120 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); | 5121 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); |
5121 | unsigned long dst_i; | 5122 | unsigned long dst_i; |
5122 | unsigned long src_i; | 5123 | unsigned long src_i; |
5123 | 5124 | ||
5124 | if (src_offset + len > dst->len) { | 5125 | if (src_offset + len > dst->len) { |
5125 | printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " | 5126 | printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " |
5126 | "len %lu dst len %lu\n", src_offset, len, dst->len); | 5127 | "len %lu dst len %lu\n", src_offset, len, dst->len); |
5127 | BUG_ON(1); | 5128 | BUG_ON(1); |
5128 | } | 5129 | } |
5129 | if (dst_offset + len > dst->len) { | 5130 | if (dst_offset + len > dst->len) { |
5130 | printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " | 5131 | printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " |
5131 | "len %lu dst len %lu\n", dst_offset, len, dst->len); | 5132 | "len %lu dst len %lu\n", dst_offset, len, dst->len); |
5132 | BUG_ON(1); | 5133 | BUG_ON(1); |
5133 | } | 5134 | } |
5134 | 5135 | ||
5135 | while (len > 0) { | 5136 | while (len > 0) { |
5136 | dst_off_in_page = (start_offset + dst_offset) & | 5137 | dst_off_in_page = (start_offset + dst_offset) & |
5137 | (PAGE_CACHE_SIZE - 1); | 5138 | (PAGE_CACHE_SIZE - 1); |
5138 | src_off_in_page = (start_offset + src_offset) & | 5139 | src_off_in_page = (start_offset + src_offset) & |
5139 | (PAGE_CACHE_SIZE - 1); | 5140 | (PAGE_CACHE_SIZE - 1); |
5140 | 5141 | ||
5141 | dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; | 5142 | dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT; |
5142 | src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT; | 5143 | src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT; |
5143 | 5144 | ||
5144 | cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - | 5145 | cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - |
5145 | src_off_in_page)); | 5146 | src_off_in_page)); |
5146 | cur = min_t(unsigned long, cur, | 5147 | cur = min_t(unsigned long, cur, |
5147 | (unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page)); | 5148 | (unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page)); |
5148 | 5149 | ||
5149 | copy_pages(extent_buffer_page(dst, dst_i), | 5150 | copy_pages(extent_buffer_page(dst, dst_i), |
5150 | extent_buffer_page(dst, src_i), | 5151 | extent_buffer_page(dst, src_i), |
5151 | dst_off_in_page, src_off_in_page, cur); | 5152 | dst_off_in_page, src_off_in_page, cur); |
5152 | 5153 | ||
5153 | src_offset += cur; | 5154 | src_offset += cur; |
5154 | dst_offset += cur; | 5155 | dst_offset += cur; |
5155 | len -= cur; | 5156 | len -= cur; |
5156 | } | 5157 | } |
5157 | } | 5158 | } |
5158 | 5159 | ||
5159 | void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, | 5160 | void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset, |
5160 | unsigned long src_offset, unsigned long len) | 5161 | unsigned long src_offset, unsigned long len) |
5161 | { | 5162 | { |
5162 | size_t cur; | 5163 | size_t cur; |
5163 | size_t dst_off_in_page; | 5164 | size_t dst_off_in_page; |
5164 | size_t src_off_in_page; | 5165 | size_t src_off_in_page; |
5165 | unsigned long dst_end = dst_offset + len - 1; | 5166 | unsigned long dst_end = dst_offset + len - 1; |
5166 | unsigned long src_end = src_offset + len - 1; | 5167 | unsigned long src_end = src_offset + len - 1; |
5167 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); | 5168 | size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1); |
5168 | unsigned long dst_i; | 5169 | unsigned long dst_i; |
5169 | unsigned long src_i; | 5170 | unsigned long src_i; |
5170 | 5171 | ||
5171 | if (src_offset + len > dst->len) { | 5172 | if (src_offset + len > dst->len) { |
5172 | printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " | 5173 | printk(KERN_ERR "btrfs memmove bogus src_offset %lu move " |
5173 | "len %lu len %lu\n", src_offset, len, dst->len); | 5174 | "len %lu len %lu\n", src_offset, len, dst->len); |
5174 | BUG_ON(1); | 5175 | BUG_ON(1); |
5175 | } | 5176 | } |
5176 | if (dst_offset + len > dst->len) { | 5177 | if (dst_offset + len > dst->len) { |
5177 | printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " | 5178 | printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move " |
5178 | "len %lu len %lu\n", dst_offset, len, dst->len); | 5179 | "len %lu len %lu\n", dst_offset, len, dst->len); |
5179 | BUG_ON(1); | 5180 | BUG_ON(1); |
5180 | } | 5181 | } |
5181 | if (dst_offset < src_offset) { | 5182 | if (dst_offset < src_offset) { |
5182 | memcpy_extent_buffer(dst, dst_offset, src_offset, len); | 5183 | memcpy_extent_buffer(dst, dst_offset, src_offset, len); |
5183 | return; | 5184 | return; |
5184 | } | 5185 | } |
5185 | while (len > 0) { | 5186 | while (len > 0) { |
5186 | dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT; | 5187 | dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT; |
5187 | src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT; | 5188 | src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT; |
5188 | 5189 | ||
5189 | dst_off_in_page = (start_offset + dst_end) & | 5190 | dst_off_in_page = (start_offset + dst_end) & |
5190 | (PAGE_CACHE_SIZE - 1); | 5191 | (PAGE_CACHE_SIZE - 1); |
5191 | src_off_in_page = (start_offset + src_end) & | 5192 | src_off_in_page = (start_offset + src_end) & |
5192 | (PAGE_CACHE_SIZE - 1); | 5193 | (PAGE_CACHE_SIZE - 1); |
5193 | 5194 | ||
5194 | cur = min_t(unsigned long, len, src_off_in_page + 1); | 5195 | cur = min_t(unsigned long, len, src_off_in_page + 1); |
5195 | cur = min(cur, dst_off_in_page + 1); | 5196 | cur = min(cur, dst_off_in_page + 1); |
5196 | move_pages(extent_buffer_page(dst, dst_i), | 5197 | move_pages(extent_buffer_page(dst, dst_i), |
5197 | extent_buffer_page(dst, src_i), | 5198 | extent_buffer_page(dst, src_i), |
5198 | dst_off_in_page - cur + 1, | 5199 | dst_off_in_page - cur + 1, |
5199 | src_off_in_page - cur + 1, cur); | 5200 | src_off_in_page - cur + 1, cur); |
5200 | 5201 | ||
5201 | dst_end -= cur; | 5202 | dst_end -= cur; |
5202 | src_end -= cur; | 5203 | src_end -= cur; |
5203 | len -= cur; | 5204 | len -= cur; |
5204 | } | 5205 | } |
5205 | } | 5206 | } |
5206 | 5207 | ||
5207 | int try_release_extent_buffer(struct page *page) | 5208 | int try_release_extent_buffer(struct page *page) |
5208 | { | 5209 | { |
5209 | struct extent_buffer *eb; | 5210 | struct extent_buffer *eb; |
5210 | 5211 | ||
5211 | /* | 5212 | /* |
5212 | * We need to make sure noboody is attaching this page to an eb right | 5213 | * We need to make sure noboody is attaching this page to an eb right |
5213 | * now. | 5214 | * now. |
5214 | */ | 5215 | */ |
5215 | spin_lock(&page->mapping->private_lock); | 5216 | spin_lock(&page->mapping->private_lock); |
5216 | if (!PagePrivate(page)) { | 5217 | if (!PagePrivate(page)) { |
5217 | spin_unlock(&page->mapping->private_lock); | 5218 | spin_unlock(&page->mapping->private_lock); |
5218 | return 1; | 5219 | return 1; |
5219 | } | 5220 | } |
5220 | 5221 | ||
5221 | eb = (struct extent_buffer *)page->private; | 5222 | eb = (struct extent_buffer *)page->private; |
5222 | BUG_ON(!eb); | 5223 | BUG_ON(!eb); |
5223 | 5224 | ||
5224 | /* | 5225 | /* |
5225 | * This is a little awful but should be ok, we need to make sure that | 5226 | * This is a little awful but should be ok, we need to make sure that |
5226 | * the eb doesn't disappear out from under us while we're looking at | 5227 | * the eb doesn't disappear out from under us while we're looking at |
5227 | * this page. | 5228 | * this page. |
5228 | */ | 5229 | */ |
5229 | spin_lock(&eb->refs_lock); | 5230 | spin_lock(&eb->refs_lock); |
5230 | if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { | 5231 | if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) { |
5231 | spin_unlock(&eb->refs_lock); | 5232 | spin_unlock(&eb->refs_lock); |
5232 | spin_unlock(&page->mapping->private_lock); | 5233 | spin_unlock(&page->mapping->private_lock); |
5233 | return 0; | 5234 | return 0; |
5234 | } | 5235 | } |
5235 | spin_unlock(&page->mapping->private_lock); | 5236 | spin_unlock(&page->mapping->private_lock); |
5236 | 5237 | ||
5237 | /* | 5238 | /* |
5238 | * If tree ref isn't set then we know the ref on this eb is a real ref, | 5239 | * If tree ref isn't set then we know the ref on this eb is a real ref, |
5239 | * so just return, this page will likely be freed soon anyway. | 5240 | * so just return, this page will likely be freed soon anyway. |
5240 | */ | 5241 | */ |
5241 | if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) { | 5242 | if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) { |
5242 | spin_unlock(&eb->refs_lock); | 5243 | spin_unlock(&eb->refs_lock); |
5243 | return 0; | 5244 | return 0; |
5244 | } | 5245 | } |
5245 | 5246 | ||
5246 | return release_extent_buffer(eb); | 5247 | return release_extent_buffer(eb); |
5247 | } | 5248 | } |
fs/btrfs/file.c
1 | /* | 1 | /* |
2 | * Copyright (C) 2007 Oracle. All rights reserved. | 2 | * Copyright (C) 2007 Oracle. All rights reserved. |
3 | * | 3 | * |
4 | * This program is free software; you can redistribute it and/or | 4 | * This program is free software; you can redistribute it and/or |
5 | * modify it under the terms of the GNU General Public | 5 | * modify it under the terms of the GNU General Public |
6 | * License v2 as published by the Free Software Foundation. | 6 | * License v2 as published by the Free Software Foundation. |
7 | * | 7 | * |
8 | * This program is distributed in the hope that it will be useful, | 8 | * This program is distributed in the hope that it will be useful, |
9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of | 9 | * but WITHOUT ANY WARRANTY; without even the implied warranty of |
10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | 10 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
11 | * General Public License for more details. | 11 | * General Public License for more details. |
12 | * | 12 | * |
13 | * You should have received a copy of the GNU General Public | 13 | * You should have received a copy of the GNU General Public |
14 | * License along with this program; if not, write to the | 14 | * License along with this program; if not, write to the |
15 | * Free Software Foundation, Inc., 59 Temple Place - Suite 330, | 15 | * Free Software Foundation, Inc., 59 Temple Place - Suite 330, |
16 | * Boston, MA 021110-1307, USA. | 16 | * Boston, MA 021110-1307, USA. |
17 | */ | 17 | */ |
18 | 18 | ||
19 | #include <linux/fs.h> | 19 | #include <linux/fs.h> |
20 | #include <linux/pagemap.h> | 20 | #include <linux/pagemap.h> |
21 | #include <linux/highmem.h> | 21 | #include <linux/highmem.h> |
22 | #include <linux/time.h> | 22 | #include <linux/time.h> |
23 | #include <linux/init.h> | 23 | #include <linux/init.h> |
24 | #include <linux/string.h> | 24 | #include <linux/string.h> |
25 | #include <linux/backing-dev.h> | 25 | #include <linux/backing-dev.h> |
26 | #include <linux/mpage.h> | 26 | #include <linux/mpage.h> |
27 | #include <linux/aio.h> | 27 | #include <linux/aio.h> |
28 | #include <linux/falloc.h> | 28 | #include <linux/falloc.h> |
29 | #include <linux/swap.h> | 29 | #include <linux/swap.h> |
30 | #include <linux/writeback.h> | 30 | #include <linux/writeback.h> |
31 | #include <linux/statfs.h> | 31 | #include <linux/statfs.h> |
32 | #include <linux/compat.h> | 32 | #include <linux/compat.h> |
33 | #include <linux/slab.h> | 33 | #include <linux/slab.h> |
34 | #include <linux/btrfs.h> | 34 | #include <linux/btrfs.h> |
35 | #include "ctree.h" | 35 | #include "ctree.h" |
36 | #include "disk-io.h" | 36 | #include "disk-io.h" |
37 | #include "transaction.h" | 37 | #include "transaction.h" |
38 | #include "btrfs_inode.h" | 38 | #include "btrfs_inode.h" |
39 | #include "print-tree.h" | 39 | #include "print-tree.h" |
40 | #include "tree-log.h" | 40 | #include "tree-log.h" |
41 | #include "locking.h" | 41 | #include "locking.h" |
42 | #include "compat.h" | 42 | #include "compat.h" |
43 | #include "volumes.h" | 43 | #include "volumes.h" |
44 | 44 | ||
45 | static struct kmem_cache *btrfs_inode_defrag_cachep; | 45 | static struct kmem_cache *btrfs_inode_defrag_cachep; |
46 | /* | 46 | /* |
47 | * when auto defrag is enabled we | 47 | * when auto defrag is enabled we |
48 | * queue up these defrag structs to remember which | 48 | * queue up these defrag structs to remember which |
49 | * inodes need defragging passes | 49 | * inodes need defragging passes |
50 | */ | 50 | */ |
51 | struct inode_defrag { | 51 | struct inode_defrag { |
52 | struct rb_node rb_node; | 52 | struct rb_node rb_node; |
53 | /* objectid */ | 53 | /* objectid */ |
54 | u64 ino; | 54 | u64 ino; |
55 | /* | 55 | /* |
56 | * transid where the defrag was added, we search for | 56 | * transid where the defrag was added, we search for |
57 | * extents newer than this | 57 | * extents newer than this |
58 | */ | 58 | */ |
59 | u64 transid; | 59 | u64 transid; |
60 | 60 | ||
61 | /* root objectid */ | 61 | /* root objectid */ |
62 | u64 root; | 62 | u64 root; |
63 | 63 | ||
64 | /* last offset we were able to defrag */ | 64 | /* last offset we were able to defrag */ |
65 | u64 last_offset; | 65 | u64 last_offset; |
66 | 66 | ||
67 | /* if we've wrapped around back to zero once already */ | 67 | /* if we've wrapped around back to zero once already */ |
68 | int cycled; | 68 | int cycled; |
69 | }; | 69 | }; |
70 | 70 | ||
71 | static int __compare_inode_defrag(struct inode_defrag *defrag1, | 71 | static int __compare_inode_defrag(struct inode_defrag *defrag1, |
72 | struct inode_defrag *defrag2) | 72 | struct inode_defrag *defrag2) |
73 | { | 73 | { |
74 | if (defrag1->root > defrag2->root) | 74 | if (defrag1->root > defrag2->root) |
75 | return 1; | 75 | return 1; |
76 | else if (defrag1->root < defrag2->root) | 76 | else if (defrag1->root < defrag2->root) |
77 | return -1; | 77 | return -1; |
78 | else if (defrag1->ino > defrag2->ino) | 78 | else if (defrag1->ino > defrag2->ino) |
79 | return 1; | 79 | return 1; |
80 | else if (defrag1->ino < defrag2->ino) | 80 | else if (defrag1->ino < defrag2->ino) |
81 | return -1; | 81 | return -1; |
82 | else | 82 | else |
83 | return 0; | 83 | return 0; |
84 | } | 84 | } |
85 | 85 | ||
86 | /* pop a record for an inode into the defrag tree. The lock | 86 | /* pop a record for an inode into the defrag tree. The lock |
87 | * must be held already | 87 | * must be held already |
88 | * | 88 | * |
89 | * If you're inserting a record for an older transid than an | 89 | * If you're inserting a record for an older transid than an |
90 | * existing record, the transid already in the tree is lowered | 90 | * existing record, the transid already in the tree is lowered |
91 | * | 91 | * |
92 | * If an existing record is found the defrag item you | 92 | * If an existing record is found the defrag item you |
93 | * pass in is freed | 93 | * pass in is freed |
94 | */ | 94 | */ |
95 | static int __btrfs_add_inode_defrag(struct inode *inode, | 95 | static int __btrfs_add_inode_defrag(struct inode *inode, |
96 | struct inode_defrag *defrag) | 96 | struct inode_defrag *defrag) |
97 | { | 97 | { |
98 | struct btrfs_root *root = BTRFS_I(inode)->root; | 98 | struct btrfs_root *root = BTRFS_I(inode)->root; |
99 | struct inode_defrag *entry; | 99 | struct inode_defrag *entry; |
100 | struct rb_node **p; | 100 | struct rb_node **p; |
101 | struct rb_node *parent = NULL; | 101 | struct rb_node *parent = NULL; |
102 | int ret; | 102 | int ret; |
103 | 103 | ||
104 | p = &root->fs_info->defrag_inodes.rb_node; | 104 | p = &root->fs_info->defrag_inodes.rb_node; |
105 | while (*p) { | 105 | while (*p) { |
106 | parent = *p; | 106 | parent = *p; |
107 | entry = rb_entry(parent, struct inode_defrag, rb_node); | 107 | entry = rb_entry(parent, struct inode_defrag, rb_node); |
108 | 108 | ||
109 | ret = __compare_inode_defrag(defrag, entry); | 109 | ret = __compare_inode_defrag(defrag, entry); |
110 | if (ret < 0) | 110 | if (ret < 0) |
111 | p = &parent->rb_left; | 111 | p = &parent->rb_left; |
112 | else if (ret > 0) | 112 | else if (ret > 0) |
113 | p = &parent->rb_right; | 113 | p = &parent->rb_right; |
114 | else { | 114 | else { |
115 | /* if we're reinserting an entry for | 115 | /* if we're reinserting an entry for |
116 | * an old defrag run, make sure to | 116 | * an old defrag run, make sure to |
117 | * lower the transid of our existing record | 117 | * lower the transid of our existing record |
118 | */ | 118 | */ |
119 | if (defrag->transid < entry->transid) | 119 | if (defrag->transid < entry->transid) |
120 | entry->transid = defrag->transid; | 120 | entry->transid = defrag->transid; |
121 | if (defrag->last_offset > entry->last_offset) | 121 | if (defrag->last_offset > entry->last_offset) |
122 | entry->last_offset = defrag->last_offset; | 122 | entry->last_offset = defrag->last_offset; |
123 | return -EEXIST; | 123 | return -EEXIST; |
124 | } | 124 | } |
125 | } | 125 | } |
126 | set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); | 126 | set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); |
127 | rb_link_node(&defrag->rb_node, parent, p); | 127 | rb_link_node(&defrag->rb_node, parent, p); |
128 | rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes); | 128 | rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes); |
129 | return 0; | 129 | return 0; |
130 | } | 130 | } |
131 | 131 | ||
132 | static inline int __need_auto_defrag(struct btrfs_root *root) | 132 | static inline int __need_auto_defrag(struct btrfs_root *root) |
133 | { | 133 | { |
134 | if (!btrfs_test_opt(root, AUTO_DEFRAG)) | 134 | if (!btrfs_test_opt(root, AUTO_DEFRAG)) |
135 | return 0; | 135 | return 0; |
136 | 136 | ||
137 | if (btrfs_fs_closing(root->fs_info)) | 137 | if (btrfs_fs_closing(root->fs_info)) |
138 | return 0; | 138 | return 0; |
139 | 139 | ||
140 | return 1; | 140 | return 1; |
141 | } | 141 | } |
142 | 142 | ||
143 | /* | 143 | /* |
144 | * insert a defrag record for this inode if auto defrag is | 144 | * insert a defrag record for this inode if auto defrag is |
145 | * enabled | 145 | * enabled |
146 | */ | 146 | */ |
147 | int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans, | 147 | int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans, |
148 | struct inode *inode) | 148 | struct inode *inode) |
149 | { | 149 | { |
150 | struct btrfs_root *root = BTRFS_I(inode)->root; | 150 | struct btrfs_root *root = BTRFS_I(inode)->root; |
151 | struct inode_defrag *defrag; | 151 | struct inode_defrag *defrag; |
152 | u64 transid; | 152 | u64 transid; |
153 | int ret; | 153 | int ret; |
154 | 154 | ||
155 | if (!__need_auto_defrag(root)) | 155 | if (!__need_auto_defrag(root)) |
156 | return 0; | 156 | return 0; |
157 | 157 | ||
158 | if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) | 158 | if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) |
159 | return 0; | 159 | return 0; |
160 | 160 | ||
161 | if (trans) | 161 | if (trans) |
162 | transid = trans->transid; | 162 | transid = trans->transid; |
163 | else | 163 | else |
164 | transid = BTRFS_I(inode)->root->last_trans; | 164 | transid = BTRFS_I(inode)->root->last_trans; |
165 | 165 | ||
166 | defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS); | 166 | defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS); |
167 | if (!defrag) | 167 | if (!defrag) |
168 | return -ENOMEM; | 168 | return -ENOMEM; |
169 | 169 | ||
170 | defrag->ino = btrfs_ino(inode); | 170 | defrag->ino = btrfs_ino(inode); |
171 | defrag->transid = transid; | 171 | defrag->transid = transid; |
172 | defrag->root = root->root_key.objectid; | 172 | defrag->root = root->root_key.objectid; |
173 | 173 | ||
174 | spin_lock(&root->fs_info->defrag_inodes_lock); | 174 | spin_lock(&root->fs_info->defrag_inodes_lock); |
175 | if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) { | 175 | if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) { |
176 | /* | 176 | /* |
177 | * If we set IN_DEFRAG flag and evict the inode from memory, | 177 | * If we set IN_DEFRAG flag and evict the inode from memory, |
178 | * and then re-read this inode, this new inode doesn't have | 178 | * and then re-read this inode, this new inode doesn't have |
179 | * IN_DEFRAG flag. At the case, we may find the existed defrag. | 179 | * IN_DEFRAG flag. At the case, we may find the existed defrag. |
180 | */ | 180 | */ |
181 | ret = __btrfs_add_inode_defrag(inode, defrag); | 181 | ret = __btrfs_add_inode_defrag(inode, defrag); |
182 | if (ret) | 182 | if (ret) |
183 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 183 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
184 | } else { | 184 | } else { |
185 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 185 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
186 | } | 186 | } |
187 | spin_unlock(&root->fs_info->defrag_inodes_lock); | 187 | spin_unlock(&root->fs_info->defrag_inodes_lock); |
188 | return 0; | 188 | return 0; |
189 | } | 189 | } |
190 | 190 | ||
191 | /* | 191 | /* |
192 | * Requeue the defrag object. If there is a defrag object that points to | 192 | * Requeue the defrag object. If there is a defrag object that points to |
193 | * the same inode in the tree, we will merge them together (by | 193 | * the same inode in the tree, we will merge them together (by |
194 | * __btrfs_add_inode_defrag()) and free the one that we want to requeue. | 194 | * __btrfs_add_inode_defrag()) and free the one that we want to requeue. |
195 | */ | 195 | */ |
196 | static void btrfs_requeue_inode_defrag(struct inode *inode, | 196 | static void btrfs_requeue_inode_defrag(struct inode *inode, |
197 | struct inode_defrag *defrag) | 197 | struct inode_defrag *defrag) |
198 | { | 198 | { |
199 | struct btrfs_root *root = BTRFS_I(inode)->root; | 199 | struct btrfs_root *root = BTRFS_I(inode)->root; |
200 | int ret; | 200 | int ret; |
201 | 201 | ||
202 | if (!__need_auto_defrag(root)) | 202 | if (!__need_auto_defrag(root)) |
203 | goto out; | 203 | goto out; |
204 | 204 | ||
205 | /* | 205 | /* |
206 | * Here we don't check the IN_DEFRAG flag, because we need merge | 206 | * Here we don't check the IN_DEFRAG flag, because we need merge |
207 | * them together. | 207 | * them together. |
208 | */ | 208 | */ |
209 | spin_lock(&root->fs_info->defrag_inodes_lock); | 209 | spin_lock(&root->fs_info->defrag_inodes_lock); |
210 | ret = __btrfs_add_inode_defrag(inode, defrag); | 210 | ret = __btrfs_add_inode_defrag(inode, defrag); |
211 | spin_unlock(&root->fs_info->defrag_inodes_lock); | 211 | spin_unlock(&root->fs_info->defrag_inodes_lock); |
212 | if (ret) | 212 | if (ret) |
213 | goto out; | 213 | goto out; |
214 | return; | 214 | return; |
215 | out: | 215 | out: |
216 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 216 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
217 | } | 217 | } |
218 | 218 | ||
219 | /* | 219 | /* |
220 | * pick the defragable inode that we want, if it doesn't exist, we will get | 220 | * pick the defragable inode that we want, if it doesn't exist, we will get |
221 | * the next one. | 221 | * the next one. |
222 | */ | 222 | */ |
223 | static struct inode_defrag * | 223 | static struct inode_defrag * |
224 | btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino) | 224 | btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino) |
225 | { | 225 | { |
226 | struct inode_defrag *entry = NULL; | 226 | struct inode_defrag *entry = NULL; |
227 | struct inode_defrag tmp; | 227 | struct inode_defrag tmp; |
228 | struct rb_node *p; | 228 | struct rb_node *p; |
229 | struct rb_node *parent = NULL; | 229 | struct rb_node *parent = NULL; |
230 | int ret; | 230 | int ret; |
231 | 231 | ||
232 | tmp.ino = ino; | 232 | tmp.ino = ino; |
233 | tmp.root = root; | 233 | tmp.root = root; |
234 | 234 | ||
235 | spin_lock(&fs_info->defrag_inodes_lock); | 235 | spin_lock(&fs_info->defrag_inodes_lock); |
236 | p = fs_info->defrag_inodes.rb_node; | 236 | p = fs_info->defrag_inodes.rb_node; |
237 | while (p) { | 237 | while (p) { |
238 | parent = p; | 238 | parent = p; |
239 | entry = rb_entry(parent, struct inode_defrag, rb_node); | 239 | entry = rb_entry(parent, struct inode_defrag, rb_node); |
240 | 240 | ||
241 | ret = __compare_inode_defrag(&tmp, entry); | 241 | ret = __compare_inode_defrag(&tmp, entry); |
242 | if (ret < 0) | 242 | if (ret < 0) |
243 | p = parent->rb_left; | 243 | p = parent->rb_left; |
244 | else if (ret > 0) | 244 | else if (ret > 0) |
245 | p = parent->rb_right; | 245 | p = parent->rb_right; |
246 | else | 246 | else |
247 | goto out; | 247 | goto out; |
248 | } | 248 | } |
249 | 249 | ||
250 | if (parent && __compare_inode_defrag(&tmp, entry) > 0) { | 250 | if (parent && __compare_inode_defrag(&tmp, entry) > 0) { |
251 | parent = rb_next(parent); | 251 | parent = rb_next(parent); |
252 | if (parent) | 252 | if (parent) |
253 | entry = rb_entry(parent, struct inode_defrag, rb_node); | 253 | entry = rb_entry(parent, struct inode_defrag, rb_node); |
254 | else | 254 | else |
255 | entry = NULL; | 255 | entry = NULL; |
256 | } | 256 | } |
257 | out: | 257 | out: |
258 | if (entry) | 258 | if (entry) |
259 | rb_erase(parent, &fs_info->defrag_inodes); | 259 | rb_erase(parent, &fs_info->defrag_inodes); |
260 | spin_unlock(&fs_info->defrag_inodes_lock); | 260 | spin_unlock(&fs_info->defrag_inodes_lock); |
261 | return entry; | 261 | return entry; |
262 | } | 262 | } |
263 | 263 | ||
264 | void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) | 264 | void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) |
265 | { | 265 | { |
266 | struct inode_defrag *defrag; | 266 | struct inode_defrag *defrag; |
267 | struct rb_node *node; | 267 | struct rb_node *node; |
268 | 268 | ||
269 | spin_lock(&fs_info->defrag_inodes_lock); | 269 | spin_lock(&fs_info->defrag_inodes_lock); |
270 | node = rb_first(&fs_info->defrag_inodes); | 270 | node = rb_first(&fs_info->defrag_inodes); |
271 | while (node) { | 271 | while (node) { |
272 | rb_erase(node, &fs_info->defrag_inodes); | 272 | rb_erase(node, &fs_info->defrag_inodes); |
273 | defrag = rb_entry(node, struct inode_defrag, rb_node); | 273 | defrag = rb_entry(node, struct inode_defrag, rb_node); |
274 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 274 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
275 | 275 | ||
276 | if (need_resched()) { | 276 | if (need_resched()) { |
277 | spin_unlock(&fs_info->defrag_inodes_lock); | 277 | spin_unlock(&fs_info->defrag_inodes_lock); |
278 | cond_resched(); | 278 | cond_resched(); |
279 | spin_lock(&fs_info->defrag_inodes_lock); | 279 | spin_lock(&fs_info->defrag_inodes_lock); |
280 | } | 280 | } |
281 | 281 | ||
282 | node = rb_first(&fs_info->defrag_inodes); | 282 | node = rb_first(&fs_info->defrag_inodes); |
283 | } | 283 | } |
284 | spin_unlock(&fs_info->defrag_inodes_lock); | 284 | spin_unlock(&fs_info->defrag_inodes_lock); |
285 | } | 285 | } |
286 | 286 | ||
287 | #define BTRFS_DEFRAG_BATCH 1024 | 287 | #define BTRFS_DEFRAG_BATCH 1024 |
288 | 288 | ||
289 | static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, | 289 | static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, |
290 | struct inode_defrag *defrag) | 290 | struct inode_defrag *defrag) |
291 | { | 291 | { |
292 | struct btrfs_root *inode_root; | 292 | struct btrfs_root *inode_root; |
293 | struct inode *inode; | 293 | struct inode *inode; |
294 | struct btrfs_key key; | 294 | struct btrfs_key key; |
295 | struct btrfs_ioctl_defrag_range_args range; | 295 | struct btrfs_ioctl_defrag_range_args range; |
296 | int num_defrag; | 296 | int num_defrag; |
297 | int index; | 297 | int index; |
298 | int ret; | 298 | int ret; |
299 | 299 | ||
300 | /* get the inode */ | 300 | /* get the inode */ |
301 | key.objectid = defrag->root; | 301 | key.objectid = defrag->root; |
302 | btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); | 302 | btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY); |
303 | key.offset = (u64)-1; | 303 | key.offset = (u64)-1; |
304 | 304 | ||
305 | index = srcu_read_lock(&fs_info->subvol_srcu); | 305 | index = srcu_read_lock(&fs_info->subvol_srcu); |
306 | 306 | ||
307 | inode_root = btrfs_read_fs_root_no_name(fs_info, &key); | 307 | inode_root = btrfs_read_fs_root_no_name(fs_info, &key); |
308 | if (IS_ERR(inode_root)) { | 308 | if (IS_ERR(inode_root)) { |
309 | ret = PTR_ERR(inode_root); | 309 | ret = PTR_ERR(inode_root); |
310 | goto cleanup; | 310 | goto cleanup; |
311 | } | 311 | } |
312 | 312 | ||
313 | key.objectid = defrag->ino; | 313 | key.objectid = defrag->ino; |
314 | btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); | 314 | btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY); |
315 | key.offset = 0; | 315 | key.offset = 0; |
316 | inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); | 316 | inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL); |
317 | if (IS_ERR(inode)) { | 317 | if (IS_ERR(inode)) { |
318 | ret = PTR_ERR(inode); | 318 | ret = PTR_ERR(inode); |
319 | goto cleanup; | 319 | goto cleanup; |
320 | } | 320 | } |
321 | srcu_read_unlock(&fs_info->subvol_srcu, index); | 321 | srcu_read_unlock(&fs_info->subvol_srcu, index); |
322 | 322 | ||
323 | /* do a chunk of defrag */ | 323 | /* do a chunk of defrag */ |
324 | clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); | 324 | clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags); |
325 | memset(&range, 0, sizeof(range)); | 325 | memset(&range, 0, sizeof(range)); |
326 | range.len = (u64)-1; | 326 | range.len = (u64)-1; |
327 | range.start = defrag->last_offset; | 327 | range.start = defrag->last_offset; |
328 | 328 | ||
329 | sb_start_write(fs_info->sb); | 329 | sb_start_write(fs_info->sb); |
330 | num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid, | 330 | num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid, |
331 | BTRFS_DEFRAG_BATCH); | 331 | BTRFS_DEFRAG_BATCH); |
332 | sb_end_write(fs_info->sb); | 332 | sb_end_write(fs_info->sb); |
333 | /* | 333 | /* |
334 | * if we filled the whole defrag batch, there | 334 | * if we filled the whole defrag batch, there |
335 | * must be more work to do. Queue this defrag | 335 | * must be more work to do. Queue this defrag |
336 | * again | 336 | * again |
337 | */ | 337 | */ |
338 | if (num_defrag == BTRFS_DEFRAG_BATCH) { | 338 | if (num_defrag == BTRFS_DEFRAG_BATCH) { |
339 | defrag->last_offset = range.start; | 339 | defrag->last_offset = range.start; |
340 | btrfs_requeue_inode_defrag(inode, defrag); | 340 | btrfs_requeue_inode_defrag(inode, defrag); |
341 | } else if (defrag->last_offset && !defrag->cycled) { | 341 | } else if (defrag->last_offset && !defrag->cycled) { |
342 | /* | 342 | /* |
343 | * we didn't fill our defrag batch, but | 343 | * we didn't fill our defrag batch, but |
344 | * we didn't start at zero. Make sure we loop | 344 | * we didn't start at zero. Make sure we loop |
345 | * around to the start of the file. | 345 | * around to the start of the file. |
346 | */ | 346 | */ |
347 | defrag->last_offset = 0; | 347 | defrag->last_offset = 0; |
348 | defrag->cycled = 1; | 348 | defrag->cycled = 1; |
349 | btrfs_requeue_inode_defrag(inode, defrag); | 349 | btrfs_requeue_inode_defrag(inode, defrag); |
350 | } else { | 350 | } else { |
351 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 351 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
352 | } | 352 | } |
353 | 353 | ||
354 | iput(inode); | 354 | iput(inode); |
355 | return 0; | 355 | return 0; |
356 | cleanup: | 356 | cleanup: |
357 | srcu_read_unlock(&fs_info->subvol_srcu, index); | 357 | srcu_read_unlock(&fs_info->subvol_srcu, index); |
358 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); | 358 | kmem_cache_free(btrfs_inode_defrag_cachep, defrag); |
359 | return ret; | 359 | return ret; |
360 | } | 360 | } |
361 | 361 | ||
362 | /* | 362 | /* |
363 | * run through the list of inodes in the FS that need | 363 | * run through the list of inodes in the FS that need |
364 | * defragging | 364 | * defragging |
365 | */ | 365 | */ |
366 | int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) | 366 | int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info) |
367 | { | 367 | { |
368 | struct inode_defrag *defrag; | 368 | struct inode_defrag *defrag; |
369 | u64 first_ino = 0; | 369 | u64 first_ino = 0; |
370 | u64 root_objectid = 0; | 370 | u64 root_objectid = 0; |
371 | 371 | ||
372 | atomic_inc(&fs_info->defrag_running); | 372 | atomic_inc(&fs_info->defrag_running); |
373 | while(1) { | 373 | while(1) { |
374 | /* Pause the auto defragger. */ | 374 | /* Pause the auto defragger. */ |
375 | if (test_bit(BTRFS_FS_STATE_REMOUNTING, | 375 | if (test_bit(BTRFS_FS_STATE_REMOUNTING, |
376 | &fs_info->fs_state)) | 376 | &fs_info->fs_state)) |
377 | break; | 377 | break; |
378 | 378 | ||
379 | if (!__need_auto_defrag(fs_info->tree_root)) | 379 | if (!__need_auto_defrag(fs_info->tree_root)) |
380 | break; | 380 | break; |
381 | 381 | ||
382 | /* find an inode to defrag */ | 382 | /* find an inode to defrag */ |
383 | defrag = btrfs_pick_defrag_inode(fs_info, root_objectid, | 383 | defrag = btrfs_pick_defrag_inode(fs_info, root_objectid, |
384 | first_ino); | 384 | first_ino); |
385 | if (!defrag) { | 385 | if (!defrag) { |
386 | if (root_objectid || first_ino) { | 386 | if (root_objectid || first_ino) { |
387 | root_objectid = 0; | 387 | root_objectid = 0; |
388 | first_ino = 0; | 388 | first_ino = 0; |
389 | continue; | 389 | continue; |
390 | } else { | 390 | } else { |
391 | break; | 391 | break; |
392 | } | 392 | } |
393 | } | 393 | } |
394 | 394 | ||
395 | first_ino = defrag->ino + 1; | 395 | first_ino = defrag->ino + 1; |
396 | root_objectid = defrag->root; | 396 | root_objectid = defrag->root; |
397 | 397 | ||
398 | __btrfs_run_defrag_inode(fs_info, defrag); | 398 | __btrfs_run_defrag_inode(fs_info, defrag); |
399 | } | 399 | } |
400 | atomic_dec(&fs_info->defrag_running); | 400 | atomic_dec(&fs_info->defrag_running); |
401 | 401 | ||
402 | /* | 402 | /* |
403 | * during unmount, we use the transaction_wait queue to | 403 | * during unmount, we use the transaction_wait queue to |
404 | * wait for the defragger to stop | 404 | * wait for the defragger to stop |
405 | */ | 405 | */ |
406 | wake_up(&fs_info->transaction_wait); | 406 | wake_up(&fs_info->transaction_wait); |
407 | return 0; | 407 | return 0; |
408 | } | 408 | } |
409 | 409 | ||
410 | /* simple helper to fault in pages and copy. This should go away | 410 | /* simple helper to fault in pages and copy. This should go away |
411 | * and be replaced with calls into generic code. | 411 | * and be replaced with calls into generic code. |
412 | */ | 412 | */ |
413 | static noinline int btrfs_copy_from_user(loff_t pos, int num_pages, | 413 | static noinline int btrfs_copy_from_user(loff_t pos, int num_pages, |
414 | size_t write_bytes, | 414 | size_t write_bytes, |
415 | struct page **prepared_pages, | 415 | struct page **prepared_pages, |
416 | struct iov_iter *i) | 416 | struct iov_iter *i) |
417 | { | 417 | { |
418 | size_t copied = 0; | 418 | size_t copied = 0; |
419 | size_t total_copied = 0; | 419 | size_t total_copied = 0; |
420 | int pg = 0; | 420 | int pg = 0; |
421 | int offset = pos & (PAGE_CACHE_SIZE - 1); | 421 | int offset = pos & (PAGE_CACHE_SIZE - 1); |
422 | 422 | ||
423 | while (write_bytes > 0) { | 423 | while (write_bytes > 0) { |
424 | size_t count = min_t(size_t, | 424 | size_t count = min_t(size_t, |
425 | PAGE_CACHE_SIZE - offset, write_bytes); | 425 | PAGE_CACHE_SIZE - offset, write_bytes); |
426 | struct page *page = prepared_pages[pg]; | 426 | struct page *page = prepared_pages[pg]; |
427 | /* | 427 | /* |
428 | * Copy data from userspace to the current page | 428 | * Copy data from userspace to the current page |
429 | */ | 429 | */ |
430 | copied = iov_iter_copy_from_user_atomic(page, i, offset, count); | 430 | copied = iov_iter_copy_from_user_atomic(page, i, offset, count); |
431 | 431 | ||
432 | /* Flush processor's dcache for this page */ | 432 | /* Flush processor's dcache for this page */ |
433 | flush_dcache_page(page); | 433 | flush_dcache_page(page); |
434 | 434 | ||
435 | /* | 435 | /* |
436 | * if we get a partial write, we can end up with | 436 | * if we get a partial write, we can end up with |
437 | * partially up to date pages. These add | 437 | * partially up to date pages. These add |
438 | * a lot of complexity, so make sure they don't | 438 | * a lot of complexity, so make sure they don't |
439 | * happen by forcing this copy to be retried. | 439 | * happen by forcing this copy to be retried. |
440 | * | 440 | * |
441 | * The rest of the btrfs_file_write code will fall | 441 | * The rest of the btrfs_file_write code will fall |
442 | * back to page at a time copies after we return 0. | 442 | * back to page at a time copies after we return 0. |
443 | */ | 443 | */ |
444 | if (!PageUptodate(page) && copied < count) | 444 | if (!PageUptodate(page) && copied < count) |
445 | copied = 0; | 445 | copied = 0; |
446 | 446 | ||
447 | iov_iter_advance(i, copied); | 447 | iov_iter_advance(i, copied); |
448 | write_bytes -= copied; | 448 | write_bytes -= copied; |
449 | total_copied += copied; | 449 | total_copied += copied; |
450 | 450 | ||
451 | /* Return to btrfs_file_aio_write to fault page */ | 451 | /* Return to btrfs_file_aio_write to fault page */ |
452 | if (unlikely(copied == 0)) | 452 | if (unlikely(copied == 0)) |
453 | break; | 453 | break; |
454 | 454 | ||
455 | if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { | 455 | if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { |
456 | offset += copied; | 456 | offset += copied; |
457 | } else { | 457 | } else { |
458 | pg++; | 458 | pg++; |
459 | offset = 0; | 459 | offset = 0; |
460 | } | 460 | } |
461 | } | 461 | } |
462 | return total_copied; | 462 | return total_copied; |
463 | } | 463 | } |
464 | 464 | ||
465 | /* | 465 | /* |
466 | * unlocks pages after btrfs_file_write is done with them | 466 | * unlocks pages after btrfs_file_write is done with them |
467 | */ | 467 | */ |
468 | static void btrfs_drop_pages(struct page **pages, size_t num_pages) | 468 | static void btrfs_drop_pages(struct page **pages, size_t num_pages) |
469 | { | 469 | { |
470 | size_t i; | 470 | size_t i; |
471 | for (i = 0; i < num_pages; i++) { | 471 | for (i = 0; i < num_pages; i++) { |
472 | /* page checked is some magic around finding pages that | 472 | /* page checked is some magic around finding pages that |
473 | * have been modified without going through btrfs_set_page_dirty | 473 | * have been modified without going through btrfs_set_page_dirty |
474 | * clear it here | 474 | * clear it here. There should be no need to mark the pages |
475 | * accessed as prepare_pages should have marked them accessed | ||
476 | * in prepare_pages via find_or_create_page() | ||
475 | */ | 477 | */ |
476 | ClearPageChecked(pages[i]); | 478 | ClearPageChecked(pages[i]); |
477 | unlock_page(pages[i]); | 479 | unlock_page(pages[i]); |
478 | mark_page_accessed(pages[i]); | ||
479 | page_cache_release(pages[i]); | 480 | page_cache_release(pages[i]); |
480 | } | 481 | } |
481 | } | 482 | } |
482 | 483 | ||
483 | /* | 484 | /* |
484 | * after copy_from_user, pages need to be dirtied and we need to make | 485 | * after copy_from_user, pages need to be dirtied and we need to make |
485 | * sure holes are created between the current EOF and the start of | 486 | * sure holes are created between the current EOF and the start of |
486 | * any next extents (if required). | 487 | * any next extents (if required). |
487 | * | 488 | * |
488 | * this also makes the decision about creating an inline extent vs | 489 | * this also makes the decision about creating an inline extent vs |
489 | * doing real data extents, marking pages dirty and delalloc as required. | 490 | * doing real data extents, marking pages dirty and delalloc as required. |
490 | */ | 491 | */ |
491 | int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, | 492 | int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, |
492 | struct page **pages, size_t num_pages, | 493 | struct page **pages, size_t num_pages, |
493 | loff_t pos, size_t write_bytes, | 494 | loff_t pos, size_t write_bytes, |
494 | struct extent_state **cached) | 495 | struct extent_state **cached) |
495 | { | 496 | { |
496 | int err = 0; | 497 | int err = 0; |
497 | int i; | 498 | int i; |
498 | u64 num_bytes; | 499 | u64 num_bytes; |
499 | u64 start_pos; | 500 | u64 start_pos; |
500 | u64 end_of_last_block; | 501 | u64 end_of_last_block; |
501 | u64 end_pos = pos + write_bytes; | 502 | u64 end_pos = pos + write_bytes; |
502 | loff_t isize = i_size_read(inode); | 503 | loff_t isize = i_size_read(inode); |
503 | 504 | ||
504 | start_pos = pos & ~((u64)root->sectorsize - 1); | 505 | start_pos = pos & ~((u64)root->sectorsize - 1); |
505 | num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); | 506 | num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); |
506 | 507 | ||
507 | end_of_last_block = start_pos + num_bytes - 1; | 508 | end_of_last_block = start_pos + num_bytes - 1; |
508 | err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, | 509 | err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, |
509 | cached); | 510 | cached); |
510 | if (err) | 511 | if (err) |
511 | return err; | 512 | return err; |
512 | 513 | ||
513 | for (i = 0; i < num_pages; i++) { | 514 | for (i = 0; i < num_pages; i++) { |
514 | struct page *p = pages[i]; | 515 | struct page *p = pages[i]; |
515 | SetPageUptodate(p); | 516 | SetPageUptodate(p); |
516 | ClearPageChecked(p); | 517 | ClearPageChecked(p); |
517 | set_page_dirty(p); | 518 | set_page_dirty(p); |
518 | } | 519 | } |
519 | 520 | ||
520 | /* | 521 | /* |
521 | * we've only changed i_size in ram, and we haven't updated | 522 | * we've only changed i_size in ram, and we haven't updated |
522 | * the disk i_size. There is no need to log the inode | 523 | * the disk i_size. There is no need to log the inode |
523 | * at this time. | 524 | * at this time. |
524 | */ | 525 | */ |
525 | if (end_pos > isize) | 526 | if (end_pos > isize) |
526 | i_size_write(inode, end_pos); | 527 | i_size_write(inode, end_pos); |
527 | return 0; | 528 | return 0; |
528 | } | 529 | } |
529 | 530 | ||
530 | /* | 531 | /* |
531 | * this drops all the extents in the cache that intersect the range | 532 | * this drops all the extents in the cache that intersect the range |
532 | * [start, end]. Existing extents are split as required. | 533 | * [start, end]. Existing extents are split as required. |
533 | */ | 534 | */ |
534 | void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, | 535 | void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, |
535 | int skip_pinned) | 536 | int skip_pinned) |
536 | { | 537 | { |
537 | struct extent_map *em; | 538 | struct extent_map *em; |
538 | struct extent_map *split = NULL; | 539 | struct extent_map *split = NULL; |
539 | struct extent_map *split2 = NULL; | 540 | struct extent_map *split2 = NULL; |
540 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; | 541 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; |
541 | u64 len = end - start + 1; | 542 | u64 len = end - start + 1; |
542 | u64 gen; | 543 | u64 gen; |
543 | int ret; | 544 | int ret; |
544 | int testend = 1; | 545 | int testend = 1; |
545 | unsigned long flags; | 546 | unsigned long flags; |
546 | int compressed = 0; | 547 | int compressed = 0; |
547 | bool modified; | 548 | bool modified; |
548 | 549 | ||
549 | WARN_ON(end < start); | 550 | WARN_ON(end < start); |
550 | if (end == (u64)-1) { | 551 | if (end == (u64)-1) { |
551 | len = (u64)-1; | 552 | len = (u64)-1; |
552 | testend = 0; | 553 | testend = 0; |
553 | } | 554 | } |
554 | while (1) { | 555 | while (1) { |
555 | int no_splits = 0; | 556 | int no_splits = 0; |
556 | 557 | ||
557 | modified = false; | 558 | modified = false; |
558 | if (!split) | 559 | if (!split) |
559 | split = alloc_extent_map(); | 560 | split = alloc_extent_map(); |
560 | if (!split2) | 561 | if (!split2) |
561 | split2 = alloc_extent_map(); | 562 | split2 = alloc_extent_map(); |
562 | if (!split || !split2) | 563 | if (!split || !split2) |
563 | no_splits = 1; | 564 | no_splits = 1; |
564 | 565 | ||
565 | write_lock(&em_tree->lock); | 566 | write_lock(&em_tree->lock); |
566 | em = lookup_extent_mapping(em_tree, start, len); | 567 | em = lookup_extent_mapping(em_tree, start, len); |
567 | if (!em) { | 568 | if (!em) { |
568 | write_unlock(&em_tree->lock); | 569 | write_unlock(&em_tree->lock); |
569 | break; | 570 | break; |
570 | } | 571 | } |
571 | flags = em->flags; | 572 | flags = em->flags; |
572 | gen = em->generation; | 573 | gen = em->generation; |
573 | if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) { | 574 | if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) { |
574 | if (testend && em->start + em->len >= start + len) { | 575 | if (testend && em->start + em->len >= start + len) { |
575 | free_extent_map(em); | 576 | free_extent_map(em); |
576 | write_unlock(&em_tree->lock); | 577 | write_unlock(&em_tree->lock); |
577 | break; | 578 | break; |
578 | } | 579 | } |
579 | start = em->start + em->len; | 580 | start = em->start + em->len; |
580 | if (testend) | 581 | if (testend) |
581 | len = start + len - (em->start + em->len); | 582 | len = start + len - (em->start + em->len); |
582 | free_extent_map(em); | 583 | free_extent_map(em); |
583 | write_unlock(&em_tree->lock); | 584 | write_unlock(&em_tree->lock); |
584 | continue; | 585 | continue; |
585 | } | 586 | } |
586 | compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); | 587 | compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); |
587 | clear_bit(EXTENT_FLAG_PINNED, &em->flags); | 588 | clear_bit(EXTENT_FLAG_PINNED, &em->flags); |
588 | clear_bit(EXTENT_FLAG_LOGGING, &flags); | 589 | clear_bit(EXTENT_FLAG_LOGGING, &flags); |
589 | modified = !list_empty(&em->list); | 590 | modified = !list_empty(&em->list); |
590 | remove_extent_mapping(em_tree, em); | 591 | remove_extent_mapping(em_tree, em); |
591 | if (no_splits) | 592 | if (no_splits) |
592 | goto next; | 593 | goto next; |
593 | 594 | ||
594 | if (em->start < start) { | 595 | if (em->start < start) { |
595 | split->start = em->start; | 596 | split->start = em->start; |
596 | split->len = start - em->start; | 597 | split->len = start - em->start; |
597 | 598 | ||
598 | if (em->block_start < EXTENT_MAP_LAST_BYTE) { | 599 | if (em->block_start < EXTENT_MAP_LAST_BYTE) { |
599 | split->orig_start = em->orig_start; | 600 | split->orig_start = em->orig_start; |
600 | split->block_start = em->block_start; | 601 | split->block_start = em->block_start; |
601 | 602 | ||
602 | if (compressed) | 603 | if (compressed) |
603 | split->block_len = em->block_len; | 604 | split->block_len = em->block_len; |
604 | else | 605 | else |
605 | split->block_len = split->len; | 606 | split->block_len = split->len; |
606 | split->orig_block_len = max(split->block_len, | 607 | split->orig_block_len = max(split->block_len, |
607 | em->orig_block_len); | 608 | em->orig_block_len); |
608 | split->ram_bytes = em->ram_bytes; | 609 | split->ram_bytes = em->ram_bytes; |
609 | } else { | 610 | } else { |
610 | split->orig_start = split->start; | 611 | split->orig_start = split->start; |
611 | split->block_len = 0; | 612 | split->block_len = 0; |
612 | split->block_start = em->block_start; | 613 | split->block_start = em->block_start; |
613 | split->orig_block_len = 0; | 614 | split->orig_block_len = 0; |
614 | split->ram_bytes = split->len; | 615 | split->ram_bytes = split->len; |
615 | } | 616 | } |
616 | 617 | ||
617 | split->generation = gen; | 618 | split->generation = gen; |
618 | split->bdev = em->bdev; | 619 | split->bdev = em->bdev; |
619 | split->flags = flags; | 620 | split->flags = flags; |
620 | split->compress_type = em->compress_type; | 621 | split->compress_type = em->compress_type; |
621 | ret = add_extent_mapping(em_tree, split, modified); | 622 | ret = add_extent_mapping(em_tree, split, modified); |
622 | BUG_ON(ret); /* Logic error */ | 623 | BUG_ON(ret); /* Logic error */ |
623 | free_extent_map(split); | 624 | free_extent_map(split); |
624 | split = split2; | 625 | split = split2; |
625 | split2 = NULL; | 626 | split2 = NULL; |
626 | } | 627 | } |
627 | if (testend && em->start + em->len > start + len) { | 628 | if (testend && em->start + em->len > start + len) { |
628 | u64 diff = start + len - em->start; | 629 | u64 diff = start + len - em->start; |
629 | 630 | ||
630 | split->start = start + len; | 631 | split->start = start + len; |
631 | split->len = em->start + em->len - (start + len); | 632 | split->len = em->start + em->len - (start + len); |
632 | split->bdev = em->bdev; | 633 | split->bdev = em->bdev; |
633 | split->flags = flags; | 634 | split->flags = flags; |
634 | split->compress_type = em->compress_type; | 635 | split->compress_type = em->compress_type; |
635 | split->generation = gen; | 636 | split->generation = gen; |
636 | 637 | ||
637 | if (em->block_start < EXTENT_MAP_LAST_BYTE) { | 638 | if (em->block_start < EXTENT_MAP_LAST_BYTE) { |
638 | split->orig_block_len = max(em->block_len, | 639 | split->orig_block_len = max(em->block_len, |
639 | em->orig_block_len); | 640 | em->orig_block_len); |
640 | 641 | ||
641 | split->ram_bytes = em->ram_bytes; | 642 | split->ram_bytes = em->ram_bytes; |
642 | if (compressed) { | 643 | if (compressed) { |
643 | split->block_len = em->block_len; | 644 | split->block_len = em->block_len; |
644 | split->block_start = em->block_start; | 645 | split->block_start = em->block_start; |
645 | split->orig_start = em->orig_start; | 646 | split->orig_start = em->orig_start; |
646 | } else { | 647 | } else { |
647 | split->block_len = split->len; | 648 | split->block_len = split->len; |
648 | split->block_start = em->block_start | 649 | split->block_start = em->block_start |
649 | + diff; | 650 | + diff; |
650 | split->orig_start = em->orig_start; | 651 | split->orig_start = em->orig_start; |
651 | } | 652 | } |
652 | } else { | 653 | } else { |
653 | split->ram_bytes = split->len; | 654 | split->ram_bytes = split->len; |
654 | split->orig_start = split->start; | 655 | split->orig_start = split->start; |
655 | split->block_len = 0; | 656 | split->block_len = 0; |
656 | split->block_start = em->block_start; | 657 | split->block_start = em->block_start; |
657 | split->orig_block_len = 0; | 658 | split->orig_block_len = 0; |
658 | } | 659 | } |
659 | 660 | ||
660 | ret = add_extent_mapping(em_tree, split, modified); | 661 | ret = add_extent_mapping(em_tree, split, modified); |
661 | BUG_ON(ret); /* Logic error */ | 662 | BUG_ON(ret); /* Logic error */ |
662 | free_extent_map(split); | 663 | free_extent_map(split); |
663 | split = NULL; | 664 | split = NULL; |
664 | } | 665 | } |
665 | next: | 666 | next: |
666 | write_unlock(&em_tree->lock); | 667 | write_unlock(&em_tree->lock); |
667 | 668 | ||
668 | /* once for us */ | 669 | /* once for us */ |
669 | free_extent_map(em); | 670 | free_extent_map(em); |
670 | /* once for the tree*/ | 671 | /* once for the tree*/ |
671 | free_extent_map(em); | 672 | free_extent_map(em); |
672 | } | 673 | } |
673 | if (split) | 674 | if (split) |
674 | free_extent_map(split); | 675 | free_extent_map(split); |
675 | if (split2) | 676 | if (split2) |
676 | free_extent_map(split2); | 677 | free_extent_map(split2); |
677 | } | 678 | } |
678 | 679 | ||
679 | /* | 680 | /* |
680 | * this is very complex, but the basic idea is to drop all extents | 681 | * this is very complex, but the basic idea is to drop all extents |
681 | * in the range start - end. hint_block is filled in with a block number | 682 | * in the range start - end. hint_block is filled in with a block number |
682 | * that would be a good hint to the block allocator for this file. | 683 | * that would be a good hint to the block allocator for this file. |
683 | * | 684 | * |
684 | * If an extent intersects the range but is not entirely inside the range | 685 | * If an extent intersects the range but is not entirely inside the range |
685 | * it is either truncated or split. Anything entirely inside the range | 686 | * it is either truncated or split. Anything entirely inside the range |
686 | * is deleted from the tree. | 687 | * is deleted from the tree. |
687 | */ | 688 | */ |
688 | int __btrfs_drop_extents(struct btrfs_trans_handle *trans, | 689 | int __btrfs_drop_extents(struct btrfs_trans_handle *trans, |
689 | struct btrfs_root *root, struct inode *inode, | 690 | struct btrfs_root *root, struct inode *inode, |
690 | struct btrfs_path *path, u64 start, u64 end, | 691 | struct btrfs_path *path, u64 start, u64 end, |
691 | u64 *drop_end, int drop_cache) | 692 | u64 *drop_end, int drop_cache) |
692 | { | 693 | { |
693 | struct extent_buffer *leaf; | 694 | struct extent_buffer *leaf; |
694 | struct btrfs_file_extent_item *fi; | 695 | struct btrfs_file_extent_item *fi; |
695 | struct btrfs_key key; | 696 | struct btrfs_key key; |
696 | struct btrfs_key new_key; | 697 | struct btrfs_key new_key; |
697 | u64 ino = btrfs_ino(inode); | 698 | u64 ino = btrfs_ino(inode); |
698 | u64 search_start = start; | 699 | u64 search_start = start; |
699 | u64 disk_bytenr = 0; | 700 | u64 disk_bytenr = 0; |
700 | u64 num_bytes = 0; | 701 | u64 num_bytes = 0; |
701 | u64 extent_offset = 0; | 702 | u64 extent_offset = 0; |
702 | u64 extent_end = 0; | 703 | u64 extent_end = 0; |
703 | int del_nr = 0; | 704 | int del_nr = 0; |
704 | int del_slot = 0; | 705 | int del_slot = 0; |
705 | int extent_type; | 706 | int extent_type; |
706 | int recow; | 707 | int recow; |
707 | int ret; | 708 | int ret; |
708 | int modify_tree = -1; | 709 | int modify_tree = -1; |
709 | int update_refs = (root->ref_cows || root == root->fs_info->tree_root); | 710 | int update_refs = (root->ref_cows || root == root->fs_info->tree_root); |
710 | int found = 0; | 711 | int found = 0; |
711 | 712 | ||
712 | if (drop_cache) | 713 | if (drop_cache) |
713 | btrfs_drop_extent_cache(inode, start, end - 1, 0); | 714 | btrfs_drop_extent_cache(inode, start, end - 1, 0); |
714 | 715 | ||
715 | if (start >= BTRFS_I(inode)->disk_i_size) | 716 | if (start >= BTRFS_I(inode)->disk_i_size) |
716 | modify_tree = 0; | 717 | modify_tree = 0; |
717 | 718 | ||
718 | while (1) { | 719 | while (1) { |
719 | recow = 0; | 720 | recow = 0; |
720 | ret = btrfs_lookup_file_extent(trans, root, path, ino, | 721 | ret = btrfs_lookup_file_extent(trans, root, path, ino, |
721 | search_start, modify_tree); | 722 | search_start, modify_tree); |
722 | if (ret < 0) | 723 | if (ret < 0) |
723 | break; | 724 | break; |
724 | if (ret > 0 && path->slots[0] > 0 && search_start == start) { | 725 | if (ret > 0 && path->slots[0] > 0 && search_start == start) { |
725 | leaf = path->nodes[0]; | 726 | leaf = path->nodes[0]; |
726 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1); | 727 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1); |
727 | if (key.objectid == ino && | 728 | if (key.objectid == ino && |
728 | key.type == BTRFS_EXTENT_DATA_KEY) | 729 | key.type == BTRFS_EXTENT_DATA_KEY) |
729 | path->slots[0]--; | 730 | path->slots[0]--; |
730 | } | 731 | } |
731 | ret = 0; | 732 | ret = 0; |
732 | next_slot: | 733 | next_slot: |
733 | leaf = path->nodes[0]; | 734 | leaf = path->nodes[0]; |
734 | if (path->slots[0] >= btrfs_header_nritems(leaf)) { | 735 | if (path->slots[0] >= btrfs_header_nritems(leaf)) { |
735 | BUG_ON(del_nr > 0); | 736 | BUG_ON(del_nr > 0); |
736 | ret = btrfs_next_leaf(root, path); | 737 | ret = btrfs_next_leaf(root, path); |
737 | if (ret < 0) | 738 | if (ret < 0) |
738 | break; | 739 | break; |
739 | if (ret > 0) { | 740 | if (ret > 0) { |
740 | ret = 0; | 741 | ret = 0; |
741 | break; | 742 | break; |
742 | } | 743 | } |
743 | leaf = path->nodes[0]; | 744 | leaf = path->nodes[0]; |
744 | recow = 1; | 745 | recow = 1; |
745 | } | 746 | } |
746 | 747 | ||
747 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); | 748 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); |
748 | if (key.objectid > ino || | 749 | if (key.objectid > ino || |
749 | key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) | 750 | key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) |
750 | break; | 751 | break; |
751 | 752 | ||
752 | fi = btrfs_item_ptr(leaf, path->slots[0], | 753 | fi = btrfs_item_ptr(leaf, path->slots[0], |
753 | struct btrfs_file_extent_item); | 754 | struct btrfs_file_extent_item); |
754 | extent_type = btrfs_file_extent_type(leaf, fi); | 755 | extent_type = btrfs_file_extent_type(leaf, fi); |
755 | 756 | ||
756 | if (extent_type == BTRFS_FILE_EXTENT_REG || | 757 | if (extent_type == BTRFS_FILE_EXTENT_REG || |
757 | extent_type == BTRFS_FILE_EXTENT_PREALLOC) { | 758 | extent_type == BTRFS_FILE_EXTENT_PREALLOC) { |
758 | disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); | 759 | disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); |
759 | num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); | 760 | num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); |
760 | extent_offset = btrfs_file_extent_offset(leaf, fi); | 761 | extent_offset = btrfs_file_extent_offset(leaf, fi); |
761 | extent_end = key.offset + | 762 | extent_end = key.offset + |
762 | btrfs_file_extent_num_bytes(leaf, fi); | 763 | btrfs_file_extent_num_bytes(leaf, fi); |
763 | } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { | 764 | } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { |
764 | extent_end = key.offset + | 765 | extent_end = key.offset + |
765 | btrfs_file_extent_inline_len(leaf, fi); | 766 | btrfs_file_extent_inline_len(leaf, fi); |
766 | } else { | 767 | } else { |
767 | WARN_ON(1); | 768 | WARN_ON(1); |
768 | extent_end = search_start; | 769 | extent_end = search_start; |
769 | } | 770 | } |
770 | 771 | ||
771 | if (extent_end <= search_start) { | 772 | if (extent_end <= search_start) { |
772 | path->slots[0]++; | 773 | path->slots[0]++; |
773 | goto next_slot; | 774 | goto next_slot; |
774 | } | 775 | } |
775 | 776 | ||
776 | found = 1; | 777 | found = 1; |
777 | search_start = max(key.offset, start); | 778 | search_start = max(key.offset, start); |
778 | if (recow || !modify_tree) { | 779 | if (recow || !modify_tree) { |
779 | modify_tree = -1; | 780 | modify_tree = -1; |
780 | btrfs_release_path(path); | 781 | btrfs_release_path(path); |
781 | continue; | 782 | continue; |
782 | } | 783 | } |
783 | 784 | ||
784 | /* | 785 | /* |
785 | * | - range to drop - | | 786 | * | - range to drop - | |
786 | * | -------- extent -------- | | 787 | * | -------- extent -------- | |
787 | */ | 788 | */ |
788 | if (start > key.offset && end < extent_end) { | 789 | if (start > key.offset && end < extent_end) { |
789 | BUG_ON(del_nr > 0); | 790 | BUG_ON(del_nr > 0); |
790 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); | 791 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); |
791 | 792 | ||
792 | memcpy(&new_key, &key, sizeof(new_key)); | 793 | memcpy(&new_key, &key, sizeof(new_key)); |
793 | new_key.offset = start; | 794 | new_key.offset = start; |
794 | ret = btrfs_duplicate_item(trans, root, path, | 795 | ret = btrfs_duplicate_item(trans, root, path, |
795 | &new_key); | 796 | &new_key); |
796 | if (ret == -EAGAIN) { | 797 | if (ret == -EAGAIN) { |
797 | btrfs_release_path(path); | 798 | btrfs_release_path(path); |
798 | continue; | 799 | continue; |
799 | } | 800 | } |
800 | if (ret < 0) | 801 | if (ret < 0) |
801 | break; | 802 | break; |
802 | 803 | ||
803 | leaf = path->nodes[0]; | 804 | leaf = path->nodes[0]; |
804 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, | 805 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, |
805 | struct btrfs_file_extent_item); | 806 | struct btrfs_file_extent_item); |
806 | btrfs_set_file_extent_num_bytes(leaf, fi, | 807 | btrfs_set_file_extent_num_bytes(leaf, fi, |
807 | start - key.offset); | 808 | start - key.offset); |
808 | 809 | ||
809 | fi = btrfs_item_ptr(leaf, path->slots[0], | 810 | fi = btrfs_item_ptr(leaf, path->slots[0], |
810 | struct btrfs_file_extent_item); | 811 | struct btrfs_file_extent_item); |
811 | 812 | ||
812 | extent_offset += start - key.offset; | 813 | extent_offset += start - key.offset; |
813 | btrfs_set_file_extent_offset(leaf, fi, extent_offset); | 814 | btrfs_set_file_extent_offset(leaf, fi, extent_offset); |
814 | btrfs_set_file_extent_num_bytes(leaf, fi, | 815 | btrfs_set_file_extent_num_bytes(leaf, fi, |
815 | extent_end - start); | 816 | extent_end - start); |
816 | btrfs_mark_buffer_dirty(leaf); | 817 | btrfs_mark_buffer_dirty(leaf); |
817 | 818 | ||
818 | if (update_refs && disk_bytenr > 0) { | 819 | if (update_refs && disk_bytenr > 0) { |
819 | ret = btrfs_inc_extent_ref(trans, root, | 820 | ret = btrfs_inc_extent_ref(trans, root, |
820 | disk_bytenr, num_bytes, 0, | 821 | disk_bytenr, num_bytes, 0, |
821 | root->root_key.objectid, | 822 | root->root_key.objectid, |
822 | new_key.objectid, | 823 | new_key.objectid, |
823 | start - extent_offset, 0); | 824 | start - extent_offset, 0); |
824 | BUG_ON(ret); /* -ENOMEM */ | 825 | BUG_ON(ret); /* -ENOMEM */ |
825 | } | 826 | } |
826 | key.offset = start; | 827 | key.offset = start; |
827 | } | 828 | } |
828 | /* | 829 | /* |
829 | * | ---- range to drop ----- | | 830 | * | ---- range to drop ----- | |
830 | * | -------- extent -------- | | 831 | * | -------- extent -------- | |
831 | */ | 832 | */ |
832 | if (start <= key.offset && end < extent_end) { | 833 | if (start <= key.offset && end < extent_end) { |
833 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); | 834 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); |
834 | 835 | ||
835 | memcpy(&new_key, &key, sizeof(new_key)); | 836 | memcpy(&new_key, &key, sizeof(new_key)); |
836 | new_key.offset = end; | 837 | new_key.offset = end; |
837 | btrfs_set_item_key_safe(root, path, &new_key); | 838 | btrfs_set_item_key_safe(root, path, &new_key); |
838 | 839 | ||
839 | extent_offset += end - key.offset; | 840 | extent_offset += end - key.offset; |
840 | btrfs_set_file_extent_offset(leaf, fi, extent_offset); | 841 | btrfs_set_file_extent_offset(leaf, fi, extent_offset); |
841 | btrfs_set_file_extent_num_bytes(leaf, fi, | 842 | btrfs_set_file_extent_num_bytes(leaf, fi, |
842 | extent_end - end); | 843 | extent_end - end); |
843 | btrfs_mark_buffer_dirty(leaf); | 844 | btrfs_mark_buffer_dirty(leaf); |
844 | if (update_refs && disk_bytenr > 0) | 845 | if (update_refs && disk_bytenr > 0) |
845 | inode_sub_bytes(inode, end - key.offset); | 846 | inode_sub_bytes(inode, end - key.offset); |
846 | break; | 847 | break; |
847 | } | 848 | } |
848 | 849 | ||
849 | search_start = extent_end; | 850 | search_start = extent_end; |
850 | /* | 851 | /* |
851 | * | ---- range to drop ----- | | 852 | * | ---- range to drop ----- | |
852 | * | -------- extent -------- | | 853 | * | -------- extent -------- | |
853 | */ | 854 | */ |
854 | if (start > key.offset && end >= extent_end) { | 855 | if (start > key.offset && end >= extent_end) { |
855 | BUG_ON(del_nr > 0); | 856 | BUG_ON(del_nr > 0); |
856 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); | 857 | BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE); |
857 | 858 | ||
858 | btrfs_set_file_extent_num_bytes(leaf, fi, | 859 | btrfs_set_file_extent_num_bytes(leaf, fi, |
859 | start - key.offset); | 860 | start - key.offset); |
860 | btrfs_mark_buffer_dirty(leaf); | 861 | btrfs_mark_buffer_dirty(leaf); |
861 | if (update_refs && disk_bytenr > 0) | 862 | if (update_refs && disk_bytenr > 0) |
862 | inode_sub_bytes(inode, extent_end - start); | 863 | inode_sub_bytes(inode, extent_end - start); |
863 | if (end == extent_end) | 864 | if (end == extent_end) |
864 | break; | 865 | break; |
865 | 866 | ||
866 | path->slots[0]++; | 867 | path->slots[0]++; |
867 | goto next_slot; | 868 | goto next_slot; |
868 | } | 869 | } |
869 | 870 | ||
870 | /* | 871 | /* |
871 | * | ---- range to drop ----- | | 872 | * | ---- range to drop ----- | |
872 | * | ------ extent ------ | | 873 | * | ------ extent ------ | |
873 | */ | 874 | */ |
874 | if (start <= key.offset && end >= extent_end) { | 875 | if (start <= key.offset && end >= extent_end) { |
875 | if (del_nr == 0) { | 876 | if (del_nr == 0) { |
876 | del_slot = path->slots[0]; | 877 | del_slot = path->slots[0]; |
877 | del_nr = 1; | 878 | del_nr = 1; |
878 | } else { | 879 | } else { |
879 | BUG_ON(del_slot + del_nr != path->slots[0]); | 880 | BUG_ON(del_slot + del_nr != path->slots[0]); |
880 | del_nr++; | 881 | del_nr++; |
881 | } | 882 | } |
882 | 883 | ||
883 | if (update_refs && | 884 | if (update_refs && |
884 | extent_type == BTRFS_FILE_EXTENT_INLINE) { | 885 | extent_type == BTRFS_FILE_EXTENT_INLINE) { |
885 | inode_sub_bytes(inode, | 886 | inode_sub_bytes(inode, |
886 | extent_end - key.offset); | 887 | extent_end - key.offset); |
887 | extent_end = ALIGN(extent_end, | 888 | extent_end = ALIGN(extent_end, |
888 | root->sectorsize); | 889 | root->sectorsize); |
889 | } else if (update_refs && disk_bytenr > 0) { | 890 | } else if (update_refs && disk_bytenr > 0) { |
890 | ret = btrfs_free_extent(trans, root, | 891 | ret = btrfs_free_extent(trans, root, |
891 | disk_bytenr, num_bytes, 0, | 892 | disk_bytenr, num_bytes, 0, |
892 | root->root_key.objectid, | 893 | root->root_key.objectid, |
893 | key.objectid, key.offset - | 894 | key.objectid, key.offset - |
894 | extent_offset, 0); | 895 | extent_offset, 0); |
895 | BUG_ON(ret); /* -ENOMEM */ | 896 | BUG_ON(ret); /* -ENOMEM */ |
896 | inode_sub_bytes(inode, | 897 | inode_sub_bytes(inode, |
897 | extent_end - key.offset); | 898 | extent_end - key.offset); |
898 | } | 899 | } |
899 | 900 | ||
900 | if (end == extent_end) | 901 | if (end == extent_end) |
901 | break; | 902 | break; |
902 | 903 | ||
903 | if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) { | 904 | if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) { |
904 | path->slots[0]++; | 905 | path->slots[0]++; |
905 | goto next_slot; | 906 | goto next_slot; |
906 | } | 907 | } |
907 | 908 | ||
908 | ret = btrfs_del_items(trans, root, path, del_slot, | 909 | ret = btrfs_del_items(trans, root, path, del_slot, |
909 | del_nr); | 910 | del_nr); |
910 | if (ret) { | 911 | if (ret) { |
911 | btrfs_abort_transaction(trans, root, ret); | 912 | btrfs_abort_transaction(trans, root, ret); |
912 | break; | 913 | break; |
913 | } | 914 | } |
914 | 915 | ||
915 | del_nr = 0; | 916 | del_nr = 0; |
916 | del_slot = 0; | 917 | del_slot = 0; |
917 | 918 | ||
918 | btrfs_release_path(path); | 919 | btrfs_release_path(path); |
919 | continue; | 920 | continue; |
920 | } | 921 | } |
921 | 922 | ||
922 | BUG_ON(1); | 923 | BUG_ON(1); |
923 | } | 924 | } |
924 | 925 | ||
925 | if (!ret && del_nr > 0) { | 926 | if (!ret && del_nr > 0) { |
926 | ret = btrfs_del_items(trans, root, path, del_slot, del_nr); | 927 | ret = btrfs_del_items(trans, root, path, del_slot, del_nr); |
927 | if (ret) | 928 | if (ret) |
928 | btrfs_abort_transaction(trans, root, ret); | 929 | btrfs_abort_transaction(trans, root, ret); |
929 | } | 930 | } |
930 | 931 | ||
931 | if (drop_end) | 932 | if (drop_end) |
932 | *drop_end = found ? min(end, extent_end) : end; | 933 | *drop_end = found ? min(end, extent_end) : end; |
933 | btrfs_release_path(path); | 934 | btrfs_release_path(path); |
934 | return ret; | 935 | return ret; |
935 | } | 936 | } |
936 | 937 | ||
937 | int btrfs_drop_extents(struct btrfs_trans_handle *trans, | 938 | int btrfs_drop_extents(struct btrfs_trans_handle *trans, |
938 | struct btrfs_root *root, struct inode *inode, u64 start, | 939 | struct btrfs_root *root, struct inode *inode, u64 start, |
939 | u64 end, int drop_cache) | 940 | u64 end, int drop_cache) |
940 | { | 941 | { |
941 | struct btrfs_path *path; | 942 | struct btrfs_path *path; |
942 | int ret; | 943 | int ret; |
943 | 944 | ||
944 | path = btrfs_alloc_path(); | 945 | path = btrfs_alloc_path(); |
945 | if (!path) | 946 | if (!path) |
946 | return -ENOMEM; | 947 | return -ENOMEM; |
947 | ret = __btrfs_drop_extents(trans, root, inode, path, start, end, NULL, | 948 | ret = __btrfs_drop_extents(trans, root, inode, path, start, end, NULL, |
948 | drop_cache); | 949 | drop_cache); |
949 | btrfs_free_path(path); | 950 | btrfs_free_path(path); |
950 | return ret; | 951 | return ret; |
951 | } | 952 | } |
952 | 953 | ||
953 | static int extent_mergeable(struct extent_buffer *leaf, int slot, | 954 | static int extent_mergeable(struct extent_buffer *leaf, int slot, |
954 | u64 objectid, u64 bytenr, u64 orig_offset, | 955 | u64 objectid, u64 bytenr, u64 orig_offset, |
955 | u64 *start, u64 *end) | 956 | u64 *start, u64 *end) |
956 | { | 957 | { |
957 | struct btrfs_file_extent_item *fi; | 958 | struct btrfs_file_extent_item *fi; |
958 | struct btrfs_key key; | 959 | struct btrfs_key key; |
959 | u64 extent_end; | 960 | u64 extent_end; |
960 | 961 | ||
961 | if (slot < 0 || slot >= btrfs_header_nritems(leaf)) | 962 | if (slot < 0 || slot >= btrfs_header_nritems(leaf)) |
962 | return 0; | 963 | return 0; |
963 | 964 | ||
964 | btrfs_item_key_to_cpu(leaf, &key, slot); | 965 | btrfs_item_key_to_cpu(leaf, &key, slot); |
965 | if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY) | 966 | if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY) |
966 | return 0; | 967 | return 0; |
967 | 968 | ||
968 | fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); | 969 | fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); |
969 | if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG || | 970 | if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG || |
970 | btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr || | 971 | btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr || |
971 | btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset || | 972 | btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset || |
972 | btrfs_file_extent_compression(leaf, fi) || | 973 | btrfs_file_extent_compression(leaf, fi) || |
973 | btrfs_file_extent_encryption(leaf, fi) || | 974 | btrfs_file_extent_encryption(leaf, fi) || |
974 | btrfs_file_extent_other_encoding(leaf, fi)) | 975 | btrfs_file_extent_other_encoding(leaf, fi)) |
975 | return 0; | 976 | return 0; |
976 | 977 | ||
977 | extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); | 978 | extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); |
978 | if ((*start && *start != key.offset) || (*end && *end != extent_end)) | 979 | if ((*start && *start != key.offset) || (*end && *end != extent_end)) |
979 | return 0; | 980 | return 0; |
980 | 981 | ||
981 | *start = key.offset; | 982 | *start = key.offset; |
982 | *end = extent_end; | 983 | *end = extent_end; |
983 | return 1; | 984 | return 1; |
984 | } | 985 | } |
985 | 986 | ||
986 | /* | 987 | /* |
987 | * Mark extent in the range start - end as written. | 988 | * Mark extent in the range start - end as written. |
988 | * | 989 | * |
989 | * This changes extent type from 'pre-allocated' to 'regular'. If only | 990 | * This changes extent type from 'pre-allocated' to 'regular'. If only |
990 | * part of extent is marked as written, the extent will be split into | 991 | * part of extent is marked as written, the extent will be split into |
991 | * two or three. | 992 | * two or three. |
992 | */ | 993 | */ |
993 | int btrfs_mark_extent_written(struct btrfs_trans_handle *trans, | 994 | int btrfs_mark_extent_written(struct btrfs_trans_handle *trans, |
994 | struct inode *inode, u64 start, u64 end) | 995 | struct inode *inode, u64 start, u64 end) |
995 | { | 996 | { |
996 | struct btrfs_root *root = BTRFS_I(inode)->root; | 997 | struct btrfs_root *root = BTRFS_I(inode)->root; |
997 | struct extent_buffer *leaf; | 998 | struct extent_buffer *leaf; |
998 | struct btrfs_path *path; | 999 | struct btrfs_path *path; |
999 | struct btrfs_file_extent_item *fi; | 1000 | struct btrfs_file_extent_item *fi; |
1000 | struct btrfs_key key; | 1001 | struct btrfs_key key; |
1001 | struct btrfs_key new_key; | 1002 | struct btrfs_key new_key; |
1002 | u64 bytenr; | 1003 | u64 bytenr; |
1003 | u64 num_bytes; | 1004 | u64 num_bytes; |
1004 | u64 extent_end; | 1005 | u64 extent_end; |
1005 | u64 orig_offset; | 1006 | u64 orig_offset; |
1006 | u64 other_start; | 1007 | u64 other_start; |
1007 | u64 other_end; | 1008 | u64 other_end; |
1008 | u64 split; | 1009 | u64 split; |
1009 | int del_nr = 0; | 1010 | int del_nr = 0; |
1010 | int del_slot = 0; | 1011 | int del_slot = 0; |
1011 | int recow; | 1012 | int recow; |
1012 | int ret; | 1013 | int ret; |
1013 | u64 ino = btrfs_ino(inode); | 1014 | u64 ino = btrfs_ino(inode); |
1014 | 1015 | ||
1015 | path = btrfs_alloc_path(); | 1016 | path = btrfs_alloc_path(); |
1016 | if (!path) | 1017 | if (!path) |
1017 | return -ENOMEM; | 1018 | return -ENOMEM; |
1018 | again: | 1019 | again: |
1019 | recow = 0; | 1020 | recow = 0; |
1020 | split = start; | 1021 | split = start; |
1021 | key.objectid = ino; | 1022 | key.objectid = ino; |
1022 | key.type = BTRFS_EXTENT_DATA_KEY; | 1023 | key.type = BTRFS_EXTENT_DATA_KEY; |
1023 | key.offset = split; | 1024 | key.offset = split; |
1024 | 1025 | ||
1025 | ret = btrfs_search_slot(trans, root, &key, path, -1, 1); | 1026 | ret = btrfs_search_slot(trans, root, &key, path, -1, 1); |
1026 | if (ret < 0) | 1027 | if (ret < 0) |
1027 | goto out; | 1028 | goto out; |
1028 | if (ret > 0 && path->slots[0] > 0) | 1029 | if (ret > 0 && path->slots[0] > 0) |
1029 | path->slots[0]--; | 1030 | path->slots[0]--; |
1030 | 1031 | ||
1031 | leaf = path->nodes[0]; | 1032 | leaf = path->nodes[0]; |
1032 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); | 1033 | btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); |
1033 | BUG_ON(key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY); | 1034 | BUG_ON(key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY); |
1034 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1035 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1035 | struct btrfs_file_extent_item); | 1036 | struct btrfs_file_extent_item); |
1036 | BUG_ON(btrfs_file_extent_type(leaf, fi) != | 1037 | BUG_ON(btrfs_file_extent_type(leaf, fi) != |
1037 | BTRFS_FILE_EXTENT_PREALLOC); | 1038 | BTRFS_FILE_EXTENT_PREALLOC); |
1038 | extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); | 1039 | extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi); |
1039 | BUG_ON(key.offset > start || extent_end < end); | 1040 | BUG_ON(key.offset > start || extent_end < end); |
1040 | 1041 | ||
1041 | bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); | 1042 | bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); |
1042 | num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); | 1043 | num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); |
1043 | orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi); | 1044 | orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi); |
1044 | memcpy(&new_key, &key, sizeof(new_key)); | 1045 | memcpy(&new_key, &key, sizeof(new_key)); |
1045 | 1046 | ||
1046 | if (start == key.offset && end < extent_end) { | 1047 | if (start == key.offset && end < extent_end) { |
1047 | other_start = 0; | 1048 | other_start = 0; |
1048 | other_end = start; | 1049 | other_end = start; |
1049 | if (extent_mergeable(leaf, path->slots[0] - 1, | 1050 | if (extent_mergeable(leaf, path->slots[0] - 1, |
1050 | ino, bytenr, orig_offset, | 1051 | ino, bytenr, orig_offset, |
1051 | &other_start, &other_end)) { | 1052 | &other_start, &other_end)) { |
1052 | new_key.offset = end; | 1053 | new_key.offset = end; |
1053 | btrfs_set_item_key_safe(root, path, &new_key); | 1054 | btrfs_set_item_key_safe(root, path, &new_key); |
1054 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1055 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1055 | struct btrfs_file_extent_item); | 1056 | struct btrfs_file_extent_item); |
1056 | btrfs_set_file_extent_generation(leaf, fi, | 1057 | btrfs_set_file_extent_generation(leaf, fi, |
1057 | trans->transid); | 1058 | trans->transid); |
1058 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1059 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1059 | extent_end - end); | 1060 | extent_end - end); |
1060 | btrfs_set_file_extent_offset(leaf, fi, | 1061 | btrfs_set_file_extent_offset(leaf, fi, |
1061 | end - orig_offset); | 1062 | end - orig_offset); |
1062 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, | 1063 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, |
1063 | struct btrfs_file_extent_item); | 1064 | struct btrfs_file_extent_item); |
1064 | btrfs_set_file_extent_generation(leaf, fi, | 1065 | btrfs_set_file_extent_generation(leaf, fi, |
1065 | trans->transid); | 1066 | trans->transid); |
1066 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1067 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1067 | end - other_start); | 1068 | end - other_start); |
1068 | btrfs_mark_buffer_dirty(leaf); | 1069 | btrfs_mark_buffer_dirty(leaf); |
1069 | goto out; | 1070 | goto out; |
1070 | } | 1071 | } |
1071 | } | 1072 | } |
1072 | 1073 | ||
1073 | if (start > key.offset && end == extent_end) { | 1074 | if (start > key.offset && end == extent_end) { |
1074 | other_start = end; | 1075 | other_start = end; |
1075 | other_end = 0; | 1076 | other_end = 0; |
1076 | if (extent_mergeable(leaf, path->slots[0] + 1, | 1077 | if (extent_mergeable(leaf, path->slots[0] + 1, |
1077 | ino, bytenr, orig_offset, | 1078 | ino, bytenr, orig_offset, |
1078 | &other_start, &other_end)) { | 1079 | &other_start, &other_end)) { |
1079 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1080 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1080 | struct btrfs_file_extent_item); | 1081 | struct btrfs_file_extent_item); |
1081 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1082 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1082 | start - key.offset); | 1083 | start - key.offset); |
1083 | btrfs_set_file_extent_generation(leaf, fi, | 1084 | btrfs_set_file_extent_generation(leaf, fi, |
1084 | trans->transid); | 1085 | trans->transid); |
1085 | path->slots[0]++; | 1086 | path->slots[0]++; |
1086 | new_key.offset = start; | 1087 | new_key.offset = start; |
1087 | btrfs_set_item_key_safe(root, path, &new_key); | 1088 | btrfs_set_item_key_safe(root, path, &new_key); |
1088 | 1089 | ||
1089 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1090 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1090 | struct btrfs_file_extent_item); | 1091 | struct btrfs_file_extent_item); |
1091 | btrfs_set_file_extent_generation(leaf, fi, | 1092 | btrfs_set_file_extent_generation(leaf, fi, |
1092 | trans->transid); | 1093 | trans->transid); |
1093 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1094 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1094 | other_end - start); | 1095 | other_end - start); |
1095 | btrfs_set_file_extent_offset(leaf, fi, | 1096 | btrfs_set_file_extent_offset(leaf, fi, |
1096 | start - orig_offset); | 1097 | start - orig_offset); |
1097 | btrfs_mark_buffer_dirty(leaf); | 1098 | btrfs_mark_buffer_dirty(leaf); |
1098 | goto out; | 1099 | goto out; |
1099 | } | 1100 | } |
1100 | } | 1101 | } |
1101 | 1102 | ||
1102 | while (start > key.offset || end < extent_end) { | 1103 | while (start > key.offset || end < extent_end) { |
1103 | if (key.offset == start) | 1104 | if (key.offset == start) |
1104 | split = end; | 1105 | split = end; |
1105 | 1106 | ||
1106 | new_key.offset = split; | 1107 | new_key.offset = split; |
1107 | ret = btrfs_duplicate_item(trans, root, path, &new_key); | 1108 | ret = btrfs_duplicate_item(trans, root, path, &new_key); |
1108 | if (ret == -EAGAIN) { | 1109 | if (ret == -EAGAIN) { |
1109 | btrfs_release_path(path); | 1110 | btrfs_release_path(path); |
1110 | goto again; | 1111 | goto again; |
1111 | } | 1112 | } |
1112 | if (ret < 0) { | 1113 | if (ret < 0) { |
1113 | btrfs_abort_transaction(trans, root, ret); | 1114 | btrfs_abort_transaction(trans, root, ret); |
1114 | goto out; | 1115 | goto out; |
1115 | } | 1116 | } |
1116 | 1117 | ||
1117 | leaf = path->nodes[0]; | 1118 | leaf = path->nodes[0]; |
1118 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, | 1119 | fi = btrfs_item_ptr(leaf, path->slots[0] - 1, |
1119 | struct btrfs_file_extent_item); | 1120 | struct btrfs_file_extent_item); |
1120 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); | 1121 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); |
1121 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1122 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1122 | split - key.offset); | 1123 | split - key.offset); |
1123 | 1124 | ||
1124 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1125 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1125 | struct btrfs_file_extent_item); | 1126 | struct btrfs_file_extent_item); |
1126 | 1127 | ||
1127 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); | 1128 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); |
1128 | btrfs_set_file_extent_offset(leaf, fi, split - orig_offset); | 1129 | btrfs_set_file_extent_offset(leaf, fi, split - orig_offset); |
1129 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1130 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1130 | extent_end - split); | 1131 | extent_end - split); |
1131 | btrfs_mark_buffer_dirty(leaf); | 1132 | btrfs_mark_buffer_dirty(leaf); |
1132 | 1133 | ||
1133 | ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0, | 1134 | ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0, |
1134 | root->root_key.objectid, | 1135 | root->root_key.objectid, |
1135 | ino, orig_offset, 0); | 1136 | ino, orig_offset, 0); |
1136 | BUG_ON(ret); /* -ENOMEM */ | 1137 | BUG_ON(ret); /* -ENOMEM */ |
1137 | 1138 | ||
1138 | if (split == start) { | 1139 | if (split == start) { |
1139 | key.offset = start; | 1140 | key.offset = start; |
1140 | } else { | 1141 | } else { |
1141 | BUG_ON(start != key.offset); | 1142 | BUG_ON(start != key.offset); |
1142 | path->slots[0]--; | 1143 | path->slots[0]--; |
1143 | extent_end = end; | 1144 | extent_end = end; |
1144 | } | 1145 | } |
1145 | recow = 1; | 1146 | recow = 1; |
1146 | } | 1147 | } |
1147 | 1148 | ||
1148 | other_start = end; | 1149 | other_start = end; |
1149 | other_end = 0; | 1150 | other_end = 0; |
1150 | if (extent_mergeable(leaf, path->slots[0] + 1, | 1151 | if (extent_mergeable(leaf, path->slots[0] + 1, |
1151 | ino, bytenr, orig_offset, | 1152 | ino, bytenr, orig_offset, |
1152 | &other_start, &other_end)) { | 1153 | &other_start, &other_end)) { |
1153 | if (recow) { | 1154 | if (recow) { |
1154 | btrfs_release_path(path); | 1155 | btrfs_release_path(path); |
1155 | goto again; | 1156 | goto again; |
1156 | } | 1157 | } |
1157 | extent_end = other_end; | 1158 | extent_end = other_end; |
1158 | del_slot = path->slots[0] + 1; | 1159 | del_slot = path->slots[0] + 1; |
1159 | del_nr++; | 1160 | del_nr++; |
1160 | ret = btrfs_free_extent(trans, root, bytenr, num_bytes, | 1161 | ret = btrfs_free_extent(trans, root, bytenr, num_bytes, |
1161 | 0, root->root_key.objectid, | 1162 | 0, root->root_key.objectid, |
1162 | ino, orig_offset, 0); | 1163 | ino, orig_offset, 0); |
1163 | BUG_ON(ret); /* -ENOMEM */ | 1164 | BUG_ON(ret); /* -ENOMEM */ |
1164 | } | 1165 | } |
1165 | other_start = 0; | 1166 | other_start = 0; |
1166 | other_end = start; | 1167 | other_end = start; |
1167 | if (extent_mergeable(leaf, path->slots[0] - 1, | 1168 | if (extent_mergeable(leaf, path->slots[0] - 1, |
1168 | ino, bytenr, orig_offset, | 1169 | ino, bytenr, orig_offset, |
1169 | &other_start, &other_end)) { | 1170 | &other_start, &other_end)) { |
1170 | if (recow) { | 1171 | if (recow) { |
1171 | btrfs_release_path(path); | 1172 | btrfs_release_path(path); |
1172 | goto again; | 1173 | goto again; |
1173 | } | 1174 | } |
1174 | key.offset = other_start; | 1175 | key.offset = other_start; |
1175 | del_slot = path->slots[0]; | 1176 | del_slot = path->slots[0]; |
1176 | del_nr++; | 1177 | del_nr++; |
1177 | ret = btrfs_free_extent(trans, root, bytenr, num_bytes, | 1178 | ret = btrfs_free_extent(trans, root, bytenr, num_bytes, |
1178 | 0, root->root_key.objectid, | 1179 | 0, root->root_key.objectid, |
1179 | ino, orig_offset, 0); | 1180 | ino, orig_offset, 0); |
1180 | BUG_ON(ret); /* -ENOMEM */ | 1181 | BUG_ON(ret); /* -ENOMEM */ |
1181 | } | 1182 | } |
1182 | if (del_nr == 0) { | 1183 | if (del_nr == 0) { |
1183 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1184 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1184 | struct btrfs_file_extent_item); | 1185 | struct btrfs_file_extent_item); |
1185 | btrfs_set_file_extent_type(leaf, fi, | 1186 | btrfs_set_file_extent_type(leaf, fi, |
1186 | BTRFS_FILE_EXTENT_REG); | 1187 | BTRFS_FILE_EXTENT_REG); |
1187 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); | 1188 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); |
1188 | btrfs_mark_buffer_dirty(leaf); | 1189 | btrfs_mark_buffer_dirty(leaf); |
1189 | } else { | 1190 | } else { |
1190 | fi = btrfs_item_ptr(leaf, del_slot - 1, | 1191 | fi = btrfs_item_ptr(leaf, del_slot - 1, |
1191 | struct btrfs_file_extent_item); | 1192 | struct btrfs_file_extent_item); |
1192 | btrfs_set_file_extent_type(leaf, fi, | 1193 | btrfs_set_file_extent_type(leaf, fi, |
1193 | BTRFS_FILE_EXTENT_REG); | 1194 | BTRFS_FILE_EXTENT_REG); |
1194 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); | 1195 | btrfs_set_file_extent_generation(leaf, fi, trans->transid); |
1195 | btrfs_set_file_extent_num_bytes(leaf, fi, | 1196 | btrfs_set_file_extent_num_bytes(leaf, fi, |
1196 | extent_end - key.offset); | 1197 | extent_end - key.offset); |
1197 | btrfs_mark_buffer_dirty(leaf); | 1198 | btrfs_mark_buffer_dirty(leaf); |
1198 | 1199 | ||
1199 | ret = btrfs_del_items(trans, root, path, del_slot, del_nr); | 1200 | ret = btrfs_del_items(trans, root, path, del_slot, del_nr); |
1200 | if (ret < 0) { | 1201 | if (ret < 0) { |
1201 | btrfs_abort_transaction(trans, root, ret); | 1202 | btrfs_abort_transaction(trans, root, ret); |
1202 | goto out; | 1203 | goto out; |
1203 | } | 1204 | } |
1204 | } | 1205 | } |
1205 | out: | 1206 | out: |
1206 | btrfs_free_path(path); | 1207 | btrfs_free_path(path); |
1207 | return 0; | 1208 | return 0; |
1208 | } | 1209 | } |
1209 | 1210 | ||
1210 | /* | 1211 | /* |
1211 | * on error we return an unlocked page and the error value | 1212 | * on error we return an unlocked page and the error value |
1212 | * on success we return a locked page and 0 | 1213 | * on success we return a locked page and 0 |
1213 | */ | 1214 | */ |
1214 | static int prepare_uptodate_page(struct page *page, u64 pos, | 1215 | static int prepare_uptodate_page(struct page *page, u64 pos, |
1215 | bool force_uptodate) | 1216 | bool force_uptodate) |
1216 | { | 1217 | { |
1217 | int ret = 0; | 1218 | int ret = 0; |
1218 | 1219 | ||
1219 | if (((pos & (PAGE_CACHE_SIZE - 1)) || force_uptodate) && | 1220 | if (((pos & (PAGE_CACHE_SIZE - 1)) || force_uptodate) && |
1220 | !PageUptodate(page)) { | 1221 | !PageUptodate(page)) { |
1221 | ret = btrfs_readpage(NULL, page); | 1222 | ret = btrfs_readpage(NULL, page); |
1222 | if (ret) | 1223 | if (ret) |
1223 | return ret; | 1224 | return ret; |
1224 | lock_page(page); | 1225 | lock_page(page); |
1225 | if (!PageUptodate(page)) { | 1226 | if (!PageUptodate(page)) { |
1226 | unlock_page(page); | 1227 | unlock_page(page); |
1227 | return -EIO; | 1228 | return -EIO; |
1228 | } | 1229 | } |
1229 | } | 1230 | } |
1230 | return 0; | 1231 | return 0; |
1231 | } | 1232 | } |
1232 | 1233 | ||
1233 | /* | 1234 | /* |
1234 | * this gets pages into the page cache and locks them down, it also properly | 1235 | * this gets pages into the page cache and locks them down, it also properly |
1235 | * waits for data=ordered extents to finish before allowing the pages to be | 1236 | * waits for data=ordered extents to finish before allowing the pages to be |
1236 | * modified. | 1237 | * modified. |
1237 | */ | 1238 | */ |
1238 | static noinline int prepare_pages(struct btrfs_root *root, struct file *file, | 1239 | static noinline int prepare_pages(struct btrfs_root *root, struct file *file, |
1239 | struct page **pages, size_t num_pages, | 1240 | struct page **pages, size_t num_pages, |
1240 | loff_t pos, unsigned long first_index, | 1241 | loff_t pos, unsigned long first_index, |
1241 | size_t write_bytes, bool force_uptodate) | 1242 | size_t write_bytes, bool force_uptodate) |
1242 | { | 1243 | { |
1243 | struct extent_state *cached_state = NULL; | 1244 | struct extent_state *cached_state = NULL; |
1244 | int i; | 1245 | int i; |
1245 | unsigned long index = pos >> PAGE_CACHE_SHIFT; | 1246 | unsigned long index = pos >> PAGE_CACHE_SHIFT; |
1246 | struct inode *inode = file_inode(file); | 1247 | struct inode *inode = file_inode(file); |
1247 | gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); | 1248 | gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); |
1248 | int err = 0; | 1249 | int err = 0; |
1249 | int faili = 0; | 1250 | int faili = 0; |
1250 | u64 start_pos; | 1251 | u64 start_pos; |
1251 | u64 last_pos; | 1252 | u64 last_pos; |
1252 | 1253 | ||
1253 | start_pos = pos & ~((u64)root->sectorsize - 1); | 1254 | start_pos = pos & ~((u64)root->sectorsize - 1); |
1254 | last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT; | 1255 | last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT; |
1255 | 1256 | ||
1256 | again: | 1257 | again: |
1257 | for (i = 0; i < num_pages; i++) { | 1258 | for (i = 0; i < num_pages; i++) { |
1258 | pages[i] = find_or_create_page(inode->i_mapping, index + i, | 1259 | pages[i] = find_or_create_page(inode->i_mapping, index + i, |
1259 | mask | __GFP_WRITE); | 1260 | mask | __GFP_WRITE); |
1260 | if (!pages[i]) { | 1261 | if (!pages[i]) { |
1261 | faili = i - 1; | 1262 | faili = i - 1; |
1262 | err = -ENOMEM; | 1263 | err = -ENOMEM; |
1263 | goto fail; | 1264 | goto fail; |
1264 | } | 1265 | } |
1265 | 1266 | ||
1266 | if (i == 0) | 1267 | if (i == 0) |
1267 | err = prepare_uptodate_page(pages[i], pos, | 1268 | err = prepare_uptodate_page(pages[i], pos, |
1268 | force_uptodate); | 1269 | force_uptodate); |
1269 | if (i == num_pages - 1) | 1270 | if (i == num_pages - 1) |
1270 | err = prepare_uptodate_page(pages[i], | 1271 | err = prepare_uptodate_page(pages[i], |
1271 | pos + write_bytes, false); | 1272 | pos + write_bytes, false); |
1272 | if (err) { | 1273 | if (err) { |
1273 | page_cache_release(pages[i]); | 1274 | page_cache_release(pages[i]); |
1274 | faili = i - 1; | 1275 | faili = i - 1; |
1275 | goto fail; | 1276 | goto fail; |
1276 | } | 1277 | } |
1277 | wait_on_page_writeback(pages[i]); | 1278 | wait_on_page_writeback(pages[i]); |
1278 | } | 1279 | } |
1279 | err = 0; | 1280 | err = 0; |
1280 | if (start_pos < inode->i_size) { | 1281 | if (start_pos < inode->i_size) { |
1281 | struct btrfs_ordered_extent *ordered; | 1282 | struct btrfs_ordered_extent *ordered; |
1282 | lock_extent_bits(&BTRFS_I(inode)->io_tree, | 1283 | lock_extent_bits(&BTRFS_I(inode)->io_tree, |
1283 | start_pos, last_pos - 1, 0, &cached_state); | 1284 | start_pos, last_pos - 1, 0, &cached_state); |
1284 | ordered = btrfs_lookup_first_ordered_extent(inode, | 1285 | ordered = btrfs_lookup_first_ordered_extent(inode, |
1285 | last_pos - 1); | 1286 | last_pos - 1); |
1286 | if (ordered && | 1287 | if (ordered && |
1287 | ordered->file_offset + ordered->len > start_pos && | 1288 | ordered->file_offset + ordered->len > start_pos && |
1288 | ordered->file_offset < last_pos) { | 1289 | ordered->file_offset < last_pos) { |
1289 | btrfs_put_ordered_extent(ordered); | 1290 | btrfs_put_ordered_extent(ordered); |
1290 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, | 1291 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, |
1291 | start_pos, last_pos - 1, | 1292 | start_pos, last_pos - 1, |
1292 | &cached_state, GFP_NOFS); | 1293 | &cached_state, GFP_NOFS); |
1293 | for (i = 0; i < num_pages; i++) { | 1294 | for (i = 0; i < num_pages; i++) { |
1294 | unlock_page(pages[i]); | 1295 | unlock_page(pages[i]); |
1295 | page_cache_release(pages[i]); | 1296 | page_cache_release(pages[i]); |
1296 | } | 1297 | } |
1297 | btrfs_wait_ordered_range(inode, start_pos, | 1298 | btrfs_wait_ordered_range(inode, start_pos, |
1298 | last_pos - start_pos); | 1299 | last_pos - start_pos); |
1299 | goto again; | 1300 | goto again; |
1300 | } | 1301 | } |
1301 | if (ordered) | 1302 | if (ordered) |
1302 | btrfs_put_ordered_extent(ordered); | 1303 | btrfs_put_ordered_extent(ordered); |
1303 | 1304 | ||
1304 | clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, | 1305 | clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, |
1305 | last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC | | 1306 | last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC | |
1306 | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, | 1307 | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, |
1307 | 0, 0, &cached_state, GFP_NOFS); | 1308 | 0, 0, &cached_state, GFP_NOFS); |
1308 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, | 1309 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, |
1309 | start_pos, last_pos - 1, &cached_state, | 1310 | start_pos, last_pos - 1, &cached_state, |
1310 | GFP_NOFS); | 1311 | GFP_NOFS); |
1311 | } | 1312 | } |
1312 | for (i = 0; i < num_pages; i++) { | 1313 | for (i = 0; i < num_pages; i++) { |
1313 | if (clear_page_dirty_for_io(pages[i])) | 1314 | if (clear_page_dirty_for_io(pages[i])) |
1314 | account_page_redirty(pages[i]); | 1315 | account_page_redirty(pages[i]); |
1315 | set_page_extent_mapped(pages[i]); | 1316 | set_page_extent_mapped(pages[i]); |
1316 | WARN_ON(!PageLocked(pages[i])); | 1317 | WARN_ON(!PageLocked(pages[i])); |
1317 | } | 1318 | } |
1318 | return 0; | 1319 | return 0; |
1319 | fail: | 1320 | fail: |
1320 | while (faili >= 0) { | 1321 | while (faili >= 0) { |
1321 | unlock_page(pages[faili]); | 1322 | unlock_page(pages[faili]); |
1322 | page_cache_release(pages[faili]); | 1323 | page_cache_release(pages[faili]); |
1323 | faili--; | 1324 | faili--; |
1324 | } | 1325 | } |
1325 | return err; | 1326 | return err; |
1326 | 1327 | ||
1327 | } | 1328 | } |
1328 | 1329 | ||
1329 | static noinline int check_can_nocow(struct inode *inode, loff_t pos, | 1330 | static noinline int check_can_nocow(struct inode *inode, loff_t pos, |
1330 | size_t *write_bytes) | 1331 | size_t *write_bytes) |
1331 | { | 1332 | { |
1332 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1333 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1333 | struct btrfs_ordered_extent *ordered; | 1334 | struct btrfs_ordered_extent *ordered; |
1334 | u64 lockstart, lockend; | 1335 | u64 lockstart, lockend; |
1335 | u64 num_bytes; | 1336 | u64 num_bytes; |
1336 | int ret; | 1337 | int ret; |
1337 | 1338 | ||
1338 | lockstart = round_down(pos, root->sectorsize); | 1339 | lockstart = round_down(pos, root->sectorsize); |
1339 | lockend = lockstart + round_up(*write_bytes, root->sectorsize) - 1; | 1340 | lockend = lockstart + round_up(*write_bytes, root->sectorsize) - 1; |
1340 | 1341 | ||
1341 | while (1) { | 1342 | while (1) { |
1342 | lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); | 1343 | lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); |
1343 | ordered = btrfs_lookup_ordered_range(inode, lockstart, | 1344 | ordered = btrfs_lookup_ordered_range(inode, lockstart, |
1344 | lockend - lockstart + 1); | 1345 | lockend - lockstart + 1); |
1345 | if (!ordered) { | 1346 | if (!ordered) { |
1346 | break; | 1347 | break; |
1347 | } | 1348 | } |
1348 | unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); | 1349 | unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); |
1349 | btrfs_start_ordered_extent(inode, ordered, 1); | 1350 | btrfs_start_ordered_extent(inode, ordered, 1); |
1350 | btrfs_put_ordered_extent(ordered); | 1351 | btrfs_put_ordered_extent(ordered); |
1351 | } | 1352 | } |
1352 | 1353 | ||
1353 | num_bytes = lockend - lockstart + 1; | 1354 | num_bytes = lockend - lockstart + 1; |
1354 | ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL); | 1355 | ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL); |
1355 | if (ret <= 0) { | 1356 | if (ret <= 0) { |
1356 | ret = 0; | 1357 | ret = 0; |
1357 | } else { | 1358 | } else { |
1358 | clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend, | 1359 | clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend, |
1359 | EXTENT_DIRTY | EXTENT_DELALLOC | | 1360 | EXTENT_DIRTY | EXTENT_DELALLOC | |
1360 | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, | 1361 | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, |
1361 | NULL, GFP_NOFS); | 1362 | NULL, GFP_NOFS); |
1362 | *write_bytes = min_t(size_t, *write_bytes, num_bytes); | 1363 | *write_bytes = min_t(size_t, *write_bytes, num_bytes); |
1363 | } | 1364 | } |
1364 | 1365 | ||
1365 | unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); | 1366 | unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend); |
1366 | 1367 | ||
1367 | return ret; | 1368 | return ret; |
1368 | } | 1369 | } |
1369 | 1370 | ||
1370 | static noinline ssize_t __btrfs_buffered_write(struct file *file, | 1371 | static noinline ssize_t __btrfs_buffered_write(struct file *file, |
1371 | struct iov_iter *i, | 1372 | struct iov_iter *i, |
1372 | loff_t pos) | 1373 | loff_t pos) |
1373 | { | 1374 | { |
1374 | struct inode *inode = file_inode(file); | 1375 | struct inode *inode = file_inode(file); |
1375 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1376 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1376 | struct page **pages = NULL; | 1377 | struct page **pages = NULL; |
1377 | u64 release_bytes = 0; | 1378 | u64 release_bytes = 0; |
1378 | unsigned long first_index; | 1379 | unsigned long first_index; |
1379 | size_t num_written = 0; | 1380 | size_t num_written = 0; |
1380 | int nrptrs; | 1381 | int nrptrs; |
1381 | int ret = 0; | 1382 | int ret = 0; |
1382 | bool only_release_metadata = false; | 1383 | bool only_release_metadata = false; |
1383 | bool force_page_uptodate = false; | 1384 | bool force_page_uptodate = false; |
1384 | 1385 | ||
1385 | nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) / | 1386 | nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) / |
1386 | PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / | 1387 | PAGE_CACHE_SIZE, PAGE_CACHE_SIZE / |
1387 | (sizeof(struct page *))); | 1388 | (sizeof(struct page *))); |
1388 | nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied); | 1389 | nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied); |
1389 | nrptrs = max(nrptrs, 8); | 1390 | nrptrs = max(nrptrs, 8); |
1390 | pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); | 1391 | pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); |
1391 | if (!pages) | 1392 | if (!pages) |
1392 | return -ENOMEM; | 1393 | return -ENOMEM; |
1393 | 1394 | ||
1394 | first_index = pos >> PAGE_CACHE_SHIFT; | 1395 | first_index = pos >> PAGE_CACHE_SHIFT; |
1395 | 1396 | ||
1396 | while (iov_iter_count(i) > 0) { | 1397 | while (iov_iter_count(i) > 0) { |
1397 | size_t offset = pos & (PAGE_CACHE_SIZE - 1); | 1398 | size_t offset = pos & (PAGE_CACHE_SIZE - 1); |
1398 | size_t write_bytes = min(iov_iter_count(i), | 1399 | size_t write_bytes = min(iov_iter_count(i), |
1399 | nrptrs * (size_t)PAGE_CACHE_SIZE - | 1400 | nrptrs * (size_t)PAGE_CACHE_SIZE - |
1400 | offset); | 1401 | offset); |
1401 | size_t num_pages = (write_bytes + offset + | 1402 | size_t num_pages = (write_bytes + offset + |
1402 | PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1403 | PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1403 | size_t reserve_bytes; | 1404 | size_t reserve_bytes; |
1404 | size_t dirty_pages; | 1405 | size_t dirty_pages; |
1405 | size_t copied; | 1406 | size_t copied; |
1406 | 1407 | ||
1407 | WARN_ON(num_pages > nrptrs); | 1408 | WARN_ON(num_pages > nrptrs); |
1408 | 1409 | ||
1409 | /* | 1410 | /* |
1410 | * Fault pages before locking them in prepare_pages | 1411 | * Fault pages before locking them in prepare_pages |
1411 | * to avoid recursive lock | 1412 | * to avoid recursive lock |
1412 | */ | 1413 | */ |
1413 | if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) { | 1414 | if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) { |
1414 | ret = -EFAULT; | 1415 | ret = -EFAULT; |
1415 | break; | 1416 | break; |
1416 | } | 1417 | } |
1417 | 1418 | ||
1418 | reserve_bytes = num_pages << PAGE_CACHE_SHIFT; | 1419 | reserve_bytes = num_pages << PAGE_CACHE_SHIFT; |
1419 | ret = btrfs_check_data_free_space(inode, reserve_bytes); | 1420 | ret = btrfs_check_data_free_space(inode, reserve_bytes); |
1420 | if (ret == -ENOSPC && | 1421 | if (ret == -ENOSPC && |
1421 | (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | | 1422 | (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | |
1422 | BTRFS_INODE_PREALLOC))) { | 1423 | BTRFS_INODE_PREALLOC))) { |
1423 | ret = check_can_nocow(inode, pos, &write_bytes); | 1424 | ret = check_can_nocow(inode, pos, &write_bytes); |
1424 | if (ret > 0) { | 1425 | if (ret > 0) { |
1425 | only_release_metadata = true; | 1426 | only_release_metadata = true; |
1426 | /* | 1427 | /* |
1427 | * our prealloc extent may be smaller than | 1428 | * our prealloc extent may be smaller than |
1428 | * write_bytes, so scale down. | 1429 | * write_bytes, so scale down. |
1429 | */ | 1430 | */ |
1430 | num_pages = (write_bytes + offset + | 1431 | num_pages = (write_bytes + offset + |
1431 | PAGE_CACHE_SIZE - 1) >> | 1432 | PAGE_CACHE_SIZE - 1) >> |
1432 | PAGE_CACHE_SHIFT; | 1433 | PAGE_CACHE_SHIFT; |
1433 | reserve_bytes = num_pages << PAGE_CACHE_SHIFT; | 1434 | reserve_bytes = num_pages << PAGE_CACHE_SHIFT; |
1434 | ret = 0; | 1435 | ret = 0; |
1435 | } else { | 1436 | } else { |
1436 | ret = -ENOSPC; | 1437 | ret = -ENOSPC; |
1437 | } | 1438 | } |
1438 | } | 1439 | } |
1439 | 1440 | ||
1440 | if (ret) | 1441 | if (ret) |
1441 | break; | 1442 | break; |
1442 | 1443 | ||
1443 | ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes); | 1444 | ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes); |
1444 | if (ret) { | 1445 | if (ret) { |
1445 | if (!only_release_metadata) | 1446 | if (!only_release_metadata) |
1446 | btrfs_free_reserved_data_space(inode, | 1447 | btrfs_free_reserved_data_space(inode, |
1447 | reserve_bytes); | 1448 | reserve_bytes); |
1448 | break; | 1449 | break; |
1449 | } | 1450 | } |
1450 | 1451 | ||
1451 | release_bytes = reserve_bytes; | 1452 | release_bytes = reserve_bytes; |
1452 | 1453 | ||
1453 | /* | 1454 | /* |
1454 | * This is going to setup the pages array with the number of | 1455 | * This is going to setup the pages array with the number of |
1455 | * pages we want, so we don't really need to worry about the | 1456 | * pages we want, so we don't really need to worry about the |
1456 | * contents of pages from loop to loop | 1457 | * contents of pages from loop to loop |
1457 | */ | 1458 | */ |
1458 | ret = prepare_pages(root, file, pages, num_pages, | 1459 | ret = prepare_pages(root, file, pages, num_pages, |
1459 | pos, first_index, write_bytes, | 1460 | pos, first_index, write_bytes, |
1460 | force_page_uptodate); | 1461 | force_page_uptodate); |
1461 | if (ret) | 1462 | if (ret) |
1462 | break; | 1463 | break; |
1463 | 1464 | ||
1464 | copied = btrfs_copy_from_user(pos, num_pages, | 1465 | copied = btrfs_copy_from_user(pos, num_pages, |
1465 | write_bytes, pages, i); | 1466 | write_bytes, pages, i); |
1466 | 1467 | ||
1467 | /* | 1468 | /* |
1468 | * if we have trouble faulting in the pages, fall | 1469 | * if we have trouble faulting in the pages, fall |
1469 | * back to one page at a time | 1470 | * back to one page at a time |
1470 | */ | 1471 | */ |
1471 | if (copied < write_bytes) | 1472 | if (copied < write_bytes) |
1472 | nrptrs = 1; | 1473 | nrptrs = 1; |
1473 | 1474 | ||
1474 | if (copied == 0) { | 1475 | if (copied == 0) { |
1475 | force_page_uptodate = true; | 1476 | force_page_uptodate = true; |
1476 | dirty_pages = 0; | 1477 | dirty_pages = 0; |
1477 | } else { | 1478 | } else { |
1478 | force_page_uptodate = false; | 1479 | force_page_uptodate = false; |
1479 | dirty_pages = (copied + offset + | 1480 | dirty_pages = (copied + offset + |
1480 | PAGE_CACHE_SIZE - 1) >> | 1481 | PAGE_CACHE_SIZE - 1) >> |
1481 | PAGE_CACHE_SHIFT; | 1482 | PAGE_CACHE_SHIFT; |
1482 | } | 1483 | } |
1483 | 1484 | ||
1484 | /* | 1485 | /* |
1485 | * If we had a short copy we need to release the excess delaloc | 1486 | * If we had a short copy we need to release the excess delaloc |
1486 | * bytes we reserved. We need to increment outstanding_extents | 1487 | * bytes we reserved. We need to increment outstanding_extents |
1487 | * because btrfs_delalloc_release_space will decrement it, but | 1488 | * because btrfs_delalloc_release_space will decrement it, but |
1488 | * we still have an outstanding extent for the chunk we actually | 1489 | * we still have an outstanding extent for the chunk we actually |
1489 | * managed to copy. | 1490 | * managed to copy. |
1490 | */ | 1491 | */ |
1491 | if (num_pages > dirty_pages) { | 1492 | if (num_pages > dirty_pages) { |
1492 | release_bytes = (num_pages - dirty_pages) << | 1493 | release_bytes = (num_pages - dirty_pages) << |
1493 | PAGE_CACHE_SHIFT; | 1494 | PAGE_CACHE_SHIFT; |
1494 | if (copied > 0) { | 1495 | if (copied > 0) { |
1495 | spin_lock(&BTRFS_I(inode)->lock); | 1496 | spin_lock(&BTRFS_I(inode)->lock); |
1496 | BTRFS_I(inode)->outstanding_extents++; | 1497 | BTRFS_I(inode)->outstanding_extents++; |
1497 | spin_unlock(&BTRFS_I(inode)->lock); | 1498 | spin_unlock(&BTRFS_I(inode)->lock); |
1498 | } | 1499 | } |
1499 | if (only_release_metadata) | 1500 | if (only_release_metadata) |
1500 | btrfs_delalloc_release_metadata(inode, | 1501 | btrfs_delalloc_release_metadata(inode, |
1501 | release_bytes); | 1502 | release_bytes); |
1502 | else | 1503 | else |
1503 | btrfs_delalloc_release_space(inode, | 1504 | btrfs_delalloc_release_space(inode, |
1504 | release_bytes); | 1505 | release_bytes); |
1505 | } | 1506 | } |
1506 | 1507 | ||
1507 | release_bytes = dirty_pages << PAGE_CACHE_SHIFT; | 1508 | release_bytes = dirty_pages << PAGE_CACHE_SHIFT; |
1508 | if (copied > 0) { | 1509 | if (copied > 0) { |
1509 | ret = btrfs_dirty_pages(root, inode, pages, | 1510 | ret = btrfs_dirty_pages(root, inode, pages, |
1510 | dirty_pages, pos, copied, | 1511 | dirty_pages, pos, copied, |
1511 | NULL); | 1512 | NULL); |
1512 | if (ret) { | 1513 | if (ret) { |
1513 | btrfs_drop_pages(pages, num_pages); | 1514 | btrfs_drop_pages(pages, num_pages); |
1514 | break; | 1515 | break; |
1515 | } | 1516 | } |
1516 | } | 1517 | } |
1517 | 1518 | ||
1518 | release_bytes = 0; | 1519 | release_bytes = 0; |
1519 | btrfs_drop_pages(pages, num_pages); | 1520 | btrfs_drop_pages(pages, num_pages); |
1520 | 1521 | ||
1521 | if (only_release_metadata && copied > 0) { | 1522 | if (only_release_metadata && copied > 0) { |
1522 | u64 lockstart = round_down(pos, root->sectorsize); | 1523 | u64 lockstart = round_down(pos, root->sectorsize); |
1523 | u64 lockend = lockstart + | 1524 | u64 lockend = lockstart + |
1524 | (dirty_pages << PAGE_CACHE_SHIFT) - 1; | 1525 | (dirty_pages << PAGE_CACHE_SHIFT) - 1; |
1525 | 1526 | ||
1526 | set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, | 1527 | set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, |
1527 | lockend, EXTENT_NORESERVE, NULL, | 1528 | lockend, EXTENT_NORESERVE, NULL, |
1528 | NULL, GFP_NOFS); | 1529 | NULL, GFP_NOFS); |
1529 | only_release_metadata = false; | 1530 | only_release_metadata = false; |
1530 | } | 1531 | } |
1531 | 1532 | ||
1532 | cond_resched(); | 1533 | cond_resched(); |
1533 | 1534 | ||
1534 | balance_dirty_pages_ratelimited(inode->i_mapping); | 1535 | balance_dirty_pages_ratelimited(inode->i_mapping); |
1535 | if (dirty_pages < (root->leafsize >> PAGE_CACHE_SHIFT) + 1) | 1536 | if (dirty_pages < (root->leafsize >> PAGE_CACHE_SHIFT) + 1) |
1536 | btrfs_btree_balance_dirty(root); | 1537 | btrfs_btree_balance_dirty(root); |
1537 | 1538 | ||
1538 | pos += copied; | 1539 | pos += copied; |
1539 | num_written += copied; | 1540 | num_written += copied; |
1540 | } | 1541 | } |
1541 | 1542 | ||
1542 | kfree(pages); | 1543 | kfree(pages); |
1543 | 1544 | ||
1544 | if (release_bytes) { | 1545 | if (release_bytes) { |
1545 | if (only_release_metadata) | 1546 | if (only_release_metadata) |
1546 | btrfs_delalloc_release_metadata(inode, release_bytes); | 1547 | btrfs_delalloc_release_metadata(inode, release_bytes); |
1547 | else | 1548 | else |
1548 | btrfs_delalloc_release_space(inode, release_bytes); | 1549 | btrfs_delalloc_release_space(inode, release_bytes); |
1549 | } | 1550 | } |
1550 | 1551 | ||
1551 | return num_written ? num_written : ret; | 1552 | return num_written ? num_written : ret; |
1552 | } | 1553 | } |
1553 | 1554 | ||
1554 | static ssize_t __btrfs_direct_write(struct kiocb *iocb, | 1555 | static ssize_t __btrfs_direct_write(struct kiocb *iocb, |
1555 | const struct iovec *iov, | 1556 | const struct iovec *iov, |
1556 | unsigned long nr_segs, loff_t pos, | 1557 | unsigned long nr_segs, loff_t pos, |
1557 | loff_t *ppos, size_t count, size_t ocount) | 1558 | loff_t *ppos, size_t count, size_t ocount) |
1558 | { | 1559 | { |
1559 | struct file *file = iocb->ki_filp; | 1560 | struct file *file = iocb->ki_filp; |
1560 | struct iov_iter i; | 1561 | struct iov_iter i; |
1561 | ssize_t written; | 1562 | ssize_t written; |
1562 | ssize_t written_buffered; | 1563 | ssize_t written_buffered; |
1563 | loff_t endbyte; | 1564 | loff_t endbyte; |
1564 | int err; | 1565 | int err; |
1565 | 1566 | ||
1566 | written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, | 1567 | written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, |
1567 | count, ocount); | 1568 | count, ocount); |
1568 | 1569 | ||
1569 | if (written < 0 || written == count) | 1570 | if (written < 0 || written == count) |
1570 | return written; | 1571 | return written; |
1571 | 1572 | ||
1572 | pos += written; | 1573 | pos += written; |
1573 | count -= written; | 1574 | count -= written; |
1574 | iov_iter_init(&i, iov, nr_segs, count, written); | 1575 | iov_iter_init(&i, iov, nr_segs, count, written); |
1575 | written_buffered = __btrfs_buffered_write(file, &i, pos); | 1576 | written_buffered = __btrfs_buffered_write(file, &i, pos); |
1576 | if (written_buffered < 0) { | 1577 | if (written_buffered < 0) { |
1577 | err = written_buffered; | 1578 | err = written_buffered; |
1578 | goto out; | 1579 | goto out; |
1579 | } | 1580 | } |
1580 | endbyte = pos + written_buffered - 1; | 1581 | endbyte = pos + written_buffered - 1; |
1581 | err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); | 1582 | err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); |
1582 | if (err) | 1583 | if (err) |
1583 | goto out; | 1584 | goto out; |
1584 | written += written_buffered; | 1585 | written += written_buffered; |
1585 | *ppos = pos + written_buffered; | 1586 | *ppos = pos + written_buffered; |
1586 | invalidate_mapping_pages(file->f_mapping, pos >> PAGE_CACHE_SHIFT, | 1587 | invalidate_mapping_pages(file->f_mapping, pos >> PAGE_CACHE_SHIFT, |
1587 | endbyte >> PAGE_CACHE_SHIFT); | 1588 | endbyte >> PAGE_CACHE_SHIFT); |
1588 | out: | 1589 | out: |
1589 | return written ? written : err; | 1590 | return written ? written : err; |
1590 | } | 1591 | } |
1591 | 1592 | ||
1592 | static void update_time_for_write(struct inode *inode) | 1593 | static void update_time_for_write(struct inode *inode) |
1593 | { | 1594 | { |
1594 | struct timespec now; | 1595 | struct timespec now; |
1595 | 1596 | ||
1596 | if (IS_NOCMTIME(inode)) | 1597 | if (IS_NOCMTIME(inode)) |
1597 | return; | 1598 | return; |
1598 | 1599 | ||
1599 | now = current_fs_time(inode->i_sb); | 1600 | now = current_fs_time(inode->i_sb); |
1600 | if (!timespec_equal(&inode->i_mtime, &now)) | 1601 | if (!timespec_equal(&inode->i_mtime, &now)) |
1601 | inode->i_mtime = now; | 1602 | inode->i_mtime = now; |
1602 | 1603 | ||
1603 | if (!timespec_equal(&inode->i_ctime, &now)) | 1604 | if (!timespec_equal(&inode->i_ctime, &now)) |
1604 | inode->i_ctime = now; | 1605 | inode->i_ctime = now; |
1605 | 1606 | ||
1606 | if (IS_I_VERSION(inode)) | 1607 | if (IS_I_VERSION(inode)) |
1607 | inode_inc_iversion(inode); | 1608 | inode_inc_iversion(inode); |
1608 | } | 1609 | } |
1609 | 1610 | ||
1610 | static ssize_t btrfs_file_aio_write(struct kiocb *iocb, | 1611 | static ssize_t btrfs_file_aio_write(struct kiocb *iocb, |
1611 | const struct iovec *iov, | 1612 | const struct iovec *iov, |
1612 | unsigned long nr_segs, loff_t pos) | 1613 | unsigned long nr_segs, loff_t pos) |
1613 | { | 1614 | { |
1614 | struct file *file = iocb->ki_filp; | 1615 | struct file *file = iocb->ki_filp; |
1615 | struct inode *inode = file_inode(file); | 1616 | struct inode *inode = file_inode(file); |
1616 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1617 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1617 | loff_t *ppos = &iocb->ki_pos; | 1618 | loff_t *ppos = &iocb->ki_pos; |
1618 | u64 start_pos; | 1619 | u64 start_pos; |
1619 | ssize_t num_written = 0; | 1620 | ssize_t num_written = 0; |
1620 | ssize_t err = 0; | 1621 | ssize_t err = 0; |
1621 | size_t count, ocount; | 1622 | size_t count, ocount; |
1622 | bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); | 1623 | bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); |
1623 | 1624 | ||
1624 | mutex_lock(&inode->i_mutex); | 1625 | mutex_lock(&inode->i_mutex); |
1625 | 1626 | ||
1626 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); | 1627 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); |
1627 | if (err) { | 1628 | if (err) { |
1628 | mutex_unlock(&inode->i_mutex); | 1629 | mutex_unlock(&inode->i_mutex); |
1629 | goto out; | 1630 | goto out; |
1630 | } | 1631 | } |
1631 | count = ocount; | 1632 | count = ocount; |
1632 | 1633 | ||
1633 | current->backing_dev_info = inode->i_mapping->backing_dev_info; | 1634 | current->backing_dev_info = inode->i_mapping->backing_dev_info; |
1634 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); | 1635 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); |
1635 | if (err) { | 1636 | if (err) { |
1636 | mutex_unlock(&inode->i_mutex); | 1637 | mutex_unlock(&inode->i_mutex); |
1637 | goto out; | 1638 | goto out; |
1638 | } | 1639 | } |
1639 | 1640 | ||
1640 | if (count == 0) { | 1641 | if (count == 0) { |
1641 | mutex_unlock(&inode->i_mutex); | 1642 | mutex_unlock(&inode->i_mutex); |
1642 | goto out; | 1643 | goto out; |
1643 | } | 1644 | } |
1644 | 1645 | ||
1645 | err = file_remove_suid(file); | 1646 | err = file_remove_suid(file); |
1646 | if (err) { | 1647 | if (err) { |
1647 | mutex_unlock(&inode->i_mutex); | 1648 | mutex_unlock(&inode->i_mutex); |
1648 | goto out; | 1649 | goto out; |
1649 | } | 1650 | } |
1650 | 1651 | ||
1651 | /* | 1652 | /* |
1652 | * If BTRFS flips readonly due to some impossible error | 1653 | * If BTRFS flips readonly due to some impossible error |
1653 | * (fs_info->fs_state now has BTRFS_SUPER_FLAG_ERROR), | 1654 | * (fs_info->fs_state now has BTRFS_SUPER_FLAG_ERROR), |
1654 | * although we have opened a file as writable, we have | 1655 | * although we have opened a file as writable, we have |
1655 | * to stop this write operation to ensure FS consistency. | 1656 | * to stop this write operation to ensure FS consistency. |
1656 | */ | 1657 | */ |
1657 | if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state)) { | 1658 | if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state)) { |
1658 | mutex_unlock(&inode->i_mutex); | 1659 | mutex_unlock(&inode->i_mutex); |
1659 | err = -EROFS; | 1660 | err = -EROFS; |
1660 | goto out; | 1661 | goto out; |
1661 | } | 1662 | } |
1662 | 1663 | ||
1663 | /* | 1664 | /* |
1664 | * We reserve space for updating the inode when we reserve space for the | 1665 | * We reserve space for updating the inode when we reserve space for the |
1665 | * extent we are going to write, so we will enospc out there. We don't | 1666 | * extent we are going to write, so we will enospc out there. We don't |
1666 | * need to start yet another transaction to update the inode as we will | 1667 | * need to start yet another transaction to update the inode as we will |
1667 | * update the inode when we finish writing whatever data we write. | 1668 | * update the inode when we finish writing whatever data we write. |
1668 | */ | 1669 | */ |
1669 | update_time_for_write(inode); | 1670 | update_time_for_write(inode); |
1670 | 1671 | ||
1671 | start_pos = round_down(pos, root->sectorsize); | 1672 | start_pos = round_down(pos, root->sectorsize); |
1672 | if (start_pos > i_size_read(inode)) { | 1673 | if (start_pos > i_size_read(inode)) { |
1673 | err = btrfs_cont_expand(inode, i_size_read(inode), start_pos); | 1674 | err = btrfs_cont_expand(inode, i_size_read(inode), start_pos); |
1674 | if (err) { | 1675 | if (err) { |
1675 | mutex_unlock(&inode->i_mutex); | 1676 | mutex_unlock(&inode->i_mutex); |
1676 | goto out; | 1677 | goto out; |
1677 | } | 1678 | } |
1678 | } | 1679 | } |
1679 | 1680 | ||
1680 | if (sync) | 1681 | if (sync) |
1681 | atomic_inc(&BTRFS_I(inode)->sync_writers); | 1682 | atomic_inc(&BTRFS_I(inode)->sync_writers); |
1682 | 1683 | ||
1683 | if (unlikely(file->f_flags & O_DIRECT)) { | 1684 | if (unlikely(file->f_flags & O_DIRECT)) { |
1684 | num_written = __btrfs_direct_write(iocb, iov, nr_segs, | 1685 | num_written = __btrfs_direct_write(iocb, iov, nr_segs, |
1685 | pos, ppos, count, ocount); | 1686 | pos, ppos, count, ocount); |
1686 | } else { | 1687 | } else { |
1687 | struct iov_iter i; | 1688 | struct iov_iter i; |
1688 | 1689 | ||
1689 | iov_iter_init(&i, iov, nr_segs, count, num_written); | 1690 | iov_iter_init(&i, iov, nr_segs, count, num_written); |
1690 | 1691 | ||
1691 | num_written = __btrfs_buffered_write(file, &i, pos); | 1692 | num_written = __btrfs_buffered_write(file, &i, pos); |
1692 | if (num_written > 0) | 1693 | if (num_written > 0) |
1693 | *ppos = pos + num_written; | 1694 | *ppos = pos + num_written; |
1694 | } | 1695 | } |
1695 | 1696 | ||
1696 | mutex_unlock(&inode->i_mutex); | 1697 | mutex_unlock(&inode->i_mutex); |
1697 | 1698 | ||
1698 | /* | 1699 | /* |
1699 | * we want to make sure fsync finds this change | 1700 | * we want to make sure fsync finds this change |
1700 | * but we haven't joined a transaction running right now. | 1701 | * but we haven't joined a transaction running right now. |
1701 | * | 1702 | * |
1702 | * Later on, someone is sure to update the inode and get the | 1703 | * Later on, someone is sure to update the inode and get the |
1703 | * real transid recorded. | 1704 | * real transid recorded. |
1704 | * | 1705 | * |
1705 | * We set last_trans now to the fs_info generation + 1, | 1706 | * We set last_trans now to the fs_info generation + 1, |
1706 | * this will either be one more than the running transaction | 1707 | * this will either be one more than the running transaction |
1707 | * or the generation used for the next transaction if there isn't | 1708 | * or the generation used for the next transaction if there isn't |
1708 | * one running right now. | 1709 | * one running right now. |
1709 | * | 1710 | * |
1710 | * We also have to set last_sub_trans to the current log transid, | 1711 | * We also have to set last_sub_trans to the current log transid, |
1711 | * otherwise subsequent syncs to a file that's been synced in this | 1712 | * otherwise subsequent syncs to a file that's been synced in this |
1712 | * transaction will appear to have already occured. | 1713 | * transaction will appear to have already occured. |
1713 | */ | 1714 | */ |
1714 | BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; | 1715 | BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; |
1715 | BTRFS_I(inode)->last_sub_trans = root->log_transid; | 1716 | BTRFS_I(inode)->last_sub_trans = root->log_transid; |
1716 | if (num_written > 0) { | 1717 | if (num_written > 0) { |
1717 | err = generic_write_sync(file, pos, num_written); | 1718 | err = generic_write_sync(file, pos, num_written); |
1718 | if (err < 0 && num_written > 0) | 1719 | if (err < 0 && num_written > 0) |
1719 | num_written = err; | 1720 | num_written = err; |
1720 | } | 1721 | } |
1721 | 1722 | ||
1722 | if (sync) | 1723 | if (sync) |
1723 | atomic_dec(&BTRFS_I(inode)->sync_writers); | 1724 | atomic_dec(&BTRFS_I(inode)->sync_writers); |
1724 | out: | 1725 | out: |
1725 | current->backing_dev_info = NULL; | 1726 | current->backing_dev_info = NULL; |
1726 | return num_written ? num_written : err; | 1727 | return num_written ? num_written : err; |
1727 | } | 1728 | } |
1728 | 1729 | ||
1729 | int btrfs_release_file(struct inode *inode, struct file *filp) | 1730 | int btrfs_release_file(struct inode *inode, struct file *filp) |
1730 | { | 1731 | { |
1731 | /* | 1732 | /* |
1732 | * ordered_data_close is set by settattr when we are about to truncate | 1733 | * ordered_data_close is set by settattr when we are about to truncate |
1733 | * a file from a non-zero size to a zero size. This tries to | 1734 | * a file from a non-zero size to a zero size. This tries to |
1734 | * flush down new bytes that may have been written if the | 1735 | * flush down new bytes that may have been written if the |
1735 | * application were using truncate to replace a file in place. | 1736 | * application were using truncate to replace a file in place. |
1736 | */ | 1737 | */ |
1737 | if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE, | 1738 | if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE, |
1738 | &BTRFS_I(inode)->runtime_flags)) { | 1739 | &BTRFS_I(inode)->runtime_flags)) { |
1739 | struct btrfs_trans_handle *trans; | 1740 | struct btrfs_trans_handle *trans; |
1740 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1741 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1741 | 1742 | ||
1742 | /* | 1743 | /* |
1743 | * We need to block on a committing transaction to keep us from | 1744 | * We need to block on a committing transaction to keep us from |
1744 | * throwing a ordered operation on to the list and causing | 1745 | * throwing a ordered operation on to the list and causing |
1745 | * something like sync to deadlock trying to flush out this | 1746 | * something like sync to deadlock trying to flush out this |
1746 | * inode. | 1747 | * inode. |
1747 | */ | 1748 | */ |
1748 | trans = btrfs_start_transaction(root, 0); | 1749 | trans = btrfs_start_transaction(root, 0); |
1749 | if (IS_ERR(trans)) | 1750 | if (IS_ERR(trans)) |
1750 | return PTR_ERR(trans); | 1751 | return PTR_ERR(trans); |
1751 | btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode); | 1752 | btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode); |
1752 | btrfs_end_transaction(trans, root); | 1753 | btrfs_end_transaction(trans, root); |
1753 | if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT) | 1754 | if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT) |
1754 | filemap_flush(inode->i_mapping); | 1755 | filemap_flush(inode->i_mapping); |
1755 | } | 1756 | } |
1756 | if (filp->private_data) | 1757 | if (filp->private_data) |
1757 | btrfs_ioctl_trans_end(filp); | 1758 | btrfs_ioctl_trans_end(filp); |
1758 | return 0; | 1759 | return 0; |
1759 | } | 1760 | } |
1760 | 1761 | ||
1761 | /* | 1762 | /* |
1762 | * fsync call for both files and directories. This logs the inode into | 1763 | * fsync call for both files and directories. This logs the inode into |
1763 | * the tree log instead of forcing full commits whenever possible. | 1764 | * the tree log instead of forcing full commits whenever possible. |
1764 | * | 1765 | * |
1765 | * It needs to call filemap_fdatawait so that all ordered extent updates are | 1766 | * It needs to call filemap_fdatawait so that all ordered extent updates are |
1766 | * in the metadata btree are up to date for copying to the log. | 1767 | * in the metadata btree are up to date for copying to the log. |
1767 | * | 1768 | * |
1768 | * It drops the inode mutex before doing the tree log commit. This is an | 1769 | * It drops the inode mutex before doing the tree log commit. This is an |
1769 | * important optimization for directories because holding the mutex prevents | 1770 | * important optimization for directories because holding the mutex prevents |
1770 | * new operations on the dir while we write to disk. | 1771 | * new operations on the dir while we write to disk. |
1771 | */ | 1772 | */ |
1772 | int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) | 1773 | int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) |
1773 | { | 1774 | { |
1774 | struct dentry *dentry = file->f_path.dentry; | 1775 | struct dentry *dentry = file->f_path.dentry; |
1775 | struct inode *inode = dentry->d_inode; | 1776 | struct inode *inode = dentry->d_inode; |
1776 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1777 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1777 | int ret = 0; | 1778 | int ret = 0; |
1778 | struct btrfs_trans_handle *trans; | 1779 | struct btrfs_trans_handle *trans; |
1779 | bool full_sync = 0; | 1780 | bool full_sync = 0; |
1780 | 1781 | ||
1781 | trace_btrfs_sync_file(file, datasync); | 1782 | trace_btrfs_sync_file(file, datasync); |
1782 | 1783 | ||
1783 | /* | 1784 | /* |
1784 | * We write the dirty pages in the range and wait until they complete | 1785 | * We write the dirty pages in the range and wait until they complete |
1785 | * out of the ->i_mutex. If so, we can flush the dirty pages by | 1786 | * out of the ->i_mutex. If so, we can flush the dirty pages by |
1786 | * multi-task, and make the performance up. See | 1787 | * multi-task, and make the performance up. See |
1787 | * btrfs_wait_ordered_range for an explanation of the ASYNC check. | 1788 | * btrfs_wait_ordered_range for an explanation of the ASYNC check. |
1788 | */ | 1789 | */ |
1789 | atomic_inc(&BTRFS_I(inode)->sync_writers); | 1790 | atomic_inc(&BTRFS_I(inode)->sync_writers); |
1790 | ret = filemap_fdatawrite_range(inode->i_mapping, start, end); | 1791 | ret = filemap_fdatawrite_range(inode->i_mapping, start, end); |
1791 | if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, | 1792 | if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, |
1792 | &BTRFS_I(inode)->runtime_flags)) | 1793 | &BTRFS_I(inode)->runtime_flags)) |
1793 | ret = filemap_fdatawrite_range(inode->i_mapping, start, end); | 1794 | ret = filemap_fdatawrite_range(inode->i_mapping, start, end); |
1794 | atomic_dec(&BTRFS_I(inode)->sync_writers); | 1795 | atomic_dec(&BTRFS_I(inode)->sync_writers); |
1795 | if (ret) | 1796 | if (ret) |
1796 | return ret; | 1797 | return ret; |
1797 | 1798 | ||
1798 | mutex_lock(&inode->i_mutex); | 1799 | mutex_lock(&inode->i_mutex); |
1799 | 1800 | ||
1800 | /* | 1801 | /* |
1801 | * We flush the dirty pages again to avoid some dirty pages in the | 1802 | * We flush the dirty pages again to avoid some dirty pages in the |
1802 | * range being left. | 1803 | * range being left. |
1803 | */ | 1804 | */ |
1804 | atomic_inc(&root->log_batch); | 1805 | atomic_inc(&root->log_batch); |
1805 | full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, | 1806 | full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, |
1806 | &BTRFS_I(inode)->runtime_flags); | 1807 | &BTRFS_I(inode)->runtime_flags); |
1807 | if (full_sync) | 1808 | if (full_sync) |
1808 | btrfs_wait_ordered_range(inode, start, end - start + 1); | 1809 | btrfs_wait_ordered_range(inode, start, end - start + 1); |
1809 | atomic_inc(&root->log_batch); | 1810 | atomic_inc(&root->log_batch); |
1810 | 1811 | ||
1811 | /* | 1812 | /* |
1812 | * check the transaction that last modified this inode | 1813 | * check the transaction that last modified this inode |
1813 | * and see if its already been committed | 1814 | * and see if its already been committed |
1814 | */ | 1815 | */ |
1815 | if (!BTRFS_I(inode)->last_trans) { | 1816 | if (!BTRFS_I(inode)->last_trans) { |
1816 | mutex_unlock(&inode->i_mutex); | 1817 | mutex_unlock(&inode->i_mutex); |
1817 | goto out; | 1818 | goto out; |
1818 | } | 1819 | } |
1819 | 1820 | ||
1820 | /* | 1821 | /* |
1821 | * if the last transaction that changed this file was before | 1822 | * if the last transaction that changed this file was before |
1822 | * the current transaction, we can bail out now without any | 1823 | * the current transaction, we can bail out now without any |
1823 | * syncing | 1824 | * syncing |
1824 | */ | 1825 | */ |
1825 | smp_mb(); | 1826 | smp_mb(); |
1826 | if (btrfs_inode_in_log(inode, root->fs_info->generation) || | 1827 | if (btrfs_inode_in_log(inode, root->fs_info->generation) || |
1827 | BTRFS_I(inode)->last_trans <= | 1828 | BTRFS_I(inode)->last_trans <= |
1828 | root->fs_info->last_trans_committed) { | 1829 | root->fs_info->last_trans_committed) { |
1829 | BTRFS_I(inode)->last_trans = 0; | 1830 | BTRFS_I(inode)->last_trans = 0; |
1830 | 1831 | ||
1831 | /* | 1832 | /* |
1832 | * We'v had everything committed since the last time we were | 1833 | * We'v had everything committed since the last time we were |
1833 | * modified so clear this flag in case it was set for whatever | 1834 | * modified so clear this flag in case it was set for whatever |
1834 | * reason, it's no longer relevant. | 1835 | * reason, it's no longer relevant. |
1835 | */ | 1836 | */ |
1836 | clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, | 1837 | clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, |
1837 | &BTRFS_I(inode)->runtime_flags); | 1838 | &BTRFS_I(inode)->runtime_flags); |
1838 | mutex_unlock(&inode->i_mutex); | 1839 | mutex_unlock(&inode->i_mutex); |
1839 | goto out; | 1840 | goto out; |
1840 | } | 1841 | } |
1841 | 1842 | ||
1842 | /* | 1843 | /* |
1843 | * ok we haven't committed the transaction yet, lets do a commit | 1844 | * ok we haven't committed the transaction yet, lets do a commit |
1844 | */ | 1845 | */ |
1845 | if (file->private_data) | 1846 | if (file->private_data) |
1846 | btrfs_ioctl_trans_end(file); | 1847 | btrfs_ioctl_trans_end(file); |
1847 | 1848 | ||
1848 | trans = btrfs_start_transaction(root, 0); | 1849 | trans = btrfs_start_transaction(root, 0); |
1849 | if (IS_ERR(trans)) { | 1850 | if (IS_ERR(trans)) { |
1850 | ret = PTR_ERR(trans); | 1851 | ret = PTR_ERR(trans); |
1851 | mutex_unlock(&inode->i_mutex); | 1852 | mutex_unlock(&inode->i_mutex); |
1852 | goto out; | 1853 | goto out; |
1853 | } | 1854 | } |
1854 | 1855 | ||
1855 | ret = btrfs_log_dentry_safe(trans, root, dentry); | 1856 | ret = btrfs_log_dentry_safe(trans, root, dentry); |
1856 | if (ret < 0) { | 1857 | if (ret < 0) { |
1857 | /* Fallthrough and commit/free transaction. */ | 1858 | /* Fallthrough and commit/free transaction. */ |
1858 | ret = 1; | 1859 | ret = 1; |
1859 | } | 1860 | } |
1860 | 1861 | ||
1861 | /* we've logged all the items and now have a consistent | 1862 | /* we've logged all the items and now have a consistent |
1862 | * version of the file in the log. It is possible that | 1863 | * version of the file in the log. It is possible that |
1863 | * someone will come in and modify the file, but that's | 1864 | * someone will come in and modify the file, but that's |
1864 | * fine because the log is consistent on disk, and we | 1865 | * fine because the log is consistent on disk, and we |
1865 | * have references to all of the file's extents | 1866 | * have references to all of the file's extents |
1866 | * | 1867 | * |
1867 | * It is possible that someone will come in and log the | 1868 | * It is possible that someone will come in and log the |
1868 | * file again, but that will end up using the synchronization | 1869 | * file again, but that will end up using the synchronization |
1869 | * inside btrfs_sync_log to keep things safe. | 1870 | * inside btrfs_sync_log to keep things safe. |
1870 | */ | 1871 | */ |
1871 | mutex_unlock(&inode->i_mutex); | 1872 | mutex_unlock(&inode->i_mutex); |
1872 | 1873 | ||
1873 | if (ret != BTRFS_NO_LOG_SYNC) { | 1874 | if (ret != BTRFS_NO_LOG_SYNC) { |
1874 | if (ret > 0) { | 1875 | if (ret > 0) { |
1875 | /* | 1876 | /* |
1876 | * If we didn't already wait for ordered extents we need | 1877 | * If we didn't already wait for ordered extents we need |
1877 | * to do that now. | 1878 | * to do that now. |
1878 | */ | 1879 | */ |
1879 | if (!full_sync) | 1880 | if (!full_sync) |
1880 | btrfs_wait_ordered_range(inode, start, | 1881 | btrfs_wait_ordered_range(inode, start, |
1881 | end - start + 1); | 1882 | end - start + 1); |
1882 | ret = btrfs_commit_transaction(trans, root); | 1883 | ret = btrfs_commit_transaction(trans, root); |
1883 | } else { | 1884 | } else { |
1884 | ret = btrfs_sync_log(trans, root); | 1885 | ret = btrfs_sync_log(trans, root); |
1885 | if (ret == 0) { | 1886 | if (ret == 0) { |
1886 | ret = btrfs_end_transaction(trans, root); | 1887 | ret = btrfs_end_transaction(trans, root); |
1887 | } else { | 1888 | } else { |
1888 | if (!full_sync) | 1889 | if (!full_sync) |
1889 | btrfs_wait_ordered_range(inode, start, | 1890 | btrfs_wait_ordered_range(inode, start, |
1890 | end - | 1891 | end - |
1891 | start + 1); | 1892 | start + 1); |
1892 | ret = btrfs_commit_transaction(trans, root); | 1893 | ret = btrfs_commit_transaction(trans, root); |
1893 | } | 1894 | } |
1894 | } | 1895 | } |
1895 | } else { | 1896 | } else { |
1896 | ret = btrfs_end_transaction(trans, root); | 1897 | ret = btrfs_end_transaction(trans, root); |
1897 | } | 1898 | } |
1898 | out: | 1899 | out: |
1899 | return ret > 0 ? -EIO : ret; | 1900 | return ret > 0 ? -EIO : ret; |
1900 | } | 1901 | } |
1901 | 1902 | ||
1902 | static const struct vm_operations_struct btrfs_file_vm_ops = { | 1903 | static const struct vm_operations_struct btrfs_file_vm_ops = { |
1903 | .fault = filemap_fault, | 1904 | .fault = filemap_fault, |
1904 | .page_mkwrite = btrfs_page_mkwrite, | 1905 | .page_mkwrite = btrfs_page_mkwrite, |
1905 | .remap_pages = generic_file_remap_pages, | 1906 | .remap_pages = generic_file_remap_pages, |
1906 | }; | 1907 | }; |
1907 | 1908 | ||
1908 | static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) | 1909 | static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) |
1909 | { | 1910 | { |
1910 | struct address_space *mapping = filp->f_mapping; | 1911 | struct address_space *mapping = filp->f_mapping; |
1911 | 1912 | ||
1912 | if (!mapping->a_ops->readpage) | 1913 | if (!mapping->a_ops->readpage) |
1913 | return -ENOEXEC; | 1914 | return -ENOEXEC; |
1914 | 1915 | ||
1915 | file_accessed(filp); | 1916 | file_accessed(filp); |
1916 | vma->vm_ops = &btrfs_file_vm_ops; | 1917 | vma->vm_ops = &btrfs_file_vm_ops; |
1917 | 1918 | ||
1918 | return 0; | 1919 | return 0; |
1919 | } | 1920 | } |
1920 | 1921 | ||
1921 | static int hole_mergeable(struct inode *inode, struct extent_buffer *leaf, | 1922 | static int hole_mergeable(struct inode *inode, struct extent_buffer *leaf, |
1922 | int slot, u64 start, u64 end) | 1923 | int slot, u64 start, u64 end) |
1923 | { | 1924 | { |
1924 | struct btrfs_file_extent_item *fi; | 1925 | struct btrfs_file_extent_item *fi; |
1925 | struct btrfs_key key; | 1926 | struct btrfs_key key; |
1926 | 1927 | ||
1927 | if (slot < 0 || slot >= btrfs_header_nritems(leaf)) | 1928 | if (slot < 0 || slot >= btrfs_header_nritems(leaf)) |
1928 | return 0; | 1929 | return 0; |
1929 | 1930 | ||
1930 | btrfs_item_key_to_cpu(leaf, &key, slot); | 1931 | btrfs_item_key_to_cpu(leaf, &key, slot); |
1931 | if (key.objectid != btrfs_ino(inode) || | 1932 | if (key.objectid != btrfs_ino(inode) || |
1932 | key.type != BTRFS_EXTENT_DATA_KEY) | 1933 | key.type != BTRFS_EXTENT_DATA_KEY) |
1933 | return 0; | 1934 | return 0; |
1934 | 1935 | ||
1935 | fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); | 1936 | fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); |
1936 | 1937 | ||
1937 | if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG) | 1938 | if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG) |
1938 | return 0; | 1939 | return 0; |
1939 | 1940 | ||
1940 | if (btrfs_file_extent_disk_bytenr(leaf, fi)) | 1941 | if (btrfs_file_extent_disk_bytenr(leaf, fi)) |
1941 | return 0; | 1942 | return 0; |
1942 | 1943 | ||
1943 | if (key.offset == end) | 1944 | if (key.offset == end) |
1944 | return 1; | 1945 | return 1; |
1945 | if (key.offset + btrfs_file_extent_num_bytes(leaf, fi) == start) | 1946 | if (key.offset + btrfs_file_extent_num_bytes(leaf, fi) == start) |
1946 | return 1; | 1947 | return 1; |
1947 | return 0; | 1948 | return 0; |
1948 | } | 1949 | } |
1949 | 1950 | ||
1950 | static int fill_holes(struct btrfs_trans_handle *trans, struct inode *inode, | 1951 | static int fill_holes(struct btrfs_trans_handle *trans, struct inode *inode, |
1951 | struct btrfs_path *path, u64 offset, u64 end) | 1952 | struct btrfs_path *path, u64 offset, u64 end) |
1952 | { | 1953 | { |
1953 | struct btrfs_root *root = BTRFS_I(inode)->root; | 1954 | struct btrfs_root *root = BTRFS_I(inode)->root; |
1954 | struct extent_buffer *leaf; | 1955 | struct extent_buffer *leaf; |
1955 | struct btrfs_file_extent_item *fi; | 1956 | struct btrfs_file_extent_item *fi; |
1956 | struct extent_map *hole_em; | 1957 | struct extent_map *hole_em; |
1957 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; | 1958 | struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; |
1958 | struct btrfs_key key; | 1959 | struct btrfs_key key; |
1959 | int ret; | 1960 | int ret; |
1960 | 1961 | ||
1961 | key.objectid = btrfs_ino(inode); | 1962 | key.objectid = btrfs_ino(inode); |
1962 | key.type = BTRFS_EXTENT_DATA_KEY; | 1963 | key.type = BTRFS_EXTENT_DATA_KEY; |
1963 | key.offset = offset; | 1964 | key.offset = offset; |
1964 | 1965 | ||
1965 | 1966 | ||
1966 | ret = btrfs_search_slot(trans, root, &key, path, 0, 1); | 1967 | ret = btrfs_search_slot(trans, root, &key, path, 0, 1); |
1967 | if (ret < 0) | 1968 | if (ret < 0) |
1968 | return ret; | 1969 | return ret; |
1969 | BUG_ON(!ret); | 1970 | BUG_ON(!ret); |
1970 | 1971 | ||
1971 | leaf = path->nodes[0]; | 1972 | leaf = path->nodes[0]; |
1972 | if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) { | 1973 | if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) { |
1973 | u64 num_bytes; | 1974 | u64 num_bytes; |
1974 | 1975 | ||
1975 | path->slots[0]--; | 1976 | path->slots[0]--; |
1976 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1977 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1977 | struct btrfs_file_extent_item); | 1978 | struct btrfs_file_extent_item); |
1978 | num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + | 1979 | num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + |
1979 | end - offset; | 1980 | end - offset; |
1980 | btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); | 1981 | btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); |
1981 | btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); | 1982 | btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); |
1982 | btrfs_set_file_extent_offset(leaf, fi, 0); | 1983 | btrfs_set_file_extent_offset(leaf, fi, 0); |
1983 | btrfs_mark_buffer_dirty(leaf); | 1984 | btrfs_mark_buffer_dirty(leaf); |
1984 | goto out; | 1985 | goto out; |
1985 | } | 1986 | } |
1986 | 1987 | ||
1987 | if (hole_mergeable(inode, leaf, path->slots[0]+1, offset, end)) { | 1988 | if (hole_mergeable(inode, leaf, path->slots[0]+1, offset, end)) { |
1988 | u64 num_bytes; | 1989 | u64 num_bytes; |
1989 | 1990 | ||
1990 | path->slots[0]++; | 1991 | path->slots[0]++; |
1991 | key.offset = offset; | 1992 | key.offset = offset; |
1992 | btrfs_set_item_key_safe(root, path, &key); | 1993 | btrfs_set_item_key_safe(root, path, &key); |
1993 | fi = btrfs_item_ptr(leaf, path->slots[0], | 1994 | fi = btrfs_item_ptr(leaf, path->slots[0], |
1994 | struct btrfs_file_extent_item); | 1995 | struct btrfs_file_extent_item); |
1995 | num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + end - | 1996 | num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + end - |
1996 | offset; | 1997 | offset; |
1997 | btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); | 1998 | btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); |
1998 | btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); | 1999 | btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); |
1999 | btrfs_set_file_extent_offset(leaf, fi, 0); | 2000 | btrfs_set_file_extent_offset(leaf, fi, 0); |
2000 | btrfs_mark_buffer_dirty(leaf); | 2001 | btrfs_mark_buffer_dirty(leaf); |
2001 | goto out; | 2002 | goto out; |
2002 | } | 2003 | } |
2003 | btrfs_release_path(path); | 2004 | btrfs_release_path(path); |
2004 | 2005 | ||
2005 | ret = btrfs_insert_file_extent(trans, root, btrfs_ino(inode), offset, | 2006 | ret = btrfs_insert_file_extent(trans, root, btrfs_ino(inode), offset, |
2006 | 0, 0, end - offset, 0, end - offset, | 2007 | 0, 0, end - offset, 0, end - offset, |
2007 | 0, 0, 0); | 2008 | 0, 0, 0); |
2008 | if (ret) | 2009 | if (ret) |
2009 | return ret; | 2010 | return ret; |
2010 | 2011 | ||
2011 | out: | 2012 | out: |
2012 | btrfs_release_path(path); | 2013 | btrfs_release_path(path); |
2013 | 2014 | ||
2014 | hole_em = alloc_extent_map(); | 2015 | hole_em = alloc_extent_map(); |
2015 | if (!hole_em) { | 2016 | if (!hole_em) { |
2016 | btrfs_drop_extent_cache(inode, offset, end - 1, 0); | 2017 | btrfs_drop_extent_cache(inode, offset, end - 1, 0); |
2017 | set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, | 2018 | set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, |
2018 | &BTRFS_I(inode)->runtime_flags); | 2019 | &BTRFS_I(inode)->runtime_flags); |
2019 | } else { | 2020 | } else { |
2020 | hole_em->start = offset; | 2021 | hole_em->start = offset; |
2021 | hole_em->len = end - offset; | 2022 | hole_em->len = end - offset; |
2022 | hole_em->ram_bytes = hole_em->len; | 2023 | hole_em->ram_bytes = hole_em->len; |
2023 | hole_em->orig_start = offset; | 2024 | hole_em->orig_start = offset; |
2024 | 2025 | ||
2025 | hole_em->block_start = EXTENT_MAP_HOLE; | 2026 | hole_em->block_start = EXTENT_MAP_HOLE; |
2026 | hole_em->block_len = 0; | 2027 | hole_em->block_len = 0; |
2027 | hole_em->orig_block_len = 0; | 2028 | hole_em->orig_block_len = 0; |
2028 | hole_em->bdev = root->fs_info->fs_devices->latest_bdev; | 2029 | hole_em->bdev = root->fs_info->fs_devices->latest_bdev; |
2029 | hole_em->compress_type = BTRFS_COMPRESS_NONE; | 2030 | hole_em->compress_type = BTRFS_COMPRESS_NONE; |
2030 | hole_em->generation = trans->transid; | 2031 | hole_em->generation = trans->transid; |
2031 | 2032 | ||
2032 | do { | 2033 | do { |
2033 | btrfs_drop_extent_cache(inode, offset, end - 1, 0); | 2034 | btrfs_drop_extent_cache(inode, offset, end - 1, 0); |
2034 | write_lock(&em_tree->lock); | 2035 | write_lock(&em_tree->lock); |
2035 | ret = add_extent_mapping(em_tree, hole_em, 1); | 2036 | ret = add_extent_mapping(em_tree, hole_em, 1); |
2036 | write_unlock(&em_tree->lock); | 2037 | write_unlock(&em_tree->lock); |
2037 | } while (ret == -EEXIST); | 2038 | } while (ret == -EEXIST); |
2038 | free_extent_map(hole_em); | 2039 | free_extent_map(hole_em); |
2039 | if (ret) | 2040 | if (ret) |
2040 | set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, | 2041 | set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, |
2041 | &BTRFS_I(inode)->runtime_flags); | 2042 | &BTRFS_I(inode)->runtime_flags); |
2042 | } | 2043 | } |
2043 | 2044 | ||
2044 | return 0; | 2045 | return 0; |
2045 | } | 2046 | } |
2046 | 2047 | ||
2047 | static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) | 2048 | static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) |
2048 | { | 2049 | { |
2049 | struct btrfs_root *root = BTRFS_I(inode)->root; | 2050 | struct btrfs_root *root = BTRFS_I(inode)->root; |
2050 | struct extent_state *cached_state = NULL; | 2051 | struct extent_state *cached_state = NULL; |
2051 | struct btrfs_path *path; | 2052 | struct btrfs_path *path; |
2052 | struct btrfs_block_rsv *rsv; | 2053 | struct btrfs_block_rsv *rsv; |
2053 | struct btrfs_trans_handle *trans; | 2054 | struct btrfs_trans_handle *trans; |
2054 | u64 lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); | 2055 | u64 lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); |
2055 | u64 lockend = round_down(offset + len, | 2056 | u64 lockend = round_down(offset + len, |
2056 | BTRFS_I(inode)->root->sectorsize) - 1; | 2057 | BTRFS_I(inode)->root->sectorsize) - 1; |
2057 | u64 cur_offset = lockstart; | 2058 | u64 cur_offset = lockstart; |
2058 | u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); | 2059 | u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); |
2059 | u64 drop_end; | 2060 | u64 drop_end; |
2060 | int ret = 0; | 2061 | int ret = 0; |
2061 | int err = 0; | 2062 | int err = 0; |
2062 | bool same_page = ((offset >> PAGE_CACHE_SHIFT) == | 2063 | bool same_page = ((offset >> PAGE_CACHE_SHIFT) == |
2063 | ((offset + len - 1) >> PAGE_CACHE_SHIFT)); | 2064 | ((offset + len - 1) >> PAGE_CACHE_SHIFT)); |
2064 | 2065 | ||
2065 | btrfs_wait_ordered_range(inode, offset, len); | 2066 | btrfs_wait_ordered_range(inode, offset, len); |
2066 | 2067 | ||
2067 | mutex_lock(&inode->i_mutex); | 2068 | mutex_lock(&inode->i_mutex); |
2068 | /* | 2069 | /* |
2069 | * We needn't truncate any page which is beyond the end of the file | 2070 | * We needn't truncate any page which is beyond the end of the file |
2070 | * because we are sure there is no data there. | 2071 | * because we are sure there is no data there. |
2071 | */ | 2072 | */ |
2072 | /* | 2073 | /* |
2073 | * Only do this if we are in the same page and we aren't doing the | 2074 | * Only do this if we are in the same page and we aren't doing the |
2074 | * entire page. | 2075 | * entire page. |
2075 | */ | 2076 | */ |
2076 | if (same_page && len < PAGE_CACHE_SIZE) { | 2077 | if (same_page && len < PAGE_CACHE_SIZE) { |
2077 | if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) | 2078 | if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) |
2078 | ret = btrfs_truncate_page(inode, offset, len, 0); | 2079 | ret = btrfs_truncate_page(inode, offset, len, 0); |
2079 | mutex_unlock(&inode->i_mutex); | 2080 | mutex_unlock(&inode->i_mutex); |
2080 | return ret; | 2081 | return ret; |
2081 | } | 2082 | } |
2082 | 2083 | ||
2083 | /* zero back part of the first page */ | 2084 | /* zero back part of the first page */ |
2084 | if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) { | 2085 | if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) { |
2085 | ret = btrfs_truncate_page(inode, offset, 0, 0); | 2086 | ret = btrfs_truncate_page(inode, offset, 0, 0); |
2086 | if (ret) { | 2087 | if (ret) { |
2087 | mutex_unlock(&inode->i_mutex); | 2088 | mutex_unlock(&inode->i_mutex); |
2088 | return ret; | 2089 | return ret; |
2089 | } | 2090 | } |
2090 | } | 2091 | } |
2091 | 2092 | ||
2092 | /* zero the front end of the last page */ | 2093 | /* zero the front end of the last page */ |
2093 | if (offset + len < round_up(inode->i_size, PAGE_CACHE_SIZE)) { | 2094 | if (offset + len < round_up(inode->i_size, PAGE_CACHE_SIZE)) { |
2094 | ret = btrfs_truncate_page(inode, offset + len, 0, 1); | 2095 | ret = btrfs_truncate_page(inode, offset + len, 0, 1); |
2095 | if (ret) { | 2096 | if (ret) { |
2096 | mutex_unlock(&inode->i_mutex); | 2097 | mutex_unlock(&inode->i_mutex); |
2097 | return ret; | 2098 | return ret; |
2098 | } | 2099 | } |
2099 | } | 2100 | } |
2100 | 2101 | ||
2101 | if (lockend < lockstart) { | 2102 | if (lockend < lockstart) { |
2102 | mutex_unlock(&inode->i_mutex); | 2103 | mutex_unlock(&inode->i_mutex); |
2103 | return 0; | 2104 | return 0; |
2104 | } | 2105 | } |
2105 | 2106 | ||
2106 | while (1) { | 2107 | while (1) { |
2107 | struct btrfs_ordered_extent *ordered; | 2108 | struct btrfs_ordered_extent *ordered; |
2108 | 2109 | ||
2109 | truncate_pagecache_range(inode, lockstart, lockend); | 2110 | truncate_pagecache_range(inode, lockstart, lockend); |
2110 | 2111 | ||
2111 | lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, | 2112 | lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, |
2112 | 0, &cached_state); | 2113 | 0, &cached_state); |
2113 | ordered = btrfs_lookup_first_ordered_extent(inode, lockend); | 2114 | ordered = btrfs_lookup_first_ordered_extent(inode, lockend); |
2114 | 2115 | ||
2115 | /* | 2116 | /* |
2116 | * We need to make sure we have no ordered extents in this range | 2117 | * We need to make sure we have no ordered extents in this range |
2117 | * and nobody raced in and read a page in this range, if we did | 2118 | * and nobody raced in and read a page in this range, if we did |
2118 | * we need to try again. | 2119 | * we need to try again. |
2119 | */ | 2120 | */ |
2120 | if ((!ordered || | 2121 | if ((!ordered || |
2121 | (ordered->file_offset + ordered->len < lockstart || | 2122 | (ordered->file_offset + ordered->len < lockstart || |
2122 | ordered->file_offset > lockend)) && | 2123 | ordered->file_offset > lockend)) && |
2123 | !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart, | 2124 | !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart, |
2124 | lockend, EXTENT_UPTODATE, 0, | 2125 | lockend, EXTENT_UPTODATE, 0, |
2125 | cached_state)) { | 2126 | cached_state)) { |
2126 | if (ordered) | 2127 | if (ordered) |
2127 | btrfs_put_ordered_extent(ordered); | 2128 | btrfs_put_ordered_extent(ordered); |
2128 | break; | 2129 | break; |
2129 | } | 2130 | } |
2130 | if (ordered) | 2131 | if (ordered) |
2131 | btrfs_put_ordered_extent(ordered); | 2132 | btrfs_put_ordered_extent(ordered); |
2132 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, | 2133 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, |
2133 | lockend, &cached_state, GFP_NOFS); | 2134 | lockend, &cached_state, GFP_NOFS); |
2134 | btrfs_wait_ordered_range(inode, lockstart, | 2135 | btrfs_wait_ordered_range(inode, lockstart, |
2135 | lockend - lockstart + 1); | 2136 | lockend - lockstart + 1); |
2136 | } | 2137 | } |
2137 | 2138 | ||
2138 | path = btrfs_alloc_path(); | 2139 | path = btrfs_alloc_path(); |
2139 | if (!path) { | 2140 | if (!path) { |
2140 | ret = -ENOMEM; | 2141 | ret = -ENOMEM; |
2141 | goto out; | 2142 | goto out; |
2142 | } | 2143 | } |
2143 | 2144 | ||
2144 | rsv = btrfs_alloc_block_rsv(root, BTRFS_BLOCK_RSV_TEMP); | 2145 | rsv = btrfs_alloc_block_rsv(root, BTRFS_BLOCK_RSV_TEMP); |
2145 | if (!rsv) { | 2146 | if (!rsv) { |
2146 | ret = -ENOMEM; | 2147 | ret = -ENOMEM; |
2147 | goto out_free; | 2148 | goto out_free; |
2148 | } | 2149 | } |
2149 | rsv->size = btrfs_calc_trunc_metadata_size(root, 1); | 2150 | rsv->size = btrfs_calc_trunc_metadata_size(root, 1); |
2150 | rsv->failfast = 1; | 2151 | rsv->failfast = 1; |
2151 | 2152 | ||
2152 | /* | 2153 | /* |
2153 | * 1 - update the inode | 2154 | * 1 - update the inode |
2154 | * 1 - removing the extents in the range | 2155 | * 1 - removing the extents in the range |
2155 | * 1 - adding the hole extent | 2156 | * 1 - adding the hole extent |
2156 | */ | 2157 | */ |
2157 | trans = btrfs_start_transaction(root, 3); | 2158 | trans = btrfs_start_transaction(root, 3); |
2158 | if (IS_ERR(trans)) { | 2159 | if (IS_ERR(trans)) { |
2159 | err = PTR_ERR(trans); | 2160 | err = PTR_ERR(trans); |
2160 | goto out_free; | 2161 | goto out_free; |
2161 | } | 2162 | } |
2162 | 2163 | ||
2163 | ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv, | 2164 | ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv, |
2164 | min_size); | 2165 | min_size); |
2165 | BUG_ON(ret); | 2166 | BUG_ON(ret); |
2166 | trans->block_rsv = rsv; | 2167 | trans->block_rsv = rsv; |
2167 | 2168 | ||
2168 | while (cur_offset < lockend) { | 2169 | while (cur_offset < lockend) { |
2169 | ret = __btrfs_drop_extents(trans, root, inode, path, | 2170 | ret = __btrfs_drop_extents(trans, root, inode, path, |
2170 | cur_offset, lockend + 1, | 2171 | cur_offset, lockend + 1, |
2171 | &drop_end, 1); | 2172 | &drop_end, 1); |
2172 | if (ret != -ENOSPC) | 2173 | if (ret != -ENOSPC) |
2173 | break; | 2174 | break; |
2174 | 2175 | ||
2175 | trans->block_rsv = &root->fs_info->trans_block_rsv; | 2176 | trans->block_rsv = &root->fs_info->trans_block_rsv; |
2176 | 2177 | ||
2177 | ret = fill_holes(trans, inode, path, cur_offset, drop_end); | 2178 | ret = fill_holes(trans, inode, path, cur_offset, drop_end); |
2178 | if (ret) { | 2179 | if (ret) { |
2179 | err = ret; | 2180 | err = ret; |
2180 | break; | 2181 | break; |
2181 | } | 2182 | } |
2182 | 2183 | ||
2183 | cur_offset = drop_end; | 2184 | cur_offset = drop_end; |
2184 | 2185 | ||
2185 | ret = btrfs_update_inode(trans, root, inode); | 2186 | ret = btrfs_update_inode(trans, root, inode); |
2186 | if (ret) { | 2187 | if (ret) { |
2187 | err = ret; | 2188 | err = ret; |
2188 | break; | 2189 | break; |
2189 | } | 2190 | } |
2190 | 2191 | ||
2191 | btrfs_end_transaction(trans, root); | 2192 | btrfs_end_transaction(trans, root); |
2192 | btrfs_btree_balance_dirty(root); | 2193 | btrfs_btree_balance_dirty(root); |
2193 | 2194 | ||
2194 | trans = btrfs_start_transaction(root, 3); | 2195 | trans = btrfs_start_transaction(root, 3); |
2195 | if (IS_ERR(trans)) { | 2196 | if (IS_ERR(trans)) { |
2196 | ret = PTR_ERR(trans); | 2197 | ret = PTR_ERR(trans); |
2197 | trans = NULL; | 2198 | trans = NULL; |
2198 | break; | 2199 | break; |
2199 | } | 2200 | } |
2200 | 2201 | ||
2201 | ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, | 2202 | ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, |
2202 | rsv, min_size); | 2203 | rsv, min_size); |
2203 | BUG_ON(ret); /* shouldn't happen */ | 2204 | BUG_ON(ret); /* shouldn't happen */ |
2204 | trans->block_rsv = rsv; | 2205 | trans->block_rsv = rsv; |
2205 | } | 2206 | } |
2206 | 2207 | ||
2207 | if (ret) { | 2208 | if (ret) { |
2208 | err = ret; | 2209 | err = ret; |
2209 | goto out_trans; | 2210 | goto out_trans; |
2210 | } | 2211 | } |
2211 | 2212 | ||
2212 | trans->block_rsv = &root->fs_info->trans_block_rsv; | 2213 | trans->block_rsv = &root->fs_info->trans_block_rsv; |
2213 | ret = fill_holes(trans, inode, path, cur_offset, drop_end); | 2214 | ret = fill_holes(trans, inode, path, cur_offset, drop_end); |
2214 | if (ret) { | 2215 | if (ret) { |
2215 | err = ret; | 2216 | err = ret; |
2216 | goto out_trans; | 2217 | goto out_trans; |
2217 | } | 2218 | } |
2218 | 2219 | ||
2219 | out_trans: | 2220 | out_trans: |
2220 | if (!trans) | 2221 | if (!trans) |
2221 | goto out_free; | 2222 | goto out_free; |
2222 | 2223 | ||
2223 | inode_inc_iversion(inode); | 2224 | inode_inc_iversion(inode); |
2224 | inode->i_mtime = inode->i_ctime = CURRENT_TIME; | 2225 | inode->i_mtime = inode->i_ctime = CURRENT_TIME; |
2225 | 2226 | ||
2226 | trans->block_rsv = &root->fs_info->trans_block_rsv; | 2227 | trans->block_rsv = &root->fs_info->trans_block_rsv; |
2227 | ret = btrfs_update_inode(trans, root, inode); | 2228 | ret = btrfs_update_inode(trans, root, inode); |
2228 | btrfs_end_transaction(trans, root); | 2229 | btrfs_end_transaction(trans, root); |
2229 | btrfs_btree_balance_dirty(root); | 2230 | btrfs_btree_balance_dirty(root); |
2230 | out_free: | 2231 | out_free: |
2231 | btrfs_free_path(path); | 2232 | btrfs_free_path(path); |
2232 | btrfs_free_block_rsv(root, rsv); | 2233 | btrfs_free_block_rsv(root, rsv); |
2233 | out: | 2234 | out: |
2234 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, | 2235 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, |
2235 | &cached_state, GFP_NOFS); | 2236 | &cached_state, GFP_NOFS); |
2236 | mutex_unlock(&inode->i_mutex); | 2237 | mutex_unlock(&inode->i_mutex); |
2237 | if (ret && !err) | 2238 | if (ret && !err) |
2238 | err = ret; | 2239 | err = ret; |
2239 | return err; | 2240 | return err; |
2240 | } | 2241 | } |
2241 | 2242 | ||
2242 | static long btrfs_fallocate(struct file *file, int mode, | 2243 | static long btrfs_fallocate(struct file *file, int mode, |
2243 | loff_t offset, loff_t len) | 2244 | loff_t offset, loff_t len) |
2244 | { | 2245 | { |
2245 | struct inode *inode = file_inode(file); | 2246 | struct inode *inode = file_inode(file); |
2246 | struct extent_state *cached_state = NULL; | 2247 | struct extent_state *cached_state = NULL; |
2247 | struct btrfs_root *root = BTRFS_I(inode)->root; | 2248 | struct btrfs_root *root = BTRFS_I(inode)->root; |
2248 | u64 cur_offset; | 2249 | u64 cur_offset; |
2249 | u64 last_byte; | 2250 | u64 last_byte; |
2250 | u64 alloc_start; | 2251 | u64 alloc_start; |
2251 | u64 alloc_end; | 2252 | u64 alloc_end; |
2252 | u64 alloc_hint = 0; | 2253 | u64 alloc_hint = 0; |
2253 | u64 locked_end; | 2254 | u64 locked_end; |
2254 | struct extent_map *em; | 2255 | struct extent_map *em; |
2255 | int blocksize = BTRFS_I(inode)->root->sectorsize; | 2256 | int blocksize = BTRFS_I(inode)->root->sectorsize; |
2256 | int ret; | 2257 | int ret; |
2257 | 2258 | ||
2258 | alloc_start = round_down(offset, blocksize); | 2259 | alloc_start = round_down(offset, blocksize); |
2259 | alloc_end = round_up(offset + len, blocksize); | 2260 | alloc_end = round_up(offset + len, blocksize); |
2260 | 2261 | ||
2261 | /* Make sure we aren't being give some crap mode */ | 2262 | /* Make sure we aren't being give some crap mode */ |
2262 | if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) | 2263 | if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) |
2263 | return -EOPNOTSUPP; | 2264 | return -EOPNOTSUPP; |
2264 | 2265 | ||
2265 | if (mode & FALLOC_FL_PUNCH_HOLE) | 2266 | if (mode & FALLOC_FL_PUNCH_HOLE) |
2266 | return btrfs_punch_hole(inode, offset, len); | 2267 | return btrfs_punch_hole(inode, offset, len); |
2267 | 2268 | ||
2268 | /* | 2269 | /* |
2269 | * Make sure we have enough space before we do the | 2270 | * Make sure we have enough space before we do the |
2270 | * allocation. | 2271 | * allocation. |
2271 | */ | 2272 | */ |
2272 | ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start); | 2273 | ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start); |
2273 | if (ret) | 2274 | if (ret) |
2274 | return ret; | 2275 | return ret; |
2275 | if (root->fs_info->quota_enabled) { | 2276 | if (root->fs_info->quota_enabled) { |
2276 | ret = btrfs_qgroup_reserve(root, alloc_end - alloc_start); | 2277 | ret = btrfs_qgroup_reserve(root, alloc_end - alloc_start); |
2277 | if (ret) | 2278 | if (ret) |
2278 | goto out_reserve_fail; | 2279 | goto out_reserve_fail; |
2279 | } | 2280 | } |
2280 | 2281 | ||
2281 | mutex_lock(&inode->i_mutex); | 2282 | mutex_lock(&inode->i_mutex); |
2282 | ret = inode_newsize_ok(inode, alloc_end); | 2283 | ret = inode_newsize_ok(inode, alloc_end); |
2283 | if (ret) | 2284 | if (ret) |
2284 | goto out; | 2285 | goto out; |
2285 | 2286 | ||
2286 | if (alloc_start > inode->i_size) { | 2287 | if (alloc_start > inode->i_size) { |
2287 | ret = btrfs_cont_expand(inode, i_size_read(inode), | 2288 | ret = btrfs_cont_expand(inode, i_size_read(inode), |
2288 | alloc_start); | 2289 | alloc_start); |
2289 | if (ret) | 2290 | if (ret) |
2290 | goto out; | 2291 | goto out; |
2291 | } else { | 2292 | } else { |
2292 | /* | 2293 | /* |
2293 | * If we are fallocating from the end of the file onward we | 2294 | * If we are fallocating from the end of the file onward we |
2294 | * need to zero out the end of the page if i_size lands in the | 2295 | * need to zero out the end of the page if i_size lands in the |
2295 | * middle of a page. | 2296 | * middle of a page. |
2296 | */ | 2297 | */ |
2297 | ret = btrfs_truncate_page(inode, inode->i_size, 0, 0); | 2298 | ret = btrfs_truncate_page(inode, inode->i_size, 0, 0); |
2298 | if (ret) | 2299 | if (ret) |
2299 | goto out; | 2300 | goto out; |
2300 | } | 2301 | } |
2301 | 2302 | ||
2302 | /* | 2303 | /* |
2303 | * wait for ordered IO before we have any locks. We'll loop again | 2304 | * wait for ordered IO before we have any locks. We'll loop again |
2304 | * below with the locks held. | 2305 | * below with the locks held. |
2305 | */ | 2306 | */ |
2306 | btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); | 2307 | btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); |
2307 | 2308 | ||
2308 | locked_end = alloc_end - 1; | 2309 | locked_end = alloc_end - 1; |
2309 | while (1) { | 2310 | while (1) { |
2310 | struct btrfs_ordered_extent *ordered; | 2311 | struct btrfs_ordered_extent *ordered; |
2311 | 2312 | ||
2312 | /* the extent lock is ordered inside the running | 2313 | /* the extent lock is ordered inside the running |
2313 | * transaction | 2314 | * transaction |
2314 | */ | 2315 | */ |
2315 | lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, | 2316 | lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, |
2316 | locked_end, 0, &cached_state); | 2317 | locked_end, 0, &cached_state); |
2317 | ordered = btrfs_lookup_first_ordered_extent(inode, | 2318 | ordered = btrfs_lookup_first_ordered_extent(inode, |
2318 | alloc_end - 1); | 2319 | alloc_end - 1); |
2319 | if (ordered && | 2320 | if (ordered && |
2320 | ordered->file_offset + ordered->len > alloc_start && | 2321 | ordered->file_offset + ordered->len > alloc_start && |
2321 | ordered->file_offset < alloc_end) { | 2322 | ordered->file_offset < alloc_end) { |
2322 | btrfs_put_ordered_extent(ordered); | 2323 | btrfs_put_ordered_extent(ordered); |
2323 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, | 2324 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, |
2324 | alloc_start, locked_end, | 2325 | alloc_start, locked_end, |
2325 | &cached_state, GFP_NOFS); | 2326 | &cached_state, GFP_NOFS); |
2326 | /* | 2327 | /* |
2327 | * we can't wait on the range with the transaction | 2328 | * we can't wait on the range with the transaction |
2328 | * running or with the extent lock held | 2329 | * running or with the extent lock held |
2329 | */ | 2330 | */ |
2330 | btrfs_wait_ordered_range(inode, alloc_start, | 2331 | btrfs_wait_ordered_range(inode, alloc_start, |
2331 | alloc_end - alloc_start); | 2332 | alloc_end - alloc_start); |
2332 | } else { | 2333 | } else { |
2333 | if (ordered) | 2334 | if (ordered) |
2334 | btrfs_put_ordered_extent(ordered); | 2335 | btrfs_put_ordered_extent(ordered); |
2335 | break; | 2336 | break; |
2336 | } | 2337 | } |
2337 | } | 2338 | } |
2338 | 2339 | ||
2339 | cur_offset = alloc_start; | 2340 | cur_offset = alloc_start; |
2340 | while (1) { | 2341 | while (1) { |
2341 | u64 actual_end; | 2342 | u64 actual_end; |
2342 | 2343 | ||
2343 | em = btrfs_get_extent(inode, NULL, 0, cur_offset, | 2344 | em = btrfs_get_extent(inode, NULL, 0, cur_offset, |
2344 | alloc_end - cur_offset, 0); | 2345 | alloc_end - cur_offset, 0); |
2345 | if (IS_ERR_OR_NULL(em)) { | 2346 | if (IS_ERR_OR_NULL(em)) { |
2346 | if (!em) | 2347 | if (!em) |
2347 | ret = -ENOMEM; | 2348 | ret = -ENOMEM; |
2348 | else | 2349 | else |
2349 | ret = PTR_ERR(em); | 2350 | ret = PTR_ERR(em); |
2350 | break; | 2351 | break; |
2351 | } | 2352 | } |
2352 | last_byte = min(extent_map_end(em), alloc_end); | 2353 | last_byte = min(extent_map_end(em), alloc_end); |
2353 | actual_end = min_t(u64, extent_map_end(em), offset + len); | 2354 | actual_end = min_t(u64, extent_map_end(em), offset + len); |
2354 | last_byte = ALIGN(last_byte, blocksize); | 2355 | last_byte = ALIGN(last_byte, blocksize); |
2355 | 2356 | ||
2356 | if (em->block_start == EXTENT_MAP_HOLE || | 2357 | if (em->block_start == EXTENT_MAP_HOLE || |
2357 | (cur_offset >= inode->i_size && | 2358 | (cur_offset >= inode->i_size && |
2358 | !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { | 2359 | !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { |
2359 | ret = btrfs_prealloc_file_range(inode, mode, cur_offset, | 2360 | ret = btrfs_prealloc_file_range(inode, mode, cur_offset, |
2360 | last_byte - cur_offset, | 2361 | last_byte - cur_offset, |
2361 | 1 << inode->i_blkbits, | 2362 | 1 << inode->i_blkbits, |
2362 | offset + len, | 2363 | offset + len, |
2363 | &alloc_hint); | 2364 | &alloc_hint); |
2364 | 2365 | ||
2365 | if (ret < 0) { | 2366 | if (ret < 0) { |
2366 | free_extent_map(em); | 2367 | free_extent_map(em); |
2367 | break; | 2368 | break; |
2368 | } | 2369 | } |
2369 | } else if (actual_end > inode->i_size && | 2370 | } else if (actual_end > inode->i_size && |
2370 | !(mode & FALLOC_FL_KEEP_SIZE)) { | 2371 | !(mode & FALLOC_FL_KEEP_SIZE)) { |
2371 | /* | 2372 | /* |
2372 | * We didn't need to allocate any more space, but we | 2373 | * We didn't need to allocate any more space, but we |
2373 | * still extended the size of the file so we need to | 2374 | * still extended the size of the file so we need to |
2374 | * update i_size. | 2375 | * update i_size. |
2375 | */ | 2376 | */ |
2376 | inode->i_ctime = CURRENT_TIME; | 2377 | inode->i_ctime = CURRENT_TIME; |
2377 | i_size_write(inode, actual_end); | 2378 | i_size_write(inode, actual_end); |
2378 | btrfs_ordered_update_i_size(inode, actual_end, NULL); | 2379 | btrfs_ordered_update_i_size(inode, actual_end, NULL); |
2379 | } | 2380 | } |
2380 | free_extent_map(em); | 2381 | free_extent_map(em); |
2381 | 2382 | ||
2382 | cur_offset = last_byte; | 2383 | cur_offset = last_byte; |
2383 | if (cur_offset >= alloc_end) { | 2384 | if (cur_offset >= alloc_end) { |
2384 | ret = 0; | 2385 | ret = 0; |
2385 | break; | 2386 | break; |
2386 | } | 2387 | } |
2387 | } | 2388 | } |
2388 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, | 2389 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, |
2389 | &cached_state, GFP_NOFS); | 2390 | &cached_state, GFP_NOFS); |
2390 | out: | 2391 | out: |
2391 | mutex_unlock(&inode->i_mutex); | 2392 | mutex_unlock(&inode->i_mutex); |
2392 | if (root->fs_info->quota_enabled) | 2393 | if (root->fs_info->quota_enabled) |
2393 | btrfs_qgroup_free(root, alloc_end - alloc_start); | 2394 | btrfs_qgroup_free(root, alloc_end - alloc_start); |
2394 | out_reserve_fail: | 2395 | out_reserve_fail: |
2395 | /* Let go of our reservation. */ | 2396 | /* Let go of our reservation. */ |
2396 | btrfs_free_reserved_data_space(inode, alloc_end - alloc_start); | 2397 | btrfs_free_reserved_data_space(inode, alloc_end - alloc_start); |
2397 | return ret; | 2398 | return ret; |
2398 | } | 2399 | } |
2399 | 2400 | ||
2400 | static int find_desired_extent(struct inode *inode, loff_t *offset, int whence) | 2401 | static int find_desired_extent(struct inode *inode, loff_t *offset, int whence) |
2401 | { | 2402 | { |
2402 | struct btrfs_root *root = BTRFS_I(inode)->root; | 2403 | struct btrfs_root *root = BTRFS_I(inode)->root; |
2403 | struct extent_map *em; | 2404 | struct extent_map *em; |
2404 | struct extent_state *cached_state = NULL; | 2405 | struct extent_state *cached_state = NULL; |
2405 | u64 lockstart = *offset; | 2406 | u64 lockstart = *offset; |
2406 | u64 lockend = i_size_read(inode); | 2407 | u64 lockend = i_size_read(inode); |
2407 | u64 start = *offset; | 2408 | u64 start = *offset; |
2408 | u64 orig_start = *offset; | 2409 | u64 orig_start = *offset; |
2409 | u64 len = i_size_read(inode); | 2410 | u64 len = i_size_read(inode); |
2410 | u64 last_end = 0; | 2411 | u64 last_end = 0; |
2411 | int ret = 0; | 2412 | int ret = 0; |
2412 | 2413 | ||
2413 | lockend = max_t(u64, root->sectorsize, lockend); | 2414 | lockend = max_t(u64, root->sectorsize, lockend); |
2414 | if (lockend <= lockstart) | 2415 | if (lockend <= lockstart) |
2415 | lockend = lockstart + root->sectorsize; | 2416 | lockend = lockstart + root->sectorsize; |
2416 | 2417 | ||
2417 | lockend--; | 2418 | lockend--; |
2418 | len = lockend - lockstart + 1; | 2419 | len = lockend - lockstart + 1; |
2419 | 2420 | ||
2420 | len = max_t(u64, len, root->sectorsize); | 2421 | len = max_t(u64, len, root->sectorsize); |
2421 | if (inode->i_size == 0) | 2422 | if (inode->i_size == 0) |
2422 | return -ENXIO; | 2423 | return -ENXIO; |
2423 | 2424 | ||
2424 | lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0, | 2425 | lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0, |
2425 | &cached_state); | 2426 | &cached_state); |
2426 | 2427 | ||
2427 | /* | 2428 | /* |
2428 | * Delalloc is such a pain. If we have a hole and we have pending | 2429 | * Delalloc is such a pain. If we have a hole and we have pending |
2429 | * delalloc for a portion of the hole we will get back a hole that | 2430 | * delalloc for a portion of the hole we will get back a hole that |
2430 | * exists for the entire range since it hasn't been actually written | 2431 | * exists for the entire range since it hasn't been actually written |
2431 | * yet. So to take care of this case we need to look for an extent just | 2432 | * yet. So to take care of this case we need to look for an extent just |
2432 | * before the position we want in case there is outstanding delalloc | 2433 | * before the position we want in case there is outstanding delalloc |
2433 | * going on here. | 2434 | * going on here. |
2434 | */ | 2435 | */ |
2435 | if (whence == SEEK_HOLE && start != 0) { | 2436 | if (whence == SEEK_HOLE && start != 0) { |
2436 | if (start <= root->sectorsize) | 2437 | if (start <= root->sectorsize) |
2437 | em = btrfs_get_extent_fiemap(inode, NULL, 0, 0, | 2438 | em = btrfs_get_extent_fiemap(inode, NULL, 0, 0, |
2438 | root->sectorsize, 0); | 2439 | root->sectorsize, 0); |
2439 | else | 2440 | else |
2440 | em = btrfs_get_extent_fiemap(inode, NULL, 0, | 2441 | em = btrfs_get_extent_fiemap(inode, NULL, 0, |
2441 | start - root->sectorsize, | 2442 | start - root->sectorsize, |
2442 | root->sectorsize, 0); | 2443 | root->sectorsize, 0); |
2443 | if (IS_ERR(em)) { | 2444 | if (IS_ERR(em)) { |
2444 | ret = PTR_ERR(em); | 2445 | ret = PTR_ERR(em); |
2445 | goto out; | 2446 | goto out; |
2446 | } | 2447 | } |
2447 | last_end = em->start + em->len; | 2448 | last_end = em->start + em->len; |
2448 | if (em->block_start == EXTENT_MAP_DELALLOC) | 2449 | if (em->block_start == EXTENT_MAP_DELALLOC) |
2449 | last_end = min_t(u64, last_end, inode->i_size); | 2450 | last_end = min_t(u64, last_end, inode->i_size); |
2450 | free_extent_map(em); | 2451 | free_extent_map(em); |
2451 | } | 2452 | } |
2452 | 2453 | ||
2453 | while (1) { | 2454 | while (1) { |
2454 | em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0); | 2455 | em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0); |
2455 | if (IS_ERR(em)) { | 2456 | if (IS_ERR(em)) { |
2456 | ret = PTR_ERR(em); | 2457 | ret = PTR_ERR(em); |
2457 | break; | 2458 | break; |
2458 | } | 2459 | } |
2459 | 2460 | ||
2460 | if (em->block_start == EXTENT_MAP_HOLE) { | 2461 | if (em->block_start == EXTENT_MAP_HOLE) { |
2461 | if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { | 2462 | if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { |
2462 | if (last_end <= orig_start) { | 2463 | if (last_end <= orig_start) { |
2463 | free_extent_map(em); | 2464 | free_extent_map(em); |
2464 | ret = -ENXIO; | 2465 | ret = -ENXIO; |
2465 | break; | 2466 | break; |
2466 | } | 2467 | } |
2467 | } | 2468 | } |
2468 | 2469 | ||
2469 | if (whence == SEEK_HOLE) { | 2470 | if (whence == SEEK_HOLE) { |
2470 | *offset = start; | 2471 | *offset = start; |
2471 | free_extent_map(em); | 2472 | free_extent_map(em); |
2472 | break; | 2473 | break; |
2473 | } | 2474 | } |
2474 | } else { | 2475 | } else { |
2475 | if (whence == SEEK_DATA) { | 2476 | if (whence == SEEK_DATA) { |
2476 | if (em->block_start == EXTENT_MAP_DELALLOC) { | 2477 | if (em->block_start == EXTENT_MAP_DELALLOC) { |
2477 | if (start >= inode->i_size) { | 2478 | if (start >= inode->i_size) { |
2478 | free_extent_map(em); | 2479 | free_extent_map(em); |
2479 | ret = -ENXIO; | 2480 | ret = -ENXIO; |
2480 | break; | 2481 | break; |
2481 | } | 2482 | } |
2482 | } | 2483 | } |
2483 | 2484 | ||
2484 | if (!test_bit(EXTENT_FLAG_PREALLOC, | 2485 | if (!test_bit(EXTENT_FLAG_PREALLOC, |
2485 | &em->flags)) { | 2486 | &em->flags)) { |
2486 | *offset = start; | 2487 | *offset = start; |
2487 | free_extent_map(em); | 2488 | free_extent_map(em); |
2488 | break; | 2489 | break; |
2489 | } | 2490 | } |
2490 | } | 2491 | } |
2491 | } | 2492 | } |
2492 | 2493 | ||
2493 | start = em->start + em->len; | 2494 | start = em->start + em->len; |
2494 | last_end = em->start + em->len; | 2495 | last_end = em->start + em->len; |
2495 | 2496 | ||
2496 | if (em->block_start == EXTENT_MAP_DELALLOC) | 2497 | if (em->block_start == EXTENT_MAP_DELALLOC) |
2497 | last_end = min_t(u64, last_end, inode->i_size); | 2498 | last_end = min_t(u64, last_end, inode->i_size); |
2498 | 2499 | ||
2499 | if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { | 2500 | if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) { |
2500 | free_extent_map(em); | 2501 | free_extent_map(em); |
2501 | ret = -ENXIO; | 2502 | ret = -ENXIO; |
2502 | break; | 2503 | break; |
2503 | } | 2504 | } |
2504 | free_extent_map(em); | 2505 | free_extent_map(em); |
2505 | cond_resched(); | 2506 | cond_resched(); |
2506 | } | 2507 | } |
2507 | if (!ret) | 2508 | if (!ret) |
2508 | *offset = min(*offset, inode->i_size); | 2509 | *offset = min(*offset, inode->i_size); |
2509 | out: | 2510 | out: |
2510 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, | 2511 | unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, |
2511 | &cached_state, GFP_NOFS); | 2512 | &cached_state, GFP_NOFS); |
2512 | return ret; | 2513 | return ret; |
2513 | } | 2514 | } |
2514 | 2515 | ||
2515 | static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) | 2516 | static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) |
2516 | { | 2517 | { |
2517 | struct inode *inode = file->f_mapping->host; | 2518 | struct inode *inode = file->f_mapping->host; |
2518 | int ret; | 2519 | int ret; |
2519 | 2520 | ||
2520 | mutex_lock(&inode->i_mutex); | 2521 | mutex_lock(&inode->i_mutex); |
2521 | switch (whence) { | 2522 | switch (whence) { |
2522 | case SEEK_END: | 2523 | case SEEK_END: |
2523 | case SEEK_CUR: | 2524 | case SEEK_CUR: |
2524 | offset = generic_file_llseek(file, offset, whence); | 2525 | offset = generic_file_llseek(file, offset, whence); |
2525 | goto out; | 2526 | goto out; |
2526 | case SEEK_DATA: | 2527 | case SEEK_DATA: |
2527 | case SEEK_HOLE: | 2528 | case SEEK_HOLE: |
2528 | if (offset >= i_size_read(inode)) { | 2529 | if (offset >= i_size_read(inode)) { |
2529 | mutex_unlock(&inode->i_mutex); | 2530 | mutex_unlock(&inode->i_mutex); |
2530 | return -ENXIO; | 2531 | return -ENXIO; |
2531 | } | 2532 | } |
2532 | 2533 | ||
2533 | ret = find_desired_extent(inode, &offset, whence); | 2534 | ret = find_desired_extent(inode, &offset, whence); |
2534 | if (ret) { | 2535 | if (ret) { |
2535 | mutex_unlock(&inode->i_mutex); | 2536 | mutex_unlock(&inode->i_mutex); |
2536 | return ret; | 2537 | return ret; |
2537 | } | 2538 | } |
2538 | } | 2539 | } |
2539 | 2540 | ||
2540 | offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes); | 2541 | offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes); |
2541 | out: | 2542 | out: |
2542 | mutex_unlock(&inode->i_mutex); | 2543 | mutex_unlock(&inode->i_mutex); |
2543 | return offset; | 2544 | return offset; |
2544 | } | 2545 | } |
2545 | 2546 | ||
2546 | const struct file_operations btrfs_file_operations = { | 2547 | const struct file_operations btrfs_file_operations = { |
2547 | .llseek = btrfs_file_llseek, | 2548 | .llseek = btrfs_file_llseek, |
2548 | .read = do_sync_read, | 2549 | .read = do_sync_read, |
2549 | .write = do_sync_write, | 2550 | .write = do_sync_write, |
2550 | .aio_read = generic_file_aio_read, | 2551 | .aio_read = generic_file_aio_read, |
2551 | .splice_read = generic_file_splice_read, | 2552 | .splice_read = generic_file_splice_read, |
2552 | .aio_write = btrfs_file_aio_write, | 2553 | .aio_write = btrfs_file_aio_write, |
2553 | .mmap = btrfs_file_mmap, | 2554 | .mmap = btrfs_file_mmap, |
2554 | .open = generic_file_open, | 2555 | .open = generic_file_open, |
2555 | .release = btrfs_release_file, | 2556 | .release = btrfs_release_file, |
2556 | .fsync = btrfs_sync_file, | 2557 | .fsync = btrfs_sync_file, |
2557 | .fallocate = btrfs_fallocate, | 2558 | .fallocate = btrfs_fallocate, |
2558 | .unlocked_ioctl = btrfs_ioctl, | 2559 | .unlocked_ioctl = btrfs_ioctl, |
2559 | #ifdef CONFIG_COMPAT | 2560 | #ifdef CONFIG_COMPAT |
2560 | .compat_ioctl = btrfs_ioctl, | 2561 | .compat_ioctl = btrfs_ioctl, |
2561 | #endif | 2562 | #endif |
2562 | }; | 2563 | }; |
2563 | 2564 | ||
2564 | void btrfs_auto_defrag_exit(void) | 2565 | void btrfs_auto_defrag_exit(void) |
2565 | { | 2566 | { |
2566 | if (btrfs_inode_defrag_cachep) | 2567 | if (btrfs_inode_defrag_cachep) |
2567 | kmem_cache_destroy(btrfs_inode_defrag_cachep); | 2568 | kmem_cache_destroy(btrfs_inode_defrag_cachep); |
2568 | } | 2569 | } |
2569 | 2570 | ||
2570 | int btrfs_auto_defrag_init(void) | 2571 | int btrfs_auto_defrag_init(void) |
2571 | { | 2572 | { |
2572 | btrfs_inode_defrag_cachep = kmem_cache_create("btrfs_inode_defrag", | 2573 | btrfs_inode_defrag_cachep = kmem_cache_create("btrfs_inode_defrag", |
2573 | sizeof(struct inode_defrag), 0, | 2574 | sizeof(struct inode_defrag), 0, |
2574 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, | 2575 | SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, |
2575 | NULL); | 2576 | NULL); |
2576 | if (!btrfs_inode_defrag_cachep) | 2577 | if (!btrfs_inode_defrag_cachep) |
2577 | return -ENOMEM; | 2578 | return -ENOMEM; |
2578 | 2579 | ||
2579 | return 0; | 2580 | return 0; |
2580 | } | 2581 | } |
fs/buffer.c
1 | /* | 1 | /* |
2 | * linux/fs/buffer.c | 2 | * linux/fs/buffer.c |
3 | * | 3 | * |
4 | * Copyright (C) 1991, 1992, 2002 Linus Torvalds | 4 | * Copyright (C) 1991, 1992, 2002 Linus Torvalds |
5 | */ | 5 | */ |
6 | 6 | ||
7 | /* | 7 | /* |
8 | * Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95 | 8 | * Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95 |
9 | * | 9 | * |
10 | * Removed a lot of unnecessary code and simplified things now that | 10 | * Removed a lot of unnecessary code and simplified things now that |
11 | * the buffer cache isn't our primary cache - Andrew Tridgell 12/96 | 11 | * the buffer cache isn't our primary cache - Andrew Tridgell 12/96 |
12 | * | 12 | * |
13 | * Speed up hash, lru, and free list operations. Use gfp() for allocating | 13 | * Speed up hash, lru, and free list operations. Use gfp() for allocating |
14 | * hash table, use SLAB cache for buffer heads. SMP threading. -DaveM | 14 | * hash table, use SLAB cache for buffer heads. SMP threading. -DaveM |
15 | * | 15 | * |
16 | * Added 32k buffer block sizes - these are required older ARM systems. - RMK | 16 | * Added 32k buffer block sizes - these are required older ARM systems. - RMK |
17 | * | 17 | * |
18 | * async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de> | 18 | * async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de> |
19 | */ | 19 | */ |
20 | 20 | ||
21 | #include <linux/kernel.h> | 21 | #include <linux/kernel.h> |
22 | #include <linux/syscalls.h> | 22 | #include <linux/syscalls.h> |
23 | #include <linux/fs.h> | 23 | #include <linux/fs.h> |
24 | #include <linux/mm.h> | 24 | #include <linux/mm.h> |
25 | #include <linux/percpu.h> | 25 | #include <linux/percpu.h> |
26 | #include <linux/slab.h> | 26 | #include <linux/slab.h> |
27 | #include <linux/capability.h> | 27 | #include <linux/capability.h> |
28 | #include <linux/blkdev.h> | 28 | #include <linux/blkdev.h> |
29 | #include <linux/file.h> | 29 | #include <linux/file.h> |
30 | #include <linux/quotaops.h> | 30 | #include <linux/quotaops.h> |
31 | #include <linux/highmem.h> | 31 | #include <linux/highmem.h> |
32 | #include <linux/export.h> | 32 | #include <linux/export.h> |
33 | #include <linux/writeback.h> | 33 | #include <linux/writeback.h> |
34 | #include <linux/hash.h> | 34 | #include <linux/hash.h> |
35 | #include <linux/suspend.h> | 35 | #include <linux/suspend.h> |
36 | #include <linux/buffer_head.h> | 36 | #include <linux/buffer_head.h> |
37 | #include <linux/task_io_accounting_ops.h> | 37 | #include <linux/task_io_accounting_ops.h> |
38 | #include <linux/bio.h> | 38 | #include <linux/bio.h> |
39 | #include <linux/notifier.h> | 39 | #include <linux/notifier.h> |
40 | #include <linux/cpu.h> | 40 | #include <linux/cpu.h> |
41 | #include <linux/bitops.h> | 41 | #include <linux/bitops.h> |
42 | #include <linux/mpage.h> | 42 | #include <linux/mpage.h> |
43 | #include <linux/bit_spinlock.h> | 43 | #include <linux/bit_spinlock.h> |
44 | #include <trace/events/block.h> | 44 | #include <trace/events/block.h> |
45 | 45 | ||
46 | static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); | 46 | static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); |
47 | 47 | ||
48 | #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) | 48 | #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) |
49 | 49 | ||
50 | void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) | 50 | void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private) |
51 | { | 51 | { |
52 | bh->b_end_io = handler; | 52 | bh->b_end_io = handler; |
53 | bh->b_private = private; | 53 | bh->b_private = private; |
54 | } | 54 | } |
55 | EXPORT_SYMBOL(init_buffer); | 55 | EXPORT_SYMBOL(init_buffer); |
56 | 56 | ||
57 | inline void touch_buffer(struct buffer_head *bh) | 57 | inline void touch_buffer(struct buffer_head *bh) |
58 | { | 58 | { |
59 | trace_block_touch_buffer(bh); | 59 | trace_block_touch_buffer(bh); |
60 | mark_page_accessed(bh->b_page); | 60 | mark_page_accessed(bh->b_page); |
61 | } | 61 | } |
62 | EXPORT_SYMBOL(touch_buffer); | 62 | EXPORT_SYMBOL(touch_buffer); |
63 | 63 | ||
64 | static int sleep_on_buffer(void *word) | 64 | static int sleep_on_buffer(void *word) |
65 | { | 65 | { |
66 | io_schedule(); | 66 | io_schedule(); |
67 | return 0; | 67 | return 0; |
68 | } | 68 | } |
69 | 69 | ||
70 | void __lock_buffer(struct buffer_head *bh) | 70 | void __lock_buffer(struct buffer_head *bh) |
71 | { | 71 | { |
72 | wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer, | 72 | wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer, |
73 | TASK_UNINTERRUPTIBLE); | 73 | TASK_UNINTERRUPTIBLE); |
74 | } | 74 | } |
75 | EXPORT_SYMBOL(__lock_buffer); | 75 | EXPORT_SYMBOL(__lock_buffer); |
76 | 76 | ||
77 | void unlock_buffer(struct buffer_head *bh) | 77 | void unlock_buffer(struct buffer_head *bh) |
78 | { | 78 | { |
79 | clear_bit_unlock(BH_Lock, &bh->b_state); | 79 | clear_bit_unlock(BH_Lock, &bh->b_state); |
80 | smp_mb__after_clear_bit(); | 80 | smp_mb__after_clear_bit(); |
81 | wake_up_bit(&bh->b_state, BH_Lock); | 81 | wake_up_bit(&bh->b_state, BH_Lock); |
82 | } | 82 | } |
83 | EXPORT_SYMBOL(unlock_buffer); | 83 | EXPORT_SYMBOL(unlock_buffer); |
84 | 84 | ||
85 | /* | 85 | /* |
86 | * Returns if the page has dirty or writeback buffers. If all the buffers | 86 | * Returns if the page has dirty or writeback buffers. If all the buffers |
87 | * are unlocked and clean then the PageDirty information is stale. If | 87 | * are unlocked and clean then the PageDirty information is stale. If |
88 | * any of the pages are locked, it is assumed they are locked for IO. | 88 | * any of the pages are locked, it is assumed they are locked for IO. |
89 | */ | 89 | */ |
90 | void buffer_check_dirty_writeback(struct page *page, | 90 | void buffer_check_dirty_writeback(struct page *page, |
91 | bool *dirty, bool *writeback) | 91 | bool *dirty, bool *writeback) |
92 | { | 92 | { |
93 | struct buffer_head *head, *bh; | 93 | struct buffer_head *head, *bh; |
94 | *dirty = false; | 94 | *dirty = false; |
95 | *writeback = false; | 95 | *writeback = false; |
96 | 96 | ||
97 | BUG_ON(!PageLocked(page)); | 97 | BUG_ON(!PageLocked(page)); |
98 | 98 | ||
99 | if (!page_has_buffers(page)) | 99 | if (!page_has_buffers(page)) |
100 | return; | 100 | return; |
101 | 101 | ||
102 | if (PageWriteback(page)) | 102 | if (PageWriteback(page)) |
103 | *writeback = true; | 103 | *writeback = true; |
104 | 104 | ||
105 | head = page_buffers(page); | 105 | head = page_buffers(page); |
106 | bh = head; | 106 | bh = head; |
107 | do { | 107 | do { |
108 | if (buffer_locked(bh)) | 108 | if (buffer_locked(bh)) |
109 | *writeback = true; | 109 | *writeback = true; |
110 | 110 | ||
111 | if (buffer_dirty(bh)) | 111 | if (buffer_dirty(bh)) |
112 | *dirty = true; | 112 | *dirty = true; |
113 | 113 | ||
114 | bh = bh->b_this_page; | 114 | bh = bh->b_this_page; |
115 | } while (bh != head); | 115 | } while (bh != head); |
116 | } | 116 | } |
117 | EXPORT_SYMBOL(buffer_check_dirty_writeback); | 117 | EXPORT_SYMBOL(buffer_check_dirty_writeback); |
118 | 118 | ||
119 | /* | 119 | /* |
120 | * Block until a buffer comes unlocked. This doesn't stop it | 120 | * Block until a buffer comes unlocked. This doesn't stop it |
121 | * from becoming locked again - you have to lock it yourself | 121 | * from becoming locked again - you have to lock it yourself |
122 | * if you want to preserve its state. | 122 | * if you want to preserve its state. |
123 | */ | 123 | */ |
124 | void __wait_on_buffer(struct buffer_head * bh) | 124 | void __wait_on_buffer(struct buffer_head * bh) |
125 | { | 125 | { |
126 | wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); | 126 | wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); |
127 | } | 127 | } |
128 | EXPORT_SYMBOL(__wait_on_buffer); | 128 | EXPORT_SYMBOL(__wait_on_buffer); |
129 | 129 | ||
130 | static void | 130 | static void |
131 | __clear_page_buffers(struct page *page) | 131 | __clear_page_buffers(struct page *page) |
132 | { | 132 | { |
133 | ClearPagePrivate(page); | 133 | ClearPagePrivate(page); |
134 | set_page_private(page, 0); | 134 | set_page_private(page, 0); |
135 | page_cache_release(page); | 135 | page_cache_release(page); |
136 | } | 136 | } |
137 | 137 | ||
138 | 138 | ||
139 | static int quiet_error(struct buffer_head *bh) | 139 | static int quiet_error(struct buffer_head *bh) |
140 | { | 140 | { |
141 | if (!test_bit(BH_Quiet, &bh->b_state) && printk_ratelimit()) | 141 | if (!test_bit(BH_Quiet, &bh->b_state) && printk_ratelimit()) |
142 | return 0; | 142 | return 0; |
143 | return 1; | 143 | return 1; |
144 | } | 144 | } |
145 | 145 | ||
146 | 146 | ||
147 | static void buffer_io_error(struct buffer_head *bh) | 147 | static void buffer_io_error(struct buffer_head *bh) |
148 | { | 148 | { |
149 | char b[BDEVNAME_SIZE]; | 149 | char b[BDEVNAME_SIZE]; |
150 | printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n", | 150 | printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n", |
151 | bdevname(bh->b_bdev, b), | 151 | bdevname(bh->b_bdev, b), |
152 | (unsigned long long)bh->b_blocknr); | 152 | (unsigned long long)bh->b_blocknr); |
153 | } | 153 | } |
154 | 154 | ||
155 | /* | 155 | /* |
156 | * End-of-IO handler helper function which does not touch the bh after | 156 | * End-of-IO handler helper function which does not touch the bh after |
157 | * unlocking it. | 157 | * unlocking it. |
158 | * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but | 158 | * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but |
159 | * a race there is benign: unlock_buffer() only use the bh's address for | 159 | * a race there is benign: unlock_buffer() only use the bh's address for |
160 | * hashing after unlocking the buffer, so it doesn't actually touch the bh | 160 | * hashing after unlocking the buffer, so it doesn't actually touch the bh |
161 | * itself. | 161 | * itself. |
162 | */ | 162 | */ |
163 | static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate) | 163 | static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate) |
164 | { | 164 | { |
165 | if (uptodate) { | 165 | if (uptodate) { |
166 | set_buffer_uptodate(bh); | 166 | set_buffer_uptodate(bh); |
167 | } else { | 167 | } else { |
168 | /* This happens, due to failed READA attempts. */ | 168 | /* This happens, due to failed READA attempts. */ |
169 | clear_buffer_uptodate(bh); | 169 | clear_buffer_uptodate(bh); |
170 | } | 170 | } |
171 | unlock_buffer(bh); | 171 | unlock_buffer(bh); |
172 | } | 172 | } |
173 | 173 | ||
174 | /* | 174 | /* |
175 | * Default synchronous end-of-IO handler.. Just mark it up-to-date and | 175 | * Default synchronous end-of-IO handler.. Just mark it up-to-date and |
176 | * unlock the buffer. This is what ll_rw_block uses too. | 176 | * unlock the buffer. This is what ll_rw_block uses too. |
177 | */ | 177 | */ |
178 | void end_buffer_read_sync(struct buffer_head *bh, int uptodate) | 178 | void end_buffer_read_sync(struct buffer_head *bh, int uptodate) |
179 | { | 179 | { |
180 | __end_buffer_read_notouch(bh, uptodate); | 180 | __end_buffer_read_notouch(bh, uptodate); |
181 | put_bh(bh); | 181 | put_bh(bh); |
182 | } | 182 | } |
183 | EXPORT_SYMBOL(end_buffer_read_sync); | 183 | EXPORT_SYMBOL(end_buffer_read_sync); |
184 | 184 | ||
185 | void end_buffer_write_sync(struct buffer_head *bh, int uptodate) | 185 | void end_buffer_write_sync(struct buffer_head *bh, int uptodate) |
186 | { | 186 | { |
187 | char b[BDEVNAME_SIZE]; | 187 | char b[BDEVNAME_SIZE]; |
188 | 188 | ||
189 | if (uptodate) { | 189 | if (uptodate) { |
190 | set_buffer_uptodate(bh); | 190 | set_buffer_uptodate(bh); |
191 | } else { | 191 | } else { |
192 | if (!quiet_error(bh)) { | 192 | if (!quiet_error(bh)) { |
193 | buffer_io_error(bh); | 193 | buffer_io_error(bh); |
194 | printk(KERN_WARNING "lost page write due to " | 194 | printk(KERN_WARNING "lost page write due to " |
195 | "I/O error on %s\n", | 195 | "I/O error on %s\n", |
196 | bdevname(bh->b_bdev, b)); | 196 | bdevname(bh->b_bdev, b)); |
197 | } | 197 | } |
198 | set_buffer_write_io_error(bh); | 198 | set_buffer_write_io_error(bh); |
199 | clear_buffer_uptodate(bh); | 199 | clear_buffer_uptodate(bh); |
200 | } | 200 | } |
201 | unlock_buffer(bh); | 201 | unlock_buffer(bh); |
202 | put_bh(bh); | 202 | put_bh(bh); |
203 | } | 203 | } |
204 | EXPORT_SYMBOL(end_buffer_write_sync); | 204 | EXPORT_SYMBOL(end_buffer_write_sync); |
205 | 205 | ||
206 | /* | 206 | /* |
207 | * Various filesystems appear to want __find_get_block to be non-blocking. | 207 | * Various filesystems appear to want __find_get_block to be non-blocking. |
208 | * But it's the page lock which protects the buffers. To get around this, | 208 | * But it's the page lock which protects the buffers. To get around this, |
209 | * we get exclusion from try_to_free_buffers with the blockdev mapping's | 209 | * we get exclusion from try_to_free_buffers with the blockdev mapping's |
210 | * private_lock. | 210 | * private_lock. |
211 | * | 211 | * |
212 | * Hack idea: for the blockdev mapping, i_bufferlist_lock contention | 212 | * Hack idea: for the blockdev mapping, i_bufferlist_lock contention |
213 | * may be quite high. This code could TryLock the page, and if that | 213 | * may be quite high. This code could TryLock the page, and if that |
214 | * succeeds, there is no need to take private_lock. (But if | 214 | * succeeds, there is no need to take private_lock. (But if |
215 | * private_lock is contended then so is mapping->tree_lock). | 215 | * private_lock is contended then so is mapping->tree_lock). |
216 | */ | 216 | */ |
217 | static struct buffer_head * | 217 | static struct buffer_head * |
218 | __find_get_block_slow(struct block_device *bdev, sector_t block) | 218 | __find_get_block_slow(struct block_device *bdev, sector_t block) |
219 | { | 219 | { |
220 | struct inode *bd_inode = bdev->bd_inode; | 220 | struct inode *bd_inode = bdev->bd_inode; |
221 | struct address_space *bd_mapping = bd_inode->i_mapping; | 221 | struct address_space *bd_mapping = bd_inode->i_mapping; |
222 | struct buffer_head *ret = NULL; | 222 | struct buffer_head *ret = NULL; |
223 | pgoff_t index; | 223 | pgoff_t index; |
224 | struct buffer_head *bh; | 224 | struct buffer_head *bh; |
225 | struct buffer_head *head; | 225 | struct buffer_head *head; |
226 | struct page *page; | 226 | struct page *page; |
227 | int all_mapped = 1; | 227 | int all_mapped = 1; |
228 | 228 | ||
229 | index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); | 229 | index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); |
230 | page = find_get_page(bd_mapping, index); | 230 | page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED); |
231 | if (!page) | 231 | if (!page) |
232 | goto out; | 232 | goto out; |
233 | 233 | ||
234 | spin_lock(&bd_mapping->private_lock); | 234 | spin_lock(&bd_mapping->private_lock); |
235 | if (!page_has_buffers(page)) | 235 | if (!page_has_buffers(page)) |
236 | goto out_unlock; | 236 | goto out_unlock; |
237 | head = page_buffers(page); | 237 | head = page_buffers(page); |
238 | bh = head; | 238 | bh = head; |
239 | do { | 239 | do { |
240 | if (!buffer_mapped(bh)) | 240 | if (!buffer_mapped(bh)) |
241 | all_mapped = 0; | 241 | all_mapped = 0; |
242 | else if (bh->b_blocknr == block) { | 242 | else if (bh->b_blocknr == block) { |
243 | ret = bh; | 243 | ret = bh; |
244 | get_bh(bh); | 244 | get_bh(bh); |
245 | goto out_unlock; | 245 | goto out_unlock; |
246 | } | 246 | } |
247 | bh = bh->b_this_page; | 247 | bh = bh->b_this_page; |
248 | } while (bh != head); | 248 | } while (bh != head); |
249 | 249 | ||
250 | /* we might be here because some of the buffers on this page are | 250 | /* we might be here because some of the buffers on this page are |
251 | * not mapped. This is due to various races between | 251 | * not mapped. This is due to various races between |
252 | * file io on the block device and getblk. It gets dealt with | 252 | * file io on the block device and getblk. It gets dealt with |
253 | * elsewhere, don't buffer_error if we had some unmapped buffers | 253 | * elsewhere, don't buffer_error if we had some unmapped buffers |
254 | */ | 254 | */ |
255 | if (all_mapped) { | 255 | if (all_mapped) { |
256 | char b[BDEVNAME_SIZE]; | 256 | char b[BDEVNAME_SIZE]; |
257 | 257 | ||
258 | printk("__find_get_block_slow() failed. " | 258 | printk("__find_get_block_slow() failed. " |
259 | "block=%llu, b_blocknr=%llu\n", | 259 | "block=%llu, b_blocknr=%llu\n", |
260 | (unsigned long long)block, | 260 | (unsigned long long)block, |
261 | (unsigned long long)bh->b_blocknr); | 261 | (unsigned long long)bh->b_blocknr); |
262 | printk("b_state=0x%08lx, b_size=%zu\n", | 262 | printk("b_state=0x%08lx, b_size=%zu\n", |
263 | bh->b_state, bh->b_size); | 263 | bh->b_state, bh->b_size); |
264 | printk("device %s blocksize: %d\n", bdevname(bdev, b), | 264 | printk("device %s blocksize: %d\n", bdevname(bdev, b), |
265 | 1 << bd_inode->i_blkbits); | 265 | 1 << bd_inode->i_blkbits); |
266 | } | 266 | } |
267 | out_unlock: | 267 | out_unlock: |
268 | spin_unlock(&bd_mapping->private_lock); | 268 | spin_unlock(&bd_mapping->private_lock); |
269 | page_cache_release(page); | 269 | page_cache_release(page); |
270 | out: | 270 | out: |
271 | return ret; | 271 | return ret; |
272 | } | 272 | } |
273 | 273 | ||
274 | /* | 274 | /* |
275 | * Kick the writeback threads then try to free up some ZONE_NORMAL memory. | 275 | * Kick the writeback threads then try to free up some ZONE_NORMAL memory. |
276 | */ | 276 | */ |
277 | static void free_more_memory(void) | 277 | static void free_more_memory(void) |
278 | { | 278 | { |
279 | struct zone *zone; | 279 | struct zone *zone; |
280 | int nid; | 280 | int nid; |
281 | 281 | ||
282 | wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM); | 282 | wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM); |
283 | yield(); | 283 | yield(); |
284 | 284 | ||
285 | for_each_online_node(nid) { | 285 | for_each_online_node(nid) { |
286 | (void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS), | 286 | (void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS), |
287 | gfp_zone(GFP_NOFS), NULL, | 287 | gfp_zone(GFP_NOFS), NULL, |
288 | &zone); | 288 | &zone); |
289 | if (zone) | 289 | if (zone) |
290 | try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0, | 290 | try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0, |
291 | GFP_NOFS, NULL); | 291 | GFP_NOFS, NULL); |
292 | } | 292 | } |
293 | } | 293 | } |
294 | 294 | ||
295 | /* | 295 | /* |
296 | * I/O completion handler for block_read_full_page() - pages | 296 | * I/O completion handler for block_read_full_page() - pages |
297 | * which come unlocked at the end of I/O. | 297 | * which come unlocked at the end of I/O. |
298 | */ | 298 | */ |
299 | static void end_buffer_async_read(struct buffer_head *bh, int uptodate) | 299 | static void end_buffer_async_read(struct buffer_head *bh, int uptodate) |
300 | { | 300 | { |
301 | unsigned long flags; | 301 | unsigned long flags; |
302 | struct buffer_head *first; | 302 | struct buffer_head *first; |
303 | struct buffer_head *tmp; | 303 | struct buffer_head *tmp; |
304 | struct page *page; | 304 | struct page *page; |
305 | int page_uptodate = 1; | 305 | int page_uptodate = 1; |
306 | 306 | ||
307 | BUG_ON(!buffer_async_read(bh)); | 307 | BUG_ON(!buffer_async_read(bh)); |
308 | 308 | ||
309 | page = bh->b_page; | 309 | page = bh->b_page; |
310 | if (uptodate) { | 310 | if (uptodate) { |
311 | set_buffer_uptodate(bh); | 311 | set_buffer_uptodate(bh); |
312 | } else { | 312 | } else { |
313 | clear_buffer_uptodate(bh); | 313 | clear_buffer_uptodate(bh); |
314 | if (!quiet_error(bh)) | 314 | if (!quiet_error(bh)) |
315 | buffer_io_error(bh); | 315 | buffer_io_error(bh); |
316 | SetPageError(page); | 316 | SetPageError(page); |
317 | } | 317 | } |
318 | 318 | ||
319 | /* | 319 | /* |
320 | * Be _very_ careful from here on. Bad things can happen if | 320 | * Be _very_ careful from here on. Bad things can happen if |
321 | * two buffer heads end IO at almost the same time and both | 321 | * two buffer heads end IO at almost the same time and both |
322 | * decide that the page is now completely done. | 322 | * decide that the page is now completely done. |
323 | */ | 323 | */ |
324 | first = page_buffers(page); | 324 | first = page_buffers(page); |
325 | local_irq_save(flags); | 325 | local_irq_save(flags); |
326 | bit_spin_lock(BH_Uptodate_Lock, &first->b_state); | 326 | bit_spin_lock(BH_Uptodate_Lock, &first->b_state); |
327 | clear_buffer_async_read(bh); | 327 | clear_buffer_async_read(bh); |
328 | unlock_buffer(bh); | 328 | unlock_buffer(bh); |
329 | tmp = bh; | 329 | tmp = bh; |
330 | do { | 330 | do { |
331 | if (!buffer_uptodate(tmp)) | 331 | if (!buffer_uptodate(tmp)) |
332 | page_uptodate = 0; | 332 | page_uptodate = 0; |
333 | if (buffer_async_read(tmp)) { | 333 | if (buffer_async_read(tmp)) { |
334 | BUG_ON(!buffer_locked(tmp)); | 334 | BUG_ON(!buffer_locked(tmp)); |
335 | goto still_busy; | 335 | goto still_busy; |
336 | } | 336 | } |
337 | tmp = tmp->b_this_page; | 337 | tmp = tmp->b_this_page; |
338 | } while (tmp != bh); | 338 | } while (tmp != bh); |
339 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); | 339 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); |
340 | local_irq_restore(flags); | 340 | local_irq_restore(flags); |
341 | 341 | ||
342 | /* | 342 | /* |
343 | * If none of the buffers had errors and they are all | 343 | * If none of the buffers had errors and they are all |
344 | * uptodate then we can set the page uptodate. | 344 | * uptodate then we can set the page uptodate. |
345 | */ | 345 | */ |
346 | if (page_uptodate && !PageError(page)) | 346 | if (page_uptodate && !PageError(page)) |
347 | SetPageUptodate(page); | 347 | SetPageUptodate(page); |
348 | unlock_page(page); | 348 | unlock_page(page); |
349 | return; | 349 | return; |
350 | 350 | ||
351 | still_busy: | 351 | still_busy: |
352 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); | 352 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); |
353 | local_irq_restore(flags); | 353 | local_irq_restore(flags); |
354 | return; | 354 | return; |
355 | } | 355 | } |
356 | 356 | ||
357 | /* | 357 | /* |
358 | * Completion handler for block_write_full_page() - pages which are unlocked | 358 | * Completion handler for block_write_full_page() - pages which are unlocked |
359 | * during I/O, and which have PageWriteback cleared upon I/O completion. | 359 | * during I/O, and which have PageWriteback cleared upon I/O completion. |
360 | */ | 360 | */ |
361 | void end_buffer_async_write(struct buffer_head *bh, int uptodate) | 361 | void end_buffer_async_write(struct buffer_head *bh, int uptodate) |
362 | { | 362 | { |
363 | char b[BDEVNAME_SIZE]; | 363 | char b[BDEVNAME_SIZE]; |
364 | unsigned long flags; | 364 | unsigned long flags; |
365 | struct buffer_head *first; | 365 | struct buffer_head *first; |
366 | struct buffer_head *tmp; | 366 | struct buffer_head *tmp; |
367 | struct page *page; | 367 | struct page *page; |
368 | 368 | ||
369 | BUG_ON(!buffer_async_write(bh)); | 369 | BUG_ON(!buffer_async_write(bh)); |
370 | 370 | ||
371 | page = bh->b_page; | 371 | page = bh->b_page; |
372 | if (uptodate) { | 372 | if (uptodate) { |
373 | set_buffer_uptodate(bh); | 373 | set_buffer_uptodate(bh); |
374 | } else { | 374 | } else { |
375 | if (!quiet_error(bh)) { | 375 | if (!quiet_error(bh)) { |
376 | buffer_io_error(bh); | 376 | buffer_io_error(bh); |
377 | printk(KERN_WARNING "lost page write due to " | 377 | printk(KERN_WARNING "lost page write due to " |
378 | "I/O error on %s\n", | 378 | "I/O error on %s\n", |
379 | bdevname(bh->b_bdev, b)); | 379 | bdevname(bh->b_bdev, b)); |
380 | } | 380 | } |
381 | set_bit(AS_EIO, &page->mapping->flags); | 381 | set_bit(AS_EIO, &page->mapping->flags); |
382 | set_buffer_write_io_error(bh); | 382 | set_buffer_write_io_error(bh); |
383 | clear_buffer_uptodate(bh); | 383 | clear_buffer_uptodate(bh); |
384 | SetPageError(page); | 384 | SetPageError(page); |
385 | } | 385 | } |
386 | 386 | ||
387 | first = page_buffers(page); | 387 | first = page_buffers(page); |
388 | local_irq_save(flags); | 388 | local_irq_save(flags); |
389 | bit_spin_lock(BH_Uptodate_Lock, &first->b_state); | 389 | bit_spin_lock(BH_Uptodate_Lock, &first->b_state); |
390 | 390 | ||
391 | clear_buffer_async_write(bh); | 391 | clear_buffer_async_write(bh); |
392 | unlock_buffer(bh); | 392 | unlock_buffer(bh); |
393 | tmp = bh->b_this_page; | 393 | tmp = bh->b_this_page; |
394 | while (tmp != bh) { | 394 | while (tmp != bh) { |
395 | if (buffer_async_write(tmp)) { | 395 | if (buffer_async_write(tmp)) { |
396 | BUG_ON(!buffer_locked(tmp)); | 396 | BUG_ON(!buffer_locked(tmp)); |
397 | goto still_busy; | 397 | goto still_busy; |
398 | } | 398 | } |
399 | tmp = tmp->b_this_page; | 399 | tmp = tmp->b_this_page; |
400 | } | 400 | } |
401 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); | 401 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); |
402 | local_irq_restore(flags); | 402 | local_irq_restore(flags); |
403 | end_page_writeback(page); | 403 | end_page_writeback(page); |
404 | return; | 404 | return; |
405 | 405 | ||
406 | still_busy: | 406 | still_busy: |
407 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); | 407 | bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); |
408 | local_irq_restore(flags); | 408 | local_irq_restore(flags); |
409 | return; | 409 | return; |
410 | } | 410 | } |
411 | EXPORT_SYMBOL(end_buffer_async_write); | 411 | EXPORT_SYMBOL(end_buffer_async_write); |
412 | 412 | ||
413 | /* | 413 | /* |
414 | * If a page's buffers are under async readin (end_buffer_async_read | 414 | * If a page's buffers are under async readin (end_buffer_async_read |
415 | * completion) then there is a possibility that another thread of | 415 | * completion) then there is a possibility that another thread of |
416 | * control could lock one of the buffers after it has completed | 416 | * control could lock one of the buffers after it has completed |
417 | * but while some of the other buffers have not completed. This | 417 | * but while some of the other buffers have not completed. This |
418 | * locked buffer would confuse end_buffer_async_read() into not unlocking | 418 | * locked buffer would confuse end_buffer_async_read() into not unlocking |
419 | * the page. So the absence of BH_Async_Read tells end_buffer_async_read() | 419 | * the page. So the absence of BH_Async_Read tells end_buffer_async_read() |
420 | * that this buffer is not under async I/O. | 420 | * that this buffer is not under async I/O. |
421 | * | 421 | * |
422 | * The page comes unlocked when it has no locked buffer_async buffers | 422 | * The page comes unlocked when it has no locked buffer_async buffers |
423 | * left. | 423 | * left. |
424 | * | 424 | * |
425 | * PageLocked prevents anyone starting new async I/O reads any of | 425 | * PageLocked prevents anyone starting new async I/O reads any of |
426 | * the buffers. | 426 | * the buffers. |
427 | * | 427 | * |
428 | * PageWriteback is used to prevent simultaneous writeout of the same | 428 | * PageWriteback is used to prevent simultaneous writeout of the same |
429 | * page. | 429 | * page. |
430 | * | 430 | * |
431 | * PageLocked prevents anyone from starting writeback of a page which is | 431 | * PageLocked prevents anyone from starting writeback of a page which is |
432 | * under read I/O (PageWriteback is only ever set against a locked page). | 432 | * under read I/O (PageWriteback is only ever set against a locked page). |
433 | */ | 433 | */ |
434 | static void mark_buffer_async_read(struct buffer_head *bh) | 434 | static void mark_buffer_async_read(struct buffer_head *bh) |
435 | { | 435 | { |
436 | bh->b_end_io = end_buffer_async_read; | 436 | bh->b_end_io = end_buffer_async_read; |
437 | set_buffer_async_read(bh); | 437 | set_buffer_async_read(bh); |
438 | } | 438 | } |
439 | 439 | ||
440 | static void mark_buffer_async_write_endio(struct buffer_head *bh, | 440 | static void mark_buffer_async_write_endio(struct buffer_head *bh, |
441 | bh_end_io_t *handler) | 441 | bh_end_io_t *handler) |
442 | { | 442 | { |
443 | bh->b_end_io = handler; | 443 | bh->b_end_io = handler; |
444 | set_buffer_async_write(bh); | 444 | set_buffer_async_write(bh); |
445 | } | 445 | } |
446 | 446 | ||
447 | void mark_buffer_async_write(struct buffer_head *bh) | 447 | void mark_buffer_async_write(struct buffer_head *bh) |
448 | { | 448 | { |
449 | mark_buffer_async_write_endio(bh, end_buffer_async_write); | 449 | mark_buffer_async_write_endio(bh, end_buffer_async_write); |
450 | } | 450 | } |
451 | EXPORT_SYMBOL(mark_buffer_async_write); | 451 | EXPORT_SYMBOL(mark_buffer_async_write); |
452 | 452 | ||
453 | 453 | ||
454 | /* | 454 | /* |
455 | * fs/buffer.c contains helper functions for buffer-backed address space's | 455 | * fs/buffer.c contains helper functions for buffer-backed address space's |
456 | * fsync functions. A common requirement for buffer-based filesystems is | 456 | * fsync functions. A common requirement for buffer-based filesystems is |
457 | * that certain data from the backing blockdev needs to be written out for | 457 | * that certain data from the backing blockdev needs to be written out for |
458 | * a successful fsync(). For example, ext2 indirect blocks need to be | 458 | * a successful fsync(). For example, ext2 indirect blocks need to be |
459 | * written back and waited upon before fsync() returns. | 459 | * written back and waited upon before fsync() returns. |
460 | * | 460 | * |
461 | * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(), | 461 | * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(), |
462 | * inode_has_buffers() and invalidate_inode_buffers() are provided for the | 462 | * inode_has_buffers() and invalidate_inode_buffers() are provided for the |
463 | * management of a list of dependent buffers at ->i_mapping->private_list. | 463 | * management of a list of dependent buffers at ->i_mapping->private_list. |
464 | * | 464 | * |
465 | * Locking is a little subtle: try_to_free_buffers() will remove buffers | 465 | * Locking is a little subtle: try_to_free_buffers() will remove buffers |
466 | * from their controlling inode's queue when they are being freed. But | 466 | * from their controlling inode's queue when they are being freed. But |
467 | * try_to_free_buffers() will be operating against the *blockdev* mapping | 467 | * try_to_free_buffers() will be operating against the *blockdev* mapping |
468 | * at the time, not against the S_ISREG file which depends on those buffers. | 468 | * at the time, not against the S_ISREG file which depends on those buffers. |
469 | * So the locking for private_list is via the private_lock in the address_space | 469 | * So the locking for private_list is via the private_lock in the address_space |
470 | * which backs the buffers. Which is different from the address_space | 470 | * which backs the buffers. Which is different from the address_space |
471 | * against which the buffers are listed. So for a particular address_space, | 471 | * against which the buffers are listed. So for a particular address_space, |
472 | * mapping->private_lock does *not* protect mapping->private_list! In fact, | 472 | * mapping->private_lock does *not* protect mapping->private_list! In fact, |
473 | * mapping->private_list will always be protected by the backing blockdev's | 473 | * mapping->private_list will always be protected by the backing blockdev's |
474 | * ->private_lock. | 474 | * ->private_lock. |
475 | * | 475 | * |
476 | * Which introduces a requirement: all buffers on an address_space's | 476 | * Which introduces a requirement: all buffers on an address_space's |
477 | * ->private_list must be from the same address_space: the blockdev's. | 477 | * ->private_list must be from the same address_space: the blockdev's. |
478 | * | 478 | * |
479 | * address_spaces which do not place buffers at ->private_list via these | 479 | * address_spaces which do not place buffers at ->private_list via these |
480 | * utility functions are free to use private_lock and private_list for | 480 | * utility functions are free to use private_lock and private_list for |
481 | * whatever they want. The only requirement is that list_empty(private_list) | 481 | * whatever they want. The only requirement is that list_empty(private_list) |
482 | * be true at clear_inode() time. | 482 | * be true at clear_inode() time. |
483 | * | 483 | * |
484 | * FIXME: clear_inode should not call invalidate_inode_buffers(). The | 484 | * FIXME: clear_inode should not call invalidate_inode_buffers(). The |
485 | * filesystems should do that. invalidate_inode_buffers() should just go | 485 | * filesystems should do that. invalidate_inode_buffers() should just go |
486 | * BUG_ON(!list_empty). | 486 | * BUG_ON(!list_empty). |
487 | * | 487 | * |
488 | * FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should | 488 | * FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should |
489 | * take an address_space, not an inode. And it should be called | 489 | * take an address_space, not an inode. And it should be called |
490 | * mark_buffer_dirty_fsync() to clearly define why those buffers are being | 490 | * mark_buffer_dirty_fsync() to clearly define why those buffers are being |
491 | * queued up. | 491 | * queued up. |
492 | * | 492 | * |
493 | * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the | 493 | * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the |
494 | * list if it is already on a list. Because if the buffer is on a list, | 494 | * list if it is already on a list. Because if the buffer is on a list, |
495 | * it *must* already be on the right one. If not, the filesystem is being | 495 | * it *must* already be on the right one. If not, the filesystem is being |
496 | * silly. This will save a ton of locking. But first we have to ensure | 496 | * silly. This will save a ton of locking. But first we have to ensure |
497 | * that buffers are taken *off* the old inode's list when they are freed | 497 | * that buffers are taken *off* the old inode's list when they are freed |
498 | * (presumably in truncate). That requires careful auditing of all | 498 | * (presumably in truncate). That requires careful auditing of all |
499 | * filesystems (do it inside bforget()). It could also be done by bringing | 499 | * filesystems (do it inside bforget()). It could also be done by bringing |
500 | * b_inode back. | 500 | * b_inode back. |
501 | */ | 501 | */ |
502 | 502 | ||
503 | /* | 503 | /* |
504 | * The buffer's backing address_space's private_lock must be held | 504 | * The buffer's backing address_space's private_lock must be held |
505 | */ | 505 | */ |
506 | static void __remove_assoc_queue(struct buffer_head *bh) | 506 | static void __remove_assoc_queue(struct buffer_head *bh) |
507 | { | 507 | { |
508 | list_del_init(&bh->b_assoc_buffers); | 508 | list_del_init(&bh->b_assoc_buffers); |
509 | WARN_ON(!bh->b_assoc_map); | 509 | WARN_ON(!bh->b_assoc_map); |
510 | if (buffer_write_io_error(bh)) | 510 | if (buffer_write_io_error(bh)) |
511 | set_bit(AS_EIO, &bh->b_assoc_map->flags); | 511 | set_bit(AS_EIO, &bh->b_assoc_map->flags); |
512 | bh->b_assoc_map = NULL; | 512 | bh->b_assoc_map = NULL; |
513 | } | 513 | } |
514 | 514 | ||
515 | int inode_has_buffers(struct inode *inode) | 515 | int inode_has_buffers(struct inode *inode) |
516 | { | 516 | { |
517 | return !list_empty(&inode->i_data.private_list); | 517 | return !list_empty(&inode->i_data.private_list); |
518 | } | 518 | } |
519 | 519 | ||
520 | /* | 520 | /* |
521 | * osync is designed to support O_SYNC io. It waits synchronously for | 521 | * osync is designed to support O_SYNC io. It waits synchronously for |
522 | * all already-submitted IO to complete, but does not queue any new | 522 | * all already-submitted IO to complete, but does not queue any new |
523 | * writes to the disk. | 523 | * writes to the disk. |
524 | * | 524 | * |
525 | * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as | 525 | * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as |
526 | * you dirty the buffers, and then use osync_inode_buffers to wait for | 526 | * you dirty the buffers, and then use osync_inode_buffers to wait for |
527 | * completion. Any other dirty buffers which are not yet queued for | 527 | * completion. Any other dirty buffers which are not yet queued for |
528 | * write will not be flushed to disk by the osync. | 528 | * write will not be flushed to disk by the osync. |
529 | */ | 529 | */ |
530 | static int osync_buffers_list(spinlock_t *lock, struct list_head *list) | 530 | static int osync_buffers_list(spinlock_t *lock, struct list_head *list) |
531 | { | 531 | { |
532 | struct buffer_head *bh; | 532 | struct buffer_head *bh; |
533 | struct list_head *p; | 533 | struct list_head *p; |
534 | int err = 0; | 534 | int err = 0; |
535 | 535 | ||
536 | spin_lock(lock); | 536 | spin_lock(lock); |
537 | repeat: | 537 | repeat: |
538 | list_for_each_prev(p, list) { | 538 | list_for_each_prev(p, list) { |
539 | bh = BH_ENTRY(p); | 539 | bh = BH_ENTRY(p); |
540 | if (buffer_locked(bh)) { | 540 | if (buffer_locked(bh)) { |
541 | get_bh(bh); | 541 | get_bh(bh); |
542 | spin_unlock(lock); | 542 | spin_unlock(lock); |
543 | wait_on_buffer(bh); | 543 | wait_on_buffer(bh); |
544 | if (!buffer_uptodate(bh)) | 544 | if (!buffer_uptodate(bh)) |
545 | err = -EIO; | 545 | err = -EIO; |
546 | brelse(bh); | 546 | brelse(bh); |
547 | spin_lock(lock); | 547 | spin_lock(lock); |
548 | goto repeat; | 548 | goto repeat; |
549 | } | 549 | } |
550 | } | 550 | } |
551 | spin_unlock(lock); | 551 | spin_unlock(lock); |
552 | return err; | 552 | return err; |
553 | } | 553 | } |
554 | 554 | ||
555 | static void do_thaw_one(struct super_block *sb, void *unused) | 555 | static void do_thaw_one(struct super_block *sb, void *unused) |
556 | { | 556 | { |
557 | char b[BDEVNAME_SIZE]; | 557 | char b[BDEVNAME_SIZE]; |
558 | while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb)) | 558 | while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb)) |
559 | printk(KERN_WARNING "Emergency Thaw on %s\n", | 559 | printk(KERN_WARNING "Emergency Thaw on %s\n", |
560 | bdevname(sb->s_bdev, b)); | 560 | bdevname(sb->s_bdev, b)); |
561 | } | 561 | } |
562 | 562 | ||
563 | static void do_thaw_all(struct work_struct *work) | 563 | static void do_thaw_all(struct work_struct *work) |
564 | { | 564 | { |
565 | iterate_supers(do_thaw_one, NULL); | 565 | iterate_supers(do_thaw_one, NULL); |
566 | kfree(work); | 566 | kfree(work); |
567 | printk(KERN_WARNING "Emergency Thaw complete\n"); | 567 | printk(KERN_WARNING "Emergency Thaw complete\n"); |
568 | } | 568 | } |
569 | 569 | ||
570 | /** | 570 | /** |
571 | * emergency_thaw_all -- forcibly thaw every frozen filesystem | 571 | * emergency_thaw_all -- forcibly thaw every frozen filesystem |
572 | * | 572 | * |
573 | * Used for emergency unfreeze of all filesystems via SysRq | 573 | * Used for emergency unfreeze of all filesystems via SysRq |
574 | */ | 574 | */ |
575 | void emergency_thaw_all(void) | 575 | void emergency_thaw_all(void) |
576 | { | 576 | { |
577 | struct work_struct *work; | 577 | struct work_struct *work; |
578 | 578 | ||
579 | work = kmalloc(sizeof(*work), GFP_ATOMIC); | 579 | work = kmalloc(sizeof(*work), GFP_ATOMIC); |
580 | if (work) { | 580 | if (work) { |
581 | INIT_WORK(work, do_thaw_all); | 581 | INIT_WORK(work, do_thaw_all); |
582 | schedule_work(work); | 582 | schedule_work(work); |
583 | } | 583 | } |
584 | } | 584 | } |
585 | 585 | ||
586 | /** | 586 | /** |
587 | * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers | 587 | * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers |
588 | * @mapping: the mapping which wants those buffers written | 588 | * @mapping: the mapping which wants those buffers written |
589 | * | 589 | * |
590 | * Starts I/O against the buffers at mapping->private_list, and waits upon | 590 | * Starts I/O against the buffers at mapping->private_list, and waits upon |
591 | * that I/O. | 591 | * that I/O. |
592 | * | 592 | * |
593 | * Basically, this is a convenience function for fsync(). | 593 | * Basically, this is a convenience function for fsync(). |
594 | * @mapping is a file or directory which needs those buffers to be written for | 594 | * @mapping is a file or directory which needs those buffers to be written for |
595 | * a successful fsync(). | 595 | * a successful fsync(). |
596 | */ | 596 | */ |
597 | int sync_mapping_buffers(struct address_space *mapping) | 597 | int sync_mapping_buffers(struct address_space *mapping) |
598 | { | 598 | { |
599 | struct address_space *buffer_mapping = mapping->private_data; | 599 | struct address_space *buffer_mapping = mapping->private_data; |
600 | 600 | ||
601 | if (buffer_mapping == NULL || list_empty(&mapping->private_list)) | 601 | if (buffer_mapping == NULL || list_empty(&mapping->private_list)) |
602 | return 0; | 602 | return 0; |
603 | 603 | ||
604 | return fsync_buffers_list(&buffer_mapping->private_lock, | 604 | return fsync_buffers_list(&buffer_mapping->private_lock, |
605 | &mapping->private_list); | 605 | &mapping->private_list); |
606 | } | 606 | } |
607 | EXPORT_SYMBOL(sync_mapping_buffers); | 607 | EXPORT_SYMBOL(sync_mapping_buffers); |
608 | 608 | ||
609 | /* | 609 | /* |
610 | * Called when we've recently written block `bblock', and it is known that | 610 | * Called when we've recently written block `bblock', and it is known that |
611 | * `bblock' was for a buffer_boundary() buffer. This means that the block at | 611 | * `bblock' was for a buffer_boundary() buffer. This means that the block at |
612 | * `bblock + 1' is probably a dirty indirect block. Hunt it down and, if it's | 612 | * `bblock + 1' is probably a dirty indirect block. Hunt it down and, if it's |
613 | * dirty, schedule it for IO. So that indirects merge nicely with their data. | 613 | * dirty, schedule it for IO. So that indirects merge nicely with their data. |
614 | */ | 614 | */ |
615 | void write_boundary_block(struct block_device *bdev, | 615 | void write_boundary_block(struct block_device *bdev, |
616 | sector_t bblock, unsigned blocksize) | 616 | sector_t bblock, unsigned blocksize) |
617 | { | 617 | { |
618 | struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize); | 618 | struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize); |
619 | if (bh) { | 619 | if (bh) { |
620 | if (buffer_dirty(bh)) | 620 | if (buffer_dirty(bh)) |
621 | ll_rw_block(WRITE, 1, &bh); | 621 | ll_rw_block(WRITE, 1, &bh); |
622 | put_bh(bh); | 622 | put_bh(bh); |
623 | } | 623 | } |
624 | } | 624 | } |
625 | 625 | ||
626 | void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) | 626 | void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) |
627 | { | 627 | { |
628 | struct address_space *mapping = inode->i_mapping; | 628 | struct address_space *mapping = inode->i_mapping; |
629 | struct address_space *buffer_mapping = bh->b_page->mapping; | 629 | struct address_space *buffer_mapping = bh->b_page->mapping; |
630 | 630 | ||
631 | mark_buffer_dirty(bh); | 631 | mark_buffer_dirty(bh); |
632 | if (!mapping->private_data) { | 632 | if (!mapping->private_data) { |
633 | mapping->private_data = buffer_mapping; | 633 | mapping->private_data = buffer_mapping; |
634 | } else { | 634 | } else { |
635 | BUG_ON(mapping->private_data != buffer_mapping); | 635 | BUG_ON(mapping->private_data != buffer_mapping); |
636 | } | 636 | } |
637 | if (!bh->b_assoc_map) { | 637 | if (!bh->b_assoc_map) { |
638 | spin_lock(&buffer_mapping->private_lock); | 638 | spin_lock(&buffer_mapping->private_lock); |
639 | list_move_tail(&bh->b_assoc_buffers, | 639 | list_move_tail(&bh->b_assoc_buffers, |
640 | &mapping->private_list); | 640 | &mapping->private_list); |
641 | bh->b_assoc_map = mapping; | 641 | bh->b_assoc_map = mapping; |
642 | spin_unlock(&buffer_mapping->private_lock); | 642 | spin_unlock(&buffer_mapping->private_lock); |
643 | } | 643 | } |
644 | } | 644 | } |
645 | EXPORT_SYMBOL(mark_buffer_dirty_inode); | 645 | EXPORT_SYMBOL(mark_buffer_dirty_inode); |
646 | 646 | ||
647 | /* | 647 | /* |
648 | * Mark the page dirty, and set it dirty in the radix tree, and mark the inode | 648 | * Mark the page dirty, and set it dirty in the radix tree, and mark the inode |
649 | * dirty. | 649 | * dirty. |
650 | * | 650 | * |
651 | * If warn is true, then emit a warning if the page is not uptodate and has | 651 | * If warn is true, then emit a warning if the page is not uptodate and has |
652 | * not been truncated. | 652 | * not been truncated. |
653 | */ | 653 | */ |
654 | static void __set_page_dirty(struct page *page, | 654 | static void __set_page_dirty(struct page *page, |
655 | struct address_space *mapping, int warn) | 655 | struct address_space *mapping, int warn) |
656 | { | 656 | { |
657 | unsigned long flags; | 657 | unsigned long flags; |
658 | 658 | ||
659 | spin_lock_irqsave(&mapping->tree_lock, flags); | 659 | spin_lock_irqsave(&mapping->tree_lock, flags); |
660 | if (page->mapping) { /* Race with truncate? */ | 660 | if (page->mapping) { /* Race with truncate? */ |
661 | WARN_ON_ONCE(warn && !PageUptodate(page)); | 661 | WARN_ON_ONCE(warn && !PageUptodate(page)); |
662 | account_page_dirtied(page, mapping); | 662 | account_page_dirtied(page, mapping); |
663 | radix_tree_tag_set(&mapping->page_tree, | 663 | radix_tree_tag_set(&mapping->page_tree, |
664 | page_index(page), PAGECACHE_TAG_DIRTY); | 664 | page_index(page), PAGECACHE_TAG_DIRTY); |
665 | } | 665 | } |
666 | spin_unlock_irqrestore(&mapping->tree_lock, flags); | 666 | spin_unlock_irqrestore(&mapping->tree_lock, flags); |
667 | __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); | 667 | __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); |
668 | } | 668 | } |
669 | 669 | ||
670 | /* | 670 | /* |
671 | * Add a page to the dirty page list. | 671 | * Add a page to the dirty page list. |
672 | * | 672 | * |
673 | * It is a sad fact of life that this function is called from several places | 673 | * It is a sad fact of life that this function is called from several places |
674 | * deeply under spinlocking. It may not sleep. | 674 | * deeply under spinlocking. It may not sleep. |
675 | * | 675 | * |
676 | * If the page has buffers, the uptodate buffers are set dirty, to preserve | 676 | * If the page has buffers, the uptodate buffers are set dirty, to preserve |
677 | * dirty-state coherency between the page and the buffers. It the page does | 677 | * dirty-state coherency between the page and the buffers. It the page does |
678 | * not have buffers then when they are later attached they will all be set | 678 | * not have buffers then when they are later attached they will all be set |
679 | * dirty. | 679 | * dirty. |
680 | * | 680 | * |
681 | * The buffers are dirtied before the page is dirtied. There's a small race | 681 | * The buffers are dirtied before the page is dirtied. There's a small race |
682 | * window in which a writepage caller may see the page cleanness but not the | 682 | * window in which a writepage caller may see the page cleanness but not the |
683 | * buffer dirtiness. That's fine. If this code were to set the page dirty | 683 | * buffer dirtiness. That's fine. If this code were to set the page dirty |
684 | * before the buffers, a concurrent writepage caller could clear the page dirty | 684 | * before the buffers, a concurrent writepage caller could clear the page dirty |
685 | * bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean | 685 | * bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean |
686 | * page on the dirty page list. | 686 | * page on the dirty page list. |
687 | * | 687 | * |
688 | * We use private_lock to lock against try_to_free_buffers while using the | 688 | * We use private_lock to lock against try_to_free_buffers while using the |
689 | * page's buffer list. Also use this to protect against clean buffers being | 689 | * page's buffer list. Also use this to protect against clean buffers being |
690 | * added to the page after it was set dirty. | 690 | * added to the page after it was set dirty. |
691 | * | 691 | * |
692 | * FIXME: may need to call ->reservepage here as well. That's rather up to the | 692 | * FIXME: may need to call ->reservepage here as well. That's rather up to the |
693 | * address_space though. | 693 | * address_space though. |
694 | */ | 694 | */ |
695 | int __set_page_dirty_buffers(struct page *page) | 695 | int __set_page_dirty_buffers(struct page *page) |
696 | { | 696 | { |
697 | int newly_dirty; | 697 | int newly_dirty; |
698 | struct address_space *mapping = page_mapping(page); | 698 | struct address_space *mapping = page_mapping(page); |
699 | 699 | ||
700 | if (unlikely(!mapping)) | 700 | if (unlikely(!mapping)) |
701 | return !TestSetPageDirty(page); | 701 | return !TestSetPageDirty(page); |
702 | 702 | ||
703 | spin_lock(&mapping->private_lock); | 703 | spin_lock(&mapping->private_lock); |
704 | if (page_has_buffers(page)) { | 704 | if (page_has_buffers(page)) { |
705 | struct buffer_head *head = page_buffers(page); | 705 | struct buffer_head *head = page_buffers(page); |
706 | struct buffer_head *bh = head; | 706 | struct buffer_head *bh = head; |
707 | 707 | ||
708 | do { | 708 | do { |
709 | set_buffer_dirty(bh); | 709 | set_buffer_dirty(bh); |
710 | bh = bh->b_this_page; | 710 | bh = bh->b_this_page; |
711 | } while (bh != head); | 711 | } while (bh != head); |
712 | } | 712 | } |
713 | newly_dirty = !TestSetPageDirty(page); | 713 | newly_dirty = !TestSetPageDirty(page); |
714 | spin_unlock(&mapping->private_lock); | 714 | spin_unlock(&mapping->private_lock); |
715 | 715 | ||
716 | if (newly_dirty) | 716 | if (newly_dirty) |
717 | __set_page_dirty(page, mapping, 1); | 717 | __set_page_dirty(page, mapping, 1); |
718 | return newly_dirty; | 718 | return newly_dirty; |
719 | } | 719 | } |
720 | EXPORT_SYMBOL(__set_page_dirty_buffers); | 720 | EXPORT_SYMBOL(__set_page_dirty_buffers); |
721 | 721 | ||
722 | /* | 722 | /* |
723 | * Write out and wait upon a list of buffers. | 723 | * Write out and wait upon a list of buffers. |
724 | * | 724 | * |
725 | * We have conflicting pressures: we want to make sure that all | 725 | * We have conflicting pressures: we want to make sure that all |
726 | * initially dirty buffers get waited on, but that any subsequently | 726 | * initially dirty buffers get waited on, but that any subsequently |
727 | * dirtied buffers don't. After all, we don't want fsync to last | 727 | * dirtied buffers don't. After all, we don't want fsync to last |
728 | * forever if somebody is actively writing to the file. | 728 | * forever if somebody is actively writing to the file. |
729 | * | 729 | * |
730 | * Do this in two main stages: first we copy dirty buffers to a | 730 | * Do this in two main stages: first we copy dirty buffers to a |
731 | * temporary inode list, queueing the writes as we go. Then we clean | 731 | * temporary inode list, queueing the writes as we go. Then we clean |
732 | * up, waiting for those writes to complete. | 732 | * up, waiting for those writes to complete. |
733 | * | 733 | * |
734 | * During this second stage, any subsequent updates to the file may end | 734 | * During this second stage, any subsequent updates to the file may end |
735 | * up refiling the buffer on the original inode's dirty list again, so | 735 | * up refiling the buffer on the original inode's dirty list again, so |
736 | * there is a chance we will end up with a buffer queued for write but | 736 | * there is a chance we will end up with a buffer queued for write but |
737 | * not yet completed on that list. So, as a final cleanup we go through | 737 | * not yet completed on that list. So, as a final cleanup we go through |
738 | * the osync code to catch these locked, dirty buffers without requeuing | 738 | * the osync code to catch these locked, dirty buffers without requeuing |
739 | * any newly dirty buffers for write. | 739 | * any newly dirty buffers for write. |
740 | */ | 740 | */ |
741 | static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) | 741 | static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) |
742 | { | 742 | { |
743 | struct buffer_head *bh; | 743 | struct buffer_head *bh; |
744 | struct list_head tmp; | 744 | struct list_head tmp; |
745 | struct address_space *mapping; | 745 | struct address_space *mapping; |
746 | int err = 0, err2; | 746 | int err = 0, err2; |
747 | struct blk_plug plug; | 747 | struct blk_plug plug; |
748 | 748 | ||
749 | INIT_LIST_HEAD(&tmp); | 749 | INIT_LIST_HEAD(&tmp); |
750 | blk_start_plug(&plug); | 750 | blk_start_plug(&plug); |
751 | 751 | ||
752 | spin_lock(lock); | 752 | spin_lock(lock); |
753 | while (!list_empty(list)) { | 753 | while (!list_empty(list)) { |
754 | bh = BH_ENTRY(list->next); | 754 | bh = BH_ENTRY(list->next); |
755 | mapping = bh->b_assoc_map; | 755 | mapping = bh->b_assoc_map; |
756 | __remove_assoc_queue(bh); | 756 | __remove_assoc_queue(bh); |
757 | /* Avoid race with mark_buffer_dirty_inode() which does | 757 | /* Avoid race with mark_buffer_dirty_inode() which does |
758 | * a lockless check and we rely on seeing the dirty bit */ | 758 | * a lockless check and we rely on seeing the dirty bit */ |
759 | smp_mb(); | 759 | smp_mb(); |
760 | if (buffer_dirty(bh) || buffer_locked(bh)) { | 760 | if (buffer_dirty(bh) || buffer_locked(bh)) { |
761 | list_add(&bh->b_assoc_buffers, &tmp); | 761 | list_add(&bh->b_assoc_buffers, &tmp); |
762 | bh->b_assoc_map = mapping; | 762 | bh->b_assoc_map = mapping; |
763 | if (buffer_dirty(bh)) { | 763 | if (buffer_dirty(bh)) { |
764 | get_bh(bh); | 764 | get_bh(bh); |
765 | spin_unlock(lock); | 765 | spin_unlock(lock); |
766 | /* | 766 | /* |
767 | * Ensure any pending I/O completes so that | 767 | * Ensure any pending I/O completes so that |
768 | * write_dirty_buffer() actually writes the | 768 | * write_dirty_buffer() actually writes the |
769 | * current contents - it is a noop if I/O is | 769 | * current contents - it is a noop if I/O is |
770 | * still in flight on potentially older | 770 | * still in flight on potentially older |
771 | * contents. | 771 | * contents. |
772 | */ | 772 | */ |
773 | write_dirty_buffer(bh, WRITE_SYNC); | 773 | write_dirty_buffer(bh, WRITE_SYNC); |
774 | 774 | ||
775 | /* | 775 | /* |
776 | * Kick off IO for the previous mapping. Note | 776 | * Kick off IO for the previous mapping. Note |
777 | * that we will not run the very last mapping, | 777 | * that we will not run the very last mapping, |
778 | * wait_on_buffer() will do that for us | 778 | * wait_on_buffer() will do that for us |
779 | * through sync_buffer(). | 779 | * through sync_buffer(). |
780 | */ | 780 | */ |
781 | brelse(bh); | 781 | brelse(bh); |
782 | spin_lock(lock); | 782 | spin_lock(lock); |
783 | } | 783 | } |
784 | } | 784 | } |
785 | } | 785 | } |
786 | 786 | ||
787 | spin_unlock(lock); | 787 | spin_unlock(lock); |
788 | blk_finish_plug(&plug); | 788 | blk_finish_plug(&plug); |
789 | spin_lock(lock); | 789 | spin_lock(lock); |
790 | 790 | ||
791 | while (!list_empty(&tmp)) { | 791 | while (!list_empty(&tmp)) { |
792 | bh = BH_ENTRY(tmp.prev); | 792 | bh = BH_ENTRY(tmp.prev); |
793 | get_bh(bh); | 793 | get_bh(bh); |
794 | mapping = bh->b_assoc_map; | 794 | mapping = bh->b_assoc_map; |
795 | __remove_assoc_queue(bh); | 795 | __remove_assoc_queue(bh); |
796 | /* Avoid race with mark_buffer_dirty_inode() which does | 796 | /* Avoid race with mark_buffer_dirty_inode() which does |
797 | * a lockless check and we rely on seeing the dirty bit */ | 797 | * a lockless check and we rely on seeing the dirty bit */ |
798 | smp_mb(); | 798 | smp_mb(); |
799 | if (buffer_dirty(bh)) { | 799 | if (buffer_dirty(bh)) { |
800 | list_add(&bh->b_assoc_buffers, | 800 | list_add(&bh->b_assoc_buffers, |
801 | &mapping->private_list); | 801 | &mapping->private_list); |
802 | bh->b_assoc_map = mapping; | 802 | bh->b_assoc_map = mapping; |
803 | } | 803 | } |
804 | spin_unlock(lock); | 804 | spin_unlock(lock); |
805 | wait_on_buffer(bh); | 805 | wait_on_buffer(bh); |
806 | if (!buffer_uptodate(bh)) | 806 | if (!buffer_uptodate(bh)) |
807 | err = -EIO; | 807 | err = -EIO; |
808 | brelse(bh); | 808 | brelse(bh); |
809 | spin_lock(lock); | 809 | spin_lock(lock); |
810 | } | 810 | } |
811 | 811 | ||
812 | spin_unlock(lock); | 812 | spin_unlock(lock); |
813 | err2 = osync_buffers_list(lock, list); | 813 | err2 = osync_buffers_list(lock, list); |
814 | if (err) | 814 | if (err) |
815 | return err; | 815 | return err; |
816 | else | 816 | else |
817 | return err2; | 817 | return err2; |
818 | } | 818 | } |
819 | 819 | ||
820 | /* | 820 | /* |
821 | * Invalidate any and all dirty buffers on a given inode. We are | 821 | * Invalidate any and all dirty buffers on a given inode. We are |
822 | * probably unmounting the fs, but that doesn't mean we have already | 822 | * probably unmounting the fs, but that doesn't mean we have already |
823 | * done a sync(). Just drop the buffers from the inode list. | 823 | * done a sync(). Just drop the buffers from the inode list. |
824 | * | 824 | * |
825 | * NOTE: we take the inode's blockdev's mapping's private_lock. Which | 825 | * NOTE: we take the inode's blockdev's mapping's private_lock. Which |
826 | * assumes that all the buffers are against the blockdev. Not true | 826 | * assumes that all the buffers are against the blockdev. Not true |
827 | * for reiserfs. | 827 | * for reiserfs. |
828 | */ | 828 | */ |
829 | void invalidate_inode_buffers(struct inode *inode) | 829 | void invalidate_inode_buffers(struct inode *inode) |
830 | { | 830 | { |
831 | if (inode_has_buffers(inode)) { | 831 | if (inode_has_buffers(inode)) { |
832 | struct address_space *mapping = &inode->i_data; | 832 | struct address_space *mapping = &inode->i_data; |
833 | struct list_head *list = &mapping->private_list; | 833 | struct list_head *list = &mapping->private_list; |
834 | struct address_space *buffer_mapping = mapping->private_data; | 834 | struct address_space *buffer_mapping = mapping->private_data; |
835 | 835 | ||
836 | spin_lock(&buffer_mapping->private_lock); | 836 | spin_lock(&buffer_mapping->private_lock); |
837 | while (!list_empty(list)) | 837 | while (!list_empty(list)) |
838 | __remove_assoc_queue(BH_ENTRY(list->next)); | 838 | __remove_assoc_queue(BH_ENTRY(list->next)); |
839 | spin_unlock(&buffer_mapping->private_lock); | 839 | spin_unlock(&buffer_mapping->private_lock); |
840 | } | 840 | } |
841 | } | 841 | } |
842 | EXPORT_SYMBOL(invalidate_inode_buffers); | 842 | EXPORT_SYMBOL(invalidate_inode_buffers); |
843 | 843 | ||
844 | /* | 844 | /* |
845 | * Remove any clean buffers from the inode's buffer list. This is called | 845 | * Remove any clean buffers from the inode's buffer list. This is called |
846 | * when we're trying to free the inode itself. Those buffers can pin it. | 846 | * when we're trying to free the inode itself. Those buffers can pin it. |
847 | * | 847 | * |
848 | * Returns true if all buffers were removed. | 848 | * Returns true if all buffers were removed. |
849 | */ | 849 | */ |
850 | int remove_inode_buffers(struct inode *inode) | 850 | int remove_inode_buffers(struct inode *inode) |
851 | { | 851 | { |
852 | int ret = 1; | 852 | int ret = 1; |
853 | 853 | ||
854 | if (inode_has_buffers(inode)) { | 854 | if (inode_has_buffers(inode)) { |
855 | struct address_space *mapping = &inode->i_data; | 855 | struct address_space *mapping = &inode->i_data; |
856 | struct list_head *list = &mapping->private_list; | 856 | struct list_head *list = &mapping->private_list; |
857 | struct address_space *buffer_mapping = mapping->private_data; | 857 | struct address_space *buffer_mapping = mapping->private_data; |
858 | 858 | ||
859 | spin_lock(&buffer_mapping->private_lock); | 859 | spin_lock(&buffer_mapping->private_lock); |
860 | while (!list_empty(list)) { | 860 | while (!list_empty(list)) { |
861 | struct buffer_head *bh = BH_ENTRY(list->next); | 861 | struct buffer_head *bh = BH_ENTRY(list->next); |
862 | if (buffer_dirty(bh)) { | 862 | if (buffer_dirty(bh)) { |
863 | ret = 0; | 863 | ret = 0; |
864 | break; | 864 | break; |
865 | } | 865 | } |
866 | __remove_assoc_queue(bh); | 866 | __remove_assoc_queue(bh); |
867 | } | 867 | } |
868 | spin_unlock(&buffer_mapping->private_lock); | 868 | spin_unlock(&buffer_mapping->private_lock); |
869 | } | 869 | } |
870 | return ret; | 870 | return ret; |
871 | } | 871 | } |
872 | 872 | ||
873 | /* | 873 | /* |
874 | * Create the appropriate buffers when given a page for data area and | 874 | * Create the appropriate buffers when given a page for data area and |
875 | * the size of each buffer.. Use the bh->b_this_page linked list to | 875 | * the size of each buffer.. Use the bh->b_this_page linked list to |
876 | * follow the buffers created. Return NULL if unable to create more | 876 | * follow the buffers created. Return NULL if unable to create more |
877 | * buffers. | 877 | * buffers. |
878 | * | 878 | * |
879 | * The retry flag is used to differentiate async IO (paging, swapping) | 879 | * The retry flag is used to differentiate async IO (paging, swapping) |
880 | * which may not fail from ordinary buffer allocations. | 880 | * which may not fail from ordinary buffer allocations. |
881 | */ | 881 | */ |
882 | struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, | 882 | struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, |
883 | int retry) | 883 | int retry) |
884 | { | 884 | { |
885 | struct buffer_head *bh, *head; | 885 | struct buffer_head *bh, *head; |
886 | long offset; | 886 | long offset; |
887 | 887 | ||
888 | try_again: | 888 | try_again: |
889 | head = NULL; | 889 | head = NULL; |
890 | offset = PAGE_SIZE; | 890 | offset = PAGE_SIZE; |
891 | while ((offset -= size) >= 0) { | 891 | while ((offset -= size) >= 0) { |
892 | bh = alloc_buffer_head(GFP_NOFS); | 892 | bh = alloc_buffer_head(GFP_NOFS); |
893 | if (!bh) | 893 | if (!bh) |
894 | goto no_grow; | 894 | goto no_grow; |
895 | 895 | ||
896 | bh->b_this_page = head; | 896 | bh->b_this_page = head; |
897 | bh->b_blocknr = -1; | 897 | bh->b_blocknr = -1; |
898 | head = bh; | 898 | head = bh; |
899 | 899 | ||
900 | bh->b_size = size; | 900 | bh->b_size = size; |
901 | 901 | ||
902 | /* Link the buffer to its page */ | 902 | /* Link the buffer to its page */ |
903 | set_bh_page(bh, page, offset); | 903 | set_bh_page(bh, page, offset); |
904 | } | 904 | } |
905 | return head; | 905 | return head; |
906 | /* | 906 | /* |
907 | * In case anything failed, we just free everything we got. | 907 | * In case anything failed, we just free everything we got. |
908 | */ | 908 | */ |
909 | no_grow: | 909 | no_grow: |
910 | if (head) { | 910 | if (head) { |
911 | do { | 911 | do { |
912 | bh = head; | 912 | bh = head; |
913 | head = head->b_this_page; | 913 | head = head->b_this_page; |
914 | free_buffer_head(bh); | 914 | free_buffer_head(bh); |
915 | } while (head); | 915 | } while (head); |
916 | } | 916 | } |
917 | 917 | ||
918 | /* | 918 | /* |
919 | * Return failure for non-async IO requests. Async IO requests | 919 | * Return failure for non-async IO requests. Async IO requests |
920 | * are not allowed to fail, so we have to wait until buffer heads | 920 | * are not allowed to fail, so we have to wait until buffer heads |
921 | * become available. But we don't want tasks sleeping with | 921 | * become available. But we don't want tasks sleeping with |
922 | * partially complete buffers, so all were released above. | 922 | * partially complete buffers, so all were released above. |
923 | */ | 923 | */ |
924 | if (!retry) | 924 | if (!retry) |
925 | return NULL; | 925 | return NULL; |
926 | 926 | ||
927 | /* We're _really_ low on memory. Now we just | 927 | /* We're _really_ low on memory. Now we just |
928 | * wait for old buffer heads to become free due to | 928 | * wait for old buffer heads to become free due to |
929 | * finishing IO. Since this is an async request and | 929 | * finishing IO. Since this is an async request and |
930 | * the reserve list is empty, we're sure there are | 930 | * the reserve list is empty, we're sure there are |
931 | * async buffer heads in use. | 931 | * async buffer heads in use. |
932 | */ | 932 | */ |
933 | free_more_memory(); | 933 | free_more_memory(); |
934 | goto try_again; | 934 | goto try_again; |
935 | } | 935 | } |
936 | EXPORT_SYMBOL_GPL(alloc_page_buffers); | 936 | EXPORT_SYMBOL_GPL(alloc_page_buffers); |
937 | 937 | ||
938 | static inline void | 938 | static inline void |
939 | link_dev_buffers(struct page *page, struct buffer_head *head) | 939 | link_dev_buffers(struct page *page, struct buffer_head *head) |
940 | { | 940 | { |
941 | struct buffer_head *bh, *tail; | 941 | struct buffer_head *bh, *tail; |
942 | 942 | ||
943 | bh = head; | 943 | bh = head; |
944 | do { | 944 | do { |
945 | tail = bh; | 945 | tail = bh; |
946 | bh = bh->b_this_page; | 946 | bh = bh->b_this_page; |
947 | } while (bh); | 947 | } while (bh); |
948 | tail->b_this_page = head; | 948 | tail->b_this_page = head; |
949 | attach_page_buffers(page, head); | 949 | attach_page_buffers(page, head); |
950 | } | 950 | } |
951 | 951 | ||
952 | static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) | 952 | static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) |
953 | { | 953 | { |
954 | sector_t retval = ~((sector_t)0); | 954 | sector_t retval = ~((sector_t)0); |
955 | loff_t sz = i_size_read(bdev->bd_inode); | 955 | loff_t sz = i_size_read(bdev->bd_inode); |
956 | 956 | ||
957 | if (sz) { | 957 | if (sz) { |
958 | unsigned int sizebits = blksize_bits(size); | 958 | unsigned int sizebits = blksize_bits(size); |
959 | retval = (sz >> sizebits); | 959 | retval = (sz >> sizebits); |
960 | } | 960 | } |
961 | return retval; | 961 | return retval; |
962 | } | 962 | } |
963 | 963 | ||
964 | /* | 964 | /* |
965 | * Initialise the state of a blockdev page's buffers. | 965 | * Initialise the state of a blockdev page's buffers. |
966 | */ | 966 | */ |
967 | static sector_t | 967 | static sector_t |
968 | init_page_buffers(struct page *page, struct block_device *bdev, | 968 | init_page_buffers(struct page *page, struct block_device *bdev, |
969 | sector_t block, int size) | 969 | sector_t block, int size) |
970 | { | 970 | { |
971 | struct buffer_head *head = page_buffers(page); | 971 | struct buffer_head *head = page_buffers(page); |
972 | struct buffer_head *bh = head; | 972 | struct buffer_head *bh = head; |
973 | int uptodate = PageUptodate(page); | 973 | int uptodate = PageUptodate(page); |
974 | sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size); | 974 | sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size); |
975 | 975 | ||
976 | do { | 976 | do { |
977 | if (!buffer_mapped(bh)) { | 977 | if (!buffer_mapped(bh)) { |
978 | init_buffer(bh, NULL, NULL); | 978 | init_buffer(bh, NULL, NULL); |
979 | bh->b_bdev = bdev; | 979 | bh->b_bdev = bdev; |
980 | bh->b_blocknr = block; | 980 | bh->b_blocknr = block; |
981 | if (uptodate) | 981 | if (uptodate) |
982 | set_buffer_uptodate(bh); | 982 | set_buffer_uptodate(bh); |
983 | if (block < end_block) | 983 | if (block < end_block) |
984 | set_buffer_mapped(bh); | 984 | set_buffer_mapped(bh); |
985 | } | 985 | } |
986 | block++; | 986 | block++; |
987 | bh = bh->b_this_page; | 987 | bh = bh->b_this_page; |
988 | } while (bh != head); | 988 | } while (bh != head); |
989 | 989 | ||
990 | /* | 990 | /* |
991 | * Caller needs to validate requested block against end of device. | 991 | * Caller needs to validate requested block against end of device. |
992 | */ | 992 | */ |
993 | return end_block; | 993 | return end_block; |
994 | } | 994 | } |
995 | 995 | ||
996 | /* | 996 | /* |
997 | * Create the page-cache page that contains the requested block. | 997 | * Create the page-cache page that contains the requested block. |
998 | * | 998 | * |
999 | * This is used purely for blockdev mappings. | 999 | * This is used purely for blockdev mappings. |
1000 | */ | 1000 | */ |
1001 | static int | 1001 | static int |
1002 | grow_dev_page(struct block_device *bdev, sector_t block, | 1002 | grow_dev_page(struct block_device *bdev, sector_t block, |
1003 | pgoff_t index, int size, int sizebits) | 1003 | pgoff_t index, int size, int sizebits) |
1004 | { | 1004 | { |
1005 | struct inode *inode = bdev->bd_inode; | 1005 | struct inode *inode = bdev->bd_inode; |
1006 | struct page *page; | 1006 | struct page *page; |
1007 | struct buffer_head *bh; | 1007 | struct buffer_head *bh; |
1008 | sector_t end_block; | 1008 | sector_t end_block; |
1009 | int ret = 0; /* Will call free_more_memory() */ | 1009 | int ret = 0; /* Will call free_more_memory() */ |
1010 | gfp_t gfp_mask; | 1010 | gfp_t gfp_mask; |
1011 | 1011 | ||
1012 | gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS; | 1012 | gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS; |
1013 | gfp_mask |= __GFP_MOVABLE; | 1013 | gfp_mask |= __GFP_MOVABLE; |
1014 | /* | 1014 | /* |
1015 | * XXX: __getblk_slow() can not really deal with failure and | 1015 | * XXX: __getblk_slow() can not really deal with failure and |
1016 | * will endlessly loop on improvised global reclaim. Prefer | 1016 | * will endlessly loop on improvised global reclaim. Prefer |
1017 | * looping in the allocator rather than here, at least that | 1017 | * looping in the allocator rather than here, at least that |
1018 | * code knows what it's doing. | 1018 | * code knows what it's doing. |
1019 | */ | 1019 | */ |
1020 | gfp_mask |= __GFP_NOFAIL; | 1020 | gfp_mask |= __GFP_NOFAIL; |
1021 | 1021 | ||
1022 | page = find_or_create_page(inode->i_mapping, index, gfp_mask); | 1022 | page = find_or_create_page(inode->i_mapping, index, gfp_mask); |
1023 | if (!page) | 1023 | if (!page) |
1024 | return ret; | 1024 | return ret; |
1025 | 1025 | ||
1026 | BUG_ON(!PageLocked(page)); | 1026 | BUG_ON(!PageLocked(page)); |
1027 | 1027 | ||
1028 | if (page_has_buffers(page)) { | 1028 | if (page_has_buffers(page)) { |
1029 | bh = page_buffers(page); | 1029 | bh = page_buffers(page); |
1030 | if (bh->b_size == size) { | 1030 | if (bh->b_size == size) { |
1031 | end_block = init_page_buffers(page, bdev, | 1031 | end_block = init_page_buffers(page, bdev, |
1032 | index << sizebits, size); | 1032 | index << sizebits, size); |
1033 | goto done; | 1033 | goto done; |
1034 | } | 1034 | } |
1035 | if (!try_to_free_buffers(page)) | 1035 | if (!try_to_free_buffers(page)) |
1036 | goto failed; | 1036 | goto failed; |
1037 | } | 1037 | } |
1038 | 1038 | ||
1039 | /* | 1039 | /* |
1040 | * Allocate some buffers for this page | 1040 | * Allocate some buffers for this page |
1041 | */ | 1041 | */ |
1042 | bh = alloc_page_buffers(page, size, 0); | 1042 | bh = alloc_page_buffers(page, size, 0); |
1043 | if (!bh) | 1043 | if (!bh) |
1044 | goto failed; | 1044 | goto failed; |
1045 | 1045 | ||
1046 | /* | 1046 | /* |
1047 | * Link the page to the buffers and initialise them. Take the | 1047 | * Link the page to the buffers and initialise them. Take the |
1048 | * lock to be atomic wrt __find_get_block(), which does not | 1048 | * lock to be atomic wrt __find_get_block(), which does not |
1049 | * run under the page lock. | 1049 | * run under the page lock. |
1050 | */ | 1050 | */ |
1051 | spin_lock(&inode->i_mapping->private_lock); | 1051 | spin_lock(&inode->i_mapping->private_lock); |
1052 | link_dev_buffers(page, bh); | 1052 | link_dev_buffers(page, bh); |
1053 | end_block = init_page_buffers(page, bdev, index << sizebits, size); | 1053 | end_block = init_page_buffers(page, bdev, index << sizebits, size); |
1054 | spin_unlock(&inode->i_mapping->private_lock); | 1054 | spin_unlock(&inode->i_mapping->private_lock); |
1055 | done: | 1055 | done: |
1056 | ret = (block < end_block) ? 1 : -ENXIO; | 1056 | ret = (block < end_block) ? 1 : -ENXIO; |
1057 | failed: | 1057 | failed: |
1058 | unlock_page(page); | 1058 | unlock_page(page); |
1059 | page_cache_release(page); | 1059 | page_cache_release(page); |
1060 | return ret; | 1060 | return ret; |
1061 | } | 1061 | } |
1062 | 1062 | ||
1063 | /* | 1063 | /* |
1064 | * Create buffers for the specified block device block's page. If | 1064 | * Create buffers for the specified block device block's page. If |
1065 | * that page was dirty, the buffers are set dirty also. | 1065 | * that page was dirty, the buffers are set dirty also. |
1066 | */ | 1066 | */ |
1067 | static int | 1067 | static int |
1068 | grow_buffers(struct block_device *bdev, sector_t block, int size) | 1068 | grow_buffers(struct block_device *bdev, sector_t block, int size) |
1069 | { | 1069 | { |
1070 | pgoff_t index; | 1070 | pgoff_t index; |
1071 | int sizebits; | 1071 | int sizebits; |
1072 | 1072 | ||
1073 | sizebits = -1; | 1073 | sizebits = -1; |
1074 | do { | 1074 | do { |
1075 | sizebits++; | 1075 | sizebits++; |
1076 | } while ((size << sizebits) < PAGE_SIZE); | 1076 | } while ((size << sizebits) < PAGE_SIZE); |
1077 | 1077 | ||
1078 | index = block >> sizebits; | 1078 | index = block >> sizebits; |
1079 | 1079 | ||
1080 | /* | 1080 | /* |
1081 | * Check for a block which wants to lie outside our maximum possible | 1081 | * Check for a block which wants to lie outside our maximum possible |
1082 | * pagecache index. (this comparison is done using sector_t types). | 1082 | * pagecache index. (this comparison is done using sector_t types). |
1083 | */ | 1083 | */ |
1084 | if (unlikely(index != block >> sizebits)) { | 1084 | if (unlikely(index != block >> sizebits)) { |
1085 | char b[BDEVNAME_SIZE]; | 1085 | char b[BDEVNAME_SIZE]; |
1086 | 1086 | ||
1087 | printk(KERN_ERR "%s: requested out-of-range block %llu for " | 1087 | printk(KERN_ERR "%s: requested out-of-range block %llu for " |
1088 | "device %s\n", | 1088 | "device %s\n", |
1089 | __func__, (unsigned long long)block, | 1089 | __func__, (unsigned long long)block, |
1090 | bdevname(bdev, b)); | 1090 | bdevname(bdev, b)); |
1091 | return -EIO; | 1091 | return -EIO; |
1092 | } | 1092 | } |
1093 | 1093 | ||
1094 | /* Create a page with the proper size buffers.. */ | 1094 | /* Create a page with the proper size buffers.. */ |
1095 | return grow_dev_page(bdev, block, index, size, sizebits); | 1095 | return grow_dev_page(bdev, block, index, size, sizebits); |
1096 | } | 1096 | } |
1097 | 1097 | ||
1098 | static struct buffer_head * | 1098 | static struct buffer_head * |
1099 | __getblk_slow(struct block_device *bdev, sector_t block, int size) | 1099 | __getblk_slow(struct block_device *bdev, sector_t block, int size) |
1100 | { | 1100 | { |
1101 | /* Size must be multiple of hard sectorsize */ | 1101 | /* Size must be multiple of hard sectorsize */ |
1102 | if (unlikely(size & (bdev_logical_block_size(bdev)-1) || | 1102 | if (unlikely(size & (bdev_logical_block_size(bdev)-1) || |
1103 | (size < 512 || size > PAGE_SIZE))) { | 1103 | (size < 512 || size > PAGE_SIZE))) { |
1104 | printk(KERN_ERR "getblk(): invalid block size %d requested\n", | 1104 | printk(KERN_ERR "getblk(): invalid block size %d requested\n", |
1105 | size); | 1105 | size); |
1106 | printk(KERN_ERR "logical block size: %d\n", | 1106 | printk(KERN_ERR "logical block size: %d\n", |
1107 | bdev_logical_block_size(bdev)); | 1107 | bdev_logical_block_size(bdev)); |
1108 | 1108 | ||
1109 | dump_stack(); | 1109 | dump_stack(); |
1110 | return NULL; | 1110 | return NULL; |
1111 | } | 1111 | } |
1112 | 1112 | ||
1113 | for (;;) { | 1113 | for (;;) { |
1114 | struct buffer_head *bh; | 1114 | struct buffer_head *bh; |
1115 | int ret; | 1115 | int ret; |
1116 | 1116 | ||
1117 | bh = __find_get_block(bdev, block, size); | 1117 | bh = __find_get_block(bdev, block, size); |
1118 | if (bh) | 1118 | if (bh) |
1119 | return bh; | 1119 | return bh; |
1120 | 1120 | ||
1121 | ret = grow_buffers(bdev, block, size); | 1121 | ret = grow_buffers(bdev, block, size); |
1122 | if (ret < 0) | 1122 | if (ret < 0) |
1123 | return NULL; | 1123 | return NULL; |
1124 | if (ret == 0) | 1124 | if (ret == 0) |
1125 | free_more_memory(); | 1125 | free_more_memory(); |
1126 | } | 1126 | } |
1127 | } | 1127 | } |
1128 | 1128 | ||
1129 | /* | 1129 | /* |
1130 | * The relationship between dirty buffers and dirty pages: | 1130 | * The relationship between dirty buffers and dirty pages: |
1131 | * | 1131 | * |
1132 | * Whenever a page has any dirty buffers, the page's dirty bit is set, and | 1132 | * Whenever a page has any dirty buffers, the page's dirty bit is set, and |
1133 | * the page is tagged dirty in its radix tree. | 1133 | * the page is tagged dirty in its radix tree. |
1134 | * | 1134 | * |
1135 | * At all times, the dirtiness of the buffers represents the dirtiness of | 1135 | * At all times, the dirtiness of the buffers represents the dirtiness of |
1136 | * subsections of the page. If the page has buffers, the page dirty bit is | 1136 | * subsections of the page. If the page has buffers, the page dirty bit is |
1137 | * merely a hint about the true dirty state. | 1137 | * merely a hint about the true dirty state. |
1138 | * | 1138 | * |
1139 | * When a page is set dirty in its entirety, all its buffers are marked dirty | 1139 | * When a page is set dirty in its entirety, all its buffers are marked dirty |
1140 | * (if the page has buffers). | 1140 | * (if the page has buffers). |
1141 | * | 1141 | * |
1142 | * When a buffer is marked dirty, its page is dirtied, but the page's other | 1142 | * When a buffer is marked dirty, its page is dirtied, but the page's other |
1143 | * buffers are not. | 1143 | * buffers are not. |
1144 | * | 1144 | * |
1145 | * Also. When blockdev buffers are explicitly read with bread(), they | 1145 | * Also. When blockdev buffers are explicitly read with bread(), they |
1146 | * individually become uptodate. But their backing page remains not | 1146 | * individually become uptodate. But their backing page remains not |
1147 | * uptodate - even if all of its buffers are uptodate. A subsequent | 1147 | * uptodate - even if all of its buffers are uptodate. A subsequent |
1148 | * block_read_full_page() against that page will discover all the uptodate | 1148 | * block_read_full_page() against that page will discover all the uptodate |
1149 | * buffers, will set the page uptodate and will perform no I/O. | 1149 | * buffers, will set the page uptodate and will perform no I/O. |
1150 | */ | 1150 | */ |
1151 | 1151 | ||
1152 | /** | 1152 | /** |
1153 | * mark_buffer_dirty - mark a buffer_head as needing writeout | 1153 | * mark_buffer_dirty - mark a buffer_head as needing writeout |
1154 | * @bh: the buffer_head to mark dirty | 1154 | * @bh: the buffer_head to mark dirty |
1155 | * | 1155 | * |
1156 | * mark_buffer_dirty() will set the dirty bit against the buffer, then set its | 1156 | * mark_buffer_dirty() will set the dirty bit against the buffer, then set its |
1157 | * backing page dirty, then tag the page as dirty in its address_space's radix | 1157 | * backing page dirty, then tag the page as dirty in its address_space's radix |
1158 | * tree and then attach the address_space's inode to its superblock's dirty | 1158 | * tree and then attach the address_space's inode to its superblock's dirty |
1159 | * inode list. | 1159 | * inode list. |
1160 | * | 1160 | * |
1161 | * mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock, | 1161 | * mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock, |
1162 | * mapping->tree_lock and mapping->host->i_lock. | 1162 | * mapping->tree_lock and mapping->host->i_lock. |
1163 | */ | 1163 | */ |
1164 | void mark_buffer_dirty(struct buffer_head *bh) | 1164 | void mark_buffer_dirty(struct buffer_head *bh) |
1165 | { | 1165 | { |
1166 | WARN_ON_ONCE(!buffer_uptodate(bh)); | 1166 | WARN_ON_ONCE(!buffer_uptodate(bh)); |
1167 | 1167 | ||
1168 | trace_block_dirty_buffer(bh); | 1168 | trace_block_dirty_buffer(bh); |
1169 | 1169 | ||
1170 | /* | 1170 | /* |
1171 | * Very *carefully* optimize the it-is-already-dirty case. | 1171 | * Very *carefully* optimize the it-is-already-dirty case. |
1172 | * | 1172 | * |
1173 | * Don't let the final "is it dirty" escape to before we | 1173 | * Don't let the final "is it dirty" escape to before we |
1174 | * perhaps modified the buffer. | 1174 | * perhaps modified the buffer. |
1175 | */ | 1175 | */ |
1176 | if (buffer_dirty(bh)) { | 1176 | if (buffer_dirty(bh)) { |
1177 | smp_mb(); | 1177 | smp_mb(); |
1178 | if (buffer_dirty(bh)) | 1178 | if (buffer_dirty(bh)) |
1179 | return; | 1179 | return; |
1180 | } | 1180 | } |
1181 | 1181 | ||
1182 | if (!test_set_buffer_dirty(bh)) { | 1182 | if (!test_set_buffer_dirty(bh)) { |
1183 | struct page *page = bh->b_page; | 1183 | struct page *page = bh->b_page; |
1184 | if (!TestSetPageDirty(page)) { | 1184 | if (!TestSetPageDirty(page)) { |
1185 | struct address_space *mapping = page_mapping(page); | 1185 | struct address_space *mapping = page_mapping(page); |
1186 | if (mapping) | 1186 | if (mapping) |
1187 | __set_page_dirty(page, mapping, 0); | 1187 | __set_page_dirty(page, mapping, 0); |
1188 | } | 1188 | } |
1189 | } | 1189 | } |
1190 | } | 1190 | } |
1191 | EXPORT_SYMBOL(mark_buffer_dirty); | 1191 | EXPORT_SYMBOL(mark_buffer_dirty); |
1192 | 1192 | ||
1193 | /* | 1193 | /* |
1194 | * Decrement a buffer_head's reference count. If all buffers against a page | 1194 | * Decrement a buffer_head's reference count. If all buffers against a page |
1195 | * have zero reference count, are clean and unlocked, and if the page is clean | 1195 | * have zero reference count, are clean and unlocked, and if the page is clean |
1196 | * and unlocked then try_to_free_buffers() may strip the buffers from the page | 1196 | * and unlocked then try_to_free_buffers() may strip the buffers from the page |
1197 | * in preparation for freeing it (sometimes, rarely, buffers are removed from | 1197 | * in preparation for freeing it (sometimes, rarely, buffers are removed from |
1198 | * a page but it ends up not being freed, and buffers may later be reattached). | 1198 | * a page but it ends up not being freed, and buffers may later be reattached). |
1199 | */ | 1199 | */ |
1200 | void __brelse(struct buffer_head * buf) | 1200 | void __brelse(struct buffer_head * buf) |
1201 | { | 1201 | { |
1202 | if (atomic_read(&buf->b_count)) { | 1202 | if (atomic_read(&buf->b_count)) { |
1203 | put_bh(buf); | 1203 | put_bh(buf); |
1204 | return; | 1204 | return; |
1205 | } | 1205 | } |
1206 | WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n"); | 1206 | WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n"); |
1207 | } | 1207 | } |
1208 | EXPORT_SYMBOL(__brelse); | 1208 | EXPORT_SYMBOL(__brelse); |
1209 | 1209 | ||
1210 | /* | 1210 | /* |
1211 | * bforget() is like brelse(), except it discards any | 1211 | * bforget() is like brelse(), except it discards any |
1212 | * potentially dirty data. | 1212 | * potentially dirty data. |
1213 | */ | 1213 | */ |
1214 | void __bforget(struct buffer_head *bh) | 1214 | void __bforget(struct buffer_head *bh) |
1215 | { | 1215 | { |
1216 | clear_buffer_dirty(bh); | 1216 | clear_buffer_dirty(bh); |
1217 | if (bh->b_assoc_map) { | 1217 | if (bh->b_assoc_map) { |
1218 | struct address_space *buffer_mapping = bh->b_page->mapping; | 1218 | struct address_space *buffer_mapping = bh->b_page->mapping; |
1219 | 1219 | ||
1220 | spin_lock(&buffer_mapping->private_lock); | 1220 | spin_lock(&buffer_mapping->private_lock); |
1221 | list_del_init(&bh->b_assoc_buffers); | 1221 | list_del_init(&bh->b_assoc_buffers); |
1222 | bh->b_assoc_map = NULL; | 1222 | bh->b_assoc_map = NULL; |
1223 | spin_unlock(&buffer_mapping->private_lock); | 1223 | spin_unlock(&buffer_mapping->private_lock); |
1224 | } | 1224 | } |
1225 | __brelse(bh); | 1225 | __brelse(bh); |
1226 | } | 1226 | } |
1227 | EXPORT_SYMBOL(__bforget); | 1227 | EXPORT_SYMBOL(__bforget); |
1228 | 1228 | ||
1229 | static struct buffer_head *__bread_slow(struct buffer_head *bh) | 1229 | static struct buffer_head *__bread_slow(struct buffer_head *bh) |
1230 | { | 1230 | { |
1231 | lock_buffer(bh); | 1231 | lock_buffer(bh); |
1232 | if (buffer_uptodate(bh)) { | 1232 | if (buffer_uptodate(bh)) { |
1233 | unlock_buffer(bh); | 1233 | unlock_buffer(bh); |
1234 | return bh; | 1234 | return bh; |
1235 | } else { | 1235 | } else { |
1236 | get_bh(bh); | 1236 | get_bh(bh); |
1237 | bh->b_end_io = end_buffer_read_sync; | 1237 | bh->b_end_io = end_buffer_read_sync; |
1238 | submit_bh(READ, bh); | 1238 | submit_bh(READ, bh); |
1239 | wait_on_buffer(bh); | 1239 | wait_on_buffer(bh); |
1240 | if (buffer_uptodate(bh)) | 1240 | if (buffer_uptodate(bh)) |
1241 | return bh; | 1241 | return bh; |
1242 | } | 1242 | } |
1243 | brelse(bh); | 1243 | brelse(bh); |
1244 | return NULL; | 1244 | return NULL; |
1245 | } | 1245 | } |
1246 | 1246 | ||
1247 | /* | 1247 | /* |
1248 | * Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block(). | 1248 | * Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block(). |
1249 | * The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their | 1249 | * The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their |
1250 | * refcount elevated by one when they're in an LRU. A buffer can only appear | 1250 | * refcount elevated by one when they're in an LRU. A buffer can only appear |
1251 | * once in a particular CPU's LRU. A single buffer can be present in multiple | 1251 | * once in a particular CPU's LRU. A single buffer can be present in multiple |
1252 | * CPU's LRUs at the same time. | 1252 | * CPU's LRUs at the same time. |
1253 | * | 1253 | * |
1254 | * This is a transparent caching front-end to sb_bread(), sb_getblk() and | 1254 | * This is a transparent caching front-end to sb_bread(), sb_getblk() and |
1255 | * sb_find_get_block(). | 1255 | * sb_find_get_block(). |
1256 | * | 1256 | * |
1257 | * The LRUs themselves only need locking against invalidate_bh_lrus. We use | 1257 | * The LRUs themselves only need locking against invalidate_bh_lrus. We use |
1258 | * a local interrupt disable for that. | 1258 | * a local interrupt disable for that. |
1259 | */ | 1259 | */ |
1260 | 1260 | ||
1261 | #define BH_LRU_SIZE 8 | 1261 | #define BH_LRU_SIZE 8 |
1262 | 1262 | ||
1263 | struct bh_lru { | 1263 | struct bh_lru { |
1264 | struct buffer_head *bhs[BH_LRU_SIZE]; | 1264 | struct buffer_head *bhs[BH_LRU_SIZE]; |
1265 | }; | 1265 | }; |
1266 | 1266 | ||
1267 | static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }}; | 1267 | static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }}; |
1268 | 1268 | ||
1269 | #ifdef CONFIG_SMP | 1269 | #ifdef CONFIG_SMP |
1270 | #define bh_lru_lock() local_irq_disable() | 1270 | #define bh_lru_lock() local_irq_disable() |
1271 | #define bh_lru_unlock() local_irq_enable() | 1271 | #define bh_lru_unlock() local_irq_enable() |
1272 | #else | 1272 | #else |
1273 | #define bh_lru_lock() preempt_disable() | 1273 | #define bh_lru_lock() preempt_disable() |
1274 | #define bh_lru_unlock() preempt_enable() | 1274 | #define bh_lru_unlock() preempt_enable() |
1275 | #endif | 1275 | #endif |
1276 | 1276 | ||
1277 | static inline void check_irqs_on(void) | 1277 | static inline void check_irqs_on(void) |
1278 | { | 1278 | { |
1279 | #ifdef irqs_disabled | 1279 | #ifdef irqs_disabled |
1280 | BUG_ON(irqs_disabled()); | 1280 | BUG_ON(irqs_disabled()); |
1281 | #endif | 1281 | #endif |
1282 | } | 1282 | } |
1283 | 1283 | ||
1284 | /* | 1284 | /* |
1285 | * The LRU management algorithm is dopey-but-simple. Sorry. | 1285 | * The LRU management algorithm is dopey-but-simple. Sorry. |
1286 | */ | 1286 | */ |
1287 | static void bh_lru_install(struct buffer_head *bh) | 1287 | static void bh_lru_install(struct buffer_head *bh) |
1288 | { | 1288 | { |
1289 | struct buffer_head *evictee = NULL; | 1289 | struct buffer_head *evictee = NULL; |
1290 | 1290 | ||
1291 | check_irqs_on(); | 1291 | check_irqs_on(); |
1292 | bh_lru_lock(); | 1292 | bh_lru_lock(); |
1293 | if (__this_cpu_read(bh_lrus.bhs[0]) != bh) { | 1293 | if (__this_cpu_read(bh_lrus.bhs[0]) != bh) { |
1294 | struct buffer_head *bhs[BH_LRU_SIZE]; | 1294 | struct buffer_head *bhs[BH_LRU_SIZE]; |
1295 | int in; | 1295 | int in; |
1296 | int out = 0; | 1296 | int out = 0; |
1297 | 1297 | ||
1298 | get_bh(bh); | 1298 | get_bh(bh); |
1299 | bhs[out++] = bh; | 1299 | bhs[out++] = bh; |
1300 | for (in = 0; in < BH_LRU_SIZE; in++) { | 1300 | for (in = 0; in < BH_LRU_SIZE; in++) { |
1301 | struct buffer_head *bh2 = | 1301 | struct buffer_head *bh2 = |
1302 | __this_cpu_read(bh_lrus.bhs[in]); | 1302 | __this_cpu_read(bh_lrus.bhs[in]); |
1303 | 1303 | ||
1304 | if (bh2 == bh) { | 1304 | if (bh2 == bh) { |
1305 | __brelse(bh2); | 1305 | __brelse(bh2); |
1306 | } else { | 1306 | } else { |
1307 | if (out >= BH_LRU_SIZE) { | 1307 | if (out >= BH_LRU_SIZE) { |
1308 | BUG_ON(evictee != NULL); | 1308 | BUG_ON(evictee != NULL); |
1309 | evictee = bh2; | 1309 | evictee = bh2; |
1310 | } else { | 1310 | } else { |
1311 | bhs[out++] = bh2; | 1311 | bhs[out++] = bh2; |
1312 | } | 1312 | } |
1313 | } | 1313 | } |
1314 | } | 1314 | } |
1315 | while (out < BH_LRU_SIZE) | 1315 | while (out < BH_LRU_SIZE) |
1316 | bhs[out++] = NULL; | 1316 | bhs[out++] = NULL; |
1317 | memcpy(__this_cpu_ptr(&bh_lrus.bhs), bhs, sizeof(bhs)); | 1317 | memcpy(__this_cpu_ptr(&bh_lrus.bhs), bhs, sizeof(bhs)); |
1318 | } | 1318 | } |
1319 | bh_lru_unlock(); | 1319 | bh_lru_unlock(); |
1320 | 1320 | ||
1321 | if (evictee) | 1321 | if (evictee) |
1322 | __brelse(evictee); | 1322 | __brelse(evictee); |
1323 | } | 1323 | } |
1324 | 1324 | ||
1325 | /* | 1325 | /* |
1326 | * Look up the bh in this cpu's LRU. If it's there, move it to the head. | 1326 | * Look up the bh in this cpu's LRU. If it's there, move it to the head. |
1327 | */ | 1327 | */ |
1328 | static struct buffer_head * | 1328 | static struct buffer_head * |
1329 | lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) | 1329 | lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) |
1330 | { | 1330 | { |
1331 | struct buffer_head *ret = NULL; | 1331 | struct buffer_head *ret = NULL; |
1332 | unsigned int i; | 1332 | unsigned int i; |
1333 | 1333 | ||
1334 | check_irqs_on(); | 1334 | check_irqs_on(); |
1335 | bh_lru_lock(); | 1335 | bh_lru_lock(); |
1336 | for (i = 0; i < BH_LRU_SIZE; i++) { | 1336 | for (i = 0; i < BH_LRU_SIZE; i++) { |
1337 | struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); | 1337 | struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); |
1338 | 1338 | ||
1339 | if (bh && bh->b_bdev == bdev && | 1339 | if (bh && bh->b_bdev == bdev && |
1340 | bh->b_blocknr == block && bh->b_size == size) { | 1340 | bh->b_blocknr == block && bh->b_size == size) { |
1341 | if (i) { | 1341 | if (i) { |
1342 | while (i) { | 1342 | while (i) { |
1343 | __this_cpu_write(bh_lrus.bhs[i], | 1343 | __this_cpu_write(bh_lrus.bhs[i], |
1344 | __this_cpu_read(bh_lrus.bhs[i - 1])); | 1344 | __this_cpu_read(bh_lrus.bhs[i - 1])); |
1345 | i--; | 1345 | i--; |
1346 | } | 1346 | } |
1347 | __this_cpu_write(bh_lrus.bhs[0], bh); | 1347 | __this_cpu_write(bh_lrus.bhs[0], bh); |
1348 | } | 1348 | } |
1349 | get_bh(bh); | 1349 | get_bh(bh); |
1350 | ret = bh; | 1350 | ret = bh; |
1351 | break; | 1351 | break; |
1352 | } | 1352 | } |
1353 | } | 1353 | } |
1354 | bh_lru_unlock(); | 1354 | bh_lru_unlock(); |
1355 | return ret; | 1355 | return ret; |
1356 | } | 1356 | } |
1357 | 1357 | ||
1358 | /* | 1358 | /* |
1359 | * Perform a pagecache lookup for the matching buffer. If it's there, refresh | 1359 | * Perform a pagecache lookup for the matching buffer. If it's there, refresh |
1360 | * it in the LRU and mark it as accessed. If it is not present then return | 1360 | * it in the LRU and mark it as accessed. If it is not present then return |
1361 | * NULL | 1361 | * NULL |
1362 | */ | 1362 | */ |
1363 | struct buffer_head * | 1363 | struct buffer_head * |
1364 | __find_get_block(struct block_device *bdev, sector_t block, unsigned size) | 1364 | __find_get_block(struct block_device *bdev, sector_t block, unsigned size) |
1365 | { | 1365 | { |
1366 | struct buffer_head *bh = lookup_bh_lru(bdev, block, size); | 1366 | struct buffer_head *bh = lookup_bh_lru(bdev, block, size); |
1367 | 1367 | ||
1368 | if (bh == NULL) { | 1368 | if (bh == NULL) { |
1369 | /* __find_get_block_slow will mark the page accessed */ | ||
1369 | bh = __find_get_block_slow(bdev, block); | 1370 | bh = __find_get_block_slow(bdev, block); |
1370 | if (bh) | 1371 | if (bh) |
1371 | bh_lru_install(bh); | 1372 | bh_lru_install(bh); |
1372 | } | 1373 | } else |
1373 | if (bh) | ||
1374 | touch_buffer(bh); | 1374 | touch_buffer(bh); |
1375 | |||
1375 | return bh; | 1376 | return bh; |
1376 | } | 1377 | } |
1377 | EXPORT_SYMBOL(__find_get_block); | 1378 | EXPORT_SYMBOL(__find_get_block); |
1378 | 1379 | ||
1379 | /* | 1380 | /* |
1380 | * __getblk will locate (and, if necessary, create) the buffer_head | 1381 | * __getblk will locate (and, if necessary, create) the buffer_head |
1381 | * which corresponds to the passed block_device, block and size. The | 1382 | * which corresponds to the passed block_device, block and size. The |
1382 | * returned buffer has its reference count incremented. | 1383 | * returned buffer has its reference count incremented. |
1383 | * | 1384 | * |
1384 | * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers() | 1385 | * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers() |
1385 | * attempt is failing. FIXME, perhaps? | 1386 | * attempt is failing. FIXME, perhaps? |
1386 | */ | 1387 | */ |
1387 | struct buffer_head * | 1388 | struct buffer_head * |
1388 | __getblk(struct block_device *bdev, sector_t block, unsigned size) | 1389 | __getblk(struct block_device *bdev, sector_t block, unsigned size) |
1389 | { | 1390 | { |
1390 | struct buffer_head *bh = __find_get_block(bdev, block, size); | 1391 | struct buffer_head *bh = __find_get_block(bdev, block, size); |
1391 | 1392 | ||
1392 | might_sleep(); | 1393 | might_sleep(); |
1393 | if (bh == NULL) | 1394 | if (bh == NULL) |
1394 | bh = __getblk_slow(bdev, block, size); | 1395 | bh = __getblk_slow(bdev, block, size); |
1395 | return bh; | 1396 | return bh; |
1396 | } | 1397 | } |
1397 | EXPORT_SYMBOL(__getblk); | 1398 | EXPORT_SYMBOL(__getblk); |
1398 | 1399 | ||
1399 | /* | 1400 | /* |
1400 | * Do async read-ahead on a buffer.. | 1401 | * Do async read-ahead on a buffer.. |
1401 | */ | 1402 | */ |
1402 | void __breadahead(struct block_device *bdev, sector_t block, unsigned size) | 1403 | void __breadahead(struct block_device *bdev, sector_t block, unsigned size) |
1403 | { | 1404 | { |
1404 | struct buffer_head *bh = __getblk(bdev, block, size); | 1405 | struct buffer_head *bh = __getblk(bdev, block, size); |
1405 | if (likely(bh)) { | 1406 | if (likely(bh)) { |
1406 | ll_rw_block(READA, 1, &bh); | 1407 | ll_rw_block(READA, 1, &bh); |
1407 | brelse(bh); | 1408 | brelse(bh); |
1408 | } | 1409 | } |
1409 | } | 1410 | } |
1410 | EXPORT_SYMBOL(__breadahead); | 1411 | EXPORT_SYMBOL(__breadahead); |
1411 | 1412 | ||
1412 | /** | 1413 | /** |
1413 | * __bread() - reads a specified block and returns the bh | 1414 | * __bread() - reads a specified block and returns the bh |
1414 | * @bdev: the block_device to read from | 1415 | * @bdev: the block_device to read from |
1415 | * @block: number of block | 1416 | * @block: number of block |
1416 | * @size: size (in bytes) to read | 1417 | * @size: size (in bytes) to read |
1417 | * | 1418 | * |
1418 | * Reads a specified block, and returns buffer head that contains it. | 1419 | * Reads a specified block, and returns buffer head that contains it. |
1419 | * It returns NULL if the block was unreadable. | 1420 | * It returns NULL if the block was unreadable. |
1420 | */ | 1421 | */ |
1421 | struct buffer_head * | 1422 | struct buffer_head * |
1422 | __bread(struct block_device *bdev, sector_t block, unsigned size) | 1423 | __bread(struct block_device *bdev, sector_t block, unsigned size) |
1423 | { | 1424 | { |
1424 | struct buffer_head *bh = __getblk(bdev, block, size); | 1425 | struct buffer_head *bh = __getblk(bdev, block, size); |
1425 | 1426 | ||
1426 | if (likely(bh) && !buffer_uptodate(bh)) | 1427 | if (likely(bh) && !buffer_uptodate(bh)) |
1427 | bh = __bread_slow(bh); | 1428 | bh = __bread_slow(bh); |
1428 | return bh; | 1429 | return bh; |
1429 | } | 1430 | } |
1430 | EXPORT_SYMBOL(__bread); | 1431 | EXPORT_SYMBOL(__bread); |
1431 | 1432 | ||
1432 | /* | 1433 | /* |
1433 | * invalidate_bh_lrus() is called rarely - but not only at unmount. | 1434 | * invalidate_bh_lrus() is called rarely - but not only at unmount. |
1434 | * This doesn't race because it runs in each cpu either in irq | 1435 | * This doesn't race because it runs in each cpu either in irq |
1435 | * or with preempt disabled. | 1436 | * or with preempt disabled. |
1436 | */ | 1437 | */ |
1437 | static void invalidate_bh_lru(void *arg) | 1438 | static void invalidate_bh_lru(void *arg) |
1438 | { | 1439 | { |
1439 | struct bh_lru *b = &get_cpu_var(bh_lrus); | 1440 | struct bh_lru *b = &get_cpu_var(bh_lrus); |
1440 | int i; | 1441 | int i; |
1441 | 1442 | ||
1442 | for (i = 0; i < BH_LRU_SIZE; i++) { | 1443 | for (i = 0; i < BH_LRU_SIZE; i++) { |
1443 | brelse(b->bhs[i]); | 1444 | brelse(b->bhs[i]); |
1444 | b->bhs[i] = NULL; | 1445 | b->bhs[i] = NULL; |
1445 | } | 1446 | } |
1446 | put_cpu_var(bh_lrus); | 1447 | put_cpu_var(bh_lrus); |
1447 | } | 1448 | } |
1448 | 1449 | ||
1449 | static bool has_bh_in_lru(int cpu, void *dummy) | 1450 | static bool has_bh_in_lru(int cpu, void *dummy) |
1450 | { | 1451 | { |
1451 | struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu); | 1452 | struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu); |
1452 | int i; | 1453 | int i; |
1453 | 1454 | ||
1454 | for (i = 0; i < BH_LRU_SIZE; i++) { | 1455 | for (i = 0; i < BH_LRU_SIZE; i++) { |
1455 | if (b->bhs[i]) | 1456 | if (b->bhs[i]) |
1456 | return 1; | 1457 | return 1; |
1457 | } | 1458 | } |
1458 | 1459 | ||
1459 | return 0; | 1460 | return 0; |
1460 | } | 1461 | } |
1461 | 1462 | ||
1462 | void invalidate_bh_lrus(void) | 1463 | void invalidate_bh_lrus(void) |
1463 | { | 1464 | { |
1464 | on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL); | 1465 | on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL); |
1465 | } | 1466 | } |
1466 | EXPORT_SYMBOL_GPL(invalidate_bh_lrus); | 1467 | EXPORT_SYMBOL_GPL(invalidate_bh_lrus); |
1467 | 1468 | ||
1468 | void set_bh_page(struct buffer_head *bh, | 1469 | void set_bh_page(struct buffer_head *bh, |
1469 | struct page *page, unsigned long offset) | 1470 | struct page *page, unsigned long offset) |
1470 | { | 1471 | { |
1471 | bh->b_page = page; | 1472 | bh->b_page = page; |
1472 | BUG_ON(offset >= PAGE_SIZE); | 1473 | BUG_ON(offset >= PAGE_SIZE); |
1473 | if (PageHighMem(page)) | 1474 | if (PageHighMem(page)) |
1474 | /* | 1475 | /* |
1475 | * This catches illegal uses and preserves the offset: | 1476 | * This catches illegal uses and preserves the offset: |
1476 | */ | 1477 | */ |
1477 | bh->b_data = (char *)(0 + offset); | 1478 | bh->b_data = (char *)(0 + offset); |
1478 | else | 1479 | else |
1479 | bh->b_data = page_address(page) + offset; | 1480 | bh->b_data = page_address(page) + offset; |
1480 | } | 1481 | } |
1481 | EXPORT_SYMBOL(set_bh_page); | 1482 | EXPORT_SYMBOL(set_bh_page); |
1482 | 1483 | ||
1483 | /* | 1484 | /* |
1484 | * Called when truncating a buffer on a page completely. | 1485 | * Called when truncating a buffer on a page completely. |
1485 | */ | 1486 | */ |
1486 | 1487 | ||
1487 | /* Bits that are cleared during an invalidate */ | 1488 | /* Bits that are cleared during an invalidate */ |
1488 | #define BUFFER_FLAGS_DISCARD \ | 1489 | #define BUFFER_FLAGS_DISCARD \ |
1489 | (1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \ | 1490 | (1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \ |
1490 | 1 << BH_Delay | 1 << BH_Unwritten) | 1491 | 1 << BH_Delay | 1 << BH_Unwritten) |
1491 | 1492 | ||
1492 | static void discard_buffer(struct buffer_head * bh) | 1493 | static void discard_buffer(struct buffer_head * bh) |
1493 | { | 1494 | { |
1494 | unsigned long b_state, b_state_old; | 1495 | unsigned long b_state, b_state_old; |
1495 | 1496 | ||
1496 | lock_buffer(bh); | 1497 | lock_buffer(bh); |
1497 | clear_buffer_dirty(bh); | 1498 | clear_buffer_dirty(bh); |
1498 | bh->b_bdev = NULL; | 1499 | bh->b_bdev = NULL; |
1499 | b_state = bh->b_state; | 1500 | b_state = bh->b_state; |
1500 | for (;;) { | 1501 | for (;;) { |
1501 | b_state_old = cmpxchg(&bh->b_state, b_state, | 1502 | b_state_old = cmpxchg(&bh->b_state, b_state, |
1502 | (b_state & ~BUFFER_FLAGS_DISCARD)); | 1503 | (b_state & ~BUFFER_FLAGS_DISCARD)); |
1503 | if (b_state_old == b_state) | 1504 | if (b_state_old == b_state) |
1504 | break; | 1505 | break; |
1505 | b_state = b_state_old; | 1506 | b_state = b_state_old; |
1506 | } | 1507 | } |
1507 | unlock_buffer(bh); | 1508 | unlock_buffer(bh); |
1508 | } | 1509 | } |
1509 | 1510 | ||
1510 | /** | 1511 | /** |
1511 | * block_invalidatepage - invalidate part or all of a buffer-backed page | 1512 | * block_invalidatepage - invalidate part or all of a buffer-backed page |
1512 | * | 1513 | * |
1513 | * @page: the page which is affected | 1514 | * @page: the page which is affected |
1514 | * @offset: start of the range to invalidate | 1515 | * @offset: start of the range to invalidate |
1515 | * @length: length of the range to invalidate | 1516 | * @length: length of the range to invalidate |
1516 | * | 1517 | * |
1517 | * block_invalidatepage() is called when all or part of the page has become | 1518 | * block_invalidatepage() is called when all or part of the page has become |
1518 | * invalidated by a truncate operation. | 1519 | * invalidated by a truncate operation. |
1519 | * | 1520 | * |
1520 | * block_invalidatepage() does not have to release all buffers, but it must | 1521 | * block_invalidatepage() does not have to release all buffers, but it must |
1521 | * ensure that no dirty buffer is left outside @offset and that no I/O | 1522 | * ensure that no dirty buffer is left outside @offset and that no I/O |
1522 | * is underway against any of the blocks which are outside the truncation | 1523 | * is underway against any of the blocks which are outside the truncation |
1523 | * point. Because the caller is about to free (and possibly reuse) those | 1524 | * point. Because the caller is about to free (and possibly reuse) those |
1524 | * blocks on-disk. | 1525 | * blocks on-disk. |
1525 | */ | 1526 | */ |
1526 | void block_invalidatepage(struct page *page, unsigned int offset, | 1527 | void block_invalidatepage(struct page *page, unsigned int offset, |
1527 | unsigned int length) | 1528 | unsigned int length) |
1528 | { | 1529 | { |
1529 | struct buffer_head *head, *bh, *next; | 1530 | struct buffer_head *head, *bh, *next; |
1530 | unsigned int curr_off = 0; | 1531 | unsigned int curr_off = 0; |
1531 | unsigned int stop = length + offset; | 1532 | unsigned int stop = length + offset; |
1532 | 1533 | ||
1533 | BUG_ON(!PageLocked(page)); | 1534 | BUG_ON(!PageLocked(page)); |
1534 | if (!page_has_buffers(page)) | 1535 | if (!page_has_buffers(page)) |
1535 | goto out; | 1536 | goto out; |
1536 | 1537 | ||
1537 | /* | 1538 | /* |
1538 | * Check for overflow | 1539 | * Check for overflow |
1539 | */ | 1540 | */ |
1540 | BUG_ON(stop > PAGE_CACHE_SIZE || stop < length); | 1541 | BUG_ON(stop > PAGE_CACHE_SIZE || stop < length); |
1541 | 1542 | ||
1542 | head = page_buffers(page); | 1543 | head = page_buffers(page); |
1543 | bh = head; | 1544 | bh = head; |
1544 | do { | 1545 | do { |
1545 | unsigned int next_off = curr_off + bh->b_size; | 1546 | unsigned int next_off = curr_off + bh->b_size; |
1546 | next = bh->b_this_page; | 1547 | next = bh->b_this_page; |
1547 | 1548 | ||
1548 | /* | 1549 | /* |
1549 | * Are we still fully in range ? | 1550 | * Are we still fully in range ? |
1550 | */ | 1551 | */ |
1551 | if (next_off > stop) | 1552 | if (next_off > stop) |
1552 | goto out; | 1553 | goto out; |
1553 | 1554 | ||
1554 | /* | 1555 | /* |
1555 | * is this block fully invalidated? | 1556 | * is this block fully invalidated? |
1556 | */ | 1557 | */ |
1557 | if (offset <= curr_off) | 1558 | if (offset <= curr_off) |
1558 | discard_buffer(bh); | 1559 | discard_buffer(bh); |
1559 | curr_off = next_off; | 1560 | curr_off = next_off; |
1560 | bh = next; | 1561 | bh = next; |
1561 | } while (bh != head); | 1562 | } while (bh != head); |
1562 | 1563 | ||
1563 | /* | 1564 | /* |
1564 | * We release buffers only if the entire page is being invalidated. | 1565 | * We release buffers only if the entire page is being invalidated. |
1565 | * The get_block cached value has been unconditionally invalidated, | 1566 | * The get_block cached value has been unconditionally invalidated, |
1566 | * so real IO is not possible anymore. | 1567 | * so real IO is not possible anymore. |
1567 | */ | 1568 | */ |
1568 | if (offset == 0) | 1569 | if (offset == 0) |
1569 | try_to_release_page(page, 0); | 1570 | try_to_release_page(page, 0); |
1570 | out: | 1571 | out: |
1571 | return; | 1572 | return; |
1572 | } | 1573 | } |
1573 | EXPORT_SYMBOL(block_invalidatepage); | 1574 | EXPORT_SYMBOL(block_invalidatepage); |
1574 | 1575 | ||
1575 | 1576 | ||
1576 | /* | 1577 | /* |
1577 | * We attach and possibly dirty the buffers atomically wrt | 1578 | * We attach and possibly dirty the buffers atomically wrt |
1578 | * __set_page_dirty_buffers() via private_lock. try_to_free_buffers | 1579 | * __set_page_dirty_buffers() via private_lock. try_to_free_buffers |
1579 | * is already excluded via the page lock. | 1580 | * is already excluded via the page lock. |
1580 | */ | 1581 | */ |
1581 | void create_empty_buffers(struct page *page, | 1582 | void create_empty_buffers(struct page *page, |
1582 | unsigned long blocksize, unsigned long b_state) | 1583 | unsigned long blocksize, unsigned long b_state) |
1583 | { | 1584 | { |
1584 | struct buffer_head *bh, *head, *tail; | 1585 | struct buffer_head *bh, *head, *tail; |
1585 | 1586 | ||
1586 | head = alloc_page_buffers(page, blocksize, 1); | 1587 | head = alloc_page_buffers(page, blocksize, 1); |
1587 | bh = head; | 1588 | bh = head; |
1588 | do { | 1589 | do { |
1589 | bh->b_state |= b_state; | 1590 | bh->b_state |= b_state; |
1590 | tail = bh; | 1591 | tail = bh; |
1591 | bh = bh->b_this_page; | 1592 | bh = bh->b_this_page; |
1592 | } while (bh); | 1593 | } while (bh); |
1593 | tail->b_this_page = head; | 1594 | tail->b_this_page = head; |
1594 | 1595 | ||
1595 | spin_lock(&page->mapping->private_lock); | 1596 | spin_lock(&page->mapping->private_lock); |
1596 | if (PageUptodate(page) || PageDirty(page)) { | 1597 | if (PageUptodate(page) || PageDirty(page)) { |
1597 | bh = head; | 1598 | bh = head; |
1598 | do { | 1599 | do { |
1599 | if (PageDirty(page)) | 1600 | if (PageDirty(page)) |
1600 | set_buffer_dirty(bh); | 1601 | set_buffer_dirty(bh); |
1601 | if (PageUptodate(page)) | 1602 | if (PageUptodate(page)) |
1602 | set_buffer_uptodate(bh); | 1603 | set_buffer_uptodate(bh); |
1603 | bh = bh->b_this_page; | 1604 | bh = bh->b_this_page; |
1604 | } while (bh != head); | 1605 | } while (bh != head); |
1605 | } | 1606 | } |
1606 | attach_page_buffers(page, head); | 1607 | attach_page_buffers(page, head); |
1607 | spin_unlock(&page->mapping->private_lock); | 1608 | spin_unlock(&page->mapping->private_lock); |
1608 | } | 1609 | } |
1609 | EXPORT_SYMBOL(create_empty_buffers); | 1610 | EXPORT_SYMBOL(create_empty_buffers); |
1610 | 1611 | ||
1611 | /* | 1612 | /* |
1612 | * We are taking a block for data and we don't want any output from any | 1613 | * We are taking a block for data and we don't want any output from any |
1613 | * buffer-cache aliases starting from return from that function and | 1614 | * buffer-cache aliases starting from return from that function and |
1614 | * until the moment when something will explicitly mark the buffer | 1615 | * until the moment when something will explicitly mark the buffer |
1615 | * dirty (hopefully that will not happen until we will free that block ;-) | 1616 | * dirty (hopefully that will not happen until we will free that block ;-) |
1616 | * We don't even need to mark it not-uptodate - nobody can expect | 1617 | * We don't even need to mark it not-uptodate - nobody can expect |
1617 | * anything from a newly allocated buffer anyway. We used to used | 1618 | * anything from a newly allocated buffer anyway. We used to used |
1618 | * unmap_buffer() for such invalidation, but that was wrong. We definitely | 1619 | * unmap_buffer() for such invalidation, but that was wrong. We definitely |
1619 | * don't want to mark the alias unmapped, for example - it would confuse | 1620 | * don't want to mark the alias unmapped, for example - it would confuse |
1620 | * anyone who might pick it with bread() afterwards... | 1621 | * anyone who might pick it with bread() afterwards... |
1621 | * | 1622 | * |
1622 | * Also.. Note that bforget() doesn't lock the buffer. So there can | 1623 | * Also.. Note that bforget() doesn't lock the buffer. So there can |
1623 | * be writeout I/O going on against recently-freed buffers. We don't | 1624 | * be writeout I/O going on against recently-freed buffers. We don't |
1624 | * wait on that I/O in bforget() - it's more efficient to wait on the I/O | 1625 | * wait on that I/O in bforget() - it's more efficient to wait on the I/O |
1625 | * only if we really need to. That happens here. | 1626 | * only if we really need to. That happens here. |
1626 | */ | 1627 | */ |
1627 | void unmap_underlying_metadata(struct block_device *bdev, sector_t block) | 1628 | void unmap_underlying_metadata(struct block_device *bdev, sector_t block) |
1628 | { | 1629 | { |
1629 | struct buffer_head *old_bh; | 1630 | struct buffer_head *old_bh; |
1630 | 1631 | ||
1631 | might_sleep(); | 1632 | might_sleep(); |
1632 | 1633 | ||
1633 | old_bh = __find_get_block_slow(bdev, block); | 1634 | old_bh = __find_get_block_slow(bdev, block); |
1634 | if (old_bh) { | 1635 | if (old_bh) { |
1635 | clear_buffer_dirty(old_bh); | 1636 | clear_buffer_dirty(old_bh); |
1636 | wait_on_buffer(old_bh); | 1637 | wait_on_buffer(old_bh); |
1637 | clear_buffer_req(old_bh); | 1638 | clear_buffer_req(old_bh); |
1638 | __brelse(old_bh); | 1639 | __brelse(old_bh); |
1639 | } | 1640 | } |
1640 | } | 1641 | } |
1641 | EXPORT_SYMBOL(unmap_underlying_metadata); | 1642 | EXPORT_SYMBOL(unmap_underlying_metadata); |
1642 | 1643 | ||
1643 | /* | 1644 | /* |
1644 | * Size is a power-of-two in the range 512..PAGE_SIZE, | 1645 | * Size is a power-of-two in the range 512..PAGE_SIZE, |
1645 | * and the case we care about most is PAGE_SIZE. | 1646 | * and the case we care about most is PAGE_SIZE. |
1646 | * | 1647 | * |
1647 | * So this *could* possibly be written with those | 1648 | * So this *could* possibly be written with those |
1648 | * constraints in mind (relevant mostly if some | 1649 | * constraints in mind (relevant mostly if some |
1649 | * architecture has a slow bit-scan instruction) | 1650 | * architecture has a slow bit-scan instruction) |
1650 | */ | 1651 | */ |
1651 | static inline int block_size_bits(unsigned int blocksize) | 1652 | static inline int block_size_bits(unsigned int blocksize) |
1652 | { | 1653 | { |
1653 | return ilog2(blocksize); | 1654 | return ilog2(blocksize); |
1654 | } | 1655 | } |
1655 | 1656 | ||
1656 | static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) | 1657 | static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state) |
1657 | { | 1658 | { |
1658 | BUG_ON(!PageLocked(page)); | 1659 | BUG_ON(!PageLocked(page)); |
1659 | 1660 | ||
1660 | if (!page_has_buffers(page)) | 1661 | if (!page_has_buffers(page)) |
1661 | create_empty_buffers(page, 1 << ACCESS_ONCE(inode->i_blkbits), b_state); | 1662 | create_empty_buffers(page, 1 << ACCESS_ONCE(inode->i_blkbits), b_state); |
1662 | return page_buffers(page); | 1663 | return page_buffers(page); |
1663 | } | 1664 | } |
1664 | 1665 | ||
1665 | /* | 1666 | /* |
1666 | * NOTE! All mapped/uptodate combinations are valid: | 1667 | * NOTE! All mapped/uptodate combinations are valid: |
1667 | * | 1668 | * |
1668 | * Mapped Uptodate Meaning | 1669 | * Mapped Uptodate Meaning |
1669 | * | 1670 | * |
1670 | * No No "unknown" - must do get_block() | 1671 | * No No "unknown" - must do get_block() |
1671 | * No Yes "hole" - zero-filled | 1672 | * No Yes "hole" - zero-filled |
1672 | * Yes No "allocated" - allocated on disk, not read in | 1673 | * Yes No "allocated" - allocated on disk, not read in |
1673 | * Yes Yes "valid" - allocated and up-to-date in memory. | 1674 | * Yes Yes "valid" - allocated and up-to-date in memory. |
1674 | * | 1675 | * |
1675 | * "Dirty" is valid only with the last case (mapped+uptodate). | 1676 | * "Dirty" is valid only with the last case (mapped+uptodate). |
1676 | */ | 1677 | */ |
1677 | 1678 | ||
1678 | /* | 1679 | /* |
1679 | * While block_write_full_page is writing back the dirty buffers under | 1680 | * While block_write_full_page is writing back the dirty buffers under |
1680 | * the page lock, whoever dirtied the buffers may decide to clean them | 1681 | * the page lock, whoever dirtied the buffers may decide to clean them |
1681 | * again at any time. We handle that by only looking at the buffer | 1682 | * again at any time. We handle that by only looking at the buffer |
1682 | * state inside lock_buffer(). | 1683 | * state inside lock_buffer(). |
1683 | * | 1684 | * |
1684 | * If block_write_full_page() is called for regular writeback | 1685 | * If block_write_full_page() is called for regular writeback |
1685 | * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a | 1686 | * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a |
1686 | * locked buffer. This only can happen if someone has written the buffer | 1687 | * locked buffer. This only can happen if someone has written the buffer |
1687 | * directly, with submit_bh(). At the address_space level PageWriteback | 1688 | * directly, with submit_bh(). At the address_space level PageWriteback |
1688 | * prevents this contention from occurring. | 1689 | * prevents this contention from occurring. |
1689 | * | 1690 | * |
1690 | * If block_write_full_page() is called with wbc->sync_mode == | 1691 | * If block_write_full_page() is called with wbc->sync_mode == |
1691 | * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this | 1692 | * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this |
1692 | * causes the writes to be flagged as synchronous writes. | 1693 | * causes the writes to be flagged as synchronous writes. |
1693 | */ | 1694 | */ |
1694 | static int __block_write_full_page(struct inode *inode, struct page *page, | 1695 | static int __block_write_full_page(struct inode *inode, struct page *page, |
1695 | get_block_t *get_block, struct writeback_control *wbc, | 1696 | get_block_t *get_block, struct writeback_control *wbc, |
1696 | bh_end_io_t *handler) | 1697 | bh_end_io_t *handler) |
1697 | { | 1698 | { |
1698 | int err; | 1699 | int err; |
1699 | sector_t block; | 1700 | sector_t block; |
1700 | sector_t last_block; | 1701 | sector_t last_block; |
1701 | struct buffer_head *bh, *head; | 1702 | struct buffer_head *bh, *head; |
1702 | unsigned int blocksize, bbits; | 1703 | unsigned int blocksize, bbits; |
1703 | int nr_underway = 0; | 1704 | int nr_underway = 0; |
1704 | int write_op = (wbc->sync_mode == WB_SYNC_ALL ? | 1705 | int write_op = (wbc->sync_mode == WB_SYNC_ALL ? |
1705 | WRITE_SYNC : WRITE); | 1706 | WRITE_SYNC : WRITE); |
1706 | 1707 | ||
1707 | head = create_page_buffers(page, inode, | 1708 | head = create_page_buffers(page, inode, |
1708 | (1 << BH_Dirty)|(1 << BH_Uptodate)); | 1709 | (1 << BH_Dirty)|(1 << BH_Uptodate)); |
1709 | 1710 | ||
1710 | /* | 1711 | /* |
1711 | * Be very careful. We have no exclusion from __set_page_dirty_buffers | 1712 | * Be very careful. We have no exclusion from __set_page_dirty_buffers |
1712 | * here, and the (potentially unmapped) buffers may become dirty at | 1713 | * here, and the (potentially unmapped) buffers may become dirty at |
1713 | * any time. If a buffer becomes dirty here after we've inspected it | 1714 | * any time. If a buffer becomes dirty here after we've inspected it |
1714 | * then we just miss that fact, and the page stays dirty. | 1715 | * then we just miss that fact, and the page stays dirty. |
1715 | * | 1716 | * |
1716 | * Buffers outside i_size may be dirtied by __set_page_dirty_buffers; | 1717 | * Buffers outside i_size may be dirtied by __set_page_dirty_buffers; |
1717 | * handle that here by just cleaning them. | 1718 | * handle that here by just cleaning them. |
1718 | */ | 1719 | */ |
1719 | 1720 | ||
1720 | bh = head; | 1721 | bh = head; |
1721 | blocksize = bh->b_size; | 1722 | blocksize = bh->b_size; |
1722 | bbits = block_size_bits(blocksize); | 1723 | bbits = block_size_bits(blocksize); |
1723 | 1724 | ||
1724 | block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); | 1725 | block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); |
1725 | last_block = (i_size_read(inode) - 1) >> bbits; | 1726 | last_block = (i_size_read(inode) - 1) >> bbits; |
1726 | 1727 | ||
1727 | /* | 1728 | /* |
1728 | * Get all the dirty buffers mapped to disk addresses and | 1729 | * Get all the dirty buffers mapped to disk addresses and |
1729 | * handle any aliases from the underlying blockdev's mapping. | 1730 | * handle any aliases from the underlying blockdev's mapping. |
1730 | */ | 1731 | */ |
1731 | do { | 1732 | do { |
1732 | if (block > last_block) { | 1733 | if (block > last_block) { |
1733 | /* | 1734 | /* |
1734 | * mapped buffers outside i_size will occur, because | 1735 | * mapped buffers outside i_size will occur, because |
1735 | * this page can be outside i_size when there is a | 1736 | * this page can be outside i_size when there is a |
1736 | * truncate in progress. | 1737 | * truncate in progress. |
1737 | */ | 1738 | */ |
1738 | /* | 1739 | /* |
1739 | * The buffer was zeroed by block_write_full_page() | 1740 | * The buffer was zeroed by block_write_full_page() |
1740 | */ | 1741 | */ |
1741 | clear_buffer_dirty(bh); | 1742 | clear_buffer_dirty(bh); |
1742 | set_buffer_uptodate(bh); | 1743 | set_buffer_uptodate(bh); |
1743 | } else if ((!buffer_mapped(bh) || buffer_delay(bh)) && | 1744 | } else if ((!buffer_mapped(bh) || buffer_delay(bh)) && |
1744 | buffer_dirty(bh)) { | 1745 | buffer_dirty(bh)) { |
1745 | WARN_ON(bh->b_size != blocksize); | 1746 | WARN_ON(bh->b_size != blocksize); |
1746 | err = get_block(inode, block, bh, 1); | 1747 | err = get_block(inode, block, bh, 1); |
1747 | if (err) | 1748 | if (err) |
1748 | goto recover; | 1749 | goto recover; |
1749 | clear_buffer_delay(bh); | 1750 | clear_buffer_delay(bh); |
1750 | if (buffer_new(bh)) { | 1751 | if (buffer_new(bh)) { |
1751 | /* blockdev mappings never come here */ | 1752 | /* blockdev mappings never come here */ |
1752 | clear_buffer_new(bh); | 1753 | clear_buffer_new(bh); |
1753 | unmap_underlying_metadata(bh->b_bdev, | 1754 | unmap_underlying_metadata(bh->b_bdev, |
1754 | bh->b_blocknr); | 1755 | bh->b_blocknr); |
1755 | } | 1756 | } |
1756 | } | 1757 | } |
1757 | bh = bh->b_this_page; | 1758 | bh = bh->b_this_page; |
1758 | block++; | 1759 | block++; |
1759 | } while (bh != head); | 1760 | } while (bh != head); |
1760 | 1761 | ||
1761 | do { | 1762 | do { |
1762 | if (!buffer_mapped(bh)) | 1763 | if (!buffer_mapped(bh)) |
1763 | continue; | 1764 | continue; |
1764 | /* | 1765 | /* |
1765 | * If it's a fully non-blocking write attempt and we cannot | 1766 | * If it's a fully non-blocking write attempt and we cannot |
1766 | * lock the buffer then redirty the page. Note that this can | 1767 | * lock the buffer then redirty the page. Note that this can |
1767 | * potentially cause a busy-wait loop from writeback threads | 1768 | * potentially cause a busy-wait loop from writeback threads |
1768 | * and kswapd activity, but those code paths have their own | 1769 | * and kswapd activity, but those code paths have their own |
1769 | * higher-level throttling. | 1770 | * higher-level throttling. |
1770 | */ | 1771 | */ |
1771 | if (wbc->sync_mode != WB_SYNC_NONE) { | 1772 | if (wbc->sync_mode != WB_SYNC_NONE) { |
1772 | lock_buffer(bh); | 1773 | lock_buffer(bh); |
1773 | } else if (!trylock_buffer(bh)) { | 1774 | } else if (!trylock_buffer(bh)) { |
1774 | redirty_page_for_writepage(wbc, page); | 1775 | redirty_page_for_writepage(wbc, page); |
1775 | continue; | 1776 | continue; |
1776 | } | 1777 | } |
1777 | if (test_clear_buffer_dirty(bh)) { | 1778 | if (test_clear_buffer_dirty(bh)) { |
1778 | mark_buffer_async_write_endio(bh, handler); | 1779 | mark_buffer_async_write_endio(bh, handler); |
1779 | } else { | 1780 | } else { |
1780 | unlock_buffer(bh); | 1781 | unlock_buffer(bh); |
1781 | } | 1782 | } |
1782 | } while ((bh = bh->b_this_page) != head); | 1783 | } while ((bh = bh->b_this_page) != head); |
1783 | 1784 | ||
1784 | /* | 1785 | /* |
1785 | * The page and its buffers are protected by PageWriteback(), so we can | 1786 | * The page and its buffers are protected by PageWriteback(), so we can |
1786 | * drop the bh refcounts early. | 1787 | * drop the bh refcounts early. |
1787 | */ | 1788 | */ |
1788 | BUG_ON(PageWriteback(page)); | 1789 | BUG_ON(PageWriteback(page)); |
1789 | set_page_writeback(page); | 1790 | set_page_writeback(page); |
1790 | 1791 | ||
1791 | do { | 1792 | do { |
1792 | struct buffer_head *next = bh->b_this_page; | 1793 | struct buffer_head *next = bh->b_this_page; |
1793 | if (buffer_async_write(bh)) { | 1794 | if (buffer_async_write(bh)) { |
1794 | submit_bh(write_op, bh); | 1795 | submit_bh(write_op, bh); |
1795 | nr_underway++; | 1796 | nr_underway++; |
1796 | } | 1797 | } |
1797 | bh = next; | 1798 | bh = next; |
1798 | } while (bh != head); | 1799 | } while (bh != head); |
1799 | unlock_page(page); | 1800 | unlock_page(page); |
1800 | 1801 | ||
1801 | err = 0; | 1802 | err = 0; |
1802 | done: | 1803 | done: |
1803 | if (nr_underway == 0) { | 1804 | if (nr_underway == 0) { |
1804 | /* | 1805 | /* |
1805 | * The page was marked dirty, but the buffers were | 1806 | * The page was marked dirty, but the buffers were |
1806 | * clean. Someone wrote them back by hand with | 1807 | * clean. Someone wrote them back by hand with |
1807 | * ll_rw_block/submit_bh. A rare case. | 1808 | * ll_rw_block/submit_bh. A rare case. |
1808 | */ | 1809 | */ |
1809 | end_page_writeback(page); | 1810 | end_page_writeback(page); |
1810 | 1811 | ||
1811 | /* | 1812 | /* |
1812 | * The page and buffer_heads can be released at any time from | 1813 | * The page and buffer_heads can be released at any time from |
1813 | * here on. | 1814 | * here on. |
1814 | */ | 1815 | */ |
1815 | } | 1816 | } |
1816 | return err; | 1817 | return err; |
1817 | 1818 | ||
1818 | recover: | 1819 | recover: |
1819 | /* | 1820 | /* |
1820 | * ENOSPC, or some other error. We may already have added some | 1821 | * ENOSPC, or some other error. We may already have added some |
1821 | * blocks to the file, so we need to write these out to avoid | 1822 | * blocks to the file, so we need to write these out to avoid |
1822 | * exposing stale data. | 1823 | * exposing stale data. |
1823 | * The page is currently locked and not marked for writeback | 1824 | * The page is currently locked and not marked for writeback |
1824 | */ | 1825 | */ |
1825 | bh = head; | 1826 | bh = head; |
1826 | /* Recovery: lock and submit the mapped buffers */ | 1827 | /* Recovery: lock and submit the mapped buffers */ |
1827 | do { | 1828 | do { |
1828 | if (buffer_mapped(bh) && buffer_dirty(bh) && | 1829 | if (buffer_mapped(bh) && buffer_dirty(bh) && |
1829 | !buffer_delay(bh)) { | 1830 | !buffer_delay(bh)) { |
1830 | lock_buffer(bh); | 1831 | lock_buffer(bh); |
1831 | mark_buffer_async_write_endio(bh, handler); | 1832 | mark_buffer_async_write_endio(bh, handler); |
1832 | } else { | 1833 | } else { |
1833 | /* | 1834 | /* |
1834 | * The buffer may have been set dirty during | 1835 | * The buffer may have been set dirty during |
1835 | * attachment to a dirty page. | 1836 | * attachment to a dirty page. |
1836 | */ | 1837 | */ |
1837 | clear_buffer_dirty(bh); | 1838 | clear_buffer_dirty(bh); |
1838 | } | 1839 | } |
1839 | } while ((bh = bh->b_this_page) != head); | 1840 | } while ((bh = bh->b_this_page) != head); |
1840 | SetPageError(page); | 1841 | SetPageError(page); |
1841 | BUG_ON(PageWriteback(page)); | 1842 | BUG_ON(PageWriteback(page)); |
1842 | mapping_set_error(page->mapping, err); | 1843 | mapping_set_error(page->mapping, err); |
1843 | set_page_writeback(page); | 1844 | set_page_writeback(page); |
1844 | do { | 1845 | do { |
1845 | struct buffer_head *next = bh->b_this_page; | 1846 | struct buffer_head *next = bh->b_this_page; |
1846 | if (buffer_async_write(bh)) { | 1847 | if (buffer_async_write(bh)) { |
1847 | clear_buffer_dirty(bh); | 1848 | clear_buffer_dirty(bh); |
1848 | submit_bh(write_op, bh); | 1849 | submit_bh(write_op, bh); |
1849 | nr_underway++; | 1850 | nr_underway++; |
1850 | } | 1851 | } |
1851 | bh = next; | 1852 | bh = next; |
1852 | } while (bh != head); | 1853 | } while (bh != head); |
1853 | unlock_page(page); | 1854 | unlock_page(page); |
1854 | goto done; | 1855 | goto done; |
1855 | } | 1856 | } |
1856 | 1857 | ||
1857 | /* | 1858 | /* |
1858 | * If a page has any new buffers, zero them out here, and mark them uptodate | 1859 | * If a page has any new buffers, zero them out here, and mark them uptodate |
1859 | * and dirty so they'll be written out (in order to prevent uninitialised | 1860 | * and dirty so they'll be written out (in order to prevent uninitialised |
1860 | * block data from leaking). And clear the new bit. | 1861 | * block data from leaking). And clear the new bit. |
1861 | */ | 1862 | */ |
1862 | void page_zero_new_buffers(struct page *page, unsigned from, unsigned to) | 1863 | void page_zero_new_buffers(struct page *page, unsigned from, unsigned to) |
1863 | { | 1864 | { |
1864 | unsigned int block_start, block_end; | 1865 | unsigned int block_start, block_end; |
1865 | struct buffer_head *head, *bh; | 1866 | struct buffer_head *head, *bh; |
1866 | 1867 | ||
1867 | BUG_ON(!PageLocked(page)); | 1868 | BUG_ON(!PageLocked(page)); |
1868 | if (!page_has_buffers(page)) | 1869 | if (!page_has_buffers(page)) |
1869 | return; | 1870 | return; |
1870 | 1871 | ||
1871 | bh = head = page_buffers(page); | 1872 | bh = head = page_buffers(page); |
1872 | block_start = 0; | 1873 | block_start = 0; |
1873 | do { | 1874 | do { |
1874 | block_end = block_start + bh->b_size; | 1875 | block_end = block_start + bh->b_size; |
1875 | 1876 | ||
1876 | if (buffer_new(bh)) { | 1877 | if (buffer_new(bh)) { |
1877 | if (block_end > from && block_start < to) { | 1878 | if (block_end > from && block_start < to) { |
1878 | if (!PageUptodate(page)) { | 1879 | if (!PageUptodate(page)) { |
1879 | unsigned start, size; | 1880 | unsigned start, size; |
1880 | 1881 | ||
1881 | start = max(from, block_start); | 1882 | start = max(from, block_start); |
1882 | size = min(to, block_end) - start; | 1883 | size = min(to, block_end) - start; |
1883 | 1884 | ||
1884 | zero_user(page, start, size); | 1885 | zero_user(page, start, size); |
1885 | set_buffer_uptodate(bh); | 1886 | set_buffer_uptodate(bh); |
1886 | } | 1887 | } |
1887 | 1888 | ||
1888 | clear_buffer_new(bh); | 1889 | clear_buffer_new(bh); |
1889 | mark_buffer_dirty(bh); | 1890 | mark_buffer_dirty(bh); |
1890 | } | 1891 | } |
1891 | } | 1892 | } |
1892 | 1893 | ||
1893 | block_start = block_end; | 1894 | block_start = block_end; |
1894 | bh = bh->b_this_page; | 1895 | bh = bh->b_this_page; |
1895 | } while (bh != head); | 1896 | } while (bh != head); |
1896 | } | 1897 | } |
1897 | EXPORT_SYMBOL(page_zero_new_buffers); | 1898 | EXPORT_SYMBOL(page_zero_new_buffers); |
1898 | 1899 | ||
1899 | int __block_write_begin(struct page *page, loff_t pos, unsigned len, | 1900 | int __block_write_begin(struct page *page, loff_t pos, unsigned len, |
1900 | get_block_t *get_block) | 1901 | get_block_t *get_block) |
1901 | { | 1902 | { |
1902 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); | 1903 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); |
1903 | unsigned to = from + len; | 1904 | unsigned to = from + len; |
1904 | struct inode *inode = page->mapping->host; | 1905 | struct inode *inode = page->mapping->host; |
1905 | unsigned block_start, block_end; | 1906 | unsigned block_start, block_end; |
1906 | sector_t block; | 1907 | sector_t block; |
1907 | int err = 0; | 1908 | int err = 0; |
1908 | unsigned blocksize, bbits; | 1909 | unsigned blocksize, bbits; |
1909 | struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; | 1910 | struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; |
1910 | 1911 | ||
1911 | BUG_ON(!PageLocked(page)); | 1912 | BUG_ON(!PageLocked(page)); |
1912 | BUG_ON(from > PAGE_CACHE_SIZE); | 1913 | BUG_ON(from > PAGE_CACHE_SIZE); |
1913 | BUG_ON(to > PAGE_CACHE_SIZE); | 1914 | BUG_ON(to > PAGE_CACHE_SIZE); |
1914 | BUG_ON(from > to); | 1915 | BUG_ON(from > to); |
1915 | 1916 | ||
1916 | head = create_page_buffers(page, inode, 0); | 1917 | head = create_page_buffers(page, inode, 0); |
1917 | blocksize = head->b_size; | 1918 | blocksize = head->b_size; |
1918 | bbits = block_size_bits(blocksize); | 1919 | bbits = block_size_bits(blocksize); |
1919 | 1920 | ||
1920 | block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); | 1921 | block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); |
1921 | 1922 | ||
1922 | for(bh = head, block_start = 0; bh != head || !block_start; | 1923 | for(bh = head, block_start = 0; bh != head || !block_start; |
1923 | block++, block_start=block_end, bh = bh->b_this_page) { | 1924 | block++, block_start=block_end, bh = bh->b_this_page) { |
1924 | block_end = block_start + blocksize; | 1925 | block_end = block_start + blocksize; |
1925 | if (block_end <= from || block_start >= to) { | 1926 | if (block_end <= from || block_start >= to) { |
1926 | if (PageUptodate(page)) { | 1927 | if (PageUptodate(page)) { |
1927 | if (!buffer_uptodate(bh)) | 1928 | if (!buffer_uptodate(bh)) |
1928 | set_buffer_uptodate(bh); | 1929 | set_buffer_uptodate(bh); |
1929 | } | 1930 | } |
1930 | continue; | 1931 | continue; |
1931 | } | 1932 | } |
1932 | if (buffer_new(bh)) | 1933 | if (buffer_new(bh)) |
1933 | clear_buffer_new(bh); | 1934 | clear_buffer_new(bh); |
1934 | if (!buffer_mapped(bh)) { | 1935 | if (!buffer_mapped(bh)) { |
1935 | WARN_ON(bh->b_size != blocksize); | 1936 | WARN_ON(bh->b_size != blocksize); |
1936 | err = get_block(inode, block, bh, 1); | 1937 | err = get_block(inode, block, bh, 1); |
1937 | if (err) | 1938 | if (err) |
1938 | break; | 1939 | break; |
1939 | if (buffer_new(bh)) { | 1940 | if (buffer_new(bh)) { |
1940 | unmap_underlying_metadata(bh->b_bdev, | 1941 | unmap_underlying_metadata(bh->b_bdev, |
1941 | bh->b_blocknr); | 1942 | bh->b_blocknr); |
1942 | if (PageUptodate(page)) { | 1943 | if (PageUptodate(page)) { |
1943 | clear_buffer_new(bh); | 1944 | clear_buffer_new(bh); |
1944 | set_buffer_uptodate(bh); | 1945 | set_buffer_uptodate(bh); |
1945 | mark_buffer_dirty(bh); | 1946 | mark_buffer_dirty(bh); |
1946 | continue; | 1947 | continue; |
1947 | } | 1948 | } |
1948 | if (block_end > to || block_start < from) | 1949 | if (block_end > to || block_start < from) |
1949 | zero_user_segments(page, | 1950 | zero_user_segments(page, |
1950 | to, block_end, | 1951 | to, block_end, |
1951 | block_start, from); | 1952 | block_start, from); |
1952 | continue; | 1953 | continue; |
1953 | } | 1954 | } |
1954 | } | 1955 | } |
1955 | if (PageUptodate(page)) { | 1956 | if (PageUptodate(page)) { |
1956 | if (!buffer_uptodate(bh)) | 1957 | if (!buffer_uptodate(bh)) |
1957 | set_buffer_uptodate(bh); | 1958 | set_buffer_uptodate(bh); |
1958 | continue; | 1959 | continue; |
1959 | } | 1960 | } |
1960 | if (!buffer_uptodate(bh) && !buffer_delay(bh) && | 1961 | if (!buffer_uptodate(bh) && !buffer_delay(bh) && |
1961 | !buffer_unwritten(bh) && | 1962 | !buffer_unwritten(bh) && |
1962 | (block_start < from || block_end > to)) { | 1963 | (block_start < from || block_end > to)) { |
1963 | ll_rw_block(READ, 1, &bh); | 1964 | ll_rw_block(READ, 1, &bh); |
1964 | *wait_bh++=bh; | 1965 | *wait_bh++=bh; |
1965 | } | 1966 | } |
1966 | } | 1967 | } |
1967 | /* | 1968 | /* |
1968 | * If we issued read requests - let them complete. | 1969 | * If we issued read requests - let them complete. |
1969 | */ | 1970 | */ |
1970 | while(wait_bh > wait) { | 1971 | while(wait_bh > wait) { |
1971 | wait_on_buffer(*--wait_bh); | 1972 | wait_on_buffer(*--wait_bh); |
1972 | if (!buffer_uptodate(*wait_bh)) | 1973 | if (!buffer_uptodate(*wait_bh)) |
1973 | err = -EIO; | 1974 | err = -EIO; |
1974 | } | 1975 | } |
1975 | if (unlikely(err)) | 1976 | if (unlikely(err)) |
1976 | page_zero_new_buffers(page, from, to); | 1977 | page_zero_new_buffers(page, from, to); |
1977 | return err; | 1978 | return err; |
1978 | } | 1979 | } |
1979 | EXPORT_SYMBOL(__block_write_begin); | 1980 | EXPORT_SYMBOL(__block_write_begin); |
1980 | 1981 | ||
1981 | static int __block_commit_write(struct inode *inode, struct page *page, | 1982 | static int __block_commit_write(struct inode *inode, struct page *page, |
1982 | unsigned from, unsigned to) | 1983 | unsigned from, unsigned to) |
1983 | { | 1984 | { |
1984 | unsigned block_start, block_end; | 1985 | unsigned block_start, block_end; |
1985 | int partial = 0; | 1986 | int partial = 0; |
1986 | unsigned blocksize; | 1987 | unsigned blocksize; |
1987 | struct buffer_head *bh, *head; | 1988 | struct buffer_head *bh, *head; |
1988 | 1989 | ||
1989 | bh = head = page_buffers(page); | 1990 | bh = head = page_buffers(page); |
1990 | blocksize = bh->b_size; | 1991 | blocksize = bh->b_size; |
1991 | 1992 | ||
1992 | block_start = 0; | 1993 | block_start = 0; |
1993 | do { | 1994 | do { |
1994 | block_end = block_start + blocksize; | 1995 | block_end = block_start + blocksize; |
1995 | if (block_end <= from || block_start >= to) { | 1996 | if (block_end <= from || block_start >= to) { |
1996 | if (!buffer_uptodate(bh)) | 1997 | if (!buffer_uptodate(bh)) |
1997 | partial = 1; | 1998 | partial = 1; |
1998 | } else { | 1999 | } else { |
1999 | set_buffer_uptodate(bh); | 2000 | set_buffer_uptodate(bh); |
2000 | mark_buffer_dirty(bh); | 2001 | mark_buffer_dirty(bh); |
2001 | } | 2002 | } |
2002 | clear_buffer_new(bh); | 2003 | clear_buffer_new(bh); |
2003 | 2004 | ||
2004 | block_start = block_end; | 2005 | block_start = block_end; |
2005 | bh = bh->b_this_page; | 2006 | bh = bh->b_this_page; |
2006 | } while (bh != head); | 2007 | } while (bh != head); |
2007 | 2008 | ||
2008 | /* | 2009 | /* |
2009 | * If this is a partial write which happened to make all buffers | 2010 | * If this is a partial write which happened to make all buffers |
2010 | * uptodate then we can optimize away a bogus readpage() for | 2011 | * uptodate then we can optimize away a bogus readpage() for |
2011 | * the next read(). Here we 'discover' whether the page went | 2012 | * the next read(). Here we 'discover' whether the page went |
2012 | * uptodate as a result of this (potentially partial) write. | 2013 | * uptodate as a result of this (potentially partial) write. |
2013 | */ | 2014 | */ |
2014 | if (!partial) | 2015 | if (!partial) |
2015 | SetPageUptodate(page); | 2016 | SetPageUptodate(page); |
2016 | return 0; | 2017 | return 0; |
2017 | } | 2018 | } |
2018 | 2019 | ||
2019 | /* | 2020 | /* |
2020 | * block_write_begin takes care of the basic task of block allocation and | 2021 | * block_write_begin takes care of the basic task of block allocation and |
2021 | * bringing partial write blocks uptodate first. | 2022 | * bringing partial write blocks uptodate first. |
2022 | * | 2023 | * |
2023 | * The filesystem needs to handle block truncation upon failure. | 2024 | * The filesystem needs to handle block truncation upon failure. |
2024 | */ | 2025 | */ |
2025 | int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len, | 2026 | int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len, |
2026 | unsigned flags, struct page **pagep, get_block_t *get_block) | 2027 | unsigned flags, struct page **pagep, get_block_t *get_block) |
2027 | { | 2028 | { |
2028 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; | 2029 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; |
2029 | struct page *page; | 2030 | struct page *page; |
2030 | int status; | 2031 | int status; |
2031 | 2032 | ||
2032 | page = grab_cache_page_write_begin(mapping, index, flags); | 2033 | page = grab_cache_page_write_begin(mapping, index, flags); |
2033 | if (!page) | 2034 | if (!page) |
2034 | return -ENOMEM; | 2035 | return -ENOMEM; |
2035 | 2036 | ||
2036 | status = __block_write_begin(page, pos, len, get_block); | 2037 | status = __block_write_begin(page, pos, len, get_block); |
2037 | if (unlikely(status)) { | 2038 | if (unlikely(status)) { |
2038 | unlock_page(page); | 2039 | unlock_page(page); |
2039 | page_cache_release(page); | 2040 | page_cache_release(page); |
2040 | page = NULL; | 2041 | page = NULL; |
2041 | } | 2042 | } |
2042 | 2043 | ||
2043 | *pagep = page; | 2044 | *pagep = page; |
2044 | return status; | 2045 | return status; |
2045 | } | 2046 | } |
2046 | EXPORT_SYMBOL(block_write_begin); | 2047 | EXPORT_SYMBOL(block_write_begin); |
2047 | 2048 | ||
2048 | int block_write_end(struct file *file, struct address_space *mapping, | 2049 | int block_write_end(struct file *file, struct address_space *mapping, |
2049 | loff_t pos, unsigned len, unsigned copied, | 2050 | loff_t pos, unsigned len, unsigned copied, |
2050 | struct page *page, void *fsdata) | 2051 | struct page *page, void *fsdata) |
2051 | { | 2052 | { |
2052 | struct inode *inode = mapping->host; | 2053 | struct inode *inode = mapping->host; |
2053 | unsigned start; | 2054 | unsigned start; |
2054 | 2055 | ||
2055 | start = pos & (PAGE_CACHE_SIZE - 1); | 2056 | start = pos & (PAGE_CACHE_SIZE - 1); |
2056 | 2057 | ||
2057 | if (unlikely(copied < len)) { | 2058 | if (unlikely(copied < len)) { |
2058 | /* | 2059 | /* |
2059 | * The buffers that were written will now be uptodate, so we | 2060 | * The buffers that were written will now be uptodate, so we |
2060 | * don't have to worry about a readpage reading them and | 2061 | * don't have to worry about a readpage reading them and |
2061 | * overwriting a partial write. However if we have encountered | 2062 | * overwriting a partial write. However if we have encountered |
2062 | * a short write and only partially written into a buffer, it | 2063 | * a short write and only partially written into a buffer, it |
2063 | * will not be marked uptodate, so a readpage might come in and | 2064 | * will not be marked uptodate, so a readpage might come in and |
2064 | * destroy our partial write. | 2065 | * destroy our partial write. |
2065 | * | 2066 | * |
2066 | * Do the simplest thing, and just treat any short write to a | 2067 | * Do the simplest thing, and just treat any short write to a |
2067 | * non uptodate page as a zero-length write, and force the | 2068 | * non uptodate page as a zero-length write, and force the |
2068 | * caller to redo the whole thing. | 2069 | * caller to redo the whole thing. |
2069 | */ | 2070 | */ |
2070 | if (!PageUptodate(page)) | 2071 | if (!PageUptodate(page)) |
2071 | copied = 0; | 2072 | copied = 0; |
2072 | 2073 | ||
2073 | page_zero_new_buffers(page, start+copied, start+len); | 2074 | page_zero_new_buffers(page, start+copied, start+len); |
2074 | } | 2075 | } |
2075 | flush_dcache_page(page); | 2076 | flush_dcache_page(page); |
2076 | 2077 | ||
2077 | /* This could be a short (even 0-length) commit */ | 2078 | /* This could be a short (even 0-length) commit */ |
2078 | __block_commit_write(inode, page, start, start+copied); | 2079 | __block_commit_write(inode, page, start, start+copied); |
2079 | 2080 | ||
2080 | return copied; | 2081 | return copied; |
2081 | } | 2082 | } |
2082 | EXPORT_SYMBOL(block_write_end); | 2083 | EXPORT_SYMBOL(block_write_end); |
2083 | 2084 | ||
2084 | int generic_write_end(struct file *file, struct address_space *mapping, | 2085 | int generic_write_end(struct file *file, struct address_space *mapping, |
2085 | loff_t pos, unsigned len, unsigned copied, | 2086 | loff_t pos, unsigned len, unsigned copied, |
2086 | struct page *page, void *fsdata) | 2087 | struct page *page, void *fsdata) |
2087 | { | 2088 | { |
2088 | struct inode *inode = mapping->host; | 2089 | struct inode *inode = mapping->host; |
2089 | int i_size_changed = 0; | 2090 | int i_size_changed = 0; |
2090 | 2091 | ||
2091 | copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); | 2092 | copied = block_write_end(file, mapping, pos, len, copied, page, fsdata); |
2092 | 2093 | ||
2093 | /* | 2094 | /* |
2094 | * No need to use i_size_read() here, the i_size | 2095 | * No need to use i_size_read() here, the i_size |
2095 | * cannot change under us because we hold i_mutex. | 2096 | * cannot change under us because we hold i_mutex. |
2096 | * | 2097 | * |
2097 | * But it's important to update i_size while still holding page lock: | 2098 | * But it's important to update i_size while still holding page lock: |
2098 | * page writeout could otherwise come in and zero beyond i_size. | 2099 | * page writeout could otherwise come in and zero beyond i_size. |
2099 | */ | 2100 | */ |
2100 | if (pos+copied > inode->i_size) { | 2101 | if (pos+copied > inode->i_size) { |
2101 | i_size_write(inode, pos+copied); | 2102 | i_size_write(inode, pos+copied); |
2102 | i_size_changed = 1; | 2103 | i_size_changed = 1; |
2103 | } | 2104 | } |
2104 | 2105 | ||
2105 | unlock_page(page); | 2106 | unlock_page(page); |
2106 | page_cache_release(page); | 2107 | page_cache_release(page); |
2107 | 2108 | ||
2108 | /* | 2109 | /* |
2109 | * Don't mark the inode dirty under page lock. First, it unnecessarily | 2110 | * Don't mark the inode dirty under page lock. First, it unnecessarily |
2110 | * makes the holding time of page lock longer. Second, it forces lock | 2111 | * makes the holding time of page lock longer. Second, it forces lock |
2111 | * ordering of page lock and transaction start for journaling | 2112 | * ordering of page lock and transaction start for journaling |
2112 | * filesystems. | 2113 | * filesystems. |
2113 | */ | 2114 | */ |
2114 | if (i_size_changed) | 2115 | if (i_size_changed) |
2115 | mark_inode_dirty(inode); | 2116 | mark_inode_dirty(inode); |
2116 | 2117 | ||
2117 | return copied; | 2118 | return copied; |
2118 | } | 2119 | } |
2119 | EXPORT_SYMBOL(generic_write_end); | 2120 | EXPORT_SYMBOL(generic_write_end); |
2120 | 2121 | ||
2121 | /* | 2122 | /* |
2122 | * block_is_partially_uptodate checks whether buffers within a page are | 2123 | * block_is_partially_uptodate checks whether buffers within a page are |
2123 | * uptodate or not. | 2124 | * uptodate or not. |
2124 | * | 2125 | * |
2125 | * Returns true if all buffers which correspond to a file portion | 2126 | * Returns true if all buffers which correspond to a file portion |
2126 | * we want to read are uptodate. | 2127 | * we want to read are uptodate. |
2127 | */ | 2128 | */ |
2128 | int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, | 2129 | int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc, |
2129 | unsigned long from) | 2130 | unsigned long from) |
2130 | { | 2131 | { |
2131 | unsigned block_start, block_end, blocksize; | 2132 | unsigned block_start, block_end, blocksize; |
2132 | unsigned to; | 2133 | unsigned to; |
2133 | struct buffer_head *bh, *head; | 2134 | struct buffer_head *bh, *head; |
2134 | int ret = 1; | 2135 | int ret = 1; |
2135 | 2136 | ||
2136 | if (!page_has_buffers(page)) | 2137 | if (!page_has_buffers(page)) |
2137 | return 0; | 2138 | return 0; |
2138 | 2139 | ||
2139 | head = page_buffers(page); | 2140 | head = page_buffers(page); |
2140 | blocksize = head->b_size; | 2141 | blocksize = head->b_size; |
2141 | to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count); | 2142 | to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count); |
2142 | to = from + to; | 2143 | to = from + to; |
2143 | if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize) | 2144 | if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize) |
2144 | return 0; | 2145 | return 0; |
2145 | 2146 | ||
2146 | bh = head; | 2147 | bh = head; |
2147 | block_start = 0; | 2148 | block_start = 0; |
2148 | do { | 2149 | do { |
2149 | block_end = block_start + blocksize; | 2150 | block_end = block_start + blocksize; |
2150 | if (block_end > from && block_start < to) { | 2151 | if (block_end > from && block_start < to) { |
2151 | if (!buffer_uptodate(bh)) { | 2152 | if (!buffer_uptodate(bh)) { |
2152 | ret = 0; | 2153 | ret = 0; |
2153 | break; | 2154 | break; |
2154 | } | 2155 | } |
2155 | if (block_end >= to) | 2156 | if (block_end >= to) |
2156 | break; | 2157 | break; |
2157 | } | 2158 | } |
2158 | block_start = block_end; | 2159 | block_start = block_end; |
2159 | bh = bh->b_this_page; | 2160 | bh = bh->b_this_page; |
2160 | } while (bh != head); | 2161 | } while (bh != head); |
2161 | 2162 | ||
2162 | return ret; | 2163 | return ret; |
2163 | } | 2164 | } |
2164 | EXPORT_SYMBOL(block_is_partially_uptodate); | 2165 | EXPORT_SYMBOL(block_is_partially_uptodate); |
2165 | 2166 | ||
2166 | /* | 2167 | /* |
2167 | * Generic "read page" function for block devices that have the normal | 2168 | * Generic "read page" function for block devices that have the normal |
2168 | * get_block functionality. This is most of the block device filesystems. | 2169 | * get_block functionality. This is most of the block device filesystems. |
2169 | * Reads the page asynchronously --- the unlock_buffer() and | 2170 | * Reads the page asynchronously --- the unlock_buffer() and |
2170 | * set/clear_buffer_uptodate() functions propagate buffer state into the | 2171 | * set/clear_buffer_uptodate() functions propagate buffer state into the |
2171 | * page struct once IO has completed. | 2172 | * page struct once IO has completed. |
2172 | */ | 2173 | */ |
2173 | int block_read_full_page(struct page *page, get_block_t *get_block) | 2174 | int block_read_full_page(struct page *page, get_block_t *get_block) |
2174 | { | 2175 | { |
2175 | struct inode *inode = page->mapping->host; | 2176 | struct inode *inode = page->mapping->host; |
2176 | sector_t iblock, lblock; | 2177 | sector_t iblock, lblock; |
2177 | struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; | 2178 | struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; |
2178 | unsigned int blocksize, bbits; | 2179 | unsigned int blocksize, bbits; |
2179 | int nr, i; | 2180 | int nr, i; |
2180 | int fully_mapped = 1; | 2181 | int fully_mapped = 1; |
2181 | 2182 | ||
2182 | head = create_page_buffers(page, inode, 0); | 2183 | head = create_page_buffers(page, inode, 0); |
2183 | blocksize = head->b_size; | 2184 | blocksize = head->b_size; |
2184 | bbits = block_size_bits(blocksize); | 2185 | bbits = block_size_bits(blocksize); |
2185 | 2186 | ||
2186 | iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); | 2187 | iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); |
2187 | lblock = (i_size_read(inode)+blocksize-1) >> bbits; | 2188 | lblock = (i_size_read(inode)+blocksize-1) >> bbits; |
2188 | bh = head; | 2189 | bh = head; |
2189 | nr = 0; | 2190 | nr = 0; |
2190 | i = 0; | 2191 | i = 0; |
2191 | 2192 | ||
2192 | do { | 2193 | do { |
2193 | if (buffer_uptodate(bh)) | 2194 | if (buffer_uptodate(bh)) |
2194 | continue; | 2195 | continue; |
2195 | 2196 | ||
2196 | if (!buffer_mapped(bh)) { | 2197 | if (!buffer_mapped(bh)) { |
2197 | int err = 0; | 2198 | int err = 0; |
2198 | 2199 | ||
2199 | fully_mapped = 0; | 2200 | fully_mapped = 0; |
2200 | if (iblock < lblock) { | 2201 | if (iblock < lblock) { |
2201 | WARN_ON(bh->b_size != blocksize); | 2202 | WARN_ON(bh->b_size != blocksize); |
2202 | err = get_block(inode, iblock, bh, 0); | 2203 | err = get_block(inode, iblock, bh, 0); |
2203 | if (err) | 2204 | if (err) |
2204 | SetPageError(page); | 2205 | SetPageError(page); |
2205 | } | 2206 | } |
2206 | if (!buffer_mapped(bh)) { | 2207 | if (!buffer_mapped(bh)) { |
2207 | zero_user(page, i * blocksize, blocksize); | 2208 | zero_user(page, i * blocksize, blocksize); |
2208 | if (!err) | 2209 | if (!err) |
2209 | set_buffer_uptodate(bh); | 2210 | set_buffer_uptodate(bh); |
2210 | continue; | 2211 | continue; |
2211 | } | 2212 | } |
2212 | /* | 2213 | /* |
2213 | * get_block() might have updated the buffer | 2214 | * get_block() might have updated the buffer |
2214 | * synchronously | 2215 | * synchronously |
2215 | */ | 2216 | */ |
2216 | if (buffer_uptodate(bh)) | 2217 | if (buffer_uptodate(bh)) |
2217 | continue; | 2218 | continue; |
2218 | } | 2219 | } |
2219 | arr[nr++] = bh; | 2220 | arr[nr++] = bh; |
2220 | } while (i++, iblock++, (bh = bh->b_this_page) != head); | 2221 | } while (i++, iblock++, (bh = bh->b_this_page) != head); |
2221 | 2222 | ||
2222 | if (fully_mapped) | 2223 | if (fully_mapped) |
2223 | SetPageMappedToDisk(page); | 2224 | SetPageMappedToDisk(page); |
2224 | 2225 | ||
2225 | if (!nr) { | 2226 | if (!nr) { |
2226 | /* | 2227 | /* |
2227 | * All buffers are uptodate - we can set the page uptodate | 2228 | * All buffers are uptodate - we can set the page uptodate |
2228 | * as well. But not if get_block() returned an error. | 2229 | * as well. But not if get_block() returned an error. |
2229 | */ | 2230 | */ |
2230 | if (!PageError(page)) | 2231 | if (!PageError(page)) |
2231 | SetPageUptodate(page); | 2232 | SetPageUptodate(page); |
2232 | unlock_page(page); | 2233 | unlock_page(page); |
2233 | return 0; | 2234 | return 0; |
2234 | } | 2235 | } |
2235 | 2236 | ||
2236 | /* Stage two: lock the buffers */ | 2237 | /* Stage two: lock the buffers */ |
2237 | for (i = 0; i < nr; i++) { | 2238 | for (i = 0; i < nr; i++) { |
2238 | bh = arr[i]; | 2239 | bh = arr[i]; |
2239 | lock_buffer(bh); | 2240 | lock_buffer(bh); |
2240 | mark_buffer_async_read(bh); | 2241 | mark_buffer_async_read(bh); |
2241 | } | 2242 | } |
2242 | 2243 | ||
2243 | /* | 2244 | /* |
2244 | * Stage 3: start the IO. Check for uptodateness | 2245 | * Stage 3: start the IO. Check for uptodateness |
2245 | * inside the buffer lock in case another process reading | 2246 | * inside the buffer lock in case another process reading |
2246 | * the underlying blockdev brought it uptodate (the sct fix). | 2247 | * the underlying blockdev brought it uptodate (the sct fix). |
2247 | */ | 2248 | */ |
2248 | for (i = 0; i < nr; i++) { | 2249 | for (i = 0; i < nr; i++) { |
2249 | bh = arr[i]; | 2250 | bh = arr[i]; |
2250 | if (buffer_uptodate(bh)) | 2251 | if (buffer_uptodate(bh)) |
2251 | end_buffer_async_read(bh, 1); | 2252 | end_buffer_async_read(bh, 1); |
2252 | else | 2253 | else |
2253 | submit_bh(READ, bh); | 2254 | submit_bh(READ, bh); |
2254 | } | 2255 | } |
2255 | return 0; | 2256 | return 0; |
2256 | } | 2257 | } |
2257 | EXPORT_SYMBOL(block_read_full_page); | 2258 | EXPORT_SYMBOL(block_read_full_page); |
2258 | 2259 | ||
2259 | /* utility function for filesystems that need to do work on expanding | 2260 | /* utility function for filesystems that need to do work on expanding |
2260 | * truncates. Uses filesystem pagecache writes to allow the filesystem to | 2261 | * truncates. Uses filesystem pagecache writes to allow the filesystem to |
2261 | * deal with the hole. | 2262 | * deal with the hole. |
2262 | */ | 2263 | */ |
2263 | int generic_cont_expand_simple(struct inode *inode, loff_t size) | 2264 | int generic_cont_expand_simple(struct inode *inode, loff_t size) |
2264 | { | 2265 | { |
2265 | struct address_space *mapping = inode->i_mapping; | 2266 | struct address_space *mapping = inode->i_mapping; |
2266 | struct page *page; | 2267 | struct page *page; |
2267 | void *fsdata; | 2268 | void *fsdata; |
2268 | int err; | 2269 | int err; |
2269 | 2270 | ||
2270 | err = inode_newsize_ok(inode, size); | 2271 | err = inode_newsize_ok(inode, size); |
2271 | if (err) | 2272 | if (err) |
2272 | goto out; | 2273 | goto out; |
2273 | 2274 | ||
2274 | err = pagecache_write_begin(NULL, mapping, size, 0, | 2275 | err = pagecache_write_begin(NULL, mapping, size, 0, |
2275 | AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND, | 2276 | AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND, |
2276 | &page, &fsdata); | 2277 | &page, &fsdata); |
2277 | if (err) | 2278 | if (err) |
2278 | goto out; | 2279 | goto out; |
2279 | 2280 | ||
2280 | err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); | 2281 | err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata); |
2281 | BUG_ON(err > 0); | 2282 | BUG_ON(err > 0); |
2282 | 2283 | ||
2283 | out: | 2284 | out: |
2284 | return err; | 2285 | return err; |
2285 | } | 2286 | } |
2286 | EXPORT_SYMBOL(generic_cont_expand_simple); | 2287 | EXPORT_SYMBOL(generic_cont_expand_simple); |
2287 | 2288 | ||
2288 | static int cont_expand_zero(struct file *file, struct address_space *mapping, | 2289 | static int cont_expand_zero(struct file *file, struct address_space *mapping, |
2289 | loff_t pos, loff_t *bytes) | 2290 | loff_t pos, loff_t *bytes) |
2290 | { | 2291 | { |
2291 | struct inode *inode = mapping->host; | 2292 | struct inode *inode = mapping->host; |
2292 | unsigned blocksize = 1 << inode->i_blkbits; | 2293 | unsigned blocksize = 1 << inode->i_blkbits; |
2293 | struct page *page; | 2294 | struct page *page; |
2294 | void *fsdata; | 2295 | void *fsdata; |
2295 | pgoff_t index, curidx; | 2296 | pgoff_t index, curidx; |
2296 | loff_t curpos; | 2297 | loff_t curpos; |
2297 | unsigned zerofrom, offset, len; | 2298 | unsigned zerofrom, offset, len; |
2298 | int err = 0; | 2299 | int err = 0; |
2299 | 2300 | ||
2300 | index = pos >> PAGE_CACHE_SHIFT; | 2301 | index = pos >> PAGE_CACHE_SHIFT; |
2301 | offset = pos & ~PAGE_CACHE_MASK; | 2302 | offset = pos & ~PAGE_CACHE_MASK; |
2302 | 2303 | ||
2303 | while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) { | 2304 | while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) { |
2304 | zerofrom = curpos & ~PAGE_CACHE_MASK; | 2305 | zerofrom = curpos & ~PAGE_CACHE_MASK; |
2305 | if (zerofrom & (blocksize-1)) { | 2306 | if (zerofrom & (blocksize-1)) { |
2306 | *bytes |= (blocksize-1); | 2307 | *bytes |= (blocksize-1); |
2307 | (*bytes)++; | 2308 | (*bytes)++; |
2308 | } | 2309 | } |
2309 | len = PAGE_CACHE_SIZE - zerofrom; | 2310 | len = PAGE_CACHE_SIZE - zerofrom; |
2310 | 2311 | ||
2311 | err = pagecache_write_begin(file, mapping, curpos, len, | 2312 | err = pagecache_write_begin(file, mapping, curpos, len, |
2312 | AOP_FLAG_UNINTERRUPTIBLE, | 2313 | AOP_FLAG_UNINTERRUPTIBLE, |
2313 | &page, &fsdata); | 2314 | &page, &fsdata); |
2314 | if (err) | 2315 | if (err) |
2315 | goto out; | 2316 | goto out; |
2316 | zero_user(page, zerofrom, len); | 2317 | zero_user(page, zerofrom, len); |
2317 | err = pagecache_write_end(file, mapping, curpos, len, len, | 2318 | err = pagecache_write_end(file, mapping, curpos, len, len, |
2318 | page, fsdata); | 2319 | page, fsdata); |
2319 | if (err < 0) | 2320 | if (err < 0) |
2320 | goto out; | 2321 | goto out; |
2321 | BUG_ON(err != len); | 2322 | BUG_ON(err != len); |
2322 | err = 0; | 2323 | err = 0; |
2323 | 2324 | ||
2324 | balance_dirty_pages_ratelimited(mapping); | 2325 | balance_dirty_pages_ratelimited(mapping); |
2325 | } | 2326 | } |
2326 | 2327 | ||
2327 | /* page covers the boundary, find the boundary offset */ | 2328 | /* page covers the boundary, find the boundary offset */ |
2328 | if (index == curidx) { | 2329 | if (index == curidx) { |
2329 | zerofrom = curpos & ~PAGE_CACHE_MASK; | 2330 | zerofrom = curpos & ~PAGE_CACHE_MASK; |
2330 | /* if we will expand the thing last block will be filled */ | 2331 | /* if we will expand the thing last block will be filled */ |
2331 | if (offset <= zerofrom) { | 2332 | if (offset <= zerofrom) { |
2332 | goto out; | 2333 | goto out; |
2333 | } | 2334 | } |
2334 | if (zerofrom & (blocksize-1)) { | 2335 | if (zerofrom & (blocksize-1)) { |
2335 | *bytes |= (blocksize-1); | 2336 | *bytes |= (blocksize-1); |
2336 | (*bytes)++; | 2337 | (*bytes)++; |
2337 | } | 2338 | } |
2338 | len = offset - zerofrom; | 2339 | len = offset - zerofrom; |
2339 | 2340 | ||
2340 | err = pagecache_write_begin(file, mapping, curpos, len, | 2341 | err = pagecache_write_begin(file, mapping, curpos, len, |
2341 | AOP_FLAG_UNINTERRUPTIBLE, | 2342 | AOP_FLAG_UNINTERRUPTIBLE, |
2342 | &page, &fsdata); | 2343 | &page, &fsdata); |
2343 | if (err) | 2344 | if (err) |
2344 | goto out; | 2345 | goto out; |
2345 | zero_user(page, zerofrom, len); | 2346 | zero_user(page, zerofrom, len); |
2346 | err = pagecache_write_end(file, mapping, curpos, len, len, | 2347 | err = pagecache_write_end(file, mapping, curpos, len, len, |
2347 | page, fsdata); | 2348 | page, fsdata); |
2348 | if (err < 0) | 2349 | if (err < 0) |
2349 | goto out; | 2350 | goto out; |
2350 | BUG_ON(err != len); | 2351 | BUG_ON(err != len); |
2351 | err = 0; | 2352 | err = 0; |
2352 | } | 2353 | } |
2353 | out: | 2354 | out: |
2354 | return err; | 2355 | return err; |
2355 | } | 2356 | } |
2356 | 2357 | ||
2357 | /* | 2358 | /* |
2358 | * For moronic filesystems that do not allow holes in file. | 2359 | * For moronic filesystems that do not allow holes in file. |
2359 | * We may have to extend the file. | 2360 | * We may have to extend the file. |
2360 | */ | 2361 | */ |
2361 | int cont_write_begin(struct file *file, struct address_space *mapping, | 2362 | int cont_write_begin(struct file *file, struct address_space *mapping, |
2362 | loff_t pos, unsigned len, unsigned flags, | 2363 | loff_t pos, unsigned len, unsigned flags, |
2363 | struct page **pagep, void **fsdata, | 2364 | struct page **pagep, void **fsdata, |
2364 | get_block_t *get_block, loff_t *bytes) | 2365 | get_block_t *get_block, loff_t *bytes) |
2365 | { | 2366 | { |
2366 | struct inode *inode = mapping->host; | 2367 | struct inode *inode = mapping->host; |
2367 | unsigned blocksize = 1 << inode->i_blkbits; | 2368 | unsigned blocksize = 1 << inode->i_blkbits; |
2368 | unsigned zerofrom; | 2369 | unsigned zerofrom; |
2369 | int err; | 2370 | int err; |
2370 | 2371 | ||
2371 | err = cont_expand_zero(file, mapping, pos, bytes); | 2372 | err = cont_expand_zero(file, mapping, pos, bytes); |
2372 | if (err) | 2373 | if (err) |
2373 | return err; | 2374 | return err; |
2374 | 2375 | ||
2375 | zerofrom = *bytes & ~PAGE_CACHE_MASK; | 2376 | zerofrom = *bytes & ~PAGE_CACHE_MASK; |
2376 | if (pos+len > *bytes && zerofrom & (blocksize-1)) { | 2377 | if (pos+len > *bytes && zerofrom & (blocksize-1)) { |
2377 | *bytes |= (blocksize-1); | 2378 | *bytes |= (blocksize-1); |
2378 | (*bytes)++; | 2379 | (*bytes)++; |
2379 | } | 2380 | } |
2380 | 2381 | ||
2381 | return block_write_begin(mapping, pos, len, flags, pagep, get_block); | 2382 | return block_write_begin(mapping, pos, len, flags, pagep, get_block); |
2382 | } | 2383 | } |
2383 | EXPORT_SYMBOL(cont_write_begin); | 2384 | EXPORT_SYMBOL(cont_write_begin); |
2384 | 2385 | ||
2385 | int block_commit_write(struct page *page, unsigned from, unsigned to) | 2386 | int block_commit_write(struct page *page, unsigned from, unsigned to) |
2386 | { | 2387 | { |
2387 | struct inode *inode = page->mapping->host; | 2388 | struct inode *inode = page->mapping->host; |
2388 | __block_commit_write(inode,page,from,to); | 2389 | __block_commit_write(inode,page,from,to); |
2389 | return 0; | 2390 | return 0; |
2390 | } | 2391 | } |
2391 | EXPORT_SYMBOL(block_commit_write); | 2392 | EXPORT_SYMBOL(block_commit_write); |
2392 | 2393 | ||
2393 | /* | 2394 | /* |
2394 | * block_page_mkwrite() is not allowed to change the file size as it gets | 2395 | * block_page_mkwrite() is not allowed to change the file size as it gets |
2395 | * called from a page fault handler when a page is first dirtied. Hence we must | 2396 | * called from a page fault handler when a page is first dirtied. Hence we must |
2396 | * be careful to check for EOF conditions here. We set the page up correctly | 2397 | * be careful to check for EOF conditions here. We set the page up correctly |
2397 | * for a written page which means we get ENOSPC checking when writing into | 2398 | * for a written page which means we get ENOSPC checking when writing into |
2398 | * holes and correct delalloc and unwritten extent mapping on filesystems that | 2399 | * holes and correct delalloc and unwritten extent mapping on filesystems that |
2399 | * support these features. | 2400 | * support these features. |
2400 | * | 2401 | * |
2401 | * We are not allowed to take the i_mutex here so we have to play games to | 2402 | * We are not allowed to take the i_mutex here so we have to play games to |
2402 | * protect against truncate races as the page could now be beyond EOF. Because | 2403 | * protect against truncate races as the page could now be beyond EOF. Because |
2403 | * truncate writes the inode size before removing pages, once we have the | 2404 | * truncate writes the inode size before removing pages, once we have the |
2404 | * page lock we can determine safely if the page is beyond EOF. If it is not | 2405 | * page lock we can determine safely if the page is beyond EOF. If it is not |
2405 | * beyond EOF, then the page is guaranteed safe against truncation until we | 2406 | * beyond EOF, then the page is guaranteed safe against truncation until we |
2406 | * unlock the page. | 2407 | * unlock the page. |
2407 | * | 2408 | * |
2408 | * Direct callers of this function should protect against filesystem freezing | 2409 | * Direct callers of this function should protect against filesystem freezing |
2409 | * using sb_start_write() - sb_end_write() functions. | 2410 | * using sb_start_write() - sb_end_write() functions. |
2410 | */ | 2411 | */ |
2411 | int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, | 2412 | int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, |
2412 | get_block_t get_block) | 2413 | get_block_t get_block) |
2413 | { | 2414 | { |
2414 | struct page *page = vmf->page; | 2415 | struct page *page = vmf->page; |
2415 | struct inode *inode = file_inode(vma->vm_file); | 2416 | struct inode *inode = file_inode(vma->vm_file); |
2416 | unsigned long end; | 2417 | unsigned long end; |
2417 | loff_t size; | 2418 | loff_t size; |
2418 | int ret; | 2419 | int ret; |
2419 | 2420 | ||
2420 | lock_page(page); | 2421 | lock_page(page); |
2421 | size = i_size_read(inode); | 2422 | size = i_size_read(inode); |
2422 | if ((page->mapping != inode->i_mapping) || | 2423 | if ((page->mapping != inode->i_mapping) || |
2423 | (page_offset(page) > size)) { | 2424 | (page_offset(page) > size)) { |
2424 | /* We overload EFAULT to mean page got truncated */ | 2425 | /* We overload EFAULT to mean page got truncated */ |
2425 | ret = -EFAULT; | 2426 | ret = -EFAULT; |
2426 | goto out_unlock; | 2427 | goto out_unlock; |
2427 | } | 2428 | } |
2428 | 2429 | ||
2429 | /* page is wholly or partially inside EOF */ | 2430 | /* page is wholly or partially inside EOF */ |
2430 | if (((page->index + 1) << PAGE_CACHE_SHIFT) > size) | 2431 | if (((page->index + 1) << PAGE_CACHE_SHIFT) > size) |
2431 | end = size & ~PAGE_CACHE_MASK; | 2432 | end = size & ~PAGE_CACHE_MASK; |
2432 | else | 2433 | else |
2433 | end = PAGE_CACHE_SIZE; | 2434 | end = PAGE_CACHE_SIZE; |
2434 | 2435 | ||
2435 | ret = __block_write_begin(page, 0, end, get_block); | 2436 | ret = __block_write_begin(page, 0, end, get_block); |
2436 | if (!ret) | 2437 | if (!ret) |
2437 | ret = block_commit_write(page, 0, end); | 2438 | ret = block_commit_write(page, 0, end); |
2438 | 2439 | ||
2439 | if (unlikely(ret < 0)) | 2440 | if (unlikely(ret < 0)) |
2440 | goto out_unlock; | 2441 | goto out_unlock; |
2441 | set_page_dirty(page); | 2442 | set_page_dirty(page); |
2442 | wait_for_stable_page(page); | 2443 | wait_for_stable_page(page); |
2443 | return 0; | 2444 | return 0; |
2444 | out_unlock: | 2445 | out_unlock: |
2445 | unlock_page(page); | 2446 | unlock_page(page); |
2446 | return ret; | 2447 | return ret; |
2447 | } | 2448 | } |
2448 | EXPORT_SYMBOL(__block_page_mkwrite); | 2449 | EXPORT_SYMBOL(__block_page_mkwrite); |
2449 | 2450 | ||
2450 | int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, | 2451 | int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, |
2451 | get_block_t get_block) | 2452 | get_block_t get_block) |
2452 | { | 2453 | { |
2453 | int ret; | 2454 | int ret; |
2454 | struct super_block *sb = file_inode(vma->vm_file)->i_sb; | 2455 | struct super_block *sb = file_inode(vma->vm_file)->i_sb; |
2455 | 2456 | ||
2456 | sb_start_pagefault(sb); | 2457 | sb_start_pagefault(sb); |
2457 | 2458 | ||
2458 | /* | 2459 | /* |
2459 | * Update file times before taking page lock. We may end up failing the | 2460 | * Update file times before taking page lock. We may end up failing the |
2460 | * fault so this update may be superfluous but who really cares... | 2461 | * fault so this update may be superfluous but who really cares... |
2461 | */ | 2462 | */ |
2462 | file_update_time(vma->vm_file); | 2463 | file_update_time(vma->vm_file); |
2463 | 2464 | ||
2464 | ret = __block_page_mkwrite(vma, vmf, get_block); | 2465 | ret = __block_page_mkwrite(vma, vmf, get_block); |
2465 | sb_end_pagefault(sb); | 2466 | sb_end_pagefault(sb); |
2466 | return block_page_mkwrite_return(ret); | 2467 | return block_page_mkwrite_return(ret); |
2467 | } | 2468 | } |
2468 | EXPORT_SYMBOL(block_page_mkwrite); | 2469 | EXPORT_SYMBOL(block_page_mkwrite); |
2469 | 2470 | ||
2470 | /* | 2471 | /* |
2471 | * nobh_write_begin()'s prereads are special: the buffer_heads are freed | 2472 | * nobh_write_begin()'s prereads are special: the buffer_heads are freed |
2472 | * immediately, while under the page lock. So it needs a special end_io | 2473 | * immediately, while under the page lock. So it needs a special end_io |
2473 | * handler which does not touch the bh after unlocking it. | 2474 | * handler which does not touch the bh after unlocking it. |
2474 | */ | 2475 | */ |
2475 | static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate) | 2476 | static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate) |
2476 | { | 2477 | { |
2477 | __end_buffer_read_notouch(bh, uptodate); | 2478 | __end_buffer_read_notouch(bh, uptodate); |
2478 | } | 2479 | } |
2479 | 2480 | ||
2480 | /* | 2481 | /* |
2481 | * Attach the singly-linked list of buffers created by nobh_write_begin, to | 2482 | * Attach the singly-linked list of buffers created by nobh_write_begin, to |
2482 | * the page (converting it to circular linked list and taking care of page | 2483 | * the page (converting it to circular linked list and taking care of page |
2483 | * dirty races). | 2484 | * dirty races). |
2484 | */ | 2485 | */ |
2485 | static void attach_nobh_buffers(struct page *page, struct buffer_head *head) | 2486 | static void attach_nobh_buffers(struct page *page, struct buffer_head *head) |
2486 | { | 2487 | { |
2487 | struct buffer_head *bh; | 2488 | struct buffer_head *bh; |
2488 | 2489 | ||
2489 | BUG_ON(!PageLocked(page)); | 2490 | BUG_ON(!PageLocked(page)); |
2490 | 2491 | ||
2491 | spin_lock(&page->mapping->private_lock); | 2492 | spin_lock(&page->mapping->private_lock); |
2492 | bh = head; | 2493 | bh = head; |
2493 | do { | 2494 | do { |
2494 | if (PageDirty(page)) | 2495 | if (PageDirty(page)) |
2495 | set_buffer_dirty(bh); | 2496 | set_buffer_dirty(bh); |
2496 | if (!bh->b_this_page) | 2497 | if (!bh->b_this_page) |
2497 | bh->b_this_page = head; | 2498 | bh->b_this_page = head; |
2498 | bh = bh->b_this_page; | 2499 | bh = bh->b_this_page; |
2499 | } while (bh != head); | 2500 | } while (bh != head); |
2500 | attach_page_buffers(page, head); | 2501 | attach_page_buffers(page, head); |
2501 | spin_unlock(&page->mapping->private_lock); | 2502 | spin_unlock(&page->mapping->private_lock); |
2502 | } | 2503 | } |
2503 | 2504 | ||
2504 | /* | 2505 | /* |
2505 | * On entry, the page is fully not uptodate. | 2506 | * On entry, the page is fully not uptodate. |
2506 | * On exit the page is fully uptodate in the areas outside (from,to) | 2507 | * On exit the page is fully uptodate in the areas outside (from,to) |
2507 | * The filesystem needs to handle block truncation upon failure. | 2508 | * The filesystem needs to handle block truncation upon failure. |
2508 | */ | 2509 | */ |
2509 | int nobh_write_begin(struct address_space *mapping, | 2510 | int nobh_write_begin(struct address_space *mapping, |
2510 | loff_t pos, unsigned len, unsigned flags, | 2511 | loff_t pos, unsigned len, unsigned flags, |
2511 | struct page **pagep, void **fsdata, | 2512 | struct page **pagep, void **fsdata, |
2512 | get_block_t *get_block) | 2513 | get_block_t *get_block) |
2513 | { | 2514 | { |
2514 | struct inode *inode = mapping->host; | 2515 | struct inode *inode = mapping->host; |
2515 | const unsigned blkbits = inode->i_blkbits; | 2516 | const unsigned blkbits = inode->i_blkbits; |
2516 | const unsigned blocksize = 1 << blkbits; | 2517 | const unsigned blocksize = 1 << blkbits; |
2517 | struct buffer_head *head, *bh; | 2518 | struct buffer_head *head, *bh; |
2518 | struct page *page; | 2519 | struct page *page; |
2519 | pgoff_t index; | 2520 | pgoff_t index; |
2520 | unsigned from, to; | 2521 | unsigned from, to; |
2521 | unsigned block_in_page; | 2522 | unsigned block_in_page; |
2522 | unsigned block_start, block_end; | 2523 | unsigned block_start, block_end; |
2523 | sector_t block_in_file; | 2524 | sector_t block_in_file; |
2524 | int nr_reads = 0; | 2525 | int nr_reads = 0; |
2525 | int ret = 0; | 2526 | int ret = 0; |
2526 | int is_mapped_to_disk = 1; | 2527 | int is_mapped_to_disk = 1; |
2527 | 2528 | ||
2528 | index = pos >> PAGE_CACHE_SHIFT; | 2529 | index = pos >> PAGE_CACHE_SHIFT; |
2529 | from = pos & (PAGE_CACHE_SIZE - 1); | 2530 | from = pos & (PAGE_CACHE_SIZE - 1); |
2530 | to = from + len; | 2531 | to = from + len; |
2531 | 2532 | ||
2532 | page = grab_cache_page_write_begin(mapping, index, flags); | 2533 | page = grab_cache_page_write_begin(mapping, index, flags); |
2533 | if (!page) | 2534 | if (!page) |
2534 | return -ENOMEM; | 2535 | return -ENOMEM; |
2535 | *pagep = page; | 2536 | *pagep = page; |
2536 | *fsdata = NULL; | 2537 | *fsdata = NULL; |
2537 | 2538 | ||
2538 | if (page_has_buffers(page)) { | 2539 | if (page_has_buffers(page)) { |
2539 | ret = __block_write_begin(page, pos, len, get_block); | 2540 | ret = __block_write_begin(page, pos, len, get_block); |
2540 | if (unlikely(ret)) | 2541 | if (unlikely(ret)) |
2541 | goto out_release; | 2542 | goto out_release; |
2542 | return ret; | 2543 | return ret; |
2543 | } | 2544 | } |
2544 | 2545 | ||
2545 | if (PageMappedToDisk(page)) | 2546 | if (PageMappedToDisk(page)) |
2546 | return 0; | 2547 | return 0; |
2547 | 2548 | ||
2548 | /* | 2549 | /* |
2549 | * Allocate buffers so that we can keep track of state, and potentially | 2550 | * Allocate buffers so that we can keep track of state, and potentially |
2550 | * attach them to the page if an error occurs. In the common case of | 2551 | * attach them to the page if an error occurs. In the common case of |
2551 | * no error, they will just be freed again without ever being attached | 2552 | * no error, they will just be freed again without ever being attached |
2552 | * to the page (which is all OK, because we're under the page lock). | 2553 | * to the page (which is all OK, because we're under the page lock). |
2553 | * | 2554 | * |
2554 | * Be careful: the buffer linked list is a NULL terminated one, rather | 2555 | * Be careful: the buffer linked list is a NULL terminated one, rather |
2555 | * than the circular one we're used to. | 2556 | * than the circular one we're used to. |
2556 | */ | 2557 | */ |
2557 | head = alloc_page_buffers(page, blocksize, 0); | 2558 | head = alloc_page_buffers(page, blocksize, 0); |
2558 | if (!head) { | 2559 | if (!head) { |
2559 | ret = -ENOMEM; | 2560 | ret = -ENOMEM; |
2560 | goto out_release; | 2561 | goto out_release; |
2561 | } | 2562 | } |
2562 | 2563 | ||
2563 | block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); | 2564 | block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); |
2564 | 2565 | ||
2565 | /* | 2566 | /* |
2566 | * We loop across all blocks in the page, whether or not they are | 2567 | * We loop across all blocks in the page, whether or not they are |
2567 | * part of the affected region. This is so we can discover if the | 2568 | * part of the affected region. This is so we can discover if the |
2568 | * page is fully mapped-to-disk. | 2569 | * page is fully mapped-to-disk. |
2569 | */ | 2570 | */ |
2570 | for (block_start = 0, block_in_page = 0, bh = head; | 2571 | for (block_start = 0, block_in_page = 0, bh = head; |
2571 | block_start < PAGE_CACHE_SIZE; | 2572 | block_start < PAGE_CACHE_SIZE; |
2572 | block_in_page++, block_start += blocksize, bh = bh->b_this_page) { | 2573 | block_in_page++, block_start += blocksize, bh = bh->b_this_page) { |
2573 | int create; | 2574 | int create; |
2574 | 2575 | ||
2575 | block_end = block_start + blocksize; | 2576 | block_end = block_start + blocksize; |
2576 | bh->b_state = 0; | 2577 | bh->b_state = 0; |
2577 | create = 1; | 2578 | create = 1; |
2578 | if (block_start >= to) | 2579 | if (block_start >= to) |
2579 | create = 0; | 2580 | create = 0; |
2580 | ret = get_block(inode, block_in_file + block_in_page, | 2581 | ret = get_block(inode, block_in_file + block_in_page, |
2581 | bh, create); | 2582 | bh, create); |
2582 | if (ret) | 2583 | if (ret) |
2583 | goto failed; | 2584 | goto failed; |
2584 | if (!buffer_mapped(bh)) | 2585 | if (!buffer_mapped(bh)) |
2585 | is_mapped_to_disk = 0; | 2586 | is_mapped_to_disk = 0; |
2586 | if (buffer_new(bh)) | 2587 | if (buffer_new(bh)) |
2587 | unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); | 2588 | unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); |
2588 | if (PageUptodate(page)) { | 2589 | if (PageUptodate(page)) { |
2589 | set_buffer_uptodate(bh); | 2590 | set_buffer_uptodate(bh); |
2590 | continue; | 2591 | continue; |
2591 | } | 2592 | } |
2592 | if (buffer_new(bh) || !buffer_mapped(bh)) { | 2593 | if (buffer_new(bh) || !buffer_mapped(bh)) { |
2593 | zero_user_segments(page, block_start, from, | 2594 | zero_user_segments(page, block_start, from, |
2594 | to, block_end); | 2595 | to, block_end); |
2595 | continue; | 2596 | continue; |
2596 | } | 2597 | } |
2597 | if (buffer_uptodate(bh)) | 2598 | if (buffer_uptodate(bh)) |
2598 | continue; /* reiserfs does this */ | 2599 | continue; /* reiserfs does this */ |
2599 | if (block_start < from || block_end > to) { | 2600 | if (block_start < from || block_end > to) { |
2600 | lock_buffer(bh); | 2601 | lock_buffer(bh); |
2601 | bh->b_end_io = end_buffer_read_nobh; | 2602 | bh->b_end_io = end_buffer_read_nobh; |
2602 | submit_bh(READ, bh); | 2603 | submit_bh(READ, bh); |
2603 | nr_reads++; | 2604 | nr_reads++; |
2604 | } | 2605 | } |
2605 | } | 2606 | } |
2606 | 2607 | ||
2607 | if (nr_reads) { | 2608 | if (nr_reads) { |
2608 | /* | 2609 | /* |
2609 | * The page is locked, so these buffers are protected from | 2610 | * The page is locked, so these buffers are protected from |
2610 | * any VM or truncate activity. Hence we don't need to care | 2611 | * any VM or truncate activity. Hence we don't need to care |
2611 | * for the buffer_head refcounts. | 2612 | * for the buffer_head refcounts. |
2612 | */ | 2613 | */ |
2613 | for (bh = head; bh; bh = bh->b_this_page) { | 2614 | for (bh = head; bh; bh = bh->b_this_page) { |
2614 | wait_on_buffer(bh); | 2615 | wait_on_buffer(bh); |
2615 | if (!buffer_uptodate(bh)) | 2616 | if (!buffer_uptodate(bh)) |
2616 | ret = -EIO; | 2617 | ret = -EIO; |
2617 | } | 2618 | } |
2618 | if (ret) | 2619 | if (ret) |
2619 | goto failed; | 2620 | goto failed; |
2620 | } | 2621 | } |
2621 | 2622 | ||
2622 | if (is_mapped_to_disk) | 2623 | if (is_mapped_to_disk) |
2623 | SetPageMappedToDisk(page); | 2624 | SetPageMappedToDisk(page); |
2624 | 2625 | ||
2625 | *fsdata = head; /* to be released by nobh_write_end */ | 2626 | *fsdata = head; /* to be released by nobh_write_end */ |
2626 | 2627 | ||
2627 | return 0; | 2628 | return 0; |
2628 | 2629 | ||
2629 | failed: | 2630 | failed: |
2630 | BUG_ON(!ret); | 2631 | BUG_ON(!ret); |
2631 | /* | 2632 | /* |
2632 | * Error recovery is a bit difficult. We need to zero out blocks that | 2633 | * Error recovery is a bit difficult. We need to zero out blocks that |
2633 | * were newly allocated, and dirty them to ensure they get written out. | 2634 | * were newly allocated, and dirty them to ensure they get written out. |
2634 | * Buffers need to be attached to the page at this point, otherwise | 2635 | * Buffers need to be attached to the page at this point, otherwise |
2635 | * the handling of potential IO errors during writeout would be hard | 2636 | * the handling of potential IO errors during writeout would be hard |
2636 | * (could try doing synchronous writeout, but what if that fails too?) | 2637 | * (could try doing synchronous writeout, but what if that fails too?) |
2637 | */ | 2638 | */ |
2638 | attach_nobh_buffers(page, head); | 2639 | attach_nobh_buffers(page, head); |
2639 | page_zero_new_buffers(page, from, to); | 2640 | page_zero_new_buffers(page, from, to); |
2640 | 2641 | ||
2641 | out_release: | 2642 | out_release: |
2642 | unlock_page(page); | 2643 | unlock_page(page); |
2643 | page_cache_release(page); | 2644 | page_cache_release(page); |
2644 | *pagep = NULL; | 2645 | *pagep = NULL; |
2645 | 2646 | ||
2646 | return ret; | 2647 | return ret; |
2647 | } | 2648 | } |
2648 | EXPORT_SYMBOL(nobh_write_begin); | 2649 | EXPORT_SYMBOL(nobh_write_begin); |
2649 | 2650 | ||
2650 | int nobh_write_end(struct file *file, struct address_space *mapping, | 2651 | int nobh_write_end(struct file *file, struct address_space *mapping, |
2651 | loff_t pos, unsigned len, unsigned copied, | 2652 | loff_t pos, unsigned len, unsigned copied, |
2652 | struct page *page, void *fsdata) | 2653 | struct page *page, void *fsdata) |
2653 | { | 2654 | { |
2654 | struct inode *inode = page->mapping->host; | 2655 | struct inode *inode = page->mapping->host; |
2655 | struct buffer_head *head = fsdata; | 2656 | struct buffer_head *head = fsdata; |
2656 | struct buffer_head *bh; | 2657 | struct buffer_head *bh; |
2657 | BUG_ON(fsdata != NULL && page_has_buffers(page)); | 2658 | BUG_ON(fsdata != NULL && page_has_buffers(page)); |
2658 | 2659 | ||
2659 | if (unlikely(copied < len) && head) | 2660 | if (unlikely(copied < len) && head) |
2660 | attach_nobh_buffers(page, head); | 2661 | attach_nobh_buffers(page, head); |
2661 | if (page_has_buffers(page)) | 2662 | if (page_has_buffers(page)) |
2662 | return generic_write_end(file, mapping, pos, len, | 2663 | return generic_write_end(file, mapping, pos, len, |
2663 | copied, page, fsdata); | 2664 | copied, page, fsdata); |
2664 | 2665 | ||
2665 | SetPageUptodate(page); | 2666 | SetPageUptodate(page); |
2666 | set_page_dirty(page); | 2667 | set_page_dirty(page); |
2667 | if (pos+copied > inode->i_size) { | 2668 | if (pos+copied > inode->i_size) { |
2668 | i_size_write(inode, pos+copied); | 2669 | i_size_write(inode, pos+copied); |
2669 | mark_inode_dirty(inode); | 2670 | mark_inode_dirty(inode); |
2670 | } | 2671 | } |
2671 | 2672 | ||
2672 | unlock_page(page); | 2673 | unlock_page(page); |
2673 | page_cache_release(page); | 2674 | page_cache_release(page); |
2674 | 2675 | ||
2675 | while (head) { | 2676 | while (head) { |
2676 | bh = head; | 2677 | bh = head; |
2677 | head = head->b_this_page; | 2678 | head = head->b_this_page; |
2678 | free_buffer_head(bh); | 2679 | free_buffer_head(bh); |
2679 | } | 2680 | } |
2680 | 2681 | ||
2681 | return copied; | 2682 | return copied; |
2682 | } | 2683 | } |
2683 | EXPORT_SYMBOL(nobh_write_end); | 2684 | EXPORT_SYMBOL(nobh_write_end); |
2684 | 2685 | ||
2685 | /* | 2686 | /* |
2686 | * nobh_writepage() - based on block_full_write_page() except | 2687 | * nobh_writepage() - based on block_full_write_page() except |
2687 | * that it tries to operate without attaching bufferheads to | 2688 | * that it tries to operate without attaching bufferheads to |
2688 | * the page. | 2689 | * the page. |
2689 | */ | 2690 | */ |
2690 | int nobh_writepage(struct page *page, get_block_t *get_block, | 2691 | int nobh_writepage(struct page *page, get_block_t *get_block, |
2691 | struct writeback_control *wbc) | 2692 | struct writeback_control *wbc) |
2692 | { | 2693 | { |
2693 | struct inode * const inode = page->mapping->host; | 2694 | struct inode * const inode = page->mapping->host; |
2694 | loff_t i_size = i_size_read(inode); | 2695 | loff_t i_size = i_size_read(inode); |
2695 | const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; | 2696 | const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; |
2696 | unsigned offset; | 2697 | unsigned offset; |
2697 | int ret; | 2698 | int ret; |
2698 | 2699 | ||
2699 | /* Is the page fully inside i_size? */ | 2700 | /* Is the page fully inside i_size? */ |
2700 | if (page->index < end_index) | 2701 | if (page->index < end_index) |
2701 | goto out; | 2702 | goto out; |
2702 | 2703 | ||
2703 | /* Is the page fully outside i_size? (truncate in progress) */ | 2704 | /* Is the page fully outside i_size? (truncate in progress) */ |
2704 | offset = i_size & (PAGE_CACHE_SIZE-1); | 2705 | offset = i_size & (PAGE_CACHE_SIZE-1); |
2705 | if (page->index >= end_index+1 || !offset) { | 2706 | if (page->index >= end_index+1 || !offset) { |
2706 | /* | 2707 | /* |
2707 | * The page may have dirty, unmapped buffers. For example, | 2708 | * The page may have dirty, unmapped buffers. For example, |
2708 | * they may have been added in ext3_writepage(). Make them | 2709 | * they may have been added in ext3_writepage(). Make them |
2709 | * freeable here, so the page does not leak. | 2710 | * freeable here, so the page does not leak. |
2710 | */ | 2711 | */ |
2711 | #if 0 | 2712 | #if 0 |
2712 | /* Not really sure about this - do we need this ? */ | 2713 | /* Not really sure about this - do we need this ? */ |
2713 | if (page->mapping->a_ops->invalidatepage) | 2714 | if (page->mapping->a_ops->invalidatepage) |
2714 | page->mapping->a_ops->invalidatepage(page, offset); | 2715 | page->mapping->a_ops->invalidatepage(page, offset); |
2715 | #endif | 2716 | #endif |
2716 | unlock_page(page); | 2717 | unlock_page(page); |
2717 | return 0; /* don't care */ | 2718 | return 0; /* don't care */ |
2718 | } | 2719 | } |
2719 | 2720 | ||
2720 | /* | 2721 | /* |
2721 | * The page straddles i_size. It must be zeroed out on each and every | 2722 | * The page straddles i_size. It must be zeroed out on each and every |
2722 | * writepage invocation because it may be mmapped. "A file is mapped | 2723 | * writepage invocation because it may be mmapped. "A file is mapped |
2723 | * in multiples of the page size. For a file that is not a multiple of | 2724 | * in multiples of the page size. For a file that is not a multiple of |
2724 | * the page size, the remaining memory is zeroed when mapped, and | 2725 | * the page size, the remaining memory is zeroed when mapped, and |
2725 | * writes to that region are not written out to the file." | 2726 | * writes to that region are not written out to the file." |
2726 | */ | 2727 | */ |
2727 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); | 2728 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); |
2728 | out: | 2729 | out: |
2729 | ret = mpage_writepage(page, get_block, wbc); | 2730 | ret = mpage_writepage(page, get_block, wbc); |
2730 | if (ret == -EAGAIN) | 2731 | if (ret == -EAGAIN) |
2731 | ret = __block_write_full_page(inode, page, get_block, wbc, | 2732 | ret = __block_write_full_page(inode, page, get_block, wbc, |
2732 | end_buffer_async_write); | 2733 | end_buffer_async_write); |
2733 | return ret; | 2734 | return ret; |
2734 | } | 2735 | } |
2735 | EXPORT_SYMBOL(nobh_writepage); | 2736 | EXPORT_SYMBOL(nobh_writepage); |
2736 | 2737 | ||
2737 | int nobh_truncate_page(struct address_space *mapping, | 2738 | int nobh_truncate_page(struct address_space *mapping, |
2738 | loff_t from, get_block_t *get_block) | 2739 | loff_t from, get_block_t *get_block) |
2739 | { | 2740 | { |
2740 | pgoff_t index = from >> PAGE_CACHE_SHIFT; | 2741 | pgoff_t index = from >> PAGE_CACHE_SHIFT; |
2741 | unsigned offset = from & (PAGE_CACHE_SIZE-1); | 2742 | unsigned offset = from & (PAGE_CACHE_SIZE-1); |
2742 | unsigned blocksize; | 2743 | unsigned blocksize; |
2743 | sector_t iblock; | 2744 | sector_t iblock; |
2744 | unsigned length, pos; | 2745 | unsigned length, pos; |
2745 | struct inode *inode = mapping->host; | 2746 | struct inode *inode = mapping->host; |
2746 | struct page *page; | 2747 | struct page *page; |
2747 | struct buffer_head map_bh; | 2748 | struct buffer_head map_bh; |
2748 | int err; | 2749 | int err; |
2749 | 2750 | ||
2750 | blocksize = 1 << inode->i_blkbits; | 2751 | blocksize = 1 << inode->i_blkbits; |
2751 | length = offset & (blocksize - 1); | 2752 | length = offset & (blocksize - 1); |
2752 | 2753 | ||
2753 | /* Block boundary? Nothing to do */ | 2754 | /* Block boundary? Nothing to do */ |
2754 | if (!length) | 2755 | if (!length) |
2755 | return 0; | 2756 | return 0; |
2756 | 2757 | ||
2757 | length = blocksize - length; | 2758 | length = blocksize - length; |
2758 | iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); | 2759 | iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); |
2759 | 2760 | ||
2760 | page = grab_cache_page(mapping, index); | 2761 | page = grab_cache_page(mapping, index); |
2761 | err = -ENOMEM; | 2762 | err = -ENOMEM; |
2762 | if (!page) | 2763 | if (!page) |
2763 | goto out; | 2764 | goto out; |
2764 | 2765 | ||
2765 | if (page_has_buffers(page)) { | 2766 | if (page_has_buffers(page)) { |
2766 | has_buffers: | 2767 | has_buffers: |
2767 | unlock_page(page); | 2768 | unlock_page(page); |
2768 | page_cache_release(page); | 2769 | page_cache_release(page); |
2769 | return block_truncate_page(mapping, from, get_block); | 2770 | return block_truncate_page(mapping, from, get_block); |
2770 | } | 2771 | } |
2771 | 2772 | ||
2772 | /* Find the buffer that contains "offset" */ | 2773 | /* Find the buffer that contains "offset" */ |
2773 | pos = blocksize; | 2774 | pos = blocksize; |
2774 | while (offset >= pos) { | 2775 | while (offset >= pos) { |
2775 | iblock++; | 2776 | iblock++; |
2776 | pos += blocksize; | 2777 | pos += blocksize; |
2777 | } | 2778 | } |
2778 | 2779 | ||
2779 | map_bh.b_size = blocksize; | 2780 | map_bh.b_size = blocksize; |
2780 | map_bh.b_state = 0; | 2781 | map_bh.b_state = 0; |
2781 | err = get_block(inode, iblock, &map_bh, 0); | 2782 | err = get_block(inode, iblock, &map_bh, 0); |
2782 | if (err) | 2783 | if (err) |
2783 | goto unlock; | 2784 | goto unlock; |
2784 | /* unmapped? It's a hole - nothing to do */ | 2785 | /* unmapped? It's a hole - nothing to do */ |
2785 | if (!buffer_mapped(&map_bh)) | 2786 | if (!buffer_mapped(&map_bh)) |
2786 | goto unlock; | 2787 | goto unlock; |
2787 | 2788 | ||
2788 | /* Ok, it's mapped. Make sure it's up-to-date */ | 2789 | /* Ok, it's mapped. Make sure it's up-to-date */ |
2789 | if (!PageUptodate(page)) { | 2790 | if (!PageUptodate(page)) { |
2790 | err = mapping->a_ops->readpage(NULL, page); | 2791 | err = mapping->a_ops->readpage(NULL, page); |
2791 | if (err) { | 2792 | if (err) { |
2792 | page_cache_release(page); | 2793 | page_cache_release(page); |
2793 | goto out; | 2794 | goto out; |
2794 | } | 2795 | } |
2795 | lock_page(page); | 2796 | lock_page(page); |
2796 | if (!PageUptodate(page)) { | 2797 | if (!PageUptodate(page)) { |
2797 | err = -EIO; | 2798 | err = -EIO; |
2798 | goto unlock; | 2799 | goto unlock; |
2799 | } | 2800 | } |
2800 | if (page_has_buffers(page)) | 2801 | if (page_has_buffers(page)) |
2801 | goto has_buffers; | 2802 | goto has_buffers; |
2802 | } | 2803 | } |
2803 | zero_user(page, offset, length); | 2804 | zero_user(page, offset, length); |
2804 | set_page_dirty(page); | 2805 | set_page_dirty(page); |
2805 | err = 0; | 2806 | err = 0; |
2806 | 2807 | ||
2807 | unlock: | 2808 | unlock: |
2808 | unlock_page(page); | 2809 | unlock_page(page); |
2809 | page_cache_release(page); | 2810 | page_cache_release(page); |
2810 | out: | 2811 | out: |
2811 | return err; | 2812 | return err; |
2812 | } | 2813 | } |
2813 | EXPORT_SYMBOL(nobh_truncate_page); | 2814 | EXPORT_SYMBOL(nobh_truncate_page); |
2814 | 2815 | ||
2815 | int block_truncate_page(struct address_space *mapping, | 2816 | int block_truncate_page(struct address_space *mapping, |
2816 | loff_t from, get_block_t *get_block) | 2817 | loff_t from, get_block_t *get_block) |
2817 | { | 2818 | { |
2818 | pgoff_t index = from >> PAGE_CACHE_SHIFT; | 2819 | pgoff_t index = from >> PAGE_CACHE_SHIFT; |
2819 | unsigned offset = from & (PAGE_CACHE_SIZE-1); | 2820 | unsigned offset = from & (PAGE_CACHE_SIZE-1); |
2820 | unsigned blocksize; | 2821 | unsigned blocksize; |
2821 | sector_t iblock; | 2822 | sector_t iblock; |
2822 | unsigned length, pos; | 2823 | unsigned length, pos; |
2823 | struct inode *inode = mapping->host; | 2824 | struct inode *inode = mapping->host; |
2824 | struct page *page; | 2825 | struct page *page; |
2825 | struct buffer_head *bh; | 2826 | struct buffer_head *bh; |
2826 | int err; | 2827 | int err; |
2827 | 2828 | ||
2828 | blocksize = 1 << inode->i_blkbits; | 2829 | blocksize = 1 << inode->i_blkbits; |
2829 | length = offset & (blocksize - 1); | 2830 | length = offset & (blocksize - 1); |
2830 | 2831 | ||
2831 | /* Block boundary? Nothing to do */ | 2832 | /* Block boundary? Nothing to do */ |
2832 | if (!length) | 2833 | if (!length) |
2833 | return 0; | 2834 | return 0; |
2834 | 2835 | ||
2835 | length = blocksize - length; | 2836 | length = blocksize - length; |
2836 | iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); | 2837 | iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); |
2837 | 2838 | ||
2838 | page = grab_cache_page(mapping, index); | 2839 | page = grab_cache_page(mapping, index); |
2839 | err = -ENOMEM; | 2840 | err = -ENOMEM; |
2840 | if (!page) | 2841 | if (!page) |
2841 | goto out; | 2842 | goto out; |
2842 | 2843 | ||
2843 | if (!page_has_buffers(page)) | 2844 | if (!page_has_buffers(page)) |
2844 | create_empty_buffers(page, blocksize, 0); | 2845 | create_empty_buffers(page, blocksize, 0); |
2845 | 2846 | ||
2846 | /* Find the buffer that contains "offset" */ | 2847 | /* Find the buffer that contains "offset" */ |
2847 | bh = page_buffers(page); | 2848 | bh = page_buffers(page); |
2848 | pos = blocksize; | 2849 | pos = blocksize; |
2849 | while (offset >= pos) { | 2850 | while (offset >= pos) { |
2850 | bh = bh->b_this_page; | 2851 | bh = bh->b_this_page; |
2851 | iblock++; | 2852 | iblock++; |
2852 | pos += blocksize; | 2853 | pos += blocksize; |
2853 | } | 2854 | } |
2854 | 2855 | ||
2855 | err = 0; | 2856 | err = 0; |
2856 | if (!buffer_mapped(bh)) { | 2857 | if (!buffer_mapped(bh)) { |
2857 | WARN_ON(bh->b_size != blocksize); | 2858 | WARN_ON(bh->b_size != blocksize); |
2858 | err = get_block(inode, iblock, bh, 0); | 2859 | err = get_block(inode, iblock, bh, 0); |
2859 | if (err) | 2860 | if (err) |
2860 | goto unlock; | 2861 | goto unlock; |
2861 | /* unmapped? It's a hole - nothing to do */ | 2862 | /* unmapped? It's a hole - nothing to do */ |
2862 | if (!buffer_mapped(bh)) | 2863 | if (!buffer_mapped(bh)) |
2863 | goto unlock; | 2864 | goto unlock; |
2864 | } | 2865 | } |
2865 | 2866 | ||
2866 | /* Ok, it's mapped. Make sure it's up-to-date */ | 2867 | /* Ok, it's mapped. Make sure it's up-to-date */ |
2867 | if (PageUptodate(page)) | 2868 | if (PageUptodate(page)) |
2868 | set_buffer_uptodate(bh); | 2869 | set_buffer_uptodate(bh); |
2869 | 2870 | ||
2870 | if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) { | 2871 | if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) { |
2871 | err = -EIO; | 2872 | err = -EIO; |
2872 | ll_rw_block(READ, 1, &bh); | 2873 | ll_rw_block(READ, 1, &bh); |
2873 | wait_on_buffer(bh); | 2874 | wait_on_buffer(bh); |
2874 | /* Uhhuh. Read error. Complain and punt. */ | 2875 | /* Uhhuh. Read error. Complain and punt. */ |
2875 | if (!buffer_uptodate(bh)) | 2876 | if (!buffer_uptodate(bh)) |
2876 | goto unlock; | 2877 | goto unlock; |
2877 | } | 2878 | } |
2878 | 2879 | ||
2879 | zero_user(page, offset, length); | 2880 | zero_user(page, offset, length); |
2880 | mark_buffer_dirty(bh); | 2881 | mark_buffer_dirty(bh); |
2881 | err = 0; | 2882 | err = 0; |
2882 | 2883 | ||
2883 | unlock: | 2884 | unlock: |
2884 | unlock_page(page); | 2885 | unlock_page(page); |
2885 | page_cache_release(page); | 2886 | page_cache_release(page); |
2886 | out: | 2887 | out: |
2887 | return err; | 2888 | return err; |
2888 | } | 2889 | } |
2889 | EXPORT_SYMBOL(block_truncate_page); | 2890 | EXPORT_SYMBOL(block_truncate_page); |
2890 | 2891 | ||
2891 | /* | 2892 | /* |
2892 | * The generic ->writepage function for buffer-backed address_spaces | 2893 | * The generic ->writepage function for buffer-backed address_spaces |
2893 | * this form passes in the end_io handler used to finish the IO. | 2894 | * this form passes in the end_io handler used to finish the IO. |
2894 | */ | 2895 | */ |
2895 | int block_write_full_page_endio(struct page *page, get_block_t *get_block, | 2896 | int block_write_full_page_endio(struct page *page, get_block_t *get_block, |
2896 | struct writeback_control *wbc, bh_end_io_t *handler) | 2897 | struct writeback_control *wbc, bh_end_io_t *handler) |
2897 | { | 2898 | { |
2898 | struct inode * const inode = page->mapping->host; | 2899 | struct inode * const inode = page->mapping->host; |
2899 | loff_t i_size = i_size_read(inode); | 2900 | loff_t i_size = i_size_read(inode); |
2900 | const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; | 2901 | const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; |
2901 | unsigned offset; | 2902 | unsigned offset; |
2902 | 2903 | ||
2903 | /* Is the page fully inside i_size? */ | 2904 | /* Is the page fully inside i_size? */ |
2904 | if (page->index < end_index) | 2905 | if (page->index < end_index) |
2905 | return __block_write_full_page(inode, page, get_block, wbc, | 2906 | return __block_write_full_page(inode, page, get_block, wbc, |
2906 | handler); | 2907 | handler); |
2907 | 2908 | ||
2908 | /* Is the page fully outside i_size? (truncate in progress) */ | 2909 | /* Is the page fully outside i_size? (truncate in progress) */ |
2909 | offset = i_size & (PAGE_CACHE_SIZE-1); | 2910 | offset = i_size & (PAGE_CACHE_SIZE-1); |
2910 | if (page->index >= end_index+1 || !offset) { | 2911 | if (page->index >= end_index+1 || !offset) { |
2911 | /* | 2912 | /* |
2912 | * The page may have dirty, unmapped buffers. For example, | 2913 | * The page may have dirty, unmapped buffers. For example, |
2913 | * they may have been added in ext3_writepage(). Make them | 2914 | * they may have been added in ext3_writepage(). Make them |
2914 | * freeable here, so the page does not leak. | 2915 | * freeable here, so the page does not leak. |
2915 | */ | 2916 | */ |
2916 | do_invalidatepage(page, 0, PAGE_CACHE_SIZE); | 2917 | do_invalidatepage(page, 0, PAGE_CACHE_SIZE); |
2917 | unlock_page(page); | 2918 | unlock_page(page); |
2918 | return 0; /* don't care */ | 2919 | return 0; /* don't care */ |
2919 | } | 2920 | } |
2920 | 2921 | ||
2921 | /* | 2922 | /* |
2922 | * The page straddles i_size. It must be zeroed out on each and every | 2923 | * The page straddles i_size. It must be zeroed out on each and every |
2923 | * writepage invocation because it may be mmapped. "A file is mapped | 2924 | * writepage invocation because it may be mmapped. "A file is mapped |
2924 | * in multiples of the page size. For a file that is not a multiple of | 2925 | * in multiples of the page size. For a file that is not a multiple of |
2925 | * the page size, the remaining memory is zeroed when mapped, and | 2926 | * the page size, the remaining memory is zeroed when mapped, and |
2926 | * writes to that region are not written out to the file." | 2927 | * writes to that region are not written out to the file." |
2927 | */ | 2928 | */ |
2928 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); | 2929 | zero_user_segment(page, offset, PAGE_CACHE_SIZE); |
2929 | return __block_write_full_page(inode, page, get_block, wbc, handler); | 2930 | return __block_write_full_page(inode, page, get_block, wbc, handler); |
2930 | } | 2931 | } |
2931 | EXPORT_SYMBOL(block_write_full_page_endio); | 2932 | EXPORT_SYMBOL(block_write_full_page_endio); |
2932 | 2933 | ||
2933 | /* | 2934 | /* |
2934 | * The generic ->writepage function for buffer-backed address_spaces | 2935 | * The generic ->writepage function for buffer-backed address_spaces |
2935 | */ | 2936 | */ |
2936 | int block_write_full_page(struct page *page, get_block_t *get_block, | 2937 | int block_write_full_page(struct page *page, get_block_t *get_block, |
2937 | struct writeback_control *wbc) | 2938 | struct writeback_control *wbc) |
2938 | { | 2939 | { |
2939 | return block_write_full_page_endio(page, get_block, wbc, | 2940 | return block_write_full_page_endio(page, get_block, wbc, |
2940 | end_buffer_async_write); | 2941 | end_buffer_async_write); |
2941 | } | 2942 | } |
2942 | EXPORT_SYMBOL(block_write_full_page); | 2943 | EXPORT_SYMBOL(block_write_full_page); |
2943 | 2944 | ||
2944 | sector_t generic_block_bmap(struct address_space *mapping, sector_t block, | 2945 | sector_t generic_block_bmap(struct address_space *mapping, sector_t block, |
2945 | get_block_t *get_block) | 2946 | get_block_t *get_block) |
2946 | { | 2947 | { |
2947 | struct buffer_head tmp; | 2948 | struct buffer_head tmp; |
2948 | struct inode *inode = mapping->host; | 2949 | struct inode *inode = mapping->host; |
2949 | tmp.b_state = 0; | 2950 | tmp.b_state = 0; |
2950 | tmp.b_blocknr = 0; | 2951 | tmp.b_blocknr = 0; |
2951 | tmp.b_size = 1 << inode->i_blkbits; | 2952 | tmp.b_size = 1 << inode->i_blkbits; |
2952 | get_block(inode, block, &tmp, 0); | 2953 | get_block(inode, block, &tmp, 0); |
2953 | return tmp.b_blocknr; | 2954 | return tmp.b_blocknr; |
2954 | } | 2955 | } |
2955 | EXPORT_SYMBOL(generic_block_bmap); | 2956 | EXPORT_SYMBOL(generic_block_bmap); |
2956 | 2957 | ||
2957 | static void end_bio_bh_io_sync(struct bio *bio, int err) | 2958 | static void end_bio_bh_io_sync(struct bio *bio, int err) |
2958 | { | 2959 | { |
2959 | struct buffer_head *bh = bio->bi_private; | 2960 | struct buffer_head *bh = bio->bi_private; |
2960 | 2961 | ||
2961 | if (err == -EOPNOTSUPP) { | 2962 | if (err == -EOPNOTSUPP) { |
2962 | set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); | 2963 | set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); |
2963 | } | 2964 | } |
2964 | 2965 | ||
2965 | if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags))) | 2966 | if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags))) |
2966 | set_bit(BH_Quiet, &bh->b_state); | 2967 | set_bit(BH_Quiet, &bh->b_state); |
2967 | 2968 | ||
2968 | bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); | 2969 | bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); |
2969 | bio_put(bio); | 2970 | bio_put(bio); |
2970 | } | 2971 | } |
2971 | 2972 | ||
2972 | /* | 2973 | /* |
2973 | * This allows us to do IO even on the odd last sectors | 2974 | * This allows us to do IO even on the odd last sectors |
2974 | * of a device, even if the bh block size is some multiple | 2975 | * of a device, even if the bh block size is some multiple |
2975 | * of the physical sector size. | 2976 | * of the physical sector size. |
2976 | * | 2977 | * |
2977 | * We'll just truncate the bio to the size of the device, | 2978 | * We'll just truncate the bio to the size of the device, |
2978 | * and clear the end of the buffer head manually. | 2979 | * and clear the end of the buffer head manually. |
2979 | * | 2980 | * |
2980 | * Truly out-of-range accesses will turn into actual IO | 2981 | * Truly out-of-range accesses will turn into actual IO |
2981 | * errors, this only handles the "we need to be able to | 2982 | * errors, this only handles the "we need to be able to |
2982 | * do IO at the final sector" case. | 2983 | * do IO at the final sector" case. |
2983 | */ | 2984 | */ |
2984 | static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh) | 2985 | static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh) |
2985 | { | 2986 | { |
2986 | sector_t maxsector; | 2987 | sector_t maxsector; |
2987 | unsigned bytes; | 2988 | unsigned bytes; |
2988 | 2989 | ||
2989 | maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; | 2990 | maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; |
2990 | if (!maxsector) | 2991 | if (!maxsector) |
2991 | return; | 2992 | return; |
2992 | 2993 | ||
2993 | /* | 2994 | /* |
2994 | * If the *whole* IO is past the end of the device, | 2995 | * If the *whole* IO is past the end of the device, |
2995 | * let it through, and the IO layer will turn it into | 2996 | * let it through, and the IO layer will turn it into |
2996 | * an EIO. | 2997 | * an EIO. |
2997 | */ | 2998 | */ |
2998 | if (unlikely(bio->bi_sector >= maxsector)) | 2999 | if (unlikely(bio->bi_sector >= maxsector)) |
2999 | return; | 3000 | return; |
3000 | 3001 | ||
3001 | maxsector -= bio->bi_sector; | 3002 | maxsector -= bio->bi_sector; |
3002 | bytes = bio->bi_size; | 3003 | bytes = bio->bi_size; |
3003 | if (likely((bytes >> 9) <= maxsector)) | 3004 | if (likely((bytes >> 9) <= maxsector)) |
3004 | return; | 3005 | return; |
3005 | 3006 | ||
3006 | /* Uhhuh. We've got a bh that straddles the device size! */ | 3007 | /* Uhhuh. We've got a bh that straddles the device size! */ |
3007 | bytes = maxsector << 9; | 3008 | bytes = maxsector << 9; |
3008 | 3009 | ||
3009 | /* Truncate the bio.. */ | 3010 | /* Truncate the bio.. */ |
3010 | bio->bi_size = bytes; | 3011 | bio->bi_size = bytes; |
3011 | bio->bi_io_vec[0].bv_len = bytes; | 3012 | bio->bi_io_vec[0].bv_len = bytes; |
3012 | 3013 | ||
3013 | /* ..and clear the end of the buffer for reads */ | 3014 | /* ..and clear the end of the buffer for reads */ |
3014 | if ((rw & RW_MASK) == READ) { | 3015 | if ((rw & RW_MASK) == READ) { |
3015 | void *kaddr = kmap_atomic(bh->b_page); | 3016 | void *kaddr = kmap_atomic(bh->b_page); |
3016 | memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes); | 3017 | memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes); |
3017 | kunmap_atomic(kaddr); | 3018 | kunmap_atomic(kaddr); |
3018 | flush_dcache_page(bh->b_page); | 3019 | flush_dcache_page(bh->b_page); |
3019 | } | 3020 | } |
3020 | } | 3021 | } |
3021 | 3022 | ||
3022 | int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) | 3023 | int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) |
3023 | { | 3024 | { |
3024 | struct bio *bio; | 3025 | struct bio *bio; |
3025 | int ret = 0; | 3026 | int ret = 0; |
3026 | 3027 | ||
3027 | BUG_ON(!buffer_locked(bh)); | 3028 | BUG_ON(!buffer_locked(bh)); |
3028 | BUG_ON(!buffer_mapped(bh)); | 3029 | BUG_ON(!buffer_mapped(bh)); |
3029 | BUG_ON(!bh->b_end_io); | 3030 | BUG_ON(!bh->b_end_io); |
3030 | BUG_ON(buffer_delay(bh)); | 3031 | BUG_ON(buffer_delay(bh)); |
3031 | BUG_ON(buffer_unwritten(bh)); | 3032 | BUG_ON(buffer_unwritten(bh)); |
3032 | 3033 | ||
3033 | /* | 3034 | /* |
3034 | * Only clear out a write error when rewriting | 3035 | * Only clear out a write error when rewriting |
3035 | */ | 3036 | */ |
3036 | if (test_set_buffer_req(bh) && (rw & WRITE)) | 3037 | if (test_set_buffer_req(bh) && (rw & WRITE)) |
3037 | clear_buffer_write_io_error(bh); | 3038 | clear_buffer_write_io_error(bh); |
3038 | 3039 | ||
3039 | /* | 3040 | /* |
3040 | * from here on down, it's all bio -- do the initial mapping, | 3041 | * from here on down, it's all bio -- do the initial mapping, |
3041 | * submit_bio -> generic_make_request may further map this bio around | 3042 | * submit_bio -> generic_make_request may further map this bio around |
3042 | */ | 3043 | */ |
3043 | bio = bio_alloc(GFP_NOIO, 1); | 3044 | bio = bio_alloc(GFP_NOIO, 1); |
3044 | 3045 | ||
3045 | bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9); | 3046 | bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9); |
3046 | bio->bi_bdev = bh->b_bdev; | 3047 | bio->bi_bdev = bh->b_bdev; |
3047 | bio->bi_io_vec[0].bv_page = bh->b_page; | 3048 | bio->bi_io_vec[0].bv_page = bh->b_page; |
3048 | bio->bi_io_vec[0].bv_len = bh->b_size; | 3049 | bio->bi_io_vec[0].bv_len = bh->b_size; |
3049 | bio->bi_io_vec[0].bv_offset = bh_offset(bh); | 3050 | bio->bi_io_vec[0].bv_offset = bh_offset(bh); |
3050 | 3051 | ||
3051 | bio->bi_vcnt = 1; | 3052 | bio->bi_vcnt = 1; |
3052 | bio->bi_size = bh->b_size; | 3053 | bio->bi_size = bh->b_size; |
3053 | 3054 | ||
3054 | bio->bi_end_io = end_bio_bh_io_sync; | 3055 | bio->bi_end_io = end_bio_bh_io_sync; |
3055 | bio->bi_private = bh; | 3056 | bio->bi_private = bh; |
3056 | bio->bi_flags |= bio_flags; | 3057 | bio->bi_flags |= bio_flags; |
3057 | 3058 | ||
3058 | /* Take care of bh's that straddle the end of the device */ | 3059 | /* Take care of bh's that straddle the end of the device */ |
3059 | guard_bh_eod(rw, bio, bh); | 3060 | guard_bh_eod(rw, bio, bh); |
3060 | 3061 | ||
3061 | if (buffer_meta(bh)) | 3062 | if (buffer_meta(bh)) |
3062 | rw |= REQ_META; | 3063 | rw |= REQ_META; |
3063 | if (buffer_prio(bh)) | 3064 | if (buffer_prio(bh)) |
3064 | rw |= REQ_PRIO; | 3065 | rw |= REQ_PRIO; |
3065 | 3066 | ||
3066 | bio_get(bio); | 3067 | bio_get(bio); |
3067 | submit_bio(rw, bio); | 3068 | submit_bio(rw, bio); |
3068 | 3069 | ||
3069 | if (bio_flagged(bio, BIO_EOPNOTSUPP)) | 3070 | if (bio_flagged(bio, BIO_EOPNOTSUPP)) |
3070 | ret = -EOPNOTSUPP; | 3071 | ret = -EOPNOTSUPP; |
3071 | 3072 | ||
3072 | bio_put(bio); | 3073 | bio_put(bio); |
3073 | return ret; | 3074 | return ret; |
3074 | } | 3075 | } |
3075 | EXPORT_SYMBOL_GPL(_submit_bh); | 3076 | EXPORT_SYMBOL_GPL(_submit_bh); |
3076 | 3077 | ||
3077 | int submit_bh(int rw, struct buffer_head *bh) | 3078 | int submit_bh(int rw, struct buffer_head *bh) |
3078 | { | 3079 | { |
3079 | return _submit_bh(rw, bh, 0); | 3080 | return _submit_bh(rw, bh, 0); |
3080 | } | 3081 | } |
3081 | EXPORT_SYMBOL(submit_bh); | 3082 | EXPORT_SYMBOL(submit_bh); |
3082 | 3083 | ||
3083 | /** | 3084 | /** |
3084 | * ll_rw_block: low-level access to block devices (DEPRECATED) | 3085 | * ll_rw_block: low-level access to block devices (DEPRECATED) |
3085 | * @rw: whether to %READ or %WRITE or maybe %READA (readahead) | 3086 | * @rw: whether to %READ or %WRITE or maybe %READA (readahead) |
3086 | * @nr: number of &struct buffer_heads in the array | 3087 | * @nr: number of &struct buffer_heads in the array |
3087 | * @bhs: array of pointers to &struct buffer_head | 3088 | * @bhs: array of pointers to &struct buffer_head |
3088 | * | 3089 | * |
3089 | * ll_rw_block() takes an array of pointers to &struct buffer_heads, and | 3090 | * ll_rw_block() takes an array of pointers to &struct buffer_heads, and |
3090 | * requests an I/O operation on them, either a %READ or a %WRITE. The third | 3091 | * requests an I/O operation on them, either a %READ or a %WRITE. The third |
3091 | * %READA option is described in the documentation for generic_make_request() | 3092 | * %READA option is described in the documentation for generic_make_request() |
3092 | * which ll_rw_block() calls. | 3093 | * which ll_rw_block() calls. |
3093 | * | 3094 | * |
3094 | * This function drops any buffer that it cannot get a lock on (with the | 3095 | * This function drops any buffer that it cannot get a lock on (with the |
3095 | * BH_Lock state bit), any buffer that appears to be clean when doing a write | 3096 | * BH_Lock state bit), any buffer that appears to be clean when doing a write |
3096 | * request, and any buffer that appears to be up-to-date when doing read | 3097 | * request, and any buffer that appears to be up-to-date when doing read |
3097 | * request. Further it marks as clean buffers that are processed for | 3098 | * request. Further it marks as clean buffers that are processed for |
3098 | * writing (the buffer cache won't assume that they are actually clean | 3099 | * writing (the buffer cache won't assume that they are actually clean |
3099 | * until the buffer gets unlocked). | 3100 | * until the buffer gets unlocked). |
3100 | * | 3101 | * |
3101 | * ll_rw_block sets b_end_io to simple completion handler that marks | 3102 | * ll_rw_block sets b_end_io to simple completion handler that marks |
3102 | * the buffer up-to-date (if approriate), unlocks the buffer and wakes | 3103 | * the buffer up-to-date (if approriate), unlocks the buffer and wakes |
3103 | * any waiters. | 3104 | * any waiters. |
3104 | * | 3105 | * |
3105 | * All of the buffers must be for the same device, and must also be a | 3106 | * All of the buffers must be for the same device, and must also be a |
3106 | * multiple of the current approved size for the device. | 3107 | * multiple of the current approved size for the device. |
3107 | */ | 3108 | */ |
3108 | void ll_rw_block(int rw, int nr, struct buffer_head *bhs[]) | 3109 | void ll_rw_block(int rw, int nr, struct buffer_head *bhs[]) |
3109 | { | 3110 | { |
3110 | int i; | 3111 | int i; |
3111 | 3112 | ||
3112 | for (i = 0; i < nr; i++) { | 3113 | for (i = 0; i < nr; i++) { |
3113 | struct buffer_head *bh = bhs[i]; | 3114 | struct buffer_head *bh = bhs[i]; |
3114 | 3115 | ||
3115 | if (!trylock_buffer(bh)) | 3116 | if (!trylock_buffer(bh)) |
3116 | continue; | 3117 | continue; |
3117 | if (rw == WRITE) { | 3118 | if (rw == WRITE) { |
3118 | if (test_clear_buffer_dirty(bh)) { | 3119 | if (test_clear_buffer_dirty(bh)) { |
3119 | bh->b_end_io = end_buffer_write_sync; | 3120 | bh->b_end_io = end_buffer_write_sync; |
3120 | get_bh(bh); | 3121 | get_bh(bh); |
3121 | submit_bh(WRITE, bh); | 3122 | submit_bh(WRITE, bh); |
3122 | continue; | 3123 | continue; |
3123 | } | 3124 | } |
3124 | } else { | 3125 | } else { |
3125 | if (!buffer_uptodate(bh)) { | 3126 | if (!buffer_uptodate(bh)) { |
3126 | bh->b_end_io = end_buffer_read_sync; | 3127 | bh->b_end_io = end_buffer_read_sync; |
3127 | get_bh(bh); | 3128 | get_bh(bh); |
3128 | submit_bh(rw, bh); | 3129 | submit_bh(rw, bh); |
3129 | continue; | 3130 | continue; |
3130 | } | 3131 | } |
3131 | } | 3132 | } |
3132 | unlock_buffer(bh); | 3133 | unlock_buffer(bh); |
3133 | } | 3134 | } |
3134 | } | 3135 | } |
3135 | EXPORT_SYMBOL(ll_rw_block); | 3136 | EXPORT_SYMBOL(ll_rw_block); |
3136 | 3137 | ||
3137 | void write_dirty_buffer(struct buffer_head *bh, int rw) | 3138 | void write_dirty_buffer(struct buffer_head *bh, int rw) |
3138 | { | 3139 | { |
3139 | lock_buffer(bh); | 3140 | lock_buffer(bh); |
3140 | if (!test_clear_buffer_dirty(bh)) { | 3141 | if (!test_clear_buffer_dirty(bh)) { |
3141 | unlock_buffer(bh); | 3142 | unlock_buffer(bh); |
3142 | return; | 3143 | return; |
3143 | } | 3144 | } |
3144 | bh->b_end_io = end_buffer_write_sync; | 3145 | bh->b_end_io = end_buffer_write_sync; |
3145 | get_bh(bh); | 3146 | get_bh(bh); |
3146 | submit_bh(rw, bh); | 3147 | submit_bh(rw, bh); |
3147 | } | 3148 | } |
3148 | EXPORT_SYMBOL(write_dirty_buffer); | 3149 | EXPORT_SYMBOL(write_dirty_buffer); |
3149 | 3150 | ||
3150 | /* | 3151 | /* |
3151 | * For a data-integrity writeout, we need to wait upon any in-progress I/O | 3152 | * For a data-integrity writeout, we need to wait upon any in-progress I/O |
3152 | * and then start new I/O and then wait upon it. The caller must have a ref on | 3153 | * and then start new I/O and then wait upon it. The caller must have a ref on |
3153 | * the buffer_head. | 3154 | * the buffer_head. |
3154 | */ | 3155 | */ |
3155 | int __sync_dirty_buffer(struct buffer_head *bh, int rw) | 3156 | int __sync_dirty_buffer(struct buffer_head *bh, int rw) |
3156 | { | 3157 | { |
3157 | int ret = 0; | 3158 | int ret = 0; |
3158 | 3159 | ||
3159 | WARN_ON(atomic_read(&bh->b_count) < 1); | 3160 | WARN_ON(atomic_read(&bh->b_count) < 1); |
3160 | lock_buffer(bh); | 3161 | lock_buffer(bh); |
3161 | if (test_clear_buffer_dirty(bh)) { | 3162 | if (test_clear_buffer_dirty(bh)) { |
3162 | get_bh(bh); | 3163 | get_bh(bh); |
3163 | bh->b_end_io = end_buffer_write_sync; | 3164 | bh->b_end_io = end_buffer_write_sync; |
3164 | ret = submit_bh(rw, bh); | 3165 | ret = submit_bh(rw, bh); |
3165 | wait_on_buffer(bh); | 3166 | wait_on_buffer(bh); |
3166 | if (!ret && !buffer_uptodate(bh)) | 3167 | if (!ret && !buffer_uptodate(bh)) |
3167 | ret = -EIO; | 3168 | ret = -EIO; |
3168 | } else { | 3169 | } else { |
3169 | unlock_buffer(bh); | 3170 | unlock_buffer(bh); |
3170 | } | 3171 | } |
3171 | return ret; | 3172 | return ret; |
3172 | } | 3173 | } |
3173 | EXPORT_SYMBOL(__sync_dirty_buffer); | 3174 | EXPORT_SYMBOL(__sync_dirty_buffer); |
3174 | 3175 | ||
3175 | int sync_dirty_buffer(struct buffer_head *bh) | 3176 | int sync_dirty_buffer(struct buffer_head *bh) |
3176 | { | 3177 | { |
3177 | return __sync_dirty_buffer(bh, WRITE_SYNC); | 3178 | return __sync_dirty_buffer(bh, WRITE_SYNC); |
3178 | } | 3179 | } |
3179 | EXPORT_SYMBOL(sync_dirty_buffer); | 3180 | EXPORT_SYMBOL(sync_dirty_buffer); |
3180 | 3181 | ||
3181 | /* | 3182 | /* |
3182 | * try_to_free_buffers() checks if all the buffers on this particular page | 3183 | * try_to_free_buffers() checks if all the buffers on this particular page |
3183 | * are unused, and releases them if so. | 3184 | * are unused, and releases them if so. |
3184 | * | 3185 | * |
3185 | * Exclusion against try_to_free_buffers may be obtained by either | 3186 | * Exclusion against try_to_free_buffers may be obtained by either |
3186 | * locking the page or by holding its mapping's private_lock. | 3187 | * locking the page or by holding its mapping's private_lock. |
3187 | * | 3188 | * |
3188 | * If the page is dirty but all the buffers are clean then we need to | 3189 | * If the page is dirty but all the buffers are clean then we need to |
3189 | * be sure to mark the page clean as well. This is because the page | 3190 | * be sure to mark the page clean as well. This is because the page |
3190 | * may be against a block device, and a later reattachment of buffers | 3191 | * may be against a block device, and a later reattachment of buffers |
3191 | * to a dirty page will set *all* buffers dirty. Which would corrupt | 3192 | * to a dirty page will set *all* buffers dirty. Which would corrupt |
3192 | * filesystem data on the same device. | 3193 | * filesystem data on the same device. |
3193 | * | 3194 | * |
3194 | * The same applies to regular filesystem pages: if all the buffers are | 3195 | * The same applies to regular filesystem pages: if all the buffers are |
3195 | * clean then we set the page clean and proceed. To do that, we require | 3196 | * clean then we set the page clean and proceed. To do that, we require |
3196 | * total exclusion from __set_page_dirty_buffers(). That is obtained with | 3197 | * total exclusion from __set_page_dirty_buffers(). That is obtained with |
3197 | * private_lock. | 3198 | * private_lock. |
3198 | * | 3199 | * |
3199 | * try_to_free_buffers() is non-blocking. | 3200 | * try_to_free_buffers() is non-blocking. |
3200 | */ | 3201 | */ |
3201 | static inline int buffer_busy(struct buffer_head *bh) | 3202 | static inline int buffer_busy(struct buffer_head *bh) |
3202 | { | 3203 | { |
3203 | return atomic_read(&bh->b_count) | | 3204 | return atomic_read(&bh->b_count) | |
3204 | (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock))); | 3205 | (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock))); |
3205 | } | 3206 | } |
3206 | 3207 | ||
3207 | static int | 3208 | static int |
3208 | drop_buffers(struct page *page, struct buffer_head **buffers_to_free) | 3209 | drop_buffers(struct page *page, struct buffer_head **buffers_to_free) |
3209 | { | 3210 | { |
3210 | struct buffer_head *head = page_buffers(page); | 3211 | struct buffer_head *head = page_buffers(page); |
3211 | struct buffer_head *bh; | 3212 | struct buffer_head *bh; |
3212 | 3213 | ||
3213 | bh = head; | 3214 | bh = head; |
3214 | do { | 3215 | do { |
3215 | if (buffer_write_io_error(bh) && page->mapping) | 3216 | if (buffer_write_io_error(bh) && page->mapping) |
3216 | set_bit(AS_EIO, &page->mapping->flags); | 3217 | set_bit(AS_EIO, &page->mapping->flags); |
3217 | if (buffer_busy(bh)) | 3218 | if (buffer_busy(bh)) |
3218 | goto failed; | 3219 | goto failed; |
3219 | bh = bh->b_this_page; | 3220 | bh = bh->b_this_page; |
3220 | } while (bh != head); | 3221 | } while (bh != head); |
3221 | 3222 | ||
3222 | do { | 3223 | do { |
3223 | struct buffer_head *next = bh->b_this_page; | 3224 | struct buffer_head *next = bh->b_this_page; |
3224 | 3225 | ||
3225 | if (bh->b_assoc_map) | 3226 | if (bh->b_assoc_map) |
3226 | __remove_assoc_queue(bh); | 3227 | __remove_assoc_queue(bh); |
3227 | bh = next; | 3228 | bh = next; |
3228 | } while (bh != head); | 3229 | } while (bh != head); |
3229 | *buffers_to_free = head; | 3230 | *buffers_to_free = head; |
3230 | __clear_page_buffers(page); | 3231 | __clear_page_buffers(page); |
3231 | return 1; | 3232 | return 1; |
3232 | failed: | 3233 | failed: |
3233 | return 0; | 3234 | return 0; |
3234 | } | 3235 | } |
3235 | 3236 | ||
3236 | int try_to_free_buffers(struct page *page) | 3237 | int try_to_free_buffers(struct page *page) |
3237 | { | 3238 | { |
3238 | struct address_space * const mapping = page->mapping; | 3239 | struct address_space * const mapping = page->mapping; |
3239 | struct buffer_head *buffers_to_free = NULL; | 3240 | struct buffer_head *buffers_to_free = NULL; |
3240 | int ret = 0; | 3241 | int ret = 0; |
3241 | 3242 | ||
3242 | BUG_ON(!PageLocked(page)); | 3243 | BUG_ON(!PageLocked(page)); |
3243 | if (PageWriteback(page)) | 3244 | if (PageWriteback(page)) |
3244 | return 0; | 3245 | return 0; |
3245 | 3246 | ||
3246 | if (mapping == NULL) { /* can this still happen? */ | 3247 | if (mapping == NULL) { /* can this still happen? */ |
3247 | ret = drop_buffers(page, &buffers_to_free); | 3248 | ret = drop_buffers(page, &buffers_to_free); |
3248 | goto out; | 3249 | goto out; |
3249 | } | 3250 | } |
3250 | 3251 | ||
3251 | spin_lock(&mapping->private_lock); | 3252 | spin_lock(&mapping->private_lock); |
3252 | ret = drop_buffers(page, &buffers_to_free); | 3253 | ret = drop_buffers(page, &buffers_to_free); |
3253 | 3254 | ||
3254 | /* | 3255 | /* |
3255 | * If the filesystem writes its buffers by hand (eg ext3) | 3256 | * If the filesystem writes its buffers by hand (eg ext3) |
3256 | * then we can have clean buffers against a dirty page. We | 3257 | * then we can have clean buffers against a dirty page. We |
3257 | * clean the page here; otherwise the VM will never notice | 3258 | * clean the page here; otherwise the VM will never notice |
3258 | * that the filesystem did any IO at all. | 3259 | * that the filesystem did any IO at all. |
3259 | * | 3260 | * |
3260 | * Also, during truncate, discard_buffer will have marked all | 3261 | * Also, during truncate, discard_buffer will have marked all |
3261 | * the page's buffers clean. We discover that here and clean | 3262 | * the page's buffers clean. We discover that here and clean |
3262 | * the page also. | 3263 | * the page also. |
3263 | * | 3264 | * |
3264 | * private_lock must be held over this entire operation in order | 3265 | * private_lock must be held over this entire operation in order |
3265 | * to synchronise against __set_page_dirty_buffers and prevent the | 3266 | * to synchronise against __set_page_dirty_buffers and prevent the |
3266 | * dirty bit from being lost. | 3267 | * dirty bit from being lost. |
3267 | */ | 3268 | */ |
3268 | if (ret) | 3269 | if (ret) |
3269 | cancel_dirty_page(page, PAGE_CACHE_SIZE); | 3270 | cancel_dirty_page(page, PAGE_CACHE_SIZE); |
3270 | spin_unlock(&mapping->private_lock); | 3271 | spin_unlock(&mapping->private_lock); |
3271 | out: | 3272 | out: |
3272 | if (buffers_to_free) { | 3273 | if (buffers_to_free) { |
3273 | struct buffer_head *bh = buffers_to_free; | 3274 | struct buffer_head *bh = buffers_to_free; |
3274 | 3275 | ||
3275 | do { | 3276 | do { |
3276 | struct buffer_head *next = bh->b_this_page; | 3277 | struct buffer_head *next = bh->b_this_page; |
3277 | free_buffer_head(bh); | 3278 | free_buffer_head(bh); |
3278 | bh = next; | 3279 | bh = next; |
3279 | } while (bh != buffers_to_free); | 3280 | } while (bh != buffers_to_free); |
3280 | } | 3281 | } |
3281 | return ret; | 3282 | return ret; |
3282 | } | 3283 | } |
3283 | EXPORT_SYMBOL(try_to_free_buffers); | 3284 | EXPORT_SYMBOL(try_to_free_buffers); |
3284 | 3285 | ||
3285 | /* | 3286 | /* |
3286 | * There are no bdflush tunables left. But distributions are | 3287 | * There are no bdflush tunables left. But distributions are |
3287 | * still running obsolete flush daemons, so we terminate them here. | 3288 | * still running obsolete flush daemons, so we terminate them here. |
3288 | * | 3289 | * |
3289 | * Use of bdflush() is deprecated and will be removed in a future kernel. | 3290 | * Use of bdflush() is deprecated and will be removed in a future kernel. |
3290 | * The `flush-X' kernel threads fully replace bdflush daemons and this call. | 3291 | * The `flush-X' kernel threads fully replace bdflush daemons and this call. |
3291 | */ | 3292 | */ |
3292 | SYSCALL_DEFINE2(bdflush, int, func, long, data) | 3293 | SYSCALL_DEFINE2(bdflush, int, func, long, data) |
3293 | { | 3294 | { |
3294 | static int msg_count; | 3295 | static int msg_count; |
3295 | 3296 | ||
3296 | if (!capable(CAP_SYS_ADMIN)) | 3297 | if (!capable(CAP_SYS_ADMIN)) |
3297 | return -EPERM; | 3298 | return -EPERM; |
3298 | 3299 | ||
3299 | if (msg_count < 5) { | 3300 | if (msg_count < 5) { |
3300 | msg_count++; | 3301 | msg_count++; |
3301 | printk(KERN_INFO | 3302 | printk(KERN_INFO |
3302 | "warning: process `%s' used the obsolete bdflush" | 3303 | "warning: process `%s' used the obsolete bdflush" |
3303 | " system call\n", current->comm); | 3304 | " system call\n", current->comm); |
3304 | printk(KERN_INFO "Fix your initscripts?\n"); | 3305 | printk(KERN_INFO "Fix your initscripts?\n"); |
3305 | } | 3306 | } |
3306 | 3307 | ||
3307 | if (func == 1) | 3308 | if (func == 1) |
3308 | do_exit(0); | 3309 | do_exit(0); |
3309 | return 0; | 3310 | return 0; |
3310 | } | 3311 | } |
3311 | 3312 | ||
3312 | /* | 3313 | /* |
3313 | * Buffer-head allocation | 3314 | * Buffer-head allocation |
3314 | */ | 3315 | */ |
3315 | static struct kmem_cache *bh_cachep __read_mostly; | 3316 | static struct kmem_cache *bh_cachep __read_mostly; |
3316 | 3317 | ||
3317 | /* | 3318 | /* |
3318 | * Once the number of bh's in the machine exceeds this level, we start | 3319 | * Once the number of bh's in the machine exceeds this level, we start |
3319 | * stripping them in writeback. | 3320 | * stripping them in writeback. |
3320 | */ | 3321 | */ |
3321 | static unsigned long max_buffer_heads; | 3322 | static unsigned long max_buffer_heads; |
3322 | 3323 | ||
3323 | int buffer_heads_over_limit; | 3324 | int buffer_heads_over_limit; |
3324 | 3325 | ||
3325 | struct bh_accounting { | 3326 | struct bh_accounting { |
3326 | int nr; /* Number of live bh's */ | 3327 | int nr; /* Number of live bh's */ |
3327 | int ratelimit; /* Limit cacheline bouncing */ | 3328 | int ratelimit; /* Limit cacheline bouncing */ |
3328 | }; | 3329 | }; |
3329 | 3330 | ||
3330 | static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0}; | 3331 | static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0}; |
3331 | 3332 | ||
3332 | static void recalc_bh_state(void) | 3333 | static void recalc_bh_state(void) |
3333 | { | 3334 | { |
3334 | int i; | 3335 | int i; |
3335 | int tot = 0; | 3336 | int tot = 0; |
3336 | 3337 | ||
3337 | if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096) | 3338 | if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096) |
3338 | return; | 3339 | return; |
3339 | __this_cpu_write(bh_accounting.ratelimit, 0); | 3340 | __this_cpu_write(bh_accounting.ratelimit, 0); |
3340 | for_each_online_cpu(i) | 3341 | for_each_online_cpu(i) |
3341 | tot += per_cpu(bh_accounting, i).nr; | 3342 | tot += per_cpu(bh_accounting, i).nr; |
3342 | buffer_heads_over_limit = (tot > max_buffer_heads); | 3343 | buffer_heads_over_limit = (tot > max_buffer_heads); |
3343 | } | 3344 | } |
3344 | 3345 | ||
3345 | struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) | 3346 | struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) |
3346 | { | 3347 | { |
3347 | struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); | 3348 | struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); |
3348 | if (ret) { | 3349 | if (ret) { |
3349 | INIT_LIST_HEAD(&ret->b_assoc_buffers); | 3350 | INIT_LIST_HEAD(&ret->b_assoc_buffers); |
3350 | preempt_disable(); | 3351 | preempt_disable(); |
3351 | __this_cpu_inc(bh_accounting.nr); | 3352 | __this_cpu_inc(bh_accounting.nr); |
3352 | recalc_bh_state(); | 3353 | recalc_bh_state(); |
3353 | preempt_enable(); | 3354 | preempt_enable(); |
3354 | } | 3355 | } |
3355 | return ret; | 3356 | return ret; |
3356 | } | 3357 | } |
3357 | EXPORT_SYMBOL(alloc_buffer_head); | 3358 | EXPORT_SYMBOL(alloc_buffer_head); |
3358 | 3359 | ||
3359 | void free_buffer_head(struct buffer_head *bh) | 3360 | void free_buffer_head(struct buffer_head *bh) |
3360 | { | 3361 | { |
3361 | BUG_ON(!list_empty(&bh->b_assoc_buffers)); | 3362 | BUG_ON(!list_empty(&bh->b_assoc_buffers)); |
3362 | kmem_cache_free(bh_cachep, bh); | 3363 | kmem_cache_free(bh_cachep, bh); |
3363 | preempt_disable(); | 3364 | preempt_disable(); |
3364 | __this_cpu_dec(bh_accounting.nr); | 3365 | __this_cpu_dec(bh_accounting.nr); |
3365 | recalc_bh_state(); | 3366 | recalc_bh_state(); |
3366 | preempt_enable(); | 3367 | preempt_enable(); |
3367 | } | 3368 | } |
3368 | EXPORT_SYMBOL(free_buffer_head); | 3369 | EXPORT_SYMBOL(free_buffer_head); |
3369 | 3370 | ||
3370 | static void buffer_exit_cpu(int cpu) | 3371 | static void buffer_exit_cpu(int cpu) |
3371 | { | 3372 | { |
3372 | int i; | 3373 | int i; |
3373 | struct bh_lru *b = &per_cpu(bh_lrus, cpu); | 3374 | struct bh_lru *b = &per_cpu(bh_lrus, cpu); |
3374 | 3375 | ||
3375 | for (i = 0; i < BH_LRU_SIZE; i++) { | 3376 | for (i = 0; i < BH_LRU_SIZE; i++) { |
3376 | brelse(b->bhs[i]); | 3377 | brelse(b->bhs[i]); |
3377 | b->bhs[i] = NULL; | 3378 | b->bhs[i] = NULL; |
3378 | } | 3379 | } |
3379 | this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr); | 3380 | this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr); |
3380 | per_cpu(bh_accounting, cpu).nr = 0; | 3381 | per_cpu(bh_accounting, cpu).nr = 0; |
3381 | } | 3382 | } |
3382 | 3383 | ||
3383 | static int buffer_cpu_notify(struct notifier_block *self, | 3384 | static int buffer_cpu_notify(struct notifier_block *self, |
3384 | unsigned long action, void *hcpu) | 3385 | unsigned long action, void *hcpu) |
3385 | { | 3386 | { |
3386 | if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) | 3387 | if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) |
3387 | buffer_exit_cpu((unsigned long)hcpu); | 3388 | buffer_exit_cpu((unsigned long)hcpu); |
3388 | return NOTIFY_OK; | 3389 | return NOTIFY_OK; |
3389 | } | 3390 | } |
3390 | 3391 | ||
3391 | /** | 3392 | /** |
3392 | * bh_uptodate_or_lock - Test whether the buffer is uptodate | 3393 | * bh_uptodate_or_lock - Test whether the buffer is uptodate |
3393 | * @bh: struct buffer_head | 3394 | * @bh: struct buffer_head |
3394 | * | 3395 | * |
3395 | * Return true if the buffer is up-to-date and false, | 3396 | * Return true if the buffer is up-to-date and false, |
3396 | * with the buffer locked, if not. | 3397 | * with the buffer locked, if not. |
3397 | */ | 3398 | */ |
3398 | int bh_uptodate_or_lock(struct buffer_head *bh) | 3399 | int bh_uptodate_or_lock(struct buffer_head *bh) |
3399 | { | 3400 | { |
3400 | if (!buffer_uptodate(bh)) { | 3401 | if (!buffer_uptodate(bh)) { |
3401 | lock_buffer(bh); | 3402 | lock_buffer(bh); |
3402 | if (!buffer_uptodate(bh)) | 3403 | if (!buffer_uptodate(bh)) |
3403 | return 0; | 3404 | return 0; |
3404 | unlock_buffer(bh); | 3405 | unlock_buffer(bh); |
3405 | } | 3406 | } |
3406 | return 1; | 3407 | return 1; |
3407 | } | 3408 | } |
3408 | EXPORT_SYMBOL(bh_uptodate_or_lock); | 3409 | EXPORT_SYMBOL(bh_uptodate_or_lock); |
3409 | 3410 | ||
3410 | /** | 3411 | /** |
3411 | * bh_submit_read - Submit a locked buffer for reading | 3412 | * bh_submit_read - Submit a locked buffer for reading |
3412 | * @bh: struct buffer_head | 3413 | * @bh: struct buffer_head |
3413 | * | 3414 | * |
3414 | * Returns zero on success and -EIO on error. | 3415 | * Returns zero on success and -EIO on error. |
3415 | */ | 3416 | */ |
3416 | int bh_submit_read(struct buffer_head *bh) | 3417 | int bh_submit_read(struct buffer_head *bh) |
3417 | { | 3418 | { |
3418 | BUG_ON(!buffer_locked(bh)); | 3419 | BUG_ON(!buffer_locked(bh)); |
3419 | 3420 | ||
3420 | if (buffer_uptodate(bh)) { | 3421 | if (buffer_uptodate(bh)) { |
3421 | unlock_buffer(bh); | 3422 | unlock_buffer(bh); |
3422 | return 0; | 3423 | return 0; |
3423 | } | 3424 | } |
3424 | 3425 | ||
3425 | get_bh(bh); | 3426 | get_bh(bh); |
3426 | bh->b_end_io = end_buffer_read_sync; | 3427 | bh->b_end_io = end_buffer_read_sync; |
3427 | submit_bh(READ, bh); | 3428 | submit_bh(READ, bh); |
3428 | wait_on_buffer(bh); | 3429 | wait_on_buffer(bh); |
3429 | if (buffer_uptodate(bh)) | 3430 | if (buffer_uptodate(bh)) |
3430 | return 0; | 3431 | return 0; |
3431 | return -EIO; | 3432 | return -EIO; |
3432 | } | 3433 | } |
3433 | EXPORT_SYMBOL(bh_submit_read); | 3434 | EXPORT_SYMBOL(bh_submit_read); |
3434 | 3435 | ||
3435 | void __init buffer_init(void) | 3436 | void __init buffer_init(void) |
3436 | { | 3437 | { |
3437 | unsigned long nrpages; | 3438 | unsigned long nrpages; |
3438 | 3439 | ||
3439 | bh_cachep = kmem_cache_create("buffer_head", | 3440 | bh_cachep = kmem_cache_create("buffer_head", |
3440 | sizeof(struct buffer_head), 0, | 3441 | sizeof(struct buffer_head), 0, |
3441 | (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC| | 3442 | (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC| |
3442 | SLAB_MEM_SPREAD), | 3443 | SLAB_MEM_SPREAD), |
3443 | NULL); | 3444 | NULL); |
3444 | 3445 | ||
3445 | /* | 3446 | /* |
3446 | * Limit the bh occupancy to 10% of ZONE_NORMAL | 3447 | * Limit the bh occupancy to 10% of ZONE_NORMAL |
3447 | */ | 3448 | */ |
3448 | nrpages = (nr_free_buffer_pages() * 10) / 100; | 3449 | nrpages = (nr_free_buffer_pages() * 10) / 100; |
3449 | max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); | 3450 | max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); |
3450 | hotcpu_notifier(buffer_cpu_notify, 0); | 3451 | hotcpu_notifier(buffer_cpu_notify, 0); |
3451 | } | 3452 | } |
fs/ext4/mballoc.c
1 | /* | 1 | /* |
2 | * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com | 2 | * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com |
3 | * Written by Alex Tomas <alex@clusterfs.com> | 3 | * Written by Alex Tomas <alex@clusterfs.com> |
4 | * | 4 | * |
5 | * This program is free software; you can redistribute it and/or modify | 5 | * This program is free software; you can redistribute it and/or modify |
6 | * it under the terms of the GNU General Public License version 2 as | 6 | * it under the terms of the GNU General Public License version 2 as |
7 | * published by the Free Software Foundation. | 7 | * published by the Free Software Foundation. |
8 | * | 8 | * |
9 | * This program is distributed in the hope that it will be useful, | 9 | * This program is distributed in the hope that it will be useful, |
10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of | 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of |
11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
12 | * GNU General Public License for more details. | 12 | * GNU General Public License for more details. |
13 | * | 13 | * |
14 | * You should have received a copy of the GNU General Public Licens | 14 | * You should have received a copy of the GNU General Public Licens |
15 | * along with this program; if not, write to the Free Software | 15 | * along with this program; if not, write to the Free Software |
16 | * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- | 16 | * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- |
17 | */ | 17 | */ |
18 | 18 | ||
19 | 19 | ||
20 | /* | 20 | /* |
21 | * mballoc.c contains the multiblocks allocation routines | 21 | * mballoc.c contains the multiblocks allocation routines |
22 | */ | 22 | */ |
23 | 23 | ||
24 | #include "ext4_jbd2.h" | 24 | #include "ext4_jbd2.h" |
25 | #include "mballoc.h" | 25 | #include "mballoc.h" |
26 | #include <linux/log2.h> | 26 | #include <linux/log2.h> |
27 | #include <linux/module.h> | 27 | #include <linux/module.h> |
28 | #include <linux/slab.h> | 28 | #include <linux/slab.h> |
29 | #include <trace/events/ext4.h> | 29 | #include <trace/events/ext4.h> |
30 | 30 | ||
31 | #ifdef CONFIG_EXT4_DEBUG | 31 | #ifdef CONFIG_EXT4_DEBUG |
32 | ushort ext4_mballoc_debug __read_mostly; | 32 | ushort ext4_mballoc_debug __read_mostly; |
33 | 33 | ||
34 | module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644); | 34 | module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644); |
35 | MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc"); | 35 | MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc"); |
36 | #endif | 36 | #endif |
37 | 37 | ||
38 | /* | 38 | /* |
39 | * MUSTDO: | 39 | * MUSTDO: |
40 | * - test ext4_ext_search_left() and ext4_ext_search_right() | 40 | * - test ext4_ext_search_left() and ext4_ext_search_right() |
41 | * - search for metadata in few groups | 41 | * - search for metadata in few groups |
42 | * | 42 | * |
43 | * TODO v4: | 43 | * TODO v4: |
44 | * - normalization should take into account whether file is still open | 44 | * - normalization should take into account whether file is still open |
45 | * - discard preallocations if no free space left (policy?) | 45 | * - discard preallocations if no free space left (policy?) |
46 | * - don't normalize tails | 46 | * - don't normalize tails |
47 | * - quota | 47 | * - quota |
48 | * - reservation for superuser | 48 | * - reservation for superuser |
49 | * | 49 | * |
50 | * TODO v3: | 50 | * TODO v3: |
51 | * - bitmap read-ahead (proposed by Oleg Drokin aka green) | 51 | * - bitmap read-ahead (proposed by Oleg Drokin aka green) |
52 | * - track min/max extents in each group for better group selection | 52 | * - track min/max extents in each group for better group selection |
53 | * - mb_mark_used() may allocate chunk right after splitting buddy | 53 | * - mb_mark_used() may allocate chunk right after splitting buddy |
54 | * - tree of groups sorted by number of free blocks | 54 | * - tree of groups sorted by number of free blocks |
55 | * - error handling | 55 | * - error handling |
56 | */ | 56 | */ |
57 | 57 | ||
58 | /* | 58 | /* |
59 | * The allocation request involve request for multiple number of blocks | 59 | * The allocation request involve request for multiple number of blocks |
60 | * near to the goal(block) value specified. | 60 | * near to the goal(block) value specified. |
61 | * | 61 | * |
62 | * During initialization phase of the allocator we decide to use the | 62 | * During initialization phase of the allocator we decide to use the |
63 | * group preallocation or inode preallocation depending on the size of | 63 | * group preallocation or inode preallocation depending on the size of |
64 | * the file. The size of the file could be the resulting file size we | 64 | * the file. The size of the file could be the resulting file size we |
65 | * would have after allocation, or the current file size, which ever | 65 | * would have after allocation, or the current file size, which ever |
66 | * is larger. If the size is less than sbi->s_mb_stream_request we | 66 | * is larger. If the size is less than sbi->s_mb_stream_request we |
67 | * select to use the group preallocation. The default value of | 67 | * select to use the group preallocation. The default value of |
68 | * s_mb_stream_request is 16 blocks. This can also be tuned via | 68 | * s_mb_stream_request is 16 blocks. This can also be tuned via |
69 | * /sys/fs/ext4/<partition>/mb_stream_req. The value is represented in | 69 | * /sys/fs/ext4/<partition>/mb_stream_req. The value is represented in |
70 | * terms of number of blocks. | 70 | * terms of number of blocks. |
71 | * | 71 | * |
72 | * The main motivation for having small file use group preallocation is to | 72 | * The main motivation for having small file use group preallocation is to |
73 | * ensure that we have small files closer together on the disk. | 73 | * ensure that we have small files closer together on the disk. |
74 | * | 74 | * |
75 | * First stage the allocator looks at the inode prealloc list, | 75 | * First stage the allocator looks at the inode prealloc list, |
76 | * ext4_inode_info->i_prealloc_list, which contains list of prealloc | 76 | * ext4_inode_info->i_prealloc_list, which contains list of prealloc |
77 | * spaces for this particular inode. The inode prealloc space is | 77 | * spaces for this particular inode. The inode prealloc space is |
78 | * represented as: | 78 | * represented as: |
79 | * | 79 | * |
80 | * pa_lstart -> the logical start block for this prealloc space | 80 | * pa_lstart -> the logical start block for this prealloc space |
81 | * pa_pstart -> the physical start block for this prealloc space | 81 | * pa_pstart -> the physical start block for this prealloc space |
82 | * pa_len -> length for this prealloc space (in clusters) | 82 | * pa_len -> length for this prealloc space (in clusters) |
83 | * pa_free -> free space available in this prealloc space (in clusters) | 83 | * pa_free -> free space available in this prealloc space (in clusters) |
84 | * | 84 | * |
85 | * The inode preallocation space is used looking at the _logical_ start | 85 | * The inode preallocation space is used looking at the _logical_ start |
86 | * block. If only the logical file block falls within the range of prealloc | 86 | * block. If only the logical file block falls within the range of prealloc |
87 | * space we will consume the particular prealloc space. This makes sure that | 87 | * space we will consume the particular prealloc space. This makes sure that |
88 | * we have contiguous physical blocks representing the file blocks | 88 | * we have contiguous physical blocks representing the file blocks |
89 | * | 89 | * |
90 | * The important thing to be noted in case of inode prealloc space is that | 90 | * The important thing to be noted in case of inode prealloc space is that |
91 | * we don't modify the values associated to inode prealloc space except | 91 | * we don't modify the values associated to inode prealloc space except |
92 | * pa_free. | 92 | * pa_free. |
93 | * | 93 | * |
94 | * If we are not able to find blocks in the inode prealloc space and if we | 94 | * If we are not able to find blocks in the inode prealloc space and if we |
95 | * have the group allocation flag set then we look at the locality group | 95 | * have the group allocation flag set then we look at the locality group |
96 | * prealloc space. These are per CPU prealloc list represented as | 96 | * prealloc space. These are per CPU prealloc list represented as |
97 | * | 97 | * |
98 | * ext4_sb_info.s_locality_groups[smp_processor_id()] | 98 | * ext4_sb_info.s_locality_groups[smp_processor_id()] |
99 | * | 99 | * |
100 | * The reason for having a per cpu locality group is to reduce the contention | 100 | * The reason for having a per cpu locality group is to reduce the contention |
101 | * between CPUs. It is possible to get scheduled at this point. | 101 | * between CPUs. It is possible to get scheduled at this point. |
102 | * | 102 | * |
103 | * The locality group prealloc space is used looking at whether we have | 103 | * The locality group prealloc space is used looking at whether we have |
104 | * enough free space (pa_free) within the prealloc space. | 104 | * enough free space (pa_free) within the prealloc space. |
105 | * | 105 | * |
106 | * If we can't allocate blocks via inode prealloc or/and locality group | 106 | * If we can't allocate blocks via inode prealloc or/and locality group |
107 | * prealloc then we look at the buddy cache. The buddy cache is represented | 107 | * prealloc then we look at the buddy cache. The buddy cache is represented |
108 | * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets | 108 | * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets |
109 | * mapped to the buddy and bitmap information regarding different | 109 | * mapped to the buddy and bitmap information regarding different |
110 | * groups. The buddy information is attached to buddy cache inode so that | 110 | * groups. The buddy information is attached to buddy cache inode so that |
111 | * we can access them through the page cache. The information regarding | 111 | * we can access them through the page cache. The information regarding |
112 | * each group is loaded via ext4_mb_load_buddy. The information involve | 112 | * each group is loaded via ext4_mb_load_buddy. The information involve |
113 | * block bitmap and buddy information. The information are stored in the | 113 | * block bitmap and buddy information. The information are stored in the |
114 | * inode as: | 114 | * inode as: |
115 | * | 115 | * |
116 | * { page } | 116 | * { page } |
117 | * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... | 117 | * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... |
118 | * | 118 | * |
119 | * | 119 | * |
120 | * one block each for bitmap and buddy information. So for each group we | 120 | * one block each for bitmap and buddy information. So for each group we |
121 | * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE / | 121 | * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE / |
122 | * blocksize) blocks. So it can have information regarding groups_per_page | 122 | * blocksize) blocks. So it can have information regarding groups_per_page |
123 | * which is blocks_per_page/2 | 123 | * which is blocks_per_page/2 |
124 | * | 124 | * |
125 | * The buddy cache inode is not stored on disk. The inode is thrown | 125 | * The buddy cache inode is not stored on disk. The inode is thrown |
126 | * away when the filesystem is unmounted. | 126 | * away when the filesystem is unmounted. |
127 | * | 127 | * |
128 | * We look for count number of blocks in the buddy cache. If we were able | 128 | * We look for count number of blocks in the buddy cache. If we were able |
129 | * to locate that many free blocks we return with additional information | 129 | * to locate that many free blocks we return with additional information |
130 | * regarding rest of the contiguous physical block available | 130 | * regarding rest of the contiguous physical block available |
131 | * | 131 | * |
132 | * Before allocating blocks via buddy cache we normalize the request | 132 | * Before allocating blocks via buddy cache we normalize the request |
133 | * blocks. This ensure we ask for more blocks that we needed. The extra | 133 | * blocks. This ensure we ask for more blocks that we needed. The extra |
134 | * blocks that we get after allocation is added to the respective prealloc | 134 | * blocks that we get after allocation is added to the respective prealloc |
135 | * list. In case of inode preallocation we follow a list of heuristics | 135 | * list. In case of inode preallocation we follow a list of heuristics |
136 | * based on file size. This can be found in ext4_mb_normalize_request. If | 136 | * based on file size. This can be found in ext4_mb_normalize_request. If |
137 | * we are doing a group prealloc we try to normalize the request to | 137 | * we are doing a group prealloc we try to normalize the request to |
138 | * sbi->s_mb_group_prealloc. The default value of s_mb_group_prealloc is | 138 | * sbi->s_mb_group_prealloc. The default value of s_mb_group_prealloc is |
139 | * dependent on the cluster size; for non-bigalloc file systems, it is | 139 | * dependent on the cluster size; for non-bigalloc file systems, it is |
140 | * 512 blocks. This can be tuned via | 140 | * 512 blocks. This can be tuned via |
141 | * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in | 141 | * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in |
142 | * terms of number of blocks. If we have mounted the file system with -O | 142 | * terms of number of blocks. If we have mounted the file system with -O |
143 | * stripe=<value> option the group prealloc request is normalized to the | 143 | * stripe=<value> option the group prealloc request is normalized to the |
144 | * the smallest multiple of the stripe value (sbi->s_stripe) which is | 144 | * the smallest multiple of the stripe value (sbi->s_stripe) which is |
145 | * greater than the default mb_group_prealloc. | 145 | * greater than the default mb_group_prealloc. |
146 | * | 146 | * |
147 | * The regular allocator (using the buddy cache) supports a few tunables. | 147 | * The regular allocator (using the buddy cache) supports a few tunables. |
148 | * | 148 | * |
149 | * /sys/fs/ext4/<partition>/mb_min_to_scan | 149 | * /sys/fs/ext4/<partition>/mb_min_to_scan |
150 | * /sys/fs/ext4/<partition>/mb_max_to_scan | 150 | * /sys/fs/ext4/<partition>/mb_max_to_scan |
151 | * /sys/fs/ext4/<partition>/mb_order2_req | 151 | * /sys/fs/ext4/<partition>/mb_order2_req |
152 | * | 152 | * |
153 | * The regular allocator uses buddy scan only if the request len is power of | 153 | * The regular allocator uses buddy scan only if the request len is power of |
154 | * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The | 154 | * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The |
155 | * value of s_mb_order2_reqs can be tuned via | 155 | * value of s_mb_order2_reqs can be tuned via |
156 | * /sys/fs/ext4/<partition>/mb_order2_req. If the request len is equal to | 156 | * /sys/fs/ext4/<partition>/mb_order2_req. If the request len is equal to |
157 | * stripe size (sbi->s_stripe), we try to search for contiguous block in | 157 | * stripe size (sbi->s_stripe), we try to search for contiguous block in |
158 | * stripe size. This should result in better allocation on RAID setups. If | 158 | * stripe size. This should result in better allocation on RAID setups. If |
159 | * not, we search in the specific group using bitmap for best extents. The | 159 | * not, we search in the specific group using bitmap for best extents. The |
160 | * tunable min_to_scan and max_to_scan control the behaviour here. | 160 | * tunable min_to_scan and max_to_scan control the behaviour here. |
161 | * min_to_scan indicate how long the mballoc __must__ look for a best | 161 | * min_to_scan indicate how long the mballoc __must__ look for a best |
162 | * extent and max_to_scan indicates how long the mballoc __can__ look for a | 162 | * extent and max_to_scan indicates how long the mballoc __can__ look for a |
163 | * best extent in the found extents. Searching for the blocks starts with | 163 | * best extent in the found extents. Searching for the blocks starts with |
164 | * the group specified as the goal value in allocation context via | 164 | * the group specified as the goal value in allocation context via |
165 | * ac_g_ex. Each group is first checked based on the criteria whether it | 165 | * ac_g_ex. Each group is first checked based on the criteria whether it |
166 | * can be used for allocation. ext4_mb_good_group explains how the groups are | 166 | * can be used for allocation. ext4_mb_good_group explains how the groups are |
167 | * checked. | 167 | * checked. |
168 | * | 168 | * |
169 | * Both the prealloc space are getting populated as above. So for the first | 169 | * Both the prealloc space are getting populated as above. So for the first |
170 | * request we will hit the buddy cache which will result in this prealloc | 170 | * request we will hit the buddy cache which will result in this prealloc |
171 | * space getting filled. The prealloc space is then later used for the | 171 | * space getting filled. The prealloc space is then later used for the |
172 | * subsequent request. | 172 | * subsequent request. |
173 | */ | 173 | */ |
174 | 174 | ||
175 | /* | 175 | /* |
176 | * mballoc operates on the following data: | 176 | * mballoc operates on the following data: |
177 | * - on-disk bitmap | 177 | * - on-disk bitmap |
178 | * - in-core buddy (actually includes buddy and bitmap) | 178 | * - in-core buddy (actually includes buddy and bitmap) |
179 | * - preallocation descriptors (PAs) | 179 | * - preallocation descriptors (PAs) |
180 | * | 180 | * |
181 | * there are two types of preallocations: | 181 | * there are two types of preallocations: |
182 | * - inode | 182 | * - inode |
183 | * assiged to specific inode and can be used for this inode only. | 183 | * assiged to specific inode and can be used for this inode only. |
184 | * it describes part of inode's space preallocated to specific | 184 | * it describes part of inode's space preallocated to specific |
185 | * physical blocks. any block from that preallocated can be used | 185 | * physical blocks. any block from that preallocated can be used |
186 | * independent. the descriptor just tracks number of blocks left | 186 | * independent. the descriptor just tracks number of blocks left |
187 | * unused. so, before taking some block from descriptor, one must | 187 | * unused. so, before taking some block from descriptor, one must |
188 | * make sure corresponded logical block isn't allocated yet. this | 188 | * make sure corresponded logical block isn't allocated yet. this |
189 | * also means that freeing any block within descriptor's range | 189 | * also means that freeing any block within descriptor's range |
190 | * must discard all preallocated blocks. | 190 | * must discard all preallocated blocks. |
191 | * - locality group | 191 | * - locality group |
192 | * assigned to specific locality group which does not translate to | 192 | * assigned to specific locality group which does not translate to |
193 | * permanent set of inodes: inode can join and leave group. space | 193 | * permanent set of inodes: inode can join and leave group. space |
194 | * from this type of preallocation can be used for any inode. thus | 194 | * from this type of preallocation can be used for any inode. thus |
195 | * it's consumed from the beginning to the end. | 195 | * it's consumed from the beginning to the end. |
196 | * | 196 | * |
197 | * relation between them can be expressed as: | 197 | * relation between them can be expressed as: |
198 | * in-core buddy = on-disk bitmap + preallocation descriptors | 198 | * in-core buddy = on-disk bitmap + preallocation descriptors |
199 | * | 199 | * |
200 | * this mean blocks mballoc considers used are: | 200 | * this mean blocks mballoc considers used are: |
201 | * - allocated blocks (persistent) | 201 | * - allocated blocks (persistent) |
202 | * - preallocated blocks (non-persistent) | 202 | * - preallocated blocks (non-persistent) |
203 | * | 203 | * |
204 | * consistency in mballoc world means that at any time a block is either | 204 | * consistency in mballoc world means that at any time a block is either |
205 | * free or used in ALL structures. notice: "any time" should not be read | 205 | * free or used in ALL structures. notice: "any time" should not be read |
206 | * literally -- time is discrete and delimited by locks. | 206 | * literally -- time is discrete and delimited by locks. |
207 | * | 207 | * |
208 | * to keep it simple, we don't use block numbers, instead we count number of | 208 | * to keep it simple, we don't use block numbers, instead we count number of |
209 | * blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA. | 209 | * blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA. |
210 | * | 210 | * |
211 | * all operations can be expressed as: | 211 | * all operations can be expressed as: |
212 | * - init buddy: buddy = on-disk + PAs | 212 | * - init buddy: buddy = on-disk + PAs |
213 | * - new PA: buddy += N; PA = N | 213 | * - new PA: buddy += N; PA = N |
214 | * - use inode PA: on-disk += N; PA -= N | 214 | * - use inode PA: on-disk += N; PA -= N |
215 | * - discard inode PA buddy -= on-disk - PA; PA = 0 | 215 | * - discard inode PA buddy -= on-disk - PA; PA = 0 |
216 | * - use locality group PA on-disk += N; PA -= N | 216 | * - use locality group PA on-disk += N; PA -= N |
217 | * - discard locality group PA buddy -= PA; PA = 0 | 217 | * - discard locality group PA buddy -= PA; PA = 0 |
218 | * note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap | 218 | * note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap |
219 | * is used in real operation because we can't know actual used | 219 | * is used in real operation because we can't know actual used |
220 | * bits from PA, only from on-disk bitmap | 220 | * bits from PA, only from on-disk bitmap |
221 | * | 221 | * |
222 | * if we follow this strict logic, then all operations above should be atomic. | 222 | * if we follow this strict logic, then all operations above should be atomic. |
223 | * given some of them can block, we'd have to use something like semaphores | 223 | * given some of them can block, we'd have to use something like semaphores |
224 | * killing performance on high-end SMP hardware. let's try to relax it using | 224 | * killing performance on high-end SMP hardware. let's try to relax it using |
225 | * the following knowledge: | 225 | * the following knowledge: |
226 | * 1) if buddy is referenced, it's already initialized | 226 | * 1) if buddy is referenced, it's already initialized |
227 | * 2) while block is used in buddy and the buddy is referenced, | 227 | * 2) while block is used in buddy and the buddy is referenced, |
228 | * nobody can re-allocate that block | 228 | * nobody can re-allocate that block |
229 | * 3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has | 229 | * 3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has |
230 | * bit set and PA claims same block, it's OK. IOW, one can set bit in | 230 | * bit set and PA claims same block, it's OK. IOW, one can set bit in |
231 | * on-disk bitmap if buddy has same bit set or/and PA covers corresponded | 231 | * on-disk bitmap if buddy has same bit set or/and PA covers corresponded |
232 | * block | 232 | * block |
233 | * | 233 | * |
234 | * so, now we're building a concurrency table: | 234 | * so, now we're building a concurrency table: |
235 | * - init buddy vs. | 235 | * - init buddy vs. |
236 | * - new PA | 236 | * - new PA |
237 | * blocks for PA are allocated in the buddy, buddy must be referenced | 237 | * blocks for PA are allocated in the buddy, buddy must be referenced |
238 | * until PA is linked to allocation group to avoid concurrent buddy init | 238 | * until PA is linked to allocation group to avoid concurrent buddy init |
239 | * - use inode PA | 239 | * - use inode PA |
240 | * we need to make sure that either on-disk bitmap or PA has uptodate data | 240 | * we need to make sure that either on-disk bitmap or PA has uptodate data |
241 | * given (3) we care that PA-=N operation doesn't interfere with init | 241 | * given (3) we care that PA-=N operation doesn't interfere with init |
242 | * - discard inode PA | 242 | * - discard inode PA |
243 | * the simplest way would be to have buddy initialized by the discard | 243 | * the simplest way would be to have buddy initialized by the discard |
244 | * - use locality group PA | 244 | * - use locality group PA |
245 | * again PA-=N must be serialized with init | 245 | * again PA-=N must be serialized with init |
246 | * - discard locality group PA | 246 | * - discard locality group PA |
247 | * the simplest way would be to have buddy initialized by the discard | 247 | * the simplest way would be to have buddy initialized by the discard |
248 | * - new PA vs. | 248 | * - new PA vs. |
249 | * - use inode PA | 249 | * - use inode PA |
250 | * i_data_sem serializes them | 250 | * i_data_sem serializes them |
251 | * - discard inode PA | 251 | * - discard inode PA |
252 | * discard process must wait until PA isn't used by another process | 252 | * discard process must wait until PA isn't used by another process |
253 | * - use locality group PA | 253 | * - use locality group PA |
254 | * some mutex should serialize them | 254 | * some mutex should serialize them |
255 | * - discard locality group PA | 255 | * - discard locality group PA |
256 | * discard process must wait until PA isn't used by another process | 256 | * discard process must wait until PA isn't used by another process |
257 | * - use inode PA | 257 | * - use inode PA |
258 | * - use inode PA | 258 | * - use inode PA |
259 | * i_data_sem or another mutex should serializes them | 259 | * i_data_sem or another mutex should serializes them |
260 | * - discard inode PA | 260 | * - discard inode PA |
261 | * discard process must wait until PA isn't used by another process | 261 | * discard process must wait until PA isn't used by another process |
262 | * - use locality group PA | 262 | * - use locality group PA |
263 | * nothing wrong here -- they're different PAs covering different blocks | 263 | * nothing wrong here -- they're different PAs covering different blocks |
264 | * - discard locality group PA | 264 | * - discard locality group PA |
265 | * discard process must wait until PA isn't used by another process | 265 | * discard process must wait until PA isn't used by another process |
266 | * | 266 | * |
267 | * now we're ready to make few consequences: | 267 | * now we're ready to make few consequences: |
268 | * - PA is referenced and while it is no discard is possible | 268 | * - PA is referenced and while it is no discard is possible |
269 | * - PA is referenced until block isn't marked in on-disk bitmap | 269 | * - PA is referenced until block isn't marked in on-disk bitmap |
270 | * - PA changes only after on-disk bitmap | 270 | * - PA changes only after on-disk bitmap |
271 | * - discard must not compete with init. either init is done before | 271 | * - discard must not compete with init. either init is done before |
272 | * any discard or they're serialized somehow | 272 | * any discard or they're serialized somehow |
273 | * - buddy init as sum of on-disk bitmap and PAs is done atomically | 273 | * - buddy init as sum of on-disk bitmap and PAs is done atomically |
274 | * | 274 | * |
275 | * a special case when we've used PA to emptiness. no need to modify buddy | 275 | * a special case when we've used PA to emptiness. no need to modify buddy |
276 | * in this case, but we should care about concurrent init | 276 | * in this case, but we should care about concurrent init |
277 | * | 277 | * |
278 | */ | 278 | */ |
279 | 279 | ||
280 | /* | 280 | /* |
281 | * Logic in few words: | 281 | * Logic in few words: |
282 | * | 282 | * |
283 | * - allocation: | 283 | * - allocation: |
284 | * load group | 284 | * load group |
285 | * find blocks | 285 | * find blocks |
286 | * mark bits in on-disk bitmap | 286 | * mark bits in on-disk bitmap |
287 | * release group | 287 | * release group |
288 | * | 288 | * |
289 | * - use preallocation: | 289 | * - use preallocation: |
290 | * find proper PA (per-inode or group) | 290 | * find proper PA (per-inode or group) |
291 | * load group | 291 | * load group |
292 | * mark bits in on-disk bitmap | 292 | * mark bits in on-disk bitmap |
293 | * release group | 293 | * release group |
294 | * release PA | 294 | * release PA |
295 | * | 295 | * |
296 | * - free: | 296 | * - free: |
297 | * load group | 297 | * load group |
298 | * mark bits in on-disk bitmap | 298 | * mark bits in on-disk bitmap |
299 | * release group | 299 | * release group |
300 | * | 300 | * |
301 | * - discard preallocations in group: | 301 | * - discard preallocations in group: |
302 | * mark PAs deleted | 302 | * mark PAs deleted |
303 | * move them onto local list | 303 | * move them onto local list |
304 | * load on-disk bitmap | 304 | * load on-disk bitmap |
305 | * load group | 305 | * load group |
306 | * remove PA from object (inode or locality group) | 306 | * remove PA from object (inode or locality group) |
307 | * mark free blocks in-core | 307 | * mark free blocks in-core |
308 | * | 308 | * |
309 | * - discard inode's preallocations: | 309 | * - discard inode's preallocations: |
310 | */ | 310 | */ |
311 | 311 | ||
312 | /* | 312 | /* |
313 | * Locking rules | 313 | * Locking rules |
314 | * | 314 | * |
315 | * Locks: | 315 | * Locks: |
316 | * - bitlock on a group (group) | 316 | * - bitlock on a group (group) |
317 | * - object (inode/locality) (object) | 317 | * - object (inode/locality) (object) |
318 | * - per-pa lock (pa) | 318 | * - per-pa lock (pa) |
319 | * | 319 | * |
320 | * Paths: | 320 | * Paths: |
321 | * - new pa | 321 | * - new pa |
322 | * object | 322 | * object |
323 | * group | 323 | * group |
324 | * | 324 | * |
325 | * - find and use pa: | 325 | * - find and use pa: |
326 | * pa | 326 | * pa |
327 | * | 327 | * |
328 | * - release consumed pa: | 328 | * - release consumed pa: |
329 | * pa | 329 | * pa |
330 | * group | 330 | * group |
331 | * object | 331 | * object |
332 | * | 332 | * |
333 | * - generate in-core bitmap: | 333 | * - generate in-core bitmap: |
334 | * group | 334 | * group |
335 | * pa | 335 | * pa |
336 | * | 336 | * |
337 | * - discard all for given object (inode, locality group): | 337 | * - discard all for given object (inode, locality group): |
338 | * object | 338 | * object |
339 | * pa | 339 | * pa |
340 | * group | 340 | * group |
341 | * | 341 | * |
342 | * - discard all for given group: | 342 | * - discard all for given group: |
343 | * group | 343 | * group |
344 | * pa | 344 | * pa |
345 | * group | 345 | * group |
346 | * object | 346 | * object |
347 | * | 347 | * |
348 | */ | 348 | */ |
349 | static struct kmem_cache *ext4_pspace_cachep; | 349 | static struct kmem_cache *ext4_pspace_cachep; |
350 | static struct kmem_cache *ext4_ac_cachep; | 350 | static struct kmem_cache *ext4_ac_cachep; |
351 | static struct kmem_cache *ext4_free_data_cachep; | 351 | static struct kmem_cache *ext4_free_data_cachep; |
352 | 352 | ||
353 | /* We create slab caches for groupinfo data structures based on the | 353 | /* We create slab caches for groupinfo data structures based on the |
354 | * superblock block size. There will be one per mounted filesystem for | 354 | * superblock block size. There will be one per mounted filesystem for |
355 | * each unique s_blocksize_bits */ | 355 | * each unique s_blocksize_bits */ |
356 | #define NR_GRPINFO_CACHES 8 | 356 | #define NR_GRPINFO_CACHES 8 |
357 | static struct kmem_cache *ext4_groupinfo_caches[NR_GRPINFO_CACHES]; | 357 | static struct kmem_cache *ext4_groupinfo_caches[NR_GRPINFO_CACHES]; |
358 | 358 | ||
359 | static const char *ext4_groupinfo_slab_names[NR_GRPINFO_CACHES] = { | 359 | static const char *ext4_groupinfo_slab_names[NR_GRPINFO_CACHES] = { |
360 | "ext4_groupinfo_1k", "ext4_groupinfo_2k", "ext4_groupinfo_4k", | 360 | "ext4_groupinfo_1k", "ext4_groupinfo_2k", "ext4_groupinfo_4k", |
361 | "ext4_groupinfo_8k", "ext4_groupinfo_16k", "ext4_groupinfo_32k", | 361 | "ext4_groupinfo_8k", "ext4_groupinfo_16k", "ext4_groupinfo_32k", |
362 | "ext4_groupinfo_64k", "ext4_groupinfo_128k" | 362 | "ext4_groupinfo_64k", "ext4_groupinfo_128k" |
363 | }; | 363 | }; |
364 | 364 | ||
365 | static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, | 365 | static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, |
366 | ext4_group_t group); | 366 | ext4_group_t group); |
367 | static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, | 367 | static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, |
368 | ext4_group_t group); | 368 | ext4_group_t group); |
369 | static void ext4_free_data_callback(struct super_block *sb, | 369 | static void ext4_free_data_callback(struct super_block *sb, |
370 | struct ext4_journal_cb_entry *jce, int rc); | 370 | struct ext4_journal_cb_entry *jce, int rc); |
371 | 371 | ||
372 | static inline void *mb_correct_addr_and_bit(int *bit, void *addr) | 372 | static inline void *mb_correct_addr_and_bit(int *bit, void *addr) |
373 | { | 373 | { |
374 | #if BITS_PER_LONG == 64 | 374 | #if BITS_PER_LONG == 64 |
375 | *bit += ((unsigned long) addr & 7UL) << 3; | 375 | *bit += ((unsigned long) addr & 7UL) << 3; |
376 | addr = (void *) ((unsigned long) addr & ~7UL); | 376 | addr = (void *) ((unsigned long) addr & ~7UL); |
377 | #elif BITS_PER_LONG == 32 | 377 | #elif BITS_PER_LONG == 32 |
378 | *bit += ((unsigned long) addr & 3UL) << 3; | 378 | *bit += ((unsigned long) addr & 3UL) << 3; |
379 | addr = (void *) ((unsigned long) addr & ~3UL); | 379 | addr = (void *) ((unsigned long) addr & ~3UL); |
380 | #else | 380 | #else |
381 | #error "how many bits you are?!" | 381 | #error "how many bits you are?!" |
382 | #endif | 382 | #endif |
383 | return addr; | 383 | return addr; |
384 | } | 384 | } |
385 | 385 | ||
386 | static inline int mb_test_bit(int bit, void *addr) | 386 | static inline int mb_test_bit(int bit, void *addr) |
387 | { | 387 | { |
388 | /* | 388 | /* |
389 | * ext4_test_bit on architecture like powerpc | 389 | * ext4_test_bit on architecture like powerpc |
390 | * needs unsigned long aligned address | 390 | * needs unsigned long aligned address |
391 | */ | 391 | */ |
392 | addr = mb_correct_addr_and_bit(&bit, addr); | 392 | addr = mb_correct_addr_and_bit(&bit, addr); |
393 | return ext4_test_bit(bit, addr); | 393 | return ext4_test_bit(bit, addr); |
394 | } | 394 | } |
395 | 395 | ||
396 | static inline void mb_set_bit(int bit, void *addr) | 396 | static inline void mb_set_bit(int bit, void *addr) |
397 | { | 397 | { |
398 | addr = mb_correct_addr_and_bit(&bit, addr); | 398 | addr = mb_correct_addr_and_bit(&bit, addr); |
399 | ext4_set_bit(bit, addr); | 399 | ext4_set_bit(bit, addr); |
400 | } | 400 | } |
401 | 401 | ||
402 | static inline void mb_clear_bit(int bit, void *addr) | 402 | static inline void mb_clear_bit(int bit, void *addr) |
403 | { | 403 | { |
404 | addr = mb_correct_addr_and_bit(&bit, addr); | 404 | addr = mb_correct_addr_and_bit(&bit, addr); |
405 | ext4_clear_bit(bit, addr); | 405 | ext4_clear_bit(bit, addr); |
406 | } | 406 | } |
407 | 407 | ||
408 | static inline int mb_test_and_clear_bit(int bit, void *addr) | 408 | static inline int mb_test_and_clear_bit(int bit, void *addr) |
409 | { | 409 | { |
410 | addr = mb_correct_addr_and_bit(&bit, addr); | 410 | addr = mb_correct_addr_and_bit(&bit, addr); |
411 | return ext4_test_and_clear_bit(bit, addr); | 411 | return ext4_test_and_clear_bit(bit, addr); |
412 | } | 412 | } |
413 | 413 | ||
414 | static inline int mb_find_next_zero_bit(void *addr, int max, int start) | 414 | static inline int mb_find_next_zero_bit(void *addr, int max, int start) |
415 | { | 415 | { |
416 | int fix = 0, ret, tmpmax; | 416 | int fix = 0, ret, tmpmax; |
417 | addr = mb_correct_addr_and_bit(&fix, addr); | 417 | addr = mb_correct_addr_and_bit(&fix, addr); |
418 | tmpmax = max + fix; | 418 | tmpmax = max + fix; |
419 | start += fix; | 419 | start += fix; |
420 | 420 | ||
421 | ret = ext4_find_next_zero_bit(addr, tmpmax, start) - fix; | 421 | ret = ext4_find_next_zero_bit(addr, tmpmax, start) - fix; |
422 | if (ret > max) | 422 | if (ret > max) |
423 | return max; | 423 | return max; |
424 | return ret; | 424 | return ret; |
425 | } | 425 | } |
426 | 426 | ||
427 | static inline int mb_find_next_bit(void *addr, int max, int start) | 427 | static inline int mb_find_next_bit(void *addr, int max, int start) |
428 | { | 428 | { |
429 | int fix = 0, ret, tmpmax; | 429 | int fix = 0, ret, tmpmax; |
430 | addr = mb_correct_addr_and_bit(&fix, addr); | 430 | addr = mb_correct_addr_and_bit(&fix, addr); |
431 | tmpmax = max + fix; | 431 | tmpmax = max + fix; |
432 | start += fix; | 432 | start += fix; |
433 | 433 | ||
434 | ret = ext4_find_next_bit(addr, tmpmax, start) - fix; | 434 | ret = ext4_find_next_bit(addr, tmpmax, start) - fix; |
435 | if (ret > max) | 435 | if (ret > max) |
436 | return max; | 436 | return max; |
437 | return ret; | 437 | return ret; |
438 | } | 438 | } |
439 | 439 | ||
440 | static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) | 440 | static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) |
441 | { | 441 | { |
442 | char *bb; | 442 | char *bb; |
443 | 443 | ||
444 | BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); | 444 | BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); |
445 | BUG_ON(max == NULL); | 445 | BUG_ON(max == NULL); |
446 | 446 | ||
447 | if (order > e4b->bd_blkbits + 1) { | 447 | if (order > e4b->bd_blkbits + 1) { |
448 | *max = 0; | 448 | *max = 0; |
449 | return NULL; | 449 | return NULL; |
450 | } | 450 | } |
451 | 451 | ||
452 | /* at order 0 we see each particular block */ | 452 | /* at order 0 we see each particular block */ |
453 | if (order == 0) { | 453 | if (order == 0) { |
454 | *max = 1 << (e4b->bd_blkbits + 3); | 454 | *max = 1 << (e4b->bd_blkbits + 3); |
455 | return e4b->bd_bitmap; | 455 | return e4b->bd_bitmap; |
456 | } | 456 | } |
457 | 457 | ||
458 | bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order]; | 458 | bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order]; |
459 | *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order]; | 459 | *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order]; |
460 | 460 | ||
461 | return bb; | 461 | return bb; |
462 | } | 462 | } |
463 | 463 | ||
464 | #ifdef DOUBLE_CHECK | 464 | #ifdef DOUBLE_CHECK |
465 | static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b, | 465 | static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b, |
466 | int first, int count) | 466 | int first, int count) |
467 | { | 467 | { |
468 | int i; | 468 | int i; |
469 | struct super_block *sb = e4b->bd_sb; | 469 | struct super_block *sb = e4b->bd_sb; |
470 | 470 | ||
471 | if (unlikely(e4b->bd_info->bb_bitmap == NULL)) | 471 | if (unlikely(e4b->bd_info->bb_bitmap == NULL)) |
472 | return; | 472 | return; |
473 | assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); | 473 | assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); |
474 | for (i = 0; i < count; i++) { | 474 | for (i = 0; i < count; i++) { |
475 | if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) { | 475 | if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) { |
476 | ext4_fsblk_t blocknr; | 476 | ext4_fsblk_t blocknr; |
477 | 477 | ||
478 | blocknr = ext4_group_first_block_no(sb, e4b->bd_group); | 478 | blocknr = ext4_group_first_block_no(sb, e4b->bd_group); |
479 | blocknr += EXT4_C2B(EXT4_SB(sb), first + i); | 479 | blocknr += EXT4_C2B(EXT4_SB(sb), first + i); |
480 | ext4_grp_locked_error(sb, e4b->bd_group, | 480 | ext4_grp_locked_error(sb, e4b->bd_group, |
481 | inode ? inode->i_ino : 0, | 481 | inode ? inode->i_ino : 0, |
482 | blocknr, | 482 | blocknr, |
483 | "freeing block already freed " | 483 | "freeing block already freed " |
484 | "(bit %u)", | 484 | "(bit %u)", |
485 | first + i); | 485 | first + i); |
486 | } | 486 | } |
487 | mb_clear_bit(first + i, e4b->bd_info->bb_bitmap); | 487 | mb_clear_bit(first + i, e4b->bd_info->bb_bitmap); |
488 | } | 488 | } |
489 | } | 489 | } |
490 | 490 | ||
491 | static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count) | 491 | static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count) |
492 | { | 492 | { |
493 | int i; | 493 | int i; |
494 | 494 | ||
495 | if (unlikely(e4b->bd_info->bb_bitmap == NULL)) | 495 | if (unlikely(e4b->bd_info->bb_bitmap == NULL)) |
496 | return; | 496 | return; |
497 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); | 497 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); |
498 | for (i = 0; i < count; i++) { | 498 | for (i = 0; i < count; i++) { |
499 | BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap)); | 499 | BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap)); |
500 | mb_set_bit(first + i, e4b->bd_info->bb_bitmap); | 500 | mb_set_bit(first + i, e4b->bd_info->bb_bitmap); |
501 | } | 501 | } |
502 | } | 502 | } |
503 | 503 | ||
504 | static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) | 504 | static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) |
505 | { | 505 | { |
506 | if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) { | 506 | if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) { |
507 | unsigned char *b1, *b2; | 507 | unsigned char *b1, *b2; |
508 | int i; | 508 | int i; |
509 | b1 = (unsigned char *) e4b->bd_info->bb_bitmap; | 509 | b1 = (unsigned char *) e4b->bd_info->bb_bitmap; |
510 | b2 = (unsigned char *) bitmap; | 510 | b2 = (unsigned char *) bitmap; |
511 | for (i = 0; i < e4b->bd_sb->s_blocksize; i++) { | 511 | for (i = 0; i < e4b->bd_sb->s_blocksize; i++) { |
512 | if (b1[i] != b2[i]) { | 512 | if (b1[i] != b2[i]) { |
513 | ext4_msg(e4b->bd_sb, KERN_ERR, | 513 | ext4_msg(e4b->bd_sb, KERN_ERR, |
514 | "corruption in group %u " | 514 | "corruption in group %u " |
515 | "at byte %u(%u): %x in copy != %x " | 515 | "at byte %u(%u): %x in copy != %x " |
516 | "on disk/prealloc", | 516 | "on disk/prealloc", |
517 | e4b->bd_group, i, i * 8, b1[i], b2[i]); | 517 | e4b->bd_group, i, i * 8, b1[i], b2[i]); |
518 | BUG(); | 518 | BUG(); |
519 | } | 519 | } |
520 | } | 520 | } |
521 | } | 521 | } |
522 | } | 522 | } |
523 | 523 | ||
524 | #else | 524 | #else |
525 | static inline void mb_free_blocks_double(struct inode *inode, | 525 | static inline void mb_free_blocks_double(struct inode *inode, |
526 | struct ext4_buddy *e4b, int first, int count) | 526 | struct ext4_buddy *e4b, int first, int count) |
527 | { | 527 | { |
528 | return; | 528 | return; |
529 | } | 529 | } |
530 | static inline void mb_mark_used_double(struct ext4_buddy *e4b, | 530 | static inline void mb_mark_used_double(struct ext4_buddy *e4b, |
531 | int first, int count) | 531 | int first, int count) |
532 | { | 532 | { |
533 | return; | 533 | return; |
534 | } | 534 | } |
535 | static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) | 535 | static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) |
536 | { | 536 | { |
537 | return; | 537 | return; |
538 | } | 538 | } |
539 | #endif | 539 | #endif |
540 | 540 | ||
541 | #ifdef AGGRESSIVE_CHECK | 541 | #ifdef AGGRESSIVE_CHECK |
542 | 542 | ||
543 | #define MB_CHECK_ASSERT(assert) \ | 543 | #define MB_CHECK_ASSERT(assert) \ |
544 | do { \ | 544 | do { \ |
545 | if (!(assert)) { \ | 545 | if (!(assert)) { \ |
546 | printk(KERN_EMERG \ | 546 | printk(KERN_EMERG \ |
547 | "Assertion failure in %s() at %s:%d: \"%s\"\n", \ | 547 | "Assertion failure in %s() at %s:%d: \"%s\"\n", \ |
548 | function, file, line, # assert); \ | 548 | function, file, line, # assert); \ |
549 | BUG(); \ | 549 | BUG(); \ |
550 | } \ | 550 | } \ |
551 | } while (0) | 551 | } while (0) |
552 | 552 | ||
553 | static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, | 553 | static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, |
554 | const char *function, int line) | 554 | const char *function, int line) |
555 | { | 555 | { |
556 | struct super_block *sb = e4b->bd_sb; | 556 | struct super_block *sb = e4b->bd_sb; |
557 | int order = e4b->bd_blkbits + 1; | 557 | int order = e4b->bd_blkbits + 1; |
558 | int max; | 558 | int max; |
559 | int max2; | 559 | int max2; |
560 | int i; | 560 | int i; |
561 | int j; | 561 | int j; |
562 | int k; | 562 | int k; |
563 | int count; | 563 | int count; |
564 | struct ext4_group_info *grp; | 564 | struct ext4_group_info *grp; |
565 | int fragments = 0; | 565 | int fragments = 0; |
566 | int fstart; | 566 | int fstart; |
567 | struct list_head *cur; | 567 | struct list_head *cur; |
568 | void *buddy; | 568 | void *buddy; |
569 | void *buddy2; | 569 | void *buddy2; |
570 | 570 | ||
571 | { | 571 | { |
572 | static int mb_check_counter; | 572 | static int mb_check_counter; |
573 | if (mb_check_counter++ % 100 != 0) | 573 | if (mb_check_counter++ % 100 != 0) |
574 | return 0; | 574 | return 0; |
575 | } | 575 | } |
576 | 576 | ||
577 | while (order > 1) { | 577 | while (order > 1) { |
578 | buddy = mb_find_buddy(e4b, order, &max); | 578 | buddy = mb_find_buddy(e4b, order, &max); |
579 | MB_CHECK_ASSERT(buddy); | 579 | MB_CHECK_ASSERT(buddy); |
580 | buddy2 = mb_find_buddy(e4b, order - 1, &max2); | 580 | buddy2 = mb_find_buddy(e4b, order - 1, &max2); |
581 | MB_CHECK_ASSERT(buddy2); | 581 | MB_CHECK_ASSERT(buddy2); |
582 | MB_CHECK_ASSERT(buddy != buddy2); | 582 | MB_CHECK_ASSERT(buddy != buddy2); |
583 | MB_CHECK_ASSERT(max * 2 == max2); | 583 | MB_CHECK_ASSERT(max * 2 == max2); |
584 | 584 | ||
585 | count = 0; | 585 | count = 0; |
586 | for (i = 0; i < max; i++) { | 586 | for (i = 0; i < max; i++) { |
587 | 587 | ||
588 | if (mb_test_bit(i, buddy)) { | 588 | if (mb_test_bit(i, buddy)) { |
589 | /* only single bit in buddy2 may be 1 */ | 589 | /* only single bit in buddy2 may be 1 */ |
590 | if (!mb_test_bit(i << 1, buddy2)) { | 590 | if (!mb_test_bit(i << 1, buddy2)) { |
591 | MB_CHECK_ASSERT( | 591 | MB_CHECK_ASSERT( |
592 | mb_test_bit((i<<1)+1, buddy2)); | 592 | mb_test_bit((i<<1)+1, buddy2)); |
593 | } else if (!mb_test_bit((i << 1) + 1, buddy2)) { | 593 | } else if (!mb_test_bit((i << 1) + 1, buddy2)) { |
594 | MB_CHECK_ASSERT( | 594 | MB_CHECK_ASSERT( |
595 | mb_test_bit(i << 1, buddy2)); | 595 | mb_test_bit(i << 1, buddy2)); |
596 | } | 596 | } |
597 | continue; | 597 | continue; |
598 | } | 598 | } |
599 | 599 | ||
600 | /* both bits in buddy2 must be 1 */ | 600 | /* both bits in buddy2 must be 1 */ |
601 | MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2)); | 601 | MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2)); |
602 | MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2)); | 602 | MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2)); |
603 | 603 | ||
604 | for (j = 0; j < (1 << order); j++) { | 604 | for (j = 0; j < (1 << order); j++) { |
605 | k = (i * (1 << order)) + j; | 605 | k = (i * (1 << order)) + j; |
606 | MB_CHECK_ASSERT( | 606 | MB_CHECK_ASSERT( |
607 | !mb_test_bit(k, e4b->bd_bitmap)); | 607 | !mb_test_bit(k, e4b->bd_bitmap)); |
608 | } | 608 | } |
609 | count++; | 609 | count++; |
610 | } | 610 | } |
611 | MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count); | 611 | MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count); |
612 | order--; | 612 | order--; |
613 | } | 613 | } |
614 | 614 | ||
615 | fstart = -1; | 615 | fstart = -1; |
616 | buddy = mb_find_buddy(e4b, 0, &max); | 616 | buddy = mb_find_buddy(e4b, 0, &max); |
617 | for (i = 0; i < max; i++) { | 617 | for (i = 0; i < max; i++) { |
618 | if (!mb_test_bit(i, buddy)) { | 618 | if (!mb_test_bit(i, buddy)) { |
619 | MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); | 619 | MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); |
620 | if (fstart == -1) { | 620 | if (fstart == -1) { |
621 | fragments++; | 621 | fragments++; |
622 | fstart = i; | 622 | fstart = i; |
623 | } | 623 | } |
624 | continue; | 624 | continue; |
625 | } | 625 | } |
626 | fstart = -1; | 626 | fstart = -1; |
627 | /* check used bits only */ | 627 | /* check used bits only */ |
628 | for (j = 0; j < e4b->bd_blkbits + 1; j++) { | 628 | for (j = 0; j < e4b->bd_blkbits + 1; j++) { |
629 | buddy2 = mb_find_buddy(e4b, j, &max2); | 629 | buddy2 = mb_find_buddy(e4b, j, &max2); |
630 | k = i >> j; | 630 | k = i >> j; |
631 | MB_CHECK_ASSERT(k < max2); | 631 | MB_CHECK_ASSERT(k < max2); |
632 | MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); | 632 | MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); |
633 | } | 633 | } |
634 | } | 634 | } |
635 | MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); | 635 | MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); |
636 | MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); | 636 | MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); |
637 | 637 | ||
638 | grp = ext4_get_group_info(sb, e4b->bd_group); | 638 | grp = ext4_get_group_info(sb, e4b->bd_group); |
639 | list_for_each(cur, &grp->bb_prealloc_list) { | 639 | list_for_each(cur, &grp->bb_prealloc_list) { |
640 | ext4_group_t groupnr; | 640 | ext4_group_t groupnr; |
641 | struct ext4_prealloc_space *pa; | 641 | struct ext4_prealloc_space *pa; |
642 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); | 642 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); |
643 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &groupnr, &k); | 643 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &groupnr, &k); |
644 | MB_CHECK_ASSERT(groupnr == e4b->bd_group); | 644 | MB_CHECK_ASSERT(groupnr == e4b->bd_group); |
645 | for (i = 0; i < pa->pa_len; i++) | 645 | for (i = 0; i < pa->pa_len; i++) |
646 | MB_CHECK_ASSERT(mb_test_bit(k + i, buddy)); | 646 | MB_CHECK_ASSERT(mb_test_bit(k + i, buddy)); |
647 | } | 647 | } |
648 | return 0; | 648 | return 0; |
649 | } | 649 | } |
650 | #undef MB_CHECK_ASSERT | 650 | #undef MB_CHECK_ASSERT |
651 | #define mb_check_buddy(e4b) __mb_check_buddy(e4b, \ | 651 | #define mb_check_buddy(e4b) __mb_check_buddy(e4b, \ |
652 | __FILE__, __func__, __LINE__) | 652 | __FILE__, __func__, __LINE__) |
653 | #else | 653 | #else |
654 | #define mb_check_buddy(e4b) | 654 | #define mb_check_buddy(e4b) |
655 | #endif | 655 | #endif |
656 | 656 | ||
657 | /* | 657 | /* |
658 | * Divide blocks started from @first with length @len into | 658 | * Divide blocks started from @first with length @len into |
659 | * smaller chunks with power of 2 blocks. | 659 | * smaller chunks with power of 2 blocks. |
660 | * Clear the bits in bitmap which the blocks of the chunk(s) covered, | 660 | * Clear the bits in bitmap which the blocks of the chunk(s) covered, |
661 | * then increase bb_counters[] for corresponded chunk size. | 661 | * then increase bb_counters[] for corresponded chunk size. |
662 | */ | 662 | */ |
663 | static void ext4_mb_mark_free_simple(struct super_block *sb, | 663 | static void ext4_mb_mark_free_simple(struct super_block *sb, |
664 | void *buddy, ext4_grpblk_t first, ext4_grpblk_t len, | 664 | void *buddy, ext4_grpblk_t first, ext4_grpblk_t len, |
665 | struct ext4_group_info *grp) | 665 | struct ext4_group_info *grp) |
666 | { | 666 | { |
667 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 667 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
668 | ext4_grpblk_t min; | 668 | ext4_grpblk_t min; |
669 | ext4_grpblk_t max; | 669 | ext4_grpblk_t max; |
670 | ext4_grpblk_t chunk; | 670 | ext4_grpblk_t chunk; |
671 | unsigned short border; | 671 | unsigned short border; |
672 | 672 | ||
673 | BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb)); | 673 | BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb)); |
674 | 674 | ||
675 | border = 2 << sb->s_blocksize_bits; | 675 | border = 2 << sb->s_blocksize_bits; |
676 | 676 | ||
677 | while (len > 0) { | 677 | while (len > 0) { |
678 | /* find how many blocks can be covered since this position */ | 678 | /* find how many blocks can be covered since this position */ |
679 | max = ffs(first | border) - 1; | 679 | max = ffs(first | border) - 1; |
680 | 680 | ||
681 | /* find how many blocks of power 2 we need to mark */ | 681 | /* find how many blocks of power 2 we need to mark */ |
682 | min = fls(len) - 1; | 682 | min = fls(len) - 1; |
683 | 683 | ||
684 | if (max < min) | 684 | if (max < min) |
685 | min = max; | 685 | min = max; |
686 | chunk = 1 << min; | 686 | chunk = 1 << min; |
687 | 687 | ||
688 | /* mark multiblock chunks only */ | 688 | /* mark multiblock chunks only */ |
689 | grp->bb_counters[min]++; | 689 | grp->bb_counters[min]++; |
690 | if (min > 0) | 690 | if (min > 0) |
691 | mb_clear_bit(first >> min, | 691 | mb_clear_bit(first >> min, |
692 | buddy + sbi->s_mb_offsets[min]); | 692 | buddy + sbi->s_mb_offsets[min]); |
693 | 693 | ||
694 | len -= chunk; | 694 | len -= chunk; |
695 | first += chunk; | 695 | first += chunk; |
696 | } | 696 | } |
697 | } | 697 | } |
698 | 698 | ||
699 | /* | 699 | /* |
700 | * Cache the order of the largest free extent we have available in this block | 700 | * Cache the order of the largest free extent we have available in this block |
701 | * group. | 701 | * group. |
702 | */ | 702 | */ |
703 | static void | 703 | static void |
704 | mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) | 704 | mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) |
705 | { | 705 | { |
706 | int i; | 706 | int i; |
707 | int bits; | 707 | int bits; |
708 | 708 | ||
709 | grp->bb_largest_free_order = -1; /* uninit */ | 709 | grp->bb_largest_free_order = -1; /* uninit */ |
710 | 710 | ||
711 | bits = sb->s_blocksize_bits + 1; | 711 | bits = sb->s_blocksize_bits + 1; |
712 | for (i = bits; i >= 0; i--) { | 712 | for (i = bits; i >= 0; i--) { |
713 | if (grp->bb_counters[i] > 0) { | 713 | if (grp->bb_counters[i] > 0) { |
714 | grp->bb_largest_free_order = i; | 714 | grp->bb_largest_free_order = i; |
715 | break; | 715 | break; |
716 | } | 716 | } |
717 | } | 717 | } |
718 | } | 718 | } |
719 | 719 | ||
720 | static noinline_for_stack | 720 | static noinline_for_stack |
721 | void ext4_mb_generate_buddy(struct super_block *sb, | 721 | void ext4_mb_generate_buddy(struct super_block *sb, |
722 | void *buddy, void *bitmap, ext4_group_t group) | 722 | void *buddy, void *bitmap, ext4_group_t group) |
723 | { | 723 | { |
724 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); | 724 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); |
725 | ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb); | 725 | ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb); |
726 | ext4_grpblk_t i = 0; | 726 | ext4_grpblk_t i = 0; |
727 | ext4_grpblk_t first; | 727 | ext4_grpblk_t first; |
728 | ext4_grpblk_t len; | 728 | ext4_grpblk_t len; |
729 | unsigned free = 0; | 729 | unsigned free = 0; |
730 | unsigned fragments = 0; | 730 | unsigned fragments = 0; |
731 | unsigned long long period = get_cycles(); | 731 | unsigned long long period = get_cycles(); |
732 | 732 | ||
733 | /* initialize buddy from bitmap which is aggregation | 733 | /* initialize buddy from bitmap which is aggregation |
734 | * of on-disk bitmap and preallocations */ | 734 | * of on-disk bitmap and preallocations */ |
735 | i = mb_find_next_zero_bit(bitmap, max, 0); | 735 | i = mb_find_next_zero_bit(bitmap, max, 0); |
736 | grp->bb_first_free = i; | 736 | grp->bb_first_free = i; |
737 | while (i < max) { | 737 | while (i < max) { |
738 | fragments++; | 738 | fragments++; |
739 | first = i; | 739 | first = i; |
740 | i = mb_find_next_bit(bitmap, max, i); | 740 | i = mb_find_next_bit(bitmap, max, i); |
741 | len = i - first; | 741 | len = i - first; |
742 | free += len; | 742 | free += len; |
743 | if (len > 1) | 743 | if (len > 1) |
744 | ext4_mb_mark_free_simple(sb, buddy, first, len, grp); | 744 | ext4_mb_mark_free_simple(sb, buddy, first, len, grp); |
745 | else | 745 | else |
746 | grp->bb_counters[0]++; | 746 | grp->bb_counters[0]++; |
747 | if (i < max) | 747 | if (i < max) |
748 | i = mb_find_next_zero_bit(bitmap, max, i); | 748 | i = mb_find_next_zero_bit(bitmap, max, i); |
749 | } | 749 | } |
750 | grp->bb_fragments = fragments; | 750 | grp->bb_fragments = fragments; |
751 | 751 | ||
752 | if (free != grp->bb_free) { | 752 | if (free != grp->bb_free) { |
753 | ext4_grp_locked_error(sb, group, 0, 0, | 753 | ext4_grp_locked_error(sb, group, 0, 0, |
754 | "block bitmap and bg descriptor " | 754 | "block bitmap and bg descriptor " |
755 | "inconsistent: %u vs %u free clusters", | 755 | "inconsistent: %u vs %u free clusters", |
756 | free, grp->bb_free); | 756 | free, grp->bb_free); |
757 | /* | 757 | /* |
758 | * If we intend to continue, we consider group descriptor | 758 | * If we intend to continue, we consider group descriptor |
759 | * corrupt and update bb_free using bitmap value | 759 | * corrupt and update bb_free using bitmap value |
760 | */ | 760 | */ |
761 | grp->bb_free = free; | 761 | grp->bb_free = free; |
762 | set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); | 762 | set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); |
763 | } | 763 | } |
764 | mb_set_largest_free_order(sb, grp); | 764 | mb_set_largest_free_order(sb, grp); |
765 | 765 | ||
766 | clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state)); | 766 | clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state)); |
767 | 767 | ||
768 | period = get_cycles() - period; | 768 | period = get_cycles() - period; |
769 | spin_lock(&EXT4_SB(sb)->s_bal_lock); | 769 | spin_lock(&EXT4_SB(sb)->s_bal_lock); |
770 | EXT4_SB(sb)->s_mb_buddies_generated++; | 770 | EXT4_SB(sb)->s_mb_buddies_generated++; |
771 | EXT4_SB(sb)->s_mb_generation_time += period; | 771 | EXT4_SB(sb)->s_mb_generation_time += period; |
772 | spin_unlock(&EXT4_SB(sb)->s_bal_lock); | 772 | spin_unlock(&EXT4_SB(sb)->s_bal_lock); |
773 | } | 773 | } |
774 | 774 | ||
775 | static void mb_regenerate_buddy(struct ext4_buddy *e4b) | 775 | static void mb_regenerate_buddy(struct ext4_buddy *e4b) |
776 | { | 776 | { |
777 | int count; | 777 | int count; |
778 | int order = 1; | 778 | int order = 1; |
779 | void *buddy; | 779 | void *buddy; |
780 | 780 | ||
781 | while ((buddy = mb_find_buddy(e4b, order++, &count))) { | 781 | while ((buddy = mb_find_buddy(e4b, order++, &count))) { |
782 | ext4_set_bits(buddy, 0, count); | 782 | ext4_set_bits(buddy, 0, count); |
783 | } | 783 | } |
784 | e4b->bd_info->bb_fragments = 0; | 784 | e4b->bd_info->bb_fragments = 0; |
785 | memset(e4b->bd_info->bb_counters, 0, | 785 | memset(e4b->bd_info->bb_counters, 0, |
786 | sizeof(*e4b->bd_info->bb_counters) * | 786 | sizeof(*e4b->bd_info->bb_counters) * |
787 | (e4b->bd_sb->s_blocksize_bits + 2)); | 787 | (e4b->bd_sb->s_blocksize_bits + 2)); |
788 | 788 | ||
789 | ext4_mb_generate_buddy(e4b->bd_sb, e4b->bd_buddy, | 789 | ext4_mb_generate_buddy(e4b->bd_sb, e4b->bd_buddy, |
790 | e4b->bd_bitmap, e4b->bd_group); | 790 | e4b->bd_bitmap, e4b->bd_group); |
791 | } | 791 | } |
792 | 792 | ||
793 | /* The buddy information is attached the buddy cache inode | 793 | /* The buddy information is attached the buddy cache inode |
794 | * for convenience. The information regarding each group | 794 | * for convenience. The information regarding each group |
795 | * is loaded via ext4_mb_load_buddy. The information involve | 795 | * is loaded via ext4_mb_load_buddy. The information involve |
796 | * block bitmap and buddy information. The information are | 796 | * block bitmap and buddy information. The information are |
797 | * stored in the inode as | 797 | * stored in the inode as |
798 | * | 798 | * |
799 | * { page } | 799 | * { page } |
800 | * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... | 800 | * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]... |
801 | * | 801 | * |
802 | * | 802 | * |
803 | * one block each for bitmap and buddy information. | 803 | * one block each for bitmap and buddy information. |
804 | * So for each group we take up 2 blocks. A page can | 804 | * So for each group we take up 2 blocks. A page can |
805 | * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks. | 805 | * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks. |
806 | * So it can have information regarding groups_per_page which | 806 | * So it can have information regarding groups_per_page which |
807 | * is blocks_per_page/2 | 807 | * is blocks_per_page/2 |
808 | * | 808 | * |
809 | * Locking note: This routine takes the block group lock of all groups | 809 | * Locking note: This routine takes the block group lock of all groups |
810 | * for this page; do not hold this lock when calling this routine! | 810 | * for this page; do not hold this lock when calling this routine! |
811 | */ | 811 | */ |
812 | 812 | ||
813 | static int ext4_mb_init_cache(struct page *page, char *incore) | 813 | static int ext4_mb_init_cache(struct page *page, char *incore) |
814 | { | 814 | { |
815 | ext4_group_t ngroups; | 815 | ext4_group_t ngroups; |
816 | int blocksize; | 816 | int blocksize; |
817 | int blocks_per_page; | 817 | int blocks_per_page; |
818 | int groups_per_page; | 818 | int groups_per_page; |
819 | int err = 0; | 819 | int err = 0; |
820 | int i; | 820 | int i; |
821 | ext4_group_t first_group, group; | 821 | ext4_group_t first_group, group; |
822 | int first_block; | 822 | int first_block; |
823 | struct super_block *sb; | 823 | struct super_block *sb; |
824 | struct buffer_head *bhs; | 824 | struct buffer_head *bhs; |
825 | struct buffer_head **bh = NULL; | 825 | struct buffer_head **bh = NULL; |
826 | struct inode *inode; | 826 | struct inode *inode; |
827 | char *data; | 827 | char *data; |
828 | char *bitmap; | 828 | char *bitmap; |
829 | struct ext4_group_info *grinfo; | 829 | struct ext4_group_info *grinfo; |
830 | 830 | ||
831 | mb_debug(1, "init page %lu\n", page->index); | 831 | mb_debug(1, "init page %lu\n", page->index); |
832 | 832 | ||
833 | inode = page->mapping->host; | 833 | inode = page->mapping->host; |
834 | sb = inode->i_sb; | 834 | sb = inode->i_sb; |
835 | ngroups = ext4_get_groups_count(sb); | 835 | ngroups = ext4_get_groups_count(sb); |
836 | blocksize = 1 << inode->i_blkbits; | 836 | blocksize = 1 << inode->i_blkbits; |
837 | blocks_per_page = PAGE_CACHE_SIZE / blocksize; | 837 | blocks_per_page = PAGE_CACHE_SIZE / blocksize; |
838 | 838 | ||
839 | groups_per_page = blocks_per_page >> 1; | 839 | groups_per_page = blocks_per_page >> 1; |
840 | if (groups_per_page == 0) | 840 | if (groups_per_page == 0) |
841 | groups_per_page = 1; | 841 | groups_per_page = 1; |
842 | 842 | ||
843 | /* allocate buffer_heads to read bitmaps */ | 843 | /* allocate buffer_heads to read bitmaps */ |
844 | if (groups_per_page > 1) { | 844 | if (groups_per_page > 1) { |
845 | i = sizeof(struct buffer_head *) * groups_per_page; | 845 | i = sizeof(struct buffer_head *) * groups_per_page; |
846 | bh = kzalloc(i, GFP_NOFS); | 846 | bh = kzalloc(i, GFP_NOFS); |
847 | if (bh == NULL) { | 847 | if (bh == NULL) { |
848 | err = -ENOMEM; | 848 | err = -ENOMEM; |
849 | goto out; | 849 | goto out; |
850 | } | 850 | } |
851 | } else | 851 | } else |
852 | bh = &bhs; | 852 | bh = &bhs; |
853 | 853 | ||
854 | first_group = page->index * blocks_per_page / 2; | 854 | first_group = page->index * blocks_per_page / 2; |
855 | 855 | ||
856 | /* read all groups the page covers into the cache */ | 856 | /* read all groups the page covers into the cache */ |
857 | for (i = 0, group = first_group; i < groups_per_page; i++, group++) { | 857 | for (i = 0, group = first_group; i < groups_per_page; i++, group++) { |
858 | if (group >= ngroups) | 858 | if (group >= ngroups) |
859 | break; | 859 | break; |
860 | 860 | ||
861 | grinfo = ext4_get_group_info(sb, group); | 861 | grinfo = ext4_get_group_info(sb, group); |
862 | /* | 862 | /* |
863 | * If page is uptodate then we came here after online resize | 863 | * If page is uptodate then we came here after online resize |
864 | * which added some new uninitialized group info structs, so | 864 | * which added some new uninitialized group info structs, so |
865 | * we must skip all initialized uptodate buddies on the page, | 865 | * we must skip all initialized uptodate buddies on the page, |
866 | * which may be currently in use by an allocating task. | 866 | * which may be currently in use by an allocating task. |
867 | */ | 867 | */ |
868 | if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) { | 868 | if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) { |
869 | bh[i] = NULL; | 869 | bh[i] = NULL; |
870 | continue; | 870 | continue; |
871 | } | 871 | } |
872 | if (!(bh[i] = ext4_read_block_bitmap_nowait(sb, group))) { | 872 | if (!(bh[i] = ext4_read_block_bitmap_nowait(sb, group))) { |
873 | err = -ENOMEM; | 873 | err = -ENOMEM; |
874 | goto out; | 874 | goto out; |
875 | } | 875 | } |
876 | mb_debug(1, "read bitmap for group %u\n", group); | 876 | mb_debug(1, "read bitmap for group %u\n", group); |
877 | } | 877 | } |
878 | 878 | ||
879 | /* wait for I/O completion */ | 879 | /* wait for I/O completion */ |
880 | for (i = 0, group = first_group; i < groups_per_page; i++, group++) { | 880 | for (i = 0, group = first_group; i < groups_per_page; i++, group++) { |
881 | if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) { | 881 | if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) { |
882 | err = -EIO; | 882 | err = -EIO; |
883 | goto out; | 883 | goto out; |
884 | } | 884 | } |
885 | } | 885 | } |
886 | 886 | ||
887 | first_block = page->index * blocks_per_page; | 887 | first_block = page->index * blocks_per_page; |
888 | for (i = 0; i < blocks_per_page; i++) { | 888 | for (i = 0; i < blocks_per_page; i++) { |
889 | group = (first_block + i) >> 1; | 889 | group = (first_block + i) >> 1; |
890 | if (group >= ngroups) | 890 | if (group >= ngroups) |
891 | break; | 891 | break; |
892 | 892 | ||
893 | if (!bh[group - first_group]) | 893 | if (!bh[group - first_group]) |
894 | /* skip initialized uptodate buddy */ | 894 | /* skip initialized uptodate buddy */ |
895 | continue; | 895 | continue; |
896 | 896 | ||
897 | /* | 897 | /* |
898 | * data carry information regarding this | 898 | * data carry information regarding this |
899 | * particular group in the format specified | 899 | * particular group in the format specified |
900 | * above | 900 | * above |
901 | * | 901 | * |
902 | */ | 902 | */ |
903 | data = page_address(page) + (i * blocksize); | 903 | data = page_address(page) + (i * blocksize); |
904 | bitmap = bh[group - first_group]->b_data; | 904 | bitmap = bh[group - first_group]->b_data; |
905 | 905 | ||
906 | /* | 906 | /* |
907 | * We place the buddy block and bitmap block | 907 | * We place the buddy block and bitmap block |
908 | * close together | 908 | * close together |
909 | */ | 909 | */ |
910 | if ((first_block + i) & 1) { | 910 | if ((first_block + i) & 1) { |
911 | /* this is block of buddy */ | 911 | /* this is block of buddy */ |
912 | BUG_ON(incore == NULL); | 912 | BUG_ON(incore == NULL); |
913 | mb_debug(1, "put buddy for group %u in page %lu/%x\n", | 913 | mb_debug(1, "put buddy for group %u in page %lu/%x\n", |
914 | group, page->index, i * blocksize); | 914 | group, page->index, i * blocksize); |
915 | trace_ext4_mb_buddy_bitmap_load(sb, group); | 915 | trace_ext4_mb_buddy_bitmap_load(sb, group); |
916 | grinfo = ext4_get_group_info(sb, group); | 916 | grinfo = ext4_get_group_info(sb, group); |
917 | grinfo->bb_fragments = 0; | 917 | grinfo->bb_fragments = 0; |
918 | memset(grinfo->bb_counters, 0, | 918 | memset(grinfo->bb_counters, 0, |
919 | sizeof(*grinfo->bb_counters) * | 919 | sizeof(*grinfo->bb_counters) * |
920 | (sb->s_blocksize_bits+2)); | 920 | (sb->s_blocksize_bits+2)); |
921 | /* | 921 | /* |
922 | * incore got set to the group block bitmap below | 922 | * incore got set to the group block bitmap below |
923 | */ | 923 | */ |
924 | ext4_lock_group(sb, group); | 924 | ext4_lock_group(sb, group); |
925 | /* init the buddy */ | 925 | /* init the buddy */ |
926 | memset(data, 0xff, blocksize); | 926 | memset(data, 0xff, blocksize); |
927 | ext4_mb_generate_buddy(sb, data, incore, group); | 927 | ext4_mb_generate_buddy(sb, data, incore, group); |
928 | ext4_unlock_group(sb, group); | 928 | ext4_unlock_group(sb, group); |
929 | incore = NULL; | 929 | incore = NULL; |
930 | } else { | 930 | } else { |
931 | /* this is block of bitmap */ | 931 | /* this is block of bitmap */ |
932 | BUG_ON(incore != NULL); | 932 | BUG_ON(incore != NULL); |
933 | mb_debug(1, "put bitmap for group %u in page %lu/%x\n", | 933 | mb_debug(1, "put bitmap for group %u in page %lu/%x\n", |
934 | group, page->index, i * blocksize); | 934 | group, page->index, i * blocksize); |
935 | trace_ext4_mb_bitmap_load(sb, group); | 935 | trace_ext4_mb_bitmap_load(sb, group); |
936 | 936 | ||
937 | /* see comments in ext4_mb_put_pa() */ | 937 | /* see comments in ext4_mb_put_pa() */ |
938 | ext4_lock_group(sb, group); | 938 | ext4_lock_group(sb, group); |
939 | memcpy(data, bitmap, blocksize); | 939 | memcpy(data, bitmap, blocksize); |
940 | 940 | ||
941 | /* mark all preallocated blks used in in-core bitmap */ | 941 | /* mark all preallocated blks used in in-core bitmap */ |
942 | ext4_mb_generate_from_pa(sb, data, group); | 942 | ext4_mb_generate_from_pa(sb, data, group); |
943 | ext4_mb_generate_from_freelist(sb, data, group); | 943 | ext4_mb_generate_from_freelist(sb, data, group); |
944 | ext4_unlock_group(sb, group); | 944 | ext4_unlock_group(sb, group); |
945 | 945 | ||
946 | /* set incore so that the buddy information can be | 946 | /* set incore so that the buddy information can be |
947 | * generated using this | 947 | * generated using this |
948 | */ | 948 | */ |
949 | incore = data; | 949 | incore = data; |
950 | } | 950 | } |
951 | } | 951 | } |
952 | SetPageUptodate(page); | 952 | SetPageUptodate(page); |
953 | 953 | ||
954 | out: | 954 | out: |
955 | if (bh) { | 955 | if (bh) { |
956 | for (i = 0; i < groups_per_page; i++) | 956 | for (i = 0; i < groups_per_page; i++) |
957 | brelse(bh[i]); | 957 | brelse(bh[i]); |
958 | if (bh != &bhs) | 958 | if (bh != &bhs) |
959 | kfree(bh); | 959 | kfree(bh); |
960 | } | 960 | } |
961 | return err; | 961 | return err; |
962 | } | 962 | } |
963 | 963 | ||
964 | /* | 964 | /* |
965 | * Lock the buddy and bitmap pages. This make sure other parallel init_group | 965 | * Lock the buddy and bitmap pages. This make sure other parallel init_group |
966 | * on the same buddy page doesn't happen whild holding the buddy page lock. | 966 | * on the same buddy page doesn't happen whild holding the buddy page lock. |
967 | * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap | 967 | * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap |
968 | * are on the same page e4b->bd_buddy_page is NULL and return value is 0. | 968 | * are on the same page e4b->bd_buddy_page is NULL and return value is 0. |
969 | */ | 969 | */ |
970 | static int ext4_mb_get_buddy_page_lock(struct super_block *sb, | 970 | static int ext4_mb_get_buddy_page_lock(struct super_block *sb, |
971 | ext4_group_t group, struct ext4_buddy *e4b) | 971 | ext4_group_t group, struct ext4_buddy *e4b) |
972 | { | 972 | { |
973 | struct inode *inode = EXT4_SB(sb)->s_buddy_cache; | 973 | struct inode *inode = EXT4_SB(sb)->s_buddy_cache; |
974 | int block, pnum, poff; | 974 | int block, pnum, poff; |
975 | int blocks_per_page; | 975 | int blocks_per_page; |
976 | struct page *page; | 976 | struct page *page; |
977 | 977 | ||
978 | e4b->bd_buddy_page = NULL; | 978 | e4b->bd_buddy_page = NULL; |
979 | e4b->bd_bitmap_page = NULL; | 979 | e4b->bd_bitmap_page = NULL; |
980 | 980 | ||
981 | blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; | 981 | blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; |
982 | /* | 982 | /* |
983 | * the buddy cache inode stores the block bitmap | 983 | * the buddy cache inode stores the block bitmap |
984 | * and buddy information in consecutive blocks. | 984 | * and buddy information in consecutive blocks. |
985 | * So for each group we need two blocks. | 985 | * So for each group we need two blocks. |
986 | */ | 986 | */ |
987 | block = group * 2; | 987 | block = group * 2; |
988 | pnum = block / blocks_per_page; | 988 | pnum = block / blocks_per_page; |
989 | poff = block % blocks_per_page; | 989 | poff = block % blocks_per_page; |
990 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); | 990 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); |
991 | if (!page) | 991 | if (!page) |
992 | return -EIO; | 992 | return -EIO; |
993 | BUG_ON(page->mapping != inode->i_mapping); | 993 | BUG_ON(page->mapping != inode->i_mapping); |
994 | e4b->bd_bitmap_page = page; | 994 | e4b->bd_bitmap_page = page; |
995 | e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); | 995 | e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); |
996 | 996 | ||
997 | if (blocks_per_page >= 2) { | 997 | if (blocks_per_page >= 2) { |
998 | /* buddy and bitmap are on the same page */ | 998 | /* buddy and bitmap are on the same page */ |
999 | return 0; | 999 | return 0; |
1000 | } | 1000 | } |
1001 | 1001 | ||
1002 | block++; | 1002 | block++; |
1003 | pnum = block / blocks_per_page; | 1003 | pnum = block / blocks_per_page; |
1004 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); | 1004 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); |
1005 | if (!page) | 1005 | if (!page) |
1006 | return -EIO; | 1006 | return -EIO; |
1007 | BUG_ON(page->mapping != inode->i_mapping); | 1007 | BUG_ON(page->mapping != inode->i_mapping); |
1008 | e4b->bd_buddy_page = page; | 1008 | e4b->bd_buddy_page = page; |
1009 | return 0; | 1009 | return 0; |
1010 | } | 1010 | } |
1011 | 1011 | ||
1012 | static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b) | 1012 | static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b) |
1013 | { | 1013 | { |
1014 | if (e4b->bd_bitmap_page) { | 1014 | if (e4b->bd_bitmap_page) { |
1015 | unlock_page(e4b->bd_bitmap_page); | 1015 | unlock_page(e4b->bd_bitmap_page); |
1016 | page_cache_release(e4b->bd_bitmap_page); | 1016 | page_cache_release(e4b->bd_bitmap_page); |
1017 | } | 1017 | } |
1018 | if (e4b->bd_buddy_page) { | 1018 | if (e4b->bd_buddy_page) { |
1019 | unlock_page(e4b->bd_buddy_page); | 1019 | unlock_page(e4b->bd_buddy_page); |
1020 | page_cache_release(e4b->bd_buddy_page); | 1020 | page_cache_release(e4b->bd_buddy_page); |
1021 | } | 1021 | } |
1022 | } | 1022 | } |
1023 | 1023 | ||
1024 | /* | 1024 | /* |
1025 | * Locking note: This routine calls ext4_mb_init_cache(), which takes the | 1025 | * Locking note: This routine calls ext4_mb_init_cache(), which takes the |
1026 | * block group lock of all groups for this page; do not hold the BG lock when | 1026 | * block group lock of all groups for this page; do not hold the BG lock when |
1027 | * calling this routine! | 1027 | * calling this routine! |
1028 | */ | 1028 | */ |
1029 | static noinline_for_stack | 1029 | static noinline_for_stack |
1030 | int ext4_mb_init_group(struct super_block *sb, ext4_group_t group) | 1030 | int ext4_mb_init_group(struct super_block *sb, ext4_group_t group) |
1031 | { | 1031 | { |
1032 | 1032 | ||
1033 | struct ext4_group_info *this_grp; | 1033 | struct ext4_group_info *this_grp; |
1034 | struct ext4_buddy e4b; | 1034 | struct ext4_buddy e4b; |
1035 | struct page *page; | 1035 | struct page *page; |
1036 | int ret = 0; | 1036 | int ret = 0; |
1037 | 1037 | ||
1038 | might_sleep(); | 1038 | might_sleep(); |
1039 | mb_debug(1, "init group %u\n", group); | 1039 | mb_debug(1, "init group %u\n", group); |
1040 | this_grp = ext4_get_group_info(sb, group); | 1040 | this_grp = ext4_get_group_info(sb, group); |
1041 | /* | 1041 | /* |
1042 | * This ensures that we don't reinit the buddy cache | 1042 | * This ensures that we don't reinit the buddy cache |
1043 | * page which map to the group from which we are already | 1043 | * page which map to the group from which we are already |
1044 | * allocating. If we are looking at the buddy cache we would | 1044 | * allocating. If we are looking at the buddy cache we would |
1045 | * have taken a reference using ext4_mb_load_buddy and that | 1045 | * have taken a reference using ext4_mb_load_buddy and that |
1046 | * would have pinned buddy page to page cache. | 1046 | * would have pinned buddy page to page cache. |
1047 | * The call to ext4_mb_get_buddy_page_lock will mark the | ||
1048 | * page accessed. | ||
1047 | */ | 1049 | */ |
1048 | ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b); | 1050 | ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b); |
1049 | if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) { | 1051 | if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) { |
1050 | /* | 1052 | /* |
1051 | * somebody initialized the group | 1053 | * somebody initialized the group |
1052 | * return without doing anything | 1054 | * return without doing anything |
1053 | */ | 1055 | */ |
1054 | goto err; | 1056 | goto err; |
1055 | } | 1057 | } |
1056 | 1058 | ||
1057 | page = e4b.bd_bitmap_page; | 1059 | page = e4b.bd_bitmap_page; |
1058 | ret = ext4_mb_init_cache(page, NULL); | 1060 | ret = ext4_mb_init_cache(page, NULL); |
1059 | if (ret) | 1061 | if (ret) |
1060 | goto err; | 1062 | goto err; |
1061 | if (!PageUptodate(page)) { | 1063 | if (!PageUptodate(page)) { |
1062 | ret = -EIO; | 1064 | ret = -EIO; |
1063 | goto err; | 1065 | goto err; |
1064 | } | 1066 | } |
1065 | mark_page_accessed(page); | ||
1066 | 1067 | ||
1067 | if (e4b.bd_buddy_page == NULL) { | 1068 | if (e4b.bd_buddy_page == NULL) { |
1068 | /* | 1069 | /* |
1069 | * If both the bitmap and buddy are in | 1070 | * If both the bitmap and buddy are in |
1070 | * the same page we don't need to force | 1071 | * the same page we don't need to force |
1071 | * init the buddy | 1072 | * init the buddy |
1072 | */ | 1073 | */ |
1073 | ret = 0; | 1074 | ret = 0; |
1074 | goto err; | 1075 | goto err; |
1075 | } | 1076 | } |
1076 | /* init buddy cache */ | 1077 | /* init buddy cache */ |
1077 | page = e4b.bd_buddy_page; | 1078 | page = e4b.bd_buddy_page; |
1078 | ret = ext4_mb_init_cache(page, e4b.bd_bitmap); | 1079 | ret = ext4_mb_init_cache(page, e4b.bd_bitmap); |
1079 | if (ret) | 1080 | if (ret) |
1080 | goto err; | 1081 | goto err; |
1081 | if (!PageUptodate(page)) { | 1082 | if (!PageUptodate(page)) { |
1082 | ret = -EIO; | 1083 | ret = -EIO; |
1083 | goto err; | 1084 | goto err; |
1084 | } | 1085 | } |
1085 | mark_page_accessed(page); | ||
1086 | err: | 1086 | err: |
1087 | ext4_mb_put_buddy_page_lock(&e4b); | 1087 | ext4_mb_put_buddy_page_lock(&e4b); |
1088 | return ret; | 1088 | return ret; |
1089 | } | 1089 | } |
1090 | 1090 | ||
1091 | /* | 1091 | /* |
1092 | * Locking note: This routine calls ext4_mb_init_cache(), which takes the | 1092 | * Locking note: This routine calls ext4_mb_init_cache(), which takes the |
1093 | * block group lock of all groups for this page; do not hold the BG lock when | 1093 | * block group lock of all groups for this page; do not hold the BG lock when |
1094 | * calling this routine! | 1094 | * calling this routine! |
1095 | */ | 1095 | */ |
1096 | static noinline_for_stack int | 1096 | static noinline_for_stack int |
1097 | ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, | 1097 | ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, |
1098 | struct ext4_buddy *e4b) | 1098 | struct ext4_buddy *e4b) |
1099 | { | 1099 | { |
1100 | int blocks_per_page; | 1100 | int blocks_per_page; |
1101 | int block; | 1101 | int block; |
1102 | int pnum; | 1102 | int pnum; |
1103 | int poff; | 1103 | int poff; |
1104 | struct page *page; | 1104 | struct page *page; |
1105 | int ret; | 1105 | int ret; |
1106 | struct ext4_group_info *grp; | 1106 | struct ext4_group_info *grp; |
1107 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 1107 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
1108 | struct inode *inode = sbi->s_buddy_cache; | 1108 | struct inode *inode = sbi->s_buddy_cache; |
1109 | 1109 | ||
1110 | might_sleep(); | 1110 | might_sleep(); |
1111 | mb_debug(1, "load group %u\n", group); | 1111 | mb_debug(1, "load group %u\n", group); |
1112 | 1112 | ||
1113 | blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; | 1113 | blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize; |
1114 | grp = ext4_get_group_info(sb, group); | 1114 | grp = ext4_get_group_info(sb, group); |
1115 | 1115 | ||
1116 | e4b->bd_blkbits = sb->s_blocksize_bits; | 1116 | e4b->bd_blkbits = sb->s_blocksize_bits; |
1117 | e4b->bd_info = grp; | 1117 | e4b->bd_info = grp; |
1118 | e4b->bd_sb = sb; | 1118 | e4b->bd_sb = sb; |
1119 | e4b->bd_group = group; | 1119 | e4b->bd_group = group; |
1120 | e4b->bd_buddy_page = NULL; | 1120 | e4b->bd_buddy_page = NULL; |
1121 | e4b->bd_bitmap_page = NULL; | 1121 | e4b->bd_bitmap_page = NULL; |
1122 | 1122 | ||
1123 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { | 1123 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { |
1124 | /* | 1124 | /* |
1125 | * we need full data about the group | 1125 | * we need full data about the group |
1126 | * to make a good selection | 1126 | * to make a good selection |
1127 | */ | 1127 | */ |
1128 | ret = ext4_mb_init_group(sb, group); | 1128 | ret = ext4_mb_init_group(sb, group); |
1129 | if (ret) | 1129 | if (ret) |
1130 | return ret; | 1130 | return ret; |
1131 | } | 1131 | } |
1132 | 1132 | ||
1133 | /* | 1133 | /* |
1134 | * the buddy cache inode stores the block bitmap | 1134 | * the buddy cache inode stores the block bitmap |
1135 | * and buddy information in consecutive blocks. | 1135 | * and buddy information in consecutive blocks. |
1136 | * So for each group we need two blocks. | 1136 | * So for each group we need two blocks. |
1137 | */ | 1137 | */ |
1138 | block = group * 2; | 1138 | block = group * 2; |
1139 | pnum = block / blocks_per_page; | 1139 | pnum = block / blocks_per_page; |
1140 | poff = block % blocks_per_page; | 1140 | poff = block % blocks_per_page; |
1141 | 1141 | ||
1142 | /* we could use find_or_create_page(), but it locks page | 1142 | /* we could use find_or_create_page(), but it locks page |
1143 | * what we'd like to avoid in fast path ... */ | 1143 | * what we'd like to avoid in fast path ... */ |
1144 | page = find_get_page(inode->i_mapping, pnum); | 1144 | page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED); |
1145 | if (page == NULL || !PageUptodate(page)) { | 1145 | if (page == NULL || !PageUptodate(page)) { |
1146 | if (page) | 1146 | if (page) |
1147 | /* | 1147 | /* |
1148 | * drop the page reference and try | 1148 | * drop the page reference and try |
1149 | * to get the page with lock. If we | 1149 | * to get the page with lock. If we |
1150 | * are not uptodate that implies | 1150 | * are not uptodate that implies |
1151 | * somebody just created the page but | 1151 | * somebody just created the page but |
1152 | * is yet to initialize the same. So | 1152 | * is yet to initialize the same. So |
1153 | * wait for it to initialize. | 1153 | * wait for it to initialize. |
1154 | */ | 1154 | */ |
1155 | page_cache_release(page); | 1155 | page_cache_release(page); |
1156 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); | 1156 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); |
1157 | if (page) { | 1157 | if (page) { |
1158 | BUG_ON(page->mapping != inode->i_mapping); | 1158 | BUG_ON(page->mapping != inode->i_mapping); |
1159 | if (!PageUptodate(page)) { | 1159 | if (!PageUptodate(page)) { |
1160 | ret = ext4_mb_init_cache(page, NULL); | 1160 | ret = ext4_mb_init_cache(page, NULL); |
1161 | if (ret) { | 1161 | if (ret) { |
1162 | unlock_page(page); | 1162 | unlock_page(page); |
1163 | goto err; | 1163 | goto err; |
1164 | } | 1164 | } |
1165 | mb_cmp_bitmaps(e4b, page_address(page) + | 1165 | mb_cmp_bitmaps(e4b, page_address(page) + |
1166 | (poff * sb->s_blocksize)); | 1166 | (poff * sb->s_blocksize)); |
1167 | } | 1167 | } |
1168 | unlock_page(page); | 1168 | unlock_page(page); |
1169 | } | 1169 | } |
1170 | } | 1170 | } |
1171 | if (page == NULL || !PageUptodate(page)) { | 1171 | if (page == NULL || !PageUptodate(page)) { |
1172 | ret = -EIO; | 1172 | ret = -EIO; |
1173 | goto err; | 1173 | goto err; |
1174 | } | 1174 | } |
1175 | |||
1176 | /* Pages marked accessed already */ | ||
1175 | e4b->bd_bitmap_page = page; | 1177 | e4b->bd_bitmap_page = page; |
1176 | e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); | 1178 | e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize); |
1177 | mark_page_accessed(page); | ||
1178 | 1179 | ||
1179 | block++; | 1180 | block++; |
1180 | pnum = block / blocks_per_page; | 1181 | pnum = block / blocks_per_page; |
1181 | poff = block % blocks_per_page; | 1182 | poff = block % blocks_per_page; |
1182 | 1183 | ||
1183 | page = find_get_page(inode->i_mapping, pnum); | 1184 | page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED); |
1184 | if (page == NULL || !PageUptodate(page)) { | 1185 | if (page == NULL || !PageUptodate(page)) { |
1185 | if (page) | 1186 | if (page) |
1186 | page_cache_release(page); | 1187 | page_cache_release(page); |
1187 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); | 1188 | page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); |
1188 | if (page) { | 1189 | if (page) { |
1189 | BUG_ON(page->mapping != inode->i_mapping); | 1190 | BUG_ON(page->mapping != inode->i_mapping); |
1190 | if (!PageUptodate(page)) { | 1191 | if (!PageUptodate(page)) { |
1191 | ret = ext4_mb_init_cache(page, e4b->bd_bitmap); | 1192 | ret = ext4_mb_init_cache(page, e4b->bd_bitmap); |
1192 | if (ret) { | 1193 | if (ret) { |
1193 | unlock_page(page); | 1194 | unlock_page(page); |
1194 | goto err; | 1195 | goto err; |
1195 | } | 1196 | } |
1196 | } | 1197 | } |
1197 | unlock_page(page); | 1198 | unlock_page(page); |
1198 | } | 1199 | } |
1199 | } | 1200 | } |
1200 | if (page == NULL || !PageUptodate(page)) { | 1201 | if (page == NULL || !PageUptodate(page)) { |
1201 | ret = -EIO; | 1202 | ret = -EIO; |
1202 | goto err; | 1203 | goto err; |
1203 | } | 1204 | } |
1205 | |||
1206 | /* Pages marked accessed already */ | ||
1204 | e4b->bd_buddy_page = page; | 1207 | e4b->bd_buddy_page = page; |
1205 | e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize); | 1208 | e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize); |
1206 | mark_page_accessed(page); | ||
1207 | 1209 | ||
1208 | BUG_ON(e4b->bd_bitmap_page == NULL); | 1210 | BUG_ON(e4b->bd_bitmap_page == NULL); |
1209 | BUG_ON(e4b->bd_buddy_page == NULL); | 1211 | BUG_ON(e4b->bd_buddy_page == NULL); |
1210 | 1212 | ||
1211 | return 0; | 1213 | return 0; |
1212 | 1214 | ||
1213 | err: | 1215 | err: |
1214 | if (page) | 1216 | if (page) |
1215 | page_cache_release(page); | 1217 | page_cache_release(page); |
1216 | if (e4b->bd_bitmap_page) | 1218 | if (e4b->bd_bitmap_page) |
1217 | page_cache_release(e4b->bd_bitmap_page); | 1219 | page_cache_release(e4b->bd_bitmap_page); |
1218 | if (e4b->bd_buddy_page) | 1220 | if (e4b->bd_buddy_page) |
1219 | page_cache_release(e4b->bd_buddy_page); | 1221 | page_cache_release(e4b->bd_buddy_page); |
1220 | e4b->bd_buddy = NULL; | 1222 | e4b->bd_buddy = NULL; |
1221 | e4b->bd_bitmap = NULL; | 1223 | e4b->bd_bitmap = NULL; |
1222 | return ret; | 1224 | return ret; |
1223 | } | 1225 | } |
1224 | 1226 | ||
1225 | static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) | 1227 | static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) |
1226 | { | 1228 | { |
1227 | if (e4b->bd_bitmap_page) | 1229 | if (e4b->bd_bitmap_page) |
1228 | page_cache_release(e4b->bd_bitmap_page); | 1230 | page_cache_release(e4b->bd_bitmap_page); |
1229 | if (e4b->bd_buddy_page) | 1231 | if (e4b->bd_buddy_page) |
1230 | page_cache_release(e4b->bd_buddy_page); | 1232 | page_cache_release(e4b->bd_buddy_page); |
1231 | } | 1233 | } |
1232 | 1234 | ||
1233 | 1235 | ||
1234 | static int mb_find_order_for_block(struct ext4_buddy *e4b, int block) | 1236 | static int mb_find_order_for_block(struct ext4_buddy *e4b, int block) |
1235 | { | 1237 | { |
1236 | int order = 1; | 1238 | int order = 1; |
1237 | void *bb; | 1239 | void *bb; |
1238 | 1240 | ||
1239 | BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); | 1241 | BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); |
1240 | BUG_ON(block >= (1 << (e4b->bd_blkbits + 3))); | 1242 | BUG_ON(block >= (1 << (e4b->bd_blkbits + 3))); |
1241 | 1243 | ||
1242 | bb = e4b->bd_buddy; | 1244 | bb = e4b->bd_buddy; |
1243 | while (order <= e4b->bd_blkbits + 1) { | 1245 | while (order <= e4b->bd_blkbits + 1) { |
1244 | block = block >> 1; | 1246 | block = block >> 1; |
1245 | if (!mb_test_bit(block, bb)) { | 1247 | if (!mb_test_bit(block, bb)) { |
1246 | /* this block is part of buddy of order 'order' */ | 1248 | /* this block is part of buddy of order 'order' */ |
1247 | return order; | 1249 | return order; |
1248 | } | 1250 | } |
1249 | bb += 1 << (e4b->bd_blkbits - order); | 1251 | bb += 1 << (e4b->bd_blkbits - order); |
1250 | order++; | 1252 | order++; |
1251 | } | 1253 | } |
1252 | return 0; | 1254 | return 0; |
1253 | } | 1255 | } |
1254 | 1256 | ||
1255 | static void mb_clear_bits(void *bm, int cur, int len) | 1257 | static void mb_clear_bits(void *bm, int cur, int len) |
1256 | { | 1258 | { |
1257 | __u32 *addr; | 1259 | __u32 *addr; |
1258 | 1260 | ||
1259 | len = cur + len; | 1261 | len = cur + len; |
1260 | while (cur < len) { | 1262 | while (cur < len) { |
1261 | if ((cur & 31) == 0 && (len - cur) >= 32) { | 1263 | if ((cur & 31) == 0 && (len - cur) >= 32) { |
1262 | /* fast path: clear whole word at once */ | 1264 | /* fast path: clear whole word at once */ |
1263 | addr = bm + (cur >> 3); | 1265 | addr = bm + (cur >> 3); |
1264 | *addr = 0; | 1266 | *addr = 0; |
1265 | cur += 32; | 1267 | cur += 32; |
1266 | continue; | 1268 | continue; |
1267 | } | 1269 | } |
1268 | mb_clear_bit(cur, bm); | 1270 | mb_clear_bit(cur, bm); |
1269 | cur++; | 1271 | cur++; |
1270 | } | 1272 | } |
1271 | } | 1273 | } |
1272 | 1274 | ||
1273 | /* clear bits in given range | 1275 | /* clear bits in given range |
1274 | * will return first found zero bit if any, -1 otherwise | 1276 | * will return first found zero bit if any, -1 otherwise |
1275 | */ | 1277 | */ |
1276 | static int mb_test_and_clear_bits(void *bm, int cur, int len) | 1278 | static int mb_test_and_clear_bits(void *bm, int cur, int len) |
1277 | { | 1279 | { |
1278 | __u32 *addr; | 1280 | __u32 *addr; |
1279 | int zero_bit = -1; | 1281 | int zero_bit = -1; |
1280 | 1282 | ||
1281 | len = cur + len; | 1283 | len = cur + len; |
1282 | while (cur < len) { | 1284 | while (cur < len) { |
1283 | if ((cur & 31) == 0 && (len - cur) >= 32) { | 1285 | if ((cur & 31) == 0 && (len - cur) >= 32) { |
1284 | /* fast path: clear whole word at once */ | 1286 | /* fast path: clear whole word at once */ |
1285 | addr = bm + (cur >> 3); | 1287 | addr = bm + (cur >> 3); |
1286 | if (*addr != (__u32)(-1) && zero_bit == -1) | 1288 | if (*addr != (__u32)(-1) && zero_bit == -1) |
1287 | zero_bit = cur + mb_find_next_zero_bit(addr, 32, 0); | 1289 | zero_bit = cur + mb_find_next_zero_bit(addr, 32, 0); |
1288 | *addr = 0; | 1290 | *addr = 0; |
1289 | cur += 32; | 1291 | cur += 32; |
1290 | continue; | 1292 | continue; |
1291 | } | 1293 | } |
1292 | if (!mb_test_and_clear_bit(cur, bm) && zero_bit == -1) | 1294 | if (!mb_test_and_clear_bit(cur, bm) && zero_bit == -1) |
1293 | zero_bit = cur; | 1295 | zero_bit = cur; |
1294 | cur++; | 1296 | cur++; |
1295 | } | 1297 | } |
1296 | 1298 | ||
1297 | return zero_bit; | 1299 | return zero_bit; |
1298 | } | 1300 | } |
1299 | 1301 | ||
1300 | void ext4_set_bits(void *bm, int cur, int len) | 1302 | void ext4_set_bits(void *bm, int cur, int len) |
1301 | { | 1303 | { |
1302 | __u32 *addr; | 1304 | __u32 *addr; |
1303 | 1305 | ||
1304 | len = cur + len; | 1306 | len = cur + len; |
1305 | while (cur < len) { | 1307 | while (cur < len) { |
1306 | if ((cur & 31) == 0 && (len - cur) >= 32) { | 1308 | if ((cur & 31) == 0 && (len - cur) >= 32) { |
1307 | /* fast path: set whole word at once */ | 1309 | /* fast path: set whole word at once */ |
1308 | addr = bm + (cur >> 3); | 1310 | addr = bm + (cur >> 3); |
1309 | *addr = 0xffffffff; | 1311 | *addr = 0xffffffff; |
1310 | cur += 32; | 1312 | cur += 32; |
1311 | continue; | 1313 | continue; |
1312 | } | 1314 | } |
1313 | mb_set_bit(cur, bm); | 1315 | mb_set_bit(cur, bm); |
1314 | cur++; | 1316 | cur++; |
1315 | } | 1317 | } |
1316 | } | 1318 | } |
1317 | 1319 | ||
1318 | /* | 1320 | /* |
1319 | * _________________________________________________________________ */ | 1321 | * _________________________________________________________________ */ |
1320 | 1322 | ||
1321 | static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side) | 1323 | static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side) |
1322 | { | 1324 | { |
1323 | if (mb_test_bit(*bit + side, bitmap)) { | 1325 | if (mb_test_bit(*bit + side, bitmap)) { |
1324 | mb_clear_bit(*bit, bitmap); | 1326 | mb_clear_bit(*bit, bitmap); |
1325 | (*bit) -= side; | 1327 | (*bit) -= side; |
1326 | return 1; | 1328 | return 1; |
1327 | } | 1329 | } |
1328 | else { | 1330 | else { |
1329 | (*bit) += side; | 1331 | (*bit) += side; |
1330 | mb_set_bit(*bit, bitmap); | 1332 | mb_set_bit(*bit, bitmap); |
1331 | return -1; | 1333 | return -1; |
1332 | } | 1334 | } |
1333 | } | 1335 | } |
1334 | 1336 | ||
1335 | static void mb_buddy_mark_free(struct ext4_buddy *e4b, int first, int last) | 1337 | static void mb_buddy_mark_free(struct ext4_buddy *e4b, int first, int last) |
1336 | { | 1338 | { |
1337 | int max; | 1339 | int max; |
1338 | int order = 1; | 1340 | int order = 1; |
1339 | void *buddy = mb_find_buddy(e4b, order, &max); | 1341 | void *buddy = mb_find_buddy(e4b, order, &max); |
1340 | 1342 | ||
1341 | while (buddy) { | 1343 | while (buddy) { |
1342 | void *buddy2; | 1344 | void *buddy2; |
1343 | 1345 | ||
1344 | /* Bits in range [first; last] are known to be set since | 1346 | /* Bits in range [first; last] are known to be set since |
1345 | * corresponding blocks were allocated. Bits in range | 1347 | * corresponding blocks were allocated. Bits in range |
1346 | * (first; last) will stay set because they form buddies on | 1348 | * (first; last) will stay set because they form buddies on |
1347 | * upper layer. We just deal with borders if they don't | 1349 | * upper layer. We just deal with borders if they don't |
1348 | * align with upper layer and then go up. | 1350 | * align with upper layer and then go up. |
1349 | * Releasing entire group is all about clearing | 1351 | * Releasing entire group is all about clearing |
1350 | * single bit of highest order buddy. | 1352 | * single bit of highest order buddy. |
1351 | */ | 1353 | */ |
1352 | 1354 | ||
1353 | /* Example: | 1355 | /* Example: |
1354 | * --------------------------------- | 1356 | * --------------------------------- |
1355 | * | 1 | 1 | 1 | 1 | | 1357 | * | 1 | 1 | 1 | 1 | |
1356 | * --------------------------------- | 1358 | * --------------------------------- |
1357 | * | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | 1359 | * | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
1358 | * --------------------------------- | 1360 | * --------------------------------- |
1359 | * 0 1 2 3 4 5 6 7 | 1361 | * 0 1 2 3 4 5 6 7 |
1360 | * \_____________________/ | 1362 | * \_____________________/ |
1361 | * | 1363 | * |
1362 | * Neither [1] nor [6] is aligned to above layer. | 1364 | * Neither [1] nor [6] is aligned to above layer. |
1363 | * Left neighbour [0] is free, so mark it busy, | 1365 | * Left neighbour [0] is free, so mark it busy, |
1364 | * decrease bb_counters and extend range to | 1366 | * decrease bb_counters and extend range to |
1365 | * [0; 6] | 1367 | * [0; 6] |
1366 | * Right neighbour [7] is busy. It can't be coaleasced with [6], so | 1368 | * Right neighbour [7] is busy. It can't be coaleasced with [6], so |
1367 | * mark [6] free, increase bb_counters and shrink range to | 1369 | * mark [6] free, increase bb_counters and shrink range to |
1368 | * [0; 5]. | 1370 | * [0; 5]. |
1369 | * Then shift range to [0; 2], go up and do the same. | 1371 | * Then shift range to [0; 2], go up and do the same. |
1370 | */ | 1372 | */ |
1371 | 1373 | ||
1372 | 1374 | ||
1373 | if (first & 1) | 1375 | if (first & 1) |
1374 | e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&first, buddy, -1); | 1376 | e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&first, buddy, -1); |
1375 | if (!(last & 1)) | 1377 | if (!(last & 1)) |
1376 | e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&last, buddy, 1); | 1378 | e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&last, buddy, 1); |
1377 | if (first > last) | 1379 | if (first > last) |
1378 | break; | 1380 | break; |
1379 | order++; | 1381 | order++; |
1380 | 1382 | ||
1381 | if (first == last || !(buddy2 = mb_find_buddy(e4b, order, &max))) { | 1383 | if (first == last || !(buddy2 = mb_find_buddy(e4b, order, &max))) { |
1382 | mb_clear_bits(buddy, first, last - first + 1); | 1384 | mb_clear_bits(buddy, first, last - first + 1); |
1383 | e4b->bd_info->bb_counters[order - 1] += last - first + 1; | 1385 | e4b->bd_info->bb_counters[order - 1] += last - first + 1; |
1384 | break; | 1386 | break; |
1385 | } | 1387 | } |
1386 | first >>= 1; | 1388 | first >>= 1; |
1387 | last >>= 1; | 1389 | last >>= 1; |
1388 | buddy = buddy2; | 1390 | buddy = buddy2; |
1389 | } | 1391 | } |
1390 | } | 1392 | } |
1391 | 1393 | ||
1392 | static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, | 1394 | static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, |
1393 | int first, int count) | 1395 | int first, int count) |
1394 | { | 1396 | { |
1395 | int left_is_free = 0; | 1397 | int left_is_free = 0; |
1396 | int right_is_free = 0; | 1398 | int right_is_free = 0; |
1397 | int block; | 1399 | int block; |
1398 | int last = first + count - 1; | 1400 | int last = first + count - 1; |
1399 | struct super_block *sb = e4b->bd_sb; | 1401 | struct super_block *sb = e4b->bd_sb; |
1400 | 1402 | ||
1401 | if (WARN_ON(count == 0)) | 1403 | if (WARN_ON(count == 0)) |
1402 | return; | 1404 | return; |
1403 | BUG_ON(last >= (sb->s_blocksize << 3)); | 1405 | BUG_ON(last >= (sb->s_blocksize << 3)); |
1404 | assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); | 1406 | assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group)); |
1405 | /* Don't bother if the block group is corrupt. */ | 1407 | /* Don't bother if the block group is corrupt. */ |
1406 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) | 1408 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) |
1407 | return; | 1409 | return; |
1408 | 1410 | ||
1409 | mb_check_buddy(e4b); | 1411 | mb_check_buddy(e4b); |
1410 | mb_free_blocks_double(inode, e4b, first, count); | 1412 | mb_free_blocks_double(inode, e4b, first, count); |
1411 | 1413 | ||
1412 | e4b->bd_info->bb_free += count; | 1414 | e4b->bd_info->bb_free += count; |
1413 | if (first < e4b->bd_info->bb_first_free) | 1415 | if (first < e4b->bd_info->bb_first_free) |
1414 | e4b->bd_info->bb_first_free = first; | 1416 | e4b->bd_info->bb_first_free = first; |
1415 | 1417 | ||
1416 | /* access memory sequentially: check left neighbour, | 1418 | /* access memory sequentially: check left neighbour, |
1417 | * clear range and then check right neighbour | 1419 | * clear range and then check right neighbour |
1418 | */ | 1420 | */ |
1419 | if (first != 0) | 1421 | if (first != 0) |
1420 | left_is_free = !mb_test_bit(first - 1, e4b->bd_bitmap); | 1422 | left_is_free = !mb_test_bit(first - 1, e4b->bd_bitmap); |
1421 | block = mb_test_and_clear_bits(e4b->bd_bitmap, first, count); | 1423 | block = mb_test_and_clear_bits(e4b->bd_bitmap, first, count); |
1422 | if (last + 1 < EXT4_SB(sb)->s_mb_maxs[0]) | 1424 | if (last + 1 < EXT4_SB(sb)->s_mb_maxs[0]) |
1423 | right_is_free = !mb_test_bit(last + 1, e4b->bd_bitmap); | 1425 | right_is_free = !mb_test_bit(last + 1, e4b->bd_bitmap); |
1424 | 1426 | ||
1425 | if (unlikely(block != -1)) { | 1427 | if (unlikely(block != -1)) { |
1426 | ext4_fsblk_t blocknr; | 1428 | ext4_fsblk_t blocknr; |
1427 | 1429 | ||
1428 | blocknr = ext4_group_first_block_no(sb, e4b->bd_group); | 1430 | blocknr = ext4_group_first_block_no(sb, e4b->bd_group); |
1429 | blocknr += EXT4_C2B(EXT4_SB(sb), block); | 1431 | blocknr += EXT4_C2B(EXT4_SB(sb), block); |
1430 | ext4_grp_locked_error(sb, e4b->bd_group, | 1432 | ext4_grp_locked_error(sb, e4b->bd_group, |
1431 | inode ? inode->i_ino : 0, | 1433 | inode ? inode->i_ino : 0, |
1432 | blocknr, | 1434 | blocknr, |
1433 | "freeing already freed block " | 1435 | "freeing already freed block " |
1434 | "(bit %u); block bitmap corrupt.", | 1436 | "(bit %u); block bitmap corrupt.", |
1435 | block); | 1437 | block); |
1436 | /* Mark the block group as corrupt. */ | 1438 | /* Mark the block group as corrupt. */ |
1437 | set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, | 1439 | set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, |
1438 | &e4b->bd_info->bb_state); | 1440 | &e4b->bd_info->bb_state); |
1439 | mb_regenerate_buddy(e4b); | 1441 | mb_regenerate_buddy(e4b); |
1440 | goto done; | 1442 | goto done; |
1441 | } | 1443 | } |
1442 | 1444 | ||
1443 | /* let's maintain fragments counter */ | 1445 | /* let's maintain fragments counter */ |
1444 | if (left_is_free && right_is_free) | 1446 | if (left_is_free && right_is_free) |
1445 | e4b->bd_info->bb_fragments--; | 1447 | e4b->bd_info->bb_fragments--; |
1446 | else if (!left_is_free && !right_is_free) | 1448 | else if (!left_is_free && !right_is_free) |
1447 | e4b->bd_info->bb_fragments++; | 1449 | e4b->bd_info->bb_fragments++; |
1448 | 1450 | ||
1449 | /* buddy[0] == bd_bitmap is a special case, so handle | 1451 | /* buddy[0] == bd_bitmap is a special case, so handle |
1450 | * it right away and let mb_buddy_mark_free stay free of | 1452 | * it right away and let mb_buddy_mark_free stay free of |
1451 | * zero order checks. | 1453 | * zero order checks. |
1452 | * Check if neighbours are to be coaleasced, | 1454 | * Check if neighbours are to be coaleasced, |
1453 | * adjust bitmap bb_counters and borders appropriately. | 1455 | * adjust bitmap bb_counters and borders appropriately. |
1454 | */ | 1456 | */ |
1455 | if (first & 1) { | 1457 | if (first & 1) { |
1456 | first += !left_is_free; | 1458 | first += !left_is_free; |
1457 | e4b->bd_info->bb_counters[0] += left_is_free ? -1 : 1; | 1459 | e4b->bd_info->bb_counters[0] += left_is_free ? -1 : 1; |
1458 | } | 1460 | } |
1459 | if (!(last & 1)) { | 1461 | if (!(last & 1)) { |
1460 | last -= !right_is_free; | 1462 | last -= !right_is_free; |
1461 | e4b->bd_info->bb_counters[0] += right_is_free ? -1 : 1; | 1463 | e4b->bd_info->bb_counters[0] += right_is_free ? -1 : 1; |
1462 | } | 1464 | } |
1463 | 1465 | ||
1464 | if (first <= last) | 1466 | if (first <= last) |
1465 | mb_buddy_mark_free(e4b, first >> 1, last >> 1); | 1467 | mb_buddy_mark_free(e4b, first >> 1, last >> 1); |
1466 | 1468 | ||
1467 | done: | 1469 | done: |
1468 | mb_set_largest_free_order(sb, e4b->bd_info); | 1470 | mb_set_largest_free_order(sb, e4b->bd_info); |
1469 | mb_check_buddy(e4b); | 1471 | mb_check_buddy(e4b); |
1470 | } | 1472 | } |
1471 | 1473 | ||
1472 | static int mb_find_extent(struct ext4_buddy *e4b, int block, | 1474 | static int mb_find_extent(struct ext4_buddy *e4b, int block, |
1473 | int needed, struct ext4_free_extent *ex) | 1475 | int needed, struct ext4_free_extent *ex) |
1474 | { | 1476 | { |
1475 | int next = block; | 1477 | int next = block; |
1476 | int max, order; | 1478 | int max, order; |
1477 | void *buddy; | 1479 | void *buddy; |
1478 | 1480 | ||
1479 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); | 1481 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); |
1480 | BUG_ON(ex == NULL); | 1482 | BUG_ON(ex == NULL); |
1481 | 1483 | ||
1482 | buddy = mb_find_buddy(e4b, 0, &max); | 1484 | buddy = mb_find_buddy(e4b, 0, &max); |
1483 | BUG_ON(buddy == NULL); | 1485 | BUG_ON(buddy == NULL); |
1484 | BUG_ON(block >= max); | 1486 | BUG_ON(block >= max); |
1485 | if (mb_test_bit(block, buddy)) { | 1487 | if (mb_test_bit(block, buddy)) { |
1486 | ex->fe_len = 0; | 1488 | ex->fe_len = 0; |
1487 | ex->fe_start = 0; | 1489 | ex->fe_start = 0; |
1488 | ex->fe_group = 0; | 1490 | ex->fe_group = 0; |
1489 | return 0; | 1491 | return 0; |
1490 | } | 1492 | } |
1491 | 1493 | ||
1492 | /* find actual order */ | 1494 | /* find actual order */ |
1493 | order = mb_find_order_for_block(e4b, block); | 1495 | order = mb_find_order_for_block(e4b, block); |
1494 | block = block >> order; | 1496 | block = block >> order; |
1495 | 1497 | ||
1496 | ex->fe_len = 1 << order; | 1498 | ex->fe_len = 1 << order; |
1497 | ex->fe_start = block << order; | 1499 | ex->fe_start = block << order; |
1498 | ex->fe_group = e4b->bd_group; | 1500 | ex->fe_group = e4b->bd_group; |
1499 | 1501 | ||
1500 | /* calc difference from given start */ | 1502 | /* calc difference from given start */ |
1501 | next = next - ex->fe_start; | 1503 | next = next - ex->fe_start; |
1502 | ex->fe_len -= next; | 1504 | ex->fe_len -= next; |
1503 | ex->fe_start += next; | 1505 | ex->fe_start += next; |
1504 | 1506 | ||
1505 | while (needed > ex->fe_len && | 1507 | while (needed > ex->fe_len && |
1506 | mb_find_buddy(e4b, order, &max)) { | 1508 | mb_find_buddy(e4b, order, &max)) { |
1507 | 1509 | ||
1508 | if (block + 1 >= max) | 1510 | if (block + 1 >= max) |
1509 | break; | 1511 | break; |
1510 | 1512 | ||
1511 | next = (block + 1) * (1 << order); | 1513 | next = (block + 1) * (1 << order); |
1512 | if (mb_test_bit(next, e4b->bd_bitmap)) | 1514 | if (mb_test_bit(next, e4b->bd_bitmap)) |
1513 | break; | 1515 | break; |
1514 | 1516 | ||
1515 | order = mb_find_order_for_block(e4b, next); | 1517 | order = mb_find_order_for_block(e4b, next); |
1516 | 1518 | ||
1517 | block = next >> order; | 1519 | block = next >> order; |
1518 | ex->fe_len += 1 << order; | 1520 | ex->fe_len += 1 << order; |
1519 | } | 1521 | } |
1520 | 1522 | ||
1521 | BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3))); | 1523 | BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3))); |
1522 | return ex->fe_len; | 1524 | return ex->fe_len; |
1523 | } | 1525 | } |
1524 | 1526 | ||
1525 | static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) | 1527 | static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) |
1526 | { | 1528 | { |
1527 | int ord; | 1529 | int ord; |
1528 | int mlen = 0; | 1530 | int mlen = 0; |
1529 | int max = 0; | 1531 | int max = 0; |
1530 | int cur; | 1532 | int cur; |
1531 | int start = ex->fe_start; | 1533 | int start = ex->fe_start; |
1532 | int len = ex->fe_len; | 1534 | int len = ex->fe_len; |
1533 | unsigned ret = 0; | 1535 | unsigned ret = 0; |
1534 | int len0 = len; | 1536 | int len0 = len; |
1535 | void *buddy; | 1537 | void *buddy; |
1536 | 1538 | ||
1537 | BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); | 1539 | BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); |
1538 | BUG_ON(e4b->bd_group != ex->fe_group); | 1540 | BUG_ON(e4b->bd_group != ex->fe_group); |
1539 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); | 1541 | assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); |
1540 | mb_check_buddy(e4b); | 1542 | mb_check_buddy(e4b); |
1541 | mb_mark_used_double(e4b, start, len); | 1543 | mb_mark_used_double(e4b, start, len); |
1542 | 1544 | ||
1543 | e4b->bd_info->bb_free -= len; | 1545 | e4b->bd_info->bb_free -= len; |
1544 | if (e4b->bd_info->bb_first_free == start) | 1546 | if (e4b->bd_info->bb_first_free == start) |
1545 | e4b->bd_info->bb_first_free += len; | 1547 | e4b->bd_info->bb_first_free += len; |
1546 | 1548 | ||
1547 | /* let's maintain fragments counter */ | 1549 | /* let's maintain fragments counter */ |
1548 | if (start != 0) | 1550 | if (start != 0) |
1549 | mlen = !mb_test_bit(start - 1, e4b->bd_bitmap); | 1551 | mlen = !mb_test_bit(start - 1, e4b->bd_bitmap); |
1550 | if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0]) | 1552 | if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0]) |
1551 | max = !mb_test_bit(start + len, e4b->bd_bitmap); | 1553 | max = !mb_test_bit(start + len, e4b->bd_bitmap); |
1552 | if (mlen && max) | 1554 | if (mlen && max) |
1553 | e4b->bd_info->bb_fragments++; | 1555 | e4b->bd_info->bb_fragments++; |
1554 | else if (!mlen && !max) | 1556 | else if (!mlen && !max) |
1555 | e4b->bd_info->bb_fragments--; | 1557 | e4b->bd_info->bb_fragments--; |
1556 | 1558 | ||
1557 | /* let's maintain buddy itself */ | 1559 | /* let's maintain buddy itself */ |
1558 | while (len) { | 1560 | while (len) { |
1559 | ord = mb_find_order_for_block(e4b, start); | 1561 | ord = mb_find_order_for_block(e4b, start); |
1560 | 1562 | ||
1561 | if (((start >> ord) << ord) == start && len >= (1 << ord)) { | 1563 | if (((start >> ord) << ord) == start && len >= (1 << ord)) { |
1562 | /* the whole chunk may be allocated at once! */ | 1564 | /* the whole chunk may be allocated at once! */ |
1563 | mlen = 1 << ord; | 1565 | mlen = 1 << ord; |
1564 | buddy = mb_find_buddy(e4b, ord, &max); | 1566 | buddy = mb_find_buddy(e4b, ord, &max); |
1565 | BUG_ON((start >> ord) >= max); | 1567 | BUG_ON((start >> ord) >= max); |
1566 | mb_set_bit(start >> ord, buddy); | 1568 | mb_set_bit(start >> ord, buddy); |
1567 | e4b->bd_info->bb_counters[ord]--; | 1569 | e4b->bd_info->bb_counters[ord]--; |
1568 | start += mlen; | 1570 | start += mlen; |
1569 | len -= mlen; | 1571 | len -= mlen; |
1570 | BUG_ON(len < 0); | 1572 | BUG_ON(len < 0); |
1571 | continue; | 1573 | continue; |
1572 | } | 1574 | } |
1573 | 1575 | ||
1574 | /* store for history */ | 1576 | /* store for history */ |
1575 | if (ret == 0) | 1577 | if (ret == 0) |
1576 | ret = len | (ord << 16); | 1578 | ret = len | (ord << 16); |
1577 | 1579 | ||
1578 | /* we have to split large buddy */ | 1580 | /* we have to split large buddy */ |
1579 | BUG_ON(ord <= 0); | 1581 | BUG_ON(ord <= 0); |
1580 | buddy = mb_find_buddy(e4b, ord, &max); | 1582 | buddy = mb_find_buddy(e4b, ord, &max); |
1581 | mb_set_bit(start >> ord, buddy); | 1583 | mb_set_bit(start >> ord, buddy); |
1582 | e4b->bd_info->bb_counters[ord]--; | 1584 | e4b->bd_info->bb_counters[ord]--; |
1583 | 1585 | ||
1584 | ord--; | 1586 | ord--; |
1585 | cur = (start >> ord) & ~1U; | 1587 | cur = (start >> ord) & ~1U; |
1586 | buddy = mb_find_buddy(e4b, ord, &max); | 1588 | buddy = mb_find_buddy(e4b, ord, &max); |
1587 | mb_clear_bit(cur, buddy); | 1589 | mb_clear_bit(cur, buddy); |
1588 | mb_clear_bit(cur + 1, buddy); | 1590 | mb_clear_bit(cur + 1, buddy); |
1589 | e4b->bd_info->bb_counters[ord]++; | 1591 | e4b->bd_info->bb_counters[ord]++; |
1590 | e4b->bd_info->bb_counters[ord]++; | 1592 | e4b->bd_info->bb_counters[ord]++; |
1591 | } | 1593 | } |
1592 | mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info); | 1594 | mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info); |
1593 | 1595 | ||
1594 | ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0); | 1596 | ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0); |
1595 | mb_check_buddy(e4b); | 1597 | mb_check_buddy(e4b); |
1596 | 1598 | ||
1597 | return ret; | 1599 | return ret; |
1598 | } | 1600 | } |
1599 | 1601 | ||
1600 | /* | 1602 | /* |
1601 | * Must be called under group lock! | 1603 | * Must be called under group lock! |
1602 | */ | 1604 | */ |
1603 | static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, | 1605 | static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, |
1604 | struct ext4_buddy *e4b) | 1606 | struct ext4_buddy *e4b) |
1605 | { | 1607 | { |
1606 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 1608 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
1607 | int ret; | 1609 | int ret; |
1608 | 1610 | ||
1609 | BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); | 1611 | BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); |
1610 | BUG_ON(ac->ac_status == AC_STATUS_FOUND); | 1612 | BUG_ON(ac->ac_status == AC_STATUS_FOUND); |
1611 | 1613 | ||
1612 | ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len); | 1614 | ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len); |
1613 | ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical; | 1615 | ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical; |
1614 | ret = mb_mark_used(e4b, &ac->ac_b_ex); | 1616 | ret = mb_mark_used(e4b, &ac->ac_b_ex); |
1615 | 1617 | ||
1616 | /* preallocation can change ac_b_ex, thus we store actually | 1618 | /* preallocation can change ac_b_ex, thus we store actually |
1617 | * allocated blocks for history */ | 1619 | * allocated blocks for history */ |
1618 | ac->ac_f_ex = ac->ac_b_ex; | 1620 | ac->ac_f_ex = ac->ac_b_ex; |
1619 | 1621 | ||
1620 | ac->ac_status = AC_STATUS_FOUND; | 1622 | ac->ac_status = AC_STATUS_FOUND; |
1621 | ac->ac_tail = ret & 0xffff; | 1623 | ac->ac_tail = ret & 0xffff; |
1622 | ac->ac_buddy = ret >> 16; | 1624 | ac->ac_buddy = ret >> 16; |
1623 | 1625 | ||
1624 | /* | 1626 | /* |
1625 | * take the page reference. We want the page to be pinned | 1627 | * take the page reference. We want the page to be pinned |
1626 | * so that we don't get a ext4_mb_init_cache_call for this | 1628 | * so that we don't get a ext4_mb_init_cache_call for this |
1627 | * group until we update the bitmap. That would mean we | 1629 | * group until we update the bitmap. That would mean we |
1628 | * double allocate blocks. The reference is dropped | 1630 | * double allocate blocks. The reference is dropped |
1629 | * in ext4_mb_release_context | 1631 | * in ext4_mb_release_context |
1630 | */ | 1632 | */ |
1631 | ac->ac_bitmap_page = e4b->bd_bitmap_page; | 1633 | ac->ac_bitmap_page = e4b->bd_bitmap_page; |
1632 | get_page(ac->ac_bitmap_page); | 1634 | get_page(ac->ac_bitmap_page); |
1633 | ac->ac_buddy_page = e4b->bd_buddy_page; | 1635 | ac->ac_buddy_page = e4b->bd_buddy_page; |
1634 | get_page(ac->ac_buddy_page); | 1636 | get_page(ac->ac_buddy_page); |
1635 | /* store last allocated for subsequent stream allocation */ | 1637 | /* store last allocated for subsequent stream allocation */ |
1636 | if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { | 1638 | if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { |
1637 | spin_lock(&sbi->s_md_lock); | 1639 | spin_lock(&sbi->s_md_lock); |
1638 | sbi->s_mb_last_group = ac->ac_f_ex.fe_group; | 1640 | sbi->s_mb_last_group = ac->ac_f_ex.fe_group; |
1639 | sbi->s_mb_last_start = ac->ac_f_ex.fe_start; | 1641 | sbi->s_mb_last_start = ac->ac_f_ex.fe_start; |
1640 | spin_unlock(&sbi->s_md_lock); | 1642 | spin_unlock(&sbi->s_md_lock); |
1641 | } | 1643 | } |
1642 | } | 1644 | } |
1643 | 1645 | ||
1644 | /* | 1646 | /* |
1645 | * regular allocator, for general purposes allocation | 1647 | * regular allocator, for general purposes allocation |
1646 | */ | 1648 | */ |
1647 | 1649 | ||
1648 | static void ext4_mb_check_limits(struct ext4_allocation_context *ac, | 1650 | static void ext4_mb_check_limits(struct ext4_allocation_context *ac, |
1649 | struct ext4_buddy *e4b, | 1651 | struct ext4_buddy *e4b, |
1650 | int finish_group) | 1652 | int finish_group) |
1651 | { | 1653 | { |
1652 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 1654 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
1653 | struct ext4_free_extent *bex = &ac->ac_b_ex; | 1655 | struct ext4_free_extent *bex = &ac->ac_b_ex; |
1654 | struct ext4_free_extent *gex = &ac->ac_g_ex; | 1656 | struct ext4_free_extent *gex = &ac->ac_g_ex; |
1655 | struct ext4_free_extent ex; | 1657 | struct ext4_free_extent ex; |
1656 | int max; | 1658 | int max; |
1657 | 1659 | ||
1658 | if (ac->ac_status == AC_STATUS_FOUND) | 1660 | if (ac->ac_status == AC_STATUS_FOUND) |
1659 | return; | 1661 | return; |
1660 | /* | 1662 | /* |
1661 | * We don't want to scan for a whole year | 1663 | * We don't want to scan for a whole year |
1662 | */ | 1664 | */ |
1663 | if (ac->ac_found > sbi->s_mb_max_to_scan && | 1665 | if (ac->ac_found > sbi->s_mb_max_to_scan && |
1664 | !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { | 1666 | !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { |
1665 | ac->ac_status = AC_STATUS_BREAK; | 1667 | ac->ac_status = AC_STATUS_BREAK; |
1666 | return; | 1668 | return; |
1667 | } | 1669 | } |
1668 | 1670 | ||
1669 | /* | 1671 | /* |
1670 | * Haven't found good chunk so far, let's continue | 1672 | * Haven't found good chunk so far, let's continue |
1671 | */ | 1673 | */ |
1672 | if (bex->fe_len < gex->fe_len) | 1674 | if (bex->fe_len < gex->fe_len) |
1673 | return; | 1675 | return; |
1674 | 1676 | ||
1675 | if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan) | 1677 | if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan) |
1676 | && bex->fe_group == e4b->bd_group) { | 1678 | && bex->fe_group == e4b->bd_group) { |
1677 | /* recheck chunk's availability - we don't know | 1679 | /* recheck chunk's availability - we don't know |
1678 | * when it was found (within this lock-unlock | 1680 | * when it was found (within this lock-unlock |
1679 | * period or not) */ | 1681 | * period or not) */ |
1680 | max = mb_find_extent(e4b, bex->fe_start, gex->fe_len, &ex); | 1682 | max = mb_find_extent(e4b, bex->fe_start, gex->fe_len, &ex); |
1681 | if (max >= gex->fe_len) { | 1683 | if (max >= gex->fe_len) { |
1682 | ext4_mb_use_best_found(ac, e4b); | 1684 | ext4_mb_use_best_found(ac, e4b); |
1683 | return; | 1685 | return; |
1684 | } | 1686 | } |
1685 | } | 1687 | } |
1686 | } | 1688 | } |
1687 | 1689 | ||
1688 | /* | 1690 | /* |
1689 | * The routine checks whether found extent is good enough. If it is, | 1691 | * The routine checks whether found extent is good enough. If it is, |
1690 | * then the extent gets marked used and flag is set to the context | 1692 | * then the extent gets marked used and flag is set to the context |
1691 | * to stop scanning. Otherwise, the extent is compared with the | 1693 | * to stop scanning. Otherwise, the extent is compared with the |
1692 | * previous found extent and if new one is better, then it's stored | 1694 | * previous found extent and if new one is better, then it's stored |
1693 | * in the context. Later, the best found extent will be used, if | 1695 | * in the context. Later, the best found extent will be used, if |
1694 | * mballoc can't find good enough extent. | 1696 | * mballoc can't find good enough extent. |
1695 | * | 1697 | * |
1696 | * FIXME: real allocation policy is to be designed yet! | 1698 | * FIXME: real allocation policy is to be designed yet! |
1697 | */ | 1699 | */ |
1698 | static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, | 1700 | static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, |
1699 | struct ext4_free_extent *ex, | 1701 | struct ext4_free_extent *ex, |
1700 | struct ext4_buddy *e4b) | 1702 | struct ext4_buddy *e4b) |
1701 | { | 1703 | { |
1702 | struct ext4_free_extent *bex = &ac->ac_b_ex; | 1704 | struct ext4_free_extent *bex = &ac->ac_b_ex; |
1703 | struct ext4_free_extent *gex = &ac->ac_g_ex; | 1705 | struct ext4_free_extent *gex = &ac->ac_g_ex; |
1704 | 1706 | ||
1705 | BUG_ON(ex->fe_len <= 0); | 1707 | BUG_ON(ex->fe_len <= 0); |
1706 | BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); | 1708 | BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); |
1707 | BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); | 1709 | BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb)); |
1708 | BUG_ON(ac->ac_status != AC_STATUS_CONTINUE); | 1710 | BUG_ON(ac->ac_status != AC_STATUS_CONTINUE); |
1709 | 1711 | ||
1710 | ac->ac_found++; | 1712 | ac->ac_found++; |
1711 | 1713 | ||
1712 | /* | 1714 | /* |
1713 | * The special case - take what you catch first | 1715 | * The special case - take what you catch first |
1714 | */ | 1716 | */ |
1715 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) { | 1717 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) { |
1716 | *bex = *ex; | 1718 | *bex = *ex; |
1717 | ext4_mb_use_best_found(ac, e4b); | 1719 | ext4_mb_use_best_found(ac, e4b); |
1718 | return; | 1720 | return; |
1719 | } | 1721 | } |
1720 | 1722 | ||
1721 | /* | 1723 | /* |
1722 | * Let's check whether the chuck is good enough | 1724 | * Let's check whether the chuck is good enough |
1723 | */ | 1725 | */ |
1724 | if (ex->fe_len == gex->fe_len) { | 1726 | if (ex->fe_len == gex->fe_len) { |
1725 | *bex = *ex; | 1727 | *bex = *ex; |
1726 | ext4_mb_use_best_found(ac, e4b); | 1728 | ext4_mb_use_best_found(ac, e4b); |
1727 | return; | 1729 | return; |
1728 | } | 1730 | } |
1729 | 1731 | ||
1730 | /* | 1732 | /* |
1731 | * If this is first found extent, just store it in the context | 1733 | * If this is first found extent, just store it in the context |
1732 | */ | 1734 | */ |
1733 | if (bex->fe_len == 0) { | 1735 | if (bex->fe_len == 0) { |
1734 | *bex = *ex; | 1736 | *bex = *ex; |
1735 | return; | 1737 | return; |
1736 | } | 1738 | } |
1737 | 1739 | ||
1738 | /* | 1740 | /* |
1739 | * If new found extent is better, store it in the context | 1741 | * If new found extent is better, store it in the context |
1740 | */ | 1742 | */ |
1741 | if (bex->fe_len < gex->fe_len) { | 1743 | if (bex->fe_len < gex->fe_len) { |
1742 | /* if the request isn't satisfied, any found extent | 1744 | /* if the request isn't satisfied, any found extent |
1743 | * larger than previous best one is better */ | 1745 | * larger than previous best one is better */ |
1744 | if (ex->fe_len > bex->fe_len) | 1746 | if (ex->fe_len > bex->fe_len) |
1745 | *bex = *ex; | 1747 | *bex = *ex; |
1746 | } else if (ex->fe_len > gex->fe_len) { | 1748 | } else if (ex->fe_len > gex->fe_len) { |
1747 | /* if the request is satisfied, then we try to find | 1749 | /* if the request is satisfied, then we try to find |
1748 | * an extent that still satisfy the request, but is | 1750 | * an extent that still satisfy the request, but is |
1749 | * smaller than previous one */ | 1751 | * smaller than previous one */ |
1750 | if (ex->fe_len < bex->fe_len) | 1752 | if (ex->fe_len < bex->fe_len) |
1751 | *bex = *ex; | 1753 | *bex = *ex; |
1752 | } | 1754 | } |
1753 | 1755 | ||
1754 | ext4_mb_check_limits(ac, e4b, 0); | 1756 | ext4_mb_check_limits(ac, e4b, 0); |
1755 | } | 1757 | } |
1756 | 1758 | ||
1757 | static noinline_for_stack | 1759 | static noinline_for_stack |
1758 | int ext4_mb_try_best_found(struct ext4_allocation_context *ac, | 1760 | int ext4_mb_try_best_found(struct ext4_allocation_context *ac, |
1759 | struct ext4_buddy *e4b) | 1761 | struct ext4_buddy *e4b) |
1760 | { | 1762 | { |
1761 | struct ext4_free_extent ex = ac->ac_b_ex; | 1763 | struct ext4_free_extent ex = ac->ac_b_ex; |
1762 | ext4_group_t group = ex.fe_group; | 1764 | ext4_group_t group = ex.fe_group; |
1763 | int max; | 1765 | int max; |
1764 | int err; | 1766 | int err; |
1765 | 1767 | ||
1766 | BUG_ON(ex.fe_len <= 0); | 1768 | BUG_ON(ex.fe_len <= 0); |
1767 | err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); | 1769 | err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); |
1768 | if (err) | 1770 | if (err) |
1769 | return err; | 1771 | return err; |
1770 | 1772 | ||
1771 | ext4_lock_group(ac->ac_sb, group); | 1773 | ext4_lock_group(ac->ac_sb, group); |
1772 | max = mb_find_extent(e4b, ex.fe_start, ex.fe_len, &ex); | 1774 | max = mb_find_extent(e4b, ex.fe_start, ex.fe_len, &ex); |
1773 | 1775 | ||
1774 | if (max > 0) { | 1776 | if (max > 0) { |
1775 | ac->ac_b_ex = ex; | 1777 | ac->ac_b_ex = ex; |
1776 | ext4_mb_use_best_found(ac, e4b); | 1778 | ext4_mb_use_best_found(ac, e4b); |
1777 | } | 1779 | } |
1778 | 1780 | ||
1779 | ext4_unlock_group(ac->ac_sb, group); | 1781 | ext4_unlock_group(ac->ac_sb, group); |
1780 | ext4_mb_unload_buddy(e4b); | 1782 | ext4_mb_unload_buddy(e4b); |
1781 | 1783 | ||
1782 | return 0; | 1784 | return 0; |
1783 | } | 1785 | } |
1784 | 1786 | ||
1785 | static noinline_for_stack | 1787 | static noinline_for_stack |
1786 | int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, | 1788 | int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, |
1787 | struct ext4_buddy *e4b) | 1789 | struct ext4_buddy *e4b) |
1788 | { | 1790 | { |
1789 | ext4_group_t group = ac->ac_g_ex.fe_group; | 1791 | ext4_group_t group = ac->ac_g_ex.fe_group; |
1790 | int max; | 1792 | int max; |
1791 | int err; | 1793 | int err; |
1792 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 1794 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
1793 | struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); | 1795 | struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); |
1794 | struct ext4_free_extent ex; | 1796 | struct ext4_free_extent ex; |
1795 | 1797 | ||
1796 | if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL)) | 1798 | if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL)) |
1797 | return 0; | 1799 | return 0; |
1798 | if (grp->bb_free == 0) | 1800 | if (grp->bb_free == 0) |
1799 | return 0; | 1801 | return 0; |
1800 | 1802 | ||
1801 | err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); | 1803 | err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); |
1802 | if (err) | 1804 | if (err) |
1803 | return err; | 1805 | return err; |
1804 | 1806 | ||
1805 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) { | 1807 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) { |
1806 | ext4_mb_unload_buddy(e4b); | 1808 | ext4_mb_unload_buddy(e4b); |
1807 | return 0; | 1809 | return 0; |
1808 | } | 1810 | } |
1809 | 1811 | ||
1810 | ext4_lock_group(ac->ac_sb, group); | 1812 | ext4_lock_group(ac->ac_sb, group); |
1811 | max = mb_find_extent(e4b, ac->ac_g_ex.fe_start, | 1813 | max = mb_find_extent(e4b, ac->ac_g_ex.fe_start, |
1812 | ac->ac_g_ex.fe_len, &ex); | 1814 | ac->ac_g_ex.fe_len, &ex); |
1813 | 1815 | ||
1814 | if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { | 1816 | if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) { |
1815 | ext4_fsblk_t start; | 1817 | ext4_fsblk_t start; |
1816 | 1818 | ||
1817 | start = ext4_group_first_block_no(ac->ac_sb, e4b->bd_group) + | 1819 | start = ext4_group_first_block_no(ac->ac_sb, e4b->bd_group) + |
1818 | ex.fe_start; | 1820 | ex.fe_start; |
1819 | /* use do_div to get remainder (would be 64-bit modulo) */ | 1821 | /* use do_div to get remainder (would be 64-bit modulo) */ |
1820 | if (do_div(start, sbi->s_stripe) == 0) { | 1822 | if (do_div(start, sbi->s_stripe) == 0) { |
1821 | ac->ac_found++; | 1823 | ac->ac_found++; |
1822 | ac->ac_b_ex = ex; | 1824 | ac->ac_b_ex = ex; |
1823 | ext4_mb_use_best_found(ac, e4b); | 1825 | ext4_mb_use_best_found(ac, e4b); |
1824 | } | 1826 | } |
1825 | } else if (max >= ac->ac_g_ex.fe_len) { | 1827 | } else if (max >= ac->ac_g_ex.fe_len) { |
1826 | BUG_ON(ex.fe_len <= 0); | 1828 | BUG_ON(ex.fe_len <= 0); |
1827 | BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); | 1829 | BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); |
1828 | BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); | 1830 | BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); |
1829 | ac->ac_found++; | 1831 | ac->ac_found++; |
1830 | ac->ac_b_ex = ex; | 1832 | ac->ac_b_ex = ex; |
1831 | ext4_mb_use_best_found(ac, e4b); | 1833 | ext4_mb_use_best_found(ac, e4b); |
1832 | } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) { | 1834 | } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) { |
1833 | /* Sometimes, caller may want to merge even small | 1835 | /* Sometimes, caller may want to merge even small |
1834 | * number of blocks to an existing extent */ | 1836 | * number of blocks to an existing extent */ |
1835 | BUG_ON(ex.fe_len <= 0); | 1837 | BUG_ON(ex.fe_len <= 0); |
1836 | BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); | 1838 | BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group); |
1837 | BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); | 1839 | BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start); |
1838 | ac->ac_found++; | 1840 | ac->ac_found++; |
1839 | ac->ac_b_ex = ex; | 1841 | ac->ac_b_ex = ex; |
1840 | ext4_mb_use_best_found(ac, e4b); | 1842 | ext4_mb_use_best_found(ac, e4b); |
1841 | } | 1843 | } |
1842 | ext4_unlock_group(ac->ac_sb, group); | 1844 | ext4_unlock_group(ac->ac_sb, group); |
1843 | ext4_mb_unload_buddy(e4b); | 1845 | ext4_mb_unload_buddy(e4b); |
1844 | 1846 | ||
1845 | return 0; | 1847 | return 0; |
1846 | } | 1848 | } |
1847 | 1849 | ||
1848 | /* | 1850 | /* |
1849 | * The routine scans buddy structures (not bitmap!) from given order | 1851 | * The routine scans buddy structures (not bitmap!) from given order |
1850 | * to max order and tries to find big enough chunk to satisfy the req | 1852 | * to max order and tries to find big enough chunk to satisfy the req |
1851 | */ | 1853 | */ |
1852 | static noinline_for_stack | 1854 | static noinline_for_stack |
1853 | void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, | 1855 | void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, |
1854 | struct ext4_buddy *e4b) | 1856 | struct ext4_buddy *e4b) |
1855 | { | 1857 | { |
1856 | struct super_block *sb = ac->ac_sb; | 1858 | struct super_block *sb = ac->ac_sb; |
1857 | struct ext4_group_info *grp = e4b->bd_info; | 1859 | struct ext4_group_info *grp = e4b->bd_info; |
1858 | void *buddy; | 1860 | void *buddy; |
1859 | int i; | 1861 | int i; |
1860 | int k; | 1862 | int k; |
1861 | int max; | 1863 | int max; |
1862 | 1864 | ||
1863 | BUG_ON(ac->ac_2order <= 0); | 1865 | BUG_ON(ac->ac_2order <= 0); |
1864 | for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { | 1866 | for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { |
1865 | if (grp->bb_counters[i] == 0) | 1867 | if (grp->bb_counters[i] == 0) |
1866 | continue; | 1868 | continue; |
1867 | 1869 | ||
1868 | buddy = mb_find_buddy(e4b, i, &max); | 1870 | buddy = mb_find_buddy(e4b, i, &max); |
1869 | BUG_ON(buddy == NULL); | 1871 | BUG_ON(buddy == NULL); |
1870 | 1872 | ||
1871 | k = mb_find_next_zero_bit(buddy, max, 0); | 1873 | k = mb_find_next_zero_bit(buddy, max, 0); |
1872 | BUG_ON(k >= max); | 1874 | BUG_ON(k >= max); |
1873 | 1875 | ||
1874 | ac->ac_found++; | 1876 | ac->ac_found++; |
1875 | 1877 | ||
1876 | ac->ac_b_ex.fe_len = 1 << i; | 1878 | ac->ac_b_ex.fe_len = 1 << i; |
1877 | ac->ac_b_ex.fe_start = k << i; | 1879 | ac->ac_b_ex.fe_start = k << i; |
1878 | ac->ac_b_ex.fe_group = e4b->bd_group; | 1880 | ac->ac_b_ex.fe_group = e4b->bd_group; |
1879 | 1881 | ||
1880 | ext4_mb_use_best_found(ac, e4b); | 1882 | ext4_mb_use_best_found(ac, e4b); |
1881 | 1883 | ||
1882 | BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len); | 1884 | BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len); |
1883 | 1885 | ||
1884 | if (EXT4_SB(sb)->s_mb_stats) | 1886 | if (EXT4_SB(sb)->s_mb_stats) |
1885 | atomic_inc(&EXT4_SB(sb)->s_bal_2orders); | 1887 | atomic_inc(&EXT4_SB(sb)->s_bal_2orders); |
1886 | 1888 | ||
1887 | break; | 1889 | break; |
1888 | } | 1890 | } |
1889 | } | 1891 | } |
1890 | 1892 | ||
1891 | /* | 1893 | /* |
1892 | * The routine scans the group and measures all found extents. | 1894 | * The routine scans the group and measures all found extents. |
1893 | * In order to optimize scanning, caller must pass number of | 1895 | * In order to optimize scanning, caller must pass number of |
1894 | * free blocks in the group, so the routine can know upper limit. | 1896 | * free blocks in the group, so the routine can know upper limit. |
1895 | */ | 1897 | */ |
1896 | static noinline_for_stack | 1898 | static noinline_for_stack |
1897 | void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, | 1899 | void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, |
1898 | struct ext4_buddy *e4b) | 1900 | struct ext4_buddy *e4b) |
1899 | { | 1901 | { |
1900 | struct super_block *sb = ac->ac_sb; | 1902 | struct super_block *sb = ac->ac_sb; |
1901 | void *bitmap = e4b->bd_bitmap; | 1903 | void *bitmap = e4b->bd_bitmap; |
1902 | struct ext4_free_extent ex; | 1904 | struct ext4_free_extent ex; |
1903 | int i; | 1905 | int i; |
1904 | int free; | 1906 | int free; |
1905 | 1907 | ||
1906 | free = e4b->bd_info->bb_free; | 1908 | free = e4b->bd_info->bb_free; |
1907 | BUG_ON(free <= 0); | 1909 | BUG_ON(free <= 0); |
1908 | 1910 | ||
1909 | i = e4b->bd_info->bb_first_free; | 1911 | i = e4b->bd_info->bb_first_free; |
1910 | 1912 | ||
1911 | while (free && ac->ac_status == AC_STATUS_CONTINUE) { | 1913 | while (free && ac->ac_status == AC_STATUS_CONTINUE) { |
1912 | i = mb_find_next_zero_bit(bitmap, | 1914 | i = mb_find_next_zero_bit(bitmap, |
1913 | EXT4_CLUSTERS_PER_GROUP(sb), i); | 1915 | EXT4_CLUSTERS_PER_GROUP(sb), i); |
1914 | if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) { | 1916 | if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) { |
1915 | /* | 1917 | /* |
1916 | * IF we have corrupt bitmap, we won't find any | 1918 | * IF we have corrupt bitmap, we won't find any |
1917 | * free blocks even though group info says we | 1919 | * free blocks even though group info says we |
1918 | * we have free blocks | 1920 | * we have free blocks |
1919 | */ | 1921 | */ |
1920 | ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, | 1922 | ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, |
1921 | "%d free clusters as per " | 1923 | "%d free clusters as per " |
1922 | "group info. But bitmap says 0", | 1924 | "group info. But bitmap says 0", |
1923 | free); | 1925 | free); |
1924 | break; | 1926 | break; |
1925 | } | 1927 | } |
1926 | 1928 | ||
1927 | mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex); | 1929 | mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex); |
1928 | BUG_ON(ex.fe_len <= 0); | 1930 | BUG_ON(ex.fe_len <= 0); |
1929 | if (free < ex.fe_len) { | 1931 | if (free < ex.fe_len) { |
1930 | ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, | 1932 | ext4_grp_locked_error(sb, e4b->bd_group, 0, 0, |
1931 | "%d free clusters as per " | 1933 | "%d free clusters as per " |
1932 | "group info. But got %d blocks", | 1934 | "group info. But got %d blocks", |
1933 | free, ex.fe_len); | 1935 | free, ex.fe_len); |
1934 | /* | 1936 | /* |
1935 | * The number of free blocks differs. This mostly | 1937 | * The number of free blocks differs. This mostly |
1936 | * indicate that the bitmap is corrupt. So exit | 1938 | * indicate that the bitmap is corrupt. So exit |
1937 | * without claiming the space. | 1939 | * without claiming the space. |
1938 | */ | 1940 | */ |
1939 | break; | 1941 | break; |
1940 | } | 1942 | } |
1941 | 1943 | ||
1942 | ext4_mb_measure_extent(ac, &ex, e4b); | 1944 | ext4_mb_measure_extent(ac, &ex, e4b); |
1943 | 1945 | ||
1944 | i += ex.fe_len; | 1946 | i += ex.fe_len; |
1945 | free -= ex.fe_len; | 1947 | free -= ex.fe_len; |
1946 | } | 1948 | } |
1947 | 1949 | ||
1948 | ext4_mb_check_limits(ac, e4b, 1); | 1950 | ext4_mb_check_limits(ac, e4b, 1); |
1949 | } | 1951 | } |
1950 | 1952 | ||
1951 | /* | 1953 | /* |
1952 | * This is a special case for storages like raid5 | 1954 | * This is a special case for storages like raid5 |
1953 | * we try to find stripe-aligned chunks for stripe-size-multiple requests | 1955 | * we try to find stripe-aligned chunks for stripe-size-multiple requests |
1954 | */ | 1956 | */ |
1955 | static noinline_for_stack | 1957 | static noinline_for_stack |
1956 | void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, | 1958 | void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, |
1957 | struct ext4_buddy *e4b) | 1959 | struct ext4_buddy *e4b) |
1958 | { | 1960 | { |
1959 | struct super_block *sb = ac->ac_sb; | 1961 | struct super_block *sb = ac->ac_sb; |
1960 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 1962 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
1961 | void *bitmap = e4b->bd_bitmap; | 1963 | void *bitmap = e4b->bd_bitmap; |
1962 | struct ext4_free_extent ex; | 1964 | struct ext4_free_extent ex; |
1963 | ext4_fsblk_t first_group_block; | 1965 | ext4_fsblk_t first_group_block; |
1964 | ext4_fsblk_t a; | 1966 | ext4_fsblk_t a; |
1965 | ext4_grpblk_t i; | 1967 | ext4_grpblk_t i; |
1966 | int max; | 1968 | int max; |
1967 | 1969 | ||
1968 | BUG_ON(sbi->s_stripe == 0); | 1970 | BUG_ON(sbi->s_stripe == 0); |
1969 | 1971 | ||
1970 | /* find first stripe-aligned block in group */ | 1972 | /* find first stripe-aligned block in group */ |
1971 | first_group_block = ext4_group_first_block_no(sb, e4b->bd_group); | 1973 | first_group_block = ext4_group_first_block_no(sb, e4b->bd_group); |
1972 | 1974 | ||
1973 | a = first_group_block + sbi->s_stripe - 1; | 1975 | a = first_group_block + sbi->s_stripe - 1; |
1974 | do_div(a, sbi->s_stripe); | 1976 | do_div(a, sbi->s_stripe); |
1975 | i = (a * sbi->s_stripe) - first_group_block; | 1977 | i = (a * sbi->s_stripe) - first_group_block; |
1976 | 1978 | ||
1977 | while (i < EXT4_CLUSTERS_PER_GROUP(sb)) { | 1979 | while (i < EXT4_CLUSTERS_PER_GROUP(sb)) { |
1978 | if (!mb_test_bit(i, bitmap)) { | 1980 | if (!mb_test_bit(i, bitmap)) { |
1979 | max = mb_find_extent(e4b, i, sbi->s_stripe, &ex); | 1981 | max = mb_find_extent(e4b, i, sbi->s_stripe, &ex); |
1980 | if (max >= sbi->s_stripe) { | 1982 | if (max >= sbi->s_stripe) { |
1981 | ac->ac_found++; | 1983 | ac->ac_found++; |
1982 | ac->ac_b_ex = ex; | 1984 | ac->ac_b_ex = ex; |
1983 | ext4_mb_use_best_found(ac, e4b); | 1985 | ext4_mb_use_best_found(ac, e4b); |
1984 | break; | 1986 | break; |
1985 | } | 1987 | } |
1986 | } | 1988 | } |
1987 | i += sbi->s_stripe; | 1989 | i += sbi->s_stripe; |
1988 | } | 1990 | } |
1989 | } | 1991 | } |
1990 | 1992 | ||
1991 | /* This is now called BEFORE we load the buddy bitmap. */ | 1993 | /* This is now called BEFORE we load the buddy bitmap. */ |
1992 | static int ext4_mb_good_group(struct ext4_allocation_context *ac, | 1994 | static int ext4_mb_good_group(struct ext4_allocation_context *ac, |
1993 | ext4_group_t group, int cr) | 1995 | ext4_group_t group, int cr) |
1994 | { | 1996 | { |
1995 | unsigned free, fragments; | 1997 | unsigned free, fragments; |
1996 | int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb)); | 1998 | int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb)); |
1997 | struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); | 1999 | struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); |
1998 | 2000 | ||
1999 | BUG_ON(cr < 0 || cr >= 4); | 2001 | BUG_ON(cr < 0 || cr >= 4); |
2000 | 2002 | ||
2001 | free = grp->bb_free; | 2003 | free = grp->bb_free; |
2002 | if (free == 0) | 2004 | if (free == 0) |
2003 | return 0; | 2005 | return 0; |
2004 | if (cr <= 2 && free < ac->ac_g_ex.fe_len) | 2006 | if (cr <= 2 && free < ac->ac_g_ex.fe_len) |
2005 | return 0; | 2007 | return 0; |
2006 | 2008 | ||
2007 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) | 2009 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) |
2008 | return 0; | 2010 | return 0; |
2009 | 2011 | ||
2010 | /* We only do this if the grp has never been initialized */ | 2012 | /* We only do this if the grp has never been initialized */ |
2011 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { | 2013 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { |
2012 | int ret = ext4_mb_init_group(ac->ac_sb, group); | 2014 | int ret = ext4_mb_init_group(ac->ac_sb, group); |
2013 | if (ret) | 2015 | if (ret) |
2014 | return 0; | 2016 | return 0; |
2015 | } | 2017 | } |
2016 | 2018 | ||
2017 | fragments = grp->bb_fragments; | 2019 | fragments = grp->bb_fragments; |
2018 | if (fragments == 0) | 2020 | if (fragments == 0) |
2019 | return 0; | 2021 | return 0; |
2020 | 2022 | ||
2021 | switch (cr) { | 2023 | switch (cr) { |
2022 | case 0: | 2024 | case 0: |
2023 | BUG_ON(ac->ac_2order == 0); | 2025 | BUG_ON(ac->ac_2order == 0); |
2024 | 2026 | ||
2025 | /* Avoid using the first bg of a flexgroup for data files */ | 2027 | /* Avoid using the first bg of a flexgroup for data files */ |
2026 | if ((ac->ac_flags & EXT4_MB_HINT_DATA) && | 2028 | if ((ac->ac_flags & EXT4_MB_HINT_DATA) && |
2027 | (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) && | 2029 | (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) && |
2028 | ((group % flex_size) == 0)) | 2030 | ((group % flex_size) == 0)) |
2029 | return 0; | 2031 | return 0; |
2030 | 2032 | ||
2031 | if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) || | 2033 | if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) || |
2032 | (free / fragments) >= ac->ac_g_ex.fe_len) | 2034 | (free / fragments) >= ac->ac_g_ex.fe_len) |
2033 | return 1; | 2035 | return 1; |
2034 | 2036 | ||
2035 | if (grp->bb_largest_free_order < ac->ac_2order) | 2037 | if (grp->bb_largest_free_order < ac->ac_2order) |
2036 | return 0; | 2038 | return 0; |
2037 | 2039 | ||
2038 | return 1; | 2040 | return 1; |
2039 | case 1: | 2041 | case 1: |
2040 | if ((free / fragments) >= ac->ac_g_ex.fe_len) | 2042 | if ((free / fragments) >= ac->ac_g_ex.fe_len) |
2041 | return 1; | 2043 | return 1; |
2042 | break; | 2044 | break; |
2043 | case 2: | 2045 | case 2: |
2044 | if (free >= ac->ac_g_ex.fe_len) | 2046 | if (free >= ac->ac_g_ex.fe_len) |
2045 | return 1; | 2047 | return 1; |
2046 | break; | 2048 | break; |
2047 | case 3: | 2049 | case 3: |
2048 | return 1; | 2050 | return 1; |
2049 | default: | 2051 | default: |
2050 | BUG(); | 2052 | BUG(); |
2051 | } | 2053 | } |
2052 | 2054 | ||
2053 | return 0; | 2055 | return 0; |
2054 | } | 2056 | } |
2055 | 2057 | ||
2056 | static noinline_for_stack int | 2058 | static noinline_for_stack int |
2057 | ext4_mb_regular_allocator(struct ext4_allocation_context *ac) | 2059 | ext4_mb_regular_allocator(struct ext4_allocation_context *ac) |
2058 | { | 2060 | { |
2059 | ext4_group_t ngroups, group, i; | 2061 | ext4_group_t ngroups, group, i; |
2060 | int cr; | 2062 | int cr; |
2061 | int err = 0; | 2063 | int err = 0; |
2062 | struct ext4_sb_info *sbi; | 2064 | struct ext4_sb_info *sbi; |
2063 | struct super_block *sb; | 2065 | struct super_block *sb; |
2064 | struct ext4_buddy e4b; | 2066 | struct ext4_buddy e4b; |
2065 | 2067 | ||
2066 | sb = ac->ac_sb; | 2068 | sb = ac->ac_sb; |
2067 | sbi = EXT4_SB(sb); | 2069 | sbi = EXT4_SB(sb); |
2068 | ngroups = ext4_get_groups_count(sb); | 2070 | ngroups = ext4_get_groups_count(sb); |
2069 | /* non-extent files are limited to low blocks/groups */ | 2071 | /* non-extent files are limited to low blocks/groups */ |
2070 | if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) | 2072 | if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) |
2071 | ngroups = sbi->s_blockfile_groups; | 2073 | ngroups = sbi->s_blockfile_groups; |
2072 | 2074 | ||
2073 | BUG_ON(ac->ac_status == AC_STATUS_FOUND); | 2075 | BUG_ON(ac->ac_status == AC_STATUS_FOUND); |
2074 | 2076 | ||
2075 | /* first, try the goal */ | 2077 | /* first, try the goal */ |
2076 | err = ext4_mb_find_by_goal(ac, &e4b); | 2078 | err = ext4_mb_find_by_goal(ac, &e4b); |
2077 | if (err || ac->ac_status == AC_STATUS_FOUND) | 2079 | if (err || ac->ac_status == AC_STATUS_FOUND) |
2078 | goto out; | 2080 | goto out; |
2079 | 2081 | ||
2080 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) | 2082 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) |
2081 | goto out; | 2083 | goto out; |
2082 | 2084 | ||
2083 | /* | 2085 | /* |
2084 | * ac->ac2_order is set only if the fe_len is a power of 2 | 2086 | * ac->ac2_order is set only if the fe_len is a power of 2 |
2085 | * if ac2_order is set we also set criteria to 0 so that we | 2087 | * if ac2_order is set we also set criteria to 0 so that we |
2086 | * try exact allocation using buddy. | 2088 | * try exact allocation using buddy. |
2087 | */ | 2089 | */ |
2088 | i = fls(ac->ac_g_ex.fe_len); | 2090 | i = fls(ac->ac_g_ex.fe_len); |
2089 | ac->ac_2order = 0; | 2091 | ac->ac_2order = 0; |
2090 | /* | 2092 | /* |
2091 | * We search using buddy data only if the order of the request | 2093 | * We search using buddy data only if the order of the request |
2092 | * is greater than equal to the sbi_s_mb_order2_reqs | 2094 | * is greater than equal to the sbi_s_mb_order2_reqs |
2093 | * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req | 2095 | * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req |
2094 | */ | 2096 | */ |
2095 | if (i >= sbi->s_mb_order2_reqs) { | 2097 | if (i >= sbi->s_mb_order2_reqs) { |
2096 | /* | 2098 | /* |
2097 | * This should tell if fe_len is exactly power of 2 | 2099 | * This should tell if fe_len is exactly power of 2 |
2098 | */ | 2100 | */ |
2099 | if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0) | 2101 | if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0) |
2100 | ac->ac_2order = i - 1; | 2102 | ac->ac_2order = i - 1; |
2101 | } | 2103 | } |
2102 | 2104 | ||
2103 | /* if stream allocation is enabled, use global goal */ | 2105 | /* if stream allocation is enabled, use global goal */ |
2104 | if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { | 2106 | if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { |
2105 | /* TBD: may be hot point */ | 2107 | /* TBD: may be hot point */ |
2106 | spin_lock(&sbi->s_md_lock); | 2108 | spin_lock(&sbi->s_md_lock); |
2107 | ac->ac_g_ex.fe_group = sbi->s_mb_last_group; | 2109 | ac->ac_g_ex.fe_group = sbi->s_mb_last_group; |
2108 | ac->ac_g_ex.fe_start = sbi->s_mb_last_start; | 2110 | ac->ac_g_ex.fe_start = sbi->s_mb_last_start; |
2109 | spin_unlock(&sbi->s_md_lock); | 2111 | spin_unlock(&sbi->s_md_lock); |
2110 | } | 2112 | } |
2111 | 2113 | ||
2112 | /* Let's just scan groups to find more-less suitable blocks */ | 2114 | /* Let's just scan groups to find more-less suitable blocks */ |
2113 | cr = ac->ac_2order ? 0 : 1; | 2115 | cr = ac->ac_2order ? 0 : 1; |
2114 | /* | 2116 | /* |
2115 | * cr == 0 try to get exact allocation, | 2117 | * cr == 0 try to get exact allocation, |
2116 | * cr == 3 try to get anything | 2118 | * cr == 3 try to get anything |
2117 | */ | 2119 | */ |
2118 | repeat: | 2120 | repeat: |
2119 | for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) { | 2121 | for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) { |
2120 | ac->ac_criteria = cr; | 2122 | ac->ac_criteria = cr; |
2121 | /* | 2123 | /* |
2122 | * searching for the right group start | 2124 | * searching for the right group start |
2123 | * from the goal value specified | 2125 | * from the goal value specified |
2124 | */ | 2126 | */ |
2125 | group = ac->ac_g_ex.fe_group; | 2127 | group = ac->ac_g_ex.fe_group; |
2126 | 2128 | ||
2127 | for (i = 0; i < ngroups; group++, i++) { | 2129 | for (i = 0; i < ngroups; group++, i++) { |
2128 | cond_resched(); | 2130 | cond_resched(); |
2129 | /* | 2131 | /* |
2130 | * Artificially restricted ngroups for non-extent | 2132 | * Artificially restricted ngroups for non-extent |
2131 | * files makes group > ngroups possible on first loop. | 2133 | * files makes group > ngroups possible on first loop. |
2132 | */ | 2134 | */ |
2133 | if (group >= ngroups) | 2135 | if (group >= ngroups) |
2134 | group = 0; | 2136 | group = 0; |
2135 | 2137 | ||
2136 | /* This now checks without needing the buddy page */ | 2138 | /* This now checks without needing the buddy page */ |
2137 | if (!ext4_mb_good_group(ac, group, cr)) | 2139 | if (!ext4_mb_good_group(ac, group, cr)) |
2138 | continue; | 2140 | continue; |
2139 | 2141 | ||
2140 | err = ext4_mb_load_buddy(sb, group, &e4b); | 2142 | err = ext4_mb_load_buddy(sb, group, &e4b); |
2141 | if (err) | 2143 | if (err) |
2142 | goto out; | 2144 | goto out; |
2143 | 2145 | ||
2144 | ext4_lock_group(sb, group); | 2146 | ext4_lock_group(sb, group); |
2145 | 2147 | ||
2146 | /* | 2148 | /* |
2147 | * We need to check again after locking the | 2149 | * We need to check again after locking the |
2148 | * block group | 2150 | * block group |
2149 | */ | 2151 | */ |
2150 | if (!ext4_mb_good_group(ac, group, cr)) { | 2152 | if (!ext4_mb_good_group(ac, group, cr)) { |
2151 | ext4_unlock_group(sb, group); | 2153 | ext4_unlock_group(sb, group); |
2152 | ext4_mb_unload_buddy(&e4b); | 2154 | ext4_mb_unload_buddy(&e4b); |
2153 | continue; | 2155 | continue; |
2154 | } | 2156 | } |
2155 | 2157 | ||
2156 | ac->ac_groups_scanned++; | 2158 | ac->ac_groups_scanned++; |
2157 | if (cr == 0 && ac->ac_2order < sb->s_blocksize_bits+2) | 2159 | if (cr == 0 && ac->ac_2order < sb->s_blocksize_bits+2) |
2158 | ext4_mb_simple_scan_group(ac, &e4b); | 2160 | ext4_mb_simple_scan_group(ac, &e4b); |
2159 | else if (cr == 1 && sbi->s_stripe && | 2161 | else if (cr == 1 && sbi->s_stripe && |
2160 | !(ac->ac_g_ex.fe_len % sbi->s_stripe)) | 2162 | !(ac->ac_g_ex.fe_len % sbi->s_stripe)) |
2161 | ext4_mb_scan_aligned(ac, &e4b); | 2163 | ext4_mb_scan_aligned(ac, &e4b); |
2162 | else | 2164 | else |
2163 | ext4_mb_complex_scan_group(ac, &e4b); | 2165 | ext4_mb_complex_scan_group(ac, &e4b); |
2164 | 2166 | ||
2165 | ext4_unlock_group(sb, group); | 2167 | ext4_unlock_group(sb, group); |
2166 | ext4_mb_unload_buddy(&e4b); | 2168 | ext4_mb_unload_buddy(&e4b); |
2167 | 2169 | ||
2168 | if (ac->ac_status != AC_STATUS_CONTINUE) | 2170 | if (ac->ac_status != AC_STATUS_CONTINUE) |
2169 | break; | 2171 | break; |
2170 | } | 2172 | } |
2171 | } | 2173 | } |
2172 | 2174 | ||
2173 | if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && | 2175 | if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && |
2174 | !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { | 2176 | !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { |
2175 | /* | 2177 | /* |
2176 | * We've been searching too long. Let's try to allocate | 2178 | * We've been searching too long. Let's try to allocate |
2177 | * the best chunk we've found so far | 2179 | * the best chunk we've found so far |
2178 | */ | 2180 | */ |
2179 | 2181 | ||
2180 | ext4_mb_try_best_found(ac, &e4b); | 2182 | ext4_mb_try_best_found(ac, &e4b); |
2181 | if (ac->ac_status != AC_STATUS_FOUND) { | 2183 | if (ac->ac_status != AC_STATUS_FOUND) { |
2182 | /* | 2184 | /* |
2183 | * Someone more lucky has already allocated it. | 2185 | * Someone more lucky has already allocated it. |
2184 | * The only thing we can do is just take first | 2186 | * The only thing we can do is just take first |
2185 | * found block(s) | 2187 | * found block(s) |
2186 | printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n"); | 2188 | printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n"); |
2187 | */ | 2189 | */ |
2188 | ac->ac_b_ex.fe_group = 0; | 2190 | ac->ac_b_ex.fe_group = 0; |
2189 | ac->ac_b_ex.fe_start = 0; | 2191 | ac->ac_b_ex.fe_start = 0; |
2190 | ac->ac_b_ex.fe_len = 0; | 2192 | ac->ac_b_ex.fe_len = 0; |
2191 | ac->ac_status = AC_STATUS_CONTINUE; | 2193 | ac->ac_status = AC_STATUS_CONTINUE; |
2192 | ac->ac_flags |= EXT4_MB_HINT_FIRST; | 2194 | ac->ac_flags |= EXT4_MB_HINT_FIRST; |
2193 | cr = 3; | 2195 | cr = 3; |
2194 | atomic_inc(&sbi->s_mb_lost_chunks); | 2196 | atomic_inc(&sbi->s_mb_lost_chunks); |
2195 | goto repeat; | 2197 | goto repeat; |
2196 | } | 2198 | } |
2197 | } | 2199 | } |
2198 | out: | 2200 | out: |
2199 | return err; | 2201 | return err; |
2200 | } | 2202 | } |
2201 | 2203 | ||
2202 | static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) | 2204 | static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos) |
2203 | { | 2205 | { |
2204 | struct super_block *sb = seq->private; | 2206 | struct super_block *sb = seq->private; |
2205 | ext4_group_t group; | 2207 | ext4_group_t group; |
2206 | 2208 | ||
2207 | if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) | 2209 | if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) |
2208 | return NULL; | 2210 | return NULL; |
2209 | group = *pos + 1; | 2211 | group = *pos + 1; |
2210 | return (void *) ((unsigned long) group); | 2212 | return (void *) ((unsigned long) group); |
2211 | } | 2213 | } |
2212 | 2214 | ||
2213 | static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) | 2215 | static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos) |
2214 | { | 2216 | { |
2215 | struct super_block *sb = seq->private; | 2217 | struct super_block *sb = seq->private; |
2216 | ext4_group_t group; | 2218 | ext4_group_t group; |
2217 | 2219 | ||
2218 | ++*pos; | 2220 | ++*pos; |
2219 | if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) | 2221 | if (*pos < 0 || *pos >= ext4_get_groups_count(sb)) |
2220 | return NULL; | 2222 | return NULL; |
2221 | group = *pos + 1; | 2223 | group = *pos + 1; |
2222 | return (void *) ((unsigned long) group); | 2224 | return (void *) ((unsigned long) group); |
2223 | } | 2225 | } |
2224 | 2226 | ||
2225 | static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v) | 2227 | static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v) |
2226 | { | 2228 | { |
2227 | struct super_block *sb = seq->private; | 2229 | struct super_block *sb = seq->private; |
2228 | ext4_group_t group = (ext4_group_t) ((unsigned long) v); | 2230 | ext4_group_t group = (ext4_group_t) ((unsigned long) v); |
2229 | int i; | 2231 | int i; |
2230 | int err, buddy_loaded = 0; | 2232 | int err, buddy_loaded = 0; |
2231 | struct ext4_buddy e4b; | 2233 | struct ext4_buddy e4b; |
2232 | struct ext4_group_info *grinfo; | 2234 | struct ext4_group_info *grinfo; |
2233 | struct sg { | 2235 | struct sg { |
2234 | struct ext4_group_info info; | 2236 | struct ext4_group_info info; |
2235 | ext4_grpblk_t counters[16]; | 2237 | ext4_grpblk_t counters[16]; |
2236 | } sg; | 2238 | } sg; |
2237 | 2239 | ||
2238 | group--; | 2240 | group--; |
2239 | if (group == 0) | 2241 | if (group == 0) |
2240 | seq_printf(seq, "#%-5s: %-5s %-5s %-5s " | 2242 | seq_printf(seq, "#%-5s: %-5s %-5s %-5s " |
2241 | "[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s " | 2243 | "[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s " |
2242 | "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", | 2244 | "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n", |
2243 | "group", "free", "frags", "first", | 2245 | "group", "free", "frags", "first", |
2244 | "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6", | 2246 | "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6", |
2245 | "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13"); | 2247 | "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13"); |
2246 | 2248 | ||
2247 | i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + | 2249 | i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) + |
2248 | sizeof(struct ext4_group_info); | 2250 | sizeof(struct ext4_group_info); |
2249 | grinfo = ext4_get_group_info(sb, group); | 2251 | grinfo = ext4_get_group_info(sb, group); |
2250 | /* Load the group info in memory only if not already loaded. */ | 2252 | /* Load the group info in memory only if not already loaded. */ |
2251 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) { | 2253 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) { |
2252 | err = ext4_mb_load_buddy(sb, group, &e4b); | 2254 | err = ext4_mb_load_buddy(sb, group, &e4b); |
2253 | if (err) { | 2255 | if (err) { |
2254 | seq_printf(seq, "#%-5u: I/O error\n", group); | 2256 | seq_printf(seq, "#%-5u: I/O error\n", group); |
2255 | return 0; | 2257 | return 0; |
2256 | } | 2258 | } |
2257 | buddy_loaded = 1; | 2259 | buddy_loaded = 1; |
2258 | } | 2260 | } |
2259 | 2261 | ||
2260 | memcpy(&sg, ext4_get_group_info(sb, group), i); | 2262 | memcpy(&sg, ext4_get_group_info(sb, group), i); |
2261 | 2263 | ||
2262 | if (buddy_loaded) | 2264 | if (buddy_loaded) |
2263 | ext4_mb_unload_buddy(&e4b); | 2265 | ext4_mb_unload_buddy(&e4b); |
2264 | 2266 | ||
2265 | seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free, | 2267 | seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free, |
2266 | sg.info.bb_fragments, sg.info.bb_first_free); | 2268 | sg.info.bb_fragments, sg.info.bb_first_free); |
2267 | for (i = 0; i <= 13; i++) | 2269 | for (i = 0; i <= 13; i++) |
2268 | seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? | 2270 | seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ? |
2269 | sg.info.bb_counters[i] : 0); | 2271 | sg.info.bb_counters[i] : 0); |
2270 | seq_printf(seq, " ]\n"); | 2272 | seq_printf(seq, " ]\n"); |
2271 | 2273 | ||
2272 | return 0; | 2274 | return 0; |
2273 | } | 2275 | } |
2274 | 2276 | ||
2275 | static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v) | 2277 | static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v) |
2276 | { | 2278 | { |
2277 | } | 2279 | } |
2278 | 2280 | ||
2279 | static const struct seq_operations ext4_mb_seq_groups_ops = { | 2281 | static const struct seq_operations ext4_mb_seq_groups_ops = { |
2280 | .start = ext4_mb_seq_groups_start, | 2282 | .start = ext4_mb_seq_groups_start, |
2281 | .next = ext4_mb_seq_groups_next, | 2283 | .next = ext4_mb_seq_groups_next, |
2282 | .stop = ext4_mb_seq_groups_stop, | 2284 | .stop = ext4_mb_seq_groups_stop, |
2283 | .show = ext4_mb_seq_groups_show, | 2285 | .show = ext4_mb_seq_groups_show, |
2284 | }; | 2286 | }; |
2285 | 2287 | ||
2286 | static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file) | 2288 | static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file) |
2287 | { | 2289 | { |
2288 | struct super_block *sb = PDE_DATA(inode); | 2290 | struct super_block *sb = PDE_DATA(inode); |
2289 | int rc; | 2291 | int rc; |
2290 | 2292 | ||
2291 | rc = seq_open(file, &ext4_mb_seq_groups_ops); | 2293 | rc = seq_open(file, &ext4_mb_seq_groups_ops); |
2292 | if (rc == 0) { | 2294 | if (rc == 0) { |
2293 | struct seq_file *m = file->private_data; | 2295 | struct seq_file *m = file->private_data; |
2294 | m->private = sb; | 2296 | m->private = sb; |
2295 | } | 2297 | } |
2296 | return rc; | 2298 | return rc; |
2297 | 2299 | ||
2298 | } | 2300 | } |
2299 | 2301 | ||
2300 | static const struct file_operations ext4_mb_seq_groups_fops = { | 2302 | static const struct file_operations ext4_mb_seq_groups_fops = { |
2301 | .owner = THIS_MODULE, | 2303 | .owner = THIS_MODULE, |
2302 | .open = ext4_mb_seq_groups_open, | 2304 | .open = ext4_mb_seq_groups_open, |
2303 | .read = seq_read, | 2305 | .read = seq_read, |
2304 | .llseek = seq_lseek, | 2306 | .llseek = seq_lseek, |
2305 | .release = seq_release, | 2307 | .release = seq_release, |
2306 | }; | 2308 | }; |
2307 | 2309 | ||
2308 | static struct kmem_cache *get_groupinfo_cache(int blocksize_bits) | 2310 | static struct kmem_cache *get_groupinfo_cache(int blocksize_bits) |
2309 | { | 2311 | { |
2310 | int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; | 2312 | int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; |
2311 | struct kmem_cache *cachep = ext4_groupinfo_caches[cache_index]; | 2313 | struct kmem_cache *cachep = ext4_groupinfo_caches[cache_index]; |
2312 | 2314 | ||
2313 | BUG_ON(!cachep); | 2315 | BUG_ON(!cachep); |
2314 | return cachep; | 2316 | return cachep; |
2315 | } | 2317 | } |
2316 | 2318 | ||
2317 | /* | 2319 | /* |
2318 | * Allocate the top-level s_group_info array for the specified number | 2320 | * Allocate the top-level s_group_info array for the specified number |
2319 | * of groups | 2321 | * of groups |
2320 | */ | 2322 | */ |
2321 | int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups) | 2323 | int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups) |
2322 | { | 2324 | { |
2323 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 2325 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
2324 | unsigned size; | 2326 | unsigned size; |
2325 | struct ext4_group_info ***new_groupinfo; | 2327 | struct ext4_group_info ***new_groupinfo; |
2326 | 2328 | ||
2327 | size = (ngroups + EXT4_DESC_PER_BLOCK(sb) - 1) >> | 2329 | size = (ngroups + EXT4_DESC_PER_BLOCK(sb) - 1) >> |
2328 | EXT4_DESC_PER_BLOCK_BITS(sb); | 2330 | EXT4_DESC_PER_BLOCK_BITS(sb); |
2329 | if (size <= sbi->s_group_info_size) | 2331 | if (size <= sbi->s_group_info_size) |
2330 | return 0; | 2332 | return 0; |
2331 | 2333 | ||
2332 | size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size); | 2334 | size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size); |
2333 | new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL); | 2335 | new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL); |
2334 | if (!new_groupinfo) { | 2336 | if (!new_groupinfo) { |
2335 | ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group"); | 2337 | ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group"); |
2336 | return -ENOMEM; | 2338 | return -ENOMEM; |
2337 | } | 2339 | } |
2338 | if (sbi->s_group_info) { | 2340 | if (sbi->s_group_info) { |
2339 | memcpy(new_groupinfo, sbi->s_group_info, | 2341 | memcpy(new_groupinfo, sbi->s_group_info, |
2340 | sbi->s_group_info_size * sizeof(*sbi->s_group_info)); | 2342 | sbi->s_group_info_size * sizeof(*sbi->s_group_info)); |
2341 | ext4_kvfree(sbi->s_group_info); | 2343 | ext4_kvfree(sbi->s_group_info); |
2342 | } | 2344 | } |
2343 | sbi->s_group_info = new_groupinfo; | 2345 | sbi->s_group_info = new_groupinfo; |
2344 | sbi->s_group_info_size = size / sizeof(*sbi->s_group_info); | 2346 | sbi->s_group_info_size = size / sizeof(*sbi->s_group_info); |
2345 | ext4_debug("allocated s_groupinfo array for %d meta_bg's\n", | 2347 | ext4_debug("allocated s_groupinfo array for %d meta_bg's\n", |
2346 | sbi->s_group_info_size); | 2348 | sbi->s_group_info_size); |
2347 | return 0; | 2349 | return 0; |
2348 | } | 2350 | } |
2349 | 2351 | ||
2350 | /* Create and initialize ext4_group_info data for the given group. */ | 2352 | /* Create and initialize ext4_group_info data for the given group. */ |
2351 | int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, | 2353 | int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, |
2352 | struct ext4_group_desc *desc) | 2354 | struct ext4_group_desc *desc) |
2353 | { | 2355 | { |
2354 | int i; | 2356 | int i; |
2355 | int metalen = 0; | 2357 | int metalen = 0; |
2356 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 2358 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
2357 | struct ext4_group_info **meta_group_info; | 2359 | struct ext4_group_info **meta_group_info; |
2358 | struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); | 2360 | struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); |
2359 | 2361 | ||
2360 | /* | 2362 | /* |
2361 | * First check if this group is the first of a reserved block. | 2363 | * First check if this group is the first of a reserved block. |
2362 | * If it's true, we have to allocate a new table of pointers | 2364 | * If it's true, we have to allocate a new table of pointers |
2363 | * to ext4_group_info structures | 2365 | * to ext4_group_info structures |
2364 | */ | 2366 | */ |
2365 | if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { | 2367 | if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { |
2366 | metalen = sizeof(*meta_group_info) << | 2368 | metalen = sizeof(*meta_group_info) << |
2367 | EXT4_DESC_PER_BLOCK_BITS(sb); | 2369 | EXT4_DESC_PER_BLOCK_BITS(sb); |
2368 | meta_group_info = kmalloc(metalen, GFP_KERNEL); | 2370 | meta_group_info = kmalloc(metalen, GFP_KERNEL); |
2369 | if (meta_group_info == NULL) { | 2371 | if (meta_group_info == NULL) { |
2370 | ext4_msg(sb, KERN_ERR, "can't allocate mem " | 2372 | ext4_msg(sb, KERN_ERR, "can't allocate mem " |
2371 | "for a buddy group"); | 2373 | "for a buddy group"); |
2372 | goto exit_meta_group_info; | 2374 | goto exit_meta_group_info; |
2373 | } | 2375 | } |
2374 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = | 2376 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = |
2375 | meta_group_info; | 2377 | meta_group_info; |
2376 | } | 2378 | } |
2377 | 2379 | ||
2378 | meta_group_info = | 2380 | meta_group_info = |
2379 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]; | 2381 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]; |
2380 | i = group & (EXT4_DESC_PER_BLOCK(sb) - 1); | 2382 | i = group & (EXT4_DESC_PER_BLOCK(sb) - 1); |
2381 | 2383 | ||
2382 | meta_group_info[i] = kmem_cache_zalloc(cachep, GFP_KERNEL); | 2384 | meta_group_info[i] = kmem_cache_zalloc(cachep, GFP_KERNEL); |
2383 | if (meta_group_info[i] == NULL) { | 2385 | if (meta_group_info[i] == NULL) { |
2384 | ext4_msg(sb, KERN_ERR, "can't allocate buddy mem"); | 2386 | ext4_msg(sb, KERN_ERR, "can't allocate buddy mem"); |
2385 | goto exit_group_info; | 2387 | goto exit_group_info; |
2386 | } | 2388 | } |
2387 | set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, | 2389 | set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, |
2388 | &(meta_group_info[i]->bb_state)); | 2390 | &(meta_group_info[i]->bb_state)); |
2389 | 2391 | ||
2390 | /* | 2392 | /* |
2391 | * initialize bb_free to be able to skip | 2393 | * initialize bb_free to be able to skip |
2392 | * empty groups without initialization | 2394 | * empty groups without initialization |
2393 | */ | 2395 | */ |
2394 | if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { | 2396 | if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { |
2395 | meta_group_info[i]->bb_free = | 2397 | meta_group_info[i]->bb_free = |
2396 | ext4_free_clusters_after_init(sb, group, desc); | 2398 | ext4_free_clusters_after_init(sb, group, desc); |
2397 | } else { | 2399 | } else { |
2398 | meta_group_info[i]->bb_free = | 2400 | meta_group_info[i]->bb_free = |
2399 | ext4_free_group_clusters(sb, desc); | 2401 | ext4_free_group_clusters(sb, desc); |
2400 | } | 2402 | } |
2401 | 2403 | ||
2402 | INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); | 2404 | INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); |
2403 | init_rwsem(&meta_group_info[i]->alloc_sem); | 2405 | init_rwsem(&meta_group_info[i]->alloc_sem); |
2404 | meta_group_info[i]->bb_free_root = RB_ROOT; | 2406 | meta_group_info[i]->bb_free_root = RB_ROOT; |
2405 | meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ | 2407 | meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ |
2406 | 2408 | ||
2407 | #ifdef DOUBLE_CHECK | 2409 | #ifdef DOUBLE_CHECK |
2408 | { | 2410 | { |
2409 | struct buffer_head *bh; | 2411 | struct buffer_head *bh; |
2410 | meta_group_info[i]->bb_bitmap = | 2412 | meta_group_info[i]->bb_bitmap = |
2411 | kmalloc(sb->s_blocksize, GFP_KERNEL); | 2413 | kmalloc(sb->s_blocksize, GFP_KERNEL); |
2412 | BUG_ON(meta_group_info[i]->bb_bitmap == NULL); | 2414 | BUG_ON(meta_group_info[i]->bb_bitmap == NULL); |
2413 | bh = ext4_read_block_bitmap(sb, group); | 2415 | bh = ext4_read_block_bitmap(sb, group); |
2414 | BUG_ON(bh == NULL); | 2416 | BUG_ON(bh == NULL); |
2415 | memcpy(meta_group_info[i]->bb_bitmap, bh->b_data, | 2417 | memcpy(meta_group_info[i]->bb_bitmap, bh->b_data, |
2416 | sb->s_blocksize); | 2418 | sb->s_blocksize); |
2417 | put_bh(bh); | 2419 | put_bh(bh); |
2418 | } | 2420 | } |
2419 | #endif | 2421 | #endif |
2420 | 2422 | ||
2421 | return 0; | 2423 | return 0; |
2422 | 2424 | ||
2423 | exit_group_info: | 2425 | exit_group_info: |
2424 | /* If a meta_group_info table has been allocated, release it now */ | 2426 | /* If a meta_group_info table has been allocated, release it now */ |
2425 | if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { | 2427 | if (group % EXT4_DESC_PER_BLOCK(sb) == 0) { |
2426 | kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]); | 2428 | kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]); |
2427 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = NULL; | 2429 | sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = NULL; |
2428 | } | 2430 | } |
2429 | exit_meta_group_info: | 2431 | exit_meta_group_info: |
2430 | return -ENOMEM; | 2432 | return -ENOMEM; |
2431 | } /* ext4_mb_add_groupinfo */ | 2433 | } /* ext4_mb_add_groupinfo */ |
2432 | 2434 | ||
2433 | static int ext4_mb_init_backend(struct super_block *sb) | 2435 | static int ext4_mb_init_backend(struct super_block *sb) |
2434 | { | 2436 | { |
2435 | ext4_group_t ngroups = ext4_get_groups_count(sb); | 2437 | ext4_group_t ngroups = ext4_get_groups_count(sb); |
2436 | ext4_group_t i; | 2438 | ext4_group_t i; |
2437 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 2439 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
2438 | int err; | 2440 | int err; |
2439 | struct ext4_group_desc *desc; | 2441 | struct ext4_group_desc *desc; |
2440 | struct kmem_cache *cachep; | 2442 | struct kmem_cache *cachep; |
2441 | 2443 | ||
2442 | err = ext4_mb_alloc_groupinfo(sb, ngroups); | 2444 | err = ext4_mb_alloc_groupinfo(sb, ngroups); |
2443 | if (err) | 2445 | if (err) |
2444 | return err; | 2446 | return err; |
2445 | 2447 | ||
2446 | sbi->s_buddy_cache = new_inode(sb); | 2448 | sbi->s_buddy_cache = new_inode(sb); |
2447 | if (sbi->s_buddy_cache == NULL) { | 2449 | if (sbi->s_buddy_cache == NULL) { |
2448 | ext4_msg(sb, KERN_ERR, "can't get new inode"); | 2450 | ext4_msg(sb, KERN_ERR, "can't get new inode"); |
2449 | goto err_freesgi; | 2451 | goto err_freesgi; |
2450 | } | 2452 | } |
2451 | /* To avoid potentially colliding with an valid on-disk inode number, | 2453 | /* To avoid potentially colliding with an valid on-disk inode number, |
2452 | * use EXT4_BAD_INO for the buddy cache inode number. This inode is | 2454 | * use EXT4_BAD_INO for the buddy cache inode number. This inode is |
2453 | * not in the inode hash, so it should never be found by iget(), but | 2455 | * not in the inode hash, so it should never be found by iget(), but |
2454 | * this will avoid confusion if it ever shows up during debugging. */ | 2456 | * this will avoid confusion if it ever shows up during debugging. */ |
2455 | sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; | 2457 | sbi->s_buddy_cache->i_ino = EXT4_BAD_INO; |
2456 | EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; | 2458 | EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; |
2457 | for (i = 0; i < ngroups; i++) { | 2459 | for (i = 0; i < ngroups; i++) { |
2458 | desc = ext4_get_group_desc(sb, i, NULL); | 2460 | desc = ext4_get_group_desc(sb, i, NULL); |
2459 | if (desc == NULL) { | 2461 | if (desc == NULL) { |
2460 | ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i); | 2462 | ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i); |
2461 | goto err_freebuddy; | 2463 | goto err_freebuddy; |
2462 | } | 2464 | } |
2463 | if (ext4_mb_add_groupinfo(sb, i, desc) != 0) | 2465 | if (ext4_mb_add_groupinfo(sb, i, desc) != 0) |
2464 | goto err_freebuddy; | 2466 | goto err_freebuddy; |
2465 | } | 2467 | } |
2466 | 2468 | ||
2467 | return 0; | 2469 | return 0; |
2468 | 2470 | ||
2469 | err_freebuddy: | 2471 | err_freebuddy: |
2470 | cachep = get_groupinfo_cache(sb->s_blocksize_bits); | 2472 | cachep = get_groupinfo_cache(sb->s_blocksize_bits); |
2471 | while (i-- > 0) | 2473 | while (i-- > 0) |
2472 | kmem_cache_free(cachep, ext4_get_group_info(sb, i)); | 2474 | kmem_cache_free(cachep, ext4_get_group_info(sb, i)); |
2473 | i = sbi->s_group_info_size; | 2475 | i = sbi->s_group_info_size; |
2474 | while (i-- > 0) | 2476 | while (i-- > 0) |
2475 | kfree(sbi->s_group_info[i]); | 2477 | kfree(sbi->s_group_info[i]); |
2476 | iput(sbi->s_buddy_cache); | 2478 | iput(sbi->s_buddy_cache); |
2477 | err_freesgi: | 2479 | err_freesgi: |
2478 | ext4_kvfree(sbi->s_group_info); | 2480 | ext4_kvfree(sbi->s_group_info); |
2479 | return -ENOMEM; | 2481 | return -ENOMEM; |
2480 | } | 2482 | } |
2481 | 2483 | ||
2482 | static void ext4_groupinfo_destroy_slabs(void) | 2484 | static void ext4_groupinfo_destroy_slabs(void) |
2483 | { | 2485 | { |
2484 | int i; | 2486 | int i; |
2485 | 2487 | ||
2486 | for (i = 0; i < NR_GRPINFO_CACHES; i++) { | 2488 | for (i = 0; i < NR_GRPINFO_CACHES; i++) { |
2487 | if (ext4_groupinfo_caches[i]) | 2489 | if (ext4_groupinfo_caches[i]) |
2488 | kmem_cache_destroy(ext4_groupinfo_caches[i]); | 2490 | kmem_cache_destroy(ext4_groupinfo_caches[i]); |
2489 | ext4_groupinfo_caches[i] = NULL; | 2491 | ext4_groupinfo_caches[i] = NULL; |
2490 | } | 2492 | } |
2491 | } | 2493 | } |
2492 | 2494 | ||
2493 | static int ext4_groupinfo_create_slab(size_t size) | 2495 | static int ext4_groupinfo_create_slab(size_t size) |
2494 | { | 2496 | { |
2495 | static DEFINE_MUTEX(ext4_grpinfo_slab_create_mutex); | 2497 | static DEFINE_MUTEX(ext4_grpinfo_slab_create_mutex); |
2496 | int slab_size; | 2498 | int slab_size; |
2497 | int blocksize_bits = order_base_2(size); | 2499 | int blocksize_bits = order_base_2(size); |
2498 | int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; | 2500 | int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; |
2499 | struct kmem_cache *cachep; | 2501 | struct kmem_cache *cachep; |
2500 | 2502 | ||
2501 | if (cache_index >= NR_GRPINFO_CACHES) | 2503 | if (cache_index >= NR_GRPINFO_CACHES) |
2502 | return -EINVAL; | 2504 | return -EINVAL; |
2503 | 2505 | ||
2504 | if (unlikely(cache_index < 0)) | 2506 | if (unlikely(cache_index < 0)) |
2505 | cache_index = 0; | 2507 | cache_index = 0; |
2506 | 2508 | ||
2507 | mutex_lock(&ext4_grpinfo_slab_create_mutex); | 2509 | mutex_lock(&ext4_grpinfo_slab_create_mutex); |
2508 | if (ext4_groupinfo_caches[cache_index]) { | 2510 | if (ext4_groupinfo_caches[cache_index]) { |
2509 | mutex_unlock(&ext4_grpinfo_slab_create_mutex); | 2511 | mutex_unlock(&ext4_grpinfo_slab_create_mutex); |
2510 | return 0; /* Already created */ | 2512 | return 0; /* Already created */ |
2511 | } | 2513 | } |
2512 | 2514 | ||
2513 | slab_size = offsetof(struct ext4_group_info, | 2515 | slab_size = offsetof(struct ext4_group_info, |
2514 | bb_counters[blocksize_bits + 2]); | 2516 | bb_counters[blocksize_bits + 2]); |
2515 | 2517 | ||
2516 | cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index], | 2518 | cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index], |
2517 | slab_size, 0, SLAB_RECLAIM_ACCOUNT, | 2519 | slab_size, 0, SLAB_RECLAIM_ACCOUNT, |
2518 | NULL); | 2520 | NULL); |
2519 | 2521 | ||
2520 | ext4_groupinfo_caches[cache_index] = cachep; | 2522 | ext4_groupinfo_caches[cache_index] = cachep; |
2521 | 2523 | ||
2522 | mutex_unlock(&ext4_grpinfo_slab_create_mutex); | 2524 | mutex_unlock(&ext4_grpinfo_slab_create_mutex); |
2523 | if (!cachep) { | 2525 | if (!cachep) { |
2524 | printk(KERN_EMERG | 2526 | printk(KERN_EMERG |
2525 | "EXT4-fs: no memory for groupinfo slab cache\n"); | 2527 | "EXT4-fs: no memory for groupinfo slab cache\n"); |
2526 | return -ENOMEM; | 2528 | return -ENOMEM; |
2527 | } | 2529 | } |
2528 | 2530 | ||
2529 | return 0; | 2531 | return 0; |
2530 | } | 2532 | } |
2531 | 2533 | ||
2532 | int ext4_mb_init(struct super_block *sb) | 2534 | int ext4_mb_init(struct super_block *sb) |
2533 | { | 2535 | { |
2534 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 2536 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
2535 | unsigned i, j; | 2537 | unsigned i, j; |
2536 | unsigned offset; | 2538 | unsigned offset; |
2537 | unsigned max; | 2539 | unsigned max; |
2538 | int ret; | 2540 | int ret; |
2539 | 2541 | ||
2540 | i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets); | 2542 | i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets); |
2541 | 2543 | ||
2542 | sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); | 2544 | sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); |
2543 | if (sbi->s_mb_offsets == NULL) { | 2545 | if (sbi->s_mb_offsets == NULL) { |
2544 | ret = -ENOMEM; | 2546 | ret = -ENOMEM; |
2545 | goto out; | 2547 | goto out; |
2546 | } | 2548 | } |
2547 | 2549 | ||
2548 | i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs); | 2550 | i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs); |
2549 | sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); | 2551 | sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); |
2550 | if (sbi->s_mb_maxs == NULL) { | 2552 | if (sbi->s_mb_maxs == NULL) { |
2551 | ret = -ENOMEM; | 2553 | ret = -ENOMEM; |
2552 | goto out; | 2554 | goto out; |
2553 | } | 2555 | } |
2554 | 2556 | ||
2555 | ret = ext4_groupinfo_create_slab(sb->s_blocksize); | 2557 | ret = ext4_groupinfo_create_slab(sb->s_blocksize); |
2556 | if (ret < 0) | 2558 | if (ret < 0) |
2557 | goto out; | 2559 | goto out; |
2558 | 2560 | ||
2559 | /* order 0 is regular bitmap */ | 2561 | /* order 0 is regular bitmap */ |
2560 | sbi->s_mb_maxs[0] = sb->s_blocksize << 3; | 2562 | sbi->s_mb_maxs[0] = sb->s_blocksize << 3; |
2561 | sbi->s_mb_offsets[0] = 0; | 2563 | sbi->s_mb_offsets[0] = 0; |
2562 | 2564 | ||
2563 | i = 1; | 2565 | i = 1; |
2564 | offset = 0; | 2566 | offset = 0; |
2565 | max = sb->s_blocksize << 2; | 2567 | max = sb->s_blocksize << 2; |
2566 | do { | 2568 | do { |
2567 | sbi->s_mb_offsets[i] = offset; | 2569 | sbi->s_mb_offsets[i] = offset; |
2568 | sbi->s_mb_maxs[i] = max; | 2570 | sbi->s_mb_maxs[i] = max; |
2569 | offset += 1 << (sb->s_blocksize_bits - i); | 2571 | offset += 1 << (sb->s_blocksize_bits - i); |
2570 | max = max >> 1; | 2572 | max = max >> 1; |
2571 | i++; | 2573 | i++; |
2572 | } while (i <= sb->s_blocksize_bits + 1); | 2574 | } while (i <= sb->s_blocksize_bits + 1); |
2573 | 2575 | ||
2574 | spin_lock_init(&sbi->s_md_lock); | 2576 | spin_lock_init(&sbi->s_md_lock); |
2575 | spin_lock_init(&sbi->s_bal_lock); | 2577 | spin_lock_init(&sbi->s_bal_lock); |
2576 | 2578 | ||
2577 | sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN; | 2579 | sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN; |
2578 | sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN; | 2580 | sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN; |
2579 | sbi->s_mb_stats = MB_DEFAULT_STATS; | 2581 | sbi->s_mb_stats = MB_DEFAULT_STATS; |
2580 | sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; | 2582 | sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; |
2581 | sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; | 2583 | sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; |
2582 | /* | 2584 | /* |
2583 | * The default group preallocation is 512, which for 4k block | 2585 | * The default group preallocation is 512, which for 4k block |
2584 | * sizes translates to 2 megabytes. However for bigalloc file | 2586 | * sizes translates to 2 megabytes. However for bigalloc file |
2585 | * systems, this is probably too big (i.e, if the cluster size | 2587 | * systems, this is probably too big (i.e, if the cluster size |
2586 | * is 1 megabyte, then group preallocation size becomes half a | 2588 | * is 1 megabyte, then group preallocation size becomes half a |
2587 | * gigabyte!). As a default, we will keep a two megabyte | 2589 | * gigabyte!). As a default, we will keep a two megabyte |
2588 | * group pralloc size for cluster sizes up to 64k, and after | 2590 | * group pralloc size for cluster sizes up to 64k, and after |
2589 | * that, we will force a minimum group preallocation size of | 2591 | * that, we will force a minimum group preallocation size of |
2590 | * 32 clusters. This translates to 8 megs when the cluster | 2592 | * 32 clusters. This translates to 8 megs when the cluster |
2591 | * size is 256k, and 32 megs when the cluster size is 1 meg, | 2593 | * size is 256k, and 32 megs when the cluster size is 1 meg, |
2592 | * which seems reasonable as a default. | 2594 | * which seems reasonable as a default. |
2593 | */ | 2595 | */ |
2594 | sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >> | 2596 | sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >> |
2595 | sbi->s_cluster_bits, 32); | 2597 | sbi->s_cluster_bits, 32); |
2596 | /* | 2598 | /* |
2597 | * If there is a s_stripe > 1, then we set the s_mb_group_prealloc | 2599 | * If there is a s_stripe > 1, then we set the s_mb_group_prealloc |
2598 | * to the lowest multiple of s_stripe which is bigger than | 2600 | * to the lowest multiple of s_stripe which is bigger than |
2599 | * the s_mb_group_prealloc as determined above. We want | 2601 | * the s_mb_group_prealloc as determined above. We want |
2600 | * the preallocation size to be an exact multiple of the | 2602 | * the preallocation size to be an exact multiple of the |
2601 | * RAID stripe size so that preallocations don't fragment | 2603 | * RAID stripe size so that preallocations don't fragment |
2602 | * the stripes. | 2604 | * the stripes. |
2603 | */ | 2605 | */ |
2604 | if (sbi->s_stripe > 1) { | 2606 | if (sbi->s_stripe > 1) { |
2605 | sbi->s_mb_group_prealloc = roundup( | 2607 | sbi->s_mb_group_prealloc = roundup( |
2606 | sbi->s_mb_group_prealloc, sbi->s_stripe); | 2608 | sbi->s_mb_group_prealloc, sbi->s_stripe); |
2607 | } | 2609 | } |
2608 | 2610 | ||
2609 | sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group); | 2611 | sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group); |
2610 | if (sbi->s_locality_groups == NULL) { | 2612 | if (sbi->s_locality_groups == NULL) { |
2611 | ret = -ENOMEM; | 2613 | ret = -ENOMEM; |
2612 | goto out_free_groupinfo_slab; | 2614 | goto out_free_groupinfo_slab; |
2613 | } | 2615 | } |
2614 | for_each_possible_cpu(i) { | 2616 | for_each_possible_cpu(i) { |
2615 | struct ext4_locality_group *lg; | 2617 | struct ext4_locality_group *lg; |
2616 | lg = per_cpu_ptr(sbi->s_locality_groups, i); | 2618 | lg = per_cpu_ptr(sbi->s_locality_groups, i); |
2617 | mutex_init(&lg->lg_mutex); | 2619 | mutex_init(&lg->lg_mutex); |
2618 | for (j = 0; j < PREALLOC_TB_SIZE; j++) | 2620 | for (j = 0; j < PREALLOC_TB_SIZE; j++) |
2619 | INIT_LIST_HEAD(&lg->lg_prealloc_list[j]); | 2621 | INIT_LIST_HEAD(&lg->lg_prealloc_list[j]); |
2620 | spin_lock_init(&lg->lg_prealloc_lock); | 2622 | spin_lock_init(&lg->lg_prealloc_lock); |
2621 | } | 2623 | } |
2622 | 2624 | ||
2623 | /* init file for buddy data */ | 2625 | /* init file for buddy data */ |
2624 | ret = ext4_mb_init_backend(sb); | 2626 | ret = ext4_mb_init_backend(sb); |
2625 | if (ret != 0) | 2627 | if (ret != 0) |
2626 | goto out_free_locality_groups; | 2628 | goto out_free_locality_groups; |
2627 | 2629 | ||
2628 | if (sbi->s_proc) | 2630 | if (sbi->s_proc) |
2629 | proc_create_data("mb_groups", S_IRUGO, sbi->s_proc, | 2631 | proc_create_data("mb_groups", S_IRUGO, sbi->s_proc, |
2630 | &ext4_mb_seq_groups_fops, sb); | 2632 | &ext4_mb_seq_groups_fops, sb); |
2631 | 2633 | ||
2632 | return 0; | 2634 | return 0; |
2633 | 2635 | ||
2634 | out_free_locality_groups: | 2636 | out_free_locality_groups: |
2635 | free_percpu(sbi->s_locality_groups); | 2637 | free_percpu(sbi->s_locality_groups); |
2636 | sbi->s_locality_groups = NULL; | 2638 | sbi->s_locality_groups = NULL; |
2637 | out_free_groupinfo_slab: | 2639 | out_free_groupinfo_slab: |
2638 | ext4_groupinfo_destroy_slabs(); | 2640 | ext4_groupinfo_destroy_slabs(); |
2639 | out: | 2641 | out: |
2640 | kfree(sbi->s_mb_offsets); | 2642 | kfree(sbi->s_mb_offsets); |
2641 | sbi->s_mb_offsets = NULL; | 2643 | sbi->s_mb_offsets = NULL; |
2642 | kfree(sbi->s_mb_maxs); | 2644 | kfree(sbi->s_mb_maxs); |
2643 | sbi->s_mb_maxs = NULL; | 2645 | sbi->s_mb_maxs = NULL; |
2644 | return ret; | 2646 | return ret; |
2645 | } | 2647 | } |
2646 | 2648 | ||
2647 | /* need to called with the ext4 group lock held */ | 2649 | /* need to called with the ext4 group lock held */ |
2648 | static void ext4_mb_cleanup_pa(struct ext4_group_info *grp) | 2650 | static void ext4_mb_cleanup_pa(struct ext4_group_info *grp) |
2649 | { | 2651 | { |
2650 | struct ext4_prealloc_space *pa; | 2652 | struct ext4_prealloc_space *pa; |
2651 | struct list_head *cur, *tmp; | 2653 | struct list_head *cur, *tmp; |
2652 | int count = 0; | 2654 | int count = 0; |
2653 | 2655 | ||
2654 | list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) { | 2656 | list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) { |
2655 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); | 2657 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); |
2656 | list_del(&pa->pa_group_list); | 2658 | list_del(&pa->pa_group_list); |
2657 | count++; | 2659 | count++; |
2658 | kmem_cache_free(ext4_pspace_cachep, pa); | 2660 | kmem_cache_free(ext4_pspace_cachep, pa); |
2659 | } | 2661 | } |
2660 | if (count) | 2662 | if (count) |
2661 | mb_debug(1, "mballoc: %u PAs left\n", count); | 2663 | mb_debug(1, "mballoc: %u PAs left\n", count); |
2662 | 2664 | ||
2663 | } | 2665 | } |
2664 | 2666 | ||
2665 | int ext4_mb_release(struct super_block *sb) | 2667 | int ext4_mb_release(struct super_block *sb) |
2666 | { | 2668 | { |
2667 | ext4_group_t ngroups = ext4_get_groups_count(sb); | 2669 | ext4_group_t ngroups = ext4_get_groups_count(sb); |
2668 | ext4_group_t i; | 2670 | ext4_group_t i; |
2669 | int num_meta_group_infos; | 2671 | int num_meta_group_infos; |
2670 | struct ext4_group_info *grinfo; | 2672 | struct ext4_group_info *grinfo; |
2671 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 2673 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
2672 | struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); | 2674 | struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits); |
2673 | 2675 | ||
2674 | if (sbi->s_proc) | 2676 | if (sbi->s_proc) |
2675 | remove_proc_entry("mb_groups", sbi->s_proc); | 2677 | remove_proc_entry("mb_groups", sbi->s_proc); |
2676 | 2678 | ||
2677 | if (sbi->s_group_info) { | 2679 | if (sbi->s_group_info) { |
2678 | for (i = 0; i < ngroups; i++) { | 2680 | for (i = 0; i < ngroups; i++) { |
2679 | grinfo = ext4_get_group_info(sb, i); | 2681 | grinfo = ext4_get_group_info(sb, i); |
2680 | #ifdef DOUBLE_CHECK | 2682 | #ifdef DOUBLE_CHECK |
2681 | kfree(grinfo->bb_bitmap); | 2683 | kfree(grinfo->bb_bitmap); |
2682 | #endif | 2684 | #endif |
2683 | ext4_lock_group(sb, i); | 2685 | ext4_lock_group(sb, i); |
2684 | ext4_mb_cleanup_pa(grinfo); | 2686 | ext4_mb_cleanup_pa(grinfo); |
2685 | ext4_unlock_group(sb, i); | 2687 | ext4_unlock_group(sb, i); |
2686 | kmem_cache_free(cachep, grinfo); | 2688 | kmem_cache_free(cachep, grinfo); |
2687 | } | 2689 | } |
2688 | num_meta_group_infos = (ngroups + | 2690 | num_meta_group_infos = (ngroups + |
2689 | EXT4_DESC_PER_BLOCK(sb) - 1) >> | 2691 | EXT4_DESC_PER_BLOCK(sb) - 1) >> |
2690 | EXT4_DESC_PER_BLOCK_BITS(sb); | 2692 | EXT4_DESC_PER_BLOCK_BITS(sb); |
2691 | for (i = 0; i < num_meta_group_infos; i++) | 2693 | for (i = 0; i < num_meta_group_infos; i++) |
2692 | kfree(sbi->s_group_info[i]); | 2694 | kfree(sbi->s_group_info[i]); |
2693 | ext4_kvfree(sbi->s_group_info); | 2695 | ext4_kvfree(sbi->s_group_info); |
2694 | } | 2696 | } |
2695 | kfree(sbi->s_mb_offsets); | 2697 | kfree(sbi->s_mb_offsets); |
2696 | kfree(sbi->s_mb_maxs); | 2698 | kfree(sbi->s_mb_maxs); |
2697 | if (sbi->s_buddy_cache) | 2699 | if (sbi->s_buddy_cache) |
2698 | iput(sbi->s_buddy_cache); | 2700 | iput(sbi->s_buddy_cache); |
2699 | if (sbi->s_mb_stats) { | 2701 | if (sbi->s_mb_stats) { |
2700 | ext4_msg(sb, KERN_INFO, | 2702 | ext4_msg(sb, KERN_INFO, |
2701 | "mballoc: %u blocks %u reqs (%u success)", | 2703 | "mballoc: %u blocks %u reqs (%u success)", |
2702 | atomic_read(&sbi->s_bal_allocated), | 2704 | atomic_read(&sbi->s_bal_allocated), |
2703 | atomic_read(&sbi->s_bal_reqs), | 2705 | atomic_read(&sbi->s_bal_reqs), |
2704 | atomic_read(&sbi->s_bal_success)); | 2706 | atomic_read(&sbi->s_bal_success)); |
2705 | ext4_msg(sb, KERN_INFO, | 2707 | ext4_msg(sb, KERN_INFO, |
2706 | "mballoc: %u extents scanned, %u goal hits, " | 2708 | "mballoc: %u extents scanned, %u goal hits, " |
2707 | "%u 2^N hits, %u breaks, %u lost", | 2709 | "%u 2^N hits, %u breaks, %u lost", |
2708 | atomic_read(&sbi->s_bal_ex_scanned), | 2710 | atomic_read(&sbi->s_bal_ex_scanned), |
2709 | atomic_read(&sbi->s_bal_goals), | 2711 | atomic_read(&sbi->s_bal_goals), |
2710 | atomic_read(&sbi->s_bal_2orders), | 2712 | atomic_read(&sbi->s_bal_2orders), |
2711 | atomic_read(&sbi->s_bal_breaks), | 2713 | atomic_read(&sbi->s_bal_breaks), |
2712 | atomic_read(&sbi->s_mb_lost_chunks)); | 2714 | atomic_read(&sbi->s_mb_lost_chunks)); |
2713 | ext4_msg(sb, KERN_INFO, | 2715 | ext4_msg(sb, KERN_INFO, |
2714 | "mballoc: %lu generated and it took %Lu", | 2716 | "mballoc: %lu generated and it took %Lu", |
2715 | sbi->s_mb_buddies_generated, | 2717 | sbi->s_mb_buddies_generated, |
2716 | sbi->s_mb_generation_time); | 2718 | sbi->s_mb_generation_time); |
2717 | ext4_msg(sb, KERN_INFO, | 2719 | ext4_msg(sb, KERN_INFO, |
2718 | "mballoc: %u preallocated, %u discarded", | 2720 | "mballoc: %u preallocated, %u discarded", |
2719 | atomic_read(&sbi->s_mb_preallocated), | 2721 | atomic_read(&sbi->s_mb_preallocated), |
2720 | atomic_read(&sbi->s_mb_discarded)); | 2722 | atomic_read(&sbi->s_mb_discarded)); |
2721 | } | 2723 | } |
2722 | 2724 | ||
2723 | free_percpu(sbi->s_locality_groups); | 2725 | free_percpu(sbi->s_locality_groups); |
2724 | 2726 | ||
2725 | return 0; | 2727 | return 0; |
2726 | } | 2728 | } |
2727 | 2729 | ||
2728 | static inline int ext4_issue_discard(struct super_block *sb, | 2730 | static inline int ext4_issue_discard(struct super_block *sb, |
2729 | ext4_group_t block_group, ext4_grpblk_t cluster, int count) | 2731 | ext4_group_t block_group, ext4_grpblk_t cluster, int count) |
2730 | { | 2732 | { |
2731 | ext4_fsblk_t discard_block; | 2733 | ext4_fsblk_t discard_block; |
2732 | 2734 | ||
2733 | discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) + | 2735 | discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) + |
2734 | ext4_group_first_block_no(sb, block_group)); | 2736 | ext4_group_first_block_no(sb, block_group)); |
2735 | count = EXT4_C2B(EXT4_SB(sb), count); | 2737 | count = EXT4_C2B(EXT4_SB(sb), count); |
2736 | trace_ext4_discard_blocks(sb, | 2738 | trace_ext4_discard_blocks(sb, |
2737 | (unsigned long long) discard_block, count); | 2739 | (unsigned long long) discard_block, count); |
2738 | return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0); | 2740 | return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0); |
2739 | } | 2741 | } |
2740 | 2742 | ||
2741 | /* | 2743 | /* |
2742 | * This function is called by the jbd2 layer once the commit has finished, | 2744 | * This function is called by the jbd2 layer once the commit has finished, |
2743 | * so we know we can free the blocks that were released with that commit. | 2745 | * so we know we can free the blocks that were released with that commit. |
2744 | */ | 2746 | */ |
2745 | static void ext4_free_data_callback(struct super_block *sb, | 2747 | static void ext4_free_data_callback(struct super_block *sb, |
2746 | struct ext4_journal_cb_entry *jce, | 2748 | struct ext4_journal_cb_entry *jce, |
2747 | int rc) | 2749 | int rc) |
2748 | { | 2750 | { |
2749 | struct ext4_free_data *entry = (struct ext4_free_data *)jce; | 2751 | struct ext4_free_data *entry = (struct ext4_free_data *)jce; |
2750 | struct ext4_buddy e4b; | 2752 | struct ext4_buddy e4b; |
2751 | struct ext4_group_info *db; | 2753 | struct ext4_group_info *db; |
2752 | int err, count = 0, count2 = 0; | 2754 | int err, count = 0, count2 = 0; |
2753 | 2755 | ||
2754 | mb_debug(1, "gonna free %u blocks in group %u (0x%p):", | 2756 | mb_debug(1, "gonna free %u blocks in group %u (0x%p):", |
2755 | entry->efd_count, entry->efd_group, entry); | 2757 | entry->efd_count, entry->efd_group, entry); |
2756 | 2758 | ||
2757 | if (test_opt(sb, DISCARD)) { | 2759 | if (test_opt(sb, DISCARD)) { |
2758 | err = ext4_issue_discard(sb, entry->efd_group, | 2760 | err = ext4_issue_discard(sb, entry->efd_group, |
2759 | entry->efd_start_cluster, | 2761 | entry->efd_start_cluster, |
2760 | entry->efd_count); | 2762 | entry->efd_count); |
2761 | if (err && err != -EOPNOTSUPP) | 2763 | if (err && err != -EOPNOTSUPP) |
2762 | ext4_msg(sb, KERN_WARNING, "discard request in" | 2764 | ext4_msg(sb, KERN_WARNING, "discard request in" |
2763 | " group:%d block:%d count:%d failed" | 2765 | " group:%d block:%d count:%d failed" |
2764 | " with %d", entry->efd_group, | 2766 | " with %d", entry->efd_group, |
2765 | entry->efd_start_cluster, | 2767 | entry->efd_start_cluster, |
2766 | entry->efd_count, err); | 2768 | entry->efd_count, err); |
2767 | } | 2769 | } |
2768 | 2770 | ||
2769 | err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b); | 2771 | err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b); |
2770 | /* we expect to find existing buddy because it's pinned */ | 2772 | /* we expect to find existing buddy because it's pinned */ |
2771 | BUG_ON(err != 0); | 2773 | BUG_ON(err != 0); |
2772 | 2774 | ||
2773 | 2775 | ||
2774 | db = e4b.bd_info; | 2776 | db = e4b.bd_info; |
2775 | /* there are blocks to put in buddy to make them really free */ | 2777 | /* there are blocks to put in buddy to make them really free */ |
2776 | count += entry->efd_count; | 2778 | count += entry->efd_count; |
2777 | count2++; | 2779 | count2++; |
2778 | ext4_lock_group(sb, entry->efd_group); | 2780 | ext4_lock_group(sb, entry->efd_group); |
2779 | /* Take it out of per group rb tree */ | 2781 | /* Take it out of per group rb tree */ |
2780 | rb_erase(&entry->efd_node, &(db->bb_free_root)); | 2782 | rb_erase(&entry->efd_node, &(db->bb_free_root)); |
2781 | mb_free_blocks(NULL, &e4b, entry->efd_start_cluster, entry->efd_count); | 2783 | mb_free_blocks(NULL, &e4b, entry->efd_start_cluster, entry->efd_count); |
2782 | 2784 | ||
2783 | /* | 2785 | /* |
2784 | * Clear the trimmed flag for the group so that the next | 2786 | * Clear the trimmed flag for the group so that the next |
2785 | * ext4_trim_fs can trim it. | 2787 | * ext4_trim_fs can trim it. |
2786 | * If the volume is mounted with -o discard, online discard | 2788 | * If the volume is mounted with -o discard, online discard |
2787 | * is supported and the free blocks will be trimmed online. | 2789 | * is supported and the free blocks will be trimmed online. |
2788 | */ | 2790 | */ |
2789 | if (!test_opt(sb, DISCARD)) | 2791 | if (!test_opt(sb, DISCARD)) |
2790 | EXT4_MB_GRP_CLEAR_TRIMMED(db); | 2792 | EXT4_MB_GRP_CLEAR_TRIMMED(db); |
2791 | 2793 | ||
2792 | if (!db->bb_free_root.rb_node) { | 2794 | if (!db->bb_free_root.rb_node) { |
2793 | /* No more items in the per group rb tree | 2795 | /* No more items in the per group rb tree |
2794 | * balance refcounts from ext4_mb_free_metadata() | 2796 | * balance refcounts from ext4_mb_free_metadata() |
2795 | */ | 2797 | */ |
2796 | page_cache_release(e4b.bd_buddy_page); | 2798 | page_cache_release(e4b.bd_buddy_page); |
2797 | page_cache_release(e4b.bd_bitmap_page); | 2799 | page_cache_release(e4b.bd_bitmap_page); |
2798 | } | 2800 | } |
2799 | ext4_unlock_group(sb, entry->efd_group); | 2801 | ext4_unlock_group(sb, entry->efd_group); |
2800 | kmem_cache_free(ext4_free_data_cachep, entry); | 2802 | kmem_cache_free(ext4_free_data_cachep, entry); |
2801 | ext4_mb_unload_buddy(&e4b); | 2803 | ext4_mb_unload_buddy(&e4b); |
2802 | 2804 | ||
2803 | mb_debug(1, "freed %u blocks in %u structures\n", count, count2); | 2805 | mb_debug(1, "freed %u blocks in %u structures\n", count, count2); |
2804 | } | 2806 | } |
2805 | 2807 | ||
2806 | int __init ext4_init_mballoc(void) | 2808 | int __init ext4_init_mballoc(void) |
2807 | { | 2809 | { |
2808 | ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space, | 2810 | ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space, |
2809 | SLAB_RECLAIM_ACCOUNT); | 2811 | SLAB_RECLAIM_ACCOUNT); |
2810 | if (ext4_pspace_cachep == NULL) | 2812 | if (ext4_pspace_cachep == NULL) |
2811 | return -ENOMEM; | 2813 | return -ENOMEM; |
2812 | 2814 | ||
2813 | ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context, | 2815 | ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context, |
2814 | SLAB_RECLAIM_ACCOUNT); | 2816 | SLAB_RECLAIM_ACCOUNT); |
2815 | if (ext4_ac_cachep == NULL) { | 2817 | if (ext4_ac_cachep == NULL) { |
2816 | kmem_cache_destroy(ext4_pspace_cachep); | 2818 | kmem_cache_destroy(ext4_pspace_cachep); |
2817 | return -ENOMEM; | 2819 | return -ENOMEM; |
2818 | } | 2820 | } |
2819 | 2821 | ||
2820 | ext4_free_data_cachep = KMEM_CACHE(ext4_free_data, | 2822 | ext4_free_data_cachep = KMEM_CACHE(ext4_free_data, |
2821 | SLAB_RECLAIM_ACCOUNT); | 2823 | SLAB_RECLAIM_ACCOUNT); |
2822 | if (ext4_free_data_cachep == NULL) { | 2824 | if (ext4_free_data_cachep == NULL) { |
2823 | kmem_cache_destroy(ext4_pspace_cachep); | 2825 | kmem_cache_destroy(ext4_pspace_cachep); |
2824 | kmem_cache_destroy(ext4_ac_cachep); | 2826 | kmem_cache_destroy(ext4_ac_cachep); |
2825 | return -ENOMEM; | 2827 | return -ENOMEM; |
2826 | } | 2828 | } |
2827 | return 0; | 2829 | return 0; |
2828 | } | 2830 | } |
2829 | 2831 | ||
2830 | void ext4_exit_mballoc(void) | 2832 | void ext4_exit_mballoc(void) |
2831 | { | 2833 | { |
2832 | /* | 2834 | /* |
2833 | * Wait for completion of call_rcu()'s on ext4_pspace_cachep | 2835 | * Wait for completion of call_rcu()'s on ext4_pspace_cachep |
2834 | * before destroying the slab cache. | 2836 | * before destroying the slab cache. |
2835 | */ | 2837 | */ |
2836 | rcu_barrier(); | 2838 | rcu_barrier(); |
2837 | kmem_cache_destroy(ext4_pspace_cachep); | 2839 | kmem_cache_destroy(ext4_pspace_cachep); |
2838 | kmem_cache_destroy(ext4_ac_cachep); | 2840 | kmem_cache_destroy(ext4_ac_cachep); |
2839 | kmem_cache_destroy(ext4_free_data_cachep); | 2841 | kmem_cache_destroy(ext4_free_data_cachep); |
2840 | ext4_groupinfo_destroy_slabs(); | 2842 | ext4_groupinfo_destroy_slabs(); |
2841 | } | 2843 | } |
2842 | 2844 | ||
2843 | 2845 | ||
2844 | /* | 2846 | /* |
2845 | * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps | 2847 | * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps |
2846 | * Returns 0 if success or error code | 2848 | * Returns 0 if success or error code |
2847 | */ | 2849 | */ |
2848 | static noinline_for_stack int | 2850 | static noinline_for_stack int |
2849 | ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, | 2851 | ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, |
2850 | handle_t *handle, unsigned int reserv_clstrs) | 2852 | handle_t *handle, unsigned int reserv_clstrs) |
2851 | { | 2853 | { |
2852 | struct buffer_head *bitmap_bh = NULL; | 2854 | struct buffer_head *bitmap_bh = NULL; |
2853 | struct ext4_group_desc *gdp; | 2855 | struct ext4_group_desc *gdp; |
2854 | struct buffer_head *gdp_bh; | 2856 | struct buffer_head *gdp_bh; |
2855 | struct ext4_sb_info *sbi; | 2857 | struct ext4_sb_info *sbi; |
2856 | struct super_block *sb; | 2858 | struct super_block *sb; |
2857 | ext4_fsblk_t block; | 2859 | ext4_fsblk_t block; |
2858 | int err, len; | 2860 | int err, len; |
2859 | 2861 | ||
2860 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); | 2862 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); |
2861 | BUG_ON(ac->ac_b_ex.fe_len <= 0); | 2863 | BUG_ON(ac->ac_b_ex.fe_len <= 0); |
2862 | 2864 | ||
2863 | sb = ac->ac_sb; | 2865 | sb = ac->ac_sb; |
2864 | sbi = EXT4_SB(sb); | 2866 | sbi = EXT4_SB(sb); |
2865 | 2867 | ||
2866 | err = -EIO; | 2868 | err = -EIO; |
2867 | bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group); | 2869 | bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group); |
2868 | if (!bitmap_bh) | 2870 | if (!bitmap_bh) |
2869 | goto out_err; | 2871 | goto out_err; |
2870 | 2872 | ||
2871 | err = ext4_journal_get_write_access(handle, bitmap_bh); | 2873 | err = ext4_journal_get_write_access(handle, bitmap_bh); |
2872 | if (err) | 2874 | if (err) |
2873 | goto out_err; | 2875 | goto out_err; |
2874 | 2876 | ||
2875 | err = -EIO; | 2877 | err = -EIO; |
2876 | gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh); | 2878 | gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh); |
2877 | if (!gdp) | 2879 | if (!gdp) |
2878 | goto out_err; | 2880 | goto out_err; |
2879 | 2881 | ||
2880 | ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group, | 2882 | ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group, |
2881 | ext4_free_group_clusters(sb, gdp)); | 2883 | ext4_free_group_clusters(sb, gdp)); |
2882 | 2884 | ||
2883 | err = ext4_journal_get_write_access(handle, gdp_bh); | 2885 | err = ext4_journal_get_write_access(handle, gdp_bh); |
2884 | if (err) | 2886 | if (err) |
2885 | goto out_err; | 2887 | goto out_err; |
2886 | 2888 | ||
2887 | block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); | 2889 | block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); |
2888 | 2890 | ||
2889 | len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len); | 2891 | len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len); |
2890 | if (!ext4_data_block_valid(sbi, block, len)) { | 2892 | if (!ext4_data_block_valid(sbi, block, len)) { |
2891 | ext4_error(sb, "Allocating blocks %llu-%llu which overlap " | 2893 | ext4_error(sb, "Allocating blocks %llu-%llu which overlap " |
2892 | "fs metadata", block, block+len); | 2894 | "fs metadata", block, block+len); |
2893 | /* File system mounted not to panic on error | 2895 | /* File system mounted not to panic on error |
2894 | * Fix the bitmap and repeat the block allocation | 2896 | * Fix the bitmap and repeat the block allocation |
2895 | * We leak some of the blocks here. | 2897 | * We leak some of the blocks here. |
2896 | */ | 2898 | */ |
2897 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); | 2899 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); |
2898 | ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, | 2900 | ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, |
2899 | ac->ac_b_ex.fe_len); | 2901 | ac->ac_b_ex.fe_len); |
2900 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); | 2902 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); |
2901 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); | 2903 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); |
2902 | if (!err) | 2904 | if (!err) |
2903 | err = -EAGAIN; | 2905 | err = -EAGAIN; |
2904 | goto out_err; | 2906 | goto out_err; |
2905 | } | 2907 | } |
2906 | 2908 | ||
2907 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); | 2909 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); |
2908 | #ifdef AGGRESSIVE_CHECK | 2910 | #ifdef AGGRESSIVE_CHECK |
2909 | { | 2911 | { |
2910 | int i; | 2912 | int i; |
2911 | for (i = 0; i < ac->ac_b_ex.fe_len; i++) { | 2913 | for (i = 0; i < ac->ac_b_ex.fe_len; i++) { |
2912 | BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i, | 2914 | BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i, |
2913 | bitmap_bh->b_data)); | 2915 | bitmap_bh->b_data)); |
2914 | } | 2916 | } |
2915 | } | 2917 | } |
2916 | #endif | 2918 | #endif |
2917 | ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, | 2919 | ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start, |
2918 | ac->ac_b_ex.fe_len); | 2920 | ac->ac_b_ex.fe_len); |
2919 | if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { | 2921 | if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { |
2920 | gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); | 2922 | gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); |
2921 | ext4_free_group_clusters_set(sb, gdp, | 2923 | ext4_free_group_clusters_set(sb, gdp, |
2922 | ext4_free_clusters_after_init(sb, | 2924 | ext4_free_clusters_after_init(sb, |
2923 | ac->ac_b_ex.fe_group, gdp)); | 2925 | ac->ac_b_ex.fe_group, gdp)); |
2924 | } | 2926 | } |
2925 | len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len; | 2927 | len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len; |
2926 | ext4_free_group_clusters_set(sb, gdp, len); | 2928 | ext4_free_group_clusters_set(sb, gdp, len); |
2927 | ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh); | 2929 | ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh); |
2928 | ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp); | 2930 | ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp); |
2929 | 2931 | ||
2930 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); | 2932 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); |
2931 | percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len); | 2933 | percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len); |
2932 | /* | 2934 | /* |
2933 | * Now reduce the dirty block count also. Should not go negative | 2935 | * Now reduce the dirty block count also. Should not go negative |
2934 | */ | 2936 | */ |
2935 | if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED)) | 2937 | if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED)) |
2936 | /* release all the reserved blocks if non delalloc */ | 2938 | /* release all the reserved blocks if non delalloc */ |
2937 | percpu_counter_sub(&sbi->s_dirtyclusters_counter, | 2939 | percpu_counter_sub(&sbi->s_dirtyclusters_counter, |
2938 | reserv_clstrs); | 2940 | reserv_clstrs); |
2939 | 2941 | ||
2940 | if (sbi->s_log_groups_per_flex) { | 2942 | if (sbi->s_log_groups_per_flex) { |
2941 | ext4_group_t flex_group = ext4_flex_group(sbi, | 2943 | ext4_group_t flex_group = ext4_flex_group(sbi, |
2942 | ac->ac_b_ex.fe_group); | 2944 | ac->ac_b_ex.fe_group); |
2943 | atomic64_sub(ac->ac_b_ex.fe_len, | 2945 | atomic64_sub(ac->ac_b_ex.fe_len, |
2944 | &sbi->s_flex_groups[flex_group].free_clusters); | 2946 | &sbi->s_flex_groups[flex_group].free_clusters); |
2945 | } | 2947 | } |
2946 | 2948 | ||
2947 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); | 2949 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); |
2948 | if (err) | 2950 | if (err) |
2949 | goto out_err; | 2951 | goto out_err; |
2950 | err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh); | 2952 | err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh); |
2951 | 2953 | ||
2952 | out_err: | 2954 | out_err: |
2953 | brelse(bitmap_bh); | 2955 | brelse(bitmap_bh); |
2954 | return err; | 2956 | return err; |
2955 | } | 2957 | } |
2956 | 2958 | ||
2957 | /* | 2959 | /* |
2958 | * here we normalize request for locality group | 2960 | * here we normalize request for locality group |
2959 | * Group request are normalized to s_mb_group_prealloc, which goes to | 2961 | * Group request are normalized to s_mb_group_prealloc, which goes to |
2960 | * s_strip if we set the same via mount option. | 2962 | * s_strip if we set the same via mount option. |
2961 | * s_mb_group_prealloc can be configured via | 2963 | * s_mb_group_prealloc can be configured via |
2962 | * /sys/fs/ext4/<partition>/mb_group_prealloc | 2964 | * /sys/fs/ext4/<partition>/mb_group_prealloc |
2963 | * | 2965 | * |
2964 | * XXX: should we try to preallocate more than the group has now? | 2966 | * XXX: should we try to preallocate more than the group has now? |
2965 | */ | 2967 | */ |
2966 | static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac) | 2968 | static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac) |
2967 | { | 2969 | { |
2968 | struct super_block *sb = ac->ac_sb; | 2970 | struct super_block *sb = ac->ac_sb; |
2969 | struct ext4_locality_group *lg = ac->ac_lg; | 2971 | struct ext4_locality_group *lg = ac->ac_lg; |
2970 | 2972 | ||
2971 | BUG_ON(lg == NULL); | 2973 | BUG_ON(lg == NULL); |
2972 | ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc; | 2974 | ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc; |
2973 | mb_debug(1, "#%u: goal %u blocks for locality group\n", | 2975 | mb_debug(1, "#%u: goal %u blocks for locality group\n", |
2974 | current->pid, ac->ac_g_ex.fe_len); | 2976 | current->pid, ac->ac_g_ex.fe_len); |
2975 | } | 2977 | } |
2976 | 2978 | ||
2977 | /* | 2979 | /* |
2978 | * Normalization means making request better in terms of | 2980 | * Normalization means making request better in terms of |
2979 | * size and alignment | 2981 | * size and alignment |
2980 | */ | 2982 | */ |
2981 | static noinline_for_stack void | 2983 | static noinline_for_stack void |
2982 | ext4_mb_normalize_request(struct ext4_allocation_context *ac, | 2984 | ext4_mb_normalize_request(struct ext4_allocation_context *ac, |
2983 | struct ext4_allocation_request *ar) | 2985 | struct ext4_allocation_request *ar) |
2984 | { | 2986 | { |
2985 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 2987 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
2986 | int bsbits, max; | 2988 | int bsbits, max; |
2987 | ext4_lblk_t end; | 2989 | ext4_lblk_t end; |
2988 | loff_t size, start_off; | 2990 | loff_t size, start_off; |
2989 | loff_t orig_size __maybe_unused; | 2991 | loff_t orig_size __maybe_unused; |
2990 | ext4_lblk_t start; | 2992 | ext4_lblk_t start; |
2991 | struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); | 2993 | struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); |
2992 | struct ext4_prealloc_space *pa; | 2994 | struct ext4_prealloc_space *pa; |
2993 | 2995 | ||
2994 | /* do normalize only data requests, metadata requests | 2996 | /* do normalize only data requests, metadata requests |
2995 | do not need preallocation */ | 2997 | do not need preallocation */ |
2996 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) | 2998 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) |
2997 | return; | 2999 | return; |
2998 | 3000 | ||
2999 | /* sometime caller may want exact blocks */ | 3001 | /* sometime caller may want exact blocks */ |
3000 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) | 3002 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) |
3001 | return; | 3003 | return; |
3002 | 3004 | ||
3003 | /* caller may indicate that preallocation isn't | 3005 | /* caller may indicate that preallocation isn't |
3004 | * required (it's a tail, for example) */ | 3006 | * required (it's a tail, for example) */ |
3005 | if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) | 3007 | if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) |
3006 | return; | 3008 | return; |
3007 | 3009 | ||
3008 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { | 3010 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { |
3009 | ext4_mb_normalize_group_request(ac); | 3011 | ext4_mb_normalize_group_request(ac); |
3010 | return ; | 3012 | return ; |
3011 | } | 3013 | } |
3012 | 3014 | ||
3013 | bsbits = ac->ac_sb->s_blocksize_bits; | 3015 | bsbits = ac->ac_sb->s_blocksize_bits; |
3014 | 3016 | ||
3015 | /* first, let's learn actual file size | 3017 | /* first, let's learn actual file size |
3016 | * given current request is allocated */ | 3018 | * given current request is allocated */ |
3017 | size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); | 3019 | size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); |
3018 | size = size << bsbits; | 3020 | size = size << bsbits; |
3019 | if (size < i_size_read(ac->ac_inode)) | 3021 | if (size < i_size_read(ac->ac_inode)) |
3020 | size = i_size_read(ac->ac_inode); | 3022 | size = i_size_read(ac->ac_inode); |
3021 | orig_size = size; | 3023 | orig_size = size; |
3022 | 3024 | ||
3023 | /* max size of free chunks */ | 3025 | /* max size of free chunks */ |
3024 | max = 2 << bsbits; | 3026 | max = 2 << bsbits; |
3025 | 3027 | ||
3026 | #define NRL_CHECK_SIZE(req, size, max, chunk_size) \ | 3028 | #define NRL_CHECK_SIZE(req, size, max, chunk_size) \ |
3027 | (req <= (size) || max <= (chunk_size)) | 3029 | (req <= (size) || max <= (chunk_size)) |
3028 | 3030 | ||
3029 | /* first, try to predict filesize */ | 3031 | /* first, try to predict filesize */ |
3030 | /* XXX: should this table be tunable? */ | 3032 | /* XXX: should this table be tunable? */ |
3031 | start_off = 0; | 3033 | start_off = 0; |
3032 | if (size <= 16 * 1024) { | 3034 | if (size <= 16 * 1024) { |
3033 | size = 16 * 1024; | 3035 | size = 16 * 1024; |
3034 | } else if (size <= 32 * 1024) { | 3036 | } else if (size <= 32 * 1024) { |
3035 | size = 32 * 1024; | 3037 | size = 32 * 1024; |
3036 | } else if (size <= 64 * 1024) { | 3038 | } else if (size <= 64 * 1024) { |
3037 | size = 64 * 1024; | 3039 | size = 64 * 1024; |
3038 | } else if (size <= 128 * 1024) { | 3040 | } else if (size <= 128 * 1024) { |
3039 | size = 128 * 1024; | 3041 | size = 128 * 1024; |
3040 | } else if (size <= 256 * 1024) { | 3042 | } else if (size <= 256 * 1024) { |
3041 | size = 256 * 1024; | 3043 | size = 256 * 1024; |
3042 | } else if (size <= 512 * 1024) { | 3044 | } else if (size <= 512 * 1024) { |
3043 | size = 512 * 1024; | 3045 | size = 512 * 1024; |
3044 | } else if (size <= 1024 * 1024) { | 3046 | } else if (size <= 1024 * 1024) { |
3045 | size = 1024 * 1024; | 3047 | size = 1024 * 1024; |
3046 | } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) { | 3048 | } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) { |
3047 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> | 3049 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> |
3048 | (21 - bsbits)) << 21; | 3050 | (21 - bsbits)) << 21; |
3049 | size = 2 * 1024 * 1024; | 3051 | size = 2 * 1024 * 1024; |
3050 | } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) { | 3052 | } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) { |
3051 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> | 3053 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> |
3052 | (22 - bsbits)) << 22; | 3054 | (22 - bsbits)) << 22; |
3053 | size = 4 * 1024 * 1024; | 3055 | size = 4 * 1024 * 1024; |
3054 | } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len, | 3056 | } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len, |
3055 | (8<<20)>>bsbits, max, 8 * 1024)) { | 3057 | (8<<20)>>bsbits, max, 8 * 1024)) { |
3056 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> | 3058 | start_off = ((loff_t)ac->ac_o_ex.fe_logical >> |
3057 | (23 - bsbits)) << 23; | 3059 | (23 - bsbits)) << 23; |
3058 | size = 8 * 1024 * 1024; | 3060 | size = 8 * 1024 * 1024; |
3059 | } else { | 3061 | } else { |
3060 | start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits; | 3062 | start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits; |
3061 | size = ac->ac_o_ex.fe_len << bsbits; | 3063 | size = ac->ac_o_ex.fe_len << bsbits; |
3062 | } | 3064 | } |
3063 | size = size >> bsbits; | 3065 | size = size >> bsbits; |
3064 | start = start_off >> bsbits; | 3066 | start = start_off >> bsbits; |
3065 | 3067 | ||
3066 | /* don't cover already allocated blocks in selected range */ | 3068 | /* don't cover already allocated blocks in selected range */ |
3067 | if (ar->pleft && start <= ar->lleft) { | 3069 | if (ar->pleft && start <= ar->lleft) { |
3068 | size -= ar->lleft + 1 - start; | 3070 | size -= ar->lleft + 1 - start; |
3069 | start = ar->lleft + 1; | 3071 | start = ar->lleft + 1; |
3070 | } | 3072 | } |
3071 | if (ar->pright && start + size - 1 >= ar->lright) | 3073 | if (ar->pright && start + size - 1 >= ar->lright) |
3072 | size -= start + size - ar->lright; | 3074 | size -= start + size - ar->lright; |
3073 | 3075 | ||
3074 | end = start + size; | 3076 | end = start + size; |
3075 | 3077 | ||
3076 | /* check we don't cross already preallocated blocks */ | 3078 | /* check we don't cross already preallocated blocks */ |
3077 | rcu_read_lock(); | 3079 | rcu_read_lock(); |
3078 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { | 3080 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { |
3079 | ext4_lblk_t pa_end; | 3081 | ext4_lblk_t pa_end; |
3080 | 3082 | ||
3081 | if (pa->pa_deleted) | 3083 | if (pa->pa_deleted) |
3082 | continue; | 3084 | continue; |
3083 | spin_lock(&pa->pa_lock); | 3085 | spin_lock(&pa->pa_lock); |
3084 | if (pa->pa_deleted) { | 3086 | if (pa->pa_deleted) { |
3085 | spin_unlock(&pa->pa_lock); | 3087 | spin_unlock(&pa->pa_lock); |
3086 | continue; | 3088 | continue; |
3087 | } | 3089 | } |
3088 | 3090 | ||
3089 | pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), | 3091 | pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), |
3090 | pa->pa_len); | 3092 | pa->pa_len); |
3091 | 3093 | ||
3092 | /* PA must not overlap original request */ | 3094 | /* PA must not overlap original request */ |
3093 | BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || | 3095 | BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end || |
3094 | ac->ac_o_ex.fe_logical < pa->pa_lstart)); | 3096 | ac->ac_o_ex.fe_logical < pa->pa_lstart)); |
3095 | 3097 | ||
3096 | /* skip PAs this normalized request doesn't overlap with */ | 3098 | /* skip PAs this normalized request doesn't overlap with */ |
3097 | if (pa->pa_lstart >= end || pa_end <= start) { | 3099 | if (pa->pa_lstart >= end || pa_end <= start) { |
3098 | spin_unlock(&pa->pa_lock); | 3100 | spin_unlock(&pa->pa_lock); |
3099 | continue; | 3101 | continue; |
3100 | } | 3102 | } |
3101 | BUG_ON(pa->pa_lstart <= start && pa_end >= end); | 3103 | BUG_ON(pa->pa_lstart <= start && pa_end >= end); |
3102 | 3104 | ||
3103 | /* adjust start or end to be adjacent to this pa */ | 3105 | /* adjust start or end to be adjacent to this pa */ |
3104 | if (pa_end <= ac->ac_o_ex.fe_logical) { | 3106 | if (pa_end <= ac->ac_o_ex.fe_logical) { |
3105 | BUG_ON(pa_end < start); | 3107 | BUG_ON(pa_end < start); |
3106 | start = pa_end; | 3108 | start = pa_end; |
3107 | } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { | 3109 | } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) { |
3108 | BUG_ON(pa->pa_lstart > end); | 3110 | BUG_ON(pa->pa_lstart > end); |
3109 | end = pa->pa_lstart; | 3111 | end = pa->pa_lstart; |
3110 | } | 3112 | } |
3111 | spin_unlock(&pa->pa_lock); | 3113 | spin_unlock(&pa->pa_lock); |
3112 | } | 3114 | } |
3113 | rcu_read_unlock(); | 3115 | rcu_read_unlock(); |
3114 | size = end - start; | 3116 | size = end - start; |
3115 | 3117 | ||
3116 | /* XXX: extra loop to check we really don't overlap preallocations */ | 3118 | /* XXX: extra loop to check we really don't overlap preallocations */ |
3117 | rcu_read_lock(); | 3119 | rcu_read_lock(); |
3118 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { | 3120 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { |
3119 | ext4_lblk_t pa_end; | 3121 | ext4_lblk_t pa_end; |
3120 | 3122 | ||
3121 | spin_lock(&pa->pa_lock); | 3123 | spin_lock(&pa->pa_lock); |
3122 | if (pa->pa_deleted == 0) { | 3124 | if (pa->pa_deleted == 0) { |
3123 | pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), | 3125 | pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb), |
3124 | pa->pa_len); | 3126 | pa->pa_len); |
3125 | BUG_ON(!(start >= pa_end || end <= pa->pa_lstart)); | 3127 | BUG_ON(!(start >= pa_end || end <= pa->pa_lstart)); |
3126 | } | 3128 | } |
3127 | spin_unlock(&pa->pa_lock); | 3129 | spin_unlock(&pa->pa_lock); |
3128 | } | 3130 | } |
3129 | rcu_read_unlock(); | 3131 | rcu_read_unlock(); |
3130 | 3132 | ||
3131 | if (start + size <= ac->ac_o_ex.fe_logical && | 3133 | if (start + size <= ac->ac_o_ex.fe_logical && |
3132 | start > ac->ac_o_ex.fe_logical) { | 3134 | start > ac->ac_o_ex.fe_logical) { |
3133 | ext4_msg(ac->ac_sb, KERN_ERR, | 3135 | ext4_msg(ac->ac_sb, KERN_ERR, |
3134 | "start %lu, size %lu, fe_logical %lu", | 3136 | "start %lu, size %lu, fe_logical %lu", |
3135 | (unsigned long) start, (unsigned long) size, | 3137 | (unsigned long) start, (unsigned long) size, |
3136 | (unsigned long) ac->ac_o_ex.fe_logical); | 3138 | (unsigned long) ac->ac_o_ex.fe_logical); |
3137 | } | 3139 | } |
3138 | BUG_ON(start + size <= ac->ac_o_ex.fe_logical && | 3140 | BUG_ON(start + size <= ac->ac_o_ex.fe_logical && |
3139 | start > ac->ac_o_ex.fe_logical); | 3141 | start > ac->ac_o_ex.fe_logical); |
3140 | BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); | 3142 | BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)); |
3141 | 3143 | ||
3142 | /* now prepare goal request */ | 3144 | /* now prepare goal request */ |
3143 | 3145 | ||
3144 | /* XXX: is it better to align blocks WRT to logical | 3146 | /* XXX: is it better to align blocks WRT to logical |
3145 | * placement or satisfy big request as is */ | 3147 | * placement or satisfy big request as is */ |
3146 | ac->ac_g_ex.fe_logical = start; | 3148 | ac->ac_g_ex.fe_logical = start; |
3147 | ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); | 3149 | ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); |
3148 | 3150 | ||
3149 | /* define goal start in order to merge */ | 3151 | /* define goal start in order to merge */ |
3150 | if (ar->pright && (ar->lright == (start + size))) { | 3152 | if (ar->pright && (ar->lright == (start + size))) { |
3151 | /* merge to the right */ | 3153 | /* merge to the right */ |
3152 | ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size, | 3154 | ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size, |
3153 | &ac->ac_f_ex.fe_group, | 3155 | &ac->ac_f_ex.fe_group, |
3154 | &ac->ac_f_ex.fe_start); | 3156 | &ac->ac_f_ex.fe_start); |
3155 | ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; | 3157 | ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; |
3156 | } | 3158 | } |
3157 | if (ar->pleft && (ar->lleft + 1 == start)) { | 3159 | if (ar->pleft && (ar->lleft + 1 == start)) { |
3158 | /* merge to the left */ | 3160 | /* merge to the left */ |
3159 | ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1, | 3161 | ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1, |
3160 | &ac->ac_f_ex.fe_group, | 3162 | &ac->ac_f_ex.fe_group, |
3161 | &ac->ac_f_ex.fe_start); | 3163 | &ac->ac_f_ex.fe_start); |
3162 | ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; | 3164 | ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL; |
3163 | } | 3165 | } |
3164 | 3166 | ||
3165 | mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size, | 3167 | mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size, |
3166 | (unsigned) orig_size, (unsigned) start); | 3168 | (unsigned) orig_size, (unsigned) start); |
3167 | } | 3169 | } |
3168 | 3170 | ||
3169 | static void ext4_mb_collect_stats(struct ext4_allocation_context *ac) | 3171 | static void ext4_mb_collect_stats(struct ext4_allocation_context *ac) |
3170 | { | 3172 | { |
3171 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 3173 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
3172 | 3174 | ||
3173 | if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) { | 3175 | if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) { |
3174 | atomic_inc(&sbi->s_bal_reqs); | 3176 | atomic_inc(&sbi->s_bal_reqs); |
3175 | atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated); | 3177 | atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated); |
3176 | if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len) | 3178 | if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len) |
3177 | atomic_inc(&sbi->s_bal_success); | 3179 | atomic_inc(&sbi->s_bal_success); |
3178 | atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned); | 3180 | atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned); |
3179 | if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && | 3181 | if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && |
3180 | ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) | 3182 | ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) |
3181 | atomic_inc(&sbi->s_bal_goals); | 3183 | atomic_inc(&sbi->s_bal_goals); |
3182 | if (ac->ac_found > sbi->s_mb_max_to_scan) | 3184 | if (ac->ac_found > sbi->s_mb_max_to_scan) |
3183 | atomic_inc(&sbi->s_bal_breaks); | 3185 | atomic_inc(&sbi->s_bal_breaks); |
3184 | } | 3186 | } |
3185 | 3187 | ||
3186 | if (ac->ac_op == EXT4_MB_HISTORY_ALLOC) | 3188 | if (ac->ac_op == EXT4_MB_HISTORY_ALLOC) |
3187 | trace_ext4_mballoc_alloc(ac); | 3189 | trace_ext4_mballoc_alloc(ac); |
3188 | else | 3190 | else |
3189 | trace_ext4_mballoc_prealloc(ac); | 3191 | trace_ext4_mballoc_prealloc(ac); |
3190 | } | 3192 | } |
3191 | 3193 | ||
3192 | /* | 3194 | /* |
3193 | * Called on failure; free up any blocks from the inode PA for this | 3195 | * Called on failure; free up any blocks from the inode PA for this |
3194 | * context. We don't need this for MB_GROUP_PA because we only change | 3196 | * context. We don't need this for MB_GROUP_PA because we only change |
3195 | * pa_free in ext4_mb_release_context(), but on failure, we've already | 3197 | * pa_free in ext4_mb_release_context(), but on failure, we've already |
3196 | * zeroed out ac->ac_b_ex.fe_len, so group_pa->pa_free is not changed. | 3198 | * zeroed out ac->ac_b_ex.fe_len, so group_pa->pa_free is not changed. |
3197 | */ | 3199 | */ |
3198 | static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac) | 3200 | static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac) |
3199 | { | 3201 | { |
3200 | struct ext4_prealloc_space *pa = ac->ac_pa; | 3202 | struct ext4_prealloc_space *pa = ac->ac_pa; |
3201 | struct ext4_buddy e4b; | 3203 | struct ext4_buddy e4b; |
3202 | int err; | 3204 | int err; |
3203 | 3205 | ||
3204 | if (pa == NULL) { | 3206 | if (pa == NULL) { |
3205 | if (ac->ac_f_ex.fe_len == 0) | 3207 | if (ac->ac_f_ex.fe_len == 0) |
3206 | return; | 3208 | return; |
3207 | err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b); | 3209 | err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b); |
3208 | if (err) { | 3210 | if (err) { |
3209 | /* | 3211 | /* |
3210 | * This should never happen since we pin the | 3212 | * This should never happen since we pin the |
3211 | * pages in the ext4_allocation_context so | 3213 | * pages in the ext4_allocation_context so |
3212 | * ext4_mb_load_buddy() should never fail. | 3214 | * ext4_mb_load_buddy() should never fail. |
3213 | */ | 3215 | */ |
3214 | WARN(1, "mb_load_buddy failed (%d)", err); | 3216 | WARN(1, "mb_load_buddy failed (%d)", err); |
3215 | return; | 3217 | return; |
3216 | } | 3218 | } |
3217 | ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group); | 3219 | ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group); |
3218 | mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start, | 3220 | mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start, |
3219 | ac->ac_f_ex.fe_len); | 3221 | ac->ac_f_ex.fe_len); |
3220 | ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group); | 3222 | ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group); |
3221 | ext4_mb_unload_buddy(&e4b); | 3223 | ext4_mb_unload_buddy(&e4b); |
3222 | return; | 3224 | return; |
3223 | } | 3225 | } |
3224 | if (pa->pa_type == MB_INODE_PA) | 3226 | if (pa->pa_type == MB_INODE_PA) |
3225 | pa->pa_free += ac->ac_b_ex.fe_len; | 3227 | pa->pa_free += ac->ac_b_ex.fe_len; |
3226 | } | 3228 | } |
3227 | 3229 | ||
3228 | /* | 3230 | /* |
3229 | * use blocks preallocated to inode | 3231 | * use blocks preallocated to inode |
3230 | */ | 3232 | */ |
3231 | static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac, | 3233 | static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac, |
3232 | struct ext4_prealloc_space *pa) | 3234 | struct ext4_prealloc_space *pa) |
3233 | { | 3235 | { |
3234 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 3236 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
3235 | ext4_fsblk_t start; | 3237 | ext4_fsblk_t start; |
3236 | ext4_fsblk_t end; | 3238 | ext4_fsblk_t end; |
3237 | int len; | 3239 | int len; |
3238 | 3240 | ||
3239 | /* found preallocated blocks, use them */ | 3241 | /* found preallocated blocks, use them */ |
3240 | start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); | 3242 | start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart); |
3241 | end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), | 3243 | end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len), |
3242 | start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len)); | 3244 | start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len)); |
3243 | len = EXT4_NUM_B2C(sbi, end - start); | 3245 | len = EXT4_NUM_B2C(sbi, end - start); |
3244 | ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group, | 3246 | ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group, |
3245 | &ac->ac_b_ex.fe_start); | 3247 | &ac->ac_b_ex.fe_start); |
3246 | ac->ac_b_ex.fe_len = len; | 3248 | ac->ac_b_ex.fe_len = len; |
3247 | ac->ac_status = AC_STATUS_FOUND; | 3249 | ac->ac_status = AC_STATUS_FOUND; |
3248 | ac->ac_pa = pa; | 3250 | ac->ac_pa = pa; |
3249 | 3251 | ||
3250 | BUG_ON(start < pa->pa_pstart); | 3252 | BUG_ON(start < pa->pa_pstart); |
3251 | BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len)); | 3253 | BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len)); |
3252 | BUG_ON(pa->pa_free < len); | 3254 | BUG_ON(pa->pa_free < len); |
3253 | pa->pa_free -= len; | 3255 | pa->pa_free -= len; |
3254 | 3256 | ||
3255 | mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa); | 3257 | mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa); |
3256 | } | 3258 | } |
3257 | 3259 | ||
3258 | /* | 3260 | /* |
3259 | * use blocks preallocated to locality group | 3261 | * use blocks preallocated to locality group |
3260 | */ | 3262 | */ |
3261 | static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac, | 3263 | static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac, |
3262 | struct ext4_prealloc_space *pa) | 3264 | struct ext4_prealloc_space *pa) |
3263 | { | 3265 | { |
3264 | unsigned int len = ac->ac_o_ex.fe_len; | 3266 | unsigned int len = ac->ac_o_ex.fe_len; |
3265 | 3267 | ||
3266 | ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart, | 3268 | ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart, |
3267 | &ac->ac_b_ex.fe_group, | 3269 | &ac->ac_b_ex.fe_group, |
3268 | &ac->ac_b_ex.fe_start); | 3270 | &ac->ac_b_ex.fe_start); |
3269 | ac->ac_b_ex.fe_len = len; | 3271 | ac->ac_b_ex.fe_len = len; |
3270 | ac->ac_status = AC_STATUS_FOUND; | 3272 | ac->ac_status = AC_STATUS_FOUND; |
3271 | ac->ac_pa = pa; | 3273 | ac->ac_pa = pa; |
3272 | 3274 | ||
3273 | /* we don't correct pa_pstart or pa_plen here to avoid | 3275 | /* we don't correct pa_pstart or pa_plen here to avoid |
3274 | * possible race when the group is being loaded concurrently | 3276 | * possible race when the group is being loaded concurrently |
3275 | * instead we correct pa later, after blocks are marked | 3277 | * instead we correct pa later, after blocks are marked |
3276 | * in on-disk bitmap -- see ext4_mb_release_context() | 3278 | * in on-disk bitmap -- see ext4_mb_release_context() |
3277 | * Other CPUs are prevented from allocating from this pa by lg_mutex | 3279 | * Other CPUs are prevented from allocating from this pa by lg_mutex |
3278 | */ | 3280 | */ |
3279 | mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa); | 3281 | mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa); |
3280 | } | 3282 | } |
3281 | 3283 | ||
3282 | /* | 3284 | /* |
3283 | * Return the prealloc space that have minimal distance | 3285 | * Return the prealloc space that have minimal distance |
3284 | * from the goal block. @cpa is the prealloc | 3286 | * from the goal block. @cpa is the prealloc |
3285 | * space that is having currently known minimal distance | 3287 | * space that is having currently known minimal distance |
3286 | * from the goal block. | 3288 | * from the goal block. |
3287 | */ | 3289 | */ |
3288 | static struct ext4_prealloc_space * | 3290 | static struct ext4_prealloc_space * |
3289 | ext4_mb_check_group_pa(ext4_fsblk_t goal_block, | 3291 | ext4_mb_check_group_pa(ext4_fsblk_t goal_block, |
3290 | struct ext4_prealloc_space *pa, | 3292 | struct ext4_prealloc_space *pa, |
3291 | struct ext4_prealloc_space *cpa) | 3293 | struct ext4_prealloc_space *cpa) |
3292 | { | 3294 | { |
3293 | ext4_fsblk_t cur_distance, new_distance; | 3295 | ext4_fsblk_t cur_distance, new_distance; |
3294 | 3296 | ||
3295 | if (cpa == NULL) { | 3297 | if (cpa == NULL) { |
3296 | atomic_inc(&pa->pa_count); | 3298 | atomic_inc(&pa->pa_count); |
3297 | return pa; | 3299 | return pa; |
3298 | } | 3300 | } |
3299 | cur_distance = abs(goal_block - cpa->pa_pstart); | 3301 | cur_distance = abs(goal_block - cpa->pa_pstart); |
3300 | new_distance = abs(goal_block - pa->pa_pstart); | 3302 | new_distance = abs(goal_block - pa->pa_pstart); |
3301 | 3303 | ||
3302 | if (cur_distance <= new_distance) | 3304 | if (cur_distance <= new_distance) |
3303 | return cpa; | 3305 | return cpa; |
3304 | 3306 | ||
3305 | /* drop the previous reference */ | 3307 | /* drop the previous reference */ |
3306 | atomic_dec(&cpa->pa_count); | 3308 | atomic_dec(&cpa->pa_count); |
3307 | atomic_inc(&pa->pa_count); | 3309 | atomic_inc(&pa->pa_count); |
3308 | return pa; | 3310 | return pa; |
3309 | } | 3311 | } |
3310 | 3312 | ||
3311 | /* | 3313 | /* |
3312 | * search goal blocks in preallocated space | 3314 | * search goal blocks in preallocated space |
3313 | */ | 3315 | */ |
3314 | static noinline_for_stack int | 3316 | static noinline_for_stack int |
3315 | ext4_mb_use_preallocated(struct ext4_allocation_context *ac) | 3317 | ext4_mb_use_preallocated(struct ext4_allocation_context *ac) |
3316 | { | 3318 | { |
3317 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 3319 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
3318 | int order, i; | 3320 | int order, i; |
3319 | struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); | 3321 | struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); |
3320 | struct ext4_locality_group *lg; | 3322 | struct ext4_locality_group *lg; |
3321 | struct ext4_prealloc_space *pa, *cpa = NULL; | 3323 | struct ext4_prealloc_space *pa, *cpa = NULL; |
3322 | ext4_fsblk_t goal_block; | 3324 | ext4_fsblk_t goal_block; |
3323 | 3325 | ||
3324 | /* only data can be preallocated */ | 3326 | /* only data can be preallocated */ |
3325 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) | 3327 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) |
3326 | return 0; | 3328 | return 0; |
3327 | 3329 | ||
3328 | /* first, try per-file preallocation */ | 3330 | /* first, try per-file preallocation */ |
3329 | rcu_read_lock(); | 3331 | rcu_read_lock(); |
3330 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { | 3332 | list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { |
3331 | 3333 | ||
3332 | /* all fields in this condition don't change, | 3334 | /* all fields in this condition don't change, |
3333 | * so we can skip locking for them */ | 3335 | * so we can skip locking for them */ |
3334 | if (ac->ac_o_ex.fe_logical < pa->pa_lstart || | 3336 | if (ac->ac_o_ex.fe_logical < pa->pa_lstart || |
3335 | ac->ac_o_ex.fe_logical >= (pa->pa_lstart + | 3337 | ac->ac_o_ex.fe_logical >= (pa->pa_lstart + |
3336 | EXT4_C2B(sbi, pa->pa_len))) | 3338 | EXT4_C2B(sbi, pa->pa_len))) |
3337 | continue; | 3339 | continue; |
3338 | 3340 | ||
3339 | /* non-extent files can't have physical blocks past 2^32 */ | 3341 | /* non-extent files can't have physical blocks past 2^32 */ |
3340 | if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) && | 3342 | if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) && |
3341 | (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) > | 3343 | (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) > |
3342 | EXT4_MAX_BLOCK_FILE_PHYS)) | 3344 | EXT4_MAX_BLOCK_FILE_PHYS)) |
3343 | continue; | 3345 | continue; |
3344 | 3346 | ||
3345 | /* found preallocated blocks, use them */ | 3347 | /* found preallocated blocks, use them */ |
3346 | spin_lock(&pa->pa_lock); | 3348 | spin_lock(&pa->pa_lock); |
3347 | if (pa->pa_deleted == 0 && pa->pa_free) { | 3349 | if (pa->pa_deleted == 0 && pa->pa_free) { |
3348 | atomic_inc(&pa->pa_count); | 3350 | atomic_inc(&pa->pa_count); |
3349 | ext4_mb_use_inode_pa(ac, pa); | 3351 | ext4_mb_use_inode_pa(ac, pa); |
3350 | spin_unlock(&pa->pa_lock); | 3352 | spin_unlock(&pa->pa_lock); |
3351 | ac->ac_criteria = 10; | 3353 | ac->ac_criteria = 10; |
3352 | rcu_read_unlock(); | 3354 | rcu_read_unlock(); |
3353 | return 1; | 3355 | return 1; |
3354 | } | 3356 | } |
3355 | spin_unlock(&pa->pa_lock); | 3357 | spin_unlock(&pa->pa_lock); |
3356 | } | 3358 | } |
3357 | rcu_read_unlock(); | 3359 | rcu_read_unlock(); |
3358 | 3360 | ||
3359 | /* can we use group allocation? */ | 3361 | /* can we use group allocation? */ |
3360 | if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)) | 3362 | if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)) |
3361 | return 0; | 3363 | return 0; |
3362 | 3364 | ||
3363 | /* inode may have no locality group for some reason */ | 3365 | /* inode may have no locality group for some reason */ |
3364 | lg = ac->ac_lg; | 3366 | lg = ac->ac_lg; |
3365 | if (lg == NULL) | 3367 | if (lg == NULL) |
3366 | return 0; | 3368 | return 0; |
3367 | order = fls(ac->ac_o_ex.fe_len) - 1; | 3369 | order = fls(ac->ac_o_ex.fe_len) - 1; |
3368 | if (order > PREALLOC_TB_SIZE - 1) | 3370 | if (order > PREALLOC_TB_SIZE - 1) |
3369 | /* The max size of hash table is PREALLOC_TB_SIZE */ | 3371 | /* The max size of hash table is PREALLOC_TB_SIZE */ |
3370 | order = PREALLOC_TB_SIZE - 1; | 3372 | order = PREALLOC_TB_SIZE - 1; |
3371 | 3373 | ||
3372 | goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex); | 3374 | goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex); |
3373 | /* | 3375 | /* |
3374 | * search for the prealloc space that is having | 3376 | * search for the prealloc space that is having |
3375 | * minimal distance from the goal block. | 3377 | * minimal distance from the goal block. |
3376 | */ | 3378 | */ |
3377 | for (i = order; i < PREALLOC_TB_SIZE; i++) { | 3379 | for (i = order; i < PREALLOC_TB_SIZE; i++) { |
3378 | rcu_read_lock(); | 3380 | rcu_read_lock(); |
3379 | list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i], | 3381 | list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i], |
3380 | pa_inode_list) { | 3382 | pa_inode_list) { |
3381 | spin_lock(&pa->pa_lock); | 3383 | spin_lock(&pa->pa_lock); |
3382 | if (pa->pa_deleted == 0 && | 3384 | if (pa->pa_deleted == 0 && |
3383 | pa->pa_free >= ac->ac_o_ex.fe_len) { | 3385 | pa->pa_free >= ac->ac_o_ex.fe_len) { |
3384 | 3386 | ||
3385 | cpa = ext4_mb_check_group_pa(goal_block, | 3387 | cpa = ext4_mb_check_group_pa(goal_block, |
3386 | pa, cpa); | 3388 | pa, cpa); |
3387 | } | 3389 | } |
3388 | spin_unlock(&pa->pa_lock); | 3390 | spin_unlock(&pa->pa_lock); |
3389 | } | 3391 | } |
3390 | rcu_read_unlock(); | 3392 | rcu_read_unlock(); |
3391 | } | 3393 | } |
3392 | if (cpa) { | 3394 | if (cpa) { |
3393 | ext4_mb_use_group_pa(ac, cpa); | 3395 | ext4_mb_use_group_pa(ac, cpa); |
3394 | ac->ac_criteria = 20; | 3396 | ac->ac_criteria = 20; |
3395 | return 1; | 3397 | return 1; |
3396 | } | 3398 | } |
3397 | return 0; | 3399 | return 0; |
3398 | } | 3400 | } |
3399 | 3401 | ||
3400 | /* | 3402 | /* |
3401 | * the function goes through all block freed in the group | 3403 | * the function goes through all block freed in the group |
3402 | * but not yet committed and marks them used in in-core bitmap. | 3404 | * but not yet committed and marks them used in in-core bitmap. |
3403 | * buddy must be generated from this bitmap | 3405 | * buddy must be generated from this bitmap |
3404 | * Need to be called with the ext4 group lock held | 3406 | * Need to be called with the ext4 group lock held |
3405 | */ | 3407 | */ |
3406 | static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, | 3408 | static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, |
3407 | ext4_group_t group) | 3409 | ext4_group_t group) |
3408 | { | 3410 | { |
3409 | struct rb_node *n; | 3411 | struct rb_node *n; |
3410 | struct ext4_group_info *grp; | 3412 | struct ext4_group_info *grp; |
3411 | struct ext4_free_data *entry; | 3413 | struct ext4_free_data *entry; |
3412 | 3414 | ||
3413 | grp = ext4_get_group_info(sb, group); | 3415 | grp = ext4_get_group_info(sb, group); |
3414 | n = rb_first(&(grp->bb_free_root)); | 3416 | n = rb_first(&(grp->bb_free_root)); |
3415 | 3417 | ||
3416 | while (n) { | 3418 | while (n) { |
3417 | entry = rb_entry(n, struct ext4_free_data, efd_node); | 3419 | entry = rb_entry(n, struct ext4_free_data, efd_node); |
3418 | ext4_set_bits(bitmap, entry->efd_start_cluster, entry->efd_count); | 3420 | ext4_set_bits(bitmap, entry->efd_start_cluster, entry->efd_count); |
3419 | n = rb_next(n); | 3421 | n = rb_next(n); |
3420 | } | 3422 | } |
3421 | return; | 3423 | return; |
3422 | } | 3424 | } |
3423 | 3425 | ||
3424 | /* | 3426 | /* |
3425 | * the function goes through all preallocation in this group and marks them | 3427 | * the function goes through all preallocation in this group and marks them |
3426 | * used in in-core bitmap. buddy must be generated from this bitmap | 3428 | * used in in-core bitmap. buddy must be generated from this bitmap |
3427 | * Need to be called with ext4 group lock held | 3429 | * Need to be called with ext4 group lock held |
3428 | */ | 3430 | */ |
3429 | static noinline_for_stack | 3431 | static noinline_for_stack |
3430 | void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, | 3432 | void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, |
3431 | ext4_group_t group) | 3433 | ext4_group_t group) |
3432 | { | 3434 | { |
3433 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); | 3435 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); |
3434 | struct ext4_prealloc_space *pa; | 3436 | struct ext4_prealloc_space *pa; |
3435 | struct list_head *cur; | 3437 | struct list_head *cur; |
3436 | ext4_group_t groupnr; | 3438 | ext4_group_t groupnr; |
3437 | ext4_grpblk_t start; | 3439 | ext4_grpblk_t start; |
3438 | int preallocated = 0; | 3440 | int preallocated = 0; |
3439 | int len; | 3441 | int len; |
3440 | 3442 | ||
3441 | /* all form of preallocation discards first load group, | 3443 | /* all form of preallocation discards first load group, |
3442 | * so the only competing code is preallocation use. | 3444 | * so the only competing code is preallocation use. |
3443 | * we don't need any locking here | 3445 | * we don't need any locking here |
3444 | * notice we do NOT ignore preallocations with pa_deleted | 3446 | * notice we do NOT ignore preallocations with pa_deleted |
3445 | * otherwise we could leave used blocks available for | 3447 | * otherwise we could leave used blocks available for |
3446 | * allocation in buddy when concurrent ext4_mb_put_pa() | 3448 | * allocation in buddy when concurrent ext4_mb_put_pa() |
3447 | * is dropping preallocation | 3449 | * is dropping preallocation |
3448 | */ | 3450 | */ |
3449 | list_for_each(cur, &grp->bb_prealloc_list) { | 3451 | list_for_each(cur, &grp->bb_prealloc_list) { |
3450 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); | 3452 | pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list); |
3451 | spin_lock(&pa->pa_lock); | 3453 | spin_lock(&pa->pa_lock); |
3452 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, | 3454 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, |
3453 | &groupnr, &start); | 3455 | &groupnr, &start); |
3454 | len = pa->pa_len; | 3456 | len = pa->pa_len; |
3455 | spin_unlock(&pa->pa_lock); | 3457 | spin_unlock(&pa->pa_lock); |
3456 | if (unlikely(len == 0)) | 3458 | if (unlikely(len == 0)) |
3457 | continue; | 3459 | continue; |
3458 | BUG_ON(groupnr != group); | 3460 | BUG_ON(groupnr != group); |
3459 | ext4_set_bits(bitmap, start, len); | 3461 | ext4_set_bits(bitmap, start, len); |
3460 | preallocated += len; | 3462 | preallocated += len; |
3461 | } | 3463 | } |
3462 | mb_debug(1, "prellocated %u for group %u\n", preallocated, group); | 3464 | mb_debug(1, "prellocated %u for group %u\n", preallocated, group); |
3463 | } | 3465 | } |
3464 | 3466 | ||
3465 | static void ext4_mb_pa_callback(struct rcu_head *head) | 3467 | static void ext4_mb_pa_callback(struct rcu_head *head) |
3466 | { | 3468 | { |
3467 | struct ext4_prealloc_space *pa; | 3469 | struct ext4_prealloc_space *pa; |
3468 | pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu); | 3470 | pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu); |
3469 | 3471 | ||
3470 | BUG_ON(atomic_read(&pa->pa_count)); | 3472 | BUG_ON(atomic_read(&pa->pa_count)); |
3471 | BUG_ON(pa->pa_deleted == 0); | 3473 | BUG_ON(pa->pa_deleted == 0); |
3472 | kmem_cache_free(ext4_pspace_cachep, pa); | 3474 | kmem_cache_free(ext4_pspace_cachep, pa); |
3473 | } | 3475 | } |
3474 | 3476 | ||
3475 | /* | 3477 | /* |
3476 | * drops a reference to preallocated space descriptor | 3478 | * drops a reference to preallocated space descriptor |
3477 | * if this was the last reference and the space is consumed | 3479 | * if this was the last reference and the space is consumed |
3478 | */ | 3480 | */ |
3479 | static void ext4_mb_put_pa(struct ext4_allocation_context *ac, | 3481 | static void ext4_mb_put_pa(struct ext4_allocation_context *ac, |
3480 | struct super_block *sb, struct ext4_prealloc_space *pa) | 3482 | struct super_block *sb, struct ext4_prealloc_space *pa) |
3481 | { | 3483 | { |
3482 | ext4_group_t grp; | 3484 | ext4_group_t grp; |
3483 | ext4_fsblk_t grp_blk; | 3485 | ext4_fsblk_t grp_blk; |
3484 | 3486 | ||
3485 | /* in this short window concurrent discard can set pa_deleted */ | 3487 | /* in this short window concurrent discard can set pa_deleted */ |
3486 | spin_lock(&pa->pa_lock); | 3488 | spin_lock(&pa->pa_lock); |
3487 | if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0) { | 3489 | if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0) { |
3488 | spin_unlock(&pa->pa_lock); | 3490 | spin_unlock(&pa->pa_lock); |
3489 | return; | 3491 | return; |
3490 | } | 3492 | } |
3491 | 3493 | ||
3492 | if (pa->pa_deleted == 1) { | 3494 | if (pa->pa_deleted == 1) { |
3493 | spin_unlock(&pa->pa_lock); | 3495 | spin_unlock(&pa->pa_lock); |
3494 | return; | 3496 | return; |
3495 | } | 3497 | } |
3496 | 3498 | ||
3497 | pa->pa_deleted = 1; | 3499 | pa->pa_deleted = 1; |
3498 | spin_unlock(&pa->pa_lock); | 3500 | spin_unlock(&pa->pa_lock); |
3499 | 3501 | ||
3500 | grp_blk = pa->pa_pstart; | 3502 | grp_blk = pa->pa_pstart; |
3501 | /* | 3503 | /* |
3502 | * If doing group-based preallocation, pa_pstart may be in the | 3504 | * If doing group-based preallocation, pa_pstart may be in the |
3503 | * next group when pa is used up | 3505 | * next group when pa is used up |
3504 | */ | 3506 | */ |
3505 | if (pa->pa_type == MB_GROUP_PA) | 3507 | if (pa->pa_type == MB_GROUP_PA) |
3506 | grp_blk--; | 3508 | grp_blk--; |
3507 | 3509 | ||
3508 | grp = ext4_get_group_number(sb, grp_blk); | 3510 | grp = ext4_get_group_number(sb, grp_blk); |
3509 | 3511 | ||
3510 | /* | 3512 | /* |
3511 | * possible race: | 3513 | * possible race: |
3512 | * | 3514 | * |
3513 | * P1 (buddy init) P2 (regular allocation) | 3515 | * P1 (buddy init) P2 (regular allocation) |
3514 | * find block B in PA | 3516 | * find block B in PA |
3515 | * copy on-disk bitmap to buddy | 3517 | * copy on-disk bitmap to buddy |
3516 | * mark B in on-disk bitmap | 3518 | * mark B in on-disk bitmap |
3517 | * drop PA from group | 3519 | * drop PA from group |
3518 | * mark all PAs in buddy | 3520 | * mark all PAs in buddy |
3519 | * | 3521 | * |
3520 | * thus, P1 initializes buddy with B available. to prevent this | 3522 | * thus, P1 initializes buddy with B available. to prevent this |
3521 | * we make "copy" and "mark all PAs" atomic and serialize "drop PA" | 3523 | * we make "copy" and "mark all PAs" atomic and serialize "drop PA" |
3522 | * against that pair | 3524 | * against that pair |
3523 | */ | 3525 | */ |
3524 | ext4_lock_group(sb, grp); | 3526 | ext4_lock_group(sb, grp); |
3525 | list_del(&pa->pa_group_list); | 3527 | list_del(&pa->pa_group_list); |
3526 | ext4_unlock_group(sb, grp); | 3528 | ext4_unlock_group(sb, grp); |
3527 | 3529 | ||
3528 | spin_lock(pa->pa_obj_lock); | 3530 | spin_lock(pa->pa_obj_lock); |
3529 | list_del_rcu(&pa->pa_inode_list); | 3531 | list_del_rcu(&pa->pa_inode_list); |
3530 | spin_unlock(pa->pa_obj_lock); | 3532 | spin_unlock(pa->pa_obj_lock); |
3531 | 3533 | ||
3532 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); | 3534 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); |
3533 | } | 3535 | } |
3534 | 3536 | ||
3535 | /* | 3537 | /* |
3536 | * creates new preallocated space for given inode | 3538 | * creates new preallocated space for given inode |
3537 | */ | 3539 | */ |
3538 | static noinline_for_stack int | 3540 | static noinline_for_stack int |
3539 | ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) | 3541 | ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) |
3540 | { | 3542 | { |
3541 | struct super_block *sb = ac->ac_sb; | 3543 | struct super_block *sb = ac->ac_sb; |
3542 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 3544 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
3543 | struct ext4_prealloc_space *pa; | 3545 | struct ext4_prealloc_space *pa; |
3544 | struct ext4_group_info *grp; | 3546 | struct ext4_group_info *grp; |
3545 | struct ext4_inode_info *ei; | 3547 | struct ext4_inode_info *ei; |
3546 | 3548 | ||
3547 | /* preallocate only when found space is larger then requested */ | 3549 | /* preallocate only when found space is larger then requested */ |
3548 | BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); | 3550 | BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); |
3549 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); | 3551 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); |
3550 | BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); | 3552 | BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); |
3551 | 3553 | ||
3552 | pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); | 3554 | pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); |
3553 | if (pa == NULL) | 3555 | if (pa == NULL) |
3554 | return -ENOMEM; | 3556 | return -ENOMEM; |
3555 | 3557 | ||
3556 | if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { | 3558 | if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { |
3557 | int winl; | 3559 | int winl; |
3558 | int wins; | 3560 | int wins; |
3559 | int win; | 3561 | int win; |
3560 | int offs; | 3562 | int offs; |
3561 | 3563 | ||
3562 | /* we can't allocate as much as normalizer wants. | 3564 | /* we can't allocate as much as normalizer wants. |
3563 | * so, found space must get proper lstart | 3565 | * so, found space must get proper lstart |
3564 | * to cover original request */ | 3566 | * to cover original request */ |
3565 | BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical); | 3567 | BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical); |
3566 | BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len); | 3568 | BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len); |
3567 | 3569 | ||
3568 | /* we're limited by original request in that | 3570 | /* we're limited by original request in that |
3569 | * logical block must be covered any way | 3571 | * logical block must be covered any way |
3570 | * winl is window we can move our chunk within */ | 3572 | * winl is window we can move our chunk within */ |
3571 | winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical; | 3573 | winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical; |
3572 | 3574 | ||
3573 | /* also, we should cover whole original request */ | 3575 | /* also, we should cover whole original request */ |
3574 | wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len); | 3576 | wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len); |
3575 | 3577 | ||
3576 | /* the smallest one defines real window */ | 3578 | /* the smallest one defines real window */ |
3577 | win = min(winl, wins); | 3579 | win = min(winl, wins); |
3578 | 3580 | ||
3579 | offs = ac->ac_o_ex.fe_logical % | 3581 | offs = ac->ac_o_ex.fe_logical % |
3580 | EXT4_C2B(sbi, ac->ac_b_ex.fe_len); | 3582 | EXT4_C2B(sbi, ac->ac_b_ex.fe_len); |
3581 | if (offs && offs < win) | 3583 | if (offs && offs < win) |
3582 | win = offs; | 3584 | win = offs; |
3583 | 3585 | ||
3584 | ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - | 3586 | ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - |
3585 | EXT4_NUM_B2C(sbi, win); | 3587 | EXT4_NUM_B2C(sbi, win); |
3586 | BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); | 3588 | BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); |
3587 | BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); | 3589 | BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); |
3588 | } | 3590 | } |
3589 | 3591 | ||
3590 | /* preallocation can change ac_b_ex, thus we store actually | 3592 | /* preallocation can change ac_b_ex, thus we store actually |
3591 | * allocated blocks for history */ | 3593 | * allocated blocks for history */ |
3592 | ac->ac_f_ex = ac->ac_b_ex; | 3594 | ac->ac_f_ex = ac->ac_b_ex; |
3593 | 3595 | ||
3594 | pa->pa_lstart = ac->ac_b_ex.fe_logical; | 3596 | pa->pa_lstart = ac->ac_b_ex.fe_logical; |
3595 | pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); | 3597 | pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); |
3596 | pa->pa_len = ac->ac_b_ex.fe_len; | 3598 | pa->pa_len = ac->ac_b_ex.fe_len; |
3597 | pa->pa_free = pa->pa_len; | 3599 | pa->pa_free = pa->pa_len; |
3598 | atomic_set(&pa->pa_count, 1); | 3600 | atomic_set(&pa->pa_count, 1); |
3599 | spin_lock_init(&pa->pa_lock); | 3601 | spin_lock_init(&pa->pa_lock); |
3600 | INIT_LIST_HEAD(&pa->pa_inode_list); | 3602 | INIT_LIST_HEAD(&pa->pa_inode_list); |
3601 | INIT_LIST_HEAD(&pa->pa_group_list); | 3603 | INIT_LIST_HEAD(&pa->pa_group_list); |
3602 | pa->pa_deleted = 0; | 3604 | pa->pa_deleted = 0; |
3603 | pa->pa_type = MB_INODE_PA; | 3605 | pa->pa_type = MB_INODE_PA; |
3604 | 3606 | ||
3605 | mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa, | 3607 | mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa, |
3606 | pa->pa_pstart, pa->pa_len, pa->pa_lstart); | 3608 | pa->pa_pstart, pa->pa_len, pa->pa_lstart); |
3607 | trace_ext4_mb_new_inode_pa(ac, pa); | 3609 | trace_ext4_mb_new_inode_pa(ac, pa); |
3608 | 3610 | ||
3609 | ext4_mb_use_inode_pa(ac, pa); | 3611 | ext4_mb_use_inode_pa(ac, pa); |
3610 | atomic_add(pa->pa_free, &sbi->s_mb_preallocated); | 3612 | atomic_add(pa->pa_free, &sbi->s_mb_preallocated); |
3611 | 3613 | ||
3612 | ei = EXT4_I(ac->ac_inode); | 3614 | ei = EXT4_I(ac->ac_inode); |
3613 | grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); | 3615 | grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); |
3614 | 3616 | ||
3615 | pa->pa_obj_lock = &ei->i_prealloc_lock; | 3617 | pa->pa_obj_lock = &ei->i_prealloc_lock; |
3616 | pa->pa_inode = ac->ac_inode; | 3618 | pa->pa_inode = ac->ac_inode; |
3617 | 3619 | ||
3618 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); | 3620 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); |
3619 | list_add(&pa->pa_group_list, &grp->bb_prealloc_list); | 3621 | list_add(&pa->pa_group_list, &grp->bb_prealloc_list); |
3620 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); | 3622 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); |
3621 | 3623 | ||
3622 | spin_lock(pa->pa_obj_lock); | 3624 | spin_lock(pa->pa_obj_lock); |
3623 | list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list); | 3625 | list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list); |
3624 | spin_unlock(pa->pa_obj_lock); | 3626 | spin_unlock(pa->pa_obj_lock); |
3625 | 3627 | ||
3626 | return 0; | 3628 | return 0; |
3627 | } | 3629 | } |
3628 | 3630 | ||
3629 | /* | 3631 | /* |
3630 | * creates new preallocated space for locality group inodes belongs to | 3632 | * creates new preallocated space for locality group inodes belongs to |
3631 | */ | 3633 | */ |
3632 | static noinline_for_stack int | 3634 | static noinline_for_stack int |
3633 | ext4_mb_new_group_pa(struct ext4_allocation_context *ac) | 3635 | ext4_mb_new_group_pa(struct ext4_allocation_context *ac) |
3634 | { | 3636 | { |
3635 | struct super_block *sb = ac->ac_sb; | 3637 | struct super_block *sb = ac->ac_sb; |
3636 | struct ext4_locality_group *lg; | 3638 | struct ext4_locality_group *lg; |
3637 | struct ext4_prealloc_space *pa; | 3639 | struct ext4_prealloc_space *pa; |
3638 | struct ext4_group_info *grp; | 3640 | struct ext4_group_info *grp; |
3639 | 3641 | ||
3640 | /* preallocate only when found space is larger then requested */ | 3642 | /* preallocate only when found space is larger then requested */ |
3641 | BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); | 3643 | BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); |
3642 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); | 3644 | BUG_ON(ac->ac_status != AC_STATUS_FOUND); |
3643 | BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); | 3645 | BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); |
3644 | 3646 | ||
3645 | BUG_ON(ext4_pspace_cachep == NULL); | 3647 | BUG_ON(ext4_pspace_cachep == NULL); |
3646 | pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); | 3648 | pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); |
3647 | if (pa == NULL) | 3649 | if (pa == NULL) |
3648 | return -ENOMEM; | 3650 | return -ENOMEM; |
3649 | 3651 | ||
3650 | /* preallocation can change ac_b_ex, thus we store actually | 3652 | /* preallocation can change ac_b_ex, thus we store actually |
3651 | * allocated blocks for history */ | 3653 | * allocated blocks for history */ |
3652 | ac->ac_f_ex = ac->ac_b_ex; | 3654 | ac->ac_f_ex = ac->ac_b_ex; |
3653 | 3655 | ||
3654 | pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); | 3656 | pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); |
3655 | pa->pa_lstart = pa->pa_pstart; | 3657 | pa->pa_lstart = pa->pa_pstart; |
3656 | pa->pa_len = ac->ac_b_ex.fe_len; | 3658 | pa->pa_len = ac->ac_b_ex.fe_len; |
3657 | pa->pa_free = pa->pa_len; | 3659 | pa->pa_free = pa->pa_len; |
3658 | atomic_set(&pa->pa_count, 1); | 3660 | atomic_set(&pa->pa_count, 1); |
3659 | spin_lock_init(&pa->pa_lock); | 3661 | spin_lock_init(&pa->pa_lock); |
3660 | INIT_LIST_HEAD(&pa->pa_inode_list); | 3662 | INIT_LIST_HEAD(&pa->pa_inode_list); |
3661 | INIT_LIST_HEAD(&pa->pa_group_list); | 3663 | INIT_LIST_HEAD(&pa->pa_group_list); |
3662 | pa->pa_deleted = 0; | 3664 | pa->pa_deleted = 0; |
3663 | pa->pa_type = MB_GROUP_PA; | 3665 | pa->pa_type = MB_GROUP_PA; |
3664 | 3666 | ||
3665 | mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa, | 3667 | mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa, |
3666 | pa->pa_pstart, pa->pa_len, pa->pa_lstart); | 3668 | pa->pa_pstart, pa->pa_len, pa->pa_lstart); |
3667 | trace_ext4_mb_new_group_pa(ac, pa); | 3669 | trace_ext4_mb_new_group_pa(ac, pa); |
3668 | 3670 | ||
3669 | ext4_mb_use_group_pa(ac, pa); | 3671 | ext4_mb_use_group_pa(ac, pa); |
3670 | atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated); | 3672 | atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated); |
3671 | 3673 | ||
3672 | grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); | 3674 | grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group); |
3673 | lg = ac->ac_lg; | 3675 | lg = ac->ac_lg; |
3674 | BUG_ON(lg == NULL); | 3676 | BUG_ON(lg == NULL); |
3675 | 3677 | ||
3676 | pa->pa_obj_lock = &lg->lg_prealloc_lock; | 3678 | pa->pa_obj_lock = &lg->lg_prealloc_lock; |
3677 | pa->pa_inode = NULL; | 3679 | pa->pa_inode = NULL; |
3678 | 3680 | ||
3679 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); | 3681 | ext4_lock_group(sb, ac->ac_b_ex.fe_group); |
3680 | list_add(&pa->pa_group_list, &grp->bb_prealloc_list); | 3682 | list_add(&pa->pa_group_list, &grp->bb_prealloc_list); |
3681 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); | 3683 | ext4_unlock_group(sb, ac->ac_b_ex.fe_group); |
3682 | 3684 | ||
3683 | /* | 3685 | /* |
3684 | * We will later add the new pa to the right bucket | 3686 | * We will later add the new pa to the right bucket |
3685 | * after updating the pa_free in ext4_mb_release_context | 3687 | * after updating the pa_free in ext4_mb_release_context |
3686 | */ | 3688 | */ |
3687 | return 0; | 3689 | return 0; |
3688 | } | 3690 | } |
3689 | 3691 | ||
3690 | static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac) | 3692 | static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac) |
3691 | { | 3693 | { |
3692 | int err; | 3694 | int err; |
3693 | 3695 | ||
3694 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) | 3696 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) |
3695 | err = ext4_mb_new_group_pa(ac); | 3697 | err = ext4_mb_new_group_pa(ac); |
3696 | else | 3698 | else |
3697 | err = ext4_mb_new_inode_pa(ac); | 3699 | err = ext4_mb_new_inode_pa(ac); |
3698 | return err; | 3700 | return err; |
3699 | } | 3701 | } |
3700 | 3702 | ||
3701 | /* | 3703 | /* |
3702 | * finds all unused blocks in on-disk bitmap, frees them in | 3704 | * finds all unused blocks in on-disk bitmap, frees them in |
3703 | * in-core bitmap and buddy. | 3705 | * in-core bitmap and buddy. |
3704 | * @pa must be unlinked from inode and group lists, so that | 3706 | * @pa must be unlinked from inode and group lists, so that |
3705 | * nobody else can find/use it. | 3707 | * nobody else can find/use it. |
3706 | * the caller MUST hold group/inode locks. | 3708 | * the caller MUST hold group/inode locks. |
3707 | * TODO: optimize the case when there are no in-core structures yet | 3709 | * TODO: optimize the case when there are no in-core structures yet |
3708 | */ | 3710 | */ |
3709 | static noinline_for_stack int | 3711 | static noinline_for_stack int |
3710 | ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh, | 3712 | ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh, |
3711 | struct ext4_prealloc_space *pa) | 3713 | struct ext4_prealloc_space *pa) |
3712 | { | 3714 | { |
3713 | struct super_block *sb = e4b->bd_sb; | 3715 | struct super_block *sb = e4b->bd_sb; |
3714 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 3716 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
3715 | unsigned int end; | 3717 | unsigned int end; |
3716 | unsigned int next; | 3718 | unsigned int next; |
3717 | ext4_group_t group; | 3719 | ext4_group_t group; |
3718 | ext4_grpblk_t bit; | 3720 | ext4_grpblk_t bit; |
3719 | unsigned long long grp_blk_start; | 3721 | unsigned long long grp_blk_start; |
3720 | int err = 0; | 3722 | int err = 0; |
3721 | int free = 0; | 3723 | int free = 0; |
3722 | 3724 | ||
3723 | BUG_ON(pa->pa_deleted == 0); | 3725 | BUG_ON(pa->pa_deleted == 0); |
3724 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); | 3726 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); |
3725 | grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit); | 3727 | grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit); |
3726 | BUG_ON(group != e4b->bd_group && pa->pa_len != 0); | 3728 | BUG_ON(group != e4b->bd_group && pa->pa_len != 0); |
3727 | end = bit + pa->pa_len; | 3729 | end = bit + pa->pa_len; |
3728 | 3730 | ||
3729 | while (bit < end) { | 3731 | while (bit < end) { |
3730 | bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit); | 3732 | bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit); |
3731 | if (bit >= end) | 3733 | if (bit >= end) |
3732 | break; | 3734 | break; |
3733 | next = mb_find_next_bit(bitmap_bh->b_data, end, bit); | 3735 | next = mb_find_next_bit(bitmap_bh->b_data, end, bit); |
3734 | mb_debug(1, " free preallocated %u/%u in group %u\n", | 3736 | mb_debug(1, " free preallocated %u/%u in group %u\n", |
3735 | (unsigned) ext4_group_first_block_no(sb, group) + bit, | 3737 | (unsigned) ext4_group_first_block_no(sb, group) + bit, |
3736 | (unsigned) next - bit, (unsigned) group); | 3738 | (unsigned) next - bit, (unsigned) group); |
3737 | free += next - bit; | 3739 | free += next - bit; |
3738 | 3740 | ||
3739 | trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit); | 3741 | trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit); |
3740 | trace_ext4_mb_release_inode_pa(pa, (grp_blk_start + | 3742 | trace_ext4_mb_release_inode_pa(pa, (grp_blk_start + |
3741 | EXT4_C2B(sbi, bit)), | 3743 | EXT4_C2B(sbi, bit)), |
3742 | next - bit); | 3744 | next - bit); |
3743 | mb_free_blocks(pa->pa_inode, e4b, bit, next - bit); | 3745 | mb_free_blocks(pa->pa_inode, e4b, bit, next - bit); |
3744 | bit = next + 1; | 3746 | bit = next + 1; |
3745 | } | 3747 | } |
3746 | if (free != pa->pa_free) { | 3748 | if (free != pa->pa_free) { |
3747 | ext4_msg(e4b->bd_sb, KERN_CRIT, | 3749 | ext4_msg(e4b->bd_sb, KERN_CRIT, |
3748 | "pa %p: logic %lu, phys. %lu, len %lu", | 3750 | "pa %p: logic %lu, phys. %lu, len %lu", |
3749 | pa, (unsigned long) pa->pa_lstart, | 3751 | pa, (unsigned long) pa->pa_lstart, |
3750 | (unsigned long) pa->pa_pstart, | 3752 | (unsigned long) pa->pa_pstart, |
3751 | (unsigned long) pa->pa_len); | 3753 | (unsigned long) pa->pa_len); |
3752 | ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u", | 3754 | ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u", |
3753 | free, pa->pa_free); | 3755 | free, pa->pa_free); |
3754 | /* | 3756 | /* |
3755 | * pa is already deleted so we use the value obtained | 3757 | * pa is already deleted so we use the value obtained |
3756 | * from the bitmap and continue. | 3758 | * from the bitmap and continue. |
3757 | */ | 3759 | */ |
3758 | } | 3760 | } |
3759 | atomic_add(free, &sbi->s_mb_discarded); | 3761 | atomic_add(free, &sbi->s_mb_discarded); |
3760 | 3762 | ||
3761 | return err; | 3763 | return err; |
3762 | } | 3764 | } |
3763 | 3765 | ||
3764 | static noinline_for_stack int | 3766 | static noinline_for_stack int |
3765 | ext4_mb_release_group_pa(struct ext4_buddy *e4b, | 3767 | ext4_mb_release_group_pa(struct ext4_buddy *e4b, |
3766 | struct ext4_prealloc_space *pa) | 3768 | struct ext4_prealloc_space *pa) |
3767 | { | 3769 | { |
3768 | struct super_block *sb = e4b->bd_sb; | 3770 | struct super_block *sb = e4b->bd_sb; |
3769 | ext4_group_t group; | 3771 | ext4_group_t group; |
3770 | ext4_grpblk_t bit; | 3772 | ext4_grpblk_t bit; |
3771 | 3773 | ||
3772 | trace_ext4_mb_release_group_pa(sb, pa); | 3774 | trace_ext4_mb_release_group_pa(sb, pa); |
3773 | BUG_ON(pa->pa_deleted == 0); | 3775 | BUG_ON(pa->pa_deleted == 0); |
3774 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); | 3776 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit); |
3775 | BUG_ON(group != e4b->bd_group && pa->pa_len != 0); | 3777 | BUG_ON(group != e4b->bd_group && pa->pa_len != 0); |
3776 | mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len); | 3778 | mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len); |
3777 | atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded); | 3779 | atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded); |
3778 | trace_ext4_mballoc_discard(sb, NULL, group, bit, pa->pa_len); | 3780 | trace_ext4_mballoc_discard(sb, NULL, group, bit, pa->pa_len); |
3779 | 3781 | ||
3780 | return 0; | 3782 | return 0; |
3781 | } | 3783 | } |
3782 | 3784 | ||
3783 | /* | 3785 | /* |
3784 | * releases all preallocations in given group | 3786 | * releases all preallocations in given group |
3785 | * | 3787 | * |
3786 | * first, we need to decide discard policy: | 3788 | * first, we need to decide discard policy: |
3787 | * - when do we discard | 3789 | * - when do we discard |
3788 | * 1) ENOSPC | 3790 | * 1) ENOSPC |
3789 | * - how many do we discard | 3791 | * - how many do we discard |
3790 | * 1) how many requested | 3792 | * 1) how many requested |
3791 | */ | 3793 | */ |
3792 | static noinline_for_stack int | 3794 | static noinline_for_stack int |
3793 | ext4_mb_discard_group_preallocations(struct super_block *sb, | 3795 | ext4_mb_discard_group_preallocations(struct super_block *sb, |
3794 | ext4_group_t group, int needed) | 3796 | ext4_group_t group, int needed) |
3795 | { | 3797 | { |
3796 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); | 3798 | struct ext4_group_info *grp = ext4_get_group_info(sb, group); |
3797 | struct buffer_head *bitmap_bh = NULL; | 3799 | struct buffer_head *bitmap_bh = NULL; |
3798 | struct ext4_prealloc_space *pa, *tmp; | 3800 | struct ext4_prealloc_space *pa, *tmp; |
3799 | struct list_head list; | 3801 | struct list_head list; |
3800 | struct ext4_buddy e4b; | 3802 | struct ext4_buddy e4b; |
3801 | int err; | 3803 | int err; |
3802 | int busy = 0; | 3804 | int busy = 0; |
3803 | int free = 0; | 3805 | int free = 0; |
3804 | 3806 | ||
3805 | mb_debug(1, "discard preallocation for group %u\n", group); | 3807 | mb_debug(1, "discard preallocation for group %u\n", group); |
3806 | 3808 | ||
3807 | if (list_empty(&grp->bb_prealloc_list)) | 3809 | if (list_empty(&grp->bb_prealloc_list)) |
3808 | return 0; | 3810 | return 0; |
3809 | 3811 | ||
3810 | bitmap_bh = ext4_read_block_bitmap(sb, group); | 3812 | bitmap_bh = ext4_read_block_bitmap(sb, group); |
3811 | if (bitmap_bh == NULL) { | 3813 | if (bitmap_bh == NULL) { |
3812 | ext4_error(sb, "Error reading block bitmap for %u", group); | 3814 | ext4_error(sb, "Error reading block bitmap for %u", group); |
3813 | return 0; | 3815 | return 0; |
3814 | } | 3816 | } |
3815 | 3817 | ||
3816 | err = ext4_mb_load_buddy(sb, group, &e4b); | 3818 | err = ext4_mb_load_buddy(sb, group, &e4b); |
3817 | if (err) { | 3819 | if (err) { |
3818 | ext4_error(sb, "Error loading buddy information for %u", group); | 3820 | ext4_error(sb, "Error loading buddy information for %u", group); |
3819 | put_bh(bitmap_bh); | 3821 | put_bh(bitmap_bh); |
3820 | return 0; | 3822 | return 0; |
3821 | } | 3823 | } |
3822 | 3824 | ||
3823 | if (needed == 0) | 3825 | if (needed == 0) |
3824 | needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1; | 3826 | needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1; |
3825 | 3827 | ||
3826 | INIT_LIST_HEAD(&list); | 3828 | INIT_LIST_HEAD(&list); |
3827 | repeat: | 3829 | repeat: |
3828 | ext4_lock_group(sb, group); | 3830 | ext4_lock_group(sb, group); |
3829 | list_for_each_entry_safe(pa, tmp, | 3831 | list_for_each_entry_safe(pa, tmp, |
3830 | &grp->bb_prealloc_list, pa_group_list) { | 3832 | &grp->bb_prealloc_list, pa_group_list) { |
3831 | spin_lock(&pa->pa_lock); | 3833 | spin_lock(&pa->pa_lock); |
3832 | if (atomic_read(&pa->pa_count)) { | 3834 | if (atomic_read(&pa->pa_count)) { |
3833 | spin_unlock(&pa->pa_lock); | 3835 | spin_unlock(&pa->pa_lock); |
3834 | busy = 1; | 3836 | busy = 1; |
3835 | continue; | 3837 | continue; |
3836 | } | 3838 | } |
3837 | if (pa->pa_deleted) { | 3839 | if (pa->pa_deleted) { |
3838 | spin_unlock(&pa->pa_lock); | 3840 | spin_unlock(&pa->pa_lock); |
3839 | continue; | 3841 | continue; |
3840 | } | 3842 | } |
3841 | 3843 | ||
3842 | /* seems this one can be freed ... */ | 3844 | /* seems this one can be freed ... */ |
3843 | pa->pa_deleted = 1; | 3845 | pa->pa_deleted = 1; |
3844 | 3846 | ||
3845 | /* we can trust pa_free ... */ | 3847 | /* we can trust pa_free ... */ |
3846 | free += pa->pa_free; | 3848 | free += pa->pa_free; |
3847 | 3849 | ||
3848 | spin_unlock(&pa->pa_lock); | 3850 | spin_unlock(&pa->pa_lock); |
3849 | 3851 | ||
3850 | list_del(&pa->pa_group_list); | 3852 | list_del(&pa->pa_group_list); |
3851 | list_add(&pa->u.pa_tmp_list, &list); | 3853 | list_add(&pa->u.pa_tmp_list, &list); |
3852 | } | 3854 | } |
3853 | 3855 | ||
3854 | /* if we still need more blocks and some PAs were used, try again */ | 3856 | /* if we still need more blocks and some PAs were used, try again */ |
3855 | if (free < needed && busy) { | 3857 | if (free < needed && busy) { |
3856 | busy = 0; | 3858 | busy = 0; |
3857 | ext4_unlock_group(sb, group); | 3859 | ext4_unlock_group(sb, group); |
3858 | cond_resched(); | 3860 | cond_resched(); |
3859 | goto repeat; | 3861 | goto repeat; |
3860 | } | 3862 | } |
3861 | 3863 | ||
3862 | /* found anything to free? */ | 3864 | /* found anything to free? */ |
3863 | if (list_empty(&list)) { | 3865 | if (list_empty(&list)) { |
3864 | BUG_ON(free != 0); | 3866 | BUG_ON(free != 0); |
3865 | goto out; | 3867 | goto out; |
3866 | } | 3868 | } |
3867 | 3869 | ||
3868 | /* now free all selected PAs */ | 3870 | /* now free all selected PAs */ |
3869 | list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { | 3871 | list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { |
3870 | 3872 | ||
3871 | /* remove from object (inode or locality group) */ | 3873 | /* remove from object (inode or locality group) */ |
3872 | spin_lock(pa->pa_obj_lock); | 3874 | spin_lock(pa->pa_obj_lock); |
3873 | list_del_rcu(&pa->pa_inode_list); | 3875 | list_del_rcu(&pa->pa_inode_list); |
3874 | spin_unlock(pa->pa_obj_lock); | 3876 | spin_unlock(pa->pa_obj_lock); |
3875 | 3877 | ||
3876 | if (pa->pa_type == MB_GROUP_PA) | 3878 | if (pa->pa_type == MB_GROUP_PA) |
3877 | ext4_mb_release_group_pa(&e4b, pa); | 3879 | ext4_mb_release_group_pa(&e4b, pa); |
3878 | else | 3880 | else |
3879 | ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); | 3881 | ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); |
3880 | 3882 | ||
3881 | list_del(&pa->u.pa_tmp_list); | 3883 | list_del(&pa->u.pa_tmp_list); |
3882 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); | 3884 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); |
3883 | } | 3885 | } |
3884 | 3886 | ||
3885 | out: | 3887 | out: |
3886 | ext4_unlock_group(sb, group); | 3888 | ext4_unlock_group(sb, group); |
3887 | ext4_mb_unload_buddy(&e4b); | 3889 | ext4_mb_unload_buddy(&e4b); |
3888 | put_bh(bitmap_bh); | 3890 | put_bh(bitmap_bh); |
3889 | return free; | 3891 | return free; |
3890 | } | 3892 | } |
3891 | 3893 | ||
3892 | /* | 3894 | /* |
3893 | * releases all non-used preallocated blocks for given inode | 3895 | * releases all non-used preallocated blocks for given inode |
3894 | * | 3896 | * |
3895 | * It's important to discard preallocations under i_data_sem | 3897 | * It's important to discard preallocations under i_data_sem |
3896 | * We don't want another block to be served from the prealloc | 3898 | * We don't want another block to be served from the prealloc |
3897 | * space when we are discarding the inode prealloc space. | 3899 | * space when we are discarding the inode prealloc space. |
3898 | * | 3900 | * |
3899 | * FIXME!! Make sure it is valid at all the call sites | 3901 | * FIXME!! Make sure it is valid at all the call sites |
3900 | */ | 3902 | */ |
3901 | void ext4_discard_preallocations(struct inode *inode) | 3903 | void ext4_discard_preallocations(struct inode *inode) |
3902 | { | 3904 | { |
3903 | struct ext4_inode_info *ei = EXT4_I(inode); | 3905 | struct ext4_inode_info *ei = EXT4_I(inode); |
3904 | struct super_block *sb = inode->i_sb; | 3906 | struct super_block *sb = inode->i_sb; |
3905 | struct buffer_head *bitmap_bh = NULL; | 3907 | struct buffer_head *bitmap_bh = NULL; |
3906 | struct ext4_prealloc_space *pa, *tmp; | 3908 | struct ext4_prealloc_space *pa, *tmp; |
3907 | ext4_group_t group = 0; | 3909 | ext4_group_t group = 0; |
3908 | struct list_head list; | 3910 | struct list_head list; |
3909 | struct ext4_buddy e4b; | 3911 | struct ext4_buddy e4b; |
3910 | int err; | 3912 | int err; |
3911 | 3913 | ||
3912 | if (!S_ISREG(inode->i_mode)) { | 3914 | if (!S_ISREG(inode->i_mode)) { |
3913 | /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/ | 3915 | /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/ |
3914 | return; | 3916 | return; |
3915 | } | 3917 | } |
3916 | 3918 | ||
3917 | mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino); | 3919 | mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino); |
3918 | trace_ext4_discard_preallocations(inode); | 3920 | trace_ext4_discard_preallocations(inode); |
3919 | 3921 | ||
3920 | INIT_LIST_HEAD(&list); | 3922 | INIT_LIST_HEAD(&list); |
3921 | 3923 | ||
3922 | repeat: | 3924 | repeat: |
3923 | /* first, collect all pa's in the inode */ | 3925 | /* first, collect all pa's in the inode */ |
3924 | spin_lock(&ei->i_prealloc_lock); | 3926 | spin_lock(&ei->i_prealloc_lock); |
3925 | while (!list_empty(&ei->i_prealloc_list)) { | 3927 | while (!list_empty(&ei->i_prealloc_list)) { |
3926 | pa = list_entry(ei->i_prealloc_list.next, | 3928 | pa = list_entry(ei->i_prealloc_list.next, |
3927 | struct ext4_prealloc_space, pa_inode_list); | 3929 | struct ext4_prealloc_space, pa_inode_list); |
3928 | BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock); | 3930 | BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock); |
3929 | spin_lock(&pa->pa_lock); | 3931 | spin_lock(&pa->pa_lock); |
3930 | if (atomic_read(&pa->pa_count)) { | 3932 | if (atomic_read(&pa->pa_count)) { |
3931 | /* this shouldn't happen often - nobody should | 3933 | /* this shouldn't happen often - nobody should |
3932 | * use preallocation while we're discarding it */ | 3934 | * use preallocation while we're discarding it */ |
3933 | spin_unlock(&pa->pa_lock); | 3935 | spin_unlock(&pa->pa_lock); |
3934 | spin_unlock(&ei->i_prealloc_lock); | 3936 | spin_unlock(&ei->i_prealloc_lock); |
3935 | ext4_msg(sb, KERN_ERR, | 3937 | ext4_msg(sb, KERN_ERR, |
3936 | "uh-oh! used pa while discarding"); | 3938 | "uh-oh! used pa while discarding"); |
3937 | WARN_ON(1); | 3939 | WARN_ON(1); |
3938 | schedule_timeout_uninterruptible(HZ); | 3940 | schedule_timeout_uninterruptible(HZ); |
3939 | goto repeat; | 3941 | goto repeat; |
3940 | 3942 | ||
3941 | } | 3943 | } |
3942 | if (pa->pa_deleted == 0) { | 3944 | if (pa->pa_deleted == 0) { |
3943 | pa->pa_deleted = 1; | 3945 | pa->pa_deleted = 1; |
3944 | spin_unlock(&pa->pa_lock); | 3946 | spin_unlock(&pa->pa_lock); |
3945 | list_del_rcu(&pa->pa_inode_list); | 3947 | list_del_rcu(&pa->pa_inode_list); |
3946 | list_add(&pa->u.pa_tmp_list, &list); | 3948 | list_add(&pa->u.pa_tmp_list, &list); |
3947 | continue; | 3949 | continue; |
3948 | } | 3950 | } |
3949 | 3951 | ||
3950 | /* someone is deleting pa right now */ | 3952 | /* someone is deleting pa right now */ |
3951 | spin_unlock(&pa->pa_lock); | 3953 | spin_unlock(&pa->pa_lock); |
3952 | spin_unlock(&ei->i_prealloc_lock); | 3954 | spin_unlock(&ei->i_prealloc_lock); |
3953 | 3955 | ||
3954 | /* we have to wait here because pa_deleted | 3956 | /* we have to wait here because pa_deleted |
3955 | * doesn't mean pa is already unlinked from | 3957 | * doesn't mean pa is already unlinked from |
3956 | * the list. as we might be called from | 3958 | * the list. as we might be called from |
3957 | * ->clear_inode() the inode will get freed | 3959 | * ->clear_inode() the inode will get freed |
3958 | * and concurrent thread which is unlinking | 3960 | * and concurrent thread which is unlinking |
3959 | * pa from inode's list may access already | 3961 | * pa from inode's list may access already |
3960 | * freed memory, bad-bad-bad */ | 3962 | * freed memory, bad-bad-bad */ |
3961 | 3963 | ||
3962 | /* XXX: if this happens too often, we can | 3964 | /* XXX: if this happens too often, we can |
3963 | * add a flag to force wait only in case | 3965 | * add a flag to force wait only in case |
3964 | * of ->clear_inode(), but not in case of | 3966 | * of ->clear_inode(), but not in case of |
3965 | * regular truncate */ | 3967 | * regular truncate */ |
3966 | schedule_timeout_uninterruptible(HZ); | 3968 | schedule_timeout_uninterruptible(HZ); |
3967 | goto repeat; | 3969 | goto repeat; |
3968 | } | 3970 | } |
3969 | spin_unlock(&ei->i_prealloc_lock); | 3971 | spin_unlock(&ei->i_prealloc_lock); |
3970 | 3972 | ||
3971 | list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { | 3973 | list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { |
3972 | BUG_ON(pa->pa_type != MB_INODE_PA); | 3974 | BUG_ON(pa->pa_type != MB_INODE_PA); |
3973 | group = ext4_get_group_number(sb, pa->pa_pstart); | 3975 | group = ext4_get_group_number(sb, pa->pa_pstart); |
3974 | 3976 | ||
3975 | err = ext4_mb_load_buddy(sb, group, &e4b); | 3977 | err = ext4_mb_load_buddy(sb, group, &e4b); |
3976 | if (err) { | 3978 | if (err) { |
3977 | ext4_error(sb, "Error loading buddy information for %u", | 3979 | ext4_error(sb, "Error loading buddy information for %u", |
3978 | group); | 3980 | group); |
3979 | continue; | 3981 | continue; |
3980 | } | 3982 | } |
3981 | 3983 | ||
3982 | bitmap_bh = ext4_read_block_bitmap(sb, group); | 3984 | bitmap_bh = ext4_read_block_bitmap(sb, group); |
3983 | if (bitmap_bh == NULL) { | 3985 | if (bitmap_bh == NULL) { |
3984 | ext4_error(sb, "Error reading block bitmap for %u", | 3986 | ext4_error(sb, "Error reading block bitmap for %u", |
3985 | group); | 3987 | group); |
3986 | ext4_mb_unload_buddy(&e4b); | 3988 | ext4_mb_unload_buddy(&e4b); |
3987 | continue; | 3989 | continue; |
3988 | } | 3990 | } |
3989 | 3991 | ||
3990 | ext4_lock_group(sb, group); | 3992 | ext4_lock_group(sb, group); |
3991 | list_del(&pa->pa_group_list); | 3993 | list_del(&pa->pa_group_list); |
3992 | ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); | 3994 | ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); |
3993 | ext4_unlock_group(sb, group); | 3995 | ext4_unlock_group(sb, group); |
3994 | 3996 | ||
3995 | ext4_mb_unload_buddy(&e4b); | 3997 | ext4_mb_unload_buddy(&e4b); |
3996 | put_bh(bitmap_bh); | 3998 | put_bh(bitmap_bh); |
3997 | 3999 | ||
3998 | list_del(&pa->u.pa_tmp_list); | 4000 | list_del(&pa->u.pa_tmp_list); |
3999 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); | 4001 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); |
4000 | } | 4002 | } |
4001 | } | 4003 | } |
4002 | 4004 | ||
4003 | #ifdef CONFIG_EXT4_DEBUG | 4005 | #ifdef CONFIG_EXT4_DEBUG |
4004 | static void ext4_mb_show_ac(struct ext4_allocation_context *ac) | 4006 | static void ext4_mb_show_ac(struct ext4_allocation_context *ac) |
4005 | { | 4007 | { |
4006 | struct super_block *sb = ac->ac_sb; | 4008 | struct super_block *sb = ac->ac_sb; |
4007 | ext4_group_t ngroups, i; | 4009 | ext4_group_t ngroups, i; |
4008 | 4010 | ||
4009 | if (!ext4_mballoc_debug || | 4011 | if (!ext4_mballoc_debug || |
4010 | (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) | 4012 | (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)) |
4011 | return; | 4013 | return; |
4012 | 4014 | ||
4013 | ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:" | 4015 | ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:" |
4014 | " Allocation context details:"); | 4016 | " Allocation context details:"); |
4015 | ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d", | 4017 | ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d", |
4016 | ac->ac_status, ac->ac_flags); | 4018 | ac->ac_status, ac->ac_flags); |
4017 | ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, " | 4019 | ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, " |
4018 | "goal %lu/%lu/%lu@%lu, " | 4020 | "goal %lu/%lu/%lu@%lu, " |
4019 | "best %lu/%lu/%lu@%lu cr %d", | 4021 | "best %lu/%lu/%lu@%lu cr %d", |
4020 | (unsigned long)ac->ac_o_ex.fe_group, | 4022 | (unsigned long)ac->ac_o_ex.fe_group, |
4021 | (unsigned long)ac->ac_o_ex.fe_start, | 4023 | (unsigned long)ac->ac_o_ex.fe_start, |
4022 | (unsigned long)ac->ac_o_ex.fe_len, | 4024 | (unsigned long)ac->ac_o_ex.fe_len, |
4023 | (unsigned long)ac->ac_o_ex.fe_logical, | 4025 | (unsigned long)ac->ac_o_ex.fe_logical, |
4024 | (unsigned long)ac->ac_g_ex.fe_group, | 4026 | (unsigned long)ac->ac_g_ex.fe_group, |
4025 | (unsigned long)ac->ac_g_ex.fe_start, | 4027 | (unsigned long)ac->ac_g_ex.fe_start, |
4026 | (unsigned long)ac->ac_g_ex.fe_len, | 4028 | (unsigned long)ac->ac_g_ex.fe_len, |
4027 | (unsigned long)ac->ac_g_ex.fe_logical, | 4029 | (unsigned long)ac->ac_g_ex.fe_logical, |
4028 | (unsigned long)ac->ac_b_ex.fe_group, | 4030 | (unsigned long)ac->ac_b_ex.fe_group, |
4029 | (unsigned long)ac->ac_b_ex.fe_start, | 4031 | (unsigned long)ac->ac_b_ex.fe_start, |
4030 | (unsigned long)ac->ac_b_ex.fe_len, | 4032 | (unsigned long)ac->ac_b_ex.fe_len, |
4031 | (unsigned long)ac->ac_b_ex.fe_logical, | 4033 | (unsigned long)ac->ac_b_ex.fe_logical, |
4032 | (int)ac->ac_criteria); | 4034 | (int)ac->ac_criteria); |
4033 | ext4_msg(ac->ac_sb, KERN_ERR, "%lu scanned, %d found", | 4035 | ext4_msg(ac->ac_sb, KERN_ERR, "%lu scanned, %d found", |
4034 | ac->ac_ex_scanned, ac->ac_found); | 4036 | ac->ac_ex_scanned, ac->ac_found); |
4035 | ext4_msg(ac->ac_sb, KERN_ERR, "groups: "); | 4037 | ext4_msg(ac->ac_sb, KERN_ERR, "groups: "); |
4036 | ngroups = ext4_get_groups_count(sb); | 4038 | ngroups = ext4_get_groups_count(sb); |
4037 | for (i = 0; i < ngroups; i++) { | 4039 | for (i = 0; i < ngroups; i++) { |
4038 | struct ext4_group_info *grp = ext4_get_group_info(sb, i); | 4040 | struct ext4_group_info *grp = ext4_get_group_info(sb, i); |
4039 | struct ext4_prealloc_space *pa; | 4041 | struct ext4_prealloc_space *pa; |
4040 | ext4_grpblk_t start; | 4042 | ext4_grpblk_t start; |
4041 | struct list_head *cur; | 4043 | struct list_head *cur; |
4042 | ext4_lock_group(sb, i); | 4044 | ext4_lock_group(sb, i); |
4043 | list_for_each(cur, &grp->bb_prealloc_list) { | 4045 | list_for_each(cur, &grp->bb_prealloc_list) { |
4044 | pa = list_entry(cur, struct ext4_prealloc_space, | 4046 | pa = list_entry(cur, struct ext4_prealloc_space, |
4045 | pa_group_list); | 4047 | pa_group_list); |
4046 | spin_lock(&pa->pa_lock); | 4048 | spin_lock(&pa->pa_lock); |
4047 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, | 4049 | ext4_get_group_no_and_offset(sb, pa->pa_pstart, |
4048 | NULL, &start); | 4050 | NULL, &start); |
4049 | spin_unlock(&pa->pa_lock); | 4051 | spin_unlock(&pa->pa_lock); |
4050 | printk(KERN_ERR "PA:%u:%d:%u \n", i, | 4052 | printk(KERN_ERR "PA:%u:%d:%u \n", i, |
4051 | start, pa->pa_len); | 4053 | start, pa->pa_len); |
4052 | } | 4054 | } |
4053 | ext4_unlock_group(sb, i); | 4055 | ext4_unlock_group(sb, i); |
4054 | 4056 | ||
4055 | if (grp->bb_free == 0) | 4057 | if (grp->bb_free == 0) |
4056 | continue; | 4058 | continue; |
4057 | printk(KERN_ERR "%u: %d/%d \n", | 4059 | printk(KERN_ERR "%u: %d/%d \n", |
4058 | i, grp->bb_free, grp->bb_fragments); | 4060 | i, grp->bb_free, grp->bb_fragments); |
4059 | } | 4061 | } |
4060 | printk(KERN_ERR "\n"); | 4062 | printk(KERN_ERR "\n"); |
4061 | } | 4063 | } |
4062 | #else | 4064 | #else |
4063 | static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac) | 4065 | static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac) |
4064 | { | 4066 | { |
4065 | return; | 4067 | return; |
4066 | } | 4068 | } |
4067 | #endif | 4069 | #endif |
4068 | 4070 | ||
4069 | /* | 4071 | /* |
4070 | * We use locality group preallocation for small size file. The size of the | 4072 | * We use locality group preallocation for small size file. The size of the |
4071 | * file is determined by the current size or the resulting size after | 4073 | * file is determined by the current size or the resulting size after |
4072 | * allocation which ever is larger | 4074 | * allocation which ever is larger |
4073 | * | 4075 | * |
4074 | * One can tune this size via /sys/fs/ext4/<partition>/mb_stream_req | 4076 | * One can tune this size via /sys/fs/ext4/<partition>/mb_stream_req |
4075 | */ | 4077 | */ |
4076 | static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) | 4078 | static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) |
4077 | { | 4079 | { |
4078 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 4080 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
4079 | int bsbits = ac->ac_sb->s_blocksize_bits; | 4081 | int bsbits = ac->ac_sb->s_blocksize_bits; |
4080 | loff_t size, isize; | 4082 | loff_t size, isize; |
4081 | 4083 | ||
4082 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) | 4084 | if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) |
4083 | return; | 4085 | return; |
4084 | 4086 | ||
4085 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) | 4087 | if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) |
4086 | return; | 4088 | return; |
4087 | 4089 | ||
4088 | size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); | 4090 | size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len); |
4089 | isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) | 4091 | isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) |
4090 | >> bsbits; | 4092 | >> bsbits; |
4091 | 4093 | ||
4092 | if ((size == isize) && | 4094 | if ((size == isize) && |
4093 | !ext4_fs_is_busy(sbi) && | 4095 | !ext4_fs_is_busy(sbi) && |
4094 | (atomic_read(&ac->ac_inode->i_writecount) == 0)) { | 4096 | (atomic_read(&ac->ac_inode->i_writecount) == 0)) { |
4095 | ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; | 4097 | ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; |
4096 | return; | 4098 | return; |
4097 | } | 4099 | } |
4098 | 4100 | ||
4099 | if (sbi->s_mb_group_prealloc <= 0) { | 4101 | if (sbi->s_mb_group_prealloc <= 0) { |
4100 | ac->ac_flags |= EXT4_MB_STREAM_ALLOC; | 4102 | ac->ac_flags |= EXT4_MB_STREAM_ALLOC; |
4101 | return; | 4103 | return; |
4102 | } | 4104 | } |
4103 | 4105 | ||
4104 | /* don't use group allocation for large files */ | 4106 | /* don't use group allocation for large files */ |
4105 | size = max(size, isize); | 4107 | size = max(size, isize); |
4106 | if (size > sbi->s_mb_stream_request) { | 4108 | if (size > sbi->s_mb_stream_request) { |
4107 | ac->ac_flags |= EXT4_MB_STREAM_ALLOC; | 4109 | ac->ac_flags |= EXT4_MB_STREAM_ALLOC; |
4108 | return; | 4110 | return; |
4109 | } | 4111 | } |
4110 | 4112 | ||
4111 | BUG_ON(ac->ac_lg != NULL); | 4113 | BUG_ON(ac->ac_lg != NULL); |
4112 | /* | 4114 | /* |
4113 | * locality group prealloc space are per cpu. The reason for having | 4115 | * locality group prealloc space are per cpu. The reason for having |
4114 | * per cpu locality group is to reduce the contention between block | 4116 | * per cpu locality group is to reduce the contention between block |
4115 | * request from multiple CPUs. | 4117 | * request from multiple CPUs. |
4116 | */ | 4118 | */ |
4117 | ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups); | 4119 | ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups); |
4118 | 4120 | ||
4119 | /* we're going to use group allocation */ | 4121 | /* we're going to use group allocation */ |
4120 | ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC; | 4122 | ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC; |
4121 | 4123 | ||
4122 | /* serialize all allocations in the group */ | 4124 | /* serialize all allocations in the group */ |
4123 | mutex_lock(&ac->ac_lg->lg_mutex); | 4125 | mutex_lock(&ac->ac_lg->lg_mutex); |
4124 | } | 4126 | } |
4125 | 4127 | ||
4126 | static noinline_for_stack int | 4128 | static noinline_for_stack int |
4127 | ext4_mb_initialize_context(struct ext4_allocation_context *ac, | 4129 | ext4_mb_initialize_context(struct ext4_allocation_context *ac, |
4128 | struct ext4_allocation_request *ar) | 4130 | struct ext4_allocation_request *ar) |
4129 | { | 4131 | { |
4130 | struct super_block *sb = ar->inode->i_sb; | 4132 | struct super_block *sb = ar->inode->i_sb; |
4131 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 4133 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
4132 | struct ext4_super_block *es = sbi->s_es; | 4134 | struct ext4_super_block *es = sbi->s_es; |
4133 | ext4_group_t group; | 4135 | ext4_group_t group; |
4134 | unsigned int len; | 4136 | unsigned int len; |
4135 | ext4_fsblk_t goal; | 4137 | ext4_fsblk_t goal; |
4136 | ext4_grpblk_t block; | 4138 | ext4_grpblk_t block; |
4137 | 4139 | ||
4138 | /* we can't allocate > group size */ | 4140 | /* we can't allocate > group size */ |
4139 | len = ar->len; | 4141 | len = ar->len; |
4140 | 4142 | ||
4141 | /* just a dirty hack to filter too big requests */ | 4143 | /* just a dirty hack to filter too big requests */ |
4142 | if (len >= EXT4_CLUSTERS_PER_GROUP(sb)) | 4144 | if (len >= EXT4_CLUSTERS_PER_GROUP(sb)) |
4143 | len = EXT4_CLUSTERS_PER_GROUP(sb); | 4145 | len = EXT4_CLUSTERS_PER_GROUP(sb); |
4144 | 4146 | ||
4145 | /* start searching from the goal */ | 4147 | /* start searching from the goal */ |
4146 | goal = ar->goal; | 4148 | goal = ar->goal; |
4147 | if (goal < le32_to_cpu(es->s_first_data_block) || | 4149 | if (goal < le32_to_cpu(es->s_first_data_block) || |
4148 | goal >= ext4_blocks_count(es)) | 4150 | goal >= ext4_blocks_count(es)) |
4149 | goal = le32_to_cpu(es->s_first_data_block); | 4151 | goal = le32_to_cpu(es->s_first_data_block); |
4150 | ext4_get_group_no_and_offset(sb, goal, &group, &block); | 4152 | ext4_get_group_no_and_offset(sb, goal, &group, &block); |
4151 | 4153 | ||
4152 | /* set up allocation goals */ | 4154 | /* set up allocation goals */ |
4153 | ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical); | 4155 | ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical); |
4154 | ac->ac_status = AC_STATUS_CONTINUE; | 4156 | ac->ac_status = AC_STATUS_CONTINUE; |
4155 | ac->ac_sb = sb; | 4157 | ac->ac_sb = sb; |
4156 | ac->ac_inode = ar->inode; | 4158 | ac->ac_inode = ar->inode; |
4157 | ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical; | 4159 | ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical; |
4158 | ac->ac_o_ex.fe_group = group; | 4160 | ac->ac_o_ex.fe_group = group; |
4159 | ac->ac_o_ex.fe_start = block; | 4161 | ac->ac_o_ex.fe_start = block; |
4160 | ac->ac_o_ex.fe_len = len; | 4162 | ac->ac_o_ex.fe_len = len; |
4161 | ac->ac_g_ex = ac->ac_o_ex; | 4163 | ac->ac_g_ex = ac->ac_o_ex; |
4162 | ac->ac_flags = ar->flags; | 4164 | ac->ac_flags = ar->flags; |
4163 | 4165 | ||
4164 | /* we have to define context: we'll we work with a file or | 4166 | /* we have to define context: we'll we work with a file or |
4165 | * locality group. this is a policy, actually */ | 4167 | * locality group. this is a policy, actually */ |
4166 | ext4_mb_group_or_file(ac); | 4168 | ext4_mb_group_or_file(ac); |
4167 | 4169 | ||
4168 | mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, " | 4170 | mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, " |
4169 | "left: %u/%u, right %u/%u to %swritable\n", | 4171 | "left: %u/%u, right %u/%u to %swritable\n", |
4170 | (unsigned) ar->len, (unsigned) ar->logical, | 4172 | (unsigned) ar->len, (unsigned) ar->logical, |
4171 | (unsigned) ar->goal, ac->ac_flags, ac->ac_2order, | 4173 | (unsigned) ar->goal, ac->ac_flags, ac->ac_2order, |
4172 | (unsigned) ar->lleft, (unsigned) ar->pleft, | 4174 | (unsigned) ar->lleft, (unsigned) ar->pleft, |
4173 | (unsigned) ar->lright, (unsigned) ar->pright, | 4175 | (unsigned) ar->lright, (unsigned) ar->pright, |
4174 | atomic_read(&ar->inode->i_writecount) ? "" : "non-"); | 4176 | atomic_read(&ar->inode->i_writecount) ? "" : "non-"); |
4175 | return 0; | 4177 | return 0; |
4176 | 4178 | ||
4177 | } | 4179 | } |
4178 | 4180 | ||
4179 | static noinline_for_stack void | 4181 | static noinline_for_stack void |
4180 | ext4_mb_discard_lg_preallocations(struct super_block *sb, | 4182 | ext4_mb_discard_lg_preallocations(struct super_block *sb, |
4181 | struct ext4_locality_group *lg, | 4183 | struct ext4_locality_group *lg, |
4182 | int order, int total_entries) | 4184 | int order, int total_entries) |
4183 | { | 4185 | { |
4184 | ext4_group_t group = 0; | 4186 | ext4_group_t group = 0; |
4185 | struct ext4_buddy e4b; | 4187 | struct ext4_buddy e4b; |
4186 | struct list_head discard_list; | 4188 | struct list_head discard_list; |
4187 | struct ext4_prealloc_space *pa, *tmp; | 4189 | struct ext4_prealloc_space *pa, *tmp; |
4188 | 4190 | ||
4189 | mb_debug(1, "discard locality group preallocation\n"); | 4191 | mb_debug(1, "discard locality group preallocation\n"); |
4190 | 4192 | ||
4191 | INIT_LIST_HEAD(&discard_list); | 4193 | INIT_LIST_HEAD(&discard_list); |
4192 | 4194 | ||
4193 | spin_lock(&lg->lg_prealloc_lock); | 4195 | spin_lock(&lg->lg_prealloc_lock); |
4194 | list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order], | 4196 | list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order], |
4195 | pa_inode_list) { | 4197 | pa_inode_list) { |
4196 | spin_lock(&pa->pa_lock); | 4198 | spin_lock(&pa->pa_lock); |
4197 | if (atomic_read(&pa->pa_count)) { | 4199 | if (atomic_read(&pa->pa_count)) { |
4198 | /* | 4200 | /* |
4199 | * This is the pa that we just used | 4201 | * This is the pa that we just used |
4200 | * for block allocation. So don't | 4202 | * for block allocation. So don't |
4201 | * free that | 4203 | * free that |
4202 | */ | 4204 | */ |
4203 | spin_unlock(&pa->pa_lock); | 4205 | spin_unlock(&pa->pa_lock); |
4204 | continue; | 4206 | continue; |
4205 | } | 4207 | } |
4206 | if (pa->pa_deleted) { | 4208 | if (pa->pa_deleted) { |
4207 | spin_unlock(&pa->pa_lock); | 4209 | spin_unlock(&pa->pa_lock); |
4208 | continue; | 4210 | continue; |
4209 | } | 4211 | } |
4210 | /* only lg prealloc space */ | 4212 | /* only lg prealloc space */ |
4211 | BUG_ON(pa->pa_type != MB_GROUP_PA); | 4213 | BUG_ON(pa->pa_type != MB_GROUP_PA); |
4212 | 4214 | ||
4213 | /* seems this one can be freed ... */ | 4215 | /* seems this one can be freed ... */ |
4214 | pa->pa_deleted = 1; | 4216 | pa->pa_deleted = 1; |
4215 | spin_unlock(&pa->pa_lock); | 4217 | spin_unlock(&pa->pa_lock); |
4216 | 4218 | ||
4217 | list_del_rcu(&pa->pa_inode_list); | 4219 | list_del_rcu(&pa->pa_inode_list); |
4218 | list_add(&pa->u.pa_tmp_list, &discard_list); | 4220 | list_add(&pa->u.pa_tmp_list, &discard_list); |
4219 | 4221 | ||
4220 | total_entries--; | 4222 | total_entries--; |
4221 | if (total_entries <= 5) { | 4223 | if (total_entries <= 5) { |
4222 | /* | 4224 | /* |
4223 | * we want to keep only 5 entries | 4225 | * we want to keep only 5 entries |
4224 | * allowing it to grow to 8. This | 4226 | * allowing it to grow to 8. This |
4225 | * mak sure we don't call discard | 4227 | * mak sure we don't call discard |
4226 | * soon for this list. | 4228 | * soon for this list. |
4227 | */ | 4229 | */ |
4228 | break; | 4230 | break; |
4229 | } | 4231 | } |
4230 | } | 4232 | } |
4231 | spin_unlock(&lg->lg_prealloc_lock); | 4233 | spin_unlock(&lg->lg_prealloc_lock); |
4232 | 4234 | ||
4233 | list_for_each_entry_safe(pa, tmp, &discard_list, u.pa_tmp_list) { | 4235 | list_for_each_entry_safe(pa, tmp, &discard_list, u.pa_tmp_list) { |
4234 | 4236 | ||
4235 | group = ext4_get_group_number(sb, pa->pa_pstart); | 4237 | group = ext4_get_group_number(sb, pa->pa_pstart); |
4236 | if (ext4_mb_load_buddy(sb, group, &e4b)) { | 4238 | if (ext4_mb_load_buddy(sb, group, &e4b)) { |
4237 | ext4_error(sb, "Error loading buddy information for %u", | 4239 | ext4_error(sb, "Error loading buddy information for %u", |
4238 | group); | 4240 | group); |
4239 | continue; | 4241 | continue; |
4240 | } | 4242 | } |
4241 | ext4_lock_group(sb, group); | 4243 | ext4_lock_group(sb, group); |
4242 | list_del(&pa->pa_group_list); | 4244 | list_del(&pa->pa_group_list); |
4243 | ext4_mb_release_group_pa(&e4b, pa); | 4245 | ext4_mb_release_group_pa(&e4b, pa); |
4244 | ext4_unlock_group(sb, group); | 4246 | ext4_unlock_group(sb, group); |
4245 | 4247 | ||
4246 | ext4_mb_unload_buddy(&e4b); | 4248 | ext4_mb_unload_buddy(&e4b); |
4247 | list_del(&pa->u.pa_tmp_list); | 4249 | list_del(&pa->u.pa_tmp_list); |
4248 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); | 4250 | call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); |
4249 | } | 4251 | } |
4250 | } | 4252 | } |
4251 | 4253 | ||
4252 | /* | 4254 | /* |
4253 | * We have incremented pa_count. So it cannot be freed at this | 4255 | * We have incremented pa_count. So it cannot be freed at this |
4254 | * point. Also we hold lg_mutex. So no parallel allocation is | 4256 | * point. Also we hold lg_mutex. So no parallel allocation is |
4255 | * possible from this lg. That means pa_free cannot be updated. | 4257 | * possible from this lg. That means pa_free cannot be updated. |
4256 | * | 4258 | * |
4257 | * A parallel ext4_mb_discard_group_preallocations is possible. | 4259 | * A parallel ext4_mb_discard_group_preallocations is possible. |
4258 | * which can cause the lg_prealloc_list to be updated. | 4260 | * which can cause the lg_prealloc_list to be updated. |
4259 | */ | 4261 | */ |
4260 | 4262 | ||
4261 | static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac) | 4263 | static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac) |
4262 | { | 4264 | { |
4263 | int order, added = 0, lg_prealloc_count = 1; | 4265 | int order, added = 0, lg_prealloc_count = 1; |
4264 | struct super_block *sb = ac->ac_sb; | 4266 | struct super_block *sb = ac->ac_sb; |
4265 | struct ext4_locality_group *lg = ac->ac_lg; | 4267 | struct ext4_locality_group *lg = ac->ac_lg; |
4266 | struct ext4_prealloc_space *tmp_pa, *pa = ac->ac_pa; | 4268 | struct ext4_prealloc_space *tmp_pa, *pa = ac->ac_pa; |
4267 | 4269 | ||
4268 | order = fls(pa->pa_free) - 1; | 4270 | order = fls(pa->pa_free) - 1; |
4269 | if (order > PREALLOC_TB_SIZE - 1) | 4271 | if (order > PREALLOC_TB_SIZE - 1) |
4270 | /* The max size of hash table is PREALLOC_TB_SIZE */ | 4272 | /* The max size of hash table is PREALLOC_TB_SIZE */ |
4271 | order = PREALLOC_TB_SIZE - 1; | 4273 | order = PREALLOC_TB_SIZE - 1; |
4272 | /* Add the prealloc space to lg */ | 4274 | /* Add the prealloc space to lg */ |
4273 | spin_lock(&lg->lg_prealloc_lock); | 4275 | spin_lock(&lg->lg_prealloc_lock); |
4274 | list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], | 4276 | list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order], |
4275 | pa_inode_list) { | 4277 | pa_inode_list) { |
4276 | spin_lock(&tmp_pa->pa_lock); | 4278 | spin_lock(&tmp_pa->pa_lock); |
4277 | if (tmp_pa->pa_deleted) { | 4279 | if (tmp_pa->pa_deleted) { |
4278 | spin_unlock(&tmp_pa->pa_lock); | 4280 | spin_unlock(&tmp_pa->pa_lock); |
4279 | continue; | 4281 | continue; |
4280 | } | 4282 | } |
4281 | if (!added && pa->pa_free < tmp_pa->pa_free) { | 4283 | if (!added && pa->pa_free < tmp_pa->pa_free) { |
4282 | /* Add to the tail of the previous entry */ | 4284 | /* Add to the tail of the previous entry */ |
4283 | list_add_tail_rcu(&pa->pa_inode_list, | 4285 | list_add_tail_rcu(&pa->pa_inode_list, |
4284 | &tmp_pa->pa_inode_list); | 4286 | &tmp_pa->pa_inode_list); |
4285 | added = 1; | 4287 | added = 1; |
4286 | /* | 4288 | /* |
4287 | * we want to count the total | 4289 | * we want to count the total |
4288 | * number of entries in the list | 4290 | * number of entries in the list |
4289 | */ | 4291 | */ |
4290 | } | 4292 | } |
4291 | spin_unlock(&tmp_pa->pa_lock); | 4293 | spin_unlock(&tmp_pa->pa_lock); |
4292 | lg_prealloc_count++; | 4294 | lg_prealloc_count++; |
4293 | } | 4295 | } |
4294 | if (!added) | 4296 | if (!added) |
4295 | list_add_tail_rcu(&pa->pa_inode_list, | 4297 | list_add_tail_rcu(&pa->pa_inode_list, |
4296 | &lg->lg_prealloc_list[order]); | 4298 | &lg->lg_prealloc_list[order]); |
4297 | spin_unlock(&lg->lg_prealloc_lock); | 4299 | spin_unlock(&lg->lg_prealloc_lock); |
4298 | 4300 | ||
4299 | /* Now trim the list to be not more than 8 elements */ | 4301 | /* Now trim the list to be not more than 8 elements */ |
4300 | if (lg_prealloc_count > 8) { | 4302 | if (lg_prealloc_count > 8) { |
4301 | ext4_mb_discard_lg_preallocations(sb, lg, | 4303 | ext4_mb_discard_lg_preallocations(sb, lg, |
4302 | order, lg_prealloc_count); | 4304 | order, lg_prealloc_count); |
4303 | return; | 4305 | return; |
4304 | } | 4306 | } |
4305 | return ; | 4307 | return ; |
4306 | } | 4308 | } |
4307 | 4309 | ||
4308 | /* | 4310 | /* |
4309 | * release all resource we used in allocation | 4311 | * release all resource we used in allocation |
4310 | */ | 4312 | */ |
4311 | static int ext4_mb_release_context(struct ext4_allocation_context *ac) | 4313 | static int ext4_mb_release_context(struct ext4_allocation_context *ac) |
4312 | { | 4314 | { |
4313 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); | 4315 | struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); |
4314 | struct ext4_prealloc_space *pa = ac->ac_pa; | 4316 | struct ext4_prealloc_space *pa = ac->ac_pa; |
4315 | if (pa) { | 4317 | if (pa) { |
4316 | if (pa->pa_type == MB_GROUP_PA) { | 4318 | if (pa->pa_type == MB_GROUP_PA) { |
4317 | /* see comment in ext4_mb_use_group_pa() */ | 4319 | /* see comment in ext4_mb_use_group_pa() */ |
4318 | spin_lock(&pa->pa_lock); | 4320 | spin_lock(&pa->pa_lock); |
4319 | pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); | 4321 | pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); |
4320 | pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); | 4322 | pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len); |
4321 | pa->pa_free -= ac->ac_b_ex.fe_len; | 4323 | pa->pa_free -= ac->ac_b_ex.fe_len; |
4322 | pa->pa_len -= ac->ac_b_ex.fe_len; | 4324 | pa->pa_len -= ac->ac_b_ex.fe_len; |
4323 | spin_unlock(&pa->pa_lock); | 4325 | spin_unlock(&pa->pa_lock); |
4324 | } | 4326 | } |
4325 | } | 4327 | } |
4326 | if (pa) { | 4328 | if (pa) { |
4327 | /* | 4329 | /* |
4328 | * We want to add the pa to the right bucket. | 4330 | * We want to add the pa to the right bucket. |
4329 | * Remove it from the list and while adding | 4331 | * Remove it from the list and while adding |
4330 | * make sure the list to which we are adding | 4332 | * make sure the list to which we are adding |
4331 | * doesn't grow big. | 4333 | * doesn't grow big. |
4332 | */ | 4334 | */ |
4333 | if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) { | 4335 | if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) { |
4334 | spin_lock(pa->pa_obj_lock); | 4336 | spin_lock(pa->pa_obj_lock); |
4335 | list_del_rcu(&pa->pa_inode_list); | 4337 | list_del_rcu(&pa->pa_inode_list); |
4336 | spin_unlock(pa->pa_obj_lock); | 4338 | spin_unlock(pa->pa_obj_lock); |
4337 | ext4_mb_add_n_trim(ac); | 4339 | ext4_mb_add_n_trim(ac); |
4338 | } | 4340 | } |
4339 | ext4_mb_put_pa(ac, ac->ac_sb, pa); | 4341 | ext4_mb_put_pa(ac, ac->ac_sb, pa); |
4340 | } | 4342 | } |
4341 | if (ac->ac_bitmap_page) | 4343 | if (ac->ac_bitmap_page) |
4342 | page_cache_release(ac->ac_bitmap_page); | 4344 | page_cache_release(ac->ac_bitmap_page); |
4343 | if (ac->ac_buddy_page) | 4345 | if (ac->ac_buddy_page) |
4344 | page_cache_release(ac->ac_buddy_page); | 4346 | page_cache_release(ac->ac_buddy_page); |
4345 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) | 4347 | if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) |
4346 | mutex_unlock(&ac->ac_lg->lg_mutex); | 4348 | mutex_unlock(&ac->ac_lg->lg_mutex); |
4347 | ext4_mb_collect_stats(ac); | 4349 | ext4_mb_collect_stats(ac); |
4348 | return 0; | 4350 | return 0; |
4349 | } | 4351 | } |
4350 | 4352 | ||
4351 | static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) | 4353 | static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) |
4352 | { | 4354 | { |
4353 | ext4_group_t i, ngroups = ext4_get_groups_count(sb); | 4355 | ext4_group_t i, ngroups = ext4_get_groups_count(sb); |
4354 | int ret; | 4356 | int ret; |
4355 | int freed = 0; | 4357 | int freed = 0; |
4356 | 4358 | ||
4357 | trace_ext4_mb_discard_preallocations(sb, needed); | 4359 | trace_ext4_mb_discard_preallocations(sb, needed); |
4358 | for (i = 0; i < ngroups && needed > 0; i++) { | 4360 | for (i = 0; i < ngroups && needed > 0; i++) { |
4359 | ret = ext4_mb_discard_group_preallocations(sb, i, needed); | 4361 | ret = ext4_mb_discard_group_preallocations(sb, i, needed); |
4360 | freed += ret; | 4362 | freed += ret; |
4361 | needed -= ret; | 4363 | needed -= ret; |
4362 | } | 4364 | } |
4363 | 4365 | ||
4364 | return freed; | 4366 | return freed; |
4365 | } | 4367 | } |
4366 | 4368 | ||
4367 | /* | 4369 | /* |
4368 | * Main entry point into mballoc to allocate blocks | 4370 | * Main entry point into mballoc to allocate blocks |
4369 | * it tries to use preallocation first, then falls back | 4371 | * it tries to use preallocation first, then falls back |
4370 | * to usual allocation | 4372 | * to usual allocation |
4371 | */ | 4373 | */ |
4372 | ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, | 4374 | ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, |
4373 | struct ext4_allocation_request *ar, int *errp) | 4375 | struct ext4_allocation_request *ar, int *errp) |
4374 | { | 4376 | { |
4375 | int freed; | 4377 | int freed; |
4376 | struct ext4_allocation_context *ac = NULL; | 4378 | struct ext4_allocation_context *ac = NULL; |
4377 | struct ext4_sb_info *sbi; | 4379 | struct ext4_sb_info *sbi; |
4378 | struct super_block *sb; | 4380 | struct super_block *sb; |
4379 | ext4_fsblk_t block = 0; | 4381 | ext4_fsblk_t block = 0; |
4380 | unsigned int inquota = 0; | 4382 | unsigned int inquota = 0; |
4381 | unsigned int reserv_clstrs = 0; | 4383 | unsigned int reserv_clstrs = 0; |
4382 | 4384 | ||
4383 | might_sleep(); | 4385 | might_sleep(); |
4384 | sb = ar->inode->i_sb; | 4386 | sb = ar->inode->i_sb; |
4385 | sbi = EXT4_SB(sb); | 4387 | sbi = EXT4_SB(sb); |
4386 | 4388 | ||
4387 | trace_ext4_request_blocks(ar); | 4389 | trace_ext4_request_blocks(ar); |
4388 | 4390 | ||
4389 | /* Allow to use superuser reservation for quota file */ | 4391 | /* Allow to use superuser reservation for quota file */ |
4390 | if (IS_NOQUOTA(ar->inode)) | 4392 | if (IS_NOQUOTA(ar->inode)) |
4391 | ar->flags |= EXT4_MB_USE_ROOT_BLOCKS; | 4393 | ar->flags |= EXT4_MB_USE_ROOT_BLOCKS; |
4392 | 4394 | ||
4393 | /* | 4395 | /* |
4394 | * For delayed allocation, we could skip the ENOSPC and | 4396 | * For delayed allocation, we could skip the ENOSPC and |
4395 | * EDQUOT check, as blocks and quotas have been already | 4397 | * EDQUOT check, as blocks and quotas have been already |
4396 | * reserved when data being copied into pagecache. | 4398 | * reserved when data being copied into pagecache. |
4397 | */ | 4399 | */ |
4398 | if (ext4_test_inode_state(ar->inode, EXT4_STATE_DELALLOC_RESERVED)) | 4400 | if (ext4_test_inode_state(ar->inode, EXT4_STATE_DELALLOC_RESERVED)) |
4399 | ar->flags |= EXT4_MB_DELALLOC_RESERVED; | 4401 | ar->flags |= EXT4_MB_DELALLOC_RESERVED; |
4400 | else { | 4402 | else { |
4401 | /* Without delayed allocation we need to verify | 4403 | /* Without delayed allocation we need to verify |
4402 | * there is enough free blocks to do block allocation | 4404 | * there is enough free blocks to do block allocation |
4403 | * and verify allocation doesn't exceed the quota limits. | 4405 | * and verify allocation doesn't exceed the quota limits. |
4404 | */ | 4406 | */ |
4405 | while (ar->len && | 4407 | while (ar->len && |
4406 | ext4_claim_free_clusters(sbi, ar->len, ar->flags)) { | 4408 | ext4_claim_free_clusters(sbi, ar->len, ar->flags)) { |
4407 | 4409 | ||
4408 | /* let others to free the space */ | 4410 | /* let others to free the space */ |
4409 | cond_resched(); | 4411 | cond_resched(); |
4410 | ar->len = ar->len >> 1; | 4412 | ar->len = ar->len >> 1; |
4411 | } | 4413 | } |
4412 | if (!ar->len) { | 4414 | if (!ar->len) { |
4413 | *errp = -ENOSPC; | 4415 | *errp = -ENOSPC; |
4414 | return 0; | 4416 | return 0; |
4415 | } | 4417 | } |
4416 | reserv_clstrs = ar->len; | 4418 | reserv_clstrs = ar->len; |
4417 | if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) { | 4419 | if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) { |
4418 | dquot_alloc_block_nofail(ar->inode, | 4420 | dquot_alloc_block_nofail(ar->inode, |
4419 | EXT4_C2B(sbi, ar->len)); | 4421 | EXT4_C2B(sbi, ar->len)); |
4420 | } else { | 4422 | } else { |
4421 | while (ar->len && | 4423 | while (ar->len && |
4422 | dquot_alloc_block(ar->inode, | 4424 | dquot_alloc_block(ar->inode, |
4423 | EXT4_C2B(sbi, ar->len))) { | 4425 | EXT4_C2B(sbi, ar->len))) { |
4424 | 4426 | ||
4425 | ar->flags |= EXT4_MB_HINT_NOPREALLOC; | 4427 | ar->flags |= EXT4_MB_HINT_NOPREALLOC; |
4426 | ar->len--; | 4428 | ar->len--; |
4427 | } | 4429 | } |
4428 | } | 4430 | } |
4429 | inquota = ar->len; | 4431 | inquota = ar->len; |
4430 | if (ar->len == 0) { | 4432 | if (ar->len == 0) { |
4431 | *errp = -EDQUOT; | 4433 | *errp = -EDQUOT; |
4432 | goto out; | 4434 | goto out; |
4433 | } | 4435 | } |
4434 | } | 4436 | } |
4435 | 4437 | ||
4436 | ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS); | 4438 | ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS); |
4437 | if (!ac) { | 4439 | if (!ac) { |
4438 | ar->len = 0; | 4440 | ar->len = 0; |
4439 | *errp = -ENOMEM; | 4441 | *errp = -ENOMEM; |
4440 | goto out; | 4442 | goto out; |
4441 | } | 4443 | } |
4442 | 4444 | ||
4443 | *errp = ext4_mb_initialize_context(ac, ar); | 4445 | *errp = ext4_mb_initialize_context(ac, ar); |
4444 | if (*errp) { | 4446 | if (*errp) { |
4445 | ar->len = 0; | 4447 | ar->len = 0; |
4446 | goto out; | 4448 | goto out; |
4447 | } | 4449 | } |
4448 | 4450 | ||
4449 | ac->ac_op = EXT4_MB_HISTORY_PREALLOC; | 4451 | ac->ac_op = EXT4_MB_HISTORY_PREALLOC; |
4450 | if (!ext4_mb_use_preallocated(ac)) { | 4452 | if (!ext4_mb_use_preallocated(ac)) { |
4451 | ac->ac_op = EXT4_MB_HISTORY_ALLOC; | 4453 | ac->ac_op = EXT4_MB_HISTORY_ALLOC; |
4452 | ext4_mb_normalize_request(ac, ar); | 4454 | ext4_mb_normalize_request(ac, ar); |
4453 | repeat: | 4455 | repeat: |
4454 | /* allocate space in core */ | 4456 | /* allocate space in core */ |
4455 | *errp = ext4_mb_regular_allocator(ac); | 4457 | *errp = ext4_mb_regular_allocator(ac); |
4456 | if (*errp) | 4458 | if (*errp) |
4457 | goto discard_and_exit; | 4459 | goto discard_and_exit; |
4458 | 4460 | ||
4459 | /* as we've just preallocated more space than | 4461 | /* as we've just preallocated more space than |
4460 | * user requested originally, we store allocated | 4462 | * user requested originally, we store allocated |
4461 | * space in a special descriptor */ | 4463 | * space in a special descriptor */ |
4462 | if (ac->ac_status == AC_STATUS_FOUND && | 4464 | if (ac->ac_status == AC_STATUS_FOUND && |
4463 | ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) | 4465 | ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) |
4464 | *errp = ext4_mb_new_preallocation(ac); | 4466 | *errp = ext4_mb_new_preallocation(ac); |
4465 | if (*errp) { | 4467 | if (*errp) { |
4466 | discard_and_exit: | 4468 | discard_and_exit: |
4467 | ext4_discard_allocated_blocks(ac); | 4469 | ext4_discard_allocated_blocks(ac); |
4468 | goto errout; | 4470 | goto errout; |
4469 | } | 4471 | } |
4470 | } | 4472 | } |
4471 | if (likely(ac->ac_status == AC_STATUS_FOUND)) { | 4473 | if (likely(ac->ac_status == AC_STATUS_FOUND)) { |
4472 | *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs); | 4474 | *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs); |
4473 | if (*errp == -EAGAIN) { | 4475 | if (*errp == -EAGAIN) { |
4474 | /* | 4476 | /* |
4475 | * drop the reference that we took | 4477 | * drop the reference that we took |
4476 | * in ext4_mb_use_best_found | 4478 | * in ext4_mb_use_best_found |
4477 | */ | 4479 | */ |
4478 | ext4_mb_release_context(ac); | 4480 | ext4_mb_release_context(ac); |
4479 | ac->ac_b_ex.fe_group = 0; | 4481 | ac->ac_b_ex.fe_group = 0; |
4480 | ac->ac_b_ex.fe_start = 0; | 4482 | ac->ac_b_ex.fe_start = 0; |
4481 | ac->ac_b_ex.fe_len = 0; | 4483 | ac->ac_b_ex.fe_len = 0; |
4482 | ac->ac_status = AC_STATUS_CONTINUE; | 4484 | ac->ac_status = AC_STATUS_CONTINUE; |
4483 | goto repeat; | 4485 | goto repeat; |
4484 | } else if (*errp) { | 4486 | } else if (*errp) { |
4485 | ext4_discard_allocated_blocks(ac); | 4487 | ext4_discard_allocated_blocks(ac); |
4486 | goto errout; | 4488 | goto errout; |
4487 | } else { | 4489 | } else { |
4488 | block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); | 4490 | block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); |
4489 | ar->len = ac->ac_b_ex.fe_len; | 4491 | ar->len = ac->ac_b_ex.fe_len; |
4490 | } | 4492 | } |
4491 | } else { | 4493 | } else { |
4492 | freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); | 4494 | freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); |
4493 | if (freed) | 4495 | if (freed) |
4494 | goto repeat; | 4496 | goto repeat; |
4495 | *errp = -ENOSPC; | 4497 | *errp = -ENOSPC; |
4496 | } | 4498 | } |
4497 | 4499 | ||
4498 | errout: | 4500 | errout: |
4499 | if (*errp) { | 4501 | if (*errp) { |
4500 | ac->ac_b_ex.fe_len = 0; | 4502 | ac->ac_b_ex.fe_len = 0; |
4501 | ar->len = 0; | 4503 | ar->len = 0; |
4502 | ext4_mb_show_ac(ac); | 4504 | ext4_mb_show_ac(ac); |
4503 | } | 4505 | } |
4504 | ext4_mb_release_context(ac); | 4506 | ext4_mb_release_context(ac); |
4505 | out: | 4507 | out: |
4506 | if (ac) | 4508 | if (ac) |
4507 | kmem_cache_free(ext4_ac_cachep, ac); | 4509 | kmem_cache_free(ext4_ac_cachep, ac); |
4508 | if (inquota && ar->len < inquota) | 4510 | if (inquota && ar->len < inquota) |
4509 | dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); | 4511 | dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); |
4510 | if (!ar->len) { | 4512 | if (!ar->len) { |
4511 | if (!ext4_test_inode_state(ar->inode, | 4513 | if (!ext4_test_inode_state(ar->inode, |
4512 | EXT4_STATE_DELALLOC_RESERVED)) | 4514 | EXT4_STATE_DELALLOC_RESERVED)) |
4513 | /* release all the reserved blocks if non delalloc */ | 4515 | /* release all the reserved blocks if non delalloc */ |
4514 | percpu_counter_sub(&sbi->s_dirtyclusters_counter, | 4516 | percpu_counter_sub(&sbi->s_dirtyclusters_counter, |
4515 | reserv_clstrs); | 4517 | reserv_clstrs); |
4516 | } | 4518 | } |
4517 | 4519 | ||
4518 | trace_ext4_allocate_blocks(ar, (unsigned long long)block); | 4520 | trace_ext4_allocate_blocks(ar, (unsigned long long)block); |
4519 | 4521 | ||
4520 | return block; | 4522 | return block; |
4521 | } | 4523 | } |
4522 | 4524 | ||
4523 | /* | 4525 | /* |
4524 | * We can merge two free data extents only if the physical blocks | 4526 | * We can merge two free data extents only if the physical blocks |
4525 | * are contiguous, AND the extents were freed by the same transaction, | 4527 | * are contiguous, AND the extents were freed by the same transaction, |
4526 | * AND the blocks are associated with the same group. | 4528 | * AND the blocks are associated with the same group. |
4527 | */ | 4529 | */ |
4528 | static int can_merge(struct ext4_free_data *entry1, | 4530 | static int can_merge(struct ext4_free_data *entry1, |
4529 | struct ext4_free_data *entry2) | 4531 | struct ext4_free_data *entry2) |
4530 | { | 4532 | { |
4531 | if ((entry1->efd_tid == entry2->efd_tid) && | 4533 | if ((entry1->efd_tid == entry2->efd_tid) && |
4532 | (entry1->efd_group == entry2->efd_group) && | 4534 | (entry1->efd_group == entry2->efd_group) && |
4533 | ((entry1->efd_start_cluster + entry1->efd_count) == entry2->efd_start_cluster)) | 4535 | ((entry1->efd_start_cluster + entry1->efd_count) == entry2->efd_start_cluster)) |
4534 | return 1; | 4536 | return 1; |
4535 | return 0; | 4537 | return 0; |
4536 | } | 4538 | } |
4537 | 4539 | ||
4538 | static noinline_for_stack int | 4540 | static noinline_for_stack int |
4539 | ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, | 4541 | ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, |
4540 | struct ext4_free_data *new_entry) | 4542 | struct ext4_free_data *new_entry) |
4541 | { | 4543 | { |
4542 | ext4_group_t group = e4b->bd_group; | 4544 | ext4_group_t group = e4b->bd_group; |
4543 | ext4_grpblk_t cluster; | 4545 | ext4_grpblk_t cluster; |
4544 | struct ext4_free_data *entry; | 4546 | struct ext4_free_data *entry; |
4545 | struct ext4_group_info *db = e4b->bd_info; | 4547 | struct ext4_group_info *db = e4b->bd_info; |
4546 | struct super_block *sb = e4b->bd_sb; | 4548 | struct super_block *sb = e4b->bd_sb; |
4547 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 4549 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
4548 | struct rb_node **n = &db->bb_free_root.rb_node, *node; | 4550 | struct rb_node **n = &db->bb_free_root.rb_node, *node; |
4549 | struct rb_node *parent = NULL, *new_node; | 4551 | struct rb_node *parent = NULL, *new_node; |
4550 | 4552 | ||
4551 | BUG_ON(!ext4_handle_valid(handle)); | 4553 | BUG_ON(!ext4_handle_valid(handle)); |
4552 | BUG_ON(e4b->bd_bitmap_page == NULL); | 4554 | BUG_ON(e4b->bd_bitmap_page == NULL); |
4553 | BUG_ON(e4b->bd_buddy_page == NULL); | 4555 | BUG_ON(e4b->bd_buddy_page == NULL); |
4554 | 4556 | ||
4555 | new_node = &new_entry->efd_node; | 4557 | new_node = &new_entry->efd_node; |
4556 | cluster = new_entry->efd_start_cluster; | 4558 | cluster = new_entry->efd_start_cluster; |
4557 | 4559 | ||
4558 | if (!*n) { | 4560 | if (!*n) { |
4559 | /* first free block exent. We need to | 4561 | /* first free block exent. We need to |
4560 | protect buddy cache from being freed, | 4562 | protect buddy cache from being freed, |
4561 | * otherwise we'll refresh it from | 4563 | * otherwise we'll refresh it from |
4562 | * on-disk bitmap and lose not-yet-available | 4564 | * on-disk bitmap and lose not-yet-available |
4563 | * blocks */ | 4565 | * blocks */ |
4564 | page_cache_get(e4b->bd_buddy_page); | 4566 | page_cache_get(e4b->bd_buddy_page); |
4565 | page_cache_get(e4b->bd_bitmap_page); | 4567 | page_cache_get(e4b->bd_bitmap_page); |
4566 | } | 4568 | } |
4567 | while (*n) { | 4569 | while (*n) { |
4568 | parent = *n; | 4570 | parent = *n; |
4569 | entry = rb_entry(parent, struct ext4_free_data, efd_node); | 4571 | entry = rb_entry(parent, struct ext4_free_data, efd_node); |
4570 | if (cluster < entry->efd_start_cluster) | 4572 | if (cluster < entry->efd_start_cluster) |
4571 | n = &(*n)->rb_left; | 4573 | n = &(*n)->rb_left; |
4572 | else if (cluster >= (entry->efd_start_cluster + entry->efd_count)) | 4574 | else if (cluster >= (entry->efd_start_cluster + entry->efd_count)) |
4573 | n = &(*n)->rb_right; | 4575 | n = &(*n)->rb_right; |
4574 | else { | 4576 | else { |
4575 | ext4_grp_locked_error(sb, group, 0, | 4577 | ext4_grp_locked_error(sb, group, 0, |
4576 | ext4_group_first_block_no(sb, group) + | 4578 | ext4_group_first_block_no(sb, group) + |
4577 | EXT4_C2B(sbi, cluster), | 4579 | EXT4_C2B(sbi, cluster), |
4578 | "Block already on to-be-freed list"); | 4580 | "Block already on to-be-freed list"); |
4579 | return 0; | 4581 | return 0; |
4580 | } | 4582 | } |
4581 | } | 4583 | } |
4582 | 4584 | ||
4583 | rb_link_node(new_node, parent, n); | 4585 | rb_link_node(new_node, parent, n); |
4584 | rb_insert_color(new_node, &db->bb_free_root); | 4586 | rb_insert_color(new_node, &db->bb_free_root); |
4585 | 4587 | ||
4586 | /* Now try to see the extent can be merged to left and right */ | 4588 | /* Now try to see the extent can be merged to left and right */ |
4587 | node = rb_prev(new_node); | 4589 | node = rb_prev(new_node); |
4588 | if (node) { | 4590 | if (node) { |
4589 | entry = rb_entry(node, struct ext4_free_data, efd_node); | 4591 | entry = rb_entry(node, struct ext4_free_data, efd_node); |
4590 | if (can_merge(entry, new_entry) && | 4592 | if (can_merge(entry, new_entry) && |
4591 | ext4_journal_callback_try_del(handle, &entry->efd_jce)) { | 4593 | ext4_journal_callback_try_del(handle, &entry->efd_jce)) { |
4592 | new_entry->efd_start_cluster = entry->efd_start_cluster; | 4594 | new_entry->efd_start_cluster = entry->efd_start_cluster; |
4593 | new_entry->efd_count += entry->efd_count; | 4595 | new_entry->efd_count += entry->efd_count; |
4594 | rb_erase(node, &(db->bb_free_root)); | 4596 | rb_erase(node, &(db->bb_free_root)); |
4595 | kmem_cache_free(ext4_free_data_cachep, entry); | 4597 | kmem_cache_free(ext4_free_data_cachep, entry); |
4596 | } | 4598 | } |
4597 | } | 4599 | } |
4598 | 4600 | ||
4599 | node = rb_next(new_node); | 4601 | node = rb_next(new_node); |
4600 | if (node) { | 4602 | if (node) { |
4601 | entry = rb_entry(node, struct ext4_free_data, efd_node); | 4603 | entry = rb_entry(node, struct ext4_free_data, efd_node); |
4602 | if (can_merge(new_entry, entry) && | 4604 | if (can_merge(new_entry, entry) && |
4603 | ext4_journal_callback_try_del(handle, &entry->efd_jce)) { | 4605 | ext4_journal_callback_try_del(handle, &entry->efd_jce)) { |
4604 | new_entry->efd_count += entry->efd_count; | 4606 | new_entry->efd_count += entry->efd_count; |
4605 | rb_erase(node, &(db->bb_free_root)); | 4607 | rb_erase(node, &(db->bb_free_root)); |
4606 | kmem_cache_free(ext4_free_data_cachep, entry); | 4608 | kmem_cache_free(ext4_free_data_cachep, entry); |
4607 | } | 4609 | } |
4608 | } | 4610 | } |
4609 | /* Add the extent to transaction's private list */ | 4611 | /* Add the extent to transaction's private list */ |
4610 | ext4_journal_callback_add(handle, ext4_free_data_callback, | 4612 | ext4_journal_callback_add(handle, ext4_free_data_callback, |
4611 | &new_entry->efd_jce); | 4613 | &new_entry->efd_jce); |
4612 | return 0; | 4614 | return 0; |
4613 | } | 4615 | } |
4614 | 4616 | ||
4615 | /** | 4617 | /** |
4616 | * ext4_free_blocks() -- Free given blocks and update quota | 4618 | * ext4_free_blocks() -- Free given blocks and update quota |
4617 | * @handle: handle for this transaction | 4619 | * @handle: handle for this transaction |
4618 | * @inode: inode | 4620 | * @inode: inode |
4619 | * @block: start physical block to free | 4621 | * @block: start physical block to free |
4620 | * @count: number of blocks to count | 4622 | * @count: number of blocks to count |
4621 | * @flags: flags used by ext4_free_blocks | 4623 | * @flags: flags used by ext4_free_blocks |
4622 | */ | 4624 | */ |
4623 | void ext4_free_blocks(handle_t *handle, struct inode *inode, | 4625 | void ext4_free_blocks(handle_t *handle, struct inode *inode, |
4624 | struct buffer_head *bh, ext4_fsblk_t block, | 4626 | struct buffer_head *bh, ext4_fsblk_t block, |
4625 | unsigned long count, int flags) | 4627 | unsigned long count, int flags) |
4626 | { | 4628 | { |
4627 | struct buffer_head *bitmap_bh = NULL; | 4629 | struct buffer_head *bitmap_bh = NULL; |
4628 | struct super_block *sb = inode->i_sb; | 4630 | struct super_block *sb = inode->i_sb; |
4629 | struct ext4_group_desc *gdp; | 4631 | struct ext4_group_desc *gdp; |
4630 | unsigned int overflow; | 4632 | unsigned int overflow; |
4631 | ext4_grpblk_t bit; | 4633 | ext4_grpblk_t bit; |
4632 | struct buffer_head *gd_bh; | 4634 | struct buffer_head *gd_bh; |
4633 | ext4_group_t block_group; | 4635 | ext4_group_t block_group; |
4634 | struct ext4_sb_info *sbi; | 4636 | struct ext4_sb_info *sbi; |
4635 | struct ext4_inode_info *ei = EXT4_I(inode); | 4637 | struct ext4_inode_info *ei = EXT4_I(inode); |
4636 | struct ext4_buddy e4b; | 4638 | struct ext4_buddy e4b; |
4637 | unsigned int count_clusters; | 4639 | unsigned int count_clusters; |
4638 | int err = 0; | 4640 | int err = 0; |
4639 | int ret; | 4641 | int ret; |
4640 | 4642 | ||
4641 | might_sleep(); | 4643 | might_sleep(); |
4642 | if (bh) { | 4644 | if (bh) { |
4643 | if (block) | 4645 | if (block) |
4644 | BUG_ON(block != bh->b_blocknr); | 4646 | BUG_ON(block != bh->b_blocknr); |
4645 | else | 4647 | else |
4646 | block = bh->b_blocknr; | 4648 | block = bh->b_blocknr; |
4647 | } | 4649 | } |
4648 | 4650 | ||
4649 | sbi = EXT4_SB(sb); | 4651 | sbi = EXT4_SB(sb); |
4650 | if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) && | 4652 | if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) && |
4651 | !ext4_data_block_valid(sbi, block, count)) { | 4653 | !ext4_data_block_valid(sbi, block, count)) { |
4652 | ext4_error(sb, "Freeing blocks not in datazone - " | 4654 | ext4_error(sb, "Freeing blocks not in datazone - " |
4653 | "block = %llu, count = %lu", block, count); | 4655 | "block = %llu, count = %lu", block, count); |
4654 | goto error_return; | 4656 | goto error_return; |
4655 | } | 4657 | } |
4656 | 4658 | ||
4657 | ext4_debug("freeing block %llu\n", block); | 4659 | ext4_debug("freeing block %llu\n", block); |
4658 | trace_ext4_free_blocks(inode, block, count, flags); | 4660 | trace_ext4_free_blocks(inode, block, count, flags); |
4659 | 4661 | ||
4660 | if (flags & EXT4_FREE_BLOCKS_FORGET) { | 4662 | if (flags & EXT4_FREE_BLOCKS_FORGET) { |
4661 | struct buffer_head *tbh = bh; | 4663 | struct buffer_head *tbh = bh; |
4662 | int i; | 4664 | int i; |
4663 | 4665 | ||
4664 | BUG_ON(bh && (count > 1)); | 4666 | BUG_ON(bh && (count > 1)); |
4665 | 4667 | ||
4666 | for (i = 0; i < count; i++) { | 4668 | for (i = 0; i < count; i++) { |
4667 | cond_resched(); | 4669 | cond_resched(); |
4668 | if (!bh) | 4670 | if (!bh) |
4669 | tbh = sb_find_get_block(inode->i_sb, | 4671 | tbh = sb_find_get_block(inode->i_sb, |
4670 | block + i); | 4672 | block + i); |
4671 | if (!tbh) | 4673 | if (!tbh) |
4672 | continue; | 4674 | continue; |
4673 | ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA, | 4675 | ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA, |
4674 | inode, tbh, block + i); | 4676 | inode, tbh, block + i); |
4675 | } | 4677 | } |
4676 | } | 4678 | } |
4677 | 4679 | ||
4678 | /* | 4680 | /* |
4679 | * We need to make sure we don't reuse the freed block until | 4681 | * We need to make sure we don't reuse the freed block until |
4680 | * after the transaction is committed, which we can do by | 4682 | * after the transaction is committed, which we can do by |
4681 | * treating the block as metadata, below. We make an | 4683 | * treating the block as metadata, below. We make an |
4682 | * exception if the inode is to be written in writeback mode | 4684 | * exception if the inode is to be written in writeback mode |
4683 | * since writeback mode has weak data consistency guarantees. | 4685 | * since writeback mode has weak data consistency guarantees. |
4684 | */ | 4686 | */ |
4685 | if (!ext4_should_writeback_data(inode)) | 4687 | if (!ext4_should_writeback_data(inode)) |
4686 | flags |= EXT4_FREE_BLOCKS_METADATA; | 4688 | flags |= EXT4_FREE_BLOCKS_METADATA; |
4687 | 4689 | ||
4688 | /* | 4690 | /* |
4689 | * If the extent to be freed does not begin on a cluster | 4691 | * If the extent to be freed does not begin on a cluster |
4690 | * boundary, we need to deal with partial clusters at the | 4692 | * boundary, we need to deal with partial clusters at the |
4691 | * beginning and end of the extent. Normally we will free | 4693 | * beginning and end of the extent. Normally we will free |
4692 | * blocks at the beginning or the end unless we are explicitly | 4694 | * blocks at the beginning or the end unless we are explicitly |
4693 | * requested to avoid doing so. | 4695 | * requested to avoid doing so. |
4694 | */ | 4696 | */ |
4695 | overflow = EXT4_PBLK_COFF(sbi, block); | 4697 | overflow = EXT4_PBLK_COFF(sbi, block); |
4696 | if (overflow) { | 4698 | if (overflow) { |
4697 | if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) { | 4699 | if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) { |
4698 | overflow = sbi->s_cluster_ratio - overflow; | 4700 | overflow = sbi->s_cluster_ratio - overflow; |
4699 | block += overflow; | 4701 | block += overflow; |
4700 | if (count > overflow) | 4702 | if (count > overflow) |
4701 | count -= overflow; | 4703 | count -= overflow; |
4702 | else | 4704 | else |
4703 | return; | 4705 | return; |
4704 | } else { | 4706 | } else { |
4705 | block -= overflow; | 4707 | block -= overflow; |
4706 | count += overflow; | 4708 | count += overflow; |
4707 | } | 4709 | } |
4708 | } | 4710 | } |
4709 | overflow = EXT4_LBLK_COFF(sbi, count); | 4711 | overflow = EXT4_LBLK_COFF(sbi, count); |
4710 | if (overflow) { | 4712 | if (overflow) { |
4711 | if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) { | 4713 | if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) { |
4712 | if (count > overflow) | 4714 | if (count > overflow) |
4713 | count -= overflow; | 4715 | count -= overflow; |
4714 | else | 4716 | else |
4715 | return; | 4717 | return; |
4716 | } else | 4718 | } else |
4717 | count += sbi->s_cluster_ratio - overflow; | 4719 | count += sbi->s_cluster_ratio - overflow; |
4718 | } | 4720 | } |
4719 | 4721 | ||
4720 | do_more: | 4722 | do_more: |
4721 | overflow = 0; | 4723 | overflow = 0; |
4722 | ext4_get_group_no_and_offset(sb, block, &block_group, &bit); | 4724 | ext4_get_group_no_and_offset(sb, block, &block_group, &bit); |
4723 | 4725 | ||
4724 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT( | 4726 | if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT( |
4725 | ext4_get_group_info(sb, block_group)))) | 4727 | ext4_get_group_info(sb, block_group)))) |
4726 | return; | 4728 | return; |
4727 | 4729 | ||
4728 | /* | 4730 | /* |
4729 | * Check to see if we are freeing blocks across a group | 4731 | * Check to see if we are freeing blocks across a group |
4730 | * boundary. | 4732 | * boundary. |
4731 | */ | 4733 | */ |
4732 | if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) { | 4734 | if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) { |
4733 | overflow = EXT4_C2B(sbi, bit) + count - | 4735 | overflow = EXT4_C2B(sbi, bit) + count - |
4734 | EXT4_BLOCKS_PER_GROUP(sb); | 4736 | EXT4_BLOCKS_PER_GROUP(sb); |
4735 | count -= overflow; | 4737 | count -= overflow; |
4736 | } | 4738 | } |
4737 | count_clusters = EXT4_NUM_B2C(sbi, count); | 4739 | count_clusters = EXT4_NUM_B2C(sbi, count); |
4738 | bitmap_bh = ext4_read_block_bitmap(sb, block_group); | 4740 | bitmap_bh = ext4_read_block_bitmap(sb, block_group); |
4739 | if (!bitmap_bh) { | 4741 | if (!bitmap_bh) { |
4740 | err = -EIO; | 4742 | err = -EIO; |
4741 | goto error_return; | 4743 | goto error_return; |
4742 | } | 4744 | } |
4743 | gdp = ext4_get_group_desc(sb, block_group, &gd_bh); | 4745 | gdp = ext4_get_group_desc(sb, block_group, &gd_bh); |
4744 | if (!gdp) { | 4746 | if (!gdp) { |
4745 | err = -EIO; | 4747 | err = -EIO; |
4746 | goto error_return; | 4748 | goto error_return; |
4747 | } | 4749 | } |
4748 | 4750 | ||
4749 | if (in_range(ext4_block_bitmap(sb, gdp), block, count) || | 4751 | if (in_range(ext4_block_bitmap(sb, gdp), block, count) || |
4750 | in_range(ext4_inode_bitmap(sb, gdp), block, count) || | 4752 | in_range(ext4_inode_bitmap(sb, gdp), block, count) || |
4751 | in_range(block, ext4_inode_table(sb, gdp), | 4753 | in_range(block, ext4_inode_table(sb, gdp), |
4752 | EXT4_SB(sb)->s_itb_per_group) || | 4754 | EXT4_SB(sb)->s_itb_per_group) || |
4753 | in_range(block + count - 1, ext4_inode_table(sb, gdp), | 4755 | in_range(block + count - 1, ext4_inode_table(sb, gdp), |
4754 | EXT4_SB(sb)->s_itb_per_group)) { | 4756 | EXT4_SB(sb)->s_itb_per_group)) { |
4755 | 4757 | ||
4756 | ext4_error(sb, "Freeing blocks in system zone - " | 4758 | ext4_error(sb, "Freeing blocks in system zone - " |
4757 | "Block = %llu, count = %lu", block, count); | 4759 | "Block = %llu, count = %lu", block, count); |
4758 | /* err = 0. ext4_std_error should be a no op */ | 4760 | /* err = 0. ext4_std_error should be a no op */ |
4759 | goto error_return; | 4761 | goto error_return; |
4760 | } | 4762 | } |
4761 | 4763 | ||
4762 | BUFFER_TRACE(bitmap_bh, "getting write access"); | 4764 | BUFFER_TRACE(bitmap_bh, "getting write access"); |
4763 | err = ext4_journal_get_write_access(handle, bitmap_bh); | 4765 | err = ext4_journal_get_write_access(handle, bitmap_bh); |
4764 | if (err) | 4766 | if (err) |
4765 | goto error_return; | 4767 | goto error_return; |
4766 | 4768 | ||
4767 | /* | 4769 | /* |
4768 | * We are about to modify some metadata. Call the journal APIs | 4770 | * We are about to modify some metadata. Call the journal APIs |
4769 | * to unshare ->b_data if a currently-committing transaction is | 4771 | * to unshare ->b_data if a currently-committing transaction is |
4770 | * using it | 4772 | * using it |
4771 | */ | 4773 | */ |
4772 | BUFFER_TRACE(gd_bh, "get_write_access"); | 4774 | BUFFER_TRACE(gd_bh, "get_write_access"); |
4773 | err = ext4_journal_get_write_access(handle, gd_bh); | 4775 | err = ext4_journal_get_write_access(handle, gd_bh); |
4774 | if (err) | 4776 | if (err) |
4775 | goto error_return; | 4777 | goto error_return; |
4776 | #ifdef AGGRESSIVE_CHECK | 4778 | #ifdef AGGRESSIVE_CHECK |
4777 | { | 4779 | { |
4778 | int i; | 4780 | int i; |
4779 | for (i = 0; i < count_clusters; i++) | 4781 | for (i = 0; i < count_clusters; i++) |
4780 | BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data)); | 4782 | BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data)); |
4781 | } | 4783 | } |
4782 | #endif | 4784 | #endif |
4783 | trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters); | 4785 | trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters); |
4784 | 4786 | ||
4785 | err = ext4_mb_load_buddy(sb, block_group, &e4b); | 4787 | err = ext4_mb_load_buddy(sb, block_group, &e4b); |
4786 | if (err) | 4788 | if (err) |
4787 | goto error_return; | 4789 | goto error_return; |
4788 | 4790 | ||
4789 | if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) { | 4791 | if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) { |
4790 | struct ext4_free_data *new_entry; | 4792 | struct ext4_free_data *new_entry; |
4791 | /* | 4793 | /* |
4792 | * blocks being freed are metadata. these blocks shouldn't | 4794 | * blocks being freed are metadata. these blocks shouldn't |
4793 | * be used until this transaction is committed | 4795 | * be used until this transaction is committed |
4794 | */ | 4796 | */ |
4795 | retry: | 4797 | retry: |
4796 | new_entry = kmem_cache_alloc(ext4_free_data_cachep, GFP_NOFS); | 4798 | new_entry = kmem_cache_alloc(ext4_free_data_cachep, GFP_NOFS); |
4797 | if (!new_entry) { | 4799 | if (!new_entry) { |
4798 | /* | 4800 | /* |
4799 | * We use a retry loop because | 4801 | * We use a retry loop because |
4800 | * ext4_free_blocks() is not allowed to fail. | 4802 | * ext4_free_blocks() is not allowed to fail. |
4801 | */ | 4803 | */ |
4802 | cond_resched(); | 4804 | cond_resched(); |
4803 | congestion_wait(BLK_RW_ASYNC, HZ/50); | 4805 | congestion_wait(BLK_RW_ASYNC, HZ/50); |
4804 | goto retry; | 4806 | goto retry; |
4805 | } | 4807 | } |
4806 | new_entry->efd_start_cluster = bit; | 4808 | new_entry->efd_start_cluster = bit; |
4807 | new_entry->efd_group = block_group; | 4809 | new_entry->efd_group = block_group; |
4808 | new_entry->efd_count = count_clusters; | 4810 | new_entry->efd_count = count_clusters; |
4809 | new_entry->efd_tid = handle->h_transaction->t_tid; | 4811 | new_entry->efd_tid = handle->h_transaction->t_tid; |
4810 | 4812 | ||
4811 | ext4_lock_group(sb, block_group); | 4813 | ext4_lock_group(sb, block_group); |
4812 | mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); | 4814 | mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); |
4813 | ext4_mb_free_metadata(handle, &e4b, new_entry); | 4815 | ext4_mb_free_metadata(handle, &e4b, new_entry); |
4814 | } else { | 4816 | } else { |
4815 | /* need to update group_info->bb_free and bitmap | 4817 | /* need to update group_info->bb_free and bitmap |
4816 | * with group lock held. generate_buddy look at | 4818 | * with group lock held. generate_buddy look at |
4817 | * them with group lock_held | 4819 | * them with group lock_held |
4818 | */ | 4820 | */ |
4819 | if (test_opt(sb, DISCARD)) { | 4821 | if (test_opt(sb, DISCARD)) { |
4820 | err = ext4_issue_discard(sb, block_group, bit, count); | 4822 | err = ext4_issue_discard(sb, block_group, bit, count); |
4821 | if (err && err != -EOPNOTSUPP) | 4823 | if (err && err != -EOPNOTSUPP) |
4822 | ext4_msg(sb, KERN_WARNING, "discard request in" | 4824 | ext4_msg(sb, KERN_WARNING, "discard request in" |
4823 | " group:%d block:%d count:%lu failed" | 4825 | " group:%d block:%d count:%lu failed" |
4824 | " with %d", block_group, bit, count, | 4826 | " with %d", block_group, bit, count, |
4825 | err); | 4827 | err); |
4826 | } else | 4828 | } else |
4827 | EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info); | 4829 | EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info); |
4828 | 4830 | ||
4829 | ext4_lock_group(sb, block_group); | 4831 | ext4_lock_group(sb, block_group); |
4830 | mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); | 4832 | mb_clear_bits(bitmap_bh->b_data, bit, count_clusters); |
4831 | mb_free_blocks(inode, &e4b, bit, count_clusters); | 4833 | mb_free_blocks(inode, &e4b, bit, count_clusters); |
4832 | } | 4834 | } |
4833 | 4835 | ||
4834 | ret = ext4_free_group_clusters(sb, gdp) + count_clusters; | 4836 | ret = ext4_free_group_clusters(sb, gdp) + count_clusters; |
4835 | ext4_free_group_clusters_set(sb, gdp, ret); | 4837 | ext4_free_group_clusters_set(sb, gdp, ret); |
4836 | ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh); | 4838 | ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh); |
4837 | ext4_group_desc_csum_set(sb, block_group, gdp); | 4839 | ext4_group_desc_csum_set(sb, block_group, gdp); |
4838 | ext4_unlock_group(sb, block_group); | 4840 | ext4_unlock_group(sb, block_group); |
4839 | 4841 | ||
4840 | if (sbi->s_log_groups_per_flex) { | 4842 | if (sbi->s_log_groups_per_flex) { |
4841 | ext4_group_t flex_group = ext4_flex_group(sbi, block_group); | 4843 | ext4_group_t flex_group = ext4_flex_group(sbi, block_group); |
4842 | atomic64_add(count_clusters, | 4844 | atomic64_add(count_clusters, |
4843 | &sbi->s_flex_groups[flex_group].free_clusters); | 4845 | &sbi->s_flex_groups[flex_group].free_clusters); |
4844 | } | 4846 | } |
4845 | 4847 | ||
4846 | if (flags & EXT4_FREE_BLOCKS_RESERVE && ei->i_reserved_data_blocks) { | 4848 | if (flags & EXT4_FREE_BLOCKS_RESERVE && ei->i_reserved_data_blocks) { |
4847 | percpu_counter_add(&sbi->s_dirtyclusters_counter, | 4849 | percpu_counter_add(&sbi->s_dirtyclusters_counter, |
4848 | count_clusters); | 4850 | count_clusters); |
4849 | spin_lock(&ei->i_block_reservation_lock); | 4851 | spin_lock(&ei->i_block_reservation_lock); |
4850 | if (flags & EXT4_FREE_BLOCKS_METADATA) | 4852 | if (flags & EXT4_FREE_BLOCKS_METADATA) |
4851 | ei->i_reserved_meta_blocks += count_clusters; | 4853 | ei->i_reserved_meta_blocks += count_clusters; |
4852 | else | 4854 | else |
4853 | ei->i_reserved_data_blocks += count_clusters; | 4855 | ei->i_reserved_data_blocks += count_clusters; |
4854 | spin_unlock(&ei->i_block_reservation_lock); | 4856 | spin_unlock(&ei->i_block_reservation_lock); |
4855 | if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) | 4857 | if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) |
4856 | dquot_reclaim_block(inode, | 4858 | dquot_reclaim_block(inode, |
4857 | EXT4_C2B(sbi, count_clusters)); | 4859 | EXT4_C2B(sbi, count_clusters)); |
4858 | } else if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) | 4860 | } else if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) |
4859 | dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); | 4861 | dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); |
4860 | percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); | 4862 | percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); |
4861 | 4863 | ||
4862 | ext4_mb_unload_buddy(&e4b); | 4864 | ext4_mb_unload_buddy(&e4b); |
4863 | 4865 | ||
4864 | /* We dirtied the bitmap block */ | 4866 | /* We dirtied the bitmap block */ |
4865 | BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); | 4867 | BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); |
4866 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); | 4868 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); |
4867 | 4869 | ||
4868 | /* And the group descriptor block */ | 4870 | /* And the group descriptor block */ |
4869 | BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); | 4871 | BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); |
4870 | ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); | 4872 | ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); |
4871 | if (!err) | 4873 | if (!err) |
4872 | err = ret; | 4874 | err = ret; |
4873 | 4875 | ||
4874 | if (overflow && !err) { | 4876 | if (overflow && !err) { |
4875 | block += count; | 4877 | block += count; |
4876 | count = overflow; | 4878 | count = overflow; |
4877 | put_bh(bitmap_bh); | 4879 | put_bh(bitmap_bh); |
4878 | goto do_more; | 4880 | goto do_more; |
4879 | } | 4881 | } |
4880 | error_return: | 4882 | error_return: |
4881 | brelse(bitmap_bh); | 4883 | brelse(bitmap_bh); |
4882 | ext4_std_error(sb, err); | 4884 | ext4_std_error(sb, err); |
4883 | return; | 4885 | return; |
4884 | } | 4886 | } |
4885 | 4887 | ||
4886 | /** | 4888 | /** |
4887 | * ext4_group_add_blocks() -- Add given blocks to an existing group | 4889 | * ext4_group_add_blocks() -- Add given blocks to an existing group |
4888 | * @handle: handle to this transaction | 4890 | * @handle: handle to this transaction |
4889 | * @sb: super block | 4891 | * @sb: super block |
4890 | * @block: start physical block to add to the block group | 4892 | * @block: start physical block to add to the block group |
4891 | * @count: number of blocks to free | 4893 | * @count: number of blocks to free |
4892 | * | 4894 | * |
4893 | * This marks the blocks as free in the bitmap and buddy. | 4895 | * This marks the blocks as free in the bitmap and buddy. |
4894 | */ | 4896 | */ |
4895 | int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, | 4897 | int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, |
4896 | ext4_fsblk_t block, unsigned long count) | 4898 | ext4_fsblk_t block, unsigned long count) |
4897 | { | 4899 | { |
4898 | struct buffer_head *bitmap_bh = NULL; | 4900 | struct buffer_head *bitmap_bh = NULL; |
4899 | struct buffer_head *gd_bh; | 4901 | struct buffer_head *gd_bh; |
4900 | ext4_group_t block_group; | 4902 | ext4_group_t block_group; |
4901 | ext4_grpblk_t bit; | 4903 | ext4_grpblk_t bit; |
4902 | unsigned int i; | 4904 | unsigned int i; |
4903 | struct ext4_group_desc *desc; | 4905 | struct ext4_group_desc *desc; |
4904 | struct ext4_sb_info *sbi = EXT4_SB(sb); | 4906 | struct ext4_sb_info *sbi = EXT4_SB(sb); |
4905 | struct ext4_buddy e4b; | 4907 | struct ext4_buddy e4b; |
4906 | int err = 0, ret, blk_free_count; | 4908 | int err = 0, ret, blk_free_count; |
4907 | ext4_grpblk_t blocks_freed; | 4909 | ext4_grpblk_t blocks_freed; |
4908 | 4910 | ||
4909 | ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1); | 4911 | ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1); |
4910 | 4912 | ||
4911 | if (count == 0) | 4913 | if (count == 0) |
4912 | return 0; | 4914 | return 0; |
4913 | 4915 | ||
4914 | ext4_get_group_no_and_offset(sb, block, &block_group, &bit); | 4916 | ext4_get_group_no_and_offset(sb, block, &block_group, &bit); |
4915 | /* | 4917 | /* |
4916 | * Check to see if we are freeing blocks across a group | 4918 | * Check to see if we are freeing blocks across a group |
4917 | * boundary. | 4919 | * boundary. |
4918 | */ | 4920 | */ |
4919 | if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) { | 4921 | if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) { |
4920 | ext4_warning(sb, "too much blocks added to group %u\n", | 4922 | ext4_warning(sb, "too much blocks added to group %u\n", |
4921 | block_group); | 4923 | block_group); |
4922 | err = -EINVAL; | 4924 | err = -EINVAL; |
4923 | goto error_return; | 4925 | goto error_return; |
4924 | } | 4926 | } |
4925 | 4927 | ||
4926 | bitmap_bh = ext4_read_block_bitmap(sb, block_group); | 4928 | bitmap_bh = ext4_read_block_bitmap(sb, block_group); |
4927 | if (!bitmap_bh) { | 4929 | if (!bitmap_bh) { |
4928 | err = -EIO; | 4930 | err = -EIO; |
4929 | goto error_return; | 4931 | goto error_return; |
4930 | } | 4932 | } |
4931 | 4933 | ||
4932 | desc = ext4_get_group_desc(sb, block_group, &gd_bh); | 4934 | desc = ext4_get_group_desc(sb, block_group, &gd_bh); |
4933 | if (!desc) { | 4935 | if (!desc) { |
4934 | err = -EIO; | 4936 | err = -EIO; |
4935 | goto error_return; | 4937 | goto error_return; |
4936 | } | 4938 | } |
4937 | 4939 | ||
4938 | if (in_range(ext4_block_bitmap(sb, desc), block, count) || | 4940 | if (in_range(ext4_block_bitmap(sb, desc), block, count) || |
4939 | in_range(ext4_inode_bitmap(sb, desc), block, count) || | 4941 | in_range(ext4_inode_bitmap(sb, desc), block, count) || |
4940 | in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) || | 4942 | in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) || |
4941 | in_range(block + count - 1, ext4_inode_table(sb, desc), | 4943 | in_range(block + count - 1, ext4_inode_table(sb, desc), |
4942 | sbi->s_itb_per_group)) { | 4944 | sbi->s_itb_per_group)) { |
4943 | ext4_error(sb, "Adding blocks in system zones - " | 4945 | ext4_error(sb, "Adding blocks in system zones - " |
4944 | "Block = %llu, count = %lu", | 4946 | "Block = %llu, count = %lu", |
4945 | block, count); | 4947 | block, count); |
4946 | err = -EINVAL; | 4948 | err = -EINVAL; |
4947 | goto error_return; | 4949 | goto error_return; |
4948 | } | 4950 | } |
4949 | 4951 | ||
4950 | BUFFER_TRACE(bitmap_bh, "getting write access"); | 4952 | BUFFER_TRACE(bitmap_bh, "getting write access"); |
4951 | err = ext4_journal_get_write_access(handle, bitmap_bh); | 4953 | err = ext4_journal_get_write_access(handle, bitmap_bh); |
4952 | if (err) | 4954 | if (err) |
4953 | goto error_return; | 4955 | goto error_return; |
4954 | 4956 | ||
4955 | /* | 4957 | /* |
4956 | * We are about to modify some metadata. Call the journal APIs | 4958 | * We are about to modify some metadata. Call the journal APIs |
4957 | * to unshare ->b_data if a currently-committing transaction is | 4959 | * to unshare ->b_data if a currently-committing transaction is |
4958 | * using it | 4960 | * using it |
4959 | */ | 4961 | */ |
4960 | BUFFER_TRACE(gd_bh, "get_write_access"); | 4962 | BUFFER_TRACE(gd_bh, "get_write_access"); |
4961 | err = ext4_journal_get_write_access(handle, gd_bh); | 4963 | err = ext4_journal_get_write_access(handle, gd_bh); |
4962 | if (err) | 4964 | if (err) |
4963 | goto error_return; | 4965 | goto error_return; |
4964 | 4966 | ||
4965 | for (i = 0, blocks_freed = 0; i < count; i++) { | 4967 | for (i = 0, blocks_freed = 0; i < count; i++) { |
4966 | BUFFER_TRACE(bitmap_bh, "clear bit"); | 4968 | BUFFER_TRACE(bitmap_bh, "clear bit"); |
4967 | if (!mb_test_bit(bit + i, bitmap_bh->b_data)) { | 4969 | if (!mb_test_bit(bit + i, bitmap_bh->b_data)) { |
4968 | ext4_error(sb, "bit already cleared for block %llu", | 4970 | ext4_error(sb, "bit already cleared for block %llu", |
4969 | (ext4_fsblk_t)(block + i)); | 4971 | (ext4_fsblk_t)(block + i)); |
4970 | BUFFER_TRACE(bitmap_bh, "bit already cleared"); | 4972 | BUFFER_TRACE(bitmap_bh, "bit already cleared"); |
4971 | } else { | 4973 | } else { |
4972 | blocks_freed++; | 4974 | blocks_freed++; |
4973 | } | 4975 | } |
4974 | } | 4976 | } |
4975 | 4977 | ||
4976 | err = ext4_mb_load_buddy(sb, block_group, &e4b); | 4978 | err = ext4_mb_load_buddy(sb, block_group, &e4b); |
4977 | if (err) | 4979 | if (err) |
4978 | goto error_return; | 4980 | goto error_return; |
4979 | 4981 | ||
4980 | /* | 4982 | /* |
4981 | * need to update group_info->bb_free and bitmap | 4983 | * need to update group_info->bb_free and bitmap |
4982 | * with group lock held. generate_buddy look at | 4984 | * with group lock held. generate_buddy look at |
4983 | * them with group lock_held | 4985 | * them with group lock_held |
4984 | */ | 4986 | */ |
4985 | ext4_lock_group(sb, block_group); | 4987 | ext4_lock_group(sb, block_group); |
4986 | mb_clear_bits(bitmap_bh->b_data, bit, count); | 4988 | mb_clear_bits(bitmap_bh->b_data, bit, count); |
4987 | mb_free_blocks(NULL, &e4b, bit, count); | 4989 | mb_free_blocks(NULL, &e4b, bit, count); |
4988 | blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc); | 4990 | blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc); |
4989 | ext4_free_group_clusters_set(sb, desc, blk_free_count); | 4991 | ext4_free_group_clusters_set(sb, desc, blk_free_count); |
4990 | ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh); | 4992 | ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh); |
4991 | ext4_group_desc_csum_set(sb, block_group, desc); | 4993 | ext4_group_desc_csum_set(sb, block_group, desc); |
4992 | ext4_unlock_group(sb, block_group); | 4994 | ext4_unlock_group(sb, block_group); |
4993 | percpu_counter_add(&sbi->s_freeclusters_counter, | 4995 | percpu_counter_add(&sbi->s_freeclusters_counter, |
4994 | EXT4_NUM_B2C(sbi, blocks_freed)); | 4996 | EXT4_NUM_B2C(sbi, blocks_freed)); |
4995 | 4997 | ||
4996 | if (sbi->s_log_groups_per_flex) { | 4998 | if (sbi->s_log_groups_per_flex) { |
4997 | ext4_group_t flex_group = ext4_flex_group(sbi, block_group); | 4999 | ext4_group_t flex_group = ext4_flex_group(sbi, block_group); |
4998 | atomic64_add(EXT4_NUM_B2C(sbi, blocks_freed), | 5000 | atomic64_add(EXT4_NUM_B2C(sbi, blocks_freed), |
4999 | &sbi->s_flex_groups[flex_group].free_clusters); | 5001 | &sbi->s_flex_groups[flex_group].free_clusters); |
5000 | } | 5002 | } |
5001 | 5003 | ||
5002 | ext4_mb_unload_buddy(&e4b); | 5004 | ext4_mb_unload_buddy(&e4b); |
5003 | 5005 | ||
5004 | /* We dirtied the bitmap block */ | 5006 | /* We dirtied the bitmap block */ |
5005 | BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); | 5007 | BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); |
5006 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); | 5008 | err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh); |
5007 | 5009 | ||
5008 | /* And the group descriptor block */ | 5010 | /* And the group descriptor block */ |
5009 | BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); | 5011 | BUFFER_TRACE(gd_bh, "dirtied group descriptor block"); |
5010 | ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); | 5012 | ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh); |
5011 | if (!err) | 5013 | if (!err) |
5012 | err = ret; | 5014 | err = ret; |
5013 | 5015 | ||
5014 | error_return: | 5016 | error_return: |
5015 | brelse(bitmap_bh); | 5017 | brelse(bitmap_bh); |
5016 | ext4_std_error(sb, err); | 5018 | ext4_std_error(sb, err); |
5017 | return err; | 5019 | return err; |
5018 | } | 5020 | } |
5019 | 5021 | ||
5020 | /** | 5022 | /** |
5021 | * ext4_trim_extent -- function to TRIM one single free extent in the group | 5023 | * ext4_trim_extent -- function to TRIM one single free extent in the group |
5022 | * @sb: super block for the file system | 5024 | * @sb: super block for the file system |
5023 | * @start: starting block of the free extent in the alloc. group | 5025 | * @start: starting block of the free extent in the alloc. group |
5024 | * @count: number of blocks to TRIM | 5026 | * @count: number of blocks to TRIM |
5025 | * @group: alloc. group we are working with | 5027 | * @group: alloc. group we are working with |
5026 | * @e4b: ext4 buddy for the group | 5028 | * @e4b: ext4 buddy for the group |
5027 | * | 5029 | * |
5028 | * Trim "count" blocks starting at "start" in the "group". To assure that no | 5030 | * Trim "count" blocks starting at "start" in the "group". To assure that no |
5029 | * one will allocate those blocks, mark it as used in buddy bitmap. This must | 5031 | * one will allocate those blocks, mark it as used in buddy bitmap. This must |
5030 | * be called with under the group lock. | 5032 | * be called with under the group lock. |
5031 | */ | 5033 | */ |
5032 | static int ext4_trim_extent(struct super_block *sb, int start, int count, | 5034 | static int ext4_trim_extent(struct super_block *sb, int start, int count, |
5033 | ext4_group_t group, struct ext4_buddy *e4b) | 5035 | ext4_group_t group, struct ext4_buddy *e4b) |
5034 | { | 5036 | { |
5035 | struct ext4_free_extent ex; | 5037 | struct ext4_free_extent ex; |
5036 | int ret = 0; | 5038 | int ret = 0; |
5037 | 5039 | ||
5038 | trace_ext4_trim_extent(sb, group, start, count); | 5040 | trace_ext4_trim_extent(sb, group, start, count); |
5039 | 5041 | ||
5040 | assert_spin_locked(ext4_group_lock_ptr(sb, group)); | 5042 | assert_spin_locked(ext4_group_lock_ptr(sb, group)); |
5041 | 5043 | ||
5042 | ex.fe_start = start; | 5044 | ex.fe_start = start; |
5043 | ex.fe_group = group; | 5045 | ex.fe_group = group; |
5044 | ex.fe_len = count; | 5046 | ex.fe_len = count; |
5045 | 5047 | ||
5046 | /* | 5048 | /* |
5047 | * Mark blocks used, so no one can reuse them while | 5049 | * Mark blocks used, so no one can reuse them while |
5048 | * being trimmed. | 5050 | * being trimmed. |
5049 | */ | 5051 | */ |
5050 | mb_mark_used(e4b, &ex); | 5052 | mb_mark_used(e4b, &ex); |
5051 | ext4_unlock_group(sb, group); | 5053 | ext4_unlock_group(sb, group); |
5052 | ret = ext4_issue_discard(sb, group, start, count); | 5054 | ret = ext4_issue_discard(sb, group, start, count); |
5053 | ext4_lock_group(sb, group); | 5055 | ext4_lock_group(sb, group); |
5054 | mb_free_blocks(NULL, e4b, start, ex.fe_len); | 5056 | mb_free_blocks(NULL, e4b, start, ex.fe_len); |
5055 | return ret; | 5057 | return ret; |
5056 | } | 5058 | } |
5057 | 5059 | ||
5058 | /** | 5060 | /** |
5059 | * ext4_trim_all_free -- function to trim all free space in alloc. group | 5061 | * ext4_trim_all_free -- function to trim all free space in alloc. group |
5060 | * @sb: super block for file system | 5062 | * @sb: super block for file system |
5061 | * @group: group to be trimmed | 5063 | * @group: group to be trimmed |
5062 | * @start: first group block to examine | 5064 | * @start: first group block to examine |
5063 | * @max: last group block to examine | 5065 | * @max: last group block to examine |
5064 | * @minblocks: minimum extent block count | 5066 | * @minblocks: minimum extent block count |
5065 | * | 5067 | * |
5066 | * ext4_trim_all_free walks through group's buddy bitmap searching for free | 5068 | * ext4_trim_all_free walks through group's buddy bitmap searching for free |
5067 | * extents. When the free block is found, ext4_trim_extent is called to TRIM | 5069 | * extents. When the free block is found, ext4_trim_extent is called to TRIM |
5068 | * the extent. | 5070 | * the extent. |
5069 | * | 5071 | * |
5070 | * | 5072 | * |
5071 | * ext4_trim_all_free walks through group's block bitmap searching for free | 5073 | * ext4_trim_all_free walks through group's block bitmap searching for free |
5072 | * extents. When the free extent is found, mark it as used in group buddy | 5074 | * extents. When the free extent is found, mark it as used in group buddy |
5073 | * bitmap. Then issue a TRIM command on this extent and free the extent in | 5075 | * bitmap. Then issue a TRIM command on this extent and free the extent in |
5074 | * the group buddy bitmap. This is done until whole group is scanned. | 5076 | * the group buddy bitmap. This is done until whole group is scanned. |
5075 | */ | 5077 | */ |
5076 | static ext4_grpblk_t | 5078 | static ext4_grpblk_t |
5077 | ext4_trim_all_free(struct super_block *sb, ext4_group_t group, | 5079 | ext4_trim_all_free(struct super_block *sb, ext4_group_t group, |
5078 | ext4_grpblk_t start, ext4_grpblk_t max, | 5080 | ext4_grpblk_t start, ext4_grpblk_t max, |
5079 | ext4_grpblk_t minblocks) | 5081 | ext4_grpblk_t minblocks) |
5080 | { | 5082 | { |
5081 | void *bitmap; | 5083 | void *bitmap; |
5082 | ext4_grpblk_t next, count = 0, free_count = 0; | 5084 | ext4_grpblk_t next, count = 0, free_count = 0; |
5083 | struct ext4_buddy e4b; | 5085 | struct ext4_buddy e4b; |
5084 | int ret = 0; | 5086 | int ret = 0; |
5085 | 5087 | ||
5086 | trace_ext4_trim_all_free(sb, group, start, max); | 5088 | trace_ext4_trim_all_free(sb, group, start, max); |
5087 | 5089 | ||
5088 | ret = ext4_mb_load_buddy(sb, group, &e4b); | 5090 | ret = ext4_mb_load_buddy(sb, group, &e4b); |
5089 | if (ret) { | 5091 | if (ret) { |
5090 | ext4_error(sb, "Error in loading buddy " | 5092 | ext4_error(sb, "Error in loading buddy " |
5091 | "information for %u", group); | 5093 | "information for %u", group); |
5092 | return ret; | 5094 | return ret; |
5093 | } | 5095 | } |
5094 | bitmap = e4b.bd_bitmap; | 5096 | bitmap = e4b.bd_bitmap; |
5095 | 5097 | ||
5096 | ext4_lock_group(sb, group); | 5098 | ext4_lock_group(sb, group); |
5097 | if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) && | 5099 | if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) && |
5098 | minblocks >= atomic_read(&EXT4_SB(sb)->s_last_trim_minblks)) | 5100 | minblocks >= atomic_read(&EXT4_SB(sb)->s_last_trim_minblks)) |
5099 | goto out; | 5101 | goto out; |
5100 | 5102 | ||
5101 | start = (e4b.bd_info->bb_first_free > start) ? | 5103 | start = (e4b.bd_info->bb_first_free > start) ? |
5102 | e4b.bd_info->bb_first_free : start; | 5104 | e4b.bd_info->bb_first_free : start; |
5103 | 5105 | ||
5104 | while (start <= max) { | 5106 | while (start <= max) { |
5105 | start = mb_find_next_zero_bit(bitmap, max + 1, start); | 5107 | start = mb_find_next_zero_bit(bitmap, max + 1, start); |
5106 | if (start > max) | 5108 | if (start > max) |
5107 | break; | 5109 | break; |
5108 | next = mb_find_next_bit(bitmap, max + 1, start); | 5110 | next = mb_find_next_bit(bitmap, max + 1, start); |
5109 | 5111 | ||
5110 | if ((next - start) >= minblocks) { | 5112 | if ((next - start) >= minblocks) { |
5111 | ret = ext4_trim_extent(sb, start, | 5113 | ret = ext4_trim_extent(sb, start, |
5112 | next - start, group, &e4b); | 5114 | next - start, group, &e4b); |
5113 | if (ret && ret != -EOPNOTSUPP) | 5115 | if (ret && ret != -EOPNOTSUPP) |
5114 | break; | 5116 | break; |
5115 | ret = 0; | 5117 | ret = 0; |
5116 | count += next - start; | 5118 | count += next - start; |
5117 | } | 5119 | } |
5118 | free_count += next - start; | 5120 | free_count += next - start; |
5119 | start = next + 1; | 5121 | start = next + 1; |
5120 | 5122 | ||
5121 | if (fatal_signal_pending(current)) { | 5123 | if (fatal_signal_pending(current)) { |
5122 | count = -ERESTARTSYS; | 5124 | count = -ERESTARTSYS; |
5123 | break; | 5125 | break; |
5124 | } | 5126 | } |
5125 | 5127 | ||
5126 | if (need_resched()) { | 5128 | if (need_resched()) { |
5127 | ext4_unlock_group(sb, group); | 5129 | ext4_unlock_group(sb, group); |
5128 | cond_resched(); | 5130 | cond_resched(); |
5129 | ext4_lock_group(sb, group); | 5131 | ext4_lock_group(sb, group); |
5130 | } | 5132 | } |
5131 | 5133 | ||
5132 | if ((e4b.bd_info->bb_free - free_count) < minblocks) | 5134 | if ((e4b.bd_info->bb_free - free_count) < minblocks) |
5133 | break; | 5135 | break; |
5134 | } | 5136 | } |
5135 | 5137 | ||
5136 | if (!ret) { | 5138 | if (!ret) { |
5137 | ret = count; | 5139 | ret = count; |
5138 | EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); | 5140 | EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); |
5139 | } | 5141 | } |
5140 | out: | 5142 | out: |
5141 | ext4_unlock_group(sb, group); | 5143 | ext4_unlock_group(sb, group); |
5142 | ext4_mb_unload_buddy(&e4b); | 5144 | ext4_mb_unload_buddy(&e4b); |
5143 | 5145 | ||
5144 | ext4_debug("trimmed %d blocks in the group %d\n", | 5146 | ext4_debug("trimmed %d blocks in the group %d\n", |
5145 | count, group); | 5147 | count, group); |
5146 | 5148 | ||
5147 | return ret; | 5149 | return ret; |
5148 | } | 5150 | } |
5149 | 5151 | ||
5150 | /** | 5152 | /** |
5151 | * ext4_trim_fs() -- trim ioctl handle function | 5153 | * ext4_trim_fs() -- trim ioctl handle function |
5152 | * @sb: superblock for filesystem | 5154 | * @sb: superblock for filesystem |
5153 | * @range: fstrim_range structure | 5155 | * @range: fstrim_range structure |
5154 | * | 5156 | * |
5155 | * start: First Byte to trim | 5157 | * start: First Byte to trim |
5156 | * len: number of Bytes to trim from start | 5158 | * len: number of Bytes to trim from start |
5157 | * minlen: minimum extent length in Bytes | 5159 | * minlen: minimum extent length in Bytes |
5158 | * ext4_trim_fs goes through all allocation groups containing Bytes from | 5160 | * ext4_trim_fs goes through all allocation groups containing Bytes from |
5159 | * start to start+len. For each such a group ext4_trim_all_free function | 5161 | * start to start+len. For each such a group ext4_trim_all_free function |
5160 | * is invoked to trim all free space. | 5162 | * is invoked to trim all free space. |
5161 | */ | 5163 | */ |
5162 | int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) | 5164 | int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) |
5163 | { | 5165 | { |
5164 | struct ext4_group_info *grp; | 5166 | struct ext4_group_info *grp; |
5165 | ext4_group_t group, first_group, last_group; | 5167 | ext4_group_t group, first_group, last_group; |
5166 | ext4_grpblk_t cnt = 0, first_cluster, last_cluster; | 5168 | ext4_grpblk_t cnt = 0, first_cluster, last_cluster; |
5167 | uint64_t start, end, minlen, trimmed = 0; | 5169 | uint64_t start, end, minlen, trimmed = 0; |
5168 | ext4_fsblk_t first_data_blk = | 5170 | ext4_fsblk_t first_data_blk = |
5169 | le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); | 5171 | le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); |
5170 | ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); | 5172 | ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); |
5171 | int ret = 0; | 5173 | int ret = 0; |
5172 | 5174 | ||
5173 | start = range->start >> sb->s_blocksize_bits; | 5175 | start = range->start >> sb->s_blocksize_bits; |
5174 | end = start + (range->len >> sb->s_blocksize_bits) - 1; | 5176 | end = start + (range->len >> sb->s_blocksize_bits) - 1; |
5175 | minlen = EXT4_NUM_B2C(EXT4_SB(sb), | 5177 | minlen = EXT4_NUM_B2C(EXT4_SB(sb), |
5176 | range->minlen >> sb->s_blocksize_bits); | 5178 | range->minlen >> sb->s_blocksize_bits); |
5177 | 5179 | ||
5178 | if (minlen > EXT4_CLUSTERS_PER_GROUP(sb) || | 5180 | if (minlen > EXT4_CLUSTERS_PER_GROUP(sb) || |
5179 | start >= max_blks || | 5181 | start >= max_blks || |
5180 | range->len < sb->s_blocksize) | 5182 | range->len < sb->s_blocksize) |
5181 | return -EINVAL; | 5183 | return -EINVAL; |
5182 | if (end >= max_blks) | 5184 | if (end >= max_blks) |
5183 | end = max_blks - 1; | 5185 | end = max_blks - 1; |
5184 | if (end <= first_data_blk) | 5186 | if (end <= first_data_blk) |
5185 | goto out; | 5187 | goto out; |
5186 | if (start < first_data_blk) | 5188 | if (start < first_data_blk) |
5187 | start = first_data_blk; | 5189 | start = first_data_blk; |
5188 | 5190 | ||
5189 | /* Determine first and last group to examine based on start and end */ | 5191 | /* Determine first and last group to examine based on start and end */ |
5190 | ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start, | 5192 | ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start, |
5191 | &first_group, &first_cluster); | 5193 | &first_group, &first_cluster); |
5192 | ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end, | 5194 | ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end, |
5193 | &last_group, &last_cluster); | 5195 | &last_group, &last_cluster); |
5194 | 5196 | ||
5195 | /* end now represents the last cluster to discard in this group */ | 5197 | /* end now represents the last cluster to discard in this group */ |
5196 | end = EXT4_CLUSTERS_PER_GROUP(sb) - 1; | 5198 | end = EXT4_CLUSTERS_PER_GROUP(sb) - 1; |
5197 | 5199 | ||
5198 | for (group = first_group; group <= last_group; group++) { | 5200 | for (group = first_group; group <= last_group; group++) { |
5199 | grp = ext4_get_group_info(sb, group); | 5201 | grp = ext4_get_group_info(sb, group); |
5200 | /* We only do this if the grp has never been initialized */ | 5202 | /* We only do this if the grp has never been initialized */ |
5201 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { | 5203 | if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { |
5202 | ret = ext4_mb_init_group(sb, group); | 5204 | ret = ext4_mb_init_group(sb, group); |
5203 | if (ret) | 5205 | if (ret) |
5204 | break; | 5206 | break; |
5205 | } | 5207 | } |
5206 | 5208 | ||
5207 | /* | 5209 | /* |
5208 | * For all the groups except the last one, last cluster will | 5210 | * For all the groups except the last one, last cluster will |
5209 | * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to | 5211 | * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to |
5210 | * change it for the last group, note that last_cluster is | 5212 | * change it for the last group, note that last_cluster is |
5211 | * already computed earlier by ext4_get_group_no_and_offset() | 5213 | * already computed earlier by ext4_get_group_no_and_offset() |
5212 | */ | 5214 | */ |
5213 | if (group == last_group) | 5215 | if (group == last_group) |
5214 | end = last_cluster; | 5216 | end = last_cluster; |
5215 | 5217 | ||
5216 | if (grp->bb_free >= minlen) { | 5218 | if (grp->bb_free >= minlen) { |
5217 | cnt = ext4_trim_all_free(sb, group, first_cluster, | 5219 | cnt = ext4_trim_all_free(sb, group, first_cluster, |
5218 | end, minlen); | 5220 | end, minlen); |
5219 | if (cnt < 0) { | 5221 | if (cnt < 0) { |
5220 | ret = cnt; | 5222 | ret = cnt; |
5221 | break; | 5223 | break; |
5222 | } | 5224 | } |
5223 | trimmed += cnt; | 5225 | trimmed += cnt; |
5224 | } | 5226 | } |
5225 | 5227 | ||
5226 | /* | 5228 | /* |
5227 | * For every group except the first one, we are sure | 5229 | * For every group except the first one, we are sure |
5228 | * that the first cluster to discard will be cluster #0. | 5230 | * that the first cluster to discard will be cluster #0. |
5229 | */ | 5231 | */ |
5230 | first_cluster = 0; | 5232 | first_cluster = 0; |
5231 | } | 5233 | } |
5232 | 5234 | ||
5233 | if (!ret) | 5235 | if (!ret) |
5234 | atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen); | 5236 | atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen); |
5235 | 5237 | ||
5236 | out: | 5238 | out: |
fs/f2fs/checkpoint.c
1 | /* | 1 | /* |
2 | * fs/f2fs/checkpoint.c | 2 | * fs/f2fs/checkpoint.c |
3 | * | 3 | * |
4 | * Copyright (c) 2012 Samsung Electronics Co., Ltd. | 4 | * Copyright (c) 2012 Samsung Electronics Co., Ltd. |
5 | * http://www.samsung.com/ | 5 | * http://www.samsung.com/ |
6 | * | 6 | * |
7 | * This program is free software; you can redistribute it and/or modify | 7 | * This program is free software; you can redistribute it and/or modify |
8 | * it under the terms of the GNU General Public License version 2 as | 8 | * it under the terms of the GNU General Public License version 2 as |
9 | * published by the Free Software Foundation. | 9 | * published by the Free Software Foundation. |
10 | */ | 10 | */ |
11 | #include <linux/fs.h> | 11 | #include <linux/fs.h> |
12 | #include <linux/bio.h> | 12 | #include <linux/bio.h> |
13 | #include <linux/mpage.h> | 13 | #include <linux/mpage.h> |
14 | #include <linux/writeback.h> | 14 | #include <linux/writeback.h> |
15 | #include <linux/blkdev.h> | 15 | #include <linux/blkdev.h> |
16 | #include <linux/f2fs_fs.h> | 16 | #include <linux/f2fs_fs.h> |
17 | #include <linux/pagevec.h> | 17 | #include <linux/pagevec.h> |
18 | #include <linux/swap.h> | 18 | #include <linux/swap.h> |
19 | 19 | ||
20 | #include "f2fs.h" | 20 | #include "f2fs.h" |
21 | #include "node.h" | 21 | #include "node.h" |
22 | #include "segment.h" | 22 | #include "segment.h" |
23 | #include <trace/events/f2fs.h> | 23 | #include <trace/events/f2fs.h> |
24 | 24 | ||
25 | static struct kmem_cache *orphan_entry_slab; | 25 | static struct kmem_cache *orphan_entry_slab; |
26 | static struct kmem_cache *inode_entry_slab; | 26 | static struct kmem_cache *inode_entry_slab; |
27 | 27 | ||
28 | /* | 28 | /* |
29 | * We guarantee no failure on the returned page. | 29 | * We guarantee no failure on the returned page. |
30 | */ | 30 | */ |
31 | struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) | 31 | struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) |
32 | { | 32 | { |
33 | struct address_space *mapping = sbi->meta_inode->i_mapping; | 33 | struct address_space *mapping = sbi->meta_inode->i_mapping; |
34 | struct page *page = NULL; | 34 | struct page *page = NULL; |
35 | repeat: | 35 | repeat: |
36 | page = grab_cache_page(mapping, index); | 36 | page = grab_cache_page(mapping, index); |
37 | if (!page) { | 37 | if (!page) { |
38 | cond_resched(); | 38 | cond_resched(); |
39 | goto repeat; | 39 | goto repeat; |
40 | } | 40 | } |
41 | 41 | ||
42 | /* We wait writeback only inside grab_meta_page() */ | 42 | /* We wait writeback only inside grab_meta_page() */ |
43 | wait_on_page_writeback(page); | 43 | wait_on_page_writeback(page); |
44 | SetPageUptodate(page); | 44 | SetPageUptodate(page); |
45 | return page; | 45 | return page; |
46 | } | 46 | } |
47 | 47 | ||
48 | /* | 48 | /* |
49 | * We guarantee no failure on the returned page. | 49 | * We guarantee no failure on the returned page. |
50 | */ | 50 | */ |
51 | struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) | 51 | struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index) |
52 | { | 52 | { |
53 | struct address_space *mapping = sbi->meta_inode->i_mapping; | 53 | struct address_space *mapping = sbi->meta_inode->i_mapping; |
54 | struct page *page; | 54 | struct page *page; |
55 | repeat: | 55 | repeat: |
56 | page = grab_cache_page(mapping, index); | 56 | page = grab_cache_page(mapping, index); |
57 | if (!page) { | 57 | if (!page) { |
58 | cond_resched(); | 58 | cond_resched(); |
59 | goto repeat; | 59 | goto repeat; |
60 | } | 60 | } |
61 | if (PageUptodate(page)) | 61 | if (PageUptodate(page)) |
62 | goto out; | 62 | goto out; |
63 | 63 | ||
64 | if (f2fs_readpage(sbi, page, index, READ_SYNC)) | 64 | if (f2fs_readpage(sbi, page, index, READ_SYNC)) |
65 | goto repeat; | 65 | goto repeat; |
66 | 66 | ||
67 | lock_page(page); | 67 | lock_page(page); |
68 | if (page->mapping != mapping) { | 68 | if (page->mapping != mapping) { |
69 | f2fs_put_page(page, 1); | 69 | f2fs_put_page(page, 1); |
70 | goto repeat; | 70 | goto repeat; |
71 | } | 71 | } |
72 | out: | 72 | out: |
73 | mark_page_accessed(page); | ||
74 | return page; | 73 | return page; |
75 | } | 74 | } |
76 | 75 | ||
77 | static int f2fs_write_meta_page(struct page *page, | 76 | static int f2fs_write_meta_page(struct page *page, |
78 | struct writeback_control *wbc) | 77 | struct writeback_control *wbc) |
79 | { | 78 | { |
80 | struct inode *inode = page->mapping->host; | 79 | struct inode *inode = page->mapping->host; |
81 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 80 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
82 | 81 | ||
83 | /* Should not write any meta pages, if any IO error was occurred */ | 82 | /* Should not write any meta pages, if any IO error was occurred */ |
84 | if (wbc->for_reclaim || | 83 | if (wbc->for_reclaim || |
85 | is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ERROR_FLAG)) { | 84 | is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ERROR_FLAG)) { |
86 | dec_page_count(sbi, F2FS_DIRTY_META); | 85 | dec_page_count(sbi, F2FS_DIRTY_META); |
87 | wbc->pages_skipped++; | 86 | wbc->pages_skipped++; |
88 | set_page_dirty(page); | 87 | set_page_dirty(page); |
89 | return AOP_WRITEPAGE_ACTIVATE; | 88 | return AOP_WRITEPAGE_ACTIVATE; |
90 | } | 89 | } |
91 | 90 | ||
92 | wait_on_page_writeback(page); | 91 | wait_on_page_writeback(page); |
93 | 92 | ||
94 | write_meta_page(sbi, page); | 93 | write_meta_page(sbi, page); |
95 | dec_page_count(sbi, F2FS_DIRTY_META); | 94 | dec_page_count(sbi, F2FS_DIRTY_META); |
96 | unlock_page(page); | 95 | unlock_page(page); |
97 | return 0; | 96 | return 0; |
98 | } | 97 | } |
99 | 98 | ||
100 | static int f2fs_write_meta_pages(struct address_space *mapping, | 99 | static int f2fs_write_meta_pages(struct address_space *mapping, |
101 | struct writeback_control *wbc) | 100 | struct writeback_control *wbc) |
102 | { | 101 | { |
103 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); | 102 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); |
104 | struct block_device *bdev = sbi->sb->s_bdev; | 103 | struct block_device *bdev = sbi->sb->s_bdev; |
105 | long written; | 104 | long written; |
106 | 105 | ||
107 | if (wbc->for_kupdate) | 106 | if (wbc->for_kupdate) |
108 | return 0; | 107 | return 0; |
109 | 108 | ||
110 | if (get_pages(sbi, F2FS_DIRTY_META) == 0) | 109 | if (get_pages(sbi, F2FS_DIRTY_META) == 0) |
111 | return 0; | 110 | return 0; |
112 | 111 | ||
113 | /* if mounting is failed, skip writing node pages */ | 112 | /* if mounting is failed, skip writing node pages */ |
114 | mutex_lock(&sbi->cp_mutex); | 113 | mutex_lock(&sbi->cp_mutex); |
115 | written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev)); | 114 | written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev)); |
116 | mutex_unlock(&sbi->cp_mutex); | 115 | mutex_unlock(&sbi->cp_mutex); |
117 | wbc->nr_to_write -= written; | 116 | wbc->nr_to_write -= written; |
118 | return 0; | 117 | return 0; |
119 | } | 118 | } |
120 | 119 | ||
121 | long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, | 120 | long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, |
122 | long nr_to_write) | 121 | long nr_to_write) |
123 | { | 122 | { |
124 | struct address_space *mapping = sbi->meta_inode->i_mapping; | 123 | struct address_space *mapping = sbi->meta_inode->i_mapping; |
125 | pgoff_t index = 0, end = LONG_MAX; | 124 | pgoff_t index = 0, end = LONG_MAX; |
126 | struct pagevec pvec; | 125 | struct pagevec pvec; |
127 | long nwritten = 0; | 126 | long nwritten = 0; |
128 | struct writeback_control wbc = { | 127 | struct writeback_control wbc = { |
129 | .for_reclaim = 0, | 128 | .for_reclaim = 0, |
130 | }; | 129 | }; |
131 | 130 | ||
132 | pagevec_init(&pvec, 0); | 131 | pagevec_init(&pvec, 0); |
133 | 132 | ||
134 | while (index <= end) { | 133 | while (index <= end) { |
135 | int i, nr_pages; | 134 | int i, nr_pages; |
136 | nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, | 135 | nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, |
137 | PAGECACHE_TAG_DIRTY, | 136 | PAGECACHE_TAG_DIRTY, |
138 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); | 137 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); |
139 | if (nr_pages == 0) | 138 | if (nr_pages == 0) |
140 | break; | 139 | break; |
141 | 140 | ||
142 | for (i = 0; i < nr_pages; i++) { | 141 | for (i = 0; i < nr_pages; i++) { |
143 | struct page *page = pvec.pages[i]; | 142 | struct page *page = pvec.pages[i]; |
144 | lock_page(page); | 143 | lock_page(page); |
145 | BUG_ON(page->mapping != mapping); | 144 | BUG_ON(page->mapping != mapping); |
146 | BUG_ON(!PageDirty(page)); | 145 | BUG_ON(!PageDirty(page)); |
147 | clear_page_dirty_for_io(page); | 146 | clear_page_dirty_for_io(page); |
148 | if (f2fs_write_meta_page(page, &wbc)) { | 147 | if (f2fs_write_meta_page(page, &wbc)) { |
149 | unlock_page(page); | 148 | unlock_page(page); |
150 | break; | 149 | break; |
151 | } | 150 | } |
152 | if (nwritten++ >= nr_to_write) | 151 | if (nwritten++ >= nr_to_write) |
153 | break; | 152 | break; |
154 | } | 153 | } |
155 | pagevec_release(&pvec); | 154 | pagevec_release(&pvec); |
156 | cond_resched(); | 155 | cond_resched(); |
157 | } | 156 | } |
158 | 157 | ||
159 | if (nwritten) | 158 | if (nwritten) |
160 | f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX); | 159 | f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX); |
161 | 160 | ||
162 | return nwritten; | 161 | return nwritten; |
163 | } | 162 | } |
164 | 163 | ||
165 | static int f2fs_set_meta_page_dirty(struct page *page) | 164 | static int f2fs_set_meta_page_dirty(struct page *page) |
166 | { | 165 | { |
167 | struct address_space *mapping = page->mapping; | 166 | struct address_space *mapping = page->mapping; |
168 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); | 167 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); |
169 | 168 | ||
170 | SetPageUptodate(page); | 169 | SetPageUptodate(page); |
171 | if (!PageDirty(page)) { | 170 | if (!PageDirty(page)) { |
172 | __set_page_dirty_nobuffers(page); | 171 | __set_page_dirty_nobuffers(page); |
173 | inc_page_count(sbi, F2FS_DIRTY_META); | 172 | inc_page_count(sbi, F2FS_DIRTY_META); |
174 | return 1; | 173 | return 1; |
175 | } | 174 | } |
176 | return 0; | 175 | return 0; |
177 | } | 176 | } |
178 | 177 | ||
179 | const struct address_space_operations f2fs_meta_aops = { | 178 | const struct address_space_operations f2fs_meta_aops = { |
180 | .writepage = f2fs_write_meta_page, | 179 | .writepage = f2fs_write_meta_page, |
181 | .writepages = f2fs_write_meta_pages, | 180 | .writepages = f2fs_write_meta_pages, |
182 | .set_page_dirty = f2fs_set_meta_page_dirty, | 181 | .set_page_dirty = f2fs_set_meta_page_dirty, |
183 | }; | 182 | }; |
184 | 183 | ||
185 | int acquire_orphan_inode(struct f2fs_sb_info *sbi) | 184 | int acquire_orphan_inode(struct f2fs_sb_info *sbi) |
186 | { | 185 | { |
187 | unsigned int max_orphans; | 186 | unsigned int max_orphans; |
188 | int err = 0; | 187 | int err = 0; |
189 | 188 | ||
190 | /* | 189 | /* |
191 | * considering 512 blocks in a segment 5 blocks are needed for cp | 190 | * considering 512 blocks in a segment 5 blocks are needed for cp |
192 | * and log segment summaries. Remaining blocks are used to keep | 191 | * and log segment summaries. Remaining blocks are used to keep |
193 | * orphan entries with the limitation one reserved segment | 192 | * orphan entries with the limitation one reserved segment |
194 | * for cp pack we can have max 1020*507 orphan entries | 193 | * for cp pack we can have max 1020*507 orphan entries |
195 | */ | 194 | */ |
196 | max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK; | 195 | max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK; |
197 | mutex_lock(&sbi->orphan_inode_mutex); | 196 | mutex_lock(&sbi->orphan_inode_mutex); |
198 | if (sbi->n_orphans >= max_orphans) | 197 | if (sbi->n_orphans >= max_orphans) |
199 | err = -ENOSPC; | 198 | err = -ENOSPC; |
200 | else | 199 | else |
201 | sbi->n_orphans++; | 200 | sbi->n_orphans++; |
202 | mutex_unlock(&sbi->orphan_inode_mutex); | 201 | mutex_unlock(&sbi->orphan_inode_mutex); |
203 | return err; | 202 | return err; |
204 | } | 203 | } |
205 | 204 | ||
206 | void release_orphan_inode(struct f2fs_sb_info *sbi) | 205 | void release_orphan_inode(struct f2fs_sb_info *sbi) |
207 | { | 206 | { |
208 | mutex_lock(&sbi->orphan_inode_mutex); | 207 | mutex_lock(&sbi->orphan_inode_mutex); |
209 | sbi->n_orphans--; | 208 | sbi->n_orphans--; |
210 | mutex_unlock(&sbi->orphan_inode_mutex); | 209 | mutex_unlock(&sbi->orphan_inode_mutex); |
211 | } | 210 | } |
212 | 211 | ||
213 | void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) | 212 | void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) |
214 | { | 213 | { |
215 | struct list_head *head, *this; | 214 | struct list_head *head, *this; |
216 | struct orphan_inode_entry *new = NULL, *orphan = NULL; | 215 | struct orphan_inode_entry *new = NULL, *orphan = NULL; |
217 | 216 | ||
218 | mutex_lock(&sbi->orphan_inode_mutex); | 217 | mutex_lock(&sbi->orphan_inode_mutex); |
219 | head = &sbi->orphan_inode_list; | 218 | head = &sbi->orphan_inode_list; |
220 | list_for_each(this, head) { | 219 | list_for_each(this, head) { |
221 | orphan = list_entry(this, struct orphan_inode_entry, list); | 220 | orphan = list_entry(this, struct orphan_inode_entry, list); |
222 | if (orphan->ino == ino) | 221 | if (orphan->ino == ino) |
223 | goto out; | 222 | goto out; |
224 | if (orphan->ino > ino) | 223 | if (orphan->ino > ino) |
225 | break; | 224 | break; |
226 | orphan = NULL; | 225 | orphan = NULL; |
227 | } | 226 | } |
228 | retry: | 227 | retry: |
229 | new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC); | 228 | new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC); |
230 | if (!new) { | 229 | if (!new) { |
231 | cond_resched(); | 230 | cond_resched(); |
232 | goto retry; | 231 | goto retry; |
233 | } | 232 | } |
234 | new->ino = ino; | 233 | new->ino = ino; |
235 | 234 | ||
236 | /* add new_oentry into list which is sorted by inode number */ | 235 | /* add new_oentry into list which is sorted by inode number */ |
237 | if (orphan) | 236 | if (orphan) |
238 | list_add(&new->list, this->prev); | 237 | list_add(&new->list, this->prev); |
239 | else | 238 | else |
240 | list_add_tail(&new->list, head); | 239 | list_add_tail(&new->list, head); |
241 | out: | 240 | out: |
242 | mutex_unlock(&sbi->orphan_inode_mutex); | 241 | mutex_unlock(&sbi->orphan_inode_mutex); |
243 | } | 242 | } |
244 | 243 | ||
245 | void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) | 244 | void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) |
246 | { | 245 | { |
247 | struct list_head *head; | 246 | struct list_head *head; |
248 | struct orphan_inode_entry *orphan; | 247 | struct orphan_inode_entry *orphan; |
249 | 248 | ||
250 | mutex_lock(&sbi->orphan_inode_mutex); | 249 | mutex_lock(&sbi->orphan_inode_mutex); |
251 | head = &sbi->orphan_inode_list; | 250 | head = &sbi->orphan_inode_list; |
252 | list_for_each_entry(orphan, head, list) { | 251 | list_for_each_entry(orphan, head, list) { |
253 | if (orphan->ino == ino) { | 252 | if (orphan->ino == ino) { |
254 | list_del(&orphan->list); | 253 | list_del(&orphan->list); |
255 | kmem_cache_free(orphan_entry_slab, orphan); | 254 | kmem_cache_free(orphan_entry_slab, orphan); |
256 | sbi->n_orphans--; | 255 | sbi->n_orphans--; |
257 | break; | 256 | break; |
258 | } | 257 | } |
259 | } | 258 | } |
260 | mutex_unlock(&sbi->orphan_inode_mutex); | 259 | mutex_unlock(&sbi->orphan_inode_mutex); |
261 | } | 260 | } |
262 | 261 | ||
263 | static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) | 262 | static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) |
264 | { | 263 | { |
265 | struct inode *inode = f2fs_iget(sbi->sb, ino); | 264 | struct inode *inode = f2fs_iget(sbi->sb, ino); |
266 | BUG_ON(IS_ERR(inode)); | 265 | BUG_ON(IS_ERR(inode)); |
267 | clear_nlink(inode); | 266 | clear_nlink(inode); |
268 | 267 | ||
269 | /* truncate all the data during iput */ | 268 | /* truncate all the data during iput */ |
270 | iput(inode); | 269 | iput(inode); |
271 | } | 270 | } |
272 | 271 | ||
273 | int recover_orphan_inodes(struct f2fs_sb_info *sbi) | 272 | int recover_orphan_inodes(struct f2fs_sb_info *sbi) |
274 | { | 273 | { |
275 | block_t start_blk, orphan_blkaddr, i, j; | 274 | block_t start_blk, orphan_blkaddr, i, j; |
276 | 275 | ||
277 | if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG)) | 276 | if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG)) |
278 | return 0; | 277 | return 0; |
279 | 278 | ||
280 | sbi->por_doing = 1; | 279 | sbi->por_doing = 1; |
281 | start_blk = __start_cp_addr(sbi) + 1; | 280 | start_blk = __start_cp_addr(sbi) + 1; |
282 | orphan_blkaddr = __start_sum_addr(sbi) - 1; | 281 | orphan_blkaddr = __start_sum_addr(sbi) - 1; |
283 | 282 | ||
284 | for (i = 0; i < orphan_blkaddr; i++) { | 283 | for (i = 0; i < orphan_blkaddr; i++) { |
285 | struct page *page = get_meta_page(sbi, start_blk + i); | 284 | struct page *page = get_meta_page(sbi, start_blk + i); |
286 | struct f2fs_orphan_block *orphan_blk; | 285 | struct f2fs_orphan_block *orphan_blk; |
287 | 286 | ||
288 | orphan_blk = (struct f2fs_orphan_block *)page_address(page); | 287 | orphan_blk = (struct f2fs_orphan_block *)page_address(page); |
289 | for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) { | 288 | for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) { |
290 | nid_t ino = le32_to_cpu(orphan_blk->ino[j]); | 289 | nid_t ino = le32_to_cpu(orphan_blk->ino[j]); |
291 | recover_orphan_inode(sbi, ino); | 290 | recover_orphan_inode(sbi, ino); |
292 | } | 291 | } |
293 | f2fs_put_page(page, 1); | 292 | f2fs_put_page(page, 1); |
294 | } | 293 | } |
295 | /* clear Orphan Flag */ | 294 | /* clear Orphan Flag */ |
296 | clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG); | 295 | clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG); |
297 | sbi->por_doing = 0; | 296 | sbi->por_doing = 0; |
298 | return 0; | 297 | return 0; |
299 | } | 298 | } |
300 | 299 | ||
301 | static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) | 300 | static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) |
302 | { | 301 | { |
303 | struct list_head *head, *this, *next; | 302 | struct list_head *head, *this, *next; |
304 | struct f2fs_orphan_block *orphan_blk = NULL; | 303 | struct f2fs_orphan_block *orphan_blk = NULL; |
305 | struct page *page = NULL; | 304 | struct page *page = NULL; |
306 | unsigned int nentries = 0; | 305 | unsigned int nentries = 0; |
307 | unsigned short index = 1; | 306 | unsigned short index = 1; |
308 | unsigned short orphan_blocks; | 307 | unsigned short orphan_blocks; |
309 | 308 | ||
310 | orphan_blocks = (unsigned short)((sbi->n_orphans + | 309 | orphan_blocks = (unsigned short)((sbi->n_orphans + |
311 | (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK); | 310 | (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK); |
312 | 311 | ||
313 | mutex_lock(&sbi->orphan_inode_mutex); | 312 | mutex_lock(&sbi->orphan_inode_mutex); |
314 | head = &sbi->orphan_inode_list; | 313 | head = &sbi->orphan_inode_list; |
315 | 314 | ||
316 | /* loop for each orphan inode entry and write them in Jornal block */ | 315 | /* loop for each orphan inode entry and write them in Jornal block */ |
317 | list_for_each_safe(this, next, head) { | 316 | list_for_each_safe(this, next, head) { |
318 | struct orphan_inode_entry *orphan; | 317 | struct orphan_inode_entry *orphan; |
319 | 318 | ||
320 | orphan = list_entry(this, struct orphan_inode_entry, list); | 319 | orphan = list_entry(this, struct orphan_inode_entry, list); |
321 | 320 | ||
322 | if (nentries == F2FS_ORPHANS_PER_BLOCK) { | 321 | if (nentries == F2FS_ORPHANS_PER_BLOCK) { |
323 | /* | 322 | /* |
324 | * an orphan block is full of 1020 entries, | 323 | * an orphan block is full of 1020 entries, |
325 | * then we need to flush current orphan blocks | 324 | * then we need to flush current orphan blocks |
326 | * and bring another one in memory | 325 | * and bring another one in memory |
327 | */ | 326 | */ |
328 | orphan_blk->blk_addr = cpu_to_le16(index); | 327 | orphan_blk->blk_addr = cpu_to_le16(index); |
329 | orphan_blk->blk_count = cpu_to_le16(orphan_blocks); | 328 | orphan_blk->blk_count = cpu_to_le16(orphan_blocks); |
330 | orphan_blk->entry_count = cpu_to_le32(nentries); | 329 | orphan_blk->entry_count = cpu_to_le32(nentries); |
331 | set_page_dirty(page); | 330 | set_page_dirty(page); |
332 | f2fs_put_page(page, 1); | 331 | f2fs_put_page(page, 1); |
333 | index++; | 332 | index++; |
334 | start_blk++; | 333 | start_blk++; |
335 | nentries = 0; | 334 | nentries = 0; |
336 | page = NULL; | 335 | page = NULL; |
337 | } | 336 | } |
338 | if (page) | 337 | if (page) |
339 | goto page_exist; | 338 | goto page_exist; |
340 | 339 | ||
341 | page = grab_meta_page(sbi, start_blk); | 340 | page = grab_meta_page(sbi, start_blk); |
342 | orphan_blk = (struct f2fs_orphan_block *)page_address(page); | 341 | orphan_blk = (struct f2fs_orphan_block *)page_address(page); |
343 | memset(orphan_blk, 0, sizeof(*orphan_blk)); | 342 | memset(orphan_blk, 0, sizeof(*orphan_blk)); |
344 | page_exist: | 343 | page_exist: |
345 | orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino); | 344 | orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino); |
346 | } | 345 | } |
347 | if (!page) | 346 | if (!page) |
348 | goto end; | 347 | goto end; |
349 | 348 | ||
350 | orphan_blk->blk_addr = cpu_to_le16(index); | 349 | orphan_blk->blk_addr = cpu_to_le16(index); |
351 | orphan_blk->blk_count = cpu_to_le16(orphan_blocks); | 350 | orphan_blk->blk_count = cpu_to_le16(orphan_blocks); |
352 | orphan_blk->entry_count = cpu_to_le32(nentries); | 351 | orphan_blk->entry_count = cpu_to_le32(nentries); |
353 | set_page_dirty(page); | 352 | set_page_dirty(page); |
354 | f2fs_put_page(page, 1); | 353 | f2fs_put_page(page, 1); |
355 | end: | 354 | end: |
356 | mutex_unlock(&sbi->orphan_inode_mutex); | 355 | mutex_unlock(&sbi->orphan_inode_mutex); |
357 | } | 356 | } |
358 | 357 | ||
359 | static struct page *validate_checkpoint(struct f2fs_sb_info *sbi, | 358 | static struct page *validate_checkpoint(struct f2fs_sb_info *sbi, |
360 | block_t cp_addr, unsigned long long *version) | 359 | block_t cp_addr, unsigned long long *version) |
361 | { | 360 | { |
362 | struct page *cp_page_1, *cp_page_2 = NULL; | 361 | struct page *cp_page_1, *cp_page_2 = NULL; |
363 | unsigned long blk_size = sbi->blocksize; | 362 | unsigned long blk_size = sbi->blocksize; |
364 | struct f2fs_checkpoint *cp_block; | 363 | struct f2fs_checkpoint *cp_block; |
365 | unsigned long long cur_version = 0, pre_version = 0; | 364 | unsigned long long cur_version = 0, pre_version = 0; |
366 | size_t crc_offset; | 365 | size_t crc_offset; |
367 | __u32 crc = 0; | 366 | __u32 crc = 0; |
368 | 367 | ||
369 | /* Read the 1st cp block in this CP pack */ | 368 | /* Read the 1st cp block in this CP pack */ |
370 | cp_page_1 = get_meta_page(sbi, cp_addr); | 369 | cp_page_1 = get_meta_page(sbi, cp_addr); |
371 | 370 | ||
372 | /* get the version number */ | 371 | /* get the version number */ |
373 | cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1); | 372 | cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1); |
374 | crc_offset = le32_to_cpu(cp_block->checksum_offset); | 373 | crc_offset = le32_to_cpu(cp_block->checksum_offset); |
375 | if (crc_offset >= blk_size) | 374 | if (crc_offset >= blk_size) |
376 | goto invalid_cp1; | 375 | goto invalid_cp1; |
377 | 376 | ||
378 | crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); | 377 | crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); |
379 | if (!f2fs_crc_valid(crc, cp_block, crc_offset)) | 378 | if (!f2fs_crc_valid(crc, cp_block, crc_offset)) |
380 | goto invalid_cp1; | 379 | goto invalid_cp1; |
381 | 380 | ||
382 | pre_version = cur_cp_version(cp_block); | 381 | pre_version = cur_cp_version(cp_block); |
383 | 382 | ||
384 | /* Read the 2nd cp block in this CP pack */ | 383 | /* Read the 2nd cp block in this CP pack */ |
385 | cp_addr += le32_to_cpu(cp_block->cp_pack_total_block_count) - 1; | 384 | cp_addr += le32_to_cpu(cp_block->cp_pack_total_block_count) - 1; |
386 | cp_page_2 = get_meta_page(sbi, cp_addr); | 385 | cp_page_2 = get_meta_page(sbi, cp_addr); |
387 | 386 | ||
388 | cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2); | 387 | cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2); |
389 | crc_offset = le32_to_cpu(cp_block->checksum_offset); | 388 | crc_offset = le32_to_cpu(cp_block->checksum_offset); |
390 | if (crc_offset >= blk_size) | 389 | if (crc_offset >= blk_size) |
391 | goto invalid_cp2; | 390 | goto invalid_cp2; |
392 | 391 | ||
393 | crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); | 392 | crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset))); |
394 | if (!f2fs_crc_valid(crc, cp_block, crc_offset)) | 393 | if (!f2fs_crc_valid(crc, cp_block, crc_offset)) |
395 | goto invalid_cp2; | 394 | goto invalid_cp2; |
396 | 395 | ||
397 | cur_version = cur_cp_version(cp_block); | 396 | cur_version = cur_cp_version(cp_block); |
398 | 397 | ||
399 | if (cur_version == pre_version) { | 398 | if (cur_version == pre_version) { |
400 | *version = cur_version; | 399 | *version = cur_version; |
401 | f2fs_put_page(cp_page_2, 1); | 400 | f2fs_put_page(cp_page_2, 1); |
402 | return cp_page_1; | 401 | return cp_page_1; |
403 | } | 402 | } |
404 | invalid_cp2: | 403 | invalid_cp2: |
405 | f2fs_put_page(cp_page_2, 1); | 404 | f2fs_put_page(cp_page_2, 1); |
406 | invalid_cp1: | 405 | invalid_cp1: |
407 | f2fs_put_page(cp_page_1, 1); | 406 | f2fs_put_page(cp_page_1, 1); |
408 | return NULL; | 407 | return NULL; |
409 | } | 408 | } |
410 | 409 | ||
411 | int get_valid_checkpoint(struct f2fs_sb_info *sbi) | 410 | int get_valid_checkpoint(struct f2fs_sb_info *sbi) |
412 | { | 411 | { |
413 | struct f2fs_checkpoint *cp_block; | 412 | struct f2fs_checkpoint *cp_block; |
414 | struct f2fs_super_block *fsb = sbi->raw_super; | 413 | struct f2fs_super_block *fsb = sbi->raw_super; |
415 | struct page *cp1, *cp2, *cur_page; | 414 | struct page *cp1, *cp2, *cur_page; |
416 | unsigned long blk_size = sbi->blocksize; | 415 | unsigned long blk_size = sbi->blocksize; |
417 | unsigned long long cp1_version = 0, cp2_version = 0; | 416 | unsigned long long cp1_version = 0, cp2_version = 0; |
418 | unsigned long long cp_start_blk_no; | 417 | unsigned long long cp_start_blk_no; |
419 | 418 | ||
420 | sbi->ckpt = kzalloc(blk_size, GFP_KERNEL); | 419 | sbi->ckpt = kzalloc(blk_size, GFP_KERNEL); |
421 | if (!sbi->ckpt) | 420 | if (!sbi->ckpt) |
422 | return -ENOMEM; | 421 | return -ENOMEM; |
423 | /* | 422 | /* |
424 | * Finding out valid cp block involves read both | 423 | * Finding out valid cp block involves read both |
425 | * sets( cp pack1 and cp pack 2) | 424 | * sets( cp pack1 and cp pack 2) |
426 | */ | 425 | */ |
427 | cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr); | 426 | cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr); |
428 | cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version); | 427 | cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version); |
429 | 428 | ||
430 | /* The second checkpoint pack should start at the next segment */ | 429 | /* The second checkpoint pack should start at the next segment */ |
431 | cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg); | 430 | cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg); |
432 | cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version); | 431 | cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version); |
433 | 432 | ||
434 | if (cp1 && cp2) { | 433 | if (cp1 && cp2) { |
435 | if (ver_after(cp2_version, cp1_version)) | 434 | if (ver_after(cp2_version, cp1_version)) |
436 | cur_page = cp2; | 435 | cur_page = cp2; |
437 | else | 436 | else |
438 | cur_page = cp1; | 437 | cur_page = cp1; |
439 | } else if (cp1) { | 438 | } else if (cp1) { |
440 | cur_page = cp1; | 439 | cur_page = cp1; |
441 | } else if (cp2) { | 440 | } else if (cp2) { |
442 | cur_page = cp2; | 441 | cur_page = cp2; |
443 | } else { | 442 | } else { |
444 | goto fail_no_cp; | 443 | goto fail_no_cp; |
445 | } | 444 | } |
446 | 445 | ||
447 | cp_block = (struct f2fs_checkpoint *)page_address(cur_page); | 446 | cp_block = (struct f2fs_checkpoint *)page_address(cur_page); |
448 | memcpy(sbi->ckpt, cp_block, blk_size); | 447 | memcpy(sbi->ckpt, cp_block, blk_size); |
449 | 448 | ||
450 | f2fs_put_page(cp1, 1); | 449 | f2fs_put_page(cp1, 1); |
451 | f2fs_put_page(cp2, 1); | 450 | f2fs_put_page(cp2, 1); |
452 | return 0; | 451 | return 0; |
453 | 452 | ||
454 | fail_no_cp: | 453 | fail_no_cp: |
455 | kfree(sbi->ckpt); | 454 | kfree(sbi->ckpt); |
456 | return -EINVAL; | 455 | return -EINVAL; |
457 | } | 456 | } |
458 | 457 | ||
459 | static int __add_dirty_inode(struct inode *inode, struct dir_inode_entry *new) | 458 | static int __add_dirty_inode(struct inode *inode, struct dir_inode_entry *new) |
460 | { | 459 | { |
461 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 460 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
462 | struct list_head *head = &sbi->dir_inode_list; | 461 | struct list_head *head = &sbi->dir_inode_list; |
463 | struct list_head *this; | 462 | struct list_head *this; |
464 | 463 | ||
465 | list_for_each(this, head) { | 464 | list_for_each(this, head) { |
466 | struct dir_inode_entry *entry; | 465 | struct dir_inode_entry *entry; |
467 | entry = list_entry(this, struct dir_inode_entry, list); | 466 | entry = list_entry(this, struct dir_inode_entry, list); |
468 | if (entry->inode == inode) | 467 | if (entry->inode == inode) |
469 | return -EEXIST; | 468 | return -EEXIST; |
470 | } | 469 | } |
471 | list_add_tail(&new->list, head); | 470 | list_add_tail(&new->list, head); |
472 | #ifdef CONFIG_F2FS_STAT_FS | 471 | #ifdef CONFIG_F2FS_STAT_FS |
473 | sbi->n_dirty_dirs++; | 472 | sbi->n_dirty_dirs++; |
474 | #endif | 473 | #endif |
475 | return 0; | 474 | return 0; |
476 | } | 475 | } |
477 | 476 | ||
478 | void set_dirty_dir_page(struct inode *inode, struct page *page) | 477 | void set_dirty_dir_page(struct inode *inode, struct page *page) |
479 | { | 478 | { |
480 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 479 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
481 | struct dir_inode_entry *new; | 480 | struct dir_inode_entry *new; |
482 | 481 | ||
483 | if (!S_ISDIR(inode->i_mode)) | 482 | if (!S_ISDIR(inode->i_mode)) |
484 | return; | 483 | return; |
485 | retry: | 484 | retry: |
486 | new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); | 485 | new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); |
487 | if (!new) { | 486 | if (!new) { |
488 | cond_resched(); | 487 | cond_resched(); |
489 | goto retry; | 488 | goto retry; |
490 | } | 489 | } |
491 | new->inode = inode; | 490 | new->inode = inode; |
492 | INIT_LIST_HEAD(&new->list); | 491 | INIT_LIST_HEAD(&new->list); |
493 | 492 | ||
494 | spin_lock(&sbi->dir_inode_lock); | 493 | spin_lock(&sbi->dir_inode_lock); |
495 | if (__add_dirty_inode(inode, new)) | 494 | if (__add_dirty_inode(inode, new)) |
496 | kmem_cache_free(inode_entry_slab, new); | 495 | kmem_cache_free(inode_entry_slab, new); |
497 | 496 | ||
498 | inc_page_count(sbi, F2FS_DIRTY_DENTS); | 497 | inc_page_count(sbi, F2FS_DIRTY_DENTS); |
499 | inode_inc_dirty_dents(inode); | 498 | inode_inc_dirty_dents(inode); |
500 | SetPagePrivate(page); | 499 | SetPagePrivate(page); |
501 | spin_unlock(&sbi->dir_inode_lock); | 500 | spin_unlock(&sbi->dir_inode_lock); |
502 | } | 501 | } |
503 | 502 | ||
504 | void add_dirty_dir_inode(struct inode *inode) | 503 | void add_dirty_dir_inode(struct inode *inode) |
505 | { | 504 | { |
506 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 505 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
507 | struct dir_inode_entry *new; | 506 | struct dir_inode_entry *new; |
508 | retry: | 507 | retry: |
509 | new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); | 508 | new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); |
510 | if (!new) { | 509 | if (!new) { |
511 | cond_resched(); | 510 | cond_resched(); |
512 | goto retry; | 511 | goto retry; |
513 | } | 512 | } |
514 | new->inode = inode; | 513 | new->inode = inode; |
515 | INIT_LIST_HEAD(&new->list); | 514 | INIT_LIST_HEAD(&new->list); |
516 | 515 | ||
517 | spin_lock(&sbi->dir_inode_lock); | 516 | spin_lock(&sbi->dir_inode_lock); |
518 | if (__add_dirty_inode(inode, new)) | 517 | if (__add_dirty_inode(inode, new)) |
519 | kmem_cache_free(inode_entry_slab, new); | 518 | kmem_cache_free(inode_entry_slab, new); |
520 | spin_unlock(&sbi->dir_inode_lock); | 519 | spin_unlock(&sbi->dir_inode_lock); |
521 | } | 520 | } |
522 | 521 | ||
523 | void remove_dirty_dir_inode(struct inode *inode) | 522 | void remove_dirty_dir_inode(struct inode *inode) |
524 | { | 523 | { |
525 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 524 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
526 | struct list_head *head = &sbi->dir_inode_list; | 525 | struct list_head *head = &sbi->dir_inode_list; |
527 | struct list_head *this; | 526 | struct list_head *this; |
528 | 527 | ||
529 | if (!S_ISDIR(inode->i_mode)) | 528 | if (!S_ISDIR(inode->i_mode)) |
530 | return; | 529 | return; |
531 | 530 | ||
532 | spin_lock(&sbi->dir_inode_lock); | 531 | spin_lock(&sbi->dir_inode_lock); |
533 | if (atomic_read(&F2FS_I(inode)->dirty_dents)) { | 532 | if (atomic_read(&F2FS_I(inode)->dirty_dents)) { |
534 | spin_unlock(&sbi->dir_inode_lock); | 533 | spin_unlock(&sbi->dir_inode_lock); |
535 | return; | 534 | return; |
536 | } | 535 | } |
537 | 536 | ||
538 | list_for_each(this, head) { | 537 | list_for_each(this, head) { |
539 | struct dir_inode_entry *entry; | 538 | struct dir_inode_entry *entry; |
540 | entry = list_entry(this, struct dir_inode_entry, list); | 539 | entry = list_entry(this, struct dir_inode_entry, list); |
541 | if (entry->inode == inode) { | 540 | if (entry->inode == inode) { |
542 | list_del(&entry->list); | 541 | list_del(&entry->list); |
543 | kmem_cache_free(inode_entry_slab, entry); | 542 | kmem_cache_free(inode_entry_slab, entry); |
544 | #ifdef CONFIG_F2FS_STAT_FS | 543 | #ifdef CONFIG_F2FS_STAT_FS |
545 | sbi->n_dirty_dirs--; | 544 | sbi->n_dirty_dirs--; |
546 | #endif | 545 | #endif |
547 | break; | 546 | break; |
548 | } | 547 | } |
549 | } | 548 | } |
550 | spin_unlock(&sbi->dir_inode_lock); | 549 | spin_unlock(&sbi->dir_inode_lock); |
551 | 550 | ||
552 | /* Only from the recovery routine */ | 551 | /* Only from the recovery routine */ |
553 | if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) { | 552 | if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) { |
554 | clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT); | 553 | clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT); |
555 | iput(inode); | 554 | iput(inode); |
556 | } | 555 | } |
557 | } | 556 | } |
558 | 557 | ||
559 | struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino) | 558 | struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino) |
560 | { | 559 | { |
561 | struct list_head *head = &sbi->dir_inode_list; | 560 | struct list_head *head = &sbi->dir_inode_list; |
562 | struct list_head *this; | 561 | struct list_head *this; |
563 | struct inode *inode = NULL; | 562 | struct inode *inode = NULL; |
564 | 563 | ||
565 | spin_lock(&sbi->dir_inode_lock); | 564 | spin_lock(&sbi->dir_inode_lock); |
566 | list_for_each(this, head) { | 565 | list_for_each(this, head) { |
567 | struct dir_inode_entry *entry; | 566 | struct dir_inode_entry *entry; |
568 | entry = list_entry(this, struct dir_inode_entry, list); | 567 | entry = list_entry(this, struct dir_inode_entry, list); |
569 | if (entry->inode->i_ino == ino) { | 568 | if (entry->inode->i_ino == ino) { |
570 | inode = entry->inode; | 569 | inode = entry->inode; |
571 | break; | 570 | break; |
572 | } | 571 | } |
573 | } | 572 | } |
574 | spin_unlock(&sbi->dir_inode_lock); | 573 | spin_unlock(&sbi->dir_inode_lock); |
575 | return inode; | 574 | return inode; |
576 | } | 575 | } |
577 | 576 | ||
578 | void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) | 577 | void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) |
579 | { | 578 | { |
580 | struct list_head *head = &sbi->dir_inode_list; | 579 | struct list_head *head = &sbi->dir_inode_list; |
581 | struct dir_inode_entry *entry; | 580 | struct dir_inode_entry *entry; |
582 | struct inode *inode; | 581 | struct inode *inode; |
583 | retry: | 582 | retry: |
584 | spin_lock(&sbi->dir_inode_lock); | 583 | spin_lock(&sbi->dir_inode_lock); |
585 | if (list_empty(head)) { | 584 | if (list_empty(head)) { |
586 | spin_unlock(&sbi->dir_inode_lock); | 585 | spin_unlock(&sbi->dir_inode_lock); |
587 | return; | 586 | return; |
588 | } | 587 | } |
589 | entry = list_entry(head->next, struct dir_inode_entry, list); | 588 | entry = list_entry(head->next, struct dir_inode_entry, list); |
590 | inode = igrab(entry->inode); | 589 | inode = igrab(entry->inode); |
591 | spin_unlock(&sbi->dir_inode_lock); | 590 | spin_unlock(&sbi->dir_inode_lock); |
592 | if (inode) { | 591 | if (inode) { |
593 | filemap_flush(inode->i_mapping); | 592 | filemap_flush(inode->i_mapping); |
594 | iput(inode); | 593 | iput(inode); |
595 | } else { | 594 | } else { |
596 | /* | 595 | /* |
597 | * We should submit bio, since it exists several | 596 | * We should submit bio, since it exists several |
598 | * wribacking dentry pages in the freeing inode. | 597 | * wribacking dentry pages in the freeing inode. |
599 | */ | 598 | */ |
600 | f2fs_submit_bio(sbi, DATA, true); | 599 | f2fs_submit_bio(sbi, DATA, true); |
601 | } | 600 | } |
602 | goto retry; | 601 | goto retry; |
603 | } | 602 | } |
604 | 603 | ||
605 | /* | 604 | /* |
606 | * Freeze all the FS-operations for checkpoint. | 605 | * Freeze all the FS-operations for checkpoint. |
607 | */ | 606 | */ |
608 | static void block_operations(struct f2fs_sb_info *sbi) | 607 | static void block_operations(struct f2fs_sb_info *sbi) |
609 | { | 608 | { |
610 | struct writeback_control wbc = { | 609 | struct writeback_control wbc = { |
611 | .sync_mode = WB_SYNC_ALL, | 610 | .sync_mode = WB_SYNC_ALL, |
612 | .nr_to_write = LONG_MAX, | 611 | .nr_to_write = LONG_MAX, |
613 | .for_reclaim = 0, | 612 | .for_reclaim = 0, |
614 | }; | 613 | }; |
615 | struct blk_plug plug; | 614 | struct blk_plug plug; |
616 | 615 | ||
617 | blk_start_plug(&plug); | 616 | blk_start_plug(&plug); |
618 | 617 | ||
619 | retry_flush_dents: | 618 | retry_flush_dents: |
620 | mutex_lock_all(sbi); | 619 | mutex_lock_all(sbi); |
621 | 620 | ||
622 | /* write all the dirty dentry pages */ | 621 | /* write all the dirty dentry pages */ |
623 | if (get_pages(sbi, F2FS_DIRTY_DENTS)) { | 622 | if (get_pages(sbi, F2FS_DIRTY_DENTS)) { |
624 | mutex_unlock_all(sbi); | 623 | mutex_unlock_all(sbi); |
625 | sync_dirty_dir_inodes(sbi); | 624 | sync_dirty_dir_inodes(sbi); |
626 | goto retry_flush_dents; | 625 | goto retry_flush_dents; |
627 | } | 626 | } |
628 | 627 | ||
629 | /* | 628 | /* |
630 | * POR: we should ensure that there is no dirty node pages | 629 | * POR: we should ensure that there is no dirty node pages |
631 | * until finishing nat/sit flush. | 630 | * until finishing nat/sit flush. |
632 | */ | 631 | */ |
633 | retry_flush_nodes: | 632 | retry_flush_nodes: |
634 | mutex_lock(&sbi->node_write); | 633 | mutex_lock(&sbi->node_write); |
635 | 634 | ||
636 | if (get_pages(sbi, F2FS_DIRTY_NODES)) { | 635 | if (get_pages(sbi, F2FS_DIRTY_NODES)) { |
637 | mutex_unlock(&sbi->node_write); | 636 | mutex_unlock(&sbi->node_write); |
638 | sync_node_pages(sbi, 0, &wbc); | 637 | sync_node_pages(sbi, 0, &wbc); |
639 | goto retry_flush_nodes; | 638 | goto retry_flush_nodes; |
640 | } | 639 | } |
641 | blk_finish_plug(&plug); | 640 | blk_finish_plug(&plug); |
642 | } | 641 | } |
643 | 642 | ||
644 | static void unblock_operations(struct f2fs_sb_info *sbi) | 643 | static void unblock_operations(struct f2fs_sb_info *sbi) |
645 | { | 644 | { |
646 | mutex_unlock(&sbi->node_write); | 645 | mutex_unlock(&sbi->node_write); |
647 | mutex_unlock_all(sbi); | 646 | mutex_unlock_all(sbi); |
648 | } | 647 | } |
649 | 648 | ||
650 | static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) | 649 | static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) |
651 | { | 650 | { |
652 | struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); | 651 | struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); |
653 | nid_t last_nid = 0; | 652 | nid_t last_nid = 0; |
654 | block_t start_blk; | 653 | block_t start_blk; |
655 | struct page *cp_page; | 654 | struct page *cp_page; |
656 | unsigned int data_sum_blocks, orphan_blocks; | 655 | unsigned int data_sum_blocks, orphan_blocks; |
657 | __u32 crc32 = 0; | 656 | __u32 crc32 = 0; |
658 | void *kaddr; | 657 | void *kaddr; |
659 | int i; | 658 | int i; |
660 | 659 | ||
661 | /* Flush all the NAT/SIT pages */ | 660 | /* Flush all the NAT/SIT pages */ |
662 | while (get_pages(sbi, F2FS_DIRTY_META)) | 661 | while (get_pages(sbi, F2FS_DIRTY_META)) |
663 | sync_meta_pages(sbi, META, LONG_MAX); | 662 | sync_meta_pages(sbi, META, LONG_MAX); |
664 | 663 | ||
665 | next_free_nid(sbi, &last_nid); | 664 | next_free_nid(sbi, &last_nid); |
666 | 665 | ||
667 | /* | 666 | /* |
668 | * modify checkpoint | 667 | * modify checkpoint |
669 | * version number is already updated | 668 | * version number is already updated |
670 | */ | 669 | */ |
671 | ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi)); | 670 | ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi)); |
672 | ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi)); | 671 | ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi)); |
673 | ckpt->free_segment_count = cpu_to_le32(free_segments(sbi)); | 672 | ckpt->free_segment_count = cpu_to_le32(free_segments(sbi)); |
674 | for (i = 0; i < 3; i++) { | 673 | for (i = 0; i < 3; i++) { |
675 | ckpt->cur_node_segno[i] = | 674 | ckpt->cur_node_segno[i] = |
676 | cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE)); | 675 | cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE)); |
677 | ckpt->cur_node_blkoff[i] = | 676 | ckpt->cur_node_blkoff[i] = |
678 | cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE)); | 677 | cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE)); |
679 | ckpt->alloc_type[i + CURSEG_HOT_NODE] = | 678 | ckpt->alloc_type[i + CURSEG_HOT_NODE] = |
680 | curseg_alloc_type(sbi, i + CURSEG_HOT_NODE); | 679 | curseg_alloc_type(sbi, i + CURSEG_HOT_NODE); |
681 | } | 680 | } |
682 | for (i = 0; i < 3; i++) { | 681 | for (i = 0; i < 3; i++) { |
683 | ckpt->cur_data_segno[i] = | 682 | ckpt->cur_data_segno[i] = |
684 | cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA)); | 683 | cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA)); |
685 | ckpt->cur_data_blkoff[i] = | 684 | ckpt->cur_data_blkoff[i] = |
686 | cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA)); | 685 | cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA)); |
687 | ckpt->alloc_type[i + CURSEG_HOT_DATA] = | 686 | ckpt->alloc_type[i + CURSEG_HOT_DATA] = |
688 | curseg_alloc_type(sbi, i + CURSEG_HOT_DATA); | 687 | curseg_alloc_type(sbi, i + CURSEG_HOT_DATA); |
689 | } | 688 | } |
690 | 689 | ||
691 | ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi)); | 690 | ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi)); |
692 | ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi)); | 691 | ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi)); |
693 | ckpt->next_free_nid = cpu_to_le32(last_nid); | 692 | ckpt->next_free_nid = cpu_to_le32(last_nid); |
694 | 693 | ||
695 | /* 2 cp + n data seg summary + orphan inode blocks */ | 694 | /* 2 cp + n data seg summary + orphan inode blocks */ |
696 | data_sum_blocks = npages_for_summary_flush(sbi); | 695 | data_sum_blocks = npages_for_summary_flush(sbi); |
697 | if (data_sum_blocks < 3) | 696 | if (data_sum_blocks < 3) |
698 | set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); | 697 | set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); |
699 | else | 698 | else |
700 | clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); | 699 | clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); |
701 | 700 | ||
702 | orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1) | 701 | orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1) |
703 | / F2FS_ORPHANS_PER_BLOCK; | 702 | / F2FS_ORPHANS_PER_BLOCK; |
704 | ckpt->cp_pack_start_sum = cpu_to_le32(1 + orphan_blocks); | 703 | ckpt->cp_pack_start_sum = cpu_to_le32(1 + orphan_blocks); |
705 | 704 | ||
706 | if (is_umount) { | 705 | if (is_umount) { |
707 | set_ckpt_flags(ckpt, CP_UMOUNT_FLAG); | 706 | set_ckpt_flags(ckpt, CP_UMOUNT_FLAG); |
708 | ckpt->cp_pack_total_block_count = cpu_to_le32(2 + | 707 | ckpt->cp_pack_total_block_count = cpu_to_le32(2 + |
709 | data_sum_blocks + orphan_blocks + NR_CURSEG_NODE_TYPE); | 708 | data_sum_blocks + orphan_blocks + NR_CURSEG_NODE_TYPE); |
710 | } else { | 709 | } else { |
711 | clear_ckpt_flags(ckpt, CP_UMOUNT_FLAG); | 710 | clear_ckpt_flags(ckpt, CP_UMOUNT_FLAG); |
712 | ckpt->cp_pack_total_block_count = cpu_to_le32(2 + | 711 | ckpt->cp_pack_total_block_count = cpu_to_le32(2 + |
713 | data_sum_blocks + orphan_blocks); | 712 | data_sum_blocks + orphan_blocks); |
714 | } | 713 | } |
715 | 714 | ||
716 | if (sbi->n_orphans) | 715 | if (sbi->n_orphans) |
717 | set_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); | 716 | set_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); |
718 | else | 717 | else |
719 | clear_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); | 718 | clear_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG); |
720 | 719 | ||
721 | /* update SIT/NAT bitmap */ | 720 | /* update SIT/NAT bitmap */ |
722 | get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP)); | 721 | get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP)); |
723 | get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP)); | 722 | get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP)); |
724 | 723 | ||
725 | crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset)); | 724 | crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset)); |
726 | *((__le32 *)((unsigned char *)ckpt + | 725 | *((__le32 *)((unsigned char *)ckpt + |
727 | le32_to_cpu(ckpt->checksum_offset))) | 726 | le32_to_cpu(ckpt->checksum_offset))) |
728 | = cpu_to_le32(crc32); | 727 | = cpu_to_le32(crc32); |
729 | 728 | ||
730 | start_blk = __start_cp_addr(sbi); | 729 | start_blk = __start_cp_addr(sbi); |
731 | 730 | ||
732 | /* write out checkpoint buffer at block 0 */ | 731 | /* write out checkpoint buffer at block 0 */ |
733 | cp_page = grab_meta_page(sbi, start_blk++); | 732 | cp_page = grab_meta_page(sbi, start_blk++); |
734 | kaddr = page_address(cp_page); | 733 | kaddr = page_address(cp_page); |
735 | memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); | 734 | memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); |
736 | set_page_dirty(cp_page); | 735 | set_page_dirty(cp_page); |
737 | f2fs_put_page(cp_page, 1); | 736 | f2fs_put_page(cp_page, 1); |
738 | 737 | ||
739 | if (sbi->n_orphans) { | 738 | if (sbi->n_orphans) { |
740 | write_orphan_inodes(sbi, start_blk); | 739 | write_orphan_inodes(sbi, start_blk); |
741 | start_blk += orphan_blocks; | 740 | start_blk += orphan_blocks; |
742 | } | 741 | } |
743 | 742 | ||
744 | write_data_summaries(sbi, start_blk); | 743 | write_data_summaries(sbi, start_blk); |
745 | start_blk += data_sum_blocks; | 744 | start_blk += data_sum_blocks; |
746 | if (is_umount) { | 745 | if (is_umount) { |
747 | write_node_summaries(sbi, start_blk); | 746 | write_node_summaries(sbi, start_blk); |
748 | start_blk += NR_CURSEG_NODE_TYPE; | 747 | start_blk += NR_CURSEG_NODE_TYPE; |
749 | } | 748 | } |
750 | 749 | ||
751 | /* writeout checkpoint block */ | 750 | /* writeout checkpoint block */ |
752 | cp_page = grab_meta_page(sbi, start_blk); | 751 | cp_page = grab_meta_page(sbi, start_blk); |
753 | kaddr = page_address(cp_page); | 752 | kaddr = page_address(cp_page); |
754 | memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); | 753 | memcpy(kaddr, ckpt, (1 << sbi->log_blocksize)); |
755 | set_page_dirty(cp_page); | 754 | set_page_dirty(cp_page); |
756 | f2fs_put_page(cp_page, 1); | 755 | f2fs_put_page(cp_page, 1); |
757 | 756 | ||
758 | /* wait for previous submitted node/meta pages writeback */ | 757 | /* wait for previous submitted node/meta pages writeback */ |
759 | while (get_pages(sbi, F2FS_WRITEBACK)) | 758 | while (get_pages(sbi, F2FS_WRITEBACK)) |
760 | congestion_wait(BLK_RW_ASYNC, HZ / 50); | 759 | congestion_wait(BLK_RW_ASYNC, HZ / 50); |
761 | 760 | ||
762 | filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX); | 761 | filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX); |
763 | filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX); | 762 | filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX); |
764 | 763 | ||
765 | /* update user_block_counts */ | 764 | /* update user_block_counts */ |
766 | sbi->last_valid_block_count = sbi->total_valid_block_count; | 765 | sbi->last_valid_block_count = sbi->total_valid_block_count; |
767 | sbi->alloc_valid_block_count = 0; | 766 | sbi->alloc_valid_block_count = 0; |
768 | 767 | ||
769 | /* Here, we only have one bio having CP pack */ | 768 | /* Here, we only have one bio having CP pack */ |
770 | sync_meta_pages(sbi, META_FLUSH, LONG_MAX); | 769 | sync_meta_pages(sbi, META_FLUSH, LONG_MAX); |
771 | 770 | ||
772 | if (!is_set_ckpt_flags(ckpt, CP_ERROR_FLAG)) { | 771 | if (!is_set_ckpt_flags(ckpt, CP_ERROR_FLAG)) { |
773 | clear_prefree_segments(sbi); | 772 | clear_prefree_segments(sbi); |
774 | F2FS_RESET_SB_DIRT(sbi); | 773 | F2FS_RESET_SB_DIRT(sbi); |
775 | } | 774 | } |
776 | } | 775 | } |
777 | 776 | ||
778 | /* | 777 | /* |
779 | * We guarantee that this checkpoint procedure should not fail. | 778 | * We guarantee that this checkpoint procedure should not fail. |
780 | */ | 779 | */ |
781 | void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) | 780 | void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) |
782 | { | 781 | { |
783 | struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); | 782 | struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); |
784 | unsigned long long ckpt_ver; | 783 | unsigned long long ckpt_ver; |
785 | 784 | ||
786 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops"); | 785 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops"); |
787 | 786 | ||
788 | mutex_lock(&sbi->cp_mutex); | 787 | mutex_lock(&sbi->cp_mutex); |
789 | block_operations(sbi); | 788 | block_operations(sbi); |
790 | 789 | ||
791 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops"); | 790 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops"); |
792 | 791 | ||
793 | f2fs_submit_bio(sbi, DATA, true); | 792 | f2fs_submit_bio(sbi, DATA, true); |
794 | f2fs_submit_bio(sbi, NODE, true); | 793 | f2fs_submit_bio(sbi, NODE, true); |
795 | f2fs_submit_bio(sbi, META, true); | 794 | f2fs_submit_bio(sbi, META, true); |
796 | 795 | ||
797 | /* | 796 | /* |
798 | * update checkpoint pack index | 797 | * update checkpoint pack index |
799 | * Increase the version number so that | 798 | * Increase the version number so that |
800 | * SIT entries and seg summaries are written at correct place | 799 | * SIT entries and seg summaries are written at correct place |
801 | */ | 800 | */ |
802 | ckpt_ver = cur_cp_version(ckpt); | 801 | ckpt_ver = cur_cp_version(ckpt); |
803 | ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver); | 802 | ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver); |
804 | 803 | ||
805 | /* write cached NAT/SIT entries to NAT/SIT area */ | 804 | /* write cached NAT/SIT entries to NAT/SIT area */ |
806 | flush_nat_entries(sbi); | 805 | flush_nat_entries(sbi); |
807 | flush_sit_entries(sbi); | 806 | flush_sit_entries(sbi); |
808 | 807 | ||
809 | /* unlock all the fs_lock[] in do_checkpoint() */ | 808 | /* unlock all the fs_lock[] in do_checkpoint() */ |
810 | do_checkpoint(sbi, is_umount); | 809 | do_checkpoint(sbi, is_umount); |
811 | 810 | ||
812 | unblock_operations(sbi); | 811 | unblock_operations(sbi); |
813 | mutex_unlock(&sbi->cp_mutex); | 812 | mutex_unlock(&sbi->cp_mutex); |
814 | 813 | ||
815 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint"); | 814 | trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint"); |
816 | } | 815 | } |
817 | 816 | ||
818 | void init_orphan_info(struct f2fs_sb_info *sbi) | 817 | void init_orphan_info(struct f2fs_sb_info *sbi) |
819 | { | 818 | { |
820 | mutex_init(&sbi->orphan_inode_mutex); | 819 | mutex_init(&sbi->orphan_inode_mutex); |
821 | INIT_LIST_HEAD(&sbi->orphan_inode_list); | 820 | INIT_LIST_HEAD(&sbi->orphan_inode_list); |
822 | sbi->n_orphans = 0; | 821 | sbi->n_orphans = 0; |
823 | } | 822 | } |
824 | 823 | ||
825 | int __init create_checkpoint_caches(void) | 824 | int __init create_checkpoint_caches(void) |
826 | { | 825 | { |
827 | orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry", | 826 | orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry", |
828 | sizeof(struct orphan_inode_entry), NULL); | 827 | sizeof(struct orphan_inode_entry), NULL); |
829 | if (unlikely(!orphan_entry_slab)) | 828 | if (unlikely(!orphan_entry_slab)) |
830 | return -ENOMEM; | 829 | return -ENOMEM; |
831 | inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry", | 830 | inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry", |
832 | sizeof(struct dir_inode_entry), NULL); | 831 | sizeof(struct dir_inode_entry), NULL); |
833 | if (unlikely(!inode_entry_slab)) { | 832 | if (unlikely(!inode_entry_slab)) { |
834 | kmem_cache_destroy(orphan_entry_slab); | 833 | kmem_cache_destroy(orphan_entry_slab); |
835 | return -ENOMEM; | 834 | return -ENOMEM; |
836 | } | 835 | } |
837 | return 0; | 836 | return 0; |
838 | } | 837 | } |
839 | 838 | ||
840 | void destroy_checkpoint_caches(void) | 839 | void destroy_checkpoint_caches(void) |
841 | { | 840 | { |
842 | kmem_cache_destroy(orphan_entry_slab); | 841 | kmem_cache_destroy(orphan_entry_slab); |
843 | kmem_cache_destroy(inode_entry_slab); | 842 | kmem_cache_destroy(inode_entry_slab); |
844 | } | 843 | } |
845 | 844 |
fs/f2fs/node.c
1 | /* | 1 | /* |
2 | * fs/f2fs/node.c | 2 | * fs/f2fs/node.c |
3 | * | 3 | * |
4 | * Copyright (c) 2012 Samsung Electronics Co., Ltd. | 4 | * Copyright (c) 2012 Samsung Electronics Co., Ltd. |
5 | * http://www.samsung.com/ | 5 | * http://www.samsung.com/ |
6 | * | 6 | * |
7 | * This program is free software; you can redistribute it and/or modify | 7 | * This program is free software; you can redistribute it and/or modify |
8 | * it under the terms of the GNU General Public License version 2 as | 8 | * it under the terms of the GNU General Public License version 2 as |
9 | * published by the Free Software Foundation. | 9 | * published by the Free Software Foundation. |
10 | */ | 10 | */ |
11 | #include <linux/fs.h> | 11 | #include <linux/fs.h> |
12 | #include <linux/f2fs_fs.h> | 12 | #include <linux/f2fs_fs.h> |
13 | #include <linux/mpage.h> | 13 | #include <linux/mpage.h> |
14 | #include <linux/backing-dev.h> | 14 | #include <linux/backing-dev.h> |
15 | #include <linux/blkdev.h> | 15 | #include <linux/blkdev.h> |
16 | #include <linux/pagevec.h> | 16 | #include <linux/pagevec.h> |
17 | #include <linux/swap.h> | 17 | #include <linux/swap.h> |
18 | 18 | ||
19 | #include "f2fs.h" | 19 | #include "f2fs.h" |
20 | #include "node.h" | 20 | #include "node.h" |
21 | #include "segment.h" | 21 | #include "segment.h" |
22 | #include <trace/events/f2fs.h> | 22 | #include <trace/events/f2fs.h> |
23 | 23 | ||
24 | static struct kmem_cache *nat_entry_slab; | 24 | static struct kmem_cache *nat_entry_slab; |
25 | static struct kmem_cache *free_nid_slab; | 25 | static struct kmem_cache *free_nid_slab; |
26 | 26 | ||
27 | static void clear_node_page_dirty(struct page *page) | 27 | static void clear_node_page_dirty(struct page *page) |
28 | { | 28 | { |
29 | struct address_space *mapping = page->mapping; | 29 | struct address_space *mapping = page->mapping; |
30 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); | 30 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); |
31 | unsigned int long flags; | 31 | unsigned int long flags; |
32 | 32 | ||
33 | if (PageDirty(page)) { | 33 | if (PageDirty(page)) { |
34 | spin_lock_irqsave(&mapping->tree_lock, flags); | 34 | spin_lock_irqsave(&mapping->tree_lock, flags); |
35 | radix_tree_tag_clear(&mapping->page_tree, | 35 | radix_tree_tag_clear(&mapping->page_tree, |
36 | page_index(page), | 36 | page_index(page), |
37 | PAGECACHE_TAG_DIRTY); | 37 | PAGECACHE_TAG_DIRTY); |
38 | spin_unlock_irqrestore(&mapping->tree_lock, flags); | 38 | spin_unlock_irqrestore(&mapping->tree_lock, flags); |
39 | 39 | ||
40 | clear_page_dirty_for_io(page); | 40 | clear_page_dirty_for_io(page); |
41 | dec_page_count(sbi, F2FS_DIRTY_NODES); | 41 | dec_page_count(sbi, F2FS_DIRTY_NODES); |
42 | } | 42 | } |
43 | ClearPageUptodate(page); | 43 | ClearPageUptodate(page); |
44 | } | 44 | } |
45 | 45 | ||
46 | static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid) | 46 | static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid) |
47 | { | 47 | { |
48 | pgoff_t index = current_nat_addr(sbi, nid); | 48 | pgoff_t index = current_nat_addr(sbi, nid); |
49 | return get_meta_page(sbi, index); | 49 | return get_meta_page(sbi, index); |
50 | } | 50 | } |
51 | 51 | ||
52 | static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid) | 52 | static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid) |
53 | { | 53 | { |
54 | struct page *src_page; | 54 | struct page *src_page; |
55 | struct page *dst_page; | 55 | struct page *dst_page; |
56 | pgoff_t src_off; | 56 | pgoff_t src_off; |
57 | pgoff_t dst_off; | 57 | pgoff_t dst_off; |
58 | void *src_addr; | 58 | void *src_addr; |
59 | void *dst_addr; | 59 | void *dst_addr; |
60 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 60 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
61 | 61 | ||
62 | src_off = current_nat_addr(sbi, nid); | 62 | src_off = current_nat_addr(sbi, nid); |
63 | dst_off = next_nat_addr(sbi, src_off); | 63 | dst_off = next_nat_addr(sbi, src_off); |
64 | 64 | ||
65 | /* get current nat block page with lock */ | 65 | /* get current nat block page with lock */ |
66 | src_page = get_meta_page(sbi, src_off); | 66 | src_page = get_meta_page(sbi, src_off); |
67 | 67 | ||
68 | /* Dirty src_page means that it is already the new target NAT page. */ | 68 | /* Dirty src_page means that it is already the new target NAT page. */ |
69 | if (PageDirty(src_page)) | 69 | if (PageDirty(src_page)) |
70 | return src_page; | 70 | return src_page; |
71 | 71 | ||
72 | dst_page = grab_meta_page(sbi, dst_off); | 72 | dst_page = grab_meta_page(sbi, dst_off); |
73 | 73 | ||
74 | src_addr = page_address(src_page); | 74 | src_addr = page_address(src_page); |
75 | dst_addr = page_address(dst_page); | 75 | dst_addr = page_address(dst_page); |
76 | memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE); | 76 | memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE); |
77 | set_page_dirty(dst_page); | 77 | set_page_dirty(dst_page); |
78 | f2fs_put_page(src_page, 1); | 78 | f2fs_put_page(src_page, 1); |
79 | 79 | ||
80 | set_to_next_nat(nm_i, nid); | 80 | set_to_next_nat(nm_i, nid); |
81 | 81 | ||
82 | return dst_page; | 82 | return dst_page; |
83 | } | 83 | } |
84 | 84 | ||
85 | /* | 85 | /* |
86 | * Readahead NAT pages | 86 | * Readahead NAT pages |
87 | */ | 87 | */ |
88 | static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid) | 88 | static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid) |
89 | { | 89 | { |
90 | struct address_space *mapping = sbi->meta_inode->i_mapping; | 90 | struct address_space *mapping = sbi->meta_inode->i_mapping; |
91 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 91 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
92 | struct blk_plug plug; | 92 | struct blk_plug plug; |
93 | struct page *page; | 93 | struct page *page; |
94 | pgoff_t index; | 94 | pgoff_t index; |
95 | int i; | 95 | int i; |
96 | 96 | ||
97 | blk_start_plug(&plug); | 97 | blk_start_plug(&plug); |
98 | 98 | ||
99 | for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) { | 99 | for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) { |
100 | if (nid >= nm_i->max_nid) | 100 | if (nid >= nm_i->max_nid) |
101 | nid = 0; | 101 | nid = 0; |
102 | index = current_nat_addr(sbi, nid); | 102 | index = current_nat_addr(sbi, nid); |
103 | 103 | ||
104 | page = grab_cache_page(mapping, index); | 104 | page = grab_cache_page(mapping, index); |
105 | if (!page) | 105 | if (!page) |
106 | continue; | 106 | continue; |
107 | if (PageUptodate(page)) { | 107 | if (PageUptodate(page)) { |
108 | f2fs_put_page(page, 1); | 108 | f2fs_put_page(page, 1); |
109 | continue; | 109 | continue; |
110 | } | 110 | } |
111 | if (f2fs_readpage(sbi, page, index, READ)) | 111 | if (f2fs_readpage(sbi, page, index, READ)) |
112 | continue; | 112 | continue; |
113 | 113 | ||
114 | f2fs_put_page(page, 0); | 114 | f2fs_put_page(page, 0); |
115 | } | 115 | } |
116 | blk_finish_plug(&plug); | 116 | blk_finish_plug(&plug); |
117 | } | 117 | } |
118 | 118 | ||
119 | static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n) | 119 | static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n) |
120 | { | 120 | { |
121 | return radix_tree_lookup(&nm_i->nat_root, n); | 121 | return radix_tree_lookup(&nm_i->nat_root, n); |
122 | } | 122 | } |
123 | 123 | ||
124 | static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i, | 124 | static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i, |
125 | nid_t start, unsigned int nr, struct nat_entry **ep) | 125 | nid_t start, unsigned int nr, struct nat_entry **ep) |
126 | { | 126 | { |
127 | return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr); | 127 | return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr); |
128 | } | 128 | } |
129 | 129 | ||
130 | static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e) | 130 | static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e) |
131 | { | 131 | { |
132 | list_del(&e->list); | 132 | list_del(&e->list); |
133 | radix_tree_delete(&nm_i->nat_root, nat_get_nid(e)); | 133 | radix_tree_delete(&nm_i->nat_root, nat_get_nid(e)); |
134 | nm_i->nat_cnt--; | 134 | nm_i->nat_cnt--; |
135 | kmem_cache_free(nat_entry_slab, e); | 135 | kmem_cache_free(nat_entry_slab, e); |
136 | } | 136 | } |
137 | 137 | ||
138 | int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) | 138 | int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) |
139 | { | 139 | { |
140 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 140 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
141 | struct nat_entry *e; | 141 | struct nat_entry *e; |
142 | int is_cp = 1; | 142 | int is_cp = 1; |
143 | 143 | ||
144 | read_lock(&nm_i->nat_tree_lock); | 144 | read_lock(&nm_i->nat_tree_lock); |
145 | e = __lookup_nat_cache(nm_i, nid); | 145 | e = __lookup_nat_cache(nm_i, nid); |
146 | if (e && !e->checkpointed) | 146 | if (e && !e->checkpointed) |
147 | is_cp = 0; | 147 | is_cp = 0; |
148 | read_unlock(&nm_i->nat_tree_lock); | 148 | read_unlock(&nm_i->nat_tree_lock); |
149 | return is_cp; | 149 | return is_cp; |
150 | } | 150 | } |
151 | 151 | ||
152 | static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid) | 152 | static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid) |
153 | { | 153 | { |
154 | struct nat_entry *new; | 154 | struct nat_entry *new; |
155 | 155 | ||
156 | new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC); | 156 | new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC); |
157 | if (!new) | 157 | if (!new) |
158 | return NULL; | 158 | return NULL; |
159 | if (radix_tree_insert(&nm_i->nat_root, nid, new)) { | 159 | if (radix_tree_insert(&nm_i->nat_root, nid, new)) { |
160 | kmem_cache_free(nat_entry_slab, new); | 160 | kmem_cache_free(nat_entry_slab, new); |
161 | return NULL; | 161 | return NULL; |
162 | } | 162 | } |
163 | memset(new, 0, sizeof(struct nat_entry)); | 163 | memset(new, 0, sizeof(struct nat_entry)); |
164 | nat_set_nid(new, nid); | 164 | nat_set_nid(new, nid); |
165 | list_add_tail(&new->list, &nm_i->nat_entries); | 165 | list_add_tail(&new->list, &nm_i->nat_entries); |
166 | nm_i->nat_cnt++; | 166 | nm_i->nat_cnt++; |
167 | return new; | 167 | return new; |
168 | } | 168 | } |
169 | 169 | ||
170 | static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid, | 170 | static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid, |
171 | struct f2fs_nat_entry *ne) | 171 | struct f2fs_nat_entry *ne) |
172 | { | 172 | { |
173 | struct nat_entry *e; | 173 | struct nat_entry *e; |
174 | retry: | 174 | retry: |
175 | write_lock(&nm_i->nat_tree_lock); | 175 | write_lock(&nm_i->nat_tree_lock); |
176 | e = __lookup_nat_cache(nm_i, nid); | 176 | e = __lookup_nat_cache(nm_i, nid); |
177 | if (!e) { | 177 | if (!e) { |
178 | e = grab_nat_entry(nm_i, nid); | 178 | e = grab_nat_entry(nm_i, nid); |
179 | if (!e) { | 179 | if (!e) { |
180 | write_unlock(&nm_i->nat_tree_lock); | 180 | write_unlock(&nm_i->nat_tree_lock); |
181 | goto retry; | 181 | goto retry; |
182 | } | 182 | } |
183 | nat_set_blkaddr(e, le32_to_cpu(ne->block_addr)); | 183 | nat_set_blkaddr(e, le32_to_cpu(ne->block_addr)); |
184 | nat_set_ino(e, le32_to_cpu(ne->ino)); | 184 | nat_set_ino(e, le32_to_cpu(ne->ino)); |
185 | nat_set_version(e, ne->version); | 185 | nat_set_version(e, ne->version); |
186 | e->checkpointed = true; | 186 | e->checkpointed = true; |
187 | } | 187 | } |
188 | write_unlock(&nm_i->nat_tree_lock); | 188 | write_unlock(&nm_i->nat_tree_lock); |
189 | } | 189 | } |
190 | 190 | ||
191 | static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, | 191 | static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, |
192 | block_t new_blkaddr) | 192 | block_t new_blkaddr) |
193 | { | 193 | { |
194 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 194 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
195 | struct nat_entry *e; | 195 | struct nat_entry *e; |
196 | retry: | 196 | retry: |
197 | write_lock(&nm_i->nat_tree_lock); | 197 | write_lock(&nm_i->nat_tree_lock); |
198 | e = __lookup_nat_cache(nm_i, ni->nid); | 198 | e = __lookup_nat_cache(nm_i, ni->nid); |
199 | if (!e) { | 199 | if (!e) { |
200 | e = grab_nat_entry(nm_i, ni->nid); | 200 | e = grab_nat_entry(nm_i, ni->nid); |
201 | if (!e) { | 201 | if (!e) { |
202 | write_unlock(&nm_i->nat_tree_lock); | 202 | write_unlock(&nm_i->nat_tree_lock); |
203 | goto retry; | 203 | goto retry; |
204 | } | 204 | } |
205 | e->ni = *ni; | 205 | e->ni = *ni; |
206 | e->checkpointed = true; | 206 | e->checkpointed = true; |
207 | BUG_ON(ni->blk_addr == NEW_ADDR); | 207 | BUG_ON(ni->blk_addr == NEW_ADDR); |
208 | } else if (new_blkaddr == NEW_ADDR) { | 208 | } else if (new_blkaddr == NEW_ADDR) { |
209 | /* | 209 | /* |
210 | * when nid is reallocated, | 210 | * when nid is reallocated, |
211 | * previous nat entry can be remained in nat cache. | 211 | * previous nat entry can be remained in nat cache. |
212 | * So, reinitialize it with new information. | 212 | * So, reinitialize it with new information. |
213 | */ | 213 | */ |
214 | e->ni = *ni; | 214 | e->ni = *ni; |
215 | BUG_ON(ni->blk_addr != NULL_ADDR); | 215 | BUG_ON(ni->blk_addr != NULL_ADDR); |
216 | } | 216 | } |
217 | 217 | ||
218 | if (new_blkaddr == NEW_ADDR) | 218 | if (new_blkaddr == NEW_ADDR) |
219 | e->checkpointed = false; | 219 | e->checkpointed = false; |
220 | 220 | ||
221 | /* sanity check */ | 221 | /* sanity check */ |
222 | BUG_ON(nat_get_blkaddr(e) != ni->blk_addr); | 222 | BUG_ON(nat_get_blkaddr(e) != ni->blk_addr); |
223 | BUG_ON(nat_get_blkaddr(e) == NULL_ADDR && | 223 | BUG_ON(nat_get_blkaddr(e) == NULL_ADDR && |
224 | new_blkaddr == NULL_ADDR); | 224 | new_blkaddr == NULL_ADDR); |
225 | BUG_ON(nat_get_blkaddr(e) == NEW_ADDR && | 225 | BUG_ON(nat_get_blkaddr(e) == NEW_ADDR && |
226 | new_blkaddr == NEW_ADDR); | 226 | new_blkaddr == NEW_ADDR); |
227 | BUG_ON(nat_get_blkaddr(e) != NEW_ADDR && | 227 | BUG_ON(nat_get_blkaddr(e) != NEW_ADDR && |
228 | nat_get_blkaddr(e) != NULL_ADDR && | 228 | nat_get_blkaddr(e) != NULL_ADDR && |
229 | new_blkaddr == NEW_ADDR); | 229 | new_blkaddr == NEW_ADDR); |
230 | 230 | ||
231 | /* increament version no as node is removed */ | 231 | /* increament version no as node is removed */ |
232 | if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) { | 232 | if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) { |
233 | unsigned char version = nat_get_version(e); | 233 | unsigned char version = nat_get_version(e); |
234 | nat_set_version(e, inc_node_version(version)); | 234 | nat_set_version(e, inc_node_version(version)); |
235 | } | 235 | } |
236 | 236 | ||
237 | /* change address */ | 237 | /* change address */ |
238 | nat_set_blkaddr(e, new_blkaddr); | 238 | nat_set_blkaddr(e, new_blkaddr); |
239 | __set_nat_cache_dirty(nm_i, e); | 239 | __set_nat_cache_dirty(nm_i, e); |
240 | write_unlock(&nm_i->nat_tree_lock); | 240 | write_unlock(&nm_i->nat_tree_lock); |
241 | } | 241 | } |
242 | 242 | ||
243 | static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink) | 243 | static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink) |
244 | { | 244 | { |
245 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 245 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
246 | 246 | ||
247 | if (nm_i->nat_cnt <= NM_WOUT_THRESHOLD) | 247 | if (nm_i->nat_cnt <= NM_WOUT_THRESHOLD) |
248 | return 0; | 248 | return 0; |
249 | 249 | ||
250 | write_lock(&nm_i->nat_tree_lock); | 250 | write_lock(&nm_i->nat_tree_lock); |
251 | while (nr_shrink && !list_empty(&nm_i->nat_entries)) { | 251 | while (nr_shrink && !list_empty(&nm_i->nat_entries)) { |
252 | struct nat_entry *ne; | 252 | struct nat_entry *ne; |
253 | ne = list_first_entry(&nm_i->nat_entries, | 253 | ne = list_first_entry(&nm_i->nat_entries, |
254 | struct nat_entry, list); | 254 | struct nat_entry, list); |
255 | __del_from_nat_cache(nm_i, ne); | 255 | __del_from_nat_cache(nm_i, ne); |
256 | nr_shrink--; | 256 | nr_shrink--; |
257 | } | 257 | } |
258 | write_unlock(&nm_i->nat_tree_lock); | 258 | write_unlock(&nm_i->nat_tree_lock); |
259 | return nr_shrink; | 259 | return nr_shrink; |
260 | } | 260 | } |
261 | 261 | ||
262 | /* | 262 | /* |
263 | * This function returns always success | 263 | * This function returns always success |
264 | */ | 264 | */ |
265 | void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) | 265 | void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni) |
266 | { | 266 | { |
267 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 267 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
268 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); | 268 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); |
269 | struct f2fs_summary_block *sum = curseg->sum_blk; | 269 | struct f2fs_summary_block *sum = curseg->sum_blk; |
270 | nid_t start_nid = START_NID(nid); | 270 | nid_t start_nid = START_NID(nid); |
271 | struct f2fs_nat_block *nat_blk; | 271 | struct f2fs_nat_block *nat_blk; |
272 | struct page *page = NULL; | 272 | struct page *page = NULL; |
273 | struct f2fs_nat_entry ne; | 273 | struct f2fs_nat_entry ne; |
274 | struct nat_entry *e; | 274 | struct nat_entry *e; |
275 | int i; | 275 | int i; |
276 | 276 | ||
277 | memset(&ne, 0, sizeof(struct f2fs_nat_entry)); | 277 | memset(&ne, 0, sizeof(struct f2fs_nat_entry)); |
278 | ni->nid = nid; | 278 | ni->nid = nid; |
279 | 279 | ||
280 | /* Check nat cache */ | 280 | /* Check nat cache */ |
281 | read_lock(&nm_i->nat_tree_lock); | 281 | read_lock(&nm_i->nat_tree_lock); |
282 | e = __lookup_nat_cache(nm_i, nid); | 282 | e = __lookup_nat_cache(nm_i, nid); |
283 | if (e) { | 283 | if (e) { |
284 | ni->ino = nat_get_ino(e); | 284 | ni->ino = nat_get_ino(e); |
285 | ni->blk_addr = nat_get_blkaddr(e); | 285 | ni->blk_addr = nat_get_blkaddr(e); |
286 | ni->version = nat_get_version(e); | 286 | ni->version = nat_get_version(e); |
287 | } | 287 | } |
288 | read_unlock(&nm_i->nat_tree_lock); | 288 | read_unlock(&nm_i->nat_tree_lock); |
289 | if (e) | 289 | if (e) |
290 | return; | 290 | return; |
291 | 291 | ||
292 | /* Check current segment summary */ | 292 | /* Check current segment summary */ |
293 | mutex_lock(&curseg->curseg_mutex); | 293 | mutex_lock(&curseg->curseg_mutex); |
294 | i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0); | 294 | i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0); |
295 | if (i >= 0) { | 295 | if (i >= 0) { |
296 | ne = nat_in_journal(sum, i); | 296 | ne = nat_in_journal(sum, i); |
297 | node_info_from_raw_nat(ni, &ne); | 297 | node_info_from_raw_nat(ni, &ne); |
298 | } | 298 | } |
299 | mutex_unlock(&curseg->curseg_mutex); | 299 | mutex_unlock(&curseg->curseg_mutex); |
300 | if (i >= 0) | 300 | if (i >= 0) |
301 | goto cache; | 301 | goto cache; |
302 | 302 | ||
303 | /* Fill node_info from nat page */ | 303 | /* Fill node_info from nat page */ |
304 | page = get_current_nat_page(sbi, start_nid); | 304 | page = get_current_nat_page(sbi, start_nid); |
305 | nat_blk = (struct f2fs_nat_block *)page_address(page); | 305 | nat_blk = (struct f2fs_nat_block *)page_address(page); |
306 | ne = nat_blk->entries[nid - start_nid]; | 306 | ne = nat_blk->entries[nid - start_nid]; |
307 | node_info_from_raw_nat(ni, &ne); | 307 | node_info_from_raw_nat(ni, &ne); |
308 | f2fs_put_page(page, 1); | 308 | f2fs_put_page(page, 1); |
309 | cache: | 309 | cache: |
310 | /* cache nat entry */ | 310 | /* cache nat entry */ |
311 | cache_nat_entry(NM_I(sbi), nid, &ne); | 311 | cache_nat_entry(NM_I(sbi), nid, &ne); |
312 | } | 312 | } |
313 | 313 | ||
314 | /* | 314 | /* |
315 | * The maximum depth is four. | 315 | * The maximum depth is four. |
316 | * Offset[0] will have raw inode offset. | 316 | * Offset[0] will have raw inode offset. |
317 | */ | 317 | */ |
318 | static int get_node_path(struct f2fs_inode_info *fi, long block, | 318 | static int get_node_path(struct f2fs_inode_info *fi, long block, |
319 | int offset[4], unsigned int noffset[4]) | 319 | int offset[4], unsigned int noffset[4]) |
320 | { | 320 | { |
321 | const long direct_index = ADDRS_PER_INODE(fi); | 321 | const long direct_index = ADDRS_PER_INODE(fi); |
322 | const long direct_blks = ADDRS_PER_BLOCK; | 322 | const long direct_blks = ADDRS_PER_BLOCK; |
323 | const long dptrs_per_blk = NIDS_PER_BLOCK; | 323 | const long dptrs_per_blk = NIDS_PER_BLOCK; |
324 | const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK; | 324 | const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK; |
325 | const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK; | 325 | const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK; |
326 | int n = 0; | 326 | int n = 0; |
327 | int level = 0; | 327 | int level = 0; |
328 | 328 | ||
329 | noffset[0] = 0; | 329 | noffset[0] = 0; |
330 | 330 | ||
331 | if (block < direct_index) { | 331 | if (block < direct_index) { |
332 | offset[n] = block; | 332 | offset[n] = block; |
333 | goto got; | 333 | goto got; |
334 | } | 334 | } |
335 | block -= direct_index; | 335 | block -= direct_index; |
336 | if (block < direct_blks) { | 336 | if (block < direct_blks) { |
337 | offset[n++] = NODE_DIR1_BLOCK; | 337 | offset[n++] = NODE_DIR1_BLOCK; |
338 | noffset[n] = 1; | 338 | noffset[n] = 1; |
339 | offset[n] = block; | 339 | offset[n] = block; |
340 | level = 1; | 340 | level = 1; |
341 | goto got; | 341 | goto got; |
342 | } | 342 | } |
343 | block -= direct_blks; | 343 | block -= direct_blks; |
344 | if (block < direct_blks) { | 344 | if (block < direct_blks) { |
345 | offset[n++] = NODE_DIR2_BLOCK; | 345 | offset[n++] = NODE_DIR2_BLOCK; |
346 | noffset[n] = 2; | 346 | noffset[n] = 2; |
347 | offset[n] = block; | 347 | offset[n] = block; |
348 | level = 1; | 348 | level = 1; |
349 | goto got; | 349 | goto got; |
350 | } | 350 | } |
351 | block -= direct_blks; | 351 | block -= direct_blks; |
352 | if (block < indirect_blks) { | 352 | if (block < indirect_blks) { |
353 | offset[n++] = NODE_IND1_BLOCK; | 353 | offset[n++] = NODE_IND1_BLOCK; |
354 | noffset[n] = 3; | 354 | noffset[n] = 3; |
355 | offset[n++] = block / direct_blks; | 355 | offset[n++] = block / direct_blks; |
356 | noffset[n] = 4 + offset[n - 1]; | 356 | noffset[n] = 4 + offset[n - 1]; |
357 | offset[n] = block % direct_blks; | 357 | offset[n] = block % direct_blks; |
358 | level = 2; | 358 | level = 2; |
359 | goto got; | 359 | goto got; |
360 | } | 360 | } |
361 | block -= indirect_blks; | 361 | block -= indirect_blks; |
362 | if (block < indirect_blks) { | 362 | if (block < indirect_blks) { |
363 | offset[n++] = NODE_IND2_BLOCK; | 363 | offset[n++] = NODE_IND2_BLOCK; |
364 | noffset[n] = 4 + dptrs_per_blk; | 364 | noffset[n] = 4 + dptrs_per_blk; |
365 | offset[n++] = block / direct_blks; | 365 | offset[n++] = block / direct_blks; |
366 | noffset[n] = 5 + dptrs_per_blk + offset[n - 1]; | 366 | noffset[n] = 5 + dptrs_per_blk + offset[n - 1]; |
367 | offset[n] = block % direct_blks; | 367 | offset[n] = block % direct_blks; |
368 | level = 2; | 368 | level = 2; |
369 | goto got; | 369 | goto got; |
370 | } | 370 | } |
371 | block -= indirect_blks; | 371 | block -= indirect_blks; |
372 | if (block < dindirect_blks) { | 372 | if (block < dindirect_blks) { |
373 | offset[n++] = NODE_DIND_BLOCK; | 373 | offset[n++] = NODE_DIND_BLOCK; |
374 | noffset[n] = 5 + (dptrs_per_blk * 2); | 374 | noffset[n] = 5 + (dptrs_per_blk * 2); |
375 | offset[n++] = block / indirect_blks; | 375 | offset[n++] = block / indirect_blks; |
376 | noffset[n] = 6 + (dptrs_per_blk * 2) + | 376 | noffset[n] = 6 + (dptrs_per_blk * 2) + |
377 | offset[n - 1] * (dptrs_per_blk + 1); | 377 | offset[n - 1] * (dptrs_per_blk + 1); |
378 | offset[n++] = (block / direct_blks) % dptrs_per_blk; | 378 | offset[n++] = (block / direct_blks) % dptrs_per_blk; |
379 | noffset[n] = 7 + (dptrs_per_blk * 2) + | 379 | noffset[n] = 7 + (dptrs_per_blk * 2) + |
380 | offset[n - 2] * (dptrs_per_blk + 1) + | 380 | offset[n - 2] * (dptrs_per_blk + 1) + |
381 | offset[n - 1]; | 381 | offset[n - 1]; |
382 | offset[n] = block % direct_blks; | 382 | offset[n] = block % direct_blks; |
383 | level = 3; | 383 | level = 3; |
384 | goto got; | 384 | goto got; |
385 | } else { | 385 | } else { |
386 | BUG(); | 386 | BUG(); |
387 | } | 387 | } |
388 | got: | 388 | got: |
389 | return level; | 389 | return level; |
390 | } | 390 | } |
391 | 391 | ||
392 | /* | 392 | /* |
393 | * Caller should call f2fs_put_dnode(dn). | 393 | * Caller should call f2fs_put_dnode(dn). |
394 | * Also, it should grab and release a mutex by calling mutex_lock_op() and | 394 | * Also, it should grab and release a mutex by calling mutex_lock_op() and |
395 | * mutex_unlock_op() only if ro is not set RDONLY_NODE. | 395 | * mutex_unlock_op() only if ro is not set RDONLY_NODE. |
396 | * In the case of RDONLY_NODE, we don't need to care about mutex. | 396 | * In the case of RDONLY_NODE, we don't need to care about mutex. |
397 | */ | 397 | */ |
398 | int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) | 398 | int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) |
399 | { | 399 | { |
400 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 400 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
401 | struct page *npage[4]; | 401 | struct page *npage[4]; |
402 | struct page *parent; | 402 | struct page *parent; |
403 | int offset[4]; | 403 | int offset[4]; |
404 | unsigned int noffset[4]; | 404 | unsigned int noffset[4]; |
405 | nid_t nids[4]; | 405 | nid_t nids[4]; |
406 | int level, i; | 406 | int level, i; |
407 | int err = 0; | 407 | int err = 0; |
408 | 408 | ||
409 | level = get_node_path(F2FS_I(dn->inode), index, offset, noffset); | 409 | level = get_node_path(F2FS_I(dn->inode), index, offset, noffset); |
410 | 410 | ||
411 | nids[0] = dn->inode->i_ino; | 411 | nids[0] = dn->inode->i_ino; |
412 | npage[0] = dn->inode_page; | 412 | npage[0] = dn->inode_page; |
413 | 413 | ||
414 | if (!npage[0]) { | 414 | if (!npage[0]) { |
415 | npage[0] = get_node_page(sbi, nids[0]); | 415 | npage[0] = get_node_page(sbi, nids[0]); |
416 | if (IS_ERR(npage[0])) | 416 | if (IS_ERR(npage[0])) |
417 | return PTR_ERR(npage[0]); | 417 | return PTR_ERR(npage[0]); |
418 | } | 418 | } |
419 | parent = npage[0]; | 419 | parent = npage[0]; |
420 | if (level != 0) | 420 | if (level != 0) |
421 | nids[1] = get_nid(parent, offset[0], true); | 421 | nids[1] = get_nid(parent, offset[0], true); |
422 | dn->inode_page = npage[0]; | 422 | dn->inode_page = npage[0]; |
423 | dn->inode_page_locked = true; | 423 | dn->inode_page_locked = true; |
424 | 424 | ||
425 | /* get indirect or direct nodes */ | 425 | /* get indirect or direct nodes */ |
426 | for (i = 1; i <= level; i++) { | 426 | for (i = 1; i <= level; i++) { |
427 | bool done = false; | 427 | bool done = false; |
428 | 428 | ||
429 | if (!nids[i] && mode == ALLOC_NODE) { | 429 | if (!nids[i] && mode == ALLOC_NODE) { |
430 | /* alloc new node */ | 430 | /* alloc new node */ |
431 | if (!alloc_nid(sbi, &(nids[i]))) { | 431 | if (!alloc_nid(sbi, &(nids[i]))) { |
432 | err = -ENOSPC; | 432 | err = -ENOSPC; |
433 | goto release_pages; | 433 | goto release_pages; |
434 | } | 434 | } |
435 | 435 | ||
436 | dn->nid = nids[i]; | 436 | dn->nid = nids[i]; |
437 | npage[i] = new_node_page(dn, noffset[i], NULL); | 437 | npage[i] = new_node_page(dn, noffset[i], NULL); |
438 | if (IS_ERR(npage[i])) { | 438 | if (IS_ERR(npage[i])) { |
439 | alloc_nid_failed(sbi, nids[i]); | 439 | alloc_nid_failed(sbi, nids[i]); |
440 | err = PTR_ERR(npage[i]); | 440 | err = PTR_ERR(npage[i]); |
441 | goto release_pages; | 441 | goto release_pages; |
442 | } | 442 | } |
443 | 443 | ||
444 | set_nid(parent, offset[i - 1], nids[i], i == 1); | 444 | set_nid(parent, offset[i - 1], nids[i], i == 1); |
445 | alloc_nid_done(sbi, nids[i]); | 445 | alloc_nid_done(sbi, nids[i]); |
446 | done = true; | 446 | done = true; |
447 | } else if (mode == LOOKUP_NODE_RA && i == level && level > 1) { | 447 | } else if (mode == LOOKUP_NODE_RA && i == level && level > 1) { |
448 | npage[i] = get_node_page_ra(parent, offset[i - 1]); | 448 | npage[i] = get_node_page_ra(parent, offset[i - 1]); |
449 | if (IS_ERR(npage[i])) { | 449 | if (IS_ERR(npage[i])) { |
450 | err = PTR_ERR(npage[i]); | 450 | err = PTR_ERR(npage[i]); |
451 | goto release_pages; | 451 | goto release_pages; |
452 | } | 452 | } |
453 | done = true; | 453 | done = true; |
454 | } | 454 | } |
455 | if (i == 1) { | 455 | if (i == 1) { |
456 | dn->inode_page_locked = false; | 456 | dn->inode_page_locked = false; |
457 | unlock_page(parent); | 457 | unlock_page(parent); |
458 | } else { | 458 | } else { |
459 | f2fs_put_page(parent, 1); | 459 | f2fs_put_page(parent, 1); |
460 | } | 460 | } |
461 | 461 | ||
462 | if (!done) { | 462 | if (!done) { |
463 | npage[i] = get_node_page(sbi, nids[i]); | 463 | npage[i] = get_node_page(sbi, nids[i]); |
464 | if (IS_ERR(npage[i])) { | 464 | if (IS_ERR(npage[i])) { |
465 | err = PTR_ERR(npage[i]); | 465 | err = PTR_ERR(npage[i]); |
466 | f2fs_put_page(npage[0], 0); | 466 | f2fs_put_page(npage[0], 0); |
467 | goto release_out; | 467 | goto release_out; |
468 | } | 468 | } |
469 | } | 469 | } |
470 | if (i < level) { | 470 | if (i < level) { |
471 | parent = npage[i]; | 471 | parent = npage[i]; |
472 | nids[i + 1] = get_nid(parent, offset[i], false); | 472 | nids[i + 1] = get_nid(parent, offset[i], false); |
473 | } | 473 | } |
474 | } | 474 | } |
475 | dn->nid = nids[level]; | 475 | dn->nid = nids[level]; |
476 | dn->ofs_in_node = offset[level]; | 476 | dn->ofs_in_node = offset[level]; |
477 | dn->node_page = npage[level]; | 477 | dn->node_page = npage[level]; |
478 | dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node); | 478 | dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node); |
479 | return 0; | 479 | return 0; |
480 | 480 | ||
481 | release_pages: | 481 | release_pages: |
482 | f2fs_put_page(parent, 1); | 482 | f2fs_put_page(parent, 1); |
483 | if (i > 1) | 483 | if (i > 1) |
484 | f2fs_put_page(npage[0], 0); | 484 | f2fs_put_page(npage[0], 0); |
485 | release_out: | 485 | release_out: |
486 | dn->inode_page = NULL; | 486 | dn->inode_page = NULL; |
487 | dn->node_page = NULL; | 487 | dn->node_page = NULL; |
488 | return err; | 488 | return err; |
489 | } | 489 | } |
490 | 490 | ||
491 | static void truncate_node(struct dnode_of_data *dn) | 491 | static void truncate_node(struct dnode_of_data *dn) |
492 | { | 492 | { |
493 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 493 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
494 | struct node_info ni; | 494 | struct node_info ni; |
495 | 495 | ||
496 | get_node_info(sbi, dn->nid, &ni); | 496 | get_node_info(sbi, dn->nid, &ni); |
497 | if (dn->inode->i_blocks == 0) { | 497 | if (dn->inode->i_blocks == 0) { |
498 | BUG_ON(ni.blk_addr != NULL_ADDR); | 498 | BUG_ON(ni.blk_addr != NULL_ADDR); |
499 | goto invalidate; | 499 | goto invalidate; |
500 | } | 500 | } |
501 | BUG_ON(ni.blk_addr == NULL_ADDR); | 501 | BUG_ON(ni.blk_addr == NULL_ADDR); |
502 | 502 | ||
503 | /* Deallocate node address */ | 503 | /* Deallocate node address */ |
504 | invalidate_blocks(sbi, ni.blk_addr); | 504 | invalidate_blocks(sbi, ni.blk_addr); |
505 | dec_valid_node_count(sbi, dn->inode, 1); | 505 | dec_valid_node_count(sbi, dn->inode, 1); |
506 | set_node_addr(sbi, &ni, NULL_ADDR); | 506 | set_node_addr(sbi, &ni, NULL_ADDR); |
507 | 507 | ||
508 | if (dn->nid == dn->inode->i_ino) { | 508 | if (dn->nid == dn->inode->i_ino) { |
509 | remove_orphan_inode(sbi, dn->nid); | 509 | remove_orphan_inode(sbi, dn->nid); |
510 | dec_valid_inode_count(sbi); | 510 | dec_valid_inode_count(sbi); |
511 | } else { | 511 | } else { |
512 | sync_inode_page(dn); | 512 | sync_inode_page(dn); |
513 | } | 513 | } |
514 | invalidate: | 514 | invalidate: |
515 | clear_node_page_dirty(dn->node_page); | 515 | clear_node_page_dirty(dn->node_page); |
516 | F2FS_SET_SB_DIRT(sbi); | 516 | F2FS_SET_SB_DIRT(sbi); |
517 | 517 | ||
518 | f2fs_put_page(dn->node_page, 1); | 518 | f2fs_put_page(dn->node_page, 1); |
519 | dn->node_page = NULL; | 519 | dn->node_page = NULL; |
520 | trace_f2fs_truncate_node(dn->inode, dn->nid, ni.blk_addr); | 520 | trace_f2fs_truncate_node(dn->inode, dn->nid, ni.blk_addr); |
521 | } | 521 | } |
522 | 522 | ||
523 | static int truncate_dnode(struct dnode_of_data *dn) | 523 | static int truncate_dnode(struct dnode_of_data *dn) |
524 | { | 524 | { |
525 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 525 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
526 | struct page *page; | 526 | struct page *page; |
527 | 527 | ||
528 | if (dn->nid == 0) | 528 | if (dn->nid == 0) |
529 | return 1; | 529 | return 1; |
530 | 530 | ||
531 | /* get direct node */ | 531 | /* get direct node */ |
532 | page = get_node_page(sbi, dn->nid); | 532 | page = get_node_page(sbi, dn->nid); |
533 | if (IS_ERR(page) && PTR_ERR(page) == -ENOENT) | 533 | if (IS_ERR(page) && PTR_ERR(page) == -ENOENT) |
534 | return 1; | 534 | return 1; |
535 | else if (IS_ERR(page)) | 535 | else if (IS_ERR(page)) |
536 | return PTR_ERR(page); | 536 | return PTR_ERR(page); |
537 | 537 | ||
538 | /* Make dnode_of_data for parameter */ | 538 | /* Make dnode_of_data for parameter */ |
539 | dn->node_page = page; | 539 | dn->node_page = page; |
540 | dn->ofs_in_node = 0; | 540 | dn->ofs_in_node = 0; |
541 | truncate_data_blocks(dn); | 541 | truncate_data_blocks(dn); |
542 | truncate_node(dn); | 542 | truncate_node(dn); |
543 | return 1; | 543 | return 1; |
544 | } | 544 | } |
545 | 545 | ||
546 | static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, | 546 | static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, |
547 | int ofs, int depth) | 547 | int ofs, int depth) |
548 | { | 548 | { |
549 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 549 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
550 | struct dnode_of_data rdn = *dn; | 550 | struct dnode_of_data rdn = *dn; |
551 | struct page *page; | 551 | struct page *page; |
552 | struct f2fs_node *rn; | 552 | struct f2fs_node *rn; |
553 | nid_t child_nid; | 553 | nid_t child_nid; |
554 | unsigned int child_nofs; | 554 | unsigned int child_nofs; |
555 | int freed = 0; | 555 | int freed = 0; |
556 | int i, ret; | 556 | int i, ret; |
557 | 557 | ||
558 | if (dn->nid == 0) | 558 | if (dn->nid == 0) |
559 | return NIDS_PER_BLOCK + 1; | 559 | return NIDS_PER_BLOCK + 1; |
560 | 560 | ||
561 | trace_f2fs_truncate_nodes_enter(dn->inode, dn->nid, dn->data_blkaddr); | 561 | trace_f2fs_truncate_nodes_enter(dn->inode, dn->nid, dn->data_blkaddr); |
562 | 562 | ||
563 | page = get_node_page(sbi, dn->nid); | 563 | page = get_node_page(sbi, dn->nid); |
564 | if (IS_ERR(page)) { | 564 | if (IS_ERR(page)) { |
565 | trace_f2fs_truncate_nodes_exit(dn->inode, PTR_ERR(page)); | 565 | trace_f2fs_truncate_nodes_exit(dn->inode, PTR_ERR(page)); |
566 | return PTR_ERR(page); | 566 | return PTR_ERR(page); |
567 | } | 567 | } |
568 | 568 | ||
569 | rn = F2FS_NODE(page); | 569 | rn = F2FS_NODE(page); |
570 | if (depth < 3) { | 570 | if (depth < 3) { |
571 | for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) { | 571 | for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) { |
572 | child_nid = le32_to_cpu(rn->in.nid[i]); | 572 | child_nid = le32_to_cpu(rn->in.nid[i]); |
573 | if (child_nid == 0) | 573 | if (child_nid == 0) |
574 | continue; | 574 | continue; |
575 | rdn.nid = child_nid; | 575 | rdn.nid = child_nid; |
576 | ret = truncate_dnode(&rdn); | 576 | ret = truncate_dnode(&rdn); |
577 | if (ret < 0) | 577 | if (ret < 0) |
578 | goto out_err; | 578 | goto out_err; |
579 | set_nid(page, i, 0, false); | 579 | set_nid(page, i, 0, false); |
580 | } | 580 | } |
581 | } else { | 581 | } else { |
582 | child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1; | 582 | child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1; |
583 | for (i = ofs; i < NIDS_PER_BLOCK; i++) { | 583 | for (i = ofs; i < NIDS_PER_BLOCK; i++) { |
584 | child_nid = le32_to_cpu(rn->in.nid[i]); | 584 | child_nid = le32_to_cpu(rn->in.nid[i]); |
585 | if (child_nid == 0) { | 585 | if (child_nid == 0) { |
586 | child_nofs += NIDS_PER_BLOCK + 1; | 586 | child_nofs += NIDS_PER_BLOCK + 1; |
587 | continue; | 587 | continue; |
588 | } | 588 | } |
589 | rdn.nid = child_nid; | 589 | rdn.nid = child_nid; |
590 | ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1); | 590 | ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1); |
591 | if (ret == (NIDS_PER_BLOCK + 1)) { | 591 | if (ret == (NIDS_PER_BLOCK + 1)) { |
592 | set_nid(page, i, 0, false); | 592 | set_nid(page, i, 0, false); |
593 | child_nofs += ret; | 593 | child_nofs += ret; |
594 | } else if (ret < 0 && ret != -ENOENT) { | 594 | } else if (ret < 0 && ret != -ENOENT) { |
595 | goto out_err; | 595 | goto out_err; |
596 | } | 596 | } |
597 | } | 597 | } |
598 | freed = child_nofs; | 598 | freed = child_nofs; |
599 | } | 599 | } |
600 | 600 | ||
601 | if (!ofs) { | 601 | if (!ofs) { |
602 | /* remove current indirect node */ | 602 | /* remove current indirect node */ |
603 | dn->node_page = page; | 603 | dn->node_page = page; |
604 | truncate_node(dn); | 604 | truncate_node(dn); |
605 | freed++; | 605 | freed++; |
606 | } else { | 606 | } else { |
607 | f2fs_put_page(page, 1); | 607 | f2fs_put_page(page, 1); |
608 | } | 608 | } |
609 | trace_f2fs_truncate_nodes_exit(dn->inode, freed); | 609 | trace_f2fs_truncate_nodes_exit(dn->inode, freed); |
610 | return freed; | 610 | return freed; |
611 | 611 | ||
612 | out_err: | 612 | out_err: |
613 | f2fs_put_page(page, 1); | 613 | f2fs_put_page(page, 1); |
614 | trace_f2fs_truncate_nodes_exit(dn->inode, ret); | 614 | trace_f2fs_truncate_nodes_exit(dn->inode, ret); |
615 | return ret; | 615 | return ret; |
616 | } | 616 | } |
617 | 617 | ||
618 | static int truncate_partial_nodes(struct dnode_of_data *dn, | 618 | static int truncate_partial_nodes(struct dnode_of_data *dn, |
619 | struct f2fs_inode *ri, int *offset, int depth) | 619 | struct f2fs_inode *ri, int *offset, int depth) |
620 | { | 620 | { |
621 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 621 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
622 | struct page *pages[2]; | 622 | struct page *pages[2]; |
623 | nid_t nid[3]; | 623 | nid_t nid[3]; |
624 | nid_t child_nid; | 624 | nid_t child_nid; |
625 | int err = 0; | 625 | int err = 0; |
626 | int i; | 626 | int i; |
627 | int idx = depth - 2; | 627 | int idx = depth - 2; |
628 | 628 | ||
629 | nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]); | 629 | nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]); |
630 | if (!nid[0]) | 630 | if (!nid[0]) |
631 | return 0; | 631 | return 0; |
632 | 632 | ||
633 | /* get indirect nodes in the path */ | 633 | /* get indirect nodes in the path */ |
634 | for (i = 0; i < depth - 1; i++) { | 634 | for (i = 0; i < depth - 1; i++) { |
635 | /* refernece count'll be increased */ | 635 | /* refernece count'll be increased */ |
636 | pages[i] = get_node_page(sbi, nid[i]); | 636 | pages[i] = get_node_page(sbi, nid[i]); |
637 | if (IS_ERR(pages[i])) { | 637 | if (IS_ERR(pages[i])) { |
638 | depth = i + 1; | 638 | depth = i + 1; |
639 | err = PTR_ERR(pages[i]); | 639 | err = PTR_ERR(pages[i]); |
640 | goto fail; | 640 | goto fail; |
641 | } | 641 | } |
642 | nid[i + 1] = get_nid(pages[i], offset[i + 1], false); | 642 | nid[i + 1] = get_nid(pages[i], offset[i + 1], false); |
643 | } | 643 | } |
644 | 644 | ||
645 | /* free direct nodes linked to a partial indirect node */ | 645 | /* free direct nodes linked to a partial indirect node */ |
646 | for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) { | 646 | for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) { |
647 | child_nid = get_nid(pages[idx], i, false); | 647 | child_nid = get_nid(pages[idx], i, false); |
648 | if (!child_nid) | 648 | if (!child_nid) |
649 | continue; | 649 | continue; |
650 | dn->nid = child_nid; | 650 | dn->nid = child_nid; |
651 | err = truncate_dnode(dn); | 651 | err = truncate_dnode(dn); |
652 | if (err < 0) | 652 | if (err < 0) |
653 | goto fail; | 653 | goto fail; |
654 | set_nid(pages[idx], i, 0, false); | 654 | set_nid(pages[idx], i, 0, false); |
655 | } | 655 | } |
656 | 656 | ||
657 | if (offset[depth - 1] == 0) { | 657 | if (offset[depth - 1] == 0) { |
658 | dn->node_page = pages[idx]; | 658 | dn->node_page = pages[idx]; |
659 | dn->nid = nid[idx]; | 659 | dn->nid = nid[idx]; |
660 | truncate_node(dn); | 660 | truncate_node(dn); |
661 | } else { | 661 | } else { |
662 | f2fs_put_page(pages[idx], 1); | 662 | f2fs_put_page(pages[idx], 1); |
663 | } | 663 | } |
664 | offset[idx]++; | 664 | offset[idx]++; |
665 | offset[depth - 1] = 0; | 665 | offset[depth - 1] = 0; |
666 | fail: | 666 | fail: |
667 | for (i = depth - 3; i >= 0; i--) | 667 | for (i = depth - 3; i >= 0; i--) |
668 | f2fs_put_page(pages[i], 1); | 668 | f2fs_put_page(pages[i], 1); |
669 | 669 | ||
670 | trace_f2fs_truncate_partial_nodes(dn->inode, nid, depth, err); | 670 | trace_f2fs_truncate_partial_nodes(dn->inode, nid, depth, err); |
671 | 671 | ||
672 | return err; | 672 | return err; |
673 | } | 673 | } |
674 | 674 | ||
675 | /* | 675 | /* |
676 | * All the block addresses of data and nodes should be nullified. | 676 | * All the block addresses of data and nodes should be nullified. |
677 | */ | 677 | */ |
678 | int truncate_inode_blocks(struct inode *inode, pgoff_t from) | 678 | int truncate_inode_blocks(struct inode *inode, pgoff_t from) |
679 | { | 679 | { |
680 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 680 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
681 | struct address_space *node_mapping = sbi->node_inode->i_mapping; | 681 | struct address_space *node_mapping = sbi->node_inode->i_mapping; |
682 | int err = 0, cont = 1; | 682 | int err = 0, cont = 1; |
683 | int level, offset[4], noffset[4]; | 683 | int level, offset[4], noffset[4]; |
684 | unsigned int nofs = 0; | 684 | unsigned int nofs = 0; |
685 | struct f2fs_node *rn; | 685 | struct f2fs_node *rn; |
686 | struct dnode_of_data dn; | 686 | struct dnode_of_data dn; |
687 | struct page *page; | 687 | struct page *page; |
688 | 688 | ||
689 | trace_f2fs_truncate_inode_blocks_enter(inode, from); | 689 | trace_f2fs_truncate_inode_blocks_enter(inode, from); |
690 | 690 | ||
691 | level = get_node_path(F2FS_I(inode), from, offset, noffset); | 691 | level = get_node_path(F2FS_I(inode), from, offset, noffset); |
692 | restart: | 692 | restart: |
693 | page = get_node_page(sbi, inode->i_ino); | 693 | page = get_node_page(sbi, inode->i_ino); |
694 | if (IS_ERR(page)) { | 694 | if (IS_ERR(page)) { |
695 | trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page)); | 695 | trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page)); |
696 | return PTR_ERR(page); | 696 | return PTR_ERR(page); |
697 | } | 697 | } |
698 | 698 | ||
699 | set_new_dnode(&dn, inode, page, NULL, 0); | 699 | set_new_dnode(&dn, inode, page, NULL, 0); |
700 | unlock_page(page); | 700 | unlock_page(page); |
701 | 701 | ||
702 | rn = F2FS_NODE(page); | 702 | rn = F2FS_NODE(page); |
703 | switch (level) { | 703 | switch (level) { |
704 | case 0: | 704 | case 0: |
705 | case 1: | 705 | case 1: |
706 | nofs = noffset[1]; | 706 | nofs = noffset[1]; |
707 | break; | 707 | break; |
708 | case 2: | 708 | case 2: |
709 | nofs = noffset[1]; | 709 | nofs = noffset[1]; |
710 | if (!offset[level - 1]) | 710 | if (!offset[level - 1]) |
711 | goto skip_partial; | 711 | goto skip_partial; |
712 | err = truncate_partial_nodes(&dn, &rn->i, offset, level); | 712 | err = truncate_partial_nodes(&dn, &rn->i, offset, level); |
713 | if (err < 0 && err != -ENOENT) | 713 | if (err < 0 && err != -ENOENT) |
714 | goto fail; | 714 | goto fail; |
715 | nofs += 1 + NIDS_PER_BLOCK; | 715 | nofs += 1 + NIDS_PER_BLOCK; |
716 | break; | 716 | break; |
717 | case 3: | 717 | case 3: |
718 | nofs = 5 + 2 * NIDS_PER_BLOCK; | 718 | nofs = 5 + 2 * NIDS_PER_BLOCK; |
719 | if (!offset[level - 1]) | 719 | if (!offset[level - 1]) |
720 | goto skip_partial; | 720 | goto skip_partial; |
721 | err = truncate_partial_nodes(&dn, &rn->i, offset, level); | 721 | err = truncate_partial_nodes(&dn, &rn->i, offset, level); |
722 | if (err < 0 && err != -ENOENT) | 722 | if (err < 0 && err != -ENOENT) |
723 | goto fail; | 723 | goto fail; |
724 | break; | 724 | break; |
725 | default: | 725 | default: |
726 | BUG(); | 726 | BUG(); |
727 | } | 727 | } |
728 | 728 | ||
729 | skip_partial: | 729 | skip_partial: |
730 | while (cont) { | 730 | while (cont) { |
731 | dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]); | 731 | dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]); |
732 | switch (offset[0]) { | 732 | switch (offset[0]) { |
733 | case NODE_DIR1_BLOCK: | 733 | case NODE_DIR1_BLOCK: |
734 | case NODE_DIR2_BLOCK: | 734 | case NODE_DIR2_BLOCK: |
735 | err = truncate_dnode(&dn); | 735 | err = truncate_dnode(&dn); |
736 | break; | 736 | break; |
737 | 737 | ||
738 | case NODE_IND1_BLOCK: | 738 | case NODE_IND1_BLOCK: |
739 | case NODE_IND2_BLOCK: | 739 | case NODE_IND2_BLOCK: |
740 | err = truncate_nodes(&dn, nofs, offset[1], 2); | 740 | err = truncate_nodes(&dn, nofs, offset[1], 2); |
741 | break; | 741 | break; |
742 | 742 | ||
743 | case NODE_DIND_BLOCK: | 743 | case NODE_DIND_BLOCK: |
744 | err = truncate_nodes(&dn, nofs, offset[1], 3); | 744 | err = truncate_nodes(&dn, nofs, offset[1], 3); |
745 | cont = 0; | 745 | cont = 0; |
746 | break; | 746 | break; |
747 | 747 | ||
748 | default: | 748 | default: |
749 | BUG(); | 749 | BUG(); |
750 | } | 750 | } |
751 | if (err < 0 && err != -ENOENT) | 751 | if (err < 0 && err != -ENOENT) |
752 | goto fail; | 752 | goto fail; |
753 | if (offset[1] == 0 && | 753 | if (offset[1] == 0 && |
754 | rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) { | 754 | rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) { |
755 | lock_page(page); | 755 | lock_page(page); |
756 | if (page->mapping != node_mapping) { | 756 | if (page->mapping != node_mapping) { |
757 | f2fs_put_page(page, 1); | 757 | f2fs_put_page(page, 1); |
758 | goto restart; | 758 | goto restart; |
759 | } | 759 | } |
760 | wait_on_page_writeback(page); | 760 | wait_on_page_writeback(page); |
761 | rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0; | 761 | rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0; |
762 | set_page_dirty(page); | 762 | set_page_dirty(page); |
763 | unlock_page(page); | 763 | unlock_page(page); |
764 | } | 764 | } |
765 | offset[1] = 0; | 765 | offset[1] = 0; |
766 | offset[0]++; | 766 | offset[0]++; |
767 | nofs += err; | 767 | nofs += err; |
768 | } | 768 | } |
769 | fail: | 769 | fail: |
770 | f2fs_put_page(page, 0); | 770 | f2fs_put_page(page, 0); |
771 | trace_f2fs_truncate_inode_blocks_exit(inode, err); | 771 | trace_f2fs_truncate_inode_blocks_exit(inode, err); |
772 | return err > 0 ? 0 : err; | 772 | return err > 0 ? 0 : err; |
773 | } | 773 | } |
774 | 774 | ||
775 | int truncate_xattr_node(struct inode *inode, struct page *page) | 775 | int truncate_xattr_node(struct inode *inode, struct page *page) |
776 | { | 776 | { |
777 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 777 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
778 | nid_t nid = F2FS_I(inode)->i_xattr_nid; | 778 | nid_t nid = F2FS_I(inode)->i_xattr_nid; |
779 | struct dnode_of_data dn; | 779 | struct dnode_of_data dn; |
780 | struct page *npage; | 780 | struct page *npage; |
781 | 781 | ||
782 | if (!nid) | 782 | if (!nid) |
783 | return 0; | 783 | return 0; |
784 | 784 | ||
785 | npage = get_node_page(sbi, nid); | 785 | npage = get_node_page(sbi, nid); |
786 | if (IS_ERR(npage)) | 786 | if (IS_ERR(npage)) |
787 | return PTR_ERR(npage); | 787 | return PTR_ERR(npage); |
788 | 788 | ||
789 | F2FS_I(inode)->i_xattr_nid = 0; | 789 | F2FS_I(inode)->i_xattr_nid = 0; |
790 | 790 | ||
791 | /* need to do checkpoint during fsync */ | 791 | /* need to do checkpoint during fsync */ |
792 | F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi)); | 792 | F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi)); |
793 | 793 | ||
794 | set_new_dnode(&dn, inode, page, npage, nid); | 794 | set_new_dnode(&dn, inode, page, npage, nid); |
795 | 795 | ||
796 | if (page) | 796 | if (page) |
797 | dn.inode_page_locked = 1; | 797 | dn.inode_page_locked = 1; |
798 | truncate_node(&dn); | 798 | truncate_node(&dn); |
799 | return 0; | 799 | return 0; |
800 | } | 800 | } |
801 | 801 | ||
802 | /* | 802 | /* |
803 | * Caller should grab and release a mutex by calling mutex_lock_op() and | 803 | * Caller should grab and release a mutex by calling mutex_lock_op() and |
804 | * mutex_unlock_op(). | 804 | * mutex_unlock_op(). |
805 | */ | 805 | */ |
806 | int remove_inode_page(struct inode *inode) | 806 | int remove_inode_page(struct inode *inode) |
807 | { | 807 | { |
808 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 808 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
809 | struct page *page; | 809 | struct page *page; |
810 | nid_t ino = inode->i_ino; | 810 | nid_t ino = inode->i_ino; |
811 | struct dnode_of_data dn; | 811 | struct dnode_of_data dn; |
812 | int err; | 812 | int err; |
813 | 813 | ||
814 | page = get_node_page(sbi, ino); | 814 | page = get_node_page(sbi, ino); |
815 | if (IS_ERR(page)) | 815 | if (IS_ERR(page)) |
816 | return PTR_ERR(page); | 816 | return PTR_ERR(page); |
817 | 817 | ||
818 | err = truncate_xattr_node(inode, page); | 818 | err = truncate_xattr_node(inode, page); |
819 | if (err) { | 819 | if (err) { |
820 | f2fs_put_page(page, 1); | 820 | f2fs_put_page(page, 1); |
821 | return err; | 821 | return err; |
822 | } | 822 | } |
823 | 823 | ||
824 | /* 0 is possible, after f2fs_new_inode() is failed */ | 824 | /* 0 is possible, after f2fs_new_inode() is failed */ |
825 | BUG_ON(inode->i_blocks != 0 && inode->i_blocks != 1); | 825 | BUG_ON(inode->i_blocks != 0 && inode->i_blocks != 1); |
826 | set_new_dnode(&dn, inode, page, page, ino); | 826 | set_new_dnode(&dn, inode, page, page, ino); |
827 | truncate_node(&dn); | 827 | truncate_node(&dn); |
828 | return 0; | 828 | return 0; |
829 | } | 829 | } |
830 | 830 | ||
831 | struct page *new_inode_page(struct inode *inode, const struct qstr *name) | 831 | struct page *new_inode_page(struct inode *inode, const struct qstr *name) |
832 | { | 832 | { |
833 | struct dnode_of_data dn; | 833 | struct dnode_of_data dn; |
834 | 834 | ||
835 | /* allocate inode page for new inode */ | 835 | /* allocate inode page for new inode */ |
836 | set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino); | 836 | set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino); |
837 | 837 | ||
838 | /* caller should f2fs_put_page(page, 1); */ | 838 | /* caller should f2fs_put_page(page, 1); */ |
839 | return new_node_page(&dn, 0, NULL); | 839 | return new_node_page(&dn, 0, NULL); |
840 | } | 840 | } |
841 | 841 | ||
842 | struct page *new_node_page(struct dnode_of_data *dn, | 842 | struct page *new_node_page(struct dnode_of_data *dn, |
843 | unsigned int ofs, struct page *ipage) | 843 | unsigned int ofs, struct page *ipage) |
844 | { | 844 | { |
845 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); | 845 | struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb); |
846 | struct address_space *mapping = sbi->node_inode->i_mapping; | 846 | struct address_space *mapping = sbi->node_inode->i_mapping; |
847 | struct node_info old_ni, new_ni; | 847 | struct node_info old_ni, new_ni; |
848 | struct page *page; | 848 | struct page *page; |
849 | int err; | 849 | int err; |
850 | 850 | ||
851 | if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC)) | 851 | if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC)) |
852 | return ERR_PTR(-EPERM); | 852 | return ERR_PTR(-EPERM); |
853 | 853 | ||
854 | page = grab_cache_page(mapping, dn->nid); | 854 | page = grab_cache_page(mapping, dn->nid); |
855 | if (!page) | 855 | if (!page) |
856 | return ERR_PTR(-ENOMEM); | 856 | return ERR_PTR(-ENOMEM); |
857 | 857 | ||
858 | if (!inc_valid_node_count(sbi, dn->inode, 1)) { | 858 | if (!inc_valid_node_count(sbi, dn->inode, 1)) { |
859 | err = -ENOSPC; | 859 | err = -ENOSPC; |
860 | goto fail; | 860 | goto fail; |
861 | } | 861 | } |
862 | 862 | ||
863 | get_node_info(sbi, dn->nid, &old_ni); | 863 | get_node_info(sbi, dn->nid, &old_ni); |
864 | 864 | ||
865 | /* Reinitialize old_ni with new node page */ | 865 | /* Reinitialize old_ni with new node page */ |
866 | BUG_ON(old_ni.blk_addr != NULL_ADDR); | 866 | BUG_ON(old_ni.blk_addr != NULL_ADDR); |
867 | new_ni = old_ni; | 867 | new_ni = old_ni; |
868 | new_ni.ino = dn->inode->i_ino; | 868 | new_ni.ino = dn->inode->i_ino; |
869 | set_node_addr(sbi, &new_ni, NEW_ADDR); | 869 | set_node_addr(sbi, &new_ni, NEW_ADDR); |
870 | 870 | ||
871 | fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true); | 871 | fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true); |
872 | set_cold_node(dn->inode, page); | 872 | set_cold_node(dn->inode, page); |
873 | SetPageUptodate(page); | 873 | SetPageUptodate(page); |
874 | set_page_dirty(page); | 874 | set_page_dirty(page); |
875 | 875 | ||
876 | if (ofs == XATTR_NODE_OFFSET) | 876 | if (ofs == XATTR_NODE_OFFSET) |
877 | F2FS_I(dn->inode)->i_xattr_nid = dn->nid; | 877 | F2FS_I(dn->inode)->i_xattr_nid = dn->nid; |
878 | 878 | ||
879 | dn->node_page = page; | 879 | dn->node_page = page; |
880 | if (ipage) | 880 | if (ipage) |
881 | update_inode(dn->inode, ipage); | 881 | update_inode(dn->inode, ipage); |
882 | else | 882 | else |
883 | sync_inode_page(dn); | 883 | sync_inode_page(dn); |
884 | if (ofs == 0) | 884 | if (ofs == 0) |
885 | inc_valid_inode_count(sbi); | 885 | inc_valid_inode_count(sbi); |
886 | 886 | ||
887 | return page; | 887 | return page; |
888 | 888 | ||
889 | fail: | 889 | fail: |
890 | clear_node_page_dirty(page); | 890 | clear_node_page_dirty(page); |
891 | f2fs_put_page(page, 1); | 891 | f2fs_put_page(page, 1); |
892 | return ERR_PTR(err); | 892 | return ERR_PTR(err); |
893 | } | 893 | } |
894 | 894 | ||
895 | /* | 895 | /* |
896 | * Caller should do after getting the following values. | 896 | * Caller should do after getting the following values. |
897 | * 0: f2fs_put_page(page, 0) | 897 | * 0: f2fs_put_page(page, 0) |
898 | * LOCKED_PAGE: f2fs_put_page(page, 1) | 898 | * LOCKED_PAGE: f2fs_put_page(page, 1) |
899 | * error: nothing | 899 | * error: nothing |
900 | */ | 900 | */ |
901 | static int read_node_page(struct page *page, int type) | 901 | static int read_node_page(struct page *page, int type) |
902 | { | 902 | { |
903 | struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); | 903 | struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); |
904 | struct node_info ni; | 904 | struct node_info ni; |
905 | 905 | ||
906 | get_node_info(sbi, page->index, &ni); | 906 | get_node_info(sbi, page->index, &ni); |
907 | 907 | ||
908 | if (ni.blk_addr == NULL_ADDR) { | 908 | if (ni.blk_addr == NULL_ADDR) { |
909 | f2fs_put_page(page, 1); | 909 | f2fs_put_page(page, 1); |
910 | return -ENOENT; | 910 | return -ENOENT; |
911 | } | 911 | } |
912 | 912 | ||
913 | if (PageUptodate(page)) | 913 | if (PageUptodate(page)) |
914 | return LOCKED_PAGE; | 914 | return LOCKED_PAGE; |
915 | 915 | ||
916 | return f2fs_readpage(sbi, page, ni.blk_addr, type); | 916 | return f2fs_readpage(sbi, page, ni.blk_addr, type); |
917 | } | 917 | } |
918 | 918 | ||
919 | /* | 919 | /* |
920 | * Readahead a node page | 920 | * Readahead a node page |
921 | */ | 921 | */ |
922 | void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid) | 922 | void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid) |
923 | { | 923 | { |
924 | struct address_space *mapping = sbi->node_inode->i_mapping; | 924 | struct address_space *mapping = sbi->node_inode->i_mapping; |
925 | struct page *apage; | 925 | struct page *apage; |
926 | int err; | 926 | int err; |
927 | 927 | ||
928 | apage = find_get_page(mapping, nid); | 928 | apage = find_get_page(mapping, nid); |
929 | if (apage && PageUptodate(apage)) { | 929 | if (apage && PageUptodate(apage)) { |
930 | f2fs_put_page(apage, 0); | 930 | f2fs_put_page(apage, 0); |
931 | return; | 931 | return; |
932 | } | 932 | } |
933 | f2fs_put_page(apage, 0); | 933 | f2fs_put_page(apage, 0); |
934 | 934 | ||
935 | apage = grab_cache_page(mapping, nid); | 935 | apage = grab_cache_page(mapping, nid); |
936 | if (!apage) | 936 | if (!apage) |
937 | return; | 937 | return; |
938 | 938 | ||
939 | err = read_node_page(apage, READA); | 939 | err = read_node_page(apage, READA); |
940 | if (err == 0) | 940 | if (err == 0) |
941 | f2fs_put_page(apage, 0); | 941 | f2fs_put_page(apage, 0); |
942 | else if (err == LOCKED_PAGE) | 942 | else if (err == LOCKED_PAGE) |
943 | f2fs_put_page(apage, 1); | 943 | f2fs_put_page(apage, 1); |
944 | } | 944 | } |
945 | 945 | ||
946 | struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid) | 946 | struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid) |
947 | { | 947 | { |
948 | struct address_space *mapping = sbi->node_inode->i_mapping; | 948 | struct address_space *mapping = sbi->node_inode->i_mapping; |
949 | struct page *page; | 949 | struct page *page; |
950 | int err; | 950 | int err; |
951 | repeat: | 951 | repeat: |
952 | page = grab_cache_page(mapping, nid); | 952 | page = grab_cache_page(mapping, nid); |
953 | if (!page) | 953 | if (!page) |
954 | return ERR_PTR(-ENOMEM); | 954 | return ERR_PTR(-ENOMEM); |
955 | 955 | ||
956 | err = read_node_page(page, READ_SYNC); | 956 | err = read_node_page(page, READ_SYNC); |
957 | if (err < 0) | 957 | if (err < 0) |
958 | return ERR_PTR(err); | 958 | return ERR_PTR(err); |
959 | else if (err == LOCKED_PAGE) | 959 | else if (err == LOCKED_PAGE) |
960 | goto got_it; | 960 | goto got_it; |
961 | 961 | ||
962 | lock_page(page); | 962 | lock_page(page); |
963 | if (!PageUptodate(page)) { | 963 | if (!PageUptodate(page)) { |
964 | f2fs_put_page(page, 1); | 964 | f2fs_put_page(page, 1); |
965 | return ERR_PTR(-EIO); | 965 | return ERR_PTR(-EIO); |
966 | } | 966 | } |
967 | if (page->mapping != mapping) { | 967 | if (page->mapping != mapping) { |
968 | f2fs_put_page(page, 1); | 968 | f2fs_put_page(page, 1); |
969 | goto repeat; | 969 | goto repeat; |
970 | } | 970 | } |
971 | got_it: | 971 | got_it: |
972 | BUG_ON(nid != nid_of_node(page)); | 972 | BUG_ON(nid != nid_of_node(page)); |
973 | mark_page_accessed(page); | ||
974 | return page; | 973 | return page; |
975 | } | 974 | } |
976 | 975 | ||
977 | /* | 976 | /* |
978 | * Return a locked page for the desired node page. | 977 | * Return a locked page for the desired node page. |
979 | * And, readahead MAX_RA_NODE number of node pages. | 978 | * And, readahead MAX_RA_NODE number of node pages. |
980 | */ | 979 | */ |
981 | struct page *get_node_page_ra(struct page *parent, int start) | 980 | struct page *get_node_page_ra(struct page *parent, int start) |
982 | { | 981 | { |
983 | struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb); | 982 | struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb); |
984 | struct address_space *mapping = sbi->node_inode->i_mapping; | 983 | struct address_space *mapping = sbi->node_inode->i_mapping; |
985 | struct blk_plug plug; | 984 | struct blk_plug plug; |
986 | struct page *page; | 985 | struct page *page; |
987 | int err, i, end; | 986 | int err, i, end; |
988 | nid_t nid; | 987 | nid_t nid; |
989 | 988 | ||
990 | /* First, try getting the desired direct node. */ | 989 | /* First, try getting the desired direct node. */ |
991 | nid = get_nid(parent, start, false); | 990 | nid = get_nid(parent, start, false); |
992 | if (!nid) | 991 | if (!nid) |
993 | return ERR_PTR(-ENOENT); | 992 | return ERR_PTR(-ENOENT); |
994 | repeat: | 993 | repeat: |
995 | page = grab_cache_page(mapping, nid); | 994 | page = grab_cache_page(mapping, nid); |
996 | if (!page) | 995 | if (!page) |
997 | return ERR_PTR(-ENOMEM); | 996 | return ERR_PTR(-ENOMEM); |
998 | 997 | ||
999 | err = read_node_page(page, READ_SYNC); | 998 | err = read_node_page(page, READ_SYNC); |
1000 | if (err < 0) | 999 | if (err < 0) |
1001 | return ERR_PTR(err); | 1000 | return ERR_PTR(err); |
1002 | else if (err == LOCKED_PAGE) | 1001 | else if (err == LOCKED_PAGE) |
1003 | goto page_hit; | 1002 | goto page_hit; |
1004 | 1003 | ||
1005 | blk_start_plug(&plug); | 1004 | blk_start_plug(&plug); |
1006 | 1005 | ||
1007 | /* Then, try readahead for siblings of the desired node */ | 1006 | /* Then, try readahead for siblings of the desired node */ |
1008 | end = start + MAX_RA_NODE; | 1007 | end = start + MAX_RA_NODE; |
1009 | end = min(end, NIDS_PER_BLOCK); | 1008 | end = min(end, NIDS_PER_BLOCK); |
1010 | for (i = start + 1; i < end; i++) { | 1009 | for (i = start + 1; i < end; i++) { |
1011 | nid = get_nid(parent, i, false); | 1010 | nid = get_nid(parent, i, false); |
1012 | if (!nid) | 1011 | if (!nid) |
1013 | continue; | 1012 | continue; |
1014 | ra_node_page(sbi, nid); | 1013 | ra_node_page(sbi, nid); |
1015 | } | 1014 | } |
1016 | 1015 | ||
1017 | blk_finish_plug(&plug); | 1016 | blk_finish_plug(&plug); |
1018 | 1017 | ||
1019 | lock_page(page); | 1018 | lock_page(page); |
1020 | if (page->mapping != mapping) { | 1019 | if (page->mapping != mapping) { |
1021 | f2fs_put_page(page, 1); | 1020 | f2fs_put_page(page, 1); |
1022 | goto repeat; | 1021 | goto repeat; |
1023 | } | 1022 | } |
1024 | page_hit: | 1023 | page_hit: |
1025 | if (!PageUptodate(page)) { | 1024 | if (!PageUptodate(page)) { |
1026 | f2fs_put_page(page, 1); | 1025 | f2fs_put_page(page, 1); |
1027 | return ERR_PTR(-EIO); | 1026 | return ERR_PTR(-EIO); |
1028 | } | 1027 | } |
1029 | mark_page_accessed(page); | ||
1030 | return page; | 1028 | return page; |
1031 | } | 1029 | } |
1032 | 1030 | ||
1033 | void sync_inode_page(struct dnode_of_data *dn) | 1031 | void sync_inode_page(struct dnode_of_data *dn) |
1034 | { | 1032 | { |
1035 | if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) { | 1033 | if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) { |
1036 | update_inode(dn->inode, dn->node_page); | 1034 | update_inode(dn->inode, dn->node_page); |
1037 | } else if (dn->inode_page) { | 1035 | } else if (dn->inode_page) { |
1038 | if (!dn->inode_page_locked) | 1036 | if (!dn->inode_page_locked) |
1039 | lock_page(dn->inode_page); | 1037 | lock_page(dn->inode_page); |
1040 | update_inode(dn->inode, dn->inode_page); | 1038 | update_inode(dn->inode, dn->inode_page); |
1041 | if (!dn->inode_page_locked) | 1039 | if (!dn->inode_page_locked) |
1042 | unlock_page(dn->inode_page); | 1040 | unlock_page(dn->inode_page); |
1043 | } else { | 1041 | } else { |
1044 | update_inode_page(dn->inode); | 1042 | update_inode_page(dn->inode); |
1045 | } | 1043 | } |
1046 | } | 1044 | } |
1047 | 1045 | ||
1048 | int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, | 1046 | int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino, |
1049 | struct writeback_control *wbc) | 1047 | struct writeback_control *wbc) |
1050 | { | 1048 | { |
1051 | struct address_space *mapping = sbi->node_inode->i_mapping; | 1049 | struct address_space *mapping = sbi->node_inode->i_mapping; |
1052 | pgoff_t index, end; | 1050 | pgoff_t index, end; |
1053 | struct pagevec pvec; | 1051 | struct pagevec pvec; |
1054 | int step = ino ? 2 : 0; | 1052 | int step = ino ? 2 : 0; |
1055 | int nwritten = 0, wrote = 0; | 1053 | int nwritten = 0, wrote = 0; |
1056 | 1054 | ||
1057 | pagevec_init(&pvec, 0); | 1055 | pagevec_init(&pvec, 0); |
1058 | 1056 | ||
1059 | next_step: | 1057 | next_step: |
1060 | index = 0; | 1058 | index = 0; |
1061 | end = LONG_MAX; | 1059 | end = LONG_MAX; |
1062 | 1060 | ||
1063 | while (index <= end) { | 1061 | while (index <= end) { |
1064 | int i, nr_pages; | 1062 | int i, nr_pages; |
1065 | nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, | 1063 | nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, |
1066 | PAGECACHE_TAG_DIRTY, | 1064 | PAGECACHE_TAG_DIRTY, |
1067 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); | 1065 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); |
1068 | if (nr_pages == 0) | 1066 | if (nr_pages == 0) |
1069 | break; | 1067 | break; |
1070 | 1068 | ||
1071 | for (i = 0; i < nr_pages; i++) { | 1069 | for (i = 0; i < nr_pages; i++) { |
1072 | struct page *page = pvec.pages[i]; | 1070 | struct page *page = pvec.pages[i]; |
1073 | 1071 | ||
1074 | /* | 1072 | /* |
1075 | * flushing sequence with step: | 1073 | * flushing sequence with step: |
1076 | * 0. indirect nodes | 1074 | * 0. indirect nodes |
1077 | * 1. dentry dnodes | 1075 | * 1. dentry dnodes |
1078 | * 2. file dnodes | 1076 | * 2. file dnodes |
1079 | */ | 1077 | */ |
1080 | if (step == 0 && IS_DNODE(page)) | 1078 | if (step == 0 && IS_DNODE(page)) |
1081 | continue; | 1079 | continue; |
1082 | if (step == 1 && (!IS_DNODE(page) || | 1080 | if (step == 1 && (!IS_DNODE(page) || |
1083 | is_cold_node(page))) | 1081 | is_cold_node(page))) |
1084 | continue; | 1082 | continue; |
1085 | if (step == 2 && (!IS_DNODE(page) || | 1083 | if (step == 2 && (!IS_DNODE(page) || |
1086 | !is_cold_node(page))) | 1084 | !is_cold_node(page))) |
1087 | continue; | 1085 | continue; |
1088 | 1086 | ||
1089 | /* | 1087 | /* |
1090 | * If an fsync mode, | 1088 | * If an fsync mode, |
1091 | * we should not skip writing node pages. | 1089 | * we should not skip writing node pages. |
1092 | */ | 1090 | */ |
1093 | if (ino && ino_of_node(page) == ino) | 1091 | if (ino && ino_of_node(page) == ino) |
1094 | lock_page(page); | 1092 | lock_page(page); |
1095 | else if (!trylock_page(page)) | 1093 | else if (!trylock_page(page)) |
1096 | continue; | 1094 | continue; |
1097 | 1095 | ||
1098 | if (unlikely(page->mapping != mapping)) { | 1096 | if (unlikely(page->mapping != mapping)) { |
1099 | continue_unlock: | 1097 | continue_unlock: |
1100 | unlock_page(page); | 1098 | unlock_page(page); |
1101 | continue; | 1099 | continue; |
1102 | } | 1100 | } |
1103 | if (ino && ino_of_node(page) != ino) | 1101 | if (ino && ino_of_node(page) != ino) |
1104 | goto continue_unlock; | 1102 | goto continue_unlock; |
1105 | 1103 | ||
1106 | if (!PageDirty(page)) { | 1104 | if (!PageDirty(page)) { |
1107 | /* someone wrote it for us */ | 1105 | /* someone wrote it for us */ |
1108 | goto continue_unlock; | 1106 | goto continue_unlock; |
1109 | } | 1107 | } |
1110 | 1108 | ||
1111 | if (!clear_page_dirty_for_io(page)) | 1109 | if (!clear_page_dirty_for_io(page)) |
1112 | goto continue_unlock; | 1110 | goto continue_unlock; |
1113 | 1111 | ||
1114 | /* called by fsync() */ | 1112 | /* called by fsync() */ |
1115 | if (ino && IS_DNODE(page)) { | 1113 | if (ino && IS_DNODE(page)) { |
1116 | int mark = !is_checkpointed_node(sbi, ino); | 1114 | int mark = !is_checkpointed_node(sbi, ino); |
1117 | set_fsync_mark(page, 1); | 1115 | set_fsync_mark(page, 1); |
1118 | if (IS_INODE(page)) | 1116 | if (IS_INODE(page)) |
1119 | set_dentry_mark(page, mark); | 1117 | set_dentry_mark(page, mark); |
1120 | nwritten++; | 1118 | nwritten++; |
1121 | } else { | 1119 | } else { |
1122 | set_fsync_mark(page, 0); | 1120 | set_fsync_mark(page, 0); |
1123 | set_dentry_mark(page, 0); | 1121 | set_dentry_mark(page, 0); |
1124 | } | 1122 | } |
1125 | mapping->a_ops->writepage(page, wbc); | 1123 | mapping->a_ops->writepage(page, wbc); |
1126 | wrote++; | 1124 | wrote++; |
1127 | 1125 | ||
1128 | if (--wbc->nr_to_write == 0) | 1126 | if (--wbc->nr_to_write == 0) |
1129 | break; | 1127 | break; |
1130 | } | 1128 | } |
1131 | pagevec_release(&pvec); | 1129 | pagevec_release(&pvec); |
1132 | cond_resched(); | 1130 | cond_resched(); |
1133 | 1131 | ||
1134 | if (wbc->nr_to_write == 0) { | 1132 | if (wbc->nr_to_write == 0) { |
1135 | step = 2; | 1133 | step = 2; |
1136 | break; | 1134 | break; |
1137 | } | 1135 | } |
1138 | } | 1136 | } |
1139 | 1137 | ||
1140 | if (step < 2) { | 1138 | if (step < 2) { |
1141 | step++; | 1139 | step++; |
1142 | goto next_step; | 1140 | goto next_step; |
1143 | } | 1141 | } |
1144 | 1142 | ||
1145 | if (wrote) | 1143 | if (wrote) |
1146 | f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL); | 1144 | f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL); |
1147 | 1145 | ||
1148 | return nwritten; | 1146 | return nwritten; |
1149 | } | 1147 | } |
1150 | 1148 | ||
1151 | static int f2fs_write_node_page(struct page *page, | 1149 | static int f2fs_write_node_page(struct page *page, |
1152 | struct writeback_control *wbc) | 1150 | struct writeback_control *wbc) |
1153 | { | 1151 | { |
1154 | struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); | 1152 | struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb); |
1155 | nid_t nid; | 1153 | nid_t nid; |
1156 | block_t new_addr; | 1154 | block_t new_addr; |
1157 | struct node_info ni; | 1155 | struct node_info ni; |
1158 | 1156 | ||
1159 | wait_on_page_writeback(page); | 1157 | wait_on_page_writeback(page); |
1160 | 1158 | ||
1161 | /* get old block addr of this node page */ | 1159 | /* get old block addr of this node page */ |
1162 | nid = nid_of_node(page); | 1160 | nid = nid_of_node(page); |
1163 | BUG_ON(page->index != nid); | 1161 | BUG_ON(page->index != nid); |
1164 | 1162 | ||
1165 | get_node_info(sbi, nid, &ni); | 1163 | get_node_info(sbi, nid, &ni); |
1166 | 1164 | ||
1167 | /* This page is already truncated */ | 1165 | /* This page is already truncated */ |
1168 | if (ni.blk_addr == NULL_ADDR) { | 1166 | if (ni.blk_addr == NULL_ADDR) { |
1169 | dec_page_count(sbi, F2FS_DIRTY_NODES); | 1167 | dec_page_count(sbi, F2FS_DIRTY_NODES); |
1170 | unlock_page(page); | 1168 | unlock_page(page); |
1171 | return 0; | 1169 | return 0; |
1172 | } | 1170 | } |
1173 | 1171 | ||
1174 | if (wbc->for_reclaim) { | 1172 | if (wbc->for_reclaim) { |
1175 | dec_page_count(sbi, F2FS_DIRTY_NODES); | 1173 | dec_page_count(sbi, F2FS_DIRTY_NODES); |
1176 | wbc->pages_skipped++; | 1174 | wbc->pages_skipped++; |
1177 | set_page_dirty(page); | 1175 | set_page_dirty(page); |
1178 | return AOP_WRITEPAGE_ACTIVATE; | 1176 | return AOP_WRITEPAGE_ACTIVATE; |
1179 | } | 1177 | } |
1180 | 1178 | ||
1181 | mutex_lock(&sbi->node_write); | 1179 | mutex_lock(&sbi->node_write); |
1182 | set_page_writeback(page); | 1180 | set_page_writeback(page); |
1183 | write_node_page(sbi, page, nid, ni.blk_addr, &new_addr); | 1181 | write_node_page(sbi, page, nid, ni.blk_addr, &new_addr); |
1184 | set_node_addr(sbi, &ni, new_addr); | 1182 | set_node_addr(sbi, &ni, new_addr); |
1185 | dec_page_count(sbi, F2FS_DIRTY_NODES); | 1183 | dec_page_count(sbi, F2FS_DIRTY_NODES); |
1186 | mutex_unlock(&sbi->node_write); | 1184 | mutex_unlock(&sbi->node_write); |
1187 | unlock_page(page); | 1185 | unlock_page(page); |
1188 | return 0; | 1186 | return 0; |
1189 | } | 1187 | } |
1190 | 1188 | ||
1191 | /* | 1189 | /* |
1192 | * It is very important to gather dirty pages and write at once, so that we can | 1190 | * It is very important to gather dirty pages and write at once, so that we can |
1193 | * submit a big bio without interfering other data writes. | 1191 | * submit a big bio without interfering other data writes. |
1194 | * Be default, 512 pages (2MB) * 3 node types, is more reasonable. | 1192 | * Be default, 512 pages (2MB) * 3 node types, is more reasonable. |
1195 | */ | 1193 | */ |
1196 | #define COLLECT_DIRTY_NODES 1536 | 1194 | #define COLLECT_DIRTY_NODES 1536 |
1197 | static int f2fs_write_node_pages(struct address_space *mapping, | 1195 | static int f2fs_write_node_pages(struct address_space *mapping, |
1198 | struct writeback_control *wbc) | 1196 | struct writeback_control *wbc) |
1199 | { | 1197 | { |
1200 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); | 1198 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); |
1201 | long nr_to_write = wbc->nr_to_write; | 1199 | long nr_to_write = wbc->nr_to_write; |
1202 | 1200 | ||
1203 | /* First check balancing cached NAT entries */ | 1201 | /* First check balancing cached NAT entries */ |
1204 | if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) { | 1202 | if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) { |
1205 | f2fs_sync_fs(sbi->sb, true); | 1203 | f2fs_sync_fs(sbi->sb, true); |
1206 | return 0; | 1204 | return 0; |
1207 | } | 1205 | } |
1208 | 1206 | ||
1209 | /* collect a number of dirty node pages and write together */ | 1207 | /* collect a number of dirty node pages and write together */ |
1210 | if (get_pages(sbi, F2FS_DIRTY_NODES) < COLLECT_DIRTY_NODES) | 1208 | if (get_pages(sbi, F2FS_DIRTY_NODES) < COLLECT_DIRTY_NODES) |
1211 | return 0; | 1209 | return 0; |
1212 | 1210 | ||
1213 | /* if mounting is failed, skip writing node pages */ | 1211 | /* if mounting is failed, skip writing node pages */ |
1214 | wbc->nr_to_write = 3 * max_hw_blocks(sbi); | 1212 | wbc->nr_to_write = 3 * max_hw_blocks(sbi); |
1215 | sync_node_pages(sbi, 0, wbc); | 1213 | sync_node_pages(sbi, 0, wbc); |
1216 | wbc->nr_to_write = nr_to_write - (3 * max_hw_blocks(sbi) - | 1214 | wbc->nr_to_write = nr_to_write - (3 * max_hw_blocks(sbi) - |
1217 | wbc->nr_to_write); | 1215 | wbc->nr_to_write); |
1218 | return 0; | 1216 | return 0; |
1219 | } | 1217 | } |
1220 | 1218 | ||
1221 | static int f2fs_set_node_page_dirty(struct page *page) | 1219 | static int f2fs_set_node_page_dirty(struct page *page) |
1222 | { | 1220 | { |
1223 | struct address_space *mapping = page->mapping; | 1221 | struct address_space *mapping = page->mapping; |
1224 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); | 1222 | struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb); |
1225 | 1223 | ||
1226 | SetPageUptodate(page); | 1224 | SetPageUptodate(page); |
1227 | if (!PageDirty(page)) { | 1225 | if (!PageDirty(page)) { |
1228 | __set_page_dirty_nobuffers(page); | 1226 | __set_page_dirty_nobuffers(page); |
1229 | inc_page_count(sbi, F2FS_DIRTY_NODES); | 1227 | inc_page_count(sbi, F2FS_DIRTY_NODES); |
1230 | SetPagePrivate(page); | 1228 | SetPagePrivate(page); |
1231 | return 1; | 1229 | return 1; |
1232 | } | 1230 | } |
1233 | return 0; | 1231 | return 0; |
1234 | } | 1232 | } |
1235 | 1233 | ||
1236 | static void f2fs_invalidate_node_page(struct page *page, unsigned int offset, | 1234 | static void f2fs_invalidate_node_page(struct page *page, unsigned int offset, |
1237 | unsigned int length) | 1235 | unsigned int length) |
1238 | { | 1236 | { |
1239 | struct inode *inode = page->mapping->host; | 1237 | struct inode *inode = page->mapping->host; |
1240 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); | 1238 | struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb); |
1241 | if (PageDirty(page)) | 1239 | if (PageDirty(page)) |
1242 | dec_page_count(sbi, F2FS_DIRTY_NODES); | 1240 | dec_page_count(sbi, F2FS_DIRTY_NODES); |
1243 | ClearPagePrivate(page); | 1241 | ClearPagePrivate(page); |
1244 | } | 1242 | } |
1245 | 1243 | ||
1246 | static int f2fs_release_node_page(struct page *page, gfp_t wait) | 1244 | static int f2fs_release_node_page(struct page *page, gfp_t wait) |
1247 | { | 1245 | { |
1248 | ClearPagePrivate(page); | 1246 | ClearPagePrivate(page); |
1249 | return 1; | 1247 | return 1; |
1250 | } | 1248 | } |
1251 | 1249 | ||
1252 | /* | 1250 | /* |
1253 | * Structure of the f2fs node operations | 1251 | * Structure of the f2fs node operations |
1254 | */ | 1252 | */ |
1255 | const struct address_space_operations f2fs_node_aops = { | 1253 | const struct address_space_operations f2fs_node_aops = { |
1256 | .writepage = f2fs_write_node_page, | 1254 | .writepage = f2fs_write_node_page, |
1257 | .writepages = f2fs_write_node_pages, | 1255 | .writepages = f2fs_write_node_pages, |
1258 | .set_page_dirty = f2fs_set_node_page_dirty, | 1256 | .set_page_dirty = f2fs_set_node_page_dirty, |
1259 | .invalidatepage = f2fs_invalidate_node_page, | 1257 | .invalidatepage = f2fs_invalidate_node_page, |
1260 | .releasepage = f2fs_release_node_page, | 1258 | .releasepage = f2fs_release_node_page, |
1261 | }; | 1259 | }; |
1262 | 1260 | ||
1263 | static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head) | 1261 | static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head) |
1264 | { | 1262 | { |
1265 | struct list_head *this; | 1263 | struct list_head *this; |
1266 | struct free_nid *i; | 1264 | struct free_nid *i; |
1267 | list_for_each(this, head) { | 1265 | list_for_each(this, head) { |
1268 | i = list_entry(this, struct free_nid, list); | 1266 | i = list_entry(this, struct free_nid, list); |
1269 | if (i->nid == n) | 1267 | if (i->nid == n) |
1270 | return i; | 1268 | return i; |
1271 | } | 1269 | } |
1272 | return NULL; | 1270 | return NULL; |
1273 | } | 1271 | } |
1274 | 1272 | ||
1275 | static void __del_from_free_nid_list(struct free_nid *i) | 1273 | static void __del_from_free_nid_list(struct free_nid *i) |
1276 | { | 1274 | { |
1277 | list_del(&i->list); | 1275 | list_del(&i->list); |
1278 | kmem_cache_free(free_nid_slab, i); | 1276 | kmem_cache_free(free_nid_slab, i); |
1279 | } | 1277 | } |
1280 | 1278 | ||
1281 | static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build) | 1279 | static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build) |
1282 | { | 1280 | { |
1283 | struct free_nid *i; | 1281 | struct free_nid *i; |
1284 | struct nat_entry *ne; | 1282 | struct nat_entry *ne; |
1285 | bool allocated = false; | 1283 | bool allocated = false; |
1286 | 1284 | ||
1287 | if (nm_i->fcnt > 2 * MAX_FREE_NIDS) | 1285 | if (nm_i->fcnt > 2 * MAX_FREE_NIDS) |
1288 | return -1; | 1286 | return -1; |
1289 | 1287 | ||
1290 | /* 0 nid should not be used */ | 1288 | /* 0 nid should not be used */ |
1291 | if (nid == 0) | 1289 | if (nid == 0) |
1292 | return 0; | 1290 | return 0; |
1293 | 1291 | ||
1294 | if (!build) | 1292 | if (!build) |
1295 | goto retry; | 1293 | goto retry; |
1296 | 1294 | ||
1297 | /* do not add allocated nids */ | 1295 | /* do not add allocated nids */ |
1298 | read_lock(&nm_i->nat_tree_lock); | 1296 | read_lock(&nm_i->nat_tree_lock); |
1299 | ne = __lookup_nat_cache(nm_i, nid); | 1297 | ne = __lookup_nat_cache(nm_i, nid); |
1300 | if (ne && nat_get_blkaddr(ne) != NULL_ADDR) | 1298 | if (ne && nat_get_blkaddr(ne) != NULL_ADDR) |
1301 | allocated = true; | 1299 | allocated = true; |
1302 | read_unlock(&nm_i->nat_tree_lock); | 1300 | read_unlock(&nm_i->nat_tree_lock); |
1303 | if (allocated) | 1301 | if (allocated) |
1304 | return 0; | 1302 | return 0; |
1305 | retry: | 1303 | retry: |
1306 | i = kmem_cache_alloc(free_nid_slab, GFP_NOFS); | 1304 | i = kmem_cache_alloc(free_nid_slab, GFP_NOFS); |
1307 | if (!i) { | 1305 | if (!i) { |
1308 | cond_resched(); | 1306 | cond_resched(); |
1309 | goto retry; | 1307 | goto retry; |
1310 | } | 1308 | } |
1311 | i->nid = nid; | 1309 | i->nid = nid; |
1312 | i->state = NID_NEW; | 1310 | i->state = NID_NEW; |
1313 | 1311 | ||
1314 | spin_lock(&nm_i->free_nid_list_lock); | 1312 | spin_lock(&nm_i->free_nid_list_lock); |
1315 | if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) { | 1313 | if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) { |
1316 | spin_unlock(&nm_i->free_nid_list_lock); | 1314 | spin_unlock(&nm_i->free_nid_list_lock); |
1317 | kmem_cache_free(free_nid_slab, i); | 1315 | kmem_cache_free(free_nid_slab, i); |
1318 | return 0; | 1316 | return 0; |
1319 | } | 1317 | } |
1320 | list_add_tail(&i->list, &nm_i->free_nid_list); | 1318 | list_add_tail(&i->list, &nm_i->free_nid_list); |
1321 | nm_i->fcnt++; | 1319 | nm_i->fcnt++; |
1322 | spin_unlock(&nm_i->free_nid_list_lock); | 1320 | spin_unlock(&nm_i->free_nid_list_lock); |
1323 | return 1; | 1321 | return 1; |
1324 | } | 1322 | } |
1325 | 1323 | ||
1326 | static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid) | 1324 | static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid) |
1327 | { | 1325 | { |
1328 | struct free_nid *i; | 1326 | struct free_nid *i; |
1329 | spin_lock(&nm_i->free_nid_list_lock); | 1327 | spin_lock(&nm_i->free_nid_list_lock); |
1330 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); | 1328 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); |
1331 | if (i && i->state == NID_NEW) { | 1329 | if (i && i->state == NID_NEW) { |
1332 | __del_from_free_nid_list(i); | 1330 | __del_from_free_nid_list(i); |
1333 | nm_i->fcnt--; | 1331 | nm_i->fcnt--; |
1334 | } | 1332 | } |
1335 | spin_unlock(&nm_i->free_nid_list_lock); | 1333 | spin_unlock(&nm_i->free_nid_list_lock); |
1336 | } | 1334 | } |
1337 | 1335 | ||
1338 | static void scan_nat_page(struct f2fs_nm_info *nm_i, | 1336 | static void scan_nat_page(struct f2fs_nm_info *nm_i, |
1339 | struct page *nat_page, nid_t start_nid) | 1337 | struct page *nat_page, nid_t start_nid) |
1340 | { | 1338 | { |
1341 | struct f2fs_nat_block *nat_blk = page_address(nat_page); | 1339 | struct f2fs_nat_block *nat_blk = page_address(nat_page); |
1342 | block_t blk_addr; | 1340 | block_t blk_addr; |
1343 | int i; | 1341 | int i; |
1344 | 1342 | ||
1345 | i = start_nid % NAT_ENTRY_PER_BLOCK; | 1343 | i = start_nid % NAT_ENTRY_PER_BLOCK; |
1346 | 1344 | ||
1347 | for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) { | 1345 | for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) { |
1348 | 1346 | ||
1349 | if (start_nid >= nm_i->max_nid) | 1347 | if (start_nid >= nm_i->max_nid) |
1350 | break; | 1348 | break; |
1351 | 1349 | ||
1352 | blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr); | 1350 | blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr); |
1353 | BUG_ON(blk_addr == NEW_ADDR); | 1351 | BUG_ON(blk_addr == NEW_ADDR); |
1354 | if (blk_addr == NULL_ADDR) { | 1352 | if (blk_addr == NULL_ADDR) { |
1355 | if (add_free_nid(nm_i, start_nid, true) < 0) | 1353 | if (add_free_nid(nm_i, start_nid, true) < 0) |
1356 | break; | 1354 | break; |
1357 | } | 1355 | } |
1358 | } | 1356 | } |
1359 | } | 1357 | } |
1360 | 1358 | ||
1361 | static void build_free_nids(struct f2fs_sb_info *sbi) | 1359 | static void build_free_nids(struct f2fs_sb_info *sbi) |
1362 | { | 1360 | { |
1363 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1361 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1364 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); | 1362 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); |
1365 | struct f2fs_summary_block *sum = curseg->sum_blk; | 1363 | struct f2fs_summary_block *sum = curseg->sum_blk; |
1366 | int i = 0; | 1364 | int i = 0; |
1367 | nid_t nid = nm_i->next_scan_nid; | 1365 | nid_t nid = nm_i->next_scan_nid; |
1368 | 1366 | ||
1369 | /* Enough entries */ | 1367 | /* Enough entries */ |
1370 | if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK) | 1368 | if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK) |
1371 | return; | 1369 | return; |
1372 | 1370 | ||
1373 | /* readahead nat pages to be scanned */ | 1371 | /* readahead nat pages to be scanned */ |
1374 | ra_nat_pages(sbi, nid); | 1372 | ra_nat_pages(sbi, nid); |
1375 | 1373 | ||
1376 | while (1) { | 1374 | while (1) { |
1377 | struct page *page = get_current_nat_page(sbi, nid); | 1375 | struct page *page = get_current_nat_page(sbi, nid); |
1378 | 1376 | ||
1379 | scan_nat_page(nm_i, page, nid); | 1377 | scan_nat_page(nm_i, page, nid); |
1380 | f2fs_put_page(page, 1); | 1378 | f2fs_put_page(page, 1); |
1381 | 1379 | ||
1382 | nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK)); | 1380 | nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK)); |
1383 | if (nid >= nm_i->max_nid) | 1381 | if (nid >= nm_i->max_nid) |
1384 | nid = 0; | 1382 | nid = 0; |
1385 | 1383 | ||
1386 | if (i++ == FREE_NID_PAGES) | 1384 | if (i++ == FREE_NID_PAGES) |
1387 | break; | 1385 | break; |
1388 | } | 1386 | } |
1389 | 1387 | ||
1390 | /* go to the next free nat pages to find free nids abundantly */ | 1388 | /* go to the next free nat pages to find free nids abundantly */ |
1391 | nm_i->next_scan_nid = nid; | 1389 | nm_i->next_scan_nid = nid; |
1392 | 1390 | ||
1393 | /* find free nids from current sum_pages */ | 1391 | /* find free nids from current sum_pages */ |
1394 | mutex_lock(&curseg->curseg_mutex); | 1392 | mutex_lock(&curseg->curseg_mutex); |
1395 | for (i = 0; i < nats_in_cursum(sum); i++) { | 1393 | for (i = 0; i < nats_in_cursum(sum); i++) { |
1396 | block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr); | 1394 | block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr); |
1397 | nid = le32_to_cpu(nid_in_journal(sum, i)); | 1395 | nid = le32_to_cpu(nid_in_journal(sum, i)); |
1398 | if (addr == NULL_ADDR) | 1396 | if (addr == NULL_ADDR) |
1399 | add_free_nid(nm_i, nid, true); | 1397 | add_free_nid(nm_i, nid, true); |
1400 | else | 1398 | else |
1401 | remove_free_nid(nm_i, nid); | 1399 | remove_free_nid(nm_i, nid); |
1402 | } | 1400 | } |
1403 | mutex_unlock(&curseg->curseg_mutex); | 1401 | mutex_unlock(&curseg->curseg_mutex); |
1404 | } | 1402 | } |
1405 | 1403 | ||
1406 | /* | 1404 | /* |
1407 | * If this function returns success, caller can obtain a new nid | 1405 | * If this function returns success, caller can obtain a new nid |
1408 | * from second parameter of this function. | 1406 | * from second parameter of this function. |
1409 | * The returned nid could be used ino as well as nid when inode is created. | 1407 | * The returned nid could be used ino as well as nid when inode is created. |
1410 | */ | 1408 | */ |
1411 | bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid) | 1409 | bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid) |
1412 | { | 1410 | { |
1413 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1411 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1414 | struct free_nid *i = NULL; | 1412 | struct free_nid *i = NULL; |
1415 | struct list_head *this; | 1413 | struct list_head *this; |
1416 | retry: | 1414 | retry: |
1417 | if (sbi->total_valid_node_count + 1 >= nm_i->max_nid) | 1415 | if (sbi->total_valid_node_count + 1 >= nm_i->max_nid) |
1418 | return false; | 1416 | return false; |
1419 | 1417 | ||
1420 | spin_lock(&nm_i->free_nid_list_lock); | 1418 | spin_lock(&nm_i->free_nid_list_lock); |
1421 | 1419 | ||
1422 | /* We should not use stale free nids created by build_free_nids */ | 1420 | /* We should not use stale free nids created by build_free_nids */ |
1423 | if (nm_i->fcnt && !sbi->on_build_free_nids) { | 1421 | if (nm_i->fcnt && !sbi->on_build_free_nids) { |
1424 | BUG_ON(list_empty(&nm_i->free_nid_list)); | 1422 | BUG_ON(list_empty(&nm_i->free_nid_list)); |
1425 | list_for_each(this, &nm_i->free_nid_list) { | 1423 | list_for_each(this, &nm_i->free_nid_list) { |
1426 | i = list_entry(this, struct free_nid, list); | 1424 | i = list_entry(this, struct free_nid, list); |
1427 | if (i->state == NID_NEW) | 1425 | if (i->state == NID_NEW) |
1428 | break; | 1426 | break; |
1429 | } | 1427 | } |
1430 | 1428 | ||
1431 | BUG_ON(i->state != NID_NEW); | 1429 | BUG_ON(i->state != NID_NEW); |
1432 | *nid = i->nid; | 1430 | *nid = i->nid; |
1433 | i->state = NID_ALLOC; | 1431 | i->state = NID_ALLOC; |
1434 | nm_i->fcnt--; | 1432 | nm_i->fcnt--; |
1435 | spin_unlock(&nm_i->free_nid_list_lock); | 1433 | spin_unlock(&nm_i->free_nid_list_lock); |
1436 | return true; | 1434 | return true; |
1437 | } | 1435 | } |
1438 | spin_unlock(&nm_i->free_nid_list_lock); | 1436 | spin_unlock(&nm_i->free_nid_list_lock); |
1439 | 1437 | ||
1440 | /* Let's scan nat pages and its caches to get free nids */ | 1438 | /* Let's scan nat pages and its caches to get free nids */ |
1441 | mutex_lock(&nm_i->build_lock); | 1439 | mutex_lock(&nm_i->build_lock); |
1442 | sbi->on_build_free_nids = 1; | 1440 | sbi->on_build_free_nids = 1; |
1443 | build_free_nids(sbi); | 1441 | build_free_nids(sbi); |
1444 | sbi->on_build_free_nids = 0; | 1442 | sbi->on_build_free_nids = 0; |
1445 | mutex_unlock(&nm_i->build_lock); | 1443 | mutex_unlock(&nm_i->build_lock); |
1446 | goto retry; | 1444 | goto retry; |
1447 | } | 1445 | } |
1448 | 1446 | ||
1449 | /* | 1447 | /* |
1450 | * alloc_nid() should be called prior to this function. | 1448 | * alloc_nid() should be called prior to this function. |
1451 | */ | 1449 | */ |
1452 | void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid) | 1450 | void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid) |
1453 | { | 1451 | { |
1454 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1452 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1455 | struct free_nid *i; | 1453 | struct free_nid *i; |
1456 | 1454 | ||
1457 | spin_lock(&nm_i->free_nid_list_lock); | 1455 | spin_lock(&nm_i->free_nid_list_lock); |
1458 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); | 1456 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); |
1459 | BUG_ON(!i || i->state != NID_ALLOC); | 1457 | BUG_ON(!i || i->state != NID_ALLOC); |
1460 | __del_from_free_nid_list(i); | 1458 | __del_from_free_nid_list(i); |
1461 | spin_unlock(&nm_i->free_nid_list_lock); | 1459 | spin_unlock(&nm_i->free_nid_list_lock); |
1462 | } | 1460 | } |
1463 | 1461 | ||
1464 | /* | 1462 | /* |
1465 | * alloc_nid() should be called prior to this function. | 1463 | * alloc_nid() should be called prior to this function. |
1466 | */ | 1464 | */ |
1467 | void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid) | 1465 | void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid) |
1468 | { | 1466 | { |
1469 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1467 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1470 | struct free_nid *i; | 1468 | struct free_nid *i; |
1471 | 1469 | ||
1472 | if (!nid) | 1470 | if (!nid) |
1473 | return; | 1471 | return; |
1474 | 1472 | ||
1475 | spin_lock(&nm_i->free_nid_list_lock); | 1473 | spin_lock(&nm_i->free_nid_list_lock); |
1476 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); | 1474 | i = __lookup_free_nid_list(nid, &nm_i->free_nid_list); |
1477 | BUG_ON(!i || i->state != NID_ALLOC); | 1475 | BUG_ON(!i || i->state != NID_ALLOC); |
1478 | if (nm_i->fcnt > 2 * MAX_FREE_NIDS) { | 1476 | if (nm_i->fcnt > 2 * MAX_FREE_NIDS) { |
1479 | __del_from_free_nid_list(i); | 1477 | __del_from_free_nid_list(i); |
1480 | } else { | 1478 | } else { |
1481 | i->state = NID_NEW; | 1479 | i->state = NID_NEW; |
1482 | nm_i->fcnt++; | 1480 | nm_i->fcnt++; |
1483 | } | 1481 | } |
1484 | spin_unlock(&nm_i->free_nid_list_lock); | 1482 | spin_unlock(&nm_i->free_nid_list_lock); |
1485 | } | 1483 | } |
1486 | 1484 | ||
1487 | void recover_node_page(struct f2fs_sb_info *sbi, struct page *page, | 1485 | void recover_node_page(struct f2fs_sb_info *sbi, struct page *page, |
1488 | struct f2fs_summary *sum, struct node_info *ni, | 1486 | struct f2fs_summary *sum, struct node_info *ni, |
1489 | block_t new_blkaddr) | 1487 | block_t new_blkaddr) |
1490 | { | 1488 | { |
1491 | rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr); | 1489 | rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr); |
1492 | set_node_addr(sbi, ni, new_blkaddr); | 1490 | set_node_addr(sbi, ni, new_blkaddr); |
1493 | clear_node_page_dirty(page); | 1491 | clear_node_page_dirty(page); |
1494 | } | 1492 | } |
1495 | 1493 | ||
1496 | int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) | 1494 | int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) |
1497 | { | 1495 | { |
1498 | struct address_space *mapping = sbi->node_inode->i_mapping; | 1496 | struct address_space *mapping = sbi->node_inode->i_mapping; |
1499 | struct f2fs_node *src, *dst; | 1497 | struct f2fs_node *src, *dst; |
1500 | nid_t ino = ino_of_node(page); | 1498 | nid_t ino = ino_of_node(page); |
1501 | struct node_info old_ni, new_ni; | 1499 | struct node_info old_ni, new_ni; |
1502 | struct page *ipage; | 1500 | struct page *ipage; |
1503 | 1501 | ||
1504 | ipage = grab_cache_page(mapping, ino); | 1502 | ipage = grab_cache_page(mapping, ino); |
1505 | if (!ipage) | 1503 | if (!ipage) |
1506 | return -ENOMEM; | 1504 | return -ENOMEM; |
1507 | 1505 | ||
1508 | /* Should not use this inode from free nid list */ | 1506 | /* Should not use this inode from free nid list */ |
1509 | remove_free_nid(NM_I(sbi), ino); | 1507 | remove_free_nid(NM_I(sbi), ino); |
1510 | 1508 | ||
1511 | get_node_info(sbi, ino, &old_ni); | 1509 | get_node_info(sbi, ino, &old_ni); |
1512 | SetPageUptodate(ipage); | 1510 | SetPageUptodate(ipage); |
1513 | fill_node_footer(ipage, ino, ino, 0, true); | 1511 | fill_node_footer(ipage, ino, ino, 0, true); |
1514 | 1512 | ||
1515 | src = F2FS_NODE(page); | 1513 | src = F2FS_NODE(page); |
1516 | dst = F2FS_NODE(ipage); | 1514 | dst = F2FS_NODE(ipage); |
1517 | 1515 | ||
1518 | memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i); | 1516 | memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i); |
1519 | dst->i.i_size = 0; | 1517 | dst->i.i_size = 0; |
1520 | dst->i.i_blocks = cpu_to_le64(1); | 1518 | dst->i.i_blocks = cpu_to_le64(1); |
1521 | dst->i.i_links = cpu_to_le32(1); | 1519 | dst->i.i_links = cpu_to_le32(1); |
1522 | dst->i.i_xattr_nid = 0; | 1520 | dst->i.i_xattr_nid = 0; |
1523 | 1521 | ||
1524 | new_ni = old_ni; | 1522 | new_ni = old_ni; |
1525 | new_ni.ino = ino; | 1523 | new_ni.ino = ino; |
1526 | 1524 | ||
1527 | if (!inc_valid_node_count(sbi, NULL, 1)) | 1525 | if (!inc_valid_node_count(sbi, NULL, 1)) |
1528 | WARN_ON(1); | 1526 | WARN_ON(1); |
1529 | set_node_addr(sbi, &new_ni, NEW_ADDR); | 1527 | set_node_addr(sbi, &new_ni, NEW_ADDR); |
1530 | inc_valid_inode_count(sbi); | 1528 | inc_valid_inode_count(sbi); |
1531 | f2fs_put_page(ipage, 1); | 1529 | f2fs_put_page(ipage, 1); |
1532 | return 0; | 1530 | return 0; |
1533 | } | 1531 | } |
1534 | 1532 | ||
1535 | int restore_node_summary(struct f2fs_sb_info *sbi, | 1533 | int restore_node_summary(struct f2fs_sb_info *sbi, |
1536 | unsigned int segno, struct f2fs_summary_block *sum) | 1534 | unsigned int segno, struct f2fs_summary_block *sum) |
1537 | { | 1535 | { |
1538 | struct f2fs_node *rn; | 1536 | struct f2fs_node *rn; |
1539 | struct f2fs_summary *sum_entry; | 1537 | struct f2fs_summary *sum_entry; |
1540 | struct page *page; | 1538 | struct page *page; |
1541 | block_t addr; | 1539 | block_t addr; |
1542 | int i, last_offset; | 1540 | int i, last_offset; |
1543 | 1541 | ||
1544 | /* alloc temporal page for read node */ | 1542 | /* alloc temporal page for read node */ |
1545 | page = alloc_page(GFP_NOFS | __GFP_ZERO); | 1543 | page = alloc_page(GFP_NOFS | __GFP_ZERO); |
1546 | if (!page) | 1544 | if (!page) |
1547 | return -ENOMEM; | 1545 | return -ENOMEM; |
1548 | lock_page(page); | 1546 | lock_page(page); |
1549 | 1547 | ||
1550 | /* scan the node segment */ | 1548 | /* scan the node segment */ |
1551 | last_offset = sbi->blocks_per_seg; | 1549 | last_offset = sbi->blocks_per_seg; |
1552 | addr = START_BLOCK(sbi, segno); | 1550 | addr = START_BLOCK(sbi, segno); |
1553 | sum_entry = &sum->entries[0]; | 1551 | sum_entry = &sum->entries[0]; |
1554 | 1552 | ||
1555 | for (i = 0; i < last_offset; i++, sum_entry++) { | 1553 | for (i = 0; i < last_offset; i++, sum_entry++) { |
1556 | /* | 1554 | /* |
1557 | * In order to read next node page, | 1555 | * In order to read next node page, |
1558 | * we must clear PageUptodate flag. | 1556 | * we must clear PageUptodate flag. |
1559 | */ | 1557 | */ |
1560 | ClearPageUptodate(page); | 1558 | ClearPageUptodate(page); |
1561 | 1559 | ||
1562 | if (f2fs_readpage(sbi, page, addr, READ_SYNC)) | 1560 | if (f2fs_readpage(sbi, page, addr, READ_SYNC)) |
1563 | goto out; | 1561 | goto out; |
1564 | 1562 | ||
1565 | lock_page(page); | 1563 | lock_page(page); |
1566 | rn = F2FS_NODE(page); | 1564 | rn = F2FS_NODE(page); |
1567 | sum_entry->nid = rn->footer.nid; | 1565 | sum_entry->nid = rn->footer.nid; |
1568 | sum_entry->version = 0; | 1566 | sum_entry->version = 0; |
1569 | sum_entry->ofs_in_node = 0; | 1567 | sum_entry->ofs_in_node = 0; |
1570 | addr++; | 1568 | addr++; |
1571 | } | 1569 | } |
1572 | unlock_page(page); | 1570 | unlock_page(page); |
1573 | out: | 1571 | out: |
1574 | __free_pages(page, 0); | 1572 | __free_pages(page, 0); |
1575 | return 0; | 1573 | return 0; |
1576 | } | 1574 | } |
1577 | 1575 | ||
1578 | static bool flush_nats_in_journal(struct f2fs_sb_info *sbi) | 1576 | static bool flush_nats_in_journal(struct f2fs_sb_info *sbi) |
1579 | { | 1577 | { |
1580 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1578 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1581 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); | 1579 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); |
1582 | struct f2fs_summary_block *sum = curseg->sum_blk; | 1580 | struct f2fs_summary_block *sum = curseg->sum_blk; |
1583 | int i; | 1581 | int i; |
1584 | 1582 | ||
1585 | mutex_lock(&curseg->curseg_mutex); | 1583 | mutex_lock(&curseg->curseg_mutex); |
1586 | 1584 | ||
1587 | if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) { | 1585 | if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) { |
1588 | mutex_unlock(&curseg->curseg_mutex); | 1586 | mutex_unlock(&curseg->curseg_mutex); |
1589 | return false; | 1587 | return false; |
1590 | } | 1588 | } |
1591 | 1589 | ||
1592 | for (i = 0; i < nats_in_cursum(sum); i++) { | 1590 | for (i = 0; i < nats_in_cursum(sum); i++) { |
1593 | struct nat_entry *ne; | 1591 | struct nat_entry *ne; |
1594 | struct f2fs_nat_entry raw_ne; | 1592 | struct f2fs_nat_entry raw_ne; |
1595 | nid_t nid = le32_to_cpu(nid_in_journal(sum, i)); | 1593 | nid_t nid = le32_to_cpu(nid_in_journal(sum, i)); |
1596 | 1594 | ||
1597 | raw_ne = nat_in_journal(sum, i); | 1595 | raw_ne = nat_in_journal(sum, i); |
1598 | retry: | 1596 | retry: |
1599 | write_lock(&nm_i->nat_tree_lock); | 1597 | write_lock(&nm_i->nat_tree_lock); |
1600 | ne = __lookup_nat_cache(nm_i, nid); | 1598 | ne = __lookup_nat_cache(nm_i, nid); |
1601 | if (ne) { | 1599 | if (ne) { |
1602 | __set_nat_cache_dirty(nm_i, ne); | 1600 | __set_nat_cache_dirty(nm_i, ne); |
1603 | write_unlock(&nm_i->nat_tree_lock); | 1601 | write_unlock(&nm_i->nat_tree_lock); |
1604 | continue; | 1602 | continue; |
1605 | } | 1603 | } |
1606 | ne = grab_nat_entry(nm_i, nid); | 1604 | ne = grab_nat_entry(nm_i, nid); |
1607 | if (!ne) { | 1605 | if (!ne) { |
1608 | write_unlock(&nm_i->nat_tree_lock); | 1606 | write_unlock(&nm_i->nat_tree_lock); |
1609 | goto retry; | 1607 | goto retry; |
1610 | } | 1608 | } |
1611 | nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr)); | 1609 | nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr)); |
1612 | nat_set_ino(ne, le32_to_cpu(raw_ne.ino)); | 1610 | nat_set_ino(ne, le32_to_cpu(raw_ne.ino)); |
1613 | nat_set_version(ne, raw_ne.version); | 1611 | nat_set_version(ne, raw_ne.version); |
1614 | __set_nat_cache_dirty(nm_i, ne); | 1612 | __set_nat_cache_dirty(nm_i, ne); |
1615 | write_unlock(&nm_i->nat_tree_lock); | 1613 | write_unlock(&nm_i->nat_tree_lock); |
1616 | } | 1614 | } |
1617 | update_nats_in_cursum(sum, -i); | 1615 | update_nats_in_cursum(sum, -i); |
1618 | mutex_unlock(&curseg->curseg_mutex); | 1616 | mutex_unlock(&curseg->curseg_mutex); |
1619 | return true; | 1617 | return true; |
1620 | } | 1618 | } |
1621 | 1619 | ||
1622 | /* | 1620 | /* |
1623 | * This function is called during the checkpointing process. | 1621 | * This function is called during the checkpointing process. |
1624 | */ | 1622 | */ |
1625 | void flush_nat_entries(struct f2fs_sb_info *sbi) | 1623 | void flush_nat_entries(struct f2fs_sb_info *sbi) |
1626 | { | 1624 | { |
1627 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1625 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1628 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); | 1626 | struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); |
1629 | struct f2fs_summary_block *sum = curseg->sum_blk; | 1627 | struct f2fs_summary_block *sum = curseg->sum_blk; |
1630 | struct list_head *cur, *n; | 1628 | struct list_head *cur, *n; |
1631 | struct page *page = NULL; | 1629 | struct page *page = NULL; |
1632 | struct f2fs_nat_block *nat_blk = NULL; | 1630 | struct f2fs_nat_block *nat_blk = NULL; |
1633 | nid_t start_nid = 0, end_nid = 0; | 1631 | nid_t start_nid = 0, end_nid = 0; |
1634 | bool flushed; | 1632 | bool flushed; |
1635 | 1633 | ||
1636 | flushed = flush_nats_in_journal(sbi); | 1634 | flushed = flush_nats_in_journal(sbi); |
1637 | 1635 | ||
1638 | if (!flushed) | 1636 | if (!flushed) |
1639 | mutex_lock(&curseg->curseg_mutex); | 1637 | mutex_lock(&curseg->curseg_mutex); |
1640 | 1638 | ||
1641 | /* 1) flush dirty nat caches */ | 1639 | /* 1) flush dirty nat caches */ |
1642 | list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) { | 1640 | list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) { |
1643 | struct nat_entry *ne; | 1641 | struct nat_entry *ne; |
1644 | nid_t nid; | 1642 | nid_t nid; |
1645 | struct f2fs_nat_entry raw_ne; | 1643 | struct f2fs_nat_entry raw_ne; |
1646 | int offset = -1; | 1644 | int offset = -1; |
1647 | block_t new_blkaddr; | 1645 | block_t new_blkaddr; |
1648 | 1646 | ||
1649 | ne = list_entry(cur, struct nat_entry, list); | 1647 | ne = list_entry(cur, struct nat_entry, list); |
1650 | nid = nat_get_nid(ne); | 1648 | nid = nat_get_nid(ne); |
1651 | 1649 | ||
1652 | if (nat_get_blkaddr(ne) == NEW_ADDR) | 1650 | if (nat_get_blkaddr(ne) == NEW_ADDR) |
1653 | continue; | 1651 | continue; |
1654 | if (flushed) | 1652 | if (flushed) |
1655 | goto to_nat_page; | 1653 | goto to_nat_page; |
1656 | 1654 | ||
1657 | /* if there is room for nat enries in curseg->sumpage */ | 1655 | /* if there is room for nat enries in curseg->sumpage */ |
1658 | offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1); | 1656 | offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1); |
1659 | if (offset >= 0) { | 1657 | if (offset >= 0) { |
1660 | raw_ne = nat_in_journal(sum, offset); | 1658 | raw_ne = nat_in_journal(sum, offset); |
1661 | goto flush_now; | 1659 | goto flush_now; |
1662 | } | 1660 | } |
1663 | to_nat_page: | 1661 | to_nat_page: |
1664 | if (!page || (start_nid > nid || nid > end_nid)) { | 1662 | if (!page || (start_nid > nid || nid > end_nid)) { |
1665 | if (page) { | 1663 | if (page) { |
1666 | f2fs_put_page(page, 1); | 1664 | f2fs_put_page(page, 1); |
1667 | page = NULL; | 1665 | page = NULL; |
1668 | } | 1666 | } |
1669 | start_nid = START_NID(nid); | 1667 | start_nid = START_NID(nid); |
1670 | end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1; | 1668 | end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1; |
1671 | 1669 | ||
1672 | /* | 1670 | /* |
1673 | * get nat block with dirty flag, increased reference | 1671 | * get nat block with dirty flag, increased reference |
1674 | * count, mapped and lock | 1672 | * count, mapped and lock |
1675 | */ | 1673 | */ |
1676 | page = get_next_nat_page(sbi, start_nid); | 1674 | page = get_next_nat_page(sbi, start_nid); |
1677 | nat_blk = page_address(page); | 1675 | nat_blk = page_address(page); |
1678 | } | 1676 | } |
1679 | 1677 | ||
1680 | BUG_ON(!nat_blk); | 1678 | BUG_ON(!nat_blk); |
1681 | raw_ne = nat_blk->entries[nid - start_nid]; | 1679 | raw_ne = nat_blk->entries[nid - start_nid]; |
1682 | flush_now: | 1680 | flush_now: |
1683 | new_blkaddr = nat_get_blkaddr(ne); | 1681 | new_blkaddr = nat_get_blkaddr(ne); |
1684 | 1682 | ||
1685 | raw_ne.ino = cpu_to_le32(nat_get_ino(ne)); | 1683 | raw_ne.ino = cpu_to_le32(nat_get_ino(ne)); |
1686 | raw_ne.block_addr = cpu_to_le32(new_blkaddr); | 1684 | raw_ne.block_addr = cpu_to_le32(new_blkaddr); |
1687 | raw_ne.version = nat_get_version(ne); | 1685 | raw_ne.version = nat_get_version(ne); |
1688 | 1686 | ||
1689 | if (offset < 0) { | 1687 | if (offset < 0) { |
1690 | nat_blk->entries[nid - start_nid] = raw_ne; | 1688 | nat_blk->entries[nid - start_nid] = raw_ne; |
1691 | } else { | 1689 | } else { |
1692 | nat_in_journal(sum, offset) = raw_ne; | 1690 | nat_in_journal(sum, offset) = raw_ne; |
1693 | nid_in_journal(sum, offset) = cpu_to_le32(nid); | 1691 | nid_in_journal(sum, offset) = cpu_to_le32(nid); |
1694 | } | 1692 | } |
1695 | 1693 | ||
1696 | if (nat_get_blkaddr(ne) == NULL_ADDR && | 1694 | if (nat_get_blkaddr(ne) == NULL_ADDR && |
1697 | add_free_nid(NM_I(sbi), nid, false) <= 0) { | 1695 | add_free_nid(NM_I(sbi), nid, false) <= 0) { |
1698 | write_lock(&nm_i->nat_tree_lock); | 1696 | write_lock(&nm_i->nat_tree_lock); |
1699 | __del_from_nat_cache(nm_i, ne); | 1697 | __del_from_nat_cache(nm_i, ne); |
1700 | write_unlock(&nm_i->nat_tree_lock); | 1698 | write_unlock(&nm_i->nat_tree_lock); |
1701 | } else { | 1699 | } else { |
1702 | write_lock(&nm_i->nat_tree_lock); | 1700 | write_lock(&nm_i->nat_tree_lock); |
1703 | __clear_nat_cache_dirty(nm_i, ne); | 1701 | __clear_nat_cache_dirty(nm_i, ne); |
1704 | ne->checkpointed = true; | 1702 | ne->checkpointed = true; |
1705 | write_unlock(&nm_i->nat_tree_lock); | 1703 | write_unlock(&nm_i->nat_tree_lock); |
1706 | } | 1704 | } |
1707 | } | 1705 | } |
1708 | if (!flushed) | 1706 | if (!flushed) |
1709 | mutex_unlock(&curseg->curseg_mutex); | 1707 | mutex_unlock(&curseg->curseg_mutex); |
1710 | f2fs_put_page(page, 1); | 1708 | f2fs_put_page(page, 1); |
1711 | 1709 | ||
1712 | /* 2) shrink nat caches if necessary */ | 1710 | /* 2) shrink nat caches if necessary */ |
1713 | try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD); | 1711 | try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD); |
1714 | } | 1712 | } |
1715 | 1713 | ||
1716 | static int init_node_manager(struct f2fs_sb_info *sbi) | 1714 | static int init_node_manager(struct f2fs_sb_info *sbi) |
1717 | { | 1715 | { |
1718 | struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi); | 1716 | struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi); |
1719 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1717 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1720 | unsigned char *version_bitmap; | 1718 | unsigned char *version_bitmap; |
1721 | unsigned int nat_segs, nat_blocks; | 1719 | unsigned int nat_segs, nat_blocks; |
1722 | 1720 | ||
1723 | nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr); | 1721 | nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr); |
1724 | 1722 | ||
1725 | /* segment_count_nat includes pair segment so divide to 2. */ | 1723 | /* segment_count_nat includes pair segment so divide to 2. */ |
1726 | nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1; | 1724 | nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1; |
1727 | nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg); | 1725 | nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg); |
1728 | nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks; | 1726 | nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks; |
1729 | nm_i->fcnt = 0; | 1727 | nm_i->fcnt = 0; |
1730 | nm_i->nat_cnt = 0; | 1728 | nm_i->nat_cnt = 0; |
1731 | 1729 | ||
1732 | INIT_LIST_HEAD(&nm_i->free_nid_list); | 1730 | INIT_LIST_HEAD(&nm_i->free_nid_list); |
1733 | INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC); | 1731 | INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC); |
1734 | INIT_LIST_HEAD(&nm_i->nat_entries); | 1732 | INIT_LIST_HEAD(&nm_i->nat_entries); |
1735 | INIT_LIST_HEAD(&nm_i->dirty_nat_entries); | 1733 | INIT_LIST_HEAD(&nm_i->dirty_nat_entries); |
1736 | 1734 | ||
1737 | mutex_init(&nm_i->build_lock); | 1735 | mutex_init(&nm_i->build_lock); |
1738 | spin_lock_init(&nm_i->free_nid_list_lock); | 1736 | spin_lock_init(&nm_i->free_nid_list_lock); |
1739 | rwlock_init(&nm_i->nat_tree_lock); | 1737 | rwlock_init(&nm_i->nat_tree_lock); |
1740 | 1738 | ||
1741 | nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid); | 1739 | nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid); |
1742 | nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP); | 1740 | nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP); |
1743 | version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP); | 1741 | version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP); |
1744 | if (!version_bitmap) | 1742 | if (!version_bitmap) |
1745 | return -EFAULT; | 1743 | return -EFAULT; |
1746 | 1744 | ||
1747 | nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size, | 1745 | nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size, |
1748 | GFP_KERNEL); | 1746 | GFP_KERNEL); |
1749 | if (!nm_i->nat_bitmap) | 1747 | if (!nm_i->nat_bitmap) |
1750 | return -ENOMEM; | 1748 | return -ENOMEM; |
1751 | return 0; | 1749 | return 0; |
1752 | } | 1750 | } |
1753 | 1751 | ||
1754 | int build_node_manager(struct f2fs_sb_info *sbi) | 1752 | int build_node_manager(struct f2fs_sb_info *sbi) |
1755 | { | 1753 | { |
1756 | int err; | 1754 | int err; |
1757 | 1755 | ||
1758 | sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL); | 1756 | sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL); |
1759 | if (!sbi->nm_info) | 1757 | if (!sbi->nm_info) |
1760 | return -ENOMEM; | 1758 | return -ENOMEM; |
1761 | 1759 | ||
1762 | err = init_node_manager(sbi); | 1760 | err = init_node_manager(sbi); |
1763 | if (err) | 1761 | if (err) |
1764 | return err; | 1762 | return err; |
1765 | 1763 | ||
1766 | build_free_nids(sbi); | 1764 | build_free_nids(sbi); |
1767 | return 0; | 1765 | return 0; |
1768 | } | 1766 | } |
1769 | 1767 | ||
1770 | void destroy_node_manager(struct f2fs_sb_info *sbi) | 1768 | void destroy_node_manager(struct f2fs_sb_info *sbi) |
1771 | { | 1769 | { |
1772 | struct f2fs_nm_info *nm_i = NM_I(sbi); | 1770 | struct f2fs_nm_info *nm_i = NM_I(sbi); |
1773 | struct free_nid *i, *next_i; | 1771 | struct free_nid *i, *next_i; |
1774 | struct nat_entry *natvec[NATVEC_SIZE]; | 1772 | struct nat_entry *natvec[NATVEC_SIZE]; |
1775 | nid_t nid = 0; | 1773 | nid_t nid = 0; |
1776 | unsigned int found; | 1774 | unsigned int found; |
1777 | 1775 | ||
1778 | if (!nm_i) | 1776 | if (!nm_i) |
1779 | return; | 1777 | return; |
1780 | 1778 | ||
1781 | /* destroy free nid list */ | 1779 | /* destroy free nid list */ |
1782 | spin_lock(&nm_i->free_nid_list_lock); | 1780 | spin_lock(&nm_i->free_nid_list_lock); |
1783 | list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) { | 1781 | list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) { |
1784 | BUG_ON(i->state == NID_ALLOC); | 1782 | BUG_ON(i->state == NID_ALLOC); |
1785 | __del_from_free_nid_list(i); | 1783 | __del_from_free_nid_list(i); |
1786 | nm_i->fcnt--; | 1784 | nm_i->fcnt--; |
1787 | } | 1785 | } |
1788 | BUG_ON(nm_i->fcnt); | 1786 | BUG_ON(nm_i->fcnt); |
1789 | spin_unlock(&nm_i->free_nid_list_lock); | 1787 | spin_unlock(&nm_i->free_nid_list_lock); |
1790 | 1788 | ||
1791 | /* destroy nat cache */ | 1789 | /* destroy nat cache */ |
1792 | write_lock(&nm_i->nat_tree_lock); | 1790 | write_lock(&nm_i->nat_tree_lock); |
1793 | while ((found = __gang_lookup_nat_cache(nm_i, | 1791 | while ((found = __gang_lookup_nat_cache(nm_i, |
1794 | nid, NATVEC_SIZE, natvec))) { | 1792 | nid, NATVEC_SIZE, natvec))) { |
1795 | unsigned idx; | 1793 | unsigned idx; |
1796 | for (idx = 0; idx < found; idx++) { | 1794 | for (idx = 0; idx < found; idx++) { |
1797 | struct nat_entry *e = natvec[idx]; | 1795 | struct nat_entry *e = natvec[idx]; |
1798 | nid = nat_get_nid(e) + 1; | 1796 | nid = nat_get_nid(e) + 1; |
1799 | __del_from_nat_cache(nm_i, e); | 1797 | __del_from_nat_cache(nm_i, e); |
1800 | } | 1798 | } |
1801 | } | 1799 | } |
1802 | BUG_ON(nm_i->nat_cnt); | 1800 | BUG_ON(nm_i->nat_cnt); |
1803 | write_unlock(&nm_i->nat_tree_lock); | 1801 | write_unlock(&nm_i->nat_tree_lock); |
1804 | 1802 | ||
1805 | kfree(nm_i->nat_bitmap); | 1803 | kfree(nm_i->nat_bitmap); |
1806 | sbi->nm_info = NULL; | 1804 | sbi->nm_info = NULL; |
1807 | kfree(nm_i); | 1805 | kfree(nm_i); |
1808 | } | 1806 | } |
1809 | 1807 | ||
1810 | int __init create_node_manager_caches(void) | 1808 | int __init create_node_manager_caches(void) |
1811 | { | 1809 | { |
1812 | nat_entry_slab = f2fs_kmem_cache_create("nat_entry", | 1810 | nat_entry_slab = f2fs_kmem_cache_create("nat_entry", |
1813 | sizeof(struct nat_entry), NULL); | 1811 | sizeof(struct nat_entry), NULL); |
1814 | if (!nat_entry_slab) | 1812 | if (!nat_entry_slab) |
1815 | return -ENOMEM; | 1813 | return -ENOMEM; |
1816 | 1814 | ||
1817 | free_nid_slab = f2fs_kmem_cache_create("free_nid", | 1815 | free_nid_slab = f2fs_kmem_cache_create("free_nid", |
1818 | sizeof(struct free_nid), NULL); | 1816 | sizeof(struct free_nid), NULL); |
1819 | if (!free_nid_slab) { | 1817 | if (!free_nid_slab) { |
1820 | kmem_cache_destroy(nat_entry_slab); | 1818 | kmem_cache_destroy(nat_entry_slab); |
1821 | return -ENOMEM; | 1819 | return -ENOMEM; |
1822 | } | 1820 | } |
1823 | return 0; | 1821 | return 0; |
1824 | } | 1822 | } |
1825 | 1823 | ||
1826 | void destroy_node_manager_caches(void) | 1824 | void destroy_node_manager_caches(void) |
1827 | { | 1825 | { |
1828 | kmem_cache_destroy(free_nid_slab); | 1826 | kmem_cache_destroy(free_nid_slab); |
1829 | kmem_cache_destroy(nat_entry_slab); | 1827 | kmem_cache_destroy(nat_entry_slab); |
1830 | } | 1828 | } |
1831 | 1829 |
fs/fuse/file.c
1 | /* | 1 | /* |
2 | FUSE: Filesystem in Userspace | 2 | FUSE: Filesystem in Userspace |
3 | Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> | 3 | Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> |
4 | 4 | ||
5 | This program can be distributed under the terms of the GNU GPL. | 5 | This program can be distributed under the terms of the GNU GPL. |
6 | See the file COPYING. | 6 | See the file COPYING. |
7 | */ | 7 | */ |
8 | 8 | ||
9 | #include "fuse_i.h" | 9 | #include "fuse_i.h" |
10 | 10 | ||
11 | #include <linux/pagemap.h> | 11 | #include <linux/pagemap.h> |
12 | #include <linux/slab.h> | 12 | #include <linux/slab.h> |
13 | #include <linux/kernel.h> | 13 | #include <linux/kernel.h> |
14 | #include <linux/sched.h> | 14 | #include <linux/sched.h> |
15 | #include <linux/module.h> | 15 | #include <linux/module.h> |
16 | #include <linux/compat.h> | 16 | #include <linux/compat.h> |
17 | #include <linux/swap.h> | 17 | #include <linux/swap.h> |
18 | #include <linux/aio.h> | 18 | #include <linux/aio.h> |
19 | #include <linux/falloc.h> | 19 | #include <linux/falloc.h> |
20 | 20 | ||
21 | static const struct file_operations fuse_direct_io_file_operations; | 21 | static const struct file_operations fuse_direct_io_file_operations; |
22 | 22 | ||
23 | static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file, | 23 | static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file, |
24 | int opcode, struct fuse_open_out *outargp) | 24 | int opcode, struct fuse_open_out *outargp) |
25 | { | 25 | { |
26 | struct fuse_open_in inarg; | 26 | struct fuse_open_in inarg; |
27 | struct fuse_req *req; | 27 | struct fuse_req *req; |
28 | int err; | 28 | int err; |
29 | 29 | ||
30 | req = fuse_get_req_nopages(fc); | 30 | req = fuse_get_req_nopages(fc); |
31 | if (IS_ERR(req)) | 31 | if (IS_ERR(req)) |
32 | return PTR_ERR(req); | 32 | return PTR_ERR(req); |
33 | 33 | ||
34 | memset(&inarg, 0, sizeof(inarg)); | 34 | memset(&inarg, 0, sizeof(inarg)); |
35 | inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY); | 35 | inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY); |
36 | if (!fc->atomic_o_trunc) | 36 | if (!fc->atomic_o_trunc) |
37 | inarg.flags &= ~O_TRUNC; | 37 | inarg.flags &= ~O_TRUNC; |
38 | req->in.h.opcode = opcode; | 38 | req->in.h.opcode = opcode; |
39 | req->in.h.nodeid = nodeid; | 39 | req->in.h.nodeid = nodeid; |
40 | req->in.numargs = 1; | 40 | req->in.numargs = 1; |
41 | req->in.args[0].size = sizeof(inarg); | 41 | req->in.args[0].size = sizeof(inarg); |
42 | req->in.args[0].value = &inarg; | 42 | req->in.args[0].value = &inarg; |
43 | req->out.numargs = 1; | 43 | req->out.numargs = 1; |
44 | req->out.args[0].size = sizeof(*outargp); | 44 | req->out.args[0].size = sizeof(*outargp); |
45 | req->out.args[0].value = outargp; | 45 | req->out.args[0].value = outargp; |
46 | fuse_request_send(fc, req); | 46 | fuse_request_send(fc, req); |
47 | err = req->out.h.error; | 47 | err = req->out.h.error; |
48 | fuse_put_request(fc, req); | 48 | fuse_put_request(fc, req); |
49 | 49 | ||
50 | return err; | 50 | return err; |
51 | } | 51 | } |
52 | 52 | ||
53 | struct fuse_file *fuse_file_alloc(struct fuse_conn *fc) | 53 | struct fuse_file *fuse_file_alloc(struct fuse_conn *fc) |
54 | { | 54 | { |
55 | struct fuse_file *ff; | 55 | struct fuse_file *ff; |
56 | 56 | ||
57 | ff = kmalloc(sizeof(struct fuse_file), GFP_KERNEL); | 57 | ff = kmalloc(sizeof(struct fuse_file), GFP_KERNEL); |
58 | if (unlikely(!ff)) | 58 | if (unlikely(!ff)) |
59 | return NULL; | 59 | return NULL; |
60 | 60 | ||
61 | ff->fc = fc; | 61 | ff->fc = fc; |
62 | ff->reserved_req = fuse_request_alloc(0); | 62 | ff->reserved_req = fuse_request_alloc(0); |
63 | if (unlikely(!ff->reserved_req)) { | 63 | if (unlikely(!ff->reserved_req)) { |
64 | kfree(ff); | 64 | kfree(ff); |
65 | return NULL; | 65 | return NULL; |
66 | } | 66 | } |
67 | 67 | ||
68 | INIT_LIST_HEAD(&ff->write_entry); | 68 | INIT_LIST_HEAD(&ff->write_entry); |
69 | atomic_set(&ff->count, 0); | 69 | atomic_set(&ff->count, 0); |
70 | RB_CLEAR_NODE(&ff->polled_node); | 70 | RB_CLEAR_NODE(&ff->polled_node); |
71 | init_waitqueue_head(&ff->poll_wait); | 71 | init_waitqueue_head(&ff->poll_wait); |
72 | 72 | ||
73 | spin_lock(&fc->lock); | 73 | spin_lock(&fc->lock); |
74 | ff->kh = ++fc->khctr; | 74 | ff->kh = ++fc->khctr; |
75 | spin_unlock(&fc->lock); | 75 | spin_unlock(&fc->lock); |
76 | 76 | ||
77 | return ff; | 77 | return ff; |
78 | } | 78 | } |
79 | 79 | ||
80 | void fuse_file_free(struct fuse_file *ff) | 80 | void fuse_file_free(struct fuse_file *ff) |
81 | { | 81 | { |
82 | fuse_request_free(ff->reserved_req); | 82 | fuse_request_free(ff->reserved_req); |
83 | kfree(ff); | 83 | kfree(ff); |
84 | } | 84 | } |
85 | 85 | ||
86 | struct fuse_file *fuse_file_get(struct fuse_file *ff) | 86 | struct fuse_file *fuse_file_get(struct fuse_file *ff) |
87 | { | 87 | { |
88 | atomic_inc(&ff->count); | 88 | atomic_inc(&ff->count); |
89 | return ff; | 89 | return ff; |
90 | } | 90 | } |
91 | 91 | ||
92 | static void fuse_release_async(struct work_struct *work) | 92 | static void fuse_release_async(struct work_struct *work) |
93 | { | 93 | { |
94 | struct fuse_req *req; | 94 | struct fuse_req *req; |
95 | struct fuse_conn *fc; | 95 | struct fuse_conn *fc; |
96 | struct path path; | 96 | struct path path; |
97 | 97 | ||
98 | req = container_of(work, struct fuse_req, misc.release.work); | 98 | req = container_of(work, struct fuse_req, misc.release.work); |
99 | path = req->misc.release.path; | 99 | path = req->misc.release.path; |
100 | fc = get_fuse_conn(path.dentry->d_inode); | 100 | fc = get_fuse_conn(path.dentry->d_inode); |
101 | 101 | ||
102 | fuse_put_request(fc, req); | 102 | fuse_put_request(fc, req); |
103 | path_put(&path); | 103 | path_put(&path); |
104 | } | 104 | } |
105 | 105 | ||
106 | static void fuse_release_end(struct fuse_conn *fc, struct fuse_req *req) | 106 | static void fuse_release_end(struct fuse_conn *fc, struct fuse_req *req) |
107 | { | 107 | { |
108 | if (fc->destroy_req) { | 108 | if (fc->destroy_req) { |
109 | /* | 109 | /* |
110 | * If this is a fuseblk mount, then it's possible that | 110 | * If this is a fuseblk mount, then it's possible that |
111 | * releasing the path will result in releasing the | 111 | * releasing the path will result in releasing the |
112 | * super block and sending the DESTROY request. If | 112 | * super block and sending the DESTROY request. If |
113 | * the server is single threaded, this would hang. | 113 | * the server is single threaded, this would hang. |
114 | * For this reason do the path_put() in a separate | 114 | * For this reason do the path_put() in a separate |
115 | * thread. | 115 | * thread. |
116 | */ | 116 | */ |
117 | atomic_inc(&req->count); | 117 | atomic_inc(&req->count); |
118 | INIT_WORK(&req->misc.release.work, fuse_release_async); | 118 | INIT_WORK(&req->misc.release.work, fuse_release_async); |
119 | schedule_work(&req->misc.release.work); | 119 | schedule_work(&req->misc.release.work); |
120 | } else { | 120 | } else { |
121 | path_put(&req->misc.release.path); | 121 | path_put(&req->misc.release.path); |
122 | } | 122 | } |
123 | } | 123 | } |
124 | 124 | ||
125 | static void fuse_file_put(struct fuse_file *ff, bool sync) | 125 | static void fuse_file_put(struct fuse_file *ff, bool sync) |
126 | { | 126 | { |
127 | if (atomic_dec_and_test(&ff->count)) { | 127 | if (atomic_dec_and_test(&ff->count)) { |
128 | struct fuse_req *req = ff->reserved_req; | 128 | struct fuse_req *req = ff->reserved_req; |
129 | 129 | ||
130 | if (sync) { | 130 | if (sync) { |
131 | req->background = 0; | 131 | req->background = 0; |
132 | fuse_request_send(ff->fc, req); | 132 | fuse_request_send(ff->fc, req); |
133 | path_put(&req->misc.release.path); | 133 | path_put(&req->misc.release.path); |
134 | fuse_put_request(ff->fc, req); | 134 | fuse_put_request(ff->fc, req); |
135 | } else { | 135 | } else { |
136 | req->end = fuse_release_end; | 136 | req->end = fuse_release_end; |
137 | req->background = 1; | 137 | req->background = 1; |
138 | fuse_request_send_background(ff->fc, req); | 138 | fuse_request_send_background(ff->fc, req); |
139 | } | 139 | } |
140 | kfree(ff); | 140 | kfree(ff); |
141 | } | 141 | } |
142 | } | 142 | } |
143 | 143 | ||
144 | int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, | 144 | int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, |
145 | bool isdir) | 145 | bool isdir) |
146 | { | 146 | { |
147 | struct fuse_open_out outarg; | 147 | struct fuse_open_out outarg; |
148 | struct fuse_file *ff; | 148 | struct fuse_file *ff; |
149 | int err; | 149 | int err; |
150 | int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN; | 150 | int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN; |
151 | 151 | ||
152 | ff = fuse_file_alloc(fc); | 152 | ff = fuse_file_alloc(fc); |
153 | if (!ff) | 153 | if (!ff) |
154 | return -ENOMEM; | 154 | return -ENOMEM; |
155 | 155 | ||
156 | err = fuse_send_open(fc, nodeid, file, opcode, &outarg); | 156 | err = fuse_send_open(fc, nodeid, file, opcode, &outarg); |
157 | if (err) { | 157 | if (err) { |
158 | fuse_file_free(ff); | 158 | fuse_file_free(ff); |
159 | return err; | 159 | return err; |
160 | } | 160 | } |
161 | 161 | ||
162 | if (isdir) | 162 | if (isdir) |
163 | outarg.open_flags &= ~FOPEN_DIRECT_IO; | 163 | outarg.open_flags &= ~FOPEN_DIRECT_IO; |
164 | 164 | ||
165 | ff->fh = outarg.fh; | 165 | ff->fh = outarg.fh; |
166 | ff->nodeid = nodeid; | 166 | ff->nodeid = nodeid; |
167 | ff->open_flags = outarg.open_flags; | 167 | ff->open_flags = outarg.open_flags; |
168 | file->private_data = fuse_file_get(ff); | 168 | file->private_data = fuse_file_get(ff); |
169 | 169 | ||
170 | return 0; | 170 | return 0; |
171 | } | 171 | } |
172 | EXPORT_SYMBOL_GPL(fuse_do_open); | 172 | EXPORT_SYMBOL_GPL(fuse_do_open); |
173 | 173 | ||
174 | void fuse_finish_open(struct inode *inode, struct file *file) | 174 | void fuse_finish_open(struct inode *inode, struct file *file) |
175 | { | 175 | { |
176 | struct fuse_file *ff = file->private_data; | 176 | struct fuse_file *ff = file->private_data; |
177 | struct fuse_conn *fc = get_fuse_conn(inode); | 177 | struct fuse_conn *fc = get_fuse_conn(inode); |
178 | 178 | ||
179 | if (ff->open_flags & FOPEN_DIRECT_IO) | 179 | if (ff->open_flags & FOPEN_DIRECT_IO) |
180 | file->f_op = &fuse_direct_io_file_operations; | 180 | file->f_op = &fuse_direct_io_file_operations; |
181 | if (!(ff->open_flags & FOPEN_KEEP_CACHE)) | 181 | if (!(ff->open_flags & FOPEN_KEEP_CACHE)) |
182 | invalidate_inode_pages2(inode->i_mapping); | 182 | invalidate_inode_pages2(inode->i_mapping); |
183 | if (ff->open_flags & FOPEN_NONSEEKABLE) | 183 | if (ff->open_flags & FOPEN_NONSEEKABLE) |
184 | nonseekable_open(inode, file); | 184 | nonseekable_open(inode, file); |
185 | if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) { | 185 | if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) { |
186 | struct fuse_inode *fi = get_fuse_inode(inode); | 186 | struct fuse_inode *fi = get_fuse_inode(inode); |
187 | 187 | ||
188 | spin_lock(&fc->lock); | 188 | spin_lock(&fc->lock); |
189 | fi->attr_version = ++fc->attr_version; | 189 | fi->attr_version = ++fc->attr_version; |
190 | i_size_write(inode, 0); | 190 | i_size_write(inode, 0); |
191 | spin_unlock(&fc->lock); | 191 | spin_unlock(&fc->lock); |
192 | fuse_invalidate_attr(inode); | 192 | fuse_invalidate_attr(inode); |
193 | } | 193 | } |
194 | } | 194 | } |
195 | 195 | ||
196 | int fuse_open_common(struct inode *inode, struct file *file, bool isdir) | 196 | int fuse_open_common(struct inode *inode, struct file *file, bool isdir) |
197 | { | 197 | { |
198 | struct fuse_conn *fc = get_fuse_conn(inode); | 198 | struct fuse_conn *fc = get_fuse_conn(inode); |
199 | int err; | 199 | int err; |
200 | 200 | ||
201 | err = generic_file_open(inode, file); | 201 | err = generic_file_open(inode, file); |
202 | if (err) | 202 | if (err) |
203 | return err; | 203 | return err; |
204 | 204 | ||
205 | err = fuse_do_open(fc, get_node_id(inode), file, isdir); | 205 | err = fuse_do_open(fc, get_node_id(inode), file, isdir); |
206 | if (err) | 206 | if (err) |
207 | return err; | 207 | return err; |
208 | 208 | ||
209 | fuse_finish_open(inode, file); | 209 | fuse_finish_open(inode, file); |
210 | 210 | ||
211 | return 0; | 211 | return 0; |
212 | } | 212 | } |
213 | 213 | ||
214 | static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) | 214 | static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode) |
215 | { | 215 | { |
216 | struct fuse_conn *fc = ff->fc; | 216 | struct fuse_conn *fc = ff->fc; |
217 | struct fuse_req *req = ff->reserved_req; | 217 | struct fuse_req *req = ff->reserved_req; |
218 | struct fuse_release_in *inarg = &req->misc.release.in; | 218 | struct fuse_release_in *inarg = &req->misc.release.in; |
219 | 219 | ||
220 | spin_lock(&fc->lock); | 220 | spin_lock(&fc->lock); |
221 | list_del(&ff->write_entry); | 221 | list_del(&ff->write_entry); |
222 | if (!RB_EMPTY_NODE(&ff->polled_node)) | 222 | if (!RB_EMPTY_NODE(&ff->polled_node)) |
223 | rb_erase(&ff->polled_node, &fc->polled_files); | 223 | rb_erase(&ff->polled_node, &fc->polled_files); |
224 | spin_unlock(&fc->lock); | 224 | spin_unlock(&fc->lock); |
225 | 225 | ||
226 | wake_up_interruptible_all(&ff->poll_wait); | 226 | wake_up_interruptible_all(&ff->poll_wait); |
227 | 227 | ||
228 | inarg->fh = ff->fh; | 228 | inarg->fh = ff->fh; |
229 | inarg->flags = flags; | 229 | inarg->flags = flags; |
230 | req->in.h.opcode = opcode; | 230 | req->in.h.opcode = opcode; |
231 | req->in.h.nodeid = ff->nodeid; | 231 | req->in.h.nodeid = ff->nodeid; |
232 | req->in.numargs = 1; | 232 | req->in.numargs = 1; |
233 | req->in.args[0].size = sizeof(struct fuse_release_in); | 233 | req->in.args[0].size = sizeof(struct fuse_release_in); |
234 | req->in.args[0].value = inarg; | 234 | req->in.args[0].value = inarg; |
235 | } | 235 | } |
236 | 236 | ||
237 | void fuse_release_common(struct file *file, int opcode) | 237 | void fuse_release_common(struct file *file, int opcode) |
238 | { | 238 | { |
239 | struct fuse_file *ff; | 239 | struct fuse_file *ff; |
240 | struct fuse_req *req; | 240 | struct fuse_req *req; |
241 | 241 | ||
242 | ff = file->private_data; | 242 | ff = file->private_data; |
243 | if (unlikely(!ff)) | 243 | if (unlikely(!ff)) |
244 | return; | 244 | return; |
245 | 245 | ||
246 | req = ff->reserved_req; | 246 | req = ff->reserved_req; |
247 | fuse_prepare_release(ff, file->f_flags, opcode); | 247 | fuse_prepare_release(ff, file->f_flags, opcode); |
248 | 248 | ||
249 | if (ff->flock) { | 249 | if (ff->flock) { |
250 | struct fuse_release_in *inarg = &req->misc.release.in; | 250 | struct fuse_release_in *inarg = &req->misc.release.in; |
251 | inarg->release_flags |= FUSE_RELEASE_FLOCK_UNLOCK; | 251 | inarg->release_flags |= FUSE_RELEASE_FLOCK_UNLOCK; |
252 | inarg->lock_owner = fuse_lock_owner_id(ff->fc, | 252 | inarg->lock_owner = fuse_lock_owner_id(ff->fc, |
253 | (fl_owner_t) file); | 253 | (fl_owner_t) file); |
254 | } | 254 | } |
255 | /* Hold vfsmount and dentry until release is finished */ | 255 | /* Hold vfsmount and dentry until release is finished */ |
256 | path_get(&file->f_path); | 256 | path_get(&file->f_path); |
257 | req->misc.release.path = file->f_path; | 257 | req->misc.release.path = file->f_path; |
258 | 258 | ||
259 | /* | 259 | /* |
260 | * Normally this will send the RELEASE request, however if | 260 | * Normally this will send the RELEASE request, however if |
261 | * some asynchronous READ or WRITE requests are outstanding, | 261 | * some asynchronous READ or WRITE requests are outstanding, |
262 | * the sending will be delayed. | 262 | * the sending will be delayed. |
263 | * | 263 | * |
264 | * Make the release synchronous if this is a fuseblk mount, | 264 | * Make the release synchronous if this is a fuseblk mount, |
265 | * synchronous RELEASE is allowed (and desirable) in this case | 265 | * synchronous RELEASE is allowed (and desirable) in this case |
266 | * because the server can be trusted not to screw up. | 266 | * because the server can be trusted not to screw up. |
267 | */ | 267 | */ |
268 | fuse_file_put(ff, ff->fc->destroy_req != NULL); | 268 | fuse_file_put(ff, ff->fc->destroy_req != NULL); |
269 | } | 269 | } |
270 | 270 | ||
271 | static int fuse_open(struct inode *inode, struct file *file) | 271 | static int fuse_open(struct inode *inode, struct file *file) |
272 | { | 272 | { |
273 | return fuse_open_common(inode, file, false); | 273 | return fuse_open_common(inode, file, false); |
274 | } | 274 | } |
275 | 275 | ||
276 | static int fuse_release(struct inode *inode, struct file *file) | 276 | static int fuse_release(struct inode *inode, struct file *file) |
277 | { | 277 | { |
278 | fuse_release_common(file, FUSE_RELEASE); | 278 | fuse_release_common(file, FUSE_RELEASE); |
279 | 279 | ||
280 | /* return value is ignored by VFS */ | 280 | /* return value is ignored by VFS */ |
281 | return 0; | 281 | return 0; |
282 | } | 282 | } |
283 | 283 | ||
284 | void fuse_sync_release(struct fuse_file *ff, int flags) | 284 | void fuse_sync_release(struct fuse_file *ff, int flags) |
285 | { | 285 | { |
286 | WARN_ON(atomic_read(&ff->count) > 1); | 286 | WARN_ON(atomic_read(&ff->count) > 1); |
287 | fuse_prepare_release(ff, flags, FUSE_RELEASE); | 287 | fuse_prepare_release(ff, flags, FUSE_RELEASE); |
288 | ff->reserved_req->force = 1; | 288 | ff->reserved_req->force = 1; |
289 | ff->reserved_req->background = 0; | 289 | ff->reserved_req->background = 0; |
290 | fuse_request_send(ff->fc, ff->reserved_req); | 290 | fuse_request_send(ff->fc, ff->reserved_req); |
291 | fuse_put_request(ff->fc, ff->reserved_req); | 291 | fuse_put_request(ff->fc, ff->reserved_req); |
292 | kfree(ff); | 292 | kfree(ff); |
293 | } | 293 | } |
294 | EXPORT_SYMBOL_GPL(fuse_sync_release); | 294 | EXPORT_SYMBOL_GPL(fuse_sync_release); |
295 | 295 | ||
296 | /* | 296 | /* |
297 | * Scramble the ID space with XTEA, so that the value of the files_struct | 297 | * Scramble the ID space with XTEA, so that the value of the files_struct |
298 | * pointer is not exposed to userspace. | 298 | * pointer is not exposed to userspace. |
299 | */ | 299 | */ |
300 | u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) | 300 | u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) |
301 | { | 301 | { |
302 | u32 *k = fc->scramble_key; | 302 | u32 *k = fc->scramble_key; |
303 | u64 v = (unsigned long) id; | 303 | u64 v = (unsigned long) id; |
304 | u32 v0 = v; | 304 | u32 v0 = v; |
305 | u32 v1 = v >> 32; | 305 | u32 v1 = v >> 32; |
306 | u32 sum = 0; | 306 | u32 sum = 0; |
307 | int i; | 307 | int i; |
308 | 308 | ||
309 | for (i = 0; i < 32; i++) { | 309 | for (i = 0; i < 32; i++) { |
310 | v0 += ((v1 << 4 ^ v1 >> 5) + v1) ^ (sum + k[sum & 3]); | 310 | v0 += ((v1 << 4 ^ v1 >> 5) + v1) ^ (sum + k[sum & 3]); |
311 | sum += 0x9E3779B9; | 311 | sum += 0x9E3779B9; |
312 | v1 += ((v0 << 4 ^ v0 >> 5) + v0) ^ (sum + k[sum>>11 & 3]); | 312 | v1 += ((v0 << 4 ^ v0 >> 5) + v0) ^ (sum + k[sum>>11 & 3]); |
313 | } | 313 | } |
314 | 314 | ||
315 | return (u64) v0 + ((u64) v1 << 32); | 315 | return (u64) v0 + ((u64) v1 << 32); |
316 | } | 316 | } |
317 | 317 | ||
318 | /* | 318 | /* |
319 | * Check if page is under writeback | 319 | * Check if page is under writeback |
320 | * | 320 | * |
321 | * This is currently done by walking the list of writepage requests | 321 | * This is currently done by walking the list of writepage requests |
322 | * for the inode, which can be pretty inefficient. | 322 | * for the inode, which can be pretty inefficient. |
323 | */ | 323 | */ |
324 | static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) | 324 | static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) |
325 | { | 325 | { |
326 | struct fuse_conn *fc = get_fuse_conn(inode); | 326 | struct fuse_conn *fc = get_fuse_conn(inode); |
327 | struct fuse_inode *fi = get_fuse_inode(inode); | 327 | struct fuse_inode *fi = get_fuse_inode(inode); |
328 | struct fuse_req *req; | 328 | struct fuse_req *req; |
329 | bool found = false; | 329 | bool found = false; |
330 | 330 | ||
331 | spin_lock(&fc->lock); | 331 | spin_lock(&fc->lock); |
332 | list_for_each_entry(req, &fi->writepages, writepages_entry) { | 332 | list_for_each_entry(req, &fi->writepages, writepages_entry) { |
333 | pgoff_t curr_index; | 333 | pgoff_t curr_index; |
334 | 334 | ||
335 | BUG_ON(req->inode != inode); | 335 | BUG_ON(req->inode != inode); |
336 | curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT; | 336 | curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT; |
337 | if (curr_index == index) { | 337 | if (curr_index == index) { |
338 | found = true; | 338 | found = true; |
339 | break; | 339 | break; |
340 | } | 340 | } |
341 | } | 341 | } |
342 | spin_unlock(&fc->lock); | 342 | spin_unlock(&fc->lock); |
343 | 343 | ||
344 | return found; | 344 | return found; |
345 | } | 345 | } |
346 | 346 | ||
347 | /* | 347 | /* |
348 | * Wait for page writeback to be completed. | 348 | * Wait for page writeback to be completed. |
349 | * | 349 | * |
350 | * Since fuse doesn't rely on the VM writeback tracking, this has to | 350 | * Since fuse doesn't rely on the VM writeback tracking, this has to |
351 | * use some other means. | 351 | * use some other means. |
352 | */ | 352 | */ |
353 | static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) | 353 | static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) |
354 | { | 354 | { |
355 | struct fuse_inode *fi = get_fuse_inode(inode); | 355 | struct fuse_inode *fi = get_fuse_inode(inode); |
356 | 356 | ||
357 | wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); | 357 | wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); |
358 | return 0; | 358 | return 0; |
359 | } | 359 | } |
360 | 360 | ||
361 | static int fuse_flush(struct file *file, fl_owner_t id) | 361 | static int fuse_flush(struct file *file, fl_owner_t id) |
362 | { | 362 | { |
363 | struct inode *inode = file_inode(file); | 363 | struct inode *inode = file_inode(file); |
364 | struct fuse_conn *fc = get_fuse_conn(inode); | 364 | struct fuse_conn *fc = get_fuse_conn(inode); |
365 | struct fuse_file *ff = file->private_data; | 365 | struct fuse_file *ff = file->private_data; |
366 | struct fuse_req *req; | 366 | struct fuse_req *req; |
367 | struct fuse_flush_in inarg; | 367 | struct fuse_flush_in inarg; |
368 | int err; | 368 | int err; |
369 | 369 | ||
370 | if (is_bad_inode(inode)) | 370 | if (is_bad_inode(inode)) |
371 | return -EIO; | 371 | return -EIO; |
372 | 372 | ||
373 | if (fc->no_flush) | 373 | if (fc->no_flush) |
374 | return 0; | 374 | return 0; |
375 | 375 | ||
376 | req = fuse_get_req_nofail_nopages(fc, file); | 376 | req = fuse_get_req_nofail_nopages(fc, file); |
377 | memset(&inarg, 0, sizeof(inarg)); | 377 | memset(&inarg, 0, sizeof(inarg)); |
378 | inarg.fh = ff->fh; | 378 | inarg.fh = ff->fh; |
379 | inarg.lock_owner = fuse_lock_owner_id(fc, id); | 379 | inarg.lock_owner = fuse_lock_owner_id(fc, id); |
380 | req->in.h.opcode = FUSE_FLUSH; | 380 | req->in.h.opcode = FUSE_FLUSH; |
381 | req->in.h.nodeid = get_node_id(inode); | 381 | req->in.h.nodeid = get_node_id(inode); |
382 | req->in.numargs = 1; | 382 | req->in.numargs = 1; |
383 | req->in.args[0].size = sizeof(inarg); | 383 | req->in.args[0].size = sizeof(inarg); |
384 | req->in.args[0].value = &inarg; | 384 | req->in.args[0].value = &inarg; |
385 | req->force = 1; | 385 | req->force = 1; |
386 | fuse_request_send(fc, req); | 386 | fuse_request_send(fc, req); |
387 | err = req->out.h.error; | 387 | err = req->out.h.error; |
388 | fuse_put_request(fc, req); | 388 | fuse_put_request(fc, req); |
389 | if (err == -ENOSYS) { | 389 | if (err == -ENOSYS) { |
390 | fc->no_flush = 1; | 390 | fc->no_flush = 1; |
391 | err = 0; | 391 | err = 0; |
392 | } | 392 | } |
393 | return err; | 393 | return err; |
394 | } | 394 | } |
395 | 395 | ||
396 | /* | 396 | /* |
397 | * Wait for all pending writepages on the inode to finish. | 397 | * Wait for all pending writepages on the inode to finish. |
398 | * | 398 | * |
399 | * This is currently done by blocking further writes with FUSE_NOWRITE | 399 | * This is currently done by blocking further writes with FUSE_NOWRITE |
400 | * and waiting for all sent writes to complete. | 400 | * and waiting for all sent writes to complete. |
401 | * | 401 | * |
402 | * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage | 402 | * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage |
403 | * could conflict with truncation. | 403 | * could conflict with truncation. |
404 | */ | 404 | */ |
405 | static void fuse_sync_writes(struct inode *inode) | 405 | static void fuse_sync_writes(struct inode *inode) |
406 | { | 406 | { |
407 | fuse_set_nowrite(inode); | 407 | fuse_set_nowrite(inode); |
408 | fuse_release_nowrite(inode); | 408 | fuse_release_nowrite(inode); |
409 | } | 409 | } |
410 | 410 | ||
411 | int fuse_fsync_common(struct file *file, loff_t start, loff_t end, | 411 | int fuse_fsync_common(struct file *file, loff_t start, loff_t end, |
412 | int datasync, int isdir) | 412 | int datasync, int isdir) |
413 | { | 413 | { |
414 | struct inode *inode = file->f_mapping->host; | 414 | struct inode *inode = file->f_mapping->host; |
415 | struct fuse_conn *fc = get_fuse_conn(inode); | 415 | struct fuse_conn *fc = get_fuse_conn(inode); |
416 | struct fuse_file *ff = file->private_data; | 416 | struct fuse_file *ff = file->private_data; |
417 | struct fuse_req *req; | 417 | struct fuse_req *req; |
418 | struct fuse_fsync_in inarg; | 418 | struct fuse_fsync_in inarg; |
419 | int err; | 419 | int err; |
420 | 420 | ||
421 | if (is_bad_inode(inode)) | 421 | if (is_bad_inode(inode)) |
422 | return -EIO; | 422 | return -EIO; |
423 | 423 | ||
424 | err = filemap_write_and_wait_range(inode->i_mapping, start, end); | 424 | err = filemap_write_and_wait_range(inode->i_mapping, start, end); |
425 | if (err) | 425 | if (err) |
426 | return err; | 426 | return err; |
427 | 427 | ||
428 | if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir)) | 428 | if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir)) |
429 | return 0; | 429 | return 0; |
430 | 430 | ||
431 | mutex_lock(&inode->i_mutex); | 431 | mutex_lock(&inode->i_mutex); |
432 | 432 | ||
433 | /* | 433 | /* |
434 | * Start writeback against all dirty pages of the inode, then | 434 | * Start writeback against all dirty pages of the inode, then |
435 | * wait for all outstanding writes, before sending the FSYNC | 435 | * wait for all outstanding writes, before sending the FSYNC |
436 | * request. | 436 | * request. |
437 | */ | 437 | */ |
438 | err = write_inode_now(inode, 0); | 438 | err = write_inode_now(inode, 0); |
439 | if (err) | 439 | if (err) |
440 | goto out; | 440 | goto out; |
441 | 441 | ||
442 | fuse_sync_writes(inode); | 442 | fuse_sync_writes(inode); |
443 | 443 | ||
444 | req = fuse_get_req_nopages(fc); | 444 | req = fuse_get_req_nopages(fc); |
445 | if (IS_ERR(req)) { | 445 | if (IS_ERR(req)) { |
446 | err = PTR_ERR(req); | 446 | err = PTR_ERR(req); |
447 | goto out; | 447 | goto out; |
448 | } | 448 | } |
449 | 449 | ||
450 | memset(&inarg, 0, sizeof(inarg)); | 450 | memset(&inarg, 0, sizeof(inarg)); |
451 | inarg.fh = ff->fh; | 451 | inarg.fh = ff->fh; |
452 | inarg.fsync_flags = datasync ? 1 : 0; | 452 | inarg.fsync_flags = datasync ? 1 : 0; |
453 | req->in.h.opcode = isdir ? FUSE_FSYNCDIR : FUSE_FSYNC; | 453 | req->in.h.opcode = isdir ? FUSE_FSYNCDIR : FUSE_FSYNC; |
454 | req->in.h.nodeid = get_node_id(inode); | 454 | req->in.h.nodeid = get_node_id(inode); |
455 | req->in.numargs = 1; | 455 | req->in.numargs = 1; |
456 | req->in.args[0].size = sizeof(inarg); | 456 | req->in.args[0].size = sizeof(inarg); |
457 | req->in.args[0].value = &inarg; | 457 | req->in.args[0].value = &inarg; |
458 | fuse_request_send(fc, req); | 458 | fuse_request_send(fc, req); |
459 | err = req->out.h.error; | 459 | err = req->out.h.error; |
460 | fuse_put_request(fc, req); | 460 | fuse_put_request(fc, req); |
461 | if (err == -ENOSYS) { | 461 | if (err == -ENOSYS) { |
462 | if (isdir) | 462 | if (isdir) |
463 | fc->no_fsyncdir = 1; | 463 | fc->no_fsyncdir = 1; |
464 | else | 464 | else |
465 | fc->no_fsync = 1; | 465 | fc->no_fsync = 1; |
466 | err = 0; | 466 | err = 0; |
467 | } | 467 | } |
468 | out: | 468 | out: |
469 | mutex_unlock(&inode->i_mutex); | 469 | mutex_unlock(&inode->i_mutex); |
470 | return err; | 470 | return err; |
471 | } | 471 | } |
472 | 472 | ||
473 | static int fuse_fsync(struct file *file, loff_t start, loff_t end, | 473 | static int fuse_fsync(struct file *file, loff_t start, loff_t end, |
474 | int datasync) | 474 | int datasync) |
475 | { | 475 | { |
476 | return fuse_fsync_common(file, start, end, datasync, 0); | 476 | return fuse_fsync_common(file, start, end, datasync, 0); |
477 | } | 477 | } |
478 | 478 | ||
479 | void fuse_read_fill(struct fuse_req *req, struct file *file, loff_t pos, | 479 | void fuse_read_fill(struct fuse_req *req, struct file *file, loff_t pos, |
480 | size_t count, int opcode) | 480 | size_t count, int opcode) |
481 | { | 481 | { |
482 | struct fuse_read_in *inarg = &req->misc.read.in; | 482 | struct fuse_read_in *inarg = &req->misc.read.in; |
483 | struct fuse_file *ff = file->private_data; | 483 | struct fuse_file *ff = file->private_data; |
484 | 484 | ||
485 | inarg->fh = ff->fh; | 485 | inarg->fh = ff->fh; |
486 | inarg->offset = pos; | 486 | inarg->offset = pos; |
487 | inarg->size = count; | 487 | inarg->size = count; |
488 | inarg->flags = file->f_flags; | 488 | inarg->flags = file->f_flags; |
489 | req->in.h.opcode = opcode; | 489 | req->in.h.opcode = opcode; |
490 | req->in.h.nodeid = ff->nodeid; | 490 | req->in.h.nodeid = ff->nodeid; |
491 | req->in.numargs = 1; | 491 | req->in.numargs = 1; |
492 | req->in.args[0].size = sizeof(struct fuse_read_in); | 492 | req->in.args[0].size = sizeof(struct fuse_read_in); |
493 | req->in.args[0].value = inarg; | 493 | req->in.args[0].value = inarg; |
494 | req->out.argvar = 1; | 494 | req->out.argvar = 1; |
495 | req->out.numargs = 1; | 495 | req->out.numargs = 1; |
496 | req->out.args[0].size = count; | 496 | req->out.args[0].size = count; |
497 | } | 497 | } |
498 | 498 | ||
499 | static void fuse_release_user_pages(struct fuse_req *req, int write) | 499 | static void fuse_release_user_pages(struct fuse_req *req, int write) |
500 | { | 500 | { |
501 | unsigned i; | 501 | unsigned i; |
502 | 502 | ||
503 | for (i = 0; i < req->num_pages; i++) { | 503 | for (i = 0; i < req->num_pages; i++) { |
504 | struct page *page = req->pages[i]; | 504 | struct page *page = req->pages[i]; |
505 | if (write) | 505 | if (write) |
506 | set_page_dirty_lock(page); | 506 | set_page_dirty_lock(page); |
507 | put_page(page); | 507 | put_page(page); |
508 | } | 508 | } |
509 | } | 509 | } |
510 | 510 | ||
511 | /** | 511 | /** |
512 | * In case of short read, the caller sets 'pos' to the position of | 512 | * In case of short read, the caller sets 'pos' to the position of |
513 | * actual end of fuse request in IO request. Otherwise, if bytes_requested | 513 | * actual end of fuse request in IO request. Otherwise, if bytes_requested |
514 | * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1. | 514 | * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1. |
515 | * | 515 | * |
516 | * An example: | 516 | * An example: |
517 | * User requested DIO read of 64K. It was splitted into two 32K fuse requests, | 517 | * User requested DIO read of 64K. It was splitted into two 32K fuse requests, |
518 | * both submitted asynchronously. The first of them was ACKed by userspace as | 518 | * both submitted asynchronously. The first of them was ACKed by userspace as |
519 | * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The | 519 | * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The |
520 | * second request was ACKed as short, e.g. only 1K was read, resulting in | 520 | * second request was ACKed as short, e.g. only 1K was read, resulting in |
521 | * pos == 33K. | 521 | * pos == 33K. |
522 | * | 522 | * |
523 | * Thus, when all fuse requests are completed, the minimal non-negative 'pos' | 523 | * Thus, when all fuse requests are completed, the minimal non-negative 'pos' |
524 | * will be equal to the length of the longest contiguous fragment of | 524 | * will be equal to the length of the longest contiguous fragment of |
525 | * transferred data starting from the beginning of IO request. | 525 | * transferred data starting from the beginning of IO request. |
526 | */ | 526 | */ |
527 | static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos) | 527 | static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos) |
528 | { | 528 | { |
529 | int left; | 529 | int left; |
530 | 530 | ||
531 | spin_lock(&io->lock); | 531 | spin_lock(&io->lock); |
532 | if (err) | 532 | if (err) |
533 | io->err = io->err ? : err; | 533 | io->err = io->err ? : err; |
534 | else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes)) | 534 | else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes)) |
535 | io->bytes = pos; | 535 | io->bytes = pos; |
536 | 536 | ||
537 | left = --io->reqs; | 537 | left = --io->reqs; |
538 | spin_unlock(&io->lock); | 538 | spin_unlock(&io->lock); |
539 | 539 | ||
540 | if (!left) { | 540 | if (!left) { |
541 | long res; | 541 | long res; |
542 | 542 | ||
543 | if (io->err) | 543 | if (io->err) |
544 | res = io->err; | 544 | res = io->err; |
545 | else if (io->bytes >= 0 && io->write) | 545 | else if (io->bytes >= 0 && io->write) |
546 | res = -EIO; | 546 | res = -EIO; |
547 | else { | 547 | else { |
548 | res = io->bytes < 0 ? io->size : io->bytes; | 548 | res = io->bytes < 0 ? io->size : io->bytes; |
549 | 549 | ||
550 | if (!is_sync_kiocb(io->iocb)) { | 550 | if (!is_sync_kiocb(io->iocb)) { |
551 | struct inode *inode = file_inode(io->iocb->ki_filp); | 551 | struct inode *inode = file_inode(io->iocb->ki_filp); |
552 | struct fuse_conn *fc = get_fuse_conn(inode); | 552 | struct fuse_conn *fc = get_fuse_conn(inode); |
553 | struct fuse_inode *fi = get_fuse_inode(inode); | 553 | struct fuse_inode *fi = get_fuse_inode(inode); |
554 | 554 | ||
555 | spin_lock(&fc->lock); | 555 | spin_lock(&fc->lock); |
556 | fi->attr_version = ++fc->attr_version; | 556 | fi->attr_version = ++fc->attr_version; |
557 | spin_unlock(&fc->lock); | 557 | spin_unlock(&fc->lock); |
558 | } | 558 | } |
559 | } | 559 | } |
560 | 560 | ||
561 | aio_complete(io->iocb, res, 0); | 561 | aio_complete(io->iocb, res, 0); |
562 | kfree(io); | 562 | kfree(io); |
563 | } | 563 | } |
564 | } | 564 | } |
565 | 565 | ||
566 | static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req) | 566 | static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req) |
567 | { | 567 | { |
568 | struct fuse_io_priv *io = req->io; | 568 | struct fuse_io_priv *io = req->io; |
569 | ssize_t pos = -1; | 569 | ssize_t pos = -1; |
570 | 570 | ||
571 | fuse_release_user_pages(req, !io->write); | 571 | fuse_release_user_pages(req, !io->write); |
572 | 572 | ||
573 | if (io->write) { | 573 | if (io->write) { |
574 | if (req->misc.write.in.size != req->misc.write.out.size) | 574 | if (req->misc.write.in.size != req->misc.write.out.size) |
575 | pos = req->misc.write.in.offset - io->offset + | 575 | pos = req->misc.write.in.offset - io->offset + |
576 | req->misc.write.out.size; | 576 | req->misc.write.out.size; |
577 | } else { | 577 | } else { |
578 | if (req->misc.read.in.size != req->out.args[0].size) | 578 | if (req->misc.read.in.size != req->out.args[0].size) |
579 | pos = req->misc.read.in.offset - io->offset + | 579 | pos = req->misc.read.in.offset - io->offset + |
580 | req->out.args[0].size; | 580 | req->out.args[0].size; |
581 | } | 581 | } |
582 | 582 | ||
583 | fuse_aio_complete(io, req->out.h.error, pos); | 583 | fuse_aio_complete(io, req->out.h.error, pos); |
584 | } | 584 | } |
585 | 585 | ||
586 | static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req, | 586 | static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req, |
587 | size_t num_bytes, struct fuse_io_priv *io) | 587 | size_t num_bytes, struct fuse_io_priv *io) |
588 | { | 588 | { |
589 | spin_lock(&io->lock); | 589 | spin_lock(&io->lock); |
590 | io->size += num_bytes; | 590 | io->size += num_bytes; |
591 | io->reqs++; | 591 | io->reqs++; |
592 | spin_unlock(&io->lock); | 592 | spin_unlock(&io->lock); |
593 | 593 | ||
594 | req->io = io; | 594 | req->io = io; |
595 | req->end = fuse_aio_complete_req; | 595 | req->end = fuse_aio_complete_req; |
596 | 596 | ||
597 | __fuse_get_request(req); | 597 | __fuse_get_request(req); |
598 | fuse_request_send_background(fc, req); | 598 | fuse_request_send_background(fc, req); |
599 | 599 | ||
600 | return num_bytes; | 600 | return num_bytes; |
601 | } | 601 | } |
602 | 602 | ||
603 | static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io, | 603 | static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io, |
604 | loff_t pos, size_t count, fl_owner_t owner) | 604 | loff_t pos, size_t count, fl_owner_t owner) |
605 | { | 605 | { |
606 | struct file *file = io->file; | 606 | struct file *file = io->file; |
607 | struct fuse_file *ff = file->private_data; | 607 | struct fuse_file *ff = file->private_data; |
608 | struct fuse_conn *fc = ff->fc; | 608 | struct fuse_conn *fc = ff->fc; |
609 | 609 | ||
610 | fuse_read_fill(req, file, pos, count, FUSE_READ); | 610 | fuse_read_fill(req, file, pos, count, FUSE_READ); |
611 | if (owner != NULL) { | 611 | if (owner != NULL) { |
612 | struct fuse_read_in *inarg = &req->misc.read.in; | 612 | struct fuse_read_in *inarg = &req->misc.read.in; |
613 | 613 | ||
614 | inarg->read_flags |= FUSE_READ_LOCKOWNER; | 614 | inarg->read_flags |= FUSE_READ_LOCKOWNER; |
615 | inarg->lock_owner = fuse_lock_owner_id(fc, owner); | 615 | inarg->lock_owner = fuse_lock_owner_id(fc, owner); |
616 | } | 616 | } |
617 | 617 | ||
618 | if (io->async) | 618 | if (io->async) |
619 | return fuse_async_req_send(fc, req, count, io); | 619 | return fuse_async_req_send(fc, req, count, io); |
620 | 620 | ||
621 | fuse_request_send(fc, req); | 621 | fuse_request_send(fc, req); |
622 | return req->out.args[0].size; | 622 | return req->out.args[0].size; |
623 | } | 623 | } |
624 | 624 | ||
625 | static void fuse_read_update_size(struct inode *inode, loff_t size, | 625 | static void fuse_read_update_size(struct inode *inode, loff_t size, |
626 | u64 attr_ver) | 626 | u64 attr_ver) |
627 | { | 627 | { |
628 | struct fuse_conn *fc = get_fuse_conn(inode); | 628 | struct fuse_conn *fc = get_fuse_conn(inode); |
629 | struct fuse_inode *fi = get_fuse_inode(inode); | 629 | struct fuse_inode *fi = get_fuse_inode(inode); |
630 | 630 | ||
631 | spin_lock(&fc->lock); | 631 | spin_lock(&fc->lock); |
632 | if (attr_ver == fi->attr_version && size < inode->i_size && | 632 | if (attr_ver == fi->attr_version && size < inode->i_size && |
633 | !test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { | 633 | !test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { |
634 | fi->attr_version = ++fc->attr_version; | 634 | fi->attr_version = ++fc->attr_version; |
635 | i_size_write(inode, size); | 635 | i_size_write(inode, size); |
636 | } | 636 | } |
637 | spin_unlock(&fc->lock); | 637 | spin_unlock(&fc->lock); |
638 | } | 638 | } |
639 | 639 | ||
640 | static int fuse_readpage(struct file *file, struct page *page) | 640 | static int fuse_readpage(struct file *file, struct page *page) |
641 | { | 641 | { |
642 | struct fuse_io_priv io = { .async = 0, .file = file }; | 642 | struct fuse_io_priv io = { .async = 0, .file = file }; |
643 | struct inode *inode = page->mapping->host; | 643 | struct inode *inode = page->mapping->host; |
644 | struct fuse_conn *fc = get_fuse_conn(inode); | 644 | struct fuse_conn *fc = get_fuse_conn(inode); |
645 | struct fuse_req *req; | 645 | struct fuse_req *req; |
646 | size_t num_read; | 646 | size_t num_read; |
647 | loff_t pos = page_offset(page); | 647 | loff_t pos = page_offset(page); |
648 | size_t count = PAGE_CACHE_SIZE; | 648 | size_t count = PAGE_CACHE_SIZE; |
649 | u64 attr_ver; | 649 | u64 attr_ver; |
650 | int err; | 650 | int err; |
651 | 651 | ||
652 | err = -EIO; | 652 | err = -EIO; |
653 | if (is_bad_inode(inode)) | 653 | if (is_bad_inode(inode)) |
654 | goto out; | 654 | goto out; |
655 | 655 | ||
656 | /* | 656 | /* |
657 | * Page writeback can extend beyond the lifetime of the | 657 | * Page writeback can extend beyond the lifetime of the |
658 | * page-cache page, so make sure we read a properly synced | 658 | * page-cache page, so make sure we read a properly synced |
659 | * page. | 659 | * page. |
660 | */ | 660 | */ |
661 | fuse_wait_on_page_writeback(inode, page->index); | 661 | fuse_wait_on_page_writeback(inode, page->index); |
662 | 662 | ||
663 | req = fuse_get_req(fc, 1); | 663 | req = fuse_get_req(fc, 1); |
664 | err = PTR_ERR(req); | 664 | err = PTR_ERR(req); |
665 | if (IS_ERR(req)) | 665 | if (IS_ERR(req)) |
666 | goto out; | 666 | goto out; |
667 | 667 | ||
668 | attr_ver = fuse_get_attr_version(fc); | 668 | attr_ver = fuse_get_attr_version(fc); |
669 | 669 | ||
670 | req->out.page_zeroing = 1; | 670 | req->out.page_zeroing = 1; |
671 | req->out.argpages = 1; | 671 | req->out.argpages = 1; |
672 | req->num_pages = 1; | 672 | req->num_pages = 1; |
673 | req->pages[0] = page; | 673 | req->pages[0] = page; |
674 | req->page_descs[0].length = count; | 674 | req->page_descs[0].length = count; |
675 | num_read = fuse_send_read(req, &io, pos, count, NULL); | 675 | num_read = fuse_send_read(req, &io, pos, count, NULL); |
676 | err = req->out.h.error; | 676 | err = req->out.h.error; |
677 | fuse_put_request(fc, req); | 677 | fuse_put_request(fc, req); |
678 | 678 | ||
679 | if (!err) { | 679 | if (!err) { |
680 | /* | 680 | /* |
681 | * Short read means EOF. If file size is larger, truncate it | 681 | * Short read means EOF. If file size is larger, truncate it |
682 | */ | 682 | */ |
683 | if (num_read < count) | 683 | if (num_read < count) |
684 | fuse_read_update_size(inode, pos + num_read, attr_ver); | 684 | fuse_read_update_size(inode, pos + num_read, attr_ver); |
685 | 685 | ||
686 | SetPageUptodate(page); | 686 | SetPageUptodate(page); |
687 | } | 687 | } |
688 | 688 | ||
689 | fuse_invalidate_attr(inode); /* atime changed */ | 689 | fuse_invalidate_attr(inode); /* atime changed */ |
690 | out: | 690 | out: |
691 | unlock_page(page); | 691 | unlock_page(page); |
692 | return err; | 692 | return err; |
693 | } | 693 | } |
694 | 694 | ||
695 | static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) | 695 | static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req) |
696 | { | 696 | { |
697 | int i; | 697 | int i; |
698 | size_t count = req->misc.read.in.size; | 698 | size_t count = req->misc.read.in.size; |
699 | size_t num_read = req->out.args[0].size; | 699 | size_t num_read = req->out.args[0].size; |
700 | struct address_space *mapping = NULL; | 700 | struct address_space *mapping = NULL; |
701 | 701 | ||
702 | for (i = 0; mapping == NULL && i < req->num_pages; i++) | 702 | for (i = 0; mapping == NULL && i < req->num_pages; i++) |
703 | mapping = req->pages[i]->mapping; | 703 | mapping = req->pages[i]->mapping; |
704 | 704 | ||
705 | if (mapping) { | 705 | if (mapping) { |
706 | struct inode *inode = mapping->host; | 706 | struct inode *inode = mapping->host; |
707 | 707 | ||
708 | /* | 708 | /* |
709 | * Short read means EOF. If file size is larger, truncate it | 709 | * Short read means EOF. If file size is larger, truncate it |
710 | */ | 710 | */ |
711 | if (!req->out.h.error && num_read < count) { | 711 | if (!req->out.h.error && num_read < count) { |
712 | loff_t pos; | 712 | loff_t pos; |
713 | 713 | ||
714 | pos = page_offset(req->pages[0]) + num_read; | 714 | pos = page_offset(req->pages[0]) + num_read; |
715 | fuse_read_update_size(inode, pos, | 715 | fuse_read_update_size(inode, pos, |
716 | req->misc.read.attr_ver); | 716 | req->misc.read.attr_ver); |
717 | } | 717 | } |
718 | fuse_invalidate_attr(inode); /* atime changed */ | 718 | fuse_invalidate_attr(inode); /* atime changed */ |
719 | } | 719 | } |
720 | 720 | ||
721 | for (i = 0; i < req->num_pages; i++) { | 721 | for (i = 0; i < req->num_pages; i++) { |
722 | struct page *page = req->pages[i]; | 722 | struct page *page = req->pages[i]; |
723 | if (!req->out.h.error) | 723 | if (!req->out.h.error) |
724 | SetPageUptodate(page); | 724 | SetPageUptodate(page); |
725 | else | 725 | else |
726 | SetPageError(page); | 726 | SetPageError(page); |
727 | unlock_page(page); | 727 | unlock_page(page); |
728 | page_cache_release(page); | 728 | page_cache_release(page); |
729 | } | 729 | } |
730 | if (req->ff) | 730 | if (req->ff) |
731 | fuse_file_put(req->ff, false); | 731 | fuse_file_put(req->ff, false); |
732 | } | 732 | } |
733 | 733 | ||
734 | static void fuse_send_readpages(struct fuse_req *req, struct file *file) | 734 | static void fuse_send_readpages(struct fuse_req *req, struct file *file) |
735 | { | 735 | { |
736 | struct fuse_file *ff = file->private_data; | 736 | struct fuse_file *ff = file->private_data; |
737 | struct fuse_conn *fc = ff->fc; | 737 | struct fuse_conn *fc = ff->fc; |
738 | loff_t pos = page_offset(req->pages[0]); | 738 | loff_t pos = page_offset(req->pages[0]); |
739 | size_t count = req->num_pages << PAGE_CACHE_SHIFT; | 739 | size_t count = req->num_pages << PAGE_CACHE_SHIFT; |
740 | 740 | ||
741 | req->out.argpages = 1; | 741 | req->out.argpages = 1; |
742 | req->out.page_zeroing = 1; | 742 | req->out.page_zeroing = 1; |
743 | req->out.page_replace = 1; | 743 | req->out.page_replace = 1; |
744 | fuse_read_fill(req, file, pos, count, FUSE_READ); | 744 | fuse_read_fill(req, file, pos, count, FUSE_READ); |
745 | req->misc.read.attr_ver = fuse_get_attr_version(fc); | 745 | req->misc.read.attr_ver = fuse_get_attr_version(fc); |
746 | if (fc->async_read) { | 746 | if (fc->async_read) { |
747 | req->ff = fuse_file_get(ff); | 747 | req->ff = fuse_file_get(ff); |
748 | req->end = fuse_readpages_end; | 748 | req->end = fuse_readpages_end; |
749 | fuse_request_send_background(fc, req); | 749 | fuse_request_send_background(fc, req); |
750 | } else { | 750 | } else { |
751 | fuse_request_send(fc, req); | 751 | fuse_request_send(fc, req); |
752 | fuse_readpages_end(fc, req); | 752 | fuse_readpages_end(fc, req); |
753 | fuse_put_request(fc, req); | 753 | fuse_put_request(fc, req); |
754 | } | 754 | } |
755 | } | 755 | } |
756 | 756 | ||
757 | struct fuse_fill_data { | 757 | struct fuse_fill_data { |
758 | struct fuse_req *req; | 758 | struct fuse_req *req; |
759 | struct file *file; | 759 | struct file *file; |
760 | struct inode *inode; | 760 | struct inode *inode; |
761 | unsigned nr_pages; | 761 | unsigned nr_pages; |
762 | }; | 762 | }; |
763 | 763 | ||
764 | static int fuse_readpages_fill(void *_data, struct page *page) | 764 | static int fuse_readpages_fill(void *_data, struct page *page) |
765 | { | 765 | { |
766 | struct fuse_fill_data *data = _data; | 766 | struct fuse_fill_data *data = _data; |
767 | struct fuse_req *req = data->req; | 767 | struct fuse_req *req = data->req; |
768 | struct inode *inode = data->inode; | 768 | struct inode *inode = data->inode; |
769 | struct fuse_conn *fc = get_fuse_conn(inode); | 769 | struct fuse_conn *fc = get_fuse_conn(inode); |
770 | 770 | ||
771 | fuse_wait_on_page_writeback(inode, page->index); | 771 | fuse_wait_on_page_writeback(inode, page->index); |
772 | 772 | ||
773 | if (req->num_pages && | 773 | if (req->num_pages && |
774 | (req->num_pages == FUSE_MAX_PAGES_PER_REQ || | 774 | (req->num_pages == FUSE_MAX_PAGES_PER_REQ || |
775 | (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read || | 775 | (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read || |
776 | req->pages[req->num_pages - 1]->index + 1 != page->index)) { | 776 | req->pages[req->num_pages - 1]->index + 1 != page->index)) { |
777 | int nr_alloc = min_t(unsigned, data->nr_pages, | 777 | int nr_alloc = min_t(unsigned, data->nr_pages, |
778 | FUSE_MAX_PAGES_PER_REQ); | 778 | FUSE_MAX_PAGES_PER_REQ); |
779 | fuse_send_readpages(req, data->file); | 779 | fuse_send_readpages(req, data->file); |
780 | if (fc->async_read) | 780 | if (fc->async_read) |
781 | req = fuse_get_req_for_background(fc, nr_alloc); | 781 | req = fuse_get_req_for_background(fc, nr_alloc); |
782 | else | 782 | else |
783 | req = fuse_get_req(fc, nr_alloc); | 783 | req = fuse_get_req(fc, nr_alloc); |
784 | 784 | ||
785 | data->req = req; | 785 | data->req = req; |
786 | if (IS_ERR(req)) { | 786 | if (IS_ERR(req)) { |
787 | unlock_page(page); | 787 | unlock_page(page); |
788 | return PTR_ERR(req); | 788 | return PTR_ERR(req); |
789 | } | 789 | } |
790 | } | 790 | } |
791 | 791 | ||
792 | if (WARN_ON(req->num_pages >= req->max_pages)) { | 792 | if (WARN_ON(req->num_pages >= req->max_pages)) { |
793 | fuse_put_request(fc, req); | 793 | fuse_put_request(fc, req); |
794 | return -EIO; | 794 | return -EIO; |
795 | } | 795 | } |
796 | 796 | ||
797 | page_cache_get(page); | 797 | page_cache_get(page); |
798 | req->pages[req->num_pages] = page; | 798 | req->pages[req->num_pages] = page; |
799 | req->page_descs[req->num_pages].length = PAGE_SIZE; | 799 | req->page_descs[req->num_pages].length = PAGE_SIZE; |
800 | req->num_pages++; | 800 | req->num_pages++; |
801 | data->nr_pages--; | 801 | data->nr_pages--; |
802 | return 0; | 802 | return 0; |
803 | } | 803 | } |
804 | 804 | ||
805 | static int fuse_readpages(struct file *file, struct address_space *mapping, | 805 | static int fuse_readpages(struct file *file, struct address_space *mapping, |
806 | struct list_head *pages, unsigned nr_pages) | 806 | struct list_head *pages, unsigned nr_pages) |
807 | { | 807 | { |
808 | struct inode *inode = mapping->host; | 808 | struct inode *inode = mapping->host; |
809 | struct fuse_conn *fc = get_fuse_conn(inode); | 809 | struct fuse_conn *fc = get_fuse_conn(inode); |
810 | struct fuse_fill_data data; | 810 | struct fuse_fill_data data; |
811 | int err; | 811 | int err; |
812 | int nr_alloc = min_t(unsigned, nr_pages, FUSE_MAX_PAGES_PER_REQ); | 812 | int nr_alloc = min_t(unsigned, nr_pages, FUSE_MAX_PAGES_PER_REQ); |
813 | 813 | ||
814 | err = -EIO; | 814 | err = -EIO; |
815 | if (is_bad_inode(inode)) | 815 | if (is_bad_inode(inode)) |
816 | goto out; | 816 | goto out; |
817 | 817 | ||
818 | data.file = file; | 818 | data.file = file; |
819 | data.inode = inode; | 819 | data.inode = inode; |
820 | if (fc->async_read) | 820 | if (fc->async_read) |
821 | data.req = fuse_get_req_for_background(fc, nr_alloc); | 821 | data.req = fuse_get_req_for_background(fc, nr_alloc); |
822 | else | 822 | else |
823 | data.req = fuse_get_req(fc, nr_alloc); | 823 | data.req = fuse_get_req(fc, nr_alloc); |
824 | data.nr_pages = nr_pages; | 824 | data.nr_pages = nr_pages; |
825 | err = PTR_ERR(data.req); | 825 | err = PTR_ERR(data.req); |
826 | if (IS_ERR(data.req)) | 826 | if (IS_ERR(data.req)) |
827 | goto out; | 827 | goto out; |
828 | 828 | ||
829 | err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); | 829 | err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); |
830 | if (!err) { | 830 | if (!err) { |
831 | if (data.req->num_pages) | 831 | if (data.req->num_pages) |
832 | fuse_send_readpages(data.req, file); | 832 | fuse_send_readpages(data.req, file); |
833 | else | 833 | else |
834 | fuse_put_request(fc, data.req); | 834 | fuse_put_request(fc, data.req); |
835 | } | 835 | } |
836 | out: | 836 | out: |
837 | return err; | 837 | return err; |
838 | } | 838 | } |
839 | 839 | ||
840 | static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov, | 840 | static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov, |
841 | unsigned long nr_segs, loff_t pos) | 841 | unsigned long nr_segs, loff_t pos) |
842 | { | 842 | { |
843 | struct inode *inode = iocb->ki_filp->f_mapping->host; | 843 | struct inode *inode = iocb->ki_filp->f_mapping->host; |
844 | struct fuse_conn *fc = get_fuse_conn(inode); | 844 | struct fuse_conn *fc = get_fuse_conn(inode); |
845 | 845 | ||
846 | /* | 846 | /* |
847 | * In auto invalidate mode, always update attributes on read. | 847 | * In auto invalidate mode, always update attributes on read. |
848 | * Otherwise, only update if we attempt to read past EOF (to ensure | 848 | * Otherwise, only update if we attempt to read past EOF (to ensure |
849 | * i_size is up to date). | 849 | * i_size is up to date). |
850 | */ | 850 | */ |
851 | if (fc->auto_inval_data || | 851 | if (fc->auto_inval_data || |
852 | (pos + iov_length(iov, nr_segs) > i_size_read(inode))) { | 852 | (pos + iov_length(iov, nr_segs) > i_size_read(inode))) { |
853 | int err; | 853 | int err; |
854 | err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL); | 854 | err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL); |
855 | if (err) | 855 | if (err) |
856 | return err; | 856 | return err; |
857 | } | 857 | } |
858 | 858 | ||
859 | return generic_file_aio_read(iocb, iov, nr_segs, pos); | 859 | return generic_file_aio_read(iocb, iov, nr_segs, pos); |
860 | } | 860 | } |
861 | 861 | ||
862 | static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff, | 862 | static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff, |
863 | loff_t pos, size_t count) | 863 | loff_t pos, size_t count) |
864 | { | 864 | { |
865 | struct fuse_write_in *inarg = &req->misc.write.in; | 865 | struct fuse_write_in *inarg = &req->misc.write.in; |
866 | struct fuse_write_out *outarg = &req->misc.write.out; | 866 | struct fuse_write_out *outarg = &req->misc.write.out; |
867 | 867 | ||
868 | inarg->fh = ff->fh; | 868 | inarg->fh = ff->fh; |
869 | inarg->offset = pos; | 869 | inarg->offset = pos; |
870 | inarg->size = count; | 870 | inarg->size = count; |
871 | req->in.h.opcode = FUSE_WRITE; | 871 | req->in.h.opcode = FUSE_WRITE; |
872 | req->in.h.nodeid = ff->nodeid; | 872 | req->in.h.nodeid = ff->nodeid; |
873 | req->in.numargs = 2; | 873 | req->in.numargs = 2; |
874 | if (ff->fc->minor < 9) | 874 | if (ff->fc->minor < 9) |
875 | req->in.args[0].size = FUSE_COMPAT_WRITE_IN_SIZE; | 875 | req->in.args[0].size = FUSE_COMPAT_WRITE_IN_SIZE; |
876 | else | 876 | else |
877 | req->in.args[0].size = sizeof(struct fuse_write_in); | 877 | req->in.args[0].size = sizeof(struct fuse_write_in); |
878 | req->in.args[0].value = inarg; | 878 | req->in.args[0].value = inarg; |
879 | req->in.args[1].size = count; | 879 | req->in.args[1].size = count; |
880 | req->out.numargs = 1; | 880 | req->out.numargs = 1; |
881 | req->out.args[0].size = sizeof(struct fuse_write_out); | 881 | req->out.args[0].size = sizeof(struct fuse_write_out); |
882 | req->out.args[0].value = outarg; | 882 | req->out.args[0].value = outarg; |
883 | } | 883 | } |
884 | 884 | ||
885 | static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io, | 885 | static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io, |
886 | loff_t pos, size_t count, fl_owner_t owner) | 886 | loff_t pos, size_t count, fl_owner_t owner) |
887 | { | 887 | { |
888 | struct file *file = io->file; | 888 | struct file *file = io->file; |
889 | struct fuse_file *ff = file->private_data; | 889 | struct fuse_file *ff = file->private_data; |
890 | struct fuse_conn *fc = ff->fc; | 890 | struct fuse_conn *fc = ff->fc; |
891 | struct fuse_write_in *inarg = &req->misc.write.in; | 891 | struct fuse_write_in *inarg = &req->misc.write.in; |
892 | 892 | ||
893 | fuse_write_fill(req, ff, pos, count); | 893 | fuse_write_fill(req, ff, pos, count); |
894 | inarg->flags = file->f_flags; | 894 | inarg->flags = file->f_flags; |
895 | if (owner != NULL) { | 895 | if (owner != NULL) { |
896 | inarg->write_flags |= FUSE_WRITE_LOCKOWNER; | 896 | inarg->write_flags |= FUSE_WRITE_LOCKOWNER; |
897 | inarg->lock_owner = fuse_lock_owner_id(fc, owner); | 897 | inarg->lock_owner = fuse_lock_owner_id(fc, owner); |
898 | } | 898 | } |
899 | 899 | ||
900 | if (io->async) | 900 | if (io->async) |
901 | return fuse_async_req_send(fc, req, count, io); | 901 | return fuse_async_req_send(fc, req, count, io); |
902 | 902 | ||
903 | fuse_request_send(fc, req); | 903 | fuse_request_send(fc, req); |
904 | return req->misc.write.out.size; | 904 | return req->misc.write.out.size; |
905 | } | 905 | } |
906 | 906 | ||
907 | void fuse_write_update_size(struct inode *inode, loff_t pos) | 907 | void fuse_write_update_size(struct inode *inode, loff_t pos) |
908 | { | 908 | { |
909 | struct fuse_conn *fc = get_fuse_conn(inode); | 909 | struct fuse_conn *fc = get_fuse_conn(inode); |
910 | struct fuse_inode *fi = get_fuse_inode(inode); | 910 | struct fuse_inode *fi = get_fuse_inode(inode); |
911 | 911 | ||
912 | spin_lock(&fc->lock); | 912 | spin_lock(&fc->lock); |
913 | fi->attr_version = ++fc->attr_version; | 913 | fi->attr_version = ++fc->attr_version; |
914 | if (pos > inode->i_size) | 914 | if (pos > inode->i_size) |
915 | i_size_write(inode, pos); | 915 | i_size_write(inode, pos); |
916 | spin_unlock(&fc->lock); | 916 | spin_unlock(&fc->lock); |
917 | } | 917 | } |
918 | 918 | ||
919 | static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file, | 919 | static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file, |
920 | struct inode *inode, loff_t pos, | 920 | struct inode *inode, loff_t pos, |
921 | size_t count) | 921 | size_t count) |
922 | { | 922 | { |
923 | size_t res; | 923 | size_t res; |
924 | unsigned offset; | 924 | unsigned offset; |
925 | unsigned i; | 925 | unsigned i; |
926 | struct fuse_io_priv io = { .async = 0, .file = file }; | 926 | struct fuse_io_priv io = { .async = 0, .file = file }; |
927 | 927 | ||
928 | for (i = 0; i < req->num_pages; i++) | 928 | for (i = 0; i < req->num_pages; i++) |
929 | fuse_wait_on_page_writeback(inode, req->pages[i]->index); | 929 | fuse_wait_on_page_writeback(inode, req->pages[i]->index); |
930 | 930 | ||
931 | res = fuse_send_write(req, &io, pos, count, NULL); | 931 | res = fuse_send_write(req, &io, pos, count, NULL); |
932 | 932 | ||
933 | offset = req->page_descs[0].offset; | 933 | offset = req->page_descs[0].offset; |
934 | count = res; | 934 | count = res; |
935 | for (i = 0; i < req->num_pages; i++) { | 935 | for (i = 0; i < req->num_pages; i++) { |
936 | struct page *page = req->pages[i]; | 936 | struct page *page = req->pages[i]; |
937 | 937 | ||
938 | if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE) | 938 | if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE) |
939 | SetPageUptodate(page); | 939 | SetPageUptodate(page); |
940 | 940 | ||
941 | if (count > PAGE_CACHE_SIZE - offset) | 941 | if (count > PAGE_CACHE_SIZE - offset) |
942 | count -= PAGE_CACHE_SIZE - offset; | 942 | count -= PAGE_CACHE_SIZE - offset; |
943 | else | 943 | else |
944 | count = 0; | 944 | count = 0; |
945 | offset = 0; | 945 | offset = 0; |
946 | 946 | ||
947 | unlock_page(page); | 947 | unlock_page(page); |
948 | page_cache_release(page); | 948 | page_cache_release(page); |
949 | } | 949 | } |
950 | 950 | ||
951 | return res; | 951 | return res; |
952 | } | 952 | } |
953 | 953 | ||
954 | static ssize_t fuse_fill_write_pages(struct fuse_req *req, | 954 | static ssize_t fuse_fill_write_pages(struct fuse_req *req, |
955 | struct address_space *mapping, | 955 | struct address_space *mapping, |
956 | struct iov_iter *ii, loff_t pos) | 956 | struct iov_iter *ii, loff_t pos) |
957 | { | 957 | { |
958 | struct fuse_conn *fc = get_fuse_conn(mapping->host); | 958 | struct fuse_conn *fc = get_fuse_conn(mapping->host); |
959 | unsigned offset = pos & (PAGE_CACHE_SIZE - 1); | 959 | unsigned offset = pos & (PAGE_CACHE_SIZE - 1); |
960 | size_t count = 0; | 960 | size_t count = 0; |
961 | int err; | 961 | int err; |
962 | 962 | ||
963 | req->in.argpages = 1; | 963 | req->in.argpages = 1; |
964 | req->page_descs[0].offset = offset; | 964 | req->page_descs[0].offset = offset; |
965 | 965 | ||
966 | do { | 966 | do { |
967 | size_t tmp; | 967 | size_t tmp; |
968 | struct page *page; | 968 | struct page *page; |
969 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; | 969 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; |
970 | size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset, | 970 | size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset, |
971 | iov_iter_count(ii)); | 971 | iov_iter_count(ii)); |
972 | 972 | ||
973 | bytes = min_t(size_t, bytes, fc->max_write - count); | 973 | bytes = min_t(size_t, bytes, fc->max_write - count); |
974 | 974 | ||
975 | again: | 975 | again: |
976 | err = -EFAULT; | 976 | err = -EFAULT; |
977 | if (iov_iter_fault_in_readable(ii, bytes)) | 977 | if (iov_iter_fault_in_readable(ii, bytes)) |
978 | break; | 978 | break; |
979 | 979 | ||
980 | err = -ENOMEM; | 980 | err = -ENOMEM; |
981 | page = grab_cache_page_write_begin(mapping, index, 0); | 981 | page = grab_cache_page_write_begin(mapping, index, 0); |
982 | if (!page) | 982 | if (!page) |
983 | break; | 983 | break; |
984 | 984 | ||
985 | if (mapping_writably_mapped(mapping)) | 985 | if (mapping_writably_mapped(mapping)) |
986 | flush_dcache_page(page); | 986 | flush_dcache_page(page); |
987 | 987 | ||
988 | tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes); | 988 | tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes); |
989 | flush_dcache_page(page); | 989 | flush_dcache_page(page); |
990 | 990 | ||
991 | mark_page_accessed(page); | ||
992 | |||
993 | if (!tmp) { | 991 | if (!tmp) { |
994 | unlock_page(page); | 992 | unlock_page(page); |
995 | page_cache_release(page); | 993 | page_cache_release(page); |
996 | bytes = min(bytes, iov_iter_single_seg_count(ii)); | 994 | bytes = min(bytes, iov_iter_single_seg_count(ii)); |
997 | goto again; | 995 | goto again; |
998 | } | 996 | } |
999 | 997 | ||
1000 | err = 0; | 998 | err = 0; |
1001 | req->pages[req->num_pages] = page; | 999 | req->pages[req->num_pages] = page; |
1002 | req->page_descs[req->num_pages].length = tmp; | 1000 | req->page_descs[req->num_pages].length = tmp; |
1003 | req->num_pages++; | 1001 | req->num_pages++; |
1004 | 1002 | ||
1005 | iov_iter_advance(ii, tmp); | 1003 | iov_iter_advance(ii, tmp); |
1006 | count += tmp; | 1004 | count += tmp; |
1007 | pos += tmp; | 1005 | pos += tmp; |
1008 | offset += tmp; | 1006 | offset += tmp; |
1009 | if (offset == PAGE_CACHE_SIZE) | 1007 | if (offset == PAGE_CACHE_SIZE) |
1010 | offset = 0; | 1008 | offset = 0; |
1011 | 1009 | ||
1012 | if (!fc->big_writes) | 1010 | if (!fc->big_writes) |
1013 | break; | 1011 | break; |
1014 | } while (iov_iter_count(ii) && count < fc->max_write && | 1012 | } while (iov_iter_count(ii) && count < fc->max_write && |
1015 | req->num_pages < req->max_pages && offset == 0); | 1013 | req->num_pages < req->max_pages && offset == 0); |
1016 | 1014 | ||
1017 | return count > 0 ? count : err; | 1015 | return count > 0 ? count : err; |
1018 | } | 1016 | } |
1019 | 1017 | ||
1020 | static inline unsigned fuse_wr_pages(loff_t pos, size_t len) | 1018 | static inline unsigned fuse_wr_pages(loff_t pos, size_t len) |
1021 | { | 1019 | { |
1022 | return min_t(unsigned, | 1020 | return min_t(unsigned, |
1023 | ((pos + len - 1) >> PAGE_CACHE_SHIFT) - | 1021 | ((pos + len - 1) >> PAGE_CACHE_SHIFT) - |
1024 | (pos >> PAGE_CACHE_SHIFT) + 1, | 1022 | (pos >> PAGE_CACHE_SHIFT) + 1, |
1025 | FUSE_MAX_PAGES_PER_REQ); | 1023 | FUSE_MAX_PAGES_PER_REQ); |
1026 | } | 1024 | } |
1027 | 1025 | ||
1028 | static ssize_t fuse_perform_write(struct file *file, | 1026 | static ssize_t fuse_perform_write(struct file *file, |
1029 | struct address_space *mapping, | 1027 | struct address_space *mapping, |
1030 | struct iov_iter *ii, loff_t pos) | 1028 | struct iov_iter *ii, loff_t pos) |
1031 | { | 1029 | { |
1032 | struct inode *inode = mapping->host; | 1030 | struct inode *inode = mapping->host; |
1033 | struct fuse_conn *fc = get_fuse_conn(inode); | 1031 | struct fuse_conn *fc = get_fuse_conn(inode); |
1034 | struct fuse_inode *fi = get_fuse_inode(inode); | 1032 | struct fuse_inode *fi = get_fuse_inode(inode); |
1035 | int err = 0; | 1033 | int err = 0; |
1036 | ssize_t res = 0; | 1034 | ssize_t res = 0; |
1037 | 1035 | ||
1038 | if (is_bad_inode(inode)) | 1036 | if (is_bad_inode(inode)) |
1039 | return -EIO; | 1037 | return -EIO; |
1040 | 1038 | ||
1041 | if (inode->i_size < pos + iov_iter_count(ii)) | 1039 | if (inode->i_size < pos + iov_iter_count(ii)) |
1042 | set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); | 1040 | set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); |
1043 | 1041 | ||
1044 | do { | 1042 | do { |
1045 | struct fuse_req *req; | 1043 | struct fuse_req *req; |
1046 | ssize_t count; | 1044 | ssize_t count; |
1047 | unsigned nr_pages = fuse_wr_pages(pos, iov_iter_count(ii)); | 1045 | unsigned nr_pages = fuse_wr_pages(pos, iov_iter_count(ii)); |
1048 | 1046 | ||
1049 | req = fuse_get_req(fc, nr_pages); | 1047 | req = fuse_get_req(fc, nr_pages); |
1050 | if (IS_ERR(req)) { | 1048 | if (IS_ERR(req)) { |
1051 | err = PTR_ERR(req); | 1049 | err = PTR_ERR(req); |
1052 | break; | 1050 | break; |
1053 | } | 1051 | } |
1054 | 1052 | ||
1055 | count = fuse_fill_write_pages(req, mapping, ii, pos); | 1053 | count = fuse_fill_write_pages(req, mapping, ii, pos); |
1056 | if (count <= 0) { | 1054 | if (count <= 0) { |
1057 | err = count; | 1055 | err = count; |
1058 | } else { | 1056 | } else { |
1059 | size_t num_written; | 1057 | size_t num_written; |
1060 | 1058 | ||
1061 | num_written = fuse_send_write_pages(req, file, inode, | 1059 | num_written = fuse_send_write_pages(req, file, inode, |
1062 | pos, count); | 1060 | pos, count); |
1063 | err = req->out.h.error; | 1061 | err = req->out.h.error; |
1064 | if (!err) { | 1062 | if (!err) { |
1065 | res += num_written; | 1063 | res += num_written; |
1066 | pos += num_written; | 1064 | pos += num_written; |
1067 | 1065 | ||
1068 | /* break out of the loop on short write */ | 1066 | /* break out of the loop on short write */ |
1069 | if (num_written != count) | 1067 | if (num_written != count) |
1070 | err = -EIO; | 1068 | err = -EIO; |
1071 | } | 1069 | } |
1072 | } | 1070 | } |
1073 | fuse_put_request(fc, req); | 1071 | fuse_put_request(fc, req); |
1074 | } while (!err && iov_iter_count(ii)); | 1072 | } while (!err && iov_iter_count(ii)); |
1075 | 1073 | ||
1076 | if (res > 0) | 1074 | if (res > 0) |
1077 | fuse_write_update_size(inode, pos); | 1075 | fuse_write_update_size(inode, pos); |
1078 | 1076 | ||
1079 | clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); | 1077 | clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); |
1080 | fuse_invalidate_attr(inode); | 1078 | fuse_invalidate_attr(inode); |
1081 | 1079 | ||
1082 | return res > 0 ? res : err; | 1080 | return res > 0 ? res : err; |
1083 | } | 1081 | } |
1084 | 1082 | ||
1085 | static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov, | 1083 | static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov, |
1086 | unsigned long nr_segs, loff_t pos) | 1084 | unsigned long nr_segs, loff_t pos) |
1087 | { | 1085 | { |
1088 | struct file *file = iocb->ki_filp; | 1086 | struct file *file = iocb->ki_filp; |
1089 | struct address_space *mapping = file->f_mapping; | 1087 | struct address_space *mapping = file->f_mapping; |
1090 | size_t count = 0; | 1088 | size_t count = 0; |
1091 | size_t ocount = 0; | 1089 | size_t ocount = 0; |
1092 | ssize_t written = 0; | 1090 | ssize_t written = 0; |
1093 | ssize_t written_buffered = 0; | 1091 | ssize_t written_buffered = 0; |
1094 | struct inode *inode = mapping->host; | 1092 | struct inode *inode = mapping->host; |
1095 | ssize_t err; | 1093 | ssize_t err; |
1096 | struct iov_iter i; | 1094 | struct iov_iter i; |
1097 | loff_t endbyte = 0; | 1095 | loff_t endbyte = 0; |
1098 | 1096 | ||
1099 | WARN_ON(iocb->ki_pos != pos); | 1097 | WARN_ON(iocb->ki_pos != pos); |
1100 | 1098 | ||
1101 | ocount = 0; | 1099 | ocount = 0; |
1102 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); | 1100 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); |
1103 | if (err) | 1101 | if (err) |
1104 | return err; | 1102 | return err; |
1105 | 1103 | ||
1106 | count = ocount; | 1104 | count = ocount; |
1107 | mutex_lock(&inode->i_mutex); | 1105 | mutex_lock(&inode->i_mutex); |
1108 | 1106 | ||
1109 | /* We can write back this queue in page reclaim */ | 1107 | /* We can write back this queue in page reclaim */ |
1110 | current->backing_dev_info = mapping->backing_dev_info; | 1108 | current->backing_dev_info = mapping->backing_dev_info; |
1111 | 1109 | ||
1112 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); | 1110 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); |
1113 | if (err) | 1111 | if (err) |
1114 | goto out; | 1112 | goto out; |
1115 | 1113 | ||
1116 | if (count == 0) | 1114 | if (count == 0) |
1117 | goto out; | 1115 | goto out; |
1118 | 1116 | ||
1119 | err = file_remove_suid(file); | 1117 | err = file_remove_suid(file); |
1120 | if (err) | 1118 | if (err) |
1121 | goto out; | 1119 | goto out; |
1122 | 1120 | ||
1123 | err = file_update_time(file); | 1121 | err = file_update_time(file); |
1124 | if (err) | 1122 | if (err) |
1125 | goto out; | 1123 | goto out; |
1126 | 1124 | ||
1127 | if (file->f_flags & O_DIRECT) { | 1125 | if (file->f_flags & O_DIRECT) { |
1128 | written = generic_file_direct_write(iocb, iov, &nr_segs, | 1126 | written = generic_file_direct_write(iocb, iov, &nr_segs, |
1129 | pos, &iocb->ki_pos, | 1127 | pos, &iocb->ki_pos, |
1130 | count, ocount); | 1128 | count, ocount); |
1131 | if (written < 0 || written == count) | 1129 | if (written < 0 || written == count) |
1132 | goto out; | 1130 | goto out; |
1133 | 1131 | ||
1134 | pos += written; | 1132 | pos += written; |
1135 | count -= written; | 1133 | count -= written; |
1136 | 1134 | ||
1137 | iov_iter_init(&i, iov, nr_segs, count, written); | 1135 | iov_iter_init(&i, iov, nr_segs, count, written); |
1138 | written_buffered = fuse_perform_write(file, mapping, &i, pos); | 1136 | written_buffered = fuse_perform_write(file, mapping, &i, pos); |
1139 | if (written_buffered < 0) { | 1137 | if (written_buffered < 0) { |
1140 | err = written_buffered; | 1138 | err = written_buffered; |
1141 | goto out; | 1139 | goto out; |
1142 | } | 1140 | } |
1143 | endbyte = pos + written_buffered - 1; | 1141 | endbyte = pos + written_buffered - 1; |
1144 | 1142 | ||
1145 | err = filemap_write_and_wait_range(file->f_mapping, pos, | 1143 | err = filemap_write_and_wait_range(file->f_mapping, pos, |
1146 | endbyte); | 1144 | endbyte); |
1147 | if (err) | 1145 | if (err) |
1148 | goto out; | 1146 | goto out; |
1149 | 1147 | ||
1150 | invalidate_mapping_pages(file->f_mapping, | 1148 | invalidate_mapping_pages(file->f_mapping, |
1151 | pos >> PAGE_CACHE_SHIFT, | 1149 | pos >> PAGE_CACHE_SHIFT, |
1152 | endbyte >> PAGE_CACHE_SHIFT); | 1150 | endbyte >> PAGE_CACHE_SHIFT); |
1153 | 1151 | ||
1154 | written += written_buffered; | 1152 | written += written_buffered; |
1155 | iocb->ki_pos = pos + written_buffered; | 1153 | iocb->ki_pos = pos + written_buffered; |
1156 | } else { | 1154 | } else { |
1157 | iov_iter_init(&i, iov, nr_segs, count, 0); | 1155 | iov_iter_init(&i, iov, nr_segs, count, 0); |
1158 | written = fuse_perform_write(file, mapping, &i, pos); | 1156 | written = fuse_perform_write(file, mapping, &i, pos); |
1159 | if (written >= 0) | 1157 | if (written >= 0) |
1160 | iocb->ki_pos = pos + written; | 1158 | iocb->ki_pos = pos + written; |
1161 | } | 1159 | } |
1162 | out: | 1160 | out: |
1163 | current->backing_dev_info = NULL; | 1161 | current->backing_dev_info = NULL; |
1164 | mutex_unlock(&inode->i_mutex); | 1162 | mutex_unlock(&inode->i_mutex); |
1165 | 1163 | ||
1166 | return written ? written : err; | 1164 | return written ? written : err; |
1167 | } | 1165 | } |
1168 | 1166 | ||
1169 | static inline void fuse_page_descs_length_init(struct fuse_req *req, | 1167 | static inline void fuse_page_descs_length_init(struct fuse_req *req, |
1170 | unsigned index, unsigned nr_pages) | 1168 | unsigned index, unsigned nr_pages) |
1171 | { | 1169 | { |
1172 | int i; | 1170 | int i; |
1173 | 1171 | ||
1174 | for (i = index; i < index + nr_pages; i++) | 1172 | for (i = index; i < index + nr_pages; i++) |
1175 | req->page_descs[i].length = PAGE_SIZE - | 1173 | req->page_descs[i].length = PAGE_SIZE - |
1176 | req->page_descs[i].offset; | 1174 | req->page_descs[i].offset; |
1177 | } | 1175 | } |
1178 | 1176 | ||
1179 | static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii) | 1177 | static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii) |
1180 | { | 1178 | { |
1181 | return (unsigned long)ii->iov->iov_base + ii->iov_offset; | 1179 | return (unsigned long)ii->iov->iov_base + ii->iov_offset; |
1182 | } | 1180 | } |
1183 | 1181 | ||
1184 | static inline size_t fuse_get_frag_size(const struct iov_iter *ii, | 1182 | static inline size_t fuse_get_frag_size(const struct iov_iter *ii, |
1185 | size_t max_size) | 1183 | size_t max_size) |
1186 | { | 1184 | { |
1187 | return min(iov_iter_single_seg_count(ii), max_size); | 1185 | return min(iov_iter_single_seg_count(ii), max_size); |
1188 | } | 1186 | } |
1189 | 1187 | ||
1190 | static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii, | 1188 | static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii, |
1191 | size_t *nbytesp, int write) | 1189 | size_t *nbytesp, int write) |
1192 | { | 1190 | { |
1193 | size_t nbytes = 0; /* # bytes already packed in req */ | 1191 | size_t nbytes = 0; /* # bytes already packed in req */ |
1194 | 1192 | ||
1195 | /* Special case for kernel I/O: can copy directly into the buffer */ | 1193 | /* Special case for kernel I/O: can copy directly into the buffer */ |
1196 | if (segment_eq(get_fs(), KERNEL_DS)) { | 1194 | if (segment_eq(get_fs(), KERNEL_DS)) { |
1197 | unsigned long user_addr = fuse_get_user_addr(ii); | 1195 | unsigned long user_addr = fuse_get_user_addr(ii); |
1198 | size_t frag_size = fuse_get_frag_size(ii, *nbytesp); | 1196 | size_t frag_size = fuse_get_frag_size(ii, *nbytesp); |
1199 | 1197 | ||
1200 | if (write) | 1198 | if (write) |
1201 | req->in.args[1].value = (void *) user_addr; | 1199 | req->in.args[1].value = (void *) user_addr; |
1202 | else | 1200 | else |
1203 | req->out.args[0].value = (void *) user_addr; | 1201 | req->out.args[0].value = (void *) user_addr; |
1204 | 1202 | ||
1205 | iov_iter_advance(ii, frag_size); | 1203 | iov_iter_advance(ii, frag_size); |
1206 | *nbytesp = frag_size; | 1204 | *nbytesp = frag_size; |
1207 | return 0; | 1205 | return 0; |
1208 | } | 1206 | } |
1209 | 1207 | ||
1210 | while (nbytes < *nbytesp && req->num_pages < req->max_pages) { | 1208 | while (nbytes < *nbytesp && req->num_pages < req->max_pages) { |
1211 | unsigned npages; | 1209 | unsigned npages; |
1212 | unsigned long user_addr = fuse_get_user_addr(ii); | 1210 | unsigned long user_addr = fuse_get_user_addr(ii); |
1213 | unsigned offset = user_addr & ~PAGE_MASK; | 1211 | unsigned offset = user_addr & ~PAGE_MASK; |
1214 | size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes); | 1212 | size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes); |
1215 | int ret; | 1213 | int ret; |
1216 | 1214 | ||
1217 | unsigned n = req->max_pages - req->num_pages; | 1215 | unsigned n = req->max_pages - req->num_pages; |
1218 | frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT); | 1216 | frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT); |
1219 | 1217 | ||
1220 | npages = (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; | 1218 | npages = (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; |
1221 | npages = clamp(npages, 1U, n); | 1219 | npages = clamp(npages, 1U, n); |
1222 | 1220 | ||
1223 | ret = get_user_pages_fast(user_addr, npages, !write, | 1221 | ret = get_user_pages_fast(user_addr, npages, !write, |
1224 | &req->pages[req->num_pages]); | 1222 | &req->pages[req->num_pages]); |
1225 | if (ret < 0) | 1223 | if (ret < 0) |
1226 | return ret; | 1224 | return ret; |
1227 | 1225 | ||
1228 | npages = ret; | 1226 | npages = ret; |
1229 | frag_size = min_t(size_t, frag_size, | 1227 | frag_size = min_t(size_t, frag_size, |
1230 | (npages << PAGE_SHIFT) - offset); | 1228 | (npages << PAGE_SHIFT) - offset); |
1231 | iov_iter_advance(ii, frag_size); | 1229 | iov_iter_advance(ii, frag_size); |
1232 | 1230 | ||
1233 | req->page_descs[req->num_pages].offset = offset; | 1231 | req->page_descs[req->num_pages].offset = offset; |
1234 | fuse_page_descs_length_init(req, req->num_pages, npages); | 1232 | fuse_page_descs_length_init(req, req->num_pages, npages); |
1235 | 1233 | ||
1236 | req->num_pages += npages; | 1234 | req->num_pages += npages; |
1237 | req->page_descs[req->num_pages - 1].length -= | 1235 | req->page_descs[req->num_pages - 1].length -= |
1238 | (npages << PAGE_SHIFT) - offset - frag_size; | 1236 | (npages << PAGE_SHIFT) - offset - frag_size; |
1239 | 1237 | ||
1240 | nbytes += frag_size; | 1238 | nbytes += frag_size; |
1241 | } | 1239 | } |
1242 | 1240 | ||
1243 | if (write) | 1241 | if (write) |
1244 | req->in.argpages = 1; | 1242 | req->in.argpages = 1; |
1245 | else | 1243 | else |
1246 | req->out.argpages = 1; | 1244 | req->out.argpages = 1; |
1247 | 1245 | ||
1248 | *nbytesp = nbytes; | 1246 | *nbytesp = nbytes; |
1249 | 1247 | ||
1250 | return 0; | 1248 | return 0; |
1251 | } | 1249 | } |
1252 | 1250 | ||
1253 | static inline int fuse_iter_npages(const struct iov_iter *ii_p) | 1251 | static inline int fuse_iter_npages(const struct iov_iter *ii_p) |
1254 | { | 1252 | { |
1255 | struct iov_iter ii = *ii_p; | 1253 | struct iov_iter ii = *ii_p; |
1256 | int npages = 0; | 1254 | int npages = 0; |
1257 | 1255 | ||
1258 | while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) { | 1256 | while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) { |
1259 | unsigned long user_addr = fuse_get_user_addr(&ii); | 1257 | unsigned long user_addr = fuse_get_user_addr(&ii); |
1260 | unsigned offset = user_addr & ~PAGE_MASK; | 1258 | unsigned offset = user_addr & ~PAGE_MASK; |
1261 | size_t frag_size = iov_iter_single_seg_count(&ii); | 1259 | size_t frag_size = iov_iter_single_seg_count(&ii); |
1262 | 1260 | ||
1263 | npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; | 1261 | npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT; |
1264 | iov_iter_advance(&ii, frag_size); | 1262 | iov_iter_advance(&ii, frag_size); |
1265 | } | 1263 | } |
1266 | 1264 | ||
1267 | return min(npages, FUSE_MAX_PAGES_PER_REQ); | 1265 | return min(npages, FUSE_MAX_PAGES_PER_REQ); |
1268 | } | 1266 | } |
1269 | 1267 | ||
1270 | ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, | 1268 | ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, |
1271 | unsigned long nr_segs, size_t count, loff_t *ppos, | 1269 | unsigned long nr_segs, size_t count, loff_t *ppos, |
1272 | int write) | 1270 | int write) |
1273 | { | 1271 | { |
1274 | struct file *file = io->file; | 1272 | struct file *file = io->file; |
1275 | struct fuse_file *ff = file->private_data; | 1273 | struct fuse_file *ff = file->private_data; |
1276 | struct fuse_conn *fc = ff->fc; | 1274 | struct fuse_conn *fc = ff->fc; |
1277 | size_t nmax = write ? fc->max_write : fc->max_read; | 1275 | size_t nmax = write ? fc->max_write : fc->max_read; |
1278 | loff_t pos = *ppos; | 1276 | loff_t pos = *ppos; |
1279 | ssize_t res = 0; | 1277 | ssize_t res = 0; |
1280 | struct fuse_req *req; | 1278 | struct fuse_req *req; |
1281 | struct iov_iter ii; | 1279 | struct iov_iter ii; |
1282 | 1280 | ||
1283 | iov_iter_init(&ii, iov, nr_segs, count, 0); | 1281 | iov_iter_init(&ii, iov, nr_segs, count, 0); |
1284 | 1282 | ||
1285 | if (io->async) | 1283 | if (io->async) |
1286 | req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii)); | 1284 | req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii)); |
1287 | else | 1285 | else |
1288 | req = fuse_get_req(fc, fuse_iter_npages(&ii)); | 1286 | req = fuse_get_req(fc, fuse_iter_npages(&ii)); |
1289 | if (IS_ERR(req)) | 1287 | if (IS_ERR(req)) |
1290 | return PTR_ERR(req); | 1288 | return PTR_ERR(req); |
1291 | 1289 | ||
1292 | while (count) { | 1290 | while (count) { |
1293 | size_t nres; | 1291 | size_t nres; |
1294 | fl_owner_t owner = current->files; | 1292 | fl_owner_t owner = current->files; |
1295 | size_t nbytes = min(count, nmax); | 1293 | size_t nbytes = min(count, nmax); |
1296 | int err = fuse_get_user_pages(req, &ii, &nbytes, write); | 1294 | int err = fuse_get_user_pages(req, &ii, &nbytes, write); |
1297 | if (err) { | 1295 | if (err) { |
1298 | res = err; | 1296 | res = err; |
1299 | break; | 1297 | break; |
1300 | } | 1298 | } |
1301 | 1299 | ||
1302 | if (write) | 1300 | if (write) |
1303 | nres = fuse_send_write(req, io, pos, nbytes, owner); | 1301 | nres = fuse_send_write(req, io, pos, nbytes, owner); |
1304 | else | 1302 | else |
1305 | nres = fuse_send_read(req, io, pos, nbytes, owner); | 1303 | nres = fuse_send_read(req, io, pos, nbytes, owner); |
1306 | 1304 | ||
1307 | if (!io->async) | 1305 | if (!io->async) |
1308 | fuse_release_user_pages(req, !write); | 1306 | fuse_release_user_pages(req, !write); |
1309 | if (req->out.h.error) { | 1307 | if (req->out.h.error) { |
1310 | if (!res) | 1308 | if (!res) |
1311 | res = req->out.h.error; | 1309 | res = req->out.h.error; |
1312 | break; | 1310 | break; |
1313 | } else if (nres > nbytes) { | 1311 | } else if (nres > nbytes) { |
1314 | res = -EIO; | 1312 | res = -EIO; |
1315 | break; | 1313 | break; |
1316 | } | 1314 | } |
1317 | count -= nres; | 1315 | count -= nres; |
1318 | res += nres; | 1316 | res += nres; |
1319 | pos += nres; | 1317 | pos += nres; |
1320 | if (nres != nbytes) | 1318 | if (nres != nbytes) |
1321 | break; | 1319 | break; |
1322 | if (count) { | 1320 | if (count) { |
1323 | fuse_put_request(fc, req); | 1321 | fuse_put_request(fc, req); |
1324 | if (io->async) | 1322 | if (io->async) |
1325 | req = fuse_get_req_for_background(fc, | 1323 | req = fuse_get_req_for_background(fc, |
1326 | fuse_iter_npages(&ii)); | 1324 | fuse_iter_npages(&ii)); |
1327 | else | 1325 | else |
1328 | req = fuse_get_req(fc, fuse_iter_npages(&ii)); | 1326 | req = fuse_get_req(fc, fuse_iter_npages(&ii)); |
1329 | if (IS_ERR(req)) | 1327 | if (IS_ERR(req)) |
1330 | break; | 1328 | break; |
1331 | } | 1329 | } |
1332 | } | 1330 | } |
1333 | if (!IS_ERR(req)) | 1331 | if (!IS_ERR(req)) |
1334 | fuse_put_request(fc, req); | 1332 | fuse_put_request(fc, req); |
1335 | if (res > 0) | 1333 | if (res > 0) |
1336 | *ppos = pos; | 1334 | *ppos = pos; |
1337 | 1335 | ||
1338 | return res; | 1336 | return res; |
1339 | } | 1337 | } |
1340 | EXPORT_SYMBOL_GPL(fuse_direct_io); | 1338 | EXPORT_SYMBOL_GPL(fuse_direct_io); |
1341 | 1339 | ||
1342 | static ssize_t __fuse_direct_read(struct fuse_io_priv *io, | 1340 | static ssize_t __fuse_direct_read(struct fuse_io_priv *io, |
1343 | const struct iovec *iov, | 1341 | const struct iovec *iov, |
1344 | unsigned long nr_segs, loff_t *ppos, | 1342 | unsigned long nr_segs, loff_t *ppos, |
1345 | size_t count) | 1343 | size_t count) |
1346 | { | 1344 | { |
1347 | ssize_t res; | 1345 | ssize_t res; |
1348 | struct file *file = io->file; | 1346 | struct file *file = io->file; |
1349 | struct inode *inode = file_inode(file); | 1347 | struct inode *inode = file_inode(file); |
1350 | 1348 | ||
1351 | if (is_bad_inode(inode)) | 1349 | if (is_bad_inode(inode)) |
1352 | return -EIO; | 1350 | return -EIO; |
1353 | 1351 | ||
1354 | res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0); | 1352 | res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0); |
1355 | 1353 | ||
1356 | fuse_invalidate_attr(inode); | 1354 | fuse_invalidate_attr(inode); |
1357 | 1355 | ||
1358 | return res; | 1356 | return res; |
1359 | } | 1357 | } |
1360 | 1358 | ||
1361 | static ssize_t fuse_direct_read(struct file *file, char __user *buf, | 1359 | static ssize_t fuse_direct_read(struct file *file, char __user *buf, |
1362 | size_t count, loff_t *ppos) | 1360 | size_t count, loff_t *ppos) |
1363 | { | 1361 | { |
1364 | struct fuse_io_priv io = { .async = 0, .file = file }; | 1362 | struct fuse_io_priv io = { .async = 0, .file = file }; |
1365 | struct iovec iov = { .iov_base = buf, .iov_len = count }; | 1363 | struct iovec iov = { .iov_base = buf, .iov_len = count }; |
1366 | return __fuse_direct_read(&io, &iov, 1, ppos, count); | 1364 | return __fuse_direct_read(&io, &iov, 1, ppos, count); |
1367 | } | 1365 | } |
1368 | 1366 | ||
1369 | static ssize_t __fuse_direct_write(struct fuse_io_priv *io, | 1367 | static ssize_t __fuse_direct_write(struct fuse_io_priv *io, |
1370 | const struct iovec *iov, | 1368 | const struct iovec *iov, |
1371 | unsigned long nr_segs, loff_t *ppos) | 1369 | unsigned long nr_segs, loff_t *ppos) |
1372 | { | 1370 | { |
1373 | struct file *file = io->file; | 1371 | struct file *file = io->file; |
1374 | struct inode *inode = file_inode(file); | 1372 | struct inode *inode = file_inode(file); |
1375 | size_t count = iov_length(iov, nr_segs); | 1373 | size_t count = iov_length(iov, nr_segs); |
1376 | ssize_t res; | 1374 | ssize_t res; |
1377 | 1375 | ||
1378 | res = generic_write_checks(file, ppos, &count, 0); | 1376 | res = generic_write_checks(file, ppos, &count, 0); |
1379 | if (!res) | 1377 | if (!res) |
1380 | res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1); | 1378 | res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1); |
1381 | 1379 | ||
1382 | fuse_invalidate_attr(inode); | 1380 | fuse_invalidate_attr(inode); |
1383 | 1381 | ||
1384 | return res; | 1382 | return res; |
1385 | } | 1383 | } |
1386 | 1384 | ||
1387 | static ssize_t fuse_direct_write(struct file *file, const char __user *buf, | 1385 | static ssize_t fuse_direct_write(struct file *file, const char __user *buf, |
1388 | size_t count, loff_t *ppos) | 1386 | size_t count, loff_t *ppos) |
1389 | { | 1387 | { |
1390 | struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; | 1388 | struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; |
1391 | struct inode *inode = file_inode(file); | 1389 | struct inode *inode = file_inode(file); |
1392 | ssize_t res; | 1390 | ssize_t res; |
1393 | struct fuse_io_priv io = { .async = 0, .file = file }; | 1391 | struct fuse_io_priv io = { .async = 0, .file = file }; |
1394 | 1392 | ||
1395 | if (is_bad_inode(inode)) | 1393 | if (is_bad_inode(inode)) |
1396 | return -EIO; | 1394 | return -EIO; |
1397 | 1395 | ||
1398 | /* Don't allow parallel writes to the same file */ | 1396 | /* Don't allow parallel writes to the same file */ |
1399 | mutex_lock(&inode->i_mutex); | 1397 | mutex_lock(&inode->i_mutex); |
1400 | res = __fuse_direct_write(&io, &iov, 1, ppos); | 1398 | res = __fuse_direct_write(&io, &iov, 1, ppos); |
1401 | if (res > 0) | 1399 | if (res > 0) |
1402 | fuse_write_update_size(inode, *ppos); | 1400 | fuse_write_update_size(inode, *ppos); |
1403 | mutex_unlock(&inode->i_mutex); | 1401 | mutex_unlock(&inode->i_mutex); |
1404 | 1402 | ||
1405 | return res; | 1403 | return res; |
1406 | } | 1404 | } |
1407 | 1405 | ||
1408 | static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) | 1406 | static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req) |
1409 | { | 1407 | { |
1410 | __free_page(req->pages[0]); | 1408 | __free_page(req->pages[0]); |
1411 | fuse_file_put(req->ff, false); | 1409 | fuse_file_put(req->ff, false); |
1412 | } | 1410 | } |
1413 | 1411 | ||
1414 | static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) | 1412 | static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) |
1415 | { | 1413 | { |
1416 | struct inode *inode = req->inode; | 1414 | struct inode *inode = req->inode; |
1417 | struct fuse_inode *fi = get_fuse_inode(inode); | 1415 | struct fuse_inode *fi = get_fuse_inode(inode); |
1418 | struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info; | 1416 | struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info; |
1419 | 1417 | ||
1420 | list_del(&req->writepages_entry); | 1418 | list_del(&req->writepages_entry); |
1421 | dec_bdi_stat(bdi, BDI_WRITEBACK); | 1419 | dec_bdi_stat(bdi, BDI_WRITEBACK); |
1422 | dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP); | 1420 | dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP); |
1423 | bdi_writeout_inc(bdi); | 1421 | bdi_writeout_inc(bdi); |
1424 | wake_up(&fi->page_waitq); | 1422 | wake_up(&fi->page_waitq); |
1425 | } | 1423 | } |
1426 | 1424 | ||
1427 | /* Called under fc->lock, may release and reacquire it */ | 1425 | /* Called under fc->lock, may release and reacquire it */ |
1428 | static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req) | 1426 | static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req) |
1429 | __releases(fc->lock) | 1427 | __releases(fc->lock) |
1430 | __acquires(fc->lock) | 1428 | __acquires(fc->lock) |
1431 | { | 1429 | { |
1432 | struct fuse_inode *fi = get_fuse_inode(req->inode); | 1430 | struct fuse_inode *fi = get_fuse_inode(req->inode); |
1433 | loff_t size = i_size_read(req->inode); | 1431 | loff_t size = i_size_read(req->inode); |
1434 | struct fuse_write_in *inarg = &req->misc.write.in; | 1432 | struct fuse_write_in *inarg = &req->misc.write.in; |
1435 | 1433 | ||
1436 | if (!fc->connected) | 1434 | if (!fc->connected) |
1437 | goto out_free; | 1435 | goto out_free; |
1438 | 1436 | ||
1439 | if (inarg->offset + PAGE_CACHE_SIZE <= size) { | 1437 | if (inarg->offset + PAGE_CACHE_SIZE <= size) { |
1440 | inarg->size = PAGE_CACHE_SIZE; | 1438 | inarg->size = PAGE_CACHE_SIZE; |
1441 | } else if (inarg->offset < size) { | 1439 | } else if (inarg->offset < size) { |
1442 | inarg->size = size & (PAGE_CACHE_SIZE - 1); | 1440 | inarg->size = size & (PAGE_CACHE_SIZE - 1); |
1443 | } else { | 1441 | } else { |
1444 | /* Got truncated off completely */ | 1442 | /* Got truncated off completely */ |
1445 | goto out_free; | 1443 | goto out_free; |
1446 | } | 1444 | } |
1447 | 1445 | ||
1448 | req->in.args[1].size = inarg->size; | 1446 | req->in.args[1].size = inarg->size; |
1449 | fi->writectr++; | 1447 | fi->writectr++; |
1450 | fuse_request_send_background_locked(fc, req); | 1448 | fuse_request_send_background_locked(fc, req); |
1451 | return; | 1449 | return; |
1452 | 1450 | ||
1453 | out_free: | 1451 | out_free: |
1454 | fuse_writepage_finish(fc, req); | 1452 | fuse_writepage_finish(fc, req); |
1455 | spin_unlock(&fc->lock); | 1453 | spin_unlock(&fc->lock); |
1456 | fuse_writepage_free(fc, req); | 1454 | fuse_writepage_free(fc, req); |
1457 | fuse_put_request(fc, req); | 1455 | fuse_put_request(fc, req); |
1458 | spin_lock(&fc->lock); | 1456 | spin_lock(&fc->lock); |
1459 | } | 1457 | } |
1460 | 1458 | ||
1461 | /* | 1459 | /* |
1462 | * If fi->writectr is positive (no truncate or fsync going on) send | 1460 | * If fi->writectr is positive (no truncate or fsync going on) send |
1463 | * all queued writepage requests. | 1461 | * all queued writepage requests. |
1464 | * | 1462 | * |
1465 | * Called with fc->lock | 1463 | * Called with fc->lock |
1466 | */ | 1464 | */ |
1467 | void fuse_flush_writepages(struct inode *inode) | 1465 | void fuse_flush_writepages(struct inode *inode) |
1468 | __releases(fc->lock) | 1466 | __releases(fc->lock) |
1469 | __acquires(fc->lock) | 1467 | __acquires(fc->lock) |
1470 | { | 1468 | { |
1471 | struct fuse_conn *fc = get_fuse_conn(inode); | 1469 | struct fuse_conn *fc = get_fuse_conn(inode); |
1472 | struct fuse_inode *fi = get_fuse_inode(inode); | 1470 | struct fuse_inode *fi = get_fuse_inode(inode); |
1473 | struct fuse_req *req; | 1471 | struct fuse_req *req; |
1474 | 1472 | ||
1475 | while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) { | 1473 | while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) { |
1476 | req = list_entry(fi->queued_writes.next, struct fuse_req, list); | 1474 | req = list_entry(fi->queued_writes.next, struct fuse_req, list); |
1477 | list_del_init(&req->list); | 1475 | list_del_init(&req->list); |
1478 | fuse_send_writepage(fc, req); | 1476 | fuse_send_writepage(fc, req); |
1479 | } | 1477 | } |
1480 | } | 1478 | } |
1481 | 1479 | ||
1482 | static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req) | 1480 | static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req) |
1483 | { | 1481 | { |
1484 | struct inode *inode = req->inode; | 1482 | struct inode *inode = req->inode; |
1485 | struct fuse_inode *fi = get_fuse_inode(inode); | 1483 | struct fuse_inode *fi = get_fuse_inode(inode); |
1486 | 1484 | ||
1487 | mapping_set_error(inode->i_mapping, req->out.h.error); | 1485 | mapping_set_error(inode->i_mapping, req->out.h.error); |
1488 | spin_lock(&fc->lock); | 1486 | spin_lock(&fc->lock); |
1489 | fi->writectr--; | 1487 | fi->writectr--; |
1490 | fuse_writepage_finish(fc, req); | 1488 | fuse_writepage_finish(fc, req); |
1491 | spin_unlock(&fc->lock); | 1489 | spin_unlock(&fc->lock); |
1492 | fuse_writepage_free(fc, req); | 1490 | fuse_writepage_free(fc, req); |
1493 | } | 1491 | } |
1494 | 1492 | ||
1495 | static int fuse_writepage_locked(struct page *page) | 1493 | static int fuse_writepage_locked(struct page *page) |
1496 | { | 1494 | { |
1497 | struct address_space *mapping = page->mapping; | 1495 | struct address_space *mapping = page->mapping; |
1498 | struct inode *inode = mapping->host; | 1496 | struct inode *inode = mapping->host; |
1499 | struct fuse_conn *fc = get_fuse_conn(inode); | 1497 | struct fuse_conn *fc = get_fuse_conn(inode); |
1500 | struct fuse_inode *fi = get_fuse_inode(inode); | 1498 | struct fuse_inode *fi = get_fuse_inode(inode); |
1501 | struct fuse_req *req; | 1499 | struct fuse_req *req; |
1502 | struct fuse_file *ff; | 1500 | struct fuse_file *ff; |
1503 | struct page *tmp_page; | 1501 | struct page *tmp_page; |
1504 | 1502 | ||
1505 | set_page_writeback(page); | 1503 | set_page_writeback(page); |
1506 | 1504 | ||
1507 | req = fuse_request_alloc_nofs(1); | 1505 | req = fuse_request_alloc_nofs(1); |
1508 | if (!req) | 1506 | if (!req) |
1509 | goto err; | 1507 | goto err; |
1510 | 1508 | ||
1511 | req->background = 1; /* writeback always goes to bg_queue */ | 1509 | req->background = 1; /* writeback always goes to bg_queue */ |
1512 | tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); | 1510 | tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); |
1513 | if (!tmp_page) | 1511 | if (!tmp_page) |
1514 | goto err_free; | 1512 | goto err_free; |
1515 | 1513 | ||
1516 | spin_lock(&fc->lock); | 1514 | spin_lock(&fc->lock); |
1517 | BUG_ON(list_empty(&fi->write_files)); | 1515 | BUG_ON(list_empty(&fi->write_files)); |
1518 | ff = list_entry(fi->write_files.next, struct fuse_file, write_entry); | 1516 | ff = list_entry(fi->write_files.next, struct fuse_file, write_entry); |
1519 | req->ff = fuse_file_get(ff); | 1517 | req->ff = fuse_file_get(ff); |
1520 | spin_unlock(&fc->lock); | 1518 | spin_unlock(&fc->lock); |
1521 | 1519 | ||
1522 | fuse_write_fill(req, ff, page_offset(page), 0); | 1520 | fuse_write_fill(req, ff, page_offset(page), 0); |
1523 | 1521 | ||
1524 | copy_highpage(tmp_page, page); | 1522 | copy_highpage(tmp_page, page); |
1525 | req->misc.write.in.write_flags |= FUSE_WRITE_CACHE; | 1523 | req->misc.write.in.write_flags |= FUSE_WRITE_CACHE; |
1526 | req->in.argpages = 1; | 1524 | req->in.argpages = 1; |
1527 | req->num_pages = 1; | 1525 | req->num_pages = 1; |
1528 | req->pages[0] = tmp_page; | 1526 | req->pages[0] = tmp_page; |
1529 | req->page_descs[0].offset = 0; | 1527 | req->page_descs[0].offset = 0; |
1530 | req->page_descs[0].length = PAGE_SIZE; | 1528 | req->page_descs[0].length = PAGE_SIZE; |
1531 | req->end = fuse_writepage_end; | 1529 | req->end = fuse_writepage_end; |
1532 | req->inode = inode; | 1530 | req->inode = inode; |
1533 | 1531 | ||
1534 | inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK); | 1532 | inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK); |
1535 | inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); | 1533 | inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); |
1536 | 1534 | ||
1537 | spin_lock(&fc->lock); | 1535 | spin_lock(&fc->lock); |
1538 | list_add(&req->writepages_entry, &fi->writepages); | 1536 | list_add(&req->writepages_entry, &fi->writepages); |
1539 | list_add_tail(&req->list, &fi->queued_writes); | 1537 | list_add_tail(&req->list, &fi->queued_writes); |
1540 | fuse_flush_writepages(inode); | 1538 | fuse_flush_writepages(inode); |
1541 | spin_unlock(&fc->lock); | 1539 | spin_unlock(&fc->lock); |
1542 | 1540 | ||
1543 | end_page_writeback(page); | 1541 | end_page_writeback(page); |
1544 | 1542 | ||
1545 | return 0; | 1543 | return 0; |
1546 | 1544 | ||
1547 | err_free: | 1545 | err_free: |
1548 | fuse_request_free(req); | 1546 | fuse_request_free(req); |
1549 | err: | 1547 | err: |
1550 | end_page_writeback(page); | 1548 | end_page_writeback(page); |
1551 | return -ENOMEM; | 1549 | return -ENOMEM; |
1552 | } | 1550 | } |
1553 | 1551 | ||
1554 | static int fuse_writepage(struct page *page, struct writeback_control *wbc) | 1552 | static int fuse_writepage(struct page *page, struct writeback_control *wbc) |
1555 | { | 1553 | { |
1556 | int err; | 1554 | int err; |
1557 | 1555 | ||
1558 | err = fuse_writepage_locked(page); | 1556 | err = fuse_writepage_locked(page); |
1559 | unlock_page(page); | 1557 | unlock_page(page); |
1560 | 1558 | ||
1561 | return err; | 1559 | return err; |
1562 | } | 1560 | } |
1563 | 1561 | ||
1564 | static int fuse_launder_page(struct page *page) | 1562 | static int fuse_launder_page(struct page *page) |
1565 | { | 1563 | { |
1566 | int err = 0; | 1564 | int err = 0; |
1567 | if (clear_page_dirty_for_io(page)) { | 1565 | if (clear_page_dirty_for_io(page)) { |
1568 | struct inode *inode = page->mapping->host; | 1566 | struct inode *inode = page->mapping->host; |
1569 | err = fuse_writepage_locked(page); | 1567 | err = fuse_writepage_locked(page); |
1570 | if (!err) | 1568 | if (!err) |
1571 | fuse_wait_on_page_writeback(inode, page->index); | 1569 | fuse_wait_on_page_writeback(inode, page->index); |
1572 | } | 1570 | } |
1573 | return err; | 1571 | return err; |
1574 | } | 1572 | } |
1575 | 1573 | ||
1576 | /* | 1574 | /* |
1577 | * Write back dirty pages now, because there may not be any suitable | 1575 | * Write back dirty pages now, because there may not be any suitable |
1578 | * open files later | 1576 | * open files later |
1579 | */ | 1577 | */ |
1580 | static void fuse_vma_close(struct vm_area_struct *vma) | 1578 | static void fuse_vma_close(struct vm_area_struct *vma) |
1581 | { | 1579 | { |
1582 | filemap_write_and_wait(vma->vm_file->f_mapping); | 1580 | filemap_write_and_wait(vma->vm_file->f_mapping); |
1583 | } | 1581 | } |
1584 | 1582 | ||
1585 | /* | 1583 | /* |
1586 | * Wait for writeback against this page to complete before allowing it | 1584 | * Wait for writeback against this page to complete before allowing it |
1587 | * to be marked dirty again, and hence written back again, possibly | 1585 | * to be marked dirty again, and hence written back again, possibly |
1588 | * before the previous writepage completed. | 1586 | * before the previous writepage completed. |
1589 | * | 1587 | * |
1590 | * Block here, instead of in ->writepage(), so that the userspace fs | 1588 | * Block here, instead of in ->writepage(), so that the userspace fs |
1591 | * can only block processes actually operating on the filesystem. | 1589 | * can only block processes actually operating on the filesystem. |
1592 | * | 1590 | * |
1593 | * Otherwise unprivileged userspace fs would be able to block | 1591 | * Otherwise unprivileged userspace fs would be able to block |
1594 | * unrelated: | 1592 | * unrelated: |
1595 | * | 1593 | * |
1596 | * - page migration | 1594 | * - page migration |
1597 | * - sync(2) | 1595 | * - sync(2) |
1598 | * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER | 1596 | * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER |
1599 | */ | 1597 | */ |
1600 | static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) | 1598 | static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) |
1601 | { | 1599 | { |
1602 | struct page *page = vmf->page; | 1600 | struct page *page = vmf->page; |
1603 | /* | 1601 | /* |
1604 | * Don't use page->mapping as it may become NULL from a | 1602 | * Don't use page->mapping as it may become NULL from a |
1605 | * concurrent truncate. | 1603 | * concurrent truncate. |
1606 | */ | 1604 | */ |
1607 | struct inode *inode = vma->vm_file->f_mapping->host; | 1605 | struct inode *inode = vma->vm_file->f_mapping->host; |
1608 | 1606 | ||
1609 | fuse_wait_on_page_writeback(inode, page->index); | 1607 | fuse_wait_on_page_writeback(inode, page->index); |
1610 | return 0; | 1608 | return 0; |
1611 | } | 1609 | } |
1612 | 1610 | ||
1613 | static const struct vm_operations_struct fuse_file_vm_ops = { | 1611 | static const struct vm_operations_struct fuse_file_vm_ops = { |
1614 | .close = fuse_vma_close, | 1612 | .close = fuse_vma_close, |
1615 | .fault = filemap_fault, | 1613 | .fault = filemap_fault, |
1616 | .page_mkwrite = fuse_page_mkwrite, | 1614 | .page_mkwrite = fuse_page_mkwrite, |
1617 | .remap_pages = generic_file_remap_pages, | 1615 | .remap_pages = generic_file_remap_pages, |
1618 | }; | 1616 | }; |
1619 | 1617 | ||
1620 | static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) | 1618 | static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) |
1621 | { | 1619 | { |
1622 | if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) { | 1620 | if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) { |
1623 | struct inode *inode = file_inode(file); | 1621 | struct inode *inode = file_inode(file); |
1624 | struct fuse_conn *fc = get_fuse_conn(inode); | 1622 | struct fuse_conn *fc = get_fuse_conn(inode); |
1625 | struct fuse_inode *fi = get_fuse_inode(inode); | 1623 | struct fuse_inode *fi = get_fuse_inode(inode); |
1626 | struct fuse_file *ff = file->private_data; | 1624 | struct fuse_file *ff = file->private_data; |
1627 | /* | 1625 | /* |
1628 | * file may be written through mmap, so chain it onto the | 1626 | * file may be written through mmap, so chain it onto the |
1629 | * inodes's write_file list | 1627 | * inodes's write_file list |
1630 | */ | 1628 | */ |
1631 | spin_lock(&fc->lock); | 1629 | spin_lock(&fc->lock); |
1632 | if (list_empty(&ff->write_entry)) | 1630 | if (list_empty(&ff->write_entry)) |
1633 | list_add(&ff->write_entry, &fi->write_files); | 1631 | list_add(&ff->write_entry, &fi->write_files); |
1634 | spin_unlock(&fc->lock); | 1632 | spin_unlock(&fc->lock); |
1635 | } | 1633 | } |
1636 | file_accessed(file); | 1634 | file_accessed(file); |
1637 | vma->vm_ops = &fuse_file_vm_ops; | 1635 | vma->vm_ops = &fuse_file_vm_ops; |
1638 | return 0; | 1636 | return 0; |
1639 | } | 1637 | } |
1640 | 1638 | ||
1641 | static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma) | 1639 | static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma) |
1642 | { | 1640 | { |
1643 | /* Can't provide the coherency needed for MAP_SHARED */ | 1641 | /* Can't provide the coherency needed for MAP_SHARED */ |
1644 | if (vma->vm_flags & VM_MAYSHARE) | 1642 | if (vma->vm_flags & VM_MAYSHARE) |
1645 | return -ENODEV; | 1643 | return -ENODEV; |
1646 | 1644 | ||
1647 | invalidate_inode_pages2(file->f_mapping); | 1645 | invalidate_inode_pages2(file->f_mapping); |
1648 | 1646 | ||
1649 | return generic_file_mmap(file, vma); | 1647 | return generic_file_mmap(file, vma); |
1650 | } | 1648 | } |
1651 | 1649 | ||
1652 | static int convert_fuse_file_lock(const struct fuse_file_lock *ffl, | 1650 | static int convert_fuse_file_lock(const struct fuse_file_lock *ffl, |
1653 | struct file_lock *fl) | 1651 | struct file_lock *fl) |
1654 | { | 1652 | { |
1655 | switch (ffl->type) { | 1653 | switch (ffl->type) { |
1656 | case F_UNLCK: | 1654 | case F_UNLCK: |
1657 | break; | 1655 | break; |
1658 | 1656 | ||
1659 | case F_RDLCK: | 1657 | case F_RDLCK: |
1660 | case F_WRLCK: | 1658 | case F_WRLCK: |
1661 | if (ffl->start > OFFSET_MAX || ffl->end > OFFSET_MAX || | 1659 | if (ffl->start > OFFSET_MAX || ffl->end > OFFSET_MAX || |
1662 | ffl->end < ffl->start) | 1660 | ffl->end < ffl->start) |
1663 | return -EIO; | 1661 | return -EIO; |
1664 | 1662 | ||
1665 | fl->fl_start = ffl->start; | 1663 | fl->fl_start = ffl->start; |
1666 | fl->fl_end = ffl->end; | 1664 | fl->fl_end = ffl->end; |
1667 | fl->fl_pid = ffl->pid; | 1665 | fl->fl_pid = ffl->pid; |
1668 | break; | 1666 | break; |
1669 | 1667 | ||
1670 | default: | 1668 | default: |
1671 | return -EIO; | 1669 | return -EIO; |
1672 | } | 1670 | } |
1673 | fl->fl_type = ffl->type; | 1671 | fl->fl_type = ffl->type; |
1674 | return 0; | 1672 | return 0; |
1675 | } | 1673 | } |
1676 | 1674 | ||
1677 | static void fuse_lk_fill(struct fuse_req *req, struct file *file, | 1675 | static void fuse_lk_fill(struct fuse_req *req, struct file *file, |
1678 | const struct file_lock *fl, int opcode, pid_t pid, | 1676 | const struct file_lock *fl, int opcode, pid_t pid, |
1679 | int flock) | 1677 | int flock) |
1680 | { | 1678 | { |
1681 | struct inode *inode = file_inode(file); | 1679 | struct inode *inode = file_inode(file); |
1682 | struct fuse_conn *fc = get_fuse_conn(inode); | 1680 | struct fuse_conn *fc = get_fuse_conn(inode); |
1683 | struct fuse_file *ff = file->private_data; | 1681 | struct fuse_file *ff = file->private_data; |
1684 | struct fuse_lk_in *arg = &req->misc.lk_in; | 1682 | struct fuse_lk_in *arg = &req->misc.lk_in; |
1685 | 1683 | ||
1686 | arg->fh = ff->fh; | 1684 | arg->fh = ff->fh; |
1687 | arg->owner = fuse_lock_owner_id(fc, fl->fl_owner); | 1685 | arg->owner = fuse_lock_owner_id(fc, fl->fl_owner); |
1688 | arg->lk.start = fl->fl_start; | 1686 | arg->lk.start = fl->fl_start; |
1689 | arg->lk.end = fl->fl_end; | 1687 | arg->lk.end = fl->fl_end; |
1690 | arg->lk.type = fl->fl_type; | 1688 | arg->lk.type = fl->fl_type; |
1691 | arg->lk.pid = pid; | 1689 | arg->lk.pid = pid; |
1692 | if (flock) | 1690 | if (flock) |
1693 | arg->lk_flags |= FUSE_LK_FLOCK; | 1691 | arg->lk_flags |= FUSE_LK_FLOCK; |
1694 | req->in.h.opcode = opcode; | 1692 | req->in.h.opcode = opcode; |
1695 | req->in.h.nodeid = get_node_id(inode); | 1693 | req->in.h.nodeid = get_node_id(inode); |
1696 | req->in.numargs = 1; | 1694 | req->in.numargs = 1; |
1697 | req->in.args[0].size = sizeof(*arg); | 1695 | req->in.args[0].size = sizeof(*arg); |
1698 | req->in.args[0].value = arg; | 1696 | req->in.args[0].value = arg; |
1699 | } | 1697 | } |
1700 | 1698 | ||
1701 | static int fuse_getlk(struct file *file, struct file_lock *fl) | 1699 | static int fuse_getlk(struct file *file, struct file_lock *fl) |
1702 | { | 1700 | { |
1703 | struct inode *inode = file_inode(file); | 1701 | struct inode *inode = file_inode(file); |
1704 | struct fuse_conn *fc = get_fuse_conn(inode); | 1702 | struct fuse_conn *fc = get_fuse_conn(inode); |
1705 | struct fuse_req *req; | 1703 | struct fuse_req *req; |
1706 | struct fuse_lk_out outarg; | 1704 | struct fuse_lk_out outarg; |
1707 | int err; | 1705 | int err; |
1708 | 1706 | ||
1709 | req = fuse_get_req_nopages(fc); | 1707 | req = fuse_get_req_nopages(fc); |
1710 | if (IS_ERR(req)) | 1708 | if (IS_ERR(req)) |
1711 | return PTR_ERR(req); | 1709 | return PTR_ERR(req); |
1712 | 1710 | ||
1713 | fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0); | 1711 | fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0); |
1714 | req->out.numargs = 1; | 1712 | req->out.numargs = 1; |
1715 | req->out.args[0].size = sizeof(outarg); | 1713 | req->out.args[0].size = sizeof(outarg); |
1716 | req->out.args[0].value = &outarg; | 1714 | req->out.args[0].value = &outarg; |
1717 | fuse_request_send(fc, req); | 1715 | fuse_request_send(fc, req); |
1718 | err = req->out.h.error; | 1716 | err = req->out.h.error; |
1719 | fuse_put_request(fc, req); | 1717 | fuse_put_request(fc, req); |
1720 | if (!err) | 1718 | if (!err) |
1721 | err = convert_fuse_file_lock(&outarg.lk, fl); | 1719 | err = convert_fuse_file_lock(&outarg.lk, fl); |
1722 | 1720 | ||
1723 | return err; | 1721 | return err; |
1724 | } | 1722 | } |
1725 | 1723 | ||
1726 | static int fuse_setlk(struct file *file, struct file_lock *fl, int flock) | 1724 | static int fuse_setlk(struct file *file, struct file_lock *fl, int flock) |
1727 | { | 1725 | { |
1728 | struct inode *inode = file_inode(file); | 1726 | struct inode *inode = file_inode(file); |
1729 | struct fuse_conn *fc = get_fuse_conn(inode); | 1727 | struct fuse_conn *fc = get_fuse_conn(inode); |
1730 | struct fuse_req *req; | 1728 | struct fuse_req *req; |
1731 | int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK; | 1729 | int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK; |
1732 | pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0; | 1730 | pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0; |
1733 | int err; | 1731 | int err; |
1734 | 1732 | ||
1735 | if (fl->fl_lmops && fl->fl_lmops->lm_grant) { | 1733 | if (fl->fl_lmops && fl->fl_lmops->lm_grant) { |
1736 | /* NLM needs asynchronous locks, which we don't support yet */ | 1734 | /* NLM needs asynchronous locks, which we don't support yet */ |
1737 | return -ENOLCK; | 1735 | return -ENOLCK; |
1738 | } | 1736 | } |
1739 | 1737 | ||
1740 | /* Unlock on close is handled by the flush method */ | 1738 | /* Unlock on close is handled by the flush method */ |
1741 | if (fl->fl_flags & FL_CLOSE) | 1739 | if (fl->fl_flags & FL_CLOSE) |
1742 | return 0; | 1740 | return 0; |
1743 | 1741 | ||
1744 | req = fuse_get_req_nopages(fc); | 1742 | req = fuse_get_req_nopages(fc); |
1745 | if (IS_ERR(req)) | 1743 | if (IS_ERR(req)) |
1746 | return PTR_ERR(req); | 1744 | return PTR_ERR(req); |
1747 | 1745 | ||
1748 | fuse_lk_fill(req, file, fl, opcode, pid, flock); | 1746 | fuse_lk_fill(req, file, fl, opcode, pid, flock); |
1749 | fuse_request_send(fc, req); | 1747 | fuse_request_send(fc, req); |
1750 | err = req->out.h.error; | 1748 | err = req->out.h.error; |
1751 | /* locking is restartable */ | 1749 | /* locking is restartable */ |
1752 | if (err == -EINTR) | 1750 | if (err == -EINTR) |
1753 | err = -ERESTARTSYS; | 1751 | err = -ERESTARTSYS; |
1754 | fuse_put_request(fc, req); | 1752 | fuse_put_request(fc, req); |
1755 | return err; | 1753 | return err; |
1756 | } | 1754 | } |
1757 | 1755 | ||
1758 | static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl) | 1756 | static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl) |
1759 | { | 1757 | { |
1760 | struct inode *inode = file_inode(file); | 1758 | struct inode *inode = file_inode(file); |
1761 | struct fuse_conn *fc = get_fuse_conn(inode); | 1759 | struct fuse_conn *fc = get_fuse_conn(inode); |
1762 | int err; | 1760 | int err; |
1763 | 1761 | ||
1764 | if (cmd == F_CANCELLK) { | 1762 | if (cmd == F_CANCELLK) { |
1765 | err = 0; | 1763 | err = 0; |
1766 | } else if (cmd == F_GETLK) { | 1764 | } else if (cmd == F_GETLK) { |
1767 | if (fc->no_lock) { | 1765 | if (fc->no_lock) { |
1768 | posix_test_lock(file, fl); | 1766 | posix_test_lock(file, fl); |
1769 | err = 0; | 1767 | err = 0; |
1770 | } else | 1768 | } else |
1771 | err = fuse_getlk(file, fl); | 1769 | err = fuse_getlk(file, fl); |
1772 | } else { | 1770 | } else { |
1773 | if (fc->no_lock) | 1771 | if (fc->no_lock) |
1774 | err = posix_lock_file(file, fl, NULL); | 1772 | err = posix_lock_file(file, fl, NULL); |
1775 | else | 1773 | else |
1776 | err = fuse_setlk(file, fl, 0); | 1774 | err = fuse_setlk(file, fl, 0); |
1777 | } | 1775 | } |
1778 | return err; | 1776 | return err; |
1779 | } | 1777 | } |
1780 | 1778 | ||
1781 | static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl) | 1779 | static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl) |
1782 | { | 1780 | { |
1783 | struct inode *inode = file_inode(file); | 1781 | struct inode *inode = file_inode(file); |
1784 | struct fuse_conn *fc = get_fuse_conn(inode); | 1782 | struct fuse_conn *fc = get_fuse_conn(inode); |
1785 | int err; | 1783 | int err; |
1786 | 1784 | ||
1787 | if (fc->no_flock) { | 1785 | if (fc->no_flock) { |
1788 | err = flock_lock_file_wait(file, fl); | 1786 | err = flock_lock_file_wait(file, fl); |
1789 | } else { | 1787 | } else { |
1790 | struct fuse_file *ff = file->private_data; | 1788 | struct fuse_file *ff = file->private_data; |
1791 | 1789 | ||
1792 | /* emulate flock with POSIX locks */ | 1790 | /* emulate flock with POSIX locks */ |
1793 | fl->fl_owner = (fl_owner_t) file; | 1791 | fl->fl_owner = (fl_owner_t) file; |
1794 | ff->flock = true; | 1792 | ff->flock = true; |
1795 | err = fuse_setlk(file, fl, 1); | 1793 | err = fuse_setlk(file, fl, 1); |
1796 | } | 1794 | } |
1797 | 1795 | ||
1798 | return err; | 1796 | return err; |
1799 | } | 1797 | } |
1800 | 1798 | ||
1801 | static sector_t fuse_bmap(struct address_space *mapping, sector_t block) | 1799 | static sector_t fuse_bmap(struct address_space *mapping, sector_t block) |
1802 | { | 1800 | { |
1803 | struct inode *inode = mapping->host; | 1801 | struct inode *inode = mapping->host; |
1804 | struct fuse_conn *fc = get_fuse_conn(inode); | 1802 | struct fuse_conn *fc = get_fuse_conn(inode); |
1805 | struct fuse_req *req; | 1803 | struct fuse_req *req; |
1806 | struct fuse_bmap_in inarg; | 1804 | struct fuse_bmap_in inarg; |
1807 | struct fuse_bmap_out outarg; | 1805 | struct fuse_bmap_out outarg; |
1808 | int err; | 1806 | int err; |
1809 | 1807 | ||
1810 | if (!inode->i_sb->s_bdev || fc->no_bmap) | 1808 | if (!inode->i_sb->s_bdev || fc->no_bmap) |
1811 | return 0; | 1809 | return 0; |
1812 | 1810 | ||
1813 | req = fuse_get_req_nopages(fc); | 1811 | req = fuse_get_req_nopages(fc); |
1814 | if (IS_ERR(req)) | 1812 | if (IS_ERR(req)) |
1815 | return 0; | 1813 | return 0; |
1816 | 1814 | ||
1817 | memset(&inarg, 0, sizeof(inarg)); | 1815 | memset(&inarg, 0, sizeof(inarg)); |
1818 | inarg.block = block; | 1816 | inarg.block = block; |
1819 | inarg.blocksize = inode->i_sb->s_blocksize; | 1817 | inarg.blocksize = inode->i_sb->s_blocksize; |
1820 | req->in.h.opcode = FUSE_BMAP; | 1818 | req->in.h.opcode = FUSE_BMAP; |
1821 | req->in.h.nodeid = get_node_id(inode); | 1819 | req->in.h.nodeid = get_node_id(inode); |
1822 | req->in.numargs = 1; | 1820 | req->in.numargs = 1; |
1823 | req->in.args[0].size = sizeof(inarg); | 1821 | req->in.args[0].size = sizeof(inarg); |
1824 | req->in.args[0].value = &inarg; | 1822 | req->in.args[0].value = &inarg; |
1825 | req->out.numargs = 1; | 1823 | req->out.numargs = 1; |
1826 | req->out.args[0].size = sizeof(outarg); | 1824 | req->out.args[0].size = sizeof(outarg); |
1827 | req->out.args[0].value = &outarg; | 1825 | req->out.args[0].value = &outarg; |
1828 | fuse_request_send(fc, req); | 1826 | fuse_request_send(fc, req); |
1829 | err = req->out.h.error; | 1827 | err = req->out.h.error; |
1830 | fuse_put_request(fc, req); | 1828 | fuse_put_request(fc, req); |
1831 | if (err == -ENOSYS) | 1829 | if (err == -ENOSYS) |
1832 | fc->no_bmap = 1; | 1830 | fc->no_bmap = 1; |
1833 | 1831 | ||
1834 | return err ? 0 : outarg.block; | 1832 | return err ? 0 : outarg.block; |
1835 | } | 1833 | } |
1836 | 1834 | ||
1837 | static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence) | 1835 | static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence) |
1838 | { | 1836 | { |
1839 | loff_t retval; | 1837 | loff_t retval; |
1840 | struct inode *inode = file_inode(file); | 1838 | struct inode *inode = file_inode(file); |
1841 | 1839 | ||
1842 | /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */ | 1840 | /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */ |
1843 | if (whence == SEEK_CUR || whence == SEEK_SET) | 1841 | if (whence == SEEK_CUR || whence == SEEK_SET) |
1844 | return generic_file_llseek(file, offset, whence); | 1842 | return generic_file_llseek(file, offset, whence); |
1845 | 1843 | ||
1846 | mutex_lock(&inode->i_mutex); | 1844 | mutex_lock(&inode->i_mutex); |
1847 | retval = fuse_update_attributes(inode, NULL, file, NULL); | 1845 | retval = fuse_update_attributes(inode, NULL, file, NULL); |
1848 | if (!retval) | 1846 | if (!retval) |
1849 | retval = generic_file_llseek(file, offset, whence); | 1847 | retval = generic_file_llseek(file, offset, whence); |
1850 | mutex_unlock(&inode->i_mutex); | 1848 | mutex_unlock(&inode->i_mutex); |
1851 | 1849 | ||
1852 | return retval; | 1850 | return retval; |
1853 | } | 1851 | } |
1854 | 1852 | ||
1855 | static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov, | 1853 | static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov, |
1856 | unsigned int nr_segs, size_t bytes, bool to_user) | 1854 | unsigned int nr_segs, size_t bytes, bool to_user) |
1857 | { | 1855 | { |
1858 | struct iov_iter ii; | 1856 | struct iov_iter ii; |
1859 | int page_idx = 0; | 1857 | int page_idx = 0; |
1860 | 1858 | ||
1861 | if (!bytes) | 1859 | if (!bytes) |
1862 | return 0; | 1860 | return 0; |
1863 | 1861 | ||
1864 | iov_iter_init(&ii, iov, nr_segs, bytes, 0); | 1862 | iov_iter_init(&ii, iov, nr_segs, bytes, 0); |
1865 | 1863 | ||
1866 | while (iov_iter_count(&ii)) { | 1864 | while (iov_iter_count(&ii)) { |
1867 | struct page *page = pages[page_idx++]; | 1865 | struct page *page = pages[page_idx++]; |
1868 | size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii)); | 1866 | size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii)); |
1869 | void *kaddr; | 1867 | void *kaddr; |
1870 | 1868 | ||
1871 | kaddr = kmap(page); | 1869 | kaddr = kmap(page); |
1872 | 1870 | ||
1873 | while (todo) { | 1871 | while (todo) { |
1874 | char __user *uaddr = ii.iov->iov_base + ii.iov_offset; | 1872 | char __user *uaddr = ii.iov->iov_base + ii.iov_offset; |
1875 | size_t iov_len = ii.iov->iov_len - ii.iov_offset; | 1873 | size_t iov_len = ii.iov->iov_len - ii.iov_offset; |
1876 | size_t copy = min(todo, iov_len); | 1874 | size_t copy = min(todo, iov_len); |
1877 | size_t left; | 1875 | size_t left; |
1878 | 1876 | ||
1879 | if (!to_user) | 1877 | if (!to_user) |
1880 | left = copy_from_user(kaddr, uaddr, copy); | 1878 | left = copy_from_user(kaddr, uaddr, copy); |
1881 | else | 1879 | else |
1882 | left = copy_to_user(uaddr, kaddr, copy); | 1880 | left = copy_to_user(uaddr, kaddr, copy); |
1883 | 1881 | ||
1884 | if (unlikely(left)) | 1882 | if (unlikely(left)) |
1885 | return -EFAULT; | 1883 | return -EFAULT; |
1886 | 1884 | ||
1887 | iov_iter_advance(&ii, copy); | 1885 | iov_iter_advance(&ii, copy); |
1888 | todo -= copy; | 1886 | todo -= copy; |
1889 | kaddr += copy; | 1887 | kaddr += copy; |
1890 | } | 1888 | } |
1891 | 1889 | ||
1892 | kunmap(page); | 1890 | kunmap(page); |
1893 | } | 1891 | } |
1894 | 1892 | ||
1895 | return 0; | 1893 | return 0; |
1896 | } | 1894 | } |
1897 | 1895 | ||
1898 | /* | 1896 | /* |
1899 | * CUSE servers compiled on 32bit broke on 64bit kernels because the | 1897 | * CUSE servers compiled on 32bit broke on 64bit kernels because the |
1900 | * ABI was defined to be 'struct iovec' which is different on 32bit | 1898 | * ABI was defined to be 'struct iovec' which is different on 32bit |
1901 | * and 64bit. Fortunately we can determine which structure the server | 1899 | * and 64bit. Fortunately we can determine which structure the server |
1902 | * used from the size of the reply. | 1900 | * used from the size of the reply. |
1903 | */ | 1901 | */ |
1904 | static int fuse_copy_ioctl_iovec_old(struct iovec *dst, void *src, | 1902 | static int fuse_copy_ioctl_iovec_old(struct iovec *dst, void *src, |
1905 | size_t transferred, unsigned count, | 1903 | size_t transferred, unsigned count, |
1906 | bool is_compat) | 1904 | bool is_compat) |
1907 | { | 1905 | { |
1908 | #ifdef CONFIG_COMPAT | 1906 | #ifdef CONFIG_COMPAT |
1909 | if (count * sizeof(struct compat_iovec) == transferred) { | 1907 | if (count * sizeof(struct compat_iovec) == transferred) { |
1910 | struct compat_iovec *ciov = src; | 1908 | struct compat_iovec *ciov = src; |
1911 | unsigned i; | 1909 | unsigned i; |
1912 | 1910 | ||
1913 | /* | 1911 | /* |
1914 | * With this interface a 32bit server cannot support | 1912 | * With this interface a 32bit server cannot support |
1915 | * non-compat (i.e. ones coming from 64bit apps) ioctl | 1913 | * non-compat (i.e. ones coming from 64bit apps) ioctl |
1916 | * requests | 1914 | * requests |
1917 | */ | 1915 | */ |
1918 | if (!is_compat) | 1916 | if (!is_compat) |
1919 | return -EINVAL; | 1917 | return -EINVAL; |
1920 | 1918 | ||
1921 | for (i = 0; i < count; i++) { | 1919 | for (i = 0; i < count; i++) { |
1922 | dst[i].iov_base = compat_ptr(ciov[i].iov_base); | 1920 | dst[i].iov_base = compat_ptr(ciov[i].iov_base); |
1923 | dst[i].iov_len = ciov[i].iov_len; | 1921 | dst[i].iov_len = ciov[i].iov_len; |
1924 | } | 1922 | } |
1925 | return 0; | 1923 | return 0; |
1926 | } | 1924 | } |
1927 | #endif | 1925 | #endif |
1928 | 1926 | ||
1929 | if (count * sizeof(struct iovec) != transferred) | 1927 | if (count * sizeof(struct iovec) != transferred) |
1930 | return -EIO; | 1928 | return -EIO; |
1931 | 1929 | ||
1932 | memcpy(dst, src, transferred); | 1930 | memcpy(dst, src, transferred); |
1933 | return 0; | 1931 | return 0; |
1934 | } | 1932 | } |
1935 | 1933 | ||
1936 | /* Make sure iov_length() won't overflow */ | 1934 | /* Make sure iov_length() won't overflow */ |
1937 | static int fuse_verify_ioctl_iov(struct iovec *iov, size_t count) | 1935 | static int fuse_verify_ioctl_iov(struct iovec *iov, size_t count) |
1938 | { | 1936 | { |
1939 | size_t n; | 1937 | size_t n; |
1940 | u32 max = FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT; | 1938 | u32 max = FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT; |
1941 | 1939 | ||
1942 | for (n = 0; n < count; n++, iov++) { | 1940 | for (n = 0; n < count; n++, iov++) { |
1943 | if (iov->iov_len > (size_t) max) | 1941 | if (iov->iov_len > (size_t) max) |
1944 | return -ENOMEM; | 1942 | return -ENOMEM; |
1945 | max -= iov->iov_len; | 1943 | max -= iov->iov_len; |
1946 | } | 1944 | } |
1947 | return 0; | 1945 | return 0; |
1948 | } | 1946 | } |
1949 | 1947 | ||
1950 | static int fuse_copy_ioctl_iovec(struct fuse_conn *fc, struct iovec *dst, | 1948 | static int fuse_copy_ioctl_iovec(struct fuse_conn *fc, struct iovec *dst, |
1951 | void *src, size_t transferred, unsigned count, | 1949 | void *src, size_t transferred, unsigned count, |
1952 | bool is_compat) | 1950 | bool is_compat) |
1953 | { | 1951 | { |
1954 | unsigned i; | 1952 | unsigned i; |
1955 | struct fuse_ioctl_iovec *fiov = src; | 1953 | struct fuse_ioctl_iovec *fiov = src; |
1956 | 1954 | ||
1957 | if (fc->minor < 16) { | 1955 | if (fc->minor < 16) { |
1958 | return fuse_copy_ioctl_iovec_old(dst, src, transferred, | 1956 | return fuse_copy_ioctl_iovec_old(dst, src, transferred, |
1959 | count, is_compat); | 1957 | count, is_compat); |
1960 | } | 1958 | } |
1961 | 1959 | ||
1962 | if (count * sizeof(struct fuse_ioctl_iovec) != transferred) | 1960 | if (count * sizeof(struct fuse_ioctl_iovec) != transferred) |
1963 | return -EIO; | 1961 | return -EIO; |
1964 | 1962 | ||
1965 | for (i = 0; i < count; i++) { | 1963 | for (i = 0; i < count; i++) { |
1966 | /* Did the server supply an inappropriate value? */ | 1964 | /* Did the server supply an inappropriate value? */ |
1967 | if (fiov[i].base != (unsigned long) fiov[i].base || | 1965 | if (fiov[i].base != (unsigned long) fiov[i].base || |
1968 | fiov[i].len != (unsigned long) fiov[i].len) | 1966 | fiov[i].len != (unsigned long) fiov[i].len) |
1969 | return -EIO; | 1967 | return -EIO; |
1970 | 1968 | ||
1971 | dst[i].iov_base = (void __user *) (unsigned long) fiov[i].base; | 1969 | dst[i].iov_base = (void __user *) (unsigned long) fiov[i].base; |
1972 | dst[i].iov_len = (size_t) fiov[i].len; | 1970 | dst[i].iov_len = (size_t) fiov[i].len; |
1973 | 1971 | ||
1974 | #ifdef CONFIG_COMPAT | 1972 | #ifdef CONFIG_COMPAT |
1975 | if (is_compat && | 1973 | if (is_compat && |
1976 | (ptr_to_compat(dst[i].iov_base) != fiov[i].base || | 1974 | (ptr_to_compat(dst[i].iov_base) != fiov[i].base || |
1977 | (compat_size_t) dst[i].iov_len != fiov[i].len)) | 1975 | (compat_size_t) dst[i].iov_len != fiov[i].len)) |
1978 | return -EIO; | 1976 | return -EIO; |
1979 | #endif | 1977 | #endif |
1980 | } | 1978 | } |
1981 | 1979 | ||
1982 | return 0; | 1980 | return 0; |
1983 | } | 1981 | } |
1984 | 1982 | ||
1985 | 1983 | ||
1986 | /* | 1984 | /* |
1987 | * For ioctls, there is no generic way to determine how much memory | 1985 | * For ioctls, there is no generic way to determine how much memory |
1988 | * needs to be read and/or written. Furthermore, ioctls are allowed | 1986 | * needs to be read and/or written. Furthermore, ioctls are allowed |
1989 | * to dereference the passed pointer, so the parameter requires deep | 1987 | * to dereference the passed pointer, so the parameter requires deep |
1990 | * copying but FUSE has no idea whatsoever about what to copy in or | 1988 | * copying but FUSE has no idea whatsoever about what to copy in or |
1991 | * out. | 1989 | * out. |
1992 | * | 1990 | * |
1993 | * This is solved by allowing FUSE server to retry ioctl with | 1991 | * This is solved by allowing FUSE server to retry ioctl with |
1994 | * necessary in/out iovecs. Let's assume the ioctl implementation | 1992 | * necessary in/out iovecs. Let's assume the ioctl implementation |
1995 | * needs to read in the following structure. | 1993 | * needs to read in the following structure. |
1996 | * | 1994 | * |
1997 | * struct a { | 1995 | * struct a { |
1998 | * char *buf; | 1996 | * char *buf; |
1999 | * size_t buflen; | 1997 | * size_t buflen; |
2000 | * } | 1998 | * } |
2001 | * | 1999 | * |
2002 | * On the first callout to FUSE server, inarg->in_size and | 2000 | * On the first callout to FUSE server, inarg->in_size and |
2003 | * inarg->out_size will be NULL; then, the server completes the ioctl | 2001 | * inarg->out_size will be NULL; then, the server completes the ioctl |
2004 | * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and | 2002 | * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and |
2005 | * the actual iov array to | 2003 | * the actual iov array to |
2006 | * | 2004 | * |
2007 | * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) } } | 2005 | * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) } } |
2008 | * | 2006 | * |
2009 | * which tells FUSE to copy in the requested area and retry the ioctl. | 2007 | * which tells FUSE to copy in the requested area and retry the ioctl. |
2010 | * On the second round, the server has access to the structure and | 2008 | * On the second round, the server has access to the structure and |
2011 | * from that it can tell what to look for next, so on the invocation, | 2009 | * from that it can tell what to look for next, so on the invocation, |
2012 | * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to | 2010 | * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to |
2013 | * | 2011 | * |
2014 | * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) }, | 2012 | * { { .iov_base = inarg.arg, .iov_len = sizeof(struct a) }, |
2015 | * { .iov_base = a.buf, .iov_len = a.buflen } } | 2013 | * { .iov_base = a.buf, .iov_len = a.buflen } } |
2016 | * | 2014 | * |
2017 | * FUSE will copy both struct a and the pointed buffer from the | 2015 | * FUSE will copy both struct a and the pointed buffer from the |
2018 | * process doing the ioctl and retry ioctl with both struct a and the | 2016 | * process doing the ioctl and retry ioctl with both struct a and the |
2019 | * buffer. | 2017 | * buffer. |
2020 | * | 2018 | * |
2021 | * This time, FUSE server has everything it needs and completes ioctl | 2019 | * This time, FUSE server has everything it needs and completes ioctl |
2022 | * without FUSE_IOCTL_RETRY which finishes the ioctl call. | 2020 | * without FUSE_IOCTL_RETRY which finishes the ioctl call. |
2023 | * | 2021 | * |
2024 | * Copying data out works the same way. | 2022 | * Copying data out works the same way. |
2025 | * | 2023 | * |
2026 | * Note that if FUSE_IOCTL_UNRESTRICTED is clear, the kernel | 2024 | * Note that if FUSE_IOCTL_UNRESTRICTED is clear, the kernel |
2027 | * automatically initializes in and out iovs by decoding @cmd with | 2025 | * automatically initializes in and out iovs by decoding @cmd with |
2028 | * _IOC_* macros and the server is not allowed to request RETRY. This | 2026 | * _IOC_* macros and the server is not allowed to request RETRY. This |
2029 | * limits ioctl data transfers to well-formed ioctls and is the forced | 2027 | * limits ioctl data transfers to well-formed ioctls and is the forced |
2030 | * behavior for all FUSE servers. | 2028 | * behavior for all FUSE servers. |
2031 | */ | 2029 | */ |
2032 | long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg, | 2030 | long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg, |
2033 | unsigned int flags) | 2031 | unsigned int flags) |
2034 | { | 2032 | { |
2035 | struct fuse_file *ff = file->private_data; | 2033 | struct fuse_file *ff = file->private_data; |
2036 | struct fuse_conn *fc = ff->fc; | 2034 | struct fuse_conn *fc = ff->fc; |
2037 | struct fuse_ioctl_in inarg = { | 2035 | struct fuse_ioctl_in inarg = { |
2038 | .fh = ff->fh, | 2036 | .fh = ff->fh, |
2039 | .cmd = cmd, | 2037 | .cmd = cmd, |
2040 | .arg = arg, | 2038 | .arg = arg, |
2041 | .flags = flags | 2039 | .flags = flags |
2042 | }; | 2040 | }; |
2043 | struct fuse_ioctl_out outarg; | 2041 | struct fuse_ioctl_out outarg; |
2044 | struct fuse_req *req = NULL; | 2042 | struct fuse_req *req = NULL; |
2045 | struct page **pages = NULL; | 2043 | struct page **pages = NULL; |
2046 | struct iovec *iov_page = NULL; | 2044 | struct iovec *iov_page = NULL; |
2047 | struct iovec *in_iov = NULL, *out_iov = NULL; | 2045 | struct iovec *in_iov = NULL, *out_iov = NULL; |
2048 | unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages; | 2046 | unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages; |
2049 | size_t in_size, out_size, transferred; | 2047 | size_t in_size, out_size, transferred; |
2050 | int err; | 2048 | int err; |
2051 | 2049 | ||
2052 | #if BITS_PER_LONG == 32 | 2050 | #if BITS_PER_LONG == 32 |
2053 | inarg.flags |= FUSE_IOCTL_32BIT; | 2051 | inarg.flags |= FUSE_IOCTL_32BIT; |
2054 | #else | 2052 | #else |
2055 | if (flags & FUSE_IOCTL_COMPAT) | 2053 | if (flags & FUSE_IOCTL_COMPAT) |
2056 | inarg.flags |= FUSE_IOCTL_32BIT; | 2054 | inarg.flags |= FUSE_IOCTL_32BIT; |
2057 | #endif | 2055 | #endif |
2058 | 2056 | ||
2059 | /* assume all the iovs returned by client always fits in a page */ | 2057 | /* assume all the iovs returned by client always fits in a page */ |
2060 | BUILD_BUG_ON(sizeof(struct fuse_ioctl_iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE); | 2058 | BUILD_BUG_ON(sizeof(struct fuse_ioctl_iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE); |
2061 | 2059 | ||
2062 | err = -ENOMEM; | 2060 | err = -ENOMEM; |
2063 | pages = kcalloc(FUSE_MAX_PAGES_PER_REQ, sizeof(pages[0]), GFP_KERNEL); | 2061 | pages = kcalloc(FUSE_MAX_PAGES_PER_REQ, sizeof(pages[0]), GFP_KERNEL); |
2064 | iov_page = (struct iovec *) __get_free_page(GFP_KERNEL); | 2062 | iov_page = (struct iovec *) __get_free_page(GFP_KERNEL); |
2065 | if (!pages || !iov_page) | 2063 | if (!pages || !iov_page) |
2066 | goto out; | 2064 | goto out; |
2067 | 2065 | ||
2068 | /* | 2066 | /* |
2069 | * If restricted, initialize IO parameters as encoded in @cmd. | 2067 | * If restricted, initialize IO parameters as encoded in @cmd. |
2070 | * RETRY from server is not allowed. | 2068 | * RETRY from server is not allowed. |
2071 | */ | 2069 | */ |
2072 | if (!(flags & FUSE_IOCTL_UNRESTRICTED)) { | 2070 | if (!(flags & FUSE_IOCTL_UNRESTRICTED)) { |
2073 | struct iovec *iov = iov_page; | 2071 | struct iovec *iov = iov_page; |
2074 | 2072 | ||
2075 | iov->iov_base = (void __user *)arg; | 2073 | iov->iov_base = (void __user *)arg; |
2076 | iov->iov_len = _IOC_SIZE(cmd); | 2074 | iov->iov_len = _IOC_SIZE(cmd); |
2077 | 2075 | ||
2078 | if (_IOC_DIR(cmd) & _IOC_WRITE) { | 2076 | if (_IOC_DIR(cmd) & _IOC_WRITE) { |
2079 | in_iov = iov; | 2077 | in_iov = iov; |
2080 | in_iovs = 1; | 2078 | in_iovs = 1; |
2081 | } | 2079 | } |
2082 | 2080 | ||
2083 | if (_IOC_DIR(cmd) & _IOC_READ) { | 2081 | if (_IOC_DIR(cmd) & _IOC_READ) { |
2084 | out_iov = iov; | 2082 | out_iov = iov; |
2085 | out_iovs = 1; | 2083 | out_iovs = 1; |
2086 | } | 2084 | } |
2087 | } | 2085 | } |
2088 | 2086 | ||
2089 | retry: | 2087 | retry: |
2090 | inarg.in_size = in_size = iov_length(in_iov, in_iovs); | 2088 | inarg.in_size = in_size = iov_length(in_iov, in_iovs); |
2091 | inarg.out_size = out_size = iov_length(out_iov, out_iovs); | 2089 | inarg.out_size = out_size = iov_length(out_iov, out_iovs); |
2092 | 2090 | ||
2093 | /* | 2091 | /* |
2094 | * Out data can be used either for actual out data or iovs, | 2092 | * Out data can be used either for actual out data or iovs, |
2095 | * make sure there always is at least one page. | 2093 | * make sure there always is at least one page. |
2096 | */ | 2094 | */ |
2097 | out_size = max_t(size_t, out_size, PAGE_SIZE); | 2095 | out_size = max_t(size_t, out_size, PAGE_SIZE); |
2098 | max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE); | 2096 | max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE); |
2099 | 2097 | ||
2100 | /* make sure there are enough buffer pages and init request with them */ | 2098 | /* make sure there are enough buffer pages and init request with them */ |
2101 | err = -ENOMEM; | 2099 | err = -ENOMEM; |
2102 | if (max_pages > FUSE_MAX_PAGES_PER_REQ) | 2100 | if (max_pages > FUSE_MAX_PAGES_PER_REQ) |
2103 | goto out; | 2101 | goto out; |
2104 | while (num_pages < max_pages) { | 2102 | while (num_pages < max_pages) { |
2105 | pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM); | 2103 | pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM); |
2106 | if (!pages[num_pages]) | 2104 | if (!pages[num_pages]) |
2107 | goto out; | 2105 | goto out; |
2108 | num_pages++; | 2106 | num_pages++; |
2109 | } | 2107 | } |
2110 | 2108 | ||
2111 | req = fuse_get_req(fc, num_pages); | 2109 | req = fuse_get_req(fc, num_pages); |
2112 | if (IS_ERR(req)) { | 2110 | if (IS_ERR(req)) { |
2113 | err = PTR_ERR(req); | 2111 | err = PTR_ERR(req); |
2114 | req = NULL; | 2112 | req = NULL; |
2115 | goto out; | 2113 | goto out; |
2116 | } | 2114 | } |
2117 | memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages); | 2115 | memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages); |
2118 | req->num_pages = num_pages; | 2116 | req->num_pages = num_pages; |
2119 | fuse_page_descs_length_init(req, 0, req->num_pages); | 2117 | fuse_page_descs_length_init(req, 0, req->num_pages); |
2120 | 2118 | ||
2121 | /* okay, let's send it to the client */ | 2119 | /* okay, let's send it to the client */ |
2122 | req->in.h.opcode = FUSE_IOCTL; | 2120 | req->in.h.opcode = FUSE_IOCTL; |
2123 | req->in.h.nodeid = ff->nodeid; | 2121 | req->in.h.nodeid = ff->nodeid; |
2124 | req->in.numargs = 1; | 2122 | req->in.numargs = 1; |
2125 | req->in.args[0].size = sizeof(inarg); | 2123 | req->in.args[0].size = sizeof(inarg); |
2126 | req->in.args[0].value = &inarg; | 2124 | req->in.args[0].value = &inarg; |
2127 | if (in_size) { | 2125 | if (in_size) { |
2128 | req->in.numargs++; | 2126 | req->in.numargs++; |
2129 | req->in.args[1].size = in_size; | 2127 | req->in.args[1].size = in_size; |
2130 | req->in.argpages = 1; | 2128 | req->in.argpages = 1; |
2131 | 2129 | ||
2132 | err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size, | 2130 | err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size, |
2133 | false); | 2131 | false); |
2134 | if (err) | 2132 | if (err) |
2135 | goto out; | 2133 | goto out; |
2136 | } | 2134 | } |
2137 | 2135 | ||
2138 | req->out.numargs = 2; | 2136 | req->out.numargs = 2; |
2139 | req->out.args[0].size = sizeof(outarg); | 2137 | req->out.args[0].size = sizeof(outarg); |
2140 | req->out.args[0].value = &outarg; | 2138 | req->out.args[0].value = &outarg; |
2141 | req->out.args[1].size = out_size; | 2139 | req->out.args[1].size = out_size; |
2142 | req->out.argpages = 1; | 2140 | req->out.argpages = 1; |
2143 | req->out.argvar = 1; | 2141 | req->out.argvar = 1; |
2144 | 2142 | ||
2145 | fuse_request_send(fc, req); | 2143 | fuse_request_send(fc, req); |
2146 | err = req->out.h.error; | 2144 | err = req->out.h.error; |
2147 | transferred = req->out.args[1].size; | 2145 | transferred = req->out.args[1].size; |
2148 | fuse_put_request(fc, req); | 2146 | fuse_put_request(fc, req); |
2149 | req = NULL; | 2147 | req = NULL; |
2150 | if (err) | 2148 | if (err) |
2151 | goto out; | 2149 | goto out; |
2152 | 2150 | ||
2153 | /* did it ask for retry? */ | 2151 | /* did it ask for retry? */ |
2154 | if (outarg.flags & FUSE_IOCTL_RETRY) { | 2152 | if (outarg.flags & FUSE_IOCTL_RETRY) { |
2155 | void *vaddr; | 2153 | void *vaddr; |
2156 | 2154 | ||
2157 | /* no retry if in restricted mode */ | 2155 | /* no retry if in restricted mode */ |
2158 | err = -EIO; | 2156 | err = -EIO; |
2159 | if (!(flags & FUSE_IOCTL_UNRESTRICTED)) | 2157 | if (!(flags & FUSE_IOCTL_UNRESTRICTED)) |
2160 | goto out; | 2158 | goto out; |
2161 | 2159 | ||
2162 | in_iovs = outarg.in_iovs; | 2160 | in_iovs = outarg.in_iovs; |
2163 | out_iovs = outarg.out_iovs; | 2161 | out_iovs = outarg.out_iovs; |
2164 | 2162 | ||
2165 | /* | 2163 | /* |
2166 | * Make sure things are in boundary, separate checks | 2164 | * Make sure things are in boundary, separate checks |
2167 | * are to protect against overflow. | 2165 | * are to protect against overflow. |
2168 | */ | 2166 | */ |
2169 | err = -ENOMEM; | 2167 | err = -ENOMEM; |
2170 | if (in_iovs > FUSE_IOCTL_MAX_IOV || | 2168 | if (in_iovs > FUSE_IOCTL_MAX_IOV || |
2171 | out_iovs > FUSE_IOCTL_MAX_IOV || | 2169 | out_iovs > FUSE_IOCTL_MAX_IOV || |
2172 | in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV) | 2170 | in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV) |
2173 | goto out; | 2171 | goto out; |
2174 | 2172 | ||
2175 | vaddr = kmap_atomic(pages[0]); | 2173 | vaddr = kmap_atomic(pages[0]); |
2176 | err = fuse_copy_ioctl_iovec(fc, iov_page, vaddr, | 2174 | err = fuse_copy_ioctl_iovec(fc, iov_page, vaddr, |
2177 | transferred, in_iovs + out_iovs, | 2175 | transferred, in_iovs + out_iovs, |
2178 | (flags & FUSE_IOCTL_COMPAT) != 0); | 2176 | (flags & FUSE_IOCTL_COMPAT) != 0); |
2179 | kunmap_atomic(vaddr); | 2177 | kunmap_atomic(vaddr); |
2180 | if (err) | 2178 | if (err) |
2181 | goto out; | 2179 | goto out; |
2182 | 2180 | ||
2183 | in_iov = iov_page; | 2181 | in_iov = iov_page; |
2184 | out_iov = in_iov + in_iovs; | 2182 | out_iov = in_iov + in_iovs; |
2185 | 2183 | ||
2186 | err = fuse_verify_ioctl_iov(in_iov, in_iovs); | 2184 | err = fuse_verify_ioctl_iov(in_iov, in_iovs); |
2187 | if (err) | 2185 | if (err) |
2188 | goto out; | 2186 | goto out; |
2189 | 2187 | ||
2190 | err = fuse_verify_ioctl_iov(out_iov, out_iovs); | 2188 | err = fuse_verify_ioctl_iov(out_iov, out_iovs); |
2191 | if (err) | 2189 | if (err) |
2192 | goto out; | 2190 | goto out; |
2193 | 2191 | ||
2194 | goto retry; | 2192 | goto retry; |
2195 | } | 2193 | } |
2196 | 2194 | ||
2197 | err = -EIO; | 2195 | err = -EIO; |
2198 | if (transferred > inarg.out_size) | 2196 | if (transferred > inarg.out_size) |
2199 | goto out; | 2197 | goto out; |
2200 | 2198 | ||
2201 | err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true); | 2199 | err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true); |
2202 | out: | 2200 | out: |
2203 | if (req) | 2201 | if (req) |
2204 | fuse_put_request(fc, req); | 2202 | fuse_put_request(fc, req); |
2205 | free_page((unsigned long) iov_page); | 2203 | free_page((unsigned long) iov_page); |
2206 | while (num_pages) | 2204 | while (num_pages) |
2207 | __free_page(pages[--num_pages]); | 2205 | __free_page(pages[--num_pages]); |
2208 | kfree(pages); | 2206 | kfree(pages); |
2209 | 2207 | ||
2210 | return err ? err : outarg.result; | 2208 | return err ? err : outarg.result; |
2211 | } | 2209 | } |
2212 | EXPORT_SYMBOL_GPL(fuse_do_ioctl); | 2210 | EXPORT_SYMBOL_GPL(fuse_do_ioctl); |
2213 | 2211 | ||
2214 | long fuse_ioctl_common(struct file *file, unsigned int cmd, | 2212 | long fuse_ioctl_common(struct file *file, unsigned int cmd, |
2215 | unsigned long arg, unsigned int flags) | 2213 | unsigned long arg, unsigned int flags) |
2216 | { | 2214 | { |
2217 | struct inode *inode = file_inode(file); | 2215 | struct inode *inode = file_inode(file); |
2218 | struct fuse_conn *fc = get_fuse_conn(inode); | 2216 | struct fuse_conn *fc = get_fuse_conn(inode); |
2219 | 2217 | ||
2220 | if (!fuse_allow_current_process(fc)) | 2218 | if (!fuse_allow_current_process(fc)) |
2221 | return -EACCES; | 2219 | return -EACCES; |
2222 | 2220 | ||
2223 | if (is_bad_inode(inode)) | 2221 | if (is_bad_inode(inode)) |
2224 | return -EIO; | 2222 | return -EIO; |
2225 | 2223 | ||
2226 | return fuse_do_ioctl(file, cmd, arg, flags); | 2224 | return fuse_do_ioctl(file, cmd, arg, flags); |
2227 | } | 2225 | } |
2228 | 2226 | ||
2229 | static long fuse_file_ioctl(struct file *file, unsigned int cmd, | 2227 | static long fuse_file_ioctl(struct file *file, unsigned int cmd, |
2230 | unsigned long arg) | 2228 | unsigned long arg) |
2231 | { | 2229 | { |
2232 | return fuse_ioctl_common(file, cmd, arg, 0); | 2230 | return fuse_ioctl_common(file, cmd, arg, 0); |
2233 | } | 2231 | } |
2234 | 2232 | ||
2235 | static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, | 2233 | static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd, |
2236 | unsigned long arg) | 2234 | unsigned long arg) |
2237 | { | 2235 | { |
2238 | return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT); | 2236 | return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT); |
2239 | } | 2237 | } |
2240 | 2238 | ||
2241 | /* | 2239 | /* |
2242 | * All files which have been polled are linked to RB tree | 2240 | * All files which have been polled are linked to RB tree |
2243 | * fuse_conn->polled_files which is indexed by kh. Walk the tree and | 2241 | * fuse_conn->polled_files which is indexed by kh. Walk the tree and |
2244 | * find the matching one. | 2242 | * find the matching one. |
2245 | */ | 2243 | */ |
2246 | static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 kh, | 2244 | static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 kh, |
2247 | struct rb_node **parent_out) | 2245 | struct rb_node **parent_out) |
2248 | { | 2246 | { |
2249 | struct rb_node **link = &fc->polled_files.rb_node; | 2247 | struct rb_node **link = &fc->polled_files.rb_node; |
2250 | struct rb_node *last = NULL; | 2248 | struct rb_node *last = NULL; |
2251 | 2249 | ||
2252 | while (*link) { | 2250 | while (*link) { |
2253 | struct fuse_file *ff; | 2251 | struct fuse_file *ff; |
2254 | 2252 | ||
2255 | last = *link; | 2253 | last = *link; |
2256 | ff = rb_entry(last, struct fuse_file, polled_node); | 2254 | ff = rb_entry(last, struct fuse_file, polled_node); |
2257 | 2255 | ||
2258 | if (kh < ff->kh) | 2256 | if (kh < ff->kh) |
2259 | link = &last->rb_left; | 2257 | link = &last->rb_left; |
2260 | else if (kh > ff->kh) | 2258 | else if (kh > ff->kh) |
2261 | link = &last->rb_right; | 2259 | link = &last->rb_right; |
2262 | else | 2260 | else |
2263 | return link; | 2261 | return link; |
2264 | } | 2262 | } |
2265 | 2263 | ||
2266 | if (parent_out) | 2264 | if (parent_out) |
2267 | *parent_out = last; | 2265 | *parent_out = last; |
2268 | return link; | 2266 | return link; |
2269 | } | 2267 | } |
2270 | 2268 | ||
2271 | /* | 2269 | /* |
2272 | * The file is about to be polled. Make sure it's on the polled_files | 2270 | * The file is about to be polled. Make sure it's on the polled_files |
2273 | * RB tree. Note that files once added to the polled_files tree are | 2271 | * RB tree. Note that files once added to the polled_files tree are |
2274 | * not removed before the file is released. This is because a file | 2272 | * not removed before the file is released. This is because a file |
2275 | * polled once is likely to be polled again. | 2273 | * polled once is likely to be polled again. |
2276 | */ | 2274 | */ |
2277 | static void fuse_register_polled_file(struct fuse_conn *fc, | 2275 | static void fuse_register_polled_file(struct fuse_conn *fc, |
2278 | struct fuse_file *ff) | 2276 | struct fuse_file *ff) |
2279 | { | 2277 | { |
2280 | spin_lock(&fc->lock); | 2278 | spin_lock(&fc->lock); |
2281 | if (RB_EMPTY_NODE(&ff->polled_node)) { | 2279 | if (RB_EMPTY_NODE(&ff->polled_node)) { |
2282 | struct rb_node **link, *parent; | 2280 | struct rb_node **link, *parent; |
2283 | 2281 | ||
2284 | link = fuse_find_polled_node(fc, ff->kh, &parent); | 2282 | link = fuse_find_polled_node(fc, ff->kh, &parent); |
2285 | BUG_ON(*link); | 2283 | BUG_ON(*link); |
2286 | rb_link_node(&ff->polled_node, parent, link); | 2284 | rb_link_node(&ff->polled_node, parent, link); |
2287 | rb_insert_color(&ff->polled_node, &fc->polled_files); | 2285 | rb_insert_color(&ff->polled_node, &fc->polled_files); |
2288 | } | 2286 | } |
2289 | spin_unlock(&fc->lock); | 2287 | spin_unlock(&fc->lock); |
2290 | } | 2288 | } |
2291 | 2289 | ||
2292 | unsigned fuse_file_poll(struct file *file, poll_table *wait) | 2290 | unsigned fuse_file_poll(struct file *file, poll_table *wait) |
2293 | { | 2291 | { |
2294 | struct fuse_file *ff = file->private_data; | 2292 | struct fuse_file *ff = file->private_data; |
2295 | struct fuse_conn *fc = ff->fc; | 2293 | struct fuse_conn *fc = ff->fc; |
2296 | struct fuse_poll_in inarg = { .fh = ff->fh, .kh = ff->kh }; | 2294 | struct fuse_poll_in inarg = { .fh = ff->fh, .kh = ff->kh }; |
2297 | struct fuse_poll_out outarg; | 2295 | struct fuse_poll_out outarg; |
2298 | struct fuse_req *req; | 2296 | struct fuse_req *req; |
2299 | int err; | 2297 | int err; |
2300 | 2298 | ||
2301 | if (fc->no_poll) | 2299 | if (fc->no_poll) |
2302 | return DEFAULT_POLLMASK; | 2300 | return DEFAULT_POLLMASK; |
2303 | 2301 | ||
2304 | poll_wait(file, &ff->poll_wait, wait); | 2302 | poll_wait(file, &ff->poll_wait, wait); |
2305 | inarg.events = (__u32)poll_requested_events(wait); | 2303 | inarg.events = (__u32)poll_requested_events(wait); |
2306 | 2304 | ||
2307 | /* | 2305 | /* |
2308 | * Ask for notification iff there's someone waiting for it. | 2306 | * Ask for notification iff there's someone waiting for it. |
2309 | * The client may ignore the flag and always notify. | 2307 | * The client may ignore the flag and always notify. |
2310 | */ | 2308 | */ |
2311 | if (waitqueue_active(&ff->poll_wait)) { | 2309 | if (waitqueue_active(&ff->poll_wait)) { |
2312 | inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY; | 2310 | inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY; |
2313 | fuse_register_polled_file(fc, ff); | 2311 | fuse_register_polled_file(fc, ff); |
2314 | } | 2312 | } |
2315 | 2313 | ||
2316 | req = fuse_get_req_nopages(fc); | 2314 | req = fuse_get_req_nopages(fc); |
2317 | if (IS_ERR(req)) | 2315 | if (IS_ERR(req)) |
2318 | return POLLERR; | 2316 | return POLLERR; |
2319 | 2317 | ||
2320 | req->in.h.opcode = FUSE_POLL; | 2318 | req->in.h.opcode = FUSE_POLL; |
2321 | req->in.h.nodeid = ff->nodeid; | 2319 | req->in.h.nodeid = ff->nodeid; |
2322 | req->in.numargs = 1; | 2320 | req->in.numargs = 1; |
2323 | req->in.args[0].size = sizeof(inarg); | 2321 | req->in.args[0].size = sizeof(inarg); |
2324 | req->in.args[0].value = &inarg; | 2322 | req->in.args[0].value = &inarg; |
2325 | req->out.numargs = 1; | 2323 | req->out.numargs = 1; |
2326 | req->out.args[0].size = sizeof(outarg); | 2324 | req->out.args[0].size = sizeof(outarg); |
2327 | req->out.args[0].value = &outarg; | 2325 | req->out.args[0].value = &outarg; |
2328 | fuse_request_send(fc, req); | 2326 | fuse_request_send(fc, req); |
2329 | err = req->out.h.error; | 2327 | err = req->out.h.error; |
2330 | fuse_put_request(fc, req); | 2328 | fuse_put_request(fc, req); |
2331 | 2329 | ||
2332 | if (!err) | 2330 | if (!err) |
2333 | return outarg.revents; | 2331 | return outarg.revents; |
2334 | if (err == -ENOSYS) { | 2332 | if (err == -ENOSYS) { |
2335 | fc->no_poll = 1; | 2333 | fc->no_poll = 1; |
2336 | return DEFAULT_POLLMASK; | 2334 | return DEFAULT_POLLMASK; |
2337 | } | 2335 | } |
2338 | return POLLERR; | 2336 | return POLLERR; |
2339 | } | 2337 | } |
2340 | EXPORT_SYMBOL_GPL(fuse_file_poll); | 2338 | EXPORT_SYMBOL_GPL(fuse_file_poll); |
2341 | 2339 | ||
2342 | /* | 2340 | /* |
2343 | * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and | 2341 | * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and |
2344 | * wakes up the poll waiters. | 2342 | * wakes up the poll waiters. |
2345 | */ | 2343 | */ |
2346 | int fuse_notify_poll_wakeup(struct fuse_conn *fc, | 2344 | int fuse_notify_poll_wakeup(struct fuse_conn *fc, |
2347 | struct fuse_notify_poll_wakeup_out *outarg) | 2345 | struct fuse_notify_poll_wakeup_out *outarg) |
2348 | { | 2346 | { |
2349 | u64 kh = outarg->kh; | 2347 | u64 kh = outarg->kh; |
2350 | struct rb_node **link; | 2348 | struct rb_node **link; |
2351 | 2349 | ||
2352 | spin_lock(&fc->lock); | 2350 | spin_lock(&fc->lock); |
2353 | 2351 | ||
2354 | link = fuse_find_polled_node(fc, kh, NULL); | 2352 | link = fuse_find_polled_node(fc, kh, NULL); |
2355 | if (*link) { | 2353 | if (*link) { |
2356 | struct fuse_file *ff; | 2354 | struct fuse_file *ff; |
2357 | 2355 | ||
2358 | ff = rb_entry(*link, struct fuse_file, polled_node); | 2356 | ff = rb_entry(*link, struct fuse_file, polled_node); |
2359 | wake_up_interruptible_sync(&ff->poll_wait); | 2357 | wake_up_interruptible_sync(&ff->poll_wait); |
2360 | } | 2358 | } |
2361 | 2359 | ||
2362 | spin_unlock(&fc->lock); | 2360 | spin_unlock(&fc->lock); |
2363 | return 0; | 2361 | return 0; |
2364 | } | 2362 | } |
2365 | 2363 | ||
2366 | static void fuse_do_truncate(struct file *file) | 2364 | static void fuse_do_truncate(struct file *file) |
2367 | { | 2365 | { |
2368 | struct inode *inode = file->f_mapping->host; | 2366 | struct inode *inode = file->f_mapping->host; |
2369 | struct iattr attr; | 2367 | struct iattr attr; |
2370 | 2368 | ||
2371 | attr.ia_valid = ATTR_SIZE; | 2369 | attr.ia_valid = ATTR_SIZE; |
2372 | attr.ia_size = i_size_read(inode); | 2370 | attr.ia_size = i_size_read(inode); |
2373 | 2371 | ||
2374 | attr.ia_file = file; | 2372 | attr.ia_file = file; |
2375 | attr.ia_valid |= ATTR_FILE; | 2373 | attr.ia_valid |= ATTR_FILE; |
2376 | 2374 | ||
2377 | fuse_do_setattr(inode, &attr, file); | 2375 | fuse_do_setattr(inode, &attr, file); |
2378 | } | 2376 | } |
2379 | 2377 | ||
2380 | static inline loff_t fuse_round_up(loff_t off) | 2378 | static inline loff_t fuse_round_up(loff_t off) |
2381 | { | 2379 | { |
2382 | return round_up(off, FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT); | 2380 | return round_up(off, FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT); |
2383 | } | 2381 | } |
2384 | 2382 | ||
2385 | static ssize_t | 2383 | static ssize_t |
2386 | fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, | 2384 | fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, |
2387 | loff_t offset, unsigned long nr_segs) | 2385 | loff_t offset, unsigned long nr_segs) |
2388 | { | 2386 | { |
2389 | ssize_t ret = 0; | 2387 | ssize_t ret = 0; |
2390 | struct file *file = iocb->ki_filp; | 2388 | struct file *file = iocb->ki_filp; |
2391 | struct fuse_file *ff = file->private_data; | 2389 | struct fuse_file *ff = file->private_data; |
2392 | bool async_dio = ff->fc->async_dio; | 2390 | bool async_dio = ff->fc->async_dio; |
2393 | loff_t pos = 0; | 2391 | loff_t pos = 0; |
2394 | struct inode *inode; | 2392 | struct inode *inode; |
2395 | loff_t i_size; | 2393 | loff_t i_size; |
2396 | size_t count = iov_length(iov, nr_segs); | 2394 | size_t count = iov_length(iov, nr_segs); |
2397 | struct fuse_io_priv *io; | 2395 | struct fuse_io_priv *io; |
2398 | 2396 | ||
2399 | pos = offset; | 2397 | pos = offset; |
2400 | inode = file->f_mapping->host; | 2398 | inode = file->f_mapping->host; |
2401 | i_size = i_size_read(inode); | 2399 | i_size = i_size_read(inode); |
2402 | 2400 | ||
2403 | /* optimization for short read */ | 2401 | /* optimization for short read */ |
2404 | if (async_dio && rw != WRITE && offset + count > i_size) { | 2402 | if (async_dio && rw != WRITE && offset + count > i_size) { |
2405 | if (offset >= i_size) | 2403 | if (offset >= i_size) |
2406 | return 0; | 2404 | return 0; |
2407 | count = min_t(loff_t, count, fuse_round_up(i_size - offset)); | 2405 | count = min_t(loff_t, count, fuse_round_up(i_size - offset)); |
2408 | } | 2406 | } |
2409 | 2407 | ||
2410 | io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL); | 2408 | io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL); |
2411 | if (!io) | 2409 | if (!io) |
2412 | return -ENOMEM; | 2410 | return -ENOMEM; |
2413 | spin_lock_init(&io->lock); | 2411 | spin_lock_init(&io->lock); |
2414 | io->reqs = 1; | 2412 | io->reqs = 1; |
2415 | io->bytes = -1; | 2413 | io->bytes = -1; |
2416 | io->size = 0; | 2414 | io->size = 0; |
2417 | io->offset = offset; | 2415 | io->offset = offset; |
2418 | io->write = (rw == WRITE); | 2416 | io->write = (rw == WRITE); |
2419 | io->err = 0; | 2417 | io->err = 0; |
2420 | io->file = file; | 2418 | io->file = file; |
2421 | /* | 2419 | /* |
2422 | * By default, we want to optimize all I/Os with async request | 2420 | * By default, we want to optimize all I/Os with async request |
2423 | * submission to the client filesystem if supported. | 2421 | * submission to the client filesystem if supported. |
2424 | */ | 2422 | */ |
2425 | io->async = async_dio; | 2423 | io->async = async_dio; |
2426 | io->iocb = iocb; | 2424 | io->iocb = iocb; |
2427 | 2425 | ||
2428 | /* | 2426 | /* |
2429 | * We cannot asynchronously extend the size of a file. We have no method | 2427 | * We cannot asynchronously extend the size of a file. We have no method |
2430 | * to wait on real async I/O requests, so we must submit this request | 2428 | * to wait on real async I/O requests, so we must submit this request |
2431 | * synchronously. | 2429 | * synchronously. |
2432 | */ | 2430 | */ |
2433 | if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE) | 2431 | if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE) |
2434 | io->async = false; | 2432 | io->async = false; |
2435 | 2433 | ||
2436 | if (rw == WRITE) | 2434 | if (rw == WRITE) |
2437 | ret = __fuse_direct_write(io, iov, nr_segs, &pos); | 2435 | ret = __fuse_direct_write(io, iov, nr_segs, &pos); |
2438 | else | 2436 | else |
2439 | ret = __fuse_direct_read(io, iov, nr_segs, &pos, count); | 2437 | ret = __fuse_direct_read(io, iov, nr_segs, &pos, count); |
2440 | 2438 | ||
2441 | if (io->async) { | 2439 | if (io->async) { |
2442 | fuse_aio_complete(io, ret < 0 ? ret : 0, -1); | 2440 | fuse_aio_complete(io, ret < 0 ? ret : 0, -1); |
2443 | 2441 | ||
2444 | /* we have a non-extending, async request, so return */ | 2442 | /* we have a non-extending, async request, so return */ |
2445 | if (!is_sync_kiocb(iocb)) | 2443 | if (!is_sync_kiocb(iocb)) |
2446 | return -EIOCBQUEUED; | 2444 | return -EIOCBQUEUED; |
2447 | 2445 | ||
2448 | ret = wait_on_sync_kiocb(iocb); | 2446 | ret = wait_on_sync_kiocb(iocb); |
2449 | } else { | 2447 | } else { |
2450 | kfree(io); | 2448 | kfree(io); |
2451 | } | 2449 | } |
2452 | 2450 | ||
2453 | if (rw == WRITE) { | 2451 | if (rw == WRITE) { |
2454 | if (ret > 0) | 2452 | if (ret > 0) |
2455 | fuse_write_update_size(inode, pos); | 2453 | fuse_write_update_size(inode, pos); |
2456 | else if (ret < 0 && offset + count > i_size) | 2454 | else if (ret < 0 && offset + count > i_size) |
2457 | fuse_do_truncate(file); | 2455 | fuse_do_truncate(file); |
2458 | } | 2456 | } |
2459 | 2457 | ||
2460 | return ret; | 2458 | return ret; |
2461 | } | 2459 | } |
2462 | 2460 | ||
2463 | static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, | 2461 | static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, |
2464 | loff_t length) | 2462 | loff_t length) |
2465 | { | 2463 | { |
2466 | struct fuse_file *ff = file->private_data; | 2464 | struct fuse_file *ff = file->private_data; |
2467 | struct inode *inode = file->f_inode; | 2465 | struct inode *inode = file->f_inode; |
2468 | struct fuse_inode *fi = get_fuse_inode(inode); | 2466 | struct fuse_inode *fi = get_fuse_inode(inode); |
2469 | struct fuse_conn *fc = ff->fc; | 2467 | struct fuse_conn *fc = ff->fc; |
2470 | struct fuse_req *req; | 2468 | struct fuse_req *req; |
2471 | struct fuse_fallocate_in inarg = { | 2469 | struct fuse_fallocate_in inarg = { |
2472 | .fh = ff->fh, | 2470 | .fh = ff->fh, |
2473 | .offset = offset, | 2471 | .offset = offset, |
2474 | .length = length, | 2472 | .length = length, |
2475 | .mode = mode | 2473 | .mode = mode |
2476 | }; | 2474 | }; |
2477 | int err; | 2475 | int err; |
2478 | bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) || | 2476 | bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) || |
2479 | (mode & FALLOC_FL_PUNCH_HOLE); | 2477 | (mode & FALLOC_FL_PUNCH_HOLE); |
2480 | 2478 | ||
2481 | if (fc->no_fallocate) | 2479 | if (fc->no_fallocate) |
2482 | return -EOPNOTSUPP; | 2480 | return -EOPNOTSUPP; |
2483 | 2481 | ||
2484 | if (lock_inode) { | 2482 | if (lock_inode) { |
2485 | mutex_lock(&inode->i_mutex); | 2483 | mutex_lock(&inode->i_mutex); |
2486 | if (mode & FALLOC_FL_PUNCH_HOLE) { | 2484 | if (mode & FALLOC_FL_PUNCH_HOLE) { |
2487 | loff_t endbyte = offset + length - 1; | 2485 | loff_t endbyte = offset + length - 1; |
2488 | err = filemap_write_and_wait_range(inode->i_mapping, | 2486 | err = filemap_write_and_wait_range(inode->i_mapping, |
2489 | offset, endbyte); | 2487 | offset, endbyte); |
2490 | if (err) | 2488 | if (err) |
2491 | goto out; | 2489 | goto out; |
2492 | 2490 | ||
2493 | fuse_sync_writes(inode); | 2491 | fuse_sync_writes(inode); |
2494 | } | 2492 | } |
2495 | } | 2493 | } |
2496 | 2494 | ||
2497 | if (!(mode & FALLOC_FL_KEEP_SIZE)) | 2495 | if (!(mode & FALLOC_FL_KEEP_SIZE)) |
2498 | set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); | 2496 | set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); |
2499 | 2497 | ||
2500 | req = fuse_get_req_nopages(fc); | 2498 | req = fuse_get_req_nopages(fc); |
2501 | if (IS_ERR(req)) { | 2499 | if (IS_ERR(req)) { |
2502 | err = PTR_ERR(req); | 2500 | err = PTR_ERR(req); |
2503 | goto out; | 2501 | goto out; |
2504 | } | 2502 | } |
2505 | 2503 | ||
2506 | req->in.h.opcode = FUSE_FALLOCATE; | 2504 | req->in.h.opcode = FUSE_FALLOCATE; |
2507 | req->in.h.nodeid = ff->nodeid; | 2505 | req->in.h.nodeid = ff->nodeid; |
2508 | req->in.numargs = 1; | 2506 | req->in.numargs = 1; |
2509 | req->in.args[0].size = sizeof(inarg); | 2507 | req->in.args[0].size = sizeof(inarg); |
2510 | req->in.args[0].value = &inarg; | 2508 | req->in.args[0].value = &inarg; |
2511 | fuse_request_send(fc, req); | 2509 | fuse_request_send(fc, req); |
2512 | err = req->out.h.error; | 2510 | err = req->out.h.error; |
2513 | if (err == -ENOSYS) { | 2511 | if (err == -ENOSYS) { |
2514 | fc->no_fallocate = 1; | 2512 | fc->no_fallocate = 1; |
2515 | err = -EOPNOTSUPP; | 2513 | err = -EOPNOTSUPP; |
2516 | } | 2514 | } |
2517 | fuse_put_request(fc, req); | 2515 | fuse_put_request(fc, req); |
2518 | 2516 | ||
2519 | if (err) | 2517 | if (err) |
2520 | goto out; | 2518 | goto out; |
2521 | 2519 | ||
2522 | /* we could have extended the file */ | 2520 | /* we could have extended the file */ |
2523 | if (!(mode & FALLOC_FL_KEEP_SIZE)) | 2521 | if (!(mode & FALLOC_FL_KEEP_SIZE)) |
2524 | fuse_write_update_size(inode, offset + length); | 2522 | fuse_write_update_size(inode, offset + length); |
2525 | 2523 | ||
2526 | if (mode & FALLOC_FL_PUNCH_HOLE) | 2524 | if (mode & FALLOC_FL_PUNCH_HOLE) |
2527 | truncate_pagecache_range(inode, offset, offset + length - 1); | 2525 | truncate_pagecache_range(inode, offset, offset + length - 1); |
2528 | 2526 | ||
2529 | fuse_invalidate_attr(inode); | 2527 | fuse_invalidate_attr(inode); |
2530 | 2528 | ||
2531 | out: | 2529 | out: |
2532 | if (!(mode & FALLOC_FL_KEEP_SIZE)) | 2530 | if (!(mode & FALLOC_FL_KEEP_SIZE)) |
2533 | clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); | 2531 | clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); |
2534 | 2532 | ||
2535 | if (lock_inode) | 2533 | if (lock_inode) |
2536 | mutex_unlock(&inode->i_mutex); | 2534 | mutex_unlock(&inode->i_mutex); |
2537 | 2535 | ||
2538 | return err; | 2536 | return err; |
2539 | } | 2537 | } |
2540 | 2538 | ||
2541 | static const struct file_operations fuse_file_operations = { | 2539 | static const struct file_operations fuse_file_operations = { |
2542 | .llseek = fuse_file_llseek, | 2540 | .llseek = fuse_file_llseek, |
2543 | .read = do_sync_read, | 2541 | .read = do_sync_read, |
2544 | .aio_read = fuse_file_aio_read, | 2542 | .aio_read = fuse_file_aio_read, |
2545 | .write = do_sync_write, | 2543 | .write = do_sync_write, |
2546 | .aio_write = fuse_file_aio_write, | 2544 | .aio_write = fuse_file_aio_write, |
2547 | .mmap = fuse_file_mmap, | 2545 | .mmap = fuse_file_mmap, |
2548 | .open = fuse_open, | 2546 | .open = fuse_open, |
2549 | .flush = fuse_flush, | 2547 | .flush = fuse_flush, |
2550 | .release = fuse_release, | 2548 | .release = fuse_release, |
2551 | .fsync = fuse_fsync, | 2549 | .fsync = fuse_fsync, |
2552 | .lock = fuse_file_lock, | 2550 | .lock = fuse_file_lock, |
2553 | .flock = fuse_file_flock, | 2551 | .flock = fuse_file_flock, |
2554 | .splice_read = generic_file_splice_read, | 2552 | .splice_read = generic_file_splice_read, |
2555 | .unlocked_ioctl = fuse_file_ioctl, | 2553 | .unlocked_ioctl = fuse_file_ioctl, |
2556 | .compat_ioctl = fuse_file_compat_ioctl, | 2554 | .compat_ioctl = fuse_file_compat_ioctl, |
2557 | .poll = fuse_file_poll, | 2555 | .poll = fuse_file_poll, |
2558 | .fallocate = fuse_file_fallocate, | 2556 | .fallocate = fuse_file_fallocate, |
2559 | }; | 2557 | }; |
2560 | 2558 | ||
2561 | static const struct file_operations fuse_direct_io_file_operations = { | 2559 | static const struct file_operations fuse_direct_io_file_operations = { |
2562 | .llseek = fuse_file_llseek, | 2560 | .llseek = fuse_file_llseek, |
2563 | .read = fuse_direct_read, | 2561 | .read = fuse_direct_read, |
2564 | .write = fuse_direct_write, | 2562 | .write = fuse_direct_write, |
2565 | .mmap = fuse_direct_mmap, | 2563 | .mmap = fuse_direct_mmap, |
2566 | .open = fuse_open, | 2564 | .open = fuse_open, |
2567 | .flush = fuse_flush, | 2565 | .flush = fuse_flush, |
2568 | .release = fuse_release, | 2566 | .release = fuse_release, |
2569 | .fsync = fuse_fsync, | 2567 | .fsync = fuse_fsync, |
2570 | .lock = fuse_file_lock, | 2568 | .lock = fuse_file_lock, |
2571 | .flock = fuse_file_flock, | 2569 | .flock = fuse_file_flock, |
2572 | .unlocked_ioctl = fuse_file_ioctl, | 2570 | .unlocked_ioctl = fuse_file_ioctl, |
2573 | .compat_ioctl = fuse_file_compat_ioctl, | 2571 | .compat_ioctl = fuse_file_compat_ioctl, |
2574 | .poll = fuse_file_poll, | 2572 | .poll = fuse_file_poll, |
2575 | .fallocate = fuse_file_fallocate, | 2573 | .fallocate = fuse_file_fallocate, |
2576 | /* no splice_read */ | 2574 | /* no splice_read */ |
2577 | }; | 2575 | }; |
2578 | 2576 | ||
2579 | static const struct address_space_operations fuse_file_aops = { | 2577 | static const struct address_space_operations fuse_file_aops = { |
2580 | .readpage = fuse_readpage, | 2578 | .readpage = fuse_readpage, |
2581 | .writepage = fuse_writepage, | 2579 | .writepage = fuse_writepage, |
2582 | .launder_page = fuse_launder_page, | 2580 | .launder_page = fuse_launder_page, |
2583 | .readpages = fuse_readpages, | 2581 | .readpages = fuse_readpages, |
2584 | .set_page_dirty = __set_page_dirty_nobuffers, | 2582 | .set_page_dirty = __set_page_dirty_nobuffers, |
2585 | .bmap = fuse_bmap, | 2583 | .bmap = fuse_bmap, |
2586 | .direct_IO = fuse_direct_IO, | 2584 | .direct_IO = fuse_direct_IO, |
2587 | }; | 2585 | }; |
2588 | 2586 | ||
2589 | void fuse_init_file_inode(struct inode *inode) | 2587 | void fuse_init_file_inode(struct inode *inode) |
2590 | { | 2588 | { |
2591 | inode->i_fop = &fuse_file_operations; | 2589 | inode->i_fop = &fuse_file_operations; |
2592 | inode->i_data.a_ops = &fuse_file_aops; | 2590 | inode->i_data.a_ops = &fuse_file_aops; |
2593 | } | 2591 | } |
2594 | 2592 |
fs/gfs2/aops.c
1 | /* | 1 | /* |
2 | * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. | 2 | * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. |
3 | * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. | 3 | * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. |
4 | * | 4 | * |
5 | * This copyrighted material is made available to anyone wishing to use, | 5 | * This copyrighted material is made available to anyone wishing to use, |
6 | * modify, copy, or redistribute it subject to the terms and conditions | 6 | * modify, copy, or redistribute it subject to the terms and conditions |
7 | * of the GNU General Public License version 2. | 7 | * of the GNU General Public License version 2. |
8 | */ | 8 | */ |
9 | 9 | ||
10 | #include <linux/sched.h> | 10 | #include <linux/sched.h> |
11 | #include <linux/slab.h> | 11 | #include <linux/slab.h> |
12 | #include <linux/spinlock.h> | 12 | #include <linux/spinlock.h> |
13 | #include <linux/completion.h> | 13 | #include <linux/completion.h> |
14 | #include <linux/buffer_head.h> | 14 | #include <linux/buffer_head.h> |
15 | #include <linux/pagemap.h> | 15 | #include <linux/pagemap.h> |
16 | #include <linux/pagevec.h> | 16 | #include <linux/pagevec.h> |
17 | #include <linux/mpage.h> | 17 | #include <linux/mpage.h> |
18 | #include <linux/fs.h> | 18 | #include <linux/fs.h> |
19 | #include <linux/writeback.h> | 19 | #include <linux/writeback.h> |
20 | #include <linux/swap.h> | 20 | #include <linux/swap.h> |
21 | #include <linux/gfs2_ondisk.h> | 21 | #include <linux/gfs2_ondisk.h> |
22 | #include <linux/backing-dev.h> | 22 | #include <linux/backing-dev.h> |
23 | #include <linux/aio.h> | 23 | #include <linux/aio.h> |
24 | 24 | ||
25 | #include "gfs2.h" | 25 | #include "gfs2.h" |
26 | #include "incore.h" | 26 | #include "incore.h" |
27 | #include "bmap.h" | 27 | #include "bmap.h" |
28 | #include "glock.h" | 28 | #include "glock.h" |
29 | #include "inode.h" | 29 | #include "inode.h" |
30 | #include "log.h" | 30 | #include "log.h" |
31 | #include "meta_io.h" | 31 | #include "meta_io.h" |
32 | #include "quota.h" | 32 | #include "quota.h" |
33 | #include "trans.h" | 33 | #include "trans.h" |
34 | #include "rgrp.h" | 34 | #include "rgrp.h" |
35 | #include "super.h" | 35 | #include "super.h" |
36 | #include "util.h" | 36 | #include "util.h" |
37 | #include "glops.h" | 37 | #include "glops.h" |
38 | 38 | ||
39 | 39 | ||
40 | static void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page, | 40 | static void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page, |
41 | unsigned int from, unsigned int to) | 41 | unsigned int from, unsigned int to) |
42 | { | 42 | { |
43 | struct buffer_head *head = page_buffers(page); | 43 | struct buffer_head *head = page_buffers(page); |
44 | unsigned int bsize = head->b_size; | 44 | unsigned int bsize = head->b_size; |
45 | struct buffer_head *bh; | 45 | struct buffer_head *bh; |
46 | unsigned int start, end; | 46 | unsigned int start, end; |
47 | 47 | ||
48 | for (bh = head, start = 0; bh != head || !start; | 48 | for (bh = head, start = 0; bh != head || !start; |
49 | bh = bh->b_this_page, start = end) { | 49 | bh = bh->b_this_page, start = end) { |
50 | end = start + bsize; | 50 | end = start + bsize; |
51 | if (end <= from || start >= to) | 51 | if (end <= from || start >= to) |
52 | continue; | 52 | continue; |
53 | if (gfs2_is_jdata(ip)) | 53 | if (gfs2_is_jdata(ip)) |
54 | set_buffer_uptodate(bh); | 54 | set_buffer_uptodate(bh); |
55 | gfs2_trans_add_data(ip->i_gl, bh); | 55 | gfs2_trans_add_data(ip->i_gl, bh); |
56 | } | 56 | } |
57 | } | 57 | } |
58 | 58 | ||
59 | /** | 59 | /** |
60 | * gfs2_get_block_noalloc - Fills in a buffer head with details about a block | 60 | * gfs2_get_block_noalloc - Fills in a buffer head with details about a block |
61 | * @inode: The inode | 61 | * @inode: The inode |
62 | * @lblock: The block number to look up | 62 | * @lblock: The block number to look up |
63 | * @bh_result: The buffer head to return the result in | 63 | * @bh_result: The buffer head to return the result in |
64 | * @create: Non-zero if we may add block to the file | 64 | * @create: Non-zero if we may add block to the file |
65 | * | 65 | * |
66 | * Returns: errno | 66 | * Returns: errno |
67 | */ | 67 | */ |
68 | 68 | ||
69 | static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock, | 69 | static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock, |
70 | struct buffer_head *bh_result, int create) | 70 | struct buffer_head *bh_result, int create) |
71 | { | 71 | { |
72 | int error; | 72 | int error; |
73 | 73 | ||
74 | error = gfs2_block_map(inode, lblock, bh_result, 0); | 74 | error = gfs2_block_map(inode, lblock, bh_result, 0); |
75 | if (error) | 75 | if (error) |
76 | return error; | 76 | return error; |
77 | if (!buffer_mapped(bh_result)) | 77 | if (!buffer_mapped(bh_result)) |
78 | return -EIO; | 78 | return -EIO; |
79 | return 0; | 79 | return 0; |
80 | } | 80 | } |
81 | 81 | ||
82 | static int gfs2_get_block_direct(struct inode *inode, sector_t lblock, | 82 | static int gfs2_get_block_direct(struct inode *inode, sector_t lblock, |
83 | struct buffer_head *bh_result, int create) | 83 | struct buffer_head *bh_result, int create) |
84 | { | 84 | { |
85 | return gfs2_block_map(inode, lblock, bh_result, 0); | 85 | return gfs2_block_map(inode, lblock, bh_result, 0); |
86 | } | 86 | } |
87 | 87 | ||
88 | /** | 88 | /** |
89 | * gfs2_writepage_common - Common bits of writepage | 89 | * gfs2_writepage_common - Common bits of writepage |
90 | * @page: The page to be written | 90 | * @page: The page to be written |
91 | * @wbc: The writeback control | 91 | * @wbc: The writeback control |
92 | * | 92 | * |
93 | * Returns: 1 if writepage is ok, otherwise an error code or zero if no error. | 93 | * Returns: 1 if writepage is ok, otherwise an error code or zero if no error. |
94 | */ | 94 | */ |
95 | 95 | ||
96 | static int gfs2_writepage_common(struct page *page, | 96 | static int gfs2_writepage_common(struct page *page, |
97 | struct writeback_control *wbc) | 97 | struct writeback_control *wbc) |
98 | { | 98 | { |
99 | struct inode *inode = page->mapping->host; | 99 | struct inode *inode = page->mapping->host; |
100 | struct gfs2_inode *ip = GFS2_I(inode); | 100 | struct gfs2_inode *ip = GFS2_I(inode); |
101 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 101 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
102 | loff_t i_size = i_size_read(inode); | 102 | loff_t i_size = i_size_read(inode); |
103 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; | 103 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; |
104 | unsigned offset; | 104 | unsigned offset; |
105 | 105 | ||
106 | if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl))) | 106 | if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl))) |
107 | goto out; | 107 | goto out; |
108 | if (current->journal_info) | 108 | if (current->journal_info) |
109 | goto redirty; | 109 | goto redirty; |
110 | /* Is the page fully outside i_size? (truncate in progress) */ | 110 | /* Is the page fully outside i_size? (truncate in progress) */ |
111 | offset = i_size & (PAGE_CACHE_SIZE-1); | 111 | offset = i_size & (PAGE_CACHE_SIZE-1); |
112 | if (page->index > end_index || (page->index == end_index && !offset)) { | 112 | if (page->index > end_index || (page->index == end_index && !offset)) { |
113 | page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); | 113 | page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); |
114 | goto out; | 114 | goto out; |
115 | } | 115 | } |
116 | return 1; | 116 | return 1; |
117 | redirty: | 117 | redirty: |
118 | redirty_page_for_writepage(wbc, page); | 118 | redirty_page_for_writepage(wbc, page); |
119 | out: | 119 | out: |
120 | unlock_page(page); | 120 | unlock_page(page); |
121 | return 0; | 121 | return 0; |
122 | } | 122 | } |
123 | 123 | ||
124 | /** | 124 | /** |
125 | * gfs2_writepage - Write page for writeback mappings | 125 | * gfs2_writepage - Write page for writeback mappings |
126 | * @page: The page | 126 | * @page: The page |
127 | * @wbc: The writeback control | 127 | * @wbc: The writeback control |
128 | * | 128 | * |
129 | */ | 129 | */ |
130 | 130 | ||
131 | static int gfs2_writepage(struct page *page, struct writeback_control *wbc) | 131 | static int gfs2_writepage(struct page *page, struct writeback_control *wbc) |
132 | { | 132 | { |
133 | int ret; | 133 | int ret; |
134 | 134 | ||
135 | ret = gfs2_writepage_common(page, wbc); | 135 | ret = gfs2_writepage_common(page, wbc); |
136 | if (ret <= 0) | 136 | if (ret <= 0) |
137 | return ret; | 137 | return ret; |
138 | 138 | ||
139 | return nobh_writepage(page, gfs2_get_block_noalloc, wbc); | 139 | return nobh_writepage(page, gfs2_get_block_noalloc, wbc); |
140 | } | 140 | } |
141 | 141 | ||
142 | /** | 142 | /** |
143 | * __gfs2_jdata_writepage - The core of jdata writepage | 143 | * __gfs2_jdata_writepage - The core of jdata writepage |
144 | * @page: The page to write | 144 | * @page: The page to write |
145 | * @wbc: The writeback control | 145 | * @wbc: The writeback control |
146 | * | 146 | * |
147 | * This is shared between writepage and writepages and implements the | 147 | * This is shared between writepage and writepages and implements the |
148 | * core of the writepage operation. If a transaction is required then | 148 | * core of the writepage operation. If a transaction is required then |
149 | * PageChecked will have been set and the transaction will have | 149 | * PageChecked will have been set and the transaction will have |
150 | * already been started before this is called. | 150 | * already been started before this is called. |
151 | */ | 151 | */ |
152 | 152 | ||
153 | static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) | 153 | static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) |
154 | { | 154 | { |
155 | struct inode *inode = page->mapping->host; | 155 | struct inode *inode = page->mapping->host; |
156 | struct gfs2_inode *ip = GFS2_I(inode); | 156 | struct gfs2_inode *ip = GFS2_I(inode); |
157 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 157 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
158 | 158 | ||
159 | if (PageChecked(page)) { | 159 | if (PageChecked(page)) { |
160 | ClearPageChecked(page); | 160 | ClearPageChecked(page); |
161 | if (!page_has_buffers(page)) { | 161 | if (!page_has_buffers(page)) { |
162 | create_empty_buffers(page, inode->i_sb->s_blocksize, | 162 | create_empty_buffers(page, inode->i_sb->s_blocksize, |
163 | (1 << BH_Dirty)|(1 << BH_Uptodate)); | 163 | (1 << BH_Dirty)|(1 << BH_Uptodate)); |
164 | } | 164 | } |
165 | gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1); | 165 | gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1); |
166 | } | 166 | } |
167 | return block_write_full_page(page, gfs2_get_block_noalloc, wbc); | 167 | return block_write_full_page(page, gfs2_get_block_noalloc, wbc); |
168 | } | 168 | } |
169 | 169 | ||
170 | /** | 170 | /** |
171 | * gfs2_jdata_writepage - Write complete page | 171 | * gfs2_jdata_writepage - Write complete page |
172 | * @page: Page to write | 172 | * @page: Page to write |
173 | * | 173 | * |
174 | * Returns: errno | 174 | * Returns: errno |
175 | * | 175 | * |
176 | */ | 176 | */ |
177 | 177 | ||
178 | static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) | 178 | static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc) |
179 | { | 179 | { |
180 | struct inode *inode = page->mapping->host; | 180 | struct inode *inode = page->mapping->host; |
181 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 181 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
182 | int ret; | 182 | int ret; |
183 | int done_trans = 0; | 183 | int done_trans = 0; |
184 | 184 | ||
185 | if (PageChecked(page)) { | 185 | if (PageChecked(page)) { |
186 | if (wbc->sync_mode != WB_SYNC_ALL) | 186 | if (wbc->sync_mode != WB_SYNC_ALL) |
187 | goto out_ignore; | 187 | goto out_ignore; |
188 | ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0); | 188 | ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0); |
189 | if (ret) | 189 | if (ret) |
190 | goto out_ignore; | 190 | goto out_ignore; |
191 | done_trans = 1; | 191 | done_trans = 1; |
192 | } | 192 | } |
193 | ret = gfs2_writepage_common(page, wbc); | 193 | ret = gfs2_writepage_common(page, wbc); |
194 | if (ret > 0) | 194 | if (ret > 0) |
195 | ret = __gfs2_jdata_writepage(page, wbc); | 195 | ret = __gfs2_jdata_writepage(page, wbc); |
196 | if (done_trans) | 196 | if (done_trans) |
197 | gfs2_trans_end(sdp); | 197 | gfs2_trans_end(sdp); |
198 | return ret; | 198 | return ret; |
199 | 199 | ||
200 | out_ignore: | 200 | out_ignore: |
201 | redirty_page_for_writepage(wbc, page); | 201 | redirty_page_for_writepage(wbc, page); |
202 | unlock_page(page); | 202 | unlock_page(page); |
203 | return 0; | 203 | return 0; |
204 | } | 204 | } |
205 | 205 | ||
206 | /** | 206 | /** |
207 | * gfs2_writepages - Write a bunch of dirty pages back to disk | 207 | * gfs2_writepages - Write a bunch of dirty pages back to disk |
208 | * @mapping: The mapping to write | 208 | * @mapping: The mapping to write |
209 | * @wbc: Write-back control | 209 | * @wbc: Write-back control |
210 | * | 210 | * |
211 | * Used for both ordered and writeback modes. | 211 | * Used for both ordered and writeback modes. |
212 | */ | 212 | */ |
213 | static int gfs2_writepages(struct address_space *mapping, | 213 | static int gfs2_writepages(struct address_space *mapping, |
214 | struct writeback_control *wbc) | 214 | struct writeback_control *wbc) |
215 | { | 215 | { |
216 | return mpage_writepages(mapping, wbc, gfs2_get_block_noalloc); | 216 | return mpage_writepages(mapping, wbc, gfs2_get_block_noalloc); |
217 | } | 217 | } |
218 | 218 | ||
219 | /** | 219 | /** |
220 | * gfs2_write_jdata_pagevec - Write back a pagevec's worth of pages | 220 | * gfs2_write_jdata_pagevec - Write back a pagevec's worth of pages |
221 | * @mapping: The mapping | 221 | * @mapping: The mapping |
222 | * @wbc: The writeback control | 222 | * @wbc: The writeback control |
223 | * @writepage: The writepage function to call for each page | 223 | * @writepage: The writepage function to call for each page |
224 | * @pvec: The vector of pages | 224 | * @pvec: The vector of pages |
225 | * @nr_pages: The number of pages to write | 225 | * @nr_pages: The number of pages to write |
226 | * | 226 | * |
227 | * Returns: non-zero if loop should terminate, zero otherwise | 227 | * Returns: non-zero if loop should terminate, zero otherwise |
228 | */ | 228 | */ |
229 | 229 | ||
230 | static int gfs2_write_jdata_pagevec(struct address_space *mapping, | 230 | static int gfs2_write_jdata_pagevec(struct address_space *mapping, |
231 | struct writeback_control *wbc, | 231 | struct writeback_control *wbc, |
232 | struct pagevec *pvec, | 232 | struct pagevec *pvec, |
233 | int nr_pages, pgoff_t end) | 233 | int nr_pages, pgoff_t end) |
234 | { | 234 | { |
235 | struct inode *inode = mapping->host; | 235 | struct inode *inode = mapping->host; |
236 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 236 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
237 | loff_t i_size = i_size_read(inode); | 237 | loff_t i_size = i_size_read(inode); |
238 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; | 238 | pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; |
239 | unsigned offset = i_size & (PAGE_CACHE_SIZE-1); | 239 | unsigned offset = i_size & (PAGE_CACHE_SIZE-1); |
240 | unsigned nrblocks = nr_pages * (PAGE_CACHE_SIZE/inode->i_sb->s_blocksize); | 240 | unsigned nrblocks = nr_pages * (PAGE_CACHE_SIZE/inode->i_sb->s_blocksize); |
241 | int i; | 241 | int i; |
242 | int ret; | 242 | int ret; |
243 | 243 | ||
244 | ret = gfs2_trans_begin(sdp, nrblocks, nrblocks); | 244 | ret = gfs2_trans_begin(sdp, nrblocks, nrblocks); |
245 | if (ret < 0) | 245 | if (ret < 0) |
246 | return ret; | 246 | return ret; |
247 | 247 | ||
248 | for(i = 0; i < nr_pages; i++) { | 248 | for(i = 0; i < nr_pages; i++) { |
249 | struct page *page = pvec->pages[i]; | 249 | struct page *page = pvec->pages[i]; |
250 | 250 | ||
251 | lock_page(page); | 251 | lock_page(page); |
252 | 252 | ||
253 | if (unlikely(page->mapping != mapping)) { | 253 | if (unlikely(page->mapping != mapping)) { |
254 | unlock_page(page); | 254 | unlock_page(page); |
255 | continue; | 255 | continue; |
256 | } | 256 | } |
257 | 257 | ||
258 | if (!wbc->range_cyclic && page->index > end) { | 258 | if (!wbc->range_cyclic && page->index > end) { |
259 | ret = 1; | 259 | ret = 1; |
260 | unlock_page(page); | 260 | unlock_page(page); |
261 | continue; | 261 | continue; |
262 | } | 262 | } |
263 | 263 | ||
264 | if (wbc->sync_mode != WB_SYNC_NONE) | 264 | if (wbc->sync_mode != WB_SYNC_NONE) |
265 | wait_on_page_writeback(page); | 265 | wait_on_page_writeback(page); |
266 | 266 | ||
267 | if (PageWriteback(page) || | 267 | if (PageWriteback(page) || |
268 | !clear_page_dirty_for_io(page)) { | 268 | !clear_page_dirty_for_io(page)) { |
269 | unlock_page(page); | 269 | unlock_page(page); |
270 | continue; | 270 | continue; |
271 | } | 271 | } |
272 | 272 | ||
273 | /* Is the page fully outside i_size? (truncate in progress) */ | 273 | /* Is the page fully outside i_size? (truncate in progress) */ |
274 | if (page->index > end_index || (page->index == end_index && !offset)) { | 274 | if (page->index > end_index || (page->index == end_index && !offset)) { |
275 | page->mapping->a_ops->invalidatepage(page, 0, | 275 | page->mapping->a_ops->invalidatepage(page, 0, |
276 | PAGE_CACHE_SIZE); | 276 | PAGE_CACHE_SIZE); |
277 | unlock_page(page); | 277 | unlock_page(page); |
278 | continue; | 278 | continue; |
279 | } | 279 | } |
280 | 280 | ||
281 | ret = __gfs2_jdata_writepage(page, wbc); | 281 | ret = __gfs2_jdata_writepage(page, wbc); |
282 | 282 | ||
283 | if (ret || (--(wbc->nr_to_write) <= 0)) | 283 | if (ret || (--(wbc->nr_to_write) <= 0)) |
284 | ret = 1; | 284 | ret = 1; |
285 | } | 285 | } |
286 | gfs2_trans_end(sdp); | 286 | gfs2_trans_end(sdp); |
287 | return ret; | 287 | return ret; |
288 | } | 288 | } |
289 | 289 | ||
290 | /** | 290 | /** |
291 | * gfs2_write_cache_jdata - Like write_cache_pages but different | 291 | * gfs2_write_cache_jdata - Like write_cache_pages but different |
292 | * @mapping: The mapping to write | 292 | * @mapping: The mapping to write |
293 | * @wbc: The writeback control | 293 | * @wbc: The writeback control |
294 | * @writepage: The writepage function to call | 294 | * @writepage: The writepage function to call |
295 | * @data: The data to pass to writepage | 295 | * @data: The data to pass to writepage |
296 | * | 296 | * |
297 | * The reason that we use our own function here is that we need to | 297 | * The reason that we use our own function here is that we need to |
298 | * start transactions before we grab page locks. This allows us | 298 | * start transactions before we grab page locks. This allows us |
299 | * to get the ordering right. | 299 | * to get the ordering right. |
300 | */ | 300 | */ |
301 | 301 | ||
302 | static int gfs2_write_cache_jdata(struct address_space *mapping, | 302 | static int gfs2_write_cache_jdata(struct address_space *mapping, |
303 | struct writeback_control *wbc) | 303 | struct writeback_control *wbc) |
304 | { | 304 | { |
305 | int ret = 0; | 305 | int ret = 0; |
306 | int done = 0; | 306 | int done = 0; |
307 | struct pagevec pvec; | 307 | struct pagevec pvec; |
308 | int nr_pages; | 308 | int nr_pages; |
309 | pgoff_t index; | 309 | pgoff_t index; |
310 | pgoff_t end; | 310 | pgoff_t end; |
311 | int scanned = 0; | 311 | int scanned = 0; |
312 | int range_whole = 0; | 312 | int range_whole = 0; |
313 | 313 | ||
314 | pagevec_init(&pvec, 0); | 314 | pagevec_init(&pvec, 0); |
315 | if (wbc->range_cyclic) { | 315 | if (wbc->range_cyclic) { |
316 | index = mapping->writeback_index; /* Start from prev offset */ | 316 | index = mapping->writeback_index; /* Start from prev offset */ |
317 | end = -1; | 317 | end = -1; |
318 | } else { | 318 | } else { |
319 | index = wbc->range_start >> PAGE_CACHE_SHIFT; | 319 | index = wbc->range_start >> PAGE_CACHE_SHIFT; |
320 | end = wbc->range_end >> PAGE_CACHE_SHIFT; | 320 | end = wbc->range_end >> PAGE_CACHE_SHIFT; |
321 | if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) | 321 | if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) |
322 | range_whole = 1; | 322 | range_whole = 1; |
323 | scanned = 1; | 323 | scanned = 1; |
324 | } | 324 | } |
325 | 325 | ||
326 | retry: | 326 | retry: |
327 | while (!done && (index <= end) && | 327 | while (!done && (index <= end) && |
328 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, | 328 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, |
329 | PAGECACHE_TAG_DIRTY, | 329 | PAGECACHE_TAG_DIRTY, |
330 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { | 330 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { |
331 | scanned = 1; | 331 | scanned = 1; |
332 | ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end); | 332 | ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end); |
333 | if (ret) | 333 | if (ret) |
334 | done = 1; | 334 | done = 1; |
335 | if (ret > 0) | 335 | if (ret > 0) |
336 | ret = 0; | 336 | ret = 0; |
337 | 337 | ||
338 | pagevec_release(&pvec); | 338 | pagevec_release(&pvec); |
339 | cond_resched(); | 339 | cond_resched(); |
340 | } | 340 | } |
341 | 341 | ||
342 | if (!scanned && !done) { | 342 | if (!scanned && !done) { |
343 | /* | 343 | /* |
344 | * We hit the last page and there is more work to be done: wrap | 344 | * We hit the last page and there is more work to be done: wrap |
345 | * back to the start of the file | 345 | * back to the start of the file |
346 | */ | 346 | */ |
347 | scanned = 1; | 347 | scanned = 1; |
348 | index = 0; | 348 | index = 0; |
349 | goto retry; | 349 | goto retry; |
350 | } | 350 | } |
351 | 351 | ||
352 | if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) | 352 | if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) |
353 | mapping->writeback_index = index; | 353 | mapping->writeback_index = index; |
354 | return ret; | 354 | return ret; |
355 | } | 355 | } |
356 | 356 | ||
357 | 357 | ||
358 | /** | 358 | /** |
359 | * gfs2_jdata_writepages - Write a bunch of dirty pages back to disk | 359 | * gfs2_jdata_writepages - Write a bunch of dirty pages back to disk |
360 | * @mapping: The mapping to write | 360 | * @mapping: The mapping to write |
361 | * @wbc: The writeback control | 361 | * @wbc: The writeback control |
362 | * | 362 | * |
363 | */ | 363 | */ |
364 | 364 | ||
365 | static int gfs2_jdata_writepages(struct address_space *mapping, | 365 | static int gfs2_jdata_writepages(struct address_space *mapping, |
366 | struct writeback_control *wbc) | 366 | struct writeback_control *wbc) |
367 | { | 367 | { |
368 | struct gfs2_inode *ip = GFS2_I(mapping->host); | 368 | struct gfs2_inode *ip = GFS2_I(mapping->host); |
369 | struct gfs2_sbd *sdp = GFS2_SB(mapping->host); | 369 | struct gfs2_sbd *sdp = GFS2_SB(mapping->host); |
370 | int ret; | 370 | int ret; |
371 | 371 | ||
372 | ret = gfs2_write_cache_jdata(mapping, wbc); | 372 | ret = gfs2_write_cache_jdata(mapping, wbc); |
373 | if (ret == 0 && wbc->sync_mode == WB_SYNC_ALL) { | 373 | if (ret == 0 && wbc->sync_mode == WB_SYNC_ALL) { |
374 | gfs2_log_flush(sdp, ip->i_gl); | 374 | gfs2_log_flush(sdp, ip->i_gl); |
375 | ret = gfs2_write_cache_jdata(mapping, wbc); | 375 | ret = gfs2_write_cache_jdata(mapping, wbc); |
376 | } | 376 | } |
377 | return ret; | 377 | return ret; |
378 | } | 378 | } |
379 | 379 | ||
380 | /** | 380 | /** |
381 | * stuffed_readpage - Fill in a Linux page with stuffed file data | 381 | * stuffed_readpage - Fill in a Linux page with stuffed file data |
382 | * @ip: the inode | 382 | * @ip: the inode |
383 | * @page: the page | 383 | * @page: the page |
384 | * | 384 | * |
385 | * Returns: errno | 385 | * Returns: errno |
386 | */ | 386 | */ |
387 | 387 | ||
388 | static int stuffed_readpage(struct gfs2_inode *ip, struct page *page) | 388 | static int stuffed_readpage(struct gfs2_inode *ip, struct page *page) |
389 | { | 389 | { |
390 | struct buffer_head *dibh; | 390 | struct buffer_head *dibh; |
391 | u64 dsize = i_size_read(&ip->i_inode); | 391 | u64 dsize = i_size_read(&ip->i_inode); |
392 | void *kaddr; | 392 | void *kaddr; |
393 | int error; | 393 | int error; |
394 | 394 | ||
395 | /* | 395 | /* |
396 | * Due to the order of unstuffing files and ->fault(), we can be | 396 | * Due to the order of unstuffing files and ->fault(), we can be |
397 | * asked for a zero page in the case of a stuffed file being extended, | 397 | * asked for a zero page in the case of a stuffed file being extended, |
398 | * so we need to supply one here. It doesn't happen often. | 398 | * so we need to supply one here. It doesn't happen often. |
399 | */ | 399 | */ |
400 | if (unlikely(page->index)) { | 400 | if (unlikely(page->index)) { |
401 | zero_user(page, 0, PAGE_CACHE_SIZE); | 401 | zero_user(page, 0, PAGE_CACHE_SIZE); |
402 | SetPageUptodate(page); | 402 | SetPageUptodate(page); |
403 | return 0; | 403 | return 0; |
404 | } | 404 | } |
405 | 405 | ||
406 | error = gfs2_meta_inode_buffer(ip, &dibh); | 406 | error = gfs2_meta_inode_buffer(ip, &dibh); |
407 | if (error) | 407 | if (error) |
408 | return error; | 408 | return error; |
409 | 409 | ||
410 | kaddr = kmap_atomic(page); | 410 | kaddr = kmap_atomic(page); |
411 | if (dsize > (dibh->b_size - sizeof(struct gfs2_dinode))) | 411 | if (dsize > (dibh->b_size - sizeof(struct gfs2_dinode))) |
412 | dsize = (dibh->b_size - sizeof(struct gfs2_dinode)); | 412 | dsize = (dibh->b_size - sizeof(struct gfs2_dinode)); |
413 | memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize); | 413 | memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize); |
414 | memset(kaddr + dsize, 0, PAGE_CACHE_SIZE - dsize); | 414 | memset(kaddr + dsize, 0, PAGE_CACHE_SIZE - dsize); |
415 | kunmap_atomic(kaddr); | 415 | kunmap_atomic(kaddr); |
416 | flush_dcache_page(page); | 416 | flush_dcache_page(page); |
417 | brelse(dibh); | 417 | brelse(dibh); |
418 | SetPageUptodate(page); | 418 | SetPageUptodate(page); |
419 | 419 | ||
420 | return 0; | 420 | return 0; |
421 | } | 421 | } |
422 | 422 | ||
423 | 423 | ||
424 | /** | 424 | /** |
425 | * __gfs2_readpage - readpage | 425 | * __gfs2_readpage - readpage |
426 | * @file: The file to read a page for | 426 | * @file: The file to read a page for |
427 | * @page: The page to read | 427 | * @page: The page to read |
428 | * | 428 | * |
429 | * This is the core of gfs2's readpage. Its used by the internal file | 429 | * This is the core of gfs2's readpage. Its used by the internal file |
430 | * reading code as in that case we already hold the glock. Also its | 430 | * reading code as in that case we already hold the glock. Also its |
431 | * called by gfs2_readpage() once the required lock has been granted. | 431 | * called by gfs2_readpage() once the required lock has been granted. |
432 | * | 432 | * |
433 | */ | 433 | */ |
434 | 434 | ||
435 | static int __gfs2_readpage(void *file, struct page *page) | 435 | static int __gfs2_readpage(void *file, struct page *page) |
436 | { | 436 | { |
437 | struct gfs2_inode *ip = GFS2_I(page->mapping->host); | 437 | struct gfs2_inode *ip = GFS2_I(page->mapping->host); |
438 | struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); | 438 | struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); |
439 | int error; | 439 | int error; |
440 | 440 | ||
441 | if (gfs2_is_stuffed(ip)) { | 441 | if (gfs2_is_stuffed(ip)) { |
442 | error = stuffed_readpage(ip, page); | 442 | error = stuffed_readpage(ip, page); |
443 | unlock_page(page); | 443 | unlock_page(page); |
444 | } else { | 444 | } else { |
445 | error = mpage_readpage(page, gfs2_block_map); | 445 | error = mpage_readpage(page, gfs2_block_map); |
446 | } | 446 | } |
447 | 447 | ||
448 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) | 448 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) |
449 | return -EIO; | 449 | return -EIO; |
450 | 450 | ||
451 | return error; | 451 | return error; |
452 | } | 452 | } |
453 | 453 | ||
454 | /** | 454 | /** |
455 | * gfs2_readpage - read a page of a file | 455 | * gfs2_readpage - read a page of a file |
456 | * @file: The file to read | 456 | * @file: The file to read |
457 | * @page: The page of the file | 457 | * @page: The page of the file |
458 | * | 458 | * |
459 | * This deals with the locking required. We have to unlock and | 459 | * This deals with the locking required. We have to unlock and |
460 | * relock the page in order to get the locking in the right | 460 | * relock the page in order to get the locking in the right |
461 | * order. | 461 | * order. |
462 | */ | 462 | */ |
463 | 463 | ||
464 | static int gfs2_readpage(struct file *file, struct page *page) | 464 | static int gfs2_readpage(struct file *file, struct page *page) |
465 | { | 465 | { |
466 | struct address_space *mapping = page->mapping; | 466 | struct address_space *mapping = page->mapping; |
467 | struct gfs2_inode *ip = GFS2_I(mapping->host); | 467 | struct gfs2_inode *ip = GFS2_I(mapping->host); |
468 | struct gfs2_holder gh; | 468 | struct gfs2_holder gh; |
469 | int error; | 469 | int error; |
470 | 470 | ||
471 | unlock_page(page); | 471 | unlock_page(page); |
472 | gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); | 472 | gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); |
473 | error = gfs2_glock_nq(&gh); | 473 | error = gfs2_glock_nq(&gh); |
474 | if (unlikely(error)) | 474 | if (unlikely(error)) |
475 | goto out; | 475 | goto out; |
476 | error = AOP_TRUNCATED_PAGE; | 476 | error = AOP_TRUNCATED_PAGE; |
477 | lock_page(page); | 477 | lock_page(page); |
478 | if (page->mapping == mapping && !PageUptodate(page)) | 478 | if (page->mapping == mapping && !PageUptodate(page)) |
479 | error = __gfs2_readpage(file, page); | 479 | error = __gfs2_readpage(file, page); |
480 | else | 480 | else |
481 | unlock_page(page); | 481 | unlock_page(page); |
482 | gfs2_glock_dq(&gh); | 482 | gfs2_glock_dq(&gh); |
483 | out: | 483 | out: |
484 | gfs2_holder_uninit(&gh); | 484 | gfs2_holder_uninit(&gh); |
485 | if (error && error != AOP_TRUNCATED_PAGE) | 485 | if (error && error != AOP_TRUNCATED_PAGE) |
486 | lock_page(page); | 486 | lock_page(page); |
487 | return error; | 487 | return error; |
488 | } | 488 | } |
489 | 489 | ||
490 | /** | 490 | /** |
491 | * gfs2_internal_read - read an internal file | 491 | * gfs2_internal_read - read an internal file |
492 | * @ip: The gfs2 inode | 492 | * @ip: The gfs2 inode |
493 | * @buf: The buffer to fill | 493 | * @buf: The buffer to fill |
494 | * @pos: The file position | 494 | * @pos: The file position |
495 | * @size: The amount to read | 495 | * @size: The amount to read |
496 | * | 496 | * |
497 | */ | 497 | */ |
498 | 498 | ||
499 | int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos, | 499 | int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos, |
500 | unsigned size) | 500 | unsigned size) |
501 | { | 501 | { |
502 | struct address_space *mapping = ip->i_inode.i_mapping; | 502 | struct address_space *mapping = ip->i_inode.i_mapping; |
503 | unsigned long index = *pos / PAGE_CACHE_SIZE; | 503 | unsigned long index = *pos / PAGE_CACHE_SIZE; |
504 | unsigned offset = *pos & (PAGE_CACHE_SIZE - 1); | 504 | unsigned offset = *pos & (PAGE_CACHE_SIZE - 1); |
505 | unsigned copied = 0; | 505 | unsigned copied = 0; |
506 | unsigned amt; | 506 | unsigned amt; |
507 | struct page *page; | 507 | struct page *page; |
508 | void *p; | 508 | void *p; |
509 | 509 | ||
510 | do { | 510 | do { |
511 | amt = size - copied; | 511 | amt = size - copied; |
512 | if (offset + size > PAGE_CACHE_SIZE) | 512 | if (offset + size > PAGE_CACHE_SIZE) |
513 | amt = PAGE_CACHE_SIZE - offset; | 513 | amt = PAGE_CACHE_SIZE - offset; |
514 | page = read_cache_page(mapping, index, __gfs2_readpage, NULL); | 514 | page = read_cache_page(mapping, index, __gfs2_readpage, NULL); |
515 | if (IS_ERR(page)) | 515 | if (IS_ERR(page)) |
516 | return PTR_ERR(page); | 516 | return PTR_ERR(page); |
517 | p = kmap_atomic(page); | 517 | p = kmap_atomic(page); |
518 | memcpy(buf + copied, p + offset, amt); | 518 | memcpy(buf + copied, p + offset, amt); |
519 | kunmap_atomic(p); | 519 | kunmap_atomic(p); |
520 | mark_page_accessed(page); | ||
521 | page_cache_release(page); | 520 | page_cache_release(page); |
522 | copied += amt; | 521 | copied += amt; |
523 | index++; | 522 | index++; |
524 | offset = 0; | 523 | offset = 0; |
525 | } while(copied < size); | 524 | } while(copied < size); |
526 | (*pos) += size; | 525 | (*pos) += size; |
527 | return size; | 526 | return size; |
528 | } | 527 | } |
529 | 528 | ||
530 | /** | 529 | /** |
531 | * gfs2_readpages - Read a bunch of pages at once | 530 | * gfs2_readpages - Read a bunch of pages at once |
532 | * | 531 | * |
533 | * Some notes: | 532 | * Some notes: |
534 | * 1. This is only for readahead, so we can simply ignore any things | 533 | * 1. This is only for readahead, so we can simply ignore any things |
535 | * which are slightly inconvenient (such as locking conflicts between | 534 | * which are slightly inconvenient (such as locking conflicts between |
536 | * the page lock and the glock) and return having done no I/O. Its | 535 | * the page lock and the glock) and return having done no I/O. Its |
537 | * obviously not something we'd want to do on too regular a basis. | 536 | * obviously not something we'd want to do on too regular a basis. |
538 | * Any I/O we ignore at this time will be done via readpage later. | 537 | * Any I/O we ignore at this time will be done via readpage later. |
539 | * 2. We don't handle stuffed files here we let readpage do the honours. | 538 | * 2. We don't handle stuffed files here we let readpage do the honours. |
540 | * 3. mpage_readpages() does most of the heavy lifting in the common case. | 539 | * 3. mpage_readpages() does most of the heavy lifting in the common case. |
541 | * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places. | 540 | * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places. |
542 | */ | 541 | */ |
543 | 542 | ||
544 | static int gfs2_readpages(struct file *file, struct address_space *mapping, | 543 | static int gfs2_readpages(struct file *file, struct address_space *mapping, |
545 | struct list_head *pages, unsigned nr_pages) | 544 | struct list_head *pages, unsigned nr_pages) |
546 | { | 545 | { |
547 | struct inode *inode = mapping->host; | 546 | struct inode *inode = mapping->host; |
548 | struct gfs2_inode *ip = GFS2_I(inode); | 547 | struct gfs2_inode *ip = GFS2_I(inode); |
549 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 548 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
550 | struct gfs2_holder gh; | 549 | struct gfs2_holder gh; |
551 | int ret; | 550 | int ret; |
552 | 551 | ||
553 | gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); | 552 | gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); |
554 | ret = gfs2_glock_nq(&gh); | 553 | ret = gfs2_glock_nq(&gh); |
555 | if (unlikely(ret)) | 554 | if (unlikely(ret)) |
556 | goto out_uninit; | 555 | goto out_uninit; |
557 | if (!gfs2_is_stuffed(ip)) | 556 | if (!gfs2_is_stuffed(ip)) |
558 | ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map); | 557 | ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map); |
559 | gfs2_glock_dq(&gh); | 558 | gfs2_glock_dq(&gh); |
560 | out_uninit: | 559 | out_uninit: |
561 | gfs2_holder_uninit(&gh); | 560 | gfs2_holder_uninit(&gh); |
562 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) | 561 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) |
563 | ret = -EIO; | 562 | ret = -EIO; |
564 | return ret; | 563 | return ret; |
565 | } | 564 | } |
566 | 565 | ||
567 | /** | 566 | /** |
568 | * gfs2_write_begin - Begin to write to a file | 567 | * gfs2_write_begin - Begin to write to a file |
569 | * @file: The file to write to | 568 | * @file: The file to write to |
570 | * @mapping: The mapping in which to write | 569 | * @mapping: The mapping in which to write |
571 | * @pos: The file offset at which to start writing | 570 | * @pos: The file offset at which to start writing |
572 | * @len: Length of the write | 571 | * @len: Length of the write |
573 | * @flags: Various flags | 572 | * @flags: Various flags |
574 | * @pagep: Pointer to return the page | 573 | * @pagep: Pointer to return the page |
575 | * @fsdata: Pointer to return fs data (unused by GFS2) | 574 | * @fsdata: Pointer to return fs data (unused by GFS2) |
576 | * | 575 | * |
577 | * Returns: errno | 576 | * Returns: errno |
578 | */ | 577 | */ |
579 | 578 | ||
580 | static int gfs2_write_begin(struct file *file, struct address_space *mapping, | 579 | static int gfs2_write_begin(struct file *file, struct address_space *mapping, |
581 | loff_t pos, unsigned len, unsigned flags, | 580 | loff_t pos, unsigned len, unsigned flags, |
582 | struct page **pagep, void **fsdata) | 581 | struct page **pagep, void **fsdata) |
583 | { | 582 | { |
584 | struct gfs2_inode *ip = GFS2_I(mapping->host); | 583 | struct gfs2_inode *ip = GFS2_I(mapping->host); |
585 | struct gfs2_sbd *sdp = GFS2_SB(mapping->host); | 584 | struct gfs2_sbd *sdp = GFS2_SB(mapping->host); |
586 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); | 585 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); |
587 | unsigned int data_blocks = 0, ind_blocks = 0, rblocks; | 586 | unsigned int data_blocks = 0, ind_blocks = 0, rblocks; |
588 | unsigned requested = 0; | 587 | unsigned requested = 0; |
589 | int alloc_required; | 588 | int alloc_required; |
590 | int error = 0; | 589 | int error = 0; |
591 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; | 590 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; |
592 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); | 591 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); |
593 | struct page *page; | 592 | struct page *page; |
594 | 593 | ||
595 | gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh); | 594 | gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh); |
596 | error = gfs2_glock_nq(&ip->i_gh); | 595 | error = gfs2_glock_nq(&ip->i_gh); |
597 | if (unlikely(error)) | 596 | if (unlikely(error)) |
598 | goto out_uninit; | 597 | goto out_uninit; |
599 | if (&ip->i_inode == sdp->sd_rindex) { | 598 | if (&ip->i_inode == sdp->sd_rindex) { |
600 | error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE, | 599 | error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE, |
601 | GL_NOCACHE, &m_ip->i_gh); | 600 | GL_NOCACHE, &m_ip->i_gh); |
602 | if (unlikely(error)) { | 601 | if (unlikely(error)) { |
603 | gfs2_glock_dq(&ip->i_gh); | 602 | gfs2_glock_dq(&ip->i_gh); |
604 | goto out_uninit; | 603 | goto out_uninit; |
605 | } | 604 | } |
606 | } | 605 | } |
607 | 606 | ||
608 | alloc_required = gfs2_write_alloc_required(ip, pos, len); | 607 | alloc_required = gfs2_write_alloc_required(ip, pos, len); |
609 | 608 | ||
610 | if (alloc_required || gfs2_is_jdata(ip)) | 609 | if (alloc_required || gfs2_is_jdata(ip)) |
611 | gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks); | 610 | gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks); |
612 | 611 | ||
613 | if (alloc_required) { | 612 | if (alloc_required) { |
614 | error = gfs2_quota_lock_check(ip); | 613 | error = gfs2_quota_lock_check(ip); |
615 | if (error) | 614 | if (error) |
616 | goto out_unlock; | 615 | goto out_unlock; |
617 | 616 | ||
618 | requested = data_blocks + ind_blocks; | 617 | requested = data_blocks + ind_blocks; |
619 | error = gfs2_inplace_reserve(ip, requested, 0); | 618 | error = gfs2_inplace_reserve(ip, requested, 0); |
620 | if (error) | 619 | if (error) |
621 | goto out_qunlock; | 620 | goto out_qunlock; |
622 | } | 621 | } |
623 | 622 | ||
624 | rblocks = RES_DINODE + ind_blocks; | 623 | rblocks = RES_DINODE + ind_blocks; |
625 | if (gfs2_is_jdata(ip)) | 624 | if (gfs2_is_jdata(ip)) |
626 | rblocks += data_blocks ? data_blocks : 1; | 625 | rblocks += data_blocks ? data_blocks : 1; |
627 | if (ind_blocks || data_blocks) | 626 | if (ind_blocks || data_blocks) |
628 | rblocks += RES_STATFS + RES_QUOTA; | 627 | rblocks += RES_STATFS + RES_QUOTA; |
629 | if (&ip->i_inode == sdp->sd_rindex) | 628 | if (&ip->i_inode == sdp->sd_rindex) |
630 | rblocks += 2 * RES_STATFS; | 629 | rblocks += 2 * RES_STATFS; |
631 | if (alloc_required) | 630 | if (alloc_required) |
632 | rblocks += gfs2_rg_blocks(ip, requested); | 631 | rblocks += gfs2_rg_blocks(ip, requested); |
633 | 632 | ||
634 | error = gfs2_trans_begin(sdp, rblocks, | 633 | error = gfs2_trans_begin(sdp, rblocks, |
635 | PAGE_CACHE_SIZE/sdp->sd_sb.sb_bsize); | 634 | PAGE_CACHE_SIZE/sdp->sd_sb.sb_bsize); |
636 | if (error) | 635 | if (error) |
637 | goto out_trans_fail; | 636 | goto out_trans_fail; |
638 | 637 | ||
639 | error = -ENOMEM; | 638 | error = -ENOMEM; |
640 | flags |= AOP_FLAG_NOFS; | 639 | flags |= AOP_FLAG_NOFS; |
641 | page = grab_cache_page_write_begin(mapping, index, flags); | 640 | page = grab_cache_page_write_begin(mapping, index, flags); |
642 | *pagep = page; | 641 | *pagep = page; |
643 | if (unlikely(!page)) | 642 | if (unlikely(!page)) |
644 | goto out_endtrans; | 643 | goto out_endtrans; |
645 | 644 | ||
646 | if (gfs2_is_stuffed(ip)) { | 645 | if (gfs2_is_stuffed(ip)) { |
647 | error = 0; | 646 | error = 0; |
648 | if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) { | 647 | if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) { |
649 | error = gfs2_unstuff_dinode(ip, page); | 648 | error = gfs2_unstuff_dinode(ip, page); |
650 | if (error == 0) | 649 | if (error == 0) |
651 | goto prepare_write; | 650 | goto prepare_write; |
652 | } else if (!PageUptodate(page)) { | 651 | } else if (!PageUptodate(page)) { |
653 | error = stuffed_readpage(ip, page); | 652 | error = stuffed_readpage(ip, page); |
654 | } | 653 | } |
655 | goto out; | 654 | goto out; |
656 | } | 655 | } |
657 | 656 | ||
658 | prepare_write: | 657 | prepare_write: |
659 | error = __block_write_begin(page, from, len, gfs2_block_map); | 658 | error = __block_write_begin(page, from, len, gfs2_block_map); |
660 | out: | 659 | out: |
661 | if (error == 0) | 660 | if (error == 0) |
662 | return 0; | 661 | return 0; |
663 | 662 | ||
664 | unlock_page(page); | 663 | unlock_page(page); |
665 | page_cache_release(page); | 664 | page_cache_release(page); |
666 | 665 | ||
667 | gfs2_trans_end(sdp); | 666 | gfs2_trans_end(sdp); |
668 | if (pos + len > ip->i_inode.i_size) | 667 | if (pos + len > ip->i_inode.i_size) |
669 | gfs2_trim_blocks(&ip->i_inode); | 668 | gfs2_trim_blocks(&ip->i_inode); |
670 | goto out_trans_fail; | 669 | goto out_trans_fail; |
671 | 670 | ||
672 | out_endtrans: | 671 | out_endtrans: |
673 | gfs2_trans_end(sdp); | 672 | gfs2_trans_end(sdp); |
674 | out_trans_fail: | 673 | out_trans_fail: |
675 | if (alloc_required) { | 674 | if (alloc_required) { |
676 | gfs2_inplace_release(ip); | 675 | gfs2_inplace_release(ip); |
677 | out_qunlock: | 676 | out_qunlock: |
678 | gfs2_quota_unlock(ip); | 677 | gfs2_quota_unlock(ip); |
679 | } | 678 | } |
680 | out_unlock: | 679 | out_unlock: |
681 | if (&ip->i_inode == sdp->sd_rindex) { | 680 | if (&ip->i_inode == sdp->sd_rindex) { |
682 | gfs2_glock_dq(&m_ip->i_gh); | 681 | gfs2_glock_dq(&m_ip->i_gh); |
683 | gfs2_holder_uninit(&m_ip->i_gh); | 682 | gfs2_holder_uninit(&m_ip->i_gh); |
684 | } | 683 | } |
685 | gfs2_glock_dq(&ip->i_gh); | 684 | gfs2_glock_dq(&ip->i_gh); |
686 | out_uninit: | 685 | out_uninit: |
687 | gfs2_holder_uninit(&ip->i_gh); | 686 | gfs2_holder_uninit(&ip->i_gh); |
688 | return error; | 687 | return error; |
689 | } | 688 | } |
690 | 689 | ||
691 | /** | 690 | /** |
692 | * adjust_fs_space - Adjusts the free space available due to gfs2_grow | 691 | * adjust_fs_space - Adjusts the free space available due to gfs2_grow |
693 | * @inode: the rindex inode | 692 | * @inode: the rindex inode |
694 | */ | 693 | */ |
695 | static void adjust_fs_space(struct inode *inode) | 694 | static void adjust_fs_space(struct inode *inode) |
696 | { | 695 | { |
697 | struct gfs2_sbd *sdp = inode->i_sb->s_fs_info; | 696 | struct gfs2_sbd *sdp = inode->i_sb->s_fs_info; |
698 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); | 697 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); |
699 | struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode); | 698 | struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode); |
700 | struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master; | 699 | struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master; |
701 | struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local; | 700 | struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local; |
702 | struct buffer_head *m_bh, *l_bh; | 701 | struct buffer_head *m_bh, *l_bh; |
703 | u64 fs_total, new_free; | 702 | u64 fs_total, new_free; |
704 | 703 | ||
705 | /* Total up the file system space, according to the latest rindex. */ | 704 | /* Total up the file system space, according to the latest rindex. */ |
706 | fs_total = gfs2_ri_total(sdp); | 705 | fs_total = gfs2_ri_total(sdp); |
707 | if (gfs2_meta_inode_buffer(m_ip, &m_bh) != 0) | 706 | if (gfs2_meta_inode_buffer(m_ip, &m_bh) != 0) |
708 | return; | 707 | return; |
709 | 708 | ||
710 | spin_lock(&sdp->sd_statfs_spin); | 709 | spin_lock(&sdp->sd_statfs_spin); |
711 | gfs2_statfs_change_in(m_sc, m_bh->b_data + | 710 | gfs2_statfs_change_in(m_sc, m_bh->b_data + |
712 | sizeof(struct gfs2_dinode)); | 711 | sizeof(struct gfs2_dinode)); |
713 | if (fs_total > (m_sc->sc_total + l_sc->sc_total)) | 712 | if (fs_total > (m_sc->sc_total + l_sc->sc_total)) |
714 | new_free = fs_total - (m_sc->sc_total + l_sc->sc_total); | 713 | new_free = fs_total - (m_sc->sc_total + l_sc->sc_total); |
715 | else | 714 | else |
716 | new_free = 0; | 715 | new_free = 0; |
717 | spin_unlock(&sdp->sd_statfs_spin); | 716 | spin_unlock(&sdp->sd_statfs_spin); |
718 | fs_warn(sdp, "File system extended by %llu blocks.\n", | 717 | fs_warn(sdp, "File system extended by %llu blocks.\n", |
719 | (unsigned long long)new_free); | 718 | (unsigned long long)new_free); |
720 | gfs2_statfs_change(sdp, new_free, new_free, 0); | 719 | gfs2_statfs_change(sdp, new_free, new_free, 0); |
721 | 720 | ||
722 | if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0) | 721 | if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0) |
723 | goto out; | 722 | goto out; |
724 | update_statfs(sdp, m_bh, l_bh); | 723 | update_statfs(sdp, m_bh, l_bh); |
725 | brelse(l_bh); | 724 | brelse(l_bh); |
726 | out: | 725 | out: |
727 | brelse(m_bh); | 726 | brelse(m_bh); |
728 | } | 727 | } |
729 | 728 | ||
730 | /** | 729 | /** |
731 | * gfs2_stuffed_write_end - Write end for stuffed files | 730 | * gfs2_stuffed_write_end - Write end for stuffed files |
732 | * @inode: The inode | 731 | * @inode: The inode |
733 | * @dibh: The buffer_head containing the on-disk inode | 732 | * @dibh: The buffer_head containing the on-disk inode |
734 | * @pos: The file position | 733 | * @pos: The file position |
735 | * @len: The length of the write | 734 | * @len: The length of the write |
736 | * @copied: How much was actually copied by the VFS | 735 | * @copied: How much was actually copied by the VFS |
737 | * @page: The page | 736 | * @page: The page |
738 | * | 737 | * |
739 | * This copies the data from the page into the inode block after | 738 | * This copies the data from the page into the inode block after |
740 | * the inode data structure itself. | 739 | * the inode data structure itself. |
741 | * | 740 | * |
742 | * Returns: errno | 741 | * Returns: errno |
743 | */ | 742 | */ |
744 | static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh, | 743 | static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh, |
745 | loff_t pos, unsigned len, unsigned copied, | 744 | loff_t pos, unsigned len, unsigned copied, |
746 | struct page *page) | 745 | struct page *page) |
747 | { | 746 | { |
748 | struct gfs2_inode *ip = GFS2_I(inode); | 747 | struct gfs2_inode *ip = GFS2_I(inode); |
749 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 748 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
750 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); | 749 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); |
751 | u64 to = pos + copied; | 750 | u64 to = pos + copied; |
752 | void *kaddr; | 751 | void *kaddr; |
753 | unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode); | 752 | unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode); |
754 | 753 | ||
755 | BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode))); | 754 | BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode))); |
756 | kaddr = kmap_atomic(page); | 755 | kaddr = kmap_atomic(page); |
757 | memcpy(buf + pos, kaddr + pos, copied); | 756 | memcpy(buf + pos, kaddr + pos, copied); |
758 | memset(kaddr + pos + copied, 0, len - copied); | 757 | memset(kaddr + pos + copied, 0, len - copied); |
759 | flush_dcache_page(page); | 758 | flush_dcache_page(page); |
760 | kunmap_atomic(kaddr); | 759 | kunmap_atomic(kaddr); |
761 | 760 | ||
762 | if (!PageUptodate(page)) | 761 | if (!PageUptodate(page)) |
763 | SetPageUptodate(page); | 762 | SetPageUptodate(page); |
764 | unlock_page(page); | 763 | unlock_page(page); |
765 | page_cache_release(page); | 764 | page_cache_release(page); |
766 | 765 | ||
767 | if (copied) { | 766 | if (copied) { |
768 | if (inode->i_size < to) | 767 | if (inode->i_size < to) |
769 | i_size_write(inode, to); | 768 | i_size_write(inode, to); |
770 | mark_inode_dirty(inode); | 769 | mark_inode_dirty(inode); |
771 | } | 770 | } |
772 | 771 | ||
773 | if (inode == sdp->sd_rindex) { | 772 | if (inode == sdp->sd_rindex) { |
774 | adjust_fs_space(inode); | 773 | adjust_fs_space(inode); |
775 | sdp->sd_rindex_uptodate = 0; | 774 | sdp->sd_rindex_uptodate = 0; |
776 | } | 775 | } |
777 | 776 | ||
778 | brelse(dibh); | 777 | brelse(dibh); |
779 | gfs2_trans_end(sdp); | 778 | gfs2_trans_end(sdp); |
780 | if (inode == sdp->sd_rindex) { | 779 | if (inode == sdp->sd_rindex) { |
781 | gfs2_glock_dq(&m_ip->i_gh); | 780 | gfs2_glock_dq(&m_ip->i_gh); |
782 | gfs2_holder_uninit(&m_ip->i_gh); | 781 | gfs2_holder_uninit(&m_ip->i_gh); |
783 | } | 782 | } |
784 | gfs2_glock_dq(&ip->i_gh); | 783 | gfs2_glock_dq(&ip->i_gh); |
785 | gfs2_holder_uninit(&ip->i_gh); | 784 | gfs2_holder_uninit(&ip->i_gh); |
786 | return copied; | 785 | return copied; |
787 | } | 786 | } |
788 | 787 | ||
789 | /** | 788 | /** |
790 | * gfs2_write_end | 789 | * gfs2_write_end |
791 | * @file: The file to write to | 790 | * @file: The file to write to |
792 | * @mapping: The address space to write to | 791 | * @mapping: The address space to write to |
793 | * @pos: The file position | 792 | * @pos: The file position |
794 | * @len: The length of the data | 793 | * @len: The length of the data |
795 | * @copied: | 794 | * @copied: |
796 | * @page: The page that has been written | 795 | * @page: The page that has been written |
797 | * @fsdata: The fsdata (unused in GFS2) | 796 | * @fsdata: The fsdata (unused in GFS2) |
798 | * | 797 | * |
799 | * The main write_end function for GFS2. We have a separate one for | 798 | * The main write_end function for GFS2. We have a separate one for |
800 | * stuffed files as they are slightly different, otherwise we just | 799 | * stuffed files as they are slightly different, otherwise we just |
801 | * put our locking around the VFS provided functions. | 800 | * put our locking around the VFS provided functions. |
802 | * | 801 | * |
803 | * Returns: errno | 802 | * Returns: errno |
804 | */ | 803 | */ |
805 | 804 | ||
806 | static int gfs2_write_end(struct file *file, struct address_space *mapping, | 805 | static int gfs2_write_end(struct file *file, struct address_space *mapping, |
807 | loff_t pos, unsigned len, unsigned copied, | 806 | loff_t pos, unsigned len, unsigned copied, |
808 | struct page *page, void *fsdata) | 807 | struct page *page, void *fsdata) |
809 | { | 808 | { |
810 | struct inode *inode = page->mapping->host; | 809 | struct inode *inode = page->mapping->host; |
811 | struct gfs2_inode *ip = GFS2_I(inode); | 810 | struct gfs2_inode *ip = GFS2_I(inode); |
812 | struct gfs2_sbd *sdp = GFS2_SB(inode); | 811 | struct gfs2_sbd *sdp = GFS2_SB(inode); |
813 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); | 812 | struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode); |
814 | struct buffer_head *dibh; | 813 | struct buffer_head *dibh; |
815 | unsigned int from = pos & (PAGE_CACHE_SIZE - 1); | 814 | unsigned int from = pos & (PAGE_CACHE_SIZE - 1); |
816 | unsigned int to = from + len; | 815 | unsigned int to = from + len; |
817 | int ret; | 816 | int ret; |
818 | struct gfs2_trans *tr = current->journal_info; | 817 | struct gfs2_trans *tr = current->journal_info; |
819 | BUG_ON(!tr); | 818 | BUG_ON(!tr); |
820 | 819 | ||
821 | BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL); | 820 | BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL); |
822 | 821 | ||
823 | ret = gfs2_meta_inode_buffer(ip, &dibh); | 822 | ret = gfs2_meta_inode_buffer(ip, &dibh); |
824 | if (unlikely(ret)) { | 823 | if (unlikely(ret)) { |
825 | unlock_page(page); | 824 | unlock_page(page); |
826 | page_cache_release(page); | 825 | page_cache_release(page); |
827 | goto failed; | 826 | goto failed; |
828 | } | 827 | } |
829 | 828 | ||
830 | if (gfs2_is_stuffed(ip)) | 829 | if (gfs2_is_stuffed(ip)) |
831 | return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page); | 830 | return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page); |
832 | 831 | ||
833 | if (!gfs2_is_writeback(ip)) | 832 | if (!gfs2_is_writeback(ip)) |
834 | gfs2_page_add_databufs(ip, page, from, to); | 833 | gfs2_page_add_databufs(ip, page, from, to); |
835 | 834 | ||
836 | ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); | 835 | ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata); |
837 | if (tr->tr_num_buf_new) | 836 | if (tr->tr_num_buf_new) |
838 | __mark_inode_dirty(inode, I_DIRTY_DATASYNC); | 837 | __mark_inode_dirty(inode, I_DIRTY_DATASYNC); |
839 | else | 838 | else |
840 | gfs2_trans_add_meta(ip->i_gl, dibh); | 839 | gfs2_trans_add_meta(ip->i_gl, dibh); |
841 | 840 | ||
842 | 841 | ||
843 | if (inode == sdp->sd_rindex) { | 842 | if (inode == sdp->sd_rindex) { |
844 | adjust_fs_space(inode); | 843 | adjust_fs_space(inode); |
845 | sdp->sd_rindex_uptodate = 0; | 844 | sdp->sd_rindex_uptodate = 0; |
846 | } | 845 | } |
847 | 846 | ||
848 | brelse(dibh); | 847 | brelse(dibh); |
849 | failed: | 848 | failed: |
850 | gfs2_trans_end(sdp); | 849 | gfs2_trans_end(sdp); |
851 | gfs2_inplace_release(ip); | 850 | gfs2_inplace_release(ip); |
852 | if (ip->i_res->rs_qa_qd_num) | 851 | if (ip->i_res->rs_qa_qd_num) |
853 | gfs2_quota_unlock(ip); | 852 | gfs2_quota_unlock(ip); |
854 | if (inode == sdp->sd_rindex) { | 853 | if (inode == sdp->sd_rindex) { |
855 | gfs2_glock_dq(&m_ip->i_gh); | 854 | gfs2_glock_dq(&m_ip->i_gh); |
856 | gfs2_holder_uninit(&m_ip->i_gh); | 855 | gfs2_holder_uninit(&m_ip->i_gh); |
857 | } | 856 | } |
858 | gfs2_glock_dq(&ip->i_gh); | 857 | gfs2_glock_dq(&ip->i_gh); |
859 | gfs2_holder_uninit(&ip->i_gh); | 858 | gfs2_holder_uninit(&ip->i_gh); |
860 | return ret; | 859 | return ret; |
861 | } | 860 | } |
862 | 861 | ||
863 | /** | 862 | /** |
864 | * gfs2_set_page_dirty - Page dirtying function | 863 | * gfs2_set_page_dirty - Page dirtying function |
865 | * @page: The page to dirty | 864 | * @page: The page to dirty |
866 | * | 865 | * |
867 | * Returns: 1 if it dirtyed the page, or 0 otherwise | 866 | * Returns: 1 if it dirtyed the page, or 0 otherwise |
868 | */ | 867 | */ |
869 | 868 | ||
870 | static int gfs2_set_page_dirty(struct page *page) | 869 | static int gfs2_set_page_dirty(struct page *page) |
871 | { | 870 | { |
872 | SetPageChecked(page); | 871 | SetPageChecked(page); |
873 | return __set_page_dirty_buffers(page); | 872 | return __set_page_dirty_buffers(page); |
874 | } | 873 | } |
875 | 874 | ||
876 | /** | 875 | /** |
877 | * gfs2_bmap - Block map function | 876 | * gfs2_bmap - Block map function |
878 | * @mapping: Address space info | 877 | * @mapping: Address space info |
879 | * @lblock: The block to map | 878 | * @lblock: The block to map |
880 | * | 879 | * |
881 | * Returns: The disk address for the block or 0 on hole or error | 880 | * Returns: The disk address for the block or 0 on hole or error |
882 | */ | 881 | */ |
883 | 882 | ||
884 | static sector_t gfs2_bmap(struct address_space *mapping, sector_t lblock) | 883 | static sector_t gfs2_bmap(struct address_space *mapping, sector_t lblock) |
885 | { | 884 | { |
886 | struct gfs2_inode *ip = GFS2_I(mapping->host); | 885 | struct gfs2_inode *ip = GFS2_I(mapping->host); |
887 | struct gfs2_holder i_gh; | 886 | struct gfs2_holder i_gh; |
888 | sector_t dblock = 0; | 887 | sector_t dblock = 0; |
889 | int error; | 888 | int error; |
890 | 889 | ||
891 | error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh); | 890 | error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh); |
892 | if (error) | 891 | if (error) |
893 | return 0; | 892 | return 0; |
894 | 893 | ||
895 | if (!gfs2_is_stuffed(ip)) | 894 | if (!gfs2_is_stuffed(ip)) |
896 | dblock = generic_block_bmap(mapping, lblock, gfs2_block_map); | 895 | dblock = generic_block_bmap(mapping, lblock, gfs2_block_map); |
897 | 896 | ||
898 | gfs2_glock_dq_uninit(&i_gh); | 897 | gfs2_glock_dq_uninit(&i_gh); |
899 | 898 | ||
900 | return dblock; | 899 | return dblock; |
901 | } | 900 | } |
902 | 901 | ||
903 | static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh) | 902 | static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh) |
904 | { | 903 | { |
905 | struct gfs2_bufdata *bd; | 904 | struct gfs2_bufdata *bd; |
906 | 905 | ||
907 | lock_buffer(bh); | 906 | lock_buffer(bh); |
908 | gfs2_log_lock(sdp); | 907 | gfs2_log_lock(sdp); |
909 | clear_buffer_dirty(bh); | 908 | clear_buffer_dirty(bh); |
910 | bd = bh->b_private; | 909 | bd = bh->b_private; |
911 | if (bd) { | 910 | if (bd) { |
912 | if (!list_empty(&bd->bd_list) && !buffer_pinned(bh)) | 911 | if (!list_empty(&bd->bd_list) && !buffer_pinned(bh)) |
913 | list_del_init(&bd->bd_list); | 912 | list_del_init(&bd->bd_list); |
914 | else | 913 | else |
915 | gfs2_remove_from_journal(bh, current->journal_info, 0); | 914 | gfs2_remove_from_journal(bh, current->journal_info, 0); |
916 | } | 915 | } |
917 | bh->b_bdev = NULL; | 916 | bh->b_bdev = NULL; |
918 | clear_buffer_mapped(bh); | 917 | clear_buffer_mapped(bh); |
919 | clear_buffer_req(bh); | 918 | clear_buffer_req(bh); |
920 | clear_buffer_new(bh); | 919 | clear_buffer_new(bh); |
921 | gfs2_log_unlock(sdp); | 920 | gfs2_log_unlock(sdp); |
922 | unlock_buffer(bh); | 921 | unlock_buffer(bh); |
923 | } | 922 | } |
924 | 923 | ||
925 | static void gfs2_invalidatepage(struct page *page, unsigned int offset, | 924 | static void gfs2_invalidatepage(struct page *page, unsigned int offset, |
926 | unsigned int length) | 925 | unsigned int length) |
927 | { | 926 | { |
928 | struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); | 927 | struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host); |
929 | unsigned int stop = offset + length; | 928 | unsigned int stop = offset + length; |
930 | int partial_page = (offset || length < PAGE_CACHE_SIZE); | 929 | int partial_page = (offset || length < PAGE_CACHE_SIZE); |
931 | struct buffer_head *bh, *head; | 930 | struct buffer_head *bh, *head; |
932 | unsigned long pos = 0; | 931 | unsigned long pos = 0; |
933 | 932 | ||
934 | BUG_ON(!PageLocked(page)); | 933 | BUG_ON(!PageLocked(page)); |
935 | if (!partial_page) | 934 | if (!partial_page) |
936 | ClearPageChecked(page); | 935 | ClearPageChecked(page); |
937 | if (!page_has_buffers(page)) | 936 | if (!page_has_buffers(page)) |
938 | goto out; | 937 | goto out; |
939 | 938 | ||
940 | bh = head = page_buffers(page); | 939 | bh = head = page_buffers(page); |
941 | do { | 940 | do { |
942 | if (pos + bh->b_size > stop) | 941 | if (pos + bh->b_size > stop) |
943 | return; | 942 | return; |
944 | 943 | ||
945 | if (offset <= pos) | 944 | if (offset <= pos) |
946 | gfs2_discard(sdp, bh); | 945 | gfs2_discard(sdp, bh); |
947 | pos += bh->b_size; | 946 | pos += bh->b_size; |
948 | bh = bh->b_this_page; | 947 | bh = bh->b_this_page; |
949 | } while (bh != head); | 948 | } while (bh != head); |
950 | out: | 949 | out: |
951 | if (!partial_page) | 950 | if (!partial_page) |
952 | try_to_release_page(page, 0); | 951 | try_to_release_page(page, 0); |
953 | } | 952 | } |
954 | 953 | ||
955 | /** | 954 | /** |
956 | * gfs2_ok_for_dio - check that dio is valid on this file | 955 | * gfs2_ok_for_dio - check that dio is valid on this file |
957 | * @ip: The inode | 956 | * @ip: The inode |
958 | * @rw: READ or WRITE | 957 | * @rw: READ or WRITE |
959 | * @offset: The offset at which we are reading or writing | 958 | * @offset: The offset at which we are reading or writing |
960 | * | 959 | * |
961 | * Returns: 0 (to ignore the i/o request and thus fall back to buffered i/o) | 960 | * Returns: 0 (to ignore the i/o request and thus fall back to buffered i/o) |
962 | * 1 (to accept the i/o request) | 961 | * 1 (to accept the i/o request) |
963 | */ | 962 | */ |
964 | static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset) | 963 | static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset) |
965 | { | 964 | { |
966 | /* | 965 | /* |
967 | * Should we return an error here? I can't see that O_DIRECT for | 966 | * Should we return an error here? I can't see that O_DIRECT for |
968 | * a stuffed file makes any sense. For now we'll silently fall | 967 | * a stuffed file makes any sense. For now we'll silently fall |
969 | * back to buffered I/O | 968 | * back to buffered I/O |
970 | */ | 969 | */ |
971 | if (gfs2_is_stuffed(ip)) | 970 | if (gfs2_is_stuffed(ip)) |
972 | return 0; | 971 | return 0; |
973 | 972 | ||
974 | if (offset >= i_size_read(&ip->i_inode)) | 973 | if (offset >= i_size_read(&ip->i_inode)) |
975 | return 0; | 974 | return 0; |
976 | return 1; | 975 | return 1; |
977 | } | 976 | } |
978 | 977 | ||
979 | 978 | ||
980 | 979 | ||
981 | static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, | 980 | static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, |
982 | const struct iovec *iov, loff_t offset, | 981 | const struct iovec *iov, loff_t offset, |
983 | unsigned long nr_segs) | 982 | unsigned long nr_segs) |
984 | { | 983 | { |
985 | struct file *file = iocb->ki_filp; | 984 | struct file *file = iocb->ki_filp; |
986 | struct inode *inode = file->f_mapping->host; | 985 | struct inode *inode = file->f_mapping->host; |
987 | struct address_space *mapping = inode->i_mapping; | 986 | struct address_space *mapping = inode->i_mapping; |
988 | struct gfs2_inode *ip = GFS2_I(inode); | 987 | struct gfs2_inode *ip = GFS2_I(inode); |
989 | struct gfs2_holder gh; | 988 | struct gfs2_holder gh; |
990 | int rv; | 989 | int rv; |
991 | 990 | ||
992 | /* | 991 | /* |
993 | * Deferred lock, even if its a write, since we do no allocation | 992 | * Deferred lock, even if its a write, since we do no allocation |
994 | * on this path. All we need change is atime, and this lock mode | 993 | * on this path. All we need change is atime, and this lock mode |
995 | * ensures that other nodes have flushed their buffered read caches | 994 | * ensures that other nodes have flushed their buffered read caches |
996 | * (i.e. their page cache entries for this inode). We do not, | 995 | * (i.e. their page cache entries for this inode). We do not, |
997 | * unfortunately have the option of only flushing a range like | 996 | * unfortunately have the option of only flushing a range like |
998 | * the VFS does. | 997 | * the VFS does. |
999 | */ | 998 | */ |
1000 | gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, &gh); | 999 | gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, &gh); |
1001 | rv = gfs2_glock_nq(&gh); | 1000 | rv = gfs2_glock_nq(&gh); |
1002 | if (rv) | 1001 | if (rv) |
1003 | return rv; | 1002 | return rv; |
1004 | rv = gfs2_ok_for_dio(ip, rw, offset); | 1003 | rv = gfs2_ok_for_dio(ip, rw, offset); |
1005 | if (rv != 1) | 1004 | if (rv != 1) |
1006 | goto out; /* dio not valid, fall back to buffered i/o */ | 1005 | goto out; /* dio not valid, fall back to buffered i/o */ |
1007 | 1006 | ||
1008 | /* | 1007 | /* |
1009 | * Now since we are holding a deferred (CW) lock at this point, you | 1008 | * Now since we are holding a deferred (CW) lock at this point, you |
1010 | * might be wondering why this is ever needed. There is a case however | 1009 | * might be wondering why this is ever needed. There is a case however |
1011 | * where we've granted a deferred local lock against a cached exclusive | 1010 | * where we've granted a deferred local lock against a cached exclusive |
1012 | * glock. That is ok provided all granted local locks are deferred, but | 1011 | * glock. That is ok provided all granted local locks are deferred, but |
1013 | * it also means that it is possible to encounter pages which are | 1012 | * it also means that it is possible to encounter pages which are |
1014 | * cached and possibly also mapped. So here we check for that and sort | 1013 | * cached and possibly also mapped. So here we check for that and sort |
1015 | * them out ahead of the dio. The glock state machine will take care of | 1014 | * them out ahead of the dio. The glock state machine will take care of |
1016 | * everything else. | 1015 | * everything else. |
1017 | * | 1016 | * |
1018 | * If in fact the cached glock state (gl->gl_state) is deferred (CW) in | 1017 | * If in fact the cached glock state (gl->gl_state) is deferred (CW) in |
1019 | * the first place, mapping->nr_pages will always be zero. | 1018 | * the first place, mapping->nr_pages will always be zero. |
1020 | */ | 1019 | */ |
1021 | if (mapping->nrpages) { | 1020 | if (mapping->nrpages) { |
1022 | loff_t lstart = offset & (PAGE_CACHE_SIZE - 1); | 1021 | loff_t lstart = offset & (PAGE_CACHE_SIZE - 1); |
1023 | loff_t len = iov_length(iov, nr_segs); | 1022 | loff_t len = iov_length(iov, nr_segs); |
1024 | loff_t end = PAGE_ALIGN(offset + len) - 1; | 1023 | loff_t end = PAGE_ALIGN(offset + len) - 1; |
1025 | 1024 | ||
1026 | rv = 0; | 1025 | rv = 0; |
1027 | if (len == 0) | 1026 | if (len == 0) |
1028 | goto out; | 1027 | goto out; |
1029 | if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags)) | 1028 | if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags)) |
1030 | unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len); | 1029 | unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len); |
1031 | rv = filemap_write_and_wait_range(mapping, lstart, end); | 1030 | rv = filemap_write_and_wait_range(mapping, lstart, end); |
1032 | if (rv) | 1031 | if (rv) |
1033 | return rv; | 1032 | return rv; |
1034 | truncate_inode_pages_range(mapping, lstart, end); | 1033 | truncate_inode_pages_range(mapping, lstart, end); |
1035 | } | 1034 | } |
1036 | 1035 | ||
1037 | rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, | 1036 | rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, |
1038 | offset, nr_segs, gfs2_get_block_direct, | 1037 | offset, nr_segs, gfs2_get_block_direct, |
1039 | NULL, NULL, 0); | 1038 | NULL, NULL, 0); |
1040 | out: | 1039 | out: |
1041 | gfs2_glock_dq(&gh); | 1040 | gfs2_glock_dq(&gh); |
1042 | gfs2_holder_uninit(&gh); | 1041 | gfs2_holder_uninit(&gh); |
1043 | return rv; | 1042 | return rv; |
1044 | } | 1043 | } |
1045 | 1044 | ||
1046 | /** | 1045 | /** |
1047 | * gfs2_releasepage - free the metadata associated with a page | 1046 | * gfs2_releasepage - free the metadata associated with a page |
1048 | * @page: the page that's being released | 1047 | * @page: the page that's being released |
1049 | * @gfp_mask: passed from Linux VFS, ignored by us | 1048 | * @gfp_mask: passed from Linux VFS, ignored by us |
1050 | * | 1049 | * |
1051 | * Call try_to_free_buffers() if the buffers in this page can be | 1050 | * Call try_to_free_buffers() if the buffers in this page can be |
1052 | * released. | 1051 | * released. |
1053 | * | 1052 | * |
1054 | * Returns: 0 | 1053 | * Returns: 0 |
1055 | */ | 1054 | */ |
1056 | 1055 | ||
1057 | int gfs2_releasepage(struct page *page, gfp_t gfp_mask) | 1056 | int gfs2_releasepage(struct page *page, gfp_t gfp_mask) |
1058 | { | 1057 | { |
1059 | struct address_space *mapping = page->mapping; | 1058 | struct address_space *mapping = page->mapping; |
1060 | struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); | 1059 | struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); |
1061 | struct buffer_head *bh, *head; | 1060 | struct buffer_head *bh, *head; |
1062 | struct gfs2_bufdata *bd; | 1061 | struct gfs2_bufdata *bd; |
1063 | 1062 | ||
1064 | if (!page_has_buffers(page)) | 1063 | if (!page_has_buffers(page)) |
1065 | return 0; | 1064 | return 0; |
1066 | 1065 | ||
1067 | gfs2_log_lock(sdp); | 1066 | gfs2_log_lock(sdp); |
1068 | spin_lock(&sdp->sd_ail_lock); | 1067 | spin_lock(&sdp->sd_ail_lock); |
1069 | head = bh = page_buffers(page); | 1068 | head = bh = page_buffers(page); |
1070 | do { | 1069 | do { |
1071 | if (atomic_read(&bh->b_count)) | 1070 | if (atomic_read(&bh->b_count)) |
1072 | goto cannot_release; | 1071 | goto cannot_release; |
1073 | bd = bh->b_private; | 1072 | bd = bh->b_private; |
1074 | if (bd && bd->bd_tr) | 1073 | if (bd && bd->bd_tr) |
1075 | goto cannot_release; | 1074 | goto cannot_release; |
1076 | if (buffer_pinned(bh) || buffer_dirty(bh)) | 1075 | if (buffer_pinned(bh) || buffer_dirty(bh)) |
1077 | goto not_possible; | 1076 | goto not_possible; |
1078 | bh = bh->b_this_page; | 1077 | bh = bh->b_this_page; |
1079 | } while(bh != head); | 1078 | } while(bh != head); |
1080 | spin_unlock(&sdp->sd_ail_lock); | 1079 | spin_unlock(&sdp->sd_ail_lock); |
1081 | gfs2_log_unlock(sdp); | 1080 | gfs2_log_unlock(sdp); |
1082 | 1081 | ||
1083 | head = bh = page_buffers(page); | 1082 | head = bh = page_buffers(page); |
1084 | do { | 1083 | do { |
1085 | gfs2_log_lock(sdp); | 1084 | gfs2_log_lock(sdp); |
1086 | bd = bh->b_private; | 1085 | bd = bh->b_private; |
1087 | if (bd) { | 1086 | if (bd) { |
1088 | gfs2_assert_warn(sdp, bd->bd_bh == bh); | 1087 | gfs2_assert_warn(sdp, bd->bd_bh == bh); |
1089 | if (!list_empty(&bd->bd_list)) { | 1088 | if (!list_empty(&bd->bd_list)) { |
1090 | if (!buffer_pinned(bh)) | 1089 | if (!buffer_pinned(bh)) |
1091 | list_del_init(&bd->bd_list); | 1090 | list_del_init(&bd->bd_list); |
1092 | else | 1091 | else |
1093 | bd = NULL; | 1092 | bd = NULL; |
1094 | } | 1093 | } |
1095 | if (bd) | 1094 | if (bd) |
1096 | bd->bd_bh = NULL; | 1095 | bd->bd_bh = NULL; |
1097 | bh->b_private = NULL; | 1096 | bh->b_private = NULL; |
1098 | } | 1097 | } |
1099 | gfs2_log_unlock(sdp); | 1098 | gfs2_log_unlock(sdp); |
1100 | if (bd) | 1099 | if (bd) |
1101 | kmem_cache_free(gfs2_bufdata_cachep, bd); | 1100 | kmem_cache_free(gfs2_bufdata_cachep, bd); |
1102 | 1101 | ||
1103 | bh = bh->b_this_page; | 1102 | bh = bh->b_this_page; |
1104 | } while (bh != head); | 1103 | } while (bh != head); |
1105 | 1104 | ||
1106 | return try_to_free_buffers(page); | 1105 | return try_to_free_buffers(page); |
1107 | 1106 | ||
1108 | not_possible: /* Should never happen */ | 1107 | not_possible: /* Should never happen */ |
1109 | WARN_ON(buffer_dirty(bh)); | 1108 | WARN_ON(buffer_dirty(bh)); |
1110 | WARN_ON(buffer_pinned(bh)); | 1109 | WARN_ON(buffer_pinned(bh)); |
1111 | cannot_release: | 1110 | cannot_release: |
1112 | spin_unlock(&sdp->sd_ail_lock); | 1111 | spin_unlock(&sdp->sd_ail_lock); |
1113 | gfs2_log_unlock(sdp); | 1112 | gfs2_log_unlock(sdp); |
1114 | return 0; | 1113 | return 0; |
1115 | } | 1114 | } |
1116 | 1115 | ||
1117 | static const struct address_space_operations gfs2_writeback_aops = { | 1116 | static const struct address_space_operations gfs2_writeback_aops = { |
1118 | .writepage = gfs2_writepage, | 1117 | .writepage = gfs2_writepage, |
1119 | .writepages = gfs2_writepages, | 1118 | .writepages = gfs2_writepages, |
1120 | .readpage = gfs2_readpage, | 1119 | .readpage = gfs2_readpage, |
1121 | .readpages = gfs2_readpages, | 1120 | .readpages = gfs2_readpages, |
1122 | .write_begin = gfs2_write_begin, | 1121 | .write_begin = gfs2_write_begin, |
1123 | .write_end = gfs2_write_end, | 1122 | .write_end = gfs2_write_end, |
1124 | .bmap = gfs2_bmap, | 1123 | .bmap = gfs2_bmap, |
1125 | .invalidatepage = gfs2_invalidatepage, | 1124 | .invalidatepage = gfs2_invalidatepage, |
1126 | .releasepage = gfs2_releasepage, | 1125 | .releasepage = gfs2_releasepage, |
1127 | .direct_IO = gfs2_direct_IO, | 1126 | .direct_IO = gfs2_direct_IO, |
1128 | .migratepage = buffer_migrate_page, | 1127 | .migratepage = buffer_migrate_page, |
1129 | .is_partially_uptodate = block_is_partially_uptodate, | 1128 | .is_partially_uptodate = block_is_partially_uptodate, |
1130 | .error_remove_page = generic_error_remove_page, | 1129 | .error_remove_page = generic_error_remove_page, |
1131 | }; | 1130 | }; |
1132 | 1131 | ||
1133 | static const struct address_space_operations gfs2_ordered_aops = { | 1132 | static const struct address_space_operations gfs2_ordered_aops = { |
1134 | .writepage = gfs2_writepage, | 1133 | .writepage = gfs2_writepage, |
1135 | .writepages = gfs2_writepages, | 1134 | .writepages = gfs2_writepages, |
1136 | .readpage = gfs2_readpage, | 1135 | .readpage = gfs2_readpage, |
1137 | .readpages = gfs2_readpages, | 1136 | .readpages = gfs2_readpages, |
1138 | .write_begin = gfs2_write_begin, | 1137 | .write_begin = gfs2_write_begin, |
1139 | .write_end = gfs2_write_end, | 1138 | .write_end = gfs2_write_end, |
1140 | .set_page_dirty = gfs2_set_page_dirty, | 1139 | .set_page_dirty = gfs2_set_page_dirty, |
1141 | .bmap = gfs2_bmap, | 1140 | .bmap = gfs2_bmap, |
1142 | .invalidatepage = gfs2_invalidatepage, | 1141 | .invalidatepage = gfs2_invalidatepage, |
1143 | .releasepage = gfs2_releasepage, | 1142 | .releasepage = gfs2_releasepage, |
1144 | .direct_IO = gfs2_direct_IO, | 1143 | .direct_IO = gfs2_direct_IO, |
1145 | .migratepage = buffer_migrate_page, | 1144 | .migratepage = buffer_migrate_page, |
1146 | .is_partially_uptodate = block_is_partially_uptodate, | 1145 | .is_partially_uptodate = block_is_partially_uptodate, |
1147 | .error_remove_page = generic_error_remove_page, | 1146 | .error_remove_page = generic_error_remove_page, |
1148 | }; | 1147 | }; |
1149 | 1148 | ||
1150 | static const struct address_space_operations gfs2_jdata_aops = { | 1149 | static const struct address_space_operations gfs2_jdata_aops = { |
1151 | .writepage = gfs2_jdata_writepage, | 1150 | .writepage = gfs2_jdata_writepage, |
1152 | .writepages = gfs2_jdata_writepages, | 1151 | .writepages = gfs2_jdata_writepages, |
1153 | .readpage = gfs2_readpage, | 1152 | .readpage = gfs2_readpage, |
1154 | .readpages = gfs2_readpages, | 1153 | .readpages = gfs2_readpages, |
1155 | .write_begin = gfs2_write_begin, | 1154 | .write_begin = gfs2_write_begin, |
1156 | .write_end = gfs2_write_end, | 1155 | .write_end = gfs2_write_end, |
1157 | .set_page_dirty = gfs2_set_page_dirty, | 1156 | .set_page_dirty = gfs2_set_page_dirty, |
1158 | .bmap = gfs2_bmap, | 1157 | .bmap = gfs2_bmap, |
1159 | .invalidatepage = gfs2_invalidatepage, | 1158 | .invalidatepage = gfs2_invalidatepage, |
1160 | .releasepage = gfs2_releasepage, | 1159 | .releasepage = gfs2_releasepage, |
1161 | .is_partially_uptodate = block_is_partially_uptodate, | 1160 | .is_partially_uptodate = block_is_partially_uptodate, |
1162 | .error_remove_page = generic_error_remove_page, | 1161 | .error_remove_page = generic_error_remove_page, |
1163 | }; | 1162 | }; |
1164 | 1163 | ||
1165 | void gfs2_set_aops(struct inode *inode) | 1164 | void gfs2_set_aops(struct inode *inode) |
1166 | { | 1165 | { |
1167 | struct gfs2_inode *ip = GFS2_I(inode); | 1166 | struct gfs2_inode *ip = GFS2_I(inode); |
1168 | 1167 | ||
1169 | if (gfs2_is_writeback(ip)) | 1168 | if (gfs2_is_writeback(ip)) |
1170 | inode->i_mapping->a_ops = &gfs2_writeback_aops; | 1169 | inode->i_mapping->a_ops = &gfs2_writeback_aops; |
1171 | else if (gfs2_is_ordered(ip)) | 1170 | else if (gfs2_is_ordered(ip)) |
1172 | inode->i_mapping->a_ops = &gfs2_ordered_aops; | 1171 | inode->i_mapping->a_ops = &gfs2_ordered_aops; |
1173 | else if (gfs2_is_jdata(ip)) | 1172 | else if (gfs2_is_jdata(ip)) |
1174 | inode->i_mapping->a_ops = &gfs2_jdata_aops; | 1173 | inode->i_mapping->a_ops = &gfs2_jdata_aops; |
1175 | else | 1174 | else |
1176 | BUG(); | 1175 | BUG(); |
1177 | } | 1176 | } |
1178 | 1177 | ||
1179 | 1178 |
fs/gfs2/meta_io.c
1 | /* | 1 | /* |
2 | * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. | 2 | * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. |
3 | * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. | 3 | * Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved. |
4 | * | 4 | * |
5 | * This copyrighted material is made available to anyone wishing to use, | 5 | * This copyrighted material is made available to anyone wishing to use, |
6 | * modify, copy, or redistribute it subject to the terms and conditions | 6 | * modify, copy, or redistribute it subject to the terms and conditions |
7 | * of the GNU General Public License version 2. | 7 | * of the GNU General Public License version 2. |
8 | */ | 8 | */ |
9 | 9 | ||
10 | #include <linux/sched.h> | 10 | #include <linux/sched.h> |
11 | #include <linux/slab.h> | 11 | #include <linux/slab.h> |
12 | #include <linux/spinlock.h> | 12 | #include <linux/spinlock.h> |
13 | #include <linux/completion.h> | 13 | #include <linux/completion.h> |
14 | #include <linux/buffer_head.h> | 14 | #include <linux/buffer_head.h> |
15 | #include <linux/mm.h> | 15 | #include <linux/mm.h> |
16 | #include <linux/pagemap.h> | 16 | #include <linux/pagemap.h> |
17 | #include <linux/writeback.h> | 17 | #include <linux/writeback.h> |
18 | #include <linux/swap.h> | 18 | #include <linux/swap.h> |
19 | #include <linux/delay.h> | 19 | #include <linux/delay.h> |
20 | #include <linux/bio.h> | 20 | #include <linux/bio.h> |
21 | #include <linux/gfs2_ondisk.h> | 21 | #include <linux/gfs2_ondisk.h> |
22 | 22 | ||
23 | #include "gfs2.h" | 23 | #include "gfs2.h" |
24 | #include "incore.h" | 24 | #include "incore.h" |
25 | #include "glock.h" | 25 | #include "glock.h" |
26 | #include "glops.h" | 26 | #include "glops.h" |
27 | #include "inode.h" | 27 | #include "inode.h" |
28 | #include "log.h" | 28 | #include "log.h" |
29 | #include "lops.h" | 29 | #include "lops.h" |
30 | #include "meta_io.h" | 30 | #include "meta_io.h" |
31 | #include "rgrp.h" | 31 | #include "rgrp.h" |
32 | #include "trans.h" | 32 | #include "trans.h" |
33 | #include "util.h" | 33 | #include "util.h" |
34 | #include "trace_gfs2.h" | 34 | #include "trace_gfs2.h" |
35 | 35 | ||
36 | static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc) | 36 | static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc) |
37 | { | 37 | { |
38 | struct buffer_head *bh, *head; | 38 | struct buffer_head *bh, *head; |
39 | int nr_underway = 0; | 39 | int nr_underway = 0; |
40 | int write_op = REQ_META | REQ_PRIO | | 40 | int write_op = REQ_META | REQ_PRIO | |
41 | (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); | 41 | (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); |
42 | 42 | ||
43 | BUG_ON(!PageLocked(page)); | 43 | BUG_ON(!PageLocked(page)); |
44 | BUG_ON(!page_has_buffers(page)); | 44 | BUG_ON(!page_has_buffers(page)); |
45 | 45 | ||
46 | head = page_buffers(page); | 46 | head = page_buffers(page); |
47 | bh = head; | 47 | bh = head; |
48 | 48 | ||
49 | do { | 49 | do { |
50 | if (!buffer_mapped(bh)) | 50 | if (!buffer_mapped(bh)) |
51 | continue; | 51 | continue; |
52 | /* | 52 | /* |
53 | * If it's a fully non-blocking write attempt and we cannot | 53 | * If it's a fully non-blocking write attempt and we cannot |
54 | * lock the buffer then redirty the page. Note that this can | 54 | * lock the buffer then redirty the page. Note that this can |
55 | * potentially cause a busy-wait loop from flusher thread and kswapd | 55 | * potentially cause a busy-wait loop from flusher thread and kswapd |
56 | * activity, but those code paths have their own higher-level | 56 | * activity, but those code paths have their own higher-level |
57 | * throttling. | 57 | * throttling. |
58 | */ | 58 | */ |
59 | if (wbc->sync_mode != WB_SYNC_NONE) { | 59 | if (wbc->sync_mode != WB_SYNC_NONE) { |
60 | lock_buffer(bh); | 60 | lock_buffer(bh); |
61 | } else if (!trylock_buffer(bh)) { | 61 | } else if (!trylock_buffer(bh)) { |
62 | redirty_page_for_writepage(wbc, page); | 62 | redirty_page_for_writepage(wbc, page); |
63 | continue; | 63 | continue; |
64 | } | 64 | } |
65 | if (test_clear_buffer_dirty(bh)) { | 65 | if (test_clear_buffer_dirty(bh)) { |
66 | mark_buffer_async_write(bh); | 66 | mark_buffer_async_write(bh); |
67 | } else { | 67 | } else { |
68 | unlock_buffer(bh); | 68 | unlock_buffer(bh); |
69 | } | 69 | } |
70 | } while ((bh = bh->b_this_page) != head); | 70 | } while ((bh = bh->b_this_page) != head); |
71 | 71 | ||
72 | /* | 72 | /* |
73 | * The page and its buffers are protected by PageWriteback(), so we can | 73 | * The page and its buffers are protected by PageWriteback(), so we can |
74 | * drop the bh refcounts early. | 74 | * drop the bh refcounts early. |
75 | */ | 75 | */ |
76 | BUG_ON(PageWriteback(page)); | 76 | BUG_ON(PageWriteback(page)); |
77 | set_page_writeback(page); | 77 | set_page_writeback(page); |
78 | 78 | ||
79 | do { | 79 | do { |
80 | struct buffer_head *next = bh->b_this_page; | 80 | struct buffer_head *next = bh->b_this_page; |
81 | if (buffer_async_write(bh)) { | 81 | if (buffer_async_write(bh)) { |
82 | submit_bh(write_op, bh); | 82 | submit_bh(write_op, bh); |
83 | nr_underway++; | 83 | nr_underway++; |
84 | } | 84 | } |
85 | bh = next; | 85 | bh = next; |
86 | } while (bh != head); | 86 | } while (bh != head); |
87 | unlock_page(page); | 87 | unlock_page(page); |
88 | 88 | ||
89 | if (nr_underway == 0) | 89 | if (nr_underway == 0) |
90 | end_page_writeback(page); | 90 | end_page_writeback(page); |
91 | 91 | ||
92 | return 0; | 92 | return 0; |
93 | } | 93 | } |
94 | 94 | ||
95 | const struct address_space_operations gfs2_meta_aops = { | 95 | const struct address_space_operations gfs2_meta_aops = { |
96 | .writepage = gfs2_aspace_writepage, | 96 | .writepage = gfs2_aspace_writepage, |
97 | .releasepage = gfs2_releasepage, | 97 | .releasepage = gfs2_releasepage, |
98 | }; | 98 | }; |
99 | 99 | ||
100 | /** | 100 | /** |
101 | * gfs2_getbuf - Get a buffer with a given address space | 101 | * gfs2_getbuf - Get a buffer with a given address space |
102 | * @gl: the glock | 102 | * @gl: the glock |
103 | * @blkno: the block number (filesystem scope) | 103 | * @blkno: the block number (filesystem scope) |
104 | * @create: 1 if the buffer should be created | 104 | * @create: 1 if the buffer should be created |
105 | * | 105 | * |
106 | * Returns: the buffer | 106 | * Returns: the buffer |
107 | */ | 107 | */ |
108 | 108 | ||
109 | struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create) | 109 | struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create) |
110 | { | 110 | { |
111 | struct address_space *mapping = gfs2_glock2aspace(gl); | 111 | struct address_space *mapping = gfs2_glock2aspace(gl); |
112 | struct gfs2_sbd *sdp = gl->gl_sbd; | 112 | struct gfs2_sbd *sdp = gl->gl_sbd; |
113 | struct page *page; | 113 | struct page *page; |
114 | struct buffer_head *bh; | 114 | struct buffer_head *bh; |
115 | unsigned int shift; | 115 | unsigned int shift; |
116 | unsigned long index; | 116 | unsigned long index; |
117 | unsigned int bufnum; | 117 | unsigned int bufnum; |
118 | 118 | ||
119 | shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift; | 119 | shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift; |
120 | index = blkno >> shift; /* convert block to page */ | 120 | index = blkno >> shift; /* convert block to page */ |
121 | bufnum = blkno - (index << shift); /* block buf index within page */ | 121 | bufnum = blkno - (index << shift); /* block buf index within page */ |
122 | 122 | ||
123 | if (create) { | 123 | if (create) { |
124 | for (;;) { | 124 | for (;;) { |
125 | page = grab_cache_page(mapping, index); | 125 | page = grab_cache_page(mapping, index); |
126 | if (page) | 126 | if (page) |
127 | break; | 127 | break; |
128 | yield(); | 128 | yield(); |
129 | } | 129 | } |
130 | } else { | 130 | } else { |
131 | page = find_lock_page(mapping, index); | 131 | page = find_get_page_flags(mapping, index, |
132 | FGP_LOCK|FGP_ACCESSED); | ||
132 | if (!page) | 133 | if (!page) |
133 | return NULL; | 134 | return NULL; |
134 | } | 135 | } |
135 | 136 | ||
136 | if (!page_has_buffers(page)) | 137 | if (!page_has_buffers(page)) |
137 | create_empty_buffers(page, sdp->sd_sb.sb_bsize, 0); | 138 | create_empty_buffers(page, sdp->sd_sb.sb_bsize, 0); |
138 | 139 | ||
139 | /* Locate header for our buffer within our page */ | 140 | /* Locate header for our buffer within our page */ |
140 | for (bh = page_buffers(page); bufnum--; bh = bh->b_this_page) | 141 | for (bh = page_buffers(page); bufnum--; bh = bh->b_this_page) |
141 | /* Do nothing */; | 142 | /* Do nothing */; |
142 | get_bh(bh); | 143 | get_bh(bh); |
143 | 144 | ||
144 | if (!buffer_mapped(bh)) | 145 | if (!buffer_mapped(bh)) |
145 | map_bh(bh, sdp->sd_vfs, blkno); | 146 | map_bh(bh, sdp->sd_vfs, blkno); |
146 | 147 | ||
147 | unlock_page(page); | 148 | unlock_page(page); |
148 | mark_page_accessed(page); | ||
149 | page_cache_release(page); | 149 | page_cache_release(page); |
150 | 150 | ||
151 | return bh; | 151 | return bh; |
152 | } | 152 | } |
153 | 153 | ||
154 | static void meta_prep_new(struct buffer_head *bh) | 154 | static void meta_prep_new(struct buffer_head *bh) |
155 | { | 155 | { |
156 | struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data; | 156 | struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data; |
157 | 157 | ||
158 | lock_buffer(bh); | 158 | lock_buffer(bh); |
159 | clear_buffer_dirty(bh); | 159 | clear_buffer_dirty(bh); |
160 | set_buffer_uptodate(bh); | 160 | set_buffer_uptodate(bh); |
161 | unlock_buffer(bh); | 161 | unlock_buffer(bh); |
162 | 162 | ||
163 | mh->mh_magic = cpu_to_be32(GFS2_MAGIC); | 163 | mh->mh_magic = cpu_to_be32(GFS2_MAGIC); |
164 | } | 164 | } |
165 | 165 | ||
166 | /** | 166 | /** |
167 | * gfs2_meta_new - Get a block | 167 | * gfs2_meta_new - Get a block |
168 | * @gl: The glock associated with this block | 168 | * @gl: The glock associated with this block |
169 | * @blkno: The block number | 169 | * @blkno: The block number |
170 | * | 170 | * |
171 | * Returns: The buffer | 171 | * Returns: The buffer |
172 | */ | 172 | */ |
173 | 173 | ||
174 | struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno) | 174 | struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno) |
175 | { | 175 | { |
176 | struct buffer_head *bh; | 176 | struct buffer_head *bh; |
177 | bh = gfs2_getbuf(gl, blkno, CREATE); | 177 | bh = gfs2_getbuf(gl, blkno, CREATE); |
178 | meta_prep_new(bh); | 178 | meta_prep_new(bh); |
179 | return bh; | 179 | return bh; |
180 | } | 180 | } |
181 | 181 | ||
182 | /** | 182 | /** |
183 | * gfs2_meta_read - Read a block from disk | 183 | * gfs2_meta_read - Read a block from disk |
184 | * @gl: The glock covering the block | 184 | * @gl: The glock covering the block |
185 | * @blkno: The block number | 185 | * @blkno: The block number |
186 | * @flags: flags | 186 | * @flags: flags |
187 | * @bhp: the place where the buffer is returned (NULL on failure) | 187 | * @bhp: the place where the buffer is returned (NULL on failure) |
188 | * | 188 | * |
189 | * Returns: errno | 189 | * Returns: errno |
190 | */ | 190 | */ |
191 | 191 | ||
192 | int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags, | 192 | int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags, |
193 | struct buffer_head **bhp) | 193 | struct buffer_head **bhp) |
194 | { | 194 | { |
195 | struct gfs2_sbd *sdp = gl->gl_sbd; | 195 | struct gfs2_sbd *sdp = gl->gl_sbd; |
196 | struct buffer_head *bh; | 196 | struct buffer_head *bh; |
197 | 197 | ||
198 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) { | 198 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) { |
199 | *bhp = NULL; | 199 | *bhp = NULL; |
200 | return -EIO; | 200 | return -EIO; |
201 | } | 201 | } |
202 | 202 | ||
203 | *bhp = bh = gfs2_getbuf(gl, blkno, CREATE); | 203 | *bhp = bh = gfs2_getbuf(gl, blkno, CREATE); |
204 | 204 | ||
205 | lock_buffer(bh); | 205 | lock_buffer(bh); |
206 | if (buffer_uptodate(bh)) { | 206 | if (buffer_uptodate(bh)) { |
207 | unlock_buffer(bh); | 207 | unlock_buffer(bh); |
208 | return 0; | 208 | return 0; |
209 | } | 209 | } |
210 | bh->b_end_io = end_buffer_read_sync; | 210 | bh->b_end_io = end_buffer_read_sync; |
211 | get_bh(bh); | 211 | get_bh(bh); |
212 | submit_bh(READ_SYNC | REQ_META | REQ_PRIO, bh); | 212 | submit_bh(READ_SYNC | REQ_META | REQ_PRIO, bh); |
213 | if (!(flags & DIO_WAIT)) | 213 | if (!(flags & DIO_WAIT)) |
214 | return 0; | 214 | return 0; |
215 | 215 | ||
216 | wait_on_buffer(bh); | 216 | wait_on_buffer(bh); |
217 | if (unlikely(!buffer_uptodate(bh))) { | 217 | if (unlikely(!buffer_uptodate(bh))) { |
218 | struct gfs2_trans *tr = current->journal_info; | 218 | struct gfs2_trans *tr = current->journal_info; |
219 | if (tr && tr->tr_touched) | 219 | if (tr && tr->tr_touched) |
220 | gfs2_io_error_bh(sdp, bh); | 220 | gfs2_io_error_bh(sdp, bh); |
221 | brelse(bh); | 221 | brelse(bh); |
222 | *bhp = NULL; | 222 | *bhp = NULL; |
223 | return -EIO; | 223 | return -EIO; |
224 | } | 224 | } |
225 | 225 | ||
226 | return 0; | 226 | return 0; |
227 | } | 227 | } |
228 | 228 | ||
229 | /** | 229 | /** |
230 | * gfs2_meta_wait - Reread a block from disk | 230 | * gfs2_meta_wait - Reread a block from disk |
231 | * @sdp: the filesystem | 231 | * @sdp: the filesystem |
232 | * @bh: The block to wait for | 232 | * @bh: The block to wait for |
233 | * | 233 | * |
234 | * Returns: errno | 234 | * Returns: errno |
235 | */ | 235 | */ |
236 | 236 | ||
237 | int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh) | 237 | int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh) |
238 | { | 238 | { |
239 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) | 239 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) |
240 | return -EIO; | 240 | return -EIO; |
241 | 241 | ||
242 | wait_on_buffer(bh); | 242 | wait_on_buffer(bh); |
243 | 243 | ||
244 | if (!buffer_uptodate(bh)) { | 244 | if (!buffer_uptodate(bh)) { |
245 | struct gfs2_trans *tr = current->journal_info; | 245 | struct gfs2_trans *tr = current->journal_info; |
246 | if (tr && tr->tr_touched) | 246 | if (tr && tr->tr_touched) |
247 | gfs2_io_error_bh(sdp, bh); | 247 | gfs2_io_error_bh(sdp, bh); |
248 | return -EIO; | 248 | return -EIO; |
249 | } | 249 | } |
250 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) | 250 | if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) |
251 | return -EIO; | 251 | return -EIO; |
252 | 252 | ||
253 | return 0; | 253 | return 0; |
254 | } | 254 | } |
255 | 255 | ||
256 | void gfs2_remove_from_journal(struct buffer_head *bh, struct gfs2_trans *tr, int meta) | 256 | void gfs2_remove_from_journal(struct buffer_head *bh, struct gfs2_trans *tr, int meta) |
257 | { | 257 | { |
258 | struct address_space *mapping = bh->b_page->mapping; | 258 | struct address_space *mapping = bh->b_page->mapping; |
259 | struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); | 259 | struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping); |
260 | struct gfs2_bufdata *bd = bh->b_private; | 260 | struct gfs2_bufdata *bd = bh->b_private; |
261 | int was_pinned = 0; | 261 | int was_pinned = 0; |
262 | 262 | ||
263 | if (test_clear_buffer_pinned(bh)) { | 263 | if (test_clear_buffer_pinned(bh)) { |
264 | trace_gfs2_pin(bd, 0); | 264 | trace_gfs2_pin(bd, 0); |
265 | atomic_dec(&sdp->sd_log_pinned); | 265 | atomic_dec(&sdp->sd_log_pinned); |
266 | list_del_init(&bd->bd_list); | 266 | list_del_init(&bd->bd_list); |
267 | if (meta) { | 267 | if (meta) { |
268 | gfs2_assert_warn(sdp, sdp->sd_log_num_buf); | 268 | gfs2_assert_warn(sdp, sdp->sd_log_num_buf); |
269 | sdp->sd_log_num_buf--; | 269 | sdp->sd_log_num_buf--; |
270 | tr->tr_num_buf_rm++; | 270 | tr->tr_num_buf_rm++; |
271 | } else { | 271 | } else { |
272 | gfs2_assert_warn(sdp, sdp->sd_log_num_databuf); | 272 | gfs2_assert_warn(sdp, sdp->sd_log_num_databuf); |
273 | sdp->sd_log_num_databuf--; | 273 | sdp->sd_log_num_databuf--; |
274 | tr->tr_num_databuf_rm++; | 274 | tr->tr_num_databuf_rm++; |
275 | } | 275 | } |
276 | tr->tr_touched = 1; | 276 | tr->tr_touched = 1; |
277 | was_pinned = 1; | 277 | was_pinned = 1; |
278 | brelse(bh); | 278 | brelse(bh); |
279 | } | 279 | } |
280 | if (bd) { | 280 | if (bd) { |
281 | spin_lock(&sdp->sd_ail_lock); | 281 | spin_lock(&sdp->sd_ail_lock); |
282 | if (bd->bd_tr) { | 282 | if (bd->bd_tr) { |
283 | gfs2_trans_add_revoke(sdp, bd); | 283 | gfs2_trans_add_revoke(sdp, bd); |
284 | } else if (was_pinned) { | 284 | } else if (was_pinned) { |
285 | bh->b_private = NULL; | 285 | bh->b_private = NULL; |
286 | kmem_cache_free(gfs2_bufdata_cachep, bd); | 286 | kmem_cache_free(gfs2_bufdata_cachep, bd); |
287 | } | 287 | } |
288 | spin_unlock(&sdp->sd_ail_lock); | 288 | spin_unlock(&sdp->sd_ail_lock); |
289 | } | 289 | } |
290 | clear_buffer_dirty(bh); | 290 | clear_buffer_dirty(bh); |
291 | clear_buffer_uptodate(bh); | 291 | clear_buffer_uptodate(bh); |
292 | } | 292 | } |
293 | 293 | ||
294 | /** | 294 | /** |
295 | * gfs2_meta_wipe - make inode's buffers so they aren't dirty/pinned anymore | 295 | * gfs2_meta_wipe - make inode's buffers so they aren't dirty/pinned anymore |
296 | * @ip: the inode who owns the buffers | 296 | * @ip: the inode who owns the buffers |
297 | * @bstart: the first buffer in the run | 297 | * @bstart: the first buffer in the run |
298 | * @blen: the number of buffers in the run | 298 | * @blen: the number of buffers in the run |
299 | * | 299 | * |
300 | */ | 300 | */ |
301 | 301 | ||
302 | void gfs2_meta_wipe(struct gfs2_inode *ip, u64 bstart, u32 blen) | 302 | void gfs2_meta_wipe(struct gfs2_inode *ip, u64 bstart, u32 blen) |
303 | { | 303 | { |
304 | struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); | 304 | struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); |
305 | struct buffer_head *bh; | 305 | struct buffer_head *bh; |
306 | 306 | ||
307 | while (blen) { | 307 | while (blen) { |
308 | bh = gfs2_getbuf(ip->i_gl, bstart, NO_CREATE); | 308 | bh = gfs2_getbuf(ip->i_gl, bstart, NO_CREATE); |
309 | if (bh) { | 309 | if (bh) { |
310 | lock_buffer(bh); | 310 | lock_buffer(bh); |
311 | gfs2_log_lock(sdp); | 311 | gfs2_log_lock(sdp); |
312 | gfs2_remove_from_journal(bh, current->journal_info, 1); | 312 | gfs2_remove_from_journal(bh, current->journal_info, 1); |
313 | gfs2_log_unlock(sdp); | 313 | gfs2_log_unlock(sdp); |
314 | unlock_buffer(bh); | 314 | unlock_buffer(bh); |
315 | brelse(bh); | 315 | brelse(bh); |
316 | } | 316 | } |
317 | 317 | ||
318 | bstart++; | 318 | bstart++; |
319 | blen--; | 319 | blen--; |
320 | } | 320 | } |
321 | } | 321 | } |
322 | 322 | ||
323 | /** | 323 | /** |
324 | * gfs2_meta_indirect_buffer - Get a metadata buffer | 324 | * gfs2_meta_indirect_buffer - Get a metadata buffer |
325 | * @ip: The GFS2 inode | 325 | * @ip: The GFS2 inode |
326 | * @height: The level of this buf in the metadata (indir addr) tree (if any) | 326 | * @height: The level of this buf in the metadata (indir addr) tree (if any) |
327 | * @num: The block number (device relative) of the buffer | 327 | * @num: The block number (device relative) of the buffer |
328 | * @bhp: the buffer is returned here | 328 | * @bhp: the buffer is returned here |
329 | * | 329 | * |
330 | * Returns: errno | 330 | * Returns: errno |
331 | */ | 331 | */ |
332 | 332 | ||
333 | int gfs2_meta_indirect_buffer(struct gfs2_inode *ip, int height, u64 num, | 333 | int gfs2_meta_indirect_buffer(struct gfs2_inode *ip, int height, u64 num, |
334 | struct buffer_head **bhp) | 334 | struct buffer_head **bhp) |
335 | { | 335 | { |
336 | struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); | 336 | struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); |
337 | struct gfs2_glock *gl = ip->i_gl; | 337 | struct gfs2_glock *gl = ip->i_gl; |
338 | struct buffer_head *bh; | 338 | struct buffer_head *bh; |
339 | int ret = 0; | 339 | int ret = 0; |
340 | u32 mtype = height ? GFS2_METATYPE_IN : GFS2_METATYPE_DI; | 340 | u32 mtype = height ? GFS2_METATYPE_IN : GFS2_METATYPE_DI; |
341 | 341 | ||
342 | ret = gfs2_meta_read(gl, num, DIO_WAIT, &bh); | 342 | ret = gfs2_meta_read(gl, num, DIO_WAIT, &bh); |
343 | if (ret == 0 && gfs2_metatype_check(sdp, bh, mtype)) { | 343 | if (ret == 0 && gfs2_metatype_check(sdp, bh, mtype)) { |
344 | brelse(bh); | 344 | brelse(bh); |
345 | ret = -EIO; | 345 | ret = -EIO; |
346 | } | 346 | } |
347 | *bhp = bh; | 347 | *bhp = bh; |
348 | return ret; | 348 | return ret; |
349 | } | 349 | } |
350 | 350 | ||
351 | /** | 351 | /** |
352 | * gfs2_meta_ra - start readahead on an extent of a file | 352 | * gfs2_meta_ra - start readahead on an extent of a file |
353 | * @gl: the glock the blocks belong to | 353 | * @gl: the glock the blocks belong to |
354 | * @dblock: the starting disk block | 354 | * @dblock: the starting disk block |
355 | * @extlen: the number of blocks in the extent | 355 | * @extlen: the number of blocks in the extent |
356 | * | 356 | * |
357 | * returns: the first buffer in the extent | 357 | * returns: the first buffer in the extent |
358 | */ | 358 | */ |
359 | 359 | ||
360 | struct buffer_head *gfs2_meta_ra(struct gfs2_glock *gl, u64 dblock, u32 extlen) | 360 | struct buffer_head *gfs2_meta_ra(struct gfs2_glock *gl, u64 dblock, u32 extlen) |
361 | { | 361 | { |
362 | struct gfs2_sbd *sdp = gl->gl_sbd; | 362 | struct gfs2_sbd *sdp = gl->gl_sbd; |
363 | struct buffer_head *first_bh, *bh; | 363 | struct buffer_head *first_bh, *bh; |
364 | u32 max_ra = gfs2_tune_get(sdp, gt_max_readahead) >> | 364 | u32 max_ra = gfs2_tune_get(sdp, gt_max_readahead) >> |
365 | sdp->sd_sb.sb_bsize_shift; | 365 | sdp->sd_sb.sb_bsize_shift; |
366 | 366 | ||
367 | BUG_ON(!extlen); | 367 | BUG_ON(!extlen); |
368 | 368 | ||
369 | if (max_ra < 1) | 369 | if (max_ra < 1) |
370 | max_ra = 1; | 370 | max_ra = 1; |
371 | if (extlen > max_ra) | 371 | if (extlen > max_ra) |
372 | extlen = max_ra; | 372 | extlen = max_ra; |
373 | 373 | ||
374 | first_bh = gfs2_getbuf(gl, dblock, CREATE); | 374 | first_bh = gfs2_getbuf(gl, dblock, CREATE); |
375 | 375 | ||
376 | if (buffer_uptodate(first_bh)) | 376 | if (buffer_uptodate(first_bh)) |
377 | goto out; | 377 | goto out; |
378 | if (!buffer_locked(first_bh)) | 378 | if (!buffer_locked(first_bh)) |
379 | ll_rw_block(READ_SYNC | REQ_META, 1, &first_bh); | 379 | ll_rw_block(READ_SYNC | REQ_META, 1, &first_bh); |
380 | 380 | ||
381 | dblock++; | 381 | dblock++; |
382 | extlen--; | 382 | extlen--; |
383 | 383 | ||
384 | while (extlen) { | 384 | while (extlen) { |
385 | bh = gfs2_getbuf(gl, dblock, CREATE); | 385 | bh = gfs2_getbuf(gl, dblock, CREATE); |
386 | 386 | ||
387 | if (!buffer_uptodate(bh) && !buffer_locked(bh)) | 387 | if (!buffer_uptodate(bh) && !buffer_locked(bh)) |
388 | ll_rw_block(READA | REQ_META, 1, &bh); | 388 | ll_rw_block(READA | REQ_META, 1, &bh); |
389 | brelse(bh); | 389 | brelse(bh); |
390 | dblock++; | 390 | dblock++; |
391 | extlen--; | 391 | extlen--; |
392 | if (!buffer_locked(first_bh) && buffer_uptodate(first_bh)) | 392 | if (!buffer_locked(first_bh) && buffer_uptodate(first_bh)) |
393 | goto out; | 393 | goto out; |
394 | } | 394 | } |
395 | 395 | ||
396 | wait_on_buffer(first_bh); | 396 | wait_on_buffer(first_bh); |
397 | out: | 397 | out: |
398 | return first_bh; | 398 | return first_bh; |
399 | } | 399 | } |
400 | 400 |
fs/ntfs/attrib.c
1 | /** | 1 | /** |
2 | * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project. | 2 | * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project. |
3 | * | 3 | * |
4 | * Copyright (c) 2001-2012 Anton Altaparmakov and Tuxera Inc. | 4 | * Copyright (c) 2001-2012 Anton Altaparmakov and Tuxera Inc. |
5 | * Copyright (c) 2002 Richard Russon | 5 | * Copyright (c) 2002 Richard Russon |
6 | * | 6 | * |
7 | * This program/include file is free software; you can redistribute it and/or | 7 | * This program/include file is free software; you can redistribute it and/or |
8 | * modify it under the terms of the GNU General Public License as published | 8 | * modify it under the terms of the GNU General Public License as published |
9 | * by the Free Software Foundation; either version 2 of the License, or | 9 | * by the Free Software Foundation; either version 2 of the License, or |
10 | * (at your option) any later version. | 10 | * (at your option) any later version. |
11 | * | 11 | * |
12 | * This program/include file is distributed in the hope that it will be | 12 | * This program/include file is distributed in the hope that it will be |
13 | * useful, but WITHOUT ANY WARRANTY; without even the implied warranty | 13 | * useful, but WITHOUT ANY WARRANTY; without even the implied warranty |
14 | * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | 14 | * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
15 | * GNU General Public License for more details. | 15 | * GNU General Public License for more details. |
16 | * | 16 | * |
17 | * You should have received a copy of the GNU General Public License | 17 | * You should have received a copy of the GNU General Public License |
18 | * along with this program (in the main directory of the Linux-NTFS | 18 | * along with this program (in the main directory of the Linux-NTFS |
19 | * distribution in the file COPYING); if not, write to the Free Software | 19 | * distribution in the file COPYING); if not, write to the Free Software |
20 | * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | 20 | * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA |
21 | */ | 21 | */ |
22 | 22 | ||
23 | #include <linux/buffer_head.h> | 23 | #include <linux/buffer_head.h> |
24 | #include <linux/sched.h> | 24 | #include <linux/sched.h> |
25 | #include <linux/slab.h> | 25 | #include <linux/slab.h> |
26 | #include <linux/swap.h> | 26 | #include <linux/swap.h> |
27 | #include <linux/writeback.h> | 27 | #include <linux/writeback.h> |
28 | 28 | ||
29 | #include "attrib.h" | 29 | #include "attrib.h" |
30 | #include "debug.h" | 30 | #include "debug.h" |
31 | #include "layout.h" | 31 | #include "layout.h" |
32 | #include "lcnalloc.h" | 32 | #include "lcnalloc.h" |
33 | #include "malloc.h" | 33 | #include "malloc.h" |
34 | #include "mft.h" | 34 | #include "mft.h" |
35 | #include "ntfs.h" | 35 | #include "ntfs.h" |
36 | #include "types.h" | 36 | #include "types.h" |
37 | 37 | ||
38 | /** | 38 | /** |
39 | * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode | 39 | * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode |
40 | * @ni: ntfs inode for which to map (part of) a runlist | 40 | * @ni: ntfs inode for which to map (part of) a runlist |
41 | * @vcn: map runlist part containing this vcn | 41 | * @vcn: map runlist part containing this vcn |
42 | * @ctx: active attribute search context if present or NULL if not | 42 | * @ctx: active attribute search context if present or NULL if not |
43 | * | 43 | * |
44 | * Map the part of a runlist containing the @vcn of the ntfs inode @ni. | 44 | * Map the part of a runlist containing the @vcn of the ntfs inode @ni. |
45 | * | 45 | * |
46 | * If @ctx is specified, it is an active search context of @ni and its base mft | 46 | * If @ctx is specified, it is an active search context of @ni and its base mft |
47 | * record. This is needed when ntfs_map_runlist_nolock() encounters unmapped | 47 | * record. This is needed when ntfs_map_runlist_nolock() encounters unmapped |
48 | * runlist fragments and allows their mapping. If you do not have the mft | 48 | * runlist fragments and allows their mapping. If you do not have the mft |
49 | * record mapped, you can specify @ctx as NULL and ntfs_map_runlist_nolock() | 49 | * record mapped, you can specify @ctx as NULL and ntfs_map_runlist_nolock() |
50 | * will perform the necessary mapping and unmapping. | 50 | * will perform the necessary mapping and unmapping. |
51 | * | 51 | * |
52 | * Note, ntfs_map_runlist_nolock() saves the state of @ctx on entry and | 52 | * Note, ntfs_map_runlist_nolock() saves the state of @ctx on entry and |
53 | * restores it before returning. Thus, @ctx will be left pointing to the same | 53 | * restores it before returning. Thus, @ctx will be left pointing to the same |
54 | * attribute on return as on entry. However, the actual pointers in @ctx may | 54 | * attribute on return as on entry. However, the actual pointers in @ctx may |
55 | * point to different memory locations on return, so you must remember to reset | 55 | * point to different memory locations on return, so you must remember to reset |
56 | * any cached pointers from the @ctx, i.e. after the call to | 56 | * any cached pointers from the @ctx, i.e. after the call to |
57 | * ntfs_map_runlist_nolock(), you will probably want to do: | 57 | * ntfs_map_runlist_nolock(), you will probably want to do: |
58 | * m = ctx->mrec; | 58 | * m = ctx->mrec; |
59 | * a = ctx->attr; | 59 | * a = ctx->attr; |
60 | * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that | 60 | * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that |
61 | * you cache ctx->mrec in a variable @m of type MFT_RECORD *. | 61 | * you cache ctx->mrec in a variable @m of type MFT_RECORD *. |
62 | * | 62 | * |
63 | * Return 0 on success and -errno on error. There is one special error code | 63 | * Return 0 on success and -errno on error. There is one special error code |
64 | * which is not an error as such. This is -ENOENT. It means that @vcn is out | 64 | * which is not an error as such. This is -ENOENT. It means that @vcn is out |
65 | * of bounds of the runlist. | 65 | * of bounds of the runlist. |
66 | * | 66 | * |
67 | * Note the runlist can be NULL after this function returns if @vcn is zero and | 67 | * Note the runlist can be NULL after this function returns if @vcn is zero and |
68 | * the attribute has zero allocated size, i.e. there simply is no runlist. | 68 | * the attribute has zero allocated size, i.e. there simply is no runlist. |
69 | * | 69 | * |
70 | * WARNING: If @ctx is supplied, regardless of whether success or failure is | 70 | * WARNING: If @ctx is supplied, regardless of whether success or failure is |
71 | * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx | 71 | * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx |
72 | * is no longer valid, i.e. you need to either call | 72 | * is no longer valid, i.e. you need to either call |
73 | * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. | 73 | * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. |
74 | * In that case PTR_ERR(@ctx->mrec) will give you the error code for | 74 | * In that case PTR_ERR(@ctx->mrec) will give you the error code for |
75 | * why the mapping of the old inode failed. | 75 | * why the mapping of the old inode failed. |
76 | * | 76 | * |
77 | * Locking: - The runlist described by @ni must be locked for writing on entry | 77 | * Locking: - The runlist described by @ni must be locked for writing on entry |
78 | * and is locked on return. Note the runlist will be modified. | 78 | * and is locked on return. Note the runlist will be modified. |
79 | * - If @ctx is NULL, the base mft record of @ni must not be mapped on | 79 | * - If @ctx is NULL, the base mft record of @ni must not be mapped on |
80 | * entry and it will be left unmapped on return. | 80 | * entry and it will be left unmapped on return. |
81 | * - If @ctx is not NULL, the base mft record must be mapped on entry | 81 | * - If @ctx is not NULL, the base mft record must be mapped on entry |
82 | * and it will be left mapped on return. | 82 | * and it will be left mapped on return. |
83 | */ | 83 | */ |
84 | int ntfs_map_runlist_nolock(ntfs_inode *ni, VCN vcn, ntfs_attr_search_ctx *ctx) | 84 | int ntfs_map_runlist_nolock(ntfs_inode *ni, VCN vcn, ntfs_attr_search_ctx *ctx) |
85 | { | 85 | { |
86 | VCN end_vcn; | 86 | VCN end_vcn; |
87 | unsigned long flags; | 87 | unsigned long flags; |
88 | ntfs_inode *base_ni; | 88 | ntfs_inode *base_ni; |
89 | MFT_RECORD *m; | 89 | MFT_RECORD *m; |
90 | ATTR_RECORD *a; | 90 | ATTR_RECORD *a; |
91 | runlist_element *rl; | 91 | runlist_element *rl; |
92 | struct page *put_this_page = NULL; | 92 | struct page *put_this_page = NULL; |
93 | int err = 0; | 93 | int err = 0; |
94 | bool ctx_is_temporary, ctx_needs_reset; | 94 | bool ctx_is_temporary, ctx_needs_reset; |
95 | ntfs_attr_search_ctx old_ctx = { NULL, }; | 95 | ntfs_attr_search_ctx old_ctx = { NULL, }; |
96 | 96 | ||
97 | ntfs_debug("Mapping runlist part containing vcn 0x%llx.", | 97 | ntfs_debug("Mapping runlist part containing vcn 0x%llx.", |
98 | (unsigned long long)vcn); | 98 | (unsigned long long)vcn); |
99 | if (!NInoAttr(ni)) | 99 | if (!NInoAttr(ni)) |
100 | base_ni = ni; | 100 | base_ni = ni; |
101 | else | 101 | else |
102 | base_ni = ni->ext.base_ntfs_ino; | 102 | base_ni = ni->ext.base_ntfs_ino; |
103 | if (!ctx) { | 103 | if (!ctx) { |
104 | ctx_is_temporary = ctx_needs_reset = true; | 104 | ctx_is_temporary = ctx_needs_reset = true; |
105 | m = map_mft_record(base_ni); | 105 | m = map_mft_record(base_ni); |
106 | if (IS_ERR(m)) | 106 | if (IS_ERR(m)) |
107 | return PTR_ERR(m); | 107 | return PTR_ERR(m); |
108 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 108 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
109 | if (unlikely(!ctx)) { | 109 | if (unlikely(!ctx)) { |
110 | err = -ENOMEM; | 110 | err = -ENOMEM; |
111 | goto err_out; | 111 | goto err_out; |
112 | } | 112 | } |
113 | } else { | 113 | } else { |
114 | VCN allocated_size_vcn; | 114 | VCN allocated_size_vcn; |
115 | 115 | ||
116 | BUG_ON(IS_ERR(ctx->mrec)); | 116 | BUG_ON(IS_ERR(ctx->mrec)); |
117 | a = ctx->attr; | 117 | a = ctx->attr; |
118 | BUG_ON(!a->non_resident); | 118 | BUG_ON(!a->non_resident); |
119 | ctx_is_temporary = false; | 119 | ctx_is_temporary = false; |
120 | end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); | 120 | end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); |
121 | read_lock_irqsave(&ni->size_lock, flags); | 121 | read_lock_irqsave(&ni->size_lock, flags); |
122 | allocated_size_vcn = ni->allocated_size >> | 122 | allocated_size_vcn = ni->allocated_size >> |
123 | ni->vol->cluster_size_bits; | 123 | ni->vol->cluster_size_bits; |
124 | read_unlock_irqrestore(&ni->size_lock, flags); | 124 | read_unlock_irqrestore(&ni->size_lock, flags); |
125 | if (!a->data.non_resident.lowest_vcn && end_vcn <= 0) | 125 | if (!a->data.non_resident.lowest_vcn && end_vcn <= 0) |
126 | end_vcn = allocated_size_vcn - 1; | 126 | end_vcn = allocated_size_vcn - 1; |
127 | /* | 127 | /* |
128 | * If we already have the attribute extent containing @vcn in | 128 | * If we already have the attribute extent containing @vcn in |
129 | * @ctx, no need to look it up again. We slightly cheat in | 129 | * @ctx, no need to look it up again. We slightly cheat in |
130 | * that if vcn exceeds the allocated size, we will refuse to | 130 | * that if vcn exceeds the allocated size, we will refuse to |
131 | * map the runlist below, so there is definitely no need to get | 131 | * map the runlist below, so there is definitely no need to get |
132 | * the right attribute extent. | 132 | * the right attribute extent. |
133 | */ | 133 | */ |
134 | if (vcn >= allocated_size_vcn || (a->type == ni->type && | 134 | if (vcn >= allocated_size_vcn || (a->type == ni->type && |
135 | a->name_length == ni->name_len && | 135 | a->name_length == ni->name_len && |
136 | !memcmp((u8*)a + le16_to_cpu(a->name_offset), | 136 | !memcmp((u8*)a + le16_to_cpu(a->name_offset), |
137 | ni->name, ni->name_len) && | 137 | ni->name, ni->name_len) && |
138 | sle64_to_cpu(a->data.non_resident.lowest_vcn) | 138 | sle64_to_cpu(a->data.non_resident.lowest_vcn) |
139 | <= vcn && end_vcn >= vcn)) | 139 | <= vcn && end_vcn >= vcn)) |
140 | ctx_needs_reset = false; | 140 | ctx_needs_reset = false; |
141 | else { | 141 | else { |
142 | /* Save the old search context. */ | 142 | /* Save the old search context. */ |
143 | old_ctx = *ctx; | 143 | old_ctx = *ctx; |
144 | /* | 144 | /* |
145 | * If the currently mapped (extent) inode is not the | 145 | * If the currently mapped (extent) inode is not the |
146 | * base inode we will unmap it when we reinitialize the | 146 | * base inode we will unmap it when we reinitialize the |
147 | * search context which means we need to get a | 147 | * search context which means we need to get a |
148 | * reference to the page containing the mapped mft | 148 | * reference to the page containing the mapped mft |
149 | * record so we do not accidentally drop changes to the | 149 | * record so we do not accidentally drop changes to the |
150 | * mft record when it has not been marked dirty yet. | 150 | * mft record when it has not been marked dirty yet. |
151 | */ | 151 | */ |
152 | if (old_ctx.base_ntfs_ino && old_ctx.ntfs_ino != | 152 | if (old_ctx.base_ntfs_ino && old_ctx.ntfs_ino != |
153 | old_ctx.base_ntfs_ino) { | 153 | old_ctx.base_ntfs_ino) { |
154 | put_this_page = old_ctx.ntfs_ino->page; | 154 | put_this_page = old_ctx.ntfs_ino->page; |
155 | page_cache_get(put_this_page); | 155 | page_cache_get(put_this_page); |
156 | } | 156 | } |
157 | /* | 157 | /* |
158 | * Reinitialize the search context so we can lookup the | 158 | * Reinitialize the search context so we can lookup the |
159 | * needed attribute extent. | 159 | * needed attribute extent. |
160 | */ | 160 | */ |
161 | ntfs_attr_reinit_search_ctx(ctx); | 161 | ntfs_attr_reinit_search_ctx(ctx); |
162 | ctx_needs_reset = true; | 162 | ctx_needs_reset = true; |
163 | } | 163 | } |
164 | } | 164 | } |
165 | if (ctx_needs_reset) { | 165 | if (ctx_needs_reset) { |
166 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 166 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
167 | CASE_SENSITIVE, vcn, NULL, 0, ctx); | 167 | CASE_SENSITIVE, vcn, NULL, 0, ctx); |
168 | if (unlikely(err)) { | 168 | if (unlikely(err)) { |
169 | if (err == -ENOENT) | 169 | if (err == -ENOENT) |
170 | err = -EIO; | 170 | err = -EIO; |
171 | goto err_out; | 171 | goto err_out; |
172 | } | 172 | } |
173 | BUG_ON(!ctx->attr->non_resident); | 173 | BUG_ON(!ctx->attr->non_resident); |
174 | } | 174 | } |
175 | a = ctx->attr; | 175 | a = ctx->attr; |
176 | /* | 176 | /* |
177 | * Only decompress the mapping pairs if @vcn is inside it. Otherwise | 177 | * Only decompress the mapping pairs if @vcn is inside it. Otherwise |
178 | * we get into problems when we try to map an out of bounds vcn because | 178 | * we get into problems when we try to map an out of bounds vcn because |
179 | * we then try to map the already mapped runlist fragment and | 179 | * we then try to map the already mapped runlist fragment and |
180 | * ntfs_mapping_pairs_decompress() fails. | 180 | * ntfs_mapping_pairs_decompress() fails. |
181 | */ | 181 | */ |
182 | end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn) + 1; | 182 | end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn) + 1; |
183 | if (unlikely(vcn && vcn >= end_vcn)) { | 183 | if (unlikely(vcn && vcn >= end_vcn)) { |
184 | err = -ENOENT; | 184 | err = -ENOENT; |
185 | goto err_out; | 185 | goto err_out; |
186 | } | 186 | } |
187 | rl = ntfs_mapping_pairs_decompress(ni->vol, a, ni->runlist.rl); | 187 | rl = ntfs_mapping_pairs_decompress(ni->vol, a, ni->runlist.rl); |
188 | if (IS_ERR(rl)) | 188 | if (IS_ERR(rl)) |
189 | err = PTR_ERR(rl); | 189 | err = PTR_ERR(rl); |
190 | else | 190 | else |
191 | ni->runlist.rl = rl; | 191 | ni->runlist.rl = rl; |
192 | err_out: | 192 | err_out: |
193 | if (ctx_is_temporary) { | 193 | if (ctx_is_temporary) { |
194 | if (likely(ctx)) | 194 | if (likely(ctx)) |
195 | ntfs_attr_put_search_ctx(ctx); | 195 | ntfs_attr_put_search_ctx(ctx); |
196 | unmap_mft_record(base_ni); | 196 | unmap_mft_record(base_ni); |
197 | } else if (ctx_needs_reset) { | 197 | } else if (ctx_needs_reset) { |
198 | /* | 198 | /* |
199 | * If there is no attribute list, restoring the search context | 199 | * If there is no attribute list, restoring the search context |
200 | * is accomplished simply by copying the saved context back over | 200 | * is accomplished simply by copying the saved context back over |
201 | * the caller supplied context. If there is an attribute list, | 201 | * the caller supplied context. If there is an attribute list, |
202 | * things are more complicated as we need to deal with mapping | 202 | * things are more complicated as we need to deal with mapping |
203 | * of mft records and resulting potential changes in pointers. | 203 | * of mft records and resulting potential changes in pointers. |
204 | */ | 204 | */ |
205 | if (NInoAttrList(base_ni)) { | 205 | if (NInoAttrList(base_ni)) { |
206 | /* | 206 | /* |
207 | * If the currently mapped (extent) inode is not the | 207 | * If the currently mapped (extent) inode is not the |
208 | * one we had before, we need to unmap it and map the | 208 | * one we had before, we need to unmap it and map the |
209 | * old one. | 209 | * old one. |
210 | */ | 210 | */ |
211 | if (ctx->ntfs_ino != old_ctx.ntfs_ino) { | 211 | if (ctx->ntfs_ino != old_ctx.ntfs_ino) { |
212 | /* | 212 | /* |
213 | * If the currently mapped inode is not the | 213 | * If the currently mapped inode is not the |
214 | * base inode, unmap it. | 214 | * base inode, unmap it. |
215 | */ | 215 | */ |
216 | if (ctx->base_ntfs_ino && ctx->ntfs_ino != | 216 | if (ctx->base_ntfs_ino && ctx->ntfs_ino != |
217 | ctx->base_ntfs_ino) { | 217 | ctx->base_ntfs_ino) { |
218 | unmap_extent_mft_record(ctx->ntfs_ino); | 218 | unmap_extent_mft_record(ctx->ntfs_ino); |
219 | ctx->mrec = ctx->base_mrec; | 219 | ctx->mrec = ctx->base_mrec; |
220 | BUG_ON(!ctx->mrec); | 220 | BUG_ON(!ctx->mrec); |
221 | } | 221 | } |
222 | /* | 222 | /* |
223 | * If the old mapped inode is not the base | 223 | * If the old mapped inode is not the base |
224 | * inode, map it. | 224 | * inode, map it. |
225 | */ | 225 | */ |
226 | if (old_ctx.base_ntfs_ino && | 226 | if (old_ctx.base_ntfs_ino && |
227 | old_ctx.ntfs_ino != | 227 | old_ctx.ntfs_ino != |
228 | old_ctx.base_ntfs_ino) { | 228 | old_ctx.base_ntfs_ino) { |
229 | retry_map: | 229 | retry_map: |
230 | ctx->mrec = map_mft_record( | 230 | ctx->mrec = map_mft_record( |
231 | old_ctx.ntfs_ino); | 231 | old_ctx.ntfs_ino); |
232 | /* | 232 | /* |
233 | * Something bad has happened. If out | 233 | * Something bad has happened. If out |
234 | * of memory retry till it succeeds. | 234 | * of memory retry till it succeeds. |
235 | * Any other errors are fatal and we | 235 | * Any other errors are fatal and we |
236 | * return the error code in ctx->mrec. | 236 | * return the error code in ctx->mrec. |
237 | * Let the caller deal with it... We | 237 | * Let the caller deal with it... We |
238 | * just need to fudge things so the | 238 | * just need to fudge things so the |
239 | * caller can reinit and/or put the | 239 | * caller can reinit and/or put the |
240 | * search context safely. | 240 | * search context safely. |
241 | */ | 241 | */ |
242 | if (IS_ERR(ctx->mrec)) { | 242 | if (IS_ERR(ctx->mrec)) { |
243 | if (PTR_ERR(ctx->mrec) == | 243 | if (PTR_ERR(ctx->mrec) == |
244 | -ENOMEM) { | 244 | -ENOMEM) { |
245 | schedule(); | 245 | schedule(); |
246 | goto retry_map; | 246 | goto retry_map; |
247 | } else | 247 | } else |
248 | old_ctx.ntfs_ino = | 248 | old_ctx.ntfs_ino = |
249 | old_ctx. | 249 | old_ctx. |
250 | base_ntfs_ino; | 250 | base_ntfs_ino; |
251 | } | 251 | } |
252 | } | 252 | } |
253 | } | 253 | } |
254 | /* Update the changed pointers in the saved context. */ | 254 | /* Update the changed pointers in the saved context. */ |
255 | if (ctx->mrec != old_ctx.mrec) { | 255 | if (ctx->mrec != old_ctx.mrec) { |
256 | if (!IS_ERR(ctx->mrec)) | 256 | if (!IS_ERR(ctx->mrec)) |
257 | old_ctx.attr = (ATTR_RECORD*)( | 257 | old_ctx.attr = (ATTR_RECORD*)( |
258 | (u8*)ctx->mrec + | 258 | (u8*)ctx->mrec + |
259 | ((u8*)old_ctx.attr - | 259 | ((u8*)old_ctx.attr - |
260 | (u8*)old_ctx.mrec)); | 260 | (u8*)old_ctx.mrec)); |
261 | old_ctx.mrec = ctx->mrec; | 261 | old_ctx.mrec = ctx->mrec; |
262 | } | 262 | } |
263 | } | 263 | } |
264 | /* Restore the search context to the saved one. */ | 264 | /* Restore the search context to the saved one. */ |
265 | *ctx = old_ctx; | 265 | *ctx = old_ctx; |
266 | /* | 266 | /* |
267 | * We drop the reference on the page we took earlier. In the | 267 | * We drop the reference on the page we took earlier. In the |
268 | * case that IS_ERR(ctx->mrec) is true this means we might lose | 268 | * case that IS_ERR(ctx->mrec) is true this means we might lose |
269 | * some changes to the mft record that had been made between | 269 | * some changes to the mft record that had been made between |
270 | * the last time it was marked dirty/written out and now. This | 270 | * the last time it was marked dirty/written out and now. This |
271 | * at this stage is not a problem as the mapping error is fatal | 271 | * at this stage is not a problem as the mapping error is fatal |
272 | * enough that the mft record cannot be written out anyway and | 272 | * enough that the mft record cannot be written out anyway and |
273 | * the caller is very likely to shutdown the whole inode | 273 | * the caller is very likely to shutdown the whole inode |
274 | * immediately and mark the volume dirty for chkdsk to pick up | 274 | * immediately and mark the volume dirty for chkdsk to pick up |
275 | * the pieces anyway. | 275 | * the pieces anyway. |
276 | */ | 276 | */ |
277 | if (put_this_page) | 277 | if (put_this_page) |
278 | page_cache_release(put_this_page); | 278 | page_cache_release(put_this_page); |
279 | } | 279 | } |
280 | return err; | 280 | return err; |
281 | } | 281 | } |
282 | 282 | ||
283 | /** | 283 | /** |
284 | * ntfs_map_runlist - map (a part of) a runlist of an ntfs inode | 284 | * ntfs_map_runlist - map (a part of) a runlist of an ntfs inode |
285 | * @ni: ntfs inode for which to map (part of) a runlist | 285 | * @ni: ntfs inode for which to map (part of) a runlist |
286 | * @vcn: map runlist part containing this vcn | 286 | * @vcn: map runlist part containing this vcn |
287 | * | 287 | * |
288 | * Map the part of a runlist containing the @vcn of the ntfs inode @ni. | 288 | * Map the part of a runlist containing the @vcn of the ntfs inode @ni. |
289 | * | 289 | * |
290 | * Return 0 on success and -errno on error. There is one special error code | 290 | * Return 0 on success and -errno on error. There is one special error code |
291 | * which is not an error as such. This is -ENOENT. It means that @vcn is out | 291 | * which is not an error as such. This is -ENOENT. It means that @vcn is out |
292 | * of bounds of the runlist. | 292 | * of bounds of the runlist. |
293 | * | 293 | * |
294 | * Locking: - The runlist must be unlocked on entry and is unlocked on return. | 294 | * Locking: - The runlist must be unlocked on entry and is unlocked on return. |
295 | * - This function takes the runlist lock for writing and may modify | 295 | * - This function takes the runlist lock for writing and may modify |
296 | * the runlist. | 296 | * the runlist. |
297 | */ | 297 | */ |
298 | int ntfs_map_runlist(ntfs_inode *ni, VCN vcn) | 298 | int ntfs_map_runlist(ntfs_inode *ni, VCN vcn) |
299 | { | 299 | { |
300 | int err = 0; | 300 | int err = 0; |
301 | 301 | ||
302 | down_write(&ni->runlist.lock); | 302 | down_write(&ni->runlist.lock); |
303 | /* Make sure someone else didn't do the work while we were sleeping. */ | 303 | /* Make sure someone else didn't do the work while we were sleeping. */ |
304 | if (likely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) <= | 304 | if (likely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) <= |
305 | LCN_RL_NOT_MAPPED)) | 305 | LCN_RL_NOT_MAPPED)) |
306 | err = ntfs_map_runlist_nolock(ni, vcn, NULL); | 306 | err = ntfs_map_runlist_nolock(ni, vcn, NULL); |
307 | up_write(&ni->runlist.lock); | 307 | up_write(&ni->runlist.lock); |
308 | return err; | 308 | return err; |
309 | } | 309 | } |
310 | 310 | ||
311 | /** | 311 | /** |
312 | * ntfs_attr_vcn_to_lcn_nolock - convert a vcn into a lcn given an ntfs inode | 312 | * ntfs_attr_vcn_to_lcn_nolock - convert a vcn into a lcn given an ntfs inode |
313 | * @ni: ntfs inode of the attribute whose runlist to search | 313 | * @ni: ntfs inode of the attribute whose runlist to search |
314 | * @vcn: vcn to convert | 314 | * @vcn: vcn to convert |
315 | * @write_locked: true if the runlist is locked for writing | 315 | * @write_locked: true if the runlist is locked for writing |
316 | * | 316 | * |
317 | * Find the virtual cluster number @vcn in the runlist of the ntfs attribute | 317 | * Find the virtual cluster number @vcn in the runlist of the ntfs attribute |
318 | * described by the ntfs inode @ni and return the corresponding logical cluster | 318 | * described by the ntfs inode @ni and return the corresponding logical cluster |
319 | * number (lcn). | 319 | * number (lcn). |
320 | * | 320 | * |
321 | * If the @vcn is not mapped yet, the attempt is made to map the attribute | 321 | * If the @vcn is not mapped yet, the attempt is made to map the attribute |
322 | * extent containing the @vcn and the vcn to lcn conversion is retried. | 322 | * extent containing the @vcn and the vcn to lcn conversion is retried. |
323 | * | 323 | * |
324 | * If @write_locked is true the caller has locked the runlist for writing and | 324 | * If @write_locked is true the caller has locked the runlist for writing and |
325 | * if false for reading. | 325 | * if false for reading. |
326 | * | 326 | * |
327 | * Since lcns must be >= 0, we use negative return codes with special meaning: | 327 | * Since lcns must be >= 0, we use negative return codes with special meaning: |
328 | * | 328 | * |
329 | * Return code Meaning / Description | 329 | * Return code Meaning / Description |
330 | * ========================================== | 330 | * ========================================== |
331 | * LCN_HOLE Hole / not allocated on disk. | 331 | * LCN_HOLE Hole / not allocated on disk. |
332 | * LCN_ENOENT There is no such vcn in the runlist, i.e. @vcn is out of bounds. | 332 | * LCN_ENOENT There is no such vcn in the runlist, i.e. @vcn is out of bounds. |
333 | * LCN_ENOMEM Not enough memory to map runlist. | 333 | * LCN_ENOMEM Not enough memory to map runlist. |
334 | * LCN_EIO Critical error (runlist/file is corrupt, i/o error, etc). | 334 | * LCN_EIO Critical error (runlist/file is corrupt, i/o error, etc). |
335 | * | 335 | * |
336 | * Locking: - The runlist must be locked on entry and is left locked on return. | 336 | * Locking: - The runlist must be locked on entry and is left locked on return. |
337 | * - If @write_locked is 'false', i.e. the runlist is locked for reading, | 337 | * - If @write_locked is 'false', i.e. the runlist is locked for reading, |
338 | * the lock may be dropped inside the function so you cannot rely on | 338 | * the lock may be dropped inside the function so you cannot rely on |
339 | * the runlist still being the same when this function returns. | 339 | * the runlist still being the same when this function returns. |
340 | */ | 340 | */ |
341 | LCN ntfs_attr_vcn_to_lcn_nolock(ntfs_inode *ni, const VCN vcn, | 341 | LCN ntfs_attr_vcn_to_lcn_nolock(ntfs_inode *ni, const VCN vcn, |
342 | const bool write_locked) | 342 | const bool write_locked) |
343 | { | 343 | { |
344 | LCN lcn; | 344 | LCN lcn; |
345 | unsigned long flags; | 345 | unsigned long flags; |
346 | bool is_retry = false; | 346 | bool is_retry = false; |
347 | 347 | ||
348 | BUG_ON(!ni); | 348 | BUG_ON(!ni); |
349 | ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, %s_locked.", | 349 | ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, %s_locked.", |
350 | ni->mft_no, (unsigned long long)vcn, | 350 | ni->mft_no, (unsigned long long)vcn, |
351 | write_locked ? "write" : "read"); | 351 | write_locked ? "write" : "read"); |
352 | BUG_ON(!NInoNonResident(ni)); | 352 | BUG_ON(!NInoNonResident(ni)); |
353 | BUG_ON(vcn < 0); | 353 | BUG_ON(vcn < 0); |
354 | if (!ni->runlist.rl) { | 354 | if (!ni->runlist.rl) { |
355 | read_lock_irqsave(&ni->size_lock, flags); | 355 | read_lock_irqsave(&ni->size_lock, flags); |
356 | if (!ni->allocated_size) { | 356 | if (!ni->allocated_size) { |
357 | read_unlock_irqrestore(&ni->size_lock, flags); | 357 | read_unlock_irqrestore(&ni->size_lock, flags); |
358 | return LCN_ENOENT; | 358 | return LCN_ENOENT; |
359 | } | 359 | } |
360 | read_unlock_irqrestore(&ni->size_lock, flags); | 360 | read_unlock_irqrestore(&ni->size_lock, flags); |
361 | } | 361 | } |
362 | retry_remap: | 362 | retry_remap: |
363 | /* Convert vcn to lcn. If that fails map the runlist and retry once. */ | 363 | /* Convert vcn to lcn. If that fails map the runlist and retry once. */ |
364 | lcn = ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn); | 364 | lcn = ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn); |
365 | if (likely(lcn >= LCN_HOLE)) { | 365 | if (likely(lcn >= LCN_HOLE)) { |
366 | ntfs_debug("Done, lcn 0x%llx.", (long long)lcn); | 366 | ntfs_debug("Done, lcn 0x%llx.", (long long)lcn); |
367 | return lcn; | 367 | return lcn; |
368 | } | 368 | } |
369 | if (lcn != LCN_RL_NOT_MAPPED) { | 369 | if (lcn != LCN_RL_NOT_MAPPED) { |
370 | if (lcn != LCN_ENOENT) | 370 | if (lcn != LCN_ENOENT) |
371 | lcn = LCN_EIO; | 371 | lcn = LCN_EIO; |
372 | } else if (!is_retry) { | 372 | } else if (!is_retry) { |
373 | int err; | 373 | int err; |
374 | 374 | ||
375 | if (!write_locked) { | 375 | if (!write_locked) { |
376 | up_read(&ni->runlist.lock); | 376 | up_read(&ni->runlist.lock); |
377 | down_write(&ni->runlist.lock); | 377 | down_write(&ni->runlist.lock); |
378 | if (unlikely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) != | 378 | if (unlikely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) != |
379 | LCN_RL_NOT_MAPPED)) { | 379 | LCN_RL_NOT_MAPPED)) { |
380 | up_write(&ni->runlist.lock); | 380 | up_write(&ni->runlist.lock); |
381 | down_read(&ni->runlist.lock); | 381 | down_read(&ni->runlist.lock); |
382 | goto retry_remap; | 382 | goto retry_remap; |
383 | } | 383 | } |
384 | } | 384 | } |
385 | err = ntfs_map_runlist_nolock(ni, vcn, NULL); | 385 | err = ntfs_map_runlist_nolock(ni, vcn, NULL); |
386 | if (!write_locked) { | 386 | if (!write_locked) { |
387 | up_write(&ni->runlist.lock); | 387 | up_write(&ni->runlist.lock); |
388 | down_read(&ni->runlist.lock); | 388 | down_read(&ni->runlist.lock); |
389 | } | 389 | } |
390 | if (likely(!err)) { | 390 | if (likely(!err)) { |
391 | is_retry = true; | 391 | is_retry = true; |
392 | goto retry_remap; | 392 | goto retry_remap; |
393 | } | 393 | } |
394 | if (err == -ENOENT) | 394 | if (err == -ENOENT) |
395 | lcn = LCN_ENOENT; | 395 | lcn = LCN_ENOENT; |
396 | else if (err == -ENOMEM) | 396 | else if (err == -ENOMEM) |
397 | lcn = LCN_ENOMEM; | 397 | lcn = LCN_ENOMEM; |
398 | else | 398 | else |
399 | lcn = LCN_EIO; | 399 | lcn = LCN_EIO; |
400 | } | 400 | } |
401 | if (lcn != LCN_ENOENT) | 401 | if (lcn != LCN_ENOENT) |
402 | ntfs_error(ni->vol->sb, "Failed with error code %lli.", | 402 | ntfs_error(ni->vol->sb, "Failed with error code %lli.", |
403 | (long long)lcn); | 403 | (long long)lcn); |
404 | return lcn; | 404 | return lcn; |
405 | } | 405 | } |
406 | 406 | ||
407 | /** | 407 | /** |
408 | * ntfs_attr_find_vcn_nolock - find a vcn in the runlist of an ntfs inode | 408 | * ntfs_attr_find_vcn_nolock - find a vcn in the runlist of an ntfs inode |
409 | * @ni: ntfs inode describing the runlist to search | 409 | * @ni: ntfs inode describing the runlist to search |
410 | * @vcn: vcn to find | 410 | * @vcn: vcn to find |
411 | * @ctx: active attribute search context if present or NULL if not | 411 | * @ctx: active attribute search context if present or NULL if not |
412 | * | 412 | * |
413 | * Find the virtual cluster number @vcn in the runlist described by the ntfs | 413 | * Find the virtual cluster number @vcn in the runlist described by the ntfs |
414 | * inode @ni and return the address of the runlist element containing the @vcn. | 414 | * inode @ni and return the address of the runlist element containing the @vcn. |
415 | * | 415 | * |
416 | * If the @vcn is not mapped yet, the attempt is made to map the attribute | 416 | * If the @vcn is not mapped yet, the attempt is made to map the attribute |
417 | * extent containing the @vcn and the vcn to lcn conversion is retried. | 417 | * extent containing the @vcn and the vcn to lcn conversion is retried. |
418 | * | 418 | * |
419 | * If @ctx is specified, it is an active search context of @ni and its base mft | 419 | * If @ctx is specified, it is an active search context of @ni and its base mft |
420 | * record. This is needed when ntfs_attr_find_vcn_nolock() encounters unmapped | 420 | * record. This is needed when ntfs_attr_find_vcn_nolock() encounters unmapped |
421 | * runlist fragments and allows their mapping. If you do not have the mft | 421 | * runlist fragments and allows their mapping. If you do not have the mft |
422 | * record mapped, you can specify @ctx as NULL and ntfs_attr_find_vcn_nolock() | 422 | * record mapped, you can specify @ctx as NULL and ntfs_attr_find_vcn_nolock() |
423 | * will perform the necessary mapping and unmapping. | 423 | * will perform the necessary mapping and unmapping. |
424 | * | 424 | * |
425 | * Note, ntfs_attr_find_vcn_nolock() saves the state of @ctx on entry and | 425 | * Note, ntfs_attr_find_vcn_nolock() saves the state of @ctx on entry and |
426 | * restores it before returning. Thus, @ctx will be left pointing to the same | 426 | * restores it before returning. Thus, @ctx will be left pointing to the same |
427 | * attribute on return as on entry. However, the actual pointers in @ctx may | 427 | * attribute on return as on entry. However, the actual pointers in @ctx may |
428 | * point to different memory locations on return, so you must remember to reset | 428 | * point to different memory locations on return, so you must remember to reset |
429 | * any cached pointers from the @ctx, i.e. after the call to | 429 | * any cached pointers from the @ctx, i.e. after the call to |
430 | * ntfs_attr_find_vcn_nolock(), you will probably want to do: | 430 | * ntfs_attr_find_vcn_nolock(), you will probably want to do: |
431 | * m = ctx->mrec; | 431 | * m = ctx->mrec; |
432 | * a = ctx->attr; | 432 | * a = ctx->attr; |
433 | * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that | 433 | * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that |
434 | * you cache ctx->mrec in a variable @m of type MFT_RECORD *. | 434 | * you cache ctx->mrec in a variable @m of type MFT_RECORD *. |
435 | * Note you need to distinguish between the lcn of the returned runlist element | 435 | * Note you need to distinguish between the lcn of the returned runlist element |
436 | * being >= 0 and LCN_HOLE. In the later case you have to return zeroes on | 436 | * being >= 0 and LCN_HOLE. In the later case you have to return zeroes on |
437 | * read and allocate clusters on write. | 437 | * read and allocate clusters on write. |
438 | * | 438 | * |
439 | * Return the runlist element containing the @vcn on success and | 439 | * Return the runlist element containing the @vcn on success and |
440 | * ERR_PTR(-errno) on error. You need to test the return value with IS_ERR() | 440 | * ERR_PTR(-errno) on error. You need to test the return value with IS_ERR() |
441 | * to decide if the return is success or failure and PTR_ERR() to get to the | 441 | * to decide if the return is success or failure and PTR_ERR() to get to the |
442 | * error code if IS_ERR() is true. | 442 | * error code if IS_ERR() is true. |
443 | * | 443 | * |
444 | * The possible error return codes are: | 444 | * The possible error return codes are: |
445 | * -ENOENT - No such vcn in the runlist, i.e. @vcn is out of bounds. | 445 | * -ENOENT - No such vcn in the runlist, i.e. @vcn is out of bounds. |
446 | * -ENOMEM - Not enough memory to map runlist. | 446 | * -ENOMEM - Not enough memory to map runlist. |
447 | * -EIO - Critical error (runlist/file is corrupt, i/o error, etc). | 447 | * -EIO - Critical error (runlist/file is corrupt, i/o error, etc). |
448 | * | 448 | * |
449 | * WARNING: If @ctx is supplied, regardless of whether success or failure is | 449 | * WARNING: If @ctx is supplied, regardless of whether success or failure is |
450 | * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx | 450 | * returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx |
451 | * is no longer valid, i.e. you need to either call | 451 | * is no longer valid, i.e. you need to either call |
452 | * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. | 452 | * ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it. |
453 | * In that case PTR_ERR(@ctx->mrec) will give you the error code for | 453 | * In that case PTR_ERR(@ctx->mrec) will give you the error code for |
454 | * why the mapping of the old inode failed. | 454 | * why the mapping of the old inode failed. |
455 | * | 455 | * |
456 | * Locking: - The runlist described by @ni must be locked for writing on entry | 456 | * Locking: - The runlist described by @ni must be locked for writing on entry |
457 | * and is locked on return. Note the runlist may be modified when | 457 | * and is locked on return. Note the runlist may be modified when |
458 | * needed runlist fragments need to be mapped. | 458 | * needed runlist fragments need to be mapped. |
459 | * - If @ctx is NULL, the base mft record of @ni must not be mapped on | 459 | * - If @ctx is NULL, the base mft record of @ni must not be mapped on |
460 | * entry and it will be left unmapped on return. | 460 | * entry and it will be left unmapped on return. |
461 | * - If @ctx is not NULL, the base mft record must be mapped on entry | 461 | * - If @ctx is not NULL, the base mft record must be mapped on entry |
462 | * and it will be left mapped on return. | 462 | * and it will be left mapped on return. |
463 | */ | 463 | */ |
464 | runlist_element *ntfs_attr_find_vcn_nolock(ntfs_inode *ni, const VCN vcn, | 464 | runlist_element *ntfs_attr_find_vcn_nolock(ntfs_inode *ni, const VCN vcn, |
465 | ntfs_attr_search_ctx *ctx) | 465 | ntfs_attr_search_ctx *ctx) |
466 | { | 466 | { |
467 | unsigned long flags; | 467 | unsigned long flags; |
468 | runlist_element *rl; | 468 | runlist_element *rl; |
469 | int err = 0; | 469 | int err = 0; |
470 | bool is_retry = false; | 470 | bool is_retry = false; |
471 | 471 | ||
472 | BUG_ON(!ni); | 472 | BUG_ON(!ni); |
473 | ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, with%s ctx.", | 473 | ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, with%s ctx.", |
474 | ni->mft_no, (unsigned long long)vcn, ctx ? "" : "out"); | 474 | ni->mft_no, (unsigned long long)vcn, ctx ? "" : "out"); |
475 | BUG_ON(!NInoNonResident(ni)); | 475 | BUG_ON(!NInoNonResident(ni)); |
476 | BUG_ON(vcn < 0); | 476 | BUG_ON(vcn < 0); |
477 | if (!ni->runlist.rl) { | 477 | if (!ni->runlist.rl) { |
478 | read_lock_irqsave(&ni->size_lock, flags); | 478 | read_lock_irqsave(&ni->size_lock, flags); |
479 | if (!ni->allocated_size) { | 479 | if (!ni->allocated_size) { |
480 | read_unlock_irqrestore(&ni->size_lock, flags); | 480 | read_unlock_irqrestore(&ni->size_lock, flags); |
481 | return ERR_PTR(-ENOENT); | 481 | return ERR_PTR(-ENOENT); |
482 | } | 482 | } |
483 | read_unlock_irqrestore(&ni->size_lock, flags); | 483 | read_unlock_irqrestore(&ni->size_lock, flags); |
484 | } | 484 | } |
485 | retry_remap: | 485 | retry_remap: |
486 | rl = ni->runlist.rl; | 486 | rl = ni->runlist.rl; |
487 | if (likely(rl && vcn >= rl[0].vcn)) { | 487 | if (likely(rl && vcn >= rl[0].vcn)) { |
488 | while (likely(rl->length)) { | 488 | while (likely(rl->length)) { |
489 | if (unlikely(vcn < rl[1].vcn)) { | 489 | if (unlikely(vcn < rl[1].vcn)) { |
490 | if (likely(rl->lcn >= LCN_HOLE)) { | 490 | if (likely(rl->lcn >= LCN_HOLE)) { |
491 | ntfs_debug("Done."); | 491 | ntfs_debug("Done."); |
492 | return rl; | 492 | return rl; |
493 | } | 493 | } |
494 | break; | 494 | break; |
495 | } | 495 | } |
496 | rl++; | 496 | rl++; |
497 | } | 497 | } |
498 | if (likely(rl->lcn != LCN_RL_NOT_MAPPED)) { | 498 | if (likely(rl->lcn != LCN_RL_NOT_MAPPED)) { |
499 | if (likely(rl->lcn == LCN_ENOENT)) | 499 | if (likely(rl->lcn == LCN_ENOENT)) |
500 | err = -ENOENT; | 500 | err = -ENOENT; |
501 | else | 501 | else |
502 | err = -EIO; | 502 | err = -EIO; |
503 | } | 503 | } |
504 | } | 504 | } |
505 | if (!err && !is_retry) { | 505 | if (!err && !is_retry) { |
506 | /* | 506 | /* |
507 | * If the search context is invalid we cannot map the unmapped | 507 | * If the search context is invalid we cannot map the unmapped |
508 | * region. | 508 | * region. |
509 | */ | 509 | */ |
510 | if (IS_ERR(ctx->mrec)) | 510 | if (IS_ERR(ctx->mrec)) |
511 | err = PTR_ERR(ctx->mrec); | 511 | err = PTR_ERR(ctx->mrec); |
512 | else { | 512 | else { |
513 | /* | 513 | /* |
514 | * The @vcn is in an unmapped region, map the runlist | 514 | * The @vcn is in an unmapped region, map the runlist |
515 | * and retry. | 515 | * and retry. |
516 | */ | 516 | */ |
517 | err = ntfs_map_runlist_nolock(ni, vcn, ctx); | 517 | err = ntfs_map_runlist_nolock(ni, vcn, ctx); |
518 | if (likely(!err)) { | 518 | if (likely(!err)) { |
519 | is_retry = true; | 519 | is_retry = true; |
520 | goto retry_remap; | 520 | goto retry_remap; |
521 | } | 521 | } |
522 | } | 522 | } |
523 | if (err == -EINVAL) | 523 | if (err == -EINVAL) |
524 | err = -EIO; | 524 | err = -EIO; |
525 | } else if (!err) | 525 | } else if (!err) |
526 | err = -EIO; | 526 | err = -EIO; |
527 | if (err != -ENOENT) | 527 | if (err != -ENOENT) |
528 | ntfs_error(ni->vol->sb, "Failed with error code %i.", err); | 528 | ntfs_error(ni->vol->sb, "Failed with error code %i.", err); |
529 | return ERR_PTR(err); | 529 | return ERR_PTR(err); |
530 | } | 530 | } |
531 | 531 | ||
532 | /** | 532 | /** |
533 | * ntfs_attr_find - find (next) attribute in mft record | 533 | * ntfs_attr_find - find (next) attribute in mft record |
534 | * @type: attribute type to find | 534 | * @type: attribute type to find |
535 | * @name: attribute name to find (optional, i.e. NULL means don't care) | 535 | * @name: attribute name to find (optional, i.e. NULL means don't care) |
536 | * @name_len: attribute name length (only needed if @name present) | 536 | * @name_len: attribute name length (only needed if @name present) |
537 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) | 537 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) |
538 | * @val: attribute value to find (optional, resident attributes only) | 538 | * @val: attribute value to find (optional, resident attributes only) |
539 | * @val_len: attribute value length | 539 | * @val_len: attribute value length |
540 | * @ctx: search context with mft record and attribute to search from | 540 | * @ctx: search context with mft record and attribute to search from |
541 | * | 541 | * |
542 | * You should not need to call this function directly. Use ntfs_attr_lookup() | 542 | * You should not need to call this function directly. Use ntfs_attr_lookup() |
543 | * instead. | 543 | * instead. |
544 | * | 544 | * |
545 | * ntfs_attr_find() takes a search context @ctx as parameter and searches the | 545 | * ntfs_attr_find() takes a search context @ctx as parameter and searches the |
546 | * mft record specified by @ctx->mrec, beginning at @ctx->attr, for an | 546 | * mft record specified by @ctx->mrec, beginning at @ctx->attr, for an |
547 | * attribute of @type, optionally @name and @val. | 547 | * attribute of @type, optionally @name and @val. |
548 | * | 548 | * |
549 | * If the attribute is found, ntfs_attr_find() returns 0 and @ctx->attr will | 549 | * If the attribute is found, ntfs_attr_find() returns 0 and @ctx->attr will |
550 | * point to the found attribute. | 550 | * point to the found attribute. |
551 | * | 551 | * |
552 | * If the attribute is not found, ntfs_attr_find() returns -ENOENT and | 552 | * If the attribute is not found, ntfs_attr_find() returns -ENOENT and |
553 | * @ctx->attr will point to the attribute before which the attribute being | 553 | * @ctx->attr will point to the attribute before which the attribute being |
554 | * searched for would need to be inserted if such an action were to be desired. | 554 | * searched for would need to be inserted if such an action were to be desired. |
555 | * | 555 | * |
556 | * On actual error, ntfs_attr_find() returns -EIO. In this case @ctx->attr is | 556 | * On actual error, ntfs_attr_find() returns -EIO. In this case @ctx->attr is |
557 | * undefined and in particular do not rely on it not changing. | 557 | * undefined and in particular do not rely on it not changing. |
558 | * | 558 | * |
559 | * If @ctx->is_first is 'true', the search begins with @ctx->attr itself. If it | 559 | * If @ctx->is_first is 'true', the search begins with @ctx->attr itself. If it |
560 | * is 'false', the search begins after @ctx->attr. | 560 | * is 'false', the search begins after @ctx->attr. |
561 | * | 561 | * |
562 | * If @ic is IGNORE_CASE, the @name comparisson is not case sensitive and | 562 | * If @ic is IGNORE_CASE, the @name comparisson is not case sensitive and |
563 | * @ctx->ntfs_ino must be set to the ntfs inode to which the mft record | 563 | * @ctx->ntfs_ino must be set to the ntfs inode to which the mft record |
564 | * @ctx->mrec belongs. This is so we can get at the ntfs volume and hence at | 564 | * @ctx->mrec belongs. This is so we can get at the ntfs volume and hence at |
565 | * the upcase table. If @ic is CASE_SENSITIVE, the comparison is case | 565 | * the upcase table. If @ic is CASE_SENSITIVE, the comparison is case |
566 | * sensitive. When @name is present, @name_len is the @name length in Unicode | 566 | * sensitive. When @name is present, @name_len is the @name length in Unicode |
567 | * characters. | 567 | * characters. |
568 | * | 568 | * |
569 | * If @name is not present (NULL), we assume that the unnamed attribute is | 569 | * If @name is not present (NULL), we assume that the unnamed attribute is |
570 | * being searched for. | 570 | * being searched for. |
571 | * | 571 | * |
572 | * Finally, the resident attribute value @val is looked for, if present. If | 572 | * Finally, the resident attribute value @val is looked for, if present. If |
573 | * @val is not present (NULL), @val_len is ignored. | 573 | * @val is not present (NULL), @val_len is ignored. |
574 | * | 574 | * |
575 | * ntfs_attr_find() only searches the specified mft record and it ignores the | 575 | * ntfs_attr_find() only searches the specified mft record and it ignores the |
576 | * presence of an attribute list attribute (unless it is the one being searched | 576 | * presence of an attribute list attribute (unless it is the one being searched |
577 | * for, obviously). If you need to take attribute lists into consideration, | 577 | * for, obviously). If you need to take attribute lists into consideration, |
578 | * use ntfs_attr_lookup() instead (see below). This also means that you cannot | 578 | * use ntfs_attr_lookup() instead (see below). This also means that you cannot |
579 | * use ntfs_attr_find() to search for extent records of non-resident | 579 | * use ntfs_attr_find() to search for extent records of non-resident |
580 | * attributes, as extents with lowest_vcn != 0 are usually described by the | 580 | * attributes, as extents with lowest_vcn != 0 are usually described by the |
581 | * attribute list attribute only. - Note that it is possible that the first | 581 | * attribute list attribute only. - Note that it is possible that the first |
582 | * extent is only in the attribute list while the last extent is in the base | 582 | * extent is only in the attribute list while the last extent is in the base |
583 | * mft record, so do not rely on being able to find the first extent in the | 583 | * mft record, so do not rely on being able to find the first extent in the |
584 | * base mft record. | 584 | * base mft record. |
585 | * | 585 | * |
586 | * Warning: Never use @val when looking for attribute types which can be | 586 | * Warning: Never use @val when looking for attribute types which can be |
587 | * non-resident as this most likely will result in a crash! | 587 | * non-resident as this most likely will result in a crash! |
588 | */ | 588 | */ |
589 | static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name, | 589 | static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name, |
590 | const u32 name_len, const IGNORE_CASE_BOOL ic, | 590 | const u32 name_len, const IGNORE_CASE_BOOL ic, |
591 | const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) | 591 | const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) |
592 | { | 592 | { |
593 | ATTR_RECORD *a; | 593 | ATTR_RECORD *a; |
594 | ntfs_volume *vol = ctx->ntfs_ino->vol; | 594 | ntfs_volume *vol = ctx->ntfs_ino->vol; |
595 | ntfschar *upcase = vol->upcase; | 595 | ntfschar *upcase = vol->upcase; |
596 | u32 upcase_len = vol->upcase_len; | 596 | u32 upcase_len = vol->upcase_len; |
597 | 597 | ||
598 | /* | 598 | /* |
599 | * Iterate over attributes in mft record starting at @ctx->attr, or the | 599 | * Iterate over attributes in mft record starting at @ctx->attr, or the |
600 | * attribute following that, if @ctx->is_first is 'true'. | 600 | * attribute following that, if @ctx->is_first is 'true'. |
601 | */ | 601 | */ |
602 | if (ctx->is_first) { | 602 | if (ctx->is_first) { |
603 | a = ctx->attr; | 603 | a = ctx->attr; |
604 | ctx->is_first = false; | 604 | ctx->is_first = false; |
605 | } else | 605 | } else |
606 | a = (ATTR_RECORD*)((u8*)ctx->attr + | 606 | a = (ATTR_RECORD*)((u8*)ctx->attr + |
607 | le32_to_cpu(ctx->attr->length)); | 607 | le32_to_cpu(ctx->attr->length)); |
608 | for (;; a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) { | 608 | for (;; a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) { |
609 | if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + | 609 | if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + |
610 | le32_to_cpu(ctx->mrec->bytes_allocated)) | 610 | le32_to_cpu(ctx->mrec->bytes_allocated)) |
611 | break; | 611 | break; |
612 | ctx->attr = a; | 612 | ctx->attr = a; |
613 | if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) || | 613 | if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) || |
614 | a->type == AT_END)) | 614 | a->type == AT_END)) |
615 | return -ENOENT; | 615 | return -ENOENT; |
616 | if (unlikely(!a->length)) | 616 | if (unlikely(!a->length)) |
617 | break; | 617 | break; |
618 | if (a->type != type) | 618 | if (a->type != type) |
619 | continue; | 619 | continue; |
620 | /* | 620 | /* |
621 | * If @name is present, compare the two names. If @name is | 621 | * If @name is present, compare the two names. If @name is |
622 | * missing, assume we want an unnamed attribute. | 622 | * missing, assume we want an unnamed attribute. |
623 | */ | 623 | */ |
624 | if (!name) { | 624 | if (!name) { |
625 | /* The search failed if the found attribute is named. */ | 625 | /* The search failed if the found attribute is named. */ |
626 | if (a->name_length) | 626 | if (a->name_length) |
627 | return -ENOENT; | 627 | return -ENOENT; |
628 | } else if (!ntfs_are_names_equal(name, name_len, | 628 | } else if (!ntfs_are_names_equal(name, name_len, |
629 | (ntfschar*)((u8*)a + le16_to_cpu(a->name_offset)), | 629 | (ntfschar*)((u8*)a + le16_to_cpu(a->name_offset)), |
630 | a->name_length, ic, upcase, upcase_len)) { | 630 | a->name_length, ic, upcase, upcase_len)) { |
631 | register int rc; | 631 | register int rc; |
632 | 632 | ||
633 | rc = ntfs_collate_names(name, name_len, | 633 | rc = ntfs_collate_names(name, name_len, |
634 | (ntfschar*)((u8*)a + | 634 | (ntfschar*)((u8*)a + |
635 | le16_to_cpu(a->name_offset)), | 635 | le16_to_cpu(a->name_offset)), |
636 | a->name_length, 1, IGNORE_CASE, | 636 | a->name_length, 1, IGNORE_CASE, |
637 | upcase, upcase_len); | 637 | upcase, upcase_len); |
638 | /* | 638 | /* |
639 | * If @name collates before a->name, there is no | 639 | * If @name collates before a->name, there is no |
640 | * matching attribute. | 640 | * matching attribute. |
641 | */ | 641 | */ |
642 | if (rc == -1) | 642 | if (rc == -1) |
643 | return -ENOENT; | 643 | return -ENOENT; |
644 | /* If the strings are not equal, continue search. */ | 644 | /* If the strings are not equal, continue search. */ |
645 | if (rc) | 645 | if (rc) |
646 | continue; | 646 | continue; |
647 | rc = ntfs_collate_names(name, name_len, | 647 | rc = ntfs_collate_names(name, name_len, |
648 | (ntfschar*)((u8*)a + | 648 | (ntfschar*)((u8*)a + |
649 | le16_to_cpu(a->name_offset)), | 649 | le16_to_cpu(a->name_offset)), |
650 | a->name_length, 1, CASE_SENSITIVE, | 650 | a->name_length, 1, CASE_SENSITIVE, |
651 | upcase, upcase_len); | 651 | upcase, upcase_len); |
652 | if (rc == -1) | 652 | if (rc == -1) |
653 | return -ENOENT; | 653 | return -ENOENT; |
654 | if (rc) | 654 | if (rc) |
655 | continue; | 655 | continue; |
656 | } | 656 | } |
657 | /* | 657 | /* |
658 | * The names match or @name not present and attribute is | 658 | * The names match or @name not present and attribute is |
659 | * unnamed. If no @val specified, we have found the attribute | 659 | * unnamed. If no @val specified, we have found the attribute |
660 | * and are done. | 660 | * and are done. |
661 | */ | 661 | */ |
662 | if (!val) | 662 | if (!val) |
663 | return 0; | 663 | return 0; |
664 | /* @val is present; compare values. */ | 664 | /* @val is present; compare values. */ |
665 | else { | 665 | else { |
666 | register int rc; | 666 | register int rc; |
667 | 667 | ||
668 | rc = memcmp(val, (u8*)a + le16_to_cpu( | 668 | rc = memcmp(val, (u8*)a + le16_to_cpu( |
669 | a->data.resident.value_offset), | 669 | a->data.resident.value_offset), |
670 | min_t(u32, val_len, le32_to_cpu( | 670 | min_t(u32, val_len, le32_to_cpu( |
671 | a->data.resident.value_length))); | 671 | a->data.resident.value_length))); |
672 | /* | 672 | /* |
673 | * If @val collates before the current attribute's | 673 | * If @val collates before the current attribute's |
674 | * value, there is no matching attribute. | 674 | * value, there is no matching attribute. |
675 | */ | 675 | */ |
676 | if (!rc) { | 676 | if (!rc) { |
677 | register u32 avl; | 677 | register u32 avl; |
678 | 678 | ||
679 | avl = le32_to_cpu( | 679 | avl = le32_to_cpu( |
680 | a->data.resident.value_length); | 680 | a->data.resident.value_length); |
681 | if (val_len == avl) | 681 | if (val_len == avl) |
682 | return 0; | 682 | return 0; |
683 | if (val_len < avl) | 683 | if (val_len < avl) |
684 | return -ENOENT; | 684 | return -ENOENT; |
685 | } else if (rc < 0) | 685 | } else if (rc < 0) |
686 | return -ENOENT; | 686 | return -ENOENT; |
687 | } | 687 | } |
688 | } | 688 | } |
689 | ntfs_error(vol->sb, "Inode is corrupt. Run chkdsk."); | 689 | ntfs_error(vol->sb, "Inode is corrupt. Run chkdsk."); |
690 | NVolSetErrors(vol); | 690 | NVolSetErrors(vol); |
691 | return -EIO; | 691 | return -EIO; |
692 | } | 692 | } |
693 | 693 | ||
694 | /** | 694 | /** |
695 | * load_attribute_list - load an attribute list into memory | 695 | * load_attribute_list - load an attribute list into memory |
696 | * @vol: ntfs volume from which to read | 696 | * @vol: ntfs volume from which to read |
697 | * @runlist: runlist of the attribute list | 697 | * @runlist: runlist of the attribute list |
698 | * @al_start: destination buffer | 698 | * @al_start: destination buffer |
699 | * @size: size of the destination buffer in bytes | 699 | * @size: size of the destination buffer in bytes |
700 | * @initialized_size: initialized size of the attribute list | 700 | * @initialized_size: initialized size of the attribute list |
701 | * | 701 | * |
702 | * Walk the runlist @runlist and load all clusters from it copying them into | 702 | * Walk the runlist @runlist and load all clusters from it copying them into |
703 | * the linear buffer @al. The maximum number of bytes copied to @al is @size | 703 | * the linear buffer @al. The maximum number of bytes copied to @al is @size |
704 | * bytes. Note, @size does not need to be a multiple of the cluster size. If | 704 | * bytes. Note, @size does not need to be a multiple of the cluster size. If |
705 | * @initialized_size is less than @size, the region in @al between | 705 | * @initialized_size is less than @size, the region in @al between |
706 | * @initialized_size and @size will be zeroed and not read from disk. | 706 | * @initialized_size and @size will be zeroed and not read from disk. |
707 | * | 707 | * |
708 | * Return 0 on success or -errno on error. | 708 | * Return 0 on success or -errno on error. |
709 | */ | 709 | */ |
710 | int load_attribute_list(ntfs_volume *vol, runlist *runlist, u8 *al_start, | 710 | int load_attribute_list(ntfs_volume *vol, runlist *runlist, u8 *al_start, |
711 | const s64 size, const s64 initialized_size) | 711 | const s64 size, const s64 initialized_size) |
712 | { | 712 | { |
713 | LCN lcn; | 713 | LCN lcn; |
714 | u8 *al = al_start; | 714 | u8 *al = al_start; |
715 | u8 *al_end = al + initialized_size; | 715 | u8 *al_end = al + initialized_size; |
716 | runlist_element *rl; | 716 | runlist_element *rl; |
717 | struct buffer_head *bh; | 717 | struct buffer_head *bh; |
718 | struct super_block *sb; | 718 | struct super_block *sb; |
719 | unsigned long block_size; | 719 | unsigned long block_size; |
720 | unsigned long block, max_block; | 720 | unsigned long block, max_block; |
721 | int err = 0; | 721 | int err = 0; |
722 | unsigned char block_size_bits; | 722 | unsigned char block_size_bits; |
723 | 723 | ||
724 | ntfs_debug("Entering."); | 724 | ntfs_debug("Entering."); |
725 | if (!vol || !runlist || !al || size <= 0 || initialized_size < 0 || | 725 | if (!vol || !runlist || !al || size <= 0 || initialized_size < 0 || |
726 | initialized_size > size) | 726 | initialized_size > size) |
727 | return -EINVAL; | 727 | return -EINVAL; |
728 | if (!initialized_size) { | 728 | if (!initialized_size) { |
729 | memset(al, 0, size); | 729 | memset(al, 0, size); |
730 | return 0; | 730 | return 0; |
731 | } | 731 | } |
732 | sb = vol->sb; | 732 | sb = vol->sb; |
733 | block_size = sb->s_blocksize; | 733 | block_size = sb->s_blocksize; |
734 | block_size_bits = sb->s_blocksize_bits; | 734 | block_size_bits = sb->s_blocksize_bits; |
735 | down_read(&runlist->lock); | 735 | down_read(&runlist->lock); |
736 | rl = runlist->rl; | 736 | rl = runlist->rl; |
737 | if (!rl) { | 737 | if (!rl) { |
738 | ntfs_error(sb, "Cannot read attribute list since runlist is " | 738 | ntfs_error(sb, "Cannot read attribute list since runlist is " |
739 | "missing."); | 739 | "missing."); |
740 | goto err_out; | 740 | goto err_out; |
741 | } | 741 | } |
742 | /* Read all clusters specified by the runlist one run at a time. */ | 742 | /* Read all clusters specified by the runlist one run at a time. */ |
743 | while (rl->length) { | 743 | while (rl->length) { |
744 | lcn = ntfs_rl_vcn_to_lcn(rl, rl->vcn); | 744 | lcn = ntfs_rl_vcn_to_lcn(rl, rl->vcn); |
745 | ntfs_debug("Reading vcn = 0x%llx, lcn = 0x%llx.", | 745 | ntfs_debug("Reading vcn = 0x%llx, lcn = 0x%llx.", |
746 | (unsigned long long)rl->vcn, | 746 | (unsigned long long)rl->vcn, |
747 | (unsigned long long)lcn); | 747 | (unsigned long long)lcn); |
748 | /* The attribute list cannot be sparse. */ | 748 | /* The attribute list cannot be sparse. */ |
749 | if (lcn < 0) { | 749 | if (lcn < 0) { |
750 | ntfs_error(sb, "ntfs_rl_vcn_to_lcn() failed. Cannot " | 750 | ntfs_error(sb, "ntfs_rl_vcn_to_lcn() failed. Cannot " |
751 | "read attribute list."); | 751 | "read attribute list."); |
752 | goto err_out; | 752 | goto err_out; |
753 | } | 753 | } |
754 | block = lcn << vol->cluster_size_bits >> block_size_bits; | 754 | block = lcn << vol->cluster_size_bits >> block_size_bits; |
755 | /* Read the run from device in chunks of block_size bytes. */ | 755 | /* Read the run from device in chunks of block_size bytes. */ |
756 | max_block = block + (rl->length << vol->cluster_size_bits >> | 756 | max_block = block + (rl->length << vol->cluster_size_bits >> |
757 | block_size_bits); | 757 | block_size_bits); |
758 | ntfs_debug("max_block = 0x%lx.", max_block); | 758 | ntfs_debug("max_block = 0x%lx.", max_block); |
759 | do { | 759 | do { |
760 | ntfs_debug("Reading block = 0x%lx.", block); | 760 | ntfs_debug("Reading block = 0x%lx.", block); |
761 | bh = sb_bread(sb, block); | 761 | bh = sb_bread(sb, block); |
762 | if (!bh) { | 762 | if (!bh) { |
763 | ntfs_error(sb, "sb_bread() failed. Cannot " | 763 | ntfs_error(sb, "sb_bread() failed. Cannot " |
764 | "read attribute list."); | 764 | "read attribute list."); |
765 | goto err_out; | 765 | goto err_out; |
766 | } | 766 | } |
767 | if (al + block_size >= al_end) | 767 | if (al + block_size >= al_end) |
768 | goto do_final; | 768 | goto do_final; |
769 | memcpy(al, bh->b_data, block_size); | 769 | memcpy(al, bh->b_data, block_size); |
770 | brelse(bh); | 770 | brelse(bh); |
771 | al += block_size; | 771 | al += block_size; |
772 | } while (++block < max_block); | 772 | } while (++block < max_block); |
773 | rl++; | 773 | rl++; |
774 | } | 774 | } |
775 | if (initialized_size < size) { | 775 | if (initialized_size < size) { |
776 | initialize: | 776 | initialize: |
777 | memset(al_start + initialized_size, 0, size - initialized_size); | 777 | memset(al_start + initialized_size, 0, size - initialized_size); |
778 | } | 778 | } |
779 | done: | 779 | done: |
780 | up_read(&runlist->lock); | 780 | up_read(&runlist->lock); |
781 | return err; | 781 | return err; |
782 | do_final: | 782 | do_final: |
783 | if (al < al_end) { | 783 | if (al < al_end) { |
784 | /* | 784 | /* |
785 | * Partial block. | 785 | * Partial block. |
786 | * | 786 | * |
787 | * Note: The attribute list can be smaller than its allocation | 787 | * Note: The attribute list can be smaller than its allocation |
788 | * by multiple clusters. This has been encountered by at least | 788 | * by multiple clusters. This has been encountered by at least |
789 | * two people running Windows XP, thus we cannot do any | 789 | * two people running Windows XP, thus we cannot do any |
790 | * truncation sanity checking here. (AIA) | 790 | * truncation sanity checking here. (AIA) |
791 | */ | 791 | */ |
792 | memcpy(al, bh->b_data, al_end - al); | 792 | memcpy(al, bh->b_data, al_end - al); |
793 | brelse(bh); | 793 | brelse(bh); |
794 | if (initialized_size < size) | 794 | if (initialized_size < size) |
795 | goto initialize; | 795 | goto initialize; |
796 | goto done; | 796 | goto done; |
797 | } | 797 | } |
798 | brelse(bh); | 798 | brelse(bh); |
799 | /* Real overflow! */ | 799 | /* Real overflow! */ |
800 | ntfs_error(sb, "Attribute list buffer overflow. Read attribute list " | 800 | ntfs_error(sb, "Attribute list buffer overflow. Read attribute list " |
801 | "is truncated."); | 801 | "is truncated."); |
802 | err_out: | 802 | err_out: |
803 | err = -EIO; | 803 | err = -EIO; |
804 | goto done; | 804 | goto done; |
805 | } | 805 | } |
806 | 806 | ||
807 | /** | 807 | /** |
808 | * ntfs_external_attr_find - find an attribute in the attribute list of an inode | 808 | * ntfs_external_attr_find - find an attribute in the attribute list of an inode |
809 | * @type: attribute type to find | 809 | * @type: attribute type to find |
810 | * @name: attribute name to find (optional, i.e. NULL means don't care) | 810 | * @name: attribute name to find (optional, i.e. NULL means don't care) |
811 | * @name_len: attribute name length (only needed if @name present) | 811 | * @name_len: attribute name length (only needed if @name present) |
812 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) | 812 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) |
813 | * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) | 813 | * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) |
814 | * @val: attribute value to find (optional, resident attributes only) | 814 | * @val: attribute value to find (optional, resident attributes only) |
815 | * @val_len: attribute value length | 815 | * @val_len: attribute value length |
816 | * @ctx: search context with mft record and attribute to search from | 816 | * @ctx: search context with mft record and attribute to search from |
817 | * | 817 | * |
818 | * You should not need to call this function directly. Use ntfs_attr_lookup() | 818 | * You should not need to call this function directly. Use ntfs_attr_lookup() |
819 | * instead. | 819 | * instead. |
820 | * | 820 | * |
821 | * Find an attribute by searching the attribute list for the corresponding | 821 | * Find an attribute by searching the attribute list for the corresponding |
822 | * attribute list entry. Having found the entry, map the mft record if the | 822 | * attribute list entry. Having found the entry, map the mft record if the |
823 | * attribute is in a different mft record/inode, ntfs_attr_find() the attribute | 823 | * attribute is in a different mft record/inode, ntfs_attr_find() the attribute |
824 | * in there and return it. | 824 | * in there and return it. |
825 | * | 825 | * |
826 | * On first search @ctx->ntfs_ino must be the base mft record and @ctx must | 826 | * On first search @ctx->ntfs_ino must be the base mft record and @ctx must |
827 | * have been obtained from a call to ntfs_attr_get_search_ctx(). On subsequent | 827 | * have been obtained from a call to ntfs_attr_get_search_ctx(). On subsequent |
828 | * calls @ctx->ntfs_ino can be any extent inode, too (@ctx->base_ntfs_ino is | 828 | * calls @ctx->ntfs_ino can be any extent inode, too (@ctx->base_ntfs_ino is |
829 | * then the base inode). | 829 | * then the base inode). |
830 | * | 830 | * |
831 | * After finishing with the attribute/mft record you need to call | 831 | * After finishing with the attribute/mft record you need to call |
832 | * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any | 832 | * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any |
833 | * mapped inodes, etc). | 833 | * mapped inodes, etc). |
834 | * | 834 | * |
835 | * If the attribute is found, ntfs_external_attr_find() returns 0 and | 835 | * If the attribute is found, ntfs_external_attr_find() returns 0 and |
836 | * @ctx->attr will point to the found attribute. @ctx->mrec will point to the | 836 | * @ctx->attr will point to the found attribute. @ctx->mrec will point to the |
837 | * mft record in which @ctx->attr is located and @ctx->al_entry will point to | 837 | * mft record in which @ctx->attr is located and @ctx->al_entry will point to |
838 | * the attribute list entry for the attribute. | 838 | * the attribute list entry for the attribute. |
839 | * | 839 | * |
840 | * If the attribute is not found, ntfs_external_attr_find() returns -ENOENT and | 840 | * If the attribute is not found, ntfs_external_attr_find() returns -ENOENT and |
841 | * @ctx->attr will point to the attribute in the base mft record before which | 841 | * @ctx->attr will point to the attribute in the base mft record before which |
842 | * the attribute being searched for would need to be inserted if such an action | 842 | * the attribute being searched for would need to be inserted if such an action |
843 | * were to be desired. @ctx->mrec will point to the mft record in which | 843 | * were to be desired. @ctx->mrec will point to the mft record in which |
844 | * @ctx->attr is located and @ctx->al_entry will point to the attribute list | 844 | * @ctx->attr is located and @ctx->al_entry will point to the attribute list |
845 | * entry of the attribute before which the attribute being searched for would | 845 | * entry of the attribute before which the attribute being searched for would |
846 | * need to be inserted if such an action were to be desired. | 846 | * need to be inserted if such an action were to be desired. |
847 | * | 847 | * |
848 | * Thus to insert the not found attribute, one wants to add the attribute to | 848 | * Thus to insert the not found attribute, one wants to add the attribute to |
849 | * @ctx->mrec (the base mft record) and if there is not enough space, the | 849 | * @ctx->mrec (the base mft record) and if there is not enough space, the |
850 | * attribute should be placed in a newly allocated extent mft record. The | 850 | * attribute should be placed in a newly allocated extent mft record. The |
851 | * attribute list entry for the inserted attribute should be inserted in the | 851 | * attribute list entry for the inserted attribute should be inserted in the |
852 | * attribute list attribute at @ctx->al_entry. | 852 | * attribute list attribute at @ctx->al_entry. |
853 | * | 853 | * |
854 | * On actual error, ntfs_external_attr_find() returns -EIO. In this case | 854 | * On actual error, ntfs_external_attr_find() returns -EIO. In this case |
855 | * @ctx->attr is undefined and in particular do not rely on it not changing. | 855 | * @ctx->attr is undefined and in particular do not rely on it not changing. |
856 | */ | 856 | */ |
857 | static int ntfs_external_attr_find(const ATTR_TYPE type, | 857 | static int ntfs_external_attr_find(const ATTR_TYPE type, |
858 | const ntfschar *name, const u32 name_len, | 858 | const ntfschar *name, const u32 name_len, |
859 | const IGNORE_CASE_BOOL ic, const VCN lowest_vcn, | 859 | const IGNORE_CASE_BOOL ic, const VCN lowest_vcn, |
860 | const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) | 860 | const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx) |
861 | { | 861 | { |
862 | ntfs_inode *base_ni, *ni; | 862 | ntfs_inode *base_ni, *ni; |
863 | ntfs_volume *vol; | 863 | ntfs_volume *vol; |
864 | ATTR_LIST_ENTRY *al_entry, *next_al_entry; | 864 | ATTR_LIST_ENTRY *al_entry, *next_al_entry; |
865 | u8 *al_start, *al_end; | 865 | u8 *al_start, *al_end; |
866 | ATTR_RECORD *a; | 866 | ATTR_RECORD *a; |
867 | ntfschar *al_name; | 867 | ntfschar *al_name; |
868 | u32 al_name_len; | 868 | u32 al_name_len; |
869 | int err = 0; | 869 | int err = 0; |
870 | static const char *es = " Unmount and run chkdsk."; | 870 | static const char *es = " Unmount and run chkdsk."; |
871 | 871 | ||
872 | ni = ctx->ntfs_ino; | 872 | ni = ctx->ntfs_ino; |
873 | base_ni = ctx->base_ntfs_ino; | 873 | base_ni = ctx->base_ntfs_ino; |
874 | ntfs_debug("Entering for inode 0x%lx, type 0x%x.", ni->mft_no, type); | 874 | ntfs_debug("Entering for inode 0x%lx, type 0x%x.", ni->mft_no, type); |
875 | if (!base_ni) { | 875 | if (!base_ni) { |
876 | /* First call happens with the base mft record. */ | 876 | /* First call happens with the base mft record. */ |
877 | base_ni = ctx->base_ntfs_ino = ctx->ntfs_ino; | 877 | base_ni = ctx->base_ntfs_ino = ctx->ntfs_ino; |
878 | ctx->base_mrec = ctx->mrec; | 878 | ctx->base_mrec = ctx->mrec; |
879 | } | 879 | } |
880 | if (ni == base_ni) | 880 | if (ni == base_ni) |
881 | ctx->base_attr = ctx->attr; | 881 | ctx->base_attr = ctx->attr; |
882 | if (type == AT_END) | 882 | if (type == AT_END) |
883 | goto not_found; | 883 | goto not_found; |
884 | vol = base_ni->vol; | 884 | vol = base_ni->vol; |
885 | al_start = base_ni->attr_list; | 885 | al_start = base_ni->attr_list; |
886 | al_end = al_start + base_ni->attr_list_size; | 886 | al_end = al_start + base_ni->attr_list_size; |
887 | if (!ctx->al_entry) | 887 | if (!ctx->al_entry) |
888 | ctx->al_entry = (ATTR_LIST_ENTRY*)al_start; | 888 | ctx->al_entry = (ATTR_LIST_ENTRY*)al_start; |
889 | /* | 889 | /* |
890 | * Iterate over entries in attribute list starting at @ctx->al_entry, | 890 | * Iterate over entries in attribute list starting at @ctx->al_entry, |
891 | * or the entry following that, if @ctx->is_first is 'true'. | 891 | * or the entry following that, if @ctx->is_first is 'true'. |
892 | */ | 892 | */ |
893 | if (ctx->is_first) { | 893 | if (ctx->is_first) { |
894 | al_entry = ctx->al_entry; | 894 | al_entry = ctx->al_entry; |
895 | ctx->is_first = false; | 895 | ctx->is_first = false; |
896 | } else | 896 | } else |
897 | al_entry = (ATTR_LIST_ENTRY*)((u8*)ctx->al_entry + | 897 | al_entry = (ATTR_LIST_ENTRY*)((u8*)ctx->al_entry + |
898 | le16_to_cpu(ctx->al_entry->length)); | 898 | le16_to_cpu(ctx->al_entry->length)); |
899 | for (;; al_entry = next_al_entry) { | 899 | for (;; al_entry = next_al_entry) { |
900 | /* Out of bounds check. */ | 900 | /* Out of bounds check. */ |
901 | if ((u8*)al_entry < base_ni->attr_list || | 901 | if ((u8*)al_entry < base_ni->attr_list || |
902 | (u8*)al_entry > al_end) | 902 | (u8*)al_entry > al_end) |
903 | break; /* Inode is corrupt. */ | 903 | break; /* Inode is corrupt. */ |
904 | ctx->al_entry = al_entry; | 904 | ctx->al_entry = al_entry; |
905 | /* Catch the end of the attribute list. */ | 905 | /* Catch the end of the attribute list. */ |
906 | if ((u8*)al_entry == al_end) | 906 | if ((u8*)al_entry == al_end) |
907 | goto not_found; | 907 | goto not_found; |
908 | if (!al_entry->length) | 908 | if (!al_entry->length) |
909 | break; | 909 | break; |
910 | if ((u8*)al_entry + 6 > al_end || (u8*)al_entry + | 910 | if ((u8*)al_entry + 6 > al_end || (u8*)al_entry + |
911 | le16_to_cpu(al_entry->length) > al_end) | 911 | le16_to_cpu(al_entry->length) > al_end) |
912 | break; | 912 | break; |
913 | next_al_entry = (ATTR_LIST_ENTRY*)((u8*)al_entry + | 913 | next_al_entry = (ATTR_LIST_ENTRY*)((u8*)al_entry + |
914 | le16_to_cpu(al_entry->length)); | 914 | le16_to_cpu(al_entry->length)); |
915 | if (le32_to_cpu(al_entry->type) > le32_to_cpu(type)) | 915 | if (le32_to_cpu(al_entry->type) > le32_to_cpu(type)) |
916 | goto not_found; | 916 | goto not_found; |
917 | if (type != al_entry->type) | 917 | if (type != al_entry->type) |
918 | continue; | 918 | continue; |
919 | /* | 919 | /* |
920 | * If @name is present, compare the two names. If @name is | 920 | * If @name is present, compare the two names. If @name is |
921 | * missing, assume we want an unnamed attribute. | 921 | * missing, assume we want an unnamed attribute. |
922 | */ | 922 | */ |
923 | al_name_len = al_entry->name_length; | 923 | al_name_len = al_entry->name_length; |
924 | al_name = (ntfschar*)((u8*)al_entry + al_entry->name_offset); | 924 | al_name = (ntfschar*)((u8*)al_entry + al_entry->name_offset); |
925 | if (!name) { | 925 | if (!name) { |
926 | if (al_name_len) | 926 | if (al_name_len) |
927 | goto not_found; | 927 | goto not_found; |
928 | } else if (!ntfs_are_names_equal(al_name, al_name_len, name, | 928 | } else if (!ntfs_are_names_equal(al_name, al_name_len, name, |
929 | name_len, ic, vol->upcase, vol->upcase_len)) { | 929 | name_len, ic, vol->upcase, vol->upcase_len)) { |
930 | register int rc; | 930 | register int rc; |
931 | 931 | ||
932 | rc = ntfs_collate_names(name, name_len, al_name, | 932 | rc = ntfs_collate_names(name, name_len, al_name, |
933 | al_name_len, 1, IGNORE_CASE, | 933 | al_name_len, 1, IGNORE_CASE, |
934 | vol->upcase, vol->upcase_len); | 934 | vol->upcase, vol->upcase_len); |
935 | /* | 935 | /* |
936 | * If @name collates before al_name, there is no | 936 | * If @name collates before al_name, there is no |
937 | * matching attribute. | 937 | * matching attribute. |
938 | */ | 938 | */ |
939 | if (rc == -1) | 939 | if (rc == -1) |
940 | goto not_found; | 940 | goto not_found; |
941 | /* If the strings are not equal, continue search. */ | 941 | /* If the strings are not equal, continue search. */ |
942 | if (rc) | 942 | if (rc) |
943 | continue; | 943 | continue; |
944 | /* | 944 | /* |
945 | * FIXME: Reverse engineering showed 0, IGNORE_CASE but | 945 | * FIXME: Reverse engineering showed 0, IGNORE_CASE but |
946 | * that is inconsistent with ntfs_attr_find(). The | 946 | * that is inconsistent with ntfs_attr_find(). The |
947 | * subsequent rc checks were also different. Perhaps I | 947 | * subsequent rc checks were also different. Perhaps I |
948 | * made a mistake in one of the two. Need to recheck | 948 | * made a mistake in one of the two. Need to recheck |
949 | * which is correct or at least see what is going on... | 949 | * which is correct or at least see what is going on... |
950 | * (AIA) | 950 | * (AIA) |
951 | */ | 951 | */ |
952 | rc = ntfs_collate_names(name, name_len, al_name, | 952 | rc = ntfs_collate_names(name, name_len, al_name, |
953 | al_name_len, 1, CASE_SENSITIVE, | 953 | al_name_len, 1, CASE_SENSITIVE, |
954 | vol->upcase, vol->upcase_len); | 954 | vol->upcase, vol->upcase_len); |
955 | if (rc == -1) | 955 | if (rc == -1) |
956 | goto not_found; | 956 | goto not_found; |
957 | if (rc) | 957 | if (rc) |
958 | continue; | 958 | continue; |
959 | } | 959 | } |
960 | /* | 960 | /* |
961 | * The names match or @name not present and attribute is | 961 | * The names match or @name not present and attribute is |
962 | * unnamed. Now check @lowest_vcn. Continue search if the | 962 | * unnamed. Now check @lowest_vcn. Continue search if the |
963 | * next attribute list entry still fits @lowest_vcn. Otherwise | 963 | * next attribute list entry still fits @lowest_vcn. Otherwise |
964 | * we have reached the right one or the search has failed. | 964 | * we have reached the right one or the search has failed. |
965 | */ | 965 | */ |
966 | if (lowest_vcn && (u8*)next_al_entry >= al_start && | 966 | if (lowest_vcn && (u8*)next_al_entry >= al_start && |
967 | (u8*)next_al_entry + 6 < al_end && | 967 | (u8*)next_al_entry + 6 < al_end && |
968 | (u8*)next_al_entry + le16_to_cpu( | 968 | (u8*)next_al_entry + le16_to_cpu( |
969 | next_al_entry->length) <= al_end && | 969 | next_al_entry->length) <= al_end && |
970 | sle64_to_cpu(next_al_entry->lowest_vcn) <= | 970 | sle64_to_cpu(next_al_entry->lowest_vcn) <= |
971 | lowest_vcn && | 971 | lowest_vcn && |
972 | next_al_entry->type == al_entry->type && | 972 | next_al_entry->type == al_entry->type && |
973 | next_al_entry->name_length == al_name_len && | 973 | next_al_entry->name_length == al_name_len && |
974 | ntfs_are_names_equal((ntfschar*)((u8*) | 974 | ntfs_are_names_equal((ntfschar*)((u8*) |
975 | next_al_entry + | 975 | next_al_entry + |
976 | next_al_entry->name_offset), | 976 | next_al_entry->name_offset), |
977 | next_al_entry->name_length, | 977 | next_al_entry->name_length, |
978 | al_name, al_name_len, CASE_SENSITIVE, | 978 | al_name, al_name_len, CASE_SENSITIVE, |
979 | vol->upcase, vol->upcase_len)) | 979 | vol->upcase, vol->upcase_len)) |
980 | continue; | 980 | continue; |
981 | if (MREF_LE(al_entry->mft_reference) == ni->mft_no) { | 981 | if (MREF_LE(al_entry->mft_reference) == ni->mft_no) { |
982 | if (MSEQNO_LE(al_entry->mft_reference) != ni->seq_no) { | 982 | if (MSEQNO_LE(al_entry->mft_reference) != ni->seq_no) { |
983 | ntfs_error(vol->sb, "Found stale mft " | 983 | ntfs_error(vol->sb, "Found stale mft " |
984 | "reference in attribute list " | 984 | "reference in attribute list " |
985 | "of base inode 0x%lx.%s", | 985 | "of base inode 0x%lx.%s", |
986 | base_ni->mft_no, es); | 986 | base_ni->mft_no, es); |
987 | err = -EIO; | 987 | err = -EIO; |
988 | break; | 988 | break; |
989 | } | 989 | } |
990 | } else { /* Mft references do not match. */ | 990 | } else { /* Mft references do not match. */ |
991 | /* If there is a mapped record unmap it first. */ | 991 | /* If there is a mapped record unmap it first. */ |
992 | if (ni != base_ni) | 992 | if (ni != base_ni) |
993 | unmap_extent_mft_record(ni); | 993 | unmap_extent_mft_record(ni); |
994 | /* Do we want the base record back? */ | 994 | /* Do we want the base record back? */ |
995 | if (MREF_LE(al_entry->mft_reference) == | 995 | if (MREF_LE(al_entry->mft_reference) == |
996 | base_ni->mft_no) { | 996 | base_ni->mft_no) { |
997 | ni = ctx->ntfs_ino = base_ni; | 997 | ni = ctx->ntfs_ino = base_ni; |
998 | ctx->mrec = ctx->base_mrec; | 998 | ctx->mrec = ctx->base_mrec; |
999 | } else { | 999 | } else { |
1000 | /* We want an extent record. */ | 1000 | /* We want an extent record. */ |
1001 | ctx->mrec = map_extent_mft_record(base_ni, | 1001 | ctx->mrec = map_extent_mft_record(base_ni, |
1002 | le64_to_cpu( | 1002 | le64_to_cpu( |
1003 | al_entry->mft_reference), &ni); | 1003 | al_entry->mft_reference), &ni); |
1004 | if (IS_ERR(ctx->mrec)) { | 1004 | if (IS_ERR(ctx->mrec)) { |
1005 | ntfs_error(vol->sb, "Failed to map " | 1005 | ntfs_error(vol->sb, "Failed to map " |
1006 | "extent mft record " | 1006 | "extent mft record " |
1007 | "0x%lx of base inode " | 1007 | "0x%lx of base inode " |
1008 | "0x%lx.%s", | 1008 | "0x%lx.%s", |
1009 | MREF_LE(al_entry-> | 1009 | MREF_LE(al_entry-> |
1010 | mft_reference), | 1010 | mft_reference), |
1011 | base_ni->mft_no, es); | 1011 | base_ni->mft_no, es); |
1012 | err = PTR_ERR(ctx->mrec); | 1012 | err = PTR_ERR(ctx->mrec); |
1013 | if (err == -ENOENT) | 1013 | if (err == -ENOENT) |
1014 | err = -EIO; | 1014 | err = -EIO; |
1015 | /* Cause @ctx to be sanitized below. */ | 1015 | /* Cause @ctx to be sanitized below. */ |
1016 | ni = NULL; | 1016 | ni = NULL; |
1017 | break; | 1017 | break; |
1018 | } | 1018 | } |
1019 | ctx->ntfs_ino = ni; | 1019 | ctx->ntfs_ino = ni; |
1020 | } | 1020 | } |
1021 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + | 1021 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + |
1022 | le16_to_cpu(ctx->mrec->attrs_offset)); | 1022 | le16_to_cpu(ctx->mrec->attrs_offset)); |
1023 | } | 1023 | } |
1024 | /* | 1024 | /* |
1025 | * ctx->vfs_ino, ctx->mrec, and ctx->attr now point to the | 1025 | * ctx->vfs_ino, ctx->mrec, and ctx->attr now point to the |
1026 | * mft record containing the attribute represented by the | 1026 | * mft record containing the attribute represented by the |
1027 | * current al_entry. | 1027 | * current al_entry. |
1028 | */ | 1028 | */ |
1029 | /* | 1029 | /* |
1030 | * We could call into ntfs_attr_find() to find the right | 1030 | * We could call into ntfs_attr_find() to find the right |
1031 | * attribute in this mft record but this would be less | 1031 | * attribute in this mft record but this would be less |
1032 | * efficient and not quite accurate as ntfs_attr_find() ignores | 1032 | * efficient and not quite accurate as ntfs_attr_find() ignores |
1033 | * the attribute instance numbers for example which become | 1033 | * the attribute instance numbers for example which become |
1034 | * important when one plays with attribute lists. Also, | 1034 | * important when one plays with attribute lists. Also, |
1035 | * because a proper match has been found in the attribute list | 1035 | * because a proper match has been found in the attribute list |
1036 | * entry above, the comparison can now be optimized. So it is | 1036 | * entry above, the comparison can now be optimized. So it is |
1037 | * worth re-implementing a simplified ntfs_attr_find() here. | 1037 | * worth re-implementing a simplified ntfs_attr_find() here. |
1038 | */ | 1038 | */ |
1039 | a = ctx->attr; | 1039 | a = ctx->attr; |
1040 | /* | 1040 | /* |
1041 | * Use a manual loop so we can still use break and continue | 1041 | * Use a manual loop so we can still use break and continue |
1042 | * with the same meanings as above. | 1042 | * with the same meanings as above. |
1043 | */ | 1043 | */ |
1044 | do_next_attr_loop: | 1044 | do_next_attr_loop: |
1045 | if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + | 1045 | if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec + |
1046 | le32_to_cpu(ctx->mrec->bytes_allocated)) | 1046 | le32_to_cpu(ctx->mrec->bytes_allocated)) |
1047 | break; | 1047 | break; |
1048 | if (a->type == AT_END) | 1048 | if (a->type == AT_END) |
1049 | break; | 1049 | break; |
1050 | if (!a->length) | 1050 | if (!a->length) |
1051 | break; | 1051 | break; |
1052 | if (al_entry->instance != a->instance) | 1052 | if (al_entry->instance != a->instance) |
1053 | goto do_next_attr; | 1053 | goto do_next_attr; |
1054 | /* | 1054 | /* |
1055 | * If the type and/or the name are mismatched between the | 1055 | * If the type and/or the name are mismatched between the |
1056 | * attribute list entry and the attribute record, there is | 1056 | * attribute list entry and the attribute record, there is |
1057 | * corruption so we break and return error EIO. | 1057 | * corruption so we break and return error EIO. |
1058 | */ | 1058 | */ |
1059 | if (al_entry->type != a->type) | 1059 | if (al_entry->type != a->type) |
1060 | break; | 1060 | break; |
1061 | if (!ntfs_are_names_equal((ntfschar*)((u8*)a + | 1061 | if (!ntfs_are_names_equal((ntfschar*)((u8*)a + |
1062 | le16_to_cpu(a->name_offset)), a->name_length, | 1062 | le16_to_cpu(a->name_offset)), a->name_length, |
1063 | al_name, al_name_len, CASE_SENSITIVE, | 1063 | al_name, al_name_len, CASE_SENSITIVE, |
1064 | vol->upcase, vol->upcase_len)) | 1064 | vol->upcase, vol->upcase_len)) |
1065 | break; | 1065 | break; |
1066 | ctx->attr = a; | 1066 | ctx->attr = a; |
1067 | /* | 1067 | /* |
1068 | * If no @val specified or @val specified and it matches, we | 1068 | * If no @val specified or @val specified and it matches, we |
1069 | * have found it! | 1069 | * have found it! |
1070 | */ | 1070 | */ |
1071 | if (!val || (!a->non_resident && le32_to_cpu( | 1071 | if (!val || (!a->non_resident && le32_to_cpu( |
1072 | a->data.resident.value_length) == val_len && | 1072 | a->data.resident.value_length) == val_len && |
1073 | !memcmp((u8*)a + | 1073 | !memcmp((u8*)a + |
1074 | le16_to_cpu(a->data.resident.value_offset), | 1074 | le16_to_cpu(a->data.resident.value_offset), |
1075 | val, val_len))) { | 1075 | val, val_len))) { |
1076 | ntfs_debug("Done, found."); | 1076 | ntfs_debug("Done, found."); |
1077 | return 0; | 1077 | return 0; |
1078 | } | 1078 | } |
1079 | do_next_attr: | 1079 | do_next_attr: |
1080 | /* Proceed to the next attribute in the current mft record. */ | 1080 | /* Proceed to the next attribute in the current mft record. */ |
1081 | a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length)); | 1081 | a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length)); |
1082 | goto do_next_attr_loop; | 1082 | goto do_next_attr_loop; |
1083 | } | 1083 | } |
1084 | if (!err) { | 1084 | if (!err) { |
1085 | ntfs_error(vol->sb, "Base inode 0x%lx contains corrupt " | 1085 | ntfs_error(vol->sb, "Base inode 0x%lx contains corrupt " |
1086 | "attribute list attribute.%s", base_ni->mft_no, | 1086 | "attribute list attribute.%s", base_ni->mft_no, |
1087 | es); | 1087 | es); |
1088 | err = -EIO; | 1088 | err = -EIO; |
1089 | } | 1089 | } |
1090 | if (ni != base_ni) { | 1090 | if (ni != base_ni) { |
1091 | if (ni) | 1091 | if (ni) |
1092 | unmap_extent_mft_record(ni); | 1092 | unmap_extent_mft_record(ni); |
1093 | ctx->ntfs_ino = base_ni; | 1093 | ctx->ntfs_ino = base_ni; |
1094 | ctx->mrec = ctx->base_mrec; | 1094 | ctx->mrec = ctx->base_mrec; |
1095 | ctx->attr = ctx->base_attr; | 1095 | ctx->attr = ctx->base_attr; |
1096 | } | 1096 | } |
1097 | if (err != -ENOMEM) | 1097 | if (err != -ENOMEM) |
1098 | NVolSetErrors(vol); | 1098 | NVolSetErrors(vol); |
1099 | return err; | 1099 | return err; |
1100 | not_found: | 1100 | not_found: |
1101 | /* | 1101 | /* |
1102 | * If we were looking for AT_END, we reset the search context @ctx and | 1102 | * If we were looking for AT_END, we reset the search context @ctx and |
1103 | * use ntfs_attr_find() to seek to the end of the base mft record. | 1103 | * use ntfs_attr_find() to seek to the end of the base mft record. |
1104 | */ | 1104 | */ |
1105 | if (type == AT_END) { | 1105 | if (type == AT_END) { |
1106 | ntfs_attr_reinit_search_ctx(ctx); | 1106 | ntfs_attr_reinit_search_ctx(ctx); |
1107 | return ntfs_attr_find(AT_END, name, name_len, ic, val, val_len, | 1107 | return ntfs_attr_find(AT_END, name, name_len, ic, val, val_len, |
1108 | ctx); | 1108 | ctx); |
1109 | } | 1109 | } |
1110 | /* | 1110 | /* |
1111 | * The attribute was not found. Before we return, we want to ensure | 1111 | * The attribute was not found. Before we return, we want to ensure |
1112 | * @ctx->mrec and @ctx->attr indicate the position at which the | 1112 | * @ctx->mrec and @ctx->attr indicate the position at which the |
1113 | * attribute should be inserted in the base mft record. Since we also | 1113 | * attribute should be inserted in the base mft record. Since we also |
1114 | * want to preserve @ctx->al_entry we cannot reinitialize the search | 1114 | * want to preserve @ctx->al_entry we cannot reinitialize the search |
1115 | * context using ntfs_attr_reinit_search_ctx() as this would set | 1115 | * context using ntfs_attr_reinit_search_ctx() as this would set |
1116 | * @ctx->al_entry to NULL. Thus we do the necessary bits manually (see | 1116 | * @ctx->al_entry to NULL. Thus we do the necessary bits manually (see |
1117 | * ntfs_attr_init_search_ctx() below). Note, we _only_ preserve | 1117 | * ntfs_attr_init_search_ctx() below). Note, we _only_ preserve |
1118 | * @ctx->al_entry as the remaining fields (base_*) are identical to | 1118 | * @ctx->al_entry as the remaining fields (base_*) are identical to |
1119 | * their non base_ counterparts and we cannot set @ctx->base_attr | 1119 | * their non base_ counterparts and we cannot set @ctx->base_attr |
1120 | * correctly yet as we do not know what @ctx->attr will be set to by | 1120 | * correctly yet as we do not know what @ctx->attr will be set to by |
1121 | * the call to ntfs_attr_find() below. | 1121 | * the call to ntfs_attr_find() below. |
1122 | */ | 1122 | */ |
1123 | if (ni != base_ni) | 1123 | if (ni != base_ni) |
1124 | unmap_extent_mft_record(ni); | 1124 | unmap_extent_mft_record(ni); |
1125 | ctx->mrec = ctx->base_mrec; | 1125 | ctx->mrec = ctx->base_mrec; |
1126 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + | 1126 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + |
1127 | le16_to_cpu(ctx->mrec->attrs_offset)); | 1127 | le16_to_cpu(ctx->mrec->attrs_offset)); |
1128 | ctx->is_first = true; | 1128 | ctx->is_first = true; |
1129 | ctx->ntfs_ino = base_ni; | 1129 | ctx->ntfs_ino = base_ni; |
1130 | ctx->base_ntfs_ino = NULL; | 1130 | ctx->base_ntfs_ino = NULL; |
1131 | ctx->base_mrec = NULL; | 1131 | ctx->base_mrec = NULL; |
1132 | ctx->base_attr = NULL; | 1132 | ctx->base_attr = NULL; |
1133 | /* | 1133 | /* |
1134 | * In case there are multiple matches in the base mft record, need to | 1134 | * In case there are multiple matches in the base mft record, need to |
1135 | * keep enumerating until we get an attribute not found response (or | 1135 | * keep enumerating until we get an attribute not found response (or |
1136 | * another error), otherwise we would keep returning the same attribute | 1136 | * another error), otherwise we would keep returning the same attribute |
1137 | * over and over again and all programs using us for enumeration would | 1137 | * over and over again and all programs using us for enumeration would |
1138 | * lock up in a tight loop. | 1138 | * lock up in a tight loop. |
1139 | */ | 1139 | */ |
1140 | do { | 1140 | do { |
1141 | err = ntfs_attr_find(type, name, name_len, ic, val, val_len, | 1141 | err = ntfs_attr_find(type, name, name_len, ic, val, val_len, |
1142 | ctx); | 1142 | ctx); |
1143 | } while (!err); | 1143 | } while (!err); |
1144 | ntfs_debug("Done, not found."); | 1144 | ntfs_debug("Done, not found."); |
1145 | return err; | 1145 | return err; |
1146 | } | 1146 | } |
1147 | 1147 | ||
1148 | /** | 1148 | /** |
1149 | * ntfs_attr_lookup - find an attribute in an ntfs inode | 1149 | * ntfs_attr_lookup - find an attribute in an ntfs inode |
1150 | * @type: attribute type to find | 1150 | * @type: attribute type to find |
1151 | * @name: attribute name to find (optional, i.e. NULL means don't care) | 1151 | * @name: attribute name to find (optional, i.e. NULL means don't care) |
1152 | * @name_len: attribute name length (only needed if @name present) | 1152 | * @name_len: attribute name length (only needed if @name present) |
1153 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) | 1153 | * @ic: IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present) |
1154 | * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) | 1154 | * @lowest_vcn: lowest vcn to find (optional, non-resident attributes only) |
1155 | * @val: attribute value to find (optional, resident attributes only) | 1155 | * @val: attribute value to find (optional, resident attributes only) |
1156 | * @val_len: attribute value length | 1156 | * @val_len: attribute value length |
1157 | * @ctx: search context with mft record and attribute to search from | 1157 | * @ctx: search context with mft record and attribute to search from |
1158 | * | 1158 | * |
1159 | * Find an attribute in an ntfs inode. On first search @ctx->ntfs_ino must | 1159 | * Find an attribute in an ntfs inode. On first search @ctx->ntfs_ino must |
1160 | * be the base mft record and @ctx must have been obtained from a call to | 1160 | * be the base mft record and @ctx must have been obtained from a call to |
1161 | * ntfs_attr_get_search_ctx(). | 1161 | * ntfs_attr_get_search_ctx(). |
1162 | * | 1162 | * |
1163 | * This function transparently handles attribute lists and @ctx is used to | 1163 | * This function transparently handles attribute lists and @ctx is used to |
1164 | * continue searches where they were left off at. | 1164 | * continue searches where they were left off at. |
1165 | * | 1165 | * |
1166 | * After finishing with the attribute/mft record you need to call | 1166 | * After finishing with the attribute/mft record you need to call |
1167 | * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any | 1167 | * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any |
1168 | * mapped inodes, etc). | 1168 | * mapped inodes, etc). |
1169 | * | 1169 | * |
1170 | * Return 0 if the search was successful and -errno if not. | 1170 | * Return 0 if the search was successful and -errno if not. |
1171 | * | 1171 | * |
1172 | * When 0, @ctx->attr is the found attribute and it is in mft record | 1172 | * When 0, @ctx->attr is the found attribute and it is in mft record |
1173 | * @ctx->mrec. If an attribute list attribute is present, @ctx->al_entry is | 1173 | * @ctx->mrec. If an attribute list attribute is present, @ctx->al_entry is |
1174 | * the attribute list entry of the found attribute. | 1174 | * the attribute list entry of the found attribute. |
1175 | * | 1175 | * |
1176 | * When -ENOENT, @ctx->attr is the attribute which collates just after the | 1176 | * When -ENOENT, @ctx->attr is the attribute which collates just after the |
1177 | * attribute being searched for, i.e. if one wants to add the attribute to the | 1177 | * attribute being searched for, i.e. if one wants to add the attribute to the |
1178 | * mft record this is the correct place to insert it into. If an attribute | 1178 | * mft record this is the correct place to insert it into. If an attribute |
1179 | * list attribute is present, @ctx->al_entry is the attribute list entry which | 1179 | * list attribute is present, @ctx->al_entry is the attribute list entry which |
1180 | * collates just after the attribute list entry of the attribute being searched | 1180 | * collates just after the attribute list entry of the attribute being searched |
1181 | * for, i.e. if one wants to add the attribute to the mft record this is the | 1181 | * for, i.e. if one wants to add the attribute to the mft record this is the |
1182 | * correct place to insert its attribute list entry into. | 1182 | * correct place to insert its attribute list entry into. |
1183 | * | 1183 | * |
1184 | * When -errno != -ENOENT, an error occurred during the lookup. @ctx->attr is | 1184 | * When -errno != -ENOENT, an error occurred during the lookup. @ctx->attr is |
1185 | * then undefined and in particular you should not rely on it not changing. | 1185 | * then undefined and in particular you should not rely on it not changing. |
1186 | */ | 1186 | */ |
1187 | int ntfs_attr_lookup(const ATTR_TYPE type, const ntfschar *name, | 1187 | int ntfs_attr_lookup(const ATTR_TYPE type, const ntfschar *name, |
1188 | const u32 name_len, const IGNORE_CASE_BOOL ic, | 1188 | const u32 name_len, const IGNORE_CASE_BOOL ic, |
1189 | const VCN lowest_vcn, const u8 *val, const u32 val_len, | 1189 | const VCN lowest_vcn, const u8 *val, const u32 val_len, |
1190 | ntfs_attr_search_ctx *ctx) | 1190 | ntfs_attr_search_ctx *ctx) |
1191 | { | 1191 | { |
1192 | ntfs_inode *base_ni; | 1192 | ntfs_inode *base_ni; |
1193 | 1193 | ||
1194 | ntfs_debug("Entering."); | 1194 | ntfs_debug("Entering."); |
1195 | BUG_ON(IS_ERR(ctx->mrec)); | 1195 | BUG_ON(IS_ERR(ctx->mrec)); |
1196 | if (ctx->base_ntfs_ino) | 1196 | if (ctx->base_ntfs_ino) |
1197 | base_ni = ctx->base_ntfs_ino; | 1197 | base_ni = ctx->base_ntfs_ino; |
1198 | else | 1198 | else |
1199 | base_ni = ctx->ntfs_ino; | 1199 | base_ni = ctx->ntfs_ino; |
1200 | /* Sanity check, just for debugging really. */ | 1200 | /* Sanity check, just for debugging really. */ |
1201 | BUG_ON(!base_ni); | 1201 | BUG_ON(!base_ni); |
1202 | if (!NInoAttrList(base_ni) || type == AT_ATTRIBUTE_LIST) | 1202 | if (!NInoAttrList(base_ni) || type == AT_ATTRIBUTE_LIST) |
1203 | return ntfs_attr_find(type, name, name_len, ic, val, val_len, | 1203 | return ntfs_attr_find(type, name, name_len, ic, val, val_len, |
1204 | ctx); | 1204 | ctx); |
1205 | return ntfs_external_attr_find(type, name, name_len, ic, lowest_vcn, | 1205 | return ntfs_external_attr_find(type, name, name_len, ic, lowest_vcn, |
1206 | val, val_len, ctx); | 1206 | val, val_len, ctx); |
1207 | } | 1207 | } |
1208 | 1208 | ||
1209 | /** | 1209 | /** |
1210 | * ntfs_attr_init_search_ctx - initialize an attribute search context | 1210 | * ntfs_attr_init_search_ctx - initialize an attribute search context |
1211 | * @ctx: attribute search context to initialize | 1211 | * @ctx: attribute search context to initialize |
1212 | * @ni: ntfs inode with which to initialize the search context | 1212 | * @ni: ntfs inode with which to initialize the search context |
1213 | * @mrec: mft record with which to initialize the search context | 1213 | * @mrec: mft record with which to initialize the search context |
1214 | * | 1214 | * |
1215 | * Initialize the attribute search context @ctx with @ni and @mrec. | 1215 | * Initialize the attribute search context @ctx with @ni and @mrec. |
1216 | */ | 1216 | */ |
1217 | static inline void ntfs_attr_init_search_ctx(ntfs_attr_search_ctx *ctx, | 1217 | static inline void ntfs_attr_init_search_ctx(ntfs_attr_search_ctx *ctx, |
1218 | ntfs_inode *ni, MFT_RECORD *mrec) | 1218 | ntfs_inode *ni, MFT_RECORD *mrec) |
1219 | { | 1219 | { |
1220 | *ctx = (ntfs_attr_search_ctx) { | 1220 | *ctx = (ntfs_attr_search_ctx) { |
1221 | .mrec = mrec, | 1221 | .mrec = mrec, |
1222 | /* Sanity checks are performed elsewhere. */ | 1222 | /* Sanity checks are performed elsewhere. */ |
1223 | .attr = (ATTR_RECORD*)((u8*)mrec + | 1223 | .attr = (ATTR_RECORD*)((u8*)mrec + |
1224 | le16_to_cpu(mrec->attrs_offset)), | 1224 | le16_to_cpu(mrec->attrs_offset)), |
1225 | .is_first = true, | 1225 | .is_first = true, |
1226 | .ntfs_ino = ni, | 1226 | .ntfs_ino = ni, |
1227 | }; | 1227 | }; |
1228 | } | 1228 | } |
1229 | 1229 | ||
1230 | /** | 1230 | /** |
1231 | * ntfs_attr_reinit_search_ctx - reinitialize an attribute search context | 1231 | * ntfs_attr_reinit_search_ctx - reinitialize an attribute search context |
1232 | * @ctx: attribute search context to reinitialize | 1232 | * @ctx: attribute search context to reinitialize |
1233 | * | 1233 | * |
1234 | * Reinitialize the attribute search context @ctx, unmapping an associated | 1234 | * Reinitialize the attribute search context @ctx, unmapping an associated |
1235 | * extent mft record if present, and initialize the search context again. | 1235 | * extent mft record if present, and initialize the search context again. |
1236 | * | 1236 | * |
1237 | * This is used when a search for a new attribute is being started to reset | 1237 | * This is used when a search for a new attribute is being started to reset |
1238 | * the search context to the beginning. | 1238 | * the search context to the beginning. |
1239 | */ | 1239 | */ |
1240 | void ntfs_attr_reinit_search_ctx(ntfs_attr_search_ctx *ctx) | 1240 | void ntfs_attr_reinit_search_ctx(ntfs_attr_search_ctx *ctx) |
1241 | { | 1241 | { |
1242 | if (likely(!ctx->base_ntfs_ino)) { | 1242 | if (likely(!ctx->base_ntfs_ino)) { |
1243 | /* No attribute list. */ | 1243 | /* No attribute list. */ |
1244 | ctx->is_first = true; | 1244 | ctx->is_first = true; |
1245 | /* Sanity checks are performed elsewhere. */ | 1245 | /* Sanity checks are performed elsewhere. */ |
1246 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + | 1246 | ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec + |
1247 | le16_to_cpu(ctx->mrec->attrs_offset)); | 1247 | le16_to_cpu(ctx->mrec->attrs_offset)); |
1248 | /* | 1248 | /* |
1249 | * This needs resetting due to ntfs_external_attr_find() which | 1249 | * This needs resetting due to ntfs_external_attr_find() which |
1250 | * can leave it set despite having zeroed ctx->base_ntfs_ino. | 1250 | * can leave it set despite having zeroed ctx->base_ntfs_ino. |
1251 | */ | 1251 | */ |
1252 | ctx->al_entry = NULL; | 1252 | ctx->al_entry = NULL; |
1253 | return; | 1253 | return; |
1254 | } /* Attribute list. */ | 1254 | } /* Attribute list. */ |
1255 | if (ctx->ntfs_ino != ctx->base_ntfs_ino) | 1255 | if (ctx->ntfs_ino != ctx->base_ntfs_ino) |
1256 | unmap_extent_mft_record(ctx->ntfs_ino); | 1256 | unmap_extent_mft_record(ctx->ntfs_ino); |
1257 | ntfs_attr_init_search_ctx(ctx, ctx->base_ntfs_ino, ctx->base_mrec); | 1257 | ntfs_attr_init_search_ctx(ctx, ctx->base_ntfs_ino, ctx->base_mrec); |
1258 | return; | 1258 | return; |
1259 | } | 1259 | } |
1260 | 1260 | ||
1261 | /** | 1261 | /** |
1262 | * ntfs_attr_get_search_ctx - allocate/initialize a new attribute search context | 1262 | * ntfs_attr_get_search_ctx - allocate/initialize a new attribute search context |
1263 | * @ni: ntfs inode with which to initialize the search context | 1263 | * @ni: ntfs inode with which to initialize the search context |
1264 | * @mrec: mft record with which to initialize the search context | 1264 | * @mrec: mft record with which to initialize the search context |
1265 | * | 1265 | * |
1266 | * Allocate a new attribute search context, initialize it with @ni and @mrec, | 1266 | * Allocate a new attribute search context, initialize it with @ni and @mrec, |
1267 | * and return it. Return NULL if allocation failed. | 1267 | * and return it. Return NULL if allocation failed. |
1268 | */ | 1268 | */ |
1269 | ntfs_attr_search_ctx *ntfs_attr_get_search_ctx(ntfs_inode *ni, MFT_RECORD *mrec) | 1269 | ntfs_attr_search_ctx *ntfs_attr_get_search_ctx(ntfs_inode *ni, MFT_RECORD *mrec) |
1270 | { | 1270 | { |
1271 | ntfs_attr_search_ctx *ctx; | 1271 | ntfs_attr_search_ctx *ctx; |
1272 | 1272 | ||
1273 | ctx = kmem_cache_alloc(ntfs_attr_ctx_cache, GFP_NOFS); | 1273 | ctx = kmem_cache_alloc(ntfs_attr_ctx_cache, GFP_NOFS); |
1274 | if (ctx) | 1274 | if (ctx) |
1275 | ntfs_attr_init_search_ctx(ctx, ni, mrec); | 1275 | ntfs_attr_init_search_ctx(ctx, ni, mrec); |
1276 | return ctx; | 1276 | return ctx; |
1277 | } | 1277 | } |
1278 | 1278 | ||
1279 | /** | 1279 | /** |
1280 | * ntfs_attr_put_search_ctx - release an attribute search context | 1280 | * ntfs_attr_put_search_ctx - release an attribute search context |
1281 | * @ctx: attribute search context to free | 1281 | * @ctx: attribute search context to free |
1282 | * | 1282 | * |
1283 | * Release the attribute search context @ctx, unmapping an associated extent | 1283 | * Release the attribute search context @ctx, unmapping an associated extent |
1284 | * mft record if present. | 1284 | * mft record if present. |
1285 | */ | 1285 | */ |
1286 | void ntfs_attr_put_search_ctx(ntfs_attr_search_ctx *ctx) | 1286 | void ntfs_attr_put_search_ctx(ntfs_attr_search_ctx *ctx) |
1287 | { | 1287 | { |
1288 | if (ctx->base_ntfs_ino && ctx->ntfs_ino != ctx->base_ntfs_ino) | 1288 | if (ctx->base_ntfs_ino && ctx->ntfs_ino != ctx->base_ntfs_ino) |
1289 | unmap_extent_mft_record(ctx->ntfs_ino); | 1289 | unmap_extent_mft_record(ctx->ntfs_ino); |
1290 | kmem_cache_free(ntfs_attr_ctx_cache, ctx); | 1290 | kmem_cache_free(ntfs_attr_ctx_cache, ctx); |
1291 | return; | 1291 | return; |
1292 | } | 1292 | } |
1293 | 1293 | ||
1294 | #ifdef NTFS_RW | 1294 | #ifdef NTFS_RW |
1295 | 1295 | ||
1296 | /** | 1296 | /** |
1297 | * ntfs_attr_find_in_attrdef - find an attribute in the $AttrDef system file | 1297 | * ntfs_attr_find_in_attrdef - find an attribute in the $AttrDef system file |
1298 | * @vol: ntfs volume to which the attribute belongs | 1298 | * @vol: ntfs volume to which the attribute belongs |
1299 | * @type: attribute type which to find | 1299 | * @type: attribute type which to find |
1300 | * | 1300 | * |
1301 | * Search for the attribute definition record corresponding to the attribute | 1301 | * Search for the attribute definition record corresponding to the attribute |
1302 | * @type in the $AttrDef system file. | 1302 | * @type in the $AttrDef system file. |
1303 | * | 1303 | * |
1304 | * Return the attribute type definition record if found and NULL if not found. | 1304 | * Return the attribute type definition record if found and NULL if not found. |
1305 | */ | 1305 | */ |
1306 | static ATTR_DEF *ntfs_attr_find_in_attrdef(const ntfs_volume *vol, | 1306 | static ATTR_DEF *ntfs_attr_find_in_attrdef(const ntfs_volume *vol, |
1307 | const ATTR_TYPE type) | 1307 | const ATTR_TYPE type) |
1308 | { | 1308 | { |
1309 | ATTR_DEF *ad; | 1309 | ATTR_DEF *ad; |
1310 | 1310 | ||
1311 | BUG_ON(!vol->attrdef); | 1311 | BUG_ON(!vol->attrdef); |
1312 | BUG_ON(!type); | 1312 | BUG_ON(!type); |
1313 | for (ad = vol->attrdef; (u8*)ad - (u8*)vol->attrdef < | 1313 | for (ad = vol->attrdef; (u8*)ad - (u8*)vol->attrdef < |
1314 | vol->attrdef_size && ad->type; ++ad) { | 1314 | vol->attrdef_size && ad->type; ++ad) { |
1315 | /* We have not found it yet, carry on searching. */ | 1315 | /* We have not found it yet, carry on searching. */ |
1316 | if (likely(le32_to_cpu(ad->type) < le32_to_cpu(type))) | 1316 | if (likely(le32_to_cpu(ad->type) < le32_to_cpu(type))) |
1317 | continue; | 1317 | continue; |
1318 | /* We found the attribute; return it. */ | 1318 | /* We found the attribute; return it. */ |
1319 | if (likely(ad->type == type)) | 1319 | if (likely(ad->type == type)) |
1320 | return ad; | 1320 | return ad; |
1321 | /* We have gone too far already. No point in continuing. */ | 1321 | /* We have gone too far already. No point in continuing. */ |
1322 | break; | 1322 | break; |
1323 | } | 1323 | } |
1324 | /* Attribute not found. */ | 1324 | /* Attribute not found. */ |
1325 | ntfs_debug("Attribute type 0x%x not found in $AttrDef.", | 1325 | ntfs_debug("Attribute type 0x%x not found in $AttrDef.", |
1326 | le32_to_cpu(type)); | 1326 | le32_to_cpu(type)); |
1327 | return NULL; | 1327 | return NULL; |
1328 | } | 1328 | } |
1329 | 1329 | ||
1330 | /** | 1330 | /** |
1331 | * ntfs_attr_size_bounds_check - check a size of an attribute type for validity | 1331 | * ntfs_attr_size_bounds_check - check a size of an attribute type for validity |
1332 | * @vol: ntfs volume to which the attribute belongs | 1332 | * @vol: ntfs volume to which the attribute belongs |
1333 | * @type: attribute type which to check | 1333 | * @type: attribute type which to check |
1334 | * @size: size which to check | 1334 | * @size: size which to check |
1335 | * | 1335 | * |
1336 | * Check whether the @size in bytes is valid for an attribute of @type on the | 1336 | * Check whether the @size in bytes is valid for an attribute of @type on the |
1337 | * ntfs volume @vol. This information is obtained from $AttrDef system file. | 1337 | * ntfs volume @vol. This information is obtained from $AttrDef system file. |
1338 | * | 1338 | * |
1339 | * Return 0 if valid, -ERANGE if not valid, or -ENOENT if the attribute is not | 1339 | * Return 0 if valid, -ERANGE if not valid, or -ENOENT if the attribute is not |
1340 | * listed in $AttrDef. | 1340 | * listed in $AttrDef. |
1341 | */ | 1341 | */ |
1342 | int ntfs_attr_size_bounds_check(const ntfs_volume *vol, const ATTR_TYPE type, | 1342 | int ntfs_attr_size_bounds_check(const ntfs_volume *vol, const ATTR_TYPE type, |
1343 | const s64 size) | 1343 | const s64 size) |
1344 | { | 1344 | { |
1345 | ATTR_DEF *ad; | 1345 | ATTR_DEF *ad; |
1346 | 1346 | ||
1347 | BUG_ON(size < 0); | 1347 | BUG_ON(size < 0); |
1348 | /* | 1348 | /* |
1349 | * $ATTRIBUTE_LIST has a maximum size of 256kiB, but this is not | 1349 | * $ATTRIBUTE_LIST has a maximum size of 256kiB, but this is not |
1350 | * listed in $AttrDef. | 1350 | * listed in $AttrDef. |
1351 | */ | 1351 | */ |
1352 | if (unlikely(type == AT_ATTRIBUTE_LIST && size > 256 * 1024)) | 1352 | if (unlikely(type == AT_ATTRIBUTE_LIST && size > 256 * 1024)) |
1353 | return -ERANGE; | 1353 | return -ERANGE; |
1354 | /* Get the $AttrDef entry for the attribute @type. */ | 1354 | /* Get the $AttrDef entry for the attribute @type. */ |
1355 | ad = ntfs_attr_find_in_attrdef(vol, type); | 1355 | ad = ntfs_attr_find_in_attrdef(vol, type); |
1356 | if (unlikely(!ad)) | 1356 | if (unlikely(!ad)) |
1357 | return -ENOENT; | 1357 | return -ENOENT; |
1358 | /* Do the bounds check. */ | 1358 | /* Do the bounds check. */ |
1359 | if (((sle64_to_cpu(ad->min_size) > 0) && | 1359 | if (((sle64_to_cpu(ad->min_size) > 0) && |
1360 | size < sle64_to_cpu(ad->min_size)) || | 1360 | size < sle64_to_cpu(ad->min_size)) || |
1361 | ((sle64_to_cpu(ad->max_size) > 0) && size > | 1361 | ((sle64_to_cpu(ad->max_size) > 0) && size > |
1362 | sle64_to_cpu(ad->max_size))) | 1362 | sle64_to_cpu(ad->max_size))) |
1363 | return -ERANGE; | 1363 | return -ERANGE; |
1364 | return 0; | 1364 | return 0; |
1365 | } | 1365 | } |
1366 | 1366 | ||
1367 | /** | 1367 | /** |
1368 | * ntfs_attr_can_be_non_resident - check if an attribute can be non-resident | 1368 | * ntfs_attr_can_be_non_resident - check if an attribute can be non-resident |
1369 | * @vol: ntfs volume to which the attribute belongs | 1369 | * @vol: ntfs volume to which the attribute belongs |
1370 | * @type: attribute type which to check | 1370 | * @type: attribute type which to check |
1371 | * | 1371 | * |
1372 | * Check whether the attribute of @type on the ntfs volume @vol is allowed to | 1372 | * Check whether the attribute of @type on the ntfs volume @vol is allowed to |
1373 | * be non-resident. This information is obtained from $AttrDef system file. | 1373 | * be non-resident. This information is obtained from $AttrDef system file. |
1374 | * | 1374 | * |
1375 | * Return 0 if the attribute is allowed to be non-resident, -EPERM if not, and | 1375 | * Return 0 if the attribute is allowed to be non-resident, -EPERM if not, and |
1376 | * -ENOENT if the attribute is not listed in $AttrDef. | 1376 | * -ENOENT if the attribute is not listed in $AttrDef. |
1377 | */ | 1377 | */ |
1378 | int ntfs_attr_can_be_non_resident(const ntfs_volume *vol, const ATTR_TYPE type) | 1378 | int ntfs_attr_can_be_non_resident(const ntfs_volume *vol, const ATTR_TYPE type) |
1379 | { | 1379 | { |
1380 | ATTR_DEF *ad; | 1380 | ATTR_DEF *ad; |
1381 | 1381 | ||
1382 | /* Find the attribute definition record in $AttrDef. */ | 1382 | /* Find the attribute definition record in $AttrDef. */ |
1383 | ad = ntfs_attr_find_in_attrdef(vol, type); | 1383 | ad = ntfs_attr_find_in_attrdef(vol, type); |
1384 | if (unlikely(!ad)) | 1384 | if (unlikely(!ad)) |
1385 | return -ENOENT; | 1385 | return -ENOENT; |
1386 | /* Check the flags and return the result. */ | 1386 | /* Check the flags and return the result. */ |
1387 | if (ad->flags & ATTR_DEF_RESIDENT) | 1387 | if (ad->flags & ATTR_DEF_RESIDENT) |
1388 | return -EPERM; | 1388 | return -EPERM; |
1389 | return 0; | 1389 | return 0; |
1390 | } | 1390 | } |
1391 | 1391 | ||
1392 | /** | 1392 | /** |
1393 | * ntfs_attr_can_be_resident - check if an attribute can be resident | 1393 | * ntfs_attr_can_be_resident - check if an attribute can be resident |
1394 | * @vol: ntfs volume to which the attribute belongs | 1394 | * @vol: ntfs volume to which the attribute belongs |
1395 | * @type: attribute type which to check | 1395 | * @type: attribute type which to check |
1396 | * | 1396 | * |
1397 | * Check whether the attribute of @type on the ntfs volume @vol is allowed to | 1397 | * Check whether the attribute of @type on the ntfs volume @vol is allowed to |
1398 | * be resident. This information is derived from our ntfs knowledge and may | 1398 | * be resident. This information is derived from our ntfs knowledge and may |
1399 | * not be completely accurate, especially when user defined attributes are | 1399 | * not be completely accurate, especially when user defined attributes are |
1400 | * present. Basically we allow everything to be resident except for index | 1400 | * present. Basically we allow everything to be resident except for index |
1401 | * allocation and $EA attributes. | 1401 | * allocation and $EA attributes. |
1402 | * | 1402 | * |
1403 | * Return 0 if the attribute is allowed to be non-resident and -EPERM if not. | 1403 | * Return 0 if the attribute is allowed to be non-resident and -EPERM if not. |
1404 | * | 1404 | * |
1405 | * Warning: In the system file $MFT the attribute $Bitmap must be non-resident | 1405 | * Warning: In the system file $MFT the attribute $Bitmap must be non-resident |
1406 | * otherwise windows will not boot (blue screen of death)! We cannot | 1406 | * otherwise windows will not boot (blue screen of death)! We cannot |
1407 | * check for this here as we do not know which inode's $Bitmap is | 1407 | * check for this here as we do not know which inode's $Bitmap is |
1408 | * being asked about so the caller needs to special case this. | 1408 | * being asked about so the caller needs to special case this. |
1409 | */ | 1409 | */ |
1410 | int ntfs_attr_can_be_resident(const ntfs_volume *vol, const ATTR_TYPE type) | 1410 | int ntfs_attr_can_be_resident(const ntfs_volume *vol, const ATTR_TYPE type) |
1411 | { | 1411 | { |
1412 | if (type == AT_INDEX_ALLOCATION) | 1412 | if (type == AT_INDEX_ALLOCATION) |
1413 | return -EPERM; | 1413 | return -EPERM; |
1414 | return 0; | 1414 | return 0; |
1415 | } | 1415 | } |
1416 | 1416 | ||
1417 | /** | 1417 | /** |
1418 | * ntfs_attr_record_resize - resize an attribute record | 1418 | * ntfs_attr_record_resize - resize an attribute record |
1419 | * @m: mft record containing attribute record | 1419 | * @m: mft record containing attribute record |
1420 | * @a: attribute record to resize | 1420 | * @a: attribute record to resize |
1421 | * @new_size: new size in bytes to which to resize the attribute record @a | 1421 | * @new_size: new size in bytes to which to resize the attribute record @a |
1422 | * | 1422 | * |
1423 | * Resize the attribute record @a, i.e. the resident part of the attribute, in | 1423 | * Resize the attribute record @a, i.e. the resident part of the attribute, in |
1424 | * the mft record @m to @new_size bytes. | 1424 | * the mft record @m to @new_size bytes. |
1425 | * | 1425 | * |
1426 | * Return 0 on success and -errno on error. The following error codes are | 1426 | * Return 0 on success and -errno on error. The following error codes are |
1427 | * defined: | 1427 | * defined: |
1428 | * -ENOSPC - Not enough space in the mft record @m to perform the resize. | 1428 | * -ENOSPC - Not enough space in the mft record @m to perform the resize. |
1429 | * | 1429 | * |
1430 | * Note: On error, no modifications have been performed whatsoever. | 1430 | * Note: On error, no modifications have been performed whatsoever. |
1431 | * | 1431 | * |
1432 | * Warning: If you make a record smaller without having copied all the data you | 1432 | * Warning: If you make a record smaller without having copied all the data you |
1433 | * are interested in the data may be overwritten. | 1433 | * are interested in the data may be overwritten. |
1434 | */ | 1434 | */ |
1435 | int ntfs_attr_record_resize(MFT_RECORD *m, ATTR_RECORD *a, u32 new_size) | 1435 | int ntfs_attr_record_resize(MFT_RECORD *m, ATTR_RECORD *a, u32 new_size) |
1436 | { | 1436 | { |
1437 | ntfs_debug("Entering for new_size %u.", new_size); | 1437 | ntfs_debug("Entering for new_size %u.", new_size); |
1438 | /* Align to 8 bytes if it is not already done. */ | 1438 | /* Align to 8 bytes if it is not already done. */ |
1439 | if (new_size & 7) | 1439 | if (new_size & 7) |
1440 | new_size = (new_size + 7) & ~7; | 1440 | new_size = (new_size + 7) & ~7; |
1441 | /* If the actual attribute length has changed, move things around. */ | 1441 | /* If the actual attribute length has changed, move things around. */ |
1442 | if (new_size != le32_to_cpu(a->length)) { | 1442 | if (new_size != le32_to_cpu(a->length)) { |
1443 | u32 new_muse = le32_to_cpu(m->bytes_in_use) - | 1443 | u32 new_muse = le32_to_cpu(m->bytes_in_use) - |
1444 | le32_to_cpu(a->length) + new_size; | 1444 | le32_to_cpu(a->length) + new_size; |
1445 | /* Not enough space in this mft record. */ | 1445 | /* Not enough space in this mft record. */ |
1446 | if (new_muse > le32_to_cpu(m->bytes_allocated)) | 1446 | if (new_muse > le32_to_cpu(m->bytes_allocated)) |
1447 | return -ENOSPC; | 1447 | return -ENOSPC; |
1448 | /* Move attributes following @a to their new location. */ | 1448 | /* Move attributes following @a to their new location. */ |
1449 | memmove((u8*)a + new_size, (u8*)a + le32_to_cpu(a->length), | 1449 | memmove((u8*)a + new_size, (u8*)a + le32_to_cpu(a->length), |
1450 | le32_to_cpu(m->bytes_in_use) - ((u8*)a - | 1450 | le32_to_cpu(m->bytes_in_use) - ((u8*)a - |
1451 | (u8*)m) - le32_to_cpu(a->length)); | 1451 | (u8*)m) - le32_to_cpu(a->length)); |
1452 | /* Adjust @m to reflect the change in used space. */ | 1452 | /* Adjust @m to reflect the change in used space. */ |
1453 | m->bytes_in_use = cpu_to_le32(new_muse); | 1453 | m->bytes_in_use = cpu_to_le32(new_muse); |
1454 | /* Adjust @a to reflect the new size. */ | 1454 | /* Adjust @a to reflect the new size. */ |
1455 | if (new_size >= offsetof(ATTR_REC, length) + sizeof(a->length)) | 1455 | if (new_size >= offsetof(ATTR_REC, length) + sizeof(a->length)) |
1456 | a->length = cpu_to_le32(new_size); | 1456 | a->length = cpu_to_le32(new_size); |
1457 | } | 1457 | } |
1458 | return 0; | 1458 | return 0; |
1459 | } | 1459 | } |
1460 | 1460 | ||
1461 | /** | 1461 | /** |
1462 | * ntfs_resident_attr_value_resize - resize the value of a resident attribute | 1462 | * ntfs_resident_attr_value_resize - resize the value of a resident attribute |
1463 | * @m: mft record containing attribute record | 1463 | * @m: mft record containing attribute record |
1464 | * @a: attribute record whose value to resize | 1464 | * @a: attribute record whose value to resize |
1465 | * @new_size: new size in bytes to which to resize the attribute value of @a | 1465 | * @new_size: new size in bytes to which to resize the attribute value of @a |
1466 | * | 1466 | * |
1467 | * Resize the value of the attribute @a in the mft record @m to @new_size bytes. | 1467 | * Resize the value of the attribute @a in the mft record @m to @new_size bytes. |
1468 | * If the value is made bigger, the newly allocated space is cleared. | 1468 | * If the value is made bigger, the newly allocated space is cleared. |
1469 | * | 1469 | * |
1470 | * Return 0 on success and -errno on error. The following error codes are | 1470 | * Return 0 on success and -errno on error. The following error codes are |
1471 | * defined: | 1471 | * defined: |
1472 | * -ENOSPC - Not enough space in the mft record @m to perform the resize. | 1472 | * -ENOSPC - Not enough space in the mft record @m to perform the resize. |
1473 | * | 1473 | * |
1474 | * Note: On error, no modifications have been performed whatsoever. | 1474 | * Note: On error, no modifications have been performed whatsoever. |
1475 | * | 1475 | * |
1476 | * Warning: If you make a record smaller without having copied all the data you | 1476 | * Warning: If you make a record smaller without having copied all the data you |
1477 | * are interested in the data may be overwritten. | 1477 | * are interested in the data may be overwritten. |
1478 | */ | 1478 | */ |
1479 | int ntfs_resident_attr_value_resize(MFT_RECORD *m, ATTR_RECORD *a, | 1479 | int ntfs_resident_attr_value_resize(MFT_RECORD *m, ATTR_RECORD *a, |
1480 | const u32 new_size) | 1480 | const u32 new_size) |
1481 | { | 1481 | { |
1482 | u32 old_size; | 1482 | u32 old_size; |
1483 | 1483 | ||
1484 | /* Resize the resident part of the attribute record. */ | 1484 | /* Resize the resident part of the attribute record. */ |
1485 | if (ntfs_attr_record_resize(m, a, | 1485 | if (ntfs_attr_record_resize(m, a, |
1486 | le16_to_cpu(a->data.resident.value_offset) + new_size)) | 1486 | le16_to_cpu(a->data.resident.value_offset) + new_size)) |
1487 | return -ENOSPC; | 1487 | return -ENOSPC; |
1488 | /* | 1488 | /* |
1489 | * The resize succeeded! If we made the attribute value bigger, clear | 1489 | * The resize succeeded! If we made the attribute value bigger, clear |
1490 | * the area between the old size and @new_size. | 1490 | * the area between the old size and @new_size. |
1491 | */ | 1491 | */ |
1492 | old_size = le32_to_cpu(a->data.resident.value_length); | 1492 | old_size = le32_to_cpu(a->data.resident.value_length); |
1493 | if (new_size > old_size) | 1493 | if (new_size > old_size) |
1494 | memset((u8*)a + le16_to_cpu(a->data.resident.value_offset) + | 1494 | memset((u8*)a + le16_to_cpu(a->data.resident.value_offset) + |
1495 | old_size, 0, new_size - old_size); | 1495 | old_size, 0, new_size - old_size); |
1496 | /* Finally update the length of the attribute value. */ | 1496 | /* Finally update the length of the attribute value. */ |
1497 | a->data.resident.value_length = cpu_to_le32(new_size); | 1497 | a->data.resident.value_length = cpu_to_le32(new_size); |
1498 | return 0; | 1498 | return 0; |
1499 | } | 1499 | } |
1500 | 1500 | ||
1501 | /** | 1501 | /** |
1502 | * ntfs_attr_make_non_resident - convert a resident to a non-resident attribute | 1502 | * ntfs_attr_make_non_resident - convert a resident to a non-resident attribute |
1503 | * @ni: ntfs inode describing the attribute to convert | 1503 | * @ni: ntfs inode describing the attribute to convert |
1504 | * @data_size: size of the resident data to copy to the non-resident attribute | 1504 | * @data_size: size of the resident data to copy to the non-resident attribute |
1505 | * | 1505 | * |
1506 | * Convert the resident ntfs attribute described by the ntfs inode @ni to a | 1506 | * Convert the resident ntfs attribute described by the ntfs inode @ni to a |
1507 | * non-resident one. | 1507 | * non-resident one. |
1508 | * | 1508 | * |
1509 | * @data_size must be equal to the attribute value size. This is needed since | 1509 | * @data_size must be equal to the attribute value size. This is needed since |
1510 | * we need to know the size before we can map the mft record and our callers | 1510 | * we need to know the size before we can map the mft record and our callers |
1511 | * always know it. The reason we cannot simply read the size from the vfs | 1511 | * always know it. The reason we cannot simply read the size from the vfs |
1512 | * inode i_size is that this is not necessarily uptodate. This happens when | 1512 | * inode i_size is that this is not necessarily uptodate. This happens when |
1513 | * ntfs_attr_make_non_resident() is called in the ->truncate call path(s). | 1513 | * ntfs_attr_make_non_resident() is called in the ->truncate call path(s). |
1514 | * | 1514 | * |
1515 | * Return 0 on success and -errno on error. The following error return codes | 1515 | * Return 0 on success and -errno on error. The following error return codes |
1516 | * are defined: | 1516 | * are defined: |
1517 | * -EPERM - The attribute is not allowed to be non-resident. | 1517 | * -EPERM - The attribute is not allowed to be non-resident. |
1518 | * -ENOMEM - Not enough memory. | 1518 | * -ENOMEM - Not enough memory. |
1519 | * -ENOSPC - Not enough disk space. | 1519 | * -ENOSPC - Not enough disk space. |
1520 | * -EINVAL - Attribute not defined on the volume. | 1520 | * -EINVAL - Attribute not defined on the volume. |
1521 | * -EIO - I/o error or other error. | 1521 | * -EIO - I/o error or other error. |
1522 | * Note that -ENOSPC is also returned in the case that there is not enough | 1522 | * Note that -ENOSPC is also returned in the case that there is not enough |
1523 | * space in the mft record to do the conversion. This can happen when the mft | 1523 | * space in the mft record to do the conversion. This can happen when the mft |
1524 | * record is already very full. The caller is responsible for trying to make | 1524 | * record is already very full. The caller is responsible for trying to make |
1525 | * space in the mft record and trying again. FIXME: Do we need a separate | 1525 | * space in the mft record and trying again. FIXME: Do we need a separate |
1526 | * error return code for this kind of -ENOSPC or is it always worth trying | 1526 | * error return code for this kind of -ENOSPC or is it always worth trying |
1527 | * again in case the attribute may then fit in a resident state so no need to | 1527 | * again in case the attribute may then fit in a resident state so no need to |
1528 | * make it non-resident at all? Ho-hum... (AIA) | 1528 | * make it non-resident at all? Ho-hum... (AIA) |
1529 | * | 1529 | * |
1530 | * NOTE to self: No changes in the attribute list are required to move from | 1530 | * NOTE to self: No changes in the attribute list are required to move from |
1531 | * a resident to a non-resident attribute. | 1531 | * a resident to a non-resident attribute. |
1532 | * | 1532 | * |
1533 | * Locking: - The caller must hold i_mutex on the inode. | 1533 | * Locking: - The caller must hold i_mutex on the inode. |
1534 | */ | 1534 | */ |
1535 | int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size) | 1535 | int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size) |
1536 | { | 1536 | { |
1537 | s64 new_size; | 1537 | s64 new_size; |
1538 | struct inode *vi = VFS_I(ni); | 1538 | struct inode *vi = VFS_I(ni); |
1539 | ntfs_volume *vol = ni->vol; | 1539 | ntfs_volume *vol = ni->vol; |
1540 | ntfs_inode *base_ni; | 1540 | ntfs_inode *base_ni; |
1541 | MFT_RECORD *m; | 1541 | MFT_RECORD *m; |
1542 | ATTR_RECORD *a; | 1542 | ATTR_RECORD *a; |
1543 | ntfs_attr_search_ctx *ctx; | 1543 | ntfs_attr_search_ctx *ctx; |
1544 | struct page *page; | 1544 | struct page *page; |
1545 | runlist_element *rl; | 1545 | runlist_element *rl; |
1546 | u8 *kaddr; | 1546 | u8 *kaddr; |
1547 | unsigned long flags; | 1547 | unsigned long flags; |
1548 | int mp_size, mp_ofs, name_ofs, arec_size, err, err2; | 1548 | int mp_size, mp_ofs, name_ofs, arec_size, err, err2; |
1549 | u32 attr_size; | 1549 | u32 attr_size; |
1550 | u8 old_res_attr_flags; | 1550 | u8 old_res_attr_flags; |
1551 | 1551 | ||
1552 | /* Check that the attribute is allowed to be non-resident. */ | 1552 | /* Check that the attribute is allowed to be non-resident. */ |
1553 | err = ntfs_attr_can_be_non_resident(vol, ni->type); | 1553 | err = ntfs_attr_can_be_non_resident(vol, ni->type); |
1554 | if (unlikely(err)) { | 1554 | if (unlikely(err)) { |
1555 | if (err == -EPERM) | 1555 | if (err == -EPERM) |
1556 | ntfs_debug("Attribute is not allowed to be " | 1556 | ntfs_debug("Attribute is not allowed to be " |
1557 | "non-resident."); | 1557 | "non-resident."); |
1558 | else | 1558 | else |
1559 | ntfs_debug("Attribute not defined on the NTFS " | 1559 | ntfs_debug("Attribute not defined on the NTFS " |
1560 | "volume!"); | 1560 | "volume!"); |
1561 | return err; | 1561 | return err; |
1562 | } | 1562 | } |
1563 | /* | 1563 | /* |
1564 | * FIXME: Compressed and encrypted attributes are not supported when | 1564 | * FIXME: Compressed and encrypted attributes are not supported when |
1565 | * writing and we should never have gotten here for them. | 1565 | * writing and we should never have gotten here for them. |
1566 | */ | 1566 | */ |
1567 | BUG_ON(NInoCompressed(ni)); | 1567 | BUG_ON(NInoCompressed(ni)); |
1568 | BUG_ON(NInoEncrypted(ni)); | 1568 | BUG_ON(NInoEncrypted(ni)); |
1569 | /* | 1569 | /* |
1570 | * The size needs to be aligned to a cluster boundary for allocation | 1570 | * The size needs to be aligned to a cluster boundary for allocation |
1571 | * purposes. | 1571 | * purposes. |
1572 | */ | 1572 | */ |
1573 | new_size = (data_size + vol->cluster_size - 1) & | 1573 | new_size = (data_size + vol->cluster_size - 1) & |
1574 | ~(vol->cluster_size - 1); | 1574 | ~(vol->cluster_size - 1); |
1575 | if (new_size > 0) { | 1575 | if (new_size > 0) { |
1576 | /* | 1576 | /* |
1577 | * Will need the page later and since the page lock nests | 1577 | * Will need the page later and since the page lock nests |
1578 | * outside all ntfs locks, we need to get the page now. | 1578 | * outside all ntfs locks, we need to get the page now. |
1579 | */ | 1579 | */ |
1580 | page = find_or_create_page(vi->i_mapping, 0, | 1580 | page = find_or_create_page(vi->i_mapping, 0, |
1581 | mapping_gfp_mask(vi->i_mapping)); | 1581 | mapping_gfp_mask(vi->i_mapping)); |
1582 | if (unlikely(!page)) | 1582 | if (unlikely(!page)) |
1583 | return -ENOMEM; | 1583 | return -ENOMEM; |
1584 | /* Start by allocating clusters to hold the attribute value. */ | 1584 | /* Start by allocating clusters to hold the attribute value. */ |
1585 | rl = ntfs_cluster_alloc(vol, 0, new_size >> | 1585 | rl = ntfs_cluster_alloc(vol, 0, new_size >> |
1586 | vol->cluster_size_bits, -1, DATA_ZONE, true); | 1586 | vol->cluster_size_bits, -1, DATA_ZONE, true); |
1587 | if (IS_ERR(rl)) { | 1587 | if (IS_ERR(rl)) { |
1588 | err = PTR_ERR(rl); | 1588 | err = PTR_ERR(rl); |
1589 | ntfs_debug("Failed to allocate cluster%s, error code " | 1589 | ntfs_debug("Failed to allocate cluster%s, error code " |
1590 | "%i.", (new_size >> | 1590 | "%i.", (new_size >> |
1591 | vol->cluster_size_bits) > 1 ? "s" : "", | 1591 | vol->cluster_size_bits) > 1 ? "s" : "", |
1592 | err); | 1592 | err); |
1593 | goto page_err_out; | 1593 | goto page_err_out; |
1594 | } | 1594 | } |
1595 | } else { | 1595 | } else { |
1596 | rl = NULL; | 1596 | rl = NULL; |
1597 | page = NULL; | 1597 | page = NULL; |
1598 | } | 1598 | } |
1599 | /* Determine the size of the mapping pairs array. */ | 1599 | /* Determine the size of the mapping pairs array. */ |
1600 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl, 0, -1); | 1600 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl, 0, -1); |
1601 | if (unlikely(mp_size < 0)) { | 1601 | if (unlikely(mp_size < 0)) { |
1602 | err = mp_size; | 1602 | err = mp_size; |
1603 | ntfs_debug("Failed to get size for mapping pairs array, error " | 1603 | ntfs_debug("Failed to get size for mapping pairs array, error " |
1604 | "code %i.", err); | 1604 | "code %i.", err); |
1605 | goto rl_err_out; | 1605 | goto rl_err_out; |
1606 | } | 1606 | } |
1607 | down_write(&ni->runlist.lock); | 1607 | down_write(&ni->runlist.lock); |
1608 | if (!NInoAttr(ni)) | 1608 | if (!NInoAttr(ni)) |
1609 | base_ni = ni; | 1609 | base_ni = ni; |
1610 | else | 1610 | else |
1611 | base_ni = ni->ext.base_ntfs_ino; | 1611 | base_ni = ni->ext.base_ntfs_ino; |
1612 | m = map_mft_record(base_ni); | 1612 | m = map_mft_record(base_ni); |
1613 | if (IS_ERR(m)) { | 1613 | if (IS_ERR(m)) { |
1614 | err = PTR_ERR(m); | 1614 | err = PTR_ERR(m); |
1615 | m = NULL; | 1615 | m = NULL; |
1616 | ctx = NULL; | 1616 | ctx = NULL; |
1617 | goto err_out; | 1617 | goto err_out; |
1618 | } | 1618 | } |
1619 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 1619 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
1620 | if (unlikely(!ctx)) { | 1620 | if (unlikely(!ctx)) { |
1621 | err = -ENOMEM; | 1621 | err = -ENOMEM; |
1622 | goto err_out; | 1622 | goto err_out; |
1623 | } | 1623 | } |
1624 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 1624 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
1625 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 1625 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
1626 | if (unlikely(err)) { | 1626 | if (unlikely(err)) { |
1627 | if (err == -ENOENT) | 1627 | if (err == -ENOENT) |
1628 | err = -EIO; | 1628 | err = -EIO; |
1629 | goto err_out; | 1629 | goto err_out; |
1630 | } | 1630 | } |
1631 | m = ctx->mrec; | 1631 | m = ctx->mrec; |
1632 | a = ctx->attr; | 1632 | a = ctx->attr; |
1633 | BUG_ON(NInoNonResident(ni)); | 1633 | BUG_ON(NInoNonResident(ni)); |
1634 | BUG_ON(a->non_resident); | 1634 | BUG_ON(a->non_resident); |
1635 | /* | 1635 | /* |
1636 | * Calculate new offsets for the name and the mapping pairs array. | 1636 | * Calculate new offsets for the name and the mapping pairs array. |
1637 | */ | 1637 | */ |
1638 | if (NInoSparse(ni) || NInoCompressed(ni)) | 1638 | if (NInoSparse(ni) || NInoCompressed(ni)) |
1639 | name_ofs = (offsetof(ATTR_REC, | 1639 | name_ofs = (offsetof(ATTR_REC, |
1640 | data.non_resident.compressed_size) + | 1640 | data.non_resident.compressed_size) + |
1641 | sizeof(a->data.non_resident.compressed_size) + | 1641 | sizeof(a->data.non_resident.compressed_size) + |
1642 | 7) & ~7; | 1642 | 7) & ~7; |
1643 | else | 1643 | else |
1644 | name_ofs = (offsetof(ATTR_REC, | 1644 | name_ofs = (offsetof(ATTR_REC, |
1645 | data.non_resident.compressed_size) + 7) & ~7; | 1645 | data.non_resident.compressed_size) + 7) & ~7; |
1646 | mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; | 1646 | mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; |
1647 | /* | 1647 | /* |
1648 | * Determine the size of the resident part of the now non-resident | 1648 | * Determine the size of the resident part of the now non-resident |
1649 | * attribute record. | 1649 | * attribute record. |
1650 | */ | 1650 | */ |
1651 | arec_size = (mp_ofs + mp_size + 7) & ~7; | 1651 | arec_size = (mp_ofs + mp_size + 7) & ~7; |
1652 | /* | 1652 | /* |
1653 | * If the page is not uptodate bring it uptodate by copying from the | 1653 | * If the page is not uptodate bring it uptodate by copying from the |
1654 | * attribute value. | 1654 | * attribute value. |
1655 | */ | 1655 | */ |
1656 | attr_size = le32_to_cpu(a->data.resident.value_length); | 1656 | attr_size = le32_to_cpu(a->data.resident.value_length); |
1657 | BUG_ON(attr_size != data_size); | 1657 | BUG_ON(attr_size != data_size); |
1658 | if (page && !PageUptodate(page)) { | 1658 | if (page && !PageUptodate(page)) { |
1659 | kaddr = kmap_atomic(page); | 1659 | kaddr = kmap_atomic(page); |
1660 | memcpy(kaddr, (u8*)a + | 1660 | memcpy(kaddr, (u8*)a + |
1661 | le16_to_cpu(a->data.resident.value_offset), | 1661 | le16_to_cpu(a->data.resident.value_offset), |
1662 | attr_size); | 1662 | attr_size); |
1663 | memset(kaddr + attr_size, 0, PAGE_CACHE_SIZE - attr_size); | 1663 | memset(kaddr + attr_size, 0, PAGE_CACHE_SIZE - attr_size); |
1664 | kunmap_atomic(kaddr); | 1664 | kunmap_atomic(kaddr); |
1665 | flush_dcache_page(page); | 1665 | flush_dcache_page(page); |
1666 | SetPageUptodate(page); | 1666 | SetPageUptodate(page); |
1667 | } | 1667 | } |
1668 | /* Backup the attribute flag. */ | 1668 | /* Backup the attribute flag. */ |
1669 | old_res_attr_flags = a->data.resident.flags; | 1669 | old_res_attr_flags = a->data.resident.flags; |
1670 | /* Resize the resident part of the attribute record. */ | 1670 | /* Resize the resident part of the attribute record. */ |
1671 | err = ntfs_attr_record_resize(m, a, arec_size); | 1671 | err = ntfs_attr_record_resize(m, a, arec_size); |
1672 | if (unlikely(err)) | 1672 | if (unlikely(err)) |
1673 | goto err_out; | 1673 | goto err_out; |
1674 | /* | 1674 | /* |
1675 | * Convert the resident part of the attribute record to describe a | 1675 | * Convert the resident part of the attribute record to describe a |
1676 | * non-resident attribute. | 1676 | * non-resident attribute. |
1677 | */ | 1677 | */ |
1678 | a->non_resident = 1; | 1678 | a->non_resident = 1; |
1679 | /* Move the attribute name if it exists and update the offset. */ | 1679 | /* Move the attribute name if it exists and update the offset. */ |
1680 | if (a->name_length) | 1680 | if (a->name_length) |
1681 | memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), | 1681 | memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), |
1682 | a->name_length * sizeof(ntfschar)); | 1682 | a->name_length * sizeof(ntfschar)); |
1683 | a->name_offset = cpu_to_le16(name_ofs); | 1683 | a->name_offset = cpu_to_le16(name_ofs); |
1684 | /* Setup the fields specific to non-resident attributes. */ | 1684 | /* Setup the fields specific to non-resident attributes. */ |
1685 | a->data.non_resident.lowest_vcn = 0; | 1685 | a->data.non_resident.lowest_vcn = 0; |
1686 | a->data.non_resident.highest_vcn = cpu_to_sle64((new_size - 1) >> | 1686 | a->data.non_resident.highest_vcn = cpu_to_sle64((new_size - 1) >> |
1687 | vol->cluster_size_bits); | 1687 | vol->cluster_size_bits); |
1688 | a->data.non_resident.mapping_pairs_offset = cpu_to_le16(mp_ofs); | 1688 | a->data.non_resident.mapping_pairs_offset = cpu_to_le16(mp_ofs); |
1689 | memset(&a->data.non_resident.reserved, 0, | 1689 | memset(&a->data.non_resident.reserved, 0, |
1690 | sizeof(a->data.non_resident.reserved)); | 1690 | sizeof(a->data.non_resident.reserved)); |
1691 | a->data.non_resident.allocated_size = cpu_to_sle64(new_size); | 1691 | a->data.non_resident.allocated_size = cpu_to_sle64(new_size); |
1692 | a->data.non_resident.data_size = | 1692 | a->data.non_resident.data_size = |
1693 | a->data.non_resident.initialized_size = | 1693 | a->data.non_resident.initialized_size = |
1694 | cpu_to_sle64(attr_size); | 1694 | cpu_to_sle64(attr_size); |
1695 | if (NInoSparse(ni) || NInoCompressed(ni)) { | 1695 | if (NInoSparse(ni) || NInoCompressed(ni)) { |
1696 | a->data.non_resident.compression_unit = 0; | 1696 | a->data.non_resident.compression_unit = 0; |
1697 | if (NInoCompressed(ni) || vol->major_ver < 3) | 1697 | if (NInoCompressed(ni) || vol->major_ver < 3) |
1698 | a->data.non_resident.compression_unit = 4; | 1698 | a->data.non_resident.compression_unit = 4; |
1699 | a->data.non_resident.compressed_size = | 1699 | a->data.non_resident.compressed_size = |
1700 | a->data.non_resident.allocated_size; | 1700 | a->data.non_resident.allocated_size; |
1701 | } else | 1701 | } else |
1702 | a->data.non_resident.compression_unit = 0; | 1702 | a->data.non_resident.compression_unit = 0; |
1703 | /* Generate the mapping pairs array into the attribute record. */ | 1703 | /* Generate the mapping pairs array into the attribute record. */ |
1704 | err = ntfs_mapping_pairs_build(vol, (u8*)a + mp_ofs, | 1704 | err = ntfs_mapping_pairs_build(vol, (u8*)a + mp_ofs, |
1705 | arec_size - mp_ofs, rl, 0, -1, NULL); | 1705 | arec_size - mp_ofs, rl, 0, -1, NULL); |
1706 | if (unlikely(err)) { | 1706 | if (unlikely(err)) { |
1707 | ntfs_debug("Failed to build mapping pairs, error code %i.", | 1707 | ntfs_debug("Failed to build mapping pairs, error code %i.", |
1708 | err); | 1708 | err); |
1709 | goto undo_err_out; | 1709 | goto undo_err_out; |
1710 | } | 1710 | } |
1711 | /* Setup the in-memory attribute structure to be non-resident. */ | 1711 | /* Setup the in-memory attribute structure to be non-resident. */ |
1712 | ni->runlist.rl = rl; | 1712 | ni->runlist.rl = rl; |
1713 | write_lock_irqsave(&ni->size_lock, flags); | 1713 | write_lock_irqsave(&ni->size_lock, flags); |
1714 | ni->allocated_size = new_size; | 1714 | ni->allocated_size = new_size; |
1715 | if (NInoSparse(ni) || NInoCompressed(ni)) { | 1715 | if (NInoSparse(ni) || NInoCompressed(ni)) { |
1716 | ni->itype.compressed.size = ni->allocated_size; | 1716 | ni->itype.compressed.size = ni->allocated_size; |
1717 | if (a->data.non_resident.compression_unit) { | 1717 | if (a->data.non_resident.compression_unit) { |
1718 | ni->itype.compressed.block_size = 1U << (a->data. | 1718 | ni->itype.compressed.block_size = 1U << (a->data. |
1719 | non_resident.compression_unit + | 1719 | non_resident.compression_unit + |
1720 | vol->cluster_size_bits); | 1720 | vol->cluster_size_bits); |
1721 | ni->itype.compressed.block_size_bits = | 1721 | ni->itype.compressed.block_size_bits = |
1722 | ffs(ni->itype.compressed.block_size) - | 1722 | ffs(ni->itype.compressed.block_size) - |
1723 | 1; | 1723 | 1; |
1724 | ni->itype.compressed.block_clusters = 1U << | 1724 | ni->itype.compressed.block_clusters = 1U << |
1725 | a->data.non_resident.compression_unit; | 1725 | a->data.non_resident.compression_unit; |
1726 | } else { | 1726 | } else { |
1727 | ni->itype.compressed.block_size = 0; | 1727 | ni->itype.compressed.block_size = 0; |
1728 | ni->itype.compressed.block_size_bits = 0; | 1728 | ni->itype.compressed.block_size_bits = 0; |
1729 | ni->itype.compressed.block_clusters = 0; | 1729 | ni->itype.compressed.block_clusters = 0; |
1730 | } | 1730 | } |
1731 | vi->i_blocks = ni->itype.compressed.size >> 9; | 1731 | vi->i_blocks = ni->itype.compressed.size >> 9; |
1732 | } else | 1732 | } else |
1733 | vi->i_blocks = ni->allocated_size >> 9; | 1733 | vi->i_blocks = ni->allocated_size >> 9; |
1734 | write_unlock_irqrestore(&ni->size_lock, flags); | 1734 | write_unlock_irqrestore(&ni->size_lock, flags); |
1735 | /* | 1735 | /* |
1736 | * This needs to be last since the address space operations ->readpage | 1736 | * This needs to be last since the address space operations ->readpage |
1737 | * and ->writepage can run concurrently with us as they are not | 1737 | * and ->writepage can run concurrently with us as they are not |
1738 | * serialized on i_mutex. Note, we are not allowed to fail once we flip | 1738 | * serialized on i_mutex. Note, we are not allowed to fail once we flip |
1739 | * this switch, which is another reason to do this last. | 1739 | * this switch, which is another reason to do this last. |
1740 | */ | 1740 | */ |
1741 | NInoSetNonResident(ni); | 1741 | NInoSetNonResident(ni); |
1742 | /* Mark the mft record dirty, so it gets written back. */ | 1742 | /* Mark the mft record dirty, so it gets written back. */ |
1743 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1743 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1744 | mark_mft_record_dirty(ctx->ntfs_ino); | 1744 | mark_mft_record_dirty(ctx->ntfs_ino); |
1745 | ntfs_attr_put_search_ctx(ctx); | 1745 | ntfs_attr_put_search_ctx(ctx); |
1746 | unmap_mft_record(base_ni); | 1746 | unmap_mft_record(base_ni); |
1747 | up_write(&ni->runlist.lock); | 1747 | up_write(&ni->runlist.lock); |
1748 | if (page) { | 1748 | if (page) { |
1749 | set_page_dirty(page); | 1749 | set_page_dirty(page); |
1750 | unlock_page(page); | 1750 | unlock_page(page); |
1751 | mark_page_accessed(page); | ||
1752 | page_cache_release(page); | 1751 | page_cache_release(page); |
1753 | } | 1752 | } |
1754 | ntfs_debug("Done."); | 1753 | ntfs_debug("Done."); |
1755 | return 0; | 1754 | return 0; |
1756 | undo_err_out: | 1755 | undo_err_out: |
1757 | /* Convert the attribute back into a resident attribute. */ | 1756 | /* Convert the attribute back into a resident attribute. */ |
1758 | a->non_resident = 0; | 1757 | a->non_resident = 0; |
1759 | /* Move the attribute name if it exists and update the offset. */ | 1758 | /* Move the attribute name if it exists and update the offset. */ |
1760 | name_ofs = (offsetof(ATTR_RECORD, data.resident.reserved) + | 1759 | name_ofs = (offsetof(ATTR_RECORD, data.resident.reserved) + |
1761 | sizeof(a->data.resident.reserved) + 7) & ~7; | 1760 | sizeof(a->data.resident.reserved) + 7) & ~7; |
1762 | if (a->name_length) | 1761 | if (a->name_length) |
1763 | memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), | 1762 | memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset), |
1764 | a->name_length * sizeof(ntfschar)); | 1763 | a->name_length * sizeof(ntfschar)); |
1765 | mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; | 1764 | mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7; |
1766 | a->name_offset = cpu_to_le16(name_ofs); | 1765 | a->name_offset = cpu_to_le16(name_ofs); |
1767 | arec_size = (mp_ofs + attr_size + 7) & ~7; | 1766 | arec_size = (mp_ofs + attr_size + 7) & ~7; |
1768 | /* Resize the resident part of the attribute record. */ | 1767 | /* Resize the resident part of the attribute record. */ |
1769 | err2 = ntfs_attr_record_resize(m, a, arec_size); | 1768 | err2 = ntfs_attr_record_resize(m, a, arec_size); |
1770 | if (unlikely(err2)) { | 1769 | if (unlikely(err2)) { |
1771 | /* | 1770 | /* |
1772 | * This cannot happen (well if memory corruption is at work it | 1771 | * This cannot happen (well if memory corruption is at work it |
1773 | * could happen in theory), but deal with it as well as we can. | 1772 | * could happen in theory), but deal with it as well as we can. |
1774 | * If the old size is too small, truncate the attribute, | 1773 | * If the old size is too small, truncate the attribute, |
1775 | * otherwise simply give it a larger allocated size. | 1774 | * otherwise simply give it a larger allocated size. |
1776 | * FIXME: Should check whether chkdsk complains when the | 1775 | * FIXME: Should check whether chkdsk complains when the |
1777 | * allocated size is much bigger than the resident value size. | 1776 | * allocated size is much bigger than the resident value size. |
1778 | */ | 1777 | */ |
1779 | arec_size = le32_to_cpu(a->length); | 1778 | arec_size = le32_to_cpu(a->length); |
1780 | if ((mp_ofs + attr_size) > arec_size) { | 1779 | if ((mp_ofs + attr_size) > arec_size) { |
1781 | err2 = attr_size; | 1780 | err2 = attr_size; |
1782 | attr_size = arec_size - mp_ofs; | 1781 | attr_size = arec_size - mp_ofs; |
1783 | ntfs_error(vol->sb, "Failed to undo partial resident " | 1782 | ntfs_error(vol->sb, "Failed to undo partial resident " |
1784 | "to non-resident attribute " | 1783 | "to non-resident attribute " |
1785 | "conversion. Truncating inode 0x%lx, " | 1784 | "conversion. Truncating inode 0x%lx, " |
1786 | "attribute type 0x%x from %i bytes to " | 1785 | "attribute type 0x%x from %i bytes to " |
1787 | "%i bytes to maintain metadata " | 1786 | "%i bytes to maintain metadata " |
1788 | "consistency. THIS MEANS YOU ARE " | 1787 | "consistency. THIS MEANS YOU ARE " |
1789 | "LOSING %i BYTES DATA FROM THIS %s.", | 1788 | "LOSING %i BYTES DATA FROM THIS %s.", |
1790 | vi->i_ino, | 1789 | vi->i_ino, |
1791 | (unsigned)le32_to_cpu(ni->type), | 1790 | (unsigned)le32_to_cpu(ni->type), |
1792 | err2, attr_size, err2 - attr_size, | 1791 | err2, attr_size, err2 - attr_size, |
1793 | ((ni->type == AT_DATA) && | 1792 | ((ni->type == AT_DATA) && |
1794 | !ni->name_len) ? "FILE": "ATTRIBUTE"); | 1793 | !ni->name_len) ? "FILE": "ATTRIBUTE"); |
1795 | write_lock_irqsave(&ni->size_lock, flags); | 1794 | write_lock_irqsave(&ni->size_lock, flags); |
1796 | ni->initialized_size = attr_size; | 1795 | ni->initialized_size = attr_size; |
1797 | i_size_write(vi, attr_size); | 1796 | i_size_write(vi, attr_size); |
1798 | write_unlock_irqrestore(&ni->size_lock, flags); | 1797 | write_unlock_irqrestore(&ni->size_lock, flags); |
1799 | } | 1798 | } |
1800 | } | 1799 | } |
1801 | /* Setup the fields specific to resident attributes. */ | 1800 | /* Setup the fields specific to resident attributes. */ |
1802 | a->data.resident.value_length = cpu_to_le32(attr_size); | 1801 | a->data.resident.value_length = cpu_to_le32(attr_size); |
1803 | a->data.resident.value_offset = cpu_to_le16(mp_ofs); | 1802 | a->data.resident.value_offset = cpu_to_le16(mp_ofs); |
1804 | a->data.resident.flags = old_res_attr_flags; | 1803 | a->data.resident.flags = old_res_attr_flags; |
1805 | memset(&a->data.resident.reserved, 0, | 1804 | memset(&a->data.resident.reserved, 0, |
1806 | sizeof(a->data.resident.reserved)); | 1805 | sizeof(a->data.resident.reserved)); |
1807 | /* Copy the data from the page back to the attribute value. */ | 1806 | /* Copy the data from the page back to the attribute value. */ |
1808 | if (page) { | 1807 | if (page) { |
1809 | kaddr = kmap_atomic(page); | 1808 | kaddr = kmap_atomic(page); |
1810 | memcpy((u8*)a + mp_ofs, kaddr, attr_size); | 1809 | memcpy((u8*)a + mp_ofs, kaddr, attr_size); |
1811 | kunmap_atomic(kaddr); | 1810 | kunmap_atomic(kaddr); |
1812 | } | 1811 | } |
1813 | /* Setup the allocated size in the ntfs inode in case it changed. */ | 1812 | /* Setup the allocated size in the ntfs inode in case it changed. */ |
1814 | write_lock_irqsave(&ni->size_lock, flags); | 1813 | write_lock_irqsave(&ni->size_lock, flags); |
1815 | ni->allocated_size = arec_size - mp_ofs; | 1814 | ni->allocated_size = arec_size - mp_ofs; |
1816 | write_unlock_irqrestore(&ni->size_lock, flags); | 1815 | write_unlock_irqrestore(&ni->size_lock, flags); |
1817 | /* Mark the mft record dirty, so it gets written back. */ | 1816 | /* Mark the mft record dirty, so it gets written back. */ |
1818 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1817 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1819 | mark_mft_record_dirty(ctx->ntfs_ino); | 1818 | mark_mft_record_dirty(ctx->ntfs_ino); |
1820 | err_out: | 1819 | err_out: |
1821 | if (ctx) | 1820 | if (ctx) |
1822 | ntfs_attr_put_search_ctx(ctx); | 1821 | ntfs_attr_put_search_ctx(ctx); |
1823 | if (m) | 1822 | if (m) |
1824 | unmap_mft_record(base_ni); | 1823 | unmap_mft_record(base_ni); |
1825 | ni->runlist.rl = NULL; | 1824 | ni->runlist.rl = NULL; |
1826 | up_write(&ni->runlist.lock); | 1825 | up_write(&ni->runlist.lock); |
1827 | rl_err_out: | 1826 | rl_err_out: |
1828 | if (rl) { | 1827 | if (rl) { |
1829 | if (ntfs_cluster_free_from_rl(vol, rl) < 0) { | 1828 | if (ntfs_cluster_free_from_rl(vol, rl) < 0) { |
1830 | ntfs_error(vol->sb, "Failed to release allocated " | 1829 | ntfs_error(vol->sb, "Failed to release allocated " |
1831 | "cluster(s) in error code path. Run " | 1830 | "cluster(s) in error code path. Run " |
1832 | "chkdsk to recover the lost " | 1831 | "chkdsk to recover the lost " |
1833 | "cluster(s)."); | 1832 | "cluster(s)."); |
1834 | NVolSetErrors(vol); | 1833 | NVolSetErrors(vol); |
1835 | } | 1834 | } |
1836 | ntfs_free(rl); | 1835 | ntfs_free(rl); |
1837 | page_err_out: | 1836 | page_err_out: |
1838 | unlock_page(page); | 1837 | unlock_page(page); |
1839 | page_cache_release(page); | 1838 | page_cache_release(page); |
1840 | } | 1839 | } |
1841 | if (err == -EINVAL) | 1840 | if (err == -EINVAL) |
1842 | err = -EIO; | 1841 | err = -EIO; |
1843 | return err; | 1842 | return err; |
1844 | } | 1843 | } |
1845 | 1844 | ||
1846 | /** | 1845 | /** |
1847 | * ntfs_attr_extend_allocation - extend the allocated space of an attribute | 1846 | * ntfs_attr_extend_allocation - extend the allocated space of an attribute |
1848 | * @ni: ntfs inode of the attribute whose allocation to extend | 1847 | * @ni: ntfs inode of the attribute whose allocation to extend |
1849 | * @new_alloc_size: new size in bytes to which to extend the allocation to | 1848 | * @new_alloc_size: new size in bytes to which to extend the allocation to |
1850 | * @new_data_size: new size in bytes to which to extend the data to | 1849 | * @new_data_size: new size in bytes to which to extend the data to |
1851 | * @data_start: beginning of region which is required to be non-sparse | 1850 | * @data_start: beginning of region which is required to be non-sparse |
1852 | * | 1851 | * |
1853 | * Extend the allocated space of an attribute described by the ntfs inode @ni | 1852 | * Extend the allocated space of an attribute described by the ntfs inode @ni |
1854 | * to @new_alloc_size bytes. If @data_start is -1, the whole extension may be | 1853 | * to @new_alloc_size bytes. If @data_start is -1, the whole extension may be |
1855 | * implemented as a hole in the file (as long as both the volume and the ntfs | 1854 | * implemented as a hole in the file (as long as both the volume and the ntfs |
1856 | * inode @ni have sparse support enabled). If @data_start is >= 0, then the | 1855 | * inode @ni have sparse support enabled). If @data_start is >= 0, then the |
1857 | * region between the old allocated size and @data_start - 1 may be made sparse | 1856 | * region between the old allocated size and @data_start - 1 may be made sparse |
1858 | * but the regions between @data_start and @new_alloc_size must be backed by | 1857 | * but the regions between @data_start and @new_alloc_size must be backed by |
1859 | * actual clusters. | 1858 | * actual clusters. |
1860 | * | 1859 | * |
1861 | * If @new_data_size is -1, it is ignored. If it is >= 0, then the data size | 1860 | * If @new_data_size is -1, it is ignored. If it is >= 0, then the data size |
1862 | * of the attribute is extended to @new_data_size. Note that the i_size of the | 1861 | * of the attribute is extended to @new_data_size. Note that the i_size of the |
1863 | * vfs inode is not updated. Only the data size in the base attribute record | 1862 | * vfs inode is not updated. Only the data size in the base attribute record |
1864 | * is updated. The caller has to update i_size separately if this is required. | 1863 | * is updated. The caller has to update i_size separately if this is required. |
1865 | * WARNING: It is a BUG() for @new_data_size to be smaller than the old data | 1864 | * WARNING: It is a BUG() for @new_data_size to be smaller than the old data |
1866 | * size as well as for @new_data_size to be greater than @new_alloc_size. | 1865 | * size as well as for @new_data_size to be greater than @new_alloc_size. |
1867 | * | 1866 | * |
1868 | * For resident attributes this involves resizing the attribute record and if | 1867 | * For resident attributes this involves resizing the attribute record and if |
1869 | * necessary moving it and/or other attributes into extent mft records and/or | 1868 | * necessary moving it and/or other attributes into extent mft records and/or |
1870 | * converting the attribute to a non-resident attribute which in turn involves | 1869 | * converting the attribute to a non-resident attribute which in turn involves |
1871 | * extending the allocation of a non-resident attribute as described below. | 1870 | * extending the allocation of a non-resident attribute as described below. |
1872 | * | 1871 | * |
1873 | * For non-resident attributes this involves allocating clusters in the data | 1872 | * For non-resident attributes this involves allocating clusters in the data |
1874 | * zone on the volume (except for regions that are being made sparse) and | 1873 | * zone on the volume (except for regions that are being made sparse) and |
1875 | * extending the run list to describe the allocated clusters as well as | 1874 | * extending the run list to describe the allocated clusters as well as |
1876 | * updating the mapping pairs array of the attribute. This in turn involves | 1875 | * updating the mapping pairs array of the attribute. This in turn involves |
1877 | * resizing the attribute record and if necessary moving it and/or other | 1876 | * resizing the attribute record and if necessary moving it and/or other |
1878 | * attributes into extent mft records and/or splitting the attribute record | 1877 | * attributes into extent mft records and/or splitting the attribute record |
1879 | * into multiple extent attribute records. | 1878 | * into multiple extent attribute records. |
1880 | * | 1879 | * |
1881 | * Also, the attribute list attribute is updated if present and in some of the | 1880 | * Also, the attribute list attribute is updated if present and in some of the |
1882 | * above cases (the ones where extent mft records/attributes come into play), | 1881 | * above cases (the ones where extent mft records/attributes come into play), |
1883 | * an attribute list attribute is created if not already present. | 1882 | * an attribute list attribute is created if not already present. |
1884 | * | 1883 | * |
1885 | * Return the new allocated size on success and -errno on error. In the case | 1884 | * Return the new allocated size on success and -errno on error. In the case |
1886 | * that an error is encountered but a partial extension at least up to | 1885 | * that an error is encountered but a partial extension at least up to |
1887 | * @data_start (if present) is possible, the allocation is partially extended | 1886 | * @data_start (if present) is possible, the allocation is partially extended |
1888 | * and this is returned. This means the caller must check the returned size to | 1887 | * and this is returned. This means the caller must check the returned size to |
1889 | * determine if the extension was partial. If @data_start is -1 then partial | 1888 | * determine if the extension was partial. If @data_start is -1 then partial |
1890 | * allocations are not performed. | 1889 | * allocations are not performed. |
1891 | * | 1890 | * |
1892 | * WARNING: Do not call ntfs_attr_extend_allocation() for $MFT/$DATA. | 1891 | * WARNING: Do not call ntfs_attr_extend_allocation() for $MFT/$DATA. |
1893 | * | 1892 | * |
1894 | * Locking: This function takes the runlist lock of @ni for writing as well as | 1893 | * Locking: This function takes the runlist lock of @ni for writing as well as |
1895 | * locking the mft record of the base ntfs inode. These locks are maintained | 1894 | * locking the mft record of the base ntfs inode. These locks are maintained |
1896 | * throughout execution of the function. These locks are required so that the | 1895 | * throughout execution of the function. These locks are required so that the |
1897 | * attribute can be resized safely and so that it can for example be converted | 1896 | * attribute can be resized safely and so that it can for example be converted |
1898 | * from resident to non-resident safely. | 1897 | * from resident to non-resident safely. |
1899 | * | 1898 | * |
1900 | * TODO: At present attribute list attribute handling is not implemented. | 1899 | * TODO: At present attribute list attribute handling is not implemented. |
1901 | * | 1900 | * |
1902 | * TODO: At present it is not safe to call this function for anything other | 1901 | * TODO: At present it is not safe to call this function for anything other |
1903 | * than the $DATA attribute(s) of an uncompressed and unencrypted file. | 1902 | * than the $DATA attribute(s) of an uncompressed and unencrypted file. |
1904 | */ | 1903 | */ |
1905 | s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size, | 1904 | s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size, |
1906 | const s64 new_data_size, const s64 data_start) | 1905 | const s64 new_data_size, const s64 data_start) |
1907 | { | 1906 | { |
1908 | VCN vcn; | 1907 | VCN vcn; |
1909 | s64 ll, allocated_size, start = data_start; | 1908 | s64 ll, allocated_size, start = data_start; |
1910 | struct inode *vi = VFS_I(ni); | 1909 | struct inode *vi = VFS_I(ni); |
1911 | ntfs_volume *vol = ni->vol; | 1910 | ntfs_volume *vol = ni->vol; |
1912 | ntfs_inode *base_ni; | 1911 | ntfs_inode *base_ni; |
1913 | MFT_RECORD *m; | 1912 | MFT_RECORD *m; |
1914 | ATTR_RECORD *a; | 1913 | ATTR_RECORD *a; |
1915 | ntfs_attr_search_ctx *ctx; | 1914 | ntfs_attr_search_ctx *ctx; |
1916 | runlist_element *rl, *rl2; | 1915 | runlist_element *rl, *rl2; |
1917 | unsigned long flags; | 1916 | unsigned long flags; |
1918 | int err, mp_size; | 1917 | int err, mp_size; |
1919 | u32 attr_len = 0; /* Silence stupid gcc warning. */ | 1918 | u32 attr_len = 0; /* Silence stupid gcc warning. */ |
1920 | bool mp_rebuilt; | 1919 | bool mp_rebuilt; |
1921 | 1920 | ||
1922 | #ifdef DEBUG | 1921 | #ifdef DEBUG |
1923 | read_lock_irqsave(&ni->size_lock, flags); | 1922 | read_lock_irqsave(&ni->size_lock, flags); |
1924 | allocated_size = ni->allocated_size; | 1923 | allocated_size = ni->allocated_size; |
1925 | read_unlock_irqrestore(&ni->size_lock, flags); | 1924 | read_unlock_irqrestore(&ni->size_lock, flags); |
1926 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " | 1925 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " |
1927 | "old_allocated_size 0x%llx, " | 1926 | "old_allocated_size 0x%llx, " |
1928 | "new_allocated_size 0x%llx, new_data_size 0x%llx, " | 1927 | "new_allocated_size 0x%llx, new_data_size 0x%llx, " |
1929 | "data_start 0x%llx.", vi->i_ino, | 1928 | "data_start 0x%llx.", vi->i_ino, |
1930 | (unsigned)le32_to_cpu(ni->type), | 1929 | (unsigned)le32_to_cpu(ni->type), |
1931 | (unsigned long long)allocated_size, | 1930 | (unsigned long long)allocated_size, |
1932 | (unsigned long long)new_alloc_size, | 1931 | (unsigned long long)new_alloc_size, |
1933 | (unsigned long long)new_data_size, | 1932 | (unsigned long long)new_data_size, |
1934 | (unsigned long long)start); | 1933 | (unsigned long long)start); |
1935 | #endif | 1934 | #endif |
1936 | retry_extend: | 1935 | retry_extend: |
1937 | /* | 1936 | /* |
1938 | * For non-resident attributes, @start and @new_size need to be aligned | 1937 | * For non-resident attributes, @start and @new_size need to be aligned |
1939 | * to cluster boundaries for allocation purposes. | 1938 | * to cluster boundaries for allocation purposes. |
1940 | */ | 1939 | */ |
1941 | if (NInoNonResident(ni)) { | 1940 | if (NInoNonResident(ni)) { |
1942 | if (start > 0) | 1941 | if (start > 0) |
1943 | start &= ~(s64)vol->cluster_size_mask; | 1942 | start &= ~(s64)vol->cluster_size_mask; |
1944 | new_alloc_size = (new_alloc_size + vol->cluster_size - 1) & | 1943 | new_alloc_size = (new_alloc_size + vol->cluster_size - 1) & |
1945 | ~(s64)vol->cluster_size_mask; | 1944 | ~(s64)vol->cluster_size_mask; |
1946 | } | 1945 | } |
1947 | BUG_ON(new_data_size >= 0 && new_data_size > new_alloc_size); | 1946 | BUG_ON(new_data_size >= 0 && new_data_size > new_alloc_size); |
1948 | /* Check if new size is allowed in $AttrDef. */ | 1947 | /* Check if new size is allowed in $AttrDef. */ |
1949 | err = ntfs_attr_size_bounds_check(vol, ni->type, new_alloc_size); | 1948 | err = ntfs_attr_size_bounds_check(vol, ni->type, new_alloc_size); |
1950 | if (unlikely(err)) { | 1949 | if (unlikely(err)) { |
1951 | /* Only emit errors when the write will fail completely. */ | 1950 | /* Only emit errors when the write will fail completely. */ |
1952 | read_lock_irqsave(&ni->size_lock, flags); | 1951 | read_lock_irqsave(&ni->size_lock, flags); |
1953 | allocated_size = ni->allocated_size; | 1952 | allocated_size = ni->allocated_size; |
1954 | read_unlock_irqrestore(&ni->size_lock, flags); | 1953 | read_unlock_irqrestore(&ni->size_lock, flags); |
1955 | if (start < 0 || start >= allocated_size) { | 1954 | if (start < 0 || start >= allocated_size) { |
1956 | if (err == -ERANGE) { | 1955 | if (err == -ERANGE) { |
1957 | ntfs_error(vol->sb, "Cannot extend allocation " | 1956 | ntfs_error(vol->sb, "Cannot extend allocation " |
1958 | "of inode 0x%lx, attribute " | 1957 | "of inode 0x%lx, attribute " |
1959 | "type 0x%x, because the new " | 1958 | "type 0x%x, because the new " |
1960 | "allocation would exceed the " | 1959 | "allocation would exceed the " |
1961 | "maximum allowed size for " | 1960 | "maximum allowed size for " |
1962 | "this attribute type.", | 1961 | "this attribute type.", |
1963 | vi->i_ino, (unsigned) | 1962 | vi->i_ino, (unsigned) |
1964 | le32_to_cpu(ni->type)); | 1963 | le32_to_cpu(ni->type)); |
1965 | } else { | 1964 | } else { |
1966 | ntfs_error(vol->sb, "Cannot extend allocation " | 1965 | ntfs_error(vol->sb, "Cannot extend allocation " |
1967 | "of inode 0x%lx, attribute " | 1966 | "of inode 0x%lx, attribute " |
1968 | "type 0x%x, because this " | 1967 | "type 0x%x, because this " |
1969 | "attribute type is not " | 1968 | "attribute type is not " |
1970 | "defined on the NTFS volume. " | 1969 | "defined on the NTFS volume. " |
1971 | "Possible corruption! You " | 1970 | "Possible corruption! You " |
1972 | "should run chkdsk!", | 1971 | "should run chkdsk!", |
1973 | vi->i_ino, (unsigned) | 1972 | vi->i_ino, (unsigned) |
1974 | le32_to_cpu(ni->type)); | 1973 | le32_to_cpu(ni->type)); |
1975 | } | 1974 | } |
1976 | } | 1975 | } |
1977 | /* Translate error code to be POSIX conformant for write(2). */ | 1976 | /* Translate error code to be POSIX conformant for write(2). */ |
1978 | if (err == -ERANGE) | 1977 | if (err == -ERANGE) |
1979 | err = -EFBIG; | 1978 | err = -EFBIG; |
1980 | else | 1979 | else |
1981 | err = -EIO; | 1980 | err = -EIO; |
1982 | return err; | 1981 | return err; |
1983 | } | 1982 | } |
1984 | if (!NInoAttr(ni)) | 1983 | if (!NInoAttr(ni)) |
1985 | base_ni = ni; | 1984 | base_ni = ni; |
1986 | else | 1985 | else |
1987 | base_ni = ni->ext.base_ntfs_ino; | 1986 | base_ni = ni->ext.base_ntfs_ino; |
1988 | /* | 1987 | /* |
1989 | * We will be modifying both the runlist (if non-resident) and the mft | 1988 | * We will be modifying both the runlist (if non-resident) and the mft |
1990 | * record so lock them both down. | 1989 | * record so lock them both down. |
1991 | */ | 1990 | */ |
1992 | down_write(&ni->runlist.lock); | 1991 | down_write(&ni->runlist.lock); |
1993 | m = map_mft_record(base_ni); | 1992 | m = map_mft_record(base_ni); |
1994 | if (IS_ERR(m)) { | 1993 | if (IS_ERR(m)) { |
1995 | err = PTR_ERR(m); | 1994 | err = PTR_ERR(m); |
1996 | m = NULL; | 1995 | m = NULL; |
1997 | ctx = NULL; | 1996 | ctx = NULL; |
1998 | goto err_out; | 1997 | goto err_out; |
1999 | } | 1998 | } |
2000 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 1999 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
2001 | if (unlikely(!ctx)) { | 2000 | if (unlikely(!ctx)) { |
2002 | err = -ENOMEM; | 2001 | err = -ENOMEM; |
2003 | goto err_out; | 2002 | goto err_out; |
2004 | } | 2003 | } |
2005 | read_lock_irqsave(&ni->size_lock, flags); | 2004 | read_lock_irqsave(&ni->size_lock, flags); |
2006 | allocated_size = ni->allocated_size; | 2005 | allocated_size = ni->allocated_size; |
2007 | read_unlock_irqrestore(&ni->size_lock, flags); | 2006 | read_unlock_irqrestore(&ni->size_lock, flags); |
2008 | /* | 2007 | /* |
2009 | * If non-resident, seek to the last extent. If resident, there is | 2008 | * If non-resident, seek to the last extent. If resident, there is |
2010 | * only one extent, so seek to that. | 2009 | * only one extent, so seek to that. |
2011 | */ | 2010 | */ |
2012 | vcn = NInoNonResident(ni) ? allocated_size >> vol->cluster_size_bits : | 2011 | vcn = NInoNonResident(ni) ? allocated_size >> vol->cluster_size_bits : |
2013 | 0; | 2012 | 0; |
2014 | /* | 2013 | /* |
2015 | * Abort if someone did the work whilst we waited for the locks. If we | 2014 | * Abort if someone did the work whilst we waited for the locks. If we |
2016 | * just converted the attribute from resident to non-resident it is | 2015 | * just converted the attribute from resident to non-resident it is |
2017 | * likely that exactly this has happened already. We cannot quite | 2016 | * likely that exactly this has happened already. We cannot quite |
2018 | * abort if we need to update the data size. | 2017 | * abort if we need to update the data size. |
2019 | */ | 2018 | */ |
2020 | if (unlikely(new_alloc_size <= allocated_size)) { | 2019 | if (unlikely(new_alloc_size <= allocated_size)) { |
2021 | ntfs_debug("Allocated size already exceeds requested size."); | 2020 | ntfs_debug("Allocated size already exceeds requested size."); |
2022 | new_alloc_size = allocated_size; | 2021 | new_alloc_size = allocated_size; |
2023 | if (new_data_size < 0) | 2022 | if (new_data_size < 0) |
2024 | goto done; | 2023 | goto done; |
2025 | /* | 2024 | /* |
2026 | * We want the first attribute extent so that we can update the | 2025 | * We want the first attribute extent so that we can update the |
2027 | * data size. | 2026 | * data size. |
2028 | */ | 2027 | */ |
2029 | vcn = 0; | 2028 | vcn = 0; |
2030 | } | 2029 | } |
2031 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 2030 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
2032 | CASE_SENSITIVE, vcn, NULL, 0, ctx); | 2031 | CASE_SENSITIVE, vcn, NULL, 0, ctx); |
2033 | if (unlikely(err)) { | 2032 | if (unlikely(err)) { |
2034 | if (err == -ENOENT) | 2033 | if (err == -ENOENT) |
2035 | err = -EIO; | 2034 | err = -EIO; |
2036 | goto err_out; | 2035 | goto err_out; |
2037 | } | 2036 | } |
2038 | m = ctx->mrec; | 2037 | m = ctx->mrec; |
2039 | a = ctx->attr; | 2038 | a = ctx->attr; |
2040 | /* Use goto to reduce indentation. */ | 2039 | /* Use goto to reduce indentation. */ |
2041 | if (a->non_resident) | 2040 | if (a->non_resident) |
2042 | goto do_non_resident_extend; | 2041 | goto do_non_resident_extend; |
2043 | BUG_ON(NInoNonResident(ni)); | 2042 | BUG_ON(NInoNonResident(ni)); |
2044 | /* The total length of the attribute value. */ | 2043 | /* The total length of the attribute value. */ |
2045 | attr_len = le32_to_cpu(a->data.resident.value_length); | 2044 | attr_len = le32_to_cpu(a->data.resident.value_length); |
2046 | /* | 2045 | /* |
2047 | * Extend the attribute record to be able to store the new attribute | 2046 | * Extend the attribute record to be able to store the new attribute |
2048 | * size. ntfs_attr_record_resize() will not do anything if the size is | 2047 | * size. ntfs_attr_record_resize() will not do anything if the size is |
2049 | * not changing. | 2048 | * not changing. |
2050 | */ | 2049 | */ |
2051 | if (new_alloc_size < vol->mft_record_size && | 2050 | if (new_alloc_size < vol->mft_record_size && |
2052 | !ntfs_attr_record_resize(m, a, | 2051 | !ntfs_attr_record_resize(m, a, |
2053 | le16_to_cpu(a->data.resident.value_offset) + | 2052 | le16_to_cpu(a->data.resident.value_offset) + |
2054 | new_alloc_size)) { | 2053 | new_alloc_size)) { |
2055 | /* The resize succeeded! */ | 2054 | /* The resize succeeded! */ |
2056 | write_lock_irqsave(&ni->size_lock, flags); | 2055 | write_lock_irqsave(&ni->size_lock, flags); |
2057 | ni->allocated_size = le32_to_cpu(a->length) - | 2056 | ni->allocated_size = le32_to_cpu(a->length) - |
2058 | le16_to_cpu(a->data.resident.value_offset); | 2057 | le16_to_cpu(a->data.resident.value_offset); |
2059 | write_unlock_irqrestore(&ni->size_lock, flags); | 2058 | write_unlock_irqrestore(&ni->size_lock, flags); |
2060 | if (new_data_size >= 0) { | 2059 | if (new_data_size >= 0) { |
2061 | BUG_ON(new_data_size < attr_len); | 2060 | BUG_ON(new_data_size < attr_len); |
2062 | a->data.resident.value_length = | 2061 | a->data.resident.value_length = |
2063 | cpu_to_le32((u32)new_data_size); | 2062 | cpu_to_le32((u32)new_data_size); |
2064 | } | 2063 | } |
2065 | goto flush_done; | 2064 | goto flush_done; |
2066 | } | 2065 | } |
2067 | /* | 2066 | /* |
2068 | * We have to drop all the locks so we can call | 2067 | * We have to drop all the locks so we can call |
2069 | * ntfs_attr_make_non_resident(). This could be optimised by try- | 2068 | * ntfs_attr_make_non_resident(). This could be optimised by try- |
2070 | * locking the first page cache page and only if that fails dropping | 2069 | * locking the first page cache page and only if that fails dropping |
2071 | * the locks, locking the page, and redoing all the locking and | 2070 | * the locks, locking the page, and redoing all the locking and |
2072 | * lookups. While this would be a huge optimisation, it is not worth | 2071 | * lookups. While this would be a huge optimisation, it is not worth |
2073 | * it as this is definitely a slow code path. | 2072 | * it as this is definitely a slow code path. |
2074 | */ | 2073 | */ |
2075 | ntfs_attr_put_search_ctx(ctx); | 2074 | ntfs_attr_put_search_ctx(ctx); |
2076 | unmap_mft_record(base_ni); | 2075 | unmap_mft_record(base_ni); |
2077 | up_write(&ni->runlist.lock); | 2076 | up_write(&ni->runlist.lock); |
2078 | /* | 2077 | /* |
2079 | * Not enough space in the mft record, try to make the attribute | 2078 | * Not enough space in the mft record, try to make the attribute |
2080 | * non-resident and if successful restart the extension process. | 2079 | * non-resident and if successful restart the extension process. |
2081 | */ | 2080 | */ |
2082 | err = ntfs_attr_make_non_resident(ni, attr_len); | 2081 | err = ntfs_attr_make_non_resident(ni, attr_len); |
2083 | if (likely(!err)) | 2082 | if (likely(!err)) |
2084 | goto retry_extend; | 2083 | goto retry_extend; |
2085 | /* | 2084 | /* |
2086 | * Could not make non-resident. If this is due to this not being | 2085 | * Could not make non-resident. If this is due to this not being |
2087 | * permitted for this attribute type or there not being enough space, | 2086 | * permitted for this attribute type or there not being enough space, |
2088 | * try to make other attributes non-resident. Otherwise fail. | 2087 | * try to make other attributes non-resident. Otherwise fail. |
2089 | */ | 2088 | */ |
2090 | if (unlikely(err != -EPERM && err != -ENOSPC)) { | 2089 | if (unlikely(err != -EPERM && err != -ENOSPC)) { |
2091 | /* Only emit errors when the write will fail completely. */ | 2090 | /* Only emit errors when the write will fail completely. */ |
2092 | read_lock_irqsave(&ni->size_lock, flags); | 2091 | read_lock_irqsave(&ni->size_lock, flags); |
2093 | allocated_size = ni->allocated_size; | 2092 | allocated_size = ni->allocated_size; |
2094 | read_unlock_irqrestore(&ni->size_lock, flags); | 2093 | read_unlock_irqrestore(&ni->size_lock, flags); |
2095 | if (start < 0 || start >= allocated_size) | 2094 | if (start < 0 || start >= allocated_size) |
2096 | ntfs_error(vol->sb, "Cannot extend allocation of " | 2095 | ntfs_error(vol->sb, "Cannot extend allocation of " |
2097 | "inode 0x%lx, attribute type 0x%x, " | 2096 | "inode 0x%lx, attribute type 0x%x, " |
2098 | "because the conversion from resident " | 2097 | "because the conversion from resident " |
2099 | "to non-resident attribute failed " | 2098 | "to non-resident attribute failed " |
2100 | "with error code %i.", vi->i_ino, | 2099 | "with error code %i.", vi->i_ino, |
2101 | (unsigned)le32_to_cpu(ni->type), err); | 2100 | (unsigned)le32_to_cpu(ni->type), err); |
2102 | if (err != -ENOMEM) | 2101 | if (err != -ENOMEM) |
2103 | err = -EIO; | 2102 | err = -EIO; |
2104 | goto conv_err_out; | 2103 | goto conv_err_out; |
2105 | } | 2104 | } |
2106 | /* TODO: Not implemented from here, abort. */ | 2105 | /* TODO: Not implemented from here, abort. */ |
2107 | read_lock_irqsave(&ni->size_lock, flags); | 2106 | read_lock_irqsave(&ni->size_lock, flags); |
2108 | allocated_size = ni->allocated_size; | 2107 | allocated_size = ni->allocated_size; |
2109 | read_unlock_irqrestore(&ni->size_lock, flags); | 2108 | read_unlock_irqrestore(&ni->size_lock, flags); |
2110 | if (start < 0 || start >= allocated_size) { | 2109 | if (start < 0 || start >= allocated_size) { |
2111 | if (err == -ENOSPC) | 2110 | if (err == -ENOSPC) |
2112 | ntfs_error(vol->sb, "Not enough space in the mft " | 2111 | ntfs_error(vol->sb, "Not enough space in the mft " |
2113 | "record/on disk for the non-resident " | 2112 | "record/on disk for the non-resident " |
2114 | "attribute value. This case is not " | 2113 | "attribute value. This case is not " |
2115 | "implemented yet."); | 2114 | "implemented yet."); |
2116 | else /* if (err == -EPERM) */ | 2115 | else /* if (err == -EPERM) */ |
2117 | ntfs_error(vol->sb, "This attribute type may not be " | 2116 | ntfs_error(vol->sb, "This attribute type may not be " |
2118 | "non-resident. This case is not " | 2117 | "non-resident. This case is not " |
2119 | "implemented yet."); | 2118 | "implemented yet."); |
2120 | } | 2119 | } |
2121 | err = -EOPNOTSUPP; | 2120 | err = -EOPNOTSUPP; |
2122 | goto conv_err_out; | 2121 | goto conv_err_out; |
2123 | #if 0 | 2122 | #if 0 |
2124 | // TODO: Attempt to make other attributes non-resident. | 2123 | // TODO: Attempt to make other attributes non-resident. |
2125 | if (!err) | 2124 | if (!err) |
2126 | goto do_resident_extend; | 2125 | goto do_resident_extend; |
2127 | /* | 2126 | /* |
2128 | * Both the attribute list attribute and the standard information | 2127 | * Both the attribute list attribute and the standard information |
2129 | * attribute must remain in the base inode. Thus, if this is one of | 2128 | * attribute must remain in the base inode. Thus, if this is one of |
2130 | * these attributes, we have to try to move other attributes out into | 2129 | * these attributes, we have to try to move other attributes out into |
2131 | * extent mft records instead. | 2130 | * extent mft records instead. |
2132 | */ | 2131 | */ |
2133 | if (ni->type == AT_ATTRIBUTE_LIST || | 2132 | if (ni->type == AT_ATTRIBUTE_LIST || |
2134 | ni->type == AT_STANDARD_INFORMATION) { | 2133 | ni->type == AT_STANDARD_INFORMATION) { |
2135 | // TODO: Attempt to move other attributes into extent mft | 2134 | // TODO: Attempt to move other attributes into extent mft |
2136 | // records. | 2135 | // records. |
2137 | err = -EOPNOTSUPP; | 2136 | err = -EOPNOTSUPP; |
2138 | if (!err) | 2137 | if (!err) |
2139 | goto do_resident_extend; | 2138 | goto do_resident_extend; |
2140 | goto err_out; | 2139 | goto err_out; |
2141 | } | 2140 | } |
2142 | // TODO: Attempt to move this attribute to an extent mft record, but | 2141 | // TODO: Attempt to move this attribute to an extent mft record, but |
2143 | // only if it is not already the only attribute in an mft record in | 2142 | // only if it is not already the only attribute in an mft record in |
2144 | // which case there would be nothing to gain. | 2143 | // which case there would be nothing to gain. |
2145 | err = -EOPNOTSUPP; | 2144 | err = -EOPNOTSUPP; |
2146 | if (!err) | 2145 | if (!err) |
2147 | goto do_resident_extend; | 2146 | goto do_resident_extend; |
2148 | /* There is nothing we can do to make enough space. )-: */ | 2147 | /* There is nothing we can do to make enough space. )-: */ |
2149 | goto err_out; | 2148 | goto err_out; |
2150 | #endif | 2149 | #endif |
2151 | do_non_resident_extend: | 2150 | do_non_resident_extend: |
2152 | BUG_ON(!NInoNonResident(ni)); | 2151 | BUG_ON(!NInoNonResident(ni)); |
2153 | if (new_alloc_size == allocated_size) { | 2152 | if (new_alloc_size == allocated_size) { |
2154 | BUG_ON(vcn); | 2153 | BUG_ON(vcn); |
2155 | goto alloc_done; | 2154 | goto alloc_done; |
2156 | } | 2155 | } |
2157 | /* | 2156 | /* |
2158 | * If the data starts after the end of the old allocation, this is a | 2157 | * If the data starts after the end of the old allocation, this is a |
2159 | * $DATA attribute and sparse attributes are enabled on the volume and | 2158 | * $DATA attribute and sparse attributes are enabled on the volume and |
2160 | * for this inode, then create a sparse region between the old | 2159 | * for this inode, then create a sparse region between the old |
2161 | * allocated size and the start of the data. Otherwise simply proceed | 2160 | * allocated size and the start of the data. Otherwise simply proceed |
2162 | * with filling the whole space between the old allocated size and the | 2161 | * with filling the whole space between the old allocated size and the |
2163 | * new allocated size with clusters. | 2162 | * new allocated size with clusters. |
2164 | */ | 2163 | */ |
2165 | if ((start >= 0 && start <= allocated_size) || ni->type != AT_DATA || | 2164 | if ((start >= 0 && start <= allocated_size) || ni->type != AT_DATA || |
2166 | !NVolSparseEnabled(vol) || NInoSparseDisabled(ni)) | 2165 | !NVolSparseEnabled(vol) || NInoSparseDisabled(ni)) |
2167 | goto skip_sparse; | 2166 | goto skip_sparse; |
2168 | // TODO: This is not implemented yet. We just fill in with real | 2167 | // TODO: This is not implemented yet. We just fill in with real |
2169 | // clusters for now... | 2168 | // clusters for now... |
2170 | ntfs_debug("Inserting holes is not-implemented yet. Falling back to " | 2169 | ntfs_debug("Inserting holes is not-implemented yet. Falling back to " |
2171 | "allocating real clusters instead."); | 2170 | "allocating real clusters instead."); |
2172 | skip_sparse: | 2171 | skip_sparse: |
2173 | rl = ni->runlist.rl; | 2172 | rl = ni->runlist.rl; |
2174 | if (likely(rl)) { | 2173 | if (likely(rl)) { |
2175 | /* Seek to the end of the runlist. */ | 2174 | /* Seek to the end of the runlist. */ |
2176 | while (rl->length) | 2175 | while (rl->length) |
2177 | rl++; | 2176 | rl++; |
2178 | } | 2177 | } |
2179 | /* If this attribute extent is not mapped, map it now. */ | 2178 | /* If this attribute extent is not mapped, map it now. */ |
2180 | if (unlikely(!rl || rl->lcn == LCN_RL_NOT_MAPPED || | 2179 | if (unlikely(!rl || rl->lcn == LCN_RL_NOT_MAPPED || |
2181 | (rl->lcn == LCN_ENOENT && rl > ni->runlist.rl && | 2180 | (rl->lcn == LCN_ENOENT && rl > ni->runlist.rl && |
2182 | (rl-1)->lcn == LCN_RL_NOT_MAPPED))) { | 2181 | (rl-1)->lcn == LCN_RL_NOT_MAPPED))) { |
2183 | if (!rl && !allocated_size) | 2182 | if (!rl && !allocated_size) |
2184 | goto first_alloc; | 2183 | goto first_alloc; |
2185 | rl = ntfs_mapping_pairs_decompress(vol, a, ni->runlist.rl); | 2184 | rl = ntfs_mapping_pairs_decompress(vol, a, ni->runlist.rl); |
2186 | if (IS_ERR(rl)) { | 2185 | if (IS_ERR(rl)) { |
2187 | err = PTR_ERR(rl); | 2186 | err = PTR_ERR(rl); |
2188 | if (start < 0 || start >= allocated_size) | 2187 | if (start < 0 || start >= allocated_size) |
2189 | ntfs_error(vol->sb, "Cannot extend allocation " | 2188 | ntfs_error(vol->sb, "Cannot extend allocation " |
2190 | "of inode 0x%lx, attribute " | 2189 | "of inode 0x%lx, attribute " |
2191 | "type 0x%x, because the " | 2190 | "type 0x%x, because the " |
2192 | "mapping of a runlist " | 2191 | "mapping of a runlist " |
2193 | "fragment failed with error " | 2192 | "fragment failed with error " |
2194 | "code %i.", vi->i_ino, | 2193 | "code %i.", vi->i_ino, |
2195 | (unsigned)le32_to_cpu(ni->type), | 2194 | (unsigned)le32_to_cpu(ni->type), |
2196 | err); | 2195 | err); |
2197 | if (err != -ENOMEM) | 2196 | if (err != -ENOMEM) |
2198 | err = -EIO; | 2197 | err = -EIO; |
2199 | goto err_out; | 2198 | goto err_out; |
2200 | } | 2199 | } |
2201 | ni->runlist.rl = rl; | 2200 | ni->runlist.rl = rl; |
2202 | /* Seek to the end of the runlist. */ | 2201 | /* Seek to the end of the runlist. */ |
2203 | while (rl->length) | 2202 | while (rl->length) |
2204 | rl++; | 2203 | rl++; |
2205 | } | 2204 | } |
2206 | /* | 2205 | /* |
2207 | * We now know the runlist of the last extent is mapped and @rl is at | 2206 | * We now know the runlist of the last extent is mapped and @rl is at |
2208 | * the end of the runlist. We want to begin allocating clusters | 2207 | * the end of the runlist. We want to begin allocating clusters |
2209 | * starting at the last allocated cluster to reduce fragmentation. If | 2208 | * starting at the last allocated cluster to reduce fragmentation. If |
2210 | * there are no valid LCNs in the attribute we let the cluster | 2209 | * there are no valid LCNs in the attribute we let the cluster |
2211 | * allocator choose the starting cluster. | 2210 | * allocator choose the starting cluster. |
2212 | */ | 2211 | */ |
2213 | /* If the last LCN is a hole or simillar seek back to last real LCN. */ | 2212 | /* If the last LCN is a hole or simillar seek back to last real LCN. */ |
2214 | while (rl->lcn < 0 && rl > ni->runlist.rl) | 2213 | while (rl->lcn < 0 && rl > ni->runlist.rl) |
2215 | rl--; | 2214 | rl--; |
2216 | first_alloc: | 2215 | first_alloc: |
2217 | // FIXME: Need to implement partial allocations so at least part of the | 2216 | // FIXME: Need to implement partial allocations so at least part of the |
2218 | // write can be performed when start >= 0. (Needed for POSIX write(2) | 2217 | // write can be performed when start >= 0. (Needed for POSIX write(2) |
2219 | // conformance.) | 2218 | // conformance.) |
2220 | rl2 = ntfs_cluster_alloc(vol, allocated_size >> vol->cluster_size_bits, | 2219 | rl2 = ntfs_cluster_alloc(vol, allocated_size >> vol->cluster_size_bits, |
2221 | (new_alloc_size - allocated_size) >> | 2220 | (new_alloc_size - allocated_size) >> |
2222 | vol->cluster_size_bits, (rl && (rl->lcn >= 0)) ? | 2221 | vol->cluster_size_bits, (rl && (rl->lcn >= 0)) ? |
2223 | rl->lcn + rl->length : -1, DATA_ZONE, true); | 2222 | rl->lcn + rl->length : -1, DATA_ZONE, true); |
2224 | if (IS_ERR(rl2)) { | 2223 | if (IS_ERR(rl2)) { |
2225 | err = PTR_ERR(rl2); | 2224 | err = PTR_ERR(rl2); |
2226 | if (start < 0 || start >= allocated_size) | 2225 | if (start < 0 || start >= allocated_size) |
2227 | ntfs_error(vol->sb, "Cannot extend allocation of " | 2226 | ntfs_error(vol->sb, "Cannot extend allocation of " |
2228 | "inode 0x%lx, attribute type 0x%x, " | 2227 | "inode 0x%lx, attribute type 0x%x, " |
2229 | "because the allocation of clusters " | 2228 | "because the allocation of clusters " |
2230 | "failed with error code %i.", vi->i_ino, | 2229 | "failed with error code %i.", vi->i_ino, |
2231 | (unsigned)le32_to_cpu(ni->type), err); | 2230 | (unsigned)le32_to_cpu(ni->type), err); |
2232 | if (err != -ENOMEM && err != -ENOSPC) | 2231 | if (err != -ENOMEM && err != -ENOSPC) |
2233 | err = -EIO; | 2232 | err = -EIO; |
2234 | goto err_out; | 2233 | goto err_out; |
2235 | } | 2234 | } |
2236 | rl = ntfs_runlists_merge(ni->runlist.rl, rl2); | 2235 | rl = ntfs_runlists_merge(ni->runlist.rl, rl2); |
2237 | if (IS_ERR(rl)) { | 2236 | if (IS_ERR(rl)) { |
2238 | err = PTR_ERR(rl); | 2237 | err = PTR_ERR(rl); |
2239 | if (start < 0 || start >= allocated_size) | 2238 | if (start < 0 || start >= allocated_size) |
2240 | ntfs_error(vol->sb, "Cannot extend allocation of " | 2239 | ntfs_error(vol->sb, "Cannot extend allocation of " |
2241 | "inode 0x%lx, attribute type 0x%x, " | 2240 | "inode 0x%lx, attribute type 0x%x, " |
2242 | "because the runlist merge failed " | 2241 | "because the runlist merge failed " |
2243 | "with error code %i.", vi->i_ino, | 2242 | "with error code %i.", vi->i_ino, |
2244 | (unsigned)le32_to_cpu(ni->type), err); | 2243 | (unsigned)le32_to_cpu(ni->type), err); |
2245 | if (err != -ENOMEM) | 2244 | if (err != -ENOMEM) |
2246 | err = -EIO; | 2245 | err = -EIO; |
2247 | if (ntfs_cluster_free_from_rl(vol, rl2)) { | 2246 | if (ntfs_cluster_free_from_rl(vol, rl2)) { |
2248 | ntfs_error(vol->sb, "Failed to release allocated " | 2247 | ntfs_error(vol->sb, "Failed to release allocated " |
2249 | "cluster(s) in error code path. Run " | 2248 | "cluster(s) in error code path. Run " |
2250 | "chkdsk to recover the lost " | 2249 | "chkdsk to recover the lost " |
2251 | "cluster(s)."); | 2250 | "cluster(s)."); |
2252 | NVolSetErrors(vol); | 2251 | NVolSetErrors(vol); |
2253 | } | 2252 | } |
2254 | ntfs_free(rl2); | 2253 | ntfs_free(rl2); |
2255 | goto err_out; | 2254 | goto err_out; |
2256 | } | 2255 | } |
2257 | ni->runlist.rl = rl; | 2256 | ni->runlist.rl = rl; |
2258 | ntfs_debug("Allocated 0x%llx clusters.", (long long)(new_alloc_size - | 2257 | ntfs_debug("Allocated 0x%llx clusters.", (long long)(new_alloc_size - |
2259 | allocated_size) >> vol->cluster_size_bits); | 2258 | allocated_size) >> vol->cluster_size_bits); |
2260 | /* Find the runlist element with which the attribute extent starts. */ | 2259 | /* Find the runlist element with which the attribute extent starts. */ |
2261 | ll = sle64_to_cpu(a->data.non_resident.lowest_vcn); | 2260 | ll = sle64_to_cpu(a->data.non_resident.lowest_vcn); |
2262 | rl2 = ntfs_rl_find_vcn_nolock(rl, ll); | 2261 | rl2 = ntfs_rl_find_vcn_nolock(rl, ll); |
2263 | BUG_ON(!rl2); | 2262 | BUG_ON(!rl2); |
2264 | BUG_ON(!rl2->length); | 2263 | BUG_ON(!rl2->length); |
2265 | BUG_ON(rl2->lcn < LCN_HOLE); | 2264 | BUG_ON(rl2->lcn < LCN_HOLE); |
2266 | mp_rebuilt = false; | 2265 | mp_rebuilt = false; |
2267 | /* Get the size for the new mapping pairs array for this extent. */ | 2266 | /* Get the size for the new mapping pairs array for this extent. */ |
2268 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, ll, -1); | 2267 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, ll, -1); |
2269 | if (unlikely(mp_size <= 0)) { | 2268 | if (unlikely(mp_size <= 0)) { |
2270 | err = mp_size; | 2269 | err = mp_size; |
2271 | if (start < 0 || start >= allocated_size) | 2270 | if (start < 0 || start >= allocated_size) |
2272 | ntfs_error(vol->sb, "Cannot extend allocation of " | 2271 | ntfs_error(vol->sb, "Cannot extend allocation of " |
2273 | "inode 0x%lx, attribute type 0x%x, " | 2272 | "inode 0x%lx, attribute type 0x%x, " |
2274 | "because determining the size for the " | 2273 | "because determining the size for the " |
2275 | "mapping pairs failed with error code " | 2274 | "mapping pairs failed with error code " |
2276 | "%i.", vi->i_ino, | 2275 | "%i.", vi->i_ino, |
2277 | (unsigned)le32_to_cpu(ni->type), err); | 2276 | (unsigned)le32_to_cpu(ni->type), err); |
2278 | err = -EIO; | 2277 | err = -EIO; |
2279 | goto undo_alloc; | 2278 | goto undo_alloc; |
2280 | } | 2279 | } |
2281 | /* Extend the attribute record to fit the bigger mapping pairs array. */ | 2280 | /* Extend the attribute record to fit the bigger mapping pairs array. */ |
2282 | attr_len = le32_to_cpu(a->length); | 2281 | attr_len = le32_to_cpu(a->length); |
2283 | err = ntfs_attr_record_resize(m, a, mp_size + | 2282 | err = ntfs_attr_record_resize(m, a, mp_size + |
2284 | le16_to_cpu(a->data.non_resident.mapping_pairs_offset)); | 2283 | le16_to_cpu(a->data.non_resident.mapping_pairs_offset)); |
2285 | if (unlikely(err)) { | 2284 | if (unlikely(err)) { |
2286 | BUG_ON(err != -ENOSPC); | 2285 | BUG_ON(err != -ENOSPC); |
2287 | // TODO: Deal with this by moving this extent to a new mft | 2286 | // TODO: Deal with this by moving this extent to a new mft |
2288 | // record or by starting a new extent in a new mft record, | 2287 | // record or by starting a new extent in a new mft record, |
2289 | // possibly by extending this extent partially and filling it | 2288 | // possibly by extending this extent partially and filling it |
2290 | // and creating a new extent for the remainder, or by making | 2289 | // and creating a new extent for the remainder, or by making |
2291 | // other attributes non-resident and/or by moving other | 2290 | // other attributes non-resident and/or by moving other |
2292 | // attributes out of this mft record. | 2291 | // attributes out of this mft record. |
2293 | if (start < 0 || start >= allocated_size) | 2292 | if (start < 0 || start >= allocated_size) |
2294 | ntfs_error(vol->sb, "Not enough space in the mft " | 2293 | ntfs_error(vol->sb, "Not enough space in the mft " |
2295 | "record for the extended attribute " | 2294 | "record for the extended attribute " |
2296 | "record. This case is not " | 2295 | "record. This case is not " |
2297 | "implemented yet."); | 2296 | "implemented yet."); |
2298 | err = -EOPNOTSUPP; | 2297 | err = -EOPNOTSUPP; |
2299 | goto undo_alloc; | 2298 | goto undo_alloc; |
2300 | } | 2299 | } |
2301 | mp_rebuilt = true; | 2300 | mp_rebuilt = true; |
2302 | /* Generate the mapping pairs array directly into the attr record. */ | 2301 | /* Generate the mapping pairs array directly into the attr record. */ |
2303 | err = ntfs_mapping_pairs_build(vol, (u8*)a + | 2302 | err = ntfs_mapping_pairs_build(vol, (u8*)a + |
2304 | le16_to_cpu(a->data.non_resident.mapping_pairs_offset), | 2303 | le16_to_cpu(a->data.non_resident.mapping_pairs_offset), |
2305 | mp_size, rl2, ll, -1, NULL); | 2304 | mp_size, rl2, ll, -1, NULL); |
2306 | if (unlikely(err)) { | 2305 | if (unlikely(err)) { |
2307 | if (start < 0 || start >= allocated_size) | 2306 | if (start < 0 || start >= allocated_size) |
2308 | ntfs_error(vol->sb, "Cannot extend allocation of " | 2307 | ntfs_error(vol->sb, "Cannot extend allocation of " |
2309 | "inode 0x%lx, attribute type 0x%x, " | 2308 | "inode 0x%lx, attribute type 0x%x, " |
2310 | "because building the mapping pairs " | 2309 | "because building the mapping pairs " |
2311 | "failed with error code %i.", vi->i_ino, | 2310 | "failed with error code %i.", vi->i_ino, |
2312 | (unsigned)le32_to_cpu(ni->type), err); | 2311 | (unsigned)le32_to_cpu(ni->type), err); |
2313 | err = -EIO; | 2312 | err = -EIO; |
2314 | goto undo_alloc; | 2313 | goto undo_alloc; |
2315 | } | 2314 | } |
2316 | /* Update the highest_vcn. */ | 2315 | /* Update the highest_vcn. */ |
2317 | a->data.non_resident.highest_vcn = cpu_to_sle64((new_alloc_size >> | 2316 | a->data.non_resident.highest_vcn = cpu_to_sle64((new_alloc_size >> |
2318 | vol->cluster_size_bits) - 1); | 2317 | vol->cluster_size_bits) - 1); |
2319 | /* | 2318 | /* |
2320 | * We now have extended the allocated size of the attribute. Reflect | 2319 | * We now have extended the allocated size of the attribute. Reflect |
2321 | * this in the ntfs_inode structure and the attribute record. | 2320 | * this in the ntfs_inode structure and the attribute record. |
2322 | */ | 2321 | */ |
2323 | if (a->data.non_resident.lowest_vcn) { | 2322 | if (a->data.non_resident.lowest_vcn) { |
2324 | /* | 2323 | /* |
2325 | * We are not in the first attribute extent, switch to it, but | 2324 | * We are not in the first attribute extent, switch to it, but |
2326 | * first ensure the changes will make it to disk later. | 2325 | * first ensure the changes will make it to disk later. |
2327 | */ | 2326 | */ |
2328 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 2327 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
2329 | mark_mft_record_dirty(ctx->ntfs_ino); | 2328 | mark_mft_record_dirty(ctx->ntfs_ino); |
2330 | ntfs_attr_reinit_search_ctx(ctx); | 2329 | ntfs_attr_reinit_search_ctx(ctx); |
2331 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 2330 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
2332 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 2331 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
2333 | if (unlikely(err)) | 2332 | if (unlikely(err)) |
2334 | goto restore_undo_alloc; | 2333 | goto restore_undo_alloc; |
2335 | /* @m is not used any more so no need to set it. */ | 2334 | /* @m is not used any more so no need to set it. */ |
2336 | a = ctx->attr; | 2335 | a = ctx->attr; |
2337 | } | 2336 | } |
2338 | write_lock_irqsave(&ni->size_lock, flags); | 2337 | write_lock_irqsave(&ni->size_lock, flags); |
2339 | ni->allocated_size = new_alloc_size; | 2338 | ni->allocated_size = new_alloc_size; |
2340 | a->data.non_resident.allocated_size = cpu_to_sle64(new_alloc_size); | 2339 | a->data.non_resident.allocated_size = cpu_to_sle64(new_alloc_size); |
2341 | /* | 2340 | /* |
2342 | * FIXME: This would fail if @ni is a directory, $MFT, or an index, | 2341 | * FIXME: This would fail if @ni is a directory, $MFT, or an index, |
2343 | * since those can have sparse/compressed set. For example can be | 2342 | * since those can have sparse/compressed set. For example can be |
2344 | * set compressed even though it is not compressed itself and in that | 2343 | * set compressed even though it is not compressed itself and in that |
2345 | * case the bit means that files are to be created compressed in the | 2344 | * case the bit means that files are to be created compressed in the |
2346 | * directory... At present this is ok as this code is only called for | 2345 | * directory... At present this is ok as this code is only called for |
2347 | * regular files, and only for their $DATA attribute(s). | 2346 | * regular files, and only for their $DATA attribute(s). |
2348 | * FIXME: The calculation is wrong if we created a hole above. For now | 2347 | * FIXME: The calculation is wrong if we created a hole above. For now |
2349 | * it does not matter as we never create holes. | 2348 | * it does not matter as we never create holes. |
2350 | */ | 2349 | */ |
2351 | if (NInoSparse(ni) || NInoCompressed(ni)) { | 2350 | if (NInoSparse(ni) || NInoCompressed(ni)) { |
2352 | ni->itype.compressed.size += new_alloc_size - allocated_size; | 2351 | ni->itype.compressed.size += new_alloc_size - allocated_size; |
2353 | a->data.non_resident.compressed_size = | 2352 | a->data.non_resident.compressed_size = |
2354 | cpu_to_sle64(ni->itype.compressed.size); | 2353 | cpu_to_sle64(ni->itype.compressed.size); |
2355 | vi->i_blocks = ni->itype.compressed.size >> 9; | 2354 | vi->i_blocks = ni->itype.compressed.size >> 9; |
2356 | } else | 2355 | } else |
2357 | vi->i_blocks = new_alloc_size >> 9; | 2356 | vi->i_blocks = new_alloc_size >> 9; |
2358 | write_unlock_irqrestore(&ni->size_lock, flags); | 2357 | write_unlock_irqrestore(&ni->size_lock, flags); |
2359 | alloc_done: | 2358 | alloc_done: |
2360 | if (new_data_size >= 0) { | 2359 | if (new_data_size >= 0) { |
2361 | BUG_ON(new_data_size < | 2360 | BUG_ON(new_data_size < |
2362 | sle64_to_cpu(a->data.non_resident.data_size)); | 2361 | sle64_to_cpu(a->data.non_resident.data_size)); |
2363 | a->data.non_resident.data_size = cpu_to_sle64(new_data_size); | 2362 | a->data.non_resident.data_size = cpu_to_sle64(new_data_size); |
2364 | } | 2363 | } |
2365 | flush_done: | 2364 | flush_done: |
2366 | /* Ensure the changes make it to disk. */ | 2365 | /* Ensure the changes make it to disk. */ |
2367 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 2366 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
2368 | mark_mft_record_dirty(ctx->ntfs_ino); | 2367 | mark_mft_record_dirty(ctx->ntfs_ino); |
2369 | done: | 2368 | done: |
2370 | ntfs_attr_put_search_ctx(ctx); | 2369 | ntfs_attr_put_search_ctx(ctx); |
2371 | unmap_mft_record(base_ni); | 2370 | unmap_mft_record(base_ni); |
2372 | up_write(&ni->runlist.lock); | 2371 | up_write(&ni->runlist.lock); |
2373 | ntfs_debug("Done, new_allocated_size 0x%llx.", | 2372 | ntfs_debug("Done, new_allocated_size 0x%llx.", |
2374 | (unsigned long long)new_alloc_size); | 2373 | (unsigned long long)new_alloc_size); |
2375 | return new_alloc_size; | 2374 | return new_alloc_size; |
2376 | restore_undo_alloc: | 2375 | restore_undo_alloc: |
2377 | if (start < 0 || start >= allocated_size) | 2376 | if (start < 0 || start >= allocated_size) |
2378 | ntfs_error(vol->sb, "Cannot complete extension of allocation " | 2377 | ntfs_error(vol->sb, "Cannot complete extension of allocation " |
2379 | "of inode 0x%lx, attribute type 0x%x, because " | 2378 | "of inode 0x%lx, attribute type 0x%x, because " |
2380 | "lookup of first attribute extent failed with " | 2379 | "lookup of first attribute extent failed with " |
2381 | "error code %i.", vi->i_ino, | 2380 | "error code %i.", vi->i_ino, |
2382 | (unsigned)le32_to_cpu(ni->type), err); | 2381 | (unsigned)le32_to_cpu(ni->type), err); |
2383 | if (err == -ENOENT) | 2382 | if (err == -ENOENT) |
2384 | err = -EIO; | 2383 | err = -EIO; |
2385 | ntfs_attr_reinit_search_ctx(ctx); | 2384 | ntfs_attr_reinit_search_ctx(ctx); |
2386 | if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE, | 2385 | if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE, |
2387 | allocated_size >> vol->cluster_size_bits, NULL, 0, | 2386 | allocated_size >> vol->cluster_size_bits, NULL, 0, |
2388 | ctx)) { | 2387 | ctx)) { |
2389 | ntfs_error(vol->sb, "Failed to find last attribute extent of " | 2388 | ntfs_error(vol->sb, "Failed to find last attribute extent of " |
2390 | "attribute in error code path. Run chkdsk to " | 2389 | "attribute in error code path. Run chkdsk to " |
2391 | "recover."); | 2390 | "recover."); |
2392 | write_lock_irqsave(&ni->size_lock, flags); | 2391 | write_lock_irqsave(&ni->size_lock, flags); |
2393 | ni->allocated_size = new_alloc_size; | 2392 | ni->allocated_size = new_alloc_size; |
2394 | /* | 2393 | /* |
2395 | * FIXME: This would fail if @ni is a directory... See above. | 2394 | * FIXME: This would fail if @ni is a directory... See above. |
2396 | * FIXME: The calculation is wrong if we created a hole above. | 2395 | * FIXME: The calculation is wrong if we created a hole above. |
2397 | * For now it does not matter as we never create holes. | 2396 | * For now it does not matter as we never create holes. |
2398 | */ | 2397 | */ |
2399 | if (NInoSparse(ni) || NInoCompressed(ni)) { | 2398 | if (NInoSparse(ni) || NInoCompressed(ni)) { |
2400 | ni->itype.compressed.size += new_alloc_size - | 2399 | ni->itype.compressed.size += new_alloc_size - |
2401 | allocated_size; | 2400 | allocated_size; |
2402 | vi->i_blocks = ni->itype.compressed.size >> 9; | 2401 | vi->i_blocks = ni->itype.compressed.size >> 9; |
2403 | } else | 2402 | } else |
2404 | vi->i_blocks = new_alloc_size >> 9; | 2403 | vi->i_blocks = new_alloc_size >> 9; |
2405 | write_unlock_irqrestore(&ni->size_lock, flags); | 2404 | write_unlock_irqrestore(&ni->size_lock, flags); |
2406 | ntfs_attr_put_search_ctx(ctx); | 2405 | ntfs_attr_put_search_ctx(ctx); |
2407 | unmap_mft_record(base_ni); | 2406 | unmap_mft_record(base_ni); |
2408 | up_write(&ni->runlist.lock); | 2407 | up_write(&ni->runlist.lock); |
2409 | /* | 2408 | /* |
2410 | * The only thing that is now wrong is the allocated size of the | 2409 | * The only thing that is now wrong is the allocated size of the |
2411 | * base attribute extent which chkdsk should be able to fix. | 2410 | * base attribute extent which chkdsk should be able to fix. |
2412 | */ | 2411 | */ |
2413 | NVolSetErrors(vol); | 2412 | NVolSetErrors(vol); |
2414 | return err; | 2413 | return err; |
2415 | } | 2414 | } |
2416 | ctx->attr->data.non_resident.highest_vcn = cpu_to_sle64( | 2415 | ctx->attr->data.non_resident.highest_vcn = cpu_to_sle64( |
2417 | (allocated_size >> vol->cluster_size_bits) - 1); | 2416 | (allocated_size >> vol->cluster_size_bits) - 1); |
2418 | undo_alloc: | 2417 | undo_alloc: |
2419 | ll = allocated_size >> vol->cluster_size_bits; | 2418 | ll = allocated_size >> vol->cluster_size_bits; |
2420 | if (ntfs_cluster_free(ni, ll, -1, ctx) < 0) { | 2419 | if (ntfs_cluster_free(ni, ll, -1, ctx) < 0) { |
2421 | ntfs_error(vol->sb, "Failed to release allocated cluster(s) " | 2420 | ntfs_error(vol->sb, "Failed to release allocated cluster(s) " |
2422 | "in error code path. Run chkdsk to recover " | 2421 | "in error code path. Run chkdsk to recover " |
2423 | "the lost cluster(s)."); | 2422 | "the lost cluster(s)."); |
2424 | NVolSetErrors(vol); | 2423 | NVolSetErrors(vol); |
2425 | } | 2424 | } |
2426 | m = ctx->mrec; | 2425 | m = ctx->mrec; |
2427 | a = ctx->attr; | 2426 | a = ctx->attr; |
2428 | /* | 2427 | /* |
2429 | * If the runlist truncation fails and/or the search context is no | 2428 | * If the runlist truncation fails and/or the search context is no |
2430 | * longer valid, we cannot resize the attribute record or build the | 2429 | * longer valid, we cannot resize the attribute record or build the |
2431 | * mapping pairs array thus we mark the inode bad so that no access to | 2430 | * mapping pairs array thus we mark the inode bad so that no access to |
2432 | * the freed clusters can happen. | 2431 | * the freed clusters can happen. |
2433 | */ | 2432 | */ |
2434 | if (ntfs_rl_truncate_nolock(vol, &ni->runlist, ll) || IS_ERR(m)) { | 2433 | if (ntfs_rl_truncate_nolock(vol, &ni->runlist, ll) || IS_ERR(m)) { |
2435 | ntfs_error(vol->sb, "Failed to %s in error code path. Run " | 2434 | ntfs_error(vol->sb, "Failed to %s in error code path. Run " |
2436 | "chkdsk to recover.", IS_ERR(m) ? | 2435 | "chkdsk to recover.", IS_ERR(m) ? |
2437 | "restore attribute search context" : | 2436 | "restore attribute search context" : |
2438 | "truncate attribute runlist"); | 2437 | "truncate attribute runlist"); |
2439 | NVolSetErrors(vol); | 2438 | NVolSetErrors(vol); |
2440 | } else if (mp_rebuilt) { | 2439 | } else if (mp_rebuilt) { |
2441 | if (ntfs_attr_record_resize(m, a, attr_len)) { | 2440 | if (ntfs_attr_record_resize(m, a, attr_len)) { |
2442 | ntfs_error(vol->sb, "Failed to restore attribute " | 2441 | ntfs_error(vol->sb, "Failed to restore attribute " |
2443 | "record in error code path. Run " | 2442 | "record in error code path. Run " |
2444 | "chkdsk to recover."); | 2443 | "chkdsk to recover."); |
2445 | NVolSetErrors(vol); | 2444 | NVolSetErrors(vol); |
2446 | } else /* if (success) */ { | 2445 | } else /* if (success) */ { |
2447 | if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( | 2446 | if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( |
2448 | a->data.non_resident. | 2447 | a->data.non_resident. |
2449 | mapping_pairs_offset), attr_len - | 2448 | mapping_pairs_offset), attr_len - |
2450 | le16_to_cpu(a->data.non_resident. | 2449 | le16_to_cpu(a->data.non_resident. |
2451 | mapping_pairs_offset), rl2, ll, -1, | 2450 | mapping_pairs_offset), rl2, ll, -1, |
2452 | NULL)) { | 2451 | NULL)) { |
2453 | ntfs_error(vol->sb, "Failed to restore " | 2452 | ntfs_error(vol->sb, "Failed to restore " |
2454 | "mapping pairs array in error " | 2453 | "mapping pairs array in error " |
2455 | "code path. Run chkdsk to " | 2454 | "code path. Run chkdsk to " |
2456 | "recover."); | 2455 | "recover."); |
2457 | NVolSetErrors(vol); | 2456 | NVolSetErrors(vol); |
2458 | } | 2457 | } |
2459 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 2458 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
2460 | mark_mft_record_dirty(ctx->ntfs_ino); | 2459 | mark_mft_record_dirty(ctx->ntfs_ino); |
2461 | } | 2460 | } |
2462 | } | 2461 | } |
2463 | err_out: | 2462 | err_out: |
2464 | if (ctx) | 2463 | if (ctx) |
2465 | ntfs_attr_put_search_ctx(ctx); | 2464 | ntfs_attr_put_search_ctx(ctx); |
2466 | if (m) | 2465 | if (m) |
2467 | unmap_mft_record(base_ni); | 2466 | unmap_mft_record(base_ni); |
2468 | up_write(&ni->runlist.lock); | 2467 | up_write(&ni->runlist.lock); |
2469 | conv_err_out: | 2468 | conv_err_out: |
2470 | ntfs_debug("Failed. Returning error code %i.", err); | 2469 | ntfs_debug("Failed. Returning error code %i.", err); |
2471 | return err; | 2470 | return err; |
2472 | } | 2471 | } |
2473 | 2472 | ||
2474 | /** | 2473 | /** |
2475 | * ntfs_attr_set - fill (a part of) an attribute with a byte | 2474 | * ntfs_attr_set - fill (a part of) an attribute with a byte |
2476 | * @ni: ntfs inode describing the attribute to fill | 2475 | * @ni: ntfs inode describing the attribute to fill |
2477 | * @ofs: offset inside the attribute at which to start to fill | 2476 | * @ofs: offset inside the attribute at which to start to fill |
2478 | * @cnt: number of bytes to fill | 2477 | * @cnt: number of bytes to fill |
2479 | * @val: the unsigned 8-bit value with which to fill the attribute | 2478 | * @val: the unsigned 8-bit value with which to fill the attribute |
2480 | * | 2479 | * |
2481 | * Fill @cnt bytes of the attribute described by the ntfs inode @ni starting at | 2480 | * Fill @cnt bytes of the attribute described by the ntfs inode @ni starting at |
2482 | * byte offset @ofs inside the attribute with the constant byte @val. | 2481 | * byte offset @ofs inside the attribute with the constant byte @val. |
2483 | * | 2482 | * |
2484 | * This function is effectively like memset() applied to an ntfs attribute. | 2483 | * This function is effectively like memset() applied to an ntfs attribute. |
2485 | * Note thie function actually only operates on the page cache pages belonging | 2484 | * Note thie function actually only operates on the page cache pages belonging |
2486 | * to the ntfs attribute and it marks them dirty after doing the memset(). | 2485 | * to the ntfs attribute and it marks them dirty after doing the memset(). |
2487 | * Thus it relies on the vm dirty page write code paths to cause the modified | 2486 | * Thus it relies on the vm dirty page write code paths to cause the modified |
2488 | * pages to be written to the mft record/disk. | 2487 | * pages to be written to the mft record/disk. |
2489 | * | 2488 | * |
2490 | * Return 0 on success and -errno on error. An error code of -ESPIPE means | 2489 | * Return 0 on success and -errno on error. An error code of -ESPIPE means |
2491 | * that @ofs + @cnt were outside the end of the attribute and no write was | 2490 | * that @ofs + @cnt were outside the end of the attribute and no write was |
2492 | * performed. | 2491 | * performed. |
2493 | */ | 2492 | */ |
2494 | int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val) | 2493 | int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val) |
2495 | { | 2494 | { |
2496 | ntfs_volume *vol = ni->vol; | 2495 | ntfs_volume *vol = ni->vol; |
2497 | struct address_space *mapping; | 2496 | struct address_space *mapping; |
2498 | struct page *page; | 2497 | struct page *page; |
2499 | u8 *kaddr; | 2498 | u8 *kaddr; |
2500 | pgoff_t idx, end; | 2499 | pgoff_t idx, end; |
2501 | unsigned start_ofs, end_ofs, size; | 2500 | unsigned start_ofs, end_ofs, size; |
2502 | 2501 | ||
2503 | ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.", | 2502 | ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.", |
2504 | (long long)ofs, (long long)cnt, val); | 2503 | (long long)ofs, (long long)cnt, val); |
2505 | BUG_ON(ofs < 0); | 2504 | BUG_ON(ofs < 0); |
2506 | BUG_ON(cnt < 0); | 2505 | BUG_ON(cnt < 0); |
2507 | if (!cnt) | 2506 | if (!cnt) |
2508 | goto done; | 2507 | goto done; |
2509 | /* | 2508 | /* |
2510 | * FIXME: Compressed and encrypted attributes are not supported when | 2509 | * FIXME: Compressed and encrypted attributes are not supported when |
2511 | * writing and we should never have gotten here for them. | 2510 | * writing and we should never have gotten here for them. |
2512 | */ | 2511 | */ |
2513 | BUG_ON(NInoCompressed(ni)); | 2512 | BUG_ON(NInoCompressed(ni)); |
2514 | BUG_ON(NInoEncrypted(ni)); | 2513 | BUG_ON(NInoEncrypted(ni)); |
2515 | mapping = VFS_I(ni)->i_mapping; | 2514 | mapping = VFS_I(ni)->i_mapping; |
2516 | /* Work out the starting index and page offset. */ | 2515 | /* Work out the starting index and page offset. */ |
2517 | idx = ofs >> PAGE_CACHE_SHIFT; | 2516 | idx = ofs >> PAGE_CACHE_SHIFT; |
2518 | start_ofs = ofs & ~PAGE_CACHE_MASK; | 2517 | start_ofs = ofs & ~PAGE_CACHE_MASK; |
2519 | /* Work out the ending index and page offset. */ | 2518 | /* Work out the ending index and page offset. */ |
2520 | end = ofs + cnt; | 2519 | end = ofs + cnt; |
2521 | end_ofs = end & ~PAGE_CACHE_MASK; | 2520 | end_ofs = end & ~PAGE_CACHE_MASK; |
2522 | /* If the end is outside the inode size return -ESPIPE. */ | 2521 | /* If the end is outside the inode size return -ESPIPE. */ |
2523 | if (unlikely(end > i_size_read(VFS_I(ni)))) { | 2522 | if (unlikely(end > i_size_read(VFS_I(ni)))) { |
2524 | ntfs_error(vol->sb, "Request exceeds end of attribute."); | 2523 | ntfs_error(vol->sb, "Request exceeds end of attribute."); |
2525 | return -ESPIPE; | 2524 | return -ESPIPE; |
2526 | } | 2525 | } |
2527 | end >>= PAGE_CACHE_SHIFT; | 2526 | end >>= PAGE_CACHE_SHIFT; |
2528 | /* If there is a first partial page, need to do it the slow way. */ | 2527 | /* If there is a first partial page, need to do it the slow way. */ |
2529 | if (start_ofs) { | 2528 | if (start_ofs) { |
2530 | page = read_mapping_page(mapping, idx, NULL); | 2529 | page = read_mapping_page(mapping, idx, NULL); |
2531 | if (IS_ERR(page)) { | 2530 | if (IS_ERR(page)) { |
2532 | ntfs_error(vol->sb, "Failed to read first partial " | 2531 | ntfs_error(vol->sb, "Failed to read first partial " |
2533 | "page (error, index 0x%lx).", idx); | 2532 | "page (error, index 0x%lx).", idx); |
2534 | return PTR_ERR(page); | 2533 | return PTR_ERR(page); |
2535 | } | 2534 | } |
2536 | /* | 2535 | /* |
2537 | * If the last page is the same as the first page, need to | 2536 | * If the last page is the same as the first page, need to |
2538 | * limit the write to the end offset. | 2537 | * limit the write to the end offset. |
2539 | */ | 2538 | */ |
2540 | size = PAGE_CACHE_SIZE; | 2539 | size = PAGE_CACHE_SIZE; |
2541 | if (idx == end) | 2540 | if (idx == end) |
2542 | size = end_ofs; | 2541 | size = end_ofs; |
2543 | kaddr = kmap_atomic(page); | 2542 | kaddr = kmap_atomic(page); |
2544 | memset(kaddr + start_ofs, val, size - start_ofs); | 2543 | memset(kaddr + start_ofs, val, size - start_ofs); |
2545 | flush_dcache_page(page); | 2544 | flush_dcache_page(page); |
2546 | kunmap_atomic(kaddr); | 2545 | kunmap_atomic(kaddr); |
2547 | set_page_dirty(page); | 2546 | set_page_dirty(page); |
2548 | page_cache_release(page); | 2547 | page_cache_release(page); |
2549 | balance_dirty_pages_ratelimited(mapping); | 2548 | balance_dirty_pages_ratelimited(mapping); |
2550 | cond_resched(); | 2549 | cond_resched(); |
2551 | if (idx == end) | 2550 | if (idx == end) |
2552 | goto done; | 2551 | goto done; |
2553 | idx++; | 2552 | idx++; |
2554 | } | 2553 | } |
2555 | /* Do the whole pages the fast way. */ | 2554 | /* Do the whole pages the fast way. */ |
2556 | for (; idx < end; idx++) { | 2555 | for (; idx < end; idx++) { |
2557 | /* Find or create the current page. (The page is locked.) */ | 2556 | /* Find or create the current page. (The page is locked.) */ |
2558 | page = grab_cache_page(mapping, idx); | 2557 | page = grab_cache_page(mapping, idx); |
2559 | if (unlikely(!page)) { | 2558 | if (unlikely(!page)) { |
2560 | ntfs_error(vol->sb, "Insufficient memory to grab " | 2559 | ntfs_error(vol->sb, "Insufficient memory to grab " |
2561 | "page (index 0x%lx).", idx); | 2560 | "page (index 0x%lx).", idx); |
2562 | return -ENOMEM; | 2561 | return -ENOMEM; |
2563 | } | 2562 | } |
2564 | kaddr = kmap_atomic(page); | 2563 | kaddr = kmap_atomic(page); |
2565 | memset(kaddr, val, PAGE_CACHE_SIZE); | 2564 | memset(kaddr, val, PAGE_CACHE_SIZE); |
2566 | flush_dcache_page(page); | 2565 | flush_dcache_page(page); |
2567 | kunmap_atomic(kaddr); | 2566 | kunmap_atomic(kaddr); |
2568 | /* | 2567 | /* |
2569 | * If the page has buffers, mark them uptodate since buffer | 2568 | * If the page has buffers, mark them uptodate since buffer |
2570 | * state and not page state is definitive in 2.6 kernels. | 2569 | * state and not page state is definitive in 2.6 kernels. |
2571 | */ | 2570 | */ |
2572 | if (page_has_buffers(page)) { | 2571 | if (page_has_buffers(page)) { |
2573 | struct buffer_head *bh, *head; | 2572 | struct buffer_head *bh, *head; |
2574 | 2573 | ||
2575 | bh = head = page_buffers(page); | 2574 | bh = head = page_buffers(page); |
2576 | do { | 2575 | do { |
2577 | set_buffer_uptodate(bh); | 2576 | set_buffer_uptodate(bh); |
2578 | } while ((bh = bh->b_this_page) != head); | 2577 | } while ((bh = bh->b_this_page) != head); |
2579 | } | 2578 | } |
2580 | /* Now that buffers are uptodate, set the page uptodate, too. */ | 2579 | /* Now that buffers are uptodate, set the page uptodate, too. */ |
2581 | SetPageUptodate(page); | 2580 | SetPageUptodate(page); |
2582 | /* | 2581 | /* |
2583 | * Set the page and all its buffers dirty and mark the inode | 2582 | * Set the page and all its buffers dirty and mark the inode |
2584 | * dirty, too. The VM will write the page later on. | 2583 | * dirty, too. The VM will write the page later on. |
2585 | */ | 2584 | */ |
2586 | set_page_dirty(page); | 2585 | set_page_dirty(page); |
2587 | /* Finally unlock and release the page. */ | 2586 | /* Finally unlock and release the page. */ |
2588 | unlock_page(page); | 2587 | unlock_page(page); |
2589 | page_cache_release(page); | 2588 | page_cache_release(page); |
2590 | balance_dirty_pages_ratelimited(mapping); | 2589 | balance_dirty_pages_ratelimited(mapping); |
2591 | cond_resched(); | 2590 | cond_resched(); |
2592 | } | 2591 | } |
2593 | /* If there is a last partial page, need to do it the slow way. */ | 2592 | /* If there is a last partial page, need to do it the slow way. */ |
2594 | if (end_ofs) { | 2593 | if (end_ofs) { |
2595 | page = read_mapping_page(mapping, idx, NULL); | 2594 | page = read_mapping_page(mapping, idx, NULL); |
2596 | if (IS_ERR(page)) { | 2595 | if (IS_ERR(page)) { |
2597 | ntfs_error(vol->sb, "Failed to read last partial page " | 2596 | ntfs_error(vol->sb, "Failed to read last partial page " |
2598 | "(error, index 0x%lx).", idx); | 2597 | "(error, index 0x%lx).", idx); |
2599 | return PTR_ERR(page); | 2598 | return PTR_ERR(page); |
2600 | } | 2599 | } |
2601 | kaddr = kmap_atomic(page); | 2600 | kaddr = kmap_atomic(page); |
2602 | memset(kaddr, val, end_ofs); | 2601 | memset(kaddr, val, end_ofs); |
2603 | flush_dcache_page(page); | 2602 | flush_dcache_page(page); |
2604 | kunmap_atomic(kaddr); | 2603 | kunmap_atomic(kaddr); |
2605 | set_page_dirty(page); | 2604 | set_page_dirty(page); |
2606 | page_cache_release(page); | 2605 | page_cache_release(page); |
2607 | balance_dirty_pages_ratelimited(mapping); | 2606 | balance_dirty_pages_ratelimited(mapping); |
2608 | cond_resched(); | 2607 | cond_resched(); |
2609 | } | 2608 | } |
2610 | done: | 2609 | done: |
2611 | ntfs_debug("Done."); | 2610 | ntfs_debug("Done."); |
2612 | return 0; | 2611 | return 0; |
2613 | } | 2612 | } |
2614 | 2613 | ||
2615 | #endif /* NTFS_RW */ | 2614 | #endif /* NTFS_RW */ |
2616 | 2615 |
fs/ntfs/file.c
1 | /* | 1 | /* |
2 | * file.c - NTFS kernel file operations. Part of the Linux-NTFS project. | 2 | * file.c - NTFS kernel file operations. Part of the Linux-NTFS project. |
3 | * | 3 | * |
4 | * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc. | 4 | * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc. |
5 | * | 5 | * |
6 | * This program/include file is free software; you can redistribute it and/or | 6 | * This program/include file is free software; you can redistribute it and/or |
7 | * modify it under the terms of the GNU General Public License as published | 7 | * modify it under the terms of the GNU General Public License as published |
8 | * by the Free Software Foundation; either version 2 of the License, or | 8 | * by the Free Software Foundation; either version 2 of the License, or |
9 | * (at your option) any later version. | 9 | * (at your option) any later version. |
10 | * | 10 | * |
11 | * This program/include file is distributed in the hope that it will be | 11 | * This program/include file is distributed in the hope that it will be |
12 | * useful, but WITHOUT ANY WARRANTY; without even the implied warranty | 12 | * useful, but WITHOUT ANY WARRANTY; without even the implied warranty |
13 | * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | 13 | * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
14 | * GNU General Public License for more details. | 14 | * GNU General Public License for more details. |
15 | * | 15 | * |
16 | * You should have received a copy of the GNU General Public License | 16 | * You should have received a copy of the GNU General Public License |
17 | * along with this program (in the main directory of the Linux-NTFS | 17 | * along with this program (in the main directory of the Linux-NTFS |
18 | * distribution in the file COPYING); if not, write to the Free Software | 18 | * distribution in the file COPYING); if not, write to the Free Software |
19 | * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | 19 | * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA |
20 | */ | 20 | */ |
21 | 21 | ||
22 | #include <linux/buffer_head.h> | 22 | #include <linux/buffer_head.h> |
23 | #include <linux/gfp.h> | 23 | #include <linux/gfp.h> |
24 | #include <linux/pagemap.h> | 24 | #include <linux/pagemap.h> |
25 | #include <linux/pagevec.h> | 25 | #include <linux/pagevec.h> |
26 | #include <linux/sched.h> | 26 | #include <linux/sched.h> |
27 | #include <linux/swap.h> | 27 | #include <linux/swap.h> |
28 | #include <linux/uio.h> | 28 | #include <linux/uio.h> |
29 | #include <linux/writeback.h> | 29 | #include <linux/writeback.h> |
30 | #include <linux/aio.h> | 30 | #include <linux/aio.h> |
31 | 31 | ||
32 | #include <asm/page.h> | 32 | #include <asm/page.h> |
33 | #include <asm/uaccess.h> | 33 | #include <asm/uaccess.h> |
34 | 34 | ||
35 | #include "attrib.h" | 35 | #include "attrib.h" |
36 | #include "bitmap.h" | 36 | #include "bitmap.h" |
37 | #include "inode.h" | 37 | #include "inode.h" |
38 | #include "debug.h" | 38 | #include "debug.h" |
39 | #include "lcnalloc.h" | 39 | #include "lcnalloc.h" |
40 | #include "malloc.h" | 40 | #include "malloc.h" |
41 | #include "mft.h" | 41 | #include "mft.h" |
42 | #include "ntfs.h" | 42 | #include "ntfs.h" |
43 | 43 | ||
44 | /** | 44 | /** |
45 | * ntfs_file_open - called when an inode is about to be opened | 45 | * ntfs_file_open - called when an inode is about to be opened |
46 | * @vi: inode to be opened | 46 | * @vi: inode to be opened |
47 | * @filp: file structure describing the inode | 47 | * @filp: file structure describing the inode |
48 | * | 48 | * |
49 | * Limit file size to the page cache limit on architectures where unsigned long | 49 | * Limit file size to the page cache limit on architectures where unsigned long |
50 | * is 32-bits. This is the most we can do for now without overflowing the page | 50 | * is 32-bits. This is the most we can do for now without overflowing the page |
51 | * cache page index. Doing it this way means we don't run into problems because | 51 | * cache page index. Doing it this way means we don't run into problems because |
52 | * of existing too large files. It would be better to allow the user to read | 52 | * of existing too large files. It would be better to allow the user to read |
53 | * the beginning of the file but I doubt very much anyone is going to hit this | 53 | * the beginning of the file but I doubt very much anyone is going to hit this |
54 | * check on a 32-bit architecture, so there is no point in adding the extra | 54 | * check on a 32-bit architecture, so there is no point in adding the extra |
55 | * complexity required to support this. | 55 | * complexity required to support this. |
56 | * | 56 | * |
57 | * On 64-bit architectures, the check is hopefully optimized away by the | 57 | * On 64-bit architectures, the check is hopefully optimized away by the |
58 | * compiler. | 58 | * compiler. |
59 | * | 59 | * |
60 | * After the check passes, just call generic_file_open() to do its work. | 60 | * After the check passes, just call generic_file_open() to do its work. |
61 | */ | 61 | */ |
62 | static int ntfs_file_open(struct inode *vi, struct file *filp) | 62 | static int ntfs_file_open(struct inode *vi, struct file *filp) |
63 | { | 63 | { |
64 | if (sizeof(unsigned long) < 8) { | 64 | if (sizeof(unsigned long) < 8) { |
65 | if (i_size_read(vi) > MAX_LFS_FILESIZE) | 65 | if (i_size_read(vi) > MAX_LFS_FILESIZE) |
66 | return -EOVERFLOW; | 66 | return -EOVERFLOW; |
67 | } | 67 | } |
68 | return generic_file_open(vi, filp); | 68 | return generic_file_open(vi, filp); |
69 | } | 69 | } |
70 | 70 | ||
71 | #ifdef NTFS_RW | 71 | #ifdef NTFS_RW |
72 | 72 | ||
73 | /** | 73 | /** |
74 | * ntfs_attr_extend_initialized - extend the initialized size of an attribute | 74 | * ntfs_attr_extend_initialized - extend the initialized size of an attribute |
75 | * @ni: ntfs inode of the attribute to extend | 75 | * @ni: ntfs inode of the attribute to extend |
76 | * @new_init_size: requested new initialized size in bytes | 76 | * @new_init_size: requested new initialized size in bytes |
77 | * @cached_page: store any allocated but unused page here | 77 | * @cached_page: store any allocated but unused page here |
78 | * @lru_pvec: lru-buffering pagevec of the caller | 78 | * @lru_pvec: lru-buffering pagevec of the caller |
79 | * | 79 | * |
80 | * Extend the initialized size of an attribute described by the ntfs inode @ni | 80 | * Extend the initialized size of an attribute described by the ntfs inode @ni |
81 | * to @new_init_size bytes. This involves zeroing any non-sparse space between | 81 | * to @new_init_size bytes. This involves zeroing any non-sparse space between |
82 | * the old initialized size and @new_init_size both in the page cache and on | 82 | * the old initialized size and @new_init_size both in the page cache and on |
83 | * disk (if relevant complete pages are already uptodate in the page cache then | 83 | * disk (if relevant complete pages are already uptodate in the page cache then |
84 | * these are simply marked dirty). | 84 | * these are simply marked dirty). |
85 | * | 85 | * |
86 | * As a side-effect, the file size (vfs inode->i_size) may be incremented as, | 86 | * As a side-effect, the file size (vfs inode->i_size) may be incremented as, |
87 | * in the resident attribute case, it is tied to the initialized size and, in | 87 | * in the resident attribute case, it is tied to the initialized size and, in |
88 | * the non-resident attribute case, it may not fall below the initialized size. | 88 | * the non-resident attribute case, it may not fall below the initialized size. |
89 | * | 89 | * |
90 | * Note that if the attribute is resident, we do not need to touch the page | 90 | * Note that if the attribute is resident, we do not need to touch the page |
91 | * cache at all. This is because if the page cache page is not uptodate we | 91 | * cache at all. This is because if the page cache page is not uptodate we |
92 | * bring it uptodate later, when doing the write to the mft record since we | 92 | * bring it uptodate later, when doing the write to the mft record since we |
93 | * then already have the page mapped. And if the page is uptodate, the | 93 | * then already have the page mapped. And if the page is uptodate, the |
94 | * non-initialized region will already have been zeroed when the page was | 94 | * non-initialized region will already have been zeroed when the page was |
95 | * brought uptodate and the region may in fact already have been overwritten | 95 | * brought uptodate and the region may in fact already have been overwritten |
96 | * with new data via mmap() based writes, so we cannot just zero it. And since | 96 | * with new data via mmap() based writes, so we cannot just zero it. And since |
97 | * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped | 97 | * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped |
98 | * is unspecified, we choose not to do zeroing and thus we do not need to touch | 98 | * is unspecified, we choose not to do zeroing and thus we do not need to touch |
99 | * the page at all. For a more detailed explanation see ntfs_truncate() in | 99 | * the page at all. For a more detailed explanation see ntfs_truncate() in |
100 | * fs/ntfs/inode.c. | 100 | * fs/ntfs/inode.c. |
101 | * | 101 | * |
102 | * Return 0 on success and -errno on error. In the case that an error is | 102 | * Return 0 on success and -errno on error. In the case that an error is |
103 | * encountered it is possible that the initialized size will already have been | 103 | * encountered it is possible that the initialized size will already have been |
104 | * incremented some way towards @new_init_size but it is guaranteed that if | 104 | * incremented some way towards @new_init_size but it is guaranteed that if |
105 | * this is the case, the necessary zeroing will also have happened and that all | 105 | * this is the case, the necessary zeroing will also have happened and that all |
106 | * metadata is self-consistent. | 106 | * metadata is self-consistent. |
107 | * | 107 | * |
108 | * Locking: i_mutex on the vfs inode corrseponsind to the ntfs inode @ni must be | 108 | * Locking: i_mutex on the vfs inode corrseponsind to the ntfs inode @ni must be |
109 | * held by the caller. | 109 | * held by the caller. |
110 | */ | 110 | */ |
111 | static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size) | 111 | static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size) |
112 | { | 112 | { |
113 | s64 old_init_size; | 113 | s64 old_init_size; |
114 | loff_t old_i_size; | 114 | loff_t old_i_size; |
115 | pgoff_t index, end_index; | 115 | pgoff_t index, end_index; |
116 | unsigned long flags; | 116 | unsigned long flags; |
117 | struct inode *vi = VFS_I(ni); | 117 | struct inode *vi = VFS_I(ni); |
118 | ntfs_inode *base_ni; | 118 | ntfs_inode *base_ni; |
119 | MFT_RECORD *m = NULL; | 119 | MFT_RECORD *m = NULL; |
120 | ATTR_RECORD *a; | 120 | ATTR_RECORD *a; |
121 | ntfs_attr_search_ctx *ctx = NULL; | 121 | ntfs_attr_search_ctx *ctx = NULL; |
122 | struct address_space *mapping; | 122 | struct address_space *mapping; |
123 | struct page *page = NULL; | 123 | struct page *page = NULL; |
124 | u8 *kattr; | 124 | u8 *kattr; |
125 | int err; | 125 | int err; |
126 | u32 attr_len; | 126 | u32 attr_len; |
127 | 127 | ||
128 | read_lock_irqsave(&ni->size_lock, flags); | 128 | read_lock_irqsave(&ni->size_lock, flags); |
129 | old_init_size = ni->initialized_size; | 129 | old_init_size = ni->initialized_size; |
130 | old_i_size = i_size_read(vi); | 130 | old_i_size = i_size_read(vi); |
131 | BUG_ON(new_init_size > ni->allocated_size); | 131 | BUG_ON(new_init_size > ni->allocated_size); |
132 | read_unlock_irqrestore(&ni->size_lock, flags); | 132 | read_unlock_irqrestore(&ni->size_lock, flags); |
133 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " | 133 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " |
134 | "old_initialized_size 0x%llx, " | 134 | "old_initialized_size 0x%llx, " |
135 | "new_initialized_size 0x%llx, i_size 0x%llx.", | 135 | "new_initialized_size 0x%llx, i_size 0x%llx.", |
136 | vi->i_ino, (unsigned)le32_to_cpu(ni->type), | 136 | vi->i_ino, (unsigned)le32_to_cpu(ni->type), |
137 | (unsigned long long)old_init_size, | 137 | (unsigned long long)old_init_size, |
138 | (unsigned long long)new_init_size, old_i_size); | 138 | (unsigned long long)new_init_size, old_i_size); |
139 | if (!NInoAttr(ni)) | 139 | if (!NInoAttr(ni)) |
140 | base_ni = ni; | 140 | base_ni = ni; |
141 | else | 141 | else |
142 | base_ni = ni->ext.base_ntfs_ino; | 142 | base_ni = ni->ext.base_ntfs_ino; |
143 | /* Use goto to reduce indentation and we need the label below anyway. */ | 143 | /* Use goto to reduce indentation and we need the label below anyway. */ |
144 | if (NInoNonResident(ni)) | 144 | if (NInoNonResident(ni)) |
145 | goto do_non_resident_extend; | 145 | goto do_non_resident_extend; |
146 | BUG_ON(old_init_size != old_i_size); | 146 | BUG_ON(old_init_size != old_i_size); |
147 | m = map_mft_record(base_ni); | 147 | m = map_mft_record(base_ni); |
148 | if (IS_ERR(m)) { | 148 | if (IS_ERR(m)) { |
149 | err = PTR_ERR(m); | 149 | err = PTR_ERR(m); |
150 | m = NULL; | 150 | m = NULL; |
151 | goto err_out; | 151 | goto err_out; |
152 | } | 152 | } |
153 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 153 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
154 | if (unlikely(!ctx)) { | 154 | if (unlikely(!ctx)) { |
155 | err = -ENOMEM; | 155 | err = -ENOMEM; |
156 | goto err_out; | 156 | goto err_out; |
157 | } | 157 | } |
158 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 158 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
159 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 159 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
160 | if (unlikely(err)) { | 160 | if (unlikely(err)) { |
161 | if (err == -ENOENT) | 161 | if (err == -ENOENT) |
162 | err = -EIO; | 162 | err = -EIO; |
163 | goto err_out; | 163 | goto err_out; |
164 | } | 164 | } |
165 | m = ctx->mrec; | 165 | m = ctx->mrec; |
166 | a = ctx->attr; | 166 | a = ctx->attr; |
167 | BUG_ON(a->non_resident); | 167 | BUG_ON(a->non_resident); |
168 | /* The total length of the attribute value. */ | 168 | /* The total length of the attribute value. */ |
169 | attr_len = le32_to_cpu(a->data.resident.value_length); | 169 | attr_len = le32_to_cpu(a->data.resident.value_length); |
170 | BUG_ON(old_i_size != (loff_t)attr_len); | 170 | BUG_ON(old_i_size != (loff_t)attr_len); |
171 | /* | 171 | /* |
172 | * Do the zeroing in the mft record and update the attribute size in | 172 | * Do the zeroing in the mft record and update the attribute size in |
173 | * the mft record. | 173 | * the mft record. |
174 | */ | 174 | */ |
175 | kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); | 175 | kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); |
176 | memset(kattr + attr_len, 0, new_init_size - attr_len); | 176 | memset(kattr + attr_len, 0, new_init_size - attr_len); |
177 | a->data.resident.value_length = cpu_to_le32((u32)new_init_size); | 177 | a->data.resident.value_length = cpu_to_le32((u32)new_init_size); |
178 | /* Finally, update the sizes in the vfs and ntfs inodes. */ | 178 | /* Finally, update the sizes in the vfs and ntfs inodes. */ |
179 | write_lock_irqsave(&ni->size_lock, flags); | 179 | write_lock_irqsave(&ni->size_lock, flags); |
180 | i_size_write(vi, new_init_size); | 180 | i_size_write(vi, new_init_size); |
181 | ni->initialized_size = new_init_size; | 181 | ni->initialized_size = new_init_size; |
182 | write_unlock_irqrestore(&ni->size_lock, flags); | 182 | write_unlock_irqrestore(&ni->size_lock, flags); |
183 | goto done; | 183 | goto done; |
184 | do_non_resident_extend: | 184 | do_non_resident_extend: |
185 | /* | 185 | /* |
186 | * If the new initialized size @new_init_size exceeds the current file | 186 | * If the new initialized size @new_init_size exceeds the current file |
187 | * size (vfs inode->i_size), we need to extend the file size to the | 187 | * size (vfs inode->i_size), we need to extend the file size to the |
188 | * new initialized size. | 188 | * new initialized size. |
189 | */ | 189 | */ |
190 | if (new_init_size > old_i_size) { | 190 | if (new_init_size > old_i_size) { |
191 | m = map_mft_record(base_ni); | 191 | m = map_mft_record(base_ni); |
192 | if (IS_ERR(m)) { | 192 | if (IS_ERR(m)) { |
193 | err = PTR_ERR(m); | 193 | err = PTR_ERR(m); |
194 | m = NULL; | 194 | m = NULL; |
195 | goto err_out; | 195 | goto err_out; |
196 | } | 196 | } |
197 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 197 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
198 | if (unlikely(!ctx)) { | 198 | if (unlikely(!ctx)) { |
199 | err = -ENOMEM; | 199 | err = -ENOMEM; |
200 | goto err_out; | 200 | goto err_out; |
201 | } | 201 | } |
202 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 202 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
203 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 203 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
204 | if (unlikely(err)) { | 204 | if (unlikely(err)) { |
205 | if (err == -ENOENT) | 205 | if (err == -ENOENT) |
206 | err = -EIO; | 206 | err = -EIO; |
207 | goto err_out; | 207 | goto err_out; |
208 | } | 208 | } |
209 | m = ctx->mrec; | 209 | m = ctx->mrec; |
210 | a = ctx->attr; | 210 | a = ctx->attr; |
211 | BUG_ON(!a->non_resident); | 211 | BUG_ON(!a->non_resident); |
212 | BUG_ON(old_i_size != (loff_t) | 212 | BUG_ON(old_i_size != (loff_t) |
213 | sle64_to_cpu(a->data.non_resident.data_size)); | 213 | sle64_to_cpu(a->data.non_resident.data_size)); |
214 | a->data.non_resident.data_size = cpu_to_sle64(new_init_size); | 214 | a->data.non_resident.data_size = cpu_to_sle64(new_init_size); |
215 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 215 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
216 | mark_mft_record_dirty(ctx->ntfs_ino); | 216 | mark_mft_record_dirty(ctx->ntfs_ino); |
217 | /* Update the file size in the vfs inode. */ | 217 | /* Update the file size in the vfs inode. */ |
218 | i_size_write(vi, new_init_size); | 218 | i_size_write(vi, new_init_size); |
219 | ntfs_attr_put_search_ctx(ctx); | 219 | ntfs_attr_put_search_ctx(ctx); |
220 | ctx = NULL; | 220 | ctx = NULL; |
221 | unmap_mft_record(base_ni); | 221 | unmap_mft_record(base_ni); |
222 | m = NULL; | 222 | m = NULL; |
223 | } | 223 | } |
224 | mapping = vi->i_mapping; | 224 | mapping = vi->i_mapping; |
225 | index = old_init_size >> PAGE_CACHE_SHIFT; | 225 | index = old_init_size >> PAGE_CACHE_SHIFT; |
226 | end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 226 | end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
227 | do { | 227 | do { |
228 | /* | 228 | /* |
229 | * Read the page. If the page is not present, this will zero | 229 | * Read the page. If the page is not present, this will zero |
230 | * the uninitialized regions for us. | 230 | * the uninitialized regions for us. |
231 | */ | 231 | */ |
232 | page = read_mapping_page(mapping, index, NULL); | 232 | page = read_mapping_page(mapping, index, NULL); |
233 | if (IS_ERR(page)) { | 233 | if (IS_ERR(page)) { |
234 | err = PTR_ERR(page); | 234 | err = PTR_ERR(page); |
235 | goto init_err_out; | 235 | goto init_err_out; |
236 | } | 236 | } |
237 | if (unlikely(PageError(page))) { | 237 | if (unlikely(PageError(page))) { |
238 | page_cache_release(page); | 238 | page_cache_release(page); |
239 | err = -EIO; | 239 | err = -EIO; |
240 | goto init_err_out; | 240 | goto init_err_out; |
241 | } | 241 | } |
242 | /* | 242 | /* |
243 | * Update the initialized size in the ntfs inode. This is | 243 | * Update the initialized size in the ntfs inode. This is |
244 | * enough to make ntfs_writepage() work. | 244 | * enough to make ntfs_writepage() work. |
245 | */ | 245 | */ |
246 | write_lock_irqsave(&ni->size_lock, flags); | 246 | write_lock_irqsave(&ni->size_lock, flags); |
247 | ni->initialized_size = (s64)(index + 1) << PAGE_CACHE_SHIFT; | 247 | ni->initialized_size = (s64)(index + 1) << PAGE_CACHE_SHIFT; |
248 | if (ni->initialized_size > new_init_size) | 248 | if (ni->initialized_size > new_init_size) |
249 | ni->initialized_size = new_init_size; | 249 | ni->initialized_size = new_init_size; |
250 | write_unlock_irqrestore(&ni->size_lock, flags); | 250 | write_unlock_irqrestore(&ni->size_lock, flags); |
251 | /* Set the page dirty so it gets written out. */ | 251 | /* Set the page dirty so it gets written out. */ |
252 | set_page_dirty(page); | 252 | set_page_dirty(page); |
253 | page_cache_release(page); | 253 | page_cache_release(page); |
254 | /* | 254 | /* |
255 | * Play nice with the vm and the rest of the system. This is | 255 | * Play nice with the vm and the rest of the system. This is |
256 | * very much needed as we can potentially be modifying the | 256 | * very much needed as we can potentially be modifying the |
257 | * initialised size from a very small value to a really huge | 257 | * initialised size from a very small value to a really huge |
258 | * value, e.g. | 258 | * value, e.g. |
259 | * f = open(somefile, O_TRUNC); | 259 | * f = open(somefile, O_TRUNC); |
260 | * truncate(f, 10GiB); | 260 | * truncate(f, 10GiB); |
261 | * seek(f, 10GiB); | 261 | * seek(f, 10GiB); |
262 | * write(f, 1); | 262 | * write(f, 1); |
263 | * And this would mean we would be marking dirty hundreds of | 263 | * And this would mean we would be marking dirty hundreds of |
264 | * thousands of pages or as in the above example more than | 264 | * thousands of pages or as in the above example more than |
265 | * two and a half million pages! | 265 | * two and a half million pages! |
266 | * | 266 | * |
267 | * TODO: For sparse pages could optimize this workload by using | 267 | * TODO: For sparse pages could optimize this workload by using |
268 | * the FsMisc / MiscFs page bit as a "PageIsSparse" bit. This | 268 | * the FsMisc / MiscFs page bit as a "PageIsSparse" bit. This |
269 | * would be set in readpage for sparse pages and here we would | 269 | * would be set in readpage for sparse pages and here we would |
270 | * not need to mark dirty any pages which have this bit set. | 270 | * not need to mark dirty any pages which have this bit set. |
271 | * The only caveat is that we have to clear the bit everywhere | 271 | * The only caveat is that we have to clear the bit everywhere |
272 | * where we allocate any clusters that lie in the page or that | 272 | * where we allocate any clusters that lie in the page or that |
273 | * contain the page. | 273 | * contain the page. |
274 | * | 274 | * |
275 | * TODO: An even greater optimization would be for us to only | 275 | * TODO: An even greater optimization would be for us to only |
276 | * call readpage() on pages which are not in sparse regions as | 276 | * call readpage() on pages which are not in sparse regions as |
277 | * determined from the runlist. This would greatly reduce the | 277 | * determined from the runlist. This would greatly reduce the |
278 | * number of pages we read and make dirty in the case of sparse | 278 | * number of pages we read and make dirty in the case of sparse |
279 | * files. | 279 | * files. |
280 | */ | 280 | */ |
281 | balance_dirty_pages_ratelimited(mapping); | 281 | balance_dirty_pages_ratelimited(mapping); |
282 | cond_resched(); | 282 | cond_resched(); |
283 | } while (++index < end_index); | 283 | } while (++index < end_index); |
284 | read_lock_irqsave(&ni->size_lock, flags); | 284 | read_lock_irqsave(&ni->size_lock, flags); |
285 | BUG_ON(ni->initialized_size != new_init_size); | 285 | BUG_ON(ni->initialized_size != new_init_size); |
286 | read_unlock_irqrestore(&ni->size_lock, flags); | 286 | read_unlock_irqrestore(&ni->size_lock, flags); |
287 | /* Now bring in sync the initialized_size in the mft record. */ | 287 | /* Now bring in sync the initialized_size in the mft record. */ |
288 | m = map_mft_record(base_ni); | 288 | m = map_mft_record(base_ni); |
289 | if (IS_ERR(m)) { | 289 | if (IS_ERR(m)) { |
290 | err = PTR_ERR(m); | 290 | err = PTR_ERR(m); |
291 | m = NULL; | 291 | m = NULL; |
292 | goto init_err_out; | 292 | goto init_err_out; |
293 | } | 293 | } |
294 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 294 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
295 | if (unlikely(!ctx)) { | 295 | if (unlikely(!ctx)) { |
296 | err = -ENOMEM; | 296 | err = -ENOMEM; |
297 | goto init_err_out; | 297 | goto init_err_out; |
298 | } | 298 | } |
299 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 299 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
300 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 300 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
301 | if (unlikely(err)) { | 301 | if (unlikely(err)) { |
302 | if (err == -ENOENT) | 302 | if (err == -ENOENT) |
303 | err = -EIO; | 303 | err = -EIO; |
304 | goto init_err_out; | 304 | goto init_err_out; |
305 | } | 305 | } |
306 | m = ctx->mrec; | 306 | m = ctx->mrec; |
307 | a = ctx->attr; | 307 | a = ctx->attr; |
308 | BUG_ON(!a->non_resident); | 308 | BUG_ON(!a->non_resident); |
309 | a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size); | 309 | a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size); |
310 | done: | 310 | done: |
311 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 311 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
312 | mark_mft_record_dirty(ctx->ntfs_ino); | 312 | mark_mft_record_dirty(ctx->ntfs_ino); |
313 | if (ctx) | 313 | if (ctx) |
314 | ntfs_attr_put_search_ctx(ctx); | 314 | ntfs_attr_put_search_ctx(ctx); |
315 | if (m) | 315 | if (m) |
316 | unmap_mft_record(base_ni); | 316 | unmap_mft_record(base_ni); |
317 | ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.", | 317 | ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.", |
318 | (unsigned long long)new_init_size, i_size_read(vi)); | 318 | (unsigned long long)new_init_size, i_size_read(vi)); |
319 | return 0; | 319 | return 0; |
320 | init_err_out: | 320 | init_err_out: |
321 | write_lock_irqsave(&ni->size_lock, flags); | 321 | write_lock_irqsave(&ni->size_lock, flags); |
322 | ni->initialized_size = old_init_size; | 322 | ni->initialized_size = old_init_size; |
323 | write_unlock_irqrestore(&ni->size_lock, flags); | 323 | write_unlock_irqrestore(&ni->size_lock, flags); |
324 | err_out: | 324 | err_out: |
325 | if (ctx) | 325 | if (ctx) |
326 | ntfs_attr_put_search_ctx(ctx); | 326 | ntfs_attr_put_search_ctx(ctx); |
327 | if (m) | 327 | if (m) |
328 | unmap_mft_record(base_ni); | 328 | unmap_mft_record(base_ni); |
329 | ntfs_debug("Failed. Returning error code %i.", err); | 329 | ntfs_debug("Failed. Returning error code %i.", err); |
330 | return err; | 330 | return err; |
331 | } | 331 | } |
332 | 332 | ||
333 | /** | 333 | /** |
334 | * ntfs_fault_in_pages_readable - | 334 | * ntfs_fault_in_pages_readable - |
335 | * | 335 | * |
336 | * Fault a number of userspace pages into pagetables. | 336 | * Fault a number of userspace pages into pagetables. |
337 | * | 337 | * |
338 | * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes | 338 | * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes |
339 | * with more than two userspace pages as well as handling the single page case | 339 | * with more than two userspace pages as well as handling the single page case |
340 | * elegantly. | 340 | * elegantly. |
341 | * | 341 | * |
342 | * If you find this difficult to understand, then think of the while loop being | 342 | * If you find this difficult to understand, then think of the while loop being |
343 | * the following code, except that we do without the integer variable ret: | 343 | * the following code, except that we do without the integer variable ret: |
344 | * | 344 | * |
345 | * do { | 345 | * do { |
346 | * ret = __get_user(c, uaddr); | 346 | * ret = __get_user(c, uaddr); |
347 | * uaddr += PAGE_SIZE; | 347 | * uaddr += PAGE_SIZE; |
348 | * } while (!ret && uaddr < end); | 348 | * } while (!ret && uaddr < end); |
349 | * | 349 | * |
350 | * Note, the final __get_user() may well run out-of-bounds of the user buffer, | 350 | * Note, the final __get_user() may well run out-of-bounds of the user buffer, |
351 | * but _not_ out-of-bounds of the page the user buffer belongs to, and since | 351 | * but _not_ out-of-bounds of the page the user buffer belongs to, and since |
352 | * this is only a read and not a write, and since it is still in the same page, | 352 | * this is only a read and not a write, and since it is still in the same page, |
353 | * it should not matter and this makes the code much simpler. | 353 | * it should not matter and this makes the code much simpler. |
354 | */ | 354 | */ |
355 | static inline void ntfs_fault_in_pages_readable(const char __user *uaddr, | 355 | static inline void ntfs_fault_in_pages_readable(const char __user *uaddr, |
356 | int bytes) | 356 | int bytes) |
357 | { | 357 | { |
358 | const char __user *end; | 358 | const char __user *end; |
359 | volatile char c; | 359 | volatile char c; |
360 | 360 | ||
361 | /* Set @end to the first byte outside the last page we care about. */ | 361 | /* Set @end to the first byte outside the last page we care about. */ |
362 | end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes); | 362 | end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes); |
363 | 363 | ||
364 | while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end)) | 364 | while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end)) |
365 | ; | 365 | ; |
366 | } | 366 | } |
367 | 367 | ||
368 | /** | 368 | /** |
369 | * ntfs_fault_in_pages_readable_iovec - | 369 | * ntfs_fault_in_pages_readable_iovec - |
370 | * | 370 | * |
371 | * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs. | 371 | * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs. |
372 | */ | 372 | */ |
373 | static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov, | 373 | static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov, |
374 | size_t iov_ofs, int bytes) | 374 | size_t iov_ofs, int bytes) |
375 | { | 375 | { |
376 | do { | 376 | do { |
377 | const char __user *buf; | 377 | const char __user *buf; |
378 | unsigned len; | 378 | unsigned len; |
379 | 379 | ||
380 | buf = iov->iov_base + iov_ofs; | 380 | buf = iov->iov_base + iov_ofs; |
381 | len = iov->iov_len - iov_ofs; | 381 | len = iov->iov_len - iov_ofs; |
382 | if (len > bytes) | 382 | if (len > bytes) |
383 | len = bytes; | 383 | len = bytes; |
384 | ntfs_fault_in_pages_readable(buf, len); | 384 | ntfs_fault_in_pages_readable(buf, len); |
385 | bytes -= len; | 385 | bytes -= len; |
386 | iov++; | 386 | iov++; |
387 | iov_ofs = 0; | 387 | iov_ofs = 0; |
388 | } while (bytes); | 388 | } while (bytes); |
389 | } | 389 | } |
390 | 390 | ||
391 | /** | 391 | /** |
392 | * __ntfs_grab_cache_pages - obtain a number of locked pages | 392 | * __ntfs_grab_cache_pages - obtain a number of locked pages |
393 | * @mapping: address space mapping from which to obtain page cache pages | 393 | * @mapping: address space mapping from which to obtain page cache pages |
394 | * @index: starting index in @mapping at which to begin obtaining pages | 394 | * @index: starting index in @mapping at which to begin obtaining pages |
395 | * @nr_pages: number of page cache pages to obtain | 395 | * @nr_pages: number of page cache pages to obtain |
396 | * @pages: array of pages in which to return the obtained page cache pages | 396 | * @pages: array of pages in which to return the obtained page cache pages |
397 | * @cached_page: allocated but as yet unused page | 397 | * @cached_page: allocated but as yet unused page |
398 | * @lru_pvec: lru-buffering pagevec of caller | 398 | * @lru_pvec: lru-buffering pagevec of caller |
399 | * | 399 | * |
400 | * Obtain @nr_pages locked page cache pages from the mapping @mapping and | 400 | * Obtain @nr_pages locked page cache pages from the mapping @mapping and |
401 | * starting at index @index. | 401 | * starting at index @index. |
402 | * | 402 | * |
403 | * If a page is newly created, add it to lru list | 403 | * If a page is newly created, add it to lru list |
404 | * | 404 | * |
405 | * Note, the page locks are obtained in ascending page index order. | 405 | * Note, the page locks are obtained in ascending page index order. |
406 | */ | 406 | */ |
407 | static inline int __ntfs_grab_cache_pages(struct address_space *mapping, | 407 | static inline int __ntfs_grab_cache_pages(struct address_space *mapping, |
408 | pgoff_t index, const unsigned nr_pages, struct page **pages, | 408 | pgoff_t index, const unsigned nr_pages, struct page **pages, |
409 | struct page **cached_page) | 409 | struct page **cached_page) |
410 | { | 410 | { |
411 | int err, nr; | 411 | int err, nr; |
412 | 412 | ||
413 | BUG_ON(!nr_pages); | 413 | BUG_ON(!nr_pages); |
414 | err = nr = 0; | 414 | err = nr = 0; |
415 | do { | 415 | do { |
416 | pages[nr] = find_lock_page(mapping, index); | 416 | pages[nr] = find_lock_page(mapping, index); |
417 | if (!pages[nr]) { | 417 | if (!pages[nr]) { |
418 | if (!*cached_page) { | 418 | if (!*cached_page) { |
419 | *cached_page = page_cache_alloc(mapping); | 419 | *cached_page = page_cache_alloc(mapping); |
420 | if (unlikely(!*cached_page)) { | 420 | if (unlikely(!*cached_page)) { |
421 | err = -ENOMEM; | 421 | err = -ENOMEM; |
422 | goto err_out; | 422 | goto err_out; |
423 | } | 423 | } |
424 | } | 424 | } |
425 | err = add_to_page_cache_lru(*cached_page, mapping, index, | 425 | err = add_to_page_cache_lru(*cached_page, mapping, index, |
426 | GFP_KERNEL); | 426 | GFP_KERNEL); |
427 | if (unlikely(err)) { | 427 | if (unlikely(err)) { |
428 | if (err == -EEXIST) | 428 | if (err == -EEXIST) |
429 | continue; | 429 | continue; |
430 | goto err_out; | 430 | goto err_out; |
431 | } | 431 | } |
432 | pages[nr] = *cached_page; | 432 | pages[nr] = *cached_page; |
433 | *cached_page = NULL; | 433 | *cached_page = NULL; |
434 | } | 434 | } |
435 | index++; | 435 | index++; |
436 | nr++; | 436 | nr++; |
437 | } while (nr < nr_pages); | 437 | } while (nr < nr_pages); |
438 | out: | 438 | out: |
439 | return err; | 439 | return err; |
440 | err_out: | 440 | err_out: |
441 | while (nr > 0) { | 441 | while (nr > 0) { |
442 | unlock_page(pages[--nr]); | 442 | unlock_page(pages[--nr]); |
443 | page_cache_release(pages[nr]); | 443 | page_cache_release(pages[nr]); |
444 | } | 444 | } |
445 | goto out; | 445 | goto out; |
446 | } | 446 | } |
447 | 447 | ||
448 | static inline int ntfs_submit_bh_for_read(struct buffer_head *bh) | 448 | static inline int ntfs_submit_bh_for_read(struct buffer_head *bh) |
449 | { | 449 | { |
450 | lock_buffer(bh); | 450 | lock_buffer(bh); |
451 | get_bh(bh); | 451 | get_bh(bh); |
452 | bh->b_end_io = end_buffer_read_sync; | 452 | bh->b_end_io = end_buffer_read_sync; |
453 | return submit_bh(READ, bh); | 453 | return submit_bh(READ, bh); |
454 | } | 454 | } |
455 | 455 | ||
456 | /** | 456 | /** |
457 | * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data | 457 | * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data |
458 | * @pages: array of destination pages | 458 | * @pages: array of destination pages |
459 | * @nr_pages: number of pages in @pages | 459 | * @nr_pages: number of pages in @pages |
460 | * @pos: byte position in file at which the write begins | 460 | * @pos: byte position in file at which the write begins |
461 | * @bytes: number of bytes to be written | 461 | * @bytes: number of bytes to be written |
462 | * | 462 | * |
463 | * This is called for non-resident attributes from ntfs_file_buffered_write() | 463 | * This is called for non-resident attributes from ntfs_file_buffered_write() |
464 | * with i_mutex held on the inode (@pages[0]->mapping->host). There are | 464 | * with i_mutex held on the inode (@pages[0]->mapping->host). There are |
465 | * @nr_pages pages in @pages which are locked but not kmap()ped. The source | 465 | * @nr_pages pages in @pages which are locked but not kmap()ped. The source |
466 | * data has not yet been copied into the @pages. | 466 | * data has not yet been copied into the @pages. |
467 | * | 467 | * |
468 | * Need to fill any holes with actual clusters, allocate buffers if necessary, | 468 | * Need to fill any holes with actual clusters, allocate buffers if necessary, |
469 | * ensure all the buffers are mapped, and bring uptodate any buffers that are | 469 | * ensure all the buffers are mapped, and bring uptodate any buffers that are |
470 | * only partially being written to. | 470 | * only partially being written to. |
471 | * | 471 | * |
472 | * If @nr_pages is greater than one, we are guaranteed that the cluster size is | 472 | * If @nr_pages is greater than one, we are guaranteed that the cluster size is |
473 | * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside | 473 | * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside |
474 | * the same cluster and that they are the entirety of that cluster, and that | 474 | * the same cluster and that they are the entirety of that cluster, and that |
475 | * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole. | 475 | * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole. |
476 | * | 476 | * |
477 | * i_size is not to be modified yet. | 477 | * i_size is not to be modified yet. |
478 | * | 478 | * |
479 | * Return 0 on success or -errno on error. | 479 | * Return 0 on success or -errno on error. |
480 | */ | 480 | */ |
481 | static int ntfs_prepare_pages_for_non_resident_write(struct page **pages, | 481 | static int ntfs_prepare_pages_for_non_resident_write(struct page **pages, |
482 | unsigned nr_pages, s64 pos, size_t bytes) | 482 | unsigned nr_pages, s64 pos, size_t bytes) |
483 | { | 483 | { |
484 | VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend; | 484 | VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend; |
485 | LCN lcn; | 485 | LCN lcn; |
486 | s64 bh_pos, vcn_len, end, initialized_size; | 486 | s64 bh_pos, vcn_len, end, initialized_size; |
487 | sector_t lcn_block; | 487 | sector_t lcn_block; |
488 | struct page *page; | 488 | struct page *page; |
489 | struct inode *vi; | 489 | struct inode *vi; |
490 | ntfs_inode *ni, *base_ni = NULL; | 490 | ntfs_inode *ni, *base_ni = NULL; |
491 | ntfs_volume *vol; | 491 | ntfs_volume *vol; |
492 | runlist_element *rl, *rl2; | 492 | runlist_element *rl, *rl2; |
493 | struct buffer_head *bh, *head, *wait[2], **wait_bh = wait; | 493 | struct buffer_head *bh, *head, *wait[2], **wait_bh = wait; |
494 | ntfs_attr_search_ctx *ctx = NULL; | 494 | ntfs_attr_search_ctx *ctx = NULL; |
495 | MFT_RECORD *m = NULL; | 495 | MFT_RECORD *m = NULL; |
496 | ATTR_RECORD *a = NULL; | 496 | ATTR_RECORD *a = NULL; |
497 | unsigned long flags; | 497 | unsigned long flags; |
498 | u32 attr_rec_len = 0; | 498 | u32 attr_rec_len = 0; |
499 | unsigned blocksize, u; | 499 | unsigned blocksize, u; |
500 | int err, mp_size; | 500 | int err, mp_size; |
501 | bool rl_write_locked, was_hole, is_retry; | 501 | bool rl_write_locked, was_hole, is_retry; |
502 | unsigned char blocksize_bits; | 502 | unsigned char blocksize_bits; |
503 | struct { | 503 | struct { |
504 | u8 runlist_merged:1; | 504 | u8 runlist_merged:1; |
505 | u8 mft_attr_mapped:1; | 505 | u8 mft_attr_mapped:1; |
506 | u8 mp_rebuilt:1; | 506 | u8 mp_rebuilt:1; |
507 | u8 attr_switched:1; | 507 | u8 attr_switched:1; |
508 | } status = { 0, 0, 0, 0 }; | 508 | } status = { 0, 0, 0, 0 }; |
509 | 509 | ||
510 | BUG_ON(!nr_pages); | 510 | BUG_ON(!nr_pages); |
511 | BUG_ON(!pages); | 511 | BUG_ON(!pages); |
512 | BUG_ON(!*pages); | 512 | BUG_ON(!*pages); |
513 | vi = pages[0]->mapping->host; | 513 | vi = pages[0]->mapping->host; |
514 | ni = NTFS_I(vi); | 514 | ni = NTFS_I(vi); |
515 | vol = ni->vol; | 515 | vol = ni->vol; |
516 | ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " | 516 | ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " |
517 | "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", | 517 | "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", |
518 | vi->i_ino, ni->type, pages[0]->index, nr_pages, | 518 | vi->i_ino, ni->type, pages[0]->index, nr_pages, |
519 | (long long)pos, bytes); | 519 | (long long)pos, bytes); |
520 | blocksize = vol->sb->s_blocksize; | 520 | blocksize = vol->sb->s_blocksize; |
521 | blocksize_bits = vol->sb->s_blocksize_bits; | 521 | blocksize_bits = vol->sb->s_blocksize_bits; |
522 | u = 0; | 522 | u = 0; |
523 | do { | 523 | do { |
524 | page = pages[u]; | 524 | page = pages[u]; |
525 | BUG_ON(!page); | 525 | BUG_ON(!page); |
526 | /* | 526 | /* |
527 | * create_empty_buffers() will create uptodate/dirty buffers if | 527 | * create_empty_buffers() will create uptodate/dirty buffers if |
528 | * the page is uptodate/dirty. | 528 | * the page is uptodate/dirty. |
529 | */ | 529 | */ |
530 | if (!page_has_buffers(page)) { | 530 | if (!page_has_buffers(page)) { |
531 | create_empty_buffers(page, blocksize, 0); | 531 | create_empty_buffers(page, blocksize, 0); |
532 | if (unlikely(!page_has_buffers(page))) | 532 | if (unlikely(!page_has_buffers(page))) |
533 | return -ENOMEM; | 533 | return -ENOMEM; |
534 | } | 534 | } |
535 | } while (++u < nr_pages); | 535 | } while (++u < nr_pages); |
536 | rl_write_locked = false; | 536 | rl_write_locked = false; |
537 | rl = NULL; | 537 | rl = NULL; |
538 | err = 0; | 538 | err = 0; |
539 | vcn = lcn = -1; | 539 | vcn = lcn = -1; |
540 | vcn_len = 0; | 540 | vcn_len = 0; |
541 | lcn_block = -1; | 541 | lcn_block = -1; |
542 | was_hole = false; | 542 | was_hole = false; |
543 | cpos = pos >> vol->cluster_size_bits; | 543 | cpos = pos >> vol->cluster_size_bits; |
544 | end = pos + bytes; | 544 | end = pos + bytes; |
545 | cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits; | 545 | cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits; |
546 | /* | 546 | /* |
547 | * Loop over each page and for each page over each buffer. Use goto to | 547 | * Loop over each page and for each page over each buffer. Use goto to |
548 | * reduce indentation. | 548 | * reduce indentation. |
549 | */ | 549 | */ |
550 | u = 0; | 550 | u = 0; |
551 | do_next_page: | 551 | do_next_page: |
552 | page = pages[u]; | 552 | page = pages[u]; |
553 | bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; | 553 | bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; |
554 | bh = head = page_buffers(page); | 554 | bh = head = page_buffers(page); |
555 | do { | 555 | do { |
556 | VCN cdelta; | 556 | VCN cdelta; |
557 | s64 bh_end; | 557 | s64 bh_end; |
558 | unsigned bh_cofs; | 558 | unsigned bh_cofs; |
559 | 559 | ||
560 | /* Clear buffer_new on all buffers to reinitialise state. */ | 560 | /* Clear buffer_new on all buffers to reinitialise state. */ |
561 | if (buffer_new(bh)) | 561 | if (buffer_new(bh)) |
562 | clear_buffer_new(bh); | 562 | clear_buffer_new(bh); |
563 | bh_end = bh_pos + blocksize; | 563 | bh_end = bh_pos + blocksize; |
564 | bh_cpos = bh_pos >> vol->cluster_size_bits; | 564 | bh_cpos = bh_pos >> vol->cluster_size_bits; |
565 | bh_cofs = bh_pos & vol->cluster_size_mask; | 565 | bh_cofs = bh_pos & vol->cluster_size_mask; |
566 | if (buffer_mapped(bh)) { | 566 | if (buffer_mapped(bh)) { |
567 | /* | 567 | /* |
568 | * The buffer is already mapped. If it is uptodate, | 568 | * The buffer is already mapped. If it is uptodate, |
569 | * ignore it. | 569 | * ignore it. |
570 | */ | 570 | */ |
571 | if (buffer_uptodate(bh)) | 571 | if (buffer_uptodate(bh)) |
572 | continue; | 572 | continue; |
573 | /* | 573 | /* |
574 | * The buffer is not uptodate. If the page is uptodate | 574 | * The buffer is not uptodate. If the page is uptodate |
575 | * set the buffer uptodate and otherwise ignore it. | 575 | * set the buffer uptodate and otherwise ignore it. |
576 | */ | 576 | */ |
577 | if (PageUptodate(page)) { | 577 | if (PageUptodate(page)) { |
578 | set_buffer_uptodate(bh); | 578 | set_buffer_uptodate(bh); |
579 | continue; | 579 | continue; |
580 | } | 580 | } |
581 | /* | 581 | /* |
582 | * Neither the page nor the buffer are uptodate. If | 582 | * Neither the page nor the buffer are uptodate. If |
583 | * the buffer is only partially being written to, we | 583 | * the buffer is only partially being written to, we |
584 | * need to read it in before the write, i.e. now. | 584 | * need to read it in before the write, i.e. now. |
585 | */ | 585 | */ |
586 | if ((bh_pos < pos && bh_end > pos) || | 586 | if ((bh_pos < pos && bh_end > pos) || |
587 | (bh_pos < end && bh_end > end)) { | 587 | (bh_pos < end && bh_end > end)) { |
588 | /* | 588 | /* |
589 | * If the buffer is fully or partially within | 589 | * If the buffer is fully or partially within |
590 | * the initialized size, do an actual read. | 590 | * the initialized size, do an actual read. |
591 | * Otherwise, simply zero the buffer. | 591 | * Otherwise, simply zero the buffer. |
592 | */ | 592 | */ |
593 | read_lock_irqsave(&ni->size_lock, flags); | 593 | read_lock_irqsave(&ni->size_lock, flags); |
594 | initialized_size = ni->initialized_size; | 594 | initialized_size = ni->initialized_size; |
595 | read_unlock_irqrestore(&ni->size_lock, flags); | 595 | read_unlock_irqrestore(&ni->size_lock, flags); |
596 | if (bh_pos < initialized_size) { | 596 | if (bh_pos < initialized_size) { |
597 | ntfs_submit_bh_for_read(bh); | 597 | ntfs_submit_bh_for_read(bh); |
598 | *wait_bh++ = bh; | 598 | *wait_bh++ = bh; |
599 | } else { | 599 | } else { |
600 | zero_user(page, bh_offset(bh), | 600 | zero_user(page, bh_offset(bh), |
601 | blocksize); | 601 | blocksize); |
602 | set_buffer_uptodate(bh); | 602 | set_buffer_uptodate(bh); |
603 | } | 603 | } |
604 | } | 604 | } |
605 | continue; | 605 | continue; |
606 | } | 606 | } |
607 | /* Unmapped buffer. Need to map it. */ | 607 | /* Unmapped buffer. Need to map it. */ |
608 | bh->b_bdev = vol->sb->s_bdev; | 608 | bh->b_bdev = vol->sb->s_bdev; |
609 | /* | 609 | /* |
610 | * If the current buffer is in the same clusters as the map | 610 | * If the current buffer is in the same clusters as the map |
611 | * cache, there is no need to check the runlist again. The | 611 | * cache, there is no need to check the runlist again. The |
612 | * map cache is made up of @vcn, which is the first cached file | 612 | * map cache is made up of @vcn, which is the first cached file |
613 | * cluster, @vcn_len which is the number of cached file | 613 | * cluster, @vcn_len which is the number of cached file |
614 | * clusters, @lcn is the device cluster corresponding to @vcn, | 614 | * clusters, @lcn is the device cluster corresponding to @vcn, |
615 | * and @lcn_block is the block number corresponding to @lcn. | 615 | * and @lcn_block is the block number corresponding to @lcn. |
616 | */ | 616 | */ |
617 | cdelta = bh_cpos - vcn; | 617 | cdelta = bh_cpos - vcn; |
618 | if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) { | 618 | if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) { |
619 | map_buffer_cached: | 619 | map_buffer_cached: |
620 | BUG_ON(lcn < 0); | 620 | BUG_ON(lcn < 0); |
621 | bh->b_blocknr = lcn_block + | 621 | bh->b_blocknr = lcn_block + |
622 | (cdelta << (vol->cluster_size_bits - | 622 | (cdelta << (vol->cluster_size_bits - |
623 | blocksize_bits)) + | 623 | blocksize_bits)) + |
624 | (bh_cofs >> blocksize_bits); | 624 | (bh_cofs >> blocksize_bits); |
625 | set_buffer_mapped(bh); | 625 | set_buffer_mapped(bh); |
626 | /* | 626 | /* |
627 | * If the page is uptodate so is the buffer. If the | 627 | * If the page is uptodate so is the buffer. If the |
628 | * buffer is fully outside the write, we ignore it if | 628 | * buffer is fully outside the write, we ignore it if |
629 | * it was already allocated and we mark it dirty so it | 629 | * it was already allocated and we mark it dirty so it |
630 | * gets written out if we allocated it. On the other | 630 | * gets written out if we allocated it. On the other |
631 | * hand, if we allocated the buffer but we are not | 631 | * hand, if we allocated the buffer but we are not |
632 | * marking it dirty we set buffer_new so we can do | 632 | * marking it dirty we set buffer_new so we can do |
633 | * error recovery. | 633 | * error recovery. |
634 | */ | 634 | */ |
635 | if (PageUptodate(page)) { | 635 | if (PageUptodate(page)) { |
636 | if (!buffer_uptodate(bh)) | 636 | if (!buffer_uptodate(bh)) |
637 | set_buffer_uptodate(bh); | 637 | set_buffer_uptodate(bh); |
638 | if (unlikely(was_hole)) { | 638 | if (unlikely(was_hole)) { |
639 | /* We allocated the buffer. */ | 639 | /* We allocated the buffer. */ |
640 | unmap_underlying_metadata(bh->b_bdev, | 640 | unmap_underlying_metadata(bh->b_bdev, |
641 | bh->b_blocknr); | 641 | bh->b_blocknr); |
642 | if (bh_end <= pos || bh_pos >= end) | 642 | if (bh_end <= pos || bh_pos >= end) |
643 | mark_buffer_dirty(bh); | 643 | mark_buffer_dirty(bh); |
644 | else | 644 | else |
645 | set_buffer_new(bh); | 645 | set_buffer_new(bh); |
646 | } | 646 | } |
647 | continue; | 647 | continue; |
648 | } | 648 | } |
649 | /* Page is _not_ uptodate. */ | 649 | /* Page is _not_ uptodate. */ |
650 | if (likely(!was_hole)) { | 650 | if (likely(!was_hole)) { |
651 | /* | 651 | /* |
652 | * Buffer was already allocated. If it is not | 652 | * Buffer was already allocated. If it is not |
653 | * uptodate and is only partially being written | 653 | * uptodate and is only partially being written |
654 | * to, we need to read it in before the write, | 654 | * to, we need to read it in before the write, |
655 | * i.e. now. | 655 | * i.e. now. |
656 | */ | 656 | */ |
657 | if (!buffer_uptodate(bh) && bh_pos < end && | 657 | if (!buffer_uptodate(bh) && bh_pos < end && |
658 | bh_end > pos && | 658 | bh_end > pos && |
659 | (bh_pos < pos || | 659 | (bh_pos < pos || |
660 | bh_end > end)) { | 660 | bh_end > end)) { |
661 | /* | 661 | /* |
662 | * If the buffer is fully or partially | 662 | * If the buffer is fully or partially |
663 | * within the initialized size, do an | 663 | * within the initialized size, do an |
664 | * actual read. Otherwise, simply zero | 664 | * actual read. Otherwise, simply zero |
665 | * the buffer. | 665 | * the buffer. |
666 | */ | 666 | */ |
667 | read_lock_irqsave(&ni->size_lock, | 667 | read_lock_irqsave(&ni->size_lock, |
668 | flags); | 668 | flags); |
669 | initialized_size = ni->initialized_size; | 669 | initialized_size = ni->initialized_size; |
670 | read_unlock_irqrestore(&ni->size_lock, | 670 | read_unlock_irqrestore(&ni->size_lock, |
671 | flags); | 671 | flags); |
672 | if (bh_pos < initialized_size) { | 672 | if (bh_pos < initialized_size) { |
673 | ntfs_submit_bh_for_read(bh); | 673 | ntfs_submit_bh_for_read(bh); |
674 | *wait_bh++ = bh; | 674 | *wait_bh++ = bh; |
675 | } else { | 675 | } else { |
676 | zero_user(page, bh_offset(bh), | 676 | zero_user(page, bh_offset(bh), |
677 | blocksize); | 677 | blocksize); |
678 | set_buffer_uptodate(bh); | 678 | set_buffer_uptodate(bh); |
679 | } | 679 | } |
680 | } | 680 | } |
681 | continue; | 681 | continue; |
682 | } | 682 | } |
683 | /* We allocated the buffer. */ | 683 | /* We allocated the buffer. */ |
684 | unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); | 684 | unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); |
685 | /* | 685 | /* |
686 | * If the buffer is fully outside the write, zero it, | 686 | * If the buffer is fully outside the write, zero it, |
687 | * set it uptodate, and mark it dirty so it gets | 687 | * set it uptodate, and mark it dirty so it gets |
688 | * written out. If it is partially being written to, | 688 | * written out. If it is partially being written to, |
689 | * zero region surrounding the write but leave it to | 689 | * zero region surrounding the write but leave it to |
690 | * commit write to do anything else. Finally, if the | 690 | * commit write to do anything else. Finally, if the |
691 | * buffer is fully being overwritten, do nothing. | 691 | * buffer is fully being overwritten, do nothing. |
692 | */ | 692 | */ |
693 | if (bh_end <= pos || bh_pos >= end) { | 693 | if (bh_end <= pos || bh_pos >= end) { |
694 | if (!buffer_uptodate(bh)) { | 694 | if (!buffer_uptodate(bh)) { |
695 | zero_user(page, bh_offset(bh), | 695 | zero_user(page, bh_offset(bh), |
696 | blocksize); | 696 | blocksize); |
697 | set_buffer_uptodate(bh); | 697 | set_buffer_uptodate(bh); |
698 | } | 698 | } |
699 | mark_buffer_dirty(bh); | 699 | mark_buffer_dirty(bh); |
700 | continue; | 700 | continue; |
701 | } | 701 | } |
702 | set_buffer_new(bh); | 702 | set_buffer_new(bh); |
703 | if (!buffer_uptodate(bh) && | 703 | if (!buffer_uptodate(bh) && |
704 | (bh_pos < pos || bh_end > end)) { | 704 | (bh_pos < pos || bh_end > end)) { |
705 | u8 *kaddr; | 705 | u8 *kaddr; |
706 | unsigned pofs; | 706 | unsigned pofs; |
707 | 707 | ||
708 | kaddr = kmap_atomic(page); | 708 | kaddr = kmap_atomic(page); |
709 | if (bh_pos < pos) { | 709 | if (bh_pos < pos) { |
710 | pofs = bh_pos & ~PAGE_CACHE_MASK; | 710 | pofs = bh_pos & ~PAGE_CACHE_MASK; |
711 | memset(kaddr + pofs, 0, pos - bh_pos); | 711 | memset(kaddr + pofs, 0, pos - bh_pos); |
712 | } | 712 | } |
713 | if (bh_end > end) { | 713 | if (bh_end > end) { |
714 | pofs = end & ~PAGE_CACHE_MASK; | 714 | pofs = end & ~PAGE_CACHE_MASK; |
715 | memset(kaddr + pofs, 0, bh_end - end); | 715 | memset(kaddr + pofs, 0, bh_end - end); |
716 | } | 716 | } |
717 | kunmap_atomic(kaddr); | 717 | kunmap_atomic(kaddr); |
718 | flush_dcache_page(page); | 718 | flush_dcache_page(page); |
719 | } | 719 | } |
720 | continue; | 720 | continue; |
721 | } | 721 | } |
722 | /* | 722 | /* |
723 | * Slow path: this is the first buffer in the cluster. If it | 723 | * Slow path: this is the first buffer in the cluster. If it |
724 | * is outside allocated size and is not uptodate, zero it and | 724 | * is outside allocated size and is not uptodate, zero it and |
725 | * set it uptodate. | 725 | * set it uptodate. |
726 | */ | 726 | */ |
727 | read_lock_irqsave(&ni->size_lock, flags); | 727 | read_lock_irqsave(&ni->size_lock, flags); |
728 | initialized_size = ni->allocated_size; | 728 | initialized_size = ni->allocated_size; |
729 | read_unlock_irqrestore(&ni->size_lock, flags); | 729 | read_unlock_irqrestore(&ni->size_lock, flags); |
730 | if (bh_pos > initialized_size) { | 730 | if (bh_pos > initialized_size) { |
731 | if (PageUptodate(page)) { | 731 | if (PageUptodate(page)) { |
732 | if (!buffer_uptodate(bh)) | 732 | if (!buffer_uptodate(bh)) |
733 | set_buffer_uptodate(bh); | 733 | set_buffer_uptodate(bh); |
734 | } else if (!buffer_uptodate(bh)) { | 734 | } else if (!buffer_uptodate(bh)) { |
735 | zero_user(page, bh_offset(bh), blocksize); | 735 | zero_user(page, bh_offset(bh), blocksize); |
736 | set_buffer_uptodate(bh); | 736 | set_buffer_uptodate(bh); |
737 | } | 737 | } |
738 | continue; | 738 | continue; |
739 | } | 739 | } |
740 | is_retry = false; | 740 | is_retry = false; |
741 | if (!rl) { | 741 | if (!rl) { |
742 | down_read(&ni->runlist.lock); | 742 | down_read(&ni->runlist.lock); |
743 | retry_remap: | 743 | retry_remap: |
744 | rl = ni->runlist.rl; | 744 | rl = ni->runlist.rl; |
745 | } | 745 | } |
746 | if (likely(rl != NULL)) { | 746 | if (likely(rl != NULL)) { |
747 | /* Seek to element containing target cluster. */ | 747 | /* Seek to element containing target cluster. */ |
748 | while (rl->length && rl[1].vcn <= bh_cpos) | 748 | while (rl->length && rl[1].vcn <= bh_cpos) |
749 | rl++; | 749 | rl++; |
750 | lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos); | 750 | lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos); |
751 | if (likely(lcn >= 0)) { | 751 | if (likely(lcn >= 0)) { |
752 | /* | 752 | /* |
753 | * Successful remap, setup the map cache and | 753 | * Successful remap, setup the map cache and |
754 | * use that to deal with the buffer. | 754 | * use that to deal with the buffer. |
755 | */ | 755 | */ |
756 | was_hole = false; | 756 | was_hole = false; |
757 | vcn = bh_cpos; | 757 | vcn = bh_cpos; |
758 | vcn_len = rl[1].vcn - vcn; | 758 | vcn_len = rl[1].vcn - vcn; |
759 | lcn_block = lcn << (vol->cluster_size_bits - | 759 | lcn_block = lcn << (vol->cluster_size_bits - |
760 | blocksize_bits); | 760 | blocksize_bits); |
761 | cdelta = 0; | 761 | cdelta = 0; |
762 | /* | 762 | /* |
763 | * If the number of remaining clusters touched | 763 | * If the number of remaining clusters touched |
764 | * by the write is smaller or equal to the | 764 | * by the write is smaller or equal to the |
765 | * number of cached clusters, unlock the | 765 | * number of cached clusters, unlock the |
766 | * runlist as the map cache will be used from | 766 | * runlist as the map cache will be used from |
767 | * now on. | 767 | * now on. |
768 | */ | 768 | */ |
769 | if (likely(vcn + vcn_len >= cend)) { | 769 | if (likely(vcn + vcn_len >= cend)) { |
770 | if (rl_write_locked) { | 770 | if (rl_write_locked) { |
771 | up_write(&ni->runlist.lock); | 771 | up_write(&ni->runlist.lock); |
772 | rl_write_locked = false; | 772 | rl_write_locked = false; |
773 | } else | 773 | } else |
774 | up_read(&ni->runlist.lock); | 774 | up_read(&ni->runlist.lock); |
775 | rl = NULL; | 775 | rl = NULL; |
776 | } | 776 | } |
777 | goto map_buffer_cached; | 777 | goto map_buffer_cached; |
778 | } | 778 | } |
779 | } else | 779 | } else |
780 | lcn = LCN_RL_NOT_MAPPED; | 780 | lcn = LCN_RL_NOT_MAPPED; |
781 | /* | 781 | /* |
782 | * If it is not a hole and not out of bounds, the runlist is | 782 | * If it is not a hole and not out of bounds, the runlist is |
783 | * probably unmapped so try to map it now. | 783 | * probably unmapped so try to map it now. |
784 | */ | 784 | */ |
785 | if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) { | 785 | if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) { |
786 | if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) { | 786 | if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) { |
787 | /* Attempt to map runlist. */ | 787 | /* Attempt to map runlist. */ |
788 | if (!rl_write_locked) { | 788 | if (!rl_write_locked) { |
789 | /* | 789 | /* |
790 | * We need the runlist locked for | 790 | * We need the runlist locked for |
791 | * writing, so if it is locked for | 791 | * writing, so if it is locked for |
792 | * reading relock it now and retry in | 792 | * reading relock it now and retry in |
793 | * case it changed whilst we dropped | 793 | * case it changed whilst we dropped |
794 | * the lock. | 794 | * the lock. |
795 | */ | 795 | */ |
796 | up_read(&ni->runlist.lock); | 796 | up_read(&ni->runlist.lock); |
797 | down_write(&ni->runlist.lock); | 797 | down_write(&ni->runlist.lock); |
798 | rl_write_locked = true; | 798 | rl_write_locked = true; |
799 | goto retry_remap; | 799 | goto retry_remap; |
800 | } | 800 | } |
801 | err = ntfs_map_runlist_nolock(ni, bh_cpos, | 801 | err = ntfs_map_runlist_nolock(ni, bh_cpos, |
802 | NULL); | 802 | NULL); |
803 | if (likely(!err)) { | 803 | if (likely(!err)) { |
804 | is_retry = true; | 804 | is_retry = true; |
805 | goto retry_remap; | 805 | goto retry_remap; |
806 | } | 806 | } |
807 | /* | 807 | /* |
808 | * If @vcn is out of bounds, pretend @lcn is | 808 | * If @vcn is out of bounds, pretend @lcn is |
809 | * LCN_ENOENT. As long as the buffer is out | 809 | * LCN_ENOENT. As long as the buffer is out |
810 | * of bounds this will work fine. | 810 | * of bounds this will work fine. |
811 | */ | 811 | */ |
812 | if (err == -ENOENT) { | 812 | if (err == -ENOENT) { |
813 | lcn = LCN_ENOENT; | 813 | lcn = LCN_ENOENT; |
814 | err = 0; | 814 | err = 0; |
815 | goto rl_not_mapped_enoent; | 815 | goto rl_not_mapped_enoent; |
816 | } | 816 | } |
817 | } else | 817 | } else |
818 | err = -EIO; | 818 | err = -EIO; |
819 | /* Failed to map the buffer, even after retrying. */ | 819 | /* Failed to map the buffer, even after retrying. */ |
820 | bh->b_blocknr = -1; | 820 | bh->b_blocknr = -1; |
821 | ntfs_error(vol->sb, "Failed to write to inode 0x%lx, " | 821 | ntfs_error(vol->sb, "Failed to write to inode 0x%lx, " |
822 | "attribute type 0x%x, vcn 0x%llx, " | 822 | "attribute type 0x%x, vcn 0x%llx, " |
823 | "vcn offset 0x%x, because its " | 823 | "vcn offset 0x%x, because its " |
824 | "location on disk could not be " | 824 | "location on disk could not be " |
825 | "determined%s (error code %i).", | 825 | "determined%s (error code %i).", |
826 | ni->mft_no, ni->type, | 826 | ni->mft_no, ni->type, |
827 | (unsigned long long)bh_cpos, | 827 | (unsigned long long)bh_cpos, |
828 | (unsigned)bh_pos & | 828 | (unsigned)bh_pos & |
829 | vol->cluster_size_mask, | 829 | vol->cluster_size_mask, |
830 | is_retry ? " even after retrying" : "", | 830 | is_retry ? " even after retrying" : "", |
831 | err); | 831 | err); |
832 | break; | 832 | break; |
833 | } | 833 | } |
834 | rl_not_mapped_enoent: | 834 | rl_not_mapped_enoent: |
835 | /* | 835 | /* |
836 | * The buffer is in a hole or out of bounds. We need to fill | 836 | * The buffer is in a hole or out of bounds. We need to fill |
837 | * the hole, unless the buffer is in a cluster which is not | 837 | * the hole, unless the buffer is in a cluster which is not |
838 | * touched by the write, in which case we just leave the buffer | 838 | * touched by the write, in which case we just leave the buffer |
839 | * unmapped. This can only happen when the cluster size is | 839 | * unmapped. This can only happen when the cluster size is |
840 | * less than the page cache size. | 840 | * less than the page cache size. |
841 | */ | 841 | */ |
842 | if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) { | 842 | if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) { |
843 | bh_cend = (bh_end + vol->cluster_size - 1) >> | 843 | bh_cend = (bh_end + vol->cluster_size - 1) >> |
844 | vol->cluster_size_bits; | 844 | vol->cluster_size_bits; |
845 | if ((bh_cend <= cpos || bh_cpos >= cend)) { | 845 | if ((bh_cend <= cpos || bh_cpos >= cend)) { |
846 | bh->b_blocknr = -1; | 846 | bh->b_blocknr = -1; |
847 | /* | 847 | /* |
848 | * If the buffer is uptodate we skip it. If it | 848 | * If the buffer is uptodate we skip it. If it |
849 | * is not but the page is uptodate, we can set | 849 | * is not but the page is uptodate, we can set |
850 | * the buffer uptodate. If the page is not | 850 | * the buffer uptodate. If the page is not |
851 | * uptodate, we can clear the buffer and set it | 851 | * uptodate, we can clear the buffer and set it |
852 | * uptodate. Whether this is worthwhile is | 852 | * uptodate. Whether this is worthwhile is |
853 | * debatable and this could be removed. | 853 | * debatable and this could be removed. |
854 | */ | 854 | */ |
855 | if (PageUptodate(page)) { | 855 | if (PageUptodate(page)) { |
856 | if (!buffer_uptodate(bh)) | 856 | if (!buffer_uptodate(bh)) |
857 | set_buffer_uptodate(bh); | 857 | set_buffer_uptodate(bh); |
858 | } else if (!buffer_uptodate(bh)) { | 858 | } else if (!buffer_uptodate(bh)) { |
859 | zero_user(page, bh_offset(bh), | 859 | zero_user(page, bh_offset(bh), |
860 | blocksize); | 860 | blocksize); |
861 | set_buffer_uptodate(bh); | 861 | set_buffer_uptodate(bh); |
862 | } | 862 | } |
863 | continue; | 863 | continue; |
864 | } | 864 | } |
865 | } | 865 | } |
866 | /* | 866 | /* |
867 | * Out of bounds buffer is invalid if it was not really out of | 867 | * Out of bounds buffer is invalid if it was not really out of |
868 | * bounds. | 868 | * bounds. |
869 | */ | 869 | */ |
870 | BUG_ON(lcn != LCN_HOLE); | 870 | BUG_ON(lcn != LCN_HOLE); |
871 | /* | 871 | /* |
872 | * We need the runlist locked for writing, so if it is locked | 872 | * We need the runlist locked for writing, so if it is locked |
873 | * for reading relock it now and retry in case it changed | 873 | * for reading relock it now and retry in case it changed |
874 | * whilst we dropped the lock. | 874 | * whilst we dropped the lock. |
875 | */ | 875 | */ |
876 | BUG_ON(!rl); | 876 | BUG_ON(!rl); |
877 | if (!rl_write_locked) { | 877 | if (!rl_write_locked) { |
878 | up_read(&ni->runlist.lock); | 878 | up_read(&ni->runlist.lock); |
879 | down_write(&ni->runlist.lock); | 879 | down_write(&ni->runlist.lock); |
880 | rl_write_locked = true; | 880 | rl_write_locked = true; |
881 | goto retry_remap; | 881 | goto retry_remap; |
882 | } | 882 | } |
883 | /* Find the previous last allocated cluster. */ | 883 | /* Find the previous last allocated cluster. */ |
884 | BUG_ON(rl->lcn != LCN_HOLE); | 884 | BUG_ON(rl->lcn != LCN_HOLE); |
885 | lcn = -1; | 885 | lcn = -1; |
886 | rl2 = rl; | 886 | rl2 = rl; |
887 | while (--rl2 >= ni->runlist.rl) { | 887 | while (--rl2 >= ni->runlist.rl) { |
888 | if (rl2->lcn >= 0) { | 888 | if (rl2->lcn >= 0) { |
889 | lcn = rl2->lcn + rl2->length; | 889 | lcn = rl2->lcn + rl2->length; |
890 | break; | 890 | break; |
891 | } | 891 | } |
892 | } | 892 | } |
893 | rl2 = ntfs_cluster_alloc(vol, bh_cpos, 1, lcn, DATA_ZONE, | 893 | rl2 = ntfs_cluster_alloc(vol, bh_cpos, 1, lcn, DATA_ZONE, |
894 | false); | 894 | false); |
895 | if (IS_ERR(rl2)) { | 895 | if (IS_ERR(rl2)) { |
896 | err = PTR_ERR(rl2); | 896 | err = PTR_ERR(rl2); |
897 | ntfs_debug("Failed to allocate cluster, error code %i.", | 897 | ntfs_debug("Failed to allocate cluster, error code %i.", |
898 | err); | 898 | err); |
899 | break; | 899 | break; |
900 | } | 900 | } |
901 | lcn = rl2->lcn; | 901 | lcn = rl2->lcn; |
902 | rl = ntfs_runlists_merge(ni->runlist.rl, rl2); | 902 | rl = ntfs_runlists_merge(ni->runlist.rl, rl2); |
903 | if (IS_ERR(rl)) { | 903 | if (IS_ERR(rl)) { |
904 | err = PTR_ERR(rl); | 904 | err = PTR_ERR(rl); |
905 | if (err != -ENOMEM) | 905 | if (err != -ENOMEM) |
906 | err = -EIO; | 906 | err = -EIO; |
907 | if (ntfs_cluster_free_from_rl(vol, rl2)) { | 907 | if (ntfs_cluster_free_from_rl(vol, rl2)) { |
908 | ntfs_error(vol->sb, "Failed to release " | 908 | ntfs_error(vol->sb, "Failed to release " |
909 | "allocated cluster in error " | 909 | "allocated cluster in error " |
910 | "code path. Run chkdsk to " | 910 | "code path. Run chkdsk to " |
911 | "recover the lost cluster."); | 911 | "recover the lost cluster."); |
912 | NVolSetErrors(vol); | 912 | NVolSetErrors(vol); |
913 | } | 913 | } |
914 | ntfs_free(rl2); | 914 | ntfs_free(rl2); |
915 | break; | 915 | break; |
916 | } | 916 | } |
917 | ni->runlist.rl = rl; | 917 | ni->runlist.rl = rl; |
918 | status.runlist_merged = 1; | 918 | status.runlist_merged = 1; |
919 | ntfs_debug("Allocated cluster, lcn 0x%llx.", | 919 | ntfs_debug("Allocated cluster, lcn 0x%llx.", |
920 | (unsigned long long)lcn); | 920 | (unsigned long long)lcn); |
921 | /* Map and lock the mft record and get the attribute record. */ | 921 | /* Map and lock the mft record and get the attribute record. */ |
922 | if (!NInoAttr(ni)) | 922 | if (!NInoAttr(ni)) |
923 | base_ni = ni; | 923 | base_ni = ni; |
924 | else | 924 | else |
925 | base_ni = ni->ext.base_ntfs_ino; | 925 | base_ni = ni->ext.base_ntfs_ino; |
926 | m = map_mft_record(base_ni); | 926 | m = map_mft_record(base_ni); |
927 | if (IS_ERR(m)) { | 927 | if (IS_ERR(m)) { |
928 | err = PTR_ERR(m); | 928 | err = PTR_ERR(m); |
929 | break; | 929 | break; |
930 | } | 930 | } |
931 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 931 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
932 | if (unlikely(!ctx)) { | 932 | if (unlikely(!ctx)) { |
933 | err = -ENOMEM; | 933 | err = -ENOMEM; |
934 | unmap_mft_record(base_ni); | 934 | unmap_mft_record(base_ni); |
935 | break; | 935 | break; |
936 | } | 936 | } |
937 | status.mft_attr_mapped = 1; | 937 | status.mft_attr_mapped = 1; |
938 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 938 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
939 | CASE_SENSITIVE, bh_cpos, NULL, 0, ctx); | 939 | CASE_SENSITIVE, bh_cpos, NULL, 0, ctx); |
940 | if (unlikely(err)) { | 940 | if (unlikely(err)) { |
941 | if (err == -ENOENT) | 941 | if (err == -ENOENT) |
942 | err = -EIO; | 942 | err = -EIO; |
943 | break; | 943 | break; |
944 | } | 944 | } |
945 | m = ctx->mrec; | 945 | m = ctx->mrec; |
946 | a = ctx->attr; | 946 | a = ctx->attr; |
947 | /* | 947 | /* |
948 | * Find the runlist element with which the attribute extent | 948 | * Find the runlist element with which the attribute extent |
949 | * starts. Note, we cannot use the _attr_ version because we | 949 | * starts. Note, we cannot use the _attr_ version because we |
950 | * have mapped the mft record. That is ok because we know the | 950 | * have mapped the mft record. That is ok because we know the |
951 | * runlist fragment must be mapped already to have ever gotten | 951 | * runlist fragment must be mapped already to have ever gotten |
952 | * here, so we can just use the _rl_ version. | 952 | * here, so we can just use the _rl_ version. |
953 | */ | 953 | */ |
954 | vcn = sle64_to_cpu(a->data.non_resident.lowest_vcn); | 954 | vcn = sle64_to_cpu(a->data.non_resident.lowest_vcn); |
955 | rl2 = ntfs_rl_find_vcn_nolock(rl, vcn); | 955 | rl2 = ntfs_rl_find_vcn_nolock(rl, vcn); |
956 | BUG_ON(!rl2); | 956 | BUG_ON(!rl2); |
957 | BUG_ON(!rl2->length); | 957 | BUG_ON(!rl2->length); |
958 | BUG_ON(rl2->lcn < LCN_HOLE); | 958 | BUG_ON(rl2->lcn < LCN_HOLE); |
959 | highest_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); | 959 | highest_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn); |
960 | /* | 960 | /* |
961 | * If @highest_vcn is zero, calculate the real highest_vcn | 961 | * If @highest_vcn is zero, calculate the real highest_vcn |
962 | * (which can really be zero). | 962 | * (which can really be zero). |
963 | */ | 963 | */ |
964 | if (!highest_vcn) | 964 | if (!highest_vcn) |
965 | highest_vcn = (sle64_to_cpu( | 965 | highest_vcn = (sle64_to_cpu( |
966 | a->data.non_resident.allocated_size) >> | 966 | a->data.non_resident.allocated_size) >> |
967 | vol->cluster_size_bits) - 1; | 967 | vol->cluster_size_bits) - 1; |
968 | /* | 968 | /* |
969 | * Determine the size of the mapping pairs array for the new | 969 | * Determine the size of the mapping pairs array for the new |
970 | * extent, i.e. the old extent with the hole filled. | 970 | * extent, i.e. the old extent with the hole filled. |
971 | */ | 971 | */ |
972 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, vcn, | 972 | mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, vcn, |
973 | highest_vcn); | 973 | highest_vcn); |
974 | if (unlikely(mp_size <= 0)) { | 974 | if (unlikely(mp_size <= 0)) { |
975 | if (!(err = mp_size)) | 975 | if (!(err = mp_size)) |
976 | err = -EIO; | 976 | err = -EIO; |
977 | ntfs_debug("Failed to get size for mapping pairs " | 977 | ntfs_debug("Failed to get size for mapping pairs " |
978 | "array, error code %i.", err); | 978 | "array, error code %i.", err); |
979 | break; | 979 | break; |
980 | } | 980 | } |
981 | /* | 981 | /* |
982 | * Resize the attribute record to fit the new mapping pairs | 982 | * Resize the attribute record to fit the new mapping pairs |
983 | * array. | 983 | * array. |
984 | */ | 984 | */ |
985 | attr_rec_len = le32_to_cpu(a->length); | 985 | attr_rec_len = le32_to_cpu(a->length); |
986 | err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu( | 986 | err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu( |
987 | a->data.non_resident.mapping_pairs_offset)); | 987 | a->data.non_resident.mapping_pairs_offset)); |
988 | if (unlikely(err)) { | 988 | if (unlikely(err)) { |
989 | BUG_ON(err != -ENOSPC); | 989 | BUG_ON(err != -ENOSPC); |
990 | // TODO: Deal with this by using the current attribute | 990 | // TODO: Deal with this by using the current attribute |
991 | // and fill it with as much of the mapping pairs | 991 | // and fill it with as much of the mapping pairs |
992 | // array as possible. Then loop over each attribute | 992 | // array as possible. Then loop over each attribute |
993 | // extent rewriting the mapping pairs arrays as we go | 993 | // extent rewriting the mapping pairs arrays as we go |
994 | // along and if when we reach the end we have not | 994 | // along and if when we reach the end we have not |
995 | // enough space, try to resize the last attribute | 995 | // enough space, try to resize the last attribute |
996 | // extent and if even that fails, add a new attribute | 996 | // extent and if even that fails, add a new attribute |
997 | // extent. | 997 | // extent. |
998 | // We could also try to resize at each step in the hope | 998 | // We could also try to resize at each step in the hope |
999 | // that we will not need to rewrite every single extent. | 999 | // that we will not need to rewrite every single extent. |
1000 | // Note, we may need to decompress some extents to fill | 1000 | // Note, we may need to decompress some extents to fill |
1001 | // the runlist as we are walking the extents... | 1001 | // the runlist as we are walking the extents... |
1002 | ntfs_error(vol->sb, "Not enough space in the mft " | 1002 | ntfs_error(vol->sb, "Not enough space in the mft " |
1003 | "record for the extended attribute " | 1003 | "record for the extended attribute " |
1004 | "record. This case is not " | 1004 | "record. This case is not " |
1005 | "implemented yet."); | 1005 | "implemented yet."); |
1006 | err = -EOPNOTSUPP; | 1006 | err = -EOPNOTSUPP; |
1007 | break ; | 1007 | break ; |
1008 | } | 1008 | } |
1009 | status.mp_rebuilt = 1; | 1009 | status.mp_rebuilt = 1; |
1010 | /* | 1010 | /* |
1011 | * Generate the mapping pairs array directly into the attribute | 1011 | * Generate the mapping pairs array directly into the attribute |
1012 | * record. | 1012 | * record. |
1013 | */ | 1013 | */ |
1014 | err = ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( | 1014 | err = ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( |
1015 | a->data.non_resident.mapping_pairs_offset), | 1015 | a->data.non_resident.mapping_pairs_offset), |
1016 | mp_size, rl2, vcn, highest_vcn, NULL); | 1016 | mp_size, rl2, vcn, highest_vcn, NULL); |
1017 | if (unlikely(err)) { | 1017 | if (unlikely(err)) { |
1018 | ntfs_error(vol->sb, "Cannot fill hole in inode 0x%lx, " | 1018 | ntfs_error(vol->sb, "Cannot fill hole in inode 0x%lx, " |
1019 | "attribute type 0x%x, because building " | 1019 | "attribute type 0x%x, because building " |
1020 | "the mapping pairs failed with error " | 1020 | "the mapping pairs failed with error " |
1021 | "code %i.", vi->i_ino, | 1021 | "code %i.", vi->i_ino, |
1022 | (unsigned)le32_to_cpu(ni->type), err); | 1022 | (unsigned)le32_to_cpu(ni->type), err); |
1023 | err = -EIO; | 1023 | err = -EIO; |
1024 | break; | 1024 | break; |
1025 | } | 1025 | } |
1026 | /* Update the highest_vcn but only if it was not set. */ | 1026 | /* Update the highest_vcn but only if it was not set. */ |
1027 | if (unlikely(!a->data.non_resident.highest_vcn)) | 1027 | if (unlikely(!a->data.non_resident.highest_vcn)) |
1028 | a->data.non_resident.highest_vcn = | 1028 | a->data.non_resident.highest_vcn = |
1029 | cpu_to_sle64(highest_vcn); | 1029 | cpu_to_sle64(highest_vcn); |
1030 | /* | 1030 | /* |
1031 | * If the attribute is sparse/compressed, update the compressed | 1031 | * If the attribute is sparse/compressed, update the compressed |
1032 | * size in the ntfs_inode structure and the attribute record. | 1032 | * size in the ntfs_inode structure and the attribute record. |
1033 | */ | 1033 | */ |
1034 | if (likely(NInoSparse(ni) || NInoCompressed(ni))) { | 1034 | if (likely(NInoSparse(ni) || NInoCompressed(ni))) { |
1035 | /* | 1035 | /* |
1036 | * If we are not in the first attribute extent, switch | 1036 | * If we are not in the first attribute extent, switch |
1037 | * to it, but first ensure the changes will make it to | 1037 | * to it, but first ensure the changes will make it to |
1038 | * disk later. | 1038 | * disk later. |
1039 | */ | 1039 | */ |
1040 | if (a->data.non_resident.lowest_vcn) { | 1040 | if (a->data.non_resident.lowest_vcn) { |
1041 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1041 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1042 | mark_mft_record_dirty(ctx->ntfs_ino); | 1042 | mark_mft_record_dirty(ctx->ntfs_ino); |
1043 | ntfs_attr_reinit_search_ctx(ctx); | 1043 | ntfs_attr_reinit_search_ctx(ctx); |
1044 | err = ntfs_attr_lookup(ni->type, ni->name, | 1044 | err = ntfs_attr_lookup(ni->type, ni->name, |
1045 | ni->name_len, CASE_SENSITIVE, | 1045 | ni->name_len, CASE_SENSITIVE, |
1046 | 0, NULL, 0, ctx); | 1046 | 0, NULL, 0, ctx); |
1047 | if (unlikely(err)) { | 1047 | if (unlikely(err)) { |
1048 | status.attr_switched = 1; | 1048 | status.attr_switched = 1; |
1049 | break; | 1049 | break; |
1050 | } | 1050 | } |
1051 | /* @m is not used any more so do not set it. */ | 1051 | /* @m is not used any more so do not set it. */ |
1052 | a = ctx->attr; | 1052 | a = ctx->attr; |
1053 | } | 1053 | } |
1054 | write_lock_irqsave(&ni->size_lock, flags); | 1054 | write_lock_irqsave(&ni->size_lock, flags); |
1055 | ni->itype.compressed.size += vol->cluster_size; | 1055 | ni->itype.compressed.size += vol->cluster_size; |
1056 | a->data.non_resident.compressed_size = | 1056 | a->data.non_resident.compressed_size = |
1057 | cpu_to_sle64(ni->itype.compressed.size); | 1057 | cpu_to_sle64(ni->itype.compressed.size); |
1058 | write_unlock_irqrestore(&ni->size_lock, flags); | 1058 | write_unlock_irqrestore(&ni->size_lock, flags); |
1059 | } | 1059 | } |
1060 | /* Ensure the changes make it to disk. */ | 1060 | /* Ensure the changes make it to disk. */ |
1061 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1061 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1062 | mark_mft_record_dirty(ctx->ntfs_ino); | 1062 | mark_mft_record_dirty(ctx->ntfs_ino); |
1063 | ntfs_attr_put_search_ctx(ctx); | 1063 | ntfs_attr_put_search_ctx(ctx); |
1064 | unmap_mft_record(base_ni); | 1064 | unmap_mft_record(base_ni); |
1065 | /* Successfully filled the hole. */ | 1065 | /* Successfully filled the hole. */ |
1066 | status.runlist_merged = 0; | 1066 | status.runlist_merged = 0; |
1067 | status.mft_attr_mapped = 0; | 1067 | status.mft_attr_mapped = 0; |
1068 | status.mp_rebuilt = 0; | 1068 | status.mp_rebuilt = 0; |
1069 | /* Setup the map cache and use that to deal with the buffer. */ | 1069 | /* Setup the map cache and use that to deal with the buffer. */ |
1070 | was_hole = true; | 1070 | was_hole = true; |
1071 | vcn = bh_cpos; | 1071 | vcn = bh_cpos; |
1072 | vcn_len = 1; | 1072 | vcn_len = 1; |
1073 | lcn_block = lcn << (vol->cluster_size_bits - blocksize_bits); | 1073 | lcn_block = lcn << (vol->cluster_size_bits - blocksize_bits); |
1074 | cdelta = 0; | 1074 | cdelta = 0; |
1075 | /* | 1075 | /* |
1076 | * If the number of remaining clusters in the @pages is smaller | 1076 | * If the number of remaining clusters in the @pages is smaller |
1077 | * or equal to the number of cached clusters, unlock the | 1077 | * or equal to the number of cached clusters, unlock the |
1078 | * runlist as the map cache will be used from now on. | 1078 | * runlist as the map cache will be used from now on. |
1079 | */ | 1079 | */ |
1080 | if (likely(vcn + vcn_len >= cend)) { | 1080 | if (likely(vcn + vcn_len >= cend)) { |
1081 | up_write(&ni->runlist.lock); | 1081 | up_write(&ni->runlist.lock); |
1082 | rl_write_locked = false; | 1082 | rl_write_locked = false; |
1083 | rl = NULL; | 1083 | rl = NULL; |
1084 | } | 1084 | } |
1085 | goto map_buffer_cached; | 1085 | goto map_buffer_cached; |
1086 | } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); | 1086 | } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); |
1087 | /* If there are no errors, do the next page. */ | 1087 | /* If there are no errors, do the next page. */ |
1088 | if (likely(!err && ++u < nr_pages)) | 1088 | if (likely(!err && ++u < nr_pages)) |
1089 | goto do_next_page; | 1089 | goto do_next_page; |
1090 | /* If there are no errors, release the runlist lock if we took it. */ | 1090 | /* If there are no errors, release the runlist lock if we took it. */ |
1091 | if (likely(!err)) { | 1091 | if (likely(!err)) { |
1092 | if (unlikely(rl_write_locked)) { | 1092 | if (unlikely(rl_write_locked)) { |
1093 | up_write(&ni->runlist.lock); | 1093 | up_write(&ni->runlist.lock); |
1094 | rl_write_locked = false; | 1094 | rl_write_locked = false; |
1095 | } else if (unlikely(rl)) | 1095 | } else if (unlikely(rl)) |
1096 | up_read(&ni->runlist.lock); | 1096 | up_read(&ni->runlist.lock); |
1097 | rl = NULL; | 1097 | rl = NULL; |
1098 | } | 1098 | } |
1099 | /* If we issued read requests, let them complete. */ | 1099 | /* If we issued read requests, let them complete. */ |
1100 | read_lock_irqsave(&ni->size_lock, flags); | 1100 | read_lock_irqsave(&ni->size_lock, flags); |
1101 | initialized_size = ni->initialized_size; | 1101 | initialized_size = ni->initialized_size; |
1102 | read_unlock_irqrestore(&ni->size_lock, flags); | 1102 | read_unlock_irqrestore(&ni->size_lock, flags); |
1103 | while (wait_bh > wait) { | 1103 | while (wait_bh > wait) { |
1104 | bh = *--wait_bh; | 1104 | bh = *--wait_bh; |
1105 | wait_on_buffer(bh); | 1105 | wait_on_buffer(bh); |
1106 | if (likely(buffer_uptodate(bh))) { | 1106 | if (likely(buffer_uptodate(bh))) { |
1107 | page = bh->b_page; | 1107 | page = bh->b_page; |
1108 | bh_pos = ((s64)page->index << PAGE_CACHE_SHIFT) + | 1108 | bh_pos = ((s64)page->index << PAGE_CACHE_SHIFT) + |
1109 | bh_offset(bh); | 1109 | bh_offset(bh); |
1110 | /* | 1110 | /* |
1111 | * If the buffer overflows the initialized size, need | 1111 | * If the buffer overflows the initialized size, need |
1112 | * to zero the overflowing region. | 1112 | * to zero the overflowing region. |
1113 | */ | 1113 | */ |
1114 | if (unlikely(bh_pos + blocksize > initialized_size)) { | 1114 | if (unlikely(bh_pos + blocksize > initialized_size)) { |
1115 | int ofs = 0; | 1115 | int ofs = 0; |
1116 | 1116 | ||
1117 | if (likely(bh_pos < initialized_size)) | 1117 | if (likely(bh_pos < initialized_size)) |
1118 | ofs = initialized_size - bh_pos; | 1118 | ofs = initialized_size - bh_pos; |
1119 | zero_user_segment(page, bh_offset(bh) + ofs, | 1119 | zero_user_segment(page, bh_offset(bh) + ofs, |
1120 | blocksize); | 1120 | blocksize); |
1121 | } | 1121 | } |
1122 | } else /* if (unlikely(!buffer_uptodate(bh))) */ | 1122 | } else /* if (unlikely(!buffer_uptodate(bh))) */ |
1123 | err = -EIO; | 1123 | err = -EIO; |
1124 | } | 1124 | } |
1125 | if (likely(!err)) { | 1125 | if (likely(!err)) { |
1126 | /* Clear buffer_new on all buffers. */ | 1126 | /* Clear buffer_new on all buffers. */ |
1127 | u = 0; | 1127 | u = 0; |
1128 | do { | 1128 | do { |
1129 | bh = head = page_buffers(pages[u]); | 1129 | bh = head = page_buffers(pages[u]); |
1130 | do { | 1130 | do { |
1131 | if (buffer_new(bh)) | 1131 | if (buffer_new(bh)) |
1132 | clear_buffer_new(bh); | 1132 | clear_buffer_new(bh); |
1133 | } while ((bh = bh->b_this_page) != head); | 1133 | } while ((bh = bh->b_this_page) != head); |
1134 | } while (++u < nr_pages); | 1134 | } while (++u < nr_pages); |
1135 | ntfs_debug("Done."); | 1135 | ntfs_debug("Done."); |
1136 | return err; | 1136 | return err; |
1137 | } | 1137 | } |
1138 | if (status.attr_switched) { | 1138 | if (status.attr_switched) { |
1139 | /* Get back to the attribute extent we modified. */ | 1139 | /* Get back to the attribute extent we modified. */ |
1140 | ntfs_attr_reinit_search_ctx(ctx); | 1140 | ntfs_attr_reinit_search_ctx(ctx); |
1141 | if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 1141 | if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
1142 | CASE_SENSITIVE, bh_cpos, NULL, 0, ctx)) { | 1142 | CASE_SENSITIVE, bh_cpos, NULL, 0, ctx)) { |
1143 | ntfs_error(vol->sb, "Failed to find required " | 1143 | ntfs_error(vol->sb, "Failed to find required " |
1144 | "attribute extent of attribute in " | 1144 | "attribute extent of attribute in " |
1145 | "error code path. Run chkdsk to " | 1145 | "error code path. Run chkdsk to " |
1146 | "recover."); | 1146 | "recover."); |
1147 | write_lock_irqsave(&ni->size_lock, flags); | 1147 | write_lock_irqsave(&ni->size_lock, flags); |
1148 | ni->itype.compressed.size += vol->cluster_size; | 1148 | ni->itype.compressed.size += vol->cluster_size; |
1149 | write_unlock_irqrestore(&ni->size_lock, flags); | 1149 | write_unlock_irqrestore(&ni->size_lock, flags); |
1150 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1150 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1151 | mark_mft_record_dirty(ctx->ntfs_ino); | 1151 | mark_mft_record_dirty(ctx->ntfs_ino); |
1152 | /* | 1152 | /* |
1153 | * The only thing that is now wrong is the compressed | 1153 | * The only thing that is now wrong is the compressed |
1154 | * size of the base attribute extent which chkdsk | 1154 | * size of the base attribute extent which chkdsk |
1155 | * should be able to fix. | 1155 | * should be able to fix. |
1156 | */ | 1156 | */ |
1157 | NVolSetErrors(vol); | 1157 | NVolSetErrors(vol); |
1158 | } else { | 1158 | } else { |
1159 | m = ctx->mrec; | 1159 | m = ctx->mrec; |
1160 | a = ctx->attr; | 1160 | a = ctx->attr; |
1161 | status.attr_switched = 0; | 1161 | status.attr_switched = 0; |
1162 | } | 1162 | } |
1163 | } | 1163 | } |
1164 | /* | 1164 | /* |
1165 | * If the runlist has been modified, need to restore it by punching a | 1165 | * If the runlist has been modified, need to restore it by punching a |
1166 | * hole into it and we then need to deallocate the on-disk cluster as | 1166 | * hole into it and we then need to deallocate the on-disk cluster as |
1167 | * well. Note, we only modify the runlist if we are able to generate a | 1167 | * well. Note, we only modify the runlist if we are able to generate a |
1168 | * new mapping pairs array, i.e. only when the mapped attribute extent | 1168 | * new mapping pairs array, i.e. only when the mapped attribute extent |
1169 | * is not switched. | 1169 | * is not switched. |
1170 | */ | 1170 | */ |
1171 | if (status.runlist_merged && !status.attr_switched) { | 1171 | if (status.runlist_merged && !status.attr_switched) { |
1172 | BUG_ON(!rl_write_locked); | 1172 | BUG_ON(!rl_write_locked); |
1173 | /* Make the file cluster we allocated sparse in the runlist. */ | 1173 | /* Make the file cluster we allocated sparse in the runlist. */ |
1174 | if (ntfs_rl_punch_nolock(vol, &ni->runlist, bh_cpos, 1)) { | 1174 | if (ntfs_rl_punch_nolock(vol, &ni->runlist, bh_cpos, 1)) { |
1175 | ntfs_error(vol->sb, "Failed to punch hole into " | 1175 | ntfs_error(vol->sb, "Failed to punch hole into " |
1176 | "attribute runlist in error code " | 1176 | "attribute runlist in error code " |
1177 | "path. Run chkdsk to recover the " | 1177 | "path. Run chkdsk to recover the " |
1178 | "lost cluster."); | 1178 | "lost cluster."); |
1179 | NVolSetErrors(vol); | 1179 | NVolSetErrors(vol); |
1180 | } else /* if (success) */ { | 1180 | } else /* if (success) */ { |
1181 | status.runlist_merged = 0; | 1181 | status.runlist_merged = 0; |
1182 | /* | 1182 | /* |
1183 | * Deallocate the on-disk cluster we allocated but only | 1183 | * Deallocate the on-disk cluster we allocated but only |
1184 | * if we succeeded in punching its vcn out of the | 1184 | * if we succeeded in punching its vcn out of the |
1185 | * runlist. | 1185 | * runlist. |
1186 | */ | 1186 | */ |
1187 | down_write(&vol->lcnbmp_lock); | 1187 | down_write(&vol->lcnbmp_lock); |
1188 | if (ntfs_bitmap_clear_bit(vol->lcnbmp_ino, lcn)) { | 1188 | if (ntfs_bitmap_clear_bit(vol->lcnbmp_ino, lcn)) { |
1189 | ntfs_error(vol->sb, "Failed to release " | 1189 | ntfs_error(vol->sb, "Failed to release " |
1190 | "allocated cluster in error " | 1190 | "allocated cluster in error " |
1191 | "code path. Run chkdsk to " | 1191 | "code path. Run chkdsk to " |
1192 | "recover the lost cluster."); | 1192 | "recover the lost cluster."); |
1193 | NVolSetErrors(vol); | 1193 | NVolSetErrors(vol); |
1194 | } | 1194 | } |
1195 | up_write(&vol->lcnbmp_lock); | 1195 | up_write(&vol->lcnbmp_lock); |
1196 | } | 1196 | } |
1197 | } | 1197 | } |
1198 | /* | 1198 | /* |
1199 | * Resize the attribute record to its old size and rebuild the mapping | 1199 | * Resize the attribute record to its old size and rebuild the mapping |
1200 | * pairs array. Note, we only can do this if the runlist has been | 1200 | * pairs array. Note, we only can do this if the runlist has been |
1201 | * restored to its old state which also implies that the mapped | 1201 | * restored to its old state which also implies that the mapped |
1202 | * attribute extent is not switched. | 1202 | * attribute extent is not switched. |
1203 | */ | 1203 | */ |
1204 | if (status.mp_rebuilt && !status.runlist_merged) { | 1204 | if (status.mp_rebuilt && !status.runlist_merged) { |
1205 | if (ntfs_attr_record_resize(m, a, attr_rec_len)) { | 1205 | if (ntfs_attr_record_resize(m, a, attr_rec_len)) { |
1206 | ntfs_error(vol->sb, "Failed to restore attribute " | 1206 | ntfs_error(vol->sb, "Failed to restore attribute " |
1207 | "record in error code path. Run " | 1207 | "record in error code path. Run " |
1208 | "chkdsk to recover."); | 1208 | "chkdsk to recover."); |
1209 | NVolSetErrors(vol); | 1209 | NVolSetErrors(vol); |
1210 | } else /* if (success) */ { | 1210 | } else /* if (success) */ { |
1211 | if (ntfs_mapping_pairs_build(vol, (u8*)a + | 1211 | if (ntfs_mapping_pairs_build(vol, (u8*)a + |
1212 | le16_to_cpu(a->data.non_resident. | 1212 | le16_to_cpu(a->data.non_resident. |
1213 | mapping_pairs_offset), attr_rec_len - | 1213 | mapping_pairs_offset), attr_rec_len - |
1214 | le16_to_cpu(a->data.non_resident. | 1214 | le16_to_cpu(a->data.non_resident. |
1215 | mapping_pairs_offset), ni->runlist.rl, | 1215 | mapping_pairs_offset), ni->runlist.rl, |
1216 | vcn, highest_vcn, NULL)) { | 1216 | vcn, highest_vcn, NULL)) { |
1217 | ntfs_error(vol->sb, "Failed to restore " | 1217 | ntfs_error(vol->sb, "Failed to restore " |
1218 | "mapping pairs array in error " | 1218 | "mapping pairs array in error " |
1219 | "code path. Run chkdsk to " | 1219 | "code path. Run chkdsk to " |
1220 | "recover."); | 1220 | "recover."); |
1221 | NVolSetErrors(vol); | 1221 | NVolSetErrors(vol); |
1222 | } | 1222 | } |
1223 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1223 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1224 | mark_mft_record_dirty(ctx->ntfs_ino); | 1224 | mark_mft_record_dirty(ctx->ntfs_ino); |
1225 | } | 1225 | } |
1226 | } | 1226 | } |
1227 | /* Release the mft record and the attribute. */ | 1227 | /* Release the mft record and the attribute. */ |
1228 | if (status.mft_attr_mapped) { | 1228 | if (status.mft_attr_mapped) { |
1229 | ntfs_attr_put_search_ctx(ctx); | 1229 | ntfs_attr_put_search_ctx(ctx); |
1230 | unmap_mft_record(base_ni); | 1230 | unmap_mft_record(base_ni); |
1231 | } | 1231 | } |
1232 | /* Release the runlist lock. */ | 1232 | /* Release the runlist lock. */ |
1233 | if (rl_write_locked) | 1233 | if (rl_write_locked) |
1234 | up_write(&ni->runlist.lock); | 1234 | up_write(&ni->runlist.lock); |
1235 | else if (rl) | 1235 | else if (rl) |
1236 | up_read(&ni->runlist.lock); | 1236 | up_read(&ni->runlist.lock); |
1237 | /* | 1237 | /* |
1238 | * Zero out any newly allocated blocks to avoid exposing stale data. | 1238 | * Zero out any newly allocated blocks to avoid exposing stale data. |
1239 | * If BH_New is set, we know that the block was newly allocated above | 1239 | * If BH_New is set, we know that the block was newly allocated above |
1240 | * and that it has not been fully zeroed and marked dirty yet. | 1240 | * and that it has not been fully zeroed and marked dirty yet. |
1241 | */ | 1241 | */ |
1242 | nr_pages = u; | 1242 | nr_pages = u; |
1243 | u = 0; | 1243 | u = 0; |
1244 | end = bh_cpos << vol->cluster_size_bits; | 1244 | end = bh_cpos << vol->cluster_size_bits; |
1245 | do { | 1245 | do { |
1246 | page = pages[u]; | 1246 | page = pages[u]; |
1247 | bh = head = page_buffers(page); | 1247 | bh = head = page_buffers(page); |
1248 | do { | 1248 | do { |
1249 | if (u == nr_pages && | 1249 | if (u == nr_pages && |
1250 | ((s64)page->index << PAGE_CACHE_SHIFT) + | 1250 | ((s64)page->index << PAGE_CACHE_SHIFT) + |
1251 | bh_offset(bh) >= end) | 1251 | bh_offset(bh) >= end) |
1252 | break; | 1252 | break; |
1253 | if (!buffer_new(bh)) | 1253 | if (!buffer_new(bh)) |
1254 | continue; | 1254 | continue; |
1255 | clear_buffer_new(bh); | 1255 | clear_buffer_new(bh); |
1256 | if (!buffer_uptodate(bh)) { | 1256 | if (!buffer_uptodate(bh)) { |
1257 | if (PageUptodate(page)) | 1257 | if (PageUptodate(page)) |
1258 | set_buffer_uptodate(bh); | 1258 | set_buffer_uptodate(bh); |
1259 | else { | 1259 | else { |
1260 | zero_user(page, bh_offset(bh), | 1260 | zero_user(page, bh_offset(bh), |
1261 | blocksize); | 1261 | blocksize); |
1262 | set_buffer_uptodate(bh); | 1262 | set_buffer_uptodate(bh); |
1263 | } | 1263 | } |
1264 | } | 1264 | } |
1265 | mark_buffer_dirty(bh); | 1265 | mark_buffer_dirty(bh); |
1266 | } while ((bh = bh->b_this_page) != head); | 1266 | } while ((bh = bh->b_this_page) != head); |
1267 | } while (++u <= nr_pages); | 1267 | } while (++u <= nr_pages); |
1268 | ntfs_error(vol->sb, "Failed. Returning error code %i.", err); | 1268 | ntfs_error(vol->sb, "Failed. Returning error code %i.", err); |
1269 | return err; | 1269 | return err; |
1270 | } | 1270 | } |
1271 | 1271 | ||
1272 | /* | 1272 | /* |
1273 | * Copy as much as we can into the pages and return the number of bytes which | 1273 | * Copy as much as we can into the pages and return the number of bytes which |
1274 | * were successfully copied. If a fault is encountered then clear the pages | 1274 | * were successfully copied. If a fault is encountered then clear the pages |
1275 | * out to (ofs + bytes) and return the number of bytes which were copied. | 1275 | * out to (ofs + bytes) and return the number of bytes which were copied. |
1276 | */ | 1276 | */ |
1277 | static inline size_t ntfs_copy_from_user(struct page **pages, | 1277 | static inline size_t ntfs_copy_from_user(struct page **pages, |
1278 | unsigned nr_pages, unsigned ofs, const char __user *buf, | 1278 | unsigned nr_pages, unsigned ofs, const char __user *buf, |
1279 | size_t bytes) | 1279 | size_t bytes) |
1280 | { | 1280 | { |
1281 | struct page **last_page = pages + nr_pages; | 1281 | struct page **last_page = pages + nr_pages; |
1282 | char *addr; | 1282 | char *addr; |
1283 | size_t total = 0; | 1283 | size_t total = 0; |
1284 | unsigned len; | 1284 | unsigned len; |
1285 | int left; | 1285 | int left; |
1286 | 1286 | ||
1287 | do { | 1287 | do { |
1288 | len = PAGE_CACHE_SIZE - ofs; | 1288 | len = PAGE_CACHE_SIZE - ofs; |
1289 | if (len > bytes) | 1289 | if (len > bytes) |
1290 | len = bytes; | 1290 | len = bytes; |
1291 | addr = kmap_atomic(*pages); | 1291 | addr = kmap_atomic(*pages); |
1292 | left = __copy_from_user_inatomic(addr + ofs, buf, len); | 1292 | left = __copy_from_user_inatomic(addr + ofs, buf, len); |
1293 | kunmap_atomic(addr); | 1293 | kunmap_atomic(addr); |
1294 | if (unlikely(left)) { | 1294 | if (unlikely(left)) { |
1295 | /* Do it the slow way. */ | 1295 | /* Do it the slow way. */ |
1296 | addr = kmap(*pages); | 1296 | addr = kmap(*pages); |
1297 | left = __copy_from_user(addr + ofs, buf, len); | 1297 | left = __copy_from_user(addr + ofs, buf, len); |
1298 | kunmap(*pages); | 1298 | kunmap(*pages); |
1299 | if (unlikely(left)) | 1299 | if (unlikely(left)) |
1300 | goto err_out; | 1300 | goto err_out; |
1301 | } | 1301 | } |
1302 | total += len; | 1302 | total += len; |
1303 | bytes -= len; | 1303 | bytes -= len; |
1304 | if (!bytes) | 1304 | if (!bytes) |
1305 | break; | 1305 | break; |
1306 | buf += len; | 1306 | buf += len; |
1307 | ofs = 0; | 1307 | ofs = 0; |
1308 | } while (++pages < last_page); | 1308 | } while (++pages < last_page); |
1309 | out: | 1309 | out: |
1310 | return total; | 1310 | return total; |
1311 | err_out: | 1311 | err_out: |
1312 | total += len - left; | 1312 | total += len - left; |
1313 | /* Zero the rest of the target like __copy_from_user(). */ | 1313 | /* Zero the rest of the target like __copy_from_user(). */ |
1314 | while (++pages < last_page) { | 1314 | while (++pages < last_page) { |
1315 | bytes -= len; | 1315 | bytes -= len; |
1316 | if (!bytes) | 1316 | if (!bytes) |
1317 | break; | 1317 | break; |
1318 | len = PAGE_CACHE_SIZE; | 1318 | len = PAGE_CACHE_SIZE; |
1319 | if (len > bytes) | 1319 | if (len > bytes) |
1320 | len = bytes; | 1320 | len = bytes; |
1321 | zero_user(*pages, 0, len); | 1321 | zero_user(*pages, 0, len); |
1322 | } | 1322 | } |
1323 | goto out; | 1323 | goto out; |
1324 | } | 1324 | } |
1325 | 1325 | ||
1326 | static size_t __ntfs_copy_from_user_iovec_inatomic(char *vaddr, | 1326 | static size_t __ntfs_copy_from_user_iovec_inatomic(char *vaddr, |
1327 | const struct iovec *iov, size_t iov_ofs, size_t bytes) | 1327 | const struct iovec *iov, size_t iov_ofs, size_t bytes) |
1328 | { | 1328 | { |
1329 | size_t total = 0; | 1329 | size_t total = 0; |
1330 | 1330 | ||
1331 | while (1) { | 1331 | while (1) { |
1332 | const char __user *buf = iov->iov_base + iov_ofs; | 1332 | const char __user *buf = iov->iov_base + iov_ofs; |
1333 | unsigned len; | 1333 | unsigned len; |
1334 | size_t left; | 1334 | size_t left; |
1335 | 1335 | ||
1336 | len = iov->iov_len - iov_ofs; | 1336 | len = iov->iov_len - iov_ofs; |
1337 | if (len > bytes) | 1337 | if (len > bytes) |
1338 | len = bytes; | 1338 | len = bytes; |
1339 | left = __copy_from_user_inatomic(vaddr, buf, len); | 1339 | left = __copy_from_user_inatomic(vaddr, buf, len); |
1340 | total += len; | 1340 | total += len; |
1341 | bytes -= len; | 1341 | bytes -= len; |
1342 | vaddr += len; | 1342 | vaddr += len; |
1343 | if (unlikely(left)) { | 1343 | if (unlikely(left)) { |
1344 | total -= left; | 1344 | total -= left; |
1345 | break; | 1345 | break; |
1346 | } | 1346 | } |
1347 | if (!bytes) | 1347 | if (!bytes) |
1348 | break; | 1348 | break; |
1349 | iov++; | 1349 | iov++; |
1350 | iov_ofs = 0; | 1350 | iov_ofs = 0; |
1351 | } | 1351 | } |
1352 | return total; | 1352 | return total; |
1353 | } | 1353 | } |
1354 | 1354 | ||
1355 | static inline void ntfs_set_next_iovec(const struct iovec **iovp, | 1355 | static inline void ntfs_set_next_iovec(const struct iovec **iovp, |
1356 | size_t *iov_ofsp, size_t bytes) | 1356 | size_t *iov_ofsp, size_t bytes) |
1357 | { | 1357 | { |
1358 | const struct iovec *iov = *iovp; | 1358 | const struct iovec *iov = *iovp; |
1359 | size_t iov_ofs = *iov_ofsp; | 1359 | size_t iov_ofs = *iov_ofsp; |
1360 | 1360 | ||
1361 | while (bytes) { | 1361 | while (bytes) { |
1362 | unsigned len; | 1362 | unsigned len; |
1363 | 1363 | ||
1364 | len = iov->iov_len - iov_ofs; | 1364 | len = iov->iov_len - iov_ofs; |
1365 | if (len > bytes) | 1365 | if (len > bytes) |
1366 | len = bytes; | 1366 | len = bytes; |
1367 | bytes -= len; | 1367 | bytes -= len; |
1368 | iov_ofs += len; | 1368 | iov_ofs += len; |
1369 | if (iov->iov_len == iov_ofs) { | 1369 | if (iov->iov_len == iov_ofs) { |
1370 | iov++; | 1370 | iov++; |
1371 | iov_ofs = 0; | 1371 | iov_ofs = 0; |
1372 | } | 1372 | } |
1373 | } | 1373 | } |
1374 | *iovp = iov; | 1374 | *iovp = iov; |
1375 | *iov_ofsp = iov_ofs; | 1375 | *iov_ofsp = iov_ofs; |
1376 | } | 1376 | } |
1377 | 1377 | ||
1378 | /* | 1378 | /* |
1379 | * This has the same side-effects and return value as ntfs_copy_from_user(). | 1379 | * This has the same side-effects and return value as ntfs_copy_from_user(). |
1380 | * The difference is that on a fault we need to memset the remainder of the | 1380 | * The difference is that on a fault we need to memset the remainder of the |
1381 | * pages (out to offset + bytes), to emulate ntfs_copy_from_user()'s | 1381 | * pages (out to offset + bytes), to emulate ntfs_copy_from_user()'s |
1382 | * single-segment behaviour. | 1382 | * single-segment behaviour. |
1383 | * | 1383 | * |
1384 | * We call the same helper (__ntfs_copy_from_user_iovec_inatomic()) both when | 1384 | * We call the same helper (__ntfs_copy_from_user_iovec_inatomic()) both when |
1385 | * atomic and when not atomic. This is ok because it calls | 1385 | * atomic and when not atomic. This is ok because it calls |
1386 | * __copy_from_user_inatomic() and it is ok to call this when non-atomic. In | 1386 | * __copy_from_user_inatomic() and it is ok to call this when non-atomic. In |
1387 | * fact, the only difference between __copy_from_user_inatomic() and | 1387 | * fact, the only difference between __copy_from_user_inatomic() and |
1388 | * __copy_from_user() is that the latter calls might_sleep() and the former | 1388 | * __copy_from_user() is that the latter calls might_sleep() and the former |
1389 | * should not zero the tail of the buffer on error. And on many architectures | 1389 | * should not zero the tail of the buffer on error. And on many architectures |
1390 | * __copy_from_user_inatomic() is just defined to __copy_from_user() so it | 1390 | * __copy_from_user_inatomic() is just defined to __copy_from_user() so it |
1391 | * makes no difference at all on those architectures. | 1391 | * makes no difference at all on those architectures. |
1392 | */ | 1392 | */ |
1393 | static inline size_t ntfs_copy_from_user_iovec(struct page **pages, | 1393 | static inline size_t ntfs_copy_from_user_iovec(struct page **pages, |
1394 | unsigned nr_pages, unsigned ofs, const struct iovec **iov, | 1394 | unsigned nr_pages, unsigned ofs, const struct iovec **iov, |
1395 | size_t *iov_ofs, size_t bytes) | 1395 | size_t *iov_ofs, size_t bytes) |
1396 | { | 1396 | { |
1397 | struct page **last_page = pages + nr_pages; | 1397 | struct page **last_page = pages + nr_pages; |
1398 | char *addr; | 1398 | char *addr; |
1399 | size_t copied, len, total = 0; | 1399 | size_t copied, len, total = 0; |
1400 | 1400 | ||
1401 | do { | 1401 | do { |
1402 | len = PAGE_CACHE_SIZE - ofs; | 1402 | len = PAGE_CACHE_SIZE - ofs; |
1403 | if (len > bytes) | 1403 | if (len > bytes) |
1404 | len = bytes; | 1404 | len = bytes; |
1405 | addr = kmap_atomic(*pages); | 1405 | addr = kmap_atomic(*pages); |
1406 | copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs, | 1406 | copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs, |
1407 | *iov, *iov_ofs, len); | 1407 | *iov, *iov_ofs, len); |
1408 | kunmap_atomic(addr); | 1408 | kunmap_atomic(addr); |
1409 | if (unlikely(copied != len)) { | 1409 | if (unlikely(copied != len)) { |
1410 | /* Do it the slow way. */ | 1410 | /* Do it the slow way. */ |
1411 | addr = kmap(*pages); | 1411 | addr = kmap(*pages); |
1412 | copied = __ntfs_copy_from_user_iovec_inatomic(addr + | 1412 | copied = __ntfs_copy_from_user_iovec_inatomic(addr + |
1413 | ofs, *iov, *iov_ofs, len); | 1413 | ofs, *iov, *iov_ofs, len); |
1414 | if (unlikely(copied != len)) | 1414 | if (unlikely(copied != len)) |
1415 | goto err_out; | 1415 | goto err_out; |
1416 | kunmap(*pages); | 1416 | kunmap(*pages); |
1417 | } | 1417 | } |
1418 | total += len; | 1418 | total += len; |
1419 | ntfs_set_next_iovec(iov, iov_ofs, len); | 1419 | ntfs_set_next_iovec(iov, iov_ofs, len); |
1420 | bytes -= len; | 1420 | bytes -= len; |
1421 | if (!bytes) | 1421 | if (!bytes) |
1422 | break; | 1422 | break; |
1423 | ofs = 0; | 1423 | ofs = 0; |
1424 | } while (++pages < last_page); | 1424 | } while (++pages < last_page); |
1425 | out: | 1425 | out: |
1426 | return total; | 1426 | return total; |
1427 | err_out: | 1427 | err_out: |
1428 | BUG_ON(copied > len); | 1428 | BUG_ON(copied > len); |
1429 | /* Zero the rest of the target like __copy_from_user(). */ | 1429 | /* Zero the rest of the target like __copy_from_user(). */ |
1430 | memset(addr + ofs + copied, 0, len - copied); | 1430 | memset(addr + ofs + copied, 0, len - copied); |
1431 | kunmap(*pages); | 1431 | kunmap(*pages); |
1432 | total += copied; | 1432 | total += copied; |
1433 | ntfs_set_next_iovec(iov, iov_ofs, copied); | 1433 | ntfs_set_next_iovec(iov, iov_ofs, copied); |
1434 | while (++pages < last_page) { | 1434 | while (++pages < last_page) { |
1435 | bytes -= len; | 1435 | bytes -= len; |
1436 | if (!bytes) | 1436 | if (!bytes) |
1437 | break; | 1437 | break; |
1438 | len = PAGE_CACHE_SIZE; | 1438 | len = PAGE_CACHE_SIZE; |
1439 | if (len > bytes) | 1439 | if (len > bytes) |
1440 | len = bytes; | 1440 | len = bytes; |
1441 | zero_user(*pages, 0, len); | 1441 | zero_user(*pages, 0, len); |
1442 | } | 1442 | } |
1443 | goto out; | 1443 | goto out; |
1444 | } | 1444 | } |
1445 | 1445 | ||
1446 | static inline void ntfs_flush_dcache_pages(struct page **pages, | 1446 | static inline void ntfs_flush_dcache_pages(struct page **pages, |
1447 | unsigned nr_pages) | 1447 | unsigned nr_pages) |
1448 | { | 1448 | { |
1449 | BUG_ON(!nr_pages); | 1449 | BUG_ON(!nr_pages); |
1450 | /* | 1450 | /* |
1451 | * Warning: Do not do the decrement at the same time as the call to | 1451 | * Warning: Do not do the decrement at the same time as the call to |
1452 | * flush_dcache_page() because it is a NULL macro on i386 and hence the | 1452 | * flush_dcache_page() because it is a NULL macro on i386 and hence the |
1453 | * decrement never happens so the loop never terminates. | 1453 | * decrement never happens so the loop never terminates. |
1454 | */ | 1454 | */ |
1455 | do { | 1455 | do { |
1456 | --nr_pages; | 1456 | --nr_pages; |
1457 | flush_dcache_page(pages[nr_pages]); | 1457 | flush_dcache_page(pages[nr_pages]); |
1458 | } while (nr_pages > 0); | 1458 | } while (nr_pages > 0); |
1459 | } | 1459 | } |
1460 | 1460 | ||
1461 | /** | 1461 | /** |
1462 | * ntfs_commit_pages_after_non_resident_write - commit the received data | 1462 | * ntfs_commit_pages_after_non_resident_write - commit the received data |
1463 | * @pages: array of destination pages | 1463 | * @pages: array of destination pages |
1464 | * @nr_pages: number of pages in @pages | 1464 | * @nr_pages: number of pages in @pages |
1465 | * @pos: byte position in file at which the write begins | 1465 | * @pos: byte position in file at which the write begins |
1466 | * @bytes: number of bytes to be written | 1466 | * @bytes: number of bytes to be written |
1467 | * | 1467 | * |
1468 | * See description of ntfs_commit_pages_after_write(), below. | 1468 | * See description of ntfs_commit_pages_after_write(), below. |
1469 | */ | 1469 | */ |
1470 | static inline int ntfs_commit_pages_after_non_resident_write( | 1470 | static inline int ntfs_commit_pages_after_non_resident_write( |
1471 | struct page **pages, const unsigned nr_pages, | 1471 | struct page **pages, const unsigned nr_pages, |
1472 | s64 pos, size_t bytes) | 1472 | s64 pos, size_t bytes) |
1473 | { | 1473 | { |
1474 | s64 end, initialized_size; | 1474 | s64 end, initialized_size; |
1475 | struct inode *vi; | 1475 | struct inode *vi; |
1476 | ntfs_inode *ni, *base_ni; | 1476 | ntfs_inode *ni, *base_ni; |
1477 | struct buffer_head *bh, *head; | 1477 | struct buffer_head *bh, *head; |
1478 | ntfs_attr_search_ctx *ctx; | 1478 | ntfs_attr_search_ctx *ctx; |
1479 | MFT_RECORD *m; | 1479 | MFT_RECORD *m; |
1480 | ATTR_RECORD *a; | 1480 | ATTR_RECORD *a; |
1481 | unsigned long flags; | 1481 | unsigned long flags; |
1482 | unsigned blocksize, u; | 1482 | unsigned blocksize, u; |
1483 | int err; | 1483 | int err; |
1484 | 1484 | ||
1485 | vi = pages[0]->mapping->host; | 1485 | vi = pages[0]->mapping->host; |
1486 | ni = NTFS_I(vi); | 1486 | ni = NTFS_I(vi); |
1487 | blocksize = vi->i_sb->s_blocksize; | 1487 | blocksize = vi->i_sb->s_blocksize; |
1488 | end = pos + bytes; | 1488 | end = pos + bytes; |
1489 | u = 0; | 1489 | u = 0; |
1490 | do { | 1490 | do { |
1491 | s64 bh_pos; | 1491 | s64 bh_pos; |
1492 | struct page *page; | 1492 | struct page *page; |
1493 | bool partial; | 1493 | bool partial; |
1494 | 1494 | ||
1495 | page = pages[u]; | 1495 | page = pages[u]; |
1496 | bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; | 1496 | bh_pos = (s64)page->index << PAGE_CACHE_SHIFT; |
1497 | bh = head = page_buffers(page); | 1497 | bh = head = page_buffers(page); |
1498 | partial = false; | 1498 | partial = false; |
1499 | do { | 1499 | do { |
1500 | s64 bh_end; | 1500 | s64 bh_end; |
1501 | 1501 | ||
1502 | bh_end = bh_pos + blocksize; | 1502 | bh_end = bh_pos + blocksize; |
1503 | if (bh_end <= pos || bh_pos >= end) { | 1503 | if (bh_end <= pos || bh_pos >= end) { |
1504 | if (!buffer_uptodate(bh)) | 1504 | if (!buffer_uptodate(bh)) |
1505 | partial = true; | 1505 | partial = true; |
1506 | } else { | 1506 | } else { |
1507 | set_buffer_uptodate(bh); | 1507 | set_buffer_uptodate(bh); |
1508 | mark_buffer_dirty(bh); | 1508 | mark_buffer_dirty(bh); |
1509 | } | 1509 | } |
1510 | } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); | 1510 | } while (bh_pos += blocksize, (bh = bh->b_this_page) != head); |
1511 | /* | 1511 | /* |
1512 | * If all buffers are now uptodate but the page is not, set the | 1512 | * If all buffers are now uptodate but the page is not, set the |
1513 | * page uptodate. | 1513 | * page uptodate. |
1514 | */ | 1514 | */ |
1515 | if (!partial && !PageUptodate(page)) | 1515 | if (!partial && !PageUptodate(page)) |
1516 | SetPageUptodate(page); | 1516 | SetPageUptodate(page); |
1517 | } while (++u < nr_pages); | 1517 | } while (++u < nr_pages); |
1518 | /* | 1518 | /* |
1519 | * Finally, if we do not need to update initialized_size or i_size we | 1519 | * Finally, if we do not need to update initialized_size or i_size we |
1520 | * are finished. | 1520 | * are finished. |
1521 | */ | 1521 | */ |
1522 | read_lock_irqsave(&ni->size_lock, flags); | 1522 | read_lock_irqsave(&ni->size_lock, flags); |
1523 | initialized_size = ni->initialized_size; | 1523 | initialized_size = ni->initialized_size; |
1524 | read_unlock_irqrestore(&ni->size_lock, flags); | 1524 | read_unlock_irqrestore(&ni->size_lock, flags); |
1525 | if (end <= initialized_size) { | 1525 | if (end <= initialized_size) { |
1526 | ntfs_debug("Done."); | 1526 | ntfs_debug("Done."); |
1527 | return 0; | 1527 | return 0; |
1528 | } | 1528 | } |
1529 | /* | 1529 | /* |
1530 | * Update initialized_size/i_size as appropriate, both in the inode and | 1530 | * Update initialized_size/i_size as appropriate, both in the inode and |
1531 | * the mft record. | 1531 | * the mft record. |
1532 | */ | 1532 | */ |
1533 | if (!NInoAttr(ni)) | 1533 | if (!NInoAttr(ni)) |
1534 | base_ni = ni; | 1534 | base_ni = ni; |
1535 | else | 1535 | else |
1536 | base_ni = ni->ext.base_ntfs_ino; | 1536 | base_ni = ni->ext.base_ntfs_ino; |
1537 | /* Map, pin, and lock the mft record. */ | 1537 | /* Map, pin, and lock the mft record. */ |
1538 | m = map_mft_record(base_ni); | 1538 | m = map_mft_record(base_ni); |
1539 | if (IS_ERR(m)) { | 1539 | if (IS_ERR(m)) { |
1540 | err = PTR_ERR(m); | 1540 | err = PTR_ERR(m); |
1541 | m = NULL; | 1541 | m = NULL; |
1542 | ctx = NULL; | 1542 | ctx = NULL; |
1543 | goto err_out; | 1543 | goto err_out; |
1544 | } | 1544 | } |
1545 | BUG_ON(!NInoNonResident(ni)); | 1545 | BUG_ON(!NInoNonResident(ni)); |
1546 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 1546 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
1547 | if (unlikely(!ctx)) { | 1547 | if (unlikely(!ctx)) { |
1548 | err = -ENOMEM; | 1548 | err = -ENOMEM; |
1549 | goto err_out; | 1549 | goto err_out; |
1550 | } | 1550 | } |
1551 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 1551 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
1552 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 1552 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
1553 | if (unlikely(err)) { | 1553 | if (unlikely(err)) { |
1554 | if (err == -ENOENT) | 1554 | if (err == -ENOENT) |
1555 | err = -EIO; | 1555 | err = -EIO; |
1556 | goto err_out; | 1556 | goto err_out; |
1557 | } | 1557 | } |
1558 | a = ctx->attr; | 1558 | a = ctx->attr; |
1559 | BUG_ON(!a->non_resident); | 1559 | BUG_ON(!a->non_resident); |
1560 | write_lock_irqsave(&ni->size_lock, flags); | 1560 | write_lock_irqsave(&ni->size_lock, flags); |
1561 | BUG_ON(end > ni->allocated_size); | 1561 | BUG_ON(end > ni->allocated_size); |
1562 | ni->initialized_size = end; | 1562 | ni->initialized_size = end; |
1563 | a->data.non_resident.initialized_size = cpu_to_sle64(end); | 1563 | a->data.non_resident.initialized_size = cpu_to_sle64(end); |
1564 | if (end > i_size_read(vi)) { | 1564 | if (end > i_size_read(vi)) { |
1565 | i_size_write(vi, end); | 1565 | i_size_write(vi, end); |
1566 | a->data.non_resident.data_size = | 1566 | a->data.non_resident.data_size = |
1567 | a->data.non_resident.initialized_size; | 1567 | a->data.non_resident.initialized_size; |
1568 | } | 1568 | } |
1569 | write_unlock_irqrestore(&ni->size_lock, flags); | 1569 | write_unlock_irqrestore(&ni->size_lock, flags); |
1570 | /* Mark the mft record dirty, so it gets written back. */ | 1570 | /* Mark the mft record dirty, so it gets written back. */ |
1571 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1571 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1572 | mark_mft_record_dirty(ctx->ntfs_ino); | 1572 | mark_mft_record_dirty(ctx->ntfs_ino); |
1573 | ntfs_attr_put_search_ctx(ctx); | 1573 | ntfs_attr_put_search_ctx(ctx); |
1574 | unmap_mft_record(base_ni); | 1574 | unmap_mft_record(base_ni); |
1575 | ntfs_debug("Done."); | 1575 | ntfs_debug("Done."); |
1576 | return 0; | 1576 | return 0; |
1577 | err_out: | 1577 | err_out: |
1578 | if (ctx) | 1578 | if (ctx) |
1579 | ntfs_attr_put_search_ctx(ctx); | 1579 | ntfs_attr_put_search_ctx(ctx); |
1580 | if (m) | 1580 | if (m) |
1581 | unmap_mft_record(base_ni); | 1581 | unmap_mft_record(base_ni); |
1582 | ntfs_error(vi->i_sb, "Failed to update initialized_size/i_size (error " | 1582 | ntfs_error(vi->i_sb, "Failed to update initialized_size/i_size (error " |
1583 | "code %i).", err); | 1583 | "code %i).", err); |
1584 | if (err != -ENOMEM) | 1584 | if (err != -ENOMEM) |
1585 | NVolSetErrors(ni->vol); | 1585 | NVolSetErrors(ni->vol); |
1586 | return err; | 1586 | return err; |
1587 | } | 1587 | } |
1588 | 1588 | ||
1589 | /** | 1589 | /** |
1590 | * ntfs_commit_pages_after_write - commit the received data | 1590 | * ntfs_commit_pages_after_write - commit the received data |
1591 | * @pages: array of destination pages | 1591 | * @pages: array of destination pages |
1592 | * @nr_pages: number of pages in @pages | 1592 | * @nr_pages: number of pages in @pages |
1593 | * @pos: byte position in file at which the write begins | 1593 | * @pos: byte position in file at which the write begins |
1594 | * @bytes: number of bytes to be written | 1594 | * @bytes: number of bytes to be written |
1595 | * | 1595 | * |
1596 | * This is called from ntfs_file_buffered_write() with i_mutex held on the inode | 1596 | * This is called from ntfs_file_buffered_write() with i_mutex held on the inode |
1597 | * (@pages[0]->mapping->host). There are @nr_pages pages in @pages which are | 1597 | * (@pages[0]->mapping->host). There are @nr_pages pages in @pages which are |
1598 | * locked but not kmap()ped. The source data has already been copied into the | 1598 | * locked but not kmap()ped. The source data has already been copied into the |
1599 | * @page. ntfs_prepare_pages_for_non_resident_write() has been called before | 1599 | * @page. ntfs_prepare_pages_for_non_resident_write() has been called before |
1600 | * the data was copied (for non-resident attributes only) and it returned | 1600 | * the data was copied (for non-resident attributes only) and it returned |
1601 | * success. | 1601 | * success. |
1602 | * | 1602 | * |
1603 | * Need to set uptodate and mark dirty all buffers within the boundary of the | 1603 | * Need to set uptodate and mark dirty all buffers within the boundary of the |
1604 | * write. If all buffers in a page are uptodate we set the page uptodate, too. | 1604 | * write. If all buffers in a page are uptodate we set the page uptodate, too. |
1605 | * | 1605 | * |
1606 | * Setting the buffers dirty ensures that they get written out later when | 1606 | * Setting the buffers dirty ensures that they get written out later when |
1607 | * ntfs_writepage() is invoked by the VM. | 1607 | * ntfs_writepage() is invoked by the VM. |
1608 | * | 1608 | * |
1609 | * Finally, we need to update i_size and initialized_size as appropriate both | 1609 | * Finally, we need to update i_size and initialized_size as appropriate both |
1610 | * in the inode and the mft record. | 1610 | * in the inode and the mft record. |
1611 | * | 1611 | * |
1612 | * This is modelled after fs/buffer.c::generic_commit_write(), which marks | 1612 | * This is modelled after fs/buffer.c::generic_commit_write(), which marks |
1613 | * buffers uptodate and dirty, sets the page uptodate if all buffers in the | 1613 | * buffers uptodate and dirty, sets the page uptodate if all buffers in the |
1614 | * page are uptodate, and updates i_size if the end of io is beyond i_size. In | 1614 | * page are uptodate, and updates i_size if the end of io is beyond i_size. In |
1615 | * that case, it also marks the inode dirty. | 1615 | * that case, it also marks the inode dirty. |
1616 | * | 1616 | * |
1617 | * If things have gone as outlined in | 1617 | * If things have gone as outlined in |
1618 | * ntfs_prepare_pages_for_non_resident_write(), we do not need to do any page | 1618 | * ntfs_prepare_pages_for_non_resident_write(), we do not need to do any page |
1619 | * content modifications here for non-resident attributes. For resident | 1619 | * content modifications here for non-resident attributes. For resident |
1620 | * attributes we need to do the uptodate bringing here which we combine with | 1620 | * attributes we need to do the uptodate bringing here which we combine with |
1621 | * the copying into the mft record which means we save one atomic kmap. | 1621 | * the copying into the mft record which means we save one atomic kmap. |
1622 | * | 1622 | * |
1623 | * Return 0 on success or -errno on error. | 1623 | * Return 0 on success or -errno on error. |
1624 | */ | 1624 | */ |
1625 | static int ntfs_commit_pages_after_write(struct page **pages, | 1625 | static int ntfs_commit_pages_after_write(struct page **pages, |
1626 | const unsigned nr_pages, s64 pos, size_t bytes) | 1626 | const unsigned nr_pages, s64 pos, size_t bytes) |
1627 | { | 1627 | { |
1628 | s64 end, initialized_size; | 1628 | s64 end, initialized_size; |
1629 | loff_t i_size; | 1629 | loff_t i_size; |
1630 | struct inode *vi; | 1630 | struct inode *vi; |
1631 | ntfs_inode *ni, *base_ni; | 1631 | ntfs_inode *ni, *base_ni; |
1632 | struct page *page; | 1632 | struct page *page; |
1633 | ntfs_attr_search_ctx *ctx; | 1633 | ntfs_attr_search_ctx *ctx; |
1634 | MFT_RECORD *m; | 1634 | MFT_RECORD *m; |
1635 | ATTR_RECORD *a; | 1635 | ATTR_RECORD *a; |
1636 | char *kattr, *kaddr; | 1636 | char *kattr, *kaddr; |
1637 | unsigned long flags; | 1637 | unsigned long flags; |
1638 | u32 attr_len; | 1638 | u32 attr_len; |
1639 | int err; | 1639 | int err; |
1640 | 1640 | ||
1641 | BUG_ON(!nr_pages); | 1641 | BUG_ON(!nr_pages); |
1642 | BUG_ON(!pages); | 1642 | BUG_ON(!pages); |
1643 | page = pages[0]; | 1643 | page = pages[0]; |
1644 | BUG_ON(!page); | 1644 | BUG_ON(!page); |
1645 | vi = page->mapping->host; | 1645 | vi = page->mapping->host; |
1646 | ni = NTFS_I(vi); | 1646 | ni = NTFS_I(vi); |
1647 | ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " | 1647 | ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page " |
1648 | "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", | 1648 | "index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.", |
1649 | vi->i_ino, ni->type, page->index, nr_pages, | 1649 | vi->i_ino, ni->type, page->index, nr_pages, |
1650 | (long long)pos, bytes); | 1650 | (long long)pos, bytes); |
1651 | if (NInoNonResident(ni)) | 1651 | if (NInoNonResident(ni)) |
1652 | return ntfs_commit_pages_after_non_resident_write(pages, | 1652 | return ntfs_commit_pages_after_non_resident_write(pages, |
1653 | nr_pages, pos, bytes); | 1653 | nr_pages, pos, bytes); |
1654 | BUG_ON(nr_pages > 1); | 1654 | BUG_ON(nr_pages > 1); |
1655 | /* | 1655 | /* |
1656 | * Attribute is resident, implying it is not compressed, encrypted, or | 1656 | * Attribute is resident, implying it is not compressed, encrypted, or |
1657 | * sparse. | 1657 | * sparse. |
1658 | */ | 1658 | */ |
1659 | if (!NInoAttr(ni)) | 1659 | if (!NInoAttr(ni)) |
1660 | base_ni = ni; | 1660 | base_ni = ni; |
1661 | else | 1661 | else |
1662 | base_ni = ni->ext.base_ntfs_ino; | 1662 | base_ni = ni->ext.base_ntfs_ino; |
1663 | BUG_ON(NInoNonResident(ni)); | 1663 | BUG_ON(NInoNonResident(ni)); |
1664 | /* Map, pin, and lock the mft record. */ | 1664 | /* Map, pin, and lock the mft record. */ |
1665 | m = map_mft_record(base_ni); | 1665 | m = map_mft_record(base_ni); |
1666 | if (IS_ERR(m)) { | 1666 | if (IS_ERR(m)) { |
1667 | err = PTR_ERR(m); | 1667 | err = PTR_ERR(m); |
1668 | m = NULL; | 1668 | m = NULL; |
1669 | ctx = NULL; | 1669 | ctx = NULL; |
1670 | goto err_out; | 1670 | goto err_out; |
1671 | } | 1671 | } |
1672 | ctx = ntfs_attr_get_search_ctx(base_ni, m); | 1672 | ctx = ntfs_attr_get_search_ctx(base_ni, m); |
1673 | if (unlikely(!ctx)) { | 1673 | if (unlikely(!ctx)) { |
1674 | err = -ENOMEM; | 1674 | err = -ENOMEM; |
1675 | goto err_out; | 1675 | goto err_out; |
1676 | } | 1676 | } |
1677 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, | 1677 | err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, |
1678 | CASE_SENSITIVE, 0, NULL, 0, ctx); | 1678 | CASE_SENSITIVE, 0, NULL, 0, ctx); |
1679 | if (unlikely(err)) { | 1679 | if (unlikely(err)) { |
1680 | if (err == -ENOENT) | 1680 | if (err == -ENOENT) |
1681 | err = -EIO; | 1681 | err = -EIO; |
1682 | goto err_out; | 1682 | goto err_out; |
1683 | } | 1683 | } |
1684 | a = ctx->attr; | 1684 | a = ctx->attr; |
1685 | BUG_ON(a->non_resident); | 1685 | BUG_ON(a->non_resident); |
1686 | /* The total length of the attribute value. */ | 1686 | /* The total length of the attribute value. */ |
1687 | attr_len = le32_to_cpu(a->data.resident.value_length); | 1687 | attr_len = le32_to_cpu(a->data.resident.value_length); |
1688 | i_size = i_size_read(vi); | 1688 | i_size = i_size_read(vi); |
1689 | BUG_ON(attr_len != i_size); | 1689 | BUG_ON(attr_len != i_size); |
1690 | BUG_ON(pos > attr_len); | 1690 | BUG_ON(pos > attr_len); |
1691 | end = pos + bytes; | 1691 | end = pos + bytes; |
1692 | BUG_ON(end > le32_to_cpu(a->length) - | 1692 | BUG_ON(end > le32_to_cpu(a->length) - |
1693 | le16_to_cpu(a->data.resident.value_offset)); | 1693 | le16_to_cpu(a->data.resident.value_offset)); |
1694 | kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); | 1694 | kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset); |
1695 | kaddr = kmap_atomic(page); | 1695 | kaddr = kmap_atomic(page); |
1696 | /* Copy the received data from the page to the mft record. */ | 1696 | /* Copy the received data from the page to the mft record. */ |
1697 | memcpy(kattr + pos, kaddr + pos, bytes); | 1697 | memcpy(kattr + pos, kaddr + pos, bytes); |
1698 | /* Update the attribute length if necessary. */ | 1698 | /* Update the attribute length if necessary. */ |
1699 | if (end > attr_len) { | 1699 | if (end > attr_len) { |
1700 | attr_len = end; | 1700 | attr_len = end; |
1701 | a->data.resident.value_length = cpu_to_le32(attr_len); | 1701 | a->data.resident.value_length = cpu_to_le32(attr_len); |
1702 | } | 1702 | } |
1703 | /* | 1703 | /* |
1704 | * If the page is not uptodate, bring the out of bounds area(s) | 1704 | * If the page is not uptodate, bring the out of bounds area(s) |
1705 | * uptodate by copying data from the mft record to the page. | 1705 | * uptodate by copying data from the mft record to the page. |
1706 | */ | 1706 | */ |
1707 | if (!PageUptodate(page)) { | 1707 | if (!PageUptodate(page)) { |
1708 | if (pos > 0) | 1708 | if (pos > 0) |
1709 | memcpy(kaddr, kattr, pos); | 1709 | memcpy(kaddr, kattr, pos); |
1710 | if (end < attr_len) | 1710 | if (end < attr_len) |
1711 | memcpy(kaddr + end, kattr + end, attr_len - end); | 1711 | memcpy(kaddr + end, kattr + end, attr_len - end); |
1712 | /* Zero the region outside the end of the attribute value. */ | 1712 | /* Zero the region outside the end of the attribute value. */ |
1713 | memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); | 1713 | memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); |
1714 | flush_dcache_page(page); | 1714 | flush_dcache_page(page); |
1715 | SetPageUptodate(page); | 1715 | SetPageUptodate(page); |
1716 | } | 1716 | } |
1717 | kunmap_atomic(kaddr); | 1717 | kunmap_atomic(kaddr); |
1718 | /* Update initialized_size/i_size if necessary. */ | 1718 | /* Update initialized_size/i_size if necessary. */ |
1719 | read_lock_irqsave(&ni->size_lock, flags); | 1719 | read_lock_irqsave(&ni->size_lock, flags); |
1720 | initialized_size = ni->initialized_size; | 1720 | initialized_size = ni->initialized_size; |
1721 | BUG_ON(end > ni->allocated_size); | 1721 | BUG_ON(end > ni->allocated_size); |
1722 | read_unlock_irqrestore(&ni->size_lock, flags); | 1722 | read_unlock_irqrestore(&ni->size_lock, flags); |
1723 | BUG_ON(initialized_size != i_size); | 1723 | BUG_ON(initialized_size != i_size); |
1724 | if (end > initialized_size) { | 1724 | if (end > initialized_size) { |
1725 | write_lock_irqsave(&ni->size_lock, flags); | 1725 | write_lock_irqsave(&ni->size_lock, flags); |
1726 | ni->initialized_size = end; | 1726 | ni->initialized_size = end; |
1727 | i_size_write(vi, end); | 1727 | i_size_write(vi, end); |
1728 | write_unlock_irqrestore(&ni->size_lock, flags); | 1728 | write_unlock_irqrestore(&ni->size_lock, flags); |
1729 | } | 1729 | } |
1730 | /* Mark the mft record dirty, so it gets written back. */ | 1730 | /* Mark the mft record dirty, so it gets written back. */ |
1731 | flush_dcache_mft_record_page(ctx->ntfs_ino); | 1731 | flush_dcache_mft_record_page(ctx->ntfs_ino); |
1732 | mark_mft_record_dirty(ctx->ntfs_ino); | 1732 | mark_mft_record_dirty(ctx->ntfs_ino); |
1733 | ntfs_attr_put_search_ctx(ctx); | 1733 | ntfs_attr_put_search_ctx(ctx); |
1734 | unmap_mft_record(base_ni); | 1734 | unmap_mft_record(base_ni); |
1735 | ntfs_debug("Done."); | 1735 | ntfs_debug("Done."); |
1736 | return 0; | 1736 | return 0; |
1737 | err_out: | 1737 | err_out: |
1738 | if (err == -ENOMEM) { | 1738 | if (err == -ENOMEM) { |
1739 | ntfs_warning(vi->i_sb, "Error allocating memory required to " | 1739 | ntfs_warning(vi->i_sb, "Error allocating memory required to " |
1740 | "commit the write."); | 1740 | "commit the write."); |
1741 | if (PageUptodate(page)) { | 1741 | if (PageUptodate(page)) { |
1742 | ntfs_warning(vi->i_sb, "Page is uptodate, setting " | 1742 | ntfs_warning(vi->i_sb, "Page is uptodate, setting " |
1743 | "dirty so the write will be retried " | 1743 | "dirty so the write will be retried " |
1744 | "later on by the VM."); | 1744 | "later on by the VM."); |
1745 | /* | 1745 | /* |
1746 | * Put the page on mapping->dirty_pages, but leave its | 1746 | * Put the page on mapping->dirty_pages, but leave its |
1747 | * buffers' dirty state as-is. | 1747 | * buffers' dirty state as-is. |
1748 | */ | 1748 | */ |
1749 | __set_page_dirty_nobuffers(page); | 1749 | __set_page_dirty_nobuffers(page); |
1750 | err = 0; | 1750 | err = 0; |
1751 | } else | 1751 | } else |
1752 | ntfs_error(vi->i_sb, "Page is not uptodate. Written " | 1752 | ntfs_error(vi->i_sb, "Page is not uptodate. Written " |
1753 | "data has been lost."); | 1753 | "data has been lost."); |
1754 | } else { | 1754 | } else { |
1755 | ntfs_error(vi->i_sb, "Resident attribute commit write failed " | 1755 | ntfs_error(vi->i_sb, "Resident attribute commit write failed " |
1756 | "with error %i.", err); | 1756 | "with error %i.", err); |
1757 | NVolSetErrors(ni->vol); | 1757 | NVolSetErrors(ni->vol); |
1758 | } | 1758 | } |
1759 | if (ctx) | 1759 | if (ctx) |
1760 | ntfs_attr_put_search_ctx(ctx); | 1760 | ntfs_attr_put_search_ctx(ctx); |
1761 | if (m) | 1761 | if (m) |
1762 | unmap_mft_record(base_ni); | 1762 | unmap_mft_record(base_ni); |
1763 | return err; | 1763 | return err; |
1764 | } | 1764 | } |
1765 | 1765 | ||
1766 | static void ntfs_write_failed(struct address_space *mapping, loff_t to) | 1766 | static void ntfs_write_failed(struct address_space *mapping, loff_t to) |
1767 | { | 1767 | { |
1768 | struct inode *inode = mapping->host; | 1768 | struct inode *inode = mapping->host; |
1769 | 1769 | ||
1770 | if (to > inode->i_size) { | 1770 | if (to > inode->i_size) { |
1771 | truncate_pagecache(inode, inode->i_size); | 1771 | truncate_pagecache(inode, inode->i_size); |
1772 | ntfs_truncate_vfs(inode); | 1772 | ntfs_truncate_vfs(inode); |
1773 | } | 1773 | } |
1774 | } | 1774 | } |
1775 | 1775 | ||
1776 | /** | 1776 | /** |
1777 | * ntfs_file_buffered_write - | 1777 | * ntfs_file_buffered_write - |
1778 | * | 1778 | * |
1779 | * Locking: The vfs is holding ->i_mutex on the inode. | 1779 | * Locking: The vfs is holding ->i_mutex on the inode. |
1780 | */ | 1780 | */ |
1781 | static ssize_t ntfs_file_buffered_write(struct kiocb *iocb, | 1781 | static ssize_t ntfs_file_buffered_write(struct kiocb *iocb, |
1782 | const struct iovec *iov, unsigned long nr_segs, | 1782 | const struct iovec *iov, unsigned long nr_segs, |
1783 | loff_t pos, loff_t *ppos, size_t count) | 1783 | loff_t pos, loff_t *ppos, size_t count) |
1784 | { | 1784 | { |
1785 | struct file *file = iocb->ki_filp; | 1785 | struct file *file = iocb->ki_filp; |
1786 | struct address_space *mapping = file->f_mapping; | 1786 | struct address_space *mapping = file->f_mapping; |
1787 | struct inode *vi = mapping->host; | 1787 | struct inode *vi = mapping->host; |
1788 | ntfs_inode *ni = NTFS_I(vi); | 1788 | ntfs_inode *ni = NTFS_I(vi); |
1789 | ntfs_volume *vol = ni->vol; | 1789 | ntfs_volume *vol = ni->vol; |
1790 | struct page *pages[NTFS_MAX_PAGES_PER_CLUSTER]; | 1790 | struct page *pages[NTFS_MAX_PAGES_PER_CLUSTER]; |
1791 | struct page *cached_page = NULL; | 1791 | struct page *cached_page = NULL; |
1792 | char __user *buf = NULL; | 1792 | char __user *buf = NULL; |
1793 | s64 end, ll; | 1793 | s64 end, ll; |
1794 | VCN last_vcn; | 1794 | VCN last_vcn; |
1795 | LCN lcn; | 1795 | LCN lcn; |
1796 | unsigned long flags; | 1796 | unsigned long flags; |
1797 | size_t bytes, iov_ofs = 0; /* Offset in the current iovec. */ | 1797 | size_t bytes, iov_ofs = 0; /* Offset in the current iovec. */ |
1798 | ssize_t status, written; | 1798 | ssize_t status, written; |
1799 | unsigned nr_pages; | 1799 | unsigned nr_pages; |
1800 | int err; | 1800 | int err; |
1801 | 1801 | ||
1802 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " | 1802 | ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, " |
1803 | "pos 0x%llx, count 0x%lx.", | 1803 | "pos 0x%llx, count 0x%lx.", |
1804 | vi->i_ino, (unsigned)le32_to_cpu(ni->type), | 1804 | vi->i_ino, (unsigned)le32_to_cpu(ni->type), |
1805 | (unsigned long long)pos, (unsigned long)count); | 1805 | (unsigned long long)pos, (unsigned long)count); |
1806 | if (unlikely(!count)) | 1806 | if (unlikely(!count)) |
1807 | return 0; | 1807 | return 0; |
1808 | BUG_ON(NInoMstProtected(ni)); | 1808 | BUG_ON(NInoMstProtected(ni)); |
1809 | /* | 1809 | /* |
1810 | * If the attribute is not an index root and it is encrypted or | 1810 | * If the attribute is not an index root and it is encrypted or |
1811 | * compressed, we cannot write to it yet. Note we need to check for | 1811 | * compressed, we cannot write to it yet. Note we need to check for |
1812 | * AT_INDEX_ALLOCATION since this is the type of both directory and | 1812 | * AT_INDEX_ALLOCATION since this is the type of both directory and |
1813 | * index inodes. | 1813 | * index inodes. |
1814 | */ | 1814 | */ |
1815 | if (ni->type != AT_INDEX_ALLOCATION) { | 1815 | if (ni->type != AT_INDEX_ALLOCATION) { |
1816 | /* If file is encrypted, deny access, just like NT4. */ | 1816 | /* If file is encrypted, deny access, just like NT4. */ |
1817 | if (NInoEncrypted(ni)) { | 1817 | if (NInoEncrypted(ni)) { |
1818 | /* | 1818 | /* |
1819 | * Reminder for later: Encrypted files are _always_ | 1819 | * Reminder for later: Encrypted files are _always_ |
1820 | * non-resident so that the content can always be | 1820 | * non-resident so that the content can always be |
1821 | * encrypted. | 1821 | * encrypted. |
1822 | */ | 1822 | */ |
1823 | ntfs_debug("Denying write access to encrypted file."); | 1823 | ntfs_debug("Denying write access to encrypted file."); |
1824 | return -EACCES; | 1824 | return -EACCES; |
1825 | } | 1825 | } |
1826 | if (NInoCompressed(ni)) { | 1826 | if (NInoCompressed(ni)) { |
1827 | /* Only unnamed $DATA attribute can be compressed. */ | 1827 | /* Only unnamed $DATA attribute can be compressed. */ |
1828 | BUG_ON(ni->type != AT_DATA); | 1828 | BUG_ON(ni->type != AT_DATA); |
1829 | BUG_ON(ni->name_len); | 1829 | BUG_ON(ni->name_len); |
1830 | /* | 1830 | /* |
1831 | * Reminder for later: If resident, the data is not | 1831 | * Reminder for later: If resident, the data is not |
1832 | * actually compressed. Only on the switch to non- | 1832 | * actually compressed. Only on the switch to non- |
1833 | * resident does compression kick in. This is in | 1833 | * resident does compression kick in. This is in |
1834 | * contrast to encrypted files (see above). | 1834 | * contrast to encrypted files (see above). |
1835 | */ | 1835 | */ |
1836 | ntfs_error(vi->i_sb, "Writing to compressed files is " | 1836 | ntfs_error(vi->i_sb, "Writing to compressed files is " |
1837 | "not implemented yet. Sorry."); | 1837 | "not implemented yet. Sorry."); |
1838 | return -EOPNOTSUPP; | 1838 | return -EOPNOTSUPP; |
1839 | } | 1839 | } |
1840 | } | 1840 | } |
1841 | /* | 1841 | /* |
1842 | * If a previous ntfs_truncate() failed, repeat it and abort if it | 1842 | * If a previous ntfs_truncate() failed, repeat it and abort if it |
1843 | * fails again. | 1843 | * fails again. |
1844 | */ | 1844 | */ |
1845 | if (unlikely(NInoTruncateFailed(ni))) { | 1845 | if (unlikely(NInoTruncateFailed(ni))) { |
1846 | inode_dio_wait(vi); | 1846 | inode_dio_wait(vi); |
1847 | err = ntfs_truncate(vi); | 1847 | err = ntfs_truncate(vi); |
1848 | if (err || NInoTruncateFailed(ni)) { | 1848 | if (err || NInoTruncateFailed(ni)) { |
1849 | if (!err) | 1849 | if (!err) |
1850 | err = -EIO; | 1850 | err = -EIO; |
1851 | ntfs_error(vol->sb, "Cannot perform write to inode " | 1851 | ntfs_error(vol->sb, "Cannot perform write to inode " |
1852 | "0x%lx, attribute type 0x%x, because " | 1852 | "0x%lx, attribute type 0x%x, because " |
1853 | "ntfs_truncate() failed (error code " | 1853 | "ntfs_truncate() failed (error code " |
1854 | "%i).", vi->i_ino, | 1854 | "%i).", vi->i_ino, |
1855 | (unsigned)le32_to_cpu(ni->type), err); | 1855 | (unsigned)le32_to_cpu(ni->type), err); |
1856 | return err; | 1856 | return err; |
1857 | } | 1857 | } |
1858 | } | 1858 | } |
1859 | /* The first byte after the write. */ | 1859 | /* The first byte after the write. */ |
1860 | end = pos + count; | 1860 | end = pos + count; |
1861 | /* | 1861 | /* |
1862 | * If the write goes beyond the allocated size, extend the allocation | 1862 | * If the write goes beyond the allocated size, extend the allocation |
1863 | * to cover the whole of the write, rounded up to the nearest cluster. | 1863 | * to cover the whole of the write, rounded up to the nearest cluster. |
1864 | */ | 1864 | */ |
1865 | read_lock_irqsave(&ni->size_lock, flags); | 1865 | read_lock_irqsave(&ni->size_lock, flags); |
1866 | ll = ni->allocated_size; | 1866 | ll = ni->allocated_size; |
1867 | read_unlock_irqrestore(&ni->size_lock, flags); | 1867 | read_unlock_irqrestore(&ni->size_lock, flags); |
1868 | if (end > ll) { | 1868 | if (end > ll) { |
1869 | /* Extend the allocation without changing the data size. */ | 1869 | /* Extend the allocation without changing the data size. */ |
1870 | ll = ntfs_attr_extend_allocation(ni, end, -1, pos); | 1870 | ll = ntfs_attr_extend_allocation(ni, end, -1, pos); |
1871 | if (likely(ll >= 0)) { | 1871 | if (likely(ll >= 0)) { |
1872 | BUG_ON(pos >= ll); | 1872 | BUG_ON(pos >= ll); |
1873 | /* If the extension was partial truncate the write. */ | 1873 | /* If the extension was partial truncate the write. */ |
1874 | if (end > ll) { | 1874 | if (end > ll) { |
1875 | ntfs_debug("Truncating write to inode 0x%lx, " | 1875 | ntfs_debug("Truncating write to inode 0x%lx, " |
1876 | "attribute type 0x%x, because " | 1876 | "attribute type 0x%x, because " |
1877 | "the allocation was only " | 1877 | "the allocation was only " |
1878 | "partially extended.", | 1878 | "partially extended.", |
1879 | vi->i_ino, (unsigned) | 1879 | vi->i_ino, (unsigned) |
1880 | le32_to_cpu(ni->type)); | 1880 | le32_to_cpu(ni->type)); |
1881 | end = ll; | 1881 | end = ll; |
1882 | count = ll - pos; | 1882 | count = ll - pos; |
1883 | } | 1883 | } |
1884 | } else { | 1884 | } else { |
1885 | err = ll; | 1885 | err = ll; |
1886 | read_lock_irqsave(&ni->size_lock, flags); | 1886 | read_lock_irqsave(&ni->size_lock, flags); |
1887 | ll = ni->allocated_size; | 1887 | ll = ni->allocated_size; |
1888 | read_unlock_irqrestore(&ni->size_lock, flags); | 1888 | read_unlock_irqrestore(&ni->size_lock, flags); |
1889 | /* Perform a partial write if possible or fail. */ | 1889 | /* Perform a partial write if possible or fail. */ |
1890 | if (pos < ll) { | 1890 | if (pos < ll) { |
1891 | ntfs_debug("Truncating write to inode 0x%lx, " | 1891 | ntfs_debug("Truncating write to inode 0x%lx, " |
1892 | "attribute type 0x%x, because " | 1892 | "attribute type 0x%x, because " |
1893 | "extending the allocation " | 1893 | "extending the allocation " |
1894 | "failed (error code %i).", | 1894 | "failed (error code %i).", |
1895 | vi->i_ino, (unsigned) | 1895 | vi->i_ino, (unsigned) |
1896 | le32_to_cpu(ni->type), err); | 1896 | le32_to_cpu(ni->type), err); |
1897 | end = ll; | 1897 | end = ll; |
1898 | count = ll - pos; | 1898 | count = ll - pos; |
1899 | } else { | 1899 | } else { |
1900 | ntfs_error(vol->sb, "Cannot perform write to " | 1900 | ntfs_error(vol->sb, "Cannot perform write to " |
1901 | "inode 0x%lx, attribute type " | 1901 | "inode 0x%lx, attribute type " |
1902 | "0x%x, because extending the " | 1902 | "0x%x, because extending the " |
1903 | "allocation failed (error " | 1903 | "allocation failed (error " |
1904 | "code %i).", vi->i_ino, | 1904 | "code %i).", vi->i_ino, |
1905 | (unsigned) | 1905 | (unsigned) |
1906 | le32_to_cpu(ni->type), err); | 1906 | le32_to_cpu(ni->type), err); |
1907 | return err; | 1907 | return err; |
1908 | } | 1908 | } |
1909 | } | 1909 | } |
1910 | } | 1910 | } |
1911 | written = 0; | 1911 | written = 0; |
1912 | /* | 1912 | /* |
1913 | * If the write starts beyond the initialized size, extend it up to the | 1913 | * If the write starts beyond the initialized size, extend it up to the |
1914 | * beginning of the write and initialize all non-sparse space between | 1914 | * beginning of the write and initialize all non-sparse space between |
1915 | * the old initialized size and the new one. This automatically also | 1915 | * the old initialized size and the new one. This automatically also |
1916 | * increments the vfs inode->i_size to keep it above or equal to the | 1916 | * increments the vfs inode->i_size to keep it above or equal to the |
1917 | * initialized_size. | 1917 | * initialized_size. |
1918 | */ | 1918 | */ |
1919 | read_lock_irqsave(&ni->size_lock, flags); | 1919 | read_lock_irqsave(&ni->size_lock, flags); |
1920 | ll = ni->initialized_size; | 1920 | ll = ni->initialized_size; |
1921 | read_unlock_irqrestore(&ni->size_lock, flags); | 1921 | read_unlock_irqrestore(&ni->size_lock, flags); |
1922 | if (pos > ll) { | 1922 | if (pos > ll) { |
1923 | err = ntfs_attr_extend_initialized(ni, pos); | 1923 | err = ntfs_attr_extend_initialized(ni, pos); |
1924 | if (err < 0) { | 1924 | if (err < 0) { |
1925 | ntfs_error(vol->sb, "Cannot perform write to inode " | 1925 | ntfs_error(vol->sb, "Cannot perform write to inode " |
1926 | "0x%lx, attribute type 0x%x, because " | 1926 | "0x%lx, attribute type 0x%x, because " |
1927 | "extending the initialized size " | 1927 | "extending the initialized size " |
1928 | "failed (error code %i).", vi->i_ino, | 1928 | "failed (error code %i).", vi->i_ino, |
1929 | (unsigned)le32_to_cpu(ni->type), err); | 1929 | (unsigned)le32_to_cpu(ni->type), err); |
1930 | status = err; | 1930 | status = err; |
1931 | goto err_out; | 1931 | goto err_out; |
1932 | } | 1932 | } |
1933 | } | 1933 | } |
1934 | /* | 1934 | /* |
1935 | * Determine the number of pages per cluster for non-resident | 1935 | * Determine the number of pages per cluster for non-resident |
1936 | * attributes. | 1936 | * attributes. |
1937 | */ | 1937 | */ |
1938 | nr_pages = 1; | 1938 | nr_pages = 1; |
1939 | if (vol->cluster_size > PAGE_CACHE_SIZE && NInoNonResident(ni)) | 1939 | if (vol->cluster_size > PAGE_CACHE_SIZE && NInoNonResident(ni)) |
1940 | nr_pages = vol->cluster_size >> PAGE_CACHE_SHIFT; | 1940 | nr_pages = vol->cluster_size >> PAGE_CACHE_SHIFT; |
1941 | /* Finally, perform the actual write. */ | 1941 | /* Finally, perform the actual write. */ |
1942 | last_vcn = -1; | 1942 | last_vcn = -1; |
1943 | if (likely(nr_segs == 1)) | 1943 | if (likely(nr_segs == 1)) |
1944 | buf = iov->iov_base; | 1944 | buf = iov->iov_base; |
1945 | do { | 1945 | do { |
1946 | VCN vcn; | 1946 | VCN vcn; |
1947 | pgoff_t idx, start_idx; | 1947 | pgoff_t idx, start_idx; |
1948 | unsigned ofs, do_pages, u; | 1948 | unsigned ofs, do_pages, u; |
1949 | size_t copied; | 1949 | size_t copied; |
1950 | 1950 | ||
1951 | start_idx = idx = pos >> PAGE_CACHE_SHIFT; | 1951 | start_idx = idx = pos >> PAGE_CACHE_SHIFT; |
1952 | ofs = pos & ~PAGE_CACHE_MASK; | 1952 | ofs = pos & ~PAGE_CACHE_MASK; |
1953 | bytes = PAGE_CACHE_SIZE - ofs; | 1953 | bytes = PAGE_CACHE_SIZE - ofs; |
1954 | do_pages = 1; | 1954 | do_pages = 1; |
1955 | if (nr_pages > 1) { | 1955 | if (nr_pages > 1) { |
1956 | vcn = pos >> vol->cluster_size_bits; | 1956 | vcn = pos >> vol->cluster_size_bits; |
1957 | if (vcn != last_vcn) { | 1957 | if (vcn != last_vcn) { |
1958 | last_vcn = vcn; | 1958 | last_vcn = vcn; |
1959 | /* | 1959 | /* |
1960 | * Get the lcn of the vcn the write is in. If | 1960 | * Get the lcn of the vcn the write is in. If |
1961 | * it is a hole, need to lock down all pages in | 1961 | * it is a hole, need to lock down all pages in |
1962 | * the cluster. | 1962 | * the cluster. |
1963 | */ | 1963 | */ |
1964 | down_read(&ni->runlist.lock); | 1964 | down_read(&ni->runlist.lock); |
1965 | lcn = ntfs_attr_vcn_to_lcn_nolock(ni, pos >> | 1965 | lcn = ntfs_attr_vcn_to_lcn_nolock(ni, pos >> |
1966 | vol->cluster_size_bits, false); | 1966 | vol->cluster_size_bits, false); |
1967 | up_read(&ni->runlist.lock); | 1967 | up_read(&ni->runlist.lock); |
1968 | if (unlikely(lcn < LCN_HOLE)) { | 1968 | if (unlikely(lcn < LCN_HOLE)) { |
1969 | status = -EIO; | 1969 | status = -EIO; |
1970 | if (lcn == LCN_ENOMEM) | 1970 | if (lcn == LCN_ENOMEM) |
1971 | status = -ENOMEM; | 1971 | status = -ENOMEM; |
1972 | else | 1972 | else |
1973 | ntfs_error(vol->sb, "Cannot " | 1973 | ntfs_error(vol->sb, "Cannot " |
1974 | "perform write to " | 1974 | "perform write to " |
1975 | "inode 0x%lx, " | 1975 | "inode 0x%lx, " |
1976 | "attribute type 0x%x, " | 1976 | "attribute type 0x%x, " |
1977 | "because the attribute " | 1977 | "because the attribute " |
1978 | "is corrupt.", | 1978 | "is corrupt.", |
1979 | vi->i_ino, (unsigned) | 1979 | vi->i_ino, (unsigned) |
1980 | le32_to_cpu(ni->type)); | 1980 | le32_to_cpu(ni->type)); |
1981 | break; | 1981 | break; |
1982 | } | 1982 | } |
1983 | if (lcn == LCN_HOLE) { | 1983 | if (lcn == LCN_HOLE) { |
1984 | start_idx = (pos & ~(s64) | 1984 | start_idx = (pos & ~(s64) |
1985 | vol->cluster_size_mask) | 1985 | vol->cluster_size_mask) |
1986 | >> PAGE_CACHE_SHIFT; | 1986 | >> PAGE_CACHE_SHIFT; |
1987 | bytes = vol->cluster_size - (pos & | 1987 | bytes = vol->cluster_size - (pos & |
1988 | vol->cluster_size_mask); | 1988 | vol->cluster_size_mask); |
1989 | do_pages = nr_pages; | 1989 | do_pages = nr_pages; |
1990 | } | 1990 | } |
1991 | } | 1991 | } |
1992 | } | 1992 | } |
1993 | if (bytes > count) | 1993 | if (bytes > count) |
1994 | bytes = count; | 1994 | bytes = count; |
1995 | /* | 1995 | /* |
1996 | * Bring in the user page(s) that we will copy from _first_. | 1996 | * Bring in the user page(s) that we will copy from _first_. |
1997 | * Otherwise there is a nasty deadlock on copying from the same | 1997 | * Otherwise there is a nasty deadlock on copying from the same |
1998 | * page(s) as we are writing to, without it/them being marked | 1998 | * page(s) as we are writing to, without it/them being marked |
1999 | * up-to-date. Note, at present there is nothing to stop the | 1999 | * up-to-date. Note, at present there is nothing to stop the |
2000 | * pages being swapped out between us bringing them into memory | 2000 | * pages being swapped out between us bringing them into memory |
2001 | * and doing the actual copying. | 2001 | * and doing the actual copying. |
2002 | */ | 2002 | */ |
2003 | if (likely(nr_segs == 1)) | 2003 | if (likely(nr_segs == 1)) |
2004 | ntfs_fault_in_pages_readable(buf, bytes); | 2004 | ntfs_fault_in_pages_readable(buf, bytes); |
2005 | else | 2005 | else |
2006 | ntfs_fault_in_pages_readable_iovec(iov, iov_ofs, bytes); | 2006 | ntfs_fault_in_pages_readable_iovec(iov, iov_ofs, bytes); |
2007 | /* Get and lock @do_pages starting at index @start_idx. */ | 2007 | /* Get and lock @do_pages starting at index @start_idx. */ |
2008 | status = __ntfs_grab_cache_pages(mapping, start_idx, do_pages, | 2008 | status = __ntfs_grab_cache_pages(mapping, start_idx, do_pages, |
2009 | pages, &cached_page); | 2009 | pages, &cached_page); |
2010 | if (unlikely(status)) | 2010 | if (unlikely(status)) |
2011 | break; | 2011 | break; |
2012 | /* | 2012 | /* |
2013 | * For non-resident attributes, we need to fill any holes with | 2013 | * For non-resident attributes, we need to fill any holes with |
2014 | * actual clusters and ensure all bufferes are mapped. We also | 2014 | * actual clusters and ensure all bufferes are mapped. We also |
2015 | * need to bring uptodate any buffers that are only partially | 2015 | * need to bring uptodate any buffers that are only partially |
2016 | * being written to. | 2016 | * being written to. |
2017 | */ | 2017 | */ |
2018 | if (NInoNonResident(ni)) { | 2018 | if (NInoNonResident(ni)) { |
2019 | status = ntfs_prepare_pages_for_non_resident_write( | 2019 | status = ntfs_prepare_pages_for_non_resident_write( |
2020 | pages, do_pages, pos, bytes); | 2020 | pages, do_pages, pos, bytes); |
2021 | if (unlikely(status)) { | 2021 | if (unlikely(status)) { |
2022 | loff_t i_size; | 2022 | loff_t i_size; |
2023 | 2023 | ||
2024 | do { | 2024 | do { |
2025 | unlock_page(pages[--do_pages]); | 2025 | unlock_page(pages[--do_pages]); |
2026 | page_cache_release(pages[do_pages]); | 2026 | page_cache_release(pages[do_pages]); |
2027 | } while (do_pages); | 2027 | } while (do_pages); |
2028 | /* | 2028 | /* |
2029 | * The write preparation may have instantiated | 2029 | * The write preparation may have instantiated |
2030 | * allocated space outside i_size. Trim this | 2030 | * allocated space outside i_size. Trim this |
2031 | * off again. We can ignore any errors in this | 2031 | * off again. We can ignore any errors in this |
2032 | * case as we will just be waisting a bit of | 2032 | * case as we will just be waisting a bit of |
2033 | * allocated space, which is not a disaster. | 2033 | * allocated space, which is not a disaster. |
2034 | */ | 2034 | */ |
2035 | i_size = i_size_read(vi); | 2035 | i_size = i_size_read(vi); |
2036 | if (pos + bytes > i_size) { | 2036 | if (pos + bytes > i_size) { |
2037 | ntfs_write_failed(mapping, pos + bytes); | 2037 | ntfs_write_failed(mapping, pos + bytes); |
2038 | } | 2038 | } |
2039 | break; | 2039 | break; |
2040 | } | 2040 | } |
2041 | } | 2041 | } |
2042 | u = (pos >> PAGE_CACHE_SHIFT) - pages[0]->index; | 2042 | u = (pos >> PAGE_CACHE_SHIFT) - pages[0]->index; |
2043 | if (likely(nr_segs == 1)) { | 2043 | if (likely(nr_segs == 1)) { |
2044 | copied = ntfs_copy_from_user(pages + u, do_pages - u, | 2044 | copied = ntfs_copy_from_user(pages + u, do_pages - u, |
2045 | ofs, buf, bytes); | 2045 | ofs, buf, bytes); |
2046 | buf += copied; | 2046 | buf += copied; |
2047 | } else | 2047 | } else |
2048 | copied = ntfs_copy_from_user_iovec(pages + u, | 2048 | copied = ntfs_copy_from_user_iovec(pages + u, |
2049 | do_pages - u, ofs, &iov, &iov_ofs, | 2049 | do_pages - u, ofs, &iov, &iov_ofs, |
2050 | bytes); | 2050 | bytes); |
2051 | ntfs_flush_dcache_pages(pages + u, do_pages - u); | 2051 | ntfs_flush_dcache_pages(pages + u, do_pages - u); |
2052 | status = ntfs_commit_pages_after_write(pages, do_pages, pos, | 2052 | status = ntfs_commit_pages_after_write(pages, do_pages, pos, |
2053 | bytes); | 2053 | bytes); |
2054 | if (likely(!status)) { | 2054 | if (likely(!status)) { |
2055 | written += copied; | 2055 | written += copied; |
2056 | count -= copied; | 2056 | count -= copied; |
2057 | pos += copied; | 2057 | pos += copied; |
2058 | if (unlikely(copied != bytes)) | 2058 | if (unlikely(copied != bytes)) |
2059 | status = -EFAULT; | 2059 | status = -EFAULT; |
2060 | } | 2060 | } |
2061 | do { | 2061 | do { |
2062 | unlock_page(pages[--do_pages]); | 2062 | unlock_page(pages[--do_pages]); |
2063 | mark_page_accessed(pages[do_pages]); | ||
2064 | page_cache_release(pages[do_pages]); | 2063 | page_cache_release(pages[do_pages]); |
2065 | } while (do_pages); | 2064 | } while (do_pages); |
2066 | if (unlikely(status)) | 2065 | if (unlikely(status)) |
2067 | break; | 2066 | break; |
2068 | balance_dirty_pages_ratelimited(mapping); | 2067 | balance_dirty_pages_ratelimited(mapping); |
2069 | cond_resched(); | 2068 | cond_resched(); |
2070 | } while (count); | 2069 | } while (count); |
2071 | err_out: | 2070 | err_out: |
2072 | *ppos = pos; | 2071 | *ppos = pos; |
2073 | if (cached_page) | 2072 | if (cached_page) |
2074 | page_cache_release(cached_page); | 2073 | page_cache_release(cached_page); |
2075 | ntfs_debug("Done. Returning %s (written 0x%lx, status %li).", | 2074 | ntfs_debug("Done. Returning %s (written 0x%lx, status %li).", |
2076 | written ? "written" : "status", (unsigned long)written, | 2075 | written ? "written" : "status", (unsigned long)written, |
2077 | (long)status); | 2076 | (long)status); |
2078 | return written ? written : status; | 2077 | return written ? written : status; |
2079 | } | 2078 | } |
2080 | 2079 | ||
2081 | /** | 2080 | /** |
2082 | * ntfs_file_aio_write_nolock - | 2081 | * ntfs_file_aio_write_nolock - |
2083 | */ | 2082 | */ |
2084 | static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb, | 2083 | static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb, |
2085 | const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) | 2084 | const struct iovec *iov, unsigned long nr_segs, loff_t *ppos) |
2086 | { | 2085 | { |
2087 | struct file *file = iocb->ki_filp; | 2086 | struct file *file = iocb->ki_filp; |
2088 | struct address_space *mapping = file->f_mapping; | 2087 | struct address_space *mapping = file->f_mapping; |
2089 | struct inode *inode = mapping->host; | 2088 | struct inode *inode = mapping->host; |
2090 | loff_t pos; | 2089 | loff_t pos; |
2091 | size_t count; /* after file limit checks */ | 2090 | size_t count; /* after file limit checks */ |
2092 | ssize_t written, err; | 2091 | ssize_t written, err; |
2093 | 2092 | ||
2094 | count = 0; | 2093 | count = 0; |
2095 | err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ); | 2094 | err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ); |
2096 | if (err) | 2095 | if (err) |
2097 | return err; | 2096 | return err; |
2098 | pos = *ppos; | 2097 | pos = *ppos; |
2099 | /* We can write back this queue in page reclaim. */ | 2098 | /* We can write back this queue in page reclaim. */ |
2100 | current->backing_dev_info = mapping->backing_dev_info; | 2099 | current->backing_dev_info = mapping->backing_dev_info; |
2101 | written = 0; | 2100 | written = 0; |
2102 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); | 2101 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); |
2103 | if (err) | 2102 | if (err) |
2104 | goto out; | 2103 | goto out; |
2105 | if (!count) | 2104 | if (!count) |
2106 | goto out; | 2105 | goto out; |
2107 | err = file_remove_suid(file); | 2106 | err = file_remove_suid(file); |
2108 | if (err) | 2107 | if (err) |
2109 | goto out; | 2108 | goto out; |
2110 | err = file_update_time(file); | 2109 | err = file_update_time(file); |
2111 | if (err) | 2110 | if (err) |
2112 | goto out; | 2111 | goto out; |
2113 | written = ntfs_file_buffered_write(iocb, iov, nr_segs, pos, ppos, | 2112 | written = ntfs_file_buffered_write(iocb, iov, nr_segs, pos, ppos, |
2114 | count); | 2113 | count); |
2115 | out: | 2114 | out: |
2116 | current->backing_dev_info = NULL; | 2115 | current->backing_dev_info = NULL; |
2117 | return written ? written : err; | 2116 | return written ? written : err; |
2118 | } | 2117 | } |
2119 | 2118 | ||
2120 | /** | 2119 | /** |
2121 | * ntfs_file_aio_write - | 2120 | * ntfs_file_aio_write - |
2122 | */ | 2121 | */ |
2123 | static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, | 2122 | static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov, |
2124 | unsigned long nr_segs, loff_t pos) | 2123 | unsigned long nr_segs, loff_t pos) |
2125 | { | 2124 | { |
2126 | struct file *file = iocb->ki_filp; | 2125 | struct file *file = iocb->ki_filp; |
2127 | struct address_space *mapping = file->f_mapping; | 2126 | struct address_space *mapping = file->f_mapping; |
2128 | struct inode *inode = mapping->host; | 2127 | struct inode *inode = mapping->host; |
2129 | ssize_t ret; | 2128 | ssize_t ret; |
2130 | 2129 | ||
2131 | BUG_ON(iocb->ki_pos != pos); | 2130 | BUG_ON(iocb->ki_pos != pos); |
2132 | 2131 | ||
2133 | mutex_lock(&inode->i_mutex); | 2132 | mutex_lock(&inode->i_mutex); |
2134 | ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); | 2133 | ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos); |
2135 | mutex_unlock(&inode->i_mutex); | 2134 | mutex_unlock(&inode->i_mutex); |
2136 | if (ret > 0) { | 2135 | if (ret > 0) { |
2137 | int err = generic_write_sync(file, pos, ret); | 2136 | int err = generic_write_sync(file, pos, ret); |
2138 | if (err < 0) | 2137 | if (err < 0) |
2139 | ret = err; | 2138 | ret = err; |
2140 | } | 2139 | } |
2141 | return ret; | 2140 | return ret; |
2142 | } | 2141 | } |
2143 | 2142 | ||
2144 | /** | 2143 | /** |
2145 | * ntfs_file_fsync - sync a file to disk | 2144 | * ntfs_file_fsync - sync a file to disk |
2146 | * @filp: file to be synced | 2145 | * @filp: file to be synced |
2147 | * @datasync: if non-zero only flush user data and not metadata | 2146 | * @datasync: if non-zero only flush user data and not metadata |
2148 | * | 2147 | * |
2149 | * Data integrity sync of a file to disk. Used for fsync, fdatasync, and msync | 2148 | * Data integrity sync of a file to disk. Used for fsync, fdatasync, and msync |
2150 | * system calls. This function is inspired by fs/buffer.c::file_fsync(). | 2149 | * system calls. This function is inspired by fs/buffer.c::file_fsync(). |
2151 | * | 2150 | * |
2152 | * If @datasync is false, write the mft record and all associated extent mft | 2151 | * If @datasync is false, write the mft record and all associated extent mft |
2153 | * records as well as the $DATA attribute and then sync the block device. | 2152 | * records as well as the $DATA attribute and then sync the block device. |
2154 | * | 2153 | * |
2155 | * If @datasync is true and the attribute is non-resident, we skip the writing | 2154 | * If @datasync is true and the attribute is non-resident, we skip the writing |
2156 | * of the mft record and all associated extent mft records (this might still | 2155 | * of the mft record and all associated extent mft records (this might still |
2157 | * happen due to the write_inode_now() call). | 2156 | * happen due to the write_inode_now() call). |
2158 | * | 2157 | * |
2159 | * Also, if @datasync is true, we do not wait on the inode to be written out | 2158 | * Also, if @datasync is true, we do not wait on the inode to be written out |
2160 | * but we always wait on the page cache pages to be written out. | 2159 | * but we always wait on the page cache pages to be written out. |
2161 | * | 2160 | * |
2162 | * Locking: Caller must hold i_mutex on the inode. | 2161 | * Locking: Caller must hold i_mutex on the inode. |
2163 | * | 2162 | * |
2164 | * TODO: We should probably also write all attribute/index inodes associated | 2163 | * TODO: We should probably also write all attribute/index inodes associated |
2165 | * with this inode but since we have no simple way of getting to them we ignore | 2164 | * with this inode but since we have no simple way of getting to them we ignore |
2166 | * this problem for now. | 2165 | * this problem for now. |
2167 | */ | 2166 | */ |
2168 | static int ntfs_file_fsync(struct file *filp, loff_t start, loff_t end, | 2167 | static int ntfs_file_fsync(struct file *filp, loff_t start, loff_t end, |
2169 | int datasync) | 2168 | int datasync) |
2170 | { | 2169 | { |
2171 | struct inode *vi = filp->f_mapping->host; | 2170 | struct inode *vi = filp->f_mapping->host; |
2172 | int err, ret = 0; | 2171 | int err, ret = 0; |
2173 | 2172 | ||
2174 | ntfs_debug("Entering for inode 0x%lx.", vi->i_ino); | 2173 | ntfs_debug("Entering for inode 0x%lx.", vi->i_ino); |
2175 | 2174 | ||
2176 | err = filemap_write_and_wait_range(vi->i_mapping, start, end); | 2175 | err = filemap_write_and_wait_range(vi->i_mapping, start, end); |
2177 | if (err) | 2176 | if (err) |
2178 | return err; | 2177 | return err; |
2179 | mutex_lock(&vi->i_mutex); | 2178 | mutex_lock(&vi->i_mutex); |
2180 | 2179 | ||
2181 | BUG_ON(S_ISDIR(vi->i_mode)); | 2180 | BUG_ON(S_ISDIR(vi->i_mode)); |
2182 | if (!datasync || !NInoNonResident(NTFS_I(vi))) | 2181 | if (!datasync || !NInoNonResident(NTFS_I(vi))) |
2183 | ret = __ntfs_write_inode(vi, 1); | 2182 | ret = __ntfs_write_inode(vi, 1); |
2184 | write_inode_now(vi, !datasync); | 2183 | write_inode_now(vi, !datasync); |
2185 | /* | 2184 | /* |
2186 | * NOTE: If we were to use mapping->private_list (see ext2 and | 2185 | * NOTE: If we were to use mapping->private_list (see ext2 and |
2187 | * fs/buffer.c) for dirty blocks then we could optimize the below to be | 2186 | * fs/buffer.c) for dirty blocks then we could optimize the below to be |
2188 | * sync_mapping_buffers(vi->i_mapping). | 2187 | * sync_mapping_buffers(vi->i_mapping). |
2189 | */ | 2188 | */ |
2190 | err = sync_blockdev(vi->i_sb->s_bdev); | 2189 | err = sync_blockdev(vi->i_sb->s_bdev); |
2191 | if (unlikely(err && !ret)) | 2190 | if (unlikely(err && !ret)) |
2192 | ret = err; | 2191 | ret = err; |
2193 | if (likely(!ret)) | 2192 | if (likely(!ret)) |
2194 | ntfs_debug("Done."); | 2193 | ntfs_debug("Done."); |
2195 | else | 2194 | else |
2196 | ntfs_warning(vi->i_sb, "Failed to f%ssync inode 0x%lx. Error " | 2195 | ntfs_warning(vi->i_sb, "Failed to f%ssync inode 0x%lx. Error " |
2197 | "%u.", datasync ? "data" : "", vi->i_ino, -ret); | 2196 | "%u.", datasync ? "data" : "", vi->i_ino, -ret); |
2198 | mutex_unlock(&vi->i_mutex); | 2197 | mutex_unlock(&vi->i_mutex); |
2199 | return ret; | 2198 | return ret; |
2200 | } | 2199 | } |
2201 | 2200 | ||
2202 | #endif /* NTFS_RW */ | 2201 | #endif /* NTFS_RW */ |
2203 | 2202 | ||
2204 | const struct file_operations ntfs_file_ops = { | 2203 | const struct file_operations ntfs_file_ops = { |
2205 | .llseek = generic_file_llseek, /* Seek inside file. */ | 2204 | .llseek = generic_file_llseek, /* Seek inside file. */ |
2206 | .read = do_sync_read, /* Read from file. */ | 2205 | .read = do_sync_read, /* Read from file. */ |
2207 | .aio_read = generic_file_aio_read, /* Async read from file. */ | 2206 | .aio_read = generic_file_aio_read, /* Async read from file. */ |
2208 | #ifdef NTFS_RW | 2207 | #ifdef NTFS_RW |
2209 | .write = do_sync_write, /* Write to file. */ | 2208 | .write = do_sync_write, /* Write to file. */ |
2210 | .aio_write = ntfs_file_aio_write, /* Async write to file. */ | 2209 | .aio_write = ntfs_file_aio_write, /* Async write to file. */ |
2211 | /*.release = ,*/ /* Last file is closed. See | 2210 | /*.release = ,*/ /* Last file is closed. See |
2212 | fs/ext2/file.c:: | 2211 | fs/ext2/file.c:: |
2213 | ext2_release_file() for | 2212 | ext2_release_file() for |
2214 | how to use this to discard | 2213 | how to use this to discard |
2215 | preallocated space for | 2214 | preallocated space for |
2216 | write opened files. */ | 2215 | write opened files. */ |
2217 | .fsync = ntfs_file_fsync, /* Sync a file to disk. */ | 2216 | .fsync = ntfs_file_fsync, /* Sync a file to disk. */ |
2218 | /*.aio_fsync = ,*/ /* Sync all outstanding async | 2217 | /*.aio_fsync = ,*/ /* Sync all outstanding async |
2219 | i/o operations on a | 2218 | i/o operations on a |
2220 | kiocb. */ | 2219 | kiocb. */ |
2221 | #endif /* NTFS_RW */ | 2220 | #endif /* NTFS_RW */ |
2222 | /*.ioctl = ,*/ /* Perform function on the | 2221 | /*.ioctl = ,*/ /* Perform function on the |
2223 | mounted filesystem. */ | 2222 | mounted filesystem. */ |
2224 | .mmap = generic_file_mmap, /* Mmap file. */ | 2223 | .mmap = generic_file_mmap, /* Mmap file. */ |
2225 | .open = ntfs_file_open, /* Open file. */ | 2224 | .open = ntfs_file_open, /* Open file. */ |
2226 | .splice_read = generic_file_splice_read /* Zero-copy data send with | 2225 | .splice_read = generic_file_splice_read /* Zero-copy data send with |
2227 | the data source being on | 2226 | the data source being on |
2228 | the ntfs partition. We do | 2227 | the ntfs partition. We do |
2229 | not need to care about the | 2228 | not need to care about the |
2230 | data destination. */ | 2229 | data destination. */ |
2231 | /*.sendpage = ,*/ /* Zero-copy data send with | 2230 | /*.sendpage = ,*/ /* Zero-copy data send with |
2232 | the data destination being | 2231 | the data destination being |
2233 | on the ntfs partition. We | 2232 | on the ntfs partition. We |
2234 | do not need to care about | 2233 | do not need to care about |
2235 | the data source. */ | 2234 | the data source. */ |
2236 | }; | 2235 | }; |
2237 | 2236 | ||
2238 | const struct inode_operations ntfs_file_inode_ops = { | 2237 | const struct inode_operations ntfs_file_inode_ops = { |
2239 | #ifdef NTFS_RW | 2238 | #ifdef NTFS_RW |
2240 | .setattr = ntfs_setattr, | 2239 | .setattr = ntfs_setattr, |
2241 | #endif /* NTFS_RW */ | 2240 | #endif /* NTFS_RW */ |
2242 | }; | 2241 | }; |
2243 | 2242 | ||
2244 | const struct file_operations ntfs_empty_file_ops = {}; | 2243 | const struct file_operations ntfs_empty_file_ops = {}; |
2245 | 2244 | ||
2246 | const struct inode_operations ntfs_empty_inode_ops = {}; | 2245 | const struct inode_operations ntfs_empty_inode_ops = {}; |
2247 | 2246 |
include/linux/page-flags.h
1 | /* | 1 | /* |
2 | * Macros for manipulating and testing page->flags | 2 | * Macros for manipulating and testing page->flags |
3 | */ | 3 | */ |
4 | 4 | ||
5 | #ifndef PAGE_FLAGS_H | 5 | #ifndef PAGE_FLAGS_H |
6 | #define PAGE_FLAGS_H | 6 | #define PAGE_FLAGS_H |
7 | 7 | ||
8 | #include <linux/types.h> | 8 | #include <linux/types.h> |
9 | #include <linux/bug.h> | 9 | #include <linux/bug.h> |
10 | #include <linux/mmdebug.h> | 10 | #include <linux/mmdebug.h> |
11 | #ifndef __GENERATING_BOUNDS_H | 11 | #ifndef __GENERATING_BOUNDS_H |
12 | #include <linux/mm_types.h> | 12 | #include <linux/mm_types.h> |
13 | #include <generated/bounds.h> | 13 | #include <generated/bounds.h> |
14 | #endif /* !__GENERATING_BOUNDS_H */ | 14 | #endif /* !__GENERATING_BOUNDS_H */ |
15 | 15 | ||
16 | /* | 16 | /* |
17 | * Various page->flags bits: | 17 | * Various page->flags bits: |
18 | * | 18 | * |
19 | * PG_reserved is set for special pages, which can never be swapped out. Some | 19 | * PG_reserved is set for special pages, which can never be swapped out. Some |
20 | * of them might not even exist (eg empty_bad_page)... | 20 | * of them might not even exist (eg empty_bad_page)... |
21 | * | 21 | * |
22 | * The PG_private bitflag is set on pagecache pages if they contain filesystem | 22 | * The PG_private bitflag is set on pagecache pages if they contain filesystem |
23 | * specific data (which is normally at page->private). It can be used by | 23 | * specific data (which is normally at page->private). It can be used by |
24 | * private allocations for its own usage. | 24 | * private allocations for its own usage. |
25 | * | 25 | * |
26 | * During initiation of disk I/O, PG_locked is set. This bit is set before I/O | 26 | * During initiation of disk I/O, PG_locked is set. This bit is set before I/O |
27 | * and cleared when writeback _starts_ or when read _completes_. PG_writeback | 27 | * and cleared when writeback _starts_ or when read _completes_. PG_writeback |
28 | * is set before writeback starts and cleared when it finishes. | 28 | * is set before writeback starts and cleared when it finishes. |
29 | * | 29 | * |
30 | * PG_locked also pins a page in pagecache, and blocks truncation of the file | 30 | * PG_locked also pins a page in pagecache, and blocks truncation of the file |
31 | * while it is held. | 31 | * while it is held. |
32 | * | 32 | * |
33 | * page_waitqueue(page) is a wait queue of all tasks waiting for the page | 33 | * page_waitqueue(page) is a wait queue of all tasks waiting for the page |
34 | * to become unlocked. | 34 | * to become unlocked. |
35 | * | 35 | * |
36 | * PG_uptodate tells whether the page's contents is valid. When a read | 36 | * PG_uptodate tells whether the page's contents is valid. When a read |
37 | * completes, the page becomes uptodate, unless a disk I/O error happened. | 37 | * completes, the page becomes uptodate, unless a disk I/O error happened. |
38 | * | 38 | * |
39 | * PG_referenced, PG_reclaim are used for page reclaim for anonymous and | 39 | * PG_referenced, PG_reclaim are used for page reclaim for anonymous and |
40 | * file-backed pagecache (see mm/vmscan.c). | 40 | * file-backed pagecache (see mm/vmscan.c). |
41 | * | 41 | * |
42 | * PG_error is set to indicate that an I/O error occurred on this page. | 42 | * PG_error is set to indicate that an I/O error occurred on this page. |
43 | * | 43 | * |
44 | * PG_arch_1 is an architecture specific page state bit. The generic code | 44 | * PG_arch_1 is an architecture specific page state bit. The generic code |
45 | * guarantees that this bit is cleared for a page when it first is entered into | 45 | * guarantees that this bit is cleared for a page when it first is entered into |
46 | * the page cache. | 46 | * the page cache. |
47 | * | 47 | * |
48 | * PG_highmem pages are not permanently mapped into the kernel virtual address | 48 | * PG_highmem pages are not permanently mapped into the kernel virtual address |
49 | * space, they need to be kmapped separately for doing IO on the pages. The | 49 | * space, they need to be kmapped separately for doing IO on the pages. The |
50 | * struct page (these bits with information) are always mapped into kernel | 50 | * struct page (these bits with information) are always mapped into kernel |
51 | * address space... | 51 | * address space... |
52 | * | 52 | * |
53 | * PG_hwpoison indicates that a page got corrupted in hardware and contains | 53 | * PG_hwpoison indicates that a page got corrupted in hardware and contains |
54 | * data with incorrect ECC bits that triggered a machine check. Accessing is | 54 | * data with incorrect ECC bits that triggered a machine check. Accessing is |
55 | * not safe since it may cause another machine check. Don't touch! | 55 | * not safe since it may cause another machine check. Don't touch! |
56 | */ | 56 | */ |
57 | 57 | ||
58 | /* | 58 | /* |
59 | * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break | 59 | * Don't use the *_dontuse flags. Use the macros. Otherwise you'll break |
60 | * locked- and dirty-page accounting. | 60 | * locked- and dirty-page accounting. |
61 | * | 61 | * |
62 | * The page flags field is split into two parts, the main flags area | 62 | * The page flags field is split into two parts, the main flags area |
63 | * which extends from the low bits upwards, and the fields area which | 63 | * which extends from the low bits upwards, and the fields area which |
64 | * extends from the high bits downwards. | 64 | * extends from the high bits downwards. |
65 | * | 65 | * |
66 | * | FIELD | ... | FLAGS | | 66 | * | FIELD | ... | FLAGS | |
67 | * N-1 ^ 0 | 67 | * N-1 ^ 0 |
68 | * (NR_PAGEFLAGS) | 68 | * (NR_PAGEFLAGS) |
69 | * | 69 | * |
70 | * The fields area is reserved for fields mapping zone, node (for NUMA) and | 70 | * The fields area is reserved for fields mapping zone, node (for NUMA) and |
71 | * SPARSEMEM section (for variants of SPARSEMEM that require section ids like | 71 | * SPARSEMEM section (for variants of SPARSEMEM that require section ids like |
72 | * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). | 72 | * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP). |
73 | */ | 73 | */ |
74 | enum pageflags { | 74 | enum pageflags { |
75 | PG_locked, /* Page is locked. Don't touch. */ | 75 | PG_locked, /* Page is locked. Don't touch. */ |
76 | PG_error, | 76 | PG_error, |
77 | PG_referenced, | 77 | PG_referenced, |
78 | PG_uptodate, | 78 | PG_uptodate, |
79 | PG_dirty, | 79 | PG_dirty, |
80 | PG_lru, | 80 | PG_lru, |
81 | PG_active, | 81 | PG_active, |
82 | PG_slab, | 82 | PG_slab, |
83 | PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ | 83 | PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ |
84 | PG_arch_1, | 84 | PG_arch_1, |
85 | PG_reserved, | 85 | PG_reserved, |
86 | PG_private, /* If pagecache, has fs-private data */ | 86 | PG_private, /* If pagecache, has fs-private data */ |
87 | PG_private_2, /* If pagecache, has fs aux data */ | 87 | PG_private_2, /* If pagecache, has fs aux data */ |
88 | PG_writeback, /* Page is under writeback */ | 88 | PG_writeback, /* Page is under writeback */ |
89 | #ifdef CONFIG_PAGEFLAGS_EXTENDED | 89 | #ifdef CONFIG_PAGEFLAGS_EXTENDED |
90 | PG_head, /* A head page */ | 90 | PG_head, /* A head page */ |
91 | PG_tail, /* A tail page */ | 91 | PG_tail, /* A tail page */ |
92 | #else | 92 | #else |
93 | PG_compound, /* A compound page */ | 93 | PG_compound, /* A compound page */ |
94 | #endif | 94 | #endif |
95 | PG_swapcache, /* Swap page: swp_entry_t in private */ | 95 | PG_swapcache, /* Swap page: swp_entry_t in private */ |
96 | PG_mappedtodisk, /* Has blocks allocated on-disk */ | 96 | PG_mappedtodisk, /* Has blocks allocated on-disk */ |
97 | PG_reclaim, /* To be reclaimed asap */ | 97 | PG_reclaim, /* To be reclaimed asap */ |
98 | PG_swapbacked, /* Page is backed by RAM/swap */ | 98 | PG_swapbacked, /* Page is backed by RAM/swap */ |
99 | PG_unevictable, /* Page is "unevictable" */ | 99 | PG_unevictable, /* Page is "unevictable" */ |
100 | #ifdef CONFIG_MMU | 100 | #ifdef CONFIG_MMU |
101 | PG_mlocked, /* Page is vma mlocked */ | 101 | PG_mlocked, /* Page is vma mlocked */ |
102 | #endif | 102 | #endif |
103 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED | 103 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED |
104 | PG_uncached, /* Page has been mapped as uncached */ | 104 | PG_uncached, /* Page has been mapped as uncached */ |
105 | #endif | 105 | #endif |
106 | #ifdef CONFIG_MEMORY_FAILURE | 106 | #ifdef CONFIG_MEMORY_FAILURE |
107 | PG_hwpoison, /* hardware poisoned page. Don't touch */ | 107 | PG_hwpoison, /* hardware poisoned page. Don't touch */ |
108 | #endif | 108 | #endif |
109 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 109 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
110 | PG_compound_lock, | 110 | PG_compound_lock, |
111 | #endif | 111 | #endif |
112 | __NR_PAGEFLAGS, | 112 | __NR_PAGEFLAGS, |
113 | 113 | ||
114 | /* Filesystems */ | 114 | /* Filesystems */ |
115 | PG_checked = PG_owner_priv_1, | 115 | PG_checked = PG_owner_priv_1, |
116 | 116 | ||
117 | /* Two page bits are conscripted by FS-Cache to maintain local caching | 117 | /* Two page bits are conscripted by FS-Cache to maintain local caching |
118 | * state. These bits are set on pages belonging to the netfs's inodes | 118 | * state. These bits are set on pages belonging to the netfs's inodes |
119 | * when those inodes are being locally cached. | 119 | * when those inodes are being locally cached. |
120 | */ | 120 | */ |
121 | PG_fscache = PG_private_2, /* page backed by cache */ | 121 | PG_fscache = PG_private_2, /* page backed by cache */ |
122 | 122 | ||
123 | /* XEN */ | 123 | /* XEN */ |
124 | PG_pinned = PG_owner_priv_1, | 124 | PG_pinned = PG_owner_priv_1, |
125 | PG_savepinned = PG_dirty, | 125 | PG_savepinned = PG_dirty, |
126 | 126 | ||
127 | /* SLOB */ | 127 | /* SLOB */ |
128 | PG_slob_free = PG_private, | 128 | PG_slob_free = PG_private, |
129 | }; | 129 | }; |
130 | 130 | ||
131 | #ifndef __GENERATING_BOUNDS_H | 131 | #ifndef __GENERATING_BOUNDS_H |
132 | 132 | ||
133 | /* | 133 | /* |
134 | * Macros to create function definitions for page flags | 134 | * Macros to create function definitions for page flags |
135 | */ | 135 | */ |
136 | #define TESTPAGEFLAG(uname, lname) \ | 136 | #define TESTPAGEFLAG(uname, lname) \ |
137 | static inline int Page##uname(const struct page *page) \ | 137 | static inline int Page##uname(const struct page *page) \ |
138 | { return test_bit(PG_##lname, &page->flags); } | 138 | { return test_bit(PG_##lname, &page->flags); } |
139 | 139 | ||
140 | #define SETPAGEFLAG(uname, lname) \ | 140 | #define SETPAGEFLAG(uname, lname) \ |
141 | static inline void SetPage##uname(struct page *page) \ | 141 | static inline void SetPage##uname(struct page *page) \ |
142 | { set_bit(PG_##lname, &page->flags); } | 142 | { set_bit(PG_##lname, &page->flags); } |
143 | 143 | ||
144 | #define CLEARPAGEFLAG(uname, lname) \ | 144 | #define CLEARPAGEFLAG(uname, lname) \ |
145 | static inline void ClearPage##uname(struct page *page) \ | 145 | static inline void ClearPage##uname(struct page *page) \ |
146 | { clear_bit(PG_##lname, &page->flags); } | 146 | { clear_bit(PG_##lname, &page->flags); } |
147 | 147 | ||
148 | #define __SETPAGEFLAG(uname, lname) \ | 148 | #define __SETPAGEFLAG(uname, lname) \ |
149 | static inline void __SetPage##uname(struct page *page) \ | 149 | static inline void __SetPage##uname(struct page *page) \ |
150 | { __set_bit(PG_##lname, &page->flags); } | 150 | { __set_bit(PG_##lname, &page->flags); } |
151 | 151 | ||
152 | #define __CLEARPAGEFLAG(uname, lname) \ | 152 | #define __CLEARPAGEFLAG(uname, lname) \ |
153 | static inline void __ClearPage##uname(struct page *page) \ | 153 | static inline void __ClearPage##uname(struct page *page) \ |
154 | { __clear_bit(PG_##lname, &page->flags); } | 154 | { __clear_bit(PG_##lname, &page->flags); } |
155 | 155 | ||
156 | #define TESTSETFLAG(uname, lname) \ | 156 | #define TESTSETFLAG(uname, lname) \ |
157 | static inline int TestSetPage##uname(struct page *page) \ | 157 | static inline int TestSetPage##uname(struct page *page) \ |
158 | { return test_and_set_bit(PG_##lname, &page->flags); } | 158 | { return test_and_set_bit(PG_##lname, &page->flags); } |
159 | 159 | ||
160 | #define TESTCLEARFLAG(uname, lname) \ | 160 | #define TESTCLEARFLAG(uname, lname) \ |
161 | static inline int TestClearPage##uname(struct page *page) \ | 161 | static inline int TestClearPage##uname(struct page *page) \ |
162 | { return test_and_clear_bit(PG_##lname, &page->flags); } | 162 | { return test_and_clear_bit(PG_##lname, &page->flags); } |
163 | 163 | ||
164 | #define __TESTCLEARFLAG(uname, lname) \ | 164 | #define __TESTCLEARFLAG(uname, lname) \ |
165 | static inline int __TestClearPage##uname(struct page *page) \ | 165 | static inline int __TestClearPage##uname(struct page *page) \ |
166 | { return __test_and_clear_bit(PG_##lname, &page->flags); } | 166 | { return __test_and_clear_bit(PG_##lname, &page->flags); } |
167 | 167 | ||
168 | #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ | 168 | #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ |
169 | SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) | 169 | SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) |
170 | 170 | ||
171 | #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ | 171 | #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \ |
172 | __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) | 172 | __SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname) |
173 | 173 | ||
174 | #define PAGEFLAG_FALSE(uname) \ | 174 | #define PAGEFLAG_FALSE(uname) \ |
175 | static inline int Page##uname(const struct page *page) \ | 175 | static inline int Page##uname(const struct page *page) \ |
176 | { return 0; } | 176 | { return 0; } |
177 | 177 | ||
178 | #define TESTSCFLAG(uname, lname) \ | 178 | #define TESTSCFLAG(uname, lname) \ |
179 | TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) | 179 | TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname) |
180 | 180 | ||
181 | #define SETPAGEFLAG_NOOP(uname) \ | 181 | #define SETPAGEFLAG_NOOP(uname) \ |
182 | static inline void SetPage##uname(struct page *page) { } | 182 | static inline void SetPage##uname(struct page *page) { } |
183 | 183 | ||
184 | #define CLEARPAGEFLAG_NOOP(uname) \ | 184 | #define CLEARPAGEFLAG_NOOP(uname) \ |
185 | static inline void ClearPage##uname(struct page *page) { } | 185 | static inline void ClearPage##uname(struct page *page) { } |
186 | 186 | ||
187 | #define __CLEARPAGEFLAG_NOOP(uname) \ | 187 | #define __CLEARPAGEFLAG_NOOP(uname) \ |
188 | static inline void __ClearPage##uname(struct page *page) { } | 188 | static inline void __ClearPage##uname(struct page *page) { } |
189 | 189 | ||
190 | #define TESTCLEARFLAG_FALSE(uname) \ | 190 | #define TESTCLEARFLAG_FALSE(uname) \ |
191 | static inline int TestClearPage##uname(struct page *page) { return 0; } | 191 | static inline int TestClearPage##uname(struct page *page) { return 0; } |
192 | 192 | ||
193 | #define __TESTCLEARFLAG_FALSE(uname) \ | 193 | #define __TESTCLEARFLAG_FALSE(uname) \ |
194 | static inline int __TestClearPage##uname(struct page *page) { return 0; } | 194 | static inline int __TestClearPage##uname(struct page *page) { return 0; } |
195 | 195 | ||
196 | struct page; /* forward declaration */ | 196 | struct page; /* forward declaration */ |
197 | 197 | ||
198 | TESTPAGEFLAG(Locked, locked) | 198 | TESTPAGEFLAG(Locked, locked) |
199 | PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) | 199 | PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error) |
200 | PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) | 200 | PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced) |
201 | __SETPAGEFLAG(Referenced, referenced) | ||
201 | PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) | 202 | PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty) |
202 | PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) | 203 | PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru) |
203 | PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) | 204 | PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active) |
204 | TESTCLEARFLAG(Active, active) | 205 | TESTCLEARFLAG(Active, active) |
205 | __PAGEFLAG(Slab, slab) | 206 | __PAGEFLAG(Slab, slab) |
206 | PAGEFLAG(Checked, checked) /* Used by some filesystems */ | 207 | PAGEFLAG(Checked, checked) /* Used by some filesystems */ |
207 | PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ | 208 | PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */ |
208 | PAGEFLAG(SavePinned, savepinned); /* Xen */ | 209 | PAGEFLAG(SavePinned, savepinned); /* Xen */ |
209 | PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) | 210 | PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved) |
210 | PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) | 211 | PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) |
211 | __SETPAGEFLAG(SwapBacked, swapbacked) | 212 | __SETPAGEFLAG(SwapBacked, swapbacked) |
212 | 213 | ||
213 | __PAGEFLAG(SlobFree, slob_free) | 214 | __PAGEFLAG(SlobFree, slob_free) |
214 | 215 | ||
215 | /* | 216 | /* |
216 | * Private page markings that may be used by the filesystem that owns the page | 217 | * Private page markings that may be used by the filesystem that owns the page |
217 | * for its own purposes. | 218 | * for its own purposes. |
218 | * - PG_private and PG_private_2 cause releasepage() and co to be invoked | 219 | * - PG_private and PG_private_2 cause releasepage() and co to be invoked |
219 | */ | 220 | */ |
220 | PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) | 221 | PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private) |
221 | __CLEARPAGEFLAG(Private, private) | 222 | __CLEARPAGEFLAG(Private, private) |
222 | PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) | 223 | PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2) |
223 | PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) | 224 | PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1) |
224 | 225 | ||
225 | /* | 226 | /* |
226 | * Only test-and-set exist for PG_writeback. The unconditional operators are | 227 | * Only test-and-set exist for PG_writeback. The unconditional operators are |
227 | * risky: they bypass page accounting. | 228 | * risky: they bypass page accounting. |
228 | */ | 229 | */ |
229 | TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) | 230 | TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) |
230 | PAGEFLAG(MappedToDisk, mappedtodisk) | 231 | PAGEFLAG(MappedToDisk, mappedtodisk) |
231 | 232 | ||
232 | /* PG_readahead is only used for reads; PG_reclaim is only for writes */ | 233 | /* PG_readahead is only used for reads; PG_reclaim is only for writes */ |
233 | PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) | 234 | PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) |
234 | PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim) | 235 | PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim) |
235 | 236 | ||
236 | #ifdef CONFIG_HIGHMEM | 237 | #ifdef CONFIG_HIGHMEM |
237 | /* | 238 | /* |
238 | * Must use a macro here due to header dependency issues. page_zone() is not | 239 | * Must use a macro here due to header dependency issues. page_zone() is not |
239 | * available at this point. | 240 | * available at this point. |
240 | */ | 241 | */ |
241 | #define PageHighMem(__p) is_highmem(page_zone(__p)) | 242 | #define PageHighMem(__p) is_highmem(page_zone(__p)) |
242 | #else | 243 | #else |
243 | PAGEFLAG_FALSE(HighMem) | 244 | PAGEFLAG_FALSE(HighMem) |
244 | #endif | 245 | #endif |
245 | 246 | ||
246 | #ifdef CONFIG_SWAP | 247 | #ifdef CONFIG_SWAP |
247 | PAGEFLAG(SwapCache, swapcache) | 248 | PAGEFLAG(SwapCache, swapcache) |
248 | #else | 249 | #else |
249 | PAGEFLAG_FALSE(SwapCache) | 250 | PAGEFLAG_FALSE(SwapCache) |
250 | SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) | 251 | SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache) |
251 | #endif | 252 | #endif |
252 | 253 | ||
253 | PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) | 254 | PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) |
254 | TESTCLEARFLAG(Unevictable, unevictable) | 255 | TESTCLEARFLAG(Unevictable, unevictable) |
255 | 256 | ||
256 | #ifdef CONFIG_MMU | 257 | #ifdef CONFIG_MMU |
257 | PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) | 258 | PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) |
258 | TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) | 259 | TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked) |
259 | #else | 260 | #else |
260 | PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) | 261 | PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked) |
261 | TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) | 262 | TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked) |
262 | #endif | 263 | #endif |
263 | 264 | ||
264 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED | 265 | #ifdef CONFIG_ARCH_USES_PG_UNCACHED |
265 | PAGEFLAG(Uncached, uncached) | 266 | PAGEFLAG(Uncached, uncached) |
266 | #else | 267 | #else |
267 | PAGEFLAG_FALSE(Uncached) | 268 | PAGEFLAG_FALSE(Uncached) |
268 | #endif | 269 | #endif |
269 | 270 | ||
270 | #ifdef CONFIG_MEMORY_FAILURE | 271 | #ifdef CONFIG_MEMORY_FAILURE |
271 | PAGEFLAG(HWPoison, hwpoison) | 272 | PAGEFLAG(HWPoison, hwpoison) |
272 | TESTSCFLAG(HWPoison, hwpoison) | 273 | TESTSCFLAG(HWPoison, hwpoison) |
273 | #define __PG_HWPOISON (1UL << PG_hwpoison) | 274 | #define __PG_HWPOISON (1UL << PG_hwpoison) |
274 | #else | 275 | #else |
275 | PAGEFLAG_FALSE(HWPoison) | 276 | PAGEFLAG_FALSE(HWPoison) |
276 | #define __PG_HWPOISON 0 | 277 | #define __PG_HWPOISON 0 |
277 | #endif | 278 | #endif |
278 | 279 | ||
279 | u64 stable_page_flags(struct page *page); | 280 | u64 stable_page_flags(struct page *page); |
280 | 281 | ||
281 | static inline int PageUptodate(struct page *page) | 282 | static inline int PageUptodate(struct page *page) |
282 | { | 283 | { |
283 | int ret = test_bit(PG_uptodate, &(page)->flags); | 284 | int ret = test_bit(PG_uptodate, &(page)->flags); |
284 | 285 | ||
285 | /* | 286 | /* |
286 | * Must ensure that the data we read out of the page is loaded | 287 | * Must ensure that the data we read out of the page is loaded |
287 | * _after_ we've loaded page->flags to check for PageUptodate. | 288 | * _after_ we've loaded page->flags to check for PageUptodate. |
288 | * We can skip the barrier if the page is not uptodate, because | 289 | * We can skip the barrier if the page is not uptodate, because |
289 | * we wouldn't be reading anything from it. | 290 | * we wouldn't be reading anything from it. |
290 | * | 291 | * |
291 | * See SetPageUptodate() for the other side of the story. | 292 | * See SetPageUptodate() for the other side of the story. |
292 | */ | 293 | */ |
293 | if (ret) | 294 | if (ret) |
294 | smp_rmb(); | 295 | smp_rmb(); |
295 | 296 | ||
296 | return ret; | 297 | return ret; |
297 | } | 298 | } |
298 | 299 | ||
299 | static inline void __SetPageUptodate(struct page *page) | 300 | static inline void __SetPageUptodate(struct page *page) |
300 | { | 301 | { |
301 | smp_wmb(); | 302 | smp_wmb(); |
302 | __set_bit(PG_uptodate, &(page)->flags); | 303 | __set_bit(PG_uptodate, &(page)->flags); |
303 | } | 304 | } |
304 | 305 | ||
305 | static inline void SetPageUptodate(struct page *page) | 306 | static inline void SetPageUptodate(struct page *page) |
306 | { | 307 | { |
307 | /* | 308 | /* |
308 | * Memory barrier must be issued before setting the PG_uptodate bit, | 309 | * Memory barrier must be issued before setting the PG_uptodate bit, |
309 | * so that all previous stores issued in order to bring the page | 310 | * so that all previous stores issued in order to bring the page |
310 | * uptodate are actually visible before PageUptodate becomes true. | 311 | * uptodate are actually visible before PageUptodate becomes true. |
311 | */ | 312 | */ |
312 | smp_wmb(); | 313 | smp_wmb(); |
313 | set_bit(PG_uptodate, &(page)->flags); | 314 | set_bit(PG_uptodate, &(page)->flags); |
314 | } | 315 | } |
315 | 316 | ||
316 | CLEARPAGEFLAG(Uptodate, uptodate) | 317 | CLEARPAGEFLAG(Uptodate, uptodate) |
317 | 318 | ||
318 | extern void cancel_dirty_page(struct page *page, unsigned int account_size); | 319 | extern void cancel_dirty_page(struct page *page, unsigned int account_size); |
319 | 320 | ||
320 | int test_clear_page_writeback(struct page *page); | 321 | int test_clear_page_writeback(struct page *page); |
321 | int __test_set_page_writeback(struct page *page, bool keep_write); | 322 | int __test_set_page_writeback(struct page *page, bool keep_write); |
322 | 323 | ||
323 | #define test_set_page_writeback(page) \ | 324 | #define test_set_page_writeback(page) \ |
324 | __test_set_page_writeback(page, false) | 325 | __test_set_page_writeback(page, false) |
325 | #define test_set_page_writeback_keepwrite(page) \ | 326 | #define test_set_page_writeback_keepwrite(page) \ |
326 | __test_set_page_writeback(page, true) | 327 | __test_set_page_writeback(page, true) |
327 | 328 | ||
328 | static inline void set_page_writeback(struct page *page) | 329 | static inline void set_page_writeback(struct page *page) |
329 | { | 330 | { |
330 | test_set_page_writeback(page); | 331 | test_set_page_writeback(page); |
331 | } | 332 | } |
332 | 333 | ||
333 | static inline void set_page_writeback_keepwrite(struct page *page) | 334 | static inline void set_page_writeback_keepwrite(struct page *page) |
334 | { | 335 | { |
335 | test_set_page_writeback_keepwrite(page); | 336 | test_set_page_writeback_keepwrite(page); |
336 | } | 337 | } |
337 | 338 | ||
338 | #ifdef CONFIG_PAGEFLAGS_EXTENDED | 339 | #ifdef CONFIG_PAGEFLAGS_EXTENDED |
339 | /* | 340 | /* |
340 | * System with lots of page flags available. This allows separate | 341 | * System with lots of page flags available. This allows separate |
341 | * flags for PageHead() and PageTail() checks of compound pages so that bit | 342 | * flags for PageHead() and PageTail() checks of compound pages so that bit |
342 | * tests can be used in performance sensitive paths. PageCompound is | 343 | * tests can be used in performance sensitive paths. PageCompound is |
343 | * generally not used in hot code paths. | 344 | * generally not used in hot code paths. |
344 | */ | 345 | */ |
345 | __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) | 346 | __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head) |
346 | __PAGEFLAG(Tail, tail) | 347 | __PAGEFLAG(Tail, tail) |
347 | 348 | ||
348 | static inline int PageCompound(struct page *page) | 349 | static inline int PageCompound(struct page *page) |
349 | { | 350 | { |
350 | return page->flags & ((1L << PG_head) | (1L << PG_tail)); | 351 | return page->flags & ((1L << PG_head) | (1L << PG_tail)); |
351 | 352 | ||
352 | } | 353 | } |
353 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 354 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
354 | static inline void ClearPageCompound(struct page *page) | 355 | static inline void ClearPageCompound(struct page *page) |
355 | { | 356 | { |
356 | BUG_ON(!PageHead(page)); | 357 | BUG_ON(!PageHead(page)); |
357 | ClearPageHead(page); | 358 | ClearPageHead(page); |
358 | } | 359 | } |
359 | #endif | 360 | #endif |
360 | #else | 361 | #else |
361 | /* | 362 | /* |
362 | * Reduce page flag use as much as possible by overlapping | 363 | * Reduce page flag use as much as possible by overlapping |
363 | * compound page flags with the flags used for page cache pages. Possible | 364 | * compound page flags with the flags used for page cache pages. Possible |
364 | * because PageCompound is always set for compound pages and not for | 365 | * because PageCompound is always set for compound pages and not for |
365 | * pages on the LRU and/or pagecache. | 366 | * pages on the LRU and/or pagecache. |
366 | */ | 367 | */ |
367 | TESTPAGEFLAG(Compound, compound) | 368 | TESTPAGEFLAG(Compound, compound) |
368 | __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) | 369 | __SETPAGEFLAG(Head, compound) __CLEARPAGEFLAG(Head, compound) |
369 | 370 | ||
370 | /* | 371 | /* |
371 | * PG_reclaim is used in combination with PG_compound to mark the | 372 | * PG_reclaim is used in combination with PG_compound to mark the |
372 | * head and tail of a compound page. This saves one page flag | 373 | * head and tail of a compound page. This saves one page flag |
373 | * but makes it impossible to use compound pages for the page cache. | 374 | * but makes it impossible to use compound pages for the page cache. |
374 | * The PG_reclaim bit would have to be used for reclaim or readahead | 375 | * The PG_reclaim bit would have to be used for reclaim or readahead |
375 | * if compound pages enter the page cache. | 376 | * if compound pages enter the page cache. |
376 | * | 377 | * |
377 | * PG_compound & PG_reclaim => Tail page | 378 | * PG_compound & PG_reclaim => Tail page |
378 | * PG_compound & ~PG_reclaim => Head page | 379 | * PG_compound & ~PG_reclaim => Head page |
379 | */ | 380 | */ |
380 | #define PG_head_mask ((1L << PG_compound)) | 381 | #define PG_head_mask ((1L << PG_compound)) |
381 | #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) | 382 | #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim)) |
382 | 383 | ||
383 | static inline int PageHead(struct page *page) | 384 | static inline int PageHead(struct page *page) |
384 | { | 385 | { |
385 | return ((page->flags & PG_head_tail_mask) == PG_head_mask); | 386 | return ((page->flags & PG_head_tail_mask) == PG_head_mask); |
386 | } | 387 | } |
387 | 388 | ||
388 | static inline int PageTail(struct page *page) | 389 | static inline int PageTail(struct page *page) |
389 | { | 390 | { |
390 | return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); | 391 | return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask); |
391 | } | 392 | } |
392 | 393 | ||
393 | static inline void __SetPageTail(struct page *page) | 394 | static inline void __SetPageTail(struct page *page) |
394 | { | 395 | { |
395 | page->flags |= PG_head_tail_mask; | 396 | page->flags |= PG_head_tail_mask; |
396 | } | 397 | } |
397 | 398 | ||
398 | static inline void __ClearPageTail(struct page *page) | 399 | static inline void __ClearPageTail(struct page *page) |
399 | { | 400 | { |
400 | page->flags &= ~PG_head_tail_mask; | 401 | page->flags &= ~PG_head_tail_mask; |
401 | } | 402 | } |
402 | 403 | ||
403 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 404 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
404 | static inline void ClearPageCompound(struct page *page) | 405 | static inline void ClearPageCompound(struct page *page) |
405 | { | 406 | { |
406 | BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); | 407 | BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound)); |
407 | clear_bit(PG_compound, &page->flags); | 408 | clear_bit(PG_compound, &page->flags); |
408 | } | 409 | } |
409 | #endif | 410 | #endif |
410 | 411 | ||
411 | #endif /* !PAGEFLAGS_EXTENDED */ | 412 | #endif /* !PAGEFLAGS_EXTENDED */ |
412 | 413 | ||
413 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 414 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
414 | /* | 415 | /* |
415 | * PageHuge() only returns true for hugetlbfs pages, but not for | 416 | * PageHuge() only returns true for hugetlbfs pages, but not for |
416 | * normal or transparent huge pages. | 417 | * normal or transparent huge pages. |
417 | * | 418 | * |
418 | * PageTransHuge() returns true for both transparent huge and | 419 | * PageTransHuge() returns true for both transparent huge and |
419 | * hugetlbfs pages, but not normal pages. PageTransHuge() can only be | 420 | * hugetlbfs pages, but not normal pages. PageTransHuge() can only be |
420 | * called only in the core VM paths where hugetlbfs pages can't exist. | 421 | * called only in the core VM paths where hugetlbfs pages can't exist. |
421 | */ | 422 | */ |
422 | static inline int PageTransHuge(struct page *page) | 423 | static inline int PageTransHuge(struct page *page) |
423 | { | 424 | { |
424 | VM_BUG_ON(PageTail(page)); | 425 | VM_BUG_ON(PageTail(page)); |
425 | return PageHead(page); | 426 | return PageHead(page); |
426 | } | 427 | } |
427 | 428 | ||
428 | /* | 429 | /* |
429 | * PageTransCompound returns true for both transparent huge pages | 430 | * PageTransCompound returns true for both transparent huge pages |
430 | * and hugetlbfs pages, so it should only be called when it's known | 431 | * and hugetlbfs pages, so it should only be called when it's known |
431 | * that hugetlbfs pages aren't involved. | 432 | * that hugetlbfs pages aren't involved. |
432 | */ | 433 | */ |
433 | static inline int PageTransCompound(struct page *page) | 434 | static inline int PageTransCompound(struct page *page) |
434 | { | 435 | { |
435 | return PageCompound(page); | 436 | return PageCompound(page); |
436 | } | 437 | } |
437 | 438 | ||
438 | /* | 439 | /* |
439 | * PageTransTail returns true for both transparent huge pages | 440 | * PageTransTail returns true for both transparent huge pages |
440 | * and hugetlbfs pages, so it should only be called when it's known | 441 | * and hugetlbfs pages, so it should only be called when it's known |
441 | * that hugetlbfs pages aren't involved. | 442 | * that hugetlbfs pages aren't involved. |
442 | */ | 443 | */ |
443 | static inline int PageTransTail(struct page *page) | 444 | static inline int PageTransTail(struct page *page) |
444 | { | 445 | { |
445 | return PageTail(page); | 446 | return PageTail(page); |
446 | } | 447 | } |
447 | 448 | ||
448 | #else | 449 | #else |
449 | 450 | ||
450 | static inline int PageTransHuge(struct page *page) | 451 | static inline int PageTransHuge(struct page *page) |
451 | { | 452 | { |
452 | return 0; | 453 | return 0; |
453 | } | 454 | } |
454 | 455 | ||
455 | static inline int PageTransCompound(struct page *page) | 456 | static inline int PageTransCompound(struct page *page) |
456 | { | 457 | { |
457 | return 0; | 458 | return 0; |
458 | } | 459 | } |
459 | 460 | ||
460 | static inline int PageTransTail(struct page *page) | 461 | static inline int PageTransTail(struct page *page) |
461 | { | 462 | { |
462 | return 0; | 463 | return 0; |
463 | } | 464 | } |
464 | #endif | 465 | #endif |
465 | 466 | ||
466 | /* | 467 | /* |
467 | * If network-based swap is enabled, sl*b must keep track of whether pages | 468 | * If network-based swap is enabled, sl*b must keep track of whether pages |
468 | * were allocated from pfmemalloc reserves. | 469 | * were allocated from pfmemalloc reserves. |
469 | */ | 470 | */ |
470 | static inline int PageSlabPfmemalloc(struct page *page) | 471 | static inline int PageSlabPfmemalloc(struct page *page) |
471 | { | 472 | { |
472 | VM_BUG_ON(!PageSlab(page)); | 473 | VM_BUG_ON(!PageSlab(page)); |
473 | return PageActive(page); | 474 | return PageActive(page); |
474 | } | 475 | } |
475 | 476 | ||
476 | static inline void SetPageSlabPfmemalloc(struct page *page) | 477 | static inline void SetPageSlabPfmemalloc(struct page *page) |
477 | { | 478 | { |
478 | VM_BUG_ON(!PageSlab(page)); | 479 | VM_BUG_ON(!PageSlab(page)); |
479 | SetPageActive(page); | 480 | SetPageActive(page); |
480 | } | 481 | } |
481 | 482 | ||
482 | static inline void __ClearPageSlabPfmemalloc(struct page *page) | 483 | static inline void __ClearPageSlabPfmemalloc(struct page *page) |
483 | { | 484 | { |
484 | VM_BUG_ON(!PageSlab(page)); | 485 | VM_BUG_ON(!PageSlab(page)); |
485 | __ClearPageActive(page); | 486 | __ClearPageActive(page); |
486 | } | 487 | } |
487 | 488 | ||
488 | static inline void ClearPageSlabPfmemalloc(struct page *page) | 489 | static inline void ClearPageSlabPfmemalloc(struct page *page) |
489 | { | 490 | { |
490 | VM_BUG_ON(!PageSlab(page)); | 491 | VM_BUG_ON(!PageSlab(page)); |
491 | ClearPageActive(page); | 492 | ClearPageActive(page); |
492 | } | 493 | } |
493 | 494 | ||
494 | #ifdef CONFIG_MMU | 495 | #ifdef CONFIG_MMU |
495 | #define __PG_MLOCKED (1 << PG_mlocked) | 496 | #define __PG_MLOCKED (1 << PG_mlocked) |
496 | #else | 497 | #else |
497 | #define __PG_MLOCKED 0 | 498 | #define __PG_MLOCKED 0 |
498 | #endif | 499 | #endif |
499 | 500 | ||
500 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 501 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
501 | #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) | 502 | #define __PG_COMPOUND_LOCK (1 << PG_compound_lock) |
502 | #else | 503 | #else |
503 | #define __PG_COMPOUND_LOCK 0 | 504 | #define __PG_COMPOUND_LOCK 0 |
504 | #endif | 505 | #endif |
505 | 506 | ||
506 | /* | 507 | /* |
507 | * Flags checked when a page is freed. Pages being freed should not have | 508 | * Flags checked when a page is freed. Pages being freed should not have |
508 | * these flags set. It they are, there is a problem. | 509 | * these flags set. It they are, there is a problem. |
509 | */ | 510 | */ |
510 | #define PAGE_FLAGS_CHECK_AT_FREE \ | 511 | #define PAGE_FLAGS_CHECK_AT_FREE \ |
511 | (1 << PG_lru | 1 << PG_locked | \ | 512 | (1 << PG_lru | 1 << PG_locked | \ |
512 | 1 << PG_private | 1 << PG_private_2 | \ | 513 | 1 << PG_private | 1 << PG_private_2 | \ |
513 | 1 << PG_writeback | 1 << PG_reserved | \ | 514 | 1 << PG_writeback | 1 << PG_reserved | \ |
514 | 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ | 515 | 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ |
515 | 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ | 516 | 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ |
516 | __PG_COMPOUND_LOCK) | 517 | __PG_COMPOUND_LOCK) |
517 | 518 | ||
518 | /* | 519 | /* |
519 | * Flags checked when a page is prepped for return by the page allocator. | 520 | * Flags checked when a page is prepped for return by the page allocator. |
520 | * Pages being prepped should not have any flags set. It they are set, | 521 | * Pages being prepped should not have any flags set. It they are set, |
521 | * there has been a kernel bug or struct page corruption. | 522 | * there has been a kernel bug or struct page corruption. |
522 | */ | 523 | */ |
523 | #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) | 524 | #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) |
524 | 525 | ||
525 | #define PAGE_FLAGS_PRIVATE \ | 526 | #define PAGE_FLAGS_PRIVATE \ |
526 | (1 << PG_private | 1 << PG_private_2) | 527 | (1 << PG_private | 1 << PG_private_2) |
527 | /** | 528 | /** |
528 | * page_has_private - Determine if page has private stuff | 529 | * page_has_private - Determine if page has private stuff |
529 | * @page: The page to be checked | 530 | * @page: The page to be checked |
530 | * | 531 | * |
531 | * Determine if a page has private stuff, indicating that release routines | 532 | * Determine if a page has private stuff, indicating that release routines |
532 | * should be invoked upon it. | 533 | * should be invoked upon it. |
533 | */ | 534 | */ |
534 | static inline int page_has_private(struct page *page) | 535 | static inline int page_has_private(struct page *page) |
535 | { | 536 | { |
536 | return !!(page->flags & PAGE_FLAGS_PRIVATE); | 537 | return !!(page->flags & PAGE_FLAGS_PRIVATE); |
537 | } | 538 | } |
538 | 539 | ||
539 | #endif /* !__GENERATING_BOUNDS_H */ | 540 | #endif /* !__GENERATING_BOUNDS_H */ |
540 | 541 | ||
541 | #endif /* PAGE_FLAGS_H */ | 542 | #endif /* PAGE_FLAGS_H */ |
542 | 543 |
include/linux/pagemap.h
1 | #ifndef _LINUX_PAGEMAP_H | 1 | #ifndef _LINUX_PAGEMAP_H |
2 | #define _LINUX_PAGEMAP_H | 2 | #define _LINUX_PAGEMAP_H |
3 | 3 | ||
4 | /* | 4 | /* |
5 | * Copyright 1995 Linus Torvalds | 5 | * Copyright 1995 Linus Torvalds |
6 | */ | 6 | */ |
7 | #include <linux/mm.h> | 7 | #include <linux/mm.h> |
8 | #include <linux/fs.h> | 8 | #include <linux/fs.h> |
9 | #include <linux/list.h> | 9 | #include <linux/list.h> |
10 | #include <linux/highmem.h> | 10 | #include <linux/highmem.h> |
11 | #include <linux/compiler.h> | 11 | #include <linux/compiler.h> |
12 | #include <asm/uaccess.h> | 12 | #include <asm/uaccess.h> |
13 | #include <linux/gfp.h> | 13 | #include <linux/gfp.h> |
14 | #include <linux/bitops.h> | 14 | #include <linux/bitops.h> |
15 | #include <linux/hardirq.h> /* for in_interrupt() */ | 15 | #include <linux/hardirq.h> /* for in_interrupt() */ |
16 | #include <linux/hugetlb_inline.h> | 16 | #include <linux/hugetlb_inline.h> |
17 | 17 | ||
18 | /* | 18 | /* |
19 | * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page | 19 | * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page |
20 | * allocation mode flags. | 20 | * allocation mode flags. |
21 | */ | 21 | */ |
22 | enum mapping_flags { | 22 | enum mapping_flags { |
23 | AS_EIO = __GFP_BITS_SHIFT + 0, /* IO error on async write */ | 23 | AS_EIO = __GFP_BITS_SHIFT + 0, /* IO error on async write */ |
24 | AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ | 24 | AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ |
25 | AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ | 25 | AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ |
26 | AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ | 26 | AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ |
27 | AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */ | 27 | AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */ |
28 | }; | 28 | }; |
29 | 29 | ||
30 | static inline void mapping_set_error(struct address_space *mapping, int error) | 30 | static inline void mapping_set_error(struct address_space *mapping, int error) |
31 | { | 31 | { |
32 | if (unlikely(error)) { | 32 | if (unlikely(error)) { |
33 | if (error == -ENOSPC) | 33 | if (error == -ENOSPC) |
34 | set_bit(AS_ENOSPC, &mapping->flags); | 34 | set_bit(AS_ENOSPC, &mapping->flags); |
35 | else | 35 | else |
36 | set_bit(AS_EIO, &mapping->flags); | 36 | set_bit(AS_EIO, &mapping->flags); |
37 | } | 37 | } |
38 | } | 38 | } |
39 | 39 | ||
40 | static inline void mapping_set_unevictable(struct address_space *mapping) | 40 | static inline void mapping_set_unevictable(struct address_space *mapping) |
41 | { | 41 | { |
42 | set_bit(AS_UNEVICTABLE, &mapping->flags); | 42 | set_bit(AS_UNEVICTABLE, &mapping->flags); |
43 | } | 43 | } |
44 | 44 | ||
45 | static inline void mapping_clear_unevictable(struct address_space *mapping) | 45 | static inline void mapping_clear_unevictable(struct address_space *mapping) |
46 | { | 46 | { |
47 | clear_bit(AS_UNEVICTABLE, &mapping->flags); | 47 | clear_bit(AS_UNEVICTABLE, &mapping->flags); |
48 | } | 48 | } |
49 | 49 | ||
50 | static inline int mapping_unevictable(struct address_space *mapping) | 50 | static inline int mapping_unevictable(struct address_space *mapping) |
51 | { | 51 | { |
52 | if (mapping) | 52 | if (mapping) |
53 | return test_bit(AS_UNEVICTABLE, &mapping->flags); | 53 | return test_bit(AS_UNEVICTABLE, &mapping->flags); |
54 | return !!mapping; | 54 | return !!mapping; |
55 | } | 55 | } |
56 | 56 | ||
57 | static inline void mapping_set_balloon(struct address_space *mapping) | 57 | static inline void mapping_set_balloon(struct address_space *mapping) |
58 | { | 58 | { |
59 | set_bit(AS_BALLOON_MAP, &mapping->flags); | 59 | set_bit(AS_BALLOON_MAP, &mapping->flags); |
60 | } | 60 | } |
61 | 61 | ||
62 | static inline void mapping_clear_balloon(struct address_space *mapping) | 62 | static inline void mapping_clear_balloon(struct address_space *mapping) |
63 | { | 63 | { |
64 | clear_bit(AS_BALLOON_MAP, &mapping->flags); | 64 | clear_bit(AS_BALLOON_MAP, &mapping->flags); |
65 | } | 65 | } |
66 | 66 | ||
67 | static inline int mapping_balloon(struct address_space *mapping) | 67 | static inline int mapping_balloon(struct address_space *mapping) |
68 | { | 68 | { |
69 | return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags); | 69 | return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags); |
70 | } | 70 | } |
71 | 71 | ||
72 | static inline gfp_t mapping_gfp_mask(struct address_space * mapping) | 72 | static inline gfp_t mapping_gfp_mask(struct address_space * mapping) |
73 | { | 73 | { |
74 | return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; | 74 | return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; |
75 | } | 75 | } |
76 | 76 | ||
77 | /* | 77 | /* |
78 | * This is non-atomic. Only to be used before the mapping is activated. | 78 | * This is non-atomic. Only to be used before the mapping is activated. |
79 | * Probably needs a barrier... | 79 | * Probably needs a barrier... |
80 | */ | 80 | */ |
81 | static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) | 81 | static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) |
82 | { | 82 | { |
83 | m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) | | 83 | m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) | |
84 | (__force unsigned long)mask; | 84 | (__force unsigned long)mask; |
85 | } | 85 | } |
86 | 86 | ||
87 | /* | 87 | /* |
88 | * The page cache can done in larger chunks than | 88 | * The page cache can done in larger chunks than |
89 | * one page, because it allows for more efficient | 89 | * one page, because it allows for more efficient |
90 | * throughput (it can then be mapped into user | 90 | * throughput (it can then be mapped into user |
91 | * space in smaller chunks for same flexibility). | 91 | * space in smaller chunks for same flexibility). |
92 | * | 92 | * |
93 | * Or rather, it _will_ be done in larger chunks. | 93 | * Or rather, it _will_ be done in larger chunks. |
94 | */ | 94 | */ |
95 | #define PAGE_CACHE_SHIFT PAGE_SHIFT | 95 | #define PAGE_CACHE_SHIFT PAGE_SHIFT |
96 | #define PAGE_CACHE_SIZE PAGE_SIZE | 96 | #define PAGE_CACHE_SIZE PAGE_SIZE |
97 | #define PAGE_CACHE_MASK PAGE_MASK | 97 | #define PAGE_CACHE_MASK PAGE_MASK |
98 | #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) | 98 | #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) |
99 | 99 | ||
100 | #define page_cache_get(page) get_page(page) | 100 | #define page_cache_get(page) get_page(page) |
101 | #define page_cache_release(page) put_page(page) | 101 | #define page_cache_release(page) put_page(page) |
102 | void release_pages(struct page **pages, int nr, bool cold); | 102 | void release_pages(struct page **pages, int nr, bool cold); |
103 | 103 | ||
104 | /* | 104 | /* |
105 | * speculatively take a reference to a page. | 105 | * speculatively take a reference to a page. |
106 | * If the page is free (_count == 0), then _count is untouched, and 0 | 106 | * If the page is free (_count == 0), then _count is untouched, and 0 |
107 | * is returned. Otherwise, _count is incremented by 1 and 1 is returned. | 107 | * is returned. Otherwise, _count is incremented by 1 and 1 is returned. |
108 | * | 108 | * |
109 | * This function must be called inside the same rcu_read_lock() section as has | 109 | * This function must be called inside the same rcu_read_lock() section as has |
110 | * been used to lookup the page in the pagecache radix-tree (or page table): | 110 | * been used to lookup the page in the pagecache radix-tree (or page table): |
111 | * this allows allocators to use a synchronize_rcu() to stabilize _count. | 111 | * this allows allocators to use a synchronize_rcu() to stabilize _count. |
112 | * | 112 | * |
113 | * Unless an RCU grace period has passed, the count of all pages coming out | 113 | * Unless an RCU grace period has passed, the count of all pages coming out |
114 | * of the allocator must be considered unstable. page_count may return higher | 114 | * of the allocator must be considered unstable. page_count may return higher |
115 | * than expected, and put_page must be able to do the right thing when the | 115 | * than expected, and put_page must be able to do the right thing when the |
116 | * page has been finished with, no matter what it is subsequently allocated | 116 | * page has been finished with, no matter what it is subsequently allocated |
117 | * for (because put_page is what is used here to drop an invalid speculative | 117 | * for (because put_page is what is used here to drop an invalid speculative |
118 | * reference). | 118 | * reference). |
119 | * | 119 | * |
120 | * This is the interesting part of the lockless pagecache (and lockless | 120 | * This is the interesting part of the lockless pagecache (and lockless |
121 | * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page) | 121 | * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page) |
122 | * has the following pattern: | 122 | * has the following pattern: |
123 | * 1. find page in radix tree | 123 | * 1. find page in radix tree |
124 | * 2. conditionally increment refcount | 124 | * 2. conditionally increment refcount |
125 | * 3. check the page is still in pagecache (if no, goto 1) | 125 | * 3. check the page is still in pagecache (if no, goto 1) |
126 | * | 126 | * |
127 | * Remove-side that cares about stability of _count (eg. reclaim) has the | 127 | * Remove-side that cares about stability of _count (eg. reclaim) has the |
128 | * following (with tree_lock held for write): | 128 | * following (with tree_lock held for write): |
129 | * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg) | 129 | * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg) |
130 | * B. remove page from pagecache | 130 | * B. remove page from pagecache |
131 | * C. free the page | 131 | * C. free the page |
132 | * | 132 | * |
133 | * There are 2 critical interleavings that matter: | 133 | * There are 2 critical interleavings that matter: |
134 | * - 2 runs before A: in this case, A sees elevated refcount and bails out | 134 | * - 2 runs before A: in this case, A sees elevated refcount and bails out |
135 | * - A runs before 2: in this case, 2 sees zero refcount and retries; | 135 | * - A runs before 2: in this case, 2 sees zero refcount and retries; |
136 | * subsequently, B will complete and 1 will find no page, causing the | 136 | * subsequently, B will complete and 1 will find no page, causing the |
137 | * lookup to return NULL. | 137 | * lookup to return NULL. |
138 | * | 138 | * |
139 | * It is possible that between 1 and 2, the page is removed then the exact same | 139 | * It is possible that between 1 and 2, the page is removed then the exact same |
140 | * page is inserted into the same position in pagecache. That's OK: the | 140 | * page is inserted into the same position in pagecache. That's OK: the |
141 | * old find_get_page using tree_lock could equally have run before or after | 141 | * old find_get_page using tree_lock could equally have run before or after |
142 | * such a re-insertion, depending on order that locks are granted. | 142 | * such a re-insertion, depending on order that locks are granted. |
143 | * | 143 | * |
144 | * Lookups racing against pagecache insertion isn't a big problem: either 1 | 144 | * Lookups racing against pagecache insertion isn't a big problem: either 1 |
145 | * will find the page or it will not. Likewise, the old find_get_page could run | 145 | * will find the page or it will not. Likewise, the old find_get_page could run |
146 | * either before the insertion or afterwards, depending on timing. | 146 | * either before the insertion or afterwards, depending on timing. |
147 | */ | 147 | */ |
148 | static inline int page_cache_get_speculative(struct page *page) | 148 | static inline int page_cache_get_speculative(struct page *page) |
149 | { | 149 | { |
150 | VM_BUG_ON(in_interrupt()); | 150 | VM_BUG_ON(in_interrupt()); |
151 | 151 | ||
152 | #ifdef CONFIG_TINY_RCU | 152 | #ifdef CONFIG_TINY_RCU |
153 | # ifdef CONFIG_PREEMPT_COUNT | 153 | # ifdef CONFIG_PREEMPT_COUNT |
154 | VM_BUG_ON(!in_atomic()); | 154 | VM_BUG_ON(!in_atomic()); |
155 | # endif | 155 | # endif |
156 | /* | 156 | /* |
157 | * Preempt must be disabled here - we rely on rcu_read_lock doing | 157 | * Preempt must be disabled here - we rely on rcu_read_lock doing |
158 | * this for us. | 158 | * this for us. |
159 | * | 159 | * |
160 | * Pagecache won't be truncated from interrupt context, so if we have | 160 | * Pagecache won't be truncated from interrupt context, so if we have |
161 | * found a page in the radix tree here, we have pinned its refcount by | 161 | * found a page in the radix tree here, we have pinned its refcount by |
162 | * disabling preempt, and hence no need for the "speculative get" that | 162 | * disabling preempt, and hence no need for the "speculative get" that |
163 | * SMP requires. | 163 | * SMP requires. |
164 | */ | 164 | */ |
165 | VM_BUG_ON(page_count(page) == 0); | 165 | VM_BUG_ON(page_count(page) == 0); |
166 | atomic_inc(&page->_count); | 166 | atomic_inc(&page->_count); |
167 | 167 | ||
168 | #else | 168 | #else |
169 | if (unlikely(!get_page_unless_zero(page))) { | 169 | if (unlikely(!get_page_unless_zero(page))) { |
170 | /* | 170 | /* |
171 | * Either the page has been freed, or will be freed. | 171 | * Either the page has been freed, or will be freed. |
172 | * In either case, retry here and the caller should | 172 | * In either case, retry here and the caller should |
173 | * do the right thing (see comments above). | 173 | * do the right thing (see comments above). |
174 | */ | 174 | */ |
175 | return 0; | 175 | return 0; |
176 | } | 176 | } |
177 | #endif | 177 | #endif |
178 | VM_BUG_ON(PageTail(page)); | 178 | VM_BUG_ON(PageTail(page)); |
179 | 179 | ||
180 | return 1; | 180 | return 1; |
181 | } | 181 | } |
182 | 182 | ||
183 | /* | 183 | /* |
184 | * Same as above, but add instead of inc (could just be merged) | 184 | * Same as above, but add instead of inc (could just be merged) |
185 | */ | 185 | */ |
186 | static inline int page_cache_add_speculative(struct page *page, int count) | 186 | static inline int page_cache_add_speculative(struct page *page, int count) |
187 | { | 187 | { |
188 | VM_BUG_ON(in_interrupt()); | 188 | VM_BUG_ON(in_interrupt()); |
189 | 189 | ||
190 | #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU) | 190 | #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU) |
191 | # ifdef CONFIG_PREEMPT_COUNT | 191 | # ifdef CONFIG_PREEMPT_COUNT |
192 | VM_BUG_ON(!in_atomic()); | 192 | VM_BUG_ON(!in_atomic()); |
193 | # endif | 193 | # endif |
194 | VM_BUG_ON(page_count(page) == 0); | 194 | VM_BUG_ON(page_count(page) == 0); |
195 | atomic_add(count, &page->_count); | 195 | atomic_add(count, &page->_count); |
196 | 196 | ||
197 | #else | 197 | #else |
198 | if (unlikely(!atomic_add_unless(&page->_count, count, 0))) | 198 | if (unlikely(!atomic_add_unless(&page->_count, count, 0))) |
199 | return 0; | 199 | return 0; |
200 | #endif | 200 | #endif |
201 | VM_BUG_ON(PageCompound(page) && page != compound_head(page)); | 201 | VM_BUG_ON(PageCompound(page) && page != compound_head(page)); |
202 | 202 | ||
203 | return 1; | 203 | return 1; |
204 | } | 204 | } |
205 | 205 | ||
206 | static inline int page_freeze_refs(struct page *page, int count) | 206 | static inline int page_freeze_refs(struct page *page, int count) |
207 | { | 207 | { |
208 | return likely(atomic_cmpxchg(&page->_count, count, 0) == count); | 208 | return likely(atomic_cmpxchg(&page->_count, count, 0) == count); |
209 | } | 209 | } |
210 | 210 | ||
211 | static inline void page_unfreeze_refs(struct page *page, int count) | 211 | static inline void page_unfreeze_refs(struct page *page, int count) |
212 | { | 212 | { |
213 | VM_BUG_ON(page_count(page) != 0); | 213 | VM_BUG_ON(page_count(page) != 0); |
214 | VM_BUG_ON(count == 0); | 214 | VM_BUG_ON(count == 0); |
215 | 215 | ||
216 | atomic_set(&page->_count, count); | 216 | atomic_set(&page->_count, count); |
217 | } | 217 | } |
218 | 218 | ||
219 | #ifdef CONFIG_NUMA | 219 | #ifdef CONFIG_NUMA |
220 | extern struct page *__page_cache_alloc(gfp_t gfp); | 220 | extern struct page *__page_cache_alloc(gfp_t gfp); |
221 | #else | 221 | #else |
222 | static inline struct page *__page_cache_alloc(gfp_t gfp) | 222 | static inline struct page *__page_cache_alloc(gfp_t gfp) |
223 | { | 223 | { |
224 | return alloc_pages(gfp, 0); | 224 | return alloc_pages(gfp, 0); |
225 | } | 225 | } |
226 | #endif | 226 | #endif |
227 | 227 | ||
228 | static inline struct page *page_cache_alloc(struct address_space *x) | 228 | static inline struct page *page_cache_alloc(struct address_space *x) |
229 | { | 229 | { |
230 | return __page_cache_alloc(mapping_gfp_mask(x)); | 230 | return __page_cache_alloc(mapping_gfp_mask(x)); |
231 | } | 231 | } |
232 | 232 | ||
233 | static inline struct page *page_cache_alloc_cold(struct address_space *x) | 233 | static inline struct page *page_cache_alloc_cold(struct address_space *x) |
234 | { | 234 | { |
235 | return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD); | 235 | return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD); |
236 | } | 236 | } |
237 | 237 | ||
238 | static inline struct page *page_cache_alloc_readahead(struct address_space *x) | 238 | static inline struct page *page_cache_alloc_readahead(struct address_space *x) |
239 | { | 239 | { |
240 | return __page_cache_alloc(mapping_gfp_mask(x) | | 240 | return __page_cache_alloc(mapping_gfp_mask(x) | |
241 | __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN); | 241 | __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN); |
242 | } | 242 | } |
243 | 243 | ||
244 | typedef int filler_t(void *, struct page *); | 244 | typedef int filler_t(void *, struct page *); |
245 | 245 | ||
246 | pgoff_t page_cache_next_hole(struct address_space *mapping, | 246 | pgoff_t page_cache_next_hole(struct address_space *mapping, |
247 | pgoff_t index, unsigned long max_scan); | 247 | pgoff_t index, unsigned long max_scan); |
248 | pgoff_t page_cache_prev_hole(struct address_space *mapping, | 248 | pgoff_t page_cache_prev_hole(struct address_space *mapping, |
249 | pgoff_t index, unsigned long max_scan); | 249 | pgoff_t index, unsigned long max_scan); |
250 | 250 | ||
251 | #define FGP_ACCESSED 0x00000001 | ||
252 | #define FGP_LOCK 0x00000002 | ||
253 | #define FGP_CREAT 0x00000004 | ||
254 | #define FGP_WRITE 0x00000008 | ||
255 | #define FGP_NOFS 0x00000010 | ||
256 | #define FGP_NOWAIT 0x00000020 | ||
257 | |||
258 | struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, | ||
259 | int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask); | ||
260 | |||
261 | /** | ||
262 | * find_get_page - find and get a page reference | ||
263 | * @mapping: the address_space to search | ||
264 | * @offset: the page index | ||
265 | * | ||
266 | * Looks up the page cache slot at @mapping & @offset. If there is a | ||
267 | * page cache page, it is returned with an increased refcount. | ||
268 | * | ||
269 | * Otherwise, %NULL is returned. | ||
270 | */ | ||
271 | static inline struct page *find_get_page(struct address_space *mapping, | ||
272 | pgoff_t offset) | ||
273 | { | ||
274 | return pagecache_get_page(mapping, offset, 0, 0, 0); | ||
275 | } | ||
276 | |||
277 | static inline struct page *find_get_page_flags(struct address_space *mapping, | ||
278 | pgoff_t offset, int fgp_flags) | ||
279 | { | ||
280 | return pagecache_get_page(mapping, offset, fgp_flags, 0, 0); | ||
281 | } | ||
282 | |||
283 | /** | ||
284 | * find_lock_page - locate, pin and lock a pagecache page | ||
285 | * pagecache_get_page - find and get a page reference | ||
286 | * @mapping: the address_space to search | ||
287 | * @offset: the page index | ||
288 | * | ||
289 | * Looks up the page cache slot at @mapping & @offset. If there is a | ||
290 | * page cache page, it is returned locked and with an increased | ||
291 | * refcount. | ||
292 | * | ||
293 | * Otherwise, %NULL is returned. | ||
294 | * | ||
295 | * find_lock_page() may sleep. | ||
296 | */ | ||
297 | static inline struct page *find_lock_page(struct address_space *mapping, | ||
298 | pgoff_t offset) | ||
299 | { | ||
300 | return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0); | ||
301 | } | ||
302 | |||
303 | /** | ||
304 | * find_or_create_page - locate or add a pagecache page | ||
305 | * @mapping: the page's address_space | ||
306 | * @index: the page's index into the mapping | ||
307 | * @gfp_mask: page allocation mode | ||
308 | * | ||
309 | * Looks up the page cache slot at @mapping & @offset. If there is a | ||
310 | * page cache page, it is returned locked and with an increased | ||
311 | * refcount. | ||
312 | * | ||
313 | * If the page is not present, a new page is allocated using @gfp_mask | ||
314 | * and added to the page cache and the VM's LRU list. The page is | ||
315 | * returned locked and with an increased refcount. | ||
316 | * | ||
317 | * On memory exhaustion, %NULL is returned. | ||
318 | * | ||
319 | * find_or_create_page() may sleep, even if @gfp_flags specifies an | ||
320 | * atomic allocation! | ||
321 | */ | ||
322 | static inline struct page *find_or_create_page(struct address_space *mapping, | ||
323 | pgoff_t offset, gfp_t gfp_mask) | ||
324 | { | ||
325 | return pagecache_get_page(mapping, offset, | ||
326 | FGP_LOCK|FGP_ACCESSED|FGP_CREAT, | ||
327 | gfp_mask, gfp_mask & GFP_RECLAIM_MASK); | ||
328 | } | ||
329 | |||
330 | /** | ||
331 | * grab_cache_page_nowait - returns locked page at given index in given cache | ||
332 | * @mapping: target address_space | ||
333 | * @index: the page index | ||
334 | * | ||
335 | * Same as grab_cache_page(), but do not wait if the page is unavailable. | ||
336 | * This is intended for speculative data generators, where the data can | ||
337 | * be regenerated if the page couldn't be grabbed. This routine should | ||
338 | * be safe to call while holding the lock for another page. | ||
339 | * | ||
340 | * Clear __GFP_FS when allocating the page to avoid recursion into the fs | ||
341 | * and deadlock against the caller's locked page. | ||
342 | */ | ||
343 | static inline struct page *grab_cache_page_nowait(struct address_space *mapping, | ||
344 | pgoff_t index) | ||
345 | { | ||
346 | return pagecache_get_page(mapping, index, | ||
347 | FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT, | ||
348 | mapping_gfp_mask(mapping), | ||
349 | GFP_NOFS); | ||
350 | } | ||
351 | |||
251 | struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); | 352 | struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); |
252 | struct page *find_get_page(struct address_space *mapping, pgoff_t offset); | ||
253 | struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset); | 353 | struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset); |
254 | struct page *find_lock_page(struct address_space *mapping, pgoff_t offset); | ||
255 | struct page *find_or_create_page(struct address_space *mapping, pgoff_t index, | ||
256 | gfp_t gfp_mask); | ||
257 | unsigned find_get_entries(struct address_space *mapping, pgoff_t start, | 354 | unsigned find_get_entries(struct address_space *mapping, pgoff_t start, |
258 | unsigned int nr_entries, struct page **entries, | 355 | unsigned int nr_entries, struct page **entries, |
259 | pgoff_t *indices); | 356 | pgoff_t *indices); |
260 | unsigned find_get_pages(struct address_space *mapping, pgoff_t start, | 357 | unsigned find_get_pages(struct address_space *mapping, pgoff_t start, |
261 | unsigned int nr_pages, struct page **pages); | 358 | unsigned int nr_pages, struct page **pages); |
262 | unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, | 359 | unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, |
263 | unsigned int nr_pages, struct page **pages); | 360 | unsigned int nr_pages, struct page **pages); |
264 | unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, | 361 | unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, |
265 | int tag, unsigned int nr_pages, struct page **pages); | 362 | int tag, unsigned int nr_pages, struct page **pages); |
266 | 363 | ||
267 | struct page *grab_cache_page_write_begin(struct address_space *mapping, | 364 | struct page *grab_cache_page_write_begin(struct address_space *mapping, |
268 | pgoff_t index, unsigned flags); | 365 | pgoff_t index, unsigned flags); |
269 | 366 | ||
270 | /* | 367 | /* |
271 | * Returns locked page at given index in given cache, creating it if needed. | 368 | * Returns locked page at given index in given cache, creating it if needed. |
272 | */ | 369 | */ |
273 | static inline struct page *grab_cache_page(struct address_space *mapping, | 370 | static inline struct page *grab_cache_page(struct address_space *mapping, |
274 | pgoff_t index) | 371 | pgoff_t index) |
275 | { | 372 | { |
276 | return find_or_create_page(mapping, index, mapping_gfp_mask(mapping)); | 373 | return find_or_create_page(mapping, index, mapping_gfp_mask(mapping)); |
277 | } | 374 | } |
278 | 375 | ||
279 | extern struct page * grab_cache_page_nowait(struct address_space *mapping, | ||
280 | pgoff_t index); | ||
281 | extern struct page * read_cache_page(struct address_space *mapping, | 376 | extern struct page * read_cache_page(struct address_space *mapping, |
282 | pgoff_t index, filler_t *filler, void *data); | 377 | pgoff_t index, filler_t *filler, void *data); |
283 | extern struct page * read_cache_page_gfp(struct address_space *mapping, | 378 | extern struct page * read_cache_page_gfp(struct address_space *mapping, |
284 | pgoff_t index, gfp_t gfp_mask); | 379 | pgoff_t index, gfp_t gfp_mask); |
285 | extern int read_cache_pages(struct address_space *mapping, | 380 | extern int read_cache_pages(struct address_space *mapping, |
286 | struct list_head *pages, filler_t *filler, void *data); | 381 | struct list_head *pages, filler_t *filler, void *data); |
287 | 382 | ||
288 | static inline struct page *read_mapping_page(struct address_space *mapping, | 383 | static inline struct page *read_mapping_page(struct address_space *mapping, |
289 | pgoff_t index, void *data) | 384 | pgoff_t index, void *data) |
290 | { | 385 | { |
291 | filler_t *filler = (filler_t *)mapping->a_ops->readpage; | 386 | filler_t *filler = (filler_t *)mapping->a_ops->readpage; |
292 | return read_cache_page(mapping, index, filler, data); | 387 | return read_cache_page(mapping, index, filler, data); |
293 | } | 388 | } |
294 | 389 | ||
295 | /* | 390 | /* |
296 | * Return byte-offset into filesystem object for page. | 391 | * Return byte-offset into filesystem object for page. |
297 | */ | 392 | */ |
298 | static inline loff_t page_offset(struct page *page) | 393 | static inline loff_t page_offset(struct page *page) |
299 | { | 394 | { |
300 | return ((loff_t)page->index) << PAGE_CACHE_SHIFT; | 395 | return ((loff_t)page->index) << PAGE_CACHE_SHIFT; |
301 | } | 396 | } |
302 | 397 | ||
303 | static inline loff_t page_file_offset(struct page *page) | 398 | static inline loff_t page_file_offset(struct page *page) |
304 | { | 399 | { |
305 | return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; | 400 | return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; |
306 | } | 401 | } |
307 | 402 | ||
308 | extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, | 403 | extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, |
309 | unsigned long address); | 404 | unsigned long address); |
310 | 405 | ||
311 | static inline pgoff_t linear_page_index(struct vm_area_struct *vma, | 406 | static inline pgoff_t linear_page_index(struct vm_area_struct *vma, |
312 | unsigned long address) | 407 | unsigned long address) |
313 | { | 408 | { |
314 | pgoff_t pgoff; | 409 | pgoff_t pgoff; |
315 | if (unlikely(is_vm_hugetlb_page(vma))) | 410 | if (unlikely(is_vm_hugetlb_page(vma))) |
316 | return linear_hugepage_index(vma, address); | 411 | return linear_hugepage_index(vma, address); |
317 | pgoff = (address - vma->vm_start) >> PAGE_SHIFT; | 412 | pgoff = (address - vma->vm_start) >> PAGE_SHIFT; |
318 | pgoff += vma->vm_pgoff; | 413 | pgoff += vma->vm_pgoff; |
319 | return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); | 414 | return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT); |
320 | } | 415 | } |
321 | 416 | ||
322 | extern void __lock_page(struct page *page); | 417 | extern void __lock_page(struct page *page); |
323 | extern int __lock_page_killable(struct page *page); | 418 | extern int __lock_page_killable(struct page *page); |
324 | extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, | 419 | extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, |
325 | unsigned int flags); | 420 | unsigned int flags); |
326 | extern void unlock_page(struct page *page); | 421 | extern void unlock_page(struct page *page); |
327 | 422 | ||
328 | static inline void __set_page_locked(struct page *page) | 423 | static inline void __set_page_locked(struct page *page) |
329 | { | 424 | { |
330 | __set_bit(PG_locked, &page->flags); | 425 | __set_bit(PG_locked, &page->flags); |
331 | } | 426 | } |
332 | 427 | ||
333 | static inline void __clear_page_locked(struct page *page) | 428 | static inline void __clear_page_locked(struct page *page) |
334 | { | 429 | { |
335 | __clear_bit(PG_locked, &page->flags); | 430 | __clear_bit(PG_locked, &page->flags); |
336 | } | 431 | } |
337 | 432 | ||
338 | static inline int trylock_page(struct page *page) | 433 | static inline int trylock_page(struct page *page) |
339 | { | 434 | { |
340 | return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); | 435 | return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); |
341 | } | 436 | } |
342 | 437 | ||
343 | /* | 438 | /* |
344 | * lock_page may only be called if we have the page's inode pinned. | 439 | * lock_page may only be called if we have the page's inode pinned. |
345 | */ | 440 | */ |
346 | static inline void lock_page(struct page *page) | 441 | static inline void lock_page(struct page *page) |
347 | { | 442 | { |
348 | might_sleep(); | 443 | might_sleep(); |
349 | if (!trylock_page(page)) | 444 | if (!trylock_page(page)) |
350 | __lock_page(page); | 445 | __lock_page(page); |
351 | } | 446 | } |
352 | 447 | ||
353 | /* | 448 | /* |
354 | * lock_page_killable is like lock_page but can be interrupted by fatal | 449 | * lock_page_killable is like lock_page but can be interrupted by fatal |
355 | * signals. It returns 0 if it locked the page and -EINTR if it was | 450 | * signals. It returns 0 if it locked the page and -EINTR if it was |
356 | * killed while waiting. | 451 | * killed while waiting. |
357 | */ | 452 | */ |
358 | static inline int lock_page_killable(struct page *page) | 453 | static inline int lock_page_killable(struct page *page) |
359 | { | 454 | { |
360 | might_sleep(); | 455 | might_sleep(); |
361 | if (!trylock_page(page)) | 456 | if (!trylock_page(page)) |
362 | return __lock_page_killable(page); | 457 | return __lock_page_killable(page); |
363 | return 0; | 458 | return 0; |
364 | } | 459 | } |
365 | 460 | ||
366 | /* | 461 | /* |
367 | * lock_page_or_retry - Lock the page, unless this would block and the | 462 | * lock_page_or_retry - Lock the page, unless this would block and the |
368 | * caller indicated that it can handle a retry. | 463 | * caller indicated that it can handle a retry. |
369 | */ | 464 | */ |
370 | static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, | 465 | static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, |
371 | unsigned int flags) | 466 | unsigned int flags) |
372 | { | 467 | { |
373 | might_sleep(); | 468 | might_sleep(); |
374 | return trylock_page(page) || __lock_page_or_retry(page, mm, flags); | 469 | return trylock_page(page) || __lock_page_or_retry(page, mm, flags); |
375 | } | 470 | } |
376 | 471 | ||
377 | /* | 472 | /* |
378 | * This is exported only for wait_on_page_locked/wait_on_page_writeback. | 473 | * This is exported only for wait_on_page_locked/wait_on_page_writeback. |
379 | * Never use this directly! | 474 | * Never use this directly! |
380 | */ | 475 | */ |
381 | extern void wait_on_page_bit(struct page *page, int bit_nr); | 476 | extern void wait_on_page_bit(struct page *page, int bit_nr); |
382 | 477 | ||
383 | extern int wait_on_page_bit_killable(struct page *page, int bit_nr); | 478 | extern int wait_on_page_bit_killable(struct page *page, int bit_nr); |
384 | 479 | ||
385 | static inline int wait_on_page_locked_killable(struct page *page) | 480 | static inline int wait_on_page_locked_killable(struct page *page) |
386 | { | 481 | { |
387 | if (PageLocked(page)) | 482 | if (PageLocked(page)) |
388 | return wait_on_page_bit_killable(page, PG_locked); | 483 | return wait_on_page_bit_killable(page, PG_locked); |
389 | return 0; | 484 | return 0; |
390 | } | 485 | } |
391 | 486 | ||
392 | /* | 487 | /* |
393 | * Wait for a page to be unlocked. | 488 | * Wait for a page to be unlocked. |
394 | * | 489 | * |
395 | * This must be called with the caller "holding" the page, | 490 | * This must be called with the caller "holding" the page, |
396 | * ie with increased "page->count" so that the page won't | 491 | * ie with increased "page->count" so that the page won't |
397 | * go away during the wait.. | 492 | * go away during the wait.. |
398 | */ | 493 | */ |
399 | static inline void wait_on_page_locked(struct page *page) | 494 | static inline void wait_on_page_locked(struct page *page) |
400 | { | 495 | { |
401 | if (PageLocked(page)) | 496 | if (PageLocked(page)) |
402 | wait_on_page_bit(page, PG_locked); | 497 | wait_on_page_bit(page, PG_locked); |
403 | } | 498 | } |
404 | 499 | ||
405 | /* | 500 | /* |
406 | * Wait for a page to complete writeback | 501 | * Wait for a page to complete writeback |
407 | */ | 502 | */ |
408 | static inline void wait_on_page_writeback(struct page *page) | 503 | static inline void wait_on_page_writeback(struct page *page) |
409 | { | 504 | { |
410 | if (PageWriteback(page)) | 505 | if (PageWriteback(page)) |
411 | wait_on_page_bit(page, PG_writeback); | 506 | wait_on_page_bit(page, PG_writeback); |
412 | } | 507 | } |
413 | 508 | ||
414 | extern void end_page_writeback(struct page *page); | 509 | extern void end_page_writeback(struct page *page); |
415 | void wait_for_stable_page(struct page *page); | 510 | void wait_for_stable_page(struct page *page); |
416 | 511 | ||
417 | /* | 512 | /* |
418 | * Add an arbitrary waiter to a page's wait queue | 513 | * Add an arbitrary waiter to a page's wait queue |
419 | */ | 514 | */ |
420 | extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); | 515 | extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); |
421 | 516 | ||
422 | /* | 517 | /* |
423 | * Fault a userspace page into pagetables. Return non-zero on a fault. | 518 | * Fault a userspace page into pagetables. Return non-zero on a fault. |
424 | * | 519 | * |
425 | * This assumes that two userspace pages are always sufficient. That's | 520 | * This assumes that two userspace pages are always sufficient. That's |
426 | * not true if PAGE_CACHE_SIZE > PAGE_SIZE. | 521 | * not true if PAGE_CACHE_SIZE > PAGE_SIZE. |
427 | */ | 522 | */ |
428 | static inline int fault_in_pages_writeable(char __user *uaddr, int size) | 523 | static inline int fault_in_pages_writeable(char __user *uaddr, int size) |
429 | { | 524 | { |
430 | int ret; | 525 | int ret; |
431 | 526 | ||
432 | if (unlikely(size == 0)) | 527 | if (unlikely(size == 0)) |
433 | return 0; | 528 | return 0; |
434 | 529 | ||
435 | /* | 530 | /* |
436 | * Writing zeroes into userspace here is OK, because we know that if | 531 | * Writing zeroes into userspace here is OK, because we know that if |
437 | * the zero gets there, we'll be overwriting it. | 532 | * the zero gets there, we'll be overwriting it. |
438 | */ | 533 | */ |
439 | ret = __put_user(0, uaddr); | 534 | ret = __put_user(0, uaddr); |
440 | if (ret == 0) { | 535 | if (ret == 0) { |
441 | char __user *end = uaddr + size - 1; | 536 | char __user *end = uaddr + size - 1; |
442 | 537 | ||
443 | /* | 538 | /* |
444 | * If the page was already mapped, this will get a cache miss | 539 | * If the page was already mapped, this will get a cache miss |
445 | * for sure, so try to avoid doing it. | 540 | * for sure, so try to avoid doing it. |
446 | */ | 541 | */ |
447 | if (((unsigned long)uaddr & PAGE_MASK) != | 542 | if (((unsigned long)uaddr & PAGE_MASK) != |
448 | ((unsigned long)end & PAGE_MASK)) | 543 | ((unsigned long)end & PAGE_MASK)) |
449 | ret = __put_user(0, end); | 544 | ret = __put_user(0, end); |
450 | } | 545 | } |
451 | return ret; | 546 | return ret; |
452 | } | 547 | } |
453 | 548 | ||
454 | static inline int fault_in_pages_readable(const char __user *uaddr, int size) | 549 | static inline int fault_in_pages_readable(const char __user *uaddr, int size) |
455 | { | 550 | { |
456 | volatile char c; | 551 | volatile char c; |
457 | int ret; | 552 | int ret; |
458 | 553 | ||
459 | if (unlikely(size == 0)) | 554 | if (unlikely(size == 0)) |
460 | return 0; | 555 | return 0; |
461 | 556 | ||
462 | ret = __get_user(c, uaddr); | 557 | ret = __get_user(c, uaddr); |
463 | if (ret == 0) { | 558 | if (ret == 0) { |
464 | const char __user *end = uaddr + size - 1; | 559 | const char __user *end = uaddr + size - 1; |
465 | 560 | ||
466 | if (((unsigned long)uaddr & PAGE_MASK) != | 561 | if (((unsigned long)uaddr & PAGE_MASK) != |
467 | ((unsigned long)end & PAGE_MASK)) { | 562 | ((unsigned long)end & PAGE_MASK)) { |
468 | ret = __get_user(c, end); | 563 | ret = __get_user(c, end); |
469 | (void)c; | 564 | (void)c; |
470 | } | 565 | } |
471 | } | 566 | } |
472 | return ret; | 567 | return ret; |
473 | } | 568 | } |
474 | 569 | ||
475 | /* | 570 | /* |
476 | * Multipage variants of the above prefault helpers, useful if more than | 571 | * Multipage variants of the above prefault helpers, useful if more than |
477 | * PAGE_SIZE of data needs to be prefaulted. These are separate from the above | 572 | * PAGE_SIZE of data needs to be prefaulted. These are separate from the above |
478 | * functions (which only handle up to PAGE_SIZE) to avoid clobbering the | 573 | * functions (which only handle up to PAGE_SIZE) to avoid clobbering the |
479 | * filemap.c hotpaths. | 574 | * filemap.c hotpaths. |
480 | */ | 575 | */ |
481 | static inline int fault_in_multipages_writeable(char __user *uaddr, int size) | 576 | static inline int fault_in_multipages_writeable(char __user *uaddr, int size) |
482 | { | 577 | { |
483 | int ret = 0; | 578 | int ret = 0; |
484 | char __user *end = uaddr + size - 1; | 579 | char __user *end = uaddr + size - 1; |
485 | 580 | ||
486 | if (unlikely(size == 0)) | 581 | if (unlikely(size == 0)) |
487 | return ret; | 582 | return ret; |
488 | 583 | ||
489 | /* | 584 | /* |
490 | * Writing zeroes into userspace here is OK, because we know that if | 585 | * Writing zeroes into userspace here is OK, because we know that if |
491 | * the zero gets there, we'll be overwriting it. | 586 | * the zero gets there, we'll be overwriting it. |
492 | */ | 587 | */ |
493 | while (uaddr <= end) { | 588 | while (uaddr <= end) { |
494 | ret = __put_user(0, uaddr); | 589 | ret = __put_user(0, uaddr); |
495 | if (ret != 0) | 590 | if (ret != 0) |
496 | return ret; | 591 | return ret; |
497 | uaddr += PAGE_SIZE; | 592 | uaddr += PAGE_SIZE; |
498 | } | 593 | } |
499 | 594 | ||
500 | /* Check whether the range spilled into the next page. */ | 595 | /* Check whether the range spilled into the next page. */ |
501 | if (((unsigned long)uaddr & PAGE_MASK) == | 596 | if (((unsigned long)uaddr & PAGE_MASK) == |
502 | ((unsigned long)end & PAGE_MASK)) | 597 | ((unsigned long)end & PAGE_MASK)) |
503 | ret = __put_user(0, end); | 598 | ret = __put_user(0, end); |
504 | 599 | ||
505 | return ret; | 600 | return ret; |
506 | } | 601 | } |
507 | 602 | ||
508 | static inline int fault_in_multipages_readable(const char __user *uaddr, | 603 | static inline int fault_in_multipages_readable(const char __user *uaddr, |
509 | int size) | 604 | int size) |
510 | { | 605 | { |
511 | volatile char c; | 606 | volatile char c; |
512 | int ret = 0; | 607 | int ret = 0; |
513 | const char __user *end = uaddr + size - 1; | 608 | const char __user *end = uaddr + size - 1; |
514 | 609 | ||
515 | if (unlikely(size == 0)) | 610 | if (unlikely(size == 0)) |
516 | return ret; | 611 | return ret; |
517 | 612 | ||
518 | while (uaddr <= end) { | 613 | while (uaddr <= end) { |
519 | ret = __get_user(c, uaddr); | 614 | ret = __get_user(c, uaddr); |
520 | if (ret != 0) | 615 | if (ret != 0) |
521 | return ret; | 616 | return ret; |
522 | uaddr += PAGE_SIZE; | 617 | uaddr += PAGE_SIZE; |
523 | } | 618 | } |
524 | 619 | ||
525 | /* Check whether the range spilled into the next page. */ | 620 | /* Check whether the range spilled into the next page. */ |
526 | if (((unsigned long)uaddr & PAGE_MASK) == | 621 | if (((unsigned long)uaddr & PAGE_MASK) == |
527 | ((unsigned long)end & PAGE_MASK)) { | 622 | ((unsigned long)end & PAGE_MASK)) { |
528 | ret = __get_user(c, end); | 623 | ret = __get_user(c, end); |
529 | (void)c; | 624 | (void)c; |
530 | } | 625 | } |
531 | 626 | ||
532 | return ret; | 627 | return ret; |
533 | } | 628 | } |
534 | 629 | ||
535 | int add_to_page_cache_locked(struct page *page, struct address_space *mapping, | 630 | int add_to_page_cache_locked(struct page *page, struct address_space *mapping, |
536 | pgoff_t index, gfp_t gfp_mask); | 631 | pgoff_t index, gfp_t gfp_mask); |
537 | int add_to_page_cache_lru(struct page *page, struct address_space *mapping, | 632 | int add_to_page_cache_lru(struct page *page, struct address_space *mapping, |
538 | pgoff_t index, gfp_t gfp_mask); | 633 | pgoff_t index, gfp_t gfp_mask); |
539 | extern void delete_from_page_cache(struct page *page); | 634 | extern void delete_from_page_cache(struct page *page); |
540 | extern void __delete_from_page_cache(struct page *page); | 635 | extern void __delete_from_page_cache(struct page *page); |
541 | int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); | 636 | int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); |
542 | 637 | ||
543 | /* | 638 | /* |
544 | * Like add_to_page_cache_locked, but used to add newly allocated pages: | 639 | * Like add_to_page_cache_locked, but used to add newly allocated pages: |
545 | * the page is new, so we can just run __set_page_locked() against it. | 640 | * the page is new, so we can just run __set_page_locked() against it. |
546 | */ | 641 | */ |
547 | static inline int add_to_page_cache(struct page *page, | 642 | static inline int add_to_page_cache(struct page *page, |
548 | struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) | 643 | struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) |
549 | { | 644 | { |
550 | int error; | 645 | int error; |
551 | 646 | ||
552 | __set_page_locked(page); | 647 | __set_page_locked(page); |
553 | error = add_to_page_cache_locked(page, mapping, offset, gfp_mask); | 648 | error = add_to_page_cache_locked(page, mapping, offset, gfp_mask); |
554 | if (unlikely(error)) | 649 | if (unlikely(error)) |
include/linux/swap.h
1 | #ifndef _LINUX_SWAP_H | 1 | #ifndef _LINUX_SWAP_H |
2 | #define _LINUX_SWAP_H | 2 | #define _LINUX_SWAP_H |
3 | 3 | ||
4 | #include <linux/spinlock.h> | 4 | #include <linux/spinlock.h> |
5 | #include <linux/linkage.h> | 5 | #include <linux/linkage.h> |
6 | #include <linux/mmzone.h> | 6 | #include <linux/mmzone.h> |
7 | #include <linux/list.h> | 7 | #include <linux/list.h> |
8 | #include <linux/memcontrol.h> | 8 | #include <linux/memcontrol.h> |
9 | #include <linux/sched.h> | 9 | #include <linux/sched.h> |
10 | #include <linux/node.h> | 10 | #include <linux/node.h> |
11 | #include <linux/fs.h> | 11 | #include <linux/fs.h> |
12 | #include <linux/atomic.h> | 12 | #include <linux/atomic.h> |
13 | #include <linux/page-flags.h> | 13 | #include <linux/page-flags.h> |
14 | #include <asm/page.h> | 14 | #include <asm/page.h> |
15 | 15 | ||
16 | struct notifier_block; | 16 | struct notifier_block; |
17 | 17 | ||
18 | struct bio; | 18 | struct bio; |
19 | 19 | ||
20 | #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */ | 20 | #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */ |
21 | #define SWAP_FLAG_PRIO_MASK 0x7fff | 21 | #define SWAP_FLAG_PRIO_MASK 0x7fff |
22 | #define SWAP_FLAG_PRIO_SHIFT 0 | 22 | #define SWAP_FLAG_PRIO_SHIFT 0 |
23 | #define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */ | 23 | #define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */ |
24 | #define SWAP_FLAG_DISCARD_ONCE 0x20000 /* discard swap area at swapon-time */ | 24 | #define SWAP_FLAG_DISCARD_ONCE 0x20000 /* discard swap area at swapon-time */ |
25 | #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */ | 25 | #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */ |
26 | 26 | ||
27 | #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \ | 27 | #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \ |
28 | SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \ | 28 | SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \ |
29 | SWAP_FLAG_DISCARD_PAGES) | 29 | SWAP_FLAG_DISCARD_PAGES) |
30 | 30 | ||
31 | static inline int current_is_kswapd(void) | 31 | static inline int current_is_kswapd(void) |
32 | { | 32 | { |
33 | return current->flags & PF_KSWAPD; | 33 | return current->flags & PF_KSWAPD; |
34 | } | 34 | } |
35 | 35 | ||
36 | /* | 36 | /* |
37 | * MAX_SWAPFILES defines the maximum number of swaptypes: things which can | 37 | * MAX_SWAPFILES defines the maximum number of swaptypes: things which can |
38 | * be swapped to. The swap type and the offset into that swap type are | 38 | * be swapped to. The swap type and the offset into that swap type are |
39 | * encoded into pte's and into pgoff_t's in the swapcache. Using five bits | 39 | * encoded into pte's and into pgoff_t's in the swapcache. Using five bits |
40 | * for the type means that the maximum number of swapcache pages is 27 bits | 40 | * for the type means that the maximum number of swapcache pages is 27 bits |
41 | * on 32-bit-pgoff_t architectures. And that assumes that the architecture packs | 41 | * on 32-bit-pgoff_t architectures. And that assumes that the architecture packs |
42 | * the type/offset into the pte as 5/27 as well. | 42 | * the type/offset into the pte as 5/27 as well. |
43 | */ | 43 | */ |
44 | #define MAX_SWAPFILES_SHIFT 5 | 44 | #define MAX_SWAPFILES_SHIFT 5 |
45 | 45 | ||
46 | /* | 46 | /* |
47 | * Use some of the swap files numbers for other purposes. This | 47 | * Use some of the swap files numbers for other purposes. This |
48 | * is a convenient way to hook into the VM to trigger special | 48 | * is a convenient way to hook into the VM to trigger special |
49 | * actions on faults. | 49 | * actions on faults. |
50 | */ | 50 | */ |
51 | 51 | ||
52 | /* | 52 | /* |
53 | * NUMA node memory migration support | 53 | * NUMA node memory migration support |
54 | */ | 54 | */ |
55 | #ifdef CONFIG_MIGRATION | 55 | #ifdef CONFIG_MIGRATION |
56 | #define SWP_MIGRATION_NUM 2 | 56 | #define SWP_MIGRATION_NUM 2 |
57 | #define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) | 57 | #define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) |
58 | #define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) | 58 | #define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) |
59 | #else | 59 | #else |
60 | #define SWP_MIGRATION_NUM 0 | 60 | #define SWP_MIGRATION_NUM 0 |
61 | #endif | 61 | #endif |
62 | 62 | ||
63 | /* | 63 | /* |
64 | * Handling of hardware poisoned pages with memory corruption. | 64 | * Handling of hardware poisoned pages with memory corruption. |
65 | */ | 65 | */ |
66 | #ifdef CONFIG_MEMORY_FAILURE | 66 | #ifdef CONFIG_MEMORY_FAILURE |
67 | #define SWP_HWPOISON_NUM 1 | 67 | #define SWP_HWPOISON_NUM 1 |
68 | #define SWP_HWPOISON MAX_SWAPFILES | 68 | #define SWP_HWPOISON MAX_SWAPFILES |
69 | #else | 69 | #else |
70 | #define SWP_HWPOISON_NUM 0 | 70 | #define SWP_HWPOISON_NUM 0 |
71 | #endif | 71 | #endif |
72 | 72 | ||
73 | #define MAX_SWAPFILES \ | 73 | #define MAX_SWAPFILES \ |
74 | ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) | 74 | ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) |
75 | 75 | ||
76 | /* | 76 | /* |
77 | * Magic header for a swap area. The first part of the union is | 77 | * Magic header for a swap area. The first part of the union is |
78 | * what the swap magic looks like for the old (limited to 128MB) | 78 | * what the swap magic looks like for the old (limited to 128MB) |
79 | * swap area format, the second part of the union adds - in the | 79 | * swap area format, the second part of the union adds - in the |
80 | * old reserved area - some extra information. Note that the first | 80 | * old reserved area - some extra information. Note that the first |
81 | * kilobyte is reserved for boot loader or disk label stuff... | 81 | * kilobyte is reserved for boot loader or disk label stuff... |
82 | * | 82 | * |
83 | * Having the magic at the end of the PAGE_SIZE makes detecting swap | 83 | * Having the magic at the end of the PAGE_SIZE makes detecting swap |
84 | * areas somewhat tricky on machines that support multiple page sizes. | 84 | * areas somewhat tricky on machines that support multiple page sizes. |
85 | * For 2.5 we'll probably want to move the magic to just beyond the | 85 | * For 2.5 we'll probably want to move the magic to just beyond the |
86 | * bootbits... | 86 | * bootbits... |
87 | */ | 87 | */ |
88 | union swap_header { | 88 | union swap_header { |
89 | struct { | 89 | struct { |
90 | char reserved[PAGE_SIZE - 10]; | 90 | char reserved[PAGE_SIZE - 10]; |
91 | char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */ | 91 | char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */ |
92 | } magic; | 92 | } magic; |
93 | struct { | 93 | struct { |
94 | char bootbits[1024]; /* Space for disklabel etc. */ | 94 | char bootbits[1024]; /* Space for disklabel etc. */ |
95 | __u32 version; | 95 | __u32 version; |
96 | __u32 last_page; | 96 | __u32 last_page; |
97 | __u32 nr_badpages; | 97 | __u32 nr_badpages; |
98 | unsigned char sws_uuid[16]; | 98 | unsigned char sws_uuid[16]; |
99 | unsigned char sws_volume[16]; | 99 | unsigned char sws_volume[16]; |
100 | __u32 padding[117]; | 100 | __u32 padding[117]; |
101 | __u32 badpages[1]; | 101 | __u32 badpages[1]; |
102 | } info; | 102 | } info; |
103 | }; | 103 | }; |
104 | 104 | ||
105 | /* A swap entry has to fit into a "unsigned long", as | 105 | /* A swap entry has to fit into a "unsigned long", as |
106 | * the entry is hidden in the "index" field of the | 106 | * the entry is hidden in the "index" field of the |
107 | * swapper address space. | 107 | * swapper address space. |
108 | */ | 108 | */ |
109 | typedef struct { | 109 | typedef struct { |
110 | unsigned long val; | 110 | unsigned long val; |
111 | } swp_entry_t; | 111 | } swp_entry_t; |
112 | 112 | ||
113 | /* | 113 | /* |
114 | * current->reclaim_state points to one of these when a task is running | 114 | * current->reclaim_state points to one of these when a task is running |
115 | * memory reclaim | 115 | * memory reclaim |
116 | */ | 116 | */ |
117 | struct reclaim_state { | 117 | struct reclaim_state { |
118 | unsigned long reclaimed_slab; | 118 | unsigned long reclaimed_slab; |
119 | }; | 119 | }; |
120 | 120 | ||
121 | #ifdef __KERNEL__ | 121 | #ifdef __KERNEL__ |
122 | 122 | ||
123 | struct address_space; | 123 | struct address_space; |
124 | struct sysinfo; | 124 | struct sysinfo; |
125 | struct writeback_control; | 125 | struct writeback_control; |
126 | struct zone; | 126 | struct zone; |
127 | 127 | ||
128 | /* | 128 | /* |
129 | * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of | 129 | * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of |
130 | * disk blocks. A list of swap extents maps the entire swapfile. (Where the | 130 | * disk blocks. A list of swap extents maps the entire swapfile. (Where the |
131 | * term `swapfile' refers to either a blockdevice or an IS_REG file. Apart | 131 | * term `swapfile' refers to either a blockdevice or an IS_REG file. Apart |
132 | * from setup, they're handled identically. | 132 | * from setup, they're handled identically. |
133 | * | 133 | * |
134 | * We always assume that blocks are of size PAGE_SIZE. | 134 | * We always assume that blocks are of size PAGE_SIZE. |
135 | */ | 135 | */ |
136 | struct swap_extent { | 136 | struct swap_extent { |
137 | struct list_head list; | 137 | struct list_head list; |
138 | pgoff_t start_page; | 138 | pgoff_t start_page; |
139 | pgoff_t nr_pages; | 139 | pgoff_t nr_pages; |
140 | sector_t start_block; | 140 | sector_t start_block; |
141 | }; | 141 | }; |
142 | 142 | ||
143 | /* | 143 | /* |
144 | * Max bad pages in the new format.. | 144 | * Max bad pages in the new format.. |
145 | */ | 145 | */ |
146 | #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x) | 146 | #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x) |
147 | #define MAX_SWAP_BADPAGES \ | 147 | #define MAX_SWAP_BADPAGES \ |
148 | ((__swapoffset(magic.magic) - __swapoffset(info.badpages)) / sizeof(int)) | 148 | ((__swapoffset(magic.magic) - __swapoffset(info.badpages)) / sizeof(int)) |
149 | 149 | ||
150 | enum { | 150 | enum { |
151 | SWP_USED = (1 << 0), /* is slot in swap_info[] used? */ | 151 | SWP_USED = (1 << 0), /* is slot in swap_info[] used? */ |
152 | SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ | 152 | SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */ |
153 | SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */ | 153 | SWP_DISCARDABLE = (1 << 2), /* blkdev support discard */ |
154 | SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */ | 154 | SWP_DISCARDING = (1 << 3), /* now discarding a free cluster */ |
155 | SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */ | 155 | SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */ |
156 | SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ | 156 | SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ |
157 | SWP_BLKDEV = (1 << 6), /* its a block device */ | 157 | SWP_BLKDEV = (1 << 6), /* its a block device */ |
158 | SWP_FILE = (1 << 7), /* set after swap_activate success */ | 158 | SWP_FILE = (1 << 7), /* set after swap_activate success */ |
159 | SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */ | 159 | SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */ |
160 | SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */ | 160 | SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */ |
161 | /* add others here before... */ | 161 | /* add others here before... */ |
162 | SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */ | 162 | SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */ |
163 | }; | 163 | }; |
164 | 164 | ||
165 | #define SWAP_CLUSTER_MAX 32UL | 165 | #define SWAP_CLUSTER_MAX 32UL |
166 | #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX | 166 | #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX |
167 | 167 | ||
168 | /* | 168 | /* |
169 | * Ratio between the present memory in the zone and the "gap" that | 169 | * Ratio between the present memory in the zone and the "gap" that |
170 | * we're allowing kswapd to shrink in addition to the per-zone high | 170 | * we're allowing kswapd to shrink in addition to the per-zone high |
171 | * wmark, even for zones that already have the high wmark satisfied, | 171 | * wmark, even for zones that already have the high wmark satisfied, |
172 | * in order to provide better per-zone lru behavior. We are ok to | 172 | * in order to provide better per-zone lru behavior. We are ok to |
173 | * spend not more than 1% of the memory for this zone balancing "gap". | 173 | * spend not more than 1% of the memory for this zone balancing "gap". |
174 | */ | 174 | */ |
175 | #define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 | 175 | #define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 |
176 | 176 | ||
177 | #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ | 177 | #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ |
178 | #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ | 178 | #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ |
179 | #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ | 179 | #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ |
180 | #define SWAP_CONT_MAX 0x7f /* Max count, in each swap_map continuation */ | 180 | #define SWAP_CONT_MAX 0x7f /* Max count, in each swap_map continuation */ |
181 | #define COUNT_CONTINUED 0x80 /* See swap_map continuation for full count */ | 181 | #define COUNT_CONTINUED 0x80 /* See swap_map continuation for full count */ |
182 | #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs, in first swap_map */ | 182 | #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs, in first swap_map */ |
183 | 183 | ||
184 | /* | 184 | /* |
185 | * We use this to track usage of a cluster. A cluster is a block of swap disk | 185 | * We use this to track usage of a cluster. A cluster is a block of swap disk |
186 | * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All | 186 | * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All |
187 | * free clusters are organized into a list. We fetch an entry from the list to | 187 | * free clusters are organized into a list. We fetch an entry from the list to |
188 | * get a free cluster. | 188 | * get a free cluster. |
189 | * | 189 | * |
190 | * The data field stores next cluster if the cluster is free or cluster usage | 190 | * The data field stores next cluster if the cluster is free or cluster usage |
191 | * counter otherwise. The flags field determines if a cluster is free. This is | 191 | * counter otherwise. The flags field determines if a cluster is free. This is |
192 | * protected by swap_info_struct.lock. | 192 | * protected by swap_info_struct.lock. |
193 | */ | 193 | */ |
194 | struct swap_cluster_info { | 194 | struct swap_cluster_info { |
195 | unsigned int data:24; | 195 | unsigned int data:24; |
196 | unsigned int flags:8; | 196 | unsigned int flags:8; |
197 | }; | 197 | }; |
198 | #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ | 198 | #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ |
199 | #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ | 199 | #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ |
200 | 200 | ||
201 | /* | 201 | /* |
202 | * We assign a cluster to each CPU, so each CPU can allocate swap entry from | 202 | * We assign a cluster to each CPU, so each CPU can allocate swap entry from |
203 | * its own cluster and swapout sequentially. The purpose is to optimize swapout | 203 | * its own cluster and swapout sequentially. The purpose is to optimize swapout |
204 | * throughput. | 204 | * throughput. |
205 | */ | 205 | */ |
206 | struct percpu_cluster { | 206 | struct percpu_cluster { |
207 | struct swap_cluster_info index; /* Current cluster index */ | 207 | struct swap_cluster_info index; /* Current cluster index */ |
208 | unsigned int next; /* Likely next allocation offset */ | 208 | unsigned int next; /* Likely next allocation offset */ |
209 | }; | 209 | }; |
210 | 210 | ||
211 | /* | 211 | /* |
212 | * The in-memory structure used to track swap areas. | 212 | * The in-memory structure used to track swap areas. |
213 | */ | 213 | */ |
214 | struct swap_info_struct { | 214 | struct swap_info_struct { |
215 | unsigned long flags; /* SWP_USED etc: see above */ | 215 | unsigned long flags; /* SWP_USED etc: see above */ |
216 | signed short prio; /* swap priority of this type */ | 216 | signed short prio; /* swap priority of this type */ |
217 | struct plist_node list; /* entry in swap_active_head */ | 217 | struct plist_node list; /* entry in swap_active_head */ |
218 | struct plist_node avail_list; /* entry in swap_avail_head */ | 218 | struct plist_node avail_list; /* entry in swap_avail_head */ |
219 | signed char type; /* strange name for an index */ | 219 | signed char type; /* strange name for an index */ |
220 | unsigned int max; /* extent of the swap_map */ | 220 | unsigned int max; /* extent of the swap_map */ |
221 | unsigned char *swap_map; /* vmalloc'ed array of usage counts */ | 221 | unsigned char *swap_map; /* vmalloc'ed array of usage counts */ |
222 | struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ | 222 | struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ |
223 | struct swap_cluster_info free_cluster_head; /* free cluster list head */ | 223 | struct swap_cluster_info free_cluster_head; /* free cluster list head */ |
224 | struct swap_cluster_info free_cluster_tail; /* free cluster list tail */ | 224 | struct swap_cluster_info free_cluster_tail; /* free cluster list tail */ |
225 | unsigned int lowest_bit; /* index of first free in swap_map */ | 225 | unsigned int lowest_bit; /* index of first free in swap_map */ |
226 | unsigned int highest_bit; /* index of last free in swap_map */ | 226 | unsigned int highest_bit; /* index of last free in swap_map */ |
227 | unsigned int pages; /* total of usable pages of swap */ | 227 | unsigned int pages; /* total of usable pages of swap */ |
228 | unsigned int inuse_pages; /* number of those currently in use */ | 228 | unsigned int inuse_pages; /* number of those currently in use */ |
229 | unsigned int cluster_next; /* likely index for next allocation */ | 229 | unsigned int cluster_next; /* likely index for next allocation */ |
230 | unsigned int cluster_nr; /* countdown to next cluster search */ | 230 | unsigned int cluster_nr; /* countdown to next cluster search */ |
231 | struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ | 231 | struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ |
232 | struct swap_extent *curr_swap_extent; | 232 | struct swap_extent *curr_swap_extent; |
233 | struct swap_extent first_swap_extent; | 233 | struct swap_extent first_swap_extent; |
234 | struct block_device *bdev; /* swap device or bdev of swap file */ | 234 | struct block_device *bdev; /* swap device or bdev of swap file */ |
235 | struct file *swap_file; /* seldom referenced */ | 235 | struct file *swap_file; /* seldom referenced */ |
236 | unsigned int old_block_size; /* seldom referenced */ | 236 | unsigned int old_block_size; /* seldom referenced */ |
237 | #ifdef CONFIG_FRONTSWAP | 237 | #ifdef CONFIG_FRONTSWAP |
238 | unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ | 238 | unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ |
239 | atomic_t frontswap_pages; /* frontswap pages in-use counter */ | 239 | atomic_t frontswap_pages; /* frontswap pages in-use counter */ |
240 | #endif | 240 | #endif |
241 | spinlock_t lock; /* | 241 | spinlock_t lock; /* |
242 | * protect map scan related fields like | 242 | * protect map scan related fields like |
243 | * swap_map, lowest_bit, highest_bit, | 243 | * swap_map, lowest_bit, highest_bit, |
244 | * inuse_pages, cluster_next, | 244 | * inuse_pages, cluster_next, |
245 | * cluster_nr, lowest_alloc, | 245 | * cluster_nr, lowest_alloc, |
246 | * highest_alloc, free/discard cluster | 246 | * highest_alloc, free/discard cluster |
247 | * list. other fields are only changed | 247 | * list. other fields are only changed |
248 | * at swapon/swapoff, so are protected | 248 | * at swapon/swapoff, so are protected |
249 | * by swap_lock. changing flags need | 249 | * by swap_lock. changing flags need |
250 | * hold this lock and swap_lock. If | 250 | * hold this lock and swap_lock. If |
251 | * both locks need hold, hold swap_lock | 251 | * both locks need hold, hold swap_lock |
252 | * first. | 252 | * first. |
253 | */ | 253 | */ |
254 | struct work_struct discard_work; /* discard worker */ | 254 | struct work_struct discard_work; /* discard worker */ |
255 | struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */ | 255 | struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */ |
256 | struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */ | 256 | struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */ |
257 | }; | 257 | }; |
258 | 258 | ||
259 | /* linux/mm/page_alloc.c */ | 259 | /* linux/mm/page_alloc.c */ |
260 | extern unsigned long totalram_pages; | 260 | extern unsigned long totalram_pages; |
261 | extern unsigned long totalreserve_pages; | 261 | extern unsigned long totalreserve_pages; |
262 | extern unsigned long dirty_balance_reserve; | 262 | extern unsigned long dirty_balance_reserve; |
263 | extern unsigned long nr_free_buffer_pages(void); | 263 | extern unsigned long nr_free_buffer_pages(void); |
264 | extern unsigned long nr_free_pagecache_pages(void); | 264 | extern unsigned long nr_free_pagecache_pages(void); |
265 | 265 | ||
266 | /* Definition of global_page_state not available yet */ | 266 | /* Definition of global_page_state not available yet */ |
267 | #define nr_free_pages() global_page_state(NR_FREE_PAGES) | 267 | #define nr_free_pages() global_page_state(NR_FREE_PAGES) |
268 | 268 | ||
269 | 269 | ||
270 | /* linux/mm/swap.c */ | 270 | /* linux/mm/swap.c */ |
271 | extern void lru_cache_add(struct page *); | 271 | extern void lru_cache_add(struct page *); |
272 | extern void lru_cache_add_anon(struct page *page); | 272 | extern void lru_cache_add_anon(struct page *page); |
273 | extern void lru_cache_add_file(struct page *page); | 273 | extern void lru_cache_add_file(struct page *page); |
274 | extern void lru_add_page_tail(struct page *page, struct page *page_tail, | 274 | extern void lru_add_page_tail(struct page *page, struct page *page_tail, |
275 | struct lruvec *lruvec, struct list_head *head); | 275 | struct lruvec *lruvec, struct list_head *head); |
276 | extern void activate_page(struct page *); | 276 | extern void activate_page(struct page *); |
277 | extern void mark_page_accessed(struct page *); | 277 | extern void mark_page_accessed(struct page *); |
278 | extern void init_page_accessed(struct page *page); | ||
278 | extern void lru_add_drain(void); | 279 | extern void lru_add_drain(void); |
279 | extern void lru_add_drain_cpu(int cpu); | 280 | extern void lru_add_drain_cpu(int cpu); |
280 | extern void lru_add_drain_all(void); | 281 | extern void lru_add_drain_all(void); |
281 | extern void rotate_reclaimable_page(struct page *page); | 282 | extern void rotate_reclaimable_page(struct page *page); |
282 | extern void deactivate_page(struct page *page); | 283 | extern void deactivate_page(struct page *page); |
283 | extern void swap_setup(void); | 284 | extern void swap_setup(void); |
284 | 285 | ||
285 | extern void add_page_to_unevictable_list(struct page *page); | 286 | extern void add_page_to_unevictable_list(struct page *page); |
286 | 287 | ||
287 | /* linux/mm/vmscan.c */ | 288 | /* linux/mm/vmscan.c */ |
288 | extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, | 289 | extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, |
289 | gfp_t gfp_mask, nodemask_t *mask); | 290 | gfp_t gfp_mask, nodemask_t *mask); |
290 | extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); | 291 | extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); |
291 | extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, | 292 | extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, |
292 | gfp_t gfp_mask, bool noswap); | 293 | gfp_t gfp_mask, bool noswap); |
293 | extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, | 294 | extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, |
294 | gfp_t gfp_mask, bool noswap, | 295 | gfp_t gfp_mask, bool noswap, |
295 | struct zone *zone, | 296 | struct zone *zone, |
296 | unsigned long *nr_scanned); | 297 | unsigned long *nr_scanned); |
297 | extern unsigned long shrink_all_memory(unsigned long nr_pages); | 298 | extern unsigned long shrink_all_memory(unsigned long nr_pages); |
298 | extern int vm_swappiness; | 299 | extern int vm_swappiness; |
299 | extern int remove_mapping(struct address_space *mapping, struct page *page); | 300 | extern int remove_mapping(struct address_space *mapping, struct page *page); |
300 | extern unsigned long vm_total_pages; | 301 | extern unsigned long vm_total_pages; |
301 | 302 | ||
302 | #ifdef CONFIG_NUMA | 303 | #ifdef CONFIG_NUMA |
303 | extern int zone_reclaim_mode; | 304 | extern int zone_reclaim_mode; |
304 | extern int sysctl_min_unmapped_ratio; | 305 | extern int sysctl_min_unmapped_ratio; |
305 | extern int sysctl_min_slab_ratio; | 306 | extern int sysctl_min_slab_ratio; |
306 | extern int zone_reclaim(struct zone *, gfp_t, unsigned int); | 307 | extern int zone_reclaim(struct zone *, gfp_t, unsigned int); |
307 | #else | 308 | #else |
308 | #define zone_reclaim_mode 0 | 309 | #define zone_reclaim_mode 0 |
309 | static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) | 310 | static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) |
310 | { | 311 | { |
311 | return 0; | 312 | return 0; |
312 | } | 313 | } |
313 | #endif | 314 | #endif |
314 | 315 | ||
315 | extern int page_evictable(struct page *page); | 316 | extern int page_evictable(struct page *page); |
316 | extern void check_move_unevictable_pages(struct page **, int nr_pages); | 317 | extern void check_move_unevictable_pages(struct page **, int nr_pages); |
317 | 318 | ||
318 | extern unsigned long scan_unevictable_pages; | 319 | extern unsigned long scan_unevictable_pages; |
319 | extern int scan_unevictable_handler(struct ctl_table *, int, | 320 | extern int scan_unevictable_handler(struct ctl_table *, int, |
320 | void __user *, size_t *, loff_t *); | 321 | void __user *, size_t *, loff_t *); |
321 | #ifdef CONFIG_NUMA | 322 | #ifdef CONFIG_NUMA |
322 | extern int scan_unevictable_register_node(struct node *node); | 323 | extern int scan_unevictable_register_node(struct node *node); |
323 | extern void scan_unevictable_unregister_node(struct node *node); | 324 | extern void scan_unevictable_unregister_node(struct node *node); |
324 | #else | 325 | #else |
325 | static inline int scan_unevictable_register_node(struct node *node) | 326 | static inline int scan_unevictable_register_node(struct node *node) |
326 | { | 327 | { |
327 | return 0; | 328 | return 0; |
328 | } | 329 | } |
329 | static inline void scan_unevictable_unregister_node(struct node *node) | 330 | static inline void scan_unevictable_unregister_node(struct node *node) |
330 | { | 331 | { |
331 | } | 332 | } |
332 | #endif | 333 | #endif |
333 | 334 | ||
334 | extern int kswapd_run(int nid); | 335 | extern int kswapd_run(int nid); |
335 | extern void kswapd_stop(int nid); | 336 | extern void kswapd_stop(int nid); |
336 | #ifdef CONFIG_MEMCG | 337 | #ifdef CONFIG_MEMCG |
337 | extern int mem_cgroup_swappiness(struct mem_cgroup *mem); | 338 | extern int mem_cgroup_swappiness(struct mem_cgroup *mem); |
338 | #else | 339 | #else |
339 | static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) | 340 | static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) |
340 | { | 341 | { |
341 | return vm_swappiness; | 342 | return vm_swappiness; |
342 | } | 343 | } |
343 | #endif | 344 | #endif |
344 | #ifdef CONFIG_MEMCG_SWAP | 345 | #ifdef CONFIG_MEMCG_SWAP |
345 | extern void mem_cgroup_uncharge_swap(swp_entry_t ent); | 346 | extern void mem_cgroup_uncharge_swap(swp_entry_t ent); |
346 | #else | 347 | #else |
347 | static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) | 348 | static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) |
348 | { | 349 | { |
349 | } | 350 | } |
350 | #endif | 351 | #endif |
351 | #ifdef CONFIG_SWAP | 352 | #ifdef CONFIG_SWAP |
352 | /* linux/mm/page_io.c */ | 353 | /* linux/mm/page_io.c */ |
353 | extern int swap_readpage(struct page *); | 354 | extern int swap_readpage(struct page *); |
354 | extern int swap_writepage(struct page *page, struct writeback_control *wbc); | 355 | extern int swap_writepage(struct page *page, struct writeback_control *wbc); |
355 | extern void end_swap_bio_write(struct bio *bio, int err); | 356 | extern void end_swap_bio_write(struct bio *bio, int err); |
356 | extern int __swap_writepage(struct page *page, struct writeback_control *wbc, | 357 | extern int __swap_writepage(struct page *page, struct writeback_control *wbc, |
357 | void (*end_write_func)(struct bio *, int)); | 358 | void (*end_write_func)(struct bio *, int)); |
358 | extern int swap_set_page_dirty(struct page *page); | 359 | extern int swap_set_page_dirty(struct page *page); |
359 | extern void end_swap_bio_read(struct bio *bio, int err); | 360 | extern void end_swap_bio_read(struct bio *bio, int err); |
360 | 361 | ||
361 | int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, | 362 | int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, |
362 | unsigned long nr_pages, sector_t start_block); | 363 | unsigned long nr_pages, sector_t start_block); |
363 | int generic_swapfile_activate(struct swap_info_struct *, struct file *, | 364 | int generic_swapfile_activate(struct swap_info_struct *, struct file *, |
364 | sector_t *); | 365 | sector_t *); |
365 | 366 | ||
366 | /* linux/mm/swap_state.c */ | 367 | /* linux/mm/swap_state.c */ |
367 | extern struct address_space swapper_spaces[]; | 368 | extern struct address_space swapper_spaces[]; |
368 | #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)]) | 369 | #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)]) |
369 | extern unsigned long total_swapcache_pages(void); | 370 | extern unsigned long total_swapcache_pages(void); |
370 | extern void show_swap_cache_info(void); | 371 | extern void show_swap_cache_info(void); |
371 | extern int add_to_swap(struct page *, struct list_head *list); | 372 | extern int add_to_swap(struct page *, struct list_head *list); |
372 | extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); | 373 | extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); |
373 | extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); | 374 | extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); |
374 | extern void __delete_from_swap_cache(struct page *); | 375 | extern void __delete_from_swap_cache(struct page *); |
375 | extern void delete_from_swap_cache(struct page *); | 376 | extern void delete_from_swap_cache(struct page *); |
376 | extern void free_page_and_swap_cache(struct page *); | 377 | extern void free_page_and_swap_cache(struct page *); |
377 | extern void free_pages_and_swap_cache(struct page **, int); | 378 | extern void free_pages_and_swap_cache(struct page **, int); |
378 | extern struct page *lookup_swap_cache(swp_entry_t); | 379 | extern struct page *lookup_swap_cache(swp_entry_t); |
379 | extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, | 380 | extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, |
380 | struct vm_area_struct *vma, unsigned long addr); | 381 | struct vm_area_struct *vma, unsigned long addr); |
381 | extern struct page *swapin_readahead(swp_entry_t, gfp_t, | 382 | extern struct page *swapin_readahead(swp_entry_t, gfp_t, |
382 | struct vm_area_struct *vma, unsigned long addr); | 383 | struct vm_area_struct *vma, unsigned long addr); |
383 | 384 | ||
384 | /* linux/mm/swapfile.c */ | 385 | /* linux/mm/swapfile.c */ |
385 | extern atomic_long_t nr_swap_pages; | 386 | extern atomic_long_t nr_swap_pages; |
386 | extern long total_swap_pages; | 387 | extern long total_swap_pages; |
387 | 388 | ||
388 | /* Swap 50% full? Release swapcache more aggressively.. */ | 389 | /* Swap 50% full? Release swapcache more aggressively.. */ |
389 | static inline bool vm_swap_full(void) | 390 | static inline bool vm_swap_full(void) |
390 | { | 391 | { |
391 | return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages; | 392 | return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages; |
392 | } | 393 | } |
393 | 394 | ||
394 | static inline long get_nr_swap_pages(void) | 395 | static inline long get_nr_swap_pages(void) |
395 | { | 396 | { |
396 | return atomic_long_read(&nr_swap_pages); | 397 | return atomic_long_read(&nr_swap_pages); |
397 | } | 398 | } |
398 | 399 | ||
399 | extern void si_swapinfo(struct sysinfo *); | 400 | extern void si_swapinfo(struct sysinfo *); |
400 | extern swp_entry_t get_swap_page(void); | 401 | extern swp_entry_t get_swap_page(void); |
401 | extern swp_entry_t get_swap_page_of_type(int); | 402 | extern swp_entry_t get_swap_page_of_type(int); |
402 | extern int add_swap_count_continuation(swp_entry_t, gfp_t); | 403 | extern int add_swap_count_continuation(swp_entry_t, gfp_t); |
403 | extern void swap_shmem_alloc(swp_entry_t); | 404 | extern void swap_shmem_alloc(swp_entry_t); |
404 | extern int swap_duplicate(swp_entry_t); | 405 | extern int swap_duplicate(swp_entry_t); |
405 | extern int swapcache_prepare(swp_entry_t); | 406 | extern int swapcache_prepare(swp_entry_t); |
406 | extern void swap_free(swp_entry_t); | 407 | extern void swap_free(swp_entry_t); |
407 | extern void swapcache_free(swp_entry_t, struct page *page); | 408 | extern void swapcache_free(swp_entry_t, struct page *page); |
408 | extern int free_swap_and_cache(swp_entry_t); | 409 | extern int free_swap_and_cache(swp_entry_t); |
409 | extern int swap_type_of(dev_t, sector_t, struct block_device **); | 410 | extern int swap_type_of(dev_t, sector_t, struct block_device **); |
410 | extern unsigned int count_swap_pages(int, int); | 411 | extern unsigned int count_swap_pages(int, int); |
411 | extern sector_t map_swap_page(struct page *, struct block_device **); | 412 | extern sector_t map_swap_page(struct page *, struct block_device **); |
412 | extern sector_t swapdev_block(int, pgoff_t); | 413 | extern sector_t swapdev_block(int, pgoff_t); |
413 | extern int page_swapcount(struct page *); | 414 | extern int page_swapcount(struct page *); |
414 | extern struct swap_info_struct *page_swap_info(struct page *); | 415 | extern struct swap_info_struct *page_swap_info(struct page *); |
415 | extern int reuse_swap_page(struct page *); | 416 | extern int reuse_swap_page(struct page *); |
416 | extern int try_to_free_swap(struct page *); | 417 | extern int try_to_free_swap(struct page *); |
417 | struct backing_dev_info; | 418 | struct backing_dev_info; |
418 | 419 | ||
419 | #ifdef CONFIG_MEMCG | 420 | #ifdef CONFIG_MEMCG |
420 | extern void | 421 | extern void |
421 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout); | 422 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout); |
422 | #else | 423 | #else |
423 | static inline void | 424 | static inline void |
424 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) | 425 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) |
425 | { | 426 | { |
426 | } | 427 | } |
427 | #endif | 428 | #endif |
428 | 429 | ||
429 | #else /* CONFIG_SWAP */ | 430 | #else /* CONFIG_SWAP */ |
430 | 431 | ||
431 | #define swap_address_space(entry) (NULL) | 432 | #define swap_address_space(entry) (NULL) |
432 | #define get_nr_swap_pages() 0L | 433 | #define get_nr_swap_pages() 0L |
433 | #define total_swap_pages 0L | 434 | #define total_swap_pages 0L |
434 | #define total_swapcache_pages() 0UL | 435 | #define total_swapcache_pages() 0UL |
435 | #define vm_swap_full() 0 | 436 | #define vm_swap_full() 0 |
436 | 437 | ||
437 | #define si_swapinfo(val) \ | 438 | #define si_swapinfo(val) \ |
438 | do { (val)->freeswap = (val)->totalswap = 0; } while (0) | 439 | do { (val)->freeswap = (val)->totalswap = 0; } while (0) |
439 | /* only sparc can not include linux/pagemap.h in this file | 440 | /* only sparc can not include linux/pagemap.h in this file |
440 | * so leave page_cache_release and release_pages undeclared... */ | 441 | * so leave page_cache_release and release_pages undeclared... */ |
441 | #define free_page_and_swap_cache(page) \ | 442 | #define free_page_and_swap_cache(page) \ |
442 | page_cache_release(page) | 443 | page_cache_release(page) |
443 | #define free_pages_and_swap_cache(pages, nr) \ | 444 | #define free_pages_and_swap_cache(pages, nr) \ |
444 | release_pages((pages), (nr), false); | 445 | release_pages((pages), (nr), false); |
445 | 446 | ||
446 | static inline void show_swap_cache_info(void) | 447 | static inline void show_swap_cache_info(void) |
447 | { | 448 | { |
448 | } | 449 | } |
449 | 450 | ||
450 | #define free_swap_and_cache(swp) is_migration_entry(swp) | 451 | #define free_swap_and_cache(swp) is_migration_entry(swp) |
451 | #define swapcache_prepare(swp) is_migration_entry(swp) | 452 | #define swapcache_prepare(swp) is_migration_entry(swp) |
452 | 453 | ||
453 | static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) | 454 | static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) |
454 | { | 455 | { |
455 | return 0; | 456 | return 0; |
456 | } | 457 | } |
457 | 458 | ||
458 | static inline void swap_shmem_alloc(swp_entry_t swp) | 459 | static inline void swap_shmem_alloc(swp_entry_t swp) |
459 | { | 460 | { |
460 | } | 461 | } |
461 | 462 | ||
462 | static inline int swap_duplicate(swp_entry_t swp) | 463 | static inline int swap_duplicate(swp_entry_t swp) |
463 | { | 464 | { |
464 | return 0; | 465 | return 0; |
465 | } | 466 | } |
466 | 467 | ||
467 | static inline void swap_free(swp_entry_t swp) | 468 | static inline void swap_free(swp_entry_t swp) |
468 | { | 469 | { |
469 | } | 470 | } |
470 | 471 | ||
471 | static inline void swapcache_free(swp_entry_t swp, struct page *page) | 472 | static inline void swapcache_free(swp_entry_t swp, struct page *page) |
472 | { | 473 | { |
473 | } | 474 | } |
474 | 475 | ||
475 | static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, | 476 | static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, |
476 | struct vm_area_struct *vma, unsigned long addr) | 477 | struct vm_area_struct *vma, unsigned long addr) |
477 | { | 478 | { |
478 | return NULL; | 479 | return NULL; |
479 | } | 480 | } |
480 | 481 | ||
481 | static inline int swap_writepage(struct page *p, struct writeback_control *wbc) | 482 | static inline int swap_writepage(struct page *p, struct writeback_control *wbc) |
482 | { | 483 | { |
483 | return 0; | 484 | return 0; |
484 | } | 485 | } |
485 | 486 | ||
486 | static inline struct page *lookup_swap_cache(swp_entry_t swp) | 487 | static inline struct page *lookup_swap_cache(swp_entry_t swp) |
487 | { | 488 | { |
488 | return NULL; | 489 | return NULL; |
489 | } | 490 | } |
490 | 491 | ||
491 | static inline int add_to_swap(struct page *page, struct list_head *list) | 492 | static inline int add_to_swap(struct page *page, struct list_head *list) |
492 | { | 493 | { |
493 | return 0; | 494 | return 0; |
494 | } | 495 | } |
495 | 496 | ||
496 | static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, | 497 | static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, |
497 | gfp_t gfp_mask) | 498 | gfp_t gfp_mask) |
498 | { | 499 | { |
499 | return -1; | 500 | return -1; |
500 | } | 501 | } |
501 | 502 | ||
502 | static inline void __delete_from_swap_cache(struct page *page) | 503 | static inline void __delete_from_swap_cache(struct page *page) |
503 | { | 504 | { |
504 | } | 505 | } |
505 | 506 | ||
506 | static inline void delete_from_swap_cache(struct page *page) | 507 | static inline void delete_from_swap_cache(struct page *page) |
507 | { | 508 | { |
508 | } | 509 | } |
509 | 510 | ||
510 | static inline int page_swapcount(struct page *page) | 511 | static inline int page_swapcount(struct page *page) |
511 | { | 512 | { |
512 | return 0; | 513 | return 0; |
513 | } | 514 | } |
514 | 515 | ||
515 | #define reuse_swap_page(page) (page_mapcount(page) == 1) | 516 | #define reuse_swap_page(page) (page_mapcount(page) == 1) |
516 | 517 | ||
517 | static inline int try_to_free_swap(struct page *page) | 518 | static inline int try_to_free_swap(struct page *page) |
518 | { | 519 | { |
519 | return 0; | 520 | return 0; |
520 | } | 521 | } |
521 | 522 | ||
522 | static inline swp_entry_t get_swap_page(void) | 523 | static inline swp_entry_t get_swap_page(void) |
523 | { | 524 | { |
524 | swp_entry_t entry; | 525 | swp_entry_t entry; |
525 | entry.val = 0; | 526 | entry.val = 0; |
526 | return entry; | 527 | return entry; |
527 | } | 528 | } |
528 | 529 | ||
529 | static inline void | 530 | static inline void |
530 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent) | 531 | mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent) |
531 | { | 532 | { |
532 | } | 533 | } |
533 | 534 | ||
534 | #endif /* CONFIG_SWAP */ | 535 | #endif /* CONFIG_SWAP */ |
535 | #endif /* __KERNEL__*/ | 536 | #endif /* __KERNEL__*/ |
536 | #endif /* _LINUX_SWAP_H */ | 537 | #endif /* _LINUX_SWAP_H */ |
537 | 538 |
mm/filemap.c
1 | /* | 1 | /* |
2 | * linux/mm/filemap.c | 2 | * linux/mm/filemap.c |
3 | * | 3 | * |
4 | * Copyright (C) 1994-1999 Linus Torvalds | 4 | * Copyright (C) 1994-1999 Linus Torvalds |
5 | */ | 5 | */ |
6 | 6 | ||
7 | /* | 7 | /* |
8 | * This file handles the generic file mmap semantics used by | 8 | * This file handles the generic file mmap semantics used by |
9 | * most "normal" filesystems (but you don't /have/ to use this: | 9 | * most "normal" filesystems (but you don't /have/ to use this: |
10 | * the NFS filesystem used to do this differently, for example) | 10 | * the NFS filesystem used to do this differently, for example) |
11 | */ | 11 | */ |
12 | #include <linux/export.h> | 12 | #include <linux/export.h> |
13 | #include <linux/compiler.h> | 13 | #include <linux/compiler.h> |
14 | #include <linux/fs.h> | 14 | #include <linux/fs.h> |
15 | #include <linux/uaccess.h> | 15 | #include <linux/uaccess.h> |
16 | #include <linux/aio.h> | 16 | #include <linux/aio.h> |
17 | #include <linux/capability.h> | 17 | #include <linux/capability.h> |
18 | #include <linux/kernel_stat.h> | 18 | #include <linux/kernel_stat.h> |
19 | #include <linux/gfp.h> | 19 | #include <linux/gfp.h> |
20 | #include <linux/mm.h> | 20 | #include <linux/mm.h> |
21 | #include <linux/swap.h> | 21 | #include <linux/swap.h> |
22 | #include <linux/mman.h> | 22 | #include <linux/mman.h> |
23 | #include <linux/pagemap.h> | 23 | #include <linux/pagemap.h> |
24 | #include <linux/file.h> | 24 | #include <linux/file.h> |
25 | #include <linux/uio.h> | 25 | #include <linux/uio.h> |
26 | #include <linux/hash.h> | 26 | #include <linux/hash.h> |
27 | #include <linux/writeback.h> | 27 | #include <linux/writeback.h> |
28 | #include <linux/backing-dev.h> | 28 | #include <linux/backing-dev.h> |
29 | #include <linux/pagevec.h> | 29 | #include <linux/pagevec.h> |
30 | #include <linux/blkdev.h> | 30 | #include <linux/blkdev.h> |
31 | #include <linux/security.h> | 31 | #include <linux/security.h> |
32 | #include <linux/cpuset.h> | 32 | #include <linux/cpuset.h> |
33 | #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */ | 33 | #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */ |
34 | #include <linux/memcontrol.h> | 34 | #include <linux/memcontrol.h> |
35 | #include <linux/cleancache.h> | 35 | #include <linux/cleancache.h> |
36 | #include "internal.h" | 36 | #include "internal.h" |
37 | 37 | ||
38 | #define CREATE_TRACE_POINTS | 38 | #define CREATE_TRACE_POINTS |
39 | #include <trace/events/filemap.h> | 39 | #include <trace/events/filemap.h> |
40 | 40 | ||
41 | /* | 41 | /* |
42 | * FIXME: remove all knowledge of the buffer layer from the core VM | 42 | * FIXME: remove all knowledge of the buffer layer from the core VM |
43 | */ | 43 | */ |
44 | #include <linux/buffer_head.h> /* for try_to_free_buffers */ | 44 | #include <linux/buffer_head.h> /* for try_to_free_buffers */ |
45 | 45 | ||
46 | #include <asm/mman.h> | 46 | #include <asm/mman.h> |
47 | 47 | ||
48 | /* | 48 | /* |
49 | * Shared mappings implemented 30.11.1994. It's not fully working yet, | 49 | * Shared mappings implemented 30.11.1994. It's not fully working yet, |
50 | * though. | 50 | * though. |
51 | * | 51 | * |
52 | * Shared mappings now work. 15.8.1995 Bruno. | 52 | * Shared mappings now work. 15.8.1995 Bruno. |
53 | * | 53 | * |
54 | * finished 'unifying' the page and buffer cache and SMP-threaded the | 54 | * finished 'unifying' the page and buffer cache and SMP-threaded the |
55 | * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com> | 55 | * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com> |
56 | * | 56 | * |
57 | * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de> | 57 | * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de> |
58 | */ | 58 | */ |
59 | 59 | ||
60 | /* | 60 | /* |
61 | * Lock ordering: | 61 | * Lock ordering: |
62 | * | 62 | * |
63 | * ->i_mmap_mutex (truncate_pagecache) | 63 | * ->i_mmap_mutex (truncate_pagecache) |
64 | * ->private_lock (__free_pte->__set_page_dirty_buffers) | 64 | * ->private_lock (__free_pte->__set_page_dirty_buffers) |
65 | * ->swap_lock (exclusive_swap_page, others) | 65 | * ->swap_lock (exclusive_swap_page, others) |
66 | * ->mapping->tree_lock | 66 | * ->mapping->tree_lock |
67 | * | 67 | * |
68 | * ->i_mutex | 68 | * ->i_mutex |
69 | * ->i_mmap_mutex (truncate->unmap_mapping_range) | 69 | * ->i_mmap_mutex (truncate->unmap_mapping_range) |
70 | * | 70 | * |
71 | * ->mmap_sem | 71 | * ->mmap_sem |
72 | * ->i_mmap_mutex | 72 | * ->i_mmap_mutex |
73 | * ->page_table_lock or pte_lock (various, mainly in memory.c) | 73 | * ->page_table_lock or pte_lock (various, mainly in memory.c) |
74 | * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) | 74 | * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) |
75 | * | 75 | * |
76 | * ->mmap_sem | 76 | * ->mmap_sem |
77 | * ->lock_page (access_process_vm) | 77 | * ->lock_page (access_process_vm) |
78 | * | 78 | * |
79 | * ->i_mutex (generic_file_buffered_write) | 79 | * ->i_mutex (generic_file_buffered_write) |
80 | * ->mmap_sem (fault_in_pages_readable->do_page_fault) | 80 | * ->mmap_sem (fault_in_pages_readable->do_page_fault) |
81 | * | 81 | * |
82 | * bdi->wb.list_lock | 82 | * bdi->wb.list_lock |
83 | * sb_lock (fs/fs-writeback.c) | 83 | * sb_lock (fs/fs-writeback.c) |
84 | * ->mapping->tree_lock (__sync_single_inode) | 84 | * ->mapping->tree_lock (__sync_single_inode) |
85 | * | 85 | * |
86 | * ->i_mmap_mutex | 86 | * ->i_mmap_mutex |
87 | * ->anon_vma.lock (vma_adjust) | 87 | * ->anon_vma.lock (vma_adjust) |
88 | * | 88 | * |
89 | * ->anon_vma.lock | 89 | * ->anon_vma.lock |
90 | * ->page_table_lock or pte_lock (anon_vma_prepare and various) | 90 | * ->page_table_lock or pte_lock (anon_vma_prepare and various) |
91 | * | 91 | * |
92 | * ->page_table_lock or pte_lock | 92 | * ->page_table_lock or pte_lock |
93 | * ->swap_lock (try_to_unmap_one) | 93 | * ->swap_lock (try_to_unmap_one) |
94 | * ->private_lock (try_to_unmap_one) | 94 | * ->private_lock (try_to_unmap_one) |
95 | * ->tree_lock (try_to_unmap_one) | 95 | * ->tree_lock (try_to_unmap_one) |
96 | * ->zone.lru_lock (follow_page->mark_page_accessed) | 96 | * ->zone.lru_lock (follow_page->mark_page_accessed) |
97 | * ->zone.lru_lock (check_pte_range->isolate_lru_page) | 97 | * ->zone.lru_lock (check_pte_range->isolate_lru_page) |
98 | * ->private_lock (page_remove_rmap->set_page_dirty) | 98 | * ->private_lock (page_remove_rmap->set_page_dirty) |
99 | * ->tree_lock (page_remove_rmap->set_page_dirty) | 99 | * ->tree_lock (page_remove_rmap->set_page_dirty) |
100 | * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) | 100 | * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) |
101 | * ->inode->i_lock (page_remove_rmap->set_page_dirty) | 101 | * ->inode->i_lock (page_remove_rmap->set_page_dirty) |
102 | * bdi.wb->list_lock (zap_pte_range->set_page_dirty) | 102 | * bdi.wb->list_lock (zap_pte_range->set_page_dirty) |
103 | * ->inode->i_lock (zap_pte_range->set_page_dirty) | 103 | * ->inode->i_lock (zap_pte_range->set_page_dirty) |
104 | * ->private_lock (zap_pte_range->__set_page_dirty_buffers) | 104 | * ->private_lock (zap_pte_range->__set_page_dirty_buffers) |
105 | * | 105 | * |
106 | * ->i_mmap_mutex | 106 | * ->i_mmap_mutex |
107 | * ->tasklist_lock (memory_failure, collect_procs_ao) | 107 | * ->tasklist_lock (memory_failure, collect_procs_ao) |
108 | */ | 108 | */ |
109 | 109 | ||
110 | /* | 110 | /* |
111 | * Delete a page from the page cache and free it. Caller has to make | 111 | * Delete a page from the page cache and free it. Caller has to make |
112 | * sure the page is locked and that nobody else uses it - or that usage | 112 | * sure the page is locked and that nobody else uses it - or that usage |
113 | * is safe. The caller must hold the mapping's tree_lock. | 113 | * is safe. The caller must hold the mapping's tree_lock. |
114 | */ | 114 | */ |
115 | void __delete_from_page_cache(struct page *page) | 115 | void __delete_from_page_cache(struct page *page) |
116 | { | 116 | { |
117 | struct address_space *mapping = page->mapping; | 117 | struct address_space *mapping = page->mapping; |
118 | 118 | ||
119 | trace_mm_filemap_delete_from_page_cache(page); | 119 | trace_mm_filemap_delete_from_page_cache(page); |
120 | /* | 120 | /* |
121 | * if we're uptodate, flush out into the cleancache, otherwise | 121 | * if we're uptodate, flush out into the cleancache, otherwise |
122 | * invalidate any existing cleancache entries. We can't leave | 122 | * invalidate any existing cleancache entries. We can't leave |
123 | * stale data around in the cleancache once our page is gone | 123 | * stale data around in the cleancache once our page is gone |
124 | */ | 124 | */ |
125 | if (PageUptodate(page) && PageMappedToDisk(page)) | 125 | if (PageUptodate(page) && PageMappedToDisk(page)) |
126 | cleancache_put_page(page); | 126 | cleancache_put_page(page); |
127 | else | 127 | else |
128 | cleancache_invalidate_page(mapping, page); | 128 | cleancache_invalidate_page(mapping, page); |
129 | 129 | ||
130 | radix_tree_delete(&mapping->page_tree, page->index); | 130 | radix_tree_delete(&mapping->page_tree, page->index); |
131 | page->mapping = NULL; | 131 | page->mapping = NULL; |
132 | /* Leave page->index set: truncation lookup relies upon it */ | 132 | /* Leave page->index set: truncation lookup relies upon it */ |
133 | mapping->nrpages--; | 133 | mapping->nrpages--; |
134 | __dec_zone_page_state(page, NR_FILE_PAGES); | 134 | __dec_zone_page_state(page, NR_FILE_PAGES); |
135 | if (PageSwapBacked(page)) | 135 | if (PageSwapBacked(page)) |
136 | __dec_zone_page_state(page, NR_SHMEM); | 136 | __dec_zone_page_state(page, NR_SHMEM); |
137 | BUG_ON(page_mapped(page)); | 137 | BUG_ON(page_mapped(page)); |
138 | 138 | ||
139 | /* | 139 | /* |
140 | * Some filesystems seem to re-dirty the page even after | 140 | * Some filesystems seem to re-dirty the page even after |
141 | * the VM has canceled the dirty bit (eg ext3 journaling). | 141 | * the VM has canceled the dirty bit (eg ext3 journaling). |
142 | * | 142 | * |
143 | * Fix it up by doing a final dirty accounting check after | 143 | * Fix it up by doing a final dirty accounting check after |
144 | * having removed the page entirely. | 144 | * having removed the page entirely. |
145 | */ | 145 | */ |
146 | if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { | 146 | if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { |
147 | dec_zone_page_state(page, NR_FILE_DIRTY); | 147 | dec_zone_page_state(page, NR_FILE_DIRTY); |
148 | dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); | 148 | dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); |
149 | } | 149 | } |
150 | } | 150 | } |
151 | 151 | ||
152 | /** | 152 | /** |
153 | * delete_from_page_cache - delete page from page cache | 153 | * delete_from_page_cache - delete page from page cache |
154 | * @page: the page which the kernel is trying to remove from page cache | 154 | * @page: the page which the kernel is trying to remove from page cache |
155 | * | 155 | * |
156 | * This must be called only on pages that have been verified to be in the page | 156 | * This must be called only on pages that have been verified to be in the page |
157 | * cache and locked. It will never put the page into the free list, the caller | 157 | * cache and locked. It will never put the page into the free list, the caller |
158 | * has a reference on the page. | 158 | * has a reference on the page. |
159 | */ | 159 | */ |
160 | void delete_from_page_cache(struct page *page) | 160 | void delete_from_page_cache(struct page *page) |
161 | { | 161 | { |
162 | struct address_space *mapping = page->mapping; | 162 | struct address_space *mapping = page->mapping; |
163 | void (*freepage)(struct page *); | 163 | void (*freepage)(struct page *); |
164 | 164 | ||
165 | BUG_ON(!PageLocked(page)); | 165 | BUG_ON(!PageLocked(page)); |
166 | 166 | ||
167 | freepage = mapping->a_ops->freepage; | 167 | freepage = mapping->a_ops->freepage; |
168 | spin_lock_irq(&mapping->tree_lock); | 168 | spin_lock_irq(&mapping->tree_lock); |
169 | __delete_from_page_cache(page); | 169 | __delete_from_page_cache(page); |
170 | spin_unlock_irq(&mapping->tree_lock); | 170 | spin_unlock_irq(&mapping->tree_lock); |
171 | mem_cgroup_uncharge_cache_page(page); | 171 | mem_cgroup_uncharge_cache_page(page); |
172 | 172 | ||
173 | if (freepage) | 173 | if (freepage) |
174 | freepage(page); | 174 | freepage(page); |
175 | page_cache_release(page); | 175 | page_cache_release(page); |
176 | } | 176 | } |
177 | EXPORT_SYMBOL(delete_from_page_cache); | 177 | EXPORT_SYMBOL(delete_from_page_cache); |
178 | 178 | ||
179 | static int sleep_on_page(void *word) | 179 | static int sleep_on_page(void *word) |
180 | { | 180 | { |
181 | io_schedule(); | 181 | io_schedule(); |
182 | return 0; | 182 | return 0; |
183 | } | 183 | } |
184 | 184 | ||
185 | static int sleep_on_page_killable(void *word) | 185 | static int sleep_on_page_killable(void *word) |
186 | { | 186 | { |
187 | sleep_on_page(word); | 187 | sleep_on_page(word); |
188 | return fatal_signal_pending(current) ? -EINTR : 0; | 188 | return fatal_signal_pending(current) ? -EINTR : 0; |
189 | } | 189 | } |
190 | 190 | ||
191 | static int filemap_check_errors(struct address_space *mapping) | 191 | static int filemap_check_errors(struct address_space *mapping) |
192 | { | 192 | { |
193 | int ret = 0; | 193 | int ret = 0; |
194 | /* Check for outstanding write errors */ | 194 | /* Check for outstanding write errors */ |
195 | if (test_bit(AS_ENOSPC, &mapping->flags) && | 195 | if (test_bit(AS_ENOSPC, &mapping->flags) && |
196 | test_and_clear_bit(AS_ENOSPC, &mapping->flags)) | 196 | test_and_clear_bit(AS_ENOSPC, &mapping->flags)) |
197 | ret = -ENOSPC; | 197 | ret = -ENOSPC; |
198 | if (test_bit(AS_EIO, &mapping->flags) && | 198 | if (test_bit(AS_EIO, &mapping->flags) && |
199 | test_and_clear_bit(AS_EIO, &mapping->flags)) | 199 | test_and_clear_bit(AS_EIO, &mapping->flags)) |
200 | ret = -EIO; | 200 | ret = -EIO; |
201 | return ret; | 201 | return ret; |
202 | } | 202 | } |
203 | 203 | ||
204 | /** | 204 | /** |
205 | * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range | 205 | * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range |
206 | * @mapping: address space structure to write | 206 | * @mapping: address space structure to write |
207 | * @start: offset in bytes where the range starts | 207 | * @start: offset in bytes where the range starts |
208 | * @end: offset in bytes where the range ends (inclusive) | 208 | * @end: offset in bytes where the range ends (inclusive) |
209 | * @sync_mode: enable synchronous operation | 209 | * @sync_mode: enable synchronous operation |
210 | * | 210 | * |
211 | * Start writeback against all of a mapping's dirty pages that lie | 211 | * Start writeback against all of a mapping's dirty pages that lie |
212 | * within the byte offsets <start, end> inclusive. | 212 | * within the byte offsets <start, end> inclusive. |
213 | * | 213 | * |
214 | * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as | 214 | * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as |
215 | * opposed to a regular memory cleansing writeback. The difference between | 215 | * opposed to a regular memory cleansing writeback. The difference between |
216 | * these two operations is that if a dirty page/buffer is encountered, it must | 216 | * these two operations is that if a dirty page/buffer is encountered, it must |
217 | * be waited upon, and not just skipped over. | 217 | * be waited upon, and not just skipped over. |
218 | */ | 218 | */ |
219 | int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, | 219 | int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, |
220 | loff_t end, int sync_mode) | 220 | loff_t end, int sync_mode) |
221 | { | 221 | { |
222 | int ret; | 222 | int ret; |
223 | struct writeback_control wbc = { | 223 | struct writeback_control wbc = { |
224 | .sync_mode = sync_mode, | 224 | .sync_mode = sync_mode, |
225 | .nr_to_write = LONG_MAX, | 225 | .nr_to_write = LONG_MAX, |
226 | .range_start = start, | 226 | .range_start = start, |
227 | .range_end = end, | 227 | .range_end = end, |
228 | }; | 228 | }; |
229 | 229 | ||
230 | if (!mapping_cap_writeback_dirty(mapping)) | 230 | if (!mapping_cap_writeback_dirty(mapping)) |
231 | return 0; | 231 | return 0; |
232 | 232 | ||
233 | ret = do_writepages(mapping, &wbc); | 233 | ret = do_writepages(mapping, &wbc); |
234 | return ret; | 234 | return ret; |
235 | } | 235 | } |
236 | 236 | ||
237 | static inline int __filemap_fdatawrite(struct address_space *mapping, | 237 | static inline int __filemap_fdatawrite(struct address_space *mapping, |
238 | int sync_mode) | 238 | int sync_mode) |
239 | { | 239 | { |
240 | return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode); | 240 | return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode); |
241 | } | 241 | } |
242 | 242 | ||
243 | int filemap_fdatawrite(struct address_space *mapping) | 243 | int filemap_fdatawrite(struct address_space *mapping) |
244 | { | 244 | { |
245 | return __filemap_fdatawrite(mapping, WB_SYNC_ALL); | 245 | return __filemap_fdatawrite(mapping, WB_SYNC_ALL); |
246 | } | 246 | } |
247 | EXPORT_SYMBOL(filemap_fdatawrite); | 247 | EXPORT_SYMBOL(filemap_fdatawrite); |
248 | 248 | ||
249 | int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, | 249 | int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, |
250 | loff_t end) | 250 | loff_t end) |
251 | { | 251 | { |
252 | return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL); | 252 | return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL); |
253 | } | 253 | } |
254 | EXPORT_SYMBOL(filemap_fdatawrite_range); | 254 | EXPORT_SYMBOL(filemap_fdatawrite_range); |
255 | 255 | ||
256 | /** | 256 | /** |
257 | * filemap_flush - mostly a non-blocking flush | 257 | * filemap_flush - mostly a non-blocking flush |
258 | * @mapping: target address_space | 258 | * @mapping: target address_space |
259 | * | 259 | * |
260 | * This is a mostly non-blocking flush. Not suitable for data-integrity | 260 | * This is a mostly non-blocking flush. Not suitable for data-integrity |
261 | * purposes - I/O may not be started against all dirty pages. | 261 | * purposes - I/O may not be started against all dirty pages. |
262 | */ | 262 | */ |
263 | int filemap_flush(struct address_space *mapping) | 263 | int filemap_flush(struct address_space *mapping) |
264 | { | 264 | { |
265 | return __filemap_fdatawrite(mapping, WB_SYNC_NONE); | 265 | return __filemap_fdatawrite(mapping, WB_SYNC_NONE); |
266 | } | 266 | } |
267 | EXPORT_SYMBOL(filemap_flush); | 267 | EXPORT_SYMBOL(filemap_flush); |
268 | 268 | ||
269 | /** | 269 | /** |
270 | * filemap_fdatawait_range - wait for writeback to complete | 270 | * filemap_fdatawait_range - wait for writeback to complete |
271 | * @mapping: address space structure to wait for | 271 | * @mapping: address space structure to wait for |
272 | * @start_byte: offset in bytes where the range starts | 272 | * @start_byte: offset in bytes where the range starts |
273 | * @end_byte: offset in bytes where the range ends (inclusive) | 273 | * @end_byte: offset in bytes where the range ends (inclusive) |
274 | * | 274 | * |
275 | * Walk the list of under-writeback pages of the given address space | 275 | * Walk the list of under-writeback pages of the given address space |
276 | * in the given range and wait for all of them. | 276 | * in the given range and wait for all of them. |
277 | */ | 277 | */ |
278 | int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, | 278 | int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, |
279 | loff_t end_byte) | 279 | loff_t end_byte) |
280 | { | 280 | { |
281 | pgoff_t index = start_byte >> PAGE_CACHE_SHIFT; | 281 | pgoff_t index = start_byte >> PAGE_CACHE_SHIFT; |
282 | pgoff_t end = end_byte >> PAGE_CACHE_SHIFT; | 282 | pgoff_t end = end_byte >> PAGE_CACHE_SHIFT; |
283 | struct pagevec pvec; | 283 | struct pagevec pvec; |
284 | int nr_pages; | 284 | int nr_pages; |
285 | int ret2, ret = 0; | 285 | int ret2, ret = 0; |
286 | 286 | ||
287 | if (end_byte < start_byte) | 287 | if (end_byte < start_byte) |
288 | goto out; | 288 | goto out; |
289 | 289 | ||
290 | pagevec_init(&pvec, 0); | 290 | pagevec_init(&pvec, 0); |
291 | while ((index <= end) && | 291 | while ((index <= end) && |
292 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, | 292 | (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, |
293 | PAGECACHE_TAG_WRITEBACK, | 293 | PAGECACHE_TAG_WRITEBACK, |
294 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { | 294 | min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) { |
295 | unsigned i; | 295 | unsigned i; |
296 | 296 | ||
297 | for (i = 0; i < nr_pages; i++) { | 297 | for (i = 0; i < nr_pages; i++) { |
298 | struct page *page = pvec.pages[i]; | 298 | struct page *page = pvec.pages[i]; |
299 | 299 | ||
300 | /* until radix tree lookup accepts end_index */ | 300 | /* until radix tree lookup accepts end_index */ |
301 | if (page->index > end) | 301 | if (page->index > end) |
302 | continue; | 302 | continue; |
303 | 303 | ||
304 | wait_on_page_writeback(page); | 304 | wait_on_page_writeback(page); |
305 | if (TestClearPageError(page)) | 305 | if (TestClearPageError(page)) |
306 | ret = -EIO; | 306 | ret = -EIO; |
307 | } | 307 | } |
308 | pagevec_release(&pvec); | 308 | pagevec_release(&pvec); |
309 | cond_resched(); | 309 | cond_resched(); |
310 | } | 310 | } |
311 | out: | 311 | out: |
312 | ret2 = filemap_check_errors(mapping); | 312 | ret2 = filemap_check_errors(mapping); |
313 | if (!ret) | 313 | if (!ret) |
314 | ret = ret2; | 314 | ret = ret2; |
315 | 315 | ||
316 | return ret; | 316 | return ret; |
317 | } | 317 | } |
318 | EXPORT_SYMBOL(filemap_fdatawait_range); | 318 | EXPORT_SYMBOL(filemap_fdatawait_range); |
319 | 319 | ||
320 | /** | 320 | /** |
321 | * filemap_fdatawait - wait for all under-writeback pages to complete | 321 | * filemap_fdatawait - wait for all under-writeback pages to complete |
322 | * @mapping: address space structure to wait for | 322 | * @mapping: address space structure to wait for |
323 | * | 323 | * |
324 | * Walk the list of under-writeback pages of the given address space | 324 | * Walk the list of under-writeback pages of the given address space |
325 | * and wait for all of them. | 325 | * and wait for all of them. |
326 | */ | 326 | */ |
327 | int filemap_fdatawait(struct address_space *mapping) | 327 | int filemap_fdatawait(struct address_space *mapping) |
328 | { | 328 | { |
329 | loff_t i_size = i_size_read(mapping->host); | 329 | loff_t i_size = i_size_read(mapping->host); |
330 | 330 | ||
331 | if (i_size == 0) | 331 | if (i_size == 0) |
332 | return 0; | 332 | return 0; |
333 | 333 | ||
334 | return filemap_fdatawait_range(mapping, 0, i_size - 1); | 334 | return filemap_fdatawait_range(mapping, 0, i_size - 1); |
335 | } | 335 | } |
336 | EXPORT_SYMBOL(filemap_fdatawait); | 336 | EXPORT_SYMBOL(filemap_fdatawait); |
337 | 337 | ||
338 | int filemap_write_and_wait(struct address_space *mapping) | 338 | int filemap_write_and_wait(struct address_space *mapping) |
339 | { | 339 | { |
340 | int err = 0; | 340 | int err = 0; |
341 | 341 | ||
342 | if (mapping->nrpages) { | 342 | if (mapping->nrpages) { |
343 | err = filemap_fdatawrite(mapping); | 343 | err = filemap_fdatawrite(mapping); |
344 | /* | 344 | /* |
345 | * Even if the above returned error, the pages may be | 345 | * Even if the above returned error, the pages may be |
346 | * written partially (e.g. -ENOSPC), so we wait for it. | 346 | * written partially (e.g. -ENOSPC), so we wait for it. |
347 | * But the -EIO is special case, it may indicate the worst | 347 | * But the -EIO is special case, it may indicate the worst |
348 | * thing (e.g. bug) happened, so we avoid waiting for it. | 348 | * thing (e.g. bug) happened, so we avoid waiting for it. |
349 | */ | 349 | */ |
350 | if (err != -EIO) { | 350 | if (err != -EIO) { |
351 | int err2 = filemap_fdatawait(mapping); | 351 | int err2 = filemap_fdatawait(mapping); |
352 | if (!err) | 352 | if (!err) |
353 | err = err2; | 353 | err = err2; |
354 | } | 354 | } |
355 | } else { | 355 | } else { |
356 | err = filemap_check_errors(mapping); | 356 | err = filemap_check_errors(mapping); |
357 | } | 357 | } |
358 | return err; | 358 | return err; |
359 | } | 359 | } |
360 | EXPORT_SYMBOL(filemap_write_and_wait); | 360 | EXPORT_SYMBOL(filemap_write_and_wait); |
361 | 361 | ||
362 | /** | 362 | /** |
363 | * filemap_write_and_wait_range - write out & wait on a file range | 363 | * filemap_write_and_wait_range - write out & wait on a file range |
364 | * @mapping: the address_space for the pages | 364 | * @mapping: the address_space for the pages |
365 | * @lstart: offset in bytes where the range starts | 365 | * @lstart: offset in bytes where the range starts |
366 | * @lend: offset in bytes where the range ends (inclusive) | 366 | * @lend: offset in bytes where the range ends (inclusive) |
367 | * | 367 | * |
368 | * Write out and wait upon file offsets lstart->lend, inclusive. | 368 | * Write out and wait upon file offsets lstart->lend, inclusive. |
369 | * | 369 | * |
370 | * Note that `lend' is inclusive (describes the last byte to be written) so | 370 | * Note that `lend' is inclusive (describes the last byte to be written) so |
371 | * that this function can be used to write to the very end-of-file (end = -1). | 371 | * that this function can be used to write to the very end-of-file (end = -1). |
372 | */ | 372 | */ |
373 | int filemap_write_and_wait_range(struct address_space *mapping, | 373 | int filemap_write_and_wait_range(struct address_space *mapping, |
374 | loff_t lstart, loff_t lend) | 374 | loff_t lstart, loff_t lend) |
375 | { | 375 | { |
376 | int err = 0; | 376 | int err = 0; |
377 | 377 | ||
378 | if (mapping->nrpages) { | 378 | if (mapping->nrpages) { |
379 | err = __filemap_fdatawrite_range(mapping, lstart, lend, | 379 | err = __filemap_fdatawrite_range(mapping, lstart, lend, |
380 | WB_SYNC_ALL); | 380 | WB_SYNC_ALL); |
381 | /* See comment of filemap_write_and_wait() */ | 381 | /* See comment of filemap_write_and_wait() */ |
382 | if (err != -EIO) { | 382 | if (err != -EIO) { |
383 | int err2 = filemap_fdatawait_range(mapping, | 383 | int err2 = filemap_fdatawait_range(mapping, |
384 | lstart, lend); | 384 | lstart, lend); |
385 | if (!err) | 385 | if (!err) |
386 | err = err2; | 386 | err = err2; |
387 | } | 387 | } |
388 | } else { | 388 | } else { |
389 | err = filemap_check_errors(mapping); | 389 | err = filemap_check_errors(mapping); |
390 | } | 390 | } |
391 | return err; | 391 | return err; |
392 | } | 392 | } |
393 | EXPORT_SYMBOL(filemap_write_and_wait_range); | 393 | EXPORT_SYMBOL(filemap_write_and_wait_range); |
394 | 394 | ||
395 | /** | 395 | /** |
396 | * replace_page_cache_page - replace a pagecache page with a new one | 396 | * replace_page_cache_page - replace a pagecache page with a new one |
397 | * @old: page to be replaced | 397 | * @old: page to be replaced |
398 | * @new: page to replace with | 398 | * @new: page to replace with |
399 | * @gfp_mask: allocation mode | 399 | * @gfp_mask: allocation mode |
400 | * | 400 | * |
401 | * This function replaces a page in the pagecache with a new one. On | 401 | * This function replaces a page in the pagecache with a new one. On |
402 | * success it acquires the pagecache reference for the new page and | 402 | * success it acquires the pagecache reference for the new page and |
403 | * drops it for the old page. Both the old and new pages must be | 403 | * drops it for the old page. Both the old and new pages must be |
404 | * locked. This function does not add the new page to the LRU, the | 404 | * locked. This function does not add the new page to the LRU, the |
405 | * caller must do that. | 405 | * caller must do that. |
406 | * | 406 | * |
407 | * The remove + add is atomic. The only way this function can fail is | 407 | * The remove + add is atomic. The only way this function can fail is |
408 | * memory allocation failure. | 408 | * memory allocation failure. |
409 | */ | 409 | */ |
410 | int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) | 410 | int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) |
411 | { | 411 | { |
412 | int error; | 412 | int error; |
413 | 413 | ||
414 | VM_BUG_ON(!PageLocked(old)); | 414 | VM_BUG_ON(!PageLocked(old)); |
415 | VM_BUG_ON(!PageLocked(new)); | 415 | VM_BUG_ON(!PageLocked(new)); |
416 | VM_BUG_ON(new->mapping); | 416 | VM_BUG_ON(new->mapping); |
417 | 417 | ||
418 | error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); | 418 | error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); |
419 | if (!error) { | 419 | if (!error) { |
420 | struct address_space *mapping = old->mapping; | 420 | struct address_space *mapping = old->mapping; |
421 | void (*freepage)(struct page *); | 421 | void (*freepage)(struct page *); |
422 | 422 | ||
423 | pgoff_t offset = old->index; | 423 | pgoff_t offset = old->index; |
424 | freepage = mapping->a_ops->freepage; | 424 | freepage = mapping->a_ops->freepage; |
425 | 425 | ||
426 | page_cache_get(new); | 426 | page_cache_get(new); |
427 | new->mapping = mapping; | 427 | new->mapping = mapping; |
428 | new->index = offset; | 428 | new->index = offset; |
429 | 429 | ||
430 | spin_lock_irq(&mapping->tree_lock); | 430 | spin_lock_irq(&mapping->tree_lock); |
431 | __delete_from_page_cache(old); | 431 | __delete_from_page_cache(old); |
432 | error = radix_tree_insert(&mapping->page_tree, offset, new); | 432 | error = radix_tree_insert(&mapping->page_tree, offset, new); |
433 | BUG_ON(error); | 433 | BUG_ON(error); |
434 | mapping->nrpages++; | 434 | mapping->nrpages++; |
435 | __inc_zone_page_state(new, NR_FILE_PAGES); | 435 | __inc_zone_page_state(new, NR_FILE_PAGES); |
436 | if (PageSwapBacked(new)) | 436 | if (PageSwapBacked(new)) |
437 | __inc_zone_page_state(new, NR_SHMEM); | 437 | __inc_zone_page_state(new, NR_SHMEM); |
438 | spin_unlock_irq(&mapping->tree_lock); | 438 | spin_unlock_irq(&mapping->tree_lock); |
439 | /* mem_cgroup codes must not be called under tree_lock */ | 439 | /* mem_cgroup codes must not be called under tree_lock */ |
440 | mem_cgroup_replace_page_cache(old, new); | 440 | mem_cgroup_replace_page_cache(old, new); |
441 | radix_tree_preload_end(); | 441 | radix_tree_preload_end(); |
442 | if (freepage) | 442 | if (freepage) |
443 | freepage(old); | 443 | freepage(old); |
444 | page_cache_release(old); | 444 | page_cache_release(old); |
445 | } | 445 | } |
446 | 446 | ||
447 | return error; | 447 | return error; |
448 | } | 448 | } |
449 | EXPORT_SYMBOL_GPL(replace_page_cache_page); | 449 | EXPORT_SYMBOL_GPL(replace_page_cache_page); |
450 | 450 | ||
451 | static int page_cache_tree_insert(struct address_space *mapping, | 451 | static int page_cache_tree_insert(struct address_space *mapping, |
452 | struct page *page) | 452 | struct page *page) |
453 | { | 453 | { |
454 | void **slot; | 454 | void **slot; |
455 | int error; | 455 | int error; |
456 | 456 | ||
457 | slot = radix_tree_lookup_slot(&mapping->page_tree, page->index); | 457 | slot = radix_tree_lookup_slot(&mapping->page_tree, page->index); |
458 | if (slot) { | 458 | if (slot) { |
459 | void *p; | 459 | void *p; |
460 | 460 | ||
461 | p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); | 461 | p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock); |
462 | if (!radix_tree_exceptional_entry(p)) | 462 | if (!radix_tree_exceptional_entry(p)) |
463 | return -EEXIST; | 463 | return -EEXIST; |
464 | radix_tree_replace_slot(slot, page); | 464 | radix_tree_replace_slot(slot, page); |
465 | mapping->nrpages++; | 465 | mapping->nrpages++; |
466 | return 0; | 466 | return 0; |
467 | } | 467 | } |
468 | error = radix_tree_insert(&mapping->page_tree, page->index, page); | 468 | error = radix_tree_insert(&mapping->page_tree, page->index, page); |
469 | if (!error) | 469 | if (!error) |
470 | mapping->nrpages++; | 470 | mapping->nrpages++; |
471 | return error; | 471 | return error; |
472 | } | 472 | } |
473 | 473 | ||
474 | /** | 474 | /** |
475 | * add_to_page_cache_locked - add a locked page to the pagecache | 475 | * add_to_page_cache_locked - add a locked page to the pagecache |
476 | * @page: page to add | 476 | * @page: page to add |
477 | * @mapping: the page's address_space | 477 | * @mapping: the page's address_space |
478 | * @offset: page index | 478 | * @offset: page index |
479 | * @gfp_mask: page allocation mode | 479 | * @gfp_mask: page allocation mode |
480 | * | 480 | * |
481 | * This function is used to add a page to the pagecache. It must be locked. | 481 | * This function is used to add a page to the pagecache. It must be locked. |
482 | * This function does not add the page to the LRU. The caller must do that. | 482 | * This function does not add the page to the LRU. The caller must do that. |
483 | */ | 483 | */ |
484 | int add_to_page_cache_locked(struct page *page, struct address_space *mapping, | 484 | int add_to_page_cache_locked(struct page *page, struct address_space *mapping, |
485 | pgoff_t offset, gfp_t gfp_mask) | 485 | pgoff_t offset, gfp_t gfp_mask) |
486 | { | 486 | { |
487 | int error; | 487 | int error; |
488 | 488 | ||
489 | VM_BUG_ON(!PageLocked(page)); | 489 | VM_BUG_ON(!PageLocked(page)); |
490 | VM_BUG_ON(PageSwapBacked(page)); | 490 | VM_BUG_ON(PageSwapBacked(page)); |
491 | 491 | ||
492 | error = mem_cgroup_cache_charge(page, current->mm, | 492 | error = mem_cgroup_cache_charge(page, current->mm, |
493 | gfp_mask & GFP_RECLAIM_MASK); | 493 | gfp_mask & GFP_RECLAIM_MASK); |
494 | if (error) | 494 | if (error) |
495 | return error; | 495 | return error; |
496 | 496 | ||
497 | error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM); | 497 | error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM); |
498 | if (error) { | 498 | if (error) { |
499 | mem_cgroup_uncharge_cache_page(page); | 499 | mem_cgroup_uncharge_cache_page(page); |
500 | return error; | 500 | return error; |
501 | } | 501 | } |
502 | 502 | ||
503 | page_cache_get(page); | 503 | page_cache_get(page); |
504 | page->mapping = mapping; | 504 | page->mapping = mapping; |
505 | page->index = offset; | 505 | page->index = offset; |
506 | 506 | ||
507 | spin_lock_irq(&mapping->tree_lock); | 507 | spin_lock_irq(&mapping->tree_lock); |
508 | error = page_cache_tree_insert(mapping, page); | 508 | error = page_cache_tree_insert(mapping, page); |
509 | radix_tree_preload_end(); | 509 | radix_tree_preload_end(); |
510 | if (unlikely(error)) | 510 | if (unlikely(error)) |
511 | goto err_insert; | 511 | goto err_insert; |
512 | __inc_zone_page_state(page, NR_FILE_PAGES); | 512 | __inc_zone_page_state(page, NR_FILE_PAGES); |
513 | spin_unlock_irq(&mapping->tree_lock); | 513 | spin_unlock_irq(&mapping->tree_lock); |
514 | trace_mm_filemap_add_to_page_cache(page); | 514 | trace_mm_filemap_add_to_page_cache(page); |
515 | return 0; | 515 | return 0; |
516 | err_insert: | 516 | err_insert: |
517 | page->mapping = NULL; | 517 | page->mapping = NULL; |
518 | /* Leave page->index set: truncation relies upon it */ | 518 | /* Leave page->index set: truncation relies upon it */ |
519 | spin_unlock_irq(&mapping->tree_lock); | 519 | spin_unlock_irq(&mapping->tree_lock); |
520 | mem_cgroup_uncharge_cache_page(page); | 520 | mem_cgroup_uncharge_cache_page(page); |
521 | page_cache_release(page); | 521 | page_cache_release(page); |
522 | return error; | 522 | return error; |
523 | } | 523 | } |
524 | EXPORT_SYMBOL(add_to_page_cache_locked); | 524 | EXPORT_SYMBOL(add_to_page_cache_locked); |
525 | 525 | ||
526 | int add_to_page_cache_lru(struct page *page, struct address_space *mapping, | 526 | int add_to_page_cache_lru(struct page *page, struct address_space *mapping, |
527 | pgoff_t offset, gfp_t gfp_mask) | 527 | pgoff_t offset, gfp_t gfp_mask) |
528 | { | 528 | { |
529 | int ret; | 529 | int ret; |
530 | 530 | ||
531 | ret = add_to_page_cache(page, mapping, offset, gfp_mask); | 531 | ret = add_to_page_cache(page, mapping, offset, gfp_mask); |
532 | if (ret == 0) | 532 | if (ret == 0) |
533 | lru_cache_add_file(page); | 533 | lru_cache_add_file(page); |
534 | return ret; | 534 | return ret; |
535 | } | 535 | } |
536 | EXPORT_SYMBOL_GPL(add_to_page_cache_lru); | 536 | EXPORT_SYMBOL_GPL(add_to_page_cache_lru); |
537 | 537 | ||
538 | #ifdef CONFIG_NUMA | 538 | #ifdef CONFIG_NUMA |
539 | struct page *__page_cache_alloc(gfp_t gfp) | 539 | struct page *__page_cache_alloc(gfp_t gfp) |
540 | { | 540 | { |
541 | int n; | 541 | int n; |
542 | struct page *page; | 542 | struct page *page; |
543 | 543 | ||
544 | if (cpuset_do_page_mem_spread()) { | 544 | if (cpuset_do_page_mem_spread()) { |
545 | unsigned int cpuset_mems_cookie; | 545 | unsigned int cpuset_mems_cookie; |
546 | do { | 546 | do { |
547 | cpuset_mems_cookie = read_mems_allowed_begin(); | 547 | cpuset_mems_cookie = read_mems_allowed_begin(); |
548 | n = cpuset_mem_spread_node(); | 548 | n = cpuset_mem_spread_node(); |
549 | page = alloc_pages_exact_node(n, gfp, 0); | 549 | page = alloc_pages_exact_node(n, gfp, 0); |
550 | } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); | 550 | } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); |
551 | 551 | ||
552 | return page; | 552 | return page; |
553 | } | 553 | } |
554 | return alloc_pages(gfp, 0); | 554 | return alloc_pages(gfp, 0); |
555 | } | 555 | } |
556 | EXPORT_SYMBOL(__page_cache_alloc); | 556 | EXPORT_SYMBOL(__page_cache_alloc); |
557 | #endif | 557 | #endif |
558 | 558 | ||
559 | /* | 559 | /* |
560 | * In order to wait for pages to become available there must be | 560 | * In order to wait for pages to become available there must be |
561 | * waitqueues associated with pages. By using a hash table of | 561 | * waitqueues associated with pages. By using a hash table of |
562 | * waitqueues where the bucket discipline is to maintain all | 562 | * waitqueues where the bucket discipline is to maintain all |
563 | * waiters on the same queue and wake all when any of the pages | 563 | * waiters on the same queue and wake all when any of the pages |
564 | * become available, and for the woken contexts to check to be | 564 | * become available, and for the woken contexts to check to be |
565 | * sure the appropriate page became available, this saves space | 565 | * sure the appropriate page became available, this saves space |
566 | * at a cost of "thundering herd" phenomena during rare hash | 566 | * at a cost of "thundering herd" phenomena during rare hash |
567 | * collisions. | 567 | * collisions. |
568 | */ | 568 | */ |
569 | static wait_queue_head_t *page_waitqueue(struct page *page) | 569 | static wait_queue_head_t *page_waitqueue(struct page *page) |
570 | { | 570 | { |
571 | const struct zone *zone = page_zone(page); | 571 | const struct zone *zone = page_zone(page); |
572 | 572 | ||
573 | return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; | 573 | return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; |
574 | } | 574 | } |
575 | 575 | ||
576 | static inline void wake_up_page(struct page *page, int bit) | 576 | static inline void wake_up_page(struct page *page, int bit) |
577 | { | 577 | { |
578 | __wake_up_bit(page_waitqueue(page), &page->flags, bit); | 578 | __wake_up_bit(page_waitqueue(page), &page->flags, bit); |
579 | } | 579 | } |
580 | 580 | ||
581 | void wait_on_page_bit(struct page *page, int bit_nr) | 581 | void wait_on_page_bit(struct page *page, int bit_nr) |
582 | { | 582 | { |
583 | DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); | 583 | DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); |
584 | 584 | ||
585 | if (test_bit(bit_nr, &page->flags)) | 585 | if (test_bit(bit_nr, &page->flags)) |
586 | __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page, | 586 | __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page, |
587 | TASK_UNINTERRUPTIBLE); | 587 | TASK_UNINTERRUPTIBLE); |
588 | } | 588 | } |
589 | EXPORT_SYMBOL(wait_on_page_bit); | 589 | EXPORT_SYMBOL(wait_on_page_bit); |
590 | 590 | ||
591 | int wait_on_page_bit_killable(struct page *page, int bit_nr) | 591 | int wait_on_page_bit_killable(struct page *page, int bit_nr) |
592 | { | 592 | { |
593 | DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); | 593 | DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); |
594 | 594 | ||
595 | if (!test_bit(bit_nr, &page->flags)) | 595 | if (!test_bit(bit_nr, &page->flags)) |
596 | return 0; | 596 | return 0; |
597 | 597 | ||
598 | return __wait_on_bit(page_waitqueue(page), &wait, | 598 | return __wait_on_bit(page_waitqueue(page), &wait, |
599 | sleep_on_page_killable, TASK_KILLABLE); | 599 | sleep_on_page_killable, TASK_KILLABLE); |
600 | } | 600 | } |
601 | 601 | ||
602 | /** | 602 | /** |
603 | * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue | 603 | * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue |
604 | * @page: Page defining the wait queue of interest | 604 | * @page: Page defining the wait queue of interest |
605 | * @waiter: Waiter to add to the queue | 605 | * @waiter: Waiter to add to the queue |
606 | * | 606 | * |
607 | * Add an arbitrary @waiter to the wait queue for the nominated @page. | 607 | * Add an arbitrary @waiter to the wait queue for the nominated @page. |
608 | */ | 608 | */ |
609 | void add_page_wait_queue(struct page *page, wait_queue_t *waiter) | 609 | void add_page_wait_queue(struct page *page, wait_queue_t *waiter) |
610 | { | 610 | { |
611 | wait_queue_head_t *q = page_waitqueue(page); | 611 | wait_queue_head_t *q = page_waitqueue(page); |
612 | unsigned long flags; | 612 | unsigned long flags; |
613 | 613 | ||
614 | spin_lock_irqsave(&q->lock, flags); | 614 | spin_lock_irqsave(&q->lock, flags); |
615 | __add_wait_queue(q, waiter); | 615 | __add_wait_queue(q, waiter); |
616 | spin_unlock_irqrestore(&q->lock, flags); | 616 | spin_unlock_irqrestore(&q->lock, flags); |
617 | } | 617 | } |
618 | EXPORT_SYMBOL_GPL(add_page_wait_queue); | 618 | EXPORT_SYMBOL_GPL(add_page_wait_queue); |
619 | 619 | ||
620 | /** | 620 | /** |
621 | * unlock_page - unlock a locked page | 621 | * unlock_page - unlock a locked page |
622 | * @page: the page | 622 | * @page: the page |
623 | * | 623 | * |
624 | * Unlocks the page and wakes up sleepers in ___wait_on_page_locked(). | 624 | * Unlocks the page and wakes up sleepers in ___wait_on_page_locked(). |
625 | * Also wakes sleepers in wait_on_page_writeback() because the wakeup | 625 | * Also wakes sleepers in wait_on_page_writeback() because the wakeup |
626 | * mechananism between PageLocked pages and PageWriteback pages is shared. | 626 | * mechananism between PageLocked pages and PageWriteback pages is shared. |
627 | * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep. | 627 | * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep. |
628 | * | 628 | * |
629 | * The mb is necessary to enforce ordering between the clear_bit and the read | 629 | * The mb is necessary to enforce ordering between the clear_bit and the read |
630 | * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). | 630 | * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). |
631 | */ | 631 | */ |
632 | void unlock_page(struct page *page) | 632 | void unlock_page(struct page *page) |
633 | { | 633 | { |
634 | VM_BUG_ON(!PageLocked(page)); | 634 | VM_BUG_ON(!PageLocked(page)); |
635 | clear_bit_unlock(PG_locked, &page->flags); | 635 | clear_bit_unlock(PG_locked, &page->flags); |
636 | smp_mb__after_clear_bit(); | 636 | smp_mb__after_clear_bit(); |
637 | wake_up_page(page, PG_locked); | 637 | wake_up_page(page, PG_locked); |
638 | } | 638 | } |
639 | EXPORT_SYMBOL(unlock_page); | 639 | EXPORT_SYMBOL(unlock_page); |
640 | 640 | ||
641 | /** | 641 | /** |
642 | * end_page_writeback - end writeback against a page | 642 | * end_page_writeback - end writeback against a page |
643 | * @page: the page | 643 | * @page: the page |
644 | */ | 644 | */ |
645 | void end_page_writeback(struct page *page) | 645 | void end_page_writeback(struct page *page) |
646 | { | 646 | { |
647 | if (TestClearPageReclaim(page)) | 647 | if (TestClearPageReclaim(page)) |
648 | rotate_reclaimable_page(page); | 648 | rotate_reclaimable_page(page); |
649 | 649 | ||
650 | if (!test_clear_page_writeback(page)) | 650 | if (!test_clear_page_writeback(page)) |
651 | BUG(); | 651 | BUG(); |
652 | 652 | ||
653 | smp_mb__after_clear_bit(); | 653 | smp_mb__after_clear_bit(); |
654 | wake_up_page(page, PG_writeback); | 654 | wake_up_page(page, PG_writeback); |
655 | } | 655 | } |
656 | EXPORT_SYMBOL(end_page_writeback); | 656 | EXPORT_SYMBOL(end_page_writeback); |
657 | 657 | ||
658 | /** | 658 | /** |
659 | * __lock_page - get a lock on the page, assuming we need to sleep to get it | 659 | * __lock_page - get a lock on the page, assuming we need to sleep to get it |
660 | * @page: the page to lock | 660 | * @page: the page to lock |
661 | */ | 661 | */ |
662 | void __lock_page(struct page *page) | 662 | void __lock_page(struct page *page) |
663 | { | 663 | { |
664 | DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); | 664 | DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); |
665 | 665 | ||
666 | __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, | 666 | __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, |
667 | TASK_UNINTERRUPTIBLE); | 667 | TASK_UNINTERRUPTIBLE); |
668 | } | 668 | } |
669 | EXPORT_SYMBOL(__lock_page); | 669 | EXPORT_SYMBOL(__lock_page); |
670 | 670 | ||
671 | int __lock_page_killable(struct page *page) | 671 | int __lock_page_killable(struct page *page) |
672 | { | 672 | { |
673 | DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); | 673 | DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); |
674 | 674 | ||
675 | return __wait_on_bit_lock(page_waitqueue(page), &wait, | 675 | return __wait_on_bit_lock(page_waitqueue(page), &wait, |
676 | sleep_on_page_killable, TASK_KILLABLE); | 676 | sleep_on_page_killable, TASK_KILLABLE); |
677 | } | 677 | } |
678 | EXPORT_SYMBOL_GPL(__lock_page_killable); | 678 | EXPORT_SYMBOL_GPL(__lock_page_killable); |
679 | 679 | ||
680 | int __lock_page_or_retry(struct page *page, struct mm_struct *mm, | 680 | int __lock_page_or_retry(struct page *page, struct mm_struct *mm, |
681 | unsigned int flags) | 681 | unsigned int flags) |
682 | { | 682 | { |
683 | if (flags & FAULT_FLAG_ALLOW_RETRY) { | 683 | if (flags & FAULT_FLAG_ALLOW_RETRY) { |
684 | /* | 684 | /* |
685 | * CAUTION! In this case, mmap_sem is not released | 685 | * CAUTION! In this case, mmap_sem is not released |
686 | * even though return 0. | 686 | * even though return 0. |
687 | */ | 687 | */ |
688 | if (flags & FAULT_FLAG_RETRY_NOWAIT) | 688 | if (flags & FAULT_FLAG_RETRY_NOWAIT) |
689 | return 0; | 689 | return 0; |
690 | 690 | ||
691 | up_read(&mm->mmap_sem); | 691 | up_read(&mm->mmap_sem); |
692 | if (flags & FAULT_FLAG_KILLABLE) | 692 | if (flags & FAULT_FLAG_KILLABLE) |
693 | wait_on_page_locked_killable(page); | 693 | wait_on_page_locked_killable(page); |
694 | else | 694 | else |
695 | wait_on_page_locked(page); | 695 | wait_on_page_locked(page); |
696 | return 0; | 696 | return 0; |
697 | } else { | 697 | } else { |
698 | if (flags & FAULT_FLAG_KILLABLE) { | 698 | if (flags & FAULT_FLAG_KILLABLE) { |
699 | int ret; | 699 | int ret; |
700 | 700 | ||
701 | ret = __lock_page_killable(page); | 701 | ret = __lock_page_killable(page); |
702 | if (ret) { | 702 | if (ret) { |
703 | up_read(&mm->mmap_sem); | 703 | up_read(&mm->mmap_sem); |
704 | return 0; | 704 | return 0; |
705 | } | 705 | } |
706 | } else | 706 | } else |
707 | __lock_page(page); | 707 | __lock_page(page); |
708 | return 1; | 708 | return 1; |
709 | } | 709 | } |
710 | } | 710 | } |
711 | 711 | ||
712 | /** | 712 | /** |
713 | * page_cache_next_hole - find the next hole (not-present entry) | 713 | * page_cache_next_hole - find the next hole (not-present entry) |
714 | * @mapping: mapping | 714 | * @mapping: mapping |
715 | * @index: index | 715 | * @index: index |
716 | * @max_scan: maximum range to search | 716 | * @max_scan: maximum range to search |
717 | * | 717 | * |
718 | * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the | 718 | * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the |
719 | * lowest indexed hole. | 719 | * lowest indexed hole. |
720 | * | 720 | * |
721 | * Returns: the index of the hole if found, otherwise returns an index | 721 | * Returns: the index of the hole if found, otherwise returns an index |
722 | * outside of the set specified (in which case 'return - index >= | 722 | * outside of the set specified (in which case 'return - index >= |
723 | * max_scan' will be true). In rare cases of index wrap-around, 0 will | 723 | * max_scan' will be true). In rare cases of index wrap-around, 0 will |
724 | * be returned. | 724 | * be returned. |
725 | * | 725 | * |
726 | * page_cache_next_hole may be called under rcu_read_lock. However, | 726 | * page_cache_next_hole may be called under rcu_read_lock. However, |
727 | * like radix_tree_gang_lookup, this will not atomically search a | 727 | * like radix_tree_gang_lookup, this will not atomically search a |
728 | * snapshot of the tree at a single point in time. For example, if a | 728 | * snapshot of the tree at a single point in time. For example, if a |
729 | * hole is created at index 5, then subsequently a hole is created at | 729 | * hole is created at index 5, then subsequently a hole is created at |
730 | * index 10, page_cache_next_hole covering both indexes may return 10 | 730 | * index 10, page_cache_next_hole covering both indexes may return 10 |
731 | * if called under rcu_read_lock. | 731 | * if called under rcu_read_lock. |
732 | */ | 732 | */ |
733 | pgoff_t page_cache_next_hole(struct address_space *mapping, | 733 | pgoff_t page_cache_next_hole(struct address_space *mapping, |
734 | pgoff_t index, unsigned long max_scan) | 734 | pgoff_t index, unsigned long max_scan) |
735 | { | 735 | { |
736 | unsigned long i; | 736 | unsigned long i; |
737 | 737 | ||
738 | for (i = 0; i < max_scan; i++) { | 738 | for (i = 0; i < max_scan; i++) { |
739 | struct page *page; | 739 | struct page *page; |
740 | 740 | ||
741 | page = radix_tree_lookup(&mapping->page_tree, index); | 741 | page = radix_tree_lookup(&mapping->page_tree, index); |
742 | if (!page || radix_tree_exceptional_entry(page)) | 742 | if (!page || radix_tree_exceptional_entry(page)) |
743 | break; | 743 | break; |
744 | index++; | 744 | index++; |
745 | if (index == 0) | 745 | if (index == 0) |
746 | break; | 746 | break; |
747 | } | 747 | } |
748 | 748 | ||
749 | return index; | 749 | return index; |
750 | } | 750 | } |
751 | EXPORT_SYMBOL(page_cache_next_hole); | 751 | EXPORT_SYMBOL(page_cache_next_hole); |
752 | 752 | ||
753 | /** | 753 | /** |
754 | * page_cache_prev_hole - find the prev hole (not-present entry) | 754 | * page_cache_prev_hole - find the prev hole (not-present entry) |
755 | * @mapping: mapping | 755 | * @mapping: mapping |
756 | * @index: index | 756 | * @index: index |
757 | * @max_scan: maximum range to search | 757 | * @max_scan: maximum range to search |
758 | * | 758 | * |
759 | * Search backwards in the range [max(index-max_scan+1, 0), index] for | 759 | * Search backwards in the range [max(index-max_scan+1, 0), index] for |
760 | * the first hole. | 760 | * the first hole. |
761 | * | 761 | * |
762 | * Returns: the index of the hole if found, otherwise returns an index | 762 | * Returns: the index of the hole if found, otherwise returns an index |
763 | * outside of the set specified (in which case 'index - return >= | 763 | * outside of the set specified (in which case 'index - return >= |
764 | * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX | 764 | * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX |
765 | * will be returned. | 765 | * will be returned. |
766 | * | 766 | * |
767 | * page_cache_prev_hole may be called under rcu_read_lock. However, | 767 | * page_cache_prev_hole may be called under rcu_read_lock. However, |
768 | * like radix_tree_gang_lookup, this will not atomically search a | 768 | * like radix_tree_gang_lookup, this will not atomically search a |
769 | * snapshot of the tree at a single point in time. For example, if a | 769 | * snapshot of the tree at a single point in time. For example, if a |
770 | * hole is created at index 10, then subsequently a hole is created at | 770 | * hole is created at index 10, then subsequently a hole is created at |
771 | * index 5, page_cache_prev_hole covering both indexes may return 5 if | 771 | * index 5, page_cache_prev_hole covering both indexes may return 5 if |
772 | * called under rcu_read_lock. | 772 | * called under rcu_read_lock. |
773 | */ | 773 | */ |
774 | pgoff_t page_cache_prev_hole(struct address_space *mapping, | 774 | pgoff_t page_cache_prev_hole(struct address_space *mapping, |
775 | pgoff_t index, unsigned long max_scan) | 775 | pgoff_t index, unsigned long max_scan) |
776 | { | 776 | { |
777 | unsigned long i; | 777 | unsigned long i; |
778 | 778 | ||
779 | for (i = 0; i < max_scan; i++) { | 779 | for (i = 0; i < max_scan; i++) { |
780 | struct page *page; | 780 | struct page *page; |
781 | 781 | ||
782 | page = radix_tree_lookup(&mapping->page_tree, index); | 782 | page = radix_tree_lookup(&mapping->page_tree, index); |
783 | if (!page || radix_tree_exceptional_entry(page)) | 783 | if (!page || radix_tree_exceptional_entry(page)) |
784 | break; | 784 | break; |
785 | index--; | 785 | index--; |
786 | if (index == ULONG_MAX) | 786 | if (index == ULONG_MAX) |
787 | break; | 787 | break; |
788 | } | 788 | } |
789 | 789 | ||
790 | return index; | 790 | return index; |
791 | } | 791 | } |
792 | EXPORT_SYMBOL(page_cache_prev_hole); | 792 | EXPORT_SYMBOL(page_cache_prev_hole); |
793 | 793 | ||
794 | /** | 794 | /** |
795 | * find_get_entry - find and get a page cache entry | 795 | * find_get_entry - find and get a page cache entry |
796 | * @mapping: the address_space to search | 796 | * @mapping: the address_space to search |
797 | * @offset: the page cache index | 797 | * @offset: the page cache index |
798 | * | 798 | * |
799 | * Looks up the page cache slot at @mapping & @offset. If there is a | 799 | * Looks up the page cache slot at @mapping & @offset. If there is a |
800 | * page cache page, it is returned with an increased refcount. | 800 | * page cache page, it is returned with an increased refcount. |
801 | * | 801 | * |
802 | * If the slot holds a shadow entry of a previously evicted page, it | 802 | * If the slot holds a shadow entry of a previously evicted page, it |
803 | * is returned. | 803 | * is returned. |
804 | * | 804 | * |
805 | * Otherwise, %NULL is returned. | 805 | * Otherwise, %NULL is returned. |
806 | */ | 806 | */ |
807 | struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) | 807 | struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) |
808 | { | 808 | { |
809 | void **pagep; | 809 | void **pagep; |
810 | struct page *page; | 810 | struct page *page; |
811 | 811 | ||
812 | rcu_read_lock(); | 812 | rcu_read_lock(); |
813 | repeat: | 813 | repeat: |
814 | page = NULL; | 814 | page = NULL; |
815 | pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); | 815 | pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); |
816 | if (pagep) { | 816 | if (pagep) { |
817 | page = radix_tree_deref_slot(pagep); | 817 | page = radix_tree_deref_slot(pagep); |
818 | if (unlikely(!page)) | 818 | if (unlikely(!page)) |
819 | goto out; | 819 | goto out; |
820 | if (radix_tree_exception(page)) { | 820 | if (radix_tree_exception(page)) { |
821 | if (radix_tree_deref_retry(page)) | 821 | if (radix_tree_deref_retry(page)) |
822 | goto repeat; | 822 | goto repeat; |
823 | /* | 823 | /* |
824 | * Otherwise, shmem/tmpfs must be storing a swap entry | 824 | * Otherwise, shmem/tmpfs must be storing a swap entry |
825 | * here as an exceptional entry: so return it without | 825 | * here as an exceptional entry: so return it without |
826 | * attempting to raise page count. | 826 | * attempting to raise page count. |
827 | */ | 827 | */ |
828 | goto out; | 828 | goto out; |
829 | } | 829 | } |
830 | if (!page_cache_get_speculative(page)) | 830 | if (!page_cache_get_speculative(page)) |
831 | goto repeat; | 831 | goto repeat; |
832 | 832 | ||
833 | /* | 833 | /* |
834 | * Has the page moved? | 834 | * Has the page moved? |
835 | * This is part of the lockless pagecache protocol. See | 835 | * This is part of the lockless pagecache protocol. See |
836 | * include/linux/pagemap.h for details. | 836 | * include/linux/pagemap.h for details. |
837 | */ | 837 | */ |
838 | if (unlikely(page != *pagep)) { | 838 | if (unlikely(page != *pagep)) { |
839 | page_cache_release(page); | 839 | page_cache_release(page); |
840 | goto repeat; | 840 | goto repeat; |
841 | } | 841 | } |
842 | } | 842 | } |
843 | out: | 843 | out: |
844 | rcu_read_unlock(); | 844 | rcu_read_unlock(); |
845 | 845 | ||
846 | return page; | 846 | return page; |
847 | } | 847 | } |
848 | EXPORT_SYMBOL(find_get_entry); | 848 | EXPORT_SYMBOL(find_get_entry); |
849 | 849 | ||
850 | /** | 850 | /** |
851 | * find_get_page - find and get a page reference | ||
852 | * @mapping: the address_space to search | ||
853 | * @offset: the page index | ||
854 | * | ||
855 | * Looks up the page cache slot at @mapping & @offset. If there is a | ||
856 | * page cache page, it is returned with an increased refcount. | ||
857 | * | ||
858 | * Otherwise, %NULL is returned. | ||
859 | */ | ||
860 | struct page *find_get_page(struct address_space *mapping, pgoff_t offset) | ||
861 | { | ||
862 | struct page *page = find_get_entry(mapping, offset); | ||
863 | |||
864 | if (radix_tree_exceptional_entry(page)) | ||
865 | page = NULL; | ||
866 | return page; | ||
867 | } | ||
868 | EXPORT_SYMBOL(find_get_page); | ||
869 | |||
870 | /** | ||
871 | * find_lock_entry - locate, pin and lock a page cache entry | 851 | * find_lock_entry - locate, pin and lock a page cache entry |
872 | * @mapping: the address_space to search | 852 | * @mapping: the address_space to search |
873 | * @offset: the page cache index | 853 | * @offset: the page cache index |
874 | * | 854 | * |
875 | * Looks up the page cache slot at @mapping & @offset. If there is a | 855 | * Looks up the page cache slot at @mapping & @offset. If there is a |
876 | * page cache page, it is returned locked and with an increased | 856 | * page cache page, it is returned locked and with an increased |
877 | * refcount. | 857 | * refcount. |
878 | * | 858 | * |
879 | * If the slot holds a shadow entry of a previously evicted page, it | 859 | * If the slot holds a shadow entry of a previously evicted page, it |
880 | * is returned. | 860 | * is returned. |
881 | * | 861 | * |
882 | * Otherwise, %NULL is returned. | 862 | * Otherwise, %NULL is returned. |
883 | * | 863 | * |
884 | * find_lock_entry() may sleep. | 864 | * find_lock_entry() may sleep. |
885 | */ | 865 | */ |
886 | struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) | 866 | struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) |
887 | { | 867 | { |
888 | struct page *page; | 868 | struct page *page; |
889 | 869 | ||
890 | repeat: | 870 | repeat: |
891 | page = find_get_entry(mapping, offset); | 871 | page = find_get_entry(mapping, offset); |
892 | if (page && !radix_tree_exception(page)) { | 872 | if (page && !radix_tree_exception(page)) { |
893 | lock_page(page); | 873 | lock_page(page); |
894 | /* Has the page been truncated? */ | 874 | /* Has the page been truncated? */ |
895 | if (unlikely(page->mapping != mapping)) { | 875 | if (unlikely(page->mapping != mapping)) { |
896 | unlock_page(page); | 876 | unlock_page(page); |
897 | page_cache_release(page); | 877 | page_cache_release(page); |
898 | goto repeat; | 878 | goto repeat; |
899 | } | 879 | } |
900 | VM_BUG_ON(page->index != offset); | 880 | VM_BUG_ON(page->index != offset); |
901 | } | 881 | } |
902 | return page; | 882 | return page; |
903 | } | 883 | } |
904 | EXPORT_SYMBOL(find_lock_entry); | 884 | EXPORT_SYMBOL(find_lock_entry); |
905 | 885 | ||
906 | /** | 886 | /** |
907 | * find_lock_page - locate, pin and lock a pagecache page | 887 | * pagecache_get_page - find and get a page reference |
908 | * @mapping: the address_space to search | 888 | * @mapping: the address_space to search |
909 | * @offset: the page index | 889 | * @offset: the page index |
890 | * @fgp_flags: PCG flags | ||
891 | * @gfp_mask: gfp mask to use if a page is to be allocated | ||
910 | * | 892 | * |
911 | * Looks up the page cache slot at @mapping & @offset. If there is a | 893 | * Looks up the page cache slot at @mapping & @offset. |
912 | * page cache page, it is returned locked and with an increased | ||
913 | * refcount. | ||
914 | * | 894 | * |
915 | * Otherwise, %NULL is returned. | 895 | * PCG flags modify how the page is returned |
916 | * | 896 | * |
917 | * find_lock_page() may sleep. | 897 | * FGP_ACCESSED: the page will be marked accessed |
918 | */ | 898 | * FGP_LOCK: Page is return locked |
919 | struct page *find_lock_page(struct address_space *mapping, pgoff_t offset) | 899 | * FGP_CREAT: If page is not present then a new page is allocated using |
920 | { | 900 | * @gfp_mask and added to the page cache and the VM's LRU |
921 | struct page *page = find_lock_entry(mapping, offset); | 901 | * list. The page is returned locked and with an increased |
922 | 902 | * refcount. Otherwise, %NULL is returned. | |
923 | if (radix_tree_exceptional_entry(page)) | ||
924 | page = NULL; | ||
925 | return page; | ||
926 | } | ||
927 | EXPORT_SYMBOL(find_lock_page); | ||
928 | |||
929 | /** | ||
930 | * find_or_create_page - locate or add a pagecache page | ||
931 | * @mapping: the page's address_space | ||
932 | * @index: the page's index into the mapping | ||
933 | * @gfp_mask: page allocation mode | ||
934 | * | 903 | * |
935 | * Looks up the page cache slot at @mapping & @offset. If there is a | 904 | * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even |
936 | * page cache page, it is returned locked and with an increased | 905 | * if the GFP flags specified for FGP_CREAT are atomic. |
937 | * refcount. | ||
938 | * | 906 | * |
939 | * If the page is not present, a new page is allocated using @gfp_mask | 907 | * If there is a page cache page, it is returned with an increased refcount. |
940 | * and added to the page cache and the VM's LRU list. The page is | ||
941 | * returned locked and with an increased refcount. | ||
942 | * | ||
943 | * On memory exhaustion, %NULL is returned. | ||
944 | * | ||
945 | * find_or_create_page() may sleep, even if @gfp_flags specifies an | ||
946 | * atomic allocation! | ||
947 | */ | 908 | */ |
948 | struct page *find_or_create_page(struct address_space *mapping, | 909 | struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, |
949 | pgoff_t index, gfp_t gfp_mask) | 910 | int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask) |
950 | { | 911 | { |
951 | struct page *page; | 912 | struct page *page; |
952 | int err; | 913 | |
953 | repeat: | 914 | repeat: |
954 | page = find_lock_page(mapping, index); | 915 | page = find_get_entry(mapping, offset); |
955 | if (!page) { | 916 | if (radix_tree_exceptional_entry(page)) |
956 | page = __page_cache_alloc(gfp_mask); | 917 | page = NULL; |
918 | if (!page) | ||
919 | goto no_page; | ||
920 | |||
921 | if (fgp_flags & FGP_LOCK) { | ||
922 | if (fgp_flags & FGP_NOWAIT) { | ||
923 | if (!trylock_page(page)) { | ||
924 | page_cache_release(page); | ||
925 | return NULL; | ||
926 | } | ||
927 | } else { | ||
928 | lock_page(page); | ||
929 | } | ||
930 | |||
931 | /* Has the page been truncated? */ | ||
932 | if (unlikely(page->mapping != mapping)) { | ||
933 | unlock_page(page); | ||
934 | page_cache_release(page); | ||
935 | goto repeat; | ||
936 | } | ||
937 | VM_BUG_ON(page->index != offset); | ||
938 | } | ||
939 | |||
940 | if (page && (fgp_flags & FGP_ACCESSED)) | ||
941 | mark_page_accessed(page); | ||
942 | |||
943 | no_page: | ||
944 | if (!page && (fgp_flags & FGP_CREAT)) { | ||
945 | int err; | ||
946 | if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) | ||
947 | cache_gfp_mask |= __GFP_WRITE; | ||
948 | if (fgp_flags & FGP_NOFS) { | ||
949 | cache_gfp_mask &= ~__GFP_FS; | ||
950 | radix_gfp_mask &= ~__GFP_FS; | ||
951 | } | ||
952 | |||
953 | page = __page_cache_alloc(cache_gfp_mask); | ||
957 | if (!page) | 954 | if (!page) |
958 | return NULL; | 955 | return NULL; |
959 | /* | 956 | |
960 | * We want a regular kernel memory (not highmem or DMA etc) | 957 | if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK))) |
961 | * allocation for the radix tree nodes, but we need to honour | 958 | fgp_flags |= FGP_LOCK; |
962 | * the context-specific requirements the caller has asked for. | 959 | |
963 | * GFP_RECLAIM_MASK collects those requirements. | 960 | /* Init accessed so avoit atomic mark_page_accessed later */ |
964 | */ | 961 | if (fgp_flags & FGP_ACCESSED) |
965 | err = add_to_page_cache_lru(page, mapping, index, | 962 | init_page_accessed(page); |
966 | (gfp_mask & GFP_RECLAIM_MASK)); | 963 | |
964 | err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask); | ||
967 | if (unlikely(err)) { | 965 | if (unlikely(err)) { |
968 | page_cache_release(page); | 966 | page_cache_release(page); |
969 | page = NULL; | 967 | page = NULL; |
970 | if (err == -EEXIST) | 968 | if (err == -EEXIST) |
971 | goto repeat; | 969 | goto repeat; |
972 | } | 970 | } |
973 | } | 971 | } |
972 | |||
974 | return page; | 973 | return page; |
975 | } | 974 | } |
976 | EXPORT_SYMBOL(find_or_create_page); | 975 | EXPORT_SYMBOL(pagecache_get_page); |
977 | 976 | ||
978 | /** | 977 | /** |
979 | * find_get_entries - gang pagecache lookup | 978 | * find_get_entries - gang pagecache lookup |
980 | * @mapping: The address_space to search | 979 | * @mapping: The address_space to search |
981 | * @start: The starting page cache index | 980 | * @start: The starting page cache index |
982 | * @nr_entries: The maximum number of entries | 981 | * @nr_entries: The maximum number of entries |
983 | * @entries: Where the resulting entries are placed | 982 | * @entries: Where the resulting entries are placed |
984 | * @indices: The cache indices corresponding to the entries in @entries | 983 | * @indices: The cache indices corresponding to the entries in @entries |
985 | * | 984 | * |
986 | * find_get_entries() will search for and return a group of up to | 985 | * find_get_entries() will search for and return a group of up to |
987 | * @nr_entries entries in the mapping. The entries are placed at | 986 | * @nr_entries entries in the mapping. The entries are placed at |
988 | * @entries. find_get_entries() takes a reference against any actual | 987 | * @entries. find_get_entries() takes a reference against any actual |
989 | * pages it returns. | 988 | * pages it returns. |
990 | * | 989 | * |
991 | * The search returns a group of mapping-contiguous page cache entries | 990 | * The search returns a group of mapping-contiguous page cache entries |
992 | * with ascending indexes. There may be holes in the indices due to | 991 | * with ascending indexes. There may be holes in the indices due to |
993 | * not-present pages. | 992 | * not-present pages. |
994 | * | 993 | * |
995 | * Any shadow entries of evicted pages are included in the returned | 994 | * Any shadow entries of evicted pages are included in the returned |
996 | * array. | 995 | * array. |
997 | * | 996 | * |
998 | * find_get_entries() returns the number of pages and shadow entries | 997 | * find_get_entries() returns the number of pages and shadow entries |
999 | * which were found. | 998 | * which were found. |
1000 | */ | 999 | */ |
1001 | unsigned find_get_entries(struct address_space *mapping, | 1000 | unsigned find_get_entries(struct address_space *mapping, |
1002 | pgoff_t start, unsigned int nr_entries, | 1001 | pgoff_t start, unsigned int nr_entries, |
1003 | struct page **entries, pgoff_t *indices) | 1002 | struct page **entries, pgoff_t *indices) |
1004 | { | 1003 | { |
1005 | void **slot; | 1004 | void **slot; |
1006 | unsigned int ret = 0; | 1005 | unsigned int ret = 0; |
1007 | struct radix_tree_iter iter; | 1006 | struct radix_tree_iter iter; |
1008 | 1007 | ||
1009 | if (!nr_entries) | 1008 | if (!nr_entries) |
1010 | return 0; | 1009 | return 0; |
1011 | 1010 | ||
1012 | rcu_read_lock(); | 1011 | rcu_read_lock(); |
1013 | restart: | 1012 | restart: |
1014 | radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { | 1013 | radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { |
1015 | struct page *page; | 1014 | struct page *page; |
1016 | repeat: | 1015 | repeat: |
1017 | page = radix_tree_deref_slot(slot); | 1016 | page = radix_tree_deref_slot(slot); |
1018 | if (unlikely(!page)) | 1017 | if (unlikely(!page)) |
1019 | continue; | 1018 | continue; |
1020 | if (radix_tree_exception(page)) { | 1019 | if (radix_tree_exception(page)) { |
1021 | if (radix_tree_deref_retry(page)) | 1020 | if (radix_tree_deref_retry(page)) |
1022 | goto restart; | 1021 | goto restart; |
1023 | /* | 1022 | /* |
1024 | * Otherwise, we must be storing a swap entry | 1023 | * Otherwise, we must be storing a swap entry |
1025 | * here as an exceptional entry: so return it | 1024 | * here as an exceptional entry: so return it |
1026 | * without attempting to raise page count. | 1025 | * without attempting to raise page count. |
1027 | */ | 1026 | */ |
1028 | goto export; | 1027 | goto export; |
1029 | } | 1028 | } |
1030 | if (!page_cache_get_speculative(page)) | 1029 | if (!page_cache_get_speculative(page)) |
1031 | goto repeat; | 1030 | goto repeat; |
1032 | 1031 | ||
1033 | /* Has the page moved? */ | 1032 | /* Has the page moved? */ |
1034 | if (unlikely(page != *slot)) { | 1033 | if (unlikely(page != *slot)) { |
1035 | page_cache_release(page); | 1034 | page_cache_release(page); |
1036 | goto repeat; | 1035 | goto repeat; |
1037 | } | 1036 | } |
1038 | export: | 1037 | export: |
1039 | indices[ret] = iter.index; | 1038 | indices[ret] = iter.index; |
1040 | entries[ret] = page; | 1039 | entries[ret] = page; |
1041 | if (++ret == nr_entries) | 1040 | if (++ret == nr_entries) |
1042 | break; | 1041 | break; |
1043 | } | 1042 | } |
1044 | rcu_read_unlock(); | 1043 | rcu_read_unlock(); |
1045 | return ret; | 1044 | return ret; |
1046 | } | 1045 | } |
1047 | 1046 | ||
1048 | /** | 1047 | /** |
1049 | * find_get_pages - gang pagecache lookup | 1048 | * find_get_pages - gang pagecache lookup |
1050 | * @mapping: The address_space to search | 1049 | * @mapping: The address_space to search |
1051 | * @start: The starting page index | 1050 | * @start: The starting page index |
1052 | * @nr_pages: The maximum number of pages | 1051 | * @nr_pages: The maximum number of pages |
1053 | * @pages: Where the resulting pages are placed | 1052 | * @pages: Where the resulting pages are placed |
1054 | * | 1053 | * |
1055 | * find_get_pages() will search for and return a group of up to | 1054 | * find_get_pages() will search for and return a group of up to |
1056 | * @nr_pages pages in the mapping. The pages are placed at @pages. | 1055 | * @nr_pages pages in the mapping. The pages are placed at @pages. |
1057 | * find_get_pages() takes a reference against the returned pages. | 1056 | * find_get_pages() takes a reference against the returned pages. |
1058 | * | 1057 | * |
1059 | * The search returns a group of mapping-contiguous pages with ascending | 1058 | * The search returns a group of mapping-contiguous pages with ascending |
1060 | * indexes. There may be holes in the indices due to not-present pages. | 1059 | * indexes. There may be holes in the indices due to not-present pages. |
1061 | * | 1060 | * |
1062 | * find_get_pages() returns the number of pages which were found. | 1061 | * find_get_pages() returns the number of pages which were found. |
1063 | */ | 1062 | */ |
1064 | unsigned find_get_pages(struct address_space *mapping, pgoff_t start, | 1063 | unsigned find_get_pages(struct address_space *mapping, pgoff_t start, |
1065 | unsigned int nr_pages, struct page **pages) | 1064 | unsigned int nr_pages, struct page **pages) |
1066 | { | 1065 | { |
1067 | struct radix_tree_iter iter; | 1066 | struct radix_tree_iter iter; |
1068 | void **slot; | 1067 | void **slot; |
1069 | unsigned ret = 0; | 1068 | unsigned ret = 0; |
1070 | 1069 | ||
1071 | if (unlikely(!nr_pages)) | 1070 | if (unlikely(!nr_pages)) |
1072 | return 0; | 1071 | return 0; |
1073 | 1072 | ||
1074 | rcu_read_lock(); | 1073 | rcu_read_lock(); |
1075 | restart: | 1074 | restart: |
1076 | radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { | 1075 | radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { |
1077 | struct page *page; | 1076 | struct page *page; |
1078 | repeat: | 1077 | repeat: |
1079 | page = radix_tree_deref_slot(slot); | 1078 | page = radix_tree_deref_slot(slot); |
1080 | if (unlikely(!page)) | 1079 | if (unlikely(!page)) |
1081 | continue; | 1080 | continue; |
1082 | 1081 | ||
1083 | if (radix_tree_exception(page)) { | 1082 | if (radix_tree_exception(page)) { |
1084 | if (radix_tree_deref_retry(page)) { | 1083 | if (radix_tree_deref_retry(page)) { |
1085 | /* | 1084 | /* |
1086 | * Transient condition which can only trigger | 1085 | * Transient condition which can only trigger |
1087 | * when entry at index 0 moves out of or back | 1086 | * when entry at index 0 moves out of or back |
1088 | * to root: none yet gotten, safe to restart. | 1087 | * to root: none yet gotten, safe to restart. |
1089 | */ | 1088 | */ |
1090 | WARN_ON(iter.index); | 1089 | WARN_ON(iter.index); |
1091 | goto restart; | 1090 | goto restart; |
1092 | } | 1091 | } |
1093 | /* | 1092 | /* |
1094 | * Otherwise, shmem/tmpfs must be storing a swap entry | 1093 | * Otherwise, shmem/tmpfs must be storing a swap entry |
1095 | * here as an exceptional entry: so skip over it - | 1094 | * here as an exceptional entry: so skip over it - |
1096 | * we only reach this from invalidate_mapping_pages(). | 1095 | * we only reach this from invalidate_mapping_pages(). |
1097 | */ | 1096 | */ |
1098 | continue; | 1097 | continue; |
1099 | } | 1098 | } |
1100 | 1099 | ||
1101 | if (!page_cache_get_speculative(page)) | 1100 | if (!page_cache_get_speculative(page)) |
1102 | goto repeat; | 1101 | goto repeat; |
1103 | 1102 | ||
1104 | /* Has the page moved? */ | 1103 | /* Has the page moved? */ |
1105 | if (unlikely(page != *slot)) { | 1104 | if (unlikely(page != *slot)) { |
1106 | page_cache_release(page); | 1105 | page_cache_release(page); |
1107 | goto repeat; | 1106 | goto repeat; |
1108 | } | 1107 | } |
1109 | 1108 | ||
1110 | pages[ret] = page; | 1109 | pages[ret] = page; |
1111 | if (++ret == nr_pages) | 1110 | if (++ret == nr_pages) |
1112 | break; | 1111 | break; |
1113 | } | 1112 | } |
1114 | 1113 | ||
1115 | rcu_read_unlock(); | 1114 | rcu_read_unlock(); |
1116 | return ret; | 1115 | return ret; |
1117 | } | 1116 | } |
1118 | 1117 | ||
1119 | /** | 1118 | /** |
1120 | * find_get_pages_contig - gang contiguous pagecache lookup | 1119 | * find_get_pages_contig - gang contiguous pagecache lookup |
1121 | * @mapping: The address_space to search | 1120 | * @mapping: The address_space to search |
1122 | * @index: The starting page index | 1121 | * @index: The starting page index |
1123 | * @nr_pages: The maximum number of pages | 1122 | * @nr_pages: The maximum number of pages |
1124 | * @pages: Where the resulting pages are placed | 1123 | * @pages: Where the resulting pages are placed |
1125 | * | 1124 | * |
1126 | * find_get_pages_contig() works exactly like find_get_pages(), except | 1125 | * find_get_pages_contig() works exactly like find_get_pages(), except |
1127 | * that the returned number of pages are guaranteed to be contiguous. | 1126 | * that the returned number of pages are guaranteed to be contiguous. |
1128 | * | 1127 | * |
1129 | * find_get_pages_contig() returns the number of pages which were found. | 1128 | * find_get_pages_contig() returns the number of pages which were found. |
1130 | */ | 1129 | */ |
1131 | unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index, | 1130 | unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index, |
1132 | unsigned int nr_pages, struct page **pages) | 1131 | unsigned int nr_pages, struct page **pages) |
1133 | { | 1132 | { |
1134 | struct radix_tree_iter iter; | 1133 | struct radix_tree_iter iter; |
1135 | void **slot; | 1134 | void **slot; |
1136 | unsigned int ret = 0; | 1135 | unsigned int ret = 0; |
1137 | 1136 | ||
1138 | if (unlikely(!nr_pages)) | 1137 | if (unlikely(!nr_pages)) |
1139 | return 0; | 1138 | return 0; |
1140 | 1139 | ||
1141 | rcu_read_lock(); | 1140 | rcu_read_lock(); |
1142 | restart: | 1141 | restart: |
1143 | radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) { | 1142 | radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) { |
1144 | struct page *page; | 1143 | struct page *page; |
1145 | repeat: | 1144 | repeat: |
1146 | page = radix_tree_deref_slot(slot); | 1145 | page = radix_tree_deref_slot(slot); |
1147 | /* The hole, there no reason to continue */ | 1146 | /* The hole, there no reason to continue */ |
1148 | if (unlikely(!page)) | 1147 | if (unlikely(!page)) |
1149 | break; | 1148 | break; |
1150 | 1149 | ||
1151 | if (radix_tree_exception(page)) { | 1150 | if (radix_tree_exception(page)) { |
1152 | if (radix_tree_deref_retry(page)) { | 1151 | if (radix_tree_deref_retry(page)) { |
1153 | /* | 1152 | /* |
1154 | * Transient condition which can only trigger | 1153 | * Transient condition which can only trigger |
1155 | * when entry at index 0 moves out of or back | 1154 | * when entry at index 0 moves out of or back |
1156 | * to root: none yet gotten, safe to restart. | 1155 | * to root: none yet gotten, safe to restart. |
1157 | */ | 1156 | */ |
1158 | goto restart; | 1157 | goto restart; |
1159 | } | 1158 | } |
1160 | /* | 1159 | /* |
1161 | * Otherwise, shmem/tmpfs must be storing a swap entry | 1160 | * Otherwise, shmem/tmpfs must be storing a swap entry |
1162 | * here as an exceptional entry: so stop looking for | 1161 | * here as an exceptional entry: so stop looking for |
1163 | * contiguous pages. | 1162 | * contiguous pages. |
1164 | */ | 1163 | */ |
1165 | break; | 1164 | break; |
1166 | } | 1165 | } |
1167 | 1166 | ||
1168 | if (!page_cache_get_speculative(page)) | 1167 | if (!page_cache_get_speculative(page)) |
1169 | goto repeat; | 1168 | goto repeat; |
1170 | 1169 | ||
1171 | /* Has the page moved? */ | 1170 | /* Has the page moved? */ |
1172 | if (unlikely(page != *slot)) { | 1171 | if (unlikely(page != *slot)) { |
1173 | page_cache_release(page); | 1172 | page_cache_release(page); |
1174 | goto repeat; | 1173 | goto repeat; |
1175 | } | 1174 | } |
1176 | 1175 | ||
1177 | /* | 1176 | /* |
1178 | * must check mapping and index after taking the ref. | 1177 | * must check mapping and index after taking the ref. |
1179 | * otherwise we can get both false positives and false | 1178 | * otherwise we can get both false positives and false |
1180 | * negatives, which is just confusing to the caller. | 1179 | * negatives, which is just confusing to the caller. |
1181 | */ | 1180 | */ |
1182 | if (page->mapping == NULL || page->index != iter.index) { | 1181 | if (page->mapping == NULL || page->index != iter.index) { |
1183 | page_cache_release(page); | 1182 | page_cache_release(page); |
1184 | break; | 1183 | break; |
1185 | } | 1184 | } |
1186 | 1185 | ||
1187 | pages[ret] = page; | 1186 | pages[ret] = page; |
1188 | if (++ret == nr_pages) | 1187 | if (++ret == nr_pages) |
1189 | break; | 1188 | break; |
1190 | } | 1189 | } |
1191 | rcu_read_unlock(); | 1190 | rcu_read_unlock(); |
1192 | return ret; | 1191 | return ret; |
1193 | } | 1192 | } |
1194 | EXPORT_SYMBOL(find_get_pages_contig); | 1193 | EXPORT_SYMBOL(find_get_pages_contig); |
1195 | 1194 | ||
1196 | /** | 1195 | /** |
1197 | * find_get_pages_tag - find and return pages that match @tag | 1196 | * find_get_pages_tag - find and return pages that match @tag |
1198 | * @mapping: the address_space to search | 1197 | * @mapping: the address_space to search |
1199 | * @index: the starting page index | 1198 | * @index: the starting page index |
1200 | * @tag: the tag index | 1199 | * @tag: the tag index |
1201 | * @nr_pages: the maximum number of pages | 1200 | * @nr_pages: the maximum number of pages |
1202 | * @pages: where the resulting pages are placed | 1201 | * @pages: where the resulting pages are placed |
1203 | * | 1202 | * |
1204 | * Like find_get_pages, except we only return pages which are tagged with | 1203 | * Like find_get_pages, except we only return pages which are tagged with |
1205 | * @tag. We update @index to index the next page for the traversal. | 1204 | * @tag. We update @index to index the next page for the traversal. |
1206 | */ | 1205 | */ |
1207 | unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, | 1206 | unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, |
1208 | int tag, unsigned int nr_pages, struct page **pages) | 1207 | int tag, unsigned int nr_pages, struct page **pages) |
1209 | { | 1208 | { |
1210 | struct radix_tree_iter iter; | 1209 | struct radix_tree_iter iter; |
1211 | void **slot; | 1210 | void **slot; |
1212 | unsigned ret = 0; | 1211 | unsigned ret = 0; |
1213 | 1212 | ||
1214 | if (unlikely(!nr_pages)) | 1213 | if (unlikely(!nr_pages)) |
1215 | return 0; | 1214 | return 0; |
1216 | 1215 | ||
1217 | rcu_read_lock(); | 1216 | rcu_read_lock(); |
1218 | restart: | 1217 | restart: |
1219 | radix_tree_for_each_tagged(slot, &mapping->page_tree, | 1218 | radix_tree_for_each_tagged(slot, &mapping->page_tree, |
1220 | &iter, *index, tag) { | 1219 | &iter, *index, tag) { |
1221 | struct page *page; | 1220 | struct page *page; |
1222 | repeat: | 1221 | repeat: |
1223 | page = radix_tree_deref_slot(slot); | 1222 | page = radix_tree_deref_slot(slot); |
1224 | if (unlikely(!page)) | 1223 | if (unlikely(!page)) |
1225 | continue; | 1224 | continue; |
1226 | 1225 | ||
1227 | if (radix_tree_exception(page)) { | 1226 | if (radix_tree_exception(page)) { |
1228 | if (radix_tree_deref_retry(page)) { | 1227 | if (radix_tree_deref_retry(page)) { |
1229 | /* | 1228 | /* |
1230 | * Transient condition which can only trigger | 1229 | * Transient condition which can only trigger |
1231 | * when entry at index 0 moves out of or back | 1230 | * when entry at index 0 moves out of or back |
1232 | * to root: none yet gotten, safe to restart. | 1231 | * to root: none yet gotten, safe to restart. |
1233 | */ | 1232 | */ |
1234 | goto restart; | 1233 | goto restart; |
1235 | } | 1234 | } |
1236 | /* | 1235 | /* |
1237 | * This function is never used on a shmem/tmpfs | 1236 | * This function is never used on a shmem/tmpfs |
1238 | * mapping, so a swap entry won't be found here. | 1237 | * mapping, so a swap entry won't be found here. |
1239 | */ | 1238 | */ |
1240 | BUG(); | 1239 | BUG(); |
1241 | } | 1240 | } |
1242 | 1241 | ||
1243 | if (!page_cache_get_speculative(page)) | 1242 | if (!page_cache_get_speculative(page)) |
1244 | goto repeat; | 1243 | goto repeat; |
1245 | 1244 | ||
1246 | /* Has the page moved? */ | 1245 | /* Has the page moved? */ |
1247 | if (unlikely(page != *slot)) { | 1246 | if (unlikely(page != *slot)) { |
1248 | page_cache_release(page); | 1247 | page_cache_release(page); |
1249 | goto repeat; | 1248 | goto repeat; |
1250 | } | 1249 | } |
1251 | 1250 | ||
1252 | pages[ret] = page; | 1251 | pages[ret] = page; |
1253 | if (++ret == nr_pages) | 1252 | if (++ret == nr_pages) |
1254 | break; | 1253 | break; |
1255 | } | 1254 | } |
1256 | 1255 | ||
1257 | rcu_read_unlock(); | 1256 | rcu_read_unlock(); |
1258 | 1257 | ||
1259 | if (ret) | 1258 | if (ret) |
1260 | *index = pages[ret - 1]->index + 1; | 1259 | *index = pages[ret - 1]->index + 1; |
1261 | 1260 | ||
1262 | return ret; | 1261 | return ret; |
1263 | } | 1262 | } |
1264 | EXPORT_SYMBOL(find_get_pages_tag); | 1263 | EXPORT_SYMBOL(find_get_pages_tag); |
1265 | 1264 | ||
1266 | /** | ||
1267 | * grab_cache_page_nowait - returns locked page at given index in given cache | ||
1268 | * @mapping: target address_space | ||
1269 | * @index: the page index | ||
1270 | * | ||
1271 | * Same as grab_cache_page(), but do not wait if the page is unavailable. | ||
1272 | * This is intended for speculative data generators, where the data can | ||
1273 | * be regenerated if the page couldn't be grabbed. This routine should | ||
1274 | * be safe to call while holding the lock for another page. | ||
1275 | * | ||
1276 | * Clear __GFP_FS when allocating the page to avoid recursion into the fs | ||
1277 | * and deadlock against the caller's locked page. | ||
1278 | */ | ||
1279 | struct page * | ||
1280 | grab_cache_page_nowait(struct address_space *mapping, pgoff_t index) | ||
1281 | { | ||
1282 | struct page *page = find_get_page(mapping, index); | ||
1283 | |||
1284 | if (page) { | ||
1285 | if (trylock_page(page)) | ||
1286 | return page; | ||
1287 | page_cache_release(page); | ||
1288 | return NULL; | ||
1289 | } | ||
1290 | page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS); | ||
1291 | if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) { | ||
1292 | page_cache_release(page); | ||
1293 | page = NULL; | ||
1294 | } | ||
1295 | return page; | ||
1296 | } | ||
1297 | EXPORT_SYMBOL(grab_cache_page_nowait); | ||
1298 | |||
1299 | /* | 1265 | /* |
1300 | * CD/DVDs are error prone. When a medium error occurs, the driver may fail | 1266 | * CD/DVDs are error prone. When a medium error occurs, the driver may fail |
1301 | * a _large_ part of the i/o request. Imagine the worst scenario: | 1267 | * a _large_ part of the i/o request. Imagine the worst scenario: |
1302 | * | 1268 | * |
1303 | * ---R__________________________________________B__________ | 1269 | * ---R__________________________________________B__________ |
1304 | * ^ reading here ^ bad block(assume 4k) | 1270 | * ^ reading here ^ bad block(assume 4k) |
1305 | * | 1271 | * |
1306 | * read(R) => miss => readahead(R...B) => media error => frustrating retries | 1272 | * read(R) => miss => readahead(R...B) => media error => frustrating retries |
1307 | * => failing the whole request => read(R) => read(R+1) => | 1273 | * => failing the whole request => read(R) => read(R+1) => |
1308 | * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) => | 1274 | * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) => |
1309 | * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) => | 1275 | * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) => |
1310 | * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ...... | 1276 | * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ...... |
1311 | * | 1277 | * |
1312 | * It is going insane. Fix it by quickly scaling down the readahead size. | 1278 | * It is going insane. Fix it by quickly scaling down the readahead size. |
1313 | */ | 1279 | */ |
1314 | static void shrink_readahead_size_eio(struct file *filp, | 1280 | static void shrink_readahead_size_eio(struct file *filp, |
1315 | struct file_ra_state *ra) | 1281 | struct file_ra_state *ra) |
1316 | { | 1282 | { |
1317 | ra->ra_pages /= 4; | 1283 | ra->ra_pages /= 4; |
1318 | } | 1284 | } |
1319 | 1285 | ||
1320 | /** | 1286 | /** |
1321 | * do_generic_file_read - generic file read routine | 1287 | * do_generic_file_read - generic file read routine |
1322 | * @filp: the file to read | 1288 | * @filp: the file to read |
1323 | * @ppos: current file position | 1289 | * @ppos: current file position |
1324 | * @desc: read_descriptor | 1290 | * @desc: read_descriptor |
1325 | * @actor: read method | 1291 | * @actor: read method |
1326 | * | 1292 | * |
1327 | * This is a generic file read routine, and uses the | 1293 | * This is a generic file read routine, and uses the |
1328 | * mapping->a_ops->readpage() function for the actual low-level stuff. | 1294 | * mapping->a_ops->readpage() function for the actual low-level stuff. |
1329 | * | 1295 | * |
1330 | * This is really ugly. But the goto's actually try to clarify some | 1296 | * This is really ugly. But the goto's actually try to clarify some |
1331 | * of the logic when it comes to error handling etc. | 1297 | * of the logic when it comes to error handling etc. |
1332 | */ | 1298 | */ |
1333 | static void do_generic_file_read(struct file *filp, loff_t *ppos, | 1299 | static void do_generic_file_read(struct file *filp, loff_t *ppos, |
1334 | read_descriptor_t *desc, read_actor_t actor) | 1300 | read_descriptor_t *desc, read_actor_t actor) |
1335 | { | 1301 | { |
1336 | struct address_space *mapping = filp->f_mapping; | 1302 | struct address_space *mapping = filp->f_mapping; |
1337 | struct inode *inode = mapping->host; | 1303 | struct inode *inode = mapping->host; |
1338 | struct file_ra_state *ra = &filp->f_ra; | 1304 | struct file_ra_state *ra = &filp->f_ra; |
1339 | pgoff_t index; | 1305 | pgoff_t index; |
1340 | pgoff_t last_index; | 1306 | pgoff_t last_index; |
1341 | pgoff_t prev_index; | 1307 | pgoff_t prev_index; |
1342 | unsigned long offset; /* offset into pagecache page */ | 1308 | unsigned long offset; /* offset into pagecache page */ |
1343 | unsigned int prev_offset; | 1309 | unsigned int prev_offset; |
1344 | int error; | 1310 | int error; |
1345 | 1311 | ||
1346 | index = *ppos >> PAGE_CACHE_SHIFT; | 1312 | index = *ppos >> PAGE_CACHE_SHIFT; |
1347 | prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT; | 1313 | prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT; |
1348 | prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1); | 1314 | prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1); |
1349 | last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; | 1315 | last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; |
1350 | offset = *ppos & ~PAGE_CACHE_MASK; | 1316 | offset = *ppos & ~PAGE_CACHE_MASK; |
1351 | 1317 | ||
1352 | for (;;) { | 1318 | for (;;) { |
1353 | struct page *page; | 1319 | struct page *page; |
1354 | pgoff_t end_index; | 1320 | pgoff_t end_index; |
1355 | loff_t isize; | 1321 | loff_t isize; |
1356 | unsigned long nr, ret; | 1322 | unsigned long nr, ret; |
1357 | 1323 | ||
1358 | cond_resched(); | 1324 | cond_resched(); |
1359 | find_page: | 1325 | find_page: |
1360 | page = find_get_page(mapping, index); | 1326 | page = find_get_page(mapping, index); |
1361 | if (!page) { | 1327 | if (!page) { |
1362 | page_cache_sync_readahead(mapping, | 1328 | page_cache_sync_readahead(mapping, |
1363 | ra, filp, | 1329 | ra, filp, |
1364 | index, last_index - index); | 1330 | index, last_index - index); |
1365 | page = find_get_page(mapping, index); | 1331 | page = find_get_page(mapping, index); |
1366 | if (unlikely(page == NULL)) | 1332 | if (unlikely(page == NULL)) |
1367 | goto no_cached_page; | 1333 | goto no_cached_page; |
1368 | } | 1334 | } |
1369 | if (PageReadahead(page)) { | 1335 | if (PageReadahead(page)) { |
1370 | page_cache_async_readahead(mapping, | 1336 | page_cache_async_readahead(mapping, |
1371 | ra, filp, page, | 1337 | ra, filp, page, |
1372 | index, last_index - index); | 1338 | index, last_index - index); |
1373 | } | 1339 | } |
1374 | if (!PageUptodate(page)) { | 1340 | if (!PageUptodate(page)) { |
1375 | if (inode->i_blkbits == PAGE_CACHE_SHIFT || | 1341 | if (inode->i_blkbits == PAGE_CACHE_SHIFT || |
1376 | !mapping->a_ops->is_partially_uptodate) | 1342 | !mapping->a_ops->is_partially_uptodate) |
1377 | goto page_not_up_to_date; | 1343 | goto page_not_up_to_date; |
1378 | if (!trylock_page(page)) | 1344 | if (!trylock_page(page)) |
1379 | goto page_not_up_to_date; | 1345 | goto page_not_up_to_date; |
1380 | /* Did it get truncated before we got the lock? */ | 1346 | /* Did it get truncated before we got the lock? */ |
1381 | if (!page->mapping) | 1347 | if (!page->mapping) |
1382 | goto page_not_up_to_date_locked; | 1348 | goto page_not_up_to_date_locked; |
1383 | if (!mapping->a_ops->is_partially_uptodate(page, | 1349 | if (!mapping->a_ops->is_partially_uptodate(page, |
1384 | desc, offset)) | 1350 | desc, offset)) |
1385 | goto page_not_up_to_date_locked; | 1351 | goto page_not_up_to_date_locked; |
1386 | unlock_page(page); | 1352 | unlock_page(page); |
1387 | } | 1353 | } |
1388 | page_ok: | 1354 | page_ok: |
1389 | /* | 1355 | /* |
1390 | * i_size must be checked after we know the page is Uptodate. | 1356 | * i_size must be checked after we know the page is Uptodate. |
1391 | * | 1357 | * |
1392 | * Checking i_size after the check allows us to calculate | 1358 | * Checking i_size after the check allows us to calculate |
1393 | * the correct value for "nr", which means the zero-filled | 1359 | * the correct value for "nr", which means the zero-filled |
1394 | * part of the page is not copied back to userspace (unless | 1360 | * part of the page is not copied back to userspace (unless |
1395 | * another truncate extends the file - this is desired though). | 1361 | * another truncate extends the file - this is desired though). |
1396 | */ | 1362 | */ |
1397 | 1363 | ||
1398 | isize = i_size_read(inode); | 1364 | isize = i_size_read(inode); |
1399 | end_index = (isize - 1) >> PAGE_CACHE_SHIFT; | 1365 | end_index = (isize - 1) >> PAGE_CACHE_SHIFT; |
1400 | if (unlikely(!isize || index > end_index)) { | 1366 | if (unlikely(!isize || index > end_index)) { |
1401 | page_cache_release(page); | 1367 | page_cache_release(page); |
1402 | goto out; | 1368 | goto out; |
1403 | } | 1369 | } |
1404 | 1370 | ||
1405 | /* nr is the maximum number of bytes to copy from this page */ | 1371 | /* nr is the maximum number of bytes to copy from this page */ |
1406 | nr = PAGE_CACHE_SIZE; | 1372 | nr = PAGE_CACHE_SIZE; |
1407 | if (index == end_index) { | 1373 | if (index == end_index) { |
1408 | nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; | 1374 | nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; |
1409 | if (nr <= offset) { | 1375 | if (nr <= offset) { |
1410 | page_cache_release(page); | 1376 | page_cache_release(page); |
1411 | goto out; | 1377 | goto out; |
1412 | } | 1378 | } |
1413 | } | 1379 | } |
1414 | nr = nr - offset; | 1380 | nr = nr - offset; |
1415 | 1381 | ||
1416 | /* If users can be writing to this page using arbitrary | 1382 | /* If users can be writing to this page using arbitrary |
1417 | * virtual addresses, take care about potential aliasing | 1383 | * virtual addresses, take care about potential aliasing |
1418 | * before reading the page on the kernel side. | 1384 | * before reading the page on the kernel side. |
1419 | */ | 1385 | */ |
1420 | if (mapping_writably_mapped(mapping)) | 1386 | if (mapping_writably_mapped(mapping)) |
1421 | flush_dcache_page(page); | 1387 | flush_dcache_page(page); |
1422 | 1388 | ||
1423 | /* | 1389 | /* |
1424 | * When a sequential read accesses a page several times, | 1390 | * When a sequential read accesses a page several times, |
1425 | * only mark it as accessed the first time. | 1391 | * only mark it as accessed the first time. |
1426 | */ | 1392 | */ |
1427 | if (prev_index != index || offset != prev_offset) | 1393 | if (prev_index != index || offset != prev_offset) |
1428 | mark_page_accessed(page); | 1394 | mark_page_accessed(page); |
1429 | prev_index = index; | 1395 | prev_index = index; |
1430 | 1396 | ||
1431 | /* | 1397 | /* |
1432 | * Ok, we have the page, and it's up-to-date, so | 1398 | * Ok, we have the page, and it's up-to-date, so |
1433 | * now we can copy it to user space... | 1399 | * now we can copy it to user space... |
1434 | * | 1400 | * |
1435 | * The actor routine returns how many bytes were actually used.. | 1401 | * The actor routine returns how many bytes were actually used.. |
1436 | * NOTE! This may not be the same as how much of a user buffer | 1402 | * NOTE! This may not be the same as how much of a user buffer |
1437 | * we filled up (we may be padding etc), so we can only update | 1403 | * we filled up (we may be padding etc), so we can only update |
1438 | * "pos" here (the actor routine has to update the user buffer | 1404 | * "pos" here (the actor routine has to update the user buffer |
1439 | * pointers and the remaining count). | 1405 | * pointers and the remaining count). |
1440 | */ | 1406 | */ |
1441 | ret = actor(desc, page, offset, nr); | 1407 | ret = actor(desc, page, offset, nr); |
1442 | offset += ret; | 1408 | offset += ret; |
1443 | index += offset >> PAGE_CACHE_SHIFT; | 1409 | index += offset >> PAGE_CACHE_SHIFT; |
1444 | offset &= ~PAGE_CACHE_MASK; | 1410 | offset &= ~PAGE_CACHE_MASK; |
1445 | prev_offset = offset; | 1411 | prev_offset = offset; |
1446 | 1412 | ||
1447 | page_cache_release(page); | 1413 | page_cache_release(page); |
1448 | if (ret == nr && desc->count) | 1414 | if (ret == nr && desc->count) |
1449 | continue; | 1415 | continue; |
1450 | goto out; | 1416 | goto out; |
1451 | 1417 | ||
1452 | page_not_up_to_date: | 1418 | page_not_up_to_date: |
1453 | /* Get exclusive access to the page ... */ | 1419 | /* Get exclusive access to the page ... */ |
1454 | error = lock_page_killable(page); | 1420 | error = lock_page_killable(page); |
1455 | if (unlikely(error)) | 1421 | if (unlikely(error)) |
1456 | goto readpage_error; | 1422 | goto readpage_error; |
1457 | 1423 | ||
1458 | page_not_up_to_date_locked: | 1424 | page_not_up_to_date_locked: |
1459 | /* Did it get truncated before we got the lock? */ | 1425 | /* Did it get truncated before we got the lock? */ |
1460 | if (!page->mapping) { | 1426 | if (!page->mapping) { |
1461 | unlock_page(page); | 1427 | unlock_page(page); |
1462 | page_cache_release(page); | 1428 | page_cache_release(page); |
1463 | continue; | 1429 | continue; |
1464 | } | 1430 | } |
1465 | 1431 | ||
1466 | /* Did somebody else fill it already? */ | 1432 | /* Did somebody else fill it already? */ |
1467 | if (PageUptodate(page)) { | 1433 | if (PageUptodate(page)) { |
1468 | unlock_page(page); | 1434 | unlock_page(page); |
1469 | goto page_ok; | 1435 | goto page_ok; |
1470 | } | 1436 | } |
1471 | 1437 | ||
1472 | readpage: | 1438 | readpage: |
1473 | /* | 1439 | /* |
1474 | * A previous I/O error may have been due to temporary | 1440 | * A previous I/O error may have been due to temporary |
1475 | * failures, eg. multipath errors. | 1441 | * failures, eg. multipath errors. |
1476 | * PG_error will be set again if readpage fails. | 1442 | * PG_error will be set again if readpage fails. |
1477 | */ | 1443 | */ |
1478 | ClearPageError(page); | 1444 | ClearPageError(page); |
1479 | /* Start the actual read. The read will unlock the page. */ | 1445 | /* Start the actual read. The read will unlock the page. */ |
1480 | error = mapping->a_ops->readpage(filp, page); | 1446 | error = mapping->a_ops->readpage(filp, page); |
1481 | 1447 | ||
1482 | if (unlikely(error)) { | 1448 | if (unlikely(error)) { |
1483 | if (error == AOP_TRUNCATED_PAGE) { | 1449 | if (error == AOP_TRUNCATED_PAGE) { |
1484 | page_cache_release(page); | 1450 | page_cache_release(page); |
1485 | goto find_page; | 1451 | goto find_page; |
1486 | } | 1452 | } |
1487 | goto readpage_error; | 1453 | goto readpage_error; |
1488 | } | 1454 | } |
1489 | 1455 | ||
1490 | if (!PageUptodate(page)) { | 1456 | if (!PageUptodate(page)) { |
1491 | error = lock_page_killable(page); | 1457 | error = lock_page_killable(page); |
1492 | if (unlikely(error)) | 1458 | if (unlikely(error)) |
1493 | goto readpage_error; | 1459 | goto readpage_error; |
1494 | if (!PageUptodate(page)) { | 1460 | if (!PageUptodate(page)) { |
1495 | if (page->mapping == NULL) { | 1461 | if (page->mapping == NULL) { |
1496 | /* | 1462 | /* |
1497 | * invalidate_mapping_pages got it | 1463 | * invalidate_mapping_pages got it |
1498 | */ | 1464 | */ |
1499 | unlock_page(page); | 1465 | unlock_page(page); |
1500 | page_cache_release(page); | 1466 | page_cache_release(page); |
1501 | goto find_page; | 1467 | goto find_page; |
1502 | } | 1468 | } |
1503 | unlock_page(page); | 1469 | unlock_page(page); |
1504 | shrink_readahead_size_eio(filp, ra); | 1470 | shrink_readahead_size_eio(filp, ra); |
1505 | error = -EIO; | 1471 | error = -EIO; |
1506 | goto readpage_error; | 1472 | goto readpage_error; |
1507 | } | 1473 | } |
1508 | unlock_page(page); | 1474 | unlock_page(page); |
1509 | } | 1475 | } |
1510 | 1476 | ||
1511 | goto page_ok; | 1477 | goto page_ok; |
1512 | 1478 | ||
1513 | readpage_error: | 1479 | readpage_error: |
1514 | /* UHHUH! A synchronous read error occurred. Report it */ | 1480 | /* UHHUH! A synchronous read error occurred. Report it */ |
1515 | desc->error = error; | 1481 | desc->error = error; |
1516 | page_cache_release(page); | 1482 | page_cache_release(page); |
1517 | goto out; | 1483 | goto out; |
1518 | 1484 | ||
1519 | no_cached_page: | 1485 | no_cached_page: |
1520 | /* | 1486 | /* |
1521 | * Ok, it wasn't cached, so we need to create a new | 1487 | * Ok, it wasn't cached, so we need to create a new |
1522 | * page.. | 1488 | * page.. |
1523 | */ | 1489 | */ |
1524 | page = page_cache_alloc_cold(mapping); | 1490 | page = page_cache_alloc_cold(mapping); |
1525 | if (!page) { | 1491 | if (!page) { |
1526 | desc->error = -ENOMEM; | 1492 | desc->error = -ENOMEM; |
1527 | goto out; | 1493 | goto out; |
1528 | } | 1494 | } |
1529 | error = add_to_page_cache_lru(page, mapping, | 1495 | error = add_to_page_cache_lru(page, mapping, |
1530 | index, GFP_KERNEL); | 1496 | index, GFP_KERNEL); |
1531 | if (error) { | 1497 | if (error) { |
1532 | page_cache_release(page); | 1498 | page_cache_release(page); |
1533 | if (error == -EEXIST) | 1499 | if (error == -EEXIST) |
1534 | goto find_page; | 1500 | goto find_page; |
1535 | desc->error = error; | 1501 | desc->error = error; |
1536 | goto out; | 1502 | goto out; |
1537 | } | 1503 | } |
1538 | goto readpage; | 1504 | goto readpage; |
1539 | } | 1505 | } |
1540 | 1506 | ||
1541 | out: | 1507 | out: |
1542 | ra->prev_pos = prev_index; | 1508 | ra->prev_pos = prev_index; |
1543 | ra->prev_pos <<= PAGE_CACHE_SHIFT; | 1509 | ra->prev_pos <<= PAGE_CACHE_SHIFT; |
1544 | ra->prev_pos |= prev_offset; | 1510 | ra->prev_pos |= prev_offset; |
1545 | 1511 | ||
1546 | *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset; | 1512 | *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset; |
1547 | file_accessed(filp); | 1513 | file_accessed(filp); |
1548 | } | 1514 | } |
1549 | 1515 | ||
1550 | int file_read_actor(read_descriptor_t *desc, struct page *page, | 1516 | int file_read_actor(read_descriptor_t *desc, struct page *page, |
1551 | unsigned long offset, unsigned long size) | 1517 | unsigned long offset, unsigned long size) |
1552 | { | 1518 | { |
1553 | char *kaddr; | 1519 | char *kaddr; |
1554 | unsigned long left, count = desc->count; | 1520 | unsigned long left, count = desc->count; |
1555 | 1521 | ||
1556 | if (size > count) | 1522 | if (size > count) |
1557 | size = count; | 1523 | size = count; |
1558 | 1524 | ||
1559 | /* | 1525 | /* |
1560 | * Faults on the destination of a read are common, so do it before | 1526 | * Faults on the destination of a read are common, so do it before |
1561 | * taking the kmap. | 1527 | * taking the kmap. |
1562 | */ | 1528 | */ |
1563 | if (!fault_in_pages_writeable(desc->arg.buf, size)) { | 1529 | if (!fault_in_pages_writeable(desc->arg.buf, size)) { |
1564 | kaddr = kmap_atomic(page); | 1530 | kaddr = kmap_atomic(page); |
1565 | left = __copy_to_user_inatomic(desc->arg.buf, | 1531 | left = __copy_to_user_inatomic(desc->arg.buf, |
1566 | kaddr + offset, size); | 1532 | kaddr + offset, size); |
1567 | kunmap_atomic(kaddr); | 1533 | kunmap_atomic(kaddr); |
1568 | if (left == 0) | 1534 | if (left == 0) |
1569 | goto success; | 1535 | goto success; |
1570 | } | 1536 | } |
1571 | 1537 | ||
1572 | /* Do it the slow way */ | 1538 | /* Do it the slow way */ |
1573 | kaddr = kmap(page); | 1539 | kaddr = kmap(page); |
1574 | left = __copy_to_user(desc->arg.buf, kaddr + offset, size); | 1540 | left = __copy_to_user(desc->arg.buf, kaddr + offset, size); |
1575 | kunmap(page); | 1541 | kunmap(page); |
1576 | 1542 | ||
1577 | if (left) { | 1543 | if (left) { |
1578 | size -= left; | 1544 | size -= left; |
1579 | desc->error = -EFAULT; | 1545 | desc->error = -EFAULT; |
1580 | } | 1546 | } |
1581 | success: | 1547 | success: |
1582 | desc->count = count - size; | 1548 | desc->count = count - size; |
1583 | desc->written += size; | 1549 | desc->written += size; |
1584 | desc->arg.buf += size; | 1550 | desc->arg.buf += size; |
1585 | return size; | 1551 | return size; |
1586 | } | 1552 | } |
1587 | 1553 | ||
1588 | /* | 1554 | /* |
1589 | * Performs necessary checks before doing a write | 1555 | * Performs necessary checks before doing a write |
1590 | * @iov: io vector request | 1556 | * @iov: io vector request |
1591 | * @nr_segs: number of segments in the iovec | 1557 | * @nr_segs: number of segments in the iovec |
1592 | * @count: number of bytes to write | 1558 | * @count: number of bytes to write |
1593 | * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE | 1559 | * @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE |
1594 | * | 1560 | * |
1595 | * Adjust number of segments and amount of bytes to write (nr_segs should be | 1561 | * Adjust number of segments and amount of bytes to write (nr_segs should be |
1596 | * properly initialized first). Returns appropriate error code that caller | 1562 | * properly initialized first). Returns appropriate error code that caller |
1597 | * should return or zero in case that write should be allowed. | 1563 | * should return or zero in case that write should be allowed. |
1598 | */ | 1564 | */ |
1599 | int generic_segment_checks(const struct iovec *iov, | 1565 | int generic_segment_checks(const struct iovec *iov, |
1600 | unsigned long *nr_segs, size_t *count, int access_flags) | 1566 | unsigned long *nr_segs, size_t *count, int access_flags) |
1601 | { | 1567 | { |
1602 | unsigned long seg; | 1568 | unsigned long seg; |
1603 | size_t cnt = 0; | 1569 | size_t cnt = 0; |
1604 | for (seg = 0; seg < *nr_segs; seg++) { | 1570 | for (seg = 0; seg < *nr_segs; seg++) { |
1605 | const struct iovec *iv = &iov[seg]; | 1571 | const struct iovec *iv = &iov[seg]; |
1606 | 1572 | ||
1607 | /* | 1573 | /* |
1608 | * If any segment has a negative length, or the cumulative | 1574 | * If any segment has a negative length, or the cumulative |
1609 | * length ever wraps negative then return -EINVAL. | 1575 | * length ever wraps negative then return -EINVAL. |
1610 | */ | 1576 | */ |
1611 | cnt += iv->iov_len; | 1577 | cnt += iv->iov_len; |
1612 | if (unlikely((ssize_t)(cnt|iv->iov_len) < 0)) | 1578 | if (unlikely((ssize_t)(cnt|iv->iov_len) < 0)) |
1613 | return -EINVAL; | 1579 | return -EINVAL; |
1614 | if (access_ok(access_flags, iv->iov_base, iv->iov_len)) | 1580 | if (access_ok(access_flags, iv->iov_base, iv->iov_len)) |
1615 | continue; | 1581 | continue; |
1616 | if (seg == 0) | 1582 | if (seg == 0) |
1617 | return -EFAULT; | 1583 | return -EFAULT; |
1618 | *nr_segs = seg; | 1584 | *nr_segs = seg; |
1619 | cnt -= iv->iov_len; /* This segment is no good */ | 1585 | cnt -= iv->iov_len; /* This segment is no good */ |
1620 | break; | 1586 | break; |
1621 | } | 1587 | } |
1622 | *count = cnt; | 1588 | *count = cnt; |
1623 | return 0; | 1589 | return 0; |
1624 | } | 1590 | } |
1625 | EXPORT_SYMBOL(generic_segment_checks); | 1591 | EXPORT_SYMBOL(generic_segment_checks); |
1626 | 1592 | ||
1627 | /** | 1593 | /** |
1628 | * generic_file_aio_read - generic filesystem read routine | 1594 | * generic_file_aio_read - generic filesystem read routine |
1629 | * @iocb: kernel I/O control block | 1595 | * @iocb: kernel I/O control block |
1630 | * @iov: io vector request | 1596 | * @iov: io vector request |
1631 | * @nr_segs: number of segments in the iovec | 1597 | * @nr_segs: number of segments in the iovec |
1632 | * @pos: current file position | 1598 | * @pos: current file position |
1633 | * | 1599 | * |
1634 | * This is the "read()" routine for all filesystems | 1600 | * This is the "read()" routine for all filesystems |
1635 | * that can use the page cache directly. | 1601 | * that can use the page cache directly. |
1636 | */ | 1602 | */ |
1637 | ssize_t | 1603 | ssize_t |
1638 | generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, | 1604 | generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, |
1639 | unsigned long nr_segs, loff_t pos) | 1605 | unsigned long nr_segs, loff_t pos) |
1640 | { | 1606 | { |
1641 | struct file *filp = iocb->ki_filp; | 1607 | struct file *filp = iocb->ki_filp; |
1642 | ssize_t retval; | 1608 | ssize_t retval; |
1643 | unsigned long seg = 0; | 1609 | unsigned long seg = 0; |
1644 | size_t count; | 1610 | size_t count; |
1645 | loff_t *ppos = &iocb->ki_pos; | 1611 | loff_t *ppos = &iocb->ki_pos; |
1646 | 1612 | ||
1647 | count = 0; | 1613 | count = 0; |
1648 | retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); | 1614 | retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); |
1649 | if (retval) | 1615 | if (retval) |
1650 | return retval; | 1616 | return retval; |
1651 | 1617 | ||
1652 | /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ | 1618 | /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ |
1653 | if (filp->f_flags & O_DIRECT) { | 1619 | if (filp->f_flags & O_DIRECT) { |
1654 | loff_t size; | 1620 | loff_t size; |
1655 | struct address_space *mapping; | 1621 | struct address_space *mapping; |
1656 | struct inode *inode; | 1622 | struct inode *inode; |
1657 | 1623 | ||
1658 | mapping = filp->f_mapping; | 1624 | mapping = filp->f_mapping; |
1659 | inode = mapping->host; | 1625 | inode = mapping->host; |
1660 | if (!count) | 1626 | if (!count) |
1661 | goto out; /* skip atime */ | 1627 | goto out; /* skip atime */ |
1662 | size = i_size_read(inode); | 1628 | size = i_size_read(inode); |
1663 | if (pos < size) { | 1629 | if (pos < size) { |
1664 | retval = filemap_write_and_wait_range(mapping, pos, | 1630 | retval = filemap_write_and_wait_range(mapping, pos, |
1665 | pos + iov_length(iov, nr_segs) - 1); | 1631 | pos + iov_length(iov, nr_segs) - 1); |
1666 | if (!retval) { | 1632 | if (!retval) { |
1667 | retval = mapping->a_ops->direct_IO(READ, iocb, | 1633 | retval = mapping->a_ops->direct_IO(READ, iocb, |
1668 | iov, pos, nr_segs); | 1634 | iov, pos, nr_segs); |
1669 | } | 1635 | } |
1670 | if (retval > 0) { | 1636 | if (retval > 0) { |
1671 | *ppos = pos + retval; | 1637 | *ppos = pos + retval; |
1672 | count -= retval; | 1638 | count -= retval; |
1673 | } | 1639 | } |
1674 | 1640 | ||
1675 | /* | 1641 | /* |
1676 | * Btrfs can have a short DIO read if we encounter | 1642 | * Btrfs can have a short DIO read if we encounter |
1677 | * compressed extents, so if there was an error, or if | 1643 | * compressed extents, so if there was an error, or if |
1678 | * we've already read everything we wanted to, or if | 1644 | * we've already read everything we wanted to, or if |
1679 | * there was a short read because we hit EOF, go ahead | 1645 | * there was a short read because we hit EOF, go ahead |
1680 | * and return. Otherwise fallthrough to buffered io for | 1646 | * and return. Otherwise fallthrough to buffered io for |
1681 | * the rest of the read. | 1647 | * the rest of the read. |
1682 | */ | 1648 | */ |
1683 | if (retval < 0 || !count || *ppos >= size) { | 1649 | if (retval < 0 || !count || *ppos >= size) { |
1684 | file_accessed(filp); | 1650 | file_accessed(filp); |
1685 | goto out; | 1651 | goto out; |
1686 | } | 1652 | } |
1687 | } | 1653 | } |
1688 | } | 1654 | } |
1689 | 1655 | ||
1690 | count = retval; | 1656 | count = retval; |
1691 | for (seg = 0; seg < nr_segs; seg++) { | 1657 | for (seg = 0; seg < nr_segs; seg++) { |
1692 | read_descriptor_t desc; | 1658 | read_descriptor_t desc; |
1693 | loff_t offset = 0; | 1659 | loff_t offset = 0; |
1694 | 1660 | ||
1695 | /* | 1661 | /* |
1696 | * If we did a short DIO read we need to skip the section of the | 1662 | * If we did a short DIO read we need to skip the section of the |
1697 | * iov that we've already read data into. | 1663 | * iov that we've already read data into. |
1698 | */ | 1664 | */ |
1699 | if (count) { | 1665 | if (count) { |
1700 | if (count > iov[seg].iov_len) { | 1666 | if (count > iov[seg].iov_len) { |
1701 | count -= iov[seg].iov_len; | 1667 | count -= iov[seg].iov_len; |
1702 | continue; | 1668 | continue; |
1703 | } | 1669 | } |
1704 | offset = count; | 1670 | offset = count; |
1705 | count = 0; | 1671 | count = 0; |
1706 | } | 1672 | } |
1707 | 1673 | ||
1708 | desc.written = 0; | 1674 | desc.written = 0; |
1709 | desc.arg.buf = iov[seg].iov_base + offset; | 1675 | desc.arg.buf = iov[seg].iov_base + offset; |
1710 | desc.count = iov[seg].iov_len - offset; | 1676 | desc.count = iov[seg].iov_len - offset; |
1711 | if (desc.count == 0) | 1677 | if (desc.count == 0) |
1712 | continue; | 1678 | continue; |
1713 | desc.error = 0; | 1679 | desc.error = 0; |
1714 | do_generic_file_read(filp, ppos, &desc, file_read_actor); | 1680 | do_generic_file_read(filp, ppos, &desc, file_read_actor); |
1715 | retval += desc.written; | 1681 | retval += desc.written; |
1716 | if (desc.error) { | 1682 | if (desc.error) { |
1717 | retval = retval ?: desc.error; | 1683 | retval = retval ?: desc.error; |
1718 | break; | 1684 | break; |
1719 | } | 1685 | } |
1720 | if (desc.count > 0) | 1686 | if (desc.count > 0) |
1721 | break; | 1687 | break; |
1722 | } | 1688 | } |
1723 | out: | 1689 | out: |
1724 | return retval; | 1690 | return retval; |
1725 | } | 1691 | } |
1726 | EXPORT_SYMBOL(generic_file_aio_read); | 1692 | EXPORT_SYMBOL(generic_file_aio_read); |
1727 | 1693 | ||
1728 | #ifdef CONFIG_MMU | 1694 | #ifdef CONFIG_MMU |
1729 | /** | 1695 | /** |
1730 | * page_cache_read - adds requested page to the page cache if not already there | 1696 | * page_cache_read - adds requested page to the page cache if not already there |
1731 | * @file: file to read | 1697 | * @file: file to read |
1732 | * @offset: page index | 1698 | * @offset: page index |
1733 | * | 1699 | * |
1734 | * This adds the requested page to the page cache if it isn't already there, | 1700 | * This adds the requested page to the page cache if it isn't already there, |
1735 | * and schedules an I/O to read in its contents from disk. | 1701 | * and schedules an I/O to read in its contents from disk. |
1736 | */ | 1702 | */ |
1737 | static int page_cache_read(struct file *file, pgoff_t offset) | 1703 | static int page_cache_read(struct file *file, pgoff_t offset) |
1738 | { | 1704 | { |
1739 | struct address_space *mapping = file->f_mapping; | 1705 | struct address_space *mapping = file->f_mapping; |
1740 | struct page *page; | 1706 | struct page *page; |
1741 | int ret; | 1707 | int ret; |
1742 | 1708 | ||
1743 | do { | 1709 | do { |
1744 | page = page_cache_alloc_cold(mapping); | 1710 | page = page_cache_alloc_cold(mapping); |
1745 | if (!page) | 1711 | if (!page) |
1746 | return -ENOMEM; | 1712 | return -ENOMEM; |
1747 | 1713 | ||
1748 | ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL); | 1714 | ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL); |
1749 | if (ret == 0) | 1715 | if (ret == 0) |
1750 | ret = mapping->a_ops->readpage(file, page); | 1716 | ret = mapping->a_ops->readpage(file, page); |
1751 | else if (ret == -EEXIST) | 1717 | else if (ret == -EEXIST) |
1752 | ret = 0; /* losing race to add is OK */ | 1718 | ret = 0; /* losing race to add is OK */ |
1753 | 1719 | ||
1754 | page_cache_release(page); | 1720 | page_cache_release(page); |
1755 | 1721 | ||
1756 | } while (ret == AOP_TRUNCATED_PAGE); | 1722 | } while (ret == AOP_TRUNCATED_PAGE); |
1757 | 1723 | ||
1758 | return ret; | 1724 | return ret; |
1759 | } | 1725 | } |
1760 | 1726 | ||
1761 | #define MMAP_LOTSAMISS (100) | 1727 | #define MMAP_LOTSAMISS (100) |
1762 | 1728 | ||
1763 | /* | 1729 | /* |
1764 | * Synchronous readahead happens when we don't even find | 1730 | * Synchronous readahead happens when we don't even find |
1765 | * a page in the page cache at all. | 1731 | * a page in the page cache at all. |
1766 | */ | 1732 | */ |
1767 | static void do_sync_mmap_readahead(struct vm_area_struct *vma, | 1733 | static void do_sync_mmap_readahead(struct vm_area_struct *vma, |
1768 | struct file_ra_state *ra, | 1734 | struct file_ra_state *ra, |
1769 | struct file *file, | 1735 | struct file *file, |
1770 | pgoff_t offset) | 1736 | pgoff_t offset) |
1771 | { | 1737 | { |
1772 | unsigned long ra_pages; | 1738 | unsigned long ra_pages; |
1773 | struct address_space *mapping = file->f_mapping; | 1739 | struct address_space *mapping = file->f_mapping; |
1774 | 1740 | ||
1775 | /* If we don't want any read-ahead, don't bother */ | 1741 | /* If we don't want any read-ahead, don't bother */ |
1776 | if (vma->vm_flags & VM_RAND_READ) | 1742 | if (vma->vm_flags & VM_RAND_READ) |
1777 | return; | 1743 | return; |
1778 | if (!ra->ra_pages) | 1744 | if (!ra->ra_pages) |
1779 | return; | 1745 | return; |
1780 | 1746 | ||
1781 | if (vma->vm_flags & VM_SEQ_READ) { | 1747 | if (vma->vm_flags & VM_SEQ_READ) { |
1782 | page_cache_sync_readahead(mapping, ra, file, offset, | 1748 | page_cache_sync_readahead(mapping, ra, file, offset, |
1783 | ra->ra_pages); | 1749 | ra->ra_pages); |
1784 | return; | 1750 | return; |
1785 | } | 1751 | } |
1786 | 1752 | ||
1787 | /* Avoid banging the cache line if not needed */ | 1753 | /* Avoid banging the cache line if not needed */ |
1788 | if (ra->mmap_miss < MMAP_LOTSAMISS * 10) | 1754 | if (ra->mmap_miss < MMAP_LOTSAMISS * 10) |
1789 | ra->mmap_miss++; | 1755 | ra->mmap_miss++; |
1790 | 1756 | ||
1791 | /* | 1757 | /* |
1792 | * Do we miss much more than hit in this file? If so, | 1758 | * Do we miss much more than hit in this file? If so, |
1793 | * stop bothering with read-ahead. It will only hurt. | 1759 | * stop bothering with read-ahead. It will only hurt. |
1794 | */ | 1760 | */ |
1795 | if (ra->mmap_miss > MMAP_LOTSAMISS) | 1761 | if (ra->mmap_miss > MMAP_LOTSAMISS) |
1796 | return; | 1762 | return; |
1797 | 1763 | ||
1798 | /* | 1764 | /* |
1799 | * mmap read-around | 1765 | * mmap read-around |
1800 | */ | 1766 | */ |
1801 | ra_pages = max_sane_readahead(ra->ra_pages); | 1767 | ra_pages = max_sane_readahead(ra->ra_pages); |
1802 | ra->start = max_t(long, 0, offset - ra_pages / 2); | 1768 | ra->start = max_t(long, 0, offset - ra_pages / 2); |
1803 | ra->size = ra_pages; | 1769 | ra->size = ra_pages; |
1804 | ra->async_size = ra_pages / 4; | 1770 | ra->async_size = ra_pages / 4; |
1805 | ra_submit(ra, mapping, file); | 1771 | ra_submit(ra, mapping, file); |
1806 | } | 1772 | } |
1807 | 1773 | ||
1808 | /* | 1774 | /* |
1809 | * Asynchronous readahead happens when we find the page and PG_readahead, | 1775 | * Asynchronous readahead happens when we find the page and PG_readahead, |
1810 | * so we want to possibly extend the readahead further.. | 1776 | * so we want to possibly extend the readahead further.. |
1811 | */ | 1777 | */ |
1812 | static void do_async_mmap_readahead(struct vm_area_struct *vma, | 1778 | static void do_async_mmap_readahead(struct vm_area_struct *vma, |
1813 | struct file_ra_state *ra, | 1779 | struct file_ra_state *ra, |
1814 | struct file *file, | 1780 | struct file *file, |
1815 | struct page *page, | 1781 | struct page *page, |
1816 | pgoff_t offset) | 1782 | pgoff_t offset) |
1817 | { | 1783 | { |
1818 | struct address_space *mapping = file->f_mapping; | 1784 | struct address_space *mapping = file->f_mapping; |
1819 | 1785 | ||
1820 | /* If we don't want any read-ahead, don't bother */ | 1786 | /* If we don't want any read-ahead, don't bother */ |
1821 | if (vma->vm_flags & VM_RAND_READ) | 1787 | if (vma->vm_flags & VM_RAND_READ) |
1822 | return; | 1788 | return; |
1823 | if (ra->mmap_miss > 0) | 1789 | if (ra->mmap_miss > 0) |
1824 | ra->mmap_miss--; | 1790 | ra->mmap_miss--; |
1825 | if (PageReadahead(page)) | 1791 | if (PageReadahead(page)) |
1826 | page_cache_async_readahead(mapping, ra, file, | 1792 | page_cache_async_readahead(mapping, ra, file, |
1827 | page, offset, ra->ra_pages); | 1793 | page, offset, ra->ra_pages); |
1828 | } | 1794 | } |
1829 | 1795 | ||
1830 | /** | 1796 | /** |
1831 | * filemap_fault - read in file data for page fault handling | 1797 | * filemap_fault - read in file data for page fault handling |
1832 | * @vma: vma in which the fault was taken | 1798 | * @vma: vma in which the fault was taken |
1833 | * @vmf: struct vm_fault containing details of the fault | 1799 | * @vmf: struct vm_fault containing details of the fault |
1834 | * | 1800 | * |
1835 | * filemap_fault() is invoked via the vma operations vector for a | 1801 | * filemap_fault() is invoked via the vma operations vector for a |
1836 | * mapped memory region to read in file data during a page fault. | 1802 | * mapped memory region to read in file data during a page fault. |
1837 | * | 1803 | * |
1838 | * The goto's are kind of ugly, but this streamlines the normal case of having | 1804 | * The goto's are kind of ugly, but this streamlines the normal case of having |
1839 | * it in the page cache, and handles the special cases reasonably without | 1805 | * it in the page cache, and handles the special cases reasonably without |
1840 | * having a lot of duplicated code. | 1806 | * having a lot of duplicated code. |
1841 | */ | 1807 | */ |
1842 | int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) | 1808 | int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) |
1843 | { | 1809 | { |
1844 | int error; | 1810 | int error; |
1845 | struct file *file = vma->vm_file; | 1811 | struct file *file = vma->vm_file; |
1846 | struct address_space *mapping = file->f_mapping; | 1812 | struct address_space *mapping = file->f_mapping; |
1847 | struct file_ra_state *ra = &file->f_ra; | 1813 | struct file_ra_state *ra = &file->f_ra; |
1848 | struct inode *inode = mapping->host; | 1814 | struct inode *inode = mapping->host; |
1849 | pgoff_t offset = vmf->pgoff; | 1815 | pgoff_t offset = vmf->pgoff; |
1850 | struct page *page; | 1816 | struct page *page; |
1851 | pgoff_t size; | 1817 | pgoff_t size; |
1852 | int ret = 0; | 1818 | int ret = 0; |
1853 | 1819 | ||
1854 | size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1820 | size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1855 | if (offset >= size) | 1821 | if (offset >= size) |
1856 | return VM_FAULT_SIGBUS; | 1822 | return VM_FAULT_SIGBUS; |
1857 | 1823 | ||
1858 | /* | 1824 | /* |
1859 | * Do we have something in the page cache already? | 1825 | * Do we have something in the page cache already? |
1860 | */ | 1826 | */ |
1861 | page = find_get_page(mapping, offset); | 1827 | page = find_get_page(mapping, offset); |
1862 | if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { | 1828 | if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { |
1863 | /* | 1829 | /* |
1864 | * We found the page, so try async readahead before | 1830 | * We found the page, so try async readahead before |
1865 | * waiting for the lock. | 1831 | * waiting for the lock. |
1866 | */ | 1832 | */ |
1867 | do_async_mmap_readahead(vma, ra, file, page, offset); | 1833 | do_async_mmap_readahead(vma, ra, file, page, offset); |
1868 | } else if (!page) { | 1834 | } else if (!page) { |
1869 | /* No page in the page cache at all */ | 1835 | /* No page in the page cache at all */ |
1870 | do_sync_mmap_readahead(vma, ra, file, offset); | 1836 | do_sync_mmap_readahead(vma, ra, file, offset); |
1871 | count_vm_event(PGMAJFAULT); | 1837 | count_vm_event(PGMAJFAULT); |
1872 | mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); | 1838 | mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); |
1873 | ret = VM_FAULT_MAJOR; | 1839 | ret = VM_FAULT_MAJOR; |
1874 | retry_find: | 1840 | retry_find: |
1875 | page = find_get_page(mapping, offset); | 1841 | page = find_get_page(mapping, offset); |
1876 | if (!page) | 1842 | if (!page) |
1877 | goto no_cached_page; | 1843 | goto no_cached_page; |
1878 | } | 1844 | } |
1879 | 1845 | ||
1880 | if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { | 1846 | if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { |
1881 | page_cache_release(page); | 1847 | page_cache_release(page); |
1882 | return ret | VM_FAULT_RETRY; | 1848 | return ret | VM_FAULT_RETRY; |
1883 | } | 1849 | } |
1884 | 1850 | ||
1885 | /* Did it get truncated? */ | 1851 | /* Did it get truncated? */ |
1886 | if (unlikely(page->mapping != mapping)) { | 1852 | if (unlikely(page->mapping != mapping)) { |
1887 | unlock_page(page); | 1853 | unlock_page(page); |
1888 | put_page(page); | 1854 | put_page(page); |
1889 | goto retry_find; | 1855 | goto retry_find; |
1890 | } | 1856 | } |
1891 | VM_BUG_ON(page->index != offset); | 1857 | VM_BUG_ON(page->index != offset); |
1892 | 1858 | ||
1893 | /* | 1859 | /* |
1894 | * We have a locked page in the page cache, now we need to check | 1860 | * We have a locked page in the page cache, now we need to check |
1895 | * that it's up-to-date. If not, it is going to be due to an error. | 1861 | * that it's up-to-date. If not, it is going to be due to an error. |
1896 | */ | 1862 | */ |
1897 | if (unlikely(!PageUptodate(page))) | 1863 | if (unlikely(!PageUptodate(page))) |
1898 | goto page_not_uptodate; | 1864 | goto page_not_uptodate; |
1899 | 1865 | ||
1900 | /* | 1866 | /* |
1901 | * Found the page and have a reference on it. | 1867 | * Found the page and have a reference on it. |
1902 | * We must recheck i_size under page lock. | 1868 | * We must recheck i_size under page lock. |
1903 | */ | 1869 | */ |
1904 | size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1870 | size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1905 | if (unlikely(offset >= size)) { | 1871 | if (unlikely(offset >= size)) { |
1906 | unlock_page(page); | 1872 | unlock_page(page); |
1907 | page_cache_release(page); | 1873 | page_cache_release(page); |
1908 | return VM_FAULT_SIGBUS; | 1874 | return VM_FAULT_SIGBUS; |
1909 | } | 1875 | } |
1910 | 1876 | ||
1911 | vmf->page = page; | 1877 | vmf->page = page; |
1912 | return ret | VM_FAULT_LOCKED; | 1878 | return ret | VM_FAULT_LOCKED; |
1913 | 1879 | ||
1914 | no_cached_page: | 1880 | no_cached_page: |
1915 | /* | 1881 | /* |
1916 | * We're only likely to ever get here if MADV_RANDOM is in | 1882 | * We're only likely to ever get here if MADV_RANDOM is in |
1917 | * effect. | 1883 | * effect. |
1918 | */ | 1884 | */ |
1919 | error = page_cache_read(file, offset); | 1885 | error = page_cache_read(file, offset); |
1920 | 1886 | ||
1921 | /* | 1887 | /* |
1922 | * The page we want has now been added to the page cache. | 1888 | * The page we want has now been added to the page cache. |
1923 | * In the unlikely event that someone removed it in the | 1889 | * In the unlikely event that someone removed it in the |
1924 | * meantime, we'll just come back here and read it again. | 1890 | * meantime, we'll just come back here and read it again. |
1925 | */ | 1891 | */ |
1926 | if (error >= 0) | 1892 | if (error >= 0) |
1927 | goto retry_find; | 1893 | goto retry_find; |
1928 | 1894 | ||
1929 | /* | 1895 | /* |
1930 | * An error return from page_cache_read can result if the | 1896 | * An error return from page_cache_read can result if the |
1931 | * system is low on memory, or a problem occurs while trying | 1897 | * system is low on memory, or a problem occurs while trying |
1932 | * to schedule I/O. | 1898 | * to schedule I/O. |
1933 | */ | 1899 | */ |
1934 | if (error == -ENOMEM) | 1900 | if (error == -ENOMEM) |
1935 | return VM_FAULT_OOM; | 1901 | return VM_FAULT_OOM; |
1936 | return VM_FAULT_SIGBUS; | 1902 | return VM_FAULT_SIGBUS; |
1937 | 1903 | ||
1938 | page_not_uptodate: | 1904 | page_not_uptodate: |
1939 | /* | 1905 | /* |
1940 | * Umm, take care of errors if the page isn't up-to-date. | 1906 | * Umm, take care of errors if the page isn't up-to-date. |
1941 | * Try to re-read it _once_. We do this synchronously, | 1907 | * Try to re-read it _once_. We do this synchronously, |
1942 | * because there really aren't any performance issues here | 1908 | * because there really aren't any performance issues here |
1943 | * and we need to check for errors. | 1909 | * and we need to check for errors. |
1944 | */ | 1910 | */ |
1945 | ClearPageError(page); | 1911 | ClearPageError(page); |
1946 | error = mapping->a_ops->readpage(file, page); | 1912 | error = mapping->a_ops->readpage(file, page); |
1947 | if (!error) { | 1913 | if (!error) { |
1948 | wait_on_page_locked(page); | 1914 | wait_on_page_locked(page); |
1949 | if (!PageUptodate(page)) | 1915 | if (!PageUptodate(page)) |
1950 | error = -EIO; | 1916 | error = -EIO; |
1951 | } | 1917 | } |
1952 | page_cache_release(page); | 1918 | page_cache_release(page); |
1953 | 1919 | ||
1954 | if (!error || error == AOP_TRUNCATED_PAGE) | 1920 | if (!error || error == AOP_TRUNCATED_PAGE) |
1955 | goto retry_find; | 1921 | goto retry_find; |
1956 | 1922 | ||
1957 | /* Things didn't work out. Return zero to tell the mm layer so. */ | 1923 | /* Things didn't work out. Return zero to tell the mm layer so. */ |
1958 | shrink_readahead_size_eio(file, ra); | 1924 | shrink_readahead_size_eio(file, ra); |
1959 | return VM_FAULT_SIGBUS; | 1925 | return VM_FAULT_SIGBUS; |
1960 | } | 1926 | } |
1961 | EXPORT_SYMBOL(filemap_fault); | 1927 | EXPORT_SYMBOL(filemap_fault); |
1962 | 1928 | ||
1963 | int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) | 1929 | int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) |
1964 | { | 1930 | { |
1965 | struct page *page = vmf->page; | 1931 | struct page *page = vmf->page; |
1966 | struct inode *inode = file_inode(vma->vm_file); | 1932 | struct inode *inode = file_inode(vma->vm_file); |
1967 | int ret = VM_FAULT_LOCKED; | 1933 | int ret = VM_FAULT_LOCKED; |
1968 | 1934 | ||
1969 | sb_start_pagefault(inode->i_sb); | 1935 | sb_start_pagefault(inode->i_sb); |
1970 | file_update_time(vma->vm_file); | 1936 | file_update_time(vma->vm_file); |
1971 | lock_page(page); | 1937 | lock_page(page); |
1972 | if (page->mapping != inode->i_mapping) { | 1938 | if (page->mapping != inode->i_mapping) { |
1973 | unlock_page(page); | 1939 | unlock_page(page); |
1974 | ret = VM_FAULT_NOPAGE; | 1940 | ret = VM_FAULT_NOPAGE; |
1975 | goto out; | 1941 | goto out; |
1976 | } | 1942 | } |
1977 | /* | 1943 | /* |
1978 | * We mark the page dirty already here so that when freeze is in | 1944 | * We mark the page dirty already here so that when freeze is in |
1979 | * progress, we are guaranteed that writeback during freezing will | 1945 | * progress, we are guaranteed that writeback during freezing will |
1980 | * see the dirty page and writeprotect it again. | 1946 | * see the dirty page and writeprotect it again. |
1981 | */ | 1947 | */ |
1982 | set_page_dirty(page); | 1948 | set_page_dirty(page); |
1983 | wait_for_stable_page(page); | 1949 | wait_for_stable_page(page); |
1984 | out: | 1950 | out: |
1985 | sb_end_pagefault(inode->i_sb); | 1951 | sb_end_pagefault(inode->i_sb); |
1986 | return ret; | 1952 | return ret; |
1987 | } | 1953 | } |
1988 | EXPORT_SYMBOL(filemap_page_mkwrite); | 1954 | EXPORT_SYMBOL(filemap_page_mkwrite); |
1989 | 1955 | ||
1990 | const struct vm_operations_struct generic_file_vm_ops = { | 1956 | const struct vm_operations_struct generic_file_vm_ops = { |
1991 | .fault = filemap_fault, | 1957 | .fault = filemap_fault, |
1992 | .page_mkwrite = filemap_page_mkwrite, | 1958 | .page_mkwrite = filemap_page_mkwrite, |
1993 | .remap_pages = generic_file_remap_pages, | 1959 | .remap_pages = generic_file_remap_pages, |
1994 | }; | 1960 | }; |
1995 | 1961 | ||
1996 | /* This is used for a general mmap of a disk file */ | 1962 | /* This is used for a general mmap of a disk file */ |
1997 | 1963 | ||
1998 | int generic_file_mmap(struct file * file, struct vm_area_struct * vma) | 1964 | int generic_file_mmap(struct file * file, struct vm_area_struct * vma) |
1999 | { | 1965 | { |
2000 | struct address_space *mapping = file->f_mapping; | 1966 | struct address_space *mapping = file->f_mapping; |
2001 | 1967 | ||
2002 | if (!mapping->a_ops->readpage) | 1968 | if (!mapping->a_ops->readpage) |
2003 | return -ENOEXEC; | 1969 | return -ENOEXEC; |
2004 | file_accessed(file); | 1970 | file_accessed(file); |
2005 | vma->vm_ops = &generic_file_vm_ops; | 1971 | vma->vm_ops = &generic_file_vm_ops; |
2006 | return 0; | 1972 | return 0; |
2007 | } | 1973 | } |
2008 | 1974 | ||
2009 | /* | 1975 | /* |
2010 | * This is for filesystems which do not implement ->writepage. | 1976 | * This is for filesystems which do not implement ->writepage. |
2011 | */ | 1977 | */ |
2012 | int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma) | 1978 | int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma) |
2013 | { | 1979 | { |
2014 | if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) | 1980 | if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) |
2015 | return -EINVAL; | 1981 | return -EINVAL; |
2016 | return generic_file_mmap(file, vma); | 1982 | return generic_file_mmap(file, vma); |
2017 | } | 1983 | } |
2018 | #else | 1984 | #else |
2019 | int generic_file_mmap(struct file * file, struct vm_area_struct * vma) | 1985 | int generic_file_mmap(struct file * file, struct vm_area_struct * vma) |
2020 | { | 1986 | { |
2021 | return -ENOSYS; | 1987 | return -ENOSYS; |
2022 | } | 1988 | } |
2023 | int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma) | 1989 | int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma) |
2024 | { | 1990 | { |
2025 | return -ENOSYS; | 1991 | return -ENOSYS; |
2026 | } | 1992 | } |
2027 | #endif /* CONFIG_MMU */ | 1993 | #endif /* CONFIG_MMU */ |
2028 | 1994 | ||
2029 | EXPORT_SYMBOL(generic_file_mmap); | 1995 | EXPORT_SYMBOL(generic_file_mmap); |
2030 | EXPORT_SYMBOL(generic_file_readonly_mmap); | 1996 | EXPORT_SYMBOL(generic_file_readonly_mmap); |
2031 | 1997 | ||
2032 | static struct page *wait_on_page_read(struct page *page) | 1998 | static struct page *wait_on_page_read(struct page *page) |
2033 | { | 1999 | { |
2034 | if (!IS_ERR(page)) { | 2000 | if (!IS_ERR(page)) { |
2035 | wait_on_page_locked(page); | 2001 | wait_on_page_locked(page); |
2036 | if (!PageUptodate(page)) { | 2002 | if (!PageUptodate(page)) { |
2037 | page_cache_release(page); | 2003 | page_cache_release(page); |
2038 | page = ERR_PTR(-EIO); | 2004 | page = ERR_PTR(-EIO); |
2039 | } | 2005 | } |
2040 | } | 2006 | } |
2041 | return page; | 2007 | return page; |
2042 | } | 2008 | } |
2043 | 2009 | ||
2044 | static struct page *__read_cache_page(struct address_space *mapping, | 2010 | static struct page *__read_cache_page(struct address_space *mapping, |
2045 | pgoff_t index, | 2011 | pgoff_t index, |
2046 | int (*filler)(void *, struct page *), | 2012 | int (*filler)(void *, struct page *), |
2047 | void *data, | 2013 | void *data, |
2048 | gfp_t gfp) | 2014 | gfp_t gfp) |
2049 | { | 2015 | { |
2050 | struct page *page; | 2016 | struct page *page; |
2051 | int err; | 2017 | int err; |
2052 | repeat: | 2018 | repeat: |
2053 | page = find_get_page(mapping, index); | 2019 | page = find_get_page(mapping, index); |
2054 | if (!page) { | 2020 | if (!page) { |
2055 | page = __page_cache_alloc(gfp | __GFP_COLD); | 2021 | page = __page_cache_alloc(gfp | __GFP_COLD); |
2056 | if (!page) | 2022 | if (!page) |
2057 | return ERR_PTR(-ENOMEM); | 2023 | return ERR_PTR(-ENOMEM); |
2058 | err = add_to_page_cache_lru(page, mapping, index, gfp); | 2024 | err = add_to_page_cache_lru(page, mapping, index, gfp); |
2059 | if (unlikely(err)) { | 2025 | if (unlikely(err)) { |
2060 | page_cache_release(page); | 2026 | page_cache_release(page); |
2061 | if (err == -EEXIST) | 2027 | if (err == -EEXIST) |
2062 | goto repeat; | 2028 | goto repeat; |
2063 | /* Presumably ENOMEM for radix tree node */ | 2029 | /* Presumably ENOMEM for radix tree node */ |
2064 | return ERR_PTR(err); | 2030 | return ERR_PTR(err); |
2065 | } | 2031 | } |
2066 | err = filler(data, page); | 2032 | err = filler(data, page); |
2067 | if (err < 0) { | 2033 | if (err < 0) { |
2068 | page_cache_release(page); | 2034 | page_cache_release(page); |
2069 | page = ERR_PTR(err); | 2035 | page = ERR_PTR(err); |
2070 | } else { | 2036 | } else { |
2071 | page = wait_on_page_read(page); | 2037 | page = wait_on_page_read(page); |
2072 | } | 2038 | } |
2073 | } | 2039 | } |
2074 | return page; | 2040 | return page; |
2075 | } | 2041 | } |
2076 | 2042 | ||
2077 | static struct page *do_read_cache_page(struct address_space *mapping, | 2043 | static struct page *do_read_cache_page(struct address_space *mapping, |
2078 | pgoff_t index, | 2044 | pgoff_t index, |
2079 | int (*filler)(void *, struct page *), | 2045 | int (*filler)(void *, struct page *), |
2080 | void *data, | 2046 | void *data, |
2081 | gfp_t gfp) | 2047 | gfp_t gfp) |
2082 | 2048 | ||
2083 | { | 2049 | { |
2084 | struct page *page; | 2050 | struct page *page; |
2085 | int err; | 2051 | int err; |
2086 | 2052 | ||
2087 | retry: | 2053 | retry: |
2088 | page = __read_cache_page(mapping, index, filler, data, gfp); | 2054 | page = __read_cache_page(mapping, index, filler, data, gfp); |
2089 | if (IS_ERR(page)) | 2055 | if (IS_ERR(page)) |
2090 | return page; | 2056 | return page; |
2091 | if (PageUptodate(page)) | 2057 | if (PageUptodate(page)) |
2092 | goto out; | 2058 | goto out; |
2093 | 2059 | ||
2094 | lock_page(page); | 2060 | lock_page(page); |
2095 | if (!page->mapping) { | 2061 | if (!page->mapping) { |
2096 | unlock_page(page); | 2062 | unlock_page(page); |
2097 | page_cache_release(page); | 2063 | page_cache_release(page); |
2098 | goto retry; | 2064 | goto retry; |
2099 | } | 2065 | } |
2100 | if (PageUptodate(page)) { | 2066 | if (PageUptodate(page)) { |
2101 | unlock_page(page); | 2067 | unlock_page(page); |
2102 | goto out; | 2068 | goto out; |
2103 | } | 2069 | } |
2104 | err = filler(data, page); | 2070 | err = filler(data, page); |
2105 | if (err < 0) { | 2071 | if (err < 0) { |
2106 | page_cache_release(page); | 2072 | page_cache_release(page); |
2107 | return ERR_PTR(err); | 2073 | return ERR_PTR(err); |
2108 | } else { | 2074 | } else { |
2109 | page = wait_on_page_read(page); | 2075 | page = wait_on_page_read(page); |
2110 | if (IS_ERR(page)) | 2076 | if (IS_ERR(page)) |
2111 | return page; | 2077 | return page; |
2112 | } | 2078 | } |
2113 | out: | 2079 | out: |
2114 | mark_page_accessed(page); | 2080 | mark_page_accessed(page); |
2115 | return page; | 2081 | return page; |
2116 | } | 2082 | } |
2117 | 2083 | ||
2118 | /** | 2084 | /** |
2119 | * read_cache_page - read into page cache, fill it if needed | 2085 | * read_cache_page - read into page cache, fill it if needed |
2120 | * @mapping: the page's address_space | 2086 | * @mapping: the page's address_space |
2121 | * @index: the page index | 2087 | * @index: the page index |
2122 | * @filler: function to perform the read | 2088 | * @filler: function to perform the read |
2123 | * @data: first arg to filler(data, page) function, often left as NULL | 2089 | * @data: first arg to filler(data, page) function, often left as NULL |
2124 | * | 2090 | * |
2125 | * Read into the page cache. If a page already exists, and PageUptodate() is | 2091 | * Read into the page cache. If a page already exists, and PageUptodate() is |
2126 | * not set, try to fill the page and wait for it to become unlocked. | 2092 | * not set, try to fill the page and wait for it to become unlocked. |
2127 | * | 2093 | * |
2128 | * If the page does not get brought uptodate, return -EIO. | 2094 | * If the page does not get brought uptodate, return -EIO. |
2129 | */ | 2095 | */ |
2130 | struct page *read_cache_page(struct address_space *mapping, | 2096 | struct page *read_cache_page(struct address_space *mapping, |
2131 | pgoff_t index, | 2097 | pgoff_t index, |
2132 | int (*filler)(void *, struct page *), | 2098 | int (*filler)(void *, struct page *), |
2133 | void *data) | 2099 | void *data) |
2134 | { | 2100 | { |
2135 | return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping)); | 2101 | return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping)); |
2136 | } | 2102 | } |
2137 | EXPORT_SYMBOL(read_cache_page); | 2103 | EXPORT_SYMBOL(read_cache_page); |
2138 | 2104 | ||
2139 | /** | 2105 | /** |
2140 | * read_cache_page_gfp - read into page cache, using specified page allocation flags. | 2106 | * read_cache_page_gfp - read into page cache, using specified page allocation flags. |
2141 | * @mapping: the page's address_space | 2107 | * @mapping: the page's address_space |
2142 | * @index: the page index | 2108 | * @index: the page index |
2143 | * @gfp: the page allocator flags to use if allocating | 2109 | * @gfp: the page allocator flags to use if allocating |
2144 | * | 2110 | * |
2145 | * This is the same as "read_mapping_page(mapping, index, NULL)", but with | 2111 | * This is the same as "read_mapping_page(mapping, index, NULL)", but with |
2146 | * any new page allocations done using the specified allocation flags. | 2112 | * any new page allocations done using the specified allocation flags. |
2147 | * | 2113 | * |
2148 | * If the page does not get brought uptodate, return -EIO. | 2114 | * If the page does not get brought uptodate, return -EIO. |
2149 | */ | 2115 | */ |
2150 | struct page *read_cache_page_gfp(struct address_space *mapping, | 2116 | struct page *read_cache_page_gfp(struct address_space *mapping, |
2151 | pgoff_t index, | 2117 | pgoff_t index, |
2152 | gfp_t gfp) | 2118 | gfp_t gfp) |
2153 | { | 2119 | { |
2154 | filler_t *filler = (filler_t *)mapping->a_ops->readpage; | 2120 | filler_t *filler = (filler_t *)mapping->a_ops->readpage; |
2155 | 2121 | ||
2156 | return do_read_cache_page(mapping, index, filler, NULL, gfp); | 2122 | return do_read_cache_page(mapping, index, filler, NULL, gfp); |
2157 | } | 2123 | } |
2158 | EXPORT_SYMBOL(read_cache_page_gfp); | 2124 | EXPORT_SYMBOL(read_cache_page_gfp); |
2159 | 2125 | ||
2160 | static size_t __iovec_copy_from_user_inatomic(char *vaddr, | 2126 | static size_t __iovec_copy_from_user_inatomic(char *vaddr, |
2161 | const struct iovec *iov, size_t base, size_t bytes) | 2127 | const struct iovec *iov, size_t base, size_t bytes) |
2162 | { | 2128 | { |
2163 | size_t copied = 0, left = 0; | 2129 | size_t copied = 0, left = 0; |
2164 | 2130 | ||
2165 | while (bytes) { | 2131 | while (bytes) { |
2166 | char __user *buf = iov->iov_base + base; | 2132 | char __user *buf = iov->iov_base + base; |
2167 | int copy = min(bytes, iov->iov_len - base); | 2133 | int copy = min(bytes, iov->iov_len - base); |
2168 | 2134 | ||
2169 | base = 0; | 2135 | base = 0; |
2170 | left = __copy_from_user_inatomic(vaddr, buf, copy); | 2136 | left = __copy_from_user_inatomic(vaddr, buf, copy); |
2171 | copied += copy; | 2137 | copied += copy; |
2172 | bytes -= copy; | 2138 | bytes -= copy; |
2173 | vaddr += copy; | 2139 | vaddr += copy; |
2174 | iov++; | 2140 | iov++; |
2175 | 2141 | ||
2176 | if (unlikely(left)) | 2142 | if (unlikely(left)) |
2177 | break; | 2143 | break; |
2178 | } | 2144 | } |
2179 | return copied - left; | 2145 | return copied - left; |
2180 | } | 2146 | } |
2181 | 2147 | ||
2182 | /* | 2148 | /* |
2183 | * Copy as much as we can into the page and return the number of bytes which | 2149 | * Copy as much as we can into the page and return the number of bytes which |
2184 | * were successfully copied. If a fault is encountered then return the number of | 2150 | * were successfully copied. If a fault is encountered then return the number of |
2185 | * bytes which were copied. | 2151 | * bytes which were copied. |
2186 | */ | 2152 | */ |
2187 | size_t iov_iter_copy_from_user_atomic(struct page *page, | 2153 | size_t iov_iter_copy_from_user_atomic(struct page *page, |
2188 | struct iov_iter *i, unsigned long offset, size_t bytes) | 2154 | struct iov_iter *i, unsigned long offset, size_t bytes) |
2189 | { | 2155 | { |
2190 | char *kaddr; | 2156 | char *kaddr; |
2191 | size_t copied; | 2157 | size_t copied; |
2192 | 2158 | ||
2193 | kaddr = kmap_atomic(page); | 2159 | kaddr = kmap_atomic(page); |
2194 | if (likely(i->nr_segs == 1)) { | 2160 | if (likely(i->nr_segs == 1)) { |
2195 | int left; | 2161 | int left; |
2196 | char __user *buf = i->iov->iov_base + i->iov_offset; | 2162 | char __user *buf = i->iov->iov_base + i->iov_offset; |
2197 | left = __copy_from_user_inatomic(kaddr + offset, buf, bytes); | 2163 | left = __copy_from_user_inatomic(kaddr + offset, buf, bytes); |
2198 | copied = bytes - left; | 2164 | copied = bytes - left; |
2199 | } else { | 2165 | } else { |
2200 | copied = __iovec_copy_from_user_inatomic(kaddr + offset, | 2166 | copied = __iovec_copy_from_user_inatomic(kaddr + offset, |
2201 | i->iov, i->iov_offset, bytes); | 2167 | i->iov, i->iov_offset, bytes); |
2202 | } | 2168 | } |
2203 | kunmap_atomic(kaddr); | 2169 | kunmap_atomic(kaddr); |
2204 | 2170 | ||
2205 | return copied; | 2171 | return copied; |
2206 | } | 2172 | } |
2207 | EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); | 2173 | EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); |
2208 | 2174 | ||
2209 | /* | 2175 | /* |
2210 | * This has the same sideeffects and return value as | 2176 | * This has the same sideeffects and return value as |
2211 | * iov_iter_copy_from_user_atomic(). | 2177 | * iov_iter_copy_from_user_atomic(). |
2212 | * The difference is that it attempts to resolve faults. | 2178 | * The difference is that it attempts to resolve faults. |
2213 | * Page must not be locked. | 2179 | * Page must not be locked. |
2214 | */ | 2180 | */ |
2215 | size_t iov_iter_copy_from_user(struct page *page, | 2181 | size_t iov_iter_copy_from_user(struct page *page, |
2216 | struct iov_iter *i, unsigned long offset, size_t bytes) | 2182 | struct iov_iter *i, unsigned long offset, size_t bytes) |
2217 | { | 2183 | { |
2218 | char *kaddr; | 2184 | char *kaddr; |
2219 | size_t copied; | 2185 | size_t copied; |
2220 | 2186 | ||
2221 | kaddr = kmap(page); | 2187 | kaddr = kmap(page); |
2222 | if (likely(i->nr_segs == 1)) { | 2188 | if (likely(i->nr_segs == 1)) { |
2223 | int left; | 2189 | int left; |
2224 | char __user *buf = i->iov->iov_base + i->iov_offset; | 2190 | char __user *buf = i->iov->iov_base + i->iov_offset; |
2225 | left = __copy_from_user(kaddr + offset, buf, bytes); | 2191 | left = __copy_from_user(kaddr + offset, buf, bytes); |
2226 | copied = bytes - left; | 2192 | copied = bytes - left; |
2227 | } else { | 2193 | } else { |
2228 | copied = __iovec_copy_from_user_inatomic(kaddr + offset, | 2194 | copied = __iovec_copy_from_user_inatomic(kaddr + offset, |
2229 | i->iov, i->iov_offset, bytes); | 2195 | i->iov, i->iov_offset, bytes); |
2230 | } | 2196 | } |
2231 | kunmap(page); | 2197 | kunmap(page); |
2232 | return copied; | 2198 | return copied; |
2233 | } | 2199 | } |
2234 | EXPORT_SYMBOL(iov_iter_copy_from_user); | 2200 | EXPORT_SYMBOL(iov_iter_copy_from_user); |
2235 | 2201 | ||
2236 | void iov_iter_advance(struct iov_iter *i, size_t bytes) | 2202 | void iov_iter_advance(struct iov_iter *i, size_t bytes) |
2237 | { | 2203 | { |
2238 | BUG_ON(i->count < bytes); | 2204 | BUG_ON(i->count < bytes); |
2239 | 2205 | ||
2240 | if (likely(i->nr_segs == 1)) { | 2206 | if (likely(i->nr_segs == 1)) { |
2241 | i->iov_offset += bytes; | 2207 | i->iov_offset += bytes; |
2242 | i->count -= bytes; | 2208 | i->count -= bytes; |
2243 | } else { | 2209 | } else { |
2244 | const struct iovec *iov = i->iov; | 2210 | const struct iovec *iov = i->iov; |
2245 | size_t base = i->iov_offset; | 2211 | size_t base = i->iov_offset; |
2246 | unsigned long nr_segs = i->nr_segs; | 2212 | unsigned long nr_segs = i->nr_segs; |
2247 | 2213 | ||
2248 | /* | 2214 | /* |
2249 | * The !iov->iov_len check ensures we skip over unlikely | 2215 | * The !iov->iov_len check ensures we skip over unlikely |
2250 | * zero-length segments (without overruning the iovec). | 2216 | * zero-length segments (without overruning the iovec). |
2251 | */ | 2217 | */ |
2252 | while (bytes || unlikely(i->count && !iov->iov_len)) { | 2218 | while (bytes || unlikely(i->count && !iov->iov_len)) { |
2253 | int copy; | 2219 | int copy; |
2254 | 2220 | ||
2255 | copy = min(bytes, iov->iov_len - base); | 2221 | copy = min(bytes, iov->iov_len - base); |
2256 | BUG_ON(!i->count || i->count < copy); | 2222 | BUG_ON(!i->count || i->count < copy); |
2257 | i->count -= copy; | 2223 | i->count -= copy; |
2258 | bytes -= copy; | 2224 | bytes -= copy; |
2259 | base += copy; | 2225 | base += copy; |
2260 | if (iov->iov_len == base) { | 2226 | if (iov->iov_len == base) { |
2261 | iov++; | 2227 | iov++; |
2262 | nr_segs--; | 2228 | nr_segs--; |
2263 | base = 0; | 2229 | base = 0; |
2264 | } | 2230 | } |
2265 | } | 2231 | } |
2266 | i->iov = iov; | 2232 | i->iov = iov; |
2267 | i->iov_offset = base; | 2233 | i->iov_offset = base; |
2268 | i->nr_segs = nr_segs; | 2234 | i->nr_segs = nr_segs; |
2269 | } | 2235 | } |
2270 | } | 2236 | } |
2271 | EXPORT_SYMBOL(iov_iter_advance); | 2237 | EXPORT_SYMBOL(iov_iter_advance); |
2272 | 2238 | ||
2273 | /* | 2239 | /* |
2274 | * Fault in the first iovec of the given iov_iter, to a maximum length | 2240 | * Fault in the first iovec of the given iov_iter, to a maximum length |
2275 | * of bytes. Returns 0 on success, or non-zero if the memory could not be | 2241 | * of bytes. Returns 0 on success, or non-zero if the memory could not be |
2276 | * accessed (ie. because it is an invalid address). | 2242 | * accessed (ie. because it is an invalid address). |
2277 | * | 2243 | * |
2278 | * writev-intensive code may want this to prefault several iovecs -- that | 2244 | * writev-intensive code may want this to prefault several iovecs -- that |
2279 | * would be possible (callers must not rely on the fact that _only_ the | 2245 | * would be possible (callers must not rely on the fact that _only_ the |
2280 | * first iovec will be faulted with the current implementation). | 2246 | * first iovec will be faulted with the current implementation). |
2281 | */ | 2247 | */ |
2282 | int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) | 2248 | int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) |
2283 | { | 2249 | { |
2284 | char __user *buf = i->iov->iov_base + i->iov_offset; | 2250 | char __user *buf = i->iov->iov_base + i->iov_offset; |
2285 | bytes = min(bytes, i->iov->iov_len - i->iov_offset); | 2251 | bytes = min(bytes, i->iov->iov_len - i->iov_offset); |
2286 | return fault_in_pages_readable(buf, bytes); | 2252 | return fault_in_pages_readable(buf, bytes); |
2287 | } | 2253 | } |
2288 | EXPORT_SYMBOL(iov_iter_fault_in_readable); | 2254 | EXPORT_SYMBOL(iov_iter_fault_in_readable); |
2289 | 2255 | ||
2290 | /* | 2256 | /* |
2291 | * Return the count of just the current iov_iter segment. | 2257 | * Return the count of just the current iov_iter segment. |
2292 | */ | 2258 | */ |
2293 | size_t iov_iter_single_seg_count(const struct iov_iter *i) | 2259 | size_t iov_iter_single_seg_count(const struct iov_iter *i) |
2294 | { | 2260 | { |
2295 | const struct iovec *iov = i->iov; | 2261 | const struct iovec *iov = i->iov; |
2296 | if (i->nr_segs == 1) | 2262 | if (i->nr_segs == 1) |
2297 | return i->count; | 2263 | return i->count; |
2298 | else | 2264 | else |
2299 | return min(i->count, iov->iov_len - i->iov_offset); | 2265 | return min(i->count, iov->iov_len - i->iov_offset); |
2300 | } | 2266 | } |
2301 | EXPORT_SYMBOL(iov_iter_single_seg_count); | 2267 | EXPORT_SYMBOL(iov_iter_single_seg_count); |
2302 | 2268 | ||
2303 | /* | 2269 | /* |
2304 | * Performs necessary checks before doing a write | 2270 | * Performs necessary checks before doing a write |
2305 | * | 2271 | * |
2306 | * Can adjust writing position or amount of bytes to write. | 2272 | * Can adjust writing position or amount of bytes to write. |
2307 | * Returns appropriate error code that caller should return or | 2273 | * Returns appropriate error code that caller should return or |
2308 | * zero in case that write should be allowed. | 2274 | * zero in case that write should be allowed. |
2309 | */ | 2275 | */ |
2310 | inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk) | 2276 | inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk) |
2311 | { | 2277 | { |
2312 | struct inode *inode = file->f_mapping->host; | 2278 | struct inode *inode = file->f_mapping->host; |
2313 | unsigned long limit = rlimit(RLIMIT_FSIZE); | 2279 | unsigned long limit = rlimit(RLIMIT_FSIZE); |
2314 | 2280 | ||
2315 | if (unlikely(*pos < 0)) | 2281 | if (unlikely(*pos < 0)) |
2316 | return -EINVAL; | 2282 | return -EINVAL; |
2317 | 2283 | ||
2318 | if (!isblk) { | 2284 | if (!isblk) { |
2319 | /* FIXME: this is for backwards compatibility with 2.4 */ | 2285 | /* FIXME: this is for backwards compatibility with 2.4 */ |
2320 | if (file->f_flags & O_APPEND) | 2286 | if (file->f_flags & O_APPEND) |
2321 | *pos = i_size_read(inode); | 2287 | *pos = i_size_read(inode); |
2322 | 2288 | ||
2323 | if (limit != RLIM_INFINITY) { | 2289 | if (limit != RLIM_INFINITY) { |
2324 | if (*pos >= limit) { | 2290 | if (*pos >= limit) { |
2325 | send_sig(SIGXFSZ, current, 0); | 2291 | send_sig(SIGXFSZ, current, 0); |
2326 | return -EFBIG; | 2292 | return -EFBIG; |
2327 | } | 2293 | } |
2328 | if (*count > limit - (typeof(limit))*pos) { | 2294 | if (*count > limit - (typeof(limit))*pos) { |
2329 | *count = limit - (typeof(limit))*pos; | 2295 | *count = limit - (typeof(limit))*pos; |
2330 | } | 2296 | } |
2331 | } | 2297 | } |
2332 | } | 2298 | } |
2333 | 2299 | ||
2334 | /* | 2300 | /* |
2335 | * LFS rule | 2301 | * LFS rule |
2336 | */ | 2302 | */ |
2337 | if (unlikely(*pos + *count > MAX_NON_LFS && | 2303 | if (unlikely(*pos + *count > MAX_NON_LFS && |
2338 | !(file->f_flags & O_LARGEFILE))) { | 2304 | !(file->f_flags & O_LARGEFILE))) { |
2339 | if (*pos >= MAX_NON_LFS) { | 2305 | if (*pos >= MAX_NON_LFS) { |
2340 | return -EFBIG; | 2306 | return -EFBIG; |
2341 | } | 2307 | } |
2342 | if (*count > MAX_NON_LFS - (unsigned long)*pos) { | 2308 | if (*count > MAX_NON_LFS - (unsigned long)*pos) { |
2343 | *count = MAX_NON_LFS - (unsigned long)*pos; | 2309 | *count = MAX_NON_LFS - (unsigned long)*pos; |
2344 | } | 2310 | } |
2345 | } | 2311 | } |
2346 | 2312 | ||
2347 | /* | 2313 | /* |
2348 | * Are we about to exceed the fs block limit ? | 2314 | * Are we about to exceed the fs block limit ? |
2349 | * | 2315 | * |
2350 | * If we have written data it becomes a short write. If we have | 2316 | * If we have written data it becomes a short write. If we have |
2351 | * exceeded without writing data we send a signal and return EFBIG. | 2317 | * exceeded without writing data we send a signal and return EFBIG. |
2352 | * Linus frestrict idea will clean these up nicely.. | 2318 | * Linus frestrict idea will clean these up nicely.. |
2353 | */ | 2319 | */ |
2354 | if (likely(!isblk)) { | 2320 | if (likely(!isblk)) { |
2355 | if (unlikely(*pos >= inode->i_sb->s_maxbytes)) { | 2321 | if (unlikely(*pos >= inode->i_sb->s_maxbytes)) { |
2356 | if (*count || *pos > inode->i_sb->s_maxbytes) { | 2322 | if (*count || *pos > inode->i_sb->s_maxbytes) { |
2357 | return -EFBIG; | 2323 | return -EFBIG; |
2358 | } | 2324 | } |
2359 | /* zero-length writes at ->s_maxbytes are OK */ | 2325 | /* zero-length writes at ->s_maxbytes are OK */ |
2360 | } | 2326 | } |
2361 | 2327 | ||
2362 | if (unlikely(*pos + *count > inode->i_sb->s_maxbytes)) | 2328 | if (unlikely(*pos + *count > inode->i_sb->s_maxbytes)) |
2363 | *count = inode->i_sb->s_maxbytes - *pos; | 2329 | *count = inode->i_sb->s_maxbytes - *pos; |
2364 | } else { | 2330 | } else { |
2365 | #ifdef CONFIG_BLOCK | 2331 | #ifdef CONFIG_BLOCK |
2366 | loff_t isize; | 2332 | loff_t isize; |
2367 | if (bdev_read_only(I_BDEV(inode))) | 2333 | if (bdev_read_only(I_BDEV(inode))) |
2368 | return -EPERM; | 2334 | return -EPERM; |
2369 | isize = i_size_read(inode); | 2335 | isize = i_size_read(inode); |
2370 | if (*pos >= isize) { | 2336 | if (*pos >= isize) { |
2371 | if (*count || *pos > isize) | 2337 | if (*count || *pos > isize) |
2372 | return -ENOSPC; | 2338 | return -ENOSPC; |
2373 | } | 2339 | } |
2374 | 2340 | ||
2375 | if (*pos + *count > isize) | 2341 | if (*pos + *count > isize) |
2376 | *count = isize - *pos; | 2342 | *count = isize - *pos; |
2377 | #else | 2343 | #else |
2378 | return -EPERM; | 2344 | return -EPERM; |
2379 | #endif | 2345 | #endif |
2380 | } | 2346 | } |
2381 | return 0; | 2347 | return 0; |
2382 | } | 2348 | } |
2383 | EXPORT_SYMBOL(generic_write_checks); | 2349 | EXPORT_SYMBOL(generic_write_checks); |
2384 | 2350 | ||
2385 | int pagecache_write_begin(struct file *file, struct address_space *mapping, | 2351 | int pagecache_write_begin(struct file *file, struct address_space *mapping, |
2386 | loff_t pos, unsigned len, unsigned flags, | 2352 | loff_t pos, unsigned len, unsigned flags, |
2387 | struct page **pagep, void **fsdata) | 2353 | struct page **pagep, void **fsdata) |
2388 | { | 2354 | { |
2389 | const struct address_space_operations *aops = mapping->a_ops; | 2355 | const struct address_space_operations *aops = mapping->a_ops; |
2390 | 2356 | ||
2391 | return aops->write_begin(file, mapping, pos, len, flags, | 2357 | return aops->write_begin(file, mapping, pos, len, flags, |
2392 | pagep, fsdata); | 2358 | pagep, fsdata); |
2393 | } | 2359 | } |
2394 | EXPORT_SYMBOL(pagecache_write_begin); | 2360 | EXPORT_SYMBOL(pagecache_write_begin); |
2395 | 2361 | ||
2396 | int pagecache_write_end(struct file *file, struct address_space *mapping, | 2362 | int pagecache_write_end(struct file *file, struct address_space *mapping, |
2397 | loff_t pos, unsigned len, unsigned copied, | 2363 | loff_t pos, unsigned len, unsigned copied, |
2398 | struct page *page, void *fsdata) | 2364 | struct page *page, void *fsdata) |
2399 | { | 2365 | { |
2400 | const struct address_space_operations *aops = mapping->a_ops; | 2366 | const struct address_space_operations *aops = mapping->a_ops; |
2401 | 2367 | ||
2402 | mark_page_accessed(page); | ||
2403 | return aops->write_end(file, mapping, pos, len, copied, page, fsdata); | 2368 | return aops->write_end(file, mapping, pos, len, copied, page, fsdata); |
2404 | } | 2369 | } |
2405 | EXPORT_SYMBOL(pagecache_write_end); | 2370 | EXPORT_SYMBOL(pagecache_write_end); |
2406 | 2371 | ||
2407 | ssize_t | 2372 | ssize_t |
2408 | generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, | 2373 | generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, |
2409 | unsigned long *nr_segs, loff_t pos, loff_t *ppos, | 2374 | unsigned long *nr_segs, loff_t pos, loff_t *ppos, |
2410 | size_t count, size_t ocount) | 2375 | size_t count, size_t ocount) |
2411 | { | 2376 | { |
2412 | struct file *file = iocb->ki_filp; | 2377 | struct file *file = iocb->ki_filp; |
2413 | struct address_space *mapping = file->f_mapping; | 2378 | struct address_space *mapping = file->f_mapping; |
2414 | struct inode *inode = mapping->host; | 2379 | struct inode *inode = mapping->host; |
2415 | ssize_t written; | 2380 | ssize_t written; |
2416 | size_t write_len; | 2381 | size_t write_len; |
2417 | pgoff_t end; | 2382 | pgoff_t end; |
2418 | 2383 | ||
2419 | if (count != ocount) | 2384 | if (count != ocount) |
2420 | *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count); | 2385 | *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count); |
2421 | 2386 | ||
2422 | write_len = iov_length(iov, *nr_segs); | 2387 | write_len = iov_length(iov, *nr_segs); |
2423 | end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT; | 2388 | end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT; |
2424 | 2389 | ||
2425 | written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); | 2390 | written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); |
2426 | if (written) | 2391 | if (written) |
2427 | goto out; | 2392 | goto out; |
2428 | 2393 | ||
2429 | /* | 2394 | /* |
2430 | * After a write we want buffered reads to be sure to go to disk to get | 2395 | * After a write we want buffered reads to be sure to go to disk to get |
2431 | * the new data. We invalidate clean cached page from the region we're | 2396 | * the new data. We invalidate clean cached page from the region we're |
2432 | * about to write. We do this *before* the write so that we can return | 2397 | * about to write. We do this *before* the write so that we can return |
2433 | * without clobbering -EIOCBQUEUED from ->direct_IO(). | 2398 | * without clobbering -EIOCBQUEUED from ->direct_IO(). |
2434 | */ | 2399 | */ |
2435 | if (mapping->nrpages) { | 2400 | if (mapping->nrpages) { |
2436 | written = invalidate_inode_pages2_range(mapping, | 2401 | written = invalidate_inode_pages2_range(mapping, |
2437 | pos >> PAGE_CACHE_SHIFT, end); | 2402 | pos >> PAGE_CACHE_SHIFT, end); |
2438 | /* | 2403 | /* |
2439 | * If a page can not be invalidated, return 0 to fall back | 2404 | * If a page can not be invalidated, return 0 to fall back |
2440 | * to buffered write. | 2405 | * to buffered write. |
2441 | */ | 2406 | */ |
2442 | if (written) { | 2407 | if (written) { |
2443 | if (written == -EBUSY) | 2408 | if (written == -EBUSY) |
2444 | return 0; | 2409 | return 0; |
2445 | goto out; | 2410 | goto out; |
2446 | } | 2411 | } |
2447 | } | 2412 | } |
2448 | 2413 | ||
2449 | written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs); | 2414 | written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs); |
2450 | 2415 | ||
2451 | /* | 2416 | /* |
2452 | * Finally, try again to invalidate clean pages which might have been | 2417 | * Finally, try again to invalidate clean pages which might have been |
2453 | * cached by non-direct readahead, or faulted in by get_user_pages() | 2418 | * cached by non-direct readahead, or faulted in by get_user_pages() |
2454 | * if the source of the write was an mmap'ed region of the file | 2419 | * if the source of the write was an mmap'ed region of the file |
2455 | * we're writing. Either one is a pretty crazy thing to do, | 2420 | * we're writing. Either one is a pretty crazy thing to do, |
2456 | * so we don't support it 100%. If this invalidation | 2421 | * so we don't support it 100%. If this invalidation |
2457 | * fails, tough, the write still worked... | 2422 | * fails, tough, the write still worked... |
2458 | */ | 2423 | */ |
2459 | if (mapping->nrpages) { | 2424 | if (mapping->nrpages) { |
2460 | invalidate_inode_pages2_range(mapping, | 2425 | invalidate_inode_pages2_range(mapping, |
2461 | pos >> PAGE_CACHE_SHIFT, end); | 2426 | pos >> PAGE_CACHE_SHIFT, end); |
2462 | } | 2427 | } |
2463 | 2428 | ||
2464 | if (written > 0) { | 2429 | if (written > 0) { |
2465 | pos += written; | 2430 | pos += written; |
2466 | if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) { | 2431 | if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) { |
2467 | i_size_write(inode, pos); | 2432 | i_size_write(inode, pos); |
2468 | mark_inode_dirty(inode); | 2433 | mark_inode_dirty(inode); |
2469 | } | 2434 | } |
2470 | *ppos = pos; | 2435 | *ppos = pos; |
2471 | } | 2436 | } |
2472 | out: | 2437 | out: |
2473 | return written; | 2438 | return written; |
2474 | } | 2439 | } |
2475 | EXPORT_SYMBOL(generic_file_direct_write); | 2440 | EXPORT_SYMBOL(generic_file_direct_write); |
2476 | 2441 | ||
2477 | /* | 2442 | /* |
2478 | * Find or create a page at the given pagecache position. Return the locked | 2443 | * Find or create a page at the given pagecache position. Return the locked |
2479 | * page. This function is specifically for buffered writes. | 2444 | * page. This function is specifically for buffered writes. |
2480 | */ | 2445 | */ |
2481 | struct page *grab_cache_page_write_begin(struct address_space *mapping, | 2446 | struct page *grab_cache_page_write_begin(struct address_space *mapping, |
2482 | pgoff_t index, unsigned flags) | 2447 | pgoff_t index, unsigned flags) |
2483 | { | 2448 | { |
2484 | int status; | ||
2485 | gfp_t gfp_mask; | ||
2486 | struct page *page; | 2449 | struct page *page; |
2487 | gfp_t gfp_notmask = 0; | 2450 | int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT; |
2488 | 2451 | ||
2489 | gfp_mask = mapping_gfp_mask(mapping); | ||
2490 | if (mapping_cap_account_dirty(mapping)) | ||
2491 | gfp_mask |= __GFP_WRITE; | ||
2492 | if (flags & AOP_FLAG_NOFS) | 2452 | if (flags & AOP_FLAG_NOFS) |
2493 | gfp_notmask = __GFP_FS; | 2453 | fgp_flags |= FGP_NOFS; |
2494 | repeat: | 2454 | |
2495 | page = find_lock_page(mapping, index); | 2455 | page = pagecache_get_page(mapping, index, fgp_flags, |
2456 | mapping_gfp_mask(mapping), | ||
2457 | GFP_KERNEL); | ||
2496 | if (page) | 2458 | if (page) |
2497 | goto found; | 2459 | wait_for_stable_page(page); |
2498 | 2460 | ||
2499 | page = __page_cache_alloc(gfp_mask & ~gfp_notmask); | ||
2500 | if (!page) | ||
2501 | return NULL; | ||
2502 | status = add_to_page_cache_lru(page, mapping, index, | ||
2503 | GFP_KERNEL & ~gfp_notmask); | ||
2504 | if (unlikely(status)) { | ||
2505 | page_cache_release(page); | ||
2506 | if (status == -EEXIST) | ||
2507 | goto repeat; | ||
2508 | return NULL; | ||
2509 | } | ||
2510 | found: | ||
2511 | wait_for_stable_page(page); | ||
2512 | return page; | 2461 | return page; |
2513 | } | 2462 | } |
2514 | EXPORT_SYMBOL(grab_cache_page_write_begin); | 2463 | EXPORT_SYMBOL(grab_cache_page_write_begin); |
2515 | 2464 | ||
2516 | static ssize_t generic_perform_write(struct file *file, | 2465 | static ssize_t generic_perform_write(struct file *file, |
2517 | struct iov_iter *i, loff_t pos) | 2466 | struct iov_iter *i, loff_t pos) |
2518 | { | 2467 | { |
2519 | struct address_space *mapping = file->f_mapping; | 2468 | struct address_space *mapping = file->f_mapping; |
2520 | const struct address_space_operations *a_ops = mapping->a_ops; | 2469 | const struct address_space_operations *a_ops = mapping->a_ops; |
2521 | long status = 0; | 2470 | long status = 0; |
2522 | ssize_t written = 0; | 2471 | ssize_t written = 0; |
2523 | unsigned int flags = 0; | 2472 | unsigned int flags = 0; |
2524 | 2473 | ||
2525 | /* | 2474 | /* |
2526 | * Copies from kernel address space cannot fail (NFSD is a big user). | 2475 | * Copies from kernel address space cannot fail (NFSD is a big user). |
2527 | */ | 2476 | */ |
2528 | if (segment_eq(get_fs(), KERNEL_DS)) | 2477 | if (segment_eq(get_fs(), KERNEL_DS)) |
2529 | flags |= AOP_FLAG_UNINTERRUPTIBLE; | 2478 | flags |= AOP_FLAG_UNINTERRUPTIBLE; |
2530 | 2479 | ||
2531 | do { | 2480 | do { |
2532 | struct page *page; | 2481 | struct page *page; |
2533 | unsigned long offset; /* Offset into pagecache page */ | 2482 | unsigned long offset; /* Offset into pagecache page */ |
2534 | unsigned long bytes; /* Bytes to write to page */ | 2483 | unsigned long bytes; /* Bytes to write to page */ |
2535 | size_t copied; /* Bytes copied from user */ | 2484 | size_t copied; /* Bytes copied from user */ |
2536 | void *fsdata; | 2485 | void *fsdata; |
2537 | 2486 | ||
2538 | offset = (pos & (PAGE_CACHE_SIZE - 1)); | 2487 | offset = (pos & (PAGE_CACHE_SIZE - 1)); |
2539 | bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, | 2488 | bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, |
2540 | iov_iter_count(i)); | 2489 | iov_iter_count(i)); |
2541 | 2490 | ||
2542 | again: | 2491 | again: |
2543 | /* | 2492 | /* |
2544 | * Bring in the user page that we will copy from _first_. | 2493 | * Bring in the user page that we will copy from _first_. |
2545 | * Otherwise there's a nasty deadlock on copying from the | 2494 | * Otherwise there's a nasty deadlock on copying from the |
2546 | * same page as we're writing to, without it being marked | 2495 | * same page as we're writing to, without it being marked |
2547 | * up-to-date. | 2496 | * up-to-date. |
2548 | * | 2497 | * |
2549 | * Not only is this an optimisation, but it is also required | 2498 | * Not only is this an optimisation, but it is also required |
2550 | * to check that the address is actually valid, when atomic | 2499 | * to check that the address is actually valid, when atomic |
2551 | * usercopies are used, below. | 2500 | * usercopies are used, below. |
2552 | */ | 2501 | */ |
2553 | if (unlikely(iov_iter_fault_in_readable(i, bytes))) { | 2502 | if (unlikely(iov_iter_fault_in_readable(i, bytes))) { |
2554 | status = -EFAULT; | 2503 | status = -EFAULT; |
2555 | break; | 2504 | break; |
2556 | } | 2505 | } |
2557 | 2506 | ||
2558 | status = a_ops->write_begin(file, mapping, pos, bytes, flags, | 2507 | status = a_ops->write_begin(file, mapping, pos, bytes, flags, |
2559 | &page, &fsdata); | 2508 | &page, &fsdata); |
2560 | if (unlikely(status)) | 2509 | if (unlikely(status < 0)) |
2561 | break; | 2510 | break; |
2562 | 2511 | ||
2563 | if (mapping_writably_mapped(mapping)) | 2512 | if (mapping_writably_mapped(mapping)) |
2564 | flush_dcache_page(page); | 2513 | flush_dcache_page(page); |
2565 | 2514 | ||
2566 | copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); | 2515 | copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); |
2567 | flush_dcache_page(page); | 2516 | flush_dcache_page(page); |
2568 | 2517 | ||
2569 | mark_page_accessed(page); | ||
2570 | status = a_ops->write_end(file, mapping, pos, bytes, copied, | 2518 | status = a_ops->write_end(file, mapping, pos, bytes, copied, |
2571 | page, fsdata); | 2519 | page, fsdata); |
2572 | if (unlikely(status < 0)) | 2520 | if (unlikely(status < 0)) |
2573 | break; | 2521 | break; |
2574 | copied = status; | 2522 | copied = status; |
2575 | 2523 | ||
2576 | cond_resched(); | 2524 | cond_resched(); |
2577 | 2525 | ||
2578 | iov_iter_advance(i, copied); | 2526 | iov_iter_advance(i, copied); |
2579 | if (unlikely(copied == 0)) { | 2527 | if (unlikely(copied == 0)) { |
2580 | /* | 2528 | /* |
2581 | * If we were unable to copy any data at all, we must | 2529 | * If we were unable to copy any data at all, we must |
2582 | * fall back to a single segment length write. | 2530 | * fall back to a single segment length write. |
2583 | * | 2531 | * |
2584 | * If we didn't fallback here, we could livelock | 2532 | * If we didn't fallback here, we could livelock |
2585 | * because not all segments in the iov can be copied at | 2533 | * because not all segments in the iov can be copied at |
2586 | * once without a pagefault. | 2534 | * once without a pagefault. |
2587 | */ | 2535 | */ |
2588 | bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, | 2536 | bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, |
2589 | iov_iter_single_seg_count(i)); | 2537 | iov_iter_single_seg_count(i)); |
2590 | goto again; | 2538 | goto again; |
2591 | } | 2539 | } |
2592 | pos += copied; | 2540 | pos += copied; |
2593 | written += copied; | 2541 | written += copied; |
2594 | 2542 | ||
2595 | balance_dirty_pages_ratelimited(mapping); | 2543 | balance_dirty_pages_ratelimited(mapping); |
2596 | if (fatal_signal_pending(current)) { | 2544 | if (fatal_signal_pending(current)) { |
2597 | status = -EINTR; | 2545 | status = -EINTR; |
2598 | break; | 2546 | break; |
2599 | } | 2547 | } |
2600 | } while (iov_iter_count(i)); | 2548 | } while (iov_iter_count(i)); |
2601 | 2549 | ||
2602 | return written ? written : status; | 2550 | return written ? written : status; |
2603 | } | 2551 | } |
2604 | 2552 | ||
2605 | ssize_t | 2553 | ssize_t |
2606 | generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, | 2554 | generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, |
2607 | unsigned long nr_segs, loff_t pos, loff_t *ppos, | 2555 | unsigned long nr_segs, loff_t pos, loff_t *ppos, |
2608 | size_t count, ssize_t written) | 2556 | size_t count, ssize_t written) |
2609 | { | 2557 | { |
2610 | struct file *file = iocb->ki_filp; | 2558 | struct file *file = iocb->ki_filp; |
2611 | ssize_t status; | 2559 | ssize_t status; |
2612 | struct iov_iter i; | 2560 | struct iov_iter i; |
2613 | 2561 | ||
2614 | iov_iter_init(&i, iov, nr_segs, count, written); | 2562 | iov_iter_init(&i, iov, nr_segs, count, written); |
2615 | status = generic_perform_write(file, &i, pos); | 2563 | status = generic_perform_write(file, &i, pos); |
2616 | 2564 | ||
2617 | if (likely(status >= 0)) { | 2565 | if (likely(status >= 0)) { |
2618 | written += status; | 2566 | written += status; |
2619 | *ppos = pos + status; | 2567 | *ppos = pos + status; |
2620 | } | 2568 | } |
2621 | 2569 | ||
2622 | return written ? written : status; | 2570 | return written ? written : status; |
2623 | } | 2571 | } |
2624 | EXPORT_SYMBOL(generic_file_buffered_write); | 2572 | EXPORT_SYMBOL(generic_file_buffered_write); |
2625 | 2573 | ||
2626 | /** | 2574 | /** |
2627 | * __generic_file_aio_write - write data to a file | 2575 | * __generic_file_aio_write - write data to a file |
2628 | * @iocb: IO state structure (file, offset, etc.) | 2576 | * @iocb: IO state structure (file, offset, etc.) |
2629 | * @iov: vector with data to write | 2577 | * @iov: vector with data to write |
2630 | * @nr_segs: number of segments in the vector | 2578 | * @nr_segs: number of segments in the vector |
2631 | * @ppos: position where to write | 2579 | * @ppos: position where to write |
2632 | * | 2580 | * |
2633 | * This function does all the work needed for actually writing data to a | 2581 | * This function does all the work needed for actually writing data to a |
2634 | * file. It does all basic checks, removes SUID from the file, updates | 2582 | * file. It does all basic checks, removes SUID from the file, updates |
2635 | * modification times and calls proper subroutines depending on whether we | 2583 | * modification times and calls proper subroutines depending on whether we |
2636 | * do direct IO or a standard buffered write. | 2584 | * do direct IO or a standard buffered write. |
2637 | * | 2585 | * |
2638 | * It expects i_mutex to be grabbed unless we work on a block device or similar | 2586 | * It expects i_mutex to be grabbed unless we work on a block device or similar |
2639 | * object which does not need locking at all. | 2587 | * object which does not need locking at all. |
2640 | * | 2588 | * |
2641 | * This function does *not* take care of syncing data in case of O_SYNC write. | 2589 | * This function does *not* take care of syncing data in case of O_SYNC write. |
2642 | * A caller has to handle it. This is mainly due to the fact that we want to | 2590 | * A caller has to handle it. This is mainly due to the fact that we want to |
2643 | * avoid syncing under i_mutex. | 2591 | * avoid syncing under i_mutex. |
2644 | */ | 2592 | */ |
2645 | ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, | 2593 | ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, |
2646 | unsigned long nr_segs, loff_t *ppos) | 2594 | unsigned long nr_segs, loff_t *ppos) |
2647 | { | 2595 | { |
2648 | struct file *file = iocb->ki_filp; | 2596 | struct file *file = iocb->ki_filp; |
2649 | struct address_space * mapping = file->f_mapping; | 2597 | struct address_space * mapping = file->f_mapping; |
2650 | size_t ocount; /* original count */ | 2598 | size_t ocount; /* original count */ |
2651 | size_t count; /* after file limit checks */ | 2599 | size_t count; /* after file limit checks */ |
2652 | struct inode *inode = mapping->host; | 2600 | struct inode *inode = mapping->host; |
2653 | loff_t pos; | 2601 | loff_t pos; |
2654 | ssize_t written; | 2602 | ssize_t written; |
2655 | ssize_t err; | 2603 | ssize_t err; |
2656 | 2604 | ||
2657 | ocount = 0; | 2605 | ocount = 0; |
2658 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); | 2606 | err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ); |
2659 | if (err) | 2607 | if (err) |
2660 | return err; | 2608 | return err; |
2661 | 2609 | ||
2662 | count = ocount; | 2610 | count = ocount; |
2663 | pos = *ppos; | 2611 | pos = *ppos; |
2664 | 2612 | ||
2665 | /* We can write back this queue in page reclaim */ | 2613 | /* We can write back this queue in page reclaim */ |
2666 | current->backing_dev_info = mapping->backing_dev_info; | 2614 | current->backing_dev_info = mapping->backing_dev_info; |
2667 | written = 0; | 2615 | written = 0; |
2668 | 2616 | ||
2669 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); | 2617 | err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode)); |
2670 | if (err) | 2618 | if (err) |
2671 | goto out; | 2619 | goto out; |
2672 | 2620 | ||
2673 | if (count == 0) | 2621 | if (count == 0) |
2674 | goto out; | 2622 | goto out; |
2675 | 2623 | ||
2676 | err = file_remove_suid(file); | 2624 | err = file_remove_suid(file); |
2677 | if (err) | 2625 | if (err) |
2678 | goto out; | 2626 | goto out; |
2679 | 2627 | ||
2680 | err = file_update_time(file); | 2628 | err = file_update_time(file); |
2681 | if (err) | 2629 | if (err) |
2682 | goto out; | 2630 | goto out; |
2683 | 2631 | ||
2684 | /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ | 2632 | /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ |
2685 | if (unlikely(file->f_flags & O_DIRECT)) { | 2633 | if (unlikely(file->f_flags & O_DIRECT)) { |
2686 | loff_t endbyte; | 2634 | loff_t endbyte; |
2687 | ssize_t written_buffered; | 2635 | ssize_t written_buffered; |
2688 | 2636 | ||
2689 | written = generic_file_direct_write(iocb, iov, &nr_segs, pos, | 2637 | written = generic_file_direct_write(iocb, iov, &nr_segs, pos, |
2690 | ppos, count, ocount); | 2638 | ppos, count, ocount); |
2691 | if (written < 0 || written == count) | 2639 | if (written < 0 || written == count) |
2692 | goto out; | 2640 | goto out; |
2693 | /* | 2641 | /* |
2694 | * direct-io write to a hole: fall through to buffered I/O | 2642 | * direct-io write to a hole: fall through to buffered I/O |
2695 | * for completing the rest of the request. | 2643 | * for completing the rest of the request. |
2696 | */ | 2644 | */ |
2697 | pos += written; | 2645 | pos += written; |
2698 | count -= written; | 2646 | count -= written; |
2699 | written_buffered = generic_file_buffered_write(iocb, iov, | 2647 | written_buffered = generic_file_buffered_write(iocb, iov, |
2700 | nr_segs, pos, ppos, count, | 2648 | nr_segs, pos, ppos, count, |
2701 | written); | 2649 | written); |
2702 | /* | 2650 | /* |
2703 | * If generic_file_buffered_write() retuned a synchronous error | 2651 | * If generic_file_buffered_write() retuned a synchronous error |
2704 | * then we want to return the number of bytes which were | 2652 | * then we want to return the number of bytes which were |
2705 | * direct-written, or the error code if that was zero. Note | 2653 | * direct-written, or the error code if that was zero. Note |
2706 | * that this differs from normal direct-io semantics, which | 2654 | * that this differs from normal direct-io semantics, which |
2707 | * will return -EFOO even if some bytes were written. | 2655 | * will return -EFOO even if some bytes were written. |
2708 | */ | 2656 | */ |
2709 | if (written_buffered < 0) { | 2657 | if (written_buffered < 0) { |
2710 | err = written_buffered; | 2658 | err = written_buffered; |
2711 | goto out; | 2659 | goto out; |
2712 | } | 2660 | } |
2713 | 2661 | ||
2714 | /* | 2662 | /* |
2715 | * We need to ensure that the page cache pages are written to | 2663 | * We need to ensure that the page cache pages are written to |
2716 | * disk and invalidated to preserve the expected O_DIRECT | 2664 | * disk and invalidated to preserve the expected O_DIRECT |
2717 | * semantics. | 2665 | * semantics. |
2718 | */ | 2666 | */ |
2719 | endbyte = pos + written_buffered - written - 1; | 2667 | endbyte = pos + written_buffered - written - 1; |
2720 | err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); | 2668 | err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); |
2721 | if (err == 0) { | 2669 | if (err == 0) { |
2722 | written = written_buffered; | 2670 | written = written_buffered; |
2723 | invalidate_mapping_pages(mapping, | 2671 | invalidate_mapping_pages(mapping, |
2724 | pos >> PAGE_CACHE_SHIFT, | 2672 | pos >> PAGE_CACHE_SHIFT, |
2725 | endbyte >> PAGE_CACHE_SHIFT); | 2673 | endbyte >> PAGE_CACHE_SHIFT); |
2726 | } else { | 2674 | } else { |
2727 | /* | 2675 | /* |
2728 | * We don't know how much we wrote, so just return | 2676 | * We don't know how much we wrote, so just return |
2729 | * the number of bytes which were direct-written | 2677 | * the number of bytes which were direct-written |
2730 | */ | 2678 | */ |
2731 | } | 2679 | } |
2732 | } else { | 2680 | } else { |
2733 | written = generic_file_buffered_write(iocb, iov, nr_segs, | 2681 | written = generic_file_buffered_write(iocb, iov, nr_segs, |
2734 | pos, ppos, count, written); | 2682 | pos, ppos, count, written); |
2735 | } | 2683 | } |
2736 | out: | 2684 | out: |
2737 | current->backing_dev_info = NULL; | 2685 | current->backing_dev_info = NULL; |
2738 | return written ? written : err; | 2686 | return written ? written : err; |
2739 | } | 2687 | } |
2740 | EXPORT_SYMBOL(__generic_file_aio_write); | 2688 | EXPORT_SYMBOL(__generic_file_aio_write); |
2741 | 2689 | ||
2742 | /** | 2690 | /** |
2743 | * generic_file_aio_write - write data to a file | 2691 | * generic_file_aio_write - write data to a file |
2744 | * @iocb: IO state structure | 2692 | * @iocb: IO state structure |
2745 | * @iov: vector with data to write | 2693 | * @iov: vector with data to write |
2746 | * @nr_segs: number of segments in the vector | 2694 | * @nr_segs: number of segments in the vector |
2747 | * @pos: position in file where to write | 2695 | * @pos: position in file where to write |
2748 | * | 2696 | * |
2749 | * This is a wrapper around __generic_file_aio_write() to be used by most | 2697 | * This is a wrapper around __generic_file_aio_write() to be used by most |
2750 | * filesystems. It takes care of syncing the file in case of O_SYNC file | 2698 | * filesystems. It takes care of syncing the file in case of O_SYNC file |
2751 | * and acquires i_mutex as needed. | 2699 | * and acquires i_mutex as needed. |
2752 | */ | 2700 | */ |
2753 | ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, | 2701 | ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov, |
2754 | unsigned long nr_segs, loff_t pos) | 2702 | unsigned long nr_segs, loff_t pos) |
2755 | { | 2703 | { |
2756 | struct file *file = iocb->ki_filp; | 2704 | struct file *file = iocb->ki_filp; |
2757 | struct inode *inode = file->f_mapping->host; | 2705 | struct inode *inode = file->f_mapping->host; |
2758 | ssize_t ret; | 2706 | ssize_t ret; |
2759 | 2707 | ||
2760 | BUG_ON(iocb->ki_pos != pos); | 2708 | BUG_ON(iocb->ki_pos != pos); |
2761 | 2709 | ||
2762 | mutex_lock(&inode->i_mutex); | 2710 | mutex_lock(&inode->i_mutex); |
2763 | ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); | 2711 | ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); |
2764 | mutex_unlock(&inode->i_mutex); | 2712 | mutex_unlock(&inode->i_mutex); |
2765 | 2713 | ||
2766 | if (ret > 0) { | 2714 | if (ret > 0) { |
mm/shmem.c
1 | /* | 1 | /* |
2 | * Resizable virtual memory filesystem for Linux. | 2 | * Resizable virtual memory filesystem for Linux. |
3 | * | 3 | * |
4 | * Copyright (C) 2000 Linus Torvalds. | 4 | * Copyright (C) 2000 Linus Torvalds. |
5 | * 2000 Transmeta Corp. | 5 | * 2000 Transmeta Corp. |
6 | * 2000-2001 Christoph Rohland | 6 | * 2000-2001 Christoph Rohland |
7 | * 2000-2001 SAP AG | 7 | * 2000-2001 SAP AG |
8 | * 2002 Red Hat Inc. | 8 | * 2002 Red Hat Inc. |
9 | * Copyright (C) 2002-2011 Hugh Dickins. | 9 | * Copyright (C) 2002-2011 Hugh Dickins. |
10 | * Copyright (C) 2011 Google Inc. | 10 | * Copyright (C) 2011 Google Inc. |
11 | * Copyright (C) 2002-2005 VERITAS Software Corporation. | 11 | * Copyright (C) 2002-2005 VERITAS Software Corporation. |
12 | * Copyright (C) 2004 Andi Kleen, SuSE Labs | 12 | * Copyright (C) 2004 Andi Kleen, SuSE Labs |
13 | * | 13 | * |
14 | * Extended attribute support for tmpfs: | 14 | * Extended attribute support for tmpfs: |
15 | * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net> | 15 | * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net> |
16 | * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com> | 16 | * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com> |
17 | * | 17 | * |
18 | * tiny-shmem: | 18 | * tiny-shmem: |
19 | * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com> | 19 | * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com> |
20 | * | 20 | * |
21 | * This file is released under the GPL. | 21 | * This file is released under the GPL. |
22 | */ | 22 | */ |
23 | 23 | ||
24 | #include <linux/fs.h> | 24 | #include <linux/fs.h> |
25 | #include <linux/init.h> | 25 | #include <linux/init.h> |
26 | #include <linux/vfs.h> | 26 | #include <linux/vfs.h> |
27 | #include <linux/mount.h> | 27 | #include <linux/mount.h> |
28 | #include <linux/ramfs.h> | 28 | #include <linux/ramfs.h> |
29 | #include <linux/pagemap.h> | 29 | #include <linux/pagemap.h> |
30 | #include <linux/file.h> | 30 | #include <linux/file.h> |
31 | #include <linux/mm.h> | 31 | #include <linux/mm.h> |
32 | #include <linux/export.h> | 32 | #include <linux/export.h> |
33 | #include <linux/swap.h> | 33 | #include <linux/swap.h> |
34 | #include <linux/aio.h> | 34 | #include <linux/aio.h> |
35 | 35 | ||
36 | static struct vfsmount *shm_mnt; | 36 | static struct vfsmount *shm_mnt; |
37 | 37 | ||
38 | #ifdef CONFIG_SHMEM | 38 | #ifdef CONFIG_SHMEM |
39 | /* | 39 | /* |
40 | * This virtual memory filesystem is heavily based on the ramfs. It | 40 | * This virtual memory filesystem is heavily based on the ramfs. It |
41 | * extends ramfs by the ability to use swap and honor resource limits | 41 | * extends ramfs by the ability to use swap and honor resource limits |
42 | * which makes it a completely usable filesystem. | 42 | * which makes it a completely usable filesystem. |
43 | */ | 43 | */ |
44 | 44 | ||
45 | #include <linux/xattr.h> | 45 | #include <linux/xattr.h> |
46 | #include <linux/exportfs.h> | 46 | #include <linux/exportfs.h> |
47 | #include <linux/posix_acl.h> | 47 | #include <linux/posix_acl.h> |
48 | #include <linux/generic_acl.h> | 48 | #include <linux/generic_acl.h> |
49 | #include <linux/mman.h> | 49 | #include <linux/mman.h> |
50 | #include <linux/string.h> | 50 | #include <linux/string.h> |
51 | #include <linux/slab.h> | 51 | #include <linux/slab.h> |
52 | #include <linux/backing-dev.h> | 52 | #include <linux/backing-dev.h> |
53 | #include <linux/shmem_fs.h> | 53 | #include <linux/shmem_fs.h> |
54 | #include <linux/writeback.h> | 54 | #include <linux/writeback.h> |
55 | #include <linux/blkdev.h> | 55 | #include <linux/blkdev.h> |
56 | #include <linux/pagevec.h> | 56 | #include <linux/pagevec.h> |
57 | #include <linux/percpu_counter.h> | 57 | #include <linux/percpu_counter.h> |
58 | #include <linux/falloc.h> | 58 | #include <linux/falloc.h> |
59 | #include <linux/splice.h> | 59 | #include <linux/splice.h> |
60 | #include <linux/security.h> | 60 | #include <linux/security.h> |
61 | #include <linux/swapops.h> | 61 | #include <linux/swapops.h> |
62 | #include <linux/mempolicy.h> | 62 | #include <linux/mempolicy.h> |
63 | #include <linux/namei.h> | 63 | #include <linux/namei.h> |
64 | #include <linux/ctype.h> | 64 | #include <linux/ctype.h> |
65 | #include <linux/migrate.h> | 65 | #include <linux/migrate.h> |
66 | #include <linux/highmem.h> | 66 | #include <linux/highmem.h> |
67 | #include <linux/seq_file.h> | 67 | #include <linux/seq_file.h> |
68 | #include <linux/magic.h> | 68 | #include <linux/magic.h> |
69 | 69 | ||
70 | #include <asm/uaccess.h> | 70 | #include <asm/uaccess.h> |
71 | #include <asm/pgtable.h> | 71 | #include <asm/pgtable.h> |
72 | 72 | ||
73 | #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512) | 73 | #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512) |
74 | #define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT) | 74 | #define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT) |
75 | 75 | ||
76 | /* Pretend that each entry is of this size in directory's i_size */ | 76 | /* Pretend that each entry is of this size in directory's i_size */ |
77 | #define BOGO_DIRENT_SIZE 20 | 77 | #define BOGO_DIRENT_SIZE 20 |
78 | 78 | ||
79 | /* Symlink up to this size is kmalloc'ed instead of using a swappable page */ | 79 | /* Symlink up to this size is kmalloc'ed instead of using a swappable page */ |
80 | #define SHORT_SYMLINK_LEN 128 | 80 | #define SHORT_SYMLINK_LEN 128 |
81 | 81 | ||
82 | /* | 82 | /* |
83 | * shmem_fallocate communicates with shmem_fault or shmem_writepage via | 83 | * shmem_fallocate communicates with shmem_fault or shmem_writepage via |
84 | * inode->i_private (with i_mutex making sure that it has only one user at | 84 | * inode->i_private (with i_mutex making sure that it has only one user at |
85 | * a time): we would prefer not to enlarge the shmem inode just for that. | 85 | * a time): we would prefer not to enlarge the shmem inode just for that. |
86 | */ | 86 | */ |
87 | struct shmem_falloc { | 87 | struct shmem_falloc { |
88 | wait_queue_head_t *waitq; /* faults into hole wait for punch to end */ | 88 | wait_queue_head_t *waitq; /* faults into hole wait for punch to end */ |
89 | pgoff_t start; /* start of range currently being fallocated */ | 89 | pgoff_t start; /* start of range currently being fallocated */ |
90 | pgoff_t next; /* the next page offset to be fallocated */ | 90 | pgoff_t next; /* the next page offset to be fallocated */ |
91 | pgoff_t nr_falloced; /* how many new pages have been fallocated */ | 91 | pgoff_t nr_falloced; /* how many new pages have been fallocated */ |
92 | pgoff_t nr_unswapped; /* how often writepage refused to swap out */ | 92 | pgoff_t nr_unswapped; /* how often writepage refused to swap out */ |
93 | }; | 93 | }; |
94 | 94 | ||
95 | /* Flag allocation requirements to shmem_getpage */ | 95 | /* Flag allocation requirements to shmem_getpage */ |
96 | enum sgp_type { | 96 | enum sgp_type { |
97 | SGP_READ, /* don't exceed i_size, don't allocate page */ | 97 | SGP_READ, /* don't exceed i_size, don't allocate page */ |
98 | SGP_CACHE, /* don't exceed i_size, may allocate page */ | 98 | SGP_CACHE, /* don't exceed i_size, may allocate page */ |
99 | SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ | 99 | SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ |
100 | SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ | 100 | SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ |
101 | SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ | 101 | SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ |
102 | }; | 102 | }; |
103 | 103 | ||
104 | #ifdef CONFIG_TMPFS | 104 | #ifdef CONFIG_TMPFS |
105 | static unsigned long shmem_default_max_blocks(void) | 105 | static unsigned long shmem_default_max_blocks(void) |
106 | { | 106 | { |
107 | return totalram_pages / 2; | 107 | return totalram_pages / 2; |
108 | } | 108 | } |
109 | 109 | ||
110 | static unsigned long shmem_default_max_inodes(void) | 110 | static unsigned long shmem_default_max_inodes(void) |
111 | { | 111 | { |
112 | return min(totalram_pages - totalhigh_pages, totalram_pages / 2); | 112 | return min(totalram_pages - totalhigh_pages, totalram_pages / 2); |
113 | } | 113 | } |
114 | #endif | 114 | #endif |
115 | 115 | ||
116 | static bool shmem_should_replace_page(struct page *page, gfp_t gfp); | 116 | static bool shmem_should_replace_page(struct page *page, gfp_t gfp); |
117 | static int shmem_replace_page(struct page **pagep, gfp_t gfp, | 117 | static int shmem_replace_page(struct page **pagep, gfp_t gfp, |
118 | struct shmem_inode_info *info, pgoff_t index); | 118 | struct shmem_inode_info *info, pgoff_t index); |
119 | static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, | 119 | static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, |
120 | struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); | 120 | struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type); |
121 | 121 | ||
122 | static inline int shmem_getpage(struct inode *inode, pgoff_t index, | 122 | static inline int shmem_getpage(struct inode *inode, pgoff_t index, |
123 | struct page **pagep, enum sgp_type sgp, int *fault_type) | 123 | struct page **pagep, enum sgp_type sgp, int *fault_type) |
124 | { | 124 | { |
125 | return shmem_getpage_gfp(inode, index, pagep, sgp, | 125 | return shmem_getpage_gfp(inode, index, pagep, sgp, |
126 | mapping_gfp_mask(inode->i_mapping), fault_type); | 126 | mapping_gfp_mask(inode->i_mapping), fault_type); |
127 | } | 127 | } |
128 | 128 | ||
129 | static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) | 129 | static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) |
130 | { | 130 | { |
131 | return sb->s_fs_info; | 131 | return sb->s_fs_info; |
132 | } | 132 | } |
133 | 133 | ||
134 | /* | 134 | /* |
135 | * shmem_file_setup pre-accounts the whole fixed size of a VM object, | 135 | * shmem_file_setup pre-accounts the whole fixed size of a VM object, |
136 | * for shared memory and for shared anonymous (/dev/zero) mappings | 136 | * for shared memory and for shared anonymous (/dev/zero) mappings |
137 | * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1), | 137 | * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1), |
138 | * consistent with the pre-accounting of private mappings ... | 138 | * consistent with the pre-accounting of private mappings ... |
139 | */ | 139 | */ |
140 | static inline int shmem_acct_size(unsigned long flags, loff_t size) | 140 | static inline int shmem_acct_size(unsigned long flags, loff_t size) |
141 | { | 141 | { |
142 | return (flags & VM_NORESERVE) ? | 142 | return (flags & VM_NORESERVE) ? |
143 | 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size)); | 143 | 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size)); |
144 | } | 144 | } |
145 | 145 | ||
146 | static inline void shmem_unacct_size(unsigned long flags, loff_t size) | 146 | static inline void shmem_unacct_size(unsigned long flags, loff_t size) |
147 | { | 147 | { |
148 | if (!(flags & VM_NORESERVE)) | 148 | if (!(flags & VM_NORESERVE)) |
149 | vm_unacct_memory(VM_ACCT(size)); | 149 | vm_unacct_memory(VM_ACCT(size)); |
150 | } | 150 | } |
151 | 151 | ||
152 | /* | 152 | /* |
153 | * ... whereas tmpfs objects are accounted incrementally as | 153 | * ... whereas tmpfs objects are accounted incrementally as |
154 | * pages are allocated, in order to allow huge sparse files. | 154 | * pages are allocated, in order to allow huge sparse files. |
155 | * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM, | 155 | * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM, |
156 | * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM. | 156 | * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM. |
157 | */ | 157 | */ |
158 | static inline int shmem_acct_block(unsigned long flags) | 158 | static inline int shmem_acct_block(unsigned long flags) |
159 | { | 159 | { |
160 | return (flags & VM_NORESERVE) ? | 160 | return (flags & VM_NORESERVE) ? |
161 | security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0; | 161 | security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0; |
162 | } | 162 | } |
163 | 163 | ||
164 | static inline void shmem_unacct_blocks(unsigned long flags, long pages) | 164 | static inline void shmem_unacct_blocks(unsigned long flags, long pages) |
165 | { | 165 | { |
166 | if (flags & VM_NORESERVE) | 166 | if (flags & VM_NORESERVE) |
167 | vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE)); | 167 | vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE)); |
168 | } | 168 | } |
169 | 169 | ||
170 | static const struct super_operations shmem_ops; | 170 | static const struct super_operations shmem_ops; |
171 | static const struct address_space_operations shmem_aops; | 171 | static const struct address_space_operations shmem_aops; |
172 | static const struct file_operations shmem_file_operations; | 172 | static const struct file_operations shmem_file_operations; |
173 | static const struct inode_operations shmem_inode_operations; | 173 | static const struct inode_operations shmem_inode_operations; |
174 | static const struct inode_operations shmem_dir_inode_operations; | 174 | static const struct inode_operations shmem_dir_inode_operations; |
175 | static const struct inode_operations shmem_special_inode_operations; | 175 | static const struct inode_operations shmem_special_inode_operations; |
176 | static const struct vm_operations_struct shmem_vm_ops; | 176 | static const struct vm_operations_struct shmem_vm_ops; |
177 | 177 | ||
178 | static struct backing_dev_info shmem_backing_dev_info __read_mostly = { | 178 | static struct backing_dev_info shmem_backing_dev_info __read_mostly = { |
179 | .ra_pages = 0, /* No readahead */ | 179 | .ra_pages = 0, /* No readahead */ |
180 | .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, | 180 | .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED, |
181 | }; | 181 | }; |
182 | 182 | ||
183 | static LIST_HEAD(shmem_swaplist); | 183 | static LIST_HEAD(shmem_swaplist); |
184 | static DEFINE_MUTEX(shmem_swaplist_mutex); | 184 | static DEFINE_MUTEX(shmem_swaplist_mutex); |
185 | 185 | ||
186 | static int shmem_reserve_inode(struct super_block *sb) | 186 | static int shmem_reserve_inode(struct super_block *sb) |
187 | { | 187 | { |
188 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); | 188 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); |
189 | if (sbinfo->max_inodes) { | 189 | if (sbinfo->max_inodes) { |
190 | spin_lock(&sbinfo->stat_lock); | 190 | spin_lock(&sbinfo->stat_lock); |
191 | if (!sbinfo->free_inodes) { | 191 | if (!sbinfo->free_inodes) { |
192 | spin_unlock(&sbinfo->stat_lock); | 192 | spin_unlock(&sbinfo->stat_lock); |
193 | return -ENOSPC; | 193 | return -ENOSPC; |
194 | } | 194 | } |
195 | sbinfo->free_inodes--; | 195 | sbinfo->free_inodes--; |
196 | spin_unlock(&sbinfo->stat_lock); | 196 | spin_unlock(&sbinfo->stat_lock); |
197 | } | 197 | } |
198 | return 0; | 198 | return 0; |
199 | } | 199 | } |
200 | 200 | ||
201 | static void shmem_free_inode(struct super_block *sb) | 201 | static void shmem_free_inode(struct super_block *sb) |
202 | { | 202 | { |
203 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); | 203 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); |
204 | if (sbinfo->max_inodes) { | 204 | if (sbinfo->max_inodes) { |
205 | spin_lock(&sbinfo->stat_lock); | 205 | spin_lock(&sbinfo->stat_lock); |
206 | sbinfo->free_inodes++; | 206 | sbinfo->free_inodes++; |
207 | spin_unlock(&sbinfo->stat_lock); | 207 | spin_unlock(&sbinfo->stat_lock); |
208 | } | 208 | } |
209 | } | 209 | } |
210 | 210 | ||
211 | /** | 211 | /** |
212 | * shmem_recalc_inode - recalculate the block usage of an inode | 212 | * shmem_recalc_inode - recalculate the block usage of an inode |
213 | * @inode: inode to recalc | 213 | * @inode: inode to recalc |
214 | * | 214 | * |
215 | * We have to calculate the free blocks since the mm can drop | 215 | * We have to calculate the free blocks since the mm can drop |
216 | * undirtied hole pages behind our back. | 216 | * undirtied hole pages behind our back. |
217 | * | 217 | * |
218 | * But normally info->alloced == inode->i_mapping->nrpages + info->swapped | 218 | * But normally info->alloced == inode->i_mapping->nrpages + info->swapped |
219 | * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) | 219 | * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) |
220 | * | 220 | * |
221 | * It has to be called with the spinlock held. | 221 | * It has to be called with the spinlock held. |
222 | */ | 222 | */ |
223 | static void shmem_recalc_inode(struct inode *inode) | 223 | static void shmem_recalc_inode(struct inode *inode) |
224 | { | 224 | { |
225 | struct shmem_inode_info *info = SHMEM_I(inode); | 225 | struct shmem_inode_info *info = SHMEM_I(inode); |
226 | long freed; | 226 | long freed; |
227 | 227 | ||
228 | freed = info->alloced - info->swapped - inode->i_mapping->nrpages; | 228 | freed = info->alloced - info->swapped - inode->i_mapping->nrpages; |
229 | if (freed > 0) { | 229 | if (freed > 0) { |
230 | struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); | 230 | struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); |
231 | if (sbinfo->max_blocks) | 231 | if (sbinfo->max_blocks) |
232 | percpu_counter_add(&sbinfo->used_blocks, -freed); | 232 | percpu_counter_add(&sbinfo->used_blocks, -freed); |
233 | info->alloced -= freed; | 233 | info->alloced -= freed; |
234 | inode->i_blocks -= freed * BLOCKS_PER_PAGE; | 234 | inode->i_blocks -= freed * BLOCKS_PER_PAGE; |
235 | shmem_unacct_blocks(info->flags, freed); | 235 | shmem_unacct_blocks(info->flags, freed); |
236 | } | 236 | } |
237 | } | 237 | } |
238 | 238 | ||
239 | /* | 239 | /* |
240 | * Replace item expected in radix tree by a new item, while holding tree lock. | 240 | * Replace item expected in radix tree by a new item, while holding tree lock. |
241 | */ | 241 | */ |
242 | static int shmem_radix_tree_replace(struct address_space *mapping, | 242 | static int shmem_radix_tree_replace(struct address_space *mapping, |
243 | pgoff_t index, void *expected, void *replacement) | 243 | pgoff_t index, void *expected, void *replacement) |
244 | { | 244 | { |
245 | void **pslot; | 245 | void **pslot; |
246 | void *item; | 246 | void *item; |
247 | 247 | ||
248 | VM_BUG_ON(!expected); | 248 | VM_BUG_ON(!expected); |
249 | VM_BUG_ON(!replacement); | 249 | VM_BUG_ON(!replacement); |
250 | pslot = radix_tree_lookup_slot(&mapping->page_tree, index); | 250 | pslot = radix_tree_lookup_slot(&mapping->page_tree, index); |
251 | if (!pslot) | 251 | if (!pslot) |
252 | return -ENOENT; | 252 | return -ENOENT; |
253 | item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock); | 253 | item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock); |
254 | if (item != expected) | 254 | if (item != expected) |
255 | return -ENOENT; | 255 | return -ENOENT; |
256 | radix_tree_replace_slot(pslot, replacement); | 256 | radix_tree_replace_slot(pslot, replacement); |
257 | return 0; | 257 | return 0; |
258 | } | 258 | } |
259 | 259 | ||
260 | /* | 260 | /* |
261 | * Sometimes, before we decide whether to proceed or to fail, we must check | 261 | * Sometimes, before we decide whether to proceed or to fail, we must check |
262 | * that an entry was not already brought back from swap by a racing thread. | 262 | * that an entry was not already brought back from swap by a racing thread. |
263 | * | 263 | * |
264 | * Checking page is not enough: by the time a SwapCache page is locked, it | 264 | * Checking page is not enough: by the time a SwapCache page is locked, it |
265 | * might be reused, and again be SwapCache, using the same swap as before. | 265 | * might be reused, and again be SwapCache, using the same swap as before. |
266 | */ | 266 | */ |
267 | static bool shmem_confirm_swap(struct address_space *mapping, | 267 | static bool shmem_confirm_swap(struct address_space *mapping, |
268 | pgoff_t index, swp_entry_t swap) | 268 | pgoff_t index, swp_entry_t swap) |
269 | { | 269 | { |
270 | void *item; | 270 | void *item; |
271 | 271 | ||
272 | rcu_read_lock(); | 272 | rcu_read_lock(); |
273 | item = radix_tree_lookup(&mapping->page_tree, index); | 273 | item = radix_tree_lookup(&mapping->page_tree, index); |
274 | rcu_read_unlock(); | 274 | rcu_read_unlock(); |
275 | return item == swp_to_radix_entry(swap); | 275 | return item == swp_to_radix_entry(swap); |
276 | } | 276 | } |
277 | 277 | ||
278 | /* | 278 | /* |
279 | * Like add_to_page_cache_locked, but error if expected item has gone. | 279 | * Like add_to_page_cache_locked, but error if expected item has gone. |
280 | */ | 280 | */ |
281 | static int shmem_add_to_page_cache(struct page *page, | 281 | static int shmem_add_to_page_cache(struct page *page, |
282 | struct address_space *mapping, | 282 | struct address_space *mapping, |
283 | pgoff_t index, gfp_t gfp, void *expected) | 283 | pgoff_t index, gfp_t gfp, void *expected) |
284 | { | 284 | { |
285 | int error; | 285 | int error; |
286 | 286 | ||
287 | VM_BUG_ON(!PageLocked(page)); | 287 | VM_BUG_ON(!PageLocked(page)); |
288 | VM_BUG_ON(!PageSwapBacked(page)); | 288 | VM_BUG_ON(!PageSwapBacked(page)); |
289 | 289 | ||
290 | page_cache_get(page); | 290 | page_cache_get(page); |
291 | page->mapping = mapping; | 291 | page->mapping = mapping; |
292 | page->index = index; | 292 | page->index = index; |
293 | 293 | ||
294 | spin_lock_irq(&mapping->tree_lock); | 294 | spin_lock_irq(&mapping->tree_lock); |
295 | if (!expected) | 295 | if (!expected) |
296 | error = radix_tree_insert(&mapping->page_tree, index, page); | 296 | error = radix_tree_insert(&mapping->page_tree, index, page); |
297 | else | 297 | else |
298 | error = shmem_radix_tree_replace(mapping, index, expected, | 298 | error = shmem_radix_tree_replace(mapping, index, expected, |
299 | page); | 299 | page); |
300 | if (!error) { | 300 | if (!error) { |
301 | mapping->nrpages++; | 301 | mapping->nrpages++; |
302 | __inc_zone_page_state(page, NR_FILE_PAGES); | 302 | __inc_zone_page_state(page, NR_FILE_PAGES); |
303 | __inc_zone_page_state(page, NR_SHMEM); | 303 | __inc_zone_page_state(page, NR_SHMEM); |
304 | spin_unlock_irq(&mapping->tree_lock); | 304 | spin_unlock_irq(&mapping->tree_lock); |
305 | } else { | 305 | } else { |
306 | page->mapping = NULL; | 306 | page->mapping = NULL; |
307 | spin_unlock_irq(&mapping->tree_lock); | 307 | spin_unlock_irq(&mapping->tree_lock); |
308 | page_cache_release(page); | 308 | page_cache_release(page); |
309 | } | 309 | } |
310 | return error; | 310 | return error; |
311 | } | 311 | } |
312 | 312 | ||
313 | /* | 313 | /* |
314 | * Like delete_from_page_cache, but substitutes swap for page. | 314 | * Like delete_from_page_cache, but substitutes swap for page. |
315 | */ | 315 | */ |
316 | static void shmem_delete_from_page_cache(struct page *page, void *radswap) | 316 | static void shmem_delete_from_page_cache(struct page *page, void *radswap) |
317 | { | 317 | { |
318 | struct address_space *mapping = page->mapping; | 318 | struct address_space *mapping = page->mapping; |
319 | int error; | 319 | int error; |
320 | 320 | ||
321 | spin_lock_irq(&mapping->tree_lock); | 321 | spin_lock_irq(&mapping->tree_lock); |
322 | error = shmem_radix_tree_replace(mapping, page->index, page, radswap); | 322 | error = shmem_radix_tree_replace(mapping, page->index, page, radswap); |
323 | page->mapping = NULL; | 323 | page->mapping = NULL; |
324 | mapping->nrpages--; | 324 | mapping->nrpages--; |
325 | __dec_zone_page_state(page, NR_FILE_PAGES); | 325 | __dec_zone_page_state(page, NR_FILE_PAGES); |
326 | __dec_zone_page_state(page, NR_SHMEM); | 326 | __dec_zone_page_state(page, NR_SHMEM); |
327 | spin_unlock_irq(&mapping->tree_lock); | 327 | spin_unlock_irq(&mapping->tree_lock); |
328 | page_cache_release(page); | 328 | page_cache_release(page); |
329 | BUG_ON(error); | 329 | BUG_ON(error); |
330 | } | 330 | } |
331 | 331 | ||
332 | /* | 332 | /* |
333 | * Remove swap entry from radix tree, free the swap and its page cache. | 333 | * Remove swap entry from radix tree, free the swap and its page cache. |
334 | */ | 334 | */ |
335 | static int shmem_free_swap(struct address_space *mapping, | 335 | static int shmem_free_swap(struct address_space *mapping, |
336 | pgoff_t index, void *radswap) | 336 | pgoff_t index, void *radswap) |
337 | { | 337 | { |
338 | void *old; | 338 | void *old; |
339 | 339 | ||
340 | spin_lock_irq(&mapping->tree_lock); | 340 | spin_lock_irq(&mapping->tree_lock); |
341 | old = radix_tree_delete_item(&mapping->page_tree, index, radswap); | 341 | old = radix_tree_delete_item(&mapping->page_tree, index, radswap); |
342 | spin_unlock_irq(&mapping->tree_lock); | 342 | spin_unlock_irq(&mapping->tree_lock); |
343 | if (old != radswap) | 343 | if (old != radswap) |
344 | return -ENOENT; | 344 | return -ENOENT; |
345 | free_swap_and_cache(radix_to_swp_entry(radswap)); | 345 | free_swap_and_cache(radix_to_swp_entry(radswap)); |
346 | return 0; | 346 | return 0; |
347 | } | 347 | } |
348 | 348 | ||
349 | /* | 349 | /* |
350 | * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists. | 350 | * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists. |
351 | */ | 351 | */ |
352 | void shmem_unlock_mapping(struct address_space *mapping) | 352 | void shmem_unlock_mapping(struct address_space *mapping) |
353 | { | 353 | { |
354 | struct pagevec pvec; | 354 | struct pagevec pvec; |
355 | pgoff_t indices[PAGEVEC_SIZE]; | 355 | pgoff_t indices[PAGEVEC_SIZE]; |
356 | pgoff_t index = 0; | 356 | pgoff_t index = 0; |
357 | 357 | ||
358 | pagevec_init(&pvec, 0); | 358 | pagevec_init(&pvec, 0); |
359 | /* | 359 | /* |
360 | * Minor point, but we might as well stop if someone else SHM_LOCKs it. | 360 | * Minor point, but we might as well stop if someone else SHM_LOCKs it. |
361 | */ | 361 | */ |
362 | while (!mapping_unevictable(mapping)) { | 362 | while (!mapping_unevictable(mapping)) { |
363 | /* | 363 | /* |
364 | * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it | 364 | * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it |
365 | * has finished, if it hits a row of PAGEVEC_SIZE swap entries. | 365 | * has finished, if it hits a row of PAGEVEC_SIZE swap entries. |
366 | */ | 366 | */ |
367 | pvec.nr = find_get_entries(mapping, index, | 367 | pvec.nr = find_get_entries(mapping, index, |
368 | PAGEVEC_SIZE, pvec.pages, indices); | 368 | PAGEVEC_SIZE, pvec.pages, indices); |
369 | if (!pvec.nr) | 369 | if (!pvec.nr) |
370 | break; | 370 | break; |
371 | index = indices[pvec.nr - 1] + 1; | 371 | index = indices[pvec.nr - 1] + 1; |
372 | pagevec_remove_exceptionals(&pvec); | 372 | pagevec_remove_exceptionals(&pvec); |
373 | check_move_unevictable_pages(pvec.pages, pvec.nr); | 373 | check_move_unevictable_pages(pvec.pages, pvec.nr); |
374 | pagevec_release(&pvec); | 374 | pagevec_release(&pvec); |
375 | cond_resched(); | 375 | cond_resched(); |
376 | } | 376 | } |
377 | } | 377 | } |
378 | 378 | ||
379 | /* | 379 | /* |
380 | * Remove range of pages and swap entries from radix tree, and free them. | 380 | * Remove range of pages and swap entries from radix tree, and free them. |
381 | * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. | 381 | * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. |
382 | */ | 382 | */ |
383 | static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, | 383 | static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, |
384 | bool unfalloc) | 384 | bool unfalloc) |
385 | { | 385 | { |
386 | struct address_space *mapping = inode->i_mapping; | 386 | struct address_space *mapping = inode->i_mapping; |
387 | struct shmem_inode_info *info = SHMEM_I(inode); | 387 | struct shmem_inode_info *info = SHMEM_I(inode); |
388 | pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 388 | pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
389 | pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT; | 389 | pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT; |
390 | unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1); | 390 | unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1); |
391 | unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); | 391 | unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1); |
392 | struct pagevec pvec; | 392 | struct pagevec pvec; |
393 | pgoff_t indices[PAGEVEC_SIZE]; | 393 | pgoff_t indices[PAGEVEC_SIZE]; |
394 | long nr_swaps_freed = 0; | 394 | long nr_swaps_freed = 0; |
395 | pgoff_t index; | 395 | pgoff_t index; |
396 | int i; | 396 | int i; |
397 | 397 | ||
398 | if (lend == -1) | 398 | if (lend == -1) |
399 | end = -1; /* unsigned, so actually very big */ | 399 | end = -1; /* unsigned, so actually very big */ |
400 | 400 | ||
401 | pagevec_init(&pvec, 0); | 401 | pagevec_init(&pvec, 0); |
402 | index = start; | 402 | index = start; |
403 | while (index < end) { | 403 | while (index < end) { |
404 | pvec.nr = find_get_entries(mapping, index, | 404 | pvec.nr = find_get_entries(mapping, index, |
405 | min(end - index, (pgoff_t)PAGEVEC_SIZE), | 405 | min(end - index, (pgoff_t)PAGEVEC_SIZE), |
406 | pvec.pages, indices); | 406 | pvec.pages, indices); |
407 | if (!pvec.nr) | 407 | if (!pvec.nr) |
408 | break; | 408 | break; |
409 | mem_cgroup_uncharge_start(); | 409 | mem_cgroup_uncharge_start(); |
410 | for (i = 0; i < pagevec_count(&pvec); i++) { | 410 | for (i = 0; i < pagevec_count(&pvec); i++) { |
411 | struct page *page = pvec.pages[i]; | 411 | struct page *page = pvec.pages[i]; |
412 | 412 | ||
413 | index = indices[i]; | 413 | index = indices[i]; |
414 | if (index >= end) | 414 | if (index >= end) |
415 | break; | 415 | break; |
416 | 416 | ||
417 | if (radix_tree_exceptional_entry(page)) { | 417 | if (radix_tree_exceptional_entry(page)) { |
418 | if (unfalloc) | 418 | if (unfalloc) |
419 | continue; | 419 | continue; |
420 | nr_swaps_freed += !shmem_free_swap(mapping, | 420 | nr_swaps_freed += !shmem_free_swap(mapping, |
421 | index, page); | 421 | index, page); |
422 | continue; | 422 | continue; |
423 | } | 423 | } |
424 | 424 | ||
425 | if (!trylock_page(page)) | 425 | if (!trylock_page(page)) |
426 | continue; | 426 | continue; |
427 | if (!unfalloc || !PageUptodate(page)) { | 427 | if (!unfalloc || !PageUptodate(page)) { |
428 | if (page->mapping == mapping) { | 428 | if (page->mapping == mapping) { |
429 | VM_BUG_ON(PageWriteback(page)); | 429 | VM_BUG_ON(PageWriteback(page)); |
430 | truncate_inode_page(mapping, page); | 430 | truncate_inode_page(mapping, page); |
431 | } | 431 | } |
432 | } | 432 | } |
433 | unlock_page(page); | 433 | unlock_page(page); |
434 | } | 434 | } |
435 | pagevec_remove_exceptionals(&pvec); | 435 | pagevec_remove_exceptionals(&pvec); |
436 | pagevec_release(&pvec); | 436 | pagevec_release(&pvec); |
437 | mem_cgroup_uncharge_end(); | 437 | mem_cgroup_uncharge_end(); |
438 | cond_resched(); | 438 | cond_resched(); |
439 | index++; | 439 | index++; |
440 | } | 440 | } |
441 | 441 | ||
442 | if (partial_start) { | 442 | if (partial_start) { |
443 | struct page *page = NULL; | 443 | struct page *page = NULL; |
444 | shmem_getpage(inode, start - 1, &page, SGP_READ, NULL); | 444 | shmem_getpage(inode, start - 1, &page, SGP_READ, NULL); |
445 | if (page) { | 445 | if (page) { |
446 | unsigned int top = PAGE_CACHE_SIZE; | 446 | unsigned int top = PAGE_CACHE_SIZE; |
447 | if (start > end) { | 447 | if (start > end) { |
448 | top = partial_end; | 448 | top = partial_end; |
449 | partial_end = 0; | 449 | partial_end = 0; |
450 | } | 450 | } |
451 | zero_user_segment(page, partial_start, top); | 451 | zero_user_segment(page, partial_start, top); |
452 | set_page_dirty(page); | 452 | set_page_dirty(page); |
453 | unlock_page(page); | 453 | unlock_page(page); |
454 | page_cache_release(page); | 454 | page_cache_release(page); |
455 | } | 455 | } |
456 | } | 456 | } |
457 | if (partial_end) { | 457 | if (partial_end) { |
458 | struct page *page = NULL; | 458 | struct page *page = NULL; |
459 | shmem_getpage(inode, end, &page, SGP_READ, NULL); | 459 | shmem_getpage(inode, end, &page, SGP_READ, NULL); |
460 | if (page) { | 460 | if (page) { |
461 | zero_user_segment(page, 0, partial_end); | 461 | zero_user_segment(page, 0, partial_end); |
462 | set_page_dirty(page); | 462 | set_page_dirty(page); |
463 | unlock_page(page); | 463 | unlock_page(page); |
464 | page_cache_release(page); | 464 | page_cache_release(page); |
465 | } | 465 | } |
466 | } | 466 | } |
467 | if (start >= end) | 467 | if (start >= end) |
468 | return; | 468 | return; |
469 | 469 | ||
470 | index = start; | 470 | index = start; |
471 | while (index < end) { | 471 | while (index < end) { |
472 | cond_resched(); | 472 | cond_resched(); |
473 | 473 | ||
474 | pvec.nr = find_get_entries(mapping, index, | 474 | pvec.nr = find_get_entries(mapping, index, |
475 | min(end - index, (pgoff_t)PAGEVEC_SIZE), | 475 | min(end - index, (pgoff_t)PAGEVEC_SIZE), |
476 | pvec.pages, indices); | 476 | pvec.pages, indices); |
477 | if (!pvec.nr) { | 477 | if (!pvec.nr) { |
478 | /* If all gone or hole-punch or unfalloc, we're done */ | 478 | /* If all gone or hole-punch or unfalloc, we're done */ |
479 | if (index == start || end != -1) | 479 | if (index == start || end != -1) |
480 | break; | 480 | break; |
481 | /* But if truncating, restart to make sure all gone */ | 481 | /* But if truncating, restart to make sure all gone */ |
482 | index = start; | 482 | index = start; |
483 | continue; | 483 | continue; |
484 | } | 484 | } |
485 | mem_cgroup_uncharge_start(); | 485 | mem_cgroup_uncharge_start(); |
486 | for (i = 0; i < pagevec_count(&pvec); i++) { | 486 | for (i = 0; i < pagevec_count(&pvec); i++) { |
487 | struct page *page = pvec.pages[i]; | 487 | struct page *page = pvec.pages[i]; |
488 | 488 | ||
489 | index = indices[i]; | 489 | index = indices[i]; |
490 | if (index >= end) | 490 | if (index >= end) |
491 | break; | 491 | break; |
492 | 492 | ||
493 | if (radix_tree_exceptional_entry(page)) { | 493 | if (radix_tree_exceptional_entry(page)) { |
494 | if (unfalloc) | 494 | if (unfalloc) |
495 | continue; | 495 | continue; |
496 | if (shmem_free_swap(mapping, index, page)) { | 496 | if (shmem_free_swap(mapping, index, page)) { |
497 | /* Swap was replaced by page: retry */ | 497 | /* Swap was replaced by page: retry */ |
498 | index--; | 498 | index--; |
499 | break; | 499 | break; |
500 | } | 500 | } |
501 | nr_swaps_freed++; | 501 | nr_swaps_freed++; |
502 | continue; | 502 | continue; |
503 | } | 503 | } |
504 | 504 | ||
505 | lock_page(page); | 505 | lock_page(page); |
506 | if (!unfalloc || !PageUptodate(page)) { | 506 | if (!unfalloc || !PageUptodate(page)) { |
507 | if (page->mapping == mapping) { | 507 | if (page->mapping == mapping) { |
508 | VM_BUG_ON(PageWriteback(page)); | 508 | VM_BUG_ON(PageWriteback(page)); |
509 | truncate_inode_page(mapping, page); | 509 | truncate_inode_page(mapping, page); |
510 | } else { | 510 | } else { |
511 | /* Page was replaced by swap: retry */ | 511 | /* Page was replaced by swap: retry */ |
512 | unlock_page(page); | 512 | unlock_page(page); |
513 | index--; | 513 | index--; |
514 | break; | 514 | break; |
515 | } | 515 | } |
516 | } | 516 | } |
517 | unlock_page(page); | 517 | unlock_page(page); |
518 | } | 518 | } |
519 | pagevec_remove_exceptionals(&pvec); | 519 | pagevec_remove_exceptionals(&pvec); |
520 | pagevec_release(&pvec); | 520 | pagevec_release(&pvec); |
521 | mem_cgroup_uncharge_end(); | 521 | mem_cgroup_uncharge_end(); |
522 | index++; | 522 | index++; |
523 | } | 523 | } |
524 | 524 | ||
525 | spin_lock(&info->lock); | 525 | spin_lock(&info->lock); |
526 | info->swapped -= nr_swaps_freed; | 526 | info->swapped -= nr_swaps_freed; |
527 | shmem_recalc_inode(inode); | 527 | shmem_recalc_inode(inode); |
528 | spin_unlock(&info->lock); | 528 | spin_unlock(&info->lock); |
529 | } | 529 | } |
530 | 530 | ||
531 | void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) | 531 | void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) |
532 | { | 532 | { |
533 | shmem_undo_range(inode, lstart, lend, false); | 533 | shmem_undo_range(inode, lstart, lend, false); |
534 | inode->i_ctime = inode->i_mtime = CURRENT_TIME; | 534 | inode->i_ctime = inode->i_mtime = CURRENT_TIME; |
535 | } | 535 | } |
536 | EXPORT_SYMBOL_GPL(shmem_truncate_range); | 536 | EXPORT_SYMBOL_GPL(shmem_truncate_range); |
537 | 537 | ||
538 | static int shmem_setattr(struct dentry *dentry, struct iattr *attr) | 538 | static int shmem_setattr(struct dentry *dentry, struct iattr *attr) |
539 | { | 539 | { |
540 | struct inode *inode = dentry->d_inode; | 540 | struct inode *inode = dentry->d_inode; |
541 | int error; | 541 | int error; |
542 | 542 | ||
543 | error = inode_change_ok(inode, attr); | 543 | error = inode_change_ok(inode, attr); |
544 | if (error) | 544 | if (error) |
545 | return error; | 545 | return error; |
546 | 546 | ||
547 | if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { | 547 | if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { |
548 | loff_t oldsize = inode->i_size; | 548 | loff_t oldsize = inode->i_size; |
549 | loff_t newsize = attr->ia_size; | 549 | loff_t newsize = attr->ia_size; |
550 | 550 | ||
551 | if (newsize != oldsize) { | 551 | if (newsize != oldsize) { |
552 | i_size_write(inode, newsize); | 552 | i_size_write(inode, newsize); |
553 | inode->i_ctime = inode->i_mtime = CURRENT_TIME; | 553 | inode->i_ctime = inode->i_mtime = CURRENT_TIME; |
554 | } | 554 | } |
555 | if (newsize < oldsize) { | 555 | if (newsize < oldsize) { |
556 | loff_t holebegin = round_up(newsize, PAGE_SIZE); | 556 | loff_t holebegin = round_up(newsize, PAGE_SIZE); |
557 | unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); | 557 | unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); |
558 | shmem_truncate_range(inode, newsize, (loff_t)-1); | 558 | shmem_truncate_range(inode, newsize, (loff_t)-1); |
559 | /* unmap again to remove racily COWed private pages */ | 559 | /* unmap again to remove racily COWed private pages */ |
560 | unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); | 560 | unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); |
561 | } | 561 | } |
562 | } | 562 | } |
563 | 563 | ||
564 | setattr_copy(inode, attr); | 564 | setattr_copy(inode, attr); |
565 | #ifdef CONFIG_TMPFS_POSIX_ACL | 565 | #ifdef CONFIG_TMPFS_POSIX_ACL |
566 | if (attr->ia_valid & ATTR_MODE) | 566 | if (attr->ia_valid & ATTR_MODE) |
567 | error = generic_acl_chmod(inode); | 567 | error = generic_acl_chmod(inode); |
568 | #endif | 568 | #endif |
569 | return error; | 569 | return error; |
570 | } | 570 | } |
571 | 571 | ||
572 | static void shmem_evict_inode(struct inode *inode) | 572 | static void shmem_evict_inode(struct inode *inode) |
573 | { | 573 | { |
574 | struct shmem_inode_info *info = SHMEM_I(inode); | 574 | struct shmem_inode_info *info = SHMEM_I(inode); |
575 | 575 | ||
576 | if (inode->i_mapping->a_ops == &shmem_aops) { | 576 | if (inode->i_mapping->a_ops == &shmem_aops) { |
577 | shmem_unacct_size(info->flags, inode->i_size); | 577 | shmem_unacct_size(info->flags, inode->i_size); |
578 | inode->i_size = 0; | 578 | inode->i_size = 0; |
579 | shmem_truncate_range(inode, 0, (loff_t)-1); | 579 | shmem_truncate_range(inode, 0, (loff_t)-1); |
580 | if (!list_empty(&info->swaplist)) { | 580 | if (!list_empty(&info->swaplist)) { |
581 | mutex_lock(&shmem_swaplist_mutex); | 581 | mutex_lock(&shmem_swaplist_mutex); |
582 | list_del_init(&info->swaplist); | 582 | list_del_init(&info->swaplist); |
583 | mutex_unlock(&shmem_swaplist_mutex); | 583 | mutex_unlock(&shmem_swaplist_mutex); |
584 | } | 584 | } |
585 | } else | 585 | } else |
586 | kfree(info->symlink); | 586 | kfree(info->symlink); |
587 | 587 | ||
588 | simple_xattrs_free(&info->xattrs); | 588 | simple_xattrs_free(&info->xattrs); |
589 | WARN_ON(inode->i_blocks); | 589 | WARN_ON(inode->i_blocks); |
590 | shmem_free_inode(inode->i_sb); | 590 | shmem_free_inode(inode->i_sb); |
591 | clear_inode(inode); | 591 | clear_inode(inode); |
592 | } | 592 | } |
593 | 593 | ||
594 | /* | 594 | /* |
595 | * If swap found in inode, free it and move page from swapcache to filecache. | 595 | * If swap found in inode, free it and move page from swapcache to filecache. |
596 | */ | 596 | */ |
597 | static int shmem_unuse_inode(struct shmem_inode_info *info, | 597 | static int shmem_unuse_inode(struct shmem_inode_info *info, |
598 | swp_entry_t swap, struct page **pagep) | 598 | swp_entry_t swap, struct page **pagep) |
599 | { | 599 | { |
600 | struct address_space *mapping = info->vfs_inode.i_mapping; | 600 | struct address_space *mapping = info->vfs_inode.i_mapping; |
601 | void *radswap; | 601 | void *radswap; |
602 | pgoff_t index; | 602 | pgoff_t index; |
603 | gfp_t gfp; | 603 | gfp_t gfp; |
604 | int error = 0; | 604 | int error = 0; |
605 | 605 | ||
606 | radswap = swp_to_radix_entry(swap); | 606 | radswap = swp_to_radix_entry(swap); |
607 | index = radix_tree_locate_item(&mapping->page_tree, radswap); | 607 | index = radix_tree_locate_item(&mapping->page_tree, radswap); |
608 | if (index == -1) | 608 | if (index == -1) |
609 | return 0; | 609 | return 0; |
610 | 610 | ||
611 | /* | 611 | /* |
612 | * Move _head_ to start search for next from here. | 612 | * Move _head_ to start search for next from here. |
613 | * But be careful: shmem_evict_inode checks list_empty without taking | 613 | * But be careful: shmem_evict_inode checks list_empty without taking |
614 | * mutex, and there's an instant in list_move_tail when info->swaplist | 614 | * mutex, and there's an instant in list_move_tail when info->swaplist |
615 | * would appear empty, if it were the only one on shmem_swaplist. | 615 | * would appear empty, if it were the only one on shmem_swaplist. |
616 | */ | 616 | */ |
617 | if (shmem_swaplist.next != &info->swaplist) | 617 | if (shmem_swaplist.next != &info->swaplist) |
618 | list_move_tail(&shmem_swaplist, &info->swaplist); | 618 | list_move_tail(&shmem_swaplist, &info->swaplist); |
619 | 619 | ||
620 | gfp = mapping_gfp_mask(mapping); | 620 | gfp = mapping_gfp_mask(mapping); |
621 | if (shmem_should_replace_page(*pagep, gfp)) { | 621 | if (shmem_should_replace_page(*pagep, gfp)) { |
622 | mutex_unlock(&shmem_swaplist_mutex); | 622 | mutex_unlock(&shmem_swaplist_mutex); |
623 | error = shmem_replace_page(pagep, gfp, info, index); | 623 | error = shmem_replace_page(pagep, gfp, info, index); |
624 | mutex_lock(&shmem_swaplist_mutex); | 624 | mutex_lock(&shmem_swaplist_mutex); |
625 | /* | 625 | /* |
626 | * We needed to drop mutex to make that restrictive page | 626 | * We needed to drop mutex to make that restrictive page |
627 | * allocation, but the inode might have been freed while we | 627 | * allocation, but the inode might have been freed while we |
628 | * dropped it: although a racing shmem_evict_inode() cannot | 628 | * dropped it: although a racing shmem_evict_inode() cannot |
629 | * complete without emptying the radix_tree, our page lock | 629 | * complete without emptying the radix_tree, our page lock |
630 | * on this swapcache page is not enough to prevent that - | 630 | * on this swapcache page is not enough to prevent that - |
631 | * free_swap_and_cache() of our swap entry will only | 631 | * free_swap_and_cache() of our swap entry will only |
632 | * trylock_page(), removing swap from radix_tree whatever. | 632 | * trylock_page(), removing swap from radix_tree whatever. |
633 | * | 633 | * |
634 | * We must not proceed to shmem_add_to_page_cache() if the | 634 | * We must not proceed to shmem_add_to_page_cache() if the |
635 | * inode has been freed, but of course we cannot rely on | 635 | * inode has been freed, but of course we cannot rely on |
636 | * inode or mapping or info to check that. However, we can | 636 | * inode or mapping or info to check that. However, we can |
637 | * safely check if our swap entry is still in use (and here | 637 | * safely check if our swap entry is still in use (and here |
638 | * it can't have got reused for another page): if it's still | 638 | * it can't have got reused for another page): if it's still |
639 | * in use, then the inode cannot have been freed yet, and we | 639 | * in use, then the inode cannot have been freed yet, and we |
640 | * can safely proceed (if it's no longer in use, that tells | 640 | * can safely proceed (if it's no longer in use, that tells |
641 | * nothing about the inode, but we don't need to unuse swap). | 641 | * nothing about the inode, but we don't need to unuse swap). |
642 | */ | 642 | */ |
643 | if (!page_swapcount(*pagep)) | 643 | if (!page_swapcount(*pagep)) |
644 | error = -ENOENT; | 644 | error = -ENOENT; |
645 | } | 645 | } |
646 | 646 | ||
647 | /* | 647 | /* |
648 | * We rely on shmem_swaplist_mutex, not only to protect the swaplist, | 648 | * We rely on shmem_swaplist_mutex, not only to protect the swaplist, |
649 | * but also to hold up shmem_evict_inode(): so inode cannot be freed | 649 | * but also to hold up shmem_evict_inode(): so inode cannot be freed |
650 | * beneath us (pagelock doesn't help until the page is in pagecache). | 650 | * beneath us (pagelock doesn't help until the page is in pagecache). |
651 | */ | 651 | */ |
652 | if (!error) | 652 | if (!error) |
653 | error = shmem_add_to_page_cache(*pagep, mapping, index, | 653 | error = shmem_add_to_page_cache(*pagep, mapping, index, |
654 | GFP_NOWAIT, radswap); | 654 | GFP_NOWAIT, radswap); |
655 | if (error != -ENOMEM) { | 655 | if (error != -ENOMEM) { |
656 | /* | 656 | /* |
657 | * Truncation and eviction use free_swap_and_cache(), which | 657 | * Truncation and eviction use free_swap_and_cache(), which |
658 | * only does trylock page: if we raced, best clean up here. | 658 | * only does trylock page: if we raced, best clean up here. |
659 | */ | 659 | */ |
660 | delete_from_swap_cache(*pagep); | 660 | delete_from_swap_cache(*pagep); |
661 | set_page_dirty(*pagep); | 661 | set_page_dirty(*pagep); |
662 | if (!error) { | 662 | if (!error) { |
663 | spin_lock(&info->lock); | 663 | spin_lock(&info->lock); |
664 | info->swapped--; | 664 | info->swapped--; |
665 | spin_unlock(&info->lock); | 665 | spin_unlock(&info->lock); |
666 | swap_free(swap); | 666 | swap_free(swap); |
667 | } | 667 | } |
668 | error = 1; /* not an error, but entry was found */ | 668 | error = 1; /* not an error, but entry was found */ |
669 | } | 669 | } |
670 | return error; | 670 | return error; |
671 | } | 671 | } |
672 | 672 | ||
673 | /* | 673 | /* |
674 | * Search through swapped inodes to find and replace swap by page. | 674 | * Search through swapped inodes to find and replace swap by page. |
675 | */ | 675 | */ |
676 | int shmem_unuse(swp_entry_t swap, struct page *page) | 676 | int shmem_unuse(swp_entry_t swap, struct page *page) |
677 | { | 677 | { |
678 | struct list_head *this, *next; | 678 | struct list_head *this, *next; |
679 | struct shmem_inode_info *info; | 679 | struct shmem_inode_info *info; |
680 | int found = 0; | 680 | int found = 0; |
681 | int error = 0; | 681 | int error = 0; |
682 | 682 | ||
683 | /* | 683 | /* |
684 | * There's a faint possibility that swap page was replaced before | 684 | * There's a faint possibility that swap page was replaced before |
685 | * caller locked it: caller will come back later with the right page. | 685 | * caller locked it: caller will come back later with the right page. |
686 | */ | 686 | */ |
687 | if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) | 687 | if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val)) |
688 | goto out; | 688 | goto out; |
689 | 689 | ||
690 | /* | 690 | /* |
691 | * Charge page using GFP_KERNEL while we can wait, before taking | 691 | * Charge page using GFP_KERNEL while we can wait, before taking |
692 | * the shmem_swaplist_mutex which might hold up shmem_writepage(). | 692 | * the shmem_swaplist_mutex which might hold up shmem_writepage(). |
693 | * Charged back to the user (not to caller) when swap account is used. | 693 | * Charged back to the user (not to caller) when swap account is used. |
694 | */ | 694 | */ |
695 | error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); | 695 | error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL); |
696 | if (error) | 696 | if (error) |
697 | goto out; | 697 | goto out; |
698 | /* No radix_tree_preload: swap entry keeps a place for page in tree */ | 698 | /* No radix_tree_preload: swap entry keeps a place for page in tree */ |
699 | 699 | ||
700 | mutex_lock(&shmem_swaplist_mutex); | 700 | mutex_lock(&shmem_swaplist_mutex); |
701 | list_for_each_safe(this, next, &shmem_swaplist) { | 701 | list_for_each_safe(this, next, &shmem_swaplist) { |
702 | info = list_entry(this, struct shmem_inode_info, swaplist); | 702 | info = list_entry(this, struct shmem_inode_info, swaplist); |
703 | if (info->swapped) | 703 | if (info->swapped) |
704 | found = shmem_unuse_inode(info, swap, &page); | 704 | found = shmem_unuse_inode(info, swap, &page); |
705 | else | 705 | else |
706 | list_del_init(&info->swaplist); | 706 | list_del_init(&info->swaplist); |
707 | cond_resched(); | 707 | cond_resched(); |
708 | if (found) | 708 | if (found) |
709 | break; | 709 | break; |
710 | } | 710 | } |
711 | mutex_unlock(&shmem_swaplist_mutex); | 711 | mutex_unlock(&shmem_swaplist_mutex); |
712 | 712 | ||
713 | if (found < 0) | 713 | if (found < 0) |
714 | error = found; | 714 | error = found; |
715 | out: | 715 | out: |
716 | unlock_page(page); | 716 | unlock_page(page); |
717 | page_cache_release(page); | 717 | page_cache_release(page); |
718 | return error; | 718 | return error; |
719 | } | 719 | } |
720 | 720 | ||
721 | /* | 721 | /* |
722 | * Move the page from the page cache to the swap cache. | 722 | * Move the page from the page cache to the swap cache. |
723 | */ | 723 | */ |
724 | static int shmem_writepage(struct page *page, struct writeback_control *wbc) | 724 | static int shmem_writepage(struct page *page, struct writeback_control *wbc) |
725 | { | 725 | { |
726 | struct shmem_inode_info *info; | 726 | struct shmem_inode_info *info; |
727 | struct address_space *mapping; | 727 | struct address_space *mapping; |
728 | struct inode *inode; | 728 | struct inode *inode; |
729 | swp_entry_t swap; | 729 | swp_entry_t swap; |
730 | pgoff_t index; | 730 | pgoff_t index; |
731 | 731 | ||
732 | BUG_ON(!PageLocked(page)); | 732 | BUG_ON(!PageLocked(page)); |
733 | mapping = page->mapping; | 733 | mapping = page->mapping; |
734 | index = page->index; | 734 | index = page->index; |
735 | inode = mapping->host; | 735 | inode = mapping->host; |
736 | info = SHMEM_I(inode); | 736 | info = SHMEM_I(inode); |
737 | if (info->flags & VM_LOCKED) | 737 | if (info->flags & VM_LOCKED) |
738 | goto redirty; | 738 | goto redirty; |
739 | if (!total_swap_pages) | 739 | if (!total_swap_pages) |
740 | goto redirty; | 740 | goto redirty; |
741 | 741 | ||
742 | /* | 742 | /* |
743 | * shmem_backing_dev_info's capabilities prevent regular writeback or | 743 | * shmem_backing_dev_info's capabilities prevent regular writeback or |
744 | * sync from ever calling shmem_writepage; but a stacking filesystem | 744 | * sync from ever calling shmem_writepage; but a stacking filesystem |
745 | * might use ->writepage of its underlying filesystem, in which case | 745 | * might use ->writepage of its underlying filesystem, in which case |
746 | * tmpfs should write out to swap only in response to memory pressure, | 746 | * tmpfs should write out to swap only in response to memory pressure, |
747 | * and not for the writeback threads or sync. | 747 | * and not for the writeback threads or sync. |
748 | */ | 748 | */ |
749 | if (!wbc->for_reclaim) { | 749 | if (!wbc->for_reclaim) { |
750 | WARN_ON_ONCE(1); /* Still happens? Tell us about it! */ | 750 | WARN_ON_ONCE(1); /* Still happens? Tell us about it! */ |
751 | goto redirty; | 751 | goto redirty; |
752 | } | 752 | } |
753 | 753 | ||
754 | /* | 754 | /* |
755 | * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC | 755 | * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC |
756 | * value into swapfile.c, the only way we can correctly account for a | 756 | * value into swapfile.c, the only way we can correctly account for a |
757 | * fallocated page arriving here is now to initialize it and write it. | 757 | * fallocated page arriving here is now to initialize it and write it. |
758 | * | 758 | * |
759 | * That's okay for a page already fallocated earlier, but if we have | 759 | * That's okay for a page already fallocated earlier, but if we have |
760 | * not yet completed the fallocation, then (a) we want to keep track | 760 | * not yet completed the fallocation, then (a) we want to keep track |
761 | * of this page in case we have to undo it, and (b) it may not be a | 761 | * of this page in case we have to undo it, and (b) it may not be a |
762 | * good idea to continue anyway, once we're pushing into swap. So | 762 | * good idea to continue anyway, once we're pushing into swap. So |
763 | * reactivate the page, and let shmem_fallocate() quit when too many. | 763 | * reactivate the page, and let shmem_fallocate() quit when too many. |
764 | */ | 764 | */ |
765 | if (!PageUptodate(page)) { | 765 | if (!PageUptodate(page)) { |
766 | if (inode->i_private) { | 766 | if (inode->i_private) { |
767 | struct shmem_falloc *shmem_falloc; | 767 | struct shmem_falloc *shmem_falloc; |
768 | spin_lock(&inode->i_lock); | 768 | spin_lock(&inode->i_lock); |
769 | shmem_falloc = inode->i_private; | 769 | shmem_falloc = inode->i_private; |
770 | if (shmem_falloc && | 770 | if (shmem_falloc && |
771 | !shmem_falloc->waitq && | 771 | !shmem_falloc->waitq && |
772 | index >= shmem_falloc->start && | 772 | index >= shmem_falloc->start && |
773 | index < shmem_falloc->next) | 773 | index < shmem_falloc->next) |
774 | shmem_falloc->nr_unswapped++; | 774 | shmem_falloc->nr_unswapped++; |
775 | else | 775 | else |
776 | shmem_falloc = NULL; | 776 | shmem_falloc = NULL; |
777 | spin_unlock(&inode->i_lock); | 777 | spin_unlock(&inode->i_lock); |
778 | if (shmem_falloc) | 778 | if (shmem_falloc) |
779 | goto redirty; | 779 | goto redirty; |
780 | } | 780 | } |
781 | clear_highpage(page); | 781 | clear_highpage(page); |
782 | flush_dcache_page(page); | 782 | flush_dcache_page(page); |
783 | SetPageUptodate(page); | 783 | SetPageUptodate(page); |
784 | } | 784 | } |
785 | 785 | ||
786 | swap = get_swap_page(); | 786 | swap = get_swap_page(); |
787 | if (!swap.val) | 787 | if (!swap.val) |
788 | goto redirty; | 788 | goto redirty; |
789 | 789 | ||
790 | /* | 790 | /* |
791 | * Add inode to shmem_unuse()'s list of swapped-out inodes, | 791 | * Add inode to shmem_unuse()'s list of swapped-out inodes, |
792 | * if it's not already there. Do it now before the page is | 792 | * if it's not already there. Do it now before the page is |
793 | * moved to swap cache, when its pagelock no longer protects | 793 | * moved to swap cache, when its pagelock no longer protects |
794 | * the inode from eviction. But don't unlock the mutex until | 794 | * the inode from eviction. But don't unlock the mutex until |
795 | * we've incremented swapped, because shmem_unuse_inode() will | 795 | * we've incremented swapped, because shmem_unuse_inode() will |
796 | * prune a !swapped inode from the swaplist under this mutex. | 796 | * prune a !swapped inode from the swaplist under this mutex. |
797 | */ | 797 | */ |
798 | mutex_lock(&shmem_swaplist_mutex); | 798 | mutex_lock(&shmem_swaplist_mutex); |
799 | if (list_empty(&info->swaplist)) | 799 | if (list_empty(&info->swaplist)) |
800 | list_add_tail(&info->swaplist, &shmem_swaplist); | 800 | list_add_tail(&info->swaplist, &shmem_swaplist); |
801 | 801 | ||
802 | if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { | 802 | if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { |
803 | swap_shmem_alloc(swap); | 803 | swap_shmem_alloc(swap); |
804 | shmem_delete_from_page_cache(page, swp_to_radix_entry(swap)); | 804 | shmem_delete_from_page_cache(page, swp_to_radix_entry(swap)); |
805 | 805 | ||
806 | spin_lock(&info->lock); | 806 | spin_lock(&info->lock); |
807 | info->swapped++; | 807 | info->swapped++; |
808 | shmem_recalc_inode(inode); | 808 | shmem_recalc_inode(inode); |
809 | spin_unlock(&info->lock); | 809 | spin_unlock(&info->lock); |
810 | 810 | ||
811 | mutex_unlock(&shmem_swaplist_mutex); | 811 | mutex_unlock(&shmem_swaplist_mutex); |
812 | BUG_ON(page_mapped(page)); | 812 | BUG_ON(page_mapped(page)); |
813 | swap_writepage(page, wbc); | 813 | swap_writepage(page, wbc); |
814 | return 0; | 814 | return 0; |
815 | } | 815 | } |
816 | 816 | ||
817 | mutex_unlock(&shmem_swaplist_mutex); | 817 | mutex_unlock(&shmem_swaplist_mutex); |
818 | swapcache_free(swap, NULL); | 818 | swapcache_free(swap, NULL); |
819 | redirty: | 819 | redirty: |
820 | set_page_dirty(page); | 820 | set_page_dirty(page); |
821 | if (wbc->for_reclaim) | 821 | if (wbc->for_reclaim) |
822 | return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */ | 822 | return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */ |
823 | unlock_page(page); | 823 | unlock_page(page); |
824 | return 0; | 824 | return 0; |
825 | } | 825 | } |
826 | 826 | ||
827 | #ifdef CONFIG_NUMA | 827 | #ifdef CONFIG_NUMA |
828 | #ifdef CONFIG_TMPFS | 828 | #ifdef CONFIG_TMPFS |
829 | static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) | 829 | static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) |
830 | { | 830 | { |
831 | char buffer[64]; | 831 | char buffer[64]; |
832 | 832 | ||
833 | if (!mpol || mpol->mode == MPOL_DEFAULT) | 833 | if (!mpol || mpol->mode == MPOL_DEFAULT) |
834 | return; /* show nothing */ | 834 | return; /* show nothing */ |
835 | 835 | ||
836 | mpol_to_str(buffer, sizeof(buffer), mpol); | 836 | mpol_to_str(buffer, sizeof(buffer), mpol); |
837 | 837 | ||
838 | seq_printf(seq, ",mpol=%s", buffer); | 838 | seq_printf(seq, ",mpol=%s", buffer); |
839 | } | 839 | } |
840 | 840 | ||
841 | static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) | 841 | static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) |
842 | { | 842 | { |
843 | struct mempolicy *mpol = NULL; | 843 | struct mempolicy *mpol = NULL; |
844 | if (sbinfo->mpol) { | 844 | if (sbinfo->mpol) { |
845 | spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ | 845 | spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ |
846 | mpol = sbinfo->mpol; | 846 | mpol = sbinfo->mpol; |
847 | mpol_get(mpol); | 847 | mpol_get(mpol); |
848 | spin_unlock(&sbinfo->stat_lock); | 848 | spin_unlock(&sbinfo->stat_lock); |
849 | } | 849 | } |
850 | return mpol; | 850 | return mpol; |
851 | } | 851 | } |
852 | #endif /* CONFIG_TMPFS */ | 852 | #endif /* CONFIG_TMPFS */ |
853 | 853 | ||
854 | static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, | 854 | static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, |
855 | struct shmem_inode_info *info, pgoff_t index) | 855 | struct shmem_inode_info *info, pgoff_t index) |
856 | { | 856 | { |
857 | struct vm_area_struct pvma; | 857 | struct vm_area_struct pvma; |
858 | struct page *page; | 858 | struct page *page; |
859 | 859 | ||
860 | /* Create a pseudo vma that just contains the policy */ | 860 | /* Create a pseudo vma that just contains the policy */ |
861 | pvma.vm_start = 0; | 861 | pvma.vm_start = 0; |
862 | /* Bias interleave by inode number to distribute better across nodes */ | 862 | /* Bias interleave by inode number to distribute better across nodes */ |
863 | pvma.vm_pgoff = index + info->vfs_inode.i_ino; | 863 | pvma.vm_pgoff = index + info->vfs_inode.i_ino; |
864 | pvma.vm_ops = NULL; | 864 | pvma.vm_ops = NULL; |
865 | pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); | 865 | pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); |
866 | 866 | ||
867 | page = swapin_readahead(swap, gfp, &pvma, 0); | 867 | page = swapin_readahead(swap, gfp, &pvma, 0); |
868 | 868 | ||
869 | /* Drop reference taken by mpol_shared_policy_lookup() */ | 869 | /* Drop reference taken by mpol_shared_policy_lookup() */ |
870 | mpol_cond_put(pvma.vm_policy); | 870 | mpol_cond_put(pvma.vm_policy); |
871 | 871 | ||
872 | return page; | 872 | return page; |
873 | } | 873 | } |
874 | 874 | ||
875 | static struct page *shmem_alloc_page(gfp_t gfp, | 875 | static struct page *shmem_alloc_page(gfp_t gfp, |
876 | struct shmem_inode_info *info, pgoff_t index) | 876 | struct shmem_inode_info *info, pgoff_t index) |
877 | { | 877 | { |
878 | struct vm_area_struct pvma; | 878 | struct vm_area_struct pvma; |
879 | struct page *page; | 879 | struct page *page; |
880 | 880 | ||
881 | /* Create a pseudo vma that just contains the policy */ | 881 | /* Create a pseudo vma that just contains the policy */ |
882 | pvma.vm_start = 0; | 882 | pvma.vm_start = 0; |
883 | /* Bias interleave by inode number to distribute better across nodes */ | 883 | /* Bias interleave by inode number to distribute better across nodes */ |
884 | pvma.vm_pgoff = index + info->vfs_inode.i_ino; | 884 | pvma.vm_pgoff = index + info->vfs_inode.i_ino; |
885 | pvma.vm_ops = NULL; | 885 | pvma.vm_ops = NULL; |
886 | pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); | 886 | pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); |
887 | 887 | ||
888 | page = alloc_page_vma(gfp, &pvma, 0); | 888 | page = alloc_page_vma(gfp, &pvma, 0); |
889 | 889 | ||
890 | /* Drop reference taken by mpol_shared_policy_lookup() */ | 890 | /* Drop reference taken by mpol_shared_policy_lookup() */ |
891 | mpol_cond_put(pvma.vm_policy); | 891 | mpol_cond_put(pvma.vm_policy); |
892 | 892 | ||
893 | return page; | 893 | return page; |
894 | } | 894 | } |
895 | #else /* !CONFIG_NUMA */ | 895 | #else /* !CONFIG_NUMA */ |
896 | #ifdef CONFIG_TMPFS | 896 | #ifdef CONFIG_TMPFS |
897 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) | 897 | static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) |
898 | { | 898 | { |
899 | } | 899 | } |
900 | #endif /* CONFIG_TMPFS */ | 900 | #endif /* CONFIG_TMPFS */ |
901 | 901 | ||
902 | static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, | 902 | static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp, |
903 | struct shmem_inode_info *info, pgoff_t index) | 903 | struct shmem_inode_info *info, pgoff_t index) |
904 | { | 904 | { |
905 | return swapin_readahead(swap, gfp, NULL, 0); | 905 | return swapin_readahead(swap, gfp, NULL, 0); |
906 | } | 906 | } |
907 | 907 | ||
908 | static inline struct page *shmem_alloc_page(gfp_t gfp, | 908 | static inline struct page *shmem_alloc_page(gfp_t gfp, |
909 | struct shmem_inode_info *info, pgoff_t index) | 909 | struct shmem_inode_info *info, pgoff_t index) |
910 | { | 910 | { |
911 | return alloc_page(gfp); | 911 | return alloc_page(gfp); |
912 | } | 912 | } |
913 | #endif /* CONFIG_NUMA */ | 913 | #endif /* CONFIG_NUMA */ |
914 | 914 | ||
915 | #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS) | 915 | #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS) |
916 | static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) | 916 | static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) |
917 | { | 917 | { |
918 | return NULL; | 918 | return NULL; |
919 | } | 919 | } |
920 | #endif | 920 | #endif |
921 | 921 | ||
922 | /* | 922 | /* |
923 | * When a page is moved from swapcache to shmem filecache (either by the | 923 | * When a page is moved from swapcache to shmem filecache (either by the |
924 | * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of | 924 | * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of |
925 | * shmem_unuse_inode()), it may have been read in earlier from swap, in | 925 | * shmem_unuse_inode()), it may have been read in earlier from swap, in |
926 | * ignorance of the mapping it belongs to. If that mapping has special | 926 | * ignorance of the mapping it belongs to. If that mapping has special |
927 | * constraints (like the gma500 GEM driver, which requires RAM below 4GB), | 927 | * constraints (like the gma500 GEM driver, which requires RAM below 4GB), |
928 | * we may need to copy to a suitable page before moving to filecache. | 928 | * we may need to copy to a suitable page before moving to filecache. |
929 | * | 929 | * |
930 | * In a future release, this may well be extended to respect cpuset and | 930 | * In a future release, this may well be extended to respect cpuset and |
931 | * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); | 931 | * NUMA mempolicy, and applied also to anonymous pages in do_swap_page(); |
932 | * but for now it is a simple matter of zone. | 932 | * but for now it is a simple matter of zone. |
933 | */ | 933 | */ |
934 | static bool shmem_should_replace_page(struct page *page, gfp_t gfp) | 934 | static bool shmem_should_replace_page(struct page *page, gfp_t gfp) |
935 | { | 935 | { |
936 | return page_zonenum(page) > gfp_zone(gfp); | 936 | return page_zonenum(page) > gfp_zone(gfp); |
937 | } | 937 | } |
938 | 938 | ||
939 | static int shmem_replace_page(struct page **pagep, gfp_t gfp, | 939 | static int shmem_replace_page(struct page **pagep, gfp_t gfp, |
940 | struct shmem_inode_info *info, pgoff_t index) | 940 | struct shmem_inode_info *info, pgoff_t index) |
941 | { | 941 | { |
942 | struct page *oldpage, *newpage; | 942 | struct page *oldpage, *newpage; |
943 | struct address_space *swap_mapping; | 943 | struct address_space *swap_mapping; |
944 | pgoff_t swap_index; | 944 | pgoff_t swap_index; |
945 | int error; | 945 | int error; |
946 | 946 | ||
947 | oldpage = *pagep; | 947 | oldpage = *pagep; |
948 | swap_index = page_private(oldpage); | 948 | swap_index = page_private(oldpage); |
949 | swap_mapping = page_mapping(oldpage); | 949 | swap_mapping = page_mapping(oldpage); |
950 | 950 | ||
951 | /* | 951 | /* |
952 | * We have arrived here because our zones are constrained, so don't | 952 | * We have arrived here because our zones are constrained, so don't |
953 | * limit chance of success by further cpuset and node constraints. | 953 | * limit chance of success by further cpuset and node constraints. |
954 | */ | 954 | */ |
955 | gfp &= ~GFP_CONSTRAINT_MASK; | 955 | gfp &= ~GFP_CONSTRAINT_MASK; |
956 | newpage = shmem_alloc_page(gfp, info, index); | 956 | newpage = shmem_alloc_page(gfp, info, index); |
957 | if (!newpage) | 957 | if (!newpage) |
958 | return -ENOMEM; | 958 | return -ENOMEM; |
959 | 959 | ||
960 | page_cache_get(newpage); | 960 | page_cache_get(newpage); |
961 | copy_highpage(newpage, oldpage); | 961 | copy_highpage(newpage, oldpage); |
962 | flush_dcache_page(newpage); | 962 | flush_dcache_page(newpage); |
963 | 963 | ||
964 | __set_page_locked(newpage); | 964 | __set_page_locked(newpage); |
965 | SetPageUptodate(newpage); | 965 | SetPageUptodate(newpage); |
966 | SetPageSwapBacked(newpage); | 966 | SetPageSwapBacked(newpage); |
967 | set_page_private(newpage, swap_index); | 967 | set_page_private(newpage, swap_index); |
968 | SetPageSwapCache(newpage); | 968 | SetPageSwapCache(newpage); |
969 | 969 | ||
970 | /* | 970 | /* |
971 | * Our caller will very soon move newpage out of swapcache, but it's | 971 | * Our caller will very soon move newpage out of swapcache, but it's |
972 | * a nice clean interface for us to replace oldpage by newpage there. | 972 | * a nice clean interface for us to replace oldpage by newpage there. |
973 | */ | 973 | */ |
974 | spin_lock_irq(&swap_mapping->tree_lock); | 974 | spin_lock_irq(&swap_mapping->tree_lock); |
975 | error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, | 975 | error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage, |
976 | newpage); | 976 | newpage); |
977 | if (!error) { | 977 | if (!error) { |
978 | __inc_zone_page_state(newpage, NR_FILE_PAGES); | 978 | __inc_zone_page_state(newpage, NR_FILE_PAGES); |
979 | __dec_zone_page_state(oldpage, NR_FILE_PAGES); | 979 | __dec_zone_page_state(oldpage, NR_FILE_PAGES); |
980 | } | 980 | } |
981 | spin_unlock_irq(&swap_mapping->tree_lock); | 981 | spin_unlock_irq(&swap_mapping->tree_lock); |
982 | 982 | ||
983 | if (unlikely(error)) { | 983 | if (unlikely(error)) { |
984 | /* | 984 | /* |
985 | * Is this possible? I think not, now that our callers check | 985 | * Is this possible? I think not, now that our callers check |
986 | * both PageSwapCache and page_private after getting page lock; | 986 | * both PageSwapCache and page_private after getting page lock; |
987 | * but be defensive. Reverse old to newpage for clear and free. | 987 | * but be defensive. Reverse old to newpage for clear and free. |
988 | */ | 988 | */ |
989 | oldpage = newpage; | 989 | oldpage = newpage; |
990 | } else { | 990 | } else { |
991 | mem_cgroup_replace_page_cache(oldpage, newpage); | 991 | mem_cgroup_replace_page_cache(oldpage, newpage); |
992 | lru_cache_add_anon(newpage); | 992 | lru_cache_add_anon(newpage); |
993 | *pagep = newpage; | 993 | *pagep = newpage; |
994 | } | 994 | } |
995 | 995 | ||
996 | ClearPageSwapCache(oldpage); | 996 | ClearPageSwapCache(oldpage); |
997 | set_page_private(oldpage, 0); | 997 | set_page_private(oldpage, 0); |
998 | 998 | ||
999 | unlock_page(oldpage); | 999 | unlock_page(oldpage); |
1000 | page_cache_release(oldpage); | 1000 | page_cache_release(oldpage); |
1001 | page_cache_release(oldpage); | 1001 | page_cache_release(oldpage); |
1002 | return error; | 1002 | return error; |
1003 | } | 1003 | } |
1004 | 1004 | ||
1005 | /* | 1005 | /* |
1006 | * shmem_getpage_gfp - find page in cache, or get from swap, or allocate | 1006 | * shmem_getpage_gfp - find page in cache, or get from swap, or allocate |
1007 | * | 1007 | * |
1008 | * If we allocate a new one we do not mark it dirty. That's up to the | 1008 | * If we allocate a new one we do not mark it dirty. That's up to the |
1009 | * vm. If we swap it in we mark it dirty since we also free the swap | 1009 | * vm. If we swap it in we mark it dirty since we also free the swap |
1010 | * entry since a page cannot live in both the swap and page cache | 1010 | * entry since a page cannot live in both the swap and page cache |
1011 | */ | 1011 | */ |
1012 | static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, | 1012 | static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, |
1013 | struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type) | 1013 | struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type) |
1014 | { | 1014 | { |
1015 | struct address_space *mapping = inode->i_mapping; | 1015 | struct address_space *mapping = inode->i_mapping; |
1016 | struct shmem_inode_info *info; | 1016 | struct shmem_inode_info *info; |
1017 | struct shmem_sb_info *sbinfo; | 1017 | struct shmem_sb_info *sbinfo; |
1018 | struct page *page; | 1018 | struct page *page; |
1019 | swp_entry_t swap; | 1019 | swp_entry_t swap; |
1020 | int error; | 1020 | int error; |
1021 | int once = 0; | 1021 | int once = 0; |
1022 | int alloced = 0; | 1022 | int alloced = 0; |
1023 | 1023 | ||
1024 | if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) | 1024 | if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) |
1025 | return -EFBIG; | 1025 | return -EFBIG; |
1026 | repeat: | 1026 | repeat: |
1027 | swap.val = 0; | 1027 | swap.val = 0; |
1028 | page = find_lock_entry(mapping, index); | 1028 | page = find_lock_entry(mapping, index); |
1029 | if (radix_tree_exceptional_entry(page)) { | 1029 | if (radix_tree_exceptional_entry(page)) { |
1030 | swap = radix_to_swp_entry(page); | 1030 | swap = radix_to_swp_entry(page); |
1031 | page = NULL; | 1031 | page = NULL; |
1032 | } | 1032 | } |
1033 | 1033 | ||
1034 | if (sgp != SGP_WRITE && sgp != SGP_FALLOC && | 1034 | if (sgp != SGP_WRITE && sgp != SGP_FALLOC && |
1035 | ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { | 1035 | ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { |
1036 | error = -EINVAL; | 1036 | error = -EINVAL; |
1037 | goto failed; | 1037 | goto failed; |
1038 | } | 1038 | } |
1039 | 1039 | ||
1040 | /* fallocated page? */ | 1040 | /* fallocated page? */ |
1041 | if (page && !PageUptodate(page)) { | 1041 | if (page && !PageUptodate(page)) { |
1042 | if (sgp != SGP_READ) | 1042 | if (sgp != SGP_READ) |
1043 | goto clear; | 1043 | goto clear; |
1044 | unlock_page(page); | 1044 | unlock_page(page); |
1045 | page_cache_release(page); | 1045 | page_cache_release(page); |
1046 | page = NULL; | 1046 | page = NULL; |
1047 | } | 1047 | } |
1048 | if (page || (sgp == SGP_READ && !swap.val)) { | 1048 | if (page || (sgp == SGP_READ && !swap.val)) { |
1049 | *pagep = page; | 1049 | *pagep = page; |
1050 | return 0; | 1050 | return 0; |
1051 | } | 1051 | } |
1052 | 1052 | ||
1053 | /* | 1053 | /* |
1054 | * Fast cache lookup did not find it: | 1054 | * Fast cache lookup did not find it: |
1055 | * bring it back from swap or allocate. | 1055 | * bring it back from swap or allocate. |
1056 | */ | 1056 | */ |
1057 | info = SHMEM_I(inode); | 1057 | info = SHMEM_I(inode); |
1058 | sbinfo = SHMEM_SB(inode->i_sb); | 1058 | sbinfo = SHMEM_SB(inode->i_sb); |
1059 | 1059 | ||
1060 | if (swap.val) { | 1060 | if (swap.val) { |
1061 | /* Look it up and read it in.. */ | 1061 | /* Look it up and read it in.. */ |
1062 | page = lookup_swap_cache(swap); | 1062 | page = lookup_swap_cache(swap); |
1063 | if (!page) { | 1063 | if (!page) { |
1064 | /* here we actually do the io */ | 1064 | /* here we actually do the io */ |
1065 | if (fault_type) | 1065 | if (fault_type) |
1066 | *fault_type |= VM_FAULT_MAJOR; | 1066 | *fault_type |= VM_FAULT_MAJOR; |
1067 | page = shmem_swapin(swap, gfp, info, index); | 1067 | page = shmem_swapin(swap, gfp, info, index); |
1068 | if (!page) { | 1068 | if (!page) { |
1069 | error = -ENOMEM; | 1069 | error = -ENOMEM; |
1070 | goto failed; | 1070 | goto failed; |
1071 | } | 1071 | } |
1072 | } | 1072 | } |
1073 | 1073 | ||
1074 | /* We have to do this with page locked to prevent races */ | 1074 | /* We have to do this with page locked to prevent races */ |
1075 | lock_page(page); | 1075 | lock_page(page); |
1076 | if (!PageSwapCache(page) || page_private(page) != swap.val || | 1076 | if (!PageSwapCache(page) || page_private(page) != swap.val || |
1077 | !shmem_confirm_swap(mapping, index, swap)) { | 1077 | !shmem_confirm_swap(mapping, index, swap)) { |
1078 | error = -EEXIST; /* try again */ | 1078 | error = -EEXIST; /* try again */ |
1079 | goto unlock; | 1079 | goto unlock; |
1080 | } | 1080 | } |
1081 | if (!PageUptodate(page)) { | 1081 | if (!PageUptodate(page)) { |
1082 | error = -EIO; | 1082 | error = -EIO; |
1083 | goto failed; | 1083 | goto failed; |
1084 | } | 1084 | } |
1085 | wait_on_page_writeback(page); | 1085 | wait_on_page_writeback(page); |
1086 | 1086 | ||
1087 | if (shmem_should_replace_page(page, gfp)) { | 1087 | if (shmem_should_replace_page(page, gfp)) { |
1088 | error = shmem_replace_page(&page, gfp, info, index); | 1088 | error = shmem_replace_page(&page, gfp, info, index); |
1089 | if (error) | 1089 | if (error) |
1090 | goto failed; | 1090 | goto failed; |
1091 | } | 1091 | } |
1092 | 1092 | ||
1093 | error = mem_cgroup_cache_charge(page, current->mm, | 1093 | error = mem_cgroup_cache_charge(page, current->mm, |
1094 | gfp & GFP_RECLAIM_MASK); | 1094 | gfp & GFP_RECLAIM_MASK); |
1095 | if (!error) { | 1095 | if (!error) { |
1096 | error = shmem_add_to_page_cache(page, mapping, index, | 1096 | error = shmem_add_to_page_cache(page, mapping, index, |
1097 | gfp, swp_to_radix_entry(swap)); | 1097 | gfp, swp_to_radix_entry(swap)); |
1098 | /* | 1098 | /* |
1099 | * We already confirmed swap under page lock, and make | 1099 | * We already confirmed swap under page lock, and make |
1100 | * no memory allocation here, so usually no possibility | 1100 | * no memory allocation here, so usually no possibility |
1101 | * of error; but free_swap_and_cache() only trylocks a | 1101 | * of error; but free_swap_and_cache() only trylocks a |
1102 | * page, so it is just possible that the entry has been | 1102 | * page, so it is just possible that the entry has been |
1103 | * truncated or holepunched since swap was confirmed. | 1103 | * truncated or holepunched since swap was confirmed. |
1104 | * shmem_undo_range() will have done some of the | 1104 | * shmem_undo_range() will have done some of the |
1105 | * unaccounting, now delete_from_swap_cache() will do | 1105 | * unaccounting, now delete_from_swap_cache() will do |
1106 | * the rest (including mem_cgroup_uncharge_swapcache). | 1106 | * the rest (including mem_cgroup_uncharge_swapcache). |
1107 | * Reset swap.val? No, leave it so "failed" goes back to | 1107 | * Reset swap.val? No, leave it so "failed" goes back to |
1108 | * "repeat": reading a hole and writing should succeed. | 1108 | * "repeat": reading a hole and writing should succeed. |
1109 | */ | 1109 | */ |
1110 | if (error) | 1110 | if (error) |
1111 | delete_from_swap_cache(page); | 1111 | delete_from_swap_cache(page); |
1112 | } | 1112 | } |
1113 | if (error) | 1113 | if (error) |
1114 | goto failed; | 1114 | goto failed; |
1115 | 1115 | ||
1116 | spin_lock(&info->lock); | 1116 | spin_lock(&info->lock); |
1117 | info->swapped--; | 1117 | info->swapped--; |
1118 | shmem_recalc_inode(inode); | 1118 | shmem_recalc_inode(inode); |
1119 | spin_unlock(&info->lock); | 1119 | spin_unlock(&info->lock); |
1120 | 1120 | ||
1121 | delete_from_swap_cache(page); | 1121 | delete_from_swap_cache(page); |
1122 | set_page_dirty(page); | 1122 | set_page_dirty(page); |
1123 | swap_free(swap); | 1123 | swap_free(swap); |
1124 | 1124 | ||
1125 | } else { | 1125 | } else { |
1126 | if (shmem_acct_block(info->flags)) { | 1126 | if (shmem_acct_block(info->flags)) { |
1127 | error = -ENOSPC; | 1127 | error = -ENOSPC; |
1128 | goto failed; | 1128 | goto failed; |
1129 | } | 1129 | } |
1130 | if (sbinfo->max_blocks) { | 1130 | if (sbinfo->max_blocks) { |
1131 | if (percpu_counter_compare(&sbinfo->used_blocks, | 1131 | if (percpu_counter_compare(&sbinfo->used_blocks, |
1132 | sbinfo->max_blocks) >= 0) { | 1132 | sbinfo->max_blocks) >= 0) { |
1133 | error = -ENOSPC; | 1133 | error = -ENOSPC; |
1134 | goto unacct; | 1134 | goto unacct; |
1135 | } | 1135 | } |
1136 | percpu_counter_inc(&sbinfo->used_blocks); | 1136 | percpu_counter_inc(&sbinfo->used_blocks); |
1137 | } | 1137 | } |
1138 | 1138 | ||
1139 | page = shmem_alloc_page(gfp, info, index); | 1139 | page = shmem_alloc_page(gfp, info, index); |
1140 | if (!page) { | 1140 | if (!page) { |
1141 | error = -ENOMEM; | 1141 | error = -ENOMEM; |
1142 | goto decused; | 1142 | goto decused; |
1143 | } | 1143 | } |
1144 | 1144 | ||
1145 | __SetPageSwapBacked(page); | 1145 | __SetPageSwapBacked(page); |
1146 | __set_page_locked(page); | 1146 | __set_page_locked(page); |
1147 | error = mem_cgroup_cache_charge(page, current->mm, | 1147 | error = mem_cgroup_cache_charge(page, current->mm, |
1148 | gfp & GFP_RECLAIM_MASK); | 1148 | gfp & GFP_RECLAIM_MASK); |
1149 | if (error) | 1149 | if (error) |
1150 | goto decused; | 1150 | goto decused; |
1151 | error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK); | 1151 | error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK); |
1152 | if (!error) { | 1152 | if (!error) { |
1153 | error = shmem_add_to_page_cache(page, mapping, index, | 1153 | error = shmem_add_to_page_cache(page, mapping, index, |
1154 | gfp, NULL); | 1154 | gfp, NULL); |
1155 | radix_tree_preload_end(); | 1155 | radix_tree_preload_end(); |
1156 | } | 1156 | } |
1157 | if (error) { | 1157 | if (error) { |
1158 | mem_cgroup_uncharge_cache_page(page); | 1158 | mem_cgroup_uncharge_cache_page(page); |
1159 | goto decused; | 1159 | goto decused; |
1160 | } | 1160 | } |
1161 | lru_cache_add_anon(page); | 1161 | lru_cache_add_anon(page); |
1162 | 1162 | ||
1163 | spin_lock(&info->lock); | 1163 | spin_lock(&info->lock); |
1164 | info->alloced++; | 1164 | info->alloced++; |
1165 | inode->i_blocks += BLOCKS_PER_PAGE; | 1165 | inode->i_blocks += BLOCKS_PER_PAGE; |
1166 | shmem_recalc_inode(inode); | 1166 | shmem_recalc_inode(inode); |
1167 | spin_unlock(&info->lock); | 1167 | spin_unlock(&info->lock); |
1168 | alloced = true; | 1168 | alloced = true; |
1169 | 1169 | ||
1170 | /* | 1170 | /* |
1171 | * Let SGP_FALLOC use the SGP_WRITE optimization on a new page. | 1171 | * Let SGP_FALLOC use the SGP_WRITE optimization on a new page. |
1172 | */ | 1172 | */ |
1173 | if (sgp == SGP_FALLOC) | 1173 | if (sgp == SGP_FALLOC) |
1174 | sgp = SGP_WRITE; | 1174 | sgp = SGP_WRITE; |
1175 | clear: | 1175 | clear: |
1176 | /* | 1176 | /* |
1177 | * Let SGP_WRITE caller clear ends if write does not fill page; | 1177 | * Let SGP_WRITE caller clear ends if write does not fill page; |
1178 | * but SGP_FALLOC on a page fallocated earlier must initialize | 1178 | * but SGP_FALLOC on a page fallocated earlier must initialize |
1179 | * it now, lest undo on failure cancel our earlier guarantee. | 1179 | * it now, lest undo on failure cancel our earlier guarantee. |
1180 | */ | 1180 | */ |
1181 | if (sgp != SGP_WRITE) { | 1181 | if (sgp != SGP_WRITE) { |
1182 | clear_highpage(page); | 1182 | clear_highpage(page); |
1183 | flush_dcache_page(page); | 1183 | flush_dcache_page(page); |
1184 | SetPageUptodate(page); | 1184 | SetPageUptodate(page); |
1185 | } | 1185 | } |
1186 | if (sgp == SGP_DIRTY) | 1186 | if (sgp == SGP_DIRTY) |
1187 | set_page_dirty(page); | 1187 | set_page_dirty(page); |
1188 | } | 1188 | } |
1189 | 1189 | ||
1190 | /* Perhaps the file has been truncated since we checked */ | 1190 | /* Perhaps the file has been truncated since we checked */ |
1191 | if (sgp != SGP_WRITE && sgp != SGP_FALLOC && | 1191 | if (sgp != SGP_WRITE && sgp != SGP_FALLOC && |
1192 | ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { | 1192 | ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) { |
1193 | error = -EINVAL; | 1193 | error = -EINVAL; |
1194 | if (alloced) | 1194 | if (alloced) |
1195 | goto trunc; | 1195 | goto trunc; |
1196 | else | 1196 | else |
1197 | goto failed; | 1197 | goto failed; |
1198 | } | 1198 | } |
1199 | *pagep = page; | 1199 | *pagep = page; |
1200 | return 0; | 1200 | return 0; |
1201 | 1201 | ||
1202 | /* | 1202 | /* |
1203 | * Error recovery. | 1203 | * Error recovery. |
1204 | */ | 1204 | */ |
1205 | trunc: | 1205 | trunc: |
1206 | info = SHMEM_I(inode); | 1206 | info = SHMEM_I(inode); |
1207 | ClearPageDirty(page); | 1207 | ClearPageDirty(page); |
1208 | delete_from_page_cache(page); | 1208 | delete_from_page_cache(page); |
1209 | spin_lock(&info->lock); | 1209 | spin_lock(&info->lock); |
1210 | info->alloced--; | 1210 | info->alloced--; |
1211 | inode->i_blocks -= BLOCKS_PER_PAGE; | 1211 | inode->i_blocks -= BLOCKS_PER_PAGE; |
1212 | spin_unlock(&info->lock); | 1212 | spin_unlock(&info->lock); |
1213 | decused: | 1213 | decused: |
1214 | sbinfo = SHMEM_SB(inode->i_sb); | 1214 | sbinfo = SHMEM_SB(inode->i_sb); |
1215 | if (sbinfo->max_blocks) | 1215 | if (sbinfo->max_blocks) |
1216 | percpu_counter_add(&sbinfo->used_blocks, -1); | 1216 | percpu_counter_add(&sbinfo->used_blocks, -1); |
1217 | unacct: | 1217 | unacct: |
1218 | shmem_unacct_blocks(info->flags, 1); | 1218 | shmem_unacct_blocks(info->flags, 1); |
1219 | failed: | 1219 | failed: |
1220 | if (swap.val && error != -EINVAL && | 1220 | if (swap.val && error != -EINVAL && |
1221 | !shmem_confirm_swap(mapping, index, swap)) | 1221 | !shmem_confirm_swap(mapping, index, swap)) |
1222 | error = -EEXIST; | 1222 | error = -EEXIST; |
1223 | unlock: | 1223 | unlock: |
1224 | if (page) { | 1224 | if (page) { |
1225 | unlock_page(page); | 1225 | unlock_page(page); |
1226 | page_cache_release(page); | 1226 | page_cache_release(page); |
1227 | } | 1227 | } |
1228 | if (error == -ENOSPC && !once++) { | 1228 | if (error == -ENOSPC && !once++) { |
1229 | info = SHMEM_I(inode); | 1229 | info = SHMEM_I(inode); |
1230 | spin_lock(&info->lock); | 1230 | spin_lock(&info->lock); |
1231 | shmem_recalc_inode(inode); | 1231 | shmem_recalc_inode(inode); |
1232 | spin_unlock(&info->lock); | 1232 | spin_unlock(&info->lock); |
1233 | goto repeat; | 1233 | goto repeat; |
1234 | } | 1234 | } |
1235 | if (error == -EEXIST) /* from above or from radix_tree_insert */ | 1235 | if (error == -EEXIST) /* from above or from radix_tree_insert */ |
1236 | goto repeat; | 1236 | goto repeat; |
1237 | return error; | 1237 | return error; |
1238 | } | 1238 | } |
1239 | 1239 | ||
1240 | static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) | 1240 | static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) |
1241 | { | 1241 | { |
1242 | struct inode *inode = file_inode(vma->vm_file); | 1242 | struct inode *inode = file_inode(vma->vm_file); |
1243 | int error; | 1243 | int error; |
1244 | int ret = VM_FAULT_LOCKED; | 1244 | int ret = VM_FAULT_LOCKED; |
1245 | 1245 | ||
1246 | /* | 1246 | /* |
1247 | * Trinity finds that probing a hole which tmpfs is punching can | 1247 | * Trinity finds that probing a hole which tmpfs is punching can |
1248 | * prevent the hole-punch from ever completing: which in turn | 1248 | * prevent the hole-punch from ever completing: which in turn |
1249 | * locks writers out with its hold on i_mutex. So refrain from | 1249 | * locks writers out with its hold on i_mutex. So refrain from |
1250 | * faulting pages into the hole while it's being punched. Although | 1250 | * faulting pages into the hole while it's being punched. Although |
1251 | * shmem_undo_range() does remove the additions, it may be unable to | 1251 | * shmem_undo_range() does remove the additions, it may be unable to |
1252 | * keep up, as each new page needs its own unmap_mapping_range() call, | 1252 | * keep up, as each new page needs its own unmap_mapping_range() call, |
1253 | * and the i_mmap tree grows ever slower to scan if new vmas are added. | 1253 | * and the i_mmap tree grows ever slower to scan if new vmas are added. |
1254 | * | 1254 | * |
1255 | * It does not matter if we sometimes reach this check just before the | 1255 | * It does not matter if we sometimes reach this check just before the |
1256 | * hole-punch begins, so that one fault then races with the punch: | 1256 | * hole-punch begins, so that one fault then races with the punch: |
1257 | * we just need to make racing faults a rare case. | 1257 | * we just need to make racing faults a rare case. |
1258 | * | 1258 | * |
1259 | * The implementation below would be much simpler if we just used a | 1259 | * The implementation below would be much simpler if we just used a |
1260 | * standard mutex or completion: but we cannot take i_mutex in fault, | 1260 | * standard mutex or completion: but we cannot take i_mutex in fault, |
1261 | * and bloating every shmem inode for this unlikely case would be sad. | 1261 | * and bloating every shmem inode for this unlikely case would be sad. |
1262 | */ | 1262 | */ |
1263 | if (unlikely(inode->i_private)) { | 1263 | if (unlikely(inode->i_private)) { |
1264 | struct shmem_falloc *shmem_falloc; | 1264 | struct shmem_falloc *shmem_falloc; |
1265 | 1265 | ||
1266 | spin_lock(&inode->i_lock); | 1266 | spin_lock(&inode->i_lock); |
1267 | shmem_falloc = inode->i_private; | 1267 | shmem_falloc = inode->i_private; |
1268 | if (shmem_falloc && | 1268 | if (shmem_falloc && |
1269 | shmem_falloc->waitq && | 1269 | shmem_falloc->waitq && |
1270 | vmf->pgoff >= shmem_falloc->start && | 1270 | vmf->pgoff >= shmem_falloc->start && |
1271 | vmf->pgoff < shmem_falloc->next) { | 1271 | vmf->pgoff < shmem_falloc->next) { |
1272 | wait_queue_head_t *shmem_falloc_waitq; | 1272 | wait_queue_head_t *shmem_falloc_waitq; |
1273 | DEFINE_WAIT(shmem_fault_wait); | 1273 | DEFINE_WAIT(shmem_fault_wait); |
1274 | 1274 | ||
1275 | ret = VM_FAULT_NOPAGE; | 1275 | ret = VM_FAULT_NOPAGE; |
1276 | if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && | 1276 | if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && |
1277 | !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { | 1277 | !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { |
1278 | /* It's polite to up mmap_sem if we can */ | 1278 | /* It's polite to up mmap_sem if we can */ |
1279 | up_read(&vma->vm_mm->mmap_sem); | 1279 | up_read(&vma->vm_mm->mmap_sem); |
1280 | ret = VM_FAULT_RETRY; | 1280 | ret = VM_FAULT_RETRY; |
1281 | } | 1281 | } |
1282 | 1282 | ||
1283 | shmem_falloc_waitq = shmem_falloc->waitq; | 1283 | shmem_falloc_waitq = shmem_falloc->waitq; |
1284 | prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, | 1284 | prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, |
1285 | TASK_UNINTERRUPTIBLE); | 1285 | TASK_UNINTERRUPTIBLE); |
1286 | spin_unlock(&inode->i_lock); | 1286 | spin_unlock(&inode->i_lock); |
1287 | schedule(); | 1287 | schedule(); |
1288 | 1288 | ||
1289 | /* | 1289 | /* |
1290 | * shmem_falloc_waitq points into the shmem_fallocate() | 1290 | * shmem_falloc_waitq points into the shmem_fallocate() |
1291 | * stack of the hole-punching task: shmem_falloc_waitq | 1291 | * stack of the hole-punching task: shmem_falloc_waitq |
1292 | * is usually invalid by the time we reach here, but | 1292 | * is usually invalid by the time we reach here, but |
1293 | * finish_wait() does not dereference it in that case; | 1293 | * finish_wait() does not dereference it in that case; |
1294 | * though i_lock needed lest racing with wake_up_all(). | 1294 | * though i_lock needed lest racing with wake_up_all(). |
1295 | */ | 1295 | */ |
1296 | spin_lock(&inode->i_lock); | 1296 | spin_lock(&inode->i_lock); |
1297 | finish_wait(shmem_falloc_waitq, &shmem_fault_wait); | 1297 | finish_wait(shmem_falloc_waitq, &shmem_fault_wait); |
1298 | spin_unlock(&inode->i_lock); | 1298 | spin_unlock(&inode->i_lock); |
1299 | return ret; | 1299 | return ret; |
1300 | } | 1300 | } |
1301 | spin_unlock(&inode->i_lock); | 1301 | spin_unlock(&inode->i_lock); |
1302 | } | 1302 | } |
1303 | 1303 | ||
1304 | error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); | 1304 | error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); |
1305 | if (error) | 1305 | if (error) |
1306 | return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); | 1306 | return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); |
1307 | 1307 | ||
1308 | if (ret & VM_FAULT_MAJOR) { | 1308 | if (ret & VM_FAULT_MAJOR) { |
1309 | count_vm_event(PGMAJFAULT); | 1309 | count_vm_event(PGMAJFAULT); |
1310 | mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); | 1310 | mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); |
1311 | } | 1311 | } |
1312 | return ret; | 1312 | return ret; |
1313 | } | 1313 | } |
1314 | 1314 | ||
1315 | #ifdef CONFIG_NUMA | 1315 | #ifdef CONFIG_NUMA |
1316 | static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) | 1316 | static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) |
1317 | { | 1317 | { |
1318 | struct inode *inode = file_inode(vma->vm_file); | 1318 | struct inode *inode = file_inode(vma->vm_file); |
1319 | return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol); | 1319 | return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol); |
1320 | } | 1320 | } |
1321 | 1321 | ||
1322 | static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, | 1322 | static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, |
1323 | unsigned long addr) | 1323 | unsigned long addr) |
1324 | { | 1324 | { |
1325 | struct inode *inode = file_inode(vma->vm_file); | 1325 | struct inode *inode = file_inode(vma->vm_file); |
1326 | pgoff_t index; | 1326 | pgoff_t index; |
1327 | 1327 | ||
1328 | index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; | 1328 | index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; |
1329 | return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); | 1329 | return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); |
1330 | } | 1330 | } |
1331 | #endif | 1331 | #endif |
1332 | 1332 | ||
1333 | int shmem_lock(struct file *file, int lock, struct user_struct *user) | 1333 | int shmem_lock(struct file *file, int lock, struct user_struct *user) |
1334 | { | 1334 | { |
1335 | struct inode *inode = file_inode(file); | 1335 | struct inode *inode = file_inode(file); |
1336 | struct shmem_inode_info *info = SHMEM_I(inode); | 1336 | struct shmem_inode_info *info = SHMEM_I(inode); |
1337 | int retval = -ENOMEM; | 1337 | int retval = -ENOMEM; |
1338 | 1338 | ||
1339 | spin_lock(&info->lock); | 1339 | spin_lock(&info->lock); |
1340 | if (lock && !(info->flags & VM_LOCKED)) { | 1340 | if (lock && !(info->flags & VM_LOCKED)) { |
1341 | if (!user_shm_lock(inode->i_size, user)) | 1341 | if (!user_shm_lock(inode->i_size, user)) |
1342 | goto out_nomem; | 1342 | goto out_nomem; |
1343 | info->flags |= VM_LOCKED; | 1343 | info->flags |= VM_LOCKED; |
1344 | mapping_set_unevictable(file->f_mapping); | 1344 | mapping_set_unevictable(file->f_mapping); |
1345 | } | 1345 | } |
1346 | if (!lock && (info->flags & VM_LOCKED) && user) { | 1346 | if (!lock && (info->flags & VM_LOCKED) && user) { |
1347 | user_shm_unlock(inode->i_size, user); | 1347 | user_shm_unlock(inode->i_size, user); |
1348 | info->flags &= ~VM_LOCKED; | 1348 | info->flags &= ~VM_LOCKED; |
1349 | mapping_clear_unevictable(file->f_mapping); | 1349 | mapping_clear_unevictable(file->f_mapping); |
1350 | } | 1350 | } |
1351 | retval = 0; | 1351 | retval = 0; |
1352 | 1352 | ||
1353 | out_nomem: | 1353 | out_nomem: |
1354 | spin_unlock(&info->lock); | 1354 | spin_unlock(&info->lock); |
1355 | return retval; | 1355 | return retval; |
1356 | } | 1356 | } |
1357 | 1357 | ||
1358 | static int shmem_mmap(struct file *file, struct vm_area_struct *vma) | 1358 | static int shmem_mmap(struct file *file, struct vm_area_struct *vma) |
1359 | { | 1359 | { |
1360 | file_accessed(file); | 1360 | file_accessed(file); |
1361 | vma->vm_ops = &shmem_vm_ops; | 1361 | vma->vm_ops = &shmem_vm_ops; |
1362 | return 0; | 1362 | return 0; |
1363 | } | 1363 | } |
1364 | 1364 | ||
1365 | static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, | 1365 | static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, |
1366 | umode_t mode, dev_t dev, unsigned long flags) | 1366 | umode_t mode, dev_t dev, unsigned long flags) |
1367 | { | 1367 | { |
1368 | struct inode *inode; | 1368 | struct inode *inode; |
1369 | struct shmem_inode_info *info; | 1369 | struct shmem_inode_info *info; |
1370 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); | 1370 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); |
1371 | 1371 | ||
1372 | if (shmem_reserve_inode(sb)) | 1372 | if (shmem_reserve_inode(sb)) |
1373 | return NULL; | 1373 | return NULL; |
1374 | 1374 | ||
1375 | inode = new_inode(sb); | 1375 | inode = new_inode(sb); |
1376 | if (inode) { | 1376 | if (inode) { |
1377 | inode->i_ino = get_next_ino(); | 1377 | inode->i_ino = get_next_ino(); |
1378 | inode_init_owner(inode, dir, mode); | 1378 | inode_init_owner(inode, dir, mode); |
1379 | inode->i_blocks = 0; | 1379 | inode->i_blocks = 0; |
1380 | inode->i_mapping->backing_dev_info = &shmem_backing_dev_info; | 1380 | inode->i_mapping->backing_dev_info = &shmem_backing_dev_info; |
1381 | inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; | 1381 | inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; |
1382 | inode->i_generation = get_seconds(); | 1382 | inode->i_generation = get_seconds(); |
1383 | info = SHMEM_I(inode); | 1383 | info = SHMEM_I(inode); |
1384 | memset(info, 0, (char *)inode - (char *)info); | 1384 | memset(info, 0, (char *)inode - (char *)info); |
1385 | spin_lock_init(&info->lock); | 1385 | spin_lock_init(&info->lock); |
1386 | info->flags = flags & VM_NORESERVE; | 1386 | info->flags = flags & VM_NORESERVE; |
1387 | INIT_LIST_HEAD(&info->swaplist); | 1387 | INIT_LIST_HEAD(&info->swaplist); |
1388 | simple_xattrs_init(&info->xattrs); | 1388 | simple_xattrs_init(&info->xattrs); |
1389 | cache_no_acl(inode); | 1389 | cache_no_acl(inode); |
1390 | 1390 | ||
1391 | switch (mode & S_IFMT) { | 1391 | switch (mode & S_IFMT) { |
1392 | default: | 1392 | default: |
1393 | inode->i_op = &shmem_special_inode_operations; | 1393 | inode->i_op = &shmem_special_inode_operations; |
1394 | init_special_inode(inode, mode, dev); | 1394 | init_special_inode(inode, mode, dev); |
1395 | break; | 1395 | break; |
1396 | case S_IFREG: | 1396 | case S_IFREG: |
1397 | inode->i_mapping->a_ops = &shmem_aops; | 1397 | inode->i_mapping->a_ops = &shmem_aops; |
1398 | inode->i_op = &shmem_inode_operations; | 1398 | inode->i_op = &shmem_inode_operations; |
1399 | inode->i_fop = &shmem_file_operations; | 1399 | inode->i_fop = &shmem_file_operations; |
1400 | mpol_shared_policy_init(&info->policy, | 1400 | mpol_shared_policy_init(&info->policy, |
1401 | shmem_get_sbmpol(sbinfo)); | 1401 | shmem_get_sbmpol(sbinfo)); |
1402 | break; | 1402 | break; |
1403 | case S_IFDIR: | 1403 | case S_IFDIR: |
1404 | inc_nlink(inode); | 1404 | inc_nlink(inode); |
1405 | /* Some things misbehave if size == 0 on a directory */ | 1405 | /* Some things misbehave if size == 0 on a directory */ |
1406 | inode->i_size = 2 * BOGO_DIRENT_SIZE; | 1406 | inode->i_size = 2 * BOGO_DIRENT_SIZE; |
1407 | inode->i_op = &shmem_dir_inode_operations; | 1407 | inode->i_op = &shmem_dir_inode_operations; |
1408 | inode->i_fop = &simple_dir_operations; | 1408 | inode->i_fop = &simple_dir_operations; |
1409 | break; | 1409 | break; |
1410 | case S_IFLNK: | 1410 | case S_IFLNK: |
1411 | /* | 1411 | /* |
1412 | * Must not load anything in the rbtree, | 1412 | * Must not load anything in the rbtree, |
1413 | * mpol_free_shared_policy will not be called. | 1413 | * mpol_free_shared_policy will not be called. |
1414 | */ | 1414 | */ |
1415 | mpol_shared_policy_init(&info->policy, NULL); | 1415 | mpol_shared_policy_init(&info->policy, NULL); |
1416 | break; | 1416 | break; |
1417 | } | 1417 | } |
1418 | } else | 1418 | } else |
1419 | shmem_free_inode(sb); | 1419 | shmem_free_inode(sb); |
1420 | return inode; | 1420 | return inode; |
1421 | } | 1421 | } |
1422 | 1422 | ||
1423 | bool shmem_mapping(struct address_space *mapping) | 1423 | bool shmem_mapping(struct address_space *mapping) |
1424 | { | 1424 | { |
1425 | return mapping->backing_dev_info == &shmem_backing_dev_info; | 1425 | return mapping->backing_dev_info == &shmem_backing_dev_info; |
1426 | } | 1426 | } |
1427 | 1427 | ||
1428 | #ifdef CONFIG_TMPFS | 1428 | #ifdef CONFIG_TMPFS |
1429 | static const struct inode_operations shmem_symlink_inode_operations; | 1429 | static const struct inode_operations shmem_symlink_inode_operations; |
1430 | static const struct inode_operations shmem_short_symlink_operations; | 1430 | static const struct inode_operations shmem_short_symlink_operations; |
1431 | 1431 | ||
1432 | #ifdef CONFIG_TMPFS_XATTR | 1432 | #ifdef CONFIG_TMPFS_XATTR |
1433 | static int shmem_initxattrs(struct inode *, const struct xattr *, void *); | 1433 | static int shmem_initxattrs(struct inode *, const struct xattr *, void *); |
1434 | #else | 1434 | #else |
1435 | #define shmem_initxattrs NULL | 1435 | #define shmem_initxattrs NULL |
1436 | #endif | 1436 | #endif |
1437 | 1437 | ||
1438 | static int | 1438 | static int |
1439 | shmem_write_begin(struct file *file, struct address_space *mapping, | 1439 | shmem_write_begin(struct file *file, struct address_space *mapping, |
1440 | loff_t pos, unsigned len, unsigned flags, | 1440 | loff_t pos, unsigned len, unsigned flags, |
1441 | struct page **pagep, void **fsdata) | 1441 | struct page **pagep, void **fsdata) |
1442 | { | 1442 | { |
1443 | int ret; | ||
1443 | struct inode *inode = mapping->host; | 1444 | struct inode *inode = mapping->host; |
1444 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; | 1445 | pgoff_t index = pos >> PAGE_CACHE_SHIFT; |
1445 | return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); | 1446 | ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); |
1447 | if (ret == 0 && *pagep) | ||
1448 | init_page_accessed(*pagep); | ||
1449 | return ret; | ||
1446 | } | 1450 | } |
1447 | 1451 | ||
1448 | static int | 1452 | static int |
1449 | shmem_write_end(struct file *file, struct address_space *mapping, | 1453 | shmem_write_end(struct file *file, struct address_space *mapping, |
1450 | loff_t pos, unsigned len, unsigned copied, | 1454 | loff_t pos, unsigned len, unsigned copied, |
1451 | struct page *page, void *fsdata) | 1455 | struct page *page, void *fsdata) |
1452 | { | 1456 | { |
1453 | struct inode *inode = mapping->host; | 1457 | struct inode *inode = mapping->host; |
1454 | 1458 | ||
1455 | if (pos + copied > inode->i_size) | 1459 | if (pos + copied > inode->i_size) |
1456 | i_size_write(inode, pos + copied); | 1460 | i_size_write(inode, pos + copied); |
1457 | 1461 | ||
1458 | if (!PageUptodate(page)) { | 1462 | if (!PageUptodate(page)) { |
1459 | if (copied < PAGE_CACHE_SIZE) { | 1463 | if (copied < PAGE_CACHE_SIZE) { |
1460 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); | 1464 | unsigned from = pos & (PAGE_CACHE_SIZE - 1); |
1461 | zero_user_segments(page, 0, from, | 1465 | zero_user_segments(page, 0, from, |
1462 | from + copied, PAGE_CACHE_SIZE); | 1466 | from + copied, PAGE_CACHE_SIZE); |
1463 | } | 1467 | } |
1464 | SetPageUptodate(page); | 1468 | SetPageUptodate(page); |
1465 | } | 1469 | } |
1466 | set_page_dirty(page); | 1470 | set_page_dirty(page); |
1467 | unlock_page(page); | 1471 | unlock_page(page); |
1468 | page_cache_release(page); | 1472 | page_cache_release(page); |
1469 | 1473 | ||
1470 | return copied; | 1474 | return copied; |
1471 | } | 1475 | } |
1472 | 1476 | ||
1473 | static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor) | 1477 | static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor) |
1474 | { | 1478 | { |
1475 | struct inode *inode = file_inode(filp); | 1479 | struct inode *inode = file_inode(filp); |
1476 | struct address_space *mapping = inode->i_mapping; | 1480 | struct address_space *mapping = inode->i_mapping; |
1477 | pgoff_t index; | 1481 | pgoff_t index; |
1478 | unsigned long offset; | 1482 | unsigned long offset; |
1479 | enum sgp_type sgp = SGP_READ; | 1483 | enum sgp_type sgp = SGP_READ; |
1480 | 1484 | ||
1481 | /* | 1485 | /* |
1482 | * Might this read be for a stacking filesystem? Then when reading | 1486 | * Might this read be for a stacking filesystem? Then when reading |
1483 | * holes of a sparse file, we actually need to allocate those pages, | 1487 | * holes of a sparse file, we actually need to allocate those pages, |
1484 | * and even mark them dirty, so it cannot exceed the max_blocks limit. | 1488 | * and even mark them dirty, so it cannot exceed the max_blocks limit. |
1485 | */ | 1489 | */ |
1486 | if (segment_eq(get_fs(), KERNEL_DS)) | 1490 | if (segment_eq(get_fs(), KERNEL_DS)) |
1487 | sgp = SGP_DIRTY; | 1491 | sgp = SGP_DIRTY; |
1488 | 1492 | ||
1489 | index = *ppos >> PAGE_CACHE_SHIFT; | 1493 | index = *ppos >> PAGE_CACHE_SHIFT; |
1490 | offset = *ppos & ~PAGE_CACHE_MASK; | 1494 | offset = *ppos & ~PAGE_CACHE_MASK; |
1491 | 1495 | ||
1492 | for (;;) { | 1496 | for (;;) { |
1493 | struct page *page = NULL; | 1497 | struct page *page = NULL; |
1494 | pgoff_t end_index; | 1498 | pgoff_t end_index; |
1495 | unsigned long nr, ret; | 1499 | unsigned long nr, ret; |
1496 | loff_t i_size = i_size_read(inode); | 1500 | loff_t i_size = i_size_read(inode); |
1497 | 1501 | ||
1498 | end_index = i_size >> PAGE_CACHE_SHIFT; | 1502 | end_index = i_size >> PAGE_CACHE_SHIFT; |
1499 | if (index > end_index) | 1503 | if (index > end_index) |
1500 | break; | 1504 | break; |
1501 | if (index == end_index) { | 1505 | if (index == end_index) { |
1502 | nr = i_size & ~PAGE_CACHE_MASK; | 1506 | nr = i_size & ~PAGE_CACHE_MASK; |
1503 | if (nr <= offset) | 1507 | if (nr <= offset) |
1504 | break; | 1508 | break; |
1505 | } | 1509 | } |
1506 | 1510 | ||
1507 | desc->error = shmem_getpage(inode, index, &page, sgp, NULL); | 1511 | desc->error = shmem_getpage(inode, index, &page, sgp, NULL); |
1508 | if (desc->error) { | 1512 | if (desc->error) { |
1509 | if (desc->error == -EINVAL) | 1513 | if (desc->error == -EINVAL) |
1510 | desc->error = 0; | 1514 | desc->error = 0; |
1511 | break; | 1515 | break; |
1512 | } | 1516 | } |
1513 | if (page) | 1517 | if (page) |
1514 | unlock_page(page); | 1518 | unlock_page(page); |
1515 | 1519 | ||
1516 | /* | 1520 | /* |
1517 | * We must evaluate after, since reads (unlike writes) | 1521 | * We must evaluate after, since reads (unlike writes) |
1518 | * are called without i_mutex protection against truncate | 1522 | * are called without i_mutex protection against truncate |
1519 | */ | 1523 | */ |
1520 | nr = PAGE_CACHE_SIZE; | 1524 | nr = PAGE_CACHE_SIZE; |
1521 | i_size = i_size_read(inode); | 1525 | i_size = i_size_read(inode); |
1522 | end_index = i_size >> PAGE_CACHE_SHIFT; | 1526 | end_index = i_size >> PAGE_CACHE_SHIFT; |
1523 | if (index == end_index) { | 1527 | if (index == end_index) { |
1524 | nr = i_size & ~PAGE_CACHE_MASK; | 1528 | nr = i_size & ~PAGE_CACHE_MASK; |
1525 | if (nr <= offset) { | 1529 | if (nr <= offset) { |
1526 | if (page) | 1530 | if (page) |
1527 | page_cache_release(page); | 1531 | page_cache_release(page); |
1528 | break; | 1532 | break; |
1529 | } | 1533 | } |
1530 | } | 1534 | } |
1531 | nr -= offset; | 1535 | nr -= offset; |
1532 | 1536 | ||
1533 | if (page) { | 1537 | if (page) { |
1534 | /* | 1538 | /* |
1535 | * If users can be writing to this page using arbitrary | 1539 | * If users can be writing to this page using arbitrary |
1536 | * virtual addresses, take care about potential aliasing | 1540 | * virtual addresses, take care about potential aliasing |
1537 | * before reading the page on the kernel side. | 1541 | * before reading the page on the kernel side. |
1538 | */ | 1542 | */ |
1539 | if (mapping_writably_mapped(mapping)) | 1543 | if (mapping_writably_mapped(mapping)) |
1540 | flush_dcache_page(page); | 1544 | flush_dcache_page(page); |
1541 | /* | 1545 | /* |
1542 | * Mark the page accessed if we read the beginning. | 1546 | * Mark the page accessed if we read the beginning. |
1543 | */ | 1547 | */ |
1544 | if (!offset) | 1548 | if (!offset) |
1545 | mark_page_accessed(page); | 1549 | mark_page_accessed(page); |
1546 | } else { | 1550 | } else { |
1547 | page = ZERO_PAGE(0); | 1551 | page = ZERO_PAGE(0); |
1548 | page_cache_get(page); | 1552 | page_cache_get(page); |
1549 | } | 1553 | } |
1550 | 1554 | ||
1551 | /* | 1555 | /* |
1552 | * Ok, we have the page, and it's up-to-date, so | 1556 | * Ok, we have the page, and it's up-to-date, so |
1553 | * now we can copy it to user space... | 1557 | * now we can copy it to user space... |
1554 | * | 1558 | * |
1555 | * The actor routine returns how many bytes were actually used.. | 1559 | * The actor routine returns how many bytes were actually used.. |
1556 | * NOTE! This may not be the same as how much of a user buffer | 1560 | * NOTE! This may not be the same as how much of a user buffer |
1557 | * we filled up (we may be padding etc), so we can only update | 1561 | * we filled up (we may be padding etc), so we can only update |
1558 | * "pos" here (the actor routine has to update the user buffer | 1562 | * "pos" here (the actor routine has to update the user buffer |
1559 | * pointers and the remaining count). | 1563 | * pointers and the remaining count). |
1560 | */ | 1564 | */ |
1561 | ret = actor(desc, page, offset, nr); | 1565 | ret = actor(desc, page, offset, nr); |
1562 | offset += ret; | 1566 | offset += ret; |
1563 | index += offset >> PAGE_CACHE_SHIFT; | 1567 | index += offset >> PAGE_CACHE_SHIFT; |
1564 | offset &= ~PAGE_CACHE_MASK; | 1568 | offset &= ~PAGE_CACHE_MASK; |
1565 | 1569 | ||
1566 | page_cache_release(page); | 1570 | page_cache_release(page); |
1567 | if (ret != nr || !desc->count) | 1571 | if (ret != nr || !desc->count) |
1568 | break; | 1572 | break; |
1569 | 1573 | ||
1570 | cond_resched(); | 1574 | cond_resched(); |
1571 | } | 1575 | } |
1572 | 1576 | ||
1573 | *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; | 1577 | *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; |
1574 | file_accessed(filp); | 1578 | file_accessed(filp); |
1575 | } | 1579 | } |
1576 | 1580 | ||
1577 | static ssize_t shmem_file_aio_read(struct kiocb *iocb, | 1581 | static ssize_t shmem_file_aio_read(struct kiocb *iocb, |
1578 | const struct iovec *iov, unsigned long nr_segs, loff_t pos) | 1582 | const struct iovec *iov, unsigned long nr_segs, loff_t pos) |
1579 | { | 1583 | { |
1580 | struct file *filp = iocb->ki_filp; | 1584 | struct file *filp = iocb->ki_filp; |
1581 | ssize_t retval; | 1585 | ssize_t retval; |
1582 | unsigned long seg; | 1586 | unsigned long seg; |
1583 | size_t count; | 1587 | size_t count; |
1584 | loff_t *ppos = &iocb->ki_pos; | 1588 | loff_t *ppos = &iocb->ki_pos; |
1585 | 1589 | ||
1586 | retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); | 1590 | retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE); |
1587 | if (retval) | 1591 | if (retval) |
1588 | return retval; | 1592 | return retval; |
1589 | 1593 | ||
1590 | for (seg = 0; seg < nr_segs; seg++) { | 1594 | for (seg = 0; seg < nr_segs; seg++) { |
1591 | read_descriptor_t desc; | 1595 | read_descriptor_t desc; |
1592 | 1596 | ||
1593 | desc.written = 0; | 1597 | desc.written = 0; |
1594 | desc.arg.buf = iov[seg].iov_base; | 1598 | desc.arg.buf = iov[seg].iov_base; |
1595 | desc.count = iov[seg].iov_len; | 1599 | desc.count = iov[seg].iov_len; |
1596 | if (desc.count == 0) | 1600 | if (desc.count == 0) |
1597 | continue; | 1601 | continue; |
1598 | desc.error = 0; | 1602 | desc.error = 0; |
1599 | do_shmem_file_read(filp, ppos, &desc, file_read_actor); | 1603 | do_shmem_file_read(filp, ppos, &desc, file_read_actor); |
1600 | retval += desc.written; | 1604 | retval += desc.written; |
1601 | if (desc.error) { | 1605 | if (desc.error) { |
1602 | retval = retval ?: desc.error; | 1606 | retval = retval ?: desc.error; |
1603 | break; | 1607 | break; |
1604 | } | 1608 | } |
1605 | if (desc.count > 0) | 1609 | if (desc.count > 0) |
1606 | break; | 1610 | break; |
1607 | } | 1611 | } |
1608 | return retval; | 1612 | return retval; |
1609 | } | 1613 | } |
1610 | 1614 | ||
1611 | static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, | 1615 | static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, |
1612 | struct pipe_inode_info *pipe, size_t len, | 1616 | struct pipe_inode_info *pipe, size_t len, |
1613 | unsigned int flags) | 1617 | unsigned int flags) |
1614 | { | 1618 | { |
1615 | struct address_space *mapping = in->f_mapping; | 1619 | struct address_space *mapping = in->f_mapping; |
1616 | struct inode *inode = mapping->host; | 1620 | struct inode *inode = mapping->host; |
1617 | unsigned int loff, nr_pages, req_pages; | 1621 | unsigned int loff, nr_pages, req_pages; |
1618 | struct page *pages[PIPE_DEF_BUFFERS]; | 1622 | struct page *pages[PIPE_DEF_BUFFERS]; |
1619 | struct partial_page partial[PIPE_DEF_BUFFERS]; | 1623 | struct partial_page partial[PIPE_DEF_BUFFERS]; |
1620 | struct page *page; | 1624 | struct page *page; |
1621 | pgoff_t index, end_index; | 1625 | pgoff_t index, end_index; |
1622 | loff_t isize, left; | 1626 | loff_t isize, left; |
1623 | int error, page_nr; | 1627 | int error, page_nr; |
1624 | struct splice_pipe_desc spd = { | 1628 | struct splice_pipe_desc spd = { |
1625 | .pages = pages, | 1629 | .pages = pages, |
1626 | .partial = partial, | 1630 | .partial = partial, |
1627 | .nr_pages_max = PIPE_DEF_BUFFERS, | 1631 | .nr_pages_max = PIPE_DEF_BUFFERS, |
1628 | .flags = flags, | 1632 | .flags = flags, |
1629 | .ops = &page_cache_pipe_buf_ops, | 1633 | .ops = &page_cache_pipe_buf_ops, |
1630 | .spd_release = spd_release_page, | 1634 | .spd_release = spd_release_page, |
1631 | }; | 1635 | }; |
1632 | 1636 | ||
1633 | isize = i_size_read(inode); | 1637 | isize = i_size_read(inode); |
1634 | if (unlikely(*ppos >= isize)) | 1638 | if (unlikely(*ppos >= isize)) |
1635 | return 0; | 1639 | return 0; |
1636 | 1640 | ||
1637 | left = isize - *ppos; | 1641 | left = isize - *ppos; |
1638 | if (unlikely(left < len)) | 1642 | if (unlikely(left < len)) |
1639 | len = left; | 1643 | len = left; |
1640 | 1644 | ||
1641 | if (splice_grow_spd(pipe, &spd)) | 1645 | if (splice_grow_spd(pipe, &spd)) |
1642 | return -ENOMEM; | 1646 | return -ENOMEM; |
1643 | 1647 | ||
1644 | index = *ppos >> PAGE_CACHE_SHIFT; | 1648 | index = *ppos >> PAGE_CACHE_SHIFT; |
1645 | loff = *ppos & ~PAGE_CACHE_MASK; | 1649 | loff = *ppos & ~PAGE_CACHE_MASK; |
1646 | req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1650 | req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1647 | nr_pages = min(req_pages, pipe->buffers); | 1651 | nr_pages = min(req_pages, pipe->buffers); |
1648 | 1652 | ||
1649 | spd.nr_pages = find_get_pages_contig(mapping, index, | 1653 | spd.nr_pages = find_get_pages_contig(mapping, index, |
1650 | nr_pages, spd.pages); | 1654 | nr_pages, spd.pages); |
1651 | index += spd.nr_pages; | 1655 | index += spd.nr_pages; |
1652 | error = 0; | 1656 | error = 0; |
1653 | 1657 | ||
1654 | while (spd.nr_pages < nr_pages) { | 1658 | while (spd.nr_pages < nr_pages) { |
1655 | error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL); | 1659 | error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL); |
1656 | if (error) | 1660 | if (error) |
1657 | break; | 1661 | break; |
1658 | unlock_page(page); | 1662 | unlock_page(page); |
1659 | spd.pages[spd.nr_pages++] = page; | 1663 | spd.pages[spd.nr_pages++] = page; |
1660 | index++; | 1664 | index++; |
1661 | } | 1665 | } |
1662 | 1666 | ||
1663 | index = *ppos >> PAGE_CACHE_SHIFT; | 1667 | index = *ppos >> PAGE_CACHE_SHIFT; |
1664 | nr_pages = spd.nr_pages; | 1668 | nr_pages = spd.nr_pages; |
1665 | spd.nr_pages = 0; | 1669 | spd.nr_pages = 0; |
1666 | 1670 | ||
1667 | for (page_nr = 0; page_nr < nr_pages; page_nr++) { | 1671 | for (page_nr = 0; page_nr < nr_pages; page_nr++) { |
1668 | unsigned int this_len; | 1672 | unsigned int this_len; |
1669 | 1673 | ||
1670 | if (!len) | 1674 | if (!len) |
1671 | break; | 1675 | break; |
1672 | 1676 | ||
1673 | this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff); | 1677 | this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff); |
1674 | page = spd.pages[page_nr]; | 1678 | page = spd.pages[page_nr]; |
1675 | 1679 | ||
1676 | if (!PageUptodate(page) || page->mapping != mapping) { | 1680 | if (!PageUptodate(page) || page->mapping != mapping) { |
1677 | error = shmem_getpage(inode, index, &page, | 1681 | error = shmem_getpage(inode, index, &page, |
1678 | SGP_CACHE, NULL); | 1682 | SGP_CACHE, NULL); |
1679 | if (error) | 1683 | if (error) |
1680 | break; | 1684 | break; |
1681 | unlock_page(page); | 1685 | unlock_page(page); |
1682 | page_cache_release(spd.pages[page_nr]); | 1686 | page_cache_release(spd.pages[page_nr]); |
1683 | spd.pages[page_nr] = page; | 1687 | spd.pages[page_nr] = page; |
1684 | } | 1688 | } |
1685 | 1689 | ||
1686 | isize = i_size_read(inode); | 1690 | isize = i_size_read(inode); |
1687 | end_index = (isize - 1) >> PAGE_CACHE_SHIFT; | 1691 | end_index = (isize - 1) >> PAGE_CACHE_SHIFT; |
1688 | if (unlikely(!isize || index > end_index)) | 1692 | if (unlikely(!isize || index > end_index)) |
1689 | break; | 1693 | break; |
1690 | 1694 | ||
1691 | if (end_index == index) { | 1695 | if (end_index == index) { |
1692 | unsigned int plen; | 1696 | unsigned int plen; |
1693 | 1697 | ||
1694 | plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; | 1698 | plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; |
1695 | if (plen <= loff) | 1699 | if (plen <= loff) |
1696 | break; | 1700 | break; |
1697 | 1701 | ||
1698 | this_len = min(this_len, plen - loff); | 1702 | this_len = min(this_len, plen - loff); |
1699 | len = this_len; | 1703 | len = this_len; |
1700 | } | 1704 | } |
1701 | 1705 | ||
1702 | spd.partial[page_nr].offset = loff; | 1706 | spd.partial[page_nr].offset = loff; |
1703 | spd.partial[page_nr].len = this_len; | 1707 | spd.partial[page_nr].len = this_len; |
1704 | len -= this_len; | 1708 | len -= this_len; |
1705 | loff = 0; | 1709 | loff = 0; |
1706 | spd.nr_pages++; | 1710 | spd.nr_pages++; |
1707 | index++; | 1711 | index++; |
1708 | } | 1712 | } |
1709 | 1713 | ||
1710 | while (page_nr < nr_pages) | 1714 | while (page_nr < nr_pages) |
1711 | page_cache_release(spd.pages[page_nr++]); | 1715 | page_cache_release(spd.pages[page_nr++]); |
1712 | 1716 | ||
1713 | if (spd.nr_pages) | 1717 | if (spd.nr_pages) |
1714 | error = splice_to_pipe(pipe, &spd); | 1718 | error = splice_to_pipe(pipe, &spd); |
1715 | 1719 | ||
1716 | splice_shrink_spd(&spd); | 1720 | splice_shrink_spd(&spd); |
1717 | 1721 | ||
1718 | if (error > 0) { | 1722 | if (error > 0) { |
1719 | *ppos += error; | 1723 | *ppos += error; |
1720 | file_accessed(in); | 1724 | file_accessed(in); |
1721 | } | 1725 | } |
1722 | return error; | 1726 | return error; |
1723 | } | 1727 | } |
1724 | 1728 | ||
1725 | /* | 1729 | /* |
1726 | * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. | 1730 | * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. |
1727 | */ | 1731 | */ |
1728 | static pgoff_t shmem_seek_hole_data(struct address_space *mapping, | 1732 | static pgoff_t shmem_seek_hole_data(struct address_space *mapping, |
1729 | pgoff_t index, pgoff_t end, int whence) | 1733 | pgoff_t index, pgoff_t end, int whence) |
1730 | { | 1734 | { |
1731 | struct page *page; | 1735 | struct page *page; |
1732 | struct pagevec pvec; | 1736 | struct pagevec pvec; |
1733 | pgoff_t indices[PAGEVEC_SIZE]; | 1737 | pgoff_t indices[PAGEVEC_SIZE]; |
1734 | bool done = false; | 1738 | bool done = false; |
1735 | int i; | 1739 | int i; |
1736 | 1740 | ||
1737 | pagevec_init(&pvec, 0); | 1741 | pagevec_init(&pvec, 0); |
1738 | pvec.nr = 1; /* start small: we may be there already */ | 1742 | pvec.nr = 1; /* start small: we may be there already */ |
1739 | while (!done) { | 1743 | while (!done) { |
1740 | pvec.nr = find_get_entries(mapping, index, | 1744 | pvec.nr = find_get_entries(mapping, index, |
1741 | pvec.nr, pvec.pages, indices); | 1745 | pvec.nr, pvec.pages, indices); |
1742 | if (!pvec.nr) { | 1746 | if (!pvec.nr) { |
1743 | if (whence == SEEK_DATA) | 1747 | if (whence == SEEK_DATA) |
1744 | index = end; | 1748 | index = end; |
1745 | break; | 1749 | break; |
1746 | } | 1750 | } |
1747 | for (i = 0; i < pvec.nr; i++, index++) { | 1751 | for (i = 0; i < pvec.nr; i++, index++) { |
1748 | if (index < indices[i]) { | 1752 | if (index < indices[i]) { |
1749 | if (whence == SEEK_HOLE) { | 1753 | if (whence == SEEK_HOLE) { |
1750 | done = true; | 1754 | done = true; |
1751 | break; | 1755 | break; |
1752 | } | 1756 | } |
1753 | index = indices[i]; | 1757 | index = indices[i]; |
1754 | } | 1758 | } |
1755 | page = pvec.pages[i]; | 1759 | page = pvec.pages[i]; |
1756 | if (page && !radix_tree_exceptional_entry(page)) { | 1760 | if (page && !radix_tree_exceptional_entry(page)) { |
1757 | if (!PageUptodate(page)) | 1761 | if (!PageUptodate(page)) |
1758 | page = NULL; | 1762 | page = NULL; |
1759 | } | 1763 | } |
1760 | if (index >= end || | 1764 | if (index >= end || |
1761 | (page && whence == SEEK_DATA) || | 1765 | (page && whence == SEEK_DATA) || |
1762 | (!page && whence == SEEK_HOLE)) { | 1766 | (!page && whence == SEEK_HOLE)) { |
1763 | done = true; | 1767 | done = true; |
1764 | break; | 1768 | break; |
1765 | } | 1769 | } |
1766 | } | 1770 | } |
1767 | pagevec_remove_exceptionals(&pvec); | 1771 | pagevec_remove_exceptionals(&pvec); |
1768 | pagevec_release(&pvec); | 1772 | pagevec_release(&pvec); |
1769 | pvec.nr = PAGEVEC_SIZE; | 1773 | pvec.nr = PAGEVEC_SIZE; |
1770 | cond_resched(); | 1774 | cond_resched(); |
1771 | } | 1775 | } |
1772 | return index; | 1776 | return index; |
1773 | } | 1777 | } |
1774 | 1778 | ||
1775 | static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) | 1779 | static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) |
1776 | { | 1780 | { |
1777 | struct address_space *mapping = file->f_mapping; | 1781 | struct address_space *mapping = file->f_mapping; |
1778 | struct inode *inode = mapping->host; | 1782 | struct inode *inode = mapping->host; |
1779 | pgoff_t start, end; | 1783 | pgoff_t start, end; |
1780 | loff_t new_offset; | 1784 | loff_t new_offset; |
1781 | 1785 | ||
1782 | if (whence != SEEK_DATA && whence != SEEK_HOLE) | 1786 | if (whence != SEEK_DATA && whence != SEEK_HOLE) |
1783 | return generic_file_llseek_size(file, offset, whence, | 1787 | return generic_file_llseek_size(file, offset, whence, |
1784 | MAX_LFS_FILESIZE, i_size_read(inode)); | 1788 | MAX_LFS_FILESIZE, i_size_read(inode)); |
1785 | mutex_lock(&inode->i_mutex); | 1789 | mutex_lock(&inode->i_mutex); |
1786 | /* We're holding i_mutex so we can access i_size directly */ | 1790 | /* We're holding i_mutex so we can access i_size directly */ |
1787 | 1791 | ||
1788 | if (offset < 0) | 1792 | if (offset < 0) |
1789 | offset = -EINVAL; | 1793 | offset = -EINVAL; |
1790 | else if (offset >= inode->i_size) | 1794 | else if (offset >= inode->i_size) |
1791 | offset = -ENXIO; | 1795 | offset = -ENXIO; |
1792 | else { | 1796 | else { |
1793 | start = offset >> PAGE_CACHE_SHIFT; | 1797 | start = offset >> PAGE_CACHE_SHIFT; |
1794 | end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1798 | end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1795 | new_offset = shmem_seek_hole_data(mapping, start, end, whence); | 1799 | new_offset = shmem_seek_hole_data(mapping, start, end, whence); |
1796 | new_offset <<= PAGE_CACHE_SHIFT; | 1800 | new_offset <<= PAGE_CACHE_SHIFT; |
1797 | if (new_offset > offset) { | 1801 | if (new_offset > offset) { |
1798 | if (new_offset < inode->i_size) | 1802 | if (new_offset < inode->i_size) |
1799 | offset = new_offset; | 1803 | offset = new_offset; |
1800 | else if (whence == SEEK_DATA) | 1804 | else if (whence == SEEK_DATA) |
1801 | offset = -ENXIO; | 1805 | offset = -ENXIO; |
1802 | else | 1806 | else |
1803 | offset = inode->i_size; | 1807 | offset = inode->i_size; |
1804 | } | 1808 | } |
1805 | } | 1809 | } |
1806 | 1810 | ||
1807 | if (offset >= 0) | 1811 | if (offset >= 0) |
1808 | offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); | 1812 | offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); |
1809 | mutex_unlock(&inode->i_mutex); | 1813 | mutex_unlock(&inode->i_mutex); |
1810 | return offset; | 1814 | return offset; |
1811 | } | 1815 | } |
1812 | 1816 | ||
1813 | static long shmem_fallocate(struct file *file, int mode, loff_t offset, | 1817 | static long shmem_fallocate(struct file *file, int mode, loff_t offset, |
1814 | loff_t len) | 1818 | loff_t len) |
1815 | { | 1819 | { |
1816 | struct inode *inode = file_inode(file); | 1820 | struct inode *inode = file_inode(file); |
1817 | struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); | 1821 | struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); |
1818 | struct shmem_falloc shmem_falloc; | 1822 | struct shmem_falloc shmem_falloc; |
1819 | pgoff_t start, index, end; | 1823 | pgoff_t start, index, end; |
1820 | int error; | 1824 | int error; |
1821 | 1825 | ||
1822 | mutex_lock(&inode->i_mutex); | 1826 | mutex_lock(&inode->i_mutex); |
1823 | 1827 | ||
1824 | if (mode & FALLOC_FL_PUNCH_HOLE) { | 1828 | if (mode & FALLOC_FL_PUNCH_HOLE) { |
1825 | struct address_space *mapping = file->f_mapping; | 1829 | struct address_space *mapping = file->f_mapping; |
1826 | loff_t unmap_start = round_up(offset, PAGE_SIZE); | 1830 | loff_t unmap_start = round_up(offset, PAGE_SIZE); |
1827 | loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; | 1831 | loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; |
1828 | DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); | 1832 | DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); |
1829 | 1833 | ||
1830 | shmem_falloc.waitq = &shmem_falloc_waitq; | 1834 | shmem_falloc.waitq = &shmem_falloc_waitq; |
1831 | shmem_falloc.start = unmap_start >> PAGE_SHIFT; | 1835 | shmem_falloc.start = unmap_start >> PAGE_SHIFT; |
1832 | shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; | 1836 | shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; |
1833 | spin_lock(&inode->i_lock); | 1837 | spin_lock(&inode->i_lock); |
1834 | inode->i_private = &shmem_falloc; | 1838 | inode->i_private = &shmem_falloc; |
1835 | spin_unlock(&inode->i_lock); | 1839 | spin_unlock(&inode->i_lock); |
1836 | 1840 | ||
1837 | if ((u64)unmap_end > (u64)unmap_start) | 1841 | if ((u64)unmap_end > (u64)unmap_start) |
1838 | unmap_mapping_range(mapping, unmap_start, | 1842 | unmap_mapping_range(mapping, unmap_start, |
1839 | 1 + unmap_end - unmap_start, 0); | 1843 | 1 + unmap_end - unmap_start, 0); |
1840 | shmem_truncate_range(inode, offset, offset + len - 1); | 1844 | shmem_truncate_range(inode, offset, offset + len - 1); |
1841 | /* No need to unmap again: hole-punching leaves COWed pages */ | 1845 | /* No need to unmap again: hole-punching leaves COWed pages */ |
1842 | 1846 | ||
1843 | spin_lock(&inode->i_lock); | 1847 | spin_lock(&inode->i_lock); |
1844 | inode->i_private = NULL; | 1848 | inode->i_private = NULL; |
1845 | wake_up_all(&shmem_falloc_waitq); | 1849 | wake_up_all(&shmem_falloc_waitq); |
1846 | spin_unlock(&inode->i_lock); | 1850 | spin_unlock(&inode->i_lock); |
1847 | error = 0; | 1851 | error = 0; |
1848 | goto out; | 1852 | goto out; |
1849 | } | 1853 | } |
1850 | 1854 | ||
1851 | /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ | 1855 | /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ |
1852 | error = inode_newsize_ok(inode, offset + len); | 1856 | error = inode_newsize_ok(inode, offset + len); |
1853 | if (error) | 1857 | if (error) |
1854 | goto out; | 1858 | goto out; |
1855 | 1859 | ||
1856 | start = offset >> PAGE_CACHE_SHIFT; | 1860 | start = offset >> PAGE_CACHE_SHIFT; |
1857 | end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; | 1861 | end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; |
1858 | /* Try to avoid a swapstorm if len is impossible to satisfy */ | 1862 | /* Try to avoid a swapstorm if len is impossible to satisfy */ |
1859 | if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { | 1863 | if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) { |
1860 | error = -ENOSPC; | 1864 | error = -ENOSPC; |
1861 | goto out; | 1865 | goto out; |
1862 | } | 1866 | } |
1863 | 1867 | ||
1864 | shmem_falloc.waitq = NULL; | 1868 | shmem_falloc.waitq = NULL; |
1865 | shmem_falloc.start = start; | 1869 | shmem_falloc.start = start; |
1866 | shmem_falloc.next = start; | 1870 | shmem_falloc.next = start; |
1867 | shmem_falloc.nr_falloced = 0; | 1871 | shmem_falloc.nr_falloced = 0; |
1868 | shmem_falloc.nr_unswapped = 0; | 1872 | shmem_falloc.nr_unswapped = 0; |
1869 | spin_lock(&inode->i_lock); | 1873 | spin_lock(&inode->i_lock); |
1870 | inode->i_private = &shmem_falloc; | 1874 | inode->i_private = &shmem_falloc; |
1871 | spin_unlock(&inode->i_lock); | 1875 | spin_unlock(&inode->i_lock); |
1872 | 1876 | ||
1873 | for (index = start; index < end; index++) { | 1877 | for (index = start; index < end; index++) { |
1874 | struct page *page; | 1878 | struct page *page; |
1875 | 1879 | ||
1876 | /* | 1880 | /* |
1877 | * Good, the fallocate(2) manpage permits EINTR: we may have | 1881 | * Good, the fallocate(2) manpage permits EINTR: we may have |
1878 | * been interrupted because we are using up too much memory. | 1882 | * been interrupted because we are using up too much memory. |
1879 | */ | 1883 | */ |
1880 | if (signal_pending(current)) | 1884 | if (signal_pending(current)) |
1881 | error = -EINTR; | 1885 | error = -EINTR; |
1882 | else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) | 1886 | else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) |
1883 | error = -ENOMEM; | 1887 | error = -ENOMEM; |
1884 | else | 1888 | else |
1885 | error = shmem_getpage(inode, index, &page, SGP_FALLOC, | 1889 | error = shmem_getpage(inode, index, &page, SGP_FALLOC, |
1886 | NULL); | 1890 | NULL); |
1887 | if (error) { | 1891 | if (error) { |
1888 | /* Remove the !PageUptodate pages we added */ | 1892 | /* Remove the !PageUptodate pages we added */ |
1889 | shmem_undo_range(inode, | 1893 | shmem_undo_range(inode, |
1890 | (loff_t)start << PAGE_CACHE_SHIFT, | 1894 | (loff_t)start << PAGE_CACHE_SHIFT, |
1891 | (loff_t)index << PAGE_CACHE_SHIFT, true); | 1895 | (loff_t)index << PAGE_CACHE_SHIFT, true); |
1892 | goto undone; | 1896 | goto undone; |
1893 | } | 1897 | } |
1894 | 1898 | ||
1895 | /* | 1899 | /* |
1896 | * Inform shmem_writepage() how far we have reached. | 1900 | * Inform shmem_writepage() how far we have reached. |
1897 | * No need for lock or barrier: we have the page lock. | 1901 | * No need for lock or barrier: we have the page lock. |
1898 | */ | 1902 | */ |
1899 | shmem_falloc.next++; | 1903 | shmem_falloc.next++; |
1900 | if (!PageUptodate(page)) | 1904 | if (!PageUptodate(page)) |
1901 | shmem_falloc.nr_falloced++; | 1905 | shmem_falloc.nr_falloced++; |
1902 | 1906 | ||
1903 | /* | 1907 | /* |
1904 | * If !PageUptodate, leave it that way so that freeable pages | 1908 | * If !PageUptodate, leave it that way so that freeable pages |
1905 | * can be recognized if we need to rollback on error later. | 1909 | * can be recognized if we need to rollback on error later. |
1906 | * But set_page_dirty so that memory pressure will swap rather | 1910 | * But set_page_dirty so that memory pressure will swap rather |
1907 | * than free the pages we are allocating (and SGP_CACHE pages | 1911 | * than free the pages we are allocating (and SGP_CACHE pages |
1908 | * might still be clean: we now need to mark those dirty too). | 1912 | * might still be clean: we now need to mark those dirty too). |
1909 | */ | 1913 | */ |
1910 | set_page_dirty(page); | 1914 | set_page_dirty(page); |
1911 | unlock_page(page); | 1915 | unlock_page(page); |
1912 | page_cache_release(page); | 1916 | page_cache_release(page); |
1913 | cond_resched(); | 1917 | cond_resched(); |
1914 | } | 1918 | } |
1915 | 1919 | ||
1916 | if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) | 1920 | if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) |
1917 | i_size_write(inode, offset + len); | 1921 | i_size_write(inode, offset + len); |
1918 | inode->i_ctime = CURRENT_TIME; | 1922 | inode->i_ctime = CURRENT_TIME; |
1919 | undone: | 1923 | undone: |
1920 | spin_lock(&inode->i_lock); | 1924 | spin_lock(&inode->i_lock); |
1921 | inode->i_private = NULL; | 1925 | inode->i_private = NULL; |
1922 | spin_unlock(&inode->i_lock); | 1926 | spin_unlock(&inode->i_lock); |
1923 | out: | 1927 | out: |
1924 | mutex_unlock(&inode->i_mutex); | 1928 | mutex_unlock(&inode->i_mutex); |
1925 | return error; | 1929 | return error; |
1926 | } | 1930 | } |
1927 | 1931 | ||
1928 | static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) | 1932 | static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) |
1929 | { | 1933 | { |
1930 | struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); | 1934 | struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb); |
1931 | 1935 | ||
1932 | buf->f_type = TMPFS_MAGIC; | 1936 | buf->f_type = TMPFS_MAGIC; |
1933 | buf->f_bsize = PAGE_CACHE_SIZE; | 1937 | buf->f_bsize = PAGE_CACHE_SIZE; |
1934 | buf->f_namelen = NAME_MAX; | 1938 | buf->f_namelen = NAME_MAX; |
1935 | if (sbinfo->max_blocks) { | 1939 | if (sbinfo->max_blocks) { |
1936 | buf->f_blocks = sbinfo->max_blocks; | 1940 | buf->f_blocks = sbinfo->max_blocks; |
1937 | buf->f_bavail = | 1941 | buf->f_bavail = |
1938 | buf->f_bfree = sbinfo->max_blocks - | 1942 | buf->f_bfree = sbinfo->max_blocks - |
1939 | percpu_counter_sum(&sbinfo->used_blocks); | 1943 | percpu_counter_sum(&sbinfo->used_blocks); |
1940 | } | 1944 | } |
1941 | if (sbinfo->max_inodes) { | 1945 | if (sbinfo->max_inodes) { |
1942 | buf->f_files = sbinfo->max_inodes; | 1946 | buf->f_files = sbinfo->max_inodes; |
1943 | buf->f_ffree = sbinfo->free_inodes; | 1947 | buf->f_ffree = sbinfo->free_inodes; |
1944 | } | 1948 | } |
1945 | /* else leave those fields 0 like simple_statfs */ | 1949 | /* else leave those fields 0 like simple_statfs */ |
1946 | return 0; | 1950 | return 0; |
1947 | } | 1951 | } |
1948 | 1952 | ||
1949 | /* | 1953 | /* |
1950 | * File creation. Allocate an inode, and we're done.. | 1954 | * File creation. Allocate an inode, and we're done.. |
1951 | */ | 1955 | */ |
1952 | static int | 1956 | static int |
1953 | shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) | 1957 | shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) |
1954 | { | 1958 | { |
1955 | struct inode *inode; | 1959 | struct inode *inode; |
1956 | int error = -ENOSPC; | 1960 | int error = -ENOSPC; |
1957 | 1961 | ||
1958 | inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE); | 1962 | inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE); |
1959 | if (inode) { | 1963 | if (inode) { |
1960 | #ifdef CONFIG_TMPFS_POSIX_ACL | 1964 | #ifdef CONFIG_TMPFS_POSIX_ACL |
1961 | error = generic_acl_init(inode, dir); | 1965 | error = generic_acl_init(inode, dir); |
1962 | if (error) { | 1966 | if (error) { |
1963 | iput(inode); | 1967 | iput(inode); |
1964 | return error; | 1968 | return error; |
1965 | } | 1969 | } |
1966 | #endif | 1970 | #endif |
1967 | error = security_inode_init_security(inode, dir, | 1971 | error = security_inode_init_security(inode, dir, |
1968 | &dentry->d_name, | 1972 | &dentry->d_name, |
1969 | shmem_initxattrs, NULL); | 1973 | shmem_initxattrs, NULL); |
1970 | if (error) { | 1974 | if (error) { |
1971 | if (error != -EOPNOTSUPP) { | 1975 | if (error != -EOPNOTSUPP) { |
1972 | iput(inode); | 1976 | iput(inode); |
1973 | return error; | 1977 | return error; |
1974 | } | 1978 | } |
1975 | } | 1979 | } |
1976 | 1980 | ||
1977 | error = 0; | 1981 | error = 0; |
1978 | dir->i_size += BOGO_DIRENT_SIZE; | 1982 | dir->i_size += BOGO_DIRENT_SIZE; |
1979 | dir->i_ctime = dir->i_mtime = CURRENT_TIME; | 1983 | dir->i_ctime = dir->i_mtime = CURRENT_TIME; |
1980 | d_instantiate(dentry, inode); | 1984 | d_instantiate(dentry, inode); |
1981 | dget(dentry); /* Extra count - pin the dentry in core */ | 1985 | dget(dentry); /* Extra count - pin the dentry in core */ |
1982 | } | 1986 | } |
1983 | return error; | 1987 | return error; |
1984 | } | 1988 | } |
1985 | 1989 | ||
1986 | static int | 1990 | static int |
1987 | shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) | 1991 | shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) |
1988 | { | 1992 | { |
1989 | struct inode *inode; | 1993 | struct inode *inode; |
1990 | int error = -ENOSPC; | 1994 | int error = -ENOSPC; |
1991 | 1995 | ||
1992 | inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE); | 1996 | inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE); |
1993 | if (inode) { | 1997 | if (inode) { |
1994 | error = security_inode_init_security(inode, dir, | 1998 | error = security_inode_init_security(inode, dir, |
1995 | NULL, | 1999 | NULL, |
1996 | shmem_initxattrs, NULL); | 2000 | shmem_initxattrs, NULL); |
1997 | if (error) { | 2001 | if (error) { |
1998 | if (error != -EOPNOTSUPP) { | 2002 | if (error != -EOPNOTSUPP) { |
1999 | iput(inode); | 2003 | iput(inode); |
2000 | return error; | 2004 | return error; |
2001 | } | 2005 | } |
2002 | } | 2006 | } |
2003 | #ifdef CONFIG_TMPFS_POSIX_ACL | 2007 | #ifdef CONFIG_TMPFS_POSIX_ACL |
2004 | error = generic_acl_init(inode, dir); | 2008 | error = generic_acl_init(inode, dir); |
2005 | if (error) { | 2009 | if (error) { |
2006 | iput(inode); | 2010 | iput(inode); |
2007 | return error; | 2011 | return error; |
2008 | } | 2012 | } |
2009 | #else | 2013 | #else |
2010 | error = 0; | 2014 | error = 0; |
2011 | #endif | 2015 | #endif |
2012 | d_tmpfile(dentry, inode); | 2016 | d_tmpfile(dentry, inode); |
2013 | } | 2017 | } |
2014 | return error; | 2018 | return error; |
2015 | } | 2019 | } |
2016 | 2020 | ||
2017 | static int shmem_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) | 2021 | static int shmem_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) |
2018 | { | 2022 | { |
2019 | int error; | 2023 | int error; |
2020 | 2024 | ||
2021 | if ((error = shmem_mknod(dir, dentry, mode | S_IFDIR, 0))) | 2025 | if ((error = shmem_mknod(dir, dentry, mode | S_IFDIR, 0))) |
2022 | return error; | 2026 | return error; |
2023 | inc_nlink(dir); | 2027 | inc_nlink(dir); |
2024 | return 0; | 2028 | return 0; |
2025 | } | 2029 | } |
2026 | 2030 | ||
2027 | static int shmem_create(struct inode *dir, struct dentry *dentry, umode_t mode, | 2031 | static int shmem_create(struct inode *dir, struct dentry *dentry, umode_t mode, |
2028 | bool excl) | 2032 | bool excl) |
2029 | { | 2033 | { |
2030 | return shmem_mknod(dir, dentry, mode | S_IFREG, 0); | 2034 | return shmem_mknod(dir, dentry, mode | S_IFREG, 0); |
2031 | } | 2035 | } |
2032 | 2036 | ||
2033 | /* | 2037 | /* |
2034 | * Link a file.. | 2038 | * Link a file.. |
2035 | */ | 2039 | */ |
2036 | static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) | 2040 | static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) |
2037 | { | 2041 | { |
2038 | struct inode *inode = old_dentry->d_inode; | 2042 | struct inode *inode = old_dentry->d_inode; |
2039 | int ret; | 2043 | int ret; |
2040 | 2044 | ||
2041 | /* | 2045 | /* |
2042 | * No ordinary (disk based) filesystem counts links as inodes; | 2046 | * No ordinary (disk based) filesystem counts links as inodes; |
2043 | * but each new link needs a new dentry, pinning lowmem, and | 2047 | * but each new link needs a new dentry, pinning lowmem, and |
2044 | * tmpfs dentries cannot be pruned until they are unlinked. | 2048 | * tmpfs dentries cannot be pruned until they are unlinked. |
2045 | */ | 2049 | */ |
2046 | ret = shmem_reserve_inode(inode->i_sb); | 2050 | ret = shmem_reserve_inode(inode->i_sb); |
2047 | if (ret) | 2051 | if (ret) |
2048 | goto out; | 2052 | goto out; |
2049 | 2053 | ||
2050 | dir->i_size += BOGO_DIRENT_SIZE; | 2054 | dir->i_size += BOGO_DIRENT_SIZE; |
2051 | inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; | 2055 | inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; |
2052 | inc_nlink(inode); | 2056 | inc_nlink(inode); |
2053 | ihold(inode); /* New dentry reference */ | 2057 | ihold(inode); /* New dentry reference */ |
2054 | dget(dentry); /* Extra pinning count for the created dentry */ | 2058 | dget(dentry); /* Extra pinning count for the created dentry */ |
2055 | d_instantiate(dentry, inode); | 2059 | d_instantiate(dentry, inode); |
2056 | out: | 2060 | out: |
2057 | return ret; | 2061 | return ret; |
2058 | } | 2062 | } |
2059 | 2063 | ||
2060 | static int shmem_unlink(struct inode *dir, struct dentry *dentry) | 2064 | static int shmem_unlink(struct inode *dir, struct dentry *dentry) |
2061 | { | 2065 | { |
2062 | struct inode *inode = dentry->d_inode; | 2066 | struct inode *inode = dentry->d_inode; |
2063 | 2067 | ||
2064 | if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) | 2068 | if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) |
2065 | shmem_free_inode(inode->i_sb); | 2069 | shmem_free_inode(inode->i_sb); |
2066 | 2070 | ||
2067 | dir->i_size -= BOGO_DIRENT_SIZE; | 2071 | dir->i_size -= BOGO_DIRENT_SIZE; |
2068 | inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; | 2072 | inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; |
2069 | drop_nlink(inode); | 2073 | drop_nlink(inode); |
2070 | dput(dentry); /* Undo the count from "create" - this does all the work */ | 2074 | dput(dentry); /* Undo the count from "create" - this does all the work */ |
2071 | return 0; | 2075 | return 0; |
2072 | } | 2076 | } |
2073 | 2077 | ||
2074 | static int shmem_rmdir(struct inode *dir, struct dentry *dentry) | 2078 | static int shmem_rmdir(struct inode *dir, struct dentry *dentry) |
2075 | { | 2079 | { |
2076 | if (!simple_empty(dentry)) | 2080 | if (!simple_empty(dentry)) |
2077 | return -ENOTEMPTY; | 2081 | return -ENOTEMPTY; |
2078 | 2082 | ||
2079 | drop_nlink(dentry->d_inode); | 2083 | drop_nlink(dentry->d_inode); |
2080 | drop_nlink(dir); | 2084 | drop_nlink(dir); |
2081 | return shmem_unlink(dir, dentry); | 2085 | return shmem_unlink(dir, dentry); |
2082 | } | 2086 | } |
2083 | 2087 | ||
2084 | /* | 2088 | /* |
2085 | * The VFS layer already does all the dentry stuff for rename, | 2089 | * The VFS layer already does all the dentry stuff for rename, |
2086 | * we just have to decrement the usage count for the target if | 2090 | * we just have to decrement the usage count for the target if |
2087 | * it exists so that the VFS layer correctly free's it when it | 2091 | * it exists so that the VFS layer correctly free's it when it |
2088 | * gets overwritten. | 2092 | * gets overwritten. |
2089 | */ | 2093 | */ |
2090 | static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) | 2094 | static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) |
2091 | { | 2095 | { |
2092 | struct inode *inode = old_dentry->d_inode; | 2096 | struct inode *inode = old_dentry->d_inode; |
2093 | int they_are_dirs = S_ISDIR(inode->i_mode); | 2097 | int they_are_dirs = S_ISDIR(inode->i_mode); |
2094 | 2098 | ||
2095 | if (!simple_empty(new_dentry)) | 2099 | if (!simple_empty(new_dentry)) |
2096 | return -ENOTEMPTY; | 2100 | return -ENOTEMPTY; |
2097 | 2101 | ||
2098 | if (new_dentry->d_inode) { | 2102 | if (new_dentry->d_inode) { |
2099 | (void) shmem_unlink(new_dir, new_dentry); | 2103 | (void) shmem_unlink(new_dir, new_dentry); |
2100 | if (they_are_dirs) | 2104 | if (they_are_dirs) |
2101 | drop_nlink(old_dir); | 2105 | drop_nlink(old_dir); |
2102 | } else if (they_are_dirs) { | 2106 | } else if (they_are_dirs) { |
2103 | drop_nlink(old_dir); | 2107 | drop_nlink(old_dir); |
2104 | inc_nlink(new_dir); | 2108 | inc_nlink(new_dir); |
2105 | } | 2109 | } |
2106 | 2110 | ||
2107 | old_dir->i_size -= BOGO_DIRENT_SIZE; | 2111 | old_dir->i_size -= BOGO_DIRENT_SIZE; |
2108 | new_dir->i_size += BOGO_DIRENT_SIZE; | 2112 | new_dir->i_size += BOGO_DIRENT_SIZE; |
2109 | old_dir->i_ctime = old_dir->i_mtime = | 2113 | old_dir->i_ctime = old_dir->i_mtime = |
2110 | new_dir->i_ctime = new_dir->i_mtime = | 2114 | new_dir->i_ctime = new_dir->i_mtime = |
2111 | inode->i_ctime = CURRENT_TIME; | 2115 | inode->i_ctime = CURRENT_TIME; |
2112 | return 0; | 2116 | return 0; |
2113 | } | 2117 | } |
2114 | 2118 | ||
2115 | static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *symname) | 2119 | static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *symname) |
2116 | { | 2120 | { |
2117 | int error; | 2121 | int error; |
2118 | int len; | 2122 | int len; |
2119 | struct inode *inode; | 2123 | struct inode *inode; |
2120 | struct page *page; | 2124 | struct page *page; |
2121 | char *kaddr; | 2125 | char *kaddr; |
2122 | struct shmem_inode_info *info; | 2126 | struct shmem_inode_info *info; |
2123 | 2127 | ||
2124 | len = strlen(symname) + 1; | 2128 | len = strlen(symname) + 1; |
2125 | if (len > PAGE_CACHE_SIZE) | 2129 | if (len > PAGE_CACHE_SIZE) |
2126 | return -ENAMETOOLONG; | 2130 | return -ENAMETOOLONG; |
2127 | 2131 | ||
2128 | inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE); | 2132 | inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE); |
2129 | if (!inode) | 2133 | if (!inode) |
2130 | return -ENOSPC; | 2134 | return -ENOSPC; |
2131 | 2135 | ||
2132 | error = security_inode_init_security(inode, dir, &dentry->d_name, | 2136 | error = security_inode_init_security(inode, dir, &dentry->d_name, |
2133 | shmem_initxattrs, NULL); | 2137 | shmem_initxattrs, NULL); |
2134 | if (error) { | 2138 | if (error) { |
2135 | if (error != -EOPNOTSUPP) { | 2139 | if (error != -EOPNOTSUPP) { |
2136 | iput(inode); | 2140 | iput(inode); |
2137 | return error; | 2141 | return error; |
2138 | } | 2142 | } |
2139 | error = 0; | 2143 | error = 0; |
2140 | } | 2144 | } |
2141 | 2145 | ||
2142 | info = SHMEM_I(inode); | 2146 | info = SHMEM_I(inode); |
2143 | inode->i_size = len-1; | 2147 | inode->i_size = len-1; |
2144 | if (len <= SHORT_SYMLINK_LEN) { | 2148 | if (len <= SHORT_SYMLINK_LEN) { |
2145 | info->symlink = kmemdup(symname, len, GFP_KERNEL); | 2149 | info->symlink = kmemdup(symname, len, GFP_KERNEL); |
2146 | if (!info->symlink) { | 2150 | if (!info->symlink) { |
2147 | iput(inode); | 2151 | iput(inode); |
2148 | return -ENOMEM; | 2152 | return -ENOMEM; |
2149 | } | 2153 | } |
2150 | inode->i_op = &shmem_short_symlink_operations; | 2154 | inode->i_op = &shmem_short_symlink_operations; |
2151 | } else { | 2155 | } else { |
2152 | error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL); | 2156 | error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL); |
2153 | if (error) { | 2157 | if (error) { |
2154 | iput(inode); | 2158 | iput(inode); |
2155 | return error; | 2159 | return error; |
2156 | } | 2160 | } |
2157 | inode->i_mapping->a_ops = &shmem_aops; | 2161 | inode->i_mapping->a_ops = &shmem_aops; |
2158 | inode->i_op = &shmem_symlink_inode_operations; | 2162 | inode->i_op = &shmem_symlink_inode_operations; |
2159 | kaddr = kmap_atomic(page); | 2163 | kaddr = kmap_atomic(page); |
2160 | memcpy(kaddr, symname, len); | 2164 | memcpy(kaddr, symname, len); |
2161 | kunmap_atomic(kaddr); | 2165 | kunmap_atomic(kaddr); |
2162 | SetPageUptodate(page); | 2166 | SetPageUptodate(page); |
2163 | set_page_dirty(page); | 2167 | set_page_dirty(page); |
2164 | unlock_page(page); | 2168 | unlock_page(page); |
2165 | page_cache_release(page); | 2169 | page_cache_release(page); |
2166 | } | 2170 | } |
2167 | dir->i_size += BOGO_DIRENT_SIZE; | 2171 | dir->i_size += BOGO_DIRENT_SIZE; |
2168 | dir->i_ctime = dir->i_mtime = CURRENT_TIME; | 2172 | dir->i_ctime = dir->i_mtime = CURRENT_TIME; |
2169 | d_instantiate(dentry, inode); | 2173 | d_instantiate(dentry, inode); |
2170 | dget(dentry); | 2174 | dget(dentry); |
2171 | return 0; | 2175 | return 0; |
2172 | } | 2176 | } |
2173 | 2177 | ||
2174 | static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd) | 2178 | static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd) |
2175 | { | 2179 | { |
2176 | nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink); | 2180 | nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink); |
2177 | return NULL; | 2181 | return NULL; |
2178 | } | 2182 | } |
2179 | 2183 | ||
2180 | static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd) | 2184 | static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd) |
2181 | { | 2185 | { |
2182 | struct page *page = NULL; | 2186 | struct page *page = NULL; |
2183 | int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL); | 2187 | int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL); |
2184 | nd_set_link(nd, error ? ERR_PTR(error) : kmap(page)); | 2188 | nd_set_link(nd, error ? ERR_PTR(error) : kmap(page)); |
2185 | if (page) | 2189 | if (page) |
2186 | unlock_page(page); | 2190 | unlock_page(page); |
2187 | return page; | 2191 | return page; |
2188 | } | 2192 | } |
2189 | 2193 | ||
2190 | static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie) | 2194 | static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie) |
2191 | { | 2195 | { |
2192 | if (!IS_ERR(nd_get_link(nd))) { | 2196 | if (!IS_ERR(nd_get_link(nd))) { |
2193 | struct page *page = cookie; | 2197 | struct page *page = cookie; |
2194 | kunmap(page); | 2198 | kunmap(page); |
2195 | mark_page_accessed(page); | 2199 | mark_page_accessed(page); |
2196 | page_cache_release(page); | 2200 | page_cache_release(page); |
2197 | } | 2201 | } |
2198 | } | 2202 | } |
2199 | 2203 | ||
2200 | #ifdef CONFIG_TMPFS_XATTR | 2204 | #ifdef CONFIG_TMPFS_XATTR |
2201 | /* | 2205 | /* |
2202 | * Superblocks without xattr inode operations may get some security.* xattr | 2206 | * Superblocks without xattr inode operations may get some security.* xattr |
2203 | * support from the LSM "for free". As soon as we have any other xattrs | 2207 | * support from the LSM "for free". As soon as we have any other xattrs |
2204 | * like ACLs, we also need to implement the security.* handlers at | 2208 | * like ACLs, we also need to implement the security.* handlers at |
2205 | * filesystem level, though. | 2209 | * filesystem level, though. |
2206 | */ | 2210 | */ |
2207 | 2211 | ||
2208 | /* | 2212 | /* |
2209 | * Callback for security_inode_init_security() for acquiring xattrs. | 2213 | * Callback for security_inode_init_security() for acquiring xattrs. |
2210 | */ | 2214 | */ |
2211 | static int shmem_initxattrs(struct inode *inode, | 2215 | static int shmem_initxattrs(struct inode *inode, |
2212 | const struct xattr *xattr_array, | 2216 | const struct xattr *xattr_array, |
2213 | void *fs_info) | 2217 | void *fs_info) |
2214 | { | 2218 | { |
2215 | struct shmem_inode_info *info = SHMEM_I(inode); | 2219 | struct shmem_inode_info *info = SHMEM_I(inode); |
2216 | const struct xattr *xattr; | 2220 | const struct xattr *xattr; |
2217 | struct simple_xattr *new_xattr; | 2221 | struct simple_xattr *new_xattr; |
2218 | size_t len; | 2222 | size_t len; |
2219 | 2223 | ||
2220 | for (xattr = xattr_array; xattr->name != NULL; xattr++) { | 2224 | for (xattr = xattr_array; xattr->name != NULL; xattr++) { |
2221 | new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len); | 2225 | new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len); |
2222 | if (!new_xattr) | 2226 | if (!new_xattr) |
2223 | return -ENOMEM; | 2227 | return -ENOMEM; |
2224 | 2228 | ||
2225 | len = strlen(xattr->name) + 1; | 2229 | len = strlen(xattr->name) + 1; |
2226 | new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, | 2230 | new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len, |
2227 | GFP_KERNEL); | 2231 | GFP_KERNEL); |
2228 | if (!new_xattr->name) { | 2232 | if (!new_xattr->name) { |
2229 | kfree(new_xattr); | 2233 | kfree(new_xattr); |
2230 | return -ENOMEM; | 2234 | return -ENOMEM; |
2231 | } | 2235 | } |
2232 | 2236 | ||
2233 | memcpy(new_xattr->name, XATTR_SECURITY_PREFIX, | 2237 | memcpy(new_xattr->name, XATTR_SECURITY_PREFIX, |
2234 | XATTR_SECURITY_PREFIX_LEN); | 2238 | XATTR_SECURITY_PREFIX_LEN); |
2235 | memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN, | 2239 | memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN, |
2236 | xattr->name, len); | 2240 | xattr->name, len); |
2237 | 2241 | ||
2238 | simple_xattr_list_add(&info->xattrs, new_xattr); | 2242 | simple_xattr_list_add(&info->xattrs, new_xattr); |
2239 | } | 2243 | } |
2240 | 2244 | ||
2241 | return 0; | 2245 | return 0; |
2242 | } | 2246 | } |
2243 | 2247 | ||
2244 | static const struct xattr_handler *shmem_xattr_handlers[] = { | 2248 | static const struct xattr_handler *shmem_xattr_handlers[] = { |
2245 | #ifdef CONFIG_TMPFS_POSIX_ACL | 2249 | #ifdef CONFIG_TMPFS_POSIX_ACL |
2246 | &generic_acl_access_handler, | 2250 | &generic_acl_access_handler, |
2247 | &generic_acl_default_handler, | 2251 | &generic_acl_default_handler, |
2248 | #endif | 2252 | #endif |
2249 | NULL | 2253 | NULL |
2250 | }; | 2254 | }; |
2251 | 2255 | ||
2252 | static int shmem_xattr_validate(const char *name) | 2256 | static int shmem_xattr_validate(const char *name) |
2253 | { | 2257 | { |
2254 | struct { const char *prefix; size_t len; } arr[] = { | 2258 | struct { const char *prefix; size_t len; } arr[] = { |
2255 | { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN }, | 2259 | { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN }, |
2256 | { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN } | 2260 | { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN } |
2257 | }; | 2261 | }; |
2258 | int i; | 2262 | int i; |
2259 | 2263 | ||
2260 | for (i = 0; i < ARRAY_SIZE(arr); i++) { | 2264 | for (i = 0; i < ARRAY_SIZE(arr); i++) { |
2261 | size_t preflen = arr[i].len; | 2265 | size_t preflen = arr[i].len; |
2262 | if (strncmp(name, arr[i].prefix, preflen) == 0) { | 2266 | if (strncmp(name, arr[i].prefix, preflen) == 0) { |
2263 | if (!name[preflen]) | 2267 | if (!name[preflen]) |
2264 | return -EINVAL; | 2268 | return -EINVAL; |
2265 | return 0; | 2269 | return 0; |
2266 | } | 2270 | } |
2267 | } | 2271 | } |
2268 | return -EOPNOTSUPP; | 2272 | return -EOPNOTSUPP; |
2269 | } | 2273 | } |
2270 | 2274 | ||
2271 | static ssize_t shmem_getxattr(struct dentry *dentry, const char *name, | 2275 | static ssize_t shmem_getxattr(struct dentry *dentry, const char *name, |
2272 | void *buffer, size_t size) | 2276 | void *buffer, size_t size) |
2273 | { | 2277 | { |
2274 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); | 2278 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); |
2275 | int err; | 2279 | int err; |
2276 | 2280 | ||
2277 | /* | 2281 | /* |
2278 | * If this is a request for a synthetic attribute in the system.* | 2282 | * If this is a request for a synthetic attribute in the system.* |
2279 | * namespace use the generic infrastructure to resolve a handler | 2283 | * namespace use the generic infrastructure to resolve a handler |
2280 | * for it via sb->s_xattr. | 2284 | * for it via sb->s_xattr. |
2281 | */ | 2285 | */ |
2282 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) | 2286 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) |
2283 | return generic_getxattr(dentry, name, buffer, size); | 2287 | return generic_getxattr(dentry, name, buffer, size); |
2284 | 2288 | ||
2285 | err = shmem_xattr_validate(name); | 2289 | err = shmem_xattr_validate(name); |
2286 | if (err) | 2290 | if (err) |
2287 | return err; | 2291 | return err; |
2288 | 2292 | ||
2289 | return simple_xattr_get(&info->xattrs, name, buffer, size); | 2293 | return simple_xattr_get(&info->xattrs, name, buffer, size); |
2290 | } | 2294 | } |
2291 | 2295 | ||
2292 | static int shmem_setxattr(struct dentry *dentry, const char *name, | 2296 | static int shmem_setxattr(struct dentry *dentry, const char *name, |
2293 | const void *value, size_t size, int flags) | 2297 | const void *value, size_t size, int flags) |
2294 | { | 2298 | { |
2295 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); | 2299 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); |
2296 | int err; | 2300 | int err; |
2297 | 2301 | ||
2298 | /* | 2302 | /* |
2299 | * If this is a request for a synthetic attribute in the system.* | 2303 | * If this is a request for a synthetic attribute in the system.* |
2300 | * namespace use the generic infrastructure to resolve a handler | 2304 | * namespace use the generic infrastructure to resolve a handler |
2301 | * for it via sb->s_xattr. | 2305 | * for it via sb->s_xattr. |
2302 | */ | 2306 | */ |
2303 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) | 2307 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) |
2304 | return generic_setxattr(dentry, name, value, size, flags); | 2308 | return generic_setxattr(dentry, name, value, size, flags); |
2305 | 2309 | ||
2306 | err = shmem_xattr_validate(name); | 2310 | err = shmem_xattr_validate(name); |
2307 | if (err) | 2311 | if (err) |
2308 | return err; | 2312 | return err; |
2309 | 2313 | ||
2310 | return simple_xattr_set(&info->xattrs, name, value, size, flags); | 2314 | return simple_xattr_set(&info->xattrs, name, value, size, flags); |
2311 | } | 2315 | } |
2312 | 2316 | ||
2313 | static int shmem_removexattr(struct dentry *dentry, const char *name) | 2317 | static int shmem_removexattr(struct dentry *dentry, const char *name) |
2314 | { | 2318 | { |
2315 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); | 2319 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); |
2316 | int err; | 2320 | int err; |
2317 | 2321 | ||
2318 | /* | 2322 | /* |
2319 | * If this is a request for a synthetic attribute in the system.* | 2323 | * If this is a request for a synthetic attribute in the system.* |
2320 | * namespace use the generic infrastructure to resolve a handler | 2324 | * namespace use the generic infrastructure to resolve a handler |
2321 | * for it via sb->s_xattr. | 2325 | * for it via sb->s_xattr. |
2322 | */ | 2326 | */ |
2323 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) | 2327 | if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) |
2324 | return generic_removexattr(dentry, name); | 2328 | return generic_removexattr(dentry, name); |
2325 | 2329 | ||
2326 | err = shmem_xattr_validate(name); | 2330 | err = shmem_xattr_validate(name); |
2327 | if (err) | 2331 | if (err) |
2328 | return err; | 2332 | return err; |
2329 | 2333 | ||
2330 | return simple_xattr_remove(&info->xattrs, name); | 2334 | return simple_xattr_remove(&info->xattrs, name); |
2331 | } | 2335 | } |
2332 | 2336 | ||
2333 | static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size) | 2337 | static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size) |
2334 | { | 2338 | { |
2335 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); | 2339 | struct shmem_inode_info *info = SHMEM_I(dentry->d_inode); |
2336 | return simple_xattr_list(&info->xattrs, buffer, size); | 2340 | return simple_xattr_list(&info->xattrs, buffer, size); |
2337 | } | 2341 | } |
2338 | #endif /* CONFIG_TMPFS_XATTR */ | 2342 | #endif /* CONFIG_TMPFS_XATTR */ |
2339 | 2343 | ||
2340 | static const struct inode_operations shmem_short_symlink_operations = { | 2344 | static const struct inode_operations shmem_short_symlink_operations = { |
2341 | .readlink = generic_readlink, | 2345 | .readlink = generic_readlink, |
2342 | .follow_link = shmem_follow_short_symlink, | 2346 | .follow_link = shmem_follow_short_symlink, |
2343 | #ifdef CONFIG_TMPFS_XATTR | 2347 | #ifdef CONFIG_TMPFS_XATTR |
2344 | .setxattr = shmem_setxattr, | 2348 | .setxattr = shmem_setxattr, |
2345 | .getxattr = shmem_getxattr, | 2349 | .getxattr = shmem_getxattr, |
2346 | .listxattr = shmem_listxattr, | 2350 | .listxattr = shmem_listxattr, |
2347 | .removexattr = shmem_removexattr, | 2351 | .removexattr = shmem_removexattr, |
2348 | #endif | 2352 | #endif |
2349 | }; | 2353 | }; |
2350 | 2354 | ||
2351 | static const struct inode_operations shmem_symlink_inode_operations = { | 2355 | static const struct inode_operations shmem_symlink_inode_operations = { |
2352 | .readlink = generic_readlink, | 2356 | .readlink = generic_readlink, |
2353 | .follow_link = shmem_follow_link, | 2357 | .follow_link = shmem_follow_link, |
2354 | .put_link = shmem_put_link, | 2358 | .put_link = shmem_put_link, |
2355 | #ifdef CONFIG_TMPFS_XATTR | 2359 | #ifdef CONFIG_TMPFS_XATTR |
2356 | .setxattr = shmem_setxattr, | 2360 | .setxattr = shmem_setxattr, |
2357 | .getxattr = shmem_getxattr, | 2361 | .getxattr = shmem_getxattr, |
2358 | .listxattr = shmem_listxattr, | 2362 | .listxattr = shmem_listxattr, |
2359 | .removexattr = shmem_removexattr, | 2363 | .removexattr = shmem_removexattr, |
2360 | #endif | 2364 | #endif |
2361 | }; | 2365 | }; |
2362 | 2366 | ||
2363 | static struct dentry *shmem_get_parent(struct dentry *child) | 2367 | static struct dentry *shmem_get_parent(struct dentry *child) |
2364 | { | 2368 | { |
2365 | return ERR_PTR(-ESTALE); | 2369 | return ERR_PTR(-ESTALE); |
2366 | } | 2370 | } |
2367 | 2371 | ||
2368 | static int shmem_match(struct inode *ino, void *vfh) | 2372 | static int shmem_match(struct inode *ino, void *vfh) |
2369 | { | 2373 | { |
2370 | __u32 *fh = vfh; | 2374 | __u32 *fh = vfh; |
2371 | __u64 inum = fh[2]; | 2375 | __u64 inum = fh[2]; |
2372 | inum = (inum << 32) | fh[1]; | 2376 | inum = (inum << 32) | fh[1]; |
2373 | return ino->i_ino == inum && fh[0] == ino->i_generation; | 2377 | return ino->i_ino == inum && fh[0] == ino->i_generation; |
2374 | } | 2378 | } |
2375 | 2379 | ||
2376 | static struct dentry *shmem_fh_to_dentry(struct super_block *sb, | 2380 | static struct dentry *shmem_fh_to_dentry(struct super_block *sb, |
2377 | struct fid *fid, int fh_len, int fh_type) | 2381 | struct fid *fid, int fh_len, int fh_type) |
2378 | { | 2382 | { |
2379 | struct inode *inode; | 2383 | struct inode *inode; |
2380 | struct dentry *dentry = NULL; | 2384 | struct dentry *dentry = NULL; |
2381 | u64 inum; | 2385 | u64 inum; |
2382 | 2386 | ||
2383 | if (fh_len < 3) | 2387 | if (fh_len < 3) |
2384 | return NULL; | 2388 | return NULL; |
2385 | 2389 | ||
2386 | inum = fid->raw[2]; | 2390 | inum = fid->raw[2]; |
2387 | inum = (inum << 32) | fid->raw[1]; | 2391 | inum = (inum << 32) | fid->raw[1]; |
2388 | 2392 | ||
2389 | inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]), | 2393 | inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]), |
2390 | shmem_match, fid->raw); | 2394 | shmem_match, fid->raw); |
2391 | if (inode) { | 2395 | if (inode) { |
2392 | dentry = d_find_alias(inode); | 2396 | dentry = d_find_alias(inode); |
2393 | iput(inode); | 2397 | iput(inode); |
2394 | } | 2398 | } |
2395 | 2399 | ||
2396 | return dentry; | 2400 | return dentry; |
2397 | } | 2401 | } |
2398 | 2402 | ||
2399 | static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, | 2403 | static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, |
2400 | struct inode *parent) | 2404 | struct inode *parent) |
2401 | { | 2405 | { |
2402 | if (*len < 3) { | 2406 | if (*len < 3) { |
2403 | *len = 3; | 2407 | *len = 3; |
2404 | return FILEID_INVALID; | 2408 | return FILEID_INVALID; |
2405 | } | 2409 | } |
2406 | 2410 | ||
2407 | if (inode_unhashed(inode)) { | 2411 | if (inode_unhashed(inode)) { |
2408 | /* Unfortunately insert_inode_hash is not idempotent, | 2412 | /* Unfortunately insert_inode_hash is not idempotent, |
2409 | * so as we hash inodes here rather than at creation | 2413 | * so as we hash inodes here rather than at creation |
2410 | * time, we need a lock to ensure we only try | 2414 | * time, we need a lock to ensure we only try |
2411 | * to do it once | 2415 | * to do it once |
2412 | */ | 2416 | */ |
2413 | static DEFINE_SPINLOCK(lock); | 2417 | static DEFINE_SPINLOCK(lock); |
2414 | spin_lock(&lock); | 2418 | spin_lock(&lock); |
2415 | if (inode_unhashed(inode)) | 2419 | if (inode_unhashed(inode)) |
2416 | __insert_inode_hash(inode, | 2420 | __insert_inode_hash(inode, |
2417 | inode->i_ino + inode->i_generation); | 2421 | inode->i_ino + inode->i_generation); |
2418 | spin_unlock(&lock); | 2422 | spin_unlock(&lock); |
2419 | } | 2423 | } |
2420 | 2424 | ||
2421 | fh[0] = inode->i_generation; | 2425 | fh[0] = inode->i_generation; |
2422 | fh[1] = inode->i_ino; | 2426 | fh[1] = inode->i_ino; |
2423 | fh[2] = ((__u64)inode->i_ino) >> 32; | 2427 | fh[2] = ((__u64)inode->i_ino) >> 32; |
2424 | 2428 | ||
2425 | *len = 3; | 2429 | *len = 3; |
2426 | return 1; | 2430 | return 1; |
2427 | } | 2431 | } |
2428 | 2432 | ||
2429 | static const struct export_operations shmem_export_ops = { | 2433 | static const struct export_operations shmem_export_ops = { |
2430 | .get_parent = shmem_get_parent, | 2434 | .get_parent = shmem_get_parent, |
2431 | .encode_fh = shmem_encode_fh, | 2435 | .encode_fh = shmem_encode_fh, |
2432 | .fh_to_dentry = shmem_fh_to_dentry, | 2436 | .fh_to_dentry = shmem_fh_to_dentry, |
2433 | }; | 2437 | }; |
2434 | 2438 | ||
2435 | static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, | 2439 | static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, |
2436 | bool remount) | 2440 | bool remount) |
2437 | { | 2441 | { |
2438 | char *this_char, *value, *rest; | 2442 | char *this_char, *value, *rest; |
2439 | struct mempolicy *mpol = NULL; | 2443 | struct mempolicy *mpol = NULL; |
2440 | uid_t uid; | 2444 | uid_t uid; |
2441 | gid_t gid; | 2445 | gid_t gid; |
2442 | 2446 | ||
2443 | while (options != NULL) { | 2447 | while (options != NULL) { |
2444 | this_char = options; | 2448 | this_char = options; |
2445 | for (;;) { | 2449 | for (;;) { |
2446 | /* | 2450 | /* |
2447 | * NUL-terminate this option: unfortunately, | 2451 | * NUL-terminate this option: unfortunately, |
2448 | * mount options form a comma-separated list, | 2452 | * mount options form a comma-separated list, |
2449 | * but mpol's nodelist may also contain commas. | 2453 | * but mpol's nodelist may also contain commas. |
2450 | */ | 2454 | */ |
2451 | options = strchr(options, ','); | 2455 | options = strchr(options, ','); |
2452 | if (options == NULL) | 2456 | if (options == NULL) |
2453 | break; | 2457 | break; |
2454 | options++; | 2458 | options++; |
2455 | if (!isdigit(*options)) { | 2459 | if (!isdigit(*options)) { |
2456 | options[-1] = '\0'; | 2460 | options[-1] = '\0'; |
2457 | break; | 2461 | break; |
2458 | } | 2462 | } |
2459 | } | 2463 | } |
2460 | if (!*this_char) | 2464 | if (!*this_char) |
2461 | continue; | 2465 | continue; |
2462 | if ((value = strchr(this_char,'=')) != NULL) { | 2466 | if ((value = strchr(this_char,'=')) != NULL) { |
2463 | *value++ = 0; | 2467 | *value++ = 0; |
2464 | } else { | 2468 | } else { |
2465 | printk(KERN_ERR | 2469 | printk(KERN_ERR |
2466 | "tmpfs: No value for mount option '%s'\n", | 2470 | "tmpfs: No value for mount option '%s'\n", |
2467 | this_char); | 2471 | this_char); |
2468 | goto error; | 2472 | goto error; |
2469 | } | 2473 | } |
2470 | 2474 | ||
2471 | if (!strcmp(this_char,"size")) { | 2475 | if (!strcmp(this_char,"size")) { |
2472 | unsigned long long size; | 2476 | unsigned long long size; |
2473 | size = memparse(value,&rest); | 2477 | size = memparse(value,&rest); |
2474 | if (*rest == '%') { | 2478 | if (*rest == '%') { |
2475 | size <<= PAGE_SHIFT; | 2479 | size <<= PAGE_SHIFT; |
2476 | size *= totalram_pages; | 2480 | size *= totalram_pages; |
2477 | do_div(size, 100); | 2481 | do_div(size, 100); |
2478 | rest++; | 2482 | rest++; |
2479 | } | 2483 | } |
2480 | if (*rest) | 2484 | if (*rest) |
2481 | goto bad_val; | 2485 | goto bad_val; |
2482 | sbinfo->max_blocks = | 2486 | sbinfo->max_blocks = |
2483 | DIV_ROUND_UP(size, PAGE_CACHE_SIZE); | 2487 | DIV_ROUND_UP(size, PAGE_CACHE_SIZE); |
2484 | } else if (!strcmp(this_char,"nr_blocks")) { | 2488 | } else if (!strcmp(this_char,"nr_blocks")) { |
2485 | sbinfo->max_blocks = memparse(value, &rest); | 2489 | sbinfo->max_blocks = memparse(value, &rest); |
2486 | if (*rest) | 2490 | if (*rest) |
2487 | goto bad_val; | 2491 | goto bad_val; |
2488 | } else if (!strcmp(this_char,"nr_inodes")) { | 2492 | } else if (!strcmp(this_char,"nr_inodes")) { |
2489 | sbinfo->max_inodes = memparse(value, &rest); | 2493 | sbinfo->max_inodes = memparse(value, &rest); |
2490 | if (*rest) | 2494 | if (*rest) |
2491 | goto bad_val; | 2495 | goto bad_val; |
2492 | } else if (!strcmp(this_char,"mode")) { | 2496 | } else if (!strcmp(this_char,"mode")) { |
2493 | if (remount) | 2497 | if (remount) |
2494 | continue; | 2498 | continue; |
2495 | sbinfo->mode = simple_strtoul(value, &rest, 8) & 07777; | 2499 | sbinfo->mode = simple_strtoul(value, &rest, 8) & 07777; |
2496 | if (*rest) | 2500 | if (*rest) |
2497 | goto bad_val; | 2501 | goto bad_val; |
2498 | } else if (!strcmp(this_char,"uid")) { | 2502 | } else if (!strcmp(this_char,"uid")) { |
2499 | if (remount) | 2503 | if (remount) |
2500 | continue; | 2504 | continue; |
2501 | uid = simple_strtoul(value, &rest, 0); | 2505 | uid = simple_strtoul(value, &rest, 0); |
2502 | if (*rest) | 2506 | if (*rest) |
2503 | goto bad_val; | 2507 | goto bad_val; |
2504 | sbinfo->uid = make_kuid(current_user_ns(), uid); | 2508 | sbinfo->uid = make_kuid(current_user_ns(), uid); |
2505 | if (!uid_valid(sbinfo->uid)) | 2509 | if (!uid_valid(sbinfo->uid)) |
2506 | goto bad_val; | 2510 | goto bad_val; |
2507 | } else if (!strcmp(this_char,"gid")) { | 2511 | } else if (!strcmp(this_char,"gid")) { |
2508 | if (remount) | 2512 | if (remount) |
2509 | continue; | 2513 | continue; |
2510 | gid = simple_strtoul(value, &rest, 0); | 2514 | gid = simple_strtoul(value, &rest, 0); |
2511 | if (*rest) | 2515 | if (*rest) |
2512 | goto bad_val; | 2516 | goto bad_val; |
2513 | sbinfo->gid = make_kgid(current_user_ns(), gid); | 2517 | sbinfo->gid = make_kgid(current_user_ns(), gid); |
2514 | if (!gid_valid(sbinfo->gid)) | 2518 | if (!gid_valid(sbinfo->gid)) |
2515 | goto bad_val; | 2519 | goto bad_val; |
2516 | } else if (!strcmp(this_char,"mpol")) { | 2520 | } else if (!strcmp(this_char,"mpol")) { |
2517 | mpol_put(mpol); | 2521 | mpol_put(mpol); |
2518 | mpol = NULL; | 2522 | mpol = NULL; |
2519 | if (mpol_parse_str(value, &mpol)) | 2523 | if (mpol_parse_str(value, &mpol)) |
2520 | goto bad_val; | 2524 | goto bad_val; |
2521 | } else { | 2525 | } else { |
2522 | printk(KERN_ERR "tmpfs: Bad mount option %s\n", | 2526 | printk(KERN_ERR "tmpfs: Bad mount option %s\n", |
2523 | this_char); | 2527 | this_char); |
2524 | goto error; | 2528 | goto error; |
2525 | } | 2529 | } |
2526 | } | 2530 | } |
2527 | sbinfo->mpol = mpol; | 2531 | sbinfo->mpol = mpol; |
2528 | return 0; | 2532 | return 0; |
2529 | 2533 | ||
2530 | bad_val: | 2534 | bad_val: |
2531 | printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n", | 2535 | printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n", |
2532 | value, this_char); | 2536 | value, this_char); |
2533 | error: | 2537 | error: |
2534 | mpol_put(mpol); | 2538 | mpol_put(mpol); |
2535 | return 1; | 2539 | return 1; |
2536 | 2540 | ||
2537 | } | 2541 | } |
2538 | 2542 | ||
2539 | static int shmem_remount_fs(struct super_block *sb, int *flags, char *data) | 2543 | static int shmem_remount_fs(struct super_block *sb, int *flags, char *data) |
2540 | { | 2544 | { |
2541 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); | 2545 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); |
2542 | struct shmem_sb_info config = *sbinfo; | 2546 | struct shmem_sb_info config = *sbinfo; |
2543 | unsigned long inodes; | 2547 | unsigned long inodes; |
2544 | int error = -EINVAL; | 2548 | int error = -EINVAL; |
2545 | 2549 | ||
2546 | config.mpol = NULL; | 2550 | config.mpol = NULL; |
2547 | if (shmem_parse_options(data, &config, true)) | 2551 | if (shmem_parse_options(data, &config, true)) |
2548 | return error; | 2552 | return error; |
2549 | 2553 | ||
2550 | spin_lock(&sbinfo->stat_lock); | 2554 | spin_lock(&sbinfo->stat_lock); |
2551 | inodes = sbinfo->max_inodes - sbinfo->free_inodes; | 2555 | inodes = sbinfo->max_inodes - sbinfo->free_inodes; |
2552 | if (percpu_counter_compare(&sbinfo->used_blocks, config.max_blocks) > 0) | 2556 | if (percpu_counter_compare(&sbinfo->used_blocks, config.max_blocks) > 0) |
2553 | goto out; | 2557 | goto out; |
2554 | if (config.max_inodes < inodes) | 2558 | if (config.max_inodes < inodes) |
2555 | goto out; | 2559 | goto out; |
2556 | /* | 2560 | /* |
2557 | * Those tests disallow limited->unlimited while any are in use; | 2561 | * Those tests disallow limited->unlimited while any are in use; |
2558 | * but we must separately disallow unlimited->limited, because | 2562 | * but we must separately disallow unlimited->limited, because |
2559 | * in that case we have no record of how much is already in use. | 2563 | * in that case we have no record of how much is already in use. |
2560 | */ | 2564 | */ |
2561 | if (config.max_blocks && !sbinfo->max_blocks) | 2565 | if (config.max_blocks && !sbinfo->max_blocks) |
2562 | goto out; | 2566 | goto out; |
2563 | if (config.max_inodes && !sbinfo->max_inodes) | 2567 | if (config.max_inodes && !sbinfo->max_inodes) |
2564 | goto out; | 2568 | goto out; |
2565 | 2569 | ||
2566 | error = 0; | 2570 | error = 0; |
2567 | sbinfo->max_blocks = config.max_blocks; | 2571 | sbinfo->max_blocks = config.max_blocks; |
2568 | sbinfo->max_inodes = config.max_inodes; | 2572 | sbinfo->max_inodes = config.max_inodes; |
2569 | sbinfo->free_inodes = config.max_inodes - inodes; | 2573 | sbinfo->free_inodes = config.max_inodes - inodes; |
2570 | 2574 | ||
2571 | /* | 2575 | /* |
2572 | * Preserve previous mempolicy unless mpol remount option was specified. | 2576 | * Preserve previous mempolicy unless mpol remount option was specified. |
2573 | */ | 2577 | */ |
2574 | if (config.mpol) { | 2578 | if (config.mpol) { |
2575 | mpol_put(sbinfo->mpol); | 2579 | mpol_put(sbinfo->mpol); |
2576 | sbinfo->mpol = config.mpol; /* transfers initial ref */ | 2580 | sbinfo->mpol = config.mpol; /* transfers initial ref */ |
2577 | } | 2581 | } |
2578 | out: | 2582 | out: |
2579 | spin_unlock(&sbinfo->stat_lock); | 2583 | spin_unlock(&sbinfo->stat_lock); |
2580 | return error; | 2584 | return error; |
2581 | } | 2585 | } |
2582 | 2586 | ||
2583 | static int shmem_show_options(struct seq_file *seq, struct dentry *root) | 2587 | static int shmem_show_options(struct seq_file *seq, struct dentry *root) |
2584 | { | 2588 | { |
2585 | struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb); | 2589 | struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb); |
2586 | 2590 | ||
2587 | if (sbinfo->max_blocks != shmem_default_max_blocks()) | 2591 | if (sbinfo->max_blocks != shmem_default_max_blocks()) |
2588 | seq_printf(seq, ",size=%luk", | 2592 | seq_printf(seq, ",size=%luk", |
2589 | sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10)); | 2593 | sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10)); |
2590 | if (sbinfo->max_inodes != shmem_default_max_inodes()) | 2594 | if (sbinfo->max_inodes != shmem_default_max_inodes()) |
2591 | seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); | 2595 | seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); |
2592 | if (sbinfo->mode != (S_IRWXUGO | S_ISVTX)) | 2596 | if (sbinfo->mode != (S_IRWXUGO | S_ISVTX)) |
2593 | seq_printf(seq, ",mode=%03ho", sbinfo->mode); | 2597 | seq_printf(seq, ",mode=%03ho", sbinfo->mode); |
2594 | if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID)) | 2598 | if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID)) |
2595 | seq_printf(seq, ",uid=%u", | 2599 | seq_printf(seq, ",uid=%u", |
2596 | from_kuid_munged(&init_user_ns, sbinfo->uid)); | 2600 | from_kuid_munged(&init_user_ns, sbinfo->uid)); |
2597 | if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) | 2601 | if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) |
2598 | seq_printf(seq, ",gid=%u", | 2602 | seq_printf(seq, ",gid=%u", |
2599 | from_kgid_munged(&init_user_ns, sbinfo->gid)); | 2603 | from_kgid_munged(&init_user_ns, sbinfo->gid)); |
2600 | shmem_show_mpol(seq, sbinfo->mpol); | 2604 | shmem_show_mpol(seq, sbinfo->mpol); |
2601 | return 0; | 2605 | return 0; |
2602 | } | 2606 | } |
2603 | #endif /* CONFIG_TMPFS */ | 2607 | #endif /* CONFIG_TMPFS */ |
2604 | 2608 | ||
2605 | static void shmem_put_super(struct super_block *sb) | 2609 | static void shmem_put_super(struct super_block *sb) |
2606 | { | 2610 | { |
2607 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); | 2611 | struct shmem_sb_info *sbinfo = SHMEM_SB(sb); |
2608 | 2612 | ||
2609 | percpu_counter_destroy(&sbinfo->used_blocks); | 2613 | percpu_counter_destroy(&sbinfo->used_blocks); |
2610 | mpol_put(sbinfo->mpol); | 2614 | mpol_put(sbinfo->mpol); |
2611 | kfree(sbinfo); | 2615 | kfree(sbinfo); |
2612 | sb->s_fs_info = NULL; | 2616 | sb->s_fs_info = NULL; |
2613 | } | 2617 | } |
2614 | 2618 | ||
2615 | int shmem_fill_super(struct super_block *sb, void *data, int silent) | 2619 | int shmem_fill_super(struct super_block *sb, void *data, int silent) |
2616 | { | 2620 | { |
2617 | struct inode *inode; | 2621 | struct inode *inode; |
2618 | struct shmem_sb_info *sbinfo; | 2622 | struct shmem_sb_info *sbinfo; |
2619 | int err = -ENOMEM; | 2623 | int err = -ENOMEM; |
2620 | 2624 | ||
2621 | /* Round up to L1_CACHE_BYTES to resist false sharing */ | 2625 | /* Round up to L1_CACHE_BYTES to resist false sharing */ |
2622 | sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info), | 2626 | sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info), |
2623 | L1_CACHE_BYTES), GFP_KERNEL); | 2627 | L1_CACHE_BYTES), GFP_KERNEL); |
2624 | if (!sbinfo) | 2628 | if (!sbinfo) |
2625 | return -ENOMEM; | 2629 | return -ENOMEM; |
2626 | 2630 | ||
2627 | sbinfo->mode = S_IRWXUGO | S_ISVTX; | 2631 | sbinfo->mode = S_IRWXUGO | S_ISVTX; |
2628 | sbinfo->uid = current_fsuid(); | 2632 | sbinfo->uid = current_fsuid(); |
2629 | sbinfo->gid = current_fsgid(); | 2633 | sbinfo->gid = current_fsgid(); |
2630 | sb->s_fs_info = sbinfo; | 2634 | sb->s_fs_info = sbinfo; |
2631 | 2635 | ||
2632 | #ifdef CONFIG_TMPFS | 2636 | #ifdef CONFIG_TMPFS |
2633 | /* | 2637 | /* |
2634 | * Per default we only allow half of the physical ram per | 2638 | * Per default we only allow half of the physical ram per |
2635 | * tmpfs instance, limiting inodes to one per page of lowmem; | 2639 | * tmpfs instance, limiting inodes to one per page of lowmem; |
2636 | * but the internal instance is left unlimited. | 2640 | * but the internal instance is left unlimited. |
2637 | */ | 2641 | */ |
2638 | if (!(sb->s_flags & MS_KERNMOUNT)) { | 2642 | if (!(sb->s_flags & MS_KERNMOUNT)) { |
2639 | sbinfo->max_blocks = shmem_default_max_blocks(); | 2643 | sbinfo->max_blocks = shmem_default_max_blocks(); |
2640 | sbinfo->max_inodes = shmem_default_max_inodes(); | 2644 | sbinfo->max_inodes = shmem_default_max_inodes(); |
2641 | if (shmem_parse_options(data, sbinfo, false)) { | 2645 | if (shmem_parse_options(data, sbinfo, false)) { |
2642 | err = -EINVAL; | 2646 | err = -EINVAL; |
2643 | goto failed; | 2647 | goto failed; |
2644 | } | 2648 | } |
2645 | } else { | 2649 | } else { |
2646 | sb->s_flags |= MS_NOUSER; | 2650 | sb->s_flags |= MS_NOUSER; |
2647 | } | 2651 | } |
2648 | sb->s_export_op = &shmem_export_ops; | 2652 | sb->s_export_op = &shmem_export_ops; |
2649 | sb->s_flags |= MS_NOSEC; | 2653 | sb->s_flags |= MS_NOSEC; |
2650 | #else | 2654 | #else |
2651 | sb->s_flags |= MS_NOUSER; | 2655 | sb->s_flags |= MS_NOUSER; |
2652 | #endif | 2656 | #endif |
2653 | 2657 | ||
2654 | spin_lock_init(&sbinfo->stat_lock); | 2658 | spin_lock_init(&sbinfo->stat_lock); |
2655 | if (percpu_counter_init(&sbinfo->used_blocks, 0)) | 2659 | if (percpu_counter_init(&sbinfo->used_blocks, 0)) |
2656 | goto failed; | 2660 | goto failed; |
2657 | sbinfo->free_inodes = sbinfo->max_inodes; | 2661 | sbinfo->free_inodes = sbinfo->max_inodes; |
2658 | 2662 | ||
2659 | sb->s_maxbytes = MAX_LFS_FILESIZE; | 2663 | sb->s_maxbytes = MAX_LFS_FILESIZE; |
2660 | sb->s_blocksize = PAGE_CACHE_SIZE; | 2664 | sb->s_blocksize = PAGE_CACHE_SIZE; |
2661 | sb->s_blocksize_bits = PAGE_CACHE_SHIFT; | 2665 | sb->s_blocksize_bits = PAGE_CACHE_SHIFT; |
2662 | sb->s_magic = TMPFS_MAGIC; | 2666 | sb->s_magic = TMPFS_MAGIC; |
2663 | sb->s_op = &shmem_ops; | 2667 | sb->s_op = &shmem_ops; |
2664 | sb->s_time_gran = 1; | 2668 | sb->s_time_gran = 1; |
2665 | #ifdef CONFIG_TMPFS_XATTR | 2669 | #ifdef CONFIG_TMPFS_XATTR |
2666 | sb->s_xattr = shmem_xattr_handlers; | 2670 | sb->s_xattr = shmem_xattr_handlers; |
2667 | #endif | 2671 | #endif |
2668 | #ifdef CONFIG_TMPFS_POSIX_ACL | 2672 | #ifdef CONFIG_TMPFS_POSIX_ACL |
2669 | sb->s_flags |= MS_POSIXACL; | 2673 | sb->s_flags |= MS_POSIXACL; |
2670 | #endif | 2674 | #endif |
2671 | 2675 | ||
2672 | inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); | 2676 | inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); |
2673 | if (!inode) | 2677 | if (!inode) |
2674 | goto failed; | 2678 | goto failed; |
2675 | inode->i_uid = sbinfo->uid; | 2679 | inode->i_uid = sbinfo->uid; |
2676 | inode->i_gid = sbinfo->gid; | 2680 | inode->i_gid = sbinfo->gid; |
2677 | sb->s_root = d_make_root(inode); | 2681 | sb->s_root = d_make_root(inode); |
2678 | if (!sb->s_root) | 2682 | if (!sb->s_root) |
2679 | goto failed; | 2683 | goto failed; |
2680 | return 0; | 2684 | return 0; |
2681 | 2685 | ||
2682 | failed: | 2686 | failed: |
2683 | shmem_put_super(sb); | 2687 | shmem_put_super(sb); |
2684 | return err; | 2688 | return err; |
2685 | } | 2689 | } |
2686 | 2690 | ||
2687 | static struct kmem_cache *shmem_inode_cachep; | 2691 | static struct kmem_cache *shmem_inode_cachep; |
2688 | 2692 | ||
2689 | static struct inode *shmem_alloc_inode(struct super_block *sb) | 2693 | static struct inode *shmem_alloc_inode(struct super_block *sb) |
2690 | { | 2694 | { |
2691 | struct shmem_inode_info *info; | 2695 | struct shmem_inode_info *info; |
2692 | info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL); | 2696 | info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL); |
2693 | if (!info) | 2697 | if (!info) |
2694 | return NULL; | 2698 | return NULL; |
2695 | return &info->vfs_inode; | 2699 | return &info->vfs_inode; |
2696 | } | 2700 | } |
2697 | 2701 | ||
2698 | static void shmem_destroy_callback(struct rcu_head *head) | 2702 | static void shmem_destroy_callback(struct rcu_head *head) |
2699 | { | 2703 | { |
2700 | struct inode *inode = container_of(head, struct inode, i_rcu); | 2704 | struct inode *inode = container_of(head, struct inode, i_rcu); |
2701 | kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode)); | 2705 | kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode)); |
2702 | } | 2706 | } |
2703 | 2707 | ||
2704 | static void shmem_destroy_inode(struct inode *inode) | 2708 | static void shmem_destroy_inode(struct inode *inode) |
2705 | { | 2709 | { |
2706 | if (S_ISREG(inode->i_mode)) | 2710 | if (S_ISREG(inode->i_mode)) |
2707 | mpol_free_shared_policy(&SHMEM_I(inode)->policy); | 2711 | mpol_free_shared_policy(&SHMEM_I(inode)->policy); |
2708 | call_rcu(&inode->i_rcu, shmem_destroy_callback); | 2712 | call_rcu(&inode->i_rcu, shmem_destroy_callback); |
2709 | } | 2713 | } |
2710 | 2714 | ||
2711 | static void shmem_init_inode(void *foo) | 2715 | static void shmem_init_inode(void *foo) |
2712 | { | 2716 | { |
2713 | struct shmem_inode_info *info = foo; | 2717 | struct shmem_inode_info *info = foo; |
2714 | inode_init_once(&info->vfs_inode); | 2718 | inode_init_once(&info->vfs_inode); |
2715 | } | 2719 | } |
2716 | 2720 | ||
2717 | static int shmem_init_inodecache(void) | 2721 | static int shmem_init_inodecache(void) |
2718 | { | 2722 | { |
2719 | shmem_inode_cachep = kmem_cache_create("shmem_inode_cache", | 2723 | shmem_inode_cachep = kmem_cache_create("shmem_inode_cache", |
2720 | sizeof(struct shmem_inode_info), | 2724 | sizeof(struct shmem_inode_info), |
2721 | 0, SLAB_PANIC, shmem_init_inode); | 2725 | 0, SLAB_PANIC, shmem_init_inode); |
2722 | return 0; | 2726 | return 0; |
2723 | } | 2727 | } |
2724 | 2728 | ||
2725 | static void shmem_destroy_inodecache(void) | 2729 | static void shmem_destroy_inodecache(void) |
2726 | { | 2730 | { |
2727 | kmem_cache_destroy(shmem_inode_cachep); | 2731 | kmem_cache_destroy(shmem_inode_cachep); |
2728 | } | 2732 | } |
2729 | 2733 | ||
2730 | static const struct address_space_operations shmem_aops = { | 2734 | static const struct address_space_operations shmem_aops = { |
2731 | .writepage = shmem_writepage, | 2735 | .writepage = shmem_writepage, |
2732 | .set_page_dirty = __set_page_dirty_no_writeback, | 2736 | .set_page_dirty = __set_page_dirty_no_writeback, |
2733 | #ifdef CONFIG_TMPFS | 2737 | #ifdef CONFIG_TMPFS |
2734 | .write_begin = shmem_write_begin, | 2738 | .write_begin = shmem_write_begin, |
2735 | .write_end = shmem_write_end, | 2739 | .write_end = shmem_write_end, |
2736 | #endif | 2740 | #endif |
2737 | .migratepage = migrate_page, | 2741 | .migratepage = migrate_page, |
2738 | .error_remove_page = generic_error_remove_page, | 2742 | .error_remove_page = generic_error_remove_page, |
2739 | }; | 2743 | }; |
2740 | 2744 | ||
2741 | static const struct file_operations shmem_file_operations = { | 2745 | static const struct file_operations shmem_file_operations = { |
2742 | .mmap = shmem_mmap, | 2746 | .mmap = shmem_mmap, |
2743 | #ifdef CONFIG_TMPFS | 2747 | #ifdef CONFIG_TMPFS |
2744 | .llseek = shmem_file_llseek, | 2748 | .llseek = shmem_file_llseek, |
2745 | .read = do_sync_read, | 2749 | .read = do_sync_read, |
2746 | .write = do_sync_write, | 2750 | .write = do_sync_write, |
2747 | .aio_read = shmem_file_aio_read, | 2751 | .aio_read = shmem_file_aio_read, |
2748 | .aio_write = generic_file_aio_write, | 2752 | .aio_write = generic_file_aio_write, |
2749 | .fsync = noop_fsync, | 2753 | .fsync = noop_fsync, |
2750 | .splice_read = shmem_file_splice_read, | 2754 | .splice_read = shmem_file_splice_read, |
2751 | .splice_write = generic_file_splice_write, | 2755 | .splice_write = generic_file_splice_write, |
2752 | .fallocate = shmem_fallocate, | 2756 | .fallocate = shmem_fallocate, |
2753 | #endif | 2757 | #endif |
2754 | }; | 2758 | }; |
2755 | 2759 | ||
2756 | static const struct inode_operations shmem_inode_operations = { | 2760 | static const struct inode_operations shmem_inode_operations = { |
2757 | .setattr = shmem_setattr, | 2761 | .setattr = shmem_setattr, |
2758 | #ifdef CONFIG_TMPFS_XATTR | 2762 | #ifdef CONFIG_TMPFS_XATTR |
2759 | .setxattr = shmem_setxattr, | 2763 | .setxattr = shmem_setxattr, |
2760 | .getxattr = shmem_getxattr, | 2764 | .getxattr = shmem_getxattr, |
2761 | .listxattr = shmem_listxattr, | 2765 | .listxattr = shmem_listxattr, |
2762 | .removexattr = shmem_removexattr, | 2766 | .removexattr = shmem_removexattr, |
2763 | #endif | 2767 | #endif |
2764 | }; | 2768 | }; |
2765 | 2769 | ||
2766 | static const struct inode_operations shmem_dir_inode_operations = { | 2770 | static const struct inode_operations shmem_dir_inode_operations = { |
2767 | #ifdef CONFIG_TMPFS | 2771 | #ifdef CONFIG_TMPFS |
2768 | .create = shmem_create, | 2772 | .create = shmem_create, |
2769 | .lookup = simple_lookup, | 2773 | .lookup = simple_lookup, |
2770 | .link = shmem_link, | 2774 | .link = shmem_link, |
2771 | .unlink = shmem_unlink, | 2775 | .unlink = shmem_unlink, |
2772 | .symlink = shmem_symlink, | 2776 | .symlink = shmem_symlink, |
2773 | .mkdir = shmem_mkdir, | 2777 | .mkdir = shmem_mkdir, |
2774 | .rmdir = shmem_rmdir, | 2778 | .rmdir = shmem_rmdir, |
2775 | .mknod = shmem_mknod, | 2779 | .mknod = shmem_mknod, |
2776 | .rename = shmem_rename, | 2780 | .rename = shmem_rename, |
2777 | .tmpfile = shmem_tmpfile, | 2781 | .tmpfile = shmem_tmpfile, |
2778 | #endif | 2782 | #endif |
2779 | #ifdef CONFIG_TMPFS_XATTR | 2783 | #ifdef CONFIG_TMPFS_XATTR |
2780 | .setxattr = shmem_setxattr, | 2784 | .setxattr = shmem_setxattr, |
2781 | .getxattr = shmem_getxattr, | 2785 | .getxattr = shmem_getxattr, |
2782 | .listxattr = shmem_listxattr, | 2786 | .listxattr = shmem_listxattr, |
2783 | .removexattr = shmem_removexattr, | 2787 | .removexattr = shmem_removexattr, |
2784 | #endif | 2788 | #endif |
2785 | #ifdef CONFIG_TMPFS_POSIX_ACL | 2789 | #ifdef CONFIG_TMPFS_POSIX_ACL |
2786 | .setattr = shmem_setattr, | 2790 | .setattr = shmem_setattr, |
2787 | #endif | 2791 | #endif |
2788 | }; | 2792 | }; |
2789 | 2793 | ||
2790 | static const struct inode_operations shmem_special_inode_operations = { | 2794 | static const struct inode_operations shmem_special_inode_operations = { |
2791 | #ifdef CONFIG_TMPFS_XATTR | 2795 | #ifdef CONFIG_TMPFS_XATTR |
2792 | .setxattr = shmem_setxattr, | 2796 | .setxattr = shmem_setxattr, |
2793 | .getxattr = shmem_getxattr, | 2797 | .getxattr = shmem_getxattr, |
2794 | .listxattr = shmem_listxattr, | 2798 | .listxattr = shmem_listxattr, |
2795 | .removexattr = shmem_removexattr, | 2799 | .removexattr = shmem_removexattr, |
2796 | #endif | 2800 | #endif |
2797 | #ifdef CONFIG_TMPFS_POSIX_ACL | 2801 | #ifdef CONFIG_TMPFS_POSIX_ACL |
2798 | .setattr = shmem_setattr, | 2802 | .setattr = shmem_setattr, |
2799 | #endif | 2803 | #endif |
2800 | }; | 2804 | }; |
2801 | 2805 | ||
2802 | static const struct super_operations shmem_ops = { | 2806 | static const struct super_operations shmem_ops = { |
2803 | .alloc_inode = shmem_alloc_inode, | 2807 | .alloc_inode = shmem_alloc_inode, |
2804 | .destroy_inode = shmem_destroy_inode, | 2808 | .destroy_inode = shmem_destroy_inode, |
2805 | #ifdef CONFIG_TMPFS | 2809 | #ifdef CONFIG_TMPFS |
2806 | .statfs = shmem_statfs, | 2810 | .statfs = shmem_statfs, |
2807 | .remount_fs = shmem_remount_fs, | 2811 | .remount_fs = shmem_remount_fs, |
2808 | .show_options = shmem_show_options, | 2812 | .show_options = shmem_show_options, |
2809 | #endif | 2813 | #endif |
2810 | .evict_inode = shmem_evict_inode, | 2814 | .evict_inode = shmem_evict_inode, |
2811 | .drop_inode = generic_delete_inode, | 2815 | .drop_inode = generic_delete_inode, |
2812 | .put_super = shmem_put_super, | 2816 | .put_super = shmem_put_super, |
2813 | }; | 2817 | }; |
2814 | 2818 | ||
2815 | static const struct vm_operations_struct shmem_vm_ops = { | 2819 | static const struct vm_operations_struct shmem_vm_ops = { |
2816 | .fault = shmem_fault, | 2820 | .fault = shmem_fault, |
2817 | #ifdef CONFIG_NUMA | 2821 | #ifdef CONFIG_NUMA |
2818 | .set_policy = shmem_set_policy, | 2822 | .set_policy = shmem_set_policy, |
2819 | .get_policy = shmem_get_policy, | 2823 | .get_policy = shmem_get_policy, |
2820 | #endif | 2824 | #endif |
2821 | .remap_pages = generic_file_remap_pages, | 2825 | .remap_pages = generic_file_remap_pages, |
2822 | }; | 2826 | }; |
2823 | 2827 | ||
2824 | static struct dentry *shmem_mount(struct file_system_type *fs_type, | 2828 | static struct dentry *shmem_mount(struct file_system_type *fs_type, |
2825 | int flags, const char *dev_name, void *data) | 2829 | int flags, const char *dev_name, void *data) |
2826 | { | 2830 | { |
2827 | return mount_nodev(fs_type, flags, data, shmem_fill_super); | 2831 | return mount_nodev(fs_type, flags, data, shmem_fill_super); |
2828 | } | 2832 | } |
2829 | 2833 | ||
2830 | static struct file_system_type shmem_fs_type = { | 2834 | static struct file_system_type shmem_fs_type = { |
2831 | .owner = THIS_MODULE, | 2835 | .owner = THIS_MODULE, |
2832 | .name = "tmpfs", | 2836 | .name = "tmpfs", |
2833 | .mount = shmem_mount, | 2837 | .mount = shmem_mount, |
2834 | .kill_sb = kill_litter_super, | 2838 | .kill_sb = kill_litter_super, |
2835 | .fs_flags = FS_USERNS_MOUNT, | 2839 | .fs_flags = FS_USERNS_MOUNT, |
2836 | }; | 2840 | }; |
2837 | 2841 | ||
2838 | int __init shmem_init(void) | 2842 | int __init shmem_init(void) |
2839 | { | 2843 | { |
2840 | int error; | 2844 | int error; |
2841 | 2845 | ||
2842 | /* If rootfs called this, don't re-init */ | 2846 | /* If rootfs called this, don't re-init */ |
2843 | if (shmem_inode_cachep) | 2847 | if (shmem_inode_cachep) |
2844 | return 0; | 2848 | return 0; |
2845 | 2849 | ||
2846 | error = bdi_init(&shmem_backing_dev_info); | 2850 | error = bdi_init(&shmem_backing_dev_info); |
2847 | if (error) | 2851 | if (error) |
2848 | goto out4; | 2852 | goto out4; |
2849 | 2853 | ||
2850 | error = shmem_init_inodecache(); | 2854 | error = shmem_init_inodecache(); |
2851 | if (error) | 2855 | if (error) |
2852 | goto out3; | 2856 | goto out3; |
2853 | 2857 | ||
2854 | error = register_filesystem(&shmem_fs_type); | 2858 | error = register_filesystem(&shmem_fs_type); |
2855 | if (error) { | 2859 | if (error) { |
2856 | printk(KERN_ERR "Could not register tmpfs\n"); | 2860 | printk(KERN_ERR "Could not register tmpfs\n"); |
2857 | goto out2; | 2861 | goto out2; |
2858 | } | 2862 | } |
2859 | 2863 | ||
2860 | shm_mnt = kern_mount(&shmem_fs_type); | 2864 | shm_mnt = kern_mount(&shmem_fs_type); |
2861 | if (IS_ERR(shm_mnt)) { | 2865 | if (IS_ERR(shm_mnt)) { |
2862 | error = PTR_ERR(shm_mnt); | 2866 | error = PTR_ERR(shm_mnt); |
2863 | printk(KERN_ERR "Could not kern_mount tmpfs\n"); | 2867 | printk(KERN_ERR "Could not kern_mount tmpfs\n"); |
2864 | goto out1; | 2868 | goto out1; |
2865 | } | 2869 | } |
2866 | return 0; | 2870 | return 0; |
2867 | 2871 | ||
2868 | out1: | 2872 | out1: |
2869 | unregister_filesystem(&shmem_fs_type); | 2873 | unregister_filesystem(&shmem_fs_type); |
2870 | out2: | 2874 | out2: |
2871 | shmem_destroy_inodecache(); | 2875 | shmem_destroy_inodecache(); |
2872 | out3: | 2876 | out3: |
2873 | bdi_destroy(&shmem_backing_dev_info); | 2877 | bdi_destroy(&shmem_backing_dev_info); |
2874 | out4: | 2878 | out4: |
2875 | shm_mnt = ERR_PTR(error); | 2879 | shm_mnt = ERR_PTR(error); |
2876 | return error; | 2880 | return error; |
2877 | } | 2881 | } |
2878 | 2882 | ||
2879 | #else /* !CONFIG_SHMEM */ | 2883 | #else /* !CONFIG_SHMEM */ |
2880 | 2884 | ||
2881 | /* | 2885 | /* |
2882 | * tiny-shmem: simple shmemfs and tmpfs using ramfs code | 2886 | * tiny-shmem: simple shmemfs and tmpfs using ramfs code |
2883 | * | 2887 | * |
2884 | * This is intended for small system where the benefits of the full | 2888 | * This is intended for small system where the benefits of the full |
2885 | * shmem code (swap-backed and resource-limited) are outweighed by | 2889 | * shmem code (swap-backed and resource-limited) are outweighed by |
2886 | * their complexity. On systems without swap this code should be | 2890 | * their complexity. On systems without swap this code should be |
2887 | * effectively equivalent, but much lighter weight. | 2891 | * effectively equivalent, but much lighter weight. |
2888 | */ | 2892 | */ |
2889 | 2893 | ||
2890 | static struct file_system_type shmem_fs_type = { | 2894 | static struct file_system_type shmem_fs_type = { |
2891 | .name = "tmpfs", | 2895 | .name = "tmpfs", |
2892 | .mount = ramfs_mount, | 2896 | .mount = ramfs_mount, |
2893 | .kill_sb = kill_litter_super, | 2897 | .kill_sb = kill_litter_super, |
2894 | .fs_flags = FS_USERNS_MOUNT, | 2898 | .fs_flags = FS_USERNS_MOUNT, |
2895 | }; | 2899 | }; |
2896 | 2900 | ||
2897 | int __init shmem_init(void) | 2901 | int __init shmem_init(void) |
2898 | { | 2902 | { |
2899 | BUG_ON(register_filesystem(&shmem_fs_type) != 0); | 2903 | BUG_ON(register_filesystem(&shmem_fs_type) != 0); |
2900 | 2904 | ||
2901 | shm_mnt = kern_mount(&shmem_fs_type); | 2905 | shm_mnt = kern_mount(&shmem_fs_type); |
2902 | BUG_ON(IS_ERR(shm_mnt)); | 2906 | BUG_ON(IS_ERR(shm_mnt)); |
2903 | 2907 | ||
2904 | return 0; | 2908 | return 0; |
2905 | } | 2909 | } |
2906 | 2910 | ||
2907 | int shmem_unuse(swp_entry_t swap, struct page *page) | 2911 | int shmem_unuse(swp_entry_t swap, struct page *page) |
2908 | { | 2912 | { |
2909 | return 0; | 2913 | return 0; |
2910 | } | 2914 | } |
2911 | 2915 | ||
2912 | int shmem_lock(struct file *file, int lock, struct user_struct *user) | 2916 | int shmem_lock(struct file *file, int lock, struct user_struct *user) |
2913 | { | 2917 | { |
2914 | return 0; | 2918 | return 0; |
2915 | } | 2919 | } |
2916 | 2920 | ||
2917 | void shmem_unlock_mapping(struct address_space *mapping) | 2921 | void shmem_unlock_mapping(struct address_space *mapping) |
2918 | { | 2922 | { |
2919 | } | 2923 | } |
2920 | 2924 | ||
2921 | void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) | 2925 | void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) |
2922 | { | 2926 | { |
2923 | truncate_inode_pages_range(inode->i_mapping, lstart, lend); | 2927 | truncate_inode_pages_range(inode->i_mapping, lstart, lend); |
2924 | } | 2928 | } |
2925 | EXPORT_SYMBOL_GPL(shmem_truncate_range); | 2929 | EXPORT_SYMBOL_GPL(shmem_truncate_range); |
2926 | 2930 | ||
2927 | #define shmem_vm_ops generic_file_vm_ops | 2931 | #define shmem_vm_ops generic_file_vm_ops |
2928 | #define shmem_file_operations ramfs_file_operations | 2932 | #define shmem_file_operations ramfs_file_operations |
2929 | #define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev) | 2933 | #define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev) |
2930 | #define shmem_acct_size(flags, size) 0 | 2934 | #define shmem_acct_size(flags, size) 0 |
2931 | #define shmem_unacct_size(flags, size) do {} while (0) | 2935 | #define shmem_unacct_size(flags, size) do {} while (0) |
2932 | 2936 | ||
2933 | #endif /* CONFIG_SHMEM */ | 2937 | #endif /* CONFIG_SHMEM */ |
2934 | 2938 | ||
2935 | /* common code */ | 2939 | /* common code */ |
2936 | 2940 | ||
2937 | static struct dentry_operations anon_ops = { | 2941 | static struct dentry_operations anon_ops = { |
2938 | .d_dname = simple_dname | 2942 | .d_dname = simple_dname |
2939 | }; | 2943 | }; |
2940 | 2944 | ||
2941 | /** | 2945 | /** |
2942 | * shmem_file_setup - get an unlinked file living in tmpfs | 2946 | * shmem_file_setup - get an unlinked file living in tmpfs |
2943 | * @name: name for dentry (to be seen in /proc/<pid>/maps | 2947 | * @name: name for dentry (to be seen in /proc/<pid>/maps |
2944 | * @size: size to be set for the file | 2948 | * @size: size to be set for the file |
2945 | * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size | 2949 | * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size |
2946 | */ | 2950 | */ |
2947 | struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags) | 2951 | struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags) |
2948 | { | 2952 | { |
2949 | struct file *res; | 2953 | struct file *res; |
2950 | struct inode *inode; | 2954 | struct inode *inode; |
2951 | struct path path; | 2955 | struct path path; |
2952 | struct super_block *sb; | 2956 | struct super_block *sb; |
2953 | struct qstr this; | 2957 | struct qstr this; |
2954 | 2958 | ||
2955 | if (IS_ERR(shm_mnt)) | 2959 | if (IS_ERR(shm_mnt)) |
2956 | return ERR_CAST(shm_mnt); | 2960 | return ERR_CAST(shm_mnt); |
2957 | 2961 | ||
2958 | if (size < 0 || size > MAX_LFS_FILESIZE) | 2962 | if (size < 0 || size > MAX_LFS_FILESIZE) |
2959 | return ERR_PTR(-EINVAL); | 2963 | return ERR_PTR(-EINVAL); |
2960 | 2964 | ||
2961 | if (shmem_acct_size(flags, size)) | 2965 | if (shmem_acct_size(flags, size)) |
2962 | return ERR_PTR(-ENOMEM); | 2966 | return ERR_PTR(-ENOMEM); |
2963 | 2967 | ||
2964 | res = ERR_PTR(-ENOMEM); | 2968 | res = ERR_PTR(-ENOMEM); |
2965 | this.name = name; | 2969 | this.name = name; |
2966 | this.len = strlen(name); | 2970 | this.len = strlen(name); |
2967 | this.hash = 0; /* will go */ | 2971 | this.hash = 0; /* will go */ |
2968 | sb = shm_mnt->mnt_sb; | 2972 | sb = shm_mnt->mnt_sb; |
2969 | path.dentry = d_alloc_pseudo(sb, &this); | 2973 | path.dentry = d_alloc_pseudo(sb, &this); |
2970 | if (!path.dentry) | 2974 | if (!path.dentry) |
2971 | goto put_memory; | 2975 | goto put_memory; |
2972 | d_set_d_op(path.dentry, &anon_ops); | 2976 | d_set_d_op(path.dentry, &anon_ops); |
2973 | path.mnt = mntget(shm_mnt); | 2977 | path.mnt = mntget(shm_mnt); |
2974 | 2978 | ||
2975 | res = ERR_PTR(-ENOSPC); | 2979 | res = ERR_PTR(-ENOSPC); |
2976 | inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); | 2980 | inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); |
2977 | if (!inode) | 2981 | if (!inode) |
2978 | goto put_dentry; | 2982 | goto put_dentry; |
2979 | 2983 | ||
2980 | d_instantiate(path.dentry, inode); | 2984 | d_instantiate(path.dentry, inode); |
2981 | inode->i_size = size; | 2985 | inode->i_size = size; |
2982 | clear_nlink(inode); /* It is unlinked */ | 2986 | clear_nlink(inode); /* It is unlinked */ |
2983 | res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size)); | 2987 | res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size)); |
2984 | if (IS_ERR(res)) | 2988 | if (IS_ERR(res)) |
2985 | goto put_dentry; | 2989 | goto put_dentry; |
2986 | 2990 | ||
2987 | res = alloc_file(&path, FMODE_WRITE | FMODE_READ, | 2991 | res = alloc_file(&path, FMODE_WRITE | FMODE_READ, |
2988 | &shmem_file_operations); | 2992 | &shmem_file_operations); |
2989 | if (IS_ERR(res)) | 2993 | if (IS_ERR(res)) |
2990 | goto put_dentry; | 2994 | goto put_dentry; |
2991 | 2995 | ||
2992 | return res; | 2996 | return res; |
2993 | 2997 | ||
2994 | put_dentry: | 2998 | put_dentry: |
2995 | path_put(&path); | 2999 | path_put(&path); |
2996 | put_memory: | 3000 | put_memory: |
2997 | shmem_unacct_size(flags, size); | 3001 | shmem_unacct_size(flags, size); |
2998 | return res; | 3002 | return res; |
2999 | } | 3003 | } |
3000 | EXPORT_SYMBOL_GPL(shmem_file_setup); | 3004 | EXPORT_SYMBOL_GPL(shmem_file_setup); |
3001 | 3005 | ||
3002 | /** | 3006 | /** |
3003 | * shmem_zero_setup - setup a shared anonymous mapping | 3007 | * shmem_zero_setup - setup a shared anonymous mapping |
3004 | * @vma: the vma to be mmapped is prepared by do_mmap_pgoff | 3008 | * @vma: the vma to be mmapped is prepared by do_mmap_pgoff |
3005 | */ | 3009 | */ |
3006 | int shmem_zero_setup(struct vm_area_struct *vma) | 3010 | int shmem_zero_setup(struct vm_area_struct *vma) |
3007 | { | 3011 | { |
3008 | struct file *file; | 3012 | struct file *file; |
3009 | loff_t size = vma->vm_end - vma->vm_start; | 3013 | loff_t size = vma->vm_end - vma->vm_start; |
3010 | 3014 | ||
3011 | file = shmem_file_setup("dev/zero", size, vma->vm_flags); | 3015 | file = shmem_file_setup("dev/zero", size, vma->vm_flags); |
3012 | if (IS_ERR(file)) | 3016 | if (IS_ERR(file)) |
3013 | return PTR_ERR(file); | 3017 | return PTR_ERR(file); |
3014 | 3018 | ||
3015 | if (vma->vm_file) | 3019 | if (vma->vm_file) |
3016 | fput(vma->vm_file); | 3020 | fput(vma->vm_file); |
3017 | vma->vm_file = file; | 3021 | vma->vm_file = file; |
3018 | vma->vm_ops = &shmem_vm_ops; | 3022 | vma->vm_ops = &shmem_vm_ops; |
3019 | return 0; | 3023 | return 0; |
3020 | } | 3024 | } |
3021 | 3025 | ||
3022 | /** | 3026 | /** |
3023 | * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags. | 3027 | * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags. |
3024 | * @mapping: the page's address_space | 3028 | * @mapping: the page's address_space |
3025 | * @index: the page index | 3029 | * @index: the page index |
3026 | * @gfp: the page allocator flags to use if allocating | 3030 | * @gfp: the page allocator flags to use if allocating |
3027 | * | 3031 | * |
3028 | * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)", | 3032 | * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)", |
3029 | * with any new page allocations done using the specified allocation flags. | 3033 | * with any new page allocations done using the specified allocation flags. |
3030 | * But read_cache_page_gfp() uses the ->readpage() method: which does not | 3034 | * But read_cache_page_gfp() uses the ->readpage() method: which does not |
3031 | * suit tmpfs, since it may have pages in swapcache, and needs to find those | 3035 | * suit tmpfs, since it may have pages in swapcache, and needs to find those |
3032 | * for itself; although drivers/gpu/drm i915 and ttm rely upon this support. | 3036 | * for itself; although drivers/gpu/drm i915 and ttm rely upon this support. |
3033 | * | 3037 | * |
3034 | * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in | 3038 | * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in |
3035 | * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily. | 3039 | * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily. |
3036 | */ | 3040 | */ |
3037 | struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, | 3041 | struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, |
3038 | pgoff_t index, gfp_t gfp) | 3042 | pgoff_t index, gfp_t gfp) |
3039 | { | 3043 | { |
3040 | #ifdef CONFIG_SHMEM | 3044 | #ifdef CONFIG_SHMEM |
3041 | struct inode *inode = mapping->host; | 3045 | struct inode *inode = mapping->host; |
3042 | struct page *page; | 3046 | struct page *page; |
3043 | int error; | 3047 | int error; |
3044 | 3048 | ||
3045 | BUG_ON(mapping->a_ops != &shmem_aops); | 3049 | BUG_ON(mapping->a_ops != &shmem_aops); |
3046 | error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL); | 3050 | error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL); |
3047 | if (error) | 3051 | if (error) |
3048 | page = ERR_PTR(error); | 3052 | page = ERR_PTR(error); |
3049 | else | 3053 | else |
3050 | unlock_page(page); | 3054 | unlock_page(page); |
3051 | return page; | 3055 | return page; |
3052 | #else | 3056 | #else |
3053 | /* | 3057 | /* |
3054 | * The tiny !SHMEM case uses ramfs without swap | 3058 | * The tiny !SHMEM case uses ramfs without swap |
3055 | */ | 3059 | */ |
3056 | return read_cache_page_gfp(mapping, index, gfp); | 3060 | return read_cache_page_gfp(mapping, index, gfp); |
3057 | #endif | 3061 | #endif |
3058 | } | 3062 | } |
3059 | EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp); | 3063 | EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp); |
3060 | 3064 |
mm/swap.c
1 | /* | 1 | /* |
2 | * linux/mm/swap.c | 2 | * linux/mm/swap.c |
3 | * | 3 | * |
4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds | 4 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds |
5 | */ | 5 | */ |
6 | 6 | ||
7 | /* | 7 | /* |
8 | * This file contains the default values for the operation of the | 8 | * This file contains the default values for the operation of the |
9 | * Linux VM subsystem. Fine-tuning documentation can be found in | 9 | * Linux VM subsystem. Fine-tuning documentation can be found in |
10 | * Documentation/sysctl/vm.txt. | 10 | * Documentation/sysctl/vm.txt. |
11 | * Started 18.12.91 | 11 | * Started 18.12.91 |
12 | * Swap aging added 23.2.95, Stephen Tweedie. | 12 | * Swap aging added 23.2.95, Stephen Tweedie. |
13 | * Buffermem limits added 12.3.98, Rik van Riel. | 13 | * Buffermem limits added 12.3.98, Rik van Riel. |
14 | */ | 14 | */ |
15 | 15 | ||
16 | #include <linux/mm.h> | 16 | #include <linux/mm.h> |
17 | #include <linux/sched.h> | 17 | #include <linux/sched.h> |
18 | #include <linux/kernel_stat.h> | 18 | #include <linux/kernel_stat.h> |
19 | #include <linux/swap.h> | 19 | #include <linux/swap.h> |
20 | #include <linux/mman.h> | 20 | #include <linux/mman.h> |
21 | #include <linux/pagemap.h> | 21 | #include <linux/pagemap.h> |
22 | #include <linux/pagevec.h> | 22 | #include <linux/pagevec.h> |
23 | #include <linux/init.h> | 23 | #include <linux/init.h> |
24 | #include <linux/export.h> | 24 | #include <linux/export.h> |
25 | #include <linux/mm_inline.h> | 25 | #include <linux/mm_inline.h> |
26 | #include <linux/percpu_counter.h> | 26 | #include <linux/percpu_counter.h> |
27 | #include <linux/percpu.h> | 27 | #include <linux/percpu.h> |
28 | #include <linux/cpu.h> | 28 | #include <linux/cpu.h> |
29 | #include <linux/notifier.h> | 29 | #include <linux/notifier.h> |
30 | #include <linux/backing-dev.h> | 30 | #include <linux/backing-dev.h> |
31 | #include <linux/memcontrol.h> | 31 | #include <linux/memcontrol.h> |
32 | #include <linux/gfp.h> | 32 | #include <linux/gfp.h> |
33 | #include <linux/uio.h> | 33 | #include <linux/uio.h> |
34 | #include <linux/hugetlb.h> | 34 | #include <linux/hugetlb.h> |
35 | 35 | ||
36 | #include "internal.h" | 36 | #include "internal.h" |
37 | 37 | ||
38 | #define CREATE_TRACE_POINTS | 38 | #define CREATE_TRACE_POINTS |
39 | #include <trace/events/pagemap.h> | 39 | #include <trace/events/pagemap.h> |
40 | 40 | ||
41 | /* How many pages do we try to swap or page in/out together? */ | 41 | /* How many pages do we try to swap or page in/out together? */ |
42 | int page_cluster; | 42 | int page_cluster; |
43 | 43 | ||
44 | static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); | 44 | static DEFINE_PER_CPU(struct pagevec, lru_add_pvec); |
45 | static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); | 45 | static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs); |
46 | static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); | 46 | static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs); |
47 | 47 | ||
48 | /* | 48 | /* |
49 | * This path almost never happens for VM activity - pages are normally | 49 | * This path almost never happens for VM activity - pages are normally |
50 | * freed via pagevecs. But it gets used by networking. | 50 | * freed via pagevecs. But it gets used by networking. |
51 | */ | 51 | */ |
52 | static void __page_cache_release(struct page *page) | 52 | static void __page_cache_release(struct page *page) |
53 | { | 53 | { |
54 | if (PageLRU(page)) { | 54 | if (PageLRU(page)) { |
55 | struct zone *zone = page_zone(page); | 55 | struct zone *zone = page_zone(page); |
56 | struct lruvec *lruvec; | 56 | struct lruvec *lruvec; |
57 | unsigned long flags; | 57 | unsigned long flags; |
58 | 58 | ||
59 | spin_lock_irqsave(&zone->lru_lock, flags); | 59 | spin_lock_irqsave(&zone->lru_lock, flags); |
60 | lruvec = mem_cgroup_page_lruvec(page, zone); | 60 | lruvec = mem_cgroup_page_lruvec(page, zone); |
61 | VM_BUG_ON(!PageLRU(page)); | 61 | VM_BUG_ON(!PageLRU(page)); |
62 | __ClearPageLRU(page); | 62 | __ClearPageLRU(page); |
63 | del_page_from_lru_list(page, lruvec, page_off_lru(page)); | 63 | del_page_from_lru_list(page, lruvec, page_off_lru(page)); |
64 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 64 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
65 | } | 65 | } |
66 | } | 66 | } |
67 | 67 | ||
68 | static void __put_single_page(struct page *page) | 68 | static void __put_single_page(struct page *page) |
69 | { | 69 | { |
70 | __page_cache_release(page); | 70 | __page_cache_release(page); |
71 | free_hot_cold_page(page, false); | 71 | free_hot_cold_page(page, false); |
72 | } | 72 | } |
73 | 73 | ||
74 | static void __put_compound_page(struct page *page) | 74 | static void __put_compound_page(struct page *page) |
75 | { | 75 | { |
76 | compound_page_dtor *dtor; | 76 | compound_page_dtor *dtor; |
77 | 77 | ||
78 | __page_cache_release(page); | 78 | __page_cache_release(page); |
79 | dtor = get_compound_page_dtor(page); | 79 | dtor = get_compound_page_dtor(page); |
80 | (*dtor)(page); | 80 | (*dtor)(page); |
81 | } | 81 | } |
82 | 82 | ||
83 | static void put_compound_page(struct page *page) | 83 | static void put_compound_page(struct page *page) |
84 | { | 84 | { |
85 | if (unlikely(PageTail(page))) { | 85 | if (unlikely(PageTail(page))) { |
86 | /* __split_huge_page_refcount can run under us */ | 86 | /* __split_huge_page_refcount can run under us */ |
87 | struct page *page_head = compound_head(page); | 87 | struct page *page_head = compound_head(page); |
88 | 88 | ||
89 | if (likely(page != page_head && | 89 | if (likely(page != page_head && |
90 | get_page_unless_zero(page_head))) { | 90 | get_page_unless_zero(page_head))) { |
91 | unsigned long flags; | 91 | unsigned long flags; |
92 | 92 | ||
93 | /* | 93 | /* |
94 | * THP can not break up slab pages so avoid taking | 94 | * THP can not break up slab pages so avoid taking |
95 | * compound_lock(). Slab performs non-atomic bit ops | 95 | * compound_lock(). Slab performs non-atomic bit ops |
96 | * on page->flags for better performance. In particular | 96 | * on page->flags for better performance. In particular |
97 | * slab_unlock() in slub used to be a hot path. It is | 97 | * slab_unlock() in slub used to be a hot path. It is |
98 | * still hot on arches that do not support | 98 | * still hot on arches that do not support |
99 | * this_cpu_cmpxchg_double(). | 99 | * this_cpu_cmpxchg_double(). |
100 | */ | 100 | */ |
101 | if (PageSlab(page_head) || PageHeadHuge(page_head)) { | 101 | if (PageSlab(page_head) || PageHeadHuge(page_head)) { |
102 | if (likely(PageTail(page))) { | 102 | if (likely(PageTail(page))) { |
103 | /* | 103 | /* |
104 | * __split_huge_page_refcount | 104 | * __split_huge_page_refcount |
105 | * cannot race here. | 105 | * cannot race here. |
106 | */ | 106 | */ |
107 | VM_BUG_ON(!PageHead(page_head)); | 107 | VM_BUG_ON(!PageHead(page_head)); |
108 | atomic_dec(&page->_mapcount); | 108 | atomic_dec(&page->_mapcount); |
109 | if (put_page_testzero(page_head)) | 109 | if (put_page_testzero(page_head)) |
110 | VM_BUG_ON(1); | 110 | VM_BUG_ON(1); |
111 | if (put_page_testzero(page_head)) | 111 | if (put_page_testzero(page_head)) |
112 | __put_compound_page(page_head); | 112 | __put_compound_page(page_head); |
113 | return; | 113 | return; |
114 | } else | 114 | } else |
115 | /* | 115 | /* |
116 | * __split_huge_page_refcount | 116 | * __split_huge_page_refcount |
117 | * run before us, "page" was a | 117 | * run before us, "page" was a |
118 | * THP tail. The split | 118 | * THP tail. The split |
119 | * page_head has been freed | 119 | * page_head has been freed |
120 | * and reallocated as slab or | 120 | * and reallocated as slab or |
121 | * hugetlbfs page of smaller | 121 | * hugetlbfs page of smaller |
122 | * order (only possible if | 122 | * order (only possible if |
123 | * reallocated as slab on | 123 | * reallocated as slab on |
124 | * x86). | 124 | * x86). |
125 | */ | 125 | */ |
126 | goto skip_lock; | 126 | goto skip_lock; |
127 | } | 127 | } |
128 | /* | 128 | /* |
129 | * page_head wasn't a dangling pointer but it | 129 | * page_head wasn't a dangling pointer but it |
130 | * may not be a head page anymore by the time | 130 | * may not be a head page anymore by the time |
131 | * we obtain the lock. That is ok as long as it | 131 | * we obtain the lock. That is ok as long as it |
132 | * can't be freed from under us. | 132 | * can't be freed from under us. |
133 | */ | 133 | */ |
134 | flags = compound_lock_irqsave(page_head); | 134 | flags = compound_lock_irqsave(page_head); |
135 | if (unlikely(!PageTail(page))) { | 135 | if (unlikely(!PageTail(page))) { |
136 | /* __split_huge_page_refcount run before us */ | 136 | /* __split_huge_page_refcount run before us */ |
137 | compound_unlock_irqrestore(page_head, flags); | 137 | compound_unlock_irqrestore(page_head, flags); |
138 | skip_lock: | 138 | skip_lock: |
139 | if (put_page_testzero(page_head)) { | 139 | if (put_page_testzero(page_head)) { |
140 | /* | 140 | /* |
141 | * The head page may have been | 141 | * The head page may have been |
142 | * freed and reallocated as a | 142 | * freed and reallocated as a |
143 | * compound page of smaller | 143 | * compound page of smaller |
144 | * order and then freed again. | 144 | * order and then freed again. |
145 | * All we know is that it | 145 | * All we know is that it |
146 | * cannot have become: a THP | 146 | * cannot have become: a THP |
147 | * page, a compound page of | 147 | * page, a compound page of |
148 | * higher order, a tail page. | 148 | * higher order, a tail page. |
149 | * That is because we still | 149 | * That is because we still |
150 | * hold the refcount of the | 150 | * hold the refcount of the |
151 | * split THP tail and | 151 | * split THP tail and |
152 | * page_head was the THP head | 152 | * page_head was the THP head |
153 | * before the split. | 153 | * before the split. |
154 | */ | 154 | */ |
155 | if (PageHead(page_head)) | 155 | if (PageHead(page_head)) |
156 | __put_compound_page(page_head); | 156 | __put_compound_page(page_head); |
157 | else | 157 | else |
158 | __put_single_page(page_head); | 158 | __put_single_page(page_head); |
159 | } | 159 | } |
160 | out_put_single: | 160 | out_put_single: |
161 | if (put_page_testzero(page)) | 161 | if (put_page_testzero(page)) |
162 | __put_single_page(page); | 162 | __put_single_page(page); |
163 | return; | 163 | return; |
164 | } | 164 | } |
165 | VM_BUG_ON(page_head != page->first_page); | 165 | VM_BUG_ON(page_head != page->first_page); |
166 | /* | 166 | /* |
167 | * We can release the refcount taken by | 167 | * We can release the refcount taken by |
168 | * get_page_unless_zero() now that | 168 | * get_page_unless_zero() now that |
169 | * __split_huge_page_refcount() is blocked on | 169 | * __split_huge_page_refcount() is blocked on |
170 | * the compound_lock. | 170 | * the compound_lock. |
171 | */ | 171 | */ |
172 | if (put_page_testzero(page_head)) | 172 | if (put_page_testzero(page_head)) |
173 | VM_BUG_ON(1); | 173 | VM_BUG_ON(1); |
174 | /* __split_huge_page_refcount will wait now */ | 174 | /* __split_huge_page_refcount will wait now */ |
175 | VM_BUG_ON(page_mapcount(page) <= 0); | 175 | VM_BUG_ON(page_mapcount(page) <= 0); |
176 | atomic_dec(&page->_mapcount); | 176 | atomic_dec(&page->_mapcount); |
177 | VM_BUG_ON(atomic_read(&page_head->_count) <= 0); | 177 | VM_BUG_ON(atomic_read(&page_head->_count) <= 0); |
178 | VM_BUG_ON(atomic_read(&page->_count) != 0); | 178 | VM_BUG_ON(atomic_read(&page->_count) != 0); |
179 | compound_unlock_irqrestore(page_head, flags); | 179 | compound_unlock_irqrestore(page_head, flags); |
180 | 180 | ||
181 | if (put_page_testzero(page_head)) { | 181 | if (put_page_testzero(page_head)) { |
182 | if (PageHead(page_head)) | 182 | if (PageHead(page_head)) |
183 | __put_compound_page(page_head); | 183 | __put_compound_page(page_head); |
184 | else | 184 | else |
185 | __put_single_page(page_head); | 185 | __put_single_page(page_head); |
186 | } | 186 | } |
187 | } else { | 187 | } else { |
188 | /* page_head is a dangling pointer */ | 188 | /* page_head is a dangling pointer */ |
189 | VM_BUG_ON(PageTail(page)); | 189 | VM_BUG_ON(PageTail(page)); |
190 | goto out_put_single; | 190 | goto out_put_single; |
191 | } | 191 | } |
192 | } else if (put_page_testzero(page)) { | 192 | } else if (put_page_testzero(page)) { |
193 | if (PageHead(page)) | 193 | if (PageHead(page)) |
194 | __put_compound_page(page); | 194 | __put_compound_page(page); |
195 | else | 195 | else |
196 | __put_single_page(page); | 196 | __put_single_page(page); |
197 | } | 197 | } |
198 | } | 198 | } |
199 | 199 | ||
200 | void put_page(struct page *page) | 200 | void put_page(struct page *page) |
201 | { | 201 | { |
202 | if (unlikely(PageCompound(page))) | 202 | if (unlikely(PageCompound(page))) |
203 | put_compound_page(page); | 203 | put_compound_page(page); |
204 | else if (put_page_testzero(page)) | 204 | else if (put_page_testzero(page)) |
205 | __put_single_page(page); | 205 | __put_single_page(page); |
206 | } | 206 | } |
207 | EXPORT_SYMBOL(put_page); | 207 | EXPORT_SYMBOL(put_page); |
208 | 208 | ||
209 | /* | 209 | /* |
210 | * This function is exported but must not be called by anything other | 210 | * This function is exported but must not be called by anything other |
211 | * than get_page(). It implements the slow path of get_page(). | 211 | * than get_page(). It implements the slow path of get_page(). |
212 | */ | 212 | */ |
213 | bool __get_page_tail(struct page *page) | 213 | bool __get_page_tail(struct page *page) |
214 | { | 214 | { |
215 | /* | 215 | /* |
216 | * This takes care of get_page() if run on a tail page | 216 | * This takes care of get_page() if run on a tail page |
217 | * returned by one of the get_user_pages/follow_page variants. | 217 | * returned by one of the get_user_pages/follow_page variants. |
218 | * get_user_pages/follow_page itself doesn't need the compound | 218 | * get_user_pages/follow_page itself doesn't need the compound |
219 | * lock because it runs __get_page_tail_foll() under the | 219 | * lock because it runs __get_page_tail_foll() under the |
220 | * proper PT lock that already serializes against | 220 | * proper PT lock that already serializes against |
221 | * split_huge_page(). | 221 | * split_huge_page(). |
222 | */ | 222 | */ |
223 | unsigned long flags; | 223 | unsigned long flags; |
224 | bool got = false; | 224 | bool got = false; |
225 | struct page *page_head = compound_head(page); | 225 | struct page *page_head = compound_head(page); |
226 | 226 | ||
227 | if (likely(page != page_head && get_page_unless_zero(page_head))) { | 227 | if (likely(page != page_head && get_page_unless_zero(page_head))) { |
228 | /* Ref to put_compound_page() comment. */ | 228 | /* Ref to put_compound_page() comment. */ |
229 | if (PageSlab(page_head) || PageHeadHuge(page_head)) { | 229 | if (PageSlab(page_head) || PageHeadHuge(page_head)) { |
230 | if (likely(PageTail(page))) { | 230 | if (likely(PageTail(page))) { |
231 | /* | 231 | /* |
232 | * This is a hugetlbfs page or a slab | 232 | * This is a hugetlbfs page or a slab |
233 | * page. __split_huge_page_refcount | 233 | * page. __split_huge_page_refcount |
234 | * cannot race here. | 234 | * cannot race here. |
235 | */ | 235 | */ |
236 | VM_BUG_ON(!PageHead(page_head)); | 236 | VM_BUG_ON(!PageHead(page_head)); |
237 | __get_page_tail_foll(page, false); | 237 | __get_page_tail_foll(page, false); |
238 | return true; | 238 | return true; |
239 | } else { | 239 | } else { |
240 | /* | 240 | /* |
241 | * __split_huge_page_refcount run | 241 | * __split_huge_page_refcount run |
242 | * before us, "page" was a THP | 242 | * before us, "page" was a THP |
243 | * tail. The split page_head has been | 243 | * tail. The split page_head has been |
244 | * freed and reallocated as slab or | 244 | * freed and reallocated as slab or |
245 | * hugetlbfs page of smaller order | 245 | * hugetlbfs page of smaller order |
246 | * (only possible if reallocated as | 246 | * (only possible if reallocated as |
247 | * slab on x86). | 247 | * slab on x86). |
248 | */ | 248 | */ |
249 | put_page(page_head); | 249 | put_page(page_head); |
250 | return false; | 250 | return false; |
251 | } | 251 | } |
252 | } | 252 | } |
253 | 253 | ||
254 | /* | 254 | /* |
255 | * page_head wasn't a dangling pointer but it | 255 | * page_head wasn't a dangling pointer but it |
256 | * may not be a head page anymore by the time | 256 | * may not be a head page anymore by the time |
257 | * we obtain the lock. That is ok as long as it | 257 | * we obtain the lock. That is ok as long as it |
258 | * can't be freed from under us. | 258 | * can't be freed from under us. |
259 | */ | 259 | */ |
260 | flags = compound_lock_irqsave(page_head); | 260 | flags = compound_lock_irqsave(page_head); |
261 | /* here __split_huge_page_refcount won't run anymore */ | 261 | /* here __split_huge_page_refcount won't run anymore */ |
262 | if (likely(PageTail(page))) { | 262 | if (likely(PageTail(page))) { |
263 | __get_page_tail_foll(page, false); | 263 | __get_page_tail_foll(page, false); |
264 | got = true; | 264 | got = true; |
265 | } | 265 | } |
266 | compound_unlock_irqrestore(page_head, flags); | 266 | compound_unlock_irqrestore(page_head, flags); |
267 | if (unlikely(!got)) | 267 | if (unlikely(!got)) |
268 | put_page(page_head); | 268 | put_page(page_head); |
269 | } | 269 | } |
270 | return got; | 270 | return got; |
271 | } | 271 | } |
272 | EXPORT_SYMBOL(__get_page_tail); | 272 | EXPORT_SYMBOL(__get_page_tail); |
273 | 273 | ||
274 | /** | 274 | /** |
275 | * put_pages_list() - release a list of pages | 275 | * put_pages_list() - release a list of pages |
276 | * @pages: list of pages threaded on page->lru | 276 | * @pages: list of pages threaded on page->lru |
277 | * | 277 | * |
278 | * Release a list of pages which are strung together on page.lru. Currently | 278 | * Release a list of pages which are strung together on page.lru. Currently |
279 | * used by read_cache_pages() and related error recovery code. | 279 | * used by read_cache_pages() and related error recovery code. |
280 | */ | 280 | */ |
281 | void put_pages_list(struct list_head *pages) | 281 | void put_pages_list(struct list_head *pages) |
282 | { | 282 | { |
283 | while (!list_empty(pages)) { | 283 | while (!list_empty(pages)) { |
284 | struct page *victim; | 284 | struct page *victim; |
285 | 285 | ||
286 | victim = list_entry(pages->prev, struct page, lru); | 286 | victim = list_entry(pages->prev, struct page, lru); |
287 | list_del(&victim->lru); | 287 | list_del(&victim->lru); |
288 | page_cache_release(victim); | 288 | page_cache_release(victim); |
289 | } | 289 | } |
290 | } | 290 | } |
291 | EXPORT_SYMBOL(put_pages_list); | 291 | EXPORT_SYMBOL(put_pages_list); |
292 | 292 | ||
293 | /* | 293 | /* |
294 | * get_kernel_pages() - pin kernel pages in memory | 294 | * get_kernel_pages() - pin kernel pages in memory |
295 | * @kiov: An array of struct kvec structures | 295 | * @kiov: An array of struct kvec structures |
296 | * @nr_segs: number of segments to pin | 296 | * @nr_segs: number of segments to pin |
297 | * @write: pinning for read/write, currently ignored | 297 | * @write: pinning for read/write, currently ignored |
298 | * @pages: array that receives pointers to the pages pinned. | 298 | * @pages: array that receives pointers to the pages pinned. |
299 | * Should be at least nr_segs long. | 299 | * Should be at least nr_segs long. |
300 | * | 300 | * |
301 | * Returns number of pages pinned. This may be fewer than the number | 301 | * Returns number of pages pinned. This may be fewer than the number |
302 | * requested. If nr_pages is 0 or negative, returns 0. If no pages | 302 | * requested. If nr_pages is 0 or negative, returns 0. If no pages |
303 | * were pinned, returns -errno. Each page returned must be released | 303 | * were pinned, returns -errno. Each page returned must be released |
304 | * with a put_page() call when it is finished with. | 304 | * with a put_page() call when it is finished with. |
305 | */ | 305 | */ |
306 | int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write, | 306 | int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write, |
307 | struct page **pages) | 307 | struct page **pages) |
308 | { | 308 | { |
309 | int seg; | 309 | int seg; |
310 | 310 | ||
311 | for (seg = 0; seg < nr_segs; seg++) { | 311 | for (seg = 0; seg < nr_segs; seg++) { |
312 | if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE)) | 312 | if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE)) |
313 | return seg; | 313 | return seg; |
314 | 314 | ||
315 | pages[seg] = kmap_to_page(kiov[seg].iov_base); | 315 | pages[seg] = kmap_to_page(kiov[seg].iov_base); |
316 | page_cache_get(pages[seg]); | 316 | page_cache_get(pages[seg]); |
317 | } | 317 | } |
318 | 318 | ||
319 | return seg; | 319 | return seg; |
320 | } | 320 | } |
321 | EXPORT_SYMBOL_GPL(get_kernel_pages); | 321 | EXPORT_SYMBOL_GPL(get_kernel_pages); |
322 | 322 | ||
323 | /* | 323 | /* |
324 | * get_kernel_page() - pin a kernel page in memory | 324 | * get_kernel_page() - pin a kernel page in memory |
325 | * @start: starting kernel address | 325 | * @start: starting kernel address |
326 | * @write: pinning for read/write, currently ignored | 326 | * @write: pinning for read/write, currently ignored |
327 | * @pages: array that receives pointer to the page pinned. | 327 | * @pages: array that receives pointer to the page pinned. |
328 | * Must be at least nr_segs long. | 328 | * Must be at least nr_segs long. |
329 | * | 329 | * |
330 | * Returns 1 if page is pinned. If the page was not pinned, returns | 330 | * Returns 1 if page is pinned. If the page was not pinned, returns |
331 | * -errno. The page returned must be released with a put_page() call | 331 | * -errno. The page returned must be released with a put_page() call |
332 | * when it is finished with. | 332 | * when it is finished with. |
333 | */ | 333 | */ |
334 | int get_kernel_page(unsigned long start, int write, struct page **pages) | 334 | int get_kernel_page(unsigned long start, int write, struct page **pages) |
335 | { | 335 | { |
336 | const struct kvec kiov = { | 336 | const struct kvec kiov = { |
337 | .iov_base = (void *)start, | 337 | .iov_base = (void *)start, |
338 | .iov_len = PAGE_SIZE | 338 | .iov_len = PAGE_SIZE |
339 | }; | 339 | }; |
340 | 340 | ||
341 | return get_kernel_pages(&kiov, 1, write, pages); | 341 | return get_kernel_pages(&kiov, 1, write, pages); |
342 | } | 342 | } |
343 | EXPORT_SYMBOL_GPL(get_kernel_page); | 343 | EXPORT_SYMBOL_GPL(get_kernel_page); |
344 | 344 | ||
345 | static void pagevec_lru_move_fn(struct pagevec *pvec, | 345 | static void pagevec_lru_move_fn(struct pagevec *pvec, |
346 | void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), | 346 | void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), |
347 | void *arg) | 347 | void *arg) |
348 | { | 348 | { |
349 | int i; | 349 | int i; |
350 | struct zone *zone = NULL; | 350 | struct zone *zone = NULL; |
351 | struct lruvec *lruvec; | 351 | struct lruvec *lruvec; |
352 | unsigned long flags = 0; | 352 | unsigned long flags = 0; |
353 | 353 | ||
354 | for (i = 0; i < pagevec_count(pvec); i++) { | 354 | for (i = 0; i < pagevec_count(pvec); i++) { |
355 | struct page *page = pvec->pages[i]; | 355 | struct page *page = pvec->pages[i]; |
356 | struct zone *pagezone = page_zone(page); | 356 | struct zone *pagezone = page_zone(page); |
357 | 357 | ||
358 | if (pagezone != zone) { | 358 | if (pagezone != zone) { |
359 | if (zone) | 359 | if (zone) |
360 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 360 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
361 | zone = pagezone; | 361 | zone = pagezone; |
362 | spin_lock_irqsave(&zone->lru_lock, flags); | 362 | spin_lock_irqsave(&zone->lru_lock, flags); |
363 | } | 363 | } |
364 | 364 | ||
365 | lruvec = mem_cgroup_page_lruvec(page, zone); | 365 | lruvec = mem_cgroup_page_lruvec(page, zone); |
366 | (*move_fn)(page, lruvec, arg); | 366 | (*move_fn)(page, lruvec, arg); |
367 | } | 367 | } |
368 | if (zone) | 368 | if (zone) |
369 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 369 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
370 | release_pages(pvec->pages, pvec->nr, pvec->cold); | 370 | release_pages(pvec->pages, pvec->nr, pvec->cold); |
371 | pagevec_reinit(pvec); | 371 | pagevec_reinit(pvec); |
372 | } | 372 | } |
373 | 373 | ||
374 | static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, | 374 | static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec, |
375 | void *arg) | 375 | void *arg) |
376 | { | 376 | { |
377 | int *pgmoved = arg; | 377 | int *pgmoved = arg; |
378 | 378 | ||
379 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { | 379 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { |
380 | enum lru_list lru = page_lru_base_type(page); | 380 | enum lru_list lru = page_lru_base_type(page); |
381 | list_move_tail(&page->lru, &lruvec->lists[lru]); | 381 | list_move_tail(&page->lru, &lruvec->lists[lru]); |
382 | (*pgmoved)++; | 382 | (*pgmoved)++; |
383 | } | 383 | } |
384 | } | 384 | } |
385 | 385 | ||
386 | /* | 386 | /* |
387 | * pagevec_move_tail() must be called with IRQ disabled. | 387 | * pagevec_move_tail() must be called with IRQ disabled. |
388 | * Otherwise this may cause nasty races. | 388 | * Otherwise this may cause nasty races. |
389 | */ | 389 | */ |
390 | static void pagevec_move_tail(struct pagevec *pvec) | 390 | static void pagevec_move_tail(struct pagevec *pvec) |
391 | { | 391 | { |
392 | int pgmoved = 0; | 392 | int pgmoved = 0; |
393 | 393 | ||
394 | pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved); | 394 | pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved); |
395 | __count_vm_events(PGROTATED, pgmoved); | 395 | __count_vm_events(PGROTATED, pgmoved); |
396 | } | 396 | } |
397 | 397 | ||
398 | /* | 398 | /* |
399 | * Writeback is about to end against a page which has been marked for immediate | 399 | * Writeback is about to end against a page which has been marked for immediate |
400 | * reclaim. If it still appears to be reclaimable, move it to the tail of the | 400 | * reclaim. If it still appears to be reclaimable, move it to the tail of the |
401 | * inactive list. | 401 | * inactive list. |
402 | */ | 402 | */ |
403 | void rotate_reclaimable_page(struct page *page) | 403 | void rotate_reclaimable_page(struct page *page) |
404 | { | 404 | { |
405 | if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && | 405 | if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) && |
406 | !PageUnevictable(page) && PageLRU(page)) { | 406 | !PageUnevictable(page) && PageLRU(page)) { |
407 | struct pagevec *pvec; | 407 | struct pagevec *pvec; |
408 | unsigned long flags; | 408 | unsigned long flags; |
409 | 409 | ||
410 | page_cache_get(page); | 410 | page_cache_get(page); |
411 | local_irq_save(flags); | 411 | local_irq_save(flags); |
412 | pvec = &__get_cpu_var(lru_rotate_pvecs); | 412 | pvec = &__get_cpu_var(lru_rotate_pvecs); |
413 | if (!pagevec_add(pvec, page)) | 413 | if (!pagevec_add(pvec, page)) |
414 | pagevec_move_tail(pvec); | 414 | pagevec_move_tail(pvec); |
415 | local_irq_restore(flags); | 415 | local_irq_restore(flags); |
416 | } | 416 | } |
417 | } | 417 | } |
418 | 418 | ||
419 | static void update_page_reclaim_stat(struct lruvec *lruvec, | 419 | static void update_page_reclaim_stat(struct lruvec *lruvec, |
420 | int file, int rotated) | 420 | int file, int rotated) |
421 | { | 421 | { |
422 | struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; | 422 | struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; |
423 | 423 | ||
424 | reclaim_stat->recent_scanned[file]++; | 424 | reclaim_stat->recent_scanned[file]++; |
425 | if (rotated) | 425 | if (rotated) |
426 | reclaim_stat->recent_rotated[file]++; | 426 | reclaim_stat->recent_rotated[file]++; |
427 | } | 427 | } |
428 | 428 | ||
429 | static void __activate_page(struct page *page, struct lruvec *lruvec, | 429 | static void __activate_page(struct page *page, struct lruvec *lruvec, |
430 | void *arg) | 430 | void *arg) |
431 | { | 431 | { |
432 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { | 432 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { |
433 | int file = page_is_file_cache(page); | 433 | int file = page_is_file_cache(page); |
434 | int lru = page_lru_base_type(page); | 434 | int lru = page_lru_base_type(page); |
435 | 435 | ||
436 | del_page_from_lru_list(page, lruvec, lru); | 436 | del_page_from_lru_list(page, lruvec, lru); |
437 | SetPageActive(page); | 437 | SetPageActive(page); |
438 | lru += LRU_ACTIVE; | 438 | lru += LRU_ACTIVE; |
439 | add_page_to_lru_list(page, lruvec, lru); | 439 | add_page_to_lru_list(page, lruvec, lru); |
440 | trace_mm_lru_activate(page, page_to_pfn(page)); | 440 | trace_mm_lru_activate(page, page_to_pfn(page)); |
441 | 441 | ||
442 | __count_vm_event(PGACTIVATE); | 442 | __count_vm_event(PGACTIVATE); |
443 | update_page_reclaim_stat(lruvec, file, 1); | 443 | update_page_reclaim_stat(lruvec, file, 1); |
444 | } | 444 | } |
445 | } | 445 | } |
446 | 446 | ||
447 | #ifdef CONFIG_SMP | 447 | #ifdef CONFIG_SMP |
448 | static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); | 448 | static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs); |
449 | 449 | ||
450 | static void activate_page_drain(int cpu) | 450 | static void activate_page_drain(int cpu) |
451 | { | 451 | { |
452 | struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu); | 452 | struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu); |
453 | 453 | ||
454 | if (pagevec_count(pvec)) | 454 | if (pagevec_count(pvec)) |
455 | pagevec_lru_move_fn(pvec, __activate_page, NULL); | 455 | pagevec_lru_move_fn(pvec, __activate_page, NULL); |
456 | } | 456 | } |
457 | 457 | ||
458 | static bool need_activate_page_drain(int cpu) | 458 | static bool need_activate_page_drain(int cpu) |
459 | { | 459 | { |
460 | return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0; | 460 | return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0; |
461 | } | 461 | } |
462 | 462 | ||
463 | void activate_page(struct page *page) | 463 | void activate_page(struct page *page) |
464 | { | 464 | { |
465 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { | 465 | if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { |
466 | struct pagevec *pvec = &get_cpu_var(activate_page_pvecs); | 466 | struct pagevec *pvec = &get_cpu_var(activate_page_pvecs); |
467 | 467 | ||
468 | page_cache_get(page); | 468 | page_cache_get(page); |
469 | if (!pagevec_add(pvec, page)) | 469 | if (!pagevec_add(pvec, page)) |
470 | pagevec_lru_move_fn(pvec, __activate_page, NULL); | 470 | pagevec_lru_move_fn(pvec, __activate_page, NULL); |
471 | put_cpu_var(activate_page_pvecs); | 471 | put_cpu_var(activate_page_pvecs); |
472 | } | 472 | } |
473 | } | 473 | } |
474 | 474 | ||
475 | #else | 475 | #else |
476 | static inline void activate_page_drain(int cpu) | 476 | static inline void activate_page_drain(int cpu) |
477 | { | 477 | { |
478 | } | 478 | } |
479 | 479 | ||
480 | static bool need_activate_page_drain(int cpu) | 480 | static bool need_activate_page_drain(int cpu) |
481 | { | 481 | { |
482 | return false; | 482 | return false; |
483 | } | 483 | } |
484 | 484 | ||
485 | void activate_page(struct page *page) | 485 | void activate_page(struct page *page) |
486 | { | 486 | { |
487 | struct zone *zone = page_zone(page); | 487 | struct zone *zone = page_zone(page); |
488 | 488 | ||
489 | spin_lock_irq(&zone->lru_lock); | 489 | spin_lock_irq(&zone->lru_lock); |
490 | __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL); | 490 | __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL); |
491 | spin_unlock_irq(&zone->lru_lock); | 491 | spin_unlock_irq(&zone->lru_lock); |
492 | } | 492 | } |
493 | #endif | 493 | #endif |
494 | 494 | ||
495 | static void __lru_cache_activate_page(struct page *page) | 495 | static void __lru_cache_activate_page(struct page *page) |
496 | { | 496 | { |
497 | struct pagevec *pvec = &get_cpu_var(lru_add_pvec); | 497 | struct pagevec *pvec = &get_cpu_var(lru_add_pvec); |
498 | int i; | 498 | int i; |
499 | 499 | ||
500 | /* | 500 | /* |
501 | * Search backwards on the optimistic assumption that the page being | 501 | * Search backwards on the optimistic assumption that the page being |
502 | * activated has just been added to this pagevec. Note that only | 502 | * activated has just been added to this pagevec. Note that only |
503 | * the local pagevec is examined as a !PageLRU page could be in the | 503 | * the local pagevec is examined as a !PageLRU page could be in the |
504 | * process of being released, reclaimed, migrated or on a remote | 504 | * process of being released, reclaimed, migrated or on a remote |
505 | * pagevec that is currently being drained. Furthermore, marking | 505 | * pagevec that is currently being drained. Furthermore, marking |
506 | * a remote pagevec's page PageActive potentially hits a race where | 506 | * a remote pagevec's page PageActive potentially hits a race where |
507 | * a page is marked PageActive just after it is added to the inactive | 507 | * a page is marked PageActive just after it is added to the inactive |
508 | * list causing accounting errors and BUG_ON checks to trigger. | 508 | * list causing accounting errors and BUG_ON checks to trigger. |
509 | */ | 509 | */ |
510 | for (i = pagevec_count(pvec) - 1; i >= 0; i--) { | 510 | for (i = pagevec_count(pvec) - 1; i >= 0; i--) { |
511 | struct page *pagevec_page = pvec->pages[i]; | 511 | struct page *pagevec_page = pvec->pages[i]; |
512 | 512 | ||
513 | if (pagevec_page == page) { | 513 | if (pagevec_page == page) { |
514 | SetPageActive(page); | 514 | SetPageActive(page); |
515 | break; | 515 | break; |
516 | } | 516 | } |
517 | } | 517 | } |
518 | 518 | ||
519 | put_cpu_var(lru_add_pvec); | 519 | put_cpu_var(lru_add_pvec); |
520 | } | 520 | } |
521 | 521 | ||
522 | /* | 522 | /* |
523 | * Mark a page as having seen activity. | 523 | * Mark a page as having seen activity. |
524 | * | 524 | * |
525 | * inactive,unreferenced -> inactive,referenced | 525 | * inactive,unreferenced -> inactive,referenced |
526 | * inactive,referenced -> active,unreferenced | 526 | * inactive,referenced -> active,unreferenced |
527 | * active,unreferenced -> active,referenced | 527 | * active,unreferenced -> active,referenced |
528 | */ | 528 | */ |
529 | void mark_page_accessed(struct page *page) | 529 | void mark_page_accessed(struct page *page) |
530 | { | 530 | { |
531 | if (!PageActive(page) && !PageUnevictable(page) && | 531 | if (!PageActive(page) && !PageUnevictable(page) && |
532 | PageReferenced(page)) { | 532 | PageReferenced(page)) { |
533 | 533 | ||
534 | /* | 534 | /* |
535 | * If the page is on the LRU, queue it for activation via | 535 | * If the page is on the LRU, queue it for activation via |
536 | * activate_page_pvecs. Otherwise, assume the page is on a | 536 | * activate_page_pvecs. Otherwise, assume the page is on a |
537 | * pagevec, mark it active and it'll be moved to the active | 537 | * pagevec, mark it active and it'll be moved to the active |
538 | * LRU on the next drain. | 538 | * LRU on the next drain. |
539 | */ | 539 | */ |
540 | if (PageLRU(page)) | 540 | if (PageLRU(page)) |
541 | activate_page(page); | 541 | activate_page(page); |
542 | else | 542 | else |
543 | __lru_cache_activate_page(page); | 543 | __lru_cache_activate_page(page); |
544 | ClearPageReferenced(page); | 544 | ClearPageReferenced(page); |
545 | } else if (!PageReferenced(page)) { | 545 | } else if (!PageReferenced(page)) { |
546 | SetPageReferenced(page); | 546 | SetPageReferenced(page); |
547 | } | 547 | } |
548 | } | 548 | } |
549 | EXPORT_SYMBOL(mark_page_accessed); | 549 | EXPORT_SYMBOL(mark_page_accessed); |
550 | 550 | ||
551 | /* | ||
552 | * Used to mark_page_accessed(page) that is not visible yet and when it is | ||
553 | * still safe to use non-atomic ops | ||
554 | */ | ||
555 | void init_page_accessed(struct page *page) | ||
556 | { | ||
557 | if (!PageReferenced(page)) | ||
558 | __SetPageReferenced(page); | ||
559 | } | ||
560 | EXPORT_SYMBOL(init_page_accessed); | ||
561 | |||
551 | static void __lru_cache_add(struct page *page) | 562 | static void __lru_cache_add(struct page *page) |
552 | { | 563 | { |
553 | struct pagevec *pvec = &get_cpu_var(lru_add_pvec); | 564 | struct pagevec *pvec = &get_cpu_var(lru_add_pvec); |
554 | 565 | ||
555 | page_cache_get(page); | 566 | page_cache_get(page); |
556 | if (!pagevec_space(pvec)) | 567 | if (!pagevec_space(pvec)) |
557 | __pagevec_lru_add(pvec); | 568 | __pagevec_lru_add(pvec); |
558 | pagevec_add(pvec, page); | 569 | pagevec_add(pvec, page); |
559 | put_cpu_var(lru_add_pvec); | 570 | put_cpu_var(lru_add_pvec); |
560 | } | 571 | } |
561 | 572 | ||
562 | /** | 573 | /** |
563 | * lru_cache_add: add a page to the page lists | 574 | * lru_cache_add: add a page to the page lists |
564 | * @page: the page to add | 575 | * @page: the page to add |
565 | */ | 576 | */ |
566 | void lru_cache_add_anon(struct page *page) | 577 | void lru_cache_add_anon(struct page *page) |
567 | { | 578 | { |
568 | if (PageActive(page)) | 579 | if (PageActive(page)) |
569 | ClearPageActive(page); | 580 | ClearPageActive(page); |
570 | __lru_cache_add(page); | 581 | __lru_cache_add(page); |
571 | } | 582 | } |
572 | 583 | ||
573 | void lru_cache_add_file(struct page *page) | 584 | void lru_cache_add_file(struct page *page) |
574 | { | 585 | { |
575 | if (PageActive(page)) | 586 | if (PageActive(page)) |
576 | ClearPageActive(page); | 587 | ClearPageActive(page); |
577 | __lru_cache_add(page); | 588 | __lru_cache_add(page); |
578 | } | 589 | } |
579 | EXPORT_SYMBOL(lru_cache_add_file); | 590 | EXPORT_SYMBOL(lru_cache_add_file); |
580 | 591 | ||
581 | /** | 592 | /** |
582 | * lru_cache_add - add a page to a page list | 593 | * lru_cache_add - add a page to a page list |
583 | * @page: the page to be added to the LRU. | 594 | * @page: the page to be added to the LRU. |
584 | * | 595 | * |
585 | * Queue the page for addition to the LRU via pagevec. The decision on whether | 596 | * Queue the page for addition to the LRU via pagevec. The decision on whether |
586 | * to add the page to the [in]active [file|anon] list is deferred until the | 597 | * to add the page to the [in]active [file|anon] list is deferred until the |
587 | * pagevec is drained. This gives a chance for the caller of lru_cache_add() | 598 | * pagevec is drained. This gives a chance for the caller of lru_cache_add() |
588 | * have the page added to the active list using mark_page_accessed(). | 599 | * have the page added to the active list using mark_page_accessed(). |
589 | */ | 600 | */ |
590 | void lru_cache_add(struct page *page) | 601 | void lru_cache_add(struct page *page) |
591 | { | 602 | { |
592 | VM_BUG_ON(PageActive(page) && PageUnevictable(page)); | 603 | VM_BUG_ON(PageActive(page) && PageUnevictable(page)); |
593 | VM_BUG_ON(PageLRU(page)); | 604 | VM_BUG_ON(PageLRU(page)); |
594 | __lru_cache_add(page); | 605 | __lru_cache_add(page); |
595 | } | 606 | } |
596 | 607 | ||
597 | /** | 608 | /** |
598 | * add_page_to_unevictable_list - add a page to the unevictable list | 609 | * add_page_to_unevictable_list - add a page to the unevictable list |
599 | * @page: the page to be added to the unevictable list | 610 | * @page: the page to be added to the unevictable list |
600 | * | 611 | * |
601 | * Add page directly to its zone's unevictable list. To avoid races with | 612 | * Add page directly to its zone's unevictable list. To avoid races with |
602 | * tasks that might be making the page evictable, through eg. munlock, | 613 | * tasks that might be making the page evictable, through eg. munlock, |
603 | * munmap or exit, while it's not on the lru, we want to add the page | 614 | * munmap or exit, while it's not on the lru, we want to add the page |
604 | * while it's locked or otherwise "invisible" to other tasks. This is | 615 | * while it's locked or otherwise "invisible" to other tasks. This is |
605 | * difficult to do when using the pagevec cache, so bypass that. | 616 | * difficult to do when using the pagevec cache, so bypass that. |
606 | */ | 617 | */ |
607 | void add_page_to_unevictable_list(struct page *page) | 618 | void add_page_to_unevictable_list(struct page *page) |
608 | { | 619 | { |
609 | struct zone *zone = page_zone(page); | 620 | struct zone *zone = page_zone(page); |
610 | struct lruvec *lruvec; | 621 | struct lruvec *lruvec; |
611 | 622 | ||
612 | spin_lock_irq(&zone->lru_lock); | 623 | spin_lock_irq(&zone->lru_lock); |
613 | lruvec = mem_cgroup_page_lruvec(page, zone); | 624 | lruvec = mem_cgroup_page_lruvec(page, zone); |
614 | ClearPageActive(page); | 625 | ClearPageActive(page); |
615 | SetPageUnevictable(page); | 626 | SetPageUnevictable(page); |
616 | SetPageLRU(page); | 627 | SetPageLRU(page); |
617 | add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE); | 628 | add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE); |
618 | spin_unlock_irq(&zone->lru_lock); | 629 | spin_unlock_irq(&zone->lru_lock); |
619 | } | 630 | } |
620 | 631 | ||
621 | /* | 632 | /* |
622 | * If the page can not be invalidated, it is moved to the | 633 | * If the page can not be invalidated, it is moved to the |
623 | * inactive list to speed up its reclaim. It is moved to the | 634 | * inactive list to speed up its reclaim. It is moved to the |
624 | * head of the list, rather than the tail, to give the flusher | 635 | * head of the list, rather than the tail, to give the flusher |
625 | * threads some time to write it out, as this is much more | 636 | * threads some time to write it out, as this is much more |
626 | * effective than the single-page writeout from reclaim. | 637 | * effective than the single-page writeout from reclaim. |
627 | * | 638 | * |
628 | * If the page isn't page_mapped and dirty/writeback, the page | 639 | * If the page isn't page_mapped and dirty/writeback, the page |
629 | * could reclaim asap using PG_reclaim. | 640 | * could reclaim asap using PG_reclaim. |
630 | * | 641 | * |
631 | * 1. active, mapped page -> none | 642 | * 1. active, mapped page -> none |
632 | * 2. active, dirty/writeback page -> inactive, head, PG_reclaim | 643 | * 2. active, dirty/writeback page -> inactive, head, PG_reclaim |
633 | * 3. inactive, mapped page -> none | 644 | * 3. inactive, mapped page -> none |
634 | * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim | 645 | * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim |
635 | * 5. inactive, clean -> inactive, tail | 646 | * 5. inactive, clean -> inactive, tail |
636 | * 6. Others -> none | 647 | * 6. Others -> none |
637 | * | 648 | * |
638 | * In 4, why it moves inactive's head, the VM expects the page would | 649 | * In 4, why it moves inactive's head, the VM expects the page would |
639 | * be write it out by flusher threads as this is much more effective | 650 | * be write it out by flusher threads as this is much more effective |
640 | * than the single-page writeout from reclaim. | 651 | * than the single-page writeout from reclaim. |
641 | */ | 652 | */ |
642 | static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, | 653 | static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, |
643 | void *arg) | 654 | void *arg) |
644 | { | 655 | { |
645 | int lru, file; | 656 | int lru, file; |
646 | bool active; | 657 | bool active; |
647 | 658 | ||
648 | if (!PageLRU(page)) | 659 | if (!PageLRU(page)) |
649 | return; | 660 | return; |
650 | 661 | ||
651 | if (PageUnevictable(page)) | 662 | if (PageUnevictable(page)) |
652 | return; | 663 | return; |
653 | 664 | ||
654 | /* Some processes are using the page */ | 665 | /* Some processes are using the page */ |
655 | if (page_mapped(page)) | 666 | if (page_mapped(page)) |
656 | return; | 667 | return; |
657 | 668 | ||
658 | active = PageActive(page); | 669 | active = PageActive(page); |
659 | file = page_is_file_cache(page); | 670 | file = page_is_file_cache(page); |
660 | lru = page_lru_base_type(page); | 671 | lru = page_lru_base_type(page); |
661 | 672 | ||
662 | del_page_from_lru_list(page, lruvec, lru + active); | 673 | del_page_from_lru_list(page, lruvec, lru + active); |
663 | ClearPageActive(page); | 674 | ClearPageActive(page); |
664 | ClearPageReferenced(page); | 675 | ClearPageReferenced(page); |
665 | add_page_to_lru_list(page, lruvec, lru); | 676 | add_page_to_lru_list(page, lruvec, lru); |
666 | 677 | ||
667 | if (PageWriteback(page) || PageDirty(page)) { | 678 | if (PageWriteback(page) || PageDirty(page)) { |
668 | /* | 679 | /* |
669 | * PG_reclaim could be raced with end_page_writeback | 680 | * PG_reclaim could be raced with end_page_writeback |
670 | * It can make readahead confusing. But race window | 681 | * It can make readahead confusing. But race window |
671 | * is _really_ small and it's non-critical problem. | 682 | * is _really_ small and it's non-critical problem. |
672 | */ | 683 | */ |
673 | SetPageReclaim(page); | 684 | SetPageReclaim(page); |
674 | } else { | 685 | } else { |
675 | /* | 686 | /* |
676 | * The page's writeback ends up during pagevec | 687 | * The page's writeback ends up during pagevec |
677 | * We moves tha page into tail of inactive. | 688 | * We moves tha page into tail of inactive. |
678 | */ | 689 | */ |
679 | list_move_tail(&page->lru, &lruvec->lists[lru]); | 690 | list_move_tail(&page->lru, &lruvec->lists[lru]); |
680 | __count_vm_event(PGROTATED); | 691 | __count_vm_event(PGROTATED); |
681 | } | 692 | } |
682 | 693 | ||
683 | if (active) | 694 | if (active) |
684 | __count_vm_event(PGDEACTIVATE); | 695 | __count_vm_event(PGDEACTIVATE); |
685 | update_page_reclaim_stat(lruvec, file, 0); | 696 | update_page_reclaim_stat(lruvec, file, 0); |
686 | } | 697 | } |
687 | 698 | ||
688 | /* | 699 | /* |
689 | * Drain pages out of the cpu's pagevecs. | 700 | * Drain pages out of the cpu's pagevecs. |
690 | * Either "cpu" is the current CPU, and preemption has already been | 701 | * Either "cpu" is the current CPU, and preemption has already been |
691 | * disabled; or "cpu" is being hot-unplugged, and is already dead. | 702 | * disabled; or "cpu" is being hot-unplugged, and is already dead. |
692 | */ | 703 | */ |
693 | void lru_add_drain_cpu(int cpu) | 704 | void lru_add_drain_cpu(int cpu) |
694 | { | 705 | { |
695 | struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu); | 706 | struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu); |
696 | 707 | ||
697 | if (pagevec_count(pvec)) | 708 | if (pagevec_count(pvec)) |
698 | __pagevec_lru_add(pvec); | 709 | __pagevec_lru_add(pvec); |
699 | 710 | ||
700 | pvec = &per_cpu(lru_rotate_pvecs, cpu); | 711 | pvec = &per_cpu(lru_rotate_pvecs, cpu); |
701 | if (pagevec_count(pvec)) { | 712 | if (pagevec_count(pvec)) { |
702 | unsigned long flags; | 713 | unsigned long flags; |
703 | 714 | ||
704 | /* No harm done if a racing interrupt already did this */ | 715 | /* No harm done if a racing interrupt already did this */ |
705 | local_irq_save(flags); | 716 | local_irq_save(flags); |
706 | pagevec_move_tail(pvec); | 717 | pagevec_move_tail(pvec); |
707 | local_irq_restore(flags); | 718 | local_irq_restore(flags); |
708 | } | 719 | } |
709 | 720 | ||
710 | pvec = &per_cpu(lru_deactivate_pvecs, cpu); | 721 | pvec = &per_cpu(lru_deactivate_pvecs, cpu); |
711 | if (pagevec_count(pvec)) | 722 | if (pagevec_count(pvec)) |
712 | pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); | 723 | pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); |
713 | 724 | ||
714 | activate_page_drain(cpu); | 725 | activate_page_drain(cpu); |
715 | } | 726 | } |
716 | 727 | ||
717 | /** | 728 | /** |
718 | * deactivate_page - forcefully deactivate a page | 729 | * deactivate_page - forcefully deactivate a page |
719 | * @page: page to deactivate | 730 | * @page: page to deactivate |
720 | * | 731 | * |
721 | * This function hints the VM that @page is a good reclaim candidate, | 732 | * This function hints the VM that @page is a good reclaim candidate, |
722 | * for example if its invalidation fails due to the page being dirty | 733 | * for example if its invalidation fails due to the page being dirty |
723 | * or under writeback. | 734 | * or under writeback. |
724 | */ | 735 | */ |
725 | void deactivate_page(struct page *page) | 736 | void deactivate_page(struct page *page) |
726 | { | 737 | { |
727 | /* | 738 | /* |
728 | * In a workload with many unevictable page such as mprotect, unevictable | 739 | * In a workload with many unevictable page such as mprotect, unevictable |
729 | * page deactivation for accelerating reclaim is pointless. | 740 | * page deactivation for accelerating reclaim is pointless. |
730 | */ | 741 | */ |
731 | if (PageUnevictable(page)) | 742 | if (PageUnevictable(page)) |
732 | return; | 743 | return; |
733 | 744 | ||
734 | if (likely(get_page_unless_zero(page))) { | 745 | if (likely(get_page_unless_zero(page))) { |
735 | struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); | 746 | struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs); |
736 | 747 | ||
737 | if (!pagevec_add(pvec, page)) | 748 | if (!pagevec_add(pvec, page)) |
738 | pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); | 749 | pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL); |
739 | put_cpu_var(lru_deactivate_pvecs); | 750 | put_cpu_var(lru_deactivate_pvecs); |
740 | } | 751 | } |
741 | } | 752 | } |
742 | 753 | ||
743 | void lru_add_drain(void) | 754 | void lru_add_drain(void) |
744 | { | 755 | { |
745 | lru_add_drain_cpu(get_cpu()); | 756 | lru_add_drain_cpu(get_cpu()); |
746 | put_cpu(); | 757 | put_cpu(); |
747 | } | 758 | } |
748 | 759 | ||
749 | static void lru_add_drain_per_cpu(struct work_struct *dummy) | 760 | static void lru_add_drain_per_cpu(struct work_struct *dummy) |
750 | { | 761 | { |
751 | lru_add_drain(); | 762 | lru_add_drain(); |
752 | } | 763 | } |
753 | 764 | ||
754 | static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); | 765 | static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work); |
755 | 766 | ||
756 | void lru_add_drain_all(void) | 767 | void lru_add_drain_all(void) |
757 | { | 768 | { |
758 | static DEFINE_MUTEX(lock); | 769 | static DEFINE_MUTEX(lock); |
759 | static struct cpumask has_work; | 770 | static struct cpumask has_work; |
760 | int cpu; | 771 | int cpu; |
761 | 772 | ||
762 | mutex_lock(&lock); | 773 | mutex_lock(&lock); |
763 | get_online_cpus(); | 774 | get_online_cpus(); |
764 | cpumask_clear(&has_work); | 775 | cpumask_clear(&has_work); |
765 | 776 | ||
766 | for_each_online_cpu(cpu) { | 777 | for_each_online_cpu(cpu) { |
767 | struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); | 778 | struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); |
768 | 779 | ||
769 | if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || | 780 | if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) || |
770 | pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || | 781 | pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) || |
771 | pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || | 782 | pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) || |
772 | need_activate_page_drain(cpu)) { | 783 | need_activate_page_drain(cpu)) { |
773 | INIT_WORK(work, lru_add_drain_per_cpu); | 784 | INIT_WORK(work, lru_add_drain_per_cpu); |
774 | schedule_work_on(cpu, work); | 785 | schedule_work_on(cpu, work); |
775 | cpumask_set_cpu(cpu, &has_work); | 786 | cpumask_set_cpu(cpu, &has_work); |
776 | } | 787 | } |
777 | } | 788 | } |
778 | 789 | ||
779 | for_each_cpu(cpu, &has_work) | 790 | for_each_cpu(cpu, &has_work) |
780 | flush_work(&per_cpu(lru_add_drain_work, cpu)); | 791 | flush_work(&per_cpu(lru_add_drain_work, cpu)); |
781 | 792 | ||
782 | put_online_cpus(); | 793 | put_online_cpus(); |
783 | mutex_unlock(&lock); | 794 | mutex_unlock(&lock); |
784 | } | 795 | } |
785 | 796 | ||
786 | /* | 797 | /* |
787 | * Batched page_cache_release(). Decrement the reference count on all the | 798 | * Batched page_cache_release(). Decrement the reference count on all the |
788 | * passed pages. If it fell to zero then remove the page from the LRU and | 799 | * passed pages. If it fell to zero then remove the page from the LRU and |
789 | * free it. | 800 | * free it. |
790 | * | 801 | * |
791 | * Avoid taking zone->lru_lock if possible, but if it is taken, retain it | 802 | * Avoid taking zone->lru_lock if possible, but if it is taken, retain it |
792 | * for the remainder of the operation. | 803 | * for the remainder of the operation. |
793 | * | 804 | * |
794 | * The locking in this function is against shrink_inactive_list(): we recheck | 805 | * The locking in this function is against shrink_inactive_list(): we recheck |
795 | * the page count inside the lock to see whether shrink_inactive_list() | 806 | * the page count inside the lock to see whether shrink_inactive_list() |
796 | * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() | 807 | * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() |
797 | * will free it. | 808 | * will free it. |
798 | */ | 809 | */ |
799 | void release_pages(struct page **pages, int nr, bool cold) | 810 | void release_pages(struct page **pages, int nr, bool cold) |
800 | { | 811 | { |
801 | int i; | 812 | int i; |
802 | LIST_HEAD(pages_to_free); | 813 | LIST_HEAD(pages_to_free); |
803 | struct zone *zone = NULL; | 814 | struct zone *zone = NULL; |
804 | struct lruvec *lruvec; | 815 | struct lruvec *lruvec; |
805 | unsigned long uninitialized_var(flags); | 816 | unsigned long uninitialized_var(flags); |
806 | 817 | ||
807 | for (i = 0; i < nr; i++) { | 818 | for (i = 0; i < nr; i++) { |
808 | struct page *page = pages[i]; | 819 | struct page *page = pages[i]; |
809 | 820 | ||
810 | if (unlikely(PageCompound(page))) { | 821 | if (unlikely(PageCompound(page))) { |
811 | if (zone) { | 822 | if (zone) { |
812 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 823 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
813 | zone = NULL; | 824 | zone = NULL; |
814 | } | 825 | } |
815 | put_compound_page(page); | 826 | put_compound_page(page); |
816 | continue; | 827 | continue; |
817 | } | 828 | } |
818 | 829 | ||
819 | if (!put_page_testzero(page)) | 830 | if (!put_page_testzero(page)) |
820 | continue; | 831 | continue; |
821 | 832 | ||
822 | if (PageLRU(page)) { | 833 | if (PageLRU(page)) { |
823 | struct zone *pagezone = page_zone(page); | 834 | struct zone *pagezone = page_zone(page); |
824 | 835 | ||
825 | if (pagezone != zone) { | 836 | if (pagezone != zone) { |
826 | if (zone) | 837 | if (zone) |
827 | spin_unlock_irqrestore(&zone->lru_lock, | 838 | spin_unlock_irqrestore(&zone->lru_lock, |
828 | flags); | 839 | flags); |
829 | zone = pagezone; | 840 | zone = pagezone; |
830 | spin_lock_irqsave(&zone->lru_lock, flags); | 841 | spin_lock_irqsave(&zone->lru_lock, flags); |
831 | } | 842 | } |
832 | 843 | ||
833 | lruvec = mem_cgroup_page_lruvec(page, zone); | 844 | lruvec = mem_cgroup_page_lruvec(page, zone); |
834 | VM_BUG_ON(!PageLRU(page)); | 845 | VM_BUG_ON(!PageLRU(page)); |
835 | __ClearPageLRU(page); | 846 | __ClearPageLRU(page); |
836 | del_page_from_lru_list(page, lruvec, page_off_lru(page)); | 847 | del_page_from_lru_list(page, lruvec, page_off_lru(page)); |
837 | } | 848 | } |
838 | 849 | ||
839 | /* Clear Active bit in case of parallel mark_page_accessed */ | 850 | /* Clear Active bit in case of parallel mark_page_accessed */ |
840 | __ClearPageActive(page); | 851 | __ClearPageActive(page); |
841 | 852 | ||
842 | list_add(&page->lru, &pages_to_free); | 853 | list_add(&page->lru, &pages_to_free); |
843 | } | 854 | } |
844 | if (zone) | 855 | if (zone) |
845 | spin_unlock_irqrestore(&zone->lru_lock, flags); | 856 | spin_unlock_irqrestore(&zone->lru_lock, flags); |
846 | 857 | ||
847 | free_hot_cold_page_list(&pages_to_free, cold); | 858 | free_hot_cold_page_list(&pages_to_free, cold); |
848 | } | 859 | } |
849 | EXPORT_SYMBOL(release_pages); | 860 | EXPORT_SYMBOL(release_pages); |
850 | 861 | ||
851 | /* | 862 | /* |
852 | * The pages which we're about to release may be in the deferred lru-addition | 863 | * The pages which we're about to release may be in the deferred lru-addition |
853 | * queues. That would prevent them from really being freed right now. That's | 864 | * queues. That would prevent them from really being freed right now. That's |
854 | * OK from a correctness point of view but is inefficient - those pages may be | 865 | * OK from a correctness point of view but is inefficient - those pages may be |
855 | * cache-warm and we want to give them back to the page allocator ASAP. | 866 | * cache-warm and we want to give them back to the page allocator ASAP. |
856 | * | 867 | * |
857 | * So __pagevec_release() will drain those queues here. __pagevec_lru_add() | 868 | * So __pagevec_release() will drain those queues here. __pagevec_lru_add() |
858 | * and __pagevec_lru_add_active() call release_pages() directly to avoid | 869 | * and __pagevec_lru_add_active() call release_pages() directly to avoid |
859 | * mutual recursion. | 870 | * mutual recursion. |
860 | */ | 871 | */ |
861 | void __pagevec_release(struct pagevec *pvec) | 872 | void __pagevec_release(struct pagevec *pvec) |
862 | { | 873 | { |
863 | lru_add_drain(); | 874 | lru_add_drain(); |
864 | release_pages(pvec->pages, pagevec_count(pvec), pvec->cold); | 875 | release_pages(pvec->pages, pagevec_count(pvec), pvec->cold); |
865 | pagevec_reinit(pvec); | 876 | pagevec_reinit(pvec); |
866 | } | 877 | } |
867 | EXPORT_SYMBOL(__pagevec_release); | 878 | EXPORT_SYMBOL(__pagevec_release); |
868 | 879 | ||
869 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE | 880 | #ifdef CONFIG_TRANSPARENT_HUGEPAGE |
870 | /* used by __split_huge_page_refcount() */ | 881 | /* used by __split_huge_page_refcount() */ |
871 | void lru_add_page_tail(struct page *page, struct page *page_tail, | 882 | void lru_add_page_tail(struct page *page, struct page *page_tail, |
872 | struct lruvec *lruvec, struct list_head *list) | 883 | struct lruvec *lruvec, struct list_head *list) |
873 | { | 884 | { |
874 | const int file = 0; | 885 | const int file = 0; |
875 | 886 | ||
876 | VM_BUG_ON(!PageHead(page)); | 887 | VM_BUG_ON(!PageHead(page)); |
877 | VM_BUG_ON(PageCompound(page_tail)); | 888 | VM_BUG_ON(PageCompound(page_tail)); |
878 | VM_BUG_ON(PageLRU(page_tail)); | 889 | VM_BUG_ON(PageLRU(page_tail)); |
879 | VM_BUG_ON(NR_CPUS != 1 && | 890 | VM_BUG_ON(NR_CPUS != 1 && |
880 | !spin_is_locked(&lruvec_zone(lruvec)->lru_lock)); | 891 | !spin_is_locked(&lruvec_zone(lruvec)->lru_lock)); |
881 | 892 | ||
882 | if (!list) | 893 | if (!list) |
883 | SetPageLRU(page_tail); | 894 | SetPageLRU(page_tail); |
884 | 895 | ||
885 | if (likely(PageLRU(page))) | 896 | if (likely(PageLRU(page))) |
886 | list_add_tail(&page_tail->lru, &page->lru); | 897 | list_add_tail(&page_tail->lru, &page->lru); |
887 | else if (list) { | 898 | else if (list) { |
888 | /* page reclaim is reclaiming a huge page */ | 899 | /* page reclaim is reclaiming a huge page */ |
889 | get_page(page_tail); | 900 | get_page(page_tail); |
890 | list_add_tail(&page_tail->lru, list); | 901 | list_add_tail(&page_tail->lru, list); |
891 | } else { | 902 | } else { |
892 | struct list_head *list_head; | 903 | struct list_head *list_head; |
893 | /* | 904 | /* |
894 | * Head page has not yet been counted, as an hpage, | 905 | * Head page has not yet been counted, as an hpage, |
895 | * so we must account for each subpage individually. | 906 | * so we must account for each subpage individually. |
896 | * | 907 | * |
897 | * Use the standard add function to put page_tail on the list, | 908 | * Use the standard add function to put page_tail on the list, |
898 | * but then correct its position so they all end up in order. | 909 | * but then correct its position so they all end up in order. |
899 | */ | 910 | */ |
900 | add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail)); | 911 | add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail)); |
901 | list_head = page_tail->lru.prev; | 912 | list_head = page_tail->lru.prev; |
902 | list_move_tail(&page_tail->lru, list_head); | 913 | list_move_tail(&page_tail->lru, list_head); |
903 | } | 914 | } |
904 | 915 | ||
905 | if (!PageUnevictable(page)) | 916 | if (!PageUnevictable(page)) |
906 | update_page_reclaim_stat(lruvec, file, PageActive(page_tail)); | 917 | update_page_reclaim_stat(lruvec, file, PageActive(page_tail)); |
907 | } | 918 | } |
908 | #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ | 919 | #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ |
909 | 920 | ||
910 | static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, | 921 | static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, |
911 | void *arg) | 922 | void *arg) |
912 | { | 923 | { |
913 | int file = page_is_file_cache(page); | 924 | int file = page_is_file_cache(page); |
914 | int active = PageActive(page); | 925 | int active = PageActive(page); |
915 | enum lru_list lru = page_lru(page); | 926 | enum lru_list lru = page_lru(page); |
916 | 927 | ||
917 | VM_BUG_ON(PageLRU(page)); | 928 | VM_BUG_ON(PageLRU(page)); |
918 | 929 | ||
919 | SetPageLRU(page); | 930 | SetPageLRU(page); |
920 | add_page_to_lru_list(page, lruvec, lru); | 931 | add_page_to_lru_list(page, lruvec, lru); |
921 | update_page_reclaim_stat(lruvec, file, active); | 932 | update_page_reclaim_stat(lruvec, file, active); |
922 | trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page)); | 933 | trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page)); |
923 | } | 934 | } |
924 | 935 | ||
925 | /* | 936 | /* |
926 | * Add the passed pages to the LRU, then drop the caller's refcount | 937 | * Add the passed pages to the LRU, then drop the caller's refcount |
927 | * on them. Reinitialises the caller's pagevec. | 938 | * on them. Reinitialises the caller's pagevec. |
928 | */ | 939 | */ |
929 | void __pagevec_lru_add(struct pagevec *pvec) | 940 | void __pagevec_lru_add(struct pagevec *pvec) |
930 | { | 941 | { |
931 | pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL); | 942 | pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL); |
932 | } | 943 | } |
933 | EXPORT_SYMBOL(__pagevec_lru_add); | 944 | EXPORT_SYMBOL(__pagevec_lru_add); |
934 | 945 | ||
935 | /** | 946 | /** |
936 | * pagevec_lookup_entries - gang pagecache lookup | 947 | * pagevec_lookup_entries - gang pagecache lookup |
937 | * @pvec: Where the resulting entries are placed | 948 | * @pvec: Where the resulting entries are placed |
938 | * @mapping: The address_space to search | 949 | * @mapping: The address_space to search |
939 | * @start: The starting entry index | 950 | * @start: The starting entry index |
940 | * @nr_entries: The maximum number of entries | 951 | * @nr_entries: The maximum number of entries |
941 | * @indices: The cache indices corresponding to the entries in @pvec | 952 | * @indices: The cache indices corresponding to the entries in @pvec |
942 | * | 953 | * |
943 | * pagevec_lookup_entries() will search for and return a group of up | 954 | * pagevec_lookup_entries() will search for and return a group of up |
944 | * to @nr_entries pages and shadow entries in the mapping. All | 955 | * to @nr_entries pages and shadow entries in the mapping. All |
945 | * entries are placed in @pvec. pagevec_lookup_entries() takes a | 956 | * entries are placed in @pvec. pagevec_lookup_entries() takes a |
946 | * reference against actual pages in @pvec. | 957 | * reference against actual pages in @pvec. |
947 | * | 958 | * |
948 | * The search returns a group of mapping-contiguous entries with | 959 | * The search returns a group of mapping-contiguous entries with |
949 | * ascending indexes. There may be holes in the indices due to | 960 | * ascending indexes. There may be holes in the indices due to |
950 | * not-present entries. | 961 | * not-present entries. |
951 | * | 962 | * |
952 | * pagevec_lookup_entries() returns the number of entries which were | 963 | * pagevec_lookup_entries() returns the number of entries which were |
953 | * found. | 964 | * found. |
954 | */ | 965 | */ |
955 | unsigned pagevec_lookup_entries(struct pagevec *pvec, | 966 | unsigned pagevec_lookup_entries(struct pagevec *pvec, |
956 | struct address_space *mapping, | 967 | struct address_space *mapping, |
957 | pgoff_t start, unsigned nr_pages, | 968 | pgoff_t start, unsigned nr_pages, |
958 | pgoff_t *indices) | 969 | pgoff_t *indices) |
959 | { | 970 | { |
960 | pvec->nr = find_get_entries(mapping, start, nr_pages, | 971 | pvec->nr = find_get_entries(mapping, start, nr_pages, |
961 | pvec->pages, indices); | 972 | pvec->pages, indices); |
962 | return pagevec_count(pvec); | 973 | return pagevec_count(pvec); |
963 | } | 974 | } |
964 | 975 | ||
965 | /** | 976 | /** |
966 | * pagevec_remove_exceptionals - pagevec exceptionals pruning | 977 | * pagevec_remove_exceptionals - pagevec exceptionals pruning |
967 | * @pvec: The pagevec to prune | 978 | * @pvec: The pagevec to prune |
968 | * | 979 | * |
969 | * pagevec_lookup_entries() fills both pages and exceptional radix | 980 | * pagevec_lookup_entries() fills both pages and exceptional radix |
970 | * tree entries into the pagevec. This function prunes all | 981 | * tree entries into the pagevec. This function prunes all |
971 | * exceptionals from @pvec without leaving holes, so that it can be | 982 | * exceptionals from @pvec without leaving holes, so that it can be |
972 | * passed on to page-only pagevec operations. | 983 | * passed on to page-only pagevec operations. |
973 | */ | 984 | */ |
974 | void pagevec_remove_exceptionals(struct pagevec *pvec) | 985 | void pagevec_remove_exceptionals(struct pagevec *pvec) |
975 | { | 986 | { |
976 | int i, j; | 987 | int i, j; |
977 | 988 | ||
978 | for (i = 0, j = 0; i < pagevec_count(pvec); i++) { | 989 | for (i = 0, j = 0; i < pagevec_count(pvec); i++) { |
979 | struct page *page = pvec->pages[i]; | 990 | struct page *page = pvec->pages[i]; |
980 | if (!radix_tree_exceptional_entry(page)) | 991 | if (!radix_tree_exceptional_entry(page)) |
981 | pvec->pages[j++] = page; | 992 | pvec->pages[j++] = page; |
982 | } | 993 | } |
983 | pvec->nr = j; | 994 | pvec->nr = j; |
984 | } | 995 | } |
985 | 996 | ||
986 | /** | 997 | /** |
987 | * pagevec_lookup - gang pagecache lookup | 998 | * pagevec_lookup - gang pagecache lookup |
988 | * @pvec: Where the resulting pages are placed | 999 | * @pvec: Where the resulting pages are placed |
989 | * @mapping: The address_space to search | 1000 | * @mapping: The address_space to search |
990 | * @start: The starting page index | 1001 | * @start: The starting page index |
991 | * @nr_pages: The maximum number of pages | 1002 | * @nr_pages: The maximum number of pages |
992 | * | 1003 | * |
993 | * pagevec_lookup() will search for and return a group of up to @nr_pages pages | 1004 | * pagevec_lookup() will search for and return a group of up to @nr_pages pages |
994 | * in the mapping. The pages are placed in @pvec. pagevec_lookup() takes a | 1005 | * in the mapping. The pages are placed in @pvec. pagevec_lookup() takes a |
995 | * reference against the pages in @pvec. | 1006 | * reference against the pages in @pvec. |
996 | * | 1007 | * |
997 | * The search returns a group of mapping-contiguous pages with ascending | 1008 | * The search returns a group of mapping-contiguous pages with ascending |
998 | * indexes. There may be holes in the indices due to not-present pages. | 1009 | * indexes. There may be holes in the indices due to not-present pages. |
999 | * | 1010 | * |
1000 | * pagevec_lookup() returns the number of pages which were found. | 1011 | * pagevec_lookup() returns the number of pages which were found. |
1001 | */ | 1012 | */ |
1002 | unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, | 1013 | unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, |
1003 | pgoff_t start, unsigned nr_pages) | 1014 | pgoff_t start, unsigned nr_pages) |
1004 | { | 1015 | { |
1005 | pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages); | 1016 | pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages); |
1006 | return pagevec_count(pvec); | 1017 | return pagevec_count(pvec); |
1007 | } | 1018 | } |
1008 | EXPORT_SYMBOL(pagevec_lookup); | 1019 | EXPORT_SYMBOL(pagevec_lookup); |
1009 | 1020 | ||
1010 | unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, | 1021 | unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, |
1011 | pgoff_t *index, int tag, unsigned nr_pages) | 1022 | pgoff_t *index, int tag, unsigned nr_pages) |
1012 | { | 1023 | { |
1013 | pvec->nr = find_get_pages_tag(mapping, index, tag, | 1024 | pvec->nr = find_get_pages_tag(mapping, index, tag, |
1014 | nr_pages, pvec->pages); | 1025 | nr_pages, pvec->pages); |
1015 | return pagevec_count(pvec); | 1026 | return pagevec_count(pvec); |
1016 | } | 1027 | } |
1017 | EXPORT_SYMBOL(pagevec_lookup_tag); | 1028 | EXPORT_SYMBOL(pagevec_lookup_tag); |
1018 | 1029 | ||
1019 | /* | 1030 | /* |
1020 | * Perform any setup for the swap system | 1031 | * Perform any setup for the swap system |
1021 | */ | 1032 | */ |
1022 | void __init swap_setup(void) | 1033 | void __init swap_setup(void) |
1023 | { | 1034 | { |
1024 | unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); | 1035 | unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT); |
1025 | #ifdef CONFIG_SWAP | 1036 | #ifdef CONFIG_SWAP |
1026 | int i; | 1037 | int i; |
1027 | 1038 | ||
1028 | bdi_init(swapper_spaces[0].backing_dev_info); | 1039 | bdi_init(swapper_spaces[0].backing_dev_info); |
1029 | for (i = 0; i < MAX_SWAPFILES; i++) { | 1040 | for (i = 0; i < MAX_SWAPFILES; i++) { |
1030 | spin_lock_init(&swapper_spaces[i].tree_lock); | 1041 | spin_lock_init(&swapper_spaces[i].tree_lock); |
1031 | INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); | 1042 | INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear); |
1032 | } | 1043 | } |
1033 | #endif | 1044 | #endif |
1034 | 1045 | ||
1035 | /* Use a smaller cluster for small-memory machines */ | 1046 | /* Use a smaller cluster for small-memory machines */ |
1036 | if (megs < 16) | 1047 | if (megs < 16) |
1037 | page_cluster = 2; | 1048 | page_cluster = 2; |
1038 | else | 1049 | else |
1039 | page_cluster = 3; | 1050 | page_cluster = 3; |
1040 | /* | 1051 | /* |
1041 | * Right now other parts of the system means that we | 1052 | * Right now other parts of the system means that we |
1042 | * _really_ don't want to cluster much more | 1053 | * _really_ don't want to cluster much more |
1043 | */ | 1054 | */ |
1044 | } | 1055 | } |
1045 | 1056 |