Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

Commit d618a27c7808608e376de803a4fd3940f33776c2

Authored by Mel Gorman 2014-08-29 02:35:32 +0800

Committed by Jiri Slaby 2014-09-26 17:52:05 +0800

Exists in ti-linux-3.12.y and in 2 other branches

mm: non-atomically mark page accessed during page cache allocation where possible

commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

Showing 17 changed files with 219 additions and 162 deletions Inline Diff

fs/btrfs/extent_io.c
fs/btrfs/file.c
fs/buffer.c
fs/ext4/mballoc.c
fs/f2fs/checkpoint.c
fs/f2fs/node.c
fs/fuse/file.c
fs/gfs2/aops.c
fs/gfs2/meta_io.c
fs/ntfs/attrib.c
fs/ntfs/file.c
include/linux/page-flags.h
include/linux/pagemap.h
include/linux/swap.h
mm/filemap.c
mm/shmem.c
mm/swap.c

fs/btrfs/extent_io.c

Diff comments View file @ d618a27

 #include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/bio.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/page-flags.h>
 #include <linux/spinlock.h>
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/cleancache.h>
 #include "extent_io.h"
 #include "extent_map.h"
 #include "compat.h"
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "volumes.h"
 #include "check-integrity.h"
 #include "locking.h"
 #include "rcu-string.h"
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
 static struct bio_set *btrfs_bioset;
 #ifdef CONFIG_BTRFS_DEBUG
 static LIST_HEAD(buffers);
 static LIST_HEAD(states);
 static DEFINE_SPINLOCK(leak_lock);
 static inline
 void btrfs_leak_debug_add(struct list_head *new, struct list_head *head)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&leak_lock, flags);
 	list_add(new, head);
 	spin_unlock_irqrestore(&leak_lock, flags);
 }
 static inline
 void btrfs_leak_debug_del(struct list_head *entry)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&leak_lock, flags);
 	list_del(entry);
 	spin_unlock_irqrestore(&leak_lock, flags);
 }
 static inline
 void btrfs_leak_debug_check(void)
 {
 	struct extent_state *state;
 	struct extent_buffer *eb;
 	while (!list_empty(&states)) {
 		state = list_entry(states.next, struct extent_state, leak_list);
 		printk(KERN_ERR "btrfs state leak: start %llu end %llu "
 		       "state %lu in tree %p refs %d\n",
 		       state->start, state->end, state->state, state->tree,
 		       atomic_read(&state->refs));
 		list_del(&state->leak_list);
 		kmem_cache_free(extent_state_cache, state);
 	}
 	while (!list_empty(&buffers)) {
 		eb = list_entry(buffers.next, struct extent_buffer, leak_list);
 		printk(KERN_ERR "btrfs buffer leak start %llu len %lu "
 		       "refs %d\n",
 		       eb->start, eb->len, atomic_read(&eb->refs));
 		list_del(&eb->leak_list);
 		kmem_cache_free(extent_buffer_cache, eb);
 	}
 }
 #define btrfs_debug_check_extent_io_range(inode, start, end)		\
 	__btrfs_debug_check_extent_io_range(__func__, (inode), (start), (end))
 static inline void __btrfs_debug_check_extent_io_range(const char *caller,
 		struct inode *inode, u64 start, u64 end)
 {
 	u64 isize = i_size_read(inode);
 	if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) {
 		printk_ratelimited(KERN_DEBUG
 		    "btrfs: %s: ino %llu isize %llu odd range [%llu,%llu]\n",
 				caller, btrfs_ino(inode), isize, start, end);
 	}
 }
 #else
 #define btrfs_leak_debug_add(new, head)	do {} while (0)
 #define btrfs_leak_debug_del(entry)	do {} while (0)
 #define btrfs_leak_debug_check()	do {} while (0)
 #define btrfs_debug_check_extent_io_range(c, s, e)	do {} while (0)
 #endif
 #define BUFFER_LRU_MAX 64
 struct tree_entry {
 	u64 start;
 	u64 end;
 	struct rb_node rb_node;
 };
 struct extent_page_data {
 	struct bio *bio;
 	struct extent_io_tree *tree;
 	get_extent_t *get_extent;
 	unsigned long bio_flags;
 	/* tells writepage not to lock the state bits for this range
 	 * it still does the unlocking
 	 */
 	unsigned int extent_locked:1;
 	/* tells the submit_bio code to use a WRITE_SYNC */
 	unsigned int sync_io:1;
 };
 static noinline void flush_write_bio(void *data);
 static inline struct btrfs_fs_info *
 tree_fs_info(struct extent_io_tree *tree)
 {
 	return btrfs_sb(tree->mapping->host->i_sb);
 }
 int __init extent_io_init(void)
 {
 	extent_state_cache = kmem_cache_create("btrfs_extent_state",
 			sizeof(struct extent_state), 0,
 			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
 	if (!extent_state_cache)
 		return -ENOMEM;
 	extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
 			sizeof(struct extent_buffer), 0,
 			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
 	if (!extent_buffer_cache)
 		goto free_state_cache;
 	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
 				     offsetof(struct btrfs_io_bio, bio));
 	if (!btrfs_bioset)
 		goto free_buffer_cache;
 	if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE))
 		goto free_bioset;
 	return 0;
 free_bioset:
 	bioset_free(btrfs_bioset);
 	btrfs_bioset = NULL;
 free_buffer_cache:
 	kmem_cache_destroy(extent_buffer_cache);
 	extent_buffer_cache = NULL;
 free_state_cache:
 	kmem_cache_destroy(extent_state_cache);
 	extent_state_cache = NULL;
 	return -ENOMEM;
 }
 void extent_io_exit(void)
 {
 	btrfs_leak_debug_check();
 	/*
 	 * Make sure all delayed rcu free are flushed before we
 	 * destroy caches.
 	 */
 	rcu_barrier();
 	if (extent_state_cache)
 		kmem_cache_destroy(extent_state_cache);
 	if (extent_buffer_cache)
 		kmem_cache_destroy(extent_buffer_cache);
 	if (btrfs_bioset)
 		bioset_free(btrfs_bioset);
 }
 void extent_io_tree_init(struct extent_io_tree *tree,
 			 struct address_space *mapping)
 {
 	tree->state = RB_ROOT;
 	INIT_RADIX_TREE(&tree->buffer, GFP_ATOMIC);
 	tree->ops = NULL;
 	tree->dirty_bytes = 0;
 	spin_lock_init(&tree->lock);
 	spin_lock_init(&tree->buffer_lock);
 	tree->mapping = mapping;
 }
 static struct extent_state *alloc_extent_state(gfp_t mask)
 {
 	struct extent_state *state;
 	state = kmem_cache_alloc(extent_state_cache, mask);
 	if (!state)
 		return state;
 	state->state = 0;
 	state->private = 0;
 	state->tree = NULL;
 	btrfs_leak_debug_add(&state->leak_list, &states);
 	atomic_set(&state->refs, 1);
 	init_waitqueue_head(&state->wq);
 	trace_alloc_extent_state(state, mask, _RET_IP_);
 	return state;
 }
 void free_extent_state(struct extent_state *state)
 {
 	if (!state)
 		return;
 	if (atomic_dec_and_test(&state->refs)) {
 		WARN_ON(state->tree);
 		btrfs_leak_debug_del(&state->leak_list);
 		trace_free_extent_state(state, _RET_IP_);
 		kmem_cache_free(extent_state_cache, state);
 	}
 }
 static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
 				   struct rb_node *node)
 {
 	struct rb_node **p = &root->rb_node;
 	struct rb_node *parent = NULL;
 	struct tree_entry *entry;
 	while (*p) {
 		parent = *p;
 		entry = rb_entry(parent, struct tree_entry, rb_node);
 		if (offset < entry->start)
 			p = &(*p)->rb_left;
 		else if (offset > entry->end)
 			p = &(*p)->rb_right;
 		else
 			return parent;
 	}
 	rb_link_node(node, parent, p);
 	rb_insert_color(node, root);
 	return NULL;
 }
 static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset,
 				     struct rb_node **prev_ret,
 				     struct rb_node **next_ret)
 {
 	struct rb_root *root = &tree->state;
 	struct rb_node *n = root->rb_node;
 	struct rb_node *prev = NULL;
 	struct rb_node *orig_prev = NULL;
 	struct tree_entry *entry;
 	struct tree_entry *prev_entry = NULL;
 	while (n) {
 		entry = rb_entry(n, struct tree_entry, rb_node);
 		prev = n;
 		prev_entry = entry;
 		if (offset < entry->start)
 			n = n->rb_left;
 		else if (offset > entry->end)
 			n = n->rb_right;
 		else
 			return n;
 	}
 	if (prev_ret) {
 		orig_prev = prev;
 		while (prev && offset > prev_entry->end) {
 			prev = rb_next(prev);
 			prev_entry = rb_entry(prev, struct tree_entry, rb_node);
 		}
 		*prev_ret = prev;
 		prev = orig_prev;
 	}
 	if (next_ret) {
 		prev_entry = rb_entry(prev, struct tree_entry, rb_node);
 		while (prev && offset < prev_entry->start) {
 			prev = rb_prev(prev);
 			prev_entry = rb_entry(prev, struct tree_entry, rb_node);
 		}
 		*next_ret = prev;
 	}
 	return NULL;
 }
 static inline struct rb_node *tree_search(struct extent_io_tree *tree,
 					  u64 offset)
 {
 	struct rb_node *prev = NULL;
 	struct rb_node *ret;
 	ret = __etree_search(tree, offset, &prev, NULL);
 	if (!ret)
 		return prev;
 	return ret;
 }
 static void merge_cb(struct extent_io_tree *tree, struct extent_state *new,
 		     struct extent_state *other)
 {
 	if (tree->ops && tree->ops->merge_extent_hook)
 		tree->ops->merge_extent_hook(tree->mapping->host, new,
 					     other);
 }
 /*
  * utility function to look for merge candidates inside a given range.
  * Any extents with matching state are merged together into a single
  * extent in the tree.  Extents with EXTENT_IO in their state field
  * are not merged because the end_io handlers need to be able to do
  * operations on them without sleeping (or doing allocations/splits).
  *
  * This should be called with the tree lock held.
  */
 static void merge_state(struct extent_io_tree *tree,
 		        struct extent_state *state)
 {
 	struct extent_state *other;
 	struct rb_node *other_node;
 	if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY))
 		return;
 	other_node = rb_prev(&state->rb_node);
 	if (other_node) {
 		other = rb_entry(other_node, struct extent_state, rb_node);
 		if (other->end == state->start - 1 &&
 		    other->state == state->state) {
 			merge_cb(tree, state, other);
 			state->start = other->start;
 			other->tree = NULL;
 			rb_erase(&other->rb_node, &tree->state);
 			free_extent_state(other);
 		}
 	}
 	other_node = rb_next(&state->rb_node);
 	if (other_node) {
 		other = rb_entry(other_node, struct extent_state, rb_node);
 		if (other->start == state->end + 1 &&
 		    other->state == state->state) {
 			merge_cb(tree, state, other);
 			state->end = other->end;
 			other->tree = NULL;
 			rb_erase(&other->rb_node, &tree->state);
 			free_extent_state(other);
 		}
 	}
 }
 static void set_state_cb(struct extent_io_tree *tree,
 			 struct extent_state *state, unsigned long *bits)
 {
 	if (tree->ops && tree->ops->set_bit_hook)
 		tree->ops->set_bit_hook(tree->mapping->host, state, bits);
 }
 static void clear_state_cb(struct extent_io_tree *tree,
 			   struct extent_state *state, unsigned long *bits)
 {
 	if (tree->ops && tree->ops->clear_bit_hook)
 		tree->ops->clear_bit_hook(tree->mapping->host, state, bits);
 }
 static void set_state_bits(struct extent_io_tree *tree,
 			   struct extent_state *state, unsigned long *bits);
 /*
  * insert an extent_state struct into the tree.  'bits' are set on the
  * struct before it is inserted.
  *
  * This may return -EEXIST if the extent is already there, in which case the
  * state struct is freed.
  *
  * The tree lock is not taken internally.  This is a utility function and
  * probably isn't what you want to call (see set/clear_extent_bit).
  */
 static int insert_state(struct extent_io_tree *tree,
 			struct extent_state *state, u64 start, u64 end,
 			unsigned long *bits)
 {
 	struct rb_node *node;
 	if (end < start)
 		WARN(1, KERN_ERR "btrfs end < start %llu %llu\n",
 		       end, start);
 	state->start = start;
 	state->end = end;
 	set_state_bits(tree, state, bits);
 	node = tree_insert(&tree->state, end, &state->rb_node);
 	if (node) {
 		struct extent_state *found;
 		found = rb_entry(node, struct extent_state, rb_node);
 		printk(KERN_ERR "btrfs found node %llu %llu on insert of "
 		       "%llu %llu\n",
 		       found->start, found->end, start, end);
 		return -EEXIST;
 	}
 	state->tree = tree;
 	merge_state(tree, state);
 	return 0;
 }
 static void split_cb(struct extent_io_tree *tree, struct extent_state *orig,
 		     u64 split)
 {
 	if (tree->ops && tree->ops->split_extent_hook)
 		tree->ops->split_extent_hook(tree->mapping->host, orig, split);
 }
 /*
  * split a given extent state struct in two, inserting the preallocated
  * struct 'prealloc' as the newly created second half.  'split' indicates an
  * offset inside 'orig' where it should be split.
  *
  * Before calling,
  * the tree has 'orig' at [orig->start, orig->end].  After calling, there
  * are two extent state structs in the tree:
  * prealloc: [orig->start, split - 1]
  * orig: [ split, orig->end ]
  *
  * The tree locks are not taken by this function. They need to be held
  * by the caller.
  */
 static int split_state(struct extent_io_tree *tree, struct extent_state *orig,
 		       struct extent_state *prealloc, u64 split)
 {
 	struct rb_node *node;
 	split_cb(tree, orig, split);
 	prealloc->start = orig->start;
 	prealloc->end = split - 1;
 	prealloc->state = orig->state;
 	orig->start = split;
 	node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node);
 	if (node) {
 		free_extent_state(prealloc);
 		return -EEXIST;
 	}
 	prealloc->tree = tree;
 	return 0;
 }
 static struct extent_state *next_state(struct extent_state *state)
 {
 	struct rb_node *next = rb_next(&state->rb_node);
 	if (next)
 		return rb_entry(next, struct extent_state, rb_node);
 	else
 		return NULL;
 }
 /*
  * utility function to clear some bits in an extent state struct.
  * it will optionally wake up any one waiting on this state (wake == 1).
  *
  * If no bits are set on the state struct after clearing things, the
  * struct is freed and removed from the tree
  */
 static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
 					    struct extent_state *state,
 					    unsigned long *bits, int wake)
 {
 	struct extent_state *next;
 	unsigned long bits_to_clear = *bits & ~EXTENT_CTLBITS;
 	if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
 		u64 range = state->end - state->start + 1;
 		WARN_ON(range > tree->dirty_bytes);
 		tree->dirty_bytes -= range;
 	}
 	clear_state_cb(tree, state, bits);
 	state->state &= ~bits_to_clear;
 	if (wake)
 		wake_up(&state->wq);
 	if (state->state == 0) {
 		next = next_state(state);
 		if (state->tree) {
 			rb_erase(&state->rb_node, &tree->state);
 			state->tree = NULL;
 			free_extent_state(state);
 		} else {
 			WARN_ON(1);
 		}
 	} else {
 		merge_state(tree, state);
 		next = next_state(state);
 	}
 	return next;
 }
 static struct extent_state *
 alloc_extent_state_atomic(struct extent_state *prealloc)
 {
 	if (!prealloc)
 		prealloc = alloc_extent_state(GFP_ATOMIC);
 	return prealloc;
 }
 static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
 {
 	btrfs_panic(tree_fs_info(tree), err, "Locking error: "
 		    "Extent tree was modified by another "
 		    "thread while locked.");
 }
 /*
  * clear some bits on a range in the tree.  This may require splitting
  * or inserting elements in the tree, so the gfp mask is used to
  * indicate which allocations or sleeping are allowed.
  *
  * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove
  * the given range from the tree regardless of state (ie for truncate).
  *
  * the range [start, end] is inclusive.
  *
  * This takes the tree lock, and returns 0 on success and < 0 on error.
  */
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		     unsigned long bits, int wake, int delete,
 		     struct extent_state **cached_state,
 		     gfp_t mask)
 {
 	struct extent_state *state;
 	struct extent_state *cached;
 	struct extent_state *prealloc = NULL;
 	struct rb_node *node;
 	u64 last_end;
 	int err;
 	int clear = 0;
 	btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
 	if (bits & EXTENT_DELALLOC)
 		bits |= EXTENT_NORESERVE;
 	if (delete)
 		bits |= ~EXTENT_CTLBITS;
 	bits |= EXTENT_FIRST_DELALLOC;
 	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
 		clear = 1;
 again:
 	if (!prealloc && (mask & __GFP_WAIT)) {
 		prealloc = alloc_extent_state(mask);
 		if (!prealloc)
 			return -ENOMEM;
 	}
 	spin_lock(&tree->lock);
 	if (cached_state) {
 		cached = *cached_state;
 		if (clear) {
 			*cached_state = NULL;
 			cached_state = NULL;
 		}
 		if (cached && cached->tree && cached->start <= start &&
 		    cached->end > start) {
 			if (clear)
 				atomic_dec(&cached->refs);
 			state = cached;
 			goto hit_next;
 		}
 		if (clear)
 			free_extent_state(cached);
 	}
 	/*
 	 * this search will find the extents that end after
 	 * our range starts
 	 */
 	node = tree_search(tree, start);
 	if (!node)
 		goto out;
 	state = rb_entry(node, struct extent_state, rb_node);
 hit_next:
 	if (state->start > end)
 		goto out;
 	WARN_ON(state->end < start);
 	last_end = state->end;
 	/* the state doesn't have the wanted bits, go ahead */
 	if (!(state->state & bits)) {
 		state = next_state(state);
 		goto next;
 	}
 	/*
 	 *     | ---- desired range ---- |
 	 *  | state | or
 	 *  | ------------- state -------------- |
 	 *
 	 * We need to split the extent we found, and may flip
 	 * bits on second half.
 	 *
 	 * If the extent we found extends past our range, we
 	 * just split and search again.  It'll get split again
 	 * the next time though.
 	 *
 	 * If the extent we found is inside our range, we clear
 	 * the desired bit on it.
 	 */
 	if (state->start < start) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, start);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		prealloc = NULL;
 		if (err)
 			goto out;
 		if (state->end <= end) {
 			state = clear_state_bit(tree, state, &bits, wake);
 			goto next;
 		}
 		goto search_again;
 	}
 	/*
 	 * | ---- desired range ---- |
 	 *                        | state |
 	 * We need to split the extent, and clear the bit
 	 * on the first half
 	 */
 	if (state->start <= end && state->end > end) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, end + 1);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		if (wake)
 			wake_up(&state->wq);
 		clear_state_bit(tree, prealloc, &bits, wake);
 		prealloc = NULL;
 		goto out;
 	}
 	state = clear_state_bit(tree, state, &bits, wake);
 next:
 	if (last_end == (u64)-1)
 		goto out;
 	start = last_end + 1;
 	if (start <= end && state && !need_resched())
 		goto hit_next;
 	goto search_again;
 out:
 	spin_unlock(&tree->lock);
 	if (prealloc)
 		free_extent_state(prealloc);
 	return 0;
 search_again:
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
 }
 static void wait_on_state(struct extent_io_tree *tree,
 			  struct extent_state *state)
 		__releases(tree->lock)
 		__acquires(tree->lock)
 {
 	DEFINE_WAIT(wait);
 	prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&tree->lock);
 	schedule();
 	spin_lock(&tree->lock);
 	finish_wait(&state->wq, &wait);
 }
 /*
  * waits for one or more bits to clear on a range in the state tree.
  * The range [start, end] is inclusive.
  * The tree lock is taken by this function
  */
 static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 			    unsigned long bits)
 {
 	struct extent_state *state;
 	struct rb_node *node;
 	btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
 	spin_lock(&tree->lock);
 again:
 	while (1) {
 		/*
 		 * this search will find all the extents that end after
 		 * our range starts
 		 */
 		node = tree_search(tree, start);
 		if (!node)
 			break;
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (state->start > end)
 			goto out;
 		if (state->state & bits) {
 			start = state->start;
 			atomic_inc(&state->refs);
 			wait_on_state(tree, state);
 			free_extent_state(state);
 			goto again;
 		}
 		start = state->end + 1;
 		if (start > end)
 			break;
 		cond_resched_lock(&tree->lock);
 	}
 out:
 	spin_unlock(&tree->lock);
 }
 static void set_state_bits(struct extent_io_tree *tree,
 			   struct extent_state *state,
 			   unsigned long *bits)
 {
 	unsigned long bits_to_set = *bits & ~EXTENT_CTLBITS;
 	set_state_cb(tree, state, bits);
 	if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) {
 		u64 range = state->end - state->start + 1;
 		tree->dirty_bytes += range;
 	}
 	state->state |= bits_to_set;
 }
 static void cache_state(struct extent_state *state,
 			struct extent_state **cached_ptr)
 {
 	if (cached_ptr && !(*cached_ptr)) {
 		if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY)) {
 			*cached_ptr = state;
 			atomic_inc(&state->refs);
 		}
 	}
 }
 /*
  * set some bits on a range in the tree.  This may require allocations or
  * sleeping, so the gfp mask is used to indicate what is allowed.
  *
  * If any of the exclusive bits are set, this will fail with -EEXIST if some
  * part of the range already has the desired bits set.  The start of the
  * existing range is returned in failed_start in this case.
  *
  * [start, end] is inclusive This takes the tree lock.
  */
 static int __must_check
 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		 unsigned long bits, unsigned long exclusive_bits,
 		 u64 *failed_start, struct extent_state **cached_state,
 		 gfp_t mask)
 {
 	struct extent_state *state;
 	struct extent_state *prealloc = NULL;
 	struct rb_node *node;
 	int err = 0;
 	u64 last_start;
 	u64 last_end;
 	btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
 	bits |= EXTENT_FIRST_DELALLOC;
 again:
 	if (!prealloc && (mask & __GFP_WAIT)) {
 		prealloc = alloc_extent_state(mask);
 		BUG_ON(!prealloc);
 	}
 	spin_lock(&tree->lock);
 	if (cached_state && *cached_state) {
 		state = *cached_state;
 		if (state->start <= start && state->end > start &&
 		    state->tree) {
 			node = &state->rb_node;
 			goto hit_next;
 		}
 	}
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = insert_state(tree, prealloc, start, end, &bits);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		prealloc = NULL;
 		goto out;
 	}
 	state = rb_entry(node, struct extent_state, rb_node);
 hit_next:
 	last_start = state->start;
 	last_end = state->end;
 	/*
 	 * | ---- desired range ---- |
 	 * | state |
 	 *
 	 * Just lock what we found and keep going
 	 */
 	if (state->start == start && state->end <= end) {
 		if (state->state & exclusive_bits) {
 			*failed_start = state->start;
 			err = -EEXIST;
 			goto out;
 		}
 		set_state_bits(tree, state, &bits);
 		cache_state(state, cached_state);
 		merge_state(tree, state);
 		if (last_end == (u64)-1)
 			goto out;
 		start = last_end + 1;
 		state = next_state(state);
 		if (start < end && state && state->start == start &&
 		    !need_resched())
 			goto hit_next;
 		goto search_again;
 	}
 	/*
 	 *     | ---- desired range ---- |
 	 * | state |
 	 *   or
 	 * | ------------- state -------------- |
 	 *
 	 * We need to split the extent we found, and may flip bits on
 	 * second half.
 	 *
 	 * If the extent we found extends past our
 	 * range, we just split and search again.  It'll get split
 	 * again the next time though.
 	 *
 	 * If the extent we found is inside our range, we set the
 	 * desired bit on it.
 	 */
 	if (state->start < start) {
 		if (state->state & exclusive_bits) {
 			*failed_start = start;
 			err = -EEXIST;
 			goto out;
 		}
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, start);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		prealloc = NULL;
 		if (err)
 			goto out;
 		if (state->end <= end) {
 			set_state_bits(tree, state, &bits);
 			cache_state(state, cached_state);
 			merge_state(tree, state);
 			if (last_end == (u64)-1)
 				goto out;
 			start = last_end + 1;
 			state = next_state(state);
 			if (start < end && state && state->start == start &&
 			    !need_resched())
 				goto hit_next;
 		}
 		goto search_again;
 	}
 	/*
 	 * | ---- desired range ---- |
 	 *     | state | or               | state |
 	 *
 	 * There's a hole, we need to insert something in it and
 	 * ignore the extent we found.
 	 */
 	if (state->start > start) {
 		u64 this_end;
 		if (end < last_start)
 			this_end = end;
 		else
 			this_end = last_start - 1;
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		/*
 		 * Avoid to free 'prealloc' if it can be merged with
 		 * the later extent.
 		 */
 		err = insert_state(tree, prealloc, start, this_end,
 				   &bits);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		cache_state(prealloc, cached_state);
 		prealloc = NULL;
 		start = this_end + 1;
 		goto search_again;
 	}
 	/*
 	 * | ---- desired range ---- |
 	 *                        | state |
 	 * We need to split the extent, and set the bit
 	 * on the first half
 	 */
 	if (state->start <= end && state->end > end) {
 		if (state->state & exclusive_bits) {
 			*failed_start = start;
 			err = -EEXIST;
 			goto out;
 		}
 		prealloc = alloc_extent_state_atomic(prealloc);
 		BUG_ON(!prealloc);
 		err = split_state(tree, state, prealloc, end + 1);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		set_state_bits(tree, prealloc, &bits);
 		cache_state(prealloc, cached_state);
 		merge_state(tree, prealloc);
 		prealloc = NULL;
 		goto out;
 	}
 	goto search_again;
 out:
 	spin_unlock(&tree->lock);
 	if (prealloc)
 		free_extent_state(prealloc);
 	return err;
 search_again:
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
 }
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		   unsigned long bits, u64 * failed_start,
 		   struct extent_state **cached_state, gfp_t mask)
 {
 	return __set_extent_bit(tree, start, end, bits, 0, failed_start,
 				cached_state, mask);
 }
 /**
  * convert_extent_bit - convert all bits in a given range from one bit to
  * 			another
  * @tree:	the io tree to search
  * @start:	the start offset in bytes
  * @end:	the end offset in bytes (inclusive)
  * @bits:	the bits to set in this range
  * @clear_bits:	the bits to clear in this range
  * @cached_state:	state that we're going to cache
  * @mask:	the allocation mask
  *
  * This will go through and set bits for the given range.  If any states exist
  * already in this range they are set with the given bit and cleared of the
  * clear_bits.  This is only meant to be used by things that are mergeable, ie
  * converting from say DELALLOC to DIRTY.  This is not meant to be used with
  * boundary bits like LOCK.
  */
 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		       unsigned long bits, unsigned long clear_bits,
 		       struct extent_state **cached_state, gfp_t mask)
 {
 	struct extent_state *state;
 	struct extent_state *prealloc = NULL;
 	struct rb_node *node;
 	int err = 0;
 	u64 last_start;
 	u64 last_end;
 	btrfs_debug_check_extent_io_range(tree->mapping->host, start, end);
 again:
 	if (!prealloc && (mask & __GFP_WAIT)) {
 		prealloc = alloc_extent_state(mask);
 		if (!prealloc)
 			return -ENOMEM;
 	}
 	spin_lock(&tree->lock);
 	if (cached_state && *cached_state) {
 		state = *cached_state;
 		if (state->start <= start && state->end > start &&
 		    state->tree) {
 			node = &state->rb_node;
 			goto hit_next;
 		}
 	}
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
 			goto out;
 		}
 		err = insert_state(tree, prealloc, start, end, &bits);
 		prealloc = NULL;
 		if (err)
 			extent_io_tree_panic(tree, err);
 		goto out;
 	}
 	state = rb_entry(node, struct extent_state, rb_node);
 hit_next:
 	last_start = state->start;
 	last_end = state->end;
 	/*
 	 * | ---- desired range ---- |
 	 * | state |
 	 *
 	 * Just lock what we found and keep going
 	 */
 	if (state->start == start && state->end <= end) {
 		set_state_bits(tree, state, &bits);
 		cache_state(state, cached_state);
 		state = clear_state_bit(tree, state, &clear_bits, 0);
 		if (last_end == (u64)-1)
 			goto out;
 		start = last_end + 1;
 		if (start < end && state && state->start == start &&
 		    !need_resched())
 			goto hit_next;
 		goto search_again;
 	}
 	/*
 	 *     | ---- desired range ---- |
 	 * | state |
 	 *   or
 	 * | ------------- state -------------- |
 	 *
 	 * We need to split the extent we found, and may flip bits on
 	 * second half.
 	 *
 	 * If the extent we found extends past our
 	 * range, we just split and search again.  It'll get split
 	 * again the next time though.
 	 *
 	 * If the extent we found is inside our range, we set the
 	 * desired bit on it.
 	 */
 	if (state->start < start) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
 			goto out;
 		}
 		err = split_state(tree, state, prealloc, start);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		prealloc = NULL;
 		if (err)
 			goto out;
 		if (state->end <= end) {
 			set_state_bits(tree, state, &bits);
 			cache_state(state, cached_state);
 			state = clear_state_bit(tree, state, &clear_bits, 0);
 			if (last_end == (u64)-1)
 				goto out;
 			start = last_end + 1;
 			if (start < end && state && state->start == start &&
 			    !need_resched())
 				goto hit_next;
 		}
 		goto search_again;
 	}
 	/*
 	 * | ---- desired range ---- |
 	 *     | state | or               | state |
 	 *
 	 * There's a hole, we need to insert something in it and
 	 * ignore the extent we found.
 	 */
 	if (state->start > start) {
 		u64 this_end;
 		if (end < last_start)
 			this_end = end;
 		else
 			this_end = last_start - 1;
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
 			goto out;
 		}
 		/*
 		 * Avoid to free 'prealloc' if it can be merged with
 		 * the later extent.
 		 */
 		err = insert_state(tree, prealloc, start, this_end,
 				   &bits);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		cache_state(prealloc, cached_state);
 		prealloc = NULL;
 		start = this_end + 1;
 		goto search_again;
 	}
 	/*
 	 * | ---- desired range ---- |
 	 *                        | state |
 	 * We need to split the extent, and set the bit
 	 * on the first half
 	 */
 	if (state->start <= end && state->end > end) {
 		prealloc = alloc_extent_state_atomic(prealloc);
 		if (!prealloc) {
 			err = -ENOMEM;
 			goto out;
 		}
 		err = split_state(tree, state, prealloc, end + 1);
 		if (err)
 			extent_io_tree_panic(tree, err);
 		set_state_bits(tree, prealloc, &bits);
 		cache_state(prealloc, cached_state);
 		clear_state_bit(tree, prealloc, &clear_bits, 0);
 		prealloc = NULL;
 		goto out;
 	}
 	goto search_again;
 out:
 	spin_unlock(&tree->lock);
 	if (prealloc)
 		free_extent_state(prealloc);
 	return err;
 search_again:
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
 	if (mask & __GFP_WAIT)
 		cond_resched();
 	goto again;
 }
 /* wrappers around set/clear extent bit */
 int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
 		     gfp_t mask)
 {
 	return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL,
 			      NULL, mask);
 }
 int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		    unsigned long bits, gfp_t mask)
 {
 	return set_extent_bit(tree, start, end, bits, NULL,
 			      NULL, mask);
 }
 int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		      unsigned long bits, gfp_t mask)
 {
 	return clear_extent_bit(tree, start, end, bits, 0, 0, NULL, mask);
 }
 int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
 			struct extent_state **cached_state, gfp_t mask)
 {
 	return set_extent_bit(tree, start, end,
 			      EXTENT_DELALLOC | EXTENT_UPTODATE,
 			      NULL, cached_state, mask);
 }
 int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
 		      struct extent_state **cached_state, gfp_t mask)
 {
 	return set_extent_bit(tree, start, end,
 			      EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG,
 			      NULL, cached_state, mask);
 }
 int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
 		       gfp_t mask)
 {
 	return clear_extent_bit(tree, start, end,
 				EXTENT_DIRTY | EXTENT_DELALLOC |
 				EXTENT_DO_ACCOUNTING, 0, 0, NULL, mask);
 }
 int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
 		     gfp_t mask)
 {
 	return set_extent_bit(tree, start, end, EXTENT_NEW, NULL,
 			      NULL, mask);
 }
 int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
 			struct extent_state **cached_state, gfp_t mask)
 {
 	return set_extent_bit(tree, start, end, EXTENT_UPTODATE, NULL,
 			      cached_state, mask);
 }
 int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
 			  struct extent_state **cached_state, gfp_t mask)
 {
 	return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0,
 				cached_state, mask);
 }
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
  */
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		     unsigned long bits, struct extent_state **cached_state)
 {
 	int err;
 	u64 failed_start;
 	while (1) {
 		err = __set_extent_bit(tree, start, end, EXTENT_LOCKED | bits,
 				       EXTENT_LOCKED, &failed_start,
 				       cached_state, GFP_NOFS);
 		if (err == -EEXIST) {
 			wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
 			start = failed_start;
 		} else
 			break;
 		WARN_ON(start > end);
 	}
 	return err;
 }
 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 {
 	return lock_extent_bits(tree, start, end, 0, NULL);
 }
 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 {
 	int err;
 	u64 failed_start;
 	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED,
 			       &failed_start, NULL, GFP_NOFS);
 	if (err == -EEXIST) {
 		if (failed_start > start)
 			clear_extent_bit(tree, start, failed_start - 1,
 					 EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS);
 		return 0;
 	}
 	return 1;
 }
 int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end,
 			 struct extent_state **cached, gfp_t mask)
 {
 	return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
 				mask);
 }
 int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end)
 {
 	return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL,
 				GFP_NOFS);
 }
 int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 {
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	while (index <= end_index) {
 		page = find_get_page(inode->i_mapping, index);
 		BUG_ON(!page); /* Pages should be in the extent_io_tree */
 		clear_page_dirty_for_io(page);
 		page_cache_release(page);
 		index++;
 	}
 	return 0;
 }
 int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
 {
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	while (index <= end_index) {
 		page = find_get_page(inode->i_mapping, index);
 		BUG_ON(!page); /* Pages should be in the extent_io_tree */
 		account_page_redirty(page);
 		__set_page_dirty_nobuffers(page);
 		page_cache_release(page);
 		index++;
 	}
 	return 0;
 }
 /*
  * helper function to set both pages and extents in the tree writeback
  */
 static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
 {
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	while (index <= end_index) {
 		page = find_get_page(tree->mapping, index);
 		BUG_ON(!page); /* Pages should be in the extent_io_tree */
 		set_page_writeback(page);
 		page_cache_release(page);
 		index++;
 	}
 	return 0;
 }
 /* find the first state struct with 'bits' set after 'start', and
  * return it.  tree->lock must be held.  NULL will returned if
  * nothing was found after 'start'
  */
 static struct extent_state *
 find_first_extent_bit_state(struct extent_io_tree *tree,
 			    u64 start, unsigned long bits)
 {
 	struct rb_node *node;
 	struct extent_state *state;
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node)
 		goto out;
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (state->end >= start && (state->state & bits))
 			return state;
 		node = rb_next(node);
 		if (!node)
 			break;
 	}
 out:
 	return NULL;
 }
 /*
  * find the first offset in the io tree with 'bits' set. zero is
  * returned if we find something, and *start_ret and *end_ret are
  * set to reflect the state struct that was found.
  *
  * If nothing was found, 1 is returned. If found something, return 0.
  */
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
 			  u64 *start_ret, u64 *end_ret, unsigned long bits,
 			  struct extent_state **cached_state)
 {
 	struct extent_state *state;
 	struct rb_node *n;
 	int ret = 1;
 	spin_lock(&tree->lock);
 	if (cached_state && *cached_state) {
 		state = *cached_state;
 		if (state->end == start - 1 && state->tree) {
 			n = rb_next(&state->rb_node);
 			while (n) {
 				state = rb_entry(n, struct extent_state,
 						 rb_node);
 				if (state->state & bits)
 					goto got_it;
 				n = rb_next(n);
 			}
 			free_extent_state(*cached_state);
 			*cached_state = NULL;
 			goto out;
 		}
 		free_extent_state(*cached_state);
 		*cached_state = NULL;
 	}
 	state = find_first_extent_bit_state(tree, start, bits);
 got_it:
 	if (state) {
 		cache_state(state, cached_state);
 		*start_ret = state->start;
 		*end_ret = state->end;
 		ret = 0;
 	}
 out:
 	spin_unlock(&tree->lock);
 	return ret;
 }
 /*
  * find a contiguous range of bytes in the file marked as delalloc, not
  * more than 'max_bytes'.  start and end are used to return the range,
  *
  * 1 is returned if we find something, 0 if nothing was in the tree
  */
 static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 					u64 *start, u64 *end, u64 max_bytes,
 					struct extent_state **cached_state)
 {
 	struct rb_node *node;
 	struct extent_state *state;
 	u64 cur_start = *start;
 	u64 found = 0;
 	u64 total_bytes = 0;
 	spin_lock(&tree->lock);
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, cur_start);
 	if (!node) {
 		if (!found)
 			*end = (u64)-1;
 		goto out;
 	}
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (found && (state->start != cur_start ||
 			      (state->state & EXTENT_BOUNDARY))) {
 			goto out;
 		}
 		if (!(state->state & EXTENT_DELALLOC)) {
 			if (!found)
 				*end = state->end;
 			goto out;
 		}
 		if (!found) {
 			*start = state->start;
 			*cached_state = state;
 			atomic_inc(&state->refs);
 		}
 		found++;
 		*end = state->end;
 		cur_start = state->end + 1;
 		node = rb_next(node);
 		total_bytes += state->end - state->start + 1;
 		if (total_bytes >= max_bytes)
 			break;
 		if (!node)
 			break;
 	}
 out:
 	spin_unlock(&tree->lock);
 	return found;
 }
 static noinline void __unlock_for_delalloc(struct inode *inode,
 					   struct page *locked_page,
 					   u64 start, u64 end)
 {
 	int ret;
 	struct page *pages[16];
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
 	unsigned long nr_pages = end_index - index + 1;
 	int i;
 	if (index == locked_page->index && end_index == index)
 		return;
 	while (nr_pages > 0) {
 		ret = find_get_pages_contig(inode->i_mapping, index,
 				     min_t(unsigned long, nr_pages,
 				     ARRAY_SIZE(pages)), pages);
 		for (i = 0; i < ret; i++) {
 			if (pages[i] != locked_page)
 				unlock_page(pages[i]);
 			page_cache_release(pages[i]);
 		}
 		nr_pages -= ret;
 		index += ret;
 		cond_resched();
 	}
 }
 static noinline int lock_delalloc_pages(struct inode *inode,
 					struct page *locked_page,
 					u64 delalloc_start,
 					u64 delalloc_end)
 {
 	unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT;
 	unsigned long start_index = index;
 	unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT;
 	unsigned long pages_locked = 0;
 	struct page *pages[16];
 	unsigned long nrpages;
 	int ret;
 	int i;
 	/* the caller is responsible for locking the start index */
 	if (index == locked_page->index && index == end_index)
 		return 0;
 	/* skip the page at the start index */
 	nrpages = end_index - index + 1;
 	while (nrpages > 0) {
 		ret = find_get_pages_contig(inode->i_mapping, index,
 				     min_t(unsigned long,
 				     nrpages, ARRAY_SIZE(pages)), pages);
 		if (ret == 0) {
 			ret = -EAGAIN;
 			goto done;
 		}
 		/* now we have an array of pages, lock them all */
 		for (i = 0; i < ret; i++) {
 			/*
 			 * the caller is taking responsibility for
 			 * locked_page
 			 */
 			if (pages[i] != locked_page) {
 				lock_page(pages[i]);
 				if (!PageDirty(pages[i]) ||
 				    pages[i]->mapping != inode->i_mapping) {
 					ret = -EAGAIN;
 					unlock_page(pages[i]);
 					page_cache_release(pages[i]);
 					goto done;
 				}
 			}
 			page_cache_release(pages[i]);
 			pages_locked++;
 		}
 		nrpages -= ret;
 		index += ret;
 		cond_resched();
 	}
 	ret = 0;
 done:
 	if (ret && pages_locked) {
 		__unlock_for_delalloc(inode, locked_page,
 			      delalloc_start,
 			      ((u64)(start_index + pages_locked - 1)) <<
 			      PAGE_CACHE_SHIFT);
 	}
 	return ret;
 }
 /*
  * find a contiguous range of bytes in the file marked as delalloc, not
  * more than 'max_bytes'.  start and end are used to return the range,
  *
  * 1 is returned if we find something, 0 if nothing was in the tree
  */
 static noinline u64 find_lock_delalloc_range(struct inode *inode,
 					     struct extent_io_tree *tree,
 					     struct page *locked_page,
 					     u64 *start, u64 *end,
 					     u64 max_bytes)
 {
 	u64 delalloc_start;
 	u64 delalloc_end;
 	u64 found;
 	struct extent_state *cached_state = NULL;
 	int ret;
 	int loops = 0;
 again:
 	/* step one, find a bunch of delalloc bytes starting at start */
 	delalloc_start = *start;
 	delalloc_end = 0;
 	found = find_delalloc_range(tree, &delalloc_start, &delalloc_end,
 				    max_bytes, &cached_state);
 	if (!found || delalloc_end <= *start) {
 		*start = delalloc_start;
 		*end = delalloc_end;
 		free_extent_state(cached_state);
 		return 0;
 	}
 	/*
 	 * start comes from the offset of locked_page.  We have to lock
 	 * pages in order, so we can't process delalloc bytes before
 	 * locked_page
 	 */
 	if (delalloc_start < *start)
 		delalloc_start = *start;
 	/*
 	 * make sure to limit the number of pages we try to lock down
 	 */
 	if (delalloc_end + 1 - delalloc_start > max_bytes)
 		delalloc_end = delalloc_start + max_bytes - 1;
 	/* step two, lock all the pages after the page that has start */
 	ret = lock_delalloc_pages(inode, locked_page,
 				  delalloc_start, delalloc_end);
 	if (ret == -EAGAIN) {
 		/* some of the pages are gone, lets avoid looping by
 		 * shortening the size of the delalloc range we're searching
 		 */
 		free_extent_state(cached_state);
 		cached_state = NULL;
 		if (!loops) {
 			max_bytes = PAGE_CACHE_SIZE;
 			loops = 1;
 			goto again;
 		} else {
 			found = 0;
 			goto out_failed;
 		}
 	}
 	BUG_ON(ret); /* Only valid values are 0 and -EAGAIN */
 	/* step three, lock the state bits for the whole range */
 	lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state);
 	/* then test to make sure it is all still delalloc */
 	ret = test_range_bit(tree, delalloc_start, delalloc_end,
 			     EXTENT_DELALLOC, 1, cached_state);
 	if (!ret) {
 		unlock_extent_cached(tree, delalloc_start, delalloc_end,
 				     &cached_state, GFP_NOFS);
 		__unlock_for_delalloc(inode, locked_page,
 			      delalloc_start, delalloc_end);
 		cond_resched();
 		goto again;
 	}
 	free_extent_state(cached_state);
 	*start = delalloc_start;
 	*end = delalloc_end;
 out_failed:
 	return found;
 }
 int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
 				 struct page *locked_page,
 				 unsigned long clear_bits,
 				 unsigned long page_ops)
 {
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	int ret;
 	struct page *pages[16];
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
 	unsigned long nr_pages = end_index - index + 1;
 	int i;
 	clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS);
 	if (page_ops == 0)
 		return 0;
 	while (nr_pages > 0) {
 		ret = find_get_pages_contig(inode->i_mapping, index,
 				     min_t(unsigned long,
 				     nr_pages, ARRAY_SIZE(pages)), pages);
 		for (i = 0; i < ret; i++) {
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 			if (pages[i] == locked_page) {
 				page_cache_release(pages[i]);
 				continue;
 			}
 			if (page_ops & PAGE_CLEAR_DIRTY)
 				clear_page_dirty_for_io(pages[i]);
 			if (page_ops & PAGE_SET_WRITEBACK)
 				set_page_writeback(pages[i]);
 			if (page_ops & PAGE_END_WRITEBACK)
 				end_page_writeback(pages[i]);
 			if (page_ops & PAGE_UNLOCK)
 				unlock_page(pages[i]);
 			page_cache_release(pages[i]);
 		}
 		nr_pages -= ret;
 		index += ret;
 		cond_resched();
 	}
 	return 0;
 }
 /*
  * count the number of bytes in the tree that have a given bit(s)
  * set.  This can be fairly slow, except for EXTENT_DIRTY which is
  * cached.  The total number found is returned.
  */
 u64 count_range_bits(struct extent_io_tree *tree,
 		     u64 *start, u64 search_end, u64 max_bytes,
 		     unsigned long bits, int contig)
 {
 	struct rb_node *node;
 	struct extent_state *state;
 	u64 cur_start = *start;
 	u64 total_bytes = 0;
 	u64 last = 0;
 	int found = 0;
 	if (search_end <= cur_start) {
 		WARN_ON(1);
 		return 0;
 	}
 	spin_lock(&tree->lock);
 	if (cur_start == 0 && bits == EXTENT_DIRTY) {
 		total_bytes = tree->dirty_bytes;
 		goto out;
 	}
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, cur_start);
 	if (!node)
 		goto out;
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (state->start > search_end)
 			break;
 		if (contig && found && state->start > last + 1)
 			break;
 		if (state->end >= cur_start && (state->state & bits) == bits) {
 			total_bytes += min(search_end, state->end) + 1 -
 				       max(cur_start, state->start);
 			if (total_bytes >= max_bytes)
 				break;
 			if (!found) {
 				*start = max(cur_start, state->start);
 				found = 1;
 			}
 			last = state->end;
 		} else if (contig && found) {
 			break;
 		}
 		node = rb_next(node);
 		if (!node)
 			break;
 	}
 out:
 	spin_unlock(&tree->lock);
 	return total_bytes;
 }
 /*
  * set the private field for a given byte offset in the tree.  If there isn't
  * an extent_state there already, this does nothing.
  */
 static int set_state_private(struct extent_io_tree *tree, u64 start, u64 private)
 {
 	struct rb_node *node;
 	struct extent_state *state;
 	int ret = 0;
 	spin_lock(&tree->lock);
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
 		ret = -ENOENT;
 		goto out;
 	}
 	state = rb_entry(node, struct extent_state, rb_node);
 	if (state->start != start) {
 		ret = -ENOENT;
 		goto out;
 	}
 	state->private = private;
 out:
 	spin_unlock(&tree->lock);
 	return ret;
 }
 int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private)
 {
 	struct rb_node *node;
 	struct extent_state *state;
 	int ret = 0;
 	spin_lock(&tree->lock);
 	/*
 	 * this search will find all the extents that end after
 	 * our range starts.
 	 */
 	node = tree_search(tree, start);
 	if (!node) {
 		ret = -ENOENT;
 		goto out;
 	}
 	state = rb_entry(node, struct extent_state, rb_node);
 	if (state->start != start) {
 		ret = -ENOENT;
 		goto out;
 	}
 	*private = state->private;
 out:
 	spin_unlock(&tree->lock);
 	return ret;
 }
 /*
  * searches a range in the state tree for a given mask.
  * If 'filled' == 1, this returns 1 only if every extent in the tree
  * has the bits set.  Otherwise, 1 is returned if any bit in the
  * range is found set.
  */
 int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		   unsigned long bits, int filled, struct extent_state *cached)
 {
 	struct extent_state *state = NULL;
 	struct rb_node *node;
 	int bitset = 0;
 	spin_lock(&tree->lock);
 	if (cached && cached->tree && cached->start <= start &&
 	    cached->end > start)
 		node = &cached->rb_node;
 	else
 		node = tree_search(tree, start);
 	while (node && start <= end) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (filled && state->start > start) {
 			bitset = 0;
 			break;
 		}
 		if (state->start > end)
 			break;
 		if (state->state & bits) {
 			bitset = 1;
 			if (!filled)
 				break;
 		} else if (filled) {
 			bitset = 0;
 			break;
 		}
 		if (state->end == (u64)-1)
 			break;
 		start = state->end + 1;
 		if (start > end)
 			break;
 		node = rb_next(node);
 		if (!node) {
 			if (filled)
 				bitset = 0;
 			break;
 		}
 	}
 	spin_unlock(&tree->lock);
 	return bitset;
 }
 /*
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
 static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 {
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
 		SetPageUptodate(page);
 }
 /*
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
  * mirrors.  If another mirror has good data, the page is set up to date
  * and things continue.  If a good mirror can't be found, the original
  * bio end_io callback is called to indicate things have failed.
  */
 struct io_failure_record {
 	struct page *page;
 	u64 start;
 	u64 len;
 	u64 logical;
 	unsigned long bio_flags;
 	int this_mirror;
 	int failed_mirror;
 	int in_validation;
 };
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
 				int did_repair)
 {
 	int ret;
 	int err = 0;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	set_state_private(failure_tree, rec->start, 0);
 	ret = clear_extent_bits(failure_tree, rec->start,
 				rec->start + rec->len - 1,
 				EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS);
 	if (ret)
 		err = ret;
 	ret = clear_extent_bits(&BTRFS_I(inode)->io_tree, rec->start,
 				rec->start + rec->len - 1,
 				EXTENT_DAMAGED, GFP_NOFS);
 	if (ret && !err)
 		err = ret;
 	kfree(rec);
 	return err;
 }
 static void repair_io_failure_callback(struct bio *bio, int err)
 {
 	complete(bio->bi_private);
 }
 /*
  * this bypasses the standard btrfs submit functions deliberately, as
  * the standard behavior is to write all copies in a raid setup. here we only
  * want to write the one bad copy. so we do the mapping for ourselves and issue
  * submit_bio directly.
  * to avoid any synchronization issues, wait for the data after writing, which
  * actually prevents the read that triggered the error from finishing.
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
 			int mirror_num)
 {
 	struct bio *bio;
 	struct btrfs_device *dev;
 	DECLARE_COMPLETION_ONSTACK(compl);
 	u64 map_length = 0;
 	u64 sector;
 	struct btrfs_bio *bbio = NULL;
 	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
 	int ret;
 	BUG_ON(!mirror_num);
 	/* we can't repair anything in raid56 yet */
 	if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
 		return 0;
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio)
 		return -EIO;
 	bio->bi_private = &compl;
 	bio->bi_end_io = repair_io_failure_callback;
 	bio->bi_size = 0;
 	map_length = length;
 	ret = btrfs_map_block(fs_info, WRITE, logical,
 			      &map_length, &bbio, mirror_num);
 	if (ret) {
 		bio_put(bio);
 		return -EIO;
 	}
 	BUG_ON(mirror_num != bbio->mirror_num);
 	sector = bbio->stripes[mirror_num-1].physical >> 9;
 	bio->bi_sector = sector;
 	dev = bbio->stripes[mirror_num-1].dev;
 	kfree(bbio);
 	if (!dev || !dev->bdev || !dev->writeable) {
 		bio_put(bio);
 		return -EIO;
 	}
 	bio->bi_bdev = dev->bdev;
 	bio_add_page(bio, page, length, start - page_offset(page));
 	btrfsic_submit_bio(WRITE_SYNC, bio);
 	wait_for_completion(&compl);
 	if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
 		/* try to remap that extent elsewhere? */
 		bio_put(bio);
 		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
 		return -EIO;
 	}
 	printk_ratelimited_in_rcu(KERN_INFO "btrfs read error corrected: ino %lu off %llu "
 		      "(dev %s sector %llu)\n", page->mapping->host->i_ino,
 		      start, rcu_str_deref(dev->name), sector);
 	bio_put(bio);
 	return 0;
 }
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num)
 {
 	u64 start = eb->start;
 	unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
 	int ret = 0;
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
 					start, p, mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
 	}
 	return ret;
 }
 /*
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
 static int clean_io_failure(u64 start, struct page *page)
 {
 	u64 private;
 	u64 private_failure;
 	struct io_failure_record *failrec;
 	struct btrfs_fs_info *fs_info;
 	struct extent_state *state;
 	int num_copies;
 	int did_repair = 0;
 	int ret;
 	struct inode *inode = page->mapping->host;
 	private = 0;
 	ret = count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private,
 				(u64)-1, 1, EXTENT_DIRTY, 0);
 	if (!ret)
 		return 0;
 	ret = get_state_private(&BTRFS_I(inode)->io_failure_tree, start,
 				&private_failure);
 	if (ret)
 		return 0;
 	failrec = (struct io_failure_record *)(unsigned long) private_failure;
 	BUG_ON(!failrec->this_mirror);
 	if (failrec->in_validation) {
 		/* there was no real error, just free the record */
 		pr_debug("clean_io_failure: freeing dummy error at %llu\n",
 			 failrec->start);
 		did_repair = 1;
 		goto out;
 	}
 	spin_lock(&BTRFS_I(inode)->io_tree.lock);
 	state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree,
 					    failrec->start,
 					    EXTENT_LOCKED);
 	spin_unlock(&BTRFS_I(inode)->io_tree.lock);
 	if (state && state->start <= failrec->start &&
 	    state->end >= failrec->start + failrec->len - 1) {
 		fs_info = BTRFS_I(inode)->root->fs_info;
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
 			ret = repair_io_failure(fs_info, start, failrec->len,
 						failrec->logical, page,
 						failrec->failed_mirror);
 			did_repair = !ret;
 		}
 		ret = 0;
 	}
 out:
 	if (!ret)
 		ret = free_io_failure(inode, failrec, did_repair);
 	return ret;
 }
 /*
  * this is a generic handler for readpage errors (default
  * readpage_io_failed_hook). if other copies exist, read those and write back
  * good data to the failed position. does not investigate in remapping the
  * failed extent elsewhere, hoping the device will be smart enough to do this as
  * needed
  */
 static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			      struct page *page, u64 start, u64 end,
 			      int failed_mirror)
 {
 	struct io_failure_record *failrec = NULL;
 	u64 private;
 	struct extent_map *em;
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct bio *bio;
 	struct btrfs_io_bio *btrfs_failed_bio;
 	struct btrfs_io_bio *btrfs_bio;
 	int num_copies;
 	int ret;
 	int read_mode;
 	u64 logical;
 	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
 	ret = get_state_private(failure_tree, start, &private);
 	if (ret) {
 		failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
 		if (!failrec)
 			return -ENOMEM;
 		failrec->start = start;
 		failrec->len = end - start + 1;
 		failrec->this_mirror = 0;
 		failrec->bio_flags = 0;
 		failrec->in_validation = 0;
 		read_lock(&em_tree->lock);
 		em = lookup_extent_mapping(em_tree, start, failrec->len);
 		if (!em) {
 			read_unlock(&em_tree->lock);
 			kfree(failrec);
 			return -EIO;
 		}
 		if (em->start > start || em->start + em->len < start) {
 			free_extent_map(em);
 			em = NULL;
 		}
 		read_unlock(&em_tree->lock);
 		if (!em) {
 			kfree(failrec);
 			return -EIO;
 		}
 		logical = start - em->start;
 		logical = em->block_start + logical;
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
 			logical = em->block_start;
 			failrec->bio_flags = EXTENT_BIO_COMPRESSED;
 			extent_set_compress_type(&failrec->bio_flags,
 						 em->compress_type);
 		}
 		pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
 			 "len=%llu\n", logical, start, failrec->len);
 		failrec->logical = logical;
 		free_extent_map(em);
 		/* set the bits in the private failure tree */
 		ret = set_extent_bits(failure_tree, start, end,
 					EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS);
 		if (ret >= 0)
 			ret = set_state_private(failure_tree, start,
 						(u64)(unsigned long)failrec);
 		/* set the bits in the inode's tree */
 		if (ret >= 0)
 			ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED,
 						GFP_NOFS);
 		if (ret < 0) {
 			kfree(failrec);
 			return ret;
 		}
 	} else {
 		failrec = (struct io_failure_record *)(unsigned long)private;
 		pr_debug("bio_readpage_error: (found) logical=%llu, "
 			 "start=%llu, len=%llu, validation=%d\n",
 			 failrec->logical, failrec->start, failrec->len,
 			 failrec->in_validation);
 		/*
 		 * when data can be on disk more than twice, add to failrec here
 		 * (e.g. with a list for failed_mirror) to make
 		 * clean_io_failure() clean all those errors at once.
 		 */
 	}
 	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
 				      failrec->logical, failrec->len);
 	if (num_copies == 1) {
 		/*
 		 * we only have a single copy of the data, so don't bother with
 		 * all the retry and error correction code that follows. no
 		 * matter what the error is, it is very likely to persist.
 		 */
 		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
 		free_io_failure(inode, failrec, 0);
 		return -EIO;
 	}
 	/*
 	 * there are two premises:
 	 *	a) deliver good data to the caller
 	 *	b) correct the bad sectors on disk
 	 */
 	if (failed_bio->bi_vcnt > 1) {
 		/*
 		 * to fulfill b), we need to know the exact failing sectors, as
 		 * we don't want to rewrite any more than the failed ones. thus,
 		 * we need separate read requests for the failed bio
 		 *
 		 * if the following BUG_ON triggers, our validation request got
 		 * merged. we need separate requests for our algorithm to work.
 		 */
 		BUG_ON(failrec->in_validation);
 		failrec->in_validation = 1;
 		failrec->this_mirror = failed_mirror;
 		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	} else {
 		/*
 		 * we're ready to fulfill a) and b) alongside. get a good copy
 		 * of the failed sector and if we succeed, we have setup
 		 * everything for repair_io_failure to do the rest for us.
 		 */
 		if (failrec->in_validation) {
 			BUG_ON(failrec->this_mirror != failed_mirror);
 			failrec->in_validation = 0;
 			failrec->this_mirror = 0;
 		}
 		failrec->failed_mirror = failed_mirror;
 		failrec->this_mirror++;
 		if (failrec->this_mirror == failed_mirror)
 			failrec->this_mirror++;
 		read_mode = READ_SYNC;
 	}
 	if (failrec->this_mirror > num_copies) {
 		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
 		free_io_failure(inode, failrec, 0);
 		return -EIO;
 	}
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio) {
 		free_io_failure(inode, failrec, 0);
 		return -EIO;
 	}
 	bio->bi_end_io = failed_bio->bi_end_io;
 	bio->bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_size = 0;
 	btrfs_failed_bio = btrfs_io_bio(failed_bio);
 	if (btrfs_failed_bio->csum) {
 		struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = btrfs_bio->csum_inline;
 		phy_offset >>= inode->i_sb->s_blocksize_bits;
 		phy_offset *= csum_size;
 		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset,
 		       csum_size);
 	}
 	bio_add_page(bio, page, failrec->len, start - page_offset(page));
 	pr_debug("bio_readpage_error: submitting new read[%#x] to "
 		 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode,
 		 failrec->this_mirror, num_copies, failrec->in_validation);
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
 	return ret;
 }
 /* lots and lots of room for performance fixes in the end_bio funcs */
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 {
 	int uptodate = (err == 0);
 	struct extent_io_tree *tree;
 	int ret = 0;
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	if (tree->ops && tree->ops->writepage_end_io_hook) {
 		ret = tree->ops->writepage_end_io_hook(page, start,
 					       end, NULL, uptodate);
 		if (ret)
 			uptodate = 0;
 	}
 	if (!uptodate) {
 		ClearPageUptodate(page);
 		SetPageError(page);
 		ret = ret < 0 ? ret : -EIO;
 		mapping_set_error(page->mapping, ret);
 	}
 	return 0;
 }
 /*
  * after a writepage IO is done, we need to:
  * clear the uptodate bits on error
  * clear the writeback bits in the extent tree for this IO
  * end_page_writeback if the page has no more pending IO
  *
  * Scheduling is not allowed, so the extent state tree is expected
  * to have one and only one object corresponding to this IO.
  */
 static void end_bio_extent_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct extent_io_tree *tree;
 	u64 start;
 	u64 end;
 	do {
 		struct page *page = bvec->bv_page;
 		tree = &BTRFS_I(page->mapping->host)->io_tree;
 		/* We always issue full-page reads, but if some block
 		 * in a page fails to read, blk_update_request() will
 		 * advance bv_offset and adjust bv_len to compensate.
 		 * Print a warning for nonzero offsets, and an error
 		 * if they don't add up to a full page.  */
 		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE)
 			printk("%s page write in btrfs with offset %u and length %u\n",
 			       bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE
 			       ? KERN_ERR "partial" : KERN_INFO "incomplete",
 			       bvec->bv_offset, bvec->bv_len);
 		start = page_offset(page);
 		end = start + bvec->bv_offset + bvec->bv_len - 1;
 		if (--bvec >= bio->bi_io_vec)
 			prefetchw(&bvec->bv_page->flags);
 		if (end_extent_writepage(page, err, start, end))
 			continue;
 		end_page_writeback(page);
 	} while (bvec >= bio->bi_io_vec);
 	bio_put(bio);
 }
 static void
 endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
 			      int uptodate)
 {
 	struct extent_state *cached = NULL;
 	u64 end = start + len - 1;
 	if (uptodate && tree->track_uptodate)
 		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
 	unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
 }
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
  * set the uptodate bits if things worked
  * set the page up to date if all extents in the tree are uptodate
  * clear the lock bit in the extent tree
  * unlock the page if there are no other extents locked for it
  *
  * Scheduling is not allowed, so the extent state tree is expected
  * to have one and only one object corresponding to this IO.
  */
 static void end_bio_extent_readpage(struct bio *bio, int err)
 {
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct bio_vec *bvec = bio->bi_io_vec;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	struct extent_io_tree *tree;
 	u64 offset = 0;
 	u64 start;
 	u64 end;
 	u64 len;
 	u64 extent_start = 0;
 	u64 extent_len = 0;
 	int mirror;
 	int ret;
 	if (err)
 		uptodate = 0;
 	do {
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
 			 "mirror=%lu\n", (u64)bio->bi_sector, err,
 			 io_bio->mirror_num);
 		tree = &BTRFS_I(inode)->io_tree;
 		/* We always issue full-page reads, but if some block
 		 * in a page fails to read, blk_update_request() will
 		 * advance bv_offset and adjust bv_len to compensate.
 		 * Print a warning for nonzero offsets, and an error
 		 * if they don't add up to a full page.  */
 		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE)
 			printk("%s page read in btrfs with offset %u and length %u\n",
 			       bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE
 			       ? KERN_ERR "partial" : KERN_INFO "incomplete",
 			       bvec->bv_offset, bvec->bv_len);
 		start = page_offset(page);
 		end = start + bvec->bv_offset + bvec->bv_len - 1;
 		len = bvec->bv_len;
 		if (++bvec <= bvec_end)
 			prefetchw(&bvec->bv_page->flags);
 		mirror = io_bio->mirror_num;
 		if (likely(uptodate && tree->ops &&
 			   tree->ops->readpage_end_io_hook)) {
 			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
 							      page, start, end,
 							      mirror);
 			if (ret)
 				uptodate = 0;
 			else
 				clean_io_failure(start, page);
 		}
 		if (likely(uptodate))
 			goto readpage_ok;
 		if (tree->ops && tree->ops->readpage_io_failed_hook) {
 			ret = tree->ops->readpage_io_failed_hook(page, mirror);
 			if (!ret && !err &&
 			    test_bit(BIO_UPTODATE, &bio->bi_flags))
 				uptodate = 1;
 		} else {
 			/*
 			 * The generic bio_readpage_error handles errors the
 			 * following way: If possible, new read requests are
 			 * created and submitted and will end up in
 			 * end_bio_extent_readpage as well (if we're lucky, not
 			 * in the !uptodate case). In that case it returns 0 and
 			 * we just go on with the next page in our bio. If it
 			 * can't handle the error it will return -EIO and we
 			 * remain responsible for that page.
 			 */
 			ret = bio_readpage_error(bio, offset, page, start, end,
 						 mirror);
 			if (ret == 0) {
 				uptodate =
 					test_bit(BIO_UPTODATE, &bio->bi_flags);
 				if (err)
 					uptodate = 0;
 				offset += len;
 				continue;
 			}
 		}
 readpage_ok:
 		if (likely(uptodate)) {
 			loff_t i_size = i_size_read(inode);
 			pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 			unsigned offset;
 			/* Zero out the end if this page straddles i_size */
 			offset = i_size & (PAGE_CACHE_SIZE-1);
 			if (page->index == end_index && offset)
 				zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 			SetPageUptodate(page);
 		} else {
 			ClearPageUptodate(page);
 			SetPageError(page);
 		}
 		unlock_page(page);
 		offset += len;
 		if (unlikely(!uptodate)) {
 			if (extent_len) {
 				endio_readpage_release_extent(tree,
 							      extent_start,
 							      extent_len, 1);
 				extent_start = 0;
 				extent_len = 0;
 			}
 			endio_readpage_release_extent(tree, start,
 						      end - start + 1, 0);
 		} else if (!extent_len) {
 			extent_start = start;
 			extent_len = end + 1 - start;
 		} else if (extent_start + extent_len == start) {
 			extent_len += end + 1 - start;
 		} else {
 			endio_readpage_release_extent(tree, extent_start,
 						      extent_len, uptodate);
 			extent_start = start;
 			extent_len = end + 1 - start;
 		}
 	} while (bvec <= bvec_end);
 	if (extent_len)
 		endio_readpage_release_extent(tree, extent_start, extent_len,
 					      uptodate);
 	if (io_bio->end_io)
 		io_bio->end_io(io_bio, err);
 	bio_put(bio);
 }
 /*
  * this allocates from the btrfs_bioset.  We're returning a bio right now
  * but you can call btrfs_io_bio for the appropriate container_of magic
  */
 struct bio *
 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 		gfp_t gfp_flags)
 {
 	struct btrfs_io_bio *btrfs_bio;
 	struct bio *bio;
 	bio = bio_alloc_bioset(gfp_flags, nr_vecs, btrfs_bioset);
 	if (bio == NULL && (current->flags & PF_MEMALLOC)) {
 		while (!bio && (nr_vecs /= 2)) {
 			bio = bio_alloc_bioset(gfp_flags,
 					       nr_vecs, btrfs_bioset);
 		}
 	}
 	if (bio) {
 		bio->bi_size = 0;
 		bio->bi_bdev = bdev;
 		bio->bi_sector = first_sector;
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = NULL;
 		btrfs_bio->csum_allocated = NULL;
 		btrfs_bio->end_io = NULL;
 	}
 	return bio;
 }
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
 	return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
 }
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
 	struct btrfs_io_bio *btrfs_bio;
 	struct bio *bio;
 	bio = bio_alloc_bioset(gfp_mask, nr_iovecs, btrfs_bioset);
 	if (bio) {
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = NULL;
 		btrfs_bio->csum_allocated = NULL;
 		btrfs_bio->end_io = NULL;
 	}
 	return bio;
 }
 static int __must_check submit_one_bio(int rw, struct bio *bio,
 				       int mirror_num, unsigned long bio_flags)
 {
 	int ret = 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct page *page = bvec->bv_page;
 	struct extent_io_tree *tree = bio->bi_private;
 	u64 start;
 	start = page_offset(page) + bvec->bv_offset;
 	bio->bi_private = NULL;
 	bio_get(bio);
 	if (tree->ops && tree->ops->submit_bio_hook)
 		ret = tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
 					   mirror_num, bio_flags, start);
 	else
 		btrfsic_submit_bio(rw, bio);
 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
 		ret = -EOPNOTSUPP;
 	bio_put(bio);
 	return ret;
 }
 static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page,
 		     unsigned long offset, size_t size, struct bio *bio,
 		     unsigned long bio_flags)
 {
 	int ret = 0;
 	if (tree->ops && tree->ops->merge_bio_hook)
 		ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
 						bio_flags);
 	BUG_ON(ret < 0);
 	return ret;
 }
 static int submit_extent_page(int rw, struct extent_io_tree *tree,
 			      struct page *page, sector_t sector,
 			      size_t size, unsigned long offset,
 			      struct block_device *bdev,
 			      struct bio **bio_ret,
 			      unsigned long max_pages,
 			      bio_end_io_t end_io_func,
 			      int mirror_num,
 			      unsigned long prev_bio_flags,
 			      unsigned long bio_flags)
 {
 	int ret = 0;
 	struct bio *bio;
 	int nr;
 	int contig = 0;
 	int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED;
 	int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
 	size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);
 	if (bio_ret && *bio_ret) {
 		bio = *bio_ret;
 		if (old_compressed)
 			contig = bio->bi_sector == sector;
 		else
 			contig = bio_end_sector(bio) == sector;
 		if (prev_bio_flags != bio_flags || !contig ||
 		    merge_bio(rw, tree, page, offset, page_size, bio, bio_flags) ||
 		    bio_add_page(bio, page, page_size, offset) < page_size) {
 			ret = submit_one_bio(rw, bio, mirror_num,
 					     prev_bio_flags);
 			if (ret < 0)
 				return ret;
 			bio = NULL;
 		} else {
 			return 0;
 		}
 	}
 	if (this_compressed)
 		nr = BIO_MAX_PAGES;
 	else
 		nr = bio_get_nr_vecs(bdev);
 	bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH);
 	if (!bio)
 		return -ENOMEM;
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
 	if (bio_ret)
 		*bio_ret = bio;
 	else
 		ret = submit_one_bio(rw, bio, mirror_num, bio_flags);
 	return ret;
 }
 static void attach_extent_buffer_page(struct extent_buffer *eb,
 				      struct page *page)
 {
 	if (!PagePrivate(page)) {
 		SetPagePrivate(page);
 		page_cache_get(page);
 		set_page_private(page, (unsigned long)eb);
 	} else {
 		WARN_ON(page->private != (unsigned long)eb);
 	}
 }
 void set_page_extent_mapped(struct page *page)
 {
 	if (!PagePrivate(page)) {
 		SetPagePrivate(page);
 		page_cache_get(page);
 		set_page_private(page, EXTENT_PAGE_PRIVATE);
 	}
 }
 static struct extent_map *
 __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
 		 u64 start, u64 len, get_extent_t *get_extent,
 		 struct extent_map **em_cached)
 {
 	struct extent_map *em;
 	if (em_cached && *em_cached) {
 		em = *em_cached;
 		if (em->in_tree && start >= em->start &&
 		    start < extent_map_end(em)) {
 			atomic_inc(&em->refs);
 			return em;
 		}
 		free_extent_map(em);
 		*em_cached = NULL;
 	}
 	em = get_extent(inode, page, pg_offset, start, len, 0);
 	if (em_cached && !IS_ERR_OR_NULL(em)) {
 		BUG_ON(*em_cached);
 		atomic_inc(&em->refs);
 		*em_cached = em;
 	}
 	return em;
 }
 /*
  * basic readpage implementation.  Locked extent state structs are inserted
  * into the tree that are removed when the IO is done (by the end_io
  * handlers)
  * XXX JDM: This needs looking at to ensure proper page locking
  */
 static int __do_readpage(struct extent_io_tree *tree,
 			 struct page *page,
 			 get_extent_t *get_extent,
 			 struct extent_map **em_cached,
 			 struct bio **bio, int mirror_num,
 			 unsigned long *bio_flags, int rw)
 {
 	struct inode *inode = page->mapping->host;
 	u64 start = page_offset(page);
 	u64 page_end = start + PAGE_CACHE_SIZE - 1;
 	u64 end;
 	u64 cur = start;
 	u64 extent_offset;
 	u64 last_byte = i_size_read(inode);
 	u64 block_start;
 	u64 cur_end;
 	sector_t sector;
 	struct extent_map *em;
 	struct block_device *bdev;
 	int ret;
 	int nr = 0;
 	int parent_locked = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
 	size_t pg_offset = 0;
 	size_t iosize;
 	size_t disk_io_size;
 	size_t blocksize = inode->i_sb->s_blocksize;
 	unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
 	set_page_extent_mapped(page);
 	end = page_end;
 	if (!PageUptodate(page)) {
 		if (cleancache_get_page(page) == 0) {
 			BUG_ON(blocksize != PAGE_SIZE);
 			unlock_extent(tree, start, end);
 			goto out;
 		}
 	}
 	if (page->index == last_byte >> PAGE_CACHE_SHIFT) {
 		char *userpage;
 		size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1);
 		if (zero_offset) {
 			iosize = PAGE_CACHE_SIZE - zero_offset;
 			userpage = kmap_atomic(page);
 			memset(userpage + zero_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 		}
 	}
 	while (cur <= end) {
 		unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1;
 		if (cur >= last_byte) {
 			char *userpage;
 			struct extent_state *cached = NULL;
 			iosize = PAGE_CACHE_SIZE - pg_offset;
 			userpage = kmap_atomic(page);
 			memset(userpage + pg_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
 			if (!parent_locked)
 				unlock_extent_cached(tree, cur,
 						     cur + iosize - 1,
 						     &cached, GFP_NOFS);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
 				      end - cur + 1, get_extent, em_cached);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
 			if (!parent_locked)
 				unlock_extent(tree, cur, end);
 			break;
 		}
 		extent_offset = cur - em->start;
 		BUG_ON(extent_map_end(em) <= cur);
 		BUG_ON(end < cur);
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
 			this_bio_flag |= EXTENT_BIO_COMPRESSED;
 			extent_set_compress_type(&this_bio_flag,
 						 em->compress_type);
 		}
 		iosize = min(extent_map_end(em) - cur, end - cur + 1);
 		cur_end = min(extent_map_end(em) - 1, end);
 		iosize = ALIGN(iosize, blocksize);
 		if (this_bio_flag & EXTENT_BIO_COMPRESSED) {
 			disk_io_size = em->block_len;
 			sector = em->block_start >> 9;
 		} else {
 			sector = (em->block_start + extent_offset) >> 9;
 			disk_io_size = iosize;
 		}
 		bdev = em->bdev;
 		block_start = em->block_start;
 		if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
 			block_start = EXTENT_MAP_HOLE;
 		free_extent_map(em);
 		em = NULL;
 		/* we've found a hole, just zero and go on */
 		if (block_start == EXTENT_MAP_HOLE) {
 			char *userpage;
 			struct extent_state *cached = NULL;
 			userpage = kmap_atomic(page);
 			memset(userpage + pg_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur, cur + iosize - 1,
 			                     &cached, GFP_NOFS);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
 		}
 		/* the get_extent function already copied into the page */
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
 			check_page_uptodate(tree, page);
 			if (!parent_locked)
 				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
 		}
 		/* we have an inline extent but it didn't get marked up
 		 * to date.  Error out
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			SetPageError(page);
 			if (!parent_locked)
 				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
 		}
 		pnr -= page->index;
 		ret = submit_extent_page(rw, tree, page,
 					 sector, disk_io_size, pg_offset,
 					 bdev, bio, pnr,
 					 end_bio_extent_readpage, mirror_num,
 					 *bio_flags,
 					 this_bio_flag);
 		if (!ret) {
 			nr++;
 			*bio_flags = this_bio_flag;
 		} else {
 			SetPageError(page);
 			if (!parent_locked)
 				unlock_extent(tree, cur, cur + iosize - 1);
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
 	}
 out:
 	if (!nr) {
 		if (!PageError(page))
 			SetPageUptodate(page);
 		unlock_page(page);
 	}
 	return 0;
 }
 static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
 					     struct page *pages[], int nr_pages,
 					     u64 start, u64 end,
 					     get_extent_t *get_extent,
 					     struct extent_map **em_cached,
 					     struct bio **bio, int mirror_num,
 					     unsigned long *bio_flags, int rw)
 {
 	struct inode *inode;
 	struct btrfs_ordered_extent *ordered;
 	int index;
 	inode = pages[0]->mapping->host;
 	while (1) {
 		lock_extent(tree, start, end);
 		ordered = btrfs_lookup_ordered_range(inode, start,
 						     end - start + 1);
 		if (!ordered)
 			break;
 		unlock_extent(tree, start, end);
 		btrfs_start_ordered_extent(inode, ordered, 1);
 		btrfs_put_ordered_extent(ordered);
 	}
 	for (index = 0; index < nr_pages; index++) {
 		__do_readpage(tree, pages[index], get_extent, em_cached, bio,
 			      mirror_num, bio_flags, rw);
 		page_cache_release(pages[index]);
 	}
 }
 static void __extent_readpages(struct extent_io_tree *tree,
 			       struct page *pages[],
 			       int nr_pages, get_extent_t *get_extent,
 			       struct extent_map **em_cached,
 			       struct bio **bio, int mirror_num,
 			       unsigned long *bio_flags, int rw)
 {
 	u64 start = 0;
 	u64 end = 0;
 	u64 page_start;
 	int index;
 	int first_index = 0;
 	for (index = 0; index < nr_pages; index++) {
 		page_start = page_offset(pages[index]);
 		if (!end) {
 			start = page_start;
 			end = start + PAGE_CACHE_SIZE - 1;
 			first_index = index;
 		} else if (end + 1 == page_start) {
 			end += PAGE_CACHE_SIZE;
 		} else {
 			__do_contiguous_readpages(tree, &pages[first_index],
 						  index - first_index, start,
 						  end, get_extent, em_cached,
 						  bio, mirror_num, bio_flags,
 						  rw);
 			start = page_start;
 			end = start + PAGE_CACHE_SIZE - 1;
 			first_index = index;
 		}
 	}
 	if (end)
 		__do_contiguous_readpages(tree, &pages[first_index],
 					  index - first_index, start,
 					  end, get_extent, em_cached, bio,
 					  mirror_num, bio_flags, rw);
 }
 static int __extent_read_full_page(struct extent_io_tree *tree,
 				   struct page *page,
 				   get_extent_t *get_extent,
 				   struct bio **bio, int mirror_num,
 				   unsigned long *bio_flags, int rw)
 {
 	struct inode *inode = page->mapping->host;
 	struct btrfs_ordered_extent *ordered;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 	int ret;
 	while (1) {
 		lock_extent(tree, start, end);
 		ordered = btrfs_lookup_ordered_extent(inode, start);
 		if (!ordered)
 			break;
 		unlock_extent(tree, start, end);
 		btrfs_start_ordered_extent(inode, ordered, 1);
 		btrfs_put_ordered_extent(ordered);
 	}
 	ret = __do_readpage(tree, page, get_extent, NULL, bio, mirror_num,
 			    bio_flags, rw);
 	return ret;
 }
 int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 			    get_extent_t *get_extent, int mirror_num)
 {
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
 	int ret;
 	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
 				      &bio_flags, READ);
 	if (bio)
 		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
 	return ret;
 }
 int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
 				 get_extent_t *get_extent, int mirror_num)
 {
 	struct bio *bio = NULL;
 	unsigned long bio_flags = EXTENT_BIO_PARENT_LOCKED;
 	int ret;
 	ret = __do_readpage(tree, page, get_extent, NULL, &bio, mirror_num,
 				      &bio_flags, READ);
 	if (bio)
 		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
 	return ret;
 }
 static noinline void update_nr_written(struct page *page,
 				      struct writeback_control *wbc,
 				      unsigned long nr_written)
 {
 	wbc->nr_to_write -= nr_written;
 	if (wbc->range_cyclic || (wbc->nr_to_write > 0 &&
 	    wbc->range_start == 0 && wbc->range_end == LLONG_MAX))
 		page->mapping->writeback_index = page->index + nr_written;
 }
 /*
  * the writepage semantics are similar to regular writepage.  extent
  * records are inserted to lock ranges in the tree, and as dirty areas
  * are found, they are marked writeback.  Then the lock bits are removed
  * and the end_io handler clears the writeback ranges
  */
 static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 			      void *data)
 {
 	struct inode *inode = page->mapping->host;
 	struct extent_page_data *epd = data;
 	struct extent_io_tree *tree = epd->tree;
 	u64 start = page_offset(page);
 	u64 delalloc_start;
 	u64 page_end = start + PAGE_CACHE_SIZE - 1;
 	u64 end;
 	u64 cur = start;
 	u64 extent_offset;
 	u64 last_byte = i_size_read(inode);
 	u64 block_start;
 	u64 iosize;
 	sector_t sector;
 	struct extent_state *cached_state = NULL;
 	struct extent_map *em;
 	struct block_device *bdev;
 	int ret;
 	int nr = 0;
 	size_t pg_offset = 0;
 	size_t blocksize;
 	loff_t i_size = i_size_read(inode);
 	unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
 	u64 nr_delalloc;
 	u64 delalloc_end;
 	int page_started;
 	int compressed;
 	int write_flags;
 	unsigned long nr_written = 0;
 	bool fill_delalloc = true;
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		write_flags = WRITE_SYNC;
 	else
 		write_flags = WRITE;
 	trace___extent_writepage(page, inode, wbc);
 	WARN_ON(!PageLocked(page));
 	ClearPageError(page);
 	pg_offset = i_size & (PAGE_CACHE_SIZE - 1);
 	if (page->index > end_index ||
 	   (page->index == end_index && !pg_offset)) {
 		page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
 		unlock_page(page);
 		return 0;
 	}
 	if (page->index == end_index) {
 		char *userpage;
 		userpage = kmap_atomic(page);
 		memset(userpage + pg_offset, 0,
 		       PAGE_CACHE_SIZE - pg_offset);
 		kunmap_atomic(userpage);
 		flush_dcache_page(page);
 	}
 	pg_offset = 0;
 	set_page_extent_mapped(page);
 	if (!tree->ops || !tree->ops->fill_delalloc)
 		fill_delalloc = false;
 	delalloc_start = start;
 	delalloc_end = 0;
 	page_started = 0;
 	if (!epd->extent_locked && fill_delalloc) {
 		u64 delalloc_to_write = 0;
 		/*
 		 * make sure the wbc mapping index is at least updated
 		 * to this page.
 		 */
 		update_nr_written(page, wbc, 0);
 		while (delalloc_end < page_end) {
 			nr_delalloc = find_lock_delalloc_range(inode, tree,
 						       page,
 						       &delalloc_start,
 						       &delalloc_end,
 						       128 * 1024 * 1024);
 			if (nr_delalloc == 0) {
 				delalloc_start = delalloc_end + 1;
 				continue;
 			}
 			ret = tree->ops->fill_delalloc(inode, page,
 						       delalloc_start,
 						       delalloc_end,
 						       &page_started,
 						       &nr_written);
 			/* File system has been set read-only */
 			if (ret) {
 				SetPageError(page);
 				goto done;
 			}
 			/*
 			 * delalloc_end is already one less than the total
 			 * length, so we don't subtract one from
 			 * PAGE_CACHE_SIZE
 			 */
 			delalloc_to_write += (delalloc_end - delalloc_start +
 					      PAGE_CACHE_SIZE) >>
 					      PAGE_CACHE_SHIFT;
 			delalloc_start = delalloc_end + 1;
 		}
 		if (wbc->nr_to_write < delalloc_to_write) {
 			int thresh = 8192;
 			if (delalloc_to_write < thresh * 2)
 				thresh = delalloc_to_write;
 			wbc->nr_to_write = min_t(u64, delalloc_to_write,
 						 thresh);
 		}
 		/* did the fill delalloc function already unlock and start
 		 * the IO?
 		 */
 		if (page_started) {
 			ret = 0;
 			/*
 			 * we've unlocked the page, so we can't update
 			 * the mapping's writeback index, just update
 			 * nr_to_write.
 			 */
 			wbc->nr_to_write -= nr_written;
 			goto done_unlocked;
 		}
 	}
 	if (tree->ops && tree->ops->writepage_start_hook) {
 		ret = tree->ops->writepage_start_hook(page, start,
 						      page_end);
 		if (ret) {
 			/* Fixup worker will requeue */
 			if (ret == -EBUSY)
 				wbc->pages_skipped++;
 			else
 				redirty_page_for_writepage(wbc, page);
 			update_nr_written(page, wbc, nr_written);
 			unlock_page(page);
 			ret = 0;
 			goto done_unlocked;
 		}
 	}
 	/*
 	 * we don't want to touch the inode after unlocking the page,
 	 * so we update the mapping writeback index now
 	 */
 	update_nr_written(page, wbc, nr_written + 1);
 	end = page_end;
 	if (last_byte <= start) {
 		if (tree->ops && tree->ops->writepage_end_io_hook)
 			tree->ops->writepage_end_io_hook(page, start,
 							 page_end, NULL, 1);
 		goto done;
 	}
 	blocksize = inode->i_sb->s_blocksize;
 	while (cur <= end) {
 		if (cur >= last_byte) {
 			if (tree->ops && tree->ops->writepage_end_io_hook)
 				tree->ops->writepage_end_io_hook(page, cur,
 							 page_end, NULL, 1);
 			break;
 		}
 		em = epd->get_extent(inode, page, pg_offset, cur,
 				     end - cur + 1, 1);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
 			break;
 		}
 		extent_offset = cur - em->start;
 		BUG_ON(extent_map_end(em) <= cur);
 		BUG_ON(end < cur);
 		iosize = min(extent_map_end(em) - cur, end - cur + 1);
 		iosize = ALIGN(iosize, blocksize);
 		sector = (em->block_start + extent_offset) >> 9;
 		bdev = em->bdev;
 		block_start = em->block_start;
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		free_extent_map(em);
 		em = NULL;
 		/*
 		 * compressed and inline extents are written through other
 		 * paths in the FS
 		 */
 		if (compressed || block_start == EXTENT_MAP_HOLE ||
 		    block_start == EXTENT_MAP_INLINE) {
 			/*
 			 * end_io notification does not happen here for
 			 * compressed extents
 			 */
 			if (!compressed && tree->ops &&
 			    tree->ops->writepage_end_io_hook)
 				tree->ops->writepage_end_io_hook(page, cur,
 							 cur + iosize - 1,
 							 NULL, 1);
 			else if (compressed) {
 				/* we don't want to end_page_writeback on
 				 * a compressed extent.  this happens
 				 * elsewhere
 				 */
 				nr++;
 			}
 			cur += iosize;
 			pg_offset += iosize;
 			continue;
 		}
 		/* leave this out until we have a page_mkwrite call */
 		if (0 && !test_range_bit(tree, cur, cur + iosize - 1,
 				   EXTENT_DIRTY, 0, NULL)) {
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
 		}
 		if (tree->ops && tree->ops->writepage_io_hook) {
 			ret = tree->ops->writepage_io_hook(page, cur,
 						cur + iosize - 1);
 		} else {
 			ret = 0;
 		}
 		if (ret) {
 			SetPageError(page);
 		} else {
 			unsigned long max_nr = end_index + 1;
 			set_range_writeback(tree, cur, cur + iosize - 1);
 			if (!PageWriteback(page)) {
 				printk(KERN_ERR "btrfs warning page %lu not "
 				       "writeback, cur %llu end %llu\n",
 				       page->index, cur, end);
 			}
 			ret = submit_extent_page(write_flags, tree, page,
 						 sector, iosize, pg_offset,
 						 bdev, &epd->bio, max_nr,
 						 end_bio_extent_writepage,
 						 0, 0, 0);
 			if (ret)
 				SetPageError(page);
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
 		nr++;
 	}
 done:
 	if (nr == 0) {
 		/* make sure the mapping tag for page dirty gets cleared */
 		set_page_writeback(page);
 		end_page_writeback(page);
 	}
 	unlock_page(page);
 done_unlocked:
 	/* drop our reference on any cached states */
 	free_extent_state(cached_state);
 	return 0;
 }
 static int eb_wait(void *word)
 {
 	io_schedule();
 	return 0;
 }
 void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 {
 	wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
 		    TASK_UNINTERRUPTIBLE);
 }
 static int lock_extent_buffer_for_io(struct extent_buffer *eb,
 				     struct btrfs_fs_info *fs_info,
 				     struct extent_page_data *epd)
 {
 	unsigned long i, num_pages;
 	int flush = 0;
 	int ret = 0;
 	if (!btrfs_try_tree_write_lock(eb)) {
 		flush = 1;
 		flush_write_bio(epd);
 		btrfs_tree_lock(eb);
 	}
 	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
 		btrfs_tree_unlock(eb);
 		if (!epd->sync_io)
 			return 0;
 		if (!flush) {
 			flush_write_bio(epd);
 			flush = 1;
 		}
 		while (1) {
 			wait_on_extent_buffer_writeback(eb);
 			btrfs_tree_lock(eb);
 			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
 				break;
 			btrfs_tree_unlock(eb);
 		}
 	}
 	/*
 	 * We need to do this to prevent races in people who check if the eb is
 	 * under IO since we can end up having no IO bits set for a short period
 	 * of time.
 	 */
 	spin_lock(&eb->refs_lock);
 	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
 		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
 		spin_unlock(&eb->refs_lock);
 		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
 		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 				     -eb->len,
 				     fs_info->dirty_metadata_batch);
 		ret = 1;
 	} else {
 		spin_unlock(&eb->refs_lock);
 	}
 	btrfs_tree_unlock(eb);
 	if (!ret)
 		return ret;
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		if (!trylock_page(p)) {
 			if (!flush) {
 				flush_write_bio(epd);
 				flush = 1;
 			}
 			lock_page(p);
 		}
 	}
 	return ret;
 }
 static void end_extent_buffer_writeback(struct extent_buffer *eb)
 {
 	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
 	smp_mb__after_clear_bit();
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	int uptodate = err == 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct extent_buffer *eb;
 	int done;
 	do {
 		struct page *page = bvec->bv_page;
 		bvec--;
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 		if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
 			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 			ClearPageUptodate(page);
 			SetPageError(page);
 		}
 		end_page_writeback(page);
 		if (!done)
 			continue;
 		end_extent_buffer_writeback(eb);
 	} while (bvec >= bio->bi_io_vec);
 	bio_put(bio);
 }
 static int write_one_eb(struct extent_buffer *eb,
 			struct btrfs_fs_info *fs_info,
 			struct writeback_control *wbc,
 			struct extent_page_data *epd)
 {
 	struct block_device *bdev = fs_info->fs_devices->latest_bdev;
 	u64 offset = eb->start;
 	unsigned long i, num_pages;
 	unsigned long bio_flags = 0;
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 		bio_flags = EXTENT_BIO_TREE_LOG;
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		clear_page_dirty_for_io(p);
 		set_page_writeback(p);
 		ret = submit_extent_page(rw, eb->tree, p, offset >> 9,
 					 PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
 					 -1, end_bio_extent_buffer_writepage,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
 			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 			SetPageError(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
 			ret = -EIO;
 			break;
 		}
 		offset += PAGE_CACHE_SIZE;
 		update_nr_written(p, wbc, 1);
 		unlock_page(p);
 	}
 	if (unlikely(ret)) {
 		for (; i < num_pages; i++) {
 			struct page *p = extent_buffer_page(eb, i);
 			unlock_page(p);
 		}
 	}
 	return ret;
 }
 int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
 	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
 	struct extent_buffer *eb, *prev_eb = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 		.bio_flags = 0,
 	};
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
 	int tag;
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		scanned = 1;
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
 	while (!done && !nr_to_write_done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 			if (!PagePrivate(page))
 				continue;
 			if (!wbc->range_cyclic && page->index > end) {
 				done = 1;
 				break;
 			}
 			spin_lock(&mapping->private_lock);
 			if (!PagePrivate(page)) {
 				spin_unlock(&mapping->private_lock);
 				continue;
 			}
 			eb = (struct extent_buffer *)page->private;
 			/*
 			 * Shouldn't happen and normally this would be a BUG_ON
 			 * but no sense in crashing the users box for something
 			 * we can survive anyway.
 			 */
 			if (!eb) {
 				spin_unlock(&mapping->private_lock);
 				WARN_ON(1);
 				continue;
 			}
 			if (eb == prev_eb) {
 				spin_unlock(&mapping->private_lock);
 				continue;
 			}
 			ret = atomic_inc_not_zero(&eb->refs);
 			spin_unlock(&mapping->private_lock);
 			if (!ret)
 				continue;
 			prev_eb = eb;
 			ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
 			if (!ret) {
 				free_extent_buffer(eb);
 				continue;
 			}
 			ret = write_one_eb(eb, fs_info, wbc, &epd);
 			if (ret) {
 				done = 1;
 				free_extent_buffer(eb);
 				break;
 			}
 			free_extent_buffer(eb);
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
 			 * We have to make sure to honor the new nr_to_write
 			 * at any time
 			 */
 			nr_to_write_done = wbc->nr_to_write <= 0;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
 		scanned = 1;
 		index = 0;
 		goto retry;
 	}
 	flush_write_bio(&epd);
 	return ret;
 }
 /**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
  * @writepage: function called for each page
  * @data: data passed to writepage function
  *
  * If a page is already under I/O, write_cache_pages() skips it, even
  * if it's dirty.  This is desirable behaviour for memory-cleaning writeback,
  * but it is INCORRECT for data-integrity system calls such as fsync().  fsync()
  * and msync() need to guarantee that all the data which was dirty at the time
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
  */
 static int extent_write_cache_pages(struct extent_io_tree *tree,
 			     struct address_space *mapping,
 			     struct writeback_control *wbc,
 			     writepage_t writepage, void *data,
 			     void (*flush_fn)(void *))
 {
 	struct inode *inode = mapping->host;
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
 	int tag;
 	/*
 	 * We have to hold onto the inode so that ordered extents can do their
 	 * work when the IO finishes.  The alternative to this is failing to add
 	 * an ordered extent if the igrab() fails there and that is a huge pain
 	 * to deal with, so instead just hold onto the inode throughout the
 	 * writepages operation.  If it fails here we are freeing up the inode
 	 * anyway and we'd rather not waste our time writing out stuff that is
 	 * going to be truncated anyway.
 	 */
 	if (!igrab(inode))
 		return 0;
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		scanned = 1;
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
 	while (!done && !nr_to_write_done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
 			 * lock on the page itself: the page may be truncated or
 			 * invalidated (changing page->mapping to NULL), or even
 			 * swizzled back from swapper_space to tmpfs file
 			 * mapping
 			 */
 			if (!trylock_page(page)) {
 				flush_fn(data);
 				lock_page(page);
 			}
 			if (unlikely(page->mapping != mapping)) {
 				unlock_page(page);
 				continue;
 			}
 			if (!wbc->range_cyclic && page->index > end) {
 				done = 1;
 				unlock_page(page);
 				continue;
 			}
 			if (wbc->sync_mode != WB_SYNC_NONE) {
 				if (PageWriteback(page))
 					flush_fn(data);
 				wait_on_page_writeback(page);
 			}
 			if (PageWriteback(page) ||
 			    !clear_page_dirty_for_io(page)) {
 				unlock_page(page);
 				continue;
 			}
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
 				unlock_page(page);
 				ret = 0;
 			}
 			if (ret)
 				done = 1;
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
 			 * We have to make sure to honor the new nr_to_write
 			 * at any time
 			 */
 			nr_to_write_done = wbc->nr_to_write <= 0;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
 		scanned = 1;
 		index = 0;
 		goto retry;
 	}
 	btrfs_add_delayed_iput(inode);
 	return ret;
 }
 static void flush_epd_write_bio(struct extent_page_data *epd)
 {
 	if (epd->bio) {
 		int rw = WRITE;
 		int ret;
 		if (epd->sync_io)
 			rw = WRITE_SYNC;
 		ret = submit_one_bio(rw, epd->bio, 0, epd->bio_flags);
 		BUG_ON(ret < 0); /* -ENOMEM */
 		epd->bio = NULL;
 	}
 }
 static noinline void flush_write_bio(void *data)
 {
 	struct extent_page_data *epd = data;
 	flush_epd_write_bio(epd);
 }
 int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
 			  get_extent_t *get_extent,
 			  struct writeback_control *wbc)
 {
 	int ret;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
 		.get_extent = get_extent,
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 		.bio_flags = 0,
 	};
 	ret = __extent_writepage(page, wbc, &epd);
 	flush_epd_write_bio(&epd);
 	return ret;
 }
 int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
 			      u64 start, u64 end, get_extent_t *get_extent,
 			      int mode)
 {
 	int ret = 0;
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >>
 		PAGE_CACHE_SHIFT;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
 		.get_extent = get_extent,
 		.extent_locked = 1,
 		.sync_io = mode == WB_SYNC_ALL,
 		.bio_flags = 0,
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,
 	};
 	while (start <= end) {
 		page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT);
 		if (clear_page_dirty_for_io(page))
 			ret = __extent_writepage(page, &wbc_writepages, &epd);
 		else {
 			if (tree->ops && tree->ops->writepage_end_io_hook)
 				tree->ops->writepage_end_io_hook(page, start,
 						 start + PAGE_CACHE_SIZE - 1,
 						 NULL, 1);
 			unlock_page(page);
 		}
 		page_cache_release(page);
 		start += PAGE_CACHE_SIZE;
 	}
 	flush_epd_write_bio(&epd);
 	return ret;
 }
 int extent_writepages(struct extent_io_tree *tree,
 		      struct address_space *mapping,
 		      get_extent_t *get_extent,
 		      struct writeback_control *wbc)
 {
 	int ret = 0;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
 		.get_extent = get_extent,
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 		.bio_flags = 0,
 	};
 	ret = extent_write_cache_pages(tree, mapping, wbc,
 				       __extent_writepage, &epd,
 				       flush_write_bio);
 	flush_epd_write_bio(&epd);
 	return ret;
 }
 int extent_readpages(struct extent_io_tree *tree,
 		     struct address_space *mapping,
 		     struct list_head *pages, unsigned nr_pages,
 		     get_extent_t get_extent)
 {
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	unsigned long bio_flags = 0;
 	struct page *pagepool[16];
 	struct page *page;
 	struct extent_map *em_cached = NULL;
 	int nr = 0;
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		page = list_entry(pages->prev, struct page, lru);
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping,
 					page->index, GFP_NOFS)) {
 			page_cache_release(page);
 			continue;
 		}
 		pagepool[nr++] = page;
 		if (nr < ARRAY_SIZE(pagepool))
 			continue;
 		__extent_readpages(tree, pagepool, nr, get_extent, &em_cached,
 				   &bio, 0, &bio_flags, READ);
 		nr = 0;
 	}
 	if (nr)
 		__extent_readpages(tree, pagepool, nr, get_extent, &em_cached,
 				   &bio, 0, &bio_flags, READ);
 	if (em_cached)
 		free_extent_map(em_cached);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		return submit_one_bio(READ, bio, 0, bio_flags);
 	return 0;
 }
 /*
  * basic invalidatepage code, this waits on any locked or writeback
  * ranges corresponding to the page, and then deletes any extent state
  * records from the tree
  */
 int extent_invalidatepage(struct extent_io_tree *tree,
 			  struct page *page, unsigned long offset)
 {
 	struct extent_state *cached_state = NULL;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 	size_t blocksize = page->mapping->host->i_sb->s_blocksize;
 	start += ALIGN(offset, blocksize);
 	if (start > end)
 		return 0;
 	lock_extent_bits(tree, start, end, 0, &cached_state);
 	wait_on_page_writeback(page);
 	clear_extent_bit(tree, start, end,
 			 EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC |
 			 EXTENT_DO_ACCOUNTING,
 			 1, 1, &cached_state, GFP_NOFS);
 	return 0;
 }
 /*
  * a helper for releasepage, this tests for areas of the page that
  * are locked or under IO and drops the related state bits if it is safe
  * to drop the page.
  */
 static int try_release_extent_state(struct extent_map_tree *map,
 				    struct extent_io_tree *tree,
 				    struct page *page, gfp_t mask)
 {
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 	int ret = 1;
 	if (test_range_bit(tree, start, end,
 			   EXTENT_IOBITS, 0, NULL))
 		ret = 0;
 	else {
 		if ((mask & GFP_NOFS) == GFP_NOFS)
 			mask = GFP_NOFS;
 		/*
 		 * at this point we can safely clear everything except the
 		 * locked bit and the nodatasum bit
 		 */
 		ret = clear_extent_bit(tree, start, end,
 				 ~(EXTENT_LOCKED | EXTENT_NODATASUM),
 				 0, 0, NULL, mask);
 		/* if clear_extent_bit failed for enomem reasons,
 		 * we can't allow the release to continue.
 		 */
 		if (ret < 0)
 			ret = 0;
 		else
 			ret = 1;
 	}
 	return ret;
 }
 /*
  * a helper for releasepage.  As long as there are no locked extents
  * in the range corresponding to the page, both state records and extent
  * map records are removed
  */
 int try_release_extent_mapping(struct extent_map_tree *map,
 			       struct extent_io_tree *tree, struct page *page,
 			       gfp_t mask)
 {
 	struct extent_map *em;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 	if ((mask & __GFP_WAIT) &&
 	    page->mapping->host->i_size > 16 * 1024 * 1024) {
 		u64 len;
 		while (start <= end) {
 			len = end - start + 1;
 			write_lock(&map->lock);
 			em = lookup_extent_mapping(map, start, len);
 			if (!em) {
 				write_unlock(&map->lock);
 				break;
 			}
 			if (test_bit(EXTENT_FLAG_PINNED, &em->flags) ||
 			    em->start != start) {
 				write_unlock(&map->lock);
 				free_extent_map(em);
 				break;
 			}
 			if (!test_range_bit(tree, em->start,
 					    extent_map_end(em) - 1,
 					    EXTENT_LOCKED | EXTENT_WRITEBACK,
 					    0, NULL)) {
 				remove_extent_mapping(map, em);
 				/* once for the rb tree */
 				free_extent_map(em);
 			}
 			start = extent_map_end(em);
 			write_unlock(&map->lock);
 			/* once for us */
 			free_extent_map(em);
 		}
 	}
 	return try_release_extent_state(map, tree, page, mask);
 }
 /*
  * helper function for fiemap, which doesn't want to see any holes.
  * This maps until we find something past 'last'
  */
 static struct extent_map *get_extent_skip_holes(struct inode *inode,
 						u64 offset,
 						u64 last,
 						get_extent_t *get_extent)
 {
 	u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
 	struct extent_map *em;
 	u64 len;
 	if (offset >= last)
 		return NULL;
 	while(1) {
 		len = last - offset;
 		if (len == 0)
 			break;
 		len = ALIGN(len, sectorsize);
 		em = get_extent(inode, NULL, 0, offset, len, 0);
 		if (IS_ERR_OR_NULL(em))
 			return em;
 		/* if this isn't a hole return it */
 		if (!test_bit(EXTENT_FLAG_VACANCY, &em->flags) &&
 		    em->block_start != EXTENT_MAP_HOLE) {
 			return em;
 		}
 		/* this is a hole, advance to the next extent */
 		offset = extent_map_end(em);
 		free_extent_map(em);
 		if (offset >= last)
 			break;
 	}
 	return NULL;
 }
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len, get_extent_t *get_extent)
 {
 	int ret = 0;
 	u64 off = start;
 	u64 max = start + len;
 	u32 flags = 0;
 	u32 found_type;
 	u64 last;
 	u64 last_for_get_extent = 0;
 	u64 disko = 0;
 	u64 isize = i_size_read(inode);
 	struct btrfs_key found_key;
 	struct extent_map *em = NULL;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_path *path;
 	struct btrfs_file_extent_item *item;
 	int end = 0;
 	u64 em_start = 0;
 	u64 em_len = 0;
 	u64 em_end = 0;
 	unsigned long emflags;
 	if (len == 0)
 		return -EINVAL;
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 	path->leave_spinning = 1;
 	start = ALIGN(start, BTRFS_I(inode)->root->sectorsize);
 	len = ALIGN(len, BTRFS_I(inode)->root->sectorsize);
 	/*
 	 * lookup the last file extent.  We're not using i_size here
 	 * because there might be preallocation past i_size
 	 */
 	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root,
 				       path, btrfs_ino(inode), -1, 0);
 	if (ret < 0) {
 		btrfs_free_path(path);
 		return ret;
 	}
 	WARN_ON(!ret);
 	path->slots[0]--;
 	item = btrfs_item_ptr(path->nodes[0], path->slots[0],
 			      struct btrfs_file_extent_item);
 	btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
 	found_type = btrfs_key_type(&found_key);
 	/* No extents, but there might be delalloc bits */
 	if (found_key.objectid != btrfs_ino(inode) ||
 	    found_type != BTRFS_EXTENT_DATA_KEY) {
 		/* have to trust i_size as the end */
 		last = (u64)-1;
 		last_for_get_extent = isize;
 	} else {
 		/*
 		 * remember the start of the last extent.  There are a
 		 * bunch of different factors that go into the length of the
 		 * extent, so its much less complex to remember where it started
 		 */
 		last = found_key.offset;
 		last_for_get_extent = last + 1;
 	}
 	btrfs_free_path(path);
 	/*
 	 * we might have some extents allocated but more delalloc past those
 	 * extents.  so, we trust isize unless the start of the last extent is
 	 * beyond isize
 	 */
 	if (last < isize) {
 		last = (u64)-1;
 		last_for_get_extent = isize;
 	}
 	lock_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len - 1, 0,
 			 &cached_state);
 	em = get_extent_skip_holes(inode, start, last_for_get_extent,
 				   get_extent);
 	if (!em)
 		goto out;
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
 		goto out;
 	}
 	while (!end) {
 		u64 offset_in_extent = 0;
 		/* break if the extent we found is outside the range */
 		if (em->start >= max || extent_map_end(em) < off)
 			break;
 		/*
 		 * get_extent may return an extent that starts before our
 		 * requested range.  We have to make sure the ranges
 		 * we return to fiemap always move forward and don't
 		 * overlap, so adjust the offsets here
 		 */
 		em_start = max(em->start, off);
 		/*
 		 * record the offset from the start of the extent
 		 * for adjusting the disk offset below.  Only do this if the
 		 * extent isn't compressed since our in ram offset may be past
 		 * what we have actually allocated on disk.
 		 */
 		if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
 			offset_in_extent = em_start - em->start;
 		em_end = extent_map_end(em);
 		em_len = em_end - em_start;
 		emflags = em->flags;
 		disko = 0;
 		flags = 0;
 		/*
 		 * bump off for our next call to get_extent
 		 */
 		off = extent_map_end(em);
 		if (off >= max)
 			end = 1;
 		if (em->block_start == EXTENT_MAP_LAST_BYTE) {
 			end = 1;
 			flags |= FIEMAP_EXTENT_LAST;
 		} else if (em->block_start == EXTENT_MAP_INLINE) {
 			flags |= (FIEMAP_EXTENT_DATA_INLINE |
 				  FIEMAP_EXTENT_NOT_ALIGNED);
 		} else if (em->block_start == EXTENT_MAP_DELALLOC) {
 			flags |= (FIEMAP_EXTENT_DELALLOC |
 				  FIEMAP_EXTENT_UNKNOWN);
 		} else {
 			disko = em->block_start + offset_in_extent;
 		}
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
 			flags |= FIEMAP_EXTENT_ENCODED;
 		free_extent_map(em);
 		em = NULL;
 		if ((em_start >= last) || em_len == (u64)-1 ||
 		   (last == (u64)-1 && isize <= em_end)) {
 			flags |= FIEMAP_EXTENT_LAST;
 			end = 1;
 		}
 		/* now scan forward to see if this is really the last extent. */
 		em = get_extent_skip_holes(inode, off, last_for_get_extent,
 					   get_extent);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			goto out;
 		}
 		if (!em) {
 			flags |= FIEMAP_EXTENT_LAST;
 			end = 1;
 		}
 		ret = fiemap_fill_next_extent(fieinfo, em_start, disko,
 					      em_len, flags);
 		if (ret)
 			goto out_free;
 	}
 out_free:
 	free_extent_map(em);
 out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, start + len - 1,
 			     &cached_state, GFP_NOFS);
 	return ret;
 }
 static void __free_extent_buffer(struct extent_buffer *eb)
 {
 	btrfs_leak_debug_del(&eb->leak_list);
 	kmem_cache_free(extent_buffer_cache, eb);
 }
 static int extent_buffer_under_io(struct extent_buffer *eb)
 {
 	return (atomic_read(&eb->io_pages) ||
 		test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
 		test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
 }
 /*
  * Helper for releasing extent buffer page.
  */
 static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
 						unsigned long start_idx)
 {
 	unsigned long index;
 	unsigned long num_pages;
 	struct page *page;
 	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
 	BUG_ON(extent_buffer_under_io(eb));
 	num_pages = num_extent_pages(eb->start, eb->len);
 	index = start_idx + num_pages;
 	if (start_idx >= index)
 		return;
 	do {
 		index--;
 		page = extent_buffer_page(eb, index);
 		if (page && mapped) {
 			spin_lock(&page->mapping->private_lock);
 			/*
 			 * We do this since we'll remove the pages after we've
 			 * removed the eb from the radix tree, so we could race
 			 * and have this page now attached to the new eb.  So
 			 * only clear page_private if it's still connected to
 			 * this eb.
 			 */
 			if (PagePrivate(page) &&
 			    page->private == (unsigned long)eb) {
 				BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
 				BUG_ON(PageDirty(page));
 				BUG_ON(PageWriteback(page));
 				/*
 				 * We need to make sure we haven't be attached
 				 * to a new eb.
 				 */
 				ClearPagePrivate(page);
 				set_page_private(page, 0);
 				/* One for the page private */
 				page_cache_release(page);
 			}
 			spin_unlock(&page->mapping->private_lock);
 		}
 		if (page) {
 			/* One for when we alloced the page */
 			page_cache_release(page);
 		}
 	} while (index != start_idx);
 }
 /*
  * Helper for releasing the extent buffer.
  */
 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 {
 	btrfs_release_extent_buffer_page(eb, 0);
 	__free_extent_buffer(eb);
 }
 static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
 						   u64 start,
 						   unsigned long len,
 						   gfp_t mask)
 {
 	struct extent_buffer *eb = NULL;
 	eb = kmem_cache_zalloc(extent_buffer_cache, mask);
 	if (eb == NULL)
 		return NULL;
 	eb->start = start;
 	eb->len = len;
 	eb->tree = tree;
 	eb->bflags = 0;
 	rwlock_init(&eb->lock);
 	atomic_set(&eb->write_locks, 0);
 	atomic_set(&eb->read_locks, 0);
 	atomic_set(&eb->blocking_readers, 0);
 	atomic_set(&eb->blocking_writers, 0);
 	atomic_set(&eb->spinning_readers, 0);
 	atomic_set(&eb->spinning_writers, 0);
 	eb->lock_nested = 0;
 	init_waitqueue_head(&eb->write_lock_wq);
 	init_waitqueue_head(&eb->read_lock_wq);
 	btrfs_leak_debug_add(&eb->leak_list, &buffers);
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
 	atomic_set(&eb->io_pages, 0);
 	/*
 	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
 	 */
 	BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE
 		> MAX_INLINE_EXTENT_BUFFER_SIZE);
 	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
 	return eb;
 }
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 {
 	unsigned long i;
 	struct page *p;
 	struct extent_buffer *new;
 	unsigned long num_pages = num_extent_pages(src->start, src->len);
 	new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS);
 	if (new == NULL)
 		return NULL;
 	for (i = 0; i < num_pages; i++) {
 		p = alloc_page(GFP_NOFS);
 		if (!p) {
 			btrfs_release_extent_buffer(new);
 			return NULL;
 		}
 		attach_extent_buffer_page(new, p);
 		WARN_ON(PageDirty(p));
 		SetPageUptodate(p);
 		new->pages[i] = p;
 	}
 	copy_extent_buffer(new, src, 0, 0, src->len);
 	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
 	set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
 	return new;
 }
 struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len)
 {
 	struct extent_buffer *eb;
 	unsigned long num_pages = num_extent_pages(0, len);
 	unsigned long i;
 	eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS);
 	if (!eb)
 		return NULL;
 	for (i = 0; i < num_pages; i++) {
 		eb->pages[i] = alloc_page(GFP_NOFS);
 		if (!eb->pages[i])
 			goto err;
 	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
 	set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
 	return eb;
 err:
 	for (; i > 0; i--)
 		__free_page(eb->pages[i - 1]);
 	__free_extent_buffer(eb);
 	return NULL;
 }
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
 	/* the ref bit is tricky.  We have to make sure it is set
 	 * if we have the buffer dirty.   Otherwise the
 	 * code to free a buffer can end up dropping a dirty
 	 * page
 	 *
 	 * Once the ref bit is set, it won't go away while the
 	 * buffer is dirty or in writeback, and it also won't
 	 * go away while we have the reference count on the
 	 * eb bumped.
 	 *
 	 * We can't just set the ref bit without bumping the
 	 * ref on the eb because free_extent_buffer might
 	 * see the ref bit and try to clear it.  If this happens
 	 * free_extent_buffer might end up dropping our original
 	 * ref by mistake and freeing the page before we are able
 	 * to add one more ref.
 	 *
 	 * So bump the ref count first, then set the bit.  If someone
 	 * beat us to it, drop the ref we added.
 	 */
 	refs = atomic_read(&eb->refs);
 	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
 		return;
 	spin_lock(&eb->refs_lock);
 	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
 		atomic_inc(&eb->refs);
 	spin_unlock(&eb->refs_lock);
 }
-static void mark_extent_buffer_accessed(struct extent_buffer *eb)
+static void mark_extent_buffer_accessed(struct extent_buffer *eb,
+		struct page *accessed)
 {
 	unsigned long num_pages, i;
 	check_buffer_tree_ref(eb);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		mark_page_accessed(p);
+		if (p != accessed)
+			mark_page_accessed(p);
 	}
 }
 struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
 					  u64 start, unsigned long len)
 {
 	unsigned long num_pages = num_extent_pages(start, len);
 	unsigned long i;
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
 	struct extent_buffer *eb;
 	struct extent_buffer *exists = NULL;
 	struct page *p;
 	struct address_space *mapping = tree->mapping;
 	int uptodate = 1;
 	int ret;
 	rcu_read_lock();
 	eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
+		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}
 	rcu_read_unlock();
 	eb = __alloc_extent_buffer(tree, start, len, GFP_NOFS);
 	if (!eb)
 		return NULL;
 	for (i = 0; i < num_pages; i++, index++) {
 		p = find_or_create_page(mapping, index, GFP_NOFS);
 		if (!p)
 			goto free_eb;
 		spin_lock(&mapping->private_lock);
 		if (PagePrivate(p)) {
 			/*
 			 * We could have already allocated an eb for this page
 			 * and attached one so lets see if we can get a ref on
 			 * the existing eb, and if we can we know it's good and
 			 * we can just return that one, else we know we can just
 			 * overwrite page->private.
 			 */
 			exists = (struct extent_buffer *)p->private;
 			if (atomic_inc_not_zero(&exists->refs)) {
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists);
+				mark_extent_buffer_accessed(exists, p);
 				goto free_eb;
 			}
 			/*
 			 * Do this so attach doesn't complain and we need to
 			 * drop the ref the old guy had.
 			 */
 			ClearPagePrivate(p);
 			WARN_ON(PageDirty(p));
 			page_cache_release(p);
 		}
 		attach_extent_buffer_page(eb, p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
-		mark_page_accessed(p);
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
 		/*
 		 * see below about how we avoid a nasty race with release page
 		 * and why we unlock later
 		 */
 	}
 	if (uptodate)
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 again:
 	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
 	if (ret)
 		goto free_eb;
 	spin_lock(&tree->buffer_lock);
 	ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb);
 	if (ret == -EEXIST) {
 		exists = radix_tree_lookup(&tree->buffer,
 						start >> PAGE_CACHE_SHIFT);
 		if (!atomic_inc_not_zero(&exists->refs)) {
 			spin_unlock(&tree->buffer_lock);
 			radix_tree_preload_end();
 			exists = NULL;
 			goto again;
 		}
 		spin_unlock(&tree->buffer_lock);
 		radix_tree_preload_end();
-		mark_extent_buffer_accessed(exists);
+		mark_extent_buffer_accessed(exists, NULL);
 		goto free_eb;
 	}
 	/* add one reference for the tree */
 	check_buffer_tree_ref(eb);
 	spin_unlock(&tree->buffer_lock);
 	radix_tree_preload_end();
 	/*
 	 * there is a race where release page may have
 	 * tried to find this extent buffer in the radix
 	 * but failed.  It will tell the VM it is safe to
 	 * reclaim the, and it will clear the page private bit.
 	 * We must make sure to set the page private bit properly
 	 * after the extent buffer is in the radix tree so
 	 * it doesn't get lost
 	 */
 	SetPageChecked(eb->pages[0]);
 	for (i = 1; i < num_pages; i++) {
 		p = extent_buffer_page(eb, i);
 		ClearPageChecked(p);
 		unlock_page(p);
 	}
 	unlock_page(eb->pages[0]);
 	return eb;
 free_eb:
 	for (i = 0; i < num_pages; i++) {
 		if (eb->pages[i])
 			unlock_page(eb->pages[i]);
 	}
 	WARN_ON(!atomic_dec_and_test(&eb->refs));
 	btrfs_release_extent_buffer(eb);
 	return exists;
 }
 struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
 					 u64 start, unsigned long len)
 {
 	struct extent_buffer *eb;
 	rcu_read_lock();
 	eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
+		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}
 	rcu_read_unlock();
 	return NULL;
 }
 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
 {
 	struct extent_buffer *eb =
 			container_of(head, struct extent_buffer, rcu_head);
 	__free_extent_buffer(eb);
 }
 /* Expects to have eb->eb_lock already held */
 static int release_extent_buffer(struct extent_buffer *eb)
 {
 	WARN_ON(atomic_read(&eb->refs) == 0);
 	if (atomic_dec_and_test(&eb->refs)) {
 		if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) {
 			spin_unlock(&eb->refs_lock);
 		} else {
 			struct extent_io_tree *tree = eb->tree;
 			spin_unlock(&eb->refs_lock);
 			spin_lock(&tree->buffer_lock);
 			radix_tree_delete(&tree->buffer,
 					  eb->start >> PAGE_CACHE_SHIFT);
 			spin_unlock(&tree->buffer_lock);
 		}
 		/* Should be safe to release our pages at this point */
 		btrfs_release_extent_buffer_page(eb, 0);
 		call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
 		return 1;
 	}
 	spin_unlock(&eb->refs_lock);
 	return 0;
 }
 void free_extent_buffer(struct extent_buffer *eb)
 {
 	int refs;
 	int old;
 	if (!eb)
 		return;
 	while (1) {
 		refs = atomic_read(&eb->refs);
 		if (refs <= 3)
 			break;
 		old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
 		if (old == refs)
 			return;
 	}
 	spin_lock(&eb->refs_lock);
 	if (atomic_read(&eb->refs) == 2 &&
 	    test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
 		atomic_dec(&eb->refs);
 	if (atomic_read(&eb->refs) == 2 &&
 	    test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
 	    !extent_buffer_under_io(eb) &&
 	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
 		atomic_dec(&eb->refs);
 	/*
 	 * I know this is terrible, but it's temporary until we stop tracking
 	 * the uptodate bits and such for the extent buffers.
 	 */
 	release_extent_buffer(eb);
 }
 void free_extent_buffer_stale(struct extent_buffer *eb)
 {
 	if (!eb)
 		return;
 	spin_lock(&eb->refs_lock);
 	set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
 	if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
 	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
 		atomic_dec(&eb->refs);
 	release_extent_buffer(eb);
 }
 void clear_extent_buffer_dirty(struct extent_buffer *eb)
 {
 	unsigned long i;
 	unsigned long num_pages;
 	struct page *page;
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		if (!PageDirty(page))
 			continue;
 		lock_page(page);
 		WARN_ON(!PagePrivate(page));
 		clear_page_dirty_for_io(page);
 		spin_lock_irq(&page->mapping->tree_lock);
 		if (!PageDirty(page)) {
 			radix_tree_tag_clear(&page->mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irq(&page->mapping->tree_lock);
 		ClearPageError(page);
 		unlock_page(page);
 	}
 	WARN_ON(atomic_read(&eb->refs) == 0);
 }
 int set_extent_buffer_dirty(struct extent_buffer *eb)
 {
 	unsigned long i;
 	unsigned long num_pages;
 	int was_dirty = 0;
 	check_buffer_tree_ref(eb);
 	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	WARN_ON(atomic_read(&eb->refs) == 0);
 	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
 	for (i = 0; i < num_pages; i++)
 		set_page_dirty(extent_buffer_page(eb, i));
 	return was_dirty;
 }
 int clear_extent_buffer_uptodate(struct extent_buffer *eb)
 {
 	unsigned long i;
 	struct page *page;
 	unsigned long num_pages;
 	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		if (page)
 			ClearPageUptodate(page);
 	}
 	return 0;
 }
 int set_extent_buffer_uptodate(struct extent_buffer *eb)
 {
 	unsigned long i;
 	struct page *page;
 	unsigned long num_pages;
 	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		SetPageUptodate(page);
 	}
 	return 0;
 }
 int extent_buffer_uptodate(struct extent_buffer *eb)
 {
 	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 }
 int read_extent_buffer_pages(struct extent_io_tree *tree,
 			     struct extent_buffer *eb, u64 start, int wait,
 			     get_extent_t *get_extent, int mirror_num)
 {
 	unsigned long i;
 	unsigned long start_i;
 	struct page *page;
 	int err;
 	int ret = 0;
 	int locked_pages = 0;
 	int all_uptodate = 1;
 	unsigned long num_pages;
 	unsigned long num_reads = 0;
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		return 0;
 	if (start) {
 		WARN_ON(start < eb->start);
 		start_i = (start >> PAGE_CACHE_SHIFT) -
 			(eb->start >> PAGE_CACHE_SHIFT);
 	} else {
 		start_i = 0;
 	}
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = start_i; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		if (wait == WAIT_NONE) {
 			if (!trylock_page(page))
 				goto unlock_exit;
 		} else {
 			lock_page(page);
 		}
 		locked_pages++;
 		if (!PageUptodate(page)) {
 			num_reads++;
 			all_uptodate = 0;
 		}
 	}
 	if (all_uptodate) {
 		if (start_i == 0)
 			set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 		goto unlock_exit;
 	}
 	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		if (!PageUptodate(page)) {
 			ClearPageError(page);
 			err = __extent_read_full_page(tree, page,
 						      get_extent, &bio,
 						      mirror_num, &bio_flags,
 						      READ | REQ_META);
 			if (err)
 				ret = err;
 		} else {
 			unlock_page(page);
 		}
 	}
 	if (bio) {
 		err = submit_one_bio(READ | REQ_META, bio, mirror_num,
 				     bio_flags);
 		if (err)
 			return err;
 	}
 	if (ret || wait != WAIT_COMPLETE)
 		return ret;
 	for (i = start_i; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			ret = -EIO;
 	}
 	return ret;
 unlock_exit:
 	i = start_i;
 	while (locked_pages > 0) {
 		page = extent_buffer_page(eb, i);
 		i++;
 		unlock_page(page);
 		locked_pages--;
 	}
 	return ret;
 }
 void read_extent_buffer(struct extent_buffer *eb, void *dstv,
 			unsigned long start,
 			unsigned long len)
 {
 	size_t cur;
 	size_t offset;
 	struct page *page;
 	char *kaddr;
 	char *dst = (char *)dstv;
 	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 	while (len > 0) {
 		page = extent_buffer_page(eb, i);
 		cur = min(len, (PAGE_CACHE_SIZE - offset));
 		kaddr = page_address(page);
 		memcpy(dst, kaddr + offset, cur);
 		dst += cur;
 		len -= cur;
 		offset = 0;
 		i++;
 	}
 }
 int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
 			       unsigned long min_len, char **map,
 			       unsigned long *map_start,
 			       unsigned long *map_len)
 {
 	size_t offset = start & (PAGE_CACHE_SIZE - 1);
 	char *kaddr;
 	struct page *p;
 	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 	unsigned long end_i = (start_offset + start + min_len - 1) >>
 		PAGE_CACHE_SHIFT;
 	if (i != end_i)
 		return -EINVAL;
 	if (i == 0) {
 		offset = start_offset;
 		*map_start = 0;
 	} else {
 		offset = 0;
 		*map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset;
 	}
 	if (start + min_len > eb->len) {
 		WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, "
 		       "wanted %lu %lu\n",
 		       eb->start, eb->len, start, min_len);
 		return -EINVAL;
 	}
 	p = extent_buffer_page(eb, i);
 	kaddr = page_address(p);
 	*map = kaddr + offset;
 	*map_len = PAGE_CACHE_SIZE - offset;
 	return 0;
 }
 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
 			  unsigned long start,
 			  unsigned long len)
 {
 	size_t cur;
 	size_t offset;
 	struct page *page;
 	char *kaddr;
 	char *ptr = (char *)ptrv;
 	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 	int ret = 0;
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 	while (len > 0) {
 		page = extent_buffer_page(eb, i);
 		cur = min(len, (PAGE_CACHE_SIZE - offset));
 		kaddr = page_address(page);
 		ret = memcmp(ptr, kaddr + offset, cur);
 		if (ret)
 			break;
 		ptr += cur;
 		len -= cur;
 		offset = 0;
 		i++;
 	}
 	return ret;
 }
 void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
 			 unsigned long start, unsigned long len)
 {
 	size_t cur;
 	size_t offset;
 	struct page *page;
 	char *kaddr;
 	char *src = (char *)srcv;
 	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 	while (len > 0) {
 		page = extent_buffer_page(eb, i);
 		WARN_ON(!PageUptodate(page));
 		cur = min(len, PAGE_CACHE_SIZE - offset);
 		kaddr = page_address(page);
 		memcpy(kaddr + offset, src, cur);
 		src += cur;
 		len -= cur;
 		offset = 0;
 		i++;
 	}
 }
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len)
 {
 	size_t cur;
 	size_t offset;
 	struct page *page;
 	char *kaddr;
 	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 	while (len > 0) {
 		page = extent_buffer_page(eb, i);
 		WARN_ON(!PageUptodate(page));
 		cur = min(len, PAGE_CACHE_SIZE - offset);
 		kaddr = page_address(page);
 		memset(kaddr + offset, c, cur);
 		len -= cur;
 		offset = 0;
 		i++;
 	}
 }
 void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 			unsigned long dst_offset, unsigned long src_offset,
 			unsigned long len)
 {
 	u64 dst_len = dst->len;
 	size_t cur;
 	size_t offset;
 	struct page *page;
 	char *kaddr;
 	size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
 	WARN_ON(src->len != dst_len);
 	offset = (start_offset + dst_offset) &
 		(PAGE_CACHE_SIZE - 1);
 	while (len > 0) {
 		page = extent_buffer_page(dst, i);
 		WARN_ON(!PageUptodate(page));
 		cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset));
 		kaddr = page_address(page);
 		read_extent_buffer(src, kaddr + offset, src_offset, cur);
 		src_offset += cur;
 		len -= cur;
 		offset = 0;
 		i++;
 	}
 }
 static void move_pages(struct page *dst_page, struct page *src_page,
 		       unsigned long dst_off, unsigned long src_off,
 		       unsigned long len)
 {
 	char *dst_kaddr = page_address(dst_page);
 	if (dst_page == src_page) {
 		memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len);
 	} else {
 		char *src_kaddr = page_address(src_page);
 		char *p = dst_kaddr + dst_off + len;
 		char *s = src_kaddr + src_off + len;
 		while (len--)
 			*--p = *--s;
 	}
 }
 static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
 {
 	unsigned long distance = (src > dst) ? src - dst : dst - src;
 	return distance < len;
 }
 static void copy_pages(struct page *dst_page, struct page *src_page,
 		       unsigned long dst_off, unsigned long src_off,
 		       unsigned long len)
 {
 	char *dst_kaddr = page_address(dst_page);
 	char *src_kaddr;
 	int must_memmove = 0;
 	if (dst_page != src_page) {
 		src_kaddr = page_address(src_page);
 	} else {
 		src_kaddr = dst_kaddr;
 		if (areas_overlap(src_off, dst_off, len))
 			must_memmove = 1;
 	}
 	if (must_memmove)
 		memmove(dst_kaddr + dst_off, src_kaddr + src_off, len);
 	else
 		memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len);
 }
 void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len)
 {
 	size_t cur;
 	size_t dst_off_in_page;
 	size_t src_off_in_page;
 	size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long dst_i;
 	unsigned long src_i;
 	if (src_offset + len > dst->len) {
 		printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
 		       "len %lu dst len %lu\n", src_offset, len, dst->len);
 		BUG_ON(1);
 	}
 	if (dst_offset + len > dst->len) {
 		printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
 		       "len %lu dst len %lu\n", dst_offset, len, dst->len);
 		BUG_ON(1);
 	}
 	while (len > 0) {
 		dst_off_in_page = (start_offset + dst_offset) &
 			(PAGE_CACHE_SIZE - 1);
 		src_off_in_page = (start_offset + src_offset) &
 			(PAGE_CACHE_SIZE - 1);
 		dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
 		src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT;
 		cur = min(len, (unsigned long)(PAGE_CACHE_SIZE -
 					       src_off_in_page));
 		cur = min_t(unsigned long, cur,
 			(unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page));
 		copy_pages(extent_buffer_page(dst, dst_i),
 			   extent_buffer_page(dst, src_i),
 			   dst_off_in_page, src_off_in_page, cur);
 		src_offset += cur;
 		dst_offset += cur;
 		len -= cur;
 	}
 }
 void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len)
 {
 	size_t cur;
 	size_t dst_off_in_page;
 	size_t src_off_in_page;
 	unsigned long dst_end = dst_offset + len - 1;
 	unsigned long src_end = src_offset + len - 1;
 	size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
 	unsigned long dst_i;
 	unsigned long src_i;
 	if (src_offset + len > dst->len) {
 		printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
 		       "len %lu len %lu\n", src_offset, len, dst->len);
 		BUG_ON(1);
 	}
 	if (dst_offset + len > dst->len) {
 		printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
 		       "len %lu len %lu\n", dst_offset, len, dst->len);
 		BUG_ON(1);
 	}
 	if (dst_offset < src_offset) {
 		memcpy_extent_buffer(dst, dst_offset, src_offset, len);
 		return;
 	}
 	while (len > 0) {
 		dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT;
 		src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT;
 		dst_off_in_page = (start_offset + dst_end) &
 			(PAGE_CACHE_SIZE - 1);
 		src_off_in_page = (start_offset + src_end) &
 			(PAGE_CACHE_SIZE - 1);
 		cur = min_t(unsigned long, len, src_off_in_page + 1);
 		cur = min(cur, dst_off_in_page + 1);
 		move_pages(extent_buffer_page(dst, dst_i),
 			   extent_buffer_page(dst, src_i),
 			   dst_off_in_page - cur + 1,
 			   src_off_in_page - cur + 1, cur);
 		dst_end -= cur;
 		src_end -= cur;
 		len -= cur;
 	}
 }
 int try_release_extent_buffer(struct page *page)
 {
 	struct extent_buffer *eb;
 	/*
 	 * We need to make sure noboody is attaching this page to an eb right
 	 * now.
 	 */
 	spin_lock(&page->mapping->private_lock);
 	if (!PagePrivate(page)) {
 		spin_unlock(&page->mapping->private_lock);
 		return 1;
 	}
 	eb = (struct extent_buffer *)page->private;
 	BUG_ON(!eb);
 	/*
 	 * This is a little awful but should be ok, we need to make sure that
 	 * the eb doesn't disappear out from under us while we're looking at
 	 * this page.
 	 */
 	spin_lock(&eb->refs_lock);
 	if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
 		spin_unlock(&eb->refs_lock);
 		spin_unlock(&page->mapping->private_lock);
 		return 0;
 	}
 	spin_unlock(&page->mapping->private_lock);
 	/*
 	 * If tree ref isn't set then we know the ref on this eb is a real ref,
 	 * so just return, this page will likely be freed soon anyway.
 	 */
 	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
 		spin_unlock(&eb->refs_lock);
 		return 0;
 	}
 	return release_extent_buffer(eb);
 }

fs/btrfs/file.c

Diff comments View file @ d618a27

 /*
  * Copyright (C) 2007 Oracle.  All rights reserved.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public
  * License v2 as published by the Free Software Foundation.
  *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  *
  * You should have received a copy of the GNU General Public
  * License along with this program; if not, write to the
  * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
  * Boston, MA 021110-1307, USA.
  */
 #include <linux/fs.h>
 #include <linux/pagemap.h>
 #include <linux/highmem.h>
 #include <linux/time.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/backing-dev.h>
 #include <linux/mpage.h>
 #include <linux/aio.h>
 #include <linux/falloc.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/statfs.h>
 #include <linux/compat.h>
 #include <linux/slab.h>
 #include <linux/btrfs.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
 #include "btrfs_inode.h"
 #include "print-tree.h"
 #include "tree-log.h"
 #include "locking.h"
 #include "compat.h"
 #include "volumes.h"
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
  * when auto defrag is enabled we
  * queue up these defrag structs to remember which
  * inodes need defragging passes
  */
 struct inode_defrag {
 	struct rb_node rb_node;
 	/* objectid */
 	u64 ino;
 	/*
 	 * transid where the defrag was added, we search for
 	 * extents newer than this
 	 */
 	u64 transid;
 	/* root objectid */
 	u64 root;
 	/* last offset we were able to defrag */
 	u64 last_offset;
 	/* if we've wrapped around back to zero once already */
 	int cycled;
 };
 static int __compare_inode_defrag(struct inode_defrag *defrag1,
 				  struct inode_defrag *defrag2)
 {
 	if (defrag1->root > defrag2->root)
 		return 1;
 	else if (defrag1->root < defrag2->root)
 		return -1;
 	else if (defrag1->ino > defrag2->ino)
 		return 1;
 	else if (defrag1->ino < defrag2->ino)
 		return -1;
 	else
 		return 0;
 }
 /* pop a record for an inode into the defrag tree.  The lock
  * must be held already
  *
  * If you're inserting a record for an older transid than an
  * existing record, the transid already in the tree is lowered
  *
  * If an existing record is found the defrag item you
  * pass in is freed
  */
 static int __btrfs_add_inode_defrag(struct inode *inode,
 				    struct inode_defrag *defrag)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct inode_defrag *entry;
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	int ret;
 	p = &root->fs_info->defrag_inodes.rb_node;
 	while (*p) {
 		parent = *p;
 		entry = rb_entry(parent, struct inode_defrag, rb_node);
 		ret = __compare_inode_defrag(defrag, entry);
 		if (ret < 0)
 			p = &parent->rb_left;
 		else if (ret > 0)
 			p = &parent->rb_right;
 		else {
 			/* if we're reinserting an entry for
 			 * an old defrag run, make sure to
 			 * lower the transid of our existing record
 			 */
 			if (defrag->transid < entry->transid)
 				entry->transid = defrag->transid;
 			if (defrag->last_offset > entry->last_offset)
 				entry->last_offset = defrag->last_offset;
 			return -EEXIST;
 		}
 	}
 	set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 	rb_link_node(&defrag->rb_node, parent, p);
 	rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes);
 	return 0;
 }
 static inline int __need_auto_defrag(struct btrfs_root *root)
 {
 	if (!btrfs_test_opt(root, AUTO_DEFRAG))
 		return 0;
 	if (btrfs_fs_closing(root->fs_info))
 		return 0;
 	return 1;
 }
 /*
  * insert a defrag record for this inode if auto defrag is
  * enabled
  */
 int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 			   struct inode *inode)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct inode_defrag *defrag;
 	u64 transid;
 	int ret;
 	if (!__need_auto_defrag(root))
 		return 0;
 	if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 	if (trans)
 		transid = trans->transid;
 	else
 		transid = BTRFS_I(inode)->root->last_trans;
 	defrag = kmem_cache_zalloc(btrfs_inode_defrag_cachep, GFP_NOFS);
 	if (!defrag)
 		return -ENOMEM;
 	defrag->ino = btrfs_ino(inode);
 	defrag->transid = transid;
 	defrag->root = root->root_key.objectid;
 	spin_lock(&root->fs_info->defrag_inodes_lock);
 	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags)) {
 		/*
 		 * If we set IN_DEFRAG flag and evict the inode from memory,
 		 * and then re-read this inode, this new inode doesn't have
 		 * IN_DEFRAG flag. At the case, we may find the existed defrag.
 		 */
 		ret = __btrfs_add_inode_defrag(inode, defrag);
 		if (ret)
 			kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	} else {
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	}
 	spin_unlock(&root->fs_info->defrag_inodes_lock);
 	return 0;
 }
 /*
  * Requeue the defrag object. If there is a defrag object that points to
  * the same inode in the tree, we will merge them together (by
  * __btrfs_add_inode_defrag()) and free the one that we want to requeue.
  */
 static void btrfs_requeue_inode_defrag(struct inode *inode,
 				       struct inode_defrag *defrag)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	int ret;
 	if (!__need_auto_defrag(root))
 		goto out;
 	/*
 	 * Here we don't check the IN_DEFRAG flag, because we need merge
 	 * them together.
 	 */
 	spin_lock(&root->fs_info->defrag_inodes_lock);
 	ret = __btrfs_add_inode_defrag(inode, defrag);
 	spin_unlock(&root->fs_info->defrag_inodes_lock);
 	if (ret)
 		goto out;
 	return;
 out:
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 }
 /*
  * pick the defragable inode that we want, if it doesn't exist, we will get
  * the next one.
  */
 static struct inode_defrag *
 btrfs_pick_defrag_inode(struct btrfs_fs_info *fs_info, u64 root, u64 ino)
 {
 	struct inode_defrag *entry = NULL;
 	struct inode_defrag tmp;
 	struct rb_node *p;
 	struct rb_node *parent = NULL;
 	int ret;
 	tmp.ino = ino;
 	tmp.root = root;
 	spin_lock(&fs_info->defrag_inodes_lock);
 	p = fs_info->defrag_inodes.rb_node;
 	while (p) {
 		parent = p;
 		entry = rb_entry(parent, struct inode_defrag, rb_node);
 		ret = __compare_inode_defrag(&tmp, entry);
 		if (ret < 0)
 			p = parent->rb_left;
 		else if (ret > 0)
 			p = parent->rb_right;
 		else
 			goto out;
 	}
 	if (parent && __compare_inode_defrag(&tmp, entry) > 0) {
 		parent = rb_next(parent);
 		if (parent)
 			entry = rb_entry(parent, struct inode_defrag, rb_node);
 		else
 			entry = NULL;
 	}
 out:
 	if (entry)
 		rb_erase(parent, &fs_info->defrag_inodes);
 	spin_unlock(&fs_info->defrag_inodes_lock);
 	return entry;
 }
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 {
 	struct inode_defrag *defrag;
 	struct rb_node *node;
 	spin_lock(&fs_info->defrag_inodes_lock);
 	node = rb_first(&fs_info->defrag_inodes);
 	while (node) {
 		rb_erase(node, &fs_info->defrag_inodes);
 		defrag = rb_entry(node, struct inode_defrag, rb_node);
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 		if (need_resched()) {
 			spin_unlock(&fs_info->defrag_inodes_lock);
 			cond_resched();
 			spin_lock(&fs_info->defrag_inodes_lock);
 		}
 		node = rb_first(&fs_info->defrag_inodes);
 	}
 	spin_unlock(&fs_info->defrag_inodes_lock);
 }
 #define BTRFS_DEFRAG_BATCH	1024
 static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info,
 				    struct inode_defrag *defrag)
 {
 	struct btrfs_root *inode_root;
 	struct inode *inode;
 	struct btrfs_key key;
 	struct btrfs_ioctl_defrag_range_args range;
 	int num_defrag;
 	int index;
 	int ret;
 	/* get the inode */
 	key.objectid = defrag->root;
 	btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
 	key.offset = (u64)-1;
 	index = srcu_read_lock(&fs_info->subvol_srcu);
 	inode_root = btrfs_read_fs_root_no_name(fs_info, &key);
 	if (IS_ERR(inode_root)) {
 		ret = PTR_ERR(inode_root);
 		goto cleanup;
 	}
 	key.objectid = defrag->ino;
 	btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
 	key.offset = 0;
 	inode = btrfs_iget(fs_info->sb, &key, inode_root, NULL);
 	if (IS_ERR(inode)) {
 		ret = PTR_ERR(inode);
 		goto cleanup;
 	}
 	srcu_read_unlock(&fs_info->subvol_srcu, index);
 	/* do a chunk of defrag */
 	clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 	memset(&range, 0, sizeof(range));
 	range.len = (u64)-1;
 	range.start = defrag->last_offset;
 	sb_start_write(fs_info->sb);
 	num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 				       BTRFS_DEFRAG_BATCH);
 	sb_end_write(fs_info->sb);
 	/*
 	 * if we filled the whole defrag batch, there
 	 * must be more work to do.  Queue this defrag
 	 * again
 	 */
 	if (num_defrag == BTRFS_DEFRAG_BATCH) {
 		defrag->last_offset = range.start;
 		btrfs_requeue_inode_defrag(inode, defrag);
 	} else if (defrag->last_offset && !defrag->cycled) {
 		/*
 		 * we didn't fill our defrag batch, but
 		 * we didn't start at zero.  Make sure we loop
 		 * around to the start of the file.
 		 */
 		defrag->last_offset = 0;
 		defrag->cycled = 1;
 		btrfs_requeue_inode_defrag(inode, defrag);
 	} else {
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	}
 	iput(inode);
 	return 0;
 cleanup:
 	srcu_read_unlock(&fs_info->subvol_srcu, index);
 	kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 	return ret;
 }
 /*
  * run through the list of inodes in the FS that need
  * defragging
  */
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 {
 	struct inode_defrag *defrag;
 	u64 first_ino = 0;
 	u64 root_objectid = 0;
 	atomic_inc(&fs_info->defrag_running);
 	while(1) {
 		/* Pause the auto defragger. */
 		if (test_bit(BTRFS_FS_STATE_REMOUNTING,
 			     &fs_info->fs_state))
 			break;
 		if (!__need_auto_defrag(fs_info->tree_root))
 			break;
 		/* find an inode to defrag */
 		defrag = btrfs_pick_defrag_inode(fs_info, root_objectid,
 						 first_ino);
 		if (!defrag) {
 			if (root_objectid || first_ino) {
 				root_objectid = 0;
 				first_ino = 0;
 				continue;
 			} else {
 				break;
 			}
 		}
 		first_ino = defrag->ino + 1;
 		root_objectid = defrag->root;
 		__btrfs_run_defrag_inode(fs_info, defrag);
 	}
 	atomic_dec(&fs_info->defrag_running);
 	/*
 	 * during unmount, we use the transaction_wait queue to
 	 * wait for the defragger to stop
 	 */
 	wake_up(&fs_info->transaction_wait);
 	return 0;
 }
 /* simple helper to fault in pages and copy.  This should go away
  * and be replaced with calls into generic code.
  */
 static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
 					 size_t write_bytes,
 					 struct page **prepared_pages,
 					 struct iov_iter *i)
 {
 	size_t copied = 0;
 	size_t total_copied = 0;
 	int pg = 0;
 	int offset = pos & (PAGE_CACHE_SIZE - 1);
 	while (write_bytes > 0) {
 		size_t count = min_t(size_t,
 				     PAGE_CACHE_SIZE - offset, write_bytes);
 		struct page *page = prepared_pages[pg];
 		/*
 		 * Copy data from userspace to the current page
 		 */
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
 		/* Flush processor's dcache for this page */
 		flush_dcache_page(page);
 		/*
 		 * if we get a partial write, we can end up with
 		 * partially up to date pages.  These add
 		 * a lot of complexity, so make sure they don't
 		 * happen by forcing this copy to be retried.
 		 *
 		 * The rest of the btrfs_file_write code will fall
 		 * back to page at a time copies after we return 0.
 		 */
 		if (!PageUptodate(page) && copied < count)
 			copied = 0;
 		iov_iter_advance(i, copied);
 		write_bytes -= copied;
 		total_copied += copied;
 		/* Return to btrfs_file_aio_write to fault page */
 		if (unlikely(copied == 0))
 			break;
 		if (unlikely(copied < PAGE_CACHE_SIZE - offset)) {
 			offset += copied;
 		} else {
 			pg++;
 			offset = 0;
 		}
 	}
 	return total_copied;
 }
 /*
  * unlocks pages after btrfs_file_write is done with them
  */
 static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 {
 	size_t i;
 	for (i = 0; i < num_pages; i++) {
 		/* page checked is some magic around finding pages that
 		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here
+		 * clear it here. There should be no need to mark the pages
+		 * accessed as prepare_pages should have marked them accessed
+		 * in prepare_pages via find_or_create_page()
 		 */
 		ClearPageChecked(pages[i]);
 		unlock_page(pages[i]);
-		mark_page_accessed(pages[i]);
 		page_cache_release(pages[i]);
 	}
 }
 /*
  * after copy_from_user, pages need to be dirtied and we need to make
  * sure holes are created between the current EOF and the start of
  * any next extents (if required).
  *
  * this also makes the decision about creating an inline extent vs
  * doing real data extents, marking pages dirty and delalloc as required.
  */
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 			     struct page **pages, size_t num_pages,
 			     loff_t pos, size_t write_bytes,
 			     struct extent_state **cached)
 {
 	int err = 0;
 	int i;
 	u64 num_bytes;
 	u64 start_pos;
 	u64 end_of_last_block;
 	u64 end_pos = pos + write_bytes;
 	loff_t isize = i_size_read(inode);
 	start_pos = pos & ~((u64)root->sectorsize - 1);
 	num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
 	end_of_last_block = start_pos + num_bytes - 1;
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
 					cached);
 	if (err)
 		return err;
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
 		SetPageUptodate(p);
 		ClearPageChecked(p);
 		set_page_dirty(p);
 	}
 	/*
 	 * we've only changed i_size in ram, and we haven't updated
 	 * the disk i_size.  There is no need to log the inode
 	 * at this time.
 	 */
 	if (end_pos > isize)
 		i_size_write(inode, end_pos);
 	return 0;
 }
 /*
  * this drops all the extents in the cache that intersect the range
  * [start, end].  Existing extents are split as required.
  */
 void btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
 			     int skip_pinned)
 {
 	struct extent_map *em;
 	struct extent_map *split = NULL;
 	struct extent_map *split2 = NULL;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	u64 len = end - start + 1;
 	u64 gen;
 	int ret;
 	int testend = 1;
 	unsigned long flags;
 	int compressed = 0;
 	bool modified;
 	WARN_ON(end < start);
 	if (end == (u64)-1) {
 		len = (u64)-1;
 		testend = 0;
 	}
 	while (1) {
 		int no_splits = 0;
 		modified = false;
 		if (!split)
 			split = alloc_extent_map();
 		if (!split2)
 			split2 = alloc_extent_map();
 		if (!split || !split2)
 			no_splits = 1;
 		write_lock(&em_tree->lock);
 		em = lookup_extent_mapping(em_tree, start, len);
 		if (!em) {
 			write_unlock(&em_tree->lock);
 			break;
 		}
 		flags = em->flags;
 		gen = em->generation;
 		if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
 			if (testend && em->start + em->len >= start + len) {
 				free_extent_map(em);
 				write_unlock(&em_tree->lock);
 				break;
 			}
 			start = em->start + em->len;
 			if (testend)
 				len = start + len - (em->start + em->len);
 			free_extent_map(em);
 			write_unlock(&em_tree->lock);
 			continue;
 		}
 		compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
 		clear_bit(EXTENT_FLAG_PINNED, &em->flags);
 		clear_bit(EXTENT_FLAG_LOGGING, &flags);
 		modified = !list_empty(&em->list);
 		remove_extent_mapping(em_tree, em);
 		if (no_splits)
 			goto next;
 		if (em->start < start) {
 			split->start = em->start;
 			split->len = start - em->start;
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
 				split->orig_start = em->orig_start;
 				split->block_start = em->block_start;
 				if (compressed)
 					split->block_len = em->block_len;
 				else
 					split->block_len = split->len;
 				split->orig_block_len = max(split->block_len,
 						em->orig_block_len);
 				split->ram_bytes = em->ram_bytes;
 			} else {
 				split->orig_start = split->start;
 				split->block_len = 0;
 				split->block_start = em->block_start;
 				split->orig_block_len = 0;
 				split->ram_bytes = split->len;
 			}
 			split->generation = gen;
 			split->bdev = em->bdev;
 			split->flags = flags;
 			split->compress_type = em->compress_type;
 			ret = add_extent_mapping(em_tree, split, modified);
 			BUG_ON(ret); /* Logic error */
 			free_extent_map(split);
 			split = split2;
 			split2 = NULL;
 		}
 		if (testend && em->start + em->len > start + len) {
 			u64 diff = start + len - em->start;
 			split->start = start + len;
 			split->len = em->start + em->len - (start + len);
 			split->bdev = em->bdev;
 			split->flags = flags;
 			split->compress_type = em->compress_type;
 			split->generation = gen;
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
 				split->orig_block_len = max(em->block_len,
 						    em->orig_block_len);
 				split->ram_bytes = em->ram_bytes;
 				if (compressed) {
 					split->block_len = em->block_len;
 					split->block_start = em->block_start;
 					split->orig_start = em->orig_start;
 				} else {
 					split->block_len = split->len;
 					split->block_start = em->block_start
 						+ diff;
 					split->orig_start = em->orig_start;
 				}
 			} else {
 				split->ram_bytes = split->len;
 				split->orig_start = split->start;
 				split->block_len = 0;
 				split->block_start = em->block_start;
 				split->orig_block_len = 0;
 			}
 			ret = add_extent_mapping(em_tree, split, modified);
 			BUG_ON(ret); /* Logic error */
 			free_extent_map(split);
 			split = NULL;
 		}
 next:
 		write_unlock(&em_tree->lock);
 		/* once for us */
 		free_extent_map(em);
 		/* once for the tree*/
 		free_extent_map(em);
 	}
 	if (split)
 		free_extent_map(split);
 	if (split2)
 		free_extent_map(split2);
 }
 /*
  * this is very complex, but the basic idea is to drop all extents
  * in the range start - end.  hint_block is filled in with a block number
  * that would be a good hint to the block allocator for this file.
  *
  * If an extent intersects the range but is not entirely inside the range
  * it is either truncated or split.  Anything entirely inside the range
  * is deleted from the tree.
  */
 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			 struct btrfs_root *root, struct inode *inode,
 			 struct btrfs_path *path, u64 start, u64 end,
 			 u64 *drop_end, int drop_cache)
 {
 	struct extent_buffer *leaf;
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_key key;
 	struct btrfs_key new_key;
 	u64 ino = btrfs_ino(inode);
 	u64 search_start = start;
 	u64 disk_bytenr = 0;
 	u64 num_bytes = 0;
 	u64 extent_offset = 0;
 	u64 extent_end = 0;
 	int del_nr = 0;
 	int del_slot = 0;
 	int extent_type;
 	int recow;
 	int ret;
 	int modify_tree = -1;
 	int update_refs = (root->ref_cows || root == root->fs_info->tree_root);
 	int found = 0;
 	if (drop_cache)
 		btrfs_drop_extent_cache(inode, start, end - 1, 0);
 	if (start >= BTRFS_I(inode)->disk_i_size)
 		modify_tree = 0;
 	while (1) {
 		recow = 0;
 		ret = btrfs_lookup_file_extent(trans, root, path, ino,
 					       search_start, modify_tree);
 		if (ret < 0)
 			break;
 		if (ret > 0 && path->slots[0] > 0 && search_start == start) {
 			leaf = path->nodes[0];
 			btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
 			if (key.objectid == ino &&
 			    key.type == BTRFS_EXTENT_DATA_KEY)
 				path->slots[0]--;
 		}
 		ret = 0;
 next_slot:
 		leaf = path->nodes[0];
 		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
 			BUG_ON(del_nr > 0);
 			ret = btrfs_next_leaf(root, path);
 			if (ret < 0)
 				break;
 			if (ret > 0) {
 				ret = 0;
 				break;
 			}
 			leaf = path->nodes[0];
 			recow = 1;
 		}
 		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 		if (key.objectid > ino ||
 		    key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
 			break;
 		fi = btrfs_item_ptr(leaf, path->slots[0],
 				    struct btrfs_file_extent_item);
 		extent_type = btrfs_file_extent_type(leaf, fi);
 		if (extent_type == BTRFS_FILE_EXTENT_REG ||
 		    extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
 			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 			num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
 			extent_offset = btrfs_file_extent_offset(leaf, fi);
 			extent_end = key.offset +
 				btrfs_file_extent_num_bytes(leaf, fi);
 		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 			extent_end = key.offset +
 				btrfs_file_extent_inline_len(leaf, fi);
 		} else {
 			WARN_ON(1);
 			extent_end = search_start;
 		}
 		if (extent_end <= search_start) {
 			path->slots[0]++;
 			goto next_slot;
 		}
 		found = 1;
 		search_start = max(key.offset, start);
 		if (recow || !modify_tree) {
 			modify_tree = -1;
 			btrfs_release_path(path);
 			continue;
 		}
 		/*
 		 *     | - range to drop - |
 		 *  | -------- extent -------- |
 		 */
 		if (start > key.offset && end < extent_end) {
 			BUG_ON(del_nr > 0);
 			BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
 			memcpy(&new_key, &key, sizeof(new_key));
 			new_key.offset = start;
 			ret = btrfs_duplicate_item(trans, root, path,
 						   &new_key);
 			if (ret == -EAGAIN) {
 				btrfs_release_path(path);
 				continue;
 			}
 			if (ret < 0)
 				break;
 			leaf = path->nodes[0];
 			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							start - key.offset);
 			fi = btrfs_item_ptr(leaf, path->slots[0],
 					    struct btrfs_file_extent_item);
 			extent_offset += start - key.offset;
 			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							extent_end - start);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0) {
 				ret = btrfs_inc_extent_ref(trans, root,
 						disk_bytenr, num_bytes, 0,
 						root->root_key.objectid,
 						new_key.objectid,
 						start - extent_offset, 0);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 			key.offset = start;
 		}
 		/*
 		 *  | ---- range to drop ----- |
 		 *      | -------- extent -------- |
 		 */
 		if (start <= key.offset && end < extent_end) {
 			BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
 			memcpy(&new_key, &key, sizeof(new_key));
 			new_key.offset = end;
 			btrfs_set_item_key_safe(root, path, &new_key);
 			extent_offset += end - key.offset;
 			btrfs_set_file_extent_offset(leaf, fi, extent_offset);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							extent_end - end);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
 				inode_sub_bytes(inode, end - key.offset);
 			break;
 		}
 		search_start = extent_end;
 		/*
 		 *       | ---- range to drop ----- |
 		 *  | -------- extent -------- |
 		 */
 		if (start > key.offset && end >= extent_end) {
 			BUG_ON(del_nr > 0);
 			BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							start - key.offset);
 			btrfs_mark_buffer_dirty(leaf);
 			if (update_refs && disk_bytenr > 0)
 				inode_sub_bytes(inode, extent_end - start);
 			if (end == extent_end)
 				break;
 			path->slots[0]++;
 			goto next_slot;
 		}
 		/*
 		 *  | ---- range to drop ----- |
 		 *    | ------ extent ------ |
 		 */
 		if (start <= key.offset && end >= extent_end) {
 			if (del_nr == 0) {
 				del_slot = path->slots[0];
 				del_nr = 1;
 			} else {
 				BUG_ON(del_slot + del_nr != path->slots[0]);
 				del_nr++;
 			}
 			if (update_refs &&
 			    extent_type == BTRFS_FILE_EXTENT_INLINE) {
 				inode_sub_bytes(inode,
 						extent_end - key.offset);
 				extent_end = ALIGN(extent_end,
 						   root->sectorsize);
 			} else if (update_refs && disk_bytenr > 0) {
 				ret = btrfs_free_extent(trans, root,
 						disk_bytenr, num_bytes, 0,
 						root->root_key.objectid,
 						key.objectid, key.offset -
 						extent_offset, 0);
 				BUG_ON(ret); /* -ENOMEM */
 				inode_sub_bytes(inode,
 						extent_end - key.offset);
 			}
 			if (end == extent_end)
 				break;
 			if (path->slots[0] + 1 < btrfs_header_nritems(leaf)) {
 				path->slots[0]++;
 				goto next_slot;
 			}
 			ret = btrfs_del_items(trans, root, path, del_slot,
 					      del_nr);
 			if (ret) {
 				btrfs_abort_transaction(trans, root, ret);
 				break;
 			}
 			del_nr = 0;
 			del_slot = 0;
 			btrfs_release_path(path);
 			continue;
 		}
 		BUG_ON(1);
 	}
 	if (!ret && del_nr > 0) {
 		ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
 		if (ret)
 			btrfs_abort_transaction(trans, root, ret);
 	}
 	if (drop_end)
 		*drop_end = found ? min(end, extent_end) : end;
 	btrfs_release_path(path);
 	return ret;
 }
 int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 		       struct btrfs_root *root, struct inode *inode, u64 start,
 		       u64 end, int drop_cache)
 {
 	struct btrfs_path *path;
 	int ret;
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 	ret = __btrfs_drop_extents(trans, root, inode, path, start, end, NULL,
 				   drop_cache);
 	btrfs_free_path(path);
 	return ret;
 }
 static int extent_mergeable(struct extent_buffer *leaf, int slot,
 			    u64 objectid, u64 bytenr, u64 orig_offset,
 			    u64 *start, u64 *end)
 {
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_key key;
 	u64 extent_end;
 	if (slot < 0 || slot >= btrfs_header_nritems(leaf))
 		return 0;
 	btrfs_item_key_to_cpu(leaf, &key, slot);
 	if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY)
 		return 0;
 	fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
 	if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG ||
 	    btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr ||
 	    btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset ||
 	    btrfs_file_extent_compression(leaf, fi) ||
 	    btrfs_file_extent_encryption(leaf, fi) ||
 	    btrfs_file_extent_other_encoding(leaf, fi))
 		return 0;
 	extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
 	if ((*start && *start != key.offset) || (*end && *end != extent_end))
 		return 0;
 	*start = key.offset;
 	*end = extent_end;
 	return 1;
 }
 /*
  * Mark extent in the range start - end as written.
  *
  * This changes extent type from 'pre-allocated' to 'regular'. If only
  * part of extent is marked as written, the extent will be split into
  * two or three.
  */
 int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 			      struct inode *inode, u64 start, u64 end)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_buffer *leaf;
 	struct btrfs_path *path;
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_key key;
 	struct btrfs_key new_key;
 	u64 bytenr;
 	u64 num_bytes;
 	u64 extent_end;
 	u64 orig_offset;
 	u64 other_start;
 	u64 other_end;
 	u64 split;
 	int del_nr = 0;
 	int del_slot = 0;
 	int recow;
 	int ret;
 	u64 ino = btrfs_ino(inode);
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 again:
 	recow = 0;
 	split = start;
 	key.objectid = ino;
 	key.type = BTRFS_EXTENT_DATA_KEY;
 	key.offset = split;
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret < 0)
 		goto out;
 	if (ret > 0 && path->slots[0] > 0)
 		path->slots[0]--;
 	leaf = path->nodes[0];
 	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 	BUG_ON(key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY);
 	fi = btrfs_item_ptr(leaf, path->slots[0],
 			    struct btrfs_file_extent_item);
 	BUG_ON(btrfs_file_extent_type(leaf, fi) !=
 	       BTRFS_FILE_EXTENT_PREALLOC);
 	extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
 	BUG_ON(key.offset > start || extent_end < end);
 	bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 	num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
 	orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi);
 	memcpy(&new_key, &key, sizeof(new_key));
 	if (start == key.offset && end < extent_end) {
 		other_start = 0;
 		other_end = start;
 		if (extent_mergeable(leaf, path->slots[0] - 1,
 				     ino, bytenr, orig_offset,
 				     &other_start, &other_end)) {
 			new_key.offset = end;
 			btrfs_set_item_key_safe(root, path, &new_key);
 			fi = btrfs_item_ptr(leaf, path->slots[0],
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_generation(leaf, fi,
 							 trans->transid);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							extent_end - end);
 			btrfs_set_file_extent_offset(leaf, fi,
 						     end - orig_offset);
 			fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_generation(leaf, fi,
 							 trans->transid);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							end - other_start);
 			btrfs_mark_buffer_dirty(leaf);
 			goto out;
 		}
 	}
 	if (start > key.offset && end == extent_end) {
 		other_start = end;
 		other_end = 0;
 		if (extent_mergeable(leaf, path->slots[0] + 1,
 				     ino, bytenr, orig_offset,
 				     &other_start, &other_end)) {
 			fi = btrfs_item_ptr(leaf, path->slots[0],
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							start - key.offset);
 			btrfs_set_file_extent_generation(leaf, fi,
 							 trans->transid);
 			path->slots[0]++;
 			new_key.offset = start;
 			btrfs_set_item_key_safe(root, path, &new_key);
 			fi = btrfs_item_ptr(leaf, path->slots[0],
 					    struct btrfs_file_extent_item);
 			btrfs_set_file_extent_generation(leaf, fi,
 							 trans->transid);
 			btrfs_set_file_extent_num_bytes(leaf, fi,
 							other_end - start);
 			btrfs_set_file_extent_offset(leaf, fi,
 						     start - orig_offset);
 			btrfs_mark_buffer_dirty(leaf);
 			goto out;
 		}
 	}
 	while (start > key.offset || end < extent_end) {
 		if (key.offset == start)
 			split = end;
 		new_key.offset = split;
 		ret = btrfs_duplicate_item(trans, root, path, &new_key);
 		if (ret == -EAGAIN) {
 			btrfs_release_path(path);
 			goto again;
 		}
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, root, ret);
 			goto out;
 		}
 		leaf = path->nodes[0];
 		fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
 				    struct btrfs_file_extent_item);
 		btrfs_set_file_extent_generation(leaf, fi, trans->transid);
 		btrfs_set_file_extent_num_bytes(leaf, fi,
 						split - key.offset);
 		fi = btrfs_item_ptr(leaf, path->slots[0],
 				    struct btrfs_file_extent_item);
 		btrfs_set_file_extent_generation(leaf, fi, trans->transid);
 		btrfs_set_file_extent_offset(leaf, fi, split - orig_offset);
 		btrfs_set_file_extent_num_bytes(leaf, fi,
 						extent_end - split);
 		btrfs_mark_buffer_dirty(leaf);
 		ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
 					   root->root_key.objectid,
 					   ino, orig_offset, 0);
 		BUG_ON(ret); /* -ENOMEM */
 		if (split == start) {
 			key.offset = start;
 		} else {
 			BUG_ON(start != key.offset);
 			path->slots[0]--;
 			extent_end = end;
 		}
 		recow = 1;
 	}
 	other_start = end;
 	other_end = 0;
 	if (extent_mergeable(leaf, path->slots[0] + 1,
 			     ino, bytenr, orig_offset,
 			     &other_start, &other_end)) {
 		if (recow) {
 			btrfs_release_path(path);
 			goto again;
 		}
 		extent_end = other_end;
 		del_slot = path->slots[0] + 1;
 		del_nr++;
 		ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
 					0, root->root_key.objectid,
 					ino, orig_offset, 0);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 	other_start = 0;
 	other_end = start;
 	if (extent_mergeable(leaf, path->slots[0] - 1,
 			     ino, bytenr, orig_offset,
 			     &other_start, &other_end)) {
 		if (recow) {
 			btrfs_release_path(path);
 			goto again;
 		}
 		key.offset = other_start;
 		del_slot = path->slots[0];
 		del_nr++;
 		ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
 					0, root->root_key.objectid,
 					ino, orig_offset, 0);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 	if (del_nr == 0) {
 		fi = btrfs_item_ptr(leaf, path->slots[0],
 			   struct btrfs_file_extent_item);
 		btrfs_set_file_extent_type(leaf, fi,
 					   BTRFS_FILE_EXTENT_REG);
 		btrfs_set_file_extent_generation(leaf, fi, trans->transid);
 		btrfs_mark_buffer_dirty(leaf);
 	} else {
 		fi = btrfs_item_ptr(leaf, del_slot - 1,
 			   struct btrfs_file_extent_item);
 		btrfs_set_file_extent_type(leaf, fi,
 					   BTRFS_FILE_EXTENT_REG);
 		btrfs_set_file_extent_generation(leaf, fi, trans->transid);
 		btrfs_set_file_extent_num_bytes(leaf, fi,
 						extent_end - key.offset);
 		btrfs_mark_buffer_dirty(leaf);
 		ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, root, ret);
 			goto out;
 		}
 	}
 out:
 	btrfs_free_path(path);
 	return 0;
 }
 /*
  * on error we return an unlocked page and the error value
  * on success we return a locked page and 0
  */
 static int prepare_uptodate_page(struct page *page, u64 pos,
 				 bool force_uptodate)
 {
 	int ret = 0;
 	if (((pos & (PAGE_CACHE_SIZE - 1)) || force_uptodate) &&
 	    !PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
 		if (ret)
 			return ret;
 		lock_page(page);
 		if (!PageUptodate(page)) {
 			unlock_page(page);
 			return -EIO;
 		}
 	}
 	return 0;
 }
 /*
  * this gets pages into the page cache and locks them down, it also properly
  * waits for data=ordered extents to finish before allowing the pages to be
  * modified.
  */
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
 			 size_t write_bytes, bool force_uptodate)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
 	unsigned long index = pos >> PAGE_CACHE_SHIFT;
 	struct inode *inode = file_inode(file);
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 	int err = 0;
 	int faili = 0;
 	u64 start_pos;
 	u64 last_pos;
 	start_pos = pos & ~((u64)root->sectorsize - 1);
 	last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT;
 again:
 	for (i = 0; i < num_pages; i++) {
 		pages[i] = find_or_create_page(inode->i_mapping, index + i,
 					       mask | __GFP_WRITE);
 		if (!pages[i]) {
 			faili = i - 1;
 			err = -ENOMEM;
 			goto fail;
 		}
 		if (i == 0)
 			err = prepare_uptodate_page(pages[i], pos,
 						    force_uptodate);
 		if (i == num_pages - 1)
 			err = prepare_uptodate_page(pages[i],
 						    pos + write_bytes, false);
 		if (err) {
 			page_cache_release(pages[i]);
 			faili = i - 1;
 			goto fail;
 		}
 		wait_on_page_writeback(pages[i]);
 	}
 	err = 0;
 	if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
 		lock_extent_bits(&BTRFS_I(inode)->io_tree,
 				 start_pos, last_pos - 1, 0, &cached_state);
 		ordered = btrfs_lookup_first_ordered_extent(inode,
 							    last_pos - 1);
 		if (ordered &&
 		    ordered->file_offset + ordered->len > start_pos &&
 		    ordered->file_offset < last_pos) {
 			btrfs_put_ordered_extent(ordered);
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     start_pos, last_pos - 1,
 					     &cached_state, GFP_NOFS);
 			for (i = 0; i < num_pages; i++) {
 				unlock_page(pages[i]);
 				page_cache_release(pages[i]);
 			}
 			btrfs_wait_ordered_range(inode, start_pos,
 						 last_pos - start_pos);
 			goto again;
 		}
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos,
 				  last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC |
 				  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 				  0, 0, &cached_state, GFP_NOFS);
 		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 				     start_pos, last_pos - 1, &cached_state,
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
 		if (clear_page_dirty_for_io(pages[i]))
 			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
 	return 0;
 fail:
 	while (faili >= 0) {
 		unlock_page(pages[faili]);
 		page_cache_release(pages[faili]);
 		faili--;
 	}
 	return err;
 }
 static noinline int check_can_nocow(struct inode *inode, loff_t pos,
 				    size_t *write_bytes)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ordered_extent *ordered;
 	u64 lockstart, lockend;
 	u64 num_bytes;
 	int ret;
 	lockstart = round_down(pos, root->sectorsize);
 	lockend = lockstart + round_up(*write_bytes, root->sectorsize) - 1;
 	while (1) {
 		lock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
 		ordered = btrfs_lookup_ordered_range(inode, lockstart,
 						     lockend - lockstart + 1);
 		if (!ordered) {
 			break;
 		}
 		unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
 		btrfs_start_ordered_extent(inode, ordered, 1);
 		btrfs_put_ordered_extent(ordered);
 	}
 	num_bytes = lockend - lockstart + 1;
 	ret = can_nocow_extent(inode, lockstart, &num_bytes, NULL, NULL, NULL);
 	if (ret <= 0) {
 		ret = 0;
 	} else {
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 				 EXTENT_DIRTY | EXTENT_DELALLOC |
 				 EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0,
 				 NULL, GFP_NOFS);
 		*write_bytes = min_t(size_t, *write_bytes, num_bytes);
 	}
 	unlock_extent(&BTRFS_I(inode)->io_tree, lockstart, lockend);
 	return ret;
 }
 static noinline ssize_t __btrfs_buffered_write(struct file *file,
 					       struct iov_iter *i,
 					       loff_t pos)
 {
 	struct inode *inode = file_inode(file);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct page **pages = NULL;
 	u64 release_bytes = 0;
 	unsigned long first_index;
 	size_t num_written = 0;
 	int nrptrs;
 	int ret = 0;
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
 	nrptrs = min((iov_iter_count(i) + PAGE_CACHE_SIZE - 1) /
 		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
 		     (sizeof(struct page *)));
 	nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
 	nrptrs = max(nrptrs, 8);
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 	if (!pages)
 		return -ENOMEM;
 	first_index = pos >> PAGE_CACHE_SHIFT;
 	while (iov_iter_count(i) > 0) {
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes = min(iov_iter_count(i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
 					 offset);
 		size_t num_pages = (write_bytes + offset +
 				    PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 		size_t reserve_bytes;
 		size_t dirty_pages;
 		size_t copied;
 		WARN_ON(num_pages > nrptrs);
 		/*
 		 * Fault pages before locking them in prepare_pages
 		 * to avoid recursive lock
 		 */
 		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
 			ret = -EFAULT;
 			break;
 		}
 		reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
 		ret = btrfs_check_data_free_space(inode, reserve_bytes);
 		if (ret == -ENOSPC &&
 		    (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
 					      BTRFS_INODE_PREALLOC))) {
 			ret = check_can_nocow(inode, pos, &write_bytes);
 			if (ret > 0) {
 				only_release_metadata = true;
 				/*
 				 * our prealloc extent may be smaller than
 				 * write_bytes, so scale down.
 				 */
 				num_pages = (write_bytes + offset +
 					     PAGE_CACHE_SIZE - 1) >>
 					PAGE_CACHE_SHIFT;
 				reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
 				ret = 0;
 			} else {
 				ret = -ENOSPC;
 			}
 		}
 		if (ret)
 			break;
 		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(inode,
 							       reserve_bytes);
 			break;
 		}
 		release_bytes = reserve_bytes;
 		/*
 		 * This is going to setup the pages array with the number of
 		 * pages we want, so we don't really need to worry about the
 		 * contents of pages from loop to loop
 		 */
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, write_bytes,
 				    force_page_uptodate);
 		if (ret)
 			break;
 		copied = btrfs_copy_from_user(pos, num_pages,
 					   write_bytes, pages, i);
 		/*
 		 * if we have trouble faulting in the pages, fall
 		 * back to one page at a time
 		 */
 		if (copied < write_bytes)
 			nrptrs = 1;
 		if (copied == 0) {
 			force_page_uptodate = true;
 			dirty_pages = 0;
 		} else {
 			force_page_uptodate = false;
 			dirty_pages = (copied + offset +
 				       PAGE_CACHE_SIZE - 1) >>
 				       PAGE_CACHE_SHIFT;
 		}
 		/*
 		 * If we had a short copy we need to release the excess delaloc
 		 * bytes we reserved.  We need to increment outstanding_extents
 		 * because btrfs_delalloc_release_space will decrement it, but
 		 * we still have an outstanding extent for the chunk we actually
 		 * managed to copy.
 		 */
 		if (num_pages > dirty_pages) {
 			release_bytes = (num_pages - dirty_pages) <<
 				PAGE_CACHE_SHIFT;
 			if (copied > 0) {
 				spin_lock(&BTRFS_I(inode)->lock);
 				BTRFS_I(inode)->outstanding_extents++;
 				spin_unlock(&BTRFS_I(inode)->lock);
 			}
 			if (only_release_metadata)
 				btrfs_delalloc_release_metadata(inode,
 								release_bytes);
 			else
 				btrfs_delalloc_release_space(inode,
 							     release_bytes);
 		}
 		release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
 		if (copied > 0) {
 			ret = btrfs_dirty_pages(root, inode, pages,
 						dirty_pages, pos, copied,
 						NULL);
 			if (ret) {
 				btrfs_drop_pages(pages, num_pages);
 				break;
 			}
 		}
 		release_bytes = 0;
 		btrfs_drop_pages(pages, num_pages);
 		if (only_release_metadata && copied > 0) {
 			u64 lockstart = round_down(pos, root->sectorsize);
 			u64 lockend = lockstart +
 				(dirty_pages << PAGE_CACHE_SHIFT) - 1;
 			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
 				       lockend, EXTENT_NORESERVE, NULL,
 				       NULL, GFP_NOFS);
 			only_release_metadata = false;
 		}
 		cond_resched();
 		balance_dirty_pages_ratelimited(inode->i_mapping);
 		if (dirty_pages < (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 			btrfs_btree_balance_dirty(root);
 		pos += copied;
 		num_written += copied;
 	}
 	kfree(pages);
 	if (release_bytes) {
 		if (only_release_metadata)
 			btrfs_delalloc_release_metadata(inode, release_bytes);
 		else
 			btrfs_delalloc_release_space(inode, release_bytes);
 	}
 	return num_written ? num_written : ret;
 }
 static ssize_t __btrfs_direct_write(struct kiocb *iocb,
 				    const struct iovec *iov,
 				    unsigned long nr_segs, loff_t pos,
 				    loff_t *ppos, size_t count, size_t ocount)
 {
 	struct file *file = iocb->ki_filp;
 	struct iov_iter i;
 	ssize_t written;
 	ssize_t written_buffered;
 	loff_t endbyte;
 	int err;
 	written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos,
 					    count, ocount);
 	if (written < 0 || written == count)
 		return written;
 	pos += written;
 	count -= written;
 	iov_iter_init(&i, iov, nr_segs, count, written);
 	written_buffered = __btrfs_buffered_write(file, &i, pos);
 	if (written_buffered < 0) {
 		err = written_buffered;
 		goto out;
 	}
 	endbyte = pos + written_buffered - 1;
 	err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
 	if (err)
 		goto out;
 	written += written_buffered;
 	*ppos = pos + written_buffered;
 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_CACHE_SHIFT,
 				 endbyte >> PAGE_CACHE_SHIFT);
 out:
 	return written ? written : err;
 }
 static void update_time_for_write(struct inode *inode)
 {
 	struct timespec now;
 	if (IS_NOCMTIME(inode))
 		return;
 	now = current_fs_time(inode->i_sb);
 	if (!timespec_equal(&inode->i_mtime, &now))
 		inode->i_mtime = now;
 	if (!timespec_equal(&inode->i_ctime, &now))
 		inode->i_ctime = now;
 	if (IS_I_VERSION(inode))
 		inode_inc_iversion(inode);
 }
 static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 				    const struct iovec *iov,
 				    unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	loff_t *ppos = &iocb->ki_pos;
 	u64 start_pos;
 	ssize_t num_written = 0;
 	ssize_t err = 0;
 	size_t count, ocount;
 	bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
 	mutex_lock(&inode->i_mutex);
 	err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
 	if (err) {
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	count = ocount;
 	current->backing_dev_info = inode->i_mapping->backing_dev_info;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err) {
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	if (count == 0) {
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	err = file_remove_suid(file);
 	if (err) {
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	/*
 	 * If BTRFS flips readonly due to some impossible error
 	 * (fs_info->fs_state now has BTRFS_SUPER_FLAG_ERROR),
 	 * although we have opened a file as writable, we have
 	 * to stop this write operation to ensure FS consistency.
 	 */
 	if (test_bit(BTRFS_FS_STATE_ERROR, &root->fs_info->fs_state)) {
 		mutex_unlock(&inode->i_mutex);
 		err = -EROFS;
 		goto out;
 	}
 	/*
 	 * We reserve space for updating the inode when we reserve space for the
 	 * extent we are going to write, so we will enospc out there.  We don't
 	 * need to start yet another transaction to update the inode as we will
 	 * update the inode when we finish writing whatever data we write.
 	 */
 	update_time_for_write(inode);
 	start_pos = round_down(pos, root->sectorsize);
 	if (start_pos > i_size_read(inode)) {
 		err = btrfs_cont_expand(inode, i_size_read(inode), start_pos);
 		if (err) {
 			mutex_unlock(&inode->i_mutex);
 			goto out;
 		}
 	}
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 	if (unlikely(file->f_flags & O_DIRECT)) {
 		num_written = __btrfs_direct_write(iocb, iov, nr_segs,
 						   pos, ppos, count, ocount);
 	} else {
 		struct iov_iter i;
 		iov_iter_init(&i, iov, nr_segs, count, num_written);
 		num_written = __btrfs_buffered_write(file, &i, pos);
 		if (num_written > 0)
 			*ppos = pos + num_written;
 	}
 	mutex_unlock(&inode->i_mutex);
 	/*
 	 * we want to make sure fsync finds this change
 	 * but we haven't joined a transaction running right now.
 	 *
 	 * Later on, someone is sure to update the inode and get the
 	 * real transid recorded.
 	 *
 	 * We set last_trans now to the fs_info generation + 1,
 	 * this will either be one more than the running transaction
 	 * or the generation used for the next transaction if there isn't
 	 * one running right now.
 	 *
 	 * We also have to set last_sub_trans to the current log transid,
 	 * otherwise subsequent syncs to a file that's been synced in this
 	 * transaction will appear to have already occured.
 	 */
 	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
 	BTRFS_I(inode)->last_sub_trans = root->log_transid;
 	if (num_written > 0) {
 		err = generic_write_sync(file, pos, num_written);
 		if (err < 0 && num_written > 0)
 			num_written = err;
 	}
 	if (sync)
 		atomic_dec(&BTRFS_I(inode)->sync_writers);
 out:
 	current->backing_dev_info = NULL;
 	return num_written ? num_written : err;
 }
 int btrfs_release_file(struct inode *inode, struct file *filp)
 {
 	/*
 	 * ordered_data_close is set by settattr when we are about to truncate
 	 * a file from a non-zero size to a zero size.  This tries to
 	 * flush down new bytes that may have been written if the
 	 * application were using truncate to replace a file in place.
 	 */
 	if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
 			       &BTRFS_I(inode)->runtime_flags)) {
 		struct btrfs_trans_handle *trans;
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		/*
 		 * We need to block on a committing transaction to keep us from
 		 * throwing a ordered operation on to the list and causing
 		 * something like sync to deadlock trying to flush out this
 		 * inode.
 		 */
 		trans = btrfs_start_transaction(root, 0);
 		if (IS_ERR(trans))
 			return PTR_ERR(trans);
 		btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode);
 		btrfs_end_transaction(trans, root);
 		if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
 			filemap_flush(inode->i_mapping);
 	}
 	if (filp->private_data)
 		btrfs_ioctl_trans_end(filp);
 	return 0;
 }
 /*
  * fsync call for both files and directories.  This logs the inode into
  * the tree log instead of forcing full commits whenever possible.
  *
  * It needs to call filemap_fdatawait so that all ordered extent updates are
  * in the metadata btree are up to date for copying to the log.
  *
  * It drops the inode mutex before doing the tree log commit.  This is an
  * important optimization for directories because holding the mutex prevents
  * new operations on the dir while we write to disk.
  */
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct inode *inode = dentry->d_inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	int ret = 0;
 	struct btrfs_trans_handle *trans;
 	bool full_sync = 0;
 	trace_btrfs_sync_file(file, datasync);
 	/*
 	 * We write the dirty pages in the range and wait until they complete
 	 * out of the ->i_mutex. If so, we can flush the dirty pages by
 	 * multi-task, and make the performance up.  See
 	 * btrfs_wait_ordered_range for an explanation of the ASYNC check.
 	 */
 	atomic_inc(&BTRFS_I(inode)->sync_writers);
 	ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
 	if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			     &BTRFS_I(inode)->runtime_flags))
 		ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
 	atomic_dec(&BTRFS_I(inode)->sync_writers);
 	if (ret)
 		return ret;
 	mutex_lock(&inode->i_mutex);
 	/*
 	 * We flush the dirty pages again to avoid some dirty pages in the
 	 * range being left.
 	 */
 	atomic_inc(&root->log_batch);
 	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			     &BTRFS_I(inode)->runtime_flags);
 	if (full_sync)
 		btrfs_wait_ordered_range(inode, start, end - start + 1);
 	atomic_inc(&root->log_batch);
 	/*
 	 * check the transaction that last modified this inode
 	 * and see if its already been committed
 	 */
 	if (!BTRFS_I(inode)->last_trans) {
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	/*
 	 * if the last transaction that changed this file was before
 	 * the current transaction, we can bail out now without any
 	 * syncing
 	 */
 	smp_mb();
 	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
 	    BTRFS_I(inode)->last_trans <=
 	    root->fs_info->last_trans_committed) {
 		BTRFS_I(inode)->last_trans = 0;
 		/*
 		 * We'v had everything committed since the last time we were
 		 * modified so clear this flag in case it was set for whatever
 		 * reason, it's no longer relevant.
 		 */
 		clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			  &BTRFS_I(inode)->runtime_flags);
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	/*
 	 * ok we haven't committed the transaction yet, lets do a commit
 	 */
 	if (file->private_data)
 		btrfs_ioctl_trans_end(file);
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		mutex_unlock(&inode->i_mutex);
 		goto out;
 	}
 	ret = btrfs_log_dentry_safe(trans, root, dentry);
 	if (ret < 0) {
 		/* Fallthrough and commit/free transaction. */
 		ret = 1;
 	}
 	/* we've logged all the items and now have a consistent
 	 * version of the file in the log.  It is possible that
 	 * someone will come in and modify the file, but that's
 	 * fine because the log is consistent on disk, and we
 	 * have references to all of the file's extents
 	 *
 	 * It is possible that someone will come in and log the
 	 * file again, but that will end up using the synchronization
 	 * inside btrfs_sync_log to keep things safe.
 	 */
 	mutex_unlock(&inode->i_mutex);
 	if (ret != BTRFS_NO_LOG_SYNC) {
 		if (ret > 0) {
 			/*
 			 * If we didn't already wait for ordered extents we need
 			 * to do that now.
 			 */
 			if (!full_sync)
 				btrfs_wait_ordered_range(inode, start,
 							 end - start + 1);
 			ret = btrfs_commit_transaction(trans, root);
 		} else {
 			ret = btrfs_sync_log(trans, root);
 			if (ret == 0) {
 				ret = btrfs_end_transaction(trans, root);
 			} else {
 				if (!full_sync)
 					btrfs_wait_ordered_range(inode, start,
 								 end -
 								 start + 1);
 				ret = btrfs_commit_transaction(trans, root);
 			}
 		}
 	} else {
 		ret = btrfs_end_transaction(trans, root);
 	}
 out:
 	return ret > 0 ? -EIO : ret;
 }
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
 static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 {
 	struct address_space *mapping = filp->f_mapping;
 	if (!mapping->a_ops->readpage)
 		return -ENOEXEC;
 	file_accessed(filp);
 	vma->vm_ops = &btrfs_file_vm_ops;
 	return 0;
 }
 static int hole_mergeable(struct inode *inode, struct extent_buffer *leaf,
 			  int slot, u64 start, u64 end)
 {
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_key key;
 	if (slot < 0 || slot >= btrfs_header_nritems(leaf))
 		return 0;
 	btrfs_item_key_to_cpu(leaf, &key, slot);
 	if (key.objectid != btrfs_ino(inode) ||
 	    key.type != BTRFS_EXTENT_DATA_KEY)
 		return 0;
 	fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
 	if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG)
 		return 0;
 	if (btrfs_file_extent_disk_bytenr(leaf, fi))
 		return 0;
 	if (key.offset == end)
 		return 1;
 	if (key.offset + btrfs_file_extent_num_bytes(leaf, fi) == start)
 		return 1;
 	return 0;
 }
 static int fill_holes(struct btrfs_trans_handle *trans, struct inode *inode,
 		      struct btrfs_path *path, u64 offset, u64 end)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_buffer *leaf;
 	struct btrfs_file_extent_item *fi;
 	struct extent_map *hole_em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct btrfs_key key;
 	int ret;
 	key.objectid = btrfs_ino(inode);
 	key.type = BTRFS_EXTENT_DATA_KEY;
 	key.offset = offset;
 	ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
 	if (ret < 0)
 		return ret;
 	BUG_ON(!ret);
 	leaf = path->nodes[0];
 	if (hole_mergeable(inode, leaf, path->slots[0]-1, offset, end)) {
 		u64 num_bytes;
 		path->slots[0]--;
 		fi = btrfs_item_ptr(leaf, path->slots[0],
 				    struct btrfs_file_extent_item);
 		num_bytes = btrfs_file_extent_num_bytes(leaf, fi) +
 			end - offset;
 		btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
 		btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes);
 		btrfs_set_file_extent_offset(leaf, fi, 0);
 		btrfs_mark_buffer_dirty(leaf);
 		goto out;
 	}
 	if (hole_mergeable(inode, leaf, path->slots[0]+1, offset, end)) {
 		u64 num_bytes;
 		path->slots[0]++;
 		key.offset = offset;
 		btrfs_set_item_key_safe(root, path, &key);
 		fi = btrfs_item_ptr(leaf, path->slots[0],
 				    struct btrfs_file_extent_item);
 		num_bytes = btrfs_file_extent_num_bytes(leaf, fi) + end -
 			offset;
 		btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
 		btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes);
 		btrfs_set_file_extent_offset(leaf, fi, 0);
 		btrfs_mark_buffer_dirty(leaf);
 		goto out;
 	}
 	btrfs_release_path(path);
 	ret = btrfs_insert_file_extent(trans, root, btrfs_ino(inode), offset,
 				       0, 0, end - offset, 0, end - offset,
 				       0, 0, 0);
 	if (ret)
 		return ret;
 out:
 	btrfs_release_path(path);
 	hole_em = alloc_extent_map();
 	if (!hole_em) {
 		btrfs_drop_extent_cache(inode, offset, end - 1, 0);
 		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			&BTRFS_I(inode)->runtime_flags);
 	} else {
 		hole_em->start = offset;
 		hole_em->len = end - offset;
 		hole_em->ram_bytes = hole_em->len;
 		hole_em->orig_start = offset;
 		hole_em->block_start = EXTENT_MAP_HOLE;
 		hole_em->block_len = 0;
 		hole_em->orig_block_len = 0;
 		hole_em->bdev = root->fs_info->fs_devices->latest_bdev;
 		hole_em->compress_type = BTRFS_COMPRESS_NONE;
 		hole_em->generation = trans->transid;
 		do {
 			btrfs_drop_extent_cache(inode, offset, end - 1, 0);
 			write_lock(&em_tree->lock);
 			ret = add_extent_mapping(em_tree, hole_em, 1);
 			write_unlock(&em_tree->lock);
 		} while (ret == -EEXIST);
 		free_extent_map(hole_em);
 		if (ret)
 			set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 				&BTRFS_I(inode)->runtime_flags);
 	}
 	return 0;
 }
 static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_path *path;
 	struct btrfs_block_rsv *rsv;
 	struct btrfs_trans_handle *trans;
 	u64 lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
 	u64 lockend = round_down(offset + len,
 				 BTRFS_I(inode)->root->sectorsize) - 1;
 	u64 cur_offset = lockstart;
 	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
 	u64 drop_end;
 	int ret = 0;
 	int err = 0;
 	bool same_page = ((offset >> PAGE_CACHE_SHIFT) ==
 			  ((offset + len - 1) >> PAGE_CACHE_SHIFT));
 	btrfs_wait_ordered_range(inode, offset, len);
 	mutex_lock(&inode->i_mutex);
 	/*
 	 * We needn't truncate any page which is beyond the end of the file
 	 * because we are sure there is no data there.
 	 */
 	/*
 	 * Only do this if we are in the same page and we aren't doing the
 	 * entire page.
 	 */
 	if (same_page && len < PAGE_CACHE_SIZE) {
 		if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE))
 			ret = btrfs_truncate_page(inode, offset, len, 0);
 		mutex_unlock(&inode->i_mutex);
 		return ret;
 	}
 	/* zero back part of the first page */
 	if (offset < round_up(inode->i_size, PAGE_CACHE_SIZE)) {
 		ret = btrfs_truncate_page(inode, offset, 0, 0);
 		if (ret) {
 			mutex_unlock(&inode->i_mutex);
 			return ret;
 		}
 	}
 	/* zero the front end of the last page */
 	if (offset + len < round_up(inode->i_size, PAGE_CACHE_SIZE)) {
 		ret = btrfs_truncate_page(inode, offset + len, 0, 1);
 		if (ret) {
 			mutex_unlock(&inode->i_mutex);
 			return ret;
 		}
 	}
 	if (lockend < lockstart) {
 		mutex_unlock(&inode->i_mutex);
 		return 0;
 	}
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		truncate_pagecache_range(inode, lockstart, lockend);
 		lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 				 0, &cached_state);
 		ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
 		/*
 		 * We need to make sure we have no ordered extents in this range
 		 * and nobody raced in and read a page in this range, if we did
 		 * we need to try again.
 		 */
 		if ((!ordered ||
 		    (ordered->file_offset + ordered->len < lockstart ||
 		     ordered->file_offset > lockend)) &&
 		     !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart,
 				     lockend, EXTENT_UPTODATE, 0,
 				     cached_state)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
 		}
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
 		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
 				     lockend, &cached_state, GFP_NOFS);
 		btrfs_wait_ordered_range(inode, lockstart,
 					 lockend - lockstart + 1);
 	}
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
 		goto out;
 	}
 	rsv = btrfs_alloc_block_rsv(root, BTRFS_BLOCK_RSV_TEMP);
 	if (!rsv) {
 		ret = -ENOMEM;
 		goto out_free;
 	}
 	rsv->size = btrfs_calc_trunc_metadata_size(root, 1);
 	rsv->failfast = 1;
 	/*
 	 * 1 - update the inode
 	 * 1 - removing the extents in the range
 	 * 1 - adding the hole extent
 	 */
 	trans = btrfs_start_transaction(root, 3);
 	if (IS_ERR(trans)) {
 		err = PTR_ERR(trans);
 		goto out_free;
 	}
 	ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv,
 				      min_size);
 	BUG_ON(ret);
 	trans->block_rsv = rsv;
 	while (cur_offset < lockend) {
 		ret = __btrfs_drop_extents(trans, root, inode, path,
 					   cur_offset, lockend + 1,
 					   &drop_end, 1);
 		if (ret != -ENOSPC)
 			break;
 		trans->block_rsv = &root->fs_info->trans_block_rsv;
 		ret = fill_holes(trans, inode, path, cur_offset, drop_end);
 		if (ret) {
 			err = ret;
 			break;
 		}
 		cur_offset = drop_end;
 		ret = btrfs_update_inode(trans, root, inode);
 		if (ret) {
 			err = ret;
 			break;
 		}
 		btrfs_end_transaction(trans, root);
 		btrfs_btree_balance_dirty(root);
 		trans = btrfs_start_transaction(root, 3);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
 			trans = NULL;
 			break;
 		}
 		ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv,
 					      rsv, min_size);
 		BUG_ON(ret);	/* shouldn't happen */
 		trans->block_rsv = rsv;
 	}
 	if (ret) {
 		err = ret;
 		goto out_trans;
 	}
 	trans->block_rsv = &root->fs_info->trans_block_rsv;
 	ret = fill_holes(trans, inode, path, cur_offset, drop_end);
 	if (ret) {
 		err = ret;
 		goto out_trans;
 	}
 out_trans:
 	if (!trans)
 		goto out_free;
 	inode_inc_iversion(inode);
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	trans->block_rsv = &root->fs_info->trans_block_rsv;
 	ret = btrfs_update_inode(trans, root, inode);
 	btrfs_end_transaction(trans, root);
 	btrfs_btree_balance_dirty(root);
 out_free:
 	btrfs_free_path(path);
 	btrfs_free_block_rsv(root, rsv);
 out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state, GFP_NOFS);
 	mutex_unlock(&inode->i_mutex);
 	if (ret && !err)
 		err = ret;
 	return err;
 }
 static long btrfs_fallocate(struct file *file, int mode,
 			    loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct extent_state *cached_state = NULL;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 cur_offset;
 	u64 last_byte;
 	u64 alloc_start;
 	u64 alloc_end;
 	u64 alloc_hint = 0;
 	u64 locked_end;
 	struct extent_map *em;
 	int blocksize = BTRFS_I(inode)->root->sectorsize;
 	int ret;
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return btrfs_punch_hole(inode, offset, len);
 	/*
 	 * Make sure we have enough space before we do the
 	 * allocation.
 	 */
 	ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start);
 	if (ret)
 		return ret;
 	if (root->fs_info->quota_enabled) {
 		ret = btrfs_qgroup_reserve(root, alloc_end - alloc_start);
 		if (ret)
 			goto out_reserve_fail;
 	}
 	mutex_lock(&inode->i_mutex);
 	ret = inode_newsize_ok(inode, alloc_end);
 	if (ret)
 		goto out;
 	if (alloc_start > inode->i_size) {
 		ret = btrfs_cont_expand(inode, i_size_read(inode),
 					alloc_start);
 		if (ret)
 			goto out;
 	} else {
 		/*
 		 * If we are fallocating from the end of the file onward we
 		 * need to zero out the end of the page if i_size lands in the
 		 * middle of a page.
 		 */
 		ret = btrfs_truncate_page(inode, inode->i_size, 0, 0);
 		if (ret)
 			goto out;
 	}
 	/*
 	 * wait for ordered IO before we have any locks.  We'll loop again
 	 * below with the locks held.
 	 */
 	btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start);
 	locked_end = alloc_end - 1;
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 		/* the extent lock is ordered inside the running
 		 * transaction
 		 */
 		lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start,
 				 locked_end, 0, &cached_state);
 		ordered = btrfs_lookup_first_ordered_extent(inode,
 							    alloc_end - 1);
 		if (ordered &&
 		    ordered->file_offset + ordered->len > alloc_start &&
 		    ordered->file_offset < alloc_end) {
 			btrfs_put_ordered_extent(ordered);
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     alloc_start, locked_end,
 					     &cached_state, GFP_NOFS);
 			/*
 			 * we can't wait on the range with the transaction
 			 * running or with the extent lock held
 			 */
 			btrfs_wait_ordered_range(inode, alloc_start,
 						 alloc_end - alloc_start);
 		} else {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
 		}
 	}
 	cur_offset = alloc_start;
 	while (1) {
 		u64 actual_end;
 		em = btrfs_get_extent(inode, NULL, 0, cur_offset,
 				      alloc_end - cur_offset, 0);
 		if (IS_ERR_OR_NULL(em)) {
 			if (!em)
 				ret = -ENOMEM;
 			else
 				ret = PTR_ERR(em);
 			break;
 		}
 		last_byte = min(extent_map_end(em), alloc_end);
 		actual_end = min_t(u64, extent_map_end(em), offset + len);
 		last_byte = ALIGN(last_byte, blocksize);
 		if (em->block_start == EXTENT_MAP_HOLE ||
 		    (cur_offset >= inode->i_size &&
 		     !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) {
 			ret = btrfs_prealloc_file_range(inode, mode, cur_offset,
 							last_byte - cur_offset,
 							1 << inode->i_blkbits,
 							offset + len,
 							&alloc_hint);
 			if (ret < 0) {
 				free_extent_map(em);
 				break;
 			}
 		} else if (actual_end > inode->i_size &&
 			   !(mode & FALLOC_FL_KEEP_SIZE)) {
 			/*
 			 * We didn't need to allocate any more space, but we
 			 * still extended the size of the file so we need to
 			 * update i_size.
 			 */
 			inode->i_ctime = CURRENT_TIME;
 			i_size_write(inode, actual_end);
 			btrfs_ordered_update_i_size(inode, actual_end, NULL);
 		}
 		free_extent_map(em);
 		cur_offset = last_byte;
 		if (cur_offset >= alloc_end) {
 			ret = 0;
 			break;
 		}
 	}
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			     &cached_state, GFP_NOFS);
 out:
 	mutex_unlock(&inode->i_mutex);
 	if (root->fs_info->quota_enabled)
 		btrfs_qgroup_free(root, alloc_end - alloc_start);
 out_reserve_fail:
 	/* Let go of our reservation. */
 	btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
 	return ret;
 }
 static int find_desired_extent(struct inode *inode, loff_t *offset, int whence)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
 	u64 lockstart = *offset;
 	u64 lockend = i_size_read(inode);
 	u64 start = *offset;
 	u64 orig_start = *offset;
 	u64 len = i_size_read(inode);
 	u64 last_end = 0;
 	int ret = 0;
 	lockend = max_t(u64, root->sectorsize, lockend);
 	if (lockend <= lockstart)
 		lockend = lockstart + root->sectorsize;
 	lockend--;
 	len = lockend - lockstart + 1;
 	len = max_t(u64, len, root->sectorsize);
 	if (inode->i_size == 0)
 		return -ENXIO;
 	lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
 			 &cached_state);
 	/*
 	 * Delalloc is such a pain.  If we have a hole and we have pending
 	 * delalloc for a portion of the hole we will get back a hole that
 	 * exists for the entire range since it hasn't been actually written
 	 * yet.  So to take care of this case we need to look for an extent just
 	 * before the position we want in case there is outstanding delalloc
 	 * going on here.
 	 */
 	if (whence == SEEK_HOLE && start != 0) {
 		if (start <= root->sectorsize)
 			em = btrfs_get_extent_fiemap(inode, NULL, 0, 0,
 						     root->sectorsize, 0);
 		else
 			em = btrfs_get_extent_fiemap(inode, NULL, 0,
 						     start - root->sectorsize,
 						     root->sectorsize, 0);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			goto out;
 		}
 		last_end = em->start + em->len;
 		if (em->block_start == EXTENT_MAP_DELALLOC)
 			last_end = min_t(u64, last_end, inode->i_size);
 		free_extent_map(em);
 	}
 	while (1) {
 		em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0);
 		if (IS_ERR(em)) {
 			ret = PTR_ERR(em);
 			break;
 		}
 		if (em->block_start == EXTENT_MAP_HOLE) {
 			if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
 				if (last_end <= orig_start) {
 					free_extent_map(em);
 					ret = -ENXIO;
 					break;
 				}
 			}
 			if (whence == SEEK_HOLE) {
 				*offset = start;
 				free_extent_map(em);
 				break;
 			}
 		} else {
 			if (whence == SEEK_DATA) {
 				if (em->block_start == EXTENT_MAP_DELALLOC) {
 					if (start >= inode->i_size) {
 						free_extent_map(em);
 						ret = -ENXIO;
 						break;
 					}
 				}
 				if (!test_bit(EXTENT_FLAG_PREALLOC,
 					      &em->flags)) {
 					*offset = start;
 					free_extent_map(em);
 					break;
 				}
 			}
 		}
 		start = em->start + em->len;
 		last_end = em->start + em->len;
 		if (em->block_start == EXTENT_MAP_DELALLOC)
 			last_end = min_t(u64, last_end, inode->i_size);
 		if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
 			free_extent_map(em);
 			ret = -ENXIO;
 			break;
 		}
 		free_extent_map(em);
 		cond_resched();
 	}
 	if (!ret)
 		*offset = min(*offset, inode->i_size);
 out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state, GFP_NOFS);
 	return ret;
 }
 static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct inode *inode = file->f_mapping->host;
 	int ret;
 	mutex_lock(&inode->i_mutex);
 	switch (whence) {
 	case SEEK_END:
 	case SEEK_CUR:
 		offset = generic_file_llseek(file, offset, whence);
 		goto out;
 	case SEEK_DATA:
 	case SEEK_HOLE:
 		if (offset >= i_size_read(inode)) {
 			mutex_unlock(&inode->i_mutex);
 			return -ENXIO;
 		}
 		ret = find_desired_extent(inode, &offset, whence);
 		if (ret) {
 			mutex_unlock(&inode->i_mutex);
 			return ret;
 		}
 	}
 	offset = vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 out:
 	mutex_unlock(&inode->i_mutex);
 	return offset;
 }
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read       = generic_file_aio_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_write	= btrfs_file_aio_write,
 	.mmap		= btrfs_file_mmap,
 	.open		= generic_file_open,
 	.release	= btrfs_release_file,
 	.fsync		= btrfs_sync_file,
 	.fallocate	= btrfs_fallocate,
 	.unlocked_ioctl	= btrfs_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
 };
 void btrfs_auto_defrag_exit(void)
 {
 	if (btrfs_inode_defrag_cachep)
 		kmem_cache_destroy(btrfs_inode_defrag_cachep);
 }
 int btrfs_auto_defrag_init(void)
 {
 	btrfs_inode_defrag_cachep = kmem_cache_create("btrfs_inode_defrag",
 					sizeof(struct inode_defrag), 0,
 					SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 					NULL);
 	if (!btrfs_inode_defrag_cachep)
 		return -ENOMEM;
 	return 0;
 }

fs/buffer.c

Diff comments View file @ d618a27

 /*
  *  linux/fs/buffer.c
  *
  *  Copyright (C) 1991, 1992, 2002  Linus Torvalds
  */
 /*
  * Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95
  *
  * Removed a lot of unnecessary code and simplified things now that
  * the buffer cache isn't our primary cache - Andrew Tridgell 12/96
  *
  * Speed up hash, lru, and free list operations.  Use gfp() for allocating
  * hash table, use SLAB cache for buffer heads. SMP threading.  -DaveM
  *
  * Added 32k buffer block sizes - these are required older ARM systems. - RMK
  *
  * async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de>
  */
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/capability.h>
 #include <linux/blkdev.h>
 #include <linux/file.h>
 #include <linux/quotaops.h>
 #include <linux/highmem.h>
 #include <linux/export.h>
 #include <linux/writeback.h>
 #include <linux/hash.h>
 #include <linux/suspend.h>
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
 #include <trace/events/block.h>
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
 {
 	bh->b_end_io = handler;
 	bh->b_private = private;
 }
 EXPORT_SYMBOL(init_buffer);
 inline void touch_buffer(struct buffer_head *bh)
 {
 	trace_block_touch_buffer(bh);
 	mark_page_accessed(bh->b_page);
 }
 EXPORT_SYMBOL(touch_buffer);
 static int sleep_on_buffer(void *word)
 {
 	io_schedule();
 	return 0;
 }
 void __lock_buffer(struct buffer_head *bh)
 {
 	wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_buffer);
 void unlock_buffer(struct buffer_head *bh)
 {
 	clear_bit_unlock(BH_Lock, &bh->b_state);
 	smp_mb__after_clear_bit();
 	wake_up_bit(&bh->b_state, BH_Lock);
 }
 EXPORT_SYMBOL(unlock_buffer);
 /*
  * Returns if the page has dirty or writeback buffers. If all the buffers
  * are unlocked and clean then the PageDirty information is stale. If
  * any of the pages are locked, it is assumed they are locked for IO.
  */
 void buffer_check_dirty_writeback(struct page *page,
 				     bool *dirty, bool *writeback)
 {
 	struct buffer_head *head, *bh;
 	*dirty = false;
 	*writeback = false;
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
 		return;
 	if (PageWriteback(page))
 		*writeback = true;
 	head = page_buffers(page);
 	bh = head;
 	do {
 		if (buffer_locked(bh))
 			*writeback = true;
 		if (buffer_dirty(bh))
 			*dirty = true;
 		bh = bh->b_this_page;
 	} while (bh != head);
 }
 EXPORT_SYMBOL(buffer_check_dirty_writeback);
 /*
  * Block until a buffer comes unlocked.  This doesn't stop it
  * from becoming locked again - you have to lock it yourself
  * if you want to preserve its state.
  */
 void __wait_on_buffer(struct buffer_head * bh)
 {
 	wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__wait_on_buffer);
 static void
 __clear_page_buffers(struct page *page)
 {
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page_cache_release(page);
 }
 static int quiet_error(struct buffer_head *bh)
 {
 	if (!test_bit(BH_Quiet, &bh->b_state) && printk_ratelimit())
 		return 0;
 	return 1;
 }
 static void buffer_io_error(struct buffer_head *bh)
 {
 	char b[BDEVNAME_SIZE];
 	printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n",
 			bdevname(bh->b_bdev, b),
 			(unsigned long long)bh->b_blocknr);
 }
 /*
  * End-of-IO handler helper function which does not touch the bh after
  * unlocking it.
  * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but
  * a race there is benign: unlock_buffer() only use the bh's address for
  * hashing after unlocking the buffer, so it doesn't actually touch the bh
  * itself.
  */
 static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
 {
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
 		/* This happens, due to failed READA attempts. */
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
 }
 /*
  * Default synchronous end-of-IO handler..  Just mark it up-to-date and
  * unlock the buffer. This is what ll_rw_block uses too.
  */
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
 {
 	__end_buffer_read_notouch(bh, uptodate);
 	put_bh(bh);
 }
 EXPORT_SYMBOL(end_buffer_read_sync);
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
 	char b[BDEVNAME_SIZE];
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
 		if (!quiet_error(bh)) {
 			buffer_io_error(bh);
 			printk(KERN_WARNING "lost page write due to "
 					"I/O error on %s\n",
 				       bdevname(bh->b_bdev, b));
 		}
 		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
 	put_bh(bh);
 }
 EXPORT_SYMBOL(end_buffer_write_sync);
 /*
  * Various filesystems appear to want __find_get_block to be non-blocking.
  * But it's the page lock which protects the buffers.  To get around this,
  * we get exclusion from try_to_free_buffers with the blockdev mapping's
  * private_lock.
  *
  * Hack idea: for the blockdev mapping, i_bufferlist_lock contention
  * may be quite high.  This code could TryLock the page, and if that
  * succeeds, there is no need to take private_lock. (But if
  * private_lock is contended then so is mapping->tree_lock).
  */
 static struct buffer_head *
 __find_get_block_slow(struct block_device *bdev, sector_t block)
 {
 	struct inode *bd_inode = bdev->bd_inode;
 	struct address_space *bd_mapping = bd_inode->i_mapping;
 	struct buffer_head *ret = NULL;
 	pgoff_t index;
 	struct buffer_head *bh;
 	struct buffer_head *head;
 	struct page *page;
 	int all_mapped = 1;
 	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
-	page = find_get_page(bd_mapping, index);
+	page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
 	if (!page)
 		goto out;
 	spin_lock(&bd_mapping->private_lock);
 	if (!page_has_buffers(page))
 		goto out_unlock;
 	head = page_buffers(page);
 	bh = head;
 	do {
 		if (!buffer_mapped(bh))
 			all_mapped = 0;
 		else if (bh->b_blocknr == block) {
 			ret = bh;
 			get_bh(bh);
 			goto out_unlock;
 		}
 		bh = bh->b_this_page;
 	} while (bh != head);
 	/* we might be here because some of the buffers on this page are
 	 * not mapped.  This is due to various races between
 	 * file io on the block device and getblk.  It gets dealt with
 	 * elsewhere, don't buffer_error if we had some unmapped buffers
 	 */
 	if (all_mapped) {
 		char b[BDEVNAME_SIZE];
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block,
 			(unsigned long long)bh->b_blocknr);
 		printk("b_state=0x%08lx, b_size=%zu\n",
 			bh->b_state, bh->b_size);
 		printk("device %s blocksize: %d\n", bdevname(bdev, b),
 			1 << bd_inode->i_blkbits);
 	}
 out_unlock:
 	spin_unlock(&bd_mapping->private_lock);
 	page_cache_release(page);
 out:
 	return ret;
 }
 /*
  * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
  */
 static void free_more_memory(void)
 {
 	struct zone *zone;
 	int nid;
 	wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
 	yield();
 	for_each_online_node(nid) {
 		(void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
 						gfp_zone(GFP_NOFS), NULL,
 						&zone);
 		if (zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS, NULL);
 	}
 }
 /*
  * I/O completion handler for block_read_full_page() - pages
  * which come unlocked at the end of I/O.
  */
 static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 {
 	unsigned long flags;
 	struct buffer_head *first;
 	struct buffer_head *tmp;
 	struct page *page;
 	int page_uptodate = 1;
 	BUG_ON(!buffer_async_read(bh));
 	page = bh->b_page;
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
 		clear_buffer_uptodate(bh);
 		if (!quiet_error(bh))
 			buffer_io_error(bh);
 		SetPageError(page);
 	}
 	/*
 	 * Be _very_ careful from here on. Bad things can happen if
 	 * two buffer heads end IO at almost the same time and both
 	 * decide that the page is now completely done.
 	 */
 	first = page_buffers(page);
 	local_irq_save(flags);
 	bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
 	clear_buffer_async_read(bh);
 	unlock_buffer(bh);
 	tmp = bh;
 	do {
 		if (!buffer_uptodate(tmp))
 			page_uptodate = 0;
 		if (buffer_async_read(tmp)) {
 			BUG_ON(!buffer_locked(tmp));
 			goto still_busy;
 		}
 		tmp = tmp->b_this_page;
 	} while (tmp != bh);
 	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
 	local_irq_restore(flags);
 	/*
 	 * If none of the buffers had errors and they are all
 	 * uptodate then we can set the page uptodate.
 	 */
 	if (page_uptodate && !PageError(page))
 		SetPageUptodate(page);
 	unlock_page(page);
 	return;
 still_busy:
 	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
 	local_irq_restore(flags);
 	return;
 }
 /*
  * Completion handler for block_write_full_page() - pages which are unlocked
  * during I/O, and which have PageWriteback cleared upon I/O completion.
  */
 void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 {
 	char b[BDEVNAME_SIZE];
 	unsigned long flags;
 	struct buffer_head *first;
 	struct buffer_head *tmp;
 	struct page *page;
 	BUG_ON(!buffer_async_write(bh));
 	page = bh->b_page;
 	if (uptodate) {
 		set_buffer_uptodate(bh);
 	} else {
 		if (!quiet_error(bh)) {
 			buffer_io_error(bh);
 			printk(KERN_WARNING "lost page write due to "
 					"I/O error on %s\n",
 			       bdevname(bh->b_bdev, b));
 		}
 		set_bit(AS_EIO, &page->mapping->flags);
 		set_buffer_write_io_error(bh);
 		clear_buffer_uptodate(bh);
 		SetPageError(page);
 	}
 	first = page_buffers(page);
 	local_irq_save(flags);
 	bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
 	clear_buffer_async_write(bh);
 	unlock_buffer(bh);
 	tmp = bh->b_this_page;
 	while (tmp != bh) {
 		if (buffer_async_write(tmp)) {
 			BUG_ON(!buffer_locked(tmp));
 			goto still_busy;
 		}
 		tmp = tmp->b_this_page;
 	}
 	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
 	local_irq_restore(flags);
 	end_page_writeback(page);
 	return;
 still_busy:
 	bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
 	local_irq_restore(flags);
 	return;
 }
 EXPORT_SYMBOL(end_buffer_async_write);
 /*
  * If a page's buffers are under async readin (end_buffer_async_read
  * completion) then there is a possibility that another thread of
  * control could lock one of the buffers after it has completed
  * but while some of the other buffers have not completed.  This
  * locked buffer would confuse end_buffer_async_read() into not unlocking
  * the page.  So the absence of BH_Async_Read tells end_buffer_async_read()
  * that this buffer is not under async I/O.
  *
  * The page comes unlocked when it has no locked buffer_async buffers
  * left.
  *
  * PageLocked prevents anyone starting new async I/O reads any of
  * the buffers.
  *
  * PageWriteback is used to prevent simultaneous writeout of the same
  * page.
  *
  * PageLocked prevents anyone from starting writeback of a page which is
  * under read I/O (PageWriteback is only ever set against a locked page).
  */
 static void mark_buffer_async_read(struct buffer_head *bh)
 {
 	bh->b_end_io = end_buffer_async_read;
 	set_buffer_async_read(bh);
 }
 static void mark_buffer_async_write_endio(struct buffer_head *bh,
 					  bh_end_io_t *handler)
 {
 	bh->b_end_io = handler;
 	set_buffer_async_write(bh);
 }
 void mark_buffer_async_write(struct buffer_head *bh)
 {
 	mark_buffer_async_write_endio(bh, end_buffer_async_write);
 }
 EXPORT_SYMBOL(mark_buffer_async_write);
 /*
  * fs/buffer.c contains helper functions for buffer-backed address space's
  * fsync functions.  A common requirement for buffer-based filesystems is
  * that certain data from the backing blockdev needs to be written out for
  * a successful fsync().  For example, ext2 indirect blocks need to be
  * written back and waited upon before fsync() returns.
  *
  * The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
  * inode_has_buffers() and invalidate_inode_buffers() are provided for the
  * management of a list of dependent buffers at ->i_mapping->private_list.
  *
  * Locking is a little subtle: try_to_free_buffers() will remove buffers
  * from their controlling inode's queue when they are being freed.  But
  * try_to_free_buffers() will be operating against the *blockdev* mapping
  * at the time, not against the S_ISREG file which depends on those buffers.
  * So the locking for private_list is via the private_lock in the address_space
  * which backs the buffers.  Which is different from the address_space
  * against which the buffers are listed.  So for a particular address_space,
  * mapping->private_lock does *not* protect mapping->private_list!  In fact,
  * mapping->private_list will always be protected by the backing blockdev's
  * ->private_lock.
  *
  * Which introduces a requirement: all buffers on an address_space's
  * ->private_list must be from the same address_space: the blockdev's.
  *
  * address_spaces which do not place buffers at ->private_list via these
  * utility functions are free to use private_lock and private_list for
  * whatever they want.  The only requirement is that list_empty(private_list)
  * be true at clear_inode() time.
  *
  * FIXME: clear_inode should not call invalidate_inode_buffers().  The
  * filesystems should do that.  invalidate_inode_buffers() should just go
  * BUG_ON(!list_empty).
  *
  * FIXME: mark_buffer_dirty_inode() is a data-plane operation.  It should
  * take an address_space, not an inode.  And it should be called
  * mark_buffer_dirty_fsync() to clearly define why those buffers are being
  * queued up.
  *
  * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
  * list if it is already on a list.  Because if the buffer is on a list,
  * it *must* already be on the right one.  If not, the filesystem is being
  * silly.  This will save a ton of locking.  But first we have to ensure
  * that buffers are taken *off* the old inode's list when they are freed
  * (presumably in truncate).  That requires careful auditing of all
  * filesystems (do it inside bforget()).  It could also be done by bringing
  * b_inode back.
  */
 /*
  * The buffer's backing address_space's private_lock must be held
  */
 static void __remove_assoc_queue(struct buffer_head *bh)
 {
 	list_del_init(&bh->b_assoc_buffers);
 	WARN_ON(!bh->b_assoc_map);
 	if (buffer_write_io_error(bh))
 		set_bit(AS_EIO, &bh->b_assoc_map->flags);
 	bh->b_assoc_map = NULL;
 }
 int inode_has_buffers(struct inode *inode)
 {
 	return !list_empty(&inode->i_data.private_list);
 }
 /*
  * osync is designed to support O_SYNC io.  It waits synchronously for
  * all already-submitted IO to complete, but does not queue any new
  * writes to the disk.
  *
  * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
  * you dirty the buffers, and then use osync_inode_buffers to wait for
  * completion.  Any other dirty buffers which are not yet queued for
  * write will not be flushed to disk by the osync.
  */
 static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
 	struct buffer_head *bh;
 	struct list_head *p;
 	int err = 0;
 	spin_lock(lock);
 repeat:
 	list_for_each_prev(p, list) {
 		bh = BH_ENTRY(p);
 		if (buffer_locked(bh)) {
 			get_bh(bh);
 			spin_unlock(lock);
 			wait_on_buffer(bh);
 			if (!buffer_uptodate(bh))
 				err = -EIO;
 			brelse(bh);
 			spin_lock(lock);
 			goto repeat;
 		}
 	}
 	spin_unlock(lock);
 	return err;
 }
 static void do_thaw_one(struct super_block *sb, void *unused)
 {
 	char b[BDEVNAME_SIZE];
 	while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb))
 		printk(KERN_WARNING "Emergency Thaw on %s\n",
 		       bdevname(sb->s_bdev, b));
 }
 static void do_thaw_all(struct work_struct *work)
 {
 	iterate_supers(do_thaw_one, NULL);
 	kfree(work);
 	printk(KERN_WARNING "Emergency Thaw complete\n");
 }
 /**
  * emergency_thaw_all -- forcibly thaw every frozen filesystem
  *
  * Used for emergency unfreeze of all filesystems via SysRq
  */
 void emergency_thaw_all(void)
 {
 	struct work_struct *work;
 	work = kmalloc(sizeof(*work), GFP_ATOMIC);
 	if (work) {
 		INIT_WORK(work, do_thaw_all);
 		schedule_work(work);
 	}
 }
 /**
  * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers
  * @mapping: the mapping which wants those buffers written
  *
  * Starts I/O against the buffers at mapping->private_list, and waits upon
  * that I/O.
  *
  * Basically, this is a convenience function for fsync().
  * @mapping is a file or directory which needs those buffers to be written for
  * a successful fsync().
  */
 int sync_mapping_buffers(struct address_space *mapping)
 {
 	struct address_space *buffer_mapping = mapping->private_data;
 	if (buffer_mapping == NULL || list_empty(&mapping->private_list))
 		return 0;
 	return fsync_buffers_list(&buffer_mapping->private_lock,
 					&mapping->private_list);
 }
 EXPORT_SYMBOL(sync_mapping_buffers);
 /*
  * Called when we've recently written block `bblock', and it is known that
  * `bblock' was for a buffer_boundary() buffer.  This means that the block at
  * `bblock + 1' is probably a dirty indirect block.  Hunt it down and, if it's
  * dirty, schedule it for IO.  So that indirects merge nicely with their data.
  */
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize)
 {
 	struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize);
 	if (bh) {
 		if (buffer_dirty(bh))
 			ll_rw_block(WRITE, 1, &bh);
 		put_bh(bh);
 	}
 }
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct address_space *buffer_mapping = bh->b_page->mapping;
 	mark_buffer_dirty(bh);
 	if (!mapping->private_data) {
 		mapping->private_data = buffer_mapping;
 	} else {
 		BUG_ON(mapping->private_data != buffer_mapping);
 	}
 	if (!bh->b_assoc_map) {
 		spin_lock(&buffer_mapping->private_lock);
 		list_move_tail(&bh->b_assoc_buffers,
 				&mapping->private_list);
 		bh->b_assoc_map = mapping;
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty_inode);
 /*
  * Mark the page dirty, and set it dirty in the radix tree, and mark the inode
  * dirty.
  *
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
 static void __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 }
 /*
  * Add a page to the dirty page list.
  *
  * It is a sad fact of life that this function is called from several places
  * deeply under spinlocking.  It may not sleep.
  *
  * If the page has buffers, the uptodate buffers are set dirty, to preserve
  * dirty-state coherency between the page and the buffers.  It the page does
  * not have buffers then when they are later attached they will all be set
  * dirty.
  *
  * The buffers are dirtied before the page is dirtied.  There's a small race
  * window in which a writepage caller may see the page cleanness but not the
  * buffer dirtiness.  That's fine.  If this code were to set the page dirty
  * before the buffers, a concurrent writepage caller could clear the page dirty
  * bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean
  * page on the dirty page list.
  *
  * We use private_lock to lock against try_to_free_buffers while using the
  * page's buffer list.  Also use this to protect against clean buffers being
  * added to the page after it was set dirty.
  *
  * FIXME: may need to call ->reservepage here as well.  That's rather up to the
  * address_space though.
  */
 int __set_page_dirty_buffers(struct page *page)
 {
 	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
 		struct buffer_head *bh = head;
 		do {
 			set_buffer_dirty(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 /*
  * Write out and wait upon a list of buffers.
  *
  * We have conflicting pressures: we want to make sure that all
  * initially dirty buffers get waited on, but that any subsequently
  * dirtied buffers don't.  After all, we don't want fsync to last
  * forever if somebody is actively writing to the file.
  *
  * Do this in two main stages: first we copy dirty buffers to a
  * temporary inode list, queueing the writes as we go.  Then we clean
  * up, waiting for those writes to complete.
  *
  * During this second stage, any subsequent updates to the file may end
  * up refiling the buffer on the original inode's dirty list again, so
  * there is a chance we will end up with a buffer queued for write but
  * not yet completed on that list.  So, as a final cleanup we go through
  * the osync code to catch these locked, dirty buffers without requeuing
  * any newly dirty buffers for write.
  */
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
 	struct buffer_head *bh;
 	struct list_head tmp;
 	struct address_space *mapping;
 	int err = 0, err2;
 	struct blk_plug plug;
 	INIT_LIST_HEAD(&tmp);
 	blk_start_plug(&plug);
 	spin_lock(lock);
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
 		mapping = bh->b_assoc_map;
 		__remove_assoc_queue(bh);
 		/* Avoid race with mark_buffer_dirty_inode() which does
 		 * a lockless check and we rely on seeing the dirty bit */
 		smp_mb();
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
 			bh->b_assoc_map = mapping;
 			if (buffer_dirty(bh)) {
 				get_bh(bh);
 				spin_unlock(lock);
 				/*
 				 * Ensure any pending I/O completes so that
 				 * write_dirty_buffer() actually writes the
 				 * current contents - it is a noop if I/O is
 				 * still in flight on potentially older
 				 * contents.
 				 */
 				write_dirty_buffer(bh, WRITE_SYNC);
 				/*
 				 * Kick off IO for the previous mapping. Note
 				 * that we will not run the very last mapping,
 				 * wait_on_buffer() will do that for us
 				 * through sync_buffer().
 				 */
 				brelse(bh);
 				spin_lock(lock);
 			}
 		}
 	}
 	spin_unlock(lock);
 	blk_finish_plug(&plug);
 	spin_lock(lock);
 	while (!list_empty(&tmp)) {
 		bh = BH_ENTRY(tmp.prev);
 		get_bh(bh);
 		mapping = bh->b_assoc_map;
 		__remove_assoc_queue(bh);
 		/* Avoid race with mark_buffer_dirty_inode() which does
 		 * a lockless check and we rely on seeing the dirty bit */
 		smp_mb();
 		if (buffer_dirty(bh)) {
 			list_add(&bh->b_assoc_buffers,
 				 &mapping->private_list);
 			bh->b_assoc_map = mapping;
 		}
 		spin_unlock(lock);
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
 			err = -EIO;
 		brelse(bh);
 		spin_lock(lock);
 	}
 	spin_unlock(lock);
 	err2 = osync_buffers_list(lock, list);
 	if (err)
 		return err;
 	else
 		return err2;
 }
 /*
  * Invalidate any and all dirty buffers on a given inode.  We are
  * probably unmounting the fs, but that doesn't mean we have already
  * done a sync().  Just drop the buffers from the inode list.
  *
  * NOTE: we take the inode's blockdev's mapping's private_lock.  Which
  * assumes that all the buffers are against the blockdev.  Not true
  * for reiserfs.
  */
 void invalidate_inode_buffers(struct inode *inode)
 {
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->private_data;
 		spin_lock(&buffer_mapping->private_lock);
 		while (!list_empty(list))
 			__remove_assoc_queue(BH_ENTRY(list->next));
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 }
 EXPORT_SYMBOL(invalidate_inode_buffers);
 /*
  * Remove any clean buffers from the inode's buffer list.  This is called
  * when we're trying to free the inode itself.  Those buffers can pin it.
  *
  * Returns true if all buffers were removed.
  */
 int remove_inode_buffers(struct inode *inode)
 {
 	int ret = 1;
 	if (inode_has_buffers(inode)) {
 		struct address_space *mapping = &inode->i_data;
 		struct list_head *list = &mapping->private_list;
 		struct address_space *buffer_mapping = mapping->private_data;
 		spin_lock(&buffer_mapping->private_lock);
 		while (!list_empty(list)) {
 			struct buffer_head *bh = BH_ENTRY(list->next);
 			if (buffer_dirty(bh)) {
 				ret = 0;
 				break;
 			}
 			__remove_assoc_queue(bh);
 		}
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 	return ret;
 }
 /*
  * Create the appropriate buffers when given a page for data area and
  * the size of each buffer.. Use the bh->b_this_page linked list to
  * follow the buffers created.  Return NULL if unable to create more
  * buffers.
  *
  * The retry flag is used to differentiate async IO (paging, swapping)
  * which may not fail from ordinary buffer allocations.
  */
 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 		int retry)
 {
 	struct buffer_head *bh, *head;
 	long offset;
 try_again:
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
 		bh = alloc_buffer_head(GFP_NOFS);
 		if (!bh)
 			goto no_grow;
 		bh->b_this_page = head;
 		bh->b_blocknr = -1;
 		head = bh;
 		bh->b_size = size;
 		/* Link the buffer to its page */
 		set_bh_page(bh, page, offset);
 	}
 	return head;
 /*
  * In case anything failed, we just free everything we got.
  */
 no_grow:
 	if (head) {
 		do {
 			bh = head;
 			head = head->b_this_page;
 			free_buffer_head(bh);
 		} while (head);
 	}
 	/*
 	 * Return failure for non-async IO requests.  Async IO requests
 	 * are not allowed to fail, so we have to wait until buffer heads
 	 * become available.  But we don't want tasks sleeping with
 	 * partially complete buffers, so all were released above.
 	 */
 	if (!retry)
 		return NULL;
 	/* We're _really_ low on memory. Now we just
 	 * wait for old buffer heads to become free due to
 	 * finishing IO.  Since this is an async request and
 	 * the reserve list is empty, we're sure there are
 	 * async buffer heads in use.
 	 */
 	free_more_memory();
 	goto try_again;
 }
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 static inline void
 link_dev_buffers(struct page *page, struct buffer_head *head)
 {
 	struct buffer_head *bh, *tail;
 	bh = head;
 	do {
 		tail = bh;
 		bh = bh->b_this_page;
 	} while (bh);
 	tail->b_this_page = head;
 	attach_page_buffers(page, head);
 }
 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 {
 	sector_t retval = ~((sector_t)0);
 	loff_t sz = i_size_read(bdev->bd_inode);
 	if (sz) {
 		unsigned int sizebits = blksize_bits(size);
 		retval = (sz >> sizebits);
 	}
 	return retval;
 }
 /*
  * Initialise the state of a blockdev page's buffers.
  */
 static sector_t
 init_page_buffers(struct page *page, struct block_device *bdev,
 			sector_t block, int size)
 {
 	struct buffer_head *head = page_buffers(page);
 	struct buffer_head *bh = head;
 	int uptodate = PageUptodate(page);
 	sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size);
 	do {
 		if (!buffer_mapped(bh)) {
 			init_buffer(bh, NULL, NULL);
 			bh->b_bdev = bdev;
 			bh->b_blocknr = block;
 			if (uptodate)
 				set_buffer_uptodate(bh);
 			if (block < end_block)
 				set_buffer_mapped(bh);
 		}
 		block++;
 		bh = bh->b_this_page;
 	} while (bh != head);
 	/*
 	 * Caller needs to validate requested block against end of device.
 	 */
 	return end_block;
 }
 /*
  * Create the page-cache page that contains the requested block.
  *
  * This is used purely for blockdev mappings.
  */
 static int
 grow_dev_page(struct block_device *bdev, sector_t block,
 		pgoff_t index, int size, int sizebits)
 {
 	struct inode *inode = bdev->bd_inode;
 	struct page *page;
 	struct buffer_head *bh;
 	sector_t end_block;
 	int ret = 0;		/* Will call free_more_memory() */
 	gfp_t gfp_mask;
 	gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
 	gfp_mask |= __GFP_MOVABLE;
 	/*
 	 * XXX: __getblk_slow() can not really deal with failure and
 	 * will endlessly loop on improvised global reclaim.  Prefer
 	 * looping in the allocator rather than here, at least that
 	 * code knows what it's doing.
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
 	if (!page)
 		return ret;
 	BUG_ON(!PageLocked(page));
 	if (page_has_buffers(page)) {
 		bh = page_buffers(page);
 		if (bh->b_size == size) {
 			end_block = init_page_buffers(page, bdev,
 						index << sizebits, size);
 			goto done;
 		}
 		if (!try_to_free_buffers(page))
 			goto failed;
 	}
 	/*
 	 * Allocate some buffers for this page
 	 */
 	bh = alloc_page_buffers(page, size, 0);
 	if (!bh)
 		goto failed;
 	/*
 	 * Link the page to the buffers and initialise them.  Take the
 	 * lock to be atomic wrt __find_get_block(), which does not
 	 * run under the page lock.
 	 */
 	spin_lock(&inode->i_mapping->private_lock);
 	link_dev_buffers(page, bh);
 	end_block = init_page_buffers(page, bdev, index << sizebits, size);
 	spin_unlock(&inode->i_mapping->private_lock);
 done:
 	ret = (block < end_block) ? 1 : -ENXIO;
 failed:
 	unlock_page(page);
 	page_cache_release(page);
 	return ret;
 }
 /*
  * Create buffers for the specified block device block's page.  If
  * that page was dirty, the buffers are set dirty also.
  */
 static int
 grow_buffers(struct block_device *bdev, sector_t block, int size)
 {
 	pgoff_t index;
 	int sizebits;
 	sizebits = -1;
 	do {
 		sizebits++;
 	} while ((size << sizebits) < PAGE_SIZE);
 	index = block >> sizebits;
 	/*
 	 * Check for a block which wants to lie outside our maximum possible
 	 * pagecache index.  (this comparison is done using sector_t types).
 	 */
 	if (unlikely(index != block >> sizebits)) {
 		char b[BDEVNAME_SIZE];
 		printk(KERN_ERR "%s: requested out-of-range block %llu for "
 			"device %s\n",
 			__func__, (unsigned long long)block,
 			bdevname(bdev, b));
 		return -EIO;
 	}
 	/* Create a page with the proper size buffers.. */
 	return grow_dev_page(bdev, block, index, size, sizebits);
 }
 static struct buffer_head *
 __getblk_slow(struct block_device *bdev, sector_t block, int size)
 {
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
 			(size < 512 || size > PAGE_SIZE))) {
 		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
 					size);
 		printk(KERN_ERR "logical block size: %d\n",
 					bdev_logical_block_size(bdev));
 		dump_stack();
 		return NULL;
 	}
 	for (;;) {
 		struct buffer_head *bh;
 		int ret;
 		bh = __find_get_block(bdev, block, size);
 		if (bh)
 			return bh;
 		ret = grow_buffers(bdev, block, size);
 		if (ret < 0)
 			return NULL;
 		if (ret == 0)
 			free_more_memory();
 	}
 }
 /*
  * The relationship between dirty buffers and dirty pages:
  *
  * Whenever a page has any dirty buffers, the page's dirty bit is set, and
  * the page is tagged dirty in its radix tree.
  *
  * At all times, the dirtiness of the buffers represents the dirtiness of
  * subsections of the page.  If the page has buffers, the page dirty bit is
  * merely a hint about the true dirty state.
  *
  * When a page is set dirty in its entirety, all its buffers are marked dirty
  * (if the page has buffers).
  *
  * When a buffer is marked dirty, its page is dirtied, but the page's other
  * buffers are not.
  *
  * Also.  When blockdev buffers are explicitly read with bread(), they
  * individually become uptodate.  But their backing page remains not
  * uptodate - even if all of its buffers are uptodate.  A subsequent
  * block_read_full_page() against that page will discover all the uptodate
  * buffers, will set the page uptodate and will perform no I/O.
  */
 /**
  * mark_buffer_dirty - mark a buffer_head as needing writeout
  * @bh: the buffer_head to mark dirty
  *
  * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
  * backing page dirty, then tag the page as dirty in its address_space's radix
  * tree and then attach the address_space's inode to its superblock's dirty
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
  * mapping->tree_lock and mapping->host->i_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
 	WARN_ON_ONCE(!buffer_uptodate(bh));
 	trace_block_dirty_buffer(bh);
 	/*
 	 * Very *carefully* optimize the it-is-already-dirty case.
 	 *
 	 * Don't let the final "is it dirty" escape to before we
 	 * perhaps modified the buffer.
 	 */
 	if (buffer_dirty(bh)) {
 		smp_mb();
 		if (buffer_dirty(bh))
 			return;
 	}
 	if (!test_set_buffer_dirty(bh)) {
 		struct page *page = bh->b_page;
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, 0);
 		}
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 /*
  * Decrement a buffer_head's reference count.  If all buffers against a page
  * have zero reference count, are clean and unlocked, and if the page is clean
  * and unlocked then try_to_free_buffers() may strip the buffers from the page
  * in preparation for freeing it (sometimes, rarely, buffers are removed from
  * a page but it ends up not being freed, and buffers may later be reattached).
  */
 void __brelse(struct buffer_head * buf)
 {
 	if (atomic_read(&buf->b_count)) {
 		put_bh(buf);
 		return;
 	}
 	WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n");
 }
 EXPORT_SYMBOL(__brelse);
 /*
  * bforget() is like brelse(), except it discards any
  * potentially dirty data.
  */
 void __bforget(struct buffer_head *bh)
 {
 	clear_buffer_dirty(bh);
 	if (bh->b_assoc_map) {
 		struct address_space *buffer_mapping = bh->b_page->mapping;
 		spin_lock(&buffer_mapping->private_lock);
 		list_del_init(&bh->b_assoc_buffers);
 		bh->b_assoc_map = NULL;
 		spin_unlock(&buffer_mapping->private_lock);
 	}
 	__brelse(bh);
 }
 EXPORT_SYMBOL(__bforget);
 static struct buffer_head *__bread_slow(struct buffer_head *bh)
 {
 	lock_buffer(bh);
 	if (buffer_uptodate(bh)) {
 		unlock_buffer(bh);
 		return bh;
 	} else {
 		get_bh(bh);
 		bh->b_end_io = end_buffer_read_sync;
 		submit_bh(READ, bh);
 		wait_on_buffer(bh);
 		if (buffer_uptodate(bh))
 			return bh;
 	}
 	brelse(bh);
 	return NULL;
 }
 /*
  * Per-cpu buffer LRU implementation.  To reduce the cost of __find_get_block().
  * The bhs[] array is sorted - newest buffer is at bhs[0].  Buffers have their
  * refcount elevated by one when they're in an LRU.  A buffer can only appear
  * once in a particular CPU's LRU.  A single buffer can be present in multiple
  * CPU's LRUs at the same time.
  *
  * This is a transparent caching front-end to sb_bread(), sb_getblk() and
  * sb_find_get_block().
  *
  * The LRUs themselves only need locking against invalidate_bh_lrus.  We use
  * a local interrupt disable for that.
  */
 #define BH_LRU_SIZE	8
 struct bh_lru {
 	struct buffer_head *bhs[BH_LRU_SIZE];
 };
 static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
 #ifdef CONFIG_SMP
 #define bh_lru_lock()	local_irq_disable()
 #define bh_lru_unlock()	local_irq_enable()
 #else
 #define bh_lru_lock()	preempt_disable()
 #define bh_lru_unlock()	preempt_enable()
 #endif
 static inline void check_irqs_on(void)
 {
 #ifdef irqs_disabled
 	BUG_ON(irqs_disabled());
 #endif
 }
 /*
  * The LRU management algorithm is dopey-but-simple.  Sorry.
  */
 static void bh_lru_install(struct buffer_head *bh)
 {
 	struct buffer_head *evictee = NULL;
 	check_irqs_on();
 	bh_lru_lock();
 	if (__this_cpu_read(bh_lrus.bhs[0]) != bh) {
 		struct buffer_head *bhs[BH_LRU_SIZE];
 		int in;
 		int out = 0;
 		get_bh(bh);
 		bhs[out++] = bh;
 		for (in = 0; in < BH_LRU_SIZE; in++) {
 			struct buffer_head *bh2 =
 				__this_cpu_read(bh_lrus.bhs[in]);
 			if (bh2 == bh) {
 				__brelse(bh2);
 			} else {
 				if (out >= BH_LRU_SIZE) {
 					BUG_ON(evictee != NULL);
 					evictee = bh2;
 				} else {
 					bhs[out++] = bh2;
 				}
 			}
 		}
 		while (out < BH_LRU_SIZE)
 			bhs[out++] = NULL;
 		memcpy(__this_cpu_ptr(&bh_lrus.bhs), bhs, sizeof(bhs));
 	}
 	bh_lru_unlock();
 	if (evictee)
 		__brelse(evictee);
 }
 /*
  * Look up the bh in this cpu's LRU.  If it's there, move it to the head.
  */
 static struct buffer_head *
 lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 {
 	struct buffer_head *ret = NULL;
 	unsigned int i;
 	check_irqs_on();
 	bh_lru_lock();
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
 		if (bh && bh->b_bdev == bdev &&
 				bh->b_blocknr == block && bh->b_size == size) {
 			if (i) {
 				while (i) {
 					__this_cpu_write(bh_lrus.bhs[i],
 						__this_cpu_read(bh_lrus.bhs[i - 1]));
 					i--;
 				}
 				__this_cpu_write(bh_lrus.bhs[0], bh);
 			}
 			get_bh(bh);
 			ret = bh;
 			break;
 		}
 	}
 	bh_lru_unlock();
 	return ret;
 }
 /*
  * Perform a pagecache lookup for the matching buffer.  If it's there, refresh
  * it in the LRU and mark it as accessed.  If it is not present then return
  * NULL
  */
 struct buffer_head *
 __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 {
 	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
 	if (bh == NULL) {
+		/* __find_get_block_slow will mark the page accessed */
 		bh = __find_get_block_slow(bdev, block);
 		if (bh)
 			bh_lru_install(bh);
-	}
+	} else
-	if (bh)
 		touch_buffer(bh);
 	return bh;
 }
 EXPORT_SYMBOL(__find_get_block);
 /*
  * __getblk will locate (and, if necessary, create) the buffer_head
  * which corresponds to the passed block_device, block and size. The
  * returned buffer has its reference count incremented.
  *
  * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers()
  * attempt is failing.  FIXME, perhaps?
  */
 struct buffer_head *
 __getblk(struct block_device *bdev, sector_t block, unsigned size)
 {
 	struct buffer_head *bh = __find_get_block(bdev, block, size);
 	might_sleep();
 	if (bh == NULL)
 		bh = __getblk_slow(bdev, block, size);
 	return bh;
 }
 EXPORT_SYMBOL(__getblk);
 /*
  * Do async read-ahead on a buffer..
  */
 void __breadahead(struct block_device *bdev, sector_t block, unsigned size)
 {
 	struct buffer_head *bh = __getblk(bdev, block, size);
 	if (likely(bh)) {
 		ll_rw_block(READA, 1, &bh);
 		brelse(bh);
 	}
 }
 EXPORT_SYMBOL(__breadahead);
 /**
  *  __bread() - reads a specified block and returns the bh
  *  @bdev: the block_device to read from
  *  @block: number of block
  *  @size: size (in bytes) to read
  *
  *  Reads a specified block, and returns buffer head that contains it.
  *  It returns NULL if the block was unreadable.
  */
 struct buffer_head *
 __bread(struct block_device *bdev, sector_t block, unsigned size)
 {
 	struct buffer_head *bh = __getblk(bdev, block, size);
 	if (likely(bh) && !buffer_uptodate(bh))
 		bh = __bread_slow(bh);
 	return bh;
 }
 EXPORT_SYMBOL(__bread);
 /*
  * invalidate_bh_lrus() is called rarely - but not only at unmount.
  * This doesn't race because it runs in each cpu either in irq
  * or with preempt disabled.
  */
 static void invalidate_bh_lru(void *arg)
 {
 	struct bh_lru *b = &get_cpu_var(bh_lrus);
 	int i;
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		brelse(b->bhs[i]);
 		b->bhs[i] = NULL;
 	}
 	put_cpu_var(bh_lrus);
 }
 static bool has_bh_in_lru(int cpu, void *dummy)
 {
 	struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
 	int i;
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		if (b->bhs[i])
 			return 1;
 	}
 	return 0;
 }
 void invalidate_bh_lrus(void)
 {
 	on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)
 {
 	bh->b_page = page;
 	BUG_ON(offset >= PAGE_SIZE);
 	if (PageHighMem(page))
 		/*
 		 * This catches illegal uses and preserves the offset:
 		 */
 		bh->b_data = (char *)(0 + offset);
 	else
 		bh->b_data = page_address(page) + offset;
 }
 EXPORT_SYMBOL(set_bh_page);
 /*
  * Called when truncating a buffer on a page completely.
  */
 /* Bits that are cleared during an invalidate */
 #define BUFFER_FLAGS_DISCARD \
 	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
 	 1 << BH_Delay | 1 << BH_Unwritten)
 static void discard_buffer(struct buffer_head * bh)
 {
 	unsigned long b_state, b_state_old;
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
 	b_state = bh->b_state;
 	for (;;) {
 		b_state_old = cmpxchg(&bh->b_state, b_state,
 				      (b_state & ~BUFFER_FLAGS_DISCARD));
 		if (b_state_old == b_state)
 			break;
 		b_state = b_state_old;
 	}
 	unlock_buffer(bh);
 }
 /**
  * block_invalidatepage - invalidate part or all of a buffer-backed page
  *
  * @page: the page which is affected
  * @offset: start of the range to invalidate
  * @length: length of the range to invalidate
  *
  * block_invalidatepage() is called when all or part of the page has become
  * invalidated by a truncate operation.
  *
  * block_invalidatepage() does not have to release all buffers, but it must
  * ensure that no dirty buffer is left outside @offset and that no I/O
  * is underway against any of the blocks which are outside the truncation
  * point.  Because the caller is about to free (and possibly reuse) those
  * blocks on-disk.
  */
 void block_invalidatepage(struct page *page, unsigned int offset,
 			  unsigned int length)
 {
 	struct buffer_head *head, *bh, *next;
 	unsigned int curr_off = 0;
 	unsigned int stop = length + offset;
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
 		goto out;
 	/*
 	 * Check for overflow
 	 */
 	BUG_ON(stop > PAGE_CACHE_SIZE || stop < length);
 	head = page_buffers(page);
 	bh = head;
 	do {
 		unsigned int next_off = curr_off + bh->b_size;
 		next = bh->b_this_page;
 		/*
 		 * Are we still fully in range ?
 		 */
 		if (next_off > stop)
 			goto out;
 		/*
 		 * is this block fully invalidated?
 		 */
 		if (offset <= curr_off)
 			discard_buffer(bh);
 		curr_off = next_off;
 		bh = next;
 	} while (bh != head);
 	/*
 	 * We release buffers only if the entire page is being invalidated.
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
 	if (offset == 0)
 		try_to_release_page(page, 0);
 out:
 	return;
 }
 EXPORT_SYMBOL(block_invalidatepage);
 /*
  * We attach and possibly dirty the buffers atomically wrt
  * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
  * is already excluded via the page lock.
  */
 void create_empty_buffers(struct page *page,
 			unsigned long blocksize, unsigned long b_state)
 {
 	struct buffer_head *bh, *head, *tail;
 	head = alloc_page_buffers(page, blocksize, 1);
 	bh = head;
 	do {
 		bh->b_state |= b_state;
 		tail = bh;
 		bh = bh->b_this_page;
 	} while (bh);
 	tail->b_this_page = head;
 	spin_lock(&page->mapping->private_lock);
 	if (PageUptodate(page) || PageDirty(page)) {
 		bh = head;
 		do {
 			if (PageDirty(page))
 				set_buffer_dirty(bh);
 			if (PageUptodate(page))
 				set_buffer_uptodate(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
 	attach_page_buffers(page, head);
 	spin_unlock(&page->mapping->private_lock);
 }
 EXPORT_SYMBOL(create_empty_buffers);
 /*
  * We are taking a block for data and we don't want any output from any
  * buffer-cache aliases starting from return from that function and
  * until the moment when something will explicitly mark the buffer
  * dirty (hopefully that will not happen until we will free that block ;-)
  * We don't even need to mark it not-uptodate - nobody can expect
  * anything from a newly allocated buffer anyway. We used to used
  * unmap_buffer() for such invalidation, but that was wrong. We definitely
  * don't want to mark the alias unmapped, for example - it would confuse
  * anyone who might pick it with bread() afterwards...
  *
  * Also..  Note that bforget() doesn't lock the buffer.  So there can
  * be writeout I/O going on against recently-freed buffers.  We don't
  * wait on that I/O in bforget() - it's more efficient to wait on the I/O
  * only if we really need to.  That happens here.
  */
 void unmap_underlying_metadata(struct block_device *bdev, sector_t block)
 {
 	struct buffer_head *old_bh;
 	might_sleep();
 	old_bh = __find_get_block_slow(bdev, block);
 	if (old_bh) {
 		clear_buffer_dirty(old_bh);
 		wait_on_buffer(old_bh);
 		clear_buffer_req(old_bh);
 		__brelse(old_bh);
 	}
 }
 EXPORT_SYMBOL(unmap_underlying_metadata);
 /*
  * Size is a power-of-two in the range 512..PAGE_SIZE,
  * and the case we care about most is PAGE_SIZE.
  *
  * So this *could* possibly be written with those
  * constraints in mind (relevant mostly if some
  * architecture has a slow bit-scan instruction)
  */
 static inline int block_size_bits(unsigned int blocksize)
 {
 	return ilog2(blocksize);
 }
 static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
 {
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, 1 << ACCESS_ONCE(inode->i_blkbits), b_state);
 	return page_buffers(page);
 }
 /*
  * NOTE! All mapped/uptodate combinations are valid:
  *
  *	Mapped	Uptodate	Meaning
  *
  *	No	No		"unknown" - must do get_block()
  *	No	Yes		"hole" - zero-filled
  *	Yes	No		"allocated" - allocated on disk, not read in
  *	Yes	Yes		"valid" - allocated and up-to-date in memory.
  *
  * "Dirty" is valid only with the last case (mapped+uptodate).
  */
 /*
  * While block_write_full_page is writing back the dirty buffers under
  * the page lock, whoever dirtied the buffers may decide to clean them
  * again at any time.  We handle that by only looking at the buffer
  * state inside lock_buffer().
  *
  * If block_write_full_page() is called for regular writeback
  * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
  * locked buffer.   This only can happen if someone has written the buffer
  * directly, with submit_bh().  At the address_space level PageWriteback
  * prevents this contention from occurring.
  *
  * If block_write_full_page() is called with wbc->sync_mode ==
  * WB_SYNC_ALL, the writes are posted using WRITE_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
 			get_block_t *get_block, struct writeback_control *wbc,
 			bh_end_io_t *handler)
 {
 	int err;
 	sector_t block;
 	sector_t last_block;
 	struct buffer_head *bh, *head;
 	unsigned int blocksize, bbits;
 	int nr_underway = 0;
 	int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
 			WRITE_SYNC : WRITE);
 	head = create_page_buffers(page, inode,
 					(1 << BH_Dirty)|(1 << BH_Uptodate));
 	/*
 	 * Be very careful.  We have no exclusion from __set_page_dirty_buffers
 	 * here, and the (potentially unmapped) buffers may become dirty at
 	 * any time.  If a buffer becomes dirty here after we've inspected it
 	 * then we just miss that fact, and the page stays dirty.
 	 *
 	 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
 	 * handle that here by just cleaning them.
 	 */
 	bh = head;
 	blocksize = bh->b_size;
 	bbits = block_size_bits(blocksize);
 	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
 	last_block = (i_size_read(inode) - 1) >> bbits;
 	/*
 	 * Get all the dirty buffers mapped to disk addresses and
 	 * handle any aliases from the underlying blockdev's mapping.
 	 */
 	do {
 		if (block > last_block) {
 			/*
 			 * mapped buffers outside i_size will occur, because
 			 * this page can be outside i_size when there is a
 			 * truncate in progress.
 			 */
 			/*
 			 * The buffer was zeroed by block_write_full_page()
 			 */
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 		} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
 			   buffer_dirty(bh)) {
 			WARN_ON(bh->b_size != blocksize);
 			err = get_block(inode, block, bh, 1);
 			if (err)
 				goto recover;
 			clear_buffer_delay(bh);
 			if (buffer_new(bh)) {
 				/* blockdev mappings never come here */
 				clear_buffer_new(bh);
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 			}
 		}
 		bh = bh->b_this_page;
 		block++;
 	} while (bh != head);
 	do {
 		if (!buffer_mapped(bh))
 			continue;
 		/*
 		 * If it's a fully non-blocking write attempt and we cannot
 		 * lock the buffer then redirty the page.  Note that this can
 		 * potentially cause a busy-wait loop from writeback threads
 		 * and kswapd activity, but those code paths have their own
 		 * higher-level throttling.
 		 */
 		if (wbc->sync_mode != WB_SYNC_NONE) {
 			lock_buffer(bh);
 		} else if (!trylock_buffer(bh)) {
 			redirty_page_for_writepage(wbc, page);
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			unlock_buffer(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	/*
 	 * The page and its buffers are protected by PageWriteback(), so we can
 	 * drop the bh refcounts early.
 	 */
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			submit_bh(write_op, bh);
 			nr_underway++;
 		}
 		bh = next;
 	} while (bh != head);
 	unlock_page(page);
 	err = 0;
 done:
 	if (nr_underway == 0) {
 		/*
 		 * The page was marked dirty, but the buffers were
 		 * clean.  Someone wrote them back by hand with
 		 * ll_rw_block/submit_bh.  A rare case.
 		 */
 		end_page_writeback(page);
 		/*
 		 * The page and buffer_heads can be released at any time from
 		 * here on.
 		 */
 	}
 	return err;
 recover:
 	/*
 	 * ENOSPC, or some other error.  We may already have added some
 	 * blocks to the file, so we need to write these out to avoid
 	 * exposing stale data.
 	 * The page is currently locked and not marked for writeback
 	 */
 	bh = head;
 	/* Recovery: lock and submit the mapped buffers */
 	do {
 		if (buffer_mapped(bh) && buffer_dirty(bh) &&
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
 			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			/*
 			 * The buffer may have been set dirty during
 			 * attachment to a dirty page.
 			 */
 			clear_buffer_dirty(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	SetPageError(page);
 	BUG_ON(PageWriteback(page));
 	mapping_set_error(page->mapping, err);
 	set_page_writeback(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
 			submit_bh(write_op, bh);
 			nr_underway++;
 		}
 		bh = next;
 	} while (bh != head);
 	unlock_page(page);
 	goto done;
 }
 /*
  * If a page has any new buffers, zero them out here, and mark them uptodate
  * and dirty so they'll be written out (in order to prevent uninitialised
  * block data from leaking). And clear the new bit.
  */
 void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
 {
 	unsigned int block_start, block_end;
 	struct buffer_head *head, *bh;
 	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
 		return;
 	bh = head = page_buffers(page);
 	block_start = 0;
 	do {
 		block_end = block_start + bh->b_size;
 		if (buffer_new(bh)) {
 			if (block_end > from && block_start < to) {
 				if (!PageUptodate(page)) {
 					unsigned start, size;
 					start = max(from, block_start);
 					size = min(to, block_end) - start;
 					zero_user(page, start, size);
 					set_buffer_uptodate(bh);
 				}
 				clear_buffer_new(bh);
 				mark_buffer_dirty(bh);
 			}
 		}
 		block_start = block_end;
 		bh = bh->b_this_page;
 	} while (bh != head);
 }
 EXPORT_SYMBOL(page_zero_new_buffers);
 int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		get_block_t *get_block)
 {
 	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
 	unsigned to = from + len;
 	struct inode *inode = page->mapping->host;
 	unsigned block_start, block_end;
 	sector_t block;
 	int err = 0;
 	unsigned blocksize, bbits;
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 	BUG_ON(!PageLocked(page));
 	BUG_ON(from > PAGE_CACHE_SIZE);
 	BUG_ON(to > PAGE_CACHE_SIZE);
 	BUG_ON(from > to);
 	head = create_page_buffers(page, inode, 0);
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
 			if (PageUptodate(page)) {
 				if (!buffer_uptodate(bh))
 					set_buffer_uptodate(bh);
 			}
 			continue;
 		}
 		if (buffer_new(bh))
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {
 			WARN_ON(bh->b_size != blocksize);
 			err = get_block(inode, block, bh, 1);
 			if (err)
 				break;
 			if (buffer_new(bh)) {
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 				if (PageUptodate(page)) {
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
 					mark_buffer_dirty(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from)
 					zero_user_segments(page,
 						to, block_end,
 						block_start, from);
 				continue;
 			}
 		}
 		if (PageUptodate(page)) {
 			if (!buffer_uptodate(bh))
 				set_buffer_uptodate(bh);
 			continue;
 		}
 		if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
 		    !buffer_unwritten(bh) &&
 		     (block_start < from || block_end > to)) {
 			ll_rw_block(READ, 1, &bh);
 			*wait_bh++=bh;
 		}
 	}
 	/*
 	 * If we issued read requests - let them complete.
 	 */
 	while(wait_bh > wait) {
 		wait_on_buffer(*--wait_bh);
 		if (!buffer_uptodate(*wait_bh))
 			err = -EIO;
 	}
 	if (unlikely(err))
 		page_zero_new_buffers(page, from, to);
 	return err;
 }
 EXPORT_SYMBOL(__block_write_begin);
 static int __block_commit_write(struct inode *inode, struct page *page,
 		unsigned from, unsigned to)
 {
 	unsigned block_start, block_end;
 	int partial = 0;
 	unsigned blocksize;
 	struct buffer_head *bh, *head;
 	bh = head = page_buffers(page);
 	blocksize = bh->b_size;
 	block_start = 0;
 	do {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
 			if (!buffer_uptodate(bh))
 				partial = 1;
 		} else {
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
 		clear_buffer_new(bh);
 		block_start = block_end;
 		bh = bh->b_this_page;
 	} while (bh != head);
 	/*
 	 * If this is a partial write which happened to make all buffers
 	 * uptodate then we can optimize away a bogus readpage() for
 	 * the next read(). Here we 'discover' whether the page went
 	 * uptodate as a result of this (potentially partial) write.
 	 */
 	if (!partial)
 		SetPageUptodate(page);
 	return 0;
 }
 /*
  * block_write_begin takes care of the basic task of block allocation and
  * bringing partial write blocks uptodate first.
  *
  * The filesystem needs to handle block truncation upon failure.
  */
 int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 		unsigned flags, struct page **pagep, get_block_t *get_block)
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	struct page *page;
 	int status;
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
 	status = __block_write_begin(page, pos, len, get_block);
 	if (unlikely(status)) {
 		unlock_page(page);
 		page_cache_release(page);
 		page = NULL;
 	}
 	*pagep = page;
 	return status;
 }
 EXPORT_SYMBOL(block_write_begin);
 int block_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
 	unsigned start;
 	start = pos & (PAGE_CACHE_SIZE - 1);
 	if (unlikely(copied < len)) {
 		/*
 		 * The buffers that were written will now be uptodate, so we
 		 * don't have to worry about a readpage reading them and
 		 * overwriting a partial write. However if we have encountered
 		 * a short write and only partially written into a buffer, it
 		 * will not be marked uptodate, so a readpage might come in and
 		 * destroy our partial write.
 		 *
 		 * Do the simplest thing, and just treat any short write to a
 		 * non uptodate page as a zero-length write, and force the
 		 * caller to redo the whole thing.
 		 */
 		if (!PageUptodate(page))
 			copied = 0;
 		page_zero_new_buffers(page, start+copied, start+len);
 	}
 	flush_dcache_page(page);
 	/* This could be a short (even 0-length) commit */
 	__block_commit_write(inode, page, start, start+copied);
 	return copied;
 }
 EXPORT_SYMBOL(block_write_end);
 int generic_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
 	int i_size_changed = 0;
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold i_mutex.
 	 *
 	 * But it's important to update i_size while still holding page lock:
 	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		i_size_changed = 1;
 	}
 	unlock_page(page);
 	page_cache_release(page);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
 	 * ordering of page lock and transaction start for journaling
 	 * filesystems.
 	 */
 	if (i_size_changed)
 		mark_inode_dirty(inode);
 	return copied;
 }
 EXPORT_SYMBOL(generic_write_end);
 /*
  * block_is_partially_uptodate checks whether buffers within a page are
  * uptodate or not.
  *
  * Returns true if all buffers which correspond to a file portion
  * we want to read are uptodate.
  */
 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
 					unsigned long from)
 {
 	unsigned block_start, block_end, blocksize;
 	unsigned to;
 	struct buffer_head *bh, *head;
 	int ret = 1;
 	if (!page_has_buffers(page))
 		return 0;
 	head = page_buffers(page);
 	blocksize = head->b_size;
 	to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
 	to = from + to;
 	if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
 		return 0;
 	bh = head;
 	block_start = 0;
 	do {
 		block_end = block_start + blocksize;
 		if (block_end > from && block_start < to) {
 			if (!buffer_uptodate(bh)) {
 				ret = 0;
 				break;
 			}
 			if (block_end >= to)
 				break;
 		}
 		block_start = block_end;
 		bh = bh->b_this_page;
 	} while (bh != head);
 	return ret;
 }
 EXPORT_SYMBOL(block_is_partially_uptodate);
 /*
  * Generic "read page" function for block devices that have the normal
  * get_block functionality. This is most of the block device filesystems.
  * Reads the page asynchronously --- the unlock_buffer() and
  * set/clear_buffer_uptodate() functions propagate buffer state into the
  * page struct once IO has completed.
  */
 int block_read_full_page(struct page *page, get_block_t *get_block)
 {
 	struct inode *inode = page->mapping->host;
 	sector_t iblock, lblock;
 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
 	unsigned int blocksize, bbits;
 	int nr, i;
 	int fully_mapped = 1;
 	head = create_page_buffers(page, inode, 0);
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
 	lblock = (i_size_read(inode)+blocksize-1) >> bbits;
 	bh = head;
 	nr = 0;
 	i = 0;
 	do {
 		if (buffer_uptodate(bh))
 			continue;
 		if (!buffer_mapped(bh)) {
 			int err = 0;
 			fully_mapped = 0;
 			if (iblock < lblock) {
 				WARN_ON(bh->b_size != blocksize);
 				err = get_block(inode, iblock, bh, 0);
 				if (err)
 					SetPageError(page);
 			}
 			if (!buffer_mapped(bh)) {
 				zero_user(page, i * blocksize, blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
 				continue;
 			}
 			/*
 			 * get_block() might have updated the buffer
 			 * synchronously
 			 */
 			if (buffer_uptodate(bh))
 				continue;
 		}
 		arr[nr++] = bh;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 	if (fully_mapped)
 		SetPageMappedToDisk(page);
 	if (!nr) {
 		/*
 		 * All buffers are uptodate - we can set the page uptodate
 		 * as well. But not if get_block() returned an error.
 		 */
 		if (!PageError(page))
 			SetPageUptodate(page);
 		unlock_page(page);
 		return 0;
 	}
 	/* Stage two: lock the buffers */
 	for (i = 0; i < nr; i++) {
 		bh = arr[i];
 		lock_buffer(bh);
 		mark_buffer_async_read(bh);
 	}
 	/*
 	 * Stage 3: start the IO.  Check for uptodateness
 	 * inside the buffer lock in case another process reading
 	 * the underlying blockdev brought it uptodate (the sct fix).
 	 */
 	for (i = 0; i < nr; i++) {
 		bh = arr[i];
 		if (buffer_uptodate(bh))
 			end_buffer_async_read(bh, 1);
 		else
 			submit_bh(READ, bh);
 	}
 	return 0;
 }
 EXPORT_SYMBOL(block_read_full_page);
 /* utility function for filesystems that need to do work on expanding
  * truncates.  Uses filesystem pagecache writes to allow the filesystem to
  * deal with the hole.
  */
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	void *fsdata;
 	int err;
 	err = inode_newsize_ok(inode, size);
 	if (err)
 		goto out;
 	err = pagecache_write_begin(NULL, mapping, size, 0,
 				AOP_FLAG_UNINTERRUPTIBLE|AOP_FLAG_CONT_EXPAND,
 				&page, &fsdata);
 	if (err)
 		goto out;
 	err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
 	BUG_ON(err > 0);
 out:
 	return err;
 }
 EXPORT_SYMBOL(generic_cont_expand_simple);
 static int cont_expand_zero(struct file *file, struct address_space *mapping,
 			    loff_t pos, loff_t *bytes)
 {
 	struct inode *inode = mapping->host;
 	unsigned blocksize = 1 << inode->i_blkbits;
 	struct page *page;
 	void *fsdata;
 	pgoff_t index, curidx;
 	loff_t curpos;
 	unsigned zerofrom, offset, len;
 	int err = 0;
 	index = pos >> PAGE_CACHE_SHIFT;
 	offset = pos & ~PAGE_CACHE_MASK;
 	while (index > (curidx = (curpos = *bytes)>>PAGE_CACHE_SHIFT)) {
 		zerofrom = curpos & ~PAGE_CACHE_MASK;
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
 		len = PAGE_CACHE_SIZE - zerofrom;
 		err = pagecache_write_begin(file, mapping, curpos, len,
 						AOP_FLAG_UNINTERRUPTIBLE,
 						&page, &fsdata);
 		if (err)
 			goto out;
 		zero_user(page, zerofrom, len);
 		err = pagecache_write_end(file, mapping, curpos, len, len,
 						page, fsdata);
 		if (err < 0)
 			goto out;
 		BUG_ON(err != len);
 		err = 0;
 		balance_dirty_pages_ratelimited(mapping);
 	}
 	/* page covers the boundary, find the boundary offset */
 	if (index == curidx) {
 		zerofrom = curpos & ~PAGE_CACHE_MASK;
 		/* if we will expand the thing last block will be filled */
 		if (offset <= zerofrom) {
 			goto out;
 		}
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
 		len = offset - zerofrom;
 		err = pagecache_write_begin(file, mapping, curpos, len,
 						AOP_FLAG_UNINTERRUPTIBLE,
 						&page, &fsdata);
 		if (err)
 			goto out;
 		zero_user(page, zerofrom, len);
 		err = pagecache_write_end(file, mapping, curpos, len, len,
 						page, fsdata);
 		if (err < 0)
 			goto out;
 		BUG_ON(err != len);
 		err = 0;
 	}
 out:
 	return err;
 }
 /*
  * For moronic filesystems that do not allow holes in file.
  * We may have to extend the file.
  */
 int cont_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata,
 			get_block_t *get_block, loff_t *bytes)
 {
 	struct inode *inode = mapping->host;
 	unsigned blocksize = 1 << inode->i_blkbits;
 	unsigned zerofrom;
 	int err;
 	err = cont_expand_zero(file, mapping, pos, bytes);
 	if (err)
 		return err;
 	zerofrom = *bytes & ~PAGE_CACHE_MASK;
 	if (pos+len > *bytes && zerofrom & (blocksize-1)) {
 		*bytes |= (blocksize-1);
 		(*bytes)++;
 	}
 	return block_write_begin(mapping, pos, len, flags, pagep, get_block);
 }
 EXPORT_SYMBOL(cont_write_begin);
 int block_commit_write(struct page *page, unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
 	__block_commit_write(inode,page,from,to);
 	return 0;
 }
 EXPORT_SYMBOL(block_commit_write);
 /*
  * block_page_mkwrite() is not allowed to change the file size as it gets
  * called from a page fault handler when a page is first dirtied. Hence we must
  * be careful to check for EOF conditions here. We set the page up correctly
  * for a written page which means we get ENOSPC checking when writing into
  * holes and correct delalloc and unwritten extent mapping on filesystems that
  * support these features.
  *
  * We are not allowed to take the i_mutex here so we have to play games to
  * protect against truncate races as the page could now be beyond EOF.  Because
  * truncate writes the inode size before removing pages, once we have the
  * page lock we can determine safely if the page is beyond EOF. If it is not
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
  *
  * Direct callers of this function should protect against filesystem freezing
  * using sb_start_write() - sb_end_write() functions.
  */
 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 get_block_t get_block)
 {
 	struct page *page = vmf->page;
 	struct inode *inode = file_inode(vma->vm_file);
 	unsigned long end;
 	loff_t size;
 	int ret;
 	lock_page(page);
 	size = i_size_read(inode);
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* We overload EFAULT to mean page got truncated */
 		ret = -EFAULT;
 		goto out_unlock;
 	}
 	/* page is wholly or partially inside EOF */
 	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
 		end = size & ~PAGE_CACHE_MASK;
 	else
 		end = PAGE_CACHE_SIZE;
 	ret = __block_write_begin(page, 0, end, get_block);
 	if (!ret)
 		ret = block_commit_write(page, 0, end);
 	if (unlikely(ret < 0))
 		goto out_unlock;
 	set_page_dirty(page);
 	wait_for_stable_page(page);
 	return 0;
 out_unlock:
 	unlock_page(page);
 	return ret;
 }
 EXPORT_SYMBOL(__block_page_mkwrite);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		   get_block_t get_block)
 {
 	int ret;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
 	sb_start_pagefault(sb);
 	/*
 	 * Update file times before taking page lock. We may end up failing the
 	 * fault so this update may be superfluous but who really cares...
 	 */
 	file_update_time(vma->vm_file);
 	ret = __block_page_mkwrite(vma, vmf, get_block);
 	sb_end_pagefault(sb);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL(block_page_mkwrite);
 /*
  * nobh_write_begin()'s prereads are special: the buffer_heads are freed
  * immediately, while under the page lock.  So it needs a special end_io
  * handler which does not touch the bh after unlocking it.
  */
 static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
 {
 	__end_buffer_read_notouch(bh, uptodate);
 }
 /*
  * Attach the singly-linked list of buffers created by nobh_write_begin, to
  * the page (converting it to circular linked list and taking care of page
  * dirty races).
  */
 static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
 {
 	struct buffer_head *bh;
 	BUG_ON(!PageLocked(page));
 	spin_lock(&page->mapping->private_lock);
 	bh = head;
 	do {
 		if (PageDirty(page))
 			set_buffer_dirty(bh);
 		if (!bh->b_this_page)
 			bh->b_this_page = head;
 		bh = bh->b_this_page;
 	} while (bh != head);
 	attach_page_buffers(page, head);
 	spin_unlock(&page->mapping->private_lock);
 }
 /*
  * On entry, the page is fully not uptodate.
  * On exit the page is fully uptodate in the areas outside (from,to)
  * The filesystem needs to handle block truncation upon failure.
  */
 int nobh_write_begin(struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata,
 			get_block_t *get_block)
 {
 	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	struct buffer_head *head, *bh;
 	struct page *page;
 	pgoff_t index;
 	unsigned from, to;
 	unsigned block_in_page;
 	unsigned block_start, block_end;
 	sector_t block_in_file;
 	int nr_reads = 0;
 	int ret = 0;
 	int is_mapped_to_disk = 1;
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
 	*fsdata = NULL;
 	if (page_has_buffers(page)) {
 		ret = __block_write_begin(page, pos, len, get_block);
 		if (unlikely(ret))
 			goto out_release;
 		return ret;
 	}
 	if (PageMappedToDisk(page))
 		return 0;
 	/*
 	 * Allocate buffers so that we can keep track of state, and potentially
 	 * attach them to the page if an error occurs. In the common case of
 	 * no error, they will just be freed again without ever being attached
 	 * to the page (which is all OK, because we're under the page lock).
 	 *
 	 * Be careful: the buffer linked list is a NULL terminated one, rather
 	 * than the circular one we're used to.
 	 */
 	head = alloc_page_buffers(page, blocksize, 0);
 	if (!head) {
 		ret = -ENOMEM;
 		goto out_release;
 	}
 	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
 	/*
 	 * We loop across all blocks in the page, whether or not they are
 	 * part of the affected region.  This is so we can discover if the
 	 * page is fully mapped-to-disk.
 	 */
 	for (block_start = 0, block_in_page = 0, bh = head;
 		  block_start < PAGE_CACHE_SIZE;
 		  block_in_page++, block_start += blocksize, bh = bh->b_this_page) {
 		int create;
 		block_end = block_start + blocksize;
 		bh->b_state = 0;
 		create = 1;
 		if (block_start >= to)
 			create = 0;
 		ret = get_block(inode, block_in_file + block_in_page,
 					bh, create);
 		if (ret)
 			goto failed;
 		if (!buffer_mapped(bh))
 			is_mapped_to_disk = 0;
 		if (buffer_new(bh))
 			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 		if (PageUptodate(page)) {
 			set_buffer_uptodate(bh);
 			continue;
 		}
 		if (buffer_new(bh) || !buffer_mapped(bh)) {
 			zero_user_segments(page, block_start, from,
 							to, block_end);
 			continue;
 		}
 		if (buffer_uptodate(bh))
 			continue;	/* reiserfs does this */
 		if (block_start < from || block_end > to) {
 			lock_buffer(bh);
 			bh->b_end_io = end_buffer_read_nobh;
 			submit_bh(READ, bh);
 			nr_reads++;
 		}
 	}
 	if (nr_reads) {
 		/*
 		 * The page is locked, so these buffers are protected from
 		 * any VM or truncate activity.  Hence we don't need to care
 		 * for the buffer_head refcounts.
 		 */
 		for (bh = head; bh; bh = bh->b_this_page) {
 			wait_on_buffer(bh);
 			if (!buffer_uptodate(bh))
 				ret = -EIO;
 		}
 		if (ret)
 			goto failed;
 	}
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
 	*fsdata = head; /* to be released by nobh_write_end */
 	return 0;
 failed:
 	BUG_ON(!ret);
 	/*
 	 * Error recovery is a bit difficult. We need to zero out blocks that
 	 * were newly allocated, and dirty them to ensure they get written out.
 	 * Buffers need to be attached to the page at this point, otherwise
 	 * the handling of potential IO errors during writeout would be hard
 	 * (could try doing synchronous writeout, but what if that fails too?)
 	 */
 	attach_nobh_buffers(page, head);
 	page_zero_new_buffers(page, from, to);
 out_release:
 	unlock_page(page);
 	page_cache_release(page);
 	*pagep = NULL;
 	return ret;
 }
 EXPORT_SYMBOL(nobh_write_begin);
 int nobh_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = page->mapping->host;
 	struct buffer_head *head = fsdata;
 	struct buffer_head *bh;
 	BUG_ON(fsdata != NULL && page_has_buffers(page));
 	if (unlikely(copied < len) && head)
 		attach_nobh_buffers(page, head);
 	if (page_has_buffers(page))
 		return generic_write_end(file, mapping, pos, len,
 					copied, page, fsdata);
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
 	unlock_page(page);
 	page_cache_release(page);
 	while (head) {
 		bh = head;
 		head = head->b_this_page;
 		free_buffer_head(bh);
 	}
 	return copied;
 }
 EXPORT_SYMBOL(nobh_write_end);
 /*
  * nobh_writepage() - based on block_full_write_page() except
  * that it tries to operate without attaching bufferheads to
  * the page.
  */
 int nobh_writepage(struct page *page, get_block_t *get_block,
 			struct writeback_control *wbc)
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 	int ret;
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
 		goto out;
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
 		 * they may have been added in ext3_writepage().  Make them
 		 * freeable here, so the page does not leak.
 		 */
 #if 0
 		/* Not really sure about this  - do we need this ? */
 		if (page->mapping->a_ops->invalidatepage)
 			page->mapping->a_ops->invalidatepage(page, offset);
 #endif
 		unlock_page(page);
 		return 0; /* don't care */
 	}
 	/*
 	 * The page straddles i_size.  It must be zeroed out on each and every
 	 * writepage invocation because it may be mmapped.  "A file is mapped
 	 * in multiples of the page size.  For a file that is not a multiple of
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
 		ret = __block_write_full_page(inode, page, get_block, wbc,
 					      end_buffer_async_write);
 	return ret;
 }
 EXPORT_SYMBOL(nobh_writepage);
 int nobh_truncate_page(struct address_space *mapping,
 			loff_t from, get_block_t *get_block)
 {
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
 	sector_t iblock;
 	unsigned length, pos;
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head map_bh;
 	int err;
 	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
 	length = blocksize - length;
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	page = grab_cache_page(mapping, index);
 	err = -ENOMEM;
 	if (!page)
 		goto out;
 	if (page_has_buffers(page)) {
 has_buffers:
 		unlock_page(page);
 		page_cache_release(page);
 		return block_truncate_page(mapping, from, get_block);
 	}
 	/* Find the buffer that contains "offset" */
 	pos = blocksize;
 	while (offset >= pos) {
 		iblock++;
 		pos += blocksize;
 	}
 	map_bh.b_size = blocksize;
 	map_bh.b_state = 0;
 	err = get_block(inode, iblock, &map_bh, 0);
 	if (err)
 		goto unlock;
 	/* unmapped? It's a hole - nothing to do */
 	if (!buffer_mapped(&map_bh))
 		goto unlock;
 	/* Ok, it's mapped. Make sure it's up-to-date */
 	if (!PageUptodate(page)) {
 		err = mapping->a_ops->readpage(NULL, page);
 		if (err) {
 			page_cache_release(page);
 			goto out;
 		}
 		lock_page(page);
 		if (!PageUptodate(page)) {
 			err = -EIO;
 			goto unlock;
 		}
 		if (page_has_buffers(page))
 			goto has_buffers;
 	}
 	zero_user(page, offset, length);
 	set_page_dirty(page);
 	err = 0;
 unlock:
 	unlock_page(page);
 	page_cache_release(page);
 out:
 	return err;
 }
 EXPORT_SYMBOL(nobh_truncate_page);
 int block_truncate_page(struct address_space *mapping,
 			loff_t from, get_block_t *get_block)
 {
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
 	sector_t iblock;
 	unsigned length, pos;
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head *bh;
 	int err;
 	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 	/* Block boundary? Nothing to do */
 	if (!length)
 		return 0;
 	length = blocksize - length;
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	page = grab_cache_page(mapping, index);
 	err = -ENOMEM;
 	if (!page)
 		goto out;
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, blocksize, 0);
 	/* Find the buffer that contains "offset" */
 	bh = page_buffers(page);
 	pos = blocksize;
 	while (offset >= pos) {
 		bh = bh->b_this_page;
 		iblock++;
 		pos += blocksize;
 	}
 	err = 0;
 	if (!buffer_mapped(bh)) {
 		WARN_ON(bh->b_size != blocksize);
 		err = get_block(inode, iblock, bh, 0);
 		if (err)
 			goto unlock;
 		/* unmapped? It's a hole - nothing to do */
 		if (!buffer_mapped(bh))
 			goto unlock;
 	}
 	/* Ok, it's mapped. Make sure it's up-to-date */
 	if (PageUptodate(page))
 		set_buffer_uptodate(bh);
 	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
 		err = -EIO;
 		ll_rw_block(READ, 1, &bh);
 		wait_on_buffer(bh);
 		/* Uhhuh. Read error. Complain and punt. */
 		if (!buffer_uptodate(bh))
 			goto unlock;
 	}
 	zero_user(page, offset, length);
 	mark_buffer_dirty(bh);
 	err = 0;
 unlock:
 	unlock_page(page);
 	page_cache_release(page);
 out:
 	return err;
 }
 EXPORT_SYMBOL(block_truncate_page);
 /*
  * The generic ->writepage function for buffer-backed address_spaces
  * this form passes in the end_io handler used to finish the IO.
  */
 int block_write_full_page_endio(struct page *page, get_block_t *get_block,
 			struct writeback_control *wbc, bh_end_io_t *handler)
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
 		return __block_write_full_page(inode, page, get_block, wbc,
 					       handler);
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
 		 * they may have been added in ext3_writepage().  Make them
 		 * freeable here, so the page does not leak.
 		 */
 		do_invalidatepage(page, 0, PAGE_CACHE_SIZE);
 		unlock_page(page);
 		return 0; /* don't care */
 	}
 	/*
 	 * The page straddles i_size.  It must be zeroed out on each and every
 	 * writepage invocation because it may be mmapped.  "A file is mapped
 	 * in multiples of the page size.  For a file that is not a multiple of
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 EXPORT_SYMBOL(block_write_full_page_endio);
 /*
  * The generic ->writepage function for buffer-backed address_spaces
  */
 int block_write_full_page(struct page *page, get_block_t *get_block,
 			struct writeback_control *wbc)
 {
 	return block_write_full_page_endio(page, get_block, wbc,
 					   end_buffer_async_write);
 }
 EXPORT_SYMBOL(block_write_full_page);
 sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
 			    get_block_t *get_block)
 {
 	struct buffer_head tmp;
 	struct inode *inode = mapping->host;
 	tmp.b_state = 0;
 	tmp.b_blocknr = 0;
 	tmp.b_size = 1 << inode->i_blkbits;
 	get_block(inode, block, &tmp, 0);
 	return tmp.b_blocknr;
 }
 EXPORT_SYMBOL(generic_block_bmap);
 static void end_bio_bh_io_sync(struct bio *bio, int err)
 {
 	struct buffer_head *bh = bio->bi_private;
 	if (err == -EOPNOTSUPP) {
 		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
 	}
 	if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
 		set_bit(BH_Quiet, &bh->b_state);
 	bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags));
 	bio_put(bio);
 }
 /*
  * This allows us to do IO even on the odd last sectors
  * of a device, even if the bh block size is some multiple
  * of the physical sector size.
  *
  * We'll just truncate the bio to the size of the device,
  * and clear the end of the buffer head manually.
  *
  * Truly out-of-range accesses will turn into actual IO
  * errors, this only handles the "we need to be able to
  * do IO at the final sector" case.
  */
 static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
 {
 	sector_t maxsector;
 	unsigned bytes;
 	maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9;
 	if (!maxsector)
 		return;
 	/*
 	 * If the *whole* IO is past the end of the device,
 	 * let it through, and the IO layer will turn it into
 	 * an EIO.
 	 */
 	if (unlikely(bio->bi_sector >= maxsector))
 		return;
 	maxsector -= bio->bi_sector;
 	bytes = bio->bi_size;
 	if (likely((bytes >> 9) <= maxsector))
 		return;
 	/* Uhhuh. We've got a bh that straddles the device size! */
 	bytes = maxsector << 9;
 	/* Truncate the bio.. */
 	bio->bi_size = bytes;
 	bio->bi_io_vec[0].bv_len = bytes;
 	/* ..and clear the end of the buffer for reads */
 	if ((rw & RW_MASK) == READ) {
 		void *kaddr = kmap_atomic(bh->b_page);
 		memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes);
 		kunmap_atomic(kaddr);
 		flush_dcache_page(bh->b_page);
 	}
 }
 int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 {
 	struct bio *bio;
 	int ret = 0;
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
 	BUG_ON(!bh->b_end_io);
 	BUG_ON(buffer_delay(bh));
 	BUG_ON(buffer_unwritten(bh));
 	/*
 	 * Only clear out a write error when rewriting
 	 */
 	if (test_set_buffer_req(bh) && (rw & WRITE))
 		clear_buffer_write_io_error(bh);
 	/*
 	 * from here on down, it's all bio -- do the initial mapping,
 	 * submit_bio -> generic_make_request may further map this bio around
 	 */
 	bio = bio_alloc(GFP_NOIO, 1);
 	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
 	bio->bi_io_vec[0].bv_page = bh->b_page;
 	bio->bi_io_vec[0].bv_len = bh->b_size;
 	bio->bi_io_vec[0].bv_offset = bh_offset(bh);
 	bio->bi_vcnt = 1;
 	bio->bi_size = bh->b_size;
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
 	bio->bi_flags |= bio_flags;
 	/* Take care of bh's that straddle the end of the device */
 	guard_bh_eod(rw, bio, bh);
 	if (buffer_meta(bh))
 		rw |= REQ_META;
 	if (buffer_prio(bh))
 		rw |= REQ_PRIO;
 	bio_get(bio);
 	submit_bio(rw, bio);
 	if (bio_flagged(bio, BIO_EOPNOTSUPP))
 		ret = -EOPNOTSUPP;
 	bio_put(bio);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(_submit_bh);
 int submit_bh(int rw, struct buffer_head *bh)
 {
 	return _submit_bh(rw, bh, 0);
 }
 EXPORT_SYMBOL(submit_bh);
 /**
  * ll_rw_block: low-level access to block devices (DEPRECATED)
  * @rw: whether to %READ or %WRITE or maybe %READA (readahead)
  * @nr: number of &struct buffer_heads in the array
  * @bhs: array of pointers to &struct buffer_head
  *
  * ll_rw_block() takes an array of pointers to &struct buffer_heads, and
  * requests an I/O operation on them, either a %READ or a %WRITE.  The third
  * %READA option is described in the documentation for generic_make_request()
  * which ll_rw_block() calls.
  *
  * This function drops any buffer that it cannot get a lock on (with the
  * BH_Lock state bit), any buffer that appears to be clean when doing a write
  * request, and any buffer that appears to be up-to-date when doing read
  * request.  Further it marks as clean buffers that are processed for
  * writing (the buffer cache won't assume that they are actually clean
  * until the buffer gets unlocked).
  *
  * ll_rw_block sets b_end_io to simple completion handler that marks
  * the buffer up-to-date (if approriate), unlocks the buffer and wakes
  * any waiters.
  *
  * All of the buffers must be for the same device, and must also be a
  * multiple of the current approved size for the device.
  */
 void ll_rw_block(int rw, int nr, struct buffer_head *bhs[])
 {
 	int i;
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
 		if (!trylock_buffer(bh))
 			continue;
 		if (rw == WRITE) {
 			if (test_clear_buffer_dirty(bh)) {
 				bh->b_end_io = end_buffer_write_sync;
 				get_bh(bh);
 				submit_bh(WRITE, bh);
 				continue;
 			}
 		} else {
 			if (!buffer_uptodate(bh)) {
 				bh->b_end_io = end_buffer_read_sync;
 				get_bh(bh);
 				submit_bh(rw, bh);
 				continue;
 			}
 		}
 		unlock_buffer(bh);
 	}
 }
 EXPORT_SYMBOL(ll_rw_block);
 void write_dirty_buffer(struct buffer_head *bh, int rw)
 {
 	lock_buffer(bh);
 	if (!test_clear_buffer_dirty(bh)) {
 		unlock_buffer(bh);
 		return;
 	}
 	bh->b_end_io = end_buffer_write_sync;
 	get_bh(bh);
 	submit_bh(rw, bh);
 }
 EXPORT_SYMBOL(write_dirty_buffer);
 /*
  * For a data-integrity writeout, we need to wait upon any in-progress I/O
  * and then start new I/O and then wait upon it.  The caller must have a ref on
  * the buffer_head.
  */
 int __sync_dirty_buffer(struct buffer_head *bh, int rw)
 {
 	int ret = 0;
 	WARN_ON(atomic_read(&bh->b_count) < 1);
 	lock_buffer(bh);
 	if (test_clear_buffer_dirty(bh)) {
 		get_bh(bh);
 		bh->b_end_io = end_buffer_write_sync;
 		ret = submit_bh(rw, bh);
 		wait_on_buffer(bh);
 		if (!ret && !buffer_uptodate(bh))
 			ret = -EIO;
 	} else {
 		unlock_buffer(bh);
 	}
 	return ret;
 }
 EXPORT_SYMBOL(__sync_dirty_buffer);
 int sync_dirty_buffer(struct buffer_head *bh)
 {
 	return __sync_dirty_buffer(bh, WRITE_SYNC);
 }
 EXPORT_SYMBOL(sync_dirty_buffer);
 /*
  * try_to_free_buffers() checks if all the buffers on this particular page
  * are unused, and releases them if so.
  *
  * Exclusion against try_to_free_buffers may be obtained by either
  * locking the page or by holding its mapping's private_lock.
  *
  * If the page is dirty but all the buffers are clean then we need to
  * be sure to mark the page clean as well.  This is because the page
  * may be against a block device, and a later reattachment of buffers
  * to a dirty page will set *all* buffers dirty.  Which would corrupt
  * filesystem data on the same device.
  *
  * The same applies to regular filesystem pages: if all the buffers are
  * clean then we set the page clean and proceed.  To do that, we require
  * total exclusion from __set_page_dirty_buffers().  That is obtained with
  * private_lock.
  *
  * try_to_free_buffers() is non-blocking.
  */
 static inline int buffer_busy(struct buffer_head *bh)
 {
 	return atomic_read(&bh->b_count) |
 		(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
 }
 static int
 drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 {
 	struct buffer_head *head = page_buffers(page);
 	struct buffer_head *bh;
 	bh = head;
 	do {
 		if (buffer_write_io_error(bh) && page->mapping)
 			set_bit(AS_EIO, &page->mapping->flags);
 		if (buffer_busy(bh))
 			goto failed;
 		bh = bh->b_this_page;
 	} while (bh != head);
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (bh->b_assoc_map)
 			__remove_assoc_queue(bh);
 		bh = next;
 	} while (bh != head);
 	*buffers_to_free = head;
 	__clear_page_buffers(page);
 	return 1;
 failed:
 	return 0;
 }
 int try_to_free_buffers(struct page *page)
 {
 	struct address_space * const mapping = page->mapping;
 	struct buffer_head *buffers_to_free = NULL;
 	int ret = 0;
 	BUG_ON(!PageLocked(page));
 	if (PageWriteback(page))
 		return 0;
 	if (mapping == NULL) {		/* can this still happen? */
 		ret = drop_buffers(page, &buffers_to_free);
 		goto out;
 	}
 	spin_lock(&mapping->private_lock);
 	ret = drop_buffers(page, &buffers_to_free);
 	/*
 	 * If the filesystem writes its buffers by hand (eg ext3)
 	 * then we can have clean buffers against a dirty page.  We
 	 * clean the page here; otherwise the VM will never notice
 	 * that the filesystem did any IO at all.
 	 *
 	 * Also, during truncate, discard_buffer will have marked all
 	 * the page's buffers clean.  We discover that here and clean
 	 * the page also.
 	 *
 	 * private_lock must be held over this entire operation in order
 	 * to synchronise against __set_page_dirty_buffers and prevent the
 	 * dirty bit from being lost.
 	 */
 	if (ret)
 		cancel_dirty_page(page, PAGE_CACHE_SIZE);
 	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
 		do {
 			struct buffer_head *next = bh->b_this_page;
 			free_buffer_head(bh);
 			bh = next;
 		} while (bh != buffers_to_free);
 	}
 	return ret;
 }
 EXPORT_SYMBOL(try_to_free_buffers);
 /*
  * There are no bdflush tunables left.  But distributions are
  * still running obsolete flush daemons, so we terminate them here.
  *
  * Use of bdflush() is deprecated and will be removed in a future kernel.
  * The `flush-X' kernel threads fully replace bdflush daemons and this call.
  */
 SYSCALL_DEFINE2(bdflush, int, func, long, data)
 {
 	static int msg_count;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	if (msg_count < 5) {
 		msg_count++;
 		printk(KERN_INFO
 			"warning: process `%s' used the obsolete bdflush"
 			" system call\n", current->comm);
 		printk(KERN_INFO "Fix your initscripts?\n");
 	}
 	if (func == 1)
 		do_exit(0);
 	return 0;
 }
 /*
  * Buffer-head allocation
  */
 static struct kmem_cache *bh_cachep __read_mostly;
 /*
  * Once the number of bh's in the machine exceeds this level, we start
  * stripping them in writeback.
  */
 static unsigned long max_buffer_heads;
 int buffer_heads_over_limit;
 struct bh_accounting {
 	int nr;			/* Number of live bh's */
 	int ratelimit;		/* Limit cacheline bouncing */
 };
 static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0};
 static void recalc_bh_state(void)
 {
 	int i;
 	int tot = 0;
 	if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096)
 		return;
 	__this_cpu_write(bh_accounting.ratelimit, 0);
 	for_each_online_cpu(i)
 		tot += per_cpu(bh_accounting, i).nr;
 	buffer_heads_over_limit = (tot > max_buffer_heads);
 }
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
 	struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
 	if (ret) {
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		preempt_disable();
 		__this_cpu_inc(bh_accounting.nr);
 		recalc_bh_state();
 		preempt_enable();
 	}
 	return ret;
 }
 EXPORT_SYMBOL(alloc_buffer_head);
 void free_buffer_head(struct buffer_head *bh)
 {
 	BUG_ON(!list_empty(&bh->b_assoc_buffers));
 	kmem_cache_free(bh_cachep, bh);
 	preempt_disable();
 	__this_cpu_dec(bh_accounting.nr);
 	recalc_bh_state();
 	preempt_enable();
 }
 EXPORT_SYMBOL(free_buffer_head);
 static void buffer_exit_cpu(int cpu)
 {
 	int i;
 	struct bh_lru *b = &per_cpu(bh_lrus, cpu);
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		brelse(b->bhs[i]);
 		b->bhs[i] = NULL;
 	}
 	this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr);
 	per_cpu(bh_accounting, cpu).nr = 0;
 }
 static int buffer_cpu_notify(struct notifier_block *self,
 			      unsigned long action, void *hcpu)
 {
 	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
 		buffer_exit_cpu((unsigned long)hcpu);
 	return NOTIFY_OK;
 }
 /**
  * bh_uptodate_or_lock - Test whether the buffer is uptodate
  * @bh: struct buffer_head
  *
  * Return true if the buffer is up-to-date and false,
  * with the buffer locked, if not.
  */
 int bh_uptodate_or_lock(struct buffer_head *bh)
 {
 	if (!buffer_uptodate(bh)) {
 		lock_buffer(bh);
 		if (!buffer_uptodate(bh))
 			return 0;
 		unlock_buffer(bh);
 	}
 	return 1;
 }
 EXPORT_SYMBOL(bh_uptodate_or_lock);
 /**
  * bh_submit_read - Submit a locked buffer for reading
  * @bh: struct buffer_head
  *
  * Returns zero on success and -EIO on error.
  */
 int bh_submit_read(struct buffer_head *bh)
 {
 	BUG_ON(!buffer_locked(bh));
 	if (buffer_uptodate(bh)) {
 		unlock_buffer(bh);
 		return 0;
 	}
 	get_bh(bh);
 	bh->b_end_io = end_buffer_read_sync;
 	submit_bh(READ, bh);
 	wait_on_buffer(bh);
 	if (buffer_uptodate(bh))
 		return 0;
 	return -EIO;
 }
 EXPORT_SYMBOL(bh_submit_read);
 void __init buffer_init(void)
 {
 	unsigned long nrpages;
 	bh_cachep = kmem_cache_create("buffer_head",
 			sizeof(struct buffer_head), 0,
 				(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 				SLAB_MEM_SPREAD),
 				NULL);
 	/*
 	 * Limit the bh occupancy to 10% of ZONE_NORMAL
 	 */
 	nrpages = (nr_free_buffer_pages() * 10) / 100;
 	max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
 	hotcpu_notifier(buffer_cpu_notify, 0);
 }

fs/ext4/mballoc.c

Diff comments View file @ d618a27

 /*
  * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
  * Written by Alex Tomas <alex@clusterfs.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public Licens
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
  */
 /*
  * mballoc.c contains the multiblocks allocation routines
  */
 #include "ext4_jbd2.h"
 #include "mballoc.h"
 #include <linux/log2.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <trace/events/ext4.h>
 #ifdef CONFIG_EXT4_DEBUG
 ushort ext4_mballoc_debug __read_mostly;
 module_param_named(mballoc_debug, ext4_mballoc_debug, ushort, 0644);
 MODULE_PARM_DESC(mballoc_debug, "Debugging level for ext4's mballoc");
 #endif
 /*
  * MUSTDO:
  *   - test ext4_ext_search_left() and ext4_ext_search_right()
  *   - search for metadata in few groups
  *
  * TODO v4:
  *   - normalization should take into account whether file is still open
  *   - discard preallocations if no free space left (policy?)
  *   - don't normalize tails
  *   - quota
  *   - reservation for superuser
  *
  * TODO v3:
  *   - bitmap read-ahead (proposed by Oleg Drokin aka green)
  *   - track min/max extents in each group for better group selection
  *   - mb_mark_used() may allocate chunk right after splitting buddy
  *   - tree of groups sorted by number of free blocks
  *   - error handling
  */
 /*
  * The allocation request involve request for multiple number of blocks
  * near to the goal(block) value specified.
  *
  * During initialization phase of the allocator we decide to use the
  * group preallocation or inode preallocation depending on the size of
  * the file. The size of the file could be the resulting file size we
  * would have after allocation, or the current file size, which ever
  * is larger. If the size is less than sbi->s_mb_stream_request we
  * select to use the group preallocation. The default value of
  * s_mb_stream_request is 16 blocks. This can also be tuned via
  * /sys/fs/ext4/<partition>/mb_stream_req. The value is represented in
  * terms of number of blocks.
  *
  * The main motivation for having small file use group preallocation is to
  * ensure that we have small files closer together on the disk.
  *
  * First stage the allocator looks at the inode prealloc list,
  * ext4_inode_info->i_prealloc_list, which contains list of prealloc
  * spaces for this particular inode. The inode prealloc space is
  * represented as:
  *
  * pa_lstart -> the logical start block for this prealloc space
  * pa_pstart -> the physical start block for this prealloc space
  * pa_len    -> length for this prealloc space (in clusters)
  * pa_free   ->  free space available in this prealloc space (in clusters)
  *
  * The inode preallocation space is used looking at the _logical_ start
  * block. If only the logical file block falls within the range of prealloc
  * space we will consume the particular prealloc space. This makes sure that
  * we have contiguous physical blocks representing the file blocks
  *
  * The important thing to be noted in case of inode prealloc space is that
  * we don't modify the values associated to inode prealloc space except
  * pa_free.
  *
  * If we are not able to find blocks in the inode prealloc space and if we
  * have the group allocation flag set then we look at the locality group
  * prealloc space. These are per CPU prealloc list represented as
  *
  * ext4_sb_info.s_locality_groups[smp_processor_id()]
  *
  * The reason for having a per cpu locality group is to reduce the contention
  * between CPUs. It is possible to get scheduled at this point.
  *
  * The locality group prealloc space is used looking at whether we have
  * enough free space (pa_free) within the prealloc space.
  *
  * If we can't allocate blocks via inode prealloc or/and locality group
  * prealloc then we look at the buddy cache. The buddy cache is represented
  * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets
  * mapped to the buddy and bitmap information regarding different
  * groups. The buddy information is attached to buddy cache inode so that
  * we can access them through the page cache. The information regarding
  * each group is loaded via ext4_mb_load_buddy.  The information involve
  * block bitmap and buddy information. The information are stored in the
  * inode as:
  *
  *  {                        page                        }
  *  [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
  *
  *
  * one block each for bitmap and buddy information.  So for each group we
  * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE /
  * blocksize) blocks.  So it can have information regarding groups_per_page
  * which is blocks_per_page/2
  *
  * The buddy cache inode is not stored on disk. The inode is thrown
  * away when the filesystem is unmounted.
  *
  * We look for count number of blocks in the buddy cache. If we were able
  * to locate that many free blocks we return with additional information
  * regarding rest of the contiguous physical block available
  *
  * Before allocating blocks via buddy cache we normalize the request
  * blocks. This ensure we ask for more blocks that we needed. The extra
  * blocks that we get after allocation is added to the respective prealloc
  * list. In case of inode preallocation we follow a list of heuristics
  * based on file size. This can be found in ext4_mb_normalize_request. If
  * we are doing a group prealloc we try to normalize the request to
  * sbi->s_mb_group_prealloc.  The default value of s_mb_group_prealloc is
  * dependent on the cluster size; for non-bigalloc file systems, it is
  * 512 blocks. This can be tuned via
  * /sys/fs/ext4/<partition>/mb_group_prealloc. The value is represented in
  * terms of number of blocks. If we have mounted the file system with -O
  * stripe=<value> option the group prealloc request is normalized to the
  * the smallest multiple of the stripe value (sbi->s_stripe) which is
  * greater than the default mb_group_prealloc.
  *
  * The regular allocator (using the buddy cache) supports a few tunables.
  *
  * /sys/fs/ext4/<partition>/mb_min_to_scan
  * /sys/fs/ext4/<partition>/mb_max_to_scan
  * /sys/fs/ext4/<partition>/mb_order2_req
  *
  * The regular allocator uses buddy scan only if the request len is power of
  * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
  * value of s_mb_order2_reqs can be tuned via
  * /sys/fs/ext4/<partition>/mb_order2_req.  If the request len is equal to
  * stripe size (sbi->s_stripe), we try to search for contiguous block in
  * stripe size. This should result in better allocation on RAID setups. If
  * not, we search in the specific group using bitmap for best extents. The
  * tunable min_to_scan and max_to_scan control the behaviour here.
  * min_to_scan indicate how long the mballoc __must__ look for a best
  * extent and max_to_scan indicates how long the mballoc __can__ look for a
  * best extent in the found extents. Searching for the blocks starts with
  * the group specified as the goal value in allocation context via
  * ac_g_ex. Each group is first checked based on the criteria whether it
  * can be used for allocation. ext4_mb_good_group explains how the groups are
  * checked.
  *
  * Both the prealloc space are getting populated as above. So for the first
  * request we will hit the buddy cache which will result in this prealloc
  * space getting filled. The prealloc space is then later used for the
  * subsequent request.
  */
 /*
  * mballoc operates on the following data:
  *  - on-disk bitmap
  *  - in-core buddy (actually includes buddy and bitmap)
  *  - preallocation descriptors (PAs)
  *
  * there are two types of preallocations:
  *  - inode
  *    assiged to specific inode and can be used for this inode only.
  *    it describes part of inode's space preallocated to specific
  *    physical blocks. any block from that preallocated can be used
  *    independent. the descriptor just tracks number of blocks left
  *    unused. so, before taking some block from descriptor, one must
  *    make sure corresponded logical block isn't allocated yet. this
  *    also means that freeing any block within descriptor's range
  *    must discard all preallocated blocks.
  *  - locality group
  *    assigned to specific locality group which does not translate to
  *    permanent set of inodes: inode can join and leave group. space
  *    from this type of preallocation can be used for any inode. thus
  *    it's consumed from the beginning to the end.
  *
  * relation between them can be expressed as:
  *    in-core buddy = on-disk bitmap + preallocation descriptors
  *
  * this mean blocks mballoc considers used are:
  *  - allocated blocks (persistent)
  *  - preallocated blocks (non-persistent)
  *
  * consistency in mballoc world means that at any time a block is either
  * free or used in ALL structures. notice: "any time" should not be read
  * literally -- time is discrete and delimited by locks.
  *
  *  to keep it simple, we don't use block numbers, instead we count number of
  *  blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA.
  *
  * all operations can be expressed as:
  *  - init buddy:			buddy = on-disk + PAs
  *  - new PA:				buddy += N; PA = N
  *  - use inode PA:			on-disk += N; PA -= N
  *  - discard inode PA			buddy -= on-disk - PA; PA = 0
  *  - use locality group PA		on-disk += N; PA -= N
  *  - discard locality group PA		buddy -= PA; PA = 0
  *  note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap
  *        is used in real operation because we can't know actual used
  *        bits from PA, only from on-disk bitmap
  *
  * if we follow this strict logic, then all operations above should be atomic.
  * given some of them can block, we'd have to use something like semaphores
  * killing performance on high-end SMP hardware. let's try to relax it using
  * the following knowledge:
  *  1) if buddy is referenced, it's already initialized
  *  2) while block is used in buddy and the buddy is referenced,
  *     nobody can re-allocate that block
  *  3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has
  *     bit set and PA claims same block, it's OK. IOW, one can set bit in
  *     on-disk bitmap if buddy has same bit set or/and PA covers corresponded
  *     block
  *
  * so, now we're building a concurrency table:
  *  - init buddy vs.
  *    - new PA
  *      blocks for PA are allocated in the buddy, buddy must be referenced
  *      until PA is linked to allocation group to avoid concurrent buddy init
  *    - use inode PA
  *      we need to make sure that either on-disk bitmap or PA has uptodate data
  *      given (3) we care that PA-=N operation doesn't interfere with init
  *    - discard inode PA
  *      the simplest way would be to have buddy initialized by the discard
  *    - use locality group PA
  *      again PA-=N must be serialized with init
  *    - discard locality group PA
  *      the simplest way would be to have buddy initialized by the discard
  *  - new PA vs.
  *    - use inode PA
  *      i_data_sem serializes them
  *    - discard inode PA
  *      discard process must wait until PA isn't used by another process
  *    - use locality group PA
  *      some mutex should serialize them
  *    - discard locality group PA
  *      discard process must wait until PA isn't used by another process
  *  - use inode PA
  *    - use inode PA
  *      i_data_sem or another mutex should serializes them
  *    - discard inode PA
  *      discard process must wait until PA isn't used by another process
  *    - use locality group PA
  *      nothing wrong here -- they're different PAs covering different blocks
  *    - discard locality group PA
  *      discard process must wait until PA isn't used by another process
  *
  * now we're ready to make few consequences:
  *  - PA is referenced and while it is no discard is possible
  *  - PA is referenced until block isn't marked in on-disk bitmap
  *  - PA changes only after on-disk bitmap
  *  - discard must not compete with init. either init is done before
  *    any discard or they're serialized somehow
  *  - buddy init as sum of on-disk bitmap and PAs is done atomically
  *
  * a special case when we've used PA to emptiness. no need to modify buddy
  * in this case, but we should care about concurrent init
  *
  */
  /*
  * Logic in few words:
  *
  *  - allocation:
  *    load group
  *    find blocks
  *    mark bits in on-disk bitmap
  *    release group
  *
  *  - use preallocation:
  *    find proper PA (per-inode or group)
  *    load group
  *    mark bits in on-disk bitmap
  *    release group
  *    release PA
  *
  *  - free:
  *    load group
  *    mark bits in on-disk bitmap
  *    release group
  *
  *  - discard preallocations in group:
  *    mark PAs deleted
  *    move them onto local list
  *    load on-disk bitmap
  *    load group
  *    remove PA from object (inode or locality group)
  *    mark free blocks in-core
  *
  *  - discard inode's preallocations:
  */
 /*
  * Locking rules
  *
  * Locks:
  *  - bitlock on a group	(group)
  *  - object (inode/locality)	(object)
  *  - per-pa lock		(pa)
  *
  * Paths:
  *  - new pa
  *    object
  *    group
  *
  *  - find and use pa:
  *    pa
  *
  *  - release consumed pa:
  *    pa
  *    group
  *    object
  *
  *  - generate in-core bitmap:
  *    group
  *        pa
  *
  *  - discard all for given object (inode, locality group):
  *    object
  *        pa
  *    group
  *
  *  - discard all for given group:
  *    group
  *        pa
  *    group
  *        object
  *
  */
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
 static struct kmem_cache *ext4_free_data_cachep;
 /* We create slab caches for groupinfo data structures based on the
  * superblock block size.  There will be one per mounted filesystem for
  * each unique s_blocksize_bits */
 #define NR_GRPINFO_CACHES 8
 static struct kmem_cache *ext4_groupinfo_caches[NR_GRPINFO_CACHES];
 static const char *ext4_groupinfo_slab_names[NR_GRPINFO_CACHES] = {
 	"ext4_groupinfo_1k", "ext4_groupinfo_2k", "ext4_groupinfo_4k",
 	"ext4_groupinfo_8k", "ext4_groupinfo_16k", "ext4_groupinfo_32k",
 	"ext4_groupinfo_64k", "ext4_groupinfo_128k"
 };
 static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					ext4_group_t group);
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group);
 static void ext4_free_data_callback(struct super_block *sb,
 				struct ext4_journal_cb_entry *jce, int rc);
 static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
 {
 #if BITS_PER_LONG == 64
 	*bit += ((unsigned long) addr & 7UL) << 3;
 	addr = (void *) ((unsigned long) addr & ~7UL);
 #elif BITS_PER_LONG == 32
 	*bit += ((unsigned long) addr & 3UL) << 3;
 	addr = (void *) ((unsigned long) addr & ~3UL);
 #else
 #error "how many bits you are?!"
 #endif
 	return addr;
 }
 static inline int mb_test_bit(int bit, void *addr)
 {
 	/*
 	 * ext4_test_bit on architecture like powerpc
 	 * needs unsigned long aligned address
 	 */
 	addr = mb_correct_addr_and_bit(&bit, addr);
 	return ext4_test_bit(bit, addr);
 }
 static inline void mb_set_bit(int bit, void *addr)
 {
 	addr = mb_correct_addr_and_bit(&bit, addr);
 	ext4_set_bit(bit, addr);
 }
 static inline void mb_clear_bit(int bit, void *addr)
 {
 	addr = mb_correct_addr_and_bit(&bit, addr);
 	ext4_clear_bit(bit, addr);
 }
 static inline int mb_test_and_clear_bit(int bit, void *addr)
 {
 	addr = mb_correct_addr_and_bit(&bit, addr);
 	return ext4_test_and_clear_bit(bit, addr);
 }
 static inline int mb_find_next_zero_bit(void *addr, int max, int start)
 {
 	int fix = 0, ret, tmpmax;
 	addr = mb_correct_addr_and_bit(&fix, addr);
 	tmpmax = max + fix;
 	start += fix;
 	ret = ext4_find_next_zero_bit(addr, tmpmax, start) - fix;
 	if (ret > max)
 		return max;
 	return ret;
 }
 static inline int mb_find_next_bit(void *addr, int max, int start)
 {
 	int fix = 0, ret, tmpmax;
 	addr = mb_correct_addr_and_bit(&fix, addr);
 	tmpmax = max + fix;
 	start += fix;
 	ret = ext4_find_next_bit(addr, tmpmax, start) - fix;
 	if (ret > max)
 		return max;
 	return ret;
 }
 static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
 {
 	char *bb;
 	BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
 	BUG_ON(max == NULL);
 	if (order > e4b->bd_blkbits + 1) {
 		*max = 0;
 		return NULL;
 	}
 	/* at order 0 we see each particular block */
 	if (order == 0) {
 		*max = 1 << (e4b->bd_blkbits + 3);
 		return e4b->bd_bitmap;
 	}
 	bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
 	*max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
 	return bb;
 }
 #ifdef DOUBLE_CHECK
 static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b,
 			   int first, int count)
 {
 	int i;
 	struct super_block *sb = e4b->bd_sb;
 	if (unlikely(e4b->bd_info->bb_bitmap == NULL))
 		return;
 	assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
 	for (i = 0; i < count; i++) {
 		if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) {
 			ext4_fsblk_t blocknr;
 			blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
 			blocknr += EXT4_C2B(EXT4_SB(sb), first + i);
 			ext4_grp_locked_error(sb, e4b->bd_group,
 					      inode ? inode->i_ino : 0,
 					      blocknr,
 					      "freeing block already freed "
 					      "(bit %u)",
 					      first + i);
 		}
 		mb_clear_bit(first + i, e4b->bd_info->bb_bitmap);
 	}
 }
 static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count)
 {
 	int i;
 	if (unlikely(e4b->bd_info->bb_bitmap == NULL))
 		return;
 	assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
 	for (i = 0; i < count; i++) {
 		BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap));
 		mb_set_bit(first + i, e4b->bd_info->bb_bitmap);
 	}
 }
 static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 {
 	if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) {
 		unsigned char *b1, *b2;
 		int i;
 		b1 = (unsigned char *) e4b->bd_info->bb_bitmap;
 		b2 = (unsigned char *) bitmap;
 		for (i = 0; i < e4b->bd_sb->s_blocksize; i++) {
 			if (b1[i] != b2[i]) {
 				ext4_msg(e4b->bd_sb, KERN_ERR,
 					 "corruption in group %u "
 					 "at byte %u(%u): %x in copy != %x "
 					 "on disk/prealloc",
 					 e4b->bd_group, i, i * 8, b1[i], b2[i]);
 				BUG();
 			}
 		}
 	}
 }
 #else
 static inline void mb_free_blocks_double(struct inode *inode,
 				struct ext4_buddy *e4b, int first, int count)
 {
 	return;
 }
 static inline void mb_mark_used_double(struct ext4_buddy *e4b,
 						int first, int count)
 {
 	return;
 }
 static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
 {
 	return;
 }
 #endif
 #ifdef AGGRESSIVE_CHECK
 #define MB_CHECK_ASSERT(assert)						\
 do {									\
 	if (!(assert)) {						\
 		printk(KERN_EMERG					\
 			"Assertion failure in %s() at %s:%d: \"%s\"\n",	\
 			function, file, line, # assert);		\
 		BUG();							\
 	}								\
 } while (0)
 static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
 				const char *function, int line)
 {
 	struct super_block *sb = e4b->bd_sb;
 	int order = e4b->bd_blkbits + 1;
 	int max;
 	int max2;
 	int i;
 	int j;
 	int k;
 	int count;
 	struct ext4_group_info *grp;
 	int fragments = 0;
 	int fstart;
 	struct list_head *cur;
 	void *buddy;
 	void *buddy2;
 	{
 		static int mb_check_counter;
 		if (mb_check_counter++ % 100 != 0)
 			return 0;
 	}
 	while (order > 1) {
 		buddy = mb_find_buddy(e4b, order, &max);
 		MB_CHECK_ASSERT(buddy);
 		buddy2 = mb_find_buddy(e4b, order - 1, &max2);
 		MB_CHECK_ASSERT(buddy2);
 		MB_CHECK_ASSERT(buddy != buddy2);
 		MB_CHECK_ASSERT(max * 2 == max2);
 		count = 0;
 		for (i = 0; i < max; i++) {
 			if (mb_test_bit(i, buddy)) {
 				/* only single bit in buddy2 may be 1 */
 				if (!mb_test_bit(i << 1, buddy2)) {
 					MB_CHECK_ASSERT(
 						mb_test_bit((i<<1)+1, buddy2));
 				} else if (!mb_test_bit((i << 1) + 1, buddy2)) {
 					MB_CHECK_ASSERT(
 						mb_test_bit(i << 1, buddy2));
 				}
 				continue;
 			}
 			/* both bits in buddy2 must be 1 */
 			MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2));
 			MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2));
 			for (j = 0; j < (1 << order); j++) {
 				k = (i * (1 << order)) + j;
 				MB_CHECK_ASSERT(
 					!mb_test_bit(k, e4b->bd_bitmap));
 			}
 			count++;
 		}
 		MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count);
 		order--;
 	}
 	fstart = -1;
 	buddy = mb_find_buddy(e4b, 0, &max);
 	for (i = 0; i < max; i++) {
 		if (!mb_test_bit(i, buddy)) {
 			MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
 			if (fstart == -1) {
 				fragments++;
 				fstart = i;
 			}
 			continue;
 		}
 		fstart = -1;
 		/* check used bits only */
 		for (j = 0; j < e4b->bd_blkbits + 1; j++) {
 			buddy2 = mb_find_buddy(e4b, j, &max2);
 			k = i >> j;
 			MB_CHECK_ASSERT(k < max2);
 			MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
 		}
 	}
 	MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
 	MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
 	grp = ext4_get_group_info(sb, e4b->bd_group);
 	list_for_each(cur, &grp->bb_prealloc_list) {
 		ext4_group_t groupnr;
 		struct ext4_prealloc_space *pa;
 		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
 		ext4_get_group_no_and_offset(sb, pa->pa_pstart, &groupnr, &k);
 		MB_CHECK_ASSERT(groupnr == e4b->bd_group);
 		for (i = 0; i < pa->pa_len; i++)
 			MB_CHECK_ASSERT(mb_test_bit(k + i, buddy));
 	}
 	return 0;
 }
 #undef MB_CHECK_ASSERT
 #define mb_check_buddy(e4b) __mb_check_buddy(e4b,	\
 					__FILE__, __func__, __LINE__)
 #else
 #define mb_check_buddy(e4b)
 #endif
 /*
  * Divide blocks started from @first with length @len into
  * smaller chunks with power of 2 blocks.
  * Clear the bits in bitmap which the blocks of the chunk(s) covered,
  * then increase bb_counters[] for corresponded chunk size.
  */
 static void ext4_mb_mark_free_simple(struct super_block *sb,
 				void *buddy, ext4_grpblk_t first, ext4_grpblk_t len,
 					struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_grpblk_t min;
 	ext4_grpblk_t max;
 	ext4_grpblk_t chunk;
 	unsigned short border;
 	BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));
 	border = 2 << sb->s_blocksize_bits;
 	while (len > 0) {
 		/* find how many blocks can be covered since this position */
 		max = ffs(first | border) - 1;
 		/* find how many blocks of power 2 we need to mark */
 		min = fls(len) - 1;
 		if (max < min)
 			min = max;
 		chunk = 1 << min;
 		/* mark multiblock chunks only */
 		grp->bb_counters[min]++;
 		if (min > 0)
 			mb_clear_bit(first >> min,
 				     buddy + sbi->s_mb_offsets[min]);
 		len -= chunk;
 		first += chunk;
 	}
 }
 /*
  * Cache the order of the largest free extent we have available in this block
  * group.
  */
 static void
 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
 	int i;
 	int bits;
 	grp->bb_largest_free_order = -1; /* uninit */
 	bits = sb->s_blocksize_bits + 1;
 	for (i = bits; i >= 0; i--) {
 		if (grp->bb_counters[i] > 0) {
 			grp->bb_largest_free_order = i;
 			break;
 		}
 	}
 }
 static noinline_for_stack
 void ext4_mb_generate_buddy(struct super_block *sb,
 				void *buddy, void *bitmap, ext4_group_t group)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
 	ext4_grpblk_t i = 0;
 	ext4_grpblk_t first;
 	ext4_grpblk_t len;
 	unsigned free = 0;
 	unsigned fragments = 0;
 	unsigned long long period = get_cycles();
 	/* initialize buddy from bitmap which is aggregation
 	 * of on-disk bitmap and preallocations */
 	i = mb_find_next_zero_bit(bitmap, max, 0);
 	grp->bb_first_free = i;
 	while (i < max) {
 		fragments++;
 		first = i;
 		i = mb_find_next_bit(bitmap, max, i);
 		len = i - first;
 		free += len;
 		if (len > 1)
 			ext4_mb_mark_free_simple(sb, buddy, first, len, grp);
 		else
 			grp->bb_counters[0]++;
 		if (i < max)
 			i = mb_find_next_zero_bit(bitmap, max, i);
 	}
 	grp->bb_fragments = fragments;
 	if (free != grp->bb_free) {
 		ext4_grp_locked_error(sb, group, 0, 0,
 				      "block bitmap and bg descriptor "
 				      "inconsistent: %u vs %u free clusters",
 				      free, grp->bb_free);
 		/*
 		 * If we intend to continue, we consider group descriptor
 		 * corrupt and update bb_free using bitmap value
 		 */
 		grp->bb_free = free;
 		set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
 	}
 	mb_set_largest_free_order(sb, grp);
 	clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
 	period = get_cycles() - period;
 	spin_lock(&EXT4_SB(sb)->s_bal_lock);
 	EXT4_SB(sb)->s_mb_buddies_generated++;
 	EXT4_SB(sb)->s_mb_generation_time += period;
 	spin_unlock(&EXT4_SB(sb)->s_bal_lock);
 }
 static void mb_regenerate_buddy(struct ext4_buddy *e4b)
 {
 	int count;
 	int order = 1;
 	void *buddy;
 	while ((buddy = mb_find_buddy(e4b, order++, &count))) {
 		ext4_set_bits(buddy, 0, count);
 	}
 	e4b->bd_info->bb_fragments = 0;
 	memset(e4b->bd_info->bb_counters, 0,
 		sizeof(*e4b->bd_info->bb_counters) *
 		(e4b->bd_sb->s_blocksize_bits + 2));
 	ext4_mb_generate_buddy(e4b->bd_sb, e4b->bd_buddy,
 		e4b->bd_bitmap, e4b->bd_group);
 }
 /* The buddy information is attached the buddy cache inode
  * for convenience. The information regarding each group
  * is loaded via ext4_mb_load_buddy. The information involve
  * block bitmap and buddy information. The information are
  * stored in the inode as
  *
  * {                        page                        }
  * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
  *
  *
  * one block each for bitmap and buddy information.
  * So for each group we take up 2 blocks. A page can
  * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize)  blocks.
  * So it can have information regarding groups_per_page which
  * is blocks_per_page/2
  *
  * Locking note:  This routine takes the block group lock of all groups
  * for this page; do not hold this lock when calling this routine!
  */
 static int ext4_mb_init_cache(struct page *page, char *incore)
 {
 	ext4_group_t ngroups;
 	int blocksize;
 	int blocks_per_page;
 	int groups_per_page;
 	int err = 0;
 	int i;
 	ext4_group_t first_group, group;
 	int first_block;
 	struct super_block *sb;
 	struct buffer_head *bhs;
 	struct buffer_head **bh = NULL;
 	struct inode *inode;
 	char *data;
 	char *bitmap;
 	struct ext4_group_info *grinfo;
 	mb_debug(1, "init page %lu\n", page->index);
 	inode = page->mapping->host;
 	sb = inode->i_sb;
 	ngroups = ext4_get_groups_count(sb);
 	blocksize = 1 << inode->i_blkbits;
 	blocks_per_page = PAGE_CACHE_SIZE / blocksize;
 	groups_per_page = blocks_per_page >> 1;
 	if (groups_per_page == 0)
 		groups_per_page = 1;
 	/* allocate buffer_heads to read bitmaps */
 	if (groups_per_page > 1) {
 		i = sizeof(struct buffer_head *) * groups_per_page;
 		bh = kzalloc(i, GFP_NOFS);
 		if (bh == NULL) {
 			err = -ENOMEM;
 			goto out;
 		}
 	} else
 		bh = &bhs;
 	first_group = page->index * blocks_per_page / 2;
 	/* read all groups the page covers into the cache */
 	for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
 		if (group >= ngroups)
 			break;
 		grinfo = ext4_get_group_info(sb, group);
 		/*
 		 * If page is uptodate then we came here after online resize
 		 * which added some new uninitialized group info structs, so
 		 * we must skip all initialized uptodate buddies on the page,
 		 * which may be currently in use by an allocating task.
 		 */
 		if (PageUptodate(page) && !EXT4_MB_GRP_NEED_INIT(grinfo)) {
 			bh[i] = NULL;
 			continue;
 		}
 		if (!(bh[i] = ext4_read_block_bitmap_nowait(sb, group))) {
 			err = -ENOMEM;
 			goto out;
 		}
 		mb_debug(1, "read bitmap for group %u\n", group);
 	}
 	/* wait for I/O completion */
 	for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
 		if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
 			err = -EIO;
 			goto out;
 		}
 	}
 	first_block = page->index * blocks_per_page;
 	for (i = 0; i < blocks_per_page; i++) {
 		group = (first_block + i) >> 1;
 		if (group >= ngroups)
 			break;
 		if (!bh[group - first_group])
 			/* skip initialized uptodate buddy */
 			continue;
 		/*
 		 * data carry information regarding this
 		 * particular group in the format specified
 		 * above
 		 *
 		 */
 		data = page_address(page) + (i * blocksize);
 		bitmap = bh[group - first_group]->b_data;
 		/*
 		 * We place the buddy block and bitmap block
 		 * close together
 		 */
 		if ((first_block + i) & 1) {
 			/* this is block of buddy */
 			BUG_ON(incore == NULL);
 			mb_debug(1, "put buddy for group %u in page %lu/%x\n",
 				group, page->index, i * blocksize);
 			trace_ext4_mb_buddy_bitmap_load(sb, group);
 			grinfo = ext4_get_group_info(sb, group);
 			grinfo->bb_fragments = 0;
 			memset(grinfo->bb_counters, 0,
 			       sizeof(*grinfo->bb_counters) *
 				(sb->s_blocksize_bits+2));
 			/*
 			 * incore got set to the group block bitmap below
 			 */
 			ext4_lock_group(sb, group);
 			/* init the buddy */
 			memset(data, 0xff, blocksize);
 			ext4_mb_generate_buddy(sb, data, incore, group);
 			ext4_unlock_group(sb, group);
 			incore = NULL;
 		} else {
 			/* this is block of bitmap */
 			BUG_ON(incore != NULL);
 			mb_debug(1, "put bitmap for group %u in page %lu/%x\n",
 				group, page->index, i * blocksize);
 			trace_ext4_mb_bitmap_load(sb, group);
 			/* see comments in ext4_mb_put_pa() */
 			ext4_lock_group(sb, group);
 			memcpy(data, bitmap, blocksize);
 			/* mark all preallocated blks used in in-core bitmap */
 			ext4_mb_generate_from_pa(sb, data, group);
 			ext4_mb_generate_from_freelist(sb, data, group);
 			ext4_unlock_group(sb, group);
 			/* set incore so that the buddy information can be
 			 * generated using this
 			 */
 			incore = data;
 		}
 	}
 	SetPageUptodate(page);
 out:
 	if (bh) {
 		for (i = 0; i < groups_per_page; i++)
 			brelse(bh[i]);
 		if (bh != &bhs)
 			kfree(bh);
 	}
 	return err;
 }
 /*
  * Lock the buddy and bitmap pages. This make sure other parallel init_group
  * on the same buddy page doesn't happen whild holding the buddy page lock.
  * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap
  * are on the same page e4b->bd_buddy_page is NULL and return value is 0.
  */
 static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 		ext4_group_t group, struct ext4_buddy *e4b)
 {
 	struct inode *inode = EXT4_SB(sb)->s_buddy_cache;
 	int block, pnum, poff;
 	int blocks_per_page;
 	struct page *page;
 	e4b->bd_buddy_page = NULL;
 	e4b->bd_bitmap_page = NULL;
 	blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
 	/*
 	 * the buddy cache inode stores the block bitmap
 	 * and buddy information in consecutive blocks.
 	 * So for each group we need two blocks.
 	 */
 	block = group * 2;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
 	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
 	if (!page)
 		return -EIO;
 	BUG_ON(page->mapping != inode->i_mapping);
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
 	if (blocks_per_page >= 2) {
 		/* buddy and bitmap are on the same page */
 		return 0;
 	}
 	block++;
 	pnum = block / blocks_per_page;
 	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
 	if (!page)
 		return -EIO;
 	BUG_ON(page->mapping != inode->i_mapping);
 	e4b->bd_buddy_page = page;
 	return 0;
 }
 static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
 {
 	if (e4b->bd_bitmap_page) {
 		unlock_page(e4b->bd_bitmap_page);
 		page_cache_release(e4b->bd_bitmap_page);
 	}
 	if (e4b->bd_buddy_page) {
 		unlock_page(e4b->bd_buddy_page);
 		page_cache_release(e4b->bd_buddy_page);
 	}
 }
 /*
  * Locking note:  This routine calls ext4_mb_init_cache(), which takes the
  * block group lock of all groups for this page; do not hold the BG lock when
  * calling this routine!
  */
 static noinline_for_stack
 int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 {
 	struct ext4_group_info *this_grp;
 	struct ext4_buddy e4b;
 	struct page *page;
 	int ret = 0;
 	might_sleep();
 	mb_debug(1, "init group %u\n", group);
 	this_grp = ext4_get_group_info(sb, group);
 	/*
 	 * This ensures that we don't reinit the buddy cache
 	 * page which map to the group from which we are already
 	 * allocating. If we are looking at the buddy cache we would
 	 * have taken a reference using ext4_mb_load_buddy and that
 	 * would have pinned buddy page to page cache.
+	 * The call to ext4_mb_get_buddy_page_lock will mark the
+	 * page accessed.
 	 */
 	ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
 	if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
 		/*
 		 * somebody initialized the group
 		 * return without doing anything
 		 */
 		goto err;
 	}
 	page = e4b.bd_bitmap_page;
 	ret = ext4_mb_init_cache(page, NULL);
 	if (ret)
 		goto err;
 	if (!PageUptodate(page)) {
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 	if (e4b.bd_buddy_page == NULL) {
 		/*
 		 * If both the bitmap and buddy are in
 		 * the same page we don't need to force
 		 * init the buddy
 		 */
 		ret = 0;
 		goto err;
 	}
 	/* init buddy cache */
 	page = e4b.bd_buddy_page;
 	ret = ext4_mb_init_cache(page, e4b.bd_bitmap);
 	if (ret)
 		goto err;
 	if (!PageUptodate(page)) {
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 err:
 	ext4_mb_put_buddy_page_lock(&e4b);
 	return ret;
 }
 /*
  * Locking note:  This routine calls ext4_mb_init_cache(), which takes the
  * block group lock of all groups for this page; do not hold the BG lock when
  * calling this routine!
  */
 static noinline_for_stack int
 ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 					struct ext4_buddy *e4b)
 {
 	int blocks_per_page;
 	int block;
 	int pnum;
 	int poff;
 	struct page *page;
 	int ret;
 	struct ext4_group_info *grp;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct inode *inode = sbi->s_buddy_cache;
 	might_sleep();
 	mb_debug(1, "load group %u\n", group);
 	blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
 	grp = ext4_get_group_info(sb, group);
 	e4b->bd_blkbits = sb->s_blocksize_bits;
 	e4b->bd_info = grp;
 	e4b->bd_sb = sb;
 	e4b->bd_group = group;
 	e4b->bd_buddy_page = NULL;
 	e4b->bd_bitmap_page = NULL;
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
 		/*
 		 * we need full data about the group
 		 * to make a good selection
 		 */
 		ret = ext4_mb_init_group(sb, group);
 		if (ret)
 			return ret;
 	}
 	/*
 	 * the buddy cache inode stores the block bitmap
 	 * and buddy information in consecutive blocks.
 	 * So for each group we need two blocks.
 	 */
 	block = group * 2;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
 	/* we could use find_or_create_page(), but it locks page
 	 * what we'd like to avoid in fast path ... */
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			/*
 			 * drop the page reference and try
 			 * to get the page with lock. If we
 			 * are not uptodate that implies
 			 * somebody just created the page but
 			 * is yet to initialize the same. So
 			 * wait for it to initialize.
 			 */
 			page_cache_release(page);
 		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
 				ret = ext4_mb_init_cache(page, NULL);
 				if (ret) {
 					unlock_page(page);
 					goto err;
 				}
 				mb_cmp_bitmaps(e4b, page_address(page) +
 					       (poff * sb->s_blocksize));
 			}
 			unlock_page(page);
 		}
 	}
 	if (page == NULL || !PageUptodate(page)) {
 		ret = -EIO;
 		goto err;
 	}
+	/* Pages marked accessed already */
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 	block++;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
 		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
 				ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
 				if (ret) {
 					unlock_page(page);
 					goto err;
 				}
 			}
 			unlock_page(page);
 		}
 	}
 	if (page == NULL || !PageUptodate(page)) {
 		ret = -EIO;
 		goto err;
 	}
+	/* Pages marked accessed already */
 	e4b->bd_buddy_page = page;
 	e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 	BUG_ON(e4b->bd_bitmap_page == NULL);
 	BUG_ON(e4b->bd_buddy_page == NULL);
 	return 0;
 err:
 	if (page)
 		page_cache_release(page);
 	if (e4b->bd_bitmap_page)
 		page_cache_release(e4b->bd_bitmap_page);
 	if (e4b->bd_buddy_page)
 		page_cache_release(e4b->bd_buddy_page);
 	e4b->bd_buddy = NULL;
 	e4b->bd_bitmap = NULL;
 	return ret;
 }
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
 {
 	if (e4b->bd_bitmap_page)
 		page_cache_release(e4b->bd_bitmap_page);
 	if (e4b->bd_buddy_page)
 		page_cache_release(e4b->bd_buddy_page);
 }
 static int mb_find_order_for_block(struct ext4_buddy *e4b, int block)
 {
 	int order = 1;
 	void *bb;
 	BUG_ON(e4b->bd_bitmap == e4b->bd_buddy);
 	BUG_ON(block >= (1 << (e4b->bd_blkbits + 3)));
 	bb = e4b->bd_buddy;
 	while (order <= e4b->bd_blkbits + 1) {
 		block = block >> 1;
 		if (!mb_test_bit(block, bb)) {
 			/* this block is part of buddy of order 'order' */
 			return order;
 		}
 		bb += 1 << (e4b->bd_blkbits - order);
 		order++;
 	}
 	return 0;
 }
 static void mb_clear_bits(void *bm, int cur, int len)
 {
 	__u32 *addr;
 	len = cur + len;
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
 			/* fast path: clear whole word at once */
 			addr = bm + (cur >> 3);
 			*addr = 0;
 			cur += 32;
 			continue;
 		}
 		mb_clear_bit(cur, bm);
 		cur++;
 	}
 }
 /* clear bits in given range
  * will return first found zero bit if any, -1 otherwise
  */
 static int mb_test_and_clear_bits(void *bm, int cur, int len)
 {
 	__u32 *addr;
 	int zero_bit = -1;
 	len = cur + len;
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
 			/* fast path: clear whole word at once */
 			addr = bm + (cur >> 3);
 			if (*addr != (__u32)(-1) && zero_bit == -1)
 				zero_bit = cur + mb_find_next_zero_bit(addr, 32, 0);
 			*addr = 0;
 			cur += 32;
 			continue;
 		}
 		if (!mb_test_and_clear_bit(cur, bm) && zero_bit == -1)
 			zero_bit = cur;
 		cur++;
 	}
 	return zero_bit;
 }
 void ext4_set_bits(void *bm, int cur, int len)
 {
 	__u32 *addr;
 	len = cur + len;
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
 			/* fast path: set whole word at once */
 			addr = bm + (cur >> 3);
 			*addr = 0xffffffff;
 			cur += 32;
 			continue;
 		}
 		mb_set_bit(cur, bm);
 		cur++;
 	}
 }
 /*
  * _________________________________________________________________ */
 static inline int mb_buddy_adjust_border(int* bit, void* bitmap, int side)
 {
 	if (mb_test_bit(*bit + side, bitmap)) {
 		mb_clear_bit(*bit, bitmap);
 		(*bit) -= side;
 		return 1;
 	}
 	else {
 		(*bit) += side;
 		mb_set_bit(*bit, bitmap);
 		return -1;
 	}
 }
 static void mb_buddy_mark_free(struct ext4_buddy *e4b, int first, int last)
 {
 	int max;
 	int order = 1;
 	void *buddy = mb_find_buddy(e4b, order, &max);
 	while (buddy) {
 		void *buddy2;
 		/* Bits in range [first; last] are known to be set since
 		 * corresponding blocks were allocated. Bits in range
 		 * (first; last) will stay set because they form buddies on
 		 * upper layer. We just deal with borders if they don't
 		 * align with upper layer and then go up.
 		 * Releasing entire group is all about clearing
 		 * single bit of highest order buddy.
 		 */
 		/* Example:
 		 * ---------------------------------
 		 * |   1   |   1   |   1   |   1   |
 		 * ---------------------------------
 		 * | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
 		 * ---------------------------------
 		 *   0   1   2   3   4   5   6   7
 		 *      \_____________________/
 		 *
 		 * Neither [1] nor [6] is aligned to above layer.
 		 * Left neighbour [0] is free, so mark it busy,
 		 * decrease bb_counters and extend range to
 		 * [0; 6]
 		 * Right neighbour [7] is busy. It can't be coaleasced with [6], so
 		 * mark [6] free, increase bb_counters and shrink range to
 		 * [0; 5].
 		 * Then shift range to [0; 2], go up and do the same.
 		 */
 		if (first & 1)
 			e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&first, buddy, -1);
 		if (!(last & 1))
 			e4b->bd_info->bb_counters[order] += mb_buddy_adjust_border(&last, buddy, 1);
 		if (first > last)
 			break;
 		order++;
 		if (first == last || !(buddy2 = mb_find_buddy(e4b, order, &max))) {
 			mb_clear_bits(buddy, first, last - first + 1);
 			e4b->bd_info->bb_counters[order - 1] += last - first + 1;
 			break;
 		}
 		first >>= 1;
 		last >>= 1;
 		buddy = buddy2;
 	}
 }
 static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 			   int first, int count)
 {
 	int left_is_free = 0;
 	int right_is_free = 0;
 	int block;
 	int last = first + count - 1;
 	struct super_block *sb = e4b->bd_sb;
 	if (WARN_ON(count == 0))
 		return;
 	BUG_ON(last >= (sb->s_blocksize << 3));
 	assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
 	/* Don't bother if the block group is corrupt. */
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)))
 		return;
 	mb_check_buddy(e4b);
 	mb_free_blocks_double(inode, e4b, first, count);
 	e4b->bd_info->bb_free += count;
 	if (first < e4b->bd_info->bb_first_free)
 		e4b->bd_info->bb_first_free = first;
 	/* access memory sequentially: check left neighbour,
 	 * clear range and then check right neighbour
 	 */
 	if (first != 0)
 		left_is_free = !mb_test_bit(first - 1, e4b->bd_bitmap);
 	block = mb_test_and_clear_bits(e4b->bd_bitmap, first, count);
 	if (last + 1 < EXT4_SB(sb)->s_mb_maxs[0])
 		right_is_free = !mb_test_bit(last + 1, e4b->bd_bitmap);
 	if (unlikely(block != -1)) {
 		ext4_fsblk_t blocknr;
 		blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
 		blocknr += EXT4_C2B(EXT4_SB(sb), block);
 		ext4_grp_locked_error(sb, e4b->bd_group,
 				      inode ? inode->i_ino : 0,
 				      blocknr,
 				      "freeing already freed block "
 				      "(bit %u); block bitmap corrupt.",
 				      block);
 		/* Mark the block group as corrupt. */
 		set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT,
 			&e4b->bd_info->bb_state);
 		mb_regenerate_buddy(e4b);
 		goto done;
 	}
 	/* let's maintain fragments counter */
 	if (left_is_free && right_is_free)
 		e4b->bd_info->bb_fragments--;
 	else if (!left_is_free && !right_is_free)
 		e4b->bd_info->bb_fragments++;
 	/* buddy[0] == bd_bitmap is a special case, so handle
 	 * it right away and let mb_buddy_mark_free stay free of
 	 * zero order checks.
 	 * Check if neighbours are to be coaleasced,
 	 * adjust bitmap bb_counters and borders appropriately.
 	 */
 	if (first & 1) {
 		first += !left_is_free;
 		e4b->bd_info->bb_counters[0] += left_is_free ? -1 : 1;
 	}
 	if (!(last & 1)) {
 		last -= !right_is_free;
 		e4b->bd_info->bb_counters[0] += right_is_free ? -1 : 1;
 	}
 	if (first <= last)
 		mb_buddy_mark_free(e4b, first >> 1, last >> 1);
 done:
 	mb_set_largest_free_order(sb, e4b->bd_info);
 	mb_check_buddy(e4b);
 }
 static int mb_find_extent(struct ext4_buddy *e4b, int block,
 				int needed, struct ext4_free_extent *ex)
 {
 	int next = block;
 	int max, order;
 	void *buddy;
 	assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
 	BUG_ON(ex == NULL);
 	buddy = mb_find_buddy(e4b, 0, &max);
 	BUG_ON(buddy == NULL);
 	BUG_ON(block >= max);
 	if (mb_test_bit(block, buddy)) {
 		ex->fe_len = 0;
 		ex->fe_start = 0;
 		ex->fe_group = 0;
 		return 0;
 	}
 	/* find actual order */
 	order = mb_find_order_for_block(e4b, block);
 	block = block >> order;
 	ex->fe_len = 1 << order;
 	ex->fe_start = block << order;
 	ex->fe_group = e4b->bd_group;
 	/* calc difference from given start */
 	next = next - ex->fe_start;
 	ex->fe_len -= next;
 	ex->fe_start += next;
 	while (needed > ex->fe_len &&
 	       mb_find_buddy(e4b, order, &max)) {
 		if (block + 1 >= max)
 			break;
 		next = (block + 1) * (1 << order);
 		if (mb_test_bit(next, e4b->bd_bitmap))
 			break;
 		order = mb_find_order_for_block(e4b, next);
 		block = next >> order;
 		ex->fe_len += 1 << order;
 	}
 	BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3)));
 	return ex->fe_len;
 }
 static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 {
 	int ord;
 	int mlen = 0;
 	int max = 0;
 	int cur;
 	int start = ex->fe_start;
 	int len = ex->fe_len;
 	unsigned ret = 0;
 	int len0 = len;
 	void *buddy;
 	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
 	BUG_ON(e4b->bd_group != ex->fe_group);
 	assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group));
 	mb_check_buddy(e4b);
 	mb_mark_used_double(e4b, start, len);
 	e4b->bd_info->bb_free -= len;
 	if (e4b->bd_info->bb_first_free == start)
 		e4b->bd_info->bb_first_free += len;
 	/* let's maintain fragments counter */
 	if (start != 0)
 		mlen = !mb_test_bit(start - 1, e4b->bd_bitmap);
 	if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0])
 		max = !mb_test_bit(start + len, e4b->bd_bitmap);
 	if (mlen && max)
 		e4b->bd_info->bb_fragments++;
 	else if (!mlen && !max)
 		e4b->bd_info->bb_fragments--;
 	/* let's maintain buddy itself */
 	while (len) {
 		ord = mb_find_order_for_block(e4b, start);
 		if (((start >> ord) << ord) == start && len >= (1 << ord)) {
 			/* the whole chunk may be allocated at once! */
 			mlen = 1 << ord;
 			buddy = mb_find_buddy(e4b, ord, &max);
 			BUG_ON((start >> ord) >= max);
 			mb_set_bit(start >> ord, buddy);
 			e4b->bd_info->bb_counters[ord]--;
 			start += mlen;
 			len -= mlen;
 			BUG_ON(len < 0);
 			continue;
 		}
 		/* store for history */
 		if (ret == 0)
 			ret = len | (ord << 16);
 		/* we have to split large buddy */
 		BUG_ON(ord <= 0);
 		buddy = mb_find_buddy(e4b, ord, &max);
 		mb_set_bit(start >> ord, buddy);
 		e4b->bd_info->bb_counters[ord]--;
 		ord--;
 		cur = (start >> ord) & ~1U;
 		buddy = mb_find_buddy(e4b, ord, &max);
 		mb_clear_bit(cur, buddy);
 		mb_clear_bit(cur + 1, buddy);
 		e4b->bd_info->bb_counters[ord]++;
 		e4b->bd_info->bb_counters[ord]++;
 	}
 	mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
 	ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
 	mb_check_buddy(e4b);
 	return ret;
 }
 /*
  * Must be called under group lock!
  */
 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int ret;
 	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 	ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len);
 	ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical;
 	ret = mb_mark_used(e4b, &ac->ac_b_ex);
 	/* preallocation can change ac_b_ex, thus we store actually
 	 * allocated blocks for history */
 	ac->ac_f_ex = ac->ac_b_ex;
 	ac->ac_status = AC_STATUS_FOUND;
 	ac->ac_tail = ret & 0xffff;
 	ac->ac_buddy = ret >> 16;
 	/*
 	 * take the page reference. We want the page to be pinned
 	 * so that we don't get a ext4_mb_init_cache_call for this
 	 * group until we update the bitmap. That would mean we
 	 * double allocate blocks. The reference is dropped
 	 * in ext4_mb_release_context
 	 */
 	ac->ac_bitmap_page = e4b->bd_bitmap_page;
 	get_page(ac->ac_bitmap_page);
 	ac->ac_buddy_page = e4b->bd_buddy_page;
 	get_page(ac->ac_buddy_page);
 	/* store last allocated for subsequent stream allocation */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
 		spin_lock(&sbi->s_md_lock);
 		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
 		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
 }
 /*
  * regular allocator, for general purposes allocation
  */
 static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b,
 					int finish_group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_free_extent *bex = &ac->ac_b_ex;
 	struct ext4_free_extent *gex = &ac->ac_g_ex;
 	struct ext4_free_extent ex;
 	int max;
 	if (ac->ac_status == AC_STATUS_FOUND)
 		return;
 	/*
 	 * We don't want to scan for a whole year
 	 */
 	if (ac->ac_found > sbi->s_mb_max_to_scan &&
 			!(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
 		ac->ac_status = AC_STATUS_BREAK;
 		return;
 	}
 	/*
 	 * Haven't found good chunk so far, let's continue
 	 */
 	if (bex->fe_len < gex->fe_len)
 		return;
 	if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
 			&& bex->fe_group == e4b->bd_group) {
 		/* recheck chunk's availability - we don't know
 		 * when it was found (within this lock-unlock
 		 * period or not) */
 		max = mb_find_extent(e4b, bex->fe_start, gex->fe_len, &ex);
 		if (max >= gex->fe_len) {
 			ext4_mb_use_best_found(ac, e4b);
 			return;
 		}
 	}
 }
 /*
  * The routine checks whether found extent is good enough. If it is,
  * then the extent gets marked used and flag is set to the context
  * to stop scanning. Otherwise, the extent is compared with the
  * previous found extent and if new one is better, then it's stored
  * in the context. Later, the best found extent will be used, if
  * mballoc can't find good enough extent.
  *
  * FIXME: real allocation policy is to be designed yet!
  */
 static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
 					struct ext4_free_extent *ex,
 					struct ext4_buddy *e4b)
 {
 	struct ext4_free_extent *bex = &ac->ac_b_ex;
 	struct ext4_free_extent *gex = &ac->ac_g_ex;
 	BUG_ON(ex->fe_len <= 0);
 	BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
 	BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
 	BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
 	ac->ac_found++;
 	/*
 	 * The special case - take what you catch first
 	 */
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
 		*bex = *ex;
 		ext4_mb_use_best_found(ac, e4b);
 		return;
 	}
 	/*
 	 * Let's check whether the chuck is good enough
 	 */
 	if (ex->fe_len == gex->fe_len) {
 		*bex = *ex;
 		ext4_mb_use_best_found(ac, e4b);
 		return;
 	}
 	/*
 	 * If this is first found extent, just store it in the context
 	 */
 	if (bex->fe_len == 0) {
 		*bex = *ex;
 		return;
 	}
 	/*
 	 * If new found extent is better, store it in the context
 	 */
 	if (bex->fe_len < gex->fe_len) {
 		/* if the request isn't satisfied, any found extent
 		 * larger than previous best one is better */
 		if (ex->fe_len > bex->fe_len)
 			*bex = *ex;
 	} else if (ex->fe_len > gex->fe_len) {
 		/* if the request is satisfied, then we try to find
 		 * an extent that still satisfy the request, but is
 		 * smaller than previous one */
 		if (ex->fe_len < bex->fe_len)
 			*bex = *ex;
 	}
 	ext4_mb_check_limits(ac, e4b, 0);
 }
 static noinline_for_stack
 int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
 	struct ext4_free_extent ex = ac->ac_b_ex;
 	ext4_group_t group = ex.fe_group;
 	int max;
 	int err;
 	BUG_ON(ex.fe_len <= 0);
 	err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
 	if (err)
 		return err;
 	ext4_lock_group(ac->ac_sb, group);
 	max = mb_find_extent(e4b, ex.fe_start, ex.fe_len, &ex);
 	if (max > 0) {
 		ac->ac_b_ex = ex;
 		ext4_mb_use_best_found(ac, e4b);
 	}
 	ext4_unlock_group(ac->ac_sb, group);
 	ext4_mb_unload_buddy(e4b);
 	return 0;
 }
 static noinline_for_stack
 int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
 				struct ext4_buddy *e4b)
 {
 	ext4_group_t group = ac->ac_g_ex.fe_group;
 	int max;
 	int err;
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 	struct ext4_free_extent ex;
 	if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
 		return 0;
 	if (grp->bb_free == 0)
 		return 0;
 	err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
 	if (err)
 		return err;
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
 		ext4_mb_unload_buddy(e4b);
 		return 0;
 	}
 	ext4_lock_group(ac->ac_sb, group);
 	max = mb_find_extent(e4b, ac->ac_g_ex.fe_start,
 			     ac->ac_g_ex.fe_len, &ex);
 	if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) {
 		ext4_fsblk_t start;
 		start = ext4_group_first_block_no(ac->ac_sb, e4b->bd_group) +
 			ex.fe_start;
 		/* use do_div to get remainder (would be 64-bit modulo) */
 		if (do_div(start, sbi->s_stripe) == 0) {
 			ac->ac_found++;
 			ac->ac_b_ex = ex;
 			ext4_mb_use_best_found(ac, e4b);
 		}
 	} else if (max >= ac->ac_g_ex.fe_len) {
 		BUG_ON(ex.fe_len <= 0);
 		BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
 		BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
 		ac->ac_found++;
 		ac->ac_b_ex = ex;
 		ext4_mb_use_best_found(ac, e4b);
 	} else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
 		/* Sometimes, caller may want to merge even small
 		 * number of blocks to an existing extent */
 		BUG_ON(ex.fe_len <= 0);
 		BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
 		BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
 		ac->ac_found++;
 		ac->ac_b_ex = ex;
 		ext4_mb_use_best_found(ac, e4b);
 	}
 	ext4_unlock_group(ac->ac_sb, group);
 	ext4_mb_unload_buddy(e4b);
 	return 0;
 }
 /*
  * The routine scans buddy structures (not bitmap!) from given order
  * to max order and tries to find big enough chunk to satisfy the req
  */
 static noinline_for_stack
 void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_group_info *grp = e4b->bd_info;
 	void *buddy;
 	int i;
 	int k;
 	int max;
 	BUG_ON(ac->ac_2order <= 0);
 	for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
 		if (grp->bb_counters[i] == 0)
 			continue;
 		buddy = mb_find_buddy(e4b, i, &max);
 		BUG_ON(buddy == NULL);
 		k = mb_find_next_zero_bit(buddy, max, 0);
 		BUG_ON(k >= max);
 		ac->ac_found++;
 		ac->ac_b_ex.fe_len = 1 << i;
 		ac->ac_b_ex.fe_start = k << i;
 		ac->ac_b_ex.fe_group = e4b->bd_group;
 		ext4_mb_use_best_found(ac, e4b);
 		BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
 		if (EXT4_SB(sb)->s_mb_stats)
 			atomic_inc(&EXT4_SB(sb)->s_bal_2orders);
 		break;
 	}
 }
 /*
  * The routine scans the group and measures all found extents.
  * In order to optimize scanning, caller must pass number of
  * free blocks in the group, so the routine can know upper limit.
  */
 static noinline_for_stack
 void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
 	struct super_block *sb = ac->ac_sb;
 	void *bitmap = e4b->bd_bitmap;
 	struct ext4_free_extent ex;
 	int i;
 	int free;
 	free = e4b->bd_info->bb_free;
 	BUG_ON(free <= 0);
 	i = e4b->bd_info->bb_first_free;
 	while (free && ac->ac_status == AC_STATUS_CONTINUE) {
 		i = mb_find_next_zero_bit(bitmap,
 						EXT4_CLUSTERS_PER_GROUP(sb), i);
 		if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
 			/*
 			 * IF we have corrupt bitmap, we won't find any
 			 * free blocks even though group info says we
 			 * we have free blocks
 			 */
 			ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
 					"%d free clusters as per "
 					"group info. But bitmap says 0",
 					free);
 			break;
 		}
 		mb_find_extent(e4b, i, ac->ac_g_ex.fe_len, &ex);
 		BUG_ON(ex.fe_len <= 0);
 		if (free < ex.fe_len) {
 			ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
 					"%d free clusters as per "
 					"group info. But got %d blocks",
 					free, ex.fe_len);
 			/*
 			 * The number of free blocks differs. This mostly
 			 * indicate that the bitmap is corrupt. So exit
 			 * without claiming the space.
 			 */
 			break;
 		}
 		ext4_mb_measure_extent(ac, &ex, e4b);
 		i += ex.fe_len;
 		free -= ex.fe_len;
 	}
 	ext4_mb_check_limits(ac, e4b, 1);
 }
 /*
  * This is a special case for storages like raid5
  * we try to find stripe-aligned chunks for stripe-size-multiple requests
  */
 static noinline_for_stack
 void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
 				 struct ext4_buddy *e4b)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	void *bitmap = e4b->bd_bitmap;
 	struct ext4_free_extent ex;
 	ext4_fsblk_t first_group_block;
 	ext4_fsblk_t a;
 	ext4_grpblk_t i;
 	int max;
 	BUG_ON(sbi->s_stripe == 0);
 	/* find first stripe-aligned block in group */
 	first_group_block = ext4_group_first_block_no(sb, e4b->bd_group);
 	a = first_group_block + sbi->s_stripe - 1;
 	do_div(a, sbi->s_stripe);
 	i = (a * sbi->s_stripe) - first_group_block;
 	while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
 		if (!mb_test_bit(i, bitmap)) {
 			max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
 			if (max >= sbi->s_stripe) {
 				ac->ac_found++;
 				ac->ac_b_ex = ex;
 				ext4_mb_use_best_found(ac, e4b);
 				break;
 			}
 		}
 		i += sbi->s_stripe;
 	}
 }
 /* This is now called BEFORE we load the buddy bitmap. */
 static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 				ext4_group_t group, int cr)
 {
 	unsigned free, fragments;
 	int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
 	struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
 	BUG_ON(cr < 0 || cr >= 4);
 	free = grp->bb_free;
 	if (free == 0)
 		return 0;
 	if (cr <= 2 && free < ac->ac_g_ex.fe_len)
 		return 0;
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
 		return 0;
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
 		int ret = ext4_mb_init_group(ac->ac_sb, group);
 		if (ret)
 			return 0;
 	}
 	fragments = grp->bb_fragments;
 	if (fragments == 0)
 		return 0;
 	switch (cr) {
 	case 0:
 		BUG_ON(ac->ac_2order == 0);
 		/* Avoid using the first bg of a flexgroup for data files */
 		if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
 		    (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
 		    ((group % flex_size) == 0))
 			return 0;
 		if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
 		    (free / fragments) >= ac->ac_g_ex.fe_len)
 			return 1;
 		if (grp->bb_largest_free_order < ac->ac_2order)
 			return 0;
 		return 1;
 	case 1:
 		if ((free / fragments) >= ac->ac_g_ex.fe_len)
 			return 1;
 		break;
 	case 2:
 		if (free >= ac->ac_g_ex.fe_len)
 			return 1;
 		break;
 	case 3:
 		return 1;
 	default:
 		BUG();
 	}
 	return 0;
 }
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t ngroups, group, i;
 	int cr;
 	int err = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	struct ext4_buddy e4b;
 	sb = ac->ac_sb;
 	sbi = EXT4_SB(sb);
 	ngroups = ext4_get_groups_count(sb);
 	/* non-extent files are limited to low blocks/groups */
 	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
 		ngroups = sbi->s_blockfile_groups;
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 	/* first, try the goal */
 	err = ext4_mb_find_by_goal(ac, &e4b);
 	if (err || ac->ac_status == AC_STATUS_FOUND)
 		goto out;
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		goto out;
 	/*
 	 * ac->ac2_order is set only if the fe_len is a power of 2
 	 * if ac2_order is set we also set criteria to 0 so that we
 	 * try exact allocation using buddy.
 	 */
 	i = fls(ac->ac_g_ex.fe_len);
 	ac->ac_2order = 0;
 	/*
 	 * We search using buddy data only if the order of the request
 	 * is greater than equal to the sbi_s_mb_order2_reqs
 	 * You can tune it via /sys/fs/ext4/<partition>/mb_order2_req
 	 */
 	if (i >= sbi->s_mb_order2_reqs) {
 		/*
 		 * This should tell if fe_len is exactly power of 2
 		 */
 		if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
 			ac->ac_2order = i - 1;
 	}
 	/* if stream allocation is enabled, use global goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
 		/* TBD: may be hot point */
 		spin_lock(&sbi->s_md_lock);
 		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
 		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
 	/* Let's just scan groups to find more-less suitable blocks */
 	cr = ac->ac_2order ? 0 : 1;
 	/*
 	 * cr == 0 try to get exact allocation,
 	 * cr == 3  try to get anything
 	 */
 repeat:
 	for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
 		/*
 		 * searching for the right group start
 		 * from the goal value specified
 		 */
 		group = ac->ac_g_ex.fe_group;
 		for (i = 0; i < ngroups; group++, i++) {
 			cond_resched();
 			/*
 			 * Artificially restricted ngroups for non-extent
 			 * files makes group > ngroups possible on first loop.
 			 */
 			if (group >= ngroups)
 				group = 0;
 			/* This now checks without needing the buddy page */
 			if (!ext4_mb_good_group(ac, group, cr))
 				continue;
 			err = ext4_mb_load_buddy(sb, group, &e4b);
 			if (err)
 				goto out;
 			ext4_lock_group(sb, group);
 			/*
 			 * We need to check again after locking the
 			 * block group
 			 */
 			if (!ext4_mb_good_group(ac, group, cr)) {
 				ext4_unlock_group(sb, group);
 				ext4_mb_unload_buddy(&e4b);
 				continue;
 			}
 			ac->ac_groups_scanned++;
 			if (cr == 0 && ac->ac_2order < sb->s_blocksize_bits+2)
 				ext4_mb_simple_scan_group(ac, &e4b);
 			else if (cr == 1 && sbi->s_stripe &&
 					!(ac->ac_g_ex.fe_len % sbi->s_stripe))
 				ext4_mb_scan_aligned(ac, &e4b);
 			else
 				ext4_mb_complex_scan_group(ac, &e4b);
 			ext4_unlock_group(sb, group);
 			ext4_mb_unload_buddy(&e4b);
 			if (ac->ac_status != AC_STATUS_CONTINUE)
 				break;
 		}
 	}
 	if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
 	    !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
 		/*
 		 * We've been searching too long. Let's try to allocate
 		 * the best chunk we've found so far
 		 */
 		ext4_mb_try_best_found(ac, &e4b);
 		if (ac->ac_status != AC_STATUS_FOUND) {
 			/*
 			 * Someone more lucky has already allocated it.
 			 * The only thing we can do is just take first
 			 * found block(s)
 			printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n");
 			 */
 			ac->ac_b_ex.fe_group = 0;
 			ac->ac_b_ex.fe_start = 0;
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			ac->ac_flags |= EXT4_MB_HINT_FIRST;
 			cr = 3;
 			atomic_inc(&sbi->s_mb_lost_chunks);
 			goto repeat;
 		}
 	}
 out:
 	return err;
 }
 static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
 {
 	struct super_block *sb = seq->private;
 	ext4_group_t group;
 	if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
 		return NULL;
 	group = *pos + 1;
 	return (void *) ((unsigned long) group);
 }
 static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct super_block *sb = seq->private;
 	ext4_group_t group;
 	++*pos;
 	if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
 		return NULL;
 	group = *pos + 1;
 	return (void *) ((unsigned long) group);
 }
 static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
 {
 	struct super_block *sb = seq->private;
 	ext4_group_t group = (ext4_group_t) ((unsigned long) v);
 	int i;
 	int err, buddy_loaded = 0;
 	struct ext4_buddy e4b;
 	struct ext4_group_info *grinfo;
 	struct sg {
 		struct ext4_group_info info;
 		ext4_grpblk_t counters[16];
 	} sg;
 	group--;
 	if (group == 0)
 		seq_printf(seq, "#%-5s: %-5s %-5s %-5s "
 				"[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s "
 				  "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n",
 			   "group", "free", "frags", "first",
 			   "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6",
 			   "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13");
 	i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) +
 		sizeof(struct ext4_group_info);
 	grinfo = ext4_get_group_info(sb, group);
 	/* Load the group info in memory only if not already loaded. */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) {
 		err = ext4_mb_load_buddy(sb, group, &e4b);
 		if (err) {
 			seq_printf(seq, "#%-5u: I/O error\n", group);
 			return 0;
 		}
 		buddy_loaded = 1;
 	}
 	memcpy(&sg, ext4_get_group_info(sb, group), i);
 	if (buddy_loaded)
 		ext4_mb_unload_buddy(&e4b);
 	seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free,
 			sg.info.bb_fragments, sg.info.bb_first_free);
 	for (i = 0; i <= 13; i++)
 		seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ?
 				sg.info.bb_counters[i] : 0);
 	seq_printf(seq, " ]\n");
 	return 0;
 }
 static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v)
 {
 }
 static const struct seq_operations ext4_mb_seq_groups_ops = {
 	.start  = ext4_mb_seq_groups_start,
 	.next   = ext4_mb_seq_groups_next,
 	.stop   = ext4_mb_seq_groups_stop,
 	.show   = ext4_mb_seq_groups_show,
 };
 static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file)
 {
 	struct super_block *sb = PDE_DATA(inode);
 	int rc;
 	rc = seq_open(file, &ext4_mb_seq_groups_ops);
 	if (rc == 0) {
 		struct seq_file *m = file->private_data;
 		m->private = sb;
 	}
 	return rc;
 }
 static const struct file_operations ext4_mb_seq_groups_fops = {
 	.owner		= THIS_MODULE,
 	.open		= ext4_mb_seq_groups_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
 };
 static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
 {
 	int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
 	struct kmem_cache *cachep = ext4_groupinfo_caches[cache_index];
 	BUG_ON(!cachep);
 	return cachep;
 }
 /*
  * Allocate the top-level s_group_info array for the specified number
  * of groups
  */
 int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned size;
 	struct ext4_group_info ***new_groupinfo;
 	size = (ngroups + EXT4_DESC_PER_BLOCK(sb) - 1) >>
 		EXT4_DESC_PER_BLOCK_BITS(sb);
 	if (size <= sbi->s_group_info_size)
 		return 0;
 	size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size);
 	new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL);
 	if (!new_groupinfo) {
 		ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group");
 		return -ENOMEM;
 	}
 	if (sbi->s_group_info) {
 		memcpy(new_groupinfo, sbi->s_group_info,
 		       sbi->s_group_info_size * sizeof(*sbi->s_group_info));
 		ext4_kvfree(sbi->s_group_info);
 	}
 	sbi->s_group_info = new_groupinfo;
 	sbi->s_group_info_size = size / sizeof(*sbi->s_group_info);
 	ext4_debug("allocated s_groupinfo array for %d meta_bg's\n",
 		   sbi->s_group_info_size);
 	return 0;
 }
 /* Create and initialize ext4_group_info data for the given group. */
 int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 			  struct ext4_group_desc *desc)
 {
 	int i;
 	int metalen = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_group_info **meta_group_info;
 	struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
 	/*
 	 * First check if this group is the first of a reserved block.
 	 * If it's true, we have to allocate a new table of pointers
 	 * to ext4_group_info structures
 	 */
 	if (group % EXT4_DESC_PER_BLOCK(sb) == 0) {
 		metalen = sizeof(*meta_group_info) <<
 			EXT4_DESC_PER_BLOCK_BITS(sb);
 		meta_group_info = kmalloc(metalen, GFP_KERNEL);
 		if (meta_group_info == NULL) {
 			ext4_msg(sb, KERN_ERR, "can't allocate mem "
 				 "for a buddy group");
 			goto exit_meta_group_info;
 		}
 		sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] =
 			meta_group_info;
 	}
 	meta_group_info =
 		sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)];
 	i = group & (EXT4_DESC_PER_BLOCK(sb) - 1);
 	meta_group_info[i] = kmem_cache_zalloc(cachep, GFP_KERNEL);
 	if (meta_group_info[i] == NULL) {
 		ext4_msg(sb, KERN_ERR, "can't allocate buddy mem");
 		goto exit_group_info;
 	}
 	set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT,
 		&(meta_group_info[i]->bb_state));
 	/*
 	 * initialize bb_free to be able to skip
 	 * empty groups without initialization
 	 */
 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
 		meta_group_info[i]->bb_free =
 			ext4_free_clusters_after_init(sb, group, desc);
 	} else {
 		meta_group_info[i]->bb_free =
 			ext4_free_group_clusters(sb, desc);
 	}
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
 #ifdef DOUBLE_CHECK
 	{
 		struct buffer_head *bh;
 		meta_group_info[i]->bb_bitmap =
 			kmalloc(sb->s_blocksize, GFP_KERNEL);
 		BUG_ON(meta_group_info[i]->bb_bitmap == NULL);
 		bh = ext4_read_block_bitmap(sb, group);
 		BUG_ON(bh == NULL);
 		memcpy(meta_group_info[i]->bb_bitmap, bh->b_data,
 			sb->s_blocksize);
 		put_bh(bh);
 	}
 #endif
 	return 0;
 exit_group_info:
 	/* If a meta_group_info table has been allocated, release it now */
 	if (group % EXT4_DESC_PER_BLOCK(sb) == 0) {
 		kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]);
 		sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] = NULL;
 	}
 exit_meta_group_info:
 	return -ENOMEM;
 } /* ext4_mb_add_groupinfo */
 static int ext4_mb_init_backend(struct super_block *sb)
 {
 	ext4_group_t ngroups = ext4_get_groups_count(sb);
 	ext4_group_t i;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int err;
 	struct ext4_group_desc *desc;
 	struct kmem_cache *cachep;
 	err = ext4_mb_alloc_groupinfo(sb, ngroups);
 	if (err)
 		return err;
 	sbi->s_buddy_cache = new_inode(sb);
 	if (sbi->s_buddy_cache == NULL) {
 		ext4_msg(sb, KERN_ERR, "can't get new inode");
 		goto err_freesgi;
 	}
 	/* To avoid potentially colliding with an valid on-disk inode number,
 	 * use EXT4_BAD_INO for the buddy cache inode number.  This inode is
 	 * not in the inode hash, so it should never be found by iget(), but
 	 * this will avoid confusion if it ever shows up during debugging. */
 	sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);
 		if (desc == NULL) {
 			ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
 			goto err_freebuddy;
 		}
 		if (ext4_mb_add_groupinfo(sb, i, desc) != 0)
 			goto err_freebuddy;
 	}
 	return 0;
 err_freebuddy:
 	cachep = get_groupinfo_cache(sb->s_blocksize_bits);
 	while (i-- > 0)
 		kmem_cache_free(cachep, ext4_get_group_info(sb, i));
 	i = sbi->s_group_info_size;
 	while (i-- > 0)
 		kfree(sbi->s_group_info[i]);
 	iput(sbi->s_buddy_cache);
 err_freesgi:
 	ext4_kvfree(sbi->s_group_info);
 	return -ENOMEM;
 }
 static void ext4_groupinfo_destroy_slabs(void)
 {
 	int i;
 	for (i = 0; i < NR_GRPINFO_CACHES; i++) {
 		if (ext4_groupinfo_caches[i])
 			kmem_cache_destroy(ext4_groupinfo_caches[i]);
 		ext4_groupinfo_caches[i] = NULL;
 	}
 }
 static int ext4_groupinfo_create_slab(size_t size)
 {
 	static DEFINE_MUTEX(ext4_grpinfo_slab_create_mutex);
 	int slab_size;
 	int blocksize_bits = order_base_2(size);
 	int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
 	struct kmem_cache *cachep;
 	if (cache_index >= NR_GRPINFO_CACHES)
 		return -EINVAL;
 	if (unlikely(cache_index < 0))
 		cache_index = 0;
 	mutex_lock(&ext4_grpinfo_slab_create_mutex);
 	if (ext4_groupinfo_caches[cache_index]) {
 		mutex_unlock(&ext4_grpinfo_slab_create_mutex);
 		return 0;	/* Already created */
 	}
 	slab_size = offsetof(struct ext4_group_info,
 				bb_counters[blocksize_bits + 2]);
 	cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index],
 					slab_size, 0, SLAB_RECLAIM_ACCOUNT,
 					NULL);
 	ext4_groupinfo_caches[cache_index] = cachep;
 	mutex_unlock(&ext4_grpinfo_slab_create_mutex);
 	if (!cachep) {
 		printk(KERN_EMERG
 		       "EXT4-fs: no memory for groupinfo slab cache\n");
 		return -ENOMEM;
 	}
 	return 0;
 }
 int ext4_mb_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned i, j;
 	unsigned offset;
 	unsigned max;
 	int ret;
 	i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
 	sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
 	if (sbi->s_mb_offsets == NULL) {
 		ret = -ENOMEM;
 		goto out;
 	}
 	i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
 	sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
 	if (sbi->s_mb_maxs == NULL) {
 		ret = -ENOMEM;
 		goto out;
 	}
 	ret = ext4_groupinfo_create_slab(sb->s_blocksize);
 	if (ret < 0)
 		goto out;
 	/* order 0 is regular bitmap */
 	sbi->s_mb_maxs[0] = sb->s_blocksize << 3;
 	sbi->s_mb_offsets[0] = 0;
 	i = 1;
 	offset = 0;
 	max = sb->s_blocksize << 2;
 	do {
 		sbi->s_mb_offsets[i] = offset;
 		sbi->s_mb_maxs[i] = max;
 		offset += 1 << (sb->s_blocksize_bits - i);
 		max = max >> 1;
 		i++;
 	} while (i <= sb->s_blocksize_bits + 1);
 	spin_lock_init(&sbi->s_md_lock);
 	spin_lock_init(&sbi->s_bal_lock);
 	sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN;
 	sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN;
 	sbi->s_mb_stats = MB_DEFAULT_STATS;
 	sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
 	sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
 	/*
 	 * The default group preallocation is 512, which for 4k block
 	 * sizes translates to 2 megabytes.  However for bigalloc file
 	 * systems, this is probably too big (i.e, if the cluster size
 	 * is 1 megabyte, then group preallocation size becomes half a
 	 * gigabyte!).  As a default, we will keep a two megabyte
 	 * group pralloc size for cluster sizes up to 64k, and after
 	 * that, we will force a minimum group preallocation size of
 	 * 32 clusters.  This translates to 8 megs when the cluster
 	 * size is 256k, and 32 megs when the cluster size is 1 meg,
 	 * which seems reasonable as a default.
 	 */
 	sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >>
 				       sbi->s_cluster_bits, 32);
 	/*
 	 * If there is a s_stripe > 1, then we set the s_mb_group_prealloc
 	 * to the lowest multiple of s_stripe which is bigger than
 	 * the s_mb_group_prealloc as determined above. We want
 	 * the preallocation size to be an exact multiple of the
 	 * RAID stripe size so that preallocations don't fragment
 	 * the stripes.
 	 */
 	if (sbi->s_stripe > 1) {
 		sbi->s_mb_group_prealloc = roundup(
 			sbi->s_mb_group_prealloc, sbi->s_stripe);
 	}
 	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
 	if (sbi->s_locality_groups == NULL) {
 		ret = -ENOMEM;
 		goto out_free_groupinfo_slab;
 	}
 	for_each_possible_cpu(i) {
 		struct ext4_locality_group *lg;
 		lg = per_cpu_ptr(sbi->s_locality_groups, i);
 		mutex_init(&lg->lg_mutex);
 		for (j = 0; j < PREALLOC_TB_SIZE; j++)
 			INIT_LIST_HEAD(&lg->lg_prealloc_list[j]);
 		spin_lock_init(&lg->lg_prealloc_lock);
 	}
 	/* init file for buddy data */
 	ret = ext4_mb_init_backend(sb);
 	if (ret != 0)
 		goto out_free_locality_groups;
 	if (sbi->s_proc)
 		proc_create_data("mb_groups", S_IRUGO, sbi->s_proc,
 				 &ext4_mb_seq_groups_fops, sb);
 	return 0;
 out_free_locality_groups:
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
 out_free_groupinfo_slab:
 	ext4_groupinfo_destroy_slabs();
 out:
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
 	sbi->s_mb_maxs = NULL;
 	return ret;
 }
 /* need to called with the ext4 group lock held */
 static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
 {
 	struct ext4_prealloc_space *pa;
 	struct list_head *cur, *tmp;
 	int count = 0;
 	list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) {
 		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
 		list_del(&pa->pa_group_list);
 		count++;
 		kmem_cache_free(ext4_pspace_cachep, pa);
 	}
 	if (count)
 		mb_debug(1, "mballoc: %u PAs left\n", count);
 }
 int ext4_mb_release(struct super_block *sb)
 {
 	ext4_group_t ngroups = ext4_get_groups_count(sb);
 	ext4_group_t i;
 	int num_meta_group_infos;
 	struct ext4_group_info *grinfo;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct kmem_cache *cachep = get_groupinfo_cache(sb->s_blocksize_bits);
 	if (sbi->s_proc)
 		remove_proc_entry("mb_groups", sbi->s_proc);
 	if (sbi->s_group_info) {
 		for (i = 0; i < ngroups; i++) {
 			grinfo = ext4_get_group_info(sb, i);
 #ifdef DOUBLE_CHECK
 			kfree(grinfo->bb_bitmap);
 #endif
 			ext4_lock_group(sb, i);
 			ext4_mb_cleanup_pa(grinfo);
 			ext4_unlock_group(sb, i);
 			kmem_cache_free(cachep, grinfo);
 		}
 		num_meta_group_infos = (ngroups +
 				EXT4_DESC_PER_BLOCK(sb) - 1) >>
 			EXT4_DESC_PER_BLOCK_BITS(sb);
 		for (i = 0; i < num_meta_group_infos; i++)
 			kfree(sbi->s_group_info[i]);
 		ext4_kvfree(sbi->s_group_info);
 	}
 	kfree(sbi->s_mb_offsets);
 	kfree(sbi->s_mb_maxs);
 	if (sbi->s_buddy_cache)
 		iput(sbi->s_buddy_cache);
 	if (sbi->s_mb_stats) {
 		ext4_msg(sb, KERN_INFO,
 		       "mballoc: %u blocks %u reqs (%u success)",
 				atomic_read(&sbi->s_bal_allocated),
 				atomic_read(&sbi->s_bal_reqs),
 				atomic_read(&sbi->s_bal_success));
 		ext4_msg(sb, KERN_INFO,
 		      "mballoc: %u extents scanned, %u goal hits, "
 				"%u 2^N hits, %u breaks, %u lost",
 				atomic_read(&sbi->s_bal_ex_scanned),
 				atomic_read(&sbi->s_bal_goals),
 				atomic_read(&sbi->s_bal_2orders),
 				atomic_read(&sbi->s_bal_breaks),
 				atomic_read(&sbi->s_mb_lost_chunks));
 		ext4_msg(sb, KERN_INFO,
 		       "mballoc: %lu generated and it took %Lu",
 				sbi->s_mb_buddies_generated,
 				sbi->s_mb_generation_time);
 		ext4_msg(sb, KERN_INFO,
 		       "mballoc: %u preallocated, %u discarded",
 				atomic_read(&sbi->s_mb_preallocated),
 				atomic_read(&sbi->s_mb_discarded));
 	}
 	free_percpu(sbi->s_locality_groups);
 	return 0;
 }
 static inline int ext4_issue_discard(struct super_block *sb,
 		ext4_group_t block_group, ext4_grpblk_t cluster, int count)
 {
 	ext4_fsblk_t discard_block;
 	discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) +
 			 ext4_group_first_block_no(sb, block_group));
 	count = EXT4_C2B(EXT4_SB(sb), count);
 	trace_ext4_discard_blocks(sb,
 			(unsigned long long) discard_block, count);
 	return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
 }
 /*
  * This function is called by the jbd2 layer once the commit has finished,
  * so we know we can free the blocks that were released with that commit.
  */
 static void ext4_free_data_callback(struct super_block *sb,
 				    struct ext4_journal_cb_entry *jce,
 				    int rc)
 {
 	struct ext4_free_data *entry = (struct ext4_free_data *)jce;
 	struct ext4_buddy e4b;
 	struct ext4_group_info *db;
 	int err, count = 0, count2 = 0;
 	mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
 		 entry->efd_count, entry->efd_group, entry);
 	if (test_opt(sb, DISCARD)) {
 		err = ext4_issue_discard(sb, entry->efd_group,
 					 entry->efd_start_cluster,
 					 entry->efd_count);
 		if (err && err != -EOPNOTSUPP)
 			ext4_msg(sb, KERN_WARNING, "discard request in"
 				 " group:%d block:%d count:%d failed"
 				 " with %d", entry->efd_group,
 				 entry->efd_start_cluster,
 				 entry->efd_count, err);
 	}
 	err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
 	/* we expect to find existing buddy because it's pinned */
 	BUG_ON(err != 0);
 	db = e4b.bd_info;
 	/* there are blocks to put in buddy to make them really free */
 	count += entry->efd_count;
 	count2++;
 	ext4_lock_group(sb, entry->efd_group);
 	/* Take it out of per group rb tree */
 	rb_erase(&entry->efd_node, &(db->bb_free_root));
 	mb_free_blocks(NULL, &e4b, entry->efd_start_cluster, entry->efd_count);
 	/*
 	 * Clear the trimmed flag for the group so that the next
 	 * ext4_trim_fs can trim it.
 	 * If the volume is mounted with -o discard, online discard
 	 * is supported and the free blocks will be trimmed online.
 	 */
 	if (!test_opt(sb, DISCARD))
 		EXT4_MB_GRP_CLEAR_TRIMMED(db);
 	if (!db->bb_free_root.rb_node) {
 		/* No more items in the per group rb tree
 		 * balance refcounts from ext4_mb_free_metadata()
 		 */
 		page_cache_release(e4b.bd_buddy_page);
 		page_cache_release(e4b.bd_bitmap_page);
 	}
 	ext4_unlock_group(sb, entry->efd_group);
 	kmem_cache_free(ext4_free_data_cachep, entry);
 	ext4_mb_unload_buddy(&e4b);
 	mb_debug(1, "freed %u blocks in %u structures\n", count, count2);
 }
 int __init ext4_init_mballoc(void)
 {
 	ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space,
 					SLAB_RECLAIM_ACCOUNT);
 	if (ext4_pspace_cachep == NULL)
 		return -ENOMEM;
 	ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context,
 				    SLAB_RECLAIM_ACCOUNT);
 	if (ext4_ac_cachep == NULL) {
 		kmem_cache_destroy(ext4_pspace_cachep);
 		return -ENOMEM;
 	}
 	ext4_free_data_cachep = KMEM_CACHE(ext4_free_data,
 					   SLAB_RECLAIM_ACCOUNT);
 	if (ext4_free_data_cachep == NULL) {
 		kmem_cache_destroy(ext4_pspace_cachep);
 		kmem_cache_destroy(ext4_ac_cachep);
 		return -ENOMEM;
 	}
 	return 0;
 }
 void ext4_exit_mballoc(void)
 {
 	/*
 	 * Wait for completion of call_rcu()'s on ext4_pspace_cachep
 	 * before destroying the slab cache.
 	 */
 	rcu_barrier();
 	kmem_cache_destroy(ext4_pspace_cachep);
 	kmem_cache_destroy(ext4_ac_cachep);
 	kmem_cache_destroy(ext4_free_data_cachep);
 	ext4_groupinfo_destroy_slabs();
 }
 /*
  * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps
  * Returns 0 if success or error code
  */
 static noinline_for_stack int
 ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
 				handle_t *handle, unsigned int reserv_clstrs)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct ext4_group_desc *gdp;
 	struct buffer_head *gdp_bh;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	ext4_fsblk_t block;
 	int err, len;
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(ac->ac_b_ex.fe_len <= 0);
 	sb = ac->ac_sb;
 	sbi = EXT4_SB(sb);
 	err = -EIO;
 	bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group);
 	if (!bitmap_bh)
 		goto out_err;
 	err = ext4_journal_get_write_access(handle, bitmap_bh);
 	if (err)
 		goto out_err;
 	err = -EIO;
 	gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh);
 	if (!gdp)
 		goto out_err;
 	ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group,
 			ext4_free_group_clusters(sb, gdp));
 	err = ext4_journal_get_write_access(handle, gdp_bh);
 	if (err)
 		goto out_err;
 	block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
 	len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 	if (!ext4_data_block_valid(sbi, block, len)) {
 		ext4_error(sb, "Allocating blocks %llu-%llu which overlap "
 			   "fs metadata", block, block+len);
 		/* File system mounted not to panic on error
 		 * Fix the bitmap and repeat the block allocation
 		 * We leak some of the blocks here.
 		 */
 		ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 		ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
 			      ac->ac_b_ex.fe_len);
 		ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 		err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
 		if (!err)
 			err = -EAGAIN;
 		goto out_err;
 	}
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 #ifdef AGGRESSIVE_CHECK
 	{
 		int i;
 		for (i = 0; i < ac->ac_b_ex.fe_len; i++) {
 			BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i,
 						bitmap_bh->b_data));
 		}
 	}
 #endif
 	ext4_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
 		      ac->ac_b_ex.fe_len);
 	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
 		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
 		ext4_free_group_clusters_set(sb, gdp,
 					     ext4_free_clusters_after_init(sb,
 						ac->ac_b_ex.fe_group, gdp));
 	}
 	len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len;
 	ext4_free_group_clusters_set(sb, gdp, len);
 	ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh);
 	ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 	percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len);
 	/*
 	 * Now reduce the dirty block count also. Should not go negative
 	 */
 	if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
 		/* release all the reserved blocks if non delalloc */
 		percpu_counter_sub(&sbi->s_dirtyclusters_counter,
 				   reserv_clstrs);
 	if (sbi->s_log_groups_per_flex) {
 		ext4_group_t flex_group = ext4_flex_group(sbi,
 							  ac->ac_b_ex.fe_group);
 		atomic64_sub(ac->ac_b_ex.fe_len,
 			     &sbi->s_flex_groups[flex_group].free_clusters);
 	}
 	err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
 	if (err)
 		goto out_err;
 	err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
 out_err:
 	brelse(bitmap_bh);
 	return err;
 }
 /*
  * here we normalize request for locality group
  * Group request are normalized to s_mb_group_prealloc, which goes to
  * s_strip if we set the same via mount option.
  * s_mb_group_prealloc can be configured via
  * /sys/fs/ext4/<partition>/mb_group_prealloc
  *
  * XXX: should we try to preallocate more than the group has now?
  */
 static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_locality_group *lg = ac->ac_lg;
 	BUG_ON(lg == NULL);
 	ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc;
 	mb_debug(1, "#%u: goal %u blocks for locality group\n",
 		current->pid, ac->ac_g_ex.fe_len);
 }
 /*
  * Normalization means making request better in terms of
  * size and alignment
  */
 static noinline_for_stack void
 ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 				struct ext4_allocation_request *ar)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int bsbits, max;
 	ext4_lblk_t end;
 	loff_t size, start_off;
 	loff_t orig_size __maybe_unused;
 	ext4_lblk_t start;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_prealloc_space *pa;
 	/* do normalize only data requests, metadata requests
 	   do not need preallocation */
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
 	/* sometime caller may want exact blocks */
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 	/* caller may indicate that preallocation isn't
 	 * required (it's a tail, for example) */
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
 		ext4_mb_normalize_group_request(ac);
 		return ;
 	}
 	bsbits = ac->ac_sb->s_blocksize_bits;
 	/* first, let's learn actual file size
 	 * given current request is allocated */
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
 	size = size << bsbits;
 	if (size < i_size_read(ac->ac_inode))
 		size = i_size_read(ac->ac_inode);
 	orig_size = size;
 	/* max size of free chunks */
 	max = 2 << bsbits;
 #define NRL_CHECK_SIZE(req, size, max, chunk_size)	\
 		(req <= (size) || max <= (chunk_size))
 	/* first, try to predict filesize */
 	/* XXX: should this table be tunable? */
 	start_off = 0;
 	if (size <= 16 * 1024) {
 		size = 16 * 1024;
 	} else if (size <= 32 * 1024) {
 		size = 32 * 1024;
 	} else if (size <= 64 * 1024) {
 		size = 64 * 1024;
 	} else if (size <= 128 * 1024) {
 		size = 128 * 1024;
 	} else if (size <= 256 * 1024) {
 		size = 256 * 1024;
 	} else if (size <= 512 * 1024) {
 		size = 512 * 1024;
 	} else if (size <= 1024 * 1024) {
 		size = 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {
 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
 						(21 - bsbits)) << 21;
 		size = 2 * 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {
 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
 							(22 - bsbits)) << 22;
 		size = 4 * 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
 					(8<<20)>>bsbits, max, 8 * 1024)) {
 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
 							(23 - bsbits)) << 23;
 		size = 8 * 1024 * 1024;
 	} else {
 		start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits;
 		size	  = ac->ac_o_ex.fe_len << bsbits;
 	}
 	size = size >> bsbits;
 	start = start_off >> bsbits;
 	/* don't cover already allocated blocks in selected range */
 	if (ar->pleft && start <= ar->lleft) {
 		size -= ar->lleft + 1 - start;
 		start = ar->lleft + 1;
 	}
 	if (ar->pright && start + size - 1 >= ar->lright)
 		size -= start + size - ar->lright;
 	end = start + size;
 	/* check we don't cross already preallocated blocks */
 	rcu_read_lock();
 	list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
 		ext4_lblk_t pa_end;
 		if (pa->pa_deleted)
 			continue;
 		spin_lock(&pa->pa_lock);
 		if (pa->pa_deleted) {
 			spin_unlock(&pa->pa_lock);
 			continue;
 		}
 		pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
 						  pa->pa_len);
 		/* PA must not overlap original request */
 		BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
 			ac->ac_o_ex.fe_logical < pa->pa_lstart));
 		/* skip PAs this normalized request doesn't overlap with */
 		if (pa->pa_lstart >= end || pa_end <= start) {
 			spin_unlock(&pa->pa_lock);
 			continue;
 		}
 		BUG_ON(pa->pa_lstart <= start && pa_end >= end);
 		/* adjust start or end to be adjacent to this pa */
 		if (pa_end <= ac->ac_o_ex.fe_logical) {
 			BUG_ON(pa_end < start);
 			start = pa_end;
 		} else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
 			BUG_ON(pa->pa_lstart > end);
 			end = pa->pa_lstart;
 		}
 		spin_unlock(&pa->pa_lock);
 	}
 	rcu_read_unlock();
 	size = end - start;
 	/* XXX: extra loop to check we really don't overlap preallocations */
 	rcu_read_lock();
 	list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
 		ext4_lblk_t pa_end;
 		spin_lock(&pa->pa_lock);
 		if (pa->pa_deleted == 0) {
 			pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
 							  pa->pa_len);
 			BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
 		}
 		spin_unlock(&pa->pa_lock);
 	}
 	rcu_read_unlock();
 	if (start + size <= ac->ac_o_ex.fe_logical &&
 			start > ac->ac_o_ex.fe_logical) {
 		ext4_msg(ac->ac_sb, KERN_ERR,
 			 "start %lu, size %lu, fe_logical %lu",
 			 (unsigned long) start, (unsigned long) size,
 			 (unsigned long) ac->ac_o_ex.fe_logical);
 	}
 	BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
 			start > ac->ac_o_ex.fe_logical);
 	BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
 	/* now prepare goal request */
 	/* XXX: is it better to align blocks WRT to logical
 	 * placement or satisfy big request as is */
 	ac->ac_g_ex.fe_logical = start;
 	ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);
 	/* define goal start in order to merge */
 	if (ar->pright && (ar->lright == (start + size))) {
 		/* merge to the right */
 		ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size,
 						&ac->ac_f_ex.fe_group,
 						&ac->ac_f_ex.fe_start);
 		ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
 	}
 	if (ar->pleft && (ar->lleft + 1 == start)) {
 		/* merge to the left */
 		ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1,
 						&ac->ac_f_ex.fe_group,
 						&ac->ac_f_ex.fe_start);
 		ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
 	}
 	mb_debug(1, "goal: %u(was %u) blocks at %u\n", (unsigned) size,
 		(unsigned) orig_size, (unsigned) start);
 }
 static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) {
 		atomic_inc(&sbi->s_bal_reqs);
 		atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
 		if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
 			atomic_inc(&sbi->s_bal_success);
 		atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
 		if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
 				ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
 			atomic_inc(&sbi->s_bal_goals);
 		if (ac->ac_found > sbi->s_mb_max_to_scan)
 			atomic_inc(&sbi->s_bal_breaks);
 	}
 	if (ac->ac_op == EXT4_MB_HISTORY_ALLOC)
 		trace_ext4_mballoc_alloc(ac);
 	else
 		trace_ext4_mballoc_prealloc(ac);
 }
 /*
  * Called on failure; free up any blocks from the inode PA for this
  * context.  We don't need this for MB_GROUP_PA because we only change
  * pa_free in ext4_mb_release_context(), but on failure, we've already
  * zeroed out ac->ac_b_ex.fe_len, so group_pa->pa_free is not changed.
  */
 static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
 {
 	struct ext4_prealloc_space *pa = ac->ac_pa;
 	struct ext4_buddy e4b;
 	int err;
 	if (pa == NULL) {
 		if (ac->ac_f_ex.fe_len == 0)
 			return;
 		err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
 		if (err) {
 			/*
 			 * This should never happen since we pin the
 			 * pages in the ext4_allocation_context so
 			 * ext4_mb_load_buddy() should never fail.
 			 */
 			WARN(1, "mb_load_buddy failed (%d)", err);
 			return;
 		}
 		ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
 		mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
 			       ac->ac_f_ex.fe_len);
 		ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
 		ext4_mb_unload_buddy(&e4b);
 		return;
 	}
 	if (pa->pa_type == MB_INODE_PA)
 		pa->pa_free += ac->ac_b_ex.fe_len;
 }
 /*
  * use blocks preallocated to inode
  */
 static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
 				struct ext4_prealloc_space *pa)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	ext4_fsblk_t start;
 	ext4_fsblk_t end;
 	int len;
 	/* found preallocated blocks, use them */
 	start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
 	end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len),
 		  start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len));
 	len = EXT4_NUM_B2C(sbi, end - start);
 	ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group,
 					&ac->ac_b_ex.fe_start);
 	ac->ac_b_ex.fe_len = len;
 	ac->ac_status = AC_STATUS_FOUND;
 	ac->ac_pa = pa;
 	BUG_ON(start < pa->pa_pstart);
 	BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len));
 	BUG_ON(pa->pa_free < len);
 	pa->pa_free -= len;
 	mb_debug(1, "use %llu/%u from inode pa %p\n", start, len, pa);
 }
 /*
  * use blocks preallocated to locality group
  */
 static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
 				struct ext4_prealloc_space *pa)
 {
 	unsigned int len = ac->ac_o_ex.fe_len;
 	ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart,
 					&ac->ac_b_ex.fe_group,
 					&ac->ac_b_ex.fe_start);
 	ac->ac_b_ex.fe_len = len;
 	ac->ac_status = AC_STATUS_FOUND;
 	ac->ac_pa = pa;
 	/* we don't correct pa_pstart or pa_plen here to avoid
 	 * possible race when the group is being loaded concurrently
 	 * instead we correct pa later, after blocks are marked
 	 * in on-disk bitmap -- see ext4_mb_release_context()
 	 * Other CPUs are prevented from allocating from this pa by lg_mutex
 	 */
 	mb_debug(1, "use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
 }
 /*
  * Return the prealloc space that have minimal distance
  * from the goal block. @cpa is the prealloc
  * space that is having currently known minimal distance
  * from the goal block.
  */
 static struct ext4_prealloc_space *
 ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
 			struct ext4_prealloc_space *pa,
 			struct ext4_prealloc_space *cpa)
 {
 	ext4_fsblk_t cur_distance, new_distance;
 	if (cpa == NULL) {
 		atomic_inc(&pa->pa_count);
 		return pa;
 	}
 	cur_distance = abs(goal_block - cpa->pa_pstart);
 	new_distance = abs(goal_block - pa->pa_pstart);
 	if (cur_distance <= new_distance)
 		return cpa;
 	/* drop the previous reference */
 	atomic_dec(&cpa->pa_count);
 	atomic_inc(&pa->pa_count);
 	return pa;
 }
 /*
  * search goal blocks in preallocated space
  */
 static noinline_for_stack int
 ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int order, i;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_locality_group *lg;
 	struct ext4_prealloc_space *pa, *cpa = NULL;
 	ext4_fsblk_t goal_block;
 	/* only data can be preallocated */
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return 0;
 	/* first, try per-file preallocation */
 	rcu_read_lock();
 	list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
 		/* all fields in this condition don't change,
 		 * so we can skip locking for them */
 		if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
 		    ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
 					       EXT4_C2B(sbi, pa->pa_len)))
 			continue;
 		/* non-extent files can't have physical blocks past 2^32 */
 		if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
 		    (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) >
 		     EXT4_MAX_BLOCK_FILE_PHYS))
 			continue;
 		/* found preallocated blocks, use them */
 		spin_lock(&pa->pa_lock);
 		if (pa->pa_deleted == 0 && pa->pa_free) {
 			atomic_inc(&pa->pa_count);
 			ext4_mb_use_inode_pa(ac, pa);
 			spin_unlock(&pa->pa_lock);
 			ac->ac_criteria = 10;
 			rcu_read_unlock();
 			return 1;
 		}
 		spin_unlock(&pa->pa_lock);
 	}
 	rcu_read_unlock();
 	/* can we use group allocation? */
 	if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
 		return 0;
 	/* inode may have no locality group for some reason */
 	lg = ac->ac_lg;
 	if (lg == NULL)
 		return 0;
 	order  = fls(ac->ac_o_ex.fe_len) - 1;
 	if (order > PREALLOC_TB_SIZE - 1)
 		/* The max size of hash table is PREALLOC_TB_SIZE */
 		order = PREALLOC_TB_SIZE - 1;
 	goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
 	/*
 	 * search for the prealloc space that is having
 	 * minimal distance from the goal block.
 	 */
 	for (i = order; i < PREALLOC_TB_SIZE; i++) {
 		rcu_read_lock();
 		list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
 					pa_inode_list) {
 			spin_lock(&pa->pa_lock);
 			if (pa->pa_deleted == 0 &&
 					pa->pa_free >= ac->ac_o_ex.fe_len) {
 				cpa = ext4_mb_check_group_pa(goal_block,
 								pa, cpa);
 			}
 			spin_unlock(&pa->pa_lock);
 		}
 		rcu_read_unlock();
 	}
 	if (cpa) {
 		ext4_mb_use_group_pa(ac, cpa);
 		ac->ac_criteria = 20;
 		return 1;
 	}
 	return 0;
 }
 /*
  * the function goes through all block freed in the group
  * but not yet committed and marks them used in in-core bitmap.
  * buddy must be generated from this bitmap
  * Need to be called with the ext4 group lock held
  */
 static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
 						ext4_group_t group)
 {
 	struct rb_node *n;
 	struct ext4_group_info *grp;
 	struct ext4_free_data *entry;
 	grp = ext4_get_group_info(sb, group);
 	n = rb_first(&(grp->bb_free_root));
 	while (n) {
 		entry = rb_entry(n, struct ext4_free_data, efd_node);
 		ext4_set_bits(bitmap, entry->efd_start_cluster, entry->efd_count);
 		n = rb_next(n);
 	}
 	return;
 }
 /*
  * the function goes through all preallocation in this group and marks them
  * used in in-core bitmap. buddy must be generated from this bitmap
  * Need to be called with ext4 group lock held
  */
 static noinline_for_stack
 void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					ext4_group_t group)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct ext4_prealloc_space *pa;
 	struct list_head *cur;
 	ext4_group_t groupnr;
 	ext4_grpblk_t start;
 	int preallocated = 0;
 	int len;
 	/* all form of preallocation discards first load group,
 	 * so the only competing code is preallocation use.
 	 * we don't need any locking here
 	 * notice we do NOT ignore preallocations with pa_deleted
 	 * otherwise we could leave used blocks available for
 	 * allocation in buddy when concurrent ext4_mb_put_pa()
 	 * is dropping preallocation
 	 */
 	list_for_each(cur, &grp->bb_prealloc_list) {
 		pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
 		spin_lock(&pa->pa_lock);
 		ext4_get_group_no_and_offset(sb, pa->pa_pstart,
 					     &groupnr, &start);
 		len = pa->pa_len;
 		spin_unlock(&pa->pa_lock);
 		if (unlikely(len == 0))
 			continue;
 		BUG_ON(groupnr != group);
 		ext4_set_bits(bitmap, start, len);
 		preallocated += len;
 	}
 	mb_debug(1, "prellocated %u for group %u\n", preallocated, group);
 }
 static void ext4_mb_pa_callback(struct rcu_head *head)
 {
 	struct ext4_prealloc_space *pa;
 	pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
 	BUG_ON(atomic_read(&pa->pa_count));
 	BUG_ON(pa->pa_deleted == 0);
 	kmem_cache_free(ext4_pspace_cachep, pa);
 }
 /*
  * drops a reference to preallocated space descriptor
  * if this was the last reference and the space is consumed
  */
 static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
 			struct super_block *sb, struct ext4_prealloc_space *pa)
 {
 	ext4_group_t grp;
 	ext4_fsblk_t grp_blk;
 	/* in this short window concurrent discard can set pa_deleted */
 	spin_lock(&pa->pa_lock);
 	if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0) {
 		spin_unlock(&pa->pa_lock);
 		return;
 	}
 	if (pa->pa_deleted == 1) {
 		spin_unlock(&pa->pa_lock);
 		return;
 	}
 	pa->pa_deleted = 1;
 	spin_unlock(&pa->pa_lock);
 	grp_blk = pa->pa_pstart;
 	/*
 	 * If doing group-based preallocation, pa_pstart may be in the
 	 * next group when pa is used up
 	 */
 	if (pa->pa_type == MB_GROUP_PA)
 		grp_blk--;
 	grp = ext4_get_group_number(sb, grp_blk);
 	/*
 	 * possible race:
 	 *
 	 *  P1 (buddy init)			P2 (regular allocation)
 	 *					find block B in PA
 	 *  copy on-disk bitmap to buddy
 	 *  					mark B in on-disk bitmap
 	 *					drop PA from group
 	 *  mark all PAs in buddy
 	 *
 	 * thus, P1 initializes buddy with B available. to prevent this
 	 * we make "copy" and "mark all PAs" atomic and serialize "drop PA"
 	 * against that pair
 	 */
 	ext4_lock_group(sb, grp);
 	list_del(&pa->pa_group_list);
 	ext4_unlock_group(sb, grp);
 	spin_lock(pa->pa_obj_lock);
 	list_del_rcu(&pa->pa_inode_list);
 	spin_unlock(pa->pa_obj_lock);
 	call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 }
 /*
  * creates new preallocated space for given inode
  */
 static noinline_for_stack int
 ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_prealloc_space *pa;
 	struct ext4_group_info *grp;
 	struct ext4_inode_info *ei;
 	/* preallocate only when found space is larger then requested */
 	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
 	if (pa == NULL)
 		return -ENOMEM;
 	if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
 		int winl;
 		int wins;
 		int win;
 		int offs;
 		/* we can't allocate as much as normalizer wants.
 		 * so, found space must get proper lstart
 		 * to cover original request */
 		BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical);
 		BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len);
 		/* we're limited by original request in that
 		 * logical block must be covered any way
 		 * winl is window we can move our chunk within */
 		winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;
 		/* also, we should cover whole original request */
 		wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len);
 		/* the smallest one defines real window */
 		win = min(winl, wins);
 		offs = ac->ac_o_ex.fe_logical %
 			EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 		if (offs && offs < win)
 			win = offs;
 		ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical -
 			EXT4_NUM_B2C(sbi, win);
 		BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
 		BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
 	}
 	/* preallocation can change ac_b_ex, thus we store actually
 	 * allocated blocks for history */
 	ac->ac_f_ex = ac->ac_b_ex;
 	pa->pa_lstart = ac->ac_b_ex.fe_logical;
 	pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
 	pa->pa_len = ac->ac_b_ex.fe_len;
 	pa->pa_free = pa->pa_len;
 	atomic_set(&pa->pa_count, 1);
 	spin_lock_init(&pa->pa_lock);
 	INIT_LIST_HEAD(&pa->pa_inode_list);
 	INIT_LIST_HEAD(&pa->pa_group_list);
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_INODE_PA;
 	mb_debug(1, "new inode pa %p: %llu/%u for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_inode_pa(ac, pa);
 	ext4_mb_use_inode_pa(ac, pa);
 	atomic_add(pa->pa_free, &sbi->s_mb_preallocated);
 	ei = EXT4_I(ac->ac_inode);
 	grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
 	pa->pa_obj_lock = &ei->i_prealloc_lock;
 	pa->pa_inode = ac->ac_inode;
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 	spin_lock(pa->pa_obj_lock);
 	list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
 	spin_unlock(pa->pa_obj_lock);
 	return 0;
 }
 /*
  * creates new preallocated space for locality group inodes belongs to
  */
 static noinline_for_stack int
 ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_locality_group *lg;
 	struct ext4_prealloc_space *pa;
 	struct ext4_group_info *grp;
 	/* preallocate only when found space is larger then requested */
 	BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 	BUG_ON(ext4_pspace_cachep == NULL);
 	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
 	if (pa == NULL)
 		return -ENOMEM;
 	/* preallocation can change ac_b_ex, thus we store actually
 	 * allocated blocks for history */
 	ac->ac_f_ex = ac->ac_b_ex;
 	pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
 	pa->pa_lstart = pa->pa_pstart;
 	pa->pa_len = ac->ac_b_ex.fe_len;
 	pa->pa_free = pa->pa_len;
 	atomic_set(&pa->pa_count, 1);
 	spin_lock_init(&pa->pa_lock);
 	INIT_LIST_HEAD(&pa->pa_inode_list);
 	INIT_LIST_HEAD(&pa->pa_group_list);
 	pa->pa_deleted = 0;
 	pa->pa_type = MB_GROUP_PA;
 	mb_debug(1, "new group pa %p: %llu/%u for %u\n", pa,
 			pa->pa_pstart, pa->pa_len, pa->pa_lstart);
 	trace_ext4_mb_new_group_pa(ac, pa);
 	ext4_mb_use_group_pa(ac, pa);
 	atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
 	grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
 	lg = ac->ac_lg;
 	BUG_ON(lg == NULL);
 	pa->pa_obj_lock = &lg->lg_prealloc_lock;
 	pa->pa_inode = NULL;
 	ext4_lock_group(sb, ac->ac_b_ex.fe_group);
 	list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
 	ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
 	/*
 	 * We will later add the new pa to the right bucket
 	 * after updating the pa_free in ext4_mb_release_context
 	 */
 	return 0;
 }
 static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
 {
 	int err;
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
 		err = ext4_mb_new_group_pa(ac);
 	else
 		err = ext4_mb_new_inode_pa(ac);
 	return err;
 }
 /*
  * finds all unused blocks in on-disk bitmap, frees them in
  * in-core bitmap and buddy.
  * @pa must be unlinked from inode and group lists, so that
  * nobody else can find/use it.
  * the caller MUST hold group/inode locks.
  * TODO: optimize the case when there are no in-core structures yet
  */
 static noinline_for_stack int
 ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
 			struct ext4_prealloc_space *pa)
 {
 	struct super_block *sb = e4b->bd_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	unsigned int end;
 	unsigned int next;
 	ext4_group_t group;
 	ext4_grpblk_t bit;
 	unsigned long long grp_blk_start;
 	int err = 0;
 	int free = 0;
 	BUG_ON(pa->pa_deleted == 0);
 	ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
 	grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit);
 	BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
 	end = bit + pa->pa_len;
 	while (bit < end) {
 		bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit);
 		if (bit >= end)
 			break;
 		next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
 		mb_debug(1, "    free preallocated %u/%u in group %u\n",
 			 (unsigned) ext4_group_first_block_no(sb, group) + bit,
 			 (unsigned) next - bit, (unsigned) group);
 		free += next - bit;
 		trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit);
 		trace_ext4_mb_release_inode_pa(pa, (grp_blk_start +
 						    EXT4_C2B(sbi, bit)),
 					       next - bit);
 		mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
 		bit = next + 1;
 	}
 	if (free != pa->pa_free) {
 		ext4_msg(e4b->bd_sb, KERN_CRIT,
 			 "pa %p: logic %lu, phys. %lu, len %lu",
 			 pa, (unsigned long) pa->pa_lstart,
 			 (unsigned long) pa->pa_pstart,
 			 (unsigned long) pa->pa_len);
 		ext4_grp_locked_error(sb, group, 0, 0, "free %u, pa_free %u",
 					free, pa->pa_free);
 		/*
 		 * pa is already deleted so we use the value obtained
 		 * from the bitmap and continue.
 		 */
 	}
 	atomic_add(free, &sbi->s_mb_discarded);
 	return err;
 }
 static noinline_for_stack int
 ext4_mb_release_group_pa(struct ext4_buddy *e4b,
 				struct ext4_prealloc_space *pa)
 {
 	struct super_block *sb = e4b->bd_sb;
 	ext4_group_t group;
 	ext4_grpblk_t bit;
 	trace_ext4_mb_release_group_pa(sb, pa);
 	BUG_ON(pa->pa_deleted == 0);
 	ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
 	BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
 	mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
 	atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
 	trace_ext4_mballoc_discard(sb, NULL, group, bit, pa->pa_len);
 	return 0;
 }
 /*
  * releases all preallocations in given group
  *
  * first, we need to decide discard policy:
  * - when do we discard
  *   1) ENOSPC
  * - how many do we discard
  *   1) how many requested
  */
 static noinline_for_stack int
 ext4_mb_discard_group_preallocations(struct super_block *sb,
 					ext4_group_t group, int needed)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct buffer_head *bitmap_bh = NULL;
 	struct ext4_prealloc_space *pa, *tmp;
 	struct list_head list;
 	struct ext4_buddy e4b;
 	int err;
 	int busy = 0;
 	int free = 0;
 	mb_debug(1, "discard preallocation for group %u\n", group);
 	if (list_empty(&grp->bb_prealloc_list))
 		return 0;
 	bitmap_bh = ext4_read_block_bitmap(sb, group);
 	if (bitmap_bh == NULL) {
 		ext4_error(sb, "Error reading block bitmap for %u", group);
 		return 0;
 	}
 	err = ext4_mb_load_buddy(sb, group, &e4b);
 	if (err) {
 		ext4_error(sb, "Error loading buddy information for %u", group);
 		put_bh(bitmap_bh);
 		return 0;
 	}
 	if (needed == 0)
 		needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
 	INIT_LIST_HEAD(&list);
 repeat:
 	ext4_lock_group(sb, group);
 	list_for_each_entry_safe(pa, tmp,
 				&grp->bb_prealloc_list, pa_group_list) {
 		spin_lock(&pa->pa_lock);
 		if (atomic_read(&pa->pa_count)) {
 			spin_unlock(&pa->pa_lock);
 			busy = 1;
 			continue;
 		}
 		if (pa->pa_deleted) {
 			spin_unlock(&pa->pa_lock);
 			continue;
 		}
 		/* seems this one can be freed ... */
 		pa->pa_deleted = 1;
 		/* we can trust pa_free ... */
 		free += pa->pa_free;
 		spin_unlock(&pa->pa_lock);
 		list_del(&pa->pa_group_list);
 		list_add(&pa->u.pa_tmp_list, &list);
 	}
 	/* if we still need more blocks and some PAs were used, try again */
 	if (free < needed && busy) {
 		busy = 0;
 		ext4_unlock_group(sb, group);
 		cond_resched();
 		goto repeat;
 	}
 	/* found anything to free? */
 	if (list_empty(&list)) {
 		BUG_ON(free != 0);
 		goto out;
 	}
 	/* now free all selected PAs */
 	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
 		/* remove from object (inode or locality group) */
 		spin_lock(pa->pa_obj_lock);
 		list_del_rcu(&pa->pa_inode_list);
 		spin_unlock(pa->pa_obj_lock);
 		if (pa->pa_type == MB_GROUP_PA)
 			ext4_mb_release_group_pa(&e4b, pa);
 		else
 			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
 out:
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_buddy(&e4b);
 	put_bh(bitmap_bh);
 	return free;
 }
 /*
  * releases all non-used preallocated blocks for given inode
  *
  * It's important to discard preallocations under i_data_sem
  * We don't want another block to be served from the prealloc
  * space when we are discarding the inode prealloc space.
  *
  * FIXME!! Make sure it is valid at all the call sites
  */
 void ext4_discard_preallocations(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *bitmap_bh = NULL;
 	struct ext4_prealloc_space *pa, *tmp;
 	ext4_group_t group = 0;
 	struct list_head list;
 	struct ext4_buddy e4b;
 	int err;
 	if (!S_ISREG(inode->i_mode)) {
 		/*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
 		return;
 	}
 	mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino);
 	trace_ext4_discard_preallocations(inode);
 	INIT_LIST_HEAD(&list);
 repeat:
 	/* first, collect all pa's in the inode */
 	spin_lock(&ei->i_prealloc_lock);
 	while (!list_empty(&ei->i_prealloc_list)) {
 		pa = list_entry(ei->i_prealloc_list.next,
 				struct ext4_prealloc_space, pa_inode_list);
 		BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
 		spin_lock(&pa->pa_lock);
 		if (atomic_read(&pa->pa_count)) {
 			/* this shouldn't happen often - nobody should
 			 * use preallocation while we're discarding it */
 			spin_unlock(&pa->pa_lock);
 			spin_unlock(&ei->i_prealloc_lock);
 			ext4_msg(sb, KERN_ERR,
 				 "uh-oh! used pa while discarding");
 			WARN_ON(1);
 			schedule_timeout_uninterruptible(HZ);
 			goto repeat;
 		}
 		if (pa->pa_deleted == 0) {
 			pa->pa_deleted = 1;
 			spin_unlock(&pa->pa_lock);
 			list_del_rcu(&pa->pa_inode_list);
 			list_add(&pa->u.pa_tmp_list, &list);
 			continue;
 		}
 		/* someone is deleting pa right now */
 		spin_unlock(&pa->pa_lock);
 		spin_unlock(&ei->i_prealloc_lock);
 		/* we have to wait here because pa_deleted
 		 * doesn't mean pa is already unlinked from
 		 * the list. as we might be called from
 		 * ->clear_inode() the inode will get freed
 		 * and concurrent thread which is unlinking
 		 * pa from inode's list may access already
 		 * freed memory, bad-bad-bad */
 		/* XXX: if this happens too often, we can
 		 * add a flag to force wait only in case
 		 * of ->clear_inode(), but not in case of
 		 * regular truncate */
 		schedule_timeout_uninterruptible(HZ);
 		goto repeat;
 	}
 	spin_unlock(&ei->i_prealloc_lock);
 	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
 		BUG_ON(pa->pa_type != MB_INODE_PA);
 		group = ext4_get_group_number(sb, pa->pa_pstart);
 		err = ext4_mb_load_buddy(sb, group, &e4b);
 		if (err) {
 			ext4_error(sb, "Error loading buddy information for %u",
 					group);
 			continue;
 		}
 		bitmap_bh = ext4_read_block_bitmap(sb, group);
 		if (bitmap_bh == NULL) {
 			ext4_error(sb, "Error reading block bitmap for %u",
 					group);
 			ext4_mb_unload_buddy(&e4b);
 			continue;
 		}
 		ext4_lock_group(sb, group);
 		list_del(&pa->pa_group_list);
 		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
 		ext4_unlock_group(sb, group);
 		ext4_mb_unload_buddy(&e4b);
 		put_bh(bitmap_bh);
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
 }
 #ifdef CONFIG_EXT4_DEBUG
 static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 {
 	struct super_block *sb = ac->ac_sb;
 	ext4_group_t ngroups, i;
 	if (!ext4_mballoc_debug ||
 	    (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED))
 		return;
 	ext4_msg(ac->ac_sb, KERN_ERR, "Can't allocate:"
 			" Allocation context details:");
 	ext4_msg(ac->ac_sb, KERN_ERR, "status %d flags %d",
 			ac->ac_status, ac->ac_flags);
 	ext4_msg(ac->ac_sb, KERN_ERR, "orig %lu/%lu/%lu@%lu, "
 		 	"goal %lu/%lu/%lu@%lu, "
 			"best %lu/%lu/%lu@%lu cr %d",
 			(unsigned long)ac->ac_o_ex.fe_group,
 			(unsigned long)ac->ac_o_ex.fe_start,
 			(unsigned long)ac->ac_o_ex.fe_len,
 			(unsigned long)ac->ac_o_ex.fe_logical,
 			(unsigned long)ac->ac_g_ex.fe_group,
 			(unsigned long)ac->ac_g_ex.fe_start,
 			(unsigned long)ac->ac_g_ex.fe_len,
 			(unsigned long)ac->ac_g_ex.fe_logical,
 			(unsigned long)ac->ac_b_ex.fe_group,
 			(unsigned long)ac->ac_b_ex.fe_start,
 			(unsigned long)ac->ac_b_ex.fe_len,
 			(unsigned long)ac->ac_b_ex.fe_logical,
 			(int)ac->ac_criteria);
 	ext4_msg(ac->ac_sb, KERN_ERR, "%lu scanned, %d found",
 		 ac->ac_ex_scanned, ac->ac_found);
 	ext4_msg(ac->ac_sb, KERN_ERR, "groups: ");
 	ngroups = ext4_get_groups_count(sb);
 	for (i = 0; i < ngroups; i++) {
 		struct ext4_group_info *grp = ext4_get_group_info(sb, i);
 		struct ext4_prealloc_space *pa;
 		ext4_grpblk_t start;
 		struct list_head *cur;
 		ext4_lock_group(sb, i);
 		list_for_each(cur, &grp->bb_prealloc_list) {
 			pa = list_entry(cur, struct ext4_prealloc_space,
 					pa_group_list);
 			spin_lock(&pa->pa_lock);
 			ext4_get_group_no_and_offset(sb, pa->pa_pstart,
 						     NULL, &start);
 			spin_unlock(&pa->pa_lock);
 			printk(KERN_ERR "PA:%u:%d:%u \n", i,
 			       start, pa->pa_len);
 		}
 		ext4_unlock_group(sb, i);
 		if (grp->bb_free == 0)
 			continue;
 		printk(KERN_ERR "%u: %d/%d \n",
 		       i, grp->bb_free, grp->bb_fragments);
 	}
 	printk(KERN_ERR "\n");
 }
 #else
 static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
 {
 	return;
 }
 #endif
 /*
  * We use locality group preallocation for small size file. The size of the
  * file is determined by the current size or the resulting size after
  * allocation which ever is larger
  *
  * One can tune this size via /sys/fs/ext4/<partition>/mb_stream_req
  */
 static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int bsbits = ac->ac_sb->s_blocksize_bits;
 	loff_t size, isize;
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
 		>> bsbits;
 	if ((size == isize) &&
 	    !ext4_fs_is_busy(sbi) &&
 	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
 		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
 		return;
 	}
 	if (sbi->s_mb_group_prealloc <= 0) {
 		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
 		return;
 	}
 	/* don't use group allocation for large files */
 	size = max(size, isize);
 	if (size > sbi->s_mb_stream_request) {
 		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
 		return;
 	}
 	BUG_ON(ac->ac_lg != NULL);
 	/*
 	 * locality group prealloc space are per cpu. The reason for having
 	 * per cpu locality group is to reduce the contention between block
 	 * request from multiple CPUs.
 	 */
 	ac->ac_lg = __this_cpu_ptr(sbi->s_locality_groups);
 	/* we're going to use group allocation */
 	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
 	/* serialize all allocations in the group */
 	mutex_lock(&ac->ac_lg->lg_mutex);
 }
 static noinline_for_stack int
 ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 				struct ext4_allocation_request *ar)
 {
 	struct super_block *sb = ar->inode->i_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_super_block *es = sbi->s_es;
 	ext4_group_t group;
 	unsigned int len;
 	ext4_fsblk_t goal;
 	ext4_grpblk_t block;
 	/* we can't allocate > group size */
 	len = ar->len;
 	/* just a dirty hack to filter too big requests  */
 	if (len >= EXT4_CLUSTERS_PER_GROUP(sb))
 		len = EXT4_CLUSTERS_PER_GROUP(sb);
 	/* start searching from the goal */
 	goal = ar->goal;
 	if (goal < le32_to_cpu(es->s_first_data_block) ||
 			goal >= ext4_blocks_count(es))
 		goal = le32_to_cpu(es->s_first_data_block);
 	ext4_get_group_no_and_offset(sb, goal, &group, &block);
 	/* set up allocation goals */
 	ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical);
 	ac->ac_status = AC_STATUS_CONTINUE;
 	ac->ac_sb = sb;
 	ac->ac_inode = ar->inode;
 	ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical;
 	ac->ac_o_ex.fe_group = group;
 	ac->ac_o_ex.fe_start = block;
 	ac->ac_o_ex.fe_len = len;
 	ac->ac_g_ex = ac->ac_o_ex;
 	ac->ac_flags = ar->flags;
 	/* we have to define context: we'll we work with a file or
 	 * locality group. this is a policy, actually */
 	ext4_mb_group_or_file(ac);
 	mb_debug(1, "init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, "
 			"left: %u/%u, right %u/%u to %swritable\n",
 			(unsigned) ar->len, (unsigned) ar->logical,
 			(unsigned) ar->goal, ac->ac_flags, ac->ac_2order,
 			(unsigned) ar->lleft, (unsigned) ar->pleft,
 			(unsigned) ar->lright, (unsigned) ar->pright,
 			atomic_read(&ar->inode->i_writecount) ? "" : "non-");
 	return 0;
 }
 static noinline_for_stack void
 ext4_mb_discard_lg_preallocations(struct super_block *sb,
 					struct ext4_locality_group *lg,
 					int order, int total_entries)
 {
 	ext4_group_t group = 0;
 	struct ext4_buddy e4b;
 	struct list_head discard_list;
 	struct ext4_prealloc_space *pa, *tmp;
 	mb_debug(1, "discard locality group preallocation\n");
 	INIT_LIST_HEAD(&discard_list);
 	spin_lock(&lg->lg_prealloc_lock);
 	list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
 						pa_inode_list) {
 		spin_lock(&pa->pa_lock);
 		if (atomic_read(&pa->pa_count)) {
 			/*
 			 * This is the pa that we just used
 			 * for block allocation. So don't
 			 * free that
 			 */
 			spin_unlock(&pa->pa_lock);
 			continue;
 		}
 		if (pa->pa_deleted) {
 			spin_unlock(&pa->pa_lock);
 			continue;
 		}
 		/* only lg prealloc space */
 		BUG_ON(pa->pa_type != MB_GROUP_PA);
 		/* seems this one can be freed ... */
 		pa->pa_deleted = 1;
 		spin_unlock(&pa->pa_lock);
 		list_del_rcu(&pa->pa_inode_list);
 		list_add(&pa->u.pa_tmp_list, &discard_list);
 		total_entries--;
 		if (total_entries <= 5) {
 			/*
 			 * we want to keep only 5 entries
 			 * allowing it to grow to 8. This
 			 * mak sure we don't call discard
 			 * soon for this list.
 			 */
 			break;
 		}
 	}
 	spin_unlock(&lg->lg_prealloc_lock);
 	list_for_each_entry_safe(pa, tmp, &discard_list, u.pa_tmp_list) {
 		group = ext4_get_group_number(sb, pa->pa_pstart);
 		if (ext4_mb_load_buddy(sb, group, &e4b)) {
 			ext4_error(sb, "Error loading buddy information for %u",
 					group);
 			continue;
 		}
 		ext4_lock_group(sb, group);
 		list_del(&pa->pa_group_list);
 		ext4_mb_release_group_pa(&e4b, pa);
 		ext4_unlock_group(sb, group);
 		ext4_mb_unload_buddy(&e4b);
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
 	}
 }
 /*
  * We have incremented pa_count. So it cannot be freed at this
  * point. Also we hold lg_mutex. So no parallel allocation is
  * possible from this lg. That means pa_free cannot be updated.
  *
  * A parallel ext4_mb_discard_group_preallocations is possible.
  * which can cause the lg_prealloc_list to be updated.
  */
 static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
 {
 	int order, added = 0, lg_prealloc_count = 1;
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_locality_group *lg = ac->ac_lg;
 	struct ext4_prealloc_space *tmp_pa, *pa = ac->ac_pa;
 	order = fls(pa->pa_free) - 1;
 	if (order > PREALLOC_TB_SIZE - 1)
 		/* The max size of hash table is PREALLOC_TB_SIZE */
 		order = PREALLOC_TB_SIZE - 1;
 	/* Add the prealloc space to lg */
 	spin_lock(&lg->lg_prealloc_lock);
 	list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
 						pa_inode_list) {
 		spin_lock(&tmp_pa->pa_lock);
 		if (tmp_pa->pa_deleted) {
 			spin_unlock(&tmp_pa->pa_lock);
 			continue;
 		}
 		if (!added && pa->pa_free < tmp_pa->pa_free) {
 			/* Add to the tail of the previous entry */
 			list_add_tail_rcu(&pa->pa_inode_list,
 						&tmp_pa->pa_inode_list);
 			added = 1;
 			/*
 			 * we want to count the total
 			 * number of entries in the list
 			 */
 		}
 		spin_unlock(&tmp_pa->pa_lock);
 		lg_prealloc_count++;
 	}
 	if (!added)
 		list_add_tail_rcu(&pa->pa_inode_list,
 					&lg->lg_prealloc_list[order]);
 	spin_unlock(&lg->lg_prealloc_lock);
 	/* Now trim the list to be not more than 8 elements */
 	if (lg_prealloc_count > 8) {
 		ext4_mb_discard_lg_preallocations(sb, lg,
 						  order, lg_prealloc_count);
 		return;
 	}
 	return ;
 }
 /*
  * release all resource we used in allocation
  */
 static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_prealloc_space *pa = ac->ac_pa;
 	if (pa) {
 		if (pa->pa_type == MB_GROUP_PA) {
 			/* see comment in ext4_mb_use_group_pa() */
 			spin_lock(&pa->pa_lock);
 			pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 			pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
 			pa->pa_free -= ac->ac_b_ex.fe_len;
 			pa->pa_len -= ac->ac_b_ex.fe_len;
 			spin_unlock(&pa->pa_lock);
 		}
 	}
 	if (pa) {
 		/*
 		 * We want to add the pa to the right bucket.
 		 * Remove it from the list and while adding
 		 * make sure the list to which we are adding
 		 * doesn't grow big.
 		 */
 		if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) {
 			spin_lock(pa->pa_obj_lock);
 			list_del_rcu(&pa->pa_inode_list);
 			spin_unlock(pa->pa_obj_lock);
 			ext4_mb_add_n_trim(ac);
 		}
 		ext4_mb_put_pa(ac, ac->ac_sb, pa);
 	}
 	if (ac->ac_bitmap_page)
 		page_cache_release(ac->ac_bitmap_page);
 	if (ac->ac_buddy_page)
 		page_cache_release(ac->ac_buddy_page);
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
 		mutex_unlock(&ac->ac_lg->lg_mutex);
 	ext4_mb_collect_stats(ac);
 	return 0;
 }
 static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
 {
 	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
 	int ret;
 	int freed = 0;
 	trace_ext4_mb_discard_preallocations(sb, needed);
 	for (i = 0; i < ngroups && needed > 0; i++) {
 		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
 		freed += ret;
 		needed -= ret;
 	}
 	return freed;
 }
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
  * to usual allocation
  */
 ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 				struct ext4_allocation_request *ar, int *errp)
 {
 	int freed;
 	struct ext4_allocation_context *ac = NULL;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	ext4_fsblk_t block = 0;
 	unsigned int inquota = 0;
 	unsigned int reserv_clstrs = 0;
 	might_sleep();
 	sb = ar->inode->i_sb;
 	sbi = EXT4_SB(sb);
 	trace_ext4_request_blocks(ar);
 	/* Allow to use superuser reservation for quota file */
 	if (IS_NOQUOTA(ar->inode))
 		ar->flags |= EXT4_MB_USE_ROOT_BLOCKS;
 	/*
 	 * For delayed allocation, we could skip the ENOSPC and
 	 * EDQUOT check, as blocks and quotas have been already
 	 * reserved when data being copied into pagecache.
 	 */
 	if (ext4_test_inode_state(ar->inode, EXT4_STATE_DELALLOC_RESERVED))
 		ar->flags |= EXT4_MB_DELALLOC_RESERVED;
 	else {
 		/* Without delayed allocation we need to verify
 		 * there is enough free blocks to do block allocation
 		 * and verify allocation doesn't exceed the quota limits.
 		 */
 		while (ar->len &&
 			ext4_claim_free_clusters(sbi, ar->len, ar->flags)) {
 			/* let others to free the space */
 			cond_resched();
 			ar->len = ar->len >> 1;
 		}
 		if (!ar->len) {
 			*errp = -ENOSPC;
 			return 0;
 		}
 		reserv_clstrs = ar->len;
 		if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) {
 			dquot_alloc_block_nofail(ar->inode,
 						 EXT4_C2B(sbi, ar->len));
 		} else {
 			while (ar->len &&
 				dquot_alloc_block(ar->inode,
 						  EXT4_C2B(sbi, ar->len))) {
 				ar->flags |= EXT4_MB_HINT_NOPREALLOC;
 				ar->len--;
 			}
 		}
 		inquota = ar->len;
 		if (ar->len == 0) {
 			*errp = -EDQUOT;
 			goto out;
 		}
 	}
 	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
 	if (!ac) {
 		ar->len = 0;
 		*errp = -ENOMEM;
 		goto out;
 	}
 	*errp = ext4_mb_initialize_context(ac, ar);
 	if (*errp) {
 		ar->len = 0;
 		goto out;
 	}
 	ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
 	if (!ext4_mb_use_preallocated(ac)) {
 		ac->ac_op = EXT4_MB_HISTORY_ALLOC;
 		ext4_mb_normalize_request(ac, ar);
 repeat:
 		/* allocate space in core */
 		*errp = ext4_mb_regular_allocator(ac);
 		if (*errp)
 			goto discard_and_exit;
 		/* as we've just preallocated more space than
 		 * user requested originally, we store allocated
 		 * space in a special descriptor */
 		if (ac->ac_status == AC_STATUS_FOUND &&
 		    ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
 			*errp = ext4_mb_new_preallocation(ac);
 		if (*errp) {
 		discard_and_exit:
 			ext4_discard_allocated_blocks(ac);
 			goto errout;
 		}
 	}
 	if (likely(ac->ac_status == AC_STATUS_FOUND)) {
 		*errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
 		if (*errp == -EAGAIN) {
 			/*
 			 * drop the reference that we took
 			 * in ext4_mb_use_best_found
 			 */
 			ext4_mb_release_context(ac);
 			ac->ac_b_ex.fe_group = 0;
 			ac->ac_b_ex.fe_start = 0;
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			goto repeat;
 		} else if (*errp) {
 			ext4_discard_allocated_blocks(ac);
 			goto errout;
 		} else {
 			block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
 		freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
 		if (freed)
 			goto repeat;
 		*errp = -ENOSPC;
 	}
 errout:
 	if (*errp) {
 		ac->ac_b_ex.fe_len = 0;
 		ar->len = 0;
 		ext4_mb_show_ac(ac);
 	}
 	ext4_mb_release_context(ac);
 out:
 	if (ac)
 		kmem_cache_free(ext4_ac_cachep, ac);
 	if (inquota && ar->len < inquota)
 		dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len));
 	if (!ar->len) {
 		if (!ext4_test_inode_state(ar->inode,
 					   EXT4_STATE_DELALLOC_RESERVED))
 			/* release all the reserved blocks if non delalloc */
 			percpu_counter_sub(&sbi->s_dirtyclusters_counter,
 						reserv_clstrs);
 	}
 	trace_ext4_allocate_blocks(ar, (unsigned long long)block);
 	return block;
 }
 /*
  * We can merge two free data extents only if the physical blocks
  * are contiguous, AND the extents were freed by the same transaction,
  * AND the blocks are associated with the same group.
  */
 static int can_merge(struct ext4_free_data *entry1,
 			struct ext4_free_data *entry2)
 {
 	if ((entry1->efd_tid == entry2->efd_tid) &&
 	    (entry1->efd_group == entry2->efd_group) &&
 	    ((entry1->efd_start_cluster + entry1->efd_count) == entry2->efd_start_cluster))
 		return 1;
 	return 0;
 }
 static noinline_for_stack int
 ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 		      struct ext4_free_data *new_entry)
 {
 	ext4_group_t group = e4b->bd_group;
 	ext4_grpblk_t cluster;
 	struct ext4_free_data *entry;
 	struct ext4_group_info *db = e4b->bd_info;
 	struct super_block *sb = e4b->bd_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct rb_node **n = &db->bb_free_root.rb_node, *node;
 	struct rb_node *parent = NULL, *new_node;
 	BUG_ON(!ext4_handle_valid(handle));
 	BUG_ON(e4b->bd_bitmap_page == NULL);
 	BUG_ON(e4b->bd_buddy_page == NULL);
 	new_node = &new_entry->efd_node;
 	cluster = new_entry->efd_start_cluster;
 	if (!*n) {
 		/* first free block exent. We need to
 		   protect buddy cache from being freed,
 		 * otherwise we'll refresh it from
 		 * on-disk bitmap and lose not-yet-available
 		 * blocks */
 		page_cache_get(e4b->bd_buddy_page);
 		page_cache_get(e4b->bd_bitmap_page);
 	}
 	while (*n) {
 		parent = *n;
 		entry = rb_entry(parent, struct ext4_free_data, efd_node);
 		if (cluster < entry->efd_start_cluster)
 			n = &(*n)->rb_left;
 		else if (cluster >= (entry->efd_start_cluster + entry->efd_count))
 			n = &(*n)->rb_right;
 		else {
 			ext4_grp_locked_error(sb, group, 0,
 				ext4_group_first_block_no(sb, group) +
 				EXT4_C2B(sbi, cluster),
 				"Block already on to-be-freed list");
 			return 0;
 		}
 	}
 	rb_link_node(new_node, parent, n);
 	rb_insert_color(new_node, &db->bb_free_root);
 	/* Now try to see the extent can be merged to left and right */
 	node = rb_prev(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
 		if (can_merge(entry, new_entry) &&
 		    ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_start_cluster = entry->efd_start_cluster;
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
 	node = rb_next(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
 		if (can_merge(new_entry, entry) &&
 		    ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
 	/* Add the extent to transaction's private list */
 	ext4_journal_callback_add(handle, ext4_free_data_callback,
 				  &new_entry->efd_jce);
 	return 0;
 }
 /**
  * ext4_free_blocks() -- Free given blocks and update quota
  * @handle:		handle for this transaction
  * @inode:		inode
  * @block:		start physical block to free
  * @count:		number of blocks to count
  * @flags:		flags used by ext4_free_blocks
  */
 void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		      struct buffer_head *bh, ext4_fsblk_t block,
 		      unsigned long count, int flags)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct super_block *sb = inode->i_sb;
 	struct ext4_group_desc *gdp;
 	unsigned int overflow;
 	ext4_grpblk_t bit;
 	struct buffer_head *gd_bh;
 	ext4_group_t block_group;
 	struct ext4_sb_info *sbi;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_buddy e4b;
 	unsigned int count_clusters;
 	int err = 0;
 	int ret;
 	might_sleep();
 	if (bh) {
 		if (block)
 			BUG_ON(block != bh->b_blocknr);
 		else
 			block = bh->b_blocknr;
 	}
 	sbi = EXT4_SB(sb);
 	if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
 	    !ext4_data_block_valid(sbi, block, count)) {
 		ext4_error(sb, "Freeing blocks not in datazone - "
 			   "block = %llu, count = %lu", block, count);
 		goto error_return;
 	}
 	ext4_debug("freeing block %llu\n", block);
 	trace_ext4_free_blocks(inode, block, count, flags);
 	if (flags & EXT4_FREE_BLOCKS_FORGET) {
 		struct buffer_head *tbh = bh;
 		int i;
 		BUG_ON(bh && (count > 1));
 		for (i = 0; i < count; i++) {
 			cond_resched();
 			if (!bh)
 				tbh = sb_find_get_block(inode->i_sb,
 							block + i);
 			if (!tbh)
 				continue;
 			ext4_forget(handle, flags & EXT4_FREE_BLOCKS_METADATA,
 				    inode, tbh, block + i);
 		}
 	}
 	/*
 	 * We need to make sure we don't reuse the freed block until
 	 * after the transaction is committed, which we can do by
 	 * treating the block as metadata, below.  We make an
 	 * exception if the inode is to be written in writeback mode
 	 * since writeback mode has weak data consistency guarantees.
 	 */
 	if (!ext4_should_writeback_data(inode))
 		flags |= EXT4_FREE_BLOCKS_METADATA;
 	/*
 	 * If the extent to be freed does not begin on a cluster
 	 * boundary, we need to deal with partial clusters at the
 	 * beginning and end of the extent.  Normally we will free
 	 * blocks at the beginning or the end unless we are explicitly
 	 * requested to avoid doing so.
 	 */
 	overflow = EXT4_PBLK_COFF(sbi, block);
 	if (overflow) {
 		if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) {
 			overflow = sbi->s_cluster_ratio - overflow;
 			block += overflow;
 			if (count > overflow)
 				count -= overflow;
 			else
 				return;
 		} else {
 			block -= overflow;
 			count += overflow;
 		}
 	}
 	overflow = EXT4_LBLK_COFF(sbi, count);
 	if (overflow) {
 		if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) {
 			if (count > overflow)
 				count -= overflow;
 			else
 				return;
 		} else
 			count += sbi->s_cluster_ratio - overflow;
 	}
 do_more:
 	overflow = 0;
 	ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
 	if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(
 			ext4_get_group_info(sb, block_group))))
 		return;
 	/*
 	 * Check to see if we are freeing blocks across a group
 	 * boundary.
 	 */
 	if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) {
 		overflow = EXT4_C2B(sbi, bit) + count -
 			EXT4_BLOCKS_PER_GROUP(sb);
 		count -= overflow;
 	}
 	count_clusters = EXT4_NUM_B2C(sbi, count);
 	bitmap_bh = ext4_read_block_bitmap(sb, block_group);
 	if (!bitmap_bh) {
 		err = -EIO;
 		goto error_return;
 	}
 	gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
 	if (!gdp) {
 		err = -EIO;
 		goto error_return;
 	}
 	if (in_range(ext4_block_bitmap(sb, gdp), block, count) ||
 	    in_range(ext4_inode_bitmap(sb, gdp), block, count) ||
 	    in_range(block, ext4_inode_table(sb, gdp),
 		     EXT4_SB(sb)->s_itb_per_group) ||
 	    in_range(block + count - 1, ext4_inode_table(sb, gdp),
 		     EXT4_SB(sb)->s_itb_per_group)) {
 		ext4_error(sb, "Freeing blocks in system zone - "
 			   "Block = %llu, count = %lu", block, count);
 		/* err = 0. ext4_std_error should be a no op */
 		goto error_return;
 	}
 	BUFFER_TRACE(bitmap_bh, "getting write access");
 	err = ext4_journal_get_write_access(handle, bitmap_bh);
 	if (err)
 		goto error_return;
 	/*
 	 * We are about to modify some metadata.  Call the journal APIs
 	 * to unshare ->b_data if a currently-committing transaction is
 	 * using it
 	 */
 	BUFFER_TRACE(gd_bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, gd_bh);
 	if (err)
 		goto error_return;
 #ifdef AGGRESSIVE_CHECK
 	{
 		int i;
 		for (i = 0; i < count_clusters; i++)
 			BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
 	}
 #endif
 	trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
 	err = ext4_mb_load_buddy(sb, block_group, &e4b);
 	if (err)
 		goto error_return;
 	if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) {
 		struct ext4_free_data *new_entry;
 		/*
 		 * blocks being freed are metadata. these blocks shouldn't
 		 * be used until this transaction is committed
 		 */
 	retry:
 		new_entry = kmem_cache_alloc(ext4_free_data_cachep, GFP_NOFS);
 		if (!new_entry) {
 			/*
 			 * We use a retry loop because
 			 * ext4_free_blocks() is not allowed to fail.
 			 */
 			cond_resched();
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
 		new_entry->efd_start_cluster = bit;
 		new_entry->efd_group = block_group;
 		new_entry->efd_count = count_clusters;
 		new_entry->efd_tid = handle->h_transaction->t_tid;
 		ext4_lock_group(sb, block_group);
 		mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
 		ext4_mb_free_metadata(handle, &e4b, new_entry);
 	} else {
 		/* need to update group_info->bb_free and bitmap
 		 * with group lock held. generate_buddy look at
 		 * them with group lock_held
 		 */
 		if (test_opt(sb, DISCARD)) {
 			err = ext4_issue_discard(sb, block_group, bit, count);
 			if (err && err != -EOPNOTSUPP)
 				ext4_msg(sb, KERN_WARNING, "discard request in"
 					 " group:%d block:%d count:%lu failed"
 					 " with %d", block_group, bit, count,
 					 err);
 		} else
 			EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info);
 		ext4_lock_group(sb, block_group);
 		mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
 		mb_free_blocks(inode, &e4b, bit, count_clusters);
 	}
 	ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
 	ext4_free_group_clusters_set(sb, gdp, ret);
 	ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh);
 	ext4_group_desc_csum_set(sb, block_group, gdp);
 	ext4_unlock_group(sb, block_group);
 	if (sbi->s_log_groups_per_flex) {
 		ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
 		atomic64_add(count_clusters,
 			     &sbi->s_flex_groups[flex_group].free_clusters);
 	}
 	if (flags & EXT4_FREE_BLOCKS_RESERVE && ei->i_reserved_data_blocks) {
 		percpu_counter_add(&sbi->s_dirtyclusters_counter,
 				   count_clusters);
 		spin_lock(&ei->i_block_reservation_lock);
 		if (flags & EXT4_FREE_BLOCKS_METADATA)
 			ei->i_reserved_meta_blocks += count_clusters;
 		else
 			ei->i_reserved_data_blocks += count_clusters;
 		spin_unlock(&ei->i_block_reservation_lock);
 		if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
 			dquot_reclaim_block(inode,
 					EXT4_C2B(sbi, count_clusters));
 	} else if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
 		dquot_free_block(inode, EXT4_C2B(sbi, count_clusters));
 	percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters);
 	ext4_mb_unload_buddy(&e4b);
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
 	err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
 	/* And the group descriptor block */
 	BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
 	ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
 	if (!err)
 		err = ret;
 	if (overflow && !err) {
 		block += count;
 		count = overflow;
 		put_bh(bitmap_bh);
 		goto do_more;
 	}
 error_return:
 	brelse(bitmap_bh);
 	ext4_std_error(sb, err);
 	return;
 }
 /**
  * ext4_group_add_blocks() -- Add given blocks to an existing group
  * @handle:			handle to this transaction
  * @sb:				super block
  * @block:			start physical block to add to the block group
  * @count:			number of blocks to free
  *
  * This marks the blocks as free in the bitmap and buddy.
  */
 int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 			 ext4_fsblk_t block, unsigned long count)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gd_bh;
 	ext4_group_t block_group;
 	ext4_grpblk_t bit;
 	unsigned int i;
 	struct ext4_group_desc *desc;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_buddy e4b;
 	int err = 0, ret, blk_free_count;
 	ext4_grpblk_t blocks_freed;
 	ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);
 	if (count == 0)
 		return 0;
 	ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
 	/*
 	 * Check to see if we are freeing blocks across a group
 	 * boundary.
 	 */
 	if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
 		ext4_warning(sb, "too much blocks added to group %u\n",
 			     block_group);
 		err = -EINVAL;
 		goto error_return;
 	}
 	bitmap_bh = ext4_read_block_bitmap(sb, block_group);
 	if (!bitmap_bh) {
 		err = -EIO;
 		goto error_return;
 	}
 	desc = ext4_get_group_desc(sb, block_group, &gd_bh);
 	if (!desc) {
 		err = -EIO;
 		goto error_return;
 	}
 	if (in_range(ext4_block_bitmap(sb, desc), block, count) ||
 	    in_range(ext4_inode_bitmap(sb, desc), block, count) ||
 	    in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) ||
 	    in_range(block + count - 1, ext4_inode_table(sb, desc),
 		     sbi->s_itb_per_group)) {
 		ext4_error(sb, "Adding blocks in system zones - "
 			   "Block = %llu, count = %lu",
 			   block, count);
 		err = -EINVAL;
 		goto error_return;
 	}
 	BUFFER_TRACE(bitmap_bh, "getting write access");
 	err = ext4_journal_get_write_access(handle, bitmap_bh);
 	if (err)
 		goto error_return;
 	/*
 	 * We are about to modify some metadata.  Call the journal APIs
 	 * to unshare ->b_data if a currently-committing transaction is
 	 * using it
 	 */
 	BUFFER_TRACE(gd_bh, "get_write_access");
 	err = ext4_journal_get_write_access(handle, gd_bh);
 	if (err)
 		goto error_return;
 	for (i = 0, blocks_freed = 0; i < count; i++) {
 		BUFFER_TRACE(bitmap_bh, "clear bit");
 		if (!mb_test_bit(bit + i, bitmap_bh->b_data)) {
 			ext4_error(sb, "bit already cleared for block %llu",
 				   (ext4_fsblk_t)(block + i));
 			BUFFER_TRACE(bitmap_bh, "bit already cleared");
 		} else {
 			blocks_freed++;
 		}
 	}
 	err = ext4_mb_load_buddy(sb, block_group, &e4b);
 	if (err)
 		goto error_return;
 	/*
 	 * need to update group_info->bb_free and bitmap
 	 * with group lock held. generate_buddy look at
 	 * them with group lock_held
 	 */
 	ext4_lock_group(sb, block_group);
 	mb_clear_bits(bitmap_bh->b_data, bit, count);
 	mb_free_blocks(NULL, &e4b, bit, count);
 	blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc);
 	ext4_free_group_clusters_set(sb, desc, blk_free_count);
 	ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh);
 	ext4_group_desc_csum_set(sb, block_group, desc);
 	ext4_unlock_group(sb, block_group);
 	percpu_counter_add(&sbi->s_freeclusters_counter,
 			   EXT4_NUM_B2C(sbi, blocks_freed));
 	if (sbi->s_log_groups_per_flex) {
 		ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
 		atomic64_add(EXT4_NUM_B2C(sbi, blocks_freed),
 			     &sbi->s_flex_groups[flex_group].free_clusters);
 	}
 	ext4_mb_unload_buddy(&e4b);
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
 	err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
 	/* And the group descriptor block */
 	BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
 	ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
 	if (!err)
 		err = ret;
 error_return:
 	brelse(bitmap_bh);
 	ext4_std_error(sb, err);
 	return err;
 }
 /**
  * ext4_trim_extent -- function to TRIM one single free extent in the group
  * @sb:		super block for the file system
  * @start:	starting block of the free extent in the alloc. group
  * @count:	number of blocks to TRIM
  * @group:	alloc. group we are working with
  * @e4b:	ext4 buddy for the group
  *
  * Trim "count" blocks starting at "start" in the "group". To assure that no
  * one will allocate those blocks, mark it as used in buddy bitmap. This must
  * be called with under the group lock.
  */
 static int ext4_trim_extent(struct super_block *sb, int start, int count,
 			     ext4_group_t group, struct ext4_buddy *e4b)
 {
 	struct ext4_free_extent ex;
 	int ret = 0;
 	trace_ext4_trim_extent(sb, group, start, count);
 	assert_spin_locked(ext4_group_lock_ptr(sb, group));
 	ex.fe_start = start;
 	ex.fe_group = group;
 	ex.fe_len = count;
 	/*
 	 * Mark blocks used, so no one can reuse them while
 	 * being trimmed.
 	 */
 	mb_mark_used(e4b, &ex);
 	ext4_unlock_group(sb, group);
 	ret = ext4_issue_discard(sb, group, start, count);
 	ext4_lock_group(sb, group);
 	mb_free_blocks(NULL, e4b, start, ex.fe_len);
 	return ret;
 }
 /**
  * ext4_trim_all_free -- function to trim all free space in alloc. group
  * @sb:			super block for file system
  * @group:		group to be trimmed
  * @start:		first group block to examine
  * @max:		last group block to examine
  * @minblocks:		minimum extent block count
  *
  * ext4_trim_all_free walks through group's buddy bitmap searching for free
  * extents. When the free block is found, ext4_trim_extent is called to TRIM
  * the extent.
  *
  *
  * ext4_trim_all_free walks through group's block bitmap searching for free
  * extents. When the free extent is found, mark it as used in group buddy
  * bitmap. Then issue a TRIM command on this extent and free the extent in
  * the group buddy bitmap. This is done until whole group is scanned.
  */
 static ext4_grpblk_t
 ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 		   ext4_grpblk_t start, ext4_grpblk_t max,
 		   ext4_grpblk_t minblocks)
 {
 	void *bitmap;
 	ext4_grpblk_t next, count = 0, free_count = 0;
 	struct ext4_buddy e4b;
 	int ret = 0;
 	trace_ext4_trim_all_free(sb, group, start, max);
 	ret = ext4_mb_load_buddy(sb, group, &e4b);
 	if (ret) {
 		ext4_error(sb, "Error in loading buddy "
 				"information for %u", group);
 		return ret;
 	}
 	bitmap = e4b.bd_bitmap;
 	ext4_lock_group(sb, group);
 	if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) &&
 	    minblocks >= atomic_read(&EXT4_SB(sb)->s_last_trim_minblks))
 		goto out;
 	start = (e4b.bd_info->bb_first_free > start) ?
 		e4b.bd_info->bb_first_free : start;
 	while (start <= max) {
 		start = mb_find_next_zero_bit(bitmap, max + 1, start);
 		if (start > max)
 			break;
 		next = mb_find_next_bit(bitmap, max + 1, start);
 		if ((next - start) >= minblocks) {
 			ret = ext4_trim_extent(sb, start,
 					       next - start, group, &e4b);
 			if (ret && ret != -EOPNOTSUPP)
 				break;
 			ret = 0;
 			count += next - start;
 		}
 		free_count += next - start;
 		start = next + 1;
 		if (fatal_signal_pending(current)) {
 			count = -ERESTARTSYS;
 			break;
 		}
 		if (need_resched()) {
 			ext4_unlock_group(sb, group);
 			cond_resched();
 			ext4_lock_group(sb, group);
 		}
 		if ((e4b.bd_info->bb_free - free_count) < minblocks)
 			break;
 	}
 	if (!ret) {
 		ret = count;
 		EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
 	}
 out:
 	ext4_unlock_group(sb, group);
 	ext4_mb_unload_buddy(&e4b);
 	ext4_debug("trimmed %d blocks in the group %d\n",
 		count, group);
 	return ret;
 }
 /**
  * ext4_trim_fs() -- trim ioctl handle function
  * @sb:			superblock for filesystem
  * @range:		fstrim_range structure
  *
  * start:	First Byte to trim
  * len:		number of Bytes to trim from start
  * minlen:	minimum extent length in Bytes
  * ext4_trim_fs goes through all allocation groups containing Bytes from
  * start to start+len. For each such a group ext4_trim_all_free function
  * is invoked to trim all free space.
  */
 int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 {
 	struct ext4_group_info *grp;
 	ext4_group_t group, first_group, last_group;
 	ext4_grpblk_t cnt = 0, first_cluster, last_cluster;
 	uint64_t start, end, minlen, trimmed = 0;
 	ext4_fsblk_t first_data_blk =
 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
 	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
 	int ret = 0;
 	start = range->start >> sb->s_blocksize_bits;
 	end = start + (range->len >> sb->s_blocksize_bits) - 1;
 	minlen = EXT4_NUM_B2C(EXT4_SB(sb),
 			      range->minlen >> sb->s_blocksize_bits);
 	if (minlen > EXT4_CLUSTERS_PER_GROUP(sb) ||
 	    start >= max_blks ||
 	    range->len < sb->s_blocksize)
 		return -EINVAL;
 	if (end >= max_blks)
 		end = max_blks - 1;
 	if (end <= first_data_blk)
 		goto out;
 	if (start < first_data_blk)
 		start = first_data_blk;
 	/* Determine first and last group to examine based on start and end */
 	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start,
 				     &first_group, &first_cluster);
 	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end,
 				     &last_group, &last_cluster);
 	/* end now represents the last cluster to discard in this group */
 	end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
 	for (group = first_group; group <= last_group; group++) {
 		grp = ext4_get_group_info(sb, group);
 		/* We only do this if the grp has never been initialized */
 		if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
 			ret = ext4_mb_init_group(sb, group);
 			if (ret)
 				break;
 		}
 		/*
 		 * For all the groups except the last one, last cluster will
 		 * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to
 		 * change it for the last group, note that last_cluster is
 		 * already computed earlier by ext4_get_group_no_and_offset()
 		 */
 		if (group == last_group)
 			end = last_cluster;
 		if (grp->bb_free >= minlen) {
 			cnt = ext4_trim_all_free(sb, group, first_cluster,
 						end, minlen);
 			if (cnt < 0) {
 				ret = cnt;
 				break;
 			}
 			trimmed += cnt;
 		}
 		/*
 		 * For every group except the first one, we are sure
 		 * that the first cluster to discard will be cluster #0.
 		 */
 		first_cluster = 0;
 	}
 	if (!ret)
 		atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen);
 out:

fs/f2fs/checkpoint.c

Diff comments View file @ d618a27

 /*
  * fs/f2fs/checkpoint.c
  *
  * Copyright (c) 2012 Samsung Electronics Co., Ltd.
  *             http://www.samsung.com/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
 #include <linux/fs.h>
 #include <linux/bio.h>
 #include <linux/mpage.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/f2fs_fs.h>
 #include <linux/pagevec.h>
 #include <linux/swap.h>
 #include "f2fs.h"
 #include "node.h"
 #include "segment.h"
 #include <trace/events/f2fs.h>
 static struct kmem_cache *orphan_entry_slab;
 static struct kmem_cache *inode_entry_slab;
 /*
  * We guarantee no failure on the returned page.
  */
 struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
 {
 	struct address_space *mapping = sbi->meta_inode->i_mapping;
 	struct page *page = NULL;
 repeat:
 	page = grab_cache_page(mapping, index);
 	if (!page) {
 		cond_resched();
 		goto repeat;
 	}
 	/* We wait writeback only inside grab_meta_page() */
 	wait_on_page_writeback(page);
 	SetPageUptodate(page);
 	return page;
 }
 /*
  * We guarantee no failure on the returned page.
  */
 struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
 {
 	struct address_space *mapping = sbi->meta_inode->i_mapping;
 	struct page *page;
 repeat:
 	page = grab_cache_page(mapping, index);
 	if (!page) {
 		cond_resched();
 		goto repeat;
 	}
 	if (PageUptodate(page))
 		goto out;
 	if (f2fs_readpage(sbi, page, index, READ_SYNC))
 		goto repeat;
 	lock_page(page);
 	if (page->mapping != mapping) {
 		f2fs_put_page(page, 1);
 		goto repeat;
 	}
 out:
-	mark_page_accessed(page);
 	return page;
 }
 static int f2fs_write_meta_page(struct page *page,
 				struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	/* Should not write any meta pages, if any IO error was occurred */
 	if (wbc->for_reclaim ||
 			is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ERROR_FLAG)) {
 		dec_page_count(sbi, F2FS_DIRTY_META);
 		wbc->pages_skipped++;
 		set_page_dirty(page);
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 	wait_on_page_writeback(page);
 	write_meta_page(sbi, page);
 	dec_page_count(sbi, F2FS_DIRTY_META);
 	unlock_page(page);
 	return 0;
 }
 static int f2fs_write_meta_pages(struct address_space *mapping,
 				struct writeback_control *wbc)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
 	struct block_device *bdev = sbi->sb->s_bdev;
 	long written;
 	if (wbc->for_kupdate)
 		return 0;
 	if (get_pages(sbi, F2FS_DIRTY_META) == 0)
 		return 0;
 	/* if mounting is failed, skip writing node pages */
 	mutex_lock(&sbi->cp_mutex);
 	written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev));
 	mutex_unlock(&sbi->cp_mutex);
 	wbc->nr_to_write -= written;
 	return 0;
 }
 long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
 						long nr_to_write)
 {
 	struct address_space *mapping = sbi->meta_inode->i_mapping;
 	pgoff_t index = 0, end = LONG_MAX;
 	struct pagevec pvec;
 	long nwritten = 0;
 	struct writeback_control wbc = {
 		.for_reclaim = 0,
 	};
 	pagevec_init(&pvec, 0);
 	while (index <= end) {
 		int i, nr_pages;
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 				PAGECACHE_TAG_DIRTY,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 			lock_page(page);
 			BUG_ON(page->mapping != mapping);
 			BUG_ON(!PageDirty(page));
 			clear_page_dirty_for_io(page);
 			if (f2fs_write_meta_page(page, &wbc)) {
 				unlock_page(page);
 				break;
 			}
 			if (nwritten++ >= nr_to_write)
 				break;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	if (nwritten)
 		f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX);
 	return nwritten;
 }
 static int f2fs_set_meta_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
 	SetPageUptodate(page);
 	if (!PageDirty(page)) {
 		__set_page_dirty_nobuffers(page);
 		inc_page_count(sbi, F2FS_DIRTY_META);
 		return 1;
 	}
 	return 0;
 }
 const struct address_space_operations f2fs_meta_aops = {
 	.writepage	= f2fs_write_meta_page,
 	.writepages	= f2fs_write_meta_pages,
 	.set_page_dirty	= f2fs_set_meta_page_dirty,
 };
 int acquire_orphan_inode(struct f2fs_sb_info *sbi)
 {
 	unsigned int max_orphans;
 	int err = 0;
 	/*
 	 * considering 512 blocks in a segment 5 blocks are needed for cp
 	 * and log segment summaries. Remaining blocks are used to keep
 	 * orphan entries with the limitation one reserved segment
 	 * for cp pack we can have max 1020*507 orphan entries
 	 */
 	max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK;
 	mutex_lock(&sbi->orphan_inode_mutex);
 	if (sbi->n_orphans >= max_orphans)
 		err = -ENOSPC;
 	else
 		sbi->n_orphans++;
 	mutex_unlock(&sbi->orphan_inode_mutex);
 	return err;
 }
 void release_orphan_inode(struct f2fs_sb_info *sbi)
 {
 	mutex_lock(&sbi->orphan_inode_mutex);
 	sbi->n_orphans--;
 	mutex_unlock(&sbi->orphan_inode_mutex);
 }
 void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	struct list_head *head, *this;
 	struct orphan_inode_entry *new = NULL, *orphan = NULL;
 	mutex_lock(&sbi->orphan_inode_mutex);
 	head = &sbi->orphan_inode_list;
 	list_for_each(this, head) {
 		orphan = list_entry(this, struct orphan_inode_entry, list);
 		if (orphan->ino == ino)
 			goto out;
 		if (orphan->ino > ino)
 			break;
 		orphan = NULL;
 	}
 retry:
 	new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC);
 	if (!new) {
 		cond_resched();
 		goto retry;
 	}
 	new->ino = ino;
 	/* add new_oentry into list which is sorted by inode number */
 	if (orphan)
 		list_add(&new->list, this->prev);
 	else
 		list_add_tail(&new->list, head);
 out:
 	mutex_unlock(&sbi->orphan_inode_mutex);
 }
 void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	struct list_head *head;
 	struct orphan_inode_entry *orphan;
 	mutex_lock(&sbi->orphan_inode_mutex);
 	head = &sbi->orphan_inode_list;
 	list_for_each_entry(orphan, head, list) {
 		if (orphan->ino == ino) {
 			list_del(&orphan->list);
 			kmem_cache_free(orphan_entry_slab, orphan);
 			sbi->n_orphans--;
 			break;
 		}
 	}
 	mutex_unlock(&sbi->orphan_inode_mutex);
 }
 static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	struct inode *inode = f2fs_iget(sbi->sb, ino);
 	BUG_ON(IS_ERR(inode));
 	clear_nlink(inode);
 	/* truncate all the data during iput */
 	iput(inode);
 }
 int recover_orphan_inodes(struct f2fs_sb_info *sbi)
 {
 	block_t start_blk, orphan_blkaddr, i, j;
 	if (!is_set_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG))
 		return 0;
 	sbi->por_doing = 1;
 	start_blk = __start_cp_addr(sbi) + 1;
 	orphan_blkaddr = __start_sum_addr(sbi) - 1;
 	for (i = 0; i < orphan_blkaddr; i++) {
 		struct page *page = get_meta_page(sbi, start_blk + i);
 		struct f2fs_orphan_block *orphan_blk;
 		orphan_blk = (struct f2fs_orphan_block *)page_address(page);
 		for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) {
 			nid_t ino = le32_to_cpu(orphan_blk->ino[j]);
 			recover_orphan_inode(sbi, ino);
 		}
 		f2fs_put_page(page, 1);
 	}
 	/* clear Orphan Flag */
 	clear_ckpt_flags(F2FS_CKPT(sbi), CP_ORPHAN_PRESENT_FLAG);
 	sbi->por_doing = 0;
 	return 0;
 }
 static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk)
 {
 	struct list_head *head, *this, *next;
 	struct f2fs_orphan_block *orphan_blk = NULL;
 	struct page *page = NULL;
 	unsigned int nentries = 0;
 	unsigned short index = 1;
 	unsigned short orphan_blocks;
 	orphan_blocks = (unsigned short)((sbi->n_orphans +
 		(F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK);
 	mutex_lock(&sbi->orphan_inode_mutex);
 	head = &sbi->orphan_inode_list;
 	/* loop for each orphan inode entry and write them in Jornal block */
 	list_for_each_safe(this, next, head) {
 		struct orphan_inode_entry *orphan;
 		orphan = list_entry(this, struct orphan_inode_entry, list);
 		if (nentries == F2FS_ORPHANS_PER_BLOCK) {
 			/*
 			 * an orphan block is full of 1020 entries,
 			 * then we need to flush current orphan blocks
 			 * and bring another one in memory
 			 */
 			orphan_blk->blk_addr = cpu_to_le16(index);
 			orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
 			orphan_blk->entry_count = cpu_to_le32(nentries);
 			set_page_dirty(page);
 			f2fs_put_page(page, 1);
 			index++;
 			start_blk++;
 			nentries = 0;
 			page = NULL;
 		}
 		if (page)
 			goto page_exist;
 		page = grab_meta_page(sbi, start_blk);
 		orphan_blk = (struct f2fs_orphan_block *)page_address(page);
 		memset(orphan_blk, 0, sizeof(*orphan_blk));
 page_exist:
 		orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino);
 	}
 	if (!page)
 		goto end;
 	orphan_blk->blk_addr = cpu_to_le16(index);
 	orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
 	orphan_blk->entry_count = cpu_to_le32(nentries);
 	set_page_dirty(page);
 	f2fs_put_page(page, 1);
 end:
 	mutex_unlock(&sbi->orphan_inode_mutex);
 }
 static struct page *validate_checkpoint(struct f2fs_sb_info *sbi,
 				block_t cp_addr, unsigned long long *version)
 {
 	struct page *cp_page_1, *cp_page_2 = NULL;
 	unsigned long blk_size = sbi->blocksize;
 	struct f2fs_checkpoint *cp_block;
 	unsigned long long cur_version = 0, pre_version = 0;
 	size_t crc_offset;
 	__u32 crc = 0;
 	/* Read the 1st cp block in this CP pack */
 	cp_page_1 = get_meta_page(sbi, cp_addr);
 	/* get the version number */
 	cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1);
 	crc_offset = le32_to_cpu(cp_block->checksum_offset);
 	if (crc_offset >= blk_size)
 		goto invalid_cp1;
 	crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset)));
 	if (!f2fs_crc_valid(crc, cp_block, crc_offset))
 		goto invalid_cp1;
 	pre_version = cur_cp_version(cp_block);
 	/* Read the 2nd cp block in this CP pack */
 	cp_addr += le32_to_cpu(cp_block->cp_pack_total_block_count) - 1;
 	cp_page_2 = get_meta_page(sbi, cp_addr);
 	cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2);
 	crc_offset = le32_to_cpu(cp_block->checksum_offset);
 	if (crc_offset >= blk_size)
 		goto invalid_cp2;
 	crc = le32_to_cpu(*((__u32 *)((unsigned char *)cp_block + crc_offset)));
 	if (!f2fs_crc_valid(crc, cp_block, crc_offset))
 		goto invalid_cp2;
 	cur_version = cur_cp_version(cp_block);
 	if (cur_version == pre_version) {
 		*version = cur_version;
 		f2fs_put_page(cp_page_2, 1);
 		return cp_page_1;
 	}
 invalid_cp2:
 	f2fs_put_page(cp_page_2, 1);
 invalid_cp1:
 	f2fs_put_page(cp_page_1, 1);
 	return NULL;
 }
 int get_valid_checkpoint(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_checkpoint *cp_block;
 	struct f2fs_super_block *fsb = sbi->raw_super;
 	struct page *cp1, *cp2, *cur_page;
 	unsigned long blk_size = sbi->blocksize;
 	unsigned long long cp1_version = 0, cp2_version = 0;
 	unsigned long long cp_start_blk_no;
 	sbi->ckpt = kzalloc(blk_size, GFP_KERNEL);
 	if (!sbi->ckpt)
 		return -ENOMEM;
 	/*
 	 * Finding out valid cp block involves read both
 	 * sets( cp pack1 and cp pack 2)
 	 */
 	cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr);
 	cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version);
 	/* The second checkpoint pack should start at the next segment */
 	cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg);
 	cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version);
 	if (cp1 && cp2) {
 		if (ver_after(cp2_version, cp1_version))
 			cur_page = cp2;
 		else
 			cur_page = cp1;
 	} else if (cp1) {
 		cur_page = cp1;
 	} else if (cp2) {
 		cur_page = cp2;
 	} else {
 		goto fail_no_cp;
 	}
 	cp_block = (struct f2fs_checkpoint *)page_address(cur_page);
 	memcpy(sbi->ckpt, cp_block, blk_size);
 	f2fs_put_page(cp1, 1);
 	f2fs_put_page(cp2, 1);
 	return 0;
 fail_no_cp:
 	kfree(sbi->ckpt);
 	return -EINVAL;
 }
 static int __add_dirty_inode(struct inode *inode, struct dir_inode_entry *new)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct list_head *head = &sbi->dir_inode_list;
 	struct list_head *this;
 	list_for_each(this, head) {
 		struct dir_inode_entry *entry;
 		entry = list_entry(this, struct dir_inode_entry, list);
 		if (entry->inode == inode)
 			return -EEXIST;
 	}
 	list_add_tail(&new->list, head);
 #ifdef CONFIG_F2FS_STAT_FS
 	sbi->n_dirty_dirs++;
 #endif
 	return 0;
 }
 void set_dirty_dir_page(struct inode *inode, struct page *page)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct dir_inode_entry *new;
 	if (!S_ISDIR(inode->i_mode))
 		return;
 retry:
 	new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS);
 	if (!new) {
 		cond_resched();
 		goto retry;
 	}
 	new->inode = inode;
 	INIT_LIST_HEAD(&new->list);
 	spin_lock(&sbi->dir_inode_lock);
 	if (__add_dirty_inode(inode, new))
 		kmem_cache_free(inode_entry_slab, new);
 	inc_page_count(sbi, F2FS_DIRTY_DENTS);
 	inode_inc_dirty_dents(inode);
 	SetPagePrivate(page);
 	spin_unlock(&sbi->dir_inode_lock);
 }
 void add_dirty_dir_inode(struct inode *inode)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct dir_inode_entry *new;
 retry:
 	new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS);
 	if (!new) {
 		cond_resched();
 		goto retry;
 	}
 	new->inode = inode;
 	INIT_LIST_HEAD(&new->list);
 	spin_lock(&sbi->dir_inode_lock);
 	if (__add_dirty_inode(inode, new))
 		kmem_cache_free(inode_entry_slab, new);
 	spin_unlock(&sbi->dir_inode_lock);
 }
 void remove_dirty_dir_inode(struct inode *inode)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct list_head *head = &sbi->dir_inode_list;
 	struct list_head *this;
 	if (!S_ISDIR(inode->i_mode))
 		return;
 	spin_lock(&sbi->dir_inode_lock);
 	if (atomic_read(&F2FS_I(inode)->dirty_dents)) {
 		spin_unlock(&sbi->dir_inode_lock);
 		return;
 	}
 	list_for_each(this, head) {
 		struct dir_inode_entry *entry;
 		entry = list_entry(this, struct dir_inode_entry, list);
 		if (entry->inode == inode) {
 			list_del(&entry->list);
 			kmem_cache_free(inode_entry_slab, entry);
 #ifdef CONFIG_F2FS_STAT_FS
 			sbi->n_dirty_dirs--;
 #endif
 			break;
 		}
 	}
 	spin_unlock(&sbi->dir_inode_lock);
 	/* Only from the recovery routine */
 	if (is_inode_flag_set(F2FS_I(inode), FI_DELAY_IPUT)) {
 		clear_inode_flag(F2FS_I(inode), FI_DELAY_IPUT);
 		iput(inode);
 	}
 }
 struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	struct list_head *head = &sbi->dir_inode_list;
 	struct list_head *this;
 	struct inode *inode = NULL;
 	spin_lock(&sbi->dir_inode_lock);
 	list_for_each(this, head) {
 		struct dir_inode_entry *entry;
 		entry = list_entry(this, struct dir_inode_entry, list);
 		if (entry->inode->i_ino == ino) {
 			inode = entry->inode;
 			break;
 		}
 	}
 	spin_unlock(&sbi->dir_inode_lock);
 	return inode;
 }
 void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi)
 {
 	struct list_head *head = &sbi->dir_inode_list;
 	struct dir_inode_entry *entry;
 	struct inode *inode;
 retry:
 	spin_lock(&sbi->dir_inode_lock);
 	if (list_empty(head)) {
 		spin_unlock(&sbi->dir_inode_lock);
 		return;
 	}
 	entry = list_entry(head->next, struct dir_inode_entry, list);
 	inode = igrab(entry->inode);
 	spin_unlock(&sbi->dir_inode_lock);
 	if (inode) {
 		filemap_flush(inode->i_mapping);
 		iput(inode);
 	} else {
 		/*
 		 * We should submit bio, since it exists several
 		 * wribacking dentry pages in the freeing inode.
 		 */
 		f2fs_submit_bio(sbi, DATA, true);
 	}
 	goto retry;
 }
 /*
  * Freeze all the FS-operations for checkpoint.
  */
 static void block_operations(struct f2fs_sb_info *sbi)
 {
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = LONG_MAX,
 		.for_reclaim = 0,
 	};
 	struct blk_plug plug;
 	blk_start_plug(&plug);
 retry_flush_dents:
 	mutex_lock_all(sbi);
 	/* write all the dirty dentry pages */
 	if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
 		mutex_unlock_all(sbi);
 		sync_dirty_dir_inodes(sbi);
 		goto retry_flush_dents;
 	}
 	/*
 	 * POR: we should ensure that there is no dirty node pages
 	 * until finishing nat/sit flush.
 	 */
 retry_flush_nodes:
 	mutex_lock(&sbi->node_write);
 	if (get_pages(sbi, F2FS_DIRTY_NODES)) {
 		mutex_unlock(&sbi->node_write);
 		sync_node_pages(sbi, 0, &wbc);
 		goto retry_flush_nodes;
 	}
 	blk_finish_plug(&plug);
 }
 static void unblock_operations(struct f2fs_sb_info *sbi)
 {
 	mutex_unlock(&sbi->node_write);
 	mutex_unlock_all(sbi);
 }
 static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
 {
 	struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
 	nid_t last_nid = 0;
 	block_t start_blk;
 	struct page *cp_page;
 	unsigned int data_sum_blocks, orphan_blocks;
 	__u32 crc32 = 0;
 	void *kaddr;
 	int i;
 	/* Flush all the NAT/SIT pages */
 	while (get_pages(sbi, F2FS_DIRTY_META))
 		sync_meta_pages(sbi, META, LONG_MAX);
 	next_free_nid(sbi, &last_nid);
 	/*
 	 * modify checkpoint
 	 * version number is already updated
 	 */
 	ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi));
 	ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
 	ckpt->free_segment_count = cpu_to_le32(free_segments(sbi));
 	for (i = 0; i < 3; i++) {
 		ckpt->cur_node_segno[i] =
 			cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE));
 		ckpt->cur_node_blkoff[i] =
 			cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE));
 		ckpt->alloc_type[i + CURSEG_HOT_NODE] =
 				curseg_alloc_type(sbi, i + CURSEG_HOT_NODE);
 	}
 	for (i = 0; i < 3; i++) {
 		ckpt->cur_data_segno[i] =
 			cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA));
 		ckpt->cur_data_blkoff[i] =
 			cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA));
 		ckpt->alloc_type[i + CURSEG_HOT_DATA] =
 				curseg_alloc_type(sbi, i + CURSEG_HOT_DATA);
 	}
 	ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
 	ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
 	ckpt->next_free_nid = cpu_to_le32(last_nid);
 	/* 2 cp  + n data seg summary + orphan inode blocks */
 	data_sum_blocks = npages_for_summary_flush(sbi);
 	if (data_sum_blocks < 3)
 		set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);
 	else
 		clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);
 	orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1)
 					/ F2FS_ORPHANS_PER_BLOCK;
 	ckpt->cp_pack_start_sum = cpu_to_le32(1 + orphan_blocks);
 	if (is_umount) {
 		set_ckpt_flags(ckpt, CP_UMOUNT_FLAG);
 		ckpt->cp_pack_total_block_count = cpu_to_le32(2 +
 			data_sum_blocks + orphan_blocks + NR_CURSEG_NODE_TYPE);
 	} else {
 		clear_ckpt_flags(ckpt, CP_UMOUNT_FLAG);
 		ckpt->cp_pack_total_block_count = cpu_to_le32(2 +
 			data_sum_blocks + orphan_blocks);
 	}
 	if (sbi->n_orphans)
 		set_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG);
 	else
 		clear_ckpt_flags(ckpt, CP_ORPHAN_PRESENT_FLAG);
 	/* update SIT/NAT bitmap */
 	get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP));
 	get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP));
 	crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset));
 	*((__le32 *)((unsigned char *)ckpt +
 				le32_to_cpu(ckpt->checksum_offset)))
 				= cpu_to_le32(crc32);
 	start_blk = __start_cp_addr(sbi);
 	/* write out checkpoint buffer at block 0 */
 	cp_page = grab_meta_page(sbi, start_blk++);
 	kaddr = page_address(cp_page);
 	memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
 	set_page_dirty(cp_page);
 	f2fs_put_page(cp_page, 1);
 	if (sbi->n_orphans) {
 		write_orphan_inodes(sbi, start_blk);
 		start_blk += orphan_blocks;
 	}
 	write_data_summaries(sbi, start_blk);
 	start_blk += data_sum_blocks;
 	if (is_umount) {
 		write_node_summaries(sbi, start_blk);
 		start_blk += NR_CURSEG_NODE_TYPE;
 	}
 	/* writeout checkpoint block */
 	cp_page = grab_meta_page(sbi, start_blk);
 	kaddr = page_address(cp_page);
 	memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
 	set_page_dirty(cp_page);
 	f2fs_put_page(cp_page, 1);
 	/* wait for previous submitted node/meta pages writeback */
 	while (get_pages(sbi, F2FS_WRITEBACK))
 		congestion_wait(BLK_RW_ASYNC, HZ / 50);
 	filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
 	filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
 	/* update user_block_counts */
 	sbi->last_valid_block_count = sbi->total_valid_block_count;
 	sbi->alloc_valid_block_count = 0;
 	/* Here, we only have one bio having CP pack */
 	sync_meta_pages(sbi, META_FLUSH, LONG_MAX);
 	if (!is_set_ckpt_flags(ckpt, CP_ERROR_FLAG)) {
 		clear_prefree_segments(sbi);
 		F2FS_RESET_SB_DIRT(sbi);
 	}
 }
 /*
  * We guarantee that this checkpoint procedure should not fail.
  */
 void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
 {
 	struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
 	unsigned long long ckpt_ver;
 	trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops");
 	mutex_lock(&sbi->cp_mutex);
 	block_operations(sbi);
 	trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops");
 	f2fs_submit_bio(sbi, DATA, true);
 	f2fs_submit_bio(sbi, NODE, true);
 	f2fs_submit_bio(sbi, META, true);
 	/*
 	 * update checkpoint pack index
 	 * Increase the version number so that
 	 * SIT entries and seg summaries are written at correct place
 	 */
 	ckpt_ver = cur_cp_version(ckpt);
 	ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver);
 	/* write cached NAT/SIT entries to NAT/SIT area */
 	flush_nat_entries(sbi);
 	flush_sit_entries(sbi);
 	/* unlock all the fs_lock[] in do_checkpoint() */
 	do_checkpoint(sbi, is_umount);
 	unblock_operations(sbi);
 	mutex_unlock(&sbi->cp_mutex);
 	trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint");
 }
 void init_orphan_info(struct f2fs_sb_info *sbi)
 {
 	mutex_init(&sbi->orphan_inode_mutex);
 	INIT_LIST_HEAD(&sbi->orphan_inode_list);
 	sbi->n_orphans = 0;
 }
 int __init create_checkpoint_caches(void)
 {
 	orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry",
 			sizeof(struct orphan_inode_entry), NULL);
 	if (unlikely(!orphan_entry_slab))
 		return -ENOMEM;
 	inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry",
 			sizeof(struct dir_inode_entry), NULL);
 	if (unlikely(!inode_entry_slab)) {
 		kmem_cache_destroy(orphan_entry_slab);
 		return -ENOMEM;
 	}
 	return 0;
 }
 void destroy_checkpoint_caches(void)
 {
 	kmem_cache_destroy(orphan_entry_slab);
 	kmem_cache_destroy(inode_entry_slab);
 }

fs/f2fs/node.c

Diff comments View file @ d618a27

 /*
  * fs/f2fs/node.c
  *
  * Copyright (c) 2012 Samsung Electronics Co., Ltd.
  *             http://www.samsung.com/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
 #include <linux/fs.h>
 #include <linux/f2fs_fs.h>
 #include <linux/mpage.h>
 #include <linux/backing-dev.h>
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/swap.h>
 #include "f2fs.h"
 #include "node.h"
 #include "segment.h"
 #include <trace/events/f2fs.h>
 static struct kmem_cache *nat_entry_slab;
 static struct kmem_cache *free_nid_slab;
 static void clear_node_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
 	unsigned int long flags;
 	if (PageDirty(page)) {
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		radix_tree_tag_clear(&mapping->page_tree,
 				page_index(page),
 				PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		clear_page_dirty_for_io(page);
 		dec_page_count(sbi, F2FS_DIRTY_NODES);
 	}
 	ClearPageUptodate(page);
 }
 static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	pgoff_t index = current_nat_addr(sbi, nid);
 	return get_meta_page(sbi, index);
 }
 static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	struct page *src_page;
 	struct page *dst_page;
 	pgoff_t src_off;
 	pgoff_t dst_off;
 	void *src_addr;
 	void *dst_addr;
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	src_off = current_nat_addr(sbi, nid);
 	dst_off = next_nat_addr(sbi, src_off);
 	/* get current nat block page with lock */
 	src_page = get_meta_page(sbi, src_off);
 	/* Dirty src_page means that it is already the new target NAT page. */
 	if (PageDirty(src_page))
 		return src_page;
 	dst_page = grab_meta_page(sbi, dst_off);
 	src_addr = page_address(src_page);
 	dst_addr = page_address(dst_page);
 	memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE);
 	set_page_dirty(dst_page);
 	f2fs_put_page(src_page, 1);
 	set_to_next_nat(nm_i, nid);
 	return dst_page;
 }
 /*
  * Readahead NAT pages
  */
 static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid)
 {
 	struct address_space *mapping = sbi->meta_inode->i_mapping;
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct blk_plug plug;
 	struct page *page;
 	pgoff_t index;
 	int i;
 	blk_start_plug(&plug);
 	for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) {
 		if (nid >= nm_i->max_nid)
 			nid = 0;
 		index = current_nat_addr(sbi, nid);
 		page = grab_cache_page(mapping, index);
 		if (!page)
 			continue;
 		if (PageUptodate(page)) {
 			f2fs_put_page(page, 1);
 			continue;
 		}
 		if (f2fs_readpage(sbi, page, index, READ))
 			continue;
 		f2fs_put_page(page, 0);
 	}
 	blk_finish_plug(&plug);
 }
 static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n)
 {
 	return radix_tree_lookup(&nm_i->nat_root, n);
 }
 static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i,
 		nid_t start, unsigned int nr, struct nat_entry **ep)
 {
 	return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr);
 }
 static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e)
 {
 	list_del(&e->list);
 	radix_tree_delete(&nm_i->nat_root, nat_get_nid(e));
 	nm_i->nat_cnt--;
 	kmem_cache_free(nat_entry_slab, e);
 }
 int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct nat_entry *e;
 	int is_cp = 1;
 	read_lock(&nm_i->nat_tree_lock);
 	e = __lookup_nat_cache(nm_i, nid);
 	if (e && !e->checkpointed)
 		is_cp = 0;
 	read_unlock(&nm_i->nat_tree_lock);
 	return is_cp;
 }
 static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid)
 {
 	struct nat_entry *new;
 	new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC);
 	if (!new)
 		return NULL;
 	if (radix_tree_insert(&nm_i->nat_root, nid, new)) {
 		kmem_cache_free(nat_entry_slab, new);
 		return NULL;
 	}
 	memset(new, 0, sizeof(struct nat_entry));
 	nat_set_nid(new, nid);
 	list_add_tail(&new->list, &nm_i->nat_entries);
 	nm_i->nat_cnt++;
 	return new;
 }
 static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid,
 						struct f2fs_nat_entry *ne)
 {
 	struct nat_entry *e;
 retry:
 	write_lock(&nm_i->nat_tree_lock);
 	e = __lookup_nat_cache(nm_i, nid);
 	if (!e) {
 		e = grab_nat_entry(nm_i, nid);
 		if (!e) {
 			write_unlock(&nm_i->nat_tree_lock);
 			goto retry;
 		}
 		nat_set_blkaddr(e, le32_to_cpu(ne->block_addr));
 		nat_set_ino(e, le32_to_cpu(ne->ino));
 		nat_set_version(e, ne->version);
 		e->checkpointed = true;
 	}
 	write_unlock(&nm_i->nat_tree_lock);
 }
 static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni,
 			block_t new_blkaddr)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct nat_entry *e;
 retry:
 	write_lock(&nm_i->nat_tree_lock);
 	e = __lookup_nat_cache(nm_i, ni->nid);
 	if (!e) {
 		e = grab_nat_entry(nm_i, ni->nid);
 		if (!e) {
 			write_unlock(&nm_i->nat_tree_lock);
 			goto retry;
 		}
 		e->ni = *ni;
 		e->checkpointed = true;
 		BUG_ON(ni->blk_addr == NEW_ADDR);
 	} else if (new_blkaddr == NEW_ADDR) {
 		/*
 		 * when nid is reallocated,
 		 * previous nat entry can be remained in nat cache.
 		 * So, reinitialize it with new information.
 		 */
 		e->ni = *ni;
 		BUG_ON(ni->blk_addr != NULL_ADDR);
 	}
 	if (new_blkaddr == NEW_ADDR)
 		e->checkpointed = false;
 	/* sanity check */
 	BUG_ON(nat_get_blkaddr(e) != ni->blk_addr);
 	BUG_ON(nat_get_blkaddr(e) == NULL_ADDR &&
 			new_blkaddr == NULL_ADDR);
 	BUG_ON(nat_get_blkaddr(e) == NEW_ADDR &&
 			new_blkaddr == NEW_ADDR);
 	BUG_ON(nat_get_blkaddr(e) != NEW_ADDR &&
 			nat_get_blkaddr(e) != NULL_ADDR &&
 			new_blkaddr == NEW_ADDR);
 	/* increament version no as node is removed */
 	if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) {
 		unsigned char version = nat_get_version(e);
 		nat_set_version(e, inc_node_version(version));
 	}
 	/* change address */
 	nat_set_blkaddr(e, new_blkaddr);
 	__set_nat_cache_dirty(nm_i, e);
 	write_unlock(&nm_i->nat_tree_lock);
 }
 static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	if (nm_i->nat_cnt <= NM_WOUT_THRESHOLD)
 		return 0;
 	write_lock(&nm_i->nat_tree_lock);
 	while (nr_shrink && !list_empty(&nm_i->nat_entries)) {
 		struct nat_entry *ne;
 		ne = list_first_entry(&nm_i->nat_entries,
 					struct nat_entry, list);
 		__del_from_nat_cache(nm_i, ne);
 		nr_shrink--;
 	}
 	write_unlock(&nm_i->nat_tree_lock);
 	return nr_shrink;
 }
 /*
  * This function returns always success
  */
 void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
 	struct f2fs_summary_block *sum = curseg->sum_blk;
 	nid_t start_nid = START_NID(nid);
 	struct f2fs_nat_block *nat_blk;
 	struct page *page = NULL;
 	struct f2fs_nat_entry ne;
 	struct nat_entry *e;
 	int i;
 	memset(&ne, 0, sizeof(struct f2fs_nat_entry));
 	ni->nid = nid;
 	/* Check nat cache */
 	read_lock(&nm_i->nat_tree_lock);
 	e = __lookup_nat_cache(nm_i, nid);
 	if (e) {
 		ni->ino = nat_get_ino(e);
 		ni->blk_addr = nat_get_blkaddr(e);
 		ni->version = nat_get_version(e);
 	}
 	read_unlock(&nm_i->nat_tree_lock);
 	if (e)
 		return;
 	/* Check current segment summary */
 	mutex_lock(&curseg->curseg_mutex);
 	i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0);
 	if (i >= 0) {
 		ne = nat_in_journal(sum, i);
 		node_info_from_raw_nat(ni, &ne);
 	}
 	mutex_unlock(&curseg->curseg_mutex);
 	if (i >= 0)
 		goto cache;
 	/* Fill node_info from nat page */
 	page = get_current_nat_page(sbi, start_nid);
 	nat_blk = (struct f2fs_nat_block *)page_address(page);
 	ne = nat_blk->entries[nid - start_nid];
 	node_info_from_raw_nat(ni, &ne);
 	f2fs_put_page(page, 1);
 cache:
 	/* cache nat entry */
 	cache_nat_entry(NM_I(sbi), nid, &ne);
 }
 /*
  * The maximum depth is four.
  * Offset[0] will have raw inode offset.
  */
 static int get_node_path(struct f2fs_inode_info *fi, long block,
 				int offset[4], unsigned int noffset[4])
 {
 	const long direct_index = ADDRS_PER_INODE(fi);
 	const long direct_blks = ADDRS_PER_BLOCK;
 	const long dptrs_per_blk = NIDS_PER_BLOCK;
 	const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK;
 	const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK;
 	int n = 0;
 	int level = 0;
 	noffset[0] = 0;
 	if (block < direct_index) {
 		offset[n] = block;
 		goto got;
 	}
 	block -= direct_index;
 	if (block < direct_blks) {
 		offset[n++] = NODE_DIR1_BLOCK;
 		noffset[n] = 1;
 		offset[n] = block;
 		level = 1;
 		goto got;
 	}
 	block -= direct_blks;
 	if (block < direct_blks) {
 		offset[n++] = NODE_DIR2_BLOCK;
 		noffset[n] = 2;
 		offset[n] = block;
 		level = 1;
 		goto got;
 	}
 	block -= direct_blks;
 	if (block < indirect_blks) {
 		offset[n++] = NODE_IND1_BLOCK;
 		noffset[n] = 3;
 		offset[n++] = block / direct_blks;
 		noffset[n] = 4 + offset[n - 1];
 		offset[n] = block % direct_blks;
 		level = 2;
 		goto got;
 	}
 	block -= indirect_blks;
 	if (block < indirect_blks) {
 		offset[n++] = NODE_IND2_BLOCK;
 		noffset[n] = 4 + dptrs_per_blk;
 		offset[n++] = block / direct_blks;
 		noffset[n] = 5 + dptrs_per_blk + offset[n - 1];
 		offset[n] = block % direct_blks;
 		level = 2;
 		goto got;
 	}
 	block -= indirect_blks;
 	if (block < dindirect_blks) {
 		offset[n++] = NODE_DIND_BLOCK;
 		noffset[n] = 5 + (dptrs_per_blk * 2);
 		offset[n++] = block / indirect_blks;
 		noffset[n] = 6 + (dptrs_per_blk * 2) +
 			      offset[n - 1] * (dptrs_per_blk + 1);
 		offset[n++] = (block / direct_blks) % dptrs_per_blk;
 		noffset[n] = 7 + (dptrs_per_blk * 2) +
 			      offset[n - 2] * (dptrs_per_blk + 1) +
 			      offset[n - 1];
 		offset[n] = block % direct_blks;
 		level = 3;
 		goto got;
 	} else {
 		BUG();
 	}
 got:
 	return level;
 }
 /*
  * Caller should call f2fs_put_dnode(dn).
  * Also, it should grab and release a mutex by calling mutex_lock_op() and
  * mutex_unlock_op() only if ro is not set RDONLY_NODE.
  * In the case of RDONLY_NODE, we don't need to care about mutex.
  */
 int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct page *npage[4];
 	struct page *parent;
 	int offset[4];
 	unsigned int noffset[4];
 	nid_t nids[4];
 	int level, i;
 	int err = 0;
 	level = get_node_path(F2FS_I(dn->inode), index, offset, noffset);
 	nids[0] = dn->inode->i_ino;
 	npage[0] = dn->inode_page;
 	if (!npage[0]) {
 		npage[0] = get_node_page(sbi, nids[0]);
 		if (IS_ERR(npage[0]))
 			return PTR_ERR(npage[0]);
 	}
 	parent = npage[0];
 	if (level != 0)
 		nids[1] = get_nid(parent, offset[0], true);
 	dn->inode_page = npage[0];
 	dn->inode_page_locked = true;
 	/* get indirect or direct nodes */
 	for (i = 1; i <= level; i++) {
 		bool done = false;
 		if (!nids[i] && mode == ALLOC_NODE) {
 			/* alloc new node */
 			if (!alloc_nid(sbi, &(nids[i]))) {
 				err = -ENOSPC;
 				goto release_pages;
 			}
 			dn->nid = nids[i];
 			npage[i] = new_node_page(dn, noffset[i], NULL);
 			if (IS_ERR(npage[i])) {
 				alloc_nid_failed(sbi, nids[i]);
 				err = PTR_ERR(npage[i]);
 				goto release_pages;
 			}
 			set_nid(parent, offset[i - 1], nids[i], i == 1);
 			alloc_nid_done(sbi, nids[i]);
 			done = true;
 		} else if (mode == LOOKUP_NODE_RA && i == level && level > 1) {
 			npage[i] = get_node_page_ra(parent, offset[i - 1]);
 			if (IS_ERR(npage[i])) {
 				err = PTR_ERR(npage[i]);
 				goto release_pages;
 			}
 			done = true;
 		}
 		if (i == 1) {
 			dn->inode_page_locked = false;
 			unlock_page(parent);
 		} else {
 			f2fs_put_page(parent, 1);
 		}
 		if (!done) {
 			npage[i] = get_node_page(sbi, nids[i]);
 			if (IS_ERR(npage[i])) {
 				err = PTR_ERR(npage[i]);
 				f2fs_put_page(npage[0], 0);
 				goto release_out;
 			}
 		}
 		if (i < level) {
 			parent = npage[i];
 			nids[i + 1] = get_nid(parent, offset[i], false);
 		}
 	}
 	dn->nid = nids[level];
 	dn->ofs_in_node = offset[level];
 	dn->node_page = npage[level];
 	dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node);
 	return 0;
 release_pages:
 	f2fs_put_page(parent, 1);
 	if (i > 1)
 		f2fs_put_page(npage[0], 0);
 release_out:
 	dn->inode_page = NULL;
 	dn->node_page = NULL;
 	return err;
 }
 static void truncate_node(struct dnode_of_data *dn)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct node_info ni;
 	get_node_info(sbi, dn->nid, &ni);
 	if (dn->inode->i_blocks == 0) {
 		BUG_ON(ni.blk_addr != NULL_ADDR);
 		goto invalidate;
 	}
 	BUG_ON(ni.blk_addr == NULL_ADDR);
 	/* Deallocate node address */
 	invalidate_blocks(sbi, ni.blk_addr);
 	dec_valid_node_count(sbi, dn->inode, 1);
 	set_node_addr(sbi, &ni, NULL_ADDR);
 	if (dn->nid == dn->inode->i_ino) {
 		remove_orphan_inode(sbi, dn->nid);
 		dec_valid_inode_count(sbi);
 	} else {
 		sync_inode_page(dn);
 	}
 invalidate:
 	clear_node_page_dirty(dn->node_page);
 	F2FS_SET_SB_DIRT(sbi);
 	f2fs_put_page(dn->node_page, 1);
 	dn->node_page = NULL;
 	trace_f2fs_truncate_node(dn->inode, dn->nid, ni.blk_addr);
 }
 static int truncate_dnode(struct dnode_of_data *dn)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct page *page;
 	if (dn->nid == 0)
 		return 1;
 	/* get direct node */
 	page = get_node_page(sbi, dn->nid);
 	if (IS_ERR(page) && PTR_ERR(page) == -ENOENT)
 		return 1;
 	else if (IS_ERR(page))
 		return PTR_ERR(page);
 	/* Make dnode_of_data for parameter */
 	dn->node_page = page;
 	dn->ofs_in_node = 0;
 	truncate_data_blocks(dn);
 	truncate_node(dn);
 	return 1;
 }
 static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs,
 						int ofs, int depth)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct dnode_of_data rdn = *dn;
 	struct page *page;
 	struct f2fs_node *rn;
 	nid_t child_nid;
 	unsigned int child_nofs;
 	int freed = 0;
 	int i, ret;
 	if (dn->nid == 0)
 		return NIDS_PER_BLOCK + 1;
 	trace_f2fs_truncate_nodes_enter(dn->inode, dn->nid, dn->data_blkaddr);
 	page = get_node_page(sbi, dn->nid);
 	if (IS_ERR(page)) {
 		trace_f2fs_truncate_nodes_exit(dn->inode, PTR_ERR(page));
 		return PTR_ERR(page);
 	}
 	rn = F2FS_NODE(page);
 	if (depth < 3) {
 		for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) {
 			child_nid = le32_to_cpu(rn->in.nid[i]);
 			if (child_nid == 0)
 				continue;
 			rdn.nid = child_nid;
 			ret = truncate_dnode(&rdn);
 			if (ret < 0)
 				goto out_err;
 			set_nid(page, i, 0, false);
 		}
 	} else {
 		child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1;
 		for (i = ofs; i < NIDS_PER_BLOCK; i++) {
 			child_nid = le32_to_cpu(rn->in.nid[i]);
 			if (child_nid == 0) {
 				child_nofs += NIDS_PER_BLOCK + 1;
 				continue;
 			}
 			rdn.nid = child_nid;
 			ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1);
 			if (ret == (NIDS_PER_BLOCK + 1)) {
 				set_nid(page, i, 0, false);
 				child_nofs += ret;
 			} else if (ret < 0 && ret != -ENOENT) {
 				goto out_err;
 			}
 		}
 		freed = child_nofs;
 	}
 	if (!ofs) {
 		/* remove current indirect node */
 		dn->node_page = page;
 		truncate_node(dn);
 		freed++;
 	} else {
 		f2fs_put_page(page, 1);
 	}
 	trace_f2fs_truncate_nodes_exit(dn->inode, freed);
 	return freed;
 out_err:
 	f2fs_put_page(page, 1);
 	trace_f2fs_truncate_nodes_exit(dn->inode, ret);
 	return ret;
 }
 static int truncate_partial_nodes(struct dnode_of_data *dn,
 			struct f2fs_inode *ri, int *offset, int depth)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct page *pages[2];
 	nid_t nid[3];
 	nid_t child_nid;
 	int err = 0;
 	int i;
 	int idx = depth - 2;
 	nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]);
 	if (!nid[0])
 		return 0;
 	/* get indirect nodes in the path */
 	for (i = 0; i < depth - 1; i++) {
 		/* refernece count'll be increased */
 		pages[i] = get_node_page(sbi, nid[i]);
 		if (IS_ERR(pages[i])) {
 			depth = i + 1;
 			err = PTR_ERR(pages[i]);
 			goto fail;
 		}
 		nid[i + 1] = get_nid(pages[i], offset[i + 1], false);
 	}
 	/* free direct nodes linked to a partial indirect node */
 	for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) {
 		child_nid = get_nid(pages[idx], i, false);
 		if (!child_nid)
 			continue;
 		dn->nid = child_nid;
 		err = truncate_dnode(dn);
 		if (err < 0)
 			goto fail;
 		set_nid(pages[idx], i, 0, false);
 	}
 	if (offset[depth - 1] == 0) {
 		dn->node_page = pages[idx];
 		dn->nid = nid[idx];
 		truncate_node(dn);
 	} else {
 		f2fs_put_page(pages[idx], 1);
 	}
 	offset[idx]++;
 	offset[depth - 1] = 0;
 fail:
 	for (i = depth - 3; i >= 0; i--)
 		f2fs_put_page(pages[i], 1);
 	trace_f2fs_truncate_partial_nodes(dn->inode, nid, depth, err);
 	return err;
 }
 /*
  * All the block addresses of data and nodes should be nullified.
  */
 int truncate_inode_blocks(struct inode *inode, pgoff_t from)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct address_space *node_mapping = sbi->node_inode->i_mapping;
 	int err = 0, cont = 1;
 	int level, offset[4], noffset[4];
 	unsigned int nofs = 0;
 	struct f2fs_node *rn;
 	struct dnode_of_data dn;
 	struct page *page;
 	trace_f2fs_truncate_inode_blocks_enter(inode, from);
 	level = get_node_path(F2FS_I(inode), from, offset, noffset);
 restart:
 	page = get_node_page(sbi, inode->i_ino);
 	if (IS_ERR(page)) {
 		trace_f2fs_truncate_inode_blocks_exit(inode, PTR_ERR(page));
 		return PTR_ERR(page);
 	}
 	set_new_dnode(&dn, inode, page, NULL, 0);
 	unlock_page(page);
 	rn = F2FS_NODE(page);
 	switch (level) {
 	case 0:
 	case 1:
 		nofs = noffset[1];
 		break;
 	case 2:
 		nofs = noffset[1];
 		if (!offset[level - 1])
 			goto skip_partial;
 		err = truncate_partial_nodes(&dn, &rn->i, offset, level);
 		if (err < 0 && err != -ENOENT)
 			goto fail;
 		nofs += 1 + NIDS_PER_BLOCK;
 		break;
 	case 3:
 		nofs = 5 + 2 * NIDS_PER_BLOCK;
 		if (!offset[level - 1])
 			goto skip_partial;
 		err = truncate_partial_nodes(&dn, &rn->i, offset, level);
 		if (err < 0 && err != -ENOENT)
 			goto fail;
 		break;
 	default:
 		BUG();
 	}
 skip_partial:
 	while (cont) {
 		dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]);
 		switch (offset[0]) {
 		case NODE_DIR1_BLOCK:
 		case NODE_DIR2_BLOCK:
 			err = truncate_dnode(&dn);
 			break;
 		case NODE_IND1_BLOCK:
 		case NODE_IND2_BLOCK:
 			err = truncate_nodes(&dn, nofs, offset[1], 2);
 			break;
 		case NODE_DIND_BLOCK:
 			err = truncate_nodes(&dn, nofs, offset[1], 3);
 			cont = 0;
 			break;
 		default:
 			BUG();
 		}
 		if (err < 0 && err != -ENOENT)
 			goto fail;
 		if (offset[1] == 0 &&
 				rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) {
 			lock_page(page);
 			if (page->mapping != node_mapping) {
 				f2fs_put_page(page, 1);
 				goto restart;
 			}
 			wait_on_page_writeback(page);
 			rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0;
 			set_page_dirty(page);
 			unlock_page(page);
 		}
 		offset[1] = 0;
 		offset[0]++;
 		nofs += err;
 	}
 fail:
 	f2fs_put_page(page, 0);
 	trace_f2fs_truncate_inode_blocks_exit(inode, err);
 	return err > 0 ? 0 : err;
 }
 int truncate_xattr_node(struct inode *inode, struct page *page)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	nid_t nid = F2FS_I(inode)->i_xattr_nid;
 	struct dnode_of_data dn;
 	struct page *npage;
 	if (!nid)
 		return 0;
 	npage = get_node_page(sbi, nid);
 	if (IS_ERR(npage))
 		return PTR_ERR(npage);
 	F2FS_I(inode)->i_xattr_nid = 0;
 	/* need to do checkpoint during fsync */
 	F2FS_I(inode)->xattr_ver = cur_cp_version(F2FS_CKPT(sbi));
 	set_new_dnode(&dn, inode, page, npage, nid);
 	if (page)
 		dn.inode_page_locked = 1;
 	truncate_node(&dn);
 	return 0;
 }
 /*
  * Caller should grab and release a mutex by calling mutex_lock_op() and
  * mutex_unlock_op().
  */
 int remove_inode_page(struct inode *inode)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	struct page *page;
 	nid_t ino = inode->i_ino;
 	struct dnode_of_data dn;
 	int err;
 	page = get_node_page(sbi, ino);
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 	err = truncate_xattr_node(inode, page);
 	if (err) {
 		f2fs_put_page(page, 1);
 		return err;
 	}
 	/* 0 is possible, after f2fs_new_inode() is failed */
 	BUG_ON(inode->i_blocks != 0 && inode->i_blocks != 1);
 	set_new_dnode(&dn, inode, page, page, ino);
 	truncate_node(&dn);
 	return 0;
 }
 struct page *new_inode_page(struct inode *inode, const struct qstr *name)
 {
 	struct dnode_of_data dn;
 	/* allocate inode page for new inode */
 	set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino);
 	/* caller should f2fs_put_page(page, 1); */
 	return new_node_page(&dn, 0, NULL);
 }
 struct page *new_node_page(struct dnode_of_data *dn,
 				unsigned int ofs, struct page *ipage)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	struct node_info old_ni, new_ni;
 	struct page *page;
 	int err;
 	if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))
 		return ERR_PTR(-EPERM);
 	page = grab_cache_page(mapping, dn->nid);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 	if (!inc_valid_node_count(sbi, dn->inode, 1)) {
 		err = -ENOSPC;
 		goto fail;
 	}
 	get_node_info(sbi, dn->nid, &old_ni);
 	/* Reinitialize old_ni with new node page */
 	BUG_ON(old_ni.blk_addr != NULL_ADDR);
 	new_ni = old_ni;
 	new_ni.ino = dn->inode->i_ino;
 	set_node_addr(sbi, &new_ni, NEW_ADDR);
 	fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true);
 	set_cold_node(dn->inode, page);
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (ofs == XATTR_NODE_OFFSET)
 		F2FS_I(dn->inode)->i_xattr_nid = dn->nid;
 	dn->node_page = page;
 	if (ipage)
 		update_inode(dn->inode, ipage);
 	else
 		sync_inode_page(dn);
 	if (ofs == 0)
 		inc_valid_inode_count(sbi);
 	return page;
 fail:
 	clear_node_page_dirty(page);
 	f2fs_put_page(page, 1);
 	return ERR_PTR(err);
 }
 /*
  * Caller should do after getting the following values.
  * 0: f2fs_put_page(page, 0)
  * LOCKED_PAGE: f2fs_put_page(page, 1)
  * error: nothing
  */
 static int read_node_page(struct page *page, int type)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
 	struct node_info ni;
 	get_node_info(sbi, page->index, &ni);
 	if (ni.blk_addr == NULL_ADDR) {
 		f2fs_put_page(page, 1);
 		return -ENOENT;
 	}
 	if (PageUptodate(page))
 		return LOCKED_PAGE;
 	return f2fs_readpage(sbi, page, ni.blk_addr, type);
 }
 /*
  * Readahead a node page
  */
 void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	struct page *apage;
 	int err;
 	apage = find_get_page(mapping, nid);
 	if (apage && PageUptodate(apage)) {
 		f2fs_put_page(apage, 0);
 		return;
 	}
 	f2fs_put_page(apage, 0);
 	apage = grab_cache_page(mapping, nid);
 	if (!apage)
 		return;
 	err = read_node_page(apage, READA);
 	if (err == 0)
 		f2fs_put_page(apage, 0);
 	else if (err == LOCKED_PAGE)
 		f2fs_put_page(apage, 1);
 }
 struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid)
 {
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	struct page *page;
 	int err;
 repeat:
 	page = grab_cache_page(mapping, nid);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 	err = read_node_page(page, READ_SYNC);
 	if (err < 0)
 		return ERR_PTR(err);
 	else if (err == LOCKED_PAGE)
 		goto got_it;
 	lock_page(page);
 	if (!PageUptodate(page)) {
 		f2fs_put_page(page, 1);
 		return ERR_PTR(-EIO);
 	}
 	if (page->mapping != mapping) {
 		f2fs_put_page(page, 1);
 		goto repeat;
 	}
 got_it:
 	BUG_ON(nid != nid_of_node(page));
-	mark_page_accessed(page);
 	return page;
 }
 /*
  * Return a locked page for the desired node page.
  * And, readahead MAX_RA_NODE number of node pages.
  */
 struct page *get_node_page_ra(struct page *parent, int start)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb);
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	struct blk_plug plug;
 	struct page *page;
 	int err, i, end;
 	nid_t nid;
 	/* First, try getting the desired direct node. */
 	nid = get_nid(parent, start, false);
 	if (!nid)
 		return ERR_PTR(-ENOENT);
 repeat:
 	page = grab_cache_page(mapping, nid);
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 	err = read_node_page(page, READ_SYNC);
 	if (err < 0)
 		return ERR_PTR(err);
 	else if (err == LOCKED_PAGE)
 		goto page_hit;
 	blk_start_plug(&plug);
 	/* Then, try readahead for siblings of the desired node */
 	end = start + MAX_RA_NODE;
 	end = min(end, NIDS_PER_BLOCK);
 	for (i = start + 1; i < end; i++) {
 		nid = get_nid(parent, i, false);
 		if (!nid)
 			continue;
 		ra_node_page(sbi, nid);
 	}
 	blk_finish_plug(&plug);
 	lock_page(page);
 	if (page->mapping != mapping) {
 		f2fs_put_page(page, 1);
 		goto repeat;
 	}
 page_hit:
 	if (!PageUptodate(page)) {
 		f2fs_put_page(page, 1);
 		return ERR_PTR(-EIO);
 	}
-	mark_page_accessed(page);
 	return page;
 }
 void sync_inode_page(struct dnode_of_data *dn)
 {
 	if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) {
 		update_inode(dn->inode, dn->node_page);
 	} else if (dn->inode_page) {
 		if (!dn->inode_page_locked)
 			lock_page(dn->inode_page);
 		update_inode(dn->inode, dn->inode_page);
 		if (!dn->inode_page_locked)
 			unlock_page(dn->inode_page);
 	} else {
 		update_inode_page(dn->inode);
 	}
 }
 int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino,
 					struct writeback_control *wbc)
 {
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	pgoff_t index, end;
 	struct pagevec pvec;
 	int step = ino ? 2 : 0;
 	int nwritten = 0, wrote = 0;
 	pagevec_init(&pvec, 0);
 next_step:
 	index = 0;
 	end = LONG_MAX;
 	while (index <= end) {
 		int i, nr_pages;
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 				PAGECACHE_TAG_DIRTY,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 			/*
 			 * flushing sequence with step:
 			 * 0. indirect nodes
 			 * 1. dentry dnodes
 			 * 2. file dnodes
 			 */
 			if (step == 0 && IS_DNODE(page))
 				continue;
 			if (step == 1 && (!IS_DNODE(page) ||
 						is_cold_node(page)))
 				continue;
 			if (step == 2 && (!IS_DNODE(page) ||
 						!is_cold_node(page)))
 				continue;
 			/*
 			 * If an fsync mode,
 			 * we should not skip writing node pages.
 			 */
 			if (ino && ino_of_node(page) == ino)
 				lock_page(page);
 			else if (!trylock_page(page))
 				continue;
 			if (unlikely(page->mapping != mapping)) {
 continue_unlock:
 				unlock_page(page);
 				continue;
 			}
 			if (ino && ino_of_node(page) != ino)
 				goto continue_unlock;
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
 				goto continue_unlock;
 			}
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 			/* called by fsync() */
 			if (ino && IS_DNODE(page)) {
 				int mark = !is_checkpointed_node(sbi, ino);
 				set_fsync_mark(page, 1);
 				if (IS_INODE(page))
 					set_dentry_mark(page, mark);
 				nwritten++;
 			} else {
 				set_fsync_mark(page, 0);
 				set_dentry_mark(page, 0);
 			}
 			mapping->a_ops->writepage(page, wbc);
 			wrote++;
 			if (--wbc->nr_to_write == 0)
 				break;
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 		if (wbc->nr_to_write == 0) {
 			step = 2;
 			break;
 		}
 	}
 	if (step < 2) {
 		step++;
 		goto next_step;
 	}
 	if (wrote)
 		f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL);
 	return nwritten;
 }
 static int f2fs_write_node_page(struct page *page,
 				struct writeback_control *wbc)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
 	nid_t nid;
 	block_t new_addr;
 	struct node_info ni;
 	wait_on_page_writeback(page);
 	/* get old block addr of this node page */
 	nid = nid_of_node(page);
 	BUG_ON(page->index != nid);
 	get_node_info(sbi, nid, &ni);
 	/* This page is already truncated */
 	if (ni.blk_addr == NULL_ADDR) {
 		dec_page_count(sbi, F2FS_DIRTY_NODES);
 		unlock_page(page);
 		return 0;
 	}
 	if (wbc->for_reclaim) {
 		dec_page_count(sbi, F2FS_DIRTY_NODES);
 		wbc->pages_skipped++;
 		set_page_dirty(page);
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 	mutex_lock(&sbi->node_write);
 	set_page_writeback(page);
 	write_node_page(sbi, page, nid, ni.blk_addr, &new_addr);
 	set_node_addr(sbi, &ni, new_addr);
 	dec_page_count(sbi, F2FS_DIRTY_NODES);
 	mutex_unlock(&sbi->node_write);
 	unlock_page(page);
 	return 0;
 }
 /*
  * It is very important to gather dirty pages and write at once, so that we can
  * submit a big bio without interfering other data writes.
  * Be default, 512 pages (2MB) * 3 node types, is more reasonable.
  */
 #define COLLECT_DIRTY_NODES	1536
 static int f2fs_write_node_pages(struct address_space *mapping,
 			    struct writeback_control *wbc)
 {
 	struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
 	long nr_to_write = wbc->nr_to_write;
 	/* First check balancing cached NAT entries */
 	if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) {
 		f2fs_sync_fs(sbi->sb, true);
 		return 0;
 	}
 	/* collect a number of dirty node pages and write together */
 	if (get_pages(sbi, F2FS_DIRTY_NODES) < COLLECT_DIRTY_NODES)
 		return 0;
 	/* if mounting is failed, skip writing node pages */
 	wbc->nr_to_write = 3 * max_hw_blocks(sbi);
 	sync_node_pages(sbi, 0, wbc);
 	wbc->nr_to_write = nr_to_write - (3 * max_hw_blocks(sbi) -
 						wbc->nr_to_write);
 	return 0;
 }
 static int f2fs_set_node_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
 	SetPageUptodate(page);
 	if (!PageDirty(page)) {
 		__set_page_dirty_nobuffers(page);
 		inc_page_count(sbi, F2FS_DIRTY_NODES);
 		SetPagePrivate(page);
 		return 1;
 	}
 	return 0;
 }
 static void f2fs_invalidate_node_page(struct page *page, unsigned int offset,
 				      unsigned int length)
 {
 	struct inode *inode = page->mapping->host;
 	struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
 	if (PageDirty(page))
 		dec_page_count(sbi, F2FS_DIRTY_NODES);
 	ClearPagePrivate(page);
 }
 static int f2fs_release_node_page(struct page *page, gfp_t wait)
 {
 	ClearPagePrivate(page);
 	return 1;
 }
 /*
  * Structure of the f2fs node operations
  */
 const struct address_space_operations f2fs_node_aops = {
 	.writepage	= f2fs_write_node_page,
 	.writepages	= f2fs_write_node_pages,
 	.set_page_dirty	= f2fs_set_node_page_dirty,
 	.invalidatepage	= f2fs_invalidate_node_page,
 	.releasepage	= f2fs_release_node_page,
 };
 static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head)
 {
 	struct list_head *this;
 	struct free_nid *i;
 	list_for_each(this, head) {
 		i = list_entry(this, struct free_nid, list);
 		if (i->nid == n)
 			return i;
 	}
 	return NULL;
 }
 static void __del_from_free_nid_list(struct free_nid *i)
 {
 	list_del(&i->list);
 	kmem_cache_free(free_nid_slab, i);
 }
 static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build)
 {
 	struct free_nid *i;
 	struct nat_entry *ne;
 	bool allocated = false;
 	if (nm_i->fcnt > 2 * MAX_FREE_NIDS)
 		return -1;
 	/* 0 nid should not be used */
 	if (nid == 0)
 		return 0;
 	if (!build)
 		goto retry;
 	/* do not add allocated nids */
 	read_lock(&nm_i->nat_tree_lock);
 	ne = __lookup_nat_cache(nm_i, nid);
 	if (ne && nat_get_blkaddr(ne) != NULL_ADDR)
 		allocated = true;
 	read_unlock(&nm_i->nat_tree_lock);
 	if (allocated)
 		return 0;
 retry:
 	i = kmem_cache_alloc(free_nid_slab, GFP_NOFS);
 	if (!i) {
 		cond_resched();
 		goto retry;
 	}
 	i->nid = nid;
 	i->state = NID_NEW;
 	spin_lock(&nm_i->free_nid_list_lock);
 	if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) {
 		spin_unlock(&nm_i->free_nid_list_lock);
 		kmem_cache_free(free_nid_slab, i);
 		return 0;
 	}
 	list_add_tail(&i->list, &nm_i->free_nid_list);
 	nm_i->fcnt++;
 	spin_unlock(&nm_i->free_nid_list_lock);
 	return 1;
 }
 static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid)
 {
 	struct free_nid *i;
 	spin_lock(&nm_i->free_nid_list_lock);
 	i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
 	if (i && i->state == NID_NEW) {
 		__del_from_free_nid_list(i);
 		nm_i->fcnt--;
 	}
 	spin_unlock(&nm_i->free_nid_list_lock);
 }
 static void scan_nat_page(struct f2fs_nm_info *nm_i,
 			struct page *nat_page, nid_t start_nid)
 {
 	struct f2fs_nat_block *nat_blk = page_address(nat_page);
 	block_t blk_addr;
 	int i;
 	i = start_nid % NAT_ENTRY_PER_BLOCK;
 	for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) {
 		if (start_nid >= nm_i->max_nid)
 			break;
 		blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr);
 		BUG_ON(blk_addr == NEW_ADDR);
 		if (blk_addr == NULL_ADDR) {
 			if (add_free_nid(nm_i, start_nid, true) < 0)
 				break;
 		}
 	}
 }
 static void build_free_nids(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
 	struct f2fs_summary_block *sum = curseg->sum_blk;
 	int i = 0;
 	nid_t nid = nm_i->next_scan_nid;
 	/* Enough entries */
 	if (nm_i->fcnt > NAT_ENTRY_PER_BLOCK)
 		return;
 	/* readahead nat pages to be scanned */
 	ra_nat_pages(sbi, nid);
 	while (1) {
 		struct page *page = get_current_nat_page(sbi, nid);
 		scan_nat_page(nm_i, page, nid);
 		f2fs_put_page(page, 1);
 		nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK));
 		if (nid >= nm_i->max_nid)
 			nid = 0;
 		if (i++ == FREE_NID_PAGES)
 			break;
 	}
 	/* go to the next free nat pages to find free nids abundantly */
 	nm_i->next_scan_nid = nid;
 	/* find free nids from current sum_pages */
 	mutex_lock(&curseg->curseg_mutex);
 	for (i = 0; i < nats_in_cursum(sum); i++) {
 		block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr);
 		nid = le32_to_cpu(nid_in_journal(sum, i));
 		if (addr == NULL_ADDR)
 			add_free_nid(nm_i, nid, true);
 		else
 			remove_free_nid(nm_i, nid);
 	}
 	mutex_unlock(&curseg->curseg_mutex);
 }
 /*
  * If this function returns success, caller can obtain a new nid
  * from second parameter of this function.
  * The returned nid could be used ino as well as nid when inode is created.
  */
 bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct free_nid *i = NULL;
 	struct list_head *this;
 retry:
 	if (sbi->total_valid_node_count + 1 >= nm_i->max_nid)
 		return false;
 	spin_lock(&nm_i->free_nid_list_lock);
 	/* We should not use stale free nids created by build_free_nids */
 	if (nm_i->fcnt && !sbi->on_build_free_nids) {
 		BUG_ON(list_empty(&nm_i->free_nid_list));
 		list_for_each(this, &nm_i->free_nid_list) {
 			i = list_entry(this, struct free_nid, list);
 			if (i->state == NID_NEW)
 				break;
 		}
 		BUG_ON(i->state != NID_NEW);
 		*nid = i->nid;
 		i->state = NID_ALLOC;
 		nm_i->fcnt--;
 		spin_unlock(&nm_i->free_nid_list_lock);
 		return true;
 	}
 	spin_unlock(&nm_i->free_nid_list_lock);
 	/* Let's scan nat pages and its caches to get free nids */
 	mutex_lock(&nm_i->build_lock);
 	sbi->on_build_free_nids = 1;
 	build_free_nids(sbi);
 	sbi->on_build_free_nids = 0;
 	mutex_unlock(&nm_i->build_lock);
 	goto retry;
 }
 /*
  * alloc_nid() should be called prior to this function.
  */
 void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct free_nid *i;
 	spin_lock(&nm_i->free_nid_list_lock);
 	i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
 	BUG_ON(!i || i->state != NID_ALLOC);
 	__del_from_free_nid_list(i);
 	spin_unlock(&nm_i->free_nid_list_lock);
 }
 /*
  * alloc_nid() should be called prior to this function.
  */
 void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct free_nid *i;
 	if (!nid)
 		return;
 	spin_lock(&nm_i->free_nid_list_lock);
 	i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
 	BUG_ON(!i || i->state != NID_ALLOC);
 	if (nm_i->fcnt > 2 * MAX_FREE_NIDS) {
 		__del_from_free_nid_list(i);
 	} else {
 		i->state = NID_NEW;
 		nm_i->fcnt++;
 	}
 	spin_unlock(&nm_i->free_nid_list_lock);
 }
 void recover_node_page(struct f2fs_sb_info *sbi, struct page *page,
 		struct f2fs_summary *sum, struct node_info *ni,
 		block_t new_blkaddr)
 {
 	rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr);
 	set_node_addr(sbi, ni, new_blkaddr);
 	clear_node_page_dirty(page);
 }
 int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page)
 {
 	struct address_space *mapping = sbi->node_inode->i_mapping;
 	struct f2fs_node *src, *dst;
 	nid_t ino = ino_of_node(page);
 	struct node_info old_ni, new_ni;
 	struct page *ipage;
 	ipage = grab_cache_page(mapping, ino);
 	if (!ipage)
 		return -ENOMEM;
 	/* Should not use this inode  from free nid list */
 	remove_free_nid(NM_I(sbi), ino);
 	get_node_info(sbi, ino, &old_ni);
 	SetPageUptodate(ipage);
 	fill_node_footer(ipage, ino, ino, 0, true);
 	src = F2FS_NODE(page);
 	dst = F2FS_NODE(ipage);
 	memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i);
 	dst->i.i_size = 0;
 	dst->i.i_blocks = cpu_to_le64(1);
 	dst->i.i_links = cpu_to_le32(1);
 	dst->i.i_xattr_nid = 0;
 	new_ni = old_ni;
 	new_ni.ino = ino;
 	if (!inc_valid_node_count(sbi, NULL, 1))
 		WARN_ON(1);
 	set_node_addr(sbi, &new_ni, NEW_ADDR);
 	inc_valid_inode_count(sbi);
 	f2fs_put_page(ipage, 1);
 	return 0;
 }
 int restore_node_summary(struct f2fs_sb_info *sbi,
 			unsigned int segno, struct f2fs_summary_block *sum)
 {
 	struct f2fs_node *rn;
 	struct f2fs_summary *sum_entry;
 	struct page *page;
 	block_t addr;
 	int i, last_offset;
 	/* alloc temporal page for read node */
 	page = alloc_page(GFP_NOFS | __GFP_ZERO);
 	if (!page)
 		return -ENOMEM;
 	lock_page(page);
 	/* scan the node segment */
 	last_offset = sbi->blocks_per_seg;
 	addr = START_BLOCK(sbi, segno);
 	sum_entry = &sum->entries[0];
 	for (i = 0; i < last_offset; i++, sum_entry++) {
 		/*
 		 * In order to read next node page,
 		 * we must clear PageUptodate flag.
 		 */
 		ClearPageUptodate(page);
 		if (f2fs_readpage(sbi, page, addr, READ_SYNC))
 			goto out;
 		lock_page(page);
 		rn = F2FS_NODE(page);
 		sum_entry->nid = rn->footer.nid;
 		sum_entry->version = 0;
 		sum_entry->ofs_in_node = 0;
 		addr++;
 	}
 	unlock_page(page);
 out:
 	__free_pages(page, 0);
 	return 0;
 }
 static bool flush_nats_in_journal(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
 	struct f2fs_summary_block *sum = curseg->sum_blk;
 	int i;
 	mutex_lock(&curseg->curseg_mutex);
 	if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) {
 		mutex_unlock(&curseg->curseg_mutex);
 		return false;
 	}
 	for (i = 0; i < nats_in_cursum(sum); i++) {
 		struct nat_entry *ne;
 		struct f2fs_nat_entry raw_ne;
 		nid_t nid = le32_to_cpu(nid_in_journal(sum, i));
 		raw_ne = nat_in_journal(sum, i);
 retry:
 		write_lock(&nm_i->nat_tree_lock);
 		ne = __lookup_nat_cache(nm_i, nid);
 		if (ne) {
 			__set_nat_cache_dirty(nm_i, ne);
 			write_unlock(&nm_i->nat_tree_lock);
 			continue;
 		}
 		ne = grab_nat_entry(nm_i, nid);
 		if (!ne) {
 			write_unlock(&nm_i->nat_tree_lock);
 			goto retry;
 		}
 		nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr));
 		nat_set_ino(ne, le32_to_cpu(raw_ne.ino));
 		nat_set_version(ne, raw_ne.version);
 		__set_nat_cache_dirty(nm_i, ne);
 		write_unlock(&nm_i->nat_tree_lock);
 	}
 	update_nats_in_cursum(sum, -i);
 	mutex_unlock(&curseg->curseg_mutex);
 	return true;
 }
 /*
  * This function is called during the checkpointing process.
  */
 void flush_nat_entries(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
 	struct f2fs_summary_block *sum = curseg->sum_blk;
 	struct list_head *cur, *n;
 	struct page *page = NULL;
 	struct f2fs_nat_block *nat_blk = NULL;
 	nid_t start_nid = 0, end_nid = 0;
 	bool flushed;
 	flushed = flush_nats_in_journal(sbi);
 	if (!flushed)
 		mutex_lock(&curseg->curseg_mutex);
 	/* 1) flush dirty nat caches */
 	list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) {
 		struct nat_entry *ne;
 		nid_t nid;
 		struct f2fs_nat_entry raw_ne;
 		int offset = -1;
 		block_t new_blkaddr;
 		ne = list_entry(cur, struct nat_entry, list);
 		nid = nat_get_nid(ne);
 		if (nat_get_blkaddr(ne) == NEW_ADDR)
 			continue;
 		if (flushed)
 			goto to_nat_page;
 		/* if there is room for nat enries in curseg->sumpage */
 		offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1);
 		if (offset >= 0) {
 			raw_ne = nat_in_journal(sum, offset);
 			goto flush_now;
 		}
 to_nat_page:
 		if (!page || (start_nid > nid || nid > end_nid)) {
 			if (page) {
 				f2fs_put_page(page, 1);
 				page = NULL;
 			}
 			start_nid = START_NID(nid);
 			end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1;
 			/*
 			 * get nat block with dirty flag, increased reference
 			 * count, mapped and lock
 			 */
 			page = get_next_nat_page(sbi, start_nid);
 			nat_blk = page_address(page);
 		}
 		BUG_ON(!nat_blk);
 		raw_ne = nat_blk->entries[nid - start_nid];
 flush_now:
 		new_blkaddr = nat_get_blkaddr(ne);
 		raw_ne.ino = cpu_to_le32(nat_get_ino(ne));
 		raw_ne.block_addr = cpu_to_le32(new_blkaddr);
 		raw_ne.version = nat_get_version(ne);
 		if (offset < 0) {
 			nat_blk->entries[nid - start_nid] = raw_ne;
 		} else {
 			nat_in_journal(sum, offset) = raw_ne;
 			nid_in_journal(sum, offset) = cpu_to_le32(nid);
 		}
 		if (nat_get_blkaddr(ne) == NULL_ADDR &&
 				add_free_nid(NM_I(sbi), nid, false) <= 0) {
 			write_lock(&nm_i->nat_tree_lock);
 			__del_from_nat_cache(nm_i, ne);
 			write_unlock(&nm_i->nat_tree_lock);
 		} else {
 			write_lock(&nm_i->nat_tree_lock);
 			__clear_nat_cache_dirty(nm_i, ne);
 			ne->checkpointed = true;
 			write_unlock(&nm_i->nat_tree_lock);
 		}
 	}
 	if (!flushed)
 		mutex_unlock(&curseg->curseg_mutex);
 	f2fs_put_page(page, 1);
 	/* 2) shrink nat caches if necessary */
 	try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD);
 }
 static int init_node_manager(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi);
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	unsigned char *version_bitmap;
 	unsigned int nat_segs, nat_blocks;
 	nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr);
 	/* segment_count_nat includes pair segment so divide to 2. */
 	nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1;
 	nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg);
 	nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks;
 	nm_i->fcnt = 0;
 	nm_i->nat_cnt = 0;
 	INIT_LIST_HEAD(&nm_i->free_nid_list);
 	INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC);
 	INIT_LIST_HEAD(&nm_i->nat_entries);
 	INIT_LIST_HEAD(&nm_i->dirty_nat_entries);
 	mutex_init(&nm_i->build_lock);
 	spin_lock_init(&nm_i->free_nid_list_lock);
 	rwlock_init(&nm_i->nat_tree_lock);
 	nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid);
 	nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP);
 	version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP);
 	if (!version_bitmap)
 		return -EFAULT;
 	nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size,
 					GFP_KERNEL);
 	if (!nm_i->nat_bitmap)
 		return -ENOMEM;
 	return 0;
 }
 int build_node_manager(struct f2fs_sb_info *sbi)
 {
 	int err;
 	sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL);
 	if (!sbi->nm_info)
 		return -ENOMEM;
 	err = init_node_manager(sbi);
 	if (err)
 		return err;
 	build_free_nids(sbi);
 	return 0;
 }
 void destroy_node_manager(struct f2fs_sb_info *sbi)
 {
 	struct f2fs_nm_info *nm_i = NM_I(sbi);
 	struct free_nid *i, *next_i;
 	struct nat_entry *natvec[NATVEC_SIZE];
 	nid_t nid = 0;
 	unsigned int found;
 	if (!nm_i)
 		return;
 	/* destroy free nid list */
 	spin_lock(&nm_i->free_nid_list_lock);
 	list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) {
 		BUG_ON(i->state == NID_ALLOC);
 		__del_from_free_nid_list(i);
 		nm_i->fcnt--;
 	}
 	BUG_ON(nm_i->fcnt);
 	spin_unlock(&nm_i->free_nid_list_lock);
 	/* destroy nat cache */
 	write_lock(&nm_i->nat_tree_lock);
 	while ((found = __gang_lookup_nat_cache(nm_i,
 					nid, NATVEC_SIZE, natvec))) {
 		unsigned idx;
 		for (idx = 0; idx < found; idx++) {
 			struct nat_entry *e = natvec[idx];
 			nid = nat_get_nid(e) + 1;
 			__del_from_nat_cache(nm_i, e);
 		}
 	}
 	BUG_ON(nm_i->nat_cnt);
 	write_unlock(&nm_i->nat_tree_lock);
 	kfree(nm_i->nat_bitmap);
 	sbi->nm_info = NULL;
 	kfree(nm_i);
 }
 int __init create_node_manager_caches(void)
 {
 	nat_entry_slab = f2fs_kmem_cache_create("nat_entry",
 			sizeof(struct nat_entry), NULL);
 	if (!nat_entry_slab)
 		return -ENOMEM;
 	free_nid_slab = f2fs_kmem_cache_create("free_nid",
 			sizeof(struct free_nid), NULL);
 	if (!free_nid_slab) {
 		kmem_cache_destroy(nat_entry_slab);
 		return -ENOMEM;
 	}
 	return 0;
 }
 void destroy_node_manager_caches(void)
 {
 	kmem_cache_destroy(free_nid_slab);
 	kmem_cache_destroy(nat_entry_slab);
 }

fs/fuse/file.c

Diff comments View file @ d618a27

 /*
   FUSE: Filesystem in Userspace
   Copyright (C) 2001-2008  Miklos Szeredi <miklos@szeredi.hu>
   This program can be distributed under the terms of the GNU GPL.
   See the file COPYING.
 */
 #include "fuse_i.h"
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/swap.h>
 #include <linux/aio.h>
 #include <linux/falloc.h>
 static const struct file_operations fuse_direct_io_file_operations;
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
 {
 	struct fuse_open_in inarg;
 	struct fuse_req *req;
 	int err;
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
 	if (!fc->atomic_o_trunc)
 		inarg.flags &= ~O_TRUNC;
 	req->in.h.opcode = opcode;
 	req->in.h.nodeid = nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(*outargp);
 	req->out.args[0].value = outargp;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	return err;
 }
 struct fuse_file *fuse_file_alloc(struct fuse_conn *fc)
 {
 	struct fuse_file *ff;
 	ff = kmalloc(sizeof(struct fuse_file), GFP_KERNEL);
 	if (unlikely(!ff))
 		return NULL;
 	ff->fc = fc;
 	ff->reserved_req = fuse_request_alloc(0);
 	if (unlikely(!ff->reserved_req)) {
 		kfree(ff);
 		return NULL;
 	}
 	INIT_LIST_HEAD(&ff->write_entry);
 	atomic_set(&ff->count, 0);
 	RB_CLEAR_NODE(&ff->polled_node);
 	init_waitqueue_head(&ff->poll_wait);
 	spin_lock(&fc->lock);
 	ff->kh = ++fc->khctr;
 	spin_unlock(&fc->lock);
 	return ff;
 }
 void fuse_file_free(struct fuse_file *ff)
 {
 	fuse_request_free(ff->reserved_req);
 	kfree(ff);
 }
 struct fuse_file *fuse_file_get(struct fuse_file *ff)
 {
 	atomic_inc(&ff->count);
 	return ff;
 }
 static void fuse_release_async(struct work_struct *work)
 {
 	struct fuse_req *req;
 	struct fuse_conn *fc;
 	struct path path;
 	req = container_of(work, struct fuse_req, misc.release.work);
 	path = req->misc.release.path;
 	fc = get_fuse_conn(path.dentry->d_inode);
 	fuse_put_request(fc, req);
 	path_put(&path);
 }
 static void fuse_release_end(struct fuse_conn *fc, struct fuse_req *req)
 {
 	if (fc->destroy_req) {
 		/*
 		 * If this is a fuseblk mount, then it's possible that
 		 * releasing the path will result in releasing the
 		 * super block and sending the DESTROY request.  If
 		 * the server is single threaded, this would hang.
 		 * For this reason do the path_put() in a separate
 		 * thread.
 		 */
 		atomic_inc(&req->count);
 		INIT_WORK(&req->misc.release.work, fuse_release_async);
 		schedule_work(&req->misc.release.work);
 	} else {
 		path_put(&req->misc.release.path);
 	}
 }
 static void fuse_file_put(struct fuse_file *ff, bool sync)
 {
 	if (atomic_dec_and_test(&ff->count)) {
 		struct fuse_req *req = ff->reserved_req;
 		if (sync) {
 			req->background = 0;
 			fuse_request_send(ff->fc, req);
 			path_put(&req->misc.release.path);
 			fuse_put_request(ff->fc, req);
 		} else {
 			req->end = fuse_release_end;
 			req->background = 1;
 			fuse_request_send_background(ff->fc, req);
 		}
 		kfree(ff);
 	}
 }
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir)
 {
 	struct fuse_open_out outarg;
 	struct fuse_file *ff;
 	int err;
 	int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN;
 	ff = fuse_file_alloc(fc);
 	if (!ff)
 		return -ENOMEM;
 	err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
 	if (err) {
 		fuse_file_free(ff);
 		return err;
 	}
 	if (isdir)
 		outarg.open_flags &= ~FOPEN_DIRECT_IO;
 	ff->fh = outarg.fh;
 	ff->nodeid = nodeid;
 	ff->open_flags = outarg.open_flags;
 	file->private_data = fuse_file_get(ff);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	if (ff->open_flags & FOPEN_DIRECT_IO)
 		file->f_op = &fuse_direct_io_file_operations;
 	if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 		invalidate_inode_pages2(inode->i_mapping);
 	if (ff->open_flags & FOPEN_NONSEEKABLE)
 		nonseekable_open(inode, file);
 	if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
 		struct fuse_inode *fi = get_fuse_inode(inode);
 		spin_lock(&fc->lock);
 		fi->attr_version = ++fc->attr_version;
 		i_size_write(inode, 0);
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
 }
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	int err;
 	err = generic_file_open(inode, file);
 	if (err)
 		return err;
 	err = fuse_do_open(fc, get_node_id(inode), file, isdir);
 	if (err)
 		return err;
 	fuse_finish_open(inode, file);
 	return 0;
 }
 static void fuse_prepare_release(struct fuse_file *ff, int flags, int opcode)
 {
 	struct fuse_conn *fc = ff->fc;
 	struct fuse_req *req = ff->reserved_req;
 	struct fuse_release_in *inarg = &req->misc.release.in;
 	spin_lock(&fc->lock);
 	list_del(&ff->write_entry);
 	if (!RB_EMPTY_NODE(&ff->polled_node))
 		rb_erase(&ff->polled_node, &fc->polled_files);
 	spin_unlock(&fc->lock);
 	wake_up_interruptible_all(&ff->poll_wait);
 	inarg->fh = ff->fh;
 	inarg->flags = flags;
 	req->in.h.opcode = opcode;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(struct fuse_release_in);
 	req->in.args[0].value = inarg;
 }
 void fuse_release_common(struct file *file, int opcode)
 {
 	struct fuse_file *ff;
 	struct fuse_req *req;
 	ff = file->private_data;
 	if (unlikely(!ff))
 		return;
 	req = ff->reserved_req;
 	fuse_prepare_release(ff, file->f_flags, opcode);
 	if (ff->flock) {
 		struct fuse_release_in *inarg = &req->misc.release.in;
 		inarg->release_flags |= FUSE_RELEASE_FLOCK_UNLOCK;
 		inarg->lock_owner = fuse_lock_owner_id(ff->fc,
 						       (fl_owner_t) file);
 	}
 	/* Hold vfsmount and dentry until release is finished */
 	path_get(&file->f_path);
 	req->misc.release.path = file->f_path;
 	/*
 	 * Normally this will send the RELEASE request, however if
 	 * some asynchronous READ or WRITE requests are outstanding,
 	 * the sending will be delayed.
 	 *
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
 	 */
 	fuse_file_put(ff, ff->fc->destroy_req != NULL);
 }
 static int fuse_open(struct inode *inode, struct file *file)
 {
 	return fuse_open_common(inode, file, false);
 }
 static int fuse_release(struct inode *inode, struct file *file)
 {
 	fuse_release_common(file, FUSE_RELEASE);
 	/* return value is ignored by VFS */
 	return 0;
 }
 void fuse_sync_release(struct fuse_file *ff, int flags)
 {
 	WARN_ON(atomic_read(&ff->count) > 1);
 	fuse_prepare_release(ff, flags, FUSE_RELEASE);
 	ff->reserved_req->force = 1;
 	ff->reserved_req->background = 0;
 	fuse_request_send(ff->fc, ff->reserved_req);
 	fuse_put_request(ff->fc, ff->reserved_req);
 	kfree(ff);
 }
 EXPORT_SYMBOL_GPL(fuse_sync_release);
 /*
  * Scramble the ID space with XTEA, so that the value of the files_struct
  * pointer is not exposed to userspace.
  */
 u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 {
 	u32 *k = fc->scramble_key;
 	u64 v = (unsigned long) id;
 	u32 v0 = v;
 	u32 v1 = v >> 32;
 	u32 sum = 0;
 	int i;
 	for (i = 0; i < 32; i++) {
 		v0 += ((v1 << 4 ^ v1 >> 5) + v1) ^ (sum + k[sum & 3]);
 		sum += 0x9E3779B9;
 		v1 += ((v0 << 4 ^ v0 >> 5) + v0) ^ (sum + k[sum>>11 & 3]);
 	}
 	return (u64) v0 + ((u64) v1 << 32);
 }
 /*
  * Check if page is under writeback
  *
  * This is currently done by walking the list of writepage requests
  * for the inode, which can be pretty inefficient.
  */
 static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
 	bool found = false;
 	spin_lock(&fc->lock);
 	list_for_each_entry(req, &fi->writepages, writepages_entry) {
 		pgoff_t curr_index;
 		BUG_ON(req->inode != inode);
 		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
 		if (curr_index == index) {
 			found = true;
 			break;
 		}
 	}
 	spin_unlock(&fc->lock);
 	return found;
 }
 /*
  * Wait for page writeback to be completed.
  *
  * Since fuse doesn't rely on the VM writeback tracking, this has to
  * use some other means.
  */
 static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
 	return 0;
 }
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_file *ff = file->private_data;
 	struct fuse_req *req;
 	struct fuse_flush_in inarg;
 	int err;
 	if (is_bad_inode(inode))
 		return -EIO;
 	if (fc->no_flush)
 		return 0;
 	req = fuse_get_req_nofail_nopages(fc, file);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;
 	inarg.lock_owner = fuse_lock_owner_id(fc, id);
 	req->in.h.opcode = FUSE_FLUSH;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	req->force = 1;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (err == -ENOSYS) {
 		fc->no_flush = 1;
 		err = 0;
 	}
 	return err;
 }
 /*
  * Wait for all pending writepages on the inode to finish.
  *
  * This is currently done by blocking further writes with FUSE_NOWRITE
  * and waiting for all sent writes to complete.
  *
  * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage
  * could conflict with truncation.
  */
 static void fuse_sync_writes(struct inode *inode)
 {
 	fuse_set_nowrite(inode);
 	fuse_release_nowrite(inode);
 }
 int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 		      int datasync, int isdir)
 {
 	struct inode *inode = file->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_file *ff = file->private_data;
 	struct fuse_req *req;
 	struct fuse_fsync_in inarg;
 	int err;
 	if (is_bad_inode(inode))
 		return -EIO;
 	err = filemap_write_and_wait_range(inode->i_mapping, start, end);
 	if (err)
 		return err;
 	if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
 		return 0;
 	mutex_lock(&inode->i_mutex);
 	/*
 	 * Start writeback against all dirty pages of the inode, then
 	 * wait for all outstanding writes, before sending the FSYNC
 	 * request.
 	 */
 	err = write_inode_now(inode, 0);
 	if (err)
 		goto out;
 	fuse_sync_writes(inode);
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
 		goto out;
 	}
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;
 	inarg.fsync_flags = datasync ? 1 : 0;
 	req->in.h.opcode = isdir ? FUSE_FSYNCDIR : FUSE_FSYNC;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (err == -ENOSYS) {
 		if (isdir)
 			fc->no_fsyncdir = 1;
 		else
 			fc->no_fsync = 1;
 		err = 0;
 	}
 out:
 	mutex_unlock(&inode->i_mutex);
 	return err;
 }
 static int fuse_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync)
 {
 	return fuse_fsync_common(file, start, end, datasync, 0);
 }
 void fuse_read_fill(struct fuse_req *req, struct file *file, loff_t pos,
 		    size_t count, int opcode)
 {
 	struct fuse_read_in *inarg = &req->misc.read.in;
 	struct fuse_file *ff = file->private_data;
 	inarg->fh = ff->fh;
 	inarg->offset = pos;
 	inarg->size = count;
 	inarg->flags = file->f_flags;
 	req->in.h.opcode = opcode;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(struct fuse_read_in);
 	req->in.args[0].value = inarg;
 	req->out.argvar = 1;
 	req->out.numargs = 1;
 	req->out.args[0].size = count;
 }
 static void fuse_release_user_pages(struct fuse_req *req, int write)
 {
 	unsigned i;
 	for (i = 0; i < req->num_pages; i++) {
 		struct page *page = req->pages[i];
 		if (write)
 			set_page_dirty_lock(page);
 		put_page(page);
 	}
 }
 /**
  * In case of short read, the caller sets 'pos' to the position of
  * actual end of fuse request in IO request. Otherwise, if bytes_requested
  * == bytes_transferred or rw == WRITE, the caller sets 'pos' to -1.
  *
  * An example:
  * User requested DIO read of 64K. It was splitted into two 32K fuse requests,
  * both submitted asynchronously. The first of them was ACKed by userspace as
  * fully completed (req->out.args[0].size == 32K) resulting in pos == -1. The
  * second request was ACKed as short, e.g. only 1K was read, resulting in
  * pos == 33K.
  *
  * Thus, when all fuse requests are completed, the minimal non-negative 'pos'
  * will be equal to the length of the longest contiguous fragment of
  * transferred data starting from the beginning of IO request.
  */
 static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
 {
 	int left;
 	spin_lock(&io->lock);
 	if (err)
 		io->err = io->err ? : err;
 	else if (pos >= 0 && (io->bytes < 0 || pos < io->bytes))
 		io->bytes = pos;
 	left = --io->reqs;
 	spin_unlock(&io->lock);
 	if (!left) {
 		long res;
 		if (io->err)
 			res = io->err;
 		else if (io->bytes >= 0 && io->write)
 			res = -EIO;
 		else {
 			res = io->bytes < 0 ? io->size : io->bytes;
 			if (!is_sync_kiocb(io->iocb)) {
 				struct inode *inode = file_inode(io->iocb->ki_filp);
 				struct fuse_conn *fc = get_fuse_conn(inode);
 				struct fuse_inode *fi = get_fuse_inode(inode);
 				spin_lock(&fc->lock);
 				fi->attr_version = ++fc->attr_version;
 				spin_unlock(&fc->lock);
 			}
 		}
 		aio_complete(io->iocb, res, 0);
 		kfree(io);
 	}
 }
 static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct fuse_io_priv *io = req->io;
 	ssize_t pos = -1;
 	fuse_release_user_pages(req, !io->write);
 	if (io->write) {
 		if (req->misc.write.in.size != req->misc.write.out.size)
 			pos = req->misc.write.in.offset - io->offset +
 				req->misc.write.out.size;
 	} else {
 		if (req->misc.read.in.size != req->out.args[0].size)
 			pos = req->misc.read.in.offset - io->offset +
 				req->out.args[0].size;
 	}
 	fuse_aio_complete(io, req->out.h.error, pos);
 }
 static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req,
 		size_t num_bytes, struct fuse_io_priv *io)
 {
 	spin_lock(&io->lock);
 	io->size += num_bytes;
 	io->reqs++;
 	spin_unlock(&io->lock);
 	req->io = io;
 	req->end = fuse_aio_complete_req;
 	__fuse_get_request(req);
 	fuse_request_send_background(fc, req);
 	return num_bytes;
 }
 static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io,
 			     loff_t pos, size_t count, fl_owner_t owner)
 {
 	struct file *file = io->file;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	fuse_read_fill(req, file, pos, count, FUSE_READ);
 	if (owner != NULL) {
 		struct fuse_read_in *inarg = &req->misc.read.in;
 		inarg->read_flags |= FUSE_READ_LOCKOWNER;
 		inarg->lock_owner = fuse_lock_owner_id(fc, owner);
 	}
 	if (io->async)
 		return fuse_async_req_send(fc, req, count, io);
 	fuse_request_send(fc, req);
 	return req->out.args[0].size;
 }
 static void fuse_read_update_size(struct inode *inode, loff_t size,
 				  u64 attr_ver)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	spin_lock(&fc->lock);
 	if (attr_ver == fi->attr_version && size < inode->i_size &&
 	    !test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
 		fi->attr_version = ++fc->attr_version;
 		i_size_write(inode, size);
 	}
 	spin_unlock(&fc->lock);
 }
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	size_t num_read;
 	loff_t pos = page_offset(page);
 	size_t count = PAGE_CACHE_SIZE;
 	u64 attr_ver;
 	int err;
 	err = -EIO;
 	if (is_bad_inode(inode))
 		goto out;
 	/*
 	 * Page writeback can extend beyond the lifetime of the
 	 * page-cache page, so make sure we read a properly synced
 	 * page.
 	 */
 	fuse_wait_on_page_writeback(inode, page->index);
 	req = fuse_get_req(fc, 1);
 	err = PTR_ERR(req);
 	if (IS_ERR(req))
 		goto out;
 	attr_ver = fuse_get_attr_version(fc);
 	req->out.page_zeroing = 1;
 	req->out.argpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = page;
 	req->page_descs[0].length = count;
 	num_read = fuse_send_read(req, &io, pos, count, NULL);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
 			fuse_read_update_size(inode, pos + num_read, attr_ver);
 		SetPageUptodate(page);
 	}
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
 	return err;
 }
 static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 {
 	int i;
 	size_t count = req->misc.read.in.size;
 	size_t num_read = req->out.args[0].size;
 	struct address_space *mapping = NULL;
 	for (i = 0; mapping == NULL && i < req->num_pages; i++)
 		mapping = req->pages[i]->mapping;
 	if (mapping) {
 		struct inode *inode = mapping->host;
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
 		if (!req->out.h.error && num_read < count) {
 			loff_t pos;
 			pos = page_offset(req->pages[0]) + num_read;
 			fuse_read_update_size(inode, pos,
 					      req->misc.read.attr_ver);
 		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 	for (i = 0; i < req->num_pages; i++) {
 		struct page *page = req->pages[i];
 		if (!req->out.h.error)
 			SetPageUptodate(page);
 		else
 			SetPageError(page);
 		unlock_page(page);
 		page_cache_release(page);
 	}
 	if (req->ff)
 		fuse_file_put(req->ff, false);
 }
 static void fuse_send_readpages(struct fuse_req *req, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	loff_t pos = page_offset(req->pages[0]);
 	size_t count = req->num_pages << PAGE_CACHE_SHIFT;
 	req->out.argpages = 1;
 	req->out.page_zeroing = 1;
 	req->out.page_replace = 1;
 	fuse_read_fill(req, file, pos, count, FUSE_READ);
 	req->misc.read.attr_ver = fuse_get_attr_version(fc);
 	if (fc->async_read) {
 		req->ff = fuse_file_get(ff);
 		req->end = fuse_readpages_end;
 		fuse_request_send_background(fc, req);
 	} else {
 		fuse_request_send(fc, req);
 		fuse_readpages_end(fc, req);
 		fuse_put_request(fc, req);
 	}
 }
 struct fuse_fill_data {
 	struct fuse_req *req;
 	struct file *file;
 	struct inode *inode;
 	unsigned nr_pages;
 };
 static int fuse_readpages_fill(void *_data, struct page *page)
 {
 	struct fuse_fill_data *data = _data;
 	struct fuse_req *req = data->req;
 	struct inode *inode = data->inode;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	fuse_wait_on_page_writeback(inode, page->index);
 	if (req->num_pages &&
 	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
 	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read ||
 	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
 		int nr_alloc = min_t(unsigned, data->nr_pages,
 				     FUSE_MAX_PAGES_PER_REQ);
 		fuse_send_readpages(req, data->file);
 		if (fc->async_read)
 			req = fuse_get_req_for_background(fc, nr_alloc);
 		else
 			req = fuse_get_req(fc, nr_alloc);
 		data->req = req;
 		if (IS_ERR(req)) {
 			unlock_page(page);
 			return PTR_ERR(req);
 		}
 	}
 	if (WARN_ON(req->num_pages >= req->max_pages)) {
 		fuse_put_request(fc, req);
 		return -EIO;
 	}
 	page_cache_get(page);
 	req->pages[req->num_pages] = page;
 	req->page_descs[req->num_pages].length = PAGE_SIZE;
 	req->num_pages++;
 	data->nr_pages--;
 	return 0;
 }
 static int fuse_readpages(struct file *file, struct address_space *mapping,
 			  struct list_head *pages, unsigned nr_pages)
 {
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_fill_data data;
 	int err;
 	int nr_alloc = min_t(unsigned, nr_pages, FUSE_MAX_PAGES_PER_REQ);
 	err = -EIO;
 	if (is_bad_inode(inode))
 		goto out;
 	data.file = file;
 	data.inode = inode;
 	if (fc->async_read)
 		data.req = fuse_get_req_for_background(fc, nr_alloc);
 	else
 		data.req = fuse_get_req(fc, nr_alloc);
 	data.nr_pages = nr_pages;
 	err = PTR_ERR(data.req);
 	if (IS_ERR(data.req))
 		goto out;
 	err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
 	if (!err) {
 		if (data.req->num_pages)
 			fuse_send_readpages(data.req, file);
 		else
 			fuse_put_request(fc, data.req);
 	}
 out:
 	return err;
 }
 static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 				  unsigned long nr_segs, loff_t pos)
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	/*
 	 * In auto invalidate mode, always update attributes on read.
 	 * Otherwise, only update if we attempt to read past EOF (to ensure
 	 * i_size is up to date).
 	 */
 	if (fc->auto_inval_data ||
 	    (pos + iov_length(iov, nr_segs) > i_size_read(inode))) {
 		int err;
 		err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL);
 		if (err)
 			return err;
 	}
 	return generic_file_aio_read(iocb, iov, nr_segs, pos);
 }
 static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff,
 			    loff_t pos, size_t count)
 {
 	struct fuse_write_in *inarg = &req->misc.write.in;
 	struct fuse_write_out *outarg = &req->misc.write.out;
 	inarg->fh = ff->fh;
 	inarg->offset = pos;
 	inarg->size = count;
 	req->in.h.opcode = FUSE_WRITE;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 2;
 	if (ff->fc->minor < 9)
 		req->in.args[0].size = FUSE_COMPAT_WRITE_IN_SIZE;
 	else
 		req->in.args[0].size = sizeof(struct fuse_write_in);
 	req->in.args[0].value = inarg;
 	req->in.args[1].size = count;
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(struct fuse_write_out);
 	req->out.args[0].value = outarg;
 }
 static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 			      loff_t pos, size_t count, fl_owner_t owner)
 {
 	struct file *file = io->file;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	struct fuse_write_in *inarg = &req->misc.write.in;
 	fuse_write_fill(req, ff, pos, count);
 	inarg->flags = file->f_flags;
 	if (owner != NULL) {
 		inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
 		inarg->lock_owner = fuse_lock_owner_id(fc, owner);
 	}
 	if (io->async)
 		return fuse_async_req_send(fc, req, count, io);
 	fuse_request_send(fc, req);
 	return req->misc.write.out.size;
 }
 void fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
 	if (pos > inode->i_size)
 		i_size_write(inode, pos);
 	spin_unlock(&fc->lock);
 }
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
 				    struct inode *inode, loff_t pos,
 				    size_t count)
 {
 	size_t res;
 	unsigned offset;
 	unsigned i;
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	for (i = 0; i < req->num_pages; i++)
 		fuse_wait_on_page_writeback(inode, req->pages[i]->index);
 	res = fuse_send_write(req, &io, pos, count, NULL);
 	offset = req->page_descs[0].offset;
 	count = res;
 	for (i = 0; i < req->num_pages; i++) {
 		struct page *page = req->pages[i];
 		if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE)
 			SetPageUptodate(page);
 		if (count > PAGE_CACHE_SIZE - offset)
 			count -= PAGE_CACHE_SIZE - offset;
 		else
 			count = 0;
 		offset = 0;
 		unlock_page(page);
 		page_cache_release(page);
 	}
 	return res;
 }
 static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 			       struct address_space *mapping,
 			       struct iov_iter *ii, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(mapping->host);
 	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 	size_t count = 0;
 	int err;
 	req->in.argpages = 1;
 	req->page_descs[0].offset = offset;
 	do {
 		size_t tmp;
 		struct page *page;
 		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 		size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset,
 				     iov_iter_count(ii));
 		bytes = min_t(size_t, bytes, fc->max_write - count);
  again:
 		err = -EFAULT;
 		if (iov_iter_fault_in_readable(ii, bytes))
 			break;
 		err = -ENOMEM;
 		page = grab_cache_page_write_begin(mapping, index, 0);
 		if (!page)
 			break;
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 		tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
 		flush_dcache_page(page);
-		mark_page_accessed(page);
 		if (!tmp) {
 			unlock_page(page);
 			page_cache_release(page);
 			bytes = min(bytes, iov_iter_single_seg_count(ii));
 			goto again;
 		}
 		err = 0;
 		req->pages[req->num_pages] = page;
 		req->page_descs[req->num_pages].length = tmp;
 		req->num_pages++;
 		iov_iter_advance(ii, tmp);
 		count += tmp;
 		pos += tmp;
 		offset += tmp;
 		if (offset == PAGE_CACHE_SIZE)
 			offset = 0;
 		if (!fc->big_writes)
 			break;
 	} while (iov_iter_count(ii) && count < fc->max_write &&
 		 req->num_pages < req->max_pages && offset == 0);
 	return count > 0 ? count : err;
 }
 static inline unsigned fuse_wr_pages(loff_t pos, size_t len)
 {
 	return min_t(unsigned,
 		     ((pos + len - 1) >> PAGE_CACHE_SHIFT) -
 		     (pos >> PAGE_CACHE_SHIFT) + 1,
 		     FUSE_MAX_PAGES_PER_REQ);
 }
 static ssize_t fuse_perform_write(struct file *file,
 				  struct address_space *mapping,
 				  struct iov_iter *ii, loff_t pos)
 {
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	int err = 0;
 	ssize_t res = 0;
 	if (is_bad_inode(inode))
 		return -EIO;
 	if (inode->i_size < pos + iov_iter_count(ii))
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 	do {
 		struct fuse_req *req;
 		ssize_t count;
 		unsigned nr_pages = fuse_wr_pages(pos, iov_iter_count(ii));
 		req = fuse_get_req(fc, nr_pages);
 		if (IS_ERR(req)) {
 			err = PTR_ERR(req);
 			break;
 		}
 		count = fuse_fill_write_pages(req, mapping, ii, pos);
 		if (count <= 0) {
 			err = count;
 		} else {
 			size_t num_written;
 			num_written = fuse_send_write_pages(req, file, inode,
 							    pos, count);
 			err = req->out.h.error;
 			if (!err) {
 				res += num_written;
 				pos += num_written;
 				/* break out of the loop on short write */
 				if (num_written != count)
 					err = -EIO;
 			}
 		}
 		fuse_put_request(fc, req);
 	} while (!err && iov_iter_count(ii));
 	if (res > 0)
 		fuse_write_update_size(inode, pos);
 	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 	fuse_invalidate_attr(inode);
 	return res > 0 ? res : err;
 }
 static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				   unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	size_t count = 0;
 	size_t ocount = 0;
 	ssize_t written = 0;
 	ssize_t written_buffered = 0;
 	struct inode *inode = mapping->host;
 	ssize_t err;
 	struct iov_iter i;
 	loff_t endbyte = 0;
 	WARN_ON(iocb->ki_pos != pos);
 	ocount = 0;
 	err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
 	if (err)
 		return err;
 	count = ocount;
 	mutex_lock(&inode->i_mutex);
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
 		goto out;
 	if (count == 0)
 		goto out;
 	err = file_remove_suid(file);
 	if (err)
 		goto out;
 	err = file_update_time(file);
 	if (err)
 		goto out;
 	if (file->f_flags & O_DIRECT) {
 		written = generic_file_direct_write(iocb, iov, &nr_segs,
 						    pos, &iocb->ki_pos,
 						    count, ocount);
 		if (written < 0 || written == count)
 			goto out;
 		pos += written;
 		count -= written;
 		iov_iter_init(&i, iov, nr_segs, count, written);
 		written_buffered = fuse_perform_write(file, mapping, &i, pos);
 		if (written_buffered < 0) {
 			err = written_buffered;
 			goto out;
 		}
 		endbyte = pos + written_buffered - 1;
 		err = filemap_write_and_wait_range(file->f_mapping, pos,
 						   endbyte);
 		if (err)
 			goto out;
 		invalidate_mapping_pages(file->f_mapping,
 					 pos >> PAGE_CACHE_SHIFT,
 					 endbyte >> PAGE_CACHE_SHIFT);
 		written += written_buffered;
 		iocb->ki_pos = pos + written_buffered;
 	} else {
 		iov_iter_init(&i, iov, nr_segs, count, 0);
 		written = fuse_perform_write(file, mapping, &i, pos);
 		if (written >= 0)
 			iocb->ki_pos = pos + written;
 	}
 out:
 	current->backing_dev_info = NULL;
 	mutex_unlock(&inode->i_mutex);
 	return written ? written : err;
 }
 static inline void fuse_page_descs_length_init(struct fuse_req *req,
 		unsigned index, unsigned nr_pages)
 {
 	int i;
 	for (i = index; i < index + nr_pages; i++)
 		req->page_descs[i].length = PAGE_SIZE -
 			req->page_descs[i].offset;
 }
 static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii)
 {
 	return (unsigned long)ii->iov->iov_base + ii->iov_offset;
 }
 static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
 					size_t max_size)
 {
 	return min(iov_iter_single_seg_count(ii), max_size);
 }
 static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii,
 			       size_t *nbytesp, int write)
 {
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	/* Special case for kernel I/O: can copy directly into the buffer */
 	if (segment_eq(get_fs(), KERNEL_DS)) {
 		unsigned long user_addr = fuse_get_user_addr(ii);
 		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
 		if (write)
 			req->in.args[1].value = (void *) user_addr;
 		else
 			req->out.args[0].value = (void *) user_addr;
 		iov_iter_advance(ii, frag_size);
 		*nbytesp = frag_size;
 		return 0;
 	}
 	while (nbytes < *nbytesp && req->num_pages < req->max_pages) {
 		unsigned npages;
 		unsigned long user_addr = fuse_get_user_addr(ii);
 		unsigned offset = user_addr & ~PAGE_MASK;
 		size_t frag_size = fuse_get_frag_size(ii, *nbytesp - nbytes);
 		int ret;
 		unsigned n = req->max_pages - req->num_pages;
 		frag_size = min_t(size_t, frag_size, n << PAGE_SHIFT);
 		npages = (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
 		npages = clamp(npages, 1U, n);
 		ret = get_user_pages_fast(user_addr, npages, !write,
 					  &req->pages[req->num_pages]);
 		if (ret < 0)
 			return ret;
 		npages = ret;
 		frag_size = min_t(size_t, frag_size,
 				  (npages << PAGE_SHIFT) - offset);
 		iov_iter_advance(ii, frag_size);
 		req->page_descs[req->num_pages].offset = offset;
 		fuse_page_descs_length_init(req, req->num_pages, npages);
 		req->num_pages += npages;
 		req->page_descs[req->num_pages - 1].length -=
 			(npages << PAGE_SHIFT) - offset - frag_size;
 		nbytes += frag_size;
 	}
 	if (write)
 		req->in.argpages = 1;
 	else
 		req->out.argpages = 1;
 	*nbytesp = nbytes;
 	return 0;
 }
 static inline int fuse_iter_npages(const struct iov_iter *ii_p)
 {
 	struct iov_iter ii = *ii_p;
 	int npages = 0;
 	while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) {
 		unsigned long user_addr = fuse_get_user_addr(&ii);
 		unsigned offset = user_addr & ~PAGE_MASK;
 		size_t frag_size = iov_iter_single_seg_count(&ii);
 		npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
 		iov_iter_advance(&ii, frag_size);
 	}
 	return min(npages, FUSE_MAX_PAGES_PER_REQ);
 }
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
 		       int write)
 {
 	struct file *file = io->file;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	size_t nmax = write ? fc->max_write : fc->max_read;
 	loff_t pos = *ppos;
 	ssize_t res = 0;
 	struct fuse_req *req;
 	struct iov_iter ii;
 	iov_iter_init(&ii, iov, nr_segs, count, 0);
 	if (io->async)
 		req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii));
 	else
 		req = fuse_get_req(fc, fuse_iter_npages(&ii));
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	while (count) {
 		size_t nres;
 		fl_owner_t owner = current->files;
 		size_t nbytes = min(count, nmax);
 		int err = fuse_get_user_pages(req, &ii, &nbytes, write);
 		if (err) {
 			res = err;
 			break;
 		}
 		if (write)
 			nres = fuse_send_write(req, io, pos, nbytes, owner);
 		else
 			nres = fuse_send_read(req, io, pos, nbytes, owner);
 		if (!io->async)
 			fuse_release_user_pages(req, !write);
 		if (req->out.h.error) {
 			if (!res)
 				res = req->out.h.error;
 			break;
 		} else if (nres > nbytes) {
 			res = -EIO;
 			break;
 		}
 		count -= nres;
 		res += nres;
 		pos += nres;
 		if (nres != nbytes)
 			break;
 		if (count) {
 			fuse_put_request(fc, req);
 			if (io->async)
 				req = fuse_get_req_for_background(fc,
 					fuse_iter_npages(&ii));
 			else
 				req = fuse_get_req(fc, fuse_iter_npages(&ii));
 			if (IS_ERR(req))
 				break;
 		}
 	}
 	if (!IS_ERR(req))
 		fuse_put_request(fc, req);
 	if (res > 0)
 		*ppos = pos;
 	return res;
 }
 EXPORT_SYMBOL_GPL(fuse_direct_io);
 static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
 				  const struct iovec *iov,
 				  unsigned long nr_segs, loff_t *ppos,
 				  size_t count)
 {
 	ssize_t res;
 	struct file *file = io->file;
 	struct inode *inode = file_inode(file);
 	if (is_bad_inode(inode))
 		return -EIO;
 	res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0);
 	fuse_invalidate_attr(inode);
 	return res;
 }
 static ssize_t fuse_direct_read(struct file *file, char __user *buf,
 				     size_t count, loff_t *ppos)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	struct iovec iov = { .iov_base = buf, .iov_len = count };
 	return __fuse_direct_read(&io, &iov, 1, ppos, count);
 }
 static ssize_t __fuse_direct_write(struct fuse_io_priv *io,
 				   const struct iovec *iov,
 				   unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = io->file;
 	struct inode *inode = file_inode(file);
 	size_t count = iov_length(iov, nr_segs);
 	ssize_t res;
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res)
 		res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
 	fuse_invalidate_attr(inode);
 	return res;
 }
 static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
 	struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count };
 	struct inode *inode = file_inode(file);
 	ssize_t res;
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	if (is_bad_inode(inode))
 		return -EIO;
 	/* Don't allow parallel writes to the same file */
 	mutex_lock(&inode->i_mutex);
 	res = __fuse_direct_write(&io, &iov, 1, ppos);
 	if (res > 0)
 		fuse_write_update_size(inode, *ppos);
 	mutex_unlock(&inode->i_mutex);
 	return res;
 }
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
 	__free_page(req->pages[0]);
 	fuse_file_put(req->ff, false);
 }
 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
 }
 /* Called under fc->lock, may release and reacquire it */
 static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req)
 __releases(fc->lock)
 __acquires(fc->lock)
 {
 	struct fuse_inode *fi = get_fuse_inode(req->inode);
 	loff_t size = i_size_read(req->inode);
 	struct fuse_write_in *inarg = &req->misc.write.in;
 	if (!fc->connected)
 		goto out_free;
 	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
 		inarg->size = PAGE_CACHE_SIZE;
 	} else if (inarg->offset < size) {
 		inarg->size = size & (PAGE_CACHE_SIZE - 1);
 	} else {
 		/* Got truncated off completely */
 		goto out_free;
 	}
 	req->in.args[1].size = inarg->size;
 	fi->writectr++;
 	fuse_request_send_background_locked(fc, req);
 	return;
  out_free:
 	fuse_writepage_finish(fc, req);
 	spin_unlock(&fc->lock);
 	fuse_writepage_free(fc, req);
 	fuse_put_request(fc, req);
 	spin_lock(&fc->lock);
 }
 /*
  * If fi->writectr is positive (no truncate or fsync going on) send
  * all queued writepage requests.
  *
  * Called with fc->lock
  */
 void fuse_flush_writepages(struct inode *inode)
 __releases(fc->lock)
 __acquires(fc->lock)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
 	while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) {
 		req = list_entry(fi->queued_writes.next, struct fuse_req, list);
 		list_del_init(&req->list);
 		fuse_send_writepage(fc, req);
 	}
 }
 static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	mapping_set_error(inode->i_mapping, req->out.h.error);
 	spin_lock(&fc->lock);
 	fi->writectr--;
 	fuse_writepage_finish(fc, req);
 	spin_unlock(&fc->lock);
 	fuse_writepage_free(fc, req);
 }
 static int fuse_writepage_locked(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
 	struct fuse_file *ff;
 	struct page *tmp_page;
 	set_page_writeback(page);
 	req = fuse_request_alloc_nofs(1);
 	if (!req)
 		goto err;
 	req->background = 1; /* writeback always goes to bg_queue */
 	tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
 	if (!tmp_page)
 		goto err_free;
 	spin_lock(&fc->lock);
 	BUG_ON(list_empty(&fi->write_files));
 	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
 	req->ff = fuse_file_get(ff);
 	spin_unlock(&fc->lock);
 	fuse_write_fill(req, ff, page_offset(page), 0);
 	copy_highpage(tmp_page, page);
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
 	req->in.argpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = tmp_page;
 	req->page_descs[0].offset = 0;
 	req->page_descs[0].length = PAGE_SIZE;
 	req->end = fuse_writepage_end;
 	req->inode = inode;
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	spin_lock(&fc->lock);
 	list_add(&req->writepages_entry, &fi->writepages);
 	list_add_tail(&req->list, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fc->lock);
 	end_page_writeback(page);
 	return 0;
 err_free:
 	fuse_request_free(req);
 err:
 	end_page_writeback(page);
 	return -ENOMEM;
 }
 static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int err;
 	err = fuse_writepage_locked(page);
 	unlock_page(page);
 	return err;
 }
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
 	if (clear_page_dirty_for_io(page)) {
 		struct inode *inode = page->mapping->host;
 		err = fuse_writepage_locked(page);
 		if (!err)
 			fuse_wait_on_page_writeback(inode, page->index);
 	}
 	return err;
 }
 /*
  * Write back dirty pages now, because there may not be any suitable
  * open files later
  */
 static void fuse_vma_close(struct vm_area_struct *vma)
 {
 	filemap_write_and_wait(vma->vm_file->f_mapping);
 }
 /*
  * Wait for writeback against this page to complete before allowing it
  * to be marked dirty again, and hence written back again, possibly
  * before the previous writepage completed.
  *
  * Block here, instead of in ->writepage(), so that the userspace fs
  * can only block processes actually operating on the filesystem.
  *
  * Otherwise unprivileged userspace fs would be able to block
  * unrelated:
  *
  * - page migration
  * - sync(2)
  * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER
  */
 static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
 	/*
 	 * Don't use page->mapping as it may become NULL from a
 	 * concurrent truncate.
 	 */
 	struct inode *inode = vma->vm_file->f_mapping->host;
 	fuse_wait_on_page_writeback(inode, page->index);
 	return 0;
 }
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
 		struct inode *inode = file_inode(file);
 		struct fuse_conn *fc = get_fuse_conn(inode);
 		struct fuse_inode *fi = get_fuse_inode(inode);
 		struct fuse_file *ff = file->private_data;
 		/*
 		 * file may be written through mmap, so chain it onto the
 		 * inodes's write_file list
 		 */
 		spin_lock(&fc->lock);
 		if (list_empty(&ff->write_entry))
 			list_add(&ff->write_entry, &fi->write_files);
 		spin_unlock(&fc->lock);
 	}
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;
 }
 static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	/* Can't provide the coherency needed for MAP_SHARED */
 	if (vma->vm_flags & VM_MAYSHARE)
 		return -ENODEV;
 	invalidate_inode_pages2(file->f_mapping);
 	return generic_file_mmap(file, vma);
 }
 static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
 {
 	switch (ffl->type) {
 	case F_UNLCK:
 		break;
 	case F_RDLCK:
 	case F_WRLCK:
 		if (ffl->start > OFFSET_MAX || ffl->end > OFFSET_MAX ||
 		    ffl->end < ffl->start)
 			return -EIO;
 		fl->fl_start = ffl->start;
 		fl->fl_end = ffl->end;
 		fl->fl_pid = ffl->pid;
 		break;
 	default:
 		return -EIO;
 	}
 	fl->fl_type = ffl->type;
 	return 0;
 }
 static void fuse_lk_fill(struct fuse_req *req, struct file *file,
 			 const struct file_lock *fl, int opcode, pid_t pid,
 			 int flock)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_file *ff = file->private_data;
 	struct fuse_lk_in *arg = &req->misc.lk_in;
 	arg->fh = ff->fh;
 	arg->owner = fuse_lock_owner_id(fc, fl->fl_owner);
 	arg->lk.start = fl->fl_start;
 	arg->lk.end = fl->fl_end;
 	arg->lk.type = fl->fl_type;
 	arg->lk.pid = pid;
 	if (flock)
 		arg->lk_flags |= FUSE_LK_FLOCK;
 	req->in.h.opcode = opcode;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
 	req->in.args[0].value = arg;
 }
 static int fuse_getlk(struct file *file, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	struct fuse_lk_out outarg;
 	int err;
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(outarg);
 	req->out.args[0].value = &outarg;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (!err)
 		err = convert_fuse_file_lock(&outarg.lk, fl);
 	return err;
 }
 static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
 	pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
 	int err;
 	if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
 		/* NLM needs asynchronous locks, which we don't support yet */
 		return -ENOLCK;
 	}
 	/* Unlock on close is handled by the flush method */
 	if (fl->fl_flags & FL_CLOSE)
 		return 0;
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	fuse_lk_fill(req, file, fl, opcode, pid, flock);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	/* locking is restartable */
 	if (err == -EINTR)
 		err = -ERESTARTSYS;
 	fuse_put_request(fc, req);
 	return err;
 }
 static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	int err;
 	if (cmd == F_CANCELLK) {
 		err = 0;
 	} else if (cmd == F_GETLK) {
 		if (fc->no_lock) {
 			posix_test_lock(file, fl);
 			err = 0;
 		} else
 			err = fuse_getlk(file, fl);
 	} else {
 		if (fc->no_lock)
 			err = posix_lock_file(file, fl, NULL);
 		else
 			err = fuse_setlk(file, fl, 0);
 	}
 	return err;
 }
 static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	int err;
 	if (fc->no_flock) {
 		err = flock_lock_file_wait(file, fl);
 	} else {
 		struct fuse_file *ff = file->private_data;
 		/* emulate flock with POSIX locks */
 		fl->fl_owner = (fl_owner_t) file;
 		ff->flock = true;
 		err = fuse_setlk(file, fl, 1);
 	}
 	return err;
 }
 static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
 {
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	struct fuse_bmap_in inarg;
 	struct fuse_bmap_out outarg;
 	int err;
 	if (!inode->i_sb->s_bdev || fc->no_bmap)
 		return 0;
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req))
 		return 0;
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.block = block;
 	inarg.blocksize = inode->i_sb->s_blocksize;
 	req->in.h.opcode = FUSE_BMAP;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(outarg);
 	req->out.args[0].value = &outarg;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (err == -ENOSYS)
 		fc->no_bmap = 1;
 	return err ? 0 : outarg.block;
 }
 static loff_t fuse_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	loff_t retval;
 	struct inode *inode = file_inode(file);
 	/* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */
 	if (whence == SEEK_CUR || whence == SEEK_SET)
 		return generic_file_llseek(file, offset, whence);
 	mutex_lock(&inode->i_mutex);
 	retval = fuse_update_attributes(inode, NULL, file, NULL);
 	if (!retval)
 		retval = generic_file_llseek(file, offset, whence);
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
 static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
 			unsigned int nr_segs, size_t bytes, bool to_user)
 {
 	struct iov_iter ii;
 	int page_idx = 0;
 	if (!bytes)
 		return 0;
 	iov_iter_init(&ii, iov, nr_segs, bytes, 0);
 	while (iov_iter_count(&ii)) {
 		struct page *page = pages[page_idx++];
 		size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
 		void *kaddr;
 		kaddr = kmap(page);
 		while (todo) {
 			char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
 			size_t iov_len = ii.iov->iov_len - ii.iov_offset;
 			size_t copy = min(todo, iov_len);
 			size_t left;
 			if (!to_user)
 				left = copy_from_user(kaddr, uaddr, copy);
 			else
 				left = copy_to_user(uaddr, kaddr, copy);
 			if (unlikely(left))
 				return -EFAULT;
 			iov_iter_advance(&ii, copy);
 			todo -= copy;
 			kaddr += copy;
 		}
 		kunmap(page);
 	}
 	return 0;
 }
 /*
  * CUSE servers compiled on 32bit broke on 64bit kernels because the
  * ABI was defined to be 'struct iovec' which is different on 32bit
  * and 64bit.  Fortunately we can determine which structure the server
  * used from the size of the reply.
  */
 static int fuse_copy_ioctl_iovec_old(struct iovec *dst, void *src,
 				     size_t transferred, unsigned count,
 				     bool is_compat)
 {
 #ifdef CONFIG_COMPAT
 	if (count * sizeof(struct compat_iovec) == transferred) {
 		struct compat_iovec *ciov = src;
 		unsigned i;
 		/*
 		 * With this interface a 32bit server cannot support
 		 * non-compat (i.e. ones coming from 64bit apps) ioctl
 		 * requests
 		 */
 		if (!is_compat)
 			return -EINVAL;
 		for (i = 0; i < count; i++) {
 			dst[i].iov_base = compat_ptr(ciov[i].iov_base);
 			dst[i].iov_len = ciov[i].iov_len;
 		}
 		return 0;
 	}
 #endif
 	if (count * sizeof(struct iovec) != transferred)
 		return -EIO;
 	memcpy(dst, src, transferred);
 	return 0;
 }
 /* Make sure iov_length() won't overflow */
 static int fuse_verify_ioctl_iov(struct iovec *iov, size_t count)
 {
 	size_t n;
 	u32 max = FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT;
 	for (n = 0; n < count; n++, iov++) {
 		if (iov->iov_len > (size_t) max)
 			return -ENOMEM;
 		max -= iov->iov_len;
 	}
 	return 0;
 }
 static int fuse_copy_ioctl_iovec(struct fuse_conn *fc, struct iovec *dst,
 				 void *src, size_t transferred, unsigned count,
 				 bool is_compat)
 {
 	unsigned i;
 	struct fuse_ioctl_iovec *fiov = src;
 	if (fc->minor < 16) {
 		return fuse_copy_ioctl_iovec_old(dst, src, transferred,
 						 count, is_compat);
 	}
 	if (count * sizeof(struct fuse_ioctl_iovec) != transferred)
 		return -EIO;
 	for (i = 0; i < count; i++) {
 		/* Did the server supply an inappropriate value? */
 		if (fiov[i].base != (unsigned long) fiov[i].base ||
 		    fiov[i].len != (unsigned long) fiov[i].len)
 			return -EIO;
 		dst[i].iov_base = (void __user *) (unsigned long) fiov[i].base;
 		dst[i].iov_len = (size_t) fiov[i].len;
 #ifdef CONFIG_COMPAT
 		if (is_compat &&
 		    (ptr_to_compat(dst[i].iov_base) != fiov[i].base ||
 		     (compat_size_t) dst[i].iov_len != fiov[i].len))
 			return -EIO;
 #endif
 	}
 	return 0;
 }
 /*
  * For ioctls, there is no generic way to determine how much memory
  * needs to be read and/or written.  Furthermore, ioctls are allowed
  * to dereference the passed pointer, so the parameter requires deep
  * copying but FUSE has no idea whatsoever about what to copy in or
  * out.
  *
  * This is solved by allowing FUSE server to retry ioctl with
  * necessary in/out iovecs.  Let's assume the ioctl implementation
  * needs to read in the following structure.
  *
  * struct a {
  *	char	*buf;
  *	size_t	buflen;
  * }
  *
  * On the first callout to FUSE server, inarg->in_size and
  * inarg->out_size will be NULL; then, the server completes the ioctl
  * with FUSE_IOCTL_RETRY set in out->flags, out->in_iovs set to 1 and
  * the actual iov array to
  *
  * { { .iov_base = inarg.arg,	.iov_len = sizeof(struct a) } }
  *
  * which tells FUSE to copy in the requested area and retry the ioctl.
  * On the second round, the server has access to the structure and
  * from that it can tell what to look for next, so on the invocation,
  * it sets FUSE_IOCTL_RETRY, out->in_iovs to 2 and iov array to
  *
  * { { .iov_base = inarg.arg,	.iov_len = sizeof(struct a)	},
  *   { .iov_base = a.buf,	.iov_len = a.buflen		} }
  *
  * FUSE will copy both struct a and the pointed buffer from the
  * process doing the ioctl and retry ioctl with both struct a and the
  * buffer.
  *
  * This time, FUSE server has everything it needs and completes ioctl
  * without FUSE_IOCTL_RETRY which finishes the ioctl call.
  *
  * Copying data out works the same way.
  *
  * Note that if FUSE_IOCTL_UNRESTRICTED is clear, the kernel
  * automatically initializes in and out iovs by decoding @cmd with
  * _IOC_* macros and the server is not allowed to request RETRY.  This
  * limits ioctl data transfers to well-formed ioctls and is the forced
  * behavior for all FUSE servers.
  */
 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
 		   unsigned int flags)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	struct fuse_ioctl_in inarg = {
 		.fh = ff->fh,
 		.cmd = cmd,
 		.arg = arg,
 		.flags = flags
 	};
 	struct fuse_ioctl_out outarg;
 	struct fuse_req *req = NULL;
 	struct page **pages = NULL;
 	struct iovec *iov_page = NULL;
 	struct iovec *in_iov = NULL, *out_iov = NULL;
 	unsigned int in_iovs = 0, out_iovs = 0, num_pages = 0, max_pages;
 	size_t in_size, out_size, transferred;
 	int err;
 #if BITS_PER_LONG == 32
 	inarg.flags |= FUSE_IOCTL_32BIT;
 #else
 	if (flags & FUSE_IOCTL_COMPAT)
 		inarg.flags |= FUSE_IOCTL_32BIT;
 #endif
 	/* assume all the iovs returned by client always fits in a page */
 	BUILD_BUG_ON(sizeof(struct fuse_ioctl_iovec) * FUSE_IOCTL_MAX_IOV > PAGE_SIZE);
 	err = -ENOMEM;
 	pages = kcalloc(FUSE_MAX_PAGES_PER_REQ, sizeof(pages[0]), GFP_KERNEL);
 	iov_page = (struct iovec *) __get_free_page(GFP_KERNEL);
 	if (!pages || !iov_page)
 		goto out;
 	/*
 	 * If restricted, initialize IO parameters as encoded in @cmd.
 	 * RETRY from server is not allowed.
 	 */
 	if (!(flags & FUSE_IOCTL_UNRESTRICTED)) {
 		struct iovec *iov = iov_page;
 		iov->iov_base = (void __user *)arg;
 		iov->iov_len = _IOC_SIZE(cmd);
 		if (_IOC_DIR(cmd) & _IOC_WRITE) {
 			in_iov = iov;
 			in_iovs = 1;
 		}
 		if (_IOC_DIR(cmd) & _IOC_READ) {
 			out_iov = iov;
 			out_iovs = 1;
 		}
 	}
  retry:
 	inarg.in_size = in_size = iov_length(in_iov, in_iovs);
 	inarg.out_size = out_size = iov_length(out_iov, out_iovs);
 	/*
 	 * Out data can be used either for actual out data or iovs,
 	 * make sure there always is at least one page.
 	 */
 	out_size = max_t(size_t, out_size, PAGE_SIZE);
 	max_pages = DIV_ROUND_UP(max(in_size, out_size), PAGE_SIZE);
 	/* make sure there are enough buffer pages and init request with them */
 	err = -ENOMEM;
 	if (max_pages > FUSE_MAX_PAGES_PER_REQ)
 		goto out;
 	while (num_pages < max_pages) {
 		pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
 		if (!pages[num_pages])
 			goto out;
 		num_pages++;
 	}
 	req = fuse_get_req(fc, num_pages);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
 		req = NULL;
 		goto out;
 	}
 	memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages);
 	req->num_pages = num_pages;
 	fuse_page_descs_length_init(req, 0, req->num_pages);
 	/* okay, let's send it to the client */
 	req->in.h.opcode = FUSE_IOCTL;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	if (in_size) {
 		req->in.numargs++;
 		req->in.args[1].size = in_size;
 		req->in.argpages = 1;
 		err = fuse_ioctl_copy_user(pages, in_iov, in_iovs, in_size,
 					   false);
 		if (err)
 			goto out;
 	}
 	req->out.numargs = 2;
 	req->out.args[0].size = sizeof(outarg);
 	req->out.args[0].value = &outarg;
 	req->out.args[1].size = out_size;
 	req->out.argpages = 1;
 	req->out.argvar = 1;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	transferred = req->out.args[1].size;
 	fuse_put_request(fc, req);
 	req = NULL;
 	if (err)
 		goto out;
 	/* did it ask for retry? */
 	if (outarg.flags & FUSE_IOCTL_RETRY) {
 		void *vaddr;
 		/* no retry if in restricted mode */
 		err = -EIO;
 		if (!(flags & FUSE_IOCTL_UNRESTRICTED))
 			goto out;
 		in_iovs = outarg.in_iovs;
 		out_iovs = outarg.out_iovs;
 		/*
 		 * Make sure things are in boundary, separate checks
 		 * are to protect against overflow.
 		 */
 		err = -ENOMEM;
 		if (in_iovs > FUSE_IOCTL_MAX_IOV ||
 		    out_iovs > FUSE_IOCTL_MAX_IOV ||
 		    in_iovs + out_iovs > FUSE_IOCTL_MAX_IOV)
 			goto out;
 		vaddr = kmap_atomic(pages[0]);
 		err = fuse_copy_ioctl_iovec(fc, iov_page, vaddr,
 					    transferred, in_iovs + out_iovs,
 					    (flags & FUSE_IOCTL_COMPAT) != 0);
 		kunmap_atomic(vaddr);
 		if (err)
 			goto out;
 		in_iov = iov_page;
 		out_iov = in_iov + in_iovs;
 		err = fuse_verify_ioctl_iov(in_iov, in_iovs);
 		if (err)
 			goto out;
 		err = fuse_verify_ioctl_iov(out_iov, out_iovs);
 		if (err)
 			goto out;
 		goto retry;
 	}
 	err = -EIO;
 	if (transferred > inarg.out_size)
 		goto out;
 	err = fuse_ioctl_copy_user(pages, out_iov, out_iovs, transferred, true);
  out:
 	if (req)
 		fuse_put_request(fc, req);
 	free_page((unsigned long) iov_page);
 	while (num_pages)
 		__free_page(pages[--num_pages]);
 	kfree(pages);
 	return err ? err : outarg.result;
 }
 EXPORT_SYMBOL_GPL(fuse_do_ioctl);
 long fuse_ioctl_common(struct file *file, unsigned int cmd,
 		       unsigned long arg, unsigned int flags)
 {
 	struct inode *inode = file_inode(file);
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	if (!fuse_allow_current_process(fc))
 		return -EACCES;
 	if (is_bad_inode(inode))
 		return -EIO;
 	return fuse_do_ioctl(file, cmd, arg, flags);
 }
 static long fuse_file_ioctl(struct file *file, unsigned int cmd,
 			    unsigned long arg)
 {
 	return fuse_ioctl_common(file, cmd, arg, 0);
 }
 static long fuse_file_compat_ioctl(struct file *file, unsigned int cmd,
 				   unsigned long arg)
 {
 	return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT);
 }
 /*
  * All files which have been polled are linked to RB tree
  * fuse_conn->polled_files which is indexed by kh.  Walk the tree and
  * find the matching one.
  */
 static struct rb_node **fuse_find_polled_node(struct fuse_conn *fc, u64 kh,
 					      struct rb_node **parent_out)
 {
 	struct rb_node **link = &fc->polled_files.rb_node;
 	struct rb_node *last = NULL;
 	while (*link) {
 		struct fuse_file *ff;
 		last = *link;
 		ff = rb_entry(last, struct fuse_file, polled_node);
 		if (kh < ff->kh)
 			link = &last->rb_left;
 		else if (kh > ff->kh)
 			link = &last->rb_right;
 		else
 			return link;
 	}
 	if (parent_out)
 		*parent_out = last;
 	return link;
 }
 /*
  * The file is about to be polled.  Make sure it's on the polled_files
  * RB tree.  Note that files once added to the polled_files tree are
  * not removed before the file is released.  This is because a file
  * polled once is likely to be polled again.
  */
 static void fuse_register_polled_file(struct fuse_conn *fc,
 				      struct fuse_file *ff)
 {
 	spin_lock(&fc->lock);
 	if (RB_EMPTY_NODE(&ff->polled_node)) {
 		struct rb_node **link, *parent;
 		link = fuse_find_polled_node(fc, ff->kh, &parent);
 		BUG_ON(*link);
 		rb_link_node(&ff->polled_node, parent, link);
 		rb_insert_color(&ff->polled_node, &fc->polled_files);
 	}
 	spin_unlock(&fc->lock);
 }
 unsigned fuse_file_poll(struct file *file, poll_table *wait)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	struct fuse_poll_in inarg = { .fh = ff->fh, .kh = ff->kh };
 	struct fuse_poll_out outarg;
 	struct fuse_req *req;
 	int err;
 	if (fc->no_poll)
 		return DEFAULT_POLLMASK;
 	poll_wait(file, &ff->poll_wait, wait);
 	inarg.events = (__u32)poll_requested_events(wait);
 	/*
 	 * Ask for notification iff there's someone waiting for it.
 	 * The client may ignore the flag and always notify.
 	 */
 	if (waitqueue_active(&ff->poll_wait)) {
 		inarg.flags |= FUSE_POLL_SCHEDULE_NOTIFY;
 		fuse_register_polled_file(fc, ff);
 	}
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req))
 		return POLLERR;
 	req->in.h.opcode = FUSE_POLL;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(outarg);
 	req->out.args[0].value = &outarg;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
 	if (!err)
 		return outarg.revents;
 	if (err == -ENOSYS) {
 		fc->no_poll = 1;
 		return DEFAULT_POLLMASK;
 	}
 	return POLLERR;
 }
 EXPORT_SYMBOL_GPL(fuse_file_poll);
 /*
  * This is called from fuse_handle_notify() on FUSE_NOTIFY_POLL and
  * wakes up the poll waiters.
  */
 int fuse_notify_poll_wakeup(struct fuse_conn *fc,
 			    struct fuse_notify_poll_wakeup_out *outarg)
 {
 	u64 kh = outarg->kh;
 	struct rb_node **link;
 	spin_lock(&fc->lock);
 	link = fuse_find_polled_node(fc, kh, NULL);
 	if (*link) {
 		struct fuse_file *ff;
 		ff = rb_entry(*link, struct fuse_file, polled_node);
 		wake_up_interruptible_sync(&ff->poll_wait);
 	}
 	spin_unlock(&fc->lock);
 	return 0;
 }
 static void fuse_do_truncate(struct file *file)
 {
 	struct inode *inode = file->f_mapping->host;
 	struct iattr attr;
 	attr.ia_valid = ATTR_SIZE;
 	attr.ia_size = i_size_read(inode);
 	attr.ia_file = file;
 	attr.ia_valid |= ATTR_FILE;
 	fuse_do_setattr(inode, &attr, file);
 }
 static inline loff_t fuse_round_up(loff_t off)
 {
 	return round_up(off, FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT);
 }
 static ssize_t
 fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs)
 {
 	ssize_t ret = 0;
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	bool async_dio = ff->fc->async_dio;
 	loff_t pos = 0;
 	struct inode *inode;
 	loff_t i_size;
 	size_t count = iov_length(iov, nr_segs);
 	struct fuse_io_priv *io;
 	pos = offset;
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 	/* optimization for short read */
 	if (async_dio && rw != WRITE && offset + count > i_size) {
 		if (offset >= i_size)
 			return 0;
 		count = min_t(loff_t, count, fuse_round_up(i_size - offset));
 	}
 	io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL);
 	if (!io)
 		return -ENOMEM;
 	spin_lock_init(&io->lock);
 	io->reqs = 1;
 	io->bytes = -1;
 	io->size = 0;
 	io->offset = offset;
 	io->write = (rw == WRITE);
 	io->err = 0;
 	io->file = file;
 	/*
 	 * By default, we want to optimize all I/Os with async request
 	 * submission to the client filesystem if supported.
 	 */
 	io->async = async_dio;
 	io->iocb = iocb;
 	/*
 	 * We cannot asynchronously extend the size of a file. We have no method
 	 * to wait on real async I/O requests, so we must submit this request
 	 * synchronously.
 	 */
 	if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE)
 		io->async = false;
 	if (rw == WRITE)
 		ret = __fuse_direct_write(io, iov, nr_segs, &pos);
 	else
 		ret = __fuse_direct_read(io, iov, nr_segs, &pos, count);
 	if (io->async) {
 		fuse_aio_complete(io, ret < 0 ? ret : 0, -1);
 		/* we have a non-extending, async request, so return */
 		if (!is_sync_kiocb(iocb))
 			return -EIOCBQUEUED;
 		ret = wait_on_sync_kiocb(iocb);
 	} else {
 		kfree(io);
 	}
 	if (rw == WRITE) {
 		if (ret > 0)
 			fuse_write_update_size(inode, pos);
 		else if (ret < 0 && offset + count > i_size)
 			fuse_do_truncate(file);
 	}
 	return ret;
 }
 static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 				loff_t length)
 {
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file->f_inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = ff->fc;
 	struct fuse_req *req;
 	struct fuse_fallocate_in inarg = {
 		.fh = ff->fh,
 		.offset = offset,
 		.length = length,
 		.mode = mode
 	};
 	int err;
 	bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
 			   (mode & FALLOC_FL_PUNCH_HOLE);
 	if (fc->no_fallocate)
 		return -EOPNOTSUPP;
 	if (lock_inode) {
 		mutex_lock(&inode->i_mutex);
 		if (mode & FALLOC_FL_PUNCH_HOLE) {
 			loff_t endbyte = offset + length - 1;
 			err = filemap_write_and_wait_range(inode->i_mapping,
 							   offset, endbyte);
 			if (err)
 				goto out;
 			fuse_sync_writes(inode);
 		}
 	}
 	if (!(mode & FALLOC_FL_KEEP_SIZE))
 		set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
 		goto out;
 	}
 	req->in.h.opcode = FUSE_FALLOCATE;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(inarg);
 	req->in.args[0].value = &inarg;
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	if (err == -ENOSYS) {
 		fc->no_fallocate = 1;
 		err = -EOPNOTSUPP;
 	}
 	fuse_put_request(fc, req);
 	if (err)
 		goto out;
 	/* we could have extended the file */
 	if (!(mode & FALLOC_FL_KEEP_SIZE))
 		fuse_write_update_size(inode, offset + length);
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
 	fuse_invalidate_attr(inode);
 out:
 	if (!(mode & FALLOC_FL_KEEP_SIZE))
 		clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 	if (lock_inode)
 		mutex_unlock(&inode->i_mutex);
 	return err;
 }
 static const struct file_operations fuse_file_operations = {
 	.llseek		= fuse_file_llseek,
 	.read		= do_sync_read,
 	.aio_read	= fuse_file_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= fuse_file_aio_write,
 	.mmap		= fuse_file_mmap,
 	.open		= fuse_open,
 	.flush		= fuse_flush,
 	.release	= fuse_release,
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
 	.flock		= fuse_file_flock,
 	.splice_read	= generic_file_splice_read,
 	.unlocked_ioctl	= fuse_file_ioctl,
 	.compat_ioctl	= fuse_file_compat_ioctl,
 	.poll		= fuse_file_poll,
 	.fallocate	= fuse_file_fallocate,
 };
 static const struct file_operations fuse_direct_io_file_operations = {
 	.llseek		= fuse_file_llseek,
 	.read		= fuse_direct_read,
 	.write		= fuse_direct_write,
 	.mmap		= fuse_direct_mmap,
 	.open		= fuse_open,
 	.flush		= fuse_flush,
 	.release	= fuse_release,
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
 	.flock		= fuse_file_flock,
 	.unlocked_ioctl	= fuse_file_ioctl,
 	.compat_ioctl	= fuse_file_compat_ioctl,
 	.poll		= fuse_file_poll,
 	.fallocate	= fuse_file_fallocate,
 	/* no splice_read */
 };
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
 };
 void fuse_init_file_inode(struct inode *inode)
 {
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
 }

fs/gfs2/aops.c

Diff comments View file @ d618a27

 /*
  * Copyright (C) Sistina Software, Inc.  1997-2003 All rights reserved.
  * Copyright (C) 2004-2008 Red Hat, Inc.  All rights reserved.
  *
  * This copyrighted material is made available to anyone wishing to use,
  * modify, copy, or redistribute it subject to the terms and conditions
  * of the GNU General Public License version 2.
  */
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/completion.h>
 #include <linux/buffer_head.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
 #include <linux/mpage.h>
 #include <linux/fs.h>
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/gfs2_ondisk.h>
 #include <linux/backing-dev.h>
 #include <linux/aio.h>
 #include "gfs2.h"
 #include "incore.h"
 #include "bmap.h"
 #include "glock.h"
 #include "inode.h"
 #include "log.h"
 #include "meta_io.h"
 #include "quota.h"
 #include "trans.h"
 #include "rgrp.h"
 #include "super.h"
 #include "util.h"
 #include "glops.h"
 static void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page,
 				   unsigned int from, unsigned int to)
 {
 	struct buffer_head *head = page_buffers(page);
 	unsigned int bsize = head->b_size;
 	struct buffer_head *bh;
 	unsigned int start, end;
 	for (bh = head, start = 0; bh != head || !start;
 	     bh = bh->b_this_page, start = end) {
 		end = start + bsize;
 		if (end <= from || start >= to)
 			continue;
 		if (gfs2_is_jdata(ip))
 			set_buffer_uptodate(bh);
 		gfs2_trans_add_data(ip->i_gl, bh);
 	}
 }
 /**
  * gfs2_get_block_noalloc - Fills in a buffer head with details about a block
  * @inode: The inode
  * @lblock: The block number to look up
  * @bh_result: The buffer head to return the result in
  * @create: Non-zero if we may add block to the file
  *
  * Returns: errno
  */
 static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock,
 				  struct buffer_head *bh_result, int create)
 {
 	int error;
 	error = gfs2_block_map(inode, lblock, bh_result, 0);
 	if (error)
 		return error;
 	if (!buffer_mapped(bh_result))
 		return -EIO;
 	return 0;
 }
 static int gfs2_get_block_direct(struct inode *inode, sector_t lblock,
 				 struct buffer_head *bh_result, int create)
 {
 	return gfs2_block_map(inode, lblock, bh_result, 0);
 }
 /**
  * gfs2_writepage_common - Common bits of writepage
  * @page: The page to be written
  * @wbc: The writeback control
  *
  * Returns: 1 if writepage is ok, otherwise an error code or zero if no error.
  */
 static int gfs2_writepage_common(struct page *page,
 				 struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	loff_t i_size = i_size_read(inode);
 	pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
 	if (current->journal_info)
 		goto redirty;
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index > end_index || (page->index == end_index && !offset)) {
 		page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
 		goto out;
 	}
 	return 1;
 redirty:
 	redirty_page_for_writepage(wbc, page);
 out:
 	unlock_page(page);
 	return 0;
 }
 /**
  * gfs2_writepage - Write page for writeback mappings
  * @page: The page
  * @wbc: The writeback control
  *
  */
 static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int ret;
 	ret = gfs2_writepage_common(page, wbc);
 	if (ret <= 0)
 		return ret;
 	return nobh_writepage(page, gfs2_get_block_noalloc, wbc);
 }
 /**
  * __gfs2_jdata_writepage - The core of jdata writepage
  * @page: The page to write
  * @wbc: The writeback control
  *
  * This is shared between writepage and writepages and implements the
  * core of the writepage operation. If a transaction is required then
  * PageChecked will have been set and the transaction will have
  * already been started before this is called.
  */
 static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
 		if (!page_has_buffers(page)) {
 			create_empty_buffers(page, inode->i_sb->s_blocksize,
 					     (1 << BH_Dirty)|(1 << BH_Uptodate));
 		}
 		gfs2_page_add_databufs(ip, page, 0, sdp->sd_vfs->s_blocksize-1);
 	}
 	return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
 }
 /**
  * gfs2_jdata_writepage - Write complete page
  * @page: Page to write
  *
  * Returns: errno
  *
  */
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	int ret;
 	int done_trans = 0;
 	if (PageChecked(page)) {
 		if (wbc->sync_mode != WB_SYNC_ALL)
 			goto out_ignore;
 		ret = gfs2_trans_begin(sdp, RES_DINODE + 1, 0);
 		if (ret)
 			goto out_ignore;
 		done_trans = 1;
 	}
 	ret = gfs2_writepage_common(page, wbc);
 	if (ret > 0)
 		ret = __gfs2_jdata_writepage(page, wbc);
 	if (done_trans)
 		gfs2_trans_end(sdp);
 	return ret;
 out_ignore:
 	redirty_page_for_writepage(wbc, page);
 	unlock_page(page);
 	return 0;
 }
 /**
  * gfs2_writepages - Write a bunch of dirty pages back to disk
  * @mapping: The mapping to write
  * @wbc: Write-back control
  *
  * Used for both ordered and writeback modes.
  */
 static int gfs2_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
 	return mpage_writepages(mapping, wbc, gfs2_get_block_noalloc);
 }
 /**
  * gfs2_write_jdata_pagevec - Write back a pagevec's worth of pages
  * @mapping: The mapping
  * @wbc: The writeback control
  * @writepage: The writepage function to call for each page
  * @pvec: The vector of pages
  * @nr_pages: The number of pages to write
  *
  * Returns: non-zero if loop should terminate, zero otherwise
  */
 static int gfs2_write_jdata_pagevec(struct address_space *mapping,
 				    struct writeback_control *wbc,
 				    struct pagevec *pvec,
 				    int nr_pages, pgoff_t end)
 {
 	struct inode *inode = mapping->host;
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	loff_t i_size = i_size_read(inode);
 	pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset = i_size & (PAGE_CACHE_SIZE-1);
 	unsigned nrblocks = nr_pages * (PAGE_CACHE_SIZE/inode->i_sb->s_blocksize);
 	int i;
 	int ret;
 	ret = gfs2_trans_begin(sdp, nrblocks, nrblocks);
 	if (ret < 0)
 		return ret;
 	for(i = 0; i < nr_pages; i++) {
 		struct page *page = pvec->pages[i];
 		lock_page(page);
 		if (unlikely(page->mapping != mapping)) {
 			unlock_page(page);
 			continue;
 		}
 		if (!wbc->range_cyclic && page->index > end) {
 			ret = 1;
 			unlock_page(page);
 			continue;
 		}
 		if (wbc->sync_mode != WB_SYNC_NONE)
 			wait_on_page_writeback(page);
 		if (PageWriteback(page) ||
 		    !clear_page_dirty_for_io(page)) {
 			unlock_page(page);
 			continue;
 		}
 		/* Is the page fully outside i_size? (truncate in progress) */
 		if (page->index > end_index || (page->index == end_index && !offset)) {
 			page->mapping->a_ops->invalidatepage(page, 0,
 							     PAGE_CACHE_SIZE);
 			unlock_page(page);
 			continue;
 		}
 		ret = __gfs2_jdata_writepage(page, wbc);
 		if (ret || (--(wbc->nr_to_write) <= 0))
 			ret = 1;
 	}
 	gfs2_trans_end(sdp);
 	return ret;
 }
 /**
  * gfs2_write_cache_jdata - Like write_cache_pages but different
  * @mapping: The mapping to write
  * @wbc: The writeback control
  * @writepage: The writepage function to call
  * @data: The data to pass to writepage
  *
  * The reason that we use our own function here is that we need to
  * start transactions before we grab page locks. This allows us
  * to get the ordering right.
  */
 static int gfs2_write_cache_jdata(struct address_space *mapping,
 				  struct writeback_control *wbc)
 {
 	int ret = 0;
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
 	pgoff_t end;
 	int scanned = 0;
 	int range_whole = 0;
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;
 	}
 retry:
 	 while (!done && (index <= end) &&
 		(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
 		if (ret > 0)
 			ret = 0;
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
 		scanned = 1;
 		index = 0;
 		goto retry;
 	}
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 	return ret;
 }
 /**
  * gfs2_jdata_writepages - Write a bunch of dirty pages back to disk
  * @mapping: The mapping to write
  * @wbc: The writeback control
  *
  */
 static int gfs2_jdata_writepages(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
 	struct gfs2_inode *ip = GFS2_I(mapping->host);
 	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
 	int ret;
 	ret = gfs2_write_cache_jdata(mapping, wbc);
 	if (ret == 0 && wbc->sync_mode == WB_SYNC_ALL) {
 		gfs2_log_flush(sdp, ip->i_gl);
 		ret = gfs2_write_cache_jdata(mapping, wbc);
 	}
 	return ret;
 }
 /**
  * stuffed_readpage - Fill in a Linux page with stuffed file data
  * @ip: the inode
  * @page: the page
  *
  * Returns: errno
  */
 static int stuffed_readpage(struct gfs2_inode *ip, struct page *page)
 {
 	struct buffer_head *dibh;
 	u64 dsize = i_size_read(&ip->i_inode);
 	void *kaddr;
 	int error;
 	/*
 	 * Due to the order of unstuffing files and ->fault(), we can be
 	 * asked for a zero page in the case of a stuffed file being extended,
 	 * so we need to supply one here. It doesn't happen often.
 	 */
 	if (unlikely(page->index)) {
 		zero_user(page, 0, PAGE_CACHE_SIZE);
 		SetPageUptodate(page);
 		return 0;
 	}
 	error = gfs2_meta_inode_buffer(ip, &dibh);
 	if (error)
 		return error;
 	kaddr = kmap_atomic(page);
 	if (dsize > (dibh->b_size - sizeof(struct gfs2_dinode)))
 		dsize = (dibh->b_size - sizeof(struct gfs2_dinode));
 	memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize);
 	memset(kaddr + dsize, 0, PAGE_CACHE_SIZE - dsize);
 	kunmap_atomic(kaddr);
 	flush_dcache_page(page);
 	brelse(dibh);
 	SetPageUptodate(page);
 	return 0;
 }
 /**
  * __gfs2_readpage - readpage
  * @file: The file to read a page for
  * @page: The page to read
  *
  * This is the core of gfs2's readpage. Its used by the internal file
  * reading code as in that case we already hold the glock. Also its
  * called by gfs2_readpage() once the required lock has been granted.
  *
  */
 static int __gfs2_readpage(void *file, struct page *page)
 {
 	struct gfs2_inode *ip = GFS2_I(page->mapping->host);
 	struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
 	int error;
 	if (gfs2_is_stuffed(ip)) {
 		error = stuffed_readpage(ip, page);
 		unlock_page(page);
 	} else {
 		error = mpage_readpage(page, gfs2_block_map);
 	}
 	if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
 		return -EIO;
 	return error;
 }
 /**
  * gfs2_readpage - read a page of a file
  * @file: The file to read
  * @page: The page of the file
  *
  * This deals with the locking required. We have to unlock and
  * relock the page in order to get the locking in the right
  * order.
  */
 static int gfs2_readpage(struct file *file, struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct gfs2_inode *ip = GFS2_I(mapping->host);
 	struct gfs2_holder gh;
 	int error;
 	unlock_page(page);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
 	error = gfs2_glock_nq(&gh);
 	if (unlikely(error))
 		goto out;
 	error = AOP_TRUNCATED_PAGE;
 	lock_page(page);
 	if (page->mapping == mapping && !PageUptodate(page))
 		error = __gfs2_readpage(file, page);
 	else
 		unlock_page(page);
 	gfs2_glock_dq(&gh);
 out:
 	gfs2_holder_uninit(&gh);
 	if (error && error != AOP_TRUNCATED_PAGE)
 		lock_page(page);
 	return error;
 }
 /**
  * gfs2_internal_read - read an internal file
  * @ip: The gfs2 inode
  * @buf: The buffer to fill
  * @pos: The file position
  * @size: The amount to read
  *
  */
 int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
                        unsigned size)
 {
 	struct address_space *mapping = ip->i_inode.i_mapping;
 	unsigned long index = *pos / PAGE_CACHE_SIZE;
 	unsigned offset = *pos & (PAGE_CACHE_SIZE - 1);
 	unsigned copied = 0;
 	unsigned amt;
 	struct page *page;
 	void *p;
 	do {
 		amt = size - copied;
 		if (offset + size > PAGE_CACHE_SIZE)
 			amt = PAGE_CACHE_SIZE - offset;
 		page = read_cache_page(mapping, index, __gfs2_readpage, NULL);
 		if (IS_ERR(page))
 			return PTR_ERR(page);
 		p = kmap_atomic(page);
 		memcpy(buf + copied, p + offset, amt);
 		kunmap_atomic(p);
-		mark_page_accessed(page);
 		page_cache_release(page);
 		copied += amt;
 		index++;
 		offset = 0;
 	} while(copied < size);
 	(*pos) += size;
 	return size;
 }
 /**
  * gfs2_readpages - Read a bunch of pages at once
  *
  * Some notes:
  * 1. This is only for readahead, so we can simply ignore any things
  *    which are slightly inconvenient (such as locking conflicts between
  *    the page lock and the glock) and return having done no I/O. Its
  *    obviously not something we'd want to do on too regular a basis.
  *    Any I/O we ignore at this time will be done via readpage later.
  * 2. We don't handle stuffed files here we let readpage do the honours.
  * 3. mpage_readpages() does most of the heavy lifting in the common case.
  * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
  */
 static int gfs2_readpages(struct file *file, struct address_space *mapping,
 			  struct list_head *pages, unsigned nr_pages)
 {
 	struct inode *inode = mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_holder gh;
 	int ret;
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
 	ret = gfs2_glock_nq(&gh);
 	if (unlikely(ret))
 		goto out_uninit;
 	if (!gfs2_is_stuffed(ip))
 		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
 		ret = -EIO;
 	return ret;
 }
 /**
  * gfs2_write_begin - Begin to write to a file
  * @file: The file to write to
  * @mapping: The mapping in which to write
  * @pos: The file offset at which to start writing
  * @len: Length of the write
  * @flags: Various flags
  * @pagep: Pointer to return the page
  * @fsdata: Pointer to return fs data (unused by GFS2)
  *
  * Returns: errno
  */
 static int gfs2_write_begin(struct file *file, struct address_space *mapping,
 			    loff_t pos, unsigned len, unsigned flags,
 			    struct page **pagep, void **fsdata)
 {
 	struct gfs2_inode *ip = GFS2_I(mapping->host);
 	struct gfs2_sbd *sdp = GFS2_SB(mapping->host);
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	unsigned int data_blocks = 0, ind_blocks = 0, rblocks;
 	unsigned requested = 0;
 	int alloc_required;
 	int error = 0;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
 	struct page *page;
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
 	error = gfs2_glock_nq(&ip->i_gh);
 	if (unlikely(error))
 		goto out_uninit;
 	if (&ip->i_inode == sdp->sd_rindex) {
 		error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
 					   GL_NOCACHE, &m_ip->i_gh);
 		if (unlikely(error)) {
 			gfs2_glock_dq(&ip->i_gh);
 			goto out_uninit;
 		}
 	}
 	alloc_required = gfs2_write_alloc_required(ip, pos, len);
 	if (alloc_required || gfs2_is_jdata(ip))
 		gfs2_write_calc_reserv(ip, len, &data_blocks, &ind_blocks);
 	if (alloc_required) {
 		error = gfs2_quota_lock_check(ip);
 		if (error)
 			goto out_unlock;
 		requested = data_blocks + ind_blocks;
 		error = gfs2_inplace_reserve(ip, requested, 0);
 		if (error)
 			goto out_qunlock;
 	}
 	rblocks = RES_DINODE + ind_blocks;
 	if (gfs2_is_jdata(ip))
 		rblocks += data_blocks ? data_blocks : 1;
 	if (ind_blocks || data_blocks)
 		rblocks += RES_STATFS + RES_QUOTA;
 	if (&ip->i_inode == sdp->sd_rindex)
 		rblocks += 2 * RES_STATFS;
 	if (alloc_required)
 		rblocks += gfs2_rg_blocks(ip, requested);
 	error = gfs2_trans_begin(sdp, rblocks,
 				 PAGE_CACHE_SIZE/sdp->sd_sb.sb_bsize);
 	if (error)
 		goto out_trans_fail;
 	error = -ENOMEM;
 	flags |= AOP_FLAG_NOFS;
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	*pagep = page;
 	if (unlikely(!page))
 		goto out_endtrans;
 	if (gfs2_is_stuffed(ip)) {
 		error = 0;
 		if (pos + len > sdp->sd_sb.sb_bsize - sizeof(struct gfs2_dinode)) {
 			error = gfs2_unstuff_dinode(ip, page);
 			if (error == 0)
 				goto prepare_write;
 		} else if (!PageUptodate(page)) {
 			error = stuffed_readpage(ip, page);
 		}
 		goto out;
 	}
 prepare_write:
 	error = __block_write_begin(page, from, len, gfs2_block_map);
 out:
 	if (error == 0)
 		return 0;
 	unlock_page(page);
 	page_cache_release(page);
 	gfs2_trans_end(sdp);
 	if (pos + len > ip->i_inode.i_size)
 		gfs2_trim_blocks(&ip->i_inode);
 	goto out_trans_fail;
 out_endtrans:
 	gfs2_trans_end(sdp);
 out_trans_fail:
 	if (alloc_required) {
 		gfs2_inplace_release(ip);
 out_qunlock:
 		gfs2_quota_unlock(ip);
 	}
 out_unlock:
 	if (&ip->i_inode == sdp->sd_rindex) {
 		gfs2_glock_dq(&m_ip->i_gh);
 		gfs2_holder_uninit(&m_ip->i_gh);
 	}
 	gfs2_glock_dq(&ip->i_gh);
 out_uninit:
 	gfs2_holder_uninit(&ip->i_gh);
 	return error;
 }
 /**
  * adjust_fs_space - Adjusts the free space available due to gfs2_grow
  * @inode: the rindex inode
  */
 static void adjust_fs_space(struct inode *inode)
 {
 	struct gfs2_sbd *sdp = inode->i_sb->s_fs_info;
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	struct gfs2_inode *l_ip = GFS2_I(sdp->sd_sc_inode);
 	struct gfs2_statfs_change_host *m_sc = &sdp->sd_statfs_master;
 	struct gfs2_statfs_change_host *l_sc = &sdp->sd_statfs_local;
 	struct buffer_head *m_bh, *l_bh;
 	u64 fs_total, new_free;
 	/* Total up the file system space, according to the latest rindex. */
 	fs_total = gfs2_ri_total(sdp);
 	if (gfs2_meta_inode_buffer(m_ip, &m_bh) != 0)
 		return;
 	spin_lock(&sdp->sd_statfs_spin);
 	gfs2_statfs_change_in(m_sc, m_bh->b_data +
 			      sizeof(struct gfs2_dinode));
 	if (fs_total > (m_sc->sc_total + l_sc->sc_total))
 		new_free = fs_total - (m_sc->sc_total + l_sc->sc_total);
 	else
 		new_free = 0;
 	spin_unlock(&sdp->sd_statfs_spin);
 	fs_warn(sdp, "File system extended by %llu blocks.\n",
 		(unsigned long long)new_free);
 	gfs2_statfs_change(sdp, new_free, new_free, 0);
 	if (gfs2_meta_inode_buffer(l_ip, &l_bh) != 0)
 		goto out;
 	update_statfs(sdp, m_bh, l_bh);
 	brelse(l_bh);
 out:
 	brelse(m_bh);
 }
 /**
  * gfs2_stuffed_write_end - Write end for stuffed files
  * @inode: The inode
  * @dibh: The buffer_head containing the on-disk inode
  * @pos: The file position
  * @len: The length of the write
  * @copied: How much was actually copied by the VFS
  * @page: The page
  *
  * This copies the data from the page into the inode block after
  * the inode data structure itself.
  *
  * Returns: errno
  */
 static int gfs2_stuffed_write_end(struct inode *inode, struct buffer_head *dibh,
 				  loff_t pos, unsigned len, unsigned copied,
 				  struct page *page)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	u64 to = pos + copied;
 	void *kaddr;
 	unsigned char *buf = dibh->b_data + sizeof(struct gfs2_dinode);
 	BUG_ON((pos + len) > (dibh->b_size - sizeof(struct gfs2_dinode)));
 	kaddr = kmap_atomic(page);
 	memcpy(buf + pos, kaddr + pos, copied);
 	memset(kaddr + pos + copied, 0, len - copied);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr);
 	if (!PageUptodate(page))
 		SetPageUptodate(page);
 	unlock_page(page);
 	page_cache_release(page);
 	if (copied) {
 		if (inode->i_size < to)
 			i_size_write(inode, to);
 		mark_inode_dirty(inode);
 	}
 	if (inode == sdp->sd_rindex) {
 		adjust_fs_space(inode);
 		sdp->sd_rindex_uptodate = 0;
 	}
 	brelse(dibh);
 	gfs2_trans_end(sdp);
 	if (inode == sdp->sd_rindex) {
 		gfs2_glock_dq(&m_ip->i_gh);
 		gfs2_holder_uninit(&m_ip->i_gh);
 	}
 	gfs2_glock_dq(&ip->i_gh);
 	gfs2_holder_uninit(&ip->i_gh);
 	return copied;
 }
 /**
  * gfs2_write_end
  * @file: The file to write to
  * @mapping: The address space to write to
  * @pos: The file position
  * @len: The length of the data
  * @copied:
  * @page: The page that has been written
  * @fsdata: The fsdata (unused in GFS2)
  *
  * The main write_end function for GFS2. We have a separate one for
  * stuffed files as they are slightly different, otherwise we just
  * put our locking around the VFS provided functions.
  *
  * Returns: errno
  */
 static int gfs2_write_end(struct file *file, struct address_space *mapping,
 			  loff_t pos, unsigned len, unsigned copied,
 			  struct page *page, void *fsdata)
 {
 	struct inode *inode = page->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 	struct buffer_head *dibh;
 	unsigned int from = pos & (PAGE_CACHE_SIZE - 1);
 	unsigned int to = from + len;
 	int ret;
 	struct gfs2_trans *tr = current->journal_info;
 	BUG_ON(!tr);
 	BUG_ON(gfs2_glock_is_locked_by_me(ip->i_gl) == NULL);
 	ret = gfs2_meta_inode_buffer(ip, &dibh);
 	if (unlikely(ret)) {
 		unlock_page(page);
 		page_cache_release(page);
 		goto failed;
 	}
 	if (gfs2_is_stuffed(ip))
 		return gfs2_stuffed_write_end(inode, dibh, pos, len, copied, page);
 	if (!gfs2_is_writeback(ip))
 		gfs2_page_add_databufs(ip, page, from, to);
 	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
 	if (tr->tr_num_buf_new)
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	else
 		gfs2_trans_add_meta(ip->i_gl, dibh);
 	if (inode == sdp->sd_rindex) {
 		adjust_fs_space(inode);
 		sdp->sd_rindex_uptodate = 0;
 	}
 	brelse(dibh);
 failed:
 	gfs2_trans_end(sdp);
 	gfs2_inplace_release(ip);
 	if (ip->i_res->rs_qa_qd_num)
 		gfs2_quota_unlock(ip);
 	if (inode == sdp->sd_rindex) {
 		gfs2_glock_dq(&m_ip->i_gh);
 		gfs2_holder_uninit(&m_ip->i_gh);
 	}
 	gfs2_glock_dq(&ip->i_gh);
 	gfs2_holder_uninit(&ip->i_gh);
 	return ret;
 }
 /**
  * gfs2_set_page_dirty - Page dirtying function
  * @page: The page to dirty
  *
  * Returns: 1 if it dirtyed the page, or 0 otherwise
  */
 static int gfs2_set_page_dirty(struct page *page)
 {
 	SetPageChecked(page);
 	return __set_page_dirty_buffers(page);
 }
 /**
  * gfs2_bmap - Block map function
  * @mapping: Address space info
  * @lblock: The block to map
  *
  * Returns: The disk address for the block or 0 on hole or error
  */
 static sector_t gfs2_bmap(struct address_space *mapping, sector_t lblock)
 {
 	struct gfs2_inode *ip = GFS2_I(mapping->host);
 	struct gfs2_holder i_gh;
 	sector_t dblock = 0;
 	int error;
 	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh);
 	if (error)
 		return 0;
 	if (!gfs2_is_stuffed(ip))
 		dblock = generic_block_bmap(mapping, lblock, gfs2_block_map);
 	gfs2_glock_dq_uninit(&i_gh);
 	return dblock;
 }
 static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
 	struct gfs2_bufdata *bd;
 	lock_buffer(bh);
 	gfs2_log_lock(sdp);
 	clear_buffer_dirty(bh);
 	bd = bh->b_private;
 	if (bd) {
 		if (!list_empty(&bd->bd_list) && !buffer_pinned(bh))
 			list_del_init(&bd->bd_list);
 		else
 			gfs2_remove_from_journal(bh, current->journal_info, 0);
 	}
 	bh->b_bdev = NULL;
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
 	gfs2_log_unlock(sdp);
 	unlock_buffer(bh);
 }
 static void gfs2_invalidatepage(struct page *page, unsigned int offset,
 				unsigned int length)
 {
 	struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
 	unsigned int stop = offset + length;
 	int partial_page = (offset || length < PAGE_CACHE_SIZE);
 	struct buffer_head *bh, *head;
 	unsigned long pos = 0;
 	BUG_ON(!PageLocked(page));
 	if (!partial_page)
 		ClearPageChecked(page);
 	if (!page_has_buffers(page))
 		goto out;
 	bh = head = page_buffers(page);
 	do {
 		if (pos + bh->b_size > stop)
 			return;
 		if (offset <= pos)
 			gfs2_discard(sdp, bh);
 		pos += bh->b_size;
 		bh = bh->b_this_page;
 	} while (bh != head);
 out:
 	if (!partial_page)
 		try_to_release_page(page, 0);
 }
 /**
  * gfs2_ok_for_dio - check that dio is valid on this file
  * @ip: The inode
  * @rw: READ or WRITE
  * @offset: The offset at which we are reading or writing
  *
  * Returns: 0 (to ignore the i/o request and thus fall back to buffered i/o)
  *          1 (to accept the i/o request)
  */
 static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset)
 {
 	/*
 	 * Should we return an error here? I can't see that O_DIRECT for
 	 * a stuffed file makes any sense. For now we'll silently fall
 	 * back to buffered I/O
 	 */
 	if (gfs2_is_stuffed(ip))
 		return 0;
 	if (offset >= i_size_read(&ip->i_inode))
 		return 0;
 	return 1;
 }
 static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 			      const struct iovec *iov, loff_t offset,
 			      unsigned long nr_segs)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct address_space *mapping = inode->i_mapping;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_holder gh;
 	int rv;
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation
 	 * on this path. All we need change is atime, and this lock mode
 	 * ensures that other nodes have flushed their buffered read caches
 	 * (i.e. their page cache entries for this inode). We do not,
 	 * unfortunately have the option of only flushing a range like
 	 * the VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, &gh);
 	rv = gfs2_glock_nq(&gh);
 	if (rv)
 		return rv;
 	rv = gfs2_ok_for_dio(ip, rw, offset);
 	if (rv != 1)
 		goto out; /* dio not valid, fall back to buffered i/o */
 	/*
 	 * Now since we are holding a deferred (CW) lock at this point, you
 	 * might be wondering why this is ever needed. There is a case however
 	 * where we've granted a deferred local lock against a cached exclusive
 	 * glock. That is ok provided all granted local locks are deferred, but
 	 * it also means that it is possible to encounter pages which are
 	 * cached and possibly also mapped. So here we check for that and sort
 	 * them out ahead of the dio. The glock state machine will take care of
 	 * everything else.
 	 *
 	 * If in fact the cached glock state (gl->gl_state) is deferred (CW) in
 	 * the first place, mapping->nr_pages will always be zero.
 	 */
 	if (mapping->nrpages) {
 		loff_t lstart = offset & (PAGE_CACHE_SIZE - 1);
 		loff_t len = iov_length(iov, nr_segs);
 		loff_t end = PAGE_ALIGN(offset + len) - 1;
 		rv = 0;
 		if (len == 0)
 			goto out;
 		if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
 			unmap_shared_mapping_range(ip->i_inode.i_mapping, offset, len);
 		rv = filemap_write_and_wait_range(mapping, lstart, end);
 		if (rv)
 			return rv;
 		truncate_inode_pages_range(mapping, lstart, end);
 	}
 	rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				  offset, nr_segs, gfs2_get_block_direct,
 				  NULL, NULL, 0);
 out:
 	gfs2_glock_dq(&gh);
 	gfs2_holder_uninit(&gh);
 	return rv;
 }
 /**
  * gfs2_releasepage - free the metadata associated with a page
  * @page: the page that's being released
  * @gfp_mask: passed from Linux VFS, ignored by us
  *
  * Call try_to_free_buffers() if the buffers in this page can be
  * released.
  *
  * Returns: 0
  */
 int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
 {
 	struct address_space *mapping = page->mapping;
 	struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
 	struct buffer_head *bh, *head;
 	struct gfs2_bufdata *bd;
 	if (!page_has_buffers(page))
 		return 0;
 	gfs2_log_lock(sdp);
 	spin_lock(&sdp->sd_ail_lock);
 	head = bh = page_buffers(page);
 	do {
 		if (atomic_read(&bh->b_count))
 			goto cannot_release;
 		bd = bh->b_private;
 		if (bd && bd->bd_tr)
 			goto cannot_release;
 		if (buffer_pinned(bh) || buffer_dirty(bh))
 			goto not_possible;
 		bh = bh->b_this_page;
 	} while(bh != head);
 	spin_unlock(&sdp->sd_ail_lock);
 	gfs2_log_unlock(sdp);
 	head = bh = page_buffers(page);
 	do {
 		gfs2_log_lock(sdp);
 		bd = bh->b_private;
 		if (bd) {
 			gfs2_assert_warn(sdp, bd->bd_bh == bh);
 			if (!list_empty(&bd->bd_list)) {
 				if (!buffer_pinned(bh))
 					list_del_init(&bd->bd_list);
 				else
 					bd = NULL;
 			}
 			if (bd)
 				bd->bd_bh = NULL;
 			bh->b_private = NULL;
 		}
 		gfs2_log_unlock(sdp);
 		if (bd)
 			kmem_cache_free(gfs2_bufdata_cachep, bd);
 		bh = bh->b_this_page;
 	} while (bh != head);
 	return try_to_free_buffers(page);
 not_possible: /* Should never happen */
 	WARN_ON(buffer_dirty(bh));
 	WARN_ON(buffer_pinned(bh));
 cannot_release:
 	spin_unlock(&sdp->sd_ail_lock);
 	gfs2_log_unlock(sdp);
 	return 0;
 }
 static const struct address_space_operations gfs2_writeback_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
 	.direct_IO = gfs2_direct_IO,
 	.migratepage = buffer_migrate_page,
 	.is_partially_uptodate = block_is_partially_uptodate,
 	.error_remove_page = generic_error_remove_page,
 };
 static const struct address_space_operations gfs2_ordered_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.set_page_dirty = gfs2_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
 	.direct_IO = gfs2_direct_IO,
 	.migratepage = buffer_migrate_page,
 	.is_partially_uptodate = block_is_partially_uptodate,
 	.error_remove_page = generic_error_remove_page,
 };
 static const struct address_space_operations gfs2_jdata_aops = {
 	.writepage = gfs2_jdata_writepage,
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
 	.readpages = gfs2_readpages,
 	.write_begin = gfs2_write_begin,
 	.write_end = gfs2_write_end,
 	.set_page_dirty = gfs2_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
 	.is_partially_uptodate = block_is_partially_uptodate,
 	.error_remove_page = generic_error_remove_page,
 };
 void gfs2_set_aops(struct inode *inode)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	if (gfs2_is_writeback(ip))
 		inode->i_mapping->a_ops = &gfs2_writeback_aops;
 	else if (gfs2_is_ordered(ip))
 		inode->i_mapping->a_ops = &gfs2_ordered_aops;
 	else if (gfs2_is_jdata(ip))
 		inode->i_mapping->a_ops = &gfs2_jdata_aops;
 	else
 		BUG();
 }

fs/gfs2/meta_io.c

Diff comments View file @ d618a27

 /*
  * Copyright (C) Sistina Software, Inc.  1997-2003 All rights reserved.
  * Copyright (C) 2004-2008 Red Hat, Inc.  All rights reserved.
  *
  * This copyrighted material is made available to anyone wishing to use,
  * modify, copy, or redistribute it subject to the terms and conditions
  * of the GNU General Public License version 2.
  */
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/completion.h>
 #include <linux/buffer_head.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/delay.h>
 #include <linux/bio.h>
 #include <linux/gfs2_ondisk.h>
 #include "gfs2.h"
 #include "incore.h"
 #include "glock.h"
 #include "glops.h"
 #include "inode.h"
 #include "log.h"
 #include "lops.h"
 #include "meta_io.h"
 #include "rgrp.h"
 #include "trans.h"
 #include "util.h"
 #include "trace_gfs2.h"
 static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct buffer_head *bh, *head;
 	int nr_underway = 0;
 	int write_op = REQ_META | REQ_PRIO |
 		(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!page_has_buffers(page));
 	head = page_buffers(page);
 	bh = head;
 	do {
 		if (!buffer_mapped(bh))
 			continue;
 		/*
 		 * If it's a fully non-blocking write attempt and we cannot
 		 * lock the buffer then redirty the page.  Note that this can
 		 * potentially cause a busy-wait loop from flusher thread and kswapd
 		 * activity, but those code paths have their own higher-level
 		 * throttling.
 		 */
 		if (wbc->sync_mode != WB_SYNC_NONE) {
 			lock_buffer(bh);
 		} else if (!trylock_buffer(bh)) {
 			redirty_page_for_writepage(wbc, page);
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write(bh);
 		} else {
 			unlock_buffer(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	/*
 	 * The page and its buffers are protected by PageWriteback(), so we can
 	 * drop the bh refcounts early.
 	 */
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			submit_bh(write_op, bh);
 			nr_underway++;
 		}
 		bh = next;
 	} while (bh != head);
 	unlock_page(page);
 	if (nr_underway == 0)
 		end_page_writeback(page);
 	return 0;
 }
 const struct address_space_operations gfs2_meta_aops = {
 	.writepage = gfs2_aspace_writepage,
 	.releasepage = gfs2_releasepage,
 };
 /**
  * gfs2_getbuf - Get a buffer with a given address space
  * @gl: the glock
  * @blkno: the block number (filesystem scope)
  * @create: 1 if the buffer should be created
  *
  * Returns: the buffer
  */
 struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 {
 	struct address_space *mapping = gfs2_glock2aspace(gl);
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	struct page *page;
 	struct buffer_head *bh;
 	unsigned int shift;
 	unsigned long index;
 	unsigned int bufnum;
 	shift = PAGE_CACHE_SHIFT - sdp->sd_sb.sb_bsize_shift;
 	index = blkno >> shift;             /* convert block to page */
 	bufnum = blkno - (index << shift);  /* block buf index within page */
 	if (create) {
 		for (;;) {
 			page = grab_cache_page(mapping, index);
 			if (page)
 				break;
 			yield();
 		}
 	} else {
-		page = find_lock_page(mapping, index);
+		page = find_get_page_flags(mapping, index,
+						FGP_LOCK|FGP_ACCESSED);
 		if (!page)
 			return NULL;
 	}
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, sdp->sd_sb.sb_bsize, 0);
 	/* Locate header for our buffer within our page */
 	for (bh = page_buffers(page); bufnum--; bh = bh->b_this_page)
 		/* Do nothing */;
 	get_bh(bh);
 	if (!buffer_mapped(bh))
 		map_bh(bh, sdp->sd_vfs, blkno);
 	unlock_page(page);
-	mark_page_accessed(page);
 	page_cache_release(page);
 	return bh;
 }
 static void meta_prep_new(struct buffer_head *bh)
 {
 	struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data;
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	set_buffer_uptodate(bh);
 	unlock_buffer(bh);
 	mh->mh_magic = cpu_to_be32(GFS2_MAGIC);
 }
 /**
  * gfs2_meta_new - Get a block
  * @gl: The glock associated with this block
  * @blkno: The block number
  *
  * Returns: The buffer
  */
 struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno)
 {
 	struct buffer_head *bh;
 	bh = gfs2_getbuf(gl, blkno, CREATE);
 	meta_prep_new(bh);
 	return bh;
 }
 /**
  * gfs2_meta_read - Read a block from disk
  * @gl: The glock covering the block
  * @blkno: The block number
  * @flags: flags
  * @bhp: the place where the buffer is returned (NULL on failure)
  *
  * Returns: errno
  */
 int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 		   struct buffer_head **bhp)
 {
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	struct buffer_head *bh;
 	if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags))) {
 		*bhp = NULL;
 		return -EIO;
 	}
 	*bhp = bh = gfs2_getbuf(gl, blkno, CREATE);
 	lock_buffer(bh);
 	if (buffer_uptodate(bh)) {
 		unlock_buffer(bh);
 		return 0;
 	}
 	bh->b_end_io = end_buffer_read_sync;
 	get_bh(bh);
 	submit_bh(READ_SYNC | REQ_META | REQ_PRIO, bh);
 	if (!(flags & DIO_WAIT))
 		return 0;
 	wait_on_buffer(bh);
 	if (unlikely(!buffer_uptodate(bh))) {
 		struct gfs2_trans *tr = current->journal_info;
 		if (tr && tr->tr_touched)
 			gfs2_io_error_bh(sdp, bh);
 		brelse(bh);
 		*bhp = NULL;
 		return -EIO;
 	}
 	return 0;
 }
 /**
  * gfs2_meta_wait - Reread a block from disk
  * @sdp: the filesystem
  * @bh: The block to wait for
  *
  * Returns: errno
  */
 int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
 	if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
 		return -EIO;
 	wait_on_buffer(bh);
 	if (!buffer_uptodate(bh)) {
 		struct gfs2_trans *tr = current->journal_info;
 		if (tr && tr->tr_touched)
 			gfs2_io_error_bh(sdp, bh);
 		return -EIO;
 	}
 	if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)))
 		return -EIO;
 	return 0;
 }
 void gfs2_remove_from_journal(struct buffer_head *bh, struct gfs2_trans *tr, int meta)
 {
 	struct address_space *mapping = bh->b_page->mapping;
 	struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
 	struct gfs2_bufdata *bd = bh->b_private;
 	int was_pinned = 0;
 	if (test_clear_buffer_pinned(bh)) {
 		trace_gfs2_pin(bd, 0);
 		atomic_dec(&sdp->sd_log_pinned);
 		list_del_init(&bd->bd_list);
 		if (meta) {
 			gfs2_assert_warn(sdp, sdp->sd_log_num_buf);
 			sdp->sd_log_num_buf--;
 			tr->tr_num_buf_rm++;
 		} else {
 			gfs2_assert_warn(sdp, sdp->sd_log_num_databuf);
 			sdp->sd_log_num_databuf--;
 			tr->tr_num_databuf_rm++;
 		}
 		tr->tr_touched = 1;
 		was_pinned = 1;
 		brelse(bh);
 	}
 	if (bd) {
 		spin_lock(&sdp->sd_ail_lock);
 		if (bd->bd_tr) {
 			gfs2_trans_add_revoke(sdp, bd);
 		} else if (was_pinned) {
 			bh->b_private = NULL;
 			kmem_cache_free(gfs2_bufdata_cachep, bd);
 		}
 		spin_unlock(&sdp->sd_ail_lock);
 	}
 	clear_buffer_dirty(bh);
 	clear_buffer_uptodate(bh);
 }
 /**
  * gfs2_meta_wipe - make inode's buffers so they aren't dirty/pinned anymore
  * @ip: the inode who owns the buffers
  * @bstart: the first buffer in the run
  * @blen: the number of buffers in the run
  *
  */
 void gfs2_meta_wipe(struct gfs2_inode *ip, u64 bstart, u32 blen)
 {
 	struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
 	struct buffer_head *bh;
 	while (blen) {
 		bh = gfs2_getbuf(ip->i_gl, bstart, NO_CREATE);
 		if (bh) {
 			lock_buffer(bh);
 			gfs2_log_lock(sdp);
 			gfs2_remove_from_journal(bh, current->journal_info, 1);
 			gfs2_log_unlock(sdp);
 			unlock_buffer(bh);
 			brelse(bh);
 		}
 		bstart++;
 		blen--;
 	}
 }
 /**
  * gfs2_meta_indirect_buffer - Get a metadata buffer
  * @ip: The GFS2 inode
  * @height: The level of this buf in the metadata (indir addr) tree (if any)
  * @num: The block number (device relative) of the buffer
  * @bhp: the buffer is returned here
  *
  * Returns: errno
  */
 int gfs2_meta_indirect_buffer(struct gfs2_inode *ip, int height, u64 num,
 			      struct buffer_head **bhp)
 {
 	struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
 	struct gfs2_glock *gl = ip->i_gl;
 	struct buffer_head *bh;
 	int ret = 0;
 	u32 mtype = height ? GFS2_METATYPE_IN : GFS2_METATYPE_DI;
 	ret = gfs2_meta_read(gl, num, DIO_WAIT, &bh);
 	if (ret == 0 && gfs2_metatype_check(sdp, bh, mtype)) {
 		brelse(bh);
 		ret = -EIO;
 	}
 	*bhp = bh;
 	return ret;
 }
 /**
  * gfs2_meta_ra - start readahead on an extent of a file
  * @gl: the glock the blocks belong to
  * @dblock: the starting disk block
  * @extlen: the number of blocks in the extent
  *
  * returns: the first buffer in the extent
  */
 struct buffer_head *gfs2_meta_ra(struct gfs2_glock *gl, u64 dblock, u32 extlen)
 {
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	struct buffer_head *first_bh, *bh;
 	u32 max_ra = gfs2_tune_get(sdp, gt_max_readahead) >>
 			  sdp->sd_sb.sb_bsize_shift;
 	BUG_ON(!extlen);
 	if (max_ra < 1)
 		max_ra = 1;
 	if (extlen > max_ra)
 		extlen = max_ra;
 	first_bh = gfs2_getbuf(gl, dblock, CREATE);
 	if (buffer_uptodate(first_bh))
 		goto out;
 	if (!buffer_locked(first_bh))
 		ll_rw_block(READ_SYNC | REQ_META, 1, &first_bh);
 	dblock++;
 	extlen--;
 	while (extlen) {
 		bh = gfs2_getbuf(gl, dblock, CREATE);
 		if (!buffer_uptodate(bh) && !buffer_locked(bh))
 			ll_rw_block(READA | REQ_META, 1, &bh);
 		brelse(bh);
 		dblock++;
 		extlen--;
 		if (!buffer_locked(first_bh) && buffer_uptodate(first_bh))
 			goto out;
 	}
 	wait_on_buffer(first_bh);
 out:
 	return first_bh;
 }

fs/ntfs/attrib.c

Diff comments View file @ d618a27

 /**
  * attrib.c - NTFS attribute operations.  Part of the Linux-NTFS project.
  *
  * Copyright (c) 2001-2012 Anton Altaparmakov and Tuxera Inc.
  * Copyright (c) 2002 Richard Russon
  *
  * This program/include file is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as published
  * by the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
  *
  * This program/include file is distributed in the hope that it will be
  * useful, but WITHOUT ANY WARRANTY; without even the implied warranty
  * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public License
  * along with this program (in the main directory of the Linux-NTFS
  * distribution in the file COPYING); if not, write to the Free Software
  * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 #include <linux/buffer_head.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include "attrib.h"
 #include "debug.h"
 #include "layout.h"
 #include "lcnalloc.h"
 #include "malloc.h"
 #include "mft.h"
 #include "ntfs.h"
 #include "types.h"
 /**
  * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode
  * @ni:		ntfs inode for which to map (part of) a runlist
  * @vcn:	map runlist part containing this vcn
  * @ctx:	active attribute search context if present or NULL if not
  *
  * Map the part of a runlist containing the @vcn of the ntfs inode @ni.
  *
  * If @ctx is specified, it is an active search context of @ni and its base mft
  * record.  This is needed when ntfs_map_runlist_nolock() encounters unmapped
  * runlist fragments and allows their mapping.  If you do not have the mft
  * record mapped, you can specify @ctx as NULL and ntfs_map_runlist_nolock()
  * will perform the necessary mapping and unmapping.
  *
  * Note, ntfs_map_runlist_nolock() saves the state of @ctx on entry and
  * restores it before returning.  Thus, @ctx will be left pointing to the same
  * attribute on return as on entry.  However, the actual pointers in @ctx may
  * point to different memory locations on return, so you must remember to reset
  * any cached pointers from the @ctx, i.e. after the call to
  * ntfs_map_runlist_nolock(), you will probably want to do:
  *	m = ctx->mrec;
  *	a = ctx->attr;
  * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that
  * you cache ctx->mrec in a variable @m of type MFT_RECORD *.
  *
  * Return 0 on success and -errno on error.  There is one special error code
  * which is not an error as such.  This is -ENOENT.  It means that @vcn is out
  * of bounds of the runlist.
  *
  * Note the runlist can be NULL after this function returns if @vcn is zero and
  * the attribute has zero allocated size, i.e. there simply is no runlist.
  *
  * WARNING: If @ctx is supplied, regardless of whether success or failure is
  *	    returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx
  *	    is no longer valid, i.e. you need to either call
  *	    ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it.
  *	    In that case PTR_ERR(@ctx->mrec) will give you the error code for
  *	    why the mapping of the old inode failed.
  *
  * Locking: - The runlist described by @ni must be locked for writing on entry
  *	      and is locked on return.  Note the runlist will be modified.
  *	    - If @ctx is NULL, the base mft record of @ni must not be mapped on
  *	      entry and it will be left unmapped on return.
  *	    - If @ctx is not NULL, the base mft record must be mapped on entry
  *	      and it will be left mapped on return.
  */
 int ntfs_map_runlist_nolock(ntfs_inode *ni, VCN vcn, ntfs_attr_search_ctx *ctx)
 {
 	VCN end_vcn;
 	unsigned long flags;
 	ntfs_inode *base_ni;
 	MFT_RECORD *m;
 	ATTR_RECORD *a;
 	runlist_element *rl;
 	struct page *put_this_page = NULL;
 	int err = 0;
 	bool ctx_is_temporary, ctx_needs_reset;
 	ntfs_attr_search_ctx old_ctx = { NULL, };
 	ntfs_debug("Mapping runlist part containing vcn 0x%llx.",
 			(unsigned long long)vcn);
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	if (!ctx) {
 		ctx_is_temporary = ctx_needs_reset = true;
 		m = map_mft_record(base_ni);
 		if (IS_ERR(m))
 			return PTR_ERR(m);
 		ctx = ntfs_attr_get_search_ctx(base_ni, m);
 		if (unlikely(!ctx)) {
 			err = -ENOMEM;
 			goto err_out;
 		}
 	} else {
 		VCN allocated_size_vcn;
 		BUG_ON(IS_ERR(ctx->mrec));
 		a = ctx->attr;
 		BUG_ON(!a->non_resident);
 		ctx_is_temporary = false;
 		end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn);
 		read_lock_irqsave(&ni->size_lock, flags);
 		allocated_size_vcn = ni->allocated_size >>
 				ni->vol->cluster_size_bits;
 		read_unlock_irqrestore(&ni->size_lock, flags);
 		if (!a->data.non_resident.lowest_vcn && end_vcn <= 0)
 			end_vcn = allocated_size_vcn - 1;
 		/*
 		 * If we already have the attribute extent containing @vcn in
 		 * @ctx, no need to look it up again.  We slightly cheat in
 		 * that if vcn exceeds the allocated size, we will refuse to
 		 * map the runlist below, so there is definitely no need to get
 		 * the right attribute extent.
 		 */
 		if (vcn >= allocated_size_vcn || (a->type == ni->type &&
 				a->name_length == ni->name_len &&
 				!memcmp((u8*)a + le16_to_cpu(a->name_offset),
 				ni->name, ni->name_len) &&
 				sle64_to_cpu(a->data.non_resident.lowest_vcn)
 				<= vcn && end_vcn >= vcn))
 			ctx_needs_reset = false;
 		else {
 			/* Save the old search context. */
 			old_ctx = *ctx;
 			/*
 			 * If the currently mapped (extent) inode is not the
 			 * base inode we will unmap it when we reinitialize the
 			 * search context which means we need to get a
 			 * reference to the page containing the mapped mft
 			 * record so we do not accidentally drop changes to the
 			 * mft record when it has not been marked dirty yet.
 			 */
 			if (old_ctx.base_ntfs_ino && old_ctx.ntfs_ino !=
 					old_ctx.base_ntfs_ino) {
 				put_this_page = old_ctx.ntfs_ino->page;
 				page_cache_get(put_this_page);
 			}
 			/*
 			 * Reinitialize the search context so we can lookup the
 			 * needed attribute extent.
 			 */
 			ntfs_attr_reinit_search_ctx(ctx);
 			ctx_needs_reset = true;
 		}
 	}
 	if (ctx_needs_reset) {
 		err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 				CASE_SENSITIVE, vcn, NULL, 0, ctx);
 		if (unlikely(err)) {
 			if (err == -ENOENT)
 				err = -EIO;
 			goto err_out;
 		}
 		BUG_ON(!ctx->attr->non_resident);
 	}
 	a = ctx->attr;
 	/*
 	 * Only decompress the mapping pairs if @vcn is inside it.  Otherwise
 	 * we get into problems when we try to map an out of bounds vcn because
 	 * we then try to map the already mapped runlist fragment and
 	 * ntfs_mapping_pairs_decompress() fails.
 	 */
 	end_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn) + 1;
 	if (unlikely(vcn && vcn >= end_vcn)) {
 		err = -ENOENT;
 		goto err_out;
 	}
 	rl = ntfs_mapping_pairs_decompress(ni->vol, a, ni->runlist.rl);
 	if (IS_ERR(rl))
 		err = PTR_ERR(rl);
 	else
 		ni->runlist.rl = rl;
 err_out:
 	if (ctx_is_temporary) {
 		if (likely(ctx))
 			ntfs_attr_put_search_ctx(ctx);
 		unmap_mft_record(base_ni);
 	} else if (ctx_needs_reset) {
 		/*
 		 * If there is no attribute list, restoring the search context
 		 * is accomplished simply by copying the saved context back over
 		 * the caller supplied context.  If there is an attribute list,
 		 * things are more complicated as we need to deal with mapping
 		 * of mft records and resulting potential changes in pointers.
 		 */
 		if (NInoAttrList(base_ni)) {
 			/*
 			 * If the currently mapped (extent) inode is not the
 			 * one we had before, we need to unmap it and map the
 			 * old one.
 			 */
 			if (ctx->ntfs_ino != old_ctx.ntfs_ino) {
 				/*
 				 * If the currently mapped inode is not the
 				 * base inode, unmap it.
 				 */
 				if (ctx->base_ntfs_ino && ctx->ntfs_ino !=
 						ctx->base_ntfs_ino) {
 					unmap_extent_mft_record(ctx->ntfs_ino);
 					ctx->mrec = ctx->base_mrec;
 					BUG_ON(!ctx->mrec);
 				}
 				/*
 				 * If the old mapped inode is not the base
 				 * inode, map it.
 				 */
 				if (old_ctx.base_ntfs_ino &&
 						old_ctx.ntfs_ino !=
 						old_ctx.base_ntfs_ino) {
 retry_map:
 					ctx->mrec = map_mft_record(
 							old_ctx.ntfs_ino);
 					/*
 					 * Something bad has happened.  If out
 					 * of memory retry till it succeeds.
 					 * Any other errors are fatal and we
 					 * return the error code in ctx->mrec.
 					 * Let the caller deal with it...  We
 					 * just need to fudge things so the
 					 * caller can reinit and/or put the
 					 * search context safely.
 					 */
 					if (IS_ERR(ctx->mrec)) {
 						if (PTR_ERR(ctx->mrec) ==
 								-ENOMEM) {
 							schedule();
 							goto retry_map;
 						} else
 							old_ctx.ntfs_ino =
 								old_ctx.
 								base_ntfs_ino;
 					}
 				}
 			}
 			/* Update the changed pointers in the saved context. */
 			if (ctx->mrec != old_ctx.mrec) {
 				if (!IS_ERR(ctx->mrec))
 					old_ctx.attr = (ATTR_RECORD*)(
 							(u8*)ctx->mrec +
 							((u8*)old_ctx.attr -
 							(u8*)old_ctx.mrec));
 				old_ctx.mrec = ctx->mrec;
 			}
 		}
 		/* Restore the search context to the saved one. */
 		*ctx = old_ctx;
 		/*
 		 * We drop the reference on the page we took earlier.  In the
 		 * case that IS_ERR(ctx->mrec) is true this means we might lose
 		 * some changes to the mft record that had been made between
 		 * the last time it was marked dirty/written out and now.  This
 		 * at this stage is not a problem as the mapping error is fatal
 		 * enough that the mft record cannot be written out anyway and
 		 * the caller is very likely to shutdown the whole inode
 		 * immediately and mark the volume dirty for chkdsk to pick up
 		 * the pieces anyway.
 		 */
 		if (put_this_page)
 			page_cache_release(put_this_page);
 	}
 	return err;
 }
 /**
  * ntfs_map_runlist - map (a part of) a runlist of an ntfs inode
  * @ni:		ntfs inode for which to map (part of) a runlist
  * @vcn:	map runlist part containing this vcn
  *
  * Map the part of a runlist containing the @vcn of the ntfs inode @ni.
  *
  * Return 0 on success and -errno on error.  There is one special error code
  * which is not an error as such.  This is -ENOENT.  It means that @vcn is out
  * of bounds of the runlist.
  *
  * Locking: - The runlist must be unlocked on entry and is unlocked on return.
  *	    - This function takes the runlist lock for writing and may modify
  *	      the runlist.
  */
 int ntfs_map_runlist(ntfs_inode *ni, VCN vcn)
 {
 	int err = 0;
 	down_write(&ni->runlist.lock);
 	/* Make sure someone else didn't do the work while we were sleeping. */
 	if (likely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) <=
 			LCN_RL_NOT_MAPPED))
 		err = ntfs_map_runlist_nolock(ni, vcn, NULL);
 	up_write(&ni->runlist.lock);
 	return err;
 }
 /**
  * ntfs_attr_vcn_to_lcn_nolock - convert a vcn into a lcn given an ntfs inode
  * @ni:			ntfs inode of the attribute whose runlist to search
  * @vcn:		vcn to convert
  * @write_locked:	true if the runlist is locked for writing
  *
  * Find the virtual cluster number @vcn in the runlist of the ntfs attribute
  * described by the ntfs inode @ni and return the corresponding logical cluster
  * number (lcn).
  *
  * If the @vcn is not mapped yet, the attempt is made to map the attribute
  * extent containing the @vcn and the vcn to lcn conversion is retried.
  *
  * If @write_locked is true the caller has locked the runlist for writing and
  * if false for reading.
  *
  * Since lcns must be >= 0, we use negative return codes with special meaning:
  *
  * Return code	Meaning / Description
  * ==========================================
  *  LCN_HOLE	Hole / not allocated on disk.
  *  LCN_ENOENT	There is no such vcn in the runlist, i.e. @vcn is out of bounds.
  *  LCN_ENOMEM	Not enough memory to map runlist.
  *  LCN_EIO	Critical error (runlist/file is corrupt, i/o error, etc).
  *
  * Locking: - The runlist must be locked on entry and is left locked on return.
  *	    - If @write_locked is 'false', i.e. the runlist is locked for reading,
  *	      the lock may be dropped inside the function so you cannot rely on
  *	      the runlist still being the same when this function returns.
  */
 LCN ntfs_attr_vcn_to_lcn_nolock(ntfs_inode *ni, const VCN vcn,
 		const bool write_locked)
 {
 	LCN lcn;
 	unsigned long flags;
 	bool is_retry = false;
 	BUG_ON(!ni);
 	ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, %s_locked.",
 			ni->mft_no, (unsigned long long)vcn,
 			write_locked ? "write" : "read");
 	BUG_ON(!NInoNonResident(ni));
 	BUG_ON(vcn < 0);
 	if (!ni->runlist.rl) {
 		read_lock_irqsave(&ni->size_lock, flags);
 		if (!ni->allocated_size) {
 			read_unlock_irqrestore(&ni->size_lock, flags);
 			return LCN_ENOENT;
 		}
 		read_unlock_irqrestore(&ni->size_lock, flags);
 	}
 retry_remap:
 	/* Convert vcn to lcn.  If that fails map the runlist and retry once. */
 	lcn = ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn);
 	if (likely(lcn >= LCN_HOLE)) {
 		ntfs_debug("Done, lcn 0x%llx.", (long long)lcn);
 		return lcn;
 	}
 	if (lcn != LCN_RL_NOT_MAPPED) {
 		if (lcn != LCN_ENOENT)
 			lcn = LCN_EIO;
 	} else if (!is_retry) {
 		int err;
 		if (!write_locked) {
 			up_read(&ni->runlist.lock);
 			down_write(&ni->runlist.lock);
 			if (unlikely(ntfs_rl_vcn_to_lcn(ni->runlist.rl, vcn) !=
 					LCN_RL_NOT_MAPPED)) {
 				up_write(&ni->runlist.lock);
 				down_read(&ni->runlist.lock);
 				goto retry_remap;
 			}
 		}
 		err = ntfs_map_runlist_nolock(ni, vcn, NULL);
 		if (!write_locked) {
 			up_write(&ni->runlist.lock);
 			down_read(&ni->runlist.lock);
 		}
 		if (likely(!err)) {
 			is_retry = true;
 			goto retry_remap;
 		}
 		if (err == -ENOENT)
 			lcn = LCN_ENOENT;
 		else if (err == -ENOMEM)
 			lcn = LCN_ENOMEM;
 		else
 			lcn = LCN_EIO;
 	}
 	if (lcn != LCN_ENOENT)
 		ntfs_error(ni->vol->sb, "Failed with error code %lli.",
 				(long long)lcn);
 	return lcn;
 }
 /**
  * ntfs_attr_find_vcn_nolock - find a vcn in the runlist of an ntfs inode
  * @ni:		ntfs inode describing the runlist to search
  * @vcn:	vcn to find
  * @ctx:	active attribute search context if present or NULL if not
  *
  * Find the virtual cluster number @vcn in the runlist described by the ntfs
  * inode @ni and return the address of the runlist element containing the @vcn.
  *
  * If the @vcn is not mapped yet, the attempt is made to map the attribute
  * extent containing the @vcn and the vcn to lcn conversion is retried.
  *
  * If @ctx is specified, it is an active search context of @ni and its base mft
  * record.  This is needed when ntfs_attr_find_vcn_nolock() encounters unmapped
  * runlist fragments and allows their mapping.  If you do not have the mft
  * record mapped, you can specify @ctx as NULL and ntfs_attr_find_vcn_nolock()
  * will perform the necessary mapping and unmapping.
  *
  * Note, ntfs_attr_find_vcn_nolock() saves the state of @ctx on entry and
  * restores it before returning.  Thus, @ctx will be left pointing to the same
  * attribute on return as on entry.  However, the actual pointers in @ctx may
  * point to different memory locations on return, so you must remember to reset
  * any cached pointers from the @ctx, i.e. after the call to
  * ntfs_attr_find_vcn_nolock(), you will probably want to do:
  *	m = ctx->mrec;
  *	a = ctx->attr;
  * Assuming you cache ctx->attr in a variable @a of type ATTR_RECORD * and that
  * you cache ctx->mrec in a variable @m of type MFT_RECORD *.
  * Note you need to distinguish between the lcn of the returned runlist element
  * being >= 0 and LCN_HOLE.  In the later case you have to return zeroes on
  * read and allocate clusters on write.
  *
  * Return the runlist element containing the @vcn on success and
  * ERR_PTR(-errno) on error.  You need to test the return value with IS_ERR()
  * to decide if the return is success or failure and PTR_ERR() to get to the
  * error code if IS_ERR() is true.
  *
  * The possible error return codes are:
  *	-ENOENT - No such vcn in the runlist, i.e. @vcn is out of bounds.
  *	-ENOMEM - Not enough memory to map runlist.
  *	-EIO	- Critical error (runlist/file is corrupt, i/o error, etc).
  *
  * WARNING: If @ctx is supplied, regardless of whether success or failure is
  *	    returned, you need to check IS_ERR(@ctx->mrec) and if 'true' the @ctx
  *	    is no longer valid, i.e. you need to either call
  *	    ntfs_attr_reinit_search_ctx() or ntfs_attr_put_search_ctx() on it.
  *	    In that case PTR_ERR(@ctx->mrec) will give you the error code for
  *	    why the mapping of the old inode failed.
  *
  * Locking: - The runlist described by @ni must be locked for writing on entry
  *	      and is locked on return.  Note the runlist may be modified when
  *	      needed runlist fragments need to be mapped.
  *	    - If @ctx is NULL, the base mft record of @ni must not be mapped on
  *	      entry and it will be left unmapped on return.
  *	    - If @ctx is not NULL, the base mft record must be mapped on entry
  *	      and it will be left mapped on return.
  */
 runlist_element *ntfs_attr_find_vcn_nolock(ntfs_inode *ni, const VCN vcn,
 		ntfs_attr_search_ctx *ctx)
 {
 	unsigned long flags;
 	runlist_element *rl;
 	int err = 0;
 	bool is_retry = false;
 	BUG_ON(!ni);
 	ntfs_debug("Entering for i_ino 0x%lx, vcn 0x%llx, with%s ctx.",
 			ni->mft_no, (unsigned long long)vcn, ctx ? "" : "out");
 	BUG_ON(!NInoNonResident(ni));
 	BUG_ON(vcn < 0);
 	if (!ni->runlist.rl) {
 		read_lock_irqsave(&ni->size_lock, flags);
 		if (!ni->allocated_size) {
 			read_unlock_irqrestore(&ni->size_lock, flags);
 			return ERR_PTR(-ENOENT);
 		}
 		read_unlock_irqrestore(&ni->size_lock, flags);
 	}
 retry_remap:
 	rl = ni->runlist.rl;
 	if (likely(rl && vcn >= rl[0].vcn)) {
 		while (likely(rl->length)) {
 			if (unlikely(vcn < rl[1].vcn)) {
 				if (likely(rl->lcn >= LCN_HOLE)) {
 					ntfs_debug("Done.");
 					return rl;
 				}
 				break;
 			}
 			rl++;
 		}
 		if (likely(rl->lcn != LCN_RL_NOT_MAPPED)) {
 			if (likely(rl->lcn == LCN_ENOENT))
 				err = -ENOENT;
 			else
 				err = -EIO;
 		}
 	}
 	if (!err && !is_retry) {
 		/*
 		 * If the search context is invalid we cannot map the unmapped
 		 * region.
 		 */
 		if (IS_ERR(ctx->mrec))
 			err = PTR_ERR(ctx->mrec);
 		else {
 			/*
 			 * The @vcn is in an unmapped region, map the runlist
 			 * and retry.
 			 */
 			err = ntfs_map_runlist_nolock(ni, vcn, ctx);
 			if (likely(!err)) {
 				is_retry = true;
 				goto retry_remap;
 			}
 		}
 		if (err == -EINVAL)
 			err = -EIO;
 	} else if (!err)
 		err = -EIO;
 	if (err != -ENOENT)
 		ntfs_error(ni->vol->sb, "Failed with error code %i.", err);
 	return ERR_PTR(err);
 }
 /**
  * ntfs_attr_find - find (next) attribute in mft record
  * @type:	attribute type to find
  * @name:	attribute name to find (optional, i.e. NULL means don't care)
  * @name_len:	attribute name length (only needed if @name present)
  * @ic:		IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
  * @val:	attribute value to find (optional, resident attributes only)
  * @val_len:	attribute value length
  * @ctx:	search context with mft record and attribute to search from
  *
  * You should not need to call this function directly.  Use ntfs_attr_lookup()
  * instead.
  *
  * ntfs_attr_find() takes a search context @ctx as parameter and searches the
  * mft record specified by @ctx->mrec, beginning at @ctx->attr, for an
  * attribute of @type, optionally @name and @val.
  *
  * If the attribute is found, ntfs_attr_find() returns 0 and @ctx->attr will
  * point to the found attribute.
  *
  * If the attribute is not found, ntfs_attr_find() returns -ENOENT and
  * @ctx->attr will point to the attribute before which the attribute being
  * searched for would need to be inserted if such an action were to be desired.
  *
  * On actual error, ntfs_attr_find() returns -EIO.  In this case @ctx->attr is
  * undefined and in particular do not rely on it not changing.
  *
  * If @ctx->is_first is 'true', the search begins with @ctx->attr itself.  If it
  * is 'false', the search begins after @ctx->attr.
  *
  * If @ic is IGNORE_CASE, the @name comparisson is not case sensitive and
  * @ctx->ntfs_ino must be set to the ntfs inode to which the mft record
  * @ctx->mrec belongs.  This is so we can get at the ntfs volume and hence at
  * the upcase table.  If @ic is CASE_SENSITIVE, the comparison is case
  * sensitive.  When @name is present, @name_len is the @name length in Unicode
  * characters.
  *
  * If @name is not present (NULL), we assume that the unnamed attribute is
  * being searched for.
  *
  * Finally, the resident attribute value @val is looked for, if present.  If
  * @val is not present (NULL), @val_len is ignored.
  *
  * ntfs_attr_find() only searches the specified mft record and it ignores the
  * presence of an attribute list attribute (unless it is the one being searched
  * for, obviously).  If you need to take attribute lists into consideration,
  * use ntfs_attr_lookup() instead (see below).  This also means that you cannot
  * use ntfs_attr_find() to search for extent records of non-resident
  * attributes, as extents with lowest_vcn != 0 are usually described by the
  * attribute list attribute only. - Note that it is possible that the first
  * extent is only in the attribute list while the last extent is in the base
  * mft record, so do not rely on being able to find the first extent in the
  * base mft record.
  *
  * Warning: Never use @val when looking for attribute types which can be
  *	    non-resident as this most likely will result in a crash!
  */
 static int ntfs_attr_find(const ATTR_TYPE type, const ntfschar *name,
 		const u32 name_len, const IGNORE_CASE_BOOL ic,
 		const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx)
 {
 	ATTR_RECORD *a;
 	ntfs_volume *vol = ctx->ntfs_ino->vol;
 	ntfschar *upcase = vol->upcase;
 	u32 upcase_len = vol->upcase_len;
 	/*
 	 * Iterate over attributes in mft record starting at @ctx->attr, or the
 	 * attribute following that, if @ctx->is_first is 'true'.
 	 */
 	if (ctx->is_first) {
 		a = ctx->attr;
 		ctx->is_first = false;
 	} else
 		a = (ATTR_RECORD*)((u8*)ctx->attr +
 				le32_to_cpu(ctx->attr->length));
 	for (;;	a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))) {
 		if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec +
 				le32_to_cpu(ctx->mrec->bytes_allocated))
 			break;
 		ctx->attr = a;
 		if (unlikely(le32_to_cpu(a->type) > le32_to_cpu(type) ||
 				a->type == AT_END))
 			return -ENOENT;
 		if (unlikely(!a->length))
 			break;
 		if (a->type != type)
 			continue;
 		/*
 		 * If @name is present, compare the two names.  If @name is
 		 * missing, assume we want an unnamed attribute.
 		 */
 		if (!name) {
 			/* The search failed if the found attribute is named. */
 			if (a->name_length)
 				return -ENOENT;
 		} else if (!ntfs_are_names_equal(name, name_len,
 			    (ntfschar*)((u8*)a + le16_to_cpu(a->name_offset)),
 			    a->name_length, ic, upcase, upcase_len)) {
 			register int rc;
 			rc = ntfs_collate_names(name, name_len,
 					(ntfschar*)((u8*)a +
 					le16_to_cpu(a->name_offset)),
 					a->name_length, 1, IGNORE_CASE,
 					upcase, upcase_len);
 			/*
 			 * If @name collates before a->name, there is no
 			 * matching attribute.
 			 */
 			if (rc == -1)
 				return -ENOENT;
 			/* If the strings are not equal, continue search. */
 			if (rc)
 				continue;
 			rc = ntfs_collate_names(name, name_len,
 					(ntfschar*)((u8*)a +
 					le16_to_cpu(a->name_offset)),
 					a->name_length, 1, CASE_SENSITIVE,
 					upcase, upcase_len);
 			if (rc == -1)
 				return -ENOENT;
 			if (rc)
 				continue;
 		}
 		/*
 		 * The names match or @name not present and attribute is
 		 * unnamed.  If no @val specified, we have found the attribute
 		 * and are done.
 		 */
 		if (!val)
 			return 0;
 		/* @val is present; compare values. */
 		else {
 			register int rc;
 			rc = memcmp(val, (u8*)a + le16_to_cpu(
 					a->data.resident.value_offset),
 					min_t(u32, val_len, le32_to_cpu(
 					a->data.resident.value_length)));
 			/*
 			 * If @val collates before the current attribute's
 			 * value, there is no matching attribute.
 			 */
 			if (!rc) {
 				register u32 avl;
 				avl = le32_to_cpu(
 						a->data.resident.value_length);
 				if (val_len == avl)
 					return 0;
 				if (val_len < avl)
 					return -ENOENT;
 			} else if (rc < 0)
 				return -ENOENT;
 		}
 	}
 	ntfs_error(vol->sb, "Inode is corrupt.  Run chkdsk.");
 	NVolSetErrors(vol);
 	return -EIO;
 }
 /**
  * load_attribute_list - load an attribute list into memory
  * @vol:		ntfs volume from which to read
  * @runlist:		runlist of the attribute list
  * @al_start:		destination buffer
  * @size:		size of the destination buffer in bytes
  * @initialized_size:	initialized size of the attribute list
  *
  * Walk the runlist @runlist and load all clusters from it copying them into
  * the linear buffer @al. The maximum number of bytes copied to @al is @size
  * bytes. Note, @size does not need to be a multiple of the cluster size. If
  * @initialized_size is less than @size, the region in @al between
  * @initialized_size and @size will be zeroed and not read from disk.
  *
  * Return 0 on success or -errno on error.
  */
 int load_attribute_list(ntfs_volume *vol, runlist *runlist, u8 *al_start,
 		const s64 size, const s64 initialized_size)
 {
 	LCN lcn;
 	u8 *al = al_start;
 	u8 *al_end = al + initialized_size;
 	runlist_element *rl;
 	struct buffer_head *bh;
 	struct super_block *sb;
 	unsigned long block_size;
 	unsigned long block, max_block;
 	int err = 0;
 	unsigned char block_size_bits;
 	ntfs_debug("Entering.");
 	if (!vol || !runlist || !al || size <= 0 || initialized_size < 0 ||
 			initialized_size > size)
 		return -EINVAL;
 	if (!initialized_size) {
 		memset(al, 0, size);
 		return 0;
 	}
 	sb = vol->sb;
 	block_size = sb->s_blocksize;
 	block_size_bits = sb->s_blocksize_bits;
 	down_read(&runlist->lock);
 	rl = runlist->rl;
 	if (!rl) {
 		ntfs_error(sb, "Cannot read attribute list since runlist is "
 				"missing.");
 		goto err_out;
 	}
 	/* Read all clusters specified by the runlist one run at a time. */
 	while (rl->length) {
 		lcn = ntfs_rl_vcn_to_lcn(rl, rl->vcn);
 		ntfs_debug("Reading vcn = 0x%llx, lcn = 0x%llx.",
 				(unsigned long long)rl->vcn,
 				(unsigned long long)lcn);
 		/* The attribute list cannot be sparse. */
 		if (lcn < 0) {
 			ntfs_error(sb, "ntfs_rl_vcn_to_lcn() failed.  Cannot "
 					"read attribute list.");
 			goto err_out;
 		}
 		block = lcn << vol->cluster_size_bits >> block_size_bits;
 		/* Read the run from device in chunks of block_size bytes. */
 		max_block = block + (rl->length << vol->cluster_size_bits >>
 				block_size_bits);
 		ntfs_debug("max_block = 0x%lx.", max_block);
 		do {
 			ntfs_debug("Reading block = 0x%lx.", block);
 			bh = sb_bread(sb, block);
 			if (!bh) {
 				ntfs_error(sb, "sb_bread() failed. Cannot "
 						"read attribute list.");
 				goto err_out;
 			}
 			if (al + block_size >= al_end)
 				goto do_final;
 			memcpy(al, bh->b_data, block_size);
 			brelse(bh);
 			al += block_size;
 		} while (++block < max_block);
 		rl++;
 	}
 	if (initialized_size < size) {
 initialize:
 		memset(al_start + initialized_size, 0, size - initialized_size);
 	}
 done:
 	up_read(&runlist->lock);
 	return err;
 do_final:
 	if (al < al_end) {
 		/*
 		 * Partial block.
 		 *
 		 * Note: The attribute list can be smaller than its allocation
 		 * by multiple clusters.  This has been encountered by at least
 		 * two people running Windows XP, thus we cannot do any
 		 * truncation sanity checking here. (AIA)
 		 */
 		memcpy(al, bh->b_data, al_end - al);
 		brelse(bh);
 		if (initialized_size < size)
 			goto initialize;
 		goto done;
 	}
 	brelse(bh);
 	/* Real overflow! */
 	ntfs_error(sb, "Attribute list buffer overflow. Read attribute list "
 			"is truncated.");
 err_out:
 	err = -EIO;
 	goto done;
 }
 /**
  * ntfs_external_attr_find - find an attribute in the attribute list of an inode
  * @type:	attribute type to find
  * @name:	attribute name to find (optional, i.e. NULL means don't care)
  * @name_len:	attribute name length (only needed if @name present)
  * @ic:		IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
  * @lowest_vcn:	lowest vcn to find (optional, non-resident attributes only)
  * @val:	attribute value to find (optional, resident attributes only)
  * @val_len:	attribute value length
  * @ctx:	search context with mft record and attribute to search from
  *
  * You should not need to call this function directly.  Use ntfs_attr_lookup()
  * instead.
  *
  * Find an attribute by searching the attribute list for the corresponding
  * attribute list entry.  Having found the entry, map the mft record if the
  * attribute is in a different mft record/inode, ntfs_attr_find() the attribute
  * in there and return it.
  *
  * On first search @ctx->ntfs_ino must be the base mft record and @ctx must
  * have been obtained from a call to ntfs_attr_get_search_ctx().  On subsequent
  * calls @ctx->ntfs_ino can be any extent inode, too (@ctx->base_ntfs_ino is
  * then the base inode).
  *
  * After finishing with the attribute/mft record you need to call
  * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any
  * mapped inodes, etc).
  *
  * If the attribute is found, ntfs_external_attr_find() returns 0 and
  * @ctx->attr will point to the found attribute.  @ctx->mrec will point to the
  * mft record in which @ctx->attr is located and @ctx->al_entry will point to
  * the attribute list entry for the attribute.
  *
  * If the attribute is not found, ntfs_external_attr_find() returns -ENOENT and
  * @ctx->attr will point to the attribute in the base mft record before which
  * the attribute being searched for would need to be inserted if such an action
  * were to be desired.  @ctx->mrec will point to the mft record in which
  * @ctx->attr is located and @ctx->al_entry will point to the attribute list
  * entry of the attribute before which the attribute being searched for would
  * need to be inserted if such an action were to be desired.
  *
  * Thus to insert the not found attribute, one wants to add the attribute to
  * @ctx->mrec (the base mft record) and if there is not enough space, the
  * attribute should be placed in a newly allocated extent mft record.  The
  * attribute list entry for the inserted attribute should be inserted in the
  * attribute list attribute at @ctx->al_entry.
  *
  * On actual error, ntfs_external_attr_find() returns -EIO.  In this case
  * @ctx->attr is undefined and in particular do not rely on it not changing.
  */
 static int ntfs_external_attr_find(const ATTR_TYPE type,
 		const ntfschar *name, const u32 name_len,
 		const IGNORE_CASE_BOOL ic, const VCN lowest_vcn,
 		const u8 *val, const u32 val_len, ntfs_attr_search_ctx *ctx)
 {
 	ntfs_inode *base_ni, *ni;
 	ntfs_volume *vol;
 	ATTR_LIST_ENTRY *al_entry, *next_al_entry;
 	u8 *al_start, *al_end;
 	ATTR_RECORD *a;
 	ntfschar *al_name;
 	u32 al_name_len;
 	int err = 0;
 	static const char *es = " Unmount and run chkdsk.";
 	ni = ctx->ntfs_ino;
 	base_ni = ctx->base_ntfs_ino;
 	ntfs_debug("Entering for inode 0x%lx, type 0x%x.", ni->mft_no, type);
 	if (!base_ni) {
 		/* First call happens with the base mft record. */
 		base_ni = ctx->base_ntfs_ino = ctx->ntfs_ino;
 		ctx->base_mrec = ctx->mrec;
 	}
 	if (ni == base_ni)
 		ctx->base_attr = ctx->attr;
 	if (type == AT_END)
 		goto not_found;
 	vol = base_ni->vol;
 	al_start = base_ni->attr_list;
 	al_end = al_start + base_ni->attr_list_size;
 	if (!ctx->al_entry)
 		ctx->al_entry = (ATTR_LIST_ENTRY*)al_start;
 	/*
 	 * Iterate over entries in attribute list starting at @ctx->al_entry,
 	 * or the entry following that, if @ctx->is_first is 'true'.
 	 */
 	if (ctx->is_first) {
 		al_entry = ctx->al_entry;
 		ctx->is_first = false;
 	} else
 		al_entry = (ATTR_LIST_ENTRY*)((u8*)ctx->al_entry +
 				le16_to_cpu(ctx->al_entry->length));
 	for (;; al_entry = next_al_entry) {
 		/* Out of bounds check. */
 		if ((u8*)al_entry < base_ni->attr_list ||
 				(u8*)al_entry > al_end)
 			break;	/* Inode is corrupt. */
 		ctx->al_entry = al_entry;
 		/* Catch the end of the attribute list. */
 		if ((u8*)al_entry == al_end)
 			goto not_found;
 		if (!al_entry->length)
 			break;
 		if ((u8*)al_entry + 6 > al_end || (u8*)al_entry +
 				le16_to_cpu(al_entry->length) > al_end)
 			break;
 		next_al_entry = (ATTR_LIST_ENTRY*)((u8*)al_entry +
 				le16_to_cpu(al_entry->length));
 		if (le32_to_cpu(al_entry->type) > le32_to_cpu(type))
 			goto not_found;
 		if (type != al_entry->type)
 			continue;
 		/*
 		 * If @name is present, compare the two names.  If @name is
 		 * missing, assume we want an unnamed attribute.
 		 */
 		al_name_len = al_entry->name_length;
 		al_name = (ntfschar*)((u8*)al_entry + al_entry->name_offset);
 		if (!name) {
 			if (al_name_len)
 				goto not_found;
 		} else if (!ntfs_are_names_equal(al_name, al_name_len, name,
 				name_len, ic, vol->upcase, vol->upcase_len)) {
 			register int rc;
 			rc = ntfs_collate_names(name, name_len, al_name,
 					al_name_len, 1, IGNORE_CASE,
 					vol->upcase, vol->upcase_len);
 			/*
 			 * If @name collates before al_name, there is no
 			 * matching attribute.
 			 */
 			if (rc == -1)
 				goto not_found;
 			/* If the strings are not equal, continue search. */
 			if (rc)
 				continue;
 			/*
 			 * FIXME: Reverse engineering showed 0, IGNORE_CASE but
 			 * that is inconsistent with ntfs_attr_find().  The
 			 * subsequent rc checks were also different.  Perhaps I
 			 * made a mistake in one of the two.  Need to recheck
 			 * which is correct or at least see what is going on...
 			 * (AIA)
 			 */
 			rc = ntfs_collate_names(name, name_len, al_name,
 					al_name_len, 1, CASE_SENSITIVE,
 					vol->upcase, vol->upcase_len);
 			if (rc == -1)
 				goto not_found;
 			if (rc)
 				continue;
 		}
 		/*
 		 * The names match or @name not present and attribute is
 		 * unnamed.  Now check @lowest_vcn.  Continue search if the
 		 * next attribute list entry still fits @lowest_vcn.  Otherwise
 		 * we have reached the right one or the search has failed.
 		 */
 		if (lowest_vcn && (u8*)next_al_entry >= al_start	    &&
 				(u8*)next_al_entry + 6 < al_end		    &&
 				(u8*)next_al_entry + le16_to_cpu(
 					next_al_entry->length) <= al_end    &&
 				sle64_to_cpu(next_al_entry->lowest_vcn) <=
 					lowest_vcn			    &&
 				next_al_entry->type == al_entry->type	    &&
 				next_al_entry->name_length == al_name_len   &&
 				ntfs_are_names_equal((ntfschar*)((u8*)
 					next_al_entry +
 					next_al_entry->name_offset),
 					next_al_entry->name_length,
 					al_name, al_name_len, CASE_SENSITIVE,
 					vol->upcase, vol->upcase_len))
 			continue;
 		if (MREF_LE(al_entry->mft_reference) == ni->mft_no) {
 			if (MSEQNO_LE(al_entry->mft_reference) != ni->seq_no) {
 				ntfs_error(vol->sb, "Found stale mft "
 						"reference in attribute list "
 						"of base inode 0x%lx.%s",
 						base_ni->mft_no, es);
 				err = -EIO;
 				break;
 			}
 		} else { /* Mft references do not match. */
 			/* If there is a mapped record unmap it first. */
 			if (ni != base_ni)
 				unmap_extent_mft_record(ni);
 			/* Do we want the base record back? */
 			if (MREF_LE(al_entry->mft_reference) ==
 					base_ni->mft_no) {
 				ni = ctx->ntfs_ino = base_ni;
 				ctx->mrec = ctx->base_mrec;
 			} else {
 				/* We want an extent record. */
 				ctx->mrec = map_extent_mft_record(base_ni,
 						le64_to_cpu(
 						al_entry->mft_reference), &ni);
 				if (IS_ERR(ctx->mrec)) {
 					ntfs_error(vol->sb, "Failed to map "
 							"extent mft record "
 							"0x%lx of base inode "
 							"0x%lx.%s",
 							MREF_LE(al_entry->
 							mft_reference),
 							base_ni->mft_no, es);
 					err = PTR_ERR(ctx->mrec);
 					if (err == -ENOENT)
 						err = -EIO;
 					/* Cause @ctx to be sanitized below. */
 					ni = NULL;
 					break;
 				}
 				ctx->ntfs_ino = ni;
 			}
 			ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
 					le16_to_cpu(ctx->mrec->attrs_offset));
 		}
 		/*
 		 * ctx->vfs_ino, ctx->mrec, and ctx->attr now point to the
 		 * mft record containing the attribute represented by the
 		 * current al_entry.
 		 */
 		/*
 		 * We could call into ntfs_attr_find() to find the right
 		 * attribute in this mft record but this would be less
 		 * efficient and not quite accurate as ntfs_attr_find() ignores
 		 * the attribute instance numbers for example which become
 		 * important when one plays with attribute lists.  Also,
 		 * because a proper match has been found in the attribute list
 		 * entry above, the comparison can now be optimized.  So it is
 		 * worth re-implementing a simplified ntfs_attr_find() here.
 		 */
 		a = ctx->attr;
 		/*
 		 * Use a manual loop so we can still use break and continue
 		 * with the same meanings as above.
 		 */
 do_next_attr_loop:
 		if ((u8*)a < (u8*)ctx->mrec || (u8*)a > (u8*)ctx->mrec +
 				le32_to_cpu(ctx->mrec->bytes_allocated))
 			break;
 		if (a->type == AT_END)
 			break;
 		if (!a->length)
 			break;
 		if (al_entry->instance != a->instance)
 			goto do_next_attr;
 		/*
 		 * If the type and/or the name are mismatched between the
 		 * attribute list entry and the attribute record, there is
 		 * corruption so we break and return error EIO.
 		 */
 		if (al_entry->type != a->type)
 			break;
 		if (!ntfs_are_names_equal((ntfschar*)((u8*)a +
 				le16_to_cpu(a->name_offset)), a->name_length,
 				al_name, al_name_len, CASE_SENSITIVE,
 				vol->upcase, vol->upcase_len))
 			break;
 		ctx->attr = a;
 		/*
 		 * If no @val specified or @val specified and it matches, we
 		 * have found it!
 		 */
 		if (!val || (!a->non_resident && le32_to_cpu(
 				a->data.resident.value_length) == val_len &&
 				!memcmp((u8*)a +
 				le16_to_cpu(a->data.resident.value_offset),
 				val, val_len))) {
 			ntfs_debug("Done, found.");
 			return 0;
 		}
 do_next_attr:
 		/* Proceed to the next attribute in the current mft record. */
 		a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length));
 		goto do_next_attr_loop;
 	}
 	if (!err) {
 		ntfs_error(vol->sb, "Base inode 0x%lx contains corrupt "
 				"attribute list attribute.%s", base_ni->mft_no,
 				es);
 		err = -EIO;
 	}
 	if (ni != base_ni) {
 		if (ni)
 			unmap_extent_mft_record(ni);
 		ctx->ntfs_ino = base_ni;
 		ctx->mrec = ctx->base_mrec;
 		ctx->attr = ctx->base_attr;
 	}
 	if (err != -ENOMEM)
 		NVolSetErrors(vol);
 	return err;
 not_found:
 	/*
 	 * If we were looking for AT_END, we reset the search context @ctx and
 	 * use ntfs_attr_find() to seek to the end of the base mft record.
 	 */
 	if (type == AT_END) {
 		ntfs_attr_reinit_search_ctx(ctx);
 		return ntfs_attr_find(AT_END, name, name_len, ic, val, val_len,
 				ctx);
 	}
 	/*
 	 * The attribute was not found.  Before we return, we want to ensure
 	 * @ctx->mrec and @ctx->attr indicate the position at which the
 	 * attribute should be inserted in the base mft record.  Since we also
 	 * want to preserve @ctx->al_entry we cannot reinitialize the search
 	 * context using ntfs_attr_reinit_search_ctx() as this would set
 	 * @ctx->al_entry to NULL.  Thus we do the necessary bits manually (see
 	 * ntfs_attr_init_search_ctx() below).  Note, we _only_ preserve
 	 * @ctx->al_entry as the remaining fields (base_*) are identical to
 	 * their non base_ counterparts and we cannot set @ctx->base_attr
 	 * correctly yet as we do not know what @ctx->attr will be set to by
 	 * the call to ntfs_attr_find() below.
 	 */
 	if (ni != base_ni)
 		unmap_extent_mft_record(ni);
 	ctx->mrec = ctx->base_mrec;
 	ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
 			le16_to_cpu(ctx->mrec->attrs_offset));
 	ctx->is_first = true;
 	ctx->ntfs_ino = base_ni;
 	ctx->base_ntfs_ino = NULL;
 	ctx->base_mrec = NULL;
 	ctx->base_attr = NULL;
 	/*
 	 * In case there are multiple matches in the base mft record, need to
 	 * keep enumerating until we get an attribute not found response (or
 	 * another error), otherwise we would keep returning the same attribute
 	 * over and over again and all programs using us for enumeration would
 	 * lock up in a tight loop.
 	 */
 	do {
 		err = ntfs_attr_find(type, name, name_len, ic, val, val_len,
 				ctx);
 	} while (!err);
 	ntfs_debug("Done, not found.");
 	return err;
 }
 /**
  * ntfs_attr_lookup - find an attribute in an ntfs inode
  * @type:	attribute type to find
  * @name:	attribute name to find (optional, i.e. NULL means don't care)
  * @name_len:	attribute name length (only needed if @name present)
  * @ic:		IGNORE_CASE or CASE_SENSITIVE (ignored if @name not present)
  * @lowest_vcn:	lowest vcn to find (optional, non-resident attributes only)
  * @val:	attribute value to find (optional, resident attributes only)
  * @val_len:	attribute value length
  * @ctx:	search context with mft record and attribute to search from
  *
  * Find an attribute in an ntfs inode.  On first search @ctx->ntfs_ino must
  * be the base mft record and @ctx must have been obtained from a call to
  * ntfs_attr_get_search_ctx().
  *
  * This function transparently handles attribute lists and @ctx is used to
  * continue searches where they were left off at.
  *
  * After finishing with the attribute/mft record you need to call
  * ntfs_attr_put_search_ctx() to cleanup the search context (unmapping any
  * mapped inodes, etc).
  *
  * Return 0 if the search was successful and -errno if not.
  *
  * When 0, @ctx->attr is the found attribute and it is in mft record
  * @ctx->mrec.  If an attribute list attribute is present, @ctx->al_entry is
  * the attribute list entry of the found attribute.
  *
  * When -ENOENT, @ctx->attr is the attribute which collates just after the
  * attribute being searched for, i.e. if one wants to add the attribute to the
  * mft record this is the correct place to insert it into.  If an attribute
  * list attribute is present, @ctx->al_entry is the attribute list entry which
  * collates just after the attribute list entry of the attribute being searched
  * for, i.e. if one wants to add the attribute to the mft record this is the
  * correct place to insert its attribute list entry into.
  *
  * When -errno != -ENOENT, an error occurred during the lookup.  @ctx->attr is
  * then undefined and in particular you should not rely on it not changing.
  */
 int ntfs_attr_lookup(const ATTR_TYPE type, const ntfschar *name,
 		const u32 name_len, const IGNORE_CASE_BOOL ic,
 		const VCN lowest_vcn, const u8 *val, const u32 val_len,
 		ntfs_attr_search_ctx *ctx)
 {
 	ntfs_inode *base_ni;
 	ntfs_debug("Entering.");
 	BUG_ON(IS_ERR(ctx->mrec));
 	if (ctx->base_ntfs_ino)
 		base_ni = ctx->base_ntfs_ino;
 	else
 		base_ni = ctx->ntfs_ino;
 	/* Sanity check, just for debugging really. */
 	BUG_ON(!base_ni);
 	if (!NInoAttrList(base_ni) || type == AT_ATTRIBUTE_LIST)
 		return ntfs_attr_find(type, name, name_len, ic, val, val_len,
 				ctx);
 	return ntfs_external_attr_find(type, name, name_len, ic, lowest_vcn,
 			val, val_len, ctx);
 }
 /**
  * ntfs_attr_init_search_ctx - initialize an attribute search context
  * @ctx:	attribute search context to initialize
  * @ni:		ntfs inode with which to initialize the search context
  * @mrec:	mft record with which to initialize the search context
  *
  * Initialize the attribute search context @ctx with @ni and @mrec.
  */
 static inline void ntfs_attr_init_search_ctx(ntfs_attr_search_ctx *ctx,
 		ntfs_inode *ni, MFT_RECORD *mrec)
 {
 	*ctx = (ntfs_attr_search_ctx) {
 		.mrec = mrec,
 		/* Sanity checks are performed elsewhere. */
 		.attr = (ATTR_RECORD*)((u8*)mrec +
 				le16_to_cpu(mrec->attrs_offset)),
 		.is_first = true,
 		.ntfs_ino = ni,
 	};
 }
 /**
  * ntfs_attr_reinit_search_ctx - reinitialize an attribute search context
  * @ctx:	attribute search context to reinitialize
  *
  * Reinitialize the attribute search context @ctx, unmapping an associated
  * extent mft record if present, and initialize the search context again.
  *
  * This is used when a search for a new attribute is being started to reset
  * the search context to the beginning.
  */
 void ntfs_attr_reinit_search_ctx(ntfs_attr_search_ctx *ctx)
 {
 	if (likely(!ctx->base_ntfs_ino)) {
 		/* No attribute list. */
 		ctx->is_first = true;
 		/* Sanity checks are performed elsewhere. */
 		ctx->attr = (ATTR_RECORD*)((u8*)ctx->mrec +
 				le16_to_cpu(ctx->mrec->attrs_offset));
 		/*
 		 * This needs resetting due to ntfs_external_attr_find() which
 		 * can leave it set despite having zeroed ctx->base_ntfs_ino.
 		 */
 		ctx->al_entry = NULL;
 		return;
 	} /* Attribute list. */
 	if (ctx->ntfs_ino != ctx->base_ntfs_ino)
 		unmap_extent_mft_record(ctx->ntfs_ino);
 	ntfs_attr_init_search_ctx(ctx, ctx->base_ntfs_ino, ctx->base_mrec);
 	return;
 }
 /**
  * ntfs_attr_get_search_ctx - allocate/initialize a new attribute search context
  * @ni:		ntfs inode with which to initialize the search context
  * @mrec:	mft record with which to initialize the search context
  *
  * Allocate a new attribute search context, initialize it with @ni and @mrec,
  * and return it. Return NULL if allocation failed.
  */
 ntfs_attr_search_ctx *ntfs_attr_get_search_ctx(ntfs_inode *ni, MFT_RECORD *mrec)
 {
 	ntfs_attr_search_ctx *ctx;
 	ctx = kmem_cache_alloc(ntfs_attr_ctx_cache, GFP_NOFS);
 	if (ctx)
 		ntfs_attr_init_search_ctx(ctx, ni, mrec);
 	return ctx;
 }
 /**
  * ntfs_attr_put_search_ctx - release an attribute search context
  * @ctx:	attribute search context to free
  *
  * Release the attribute search context @ctx, unmapping an associated extent
  * mft record if present.
  */
 void ntfs_attr_put_search_ctx(ntfs_attr_search_ctx *ctx)
 {
 	if (ctx->base_ntfs_ino && ctx->ntfs_ino != ctx->base_ntfs_ino)
 		unmap_extent_mft_record(ctx->ntfs_ino);
 	kmem_cache_free(ntfs_attr_ctx_cache, ctx);
 	return;
 }
 #ifdef NTFS_RW
 /**
  * ntfs_attr_find_in_attrdef - find an attribute in the $AttrDef system file
  * @vol:	ntfs volume to which the attribute belongs
  * @type:	attribute type which to find
  *
  * Search for the attribute definition record corresponding to the attribute
  * @type in the $AttrDef system file.
  *
  * Return the attribute type definition record if found and NULL if not found.
  */
 static ATTR_DEF *ntfs_attr_find_in_attrdef(const ntfs_volume *vol,
 		const ATTR_TYPE type)
 {
 	ATTR_DEF *ad;
 	BUG_ON(!vol->attrdef);
 	BUG_ON(!type);
 	for (ad = vol->attrdef; (u8*)ad - (u8*)vol->attrdef <
 			vol->attrdef_size && ad->type; ++ad) {
 		/* We have not found it yet, carry on searching. */
 		if (likely(le32_to_cpu(ad->type) < le32_to_cpu(type)))
 			continue;
 		/* We found the attribute; return it. */
 		if (likely(ad->type == type))
 			return ad;
 		/* We have gone too far already.  No point in continuing. */
 		break;
 	}
 	/* Attribute not found. */
 	ntfs_debug("Attribute type 0x%x not found in $AttrDef.",
 			le32_to_cpu(type));
 	return NULL;
 }
 /**
  * ntfs_attr_size_bounds_check - check a size of an attribute type for validity
  * @vol:	ntfs volume to which the attribute belongs
  * @type:	attribute type which to check
  * @size:	size which to check
  *
  * Check whether the @size in bytes is valid for an attribute of @type on the
  * ntfs volume @vol.  This information is obtained from $AttrDef system file.
  *
  * Return 0 if valid, -ERANGE if not valid, or -ENOENT if the attribute is not
  * listed in $AttrDef.
  */
 int ntfs_attr_size_bounds_check(const ntfs_volume *vol, const ATTR_TYPE type,
 		const s64 size)
 {
 	ATTR_DEF *ad;
 	BUG_ON(size < 0);
 	/*
 	 * $ATTRIBUTE_LIST has a maximum size of 256kiB, but this is not
 	 * listed in $AttrDef.
 	 */
 	if (unlikely(type == AT_ATTRIBUTE_LIST && size > 256 * 1024))
 		return -ERANGE;
 	/* Get the $AttrDef entry for the attribute @type. */
 	ad = ntfs_attr_find_in_attrdef(vol, type);
 	if (unlikely(!ad))
 		return -ENOENT;
 	/* Do the bounds check. */
 	if (((sle64_to_cpu(ad->min_size) > 0) &&
 			size < sle64_to_cpu(ad->min_size)) ||
 			((sle64_to_cpu(ad->max_size) > 0) && size >
 			sle64_to_cpu(ad->max_size)))
 		return -ERANGE;
 	return 0;
 }
 /**
  * ntfs_attr_can_be_non_resident - check if an attribute can be non-resident
  * @vol:	ntfs volume to which the attribute belongs
  * @type:	attribute type which to check
  *
  * Check whether the attribute of @type on the ntfs volume @vol is allowed to
  * be non-resident.  This information is obtained from $AttrDef system file.
  *
  * Return 0 if the attribute is allowed to be non-resident, -EPERM if not, and
  * -ENOENT if the attribute is not listed in $AttrDef.
  */
 int ntfs_attr_can_be_non_resident(const ntfs_volume *vol, const ATTR_TYPE type)
 {
 	ATTR_DEF *ad;
 	/* Find the attribute definition record in $AttrDef. */
 	ad = ntfs_attr_find_in_attrdef(vol, type);
 	if (unlikely(!ad))
 		return -ENOENT;
 	/* Check the flags and return the result. */
 	if (ad->flags & ATTR_DEF_RESIDENT)
 		return -EPERM;
 	return 0;
 }
 /**
  * ntfs_attr_can_be_resident - check if an attribute can be resident
  * @vol:	ntfs volume to which the attribute belongs
  * @type:	attribute type which to check
  *
  * Check whether the attribute of @type on the ntfs volume @vol is allowed to
  * be resident.  This information is derived from our ntfs knowledge and may
  * not be completely accurate, especially when user defined attributes are
  * present.  Basically we allow everything to be resident except for index
  * allocation and $EA attributes.
  *
  * Return 0 if the attribute is allowed to be non-resident and -EPERM if not.
  *
  * Warning: In the system file $MFT the attribute $Bitmap must be non-resident
  *	    otherwise windows will not boot (blue screen of death)!  We cannot
  *	    check for this here as we do not know which inode's $Bitmap is
  *	    being asked about so the caller needs to special case this.
  */
 int ntfs_attr_can_be_resident(const ntfs_volume *vol, const ATTR_TYPE type)
 {
 	if (type == AT_INDEX_ALLOCATION)
 		return -EPERM;
 	return 0;
 }
 /**
  * ntfs_attr_record_resize - resize an attribute record
  * @m:		mft record containing attribute record
  * @a:		attribute record to resize
  * @new_size:	new size in bytes to which to resize the attribute record @a
  *
  * Resize the attribute record @a, i.e. the resident part of the attribute, in
  * the mft record @m to @new_size bytes.
  *
  * Return 0 on success and -errno on error.  The following error codes are
  * defined:
  *	-ENOSPC	- Not enough space in the mft record @m to perform the resize.
  *
  * Note: On error, no modifications have been performed whatsoever.
  *
  * Warning: If you make a record smaller without having copied all the data you
  *	    are interested in the data may be overwritten.
  */
 int ntfs_attr_record_resize(MFT_RECORD *m, ATTR_RECORD *a, u32 new_size)
 {
 	ntfs_debug("Entering for new_size %u.", new_size);
 	/* Align to 8 bytes if it is not already done. */
 	if (new_size & 7)
 		new_size = (new_size + 7) & ~7;
 	/* If the actual attribute length has changed, move things around. */
 	if (new_size != le32_to_cpu(a->length)) {
 		u32 new_muse = le32_to_cpu(m->bytes_in_use) -
 				le32_to_cpu(a->length) + new_size;
 		/* Not enough space in this mft record. */
 		if (new_muse > le32_to_cpu(m->bytes_allocated))
 			return -ENOSPC;
 		/* Move attributes following @a to their new location. */
 		memmove((u8*)a + new_size, (u8*)a + le32_to_cpu(a->length),
 				le32_to_cpu(m->bytes_in_use) - ((u8*)a -
 				(u8*)m) - le32_to_cpu(a->length));
 		/* Adjust @m to reflect the change in used space. */
 		m->bytes_in_use = cpu_to_le32(new_muse);
 		/* Adjust @a to reflect the new size. */
 		if (new_size >= offsetof(ATTR_REC, length) + sizeof(a->length))
 			a->length = cpu_to_le32(new_size);
 	}
 	return 0;
 }
 /**
  * ntfs_resident_attr_value_resize - resize the value of a resident attribute
  * @m:		mft record containing attribute record
  * @a:		attribute record whose value to resize
  * @new_size:	new size in bytes to which to resize the attribute value of @a
  *
  * Resize the value of the attribute @a in the mft record @m to @new_size bytes.
  * If the value is made bigger, the newly allocated space is cleared.
  *
  * Return 0 on success and -errno on error.  The following error codes are
  * defined:
  *	-ENOSPC	- Not enough space in the mft record @m to perform the resize.
  *
  * Note: On error, no modifications have been performed whatsoever.
  *
  * Warning: If you make a record smaller without having copied all the data you
  *	    are interested in the data may be overwritten.
  */
 int ntfs_resident_attr_value_resize(MFT_RECORD *m, ATTR_RECORD *a,
 		const u32 new_size)
 {
 	u32 old_size;
 	/* Resize the resident part of the attribute record. */
 	if (ntfs_attr_record_resize(m, a,
 			le16_to_cpu(a->data.resident.value_offset) + new_size))
 		return -ENOSPC;
 	/*
 	 * The resize succeeded!  If we made the attribute value bigger, clear
 	 * the area between the old size and @new_size.
 	 */
 	old_size = le32_to_cpu(a->data.resident.value_length);
 	if (new_size > old_size)
 		memset((u8*)a + le16_to_cpu(a->data.resident.value_offset) +
 				old_size, 0, new_size - old_size);
 	/* Finally update the length of the attribute value. */
 	a->data.resident.value_length = cpu_to_le32(new_size);
 	return 0;
 }
 /**
  * ntfs_attr_make_non_resident - convert a resident to a non-resident attribute
  * @ni:		ntfs inode describing the attribute to convert
  * @data_size:	size of the resident data to copy to the non-resident attribute
  *
  * Convert the resident ntfs attribute described by the ntfs inode @ni to a
  * non-resident one.
  *
  * @data_size must be equal to the attribute value size.  This is needed since
  * we need to know the size before we can map the mft record and our callers
  * always know it.  The reason we cannot simply read the size from the vfs
  * inode i_size is that this is not necessarily uptodate.  This happens when
  * ntfs_attr_make_non_resident() is called in the ->truncate call path(s).
  *
  * Return 0 on success and -errno on error.  The following error return codes
  * are defined:
  *	-EPERM	- The attribute is not allowed to be non-resident.
  *	-ENOMEM	- Not enough memory.
  *	-ENOSPC	- Not enough disk space.
  *	-EINVAL	- Attribute not defined on the volume.
  *	-EIO	- I/o error or other error.
  * Note that -ENOSPC is also returned in the case that there is not enough
  * space in the mft record to do the conversion.  This can happen when the mft
  * record is already very full.  The caller is responsible for trying to make
  * space in the mft record and trying again.  FIXME: Do we need a separate
  * error return code for this kind of -ENOSPC or is it always worth trying
  * again in case the attribute may then fit in a resident state so no need to
  * make it non-resident at all?  Ho-hum...  (AIA)
  *
  * NOTE to self: No changes in the attribute list are required to move from
  *		 a resident to a non-resident attribute.
  *
  * Locking: - The caller must hold i_mutex on the inode.
  */
 int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size)
 {
 	s64 new_size;
 	struct inode *vi = VFS_I(ni);
 	ntfs_volume *vol = ni->vol;
 	ntfs_inode *base_ni;
 	MFT_RECORD *m;
 	ATTR_RECORD *a;
 	ntfs_attr_search_ctx *ctx;
 	struct page *page;
 	runlist_element *rl;
 	u8 *kaddr;
 	unsigned long flags;
 	int mp_size, mp_ofs, name_ofs, arec_size, err, err2;
 	u32 attr_size;
 	u8 old_res_attr_flags;
 	/* Check that the attribute is allowed to be non-resident. */
 	err = ntfs_attr_can_be_non_resident(vol, ni->type);
 	if (unlikely(err)) {
 		if (err == -EPERM)
 			ntfs_debug("Attribute is not allowed to be "
 					"non-resident.");
 		else
 			ntfs_debug("Attribute not defined on the NTFS "
 					"volume!");
 		return err;
 	}
 	/*
 	 * FIXME: Compressed and encrypted attributes are not supported when
 	 * writing and we should never have gotten here for them.
 	 */
 	BUG_ON(NInoCompressed(ni));
 	BUG_ON(NInoEncrypted(ni));
 	/*
 	 * The size needs to be aligned to a cluster boundary for allocation
 	 * purposes.
 	 */
 	new_size = (data_size + vol->cluster_size - 1) &
 			~(vol->cluster_size - 1);
 	if (new_size > 0) {
 		/*
 		 * Will need the page later and since the page lock nests
 		 * outside all ntfs locks, we need to get the page now.
 		 */
 		page = find_or_create_page(vi->i_mapping, 0,
 				mapping_gfp_mask(vi->i_mapping));
 		if (unlikely(!page))
 			return -ENOMEM;
 		/* Start by allocating clusters to hold the attribute value. */
 		rl = ntfs_cluster_alloc(vol, 0, new_size >>
 				vol->cluster_size_bits, -1, DATA_ZONE, true);
 		if (IS_ERR(rl)) {
 			err = PTR_ERR(rl);
 			ntfs_debug("Failed to allocate cluster%s, error code "
 					"%i.", (new_size >>
 					vol->cluster_size_bits) > 1 ? "s" : "",
 					err);
 			goto page_err_out;
 		}
 	} else {
 		rl = NULL;
 		page = NULL;
 	}
 	/* Determine the size of the mapping pairs array. */
 	mp_size = ntfs_get_size_for_mapping_pairs(vol, rl, 0, -1);
 	if (unlikely(mp_size < 0)) {
 		err = mp_size;
 		ntfs_debug("Failed to get size for mapping pairs array, error "
 				"code %i.", err);
 		goto rl_err_out;
 	}
 	down_write(&ni->runlist.lock);
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		ctx = NULL;
 		goto err_out;
 	}
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto err_out;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, 0, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto err_out;
 	}
 	m = ctx->mrec;
 	a = ctx->attr;
 	BUG_ON(NInoNonResident(ni));
 	BUG_ON(a->non_resident);
 	/*
 	 * Calculate new offsets for the name and the mapping pairs array.
 	 */
 	if (NInoSparse(ni) || NInoCompressed(ni))
 		name_ofs = (offsetof(ATTR_REC,
 				data.non_resident.compressed_size) +
 				sizeof(a->data.non_resident.compressed_size) +
 				7) & ~7;
 	else
 		name_ofs = (offsetof(ATTR_REC,
 				data.non_resident.compressed_size) + 7) & ~7;
 	mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7;
 	/*
 	 * Determine the size of the resident part of the now non-resident
 	 * attribute record.
 	 */
 	arec_size = (mp_ofs + mp_size + 7) & ~7;
 	/*
 	 * If the page is not uptodate bring it uptodate by copying from the
 	 * attribute value.
 	 */
 	attr_size = le32_to_cpu(a->data.resident.value_length);
 	BUG_ON(attr_size != data_size);
 	if (page && !PageUptodate(page)) {
 		kaddr = kmap_atomic(page);
 		memcpy(kaddr, (u8*)a +
 				le16_to_cpu(a->data.resident.value_offset),
 				attr_size);
 		memset(kaddr + attr_size, 0, PAGE_CACHE_SIZE - attr_size);
 		kunmap_atomic(kaddr);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
 	/* Backup the attribute flag. */
 	old_res_attr_flags = a->data.resident.flags;
 	/* Resize the resident part of the attribute record. */
 	err = ntfs_attr_record_resize(m, a, arec_size);
 	if (unlikely(err))
 		goto err_out;
 	/*
 	 * Convert the resident part of the attribute record to describe a
 	 * non-resident attribute.
 	 */
 	a->non_resident = 1;
 	/* Move the attribute name if it exists and update the offset. */
 	if (a->name_length)
 		memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset),
 				a->name_length * sizeof(ntfschar));
 	a->name_offset = cpu_to_le16(name_ofs);
 	/* Setup the fields specific to non-resident attributes. */
 	a->data.non_resident.lowest_vcn = 0;
 	a->data.non_resident.highest_vcn = cpu_to_sle64((new_size - 1) >>
 			vol->cluster_size_bits);
 	a->data.non_resident.mapping_pairs_offset = cpu_to_le16(mp_ofs);
 	memset(&a->data.non_resident.reserved, 0,
 			sizeof(a->data.non_resident.reserved));
 	a->data.non_resident.allocated_size = cpu_to_sle64(new_size);
 	a->data.non_resident.data_size =
 			a->data.non_resident.initialized_size =
 			cpu_to_sle64(attr_size);
 	if (NInoSparse(ni) || NInoCompressed(ni)) {
 		a->data.non_resident.compression_unit = 0;
 		if (NInoCompressed(ni) || vol->major_ver < 3)
 			a->data.non_resident.compression_unit = 4;
 		a->data.non_resident.compressed_size =
 				a->data.non_resident.allocated_size;
 	} else
 		a->data.non_resident.compression_unit = 0;
 	/* Generate the mapping pairs array into the attribute record. */
 	err = ntfs_mapping_pairs_build(vol, (u8*)a + mp_ofs,
 			arec_size - mp_ofs, rl, 0, -1, NULL);
 	if (unlikely(err)) {
 		ntfs_debug("Failed to build mapping pairs, error code %i.",
 				err);
 		goto undo_err_out;
 	}
 	/* Setup the in-memory attribute structure to be non-resident. */
 	ni->runlist.rl = rl;
 	write_lock_irqsave(&ni->size_lock, flags);
 	ni->allocated_size = new_size;
 	if (NInoSparse(ni) || NInoCompressed(ni)) {
 		ni->itype.compressed.size = ni->allocated_size;
 		if (a->data.non_resident.compression_unit) {
 			ni->itype.compressed.block_size = 1U << (a->data.
 					non_resident.compression_unit +
 					vol->cluster_size_bits);
 			ni->itype.compressed.block_size_bits =
 					ffs(ni->itype.compressed.block_size) -
 					1;
 			ni->itype.compressed.block_clusters = 1U <<
 					a->data.non_resident.compression_unit;
 		} else {
 			ni->itype.compressed.block_size = 0;
 			ni->itype.compressed.block_size_bits = 0;
 			ni->itype.compressed.block_clusters = 0;
 		}
 		vi->i_blocks = ni->itype.compressed.size >> 9;
 	} else
 		vi->i_blocks = ni->allocated_size >> 9;
 	write_unlock_irqrestore(&ni->size_lock, flags);
 	/*
 	 * This needs to be last since the address space operations ->readpage
 	 * and ->writepage can run concurrently with us as they are not
 	 * serialized on i_mutex.  Note, we are not allowed to fail once we flip
 	 * this switch, which is another reason to do this last.
 	 */
 	NInoSetNonResident(ni);
 	/* Mark the mft record dirty, so it gets written back. */
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 	ntfs_attr_put_search_ctx(ctx);
 	unmap_mft_record(base_ni);
 	up_write(&ni->runlist.lock);
 	if (page) {
 		set_page_dirty(page);
 		unlock_page(page);
-		mark_page_accessed(page);
 		page_cache_release(page);
 	}
 	ntfs_debug("Done.");
 	return 0;
 undo_err_out:
 	/* Convert the attribute back into a resident attribute. */
 	a->non_resident = 0;
 	/* Move the attribute name if it exists and update the offset. */
 	name_ofs = (offsetof(ATTR_RECORD, data.resident.reserved) +
 			sizeof(a->data.resident.reserved) + 7) & ~7;
 	if (a->name_length)
 		memmove((u8*)a + name_ofs, (u8*)a + le16_to_cpu(a->name_offset),
 				a->name_length * sizeof(ntfschar));
 	mp_ofs = (name_ofs + a->name_length * sizeof(ntfschar) + 7) & ~7;
 	a->name_offset = cpu_to_le16(name_ofs);
 	arec_size = (mp_ofs + attr_size + 7) & ~7;
 	/* Resize the resident part of the attribute record. */
 	err2 = ntfs_attr_record_resize(m, a, arec_size);
 	if (unlikely(err2)) {
 		/*
 		 * This cannot happen (well if memory corruption is at work it
 		 * could happen in theory), but deal with it as well as we can.
 		 * If the old size is too small, truncate the attribute,
 		 * otherwise simply give it a larger allocated size.
 		 * FIXME: Should check whether chkdsk complains when the
 		 * allocated size is much bigger than the resident value size.
 		 */
 		arec_size = le32_to_cpu(a->length);
 		if ((mp_ofs + attr_size) > arec_size) {
 			err2 = attr_size;
 			attr_size = arec_size - mp_ofs;
 			ntfs_error(vol->sb, "Failed to undo partial resident "
 					"to non-resident attribute "
 					"conversion.  Truncating inode 0x%lx, "
 					"attribute type 0x%x from %i bytes to "
 					"%i bytes to maintain metadata "
 					"consistency.  THIS MEANS YOU ARE "
 					"LOSING %i BYTES DATA FROM THIS %s.",
 					vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type),
 					err2, attr_size, err2 - attr_size,
 					((ni->type == AT_DATA) &&
 					!ni->name_len) ? "FILE": "ATTRIBUTE");
 			write_lock_irqsave(&ni->size_lock, flags);
 			ni->initialized_size = attr_size;
 			i_size_write(vi, attr_size);
 			write_unlock_irqrestore(&ni->size_lock, flags);
 		}
 	}
 	/* Setup the fields specific to resident attributes. */
 	a->data.resident.value_length = cpu_to_le32(attr_size);
 	a->data.resident.value_offset = cpu_to_le16(mp_ofs);
 	a->data.resident.flags = old_res_attr_flags;
 	memset(&a->data.resident.reserved, 0,
 			sizeof(a->data.resident.reserved));
 	/* Copy the data from the page back to the attribute value. */
 	if (page) {
 		kaddr = kmap_atomic(page);
 		memcpy((u8*)a + mp_ofs, kaddr, attr_size);
 		kunmap_atomic(kaddr);
 	}
 	/* Setup the allocated size in the ntfs inode in case it changed. */
 	write_lock_irqsave(&ni->size_lock, flags);
 	ni->allocated_size = arec_size - mp_ofs;
 	write_unlock_irqrestore(&ni->size_lock, flags);
 	/* Mark the mft record dirty, so it gets written back. */
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 err_out:
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	ni->runlist.rl = NULL;
 	up_write(&ni->runlist.lock);
 rl_err_out:
 	if (rl) {
 		if (ntfs_cluster_free_from_rl(vol, rl) < 0) {
 			ntfs_error(vol->sb, "Failed to release allocated "
 					"cluster(s) in error code path.  Run "
 					"chkdsk to recover the lost "
 					"cluster(s).");
 			NVolSetErrors(vol);
 		}
 		ntfs_free(rl);
 page_err_out:
 		unlock_page(page);
 		page_cache_release(page);
 	}
 	if (err == -EINVAL)
 		err = -EIO;
 	return err;
 }
 /**
  * ntfs_attr_extend_allocation - extend the allocated space of an attribute
  * @ni:			ntfs inode of the attribute whose allocation to extend
  * @new_alloc_size:	new size in bytes to which to extend the allocation to
  * @new_data_size:	new size in bytes to which to extend the data to
  * @data_start:		beginning of region which is required to be non-sparse
  *
  * Extend the allocated space of an attribute described by the ntfs inode @ni
  * to @new_alloc_size bytes.  If @data_start is -1, the whole extension may be
  * implemented as a hole in the file (as long as both the volume and the ntfs
  * inode @ni have sparse support enabled).  If @data_start is >= 0, then the
  * region between the old allocated size and @data_start - 1 may be made sparse
  * but the regions between @data_start and @new_alloc_size must be backed by
  * actual clusters.
  *
  * If @new_data_size is -1, it is ignored.  If it is >= 0, then the data size
  * of the attribute is extended to @new_data_size.  Note that the i_size of the
  * vfs inode is not updated.  Only the data size in the base attribute record
  * is updated.  The caller has to update i_size separately if this is required.
  * WARNING: It is a BUG() for @new_data_size to be smaller than the old data
  * size as well as for @new_data_size to be greater than @new_alloc_size.
  *
  * For resident attributes this involves resizing the attribute record and if
  * necessary moving it and/or other attributes into extent mft records and/or
  * converting the attribute to a non-resident attribute which in turn involves
  * extending the allocation of a non-resident attribute as described below.
  *
  * For non-resident attributes this involves allocating clusters in the data
  * zone on the volume (except for regions that are being made sparse) and
  * extending the run list to describe the allocated clusters as well as
  * updating the mapping pairs array of the attribute.  This in turn involves
  * resizing the attribute record and if necessary moving it and/or other
  * attributes into extent mft records and/or splitting the attribute record
  * into multiple extent attribute records.
  *
  * Also, the attribute list attribute is updated if present and in some of the
  * above cases (the ones where extent mft records/attributes come into play),
  * an attribute list attribute is created if not already present.
  *
  * Return the new allocated size on success and -errno on error.  In the case
  * that an error is encountered but a partial extension at least up to
  * @data_start (if present) is possible, the allocation is partially extended
  * and this is returned.  This means the caller must check the returned size to
  * determine if the extension was partial.  If @data_start is -1 then partial
  * allocations are not performed.
  *
  * WARNING: Do not call ntfs_attr_extend_allocation() for $MFT/$DATA.
  *
  * Locking: This function takes the runlist lock of @ni for writing as well as
  * locking the mft record of the base ntfs inode.  These locks are maintained
  * throughout execution of the function.  These locks are required so that the
  * attribute can be resized safely and so that it can for example be converted
  * from resident to non-resident safely.
  *
  * TODO: At present attribute list attribute handling is not implemented.
  *
  * TODO: At present it is not safe to call this function for anything other
  * than the $DATA attribute(s) of an uncompressed and unencrypted file.
  */
 s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size,
 		const s64 new_data_size, const s64 data_start)
 {
 	VCN vcn;
 	s64 ll, allocated_size, start = data_start;
 	struct inode *vi = VFS_I(ni);
 	ntfs_volume *vol = ni->vol;
 	ntfs_inode *base_ni;
 	MFT_RECORD *m;
 	ATTR_RECORD *a;
 	ntfs_attr_search_ctx *ctx;
 	runlist_element *rl, *rl2;
 	unsigned long flags;
 	int err, mp_size;
 	u32 attr_len = 0; /* Silence stupid gcc warning. */
 	bool mp_rebuilt;
 #ifdef DEBUG
 	read_lock_irqsave(&ni->size_lock, flags);
 	allocated_size = ni->allocated_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
 			"old_allocated_size 0x%llx, "
 			"new_allocated_size 0x%llx, new_data_size 0x%llx, "
 			"data_start 0x%llx.", vi->i_ino,
 			(unsigned)le32_to_cpu(ni->type),
 			(unsigned long long)allocated_size,
 			(unsigned long long)new_alloc_size,
 			(unsigned long long)new_data_size,
 			(unsigned long long)start);
 #endif
 retry_extend:
 	/*
 	 * For non-resident attributes, @start and @new_size need to be aligned
 	 * to cluster boundaries for allocation purposes.
 	 */
 	if (NInoNonResident(ni)) {
 		if (start > 0)
 			start &= ~(s64)vol->cluster_size_mask;
 		new_alloc_size = (new_alloc_size + vol->cluster_size - 1) &
 				~(s64)vol->cluster_size_mask;
 	}
 	BUG_ON(new_data_size >= 0 && new_data_size > new_alloc_size);
 	/* Check if new size is allowed in $AttrDef. */
 	err = ntfs_attr_size_bounds_check(vol, ni->type, new_alloc_size);
 	if (unlikely(err)) {
 		/* Only emit errors when the write will fail completely. */
 		read_lock_irqsave(&ni->size_lock, flags);
 		allocated_size = ni->allocated_size;
 		read_unlock_irqrestore(&ni->size_lock, flags);
 		if (start < 0 || start >= allocated_size) {
 			if (err == -ERANGE) {
 				ntfs_error(vol->sb, "Cannot extend allocation "
 						"of inode 0x%lx, attribute "
 						"type 0x%x, because the new "
 						"allocation would exceed the "
 						"maximum allowed size for "
 						"this attribute type.",
 						vi->i_ino, (unsigned)
 						le32_to_cpu(ni->type));
 			} else {
 				ntfs_error(vol->sb, "Cannot extend allocation "
 						"of inode 0x%lx, attribute "
 						"type 0x%x, because this "
 						"attribute type is not "
 						"defined on the NTFS volume.  "
 						"Possible corruption!  You "
 						"should run chkdsk!",
 						vi->i_ino, (unsigned)
 						le32_to_cpu(ni->type));
 			}
 		}
 		/* Translate error code to be POSIX conformant for write(2). */
 		if (err == -ERANGE)
 			err = -EFBIG;
 		else
 			err = -EIO;
 		return err;
 	}
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	/*
 	 * We will be modifying both the runlist (if non-resident) and the mft
 	 * record so lock them both down.
 	 */
 	down_write(&ni->runlist.lock);
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		ctx = NULL;
 		goto err_out;
 	}
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto err_out;
 	}
 	read_lock_irqsave(&ni->size_lock, flags);
 	allocated_size = ni->allocated_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	/*
 	 * If non-resident, seek to the last extent.  If resident, there is
 	 * only one extent, so seek to that.
 	 */
 	vcn = NInoNonResident(ni) ? allocated_size >> vol->cluster_size_bits :
 			0;
 	/*
 	 * Abort if someone did the work whilst we waited for the locks.  If we
 	 * just converted the attribute from resident to non-resident it is
 	 * likely that exactly this has happened already.  We cannot quite
 	 * abort if we need to update the data size.
 	 */
 	if (unlikely(new_alloc_size <= allocated_size)) {
 		ntfs_debug("Allocated size already exceeds requested size.");
 		new_alloc_size = allocated_size;
 		if (new_data_size < 0)
 			goto done;
 		/*
 		 * We want the first attribute extent so that we can update the
 		 * data size.
 		 */
 		vcn = 0;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, vcn, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto err_out;
 	}
 	m = ctx->mrec;
 	a = ctx->attr;
 	/* Use goto to reduce indentation. */
 	if (a->non_resident)
 		goto do_non_resident_extend;
 	BUG_ON(NInoNonResident(ni));
 	/* The total length of the attribute value. */
 	attr_len = le32_to_cpu(a->data.resident.value_length);
 	/*
 	 * Extend the attribute record to be able to store the new attribute
 	 * size.  ntfs_attr_record_resize() will not do anything if the size is
 	 * not changing.
 	 */
 	if (new_alloc_size < vol->mft_record_size &&
 			!ntfs_attr_record_resize(m, a,
 			le16_to_cpu(a->data.resident.value_offset) +
 			new_alloc_size)) {
 		/* The resize succeeded! */
 		write_lock_irqsave(&ni->size_lock, flags);
 		ni->allocated_size = le32_to_cpu(a->length) -
 				le16_to_cpu(a->data.resident.value_offset);
 		write_unlock_irqrestore(&ni->size_lock, flags);
 		if (new_data_size >= 0) {
 			BUG_ON(new_data_size < attr_len);
 			a->data.resident.value_length =
 					cpu_to_le32((u32)new_data_size);
 		}
 		goto flush_done;
 	}
 	/*
 	 * We have to drop all the locks so we can call
 	 * ntfs_attr_make_non_resident().  This could be optimised by try-
 	 * locking the first page cache page and only if that fails dropping
 	 * the locks, locking the page, and redoing all the locking and
 	 * lookups.  While this would be a huge optimisation, it is not worth
 	 * it as this is definitely a slow code path.
 	 */
 	ntfs_attr_put_search_ctx(ctx);
 	unmap_mft_record(base_ni);
 	up_write(&ni->runlist.lock);
 	/*
 	 * Not enough space in the mft record, try to make the attribute
 	 * non-resident and if successful restart the extension process.
 	 */
 	err = ntfs_attr_make_non_resident(ni, attr_len);
 	if (likely(!err))
 		goto retry_extend;
 	/*
 	 * Could not make non-resident.  If this is due to this not being
 	 * permitted for this attribute type or there not being enough space,
 	 * try to make other attributes non-resident.  Otherwise fail.
 	 */
 	if (unlikely(err != -EPERM && err != -ENOSPC)) {
 		/* Only emit errors when the write will fail completely. */
 		read_lock_irqsave(&ni->size_lock, flags);
 		allocated_size = ni->allocated_size;
 		read_unlock_irqrestore(&ni->size_lock, flags);
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Cannot extend allocation of "
 					"inode 0x%lx, attribute type 0x%x, "
 					"because the conversion from resident "
 					"to non-resident attribute failed "
 					"with error code %i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 		if (err != -ENOMEM)
 			err = -EIO;
 		goto conv_err_out;
 	}
 	/* TODO: Not implemented from here, abort. */
 	read_lock_irqsave(&ni->size_lock, flags);
 	allocated_size = ni->allocated_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	if (start < 0 || start >= allocated_size) {
 		if (err == -ENOSPC)
 			ntfs_error(vol->sb, "Not enough space in the mft "
 					"record/on disk for the non-resident "
 					"attribute value.  This case is not "
 					"implemented yet.");
 		else /* if (err == -EPERM) */
 			ntfs_error(vol->sb, "This attribute type may not be "
 					"non-resident.  This case is not "
 					"implemented yet.");
 	}
 	err = -EOPNOTSUPP;
 	goto conv_err_out;
 #if 0
 	// TODO: Attempt to make other attributes non-resident.
 	if (!err)
 		goto do_resident_extend;
 	/*
 	 * Both the attribute list attribute and the standard information
 	 * attribute must remain in the base inode.  Thus, if this is one of
 	 * these attributes, we have to try to move other attributes out into
 	 * extent mft records instead.
 	 */
 	if (ni->type == AT_ATTRIBUTE_LIST ||
 			ni->type == AT_STANDARD_INFORMATION) {
 		// TODO: Attempt to move other attributes into extent mft
 		// records.
 		err = -EOPNOTSUPP;
 		if (!err)
 			goto do_resident_extend;
 		goto err_out;
 	}
 	// TODO: Attempt to move this attribute to an extent mft record, but
 	// only if it is not already the only attribute in an mft record in
 	// which case there would be nothing to gain.
 	err = -EOPNOTSUPP;
 	if (!err)
 		goto do_resident_extend;
 	/* There is nothing we can do to make enough space. )-: */
 	goto err_out;
 #endif
 do_non_resident_extend:
 	BUG_ON(!NInoNonResident(ni));
 	if (new_alloc_size == allocated_size) {
 		BUG_ON(vcn);
 		goto alloc_done;
 	}
 	/*
 	 * If the data starts after the end of the old allocation, this is a
 	 * $DATA attribute and sparse attributes are enabled on the volume and
 	 * for this inode, then create a sparse region between the old
 	 * allocated size and the start of the data.  Otherwise simply proceed
 	 * with filling the whole space between the old allocated size and the
 	 * new allocated size with clusters.
 	 */
 	if ((start >= 0 && start <= allocated_size) || ni->type != AT_DATA ||
 			!NVolSparseEnabled(vol) || NInoSparseDisabled(ni))
 		goto skip_sparse;
 	// TODO: This is not implemented yet.  We just fill in with real
 	// clusters for now...
 	ntfs_debug("Inserting holes is not-implemented yet.  Falling back to "
 			"allocating real clusters instead.");
 skip_sparse:
 	rl = ni->runlist.rl;
 	if (likely(rl)) {
 		/* Seek to the end of the runlist. */
 		while (rl->length)
 			rl++;
 	}
 	/* If this attribute extent is not mapped, map it now. */
 	if (unlikely(!rl || rl->lcn == LCN_RL_NOT_MAPPED ||
 			(rl->lcn == LCN_ENOENT && rl > ni->runlist.rl &&
 			(rl-1)->lcn == LCN_RL_NOT_MAPPED))) {
 		if (!rl && !allocated_size)
 			goto first_alloc;
 		rl = ntfs_mapping_pairs_decompress(vol, a, ni->runlist.rl);
 		if (IS_ERR(rl)) {
 			err = PTR_ERR(rl);
 			if (start < 0 || start >= allocated_size)
 				ntfs_error(vol->sb, "Cannot extend allocation "
 						"of inode 0x%lx, attribute "
 						"type 0x%x, because the "
 						"mapping of a runlist "
 						"fragment failed with error "
 						"code %i.", vi->i_ino,
 						(unsigned)le32_to_cpu(ni->type),
 						err);
 			if (err != -ENOMEM)
 				err = -EIO;
 			goto err_out;
 		}
 		ni->runlist.rl = rl;
 		/* Seek to the end of the runlist. */
 		while (rl->length)
 			rl++;
 	}
 	/*
 	 * We now know the runlist of the last extent is mapped and @rl is at
 	 * the end of the runlist.  We want to begin allocating clusters
 	 * starting at the last allocated cluster to reduce fragmentation.  If
 	 * there are no valid LCNs in the attribute we let the cluster
 	 * allocator choose the starting cluster.
 	 */
 	/* If the last LCN is a hole or simillar seek back to last real LCN. */
 	while (rl->lcn < 0 && rl > ni->runlist.rl)
 		rl--;
 first_alloc:
 	// FIXME: Need to implement partial allocations so at least part of the
 	// write can be performed when start >= 0.  (Needed for POSIX write(2)
 	// conformance.)
 	rl2 = ntfs_cluster_alloc(vol, allocated_size >> vol->cluster_size_bits,
 			(new_alloc_size - allocated_size) >>
 			vol->cluster_size_bits, (rl && (rl->lcn >= 0)) ?
 			rl->lcn + rl->length : -1, DATA_ZONE, true);
 	if (IS_ERR(rl2)) {
 		err = PTR_ERR(rl2);
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Cannot extend allocation of "
 					"inode 0x%lx, attribute type 0x%x, "
 					"because the allocation of clusters "
 					"failed with error code %i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 		if (err != -ENOMEM && err != -ENOSPC)
 			err = -EIO;
 		goto err_out;
 	}
 	rl = ntfs_runlists_merge(ni->runlist.rl, rl2);
 	if (IS_ERR(rl)) {
 		err = PTR_ERR(rl);
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Cannot extend allocation of "
 					"inode 0x%lx, attribute type 0x%x, "
 					"because the runlist merge failed "
 					"with error code %i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 		if (err != -ENOMEM)
 			err = -EIO;
 		if (ntfs_cluster_free_from_rl(vol, rl2)) {
 			ntfs_error(vol->sb, "Failed to release allocated "
 					"cluster(s) in error code path.  Run "
 					"chkdsk to recover the lost "
 					"cluster(s).");
 			NVolSetErrors(vol);
 		}
 		ntfs_free(rl2);
 		goto err_out;
 	}
 	ni->runlist.rl = rl;
 	ntfs_debug("Allocated 0x%llx clusters.", (long long)(new_alloc_size -
 			allocated_size) >> vol->cluster_size_bits);
 	/* Find the runlist element with which the attribute extent starts. */
 	ll = sle64_to_cpu(a->data.non_resident.lowest_vcn);
 	rl2 = ntfs_rl_find_vcn_nolock(rl, ll);
 	BUG_ON(!rl2);
 	BUG_ON(!rl2->length);
 	BUG_ON(rl2->lcn < LCN_HOLE);
 	mp_rebuilt = false;
 	/* Get the size for the new mapping pairs array for this extent. */
 	mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, ll, -1);
 	if (unlikely(mp_size <= 0)) {
 		err = mp_size;
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Cannot extend allocation of "
 					"inode 0x%lx, attribute type 0x%x, "
 					"because determining the size for the "
 					"mapping pairs failed with error code "
 					"%i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 		err = -EIO;
 		goto undo_alloc;
 	}
 	/* Extend the attribute record to fit the bigger mapping pairs array. */
 	attr_len = le32_to_cpu(a->length);
 	err = ntfs_attr_record_resize(m, a, mp_size +
 			le16_to_cpu(a->data.non_resident.mapping_pairs_offset));
 	if (unlikely(err)) {
 		BUG_ON(err != -ENOSPC);
 		// TODO: Deal with this by moving this extent to a new mft
 		// record or by starting a new extent in a new mft record,
 		// possibly by extending this extent partially and filling it
 		// and creating a new extent for the remainder, or by making
 		// other attributes non-resident and/or by moving other
 		// attributes out of this mft record.
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Not enough space in the mft "
 					"record for the extended attribute "
 					"record.  This case is not "
 					"implemented yet.");
 		err = -EOPNOTSUPP;
 		goto undo_alloc;
 	}
 	mp_rebuilt = true;
 	/* Generate the mapping pairs array directly into the attr record. */
 	err = ntfs_mapping_pairs_build(vol, (u8*)a +
 			le16_to_cpu(a->data.non_resident.mapping_pairs_offset),
 			mp_size, rl2, ll, -1, NULL);
 	if (unlikely(err)) {
 		if (start < 0 || start >= allocated_size)
 			ntfs_error(vol->sb, "Cannot extend allocation of "
 					"inode 0x%lx, attribute type 0x%x, "
 					"because building the mapping pairs "
 					"failed with error code %i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 		err = -EIO;
 		goto undo_alloc;
 	}
 	/* Update the highest_vcn. */
 	a->data.non_resident.highest_vcn = cpu_to_sle64((new_alloc_size >>
 			vol->cluster_size_bits) - 1);
 	/*
 	 * We now have extended the allocated size of the attribute.  Reflect
 	 * this in the ntfs_inode structure and the attribute record.
 	 */
 	if (a->data.non_resident.lowest_vcn) {
 		/*
 		 * We are not in the first attribute extent, switch to it, but
 		 * first ensure the changes will make it to disk later.
 		 */
 		flush_dcache_mft_record_page(ctx->ntfs_ino);
 		mark_mft_record_dirty(ctx->ntfs_ino);
 		ntfs_attr_reinit_search_ctx(ctx);
 		err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 				CASE_SENSITIVE, 0, NULL, 0, ctx);
 		if (unlikely(err))
 			goto restore_undo_alloc;
 		/* @m is not used any more so no need to set it. */
 		a = ctx->attr;
 	}
 	write_lock_irqsave(&ni->size_lock, flags);
 	ni->allocated_size = new_alloc_size;
 	a->data.non_resident.allocated_size = cpu_to_sle64(new_alloc_size);
 	/*
 	 * FIXME: This would fail if @ni is a directory, $MFT, or an index,
 	 * since those can have sparse/compressed set.  For example can be
 	 * set compressed even though it is not compressed itself and in that
 	 * case the bit means that files are to be created compressed in the
 	 * directory...  At present this is ok as this code is only called for
 	 * regular files, and only for their $DATA attribute(s).
 	 * FIXME: The calculation is wrong if we created a hole above.  For now
 	 * it does not matter as we never create holes.
 	 */
 	if (NInoSparse(ni) || NInoCompressed(ni)) {
 		ni->itype.compressed.size += new_alloc_size - allocated_size;
 		a->data.non_resident.compressed_size =
 				cpu_to_sle64(ni->itype.compressed.size);
 		vi->i_blocks = ni->itype.compressed.size >> 9;
 	} else
 		vi->i_blocks = new_alloc_size >> 9;
 	write_unlock_irqrestore(&ni->size_lock, flags);
 alloc_done:
 	if (new_data_size >= 0) {
 		BUG_ON(new_data_size <
 				sle64_to_cpu(a->data.non_resident.data_size));
 		a->data.non_resident.data_size = cpu_to_sle64(new_data_size);
 	}
 flush_done:
 	/* Ensure the changes make it to disk. */
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 done:
 	ntfs_attr_put_search_ctx(ctx);
 	unmap_mft_record(base_ni);
 	up_write(&ni->runlist.lock);
 	ntfs_debug("Done, new_allocated_size 0x%llx.",
 			(unsigned long long)new_alloc_size);
 	return new_alloc_size;
 restore_undo_alloc:
 	if (start < 0 || start >= allocated_size)
 		ntfs_error(vol->sb, "Cannot complete extension of allocation "
 				"of inode 0x%lx, attribute type 0x%x, because "
 				"lookup of first attribute extent failed with "
 				"error code %i.", vi->i_ino,
 				(unsigned)le32_to_cpu(ni->type), err);
 	if (err == -ENOENT)
 		err = -EIO;
 	ntfs_attr_reinit_search_ctx(ctx);
 	if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE,
 			allocated_size >> vol->cluster_size_bits, NULL, 0,
 			ctx)) {
 		ntfs_error(vol->sb, "Failed to find last attribute extent of "
 				"attribute in error code path.  Run chkdsk to "
 				"recover.");
 		write_lock_irqsave(&ni->size_lock, flags);
 		ni->allocated_size = new_alloc_size;
 		/*
 		 * FIXME: This would fail if @ni is a directory...  See above.
 		 * FIXME: The calculation is wrong if we created a hole above.
 		 * For now it does not matter as we never create holes.
 		 */
 		if (NInoSparse(ni) || NInoCompressed(ni)) {
 			ni->itype.compressed.size += new_alloc_size -
 					allocated_size;
 			vi->i_blocks = ni->itype.compressed.size >> 9;
 		} else
 			vi->i_blocks = new_alloc_size >> 9;
 		write_unlock_irqrestore(&ni->size_lock, flags);
 		ntfs_attr_put_search_ctx(ctx);
 		unmap_mft_record(base_ni);
 		up_write(&ni->runlist.lock);
 		/*
 		 * The only thing that is now wrong is the allocated size of the
 		 * base attribute extent which chkdsk should be able to fix.
 		 */
 		NVolSetErrors(vol);
 		return err;
 	}
 	ctx->attr->data.non_resident.highest_vcn = cpu_to_sle64(
 			(allocated_size >> vol->cluster_size_bits) - 1);
 undo_alloc:
 	ll = allocated_size >> vol->cluster_size_bits;
 	if (ntfs_cluster_free(ni, ll, -1, ctx) < 0) {
 		ntfs_error(vol->sb, "Failed to release allocated cluster(s) "
 				"in error code path.  Run chkdsk to recover "
 				"the lost cluster(s).");
 		NVolSetErrors(vol);
 	}
 	m = ctx->mrec;
 	a = ctx->attr;
 	/*
 	 * If the runlist truncation fails and/or the search context is no
 	 * longer valid, we cannot resize the attribute record or build the
 	 * mapping pairs array thus we mark the inode bad so that no access to
 	 * the freed clusters can happen.
 	 */
 	if (ntfs_rl_truncate_nolock(vol, &ni->runlist, ll) || IS_ERR(m)) {
 		ntfs_error(vol->sb, "Failed to %s in error code path.  Run "
 				"chkdsk to recover.", IS_ERR(m) ?
 				"restore attribute search context" :
 				"truncate attribute runlist");
 		NVolSetErrors(vol);
 	} else if (mp_rebuilt) {
 		if (ntfs_attr_record_resize(m, a, attr_len)) {
 			ntfs_error(vol->sb, "Failed to restore attribute "
 					"record in error code path.  Run "
 					"chkdsk to recover.");
 			NVolSetErrors(vol);
 		} else /* if (success) */ {
 			if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu(
 					a->data.non_resident.
 					mapping_pairs_offset), attr_len -
 					le16_to_cpu(a->data.non_resident.
 					mapping_pairs_offset), rl2, ll, -1,
 					NULL)) {
 				ntfs_error(vol->sb, "Failed to restore "
 						"mapping pairs array in error "
 						"code path.  Run chkdsk to "
 						"recover.");
 				NVolSetErrors(vol);
 			}
 			flush_dcache_mft_record_page(ctx->ntfs_ino);
 			mark_mft_record_dirty(ctx->ntfs_ino);
 		}
 	}
 err_out:
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	up_write(&ni->runlist.lock);
 conv_err_out:
 	ntfs_debug("Failed.  Returning error code %i.", err);
 	return err;
 }
 /**
  * ntfs_attr_set - fill (a part of) an attribute with a byte
  * @ni:		ntfs inode describing the attribute to fill
  * @ofs:	offset inside the attribute at which to start to fill
  * @cnt:	number of bytes to fill
  * @val:	the unsigned 8-bit value with which to fill the attribute
  *
  * Fill @cnt bytes of the attribute described by the ntfs inode @ni starting at
  * byte offset @ofs inside the attribute with the constant byte @val.
  *
  * This function is effectively like memset() applied to an ntfs attribute.
  * Note thie function actually only operates on the page cache pages belonging
  * to the ntfs attribute and it marks them dirty after doing the memset().
  * Thus it relies on the vm dirty page write code paths to cause the modified
  * pages to be written to the mft record/disk.
  *
  * Return 0 on success and -errno on error.  An error code of -ESPIPE means
  * that @ofs + @cnt were outside the end of the attribute and no write was
  * performed.
  */
 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 {
 	ntfs_volume *vol = ni->vol;
 	struct address_space *mapping;
 	struct page *page;
 	u8 *kaddr;
 	pgoff_t idx, end;
 	unsigned start_ofs, end_ofs, size;
 	ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.",
 			(long long)ofs, (long long)cnt, val);
 	BUG_ON(ofs < 0);
 	BUG_ON(cnt < 0);
 	if (!cnt)
 		goto done;
 	/*
 	 * FIXME: Compressed and encrypted attributes are not supported when
 	 * writing and we should never have gotten here for them.
 	 */
 	BUG_ON(NInoCompressed(ni));
 	BUG_ON(NInoEncrypted(ni));
 	mapping = VFS_I(ni)->i_mapping;
 	/* Work out the starting index and page offset. */
 	idx = ofs >> PAGE_CACHE_SHIFT;
 	start_ofs = ofs & ~PAGE_CACHE_MASK;
 	/* Work out the ending index and page offset. */
 	end = ofs + cnt;
 	end_ofs = end & ~PAGE_CACHE_MASK;
 	/* If the end is outside the inode size return -ESPIPE. */
 	if (unlikely(end > i_size_read(VFS_I(ni)))) {
 		ntfs_error(vol->sb, "Request exceeds end of attribute.");
 		return -ESPIPE;
 	}
 	end >>= PAGE_CACHE_SHIFT;
 	/* If there is a first partial page, need to do it the slow way. */
 	if (start_ofs) {
 		page = read_mapping_page(mapping, idx, NULL);
 		if (IS_ERR(page)) {
 			ntfs_error(vol->sb, "Failed to read first partial "
 					"page (error, index 0x%lx).", idx);
 			return PTR_ERR(page);
 		}
 		/*
 		 * If the last page is the same as the first page, need to
 		 * limit the write to the end offset.
 		 */
 		size = PAGE_CACHE_SIZE;
 		if (idx == end)
 			size = end_ofs;
 		kaddr = kmap_atomic(page);
 		memset(kaddr + start_ofs, val, size - start_ofs);
 		flush_dcache_page(page);
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 		if (idx == end)
 			goto done;
 		idx++;
 	}
 	/* Do the whole pages the fast way. */
 	for (; idx < end; idx++) {
 		/* Find or create the current page.  (The page is locked.) */
 		page = grab_cache_page(mapping, idx);
 		if (unlikely(!page)) {
 			ntfs_error(vol->sb, "Insufficient memory to grab "
 					"page (index 0x%lx).", idx);
 			return -ENOMEM;
 		}
 		kaddr = kmap_atomic(page);
 		memset(kaddr, val, PAGE_CACHE_SIZE);
 		flush_dcache_page(page);
 		kunmap_atomic(kaddr);
 		/*
 		 * If the page has buffers, mark them uptodate since buffer
 		 * state and not page state is definitive in 2.6 kernels.
 		 */
 		if (page_has_buffers(page)) {
 			struct buffer_head *bh, *head;
 			bh = head = page_buffers(page);
 			do {
 				set_buffer_uptodate(bh);
 			} while ((bh = bh->b_this_page) != head);
 		}
 		/* Now that buffers are uptodate, set the page uptodate, too. */
 		SetPageUptodate(page);
 		/*
 		 * Set the page and all its buffers dirty and mark the inode
 		 * dirty, too.  The VM will write the page later on.
 		 */
 		set_page_dirty(page);
 		/* Finally unlock and release the page. */
 		unlock_page(page);
 		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
 	if (end_ofs) {
 		page = read_mapping_page(mapping, idx, NULL);
 		if (IS_ERR(page)) {
 			ntfs_error(vol->sb, "Failed to read last partial page "
 					"(error, index 0x%lx).", idx);
 			return PTR_ERR(page);
 		}
 		kaddr = kmap_atomic(page);
 		memset(kaddr, val, end_ofs);
 		flush_dcache_page(page);
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		page_cache_release(page);
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 	}
 done:
 	ntfs_debug("Done.");
 	return 0;
 }
 #endif /* NTFS_RW */

fs/ntfs/file.c

Diff comments View file @ d618a27

 /*
  * file.c - NTFS kernel file operations.  Part of the Linux-NTFS project.
  *
  * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc.
  *
  * This program/include file is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License as published
  * by the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
  *
  * This program/include file is distributed in the hope that it will be
  * useful, but WITHOUT ANY WARRANTY; without even the implied warranty
  * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public License
  * along with this program (in the main directory of the Linux-NTFS
  * distribution in the file COPYING); if not, write to the Free Software
  * Foundation,Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 #include <linux/buffer_head.h>
 #include <linux/gfp.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
 #include <linux/sched.h>
 #include <linux/swap.h>
 #include <linux/uio.h>
 #include <linux/writeback.h>
 #include <linux/aio.h>
 #include <asm/page.h>
 #include <asm/uaccess.h>
 #include "attrib.h"
 #include "bitmap.h"
 #include "inode.h"
 #include "debug.h"
 #include "lcnalloc.h"
 #include "malloc.h"
 #include "mft.h"
 #include "ntfs.h"
 /**
  * ntfs_file_open - called when an inode is about to be opened
  * @vi:		inode to be opened
  * @filp:	file structure describing the inode
  *
  * Limit file size to the page cache limit on architectures where unsigned long
  * is 32-bits. This is the most we can do for now without overflowing the page
  * cache page index. Doing it this way means we don't run into problems because
  * of existing too large files. It would be better to allow the user to read
  * the beginning of the file but I doubt very much anyone is going to hit this
  * check on a 32-bit architecture, so there is no point in adding the extra
  * complexity required to support this.
  *
  * On 64-bit architectures, the check is hopefully optimized away by the
  * compiler.
  *
  * After the check passes, just call generic_file_open() to do its work.
  */
 static int ntfs_file_open(struct inode *vi, struct file *filp)
 {
 	if (sizeof(unsigned long) < 8) {
 		if (i_size_read(vi) > MAX_LFS_FILESIZE)
 			return -EOVERFLOW;
 	}
 	return generic_file_open(vi, filp);
 }
 #ifdef NTFS_RW
 /**
  * ntfs_attr_extend_initialized - extend the initialized size of an attribute
  * @ni:			ntfs inode of the attribute to extend
  * @new_init_size:	requested new initialized size in bytes
  * @cached_page:	store any allocated but unused page here
  * @lru_pvec:		lru-buffering pagevec of the caller
  *
  * Extend the initialized size of an attribute described by the ntfs inode @ni
  * to @new_init_size bytes.  This involves zeroing any non-sparse space between
  * the old initialized size and @new_init_size both in the page cache and on
  * disk (if relevant complete pages are already uptodate in the page cache then
  * these are simply marked dirty).
  *
  * As a side-effect, the file size (vfs inode->i_size) may be incremented as,
  * in the resident attribute case, it is tied to the initialized size and, in
  * the non-resident attribute case, it may not fall below the initialized size.
  *
  * Note that if the attribute is resident, we do not need to touch the page
  * cache at all.  This is because if the page cache page is not uptodate we
  * bring it uptodate later, when doing the write to the mft record since we
  * then already have the page mapped.  And if the page is uptodate, the
  * non-initialized region will already have been zeroed when the page was
  * brought uptodate and the region may in fact already have been overwritten
  * with new data via mmap() based writes, so we cannot just zero it.  And since
  * POSIX specifies that the behaviour of resizing a file whilst it is mmap()ped
  * is unspecified, we choose not to do zeroing and thus we do not need to touch
  * the page at all.  For a more detailed explanation see ntfs_truncate() in
  * fs/ntfs/inode.c.
  *
  * Return 0 on success and -errno on error.  In the case that an error is
  * encountered it is possible that the initialized size will already have been
  * incremented some way towards @new_init_size but it is guaranteed that if
  * this is the case, the necessary zeroing will also have happened and that all
  * metadata is self-consistent.
  *
  * Locking: i_mutex on the vfs inode corrseponsind to the ntfs inode @ni must be
  *	    held by the caller.
  */
 static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size)
 {
 	s64 old_init_size;
 	loff_t old_i_size;
 	pgoff_t index, end_index;
 	unsigned long flags;
 	struct inode *vi = VFS_I(ni);
 	ntfs_inode *base_ni;
 	MFT_RECORD *m = NULL;
 	ATTR_RECORD *a;
 	ntfs_attr_search_ctx *ctx = NULL;
 	struct address_space *mapping;
 	struct page *page = NULL;
 	u8 *kattr;
 	int err;
 	u32 attr_len;
 	read_lock_irqsave(&ni->size_lock, flags);
 	old_init_size = ni->initialized_size;
 	old_i_size = i_size_read(vi);
 	BUG_ON(new_init_size > ni->allocated_size);
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
 			"old_initialized_size 0x%llx, "
 			"new_initialized_size 0x%llx, i_size 0x%llx.",
 			vi->i_ino, (unsigned)le32_to_cpu(ni->type),
 			(unsigned long long)old_init_size,
 			(unsigned long long)new_init_size, old_i_size);
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	/* Use goto to reduce indentation and we need the label below anyway. */
 	if (NInoNonResident(ni))
 		goto do_non_resident_extend;
 	BUG_ON(old_init_size != old_i_size);
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		goto err_out;
 	}
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto err_out;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, 0, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto err_out;
 	}
 	m = ctx->mrec;
 	a = ctx->attr;
 	BUG_ON(a->non_resident);
 	/* The total length of the attribute value. */
 	attr_len = le32_to_cpu(a->data.resident.value_length);
 	BUG_ON(old_i_size != (loff_t)attr_len);
 	/*
 	 * Do the zeroing in the mft record and update the attribute size in
 	 * the mft record.
 	 */
 	kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset);
 	memset(kattr + attr_len, 0, new_init_size - attr_len);
 	a->data.resident.value_length = cpu_to_le32((u32)new_init_size);
 	/* Finally, update the sizes in the vfs and ntfs inodes. */
 	write_lock_irqsave(&ni->size_lock, flags);
 	i_size_write(vi, new_init_size);
 	ni->initialized_size = new_init_size;
 	write_unlock_irqrestore(&ni->size_lock, flags);
 	goto done;
 do_non_resident_extend:
 	/*
 	 * If the new initialized size @new_init_size exceeds the current file
 	 * size (vfs inode->i_size), we need to extend the file size to the
 	 * new initialized size.
 	 */
 	if (new_init_size > old_i_size) {
 		m = map_mft_record(base_ni);
 		if (IS_ERR(m)) {
 			err = PTR_ERR(m);
 			m = NULL;
 			goto err_out;
 		}
 		ctx = ntfs_attr_get_search_ctx(base_ni, m);
 		if (unlikely(!ctx)) {
 			err = -ENOMEM;
 			goto err_out;
 		}
 		err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 				CASE_SENSITIVE, 0, NULL, 0, ctx);
 		if (unlikely(err)) {
 			if (err == -ENOENT)
 				err = -EIO;
 			goto err_out;
 		}
 		m = ctx->mrec;
 		a = ctx->attr;
 		BUG_ON(!a->non_resident);
 		BUG_ON(old_i_size != (loff_t)
 				sle64_to_cpu(a->data.non_resident.data_size));
 		a->data.non_resident.data_size = cpu_to_sle64(new_init_size);
 		flush_dcache_mft_record_page(ctx->ntfs_ino);
 		mark_mft_record_dirty(ctx->ntfs_ino);
 		/* Update the file size in the vfs inode. */
 		i_size_write(vi, new_init_size);
 		ntfs_attr_put_search_ctx(ctx);
 		ctx = NULL;
 		unmap_mft_record(base_ni);
 		m = NULL;
 	}
 	mapping = vi->i_mapping;
 	index = old_init_size >> PAGE_CACHE_SHIFT;
 	end_index = (new_init_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	do {
 		/*
 		 * Read the page.  If the page is not present, this will zero
 		 * the uninitialized regions for us.
 		 */
 		page = read_mapping_page(mapping, index, NULL);
 		if (IS_ERR(page)) {
 			err = PTR_ERR(page);
 			goto init_err_out;
 		}
 		if (unlikely(PageError(page))) {
 			page_cache_release(page);
 			err = -EIO;
 			goto init_err_out;
 		}
 		/*
 		 * Update the initialized size in the ntfs inode.  This is
 		 * enough to make ntfs_writepage() work.
 		 */
 		write_lock_irqsave(&ni->size_lock, flags);
 		ni->initialized_size = (s64)(index + 1) << PAGE_CACHE_SHIFT;
 		if (ni->initialized_size > new_init_size)
 			ni->initialized_size = new_init_size;
 		write_unlock_irqrestore(&ni->size_lock, flags);
 		/* Set the page dirty so it gets written out. */
 		set_page_dirty(page);
 		page_cache_release(page);
 		/*
 		 * Play nice with the vm and the rest of the system.  This is
 		 * very much needed as we can potentially be modifying the
 		 * initialised size from a very small value to a really huge
 		 * value, e.g.
 		 *	f = open(somefile, O_TRUNC);
 		 *	truncate(f, 10GiB);
 		 *	seek(f, 10GiB);
 		 *	write(f, 1);
 		 * And this would mean we would be marking dirty hundreds of
 		 * thousands of pages or as in the above example more than
 		 * two and a half million pages!
 		 *
 		 * TODO: For sparse pages could optimize this workload by using
 		 * the FsMisc / MiscFs page bit as a "PageIsSparse" bit.  This
 		 * would be set in readpage for sparse pages and here we would
 		 * not need to mark dirty any pages which have this bit set.
 		 * The only caveat is that we have to clear the bit everywhere
 		 * where we allocate any clusters that lie in the page or that
 		 * contain the page.
 		 *
 		 * TODO: An even greater optimization would be for us to only
 		 * call readpage() on pages which are not in sparse regions as
 		 * determined from the runlist.  This would greatly reduce the
 		 * number of pages we read and make dirty in the case of sparse
 		 * files.
 		 */
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
 	BUG_ON(ni->initialized_size != new_init_size);
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	/* Now bring in sync the initialized_size in the mft record. */
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		goto init_err_out;
 	}
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto init_err_out;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, 0, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto init_err_out;
 	}
 	m = ctx->mrec;
 	a = ctx->attr;
 	BUG_ON(!a->non_resident);
 	a->data.non_resident.initialized_size = cpu_to_sle64(new_init_size);
 done:
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	ntfs_debug("Done, initialized_size 0x%llx, i_size 0x%llx.",
 			(unsigned long long)new_init_size, i_size_read(vi));
 	return 0;
 init_err_out:
 	write_lock_irqsave(&ni->size_lock, flags);
 	ni->initialized_size = old_init_size;
 	write_unlock_irqrestore(&ni->size_lock, flags);
 err_out:
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	ntfs_debug("Failed.  Returning error code %i.", err);
 	return err;
 }
 /**
  * ntfs_fault_in_pages_readable -
  *
  * Fault a number of userspace pages into pagetables.
  *
  * Unlike include/linux/pagemap.h::fault_in_pages_readable(), this one copes
  * with more than two userspace pages as well as handling the single page case
  * elegantly.
  *
  * If you find this difficult to understand, then think of the while loop being
  * the following code, except that we do without the integer variable ret:
  *
  *	do {
  *		ret = __get_user(c, uaddr);
  *		uaddr += PAGE_SIZE;
  *	} while (!ret && uaddr < end);
  *
  * Note, the final __get_user() may well run out-of-bounds of the user buffer,
  * but _not_ out-of-bounds of the page the user buffer belongs to, and since
  * this is only a read and not a write, and since it is still in the same page,
  * it should not matter and this makes the code much simpler.
  */
 static inline void ntfs_fault_in_pages_readable(const char __user *uaddr,
 		int bytes)
 {
 	const char __user *end;
 	volatile char c;
 	/* Set @end to the first byte outside the last page we care about. */
 	end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes);
 	while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end))
 		;
 }
 /**
  * ntfs_fault_in_pages_readable_iovec -
  *
  * Same as ntfs_fault_in_pages_readable() but operates on an array of iovecs.
  */
 static inline void ntfs_fault_in_pages_readable_iovec(const struct iovec *iov,
 		size_t iov_ofs, int bytes)
 {
 	do {
 		const char __user *buf;
 		unsigned len;
 		buf = iov->iov_base + iov_ofs;
 		len = iov->iov_len - iov_ofs;
 		if (len > bytes)
 			len = bytes;
 		ntfs_fault_in_pages_readable(buf, len);
 		bytes -= len;
 		iov++;
 		iov_ofs = 0;
 	} while (bytes);
 }
 /**
  * __ntfs_grab_cache_pages - obtain a number of locked pages
  * @mapping:	address space mapping from which to obtain page cache pages
  * @index:	starting index in @mapping at which to begin obtaining pages
  * @nr_pages:	number of page cache pages to obtain
  * @pages:	array of pages in which to return the obtained page cache pages
  * @cached_page: allocated but as yet unused page
  * @lru_pvec:	lru-buffering pagevec of caller
  *
  * Obtain @nr_pages locked page cache pages from the mapping @mapping and
  * starting at index @index.
  *
  * If a page is newly created, add it to lru list
  *
  * Note, the page locks are obtained in ascending page index order.
  */
 static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 		pgoff_t index, const unsigned nr_pages, struct page **pages,
 		struct page **cached_page)
 {
 	int err, nr;
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
 		pages[nr] = find_lock_page(mapping, index);
 		if (!pages[nr]) {
 			if (!*cached_page) {
 				*cached_page = page_cache_alloc(mapping);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
 				}
 			}
 			err = add_to_page_cache_lru(*cached_page, mapping, index,
 					GFP_KERNEL);
 			if (unlikely(err)) {
 				if (err == -EEXIST)
 					continue;
 				goto err_out;
 			}
 			pages[nr] = *cached_page;
 			*cached_page = NULL;
 		}
 		index++;
 		nr++;
 	} while (nr < nr_pages);
 out:
 	return err;
 err_out:
 	while (nr > 0) {
 		unlock_page(pages[--nr]);
 		page_cache_release(pages[nr]);
 	}
 	goto out;
 }
 static inline int ntfs_submit_bh_for_read(struct buffer_head *bh)
 {
 	lock_buffer(bh);
 	get_bh(bh);
 	bh->b_end_io = end_buffer_read_sync;
 	return submit_bh(READ, bh);
 }
 /**
  * ntfs_prepare_pages_for_non_resident_write - prepare pages for receiving data
  * @pages:	array of destination pages
  * @nr_pages:	number of pages in @pages
  * @pos:	byte position in file at which the write begins
  * @bytes:	number of bytes to be written
  *
  * This is called for non-resident attributes from ntfs_file_buffered_write()
  * with i_mutex held on the inode (@pages[0]->mapping->host).  There are
  * @nr_pages pages in @pages which are locked but not kmap()ped.  The source
  * data has not yet been copied into the @pages.
  *
  * Need to fill any holes with actual clusters, allocate buffers if necessary,
  * ensure all the buffers are mapped, and bring uptodate any buffers that are
  * only partially being written to.
  *
  * If @nr_pages is greater than one, we are guaranteed that the cluster size is
  * greater than PAGE_CACHE_SIZE, that all pages in @pages are entirely inside
  * the same cluster and that they are the entirety of that cluster, and that
  * the cluster is sparse, i.e. we need to allocate a cluster to fill the hole.
  *
  * i_size is not to be modified yet.
  *
  * Return 0 on success or -errno on error.
  */
 static int ntfs_prepare_pages_for_non_resident_write(struct page **pages,
 		unsigned nr_pages, s64 pos, size_t bytes)
 {
 	VCN vcn, highest_vcn = 0, cpos, cend, bh_cpos, bh_cend;
 	LCN lcn;
 	s64 bh_pos, vcn_len, end, initialized_size;
 	sector_t lcn_block;
 	struct page *page;
 	struct inode *vi;
 	ntfs_inode *ni, *base_ni = NULL;
 	ntfs_volume *vol;
 	runlist_element *rl, *rl2;
 	struct buffer_head *bh, *head, *wait[2], **wait_bh = wait;
 	ntfs_attr_search_ctx *ctx = NULL;
 	MFT_RECORD *m = NULL;
 	ATTR_RECORD *a = NULL;
 	unsigned long flags;
 	u32 attr_rec_len = 0;
 	unsigned blocksize, u;
 	int err, mp_size;
 	bool rl_write_locked, was_hole, is_retry;
 	unsigned char blocksize_bits;
 	struct {
 		u8 runlist_merged:1;
 		u8 mft_attr_mapped:1;
 		u8 mp_rebuilt:1;
 		u8 attr_switched:1;
 	} status = { 0, 0, 0, 0 };
 	BUG_ON(!nr_pages);
 	BUG_ON(!pages);
 	BUG_ON(!*pages);
 	vi = pages[0]->mapping->host;
 	ni = NTFS_I(vi);
 	vol = ni->vol;
 	ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page "
 			"index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.",
 			vi->i_ino, ni->type, pages[0]->index, nr_pages,
 			(long long)pos, bytes);
 	blocksize = vol->sb->s_blocksize;
 	blocksize_bits = vol->sb->s_blocksize_bits;
 	u = 0;
 	do {
 		page = pages[u];
 		BUG_ON(!page);
 		/*
 		 * create_empty_buffers() will create uptodate/dirty buffers if
 		 * the page is uptodate/dirty.
 		 */
 		if (!page_has_buffers(page)) {
 			create_empty_buffers(page, blocksize, 0);
 			if (unlikely(!page_has_buffers(page)))
 				return -ENOMEM;
 		}
 	} while (++u < nr_pages);
 	rl_write_locked = false;
 	rl = NULL;
 	err = 0;
 	vcn = lcn = -1;
 	vcn_len = 0;
 	lcn_block = -1;
 	was_hole = false;
 	cpos = pos >> vol->cluster_size_bits;
 	end = pos + bytes;
 	cend = (end + vol->cluster_size - 1) >> vol->cluster_size_bits;
 	/*
 	 * Loop over each page and for each page over each buffer.  Use goto to
 	 * reduce indentation.
 	 */
 	u = 0;
 do_next_page:
 	page = pages[u];
 	bh_pos = (s64)page->index << PAGE_CACHE_SHIFT;
 	bh = head = page_buffers(page);
 	do {
 		VCN cdelta;
 		s64 bh_end;
 		unsigned bh_cofs;
 		/* Clear buffer_new on all buffers to reinitialise state. */
 		if (buffer_new(bh))
 			clear_buffer_new(bh);
 		bh_end = bh_pos + blocksize;
 		bh_cpos = bh_pos >> vol->cluster_size_bits;
 		bh_cofs = bh_pos & vol->cluster_size_mask;
 		if (buffer_mapped(bh)) {
 			/*
 			 * The buffer is already mapped.  If it is uptodate,
 			 * ignore it.
 			 */
 			if (buffer_uptodate(bh))
 				continue;
 			/*
 			 * The buffer is not uptodate.  If the page is uptodate
 			 * set the buffer uptodate and otherwise ignore it.
 			 */
 			if (PageUptodate(page)) {
 				set_buffer_uptodate(bh);
 				continue;
 			}
 			/*
 			 * Neither the page nor the buffer are uptodate.  If
 			 * the buffer is only partially being written to, we
 			 * need to read it in before the write, i.e. now.
 			 */
 			if ((bh_pos < pos && bh_end > pos) ||
 					(bh_pos < end && bh_end > end)) {
 				/*
 				 * If the buffer is fully or partially within
 				 * the initialized size, do an actual read.
 				 * Otherwise, simply zero the buffer.
 				 */
 				read_lock_irqsave(&ni->size_lock, flags);
 				initialized_size = ni->initialized_size;
 				read_unlock_irqrestore(&ni->size_lock, flags);
 				if (bh_pos < initialized_size) {
 					ntfs_submit_bh_for_read(bh);
 					*wait_bh++ = bh;
 				} else {
 					zero_user(page, bh_offset(bh),
 							blocksize);
 					set_buffer_uptodate(bh);
 				}
 			}
 			continue;
 		}
 		/* Unmapped buffer.  Need to map it. */
 		bh->b_bdev = vol->sb->s_bdev;
 		/*
 		 * If the current buffer is in the same clusters as the map
 		 * cache, there is no need to check the runlist again.  The
 		 * map cache is made up of @vcn, which is the first cached file
 		 * cluster, @vcn_len which is the number of cached file
 		 * clusters, @lcn is the device cluster corresponding to @vcn,
 		 * and @lcn_block is the block number corresponding to @lcn.
 		 */
 		cdelta = bh_cpos - vcn;
 		if (likely(!cdelta || (cdelta > 0 && cdelta < vcn_len))) {
 map_buffer_cached:
 			BUG_ON(lcn < 0);
 			bh->b_blocknr = lcn_block +
 					(cdelta << (vol->cluster_size_bits -
 					blocksize_bits)) +
 					(bh_cofs >> blocksize_bits);
 			set_buffer_mapped(bh);
 			/*
 			 * If the page is uptodate so is the buffer.  If the
 			 * buffer is fully outside the write, we ignore it if
 			 * it was already allocated and we mark it dirty so it
 			 * gets written out if we allocated it.  On the other
 			 * hand, if we allocated the buffer but we are not
 			 * marking it dirty we set buffer_new so we can do
 			 * error recovery.
 			 */
 			if (PageUptodate(page)) {
 				if (!buffer_uptodate(bh))
 					set_buffer_uptodate(bh);
 				if (unlikely(was_hole)) {
 					/* We allocated the buffer. */
 					unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 					if (bh_end <= pos || bh_pos >= end)
 						mark_buffer_dirty(bh);
 					else
 						set_buffer_new(bh);
 				}
 				continue;
 			}
 			/* Page is _not_ uptodate. */
 			if (likely(!was_hole)) {
 				/*
 				 * Buffer was already allocated.  If it is not
 				 * uptodate and is only partially being written
 				 * to, we need to read it in before the write,
 				 * i.e. now.
 				 */
 				if (!buffer_uptodate(bh) && bh_pos < end &&
 						bh_end > pos &&
 						(bh_pos < pos ||
 						bh_end > end)) {
 					/*
 					 * If the buffer is fully or partially
 					 * within the initialized size, do an
 					 * actual read.  Otherwise, simply zero
 					 * the buffer.
 					 */
 					read_lock_irqsave(&ni->size_lock,
 							flags);
 					initialized_size = ni->initialized_size;
 					read_unlock_irqrestore(&ni->size_lock,
 							flags);
 					if (bh_pos < initialized_size) {
 						ntfs_submit_bh_for_read(bh);
 						*wait_bh++ = bh;
 					} else {
 						zero_user(page, bh_offset(bh),
 								blocksize);
 						set_buffer_uptodate(bh);
 					}
 				}
 				continue;
 			}
 			/* We allocated the buffer. */
 			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 			/*
 			 * If the buffer is fully outside the write, zero it,
 			 * set it uptodate, and mark it dirty so it gets
 			 * written out.  If it is partially being written to,
 			 * zero region surrounding the write but leave it to
 			 * commit write to do anything else.  Finally, if the
 			 * buffer is fully being overwritten, do nothing.
 			 */
 			if (bh_end <= pos || bh_pos >= end) {
 				if (!buffer_uptodate(bh)) {
 					zero_user(page, bh_offset(bh),
 							blocksize);
 					set_buffer_uptodate(bh);
 				}
 				mark_buffer_dirty(bh);
 				continue;
 			}
 			set_buffer_new(bh);
 			if (!buffer_uptodate(bh) &&
 					(bh_pos < pos || bh_end > end)) {
 				u8 *kaddr;
 				unsigned pofs;
 				kaddr = kmap_atomic(page);
 				if (bh_pos < pos) {
 					pofs = bh_pos & ~PAGE_CACHE_MASK;
 					memset(kaddr + pofs, 0, pos - bh_pos);
 				}
 				if (bh_end > end) {
 					pofs = end & ~PAGE_CACHE_MASK;
 					memset(kaddr + pofs, 0, bh_end - end);
 				}
 				kunmap_atomic(kaddr);
 				flush_dcache_page(page);
 			}
 			continue;
 		}
 		/*
 		 * Slow path: this is the first buffer in the cluster.  If it
 		 * is outside allocated size and is not uptodate, zero it and
 		 * set it uptodate.
 		 */
 		read_lock_irqsave(&ni->size_lock, flags);
 		initialized_size = ni->allocated_size;
 		read_unlock_irqrestore(&ni->size_lock, flags);
 		if (bh_pos > initialized_size) {
 			if (PageUptodate(page)) {
 				if (!buffer_uptodate(bh))
 					set_buffer_uptodate(bh);
 			} else if (!buffer_uptodate(bh)) {
 				zero_user(page, bh_offset(bh), blocksize);
 				set_buffer_uptodate(bh);
 			}
 			continue;
 		}
 		is_retry = false;
 		if (!rl) {
 			down_read(&ni->runlist.lock);
 retry_remap:
 			rl = ni->runlist.rl;
 		}
 		if (likely(rl != NULL)) {
 			/* Seek to element containing target cluster. */
 			while (rl->length && rl[1].vcn <= bh_cpos)
 				rl++;
 			lcn = ntfs_rl_vcn_to_lcn(rl, bh_cpos);
 			if (likely(lcn >= 0)) {
 				/*
 				 * Successful remap, setup the map cache and
 				 * use that to deal with the buffer.
 				 */
 				was_hole = false;
 				vcn = bh_cpos;
 				vcn_len = rl[1].vcn - vcn;
 				lcn_block = lcn << (vol->cluster_size_bits -
 						blocksize_bits);
 				cdelta = 0;
 				/*
 				 * If the number of remaining clusters touched
 				 * by the write is smaller or equal to the
 				 * number of cached clusters, unlock the
 				 * runlist as the map cache will be used from
 				 * now on.
 				 */
 				if (likely(vcn + vcn_len >= cend)) {
 					if (rl_write_locked) {
 						up_write(&ni->runlist.lock);
 						rl_write_locked = false;
 					} else
 						up_read(&ni->runlist.lock);
 					rl = NULL;
 				}
 				goto map_buffer_cached;
 			}
 		} else
 			lcn = LCN_RL_NOT_MAPPED;
 		/*
 		 * If it is not a hole and not out of bounds, the runlist is
 		 * probably unmapped so try to map it now.
 		 */
 		if (unlikely(lcn != LCN_HOLE && lcn != LCN_ENOENT)) {
 			if (likely(!is_retry && lcn == LCN_RL_NOT_MAPPED)) {
 				/* Attempt to map runlist. */
 				if (!rl_write_locked) {
 					/*
 					 * We need the runlist locked for
 					 * writing, so if it is locked for
 					 * reading relock it now and retry in
 					 * case it changed whilst we dropped
 					 * the lock.
 					 */
 					up_read(&ni->runlist.lock);
 					down_write(&ni->runlist.lock);
 					rl_write_locked = true;
 					goto retry_remap;
 				}
 				err = ntfs_map_runlist_nolock(ni, bh_cpos,
 						NULL);
 				if (likely(!err)) {
 					is_retry = true;
 					goto retry_remap;
 				}
 				/*
 				 * If @vcn is out of bounds, pretend @lcn is
 				 * LCN_ENOENT.  As long as the buffer is out
 				 * of bounds this will work fine.
 				 */
 				if (err == -ENOENT) {
 					lcn = LCN_ENOENT;
 					err = 0;
 					goto rl_not_mapped_enoent;
 				}
 			} else
 				err = -EIO;
 			/* Failed to map the buffer, even after retrying. */
 			bh->b_blocknr = -1;
 			ntfs_error(vol->sb, "Failed to write to inode 0x%lx, "
 					"attribute type 0x%x, vcn 0x%llx, "
 					"vcn offset 0x%x, because its "
 					"location on disk could not be "
 					"determined%s (error code %i).",
 					ni->mft_no, ni->type,
 					(unsigned long long)bh_cpos,
 					(unsigned)bh_pos &
 					vol->cluster_size_mask,
 					is_retry ? " even after retrying" : "",
 					err);
 			break;
 		}
 rl_not_mapped_enoent:
 		/*
 		 * The buffer is in a hole or out of bounds.  We need to fill
 		 * the hole, unless the buffer is in a cluster which is not
 		 * touched by the write, in which case we just leave the buffer
 		 * unmapped.  This can only happen when the cluster size is
 		 * less than the page cache size.
 		 */
 		if (unlikely(vol->cluster_size < PAGE_CACHE_SIZE)) {
 			bh_cend = (bh_end + vol->cluster_size - 1) >>
 					vol->cluster_size_bits;
 			if ((bh_cend <= cpos || bh_cpos >= cend)) {
 				bh->b_blocknr = -1;
 				/*
 				 * If the buffer is uptodate we skip it.  If it
 				 * is not but the page is uptodate, we can set
 				 * the buffer uptodate.  If the page is not
 				 * uptodate, we can clear the buffer and set it
 				 * uptodate.  Whether this is worthwhile is
 				 * debatable and this could be removed.
 				 */
 				if (PageUptodate(page)) {
 					if (!buffer_uptodate(bh))
 						set_buffer_uptodate(bh);
 				} else if (!buffer_uptodate(bh)) {
 					zero_user(page, bh_offset(bh),
 						blocksize);
 					set_buffer_uptodate(bh);
 				}
 				continue;
 			}
 		}
 		/*
 		 * Out of bounds buffer is invalid if it was not really out of
 		 * bounds.
 		 */
 		BUG_ON(lcn != LCN_HOLE);
 		/*
 		 * We need the runlist locked for writing, so if it is locked
 		 * for reading relock it now and retry in case it changed
 		 * whilst we dropped the lock.
 		 */
 		BUG_ON(!rl);
 		if (!rl_write_locked) {
 			up_read(&ni->runlist.lock);
 			down_write(&ni->runlist.lock);
 			rl_write_locked = true;
 			goto retry_remap;
 		}
 		/* Find the previous last allocated cluster. */
 		BUG_ON(rl->lcn != LCN_HOLE);
 		lcn = -1;
 		rl2 = rl;
 		while (--rl2 >= ni->runlist.rl) {
 			if (rl2->lcn >= 0) {
 				lcn = rl2->lcn + rl2->length;
 				break;
 			}
 		}
 		rl2 = ntfs_cluster_alloc(vol, bh_cpos, 1, lcn, DATA_ZONE,
 				false);
 		if (IS_ERR(rl2)) {
 			err = PTR_ERR(rl2);
 			ntfs_debug("Failed to allocate cluster, error code %i.",
 					err);
 			break;
 		}
 		lcn = rl2->lcn;
 		rl = ntfs_runlists_merge(ni->runlist.rl, rl2);
 		if (IS_ERR(rl)) {
 			err = PTR_ERR(rl);
 			if (err != -ENOMEM)
 				err = -EIO;
 			if (ntfs_cluster_free_from_rl(vol, rl2)) {
 				ntfs_error(vol->sb, "Failed to release "
 						"allocated cluster in error "
 						"code path.  Run chkdsk to "
 						"recover the lost cluster.");
 				NVolSetErrors(vol);
 			}
 			ntfs_free(rl2);
 			break;
 		}
 		ni->runlist.rl = rl;
 		status.runlist_merged = 1;
 		ntfs_debug("Allocated cluster, lcn 0x%llx.",
 				(unsigned long long)lcn);
 		/* Map and lock the mft record and get the attribute record. */
 		if (!NInoAttr(ni))
 			base_ni = ni;
 		else
 			base_ni = ni->ext.base_ntfs_ino;
 		m = map_mft_record(base_ni);
 		if (IS_ERR(m)) {
 			err = PTR_ERR(m);
 			break;
 		}
 		ctx = ntfs_attr_get_search_ctx(base_ni, m);
 		if (unlikely(!ctx)) {
 			err = -ENOMEM;
 			unmap_mft_record(base_ni);
 			break;
 		}
 		status.mft_attr_mapped = 1;
 		err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 				CASE_SENSITIVE, bh_cpos, NULL, 0, ctx);
 		if (unlikely(err)) {
 			if (err == -ENOENT)
 				err = -EIO;
 			break;
 		}
 		m = ctx->mrec;
 		a = ctx->attr;
 		/*
 		 * Find the runlist element with which the attribute extent
 		 * starts.  Note, we cannot use the _attr_ version because we
 		 * have mapped the mft record.  That is ok because we know the
 		 * runlist fragment must be mapped already to have ever gotten
 		 * here, so we can just use the _rl_ version.
 		 */
 		vcn = sle64_to_cpu(a->data.non_resident.lowest_vcn);
 		rl2 = ntfs_rl_find_vcn_nolock(rl, vcn);
 		BUG_ON(!rl2);
 		BUG_ON(!rl2->length);
 		BUG_ON(rl2->lcn < LCN_HOLE);
 		highest_vcn = sle64_to_cpu(a->data.non_resident.highest_vcn);
 		/*
 		 * If @highest_vcn is zero, calculate the real highest_vcn
 		 * (which can really be zero).
 		 */
 		if (!highest_vcn)
 			highest_vcn = (sle64_to_cpu(
 					a->data.non_resident.allocated_size) >>
 					vol->cluster_size_bits) - 1;
 		/*
 		 * Determine the size of the mapping pairs array for the new
 		 * extent, i.e. the old extent with the hole filled.
 		 */
 		mp_size = ntfs_get_size_for_mapping_pairs(vol, rl2, vcn,
 				highest_vcn);
 		if (unlikely(mp_size <= 0)) {
 			if (!(err = mp_size))
 				err = -EIO;
 			ntfs_debug("Failed to get size for mapping pairs "
 					"array, error code %i.", err);
 			break;
 		}
 		/*
 		 * Resize the attribute record to fit the new mapping pairs
 		 * array.
 		 */
 		attr_rec_len = le32_to_cpu(a->length);
 		err = ntfs_attr_record_resize(m, a, mp_size + le16_to_cpu(
 				a->data.non_resident.mapping_pairs_offset));
 		if (unlikely(err)) {
 			BUG_ON(err != -ENOSPC);
 			// TODO: Deal with this by using the current attribute
 			// and fill it with as much of the mapping pairs
 			// array as possible.  Then loop over each attribute
 			// extent rewriting the mapping pairs arrays as we go
 			// along and if when we reach the end we have not
 			// enough space, try to resize the last attribute
 			// extent and if even that fails, add a new attribute
 			// extent.
 			// We could also try to resize at each step in the hope
 			// that we will not need to rewrite every single extent.
 			// Note, we may need to decompress some extents to fill
 			// the runlist as we are walking the extents...
 			ntfs_error(vol->sb, "Not enough space in the mft "
 					"record for the extended attribute "
 					"record.  This case is not "
 					"implemented yet.");
 			err = -EOPNOTSUPP;
 			break ;
 		}
 		status.mp_rebuilt = 1;
 		/*
 		 * Generate the mapping pairs array directly into the attribute
 		 * record.
 		 */
 		err = ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu(
 				a->data.non_resident.mapping_pairs_offset),
 				mp_size, rl2, vcn, highest_vcn, NULL);
 		if (unlikely(err)) {
 			ntfs_error(vol->sb, "Cannot fill hole in inode 0x%lx, "
 					"attribute type 0x%x, because building "
 					"the mapping pairs failed with error "
 					"code %i.", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 			err = -EIO;
 			break;
 		}
 		/* Update the highest_vcn but only if it was not set. */
 		if (unlikely(!a->data.non_resident.highest_vcn))
 			a->data.non_resident.highest_vcn =
 					cpu_to_sle64(highest_vcn);
 		/*
 		 * If the attribute is sparse/compressed, update the compressed
 		 * size in the ntfs_inode structure and the attribute record.
 		 */
 		if (likely(NInoSparse(ni) || NInoCompressed(ni))) {
 			/*
 			 * If we are not in the first attribute extent, switch
 			 * to it, but first ensure the changes will make it to
 			 * disk later.
 			 */
 			if (a->data.non_resident.lowest_vcn) {
 				flush_dcache_mft_record_page(ctx->ntfs_ino);
 				mark_mft_record_dirty(ctx->ntfs_ino);
 				ntfs_attr_reinit_search_ctx(ctx);
 				err = ntfs_attr_lookup(ni->type, ni->name,
 						ni->name_len, CASE_SENSITIVE,
 						0, NULL, 0, ctx);
 				if (unlikely(err)) {
 					status.attr_switched = 1;
 					break;
 				}
 				/* @m is not used any more so do not set it. */
 				a = ctx->attr;
 			}
 			write_lock_irqsave(&ni->size_lock, flags);
 			ni->itype.compressed.size += vol->cluster_size;
 			a->data.non_resident.compressed_size =
 					cpu_to_sle64(ni->itype.compressed.size);
 			write_unlock_irqrestore(&ni->size_lock, flags);
 		}
 		/* Ensure the changes make it to disk. */
 		flush_dcache_mft_record_page(ctx->ntfs_ino);
 		mark_mft_record_dirty(ctx->ntfs_ino);
 		ntfs_attr_put_search_ctx(ctx);
 		unmap_mft_record(base_ni);
 		/* Successfully filled the hole. */
 		status.runlist_merged = 0;
 		status.mft_attr_mapped = 0;
 		status.mp_rebuilt = 0;
 		/* Setup the map cache and use that to deal with the buffer. */
 		was_hole = true;
 		vcn = bh_cpos;
 		vcn_len = 1;
 		lcn_block = lcn << (vol->cluster_size_bits - blocksize_bits);
 		cdelta = 0;
 		/*
 		 * If the number of remaining clusters in the @pages is smaller
 		 * or equal to the number of cached clusters, unlock the
 		 * runlist as the map cache will be used from now on.
 		 */
 		if (likely(vcn + vcn_len >= cend)) {
 			up_write(&ni->runlist.lock);
 			rl_write_locked = false;
 			rl = NULL;
 		}
 		goto map_buffer_cached;
 	} while (bh_pos += blocksize, (bh = bh->b_this_page) != head);
 	/* If there are no errors, do the next page. */
 	if (likely(!err && ++u < nr_pages))
 		goto do_next_page;
 	/* If there are no errors, release the runlist lock if we took it. */
 	if (likely(!err)) {
 		if (unlikely(rl_write_locked)) {
 			up_write(&ni->runlist.lock);
 			rl_write_locked = false;
 		} else if (unlikely(rl))
 			up_read(&ni->runlist.lock);
 		rl = NULL;
 	}
 	/* If we issued read requests, let them complete. */
 	read_lock_irqsave(&ni->size_lock, flags);
 	initialized_size = ni->initialized_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	while (wait_bh > wait) {
 		bh = *--wait_bh;
 		wait_on_buffer(bh);
 		if (likely(buffer_uptodate(bh))) {
 			page = bh->b_page;
 			bh_pos = ((s64)page->index << PAGE_CACHE_SHIFT) +
 					bh_offset(bh);
 			/*
 			 * If the buffer overflows the initialized size, need
 			 * to zero the overflowing region.
 			 */
 			if (unlikely(bh_pos + blocksize > initialized_size)) {
 				int ofs = 0;
 				if (likely(bh_pos < initialized_size))
 					ofs = initialized_size - bh_pos;
 				zero_user_segment(page, bh_offset(bh) + ofs,
 						blocksize);
 			}
 		} else /* if (unlikely(!buffer_uptodate(bh))) */
 			err = -EIO;
 	}
 	if (likely(!err)) {
 		/* Clear buffer_new on all buffers. */
 		u = 0;
 		do {
 			bh = head = page_buffers(pages[u]);
 			do {
 				if (buffer_new(bh))
 					clear_buffer_new(bh);
 			} while ((bh = bh->b_this_page) != head);
 		} while (++u < nr_pages);
 		ntfs_debug("Done.");
 		return err;
 	}
 	if (status.attr_switched) {
 		/* Get back to the attribute extent we modified. */
 		ntfs_attr_reinit_search_ctx(ctx);
 		if (ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 				CASE_SENSITIVE, bh_cpos, NULL, 0, ctx)) {
 			ntfs_error(vol->sb, "Failed to find required "
 					"attribute extent of attribute in "
 					"error code path.  Run chkdsk to "
 					"recover.");
 			write_lock_irqsave(&ni->size_lock, flags);
 			ni->itype.compressed.size += vol->cluster_size;
 			write_unlock_irqrestore(&ni->size_lock, flags);
 			flush_dcache_mft_record_page(ctx->ntfs_ino);
 			mark_mft_record_dirty(ctx->ntfs_ino);
 			/*
 			 * The only thing that is now wrong is the compressed
 			 * size of the base attribute extent which chkdsk
 			 * should be able to fix.
 			 */
 			NVolSetErrors(vol);
 		} else {
 			m = ctx->mrec;
 			a = ctx->attr;
 			status.attr_switched = 0;
 		}
 	}
 	/*
 	 * If the runlist has been modified, need to restore it by punching a
 	 * hole into it and we then need to deallocate the on-disk cluster as
 	 * well.  Note, we only modify the runlist if we are able to generate a
 	 * new mapping pairs array, i.e. only when the mapped attribute extent
 	 * is not switched.
 	 */
 	if (status.runlist_merged && !status.attr_switched) {
 		BUG_ON(!rl_write_locked);
 		/* Make the file cluster we allocated sparse in the runlist. */
 		if (ntfs_rl_punch_nolock(vol, &ni->runlist, bh_cpos, 1)) {
 			ntfs_error(vol->sb, "Failed to punch hole into "
 					"attribute runlist in error code "
 					"path.  Run chkdsk to recover the "
 					"lost cluster.");
 			NVolSetErrors(vol);
 		} else /* if (success) */ {
 			status.runlist_merged = 0;
 			/*
 			 * Deallocate the on-disk cluster we allocated but only
 			 * if we succeeded in punching its vcn out of the
 			 * runlist.
 			 */
 			down_write(&vol->lcnbmp_lock);
 			if (ntfs_bitmap_clear_bit(vol->lcnbmp_ino, lcn)) {
 				ntfs_error(vol->sb, "Failed to release "
 						"allocated cluster in error "
 						"code path.  Run chkdsk to "
 						"recover the lost cluster.");
 				NVolSetErrors(vol);
 			}
 			up_write(&vol->lcnbmp_lock);
 		}
 	}
 	/*
 	 * Resize the attribute record to its old size and rebuild the mapping
 	 * pairs array.  Note, we only can do this if the runlist has been
 	 * restored to its old state which also implies that the mapped
 	 * attribute extent is not switched.
 	 */
 	if (status.mp_rebuilt && !status.runlist_merged) {
 		if (ntfs_attr_record_resize(m, a, attr_rec_len)) {
 			ntfs_error(vol->sb, "Failed to restore attribute "
 					"record in error code path.  Run "
 					"chkdsk to recover.");
 			NVolSetErrors(vol);
 		} else /* if (success) */ {
 			if (ntfs_mapping_pairs_build(vol, (u8*)a +
 					le16_to_cpu(a->data.non_resident.
 					mapping_pairs_offset), attr_rec_len -
 					le16_to_cpu(a->data.non_resident.
 					mapping_pairs_offset), ni->runlist.rl,
 					vcn, highest_vcn, NULL)) {
 				ntfs_error(vol->sb, "Failed to restore "
 						"mapping pairs array in error "
 						"code path.  Run chkdsk to "
 						"recover.");
 				NVolSetErrors(vol);
 			}
 			flush_dcache_mft_record_page(ctx->ntfs_ino);
 			mark_mft_record_dirty(ctx->ntfs_ino);
 		}
 	}
 	/* Release the mft record and the attribute. */
 	if (status.mft_attr_mapped) {
 		ntfs_attr_put_search_ctx(ctx);
 		unmap_mft_record(base_ni);
 	}
 	/* Release the runlist lock. */
 	if (rl_write_locked)
 		up_write(&ni->runlist.lock);
 	else if (rl)
 		up_read(&ni->runlist.lock);
 	/*
 	 * Zero out any newly allocated blocks to avoid exposing stale data.
 	 * If BH_New is set, we know that the block was newly allocated above
 	 * and that it has not been fully zeroed and marked dirty yet.
 	 */
 	nr_pages = u;
 	u = 0;
 	end = bh_cpos << vol->cluster_size_bits;
 	do {
 		page = pages[u];
 		bh = head = page_buffers(page);
 		do {
 			if (u == nr_pages &&
 					((s64)page->index << PAGE_CACHE_SHIFT) +
 					bh_offset(bh) >= end)
 				break;
 			if (!buffer_new(bh))
 				continue;
 			clear_buffer_new(bh);
 			if (!buffer_uptodate(bh)) {
 				if (PageUptodate(page))
 					set_buffer_uptodate(bh);
 				else {
 					zero_user(page, bh_offset(bh),
 							blocksize);
 					set_buffer_uptodate(bh);
 				}
 			}
 			mark_buffer_dirty(bh);
 		} while ((bh = bh->b_this_page) != head);
 	} while (++u <= nr_pages);
 	ntfs_error(vol->sb, "Failed.  Returning error code %i.", err);
 	return err;
 }
 /*
  * Copy as much as we can into the pages and return the number of bytes which
  * were successfully copied.  If a fault is encountered then clear the pages
  * out to (ofs + bytes) and return the number of bytes which were copied.
  */
 static inline size_t ntfs_copy_from_user(struct page **pages,
 		unsigned nr_pages, unsigned ofs, const char __user *buf,
 		size_t bytes)
 {
 	struct page **last_page = pages + nr_pages;
 	char *addr;
 	size_t total = 0;
 	unsigned len;
 	int left;
 	do {
 		len = PAGE_CACHE_SIZE - ofs;
 		if (len > bytes)
 			len = bytes;
 		addr = kmap_atomic(*pages);
 		left = __copy_from_user_inatomic(addr + ofs, buf, len);
 		kunmap_atomic(addr);
 		if (unlikely(left)) {
 			/* Do it the slow way. */
 			addr = kmap(*pages);
 			left = __copy_from_user(addr + ofs, buf, len);
 			kunmap(*pages);
 			if (unlikely(left))
 				goto err_out;
 		}
 		total += len;
 		bytes -= len;
 		if (!bytes)
 			break;
 		buf += len;
 		ofs = 0;
 	} while (++pages < last_page);
 out:
 	return total;
 err_out:
 	total += len - left;
 	/* Zero the rest of the target like __copy_from_user(). */
 	while (++pages < last_page) {
 		bytes -= len;
 		if (!bytes)
 			break;
 		len = PAGE_CACHE_SIZE;
 		if (len > bytes)
 			len = bytes;
 		zero_user(*pages, 0, len);
 	}
 	goto out;
 }
 static size_t __ntfs_copy_from_user_iovec_inatomic(char *vaddr,
 		const struct iovec *iov, size_t iov_ofs, size_t bytes)
 {
 	size_t total = 0;
 	while (1) {
 		const char __user *buf = iov->iov_base + iov_ofs;
 		unsigned len;
 		size_t left;
 		len = iov->iov_len - iov_ofs;
 		if (len > bytes)
 			len = bytes;
 		left = __copy_from_user_inatomic(vaddr, buf, len);
 		total += len;
 		bytes -= len;
 		vaddr += len;
 		if (unlikely(left)) {
 			total -= left;
 			break;
 		}
 		if (!bytes)
 			break;
 		iov++;
 		iov_ofs = 0;
 	}
 	return total;
 }
 static inline void ntfs_set_next_iovec(const struct iovec **iovp,
 		size_t *iov_ofsp, size_t bytes)
 {
 	const struct iovec *iov = *iovp;
 	size_t iov_ofs = *iov_ofsp;
 	while (bytes) {
 		unsigned len;
 		len = iov->iov_len - iov_ofs;
 		if (len > bytes)
 			len = bytes;
 		bytes -= len;
 		iov_ofs += len;
 		if (iov->iov_len == iov_ofs) {
 			iov++;
 			iov_ofs = 0;
 		}
 	}
 	*iovp = iov;
 	*iov_ofsp = iov_ofs;
 }
 /*
  * This has the same side-effects and return value as ntfs_copy_from_user().
  * The difference is that on a fault we need to memset the remainder of the
  * pages (out to offset + bytes), to emulate ntfs_copy_from_user()'s
  * single-segment behaviour.
  *
  * We call the same helper (__ntfs_copy_from_user_iovec_inatomic()) both when
  * atomic and when not atomic.  This is ok because it calls
  * __copy_from_user_inatomic() and it is ok to call this when non-atomic.  In
  * fact, the only difference between __copy_from_user_inatomic() and
  * __copy_from_user() is that the latter calls might_sleep() and the former
  * should not zero the tail of the buffer on error.  And on many architectures
  * __copy_from_user_inatomic() is just defined to __copy_from_user() so it
  * makes no difference at all on those architectures.
  */
 static inline size_t ntfs_copy_from_user_iovec(struct page **pages,
 		unsigned nr_pages, unsigned ofs, const struct iovec **iov,
 		size_t *iov_ofs, size_t bytes)
 {
 	struct page **last_page = pages + nr_pages;
 	char *addr;
 	size_t copied, len, total = 0;
 	do {
 		len = PAGE_CACHE_SIZE - ofs;
 		if (len > bytes)
 			len = bytes;
 		addr = kmap_atomic(*pages);
 		copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs,
 				*iov, *iov_ofs, len);
 		kunmap_atomic(addr);
 		if (unlikely(copied != len)) {
 			/* Do it the slow way. */
 			addr = kmap(*pages);
 			copied = __ntfs_copy_from_user_iovec_inatomic(addr +
 					ofs, *iov, *iov_ofs, len);
 			if (unlikely(copied != len))
 				goto err_out;
 			kunmap(*pages);
 		}
 		total += len;
 		ntfs_set_next_iovec(iov, iov_ofs, len);
 		bytes -= len;
 		if (!bytes)
 			break;
 		ofs = 0;
 	} while (++pages < last_page);
 out:
 	return total;
 err_out:
 	BUG_ON(copied > len);
 	/* Zero the rest of the target like __copy_from_user(). */
 	memset(addr + ofs + copied, 0, len - copied);
 	kunmap(*pages);
 	total += copied;
 	ntfs_set_next_iovec(iov, iov_ofs, copied);
 	while (++pages < last_page) {
 		bytes -= len;
 		if (!bytes)
 			break;
 		len = PAGE_CACHE_SIZE;
 		if (len > bytes)
 			len = bytes;
 		zero_user(*pages, 0, len);
 	}
 	goto out;
 }
 static inline void ntfs_flush_dcache_pages(struct page **pages,
 		unsigned nr_pages)
 {
 	BUG_ON(!nr_pages);
 	/*
 	 * Warning: Do not do the decrement at the same time as the call to
 	 * flush_dcache_page() because it is a NULL macro on i386 and hence the
 	 * decrement never happens so the loop never terminates.
 	 */
 	do {
 		--nr_pages;
 		flush_dcache_page(pages[nr_pages]);
 	} while (nr_pages > 0);
 }
 /**
  * ntfs_commit_pages_after_non_resident_write - commit the received data
  * @pages:	array of destination pages
  * @nr_pages:	number of pages in @pages
  * @pos:	byte position in file at which the write begins
  * @bytes:	number of bytes to be written
  *
  * See description of ntfs_commit_pages_after_write(), below.
  */
 static inline int ntfs_commit_pages_after_non_resident_write(
 		struct page **pages, const unsigned nr_pages,
 		s64 pos, size_t bytes)
 {
 	s64 end, initialized_size;
 	struct inode *vi;
 	ntfs_inode *ni, *base_ni;
 	struct buffer_head *bh, *head;
 	ntfs_attr_search_ctx *ctx;
 	MFT_RECORD *m;
 	ATTR_RECORD *a;
 	unsigned long flags;
 	unsigned blocksize, u;
 	int err;
 	vi = pages[0]->mapping->host;
 	ni = NTFS_I(vi);
 	blocksize = vi->i_sb->s_blocksize;
 	end = pos + bytes;
 	u = 0;
 	do {
 		s64 bh_pos;
 		struct page *page;
 		bool partial;
 		page = pages[u];
 		bh_pos = (s64)page->index << PAGE_CACHE_SHIFT;
 		bh = head = page_buffers(page);
 		partial = false;
 		do {
 			s64 bh_end;
 			bh_end = bh_pos + blocksize;
 			if (bh_end <= pos || bh_pos >= end) {
 				if (!buffer_uptodate(bh))
 					partial = true;
 			} else {
 				set_buffer_uptodate(bh);
 				mark_buffer_dirty(bh);
 			}
 		} while (bh_pos += blocksize, (bh = bh->b_this_page) != head);
 		/*
 		 * If all buffers are now uptodate but the page is not, set the
 		 * page uptodate.
 		 */
 		if (!partial && !PageUptodate(page))
 			SetPageUptodate(page);
 	} while (++u < nr_pages);
 	/*
 	 * Finally, if we do not need to update initialized_size or i_size we
 	 * are finished.
 	 */
 	read_lock_irqsave(&ni->size_lock, flags);
 	initialized_size = ni->initialized_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	if (end <= initialized_size) {
 		ntfs_debug("Done.");
 		return 0;
 	}
 	/*
 	 * Update initialized_size/i_size as appropriate, both in the inode and
 	 * the mft record.
 	 */
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	/* Map, pin, and lock the mft record. */
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		ctx = NULL;
 		goto err_out;
 	}
 	BUG_ON(!NInoNonResident(ni));
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto err_out;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, 0, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto err_out;
 	}
 	a = ctx->attr;
 	BUG_ON(!a->non_resident);
 	write_lock_irqsave(&ni->size_lock, flags);
 	BUG_ON(end > ni->allocated_size);
 	ni->initialized_size = end;
 	a->data.non_resident.initialized_size = cpu_to_sle64(end);
 	if (end > i_size_read(vi)) {
 		i_size_write(vi, end);
 		a->data.non_resident.data_size =
 				a->data.non_resident.initialized_size;
 	}
 	write_unlock_irqrestore(&ni->size_lock, flags);
 	/* Mark the mft record dirty, so it gets written back. */
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 	ntfs_attr_put_search_ctx(ctx);
 	unmap_mft_record(base_ni);
 	ntfs_debug("Done.");
 	return 0;
 err_out:
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	ntfs_error(vi->i_sb, "Failed to update initialized_size/i_size (error "
 			"code %i).", err);
 	if (err != -ENOMEM)
 		NVolSetErrors(ni->vol);
 	return err;
 }
 /**
  * ntfs_commit_pages_after_write - commit the received data
  * @pages:	array of destination pages
  * @nr_pages:	number of pages in @pages
  * @pos:	byte position in file at which the write begins
  * @bytes:	number of bytes to be written
  *
  * This is called from ntfs_file_buffered_write() with i_mutex held on the inode
  * (@pages[0]->mapping->host).  There are @nr_pages pages in @pages which are
  * locked but not kmap()ped.  The source data has already been copied into the
  * @page.  ntfs_prepare_pages_for_non_resident_write() has been called before
  * the data was copied (for non-resident attributes only) and it returned
  * success.
  *
  * Need to set uptodate and mark dirty all buffers within the boundary of the
  * write.  If all buffers in a page are uptodate we set the page uptodate, too.
  *
  * Setting the buffers dirty ensures that they get written out later when
  * ntfs_writepage() is invoked by the VM.
  *
  * Finally, we need to update i_size and initialized_size as appropriate both
  * in the inode and the mft record.
  *
  * This is modelled after fs/buffer.c::generic_commit_write(), which marks
  * buffers uptodate and dirty, sets the page uptodate if all buffers in the
  * page are uptodate, and updates i_size if the end of io is beyond i_size.  In
  * that case, it also marks the inode dirty.
  *
  * If things have gone as outlined in
  * ntfs_prepare_pages_for_non_resident_write(), we do not need to do any page
  * content modifications here for non-resident attributes.  For resident
  * attributes we need to do the uptodate bringing here which we combine with
  * the copying into the mft record which means we save one atomic kmap.
  *
  * Return 0 on success or -errno on error.
  */
 static int ntfs_commit_pages_after_write(struct page **pages,
 		const unsigned nr_pages, s64 pos, size_t bytes)
 {
 	s64 end, initialized_size;
 	loff_t i_size;
 	struct inode *vi;
 	ntfs_inode *ni, *base_ni;
 	struct page *page;
 	ntfs_attr_search_ctx *ctx;
 	MFT_RECORD *m;
 	ATTR_RECORD *a;
 	char *kattr, *kaddr;
 	unsigned long flags;
 	u32 attr_len;
 	int err;
 	BUG_ON(!nr_pages);
 	BUG_ON(!pages);
 	page = pages[0];
 	BUG_ON(!page);
 	vi = page->mapping->host;
 	ni = NTFS_I(vi);
 	ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, start page "
 			"index 0x%lx, nr_pages 0x%x, pos 0x%llx, bytes 0x%zx.",
 			vi->i_ino, ni->type, page->index, nr_pages,
 			(long long)pos, bytes);
 	if (NInoNonResident(ni))
 		return ntfs_commit_pages_after_non_resident_write(pages,
 				nr_pages, pos, bytes);
 	BUG_ON(nr_pages > 1);
 	/*
 	 * Attribute is resident, implying it is not compressed, encrypted, or
 	 * sparse.
 	 */
 	if (!NInoAttr(ni))
 		base_ni = ni;
 	else
 		base_ni = ni->ext.base_ntfs_ino;
 	BUG_ON(NInoNonResident(ni));
 	/* Map, pin, and lock the mft record. */
 	m = map_mft_record(base_ni);
 	if (IS_ERR(m)) {
 		err = PTR_ERR(m);
 		m = NULL;
 		ctx = NULL;
 		goto err_out;
 	}
 	ctx = ntfs_attr_get_search_ctx(base_ni, m);
 	if (unlikely(!ctx)) {
 		err = -ENOMEM;
 		goto err_out;
 	}
 	err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len,
 			CASE_SENSITIVE, 0, NULL, 0, ctx);
 	if (unlikely(err)) {
 		if (err == -ENOENT)
 			err = -EIO;
 		goto err_out;
 	}
 	a = ctx->attr;
 	BUG_ON(a->non_resident);
 	/* The total length of the attribute value. */
 	attr_len = le32_to_cpu(a->data.resident.value_length);
 	i_size = i_size_read(vi);
 	BUG_ON(attr_len != i_size);
 	BUG_ON(pos > attr_len);
 	end = pos + bytes;
 	BUG_ON(end > le32_to_cpu(a->length) -
 			le16_to_cpu(a->data.resident.value_offset));
 	kattr = (u8*)a + le16_to_cpu(a->data.resident.value_offset);
 	kaddr = kmap_atomic(page);
 	/* Copy the received data from the page to the mft record. */
 	memcpy(kattr + pos, kaddr + pos, bytes);
 	/* Update the attribute length if necessary. */
 	if (end > attr_len) {
 		attr_len = end;
 		a->data.resident.value_length = cpu_to_le32(attr_len);
 	}
 	/*
 	 * If the page is not uptodate, bring the out of bounds area(s)
 	 * uptodate by copying data from the mft record to the page.
 	 */
 	if (!PageUptodate(page)) {
 		if (pos > 0)
 			memcpy(kaddr, kattr, pos);
 		if (end < attr_len)
 			memcpy(kaddr + end, kattr + end, attr_len - end);
 		/* Zero the region outside the end of the attribute value. */
 		memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
 	kunmap_atomic(kaddr);
 	/* Update initialized_size/i_size if necessary. */
 	read_lock_irqsave(&ni->size_lock, flags);
 	initialized_size = ni->initialized_size;
 	BUG_ON(end > ni->allocated_size);
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	BUG_ON(initialized_size != i_size);
 	if (end > initialized_size) {
 		write_lock_irqsave(&ni->size_lock, flags);
 		ni->initialized_size = end;
 		i_size_write(vi, end);
 		write_unlock_irqrestore(&ni->size_lock, flags);
 	}
 	/* Mark the mft record dirty, so it gets written back. */
 	flush_dcache_mft_record_page(ctx->ntfs_ino);
 	mark_mft_record_dirty(ctx->ntfs_ino);
 	ntfs_attr_put_search_ctx(ctx);
 	unmap_mft_record(base_ni);
 	ntfs_debug("Done.");
 	return 0;
 err_out:
 	if (err == -ENOMEM) {
 		ntfs_warning(vi->i_sb, "Error allocating memory required to "
 				"commit the write.");
 		if (PageUptodate(page)) {
 			ntfs_warning(vi->i_sb, "Page is uptodate, setting "
 					"dirty so the write will be retried "
 					"later on by the VM.");
 			/*
 			 * Put the page on mapping->dirty_pages, but leave its
 			 * buffers' dirty state as-is.
 			 */
 			__set_page_dirty_nobuffers(page);
 			err = 0;
 		} else
 			ntfs_error(vi->i_sb, "Page is not uptodate.  Written "
 					"data has been lost.");
 	} else {
 		ntfs_error(vi->i_sb, "Resident attribute commit write failed "
 				"with error %i.", err);
 		NVolSetErrors(ni->vol);
 	}
 	if (ctx)
 		ntfs_attr_put_search_ctx(ctx);
 	if (m)
 		unmap_mft_record(base_ni);
 	return err;
 }
 static void ntfs_write_failed(struct address_space *mapping, loff_t to)
 {
 	struct inode *inode = mapping->host;
 	if (to > inode->i_size) {
 		truncate_pagecache(inode, inode->i_size);
 		ntfs_truncate_vfs(inode);
 	}
 }
 /**
  * ntfs_file_buffered_write -
  *
  * Locking: The vfs is holding ->i_mutex on the inode.
  */
 static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs,
 		loff_t pos, loff_t *ppos, size_t count)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *vi = mapping->host;
 	ntfs_inode *ni = NTFS_I(vi);
 	ntfs_volume *vol = ni->vol;
 	struct page *pages[NTFS_MAX_PAGES_PER_CLUSTER];
 	struct page *cached_page = NULL;
 	char __user *buf = NULL;
 	s64 end, ll;
 	VCN last_vcn;
 	LCN lcn;
 	unsigned long flags;
 	size_t bytes, iov_ofs = 0;	/* Offset in the current iovec. */
 	ssize_t status, written;
 	unsigned nr_pages;
 	int err;
 	ntfs_debug("Entering for i_ino 0x%lx, attribute type 0x%x, "
 			"pos 0x%llx, count 0x%lx.",
 			vi->i_ino, (unsigned)le32_to_cpu(ni->type),
 			(unsigned long long)pos, (unsigned long)count);
 	if (unlikely(!count))
 		return 0;
 	BUG_ON(NInoMstProtected(ni));
 	/*
 	 * If the attribute is not an index root and it is encrypted or
 	 * compressed, we cannot write to it yet.  Note we need to check for
 	 * AT_INDEX_ALLOCATION since this is the type of both directory and
 	 * index inodes.
 	 */
 	if (ni->type != AT_INDEX_ALLOCATION) {
 		/* If file is encrypted, deny access, just like NT4. */
 		if (NInoEncrypted(ni)) {
 			/*
 			 * Reminder for later: Encrypted files are _always_
 			 * non-resident so that the content can always be
 			 * encrypted.
 			 */
 			ntfs_debug("Denying write access to encrypted file.");
 			return -EACCES;
 		}
 		if (NInoCompressed(ni)) {
 			/* Only unnamed $DATA attribute can be compressed. */
 			BUG_ON(ni->type != AT_DATA);
 			BUG_ON(ni->name_len);
 			/*
 			 * Reminder for later: If resident, the data is not
 			 * actually compressed.  Only on the switch to non-
 			 * resident does compression kick in.  This is in
 			 * contrast to encrypted files (see above).
 			 */
 			ntfs_error(vi->i_sb, "Writing to compressed files is "
 					"not implemented yet.  Sorry.");
 			return -EOPNOTSUPP;
 		}
 	}
 	/*
 	 * If a previous ntfs_truncate() failed, repeat it and abort if it
 	 * fails again.
 	 */
 	if (unlikely(NInoTruncateFailed(ni))) {
 		inode_dio_wait(vi);
 		err = ntfs_truncate(vi);
 		if (err || NInoTruncateFailed(ni)) {
 			if (!err)
 				err = -EIO;
 			ntfs_error(vol->sb, "Cannot perform write to inode "
 					"0x%lx, attribute type 0x%x, because "
 					"ntfs_truncate() failed (error code "
 					"%i).", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 			return err;
 		}
 	}
 	/* The first byte after the write. */
 	end = pos + count;
 	/*
 	 * If the write goes beyond the allocated size, extend the allocation
 	 * to cover the whole of the write, rounded up to the nearest cluster.
 	 */
 	read_lock_irqsave(&ni->size_lock, flags);
 	ll = ni->allocated_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	if (end > ll) {
 		/* Extend the allocation without changing the data size. */
 		ll = ntfs_attr_extend_allocation(ni, end, -1, pos);
 		if (likely(ll >= 0)) {
 			BUG_ON(pos >= ll);
 			/* If the extension was partial truncate the write. */
 			if (end > ll) {
 				ntfs_debug("Truncating write to inode 0x%lx, "
 						"attribute type 0x%x, because "
 						"the allocation was only "
 						"partially extended.",
 						vi->i_ino, (unsigned)
 						le32_to_cpu(ni->type));
 				end = ll;
 				count = ll - pos;
 			}
 		} else {
 			err = ll;
 			read_lock_irqsave(&ni->size_lock, flags);
 			ll = ni->allocated_size;
 			read_unlock_irqrestore(&ni->size_lock, flags);
 			/* Perform a partial write if possible or fail. */
 			if (pos < ll) {
 				ntfs_debug("Truncating write to inode 0x%lx, "
 						"attribute type 0x%x, because "
 						"extending the allocation "
 						"failed (error code %i).",
 						vi->i_ino, (unsigned)
 						le32_to_cpu(ni->type), err);
 				end = ll;
 				count = ll - pos;
 			} else {
 				ntfs_error(vol->sb, "Cannot perform write to "
 						"inode 0x%lx, attribute type "
 						"0x%x, because extending the "
 						"allocation failed (error "
 						"code %i).", vi->i_ino,
 						(unsigned)
 						le32_to_cpu(ni->type), err);
 				return err;
 			}
 		}
 	}
 	written = 0;
 	/*
 	 * If the write starts beyond the initialized size, extend it up to the
 	 * beginning of the write and initialize all non-sparse space between
 	 * the old initialized size and the new one.  This automatically also
 	 * increments the vfs inode->i_size to keep it above or equal to the
 	 * initialized_size.
 	 */
 	read_lock_irqsave(&ni->size_lock, flags);
 	ll = ni->initialized_size;
 	read_unlock_irqrestore(&ni->size_lock, flags);
 	if (pos > ll) {
 		err = ntfs_attr_extend_initialized(ni, pos);
 		if (err < 0) {
 			ntfs_error(vol->sb, "Cannot perform write to inode "
 					"0x%lx, attribute type 0x%x, because "
 					"extending the initialized size "
 					"failed (error code %i).", vi->i_ino,
 					(unsigned)le32_to_cpu(ni->type), err);
 			status = err;
 			goto err_out;
 		}
 	}
 	/*
 	 * Determine the number of pages per cluster for non-resident
 	 * attributes.
 	 */
 	nr_pages = 1;
 	if (vol->cluster_size > PAGE_CACHE_SIZE && NInoNonResident(ni))
 		nr_pages = vol->cluster_size >> PAGE_CACHE_SHIFT;
 	/* Finally, perform the actual write. */
 	last_vcn = -1;
 	if (likely(nr_segs == 1))
 		buf = iov->iov_base;
 	do {
 		VCN vcn;
 		pgoff_t idx, start_idx;
 		unsigned ofs, do_pages, u;
 		size_t copied;
 		start_idx = idx = pos >> PAGE_CACHE_SHIFT;
 		ofs = pos & ~PAGE_CACHE_MASK;
 		bytes = PAGE_CACHE_SIZE - ofs;
 		do_pages = 1;
 		if (nr_pages > 1) {
 			vcn = pos >> vol->cluster_size_bits;
 			if (vcn != last_vcn) {
 				last_vcn = vcn;
 				/*
 				 * Get the lcn of the vcn the write is in.  If
 				 * it is a hole, need to lock down all pages in
 				 * the cluster.
 				 */
 				down_read(&ni->runlist.lock);
 				lcn = ntfs_attr_vcn_to_lcn_nolock(ni, pos >>
 						vol->cluster_size_bits, false);
 				up_read(&ni->runlist.lock);
 				if (unlikely(lcn < LCN_HOLE)) {
 					status = -EIO;
 					if (lcn == LCN_ENOMEM)
 						status = -ENOMEM;
 					else
 						ntfs_error(vol->sb, "Cannot "
 							"perform write to "
 							"inode 0x%lx, "
 							"attribute type 0x%x, "
 							"because the attribute "
 							"is corrupt.",
 							vi->i_ino, (unsigned)
 							le32_to_cpu(ni->type));
 					break;
 				}
 				if (lcn == LCN_HOLE) {
 					start_idx = (pos & ~(s64)
 							vol->cluster_size_mask)
 							>> PAGE_CACHE_SHIFT;
 					bytes = vol->cluster_size - (pos &
 							vol->cluster_size_mask);
 					do_pages = nr_pages;
 				}
 			}
 		}
 		if (bytes > count)
 			bytes = count;
 		/*
 		 * Bring in the user page(s) that we will copy from _first_.
 		 * Otherwise there is a nasty deadlock on copying from the same
 		 * page(s) as we are writing to, without it/them being marked
 		 * up-to-date.  Note, at present there is nothing to stop the
 		 * pages being swapped out between us bringing them into memory
 		 * and doing the actual copying.
 		 */
 		if (likely(nr_segs == 1))
 			ntfs_fault_in_pages_readable(buf, bytes);
 		else
 			ntfs_fault_in_pages_readable_iovec(iov, iov_ofs, bytes);
 		/* Get and lock @do_pages starting at index @start_idx. */
 		status = __ntfs_grab_cache_pages(mapping, start_idx, do_pages,
 				pages, &cached_page);
 		if (unlikely(status))
 			break;
 		/*
 		 * For non-resident attributes, we need to fill any holes with
 		 * actual clusters and ensure all bufferes are mapped.  We also
 		 * need to bring uptodate any buffers that are only partially
 		 * being written to.
 		 */
 		if (NInoNonResident(ni)) {
 			status = ntfs_prepare_pages_for_non_resident_write(
 					pages, do_pages, pos, bytes);
 			if (unlikely(status)) {
 				loff_t i_size;
 				do {
 					unlock_page(pages[--do_pages]);
 					page_cache_release(pages[do_pages]);
 				} while (do_pages);
 				/*
 				 * The write preparation may have instantiated
 				 * allocated space outside i_size.  Trim this
 				 * off again.  We can ignore any errors in this
 				 * case as we will just be waisting a bit of
 				 * allocated space, which is not a disaster.
 				 */
 				i_size = i_size_read(vi);
 				if (pos + bytes > i_size) {
 					ntfs_write_failed(mapping, pos + bytes);
 				}
 				break;
 			}
 		}
 		u = (pos >> PAGE_CACHE_SHIFT) - pages[0]->index;
 		if (likely(nr_segs == 1)) {
 			copied = ntfs_copy_from_user(pages + u, do_pages - u,
 					ofs, buf, bytes);
 			buf += copied;
 		} else
 			copied = ntfs_copy_from_user_iovec(pages + u,
 					do_pages - u, ofs, &iov, &iov_ofs,
 					bytes);
 		ntfs_flush_dcache_pages(pages + u, do_pages - u);
 		status = ntfs_commit_pages_after_write(pages, do_pages, pos,
 				bytes);
 		if (likely(!status)) {
 			written += copied;
 			count -= copied;
 			pos += copied;
 			if (unlikely(copied != bytes))
 				status = -EFAULT;
 		}
 		do {
 			unlock_page(pages[--do_pages]);
-			mark_page_accessed(pages[do_pages]);
 			page_cache_release(pages[do_pages]);
 		} while (do_pages);
 		if (unlikely(status))
 			break;
 		balance_dirty_pages_ratelimited(mapping);
 		cond_resched();
 	} while (count);
 err_out:
 	*ppos = pos;
 	if (cached_page)
 		page_cache_release(cached_page);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
 			(long)status);
 	return written ? written : status;
 }
 /**
  * ntfs_file_aio_write_nolock -
  */
 static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	loff_t pos;
 	size_t count;		/* after file limit checks */
 	ssize_t written, err;
 	count = 0;
 	err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
 	if (err)
 		return err;
 	pos = *ppos;
 	/* We can write back this queue in page reclaim. */
 	current->backing_dev_info = mapping->backing_dev_info;
 	written = 0;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
 		goto out;
 	if (!count)
 		goto out;
 	err = file_remove_suid(file);
 	if (err)
 		goto out;
 	err = file_update_time(file);
 	if (err)
 		goto out;
 	written = ntfs_file_buffered_write(iocb, iov, nr_segs, pos, ppos,
 			count);
 out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
 /**
  * ntfs_file_aio_write -
  */
 static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	ssize_t ret;
 	BUG_ON(iocb->ki_pos != pos);
 	mutex_lock(&inode->i_mutex);
 	ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 	if (ret > 0) {
 		int err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
 	return ret;
 }
 /**
  * ntfs_file_fsync - sync a file to disk
  * @filp:	file to be synced
  * @datasync:	if non-zero only flush user data and not metadata
  *
  * Data integrity sync of a file to disk.  Used for fsync, fdatasync, and msync
  * system calls.  This function is inspired by fs/buffer.c::file_fsync().
  *
  * If @datasync is false, write the mft record and all associated extent mft
  * records as well as the $DATA attribute and then sync the block device.
  *
  * If @datasync is true and the attribute is non-resident, we skip the writing
  * of the mft record and all associated extent mft records (this might still
  * happen due to the write_inode_now() call).
  *
  * Also, if @datasync is true, we do not wait on the inode to be written out
  * but we always wait on the page cache pages to be written out.
  *
  * Locking: Caller must hold i_mutex on the inode.
  *
  * TODO: We should probably also write all attribute/index inodes associated
  * with this inode but since we have no simple way of getting to them we ignore
  * this problem for now.
  */
 static int ntfs_file_fsync(struct file *filp, loff_t start, loff_t end,
 			   int datasync)
 {
 	struct inode *vi = filp->f_mapping->host;
 	int err, ret = 0;
 	ntfs_debug("Entering for inode 0x%lx.", vi->i_ino);
 	err = filemap_write_and_wait_range(vi->i_mapping, start, end);
 	if (err)
 		return err;
 	mutex_lock(&vi->i_mutex);
 	BUG_ON(S_ISDIR(vi->i_mode));
 	if (!datasync || !NInoNonResident(NTFS_I(vi)))
 		ret = __ntfs_write_inode(vi, 1);
 	write_inode_now(vi, !datasync);
 	/*
 	 * NOTE: If we were to use mapping->private_list (see ext2 and
 	 * fs/buffer.c) for dirty blocks then we could optimize the below to be
 	 * sync_mapping_buffers(vi->i_mapping).
 	 */
 	err = sync_blockdev(vi->i_sb->s_bdev);
 	if (unlikely(err && !ret))
 		ret = err;
 	if (likely(!ret))
 		ntfs_debug("Done.");
 	else
 		ntfs_warning(vi->i_sb, "Failed to f%ssync inode 0x%lx.  Error "
 				"%u.", datasync ? "data" : "", vi->i_ino, -ret);
 	mutex_unlock(&vi->i_mutex);
 	return ret;
 }
 #endif /* NTFS_RW */
 const struct file_operations ntfs_file_ops = {
 	.llseek		= generic_file_llseek,	 /* Seek inside file. */
 	.read		= do_sync_read,		 /* Read from file. */
 	.aio_read	= generic_file_aio_read, /* Async read from file. */
 #ifdef NTFS_RW
 	.write		= do_sync_write,	 /* Write to file. */
 	.aio_write	= ntfs_file_aio_write,	 /* Async write to file. */
 	/*.release	= ,*/			 /* Last file is closed.  See
 						    fs/ext2/file.c::
 						    ext2_release_file() for
 						    how to use this to discard
 						    preallocated space for
 						    write opened files. */
 	.fsync		= ntfs_file_fsync,	 /* Sync a file to disk. */
 	/*.aio_fsync	= ,*/			 /* Sync all outstanding async
 						    i/o operations on a
 						    kiocb. */
 #endif /* NTFS_RW */
 	/*.ioctl	= ,*/			 /* Perform function on the
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
 	.splice_read	= generic_file_splice_read /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
 						    data destination. */
 	/*.sendpage	= ,*/			 /* Zero-copy data send with
 						    the data destination being
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
 };
 const struct inode_operations ntfs_file_inode_ops = {
 #ifdef NTFS_RW
 	.setattr	= ntfs_setattr,
 #endif /* NTFS_RW */
 };
 const struct file_operations ntfs_empty_file_ops = {};
 const struct inode_operations ntfs_empty_inode_ops = {};

include/linux/page-flags.h

Diff comments View file @ d618a27

 /*
  * Macros for manipulating and testing page->flags
  */
 #ifndef PAGE_FLAGS_H
 #define PAGE_FLAGS_H
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mmdebug.h>
 #ifndef __GENERATING_BOUNDS_H
 #include <linux/mm_types.h>
 #include <generated/bounds.h>
 #endif /* !__GENERATING_BOUNDS_H */
 /*
  * Various page->flags bits:
  *
  * PG_reserved is set for special pages, which can never be swapped out. Some
  * of them might not even exist (eg empty_bad_page)...
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by
  * private allocations for its own usage.
  *
  * During initiation of disk I/O, PG_locked is set. This bit is set before I/O
  * and cleared when writeback _starts_ or when read _completes_. PG_writeback
  * is set before writeback starts and cleared when it finishes.
  *
  * PG_locked also pins a page in pagecache, and blocks truncation of the file
  * while it is held.
  *
  * page_waitqueue(page) is a wait queue of all tasks waiting for the page
  * to become unlocked.
  *
  * PG_uptodate tells whether the page's contents is valid.  When a read
  * completes, the page becomes uptodate, unless a disk I/O error happened.
  *
  * PG_referenced, PG_reclaim are used for page reclaim for anonymous and
  * file-backed pagecache (see mm/vmscan.c).
  *
  * PG_error is set to indicate that an I/O error occurred on this page.
  *
  * PG_arch_1 is an architecture specific page state bit.  The generic code
  * guarantees that this bit is cleared for a page when it first is entered into
  * the page cache.
  *
  * PG_highmem pages are not permanently mapped into the kernel virtual address
  * space, they need to be kmapped separately for doing IO on the pages.  The
  * struct page (these bits with information) are always mapped into kernel
  * address space...
  *
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
  */
 /*
  * Don't use the *_dontuse flags.  Use the macros.  Otherwise you'll break
  * locked- and dirty-page accounting.
  *
  * The page flags field is split into two parts, the main flags area
  * which extends from the low bits upwards, and the fields area which
  * extends from the high bits downwards.
  *
  *  | FIELD | ... | FLAGS |
  *  N-1           ^       0
  *               (NR_PAGEFLAGS)
  *
  * The fields area is reserved for fields mapping zone, node (for NUMA) and
  * SPARSEMEM section (for variants of SPARSEMEM that require section ids like
  * SPARSEMEM_EXTREME with !SPARSEMEM_VMEMMAP).
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
 	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
 	PG_reserved,
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
 	PG_compound,		/* A compound page */
 #endif
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
 	__NR_PAGEFLAGS,
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 	/* Two page bits are conscripted by FS-Cache to maintain local caching
 	 * state.  These bits are set on pages belonging to the netfs's inodes
 	 * when those inodes are being locally cached.
 	 */
 	PG_fscache = PG_private_2,	/* page backed by cache */
 	/* XEN */
 	PG_pinned = PG_owner_priv_1,
 	PG_savepinned = PG_dirty,
 	/* SLOB */
 	PG_slob_free = PG_private,
 };
 #ifndef __GENERATING_BOUNDS_H
 /*
  * Macros to create function definitions for page flags
  */
 #define TESTPAGEFLAG(uname, lname)					\
 static inline int Page##uname(const struct page *page)			\
 			{ return test_bit(PG_##lname, &page->flags); }
 #define SETPAGEFLAG(uname, lname)					\
 static inline void SetPage##uname(struct page *page)			\
 			{ set_bit(PG_##lname, &page->flags); }
 #define CLEARPAGEFLAG(uname, lname)					\
 static inline void ClearPage##uname(struct page *page)			\
 			{ clear_bit(PG_##lname, &page->flags); }
 #define __SETPAGEFLAG(uname, lname)					\
 static inline void __SetPage##uname(struct page *page)			\
 			{ __set_bit(PG_##lname, &page->flags); }
 #define __CLEARPAGEFLAG(uname, lname)					\
 static inline void __ClearPage##uname(struct page *page)		\
 			{ __clear_bit(PG_##lname, &page->flags); }
 #define TESTSETFLAG(uname, lname)					\
 static inline int TestSetPage##uname(struct page *page)			\
 		{ return test_and_set_bit(PG_##lname, &page->flags); }
 #define TESTCLEARFLAG(uname, lname)					\
 static inline int TestClearPage##uname(struct page *page)		\
 		{ return test_and_clear_bit(PG_##lname, &page->flags); }
 #define __TESTCLEARFLAG(uname, lname)					\
 static inline int __TestClearPage##uname(struct page *page)		\
 		{ return __test_and_clear_bit(PG_##lname, &page->flags); }
 #define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname)		\
 	SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname)
 #define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname)		\
 	__SETPAGEFLAG(uname, lname)  __CLEARPAGEFLAG(uname, lname)
 #define PAGEFLAG_FALSE(uname) 						\
 static inline int Page##uname(const struct page *page)			\
 			{ return 0; }
 #define TESTSCFLAG(uname, lname)					\
 	TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname)
 #define SETPAGEFLAG_NOOP(uname)						\
 static inline void SetPage##uname(struct page *page) {  }
 #define CLEARPAGEFLAG_NOOP(uname)					\
 static inline void ClearPage##uname(struct page *page) {  }
 #define __CLEARPAGEFLAG_NOOP(uname)					\
 static inline void __ClearPage##uname(struct page *page) {  }
 #define TESTCLEARFLAG_FALSE(uname)					\
 static inline int TestClearPage##uname(struct page *page) { return 0; }
 #define __TESTCLEARFLAG_FALSE(uname)					\
 static inline int __TestClearPage##uname(struct page *page) { return 0; }
 struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
 	TESTCLEARFLAG(Active, active)
 __PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, checked)		/* Used by some filesystems */
 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 	__SETPAGEFLAG(SwapBacked, swapbacked)
 __PAGEFLAG(SlobFree, slob_free)
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
  * - PG_private and PG_private_2 cause releasepage() and co to be invoked
  */
 PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
 	__CLEARPAGEFLAG(Private, private)
 PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
 PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
 PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
 #ifdef CONFIG_HIGHMEM
 /*
  * Must use a macro here due to header dependency issues. page_zone() is not
  * available at this point.
  */
 #define PageHighMem(__p) is_highmem(page_zone(__p))
 #else
 PAGEFLAG_FALSE(HighMem)
 #endif
 #ifdef CONFIG_SWAP
 PAGEFLAG(SwapCache, swapcache)
 #else
 PAGEFLAG_FALSE(SwapCache)
 	SETPAGEFLAG_NOOP(SwapCache) CLEARPAGEFLAG_NOOP(SwapCache)
 #endif
 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
 	TESTCLEARFLAG(Unevictable, unevictable)
 #ifdef CONFIG_MMU
 PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
 	TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked)
 #else
 PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked)
 	TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
 #endif
 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
 PAGEFLAG(Uncached, uncached)
 #else
 PAGEFLAG_FALSE(Uncached)
 #endif
 #ifdef CONFIG_MEMORY_FAILURE
 PAGEFLAG(HWPoison, hwpoison)
 TESTSCFLAG(HWPoison, hwpoison)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 #else
 PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 u64 stable_page_flags(struct page *page);
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
 	/*
 	 * Must ensure that the data we read out of the page is loaded
 	 * _after_ we've loaded page->flags to check for PageUptodate.
 	 * We can skip the barrier if the page is not uptodate, because
 	 * we wouldn't be reading anything from it.
 	 *
 	 * See SetPageUptodate() for the other side of the story.
 	 */
 	if (ret)
 		smp_rmb();
 	return ret;
 }
 static inline void __SetPageUptodate(struct page *page)
 {
 	smp_wmb();
 	__set_bit(PG_uptodate, &(page)->flags);
 }
 static inline void SetPageUptodate(struct page *page)
 {
 	/*
 	 * Memory barrier must be issued before setting the PG_uptodate bit,
 	 * so that all previous stores issued in order to bring the page
 	 * uptodate are actually visible before PageUptodate becomes true.
 	 */
 	smp_wmb();
 	set_bit(PG_uptodate, &(page)->flags);
 }
 CLEARPAGEFLAG(Uptodate, uptodate)
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
 int test_clear_page_writeback(struct page *page);
 int __test_set_page_writeback(struct page *page, bool keep_write);
 #define test_set_page_writeback(page)			\
 	__test_set_page_writeback(page, false)
 #define test_set_page_writeback_keepwrite(page)	\
 	__test_set_page_writeback(page, true)
 static inline void set_page_writeback(struct page *page)
 {
 	test_set_page_writeback(page);
 }
 static inline void set_page_writeback_keepwrite(struct page *page)
 {
 	test_set_page_writeback_keepwrite(page);
 }
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 /*
  * System with lots of page flags available. This allows separate
  * flags for PageHead() and PageTail() checks of compound pages so that bit
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 static inline int PageCompound(struct page *page)
 {
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 }
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
 	BUG_ON(!PageHead(page));
 	ClearPageHead(page);
 }
 #endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
  * compound page flags with the flags used for page cache pages. Possible
  * because PageCompound is always set for compound pages and not for
  * pages on the LRU and/or pagecache.
  */
 TESTPAGEFLAG(Compound, compound)
 __SETPAGEFLAG(Head, compound)  __CLEARPAGEFLAG(Head, compound)
 /*
  * PG_reclaim is used in combination with PG_compound to mark the
  * head and tail of a compound page. This saves one page flag
  * but makes it impossible to use compound pages for the page cache.
  * The PG_reclaim bit would have to be used for reclaim or readahead
  * if compound pages enter the page cache.
  *
  * PG_compound & PG_reclaim	=> Tail page
  * PG_compound & ~PG_reclaim	=> Head page
  */
 #define PG_head_mask ((1L << PG_compound))
 #define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
 static inline int PageHead(struct page *page)
 {
 	return ((page->flags & PG_head_tail_mask) == PG_head_mask);
 }
 static inline int PageTail(struct page *page)
 {
 	return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
 }
 static inline void __SetPageTail(struct page *page)
 {
 	page->flags |= PG_head_tail_mask;
 }
 static inline void __ClearPageTail(struct page *page)
 {
 	page->flags &= ~PG_head_tail_mask;
 }
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
 	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
 	clear_bit(PG_compound, &page->flags);
 }
 #endif
 #endif /* !PAGEFLAGS_EXTENDED */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for
  * normal or transparent huge pages.
  *
  * PageTransHuge() returns true for both transparent huge and
  * hugetlbfs pages, but not normal pages. PageTransHuge() can only be
  * called only in the core VM paths where hugetlbfs pages can't exist.
  */
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
 	return PageHead(page);
 }
 /*
  * PageTransCompound returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
  * that hugetlbfs pages aren't involved.
  */
 static inline int PageTransCompound(struct page *page)
 {
 	return PageCompound(page);
 }
 /*
  * PageTransTail returns true for both transparent huge pages
  * and hugetlbfs pages, so it should only be called when it's known
  * that hugetlbfs pages aren't involved.
  */
 static inline int PageTransTail(struct page *page)
 {
 	return PageTail(page);
 }
 #else
 static inline int PageTransHuge(struct page *page)
 {
 	return 0;
 }
 static inline int PageTransCompound(struct page *page)
 {
 	return 0;
 }
 static inline int PageTransTail(struct page *page)
 {
 	return 0;
 }
 #endif
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
  */
 static inline int PageSlabPfmemalloc(struct page *page)
 {
 	VM_BUG_ON(!PageSlab(page));
 	return PageActive(page);
 }
 static inline void SetPageSlabPfmemalloc(struct page *page)
 {
 	VM_BUG_ON(!PageSlab(page));
 	SetPageActive(page);
 }
 static inline void __ClearPageSlabPfmemalloc(struct page *page)
 {
 	VM_BUG_ON(!PageSlab(page));
 	__ClearPageActive(page);
 }
 static inline void ClearPageSlabPfmemalloc(struct page *page)
 {
 	VM_BUG_ON(!PageSlab(page));
 	ClearPageActive(page);
 }
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
 #define __PG_MLOCKED		0
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
 #else
 #define __PG_COMPOUND_LOCK		0
 #endif
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
  */
 #define PAGE_FLAGS_CHECK_AT_FREE \
 	(1 << PG_lru	 | 1 << PG_locked    | \
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
 	 __PG_COMPOUND_LOCK)
 /*
  * Flags checked when a page is prepped for return by the page allocator.
  * Pages being prepped should not have any flags set.  It they are set,
  * there has been a kernel bug or struct page corruption.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	((1 << NR_PAGEFLAGS) - 1)
 #define PAGE_FLAGS_PRIVATE				\
 	(1 << PG_private | 1 << PG_private_2)
 /**
  * page_has_private - Determine if page has private stuff
  * @page: The page to be checked
  *
  * Determine if a page has private stuff, indicating that release routines
  * should be invoked upon it.
  */
 static inline int page_has_private(struct page *page)
 {
 	return !!(page->flags & PAGE_FLAGS_PRIVATE);
 }
 #endif /* !__GENERATING_BOUNDS_H */
 #endif	/* PAGE_FLAGS_H */

include/linux/pagemap.h

Diff comments View file @ d618a27

 #ifndef _LINUX_PAGEMAP_H
 #define _LINUX_PAGEMAP_H
 /*
  * Copyright 1995 Linus Torvalds
  */
 #include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/list.h>
 #include <linux/highmem.h>
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
 #include <linux/hugetlb_inline.h>
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
  * allocation mode flags.
  */
 enum mapping_flags {
 	AS_EIO		= __GFP_BITS_SHIFT + 0,	/* IO error on async write */
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
 };
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
 	if (unlikely(error)) {
 		if (error == -ENOSPC)
 			set_bit(AS_ENOSPC, &mapping->flags);
 		else
 			set_bit(AS_EIO, &mapping->flags);
 	}
 }
 static inline void mapping_set_unevictable(struct address_space *mapping)
 {
 	set_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 static inline void mapping_clear_unevictable(struct address_space *mapping)
 {
 	clear_bit(AS_UNEVICTABLE, &mapping->flags);
 }
 static inline int mapping_unevictable(struct address_space *mapping)
 {
 	if (mapping)
 		return test_bit(AS_UNEVICTABLE, &mapping->flags);
 	return !!mapping;
 }
 static inline void mapping_set_balloon(struct address_space *mapping)
 {
 	set_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 static inline void mapping_clear_balloon(struct address_space *mapping)
 {
 	clear_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 static inline int mapping_balloon(struct address_space *mapping)
 {
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
 }
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
  */
 static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 {
 	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
 				(__force unsigned long)mask;
 }
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
  * throughput (it can then be mapped into user
  * space in smaller chunks for same flexibility).
  *
  * Or rather, it _will_ be done in larger chunks.
  */
 #define PAGE_CACHE_SHIFT	PAGE_SHIFT
 #define PAGE_CACHE_SIZE		PAGE_SIZE
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, bool cold);
 /*
  * speculatively take a reference to a page.
  * If the page is free (_count == 0), then _count is untouched, and 0
  * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
  *
  * This function must be called inside the same rcu_read_lock() section as has
  * been used to lookup the page in the pagecache radix-tree (or page table):
  * this allows allocators to use a synchronize_rcu() to stabilize _count.
  *
  * Unless an RCU grace period has passed, the count of all pages coming out
  * of the allocator must be considered unstable. page_count may return higher
  * than expected, and put_page must be able to do the right thing when the
  * page has been finished with, no matter what it is subsequently allocated
  * for (because put_page is what is used here to drop an invalid speculative
  * reference).
  *
  * This is the interesting part of the lockless pagecache (and lockless
  * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
  * has the following pattern:
  * 1. find page in radix tree
  * 2. conditionally increment refcount
  * 3. check the page is still in pagecache (if no, goto 1)
  *
  * Remove-side that cares about stability of _count (eg. reclaim) has the
  * following (with tree_lock held for write):
  * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
  * B. remove page from pagecache
  * C. free the page
  *
  * There are 2 critical interleavings that matter:
  * - 2 runs before A: in this case, A sees elevated refcount and bails out
  * - A runs before 2: in this case, 2 sees zero refcount and retries;
  *   subsequently, B will complete and 1 will find no page, causing the
  *   lookup to return NULL.
  *
  * It is possible that between 1 and 2, the page is removed then the exact same
  * page is inserted into the same position in pagecache. That's OK: the
  * old find_get_page using tree_lock could equally have run before or after
  * such a re-insertion, depending on order that locks are granted.
  *
  * Lookups racing against pagecache insertion isn't a big problem: either 1
  * will find the page or it will not. Likewise, the old find_get_page could run
  * either before the insertion or afterwards, depending on timing.
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
 # endif
 	/*
 	 * Preempt must be disabled here - we rely on rcu_read_lock doing
 	 * this for us.
 	 *
 	 * Pagecache won't be truncated from interrupt context, so if we have
 	 * found a page in the radix tree here, we have pinned its refcount by
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
 	VM_BUG_ON(page_count(page) == 0);
 	atomic_inc(&page->_count);
 #else
 	if (unlikely(!get_page_unless_zero(page))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
 		 * do the right thing (see comments above).
 		 */
 		return 0;
 	}
 #endif
 	VM_BUG_ON(PageTail(page));
 	return 1;
 }
 /*
  * Same as above, but add instead of inc (could just be merged)
  */
 static inline int page_cache_add_speculative(struct page *page, int count)
 {
 	VM_BUG_ON(in_interrupt());
 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
 # endif
 	VM_BUG_ON(page_count(page) == 0);
 	atomic_add(count, &page->_count);
 #else
 	if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
 		return 0;
 #endif
 	VM_BUG_ON(PageCompound(page) && page != compound_head(page));
 	return 1;
 }
 static inline int page_freeze_refs(struct page *page, int count)
 {
 	return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
 }
 static inline void page_unfreeze_refs(struct page *page, int count)
 {
 	VM_BUG_ON(page_count(page) != 0);
 	VM_BUG_ON(count == 0);
 	atomic_set(&page->_count, count);
 }
 #ifdef CONFIG_NUMA
 extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
 	return alloc_pages(gfp, 0);
 }
 #endif
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x));
 }
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
 }
 static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
 				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
 }
 typedef int filler_t(void *, struct page *);
 pgoff_t page_cache_next_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
+#define FGP_ACCESSED		0x00000001
+#define FGP_LOCK		0x00000002
+#define FGP_CREAT		0x00000004
+#define FGP_WRITE		0x00000008
+#define FGP_NOFS		0x00000010
+#define FGP_NOWAIT		0x00000020
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+static inline struct page *find_get_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, 0, 0, 0);
+}
+static inline struct page *find_get_page_flags(struct address_space *mapping,
+					pgoff_t offset, int fgp_flags)
+{
+	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
+}
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+static inline struct page *find_lock_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
+}
+/**
+ * find_or_create_page - locate or add a pagecache page
+ * @mapping: the page's address_space
+ * @index: the page's index into the mapping
+ * @gfp_mask: page allocation mode
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
+ *
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
+ */
+static inline struct page *find_or_create_page(struct address_space *mapping,
+					pgoff_t offset, gfp_t gfp_mask)
+{
+	return pagecache_get_page(mapping, offset,
+					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
+					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
+}
+/**
+ * grab_cache_page_nowait - returns locked page at given index in given cache
+ * @mapping: target address_space
+ * @index: the page index
+ *
+ * Same as grab_cache_page(), but do not wait if the page is unavailable.
+ * This is intended for speculative data generators, where the data can
+ * be regenerated if the page couldn't be grabbed.  This routine should
+ * be safe to call while holding the lock for another page.
+ *
+ * Clear __GFP_FS when allocating the page to avoid recursion into the fs
+ * and deadlock against the caller's locked page.
+ */
+static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
+				pgoff_t index)
+{
+	return pagecache_get_page(mapping, index,
+			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
+			mapping_gfp_mask(mapping),
+			GFP_NOFS);
+}
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
-struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
-				 gfp_t gfp_mask);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
 			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
 static inline struct page *grab_cache_page(struct address_space *mapping,
 								pgoff_t index)
 {
 	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
 }
-extern struct page * grab_cache_page_nowait(struct address_space *mapping,
-				pgoff_t index);
 extern struct page * read_cache_page(struct address_space *mapping,
 				pgoff_t index, filler_t *filler, void *data);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
 static inline struct page *read_mapping_page(struct address_space *mapping,
 				pgoff_t index, void *data)
 {
 	filler_t *filler = (filler_t *)mapping->a_ops->readpage;
 	return read_cache_page(mapping, index, filler, data);
 }
 /*
  * Return byte-offset into filesystem object for page.
  */
 static inline loff_t page_offset(struct page *page)
 {
 	return ((loff_t)page->index) << PAGE_CACHE_SHIFT;
 }
 static inline loff_t page_file_offset(struct page *page)
 {
 	return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
 }
 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address);
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 					unsigned long address)
 {
 	pgoff_t pgoff;
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return linear_hugepage_index(vma, address);
 	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
 	pgoff += vma->vm_pgoff;
 	return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
 static inline void __set_page_locked(struct page *page)
 {
 	__set_bit(PG_locked, &page->flags);
 }
 static inline void __clear_page_locked(struct page *page)
 {
 	__clear_bit(PG_locked, &page->flags);
 }
 static inline int trylock_page(struct page *page)
 {
 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
 }
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
 	might_sleep();
 	if (!trylock_page(page))
 		__lock_page(page);
 }
 /*
  * lock_page_killable is like lock_page but can be interrupted by fatal
  * signals.  It returns 0 if it locked the page and -EINTR if it was
  * killed while waiting.
  */
 static inline int lock_page_killable(struct page *page)
 {
 	might_sleep();
 	if (!trylock_page(page))
 		return __lock_page_killable(page);
 	return 0;
 }
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
  */
 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				     unsigned int flags)
 {
 	might_sleep();
 	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
 }
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 static inline int wait_on_page_locked_killable(struct page *page)
 {
 	if (PageLocked(page))
 		return wait_on_page_bit_killable(page, PG_locked);
 	return 0;
 }
 /*
  * Wait for a page to be unlocked.
  *
  * This must be called with the caller "holding" the page,
  * ie with increased "page->count" so that the page won't
  * go away during the wait..
  */
 static inline void wait_on_page_locked(struct page *page)
 {
 	if (PageLocked(page))
 		wait_on_page_bit(page, PG_locked);
 }
 /*
  * Wait for a page to complete writeback
  */
 static inline void wait_on_page_writeback(struct page *page)
 {
 	if (PageWriteback(page))
 		wait_on_page_bit(page, PG_writeback);
 }
 extern void end_page_writeback(struct page *page);
 void wait_for_stable_page(struct page *page);
 /*
  * Add an arbitrary waiter to a page's wait queue
  */
 extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
 /*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *
  * This assumes that two userspace pages are always sufficient.  That's
  * not true if PAGE_CACHE_SIZE > PAGE_SIZE.
  */
 static inline int fault_in_pages_writeable(char __user *uaddr, int size)
 {
 	int ret;
 	if (unlikely(size == 0))
 		return 0;
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
 	 */
 	ret = __put_user(0, uaddr);
 	if (ret == 0) {
 		char __user *end = uaddr + size - 1;
 		/*
 		 * If the page was already mapped, this will get a cache miss
 		 * for sure, so try to avoid doing it.
 		 */
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK))
 			ret = __put_user(0, end);
 	}
 	return ret;
 }
 static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 {
 	volatile char c;
 	int ret;
 	if (unlikely(size == 0))
 		return 0;
 	ret = __get_user(c, uaddr);
 	if (ret == 0) {
 		const char __user *end = uaddr + size - 1;
 		if (((unsigned long)uaddr & PAGE_MASK) !=
 				((unsigned long)end & PAGE_MASK)) {
 			ret = __get_user(c, end);
 			(void)c;
 		}
 	}
 	return ret;
 }
 /*
  * Multipage variants of the above prefault helpers, useful if more than
  * PAGE_SIZE of data needs to be prefaulted. These are separate from the above
  * functions (which only handle up to PAGE_SIZE) to avoid clobbering the
  * filemap.c hotpaths.
  */
 static inline int fault_in_multipages_writeable(char __user *uaddr, int size)
 {
 	int ret = 0;
 	char __user *end = uaddr + size - 1;
 	if (unlikely(size == 0))
 		return ret;
 	/*
 	 * Writing zeroes into userspace here is OK, because we know that if
 	 * the zero gets there, we'll be overwriting it.
 	 */
 	while (uaddr <= end) {
 		ret = __put_user(0, uaddr);
 		if (ret != 0)
 			return ret;
 		uaddr += PAGE_SIZE;
 	}
 	/* Check whether the range spilled into the next page. */
 	if (((unsigned long)uaddr & PAGE_MASK) ==
 			((unsigned long)end & PAGE_MASK))
 		ret = __put_user(0, end);
 	return ret;
 }
 static inline int fault_in_multipages_readable(const char __user *uaddr,
 					       int size)
 {
 	volatile char c;
 	int ret = 0;
 	const char __user *end = uaddr + size - 1;
 	if (unlikely(size == 0))
 		return ret;
 	while (uaddr <= end) {
 		ret = __get_user(c, uaddr);
 		if (ret != 0)
 			return ret;
 		uaddr += PAGE_SIZE;
 	}
 	/* Check whether the range spilled into the next page. */
 	if (((unsigned long)uaddr & PAGE_MASK) ==
 			((unsigned long)end & PAGE_MASK)) {
 		ret = __get_user(c, end);
 		(void)c;
 	}
 	return ret;
 }
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
  * the page is new, so we can just run __set_page_locked() against it.
  */
 static inline int add_to_page_cache(struct page *page,
 		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
 	__set_page_locked(page);
 	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
 	if (unlikely(error))

include/linux/swap.h

Diff comments View file @ d618a27

 #ifndef _LINUX_SWAP_H
 #define _LINUX_SWAP_H
 #include <linux/spinlock.h>
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
 #include <linux/list.h>
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
 #include <linux/fs.h>
 #include <linux/atomic.h>
 #include <linux/page-flags.h>
 #include <asm/page.h>
 struct notifier_block;
 struct bio;
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
 #define SWAP_FLAG_DISCARD_ONCE	0x20000 /* discard swap area at swapon-time */
 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
 static inline int current_is_kswapd(void)
 {
 	return current->flags & PF_KSWAPD;
 }
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are
  * encoded into pte's and into pgoff_t's in the swapcache.  Using five bits
  * for the type means that the maximum number of swapcache pages is 27 bits
  * on 32-bit-pgoff_t architectures.  And that assumes that the architecture packs
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
 /*
  * Use some of the swap files numbers for other purposes. This
  * is a convenient way to hook into the VM to trigger special
  * actions on faults.
  */
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
 #define SWP_MIGRATION_NUM 2
 #define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
 #define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
 #define SWP_MIGRATION_NUM 0
 #endif
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
 #define SWP_HWPOISON_NUM 1
 #define SWP_HWPOISON		MAX_SWAPFILES
 #else
 #define SWP_HWPOISON_NUM 0
 #endif
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 /*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
  * old reserved area - some extra information. Note that the first
  * kilobyte is reserved for boot loader or disk label stuff...
  *
  * Having the magic at the end of the PAGE_SIZE makes detecting swap
  * areas somewhat tricky on machines that support multiple page sizes.
  * For 2.5 we'll probably want to move the magic to just beyond the
  * bootbits...
  */
 union swap_header {
 	struct {
 		char reserved[PAGE_SIZE - 10];
 		char magic[10];			/* SWAP-SPACE or SWAPSPACE2 */
 	} magic;
 	struct {
 		char		bootbits[1024];	/* Space for disklabel etc. */
 		__u32		version;
 		__u32		last_page;
 		__u32		nr_badpages;
 		unsigned char	sws_uuid[16];
 		unsigned char	sws_volume[16];
 		__u32		padding[117];
 		__u32		badpages[1];
 	} info;
 };
  /* A swap entry has to fit into a "unsigned long", as
   * the entry is hidden in the "index" field of the
   * swapper address space.
   */
 typedef struct {
 	unsigned long val;
 } swp_entry_t;
 /*
  * current->reclaim_state points to one of these when a task is running
  * memory reclaim
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
 };
 #ifdef __KERNEL__
 struct address_space;
 struct sysinfo;
 struct writeback_control;
 struct zone;
 /*
  * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
  * disk blocks.  A list of swap extents maps the entire swapfile.  (Where the
  * term `swapfile' refers to either a blockdevice or an IS_REG file.  Apart
  * from setup, they're handled identically.
  *
  * We always assume that blocks are of size PAGE_SIZE.
  */
 struct swap_extent {
 	struct list_head list;
 	pgoff_t start_page;
 	pgoff_t nr_pages;
 	sector_t start_block;
 };
 /*
  * Max bad pages in the new format..
  */
 #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x)
 #define MAX_SWAP_BADPAGES \
 	((__swapoffset(magic.magic) - __swapoffset(info.badpages)) / sizeof(int))
 enum {
 	SWP_USED	= (1 << 0),	/* is slot in swap_info[] used? */
 	SWP_WRITEOK	= (1 << 1),	/* ok to write to this swap?	*/
 	SWP_DISCARDABLE = (1 << 2),	/* blkdev support discard */
 	SWP_DISCARDING	= (1 << 3),	/* now discarding a free cluster */
 	SWP_SOLIDSTATE	= (1 << 4),	/* blkdev seeks are cheap */
 	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
 	SWP_BLKDEV	= (1 << 6),	/* its a block device */
 	SWP_FILE	= (1 << 7),	/* set after swap_activate success */
 	SWP_AREA_DISCARD = (1 << 8),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 					/* add others here before... */
 	SWP_SCANNING	= (1 << 10),	/* refcount in scan_swap_map */
 };
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 /*
  * Ratio between the present memory in the zone and the "gap" that
  * we're allowing kswapd to shrink in addition to the per-zone high
  * wmark, even for zones that already have the high wmark satisfied,
  * in order to provide better per-zone lru behavior. We are ok to
  * spend not more than 1% of the memory for this zone balancing "gap".
  */
 #define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define SWAP_CONT_MAX	0x7f	/* Max count, in each swap_map continuation */
 #define COUNT_CONTINUED	0x80	/* See swap_map continuation for full count */
 #define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs, in first swap_map */
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
  * free clusters are organized into a list. We fetch an entry from the list to
  * get a free cluster.
  *
  * The data field stores next cluster if the cluster is free or cluster usage
  * counter otherwise. The flags field determines if a cluster is free. This is
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
 	unsigned int data:24;
 	unsigned int flags:8;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
  * its own cluster and swapout sequentially. The purpose is to optimize swapout
  * throughput.
  */
 struct percpu_cluster {
 	struct swap_cluster_info index; /* Current cluster index */
 	unsigned int next; /* Likely next allocation offset */
 };
 /*
  * The in-memory structure used to track swap areas.
  */
 struct swap_info_struct {
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
 	struct plist_node list;		/* entry in swap_active_head */
 	struct plist_node avail_list;	/* entry in swap_avail_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct swap_cluster_info free_cluster_head; /* free cluster list head */
 	struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
 	unsigned int cluster_next;	/* likely index for next allocation */
 	unsigned int cluster_nr;	/* countdown to next cluster search */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct swap_extent *curr_swap_extent;
 	struct swap_extent first_swap_extent;
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
 	unsigned int old_block_size;	/* seldom referenced */
 #ifdef CONFIG_FRONTSWAP
 	unsigned long *frontswap_map;	/* frontswap in-use, one bit per page */
 	atomic_t frontswap_pages;	/* frontswap pages in-use counter */
 #endif
 	spinlock_t lock;		/*
 					 * protect map scan related fields like
 					 * swap_map, lowest_bit, highest_bit,
 					 * inuse_pages, cluster_next,
 					 * cluster_nr, lowest_alloc,
 					 * highest_alloc, free/discard cluster
 					 * list. other fields are only changed
 					 * at swapon/swapoff, so are protected
 					 * by swap_lock. changing flags need
 					 * hold this lock and swap_lock. If
 					 * both locks need hold, hold swap_lock
 					 * first.
 					 */
 	struct work_struct discard_work; /* discard worker */
 	struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */
 	struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
 };
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
 extern unsigned long dirty_balance_reserve;
 extern unsigned long nr_free_buffer_pages(void);
 extern unsigned long nr_free_pagecache_pages(void);
 /* Definition of global_page_state not available yet */
 #define nr_free_pages() global_page_state(NR_FREE_PAGES)
 /* linux/mm/swap.c */
 extern void lru_cache_add(struct page *);
 extern void lru_cache_add_anon(struct page *page);
 extern void lru_cache_add_file(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 extern void add_page_to_unevictable_list(struct page *page);
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
 	return 0;
 }
 #endif
 extern int page_evictable(struct page *page);
 extern void check_move_unevictable_pages(struct page **, int nr_pages);
 extern unsigned long scan_unevictable_pages;
 extern int scan_unevictable_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 #ifdef CONFIG_NUMA
 extern int scan_unevictable_register_node(struct node *node);
 extern void scan_unevictable_unregister_node(struct node *node);
 #else
 static inline int scan_unevictable_register_node(struct node *node)
 {
 	return 0;
 }
 static inline void scan_unevictable_unregister_node(struct node *node)
 {
 }
 #endif
 extern int kswapd_run(int nid);
 extern void kswapd_stop(int nid);
 #ifdef CONFIG_MEMCG
 extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
 #else
 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 {
 	return vm_swappiness;
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
 #else
 static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 {
 }
 #endif
 #ifdef CONFIG_SWAP
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern void end_swap_bio_write(struct bio *bio, int err);
 extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	void (*end_write_func)(struct bio *, int));
 extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
 int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 /* linux/mm/swap_state.c */
 extern struct address_space swapper_spaces[];
 #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, struct list_head *list);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
 {
 	return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
 }
 static inline long get_nr_swap_pages(void)
 {
 	return atomic_long_read(&nr_swap_pages);
 }
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t, struct page *page);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 #ifdef CONFIG_MEMCG
 extern void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 {
 }
 #endif
 #else /* CONFIG_SWAP */
 #define swap_address_space(entry)		(NULL)
 #define get_nr_swap_pages()			0L
 #define total_swap_pages			0L
 #define total_swapcache_pages()			0UL
 #define vm_swap_full()				0
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
 /* only sparc can not include linux/pagemap.h in this file
  * so leave page_cache_release and release_pages undeclared... */
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr), false);
 static inline void show_swap_cache_info(void)
 {
 }
 #define free_swap_and_cache(swp)	is_migration_entry(swp)
 #define swapcache_prepare(swp)		is_migration_entry(swp)
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
 	return 0;
 }
 static inline void swap_shmem_alloc(swp_entry_t swp)
 {
 }
 static inline int swap_duplicate(swp_entry_t swp)
 {
 	return 0;
 }
 static inline void swap_free(swp_entry_t swp)
 {
 }
 static inline void swapcache_free(swp_entry_t swp, struct page *page)
 {
 }
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
 	return NULL;
 }
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
 }
 static inline struct page *lookup_swap_cache(swp_entry_t swp)
 {
 	return NULL;
 }
 static inline int add_to_swap(struct page *page, struct list_head *list)
 {
 	return 0;
 }
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 							gfp_t gfp_mask)
 {
 	return -1;
 }
 static inline void __delete_from_swap_cache(struct page *page)
 {
 }
 static inline void delete_from_swap_cache(struct page *page)
 {
 }
 static inline int page_swapcount(struct page *page)
 {
 	return 0;
 }
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 static inline int try_to_free_swap(struct page *page)
 {
 	return 0;
 }
 static inline swp_entry_t get_swap_page(void)
 {
 	swp_entry_t entry;
 	entry.val = 0;
 	return entry;
 }
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
 }
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */

mm/filemap.c

Diff comments View file @ d618a27

1	/*	1	/*
2	* linux/mm/filemap.c	2	* linux/mm/filemap.c
3	*	3	*
4	* Copyright (C) 1994-1999 Linus Torvalds	4	* Copyright (C) 1994-1999 Linus Torvalds
5	*/	5	*/
6		6
7	/*	7	/*
8	* This file handles the generic file mmap semantics used by	8	* This file handles the generic file mmap semantics used by
9	* most "normal" filesystems (but you don't /have/ to use this:	9	* most "normal" filesystems (but you don't /have/ to use this:
10	* the NFS filesystem used to do this differently, for example)	10	* the NFS filesystem used to do this differently, for example)
11	*/	11	*/
12	#include <linux/export.h>	12	#include <linux/export.h>
13	#include <linux/compiler.h>	13	#include <linux/compiler.h>
14	#include <linux/fs.h>	14	#include <linux/fs.h>
15	#include <linux/uaccess.h>	15	#include <linux/uaccess.h>
16	#include <linux/aio.h>	16	#include <linux/aio.h>
17	#include <linux/capability.h>	17	#include <linux/capability.h>
18	#include <linux/kernel_stat.h>	18	#include <linux/kernel_stat.h>
19	#include <linux/gfp.h>	19	#include <linux/gfp.h>
20	#include <linux/mm.h>	20	#include <linux/mm.h>
21	#include <linux/swap.h>	21	#include <linux/swap.h>
22	#include <linux/mman.h>	22	#include <linux/mman.h>
23	#include <linux/pagemap.h>	23	#include <linux/pagemap.h>
24	#include <linux/file.h>	24	#include <linux/file.h>
25	#include <linux/uio.h>	25	#include <linux/uio.h>
26	#include <linux/hash.h>	26	#include <linux/hash.h>
27	#include <linux/writeback.h>	27	#include <linux/writeback.h>
28	#include <linux/backing-dev.h>	28	#include <linux/backing-dev.h>
29	#include <linux/pagevec.h>	29	#include <linux/pagevec.h>
30	#include <linux/blkdev.h>	30	#include <linux/blkdev.h>
31	#include <linux/security.h>	31	#include <linux/security.h>
32	#include <linux/cpuset.h>	32	#include <linux/cpuset.h>
33	#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */	33	#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
34	#include <linux/memcontrol.h>	34	#include <linux/memcontrol.h>
35	#include <linux/cleancache.h>	35	#include <linux/cleancache.h>
36	#include "internal.h"	36	#include "internal.h"
37		37
38	#define CREATE_TRACE_POINTS	38	#define CREATE_TRACE_POINTS
39	#include <trace/events/filemap.h>	39	#include <trace/events/filemap.h>
40		40
41	/*	41	/*
42	* FIXME: remove all knowledge of the buffer layer from the core VM	42	* FIXME: remove all knowledge of the buffer layer from the core VM
43	*/	43	*/
44	#include <linux/buffer_head.h> /* for try_to_free_buffers */	44	#include <linux/buffer_head.h> /* for try_to_free_buffers */
45		45
46	#include <asm/mman.h>	46	#include <asm/mman.h>
47		47
48	/*	48	/*
49	* Shared mappings implemented 30.11.1994. It's not fully working yet,	49	* Shared mappings implemented 30.11.1994. It's not fully working yet,
50	* though.	50	* though.
51	*	51	*
52	* Shared mappings now work. 15.8.1995 Bruno.	52	* Shared mappings now work. 15.8.1995 Bruno.
53	*	53	*
54	* finished 'unifying' the page and buffer cache and SMP-threaded the	54	* finished 'unifying' the page and buffer cache and SMP-threaded the
55	* page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>	55	* page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
56	*	56	*
57	* SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>	57	* SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
58	*/	58	*/
59		59
60	/*	60	/*
61	* Lock ordering:	61	* Lock ordering:
62	*	62	*
63	* ->i_mmap_mutex (truncate_pagecache)	63	* ->i_mmap_mutex (truncate_pagecache)
64	* ->private_lock (__free_pte->__set_page_dirty_buffers)	64	* ->private_lock (__free_pte->__set_page_dirty_buffers)
65	* ->swap_lock (exclusive_swap_page, others)	65	* ->swap_lock (exclusive_swap_page, others)
66	* ->mapping->tree_lock	66	* ->mapping->tree_lock
67	*	67	*
68	* ->i_mutex	68	* ->i_mutex
69	* ->i_mmap_mutex (truncate->unmap_mapping_range)	69	* ->i_mmap_mutex (truncate->unmap_mapping_range)
70	*	70	*
71	* ->mmap_sem	71	* ->mmap_sem
72	* ->i_mmap_mutex	72	* ->i_mmap_mutex
73	* ->page_table_lock or pte_lock (various, mainly in memory.c)	73	* ->page_table_lock or pte_lock (various, mainly in memory.c)
74	* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)	74	* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
75	*	75	*
76	* ->mmap_sem	76	* ->mmap_sem
77	* ->lock_page (access_process_vm)	77	* ->lock_page (access_process_vm)
78	*	78	*
79	* ->i_mutex (generic_file_buffered_write)	79	* ->i_mutex (generic_file_buffered_write)
80	* ->mmap_sem (fault_in_pages_readable->do_page_fault)	80	* ->mmap_sem (fault_in_pages_readable->do_page_fault)
81	*	81	*
82	* bdi->wb.list_lock	82	* bdi->wb.list_lock
83	* sb_lock (fs/fs-writeback.c)	83	* sb_lock (fs/fs-writeback.c)
84	* ->mapping->tree_lock (__sync_single_inode)	84	* ->mapping->tree_lock (__sync_single_inode)
85	*	85	*
86	* ->i_mmap_mutex	86	* ->i_mmap_mutex
87	* ->anon_vma.lock (vma_adjust)	87	* ->anon_vma.lock (vma_adjust)
88	*	88	*
89	* ->anon_vma.lock	89	* ->anon_vma.lock
90	* ->page_table_lock or pte_lock (anon_vma_prepare and various)	90	* ->page_table_lock or pte_lock (anon_vma_prepare and various)
91	*	91	*
92	* ->page_table_lock or pte_lock	92	* ->page_table_lock or pte_lock
93	* ->swap_lock (try_to_unmap_one)	93	* ->swap_lock (try_to_unmap_one)
94	* ->private_lock (try_to_unmap_one)	94	* ->private_lock (try_to_unmap_one)
95	* ->tree_lock (try_to_unmap_one)	95	* ->tree_lock (try_to_unmap_one)
96	* ->zone.lru_lock (follow_page->mark_page_accessed)	96	* ->zone.lru_lock (follow_page->mark_page_accessed)
97	* ->zone.lru_lock (check_pte_range->isolate_lru_page)	97	* ->zone.lru_lock (check_pte_range->isolate_lru_page)
98	* ->private_lock (page_remove_rmap->set_page_dirty)	98	* ->private_lock (page_remove_rmap->set_page_dirty)
99	* ->tree_lock (page_remove_rmap->set_page_dirty)	99	* ->tree_lock (page_remove_rmap->set_page_dirty)
100	* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)	100	* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
101	* ->inode->i_lock (page_remove_rmap->set_page_dirty)	101	* ->inode->i_lock (page_remove_rmap->set_page_dirty)
102	* bdi.wb->list_lock (zap_pte_range->set_page_dirty)	102	* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
103	* ->inode->i_lock (zap_pte_range->set_page_dirty)	103	* ->inode->i_lock (zap_pte_range->set_page_dirty)
104	* ->private_lock (zap_pte_range->__set_page_dirty_buffers)	104	* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
105	*	105	*
106	* ->i_mmap_mutex	106	* ->i_mmap_mutex
107	* ->tasklist_lock (memory_failure, collect_procs_ao)	107	* ->tasklist_lock (memory_failure, collect_procs_ao)
108	*/	108	*/
109		109
110	/*	110	/*
111	* Delete a page from the page cache and free it. Caller has to make	111	* Delete a page from the page cache and free it. Caller has to make
112	* sure the page is locked and that nobody else uses it - or that usage	112	* sure the page is locked and that nobody else uses it - or that usage
113	* is safe. The caller must hold the mapping's tree_lock.	113	* is safe. The caller must hold the mapping's tree_lock.
114	*/	114	*/
115	void __delete_from_page_cache(struct page *page)	115	void __delete_from_page_cache(struct page *page)
116	{	116	{
117	struct address_space *mapping = page->mapping;	117	struct address_space *mapping = page->mapping;
118		118
119	trace_mm_filemap_delete_from_page_cache(page);	119	trace_mm_filemap_delete_from_page_cache(page);
120	/*	120	/*
121	* if we're uptodate, flush out into the cleancache, otherwise	121	* if we're uptodate, flush out into the cleancache, otherwise
122	* invalidate any existing cleancache entries. We can't leave	122	* invalidate any existing cleancache entries. We can't leave
123	* stale data around in the cleancache once our page is gone	123	* stale data around in the cleancache once our page is gone
124	*/	124	*/
125	if (PageUptodate(page) && PageMappedToDisk(page))	125	if (PageUptodate(page) && PageMappedToDisk(page))
126	cleancache_put_page(page);	126	cleancache_put_page(page);
127	else	127	else
128	cleancache_invalidate_page(mapping, page);	128	cleancache_invalidate_page(mapping, page);
129		129
130	radix_tree_delete(&mapping->page_tree, page->index);	130	radix_tree_delete(&mapping->page_tree, page->index);
131	page->mapping = NULL;	131	page->mapping = NULL;
132	/* Leave page->index set: truncation lookup relies upon it */	132	/* Leave page->index set: truncation lookup relies upon it */
133	mapping->nrpages--;	133	mapping->nrpages--;
134	__dec_zone_page_state(page, NR_FILE_PAGES);	134	__dec_zone_page_state(page, NR_FILE_PAGES);
135	if (PageSwapBacked(page))	135	if (PageSwapBacked(page))
136	__dec_zone_page_state(page, NR_SHMEM);	136	__dec_zone_page_state(page, NR_SHMEM);
137	BUG_ON(page_mapped(page));	137	BUG_ON(page_mapped(page));
138		138
139	/*	139	/*
140	* Some filesystems seem to re-dirty the page even after	140	* Some filesystems seem to re-dirty the page even after
141	* the VM has canceled the dirty bit (eg ext3 journaling).	141	* the VM has canceled the dirty bit (eg ext3 journaling).
142	*	142	*
143	* Fix it up by doing a final dirty accounting check after	143	* Fix it up by doing a final dirty accounting check after
144	* having removed the page entirely.	144	* having removed the page entirely.
145	*/	145	*/
146	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {	146	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
147	dec_zone_page_state(page, NR_FILE_DIRTY);	147	dec_zone_page_state(page, NR_FILE_DIRTY);
148	dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);	148	dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
149	}	149	}
150	}	150	}
151		151
152	/**	152	/**
153	* delete_from_page_cache - delete page from page cache	153	* delete_from_page_cache - delete page from page cache
154	* @page: the page which the kernel is trying to remove from page cache	154	* @page: the page which the kernel is trying to remove from page cache
155	*	155	*
156	* This must be called only on pages that have been verified to be in the page	156	* This must be called only on pages that have been verified to be in the page
157	* cache and locked. It will never put the page into the free list, the caller	157	* cache and locked. It will never put the page into the free list, the caller
158	* has a reference on the page.	158	* has a reference on the page.
159	*/	159	*/
160	void delete_from_page_cache(struct page *page)	160	void delete_from_page_cache(struct page *page)
161	{	161	{
162	struct address_space *mapping = page->mapping;	162	struct address_space *mapping = page->mapping;
163	void (freepage)(struct page );	163	void (freepage)(struct page );
164		164
165	BUG_ON(!PageLocked(page));	165	BUG_ON(!PageLocked(page));
166		166
167	freepage = mapping->a_ops->freepage;	167	freepage = mapping->a_ops->freepage;
168	spin_lock_irq(&mapping->tree_lock);	168	spin_lock_irq(&mapping->tree_lock);
169	__delete_from_page_cache(page);	169	__delete_from_page_cache(page);
170	spin_unlock_irq(&mapping->tree_lock);	170	spin_unlock_irq(&mapping->tree_lock);
171	mem_cgroup_uncharge_cache_page(page);	171	mem_cgroup_uncharge_cache_page(page);
172		172
173	if (freepage)	173	if (freepage)
174	freepage(page);	174	freepage(page);
175	page_cache_release(page);	175	page_cache_release(page);
176	}	176	}
177	EXPORT_SYMBOL(delete_from_page_cache);	177	EXPORT_SYMBOL(delete_from_page_cache);
178		178
179	static int sleep_on_page(void *word)	179	static int sleep_on_page(void *word)
180	{	180	{
181	io_schedule();	181	io_schedule();
182	return 0;	182	return 0;
183	}	183	}
184		184
185	static int sleep_on_page_killable(void *word)	185	static int sleep_on_page_killable(void *word)
186	{	186	{
187	sleep_on_page(word);	187	sleep_on_page(word);
188	return fatal_signal_pending(current) ? -EINTR : 0;	188	return fatal_signal_pending(current) ? -EINTR : 0;
189	}	189	}
190		190
191	static int filemap_check_errors(struct address_space *mapping)	191	static int filemap_check_errors(struct address_space *mapping)
192	{	192	{
193	int ret = 0;	193	int ret = 0;
194	/* Check for outstanding write errors */	194	/* Check for outstanding write errors */
195	if (test_bit(AS_ENOSPC, &mapping->flags) &&	195	if (test_bit(AS_ENOSPC, &mapping->flags) &&
196	test_and_clear_bit(AS_ENOSPC, &mapping->flags))	196	test_and_clear_bit(AS_ENOSPC, &mapping->flags))
197	ret = -ENOSPC;	197	ret = -ENOSPC;
198	if (test_bit(AS_EIO, &mapping->flags) &&	198	if (test_bit(AS_EIO, &mapping->flags) &&
199	test_and_clear_bit(AS_EIO, &mapping->flags))	199	test_and_clear_bit(AS_EIO, &mapping->flags))
200	ret = -EIO;	200	ret = -EIO;
201	return ret;	201	return ret;
202	}	202	}
203		203
204	/**	204	/**
205	* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range	205	* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
206	* @mapping: address space structure to write	206	* @mapping: address space structure to write
207	* @start: offset in bytes where the range starts	207	* @start: offset in bytes where the range starts
208	* @end: offset in bytes where the range ends (inclusive)	208	* @end: offset in bytes where the range ends (inclusive)
209	* @sync_mode: enable synchronous operation	209	* @sync_mode: enable synchronous operation
210	*	210	*
211	* Start writeback against all of a mapping's dirty pages that lie	211	* Start writeback against all of a mapping's dirty pages that lie
212	* within the byte offsets <start, end> inclusive.	212	* within the byte offsets <start, end> inclusive.
213	*	213	*
214	* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as	214	* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
215	* opposed to a regular memory cleansing writeback. The difference between	215	* opposed to a regular memory cleansing writeback. The difference between
216	* these two operations is that if a dirty page/buffer is encountered, it must	216	* these two operations is that if a dirty page/buffer is encountered, it must
217	* be waited upon, and not just skipped over.	217	* be waited upon, and not just skipped over.
218	*/	218	*/
219	int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,	219	int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
220	loff_t end, int sync_mode)	220	loff_t end, int sync_mode)
221	{	221	{
222	int ret;	222	int ret;
223	struct writeback_control wbc = {	223	struct writeback_control wbc = {
224	.sync_mode = sync_mode,	224	.sync_mode = sync_mode,
225	.nr_to_write = LONG_MAX,	225	.nr_to_write = LONG_MAX,
226	.range_start = start,	226	.range_start = start,
227	.range_end = end,	227	.range_end = end,
228	};	228	};
229		229
230	if (!mapping_cap_writeback_dirty(mapping))	230	if (!mapping_cap_writeback_dirty(mapping))
231	return 0;	231	return 0;
232		232
233	ret = do_writepages(mapping, &wbc);	233	ret = do_writepages(mapping, &wbc);
234	return ret;	234	return ret;
235	}	235	}
236		236
237	static inline int __filemap_fdatawrite(struct address_space *mapping,	237	static inline int __filemap_fdatawrite(struct address_space *mapping,
238	int sync_mode)	238	int sync_mode)
239	{	239	{
240	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);	240	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
241	}	241	}
242		242
243	int filemap_fdatawrite(struct address_space *mapping)	243	int filemap_fdatawrite(struct address_space *mapping)
244	{	244	{
245	return __filemap_fdatawrite(mapping, WB_SYNC_ALL);	245	return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
246	}	246	}
247	EXPORT_SYMBOL(filemap_fdatawrite);	247	EXPORT_SYMBOL(filemap_fdatawrite);
248		248
249	int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,	249	int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
250	loff_t end)	250	loff_t end)
251	{	251	{
252	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);	252	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
253	}	253	}
254	EXPORT_SYMBOL(filemap_fdatawrite_range);	254	EXPORT_SYMBOL(filemap_fdatawrite_range);
255		255
256	/**	256	/**
257	* filemap_flush - mostly a non-blocking flush	257	* filemap_flush - mostly a non-blocking flush
258	* @mapping: target address_space	258	* @mapping: target address_space
259	*	259	*
260	* This is a mostly non-blocking flush. Not suitable for data-integrity	260	* This is a mostly non-blocking flush. Not suitable for data-integrity
261	* purposes - I/O may not be started against all dirty pages.	261	* purposes - I/O may not be started against all dirty pages.
262	*/	262	*/
263	int filemap_flush(struct address_space *mapping)	263	int filemap_flush(struct address_space *mapping)
264	{	264	{
265	return __filemap_fdatawrite(mapping, WB_SYNC_NONE);	265	return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
266	}	266	}
267	EXPORT_SYMBOL(filemap_flush);	267	EXPORT_SYMBOL(filemap_flush);
268		268
269	/**	269	/**
270	* filemap_fdatawait_range - wait for writeback to complete	270	* filemap_fdatawait_range - wait for writeback to complete
271	* @mapping: address space structure to wait for	271	* @mapping: address space structure to wait for
272	* @start_byte: offset in bytes where the range starts	272	* @start_byte: offset in bytes where the range starts
273	* @end_byte: offset in bytes where the range ends (inclusive)	273	* @end_byte: offset in bytes where the range ends (inclusive)
274	*	274	*
275	* Walk the list of under-writeback pages of the given address space	275	* Walk the list of under-writeback pages of the given address space
276	* in the given range and wait for all of them.	276	* in the given range and wait for all of them.
277	*/	277	*/
278	int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,	278	int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
279	loff_t end_byte)	279	loff_t end_byte)
280	{	280	{
281	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;	281	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
282	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;	282	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
283	struct pagevec pvec;	283	struct pagevec pvec;
284	int nr_pages;	284	int nr_pages;
285	int ret2, ret = 0;	285	int ret2, ret = 0;
286		286
287	if (end_byte < start_byte)	287	if (end_byte < start_byte)
288	goto out;	288	goto out;
289		289
290	pagevec_init(&pvec, 0);	290	pagevec_init(&pvec, 0);
291	while ((index <= end) &&	291	while ((index <= end) &&
292	(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,	292	(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
293	PAGECACHE_TAG_WRITEBACK,	293	PAGECACHE_TAG_WRITEBACK,
294	min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {	294	min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
295	unsigned i;	295	unsigned i;
296		296
297	for (i = 0; i < nr_pages; i++) {	297	for (i = 0; i < nr_pages; i++) {
298	struct page *page = pvec.pages[i];	298	struct page *page = pvec.pages[i];
299		299
300	/* until radix tree lookup accepts end_index */	300	/* until radix tree lookup accepts end_index */
301	if (page->index > end)	301	if (page->index > end)
302	continue;	302	continue;
303		303
304	wait_on_page_writeback(page);	304	wait_on_page_writeback(page);
305	if (TestClearPageError(page))	305	if (TestClearPageError(page))
306	ret = -EIO;	306	ret = -EIO;
307	}	307	}
308	pagevec_release(&pvec);	308	pagevec_release(&pvec);
309	cond_resched();	309	cond_resched();
310	}	310	}
311	out:	311	out:
312	ret2 = filemap_check_errors(mapping);	312	ret2 = filemap_check_errors(mapping);
313	if (!ret)	313	if (!ret)
314	ret = ret2;	314	ret = ret2;
315		315
316	return ret;	316	return ret;
317	}	317	}
318	EXPORT_SYMBOL(filemap_fdatawait_range);	318	EXPORT_SYMBOL(filemap_fdatawait_range);
319		319
320	/**	320	/**
321	* filemap_fdatawait - wait for all under-writeback pages to complete	321	* filemap_fdatawait - wait for all under-writeback pages to complete
322	* @mapping: address space structure to wait for	322	* @mapping: address space structure to wait for
323	*	323	*
324	* Walk the list of under-writeback pages of the given address space	324	* Walk the list of under-writeback pages of the given address space
325	* and wait for all of them.	325	* and wait for all of them.
326	*/	326	*/
327	int filemap_fdatawait(struct address_space *mapping)	327	int filemap_fdatawait(struct address_space *mapping)
328	{	328	{
329	loff_t i_size = i_size_read(mapping->host);	329	loff_t i_size = i_size_read(mapping->host);
330		330
331	if (i_size == 0)	331	if (i_size == 0)
332	return 0;	332	return 0;
333		333
334	return filemap_fdatawait_range(mapping, 0, i_size - 1);	334	return filemap_fdatawait_range(mapping, 0, i_size - 1);
335	}	335	}
336	EXPORT_SYMBOL(filemap_fdatawait);	336	EXPORT_SYMBOL(filemap_fdatawait);
337		337
338	int filemap_write_and_wait(struct address_space *mapping)	338	int filemap_write_and_wait(struct address_space *mapping)
339	{	339	{
340	int err = 0;	340	int err = 0;
341		341
342	if (mapping->nrpages) {	342	if (mapping->nrpages) {
343	err = filemap_fdatawrite(mapping);	343	err = filemap_fdatawrite(mapping);
344	/*	344	/*
345	* Even if the above returned error, the pages may be	345	* Even if the above returned error, the pages may be
346	* written partially (e.g. -ENOSPC), so we wait for it.	346	* written partially (e.g. -ENOSPC), so we wait for it.
347	* But the -EIO is special case, it may indicate the worst	347	* But the -EIO is special case, it may indicate the worst
348	* thing (e.g. bug) happened, so we avoid waiting for it.	348	* thing (e.g. bug) happened, so we avoid waiting for it.
349	*/	349	*/
350	if (err != -EIO) {	350	if (err != -EIO) {
351	int err2 = filemap_fdatawait(mapping);	351	int err2 = filemap_fdatawait(mapping);
352	if (!err)	352	if (!err)
353	err = err2;	353	err = err2;
354	}	354	}
355	} else {	355	} else {
356	err = filemap_check_errors(mapping);	356	err = filemap_check_errors(mapping);
357	}	357	}
358	return err;	358	return err;
359	}	359	}
360	EXPORT_SYMBOL(filemap_write_and_wait);	360	EXPORT_SYMBOL(filemap_write_and_wait);
361		361
362	/**	362	/**
363	* filemap_write_and_wait_range - write out & wait on a file range	363	* filemap_write_and_wait_range - write out & wait on a file range
364	* @mapping: the address_space for the pages	364	* @mapping: the address_space for the pages
365	* @lstart: offset in bytes where the range starts	365	* @lstart: offset in bytes where the range starts
366	* @lend: offset in bytes where the range ends (inclusive)	366	* @lend: offset in bytes where the range ends (inclusive)
367	*	367	*
368	* Write out and wait upon file offsets lstart->lend, inclusive.	368	* Write out and wait upon file offsets lstart->lend, inclusive.
369	*	369	*
370	* Note that `lend' is inclusive (describes the last byte to be written) so	370	* Note that `lend' is inclusive (describes the last byte to be written) so
371	* that this function can be used to write to the very end-of-file (end = -1).	371	* that this function can be used to write to the very end-of-file (end = -1).
372	*/	372	*/
373	int filemap_write_and_wait_range(struct address_space *mapping,	373	int filemap_write_and_wait_range(struct address_space *mapping,
374	loff_t lstart, loff_t lend)	374	loff_t lstart, loff_t lend)
375	{	375	{
376	int err = 0;	376	int err = 0;
377		377
378	if (mapping->nrpages) {	378	if (mapping->nrpages) {
379	err = __filemap_fdatawrite_range(mapping, lstart, lend,	379	err = __filemap_fdatawrite_range(mapping, lstart, lend,
380	WB_SYNC_ALL);	380	WB_SYNC_ALL);
381	/* See comment of filemap_write_and_wait() */	381	/* See comment of filemap_write_and_wait() */
382	if (err != -EIO) {	382	if (err != -EIO) {
383	int err2 = filemap_fdatawait_range(mapping,	383	int err2 = filemap_fdatawait_range(mapping,
384	lstart, lend);	384	lstart, lend);
385	if (!err)	385	if (!err)
386	err = err2;	386	err = err2;
387	}	387	}
388	} else {	388	} else {
389	err = filemap_check_errors(mapping);	389	err = filemap_check_errors(mapping);
390	}	390	}
391	return err;	391	return err;
392	}	392	}
393	EXPORT_SYMBOL(filemap_write_and_wait_range);	393	EXPORT_SYMBOL(filemap_write_and_wait_range);
394		394
395	/**	395	/**
396	* replace_page_cache_page - replace a pagecache page with a new one	396	* replace_page_cache_page - replace a pagecache page with a new one
397	* @old: page to be replaced	397	* @old: page to be replaced
398	* @new: page to replace with	398	* @new: page to replace with
399	* @gfp_mask: allocation mode	399	* @gfp_mask: allocation mode
400	*	400	*
401	* This function replaces a page in the pagecache with a new one. On	401	* This function replaces a page in the pagecache with a new one. On
402	* success it acquires the pagecache reference for the new page and	402	* success it acquires the pagecache reference for the new page and
403	* drops it for the old page. Both the old and new pages must be	403	* drops it for the old page. Both the old and new pages must be
404	* locked. This function does not add the new page to the LRU, the	404	* locked. This function does not add the new page to the LRU, the
405	* caller must do that.	405	* caller must do that.
406	*	406	*
407	* The remove + add is atomic. The only way this function can fail is	407	* The remove + add is atomic. The only way this function can fail is
408	* memory allocation failure.	408	* memory allocation failure.
409	*/	409	*/
410	int replace_page_cache_page(struct page old, struct page new, gfp_t gfp_mask)	410	int replace_page_cache_page(struct page old, struct page new, gfp_t gfp_mask)
411	{	411	{
412	int error;	412	int error;
413		413
414	VM_BUG_ON(!PageLocked(old));	414	VM_BUG_ON(!PageLocked(old));
415	VM_BUG_ON(!PageLocked(new));	415	VM_BUG_ON(!PageLocked(new));
416	VM_BUG_ON(new->mapping);	416	VM_BUG_ON(new->mapping);
417		417
418	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);	418	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
419	if (!error) {	419	if (!error) {
420	struct address_space *mapping = old->mapping;	420	struct address_space *mapping = old->mapping;
421	void (freepage)(struct page );	421	void (freepage)(struct page );
422		422
423	pgoff_t offset = old->index;	423	pgoff_t offset = old->index;
424	freepage = mapping->a_ops->freepage;	424	freepage = mapping->a_ops->freepage;
425		425
426	page_cache_get(new);	426	page_cache_get(new);
427	new->mapping = mapping;	427	new->mapping = mapping;
428	new->index = offset;	428	new->index = offset;
429		429
430	spin_lock_irq(&mapping->tree_lock);	430	spin_lock_irq(&mapping->tree_lock);
431	__delete_from_page_cache(old);	431	__delete_from_page_cache(old);
432	error = radix_tree_insert(&mapping->page_tree, offset, new);	432	error = radix_tree_insert(&mapping->page_tree, offset, new);
433	BUG_ON(error);	433	BUG_ON(error);
434	mapping->nrpages++;	434	mapping->nrpages++;
435	__inc_zone_page_state(new, NR_FILE_PAGES);	435	__inc_zone_page_state(new, NR_FILE_PAGES);
436	if (PageSwapBacked(new))	436	if (PageSwapBacked(new))
437	__inc_zone_page_state(new, NR_SHMEM);	437	__inc_zone_page_state(new, NR_SHMEM);
438	spin_unlock_irq(&mapping->tree_lock);	438	spin_unlock_irq(&mapping->tree_lock);
439	/* mem_cgroup codes must not be called under tree_lock */	439	/* mem_cgroup codes must not be called under tree_lock */
440	mem_cgroup_replace_page_cache(old, new);	440	mem_cgroup_replace_page_cache(old, new);
441	radix_tree_preload_end();	441	radix_tree_preload_end();
442	if (freepage)	442	if (freepage)
443	freepage(old);	443	freepage(old);
444	page_cache_release(old);	444	page_cache_release(old);
445	}	445	}
446		446
447	return error;	447	return error;
448	}	448	}
449	EXPORT_SYMBOL_GPL(replace_page_cache_page);	449	EXPORT_SYMBOL_GPL(replace_page_cache_page);
450		450
451	static int page_cache_tree_insert(struct address_space *mapping,	451	static int page_cache_tree_insert(struct address_space *mapping,
452	struct page *page)	452	struct page *page)
453	{	453	{
454	void **slot;	454	void **slot;
455	int error;	455	int error;
456		456
457	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);	457	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
458	if (slot) {	458	if (slot) {
459	void *p;	459	void *p;
460		460
461	p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);	461	p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
462	if (!radix_tree_exceptional_entry(p))	462	if (!radix_tree_exceptional_entry(p))
463	return -EEXIST;	463	return -EEXIST;
464	radix_tree_replace_slot(slot, page);	464	radix_tree_replace_slot(slot, page);
465	mapping->nrpages++;	465	mapping->nrpages++;
466	return 0;	466	return 0;
467	}	467	}
468	error = radix_tree_insert(&mapping->page_tree, page->index, page);	468	error = radix_tree_insert(&mapping->page_tree, page->index, page);
469	if (!error)	469	if (!error)
470	mapping->nrpages++;	470	mapping->nrpages++;
471	return error;	471	return error;
472	}	472	}
473		473
474	/**	474	/**
475	* add_to_page_cache_locked - add a locked page to the pagecache	475	* add_to_page_cache_locked - add a locked page to the pagecache
476	* @page: page to add	476	* @page: page to add
477	* @mapping: the page's address_space	477	* @mapping: the page's address_space
478	* @offset: page index	478	* @offset: page index
479	* @gfp_mask: page allocation mode	479	* @gfp_mask: page allocation mode
480	*	480	*
481	* This function is used to add a page to the pagecache. It must be locked.	481	* This function is used to add a page to the pagecache. It must be locked.
482	* This function does not add the page to the LRU. The caller must do that.	482	* This function does not add the page to the LRU. The caller must do that.
483	*/	483	*/
484	int add_to_page_cache_locked(struct page page, struct address_space mapping,	484	int add_to_page_cache_locked(struct page page, struct address_space mapping,
485	pgoff_t offset, gfp_t gfp_mask)	485	pgoff_t offset, gfp_t gfp_mask)
486	{	486	{
487	int error;	487	int error;
488		488
489	VM_BUG_ON(!PageLocked(page));	489	VM_BUG_ON(!PageLocked(page));
490	VM_BUG_ON(PageSwapBacked(page));	490	VM_BUG_ON(PageSwapBacked(page));
491		491
492	error = mem_cgroup_cache_charge(page, current->mm,	492	error = mem_cgroup_cache_charge(page, current->mm,
493	gfp_mask & GFP_RECLAIM_MASK);	493	gfp_mask & GFP_RECLAIM_MASK);
494	if (error)	494	if (error)
495	return error;	495	return error;
496		496
497	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);	497	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
498	if (error) {	498	if (error) {
499	mem_cgroup_uncharge_cache_page(page);	499	mem_cgroup_uncharge_cache_page(page);
500	return error;	500	return error;
501	}	501	}
502		502
503	page_cache_get(page);	503	page_cache_get(page);
504	page->mapping = mapping;	504	page->mapping = mapping;
505	page->index = offset;	505	page->index = offset;
506		506
507	spin_lock_irq(&mapping->tree_lock);	507	spin_lock_irq(&mapping->tree_lock);
508	error = page_cache_tree_insert(mapping, page);	508	error = page_cache_tree_insert(mapping, page);
509	radix_tree_preload_end();	509	radix_tree_preload_end();
510	if (unlikely(error))	510	if (unlikely(error))
511	goto err_insert;	511	goto err_insert;
512	__inc_zone_page_state(page, NR_FILE_PAGES);	512	__inc_zone_page_state(page, NR_FILE_PAGES);
513	spin_unlock_irq(&mapping->tree_lock);	513	spin_unlock_irq(&mapping->tree_lock);
514	trace_mm_filemap_add_to_page_cache(page);	514	trace_mm_filemap_add_to_page_cache(page);
515	return 0;	515	return 0;
516	err_insert:	516	err_insert:
517	page->mapping = NULL;	517	page->mapping = NULL;
518	/* Leave page->index set: truncation relies upon it */	518	/* Leave page->index set: truncation relies upon it */
519	spin_unlock_irq(&mapping->tree_lock);	519	spin_unlock_irq(&mapping->tree_lock);
520	mem_cgroup_uncharge_cache_page(page);	520	mem_cgroup_uncharge_cache_page(page);
521	page_cache_release(page);	521	page_cache_release(page);
522	return error;	522	return error;
523	}	523	}
524	EXPORT_SYMBOL(add_to_page_cache_locked);	524	EXPORT_SYMBOL(add_to_page_cache_locked);
525		525
526	int add_to_page_cache_lru(struct page page, struct address_space mapping,	526	int add_to_page_cache_lru(struct page page, struct address_space mapping,
527	pgoff_t offset, gfp_t gfp_mask)	527	pgoff_t offset, gfp_t gfp_mask)
528	{	528	{
529	int ret;	529	int ret;
530		530
531	ret = add_to_page_cache(page, mapping, offset, gfp_mask);	531	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
532	if (ret == 0)	532	if (ret == 0)
533	lru_cache_add_file(page);	533	lru_cache_add_file(page);
534	return ret;	534	return ret;
535	}	535	}
536	EXPORT_SYMBOL_GPL(add_to_page_cache_lru);	536	EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
537		537
538	#ifdef CONFIG_NUMA	538	#ifdef CONFIG_NUMA
539	struct page *__page_cache_alloc(gfp_t gfp)	539	struct page *__page_cache_alloc(gfp_t gfp)
540	{	540	{
541	int n;	541	int n;
542	struct page *page;	542	struct page *page;
543		543
544	if (cpuset_do_page_mem_spread()) {	544	if (cpuset_do_page_mem_spread()) {
545	unsigned int cpuset_mems_cookie;	545	unsigned int cpuset_mems_cookie;
546	do {	546	do {
547	cpuset_mems_cookie = read_mems_allowed_begin();	547	cpuset_mems_cookie = read_mems_allowed_begin();
548	n = cpuset_mem_spread_node();	548	n = cpuset_mem_spread_node();
549	page = alloc_pages_exact_node(n, gfp, 0);	549	page = alloc_pages_exact_node(n, gfp, 0);
550	} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));	550	} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
551		551
552	return page;	552	return page;
553	}	553	}
554	return alloc_pages(gfp, 0);	554	return alloc_pages(gfp, 0);
555	}	555	}
556	EXPORT_SYMBOL(__page_cache_alloc);	556	EXPORT_SYMBOL(__page_cache_alloc);
557	#endif	557	#endif
558		558
559	/*	559	/*
560	* In order to wait for pages to become available there must be	560	* In order to wait for pages to become available there must be
561	* waitqueues associated with pages. By using a hash table of	561	* waitqueues associated with pages. By using a hash table of
562	* waitqueues where the bucket discipline is to maintain all	562	* waitqueues where the bucket discipline is to maintain all
563	* waiters on the same queue and wake all when any of the pages	563	* waiters on the same queue and wake all when any of the pages
564	* become available, and for the woken contexts to check to be	564	* become available, and for the woken contexts to check to be
565	* sure the appropriate page became available, this saves space	565	* sure the appropriate page became available, this saves space
566	* at a cost of "thundering herd" phenomena during rare hash	566	* at a cost of "thundering herd" phenomena during rare hash
567	* collisions.	567	* collisions.
568	*/	568	*/
569	static wait_queue_head_t page_waitqueue(struct page page)	569	static wait_queue_head_t page_waitqueue(struct page page)
570	{	570	{
571	const struct zone *zone = page_zone(page);	571	const struct zone *zone = page_zone(page);
572		572
573	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];	573	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
574	}	574	}
575		575
576	static inline void wake_up_page(struct page *page, int bit)	576	static inline void wake_up_page(struct page *page, int bit)
577	{	577	{
578	__wake_up_bit(page_waitqueue(page), &page->flags, bit);	578	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
579	}	579	}
580		580
581	void wait_on_page_bit(struct page *page, int bit_nr)	581	void wait_on_page_bit(struct page *page, int bit_nr)
582	{	582	{
583	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);	583	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
584		584
585	if (test_bit(bit_nr, &page->flags))	585	if (test_bit(bit_nr, &page->flags))
586	__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,	586	__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
587	TASK_UNINTERRUPTIBLE);	587	TASK_UNINTERRUPTIBLE);
588	}	588	}
589	EXPORT_SYMBOL(wait_on_page_bit);	589	EXPORT_SYMBOL(wait_on_page_bit);
590		590
591	int wait_on_page_bit_killable(struct page *page, int bit_nr)	591	int wait_on_page_bit_killable(struct page *page, int bit_nr)
592	{	592	{
593	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);	593	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
594		594
595	if (!test_bit(bit_nr, &page->flags))	595	if (!test_bit(bit_nr, &page->flags))
596	return 0;	596	return 0;
597		597
598	return __wait_on_bit(page_waitqueue(page), &wait,	598	return __wait_on_bit(page_waitqueue(page), &wait,
599	sleep_on_page_killable, TASK_KILLABLE);	599	sleep_on_page_killable, TASK_KILLABLE);
600	}	600	}
601		601
602	/**	602	/**
603	* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue	603	* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
604	* @page: Page defining the wait queue of interest	604	* @page: Page defining the wait queue of interest
605	* @waiter: Waiter to add to the queue	605	* @waiter: Waiter to add to the queue
606	*	606	*
607	* Add an arbitrary @waiter to the wait queue for the nominated @page.	607	* Add an arbitrary @waiter to the wait queue for the nominated @page.
608	*/	608	*/
609	void add_page_wait_queue(struct page page, wait_queue_t waiter)	609	void add_page_wait_queue(struct page page, wait_queue_t waiter)
610	{	610	{
611	wait_queue_head_t *q = page_waitqueue(page);	611	wait_queue_head_t *q = page_waitqueue(page);
612	unsigned long flags;	612	unsigned long flags;
613		613
614	spin_lock_irqsave(&q->lock, flags);	614	spin_lock_irqsave(&q->lock, flags);
615	__add_wait_queue(q, waiter);	615	__add_wait_queue(q, waiter);
616	spin_unlock_irqrestore(&q->lock, flags);	616	spin_unlock_irqrestore(&q->lock, flags);
617	}	617	}
618	EXPORT_SYMBOL_GPL(add_page_wait_queue);	618	EXPORT_SYMBOL_GPL(add_page_wait_queue);
619		619
620	/**	620	/**
621	* unlock_page - unlock a locked page	621	* unlock_page - unlock a locked page
622	* @page: the page	622	* @page: the page
623	*	623	*
624	* Unlocks the page and wakes up sleepers in ___wait_on_page_locked().	624	* Unlocks the page and wakes up sleepers in ___wait_on_page_locked().
625	* Also wakes sleepers in wait_on_page_writeback() because the wakeup	625	* Also wakes sleepers in wait_on_page_writeback() because the wakeup
626	* mechananism between PageLocked pages and PageWriteback pages is shared.	626	* mechananism between PageLocked pages and PageWriteback pages is shared.
627	* But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.	627	* But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
628	*	628	*
629	* The mb is necessary to enforce ordering between the clear_bit and the read	629	* The mb is necessary to enforce ordering between the clear_bit and the read
630	* of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).	630	* of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
631	*/	631	*/
632	void unlock_page(struct page *page)	632	void unlock_page(struct page *page)
633	{	633	{
634	VM_BUG_ON(!PageLocked(page));	634	VM_BUG_ON(!PageLocked(page));
635	clear_bit_unlock(PG_locked, &page->flags);	635	clear_bit_unlock(PG_locked, &page->flags);
636	smp_mb__after_clear_bit();	636	smp_mb__after_clear_bit();
637	wake_up_page(page, PG_locked);	637	wake_up_page(page, PG_locked);
638	}	638	}
639	EXPORT_SYMBOL(unlock_page);	639	EXPORT_SYMBOL(unlock_page);
640		640
641	/**	641	/**
642	* end_page_writeback - end writeback against a page	642	* end_page_writeback - end writeback against a page
643	* @page: the page	643	* @page: the page
644	*/	644	*/
645	void end_page_writeback(struct page *page)	645	void end_page_writeback(struct page *page)
646	{	646	{
647	if (TestClearPageReclaim(page))	647	if (TestClearPageReclaim(page))
648	rotate_reclaimable_page(page);	648	rotate_reclaimable_page(page);
649		649
650	if (!test_clear_page_writeback(page))	650	if (!test_clear_page_writeback(page))
651	BUG();	651	BUG();
652		652
653	smp_mb__after_clear_bit();	653	smp_mb__after_clear_bit();
654	wake_up_page(page, PG_writeback);	654	wake_up_page(page, PG_writeback);
655	}	655	}
656	EXPORT_SYMBOL(end_page_writeback);	656	EXPORT_SYMBOL(end_page_writeback);
657		657
658	/**	658	/**
659	* __lock_page - get a lock on the page, assuming we need to sleep to get it	659	* __lock_page - get a lock on the page, assuming we need to sleep to get it
660	* @page: the page to lock	660	* @page: the page to lock
661	*/	661	*/
662	void __lock_page(struct page *page)	662	void __lock_page(struct page *page)
663	{	663	{
664	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);	664	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
665		665
666	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,	666	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
667	TASK_UNINTERRUPTIBLE);	667	TASK_UNINTERRUPTIBLE);
668	}	668	}
669	EXPORT_SYMBOL(__lock_page);	669	EXPORT_SYMBOL(__lock_page);
670		670
671	int __lock_page_killable(struct page *page)	671	int __lock_page_killable(struct page *page)
672	{	672	{
673	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);	673	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
674		674
675	return __wait_on_bit_lock(page_waitqueue(page), &wait,	675	return __wait_on_bit_lock(page_waitqueue(page), &wait,
676	sleep_on_page_killable, TASK_KILLABLE);	676	sleep_on_page_killable, TASK_KILLABLE);
677	}	677	}
678	EXPORT_SYMBOL_GPL(__lock_page_killable);	678	EXPORT_SYMBOL_GPL(__lock_page_killable);
679		679
680	int __lock_page_or_retry(struct page page, struct mm_struct mm,	680	int __lock_page_or_retry(struct page page, struct mm_struct mm,
681	unsigned int flags)	681	unsigned int flags)
682	{	682	{
683	if (flags & FAULT_FLAG_ALLOW_RETRY) {	683	if (flags & FAULT_FLAG_ALLOW_RETRY) {
684	/*	684	/*
685	* CAUTION! In this case, mmap_sem is not released	685	* CAUTION! In this case, mmap_sem is not released
686	* even though return 0.	686	* even though return 0.
687	*/	687	*/
688	if (flags & FAULT_FLAG_RETRY_NOWAIT)	688	if (flags & FAULT_FLAG_RETRY_NOWAIT)
689	return 0;	689	return 0;
690		690
691	up_read(&mm->mmap_sem);	691	up_read(&mm->mmap_sem);
692	if (flags & FAULT_FLAG_KILLABLE)	692	if (flags & FAULT_FLAG_KILLABLE)
693	wait_on_page_locked_killable(page);	693	wait_on_page_locked_killable(page);
694	else	694	else
695	wait_on_page_locked(page);	695	wait_on_page_locked(page);
696	return 0;	696	return 0;
697	} else {	697	} else {
698	if (flags & FAULT_FLAG_KILLABLE) {	698	if (flags & FAULT_FLAG_KILLABLE) {
699	int ret;	699	int ret;
700		700
701	ret = __lock_page_killable(page);	701	ret = __lock_page_killable(page);
702	if (ret) {	702	if (ret) {
703	up_read(&mm->mmap_sem);	703	up_read(&mm->mmap_sem);
704	return 0;	704	return 0;
705	}	705	}
706	} else	706	} else
707	__lock_page(page);	707	__lock_page(page);
708	return 1;	708	return 1;
709	}	709	}
710	}	710	}
711		711
712	/**	712	/**
713	* page_cache_next_hole - find the next hole (not-present entry)	713	* page_cache_next_hole - find the next hole (not-present entry)
714	* @mapping: mapping	714	* @mapping: mapping
715	* @index: index	715	* @index: index
716	* @max_scan: maximum range to search	716	* @max_scan: maximum range to search
717	*	717	*
718	* Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the	718	* Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
719	* lowest indexed hole.	719	* lowest indexed hole.
720	*	720	*
721	* Returns: the index of the hole if found, otherwise returns an index	721	* Returns: the index of the hole if found, otherwise returns an index
722	* outside of the set specified (in which case 'return - index >=	722	* outside of the set specified (in which case 'return - index >=
723	* max_scan' will be true). In rare cases of index wrap-around, 0 will	723	* max_scan' will be true). In rare cases of index wrap-around, 0 will
724	* be returned.	724	* be returned.
725	*	725	*
726	* page_cache_next_hole may be called under rcu_read_lock. However,	726	* page_cache_next_hole may be called under rcu_read_lock. However,
727	* like radix_tree_gang_lookup, this will not atomically search a	727	* like radix_tree_gang_lookup, this will not atomically search a
728	* snapshot of the tree at a single point in time. For example, if a	728	* snapshot of the tree at a single point in time. For example, if a
729	* hole is created at index 5, then subsequently a hole is created at	729	* hole is created at index 5, then subsequently a hole is created at
730	* index 10, page_cache_next_hole covering both indexes may return 10	730	* index 10, page_cache_next_hole covering both indexes may return 10
731	* if called under rcu_read_lock.	731	* if called under rcu_read_lock.
732	*/	732	*/
733	pgoff_t page_cache_next_hole(struct address_space *mapping,	733	pgoff_t page_cache_next_hole(struct address_space *mapping,
734	pgoff_t index, unsigned long max_scan)	734	pgoff_t index, unsigned long max_scan)
735	{	735	{
736	unsigned long i;	736	unsigned long i;
737		737
738	for (i = 0; i < max_scan; i++) {	738	for (i = 0; i < max_scan; i++) {
739	struct page *page;	739	struct page *page;
740		740
741	page = radix_tree_lookup(&mapping->page_tree, index);	741	page = radix_tree_lookup(&mapping->page_tree, index);
742	if (!page \|\| radix_tree_exceptional_entry(page))	742	if (!page \|\| radix_tree_exceptional_entry(page))
743	break;	743	break;
744	index++;	744	index++;
745	if (index == 0)	745	if (index == 0)
746	break;	746	break;
747	}	747	}
748		748
749	return index;	749	return index;
750	}	750	}
751	EXPORT_SYMBOL(page_cache_next_hole);	751	EXPORT_SYMBOL(page_cache_next_hole);
752		752
753	/**	753	/**
754	* page_cache_prev_hole - find the prev hole (not-present entry)	754	* page_cache_prev_hole - find the prev hole (not-present entry)
755	* @mapping: mapping	755	* @mapping: mapping
756	* @index: index	756	* @index: index
757	* @max_scan: maximum range to search	757	* @max_scan: maximum range to search
758	*	758	*
759	* Search backwards in the range [max(index-max_scan+1, 0), index] for	759	* Search backwards in the range [max(index-max_scan+1, 0), index] for
760	* the first hole.	760	* the first hole.
761	*	761	*
762	* Returns: the index of the hole if found, otherwise returns an index	762	* Returns: the index of the hole if found, otherwise returns an index
763	* outside of the set specified (in which case 'index - return >=	763	* outside of the set specified (in which case 'index - return >=
764	* max_scan' will be true). In rare cases of wrap-around, ULONG_MAX	764	* max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
765	* will be returned.	765	* will be returned.
766	*	766	*
767	* page_cache_prev_hole may be called under rcu_read_lock. However,	767	* page_cache_prev_hole may be called under rcu_read_lock. However,
768	* like radix_tree_gang_lookup, this will not atomically search a	768	* like radix_tree_gang_lookup, this will not atomically search a
769	* snapshot of the tree at a single point in time. For example, if a	769	* snapshot of the tree at a single point in time. For example, if a
770	* hole is created at index 10, then subsequently a hole is created at	770	* hole is created at index 10, then subsequently a hole is created at
771	* index 5, page_cache_prev_hole covering both indexes may return 5 if	771	* index 5, page_cache_prev_hole covering both indexes may return 5 if
772	* called under rcu_read_lock.	772	* called under rcu_read_lock.
773	*/	773	*/
774	pgoff_t page_cache_prev_hole(struct address_space *mapping,	774	pgoff_t page_cache_prev_hole(struct address_space *mapping,
775	pgoff_t index, unsigned long max_scan)	775	pgoff_t index, unsigned long max_scan)
776	{	776	{
777	unsigned long i;	777	unsigned long i;
778		778
779	for (i = 0; i < max_scan; i++) {	779	for (i = 0; i < max_scan; i++) {
780	struct page *page;	780	struct page *page;
781		781
782	page = radix_tree_lookup(&mapping->page_tree, index);	782	page = radix_tree_lookup(&mapping->page_tree, index);
783	if (!page \|\| radix_tree_exceptional_entry(page))	783	if (!page \|\| radix_tree_exceptional_entry(page))
784	break;	784	break;
785	index--;	785	index--;
786	if (index == ULONG_MAX)	786	if (index == ULONG_MAX)
787	break;	787	break;
788	}	788	}
789		789
790	return index;	790	return index;
791	}	791	}
792	EXPORT_SYMBOL(page_cache_prev_hole);	792	EXPORT_SYMBOL(page_cache_prev_hole);
793		793
794	/**	794	/**
795	* find_get_entry - find and get a page cache entry	795	* find_get_entry - find and get a page cache entry
796	* @mapping: the address_space to search	796	* @mapping: the address_space to search
797	* @offset: the page cache index	797	* @offset: the page cache index
798	*	798	*
799	* Looks up the page cache slot at @mapping & @offset. If there is a	799	* Looks up the page cache slot at @mapping & @offset. If there is a
800	* page cache page, it is returned with an increased refcount.	800	* page cache page, it is returned with an increased refcount.
801	*	801	*
802	* If the slot holds a shadow entry of a previously evicted page, it	802	* If the slot holds a shadow entry of a previously evicted page, it
803	* is returned.	803	* is returned.
804	*	804	*
805	* Otherwise, %NULL is returned.	805	* Otherwise, %NULL is returned.
806	*/	806	*/
807	struct page find_get_entry(struct address_space mapping, pgoff_t offset)	807	struct page find_get_entry(struct address_space mapping, pgoff_t offset)
808	{	808	{
809	void **pagep;	809	void **pagep;
810	struct page *page;	810	struct page *page;
811		811
812	rcu_read_lock();	812	rcu_read_lock();
813	repeat:	813	repeat:
814	page = NULL;	814	page = NULL;
815	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);	815	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
816	if (pagep) {	816	if (pagep) {
817	page = radix_tree_deref_slot(pagep);	817	page = radix_tree_deref_slot(pagep);
818	if (unlikely(!page))	818	if (unlikely(!page))
819	goto out;	819	goto out;
820	if (radix_tree_exception(page)) {	820	if (radix_tree_exception(page)) {
821	if (radix_tree_deref_retry(page))	821	if (radix_tree_deref_retry(page))
822	goto repeat;	822	goto repeat;
823	/*	823	/*
824	* Otherwise, shmem/tmpfs must be storing a swap entry	824	* Otherwise, shmem/tmpfs must be storing a swap entry
825	* here as an exceptional entry: so return it without	825	* here as an exceptional entry: so return it without
826	* attempting to raise page count.	826	* attempting to raise page count.
827	*/	827	*/
828	goto out;	828	goto out;
829	}	829	}
830	if (!page_cache_get_speculative(page))	830	if (!page_cache_get_speculative(page))
831	goto repeat;	831	goto repeat;
832		832
833	/*	833	/*
834	* Has the page moved?	834	* Has the page moved?
835	* This is part of the lockless pagecache protocol. See	835	* This is part of the lockless pagecache protocol. See
836	* include/linux/pagemap.h for details.	836	* include/linux/pagemap.h for details.
837	*/	837	*/
838	if (unlikely(page != *pagep)) {	838	if (unlikely(page != *pagep)) {
839	page_cache_release(page);	839	page_cache_release(page);
840	goto repeat;	840	goto repeat;
841	}	841	}
842	}	842	}
843	out:	843	out:
844	rcu_read_unlock();	844	rcu_read_unlock();
845		845
846	return page;	846	return page;
847	}	847	}
848	EXPORT_SYMBOL(find_get_entry);	848	EXPORT_SYMBOL(find_get_entry);
849		849
850	/**	850	/**
851	* find_get_page - find and get a page reference
852	* @mapping: the address_space to search
853	* @offset: the page index
854	*
855	* Looks up the page cache slot at @mapping & @offset. If there is a
856	* page cache page, it is returned with an increased refcount.
857	*
858	* Otherwise, %NULL is returned.
859	*/
860	struct page find_get_page(struct address_space mapping, pgoff_t offset)
861	{
862	struct page *page = find_get_entry(mapping, offset);
863
864	if (radix_tree_exceptional_entry(page))
865	page = NULL;
866	return page;
867	}
868	EXPORT_SYMBOL(find_get_page);
869
870	/**
871	* find_lock_entry - locate, pin and lock a page cache entry	851	* find_lock_entry - locate, pin and lock a page cache entry
872	* @mapping: the address_space to search	852	* @mapping: the address_space to search
873	* @offset: the page cache index	853	* @offset: the page cache index
874	*	854	*
875	* Looks up the page cache slot at @mapping & @offset. If there is a	855	* Looks up the page cache slot at @mapping & @offset. If there is a
876	* page cache page, it is returned locked and with an increased	856	* page cache page, it is returned locked and with an increased
877	* refcount.	857	* refcount.
878	*	858	*
879	* If the slot holds a shadow entry of a previously evicted page, it	859	* If the slot holds a shadow entry of a previously evicted page, it
880	* is returned.	860	* is returned.
881	*	861	*
882	* Otherwise, %NULL is returned.	862	* Otherwise, %NULL is returned.
883	*	863	*
884	* find_lock_entry() may sleep.	864	* find_lock_entry() may sleep.
885	*/	865	*/
886	struct page find_lock_entry(struct address_space mapping, pgoff_t offset)	866	struct page find_lock_entry(struct address_space mapping, pgoff_t offset)
887	{	867	{
888	struct page *page;	868	struct page *page;
889		869
890	repeat:	870	repeat:
891	page = find_get_entry(mapping, offset);	871	page = find_get_entry(mapping, offset);
892	if (page && !radix_tree_exception(page)) {	872	if (page && !radix_tree_exception(page)) {
893	lock_page(page);	873	lock_page(page);
894	/* Has the page been truncated? */	874	/* Has the page been truncated? */
895	if (unlikely(page->mapping != mapping)) {	875	if (unlikely(page->mapping != mapping)) {
896	unlock_page(page);	876	unlock_page(page);
897	page_cache_release(page);	877	page_cache_release(page);
898	goto repeat;	878	goto repeat;
899	}	879	}
900	VM_BUG_ON(page->index != offset);	880	VM_BUG_ON(page->index != offset);
901	}	881	}
902	return page;	882	return page;
903	}	883	}
904	EXPORT_SYMBOL(find_lock_entry);	884	EXPORT_SYMBOL(find_lock_entry);
905		885
906	/**	886	/**
907	* find_lock_page - locate, pin and lock a pagecache page	887	* pagecache_get_page - find and get a page reference
908	* @mapping: the address_space to search	888	* @mapping: the address_space to search
909	* @offset: the page index	889	* @offset: the page index
		890	* @fgp_flags: PCG flags
		891	* @gfp_mask: gfp mask to use if a page is to be allocated
910	*	892	*
911	* Looks up the page cache slot at @mapping & @offset. If there is a	893	* Looks up the page cache slot at @mapping & @offset.
912	* page cache page, it is returned locked and with an increased
913	* refcount.
914	*	894	*
915	* Otherwise, %NULL is returned.	895	* PCG flags modify how the page is returned
916	*	896	*
917	* find_lock_page() may sleep.	897	* FGP_ACCESSED: the page will be marked accessed
918	*/	898	* FGP_LOCK: Page is return locked
919	struct page find_lock_page(struct address_space mapping, pgoff_t offset)	899	* FGP_CREAT: If page is not present then a new page is allocated using
920	{	900	* @gfp_mask and added to the page cache and the VM's LRU
921	struct page *page = find_lock_entry(mapping, offset);	901	* list. The page is returned locked and with an increased
922		902	* refcount. Otherwise, %NULL is returned.
923	if (radix_tree_exceptional_entry(page))
924	page = NULL;
925	return page;
926	}
927	EXPORT_SYMBOL(find_lock_page);
928
929	/**
930	* find_or_create_page - locate or add a pagecache page
931	* @mapping: the page's address_space
932	* @index: the page's index into the mapping
933	* @gfp_mask: page allocation mode
934	*	903	*
935	* Looks up the page cache slot at @mapping & @offset. If there is a	904	* If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
936	* page cache page, it is returned locked and with an increased	905	* if the GFP flags specified for FGP_CREAT are atomic.
937	* refcount.
938	*	906	*
939	* If the page is not present, a new page is allocated using @gfp_mask	907	* If there is a page cache page, it is returned with an increased refcount.
940	* and added to the page cache and the VM's LRU list. The page is
941	* returned locked and with an increased refcount.
942	*
943	* On memory exhaustion, %NULL is returned.
944	*
945	* find_or_create_page() may sleep, even if @gfp_flags specifies an
946	* atomic allocation!
947	*/	908	*/
948	struct page find_or_create_page(struct address_space mapping,	909	struct page pagecache_get_page(struct address_space mapping, pgoff_t offset,
949	pgoff_t index, gfp_t gfp_mask)	910	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
950	{	911	{
951	struct page *page;	912	struct page *page;
952	int err;	913
953	repeat:	914	repeat:
954	page = find_lock_page(mapping, index);	915	page = find_get_entry(mapping, offset);
955	if (!page) {	916	if (radix_tree_exceptional_entry(page))
956	page = __page_cache_alloc(gfp_mask);	917	page = NULL;
		918	if (!page)
		919	goto no_page;
		920
		921	if (fgp_flags & FGP_LOCK) {
		922	if (fgp_flags & FGP_NOWAIT) {
		923	if (!trylock_page(page)) {
		924	page_cache_release(page);
		925	return NULL;
		926	}
		927	} else {
		928	lock_page(page);
		929	}
		930
		931	/* Has the page been truncated? */
		932	if (unlikely(page->mapping != mapping)) {
		933	unlock_page(page);
		934	page_cache_release(page);
		935	goto repeat;
		936	}
		937	VM_BUG_ON(page->index != offset);
		938	}
		939
		940	if (page && (fgp_flags & FGP_ACCESSED))
		941	mark_page_accessed(page);
		942
		943	no_page:
		944	if (!page && (fgp_flags & FGP_CREAT)) {
		945	int err;
		946	if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
		947	cache_gfp_mask \|= __GFP_WRITE;
		948	if (fgp_flags & FGP_NOFS) {
		949	cache_gfp_mask &= ~__GFP_FS;
		950	radix_gfp_mask &= ~__GFP_FS;
		951	}
		952
		953	page = __page_cache_alloc(cache_gfp_mask);
957	if (!page)	954	if (!page)
958	return NULL;	955	return NULL;
959	/*	956
960	* We want a regular kernel memory (not highmem or DMA etc)	957	if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
961	* allocation for the radix tree nodes, but we need to honour	958	fgp_flags \|= FGP_LOCK;
962	* the context-specific requirements the caller has asked for.	959
963	* GFP_RECLAIM_MASK collects those requirements.	960	/* Init accessed so avoit atomic mark_page_accessed later */
964	*/	961	if (fgp_flags & FGP_ACCESSED)
965	err = add_to_page_cache_lru(page, mapping, index,	962	init_page_accessed(page);
966	(gfp_mask & GFP_RECLAIM_MASK));	963
		964	err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
967	if (unlikely(err)) {	965	if (unlikely(err)) {
968	page_cache_release(page);	966	page_cache_release(page);
969	page = NULL;	967	page = NULL;
970	if (err == -EEXIST)	968	if (err == -EEXIST)
971	goto repeat;	969	goto repeat;
972	}	970	}
973	}	971	}
		972
974	return page;	973	return page;
975	}	974	}
976	EXPORT_SYMBOL(find_or_create_page);	975	EXPORT_SYMBOL(pagecache_get_page);
977		976
978	/**	977	/**
979	* find_get_entries - gang pagecache lookup	978	* find_get_entries - gang pagecache lookup
980	* @mapping: The address_space to search	979	* @mapping: The address_space to search
981	* @start: The starting page cache index	980	* @start: The starting page cache index
982	* @nr_entries: The maximum number of entries	981	* @nr_entries: The maximum number of entries
983	* @entries: Where the resulting entries are placed	982	* @entries: Where the resulting entries are placed
984	* @indices: The cache indices corresponding to the entries in @entries	983	* @indices: The cache indices corresponding to the entries in @entries
985	*	984	*
986	* find_get_entries() will search for and return a group of up to	985	* find_get_entries() will search for and return a group of up to
987	* @nr_entries entries in the mapping. The entries are placed at	986	* @nr_entries entries in the mapping. The entries are placed at
988	* @entries. find_get_entries() takes a reference against any actual	987	* @entries. find_get_entries() takes a reference against any actual
989	* pages it returns.	988	* pages it returns.
990	*	989	*
991	* The search returns a group of mapping-contiguous page cache entries	990	* The search returns a group of mapping-contiguous page cache entries
992	* with ascending indexes. There may be holes in the indices due to	991	* with ascending indexes. There may be holes in the indices due to
993	* not-present pages.	992	* not-present pages.
994	*	993	*
995	* Any shadow entries of evicted pages are included in the returned	994	* Any shadow entries of evicted pages are included in the returned
996	* array.	995	* array.
997	*	996	*
998	* find_get_entries() returns the number of pages and shadow entries	997	* find_get_entries() returns the number of pages and shadow entries
999	* which were found.	998	* which were found.
1000	*/	999	*/
1001	unsigned find_get_entries(struct address_space *mapping,	1000	unsigned find_get_entries(struct address_space *mapping,
1002	pgoff_t start, unsigned int nr_entries,	1001	pgoff_t start, unsigned int nr_entries,
1003	struct page *entries, pgoff_t indices)	1002	struct page *entries, pgoff_t indices)
1004	{	1003	{
1005	void **slot;	1004	void **slot;
1006	unsigned int ret = 0;	1005	unsigned int ret = 0;
1007	struct radix_tree_iter iter;	1006	struct radix_tree_iter iter;
1008		1007
1009	if (!nr_entries)	1008	if (!nr_entries)
1010	return 0;	1009	return 0;
1011		1010
1012	rcu_read_lock();	1011	rcu_read_lock();
1013	restart:	1012	restart:
1014	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {	1013	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
1015	struct page *page;	1014	struct page *page;
1016	repeat:	1015	repeat:
1017	page = radix_tree_deref_slot(slot);	1016	page = radix_tree_deref_slot(slot);
1018	if (unlikely(!page))	1017	if (unlikely(!page))
1019	continue;	1018	continue;
1020	if (radix_tree_exception(page)) {	1019	if (radix_tree_exception(page)) {
1021	if (radix_tree_deref_retry(page))	1020	if (radix_tree_deref_retry(page))
1022	goto restart;	1021	goto restart;
1023	/*	1022	/*
1024	* Otherwise, we must be storing a swap entry	1023	* Otherwise, we must be storing a swap entry
1025	* here as an exceptional entry: so return it	1024	* here as an exceptional entry: so return it
1026	* without attempting to raise page count.	1025	* without attempting to raise page count.
1027	*/	1026	*/
1028	goto export;	1027	goto export;
1029	}	1028	}
1030	if (!page_cache_get_speculative(page))	1029	if (!page_cache_get_speculative(page))
1031	goto repeat;	1030	goto repeat;
1032		1031
1033	/* Has the page moved? */	1032	/* Has the page moved? */
1034	if (unlikely(page != *slot)) {	1033	if (unlikely(page != *slot)) {
1035	page_cache_release(page);	1034	page_cache_release(page);
1036	goto repeat;	1035	goto repeat;
1037	}	1036	}
1038	export:	1037	export:
1039	indices[ret] = iter.index;	1038	indices[ret] = iter.index;
1040	entries[ret] = page;	1039	entries[ret] = page;
1041	if (++ret == nr_entries)	1040	if (++ret == nr_entries)
1042	break;	1041	break;
1043	}	1042	}
1044	rcu_read_unlock();	1043	rcu_read_unlock();
1045	return ret;	1044	return ret;
1046	}	1045	}
1047		1046
1048	/**	1047	/**
1049	* find_get_pages - gang pagecache lookup	1048	* find_get_pages - gang pagecache lookup
1050	* @mapping: The address_space to search	1049	* @mapping: The address_space to search
1051	* @start: The starting page index	1050	* @start: The starting page index
1052	* @nr_pages: The maximum number of pages	1051	* @nr_pages: The maximum number of pages
1053	* @pages: Where the resulting pages are placed	1052	* @pages: Where the resulting pages are placed
1054	*	1053	*
1055	* find_get_pages() will search for and return a group of up to	1054	* find_get_pages() will search for and return a group of up to
1056	* @nr_pages pages in the mapping. The pages are placed at @pages.	1055	* @nr_pages pages in the mapping. The pages are placed at @pages.
1057	* find_get_pages() takes a reference against the returned pages.	1056	* find_get_pages() takes a reference against the returned pages.
1058	*	1057	*
1059	* The search returns a group of mapping-contiguous pages with ascending	1058	* The search returns a group of mapping-contiguous pages with ascending
1060	* indexes. There may be holes in the indices due to not-present pages.	1059	* indexes. There may be holes in the indices due to not-present pages.
1061	*	1060	*
1062	* find_get_pages() returns the number of pages which were found.	1061	* find_get_pages() returns the number of pages which were found.
1063	*/	1062	*/
1064	unsigned find_get_pages(struct address_space *mapping, pgoff_t start,	1063	unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
1065	unsigned int nr_pages, struct page **pages)	1064	unsigned int nr_pages, struct page **pages)
1066	{	1065	{
1067	struct radix_tree_iter iter;	1066	struct radix_tree_iter iter;
1068	void **slot;	1067	void **slot;
1069	unsigned ret = 0;	1068	unsigned ret = 0;
1070		1069
1071	if (unlikely(!nr_pages))	1070	if (unlikely(!nr_pages))
1072	return 0;	1071	return 0;
1073		1072
1074	rcu_read_lock();	1073	rcu_read_lock();
1075	restart:	1074	restart:
1076	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {	1075	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
1077	struct page *page;	1076	struct page *page;
1078	repeat:	1077	repeat:
1079	page = radix_tree_deref_slot(slot);	1078	page = radix_tree_deref_slot(slot);
1080	if (unlikely(!page))	1079	if (unlikely(!page))
1081	continue;	1080	continue;
1082		1081
1083	if (radix_tree_exception(page)) {	1082	if (radix_tree_exception(page)) {
1084	if (radix_tree_deref_retry(page)) {	1083	if (radix_tree_deref_retry(page)) {
1085	/*	1084	/*
1086	* Transient condition which can only trigger	1085	* Transient condition which can only trigger
1087	* when entry at index 0 moves out of or back	1086	* when entry at index 0 moves out of or back
1088	* to root: none yet gotten, safe to restart.	1087	* to root: none yet gotten, safe to restart.
1089	*/	1088	*/
1090	WARN_ON(iter.index);	1089	WARN_ON(iter.index);
1091	goto restart;	1090	goto restart;
1092	}	1091	}
1093	/*	1092	/*
1094	* Otherwise, shmem/tmpfs must be storing a swap entry	1093	* Otherwise, shmem/tmpfs must be storing a swap entry
1095	* here as an exceptional entry: so skip over it -	1094	* here as an exceptional entry: so skip over it -
1096	* we only reach this from invalidate_mapping_pages().	1095	* we only reach this from invalidate_mapping_pages().
1097	*/	1096	*/
1098	continue;	1097	continue;
1099	}	1098	}
1100		1099
1101	if (!page_cache_get_speculative(page))	1100	if (!page_cache_get_speculative(page))
1102	goto repeat;	1101	goto repeat;
1103		1102
1104	/* Has the page moved? */	1103	/* Has the page moved? */
1105	if (unlikely(page != *slot)) {	1104	if (unlikely(page != *slot)) {
1106	page_cache_release(page);	1105	page_cache_release(page);
1107	goto repeat;	1106	goto repeat;
1108	}	1107	}
1109		1108
1110	pages[ret] = page;	1109	pages[ret] = page;
1111	if (++ret == nr_pages)	1110	if (++ret == nr_pages)
1112	break;	1111	break;
1113	}	1112	}
1114		1113
1115	rcu_read_unlock();	1114	rcu_read_unlock();
1116	return ret;	1115	return ret;
1117	}	1116	}
1118		1117
1119	/**	1118	/**
1120	* find_get_pages_contig - gang contiguous pagecache lookup	1119	* find_get_pages_contig - gang contiguous pagecache lookup
1121	* @mapping: The address_space to search	1120	* @mapping: The address_space to search
1122	* @index: The starting page index	1121	* @index: The starting page index
1123	* @nr_pages: The maximum number of pages	1122	* @nr_pages: The maximum number of pages
1124	* @pages: Where the resulting pages are placed	1123	* @pages: Where the resulting pages are placed
1125	*	1124	*
1126	* find_get_pages_contig() works exactly like find_get_pages(), except	1125	* find_get_pages_contig() works exactly like find_get_pages(), except
1127	* that the returned number of pages are guaranteed to be contiguous.	1126	* that the returned number of pages are guaranteed to be contiguous.
1128	*	1127	*
1129	* find_get_pages_contig() returns the number of pages which were found.	1128	* find_get_pages_contig() returns the number of pages which were found.
1130	*/	1129	*/
1131	unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,	1130	unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
1132	unsigned int nr_pages, struct page **pages)	1131	unsigned int nr_pages, struct page **pages)
1133	{	1132	{
1134	struct radix_tree_iter iter;	1133	struct radix_tree_iter iter;
1135	void **slot;	1134	void **slot;
1136	unsigned int ret = 0;	1135	unsigned int ret = 0;
1137		1136
1138	if (unlikely(!nr_pages))	1137	if (unlikely(!nr_pages))
1139	return 0;	1138	return 0;
1140		1139
1141	rcu_read_lock();	1140	rcu_read_lock();
1142	restart:	1141	restart:
1143	radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {	1142	radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
1144	struct page *page;	1143	struct page *page;
1145	repeat:	1144	repeat:
1146	page = radix_tree_deref_slot(slot);	1145	page = radix_tree_deref_slot(slot);
1147	/* The hole, there no reason to continue */	1146	/* The hole, there no reason to continue */
1148	if (unlikely(!page))	1147	if (unlikely(!page))
1149	break;	1148	break;
1150		1149
1151	if (radix_tree_exception(page)) {	1150	if (radix_tree_exception(page)) {
1152	if (radix_tree_deref_retry(page)) {	1151	if (radix_tree_deref_retry(page)) {
1153	/*	1152	/*
1154	* Transient condition which can only trigger	1153	* Transient condition which can only trigger
1155	* when entry at index 0 moves out of or back	1154	* when entry at index 0 moves out of or back
1156	* to root: none yet gotten, safe to restart.	1155	* to root: none yet gotten, safe to restart.
1157	*/	1156	*/
1158	goto restart;	1157	goto restart;
1159	}	1158	}
1160	/*	1159	/*
1161	* Otherwise, shmem/tmpfs must be storing a swap entry	1160	* Otherwise, shmem/tmpfs must be storing a swap entry
1162	* here as an exceptional entry: so stop looking for	1161	* here as an exceptional entry: so stop looking for
1163	* contiguous pages.	1162	* contiguous pages.
1164	*/	1163	*/
1165	break;	1164	break;
1166	}	1165	}
1167		1166
1168	if (!page_cache_get_speculative(page))	1167	if (!page_cache_get_speculative(page))
1169	goto repeat;	1168	goto repeat;
1170		1169
1171	/* Has the page moved? */	1170	/* Has the page moved? */
1172	if (unlikely(page != *slot)) {	1171	if (unlikely(page != *slot)) {
1173	page_cache_release(page);	1172	page_cache_release(page);
1174	goto repeat;	1173	goto repeat;
1175	}	1174	}
1176		1175
1177	/*	1176	/*
1178	* must check mapping and index after taking the ref.	1177	* must check mapping and index after taking the ref.
1179	* otherwise we can get both false positives and false	1178	* otherwise we can get both false positives and false
1180	* negatives, which is just confusing to the caller.	1179	* negatives, which is just confusing to the caller.
1181	*/	1180	*/
1182	if (page->mapping == NULL \|\| page->index != iter.index) {	1181	if (page->mapping == NULL \|\| page->index != iter.index) {
1183	page_cache_release(page);	1182	page_cache_release(page);
1184	break;	1183	break;
1185	}	1184	}
1186		1185
1187	pages[ret] = page;	1186	pages[ret] = page;
1188	if (++ret == nr_pages)	1187	if (++ret == nr_pages)
1189	break;	1188	break;
1190	}	1189	}
1191	rcu_read_unlock();	1190	rcu_read_unlock();
1192	return ret;	1191	return ret;
1193	}	1192	}
1194	EXPORT_SYMBOL(find_get_pages_contig);	1193	EXPORT_SYMBOL(find_get_pages_contig);
1195		1194
1196	/**	1195	/**
1197	* find_get_pages_tag - find and return pages that match @tag	1196	* find_get_pages_tag - find and return pages that match @tag
1198	* @mapping: the address_space to search	1197	* @mapping: the address_space to search
1199	* @index: the starting page index	1198	* @index: the starting page index
1200	* @tag: the tag index	1199	* @tag: the tag index
1201	* @nr_pages: the maximum number of pages	1200	* @nr_pages: the maximum number of pages
1202	* @pages: where the resulting pages are placed	1201	* @pages: where the resulting pages are placed
1203	*	1202	*
1204	* Like find_get_pages, except we only return pages which are tagged with	1203	* Like find_get_pages, except we only return pages which are tagged with
1205	* @tag. We update @index to index the next page for the traversal.	1204	* @tag. We update @index to index the next page for the traversal.
1206	*/	1205	*/
1207	unsigned find_get_pages_tag(struct address_space mapping, pgoff_t index,	1206	unsigned find_get_pages_tag(struct address_space mapping, pgoff_t index,
1208	int tag, unsigned int nr_pages, struct page **pages)	1207	int tag, unsigned int nr_pages, struct page **pages)
1209	{	1208	{
1210	struct radix_tree_iter iter;	1209	struct radix_tree_iter iter;
1211	void **slot;	1210	void **slot;
1212	unsigned ret = 0;	1211	unsigned ret = 0;
1213		1212
1214	if (unlikely(!nr_pages))	1213	if (unlikely(!nr_pages))
1215	return 0;	1214	return 0;
1216		1215
1217	rcu_read_lock();	1216	rcu_read_lock();
1218	restart:	1217	restart:
1219	radix_tree_for_each_tagged(slot, &mapping->page_tree,	1218	radix_tree_for_each_tagged(slot, &mapping->page_tree,
1220	&iter, *index, tag) {	1219	&iter, *index, tag) {
1221	struct page *page;	1220	struct page *page;
1222	repeat:	1221	repeat:
1223	page = radix_tree_deref_slot(slot);	1222	page = radix_tree_deref_slot(slot);
1224	if (unlikely(!page))	1223	if (unlikely(!page))
1225	continue;	1224	continue;
1226		1225
1227	if (radix_tree_exception(page)) {	1226	if (radix_tree_exception(page)) {
1228	if (radix_tree_deref_retry(page)) {	1227	if (radix_tree_deref_retry(page)) {
1229	/*	1228	/*
1230	* Transient condition which can only trigger	1229	* Transient condition which can only trigger
1231	* when entry at index 0 moves out of or back	1230	* when entry at index 0 moves out of or back
1232	* to root: none yet gotten, safe to restart.	1231	* to root: none yet gotten, safe to restart.
1233	*/	1232	*/
1234	goto restart;	1233	goto restart;
1235	}	1234	}
1236	/*	1235	/*
1237	* This function is never used on a shmem/tmpfs	1236	* This function is never used on a shmem/tmpfs
1238	* mapping, so a swap entry won't be found here.	1237	* mapping, so a swap entry won't be found here.
1239	*/	1238	*/
1240	BUG();	1239	BUG();
1241	}	1240	}
1242		1241
1243	if (!page_cache_get_speculative(page))	1242	if (!page_cache_get_speculative(page))
1244	goto repeat;	1243	goto repeat;
1245		1244
1246	/* Has the page moved? */	1245	/* Has the page moved? */
1247	if (unlikely(page != *slot)) {	1246	if (unlikely(page != *slot)) {
1248	page_cache_release(page);	1247	page_cache_release(page);
1249	goto repeat;	1248	goto repeat;
1250	}	1249	}
1251		1250
1252	pages[ret] = page;	1251	pages[ret] = page;
1253	if (++ret == nr_pages)	1252	if (++ret == nr_pages)
1254	break;	1253	break;
1255	}	1254	}
1256		1255
1257	rcu_read_unlock();	1256	rcu_read_unlock();
1258		1257
1259	if (ret)	1258	if (ret)
1260	*index = pages[ret - 1]->index + 1;	1259	*index = pages[ret - 1]->index + 1;
1261		1260
1262	return ret;	1261	return ret;
1263	}	1262	}
1264	EXPORT_SYMBOL(find_get_pages_tag);	1263	EXPORT_SYMBOL(find_get_pages_tag);
1265		1264
1266	/**
1267	* grab_cache_page_nowait - returns locked page at given index in given cache
1268	* @mapping: target address_space
1269	* @index: the page index
1270	*
1271	* Same as grab_cache_page(), but do not wait if the page is unavailable.
1272	* This is intended for speculative data generators, where the data can
1273	* be regenerated if the page couldn't be grabbed. This routine should
1274	* be safe to call while holding the lock for another page.
1275	*
1276	* Clear __GFP_FS when allocating the page to avoid recursion into the fs
1277	* and deadlock against the caller's locked page.
1278	*/
1279	struct page *
1280	grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
1281	{
1282	struct page *page = find_get_page(mapping, index);
1283
1284	if (page) {
1285	if (trylock_page(page))
1286	return page;
1287	page_cache_release(page);
1288	return NULL;
1289	}
1290	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
1291	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
1292	page_cache_release(page);
1293	page = NULL;
1294	}
1295	return page;
1296	}
1297	EXPORT_SYMBOL(grab_cache_page_nowait);
1298
1299	/*	1265	/*
1300	* CD/DVDs are error prone. When a medium error occurs, the driver may fail	1266	* CD/DVDs are error prone. When a medium error occurs, the driver may fail
1301	* a _large_ part of the i/o request. Imagine the worst scenario:	1267	* a _large_ part of the i/o request. Imagine the worst scenario:
1302	*	1268	*
1303	* ---R__________________________________________B__________	1269	* ---R__________________________________________B__________
1304	* ^ reading here ^ bad block(assume 4k)	1270	* ^ reading here ^ bad block(assume 4k)
1305	*	1271	*
1306	* read(R) => miss => readahead(R...B) => media error => frustrating retries	1272	* read(R) => miss => readahead(R...B) => media error => frustrating retries
1307	* => failing the whole request => read(R) => read(R+1) =>	1273	* => failing the whole request => read(R) => read(R+1) =>
1308	* readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>	1274	* readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
1309	* readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>	1275	* readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
1310	* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......	1276	* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
1311	*	1277	*
1312	* It is going insane. Fix it by quickly scaling down the readahead size.	1278	* It is going insane. Fix it by quickly scaling down the readahead size.
1313	*/	1279	*/
1314	static void shrink_readahead_size_eio(struct file *filp,	1280	static void shrink_readahead_size_eio(struct file *filp,
1315	struct file_ra_state *ra)	1281	struct file_ra_state *ra)
1316	{	1282	{
1317	ra->ra_pages /= 4;	1283	ra->ra_pages /= 4;
1318	}	1284	}
1319		1285
1320	/**	1286	/**
1321	* do_generic_file_read - generic file read routine	1287	* do_generic_file_read - generic file read routine
1322	* @filp: the file to read	1288	* @filp: the file to read
1323	* @ppos: current file position	1289	* @ppos: current file position
1324	* @desc: read_descriptor	1290	* @desc: read_descriptor
1325	* @actor: read method	1291	* @actor: read method
1326	*	1292	*
1327	* This is a generic file read routine, and uses the	1293	* This is a generic file read routine, and uses the
1328	* mapping->a_ops->readpage() function for the actual low-level stuff.	1294	* mapping->a_ops->readpage() function for the actual low-level stuff.
1329	*	1295	*
1330	* This is really ugly. But the goto's actually try to clarify some	1296	* This is really ugly. But the goto's actually try to clarify some
1331	* of the logic when it comes to error handling etc.	1297	* of the logic when it comes to error handling etc.
1332	*/	1298	*/
1333	static void do_generic_file_read(struct file filp, loff_t ppos,	1299	static void do_generic_file_read(struct file filp, loff_t ppos,
1334	read_descriptor_t *desc, read_actor_t actor)	1300	read_descriptor_t *desc, read_actor_t actor)
1335	{	1301	{
1336	struct address_space *mapping = filp->f_mapping;	1302	struct address_space *mapping = filp->f_mapping;
1337	struct inode *inode = mapping->host;	1303	struct inode *inode = mapping->host;
1338	struct file_ra_state *ra = &filp->f_ra;	1304	struct file_ra_state *ra = &filp->f_ra;
1339	pgoff_t index;	1305	pgoff_t index;
1340	pgoff_t last_index;	1306	pgoff_t last_index;
1341	pgoff_t prev_index;	1307	pgoff_t prev_index;
1342	unsigned long offset; /* offset into pagecache page */	1308	unsigned long offset; /* offset into pagecache page */
1343	unsigned int prev_offset;	1309	unsigned int prev_offset;
1344	int error;	1310	int error;
1345		1311
1346	index = *ppos >> PAGE_CACHE_SHIFT;	1312	index = *ppos >> PAGE_CACHE_SHIFT;
1347	prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;	1313	prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
1348	prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);	1314	prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
1349	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;	1315	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
1350	offset = *ppos & ~PAGE_CACHE_MASK;	1316	offset = *ppos & ~PAGE_CACHE_MASK;
1351		1317
1352	for (;;) {	1318	for (;;) {
1353	struct page *page;	1319	struct page *page;
1354	pgoff_t end_index;	1320	pgoff_t end_index;
1355	loff_t isize;	1321	loff_t isize;
1356	unsigned long nr, ret;	1322	unsigned long nr, ret;
1357		1323
1358	cond_resched();	1324	cond_resched();
1359	find_page:	1325	find_page:
1360	page = find_get_page(mapping, index);	1326	page = find_get_page(mapping, index);
1361	if (!page) {	1327	if (!page) {
1362	page_cache_sync_readahead(mapping,	1328	page_cache_sync_readahead(mapping,
1363	ra, filp,	1329	ra, filp,
1364	index, last_index - index);	1330	index, last_index - index);
1365	page = find_get_page(mapping, index);	1331	page = find_get_page(mapping, index);
1366	if (unlikely(page == NULL))	1332	if (unlikely(page == NULL))
1367	goto no_cached_page;	1333	goto no_cached_page;
1368	}	1334	}
1369	if (PageReadahead(page)) {	1335	if (PageReadahead(page)) {
1370	page_cache_async_readahead(mapping,	1336	page_cache_async_readahead(mapping,
1371	ra, filp, page,	1337	ra, filp, page,
1372	index, last_index - index);	1338	index, last_index - index);
1373	}	1339	}
1374	if (!PageUptodate(page)) {	1340	if (!PageUptodate(page)) {
1375	if (inode->i_blkbits == PAGE_CACHE_SHIFT \|\|	1341	if (inode->i_blkbits == PAGE_CACHE_SHIFT \|\|
1376	!mapping->a_ops->is_partially_uptodate)	1342	!mapping->a_ops->is_partially_uptodate)
1377	goto page_not_up_to_date;	1343	goto page_not_up_to_date;
1378	if (!trylock_page(page))	1344	if (!trylock_page(page))
1379	goto page_not_up_to_date;	1345	goto page_not_up_to_date;
1380	/* Did it get truncated before we got the lock? */	1346	/* Did it get truncated before we got the lock? */
1381	if (!page->mapping)	1347	if (!page->mapping)
1382	goto page_not_up_to_date_locked;	1348	goto page_not_up_to_date_locked;
1383	if (!mapping->a_ops->is_partially_uptodate(page,	1349	if (!mapping->a_ops->is_partially_uptodate(page,
1384	desc, offset))	1350	desc, offset))
1385	goto page_not_up_to_date_locked;	1351	goto page_not_up_to_date_locked;
1386	unlock_page(page);	1352	unlock_page(page);
1387	}	1353	}
1388	page_ok:	1354	page_ok:
1389	/*	1355	/*
1390	* i_size must be checked after we know the page is Uptodate.	1356	* i_size must be checked after we know the page is Uptodate.
1391	*	1357	*
1392	* Checking i_size after the check allows us to calculate	1358	* Checking i_size after the check allows us to calculate
1393	* the correct value for "nr", which means the zero-filled	1359	* the correct value for "nr", which means the zero-filled
1394	* part of the page is not copied back to userspace (unless	1360	* part of the page is not copied back to userspace (unless
1395	* another truncate extends the file - this is desired though).	1361	* another truncate extends the file - this is desired though).
1396	*/	1362	*/
1397		1363
1398	isize = i_size_read(inode);	1364	isize = i_size_read(inode);
1399	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;	1365	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
1400	if (unlikely(!isize \|\| index > end_index)) {	1366	if (unlikely(!isize \|\| index > end_index)) {
1401	page_cache_release(page);	1367	page_cache_release(page);
1402	goto out;	1368	goto out;
1403	}	1369	}
1404		1370
1405	/* nr is the maximum number of bytes to copy from this page */	1371	/* nr is the maximum number of bytes to copy from this page */
1406	nr = PAGE_CACHE_SIZE;	1372	nr = PAGE_CACHE_SIZE;
1407	if (index == end_index) {	1373	if (index == end_index) {
1408	nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;	1374	nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
1409	if (nr <= offset) {	1375	if (nr <= offset) {
1410	page_cache_release(page);	1376	page_cache_release(page);
1411	goto out;	1377	goto out;
1412	}	1378	}
1413	}	1379	}
1414	nr = nr - offset;	1380	nr = nr - offset;
1415		1381
1416	/* If users can be writing to this page using arbitrary	1382	/* If users can be writing to this page using arbitrary
1417	* virtual addresses, take care about potential aliasing	1383	* virtual addresses, take care about potential aliasing
1418	* before reading the page on the kernel side.	1384	* before reading the page on the kernel side.
1419	*/	1385	*/
1420	if (mapping_writably_mapped(mapping))	1386	if (mapping_writably_mapped(mapping))
1421	flush_dcache_page(page);	1387	flush_dcache_page(page);
1422		1388
1423	/*	1389	/*
1424	* When a sequential read accesses a page several times,	1390	* When a sequential read accesses a page several times,
1425	* only mark it as accessed the first time.	1391	* only mark it as accessed the first time.
1426	*/	1392	*/
1427	if (prev_index != index \|\| offset != prev_offset)	1393	if (prev_index != index \|\| offset != prev_offset)
1428	mark_page_accessed(page);	1394	mark_page_accessed(page);
1429	prev_index = index;	1395	prev_index = index;
1430		1396
1431	/*	1397	/*
1432	* Ok, we have the page, and it's up-to-date, so	1398	* Ok, we have the page, and it's up-to-date, so
1433	* now we can copy it to user space...	1399	* now we can copy it to user space...
1434	*	1400	*
1435	* The actor routine returns how many bytes were actually used..	1401	* The actor routine returns how many bytes were actually used..
1436	* NOTE! This may not be the same as how much of a user buffer	1402	* NOTE! This may not be the same as how much of a user buffer
1437	* we filled up (we may be padding etc), so we can only update	1403	* we filled up (we may be padding etc), so we can only update
1438	* "pos" here (the actor routine has to update the user buffer	1404	* "pos" here (the actor routine has to update the user buffer
1439	* pointers and the remaining count).	1405	* pointers and the remaining count).
1440	*/	1406	*/
1441	ret = actor(desc, page, offset, nr);	1407	ret = actor(desc, page, offset, nr);
1442	offset += ret;	1408	offset += ret;
1443	index += offset >> PAGE_CACHE_SHIFT;	1409	index += offset >> PAGE_CACHE_SHIFT;
1444	offset &= ~PAGE_CACHE_MASK;	1410	offset &= ~PAGE_CACHE_MASK;
1445	prev_offset = offset;	1411	prev_offset = offset;
1446		1412
1447	page_cache_release(page);	1413	page_cache_release(page);
1448	if (ret == nr && desc->count)	1414	if (ret == nr && desc->count)
1449	continue;	1415	continue;
1450	goto out;	1416	goto out;
1451		1417
1452	page_not_up_to_date:	1418	page_not_up_to_date:
1453	/* Get exclusive access to the page ... */	1419	/* Get exclusive access to the page ... */
1454	error = lock_page_killable(page);	1420	error = lock_page_killable(page);
1455	if (unlikely(error))	1421	if (unlikely(error))
1456	goto readpage_error;	1422	goto readpage_error;
1457		1423
1458	page_not_up_to_date_locked:	1424	page_not_up_to_date_locked:
1459	/* Did it get truncated before we got the lock? */	1425	/* Did it get truncated before we got the lock? */
1460	if (!page->mapping) {	1426	if (!page->mapping) {
1461	unlock_page(page);	1427	unlock_page(page);
1462	page_cache_release(page);	1428	page_cache_release(page);
1463	continue;	1429	continue;
1464	}	1430	}
1465		1431
1466	/* Did somebody else fill it already? */	1432	/* Did somebody else fill it already? */
1467	if (PageUptodate(page)) {	1433	if (PageUptodate(page)) {
1468	unlock_page(page);	1434	unlock_page(page);
1469	goto page_ok;	1435	goto page_ok;
1470	}	1436	}
1471		1437
1472	readpage:	1438	readpage:
1473	/*	1439	/*
1474	* A previous I/O error may have been due to temporary	1440	* A previous I/O error may have been due to temporary
1475	* failures, eg. multipath errors.	1441	* failures, eg. multipath errors.
1476	* PG_error will be set again if readpage fails.	1442	* PG_error will be set again if readpage fails.
1477	*/	1443	*/
1478	ClearPageError(page);	1444	ClearPageError(page);
1479	/* Start the actual read. The read will unlock the page. */	1445	/* Start the actual read. The read will unlock the page. */
1480	error = mapping->a_ops->readpage(filp, page);	1446	error = mapping->a_ops->readpage(filp, page);
1481		1447
1482	if (unlikely(error)) {	1448	if (unlikely(error)) {
1483	if (error == AOP_TRUNCATED_PAGE) {	1449	if (error == AOP_TRUNCATED_PAGE) {
1484	page_cache_release(page);	1450	page_cache_release(page);
1485	goto find_page;	1451	goto find_page;
1486	}	1452	}
1487	goto readpage_error;	1453	goto readpage_error;
1488	}	1454	}
1489		1455
1490	if (!PageUptodate(page)) {	1456	if (!PageUptodate(page)) {
1491	error = lock_page_killable(page);	1457	error = lock_page_killable(page);
1492	if (unlikely(error))	1458	if (unlikely(error))
1493	goto readpage_error;	1459	goto readpage_error;
1494	if (!PageUptodate(page)) {	1460	if (!PageUptodate(page)) {
1495	if (page->mapping == NULL) {	1461	if (page->mapping == NULL) {
1496	/*	1462	/*
1497	* invalidate_mapping_pages got it	1463	* invalidate_mapping_pages got it
1498	*/	1464	*/
1499	unlock_page(page);	1465	unlock_page(page);
1500	page_cache_release(page);	1466	page_cache_release(page);
1501	goto find_page;	1467	goto find_page;
1502	}	1468	}
1503	unlock_page(page);	1469	unlock_page(page);
1504	shrink_readahead_size_eio(filp, ra);	1470	shrink_readahead_size_eio(filp, ra);
1505	error = -EIO;	1471	error = -EIO;
1506	goto readpage_error;	1472	goto readpage_error;
1507	}	1473	}
1508	unlock_page(page);	1474	unlock_page(page);
1509	}	1475	}
1510		1476
1511	goto page_ok;	1477	goto page_ok;
1512		1478
1513	readpage_error:	1479	readpage_error:
1514	/* UHHUH! A synchronous read error occurred. Report it */	1480	/* UHHUH! A synchronous read error occurred. Report it */
1515	desc->error = error;	1481	desc->error = error;
1516	page_cache_release(page);	1482	page_cache_release(page);
1517	goto out;	1483	goto out;
1518		1484
1519	no_cached_page:	1485	no_cached_page:
1520	/*	1486	/*
1521	* Ok, it wasn't cached, so we need to create a new	1487	* Ok, it wasn't cached, so we need to create a new
1522	* page..	1488	* page..
1523	*/	1489	*/
1524	page = page_cache_alloc_cold(mapping);	1490	page = page_cache_alloc_cold(mapping);
1525	if (!page) {	1491	if (!page) {
1526	desc->error = -ENOMEM;	1492	desc->error = -ENOMEM;
1527	goto out;	1493	goto out;
1528	}	1494	}
1529	error = add_to_page_cache_lru(page, mapping,	1495	error = add_to_page_cache_lru(page, mapping,
1530	index, GFP_KERNEL);	1496	index, GFP_KERNEL);
1531	if (error) {	1497	if (error) {
1532	page_cache_release(page);	1498	page_cache_release(page);
1533	if (error == -EEXIST)	1499	if (error == -EEXIST)
1534	goto find_page;	1500	goto find_page;
1535	desc->error = error;	1501	desc->error = error;
1536	goto out;	1502	goto out;
1537	}	1503	}
1538	goto readpage;	1504	goto readpage;
1539	}	1505	}
1540		1506
1541	out:	1507	out:
1542	ra->prev_pos = prev_index;	1508	ra->prev_pos = prev_index;
1543	ra->prev_pos <<= PAGE_CACHE_SHIFT;	1509	ra->prev_pos <<= PAGE_CACHE_SHIFT;
1544	ra->prev_pos \|= prev_offset;	1510	ra->prev_pos \|= prev_offset;
1545		1511
1546	*ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;	1512	*ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
1547	file_accessed(filp);	1513	file_accessed(filp);
1548	}	1514	}
1549		1515
1550	int file_read_actor(read_descriptor_t desc, struct page page,	1516	int file_read_actor(read_descriptor_t desc, struct page page,
1551	unsigned long offset, unsigned long size)	1517	unsigned long offset, unsigned long size)
1552	{	1518	{
1553	char *kaddr;	1519	char *kaddr;
1554	unsigned long left, count = desc->count;	1520	unsigned long left, count = desc->count;
1555		1521
1556	if (size > count)	1522	if (size > count)
1557	size = count;	1523	size = count;
1558		1524
1559	/*	1525	/*
1560	* Faults on the destination of a read are common, so do it before	1526	* Faults on the destination of a read are common, so do it before
1561	* taking the kmap.	1527	* taking the kmap.
1562	*/	1528	*/
1563	if (!fault_in_pages_writeable(desc->arg.buf, size)) {	1529	if (!fault_in_pages_writeable(desc->arg.buf, size)) {
1564	kaddr = kmap_atomic(page);	1530	kaddr = kmap_atomic(page);
1565	left = __copy_to_user_inatomic(desc->arg.buf,	1531	left = __copy_to_user_inatomic(desc->arg.buf,
1566	kaddr + offset, size);	1532	kaddr + offset, size);
1567	kunmap_atomic(kaddr);	1533	kunmap_atomic(kaddr);
1568	if (left == 0)	1534	if (left == 0)
1569	goto success;	1535	goto success;
1570	}	1536	}
1571		1537
1572	/* Do it the slow way */	1538	/* Do it the slow way */
1573	kaddr = kmap(page);	1539	kaddr = kmap(page);
1574	left = __copy_to_user(desc->arg.buf, kaddr + offset, size);	1540	left = __copy_to_user(desc->arg.buf, kaddr + offset, size);
1575	kunmap(page);	1541	kunmap(page);
1576		1542
1577	if (left) {	1543	if (left) {
1578	size -= left;	1544	size -= left;
1579	desc->error = -EFAULT;	1545	desc->error = -EFAULT;
1580	}	1546	}
1581	success:	1547	success:
1582	desc->count = count - size;	1548	desc->count = count - size;
1583	desc->written += size;	1549	desc->written += size;
1584	desc->arg.buf += size;	1550	desc->arg.buf += size;
1585	return size;	1551	return size;
1586	}	1552	}
1587		1553
1588	/*	1554	/*
1589	* Performs necessary checks before doing a write	1555	* Performs necessary checks before doing a write
1590	* @iov: io vector request	1556	* @iov: io vector request
1591	* @nr_segs: number of segments in the iovec	1557	* @nr_segs: number of segments in the iovec
1592	* @count: number of bytes to write	1558	* @count: number of bytes to write
1593	* @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE	1559	* @access_flags: type of access: %VERIFY_READ or %VERIFY_WRITE
1594	*	1560	*
1595	* Adjust number of segments and amount of bytes to write (nr_segs should be	1561	* Adjust number of segments and amount of bytes to write (nr_segs should be
1596	* properly initialized first). Returns appropriate error code that caller	1562	* properly initialized first). Returns appropriate error code that caller
1597	* should return or zero in case that write should be allowed.	1563	* should return or zero in case that write should be allowed.
1598	*/	1564	*/
1599	int generic_segment_checks(const struct iovec *iov,	1565	int generic_segment_checks(const struct iovec *iov,
1600	unsigned long nr_segs, size_t count, int access_flags)	1566	unsigned long nr_segs, size_t count, int access_flags)
1601	{	1567	{
1602	unsigned long seg;	1568	unsigned long seg;
1603	size_t cnt = 0;	1569	size_t cnt = 0;
1604	for (seg = 0; seg < *nr_segs; seg++) {	1570	for (seg = 0; seg < *nr_segs; seg++) {
1605	const struct iovec *iv = &iov[seg];	1571	const struct iovec *iv = &iov[seg];
1606		1572
1607	/*	1573	/*
1608	* If any segment has a negative length, or the cumulative	1574	* If any segment has a negative length, or the cumulative
1609	* length ever wraps negative then return -EINVAL.	1575	* length ever wraps negative then return -EINVAL.
1610	*/	1576	*/
1611	cnt += iv->iov_len;	1577	cnt += iv->iov_len;
1612	if (unlikely((ssize_t)(cnt\|iv->iov_len) < 0))	1578	if (unlikely((ssize_t)(cnt\|iv->iov_len) < 0))
1613	return -EINVAL;	1579	return -EINVAL;
1614	if (access_ok(access_flags, iv->iov_base, iv->iov_len))	1580	if (access_ok(access_flags, iv->iov_base, iv->iov_len))
1615	continue;	1581	continue;
1616	if (seg == 0)	1582	if (seg == 0)
1617	return -EFAULT;	1583	return -EFAULT;
1618	*nr_segs = seg;	1584	*nr_segs = seg;
1619	cnt -= iv->iov_len; /* This segment is no good */	1585	cnt -= iv->iov_len; /* This segment is no good */
1620	break;	1586	break;
1621	}	1587	}
1622	*count = cnt;	1588	*count = cnt;
1623	return 0;	1589	return 0;
1624	}	1590	}
1625	EXPORT_SYMBOL(generic_segment_checks);	1591	EXPORT_SYMBOL(generic_segment_checks);
1626		1592
1627	/**	1593	/**
1628	* generic_file_aio_read - generic filesystem read routine	1594	* generic_file_aio_read - generic filesystem read routine
1629	* @iocb: kernel I/O control block	1595	* @iocb: kernel I/O control block
1630	* @iov: io vector request	1596	* @iov: io vector request
1631	* @nr_segs: number of segments in the iovec	1597	* @nr_segs: number of segments in the iovec
1632	* @pos: current file position	1598	* @pos: current file position
1633	*	1599	*
1634	* This is the "read()" routine for all filesystems	1600	* This is the "read()" routine for all filesystems
1635	* that can use the page cache directly.	1601	* that can use the page cache directly.
1636	*/	1602	*/
1637	ssize_t	1603	ssize_t
1638	generic_file_aio_read(struct kiocb iocb, const struct iovec iov,	1604	generic_file_aio_read(struct kiocb iocb, const struct iovec iov,
1639	unsigned long nr_segs, loff_t pos)	1605	unsigned long nr_segs, loff_t pos)
1640	{	1606	{
1641	struct file *filp = iocb->ki_filp;	1607	struct file *filp = iocb->ki_filp;
1642	ssize_t retval;	1608	ssize_t retval;
1643	unsigned long seg = 0;	1609	unsigned long seg = 0;
1644	size_t count;	1610	size_t count;
1645	loff_t *ppos = &iocb->ki_pos;	1611	loff_t *ppos = &iocb->ki_pos;
1646		1612
1647	count = 0;	1613	count = 0;
1648	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);	1614	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
1649	if (retval)	1615	if (retval)
1650	return retval;	1616	return retval;
1651		1617
1652	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */	1618	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
1653	if (filp->f_flags & O_DIRECT) {	1619	if (filp->f_flags & O_DIRECT) {
1654	loff_t size;	1620	loff_t size;
1655	struct address_space *mapping;	1621	struct address_space *mapping;
1656	struct inode *inode;	1622	struct inode *inode;
1657		1623
1658	mapping = filp->f_mapping;	1624	mapping = filp->f_mapping;
1659	inode = mapping->host;	1625	inode = mapping->host;
1660	if (!count)	1626	if (!count)
1661	goto out; /* skip atime */	1627	goto out; /* skip atime */
1662	size = i_size_read(inode);	1628	size = i_size_read(inode);
1663	if (pos < size) {	1629	if (pos < size) {
1664	retval = filemap_write_and_wait_range(mapping, pos,	1630	retval = filemap_write_and_wait_range(mapping, pos,
1665	pos + iov_length(iov, nr_segs) - 1);	1631	pos + iov_length(iov, nr_segs) - 1);
1666	if (!retval) {	1632	if (!retval) {
1667	retval = mapping->a_ops->direct_IO(READ, iocb,	1633	retval = mapping->a_ops->direct_IO(READ, iocb,
1668	iov, pos, nr_segs);	1634	iov, pos, nr_segs);
1669	}	1635	}
1670	if (retval > 0) {	1636	if (retval > 0) {
1671	*ppos = pos + retval;	1637	*ppos = pos + retval;
1672	count -= retval;	1638	count -= retval;
1673	}	1639	}
1674		1640
1675	/*	1641	/*
1676	* Btrfs can have a short DIO read if we encounter	1642	* Btrfs can have a short DIO read if we encounter
1677	* compressed extents, so if there was an error, or if	1643	* compressed extents, so if there was an error, or if
1678	* we've already read everything we wanted to, or if	1644	* we've already read everything we wanted to, or if
1679	* there was a short read because we hit EOF, go ahead	1645	* there was a short read because we hit EOF, go ahead
1680	* and return. Otherwise fallthrough to buffered io for	1646	* and return. Otherwise fallthrough to buffered io for
1681	* the rest of the read.	1647	* the rest of the read.
1682	*/	1648	*/
1683	if (retval < 0 \|\| !count \|\| *ppos >= size) {	1649	if (retval < 0 \|\| !count \|\| *ppos >= size) {
1684	file_accessed(filp);	1650	file_accessed(filp);
1685	goto out;	1651	goto out;
1686	}	1652	}
1687	}	1653	}
1688	}	1654	}
1689		1655
1690	count = retval;	1656	count = retval;
1691	for (seg = 0; seg < nr_segs; seg++) {	1657	for (seg = 0; seg < nr_segs; seg++) {
1692	read_descriptor_t desc;	1658	read_descriptor_t desc;
1693	loff_t offset = 0;	1659	loff_t offset = 0;
1694		1660
1695	/*	1661	/*
1696	* If we did a short DIO read we need to skip the section of the	1662	* If we did a short DIO read we need to skip the section of the
1697	* iov that we've already read data into.	1663	* iov that we've already read data into.
1698	*/	1664	*/
1699	if (count) {	1665	if (count) {
1700	if (count > iov[seg].iov_len) {	1666	if (count > iov[seg].iov_len) {
1701	count -= iov[seg].iov_len;	1667	count -= iov[seg].iov_len;
1702	continue;	1668	continue;
1703	}	1669	}
1704	offset = count;	1670	offset = count;
1705	count = 0;	1671	count = 0;
1706	}	1672	}
1707		1673
1708	desc.written = 0;	1674	desc.written = 0;
1709	desc.arg.buf = iov[seg].iov_base + offset;	1675	desc.arg.buf = iov[seg].iov_base + offset;
1710	desc.count = iov[seg].iov_len - offset;	1676	desc.count = iov[seg].iov_len - offset;
1711	if (desc.count == 0)	1677	if (desc.count == 0)
1712	continue;	1678	continue;
1713	desc.error = 0;	1679	desc.error = 0;
1714	do_generic_file_read(filp, ppos, &desc, file_read_actor);	1680	do_generic_file_read(filp, ppos, &desc, file_read_actor);
1715	retval += desc.written;	1681	retval += desc.written;
1716	if (desc.error) {	1682	if (desc.error) {
1717	retval = retval ?: desc.error;	1683	retval = retval ?: desc.error;
1718	break;	1684	break;
1719	}	1685	}
1720	if (desc.count > 0)	1686	if (desc.count > 0)
1721	break;	1687	break;
1722	}	1688	}
1723	out:	1689	out:
1724	return retval;	1690	return retval;
1725	}	1691	}
1726	EXPORT_SYMBOL(generic_file_aio_read);	1692	EXPORT_SYMBOL(generic_file_aio_read);
1727		1693
1728	#ifdef CONFIG_MMU	1694	#ifdef CONFIG_MMU
1729	/**	1695	/**
1730	* page_cache_read - adds requested page to the page cache if not already there	1696	* page_cache_read - adds requested page to the page cache if not already there
1731	* @file: file to read	1697	* @file: file to read
1732	* @offset: page index	1698	* @offset: page index
1733	*	1699	*
1734	* This adds the requested page to the page cache if it isn't already there,	1700	* This adds the requested page to the page cache if it isn't already there,
1735	* and schedules an I/O to read in its contents from disk.	1701	* and schedules an I/O to read in its contents from disk.
1736	*/	1702	*/
1737	static int page_cache_read(struct file *file, pgoff_t offset)	1703	static int page_cache_read(struct file *file, pgoff_t offset)
1738	{	1704	{
1739	struct address_space *mapping = file->f_mapping;	1705	struct address_space *mapping = file->f_mapping;
1740	struct page *page;	1706	struct page *page;
1741	int ret;	1707	int ret;
1742		1708
1743	do {	1709	do {
1744	page = page_cache_alloc_cold(mapping);	1710	page = page_cache_alloc_cold(mapping);
1745	if (!page)	1711	if (!page)
1746	return -ENOMEM;	1712	return -ENOMEM;
1747		1713
1748	ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);	1714	ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
1749	if (ret == 0)	1715	if (ret == 0)
1750	ret = mapping->a_ops->readpage(file, page);	1716	ret = mapping->a_ops->readpage(file, page);
1751	else if (ret == -EEXIST)	1717	else if (ret == -EEXIST)
1752	ret = 0; /* losing race to add is OK */	1718	ret = 0; /* losing race to add is OK */
1753		1719
1754	page_cache_release(page);	1720	page_cache_release(page);
1755		1721
1756	} while (ret == AOP_TRUNCATED_PAGE);	1722	} while (ret == AOP_TRUNCATED_PAGE);
1757		1723
1758	return ret;	1724	return ret;
1759	}	1725	}
1760		1726
1761	#define MMAP_LOTSAMISS (100)	1727	#define MMAP_LOTSAMISS (100)
1762		1728
1763	/*	1729	/*
1764	* Synchronous readahead happens when we don't even find	1730	* Synchronous readahead happens when we don't even find
1765	* a page in the page cache at all.	1731	* a page in the page cache at all.
1766	*/	1732	*/
1767	static void do_sync_mmap_readahead(struct vm_area_struct *vma,	1733	static void do_sync_mmap_readahead(struct vm_area_struct *vma,
1768	struct file_ra_state *ra,	1734	struct file_ra_state *ra,
1769	struct file *file,	1735	struct file *file,
1770	pgoff_t offset)	1736	pgoff_t offset)
1771	{	1737	{
1772	unsigned long ra_pages;	1738	unsigned long ra_pages;
1773	struct address_space *mapping = file->f_mapping;	1739	struct address_space *mapping = file->f_mapping;
1774		1740
1775	/* If we don't want any read-ahead, don't bother */	1741	/* If we don't want any read-ahead, don't bother */
1776	if (vma->vm_flags & VM_RAND_READ)	1742	if (vma->vm_flags & VM_RAND_READ)
1777	return;	1743	return;
1778	if (!ra->ra_pages)	1744	if (!ra->ra_pages)
1779	return;	1745	return;
1780		1746
1781	if (vma->vm_flags & VM_SEQ_READ) {	1747	if (vma->vm_flags & VM_SEQ_READ) {
1782	page_cache_sync_readahead(mapping, ra, file, offset,	1748	page_cache_sync_readahead(mapping, ra, file, offset,
1783	ra->ra_pages);	1749	ra->ra_pages);
1784	return;	1750	return;
1785	}	1751	}
1786		1752
1787	/* Avoid banging the cache line if not needed */	1753	/* Avoid banging the cache line if not needed */
1788	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)	1754	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
1789	ra->mmap_miss++;	1755	ra->mmap_miss++;
1790		1756
1791	/*	1757	/*
1792	* Do we miss much more than hit in this file? If so,	1758	* Do we miss much more than hit in this file? If so,
1793	* stop bothering with read-ahead. It will only hurt.	1759	* stop bothering with read-ahead. It will only hurt.
1794	*/	1760	*/
1795	if (ra->mmap_miss > MMAP_LOTSAMISS)	1761	if (ra->mmap_miss > MMAP_LOTSAMISS)
1796	return;	1762	return;
1797		1763
1798	/*	1764	/*
1799	* mmap read-around	1765	* mmap read-around
1800	*/	1766	*/
1801	ra_pages = max_sane_readahead(ra->ra_pages);	1767	ra_pages = max_sane_readahead(ra->ra_pages);
1802	ra->start = max_t(long, 0, offset - ra_pages / 2);	1768	ra->start = max_t(long, 0, offset - ra_pages / 2);
1803	ra->size = ra_pages;	1769	ra->size = ra_pages;
1804	ra->async_size = ra_pages / 4;	1770	ra->async_size = ra_pages / 4;
1805	ra_submit(ra, mapping, file);	1771	ra_submit(ra, mapping, file);
1806	}	1772	}
1807		1773
1808	/*	1774	/*
1809	* Asynchronous readahead happens when we find the page and PG_readahead,	1775	* Asynchronous readahead happens when we find the page and PG_readahead,
1810	* so we want to possibly extend the readahead further..	1776	* so we want to possibly extend the readahead further..
1811	*/	1777	*/
1812	static void do_async_mmap_readahead(struct vm_area_struct *vma,	1778	static void do_async_mmap_readahead(struct vm_area_struct *vma,
1813	struct file_ra_state *ra,	1779	struct file_ra_state *ra,
1814	struct file *file,	1780	struct file *file,
1815	struct page *page,	1781	struct page *page,
1816	pgoff_t offset)	1782	pgoff_t offset)
1817	{	1783	{
1818	struct address_space *mapping = file->f_mapping;	1784	struct address_space *mapping = file->f_mapping;
1819		1785
1820	/* If we don't want any read-ahead, don't bother */	1786	/* If we don't want any read-ahead, don't bother */
1821	if (vma->vm_flags & VM_RAND_READ)	1787	if (vma->vm_flags & VM_RAND_READ)
1822	return;	1788	return;
1823	if (ra->mmap_miss > 0)	1789	if (ra->mmap_miss > 0)
1824	ra->mmap_miss--;	1790	ra->mmap_miss--;
1825	if (PageReadahead(page))	1791	if (PageReadahead(page))
1826	page_cache_async_readahead(mapping, ra, file,	1792	page_cache_async_readahead(mapping, ra, file,
1827	page, offset, ra->ra_pages);	1793	page, offset, ra->ra_pages);
1828	}	1794	}
1829		1795
1830	/**	1796	/**
1831	* filemap_fault - read in file data for page fault handling	1797	* filemap_fault - read in file data for page fault handling
1832	* @vma: vma in which the fault was taken	1798	* @vma: vma in which the fault was taken
1833	* @vmf: struct vm_fault containing details of the fault	1799	* @vmf: struct vm_fault containing details of the fault
1834	*	1800	*
1835	* filemap_fault() is invoked via the vma operations vector for a	1801	* filemap_fault() is invoked via the vma operations vector for a
1836	* mapped memory region to read in file data during a page fault.	1802	* mapped memory region to read in file data during a page fault.
1837	*	1803	*
1838	* The goto's are kind of ugly, but this streamlines the normal case of having	1804	* The goto's are kind of ugly, but this streamlines the normal case of having
1839	* it in the page cache, and handles the special cases reasonably without	1805	* it in the page cache, and handles the special cases reasonably without
1840	* having a lot of duplicated code.	1806	* having a lot of duplicated code.
1841	*/	1807	*/
1842	int filemap_fault(struct vm_area_struct vma, struct vm_fault vmf)	1808	int filemap_fault(struct vm_area_struct vma, struct vm_fault vmf)
1843	{	1809	{
1844	int error;	1810	int error;
1845	struct file *file = vma->vm_file;	1811	struct file *file = vma->vm_file;
1846	struct address_space *mapping = file->f_mapping;	1812	struct address_space *mapping = file->f_mapping;
1847	struct file_ra_state *ra = &file->f_ra;	1813	struct file_ra_state *ra = &file->f_ra;
1848	struct inode *inode = mapping->host;	1814	struct inode *inode = mapping->host;
1849	pgoff_t offset = vmf->pgoff;	1815	pgoff_t offset = vmf->pgoff;
1850	struct page *page;	1816	struct page *page;
1851	pgoff_t size;	1817	pgoff_t size;
1852	int ret = 0;	1818	int ret = 0;
1853		1819
1854	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;	1820	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1855	if (offset >= size)	1821	if (offset >= size)
1856	return VM_FAULT_SIGBUS;	1822	return VM_FAULT_SIGBUS;
1857		1823
1858	/*	1824	/*
1859	* Do we have something in the page cache already?	1825	* Do we have something in the page cache already?
1860	*/	1826	*/
1861	page = find_get_page(mapping, offset);	1827	page = find_get_page(mapping, offset);
1862	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {	1828	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
1863	/*	1829	/*
1864	* We found the page, so try async readahead before	1830	* We found the page, so try async readahead before
1865	* waiting for the lock.	1831	* waiting for the lock.
1866	*/	1832	*/
1867	do_async_mmap_readahead(vma, ra, file, page, offset);	1833	do_async_mmap_readahead(vma, ra, file, page, offset);
1868	} else if (!page) {	1834	} else if (!page) {
1869	/* No page in the page cache at all */	1835	/* No page in the page cache at all */
1870	do_sync_mmap_readahead(vma, ra, file, offset);	1836	do_sync_mmap_readahead(vma, ra, file, offset);
1871	count_vm_event(PGMAJFAULT);	1837	count_vm_event(PGMAJFAULT);
1872	mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);	1838	mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
1873	ret = VM_FAULT_MAJOR;	1839	ret = VM_FAULT_MAJOR;
1874	retry_find:	1840	retry_find:
1875	page = find_get_page(mapping, offset);	1841	page = find_get_page(mapping, offset);
1876	if (!page)	1842	if (!page)
1877	goto no_cached_page;	1843	goto no_cached_page;
1878	}	1844	}
1879		1845
1880	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {	1846	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
1881	page_cache_release(page);	1847	page_cache_release(page);
1882	return ret \| VM_FAULT_RETRY;	1848	return ret \| VM_FAULT_RETRY;
1883	}	1849	}
1884		1850
1885	/* Did it get truncated? */	1851	/* Did it get truncated? */
1886	if (unlikely(page->mapping != mapping)) {	1852	if (unlikely(page->mapping != mapping)) {
1887	unlock_page(page);	1853	unlock_page(page);
1888	put_page(page);	1854	put_page(page);
1889	goto retry_find;	1855	goto retry_find;
1890	}	1856	}
1891	VM_BUG_ON(page->index != offset);	1857	VM_BUG_ON(page->index != offset);
1892		1858
1893	/*	1859	/*
1894	* We have a locked page in the page cache, now we need to check	1860	* We have a locked page in the page cache, now we need to check
1895	* that it's up-to-date. If not, it is going to be due to an error.	1861	* that it's up-to-date. If not, it is going to be due to an error.
1896	*/	1862	*/
1897	if (unlikely(!PageUptodate(page)))	1863	if (unlikely(!PageUptodate(page)))
1898	goto page_not_uptodate;	1864	goto page_not_uptodate;
1899		1865
1900	/*	1866	/*
1901	* Found the page and have a reference on it.	1867	* Found the page and have a reference on it.
1902	* We must recheck i_size under page lock.	1868	* We must recheck i_size under page lock.
1903	*/	1869	*/
1904	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;	1870	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1905	if (unlikely(offset >= size)) {	1871	if (unlikely(offset >= size)) {
1906	unlock_page(page);	1872	unlock_page(page);
1907	page_cache_release(page);	1873	page_cache_release(page);
1908	return VM_FAULT_SIGBUS;	1874	return VM_FAULT_SIGBUS;
1909	}	1875	}
1910		1876
1911	vmf->page = page;	1877	vmf->page = page;
1912	return ret \| VM_FAULT_LOCKED;	1878	return ret \| VM_FAULT_LOCKED;
1913		1879
1914	no_cached_page:	1880	no_cached_page:
1915	/*	1881	/*
1916	* We're only likely to ever get here if MADV_RANDOM is in	1882	* We're only likely to ever get here if MADV_RANDOM is in
1917	* effect.	1883	* effect.
1918	*/	1884	*/
1919	error = page_cache_read(file, offset);	1885	error = page_cache_read(file, offset);
1920		1886
1921	/*	1887	/*
1922	* The page we want has now been added to the page cache.	1888	* The page we want has now been added to the page cache.
1923	* In the unlikely event that someone removed it in the	1889	* In the unlikely event that someone removed it in the
1924	* meantime, we'll just come back here and read it again.	1890	* meantime, we'll just come back here and read it again.
1925	*/	1891	*/
1926	if (error >= 0)	1892	if (error >= 0)
1927	goto retry_find;	1893	goto retry_find;
1928		1894
1929	/*	1895	/*
1930	* An error return from page_cache_read can result if the	1896	* An error return from page_cache_read can result if the
1931	* system is low on memory, or a problem occurs while trying	1897	* system is low on memory, or a problem occurs while trying
1932	* to schedule I/O.	1898	* to schedule I/O.
1933	*/	1899	*/
1934	if (error == -ENOMEM)	1900	if (error == -ENOMEM)
1935	return VM_FAULT_OOM;	1901	return VM_FAULT_OOM;
1936	return VM_FAULT_SIGBUS;	1902	return VM_FAULT_SIGBUS;
1937		1903
1938	page_not_uptodate:	1904	page_not_uptodate:
1939	/*	1905	/*
1940	* Umm, take care of errors if the page isn't up-to-date.	1906	* Umm, take care of errors if the page isn't up-to-date.
1941	* Try to re-read it _once_. We do this synchronously,	1907	* Try to re-read it _once_. We do this synchronously,
1942	* because there really aren't any performance issues here	1908	* because there really aren't any performance issues here
1943	* and we need to check for errors.	1909	* and we need to check for errors.
1944	*/	1910	*/
1945	ClearPageError(page);	1911	ClearPageError(page);
1946	error = mapping->a_ops->readpage(file, page);	1912	error = mapping->a_ops->readpage(file, page);
1947	if (!error) {	1913	if (!error) {
1948	wait_on_page_locked(page);	1914	wait_on_page_locked(page);
1949	if (!PageUptodate(page))	1915	if (!PageUptodate(page))
1950	error = -EIO;	1916	error = -EIO;
1951	}	1917	}
1952	page_cache_release(page);	1918	page_cache_release(page);
1953		1919
1954	if (!error \|\| error == AOP_TRUNCATED_PAGE)	1920	if (!error \|\| error == AOP_TRUNCATED_PAGE)
1955	goto retry_find;	1921	goto retry_find;
1956		1922
1957	/* Things didn't work out. Return zero to tell the mm layer so. */	1923	/* Things didn't work out. Return zero to tell the mm layer so. */
1958	shrink_readahead_size_eio(file, ra);	1924	shrink_readahead_size_eio(file, ra);
1959	return VM_FAULT_SIGBUS;	1925	return VM_FAULT_SIGBUS;
1960	}	1926	}
1961	EXPORT_SYMBOL(filemap_fault);	1927	EXPORT_SYMBOL(filemap_fault);
1962		1928
1963	int filemap_page_mkwrite(struct vm_area_struct vma, struct vm_fault vmf)	1929	int filemap_page_mkwrite(struct vm_area_struct vma, struct vm_fault vmf)
1964	{	1930	{
1965	struct page *page = vmf->page;	1931	struct page *page = vmf->page;
1966	struct inode *inode = file_inode(vma->vm_file);	1932	struct inode *inode = file_inode(vma->vm_file);
1967	int ret = VM_FAULT_LOCKED;	1933	int ret = VM_FAULT_LOCKED;
1968		1934
1969	sb_start_pagefault(inode->i_sb);	1935	sb_start_pagefault(inode->i_sb);
1970	file_update_time(vma->vm_file);	1936	file_update_time(vma->vm_file);
1971	lock_page(page);	1937	lock_page(page);
1972	if (page->mapping != inode->i_mapping) {	1938	if (page->mapping != inode->i_mapping) {
1973	unlock_page(page);	1939	unlock_page(page);
1974	ret = VM_FAULT_NOPAGE;	1940	ret = VM_FAULT_NOPAGE;
1975	goto out;	1941	goto out;
1976	}	1942	}
1977	/*	1943	/*
1978	* We mark the page dirty already here so that when freeze is in	1944	* We mark the page dirty already here so that when freeze is in
1979	* progress, we are guaranteed that writeback during freezing will	1945	* progress, we are guaranteed that writeback during freezing will
1980	* see the dirty page and writeprotect it again.	1946	* see the dirty page and writeprotect it again.
1981	*/	1947	*/
1982	set_page_dirty(page);	1948	set_page_dirty(page);
1983	wait_for_stable_page(page);	1949	wait_for_stable_page(page);
1984	out:	1950	out:
1985	sb_end_pagefault(inode->i_sb);	1951	sb_end_pagefault(inode->i_sb);
1986	return ret;	1952	return ret;
1987	}	1953	}
1988	EXPORT_SYMBOL(filemap_page_mkwrite);	1954	EXPORT_SYMBOL(filemap_page_mkwrite);
1989		1955
1990	const struct vm_operations_struct generic_file_vm_ops = {	1956	const struct vm_operations_struct generic_file_vm_ops = {
1991	.fault = filemap_fault,	1957	.fault = filemap_fault,
1992	.page_mkwrite = filemap_page_mkwrite,	1958	.page_mkwrite = filemap_page_mkwrite,
1993	.remap_pages = generic_file_remap_pages,	1959	.remap_pages = generic_file_remap_pages,
1994	};	1960	};
1995		1961
1996	/* This is used for a general mmap of a disk file */	1962	/* This is used for a general mmap of a disk file */
1997		1963
1998	int generic_file_mmap(struct file * file, struct vm_area_struct * vma)	1964	int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
1999	{	1965	{
2000	struct address_space *mapping = file->f_mapping;	1966	struct address_space *mapping = file->f_mapping;
2001		1967
2002	if (!mapping->a_ops->readpage)	1968	if (!mapping->a_ops->readpage)
2003	return -ENOEXEC;	1969	return -ENOEXEC;
2004	file_accessed(file);	1970	file_accessed(file);
2005	vma->vm_ops = &generic_file_vm_ops;	1971	vma->vm_ops = &generic_file_vm_ops;
2006	return 0;	1972	return 0;
2007	}	1973	}
2008		1974
2009	/*	1975	/*
2010	* This is for filesystems which do not implement ->writepage.	1976	* This is for filesystems which do not implement ->writepage.
2011	*/	1977	*/
2012	int generic_file_readonly_mmap(struct file file, struct vm_area_struct vma)	1978	int generic_file_readonly_mmap(struct file file, struct vm_area_struct vma)
2013	{	1979	{
2014	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))	1980	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
2015	return -EINVAL;	1981	return -EINVAL;
2016	return generic_file_mmap(file, vma);	1982	return generic_file_mmap(file, vma);
2017	}	1983	}
2018	#else	1984	#else
2019	int generic_file_mmap(struct file * file, struct vm_area_struct * vma)	1985	int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
2020	{	1986	{
2021	return -ENOSYS;	1987	return -ENOSYS;
2022	}	1988	}
2023	int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma)	1989	int generic_file_readonly_mmap(struct file * file, struct vm_area_struct * vma)
2024	{	1990	{
2025	return -ENOSYS;	1991	return -ENOSYS;
2026	}	1992	}
2027	#endif /* CONFIG_MMU */	1993	#endif /* CONFIG_MMU */
2028		1994
2029	EXPORT_SYMBOL(generic_file_mmap);	1995	EXPORT_SYMBOL(generic_file_mmap);
2030	EXPORT_SYMBOL(generic_file_readonly_mmap);	1996	EXPORT_SYMBOL(generic_file_readonly_mmap);
2031		1997
2032	static struct page wait_on_page_read(struct page page)	1998	static struct page wait_on_page_read(struct page page)
2033	{	1999	{
2034	if (!IS_ERR(page)) {	2000	if (!IS_ERR(page)) {
2035	wait_on_page_locked(page);	2001	wait_on_page_locked(page);
2036	if (!PageUptodate(page)) {	2002	if (!PageUptodate(page)) {
2037	page_cache_release(page);	2003	page_cache_release(page);
2038	page = ERR_PTR(-EIO);	2004	page = ERR_PTR(-EIO);
2039	}	2005	}
2040	}	2006	}
2041	return page;	2007	return page;
2042	}	2008	}
2043		2009
2044	static struct page __read_cache_page(struct address_space mapping,	2010	static struct page __read_cache_page(struct address_space mapping,
2045	pgoff_t index,	2011	pgoff_t index,
2046	int (filler)(void , struct page *),	2012	int (filler)(void , struct page *),
2047	void *data,	2013	void *data,
2048	gfp_t gfp)	2014	gfp_t gfp)
2049	{	2015	{
2050	struct page *page;	2016	struct page *page;
2051	int err;	2017	int err;
2052	repeat:	2018	repeat:
2053	page = find_get_page(mapping, index);	2019	page = find_get_page(mapping, index);
2054	if (!page) {	2020	if (!page) {
2055	page = __page_cache_alloc(gfp \| __GFP_COLD);	2021	page = __page_cache_alloc(gfp \| __GFP_COLD);
2056	if (!page)	2022	if (!page)
2057	return ERR_PTR(-ENOMEM);	2023	return ERR_PTR(-ENOMEM);
2058	err = add_to_page_cache_lru(page, mapping, index, gfp);	2024	err = add_to_page_cache_lru(page, mapping, index, gfp);
2059	if (unlikely(err)) {	2025	if (unlikely(err)) {
2060	page_cache_release(page);	2026	page_cache_release(page);
2061	if (err == -EEXIST)	2027	if (err == -EEXIST)
2062	goto repeat;	2028	goto repeat;
2063	/* Presumably ENOMEM for radix tree node */	2029	/* Presumably ENOMEM for radix tree node */
2064	return ERR_PTR(err);	2030	return ERR_PTR(err);
2065	}	2031	}
2066	err = filler(data, page);	2032	err = filler(data, page);
2067	if (err < 0) {	2033	if (err < 0) {
2068	page_cache_release(page);	2034	page_cache_release(page);
2069	page = ERR_PTR(err);	2035	page = ERR_PTR(err);
2070	} else {	2036	} else {
2071	page = wait_on_page_read(page);	2037	page = wait_on_page_read(page);
2072	}	2038	}
2073	}	2039	}
2074	return page;	2040	return page;
2075	}	2041	}
2076		2042
2077	static struct page do_read_cache_page(struct address_space mapping,	2043	static struct page do_read_cache_page(struct address_space mapping,
2078	pgoff_t index,	2044	pgoff_t index,
2079	int (filler)(void , struct page *),	2045	int (filler)(void , struct page *),
2080	void *data,	2046	void *data,
2081	gfp_t gfp)	2047	gfp_t gfp)
2082		2048
2083	{	2049	{
2084	struct page *page;	2050	struct page *page;
2085	int err;	2051	int err;
2086		2052
2087	retry:	2053	retry:
2088	page = __read_cache_page(mapping, index, filler, data, gfp);	2054	page = __read_cache_page(mapping, index, filler, data, gfp);
2089	if (IS_ERR(page))	2055	if (IS_ERR(page))
2090	return page;	2056	return page;
2091	if (PageUptodate(page))	2057	if (PageUptodate(page))
2092	goto out;	2058	goto out;
2093		2059
2094	lock_page(page);	2060	lock_page(page);
2095	if (!page->mapping) {	2061	if (!page->mapping) {
2096	unlock_page(page);	2062	unlock_page(page);
2097	page_cache_release(page);	2063	page_cache_release(page);
2098	goto retry;	2064	goto retry;
2099	}	2065	}
2100	if (PageUptodate(page)) {	2066	if (PageUptodate(page)) {
2101	unlock_page(page);	2067	unlock_page(page);
2102	goto out;	2068	goto out;
2103	}	2069	}
2104	err = filler(data, page);	2070	err = filler(data, page);
2105	if (err < 0) {	2071	if (err < 0) {
2106	page_cache_release(page);	2072	page_cache_release(page);
2107	return ERR_PTR(err);	2073	return ERR_PTR(err);
2108	} else {	2074	} else {
2109	page = wait_on_page_read(page);	2075	page = wait_on_page_read(page);
2110	if (IS_ERR(page))	2076	if (IS_ERR(page))
2111	return page;	2077	return page;
2112	}	2078	}
2113	out:	2079	out:
2114	mark_page_accessed(page);	2080	mark_page_accessed(page);
2115	return page;	2081	return page;
2116	}	2082	}
2117		2083
2118	/**	2084	/**
2119	* read_cache_page - read into page cache, fill it if needed	2085	* read_cache_page - read into page cache, fill it if needed
2120	* @mapping: the page's address_space	2086	* @mapping: the page's address_space
2121	* @index: the page index	2087	* @index: the page index
2122	* @filler: function to perform the read	2088	* @filler: function to perform the read
2123	* @data: first arg to filler(data, page) function, often left as NULL	2089	* @data: first arg to filler(data, page) function, often left as NULL
2124	*	2090	*
2125	* Read into the page cache. If a page already exists, and PageUptodate() is	2091	* Read into the page cache. If a page already exists, and PageUptodate() is
2126	* not set, try to fill the page and wait for it to become unlocked.	2092	* not set, try to fill the page and wait for it to become unlocked.
2127	*	2093	*
2128	* If the page does not get brought uptodate, return -EIO.	2094	* If the page does not get brought uptodate, return -EIO.
2129	*/	2095	*/
2130	struct page read_cache_page(struct address_space mapping,	2096	struct page read_cache_page(struct address_space mapping,
2131	pgoff_t index,	2097	pgoff_t index,
2132	int (filler)(void , struct page *),	2098	int (filler)(void , struct page *),
2133	void *data)	2099	void *data)
2134	{	2100	{
2135	return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping));	2101	return do_read_cache_page(mapping, index, filler, data, mapping_gfp_mask(mapping));
2136	}	2102	}
2137	EXPORT_SYMBOL(read_cache_page);	2103	EXPORT_SYMBOL(read_cache_page);
2138		2104
2139	/**	2105	/**
2140	* read_cache_page_gfp - read into page cache, using specified page allocation flags.	2106	* read_cache_page_gfp - read into page cache, using specified page allocation flags.
2141	* @mapping: the page's address_space	2107	* @mapping: the page's address_space
2142	* @index: the page index	2108	* @index: the page index
2143	* @gfp: the page allocator flags to use if allocating	2109	* @gfp: the page allocator flags to use if allocating
2144	*	2110	*
2145	* This is the same as "read_mapping_page(mapping, index, NULL)", but with	2111	* This is the same as "read_mapping_page(mapping, index, NULL)", but with
2146	* any new page allocations done using the specified allocation flags.	2112	* any new page allocations done using the specified allocation flags.
2147	*	2113	*
2148	* If the page does not get brought uptodate, return -EIO.	2114	* If the page does not get brought uptodate, return -EIO.
2149	*/	2115	*/
2150	struct page read_cache_page_gfp(struct address_space mapping,	2116	struct page read_cache_page_gfp(struct address_space mapping,
2151	pgoff_t index,	2117	pgoff_t index,
2152	gfp_t gfp)	2118	gfp_t gfp)
2153	{	2119	{
2154	filler_t filler = (filler_t )mapping->a_ops->readpage;	2120	filler_t filler = (filler_t )mapping->a_ops->readpage;
2155		2121
2156	return do_read_cache_page(mapping, index, filler, NULL, gfp);	2122	return do_read_cache_page(mapping, index, filler, NULL, gfp);
2157	}	2123	}
2158	EXPORT_SYMBOL(read_cache_page_gfp);	2124	EXPORT_SYMBOL(read_cache_page_gfp);
2159		2125
2160	static size_t __iovec_copy_from_user_inatomic(char *vaddr,	2126	static size_t __iovec_copy_from_user_inatomic(char *vaddr,
2161	const struct iovec *iov, size_t base, size_t bytes)	2127	const struct iovec *iov, size_t base, size_t bytes)
2162	{	2128	{
2163	size_t copied = 0, left = 0;	2129	size_t copied = 0, left = 0;
2164		2130
2165	while (bytes) {	2131	while (bytes) {
2166	char __user *buf = iov->iov_base + base;	2132	char __user *buf = iov->iov_base + base;
2167	int copy = min(bytes, iov->iov_len - base);	2133	int copy = min(bytes, iov->iov_len - base);
2168		2134
2169	base = 0;	2135	base = 0;
2170	left = __copy_from_user_inatomic(vaddr, buf, copy);	2136	left = __copy_from_user_inatomic(vaddr, buf, copy);
2171	copied += copy;	2137	copied += copy;
2172	bytes -= copy;	2138	bytes -= copy;
2173	vaddr += copy;	2139	vaddr += copy;
2174	iov++;	2140	iov++;
2175		2141
2176	if (unlikely(left))	2142	if (unlikely(left))
2177	break;	2143	break;
2178	}	2144	}
2179	return copied - left;	2145	return copied - left;
2180	}	2146	}
2181		2147
2182	/*	2148	/*
2183	* Copy as much as we can into the page and return the number of bytes which	2149	* Copy as much as we can into the page and return the number of bytes which
2184	* were successfully copied. If a fault is encountered then return the number of	2150	* were successfully copied. If a fault is encountered then return the number of
2185	* bytes which were copied.	2151	* bytes which were copied.
2186	*/	2152	*/
2187	size_t iov_iter_copy_from_user_atomic(struct page *page,	2153	size_t iov_iter_copy_from_user_atomic(struct page *page,
2188	struct iov_iter *i, unsigned long offset, size_t bytes)	2154	struct iov_iter *i, unsigned long offset, size_t bytes)
2189	{	2155	{
2190	char *kaddr;	2156	char *kaddr;
2191	size_t copied;	2157	size_t copied;
2192		2158
2193	kaddr = kmap_atomic(page);	2159	kaddr = kmap_atomic(page);
2194	if (likely(i->nr_segs == 1)) {	2160	if (likely(i->nr_segs == 1)) {
2195	int left;	2161	int left;
2196	char __user *buf = i->iov->iov_base + i->iov_offset;	2162	char __user *buf = i->iov->iov_base + i->iov_offset;
2197	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);	2163	left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
2198	copied = bytes - left;	2164	copied = bytes - left;
2199	} else {	2165	} else {
2200	copied = __iovec_copy_from_user_inatomic(kaddr + offset,	2166	copied = __iovec_copy_from_user_inatomic(kaddr + offset,
2201	i->iov, i->iov_offset, bytes);	2167	i->iov, i->iov_offset, bytes);
2202	}	2168	}
2203	kunmap_atomic(kaddr);	2169	kunmap_atomic(kaddr);
2204		2170
2205	return copied;	2171	return copied;
2206	}	2172	}
2207	EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);	2173	EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
2208		2174
2209	/*	2175	/*
2210	* This has the same sideeffects and return value as	2176	* This has the same sideeffects and return value as
2211	* iov_iter_copy_from_user_atomic().	2177	* iov_iter_copy_from_user_atomic().
2212	* The difference is that it attempts to resolve faults.	2178	* The difference is that it attempts to resolve faults.
2213	* Page must not be locked.	2179	* Page must not be locked.
2214	*/	2180	*/
2215	size_t iov_iter_copy_from_user(struct page *page,	2181	size_t iov_iter_copy_from_user(struct page *page,
2216	struct iov_iter *i, unsigned long offset, size_t bytes)	2182	struct iov_iter *i, unsigned long offset, size_t bytes)
2217	{	2183	{
2218	char *kaddr;	2184	char *kaddr;
2219	size_t copied;	2185	size_t copied;
2220		2186
2221	kaddr = kmap(page);	2187	kaddr = kmap(page);
2222	if (likely(i->nr_segs == 1)) {	2188	if (likely(i->nr_segs == 1)) {
2223	int left;	2189	int left;
2224	char __user *buf = i->iov->iov_base + i->iov_offset;	2190	char __user *buf = i->iov->iov_base + i->iov_offset;
2225	left = __copy_from_user(kaddr + offset, buf, bytes);	2191	left = __copy_from_user(kaddr + offset, buf, bytes);
2226	copied = bytes - left;	2192	copied = bytes - left;
2227	} else {	2193	} else {
2228	copied = __iovec_copy_from_user_inatomic(kaddr + offset,	2194	copied = __iovec_copy_from_user_inatomic(kaddr + offset,
2229	i->iov, i->iov_offset, bytes);	2195	i->iov, i->iov_offset, bytes);
2230	}	2196	}
2231	kunmap(page);	2197	kunmap(page);
2232	return copied;	2198	return copied;
2233	}	2199	}
2234	EXPORT_SYMBOL(iov_iter_copy_from_user);	2200	EXPORT_SYMBOL(iov_iter_copy_from_user);
2235		2201
2236	void iov_iter_advance(struct iov_iter *i, size_t bytes)	2202	void iov_iter_advance(struct iov_iter *i, size_t bytes)
2237	{	2203	{
2238	BUG_ON(i->count < bytes);	2204	BUG_ON(i->count < bytes);
2239		2205
2240	if (likely(i->nr_segs == 1)) {	2206	if (likely(i->nr_segs == 1)) {
2241	i->iov_offset += bytes;	2207	i->iov_offset += bytes;
2242	i->count -= bytes;	2208	i->count -= bytes;
2243	} else {	2209	} else {
2244	const struct iovec *iov = i->iov;	2210	const struct iovec *iov = i->iov;
2245	size_t base = i->iov_offset;	2211	size_t base = i->iov_offset;
2246	unsigned long nr_segs = i->nr_segs;	2212	unsigned long nr_segs = i->nr_segs;
2247		2213
2248	/*	2214	/*
2249	* The !iov->iov_len check ensures we skip over unlikely	2215	* The !iov->iov_len check ensures we skip over unlikely
2250	* zero-length segments (without overruning the iovec).	2216	* zero-length segments (without overruning the iovec).
2251	*/	2217	*/
2252	while (bytes \|\| unlikely(i->count && !iov->iov_len)) {	2218	while (bytes \|\| unlikely(i->count && !iov->iov_len)) {
2253	int copy;	2219	int copy;
2254		2220
2255	copy = min(bytes, iov->iov_len - base);	2221	copy = min(bytes, iov->iov_len - base);
2256	BUG_ON(!i->count \|\| i->count < copy);	2222	BUG_ON(!i->count \|\| i->count < copy);
2257	i->count -= copy;	2223	i->count -= copy;
2258	bytes -= copy;	2224	bytes -= copy;
2259	base += copy;	2225	base += copy;
2260	if (iov->iov_len == base) {	2226	if (iov->iov_len == base) {
2261	iov++;	2227	iov++;
2262	nr_segs--;	2228	nr_segs--;
2263	base = 0;	2229	base = 0;
2264	}	2230	}
2265	}	2231	}
2266	i->iov = iov;	2232	i->iov = iov;
2267	i->iov_offset = base;	2233	i->iov_offset = base;
2268	i->nr_segs = nr_segs;	2234	i->nr_segs = nr_segs;
2269	}	2235	}
2270	}	2236	}
2271	EXPORT_SYMBOL(iov_iter_advance);	2237	EXPORT_SYMBOL(iov_iter_advance);
2272		2238
2273	/*	2239	/*
2274	* Fault in the first iovec of the given iov_iter, to a maximum length	2240	* Fault in the first iovec of the given iov_iter, to a maximum length
2275	* of bytes. Returns 0 on success, or non-zero if the memory could not be	2241	* of bytes. Returns 0 on success, or non-zero if the memory could not be
2276	* accessed (ie. because it is an invalid address).	2242	* accessed (ie. because it is an invalid address).
2277	*	2243	*
2278	* writev-intensive code may want this to prefault several iovecs -- that	2244	* writev-intensive code may want this to prefault several iovecs -- that
2279	* would be possible (callers must not rely on the fact that _only_ the	2245	* would be possible (callers must not rely on the fact that _only_ the
2280	* first iovec will be faulted with the current implementation).	2246	* first iovec will be faulted with the current implementation).
2281	*/	2247	*/
2282	int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)	2248	int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
2283	{	2249	{
2284	char __user *buf = i->iov->iov_base + i->iov_offset;	2250	char __user *buf = i->iov->iov_base + i->iov_offset;
2285	bytes = min(bytes, i->iov->iov_len - i->iov_offset);	2251	bytes = min(bytes, i->iov->iov_len - i->iov_offset);
2286	return fault_in_pages_readable(buf, bytes);	2252	return fault_in_pages_readable(buf, bytes);
2287	}	2253	}
2288	EXPORT_SYMBOL(iov_iter_fault_in_readable);	2254	EXPORT_SYMBOL(iov_iter_fault_in_readable);
2289		2255
2290	/*	2256	/*
2291	* Return the count of just the current iov_iter segment.	2257	* Return the count of just the current iov_iter segment.
2292	*/	2258	*/
2293	size_t iov_iter_single_seg_count(const struct iov_iter *i)	2259	size_t iov_iter_single_seg_count(const struct iov_iter *i)
2294	{	2260	{
2295	const struct iovec *iov = i->iov;	2261	const struct iovec *iov = i->iov;
2296	if (i->nr_segs == 1)	2262	if (i->nr_segs == 1)
2297	return i->count;	2263	return i->count;
2298	else	2264	else
2299	return min(i->count, iov->iov_len - i->iov_offset);	2265	return min(i->count, iov->iov_len - i->iov_offset);
2300	}	2266	}
2301	EXPORT_SYMBOL(iov_iter_single_seg_count);	2267	EXPORT_SYMBOL(iov_iter_single_seg_count);
2302		2268
2303	/*	2269	/*
2304	* Performs necessary checks before doing a write	2270	* Performs necessary checks before doing a write
2305	*	2271	*
2306	* Can adjust writing position or amount of bytes to write.	2272	* Can adjust writing position or amount of bytes to write.
2307	* Returns appropriate error code that caller should return or	2273	* Returns appropriate error code that caller should return or
2308	* zero in case that write should be allowed.	2274	* zero in case that write should be allowed.
2309	*/	2275	*/
2310	inline int generic_write_checks(struct file file, loff_t pos, size_t *count, int isblk)	2276	inline int generic_write_checks(struct file file, loff_t pos, size_t *count, int isblk)
2311	{	2277	{
2312	struct inode *inode = file->f_mapping->host;	2278	struct inode *inode = file->f_mapping->host;
2313	unsigned long limit = rlimit(RLIMIT_FSIZE);	2279	unsigned long limit = rlimit(RLIMIT_FSIZE);
2314		2280
2315	if (unlikely(*pos < 0))	2281	if (unlikely(*pos < 0))
2316	return -EINVAL;	2282	return -EINVAL;
2317		2283
2318	if (!isblk) {	2284	if (!isblk) {
2319	/* FIXME: this is for backwards compatibility with 2.4 */	2285	/* FIXME: this is for backwards compatibility with 2.4 */
2320	if (file->f_flags & O_APPEND)	2286	if (file->f_flags & O_APPEND)
2321	*pos = i_size_read(inode);	2287	*pos = i_size_read(inode);
2322		2288
2323	if (limit != RLIM_INFINITY) {	2289	if (limit != RLIM_INFINITY) {
2324	if (*pos >= limit) {	2290	if (*pos >= limit) {
2325	send_sig(SIGXFSZ, current, 0);	2291	send_sig(SIGXFSZ, current, 0);
2326	return -EFBIG;	2292	return -EFBIG;
2327	}	2293	}
2328	if (count > limit - (typeof(limit))pos) {	2294	if (count > limit - (typeof(limit))pos) {
2329	count = limit - (typeof(limit))pos;	2295	count = limit - (typeof(limit))pos;
2330	}	2296	}
2331	}	2297	}
2332	}	2298	}
2333		2299
2334	/*	2300	/*
2335	* LFS rule	2301	* LFS rule
2336	*/	2302	*/
2337	if (unlikely(pos + count > MAX_NON_LFS &&	2303	if (unlikely(pos + count > MAX_NON_LFS &&
2338	!(file->f_flags & O_LARGEFILE))) {	2304	!(file->f_flags & O_LARGEFILE))) {
2339	if (*pos >= MAX_NON_LFS) {	2305	if (*pos >= MAX_NON_LFS) {
2340	return -EFBIG;	2306	return -EFBIG;
2341	}	2307	}
2342	if (count > MAX_NON_LFS - (unsigned long)pos) {	2308	if (count > MAX_NON_LFS - (unsigned long)pos) {
2343	count = MAX_NON_LFS - (unsigned long)pos;	2309	count = MAX_NON_LFS - (unsigned long)pos;
2344	}	2310	}
2345	}	2311	}
2346		2312
2347	/*	2313	/*
2348	* Are we about to exceed the fs block limit ?	2314	* Are we about to exceed the fs block limit ?
2349	*	2315	*
2350	* If we have written data it becomes a short write. If we have	2316	* If we have written data it becomes a short write. If we have
2351	* exceeded without writing data we send a signal and return EFBIG.	2317	* exceeded without writing data we send a signal and return EFBIG.
2352	* Linus frestrict idea will clean these up nicely..	2318	* Linus frestrict idea will clean these up nicely..
2353	*/	2319	*/
2354	if (likely(!isblk)) {	2320	if (likely(!isblk)) {
2355	if (unlikely(*pos >= inode->i_sb->s_maxbytes)) {	2321	if (unlikely(*pos >= inode->i_sb->s_maxbytes)) {
2356	if (count \|\| pos > inode->i_sb->s_maxbytes) {	2322	if (count \|\| pos > inode->i_sb->s_maxbytes) {
2357	return -EFBIG;	2323	return -EFBIG;
2358	}	2324	}
2359	/* zero-length writes at ->s_maxbytes are OK */	2325	/* zero-length writes at ->s_maxbytes are OK */
2360	}	2326	}
2361		2327
2362	if (unlikely(pos + count > inode->i_sb->s_maxbytes))	2328	if (unlikely(pos + count > inode->i_sb->s_maxbytes))
2363	count = inode->i_sb->s_maxbytes - pos;	2329	count = inode->i_sb->s_maxbytes - pos;
2364	} else {	2330	} else {
2365	#ifdef CONFIG_BLOCK	2331	#ifdef CONFIG_BLOCK
2366	loff_t isize;	2332	loff_t isize;
2367	if (bdev_read_only(I_BDEV(inode)))	2333	if (bdev_read_only(I_BDEV(inode)))
2368	return -EPERM;	2334	return -EPERM;
2369	isize = i_size_read(inode);	2335	isize = i_size_read(inode);
2370	if (*pos >= isize) {	2336	if (*pos >= isize) {
2371	if (count \|\| pos > isize)	2337	if (count \|\| pos > isize)
2372	return -ENOSPC;	2338	return -ENOSPC;
2373	}	2339	}
2374		2340
2375	if (pos + count > isize)	2341	if (pos + count > isize)
2376	count = isize - pos;	2342	count = isize - pos;
2377	#else	2343	#else
2378	return -EPERM;	2344	return -EPERM;
2379	#endif	2345	#endif
2380	}	2346	}
2381	return 0;	2347	return 0;
2382	}	2348	}
2383	EXPORT_SYMBOL(generic_write_checks);	2349	EXPORT_SYMBOL(generic_write_checks);
2384		2350
2385	int pagecache_write_begin(struct file file, struct address_space mapping,	2351	int pagecache_write_begin(struct file file, struct address_space mapping,
2386	loff_t pos, unsigned len, unsigned flags,	2352	loff_t pos, unsigned len, unsigned flags,
2387	struct page pagep, void fsdata)	2353	struct page pagep, void fsdata)
2388	{	2354	{
2389	const struct address_space_operations *aops = mapping->a_ops;	2355	const struct address_space_operations *aops = mapping->a_ops;
2390		2356
2391	return aops->write_begin(file, mapping, pos, len, flags,	2357	return aops->write_begin(file, mapping, pos, len, flags,
2392	pagep, fsdata);	2358	pagep, fsdata);
2393	}	2359	}
2394	EXPORT_SYMBOL(pagecache_write_begin);	2360	EXPORT_SYMBOL(pagecache_write_begin);
2395		2361
2396	int pagecache_write_end(struct file file, struct address_space mapping,	2362	int pagecache_write_end(struct file file, struct address_space mapping,
2397	loff_t pos, unsigned len, unsigned copied,	2363	loff_t pos, unsigned len, unsigned copied,
2398	struct page page, void fsdata)	2364	struct page page, void fsdata)
2399	{	2365	{
2400	const struct address_space_operations *aops = mapping->a_ops;	2366	const struct address_space_operations *aops = mapping->a_ops;
2401		2367
2402	mark_page_accessed(page);
2403	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);	2368	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
2404	}	2369	}
2405	EXPORT_SYMBOL(pagecache_write_end);	2370	EXPORT_SYMBOL(pagecache_write_end);
2406		2371
2407	ssize_t	2372	ssize_t
2408	generic_file_direct_write(struct kiocb iocb, const struct iovec iov,	2373	generic_file_direct_write(struct kiocb iocb, const struct iovec iov,
2409	unsigned long nr_segs, loff_t pos, loff_t ppos,	2374	unsigned long nr_segs, loff_t pos, loff_t ppos,
2410	size_t count, size_t ocount)	2375	size_t count, size_t ocount)
2411	{	2376	{
2412	struct file *file = iocb->ki_filp;	2377	struct file *file = iocb->ki_filp;
2413	struct address_space *mapping = file->f_mapping;	2378	struct address_space *mapping = file->f_mapping;
2414	struct inode *inode = mapping->host;	2379	struct inode *inode = mapping->host;
2415	ssize_t written;	2380	ssize_t written;
2416	size_t write_len;	2381	size_t write_len;
2417	pgoff_t end;	2382	pgoff_t end;
2418		2383
2419	if (count != ocount)	2384	if (count != ocount)
2420	nr_segs = iov_shorten((struct iovec )iov, *nr_segs, count);	2385	nr_segs = iov_shorten((struct iovec )iov, *nr_segs, count);
2421		2386
2422	write_len = iov_length(iov, *nr_segs);	2387	write_len = iov_length(iov, *nr_segs);
2423	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;	2388	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
2424		2389
2425	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);	2390	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
2426	if (written)	2391	if (written)
2427	goto out;	2392	goto out;
2428		2393
2429	/*	2394	/*
2430	* After a write we want buffered reads to be sure to go to disk to get	2395	* After a write we want buffered reads to be sure to go to disk to get
2431	* the new data. We invalidate clean cached page from the region we're	2396	* the new data. We invalidate clean cached page from the region we're
2432	* about to write. We do this before the write so that we can return	2397	* about to write. We do this before the write so that we can return
2433	* without clobbering -EIOCBQUEUED from ->direct_IO().	2398	* without clobbering -EIOCBQUEUED from ->direct_IO().
2434	*/	2399	*/
2435	if (mapping->nrpages) {	2400	if (mapping->nrpages) {
2436	written = invalidate_inode_pages2_range(mapping,	2401	written = invalidate_inode_pages2_range(mapping,
2437	pos >> PAGE_CACHE_SHIFT, end);	2402	pos >> PAGE_CACHE_SHIFT, end);
2438	/*	2403	/*
2439	* If a page can not be invalidated, return 0 to fall back	2404	* If a page can not be invalidated, return 0 to fall back
2440	* to buffered write.	2405	* to buffered write.
2441	*/	2406	*/
2442	if (written) {	2407	if (written) {
2443	if (written == -EBUSY)	2408	if (written == -EBUSY)
2444	return 0;	2409	return 0;
2445	goto out;	2410	goto out;
2446	}	2411	}
2447	}	2412	}
2448		2413
2449	written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);	2414	written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
2450		2415
2451	/*	2416	/*
2452	* Finally, try again to invalidate clean pages which might have been	2417	* Finally, try again to invalidate clean pages which might have been
2453	* cached by non-direct readahead, or faulted in by get_user_pages()	2418	* cached by non-direct readahead, or faulted in by get_user_pages()
2454	* if the source of the write was an mmap'ed region of the file	2419	* if the source of the write was an mmap'ed region of the file
2455	* we're writing. Either one is a pretty crazy thing to do,	2420	* we're writing. Either one is a pretty crazy thing to do,
2456	* so we don't support it 100%. If this invalidation	2421	* so we don't support it 100%. If this invalidation
2457	* fails, tough, the write still worked...	2422	* fails, tough, the write still worked...
2458	*/	2423	*/
2459	if (mapping->nrpages) {	2424	if (mapping->nrpages) {
2460	invalidate_inode_pages2_range(mapping,	2425	invalidate_inode_pages2_range(mapping,
2461	pos >> PAGE_CACHE_SHIFT, end);	2426	pos >> PAGE_CACHE_SHIFT, end);
2462	}	2427	}
2463		2428
2464	if (written > 0) {	2429	if (written > 0) {
2465	pos += written;	2430	pos += written;
2466	if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {	2431	if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
2467	i_size_write(inode, pos);	2432	i_size_write(inode, pos);
2468	mark_inode_dirty(inode);	2433	mark_inode_dirty(inode);
2469	}	2434	}
2470	*ppos = pos;	2435	*ppos = pos;
2471	}	2436	}
2472	out:	2437	out:
2473	return written;	2438	return written;
2474	}	2439	}
2475	EXPORT_SYMBOL(generic_file_direct_write);	2440	EXPORT_SYMBOL(generic_file_direct_write);
2476		2441
2477	/*	2442	/*
2478	* Find or create a page at the given pagecache position. Return the locked	2443	* Find or create a page at the given pagecache position. Return the locked
2479	* page. This function is specifically for buffered writes.	2444	* page. This function is specifically for buffered writes.
2480	*/	2445	*/
2481	struct page grab_cache_page_write_begin(struct address_space mapping,	2446	struct page grab_cache_page_write_begin(struct address_space mapping,
2482	pgoff_t index, unsigned flags)	2447	pgoff_t index, unsigned flags)
2483	{	2448	{
2484	int status;
2485	gfp_t gfp_mask;
2486	struct page *page;	2449	struct page *page;
2487	gfp_t gfp_notmask = 0;	2450	int fgp_flags = FGP_LOCK\|FGP_ACCESSED\|FGP_WRITE\|FGP_CREAT;
2488		2451
2489	gfp_mask = mapping_gfp_mask(mapping);
2490	if (mapping_cap_account_dirty(mapping))
2491	gfp_mask \|= __GFP_WRITE;
2492	if (flags & AOP_FLAG_NOFS)	2452	if (flags & AOP_FLAG_NOFS)
2493	gfp_notmask = __GFP_FS;	2453	fgp_flags \|= FGP_NOFS;
2494	repeat:	2454
2495	page = find_lock_page(mapping, index);	2455	page = pagecache_get_page(mapping, index, fgp_flags,
		2456	mapping_gfp_mask(mapping),
		2457	GFP_KERNEL);
2496	if (page)	2458	if (page)
2497	goto found;	2459	wait_for_stable_page(page);
2498		2460
2499	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
2500	if (!page)
2501	return NULL;
2502	status = add_to_page_cache_lru(page, mapping, index,
2503	GFP_KERNEL & ~gfp_notmask);
2504	if (unlikely(status)) {
2505	page_cache_release(page);
2506	if (status == -EEXIST)
2507	goto repeat;
2508	return NULL;
2509	}
2510	found:
2511	wait_for_stable_page(page);
2512	return page;	2461	return page;
2513	}	2462	}
2514	EXPORT_SYMBOL(grab_cache_page_write_begin);	2463	EXPORT_SYMBOL(grab_cache_page_write_begin);
2515		2464
2516	static ssize_t generic_perform_write(struct file *file,	2465	static ssize_t generic_perform_write(struct file *file,
2517	struct iov_iter *i, loff_t pos)	2466	struct iov_iter *i, loff_t pos)
2518	{	2467	{
2519	struct address_space *mapping = file->f_mapping;	2468	struct address_space *mapping = file->f_mapping;
2520	const struct address_space_operations *a_ops = mapping->a_ops;	2469	const struct address_space_operations *a_ops = mapping->a_ops;
2521	long status = 0;	2470	long status = 0;
2522	ssize_t written = 0;	2471	ssize_t written = 0;
2523	unsigned int flags = 0;	2472	unsigned int flags = 0;
2524		2473
2525	/*	2474	/*
2526	* Copies from kernel address space cannot fail (NFSD is a big user).	2475	* Copies from kernel address space cannot fail (NFSD is a big user).
2527	*/	2476	*/
2528	if (segment_eq(get_fs(), KERNEL_DS))	2477	if (segment_eq(get_fs(), KERNEL_DS))
2529	flags \|= AOP_FLAG_UNINTERRUPTIBLE;	2478	flags \|= AOP_FLAG_UNINTERRUPTIBLE;
2530		2479
2531	do {	2480	do {
2532	struct page *page;	2481	struct page *page;
2533	unsigned long offset; /* Offset into pagecache page */	2482	unsigned long offset; /* Offset into pagecache page */
2534	unsigned long bytes; /* Bytes to write to page */	2483	unsigned long bytes; /* Bytes to write to page */
2535	size_t copied; /* Bytes copied from user */	2484	size_t copied; /* Bytes copied from user */
2536	void *fsdata;	2485	void *fsdata;
2537		2486
2538	offset = (pos & (PAGE_CACHE_SIZE - 1));	2487	offset = (pos & (PAGE_CACHE_SIZE - 1));
2539	bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,	2488	bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
2540	iov_iter_count(i));	2489	iov_iter_count(i));
2541		2490
2542	again:	2491	again:
2543	/*	2492	/*
2544	* Bring in the user page that we will copy from _first_.	2493	* Bring in the user page that we will copy from _first_.
2545	* Otherwise there's a nasty deadlock on copying from the	2494	* Otherwise there's a nasty deadlock on copying from the
2546	* same page as we're writing to, without it being marked	2495	* same page as we're writing to, without it being marked
2547	* up-to-date.	2496	* up-to-date.
2548	*	2497	*
2549	* Not only is this an optimisation, but it is also required	2498	* Not only is this an optimisation, but it is also required
2550	* to check that the address is actually valid, when atomic	2499	* to check that the address is actually valid, when atomic
2551	* usercopies are used, below.	2500	* usercopies are used, below.
2552	*/	2501	*/
2553	if (unlikely(iov_iter_fault_in_readable(i, bytes))) {	2502	if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
2554	status = -EFAULT;	2503	status = -EFAULT;
2555	break;	2504	break;
2556	}	2505	}
2557		2506
2558	status = a_ops->write_begin(file, mapping, pos, bytes, flags,	2507	status = a_ops->write_begin(file, mapping, pos, bytes, flags,
2559	&page, &fsdata);	2508	&page, &fsdata);
2560	if (unlikely(status))	2509	if (unlikely(status < 0))
2561	break;	2510	break;
2562		2511
2563	if (mapping_writably_mapped(mapping))	2512	if (mapping_writably_mapped(mapping))
2564	flush_dcache_page(page);	2513	flush_dcache_page(page);
2565		2514
2566	copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);	2515	copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
2567	flush_dcache_page(page);	2516	flush_dcache_page(page);
2568		2517
2569	mark_page_accessed(page);
2570	status = a_ops->write_end(file, mapping, pos, bytes, copied,	2518	status = a_ops->write_end(file, mapping, pos, bytes, copied,
2571	page, fsdata);	2519	page, fsdata);
2572	if (unlikely(status < 0))	2520	if (unlikely(status < 0))
2573	break;	2521	break;
2574	copied = status;	2522	copied = status;
2575		2523
2576	cond_resched();	2524	cond_resched();
2577		2525
2578	iov_iter_advance(i, copied);	2526	iov_iter_advance(i, copied);
2579	if (unlikely(copied == 0)) {	2527	if (unlikely(copied == 0)) {
2580	/*	2528	/*
2581	* If we were unable to copy any data at all, we must	2529	* If we were unable to copy any data at all, we must
2582	* fall back to a single segment length write.	2530	* fall back to a single segment length write.
2583	*	2531	*
2584	* If we didn't fallback here, we could livelock	2532	* If we didn't fallback here, we could livelock
2585	* because not all segments in the iov can be copied at	2533	* because not all segments in the iov can be copied at
2586	* once without a pagefault.	2534	* once without a pagefault.
2587	*/	2535	*/
2588	bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,	2536	bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
2589	iov_iter_single_seg_count(i));	2537	iov_iter_single_seg_count(i));
2590	goto again;	2538	goto again;
2591	}	2539	}
2592	pos += copied;	2540	pos += copied;
2593	written += copied;	2541	written += copied;
2594		2542
2595	balance_dirty_pages_ratelimited(mapping);	2543	balance_dirty_pages_ratelimited(mapping);
2596	if (fatal_signal_pending(current)) {	2544	if (fatal_signal_pending(current)) {
2597	status = -EINTR;	2545	status = -EINTR;
2598	break;	2546	break;
2599	}	2547	}
2600	} while (iov_iter_count(i));	2548	} while (iov_iter_count(i));
2601		2549
2602	return written ? written : status;	2550	return written ? written : status;
2603	}	2551	}
2604		2552
2605	ssize_t	2553	ssize_t
2606	generic_file_buffered_write(struct kiocb iocb, const struct iovec iov,	2554	generic_file_buffered_write(struct kiocb iocb, const struct iovec iov,
2607	unsigned long nr_segs, loff_t pos, loff_t *ppos,	2555	unsigned long nr_segs, loff_t pos, loff_t *ppos,
2608	size_t count, ssize_t written)	2556	size_t count, ssize_t written)
2609	{	2557	{
2610	struct file *file = iocb->ki_filp;	2558	struct file *file = iocb->ki_filp;
2611	ssize_t status;	2559	ssize_t status;
2612	struct iov_iter i;	2560	struct iov_iter i;
2613		2561
2614	iov_iter_init(&i, iov, nr_segs, count, written);	2562	iov_iter_init(&i, iov, nr_segs, count, written);
2615	status = generic_perform_write(file, &i, pos);	2563	status = generic_perform_write(file, &i, pos);
2616		2564
2617	if (likely(status >= 0)) {	2565	if (likely(status >= 0)) {
2618	written += status;	2566	written += status;
2619	*ppos = pos + status;	2567	*ppos = pos + status;
2620	}	2568	}
2621		2569
2622	return written ? written : status;	2570	return written ? written : status;
2623	}	2571	}
2624	EXPORT_SYMBOL(generic_file_buffered_write);	2572	EXPORT_SYMBOL(generic_file_buffered_write);
2625		2573
2626	/**	2574	/**
2627	* __generic_file_aio_write - write data to a file	2575	* __generic_file_aio_write - write data to a file
2628	* @iocb: IO state structure (file, offset, etc.)	2576	* @iocb: IO state structure (file, offset, etc.)
2629	* @iov: vector with data to write	2577	* @iov: vector with data to write
2630	* @nr_segs: number of segments in the vector	2578	* @nr_segs: number of segments in the vector
2631	* @ppos: position where to write	2579	* @ppos: position where to write
2632	*	2580	*
2633	* This function does all the work needed for actually writing data to a	2581	* This function does all the work needed for actually writing data to a
2634	* file. It does all basic checks, removes SUID from the file, updates	2582	* file. It does all basic checks, removes SUID from the file, updates
2635	* modification times and calls proper subroutines depending on whether we	2583	* modification times and calls proper subroutines depending on whether we
2636	* do direct IO or a standard buffered write.	2584	* do direct IO or a standard buffered write.
2637	*	2585	*
2638	* It expects i_mutex to be grabbed unless we work on a block device or similar	2586	* It expects i_mutex to be grabbed unless we work on a block device or similar
2639	* object which does not need locking at all.	2587	* object which does not need locking at all.
2640	*	2588	*
2641	* This function does not take care of syncing data in case of O_SYNC write.	2589	* This function does not take care of syncing data in case of O_SYNC write.
2642	* A caller has to handle it. This is mainly due to the fact that we want to	2590	* A caller has to handle it. This is mainly due to the fact that we want to
2643	* avoid syncing under i_mutex.	2591	* avoid syncing under i_mutex.
2644	*/	2592	*/
2645	ssize_t __generic_file_aio_write(struct kiocb iocb, const struct iovec iov,	2593	ssize_t __generic_file_aio_write(struct kiocb iocb, const struct iovec iov,
2646	unsigned long nr_segs, loff_t *ppos)	2594	unsigned long nr_segs, loff_t *ppos)
2647	{	2595	{
2648	struct file *file = iocb->ki_filp;	2596	struct file *file = iocb->ki_filp;
2649	struct address_space * mapping = file->f_mapping;	2597	struct address_space * mapping = file->f_mapping;
2650	size_t ocount; /* original count */	2598	size_t ocount; /* original count */
2651	size_t count; /* after file limit checks */	2599	size_t count; /* after file limit checks */
2652	struct inode *inode = mapping->host;	2600	struct inode *inode = mapping->host;
2653	loff_t pos;	2601	loff_t pos;
2654	ssize_t written;	2602	ssize_t written;
2655	ssize_t err;	2603	ssize_t err;
2656		2604
2657	ocount = 0;	2605	ocount = 0;
2658	err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);	2606	err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
2659	if (err)	2607	if (err)
2660	return err;	2608	return err;
2661		2609
2662	count = ocount;	2610	count = ocount;
2663	pos = *ppos;	2611	pos = *ppos;
2664		2612
2665	/* We can write back this queue in page reclaim */	2613	/* We can write back this queue in page reclaim */
2666	current->backing_dev_info = mapping->backing_dev_info;	2614	current->backing_dev_info = mapping->backing_dev_info;
2667	written = 0;	2615	written = 0;
2668		2616
2669	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));	2617	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
2670	if (err)	2618	if (err)
2671	goto out;	2619	goto out;
2672		2620
2673	if (count == 0)	2621	if (count == 0)
2674	goto out;	2622	goto out;
2675		2623
2676	err = file_remove_suid(file);	2624	err = file_remove_suid(file);
2677	if (err)	2625	if (err)
2678	goto out;	2626	goto out;
2679		2627
2680	err = file_update_time(file);	2628	err = file_update_time(file);
2681	if (err)	2629	if (err)
2682	goto out;	2630	goto out;
2683		2631
2684	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */	2632	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
2685	if (unlikely(file->f_flags & O_DIRECT)) {	2633	if (unlikely(file->f_flags & O_DIRECT)) {
2686	loff_t endbyte;	2634	loff_t endbyte;
2687	ssize_t written_buffered;	2635	ssize_t written_buffered;
2688		2636
2689	written = generic_file_direct_write(iocb, iov, &nr_segs, pos,	2637	written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
2690	ppos, count, ocount);	2638	ppos, count, ocount);
2691	if (written < 0 \|\| written == count)	2639	if (written < 0 \|\| written == count)
2692	goto out;	2640	goto out;
2693	/*	2641	/*
2694	* direct-io write to a hole: fall through to buffered I/O	2642	* direct-io write to a hole: fall through to buffered I/O
2695	* for completing the rest of the request.	2643	* for completing the rest of the request.
2696	*/	2644	*/
2697	pos += written;	2645	pos += written;
2698	count -= written;	2646	count -= written;
2699	written_buffered = generic_file_buffered_write(iocb, iov,	2647	written_buffered = generic_file_buffered_write(iocb, iov,
2700	nr_segs, pos, ppos, count,	2648	nr_segs, pos, ppos, count,
2701	written);	2649	written);
2702	/*	2650	/*
2703	* If generic_file_buffered_write() retuned a synchronous error	2651	* If generic_file_buffered_write() retuned a synchronous error
2704	* then we want to return the number of bytes which were	2652	* then we want to return the number of bytes which were
2705	* direct-written, or the error code if that was zero. Note	2653	* direct-written, or the error code if that was zero. Note
2706	* that this differs from normal direct-io semantics, which	2654	* that this differs from normal direct-io semantics, which
2707	* will return -EFOO even if some bytes were written.	2655	* will return -EFOO even if some bytes were written.
2708	*/	2656	*/
2709	if (written_buffered < 0) {	2657	if (written_buffered < 0) {
2710	err = written_buffered;	2658	err = written_buffered;
2711	goto out;	2659	goto out;
2712	}	2660	}
2713		2661
2714	/*	2662	/*
2715	* We need to ensure that the page cache pages are written to	2663	* We need to ensure that the page cache pages are written to
2716	* disk and invalidated to preserve the expected O_DIRECT	2664	* disk and invalidated to preserve the expected O_DIRECT
2717	* semantics.	2665	* semantics.
2718	*/	2666	*/
2719	endbyte = pos + written_buffered - written - 1;	2667	endbyte = pos + written_buffered - written - 1;
2720	err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);	2668	err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
2721	if (err == 0) {	2669	if (err == 0) {
2722	written = written_buffered;	2670	written = written_buffered;
2723	invalidate_mapping_pages(mapping,	2671	invalidate_mapping_pages(mapping,
2724	pos >> PAGE_CACHE_SHIFT,	2672	pos >> PAGE_CACHE_SHIFT,
2725	endbyte >> PAGE_CACHE_SHIFT);	2673	endbyte >> PAGE_CACHE_SHIFT);
2726	} else {	2674	} else {
2727	/*	2675	/*
2728	* We don't know how much we wrote, so just return	2676	* We don't know how much we wrote, so just return
2729	* the number of bytes which were direct-written	2677	* the number of bytes which were direct-written
2730	*/	2678	*/
2731	}	2679	}
2732	} else {	2680	} else {
2733	written = generic_file_buffered_write(iocb, iov, nr_segs,	2681	written = generic_file_buffered_write(iocb, iov, nr_segs,
2734	pos, ppos, count, written);	2682	pos, ppos, count, written);
2735	}	2683	}
2736	out:	2684	out:
2737	current->backing_dev_info = NULL;	2685	current->backing_dev_info = NULL;
2738	return written ? written : err;	2686	return written ? written : err;
2739	}	2687	}
2740	EXPORT_SYMBOL(__generic_file_aio_write);	2688	EXPORT_SYMBOL(__generic_file_aio_write);
2741		2689
2742	/**	2690	/**
2743	* generic_file_aio_write - write data to a file	2691	* generic_file_aio_write - write data to a file
2744	* @iocb: IO state structure	2692	* @iocb: IO state structure
2745	* @iov: vector with data to write	2693	* @iov: vector with data to write
2746	* @nr_segs: number of segments in the vector	2694	* @nr_segs: number of segments in the vector
2747	* @pos: position in file where to write	2695	* @pos: position in file where to write
2748	*	2696	*
2749	* This is a wrapper around __generic_file_aio_write() to be used by most	2697	* This is a wrapper around __generic_file_aio_write() to be used by most
2750	* filesystems. It takes care of syncing the file in case of O_SYNC file	2698	* filesystems. It takes care of syncing the file in case of O_SYNC file
2751	* and acquires i_mutex as needed.	2699	* and acquires i_mutex as needed.
2752	*/	2700	*/
2753	ssize_t generic_file_aio_write(struct kiocb iocb, const struct iovec iov,	2701	ssize_t generic_file_aio_write(struct kiocb iocb, const struct iovec iov,
2754	unsigned long nr_segs, loff_t pos)	2702	unsigned long nr_segs, loff_t pos)
2755	{	2703	{
2756	struct file *file = iocb->ki_filp;	2704	struct file *file = iocb->ki_filp;
2757	struct inode *inode = file->f_mapping->host;	2705	struct inode *inode = file->f_mapping->host;
2758	ssize_t ret;	2706	ssize_t ret;
2759		2707
2760	BUG_ON(iocb->ki_pos != pos);	2708	BUG_ON(iocb->ki_pos != pos);
2761		2709
2762	mutex_lock(&inode->i_mutex);	2710	mutex_lock(&inode->i_mutex);
2763	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);	2711	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
2764	mutex_unlock(&inode->i_mutex);	2712	mutex_unlock(&inode->i_mutex);
2765		2713
2766	if (ret > 0) {	2714	if (ret > 0) {

mm/shmem.c

Diff comments View file @ d618a27

 /*
  * Resizable virtual memory filesystem for Linux.
  *
  * Copyright (C) 2000 Linus Torvalds.
  *		 2000 Transmeta Corp.
  *		 2000-2001 Christoph Rohland
  *		 2000-2001 SAP AG
  *		 2002 Red Hat Inc.
  * Copyright (C) 2002-2011 Hugh Dickins.
  * Copyright (C) 2011 Google Inc.
  * Copyright (C) 2002-2005 VERITAS Software Corporation.
  * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
  * Extended attribute support for tmpfs:
  * Copyright (c) 2004, Luke Kenneth Casson Leighton <lkcl@lkcl.net>
  * Copyright (c) 2004 Red Hat, Inc., James Morris <jmorris@redhat.com>
  *
  * tiny-shmem:
  * Copyright (c) 2004, 2008 Matt Mackall <mpm@selenic.com>
  *
  * This file is released under the GPL.
  */
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/vfs.h>
 #include <linux/mount.h>
 #include <linux/ramfs.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/export.h>
 #include <linux/swap.h>
 #include <linux/aio.h>
 static struct vfsmount *shm_mnt;
 #ifdef CONFIG_SHMEM
 /*
  * This virtual memory filesystem is heavily based on the ramfs. It
  * extends ramfs by the ability to use swap and honor resource limits
  * which makes it a completely usable filesystem.
  */
 #include <linux/xattr.h>
 #include <linux/exportfs.h>
 #include <linux/posix_acl.h>
 #include <linux/generic_acl.h>
 #include <linux/mman.h>
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
 #include <linux/shmem_fs.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/percpu_counter.h>
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
 #include <linux/swapops.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
 #include <linux/migrate.h>
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 /* Symlink up to this size is kmalloc'ed instead of using a swappable page */
 #define SHORT_SYMLINK_LEN 128
 /*
  * shmem_fallocate communicates with shmem_fault or shmem_writepage via
  * inode->i_private (with i_mutex making sure that it has only one user at
  * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
 	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
 	pgoff_t nr_unswapped;	/* how often writepage refused to swap out */
 };
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
 	return totalram_pages / 2;
 }
 static unsigned long shmem_default_max_inodes(void)
 {
 	return min(totalram_pages - totalhigh_pages, totalram_pages / 2);
 }
 #endif
 static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);
 static inline int shmem_getpage(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, int *fault_type)
 {
 	return shmem_getpage_gfp(inode, index, pagep, sgp,
 			mapping_gfp_mask(inode->i_mapping), fault_type);
 }
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
 }
 /*
  * shmem_file_setup pre-accounts the whole fixed size of a VM object,
  * for shared memory and for shared anonymous (/dev/zero) mappings
  * (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1),
  * consistent with the pre-accounting of private mappings ...
  */
 static inline int shmem_acct_size(unsigned long flags, loff_t size)
 {
 	return (flags & VM_NORESERVE) ?
 		0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size));
 }
 static inline void shmem_unacct_size(unsigned long flags, loff_t size)
 {
 	if (!(flags & VM_NORESERVE))
 		vm_unacct_memory(VM_ACCT(size));
 }
 /*
  * ... whereas tmpfs objects are accounted incrementally as
  * pages are allocated, in order to allow huge sparse files.
  * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
  * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
  */
 static inline int shmem_acct_block(unsigned long flags)
 {
 	return (flags & VM_NORESERVE) ?
 		security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
 }
 static inline void shmem_unacct_blocks(unsigned long flags, long pages)
 {
 	if (flags & VM_NORESERVE)
 		vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE));
 }
 static const struct super_operations shmem_ops;
 static const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
 static const struct inode_operations shmem_special_inode_operations;
 static const struct vm_operations_struct shmem_vm_ops;
 static struct backing_dev_info shmem_backing_dev_info  __read_mostly = {
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
 };
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 static int shmem_reserve_inode(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
 			return -ENOSPC;
 		}
 		sbinfo->free_inodes--;
 		spin_unlock(&sbinfo->stat_lock);
 	}
 	return 0;
 }
 static void shmem_free_inode(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_inodes++;
 		spin_unlock(&sbinfo->stat_lock);
 	}
 }
 /**
  * shmem_recalc_inode - recalculate the block usage of an inode
  * @inode: inode to recalc
  *
  * We have to calculate the free blocks since the mm can drop
  * undirtied hole pages behind our back.
  *
  * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
  * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
  *
  * It has to be called with the spinlock held.
  */
 static void shmem_recalc_inode(struct inode *inode)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	long freed;
 	freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
 	if (freed > 0) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 		if (sbinfo->max_blocks)
 			percpu_counter_add(&sbinfo->used_blocks, -freed);
 		info->alloced -= freed;
 		inode->i_blocks -= freed * BLOCKS_PER_PAGE;
 		shmem_unacct_blocks(info->flags, freed);
 	}
 }
 /*
  * Replace item expected in radix tree by a new item, while holding tree lock.
  */
 static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
 	void *item;
 	VM_BUG_ON(!expected);
 	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
 	if (!pslot)
 		return -ENOENT;
 	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
 	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 /*
  * Sometimes, before we decide whether to proceed or to fail, we must check
  * that an entry was not already brought back from swap by a racing thread.
  *
  * Checking page is not enough: by the time a SwapCache page is locked, it
  * might be reused, and again be SwapCache, using the same swap as before.
  */
 static bool shmem_confirm_swap(struct address_space *mapping,
 			       pgoff_t index, swp_entry_t swap)
 {
 	void *item;
 	rcu_read_lock();
 	item = radix_tree_lookup(&mapping->page_tree, index);
 	rcu_read_unlock();
 	return item == swp_to_radix_entry(swap);
 }
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
 				   struct address_space *mapping,
 				   pgoff_t index, gfp_t gfp, void *expected)
 {
 	int error;
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapBacked(page));
 	page_cache_get(page);
 	page->mapping = mapping;
 	page->index = index;
 	spin_lock_irq(&mapping->tree_lock);
 	if (!expected)
 		error = radix_tree_insert(&mapping->page_tree, index, page);
 	else
 		error = shmem_radix_tree_replace(mapping, index, expected,
 								 page);
 	if (!error) {
 		mapping->nrpages++;
 		__inc_zone_page_state(page, NR_FILE_PAGES);
 		__inc_zone_page_state(page, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
 	} else {
 		page->mapping = NULL;
 		spin_unlock_irq(&mapping->tree_lock);
 		page_cache_release(page);
 	}
 	return error;
 }
 /*
  * Like delete_from_page_cache, but substitutes swap for page.
  */
 static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 {
 	struct address_space *mapping = page->mapping;
 	int error;
 	spin_lock_irq(&mapping->tree_lock);
 	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	__dec_zone_page_state(page, NR_SHMEM);
 	spin_unlock_irq(&mapping->tree_lock);
 	page_cache_release(page);
 	BUG_ON(error);
 }
 /*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
 	void *old;
 	spin_lock_irq(&mapping->tree_lock);
 	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (old != radswap)
 		return -ENOENT;
 	free_swap_and_cache(radix_to_swp_entry(radswap));
 	return 0;
 }
 /*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
 {
 	struct pagevec pvec;
 	pgoff_t indices[PAGEVEC_SIZE];
 	pgoff_t index = 0;
 	pagevec_init(&pvec, 0);
 	/*
 	 * Minor point, but we might as well stop if someone else SHM_LOCKs it.
 	 */
 	while (!mapping_unevictable(mapping)) {
 		/*
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
 		pvec.nr = find_get_entries(mapping, index,
 					   PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
 		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 }
 /*
  * Remove range of pages and swap entries from radix tree, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
  */
 static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 								 bool unfalloc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT;
 	unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1);
 	unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
 	struct pagevec pvec;
 	pgoff_t indices[PAGEVEC_SIZE];
 	long nr_swaps_freed = 0;
 	pgoff_t index;
 	int i;
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
 		pvec.nr = find_get_entries(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			index = indices[i];
 			if (index >= end)
 				break;
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
 				nr_swaps_freed += !shmem_free_swap(mapping,
 								index, page);
 				continue;
 			}
 			if (!trylock_page(page))
 				continue;
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON(PageWriteback(page));
 					truncate_inode_page(mapping, page);
 				}
 			}
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
 	}
 	if (partial_start) {
 		struct page *page = NULL;
 		shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
 		if (page) {
 			unsigned int top = PAGE_CACHE_SIZE;
 			if (start > end) {
 				top = partial_end;
 				partial_end = 0;
 			}
 			zero_user_segment(page, partial_start, top);
 			set_page_dirty(page);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
 	if (partial_end) {
 		struct page *page = NULL;
 		shmem_getpage(inode, end, &page, SGP_READ, NULL);
 		if (page) {
 			zero_user_segment(page, 0, partial_end);
 			set_page_dirty(page);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
 	if (start >= end)
 		return;
 	index = start;
 	while (index < end) {
 		cond_resched();
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
 			/* If all gone or hole-punch or unfalloc, we're done */
 			if (index == start || end != -1)
 				break;
 			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			index = indices[i];
 			if (index >= end)
 				break;
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
 				if (shmem_free_swap(mapping, index, page)) {
 					/* Swap was replaced by page: retry */
 					index--;
 					break;
 				}
 				nr_swaps_freed++;
 				continue;
 			}
 			lock_page(page);
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON(PageWriteback(page));
 					truncate_inode_page(mapping, page);
 				} else {
 					/* Page was replaced by swap: retry */
 					unlock_page(page);
 					index--;
 					break;
 				}
 			}
 			unlock_page(page);
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
 	}
 	spin_lock(&info->lock);
 	info->swapped -= nr_swaps_freed;
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 }
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	shmem_undo_range(inode, lstart, lend, false);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
 	int error;
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		loff_t oldsize = inode->i_size;
 		loff_t newsize = attr->ia_size;
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 		}
 		if (newsize < oldsize) {
 			loff_t holebegin = round_up(newsize, PAGE_SIZE);
 			unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
 			shmem_truncate_range(inode, newsize, (loff_t)-1);
 			/* unmap again to remove racily COWed private pages */
 			unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
 		}
 	}
 	setattr_copy(inode, attr);
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	if (attr->ia_valid & ATTR_MODE)
 		error = generic_acl_chmod(inode);
 #endif
 	return error;
 }
 static void shmem_evict_inode(struct inode *inode)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	if (inode->i_mapping->a_ops == &shmem_aops) {
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
 		shmem_truncate_range(inode, 0, (loff_t)-1);
 		if (!list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			list_del_init(&info->swaplist);
 			mutex_unlock(&shmem_swaplist_mutex);
 		}
 	} else
 		kfree(info->symlink);
 	simple_xattrs_free(&info->xattrs);
 	WARN_ON(inode->i_blocks);
 	shmem_free_inode(inode->i_sb);
 	clear_inode(inode);
 }
 /*
  * If swap found in inode, free it and move page from swapcache to filecache.
  */
 static int shmem_unuse_inode(struct shmem_inode_info *info,
 			     swp_entry_t swap, struct page **pagep)
 {
 	struct address_space *mapping = info->vfs_inode.i_mapping;
 	void *radswap;
 	pgoff_t index;
 	gfp_t gfp;
 	int error = 0;
 	radswap = swp_to_radix_entry(swap);
 	index = radix_tree_locate_item(&mapping->page_tree, radswap);
 	if (index == -1)
 		return 0;
 	/*
 	 * Move _head_ to start search for next from here.
 	 * But be careful: shmem_evict_inode checks list_empty without taking
 	 * mutex, and there's an instant in list_move_tail when info->swaplist
 	 * would appear empty, if it were the only one on shmem_swaplist.
 	 */
 	if (shmem_swaplist.next != &info->swaplist)
 		list_move_tail(&shmem_swaplist, &info->swaplist);
 	gfp = mapping_gfp_mask(mapping);
 	if (shmem_should_replace_page(*pagep, gfp)) {
 		mutex_unlock(&shmem_swaplist_mutex);
 		error = shmem_replace_page(pagep, gfp, info, index);
 		mutex_lock(&shmem_swaplist_mutex);
 		/*
 		 * We needed to drop mutex to make that restrictive page
 		 * allocation, but the inode might have been freed while we
 		 * dropped it: although a racing shmem_evict_inode() cannot
 		 * complete without emptying the radix_tree, our page lock
 		 * on this swapcache page is not enough to prevent that -
 		 * free_swap_and_cache() of our swap entry will only
 		 * trylock_page(), removing swap from radix_tree whatever.
 		 *
 		 * We must not proceed to shmem_add_to_page_cache() if the
 		 * inode has been freed, but of course we cannot rely on
 		 * inode or mapping or info to check that.  However, we can
 		 * safely check if our swap entry is still in use (and here
 		 * it can't have got reused for another page): if it's still
 		 * in use, then the inode cannot have been freed yet, and we
 		 * can safely proceed (if it's no longer in use, that tells
 		 * nothing about the inode, but we don't need to unuse swap).
 		 */
 		if (!page_swapcount(*pagep))
 			error = -ENOENT;
 	}
 	/*
 	 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
 	if (!error)
 		error = shmem_add_to_page_cache(*pagep, mapping, index,
 						GFP_NOWAIT, radswap);
 	if (error != -ENOMEM) {
 		/*
 		 * Truncation and eviction use free_swap_and_cache(), which
 		 * only does trylock page: if we raced, best clean up here.
 		 */
 		delete_from_swap_cache(*pagep);
 		set_page_dirty(*pagep);
 		if (!error) {
 			spin_lock(&info->lock);
 			info->swapped--;
 			spin_unlock(&info->lock);
 			swap_free(swap);
 		}
 		error = 1;	/* not an error, but entry was found */
 	}
 	return error;
 }
 /*
  * Search through swapped inodes to find and replace swap by page.
  */
 int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	struct list_head *this, *next;
 	struct shmem_inode_info *info;
 	int found = 0;
 	int error = 0;
 	/*
 	 * There's a faint possibility that swap page was replaced before
 	 * caller locked it: caller will come back later with the right page.
 	 */
 	if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
 		goto out;
 	/*
 	 * Charge page using GFP_KERNEL while we can wait, before taking
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
 	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
 		info = list_entry(this, struct shmem_inode_info, swaplist);
 		if (info->swapped)
 			found = shmem_unuse_inode(info, swap, &page);
 		else
 			list_del_init(&info->swaplist);
 		cond_resched();
 		if (found)
 			break;
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 	if (found < 0)
 		error = found;
 out:
 	unlock_page(page);
 	page_cache_release(page);
 	return error;
 }
 /*
  * Move the page from the page cache to the swap cache.
  */
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct shmem_inode_info *info;
 	struct address_space *mapping;
 	struct inode *inode;
 	swp_entry_t swap;
 	pgoff_t index;
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
 	inode = mapping->host;
 	info = SHMEM_I(inode);
 	if (info->flags & VM_LOCKED)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
 	/*
 	 * shmem_backing_dev_info's capabilities prevent regular writeback or
 	 * sync from ever calling shmem_writepage; but a stacking filesystem
 	 * might use ->writepage of its underlying filesystem, in which case
 	 * tmpfs should write out to swap only in response to memory pressure,
 	 * and not for the writeback threads or sync.
 	 */
 	if (!wbc->for_reclaim) {
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
 		goto redirty;
 	}
 	/*
 	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
 	 * value into swapfile.c, the only way we can correctly account for a
 	 * fallocated page arriving here is now to initialize it and write it.
 	 *
 	 * That's okay for a page already fallocated earlier, but if we have
 	 * not yet completed the fallocation, then (a) we want to keep track
 	 * of this page in case we have to undo it, and (b) it may not be a
 	 * good idea to continue anyway, once we're pushing into swap.  So
 	 * reactivate the page, and let shmem_fallocate() quit when too many.
 	 */
 	if (!PageUptodate(page)) {
 		if (inode->i_private) {
 			struct shmem_falloc *shmem_falloc;
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
 			    !shmem_falloc->waitq &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
 			else
 				shmem_falloc = NULL;
 			spin_unlock(&inode->i_lock);
 			if (shmem_falloc)
 				goto redirty;
 		}
 		clear_highpage(page);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
 	 * if it's not already there.  Do it now before the page is
 	 * moved to swap cache, when its pagelock no longer protects
 	 * the inode from eviction.  But don't unlock the mutex until
 	 * we've incremented swapped, because shmem_unuse_inode() will
 	 * prune a !swapped inode from the swaplist under this mutex.
 	 */
 	mutex_lock(&shmem_swaplist_mutex);
 	if (list_empty(&info->swaplist))
 		list_add_tail(&info->swaplist, &shmem_swaplist);
 	if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
 		swap_shmem_alloc(swap);
 		shmem_delete_from_page_cache(page, swp_to_radix_entry(swap));
 		spin_lock(&info->lock);
 		info->swapped++;
 		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(page_mapped(page));
 		swap_writepage(page, wbc);
 		return 0;
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 	swapcache_free(swap, NULL);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
 		return AOP_WRITEPAGE_ACTIVATE;	/* Return with page locked */
 	unlock_page(page);
 	return 0;
 }
 #ifdef CONFIG_NUMA
 #ifdef CONFIG_TMPFS
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
 	if (!mpol || mpol->mode == MPOL_DEFAULT)
 		return;		/* show nothing */
 	mpol_to_str(buffer, sizeof(buffer), mpol);
 	seq_printf(seq, ",mpol=%s", buffer);
 }
 static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 {
 	struct mempolicy *mpol = NULL;
 	if (sbinfo->mpol) {
 		spin_lock(&sbinfo->stat_lock);	/* prevent replace/use races */
 		mpol = sbinfo->mpol;
 		mpol_get(mpol);
 		spin_unlock(&sbinfo->stat_lock);
 	}
 	return mpol;
 }
 #endif /* CONFIG_TMPFS */
 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	/* Bias interleave by inode number to distribute better across nodes */
 	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 	page = swapin_readahead(swap, gfp, &pvma, 0);
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 	return page;
 }
 static struct page *shmem_alloc_page(gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	/* Bias interleave by inode number to distribute better across nodes */
 	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 	page = alloc_page_vma(gfp, &pvma, 0);
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 	return page;
 }
 #else /* !CONFIG_NUMA */
 #ifdef CONFIG_TMPFS
 static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 }
 #endif /* CONFIG_TMPFS */
 static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
 	return swapin_readahead(swap, gfp, NULL, 0);
 }
 static inline struct page *shmem_alloc_page(gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
 	return alloc_page(gfp);
 }
 #endif /* CONFIG_NUMA */
 #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
 static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 {
 	return NULL;
 }
 #endif
 /*
  * When a page is moved from swapcache to shmem filecache (either by the
  * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of
  * shmem_unuse_inode()), it may have been read in earlier from swap, in
  * ignorance of the mapping it belongs to.  If that mapping has special
  * constraints (like the gma500 GEM driver, which requires RAM below 4GB),
  * we may need to copy to a suitable page before moving to filecache.
  *
  * In a future release, this may well be extended to respect cpuset and
  * NUMA mempolicy, and applied also to anonymous pages in do_swap_page();
  * but for now it is a simple matter of zone.
  */
 static bool shmem_should_replace_page(struct page *page, gfp_t gfp)
 {
 	return page_zonenum(page) > gfp_zone(gfp);
 }
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index)
 {
 	struct page *oldpage, *newpage;
 	struct address_space *swap_mapping;
 	pgoff_t swap_index;
 	int error;
 	oldpage = *pagep;
 	swap_index = page_private(oldpage);
 	swap_mapping = page_mapping(oldpage);
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success by further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
 	newpage = shmem_alloc_page(gfp, info, index);
 	if (!newpage)
 		return -ENOMEM;
 	page_cache_get(newpage);
 	copy_highpage(newpage, oldpage);
 	flush_dcache_page(newpage);
 	__set_page_locked(newpage);
 	SetPageUptodate(newpage);
 	SetPageSwapBacked(newpage);
 	set_page_private(newpage, swap_index);
 	SetPageSwapCache(newpage);
 	/*
 	 * Our caller will very soon move newpage out of swapcache, but it's
 	 * a nice clean interface for us to replace oldpage by newpage there.
 	 */
 	spin_lock_irq(&swap_mapping->tree_lock);
 	error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
 								   newpage);
 	if (!error) {
 		__inc_zone_page_state(newpage, NR_FILE_PAGES);
 		__dec_zone_page_state(oldpage, NR_FILE_PAGES);
 	}
 	spin_unlock_irq(&swap_mapping->tree_lock);
 	if (unlikely(error)) {
 		/*
 		 * Is this possible?  I think not, now that our callers check
 		 * both PageSwapCache and page_private after getting page lock;
 		 * but be defensive.  Reverse old to newpage for clear and free.
 		 */
 		oldpage = newpage;
 	} else {
 		mem_cgroup_replace_page_cache(oldpage, newpage);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
 	ClearPageSwapCache(oldpage);
 	set_page_private(oldpage, 0);
 	unlock_page(oldpage);
 	page_cache_release(oldpage);
 	page_cache_release(oldpage);
 	return error;
 }
 /*
  * shmem_getpage_gfp - find page in cache, or get from swap, or allocate
  *
  * If we allocate a new one we do not mark it dirty. That's up to the
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
 	struct page *page;
 	swp_entry_t swap;
 	int error;
 	int once = 0;
 	int alloced = 0;
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
 	swap.val = 0;
 	page = find_lock_entry(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
 	}
 	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		goto failed;
 	}
 	/* fallocated page? */
 	if (page && !PageUptodate(page)) {
 		if (sgp != SGP_READ)
 			goto clear;
 		unlock_page(page);
 		page_cache_release(page);
 		page = NULL;
 	}
 	if (page || (sgp == SGP_READ && !swap.val)) {
 		*pagep = page;
 		return 0;
 	}
 	/*
 	 * Fast cache lookup did not find it:
 	 * bring it back from swap or allocate.
 	 */
 	info = SHMEM_I(inode);
 	sbinfo = SHMEM_SB(inode->i_sb);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
 		if (!page) {
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
 			page = shmem_swapin(swap, gfp, info, index);
 			if (!page) {
 				error = -ENOMEM;
 				goto failed;
 			}
 		}
 		/* We have to do this with page locked to prevent races */
 		lock_page(page);
 		if (!PageSwapCache(page) || page_private(page) != swap.val ||
 		    !shmem_confirm_swap(mapping, index, swap)) {
 			error = -EEXIST;	/* try again */
 			goto unlock;
 		}
 		if (!PageUptodate(page)) {
 			error = -EIO;
 			goto failed;
 		}
 		wait_on_page_writeback(page);
 		if (shmem_should_replace_page(page, gfp)) {
 			error = shmem_replace_page(&page, gfp, info, index);
 			if (error)
 				goto failed;
 		}
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
 			/*
 			 * We already confirmed swap under page lock, and make
 			 * no memory allocation here, so usually no possibility
 			 * of error; but free_swap_and_cache() only trylocks a
 			 * page, so it is just possible that the entry has been
 			 * truncated or holepunched since swap was confirmed.
 			 * shmem_undo_range() will have done some of the
 			 * unaccounting, now delete_from_swap_cache() will do
 			 * the rest (including mem_cgroup_uncharge_swapcache).
 			 * Reset swap.val? No, leave it so "failed" goes back to
 			 * "repeat": reading a hole and writing should succeed.
 			 */
 			if (error)
 				delete_from_swap_cache(page);
 		}
 		if (error)
 			goto failed;
 		spin_lock(&info->lock);
 		info->swapped--;
 		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
 		swap_free(swap);
 	} else {
 		if (shmem_acct_block(info->flags)) {
 			error = -ENOSPC;
 			goto failed;
 		}
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
 						sbinfo->max_blocks) >= 0) {
 				error = -ENOSPC;
 				goto unacct;
 			}
 			percpu_counter_inc(&sbinfo->used_blocks);
 		}
 		page = shmem_alloc_page(gfp, info, index);
 		if (!page) {
 			error = -ENOMEM;
 			goto decused;
 		}
 		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 							gfp, NULL);
 			radix_tree_preload_end();
 		}
 		if (error) {
 			mem_cgroup_uncharge_cache_page(page);
 			goto decused;
 		}
 		lru_cache_add_anon(page);
 		spin_lock(&info->lock);
 		info->alloced++;
 		inode->i_blocks += BLOCKS_PER_PAGE;
 		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
 		alloced = true;
 		/*
 		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
 		 */
 		if (sgp == SGP_FALLOC)
 			sgp = SGP_WRITE;
 clear:
 		/*
 		 * Let SGP_WRITE caller clear ends if write does not fill page;
 		 * but SGP_FALLOC on a page fallocated earlier must initialize
 		 * it now, lest undo on failure cancel our earlier guarantee.
 		 */
 		if (sgp != SGP_WRITE) {
 			clear_highpage(page);
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 		}
 		if (sgp == SGP_DIRTY)
 			set_page_dirty(page);
 	}
 	/* Perhaps the file has been truncated since we checked */
 	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		if (alloced)
 			goto trunc;
 		else
 			goto failed;
 	}
 	*pagep = page;
 	return 0;
 	/*
 	 * Error recovery.
 	 */
 trunc:
 	info = SHMEM_I(inode);
 	ClearPageDirty(page);
 	delete_from_page_cache(page);
 	spin_lock(&info->lock);
 	info->alloced--;
 	inode->i_blocks -= BLOCKS_PER_PAGE;
 	spin_unlock(&info->lock);
 decused:
 	sbinfo = SHMEM_SB(inode->i_sb);
 	if (sbinfo->max_blocks)
 		percpu_counter_add(&sbinfo->used_blocks, -1);
 unacct:
 	shmem_unacct_blocks(info->flags, 1);
 failed:
 	if (swap.val && error != -EINVAL &&
 	    !shmem_confirm_swap(mapping, index, swap))
 		error = -EEXIST;
 unlock:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
 	}
 	if (error == -ENOSPC && !once++) {
 		info = SHMEM_I(inode);
 		spin_lock(&info->lock);
 		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
 		goto repeat;
 	}
 	if (error == -EEXIST)	/* from above or from radix_tree_insert */
 		goto repeat;
 	return error;
 }
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	int error;
 	int ret = VM_FAULT_LOCKED;
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
 	 * locks writers out with its hold on i_mutex.  So refrain from
 	 * faulting pages into the hole while it's being punched.  Although
 	 * shmem_undo_range() does remove the additions, it may be unable to
 	 * keep up, as each new page needs its own unmap_mapping_range() call,
 	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
 	 *
 	 * It does not matter if we sometimes reach this check just before the
 	 * hole-punch begins, so that one fault then races with the punch:
 	 * we just need to make racing faults a rare case.
 	 *
 	 * The implementation below would be much simpler if we just used a
 	 * standard mutex or completion: but we cannot take i_mutex in fault,
 	 * and bloating every shmem inode for this unlikely case would be sad.
 	 */
 	if (unlikely(inode->i_private)) {
 		struct shmem_falloc *shmem_falloc;
 		spin_lock(&inode->i_lock);
 		shmem_falloc = inode->i_private;
 		if (shmem_falloc &&
 		    shmem_falloc->waitq &&
 		    vmf->pgoff >= shmem_falloc->start &&
 		    vmf->pgoff < shmem_falloc->next) {
 			wait_queue_head_t *shmem_falloc_waitq;
 			DEFINE_WAIT(shmem_fault_wait);
 			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
 				ret = VM_FAULT_RETRY;
 			}
 			shmem_falloc_waitq = shmem_falloc->waitq;
 			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
 					TASK_UNINTERRUPTIBLE);
 			spin_unlock(&inode->i_lock);
 			schedule();
 			/*
 			 * shmem_falloc_waitq points into the shmem_fallocate()
 			 * stack of the hole-punching task: shmem_falloc_waitq
 			 * is usually invalid by the time we reach here, but
 			 * finish_wait() does not dereference it in that case;
 			 * though i_lock needed lest racing with wake_up_all().
 			 */
 			spin_lock(&inode->i_lock);
 			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
 			spin_unlock(&inode->i_lock);
 			return ret;
 		}
 		spin_unlock(&inode->i_lock);
 	}
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
 	if (ret & VM_FAULT_MAJOR) {
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 	}
 	return ret;
 }
 #ifdef CONFIG_NUMA
 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol);
 }
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	pgoff_t index;
 	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 	return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
 }
 #endif
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	int retval = -ENOMEM;
 	spin_lock(&info->lock);
 	if (lock && !(info->flags & VM_LOCKED)) {
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
 		mapping_set_unevictable(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
 		mapping_clear_unevictable(file->f_mapping);
 	}
 	retval = 0;
 out_nomem:
 	spin_unlock(&info->lock);
 	return retval;
 }
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
 }
 static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
 				     umode_t mode, dev_t dev, unsigned long flags)
 {
 	struct inode *inode;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	if (shmem_reserve_inode(sb))
 		return NULL;
 	inode = new_inode(sb);
 	if (inode) {
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inode->i_generation = get_seconds();
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
 		info->flags = flags & VM_NORESERVE;
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
 		switch (mode & S_IFMT) {
 		default:
 			inode->i_op = &shmem_special_inode_operations;
 			init_special_inode(inode, mode, dev);
 			break;
 		case S_IFREG:
 			inode->i_mapping->a_ops = &shmem_aops;
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
 			mpol_shared_policy_init(&info->policy,
 						 shmem_get_sbmpol(sbinfo));
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
 			/* Some things misbehave if size == 0 on a directory */
 			inode->i_size = 2 * BOGO_DIRENT_SIZE;
 			inode->i_op = &shmem_dir_inode_operations;
 			inode->i_fop = &simple_dir_operations;
 			break;
 		case S_IFLNK:
 			/*
 			 * Must not load anything in the rbtree,
 			 * mpol_free_shared_policy will not be called.
 			 */
 			mpol_shared_policy_init(&info->policy, NULL);
 			break;
 		}
 	} else
 		shmem_free_inode(sb);
 	return inode;
 }
 bool shmem_mapping(struct address_space *mapping)
 {
 	return mapping->backing_dev_info == &shmem_backing_dev_info;
 }
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
 #ifdef CONFIG_TMPFS_XATTR
 static int shmem_initxattrs(struct inode *, const struct xattr *, void *);
 #else
 #define shmem_initxattrs NULL
 #endif
 static int
 shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (ret == 0 && *pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 static int
 shmem_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
 	if (pos + copied > inode->i_size)
 		i_size_write(inode, pos + copied);
 	if (!PageUptodate(page)) {
 		if (copied < PAGE_CACHE_SIZE) {
 			unsigned from = pos & (PAGE_CACHE_SIZE - 1);
 			zero_user_segments(page, 0, from,
 					from + copied, PAGE_CACHE_SIZE);
 		}
 		SetPageUptodate(page);
 	}
 	set_page_dirty(page);
 	unlock_page(page);
 	page_cache_release(page);
 	return copied;
 }
 static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor)
 {
 	struct inode *inode = file_inode(filp);
 	struct address_space *mapping = inode->i_mapping;
 	pgoff_t index;
 	unsigned long offset;
 	enum sgp_type sgp = SGP_READ;
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
 	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
 	 */
 	if (segment_eq(get_fs(), KERNEL_DS))
 		sgp = SGP_DIRTY;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 	for (;;) {
 		struct page *page = NULL;
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
 		end_index = i_size >> PAGE_CACHE_SHIFT;
 		if (index > end_index)
 			break;
 		if (index == end_index) {
 			nr = i_size & ~PAGE_CACHE_MASK;
 			if (nr <= offset)
 				break;
 		}
 		desc->error = shmem_getpage(inode, index, &page, sgp, NULL);
 		if (desc->error) {
 			if (desc->error == -EINVAL)
 				desc->error = 0;
 			break;
 		}
 		if (page)
 			unlock_page(page);
 		/*
 		 * We must evaluate after, since reads (unlike writes)
 		 * are called without i_mutex protection against truncate
 		 */
 		nr = PAGE_CACHE_SIZE;
 		i_size = i_size_read(inode);
 		end_index = i_size >> PAGE_CACHE_SHIFT;
 		if (index == end_index) {
 			nr = i_size & ~PAGE_CACHE_MASK;
 			if (nr <= offset) {
 				if (page)
 					page_cache_release(page);
 				break;
 			}
 		}
 		nr -= offset;
 		if (page) {
 			/*
 			 * If users can be writing to this page using arbitrary
 			 * virtual addresses, take care about potential aliasing
 			 * before reading the page on the kernel side.
 			 */
 			if (mapping_writably_mapped(mapping))
 				flush_dcache_page(page);
 			/*
 			 * Mark the page accessed if we read the beginning.
 			 */
 			if (!offset)
 				mark_page_accessed(page);
 		} else {
 			page = ZERO_PAGE(0);
 			page_cache_get(page);
 		}
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
 		 * The actor routine returns how many bytes were actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 		page_cache_release(page);
 		if (ret != nr || !desc->count)
 			break;
 		cond_resched();
 	}
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	file_accessed(filp);
 }
 static ssize_t shmem_file_aio_read(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *filp = iocb->ki_filp;
 	ssize_t retval;
 	unsigned long seg;
 	size_t count;
 	loff_t *ppos = &iocb->ki_pos;
 	retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
 	if (retval)
 		return retval;
 	for (seg = 0; seg < nr_segs; seg++) {
 		read_descriptor_t desc;
 		desc.written = 0;
 		desc.arg.buf = iov[seg].iov_base;
 		desc.count = iov[seg].iov_len;
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
 		do_shmem_file_read(filp, ppos, &desc, file_read_actor);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
 			break;
 		}
 		if (desc.count > 0)
 			break;
 	}
 	return retval;
 }
 static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 				struct pipe_inode_info *pipe, size_t len,
 				unsigned int flags)
 {
 	struct address_space *mapping = in->f_mapping;
 	struct inode *inode = mapping->host;
 	unsigned int loff, nr_pages, req_pages;
 	struct page *pages[PIPE_DEF_BUFFERS];
 	struct partial_page partial[PIPE_DEF_BUFFERS];
 	struct page *page;
 	pgoff_t index, end_index;
 	loff_t isize, left;
 	int error, page_nr;
 	struct splice_pipe_desc spd = {
 		.pages = pages,
 		.partial = partial,
 		.nr_pages_max = PIPE_DEF_BUFFERS,
 		.flags = flags,
 		.ops = &page_cache_pipe_buf_ops,
 		.spd_release = spd_release_page,
 	};
 	isize = i_size_read(inode);
 	if (unlikely(*ppos >= isize))
 		return 0;
 	left = isize - *ppos;
 	if (unlikely(left < len))
 		len = left;
 	if (splice_grow_spd(pipe, &spd))
 		return -ENOMEM;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	loff = *ppos & ~PAGE_CACHE_MASK;
 	req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	nr_pages = min(req_pages, pipe->buffers);
 	spd.nr_pages = find_get_pages_contig(mapping, index,
 						nr_pages, spd.pages);
 	index += spd.nr_pages;
 	error = 0;
 	while (spd.nr_pages < nr_pages) {
 		error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL);
 		if (error)
 			break;
 		unlock_page(page);
 		spd.pages[spd.nr_pages++] = page;
 		index++;
 	}
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	nr_pages = spd.nr_pages;
 	spd.nr_pages = 0;
 	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
 		unsigned int this_len;
 		if (!len)
 			break;
 		this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
 		page = spd.pages[page_nr];
 		if (!PageUptodate(page) || page->mapping != mapping) {
 			error = shmem_getpage(inode, index, &page,
 							SGP_CACHE, NULL);
 			if (error)
 				break;
 			unlock_page(page);
 			page_cache_release(spd.pages[page_nr]);
 			spd.pages[page_nr] = page;
 		}
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index))
 			break;
 		if (end_index == index) {
 			unsigned int plen;
 			plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (plen <= loff)
 				break;
 			this_len = min(this_len, plen - loff);
 			len = this_len;
 		}
 		spd.partial[page_nr].offset = loff;
 		spd.partial[page_nr].len = this_len;
 		len -= this_len;
 		loff = 0;
 		spd.nr_pages++;
 		index++;
 	}
 	while (page_nr < nr_pages)
 		page_cache_release(spd.pages[page_nr++]);
 	if (spd.nr_pages)
 		error = splice_to_pipe(pipe, &spd);
 	splice_shrink_spd(&spd);
 	if (error > 0) {
 		*ppos += error;
 		file_accessed(in);
 	}
 	return error;
 }
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
 static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				    pgoff_t index, pgoff_t end, int whence)
 {
 	struct page *page;
 	struct pagevec pvec;
 	pgoff_t indices[PAGEVEC_SIZE];
 	bool done = false;
 	int i;
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
 		pvec.nr = find_get_entries(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
 				index = end;
 			break;
 		}
 		for (i = 0; i < pvec.nr; i++, index++) {
 			if (index < indices[i]) {
 				if (whence == SEEK_HOLE) {
 					done = true;
 					break;
 				}
 				index = indices[i];
 			}
 			page = pvec.pages[i];
 			if (page && !radix_tree_exceptional_entry(page)) {
 				if (!PageUptodate(page))
 					page = NULL;
 			}
 			if (index >= end ||
 			    (page && whence == SEEK_DATA) ||
 			    (!page && whence == SEEK_HOLE)) {
 				done = true;
 				break;
 			}
 		}
 		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
 	}
 	return index;
 }
 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	pgoff_t start, end;
 	loff_t new_offset;
 	if (whence != SEEK_DATA && whence != SEEK_HOLE)
 		return generic_file_llseek_size(file, offset, whence,
 					MAX_LFS_FILESIZE, i_size_read(inode));
 	mutex_lock(&inode->i_mutex);
 	/* We're holding i_mutex so we can access i_size directly */
 	if (offset < 0)
 		offset = -EINVAL;
 	else if (offset >= inode->i_size)
 		offset = -ENXIO;
 	else {
 		start = offset >> PAGE_CACHE_SHIFT;
 		end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 		new_offset = shmem_seek_hole_data(mapping, start, end, whence);
 		new_offset <<= PAGE_CACHE_SHIFT;
 		if (new_offset > offset) {
 			if (new_offset < inode->i_size)
 				offset = new_offset;
 			else if (whence == SEEK_DATA)
 				offset = -ENXIO;
 			else
 				offset = inode->i_size;
 		}
 	}
 	if (offset >= 0)
 		offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE);
 	mutex_unlock(&inode->i_mutex);
 	return offset;
 }
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
 	mutex_lock(&inode->i_mutex);
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
 		spin_lock(&inode->i_lock);
 		inode->i_private = &shmem_falloc;
 		spin_unlock(&inode->i_lock);
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		spin_lock(&inode->i_lock);
 		inode->i_private = NULL;
 		wake_up_all(&shmem_falloc_waitq);
 		spin_unlock(&inode->i_lock);
 		error = 0;
 		goto out;
 	}
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
 	error = inode_newsize_ok(inode, offset + len);
 	if (error)
 		goto out;
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
 	if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) {
 		error = -ENOSPC;
 		goto out;
 	}
 	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;
 	shmem_falloc.nr_falloced = 0;
 	shmem_falloc.nr_unswapped = 0;
 	spin_lock(&inode->i_lock);
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 	for (index = start; index < end; index++) {
 		struct page *page;
 		/*
 		 * Good, the fallocate(2) manpage permits EINTR: we may have
 		 * been interrupted because we are using up too much memory.
 		 */
 		if (signal_pending(current))
 			error = -EINTR;
 		else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced)
 			error = -ENOMEM;
 		else
 			error = shmem_getpage(inode, index, &page, SGP_FALLOC,
 									NULL);
 		if (error) {
 			/* Remove the !PageUptodate pages we added */
 			shmem_undo_range(inode,
 				(loff_t)start << PAGE_CACHE_SHIFT,
 				(loff_t)index << PAGE_CACHE_SHIFT, true);
 			goto undone;
 		}
 		/*
 		 * Inform shmem_writepage() how far we have reached.
 		 * No need for lock or barrier: we have the page lock.
 		 */
 		shmem_falloc.next++;
 		if (!PageUptodate(page))
 			shmem_falloc.nr_falloced++;
 		/*
 		 * If !PageUptodate, leave it that way so that freeable pages
 		 * can be recognized if we need to rollback on error later.
 		 * But set_page_dirty so that memory pressure will swap rather
 		 * than free the pages we are allocating (and SGP_CACHE pages
 		 * might still be clean: we now need to mark those dirty too).
 		 */
 		set_page_dirty(page);
 		unlock_page(page);
 		page_cache_release(page);
 		cond_resched();
 	}
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = CURRENT_TIME;
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
 	spin_unlock(&inode->i_lock);
 out:
 	mutex_unlock(&inode->i_mutex);
 	return error;
 }
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
 	buf->f_type = TMPFS_MAGIC;
 	buf->f_bsize = PAGE_CACHE_SIZE;
 	buf->f_namelen = NAME_MAX;
 	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
 		buf->f_bavail =
 		buf->f_bfree  = sbinfo->max_blocks -
 				percpu_counter_sum(&sbinfo->used_blocks);
 	}
 	if (sbinfo->max_inodes) {
 		buf->f_files = sbinfo->max_inodes;
 		buf->f_ffree = sbinfo->free_inodes;
 	}
 	/* else leave those fields 0 like simple_statfs */
 	return 0;
 }
 /*
  * File creation. Allocate an inode, and we're done..
  */
 static int
 shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 {
 	struct inode *inode;
 	int error = -ENOSPC;
 	inode = shmem_get_inode(dir->i_sb, dir, mode, dev, VM_NORESERVE);
 	if (inode) {
 #ifdef CONFIG_TMPFS_POSIX_ACL
 		error = generic_acl_init(inode, dir);
 		if (error) {
 			iput(inode);
 			return error;
 		}
 #endif
 		error = security_inode_init_security(inode, dir,
 						     &dentry->d_name,
 						     shmem_initxattrs, NULL);
 		if (error) {
 			if (error != -EOPNOTSUPP) {
 				iput(inode);
 				return error;
 			}
 		}
 		error = 0;
 		dir->i_size += BOGO_DIRENT_SIZE;
 		dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 		d_instantiate(dentry, inode);
 		dget(dentry); /* Extra count - pin the dentry in core */
 	}
 	return error;
 }
 static int
 shmem_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	int error = -ENOSPC;
 	inode = shmem_get_inode(dir->i_sb, dir, mode, 0, VM_NORESERVE);
 	if (inode) {
 		error = security_inode_init_security(inode, dir,
 						     NULL,
 						     shmem_initxattrs, NULL);
 		if (error) {
 			if (error != -EOPNOTSUPP) {
 				iput(inode);
 				return error;
 			}
 		}
 #ifdef CONFIG_TMPFS_POSIX_ACL
 		error = generic_acl_init(inode, dir);
 		if (error) {
 			iput(inode);
 			return error;
 		}
 #else
 		error = 0;
 #endif
 		d_tmpfile(dentry, inode);
 	}
 	return error;
 }
 static int shmem_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	int error;
 	if ((error = shmem_mknod(dir, dentry, mode | S_IFDIR, 0)))
 		return error;
 	inc_nlink(dir);
 	return 0;
 }
 static int shmem_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		bool excl)
 {
 	return shmem_mknod(dir, dentry, mode | S_IFREG, 0);
 }
 /*
  * Link a file..
  */
 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 {
 	struct inode *inode = old_dentry->d_inode;
 	int ret;
 	/*
 	 * No ordinary (disk based) filesystem counts links as inodes;
 	 * but each new link needs a new dentry, pinning lowmem, and
 	 * tmpfs dentries cannot be pruned until they are unlinked.
 	 */
 	ret = shmem_reserve_inode(inode->i_sb);
 	if (ret)
 		goto out;
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
 	ihold(inode);	/* New dentry reference */
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
 	return ret;
 }
 static int shmem_unlink(struct inode *dir, struct dentry *dentry)
 {
 	struct inode *inode = dentry->d_inode;
 	if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
 		shmem_free_inode(inode->i_sb);
 	dir->i_size -= BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	drop_nlink(inode);
 	dput(dentry);	/* Undo the count from "create" - this does all the work */
 	return 0;
 }
 static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
 {
 	if (!simple_empty(dentry))
 		return -ENOTEMPTY;
 	drop_nlink(dentry->d_inode);
 	drop_nlink(dir);
 	return shmem_unlink(dir, dentry);
 }
 /*
  * The VFS layer already does all the dentry stuff for rename,
  * we just have to decrement the usage count for the target if
  * it exists so that the VFS layer correctly free's it when it
  * gets overwritten.
  */
 static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
 {
 	struct inode *inode = old_dentry->d_inode;
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 	if (!simple_empty(new_dentry))
 		return -ENOTEMPTY;
 	if (new_dentry->d_inode) {
 		(void) shmem_unlink(new_dir, new_dentry);
 		if (they_are_dirs)
 			drop_nlink(old_dir);
 	} else if (they_are_dirs) {
 		drop_nlink(old_dir);
 		inc_nlink(new_dir);
 	}
 	old_dir->i_size -= BOGO_DIRENT_SIZE;
 	new_dir->i_size += BOGO_DIRENT_SIZE;
 	old_dir->i_ctime = old_dir->i_mtime =
 	new_dir->i_ctime = new_dir->i_mtime =
 	inode->i_ctime = CURRENT_TIME;
 	return 0;
 }
 static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
 {
 	int error;
 	int len;
 	struct inode *inode;
 	struct page *page;
 	char *kaddr;
 	struct shmem_inode_info *info;
 	len = strlen(symname) + 1;
 	if (len > PAGE_CACHE_SIZE)
 		return -ENAMETOOLONG;
 	inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE);
 	if (!inode)
 		return -ENOSPC;
 	error = security_inode_init_security(inode, dir, &dentry->d_name,
 					     shmem_initxattrs, NULL);
 	if (error) {
 		if (error != -EOPNOTSUPP) {
 			iput(inode);
 			return error;
 		}
 		error = 0;
 	}
 	info = SHMEM_I(inode);
 	inode->i_size = len-1;
 	if (len <= SHORT_SYMLINK_LEN) {
 		info->symlink = kmemdup(symname, len, GFP_KERNEL);
 		if (!info->symlink) {
 			iput(inode);
 			return -ENOMEM;
 		}
 		inode->i_op = &shmem_short_symlink_operations;
 	} else {
 		error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
 		if (error) {
 			iput(inode);
 			return error;
 		}
 		inode->i_mapping->a_ops = &shmem_aops;
 		inode->i_op = &shmem_symlink_inode_operations;
 		kaddr = kmap_atomic(page);
 		memcpy(kaddr, symname, len);
 		kunmap_atomic(kaddr);
 		SetPageUptodate(page);
 		set_page_dirty(page);
 		unlock_page(page);
 		page_cache_release(page);
 	}
 	dir->i_size += BOGO_DIRENT_SIZE;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	d_instantiate(dentry, inode);
 	dget(dentry);
 	return 0;
 }
 static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd)
 {
 	nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink);
 	return NULL;
 }
 static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct page *page = NULL;
 	int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
 	nd_set_link(nd, error ? ERR_PTR(error) : kmap(page));
 	if (page)
 		unlock_page(page);
 	return page;
 }
 static void shmem_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie)
 {
 	if (!IS_ERR(nd_get_link(nd))) {
 		struct page *page = cookie;
 		kunmap(page);
 		mark_page_accessed(page);
 		page_cache_release(page);
 	}
 }
 #ifdef CONFIG_TMPFS_XATTR
 /*
  * Superblocks without xattr inode operations may get some security.* xattr
  * support from the LSM "for free". As soon as we have any other xattrs
  * like ACLs, we also need to implement the security.* handlers at
  * filesystem level, though.
  */
 /*
  * Callback for security_inode_init_security() for acquiring xattrs.
  */
 static int shmem_initxattrs(struct inode *inode,
 			    const struct xattr *xattr_array,
 			    void *fs_info)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	const struct xattr *xattr;
 	struct simple_xattr *new_xattr;
 	size_t len;
 	for (xattr = xattr_array; xattr->name != NULL; xattr++) {
 		new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
 		if (!new_xattr)
 			return -ENOMEM;
 		len = strlen(xattr->name) + 1;
 		new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
 					  GFP_KERNEL);
 		if (!new_xattr->name) {
 			kfree(new_xattr);
 			return -ENOMEM;
 		}
 		memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
 		       XATTR_SECURITY_PREFIX_LEN);
 		memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
 		       xattr->name, len);
 		simple_xattr_list_add(&info->xattrs, new_xattr);
 	}
 	return 0;
 }
 static const struct xattr_handler *shmem_xattr_handlers[] = {
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	&generic_acl_access_handler,
 	&generic_acl_default_handler,
 #endif
 	NULL
 };
 static int shmem_xattr_validate(const char *name)
 {
 	struct { const char *prefix; size_t len; } arr[] = {
 		{ XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
 		{ XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
 	};
 	int i;
 	for (i = 0; i < ARRAY_SIZE(arr); i++) {
 		size_t preflen = arr[i].len;
 		if (strncmp(name, arr[i].prefix, preflen) == 0) {
 			if (!name[preflen])
 				return -EINVAL;
 			return 0;
 		}
 	}
 	return -EOPNOTSUPP;
 }
 static ssize_t shmem_getxattr(struct dentry *dentry, const char *name,
 			      void *buffer, size_t size)
 {
 	struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
 	int err;
 	/*
 	 * If this is a request for a synthetic attribute in the system.*
 	 * namespace use the generic infrastructure to resolve a handler
 	 * for it via sb->s_xattr.
 	 */
 	if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
 		return generic_getxattr(dentry, name, buffer, size);
 	err = shmem_xattr_validate(name);
 	if (err)
 		return err;
 	return simple_xattr_get(&info->xattrs, name, buffer, size);
 }
 static int shmem_setxattr(struct dentry *dentry, const char *name,
 			  const void *value, size_t size, int flags)
 {
 	struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
 	int err;
 	/*
 	 * If this is a request for a synthetic attribute in the system.*
 	 * namespace use the generic infrastructure to resolve a handler
 	 * for it via sb->s_xattr.
 	 */
 	if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
 		return generic_setxattr(dentry, name, value, size, flags);
 	err = shmem_xattr_validate(name);
 	if (err)
 		return err;
 	return simple_xattr_set(&info->xattrs, name, value, size, flags);
 }
 static int shmem_removexattr(struct dentry *dentry, const char *name)
 {
 	struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
 	int err;
 	/*
 	 * If this is a request for a synthetic attribute in the system.*
 	 * namespace use the generic infrastructure to resolve a handler
 	 * for it via sb->s_xattr.
 	 */
 	if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
 		return generic_removexattr(dentry, name);
 	err = shmem_xattr_validate(name);
 	if (err)
 		return err;
 	return simple_xattr_remove(&info->xattrs, name);
 }
 static ssize_t shmem_listxattr(struct dentry *dentry, char *buffer, size_t size)
 {
 	struct shmem_inode_info *info = SHMEM_I(dentry->d_inode);
 	return simple_xattr_list(&info->xattrs, buffer, size);
 }
 #endif /* CONFIG_TMPFS_XATTR */
 static const struct inode_operations shmem_short_symlink_operations = {
 	.readlink	= generic_readlink,
 	.follow_link	= shmem_follow_short_symlink,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
 	.listxattr	= shmem_listxattr,
 	.removexattr	= shmem_removexattr,
 #endif
 };
 static const struct inode_operations shmem_symlink_inode_operations = {
 	.readlink	= generic_readlink,
 	.follow_link	= shmem_follow_link,
 	.put_link	= shmem_put_link,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
 	.listxattr	= shmem_listxattr,
 	.removexattr	= shmem_removexattr,
 #endif
 };
 static struct dentry *shmem_get_parent(struct dentry *child)
 {
 	return ERR_PTR(-ESTALE);
 }
 static int shmem_match(struct inode *ino, void *vfh)
 {
 	__u32 *fh = vfh;
 	__u64 inum = fh[2];
 	inum = (inum << 32) | fh[1];
 	return ino->i_ino == inum && fh[0] == ino->i_generation;
 }
 static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
 		struct fid *fid, int fh_len, int fh_type)
 {
 	struct inode *inode;
 	struct dentry *dentry = NULL;
 	u64 inum;
 	if (fh_len < 3)
 		return NULL;
 	inum = fid->raw[2];
 	inum = (inum << 32) | fid->raw[1];
 	inode = ilookup5(sb, (unsigned long)(inum + fid->raw[0]),
 			shmem_match, fid->raw);
 	if (inode) {
 		dentry = d_find_alias(inode);
 		iput(inode);
 	}
 	return dentry;
 }
 static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len,
 				struct inode *parent)
 {
 	if (*len < 3) {
 		*len = 3;
 		return FILEID_INVALID;
 	}
 	if (inode_unhashed(inode)) {
 		/* Unfortunately insert_inode_hash is not idempotent,
 		 * so as we hash inodes here rather than at creation
 		 * time, we need a lock to ensure we only try
 		 * to do it once
 		 */
 		static DEFINE_SPINLOCK(lock);
 		spin_lock(&lock);
 		if (inode_unhashed(inode))
 			__insert_inode_hash(inode,
 					    inode->i_ino + inode->i_generation);
 		spin_unlock(&lock);
 	}
 	fh[0] = inode->i_generation;
 	fh[1] = inode->i_ino;
 	fh[2] = ((__u64)inode->i_ino) >> 32;
 	*len = 3;
 	return 1;
 }
 static const struct export_operations shmem_export_ops = {
 	.get_parent     = shmem_get_parent,
 	.encode_fh      = shmem_encode_fh,
 	.fh_to_dentry	= shmem_fh_to_dentry,
 };
 static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 			       bool remount)
 {
 	char *this_char, *value, *rest;
 	struct mempolicy *mpol = NULL;
 	uid_t uid;
 	gid_t gid;
 	while (options != NULL) {
 		this_char = options;
 		for (;;) {
 			/*
 			 * NUL-terminate this option: unfortunately,
 			 * mount options form a comma-separated list,
 			 * but mpol's nodelist may also contain commas.
 			 */
 			options = strchr(options, ',');
 			if (options == NULL)
 				break;
 			options++;
 			if (!isdigit(*options)) {
 				options[-1] = '\0';
 				break;
 			}
 		}
 		if (!*this_char)
 			continue;
 		if ((value = strchr(this_char,'=')) != NULL) {
 			*value++ = 0;
 		} else {
 			printk(KERN_ERR
 			    "tmpfs: No value for mount option '%s'\n",
 			    this_char);
 			goto error;
 		}
 		if (!strcmp(this_char,"size")) {
 			unsigned long long size;
 			size = memparse(value,&rest);
 			if (*rest == '%') {
 				size <<= PAGE_SHIFT;
 				size *= totalram_pages;
 				do_div(size, 100);
 				rest++;
 			}
 			if (*rest)
 				goto bad_val;
 			sbinfo->max_blocks =
 				DIV_ROUND_UP(size, PAGE_CACHE_SIZE);
 		} else if (!strcmp(this_char,"nr_blocks")) {
 			sbinfo->max_blocks = memparse(value, &rest);
 			if (*rest)
 				goto bad_val;
 		} else if (!strcmp(this_char,"nr_inodes")) {
 			sbinfo->max_inodes = memparse(value, &rest);
 			if (*rest)
 				goto bad_val;
 		} else if (!strcmp(this_char,"mode")) {
 			if (remount)
 				continue;
 			sbinfo->mode = simple_strtoul(value, &rest, 8) & 07777;
 			if (*rest)
 				goto bad_val;
 		} else if (!strcmp(this_char,"uid")) {
 			if (remount)
 				continue;
 			uid = simple_strtoul(value, &rest, 0);
 			if (*rest)
 				goto bad_val;
 			sbinfo->uid = make_kuid(current_user_ns(), uid);
 			if (!uid_valid(sbinfo->uid))
 				goto bad_val;
 		} else if (!strcmp(this_char,"gid")) {
 			if (remount)
 				continue;
 			gid = simple_strtoul(value, &rest, 0);
 			if (*rest)
 				goto bad_val;
 			sbinfo->gid = make_kgid(current_user_ns(), gid);
 			if (!gid_valid(sbinfo->gid))
 				goto bad_val;
 		} else if (!strcmp(this_char,"mpol")) {
 			mpol_put(mpol);
 			mpol = NULL;
 			if (mpol_parse_str(value, &mpol))
 				goto bad_val;
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
 			goto error;
 		}
 	}
 	sbinfo->mpol = mpol;
 	return 0;
 bad_val:
 	printk(KERN_ERR "tmpfs: Bad value '%s' for mount option '%s'\n",
 	       value, this_char);
 error:
 	mpol_put(mpol);
 	return 1;
 }
 static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	struct shmem_sb_info config = *sbinfo;
 	unsigned long inodes;
 	int error = -EINVAL;
 	config.mpol = NULL;
 	if (shmem_parse_options(data, &config, true))
 		return error;
 	spin_lock(&sbinfo->stat_lock);
 	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
 	if (percpu_counter_compare(&sbinfo->used_blocks, config.max_blocks) > 0)
 		goto out;
 	if (config.max_inodes < inodes)
 		goto out;
 	/*
 	 * Those tests disallow limited->unlimited while any are in use;
 	 * but we must separately disallow unlimited->limited, because
 	 * in that case we have no record of how much is already in use.
 	 */
 	if (config.max_blocks && !sbinfo->max_blocks)
 		goto out;
 	if (config.max_inodes && !sbinfo->max_inodes)
 		goto out;
 	error = 0;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
 	/*
 	 * Preserve previous mempolicy unless mpol remount option was specified.
 	 */
 	if (config.mpol) {
 		mpol_put(sbinfo->mpol);
 		sbinfo->mpol = config.mpol;	/* transfers initial ref */
 	}
 out:
 	spin_unlock(&sbinfo->stat_lock);
 	return error;
 }
 static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
 	if (sbinfo->max_blocks != shmem_default_max_blocks())
 		seq_printf(seq, ",size=%luk",
 			sbinfo->max_blocks << (PAGE_CACHE_SHIFT - 10));
 	if (sbinfo->max_inodes != shmem_default_max_inodes())
 		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
 	if (sbinfo->mode != (S_IRWXUGO | S_ISVTX))
 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
 	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))
 		seq_printf(seq, ",uid=%u",
 				from_kuid_munged(&init_user_ns, sbinfo->uid));
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
 }
 #endif /* CONFIG_TMPFS */
 static void shmem_put_super(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
 	kfree(sbinfo);
 	sb->s_fs_info = NULL;
 }
 int shmem_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct inode *inode;
 	struct shmem_sb_info *sbinfo;
 	int err = -ENOMEM;
 	/* Round up to L1_CACHE_BYTES to resist false sharing */
 	sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info),
 				L1_CACHE_BYTES), GFP_KERNEL);
 	if (!sbinfo)
 		return -ENOMEM;
 	sbinfo->mode = S_IRWXUGO | S_ISVTX;
 	sbinfo->uid = current_fsuid();
 	sbinfo->gid = current_fsgid();
 	sb->s_fs_info = sbinfo;
 #ifdef CONFIG_TMPFS
 	/*
 	 * Per default we only allow half of the physical ram per
 	 * tmpfs instance, limiting inodes to one per page of lowmem;
 	 * but the internal instance is left unlimited.
 	 */
 	if (!(sb->s_flags & MS_KERNMOUNT)) {
 		sbinfo->max_blocks = shmem_default_max_blocks();
 		sbinfo->max_inodes = shmem_default_max_inodes();
 		if (shmem_parse_options(data, sbinfo, false)) {
 			err = -EINVAL;
 			goto failed;
 		}
 	} else {
 		sb->s_flags |= MS_NOUSER;
 	}
 	sb->s_export_op = &shmem_export_ops;
 	sb->s_flags |= MS_NOSEC;
 #else
 	sb->s_flags |= MS_NOUSER;
 #endif
 	spin_lock_init(&sbinfo->stat_lock);
 	if (percpu_counter_init(&sbinfo->used_blocks, 0))
 		goto failed;
 	sbinfo->free_inodes = sbinfo->max_inodes;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = TMPFS_MAGIC;
 	sb->s_op = &shmem_ops;
 	sb->s_time_gran = 1;
 #ifdef CONFIG_TMPFS_XATTR
 	sb->s_xattr = shmem_xattr_handlers;
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	sb->s_flags |= MS_POSIXACL;
 #endif
 	inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
 	if (!inode)
 		goto failed;
 	inode->i_uid = sbinfo->uid;
 	inode->i_gid = sbinfo->gid;
 	sb->s_root = d_make_root(inode);
 	if (!sb->s_root)
 		goto failed;
 	return 0;
 failed:
 	shmem_put_super(sb);
 	return err;
 }
 static struct kmem_cache *shmem_inode_cachep;
 static struct inode *shmem_alloc_inode(struct super_block *sb)
 {
 	struct shmem_inode_info *info;
 	info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
 	if (!info)
 		return NULL;
 	return &info->vfs_inode;
 }
 static void shmem_destroy_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 static void shmem_destroy_inode(struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode))
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
 	call_rcu(&inode->i_rcu, shmem_destroy_callback);
 }
 static void shmem_init_inode(void *foo)
 {
 	struct shmem_inode_info *info = foo;
 	inode_init_once(&info->vfs_inode);
 }
 static int shmem_init_inodecache(void)
 {
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 				sizeof(struct shmem_inode_info),
 				0, SLAB_PANIC, shmem_init_inode);
 	return 0;
 }
 static void shmem_destroy_inodecache(void)
 {
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 static const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
 	.write_begin	= shmem_write_begin,
 	.write_end	= shmem_write_end,
 #endif
 	.migratepage	= migrate_page,
 	.error_remove_page = generic_error_remove_page,
 };
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 #ifdef CONFIG_TMPFS
 	.llseek		= shmem_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= shmem_file_aio_read,
 	.aio_write	= generic_file_aio_write,
 	.fsync		= noop_fsync,
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
 };
 static const struct inode_operations shmem_inode_operations = {
 	.setattr	= shmem_setattr,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
 	.listxattr	= shmem_listxattr,
 	.removexattr	= shmem_removexattr,
 #endif
 };
 static const struct inode_operations shmem_dir_inode_operations = {
 #ifdef CONFIG_TMPFS
 	.create		= shmem_create,
 	.lookup		= simple_lookup,
 	.link		= shmem_link,
 	.unlink		= shmem_unlink,
 	.symlink	= shmem_symlink,
 	.mkdir		= shmem_mkdir,
 	.rmdir		= shmem_rmdir,
 	.mknod		= shmem_mknod,
 	.rename		= shmem_rename,
 	.tmpfile	= shmem_tmpfile,
 #endif
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
 	.listxattr	= shmem_listxattr,
 	.removexattr	= shmem_removexattr,
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	.setattr	= shmem_setattr,
 #endif
 };
 static const struct inode_operations shmem_special_inode_operations = {
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
 	.listxattr	= shmem_listxattr,
 	.removexattr	= shmem_removexattr,
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	.setattr	= shmem_setattr,
 #endif
 };
 static const struct super_operations shmem_ops = {
 	.alloc_inode	= shmem_alloc_inode,
 	.destroy_inode	= shmem_destroy_inode,
 #ifdef CONFIG_TMPFS
 	.statfs		= shmem_statfs,
 	.remount_fs	= shmem_remount_fs,
 	.show_options	= shmem_show_options,
 #endif
 	.evict_inode	= shmem_evict_inode,
 	.drop_inode	= generic_delete_inode,
 	.put_super	= shmem_put_super,
 };
 static const struct vm_operations_struct shmem_vm_ops = {
 	.fault		= shmem_fault,
 #ifdef CONFIG_NUMA
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
 	.remap_pages	= generic_file_remap_pages,
 };
 static struct dentry *shmem_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	return mount_nodev(fs_type, flags, data, shmem_fill_super);
 }
 static struct file_system_type shmem_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "tmpfs",
 	.mount		= shmem_mount,
 	.kill_sb	= kill_litter_super,
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 int __init shmem_init(void)
 {
 	int error;
 	/* If rootfs called this, don't re-init */
 	if (shmem_inode_cachep)
 		return 0;
 	error = bdi_init(&shmem_backing_dev_info);
 	if (error)
 		goto out4;
 	error = shmem_init_inodecache();
 	if (error)
 		goto out3;
 	error = register_filesystem(&shmem_fs_type);
 	if (error) {
 		printk(KERN_ERR "Could not register tmpfs\n");
 		goto out2;
 	}
 	shm_mnt = kern_mount(&shmem_fs_type);
 	if (IS_ERR(shm_mnt)) {
 		error = PTR_ERR(shm_mnt);
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
 	return 0;
 out1:
 	unregister_filesystem(&shmem_fs_type);
 out2:
 	shmem_destroy_inodecache();
 out3:
 	bdi_destroy(&shmem_backing_dev_info);
 out4:
 	shm_mnt = ERR_PTR(error);
 	return error;
 }
 #else /* !CONFIG_SHMEM */
 /*
  * tiny-shmem: simple shmemfs and tmpfs using ramfs code
  *
  * This is intended for small system where the benefits of the full
  * shmem code (swap-backed and resource-limited) are outweighed by
  * their complexity. On systems without swap this code should be
  * effectively equivalent, but much lighter weight.
  */
 static struct file_system_type shmem_fs_type = {
 	.name		= "tmpfs",
 	.mount		= ramfs_mount,
 	.kill_sb	= kill_litter_super,
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 int __init shmem_init(void)
 {
 	BUG_ON(register_filesystem(&shmem_fs_type) != 0);
 	shm_mnt = kern_mount(&shmem_fs_type);
 	BUG_ON(IS_ERR(shm_mnt));
 	return 0;
 }
 int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	return 0;
 }
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
 {
 	return 0;
 }
 void shmem_unlock_mapping(struct address_space *mapping)
 {
 }
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	truncate_inode_pages_range(inode->i_mapping, lstart, lend);
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 #define shmem_vm_ops				generic_file_vm_ops
 #define shmem_file_operations			ramfs_file_operations
 #define shmem_get_inode(sb, dir, mode, dev, flags)	ramfs_get_inode(sb, dir, mode, dev)
 #define shmem_acct_size(flags, size)		0
 #define shmem_unacct_size(flags, size)		do {} while (0)
 #endif /* CONFIG_SHMEM */
 /* common code */
 static struct dentry_operations anon_ops = {
 	.d_dname = simple_dname
 };
 /**
  * shmem_file_setup - get an unlinked file living in tmpfs
  * @name: name for dentry (to be seen in /proc/<pid>/maps
  * @size: size to be set for the file
  * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size
  */
 struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags)
 {
 	struct file *res;
 	struct inode *inode;
 	struct path path;
 	struct super_block *sb;
 	struct qstr this;
 	if (IS_ERR(shm_mnt))
 		return ERR_CAST(shm_mnt);
 	if (size < 0 || size > MAX_LFS_FILESIZE)
 		return ERR_PTR(-EINVAL);
 	if (shmem_acct_size(flags, size))
 		return ERR_PTR(-ENOMEM);
 	res = ERR_PTR(-ENOMEM);
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0; /* will go */
 	sb = shm_mnt->mnt_sb;
 	path.dentry = d_alloc_pseudo(sb, &this);
 	if (!path.dentry)
 		goto put_memory;
 	d_set_d_op(path.dentry, &anon_ops);
 	path.mnt = mntget(shm_mnt);
 	res = ERR_PTR(-ENOSPC);
 	inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags);
 	if (!inode)
 		goto put_dentry;
 	d_instantiate(path.dentry, inode);
 	inode->i_size = size;
 	clear_nlink(inode);	/* It is unlinked */
 	res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
 	if (IS_ERR(res))
 		goto put_dentry;
 	res = alloc_file(&path, FMODE_WRITE | FMODE_READ,
 		  &shmem_file_operations);
 	if (IS_ERR(res))
 		goto put_dentry;
 	return res;
 put_dentry:
 	path_put(&path);
 put_memory:
 	shmem_unacct_size(flags, size);
 	return res;
 }
 EXPORT_SYMBOL_GPL(shmem_file_setup);
 /**
  * shmem_zero_setup - setup a shared anonymous mapping
  * @vma: the vma to be mmapped is prepared by do_mmap_pgoff
  */
 int shmem_zero_setup(struct vm_area_struct *vma)
 {
 	struct file *file;
 	loff_t size = vma->vm_end - vma->vm_start;
 	file = shmem_file_setup("dev/zero", size, vma->vm_flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
 }
 /**
  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
  * @mapping:	the page's address_space
  * @index:	the page index
  * @gfp:	the page allocator flags to use if allocating
  *
  * This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)",
  * with any new page allocations done using the specified allocation flags.
  * But read_cache_page_gfp() uses the ->readpage() method: which does not
  * suit tmpfs, since it may have pages in swapcache, and needs to find those
  * for itself; although drivers/gpu/drm i915 and ttm rely upon this support.
  *
  * i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in
  * with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily.
  */
 struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					 pgoff_t index, gfp_t gfp)
 {
 #ifdef CONFIG_SHMEM
 	struct inode *inode = mapping->host;
 	struct page *page;
 	int error;
 	BUG_ON(mapping->a_ops != &shmem_aops);
 	error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL);
 	if (error)
 		page = ERR_PTR(error);
 	else
 		unlock_page(page);
 	return page;
 #else
 	/*
 	 * The tiny !SHMEM case uses ramfs without swap
 	 */
 	return read_cache_page_gfp(mapping, index, gfp);
 #endif
 }
 EXPORT_SYMBOL_GPL(shmem_read_mapping_page_gfp);

mm/swap.c

Diff comments View file @ d618a27

 /*
  *  linux/mm/swap.c
  *
  *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
  */
 /*
  * This file contains the default values for the operation of the
  * Linux VM subsystem. Fine-tuning documentation can be found in
  * Documentation/sysctl/vm.txt.
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
  * Buffermem limits added 12.3.98, Rik van Riel.
  */
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
 #include <linux/init.h>
 #include <linux/export.h>
 #include <linux/mm_inline.h>
 #include <linux/percpu_counter.h>
 #include <linux/percpu.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
 #include <linux/hugetlb.h>
 #include "internal.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/pagemap.h>
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
  */
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
 }
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
 	free_hot_cold_page(page, false);
 }
 static void __put_compound_page(struct page *page)
 {
 	compound_page_dtor *dtor;
 	__page_cache_release(page);
 	dtor = get_compound_page_dtor(page);
 	(*dtor)(page);
 }
 static void put_compound_page(struct page *page)
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_head(page);
 		if (likely(page != page_head &&
 			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
 			 * THP can not break up slab pages so avoid taking
 			 * compound_lock().  Slab performs non-atomic bit ops
 			 * on page->flags for better performance.  In particular
 			 * slab_unlock() in slub used to be a hot path.  It is
 			 * still hot on arches that do not support
 			 * this_cpu_cmpxchg_double().
 			 */
 			if (PageSlab(page_head) || PageHeadHuge(page_head)) {
 				if (likely(PageTail(page))) {
 					/*
 					 * __split_huge_page_refcount
 					 * cannot race here.
 					 */
 					VM_BUG_ON(!PageHead(page_head));
 					atomic_dec(&page->_mapcount);
 					if (put_page_testzero(page_head))
 						VM_BUG_ON(1);
 					if (put_page_testzero(page_head))
 						__put_compound_page(page_head);
 					return;
 				} else
 					/*
 					 * __split_huge_page_refcount
 					 * run before us, "page" was a
 					 * THP tail. The split
 					 * page_head has been freed
 					 * and reallocated as slab or
 					 * hugetlbfs page of smaller
 					 * order (only possible if
 					 * reallocated as slab on
 					 * x86).
 					 */
 					goto skip_lock;
 			}
 			/*
 			 * page_head wasn't a dangling pointer but it
 			 * may not be a head page anymore by the time
 			 * we obtain the lock. That is ok as long as it
 			 * can't be freed from under us.
 			 */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 skip_lock:
 				if (put_page_testzero(page_head)) {
 					/*
 					 * The head page may have been
 					 * freed and reallocated as a
 					 * compound page of smaller
 					 * order and then freed again.
 					 * All we know is that it
 					 * cannot have become: a THP
 					 * page, a compound page of
 					 * higher order, a tail page.
 					 * That is because we still
 					 * hold the refcount of the
 					 * split THP tail and
 					 * page_head was the THP head
 					 * before the split.
 					 */
 					if (PageHead(page_head))
 						__put_compound_page(page_head);
 					else
 						__put_single_page(page_head);
 				}
 out_put_single:
 				if (put_page_testzero(page))
 					__put_single_page(page);
 				return;
 			}
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
 			 * get_page_unless_zero() now that
 			 * __split_huge_page_refcount() is blocked on
 			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
 			VM_BUG_ON(page_mapcount(page) <= 0);
 			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
 			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
 					__put_compound_page(page_head);
 				else
 					__put_single_page(page_head);
 			}
 		} else {
 			/* page_head is a dangling pointer */
 			VM_BUG_ON(PageTail(page));
 			goto out_put_single;
 		}
 	} else if (put_page_testzero(page)) {
 		if (PageHead(page))
 			__put_compound_page(page);
 		else
 			__put_single_page(page);
 	}
 }
 void put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
 		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 /*
  * This function is exported but must not be called by anything other
  * than get_page(). It implements the slow path of get_page().
  */
 bool __get_page_tail(struct page *page)
 {
 	/*
 	 * This takes care of get_page() if run on a tail page
 	 * returned by one of the get_user_pages/follow_page variants.
 	 * get_user_pages/follow_page itself doesn't need the compound
 	 * lock because it runs __get_page_tail_foll() under the
 	 * proper PT lock that already serializes against
 	 * split_huge_page().
 	 */
 	unsigned long flags;
 	bool got = false;
 	struct page *page_head = compound_head(page);
 	if (likely(page != page_head && get_page_unless_zero(page_head))) {
 		/* Ref to put_compound_page() comment. */
 		if (PageSlab(page_head) || PageHeadHuge(page_head)) {
 			if (likely(PageTail(page))) {
 				/*
 				 * This is a hugetlbfs page or a slab
 				 * page. __split_huge_page_refcount
 				 * cannot race here.
 				 */
 				VM_BUG_ON(!PageHead(page_head));
 				__get_page_tail_foll(page, false);
 				return true;
 			} else {
 				/*
 				 * __split_huge_page_refcount run
 				 * before us, "page" was a THP
 				 * tail. The split page_head has been
 				 * freed and reallocated as slab or
 				 * hugetlbfs page of smaller order
 				 * (only possible if reallocated as
 				 * slab on x86).
 				 */
 				put_page(page_head);
 				return false;
 			}
 		}
 		/*
 		 * page_head wasn't a dangling pointer but it
 		 * may not be a head page anymore by the time
 		 * we obtain the lock. That is ok as long as it
 		 * can't be freed from under us.
 		 */
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
 			__get_page_tail_foll(page, false);
 			got = true;
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))
 			put_page(page_head);
 	}
 	return got;
 }
 EXPORT_SYMBOL(__get_page_tail);
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
  *
  * Release a list of pages which are strung together on page.lru.  Currently
  * used by read_cache_pages() and related error recovery code.
  */
 void put_pages_list(struct list_head *pages)
 {
 	while (!list_empty(pages)) {
 		struct page *victim;
 		victim = list_entry(pages->prev, struct page, lru);
 		list_del(&victim->lru);
 		page_cache_release(victim);
 	}
 }
 EXPORT_SYMBOL(put_pages_list);
 /*
  * get_kernel_pages() - pin kernel pages in memory
  * @kiov:	An array of struct kvec structures
  * @nr_segs:	number of segments to pin
  * @write:	pinning for read/write, currently ignored
  * @pages:	array that receives pointers to the pages pinned.
  *		Should be at least nr_segs long.
  *
  * Returns number of pages pinned. This may be fewer than the number
  * requested. If nr_pages is 0 or negative, returns 0. If no pages
  * were pinned, returns -errno. Each page returned must be released
  * with a put_page() call when it is finished with.
  */
 int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write,
 		struct page **pages)
 {
 	int seg;
 	for (seg = 0; seg < nr_segs; seg++) {
 		if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE))
 			return seg;
 		pages[seg] = kmap_to_page(kiov[seg].iov_base);
 		page_cache_get(pages[seg]);
 	}
 	return seg;
 }
 EXPORT_SYMBOL_GPL(get_kernel_pages);
 /*
  * get_kernel_page() - pin a kernel page in memory
  * @start:	starting kernel address
  * @write:	pinning for read/write, currently ignored
  * @pages:	array that receives pointer to the page pinned.
  *		Must be at least nr_segs long.
  *
  * Returns 1 if page is pinned. If the page was not pinned, returns
  * -errno. The page returned must be released with a put_page() call
  * when it is finished with.
  */
 int get_kernel_page(unsigned long start, int write, struct page **pages)
 {
 	const struct kvec kiov = {
 		.iov_base = (void *)start,
 		.iov_len = PAGE_SIZE
 	};
 	return get_kernel_pages(&kiov, 1, write, pages);
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
 	void *arg)
 {
 	int i;
 	struct zone *zone = NULL;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 		struct zone *pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
 			zone = pagezone;
 			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
 	int *pgmoved = arg;
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
 }
 /*
  * pagevec_move_tail() must be called with IRQ disabled.
  * Otherwise this may cause nasty races.
  */
 static void pagevec_move_tail(struct pagevec *pvec)
 {
 	int pgmoved = 0;
 	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
 	__count_vm_events(PGROTATED, pgmoved);
 }
 /*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
  */
 void rotate_reclaimable_page(struct page *page)
 {
 	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
 	    !PageUnevictable(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
 		page_cache_get(page);
 		local_irq_save(flags);
 		pvec = &__get_cpu_var(lru_rotate_pvecs);
 		if (!pagevec_add(pvec, page))
 			pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
 }
 static void update_page_reclaim_stat(struct lruvec *lruvec,
 				     int file, int rotated)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
 }
 static void __activate_page(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
 		del_page_from_lru_list(page, lruvec, lru);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
 		trace_mm_lru_activate(page, page_to_pfn(page));
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
 	}
 }
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 static void activate_page_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(activate_page_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, __activate_page, NULL);
 }
 static bool need_activate_page_drain(int cpu)
 {
 	return pagevec_count(&per_cpu(activate_page_pvecs, cpu)) != 0;
 }
 void activate_page(struct page *page)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
 		page_cache_get(page);
 		if (!pagevec_add(pvec, page))
 			pagevec_lru_move_fn(pvec, __activate_page, NULL);
 		put_cpu_var(activate_page_pvecs);
 	}
 }
 #else
 static inline void activate_page_drain(int cpu)
 {
 }
 static bool need_activate_page_drain(int cpu)
 {
 	return false;
 }
 void activate_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	spin_lock_irq(&zone->lru_lock);
 	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
 	spin_unlock_irq(&zone->lru_lock);
 }
 #endif
 static void __lru_cache_activate_page(struct page *page)
 {
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 	int i;
 	/*
 	 * Search backwards on the optimistic assumption that the page being
 	 * activated has just been added to this pagevec. Note that only
 	 * the local pagevec is examined as a !PageLRU page could be in the
 	 * process of being released, reclaimed, migrated or on a remote
 	 * pagevec that is currently being drained. Furthermore, marking
 	 * a remote pagevec's page PageActive potentially hits a race where
 	 * a page is marked PageActive just after it is added to the inactive
 	 * list causing accounting errors and BUG_ON checks to trigger.
 	 */
 	for (i = pagevec_count(pvec) - 1; i >= 0; i--) {
 		struct page *pagevec_page = pvec->pages[i];
 		if (pagevec_page == page) {
 			SetPageActive(page);
 			break;
 		}
 	}
 	put_cpu_var(lru_add_pvec);
 }
 /*
  * Mark a page as having seen activity.
  *
  * inactive,unreferenced	->	inactive,referenced
  * inactive,referenced		->	active,unreferenced
  * active,unreferenced		->	active,referenced
  */
 void mark_page_accessed(struct page *page)
 {
 	if (!PageActive(page) && !PageUnevictable(page) &&
 			PageReferenced(page)) {
 		/*
 		 * If the page is on the LRU, queue it for activation via
 		 * activate_page_pvecs. Otherwise, assume the page is on a
 		 * pagevec, mark it active and it'll be moved to the active
 		 * LRU on the next drain.
 		 */
 		if (PageLRU(page))
 			activate_page(page);
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
 }
 EXPORT_SYMBOL(mark_page_accessed);
+/*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
 static void __lru_cache_add(struct page *page)
 {
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 	page_cache_get(page);
 	if (!pagevec_space(pvec))
 		__pagevec_lru_add(pvec);
 	pagevec_add(pvec, page);
 	put_cpu_var(lru_add_pvec);
 }
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
 void lru_cache_add_anon(struct page *page)
 {
 	if (PageActive(page))
 		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 void lru_cache_add_file(struct page *page)
 {
 	if (PageActive(page))
 		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 EXPORT_SYMBOL(lru_cache_add_file);
 /**
  * lru_cache_add - add a page to a page list
  * @page: the page to be added to the LRU.
  *
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of lru_cache_add()
  * have the page added to the active list using mark_page_accessed().
  */
 void lru_cache_add(struct page *page)
 {
 	VM_BUG_ON(PageActive(page) && PageUnevictable(page));
 	VM_BUG_ON(PageLRU(page));
 	__lru_cache_add(page);
 }
 /**
  * add_page_to_unevictable_list - add a page to the unevictable list
  * @page:  the page to be added to the unevictable list
  *
  * Add page directly to its zone's unevictable list.  To avoid races with
  * tasks that might be making the page evictable, through eg. munlock,
  * munmap or exit, while it's not on the lru, we want to add the page
  * while it's locked or otherwise "invisible" to other tasks.  This is
  * difficult to do when using the pagevec cache, so bypass that.
  */
 void add_page_to_unevictable_list(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
 	spin_unlock_irq(&zone->lru_lock);
 }
 /*
  * If the page can not be invalidated, it is moved to the
  * inactive list to speed up its reclaim.  It is moved to the
  * head of the list, rather than the tail, to give the flusher
  * threads some time to write it out, as this is much more
  * effective than the single-page writeout from reclaim.
  *
  * If the page isn't page_mapped and dirty/writeback, the page
  * could reclaim asap using PG_reclaim.
  *
  * 1. active, mapped page -> none
  * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
  * 3. inactive, mapped page -> none
  * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
  * 5. inactive, clean -> inactive, tail
  * 6. Others -> none
  *
  * In 4, why it moves inactive's head, the VM expects the page would
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 			      void *arg)
 {
 	int lru, file;
 	bool active;
 	if (!PageLRU(page))
 		return;
 	if (PageUnevictable(page))
 		return;
 	/* Some processes are using the page */
 	if (page_mapped(page))
 		return;
 	active = PageActive(page);
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
 	del_page_from_lru_list(page, lruvec, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	if (PageWriteback(page) || PageDirty(page)) {
 		/*
 		 * PG_reclaim could be raced with end_page_writeback
 		 * It can make readahead confusing.  But race window
 		 * is _really_ small and  it's non-critical problem.
 		 */
 		SetPageReclaim(page);
 	} else {
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
  * disabled; or "cpu" is being hot-unplugged, and is already dead.
  */
 void lru_add_drain_cpu(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add(pvec);
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
 		unsigned long flags;
 		/* No harm done if a racing interrupt already did this */
 		local_irq_save(flags);
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
 	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
 	activate_page_drain(cpu);
 }
 /**
  * deactivate_page - forcefully deactivate a page
  * @page: page to deactivate
  *
  * This function hints the VM that @page is a good reclaim candidate,
  * for example if its invalidation fails due to the page being dirty
  * or under writeback.
  */
 void deactivate_page(struct page *page)
 {
 	/*
 	 * In a workload with many unevictable page such as mprotect, unevictable
 	 * page deactivation for accelerating reclaim is pointless.
 	 */
 	if (PageUnevictable(page))
 		return;
 	if (likely(get_page_unless_zero(page))) {
 		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
 		if (!pagevec_add(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
 		put_cpu_var(lru_deactivate_pvecs);
 	}
 }
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
 	put_cpu();
 }
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
 	lru_add_drain();
 }
 static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
 void lru_add_drain_all(void)
 {
 	static DEFINE_MUTEX(lock);
 	static struct cpumask has_work;
 	int cpu;
 	mutex_lock(&lock);
 	get_online_cpus();
 	cpumask_clear(&has_work);
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
 			cpumask_set_cpu(cpu, &has_work);
 		}
 	}
 	for_each_cpu(cpu, &has_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 	put_online_cpus();
 	mutex_unlock(&lock);
 }
 /*
  * Batched page_cache_release().  Decrement the reference count on all the
  * passed pages.  If it fell to zero then remove the page from the LRU and
  * free it.
  *
  * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
  * for the remainder of the operation.
  *
  * The locking in this function is against shrink_inactive_list(): we recheck
  * the page count inside the lock to see whether shrink_inactive_list()
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
 void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
 	struct zone *zone = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
 				zone = NULL;
 			}
 			put_compound_page(page);
 			continue;
 		}
 		if (!put_page_testzero(page))
 			continue;
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
 									flags);
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
 			lruvec = mem_cgroup_page_lruvec(page, zone);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
 EXPORT_SYMBOL(release_pages);
 /*
  * The pages which we're about to release may be in the deferred lru-addition
  * queues.  That would prevent them from really being freed right now.  That's
  * OK from a correctness point of view but is inefficient - those pages may be
  * cache-warm and we want to give them back to the page allocator ASAP.
  *
  * So __pagevec_release() will drain those queues here.  __pagevec_lru_add()
  * and __pagevec_lru_add_active() call release_pages() directly to avoid
  * mutual recursion.
  */
 void __pagevec_release(struct pagevec *pvec)
 {
 	lru_add_drain();
 	release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
 	pagevec_reinit(pvec);
 }
 EXPORT_SYMBOL(__pagevec_release);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* used by __split_huge_page_refcount() */
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
 	const int file = 0;
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
 	VM_BUG_ON(NR_CPUS != 1 &&
 		  !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
 	if (!list)
 		SetPageLRU(page_tail);
 	if (likely(PageLRU(page)))
 		list_add_tail(&page_tail->lru, &page->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
 	} else {
 		struct list_head *list_head;
 		/*
 		 * Head page has not yet been counted, as an hpage,
 		 * so we must account for each subpage individually.
 		 *
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
 		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 	if (!PageUnevictable(page))
 		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
 	int file = page_is_file_cache(page);
 	int active = PageActive(page);
 	enum lru_list lru = page_lru(page);
 	VM_BUG_ON(PageLRU(page));
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 	update_page_reclaim_stat(lruvec, file, active);
 	trace_mm_lru_insertion(page, page_to_pfn(page), lru, trace_pagemap_flags(page));
 }
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 /**
  * pagevec_lookup_entries - gang pagecache lookup
  * @pvec:	Where the resulting entries are placed
  * @mapping:	The address_space to search
  * @start:	The starting entry index
  * @nr_entries:	The maximum number of entries
  * @indices:	The cache indices corresponding to the entries in @pvec
  *
  * pagevec_lookup_entries() will search for and return a group of up
  * to @nr_entries pages and shadow entries in the mapping.  All
  * entries are placed in @pvec.  pagevec_lookup_entries() takes a
  * reference against actual pages in @pvec.
  *
  * The search returns a group of mapping-contiguous entries with
  * ascending indexes.  There may be holes in the indices due to
  * not-present entries.
  *
  * pagevec_lookup_entries() returns the number of entries which were
  * found.
  */
 unsigned pagevec_lookup_entries(struct pagevec *pvec,
 				struct address_space *mapping,
 				pgoff_t start, unsigned nr_pages,
 				pgoff_t *indices)
 {
 	pvec->nr = find_get_entries(mapping, start, nr_pages,
 				    pvec->pages, indices);
 	return pagevec_count(pvec);
 }
 /**
  * pagevec_remove_exceptionals - pagevec exceptionals pruning
  * @pvec:	The pagevec to prune
  *
  * pagevec_lookup_entries() fills both pages and exceptional radix
  * tree entries into the pagevec.  This function prunes all
  * exceptionals from @pvec without leaving holes, so that it can be
  * passed on to page-only pagevec operations.
  */
 void pagevec_remove_exceptionals(struct pagevec *pvec)
 {
 	int i, j;
 	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 		if (!radix_tree_exceptional_entry(page))
 			pvec->pages[j++] = page;
 	}
 	pvec->nr = j;
 }
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
  * @mapping:	The address_space to search
  * @start:	The starting page index
  * @nr_pages:	The maximum number of pages
  *
  * pagevec_lookup() will search for and return a group of up to @nr_pages pages
  * in the mapping.  The pages are placed in @pvec.  pagevec_lookup() takes a
  * reference against the pages in @pvec.
  *
  * The search returns a group of mapping-contiguous pages with ascending
  * indexes.  There may be holes in the indices due to not-present pages.
  *
  * pagevec_lookup() returns the number of pages which were found.
  */
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages)
 {
 	pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages);
 	return pagevec_count(pvec);
 }
 EXPORT_SYMBOL(pagevec_lookup);
 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t *index, int tag, unsigned nr_pages)
 {
 	pvec->nr = find_get_pages_tag(mapping, index, tag,
 					nr_pages, pvec->pages);
 	return pagevec_count(pvec);
 }
 EXPORT_SYMBOL(pagevec_lookup_tag);
 /*
  * Perform any setup for the swap system
  */
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
 #ifdef CONFIG_SWAP
 	int i;
 	bdi_init(swapper_spaces[0].backing_dev_info);
 	for (i = 0; i < MAX_SWAPFILES; i++) {
 		spin_lock_init(&swapper_spaces[i].tree_lock);
 		INIT_LIST_HEAD(&swapper_spaces[i].i_mmap_nonlinear);
 	}
 #endif
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
 	else
 		page_cluster = 3;
 	/*
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
 }