Doug / smarc-fsl-linux-kernel | Embedian Git Server

Commit 72c270612bd33192fa836ad0f2939af1ca218292

Authored by Kent Overstreet 2013-06-05 21:24:39 +0800

Exists in smarc-imx_3.14.28_1.0.0_ga and in 1 other branch

bcache: Write out full stripes

Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:

 * If a stripe is already dirty, force writes to that stripe to
   writeback mode - to help build up full stripes of dirty data

 * When flushing dirty data, preferentially write out full stripes first
   if there are any.

Signed-off-by: Kent Overstreet <koverstreet@google.com>

Showing 9 changed files with 121 additions and 37 deletions Inline Diff

drivers/md/bcache/bcache.h
drivers/md/bcache/btree.c
drivers/md/bcache/btree.h
drivers/md/bcache/debug.c
drivers/md/bcache/movinggc.c
drivers/md/bcache/request.c
drivers/md/bcache/sysfs.c
drivers/md/bcache/writeback.c
drivers/md/bcache/writeback.h

drivers/md/bcache/bcache.h

Diff comments View file @ 72c2706

 #ifndef _BCACHE_H
 #define _BCACHE_H
 /*
  * SOME HIGH LEVEL CODE DOCUMENTATION:
  *
  * Bcache mostly works with cache sets, cache devices, and backing devices.
  *
  * Support for multiple cache devices hasn't quite been finished off yet, but
  * it's about 95% plumbed through. A cache set and its cache devices is sort of
  * like a md raid array and its component devices. Most of the code doesn't care
  * about individual cache devices, the main abstraction is the cache set.
  *
  * Multiple cache devices is intended to give us the ability to mirror dirty
  * cached data and metadata, without mirroring clean cached data.
  *
  * Backing devices are different, in that they have a lifetime independent of a
  * cache set. When you register a newly formatted backing device it'll come up
  * in passthrough mode, and then you can attach and detach a backing device from
  * a cache set at runtime - while it's mounted and in use. Detaching implicitly
  * invalidates any cached data for that backing device.
  *
  * A cache set can have multiple (many) backing devices attached to it.
  *
  * There's also flash only volumes - this is the reason for the distinction
  * between struct cached_dev and struct bcache_device. A flash only volume
  * works much like a bcache device that has a backing device, except the
  * "cached" data is always dirty. The end result is that we get thin
  * provisioning with very little additional code.
  *
  * Flash only volumes work but they're not production ready because the moving
  * garbage collector needs more work. More on that later.
  *
  * BUCKETS/ALLOCATION:
  *
  * Bcache is primarily designed for caching, which means that in normal
  * operation all of our available space will be allocated. Thus, we need an
  * efficient way of deleting things from the cache so we can write new things to
  * it.
  *
  * To do this, we first divide the cache device up into buckets. A bucket is the
  * unit of allocation; they're typically around 1 mb - anywhere from 128k to 2M+
  * works efficiently.
  *
  * Each bucket has a 16 bit priority, and an 8 bit generation associated with
  * it. The gens and priorities for all the buckets are stored contiguously and
  * packed on disk (in a linked list of buckets - aside from the superblock, all
  * of bcache's metadata is stored in buckets).
  *
  * The priority is used to implement an LRU. We reset a bucket's priority when
  * we allocate it or on cache it, and every so often we decrement the priority
  * of each bucket. It could be used to implement something more sophisticated,
  * if anyone ever gets around to it.
  *
  * The generation is used for invalidating buckets. Each pointer also has an 8
  * bit generation embedded in it; for a pointer to be considered valid, its gen
  * must match the gen of the bucket it points into.  Thus, to reuse a bucket all
  * we have to do is increment its gen (and write its new gen to disk; we batch
  * this up).
  *
  * Bcache is entirely COW - we never write twice to a bucket, even buckets that
  * contain metadata (including btree nodes).
  *
  * THE BTREE:
  *
  * Bcache is in large part design around the btree.
  *
  * At a high level, the btree is just an index of key -> ptr tuples.
  *
  * Keys represent extents, and thus have a size field. Keys also have a variable
  * number of pointers attached to them (potentially zero, which is handy for
  * invalidating the cache).
  *
  * The key itself is an inode:offset pair. The inode number corresponds to a
  * backing device or a flash only volume. The offset is the ending offset of the
  * extent within the inode - not the starting offset; this makes lookups
  * slightly more convenient.
  *
  * Pointers contain the cache device id, the offset on that device, and an 8 bit
  * generation number. More on the gen later.
  *
  * Index lookups are not fully abstracted - cache lookups in particular are
  * still somewhat mixed in with the btree code, but things are headed in that
  * direction.
  *
  * Updates are fairly well abstracted, though. There are two different ways of
  * updating the btree; insert and replace.
  *
  * BTREE_INSERT will just take a list of keys and insert them into the btree -
  * overwriting (possibly only partially) any extents they overlap with. This is
  * used to update the index after a write.
  *
  * BTREE_REPLACE is really cmpxchg(); it inserts a key into the btree iff it is
  * overwriting a key that matches another given key. This is used for inserting
  * data into the cache after a cache miss, and for background writeback, and for
  * the moving garbage collector.
  *
  * There is no "delete" operation; deleting things from the index is
  * accomplished by either by invalidating pointers (by incrementing a bucket's
  * gen) or by inserting a key with 0 pointers - which will overwrite anything
  * previously present at that location in the index.
  *
  * This means that there are always stale/invalid keys in the btree. They're
  * filtered out by the code that iterates through a btree node, and removed when
  * a btree node is rewritten.
  *
  * BTREE NODES:
  *
  * Our unit of allocation is a bucket, and we we can't arbitrarily allocate and
  * free smaller than a bucket - so, that's how big our btree nodes are.
  *
  * (If buckets are really big we'll only use part of the bucket for a btree node
  * - no less than 1/4th - but a bucket still contains no more than a single
  * btree node. I'd actually like to change this, but for now we rely on the
  * bucket's gen for deleting btree nodes when we rewrite/split a node.)
  *
  * Anyways, btree nodes are big - big enough to be inefficient with a textbook
  * btree implementation.
  *
  * The way this is solved is that btree nodes are internally log structured; we
  * can append new keys to an existing btree node without rewriting it. This
  * means each set of keys we write is sorted, but the node is not.
  *
  * We maintain this log structure in memory - keeping 1Mb of keys sorted would
  * be expensive, and we have to distinguish between the keys we have written and
  * the keys we haven't. So to do a lookup in a btree node, we have to search
  * each sorted set. But we do merge written sets together lazily, so the cost of
  * these extra searches is quite low (normally most of the keys in a btree node
  * will be in one big set, and then there'll be one or two sets that are much
  * smaller).
  *
  * This log structure makes bcache's btree more of a hybrid between a
  * conventional btree and a compacting data structure, with some of the
  * advantages of both.
  *
  * GARBAGE COLLECTION:
  *
  * We can't just invalidate any bucket - it might contain dirty data or
  * metadata. If it once contained dirty data, other writes might overwrite it
  * later, leaving no valid pointers into that bucket in the index.
  *
  * Thus, the primary purpose of garbage collection is to find buckets to reuse.
  * It also counts how much valid data it each bucket currently contains, so that
  * allocation can reuse buckets sooner when they've been mostly overwritten.
  *
  * It also does some things that are really internal to the btree
  * implementation. If a btree node contains pointers that are stale by more than
  * some threshold, it rewrites the btree node to avoid the bucket's generation
  * wrapping around. It also merges adjacent btree nodes if they're empty enough.
  *
  * THE JOURNAL:
  *
  * Bcache's journal is not necessary for consistency; we always strictly
  * order metadata writes so that the btree and everything else is consistent on
  * disk in the event of an unclean shutdown, and in fact bcache had writeback
  * caching (with recovery from unclean shutdown) before journalling was
  * implemented.
  *
  * Rather, the journal is purely a performance optimization; we can't complete a
  * write until we've updated the index on disk, otherwise the cache would be
  * inconsistent in the event of an unclean shutdown. This means that without the
  * journal, on random write workloads we constantly have to update all the leaf
  * nodes in the btree, and those writes will be mostly empty (appending at most
  * a few keys each) - highly inefficient in terms of amount of metadata writes,
  * and it puts more strain on the various btree resorting/compacting code.
  *
  * The journal is just a log of keys we've inserted; on startup we just reinsert
  * all the keys in the open journal entries. That means that when we're updating
  * a node in the btree, we can wait until a 4k block of keys fills up before
  * writing them out.
  *
  * For simplicity, we only journal updates to leaf nodes; updates to parent
  * nodes are rare enough (since our leaf nodes are huge) that it wasn't worth
  * the complexity to deal with journalling them (in particular, journal replay)
  * - updates to non leaf nodes just happen synchronously (see btree_split()).
  */
 #define pr_fmt(fmt) "bcache: %s() " fmt "\n", __func__
 #include <linux/bio.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
 #include "util.h"
 #include "closure.h"
 struct bucket {
 	atomic_t	pin;
 	uint16_t	prio;
 	uint8_t		gen;
 	uint8_t		disk_gen;
 	uint8_t		last_gc; /* Most out of date gen in the btree */
 	uint8_t		gc_gen;
 	uint16_t	gc_mark;
 };
 /*
  * I'd use bitfields for these, but I don't trust the compiler not to screw me
  * as multiple threads touch struct bucket without locking
  */
 BITMASK(GC_MARK,	 struct bucket, gc_mark, 0, 2);
 #define GC_MARK_RECLAIMABLE	0
 #define GC_MARK_DIRTY		1
 #define GC_MARK_METADATA	2
 BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 14);
 struct bkey {
 	uint64_t	high;
 	uint64_t	low;
 	uint64_t	ptr[];
 };
 /* Enough for a key with 6 pointers */
 #define BKEY_PAD		8
 #define BKEY_PADDED(key)					\
 	union { struct bkey key; uint64_t key ## _pad[BKEY_PAD]; }
 /* Version 0: Cache device
  * Version 1: Backing device
  * Version 2: Seed pointer into btree node checksum
  * Version 3: Cache device with new UUID format
  * Version 4: Backing device with data offset
  */
 #define BCACHE_SB_VERSION_CDEV			0
 #define BCACHE_SB_VERSION_BDEV			1
 #define BCACHE_SB_VERSION_CDEV_WITH_UUID	3
 #define BCACHE_SB_VERSION_BDEV_WITH_OFFSET	4
 #define BCACHE_SB_MAX_VERSION			4
 #define SB_SECTOR		8
 #define SB_SIZE			4096
 #define SB_LABEL_SIZE		32
 #define SB_JOURNAL_BUCKETS	256U
 /* SB_JOURNAL_BUCKETS must be divisible by BITS_PER_LONG */
 #define MAX_CACHES_PER_SET	8
 #define BDEV_DATA_START_DEFAULT	16	/* sectors */
 struct cache_sb {
 	uint64_t		csum;
 	uint64_t		offset;	/* sector where this sb was written */
 	uint64_t		version;
 	uint8_t			magic[16];
 	uint8_t			uuid[16];
 	union {
 		uint8_t		set_uuid[16];
 		uint64_t	set_magic;
 	};
 	uint8_t			label[SB_LABEL_SIZE];
 	uint64_t		flags;
 	uint64_t		seq;
 	uint64_t		pad[8];
 	union {
 	struct {
 		/* Cache devices */
 		uint64_t	nbuckets;	/* device size */
 		uint16_t	block_size;	/* sectors */
 		uint16_t	bucket_size;	/* sectors */
 		uint16_t	nr_in_set;
 		uint16_t	nr_this_dev;
 	};
 	struct {
 		/* Backing devices */
 		uint64_t	data_offset;
 		/*
 		 * block_size from the cache device section is still used by
 		 * backing devices, so don't add anything here until we fix
 		 * things to not need it for backing devices anymore
 		 */
 	};
 	};
 	uint32_t		last_mount;	/* time_t */
 	uint16_t		first_bucket;
 	union {
 		uint16_t	njournal_buckets;
 		uint16_t	keys;
 	};
 	uint64_t		d[SB_JOURNAL_BUCKETS];	/* journal buckets */
 };
 BITMASK(CACHE_SYNC,		struct cache_sb, flags, 0, 1);
 BITMASK(CACHE_DISCARD,		struct cache_sb, flags, 1, 1);
 BITMASK(CACHE_REPLACEMENT,	struct cache_sb, flags, 2, 3);
 #define CACHE_REPLACEMENT_LRU	0U
 #define CACHE_REPLACEMENT_FIFO	1U
 #define CACHE_REPLACEMENT_RANDOM 2U
 BITMASK(BDEV_CACHE_MODE,	struct cache_sb, flags, 0, 4);
 #define CACHE_MODE_WRITETHROUGH	0U
 #define CACHE_MODE_WRITEBACK	1U
 #define CACHE_MODE_WRITEAROUND	2U
 #define CACHE_MODE_NONE		3U
 BITMASK(BDEV_STATE,		struct cache_sb, flags, 61, 2);
 #define BDEV_STATE_NONE		0U
 #define BDEV_STATE_CLEAN	1U
 #define BDEV_STATE_DIRTY	2U
 #define BDEV_STATE_STALE	3U
 /* Version 1: Seed pointer into btree node checksum
  */
 #define BCACHE_BSET_VERSION	1
 /*
  * This is the on disk format for btree nodes - a btree node on disk is a list
  * of these; within each set the keys are sorted
  */
 struct bset {
 	uint64_t		csum;
 	uint64_t		magic;
 	uint64_t		seq;
 	uint32_t		version;
 	uint32_t		keys;
 	union {
 		struct bkey	start[0];
 		uint64_t	d[0];
 	};
 };
 /*
  * On disk format for priorities and gens - see super.c near prio_write() for
  * more.
  */
 struct prio_set {
 	uint64_t		csum;
 	uint64_t		magic;
 	uint64_t		seq;
 	uint32_t		version;
 	uint32_t		pad;
 	uint64_t		next_bucket;
 	struct bucket_disk {
 		uint16_t	prio;
 		uint8_t		gen;
 	} __attribute((packed)) data[];
 };
 struct uuid_entry {
 	union {
 		struct {
 			uint8_t		uuid[16];
 			uint8_t		label[32];
 			uint32_t	first_reg;
 			uint32_t	last_reg;
 			uint32_t	invalidated;
 			uint32_t	flags;
 			/* Size of flash only volumes */
 			uint64_t	sectors;
 		};
 		uint8_t	pad[128];
 	};
 };
 BITMASK(UUID_FLASH_ONLY,	struct uuid_entry, flags, 0, 1);
 #include "journal.h"
 #include "stats.h"
 struct search;
 struct btree;
 struct keybuf;
 struct keybuf_key {
 	struct rb_node		node;
 	BKEY_PADDED(key);
 	void			*private;
 };
 typedef bool (keybuf_pred_fn)(struct keybuf *, struct bkey *);
 struct keybuf {
-	keybuf_pred_fn		*key_predicate;
 	struct bkey		last_scanned;
 	spinlock_t		lock;
 	/*
 	 * Beginning and end of range in rb tree - so that we can skip taking
 	 * lock and checking the rb tree when we need to check for overlapping
 	 * keys.
 	 */
 	struct bkey		start;
 	struct bkey		end;
 	struct rb_root		keys;
 #define KEYBUF_NR		100
 	DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
 };
 struct bio_split_pool {
 	struct bio_set		*bio_split;
 	mempool_t		*bio_split_hook;
 };
 struct bio_split_hook {
 	struct closure		cl;
 	struct bio_split_pool	*p;
 	struct bio		*bio;
 	bio_end_io_t		*bi_end_io;
 	void			*bi_private;
 };
 struct bcache_device {
 	struct closure		cl;
 	struct kobject		kobj;
 	struct cache_set	*c;
 	unsigned		id;
 #define BCACHEDEVNAME_SIZE	12
 	char			name[BCACHEDEVNAME_SIZE];
 	struct gendisk		*disk;
 	/* If nonzero, we're closing */
 	atomic_t		closing;
 	/* If nonzero, we're detaching/unregistering from cache set */
 	atomic_t		detaching;
 	uint64_t		nr_stripes;
 	unsigned		stripe_size_bits;
 	atomic_t		*stripe_sectors_dirty;
 	unsigned long		sectors_dirty_last;
 	long			sectors_dirty_derivative;
 	mempool_t		*unaligned_bvec;
 	struct bio_set		*bio_split;
 	unsigned		data_csum:1;
 	int (*cache_miss)(struct btree *, struct search *,
 			  struct bio *, unsigned);
 	int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
 	struct bio_split_pool	bio_split_hook;
 };
 struct io {
 	/* Used to track sequential IO so it can be skipped */
 	struct hlist_node	hash;
 	struct list_head	lru;
 	unsigned long		jiffies;
 	unsigned		sequential;
 	sector_t		last;
 };
 struct cached_dev {
 	struct list_head	list;
 	struct bcache_device	disk;
 	struct block_device	*bdev;
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
 	struct closure_with_waitlist sb_write;
 	/* Refcount on the cache set. Always nonzero when we're caching. */
 	atomic_t		count;
 	struct work_struct	detach;
 	/*
 	 * Device might not be running if it's dirty and the cache set hasn't
 	 * showed up yet.
 	 */
 	atomic_t		running;
 	/*
 	 * Writes take a shared lock from start to finish; scanning for dirty
 	 * data to refill the rb tree requires an exclusive lock.
 	 */
 	struct rw_semaphore	writeback_lock;
 	/*
 	 * Nonzero, and writeback has a refcount (d->count), iff there is dirty
 	 * data in the cache. Protected by writeback_lock; must have an
 	 * shared lock to set and exclusive lock to clear.
 	 */
 	atomic_t		has_dirty;
 	struct ratelimit	writeback_rate;
 	struct delayed_work	writeback_rate_update;
 	/*
 	 * Internal to the writeback code, so read_dirty() can keep track of
 	 * where it's at.
 	 */
 	sector_t		last_read;
 	/* Number of writeback bios in flight */
 	atomic_t		in_flight;
 	struct closure_with_timer writeback;
 	struct closure_waitlist	writeback_wait;
 	struct keybuf		writeback_keys;
 	/* For tracking sequential IO */
 #define RECENT_IO_BITS	7
 #define RECENT_IO	(1 << RECENT_IO_BITS)
 	struct io		io[RECENT_IO];
 	struct hlist_head	io_hash[RECENT_IO + 1];
 	struct list_head	io_lru;
 	spinlock_t		io_lock;
 	struct cache_accounting	accounting;
 	/* The rest of this all shows up in sysfs */
 	unsigned		sequential_cutoff;
 	unsigned		readahead;
 	unsigned		sequential_merge:1;
 	unsigned		verify:1;
+	unsigned		partial_stripes_expensive:1;
 	unsigned		writeback_metadata:1;
 	unsigned		writeback_running:1;
 	unsigned char		writeback_percent;
 	unsigned		writeback_delay;
 	int			writeback_rate_change;
 	int64_t			writeback_rate_derivative;
 	uint64_t		writeback_rate_target;
 	unsigned		writeback_rate_update_seconds;
 	unsigned		writeback_rate_d_term;
 	unsigned		writeback_rate_p_term_inverse;
 	unsigned		writeback_rate_d_smooth;
 };
 enum alloc_watermarks {
 	WATERMARK_PRIO,
 	WATERMARK_METADATA,
 	WATERMARK_MOVINGGC,
 	WATERMARK_NONE,
 	WATERMARK_MAX
 };
 struct cache {
 	struct cache_set	*set;
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
 	struct kobject		kobj;
 	struct block_device	*bdev;
 	unsigned		watermark[WATERMARK_MAX];
 	struct task_struct	*alloc_thread;
 	struct closure		prio;
 	struct prio_set		*disk_buckets;
 	/*
 	 * When allocating new buckets, prio_write() gets first dibs - since we
 	 * may not be allocate at all without writing priorities and gens.
 	 * prio_buckets[] contains the last buckets we wrote priorities to (so
 	 * gc can mark them as metadata), prio_next[] contains the buckets
 	 * allocated for the next prio write.
 	 */
 	uint64_t		*prio_buckets;
 	uint64_t		*prio_last_buckets;
 	/*
 	 * free: Buckets that are ready to be used
 	 *
 	 * free_inc: Incoming buckets - these are buckets that currently have
 	 * cached data in them, and we can't reuse them until after we write
 	 * their new gen to disk. After prio_write() finishes writing the new
 	 * gens/prios, they'll be moved to the free list (and possibly discarded
 	 * in the process)
 	 *
 	 * unused: GC found nothing pointing into these buckets (possibly
 	 * because all the data they contained was overwritten), so we only
 	 * need to discard them before they can be moved to the free list.
 	 */
 	DECLARE_FIFO(long, free);
 	DECLARE_FIFO(long, free_inc);
 	DECLARE_FIFO(long, unused);
 	size_t			fifo_last_bucket;
 	/* Allocation stuff: */
 	struct bucket		*buckets;
 	DECLARE_HEAP(struct bucket *, heap);
 	/*
 	 * max(gen - disk_gen) for all buckets. When it gets too big we have to
 	 * call prio_write() to keep gens from wrapping.
 	 */
 	uint8_t			need_save_prio;
 	unsigned		gc_move_threshold;
 	/*
 	 * If nonzero, we know we aren't going to find any buckets to invalidate
 	 * until a gc finishes - otherwise we could pointlessly burn a ton of
 	 * cpu
 	 */
 	unsigned		invalidate_needs_gc:1;
 	bool			discard; /* Get rid of? */
 	/*
 	 * We preallocate structs for issuing discards to buckets, and keep them
 	 * on this list when they're not in use; do_discard() issues discards
 	 * whenever there's work to do and is called by free_some_buckets() and
 	 * when a discard finishes.
 	 */
 	atomic_t		discards_in_flight;
 	struct list_head	discards;
 	struct journal_device	journal;
 	/* The rest of this all shows up in sysfs */
 #define IO_ERROR_SHIFT		20
 	atomic_t		io_errors;
 	atomic_t		io_count;
 	atomic_long_t		meta_sectors_written;
 	atomic_long_t		btree_sectors_written;
 	atomic_long_t		sectors_written;
 	struct bio_split_pool	bio_split_hook;
 };
 struct gc_stat {
 	size_t			nodes;
 	size_t			key_bytes;
 	size_t			nkeys;
 	uint64_t		data;	/* sectors */
 	uint64_t		dirty;	/* sectors */
 	unsigned		in_use; /* percent */
 };
 /*
  * Flag bits, for how the cache set is shutting down, and what phase it's at:
  *
  * CACHE_SET_UNREGISTERING means we're not just shutting down, we're detaching
  * all the backing devices first (their cached data gets invalidated, and they
  * won't automatically reattach).
  *
  * CACHE_SET_STOPPING always gets set first when we're closing down a cache set;
  * we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e.
  * flushing dirty data).
  *
  * CACHE_SET_STOPPING_2 gets set at the last phase, when it's time to shut down
  * the allocation thread.
  */
 #define CACHE_SET_UNREGISTERING		0
 #define	CACHE_SET_STOPPING		1
 #define	CACHE_SET_STOPPING_2		2
 struct cache_set {
 	struct closure		cl;
 	struct list_head	list;
 	struct kobject		kobj;
 	struct kobject		internal;
 	struct dentry		*debug;
 	struct cache_accounting accounting;
 	unsigned long		flags;
 	struct cache_sb		sb;
 	struct cache		*cache[MAX_CACHES_PER_SET];
 	struct cache		*cache_by_alloc[MAX_CACHES_PER_SET];
 	int			caches_loaded;
 	struct bcache_device	**devices;
 	struct list_head	cached_devs;
 	uint64_t		cached_dev_sectors;
 	struct closure		caching;
 	struct closure_with_waitlist sb_write;
 	mempool_t		*search;
 	mempool_t		*bio_meta;
 	struct bio_set		*bio_split;
 	/* For the btree cache */
 	struct shrinker		shrink;
 	/* For the btree cache and anything allocation related */
 	struct mutex		bucket_lock;
 	/* log2(bucket_size), in sectors */
 	unsigned short		bucket_bits;
 	/* log2(block_size), in sectors */
 	unsigned short		block_bits;
 	/*
 	 * Default number of pages for a new btree node - may be less than a
 	 * full bucket
 	 */
 	unsigned		btree_pages;
 	/*
 	 * Lists of struct btrees; lru is the list for structs that have memory
 	 * allocated for actual btree node, freed is for structs that do not.
 	 *
 	 * We never free a struct btree, except on shutdown - we just put it on
 	 * the btree_cache_freed list and reuse it later. This simplifies the
 	 * code, and it doesn't cost us much memory as the memory usage is
 	 * dominated by buffers that hold the actual btree node data and those
 	 * can be freed - and the number of struct btrees allocated is
 	 * effectively bounded.
 	 *
 	 * btree_cache_freeable effectively is a small cache - we use it because
 	 * high order page allocations can be rather expensive, and it's quite
 	 * common to delete and allocate btree nodes in quick succession. It
 	 * should never grow past ~2-3 nodes in practice.
 	 */
 	struct list_head	btree_cache;
 	struct list_head	btree_cache_freeable;
 	struct list_head	btree_cache_freed;
 	/* Number of elements in btree_cache + btree_cache_freeable lists */
 	unsigned		bucket_cache_used;
 	/*
 	 * If we need to allocate memory for a new btree node and that
 	 * allocation fails, we can cannibalize another node in the btree cache
 	 * to satisfy the allocation. However, only one thread can be doing this
 	 * at a time, for obvious reasons - try_harder and try_wait are
 	 * basically a lock for this that we can wait on asynchronously. The
 	 * btree_root() macro releases the lock when it returns.
 	 */
 	struct closure		*try_harder;
 	struct closure_waitlist	try_wait;
 	uint64_t		try_harder_start;
 	/*
 	 * When we free a btree node, we increment the gen of the bucket the
 	 * node is in - but we can't rewrite the prios and gens until we
 	 * finished whatever it is we were doing, otherwise after a crash the
 	 * btree node would be freed but for say a split, we might not have the
 	 * pointers to the new nodes inserted into the btree yet.
 	 *
 	 * This is a refcount that blocks prio_write() until the new keys are
 	 * written.
 	 */
 	atomic_t		prio_blocked;
 	struct closure_waitlist	bucket_wait;
 	/*
 	 * For any bio we don't skip we subtract the number of sectors from
 	 * rescale; when it hits 0 we rescale all the bucket priorities.
 	 */
 	atomic_t		rescale;
 	/*
 	 * When we invalidate buckets, we use both the priority and the amount
 	 * of good data to determine which buckets to reuse first - to weight
 	 * those together consistently we keep track of the smallest nonzero
 	 * priority of any bucket.
 	 */
 	uint16_t		min_prio;
 	/*
 	 * max(gen - gc_gen) for all buckets. When it gets too big we have to gc
 	 * to keep gens from wrapping around.
 	 */
 	uint8_t			need_gc;
 	struct gc_stat		gc_stats;
 	size_t			nbuckets;
 	struct closure_with_waitlist gc;
 	/* Where in the btree gc currently is */
 	struct bkey		gc_done;
 	/*
 	 * The allocation code needs gc_mark in struct bucket to be correct, but
 	 * it's not while a gc is in progress. Protected by bucket_lock.
 	 */
 	int			gc_mark_valid;
 	/* Counts how many sectors bio_insert has added to the cache */
 	atomic_t		sectors_to_gc;
 	struct closure		moving_gc;
 	struct closure_waitlist	moving_gc_wait;
 	struct keybuf		moving_gc_keys;
 	/* Number of moving GC bios in flight */
 	atomic_t		in_flight;
 	struct btree		*root;
 #ifdef CONFIG_BCACHE_DEBUG
 	struct btree		*verify_data;
 	struct mutex		verify_lock;
 #endif
 	unsigned		nr_uuids;
 	struct uuid_entry	*uuids;
 	BKEY_PADDED(uuid_bucket);
 	struct closure_with_waitlist uuid_write;
 	/*
 	 * A btree node on disk could have too many bsets for an iterator to fit
 	 * on the stack - have to dynamically allocate them
 	 */
 	mempool_t		*fill_iter;
 	/*
 	 * btree_sort() is a merge sort and requires temporary space - single
 	 * element mempool
 	 */
 	struct mutex		sort_lock;
 	struct bset		*sort;
 	unsigned		sort_crit_factor;
 	/* List of buckets we're currently writing data to */
 	struct list_head	data_buckets;
 	spinlock_t		data_bucket_lock;
 	struct journal		journal;
 #define CONGESTED_MAX		1024
 	unsigned		congested_last_us;
 	atomic_t		congested;
 	/* The rest of this all shows up in sysfs */
 	unsigned		congested_read_threshold_us;
 	unsigned		congested_write_threshold_us;
 	spinlock_t		sort_time_lock;
 	struct time_stats	sort_time;
 	struct time_stats	btree_gc_time;
 	struct time_stats	btree_split_time;
 	spinlock_t		btree_read_time_lock;
 	struct time_stats	btree_read_time;
 	struct time_stats	try_harder_time;
 	atomic_long_t		cache_read_races;
 	atomic_long_t		writeback_keys_done;
 	atomic_long_t		writeback_keys_failed;
 	unsigned		error_limit;
 	unsigned		error_decay;
 	unsigned short		journal_delay_ms;
 	unsigned		verify:1;
 	unsigned		key_merging_disabled:1;
 	unsigned		gc_always_rewrite:1;
 	unsigned		shrinker_disabled:1;
 	unsigned		copy_gc_enabled:1;
 #define BUCKET_HASH_BITS	12
 	struct hlist_head	bucket_hash[1 << BUCKET_HASH_BITS];
 };
 static inline bool key_merging_disabled(struct cache_set *c)
 {
 #ifdef CONFIG_BCACHE_DEBUG
 	return c->key_merging_disabled;
 #else
 	return 0;
 #endif
 }
 static inline bool SB_IS_BDEV(const struct cache_sb *sb)
 {
 	return sb->version == BCACHE_SB_VERSION_BDEV
 		|| sb->version == BCACHE_SB_VERSION_BDEV_WITH_OFFSET;
 }
 struct bbio {
 	unsigned		submit_time_us;
 	union {
 		struct bkey	key;
 		uint64_t	_pad[3];
 		/*
 		 * We only need pad = 3 here because we only ever carry around a
 		 * single pointer - i.e. the pointer we're doing io to/from.
 		 */
 	};
 	struct bio		bio;
 };
 static inline unsigned local_clock_us(void)
 {
 	return local_clock() >> 10;
 }
 #define BTREE_PRIO		USHRT_MAX
 #define INITIAL_PRIO		32768
 #define btree_bytes(c)		((c)->btree_pages * PAGE_SIZE)
 #define btree_blocks(b)							\
 	((unsigned) (KEY_SIZE(&b->key) >> (b)->c->block_bits))
 #define btree_default_blocks(c)						\
 	((unsigned) ((PAGE_SECTORS * (c)->btree_pages) >> (c)->block_bits))
 #define bucket_pages(c)		((c)->sb.bucket_size / PAGE_SECTORS)
 #define bucket_bytes(c)		((c)->sb.bucket_size << 9)
 #define block_bytes(c)		((c)->sb.block_size << 9)
 #define __set_bytes(i, k)	(sizeof(*(i)) + (k) * sizeof(uint64_t))
 #define set_bytes(i)		__set_bytes(i, i->keys)
 #define __set_blocks(i, k, c)	DIV_ROUND_UP(__set_bytes(i, k), block_bytes(c))
 #define set_blocks(i, c)	__set_blocks(i, (i)->keys, c)
 #define node(i, j)		((struct bkey *) ((i)->d + (j)))
 #define end(i)			node(i, (i)->keys)
 #define index(i, b)							\
 	((size_t) (((void *) i - (void *) (b)->sets[0].data) /		\
 		   block_bytes(b->c)))
 #define btree_data_space(b)	(PAGE_SIZE << (b)->page_order)
 #define prios_per_bucket(c)				\
 	((bucket_bytes(c) - sizeof(struct prio_set)) /	\
 	 sizeof(struct bucket_disk))
 #define prio_buckets(c)					\
 	DIV_ROUND_UP((size_t) (c)->sb.nbuckets, prios_per_bucket(c))
 #define JSET_MAGIC		0x245235c1a3625032ULL
 #define PSET_MAGIC		0x6750e15f87337f91ULL
 #define BSET_MAGIC		0x90135c78b99e07f5ULL
 #define jset_magic(c)		((c)->sb.set_magic ^ JSET_MAGIC)
 #define pset_magic(c)		((c)->sb.set_magic ^ PSET_MAGIC)
 #define bset_magic(c)		((c)->sb.set_magic ^ BSET_MAGIC)
 /* Bkey fields: all units are in sectors */
 #define KEY_FIELD(name, field, offset, size)				\
 	BITMASK(name, struct bkey, field, offset, size)
 #define PTR_FIELD(name, offset, size)					\
 	static inline uint64_t name(const struct bkey *k, unsigned i)	\
 	{ return (k->ptr[i] >> offset) & ~(((uint64_t) ~0) << size); }	\
 									\
 	static inline void SET_##name(struct bkey *k, unsigned i, uint64_t v)\
 	{								\
 		k->ptr[i] &= ~(~((uint64_t) ~0 << size) << offset);	\
 		k->ptr[i] |= v << offset;				\
 	}
 KEY_FIELD(KEY_PTRS,	high, 60, 3)
 KEY_FIELD(HEADER_SIZE,	high, 58, 2)
 KEY_FIELD(KEY_CSUM,	high, 56, 2)
 KEY_FIELD(KEY_PINNED,	high, 55, 1)
 KEY_FIELD(KEY_DIRTY,	high, 36, 1)
 KEY_FIELD(KEY_SIZE,	high, 20, 16)
 KEY_FIELD(KEY_INODE,	high, 0,  20)
 /* Next time I change the on disk format, KEY_OFFSET() won't be 64 bits */
 static inline uint64_t KEY_OFFSET(const struct bkey *k)
 {
 	return k->low;
 }
 static inline void SET_KEY_OFFSET(struct bkey *k, uint64_t v)
 {
 	k->low = v;
 }
 PTR_FIELD(PTR_DEV,		51, 12)
 PTR_FIELD(PTR_OFFSET,		8,  43)
 PTR_FIELD(PTR_GEN,		0,  8)
 #define PTR_CHECK_DEV		((1 << 12) - 1)
 #define PTR(gen, offset, dev)						\
 	((((uint64_t) dev) << 51) | ((uint64_t) offset) << 8 | gen)
 static inline size_t sector_to_bucket(struct cache_set *c, sector_t s)
 {
 	return s >> c->bucket_bits;
 }
 static inline sector_t bucket_to_sector(struct cache_set *c, size_t b)
 {
 	return ((sector_t) b) << c->bucket_bits;
 }
 static inline sector_t bucket_remainder(struct cache_set *c, sector_t s)
 {
 	return s & (c->sb.bucket_size - 1);
 }
 static inline struct cache *PTR_CACHE(struct cache_set *c,
 				      const struct bkey *k,
 				      unsigned ptr)
 {
 	return c->cache[PTR_DEV(k, ptr)];
 }
 static inline size_t PTR_BUCKET_NR(struct cache_set *c,
 				   const struct bkey *k,
 				   unsigned ptr)
 {
 	return sector_to_bucket(c, PTR_OFFSET(k, ptr));
 }
 static inline struct bucket *PTR_BUCKET(struct cache_set *c,
 					const struct bkey *k,
 					unsigned ptr)
 {
 	return PTR_CACHE(c, k, ptr)->buckets + PTR_BUCKET_NR(c, k, ptr);
 }
 /* Btree key macros */
 /*
  * The high bit being set is a relic from when we used it to do binary
  * searches - it told you where a key started. It's not used anymore,
  * and can probably be safely dropped.
  */
 #define KEY(dev, sector, len)						\
 ((struct bkey) {							\
 	.high = (1ULL << 63) | ((uint64_t) (len) << 20) | (dev),	\
 	.low = (sector)							\
 })
 static inline void bkey_init(struct bkey *k)
 {
 	*k = KEY(0, 0, 0);
 }
 #define KEY_START(k)		(KEY_OFFSET(k) - KEY_SIZE(k))
 #define START_KEY(k)		KEY(KEY_INODE(k), KEY_START(k), 0)
 #define MAX_KEY			KEY(~(~0 << 20), ((uint64_t) ~0) >> 1, 0)
 #define ZERO_KEY		KEY(0, 0, 0)
 /*
  * This is used for various on disk data structures - cache_sb, prio_set, bset,
  * jset: The checksum is _always_ the first 8 bytes of these structs
  */
 #define csum_set(i)							\
 	bch_crc64(((void *) (i)) + sizeof(uint64_t),			\
 	      ((void *) end(i)) - (((void *) (i)) + sizeof(uint64_t)))
 /* Error handling macros */
 #define btree_bug(b, ...)						\
 do {									\
 	if (bch_cache_set_error((b)->c, __VA_ARGS__))			\
 		dump_stack();						\
 } while (0)
 #define cache_bug(c, ...)						\
 do {									\
 	if (bch_cache_set_error(c, __VA_ARGS__))			\
 		dump_stack();						\
 } while (0)
 #define btree_bug_on(cond, b, ...)					\
 do {									\
 	if (cond)							\
 		btree_bug(b, __VA_ARGS__);				\
 } while (0)
 #define cache_bug_on(cond, c, ...)					\
 do {									\
 	if (cond)							\
 		cache_bug(c, __VA_ARGS__);				\
 } while (0)
 #define cache_set_err_on(cond, c, ...)					\
 do {									\
 	if (cond)							\
 		bch_cache_set_error(c, __VA_ARGS__);			\
 } while (0)
 /* Looping macros */
 #define for_each_cache(ca, cs, iter)					\
 	for (iter = 0; ca = cs->cache[iter], iter < (cs)->sb.nr_in_set; iter++)
 #define for_each_bucket(b, ca)						\
 	for (b = (ca)->buckets + (ca)->sb.first_bucket;			\
 	     b < (ca)->buckets + (ca)->sb.nbuckets; b++)
 static inline void __bkey_put(struct cache_set *c, struct bkey *k)
 {
 	unsigned i;
 	for (i = 0; i < KEY_PTRS(k); i++)
 		atomic_dec_bug(&PTR_BUCKET(c, k, i)->pin);
 }
 static inline void cached_dev_put(struct cached_dev *dc)
 {
 	if (atomic_dec_and_test(&dc->count))
 		schedule_work(&dc->detach);
 }
 static inline bool cached_dev_get(struct cached_dev *dc)
 {
 	if (!atomic_inc_not_zero(&dc->count))
 		return false;
 	/* Paired with the mb in cached_dev_attach */
 	smp_mb__after_atomic_inc();
 	return true;
 }
 /*
  * bucket_gc_gen() returns the difference between the bucket's current gen and
  * the oldest gen of any pointer into that bucket in the btree (last_gc).
  *
  * bucket_disk_gen() returns the difference between the current gen and the gen
  * on disk; they're both used to make sure gens don't wrap around.
  */
 static inline uint8_t bucket_gc_gen(struct bucket *b)
 {
 	return b->gen - b->last_gc;
 }
 static inline uint8_t bucket_disk_gen(struct bucket *b)
 {
 	return b->gen - b->disk_gen;
 }
 #define BUCKET_GC_GEN_MAX	96U
 #define BUCKET_DISK_GEN_MAX	64U
 #define kobj_attribute_write(n, fn)					\
 	static struct kobj_attribute ksysfs_##n = __ATTR(n, S_IWUSR, NULL, fn)
 #define kobj_attribute_rw(n, show, store)				\
 	static struct kobj_attribute ksysfs_##n =			\
 		__ATTR(n, S_IWUSR|S_IRUSR, show, store)
 static inline void wake_up_allocators(struct cache_set *c)
 {
 	struct cache *ca;
 	unsigned i;
 	for_each_cache(ca, c, i)
 		wake_up_process(ca->alloc_thread);
 }
 /* Forward declarations */
 void bch_count_io_errors(struct cache *, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
 			      int, const char *);
 void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
 void bch_bbio_free(struct bio *, struct cache_set *);
 struct bio *bch_bbio_alloc(struct cache_set *);
 struct bio *bch_bio_split(struct bio *, int, gfp_t, struct bio_set *);
 void bch_generic_make_request(struct bio *, struct bio_split_pool *);
 void __bch_submit_bbio(struct bio *, struct cache_set *);
 void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
 uint8_t bch_inc_gen(struct cache *, struct bucket *);
 void bch_rescale_priorities(struct cache_set *, int);
 bool bch_bucket_add_unused(struct cache *, struct bucket *);
 long bch_bucket_alloc(struct cache *, unsigned, struct closure *);
 void bch_bucket_free(struct cache_set *, struct bkey *);
 int __bch_bucket_alloc_set(struct cache_set *, unsigned,
 			   struct bkey *, int, struct closure *);
 int bch_bucket_alloc_set(struct cache_set *, unsigned,
 			 struct bkey *, int, struct closure *);
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *, const char *, ...);
 void bch_prio_write(struct cache *);
 void bch_write_bdev_super(struct cached_dev *, struct closure *);
 extern struct workqueue_struct *bcache_wq, *bch_gc_wq;
 extern const char * const bch_cache_modes[];
 extern struct mutex bch_register_lock;
 extern struct list_head bch_cache_sets;
 extern struct kobj_type bch_cached_dev_ktype;
 extern struct kobj_type bch_flash_dev_ktype;
 extern struct kobj_type bch_cache_set_ktype;
 extern struct kobj_type bch_cache_set_internal_ktype;
 extern struct kobj_type bch_cache_ktype;
 void bch_cached_dev_release(struct kobject *);
 void bch_flash_dev_release(struct kobject *);
 void bch_cache_set_release(struct kobject *);
 void bch_cache_release(struct kobject *);
 int bch_uuid_write(struct cache_set *);
 void bcache_write_super(struct cache_set *);
 int bch_flash_dev_create(struct cache_set *c, uint64_t size);
 int bch_cached_dev_attach(struct cached_dev *, struct cache_set *);
 void bch_cached_dev_detach(struct cached_dev *);
 void bch_cached_dev_run(struct cached_dev *);
 void bcache_device_stop(struct bcache_device *);
 void bch_cache_set_unregister(struct cache_set *);
 void bch_cache_set_stop(struct cache_set *);
 struct cache_set *bch_cache_set_alloc(struct cache_sb *);
 void bch_btree_cache_free(struct cache_set *);
 int bch_btree_cache_alloc(struct cache_set *);
 void bch_moving_init_cache_set(struct cache_set *);
 int bch_cache_allocator_start(struct cache *ca);
 void bch_cache_allocator_exit(struct cache *ca);
 int bch_cache_allocator_init(struct cache *ca);
 void bch_debug_exit(void);
 int bch_debug_init(struct kobject *);
 void bch_writeback_exit(void);
 int bch_writeback_init(void);
 void bch_request_exit(void);
 int bch_request_init(void);
 void bch_btree_exit(void);
 int bch_btree_init(void);
 #endif /* _BCACHE_H */

drivers/md/bcache/btree.c

Diff comments View file @ 72c2706

 /*
  * Copyright (C) 2010 Kent Overstreet <kent.overstreet@gmail.com>
  *
  * Uses a block device as cache for other block devices; optimized for SSDs.
  * All allocation is done in buckets, which should match the erase block size
  * of the device.
  *
  * Buckets containing cached data are kept on a heap sorted by priority;
  * bucket priority is increased on cache hit, and periodically all the buckets
  * on the heap have their priority scaled down. This currently is just used as
  * an LRU but in the future should allow for more intelligent heuristics.
  *
  * Buckets have an 8 bit counter; freeing is accomplished by incrementing the
  * counter. Garbage collection is used to remove stale pointers.
  *
  * Indexing is done via a btree; nodes are not necessarily fully sorted, rather
  * as keys are inserted we only sort the pages that have not yet been written.
  * When garbage collection is run, we resort the entire node.
  *
  * All configuration is done via sysfs; see Documentation/bcache.txt.
  */
 #include "bcache.h"
 #include "btree.h"
 #include "debug.h"
 #include "request.h"
 #include "writeback.h"
 #include <linux/slab.h>
 #include <linux/bitops.h>
 #include <linux/hash.h>
 #include <linux/prefetch.h>
 #include <linux/random.h>
 #include <linux/rcupdate.h>
 #include <trace/events/bcache.h>
 /*
  * Todo:
  * register_bcache: Return errors out to userspace correctly
  *
  * Writeback: don't undirty key until after a cache flush
  *
  * Create an iterator for key pointers
  *
  * On btree write error, mark bucket such that it won't be freed from the cache
  *
  * Journalling:
  *   Check for bad keys in replay
  *   Propagate barriers
  *   Refcount journal entries in journal_replay
  *
  * Garbage collection:
  *   Finish incremental gc
  *   Gc should free old UUIDs, data for invalid UUIDs
  *
  * Provide a way to list backing device UUIDs we have data cached for, and
  * probably how long it's been since we've seen them, and a way to invalidate
  * dirty data for devices that will never be attached again
  *
  * Keep 1 min/5 min/15 min statistics of how busy a block device has been, so
  * that based on that and how much dirty data we have we can keep writeback
  * from being starved
  *
  * Add a tracepoint or somesuch to watch for writeback starvation
  *
  * When btree depth > 1 and splitting an interior node, we have to make sure
  * alloc_bucket() cannot fail. This should be true but is not completely
  * obvious.
  *
  * Make sure all allocations get charged to the root cgroup
  *
  * Plugging?
  *
  * If data write is less than hard sector size of ssd, round up offset in open
  * bucket to the next whole sector
  *
  * Also lookup by cgroup in get_open_bucket()
  *
  * Superblock needs to be fleshed out for multiple cache devices
  *
  * Add a sysfs tunable for the number of writeback IOs in flight
  *
  * Add a sysfs tunable for the number of open data buckets
  *
  * IO tracking: Can we track when one process is doing io on behalf of another?
  * IO tracking: Don't use just an average, weigh more recent stuff higher
  *
  * Test module load/unload
  */
 static const char * const op_types[] = {
 	"insert", "replace"
 };
 static const char *op_type(struct btree_op *op)
 {
 	return op_types[op->type];
 }
 #define MAX_NEED_GC		64
 #define MAX_SAVE_PRIO		72
 #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
 #define PTR_HASH(c, k)							\
 	(((k)->ptr[0] >> c->bucket_bits) | PTR_GEN(k, 0))
 struct workqueue_struct *bch_gc_wq;
 static struct workqueue_struct *btree_io_wq;
 void bch_btree_op_init_stack(struct btree_op *op)
 {
 	memset(op, 0, sizeof(struct btree_op));
 	closure_init_stack(&op->cl);
 	op->lock = -1;
 	bch_keylist_init(&op->keys);
 }
 /* Btree key manipulation */
 static void bkey_put(struct cache_set *c, struct bkey *k, int level)
 {
 	if ((level && KEY_OFFSET(k)) || !level)
 		__bkey_put(c, k);
 }
 /* Btree IO */
 static uint64_t btree_csum_set(struct btree *b, struct bset *i)
 {
 	uint64_t crc = b->key.ptr[0];
 	void *data = (void *) i + 8, *end = end(i);
 	crc = bch_crc64_update(crc, data, end - data);
 	return crc ^ 0xffffffffffffffffULL;
 }
 void bch_btree_node_read_done(struct btree *b)
 {
 	const char *err = "bad btree header";
 	struct bset *i = b->sets[0].data;
 	struct btree_iter *iter;
 	iter = mempool_alloc(b->c->fill_iter, GFP_NOWAIT);
 	iter->size = b->c->sb.bucket_size / b->c->sb.block_size;
 	iter->used = 0;
 	if (!i->seq)
 		goto err;
 	for (;
 	     b->written < btree_blocks(b) && i->seq == b->sets[0].data->seq;
 	     i = write_block(b)) {
 		err = "unsupported bset version";
 		if (i->version > BCACHE_BSET_VERSION)
 			goto err;
 		err = "bad btree header";
 		if (b->written + set_blocks(i, b->c) > btree_blocks(b))
 			goto err;
 		err = "bad magic";
 		if (i->magic != bset_magic(b->c))
 			goto err;
 		err = "bad checksum";
 		switch (i->version) {
 		case 0:
 			if (i->csum != csum_set(i))
 				goto err;
 			break;
 		case BCACHE_BSET_VERSION:
 			if (i->csum != btree_csum_set(b, i))
 				goto err;
 			break;
 		}
 		err = "empty set";
 		if (i != b->sets[0].data && !i->keys)
 			goto err;
 		bch_btree_iter_push(iter, i->start, end(i));
 		b->written += set_blocks(i, b->c);
 	}
 	err = "corrupted btree";
 	for (i = write_block(b);
 	     index(i, b) < btree_blocks(b);
 	     i = ((void *) i) + block_bytes(b->c))
 		if (i->seq == b->sets[0].data->seq)
 			goto err;
 	bch_btree_sort_and_fix_extents(b, iter);
 	i = b->sets[0].data;
 	err = "short btree key";
 	if (b->sets[0].size &&
 	    bkey_cmp(&b->key, &b->sets[0].end) < 0)
 		goto err;
 	if (b->written < btree_blocks(b))
 		bch_bset_init_next(b);
 out:
 	mempool_free(iter, b->c->fill_iter);
 	return;
 err:
 	set_btree_node_io_error(b);
 	bch_cache_set_error(b->c, "%s at bucket %zu, block %zu, %u keys",
 			    err, PTR_BUCKET_NR(b->c, &b->key, 0),
 			    index(i, b), i->keys);
 	goto out;
 }
 static void btree_node_read_endio(struct bio *bio, int error)
 {
 	struct closure *cl = bio->bi_private;
 	closure_put(cl);
 }
 void bch_btree_node_read(struct btree *b)
 {
 	uint64_t start_time = local_clock();
 	struct closure cl;
 	struct bio *bio;
 	trace_bcache_btree_read(b);
 	closure_init_stack(&cl);
 	bio = bch_bbio_alloc(b->c);
 	bio->bi_rw	= REQ_META|READ_SYNC;
 	bio->bi_size	= KEY_SIZE(&b->key) << 9;
 	bio->bi_end_io	= btree_node_read_endio;
 	bio->bi_private	= &cl;
 	bch_bio_map(bio, b->sets[0].data);
 	bch_submit_bbio(bio, b->c, &b->key, 0);
 	closure_sync(&cl);
 	if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		set_btree_node_io_error(b);
 	bch_bbio_free(bio, b->c);
 	if (btree_node_io_error(b))
 		goto err;
 	bch_btree_node_read_done(b);
 	spin_lock(&b->c->btree_read_time_lock);
 	bch_time_stats_update(&b->c->btree_read_time, start_time);
 	spin_unlock(&b->c->btree_read_time_lock);
 	return;
 err:
 	bch_cache_set_error(b->c, "io error reading bucket %lu",
 			    PTR_BUCKET_NR(b->c, &b->key, 0));
 }
 static void btree_complete_write(struct btree *b, struct btree_write *w)
 {
 	if (w->prio_blocked &&
 	    !atomic_sub_return(w->prio_blocked, &b->c->prio_blocked))
 		wake_up_allocators(b->c);
 	if (w->journal) {
 		atomic_dec_bug(w->journal);
 		__closure_wake_up(&b->c->journal.wait);
 	}
 	w->prio_blocked	= 0;
 	w->journal	= NULL;
 }
 static void __btree_node_write_done(struct closure *cl)
 {
 	struct btree *b = container_of(cl, struct btree, io.cl);
 	struct btree_write *w = btree_prev_write(b);
 	bch_bbio_free(b->bio, b->c);
 	b->bio = NULL;
 	btree_complete_write(b, w);
 	if (btree_node_dirty(b))
 		queue_delayed_work(btree_io_wq, &b->work,
 				   msecs_to_jiffies(30000));
 	closure_return(cl);
 }
 static void btree_node_write_done(struct closure *cl)
 {
 	struct btree *b = container_of(cl, struct btree, io.cl);
 	struct bio_vec *bv;
 	int n;
 	__bio_for_each_segment(bv, b->bio, n, 0)
 		__free_page(bv->bv_page);
 	__btree_node_write_done(cl);
 }
 static void btree_node_write_endio(struct bio *bio, int error)
 {
 	struct closure *cl = bio->bi_private;
 	struct btree *b = container_of(cl, struct btree, io.cl);
 	if (error)
 		set_btree_node_io_error(b);
 	bch_bbio_count_io_errors(b->c, bio, error, "writing btree");
 	closure_put(cl);
 }
 static void do_btree_node_write(struct btree *b)
 {
 	struct closure *cl = &b->io.cl;
 	struct bset *i = b->sets[b->nsets].data;
 	BKEY_PADDED(key) k;
 	i->version	= BCACHE_BSET_VERSION;
 	i->csum		= btree_csum_set(b, i);
 	BUG_ON(b->bio);
 	b->bio = bch_bbio_alloc(b->c);
 	b->bio->bi_end_io	= btree_node_write_endio;
 	b->bio->bi_private	= &b->io.cl;
 	b->bio->bi_rw	= REQ_META|WRITE_SYNC;
 	b->bio->bi_size	= set_blocks(i, b->c) * block_bytes(b->c);
 	bch_bio_map(b->bio, i);
 	bkey_copy(&k.key, &b->key);
 	SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) + bset_offset(b, i));
 	if (!bch_bio_alloc_pages(b->bio, GFP_NOIO)) {
 		int j;
 		struct bio_vec *bv;
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
 		bio_for_each_segment(bv, b->bio, j)
 			memcpy(page_address(bv->bv_page),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
 		continue_at(cl, btree_node_write_done, NULL);
 	} else {
 		b->bio->bi_vcnt = 0;
 		bch_bio_map(b->bio, i);
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
 		closure_sync(cl);
 		__btree_node_write_done(cl);
 	}
 }
 void bch_btree_node_write(struct btree *b, struct closure *parent)
 {
 	struct bset *i = b->sets[b->nsets].data;
 	trace_bcache_btree_write(b);
 	BUG_ON(current->bio_list);
 	BUG_ON(b->written >= btree_blocks(b));
 	BUG_ON(b->written && !i->keys);
 	BUG_ON(b->sets->data->seq != i->seq);
 	bch_check_key_order(b, i);
 	cancel_delayed_work(&b->work);
 	/* If caller isn't waiting for write, parent refcount is cache set */
 	closure_lock(&b->io, parent ?: &b->c->cl);
 	clear_bit(BTREE_NODE_dirty,	 &b->flags);
 	change_bit(BTREE_NODE_write_idx, &b->flags);
 	do_btree_node_write(b);
 	b->written += set_blocks(i, b->c);
 	atomic_long_add(set_blocks(i, b->c) * b->c->sb.block_size,
 			&PTR_CACHE(b->c, &b->key, 0)->btree_sectors_written);
 	bch_btree_sort_lazy(b);
 	if (b->written < btree_blocks(b))
 		bch_bset_init_next(b);
 }
 static void btree_node_write_work(struct work_struct *w)
 {
 	struct btree *b = container_of(to_delayed_work(w), struct btree, work);
 	rw_lock(true, b, b->level);
 	if (btree_node_dirty(b))
 		bch_btree_node_write(b, NULL);
 	rw_unlock(true, b);
 }
 static void bch_btree_leaf_dirty(struct btree *b, struct btree_op *op)
 {
 	struct bset *i = b->sets[b->nsets].data;
 	struct btree_write *w = btree_current_write(b);
 	BUG_ON(!b->written);
 	BUG_ON(!i->keys);
 	if (!btree_node_dirty(b))
 		queue_delayed_work(btree_io_wq, &b->work, 30 * HZ);
 	set_btree_node_dirty(b);
 	if (op && op->journal) {
 		if (w->journal &&
 		    journal_pin_cmp(b->c, w, op)) {
 			atomic_dec_bug(w->journal);
 			w->journal = NULL;
 		}
 		if (!w->journal) {
 			w->journal = op->journal;
 			atomic_inc(w->journal);
 		}
 	}
 	/* Force write if set is too big */
 	if (set_bytes(i) > PAGE_SIZE - 48 &&
 	    !current->bio_list)
 		bch_btree_node_write(b, NULL);
 }
 /*
  * Btree in memory cache - allocation/freeing
  * mca -> memory cache
  */
 static void mca_reinit(struct btree *b)
 {
 	unsigned i;
 	b->flags	= 0;
 	b->written	= 0;
 	b->nsets	= 0;
 	for (i = 0; i < MAX_BSETS; i++)
 		b->sets[i].size = 0;
 	/*
 	 * Second loop starts at 1 because b->sets[0]->data is the memory we
 	 * allocated
 	 */
 	for (i = 1; i < MAX_BSETS; i++)
 		b->sets[i].data = NULL;
 }
 #define mca_reserve(c)	(((c->root && c->root->level)		\
 			  ? c->root->level : 1) * 8 + 16)
 #define mca_can_free(c)						\
 	max_t(int, 0, c->bucket_cache_used - mca_reserve(c))
 static void mca_data_free(struct btree *b)
 {
 	struct bset_tree *t = b->sets;
 	BUG_ON(!closure_is_unlocked(&b->io.cl));
 	if (bset_prev_bytes(b) < PAGE_SIZE)
 		kfree(t->prev);
 	else
 		free_pages((unsigned long) t->prev,
 			   get_order(bset_prev_bytes(b)));
 	if (bset_tree_bytes(b) < PAGE_SIZE)
 		kfree(t->tree);
 	else
 		free_pages((unsigned long) t->tree,
 			   get_order(bset_tree_bytes(b)));
 	free_pages((unsigned long) t->data, b->page_order);
 	t->prev = NULL;
 	t->tree = NULL;
 	t->data = NULL;
 	list_move(&b->list, &b->c->btree_cache_freed);
 	b->c->bucket_cache_used--;
 }
 static void mca_bucket_free(struct btree *b)
 {
 	BUG_ON(btree_node_dirty(b));
 	b->key.ptr[0] = 0;
 	hlist_del_init_rcu(&b->hash);
 	list_move(&b->list, &b->c->btree_cache_freeable);
 }
 static unsigned btree_order(struct bkey *k)
 {
 	return ilog2(KEY_SIZE(k) / PAGE_SECTORS ?: 1);
 }
 static void mca_data_alloc(struct btree *b, struct bkey *k, gfp_t gfp)
 {
 	struct bset_tree *t = b->sets;
 	BUG_ON(t->data);
 	b->page_order = max_t(unsigned,
 			      ilog2(b->c->btree_pages),
 			      btree_order(k));
 	t->data = (void *) __get_free_pages(gfp, b->page_order);
 	if (!t->data)
 		goto err;
 	t->tree = bset_tree_bytes(b) < PAGE_SIZE
 		? kmalloc(bset_tree_bytes(b), gfp)
 		: (void *) __get_free_pages(gfp, get_order(bset_tree_bytes(b)));
 	if (!t->tree)
 		goto err;
 	t->prev = bset_prev_bytes(b) < PAGE_SIZE
 		? kmalloc(bset_prev_bytes(b), gfp)
 		: (void *) __get_free_pages(gfp, get_order(bset_prev_bytes(b)));
 	if (!t->prev)
 		goto err;
 	list_move(&b->list, &b->c->btree_cache);
 	b->c->bucket_cache_used++;
 	return;
 err:
 	mca_data_free(b);
 }
 static struct btree *mca_bucket_alloc(struct cache_set *c,
 				      struct bkey *k, gfp_t gfp)
 {
 	struct btree *b = kzalloc(sizeof(struct btree), gfp);
 	if (!b)
 		return NULL;
 	init_rwsem(&b->lock);
 	lockdep_set_novalidate_class(&b->lock);
 	INIT_LIST_HEAD(&b->list);
 	INIT_DELAYED_WORK(&b->work, btree_node_write_work);
 	b->c = c;
 	closure_init_unlocked(&b->io);
 	mca_data_alloc(b, k, gfp);
 	return b;
 }
 static int mca_reap(struct btree *b, struct closure *cl, unsigned min_order)
 {
 	lockdep_assert_held(&b->c->bucket_lock);
 	if (!down_write_trylock(&b->lock))
 		return -ENOMEM;
 	if (b->page_order < min_order) {
 		rw_unlock(true, b);
 		return -ENOMEM;
 	}
 	BUG_ON(btree_node_dirty(b) && !b->sets[0].data);
 	if (cl && btree_node_dirty(b))
 		bch_btree_node_write(b, NULL);
 	if (cl)
 		closure_wait_event_async(&b->io.wait, cl,
 			 atomic_read(&b->io.cl.remaining) == -1);
 	if (btree_node_dirty(b) ||
 	    !closure_is_unlocked(&b->io.cl) ||
 	    work_pending(&b->work.work)) {
 		rw_unlock(true, b);
 		return -EAGAIN;
 	}
 	return 0;
 }
 static int bch_mca_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct cache_set *c = container_of(shrink, struct cache_set, shrink);
 	struct btree *b, *t;
 	unsigned long i, nr = sc->nr_to_scan;
 	if (c->shrinker_disabled)
 		return 0;
 	if (c->try_harder)
 		return 0;
 	/*
 	 * If nr == 0, we're supposed to return the number of items we have
 	 * cached. Not allowed to return -1.
 	 */
 	if (!nr)
 		return mca_can_free(c) * c->btree_pages;
 	/* Return -1 if we can't do anything right now */
 	if (sc->gfp_mask & __GFP_WAIT)
 		mutex_lock(&c->bucket_lock);
 	else if (!mutex_trylock(&c->bucket_lock))
 		return -1;
 	nr /= c->btree_pages;
 	nr = min_t(unsigned long, nr, mca_can_free(c));
 	i = 0;
 	list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
 		if (!nr)
 			break;
 		if (++i > 3 &&
 		    !mca_reap(b, NULL, 0)) {
 			mca_data_free(b);
 			rw_unlock(true, b);
 			--nr;
 		}
 	}
 	/*
 	 * Can happen right when we first start up, before we've read in any
 	 * btree nodes
 	 */
 	if (list_empty(&c->btree_cache))
 		goto out;
 	for (i = 0; nr && i < c->bucket_cache_used; i++) {
 		b = list_first_entry(&c->btree_cache, struct btree, list);
 		list_rotate_left(&c->btree_cache);
 		if (!b->accessed &&
 		    !mca_reap(b, NULL, 0)) {
 			mca_bucket_free(b);
 			mca_data_free(b);
 			rw_unlock(true, b);
 			--nr;
 		} else
 			b->accessed = 0;
 	}
 out:
 	nr = mca_can_free(c) * c->btree_pages;
 	mutex_unlock(&c->bucket_lock);
 	return nr;
 }
 void bch_btree_cache_free(struct cache_set *c)
 {
 	struct btree *b;
 	struct closure cl;
 	closure_init_stack(&cl);
 	if (c->shrink.list.next)
 		unregister_shrinker(&c->shrink);
 	mutex_lock(&c->bucket_lock);
 #ifdef CONFIG_BCACHE_DEBUG
 	if (c->verify_data)
 		list_move(&c->verify_data->list, &c->btree_cache);
 #endif
 	list_splice(&c->btree_cache_freeable,
 		    &c->btree_cache);
 	while (!list_empty(&c->btree_cache)) {
 		b = list_first_entry(&c->btree_cache, struct btree, list);
 		if (btree_node_dirty(b))
 			btree_complete_write(b, btree_current_write(b));
 		clear_bit(BTREE_NODE_dirty, &b->flags);
 		mca_data_free(b);
 	}
 	while (!list_empty(&c->btree_cache_freed)) {
 		b = list_first_entry(&c->btree_cache_freed,
 				     struct btree, list);
 		list_del(&b->list);
 		cancel_delayed_work_sync(&b->work);
 		kfree(b);
 	}
 	mutex_unlock(&c->bucket_lock);
 }
 int bch_btree_cache_alloc(struct cache_set *c)
 {
 	unsigned i;
 	/* XXX: doesn't check for errors */
 	closure_init_unlocked(&c->gc);
 	for (i = 0; i < mca_reserve(c); i++)
 		mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL);
 	list_splice_init(&c->btree_cache,
 			 &c->btree_cache_freeable);
 #ifdef CONFIG_BCACHE_DEBUG
 	mutex_init(&c->verify_lock);
 	c->verify_data = mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL);
 	if (c->verify_data &&
 	    c->verify_data->sets[0].data)
 		list_del_init(&c->verify_data->list);
 	else
 		c->verify_data = NULL;
 #endif
 	c->shrink.shrink = bch_mca_shrink;
 	c->shrink.seeks = 4;
 	c->shrink.batch = c->btree_pages * 2;
 	register_shrinker(&c->shrink);
 	return 0;
 }
 /* Btree in memory cache - hash table */
 static struct hlist_head *mca_hash(struct cache_set *c, struct bkey *k)
 {
 	return &c->bucket_hash[hash_32(PTR_HASH(c, k), BUCKET_HASH_BITS)];
 }
 static struct btree *mca_find(struct cache_set *c, struct bkey *k)
 {
 	struct btree *b;
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(b, mca_hash(c, k), hash)
 		if (PTR_HASH(c, &b->key) == PTR_HASH(c, k))
 			goto out;
 	b = NULL;
 out:
 	rcu_read_unlock();
 	return b;
 }
 static struct btree *mca_cannibalize(struct cache_set *c, struct bkey *k,
 				     int level, struct closure *cl)
 {
 	int ret = -ENOMEM;
 	struct btree *i;
 	trace_bcache_btree_cache_cannibalize(c);
 	if (!cl)
 		return ERR_PTR(-ENOMEM);
 	/*
 	 * Trying to free up some memory - i.e. reuse some btree nodes - may
 	 * require initiating IO to flush the dirty part of the node. If we're
 	 * running under generic_make_request(), that IO will never finish and
 	 * we would deadlock. Returning -EAGAIN causes the cache lookup code to
 	 * punt to workqueue and retry.
 	 */
 	if (current->bio_list)
 		return ERR_PTR(-EAGAIN);
 	if (c->try_harder && c->try_harder != cl) {
 		closure_wait_event_async(&c->try_wait, cl, !c->try_harder);
 		return ERR_PTR(-EAGAIN);
 	}
 	c->try_harder = cl;
 	c->try_harder_start = local_clock();
 retry:
 	list_for_each_entry_reverse(i, &c->btree_cache, list) {
 		int r = mca_reap(i, cl, btree_order(k));
 		if (!r)
 			return i;
 		if (r != -ENOMEM)
 			ret = r;
 	}
 	if (ret == -EAGAIN &&
 	    closure_blocking(cl)) {
 		mutex_unlock(&c->bucket_lock);
 		closure_sync(cl);
 		mutex_lock(&c->bucket_lock);
 		goto retry;
 	}
 	return ERR_PTR(ret);
 }
 /*
  * We can only have one thread cannibalizing other cached btree nodes at a time,
  * or we'll deadlock. We use an open coded mutex to ensure that, which a
  * cannibalize_bucket() will take. This means every time we unlock the root of
  * the btree, we need to release this lock if we have it held.
  */
 void bch_cannibalize_unlock(struct cache_set *c, struct closure *cl)
 {
 	if (c->try_harder == cl) {
 		bch_time_stats_update(&c->try_harder_time, c->try_harder_start);
 		c->try_harder = NULL;
 		__closure_wake_up(&c->try_wait);
 	}
 }
 static struct btree *mca_alloc(struct cache_set *c, struct bkey *k,
 			       int level, struct closure *cl)
 {
 	struct btree *b;
 	lockdep_assert_held(&c->bucket_lock);
 	if (mca_find(c, k))
 		return NULL;
 	/* btree_free() doesn't free memory; it sticks the node on the end of
 	 * the list. Check if there's any freed nodes there:
 	 */
 	list_for_each_entry(b, &c->btree_cache_freeable, list)
 		if (!mca_reap(b, NULL, btree_order(k)))
 			goto out;
 	/* We never free struct btree itself, just the memory that holds the on
 	 * disk node. Check the freed list before allocating a new one:
 	 */
 	list_for_each_entry(b, &c->btree_cache_freed, list)
 		if (!mca_reap(b, NULL, 0)) {
 			mca_data_alloc(b, k, __GFP_NOWARN|GFP_NOIO);
 			if (!b->sets[0].data)
 				goto err;
 			else
 				goto out;
 		}
 	b = mca_bucket_alloc(c, k, __GFP_NOWARN|GFP_NOIO);
 	if (!b)
 		goto err;
 	BUG_ON(!down_write_trylock(&b->lock));
 	if (!b->sets->data)
 		goto err;
 out:
 	BUG_ON(!closure_is_unlocked(&b->io.cl));
 	bkey_copy(&b->key, k);
 	list_move(&b->list, &c->btree_cache);
 	hlist_del_init_rcu(&b->hash);
 	hlist_add_head_rcu(&b->hash, mca_hash(c, k));
 	lock_set_subclass(&b->lock.dep_map, level + 1, _THIS_IP_);
 	b->level	= level;
 	mca_reinit(b);
 	return b;
 err:
 	if (b)
 		rw_unlock(true, b);
 	b = mca_cannibalize(c, k, level, cl);
 	if (!IS_ERR(b))
 		goto out;
 	return b;
 }
 /**
  * bch_btree_node_get - find a btree node in the cache and lock it, reading it
  * in from disk if necessary.
  *
  * If IO is necessary, it uses the closure embedded in struct btree_op to wait;
  * if that closure is in non blocking mode, will return -EAGAIN.
  *
  * The btree node will have either a read or a write lock held, depending on
  * level and op->lock.
  */
 struct btree *bch_btree_node_get(struct cache_set *c, struct bkey *k,
 				 int level, struct btree_op *op)
 {
 	int i = 0;
 	bool write = level <= op->lock;
 	struct btree *b;
 	BUG_ON(level < 0);
 retry:
 	b = mca_find(c, k);
 	if (!b) {
 		if (current->bio_list)
 			return ERR_PTR(-EAGAIN);
 		mutex_lock(&c->bucket_lock);
 		b = mca_alloc(c, k, level, &op->cl);
 		mutex_unlock(&c->bucket_lock);
 		if (!b)
 			goto retry;
 		if (IS_ERR(b))
 			return b;
 		bch_btree_node_read(b);
 		if (!write)
 			downgrade_write(&b->lock);
 	} else {
 		rw_lock(write, b, level);
 		if (PTR_HASH(c, &b->key) != PTR_HASH(c, k)) {
 			rw_unlock(write, b);
 			goto retry;
 		}
 		BUG_ON(b->level != level);
 	}
 	b->accessed = 1;
 	for (; i <= b->nsets && b->sets[i].size; i++) {
 		prefetch(b->sets[i].tree);
 		prefetch(b->sets[i].data);
 	}
 	for (; i <= b->nsets; i++)
 		prefetch(b->sets[i].data);
 	if (btree_node_io_error(b)) {
 		rw_unlock(write, b);
 		return ERR_PTR(-EIO);
 	}
 	BUG_ON(!b->written);
 	return b;
 }
 static void btree_node_prefetch(struct cache_set *c, struct bkey *k, int level)
 {
 	struct btree *b;
 	mutex_lock(&c->bucket_lock);
 	b = mca_alloc(c, k, level, NULL);
 	mutex_unlock(&c->bucket_lock);
 	if (!IS_ERR_OR_NULL(b)) {
 		bch_btree_node_read(b);
 		rw_unlock(true, b);
 	}
 }
 /* Btree alloc */
 static void btree_node_free(struct btree *b, struct btree_op *op)
 {
 	unsigned i;
 	trace_bcache_btree_node_free(b);
 	/*
 	 * The BUG_ON() in btree_node_get() implies that we must have a write
 	 * lock on parent to free or even invalidate a node
 	 */
 	BUG_ON(op->lock <= b->level);
 	BUG_ON(b == b->c->root);
 	if (btree_node_dirty(b))
 		btree_complete_write(b, btree_current_write(b));
 	clear_bit(BTREE_NODE_dirty, &b->flags);
 	cancel_delayed_work(&b->work);
 	mutex_lock(&b->c->bucket_lock);
 	for (i = 0; i < KEY_PTRS(&b->key); i++) {
 		BUG_ON(atomic_read(&PTR_BUCKET(b->c, &b->key, i)->pin));
 		bch_inc_gen(PTR_CACHE(b->c, &b->key, i),
 			    PTR_BUCKET(b->c, &b->key, i));
 	}
 	bch_bucket_free(b->c, &b->key);
 	mca_bucket_free(b);
 	mutex_unlock(&b->c->bucket_lock);
 }
 struct btree *bch_btree_node_alloc(struct cache_set *c, int level,
 				   struct closure *cl)
 {
 	BKEY_PADDED(key) k;
 	struct btree *b = ERR_PTR(-EAGAIN);
 	mutex_lock(&c->bucket_lock);
 retry:
 	if (__bch_bucket_alloc_set(c, WATERMARK_METADATA, &k.key, 1, cl))
 		goto err;
 	SET_KEY_SIZE(&k.key, c->btree_pages * PAGE_SECTORS);
 	b = mca_alloc(c, &k.key, level, cl);
 	if (IS_ERR(b))
 		goto err_free;
 	if (!b) {
 		cache_bug(c,
 			"Tried to allocate bucket that was in btree cache");
 		__bkey_put(c, &k.key);
 		goto retry;
 	}
 	b->accessed = 1;
 	bch_bset_init_next(b);
 	mutex_unlock(&c->bucket_lock);
 	trace_bcache_btree_node_alloc(b);
 	return b;
 err_free:
 	bch_bucket_free(c, &k.key);
 	__bkey_put(c, &k.key);
 err:
 	mutex_unlock(&c->bucket_lock);
 	trace_bcache_btree_node_alloc_fail(b);
 	return b;
 }
 static struct btree *btree_node_alloc_replacement(struct btree *b,
 						  struct closure *cl)
 {
 	struct btree *n = bch_btree_node_alloc(b->c, b->level, cl);
 	if (!IS_ERR_OR_NULL(n))
 		bch_btree_sort_into(b, n);
 	return n;
 }
 /* Garbage collection */
 uint8_t __bch_btree_mark_key(struct cache_set *c, int level, struct bkey *k)
 {
 	uint8_t stale = 0;
 	unsigned i;
 	struct bucket *g;
 	/*
 	 * ptr_invalid() can't return true for the keys that mark btree nodes as
 	 * freed, but since ptr_bad() returns true we'll never actually use them
 	 * for anything and thus we don't want mark their pointers here
 	 */
 	if (!bkey_cmp(k, &ZERO_KEY))
 		return stale;
 	for (i = 0; i < KEY_PTRS(k); i++) {
 		if (!ptr_available(c, k, i))
 			continue;
 		g = PTR_BUCKET(c, k, i);
 		if (gen_after(g->gc_gen, PTR_GEN(k, i)))
 			g->gc_gen = PTR_GEN(k, i);
 		if (ptr_stale(c, k, i)) {
 			stale = max(stale, ptr_stale(c, k, i));
 			continue;
 		}
 		cache_bug_on(GC_MARK(g) &&
 			     (GC_MARK(g) == GC_MARK_METADATA) != (level != 0),
 			     c, "inconsistent ptrs: mark = %llu, level = %i",
 			     GC_MARK(g), level);
 		if (level)
 			SET_GC_MARK(g, GC_MARK_METADATA);
 		else if (KEY_DIRTY(k))
 			SET_GC_MARK(g, GC_MARK_DIRTY);
 		/* guard against overflow */
 		SET_GC_SECTORS_USED(g, min_t(unsigned,
 					     GC_SECTORS_USED(g) + KEY_SIZE(k),
 					     (1 << 14) - 1));
 		BUG_ON(!GC_SECTORS_USED(g));
 	}
 	return stale;
 }
 #define btree_mark_key(b, k)	__bch_btree_mark_key(b->c, b->level, k)
 static int btree_gc_mark_node(struct btree *b, unsigned *keys,
 			      struct gc_stat *gc)
 {
 	uint8_t stale = 0;
 	unsigned last_dev = -1;
 	struct bcache_device *d = NULL;
 	struct bkey *k;
 	struct btree_iter iter;
 	struct bset_tree *t;
 	gc->nodes++;
 	for_each_key_filter(b, k, &iter, bch_ptr_invalid) {
 		if (last_dev != KEY_INODE(k)) {
 			last_dev = KEY_INODE(k);
 			d = KEY_INODE(k) < b->c->nr_uuids
 				? b->c->devices[last_dev]
 				: NULL;
 		}
 		stale = max(stale, btree_mark_key(b, k));
 		if (bch_ptr_bad(b, k))
 			continue;
 		*keys += bkey_u64s(k);
 		gc->key_bytes += bkey_u64s(k);
 		gc->nkeys++;
 		gc->data += KEY_SIZE(k);
 		if (KEY_DIRTY(k))
 			gc->dirty += KEY_SIZE(k);
 	}
 	for (t = b->sets; t <= &b->sets[b->nsets]; t++)
 		btree_bug_on(t->size &&
 			     bset_written(b, t) &&
 			     bkey_cmp(&b->key, &t->end) < 0,
 			     b, "found short btree key in gc");
 	return stale;
 }
 static struct btree *btree_gc_alloc(struct btree *b, struct bkey *k,
 				    struct btree_op *op)
 {
 	/*
 	 * We block priorities from being written for the duration of garbage
 	 * collection, so we can't sleep in btree_alloc() ->
 	 * bch_bucket_alloc_set(), or we'd risk deadlock - so we don't pass it
 	 * our closure.
 	 */
 	struct btree *n = btree_node_alloc_replacement(b, NULL);
 	if (!IS_ERR_OR_NULL(n)) {
 		swap(b, n);
 		__bkey_put(b->c, &b->key);
 		memcpy(k->ptr, b->key.ptr,
 		       sizeof(uint64_t) * KEY_PTRS(&b->key));
 		btree_node_free(n, op);
 		up_write(&n->lock);
 	}
 	return b;
 }
 /*
  * Leaving this at 2 until we've got incremental garbage collection done; it
  * could be higher (and has been tested with 4) except that garbage collection
  * could take much longer, adversely affecting latency.
  */
 #define GC_MERGE_NODES	2U
 struct gc_merge_info {
 	struct btree	*b;
 	struct bkey	*k;
 	unsigned	keys;
 };
 static void btree_gc_coalesce(struct btree *b, struct btree_op *op,
 			      struct gc_stat *gc, struct gc_merge_info *r)
 {
 	unsigned nodes = 0, keys = 0, blocks;
 	int i;
 	while (nodes < GC_MERGE_NODES && r[nodes].b)
 		keys += r[nodes++].keys;
 	blocks = btree_default_blocks(b->c) * 2 / 3;
 	if (nodes < 2 ||
 	    __set_blocks(b->sets[0].data, keys, b->c) > blocks * (nodes - 1))
 		return;
 	for (i = nodes - 1; i >= 0; --i) {
 		if (r[i].b->written)
 			r[i].b = btree_gc_alloc(r[i].b, r[i].k, op);
 		if (r[i].b->written)
 			return;
 	}
 	for (i = nodes - 1; i > 0; --i) {
 		struct bset *n1 = r[i].b->sets->data;
 		struct bset *n2 = r[i - 1].b->sets->data;
 		struct bkey *k, *last = NULL;
 		keys = 0;
 		if (i == 1) {
 			/*
 			 * Last node we're not getting rid of - we're getting
 			 * rid of the node at r[0]. Have to try and fit all of
 			 * the remaining keys into this node; we can't ensure
 			 * they will always fit due to rounding and variable
 			 * length keys (shouldn't be possible in practice,
 			 * though)
 			 */
 			if (__set_blocks(n1, n1->keys + r->keys,
 					 b->c) > btree_blocks(r[i].b))
 				return;
 			keys = n2->keys;
 			last = &r->b->key;
 		} else
 			for (k = n2->start;
 			     k < end(n2);
 			     k = bkey_next(k)) {
 				if (__set_blocks(n1, n1->keys + keys +
 						 bkey_u64s(k), b->c) > blocks)
 					break;
 				last = k;
 				keys += bkey_u64s(k);
 			}
 		BUG_ON(__set_blocks(n1, n1->keys + keys,
 				    b->c) > btree_blocks(r[i].b));
 		if (last) {
 			bkey_copy_key(&r[i].b->key, last);
 			bkey_copy_key(r[i].k, last);
 		}
 		memcpy(end(n1),
 		       n2->start,
 		       (void *) node(n2, keys) - (void *) n2->start);
 		n1->keys += keys;
 		memmove(n2->start,
 			node(n2, keys),
 			(void *) end(n2) - (void *) node(n2, keys));
 		n2->keys -= keys;
 		r[i].keys	= n1->keys;
 		r[i - 1].keys	= n2->keys;
 	}
 	btree_node_free(r->b, op);
 	up_write(&r->b->lock);
 	trace_bcache_btree_gc_coalesce(nodes);
 	gc->nodes--;
 	nodes--;
 	memmove(&r[0], &r[1], sizeof(struct gc_merge_info) * nodes);
 	memset(&r[nodes], 0, sizeof(struct gc_merge_info));
 }
 static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 			    struct closure *writes, struct gc_stat *gc)
 {
 	void write(struct btree *r)
 	{
 		if (!r->written)
 			bch_btree_node_write(r, &op->cl);
 		else if (btree_node_dirty(r))
 			bch_btree_node_write(r, writes);
 		up_write(&r->lock);
 	}
 	int ret = 0, stale;
 	unsigned i;
 	struct gc_merge_info r[GC_MERGE_NODES];
 	memset(r, 0, sizeof(r));
 	while ((r->k = bch_next_recurse_key(b, &b->c->gc_done))) {
 		r->b = bch_btree_node_get(b->c, r->k, b->level - 1, op);
 		if (IS_ERR(r->b)) {
 			ret = PTR_ERR(r->b);
 			break;
 		}
 		r->keys	= 0;
 		stale = btree_gc_mark_node(r->b, &r->keys, gc);
 		if (!b->written &&
 		    (r->b->level || stale > 10 ||
 		     b->c->gc_always_rewrite))
 			r->b = btree_gc_alloc(r->b, r->k, op);
 		if (r->b->level)
 			ret = btree_gc_recurse(r->b, op, writes, gc);
 		if (ret) {
 			write(r->b);
 			break;
 		}
 		bkey_copy_key(&b->c->gc_done, r->k);
 		if (!b->written)
 			btree_gc_coalesce(b, op, gc, r);
 		if (r[GC_MERGE_NODES - 1].b)
 			write(r[GC_MERGE_NODES - 1].b);
 		memmove(&r[1], &r[0],
 			sizeof(struct gc_merge_info) * (GC_MERGE_NODES - 1));
 		/* When we've got incremental GC working, we'll want to do
 		 * if (should_resched())
 		 *	return -EAGAIN;
 		 */
 		cond_resched();
 #if 0
 		if (need_resched()) {
 			ret = -EAGAIN;
 			break;
 		}
 #endif
 	}
 	for (i = 1; i < GC_MERGE_NODES && r[i].b; i++)
 		write(r[i].b);
 	/* Might have freed some children, must remove their keys */
 	if (!b->written)
 		bch_btree_sort(b);
 	return ret;
 }
 static int bch_btree_gc_root(struct btree *b, struct btree_op *op,
 			     struct closure *writes, struct gc_stat *gc)
 {
 	struct btree *n = NULL;
 	unsigned keys = 0;
 	int ret = 0, stale = btree_gc_mark_node(b, &keys, gc);
 	if (b->level || stale > 10)
 		n = btree_node_alloc_replacement(b, NULL);
 	if (!IS_ERR_OR_NULL(n))
 		swap(b, n);
 	if (b->level)
 		ret = btree_gc_recurse(b, op, writes, gc);
 	if (!b->written || btree_node_dirty(b)) {
 		bch_btree_node_write(b, n ? &op->cl : NULL);
 	}
 	if (!IS_ERR_OR_NULL(n)) {
 		closure_sync(&op->cl);
 		bch_btree_set_root(b);
 		btree_node_free(n, op);
 		rw_unlock(true, b);
 	}
 	return ret;
 }
 static void btree_gc_start(struct cache_set *c)
 {
 	struct cache *ca;
 	struct bucket *b;
 	unsigned i;
 	if (!c->gc_mark_valid)
 		return;
 	mutex_lock(&c->bucket_lock);
 	c->gc_mark_valid = 0;
 	c->gc_done = ZERO_KEY;
 	for_each_cache(ca, c, i)
 		for_each_bucket(b, ca) {
 			b->gc_gen = b->gen;
 			if (!atomic_read(&b->pin))
 				SET_GC_MARK(b, GC_MARK_RECLAIMABLE);
 		}
 	mutex_unlock(&c->bucket_lock);
 }
 size_t bch_btree_gc_finish(struct cache_set *c)
 {
 	size_t available = 0;
 	struct bucket *b;
 	struct cache *ca;
 	unsigned i;
 	mutex_lock(&c->bucket_lock);
 	set_gc_sectors(c);
 	c->gc_mark_valid = 1;
 	c->need_gc	= 0;
 	if (c->root)
 		for (i = 0; i < KEY_PTRS(&c->root->key); i++)
 			SET_GC_MARK(PTR_BUCKET(c, &c->root->key, i),
 				    GC_MARK_METADATA);
 	for (i = 0; i < KEY_PTRS(&c->uuid_bucket); i++)
 		SET_GC_MARK(PTR_BUCKET(c, &c->uuid_bucket, i),
 			    GC_MARK_METADATA);
 	for_each_cache(ca, c, i) {
 		uint64_t *i;
 		ca->invalidate_needs_gc = 0;
 		for (i = ca->sb.d; i < ca->sb.d + ca->sb.keys; i++)
 			SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
 		for (i = ca->prio_buckets;
 		     i < ca->prio_buckets + prio_buckets(ca) * 2; i++)
 			SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
 		for_each_bucket(b, ca) {
 			b->last_gc	= b->gc_gen;
 			c->need_gc	= max(c->need_gc, bucket_gc_gen(b));
 			if (!atomic_read(&b->pin) &&
 			    GC_MARK(b) == GC_MARK_RECLAIMABLE) {
 				available++;
 				if (!GC_SECTORS_USED(b))
 					bch_bucket_add_unused(ca, b);
 			}
 		}
 	}
 	mutex_unlock(&c->bucket_lock);
 	return available;
 }
 static void bch_btree_gc(struct closure *cl)
 {
 	struct cache_set *c = container_of(cl, struct cache_set, gc.cl);
 	int ret;
 	unsigned long available;
 	struct gc_stat stats;
 	struct closure writes;
 	struct btree_op op;
 	uint64_t start_time = local_clock();
 	trace_bcache_gc_start(c);
 	memset(&stats, 0, sizeof(struct gc_stat));
 	closure_init_stack(&writes);
 	bch_btree_op_init_stack(&op);
 	op.lock = SHRT_MAX;
 	btree_gc_start(c);
 	atomic_inc(&c->prio_blocked);
 	ret = btree_root(gc_root, c, &op, &writes, &stats);
 	closure_sync(&op.cl);
 	closure_sync(&writes);
 	if (ret) {
 		pr_warn("gc failed!");
 		continue_at(cl, bch_btree_gc, bch_gc_wq);
 	}
 	/* Possibly wait for new UUIDs or whatever to hit disk */
 	bch_journal_meta(c, &op.cl);
 	closure_sync(&op.cl);
 	available = bch_btree_gc_finish(c);
 	atomic_dec(&c->prio_blocked);
 	wake_up_allocators(c);
 	bch_time_stats_update(&c->btree_gc_time, start_time);
 	stats.key_bytes *= sizeof(uint64_t);
 	stats.dirty	<<= 9;
 	stats.data	<<= 9;
 	stats.in_use	= (c->nbuckets - available) * 100 / c->nbuckets;
 	memcpy(&c->gc_stats, &stats, sizeof(struct gc_stat));
 	trace_bcache_gc_end(c);
 	continue_at(cl, bch_moving_gc, bch_gc_wq);
 }
 void bch_queue_gc(struct cache_set *c)
 {
 	closure_trylock_call(&c->gc.cl, bch_btree_gc, bch_gc_wq, &c->cl);
 }
 /* Initial partial gc */
 static int bch_btree_check_recurse(struct btree *b, struct btree_op *op,
 				   unsigned long **seen)
 {
 	int ret;
 	unsigned i;
 	struct bkey *k;
 	struct bucket *g;
 	struct btree_iter iter;
 	for_each_key_filter(b, k, &iter, bch_ptr_invalid) {
 		for (i = 0; i < KEY_PTRS(k); i++) {
 			if (!ptr_available(b->c, k, i))
 				continue;
 			g = PTR_BUCKET(b->c, k, i);
 			if (!__test_and_set_bit(PTR_BUCKET_NR(b->c, k, i),
 						seen[PTR_DEV(k, i)]) ||
 			    !ptr_stale(b->c, k, i)) {
 				g->gen = PTR_GEN(k, i);
 				if (b->level)
 					g->prio = BTREE_PRIO;
 				else if (g->prio == BTREE_PRIO)
 					g->prio = INITIAL_PRIO;
 			}
 		}
 		btree_mark_key(b, k);
 	}
 	if (b->level) {
 		k = bch_next_recurse_key(b, &ZERO_KEY);
 		while (k) {
 			struct bkey *p = bch_next_recurse_key(b, k);
 			if (p)
 				btree_node_prefetch(b->c, p, b->level - 1);
 			ret = btree(check_recurse, k, b, op, seen);
 			if (ret)
 				return ret;
 			k = p;
 		}
 	}
 	return 0;
 }
 int bch_btree_check(struct cache_set *c, struct btree_op *op)
 {
 	int ret = -ENOMEM;
 	unsigned i;
 	unsigned long *seen[MAX_CACHES_PER_SET];
 	memset(seen, 0, sizeof(seen));
 	for (i = 0; c->cache[i]; i++) {
 		size_t n = DIV_ROUND_UP(c->cache[i]->sb.nbuckets, 8);
 		seen[i] = kmalloc(n, GFP_KERNEL);
 		if (!seen[i])
 			goto err;
 		/* Disables the seen array until prio_read() uses it too */
 		memset(seen[i], 0xFF, n);
 	}
 	ret = btree_root(check_recurse, c, op, seen);
 err:
 	for (i = 0; i < MAX_CACHES_PER_SET; i++)
 		kfree(seen[i]);
 	return ret;
 }
 /* Btree insertion */
 static void shift_keys(struct btree *b, struct bkey *where, struct bkey *insert)
 {
 	struct bset *i = b->sets[b->nsets].data;
 	memmove((uint64_t *) where + bkey_u64s(insert),
 		where,
 		(void *) end(i) - (void *) where);
 	i->keys += bkey_u64s(insert);
 	bkey_copy(where, insert);
 	bch_bset_fix_lookup_table(b, where);
 }
 static bool fix_overlapping_extents(struct btree *b,
 				    struct bkey *insert,
 				    struct btree_iter *iter,
 				    struct btree_op *op)
 {
 	void subtract_dirty(struct bkey *k, uint64_t offset, int sectors)
 	{
 		if (KEY_DIRTY(k))
 			bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
 						     offset, -sectors);
 	}
 	uint64_t old_offset;
 	unsigned old_size, sectors_found = 0;
 	while (1) {
 		struct bkey *k = bch_btree_iter_next(iter);
 		if (!k ||
 		    bkey_cmp(&START_KEY(k), insert) >= 0)
 			break;
 		if (bkey_cmp(k, &START_KEY(insert)) <= 0)
 			continue;
 		old_offset = KEY_START(k);
 		old_size = KEY_SIZE(k);
 		/*
 		 * We might overlap with 0 size extents; we can't skip these
 		 * because if they're in the set we're inserting to we have to
 		 * adjust them so they don't overlap with the key we're
 		 * inserting. But we don't want to check them for BTREE_REPLACE
 		 * operations.
 		 */
 		if (op->type == BTREE_REPLACE &&
 		    KEY_SIZE(k)) {
 			/*
 			 * k might have been split since we inserted/found the
 			 * key we're replacing
 			 */
 			unsigned i;
 			uint64_t offset = KEY_START(k) -
 				KEY_START(&op->replace);
 			/* But it must be a subset of the replace key */
 			if (KEY_START(k) < KEY_START(&op->replace) ||
 			    KEY_OFFSET(k) > KEY_OFFSET(&op->replace))
 				goto check_failed;
 			/* We didn't find a key that we were supposed to */
 			if (KEY_START(k) > KEY_START(insert) + sectors_found)
 				goto check_failed;
 			if (KEY_PTRS(&op->replace) != KEY_PTRS(k))
 				goto check_failed;
 			/* skip past gen */
 			offset <<= 8;
 			BUG_ON(!KEY_PTRS(&op->replace));
 			for (i = 0; i < KEY_PTRS(&op->replace); i++)
 				if (k->ptr[i] != op->replace.ptr[i] + offset)
 					goto check_failed;
 			sectors_found = KEY_OFFSET(k) - KEY_START(insert);
 		}
 		if (bkey_cmp(insert, k) < 0 &&
 		    bkey_cmp(&START_KEY(insert), &START_KEY(k)) > 0) {
 			/*
 			 * We overlapped in the middle of an existing key: that
 			 * means we have to split the old key. But we have to do
 			 * slightly different things depending on whether the
 			 * old key has been written out yet.
 			 */
 			struct bkey *top;
 			subtract_dirty(k, KEY_START(insert), KEY_SIZE(insert));
 			if (bkey_written(b, k)) {
 				/*
 				 * We insert a new key to cover the top of the
 				 * old key, and the old key is modified in place
 				 * to represent the bottom split.
 				 *
 				 * It's completely arbitrary whether the new key
 				 * is the top or the bottom, but it has to match
 				 * up with what btree_sort_fixup() does - it
 				 * doesn't check for this kind of overlap, it
 				 * depends on us inserting a new key for the top
 				 * here.
 				 */
 				top = bch_bset_search(b, &b->sets[b->nsets],
 						      insert);
 				shift_keys(b, top, k);
 			} else {
 				BKEY_PADDED(key) temp;
 				bkey_copy(&temp.key, k);
 				shift_keys(b, k, &temp.key);
 				top = bkey_next(k);
 			}
 			bch_cut_front(insert, top);
 			bch_cut_back(&START_KEY(insert), k);
 			bch_bset_fix_invalidated_key(b, k);
 			return false;
 		}
 		if (bkey_cmp(insert, k) < 0) {
 			bch_cut_front(insert, k);
 		} else {
 			if (bkey_written(b, k) &&
 			    bkey_cmp(&START_KEY(insert), &START_KEY(k)) <= 0) {
 				/*
 				 * Completely overwrote, so we don't have to
 				 * invalidate the binary search tree
 				 */
 				bch_cut_front(k, k);
 			} else {
 				__bch_cut_back(&START_KEY(insert), k);
 				bch_bset_fix_invalidated_key(b, k);
 			}
 		}
 		subtract_dirty(k, old_offset, old_size - KEY_SIZE(k));
 	}
 check_failed:
 	if (op->type == BTREE_REPLACE) {
 		if (!sectors_found) {
 			op->insert_collision = true;
 			return true;
 		} else if (sectors_found < KEY_SIZE(insert)) {
 			SET_KEY_OFFSET(insert, KEY_OFFSET(insert) -
 				       (KEY_SIZE(insert) - sectors_found));
 			SET_KEY_SIZE(insert, sectors_found);
 		}
 	}
 	return false;
 }
 static bool btree_insert_key(struct btree *b, struct btree_op *op,
 			     struct bkey *k)
 {
 	struct bset *i = b->sets[b->nsets].data;
 	struct bkey *m, *prev;
 	unsigned status = BTREE_INSERT_STATUS_INSERT;
 	BUG_ON(bkey_cmp(k, &b->key) > 0);
 	BUG_ON(b->level && !KEY_PTRS(k));
 	BUG_ON(!b->level && !KEY_OFFSET(k));
 	if (!b->level) {
 		struct btree_iter iter;
 		struct bkey search = KEY(KEY_INODE(k), KEY_START(k), 0);
 		/*
 		 * bset_search() returns the first key that is strictly greater
 		 * than the search key - but for back merging, we want to find
 		 * the first key that is greater than or equal to KEY_START(k) -
 		 * unless KEY_START(k) is 0.
 		 */
 		if (KEY_OFFSET(&search))
 			SET_KEY_OFFSET(&search, KEY_OFFSET(&search) - 1);
 		prev = NULL;
 		m = bch_btree_iter_init(b, &iter, &search);
 		if (fix_overlapping_extents(b, k, &iter, op))
 			return false;
 		while (m != end(i) &&
 		       bkey_cmp(k, &START_KEY(m)) > 0)
 			prev = m, m = bkey_next(m);
 		if (key_merging_disabled(b->c))
 			goto insert;
 		/* prev is in the tree, if we merge we're done */
 		status = BTREE_INSERT_STATUS_BACK_MERGE;
 		if (prev &&
 		    bch_bkey_try_merge(b, prev, k))
 			goto merged;
 		status = BTREE_INSERT_STATUS_OVERWROTE;
 		if (m != end(i) &&
 		    KEY_PTRS(m) == KEY_PTRS(k) && !KEY_SIZE(m))
 			goto copy;
 		status = BTREE_INSERT_STATUS_FRONT_MERGE;
 		if (m != end(i) &&
 		    bch_bkey_try_merge(b, k, m))
 			goto copy;
 	} else
 		m = bch_bset_search(b, &b->sets[b->nsets], k);
 insert:	shift_keys(b, m, k);
 copy:	bkey_copy(m, k);
 merged:
 	if (KEY_DIRTY(k))
 		bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
 					     KEY_START(k), KEY_SIZE(k));
 	bch_check_keys(b, "%u for %s", status, op_type(op));
 	if (b->level && !KEY_OFFSET(k))
 		btree_current_write(b)->prio_blocked++;
 	trace_bcache_btree_insert_key(b, k, op->type, status);
 	return true;
 }
 bool bch_btree_insert_keys(struct btree *b, struct btree_op *op)
 {
 	bool ret = false;
 	struct bkey *k;
 	unsigned oldsize = bch_count_data(b);
 	while ((k = bch_keylist_pop(&op->keys))) {
 		bkey_put(b->c, k, b->level);
 		ret |= btree_insert_key(b, op, k);
 	}
 	BUG_ON(bch_count_data(b) < oldsize);
 	return ret;
 }
 bool bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
 				   struct bio *bio)
 {
 	bool ret = false;
 	uint64_t btree_ptr = b->key.ptr[0];
 	unsigned long seq = b->seq;
 	BKEY_PADDED(k) tmp;
 	rw_unlock(false, b);
 	rw_lock(true, b, b->level);
 	if (b->key.ptr[0] != btree_ptr ||
 	    b->seq != seq + 1 ||
 	    should_split(b))
 		goto out;
 	op->replace = KEY(op->inode, bio_end(bio), bio_sectors(bio));
 	SET_KEY_PTRS(&op->replace, 1);
 	get_random_bytes(&op->replace.ptr[0], sizeof(uint64_t));
 	SET_PTR_DEV(&op->replace, 0, PTR_CHECK_DEV);
 	bkey_copy(&tmp.k, &op->replace);
 	BUG_ON(op->type != BTREE_INSERT);
 	BUG_ON(!btree_insert_key(b, op, &tmp.k));
 	ret = true;
 out:
 	downgrade_write(&b->lock);
 	return ret;
 }
 static int btree_split(struct btree *b, struct btree_op *op)
 {
 	bool split, root = b == b->c->root;
 	struct btree *n1, *n2 = NULL, *n3 = NULL;
 	uint64_t start_time = local_clock();
 	if (b->level)
 		set_closure_blocking(&op->cl);
 	n1 = btree_node_alloc_replacement(b, &op->cl);
 	if (IS_ERR(n1))
 		goto err;
 	split = set_blocks(n1->sets[0].data, n1->c) > (btree_blocks(b) * 4) / 5;
 	if (split) {
 		unsigned keys = 0;
 		trace_bcache_btree_node_split(b, n1->sets[0].data->keys);
 		n2 = bch_btree_node_alloc(b->c, b->level, &op->cl);
 		if (IS_ERR(n2))
 			goto err_free1;
 		if (root) {
 			n3 = bch_btree_node_alloc(b->c, b->level + 1, &op->cl);
 			if (IS_ERR(n3))
 				goto err_free2;
 		}
 		bch_btree_insert_keys(n1, op);
 		/* Has to be a linear search because we don't have an auxiliary
 		 * search tree yet
 		 */
 		while (keys < (n1->sets[0].data->keys * 3) / 5)
 			keys += bkey_u64s(node(n1->sets[0].data, keys));
 		bkey_copy_key(&n1->key, node(n1->sets[0].data, keys));
 		keys += bkey_u64s(node(n1->sets[0].data, keys));
 		n2->sets[0].data->keys = n1->sets[0].data->keys - keys;
 		n1->sets[0].data->keys = keys;
 		memcpy(n2->sets[0].data->start,
 		       end(n1->sets[0].data),
 		       n2->sets[0].data->keys * sizeof(uint64_t));
 		bkey_copy_key(&n2->key, &b->key);
 		bch_keylist_add(&op->keys, &n2->key);
 		bch_btree_node_write(n2, &op->cl);
 		rw_unlock(true, n2);
 	} else {
 		trace_bcache_btree_node_compact(b, n1->sets[0].data->keys);
 		bch_btree_insert_keys(n1, op);
 	}
 	bch_keylist_add(&op->keys, &n1->key);
 	bch_btree_node_write(n1, &op->cl);
 	if (n3) {
 		bkey_copy_key(&n3->key, &MAX_KEY);
 		bch_btree_insert_keys(n3, op);
 		bch_btree_node_write(n3, &op->cl);
 		closure_sync(&op->cl);
 		bch_btree_set_root(n3);
 		rw_unlock(true, n3);
 	} else if (root) {
 		op->keys.top = op->keys.bottom;
 		closure_sync(&op->cl);
 		bch_btree_set_root(n1);
 	} else {
 		unsigned i;
 		bkey_copy(op->keys.top, &b->key);
 		bkey_copy_key(op->keys.top, &ZERO_KEY);
 		for (i = 0; i < KEY_PTRS(&b->key); i++) {
 			uint8_t g = PTR_BUCKET(b->c, &b->key, i)->gen + 1;
 			SET_PTR_GEN(op->keys.top, i, g);
 		}
 		bch_keylist_push(&op->keys);
 		closure_sync(&op->cl);
 		atomic_inc(&b->c->prio_blocked);
 	}
 	rw_unlock(true, n1);
 	btree_node_free(b, op);
 	bch_time_stats_update(&b->c->btree_split_time, start_time);
 	return 0;
 err_free2:
 	__bkey_put(n2->c, &n2->key);
 	btree_node_free(n2, op);
 	rw_unlock(true, n2);
 err_free1:
 	__bkey_put(n1->c, &n1->key);
 	btree_node_free(n1, op);
 	rw_unlock(true, n1);
 err:
 	if (n3 == ERR_PTR(-EAGAIN) ||
 	    n2 == ERR_PTR(-EAGAIN) ||
 	    n1 == ERR_PTR(-EAGAIN))
 		return -EAGAIN;
 	pr_warn("couldn't split");
 	return -ENOMEM;
 }
 static int bch_btree_insert_recurse(struct btree *b, struct btree_op *op,
 				    struct keylist *stack_keys)
 {
 	if (b->level) {
 		int ret;
 		struct bkey *insert = op->keys.bottom;
 		struct bkey *k = bch_next_recurse_key(b, &START_KEY(insert));
 		if (!k) {
 			btree_bug(b, "no key to recurse on at level %i/%i",
 				  b->level, b->c->root->level);
 			op->keys.top = op->keys.bottom;
 			return -EIO;
 		}
 		if (bkey_cmp(insert, k) > 0) {
 			unsigned i;
 			if (op->type == BTREE_REPLACE) {
 				__bkey_put(b->c, insert);
 				op->keys.top = op->keys.bottom;
 				op->insert_collision = true;
 				return 0;
 			}
 			for (i = 0; i < KEY_PTRS(insert); i++)
 				atomic_inc(&PTR_BUCKET(b->c, insert, i)->pin);
 			bkey_copy(stack_keys->top, insert);
 			bch_cut_back(k, insert);
 			bch_cut_front(k, stack_keys->top);
 			bch_keylist_push(stack_keys);
 		}
 		ret = btree(insert_recurse, k, b, op, stack_keys);
 		if (ret)
 			return ret;
 	}
 	if (!bch_keylist_empty(&op->keys)) {
 		if (should_split(b)) {
 			if (op->lock <= b->c->root->level) {
 				BUG_ON(b->level);
 				op->lock = b->c->root->level + 1;
 				return -EINTR;
 			}
 			return btree_split(b, op);
 		}
 		BUG_ON(write_block(b) != b->sets[b->nsets].data);
 		if (bch_btree_insert_keys(b, op)) {
 			if (!b->level)
 				bch_btree_leaf_dirty(b, op);
 			else
 				bch_btree_node_write(b, &op->cl);
 		}
 	}
 	return 0;
 }
 int bch_btree_insert(struct btree_op *op, struct cache_set *c)
 {
 	int ret = 0;
 	struct keylist stack_keys;
 	/*
 	 * Don't want to block with the btree locked unless we have to,
 	 * otherwise we get deadlocks with try_harder and between split/gc
 	 */
 	clear_closure_blocking(&op->cl);
 	BUG_ON(bch_keylist_empty(&op->keys));
 	bch_keylist_copy(&stack_keys, &op->keys);
 	bch_keylist_init(&op->keys);
 	while (!bch_keylist_empty(&stack_keys) ||
 	       !bch_keylist_empty(&op->keys)) {
 		if (bch_keylist_empty(&op->keys)) {
 			bch_keylist_add(&op->keys,
 					bch_keylist_pop(&stack_keys));
 			op->lock = 0;
 		}
 		ret = btree_root(insert_recurse, c, op, &stack_keys);
 		if (ret == -EAGAIN) {
 			ret = 0;
 			closure_sync(&op->cl);
 		} else if (ret) {
 			struct bkey *k;
 			pr_err("error %i trying to insert key for %s",
 			       ret, op_type(op));
 			while ((k = bch_keylist_pop(&stack_keys) ?:
 				    bch_keylist_pop(&op->keys)))
 				bkey_put(c, k, 0);
 		}
 	}
 	bch_keylist_free(&stack_keys);
 	if (op->journal)
 		atomic_dec_bug(op->journal);
 	op->journal = NULL;
 	return ret;
 }
 void bch_btree_set_root(struct btree *b)
 {
 	unsigned i;
 	trace_bcache_btree_set_root(b);
 	BUG_ON(!b->written);
 	for (i = 0; i < KEY_PTRS(&b->key); i++)
 		BUG_ON(PTR_BUCKET(b->c, &b->key, i)->prio != BTREE_PRIO);
 	mutex_lock(&b->c->bucket_lock);
 	list_del_init(&b->list);
 	mutex_unlock(&b->c->bucket_lock);
 	b->c->root = b;
 	__bkey_put(b->c, &b->key);
 	bch_journal_meta(b->c, NULL);
 }
 /* Cache lookup */
 static int submit_partial_cache_miss(struct btree *b, struct btree_op *op,
 				     struct bkey *k)
 {
 	struct search *s = container_of(op, struct search, op);
 	struct bio *bio = &s->bio.bio;
 	int ret = 0;
 	while (!ret &&
 	       !op->lookup_done) {
 		unsigned sectors = INT_MAX;
 		if (KEY_INODE(k) == op->inode) {
 			if (KEY_START(k) <= bio->bi_sector)
 				break;
 			sectors = min_t(uint64_t, sectors,
 					KEY_START(k) - bio->bi_sector);
 		}
 		ret = s->d->cache_miss(b, s, bio, sectors);
 	}
 	return ret;
 }
 /*
  * Read from a single key, handling the initial cache miss if the key starts in
  * the middle of the bio
  */
 static int submit_partial_cache_hit(struct btree *b, struct btree_op *op,
 				    struct bkey *k)
 {
 	struct search *s = container_of(op, struct search, op);
 	struct bio *bio = &s->bio.bio;
 	unsigned ptr;
 	struct bio *n;
 	int ret = submit_partial_cache_miss(b, op, k);
 	if (ret || op->lookup_done)
 		return ret;
 	/* XXX: figure out best pointer - for multiple cache devices */
 	ptr = 0;
 	PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
 	while (!op->lookup_done &&
 	       KEY_INODE(k) == op->inode &&
 	       bio->bi_sector < KEY_OFFSET(k)) {
 		struct bkey *bio_key;
 		sector_t sector = PTR_OFFSET(k, ptr) +
 			(bio->bi_sector - KEY_START(k));
 		unsigned sectors = min_t(uint64_t, INT_MAX,
 					 KEY_OFFSET(k) - bio->bi_sector);
 		n = bch_bio_split(bio, sectors, GFP_NOIO, s->d->bio_split);
 		if (!n)
 			return -EAGAIN;
 		if (n == bio)
 			op->lookup_done = true;
 		bio_key = &container_of(n, struct bbio, bio)->key;
 		/*
 		 * The bucket we're reading from might be reused while our bio
 		 * is in flight, and we could then end up reading the wrong
 		 * data.
 		 *
 		 * We guard against this by checking (in cache_read_endio()) if
 		 * the pointer is stale again; if so, we treat it as an error
 		 * and reread from the backing device (but we don't pass that
 		 * error up anywhere).
 		 */
 		bch_bkey_copy_single_ptr(bio_key, k, ptr);
 		SET_PTR_OFFSET(bio_key, 0, sector);
 		n->bi_end_io	= bch_cache_read_endio;
 		n->bi_private	= &s->cl;
 		__bch_submit_bbio(n, b->c);
 	}
 	return 0;
 }
 int bch_btree_search_recurse(struct btree *b, struct btree_op *op)
 {
 	struct search *s = container_of(op, struct search, op);
 	struct bio *bio = &s->bio.bio;
 	int ret = 0;
 	struct bkey *k;
 	struct btree_iter iter;
 	bch_btree_iter_init(b, &iter, &KEY(op->inode, bio->bi_sector, 0));
 	do {
 		k = bch_btree_iter_next_filter(&iter, b, bch_ptr_bad);
 		if (!k) {
 			/*
 			 * b->key would be exactly what we want, except that
 			 * pointers to btree nodes have nonzero size - we
 			 * wouldn't go far enough
 			 */
 			ret = submit_partial_cache_miss(b, op,
 					&KEY(KEY_INODE(&b->key),
 					     KEY_OFFSET(&b->key), 0));
 			break;
 		}
 		ret = b->level
 			? btree(search_recurse, k, b, op)
 			: submit_partial_cache_hit(b, op, k);
 	} while (!ret &&
 		 !op->lookup_done);
 	return ret;
 }
 /* Keybuf code */
 static inline int keybuf_cmp(struct keybuf_key *l, struct keybuf_key *r)
 {
 	/* Overlapping keys compare equal */
 	if (bkey_cmp(&l->key, &START_KEY(&r->key)) <= 0)
 		return -1;
 	if (bkey_cmp(&START_KEY(&l->key), &r->key) >= 0)
 		return 1;
 	return 0;
 }
 static inline int keybuf_nonoverlapping_cmp(struct keybuf_key *l,
 					    struct keybuf_key *r)
 {
 	return clamp_t(int64_t, bkey_cmp(&l->key, &r->key), -1, 1);
 }
 static int bch_btree_refill_keybuf(struct btree *b, struct btree_op *op,
-				   struct keybuf *buf, struct bkey *end)
+				   struct keybuf *buf, struct bkey *end,
+				   keybuf_pred_fn *pred)
 {
 	struct btree_iter iter;
 	bch_btree_iter_init(b, &iter, &buf->last_scanned);
 	while (!array_freelist_empty(&buf->freelist)) {
 		struct bkey *k = bch_btree_iter_next_filter(&iter, b,
 							    bch_ptr_bad);
 		if (!b->level) {
 			if (!k) {
 				buf->last_scanned = b->key;
 				break;
 			}
 			buf->last_scanned = *k;
 			if (bkey_cmp(&buf->last_scanned, end) >= 0)
 				break;
-			if (buf->key_predicate(buf, k)) {
+			if (pred(buf, k)) {
 				struct keybuf_key *w;
 				spin_lock(&buf->lock);
 				w = array_alloc(&buf->freelist);
 				w->private = NULL;
 				bkey_copy(&w->key, k);
 				if (RB_INSERT(&buf->keys, w, node, keybuf_cmp))
 					array_free(&buf->freelist, w);
 				spin_unlock(&buf->lock);
 			}
 		} else {
 			if (!k)
 				break;
-			btree(refill_keybuf, k, b, op, buf, end);
+			btree(refill_keybuf, k, b, op, buf, end, pred);
 			/*
 			 * Might get an error here, but can't really do anything
 			 * and it'll get logged elsewhere. Just read what we
 			 * can.
 			 */
 			if (bkey_cmp(&buf->last_scanned, end) >= 0)
 				break;
 			cond_resched();
 		}
 	}
 	return 0;
 }
 void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
-			  struct bkey *end)
+		       struct bkey *end, keybuf_pred_fn *pred)
 {
 	struct bkey start = buf->last_scanned;
 	struct btree_op op;
 	bch_btree_op_init_stack(&op);
 	cond_resched();
-	btree_root(refill_keybuf, c, &op, buf, end);
+	btree_root(refill_keybuf, c, &op, buf, end, pred);
 	closure_sync(&op.cl);
 	pr_debug("found %s keys from %llu:%llu to %llu:%llu",
 		 RB_EMPTY_ROOT(&buf->keys) ? "no" :
 		 array_freelist_empty(&buf->freelist) ? "some" : "a few",
 		 KEY_INODE(&start), KEY_OFFSET(&start),
 		 KEY_INODE(&buf->last_scanned), KEY_OFFSET(&buf->last_scanned));
 	spin_lock(&buf->lock);
 	if (!RB_EMPTY_ROOT(&buf->keys)) {
 		struct keybuf_key *w;
 		w = RB_FIRST(&buf->keys, struct keybuf_key, node);
 		buf->start	= START_KEY(&w->key);
 		w = RB_LAST(&buf->keys, struct keybuf_key, node);
 		buf->end	= w->key;
 	} else {
 		buf->start	= MAX_KEY;
 		buf->end	= MAX_KEY;
 	}
 	spin_unlock(&buf->lock);
 }
 static void __bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
 {
 	rb_erase(&w->node, &buf->keys);
 	array_free(&buf->freelist, w);
 }
 void bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
 {
 	spin_lock(&buf->lock);
 	__bch_keybuf_del(buf, w);
 	spin_unlock(&buf->lock);
 }
 bool bch_keybuf_check_overlapping(struct keybuf *buf, struct bkey *start,
 				  struct bkey *end)
 {
 	bool ret = false;
 	struct keybuf_key *p, *w, s;
 	s.key = *start;
 	if (bkey_cmp(end, &buf->start) <= 0 ||
 	    bkey_cmp(start, &buf->end) >= 0)
 		return false;
 	spin_lock(&buf->lock);
 	w = RB_GREATER(&buf->keys, s, node, keybuf_nonoverlapping_cmp);
 	while (w && bkey_cmp(&START_KEY(&w->key), end) < 0) {
 		p = w;
 		w = RB_NEXT(w, node);
 		if (p->private)
 			ret = true;
 		else
 			__bch_keybuf_del(buf, p);
 	}
 	spin_unlock(&buf->lock);
 	return ret;
 }
 struct keybuf_key *bch_keybuf_next(struct keybuf *buf)
 {
 	struct keybuf_key *w;
 	spin_lock(&buf->lock);
 	w = RB_FIRST(&buf->keys, struct keybuf_key, node);
 	while (w && w->private)
 		w = RB_NEXT(w, node);
 	if (w)
 		w->private = ERR_PTR(-EINTR);
 	spin_unlock(&buf->lock);
 	return w;
 }
 struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *c,
 					     struct keybuf *buf,
-					     struct bkey *end)
+					     struct bkey *end,
+					     keybuf_pred_fn *pred)
 {
 	struct keybuf_key *ret;
 	while (1) {
 		ret = bch_keybuf_next(buf);
 		if (ret)
 			break;
 		if (bkey_cmp(&buf->last_scanned, end) >= 0) {
 			pr_debug("scan finished");
 			break;
 		}
-		bch_refill_keybuf(c, buf, end);
+		bch_refill_keybuf(c, buf, end, pred);
 	}
 	return ret;
 }
-void bch_keybuf_init(struct keybuf *buf, keybuf_pred_fn *fn)
+void bch_keybuf_init(struct keybuf *buf)
 {
-	buf->key_predicate	= fn;
 	buf->last_scanned	= MAX_KEY;
 	buf->keys		= RB_ROOT;
 	spin_lock_init(&buf->lock);
 	array_allocator_init(&buf->freelist);
 }
 void bch_btree_exit(void)
 {
 	if (btree_io_wq)
 		destroy_workqueue(btree_io_wq);
 	if (bch_gc_wq)
 		destroy_workqueue(bch_gc_wq);
 }
 int __init bch_btree_init(void)
 {
 	if (!(bch_gc_wq = create_singlethread_workqueue("bch_btree_gc")) ||
 	    !(btree_io_wq = create_singlethread_workqueue("bch_btree_io")))
 		return -ENOMEM;
 	return 0;
 }

drivers/md/bcache/btree.h

Diff comments View file @ 72c2706

 #ifndef _BCACHE_BTREE_H
 #define _BCACHE_BTREE_H
 /*
  * THE BTREE:
  *
  * At a high level, bcache's btree is relatively standard b+ tree. All keys and
  * pointers are in the leaves; interior nodes only have pointers to the child
  * nodes.
  *
  * In the interior nodes, a struct bkey always points to a child btree node, and
  * the key is the highest key in the child node - except that the highest key in
  * an interior node is always MAX_KEY. The size field refers to the size on disk
  * of the child node - this would allow us to have variable sized btree nodes
  * (handy for keeping the depth of the btree 1 by expanding just the root).
  *
  * Btree nodes are themselves log structured, but this is hidden fairly
  * thoroughly. Btree nodes on disk will in practice have extents that overlap
  * (because they were written at different times), but in memory we never have
  * overlapping extents - when we read in a btree node from disk, the first thing
  * we do is resort all the sets of keys with a mergesort, and in the same pass
  * we check for overlapping extents and adjust them appropriately.
  *
  * struct btree_op is a central interface to the btree code. It's used for
  * specifying read vs. write locking, and the embedded closure is used for
  * waiting on IO or reserve memory.
  *
  * BTREE CACHE:
  *
  * Btree nodes are cached in memory; traversing the btree might require reading
  * in btree nodes which is handled mostly transparently.
  *
  * bch_btree_node_get() looks up a btree node in the cache and reads it in from
  * disk if necessary. This function is almost never called directly though - the
  * btree() macro is used to get a btree node, call some function on it, and
  * unlock the node after the function returns.
  *
  * The root is special cased - it's taken out of the cache's lru (thus pinning
  * it in memory), so we can find the root of the btree by just dereferencing a
  * pointer instead of looking it up in the cache. This makes locking a bit
  * tricky, since the root pointer is protected by the lock in the btree node it
  * points to - the btree_root() macro handles this.
  *
  * In various places we must be able to allocate memory for multiple btree nodes
  * in order to make forward progress. To do this we use the btree cache itself
  * as a reserve; if __get_free_pages() fails, we'll find a node in the btree
  * cache we can reuse. We can't allow more than one thread to be doing this at a
  * time, so there's a lock, implemented by a pointer to the btree_op closure -
  * this allows the btree_root() macro to implicitly release this lock.
  *
  * BTREE IO:
  *
  * Btree nodes never have to be explicitly read in; bch_btree_node_get() handles
  * this.
  *
  * For writing, we have two btree_write structs embeddded in struct btree - one
  * write in flight, and one being set up, and we toggle between them.
  *
  * Writing is done with a single function -  bch_btree_write() really serves two
  * different purposes and should be broken up into two different functions. When
  * passing now = false, it merely indicates that the node is now dirty - calling
  * it ensures that the dirty keys will be written at some point in the future.
  *
  * When passing now = true, bch_btree_write() causes a write to happen
  * "immediately" (if there was already a write in flight, it'll cause the write
  * to happen as soon as the previous write completes). It returns immediately
  * though - but it takes a refcount on the closure in struct btree_op you passed
  * to it, so a closure_sync() later can be used to wait for the write to
  * complete.
  *
  * This is handy because btree_split() and garbage collection can issue writes
  * in parallel, reducing the amount of time they have to hold write locks.
  *
  * LOCKING:
  *
  * When traversing the btree, we may need write locks starting at some level -
  * inserting a key into the btree will typically only require a write lock on
  * the leaf node.
  *
  * This is specified with the lock field in struct btree_op; lock = 0 means we
  * take write locks at level <= 0, i.e. only leaf nodes. bch_btree_node_get()
  * checks this field and returns the node with the appropriate lock held.
  *
  * If, after traversing the btree, the insertion code discovers it has to split
  * then it must restart from the root and take new locks - to do this it changes
  * the lock field and returns -EINTR, which causes the btree_root() macro to
  * loop.
  *
  * Handling cache misses require a different mechanism for upgrading to a write
  * lock. We do cache lookups with only a read lock held, but if we get a cache
  * miss and we wish to insert this data into the cache, we have to insert a
  * placeholder key to detect races - otherwise, we could race with a write and
  * overwrite the data that was just written to the cache with stale data from
  * the backing device.
  *
  * For this we use a sequence number that write locks and unlocks increment - to
  * insert the check key it unlocks the btree node and then takes a write lock,
  * and fails if the sequence number doesn't match.
  */
 #include "bset.h"
 #include "debug.h"
 struct btree_write {
 	atomic_t		*journal;
 	/* If btree_split() frees a btree node, it writes a new pointer to that
 	 * btree node indicating it was freed; it takes a refcount on
 	 * c->prio_blocked because we can't write the gens until the new
 	 * pointer is on disk. This allows btree_write_endio() to release the
 	 * refcount that btree_split() took.
 	 */
 	int			prio_blocked;
 };
 struct btree {
 	/* Hottest entries first */
 	struct hlist_node	hash;
 	/* Key/pointer for this btree node */
 	BKEY_PADDED(key);
 	/* Single bit - set when accessed, cleared by shrinker */
 	unsigned long		accessed;
 	unsigned long		seq;
 	struct rw_semaphore	lock;
 	struct cache_set	*c;
 	unsigned long		flags;
 	uint16_t		written;	/* would be nice to kill */
 	uint8_t			level;
 	uint8_t			nsets;
 	uint8_t			page_order;
 	/*
 	 * Set of sorted keys - the real btree node - plus a binary search tree
 	 *
 	 * sets[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
 	 * to the memory we have allocated for this btree node. Additionally,
 	 * set[0]->data points to the entire btree node as it exists on disk.
 	 */
 	struct bset_tree	sets[MAX_BSETS];
 	/* For outstanding btree writes, used as a lock - protects write_idx */
 	struct closure_with_waitlist	io;
 	struct list_head	list;
 	struct delayed_work	work;
 	struct btree_write	writes[2];
 	struct bio		*bio;
 };
 #define BTREE_FLAG(flag)						\
 static inline bool btree_node_ ## flag(struct btree *b)			\
 {	return test_bit(BTREE_NODE_ ## flag, &b->flags); }		\
 									\
 static inline void set_btree_node_ ## flag(struct btree *b)		\
 {	set_bit(BTREE_NODE_ ## flag, &b->flags); }			\
 enum btree_flags {
 	BTREE_NODE_io_error,
 	BTREE_NODE_dirty,
 	BTREE_NODE_write_idx,
 };
 BTREE_FLAG(io_error);
 BTREE_FLAG(dirty);
 BTREE_FLAG(write_idx);
 static inline struct btree_write *btree_current_write(struct btree *b)
 {
 	return b->writes + btree_node_write_idx(b);
 }
 static inline struct btree_write *btree_prev_write(struct btree *b)
 {
 	return b->writes + (btree_node_write_idx(b) ^ 1);
 }
 static inline unsigned bset_offset(struct btree *b, struct bset *i)
 {
 	return (((size_t) i) - ((size_t) b->sets->data)) >> 9;
 }
 static inline struct bset *write_block(struct btree *b)
 {
 	return ((void *) b->sets[0].data) + b->written * block_bytes(b->c);
 }
 static inline bool bset_written(struct btree *b, struct bset_tree *t)
 {
 	return t->data < write_block(b);
 }
 static inline bool bkey_written(struct btree *b, struct bkey *k)
 {
 	return k < write_block(b)->start;
 }
 static inline void set_gc_sectors(struct cache_set *c)
 {
 	atomic_set(&c->sectors_to_gc, c->sb.bucket_size * c->nbuckets / 8);
 }
 static inline bool bch_ptr_invalid(struct btree *b, const struct bkey *k)
 {
 	return __bch_ptr_invalid(b->c, b->level, k);
 }
 static inline struct bkey *bch_btree_iter_init(struct btree *b,
 					       struct btree_iter *iter,
 					       struct bkey *search)
 {
 	return __bch_btree_iter_init(b, iter, search, b->sets);
 }
 /* Looping macros */
 #define for_each_cached_btree(b, c, iter)				\
 	for (iter = 0;							\
 	     iter < ARRAY_SIZE((c)->bucket_hash);			\
 	     iter++)							\
 		hlist_for_each_entry_rcu((b), (c)->bucket_hash + iter, hash)
 #define for_each_key_filter(b, k, iter, filter)				\
 	for (bch_btree_iter_init((b), (iter), NULL);			\
 	     ((k) = bch_btree_iter_next_filter((iter), b, filter));)
 #define for_each_key(b, k, iter)					\
 	for (bch_btree_iter_init((b), (iter), NULL);			\
 	     ((k) = bch_btree_iter_next(iter));)
 /* Recursing down the btree */
 struct btree_op {
 	struct closure		cl;
 	struct cache_set	*c;
 	/* Journal entry we have a refcount on */
 	atomic_t		*journal;
 	/* Bio to be inserted into the cache */
 	struct bio		*cache_bio;
 	unsigned		inode;
 	uint16_t		write_prio;
 	/* Btree level at which we start taking write locks */
 	short			lock;
 	/* Btree insertion type */
 	enum {
 		BTREE_INSERT,
 		BTREE_REPLACE
 	} type:8;
 	unsigned		csum:1;
 	unsigned		skip:1;
 	unsigned		flush_journal:1;
 	unsigned		insert_data_done:1;
 	unsigned		lookup_done:1;
 	unsigned		insert_collision:1;
 	/* Anything after this point won't get zeroed in do_bio_hook() */
 	/* Keys to be inserted */
 	struct keylist		keys;
 	BKEY_PADDED(replace);
 };
 enum {
 	BTREE_INSERT_STATUS_INSERT,
 	BTREE_INSERT_STATUS_BACK_MERGE,
 	BTREE_INSERT_STATUS_OVERWROTE,
 	BTREE_INSERT_STATUS_FRONT_MERGE,
 };
 void bch_btree_op_init_stack(struct btree_op *);
 static inline void rw_lock(bool w, struct btree *b, int level)
 {
 	w ? down_write_nested(&b->lock, level + 1)
 	  : down_read_nested(&b->lock, level + 1);
 	if (w)
 		b->seq++;
 }
 static inline void rw_unlock(bool w, struct btree *b)
 {
 #ifdef CONFIG_BCACHE_EDEBUG
 	unsigned i;
 	if (w && b->key.ptr[0])
 		for (i = 0; i <= b->nsets; i++)
 			bch_check_key_order(b, b->sets[i].data);
 #endif
 	if (w)
 		b->seq++;
 	(w ? up_write : up_read)(&b->lock);
 }
 #define insert_lock(s, b)	((b)->level <= (s)->lock)
 /*
  * These macros are for recursing down the btree - they handle the details of
  * locking and looking up nodes in the cache for you. They're best treated as
  * mere syntax when reading code that uses them.
  *
  * op->lock determines whether we take a read or a write lock at a given depth.
  * If you've got a read lock and find that you need a write lock (i.e. you're
  * going to have to split), set op->lock and return -EINTR; btree_root() will
  * call you again and you'll have the correct lock.
  */
 /**
  * btree - recurse down the btree on a specified key
  * @fn:		function to call, which will be passed the child node
  * @key:	key to recurse on
  * @b:		parent btree node
  * @op:		pointer to struct btree_op
  */
 #define btree(fn, key, b, op, ...)					\
 ({									\
 	int _r, l = (b)->level - 1;					\
 	bool _w = l <= (op)->lock;					\
 	struct btree *_b = bch_btree_node_get((b)->c, key, l, op);	\
 	if (!IS_ERR(_b)) {						\
 		_r = bch_btree_ ## fn(_b, op, ##__VA_ARGS__);		\
 		rw_unlock(_w, _b);					\
 	} else								\
 		_r = PTR_ERR(_b);					\
 	_r;								\
 })
 /**
  * btree_root - call a function on the root of the btree
  * @fn:		function to call, which will be passed the child node
  * @c:		cache set
  * @op:		pointer to struct btree_op
  */
 #define btree_root(fn, c, op, ...)					\
 ({									\
 	int _r = -EINTR;						\
 	do {								\
 		struct btree *_b = (c)->root;				\
 		bool _w = insert_lock(op, _b);				\
 		rw_lock(_w, _b, _b->level);				\
 		if (_b == (c)->root &&					\
 		    _w == insert_lock(op, _b))				\
 			_r = bch_btree_ ## fn(_b, op, ##__VA_ARGS__);	\
 		rw_unlock(_w, _b);					\
 		bch_cannibalize_unlock(c, &(op)->cl);		\
 	} while (_r == -EINTR);						\
 									\
 	_r;								\
 })
 static inline bool should_split(struct btree *b)
 {
 	struct bset *i = write_block(b);
 	return b->written >= btree_blocks(b) ||
 		(i->seq == b->sets[0].data->seq &&
 		 b->written + __set_blocks(i, i->keys + 15, b->c)
 		 > btree_blocks(b));
 }
 void bch_btree_node_read(struct btree *);
 void bch_btree_node_read_done(struct btree *);
 void bch_btree_node_write(struct btree *, struct closure *);
 void bch_cannibalize_unlock(struct cache_set *, struct closure *);
 void bch_btree_set_root(struct btree *);
 struct btree *bch_btree_node_alloc(struct cache_set *, int, struct closure *);
 struct btree *bch_btree_node_get(struct cache_set *, struct bkey *,
 				int, struct btree_op *);
 bool bch_btree_insert_keys(struct btree *, struct btree_op *);
 bool bch_btree_insert_check_key(struct btree *, struct btree_op *,
 				   struct bio *);
 int bch_btree_insert(struct btree_op *, struct cache_set *);
 int bch_btree_search_recurse(struct btree *, struct btree_op *);
 void bch_queue_gc(struct cache_set *);
 size_t bch_btree_gc_finish(struct cache_set *);
 void bch_moving_gc(struct closure *);
 int bch_btree_check(struct cache_set *, struct btree_op *);
 uint8_t __bch_btree_mark_key(struct cache_set *, int, struct bkey *);
-void bch_keybuf_init(struct keybuf *, keybuf_pred_fn *);
+void bch_keybuf_init(struct keybuf *);
-void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *);
+void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *,
+		       keybuf_pred_fn *);
 bool bch_keybuf_check_overlapping(struct keybuf *, struct bkey *,
 				  struct bkey *);
 void bch_keybuf_del(struct keybuf *, struct keybuf_key *);
 struct keybuf_key *bch_keybuf_next(struct keybuf *);
-struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *,
+struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *, struct keybuf *,
-					  struct keybuf *, struct bkey *);
+					  struct bkey *, keybuf_pred_fn *);
 #endif

drivers/md/bcache/debug.c

Diff comments View file @ 72c2706

1	/*	1	/*
2	* Assorted bcache debug code	2	* Assorted bcache debug code
3	*	3	*
4	* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>	4	* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
5	* Copyright 2012 Google, Inc.	5	* Copyright 2012 Google, Inc.
6	*/	6	*/
7		7
8	#include "bcache.h"	8	#include "bcache.h"
9	#include "btree.h"	9	#include "btree.h"
10	#include "debug.h"	10	#include "debug.h"
11	#include "request.h"	11	#include "request.h"
12		12
13	#include <linux/console.h>	13	#include <linux/console.h>
14	#include <linux/debugfs.h>	14	#include <linux/debugfs.h>
15	#include <linux/module.h>	15	#include <linux/module.h>
16	#include <linux/random.h>	16	#include <linux/random.h>
17	#include <linux/seq_file.h>	17	#include <linux/seq_file.h>
18		18
19	static struct dentry *debug;	19	static struct dentry *debug;
20		20
21	const char bch_ptr_status(struct cache_set c, const struct bkey *k)	21	const char bch_ptr_status(struct cache_set c, const struct bkey *k)
22	{	22	{
23	unsigned i;	23	unsigned i;
24		24
25	for (i = 0; i < KEY_PTRS(k); i++)	25	for (i = 0; i < KEY_PTRS(k); i++)
26	if (ptr_available(c, k, i)) {	26	if (ptr_available(c, k, i)) {
27	struct cache *ca = PTR_CACHE(c, k, i);	27	struct cache *ca = PTR_CACHE(c, k, i);
28	size_t bucket = PTR_BUCKET_NR(c, k, i);	28	size_t bucket = PTR_BUCKET_NR(c, k, i);
29	size_t r = bucket_remainder(c, PTR_OFFSET(k, i));	29	size_t r = bucket_remainder(c, PTR_OFFSET(k, i));
30		30
31	if (KEY_SIZE(k) + r > c->sb.bucket_size)	31	if (KEY_SIZE(k) + r > c->sb.bucket_size)
32	return "bad, length too big";	32	return "bad, length too big";
33	if (bucket < ca->sb.first_bucket)	33	if (bucket < ca->sb.first_bucket)
34	return "bad, short offset";	34	return "bad, short offset";
35	if (bucket >= ca->sb.nbuckets)	35	if (bucket >= ca->sb.nbuckets)
36	return "bad, offset past end of device";	36	return "bad, offset past end of device";
37	if (ptr_stale(c, k, i))	37	if (ptr_stale(c, k, i))
38	return "stale";	38	return "stale";
39	}	39	}
40		40
41	if (!bkey_cmp(k, &ZERO_KEY))	41	if (!bkey_cmp(k, &ZERO_KEY))
42	return "bad, null key";	42	return "bad, null key";
43	if (!KEY_PTRS(k))	43	if (!KEY_PTRS(k))
44	return "bad, no pointers";	44	return "bad, no pointers";
45	if (!KEY_SIZE(k))	45	if (!KEY_SIZE(k))
46	return "zeroed key";	46	return "zeroed key";
47	return "";	47	return "";
48	}	48	}
49		49
50	int bch_bkey_to_text(char buf, size_t size, const struct bkey k)	50	int bch_bkey_to_text(char buf, size_t size, const struct bkey k)
51	{	51	{
52	unsigned i = 0;	52	unsigned i = 0;
53	char out = buf, end = buf + size;	53	char out = buf, end = buf + size;
54		54
55	#define p(...) (out += scnprintf(out, end - out, __VA_ARGS__))	55	#define p(...) (out += scnprintf(out, end - out, __VA_ARGS__))
56		56
57	p("%llu:%llu len %llu -> [", KEY_INODE(k), KEY_OFFSET(k), KEY_SIZE(k));	57	p("%llu:%llu len %llu -> [", KEY_INODE(k), KEY_OFFSET(k), KEY_SIZE(k));
58		58
59	if (KEY_PTRS(k))	59	if (KEY_PTRS(k))
60	while (1) {	60	while (1) {
61	p("%llu:%llu gen %llu",	61	p("%llu:%llu gen %llu",
62	PTR_DEV(k, i), PTR_OFFSET(k, i), PTR_GEN(k, i));	62	PTR_DEV(k, i), PTR_OFFSET(k, i), PTR_GEN(k, i));
63		63
64	if (++i == KEY_PTRS(k))	64	if (++i == KEY_PTRS(k))
65	break;	65	break;
66		66
67	p(", ");	67	p(", ");
68	}	68	}
69		69
70	p("]");	70	p("]");
71		71
72	if (KEY_DIRTY(k))	72	if (KEY_DIRTY(k))
73	p(" dirty");	73	p(" dirty");
74	if (KEY_CSUM(k))	74	if (KEY_CSUM(k))
75	p(" cs%llu %llx", KEY_CSUM(k), k->ptr[1]);	75	p(" cs%llu %llx", KEY_CSUM(k), k->ptr[1]);
76	#undef p	76	#undef p
77	return out - buf;	77	return out - buf;
78	}	78	}
79		79
80	int bch_btree_to_text(char buf, size_t size, const struct btree b)	80	int bch_btree_to_text(char buf, size_t size, const struct btree b)
81	{	81	{
82	return scnprintf(buf, size, "%zu level %i/%i",	82	return scnprintf(buf, size, "%zu level %i/%i",
83	PTR_BUCKET_NR(b->c, &b->key, 0),	83	PTR_BUCKET_NR(b->c, &b->key, 0),
84	b->level, b->c->root ? b->c->root->level : -1);	84	b->level, b->c->root ? b->c->root->level : -1);
85	}	85	}
86		86
87	#if defined(CONFIG_BCACHE_DEBUG) \|\| defined(CONFIG_BCACHE_EDEBUG)	87	#if defined(CONFIG_BCACHE_DEBUG) \|\| defined(CONFIG_BCACHE_EDEBUG)
88		88
89	static bool skipped_backwards(struct btree b, struct bkey k)	89	static bool skipped_backwards(struct btree b, struct bkey k)
90	{	90	{
91	return bkey_cmp(k, (!b->level)	91	return bkey_cmp(k, (!b->level)
92	? &START_KEY(bkey_next(k))	92	? &START_KEY(bkey_next(k))
93	: bkey_next(k)) > 0;	93	: bkey_next(k)) > 0;
94	}	94	}
95		95
96	static void dump_bset(struct btree b, struct bset i)	96	static void dump_bset(struct btree b, struct bset i)
97	{	97	{
98	struct bkey *k;	98	struct bkey *k;
99	unsigned j;	99	unsigned j;
100	char buf[80];	100	char buf[80];
101		101
102	for (k = i->start; k < end(i); k = bkey_next(k)) {	102	for (k = i->start; k < end(i); k = bkey_next(k)) {
103	bch_bkey_to_text(buf, sizeof(buf), k);	103	bch_bkey_to_text(buf, sizeof(buf), k);
104	printk(KERN_ERR "block %zu key %zi/%u: %s", index(i, b),	104	printk(KERN_ERR "block %zu key %zi/%u: %s", index(i, b),
105	(uint64_t *) k - i->d, i->keys, buf);	105	(uint64_t *) k - i->d, i->keys, buf);
106		106
107	for (j = 0; j < KEY_PTRS(k); j++) {	107	for (j = 0; j < KEY_PTRS(k); j++) {
108	size_t n = PTR_BUCKET_NR(b->c, k, j);	108	size_t n = PTR_BUCKET_NR(b->c, k, j);
109	printk(" bucket %zu", n);	109	printk(" bucket %zu", n);
110		110
111	if (n >= b->c->sb.first_bucket && n < b->c->sb.nbuckets)	111	if (n >= b->c->sb.first_bucket && n < b->c->sb.nbuckets)
112	printk(" prio %i",	112	printk(" prio %i",
113	PTR_BUCKET(b->c, k, j)->prio);	113	PTR_BUCKET(b->c, k, j)->prio);
114	}	114	}
115		115
116	printk(" %s\n", bch_ptr_status(b->c, k));	116	printk(" %s\n", bch_ptr_status(b->c, k));
117		117
118	if (bkey_next(k) < end(i) &&	118	if (bkey_next(k) < end(i) &&
119	skipped_backwards(b, k))	119	skipped_backwards(b, k))
120	printk(KERN_ERR "Key skipped backwards\n");	120	printk(KERN_ERR "Key skipped backwards\n");
121	}	121	}
122	}	122	}
123		123
124	#endif	124	#endif
125		125
126	#ifdef CONFIG_BCACHE_DEBUG	126	#ifdef CONFIG_BCACHE_DEBUG
127		127
128	void bch_btree_verify(struct btree b, struct bset new)	128	void bch_btree_verify(struct btree b, struct bset new)
129	{	129	{
130	struct btree *v = b->c->verify_data;	130	struct btree *v = b->c->verify_data;
131	struct closure cl;	131	struct closure cl;
132	closure_init_stack(&cl);	132	closure_init_stack(&cl);
133		133
134	if (!b->c->verify)	134	if (!b->c->verify)
135	return;	135	return;
136		136
137	closure_wait_event(&b->io.wait, &cl,	137	closure_wait_event(&b->io.wait, &cl,
138	atomic_read(&b->io.cl.remaining) == -1);	138	atomic_read(&b->io.cl.remaining) == -1);
139		139
140	mutex_lock(&b->c->verify_lock);	140	mutex_lock(&b->c->verify_lock);
141		141
142	bkey_copy(&v->key, &b->key);	142	bkey_copy(&v->key, &b->key);
143	v->written = 0;	143	v->written = 0;
144	v->level = b->level;	144	v->level = b->level;
145		145
146	bch_btree_node_read(v);	146	bch_btree_node_read(v);
147	closure_wait_event(&v->io.wait, &cl,	147	closure_wait_event(&v->io.wait, &cl,
148	atomic_read(&b->io.cl.remaining) == -1);	148	atomic_read(&b->io.cl.remaining) == -1);
149		149
150	if (new->keys != v->sets[0].data->keys \|\|	150	if (new->keys != v->sets[0].data->keys \|\|
151	memcmp(new->start,	151	memcmp(new->start,
152	v->sets[0].data->start,	152	v->sets[0].data->start,
153	(void ) end(new) - (void ) new->start)) {	153	(void ) end(new) - (void ) new->start)) {
154	unsigned i, j;	154	unsigned i, j;
155		155
156	console_lock();	156	console_lock();
157		157
158	printk(KERN_ERR "*** original memory node:\n");	158	printk(KERN_ERR "*** original memory node:\n");
159	for (i = 0; i <= b->nsets; i++)	159	for (i = 0; i <= b->nsets; i++)
160	dump_bset(b, b->sets[i].data);	160	dump_bset(b, b->sets[i].data);
161		161
162	printk(KERN_ERR "*** sorted memory node:\n");	162	printk(KERN_ERR "*** sorted memory node:\n");
163	dump_bset(b, new);	163	dump_bset(b, new);
164		164
165	printk(KERN_ERR "*** on disk node:\n");	165	printk(KERN_ERR "*** on disk node:\n");
166	dump_bset(v, v->sets[0].data);	166	dump_bset(v, v->sets[0].data);
167		167
168	for (j = 0; j < new->keys; j++)	168	for (j = 0; j < new->keys; j++)
169	if (new->d[j] != v->sets[0].data->d[j])	169	if (new->d[j] != v->sets[0].data->d[j])
170	break;	170	break;
171		171
172	console_unlock();	172	console_unlock();
173	panic("verify failed at %u\n", j);	173	panic("verify failed at %u\n", j);
174	}	174	}
175		175
176	mutex_unlock(&b->c->verify_lock);	176	mutex_unlock(&b->c->verify_lock);
177	}	177	}
178		178
179	static void data_verify_endio(struct bio *bio, int error)	179	static void data_verify_endio(struct bio *bio, int error)
180	{	180	{
181	struct closure *cl = bio->bi_private;	181	struct closure *cl = bio->bi_private;
182	closure_put(cl);	182	closure_put(cl);
183	}	183	}
184		184
185	void bch_data_verify(struct search *s)	185	void bch_data_verify(struct search *s)
186	{	186	{
187	char name[BDEVNAME_SIZE];	187	char name[BDEVNAME_SIZE];
188	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);	188	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
189	struct closure *cl = &s->cl;	189	struct closure *cl = &s->cl;
190	struct bio *check;	190	struct bio *check;
191	struct bio_vec *bv;	191	struct bio_vec *bv;
192	int i;	192	int i;
193		193
194	if (!s->unaligned_bvec)	194	if (!s->unaligned_bvec)
195	bio_for_each_segment(bv, s->orig_bio, i)	195	bio_for_each_segment(bv, s->orig_bio, i)
196	bv->bv_offset = 0, bv->bv_len = PAGE_SIZE;	196	bv->bv_offset = 0, bv->bv_len = PAGE_SIZE;
197		197
198	check = bio_clone(s->orig_bio, GFP_NOIO);	198	check = bio_clone(s->orig_bio, GFP_NOIO);
199	if (!check)	199	if (!check)
200	return;	200	return;
201		201
202	if (bch_bio_alloc_pages(check, GFP_NOIO))	202	if (bch_bio_alloc_pages(check, GFP_NOIO))
203	goto out_put;	203	goto out_put;
204		204
205	check->bi_rw = READ_SYNC;	205	check->bi_rw = READ_SYNC;
206	check->bi_private = cl;	206	check->bi_private = cl;
207	check->bi_end_io = data_verify_endio;	207	check->bi_end_io = data_verify_endio;
208		208
209	closure_bio_submit(check, cl, &dc->disk);	209	closure_bio_submit(check, cl, &dc->disk);
210	closure_sync(cl);	210	closure_sync(cl);
211		211
212	bio_for_each_segment(bv, s->orig_bio, i) {	212	bio_for_each_segment(bv, s->orig_bio, i) {
213	void *p1 = kmap(bv->bv_page);	213	void *p1 = kmap(bv->bv_page);
214	void *p2 = kmap(check->bi_io_vec[i].bv_page);	214	void *p2 = kmap(check->bi_io_vec[i].bv_page);
215		215
216	if (memcmp(p1 + bv->bv_offset,	216	if (memcmp(p1 + bv->bv_offset,
217	p2 + bv->bv_offset,	217	p2 + bv->bv_offset,
218	bv->bv_len))	218	bv->bv_len))
219	printk(KERN_ERR	219	printk(KERN_ERR
220	"bcache (%s): verify failed at sector %llu\n",	220	"bcache (%s): verify failed at sector %llu\n",
221	bdevname(dc->bdev, name),	221	bdevname(dc->bdev, name),
222	(uint64_t) s->orig_bio->bi_sector);	222	(uint64_t) s->orig_bio->bi_sector);
223		223
224	kunmap(bv->bv_page);	224	kunmap(bv->bv_page);
225	kunmap(check->bi_io_vec[i].bv_page);	225	kunmap(check->bi_io_vec[i].bv_page);
226	}	226	}
227		227
228	__bio_for_each_segment(bv, check, i, 0)	228	__bio_for_each_segment(bv, check, i, 0)
229	__free_page(bv->bv_page);	229	__free_page(bv->bv_page);
230	out_put:	230	out_put:
231	bio_put(check);	231	bio_put(check);
232	}	232	}
233		233
234	#endif	234	#endif
235		235
236	#ifdef CONFIG_BCACHE_EDEBUG	236	#ifdef CONFIG_BCACHE_EDEBUG
237		237
238	unsigned bch_count_data(struct btree *b)	238	unsigned bch_count_data(struct btree *b)
239	{	239	{
240	unsigned ret = 0;	240	unsigned ret = 0;
241	struct btree_iter iter;	241	struct btree_iter iter;
242	struct bkey *k;	242	struct bkey *k;
243		243
244	if (!b->level)	244	if (!b->level)
245	for_each_key(b, k, &iter)	245	for_each_key(b, k, &iter)
246	ret += KEY_SIZE(k);	246	ret += KEY_SIZE(k);
247	return ret;	247	return ret;
248	}	248	}
249		249
250	static void vdump_bucket_and_panic(struct btree b, const char fmt,	250	static void vdump_bucket_and_panic(struct btree b, const char fmt,
251	va_list args)	251	va_list args)
252	{	252	{
253	unsigned i;	253	unsigned i;
254	char buf[80];	254	char buf[80];
255		255
256	console_lock();	256	console_lock();
257		257
258	for (i = 0; i <= b->nsets; i++)	258	for (i = 0; i <= b->nsets; i++)
259	dump_bset(b, b->sets[i].data);	259	dump_bset(b, b->sets[i].data);
260		260
261	vprintk(fmt, args);	261	vprintk(fmt, args);
262		262
263	console_unlock();	263	console_unlock();
264		264
265	bch_btree_to_text(buf, sizeof(buf), b);	265	bch_btree_to_text(buf, sizeof(buf), b);
266	panic("at %s\n", buf);	266	panic("at %s\n", buf);
267	}	267	}
268		268
269	void bch_check_key_order_msg(struct btree b, struct bset i,	269	void bch_check_key_order_msg(struct btree b, struct bset i,
270	const char *fmt, ...)	270	const char *fmt, ...)
271	{	271	{
272	struct bkey *k;	272	struct bkey *k;
273		273
274	if (!i->keys)	274	if (!i->keys)
275	return;	275	return;
276		276
277	for (k = i->start; bkey_next(k) < end(i); k = bkey_next(k))	277	for (k = i->start; bkey_next(k) < end(i); k = bkey_next(k))
278	if (skipped_backwards(b, k)) {	278	if (skipped_backwards(b, k)) {
279	va_list args;	279	va_list args;
280	va_start(args, fmt);	280	va_start(args, fmt);
281		281
282	vdump_bucket_and_panic(b, fmt, args);	282	vdump_bucket_and_panic(b, fmt, args);
283	va_end(args);	283	va_end(args);
284	}	284	}
285	}	285	}
286		286
287	void bch_check_keys(struct btree b, const char fmt, ...)	287	void bch_check_keys(struct btree b, const char fmt, ...)
288	{	288	{
289	va_list args;	289	va_list args;
290	struct bkey k, p = NULL;	290	struct bkey k, p = NULL;
291	struct btree_iter iter;	291	struct btree_iter iter;
292		292
293	if (b->level)	293	if (b->level)
294	return;	294	return;
295		295
296	for_each_key(b, k, &iter) {	296	for_each_key(b, k, &iter) {
297	if (p && bkey_cmp(&START_KEY(p), &START_KEY(k)) > 0) {	297	if (p && bkey_cmp(&START_KEY(p), &START_KEY(k)) > 0) {
298	printk(KERN_ERR "Keys out of order:\n");	298	printk(KERN_ERR "Keys out of order:\n");
299	goto bug;	299	goto bug;
300	}	300	}
301		301
302	if (bch_ptr_invalid(b, k))	302	if (bch_ptr_invalid(b, k))
303	continue;	303	continue;
304		304
305	if (p && bkey_cmp(p, &START_KEY(k)) > 0) {	305	if (p && bkey_cmp(p, &START_KEY(k)) > 0) {
306	printk(KERN_ERR "Overlapping keys:\n");	306	printk(KERN_ERR "Overlapping keys:\n");
307	goto bug;	307	goto bug;
308	}	308	}
309	p = k;	309	p = k;
310	}	310	}
311	return;	311	return;
312	bug:	312	bug:
313	va_start(args, fmt);	313	va_start(args, fmt);
314	vdump_bucket_and_panic(b, fmt, args);	314	vdump_bucket_and_panic(b, fmt, args);
315	va_end(args);	315	va_end(args);
316	}	316	}
317		317
318	#endif	318	#endif
319		319
320	#ifdef CONFIG_DEBUG_FS	320	#ifdef CONFIG_DEBUG_FS
321		321
322	/* XXX: cache set refcounting */	322	/* XXX: cache set refcounting */
323		323
324	struct dump_iterator {	324	struct dump_iterator {
325	char buf[PAGE_SIZE];	325	char buf[PAGE_SIZE];
326	size_t bytes;	326	size_t bytes;
327	struct cache_set *c;	327	struct cache_set *c;
328	struct keybuf keys;	328	struct keybuf keys;
329	};	329	};
330		330
331	static bool dump_pred(struct keybuf buf, struct bkey k)	331	static bool dump_pred(struct keybuf buf, struct bkey k)
332	{	332	{
333	return true;	333	return true;
334	}	334	}
335		335
336	static ssize_t bch_dump_read(struct file file, char __user buf,	336	static ssize_t bch_dump_read(struct file file, char __user buf,
337	size_t size, loff_t *ppos)	337	size_t size, loff_t *ppos)
338	{	338	{
339	struct dump_iterator *i = file->private_data;	339	struct dump_iterator *i = file->private_data;
340	ssize_t ret = 0;	340	ssize_t ret = 0;
341	char kbuf[80];	341	char kbuf[80];
342		342
343	while (size) {	343	while (size) {
344	struct keybuf_key *w;	344	struct keybuf_key *w;
345	unsigned bytes = min(i->bytes, size);	345	unsigned bytes = min(i->bytes, size);
346		346
347	int err = copy_to_user(buf, i->buf, bytes);	347	int err = copy_to_user(buf, i->buf, bytes);
348	if (err)	348	if (err)
349	return err;	349	return err;
350		350
351	ret += bytes;	351	ret += bytes;
352	buf += bytes;	352	buf += bytes;
353	size -= bytes;	353	size -= bytes;
354	i->bytes -= bytes;	354	i->bytes -= bytes;
355	memmove(i->buf, i->buf + bytes, i->bytes);	355	memmove(i->buf, i->buf + bytes, i->bytes);
356		356
357	if (i->bytes)	357	if (i->bytes)
358	break;	358	break;
359		359
360	w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY);	360	w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY, dump_pred);
361	if (!w)	361	if (!w)
362	break;	362	break;
363		363
364	bch_bkey_to_text(kbuf, sizeof(kbuf), &w->key);	364	bch_bkey_to_text(kbuf, sizeof(kbuf), &w->key);
365	i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", kbuf);	365	i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", kbuf);
366	bch_keybuf_del(&i->keys, w);	366	bch_keybuf_del(&i->keys, w);
367	}	367	}
368		368
369	return ret;	369	return ret;
370	}	370	}
371		371
372	static int bch_dump_open(struct inode inode, struct file file)	372	static int bch_dump_open(struct inode inode, struct file file)
373	{	373	{
374	struct cache_set *c = inode->i_private;	374	struct cache_set *c = inode->i_private;
375	struct dump_iterator *i;	375	struct dump_iterator *i;
376		376
377	i = kzalloc(sizeof(struct dump_iterator), GFP_KERNEL);	377	i = kzalloc(sizeof(struct dump_iterator), GFP_KERNEL);
378	if (!i)	378	if (!i)
379	return -ENOMEM;	379	return -ENOMEM;
380		380
381	file->private_data = i;	381	file->private_data = i;
382	i->c = c;	382	i->c = c;
383	bch_keybuf_init(&i->keys, dump_pred);	383	bch_keybuf_init(&i->keys);
384	i->keys.last_scanned = KEY(0, 0, 0);	384	i->keys.last_scanned = KEY(0, 0, 0);
385		385
386	return 0;	386	return 0;
387	}	387	}
388		388
389	static int bch_dump_release(struct inode inode, struct file file)	389	static int bch_dump_release(struct inode inode, struct file file)
390	{	390	{
391	kfree(file->private_data);	391	kfree(file->private_data);
392	return 0;	392	return 0;
393	}	393	}
394		394
395	static const struct file_operations cache_set_debug_ops = {	395	static const struct file_operations cache_set_debug_ops = {
396	.owner = THIS_MODULE,	396	.owner = THIS_MODULE,
397	.open = bch_dump_open,	397	.open = bch_dump_open,
398	.read = bch_dump_read,	398	.read = bch_dump_read,
399	.release = bch_dump_release	399	.release = bch_dump_release
400	};	400	};
401		401
402	void bch_debug_init_cache_set(struct cache_set *c)	402	void bch_debug_init_cache_set(struct cache_set *c)
403	{	403	{
404	if (!IS_ERR_OR_NULL(debug)) {	404	if (!IS_ERR_OR_NULL(debug)) {
405	char name[50];	405	char name[50];
406	snprintf(name, 50, "bcache-%pU", c->sb.set_uuid);	406	snprintf(name, 50, "bcache-%pU", c->sb.set_uuid);
407		407
408	c->debug = debugfs_create_file(name, 0400, debug, c,	408	c->debug = debugfs_create_file(name, 0400, debug, c,
409	&cache_set_debug_ops);	409	&cache_set_debug_ops);
410	}	410	}
411	}	411	}
412		412
413	#endif	413	#endif
414		414
415	/* Fuzz tester has rotted: */	415	/* Fuzz tester has rotted: */
416	#if 0	416	#if 0
417		417
418	static ssize_t btree_fuzz(struct kobject k, struct kobj_attribute a,	418	static ssize_t btree_fuzz(struct kobject k, struct kobj_attribute a,
419	const char *buffer, size_t size)	419	const char *buffer, size_t size)
420	{	420	{
421	void dump(struct btree *b)	421	void dump(struct btree *b)
422	{	422	{
423	struct bset *i;	423	struct bset *i;
424		424
425	for (i = b->sets[0].data;	425	for (i = b->sets[0].data;
426	index(i, b) < btree_blocks(b) &&	426	index(i, b) < btree_blocks(b) &&
427	i->seq == b->sets[0].data->seq;	427	i->seq == b->sets[0].data->seq;
428	i = ((void ) i) + set_blocks(i, b->c) block_bytes(b->c))	428	i = ((void ) i) + set_blocks(i, b->c) block_bytes(b->c))
429	dump_bset(b, i);	429	dump_bset(b, i);
430	}	430	}
431		431
432	struct cache_sb *sb;	432	struct cache_sb *sb;
433	struct cache_set *c;	433	struct cache_set *c;
434	struct btree all[3], b, fill, orig;	434	struct btree all[3], b, fill, orig;
435	int j;	435	int j;
436		436
437	struct btree_op op;	437	struct btree_op op;
438	bch_btree_op_init_stack(&op);	438	bch_btree_op_init_stack(&op);
439		439
440	sb = kzalloc(sizeof(struct cache_sb), GFP_KERNEL);	440	sb = kzalloc(sizeof(struct cache_sb), GFP_KERNEL);
441	if (!sb)	441	if (!sb)
442	return -ENOMEM;	442	return -ENOMEM;
443		443
444	sb->bucket_size = 128;	444	sb->bucket_size = 128;
445	sb->block_size = 4;	445	sb->block_size = 4;
446		446
447	c = bch_cache_set_alloc(sb);	447	c = bch_cache_set_alloc(sb);
448	if (!c)	448	if (!c)
449	return -ENOMEM;	449	return -ENOMEM;
450		450
451	for (j = 0; j < 3; j++) {	451	for (j = 0; j < 3; j++) {
452	BUG_ON(list_empty(&c->btree_cache));	452	BUG_ON(list_empty(&c->btree_cache));
453	all[j] = list_first_entry(&c->btree_cache, struct btree, list);	453	all[j] = list_first_entry(&c->btree_cache, struct btree, list);
454	list_del_init(&all[j]->list);	454	list_del_init(&all[j]->list);
455		455
456	all[j]->key = KEY(0, 0, c->sb.bucket_size);	456	all[j]->key = KEY(0, 0, c->sb.bucket_size);
457	bkey_copy_key(&all[j]->key, &MAX_KEY);	457	bkey_copy_key(&all[j]->key, &MAX_KEY);
458	}	458	}
459		459
460	b = all[0];	460	b = all[0];
461	fill = all[1];	461	fill = all[1];
462	orig = all[2];	462	orig = all[2];
463		463
464	while (1) {	464	while (1) {
465	for (j = 0; j < 3; j++)	465	for (j = 0; j < 3; j++)
466	all[j]->written = all[j]->nsets = 0;	466	all[j]->written = all[j]->nsets = 0;
467		467
468	bch_bset_init_next(b);	468	bch_bset_init_next(b);
469		469
470	while (1) {	470	while (1) {
471	struct bset *i = write_block(b);	471	struct bset *i = write_block(b);
472	struct bkey *k = op.keys.top;	472	struct bkey *k = op.keys.top;
473	unsigned rand;	473	unsigned rand;
474		474
475	bkey_init(k);	475	bkey_init(k);
476	rand = get_random_int();	476	rand = get_random_int();
477		477
478	op.type = rand & 1	478	op.type = rand & 1
479	? BTREE_INSERT	479	? BTREE_INSERT
480	: BTREE_REPLACE;	480	: BTREE_REPLACE;
481	rand >>= 1;	481	rand >>= 1;
482		482
483	SET_KEY_SIZE(k, bucket_remainder(c, rand));	483	SET_KEY_SIZE(k, bucket_remainder(c, rand));
484	rand >>= c->bucket_bits;	484	rand >>= c->bucket_bits;
485	rand &= 1024 * 512 - 1;	485	rand &= 1024 * 512 - 1;
486	rand += c->sb.bucket_size;	486	rand += c->sb.bucket_size;
487	SET_KEY_OFFSET(k, rand);	487	SET_KEY_OFFSET(k, rand);
488	#if 0	488	#if 0
489	SET_KEY_PTRS(k, 1);	489	SET_KEY_PTRS(k, 1);
490	#endif	490	#endif
491	bch_keylist_push(&op.keys);	491	bch_keylist_push(&op.keys);
492	bch_btree_insert_keys(b, &op);	492	bch_btree_insert_keys(b, &op);
493		493
494	if (should_split(b) \|\|	494	if (should_split(b) \|\|
495	set_blocks(i, b->c) !=	495	set_blocks(i, b->c) !=
496	__set_blocks(i, i->keys + 15, b->c)) {	496	__set_blocks(i, i->keys + 15, b->c)) {
497	i->csum = csum_set(i);	497	i->csum = csum_set(i);
498		498
499	memcpy(write_block(fill),	499	memcpy(write_block(fill),
500	i, set_bytes(i));	500	i, set_bytes(i));
501		501
502	b->written += set_blocks(i, b->c);	502	b->written += set_blocks(i, b->c);
503	fill->written = b->written;	503	fill->written = b->written;
504	if (b->written == btree_blocks(b))	504	if (b->written == btree_blocks(b))
505	break;	505	break;
506		506
507	bch_btree_sort_lazy(b);	507	bch_btree_sort_lazy(b);
508	bch_bset_init_next(b);	508	bch_bset_init_next(b);
509	}	509	}
510	}	510	}
511		511
512	memcpy(orig->sets[0].data,	512	memcpy(orig->sets[0].data,
513	fill->sets[0].data,	513	fill->sets[0].data,
514	btree_bytes(c));	514	btree_bytes(c));
515		515
516	bch_btree_sort(b);	516	bch_btree_sort(b);
517	fill->written = 0;	517	fill->written = 0;
518	bch_btree_node_read_done(fill);	518	bch_btree_node_read_done(fill);
519		519
520	if (b->sets[0].data->keys != fill->sets[0].data->keys \|\|	520	if (b->sets[0].data->keys != fill->sets[0].data->keys \|\|
521	memcmp(b->sets[0].data->start,	521	memcmp(b->sets[0].data->start,
522	fill->sets[0].data->start,	522	fill->sets[0].data->start,
523	b->sets[0].data->keys * sizeof(uint64_t))) {	523	b->sets[0].data->keys * sizeof(uint64_t))) {
524	struct bset *i = b->sets[0].data;	524	struct bset *i = b->sets[0].data;
525	struct bkey k, l;	525	struct bkey k, l;
526		526
527	for (k = i->start,	527	for (k = i->start,
528	l = fill->sets[0].data->start;	528	l = fill->sets[0].data->start;
529	k < end(i);	529	k < end(i);
530	k = bkey_next(k), l = bkey_next(l))	530	k = bkey_next(k), l = bkey_next(l))
531	if (bkey_cmp(k, l) \|\|	531	if (bkey_cmp(k, l) \|\|
532	KEY_SIZE(k) != KEY_SIZE(l)) {	532	KEY_SIZE(k) != KEY_SIZE(l)) {
533	char buf1[80];	533	char buf1[80];
534	char buf2[80];	534	char buf2[80];
535		535
536	bch_bkey_to_text(buf1, sizeof(buf1), k);	536	bch_bkey_to_text(buf1, sizeof(buf1), k);
537	bch_bkey_to_text(buf2, sizeof(buf2), l);	537	bch_bkey_to_text(buf2, sizeof(buf2), l);
538		538
539	pr_err("key %zi differs: %s != %s",	539	pr_err("key %zi differs: %s != %s",
540	(uint64_t *) k - i->d,	540	(uint64_t *) k - i->d,
541	buf1, buf2);	541	buf1, buf2);
542	}	542	}
543		543
544	for (j = 0; j < 3; j++) {	544	for (j = 0; j < 3; j++) {
545	pr_err("** Set %i **", j);	545	pr_err("** Set %i **", j);
546	dump(all[j]);	546	dump(all[j]);
547	}	547	}
548	panic("\n");	548	panic("\n");
549	}	549	}
550		550
551	pr_info("fuzz complete: %i keys", b->sets[0].data->keys);	551	pr_info("fuzz complete: %i keys", b->sets[0].data->keys);
552	}	552	}
553	}	553	}
554		554
555	kobj_attribute_write(fuzz, btree_fuzz);	555	kobj_attribute_write(fuzz, btree_fuzz);
556	#endif	556	#endif
557		557
558	void bch_debug_exit(void)	558	void bch_debug_exit(void)
559	{	559	{
560	if (!IS_ERR_OR_NULL(debug))	560	if (!IS_ERR_OR_NULL(debug))
561	debugfs_remove_recursive(debug);	561	debugfs_remove_recursive(debug);
562	}	562	}
563		563
564	int __init bch_debug_init(struct kobject *kobj)	564	int __init bch_debug_init(struct kobject *kobj)
565	{	565	{
566	int ret = 0;	566	int ret = 0;
567	#if 0	567	#if 0
568	ret = sysfs_create_file(kobj, &ksysfs_fuzz.attr);	568	ret = sysfs_create_file(kobj, &ksysfs_fuzz.attr);
569	if (ret)	569	if (ret)
570	return ret;	570	return ret;
571	#endif	571	#endif
572		572
573	debug = debugfs_create_dir("bcache", NULL);	573	debug = debugfs_create_dir("bcache", NULL);
574	return ret;	574	return ret;
575	}	575	}
576		576

drivers/md/bcache/movinggc.c

Diff comments View file @ 72c2706

 /*
  * Moving/copying garbage collector
  *
  * Copyright 2012 Google, Inc.
  */
 #include "bcache.h"
 #include "btree.h"
 #include "debug.h"
 #include "request.h"
 #include <trace/events/bcache.h>
 struct moving_io {
 	struct keybuf_key	*w;
 	struct search		s;
 	struct bbio		bio;
 };
 static bool moving_pred(struct keybuf *buf, struct bkey *k)
 {
 	struct cache_set *c = container_of(buf, struct cache_set,
 					   moving_gc_keys);
 	unsigned i;
 	for (i = 0; i < KEY_PTRS(k); i++) {
 		struct cache *ca = PTR_CACHE(c, k, i);
 		struct bucket *g = PTR_BUCKET(c, k, i);
 		if (GC_SECTORS_USED(g) < ca->gc_move_threshold)
 			return true;
 	}
 	return false;
 }
 /* Moving GC - IO loop */
 static void moving_io_destructor(struct closure *cl)
 {
 	struct moving_io *io = container_of(cl, struct moving_io, s.cl);
 	kfree(io);
 }
 static void write_moving_finish(struct closure *cl)
 {
 	struct moving_io *io = container_of(cl, struct moving_io, s.cl);
 	struct bio *bio = &io->bio.bio;
 	struct bio_vec *bv = bio_iovec_idx(bio, bio->bi_vcnt);
 	while (bv-- != bio->bi_io_vec)
 		__free_page(bv->bv_page);
 	if (io->s.op.insert_collision)
 		trace_bcache_gc_copy_collision(&io->w->key);
 	bch_keybuf_del(&io->s.op.c->moving_gc_keys, io->w);
 	atomic_dec_bug(&io->s.op.c->in_flight);
 	closure_wake_up(&io->s.op.c->moving_gc_wait);
 	closure_return_with_destructor(cl, moving_io_destructor);
 }
 static void read_moving_endio(struct bio *bio, int error)
 {
 	struct moving_io *io = container_of(bio->bi_private,
 					    struct moving_io, s.cl);
 	if (error)
 		io->s.error = error;
 	bch_bbio_endio(io->s.op.c, bio, error, "reading data to move");
 }
 static void moving_init(struct moving_io *io)
 {
 	struct bio *bio = &io->bio.bio;
 	bio_init(bio);
 	bio_get(bio);
 	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 	bio->bi_size		= KEY_SIZE(&io->w->key) << 9;
 	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&io->w->key),
 					       PAGE_SECTORS);
 	bio->bi_private		= &io->s.cl;
 	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 static void write_moving(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct moving_io *io = container_of(s, struct moving_io, s);
 	if (!s->error) {
 		moving_init(io);
 		io->bio.bio.bi_sector	= KEY_START(&io->w->key);
 		s->op.lock		= -1;
 		s->op.write_prio	= 1;
 		s->op.cache_bio		= &io->bio.bio;
 		s->writeback		= KEY_DIRTY(&io->w->key);
 		s->op.csum		= KEY_CSUM(&io->w->key);
 		s->op.type = BTREE_REPLACE;
 		bkey_copy(&s->op.replace, &io->w->key);
 		closure_init(&s->op.cl, cl);
 		bch_insert_data(&s->op.cl);
 	}
 	continue_at(cl, write_moving_finish, NULL);
 }
 static void read_moving_submit(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct moving_io *io = container_of(s, struct moving_io, s);
 	struct bio *bio = &io->bio.bio;
 	bch_submit_bbio(bio, s->op.c, &io->w->key, 0);
 	continue_at(cl, write_moving, bch_gc_wq);
 }
 static void read_moving(struct closure *cl)
 {
 	struct cache_set *c = container_of(cl, struct cache_set, moving_gc);
 	struct keybuf_key *w;
 	struct moving_io *io;
 	struct bio *bio;
 	/* XXX: if we error, background writeback could stall indefinitely */
 	while (!test_bit(CACHE_SET_STOPPING, &c->flags)) {
-		w = bch_keybuf_next_rescan(c, &c->moving_gc_keys, &MAX_KEY);
+		w = bch_keybuf_next_rescan(c, &c->moving_gc_keys,
+					   &MAX_KEY, moving_pred);
 		if (!w)
 			break;
 		io = kzalloc(sizeof(struct moving_io) + sizeof(struct bio_vec)
 			     * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
 			     GFP_KERNEL);
 		if (!io)
 			goto err;
 		w->private	= io;
 		io->w		= w;
 		io->s.op.inode	= KEY_INODE(&w->key);
 		io->s.op.c	= c;
 		moving_init(io);
 		bio = &io->bio.bio;
 		bio->bi_rw	= READ;
 		bio->bi_end_io	= read_moving_endio;
 		if (bch_bio_alloc_pages(bio, GFP_KERNEL))
 			goto err;
 		trace_bcache_gc_copy(&w->key);
 		closure_call(&io->s.cl, read_moving_submit, NULL, &c->gc.cl);
 		if (atomic_inc_return(&c->in_flight) >= 64) {
 			closure_wait_event(&c->moving_gc_wait, cl,
 					   atomic_read(&c->in_flight) < 64);
 			continue_at(cl, read_moving, bch_gc_wq);
 		}
 	}
 	if (0) {
 err:		if (!IS_ERR_OR_NULL(w->private))
 			kfree(w->private);
 		bch_keybuf_del(&c->moving_gc_keys, w);
 	}
 	closure_return(cl);
 }
 static bool bucket_cmp(struct bucket *l, struct bucket *r)
 {
 	return GC_SECTORS_USED(l) < GC_SECTORS_USED(r);
 }
 static unsigned bucket_heap_top(struct cache *ca)
 {
 	return GC_SECTORS_USED(heap_peek(&ca->heap));
 }
 void bch_moving_gc(struct closure *cl)
 {
 	struct cache_set *c = container_of(cl, struct cache_set, gc.cl);
 	struct cache *ca;
 	struct bucket *b;
 	unsigned i;
 	if (!c->copy_gc_enabled)
 		closure_return(cl);
 	mutex_lock(&c->bucket_lock);
 	for_each_cache(ca, c, i) {
 		unsigned sectors_to_move = 0;
 		unsigned reserve_sectors = ca->sb.bucket_size *
 			min(fifo_used(&ca->free), ca->free.size / 2);
 		ca->heap.used = 0;
 		for_each_bucket(b, ca) {
 			if (!GC_SECTORS_USED(b))
 				continue;
 			if (!heap_full(&ca->heap)) {
 				sectors_to_move += GC_SECTORS_USED(b);
 				heap_add(&ca->heap, b, bucket_cmp);
 			} else if (bucket_cmp(b, heap_peek(&ca->heap))) {
 				sectors_to_move -= bucket_heap_top(ca);
 				sectors_to_move += GC_SECTORS_USED(b);
 				ca->heap.data[0] = b;
 				heap_sift(&ca->heap, 0, bucket_cmp);
 			}
 		}
 		while (sectors_to_move > reserve_sectors) {
 			heap_pop(&ca->heap, b, bucket_cmp);
 			sectors_to_move -= GC_SECTORS_USED(b);
 		}
 		ca->gc_move_threshold = bucket_heap_top(ca);
 		pr_debug("threshold %u", ca->gc_move_threshold);
 	}
 	mutex_unlock(&c->bucket_lock);
 	c->moving_gc_keys.last_scanned = ZERO_KEY;
 	closure_init(&c->moving_gc, cl);
 	read_moving(&c->moving_gc);
 	closure_return(cl);
 }
 void bch_moving_init_cache_set(struct cache_set *c)
 {
-	bch_keybuf_init(&c->moving_gc_keys, moving_pred);
+	bch_keybuf_init(&c->moving_gc_keys);
 }

drivers/md/bcache/request.c

Diff comments View file @ 72c2706

 /*
  * Main bcache entry point - handle a read or a write request and decide what to
  * do with it; the make_request functions are called by the block layer.
  *
  * Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
  * Copyright 2012 Google, Inc.
  */
 #include "bcache.h"
 #include "btree.h"
 #include "debug.h"
 #include "request.h"
 #include "writeback.h"
 #include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/hash.h>
 #include <linux/random.h>
 #include "blk-cgroup.h"
 #include <trace/events/bcache.h>
 #define CUTOFF_CACHE_ADD	95
 #define CUTOFF_CACHE_READA	90
-#define CUTOFF_WRITEBACK	50
-#define CUTOFF_WRITEBACK_SYNC	75
 struct kmem_cache *bch_search_cache;
 static void check_should_skip(struct cached_dev *, struct search *);
 /* Cgroup interface */
 #ifdef CONFIG_CGROUP_BCACHE
 static struct bch_cgroup bcache_default_cgroup = { .cache_mode = -1 };
 static struct bch_cgroup *cgroup_to_bcache(struct cgroup *cgroup)
 {
 	struct cgroup_subsys_state *css;
 	return cgroup &&
 		(css = cgroup_subsys_state(cgroup, bcache_subsys_id))
 		? container_of(css, struct bch_cgroup, css)
 		: &bcache_default_cgroup;
 }
 struct bch_cgroup *bch_bio_to_cgroup(struct bio *bio)
 {
 	struct cgroup_subsys_state *css = bio->bi_css
 		? cgroup_subsys_state(bio->bi_css->cgroup, bcache_subsys_id)
 		: task_subsys_state(current, bcache_subsys_id);
 	return css
 		? container_of(css, struct bch_cgroup, css)
 		: &bcache_default_cgroup;
 }
 static ssize_t cache_mode_read(struct cgroup *cgrp, struct cftype *cft,
 			struct file *file,
 			char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	char tmp[1024];
 	int len = bch_snprint_string_list(tmp, PAGE_SIZE, bch_cache_modes,
 					  cgroup_to_bcache(cgrp)->cache_mode + 1);
 	if (len < 0)
 		return len;
 	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
 }
 static int cache_mode_write(struct cgroup *cgrp, struct cftype *cft,
 			    const char *buf)
 {
 	int v = bch_read_string_list(buf, bch_cache_modes);
 	if (v < 0)
 		return v;
 	cgroup_to_bcache(cgrp)->cache_mode = v - 1;
 	return 0;
 }
 static u64 bch_verify_read(struct cgroup *cgrp, struct cftype *cft)
 {
 	return cgroup_to_bcache(cgrp)->verify;
 }
 static int bch_verify_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 {
 	cgroup_to_bcache(cgrp)->verify = val;
 	return 0;
 }
 static u64 bch_cache_hits_read(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct bch_cgroup *bcachecg = cgroup_to_bcache(cgrp);
 	return atomic_read(&bcachecg->stats.cache_hits);
 }
 static u64 bch_cache_misses_read(struct cgroup *cgrp, struct cftype *cft)
 {
 	struct bch_cgroup *bcachecg = cgroup_to_bcache(cgrp);
 	return atomic_read(&bcachecg->stats.cache_misses);
 }
 static u64 bch_cache_bypass_hits_read(struct cgroup *cgrp,
 					 struct cftype *cft)
 {
 	struct bch_cgroup *bcachecg = cgroup_to_bcache(cgrp);
 	return atomic_read(&bcachecg->stats.cache_bypass_hits);
 }
 static u64 bch_cache_bypass_misses_read(struct cgroup *cgrp,
 					   struct cftype *cft)
 {
 	struct bch_cgroup *bcachecg = cgroup_to_bcache(cgrp);
 	return atomic_read(&bcachecg->stats.cache_bypass_misses);
 }
 static struct cftype bch_files[] = {
 	{
 		.name		= "cache_mode",
 		.read		= cache_mode_read,
 		.write_string	= cache_mode_write,
 	},
 	{
 		.name		= "verify",
 		.read_u64	= bch_verify_read,
 		.write_u64	= bch_verify_write,
 	},
 	{
 		.name		= "cache_hits",
 		.read_u64	= bch_cache_hits_read,
 	},
 	{
 		.name		= "cache_misses",
 		.read_u64	= bch_cache_misses_read,
 	},
 	{
 		.name		= "cache_bypass_hits",
 		.read_u64	= bch_cache_bypass_hits_read,
 	},
 	{
 		.name		= "cache_bypass_misses",
 		.read_u64	= bch_cache_bypass_misses_read,
 	},
 	{ }	/* terminate */
 };
 static void init_bch_cgroup(struct bch_cgroup *cg)
 {
 	cg->cache_mode = -1;
 }
 static struct cgroup_subsys_state *bcachecg_create(struct cgroup *cgroup)
 {
 	struct bch_cgroup *cg;
 	cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 	if (!cg)
 		return ERR_PTR(-ENOMEM);
 	init_bch_cgroup(cg);
 	return &cg->css;
 }
 static void bcachecg_destroy(struct cgroup *cgroup)
 {
 	struct bch_cgroup *cg = cgroup_to_bcache(cgroup);
 	free_css_id(&bcache_subsys, &cg->css);
 	kfree(cg);
 }
 struct cgroup_subsys bcache_subsys = {
 	.create		= bcachecg_create,
 	.destroy	= bcachecg_destroy,
 	.subsys_id	= bcache_subsys_id,
 	.name		= "bcache",
 	.module		= THIS_MODULE,
 };
 EXPORT_SYMBOL_GPL(bcache_subsys);
 #endif
 static unsigned cache_mode(struct cached_dev *dc, struct bio *bio)
 {
 #ifdef CONFIG_CGROUP_BCACHE
 	int r = bch_bio_to_cgroup(bio)->cache_mode;
 	if (r >= 0)
 		return r;
 #endif
 	return BDEV_CACHE_MODE(&dc->sb);
 }
 static bool verify(struct cached_dev *dc, struct bio *bio)
 {
 #ifdef CONFIG_CGROUP_BCACHE
 	if (bch_bio_to_cgroup(bio)->verify)
 		return true;
 #endif
 	return dc->verify;
 }
 static void bio_csum(struct bio *bio, struct bkey *k)
 {
 	struct bio_vec *bv;
 	uint64_t csum = 0;
 	int i;
 	bio_for_each_segment(bv, bio, i) {
 		void *d = kmap(bv->bv_page) + bv->bv_offset;
 		csum = bch_crc64_update(csum, d, bv->bv_len);
 		kunmap(bv->bv_page);
 	}
 	k->ptr[KEY_PTRS(k)] = csum & (~0ULL >> 1);
 }
 /* Insert data into cache */
 static void bio_invalidate(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	struct bio *bio = op->cache_bio;
 	pr_debug("invalidating %i sectors from %llu",
 		 bio_sectors(bio), (uint64_t) bio->bi_sector);
 	while (bio_sectors(bio)) {
 		unsigned len = min(bio_sectors(bio), 1U << 14);
 		if (bch_keylist_realloc(&op->keys, 0, op->c))
 			goto out;
 		bio->bi_sector	+= len;
 		bio->bi_size	-= len << 9;
 		bch_keylist_add(&op->keys,
 				&KEY(op->inode, bio->bi_sector, len));
 	}
 	op->insert_data_done = true;
 	bio_put(bio);
 out:
 	continue_at(cl, bch_journal, bcache_wq);
 }
 struct open_bucket {
 	struct list_head	list;
 	struct task_struct	*last;
 	unsigned		sectors_free;
 	BKEY_PADDED(key);
 };
 void bch_open_buckets_free(struct cache_set *c)
 {
 	struct open_bucket *b;
 	while (!list_empty(&c->data_buckets)) {
 		b = list_first_entry(&c->data_buckets,
 				     struct open_bucket, list);
 		list_del(&b->list);
 		kfree(b);
 	}
 }
 int bch_open_buckets_alloc(struct cache_set *c)
 {
 	int i;
 	spin_lock_init(&c->data_bucket_lock);
 	for (i = 0; i < 6; i++) {
 		struct open_bucket *b = kzalloc(sizeof(*b), GFP_KERNEL);
 		if (!b)
 			return -ENOMEM;
 		list_add(&b->list, &c->data_buckets);
 	}
 	return 0;
 }
 /*
  * We keep multiple buckets open for writes, and try to segregate different
  * write streams for better cache utilization: first we look for a bucket where
  * the last write to it was sequential with the current write, and failing that
  * we look for a bucket that was last used by the same task.
  *
  * The ideas is if you've got multiple tasks pulling data into the cache at the
  * same time, you'll get better cache utilization if you try to segregate their
  * data and preserve locality.
  *
  * For example, say you've starting Firefox at the same time you're copying a
  * bunch of files. Firefox will likely end up being fairly hot and stay in the
  * cache awhile, but the data you copied might not be; if you wrote all that
  * data to the same buckets it'd get invalidated at the same time.
  *
  * Both of those tasks will be doing fairly random IO so we can't rely on
  * detecting sequential IO to segregate their data, but going off of the task
  * should be a sane heuristic.
  */
 static struct open_bucket *pick_data_bucket(struct cache_set *c,
 					    const struct bkey *search,
 					    struct task_struct *task,
 					    struct bkey *alloc)
 {
 	struct open_bucket *ret, *ret_task = NULL;
 	list_for_each_entry_reverse(ret, &c->data_buckets, list)
 		if (!bkey_cmp(&ret->key, search))
 			goto found;
 		else if (ret->last == task)
 			ret_task = ret;
 	ret = ret_task ?: list_first_entry(&c->data_buckets,
 					   struct open_bucket, list);
 found:
 	if (!ret->sectors_free && KEY_PTRS(alloc)) {
 		ret->sectors_free = c->sb.bucket_size;
 		bkey_copy(&ret->key, alloc);
 		bkey_init(alloc);
 	}
 	if (!ret->sectors_free)
 		ret = NULL;
 	return ret;
 }
 /*
  * Allocates some space in the cache to write to, and k to point to the newly
  * allocated space, and updates KEY_SIZE(k) and KEY_OFFSET(k) (to point to the
  * end of the newly allocated space).
  *
  * May allocate fewer sectors than @sectors, KEY_SIZE(k) indicates how many
  * sectors were actually allocated.
  *
  * If s->writeback is true, will not fail.
  */
 static bool bch_alloc_sectors(struct bkey *k, unsigned sectors,
 			      struct search *s)
 {
 	struct cache_set *c = s->op.c;
 	struct open_bucket *b;
 	BKEY_PADDED(key) alloc;
 	struct closure cl, *w = NULL;
 	unsigned i;
 	if (s->writeback) {
 		closure_init_stack(&cl);
 		w = &cl;
 	}
 	/*
 	 * We might have to allocate a new bucket, which we can't do with a
 	 * spinlock held. So if we have to allocate, we drop the lock, allocate
 	 * and then retry. KEY_PTRS() indicates whether alloc points to
 	 * allocated bucket(s).
 	 */
 	bkey_init(&alloc.key);
 	spin_lock(&c->data_bucket_lock);
 	while (!(b = pick_data_bucket(c, k, s->task, &alloc.key))) {
 		unsigned watermark = s->op.write_prio
 			? WATERMARK_MOVINGGC
 			: WATERMARK_NONE;
 		spin_unlock(&c->data_bucket_lock);
 		if (bch_bucket_alloc_set(c, watermark, &alloc.key, 1, w))
 			return false;
 		spin_lock(&c->data_bucket_lock);
 	}
 	/*
 	 * If we had to allocate, we might race and not need to allocate the
 	 * second time we call find_data_bucket(). If we allocated a bucket but
 	 * didn't use it, drop the refcount bch_bucket_alloc_set() took:
 	 */
 	if (KEY_PTRS(&alloc.key))
 		__bkey_put(c, &alloc.key);
 	for (i = 0; i < KEY_PTRS(&b->key); i++)
 		EBUG_ON(ptr_stale(c, &b->key, i));
 	/* Set up the pointer to the space we're allocating: */
 	for (i = 0; i < KEY_PTRS(&b->key); i++)
 		k->ptr[i] = b->key.ptr[i];
 	sectors = min(sectors, b->sectors_free);
 	SET_KEY_OFFSET(k, KEY_OFFSET(k) + sectors);
 	SET_KEY_SIZE(k, sectors);
 	SET_KEY_PTRS(k, KEY_PTRS(&b->key));
 	/*
 	 * Move b to the end of the lru, and keep track of what this bucket was
 	 * last used for:
 	 */
 	list_move_tail(&b->list, &c->data_buckets);
 	bkey_copy_key(&b->key, k);
 	b->last = s->task;
 	b->sectors_free	-= sectors;
 	for (i = 0; i < KEY_PTRS(&b->key); i++) {
 		SET_PTR_OFFSET(&b->key, i, PTR_OFFSET(&b->key, i) + sectors);
 		atomic_long_add(sectors,
 				&PTR_CACHE(c, &b->key, i)->sectors_written);
 	}
 	if (b->sectors_free < c->sb.block_size)
 		b->sectors_free = 0;
 	/*
 	 * k takes refcounts on the buckets it points to until it's inserted
 	 * into the btree, but if we're done with this bucket we just transfer
 	 * get_data_bucket()'s refcount.
 	 */
 	if (b->sectors_free)
 		for (i = 0; i < KEY_PTRS(&b->key); i++)
 			atomic_inc(&PTR_BUCKET(c, &b->key, i)->pin);
 	spin_unlock(&c->data_bucket_lock);
 	return true;
 }
 static void bch_insert_data_error(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	/*
 	 * Our data write just errored, which means we've got a bunch of keys to
 	 * insert that point to data that wasn't succesfully written.
 	 *
 	 * We don't have to insert those keys but we still have to invalidate
 	 * that region of the cache - so, if we just strip off all the pointers
 	 * from the keys we'll accomplish just that.
 	 */
 	struct bkey *src = op->keys.bottom, *dst = op->keys.bottom;
 	while (src != op->keys.top) {
 		struct bkey *n = bkey_next(src);
 		SET_KEY_PTRS(src, 0);
 		bkey_copy(dst, src);
 		dst = bkey_next(dst);
 		src = n;
 	}
 	op->keys.top = dst;
 	bch_journal(cl);
 }
 static void bch_insert_data_endio(struct bio *bio, int error)
 {
 	struct closure *cl = bio->bi_private;
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	struct search *s = container_of(op, struct search, op);
 	if (error) {
 		/* TODO: We could try to recover from this. */
 		if (s->writeback)
 			s->error = error;
 		else if (s->write)
 			set_closure_fn(cl, bch_insert_data_error, bcache_wq);
 		else
 			set_closure_fn(cl, NULL, NULL);
 	}
 	bch_bbio_endio(op->c, bio, error, "writing data to cache");
 }
 static void bch_insert_data_loop(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	struct search *s = container_of(op, struct search, op);
 	struct bio *bio = op->cache_bio, *n;
 	if (op->skip)
 		return bio_invalidate(cl);
 	if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0) {
 		set_gc_sectors(op->c);
 		bch_queue_gc(op->c);
 	}
 	do {
 		unsigned i;
 		struct bkey *k;
 		struct bio_set *split = s->d
 			? s->d->bio_split : op->c->bio_split;
 		/* 1 for the device pointer and 1 for the chksum */
 		if (bch_keylist_realloc(&op->keys,
 					1 + (op->csum ? 1 : 0),
 					op->c))
 			continue_at(cl, bch_journal, bcache_wq);
 		k = op->keys.top;
 		bkey_init(k);
 		SET_KEY_INODE(k, op->inode);
 		SET_KEY_OFFSET(k, bio->bi_sector);
 		if (!bch_alloc_sectors(k, bio_sectors(bio), s))
 			goto err;
 		n = bch_bio_split(bio, KEY_SIZE(k), GFP_NOIO, split);
 		if (!n) {
 			__bkey_put(op->c, k);
 			continue_at(cl, bch_insert_data_loop, bcache_wq);
 		}
 		n->bi_end_io	= bch_insert_data_endio;
 		n->bi_private	= cl;
 		if (s->writeback) {
 			SET_KEY_DIRTY(k, true);
 			for (i = 0; i < KEY_PTRS(k); i++)
 				SET_GC_MARK(PTR_BUCKET(op->c, k, i),
 					    GC_MARK_DIRTY);
 		}
 		SET_KEY_CSUM(k, op->csum);
 		if (KEY_CSUM(k))
 			bio_csum(n, k);
 		trace_bcache_cache_insert(k);
 		bch_keylist_push(&op->keys);
 		n->bi_rw |= REQ_WRITE;
 		bch_submit_bbio(n, op->c, k, 0);
 	} while (n != bio);
 	op->insert_data_done = true;
 	continue_at(cl, bch_journal, bcache_wq);
 err:
 	/* bch_alloc_sectors() blocks if s->writeback = true */
 	BUG_ON(s->writeback);
 	/*
 	 * But if it's not a writeback write we'd rather just bail out if
 	 * there aren't any buckets ready to write to - it might take awhile and
 	 * we might be starving btree writes for gc or something.
 	 */
 	if (s->write) {
 		/*
 		 * Writethrough write: We can't complete the write until we've
 		 * updated the index. But we don't want to delay the write while
 		 * we wait for buckets to be freed up, so just invalidate the
 		 * rest of the write.
 		 */
 		op->skip = true;
 		return bio_invalidate(cl);
 	} else {
 		/*
 		 * From a cache miss, we can just insert the keys for the data
 		 * we have written or bail out if we didn't do anything.
 		 */
 		op->insert_data_done = true;
 		bio_put(bio);
 		if (!bch_keylist_empty(&op->keys))
 			continue_at(cl, bch_journal, bcache_wq);
 		else
 			closure_return(cl);
 	}
 }
 /**
  * bch_insert_data - stick some data in the cache
  *
  * This is the starting point for any data to end up in a cache device; it could
  * be from a normal write, or a writeback write, or a write to a flash only
  * volume - it's also used by the moving garbage collector to compact data in
  * mostly empty buckets.
  *
  * It first writes the data to the cache, creating a list of keys to be inserted
  * (if the data had to be fragmented there will be multiple keys); after the
  * data is written it calls bch_journal, and after the keys have been added to
  * the next journal write they're inserted into the btree.
  *
  * It inserts the data in op->cache_bio; bi_sector is used for the key offset,
  * and op->inode is used for the key inode.
  *
  * If op->skip is true, instead of inserting the data it invalidates the region
  * of the cache represented by op->cache_bio and op->inode.
  */
 void bch_insert_data(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	bch_keylist_init(&op->keys);
 	bio_get(op->cache_bio);
 	bch_insert_data_loop(cl);
 }
 void bch_btree_insert_async(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	struct search *s = container_of(op, struct search, op);
 	if (bch_btree_insert(op, op->c)) {
 		s->error		= -ENOMEM;
 		op->insert_data_done	= true;
 	}
 	if (op->insert_data_done) {
 		bch_keylist_free(&op->keys);
 		closure_return(cl);
 	} else
 		continue_at(cl, bch_insert_data_loop, bcache_wq);
 }
 /* Common code for the make_request functions */
 static void request_endio(struct bio *bio, int error)
 {
 	struct closure *cl = bio->bi_private;
 	if (error) {
 		struct search *s = container_of(cl, struct search, cl);
 		s->error = error;
 		/* Only cache read errors are recoverable */
 		s->recoverable = false;
 	}
 	bio_put(bio);
 	closure_put(cl);
 }
 void bch_cache_read_endio(struct bio *bio, int error)
 {
 	struct bbio *b = container_of(bio, struct bbio, bio);
 	struct closure *cl = bio->bi_private;
 	struct search *s = container_of(cl, struct search, cl);
 	/*
 	 * If the bucket was reused while our bio was in flight, we might have
 	 * read the wrong data. Set s->error but not error so it doesn't get
 	 * counted against the cache device, but we'll still reread the data
 	 * from the backing device.
 	 */
 	if (error)
 		s->error = error;
 	else if (ptr_stale(s->op.c, &b->key, 0)) {
 		atomic_long_inc(&s->op.c->cache_read_races);
 		s->error = -EINTR;
 	}
 	bch_bbio_endio(s->op.c, bio, error, "reading from cache");
 }
 static void bio_complete(struct search *s)
 {
 	if (s->orig_bio) {
 		int cpu, rw = bio_data_dir(s->orig_bio);
 		unsigned long duration = jiffies - s->start_time;
 		cpu = part_stat_lock();
 		part_round_stats(cpu, &s->d->disk->part0);
 		part_stat_add(cpu, &s->d->disk->part0, ticks[rw], duration);
 		part_stat_unlock();
 		trace_bcache_request_end(s, s->orig_bio);
 		bio_endio(s->orig_bio, s->error);
 		s->orig_bio = NULL;
 	}
 }
 static void do_bio_hook(struct search *s)
 {
 	struct bio *bio = &s->bio.bio;
 	memcpy(bio, s->orig_bio, sizeof(struct bio));
 	bio->bi_end_io		= request_endio;
 	bio->bi_private		= &s->cl;
 	atomic_set(&bio->bi_cnt, 3);
 }
 static void search_free(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	bio_complete(s);
 	if (s->op.cache_bio)
 		bio_put(s->op.cache_bio);
 	if (s->unaligned_bvec)
 		mempool_free(s->bio.bio.bi_io_vec, s->d->unaligned_bvec);
 	closure_debug_destroy(cl);
 	mempool_free(s, s->d->c->search);
 }
 static struct search *search_alloc(struct bio *bio, struct bcache_device *d)
 {
 	struct bio_vec *bv;
 	struct search *s = mempool_alloc(d->c->search, GFP_NOIO);
 	memset(s, 0, offsetof(struct search, op.keys));
 	__closure_init(&s->cl, NULL);
 	s->op.inode		= d->id;
 	s->op.c			= d->c;
 	s->d			= d;
 	s->op.lock		= -1;
 	s->task			= current;
 	s->orig_bio		= bio;
 	s->write		= (bio->bi_rw & REQ_WRITE) != 0;
 	s->op.flush_journal	= (bio->bi_rw & REQ_FLUSH) != 0;
 	s->op.skip		= (bio->bi_rw & REQ_DISCARD) != 0;
 	s->recoverable		= 1;
 	s->start_time		= jiffies;
 	do_bio_hook(s);
 	if (bio->bi_size != bio_segments(bio) * PAGE_SIZE) {
 		bv = mempool_alloc(d->unaligned_bvec, GFP_NOIO);
 		memcpy(bv, bio_iovec(bio),
 		       sizeof(struct bio_vec) * bio_segments(bio));
 		s->bio.bio.bi_io_vec	= bv;
 		s->unaligned_bvec	= 1;
 	}
 	return s;
 }
 static void btree_read_async(struct closure *cl)
 {
 	struct btree_op *op = container_of(cl, struct btree_op, cl);
 	int ret = btree_root(search_recurse, op->c, op);
 	if (ret == -EAGAIN)
 		continue_at(cl, btree_read_async, bcache_wq);
 	closure_return(cl);
 }
 /* Cached devices */
 static void cached_dev_bio_complete(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	search_free(cl);
 	cached_dev_put(dc);
 }
 /* Process reads */
 static void cached_dev_read_complete(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	if (s->op.insert_collision)
 		bch_mark_cache_miss_collision(s);
 	if (s->op.cache_bio) {
 		int i;
 		struct bio_vec *bv;
 		__bio_for_each_segment(bv, s->op.cache_bio, i, 0)
 			__free_page(bv->bv_page);
 	}
 	cached_dev_bio_complete(cl);
 }
 static void request_read_error(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct bio_vec *bv;
 	int i;
 	if (s->recoverable) {
 		/* Retry from the backing device: */
 		trace_bcache_read_retry(s->orig_bio);
 		s->error = 0;
 		bv = s->bio.bio.bi_io_vec;
 		do_bio_hook(s);
 		s->bio.bio.bi_io_vec = bv;
 		if (!s->unaligned_bvec)
 			bio_for_each_segment(bv, s->orig_bio, i)
 				bv->bv_offset = 0, bv->bv_len = PAGE_SIZE;
 		else
 			memcpy(s->bio.bio.bi_io_vec,
 			       bio_iovec(s->orig_bio),
 			       sizeof(struct bio_vec) *
 			       bio_segments(s->orig_bio));
 		/* XXX: invalidate cache */
 		closure_bio_submit(&s->bio.bio, &s->cl, s->d);
 	}
 	continue_at(cl, cached_dev_read_complete, NULL);
 }
 static void request_read_done(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	/*
 	 * s->cache_bio != NULL implies that we had a cache miss; cache_bio now
 	 * contains data ready to be inserted into the cache.
 	 *
 	 * First, we copy the data we just read from cache_bio's bounce buffers
 	 * to the buffers the original bio pointed to:
 	 */
 	if (s->op.cache_bio) {
 		struct bio_vec *src, *dst;
 		unsigned src_offset, dst_offset, bytes;
 		void *dst_ptr;
 		bio_reset(s->op.cache_bio);
 		s->op.cache_bio->bi_sector	= s->cache_miss->bi_sector;
 		s->op.cache_bio->bi_bdev	= s->cache_miss->bi_bdev;
 		s->op.cache_bio->bi_size	= s->cache_bio_sectors << 9;
 		bch_bio_map(s->op.cache_bio, NULL);
 		src = bio_iovec(s->op.cache_bio);
 		dst = bio_iovec(s->cache_miss);
 		src_offset = src->bv_offset;
 		dst_offset = dst->bv_offset;
 		dst_ptr = kmap(dst->bv_page);
 		while (1) {
 			if (dst_offset == dst->bv_offset + dst->bv_len) {
 				kunmap(dst->bv_page);
 				dst++;
 				if (dst == bio_iovec_idx(s->cache_miss,
 						s->cache_miss->bi_vcnt))
 					break;
 				dst_offset = dst->bv_offset;
 				dst_ptr = kmap(dst->bv_page);
 			}
 			if (src_offset == src->bv_offset + src->bv_len) {
 				src++;
 				if (src == bio_iovec_idx(s->op.cache_bio,
 						 s->op.cache_bio->bi_vcnt))
 					BUG();
 				src_offset = src->bv_offset;
 			}
 			bytes = min(dst->bv_offset + dst->bv_len - dst_offset,
 				    src->bv_offset + src->bv_len - src_offset);
 			memcpy(dst_ptr + dst_offset,
 			       page_address(src->bv_page) + src_offset,
 			       bytes);
 			src_offset	+= bytes;
 			dst_offset	+= bytes;
 		}
 		bio_put(s->cache_miss);
 		s->cache_miss = NULL;
 	}
 	if (verify(dc, &s->bio.bio) && s->recoverable)
 		bch_data_verify(s);
 	bio_complete(s);
 	if (s->op.cache_bio &&
 	    !test_bit(CACHE_SET_STOPPING, &s->op.c->flags)) {
 		s->op.type = BTREE_REPLACE;
 		closure_call(&s->op.cl, bch_insert_data, NULL, cl);
 	}
 	continue_at(cl, cached_dev_read_complete, NULL);
 }
 static void request_read_done_bh(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	bch_mark_cache_accounting(s, !s->cache_miss, s->op.skip);
 	trace_bcache_read(s->orig_bio, !s->cache_miss, s->op.skip);
 	if (s->error)
 		continue_at_nobarrier(cl, request_read_error, bcache_wq);
 	else if (s->op.cache_bio || verify(dc, &s->bio.bio))
 		continue_at_nobarrier(cl, request_read_done, bcache_wq);
 	else
 		continue_at_nobarrier(cl, cached_dev_read_complete, NULL);
 }
 static int cached_dev_cache_miss(struct btree *b, struct search *s,
 				 struct bio *bio, unsigned sectors)
 {
 	int ret = 0;
 	unsigned reada;
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	struct bio *miss;
 	miss = bch_bio_split(bio, sectors, GFP_NOIO, s->d->bio_split);
 	if (!miss)
 		return -EAGAIN;
 	if (miss == bio)
 		s->op.lookup_done = true;
 	miss->bi_end_io		= request_endio;
 	miss->bi_private	= &s->cl;
 	if (s->cache_miss || s->op.skip)
 		goto out_submit;
 	if (miss != bio ||
 	    (bio->bi_rw & REQ_RAHEAD) ||
 	    (bio->bi_rw & REQ_META) ||
 	    s->op.c->gc_stats.in_use >= CUTOFF_CACHE_READA)
 		reada = 0;
 	else {
 		reada = min(dc->readahead >> 9,
 			    sectors - bio_sectors(miss));
 		if (bio_end(miss) + reada > bdev_sectors(miss->bi_bdev))
 			reada = bdev_sectors(miss->bi_bdev) - bio_end(miss);
 	}
 	s->cache_bio_sectors = bio_sectors(miss) + reada;
 	s->op.cache_bio = bio_alloc_bioset(GFP_NOWAIT,
 			DIV_ROUND_UP(s->cache_bio_sectors, PAGE_SECTORS),
 			dc->disk.bio_split);
 	if (!s->op.cache_bio)
 		goto out_submit;
 	s->op.cache_bio->bi_sector	= miss->bi_sector;
 	s->op.cache_bio->bi_bdev	= miss->bi_bdev;
 	s->op.cache_bio->bi_size	= s->cache_bio_sectors << 9;
 	s->op.cache_bio->bi_end_io	= request_endio;
 	s->op.cache_bio->bi_private	= &s->cl;
 	/* btree_search_recurse()'s btree iterator is no good anymore */
 	ret = -EINTR;
 	if (!bch_btree_insert_check_key(b, &s->op, s->op.cache_bio))
 		goto out_put;
 	bch_bio_map(s->op.cache_bio, NULL);
 	if (bch_bio_alloc_pages(s->op.cache_bio, __GFP_NOWARN|GFP_NOIO))
 		goto out_put;
 	s->cache_miss = miss;
 	bio_get(s->op.cache_bio);
 	closure_bio_submit(s->op.cache_bio, &s->cl, s->d);
 	return ret;
 out_put:
 	bio_put(s->op.cache_bio);
 	s->op.cache_bio = NULL;
 out_submit:
 	closure_bio_submit(miss, &s->cl, s->d);
 	return ret;
 }
 static void request_read(struct cached_dev *dc, struct search *s)
 {
 	struct closure *cl = &s->cl;
 	check_should_skip(dc, s);
 	closure_call(&s->op.cl, btree_read_async, NULL, cl);
 	continue_at(cl, request_read_done_bh, NULL);
 }
 /* Process writes */
 static void cached_dev_write_complete(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 	up_read_non_owner(&dc->writeback_lock);
 	cached_dev_bio_complete(cl);
 }
-static bool should_writeback(struct cached_dev *dc, struct bio *bio)
-{
-	unsigned threshold = (bio->bi_rw & REQ_SYNC)
-		? CUTOFF_WRITEBACK_SYNC
-		: CUTOFF_WRITEBACK;
-	return !atomic_read(&dc->disk.detaching) &&
-		cache_mode(dc, bio) == CACHE_MODE_WRITEBACK &&
-		dc->disk.c->gc_stats.in_use < threshold;
-}
 static void request_write(struct cached_dev *dc, struct search *s)
 {
 	struct closure *cl = &s->cl;
 	struct bio *bio = &s->bio.bio;
 	struct bkey start, end;
 	start = KEY(dc->disk.id, bio->bi_sector, 0);
 	end = KEY(dc->disk.id, bio_end(bio), 0);
 	bch_keybuf_check_overlapping(&s->op.c->moving_gc_keys, &start, &end);
 	check_should_skip(dc, s);
 	down_read_non_owner(&dc->writeback_lock);
 	if (bch_keybuf_check_overlapping(&dc->writeback_keys, &start, &end)) {
 		s->op.skip	= false;
 		s->writeback	= true;
 	}
 	if (bio->bi_rw & REQ_DISCARD)
 		goto skip;
+	if (should_writeback(dc, s->orig_bio,
+			     cache_mode(dc, bio),
+			     s->op.skip)) {
+		s->op.skip = false;
+		s->writeback = true;
+	}
 	if (s->op.skip)
 		goto skip;
-	if (should_writeback(dc, s->orig_bio))
-		s->writeback = true;
 	trace_bcache_write(s->orig_bio, s->writeback, s->op.skip);
 	if (!s->writeback) {
 		s->op.cache_bio = bio_clone_bioset(bio, GFP_NOIO,
 						   dc->disk.bio_split);
 		closure_bio_submit(bio, cl, s->d);
 	} else {
 		s->op.cache_bio = bio;
 		bch_writeback_add(dc);
 	}
 out:
 	closure_call(&s->op.cl, bch_insert_data, NULL, cl);
 	continue_at(cl, cached_dev_write_complete, NULL);
 skip:
 	s->op.skip = true;
 	s->op.cache_bio = s->orig_bio;
 	bio_get(s->op.cache_bio);
 	if ((bio->bi_rw & REQ_DISCARD) &&
 	    !blk_queue_discard(bdev_get_queue(dc->bdev)))
 		goto out;
 	closure_bio_submit(bio, cl, s->d);
 	goto out;
 }
 static void request_nodata(struct cached_dev *dc, struct search *s)
 {
 	struct closure *cl = &s->cl;
 	struct bio *bio = &s->bio.bio;
 	if (bio->bi_rw & REQ_DISCARD) {
 		request_write(dc, s);
 		return;
 	}
 	if (s->op.flush_journal)
 		bch_journal_meta(s->op.c, cl);
 	closure_bio_submit(bio, cl, s->d);
 	continue_at(cl, cached_dev_bio_complete, NULL);
 }
 /* Cached devices - read & write stuff */
 unsigned bch_get_congested(struct cache_set *c)
 {
 	int i;
 	long rand;
 	if (!c->congested_read_threshold_us &&
 	    !c->congested_write_threshold_us)
 		return 0;
 	i = (local_clock_us() - c->congested_last_us) / 1024;
 	if (i < 0)
 		return 0;
 	i += atomic_read(&c->congested);
 	if (i >= 0)
 		return 0;
 	i += CONGESTED_MAX;
 	if (i > 0)
 		i = fract_exp_two(i, 6);
 	rand = get_random_int();
 	i -= bitmap_weight(&rand, BITS_PER_LONG);
 	return i > 0 ? i : 1;
 }
 static void add_sequential(struct task_struct *t)
 {
 	ewma_add(t->sequential_io_avg,
 		 t->sequential_io, 8, 0);
 	t->sequential_io = 0;
 }
 static struct hlist_head *iohash(struct cached_dev *dc, uint64_t k)
 {
 	return &dc->io_hash[hash_64(k, RECENT_IO_BITS)];
 }
 static void check_should_skip(struct cached_dev *dc, struct search *s)
 {
 	struct cache_set *c = s->op.c;
 	struct bio *bio = &s->bio.bio;
 	unsigned mode = cache_mode(dc, bio);
 	unsigned sectors, congested = bch_get_congested(c);
 	if (atomic_read(&dc->disk.detaching) ||
 	    c->gc_stats.in_use > CUTOFF_CACHE_ADD ||
 	    (bio->bi_rw & REQ_DISCARD))
 		goto skip;
 	if (mode == CACHE_MODE_NONE ||
 	    (mode == CACHE_MODE_WRITEAROUND &&
 	     (bio->bi_rw & REQ_WRITE)))
 		goto skip;
 	if (bio->bi_sector   & (c->sb.block_size - 1) ||
 	    bio_sectors(bio) & (c->sb.block_size - 1)) {
 		pr_debug("skipping unaligned io");
 		goto skip;
 	}
 	if (!congested && !dc->sequential_cutoff)
 		goto rescale;
 	if (!congested &&
 	    mode == CACHE_MODE_WRITEBACK &&
 	    (bio->bi_rw & REQ_WRITE) &&
 	    (bio->bi_rw & REQ_SYNC))
 		goto rescale;
 	if (dc->sequential_merge) {
 		struct io *i;
 		spin_lock(&dc->io_lock);
 		hlist_for_each_entry(i, iohash(dc, bio->bi_sector), hash)
 			if (i->last == bio->bi_sector &&
 			    time_before(jiffies, i->jiffies))
 				goto found;
 		i = list_first_entry(&dc->io_lru, struct io, lru);
 		add_sequential(s->task);
 		i->sequential = 0;
 found:
 		if (i->sequential + bio->bi_size > i->sequential)
 			i->sequential	+= bio->bi_size;
 		i->last			 = bio_end(bio);
 		i->jiffies		 = jiffies + msecs_to_jiffies(5000);
 		s->task->sequential_io	 = i->sequential;
 		hlist_del(&i->hash);
 		hlist_add_head(&i->hash, iohash(dc, i->last));
 		list_move_tail(&i->lru, &dc->io_lru);
 		spin_unlock(&dc->io_lock);
 	} else {
 		s->task->sequential_io = bio->bi_size;
 		add_sequential(s->task);
 	}
 	sectors = max(s->task->sequential_io,
 		      s->task->sequential_io_avg) >> 9;
 	if (dc->sequential_cutoff &&
 	    sectors >= dc->sequential_cutoff >> 9) {
 		trace_bcache_bypass_sequential(s->orig_bio);
 		goto skip;
 	}
 	if (congested && sectors >= congested) {
 		trace_bcache_bypass_congested(s->orig_bio);
 		goto skip;
 	}
 rescale:
 	bch_rescale_priorities(c, bio_sectors(bio));
 	return;
 skip:
 	bch_mark_sectors_bypassed(s, bio_sectors(bio));
 	s->op.skip = true;
 }
 static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct search *s;
 	struct bcache_device *d = bio->bi_bdev->bd_disk->private_data;
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	int cpu, rw = bio_data_dir(bio);
 	cpu = part_stat_lock();
 	part_stat_inc(cpu, &d->disk->part0, ios[rw]);
 	part_stat_add(cpu, &d->disk->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 	bio->bi_bdev = dc->bdev;
 	bio->bi_sector += dc->sb.data_offset;
 	if (cached_dev_get(dc)) {
 		s = search_alloc(bio, d);
 		trace_bcache_request_start(s, bio);
 		if (!bio_has_data(bio))
 			request_nodata(dc, s);
 		else if (rw)
 			request_write(dc, s);
 		else
 			request_read(dc, s);
 	} else {
 		if ((bio->bi_rw & REQ_DISCARD) &&
 		    !blk_queue_discard(bdev_get_queue(dc->bdev)))
 			bio_endio(bio, 0);
 		else
 			bch_generic_make_request(bio, &d->bio_split_hook);
 	}
 }
 static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
 			    unsigned int cmd, unsigned long arg)
 {
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	return __blkdev_driver_ioctl(dc->bdev, mode, cmd, arg);
 }
 static int cached_dev_congested(void *data, int bits)
 {
 	struct bcache_device *d = data;
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	struct request_queue *q = bdev_get_queue(dc->bdev);
 	int ret = 0;
 	if (bdi_congested(&q->backing_dev_info, bits))
 		return 1;
 	if (cached_dev_get(dc)) {
 		unsigned i;
 		struct cache *ca;
 		for_each_cache(ca, d->c, i) {
 			q = bdev_get_queue(ca->bdev);
 			ret |= bdi_congested(&q->backing_dev_info, bits);
 		}
 		cached_dev_put(dc);
 	}
 	return ret;
 }
 void bch_cached_dev_request_init(struct cached_dev *dc)
 {
 	struct gendisk *g = dc->disk.disk;
 	g->queue->make_request_fn		= cached_dev_make_request;
 	g->queue->backing_dev_info.congested_fn = cached_dev_congested;
 	dc->disk.cache_miss			= cached_dev_cache_miss;
 	dc->disk.ioctl				= cached_dev_ioctl;
 }
 /* Flash backed devices */
 static int flash_dev_cache_miss(struct btree *b, struct search *s,
 				struct bio *bio, unsigned sectors)
 {
 	/* Zero fill bio */
 	while (bio->bi_idx != bio->bi_vcnt) {
 		struct bio_vec *bv = bio_iovec(bio);
 		unsigned j = min(bv->bv_len >> 9, sectors);
 		void *p = kmap(bv->bv_page);
 		memset(p + bv->bv_offset, 0, j << 9);
 		kunmap(bv->bv_page);
 		bv->bv_len	-= j << 9;
 		bv->bv_offset	+= j << 9;
 		if (bv->bv_len)
 			return 0;
 		bio->bi_sector	+= j;
 		bio->bi_size	-= j << 9;
 		bio->bi_idx++;
 		sectors		-= j;
 	}
 	s->op.lookup_done = true;
 	return 0;
 }
 static void flash_dev_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct search *s;
 	struct closure *cl;
 	struct bcache_device *d = bio->bi_bdev->bd_disk->private_data;
 	int cpu, rw = bio_data_dir(bio);
 	cpu = part_stat_lock();
 	part_stat_inc(cpu, &d->disk->part0, ios[rw]);
 	part_stat_add(cpu, &d->disk->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 	s = search_alloc(bio, d);
 	cl = &s->cl;
 	bio = &s->bio.bio;
 	trace_bcache_request_start(s, bio);
 	if (bio_has_data(bio) && !rw) {
 		closure_call(&s->op.cl, btree_read_async, NULL, cl);
 	} else if (bio_has_data(bio) || s->op.skip) {
 		bch_keybuf_check_overlapping(&s->op.c->moving_gc_keys,
 					     &KEY(d->id, bio->bi_sector, 0),
 					     &KEY(d->id, bio_end(bio), 0));
 		s->writeback	= true;
 		s->op.cache_bio	= bio;
 		closure_call(&s->op.cl, bch_insert_data, NULL, cl);
 	} else {
 		/* No data - probably a cache flush */
 		if (s->op.flush_journal)
 			bch_journal_meta(s->op.c, cl);
 	}
 	continue_at(cl, search_free, NULL);
 }
 static int flash_dev_ioctl(struct bcache_device *d, fmode_t mode,
 			   unsigned int cmd, unsigned long arg)
 {
 	return -ENOTTY;
 }
 static int flash_dev_congested(void *data, int bits)
 {
 	struct bcache_device *d = data;
 	struct request_queue *q;
 	struct cache *ca;
 	unsigned i;
 	int ret = 0;
 	for_each_cache(ca, d->c, i) {
 		q = bdev_get_queue(ca->bdev);
 		ret |= bdi_congested(&q->backing_dev_info, bits);
 	}
 	return ret;
 }
 void bch_flash_dev_request_init(struct bcache_device *d)
 {
 	struct gendisk *g = d->disk;
 	g->queue->make_request_fn		= flash_dev_make_request;
 	g->queue->backing_dev_info.congested_fn = flash_dev_congested;
 	d->cache_miss				= flash_dev_cache_miss;
 	d->ioctl				= flash_dev_ioctl;
 }
 void bch_request_exit(void)
 {
 #ifdef CONFIG_CGROUP_BCACHE
 	cgroup_unload_subsys(&bcache_subsys);
 #endif
 	if (bch_search_cache)
 		kmem_cache_destroy(bch_search_cache);
 }
 int __init bch_request_init(void)
 {
 	bch_search_cache = KMEM_CACHE(search, 0);
 	if (!bch_search_cache)
 		return -ENOMEM;
 #ifdef CONFIG_CGROUP_BCACHE
 	cgroup_load_subsys(&bcache_subsys);

drivers/md/bcache/sysfs.c

Diff comments View file @ 72c2706

 /*
  * bcache sysfs interfaces
  *
  * Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
  * Copyright 2012 Google, Inc.
  */
 #include "bcache.h"
 #include "sysfs.h"
 #include "btree.h"
 #include "request.h"
 #include "writeback.h"
 #include <linux/blkdev.h>
 #include <linux/sort.h>
 static const char * const cache_replacement_policies[] = {
 	"lru",
 	"fifo",
 	"random",
 	NULL
 };
 write_attribute(attach);
 write_attribute(detach);
 write_attribute(unregister);
 write_attribute(stop);
 write_attribute(clear_stats);
 write_attribute(trigger_gc);
 write_attribute(prune_cache);
 write_attribute(flash_vol_create);
 read_attribute(bucket_size);
 read_attribute(block_size);
 read_attribute(nbuckets);
 read_attribute(tree_depth);
 read_attribute(root_usage_percent);
 read_attribute(priority_stats);
 read_attribute(btree_cache_size);
 read_attribute(btree_cache_max_chain);
 read_attribute(cache_available_percent);
 read_attribute(written);
 read_attribute(btree_written);
 read_attribute(metadata_written);
 read_attribute(active_journal_entries);
 sysfs_time_stats_attribute(btree_gc,	sec, ms);
 sysfs_time_stats_attribute(btree_split, sec, us);
 sysfs_time_stats_attribute(btree_sort,	ms,  us);
 sysfs_time_stats_attribute(btree_read,	ms,  us);
 sysfs_time_stats_attribute(try_harder,	ms,  us);
 read_attribute(btree_nodes);
 read_attribute(btree_used_percent);
 read_attribute(average_key_size);
 read_attribute(dirty_data);
 read_attribute(bset_tree_stats);
 read_attribute(state);
 read_attribute(cache_read_races);
 read_attribute(writeback_keys_done);
 read_attribute(writeback_keys_failed);
 read_attribute(io_errors);
 read_attribute(congested);
 rw_attribute(congested_read_threshold_us);
 rw_attribute(congested_write_threshold_us);
 rw_attribute(sequential_cutoff);
 rw_attribute(sequential_merge);
 rw_attribute(data_csum);
 rw_attribute(cache_mode);
 rw_attribute(writeback_metadata);
 rw_attribute(writeback_running);
 rw_attribute(writeback_percent);
 rw_attribute(writeback_delay);
 rw_attribute(writeback_rate);
 rw_attribute(writeback_rate_update_seconds);
 rw_attribute(writeback_rate_d_term);
 rw_attribute(writeback_rate_p_term_inverse);
 rw_attribute(writeback_rate_d_smooth);
 read_attribute(writeback_rate_debug);
+read_attribute(stripe_size);
+read_attribute(partial_stripes_expensive);
 rw_attribute(synchronous);
 rw_attribute(journal_delay_ms);
 rw_attribute(discard);
 rw_attribute(running);
 rw_attribute(label);
 rw_attribute(readahead);
 rw_attribute(io_error_limit);
 rw_attribute(io_error_halflife);
 rw_attribute(verify);
 rw_attribute(key_merging_disabled);
 rw_attribute(gc_always_rewrite);
 rw_attribute(freelist_percent);
 rw_attribute(cache_replacement_policy);
 rw_attribute(btree_shrinker_disabled);
 rw_attribute(copy_gc_enabled);
 rw_attribute(size);
 SHOW(__bch_cached_dev)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
 	const char *states[] = { "no cache", "clean", "dirty", "inconsistent" };
 #define var(stat)		(dc->stat)
 	if (attr == &sysfs_cache_mode)
 		return bch_snprint_string_list(buf, PAGE_SIZE,
 					       bch_cache_modes + 1,
 					       BDEV_CACHE_MODE(&dc->sb));
 	sysfs_printf(data_csum,		"%i", dc->disk.data_csum);
 	var_printf(verify,		"%i");
 	var_printf(writeback_metadata,	"%i");
 	var_printf(writeback_running,	"%i");
 	var_print(writeback_delay);
 	var_print(writeback_percent);
 	sysfs_print(writeback_rate,	dc->writeback_rate.rate);
 	var_print(writeback_rate_update_seconds);
 	var_print(writeback_rate_d_term);
 	var_print(writeback_rate_p_term_inverse);
 	var_print(writeback_rate_d_smooth);
 	if (attr == &sysfs_writeback_rate_debug) {
 		char dirty[20];
 		char derivative[20];
 		char target[20];
 		bch_hprint(dirty,
 			   bcache_dev_sectors_dirty(&dc->disk) << 9);
 		bch_hprint(derivative,	dc->writeback_rate_derivative << 9);
 		bch_hprint(target,	dc->writeback_rate_target << 9);
 		return sprintf(buf,
 			       "rate:\t\t%u\n"
 			       "change:\t\t%i\n"
 			       "dirty:\t\t%s\n"
 			       "derivative:\t%s\n"
 			       "target:\t\t%s\n",
 			       dc->writeback_rate.rate,
 			       dc->writeback_rate_change,
 			       dirty, derivative, target);
 	}
 	sysfs_hprint(dirty_data,
 		     bcache_dev_sectors_dirty(&dc->disk) << 9);
+	sysfs_hprint(stripe_size,	(1 << dc->disk.stripe_size_bits) << 9);
+	var_printf(partial_stripes_expensive,	"%u");
 	var_printf(sequential_merge,	"%i");
 	var_hprint(sequential_cutoff);
 	var_hprint(readahead);
 	sysfs_print(running,		atomic_read(&dc->running));
 	sysfs_print(state,		states[BDEV_STATE(&dc->sb)]);
 	if (attr == &sysfs_label) {
 		memcpy(buf, dc->sb.label, SB_LABEL_SIZE);
 		buf[SB_LABEL_SIZE + 1] = '\0';
 		strcat(buf, "\n");
 		return strlen(buf);
 	}
 #undef var
 	return 0;
 }
 SHOW_LOCKED(bch_cached_dev)
 STORE(__cached_dev)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
 	unsigned v = size;
 	struct cache_set *c;
 #define d_strtoul(var)		sysfs_strtoul(var, dc->var)
 #define d_strtoi_h(var)		sysfs_hatoi(var, dc->var)
 	sysfs_strtoul(data_csum,	dc->disk.data_csum);
 	d_strtoul(verify);
 	d_strtoul(writeback_metadata);
 	d_strtoul(writeback_running);
 	d_strtoul(writeback_delay);
 	sysfs_strtoul_clamp(writeback_rate,
 			    dc->writeback_rate.rate, 1, 1000000);
 	sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40);
 	d_strtoul(writeback_rate_update_seconds);
 	d_strtoul(writeback_rate_d_term);
 	d_strtoul(writeback_rate_p_term_inverse);
 	sysfs_strtoul_clamp(writeback_rate_p_term_inverse,
 			    dc->writeback_rate_p_term_inverse, 1, INT_MAX);
 	d_strtoul(writeback_rate_d_smooth);
 	d_strtoul(sequential_merge);
 	d_strtoi_h(sequential_cutoff);
 	d_strtoi_h(readahead);
 	if (attr == &sysfs_clear_stats)
 		bch_cache_accounting_clear(&dc->accounting);
 	if (attr == &sysfs_running &&
 	    strtoul_or_return(buf))
 		bch_cached_dev_run(dc);
 	if (attr == &sysfs_cache_mode) {
 		ssize_t v = bch_read_string_list(buf, bch_cache_modes + 1);
 		if (v < 0)
 			return v;
 		if ((unsigned) v != BDEV_CACHE_MODE(&dc->sb)) {
 			SET_BDEV_CACHE_MODE(&dc->sb, v);
 			bch_write_bdev_super(dc, NULL);
 		}
 	}
 	if (attr == &sysfs_label) {
 		memcpy(dc->sb.label, buf, SB_LABEL_SIZE);
 		bch_write_bdev_super(dc, NULL);
 		if (dc->disk.c) {
 			memcpy(dc->disk.c->uuids[dc->disk.id].label,
 			       buf, SB_LABEL_SIZE);
 			bch_uuid_write(dc->disk.c);
 		}
 	}
 	if (attr == &sysfs_attach) {
 		if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
 			return -EINVAL;
 		list_for_each_entry(c, &bch_cache_sets, list) {
 			v = bch_cached_dev_attach(dc, c);
 			if (!v)
 				return size;
 		}
 		pr_err("Can't attach %s: cache set not found", buf);
 		size = v;
 	}
 	if (attr == &sysfs_detach && dc->disk.c)
 		bch_cached_dev_detach(dc);
 	if (attr == &sysfs_stop)
 		bcache_device_stop(&dc->disk);
 	return size;
 }
 STORE(bch_cached_dev)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
 	mutex_lock(&bch_register_lock);
 	size = __cached_dev_store(kobj, attr, buf, size);
 	if (attr == &sysfs_writeback_running)
 		bch_writeback_queue(dc);
 	if (attr == &sysfs_writeback_percent)
 		schedule_delayed_work(&dc->writeback_rate_update,
 				      dc->writeback_rate_update_seconds * HZ);
 	mutex_unlock(&bch_register_lock);
 	return size;
 }
 static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_attach,
 	&sysfs_detach,
 	&sysfs_stop,
 #if 0
 	&sysfs_data_csum,
 #endif
 	&sysfs_cache_mode,
 	&sysfs_writeback_metadata,
 	&sysfs_writeback_running,
 	&sysfs_writeback_delay,
 	&sysfs_writeback_percent,
 	&sysfs_writeback_rate,
 	&sysfs_writeback_rate_update_seconds,
 	&sysfs_writeback_rate_d_term,
 	&sysfs_writeback_rate_p_term_inverse,
 	&sysfs_writeback_rate_d_smooth,
 	&sysfs_writeback_rate_debug,
 	&sysfs_dirty_data,
+	&sysfs_stripe_size,
+	&sysfs_partial_stripes_expensive,
 	&sysfs_sequential_cutoff,
 	&sysfs_sequential_merge,
 	&sysfs_clear_stats,
 	&sysfs_running,
 	&sysfs_state,
 	&sysfs_label,
 	&sysfs_readahead,
 #ifdef CONFIG_BCACHE_DEBUG
 	&sysfs_verify,
 #endif
 	NULL
 };
 KTYPE(bch_cached_dev);
 SHOW(bch_flash_dev)
 {
 	struct bcache_device *d = container_of(kobj, struct bcache_device,
 					       kobj);
 	struct uuid_entry *u = &d->c->uuids[d->id];
 	sysfs_printf(data_csum,	"%i", d->data_csum);
 	sysfs_hprint(size,	u->sectors << 9);
 	if (attr == &sysfs_label) {
 		memcpy(buf, u->label, SB_LABEL_SIZE);
 		buf[SB_LABEL_SIZE + 1] = '\0';
 		strcat(buf, "\n");
 		return strlen(buf);
 	}
 	return 0;
 }
 STORE(__bch_flash_dev)
 {
 	struct bcache_device *d = container_of(kobj, struct bcache_device,
 					       kobj);
 	struct uuid_entry *u = &d->c->uuids[d->id];
 	sysfs_strtoul(data_csum,	d->data_csum);
 	if (attr == &sysfs_size) {
 		uint64_t v;
 		strtoi_h_or_return(buf, v);
 		u->sectors = v >> 9;
 		bch_uuid_write(d->c);
 		set_capacity(d->disk, u->sectors);
 	}
 	if (attr == &sysfs_label) {
 		memcpy(u->label, buf, SB_LABEL_SIZE);
 		bch_uuid_write(d->c);
 	}
 	if (attr == &sysfs_unregister) {
 		atomic_set(&d->detaching, 1);
 		bcache_device_stop(d);
 	}
 	return size;
 }
 STORE_LOCKED(bch_flash_dev)
 static struct attribute *bch_flash_dev_files[] = {
 	&sysfs_unregister,
 #if 0
 	&sysfs_data_csum,
 #endif
 	&sysfs_label,
 	&sysfs_size,
 	NULL
 };
 KTYPE(bch_flash_dev);
 SHOW(__bch_cache_set)
 {
 	unsigned root_usage(struct cache_set *c)
 	{
 		unsigned bytes = 0;
 		struct bkey *k;
 		struct btree *b;
 		struct btree_iter iter;
 		goto lock_root;
 		do {
 			rw_unlock(false, b);
 lock_root:
 			b = c->root;
 			rw_lock(false, b, b->level);
 		} while (b != c->root);
 		for_each_key_filter(b, k, &iter, bch_ptr_bad)
 			bytes += bkey_bytes(k);
 		rw_unlock(false, b);
 		return (bytes * 100) / btree_bytes(c);
 	}
 	size_t cache_size(struct cache_set *c)
 	{
 		size_t ret = 0;
 		struct btree *b;
 		mutex_lock(&c->bucket_lock);
 		list_for_each_entry(b, &c->btree_cache, list)
 			ret += 1 << (b->page_order + PAGE_SHIFT);
 		mutex_unlock(&c->bucket_lock);
 		return ret;
 	}
 	unsigned cache_max_chain(struct cache_set *c)
 	{
 		unsigned ret = 0;
 		struct hlist_head *h;
 		mutex_lock(&c->bucket_lock);
 		for (h = c->bucket_hash;
 		     h < c->bucket_hash + (1 << BUCKET_HASH_BITS);
 		     h++) {
 			unsigned i = 0;
 			struct hlist_node *p;
 			hlist_for_each(p, h)
 				i++;
 			ret = max(ret, i);
 		}
 		mutex_unlock(&c->bucket_lock);
 		return ret;
 	}
 	unsigned btree_used(struct cache_set *c)
 	{
 		return div64_u64(c->gc_stats.key_bytes * 100,
 				 (c->gc_stats.nodes ?: 1) * btree_bytes(c));
 	}
 	unsigned average_key_size(struct cache_set *c)
 	{
 		return c->gc_stats.nkeys
 			? div64_u64(c->gc_stats.data, c->gc_stats.nkeys)
 			: 0;
 	}
 	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
 	sysfs_print(synchronous,		CACHE_SYNC(&c->sb));
 	sysfs_print(journal_delay_ms,		c->journal_delay_ms);
 	sysfs_hprint(bucket_size,		bucket_bytes(c));
 	sysfs_hprint(block_size,		block_bytes(c));
 	sysfs_print(tree_depth,			c->root->level);
 	sysfs_print(root_usage_percent,		root_usage(c));
 	sysfs_hprint(btree_cache_size,		cache_size(c));
 	sysfs_print(btree_cache_max_chain,	cache_max_chain(c));
 	sysfs_print(cache_available_percent,	100 - c->gc_stats.in_use);
 	sysfs_print_time_stats(&c->btree_gc_time,	btree_gc, sec, ms);
 	sysfs_print_time_stats(&c->btree_split_time,	btree_split, sec, us);
 	sysfs_print_time_stats(&c->sort_time,		btree_sort, ms, us);
 	sysfs_print_time_stats(&c->btree_read_time,	btree_read, ms, us);
 	sysfs_print_time_stats(&c->try_harder_time,	try_harder, ms, us);
 	sysfs_print(btree_used_percent,	btree_used(c));
 	sysfs_print(btree_nodes,	c->gc_stats.nodes);
 	sysfs_hprint(dirty_data,	c->gc_stats.dirty);
 	sysfs_hprint(average_key_size,	average_key_size(c));
 	sysfs_print(cache_read_races,
 		    atomic_long_read(&c->cache_read_races));
 	sysfs_print(writeback_keys_done,
 		    atomic_long_read(&c->writeback_keys_done));
 	sysfs_print(writeback_keys_failed,
 		    atomic_long_read(&c->writeback_keys_failed));
 	/* See count_io_errors for why 88 */
 	sysfs_print(io_error_halflife,	c->error_decay * 88);
 	sysfs_print(io_error_limit,	c->error_limit >> IO_ERROR_SHIFT);
 	sysfs_hprint(congested,
 		     ((uint64_t) bch_get_congested(c)) << 9);
 	sysfs_print(congested_read_threshold_us,
 		    c->congested_read_threshold_us);
 	sysfs_print(congested_write_threshold_us,
 		    c->congested_write_threshold_us);
 	sysfs_print(active_journal_entries,	fifo_used(&c->journal.pin));
 	sysfs_printf(verify,			"%i", c->verify);
 	sysfs_printf(key_merging_disabled,	"%i", c->key_merging_disabled);
 	sysfs_printf(gc_always_rewrite,		"%i", c->gc_always_rewrite);
 	sysfs_printf(btree_shrinker_disabled,	"%i", c->shrinker_disabled);
 	sysfs_printf(copy_gc_enabled,		"%i", c->copy_gc_enabled);
 	if (attr == &sysfs_bset_tree_stats)
 		return bch_bset_print_stats(c, buf);
 	return 0;
 }
 SHOW_LOCKED(bch_cache_set)
 STORE(__bch_cache_set)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
 	if (attr == &sysfs_unregister)
 		bch_cache_set_unregister(c);
 	if (attr == &sysfs_stop)
 		bch_cache_set_stop(c);
 	if (attr == &sysfs_synchronous) {
 		bool sync = strtoul_or_return(buf);
 		if (sync != CACHE_SYNC(&c->sb)) {
 			SET_CACHE_SYNC(&c->sb, sync);
 			bcache_write_super(c);
 		}
 	}
 	if (attr == &sysfs_flash_vol_create) {
 		int r;
 		uint64_t v;
 		strtoi_h_or_return(buf, v);
 		r = bch_flash_dev_create(c, v);
 		if (r)
 			return r;
 	}
 	if (attr == &sysfs_clear_stats) {
 		atomic_long_set(&c->writeback_keys_done,	0);
 		atomic_long_set(&c->writeback_keys_failed,	0);
 		memset(&c->gc_stats, 0, sizeof(struct gc_stat));
 		bch_cache_accounting_clear(&c->accounting);
 	}
 	if (attr == &sysfs_trigger_gc)
 		bch_queue_gc(c);
 	if (attr == &sysfs_prune_cache) {
 		struct shrink_control sc;
 		sc.gfp_mask = GFP_KERNEL;
 		sc.nr_to_scan = strtoul_or_return(buf);
 		c->shrink.shrink(&c->shrink, &sc);
 	}
 	sysfs_strtoul(congested_read_threshold_us,
 		      c->congested_read_threshold_us);
 	sysfs_strtoul(congested_write_threshold_us,
 		      c->congested_write_threshold_us);
 	if (attr == &sysfs_io_error_limit)
 		c->error_limit = strtoul_or_return(buf) << IO_ERROR_SHIFT;
 	/* See count_io_errors() for why 88 */
 	if (attr == &sysfs_io_error_halflife)
 		c->error_decay = strtoul_or_return(buf) / 88;
 	sysfs_strtoul(journal_delay_ms,		c->journal_delay_ms);
 	sysfs_strtoul(verify,			c->verify);
 	sysfs_strtoul(key_merging_disabled,	c->key_merging_disabled);
 	sysfs_strtoul(gc_always_rewrite,	c->gc_always_rewrite);
 	sysfs_strtoul(btree_shrinker_disabled,	c->shrinker_disabled);
 	sysfs_strtoul(copy_gc_enabled,		c->copy_gc_enabled);
 	return size;
 }
 STORE_LOCKED(bch_cache_set)
 SHOW(bch_cache_set_internal)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, internal);
 	return bch_cache_set_show(&c->kobj, attr, buf);
 }
 STORE(bch_cache_set_internal)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, internal);
 	return bch_cache_set_store(&c->kobj, attr, buf, size);
 }
 static void bch_cache_set_internal_release(struct kobject *k)
 {
 }
 static struct attribute *bch_cache_set_files[] = {
 	&sysfs_unregister,
 	&sysfs_stop,
 	&sysfs_synchronous,
 	&sysfs_journal_delay_ms,
 	&sysfs_flash_vol_create,
 	&sysfs_bucket_size,
 	&sysfs_block_size,
 	&sysfs_tree_depth,
 	&sysfs_root_usage_percent,
 	&sysfs_btree_cache_size,
 	&sysfs_cache_available_percent,
 	&sysfs_average_key_size,
 	&sysfs_dirty_data,
 	&sysfs_io_error_limit,
 	&sysfs_io_error_halflife,
 	&sysfs_congested,
 	&sysfs_congested_read_threshold_us,
 	&sysfs_congested_write_threshold_us,
 	&sysfs_clear_stats,
 	NULL
 };
 KTYPE(bch_cache_set);
 static struct attribute *bch_cache_set_internal_files[] = {
 	&sysfs_active_journal_entries,
 	sysfs_time_stats_attribute_list(btree_gc, sec, ms)
 	sysfs_time_stats_attribute_list(btree_split, sec, us)
 	sysfs_time_stats_attribute_list(btree_sort, ms, us)
 	sysfs_time_stats_attribute_list(btree_read, ms, us)
 	sysfs_time_stats_attribute_list(try_harder, ms, us)
 	&sysfs_btree_nodes,
 	&sysfs_btree_used_percent,
 	&sysfs_btree_cache_max_chain,
 	&sysfs_bset_tree_stats,
 	&sysfs_cache_read_races,
 	&sysfs_writeback_keys_done,
 	&sysfs_writeback_keys_failed,
 	&sysfs_trigger_gc,
 	&sysfs_prune_cache,
 #ifdef CONFIG_BCACHE_DEBUG
 	&sysfs_verify,
 	&sysfs_key_merging_disabled,
 #endif
 	&sysfs_gc_always_rewrite,
 	&sysfs_btree_shrinker_disabled,
 	&sysfs_copy_gc_enabled,
 	NULL
 };
 KTYPE(bch_cache_set_internal);
 SHOW(__bch_cache)
 {
 	struct cache *ca = container_of(kobj, struct cache, kobj);
 	sysfs_hprint(bucket_size,	bucket_bytes(ca));
 	sysfs_hprint(block_size,	block_bytes(ca));
 	sysfs_print(nbuckets,		ca->sb.nbuckets);
 	sysfs_print(discard,		ca->discard);
 	sysfs_hprint(written, atomic_long_read(&ca->sectors_written) << 9);
 	sysfs_hprint(btree_written,
 		     atomic_long_read(&ca->btree_sectors_written) << 9);
 	sysfs_hprint(metadata_written,
 		     (atomic_long_read(&ca->meta_sectors_written) +
 		      atomic_long_read(&ca->btree_sectors_written)) << 9);
 	sysfs_print(io_errors,
 		    atomic_read(&ca->io_errors) >> IO_ERROR_SHIFT);
 	sysfs_print(freelist_percent, ca->free.size * 100 /
 		    ((size_t) ca->sb.nbuckets));
 	if (attr == &sysfs_cache_replacement_policy)
 		return bch_snprint_string_list(buf, PAGE_SIZE,
 					       cache_replacement_policies,
 					       CACHE_REPLACEMENT(&ca->sb));
 	if (attr == &sysfs_priority_stats) {
 		int cmp(const void *l, const void *r)
 		{	return *((uint16_t *) r) - *((uint16_t *) l); }
 		size_t n = ca->sb.nbuckets, i, unused, btree;
 		uint64_t sum = 0;
 		/* Compute 31 quantiles */
 		uint16_t q[31], *p, *cached;
 		ssize_t ret;
 		cached = p = vmalloc(ca->sb.nbuckets * sizeof(uint16_t));
 		if (!p)
 			return -ENOMEM;
 		mutex_lock(&ca->set->bucket_lock);
 		for (i = ca->sb.first_bucket; i < n; i++)
 			p[i] = ca->buckets[i].prio;
 		mutex_unlock(&ca->set->bucket_lock);
 		sort(p, n, sizeof(uint16_t), cmp, NULL);
 		while (n &&
 		       !cached[n - 1])
 			--n;
 		unused = ca->sb.nbuckets - n;
 		while (cached < p + n &&
 		       *cached == BTREE_PRIO)
 			cached++;
 		btree = cached - p;
 		n -= btree;
 		for (i = 0; i < n; i++)
 			sum += INITIAL_PRIO - cached[i];
 		if (n)
 			do_div(sum, n);
 		for (i = 0; i < ARRAY_SIZE(q); i++)
 			q[i] = INITIAL_PRIO - cached[n * (i + 1) /
 				(ARRAY_SIZE(q) + 1)];
 		vfree(p);
 		ret = scnprintf(buf, PAGE_SIZE,
 				"Unused:		%zu%%\n"
 				"Metadata:	%zu%%\n"
 				"Average:	%llu\n"
 				"Sectors per Q:	%zu\n"
 				"Quantiles:	[",
 				unused * 100 / (size_t) ca->sb.nbuckets,
 				btree * 100 / (size_t) ca->sb.nbuckets, sum,
 				n * ca->sb.bucket_size / (ARRAY_SIZE(q) + 1));
 		for (i = 0; i < ARRAY_SIZE(q); i++)
 			ret += scnprintf(buf + ret, PAGE_SIZE - ret,
 					 "%u ", q[i]);
 		ret--;
 		ret += scnprintf(buf + ret, PAGE_SIZE - ret, "]\n");
 		return ret;
 	}
 	return 0;
 }
 SHOW_LOCKED(bch_cache)
 STORE(__bch_cache)
 {
 	struct cache *ca = container_of(kobj, struct cache, kobj);
 	if (attr == &sysfs_discard) {
 		bool v = strtoul_or_return(buf);
 		if (blk_queue_discard(bdev_get_queue(ca->bdev)))
 			ca->discard = v;
 		if (v != CACHE_DISCARD(&ca->sb)) {
 			SET_CACHE_DISCARD(&ca->sb, v);
 			bcache_write_super(ca->set);
 		}
 	}
 	if (attr == &sysfs_cache_replacement_policy) {
 		ssize_t v = bch_read_string_list(buf, cache_replacement_policies);
 		if (v < 0)
 			return v;
 		if ((unsigned) v != CACHE_REPLACEMENT(&ca->sb)) {
 			mutex_lock(&ca->set->bucket_lock);
 			SET_CACHE_REPLACEMENT(&ca->sb, v);
 			mutex_unlock(&ca->set->bucket_lock);
 			bcache_write_super(ca->set);
 		}
 	}
 	if (attr == &sysfs_freelist_percent) {
 		DECLARE_FIFO(long, free);
 		long i;
 		size_t p = strtoul_or_return(buf);
 		p = clamp_t(size_t,
 			    ((size_t) ca->sb.nbuckets * p) / 100,
 			    roundup_pow_of_two(ca->sb.nbuckets) >> 9,
 			    ca->sb.nbuckets / 2);
 		if (!init_fifo_exact(&free, p, GFP_KERNEL))
 			return -ENOMEM;
 		mutex_lock(&ca->set->bucket_lock);
 		fifo_move(&free, &ca->free);
 		fifo_swap(&free, &ca->free);
 		mutex_unlock(&ca->set->bucket_lock);
 		while (fifo_pop(&free, i))
 			atomic_dec(&ca->buckets[i].pin);
 		free_fifo(&free);
 	}
 	if (attr == &sysfs_clear_stats) {
 		atomic_long_set(&ca->sectors_written, 0);
 		atomic_long_set(&ca->btree_sectors_written, 0);
 		atomic_long_set(&ca->meta_sectors_written, 0);
 		atomic_set(&ca->io_count, 0);
 		atomic_set(&ca->io_errors, 0);
 	}
 	return size;
 }
 STORE_LOCKED(bch_cache)
 static struct attribute *bch_cache_files[] = {
 	&sysfs_bucket_size,
 	&sysfs_block_size,
 	&sysfs_nbuckets,
 	&sysfs_priority_stats,
 	&sysfs_discard,
 	&sysfs_written,
 	&sysfs_btree_written,
 	&sysfs_metadata_written,
 	&sysfs_io_errors,
 	&sysfs_clear_stats,
 	&sysfs_freelist_percent,
 	&sysfs_cache_replacement_policy,
 	NULL
 };
 KTYPE(bch_cache);

drivers/md/bcache/writeback.c

Diff comments View file @ 72c2706

 /*
  * background writeback - scan btree for dirty data and write it to the backing
  * device
  *
  * Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
  * Copyright 2012 Google, Inc.
  */
 #include "bcache.h"
 #include "btree.h"
 #include "debug.h"
 #include "writeback.h"
 #include <trace/events/bcache.h>
 static struct workqueue_struct *dirty_wq;
 static void read_dirty(struct closure *);
 struct dirty_io {
 	struct closure		cl;
 	struct cached_dev	*dc;
 	struct bio		bio;
 };
 /* Rate limiting */
 static void __update_writeback_rate(struct cached_dev *dc)
 {
 	struct cache_set *c = dc->disk.c;
 	uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size;
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
 	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
 				   c->cached_dev_sectors);
 	/* PD controller */
 	int change = 0;
 	int64_t error;
 	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
 	int64_t derivative = dirty - dc->disk.sectors_dirty_last;
 	dc->disk.sectors_dirty_last = dirty;
 	derivative *= dc->writeback_rate_d_term;
 	derivative = clamp(derivative, -dirty, dirty);
 	derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
 			      dc->writeback_rate_d_smooth, 0);
 	/* Avoid divide by zero */
 	if (!target)
 		goto out;
 	error = div64_s64((dirty + derivative - target) << 8, target);
 	change = div_s64((dc->writeback_rate.rate * error) >> 8,
 			 dc->writeback_rate_p_term_inverse);
 	/* Don't increase writeback rate if the device isn't keeping up */
 	if (change > 0 &&
 	    time_after64(local_clock(),
 			 dc->writeback_rate.next + 10 * NSEC_PER_MSEC))
 		change = 0;
 	dc->writeback_rate.rate =
 		clamp_t(int64_t, dc->writeback_rate.rate + change,
 			1, NSEC_PER_MSEC);
 out:
 	dc->writeback_rate_derivative = derivative;
 	dc->writeback_rate_change = change;
 	dc->writeback_rate_target = target;
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
 }
 static void update_writeback_rate(struct work_struct *work)
 {
 	struct cached_dev *dc = container_of(to_delayed_work(work),
 					     struct cached_dev,
 					     writeback_rate_update);
 	down_read(&dc->writeback_lock);
 	if (atomic_read(&dc->has_dirty) &&
 	    dc->writeback_percent)
 		__update_writeback_rate(dc);
 	up_read(&dc->writeback_lock);
 }
 static unsigned writeback_delay(struct cached_dev *dc, unsigned sectors)
 {
 	if (atomic_read(&dc->disk.detaching) ||
 	    !dc->writeback_percent)
 		return 0;
 	return bch_next_delay(&dc->writeback_rate, sectors * 10000000ULL);
 }
 /* Background writeback */
 static bool dirty_pred(struct keybuf *buf, struct bkey *k)
 {
 	return KEY_DIRTY(k);
 }
+static bool dirty_full_stripe_pred(struct keybuf *buf, struct bkey *k)
+{
+	uint64_t stripe;
+	unsigned nr_sectors = KEY_SIZE(k);
+	struct cached_dev *dc = container_of(buf, struct cached_dev,
+					     writeback_keys);
+	unsigned stripe_size = 1 << dc->disk.stripe_size_bits;
+	if (!KEY_DIRTY(k))
+		return false;
+	stripe = KEY_START(k) >> dc->disk.stripe_size_bits;
+	while (1) {
+		if (atomic_read(dc->disk.stripe_sectors_dirty + stripe) !=
+		    stripe_size)
+			return false;
+		if (nr_sectors <= stripe_size)
+			return true;
+		nr_sectors -= stripe_size;
+		stripe++;
+	}
+}
 static void dirty_init(struct keybuf_key *w)
 {
 	struct dirty_io *io = w->private;
 	struct bio *bio = &io->bio;
 	bio_init(bio);
 	if (!io->dc->writeback_percent)
 		bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 	bio->bi_size		= KEY_SIZE(&w->key) << 9;
 	bio->bi_max_vecs	= DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS);
 	bio->bi_private		= w;
 	bio->bi_io_vec		= bio->bi_inline_vecs;
 	bch_bio_map(bio, NULL);
 }
 static void refill_dirty(struct closure *cl)
 {
 	struct cached_dev *dc = container_of(cl, struct cached_dev,
 					     writeback.cl);
 	struct keybuf *buf = &dc->writeback_keys;
 	bool searched_from_start = false;
 	struct bkey end = MAX_KEY;
 	SET_KEY_INODE(&end, dc->disk.id);
 	if (!atomic_read(&dc->disk.detaching) &&
 	    !dc->writeback_running)
 		closure_return(cl);
 	down_write(&dc->writeback_lock);
 	if (!atomic_read(&dc->has_dirty)) {
 		SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
 		bch_write_bdev_super(dc, NULL);
 		up_write(&dc->writeback_lock);
 		closure_return(cl);
 	}
 	if (bkey_cmp(&buf->last_scanned, &end) >= 0) {
 		buf->last_scanned = KEY(dc->disk.id, 0, 0);
 		searched_from_start = true;
 	}
-	bch_refill_keybuf(dc->disk.c, buf, &end);
+	if (dc->partial_stripes_expensive) {
+		uint64_t i;
+		for (i = 0; i < dc->disk.nr_stripes; i++)
+			if (atomic_read(dc->disk.stripe_sectors_dirty + i) ==
+			    1 << dc->disk.stripe_size_bits)
+				goto full_stripes;
+		goto normal_refill;
+full_stripes:
+		bch_refill_keybuf(dc->disk.c, buf, &end,
+				  dirty_full_stripe_pred);
+	} else {
+normal_refill:
+		bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred);
+	}
 	if (bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start) {
 		/* Searched the entire btree  - delay awhile */
 		if (RB_EMPTY_ROOT(&buf->keys)) {
 			atomic_set(&dc->has_dirty, 0);
 			cached_dev_put(dc);
 		}
 		if (!atomic_read(&dc->disk.detaching))
 			closure_delay(&dc->writeback, dc->writeback_delay * HZ);
 	}
 	up_write(&dc->writeback_lock);
 	ratelimit_reset(&dc->writeback_rate);
 	/* Punt to workqueue only so we don't recurse and blow the stack */
 	continue_at(cl, read_dirty, dirty_wq);
 }
 void bch_writeback_queue(struct cached_dev *dc)
 {
 	if (closure_trylock(&dc->writeback.cl, &dc->disk.cl)) {
 		if (!atomic_read(&dc->disk.detaching))
 			closure_delay(&dc->writeback, dc->writeback_delay * HZ);
 		continue_at(&dc->writeback.cl, refill_dirty, dirty_wq);
 	}
 }
 void bch_writeback_add(struct cached_dev *dc)
 {
 	if (!atomic_read(&dc->has_dirty) &&
 	    !atomic_xchg(&dc->has_dirty, 1)) {
 		atomic_inc(&dc->count);
 		if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
 			SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);
 			/* XXX: should do this synchronously */
 			bch_write_bdev_super(dc, NULL);
 		}
 		bch_writeback_queue(dc);
 		if (dc->writeback_percent)
 			schedule_delayed_work(&dc->writeback_rate_update,
 				      dc->writeback_rate_update_seconds * HZ);
 	}
 }
 void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode,
 				  uint64_t offset, int nr_sectors)
 {
 	struct bcache_device *d = c->devices[inode];
 	unsigned stripe_size, stripe_offset;
 	uint64_t stripe;
 	if (!d)
 		return;
 	stripe_size = 1 << d->stripe_size_bits;
 	stripe = offset >> d->stripe_size_bits;
 	stripe_offset = offset & (stripe_size - 1);
 	while (nr_sectors) {
 		int s = min_t(unsigned, abs(nr_sectors),
 			      stripe_size - stripe_offset);
 		if (nr_sectors < 0)
 			s = -s;
 		atomic_add(s, d->stripe_sectors_dirty + stripe);
 		nr_sectors -= s;
 		stripe_offset = 0;
 		stripe++;
 	}
 }
 /* Background writeback - IO loop */
 static void dirty_io_destructor(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	kfree(io);
 }
 static void write_dirty_finish(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	struct keybuf_key *w = io->bio.bi_private;
 	struct cached_dev *dc = io->dc;
 	struct bio_vec *bv = bio_iovec_idx(&io->bio, io->bio.bi_vcnt);
 	while (bv-- != io->bio.bi_io_vec)
 		__free_page(bv->bv_page);
 	/* This is kind of a dumb way of signalling errors. */
 	if (KEY_DIRTY(&w->key)) {
 		unsigned i;
 		struct btree_op op;
 		bch_btree_op_init_stack(&op);
 		op.type = BTREE_REPLACE;
 		bkey_copy(&op.replace, &w->key);
 		SET_KEY_DIRTY(&w->key, false);
 		bch_keylist_add(&op.keys, &w->key);
 		for (i = 0; i < KEY_PTRS(&w->key); i++)
 			atomic_inc(&PTR_BUCKET(dc->disk.c, &w->key, i)->pin);
 		bch_btree_insert(&op, dc->disk.c);
 		closure_sync(&op.cl);
 		if (op.insert_collision)
 			trace_bcache_writeback_collision(&w->key);
 		atomic_long_inc(op.insert_collision
 				? &dc->disk.c->writeback_keys_failed
 				: &dc->disk.c->writeback_keys_done);
 	}
 	bch_keybuf_del(&dc->writeback_keys, w);
 	atomic_dec_bug(&dc->in_flight);
 	closure_wake_up(&dc->writeback_wait);
 	closure_return_with_destructor(cl, dirty_io_destructor);
 }
 static void dirty_endio(struct bio *bio, int error)
 {
 	struct keybuf_key *w = bio->bi_private;
 	struct dirty_io *io = w->private;
 	if (error)
 		SET_KEY_DIRTY(&w->key, false);
 	closure_put(&io->cl);
 }
 static void write_dirty(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	struct keybuf_key *w = io->bio.bi_private;
 	dirty_init(w);
 	io->bio.bi_rw		= WRITE;
 	io->bio.bi_sector	= KEY_START(&w->key);
 	io->bio.bi_bdev		= io->dc->bdev;
 	io->bio.bi_end_io	= dirty_endio;
 	closure_bio_submit(&io->bio, cl, &io->dc->disk);
 	continue_at(cl, write_dirty_finish, dirty_wq);
 }
 static void read_dirty_endio(struct bio *bio, int error)
 {
 	struct keybuf_key *w = bio->bi_private;
 	struct dirty_io *io = w->private;
 	bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0),
 			    error, "reading dirty data from cache");
 	dirty_endio(bio, error);
 }
 static void read_dirty_submit(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	closure_bio_submit(&io->bio, cl, &io->dc->disk);
 	continue_at(cl, write_dirty, dirty_wq);
 }
 static void read_dirty(struct closure *cl)
 {
 	struct cached_dev *dc = container_of(cl, struct cached_dev,
 					     writeback.cl);
 	unsigned delay = writeback_delay(dc, 0);
 	struct keybuf_key *w;
 	struct dirty_io *io;
 	/*
 	 * XXX: if we error, background writeback just spins. Should use some
 	 * mempools.
 	 */
 	while (1) {
 		w = bch_keybuf_next(&dc->writeback_keys);
 		if (!w)
 			break;
 		BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
 		if (delay > 0 &&
 		    (KEY_START(&w->key) != dc->last_read ||
 		     jiffies_to_msecs(delay) > 50)) {
 			w->private = NULL;
 			closure_delay(&dc->writeback, delay);
 			continue_at(cl, read_dirty, dirty_wq);
 		}
 		dc->last_read	= KEY_OFFSET(&w->key);
 		io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
 			     * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
 			     GFP_KERNEL);
 		if (!io)
 			goto err;
 		w->private	= io;
 		io->dc		= dc;
 		dirty_init(w);
 		io->bio.bi_sector	= PTR_OFFSET(&w->key, 0);
 		io->bio.bi_bdev		= PTR_CACHE(dc->disk.c,
 						    &w->key, 0)->bdev;
 		io->bio.bi_rw		= READ;
 		io->bio.bi_end_io	= read_dirty_endio;
 		if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
 			goto err_free;
 		trace_bcache_writeback(&w->key);
 		closure_call(&io->cl, read_dirty_submit, NULL, &dc->disk.cl);
 		delay = writeback_delay(dc, KEY_SIZE(&w->key));
 		atomic_inc(&dc->in_flight);
 		if (!closure_wait_event(&dc->writeback_wait, cl,
 					atomic_read(&dc->in_flight) < 64))
 			continue_at(cl, read_dirty, dirty_wq);
 	}
 	if (0) {
 err_free:
 		kfree(w->private);
 err:
 		bch_keybuf_del(&dc->writeback_keys, w);
 	}
 	refill_dirty(cl);
 }
 /* Init */
 static int bch_btree_sectors_dirty_init(struct btree *b, struct btree_op *op,
 					struct cached_dev *dc)
 {
 	struct bkey *k;
 	struct btree_iter iter;
 	bch_btree_iter_init(b, &iter, &KEY(dc->disk.id, 0, 0));
 	while ((k = bch_btree_iter_next_filter(&iter, b, bch_ptr_bad)))
 		if (!b->level) {
 			if (KEY_INODE(k) > dc->disk.id)
 				break;
 			if (KEY_DIRTY(k))
 				bcache_dev_sectors_dirty_add(b->c, dc->disk.id,
 							     KEY_START(k),
 							     KEY_SIZE(k));
 		} else {
 			btree(sectors_dirty_init, k, b, op, dc);
 			if (KEY_INODE(k) > dc->disk.id)
 				break;
 			cond_resched();
 		}
 	return 0;
 }
 void bch_sectors_dirty_init(struct cached_dev *dc)
 {
 	struct btree_op op;
 	bch_btree_op_init_stack(&op);
 	btree_root(sectors_dirty_init, dc->disk.c, &op, dc);
 }
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
 {
 	closure_init_unlocked(&dc->writeback);
 	init_rwsem(&dc->writeback_lock);
-	bch_keybuf_init(&dc->writeback_keys, dirty_pred);
+	bch_keybuf_init(&dc->writeback_keys);
 	dc->writeback_metadata		= true;
 	dc->writeback_running		= true;
 	dc->writeback_percent		= 10;
 	dc->writeback_delay		= 30;
 	dc->writeback_rate.rate		= 1024;
 	dc->writeback_rate_update_seconds = 30;
 	dc->writeback_rate_d_term	= 16;
 	dc->writeback_rate_p_term_inverse = 64;
 	dc->writeback_rate_d_smooth	= 8;
 	INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
 	schedule_delayed_work(&dc->writeback_rate_update,
 			      dc->writeback_rate_update_seconds * HZ);
 }
 void bch_writeback_exit(void)
 {
 	if (dirty_wq)
 		destroy_workqueue(dirty_wq);
 }
 int __init bch_writeback_init(void)
 {
 	dirty_wq = create_singlethread_workqueue("bcache_writeback");
 	if (!dirty_wq)
 		return -ENOMEM;
 	return 0;
 }

drivers/md/bcache/writeback.h

Diff comments View file @ 72c2706

1	#ifndef _BCACHE_WRITEBACK_H	1	#ifndef _BCACHE_WRITEBACK_H
2	#define _BCACHE_WRITEBACK_H	2	#define _BCACHE_WRITEBACK_H
3		3
		4	#define CUTOFF_WRITEBACK 40
		5	#define CUTOFF_WRITEBACK_SYNC 70
		6
4	static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)	7	static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
5	{	8	{
6	uint64_t i, ret = 0;	9	uint64_t i, ret = 0;
7		10
8	for (i = 0; i < d->nr_stripes; i++)	11	for (i = 0; i < d->nr_stripes; i++)
9	ret += atomic_read(d->stripe_sectors_dirty + i);	12	ret += atomic_read(d->stripe_sectors_dirty + i);
10		13
11	return ret;	14	return ret;
		15	}
		16
		17	static inline bool bcache_dev_stripe_dirty(struct bcache_device *d,
		18	uint64_t offset,
		19	unsigned nr_sectors)
		20	{
		21	uint64_t stripe = offset >> d->stripe_size_bits;
		22
		23	while (1) {
		24	if (atomic_read(d->stripe_sectors_dirty + stripe))
		25	return true;
		26
		27	if (nr_sectors <= 1 << d->stripe_size_bits)
		28	return false;
		29
		30	nr_sectors -= 1 << d->stripe_size_bits;
		31	stripe++;
		32	}
		33	}
		34
		35	static inline bool should_writeback(struct cached_dev dc, struct bio bio,
		36	unsigned cache_mode, bool would_skip)
		37	{
		38	unsigned in_use = dc->disk.c->gc_stats.in_use;
		39
		40	if (cache_mode != CACHE_MODE_WRITEBACK \|\|
		41	atomic_read(&dc->disk.detaching) \|\|
		42	in_use > CUTOFF_WRITEBACK_SYNC)
		43	return false;
		44
		45	if (dc->partial_stripes_expensive &&
		46	bcache_dev_stripe_dirty(&dc->disk, bio->bi_sector,
		47	bio_sectors(bio)))
		48	return true;
		49
		50	if (would_skip)
		51	return false;
		52
		53	return bio->bi_rw & REQ_SYNC \|\|
		54	in_use <= CUTOFF_WRITEBACK;
12	}	55	}
13		56
14	void bcache_dev_sectors_dirty_add(struct cache_set *, unsigned, uint64_t, int);	57	void bcache_dev_sectors_dirty_add(struct cache_set *, unsigned, uint64_t, int);
15	void bch_writeback_queue(struct cached_dev *);	58	void bch_writeback_queue(struct cached_dev *);
16	void bch_writeback_add(struct cached_dev *);	59	void bch_writeback_add(struct cached_dev *);
17		60
18	void bch_sectors_dirty_init(struct cached_dev *dc);	61	void bch_sectors_dirty_init(struct cached_dev *dc);
19	void bch_cached_dev_writeback_init(struct cached_dev *);	62	void bch_cached_dev_writeback_init(struct cached_dev *);
20		63
21	#endif	64	#endif
22		65