mm: frontswap: config and doc files

This patch 4of4 adds configuration and documentation files including a FAQ. [v14: updated docs/FAQ to use zcache and RAMster as examples] [v10: no change] [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file] [v8: rebase to 3.0-rc4] [v7: rebase to 3.0-rc3] [v6: rebase to 3.0-rc1] [v5: change config default to n] [v4: rebase to 2.6.39] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Acked-by: Jan Beulich <JBeulich@novell.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Rik Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

mm: frontswap: config and doc files
This patch 4of4 adds configuration and documentation files including a FAQ. [v14: updated docs/FAQ to use zcache and RAMster as examples] [v10: no change] [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file] [v8: rebase to 3.0-rc4] [v7: rebase to 3.0-rc3] [v6: rebase to 3.0-rc1] [v5: change config default to n] [v4: rebase to 2.6.39] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Acked-by: Jan Beulich <JBeulich@novell.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Rik Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dan Magenheimer · Konrad Rzeszutek Wilk
1 parent 29f233cfff
Showing 3 changed files with 296 additions and 0 deletions Side-by-side Diff
Documentation/vm/frontswap.txt
mm/Kconfig
mm/Makefile
+Frontswap provides a "transcendent memory" interface for swap pages.
+In some environments, dramatic performance savings may be obtained because
+swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
+
+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
+and the only necessary changes to the core kernel for transcendent memory;
+all other supporting code -- the "backends" -- is implemented as drivers.
+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
+overview of frontswap and related kernel parts:
+https://lwn.net/Articles/454795/ )
+
+Frontswap is so named because it can be thought of as the opposite of
+a "backing" store for a swap device.  The storage is assumed to be
+a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
+to the requirements of transcendent memory (such as Xen's "tmem", or
+in-kernel compressed memory, aka "zcache", or future RAM-like devices);
+this pseudo-RAM device is not directly accessible or addressable by the
+kernel and is of unknown and possibly time-varying size.  The driver
+links itself to frontswap by calling frontswap_register_ops to set the
+frontswap_ops funcs appropriately and the functions it provides must
+conform to certain policies as follows:
+
+An "init" prepares the device to receive frontswap pages associated
+with the specified swap device number (aka "type").  A "put_page" will
+copy the page to transcendent memory and associate it with the type and
+offset associated with the page. A "get_page" will copy the page, if found,
+from transcendent memory into kernel memory, but will NOT remove the page
+from from transcendent memory.  An "invalidate_page" will remove the page
+from transcendent memory and an "invalidate_area" will remove ALL pages
+associated with the swap type (e.g., like swapoff) and notify the "device"
+to refuse further puts with that swap type.
+
+Once a page is successfully put, a matching get on the page will normally
+succeed.  So when the kernel finds itself in a situation where it needs
+to swap out a page, it first attempts to use frontswap.  If the put returns
+success, the data has been successfully saved to transcendent memory and
+a disk write and, if the data is later read back, a disk read are avoided.
+If a put returns failure, transcendent memory has rejected the data, and the
+page can be written to swap as usual.
+
+If a backend chooses, frontswap can be configured as a "writethrough
+cache" by calling frontswap_writethrough().  In this mode, the reduction
+in swap device writes is lost (and also a non-trivial performance advantage)
+in order to allow the backend to arbitrarily "reclaim" space used to
+store frontswap pages to more completely manage its memory usage.
+
+Note that if a page is put and the page already exists in transcendent memory
+(a "duplicate" put), either the put succeeds and the data is overwritten,
+or the put fails AND the page is invalidated.  This ensures stale data may
+never be obtained from frontswap.
+
+If properly configured, monitoring of frontswap is done via debugfs in
+the /sys/kernel/debug/frontswap directory.  The effectiveness of
+frontswap can be measured (across all swap devices) with:
+
+failed_puts	- how many put attempts have failed
+gets		- how many gets were attempted (all should succeed)
+succ_puts	- how many put attempts have succeeded
+invalidates	- how many invalidates were attempted
+
+A backend implementation may provide additional metrics.
+
+FAQ
+
+1) Where's the value?
+
+When a workload starts swapping, performance falls through the floor.
+Frontswap significantly increases performance in many such workloads by
+providing a clean, dynamic interface to read and write swap pages to
+"transcendent memory" that is otherwise not directly addressable to the kernel.
+This interface is ideal when data is transformed to a different form
+and size (such as with compression) or secretly moved (as might be
+useful for write-balancing for some RAM-like devices).  Swap pages (and
+evicted page-cache pages) are a great use for this kind of slower-than-RAM-
+but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
+cleancache) interface to transcendent memory provides a nice way to read
+and write -- and indirectly "name" -- the pages.
+
+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
+provides a huge amount of flexibility for more dynamic, flexible RAM
+utilization in various system configurations:
+
+In the single kernel case, aka "zcache", pages are compressed and
+stored in local memory, thus increasing the total anonymous pages
+that can be safely kept in RAM.  Zcache essentially trades off CPU
+cycles used in compression/decompression for better memory utilization.
+Benchmarks have shown little or no impact when memory pressure is
+low while providing a significant performance improvement (25%+)
+on some workloads under high memory pressure.
+
+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
+support for clustered systems.  Frontswap pages are locally compressed
+as in zcache, but then "remotified" to another system's RAM.  This
+allows RAM to be dynamically load-balanced back-and-forth as needed,
+i.e. when system A is overcommitted, it can swap to system B, and
+vice versa.  RAMster can also be configured as a memory server so
+many servers in a cluster can swap, dynamically as needed, to a single
+server configured with a large amount of RAM... without pre-configuring
+how much of the RAM is available for each of the clients!
+
+In the virtual case, the whole point of virtualization is to statistically
+multiplex physical resources acrosst the varying demands of multiple
+virtual machines.  This is really hard to do with RAM and efforts to do
+it well with no kernel changes have essentially failed (except in some
+well-publicized special-case workloads).
+Specifically, the Xen Transcendent Memory backend allows otherwise
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
+virtual machines, but the pages can be compressed and deduplicated to
+optimize RAM utilization.  And when guest OS's are induced to surrender
+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
+memory pressure may result in swapping; frontswap allows those pages
+to be swapped to and from hypervisor RAM (if overall host system memory
+conditions allow), thus mitigating the potentially awful performance impact
+of unplanned swapping.
+
+A KVM implementation is underway and has been RFC'ed to lkml.  And,
+using frontswap, investigation is also underway on the use of NVM as
+a memory extension technology.
+
+2) Sure there may be performance advantages in some situations, but
+   what's the space/time overhead of frontswap?
+
+If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
+nothingness and the only overhead is a few extra bytes per swapon'ed
+swap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
+registers, there is one extra global variable compared to zero for
+every swap page read or written.  If CONFIG_FRONTSWAP is enabled
+AND a frontswap backend registers AND the backend fails every "put"
+request (i.e. provides no memory despite claiming it might),
+CPU overhead is still negligible -- and since every frontswap fail
+precedes a swap page write-to-disk, the system is highly likely
+to be I/O bound and using a small fraction of a percent of a CPU
+will be irrelevant anyway.
+
+As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
+registers, one bit is allocated for every swap page for every swap
+device that is swapon'd.  This is added to the EIGHT bits (which
+was sixteen until about 2.6.34) that the kernel already allocates
+for every swap page for every swap device that is swapon'd.  (Hugh
+Dickins has observed that frontswap could probably steal one of
+the existing eight bits, but let's worry about that minor optimization
+later.)  For very large swap disks (which are rare) on a standard
+4K pagesize, this is 1MB per 32GB swap.
+
+When swap pages are stored in transcendent memory instead of written
+out to disk, there is a side effect that this may create more memory
+pressure that can potentially outweigh the other advantages.  A
+backend, such as zcache, must implement policies to carefully (but
+dynamically) manage memory limits to ensure this doesn't happen.
+
+3) OK, how about a quick overview of what this frontswap patch does
+   in terms that a kernel hacker can grok?
+
+Let's assume that a frontswap "backend" has registered during
+kernel initialization; this registration indicates that this
+frontswap backend has access to some "memory" that is not directly
+accessible by the kernel.  Exactly how much memory it provides is
+entirely dynamic and random.
+
+Whenever a swap-device is swapon'd frontswap_init() is called,
+passing the swap device number (aka "type") as a parameter.
+This notifies frontswap to expect attempts to "put" swap pages
+associated with that number.
+
+Whenever the swap subsystem is readying a page to write to a swap
+device (c.f swap_writepage()), frontswap_put_page is called.  Frontswap
+consults with the frontswap backend and if the backend says it does NOT
+have room, frontswap_put_page returns -1 and the kernel swaps the page
+to the swap device as normal.  Note that the response from the frontswap
+backend is unpredictable to the kernel; it may choose to never accept a
+page, it could accept every ninth page, or it might accept every
+page.  But if the backend does accept a page, the data from the page
+has already been copied and associated with the type and offset,
+and the backend guarantees the persistence of the data.  In this case,
+frontswap sets a bit in the "frontswap_map" for the swap device
+corresponding to the page offset on the swap device to which it would
+otherwise have written the data.
+
+When the swap subsystem needs to swap-in a page (swap_readpage()),
+it first calls frontswap_get_page() which checks the frontswap_map to
+see if the page was earlier accepted by the frontswap backend.  If
+it was, the page of data is filled from the frontswap backend and
+the swap-in is complete.  If not, the normal swap-in code is
+executed to obtain the page of data from the real swap device.
+
+So every time the frontswap backend accepts a page, a swap device read
+and (potentially) a swap device write are replaced by a "frontswap backend
+put" and (possibly) a "frontswap backend get", which are presumably much
+faster.
+
+4) Can't frontswap be configured as a "special" swap device that is
+   just higher priority than any real swap device (e.g. like zswap,
+   or maybe swap-over-nbd/NFS)?
+
+No.  First, the existing swap subsystem doesn't allow for any kind of
+swap hierarchy.  Perhaps it could be rewritten to accomodate a hierarchy,
+but this would require fairly drastic changes.  Even if it were
+rewritten, the existing swap subsystem uses the block I/O layer which
+assumes a swap device is fixed size and any page in it is linearly
+addressable.  Frontswap barely touches the existing swap subsystem,
+and works around the constraints of the block I/O subsystem to provide
+a great deal of flexibility and dynamicity.
+
+For example, the acceptance of any swap page by the frontswap backend is
+entirely unpredictable. This is critical to the definition of frontswap
+backends because it grants completely dynamic discretion to the
+backend.  In zcache, one cannot know a priori how compressible a page is.
+"Poorly" compressible pages can be rejected, and "poorly" can itself be
+defined dynamically depending on current memory constraints.
+
+Further, frontswap is entirely synchronous whereas a real swap
+device is, by definition, asynchronous and uses block I/O.  The
+block I/O layer is not only unnecessary, but may perform "optimizations"
+that are inappropriate for a RAM-oriented device including delaying
+the write of some pages for a significant amount of time.  Synchrony is
+required to ensure the dynamicity of the backend and to avoid thorny race
+conditions that would unnecessarily and greatly complicate frontswap
+and/or the block I/O subsystem.  That said, only the initial "put"
+and "get" operations need be synchronous.  A separate asynchronous thread
+is free to manipulate the pages stored by frontswap.  For example,
+the "remotification" thread in RAMster uses standard asynchronous
+kernel sockets to move compressed frontswap pages to a remote machine.
+Similarly, a KVM guest-side implementation could do in-guest compression
+and use "batched" hypercalls.
+
+In a virtualized environment, the dynamicity allows the hypervisor
+(or host OS) to do "intelligent overcommit".  For example, it can
+choose to accept pages only until host-swapping might be imminent,
+then force guests to do their own swapping.
+
+There is a downside to the transcendent memory specifications for
+frontswap:  Since any "put" might fail, there must always be a real
+slot on a real swap device to swap the page.  Thus frontswap must be
+implemented as a "shadow" to every swapon'd device with the potential
+capability of holding every page that the swap device might have held
+and the possibility that it might hold no pages at all.  This means
+that frontswap cannot contain more pages than the total of swapon'd
+swap devices.  For example, if NO swap device is configured on some
+installation, frontswap is useless.  Swapless portable devices
+can still use frontswap but a backend for such devices must configure
+some kind of "ghost" swap device and ensure that it is never used.
+
+5) Why this weird definition about "duplicate puts"?  If a page
+   has been previously successfully put, can't it always be
+   successfully overwritten?
+
+Nearly always it can, but no, sometimes it cannot.  Consider an example
+where data is compressed and the original 4K page has been compressed
+to 1K.  Now an attempt is made to overwrite the page with data that
+is non-compressible and so would take the entire 4K.  But the backend
+has no more space.  In this case, the put must be rejected.  Whenever
+frontswap rejects a put that would overwrite, it also must invalidate
+the old data and ensure that it is no longer accessible.  Since the
+swap subsystem then writes the new data to the read swap device,
+this is the correct course of action to ensure coherency.
+
+6) What is frontswap_shrink for?
+
+When the (non-frontswap) swap subsystem swaps out a page to a real
+swap device, that page is only taking up low-value pre-allocated disk
+space.  But if frontswap has placed a page in transcendent memory, that
+page may be taking up valuable real estate.  The frontswap_shrink
+routine allows code outside of the swap subsystem to force pages out
+of the memory managed by frontswap and back into kernel-addressable memory.
+For example, in RAMster, a "suction driver" thread will attempt
+to "repatriate" pages sent to a remote machine back to the local machine;
+this is driven using the frontswap_shrink mechanism when memory pressure
+subsides.
+
+7) Why does the frontswap patch create the new include file swapfile.h?
+
+The frontswap code depends on some swap-subsystem-internal data
+structures that have, over the years, moved back and forth between
+static and global.  This seemed a reasonable compromise:  Define
+them as global but declare them in a new include file that isn't
+included by the large number of source files that include swap.h.
+
+Dan Magenheimer, last updated April 9, 2012
@@ -379,4 +379,21 @@
 	  in a negligible performance hit.
  
 	  If unsure, say Y to enable cleancache
+
+config FRONTSWAP
+	bool "Enable frontswap to cache swap pages if tmem is present"
+	depends on SWAP
+	default n
+	help
+	  Frontswap is so named because it can be thought of as the opposite
+	  of a "backing" store for a swap device.  The data is stored into
+	  "transcendent memory", memory that is not directly accessible or
+	  addressable by the kernel and is of unknown and possibly
+	  time-varying size.  When space in transcendent memory is available,
+	  a significant swap I/O reduction may be achieved.  When none is
+	  available, all frontswap calls are reduced to a single pointer-
+	  compare-against-NULL resulting in a negligible performance hit
+	  and swap data is stored as normal on the matching swap device.
+
+	  If unsure, say Y to enable frontswap.
@@ -26,6 +26,7 @@
  
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
	1	+Frontswap provides a "transcendent memory" interface for swap pages.
	2	+In some environments, dramatic performance savings may be obtained because
	3	+swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
	4	+
	5	+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
	6	+and the only necessary changes to the core kernel for transcendent memory;
	7	+all other supporting code -- the "backends" -- is implemented as drivers.
	8	+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
	9	+overview of frontswap and related kernel parts:
	10	+https://lwn.net/Articles/454795/ )
	11	+
	12	+Frontswap is so named because it can be thought of as the opposite of
	13	+a "backing" store for a swap device. The storage is assumed to be
	14	+a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
	15	+to the requirements of transcendent memory (such as Xen's "tmem", or
	16	+in-kernel compressed memory, aka "zcache", or future RAM-like devices);
	17	+this pseudo-RAM device is not directly accessible or addressable by the
	18	+kernel and is of unknown and possibly time-varying size. The driver
	19	+links itself to frontswap by calling frontswap_register_ops to set the
	20	+frontswap_ops funcs appropriately and the functions it provides must
	21	+conform to certain policies as follows:
	22	+
	23	+An "init" prepares the device to receive frontswap pages associated
	24	+with the specified swap device number (aka "type"). A "put_page" will
	25	+copy the page to transcendent memory and associate it with the type and
	26	+offset associated with the page. A "get_page" will copy the page, if found,
	27	+from transcendent memory into kernel memory, but will NOT remove the page
	28	+from from transcendent memory. An "invalidate_page" will remove the page
	29	+from transcendent memory and an "invalidate_area" will remove ALL pages
	30	+associated with the swap type (e.g., like swapoff) and notify the "device"
	31	+to refuse further puts with that swap type.
	32	+
	33	+Once a page is successfully put, a matching get on the page will normally
	34	+succeed. So when the kernel finds itself in a situation where it needs
	35	+to swap out a page, it first attempts to use frontswap. If the put returns
	36	+success, the data has been successfully saved to transcendent memory and
	37	+a disk write and, if the data is later read back, a disk read are avoided.
	38	+If a put returns failure, transcendent memory has rejected the data, and the
	39	+page can be written to swap as usual.
	40	+
	41	+If a backend chooses, frontswap can be configured as a "writethrough
	42	+cache" by calling frontswap_writethrough(). In this mode, the reduction
	43	+in swap device writes is lost (and also a non-trivial performance advantage)
	44	+in order to allow the backend to arbitrarily "reclaim" space used to
	45	+store frontswap pages to more completely manage its memory usage.
	46	+
	47	+Note that if a page is put and the page already exists in transcendent memory
	48	+(a "duplicate" put), either the put succeeds and the data is overwritten,
	49	+or the put fails AND the page is invalidated. This ensures stale data may
	50	+never be obtained from frontswap.
	51	+
	52	+If properly configured, monitoring of frontswap is done via debugfs in
	53	+the /sys/kernel/debug/frontswap directory. The effectiveness of
	54	+frontswap can be measured (across all swap devices) with:
	55	+
	56	+failed_puts - how many put attempts have failed
	57	+gets - how many gets were attempted (all should succeed)
	58	+succ_puts - how many put attempts have succeeded
	59	+invalidates - how many invalidates were attempted
	60	+
	61	+A backend implementation may provide additional metrics.
	62	+
	63	+FAQ
	64	+
	65	+1) Where's the value?
	66	+
	67	+When a workload starts swapping, performance falls through the floor.
	68	+Frontswap significantly increases performance in many such workloads by
	69	+providing a clean, dynamic interface to read and write swap pages to
	70	+"transcendent memory" that is otherwise not directly addressable to the kernel.
	71	+This interface is ideal when data is transformed to a different form
	72	+and size (such as with compression) or secretly moved (as might be
	73	+useful for write-balancing for some RAM-like devices). Swap pages (and
	74	+evicted page-cache pages) are a great use for this kind of slower-than-RAM-
	75	+but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
	76	+cleancache) interface to transcendent memory provides a nice way to read
	77	+and write -- and indirectly "name" -- the pages.
	78	+
	79	+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
	80	+provides a huge amount of flexibility for more dynamic, flexible RAM
	81	+utilization in various system configurations:
	82	+
	83	+In the single kernel case, aka "zcache", pages are compressed and
	84	+stored in local memory, thus increasing the total anonymous pages
	85	+that can be safely kept in RAM. Zcache essentially trades off CPU
	86	+cycles used in compression/decompression for better memory utilization.
	87	+Benchmarks have shown little or no impact when memory pressure is
	88	+low while providing a significant performance improvement (25%+)
	89	+on some workloads under high memory pressure.
	90	+
	91	+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
	92	+support for clustered systems. Frontswap pages are locally compressed
	93	+as in zcache, but then "remotified" to another system's RAM. This
	94	+allows RAM to be dynamically load-balanced back-and-forth as needed,
	95	+i.e. when system A is overcommitted, it can swap to system B, and
	96	+vice versa. RAMster can also be configured as a memory server so
	97	+many servers in a cluster can swap, dynamically as needed, to a single
	98	+server configured with a large amount of RAM... without pre-configuring
	99	+how much of the RAM is available for each of the clients!
	100	+
	101	+In the virtual case, the whole point of virtualization is to statistically
	102	+multiplex physical resources acrosst the varying demands of multiple
	103	+virtual machines. This is really hard to do with RAM and efforts to do
	104	+it well with no kernel changes have essentially failed (except in some
	105	+well-publicized special-case workloads).
	106	+Specifically, the Xen Transcendent Memory backend allows otherwise
	107	+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
	108	+virtual machines, but the pages can be compressed and deduplicated to
	109	+optimize RAM utilization. And when guest OS's are induced to surrender
	110	+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
	111	+memory pressure may result in swapping; frontswap allows those pages
	112	+to be swapped to and from hypervisor RAM (if overall host system memory
	113	+conditions allow), thus mitigating the potentially awful performance impact
	114	+of unplanned swapping.
	115	+
	116	+A KVM implementation is underway and has been RFC'ed to lkml. And,
	117	+using frontswap, investigation is also underway on the use of NVM as
	118	+a memory extension technology.
	119	+
	120	+2) Sure there may be performance advantages in some situations, but
	121	+ what's the space/time overhead of frontswap?
	122	+
	123	+If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
	124	+nothingness and the only overhead is a few extra bytes per swapon'ed
	125	+swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
	126	+registers, there is one extra global variable compared to zero for
	127	+every swap page read or written. If CONFIG_FRONTSWAP is enabled
	128	+AND a frontswap backend registers AND the backend fails every "put"
	129	+request (i.e. provides no memory despite claiming it might),
	130	+CPU overhead is still negligible -- and since every frontswap fail
	131	+precedes a swap page write-to-disk, the system is highly likely
	132	+to be I/O bound and using a small fraction of a percent of a CPU
	133	+will be irrelevant anyway.
	134	+
	135	+As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
	136	+registers, one bit is allocated for every swap page for every swap
	137	+device that is swapon'd. This is added to the EIGHT bits (which
	138	+was sixteen until about 2.6.34) that the kernel already allocates
	139	+for every swap page for every swap device that is swapon'd. (Hugh
	140	+Dickins has observed that frontswap could probably steal one of
	141	+the existing eight bits, but let's worry about that minor optimization
	142	+later.) For very large swap disks (which are rare) on a standard
	143	+4K pagesize, this is 1MB per 32GB swap.
	144	+
	145	+When swap pages are stored in transcendent memory instead of written
	146	+out to disk, there is a side effect that this may create more memory
	147	+pressure that can potentially outweigh the other advantages. A
	148	+backend, such as zcache, must implement policies to carefully (but
	149	+dynamically) manage memory limits to ensure this doesn't happen.
	150	+
	151	+3) OK, how about a quick overview of what this frontswap patch does
	152	+ in terms that a kernel hacker can grok?
	153	+
	154	+Let's assume that a frontswap "backend" has registered during
	155	+kernel initialization; this registration indicates that this
	156	+frontswap backend has access to some "memory" that is not directly
	157	+accessible by the kernel. Exactly how much memory it provides is
	158	+entirely dynamic and random.
	159	+
	160	+Whenever a swap-device is swapon'd frontswap_init() is called,
	161	+passing the swap device number (aka "type") as a parameter.
	162	+This notifies frontswap to expect attempts to "put" swap pages
	163	+associated with that number.
	164	+
	165	+Whenever the swap subsystem is readying a page to write to a swap
	166	+device (c.f swap_writepage()), frontswap_put_page is called. Frontswap
	167	+consults with the frontswap backend and if the backend says it does NOT
	168	+have room, frontswap_put_page returns -1 and the kernel swaps the page
	169	+to the swap device as normal. Note that the response from the frontswap
	170	+backend is unpredictable to the kernel; it may choose to never accept a
	171	+page, it could accept every ninth page, or it might accept every
	172	+page. But if the backend does accept a page, the data from the page
	173	+has already been copied and associated with the type and offset,
	174	+and the backend guarantees the persistence of the data. In this case,
	175	+frontswap sets a bit in the "frontswap_map" for the swap device
	176	+corresponding to the page offset on the swap device to which it would
	177	+otherwise have written the data.
	178	+
	179	+When the swap subsystem needs to swap-in a page (swap_readpage()),
	180	+it first calls frontswap_get_page() which checks the frontswap_map to
	181	+see if the page was earlier accepted by the frontswap backend. If
	182	+it was, the page of data is filled from the frontswap backend and
	183	+the swap-in is complete. If not, the normal swap-in code is
	184	+executed to obtain the page of data from the real swap device.
	185	+
	186	+So every time the frontswap backend accepts a page, a swap device read
	187	+and (potentially) a swap device write are replaced by a "frontswap backend
	188	+put" and (possibly) a "frontswap backend get", which are presumably much
	189	+faster.
	190	+
	191	+4) Can't frontswap be configured as a "special" swap device that is
	192	+ just higher priority than any real swap device (e.g. like zswap,
	193	+ or maybe swap-over-nbd/NFS)?
	194	+
	195	+No. First, the existing swap subsystem doesn't allow for any kind of
	196	+swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy,
	197	+but this would require fairly drastic changes. Even if it were
	198	+rewritten, the existing swap subsystem uses the block I/O layer which
	199	+assumes a swap device is fixed size and any page in it is linearly
	200	+addressable. Frontswap barely touches the existing swap subsystem,
	201	+and works around the constraints of the block I/O subsystem to provide
	202	+a great deal of flexibility and dynamicity.
	203	+
	204	+For example, the acceptance of any swap page by the frontswap backend is
	205	+entirely unpredictable. This is critical to the definition of frontswap
	206	+backends because it grants completely dynamic discretion to the
	207	+backend. In zcache, one cannot know a priori how compressible a page is.
	208	+"Poorly" compressible pages can be rejected, and "poorly" can itself be
	209	+defined dynamically depending on current memory constraints.
	210	+
	211	+Further, frontswap is entirely synchronous whereas a real swap
	212	+device is, by definition, asynchronous and uses block I/O. The
	213	+block I/O layer is not only unnecessary, but may perform "optimizations"
	214	+that are inappropriate for a RAM-oriented device including delaying
	215	+the write of some pages for a significant amount of time. Synchrony is
	216	+required to ensure the dynamicity of the backend and to avoid thorny race
	217	+conditions that would unnecessarily and greatly complicate frontswap
	218	+and/or the block I/O subsystem. That said, only the initial "put"
	219	+and "get" operations need be synchronous. A separate asynchronous thread
	220	+is free to manipulate the pages stored by frontswap. For example,
	221	+the "remotification" thread in RAMster uses standard asynchronous
	222	+kernel sockets to move compressed frontswap pages to a remote machine.
	223	+Similarly, a KVM guest-side implementation could do in-guest compression
	224	+and use "batched" hypercalls.
	225	+
	226	+In a virtualized environment, the dynamicity allows the hypervisor
	227	+(or host OS) to do "intelligent overcommit". For example, it can
	228	+choose to accept pages only until host-swapping might be imminent,
	229	+then force guests to do their own swapping.
	230	+
	231	+There is a downside to the transcendent memory specifications for
	232	+frontswap: Since any "put" might fail, there must always be a real
	233	+slot on a real swap device to swap the page. Thus frontswap must be
	234	+implemented as a "shadow" to every swapon'd device with the potential
	235	+capability of holding every page that the swap device might have held
	236	+and the possibility that it might hold no pages at all. This means
	237	+that frontswap cannot contain more pages than the total of swapon'd
	238	+swap devices. For example, if NO swap device is configured on some
	239	+installation, frontswap is useless. Swapless portable devices
	240	+can still use frontswap but a backend for such devices must configure
	241	+some kind of "ghost" swap device and ensure that it is never used.
	242	+
	243	+5) Why this weird definition about "duplicate puts"? If a page
	244	+ has been previously successfully put, can't it always be
	245	+ successfully overwritten?
	246	+
	247	+Nearly always it can, but no, sometimes it cannot. Consider an example
	248	+where data is compressed and the original 4K page has been compressed
	249	+to 1K. Now an attempt is made to overwrite the page with data that
	250	+is non-compressible and so would take the entire 4K. But the backend
	251	+has no more space. In this case, the put must be rejected. Whenever
	252	+frontswap rejects a put that would overwrite, it also must invalidate
	253	+the old data and ensure that it is no longer accessible. Since the
	254	+swap subsystem then writes the new data to the read swap device,
	255	+this is the correct course of action to ensure coherency.
	256	+
	257	+6) What is frontswap_shrink for?
	258	+
	259	+When the (non-frontswap) swap subsystem swaps out a page to a real
	260	+swap device, that page is only taking up low-value pre-allocated disk
	261	+space. But if frontswap has placed a page in transcendent memory, that
	262	+page may be taking up valuable real estate. The frontswap_shrink
	263	+routine allows code outside of the swap subsystem to force pages out
	264	+of the memory managed by frontswap and back into kernel-addressable memory.
	265	+For example, in RAMster, a "suction driver" thread will attempt
	266	+to "repatriate" pages sent to a remote machine back to the local machine;
	267	+this is driven using the frontswap_shrink mechanism when memory pressure
	268	+subsides.
	269	+
	270	+7) Why does the frontswap patch create the new include file swapfile.h?
	271	+
	272	+The frontswap code depends on some swap-subsystem-internal data
	273	+structures that have, over the years, moved back and forth between
	274	+static and global. This seemed a reasonable compromise: Define
	275	+them as global but declare them in a new include file that isn't
	276	+included by the large number of source files that include swap.h.
	277	+
	278	+Dan Magenheimer, last updated April 9, 2012
...	...	@@ -379,4 +379,21 @@
379	379	in a negligible performance hit.
380	380
381	381	If unsure, say Y to enable cleancache
	382	+
	383	+config FRONTSWAP
	384	+ bool "Enable frontswap to cache swap pages if tmem is present"
	385	+ depends on SWAP
	386	+ default n
	387	+ help
	388	+ Frontswap is so named because it can be thought of as the opposite
	389	+ of a "backing" store for a swap device. The data is stored into
	390	+ "transcendent memory", memory that is not directly accessible or
	391	+ addressable by the kernel and is of unknown and possibly
	392	+ time-varying size. When space in transcendent memory is available,
	393	+ a significant swap I/O reduction may be achieved. When none is
	394	+ available, all frontswap calls are reduced to a single pointer-
	395	+ compare-against-NULL resulting in a negligible performance hit
	396	+ and swap data is stored as normal on the matching swap device.
	397	+
	398	+ If unsure, say Y to enable frontswap.
...	...	@@ -26,6 +26,7 @@
26	26
27	27	obj-$(CONFIG_BOUNCE) += bounce.o
28	28	obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
	29	+obj-$(CONFIG_FRONTSWAP) += frontswap.o
29	30	obj-$(CONFIG_HAS_DMA) += dmapool.o
30	31	obj-$(CONFIG_HUGETLBFS) += hugetlb.o
31	32	obj-$(CONFIG_NUMA) += mempolicy.o