Commit 27c6aec214264992603526d47da9dabddf3521b3

Authored by Dan Magenheimer
Committed by Konrad Rzeszutek Wilk
1 parent 29f233cfff

mm: frontswap: config and doc files

This patch 4of4 adds configuration and documentation files including a FAQ.

[v14: updated docs/FAQ to use zcache and RAMster as examples]
[v10: no change]
[v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file]
[v8: rebase to 3.0-rc4]
[v7: rebase to 3.0-rc3]
[v6: rebase to 3.0-rc1]
[v5: change config default to n]
[v4: rebase to 2.6.39]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Acked-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Showing 3 changed files with 296 additions and 0 deletions Side-by-side Diff

Documentation/vm/frontswap.txt
  1 +Frontswap provides a "transcendent memory" interface for swap pages.
  2 +In some environments, dramatic performance savings may be obtained because
  3 +swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
  4 +
  5 +(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
  6 +and the only necessary changes to the core kernel for transcendent memory;
  7 +all other supporting code -- the "backends" -- is implemented as drivers.
  8 +See the LWN.net article "Transcendent memory in a nutshell" for a detailed
  9 +overview of frontswap and related kernel parts:
  10 +https://lwn.net/Articles/454795/ )
  11 +
  12 +Frontswap is so named because it can be thought of as the opposite of
  13 +a "backing" store for a swap device. The storage is assumed to be
  14 +a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
  15 +to the requirements of transcendent memory (such as Xen's "tmem", or
  16 +in-kernel compressed memory, aka "zcache", or future RAM-like devices);
  17 +this pseudo-RAM device is not directly accessible or addressable by the
  18 +kernel and is of unknown and possibly time-varying size. The driver
  19 +links itself to frontswap by calling frontswap_register_ops to set the
  20 +frontswap_ops funcs appropriately and the functions it provides must
  21 +conform to certain policies as follows:
  22 +
  23 +An "init" prepares the device to receive frontswap pages associated
  24 +with the specified swap device number (aka "type"). A "put_page" will
  25 +copy the page to transcendent memory and associate it with the type and
  26 +offset associated with the page. A "get_page" will copy the page, if found,
  27 +from transcendent memory into kernel memory, but will NOT remove the page
  28 +from from transcendent memory. An "invalidate_page" will remove the page
  29 +from transcendent memory and an "invalidate_area" will remove ALL pages
  30 +associated with the swap type (e.g., like swapoff) and notify the "device"
  31 +to refuse further puts with that swap type.
  32 +
  33 +Once a page is successfully put, a matching get on the page will normally
  34 +succeed. So when the kernel finds itself in a situation where it needs
  35 +to swap out a page, it first attempts to use frontswap. If the put returns
  36 +success, the data has been successfully saved to transcendent memory and
  37 +a disk write and, if the data is later read back, a disk read are avoided.
  38 +If a put returns failure, transcendent memory has rejected the data, and the
  39 +page can be written to swap as usual.
  40 +
  41 +If a backend chooses, frontswap can be configured as a "writethrough
  42 +cache" by calling frontswap_writethrough(). In this mode, the reduction
  43 +in swap device writes is lost (and also a non-trivial performance advantage)
  44 +in order to allow the backend to arbitrarily "reclaim" space used to
  45 +store frontswap pages to more completely manage its memory usage.
  46 +
  47 +Note that if a page is put and the page already exists in transcendent memory
  48 +(a "duplicate" put), either the put succeeds and the data is overwritten,
  49 +or the put fails AND the page is invalidated. This ensures stale data may
  50 +never be obtained from frontswap.
  51 +
  52 +If properly configured, monitoring of frontswap is done via debugfs in
  53 +the /sys/kernel/debug/frontswap directory. The effectiveness of
  54 +frontswap can be measured (across all swap devices) with:
  55 +
  56 +failed_puts - how many put attempts have failed
  57 +gets - how many gets were attempted (all should succeed)
  58 +succ_puts - how many put attempts have succeeded
  59 +invalidates - how many invalidates were attempted
  60 +
  61 +A backend implementation may provide additional metrics.
  62 +
  63 +FAQ
  64 +
  65 +1) Where's the value?
  66 +
  67 +When a workload starts swapping, performance falls through the floor.
  68 +Frontswap significantly increases performance in many such workloads by
  69 +providing a clean, dynamic interface to read and write swap pages to
  70 +"transcendent memory" that is otherwise not directly addressable to the kernel.
  71 +This interface is ideal when data is transformed to a different form
  72 +and size (such as with compression) or secretly moved (as might be
  73 +useful for write-balancing for some RAM-like devices). Swap pages (and
  74 +evicted page-cache pages) are a great use for this kind of slower-than-RAM-
  75 +but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
  76 +cleancache) interface to transcendent memory provides a nice way to read
  77 +and write -- and indirectly "name" -- the pages.
  78 +
  79 +Frontswap -- and cleancache -- with a fairly small impact on the kernel,
  80 +provides a huge amount of flexibility for more dynamic, flexible RAM
  81 +utilization in various system configurations:
  82 +
  83 +In the single kernel case, aka "zcache", pages are compressed and
  84 +stored in local memory, thus increasing the total anonymous pages
  85 +that can be safely kept in RAM. Zcache essentially trades off CPU
  86 +cycles used in compression/decompression for better memory utilization.
  87 +Benchmarks have shown little or no impact when memory pressure is
  88 +low while providing a significant performance improvement (25%+)
  89 +on some workloads under high memory pressure.
  90 +
  91 +"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
  92 +support for clustered systems. Frontswap pages are locally compressed
  93 +as in zcache, but then "remotified" to another system's RAM. This
  94 +allows RAM to be dynamically load-balanced back-and-forth as needed,
  95 +i.e. when system A is overcommitted, it can swap to system B, and
  96 +vice versa. RAMster can also be configured as a memory server so
  97 +many servers in a cluster can swap, dynamically as needed, to a single
  98 +server configured with a large amount of RAM... without pre-configuring
  99 +how much of the RAM is available for each of the clients!
  100 +
  101 +In the virtual case, the whole point of virtualization is to statistically
  102 +multiplex physical resources acrosst the varying demands of multiple
  103 +virtual machines. This is really hard to do with RAM and efforts to do
  104 +it well with no kernel changes have essentially failed (except in some
  105 +well-publicized special-case workloads).
  106 +Specifically, the Xen Transcendent Memory backend allows otherwise
  107 +"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
  108 +virtual machines, but the pages can be compressed and deduplicated to
  109 +optimize RAM utilization. And when guest OS's are induced to surrender
  110 +underutilized RAM (e.g. with "selfballooning"), sudden unexpected
  111 +memory pressure may result in swapping; frontswap allows those pages
  112 +to be swapped to and from hypervisor RAM (if overall host system memory
  113 +conditions allow), thus mitigating the potentially awful performance impact
  114 +of unplanned swapping.
  115 +
  116 +A KVM implementation is underway and has been RFC'ed to lkml. And,
  117 +using frontswap, investigation is also underway on the use of NVM as
  118 +a memory extension technology.
  119 +
  120 +2) Sure there may be performance advantages in some situations, but
  121 + what's the space/time overhead of frontswap?
  122 +
  123 +If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
  124 +nothingness and the only overhead is a few extra bytes per swapon'ed
  125 +swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
  126 +registers, there is one extra global variable compared to zero for
  127 +every swap page read or written. If CONFIG_FRONTSWAP is enabled
  128 +AND a frontswap backend registers AND the backend fails every "put"
  129 +request (i.e. provides no memory despite claiming it might),
  130 +CPU overhead is still negligible -- and since every frontswap fail
  131 +precedes a swap page write-to-disk, the system is highly likely
  132 +to be I/O bound and using a small fraction of a percent of a CPU
  133 +will be irrelevant anyway.
  134 +
  135 +As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
  136 +registers, one bit is allocated for every swap page for every swap
  137 +device that is swapon'd. This is added to the EIGHT bits (which
  138 +was sixteen until about 2.6.34) that the kernel already allocates
  139 +for every swap page for every swap device that is swapon'd. (Hugh
  140 +Dickins has observed that frontswap could probably steal one of
  141 +the existing eight bits, but let's worry about that minor optimization
  142 +later.) For very large swap disks (which are rare) on a standard
  143 +4K pagesize, this is 1MB per 32GB swap.
  144 +
  145 +When swap pages are stored in transcendent memory instead of written
  146 +out to disk, there is a side effect that this may create more memory
  147 +pressure that can potentially outweigh the other advantages. A
  148 +backend, such as zcache, must implement policies to carefully (but
  149 +dynamically) manage memory limits to ensure this doesn't happen.
  150 +
  151 +3) OK, how about a quick overview of what this frontswap patch does
  152 + in terms that a kernel hacker can grok?
  153 +
  154 +Let's assume that a frontswap "backend" has registered during
  155 +kernel initialization; this registration indicates that this
  156 +frontswap backend has access to some "memory" that is not directly
  157 +accessible by the kernel. Exactly how much memory it provides is
  158 +entirely dynamic and random.
  159 +
  160 +Whenever a swap-device is swapon'd frontswap_init() is called,
  161 +passing the swap device number (aka "type") as a parameter.
  162 +This notifies frontswap to expect attempts to "put" swap pages
  163 +associated with that number.
  164 +
  165 +Whenever the swap subsystem is readying a page to write to a swap
  166 +device (c.f swap_writepage()), frontswap_put_page is called. Frontswap
  167 +consults with the frontswap backend and if the backend says it does NOT
  168 +have room, frontswap_put_page returns -1 and the kernel swaps the page
  169 +to the swap device as normal. Note that the response from the frontswap
  170 +backend is unpredictable to the kernel; it may choose to never accept a
  171 +page, it could accept every ninth page, or it might accept every
  172 +page. But if the backend does accept a page, the data from the page
  173 +has already been copied and associated with the type and offset,
  174 +and the backend guarantees the persistence of the data. In this case,
  175 +frontswap sets a bit in the "frontswap_map" for the swap device
  176 +corresponding to the page offset on the swap device to which it would
  177 +otherwise have written the data.
  178 +
  179 +When the swap subsystem needs to swap-in a page (swap_readpage()),
  180 +it first calls frontswap_get_page() which checks the frontswap_map to
  181 +see if the page was earlier accepted by the frontswap backend. If
  182 +it was, the page of data is filled from the frontswap backend and
  183 +the swap-in is complete. If not, the normal swap-in code is
  184 +executed to obtain the page of data from the real swap device.
  185 +
  186 +So every time the frontswap backend accepts a page, a swap device read
  187 +and (potentially) a swap device write are replaced by a "frontswap backend
  188 +put" and (possibly) a "frontswap backend get", which are presumably much
  189 +faster.
  190 +
  191 +4) Can't frontswap be configured as a "special" swap device that is
  192 + just higher priority than any real swap device (e.g. like zswap,
  193 + or maybe swap-over-nbd/NFS)?
  194 +
  195 +No. First, the existing swap subsystem doesn't allow for any kind of
  196 +swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy,
  197 +but this would require fairly drastic changes. Even if it were
  198 +rewritten, the existing swap subsystem uses the block I/O layer which
  199 +assumes a swap device is fixed size and any page in it is linearly
  200 +addressable. Frontswap barely touches the existing swap subsystem,
  201 +and works around the constraints of the block I/O subsystem to provide
  202 +a great deal of flexibility and dynamicity.
  203 +
  204 +For example, the acceptance of any swap page by the frontswap backend is
  205 +entirely unpredictable. This is critical to the definition of frontswap
  206 +backends because it grants completely dynamic discretion to the
  207 +backend. In zcache, one cannot know a priori how compressible a page is.
  208 +"Poorly" compressible pages can be rejected, and "poorly" can itself be
  209 +defined dynamically depending on current memory constraints.
  210 +
  211 +Further, frontswap is entirely synchronous whereas a real swap
  212 +device is, by definition, asynchronous and uses block I/O. The
  213 +block I/O layer is not only unnecessary, but may perform "optimizations"
  214 +that are inappropriate for a RAM-oriented device including delaying
  215 +the write of some pages for a significant amount of time. Synchrony is
  216 +required to ensure the dynamicity of the backend and to avoid thorny race
  217 +conditions that would unnecessarily and greatly complicate frontswap
  218 +and/or the block I/O subsystem. That said, only the initial "put"
  219 +and "get" operations need be synchronous. A separate asynchronous thread
  220 +is free to manipulate the pages stored by frontswap. For example,
  221 +the "remotification" thread in RAMster uses standard asynchronous
  222 +kernel sockets to move compressed frontswap pages to a remote machine.
  223 +Similarly, a KVM guest-side implementation could do in-guest compression
  224 +and use "batched" hypercalls.
  225 +
  226 +In a virtualized environment, the dynamicity allows the hypervisor
  227 +(or host OS) to do "intelligent overcommit". For example, it can
  228 +choose to accept pages only until host-swapping might be imminent,
  229 +then force guests to do their own swapping.
  230 +
  231 +There is a downside to the transcendent memory specifications for
  232 +frontswap: Since any "put" might fail, there must always be a real
  233 +slot on a real swap device to swap the page. Thus frontswap must be
  234 +implemented as a "shadow" to every swapon'd device with the potential
  235 +capability of holding every page that the swap device might have held
  236 +and the possibility that it might hold no pages at all. This means
  237 +that frontswap cannot contain more pages than the total of swapon'd
  238 +swap devices. For example, if NO swap device is configured on some
  239 +installation, frontswap is useless. Swapless portable devices
  240 +can still use frontswap but a backend for such devices must configure
  241 +some kind of "ghost" swap device and ensure that it is never used.
  242 +
  243 +5) Why this weird definition about "duplicate puts"? If a page
  244 + has been previously successfully put, can't it always be
  245 + successfully overwritten?
  246 +
  247 +Nearly always it can, but no, sometimes it cannot. Consider an example
  248 +where data is compressed and the original 4K page has been compressed
  249 +to 1K. Now an attempt is made to overwrite the page with data that
  250 +is non-compressible and so would take the entire 4K. But the backend
  251 +has no more space. In this case, the put must be rejected. Whenever
  252 +frontswap rejects a put that would overwrite, it also must invalidate
  253 +the old data and ensure that it is no longer accessible. Since the
  254 +swap subsystem then writes the new data to the read swap device,
  255 +this is the correct course of action to ensure coherency.
  256 +
  257 +6) What is frontswap_shrink for?
  258 +
  259 +When the (non-frontswap) swap subsystem swaps out a page to a real
  260 +swap device, that page is only taking up low-value pre-allocated disk
  261 +space. But if frontswap has placed a page in transcendent memory, that
  262 +page may be taking up valuable real estate. The frontswap_shrink
  263 +routine allows code outside of the swap subsystem to force pages out
  264 +of the memory managed by frontswap and back into kernel-addressable memory.
  265 +For example, in RAMster, a "suction driver" thread will attempt
  266 +to "repatriate" pages sent to a remote machine back to the local machine;
  267 +this is driven using the frontswap_shrink mechanism when memory pressure
  268 +subsides.
  269 +
  270 +7) Why does the frontswap patch create the new include file swapfile.h?
  271 +
  272 +The frontswap code depends on some swap-subsystem-internal data
  273 +structures that have, over the years, moved back and forth between
  274 +static and global. This seemed a reasonable compromise: Define
  275 +them as global but declare them in a new include file that isn't
  276 +included by the large number of source files that include swap.h.
  277 +
  278 +Dan Magenheimer, last updated April 9, 2012
... ... @@ -379,4 +379,21 @@
379 379 in a negligible performance hit.
380 380  
381 381 If unsure, say Y to enable cleancache
  382 +
  383 +config FRONTSWAP
  384 + bool "Enable frontswap to cache swap pages if tmem is present"
  385 + depends on SWAP
  386 + default n
  387 + help
  388 + Frontswap is so named because it can be thought of as the opposite
  389 + of a "backing" store for a swap device. The data is stored into
  390 + "transcendent memory", memory that is not directly accessible or
  391 + addressable by the kernel and is of unknown and possibly
  392 + time-varying size. When space in transcendent memory is available,
  393 + a significant swap I/O reduction may be achieved. When none is
  394 + available, all frontswap calls are reduced to a single pointer-
  395 + compare-against-NULL resulting in a negligible performance hit
  396 + and swap data is stored as normal on the matching swap device.
  397 +
  398 + If unsure, say Y to enable frontswap.
... ... @@ -26,6 +26,7 @@
26 26  
27 27 obj-$(CONFIG_BOUNCE) += bounce.o
28 28 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
  29 +obj-$(CONFIG_FRONTSWAP) += frontswap.o
29 30 obj-$(CONFIG_HAS_DMA) += dmapool.o
30 31 obj-$(CONFIG_HUGETLBFS) += hugetlb.o
31 32 obj-$(CONFIG_NUMA) += mempolicy.o