Commit 27c6aec214264992603526d47da9dabddf3521b3
Committed by
Konrad Rzeszutek Wilk
1 parent
29f233cfff
Exists in
master
and in
20 other branches
mm: frontswap: config and doc files
This patch 4of4 adds configuration and documentation files including a FAQ. [v14: updated docs/FAQ to use zcache and RAMster as examples] [v10: no change] [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file] [v8: rebase to 3.0-rc4] [v7: rebase to 3.0-rc3] [v6: rebase to 3.0-rc1] [v5: change config default to n] [v4: rebase to 2.6.39] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Acked-by: Jan Beulich <JBeulich@novell.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Rik Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Showing 3 changed files with 296 additions and 0 deletions Side-by-side Diff
Documentation/vm/frontswap.txt
1 | +Frontswap provides a "transcendent memory" interface for swap pages. | |
2 | +In some environments, dramatic performance savings may be obtained because | |
3 | +swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. | |
4 | + | |
5 | +(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" | |
6 | +and the only necessary changes to the core kernel for transcendent memory; | |
7 | +all other supporting code -- the "backends" -- is implemented as drivers. | |
8 | +See the LWN.net article "Transcendent memory in a nutshell" for a detailed | |
9 | +overview of frontswap and related kernel parts: | |
10 | +https://lwn.net/Articles/454795/ ) | |
11 | + | |
12 | +Frontswap is so named because it can be thought of as the opposite of | |
13 | +a "backing" store for a swap device. The storage is assumed to be | |
14 | +a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming | |
15 | +to the requirements of transcendent memory (such as Xen's "tmem", or | |
16 | +in-kernel compressed memory, aka "zcache", or future RAM-like devices); | |
17 | +this pseudo-RAM device is not directly accessible or addressable by the | |
18 | +kernel and is of unknown and possibly time-varying size. The driver | |
19 | +links itself to frontswap by calling frontswap_register_ops to set the | |
20 | +frontswap_ops funcs appropriately and the functions it provides must | |
21 | +conform to certain policies as follows: | |
22 | + | |
23 | +An "init" prepares the device to receive frontswap pages associated | |
24 | +with the specified swap device number (aka "type"). A "put_page" will | |
25 | +copy the page to transcendent memory and associate it with the type and | |
26 | +offset associated with the page. A "get_page" will copy the page, if found, | |
27 | +from transcendent memory into kernel memory, but will NOT remove the page | |
28 | +from from transcendent memory. An "invalidate_page" will remove the page | |
29 | +from transcendent memory and an "invalidate_area" will remove ALL pages | |
30 | +associated with the swap type (e.g., like swapoff) and notify the "device" | |
31 | +to refuse further puts with that swap type. | |
32 | + | |
33 | +Once a page is successfully put, a matching get on the page will normally | |
34 | +succeed. So when the kernel finds itself in a situation where it needs | |
35 | +to swap out a page, it first attempts to use frontswap. If the put returns | |
36 | +success, the data has been successfully saved to transcendent memory and | |
37 | +a disk write and, if the data is later read back, a disk read are avoided. | |
38 | +If a put returns failure, transcendent memory has rejected the data, and the | |
39 | +page can be written to swap as usual. | |
40 | + | |
41 | +If a backend chooses, frontswap can be configured as a "writethrough | |
42 | +cache" by calling frontswap_writethrough(). In this mode, the reduction | |
43 | +in swap device writes is lost (and also a non-trivial performance advantage) | |
44 | +in order to allow the backend to arbitrarily "reclaim" space used to | |
45 | +store frontswap pages to more completely manage its memory usage. | |
46 | + | |
47 | +Note that if a page is put and the page already exists in transcendent memory | |
48 | +(a "duplicate" put), either the put succeeds and the data is overwritten, | |
49 | +or the put fails AND the page is invalidated. This ensures stale data may | |
50 | +never be obtained from frontswap. | |
51 | + | |
52 | +If properly configured, monitoring of frontswap is done via debugfs in | |
53 | +the /sys/kernel/debug/frontswap directory. The effectiveness of | |
54 | +frontswap can be measured (across all swap devices) with: | |
55 | + | |
56 | +failed_puts - how many put attempts have failed | |
57 | +gets - how many gets were attempted (all should succeed) | |
58 | +succ_puts - how many put attempts have succeeded | |
59 | +invalidates - how many invalidates were attempted | |
60 | + | |
61 | +A backend implementation may provide additional metrics. | |
62 | + | |
63 | +FAQ | |
64 | + | |
65 | +1) Where's the value? | |
66 | + | |
67 | +When a workload starts swapping, performance falls through the floor. | |
68 | +Frontswap significantly increases performance in many such workloads by | |
69 | +providing a clean, dynamic interface to read and write swap pages to | |
70 | +"transcendent memory" that is otherwise not directly addressable to the kernel. | |
71 | +This interface is ideal when data is transformed to a different form | |
72 | +and size (such as with compression) or secretly moved (as might be | |
73 | +useful for write-balancing for some RAM-like devices). Swap pages (and | |
74 | +evicted page-cache pages) are a great use for this kind of slower-than-RAM- | |
75 | +but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and | |
76 | +cleancache) interface to transcendent memory provides a nice way to read | |
77 | +and write -- and indirectly "name" -- the pages. | |
78 | + | |
79 | +Frontswap -- and cleancache -- with a fairly small impact on the kernel, | |
80 | +provides a huge amount of flexibility for more dynamic, flexible RAM | |
81 | +utilization in various system configurations: | |
82 | + | |
83 | +In the single kernel case, aka "zcache", pages are compressed and | |
84 | +stored in local memory, thus increasing the total anonymous pages | |
85 | +that can be safely kept in RAM. Zcache essentially trades off CPU | |
86 | +cycles used in compression/decompression for better memory utilization. | |
87 | +Benchmarks have shown little or no impact when memory pressure is | |
88 | +low while providing a significant performance improvement (25%+) | |
89 | +on some workloads under high memory pressure. | |
90 | + | |
91 | +"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory | |
92 | +support for clustered systems. Frontswap pages are locally compressed | |
93 | +as in zcache, but then "remotified" to another system's RAM. This | |
94 | +allows RAM to be dynamically load-balanced back-and-forth as needed, | |
95 | +i.e. when system A is overcommitted, it can swap to system B, and | |
96 | +vice versa. RAMster can also be configured as a memory server so | |
97 | +many servers in a cluster can swap, dynamically as needed, to a single | |
98 | +server configured with a large amount of RAM... without pre-configuring | |
99 | +how much of the RAM is available for each of the clients! | |
100 | + | |
101 | +In the virtual case, the whole point of virtualization is to statistically | |
102 | +multiplex physical resources acrosst the varying demands of multiple | |
103 | +virtual machines. This is really hard to do with RAM and efforts to do | |
104 | +it well with no kernel changes have essentially failed (except in some | |
105 | +well-publicized special-case workloads). | |
106 | +Specifically, the Xen Transcendent Memory backend allows otherwise | |
107 | +"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple | |
108 | +virtual machines, but the pages can be compressed and deduplicated to | |
109 | +optimize RAM utilization. And when guest OS's are induced to surrender | |
110 | +underutilized RAM (e.g. with "selfballooning"), sudden unexpected | |
111 | +memory pressure may result in swapping; frontswap allows those pages | |
112 | +to be swapped to and from hypervisor RAM (if overall host system memory | |
113 | +conditions allow), thus mitigating the potentially awful performance impact | |
114 | +of unplanned swapping. | |
115 | + | |
116 | +A KVM implementation is underway and has been RFC'ed to lkml. And, | |
117 | +using frontswap, investigation is also underway on the use of NVM as | |
118 | +a memory extension technology. | |
119 | + | |
120 | +2) Sure there may be performance advantages in some situations, but | |
121 | + what's the space/time overhead of frontswap? | |
122 | + | |
123 | +If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into | |
124 | +nothingness and the only overhead is a few extra bytes per swapon'ed | |
125 | +swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" | |
126 | +registers, there is one extra global variable compared to zero for | |
127 | +every swap page read or written. If CONFIG_FRONTSWAP is enabled | |
128 | +AND a frontswap backend registers AND the backend fails every "put" | |
129 | +request (i.e. provides no memory despite claiming it might), | |
130 | +CPU overhead is still negligible -- and since every frontswap fail | |
131 | +precedes a swap page write-to-disk, the system is highly likely | |
132 | +to be I/O bound and using a small fraction of a percent of a CPU | |
133 | +will be irrelevant anyway. | |
134 | + | |
135 | +As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend | |
136 | +registers, one bit is allocated for every swap page for every swap | |
137 | +device that is swapon'd. This is added to the EIGHT bits (which | |
138 | +was sixteen until about 2.6.34) that the kernel already allocates | |
139 | +for every swap page for every swap device that is swapon'd. (Hugh | |
140 | +Dickins has observed that frontswap could probably steal one of | |
141 | +the existing eight bits, but let's worry about that minor optimization | |
142 | +later.) For very large swap disks (which are rare) on a standard | |
143 | +4K pagesize, this is 1MB per 32GB swap. | |
144 | + | |
145 | +When swap pages are stored in transcendent memory instead of written | |
146 | +out to disk, there is a side effect that this may create more memory | |
147 | +pressure that can potentially outweigh the other advantages. A | |
148 | +backend, such as zcache, must implement policies to carefully (but | |
149 | +dynamically) manage memory limits to ensure this doesn't happen. | |
150 | + | |
151 | +3) OK, how about a quick overview of what this frontswap patch does | |
152 | + in terms that a kernel hacker can grok? | |
153 | + | |
154 | +Let's assume that a frontswap "backend" has registered during | |
155 | +kernel initialization; this registration indicates that this | |
156 | +frontswap backend has access to some "memory" that is not directly | |
157 | +accessible by the kernel. Exactly how much memory it provides is | |
158 | +entirely dynamic and random. | |
159 | + | |
160 | +Whenever a swap-device is swapon'd frontswap_init() is called, | |
161 | +passing the swap device number (aka "type") as a parameter. | |
162 | +This notifies frontswap to expect attempts to "put" swap pages | |
163 | +associated with that number. | |
164 | + | |
165 | +Whenever the swap subsystem is readying a page to write to a swap | |
166 | +device (c.f swap_writepage()), frontswap_put_page is called. Frontswap | |
167 | +consults with the frontswap backend and if the backend says it does NOT | |
168 | +have room, frontswap_put_page returns -1 and the kernel swaps the page | |
169 | +to the swap device as normal. Note that the response from the frontswap | |
170 | +backend is unpredictable to the kernel; it may choose to never accept a | |
171 | +page, it could accept every ninth page, or it might accept every | |
172 | +page. But if the backend does accept a page, the data from the page | |
173 | +has already been copied and associated with the type and offset, | |
174 | +and the backend guarantees the persistence of the data. In this case, | |
175 | +frontswap sets a bit in the "frontswap_map" for the swap device | |
176 | +corresponding to the page offset on the swap device to which it would | |
177 | +otherwise have written the data. | |
178 | + | |
179 | +When the swap subsystem needs to swap-in a page (swap_readpage()), | |
180 | +it first calls frontswap_get_page() which checks the frontswap_map to | |
181 | +see if the page was earlier accepted by the frontswap backend. If | |
182 | +it was, the page of data is filled from the frontswap backend and | |
183 | +the swap-in is complete. If not, the normal swap-in code is | |
184 | +executed to obtain the page of data from the real swap device. | |
185 | + | |
186 | +So every time the frontswap backend accepts a page, a swap device read | |
187 | +and (potentially) a swap device write are replaced by a "frontswap backend | |
188 | +put" and (possibly) a "frontswap backend get", which are presumably much | |
189 | +faster. | |
190 | + | |
191 | +4) Can't frontswap be configured as a "special" swap device that is | |
192 | + just higher priority than any real swap device (e.g. like zswap, | |
193 | + or maybe swap-over-nbd/NFS)? | |
194 | + | |
195 | +No. First, the existing swap subsystem doesn't allow for any kind of | |
196 | +swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, | |
197 | +but this would require fairly drastic changes. Even if it were | |
198 | +rewritten, the existing swap subsystem uses the block I/O layer which | |
199 | +assumes a swap device is fixed size and any page in it is linearly | |
200 | +addressable. Frontswap barely touches the existing swap subsystem, | |
201 | +and works around the constraints of the block I/O subsystem to provide | |
202 | +a great deal of flexibility and dynamicity. | |
203 | + | |
204 | +For example, the acceptance of any swap page by the frontswap backend is | |
205 | +entirely unpredictable. This is critical to the definition of frontswap | |
206 | +backends because it grants completely dynamic discretion to the | |
207 | +backend. In zcache, one cannot know a priori how compressible a page is. | |
208 | +"Poorly" compressible pages can be rejected, and "poorly" can itself be | |
209 | +defined dynamically depending on current memory constraints. | |
210 | + | |
211 | +Further, frontswap is entirely synchronous whereas a real swap | |
212 | +device is, by definition, asynchronous and uses block I/O. The | |
213 | +block I/O layer is not only unnecessary, but may perform "optimizations" | |
214 | +that are inappropriate for a RAM-oriented device including delaying | |
215 | +the write of some pages for a significant amount of time. Synchrony is | |
216 | +required to ensure the dynamicity of the backend and to avoid thorny race | |
217 | +conditions that would unnecessarily and greatly complicate frontswap | |
218 | +and/or the block I/O subsystem. That said, only the initial "put" | |
219 | +and "get" operations need be synchronous. A separate asynchronous thread | |
220 | +is free to manipulate the pages stored by frontswap. For example, | |
221 | +the "remotification" thread in RAMster uses standard asynchronous | |
222 | +kernel sockets to move compressed frontswap pages to a remote machine. | |
223 | +Similarly, a KVM guest-side implementation could do in-guest compression | |
224 | +and use "batched" hypercalls. | |
225 | + | |
226 | +In a virtualized environment, the dynamicity allows the hypervisor | |
227 | +(or host OS) to do "intelligent overcommit". For example, it can | |
228 | +choose to accept pages only until host-swapping might be imminent, | |
229 | +then force guests to do their own swapping. | |
230 | + | |
231 | +There is a downside to the transcendent memory specifications for | |
232 | +frontswap: Since any "put" might fail, there must always be a real | |
233 | +slot on a real swap device to swap the page. Thus frontswap must be | |
234 | +implemented as a "shadow" to every swapon'd device with the potential | |
235 | +capability of holding every page that the swap device might have held | |
236 | +and the possibility that it might hold no pages at all. This means | |
237 | +that frontswap cannot contain more pages than the total of swapon'd | |
238 | +swap devices. For example, if NO swap device is configured on some | |
239 | +installation, frontswap is useless. Swapless portable devices | |
240 | +can still use frontswap but a backend for such devices must configure | |
241 | +some kind of "ghost" swap device and ensure that it is never used. | |
242 | + | |
243 | +5) Why this weird definition about "duplicate puts"? If a page | |
244 | + has been previously successfully put, can't it always be | |
245 | + successfully overwritten? | |
246 | + | |
247 | +Nearly always it can, but no, sometimes it cannot. Consider an example | |
248 | +where data is compressed and the original 4K page has been compressed | |
249 | +to 1K. Now an attempt is made to overwrite the page with data that | |
250 | +is non-compressible and so would take the entire 4K. But the backend | |
251 | +has no more space. In this case, the put must be rejected. Whenever | |
252 | +frontswap rejects a put that would overwrite, it also must invalidate | |
253 | +the old data and ensure that it is no longer accessible. Since the | |
254 | +swap subsystem then writes the new data to the read swap device, | |
255 | +this is the correct course of action to ensure coherency. | |
256 | + | |
257 | +6) What is frontswap_shrink for? | |
258 | + | |
259 | +When the (non-frontswap) swap subsystem swaps out a page to a real | |
260 | +swap device, that page is only taking up low-value pre-allocated disk | |
261 | +space. But if frontswap has placed a page in transcendent memory, that | |
262 | +page may be taking up valuable real estate. The frontswap_shrink | |
263 | +routine allows code outside of the swap subsystem to force pages out | |
264 | +of the memory managed by frontswap and back into kernel-addressable memory. | |
265 | +For example, in RAMster, a "suction driver" thread will attempt | |
266 | +to "repatriate" pages sent to a remote machine back to the local machine; | |
267 | +this is driven using the frontswap_shrink mechanism when memory pressure | |
268 | +subsides. | |
269 | + | |
270 | +7) Why does the frontswap patch create the new include file swapfile.h? | |
271 | + | |
272 | +The frontswap code depends on some swap-subsystem-internal data | |
273 | +structures that have, over the years, moved back and forth between | |
274 | +static and global. This seemed a reasonable compromise: Define | |
275 | +them as global but declare them in a new include file that isn't | |
276 | +included by the large number of source files that include swap.h. | |
277 | + | |
278 | +Dan Magenheimer, last updated April 9, 2012 |
mm/Kconfig
... | ... | @@ -379,4 +379,21 @@ |
379 | 379 | in a negligible performance hit. |
380 | 380 | |
381 | 381 | If unsure, say Y to enable cleancache |
382 | + | |
383 | +config FRONTSWAP | |
384 | + bool "Enable frontswap to cache swap pages if tmem is present" | |
385 | + depends on SWAP | |
386 | + default n | |
387 | + help | |
388 | + Frontswap is so named because it can be thought of as the opposite | |
389 | + of a "backing" store for a swap device. The data is stored into | |
390 | + "transcendent memory", memory that is not directly accessible or | |
391 | + addressable by the kernel and is of unknown and possibly | |
392 | + time-varying size. When space in transcendent memory is available, | |
393 | + a significant swap I/O reduction may be achieved. When none is | |
394 | + available, all frontswap calls are reduced to a single pointer- | |
395 | + compare-against-NULL resulting in a negligible performance hit | |
396 | + and swap data is stored as normal on the matching swap device. | |
397 | + | |
398 | + If unsure, say Y to enable frontswap. |
mm/Makefile
... | ... | @@ -26,6 +26,7 @@ |
26 | 26 | |
27 | 27 | obj-$(CONFIG_BOUNCE) += bounce.o |
28 | 28 | obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o |
29 | +obj-$(CONFIG_FRONTSWAP) += frontswap.o | |
29 | 30 | obj-$(CONFIG_HAS_DMA) += dmapool.o |
30 | 31 | obj-$(CONFIG_HUGETLBFS) += hugetlb.o |
31 | 32 | obj-$(CONFIG_NUMA) += mempolicy.o |