Commit a3fe778c7895cd847d23c25ad566d83346282a77

Authored by Linus Torvalds

Merge tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm

Pull frontswap feature from Konrad Rzeszutek Wilk:
 "Frontswap provides a "transcendent memory" interface for swap pages.
  In some environments, dramatic performance savings may be obtained
  because swapped pages are saved in RAM (or a RAM-like device) instead
  of a swap disk.  This tag provides the basic infrastructure along with
  some changes to the existing backends."

Fix up trivial conflict in mm/Makefile due to removal of swap token code
changing a line next to the new frontswap entry.

This pull request came in before the merge window even opened, it got
delayed to after the merge window by me just wanting to make sure it had
actual users.  Apparently IBM is using this on their embedded side, and
Jan Beulich says that it's already made available for SLES and OpenSUSE
users.

Also acked by Rik van Riel, and Konrad points to other people liking it
too.  So in it goes.

By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
via Konrad Rzeszutek Wilk
* tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
  frontswap: s/put_page/store/g s/get_page/load
  MAINTAINER: Add myself for the frontswap API
  mm: frontswap: config and doc files
  mm: frontswap: core frontswap functionality
  mm: frontswap: core swap subsystem hooks and headers
  mm: frontswap: add frontswap header file

Showing 13 changed files Side-by-side Diff

Documentation/vm/frontswap.txt
  1 +Frontswap provides a "transcendent memory" interface for swap pages.
  2 +In some environments, dramatic performance savings may be obtained because
  3 +swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
  4 +
  5 +(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
  6 +and the only necessary changes to the core kernel for transcendent memory;
  7 +all other supporting code -- the "backends" -- is implemented as drivers.
  8 +See the LWN.net article "Transcendent memory in a nutshell" for a detailed
  9 +overview of frontswap and related kernel parts:
  10 +https://lwn.net/Articles/454795/ )
  11 +
  12 +Frontswap is so named because it can be thought of as the opposite of
  13 +a "backing" store for a swap device. The storage is assumed to be
  14 +a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
  15 +to the requirements of transcendent memory (such as Xen's "tmem", or
  16 +in-kernel compressed memory, aka "zcache", or future RAM-like devices);
  17 +this pseudo-RAM device is not directly accessible or addressable by the
  18 +kernel and is of unknown and possibly time-varying size. The driver
  19 +links itself to frontswap by calling frontswap_register_ops to set the
  20 +frontswap_ops funcs appropriately and the functions it provides must
  21 +conform to certain policies as follows:
  22 +
  23 +An "init" prepares the device to receive frontswap pages associated
  24 +with the specified swap device number (aka "type"). A "store" will
  25 +copy the page to transcendent memory and associate it with the type and
  26 +offset associated with the page. A "load" will copy the page, if found,
  27 +from transcendent memory into kernel memory, but will NOT remove the page
  28 +from from transcendent memory. An "invalidate_page" will remove the page
  29 +from transcendent memory and an "invalidate_area" will remove ALL pages
  30 +associated with the swap type (e.g., like swapoff) and notify the "device"
  31 +to refuse further stores with that swap type.
  32 +
  33 +Once a page is successfully stored, a matching load on the page will normally
  34 +succeed. So when the kernel finds itself in a situation where it needs
  35 +to swap out a page, it first attempts to use frontswap. If the store returns
  36 +success, the data has been successfully saved to transcendent memory and
  37 +a disk write and, if the data is later read back, a disk read are avoided.
  38 +If a store returns failure, transcendent memory has rejected the data, and the
  39 +page can be written to swap as usual.
  40 +
  41 +If a backend chooses, frontswap can be configured as a "writethrough
  42 +cache" by calling frontswap_writethrough(). In this mode, the reduction
  43 +in swap device writes is lost (and also a non-trivial performance advantage)
  44 +in order to allow the backend to arbitrarily "reclaim" space used to
  45 +store frontswap pages to more completely manage its memory usage.
  46 +
  47 +Note that if a page is stored and the page already exists in transcendent memory
  48 +(a "duplicate" store), either the store succeeds and the data is overwritten,
  49 +or the store fails AND the page is invalidated. This ensures stale data may
  50 +never be obtained from frontswap.
  51 +
  52 +If properly configured, monitoring of frontswap is done via debugfs in
  53 +the /sys/kernel/debug/frontswap directory. The effectiveness of
  54 +frontswap can be measured (across all swap devices) with:
  55 +
  56 +failed_stores - how many store attempts have failed
  57 +loads - how many loads were attempted (all should succeed)
  58 +succ_stores - how many store attempts have succeeded
  59 +invalidates - how many invalidates were attempted
  60 +
  61 +A backend implementation may provide additional metrics.
  62 +
  63 +FAQ
  64 +
  65 +1) Where's the value?
  66 +
  67 +When a workload starts swapping, performance falls through the floor.
  68 +Frontswap significantly increases performance in many such workloads by
  69 +providing a clean, dynamic interface to read and write swap pages to
  70 +"transcendent memory" that is otherwise not directly addressable to the kernel.
  71 +This interface is ideal when data is transformed to a different form
  72 +and size (such as with compression) or secretly moved (as might be
  73 +useful for write-balancing for some RAM-like devices). Swap pages (and
  74 +evicted page-cache pages) are a great use for this kind of slower-than-RAM-
  75 +but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
  76 +cleancache) interface to transcendent memory provides a nice way to read
  77 +and write -- and indirectly "name" -- the pages.
  78 +
  79 +Frontswap -- and cleancache -- with a fairly small impact on the kernel,
  80 +provides a huge amount of flexibility for more dynamic, flexible RAM
  81 +utilization in various system configurations:
  82 +
  83 +In the single kernel case, aka "zcache", pages are compressed and
  84 +stored in local memory, thus increasing the total anonymous pages
  85 +that can be safely kept in RAM. Zcache essentially trades off CPU
  86 +cycles used in compression/decompression for better memory utilization.
  87 +Benchmarks have shown little or no impact when memory pressure is
  88 +low while providing a significant performance improvement (25%+)
  89 +on some workloads under high memory pressure.
  90 +
  91 +"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
  92 +support for clustered systems. Frontswap pages are locally compressed
  93 +as in zcache, but then "remotified" to another system's RAM. This
  94 +allows RAM to be dynamically load-balanced back-and-forth as needed,
  95 +i.e. when system A is overcommitted, it can swap to system B, and
  96 +vice versa. RAMster can also be configured as a memory server so
  97 +many servers in a cluster can swap, dynamically as needed, to a single
  98 +server configured with a large amount of RAM... without pre-configuring
  99 +how much of the RAM is available for each of the clients!
  100 +
  101 +In the virtual case, the whole point of virtualization is to statistically
  102 +multiplex physical resources acrosst the varying demands of multiple
  103 +virtual machines. This is really hard to do with RAM and efforts to do
  104 +it well with no kernel changes have essentially failed (except in some
  105 +well-publicized special-case workloads).
  106 +Specifically, the Xen Transcendent Memory backend allows otherwise
  107 +"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
  108 +virtual machines, but the pages can be compressed and deduplicated to
  109 +optimize RAM utilization. And when guest OS's are induced to surrender
  110 +underutilized RAM (e.g. with "selfballooning"), sudden unexpected
  111 +memory pressure may result in swapping; frontswap allows those pages
  112 +to be swapped to and from hypervisor RAM (if overall host system memory
  113 +conditions allow), thus mitigating the potentially awful performance impact
  114 +of unplanned swapping.
  115 +
  116 +A KVM implementation is underway and has been RFC'ed to lkml. And,
  117 +using frontswap, investigation is also underway on the use of NVM as
  118 +a memory extension technology.
  119 +
  120 +2) Sure there may be performance advantages in some situations, but
  121 + what's the space/time overhead of frontswap?
  122 +
  123 +If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
  124 +nothingness and the only overhead is a few extra bytes per swapon'ed
  125 +swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
  126 +registers, there is one extra global variable compared to zero for
  127 +every swap page read or written. If CONFIG_FRONTSWAP is enabled
  128 +AND a frontswap backend registers AND the backend fails every "store"
  129 +request (i.e. provides no memory despite claiming it might),
  130 +CPU overhead is still negligible -- and since every frontswap fail
  131 +precedes a swap page write-to-disk, the system is highly likely
  132 +to be I/O bound and using a small fraction of a percent of a CPU
  133 +will be irrelevant anyway.
  134 +
  135 +As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
  136 +registers, one bit is allocated for every swap page for every swap
  137 +device that is swapon'd. This is added to the EIGHT bits (which
  138 +was sixteen until about 2.6.34) that the kernel already allocates
  139 +for every swap page for every swap device that is swapon'd. (Hugh
  140 +Dickins has observed that frontswap could probably steal one of
  141 +the existing eight bits, but let's worry about that minor optimization
  142 +later.) For very large swap disks (which are rare) on a standard
  143 +4K pagesize, this is 1MB per 32GB swap.
  144 +
  145 +When swap pages are stored in transcendent memory instead of written
  146 +out to disk, there is a side effect that this may create more memory
  147 +pressure that can potentially outweigh the other advantages. A
  148 +backend, such as zcache, must implement policies to carefully (but
  149 +dynamically) manage memory limits to ensure this doesn't happen.
  150 +
  151 +3) OK, how about a quick overview of what this frontswap patch does
  152 + in terms that a kernel hacker can grok?
  153 +
  154 +Let's assume that a frontswap "backend" has registered during
  155 +kernel initialization; this registration indicates that this
  156 +frontswap backend has access to some "memory" that is not directly
  157 +accessible by the kernel. Exactly how much memory it provides is
  158 +entirely dynamic and random.
  159 +
  160 +Whenever a swap-device is swapon'd frontswap_init() is called,
  161 +passing the swap device number (aka "type") as a parameter.
  162 +This notifies frontswap to expect attempts to "store" swap pages
  163 +associated with that number.
  164 +
  165 +Whenever the swap subsystem is readying a page to write to a swap
  166 +device (c.f swap_writepage()), frontswap_store is called. Frontswap
  167 +consults with the frontswap backend and if the backend says it does NOT
  168 +have room, frontswap_store returns -1 and the kernel swaps the page
  169 +to the swap device as normal. Note that the response from the frontswap
  170 +backend is unpredictable to the kernel; it may choose to never accept a
  171 +page, it could accept every ninth page, or it might accept every
  172 +page. But if the backend does accept a page, the data from the page
  173 +has already been copied and associated with the type and offset,
  174 +and the backend guarantees the persistence of the data. In this case,
  175 +frontswap sets a bit in the "frontswap_map" for the swap device
  176 +corresponding to the page offset on the swap device to which it would
  177 +otherwise have written the data.
  178 +
  179 +When the swap subsystem needs to swap-in a page (swap_readpage()),
  180 +it first calls frontswap_load() which checks the frontswap_map to
  181 +see if the page was earlier accepted by the frontswap backend. If
  182 +it was, the page of data is filled from the frontswap backend and
  183 +the swap-in is complete. If not, the normal swap-in code is
  184 +executed to obtain the page of data from the real swap device.
  185 +
  186 +So every time the frontswap backend accepts a page, a swap device read
  187 +and (potentially) a swap device write are replaced by a "frontswap backend
  188 +store" and (possibly) a "frontswap backend loads", which are presumably much
  189 +faster.
  190 +
  191 +4) Can't frontswap be configured as a "special" swap device that is
  192 + just higher priority than any real swap device (e.g. like zswap,
  193 + or maybe swap-over-nbd/NFS)?
  194 +
  195 +No. First, the existing swap subsystem doesn't allow for any kind of
  196 +swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy,
  197 +but this would require fairly drastic changes. Even if it were
  198 +rewritten, the existing swap subsystem uses the block I/O layer which
  199 +assumes a swap device is fixed size and any page in it is linearly
  200 +addressable. Frontswap barely touches the existing swap subsystem,
  201 +and works around the constraints of the block I/O subsystem to provide
  202 +a great deal of flexibility and dynamicity.
  203 +
  204 +For example, the acceptance of any swap page by the frontswap backend is
  205 +entirely unpredictable. This is critical to the definition of frontswap
  206 +backends because it grants completely dynamic discretion to the
  207 +backend. In zcache, one cannot know a priori how compressible a page is.
  208 +"Poorly" compressible pages can be rejected, and "poorly" can itself be
  209 +defined dynamically depending on current memory constraints.
  210 +
  211 +Further, frontswap is entirely synchronous whereas a real swap
  212 +device is, by definition, asynchronous and uses block I/O. The
  213 +block I/O layer is not only unnecessary, but may perform "optimizations"
  214 +that are inappropriate for a RAM-oriented device including delaying
  215 +the write of some pages for a significant amount of time. Synchrony is
  216 +required to ensure the dynamicity of the backend and to avoid thorny race
  217 +conditions that would unnecessarily and greatly complicate frontswap
  218 +and/or the block I/O subsystem. That said, only the initial "store"
  219 +and "load" operations need be synchronous. A separate asynchronous thread
  220 +is free to manipulate the pages stored by frontswap. For example,
  221 +the "remotification" thread in RAMster uses standard asynchronous
  222 +kernel sockets to move compressed frontswap pages to a remote machine.
  223 +Similarly, a KVM guest-side implementation could do in-guest compression
  224 +and use "batched" hypercalls.
  225 +
  226 +In a virtualized environment, the dynamicity allows the hypervisor
  227 +(or host OS) to do "intelligent overcommit". For example, it can
  228 +choose to accept pages only until host-swapping might be imminent,
  229 +then force guests to do their own swapping.
  230 +
  231 +There is a downside to the transcendent memory specifications for
  232 +frontswap: Since any "store" might fail, there must always be a real
  233 +slot on a real swap device to swap the page. Thus frontswap must be
  234 +implemented as a "shadow" to every swapon'd device with the potential
  235 +capability of holding every page that the swap device might have held
  236 +and the possibility that it might hold no pages at all. This means
  237 +that frontswap cannot contain more pages than the total of swapon'd
  238 +swap devices. For example, if NO swap device is configured on some
  239 +installation, frontswap is useless. Swapless portable devices
  240 +can still use frontswap but a backend for such devices must configure
  241 +some kind of "ghost" swap device and ensure that it is never used.
  242 +
  243 +5) Why this weird definition about "duplicate stores"? If a page
  244 + has been previously successfully stored, can't it always be
  245 + successfully overwritten?
  246 +
  247 +Nearly always it can, but no, sometimes it cannot. Consider an example
  248 +where data is compressed and the original 4K page has been compressed
  249 +to 1K. Now an attempt is made to overwrite the page with data that
  250 +is non-compressible and so would take the entire 4K. But the backend
  251 +has no more space. In this case, the store must be rejected. Whenever
  252 +frontswap rejects a store that would overwrite, it also must invalidate
  253 +the old data and ensure that it is no longer accessible. Since the
  254 +swap subsystem then writes the new data to the read swap device,
  255 +this is the correct course of action to ensure coherency.
  256 +
  257 +6) What is frontswap_shrink for?
  258 +
  259 +When the (non-frontswap) swap subsystem swaps out a page to a real
  260 +swap device, that page is only taking up low-value pre-allocated disk
  261 +space. But if frontswap has placed a page in transcendent memory, that
  262 +page may be taking up valuable real estate. The frontswap_shrink
  263 +routine allows code outside of the swap subsystem to force pages out
  264 +of the memory managed by frontswap and back into kernel-addressable memory.
  265 +For example, in RAMster, a "suction driver" thread will attempt
  266 +to "repatriate" pages sent to a remote machine back to the local machine;
  267 +this is driven using the frontswap_shrink mechanism when memory pressure
  268 +subsides.
  269 +
  270 +7) Why does the frontswap patch create the new include file swapfile.h?
  271 +
  272 +The frontswap code depends on some swap-subsystem-internal data
  273 +structures that have, over the years, moved back and forth between
  274 +static and global. This seemed a reasonable compromise: Define
  275 +them as global but declare them in a new include file that isn't
  276 +included by the large number of source files that include swap.h.
  277 +
  278 +Dan Magenheimer, last updated April 9, 2012
... ... @@ -2930,6 +2930,13 @@
2930 2930 F: include/linux/freezer.h
2931 2931 F: kernel/freezer.c
2932 2932  
  2933 +FRONTSWAP API
  2934 +M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  2935 +L: linux-kernel@vger.kernel.org
  2936 +S: Maintained
  2937 +F: mm/frontswap.c
  2938 +F: include/linux/frontswap.h
  2939 +
2933 2940 FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS
2934 2941 M: David Howells <dhowells@redhat.com>
2935 2942 L: linux-cachefs@redhat.com
drivers/staging/ramster/zcache-main.c
... ... @@ -3002,7 +3002,7 @@
3002 3002 return oid;
3003 3003 }
3004 3004  
3005   -static int zcache_frontswap_put_page(unsigned type, pgoff_t offset,
  3005 +static int zcache_frontswap_store(unsigned type, pgoff_t offset,
3006 3006 struct page *page)
3007 3007 {
3008 3008 u64 ind64 = (u64)offset;
... ... @@ -3025,7 +3025,7 @@
3025 3025  
3026 3026 /* returns 0 if the page was successfully gotten from frontswap, -1 if
3027 3027 * was not present (should never happen!) */
3028   -static int zcache_frontswap_get_page(unsigned type, pgoff_t offset,
  3028 +static int zcache_frontswap_load(unsigned type, pgoff_t offset,
3029 3029 struct page *page)
3030 3030 {
3031 3031 u64 ind64 = (u64)offset;
... ... @@ -3080,8 +3080,8 @@
3080 3080 }
3081 3081  
3082 3082 static struct frontswap_ops zcache_frontswap_ops = {
3083   - .put_page = zcache_frontswap_put_page,
3084   - .get_page = zcache_frontswap_get_page,
  3083 + .store = zcache_frontswap_store,
  3084 + .load = zcache_frontswap_load,
3085 3085 .invalidate_page = zcache_frontswap_flush_page,
3086 3086 .invalidate_area = zcache_frontswap_flush_area,
3087 3087 .init = zcache_frontswap_init
drivers/staging/zcache/zcache-main.c
... ... @@ -1835,7 +1835,7 @@
1835 1835 * Swizzling increases objects per swaptype, increasing tmem concurrency
1836 1836 * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS
1837 1837 * Setting SWIZ_BITS to 27 basically reconstructs the swap entry from
1838   - * frontswap_get_page(), but has side-effects. Hence using 8.
  1838 + * frontswap_load(), but has side-effects. Hence using 8.
1839 1839 */
1840 1840 #define SWIZ_BITS 8
1841 1841 #define SWIZ_MASK ((1 << SWIZ_BITS) - 1)
... ... @@ -1849,7 +1849,7 @@
1849 1849 return oid;
1850 1850 }
1851 1851  
1852   -static int zcache_frontswap_put_page(unsigned type, pgoff_t offset,
  1852 +static int zcache_frontswap_store(unsigned type, pgoff_t offset,
1853 1853 struct page *page)
1854 1854 {
1855 1855 u64 ind64 = (u64)offset;
... ... @@ -1870,7 +1870,7 @@
1870 1870  
1871 1871 /* returns 0 if the page was successfully gotten from frontswap, -1 if
1872 1872 * was not present (should never happen!) */
1873   -static int zcache_frontswap_get_page(unsigned type, pgoff_t offset,
  1873 +static int zcache_frontswap_load(unsigned type, pgoff_t offset,
1874 1874 struct page *page)
1875 1875 {
1876 1876 u64 ind64 = (u64)offset;
... ... @@ -1919,8 +1919,8 @@
1919 1919 }
1920 1920  
1921 1921 static struct frontswap_ops zcache_frontswap_ops = {
1922   - .put_page = zcache_frontswap_put_page,
1923   - .get_page = zcache_frontswap_get_page,
  1922 + .store = zcache_frontswap_store,
  1923 + .load = zcache_frontswap_load,
1924 1924 .invalidate_page = zcache_frontswap_flush_page,
1925 1925 .invalidate_area = zcache_frontswap_flush_area,
1926 1926 .init = zcache_frontswap_init
... ... @@ -269,7 +269,7 @@
269 269 }
270 270  
271 271 /* returns 0 if the page was successfully put into frontswap, -1 if not */
272   -static int tmem_frontswap_put_page(unsigned type, pgoff_t offset,
  272 +static int tmem_frontswap_store(unsigned type, pgoff_t offset,
273 273 struct page *page)
274 274 {
275 275 u64 ind64 = (u64)offset;
... ... @@ -295,7 +295,7 @@
295 295 * returns 0 if the page was successfully gotten from frontswap, -1 if
296 296 * was not present (should never happen!)
297 297 */
298   -static int tmem_frontswap_get_page(unsigned type, pgoff_t offset,
  298 +static int tmem_frontswap_load(unsigned type, pgoff_t offset,
299 299 struct page *page)
300 300 {
301 301 u64 ind64 = (u64)offset;
... ... @@ -362,8 +362,8 @@
362 362 __setup("nofrontswap", no_frontswap);
363 363  
364 364 static struct frontswap_ops __initdata tmem_frontswap_ops = {
365   - .put_page = tmem_frontswap_put_page,
366   - .get_page = tmem_frontswap_get_page,
  365 + .store = tmem_frontswap_store,
  366 + .load = tmem_frontswap_load,
367 367 .invalidate_page = tmem_frontswap_flush_page,
368 368 .invalidate_area = tmem_frontswap_flush_area,
369 369 .init = tmem_frontswap_init
include/linux/frontswap.h
  1 +#ifndef _LINUX_FRONTSWAP_H
  2 +#define _LINUX_FRONTSWAP_H
  3 +
  4 +#include <linux/swap.h>
  5 +#include <linux/mm.h>
  6 +#include <linux/bitops.h>
  7 +
  8 +struct frontswap_ops {
  9 + void (*init)(unsigned);
  10 + int (*store)(unsigned, pgoff_t, struct page *);
  11 + int (*load)(unsigned, pgoff_t, struct page *);
  12 + void (*invalidate_page)(unsigned, pgoff_t);
  13 + void (*invalidate_area)(unsigned);
  14 +};
  15 +
  16 +extern bool frontswap_enabled;
  17 +extern struct frontswap_ops
  18 + frontswap_register_ops(struct frontswap_ops *ops);
  19 +extern void frontswap_shrink(unsigned long);
  20 +extern unsigned long frontswap_curr_pages(void);
  21 +extern void frontswap_writethrough(bool);
  22 +
  23 +extern void __frontswap_init(unsigned type);
  24 +extern int __frontswap_store(struct page *page);
  25 +extern int __frontswap_load(struct page *page);
  26 +extern void __frontswap_invalidate_page(unsigned, pgoff_t);
  27 +extern void __frontswap_invalidate_area(unsigned);
  28 +
  29 +#ifdef CONFIG_FRONTSWAP
  30 +
  31 +static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
  32 +{
  33 + bool ret = false;
  34 +
  35 + if (frontswap_enabled && sis->frontswap_map)
  36 + ret = test_bit(offset, sis->frontswap_map);
  37 + return ret;
  38 +}
  39 +
  40 +static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset)
  41 +{
  42 + if (frontswap_enabled && sis->frontswap_map)
  43 + set_bit(offset, sis->frontswap_map);
  44 +}
  45 +
  46 +static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset)
  47 +{
  48 + if (frontswap_enabled && sis->frontswap_map)
  49 + clear_bit(offset, sis->frontswap_map);
  50 +}
  51 +
  52 +static inline void frontswap_map_set(struct swap_info_struct *p,
  53 + unsigned long *map)
  54 +{
  55 + p->frontswap_map = map;
  56 +}
  57 +
  58 +static inline unsigned long *frontswap_map_get(struct swap_info_struct *p)
  59 +{
  60 + return p->frontswap_map;
  61 +}
  62 +#else
  63 +/* all inline routines become no-ops and all externs are ignored */
  64 +
  65 +#define frontswap_enabled (0)
  66 +
  67 +static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset)
  68 +{
  69 + return false;
  70 +}
  71 +
  72 +static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset)
  73 +{
  74 +}
  75 +
  76 +static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset)
  77 +{
  78 +}
  79 +
  80 +static inline void frontswap_map_set(struct swap_info_struct *p,
  81 + unsigned long *map)
  82 +{
  83 +}
  84 +
  85 +static inline unsigned long *frontswap_map_get(struct swap_info_struct *p)
  86 +{
  87 + return NULL;
  88 +}
  89 +#endif
  90 +
  91 +static inline int frontswap_store(struct page *page)
  92 +{
  93 + int ret = -1;
  94 +
  95 + if (frontswap_enabled)
  96 + ret = __frontswap_store(page);
  97 + return ret;
  98 +}
  99 +
  100 +static inline int frontswap_load(struct page *page)
  101 +{
  102 + int ret = -1;
  103 +
  104 + if (frontswap_enabled)
  105 + ret = __frontswap_load(page);
  106 + return ret;
  107 +}
  108 +
  109 +static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset)
  110 +{
  111 + if (frontswap_enabled)
  112 + __frontswap_invalidate_page(type, offset);
  113 +}
  114 +
  115 +static inline void frontswap_invalidate_area(unsigned type)
  116 +{
  117 + if (frontswap_enabled)
  118 + __frontswap_invalidate_area(type);
  119 +}
  120 +
  121 +static inline void frontswap_init(unsigned type)
  122 +{
  123 + if (frontswap_enabled)
  124 + __frontswap_init(type);
  125 +}
  126 +
  127 +#endif /* _LINUX_FRONTSWAP_H */
include/linux/swap.h
... ... @@ -197,6 +197,10 @@
197 197 struct block_device *bdev; /* swap device or bdev of swap file */
198 198 struct file *swap_file; /* seldom referenced */
199 199 unsigned int old_block_size; /* seldom referenced */
  200 +#ifdef CONFIG_FRONTSWAP
  201 + unsigned long *frontswap_map; /* frontswap in-use, one bit per page */
  202 + atomic_t frontswap_pages; /* frontswap pages in-use counter */
  203 +#endif
200 204 };
201 205  
202 206 struct swap_list_t {
include/linux/swapfile.h
  1 +#ifndef _LINUX_SWAPFILE_H
  2 +#define _LINUX_SWAPFILE_H
  3 +
  4 +/*
  5 + * these were static in swapfile.c but frontswap.c needs them and we don't
  6 + * want to expose them to the dozens of source files that include swap.h
  7 + */
  8 +extern spinlock_t swap_lock;
  9 +extern struct swap_list_t swap_list;
  10 +extern struct swap_info_struct *swap_info[];
  11 +extern int try_to_unuse(unsigned int, bool, unsigned long);
  12 +
  13 +#endif /* _LINUX_SWAPFILE_H */
... ... @@ -389,4 +389,21 @@
389 389 in a negligible performance hit.
390 390  
391 391 If unsure, say Y to enable cleancache
  392 +
  393 +config FRONTSWAP
  394 + bool "Enable frontswap to cache swap pages if tmem is present"
  395 + depends on SWAP
  396 + default n
  397 + help
  398 + Frontswap is so named because it can be thought of as the opposite
  399 + of a "backing" store for a swap device. The data is stored into
  400 + "transcendent memory", memory that is not directly accessible or
  401 + addressable by the kernel and is of unknown and possibly
  402 + time-varying size. When space in transcendent memory is available,
  403 + a significant swap I/O reduction may be achieved. When none is
  404 + available, all frontswap calls are reduced to a single pointer-
  405 + compare-against-NULL resulting in a negligible performance hit
  406 + and swap data is stored as normal on the matching swap device.
  407 +
  408 + If unsure, say Y to enable frontswap.
... ... @@ -29,6 +29,7 @@
29 29  
30 30 obj-$(CONFIG_BOUNCE) += bounce.o
31 31 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
  32 +obj-$(CONFIG_FRONTSWAP) += frontswap.o
32 33 obj-$(CONFIG_HAS_DMA) += dmapool.o
33 34 obj-$(CONFIG_HUGETLBFS) += hugetlb.o
34 35 obj-$(CONFIG_NUMA) += mempolicy.o
  1 +/*
  2 + * Frontswap frontend
  3 + *
  4 + * This code provides the generic "frontend" layer to call a matching
  5 + * "backend" driver implementation of frontswap. See
  6 + * Documentation/vm/frontswap.txt for more information.
  7 + *
  8 + * Copyright (C) 2009-2012 Oracle Corp. All rights reserved.
  9 + * Author: Dan Magenheimer
  10 + *
  11 + * This work is licensed under the terms of the GNU GPL, version 2.
  12 + */
  13 +
  14 +#include <linux/mm.h>
  15 +#include <linux/mman.h>
  16 +#include <linux/swap.h>
  17 +#include <linux/swapops.h>
  18 +#include <linux/proc_fs.h>
  19 +#include <linux/security.h>
  20 +#include <linux/capability.h>
  21 +#include <linux/module.h>
  22 +#include <linux/uaccess.h>
  23 +#include <linux/debugfs.h>
  24 +#include <linux/frontswap.h>
  25 +#include <linux/swapfile.h>
  26 +
  27 +/*
  28 + * frontswap_ops is set by frontswap_register_ops to contain the pointers
  29 + * to the frontswap "backend" implementation functions.
  30 + */
  31 +static struct frontswap_ops frontswap_ops __read_mostly;
  32 +
  33 +/*
  34 + * This global enablement flag reduces overhead on systems where frontswap_ops
  35 + * has not been registered, so is preferred to the slower alternative: a
  36 + * function call that checks a non-global.
  37 + */
  38 +bool frontswap_enabled __read_mostly;
  39 +EXPORT_SYMBOL(frontswap_enabled);
  40 +
  41 +/*
  42 + * If enabled, frontswap_store will return failure even on success. As
  43 + * a result, the swap subsystem will always write the page to swap, in
  44 + * effect converting frontswap into a writethrough cache. In this mode,
  45 + * there is no direct reduction in swap writes, but a frontswap backend
  46 + * can unilaterally "reclaim" any pages in use with no data loss, thus
  47 + * providing increases control over maximum memory usage due to frontswap.
  48 + */
  49 +static bool frontswap_writethrough_enabled __read_mostly;
  50 +
  51 +#ifdef CONFIG_DEBUG_FS
  52 +/*
  53 + * Counters available via /sys/kernel/debug/frontswap (if debugfs is
  54 + * properly configured). These are for information only so are not protected
  55 + * against increment races.
  56 + */
  57 +static u64 frontswap_loads;
  58 +static u64 frontswap_succ_stores;
  59 +static u64 frontswap_failed_stores;
  60 +static u64 frontswap_invalidates;
  61 +
  62 +static inline void inc_frontswap_loads(void) {
  63 + frontswap_loads++;
  64 +}
  65 +static inline void inc_frontswap_succ_stores(void) {
  66 + frontswap_succ_stores++;
  67 +}
  68 +static inline void inc_frontswap_failed_stores(void) {
  69 + frontswap_failed_stores++;
  70 +}
  71 +static inline void inc_frontswap_invalidates(void) {
  72 + frontswap_invalidates++;
  73 +}
  74 +#else
  75 +static inline void inc_frontswap_loads(void) { }
  76 +static inline void inc_frontswap_succ_stores(void) { }
  77 +static inline void inc_frontswap_failed_stores(void) { }
  78 +static inline void inc_frontswap_invalidates(void) { }
  79 +#endif
  80 +/*
  81 + * Register operations for frontswap, returning previous thus allowing
  82 + * detection of multiple backends and possible nesting.
  83 + */
  84 +struct frontswap_ops frontswap_register_ops(struct frontswap_ops *ops)
  85 +{
  86 + struct frontswap_ops old = frontswap_ops;
  87 +
  88 + frontswap_ops = *ops;
  89 + frontswap_enabled = true;
  90 + return old;
  91 +}
  92 +EXPORT_SYMBOL(frontswap_register_ops);
  93 +
  94 +/*
  95 + * Enable/disable frontswap writethrough (see above).
  96 + */
  97 +void frontswap_writethrough(bool enable)
  98 +{
  99 + frontswap_writethrough_enabled = enable;
  100 +}
  101 +EXPORT_SYMBOL(frontswap_writethrough);
  102 +
  103 +/*
  104 + * Called when a swap device is swapon'd.
  105 + */
  106 +void __frontswap_init(unsigned type)
  107 +{
  108 + struct swap_info_struct *sis = swap_info[type];
  109 +
  110 + BUG_ON(sis == NULL);
  111 + if (sis->frontswap_map == NULL)
  112 + return;
  113 + if (frontswap_enabled)
  114 + (*frontswap_ops.init)(type);
  115 +}
  116 +EXPORT_SYMBOL(__frontswap_init);
  117 +
  118 +/*
  119 + * "Store" data from a page to frontswap and associate it with the page's
  120 + * swaptype and offset. Page must be locked and in the swap cache.
  121 + * If frontswap already contains a page with matching swaptype and
  122 + * offset, the frontswap implmentation may either overwrite the data and
  123 + * return success or invalidate the page from frontswap and return failure.
  124 + */
  125 +int __frontswap_store(struct page *page)
  126 +{
  127 + int ret = -1, dup = 0;
  128 + swp_entry_t entry = { .val = page_private(page), };
  129 + int type = swp_type(entry);
  130 + struct swap_info_struct *sis = swap_info[type];
  131 + pgoff_t offset = swp_offset(entry);
  132 +
  133 + BUG_ON(!PageLocked(page));
  134 + BUG_ON(sis == NULL);
  135 + if (frontswap_test(sis, offset))
  136 + dup = 1;
  137 + ret = (*frontswap_ops.store)(type, offset, page);
  138 + if (ret == 0) {
  139 + frontswap_set(sis, offset);
  140 + inc_frontswap_succ_stores();
  141 + if (!dup)
  142 + atomic_inc(&sis->frontswap_pages);
  143 + } else if (dup) {
  144 + /*
  145 + failed dup always results in automatic invalidate of
  146 + the (older) page from frontswap
  147 + */
  148 + frontswap_clear(sis, offset);
  149 + atomic_dec(&sis->frontswap_pages);
  150 + inc_frontswap_failed_stores();
  151 + } else
  152 + inc_frontswap_failed_stores();
  153 + if (frontswap_writethrough_enabled)
  154 + /* report failure so swap also writes to swap device */
  155 + ret = -1;
  156 + return ret;
  157 +}
  158 +EXPORT_SYMBOL(__frontswap_store);
  159 +
  160 +/*
  161 + * "Get" data from frontswap associated with swaptype and offset that were
  162 + * specified when the data was put to frontswap and use it to fill the
  163 + * specified page with data. Page must be locked and in the swap cache.
  164 + */
  165 +int __frontswap_load(struct page *page)
  166 +{
  167 + int ret = -1;
  168 + swp_entry_t entry = { .val = page_private(page), };
  169 + int type = swp_type(entry);
  170 + struct swap_info_struct *sis = swap_info[type];
  171 + pgoff_t offset = swp_offset(entry);
  172 +
  173 + BUG_ON(!PageLocked(page));
  174 + BUG_ON(sis == NULL);
  175 + if (frontswap_test(sis, offset))
  176 + ret = (*frontswap_ops.load)(type, offset, page);
  177 + if (ret == 0)
  178 + inc_frontswap_loads();
  179 + return ret;
  180 +}
  181 +EXPORT_SYMBOL(__frontswap_load);
  182 +
  183 +/*
  184 + * Invalidate any data from frontswap associated with the specified swaptype
  185 + * and offset so that a subsequent "get" will fail.
  186 + */
  187 +void __frontswap_invalidate_page(unsigned type, pgoff_t offset)
  188 +{
  189 + struct swap_info_struct *sis = swap_info[type];
  190 +
  191 + BUG_ON(sis == NULL);
  192 + if (frontswap_test(sis, offset)) {
  193 + (*frontswap_ops.invalidate_page)(type, offset);
  194 + atomic_dec(&sis->frontswap_pages);
  195 + frontswap_clear(sis, offset);
  196 + inc_frontswap_invalidates();
  197 + }
  198 +}
  199 +EXPORT_SYMBOL(__frontswap_invalidate_page);
  200 +
  201 +/*
  202 + * Invalidate all data from frontswap associated with all offsets for the
  203 + * specified swaptype.
  204 + */
  205 +void __frontswap_invalidate_area(unsigned type)
  206 +{
  207 + struct swap_info_struct *sis = swap_info[type];
  208 +
  209 + BUG_ON(sis == NULL);
  210 + if (sis->frontswap_map == NULL)
  211 + return;
  212 + (*frontswap_ops.invalidate_area)(type);
  213 + atomic_set(&sis->frontswap_pages, 0);
  214 + memset(sis->frontswap_map, 0, sis->max / sizeof(long));
  215 +}
  216 +EXPORT_SYMBOL(__frontswap_invalidate_area);
  217 +
  218 +/*
  219 + * Frontswap, like a true swap device, may unnecessarily retain pages
  220 + * under certain circumstances; "shrink" frontswap is essentially a
  221 + * "partial swapoff" and works by calling try_to_unuse to attempt to
  222 + * unuse enough frontswap pages to attempt to -- subject to memory
  223 + * constraints -- reduce the number of pages in frontswap to the
  224 + * number given in the parameter target_pages.
  225 + */
  226 +void frontswap_shrink(unsigned long target_pages)
  227 +{
  228 + struct swap_info_struct *si = NULL;
  229 + int si_frontswap_pages;
  230 + unsigned long total_pages = 0, total_pages_to_unuse;
  231 + unsigned long pages = 0, pages_to_unuse = 0;
  232 + int type;
  233 + bool locked = false;
  234 +
  235 + /*
  236 + * we don't want to hold swap_lock while doing a very
  237 + * lengthy try_to_unuse, but swap_list may change
  238 + * so restart scan from swap_list.head each time
  239 + */
  240 + spin_lock(&swap_lock);
  241 + locked = true;
  242 + total_pages = 0;
  243 + for (type = swap_list.head; type >= 0; type = si->next) {
  244 + si = swap_info[type];
  245 + total_pages += atomic_read(&si->frontswap_pages);
  246 + }
  247 + if (total_pages <= target_pages)
  248 + goto out;
  249 + total_pages_to_unuse = total_pages - target_pages;
  250 + for (type = swap_list.head; type >= 0; type = si->next) {
  251 + si = swap_info[type];
  252 + si_frontswap_pages = atomic_read(&si->frontswap_pages);
  253 + if (total_pages_to_unuse < si_frontswap_pages)
  254 + pages = pages_to_unuse = total_pages_to_unuse;
  255 + else {
  256 + pages = si_frontswap_pages;
  257 + pages_to_unuse = 0; /* unuse all */
  258 + }
  259 + /* ensure there is enough RAM to fetch pages from frontswap */
  260 + if (security_vm_enough_memory_mm(current->mm, pages))
  261 + continue;
  262 + vm_unacct_memory(pages);
  263 + break;
  264 + }
  265 + if (type < 0)
  266 + goto out;
  267 + locked = false;
  268 + spin_unlock(&swap_lock);
  269 + try_to_unuse(type, true, pages_to_unuse);
  270 +out:
  271 + if (locked)
  272 + spin_unlock(&swap_lock);
  273 + return;
  274 +}
  275 +EXPORT_SYMBOL(frontswap_shrink);
  276 +
  277 +/*
  278 + * Count and return the number of frontswap pages across all
  279 + * swap devices. This is exported so that backend drivers can
  280 + * determine current usage without reading debugfs.
  281 + */
  282 +unsigned long frontswap_curr_pages(void)
  283 +{
  284 + int type;
  285 + unsigned long totalpages = 0;
  286 + struct swap_info_struct *si = NULL;
  287 +
  288 + spin_lock(&swap_lock);
  289 + for (type = swap_list.head; type >= 0; type = si->next) {
  290 + si = swap_info[type];
  291 + totalpages += atomic_read(&si->frontswap_pages);
  292 + }
  293 + spin_unlock(&swap_lock);
  294 + return totalpages;
  295 +}
  296 +EXPORT_SYMBOL(frontswap_curr_pages);
  297 +
  298 +static int __init init_frontswap(void)
  299 +{
  300 +#ifdef CONFIG_DEBUG_FS
  301 + struct dentry *root = debugfs_create_dir("frontswap", NULL);
  302 + if (root == NULL)
  303 + return -ENXIO;
  304 + debugfs_create_u64("loads", S_IRUGO, root, &frontswap_loads);
  305 + debugfs_create_u64("succ_stores", S_IRUGO, root, &frontswap_succ_stores);
  306 + debugfs_create_u64("failed_stores", S_IRUGO, root,
  307 + &frontswap_failed_stores);
  308 + debugfs_create_u64("invalidates", S_IRUGO,
  309 + root, &frontswap_invalidates);
  310 +#endif
  311 + return 0;
  312 +}
  313 +
  314 +module_init(init_frontswap);
... ... @@ -18,6 +18,7 @@
18 18 #include <linux/bio.h>
19 19 #include <linux/swapops.h>
20 20 #include <linux/writeback.h>
  21 +#include <linux/frontswap.h>
21 22 #include <asm/pgtable.h>
22 23  
23 24 static struct bio *get_swap_bio(gfp_t gfp_flags,
... ... @@ -98,6 +99,12 @@
98 99 unlock_page(page);
99 100 goto out;
100 101 }
  102 + if (frontswap_store(page) == 0) {
  103 + set_page_writeback(page);
  104 + unlock_page(page);
  105 + end_page_writeback(page);
  106 + goto out;
  107 + }
101 108 bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
102 109 if (bio == NULL) {
103 110 set_page_dirty(page);
... ... @@ -122,6 +129,11 @@
122 129  
123 130 VM_BUG_ON(!PageLocked(page));
124 131 VM_BUG_ON(PageUptodate(page));
  132 + if (frontswap_load(page) == 0) {
  133 + SetPageUptodate(page);
  134 + unlock_page(page);
  135 + goto out;
  136 + }
125 137 bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
126 138 if (bio == NULL) {
127 139 unlock_page(page);
... ... @@ -31,6 +31,8 @@
31 31 #include <linux/memcontrol.h>
32 32 #include <linux/poll.h>
33 33 #include <linux/oom.h>
  34 +#include <linux/frontswap.h>
  35 +#include <linux/swapfile.h>
34 36  
35 37 #include <asm/pgtable.h>
36 38 #include <asm/tlbflush.h>
... ... @@ -42,7 +44,7 @@
42 44 static void free_swap_count_continuations(struct swap_info_struct *);
43 45 static sector_t map_swap_entry(swp_entry_t, struct block_device**);
44 46  
45   -static DEFINE_SPINLOCK(swap_lock);
  47 +DEFINE_SPINLOCK(swap_lock);
46 48 static unsigned int nr_swapfiles;
47 49 long nr_swap_pages;
48 50 long total_swap_pages;
49 51  
... ... @@ -53,9 +55,9 @@
53 55 static const char Bad_offset[] = "Bad swap offset entry ";
54 56 static const char Unused_offset[] = "Unused swap offset entry ";
55 57  
56   -static struct swap_list_t swap_list = {-1, -1};
  58 +struct swap_list_t swap_list = {-1, -1};
57 59  
58   -static struct swap_info_struct *swap_info[MAX_SWAPFILES];
  60 +struct swap_info_struct *swap_info[MAX_SWAPFILES];
59 61  
60 62 static DEFINE_MUTEX(swapon_mutex);
61 63  
... ... @@ -556,6 +558,7 @@
556 558 swap_list.next = p->type;
557 559 nr_swap_pages++;
558 560 p->inuse_pages--;
  561 + frontswap_invalidate_page(p->type, offset);
559 562 if ((p->flags & SWP_BLKDEV) &&
560 563 disk->fops->swap_slot_free_notify)
561 564 disk->fops->swap_slot_free_notify(p->bdev, offset);
562 565  
... ... @@ -985,11 +988,12 @@
985 988 }
986 989  
987 990 /*
988   - * Scan swap_map from current position to next entry still in use.
  991 + * Scan swap_map (or frontswap_map if frontswap parameter is true)
  992 + * from current position to next entry still in use.
989 993 * Recycle to start on reaching the end, returning 0 when empty.
990 994 */
991 995 static unsigned int find_next_to_unuse(struct swap_info_struct *si,
992   - unsigned int prev)
  996 + unsigned int prev, bool frontswap)
993 997 {
994 998 unsigned int max = si->max;
995 999 unsigned int i = prev;
... ... @@ -1015,6 +1019,12 @@
1015 1019 prev = 0;
1016 1020 i = 1;
1017 1021 }
  1022 + if (frontswap) {
  1023 + if (frontswap_test(si, i))
  1024 + break;
  1025 + else
  1026 + continue;
  1027 + }
1018 1028 count = si->swap_map[i];
1019 1029 if (count && swap_count(count) != SWAP_MAP_BAD)
1020 1030 break;
1021 1031  
... ... @@ -1026,8 +1036,12 @@
1026 1036 * We completely avoid races by reading each swap page in advance,
1027 1037 * and then search for the process using it. All the necessary
1028 1038 * page table adjustments can then be made atomically.
  1039 + *
  1040 + * if the boolean frontswap is true, only unuse pages_to_unuse pages;
  1041 + * pages_to_unuse==0 means all pages; ignored if frontswap is false
1029 1042 */
1030   -static int try_to_unuse(unsigned int type)
  1043 +int try_to_unuse(unsigned int type, bool frontswap,
  1044 + unsigned long pages_to_unuse)
1031 1045 {
1032 1046 struct swap_info_struct *si = swap_info[type];
1033 1047 struct mm_struct *start_mm;
... ... @@ -1060,7 +1074,7 @@
1060 1074 * one pass through swap_map is enough, but not necessarily:
1061 1075 * there are races when an instance of an entry might be missed.
1062 1076 */
1063   - while ((i = find_next_to_unuse(si, i)) != 0) {
  1077 + while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
1064 1078 if (signal_pending(current)) {
1065 1079 retval = -EINTR;
1066 1080 break;
... ... @@ -1227,6 +1241,10 @@
1227 1241 * interactive performance.
1228 1242 */
1229 1243 cond_resched();
  1244 + if (frontswap && pages_to_unuse > 0) {
  1245 + if (!--pages_to_unuse)
  1246 + break;
  1247 + }
1230 1248 }
1231 1249  
1232 1250 mmput(start_mm);
... ... @@ -1486,7 +1504,8 @@
1486 1504 }
1487 1505  
1488 1506 static void enable_swap_info(struct swap_info_struct *p, int prio,
1489   - unsigned char *swap_map)
  1507 + unsigned char *swap_map,
  1508 + unsigned long *frontswap_map)
1490 1509 {
1491 1510 int i, prev;
1492 1511  
... ... @@ -1496,6 +1515,7 @@
1496 1515 else
1497 1516 p->prio = --least_priority;
1498 1517 p->swap_map = swap_map;
  1518 + frontswap_map_set(p, frontswap_map);
1499 1519 p->flags |= SWP_WRITEOK;
1500 1520 nr_swap_pages += p->pages;
1501 1521 total_swap_pages += p->pages;
... ... @@ -1512,6 +1532,7 @@
1512 1532 swap_list.head = swap_list.next = p->type;
1513 1533 else
1514 1534 swap_info[prev]->next = p->type;
  1535 + frontswap_init(p->type);
1515 1536 spin_unlock(&swap_lock);
1516 1537 }
1517 1538  
... ... @@ -1585,7 +1606,7 @@
1585 1606 spin_unlock(&swap_lock);
1586 1607  
1587 1608 oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
1588   - err = try_to_unuse(type);
  1609 + err = try_to_unuse(type, false, 0); /* force all pages to be unused */
1589 1610 compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
1590 1611  
1591 1612 if (err) {
... ... @@ -1596,7 +1617,7 @@
1596 1617 * sys_swapoff for this swap_info_struct at this point.
1597 1618 */
1598 1619 /* re-insert swap space back into swap_list */
1599   - enable_swap_info(p, p->prio, p->swap_map);
  1620 + enable_swap_info(p, p->prio, p->swap_map, frontswap_map_get(p));
1600 1621 goto out_dput;
1601 1622 }
1602 1623  
1603 1624  
... ... @@ -1622,9 +1643,11 @@
1622 1643 swap_map = p->swap_map;
1623 1644 p->swap_map = NULL;
1624 1645 p->flags = 0;
  1646 + frontswap_invalidate_area(type);
1625 1647 spin_unlock(&swap_lock);
1626 1648 mutex_unlock(&swapon_mutex);
1627 1649 vfree(swap_map);
  1650 + vfree(frontswap_map_get(p));
1628 1651 /* Destroy swap account informatin */
1629 1652 swap_cgroup_swapoff(type);
1630 1653  
... ... @@ -1988,6 +2011,7 @@
1988 2011 sector_t span;
1989 2012 unsigned long maxpages;
1990 2013 unsigned char *swap_map = NULL;
  2014 + unsigned long *frontswap_map = NULL;
1991 2015 struct page *page = NULL;
1992 2016 struct inode *inode = NULL;
1993 2017  
... ... @@ -2071,6 +2095,9 @@
2071 2095 error = nr_extents;
2072 2096 goto bad_swap;
2073 2097 }
  2098 + /* frontswap enabled? set up bit-per-page map for frontswap */
  2099 + if (frontswap_enabled)
  2100 + frontswap_map = vzalloc(maxpages / sizeof(long));
2074 2101  
2075 2102 if (p->bdev) {
2076 2103 if (blk_queue_nonrot(bdev_get_queue(p->bdev))) {
2077 2104  
2078 2105  
... ... @@ -2086,14 +2113,15 @@
2086 2113 if (swap_flags & SWAP_FLAG_PREFER)
2087 2114 prio =
2088 2115 (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT;
2089   - enable_swap_info(p, prio, swap_map);
  2116 + enable_swap_info(p, prio, swap_map, frontswap_map);
2090 2117  
2091 2118 printk(KERN_INFO "Adding %uk swap on %s. "
2092   - "Priority:%d extents:%d across:%lluk %s%s\n",
  2119 + "Priority:%d extents:%d across:%lluk %s%s%s\n",
2093 2120 p->pages<<(PAGE_SHIFT-10), name, p->prio,
2094 2121 nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10),
2095 2122 (p->flags & SWP_SOLIDSTATE) ? "SS" : "",
2096   - (p->flags & SWP_DISCARDABLE) ? "D" : "");
  2123 + (p->flags & SWP_DISCARDABLE) ? "D" : "",
  2124 + (frontswap_map) ? "FS" : "");
2097 2125  
2098 2126 mutex_unlock(&swapon_mutex);
2099 2127 atomic_inc(&proc_poll_event);