08 Apr, 2014
40 commits
-
Signed-off-by: SeongJae Park
Cc: Seth Jennings
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zram is ram based block device and can be used by backend of filesystem.
When filesystem deletes a file, it normally doesn't do anything on data
block of that file. It just marks on metadata of that file. This
behavior has no problem on disk based block device, but has problems on
ram based block device, since we can't free memory used for data block.
To overcome this disadvantage, there is REQ_DISCARD functionality. If
block device support REQ_DISCARD and filesystem is mounted with discard
option, filesystem sends REQ_DISCARD to block device whenever some data
blocks are discarded. All we have to do is to handle this request.This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
REQ_DISCARD request. With it, we can free memory used by zram if it isn't
used.[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Joonsoo Kim
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
sysfs.txt documentation lists the following requirements:
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.- show() methods should return the number of bytes printed into the
buffer. This is the return value of scnprintf().- show() should always use scnprintf().
Use scnprintf() in show() functions.
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When we initialized zcomp with single, we couldn't change
max_comp_streams without zram reset but current interface doesn't show
any error to user and even it changes max_comp_streams's value without
any effect so it would make user very confusing.This patch prevents max_comp_streams's change when zcomp was initialized
as single zcomp and emit the error to user(ex, echo).[akpm@linux-foundation.org: don't return with the lock held, per Sergey]
[fengguang.wu@intel.com: fix coccinelle warnings]
Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Jerome Marchand
Acked-by: Sergey Senozhatsky
Signed-off-by: Fengguang Wu
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Instead of returning just NULL, return ERR_PTR from zcomp_create() if
compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported
compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
compression stream) error.Perform IS_ERR() check of returned from zcomp_create() value in
disksize_store() and set return code to PTR_ERR().Change suggested by Jerome Marchand.
[akpm@linux-foundation.org: clean up error recovery flow]
Signed-off-by: Sergey Senozhatsky
Reported-by: Jerome Marchand
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
Minchan Kim noted [2] that it's better to move compression backend
allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
with zram_meta_alloc(), in order to prevent the same lockdep spew.[1] https://lkml.org/lkml/2014/2/27/337
[2] https://lkml.org/lkml/2014/3/3/32Signed-off-by: Sergey Senozhatsky
Reported-by: Minchan Kim
Acked-by: Minchan Kim
Cc: Sasha Levin
Acked-by: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce LZ4 compression backend and make it available for selection.
LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
option. The default compression backend is LZO.TEST
(x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
ext4 file system, 3 compression streams)iozone -t 3 -R -r 16K -s 60M -I +Z
Test LZO LZ4
----------------------------------------------
Initial write 1642744.62 1317005.09
Rewrite 2498980.88 1800645.16
Read 3957026.38 5877043.75
Re-read 3950997.38 5861847.00
Reverse Read 2937114.56 5047384.00
Stride read 2948163.19 4929587.38
Random read 3292692.69 4880793.62
Mixed workload 1545602.62 3502940.38
Random write 2448039.75 1758786.25
Pwrite 1670051.03 1338329.69
Pread 2530682.00 5097177.62
Fwrite 3232085.62 3275942.56
Fread 6306880.25 6645271.12So on my system LZ4 is slower in write-only tests, while it performs
better in read-only and mixed (reads + writes) tests.Official LZ4 benchmarks available here http://code.google.com/p/lz4/
(linux kernel uses revision r90).Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add and document `comp_algorithm' device attribute. This attribute allows
to show supported compression and currently selected compression
algorithms:cat /sys/block/zram0/comp_algorithm
[lzo] lz4and change selected compression algorithm:
echo lzo > /sys/block/zram0/comp_algorithmSigned-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch allows to change max_comp_streams on initialised zcomp.
Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
and zcomp_strm_single_set_max_streams() callbacks to change streams limit
for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams
for single steam zcomp does nothing.If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
attempts to immediately free extra streams (as much as it can, depending
on idle streams availability).Note, this patch does not allow to change stream 'policy' from single to
multi stream (or vice versa) on already initialised compression backend.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Existing zram (zcomp) implementation has only one compression stream
(buffer and algorithm private part), so in order to prevent data
corruption only one write (compress operation) can use this compression
stream, forcing all concurrent write operations to wait for stream lock
to be released. This patch changes zcomp to keep a compression streams
list of user-defined size (via sysfs device attr). Each write operation
still exclusively holds compression stream, the difference is that we
can have N write operations (depending on size of streams list)
executing in parallel. See TEST section later in commit message for
performance data.Introduce struct zcomp_strm_multi and a set of functions to manage
zcomp_strm stream access. zcomp_strm_multi has a list of idle
zcomp_strm structs, spinlock to protect idle list and wait queue, making
it possible to perform parallel compressions.The following set of functions added:
- zcomp_strm_multi_find()/zcomp_strm_multi_release()
find and release a compression stream, implement required locking
- zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
create and destroy zcomp_strm_multizcomp ->strm_find() and ->strm_release() callbacks are set during
initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
correspondingly.Each time zcomp issues a zcomp_strm_multi_find() call, the following set
of operations performed:- spin lock strm_lock
- if idle list is not empty, remove zcomp_strm from idle list, spin
unlock and return zcomp stream pointer to caller
- if idle list is empty, current adds itself to wait queue. it will be
awaken by zcomp_strm_multi_release() caller.zcomp_strm_multi_release():
- spin lock strm_lock
- add zcomp stream to idle list
- spin unlock, wake up sleeperMinchan Kim reported that spinlock-based locking scheme has demonstrated
a severe perfomance regression for single compression stream case,
comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)base spinlock mutex
==Initial write ==Initial write ==Initial write
records: 5 records: 5 records: 5
avg: 1642424.35 avg: 699610.40 avg: 1655583.71
std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96
max: 1690170.94 max: 1163473.45 max: 1697164.75
min: 1568669.52 min: 573429.88 min: 1553410.23
==Rewrite ==Rewrite ==Rewrite
records: 5 records: 5 records: 5
avg: 1611775.39 avg: 501406.64 avg: 1684419.11
std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42
max: 1641800.95 max: 531356.78 max: 1706445.84
min: 1593515.27 min: 488817.78 min: 1655335.73When only one compression stream available, mutex with spin on owner
tends to perform much better than frequent wait_event()/wake_up(). This
is why single stream implemented as a special case with mutex locking.Introduce and document zram device attribute max_comp_streams. This
attr shows and stores current zcomp's max number of zcomp streams
(max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter.
`max_strm' limits the number of zcomp_strm structs in compression
backend's idle list (max_comp_streams).max_comp_streams used during initialisation as follows:
-- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
using single compression stream zcomp_strm_single (mutex-based locking).
-- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
using multi compression stream zcomp_strm_multi (spinlock-based locking).default max_comp_streams value is 1, meaning that zram with single stream
will be initialised.Later patch will introduce configuration knob to change max_comp_streams
on already initialised and used zcomp.TEST
iozone -t 3 -R -r 16K -s 60M -I +Ztest base 1 strm (mutex) 3 strm (spinlock)
-----------------------------------------------------------------------
Initial write 589286.78 583518.39 718011.05
Rewrite 604837.97 596776.38 1515125.72
Random write 584120.11 595714.58 1388850.25
Pwrite 535731.17 541117.38 739295.27
Fwrite 1418083.88 1478612.72 1484927.06Usage example:
set max_comp_streams to 4
echo 4 > /sys/block/zram0/max_comp_streamsshow current max_comp_streams (default value is 1).
cat /sys/block/zram0/max_comp_streamsSigned-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is preparation patch to add multi stream support to zcomp.
Introduce struct zcomp_strm_single and a set of functions to manage
zcomp_strm stream access. zcomp_strm_single implements single compession
stream, same way as current zcomp implementation. This moves zcomp_strm
stream control and locking from zcomp, so compressing backend zcomp is not
aware of required locking.Single and multi streams require different locking schemes. Minchan Kim
reported that spinlock-based locking scheme (which is used in multi stream
implementation) has demonstrated a severe perfomance regression for single
compression stream case, comparing to mutex-based. see
https://lkml.org/lkml/2014/2/18/16The following set of functions added:
- zcomp_strm_single_find()/zcomp_strm_single_release()
find and release a compression stream, implement required locking
- zcomp_strm_single_create()/zcomp_strm_single_destroy()
create and destroy zcomp_strm_singleNew ->strm_find() and ->strm_release() callbacks added to zcomp, which are
set to zcomp_strm_single_find() and zcomp_strm_single_release() during
initialisation. Instead of direct locking and zcomp_strm access from
zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
and ->strm_release() correspondingly.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Do not perform direct LZO compress/decompress calls, initialise
and use zcomp LZO backend (single compression stream) instead.[akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
ZRAM performs direct LZO compression algorithm calls, making it the one
and only option. While LZO is generally performs well, LZ4 algorithm
tends to have a faster decompression (see http://code.google.com/p/lz4/
for full report)Name Ratio C.speed D.speed
MB/s MB/s
LZ4 (r101) 2.084 422 1820
LZO 2.06 2.106 414 600Thus, users who have mostly read (decompress) usage scenarious or mixed
workflow (writes with relatively high read ops number) will benefit from
using LZ4 compression backend.Introduce compressing backend abstraction zcomp in order to support
multiple compression algorithms with the following set of operations:.create
.destroy
.compress
.decompressSchematically zram write() usually contains the following steps:
0) preparation (decompression of partioal IO, etc.)
1) lock buffer_lock mutex (protects meta compress buffers)
2) compress (using meta compress buffers)
3) alloc and map zs_pool object
4) copy compressed data (from meta compress buffers) to object allocated by 3)
5) free previous pool page, assign a new one
6) unlock buffer_lock mutexAs we can see, compressing buffers must remain untouched from 1) to 4),
because, otherwise, concurrent write() can overwrite data. At the same
time, zram_meta must be aware of a) specific compression algorithm memory
requirements and b) necessary locking to protect compression buffers. To
remove requirement a) new struct zcomp_strm introduced, which contains a
compress/decompress `buffer' and compression algorithm `private' part.
While struct zcomp implements zcomp_strm stream handling and locking and
removes requirement b) from zram meta. zcomp ->create() and ->destroy(),
respectively, allocate and deallocate algorithm specific zcomp_strm
`private' part.Every zcomp has zcomp stream and mutex to protect its compression stream.
Stream usage semantics remains the same -- only one write can hold stream
lock and use its buffers. zcomp_strm_find() turns caller into exclusive
user of a stream (holding stream mutex until zram release stream), and
zcomp_strm_release() makes zcomp stream available (unlock the stream
mutex). Hence no concurrent write (compression) operations possible at
the moment.iozone -t 3 -R -r 16K -s 60M -I +Z
test base patched
--------------------------------------------------
Initial write 597992.91 591660.58
Rewrite 609674.34 616054.97
Read 2404771.75 2452909.12
Re-read 2459216.81 2470074.44
Reverse Read 1652769.66 1589128.66
Stride read 2202441.81 2202173.31
Random read 2236311.47 2276565.31
Mixed workload 1423760.41 1709760.06
Random write 579584.08 615933.86
Pwrite 597550.02 594933.70
Pread 1703672.53 1718126.72
Fwrite 1330497.06 1461054.00
Fread 3922851.00 3957242.62Usage examples:
comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
which initialises compressing backend if requested algorithm is supported.
Compress:
zstrm = zcomp_strm_find(comp)
zcomp_compress(comp, zstrm, src, &dst_len)
[..] /* copy compressed data */
zcomp_strm_release(comp, zstrm)Decompress:
zcomp_decompress(comp, src, src_len, dst);Free compessing backend and its zcomp stream:
zcomp_destroy(comp)Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
allocate new `zram_meta' in disksize_store() only for uninitialised zram
device, saving a number of allocations and deallocations in case if
disksize_store() was called on currently used device. at the same time
zram_meta stack variable is not necessary, because we can set ->meta
directly. there is also no need in setting QUEUE_FLAG_NONROT queue on
every disksize_store(), set it once during device creation.[minchan@kernel.org: handle zram->meta alloc fail case]
[minchan@kernel.org: prevent lockdep spew of init_lock]
Signed-off-by: Sergey Senozhatsky
Signed-off-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Document `failed_reads' and `failed_writes' device attributes.
Remove info about `discard' - there is no such zram attr.Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move zram warning about disksize and size of memory correlation to zram
documentation.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
struct table `count' member is not used.
Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zram accounted but did not report numbers of failed read and write
queries. make these stats available as failed_reads and failed_writes
attrs.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce ZRAM_ATTR_RO macro that generates device_attribute and default
ATTR show() function for existing atomic64_t zram stats.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a preparation patch for stats code duplication removal.
1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.
2) `compr_size' and `pages_zero' struct zram_stats members did not
follow the existing device attr naming scheme: zram_stats.ATTR has
ATTR_show() function. rename them:-- compr_size -> compr_data_size
-- pages_zero -> zero_pagesMinchan Kim's note:
If we really have trouble with atomic stat operation, we could
change it with percpu_counter so that it could solve atomic overhead and
unnecessary memory space by introducing unsigned long instead of 64bit
atomic_t.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove `good' and `bad' compressed sub-requests stats. RW request may
cause a number of RW sub-requests. zram used to account `good' compressed
sub-queries (with compressed size less than 50% of original size), `bad'
compressed sub-queries (with compressed size greater that 75% of original
size), leaving sub-requests with compression size between 50% and 75% of
original size not accounted and not reported. zram already accounts each
sub-request's compression size so we can calculate real device compression
ratio.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
chain, decode it in zram_bvec_rw() instead. Besides, this is the place
where we distinguish READ and WRITE bio data directions, so account zram
RW stats here, instead of __zram_make_request(). This also allows to
account a real number of zram READ/WRITE operations, not just requests
(single RW request may cause a number of zram RW ops with separate
locking, compression/decompression, etc).Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce init_done() helper function which allows us to drop `init_done'
struct zram member. init_done() uses the fact that ->init_done == 1
equals to ->meta != NULL.Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Acked-by: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A new dump_page() routine was recently added, and marked
EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
macro, and so the end result is that non-GPL code can no longer call
get_page() and a few other routines.This only happens if the kernel was compiled with CONFIG_DEBUG_VM.
Change dump_page() to be EXPORT_SYMBOL.
Longer explanation:
Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
drivers on Fedora. Fedora is semi-unique, in that it sets
CONFIG_VM_DEBUG.Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
new, open source, "UVM-Lite" kernel module, I originally choose to use
the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
because get_page() has always seemed to be very clearly intended for use
by non-GPL, driver code.So I'm hoping that making get_page() widely accessible again will not be
too controversial. We did check with Fedora first, and they responded
(https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
try to get upstream changed, before asking Fedora to change. Their
reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
Fedora to help catch mm bugs.Signed-off-by: John Hubbard
Cc: Sasha Levin
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However
LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
set). In such a case LAST_CPUPID_MASK turns out to be 0.But with recent commit 1ae71d0319: (mm: numa: bugfix for
LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
page_cpupid_xchg_last() and page_cpupid_reset_last() causes
page->_last_cpupid to be set to 0.This causes performance regression. Its almost as if numa_balancing is
off.Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
LAST_CPUPID_WIDTH.Some performance numbers and perf stats with and without the fix.
(3.14-rc6)
----------
numa01Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
12,27,462 cs [100.00%]
2,41,957 migrations [100.00%]
1,68,01,713 faults [100.00%]
7,99,35,29,041 cache-misses
98,808 migrate:mm_migrate_pages [100.00%]1407.690148814 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
63,065 cs [100.00%]
14,364 migrations [100.00%]
2,08,118 faults [100.00%]
25,32,59,404 cache-misses
12 migrate:mm_migrate_pages [100.00%]63.840827219 seconds time elapsed
(3.14-rc6 with fix)
-------------------
numa01Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
9,68,911 cs [100.00%]
1,01,414 migrations [100.00%]
88,38,697 faults [100.00%]
4,42,92,51,042 cache-misses
4,25,060 migrate:mm_migrate_pages [100.00%]685.965331189 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
17,543 cs [100.00%]
2,962 migrations [100.00%]
1,17,843 faults [100.00%]
11,80,61,644 cache-misses
12,358 migrate:mm_migrate_pages [100.00%]20.380132343 seconds time elapsed
Signed-off-by: Srikar Dronamraju
Cc: Liu Ping Fan
Reviewed-by: Aneesh Kumar K.V
Cc: Peter Zijlstra
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
s/MADV_NODUMP/MADV_DONTDUMP/
Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.Move ra_submit to internal.h and inline it to save some stack. Thanks
to Andrew Morton for commenting different versions.Signed-off-by: Fabian Frederick
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When I decrease the value of nr_hugepage in procfs a lot, softlockup
happens. It is because there is no chance of context switch during this
process.On the other hand, when I allocate a large number of hugepages, there is
some chance of context switch. Hence softlockup doesn't happen during
this process. So it's necessary to add the context switch in the
freeing process as same as allocating process to avoid softlockup.When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
process occupied a CPU over 150 seconds and following softlockup message
appeared twice or more.$ echo 6000000 > /proc/sys/vm/nr_hugepages
$ cat /proc/sys/vm/nr_hugepages
6000000
$ grep ^Huge /proc/meminfo
HugePages_Total: 6000000
HugePages_Free: 6000000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
$ echo 0 > /proc/sys/vm/nr_hugepagesBUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
free_pool_huge_page+0xb8/0xd0
set_max_huge_pages+0x128/0x190
hugetlb_sysctl_handler_common+0x113/0x140
hugetlb_sysctl_handler+0x1e/0x20
proc_sys_call_handler+0x97/0xd0
proc_sys_write+0x14/0x20
vfs_write+0xb8/0x1a0
sys_write+0x51/0x90
__audit_syscall_exit+0x265/0x290
system_call_fastpath+0x16/0x1bI have not confirmed this problem with upstream kernels because I am not
able to prepare the machine equipped with 12TB memory now. However I
confirmed that the amount of decreasing hugepages was directly
proportional to the amount of required time.I measured required times on a smaller machine. It showed 130-145
hugepages decreased in a millisecond.Amount of decreasing Required time Decreasing rate
hugepages (msec) (pages/msec)
------------------------------------------------------------
10,000 pages == 20GB 70 - 74 135-142
30,000 pages == 60GB 208 - 229 131-144It means decrement of 6TB hugepages will trigger softlockup with the
default threshold 20sec, in this decreasing rate.Signed-off-by: Masayoshi Mizuma
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Wanpeng Li
Cc: Aneesh Kumar
Cc: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace ((phys_addr_t)(x) << PAGE_SHIFT) by pfn macro.
Signed-off-by: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a small cleanup.
Signed-off-by: Emil Medve
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There's only one caller of set_page_dirty_balance() and that will call it
with page_mkwrite == 0.The page_mkwrite argument was unused since commit b827e496c893 "mm: close
page_mkwrite races".Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag. This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless. This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success. If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.Signed-off-by: Vlastimil Babka
Signed-off-by: Bob Liu
Reported-by: Sasha Levin
Cc: Wanpeng Li
Cc: Michel Lespinasse
Cc: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: David Rientjes
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On NUMA systems, a node may start thrashing cache or even swap anonymous
pages while there are still free pages on remote nodes.This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
of fair allocation policy").Before those changes, the allocator would first try all allowed zones,
including those on remote nodes, before waking any kswapds. But now,
the allocator fastpath doubles as the fairness pass, which in turn can
only consider the local node to prevent remote spilling based on
exhausted fairness batches alone. Remote nodes are only considered in
the slowpath, after the kswapds are woken up. But if remote nodes still
have free memory, kswapd should not be woken to rebalance the local node
or it may thrash cash or swap prematurely.Fix this by adding one more unfair pass over the zonelist that is
allowed to spill to remote nodes after the local fairness pass fails but
before entering the slowpath and waking the kswapds.This also gets rid of the GFP_THISNODE exemption from the fairness
protocol because the unfair pass is no longer tied to kswapd, which
GFP_THISNODE is not allowed to wake up.However, because remote spills can be more frequent now - we prefer them
over local kswapd reclaim - the allocation batches on remote nodes could
underflow more heavily. When resetting the batches, use
atomic_long_read() directly instead of zone_page_state() to calculate the
delta as the latter filters negative counter values.Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: [3.12+]Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.Signed-off-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some callsites pass a memcg directly, some callsites pass an mm that
then has to be translated to a memcg. This makes for a terrible
function interface.Just push the mm-to-memcg translation into the respective callsites and
always pass a memcg to mem_cgroup_try_charge().[mhocko@suse.cz: add charge mm helper]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
which came without a memcg. The only reason seems to be a tiny
optimization when css_tryget is not called if the charge can be consumed
from the stock. Nevertheless css_tryget is very cheap since it has been
reworked to use per-cpu counting so this optimization doesn't give us
anything these days.So let's drop the code duplication so that the code is more readable.
Signed-off-by: Michal Hocko
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
owner is exiting, just return root_mem_cgroup. This makes sense for all
callsites and gets rid of some of them having to fallback manually.[fengguang.wu@intel.com: fix warnings]
Signed-off-by: Johannes Weiner
Signed-off-by: Fengguang Wu
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Users pass either a mm that has been established under task lock, or use
a verified current->mm, which means the task can't be exiting.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Only page cache charges can happen without an mm context, so push this
special case out of the inner core and into the cache charge function.An ancient comment explains that the mm can also be NULL in case the
task is currently being migrated, but that is not actually true with the
current case, so just remove it.Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_charge_common() is used by both cache and anon pages, but
most of its body only applies to anon pages and the remainder is not
worth having in a separate function.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds