30 Dec, 2014
1 commit
-
Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
cache allocation where possible") has added a separate parameter for
specifying gfp mask for radix tree allocations.Not only this is less than optimal from the API point of view because it
is error prone, it is also buggy currently because
grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
the radix tree allocation wouldn't obey the restriction and might
recurse into filesystem and cause deadlocks. This is the case for most
filesystems unfortunately because only ext4 and gfs2 are using
AOP_FLAG_NOFS.Let's simply remove radix_gfp_mask parameter because the allocation
context is same for both page cache and for the radix tree. Just make
sure that the radix tree gets only the sane subset of the mask (e.g. do
not pass __GFP_WRITE).Long term it is more preferable to convert remaining users of
AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
interface even further.Reported-by: Dave Chinner
Signed-off-by: Michal Hocko
Signed-off-by: Linus Torvalds
23 Dec, 2014
1 commit
-
This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946.
There are several[1][2] of bug reports which points to this commit as potential
cause[3].Let's revert it until we figure out what's going on.
[1] https://lkml.org/lkml/2014/11/14/342
[2] https://lkml.org/lkml/2014/12/22/213
[3] https://lkml.org/lkml/2014/12/9/741Signed-off-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Acked-by: Davidlohr Bueso
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Mel Gorman
Signed-off-by: Linus Torvalds
21 Dec, 2014
1 commit
-
Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
"kernel: Provide READ_ONCE and ASSIGN_ONCEAs discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
ACCESS_ONCE might fail with specific compilers for non-scalar
accesses.Here is a set of patches to tackle that problem.
The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
structure is larger than the machine word size memcpy is used and a
warning is emitted. The next patches fix up several in-tree users of
ACCESS_ONCE on non-scalar types.This does not yet contain a patch that forces ACCESS_ONCE to work only
on scalar types. This is targetted for the next merge window as Linux
next already contains new offenders regarding ACCESS_ONCE vs.
non-scalar types"* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
s390/kvm: REPLACE barrier fixup with READ_ONCE
arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
mips/gup: Replace ACCESS_ONCE with READ_ONCE
x86/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
mm: replace ACCESS_ONCE with READ_ONCE or barriers
kernel: Provide READ_ONCE and ASSIGN_ONCE
20 Dec, 2014
1 commit
-
Pull vfs pile #3 from Al Viro:
"Assorted fixes and patches from the last cycle"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[regression] chunk lost from bd9b51
vfs: make mounts and mountstats honor root dir like mountinfo does
vfs: cleanup show_mountinfo
init: fix read-write root mount
unfuck binfmt_misc.c (broken by commit e6084d4)
vm_area_operations: kill ->migrate()
new helper: iter_is_iovec()
move_extent_per_page(): get rid of unused w_flags
lustre: get rid of playing with ->fs
btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
19 Dec, 2014
4 commits
-
Currently functions in zsmalloc.c does not arranged in a readable and
reasonable sequence. With the more and more functions added, we may
meet below inconvenience. For example:Current functions:
void zs_init()
{
}static void get_maxobj_per_zspage()
{
}Then I want to add a func_1() which is called from zs_init(), and this
new added function func_1() will used get_maxobj_per_zspage() which is
defined below zs_init().void func_1()
{
get_maxobj_per_zspage()
}void zs_init()
{
func_1()
}static void get_maxobj_per_zspage()
{
}This will cause compiling issue. So we must add a declaration:
static void get_maxobj_per_zspage();
before func_1() if we do not put get_maxobj_per_zspage() before
func_1().In addition, puting module_[init|exit] functions at the bottom of the
file conforms to our habit.So, this patch ajusts function sequence as:
/* helper functions */
...
obj_location_to_handle()
.../* Some exported functions */
...zs_map_object()
zs_unmap_object()zs_malloc()
zs_free()zs_init()
zs_exit()Signed-off-by: Ganesh Mahendran
Cc: Nitin Gupta
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
do_shared_fault() and drop do_fault()").Cc: Andi Kleen
Cc: Bob Liu
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When the system boots up, in the dmesg logs we can see the memory
statistics along with total reserved as below. Memory: 458840k/458840k
available, 65448k reserved, 0K highmemWhen CMA is enabled, still the total reserved memory remains the same.
However, the CMA memory is not considered as reserved. But, when we see
/proc/meminfo, the CMA memory is part of free memory. This creates
confusion. This patch corrects the problem by properly subtracting the
CMA reserved memory from the total reserved memory in dmesg logs.Below is the dmesg snapshot from an arm based device with 512MB RAM and
12MB single CMA region.Before this change:
Memory: 458840k/458840k available, 65448k reserved, 0K highmemAfter this change:
Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmemSigned-off-by: Pintu Kumar
Signed-off-by: Vishnu Pratap Singh
Acked-by: Michal Nazarewicz
Cc: Rafael Aquini
Cc: Jerome Marchand
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When nodes is true, nsc->mask2 has already been filtered by nsc->mask1,
which has already factored in node_states[N_MEMORY].Signed-off-by: Zhihui Zhang
Cc: Mel Gorman
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Dec, 2014
2 commits
-
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)Let's change the code to access the page table elements with
READ_ONCE that does implicit scalar accesses for the gup code.mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
as array of longs. This code requires just that the pmd_present
and pmd_trans_huge check are done on the same value, so a barrier
is sufficent.A similar case is in handle_pte_fault. On ppc44x the word size is
32 bit, but a pte is 64 bit. A barrier is ok as well.Signed-off-by: Christian Borntraeger
Cc: linux-mm@kvack.org
Acked-by: Paul E. McKenney -
Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
range calculations into generic code") caused a performance problem:"tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
applied, but does not show up at all on the commit before"and the reason is that Will moved the test for whether we need to flush
from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
tlb_flush_mmu_free() basically lost that check.Move it back into tlb_flush_mmu() where it belongs, so that it covers
both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().Reported-and-tested-by: Dave Hansen
Acked-by: Will Deacon
Signed-off-by: Linus Torvalds
17 Dec, 2014
3 commits
-
the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
Pull nfsd updates from Bruce Fields:
"A comparatively quieter cycle for nfsd this time, but still with two
larger changes:- RPC server scalability improvements from Jeff Layton (using RCU
instead of a spinlock to find idle threads).- server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
Schumaker, enabling fallocate on new clients"* 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd4: fix xdr4 count of server in fs_location4
nfsd4: fix xdr4 inclusion of escaped char
sunrpc/cache: convert to use string_escape_str()
sunrpc: only call test_bit once in svc_xprt_received
fs: nfsd: Fix signedness bug in compare_blob
sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
sunrpc: convert to lockless lookup of queued server threads
sunrpc: fix potential races in pool_stats collection
sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
sunrpc: require svc_create callers to pass in meaningful shutdown routine
sunrpc: have svc_wake_up only deal with pool 0
sunrpc: convert sp_task_pending flag to use atomic bitops
sunrpc: move rq_cachetype field to better optimize space
sunrpc: move rq_splice_ok flag into rq_flags
sunrpc: move rq_dropme flag into rq_flags
sunrpc: move rq_usedeferral flag to rq_flags
sunrpc: move rq_local field to rq_flags
sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
nfsd: minor off by one checks in __write_versions()
sunrpc: release svc_pool_map reference when serv allocation fails
...
16 Dec, 2014
1 commit
-
Pull drm updates from Dave Airlie:
"Highlights:- AMD KFD driver merge
This is the AMD HSA interface for exposing a lowlevel interface for
GPGPU use. They have an open source userspace built on top of this
interface, and the code looks as good as it was going to get out of
tree.- Initial atomic modesetting work
The need for an atomic modesetting interface to allow userspace to
try and send a complete set of modesetting state to the driver has
arisen, and been suffering from neglect this past year. No more,
the start of the common code and changes for msm driver to use it
are in this tree. Ongoing work to get the userspace ioctl finished
and the code clean will probably wait until next kernel.- DisplayID 1.3 and tiled monitor exposed to userspace.
Tiled monitor property is now exposed for userspace to make use of.
- Rockchip drm driver merged.
- imx gpu driver moved out of staging
Other stuff:
- core:
panel - MIPI DSI + new panels.
expose suggested x/y properties for virtual GPUs- i915:
Initial Skylake (SKL) support
gen3/4 reset work
start of dri1/ums removal
infoframe tracking
fixes for lots of things.- nouveau:
tegra k1 voltage support
GM204 modesetting support
GT21x memory reclocking work- radeon:
CI dpm fixes
GPUVM improvements
Initial DPM fan control- rcar-du:
HDMI support added
removed some support for old boards
slave encoder driver for Analog Devices adv7511- exynos:
Exynos4415 SoC support- msm:
a4xx gpu support
atomic helper conversion- tegra:
iommu support
universal plane support
ganged-mode DSI support- sti:
HDMI i2c improvements- vmwgfx:
some late fixes.- qxl:
use suggested x/y properties"* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
drm: sti: fix module compilation issue
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm: sti: correctly cleanup CRTC and planes
drm: sti: add HQVDP plane
drm: sti: add cursor plane
drm: sti: enable auxiliary CRTC
drm: sti: fix delay in VTG programming
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: fix hdmi avi infoframe
drm: sti: remove event lock while disabling vblank
drm: sti: simplify gdp code
drm: sti: clear all mixer control
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: allow to change hdmi ddc i2c adapter
drm/doc: Document drm_add_modes_noedid() usage
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm: Zero out DRM object memory upon cleanup
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
...
15 Dec, 2014
1 commit
-
Pull aio updates from Benjamin LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: Skip timer for io_getevents if timeout=0
aio: Make it possible to remap aio ring
14 Dec, 2014
25 commits
-
There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.So, to make restore work we need to make sure that
a) ring is mapped at desired virtual address
b) kioctx->user_id matches this valueHaving said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.What do you think?
Signed-off-by: Pavel Emelyanov
Acked-by: Dmitry Monakhov
Signed-off-by: Benjamin LaHaise -
kmemleak will add allocations as objects to a pool. The memory allocated
for each object in this pool is periodically searched for pointers to
other allocated objects. This only works for memory that is mapped into
the kernel's virtual address space, which happens not to be the case for
most CMA regions.Furthermore, CMA regions are typically used to store data transferred to
or from a device and therefore don't contain pointers to other objects.Without this, the kernel crashes on the first execution of the
scan_gray_list() because it tries to access highmem. Perhaps a more
appropriate fix would be to reject any object that can't map to a kernel
virtual address?[akpm@linux-foundation.org: add comment]
[akpm@linux-foundation.org: fix comment, per Catalin]
[sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
Signed-off-by: Thierry Reding
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Joonsoo Kim
Cc: "Aneesh Kumar K.V"
Cc: Catalin Marinas
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If we fail to allocate from the current node's stock, we look for free
objects on other nodes before calling the page allocator (see
get_any_partial). While checking other nodes we respect cpuset
constraints by calling cpuset_zone_allowed. We enforce hardwall check.
As a result, we will fallback to the page allocator even if there are some
pages cached on other nodes, but the current cpuset doesn't have them set.
However, the page allocator uses softwall check for kernel allocations,
so it may allocate from one of the other nodes in this case.Therefore we should use softwall cpuset check in get_any_partial to
conform with the cpuset check in the page allocator.Signed-off-by: Vladimir Davydov
Acked-by: Zefan Li
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fallback_alloc is called on kmalloc if the preferred node doesn't have
free or partial slabs and there's no pages on the node's free list
(GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
to locate a free or partial slab on other allowed nodes' lists. While
iterating over the preferred node's zonelist it skips those zones which
hardwall cpuset check returns false for. That means that for a task bound
to a specific node using cpusets fallback_alloc will always ignore free
slabs on other nodes and go directly to the reclaimer, which, however, may
allocate from other nodes if cpuset.mem_hardwall is unset (default). As a
result, we may get lists of free slabs grow without bounds on other nodes,
which is bad, because inactive slabs are only evicted by cache_reap at a
very slow rate and cannot be dropped forcefully.To reproduce the issue, run a process that will walk over a directory tree
with lots of files inside a cpuset bound to a node that constantly
experiences memory pressure. Look at num_slabs vs active_slabs growth as
reported by /proc/slabinfo.To avoid this we should use softwall cpuset check in fallback_alloc.
Signed-off-by: Vladimir Davydov
Acked-by: Zefan Li
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When zbud is initialized through the zpool wrapper, pool->ops which
points to user-defined operations is always set regardless of whether it
is specified from the upper layer. This causes zbud_reclaim_page() to
iterate its loop for evicting pool pages out without any gain.This patch sets the user-defined ops only when it is needed, so that
zbud_reclaim_page() can bail out the reclamation loop earlier if there
is no user-defined operations specified.Signed-off-by: Heesub Shin
Acked-by: Dan Streetman
Cc: Seth Jennings
Cc: Sunae Seo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
free_percpu() tests whether its argument is NULL and then returns
immediately. Thus the test around the call is not needed.This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
__init init_zswap()Signed-off-by: Mahendran Ganesh
Cc: Seth Jennings
Cc: Dan Streetman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);This patch allocate memory of exactly needed size.
Signed-off-by: Ganesh Mahendran
Acked-by: Minchan Kim
Cc: Nitin Gupta
Cc: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
And the prev_class only references to the previous size_class. So we do
not need unnecessary assignement.This patch assigns *prev_class* when a new size_class structure is
allocated and uses prev_class to check whether the first class has been
allocated.[akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
Signed-off-by: Ganesh Mahendran
Cc: Minchan Kim
Cc: Nitin Gupta
Reviewed-by: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim
found zsmalloc even does not support allocating an obj with the size of
ZS_MAX_ALLOC_SIZE in some situations.For example:
In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
ZS_SIZE_CLASS_DELTA is 256 bytes
So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
ZS_SIZE_CLASS_DELTA + 1
= 256In zs_create_pool(), the max size obj which can be allocated will be:
ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
users we can do.[1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.htmlThis patch fixes this issue by dynamiclly calculating zs_size_classes when
module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the
max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.[akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
Signed-off-by: Mahendran Ganesh
Suggested-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The kunmap_atomic should use virtual address getting by kmap_atomic.
However, some pieces of code in zsmalloc uses modified address, not the
one got by kmap_atomic for kunmap_atomic.It's okay for working because zsmalloc modifies the address inner
PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
But it's still fragile with potential changing of kmap_atomic so let's
correct it.I got a subtle bug when I implemented a new feature of zsmalloc
(compaction) due to a link's mishandling (the link was over page
boundary). Although it was totally my mistake, it took a while to find
the cause because an unpredictable kmapped address was unmapped causing an
almost random crash.Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
zpool_unregister_driver() from zs_init() if cpu notifier registration has
failed, because error handling is performed before we register the driver
via zpool_register_driver() call.Factor out cpu notifier registration and unregistration code and fix
zs_init() error handling.link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
[akpm@linux-foundation.org: squash bogus gcc warning]
[akpm@linux-foundation.org: use __init and __exit]
Signed-off-by: Sergey Senozhatsky
Reported-by: Mahendran Ganesh
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zsmalloc has many size_classes to reduce fragmentation and they are in 16
bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And,
zsmalloc has constraint that each zspage has 4 pages at maximum.In this situation, we can see interesting aspect. Let's think about
size_class for 1488, 1472, ..., 1376. To prevent external fragmentation,
they uses 4 pages per zspage and so all they can contain 11 objects at
maximum.16384 (4096 * 4) = 1488 * 11 + remains
16384 (4096 * 4) = 1472 * 11 + remains
16384 (4096 * 4) = ...
16384 (4096 * 4) = 1376 * 11 + remainsIt means that they have same characteristics and classification between
them isn't needed. If we use one size_class for them, we can reduce
fragementation and save some memory since both the 1488 and 1472 sized
classes can only fit 11 objects into 4 pages, and an object that's 1472
bytes can fit into an object that's 1488 bytes, merging these classes to
always use objects that are 1488 bytes will reduce the total number of
size classes. And reducing the total number of size classes reduces
overall fragmentation, because a wider range of compressed pages can fit
into a single size class, leaving less unused objects in each size class.For this purpose, this patch implement size_class merging. If there is
size_class that have same pages_per_zspage and same number of objects per
zspage with previous size_class, we don't create new size_class. Instead,
we use previous, same characteristic size_class. With this way, above
example sizes (1488, 1472, ..., 1376) use just one size_class so we can
get much more memory utilization.Below is result of my simple test.
TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
source code, remove directory in descending order in size. (drivers arch
fs sound include net Documentation firmware kernel tools)Each line represents orig_data_size, compr_data_size, mem_used_total,
fragmentation overhead (mem_used - compr_data_size) and overhead ratio
(overhead to compr_data_size), respectively, after untar and remove
operation is executed.* untar-nomerge.out
orig_size compr_size used_size overhead overhead_ratio
525.88MB 199.16MB 210.23MB 11.08MB 5.56%
288.32MB 97.43MB 105.63MB 8.20MB 8.41%
177.32MB 61.12MB 69.40MB 8.28MB 13.55%
146.47MB 47.32MB 56.10MB 8.78MB 18.55%
124.16MB 38.85MB 48.41MB 9.55MB 24.58%
103.93MB 31.68MB 40.93MB 9.25MB 29.21%
84.34MB 22.86MB 32.72MB 9.86MB 43.13%
66.87MB 14.83MB 23.83MB 9.00MB 60.70%
60.67MB 11.11MB 18.60MB 7.49MB 67.48%
55.86MB 8.83MB 16.61MB 7.77MB 88.03%
53.32MB 8.01MB 15.32MB 7.31MB 91.24%* untar-merge.out
orig_size compr_size used_size overhead overhead_ratio
526.23MB 199.18MB 209.81MB 10.64MB 5.34%
288.68MB 97.45MB 104.08MB 6.63MB 6.80%
177.68MB 61.14MB 66.93MB 5.79MB 9.47%
146.83MB 47.34MB 52.79MB 5.45MB 11.51%
124.52MB 38.87MB 44.30MB 5.43MB 13.96%
104.29MB 31.70MB 36.83MB 5.13MB 16.19%
84.70MB 22.88MB 27.92MB 5.04MB 22.04%
67.11MB 14.83MB 19.26MB 4.43MB 29.86%
60.82MB 11.10MB 14.90MB 3.79MB 34.17%
55.90MB 8.82MB 12.61MB 3.79MB 42.97%
53.32MB 8.01MB 11.73MB 3.73MB 46.53%As you can see above result, merged one has better utilization (overhead
ratio, 5th column) and uses less memory (mem_used_total, 3rd column).Signed-off-by: Joonsoo Kim
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Reviewed-by: Dan Streetman
Cc: Luigi Semenzato
Cc:
Cc: "seungho1.park"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
to the beginning of memcg_stat_show().This was partially found by using a static code analysis program called
cppcheck.Signed-off-by: Rickard Strandqvist
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.Signed-off-by: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
move it into the compiler if statement[akpm@linux-foundation.org: clean up layout]
Signed-off-by: Michele Curti
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A random seek IO benchmark appeared to regress because of a change to
readahead but the real problem was the benchmark. To ensure the IO
request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
(512K) but the hint is ignored by the kernel. This is correct but not
necessarily obvious behaviour. As much as I dislike comment patches, the
explanation for this behaviour predates current git history. Clarify why
it behaves like this in case someone "fixes" fadvise or readahead for the
wrong reasons.Signed-off-by: Mel Gorman
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Read memory barriers must follow the read operations.
Signed-off-by: Dmitry Vyukov
Cc: Eric Dumazet
Acked-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After the previous patch we can remove the PT_TRACE_EXIT check in
oom_scan_process_thread(), it was added to handle the case when the
coredumping was "frozen" by ptrace, but it doesn't really work. If
nothing else, we would need to check all threads which could share the
same ->mm to make it more or less correct.Signed-off-by: Oleg Nesterov
Cc: Cong Wang
Cc: David Rientjes
Acked-by: Michal Hocko
Cc: "Rafael J. Wysocki"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
oom_kill.c assumes that PF_EXITING task should exit and free the memory
soon. This is wrong in many ways and one important case is the coredump.
A task can sleep in exit_mm() "forever" while the coredumping sub-thread
can need more memory.Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
we add the new trivial helper for that.Note: this is only the first step, this patch doesn't try to solve other
problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
participate in coredump after it was already observed in PF_EXITING state,
so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
And even the name/usage of the new helper is confusing, an exiting thread
can only free its ->mm if it is the only/last task in thread group.[akpm@linux-foundation.org: add comment]
Signed-off-by: Oleg Nesterov
Cc: Cong Wang
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: "Rafael J. Wysocki"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since 01cefaef40c4 ("mm: provide more accurate estimation
of pages occupied by memmap") allocate the pages from lowmem for the
highmem zones' memmap. So It is not need to reserver the memmap's for
the highmem.A 2G DDR3 for the arm platform:
On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 528 pages used for memmap
HighMem zone: 67584 pages, LIFO batch:15On node 0 totalpages: 524288
free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
DMA zone: 3568 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 456704 pages, LIFO batch:31
HighMem zone: 67584 pages, LIFO batch:15Signed-off-by: Hongbo Zhong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
created for use on pages almost certainly mapped into userspace. But
nowadays compaction often applies them to unmapped page cache pages: which
may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
try_to_unmap_file() makes no preliminary page_mapped() check.Now check page_mapped() in __unmap_and_move(); and avoid repeating the
same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
never inserted any.(The PageAnon(page) comment blocks now look even sillier than before, but
clean that up on some other occasion. And note in passing that
try_to_unmap_one() does not use a migration entry when PageSwapCache, so
remove_migration_ptes() will then not update that swap entry to newpage
pte: not a big deal, but something else to clean up later.)Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
conversion to i_mmap_rwsem, that "The biggest winner of these changes is
migration": a part of the reason might be all of that unnecessary taking
of i_mmap_mutex in page migration; and it's rather a shame that I didn't
get around to sending this patch in before his - this one is much less
useful after Davidlohr's conversion to rwsem, but still good.Signed-off-by: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner
Cc: Dave Chinner
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These flushes deal with sequence number overflows, such as for long lived
threads. These are rare, but interesting from a debugging PoV. As such,
display the number of flushes when vmacache debugging is enabled.Signed-off-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Extended memory to store page owner information is initialized some time
later than that page allocator starts. Until initialization, many pages
can be allocated and they have no owner information. This make debugging
using page owner harder, so some fixup will be helpful.This patch fixes up this situation by setting fake owner information
immediately after page extension is initialized. Information doesn't tell
the right owner, but, at least, it can tell whether page is allocated or
not, more correctly.On my testing, this patch catches 13343 early allocated pages, although
they are mostly allocated from page extension feature. Anyway, after
then, there is no page left that it is allocated and has no page owner
flag.Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds