28 Oct, 2016
1 commit
-
No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
but for build testing there is no reason not to allow them together.This hopefully means better build coverage and fewer embarrasing silly
problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
variable in memory hotplug") in the future.Cc: Stephen Rothwell
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Signed-off-by: Linus Torvalds
27 Aug, 2016
1 commit
-
The current wording of the COMPACTION Kconfig help text doesn't
emphasise that disabling COMPACTION might cripple the page allocator
which relies on the compaction quite heavily for high order requests and
an unexpected OOM can happen with the lack of compaction. Make sure we
are vocal about that.Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Cc: Markus Trippelsdorf
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Aug, 2016
1 commit
-
At present it is obvious that memory online and offline will fail when
KASAN is enabled. So add the condition to limit the memory_hotplug when
KASAN is enabled.Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jul, 2016
1 commit
-
When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
However, now that the ZONE_DMA conflict has been eliminated it no longer
makes sense to require CONFIG_EXPERT.Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reported-by: Eric Sandeen
Reported-by: Jeff Moyer
Acked-by: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2016
1 commit
-
For file mappings, we don't deposit page tables on THP allocation
because it's not strictly required to implement split_huge_pmd(): we can
just clear pmd and let following page faults to reconstruct the page
table.But Power makes use of deposited page table to address MMU quirk.
Let's hide THP page cache, including huge tmpfs, under separate config
option, so it can be forbidden on Power.We can revert the patch later once solution for Power found.
Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 May, 2016
1 commit
-
When we have !NO_BOOTMEM, the deferred page struct initialization
doesn't work well because the pages reserved in bootmem are released to
the page allocator uncoditionally. It causes memory corruption and
system crash eventually.As Mel suggested, the bootmem is retiring slowly. We fix the issue by
simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
Signed-off-by: Gavin Shan
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 May, 2016
1 commit
-
Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
requires some ordering wrt other initialization operations, e.g.
page_ext_init has to happen after the whole memmap is initialized
properly.For SPARSEMEM this requires to wait for page_alloc_init_late. Other
memory models (e.g. flatmem) might have different initialization
layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT
depends on MEMORY_HOTPLUG which in turndepends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUGand X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
memory model:config ARCH_FLATMEM_ENABLE
def_bool y
depends on X86_32 && !NUMAso FLATMEM is ruled out via dependency maze. Be explicit and disable
FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
subtle initialization bugs[1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 May, 2016
2 commits
-
I've been receiving increasingly concerned notes from 0day about how
much my recent changes have been bloating the radix tree. Make it
happier by only including multiorder support if
CONFIG_TRANSPARENT_HUGEPAGES is set.This is an independent Kconfig option, so other radix tree users can
also set it if they have a need.Signed-off-by: Matthew Wilcox
Reviewed-by: Ross Zwisler
Cc: Konstantin Khlebnikov
Cc: Kirill Shutemov
Cc: Jan Kara
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch introduces z3fold, a special purpose allocator for storing
compressed pages. It is designed to store up to three compressed pages
per physical page. It is a ZBUD derivative which allows for higher
compression ratio keeping the simplicity and determinism of its
predecessor.This patch comes as a follow-up to the discussions at the Embedded Linux
Conference in San-Diego related to the talk [1]. The outcome of these
discussions was that it would be good to have a compressed page
allocator as stable and deterministic as zbud with with higher
compression ratio.To keep the determinism and simplicity, z3fold, just like zbud, always
stores an integral number of compressed pages per page, but it can store
up to 3 pages unlike zbud which can store at most 2. Therefore the
compression ratio goes to around 2.6x while zbud's one is around 1.7x.The patch is based on the latest linux.git tree.
This version has been updated after testing on various simulators (e.g.
ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
comments from Dan Streetman [3].[1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
[2] https://lkml.org/lkml/2016/4/21/799
[3] https://lkml.org/lkml/2016/5/4/852Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
Signed-off-by: Vitaly Wool
Cc: Seth Jennings
Cc: Dan Streetman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 May, 2016
2 commits
-
This patchset continues the work I started with commit 31bc3858ea3e
("memory-hotplug: add automatic onlining policy for the newly added
memory").Initially I was going to stop there and bring the policy setting logic
to userspace. I met two issues on this way:1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
These blocks stay offlined if we turn the onlining policy on by
userspace.2) My attempt to bring this policy setting to systemd failed, systemd
maintainers suggest to change the default in kernel or ... to use
tmpfiles.d to alter the policy (which looks like a hack to me):
https://github.com/systemd/systemd/pull/2938Here I suggest to add a config option to set the default value for the
policy and a kernel command line parameter to make the override.This patch (of 2):
Introduce config option to set the default value for memory hotplug
onlining policy (/sys/devices/system/memory/auto_online_blocks). The
reason one would want to turn this option on are to have early onlining
for hotpluggable memory available at boot and to not require any
userspace actions to make memory hotplug work.[akpm@linux-foundation.org: tweak Kconfig text]
Signed-off-by: Vitaly Kuznetsov
Cc: Jonathan Corbet
Cc: Dan Williams
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: David Vrabel
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Lennart Poettering
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
not, so ZONE_DMA_FLAG sounds no longer useful.And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
comment [1] from Johannes Weiner, so remove them and ORing passed in
flags with the cache gfp flags has been done in kmem_getpages().[1] https://lkml.org/lkml/2014/9/25/553
Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Mar, 2016
1 commit
-
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
18 Mar, 2016
3 commits
-
The primary use case for devm_memremap_pages() is to allocate an memmap
array from persistent memory. That capabilty requires vmem_altmap which
requires SPARSEMEM_VMEMMAP.Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
ZONES_WIDTH and triggers the:"Unfortunate NUMA and NUMA Balancing config, growing page-frame for
last_cpupid."...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
a configuration we should worry about supporting.Signed-off-by: Dan Williams
Reported-by: Vlastimil Babka
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.The GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build. We need to be careful to only build the table for zones that
have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
purpose. This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
consume another bit in page->flags (expand ZONES_WIDTH) with room to
spare.Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Signed-off-by: Dan Williams
Reported-by: Mark
Reported-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Sudip Mukherjee
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
selected by the supported architectures, so the following arch depend is
unnecessary.Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Feb, 2016
1 commit
-
The syscall-level code is passed a protection key and need to
return an appropriate error code if the protection key is bogus.
We will be using this in subsequent patches.Note that this also begins a series of arch-specific calls that
we need to expose in otherwise arch-independent code. We create
a linux/pkeys.h header where we will put *all* the stubs for
these functions.Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
18 Feb, 2016
1 commit
-
vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures. The high 32 bits are unused on 64-bit
platforms. We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.Sparse complains about these constants unless we explicitly
call them "UL".Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dan Williams
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jan Kara
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sasha Levin
Cc: Valentin Rothberg
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Cc: Xie XiuQi
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar
06 Feb, 2016
1 commit
-
The description mentions kswapd threads, while the deferred struct page
initialization is actually done by one-off "pgdatinitX" threads.Fix the description so that potentially users are not confused about
pgdatinit threads using CPU after boot instead of kswapd.Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Nov, 2015
1 commit
-
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:CPU0 CPU1
isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.That means page->compound_head shares storage space with:
- page->lru.next;
- page->next;
- page->rcu_head.next;That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2015
1 commit
-
Pull media updates from Mauro Carvalho Chehab:
"A series of patches that move part of the code used to allocate memory
from the media subsystem to the mm subsystem"[ The mm parts have been acked by VM people, and the series was
apparently in -mm for a while - Linus ]* tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
[media] media: vb2: Remove unused functions
[media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
[media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
[media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
[media] vb2: Provide helpers for mapping virtual addresses
[media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
[media] mm: Provide new get_vaddr_frames() helper
[media] vb2: Push mmap_sem down to memops
11 Sep, 2015
1 commit
-
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:- it does not count unmapped file pages
- it affects the reclaimer logicTo overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Sep, 2015
1 commit
-
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
28 Aug, 2015
1 commit
-
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Jerome Glisse
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams
17 Aug, 2015
1 commit
-
Provide new function get_vaddr_frames(). This function maps virtual
addresses from given start and fills given array with page frame numbers of
the corresponding pages. If given start belongs to a normal vma, the function
grabs reference to each of the pages to pin them in memory. If start
belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures. Caller
must make sure pfns aren't reused for anything else while he is using
them.This function is created for various drivers to simplify handling of
their buffers.Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Andrew Morton
Signed-off-by: Hans Verkuil
Signed-off-by: Mauro Carvalho Chehab
24 Jul, 2015
1 commit
-
commit 106542e7987c ("fs: Remove ext3 filesystem driver") removed ext3
and JBD, hence remove the superfluous condition.Signed-off-by: Valentin Rothberg
Signed-off-by: Jan Kara
01 Jul, 2015
1 commit
-
This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
but will be available in a later patch. Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Jun, 2015
1 commit
-
RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event. So, I want
to add this event to make this scenario complete.This patch add a event at ras group for memory-failure.
The output like below:
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:24
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi
Reviewed-by: Naoya Horiguchi
Acked-by: Steven Rostedt
Cc: Tony Luck
Cc: Chen Gong
Cc: Jim Davis
Signed-off-by: Xie XiuQi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2015
1 commit
-
I've noticed that there is no interfaces exposed by CMA which would let me
fuzz what's going on in there.This small patchset exposes some information out to userspace, plus adds
the ability to trigger allocation and freeing from userspace.This patch (of 3):
Implement a simple debugfs interface to expose information about CMA areas
in the system.Useful for testing/sanity checks for CMA since it was impossible to
previously retrieve this information in userspace.Signed-off-by: Sasha Levin
Acked-by: Joonsoo Kim
Cc: Marek Szyprowski
Cc: Laura Abbott
Cc: Konrad Rzeszutek Wilk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Feb, 2015
1 commit
-
Pull kconfig updates from Michal Marek:
"Yann E Morin was supposed to take over kconfig maintainership, but
this hasn't happened. So I'm sending a few kconfig patches that I
collected:- Fix for missing va_end in kconfig
- merge_config.sh displays used if given too few arguments
- s/boolean/bool/ in Kconfig files for consistency, with the plan to
only support bool in the future"* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kconfig: use va_end to match corresponding va_start
merge_config.sh: Display usage if given too few arguments
kconfig: use bool instead of boolean for type definition attributes
13 Feb, 2015
1 commit
-
Keeping fragmentation of zsmalloc in a low level is our target. But now
we still need to add the debug code in zsmalloc to get the quantitative
data.This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
statistics collection for developers. Currently only the objects
statatitics in each class are collected. User can get the information via
debugfs.cat /sys/kernel/debug/zsmalloc/zram0/...
For example:
After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
class size obj_allocated obj_used pages_used
0 32 0 0 0
1 48 256 12 3
2 64 64 14 1
3 80 51 7 1
4 96 128 5 3
5 112 73 5 2
6 128 32 4 1
7 144 0 0 0
8 160 0 0 0
9 176 0 0 0
10 192 0 0 0
11 208 0 0 0
12 224 0 0 0
13 240 0 0 0
14 256 16 1 1
15 272 15 9 1
16 288 0 0 0
17 304 0 0 0
18 320 0 0 0
19 336 0 0 0
20 352 0 0 0
21 368 0 0 0
22 384 0 0 0
23 400 0 0 0
24 416 0 0 0
25 432 0 0 0
26 448 0 0 0
27 464 0 0 0
28 480 0 0 0
29 496 33 1 4
30 512 0 0 0
31 528 0 0 0
32 544 0 0 0
33 560 0 0 0
34 576 0 0 0
35 592 0 0 0
36 608 0 0 0
37 624 0 0 0
38 640 0 0 0
40 672 0 0 0
42 704 0 0 0
43 720 17 1 3
44 736 0 0 0
46 768 0 0 0
49 816 0 0 0
51 848 0 0 0
52 864 14 1 3
54 896 0 0 0
57 944 13 1 3
58 960 0 0 0
62 1024 4 1 1
66 1088 15 2 4
67 1104 0 0 0
71 1168 0 0 0
74 1216 0 0 0
76 1248 0 0 0
83 1360 3 1 1
91 1488 11 1 4
94 1536 0 0 0
100 1632 5 1 2
107 1744 0 0 0
111 1808 9 1 4
126 2048 4 4 2
144 2336 7 3 4
151 2448 0 0 0
168 2720 15 15 10
190 3072 28 27 21
202 3264 0 0 0
254 4096 36209 36209 36209Total 37022 36326 36288
We can calculate the overall fragentation by the last line:
Total 37022 36326 36288
(37022 - 36326) / 37022 = 1.87%Also by analysing objects alocated in every class we know why we got so
low fragmentation: Most of the allocated objects is in . And
there is only 1 page in class 254 zspage. So, No fragmentation will be
introduced by allocating objs in class 254.And in future, we can collect other zsmalloc statistics as we need and
analyse them.Signed-off-by: Ganesh Mahendran
Suggested-by: Minchan Kim
Acked-by: Minchan Kim
Cc: Nitin Gupta
Cc: Seth Jennings
Cc: Dan Streetman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jan, 2015
2 commits
-
Support for keyword 'boolean' will be dropped later on.
No functional change.
Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger
Signed-off-by: Michal Marek -
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.oSize of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : afterso the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar
CC: Paul E. McKenney
CC: Josh Triplett
CC: Lai Jiangshan
Signed-off-by: Paul E. McKenney
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
10 Oct, 2014
2 commits
-
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.Signed-off-by: Konstantin Khlebnikov
Cc: Rafael Aquini
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This series implements general forms of get_user_pages_fast and
__get_user_pages_fast in core code and activates them for arm and arm64.These are required for Transparent HugePages to function correctly, as a
futex on a THP tail will otherwise result in an infinite loop (due to the
core implementation of __get_user_pages_fast always returning 0).Unfortunately, a futex on THP tail can be quite common for certain
workloads; thus THP is unreliable without a __get_user_pages_fast
implementation.This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.This patch (of 6):
get_user_pages_fast() attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs to
block any THP splits.One way to achieve this is to have the walker disable interrupts, and rely
on IPIs from the TLB flushing code blocking before the page table pages
are freed.On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic has
been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
own get_user_pages_fast routines.The RCU page table free logic coupled with an IPI broadcast on THP split
(which is a rare event), allows one to protect a page table walker by
merely disabling the interrupts during the walk.This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of TLB
invalidations.It is based heavily on the PowerPC implementation by Nick Piggin.
[akpm@linux-foundation.org: various comment fixes]
Signed-off-by: Steve Capper
Tested-by: Dann Frazier
Reviewed-by: Catalin Marinas
Acked-by: Hugh Dickins
Cc: Russell King
Cc: Mark Rutland
Cc: Mel Gorman
Cc: Will Deacon
Cc: Christoffer Dall
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Aug, 2014
3 commits
-
Change zswap to use the zpool api instead of directly using zbud. Add a
boot-time param to allow selecting which zpool implementation to use,
with zbud as the default.Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Weijie Yang
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add zpool api.
zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants
to maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through this
patch.This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: Zhang Yanfei
Acked-by: Minchan Kim
Reviewed-by: Aneesh Kumar K.V
Cc: Alexander Graf
Cc: Aneesh Kumar K.V
Cc: Gleb Natapov
Acked-by: Marek Szyprowski
Tested-by: Marek Szyprowski
Cc: Paolo Bonzini
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
3 commits
-
Now, we can build zsmalloc as module because unmap_kernel_range was
exported.Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
process_vm_writev, it's a kind of IPC for copying data between processes.
Currently this option is placed inside "Processor type and features".This patch moves it into "General setup" (where all other arch-independed
syscalls and ipc features are placed) and changes prompt string to less
cryptic.Signed-off-by: Konstantin Khlebnikov
Cc: Christopher Yeoh
Cc: Davidlohr Bueso
Cc: Hugh Dickins
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs. So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().Signed-off-by: Naoya Horiguchi
Reported-by: Michael Ellerman
Tested-by: Michael Ellerman
Acked-by: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: Tony Luck
Cc: Russell King
Cc: Martin Schwidefsky
Cc: James Hogan
Cc: Ralf Baechle
Cc: David Miller
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds