Eric Lee / smarc-fsl-linux-kernel

05 Jun, 2014

40 commits

bfc8c9013 mem-hotplug: implement get/put_online_mems ... Browse Code »

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
e8d9df3ab memcg: un-export __memcg_kmem_get_cache ... Browse Code »

It is only used in slab and should not be used anywhere else so there is
no need in exporting it.

Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:59 +0800
5f7a75acd mm: page_alloc: do not cache reclaim distances ... Browse Code »

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
by zone_reclaim due to its distance. As it is expected that
zone_reclaim_mode will be rarely enabled it is unreasonable for all
machines to take a penalty. Fortunately, the zone_reclaim_mode() path
is already slow and it is the path that takes the hit.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Zhang Yanfei
Acked-by: Michal Hocko
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:59 +0800
4f9b16a64 mm: disable zone_reclaim_mode by default ... Browse Code »

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Zhang Yanfei
Acked-by: Michal Hocko
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:59 +0800
944d9fec8 hugetlb: add support for gigantic page allocation at runtime ... Browse Code »

HugeTLB is limited to allocating hugepages whose size are less than
MAX_ORDER order. This is so because HugeTLB allocates hugepages via the
buddy allocator. Gigantic pages (that is, pages whose size is greater
than MAX_ORDER order) have to be allocated at boottime.

However, boottime allocation has at least two serious problems. First,
it doesn't support NUMA and second, gigantic pages allocated at boottime
can't be freed.

This commit solves both issues by adding support for allocating gigantic
pages during runtime. It works just like regular sized hugepages,
meaning that the interface in sysfs is the same, it supports NUMA, and
gigantic pages can be freed.

For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
gigantic pages on node 1, one can do:

# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And to free them all:

# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

The one problem with gigantic page allocation at runtime is that it
can't be serviced by the buddy allocator. To overcome that problem,
this commit scans all zones from a node looking for a large enough
contiguous region. When one is found, it's allocated by using CMA, that
is, we call alloc_contig_range() to do the actual allocation. For
example, on x86_64 we scan all zones looking for a 1GB contiguous
region. When one is found, it's allocated by alloc_contig_range().

One expected issue with that approach is that such gigantic contiguous
regions tend to vanish as runtime goes by. The best way to avoid this
for now is to make gigantic page allocations very early during system
boot, say from a init script. Other possible optimization include using
compaction, which is supported by CMA but is not explicitly used by this
commit.

It's also important to note the following:

1. Gigantic pages allocated at boottime by the hugepages= command-line
option can be freed at runtime just fine

2. This commit adds support for gigantic pages only to x86_64. The
reason is that I don't have access to nor experience with other archs.
The code is arch indepedent though, so it should be simple to add
support to different archs

3. I didn't add support for hugepage overcommit, that is allocating
a gigantic page on demand when
/proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
think it's reasonable to do the hard and long work required for
allocating a gigantic page at fault time. But it should be simple
to add this if wanted

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Luiz Capitulino
Reviewed-by: Davidlohr Bueso
Acked-by: Kirill A. Shutemov
Reviewed-by: Zhang Yanfei
Reviewed-by: Yasuaki Ishimatsu
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Marcelo Tosatti
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-06-05 07:53:59 +0800
1cac6f2c0 hugetlb: move helpers up in the file ... Browse Code »

Next commit will add new code which will want to call
for_each_node_mask_to_alloc() macro. Move it, its buddy
for_each_node_mask_to_free() and their dependencies up in the file so the
new code can use them. This is just code movement, no logic change.

Signed-off-by: Luiz Capitulino
Reviewed-by: Andrea Arcangeli
Reviewed-by: Naoya Horiguchi
Reviewed-by: Yasuaki Ishimatsu
Reviewed-by: Davidlohr Bueso
Acked-by: Kirill A. Shutemov
Reviewed-by: Zhang Yanfei
Cc: David Rientjes
Cc: Marcelo Tosatti
Cc: Rik van Riel
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-06-05 07:53:59 +0800
a7407a27c hugetlb: update_and_free_page(): don't clear PG_reserved bit ... Browse Code »

Hugepages pages never get the PG_reserved bit set, so don't clear it.

However, note that if the bit gets mistakenly set free_pages_check() will
catch it.

Signed-off-by: Luiz Capitulino
Reviewed-by: Davidlohr Bueso
Acked-by: Kirill A. Shutemov
Reviewed-by: Zhang Yanfei
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Marcelo Tosatti
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-06-05 07:53:59 +0800
bae7f4ae1 hugetlb: add hstate_is_gigantic() ... Browse Code »

Signed-off-by: Luiz Capitulino
Reviewed-by: Andrea Arcangeli
Reviewed-by: Naoya Horiguchi
Reviewed-by: Yasuaki Ishimatsu
Reviewed-by: Davidlohr Bueso
Acked-by: Kirill A. Shutemov
Reviewed-by: Zhang Yanfei
Cc: David Rientjes
Cc: Marcelo Tosatti
Cc: Rik van Riel
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-06-05 07:53:59 +0800
2906dd528 hugetlb: prep_compound_gigantic_page(): drop __init marker ... Browse Code »

The HugeTLB subsystem uses the buddy allocator to allocate hugepages
during runtime. This means that hugepages allocation during runtime is
limited to MAX_ORDER order. For archs supporting gigantic pages (that
is, page sizes greater than MAX_ORDER), this in turn means that those
pages can't be allocated at runtime.

HugeTLB supports gigantic page allocation during boottime, via the boot
allocator. To this end the kernel provides the command-line options
hugepagesz= and hugepages=, which can be used to instruct the kernel to
allocate N gigantic pages during boot.

For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
can be allocated and freed at runtime. If one wants to allocate 1G
gigantic pages, this has to be done at boot via the hugepagesz= and
hugepages= command-line options.

Now, gigantic page allocation at boottime has two serious problems:

1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
evenly distributes boottime allocated hugepages among nodes.

For example, suppose you have a four-node NUMA machine and want
to allocate four 1G gigantic pages at boottime. The kernel will
allocate one gigantic page per node.

On the other hand, we do have users who want to be able to specify
which NUMA node gigantic pages should allocated from. So that they
can place virtual machines on a specific NUMA node.

2. Gigantic pages allocated at boottime can't be freed

At this point it's important to observe that regular hugepages allocated
at runtime don't have those problems. This is so because HugeTLB
interface for runtime allocation in sysfs supports NUMA and runtime
allocated pages can be freed just fine via the buddy allocator.

This series adds support for allocating gigantic pages at runtime. It
does so by allocating gigantic pages via CMA instead of the buddy
allocator. Releasing gigantic pages is also supported via CMA. As this
series builds on top of the existing HugeTLB interface, it makes gigantic
page allocation and releasing just like regular sized hugepages. This
also means that NUMA support just works.

For example, to allocate two 1G gigantic pages on node 1, one can do:

# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And, to release all gigantic pages on the same node:

# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Please, refer to patch 5/5 for full technical details.

Finally, please note that this series is a follow up for a previous series
that tried to extend the command-line options set to be NUMA aware:

http://marc.info/?l=linux-mm&m=139593335312191&w=2

During the discussion of that series it was agreed that having runtime
allocation support for gigantic pages was a better solution.

This patch (of 5):

This function is going to be used by non-init code in a future
commit.

Signed-off-by: Luiz Capitulino
Reviewed-by: Davidlohr Bueso
Acked-by: Kirill A. Shutemov
Reviewed-by: Zhang Yanfei
Cc: Marcelo Tosatti
Cc: Andrea Arcangeli
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Rik van Riel
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luiz Capitulino
2014-06-05 07:53:58 +0800
14bd5b458 mm/mmap.c: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ... Browse Code »

Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duan Jiong
2014-06-05 07:53:58 +0800
cea371f4f slab: document kmalloc_order ... Browse Code »

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:58 +0800
56a3c655a memory-hotplug: update documentation to hide information about SECTIONS and remove end_phys_index ... Browse Code »

Seems we all agree that information about SECTION, e.g. section size,
sections per memory block should be kept as kernel internals, and not
exposed to userspace.

This patch updates Documentation/memory-hotplug.txt to refer to memory
blocks instead of memory sections where appropriate and added a
paragraph to explain that memory blocks are made of memory sections.
The documentation update is mostly provided by Nathan.

Also, as end_phys_index in code is actually not the end section id, but
the end memory block id, which should always be the same as phys_index.
So it is removed here.

Signed-off-by: Li Zhong
Reviewed-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zhong
2014-06-05 07:53:58 +0800
e4f674229 mm: pass VM_BUG_ON() reason to dump_page() ... Browse Code »

I recently added a patch to let folks pass a "reason" string dump_page()
which gets dumped out along with the page's data. This essentially
saves the bug-reader a trip in to the source to figure out why we
BUG_ON()'d.

The new VM_BUG_ON_PAGE() passes in NULL for "reason". It seems like we
might as well pass the BUG_ON() condition if we have it. This will
bloat kernels a bit with ~160 new strings, but this is all under a
debugging option anyway.

page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0
page flags: 0xbfffc0000000001(locked)
page dumped because: VM_BUG_ON_PAGE(PageLocked(page))
------------[ cut here ]------------
kernel BUG at /home/davehans/linux.git/mm/filemap.c:464!
invalid opcode: 0000 [#1] SMP
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
...

[akpm@linux-foundation.org: include stringify.h]
Signed-off-by: Dave Hansen
Acked-by: Kirill A. Shutemov
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:53:58 +0800
3dae7fec5 mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control ... Browse Code »

Per-memcg swappiness and oom killing can currently not be tweaked on a
memcg that is part of a hierarchy, but not the root of that hierarchy.
Users have complained that they can't configure this when they turned on
hierarchy mode. In fact, with hierarchy mode becoming the default, this
restriction disables the tunables entirely.

But there is no good reason for this restriction. The settings for
swappiness and OOM killing are taken from whatever memcg whose limit
triggered reclaim and OOM invocation, regardless of its position in the
hierarchy tree.

Allow setting swappiness on any group. The knob on the root memcg
already reads the global VM swappiness, make it writable as well.

Allow disabling the OOM killer on any non-root memcg.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-06-05 07:53:58 +0800
8bf8fcb07 mm/mempool: warn about __GFP_ZERO usage ... Browse Code »

Memory obtained via mempool_alloc is not always zeroed even when
called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make
that clear.

[akpm@linux-foundation.org: use VM_WARN_ON_ONCE]
Signed-off-by: Sebastian Ott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Ott
2014-06-05 07:53:58 +0800
02a8efeda include/linux/mmdebug.h: add VM_WARN_ON() and VM_WARN_ON_ONCE() ... Browse Code »

WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM

Cc: Sebastian Ott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-06-05 07:53:58 +0800
ae3a8c1c2 mm/huge_memory.c: complete conversion to pr_foo() ... Browse Code »

It was using a mix of pr_foo() and printk(KERN_ERR ...).

Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-06-05 07:53:58 +0800
ff9e43eb4 thp: consolidate assert checks in __split_huge_page() ... Browse Code »

It doesn't make sense to have two assert checks for each invariant: one
for printing and one for BUG().

Let's trigger BUG() if we print error message.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:53:57 +0800
38f7ea5a0 arch/x86/kernel/pci-dma.c: fix dma_generic_alloc_coherent() when CONFIG_DMA_CMA is enabled ... Browse Code »

dma_generic_alloc_coherent() firstly attempts to allocate by
dma_alloc_from_contiguous() if CONFIG_DMA_CMA is enabled. But the
memory region allocated by it may not fit within the device's DMA mask.
This change makes it fall back to usual alloc_pages_node() allocation
for such cases.

Signed-off-by: Akinobu Mita
Cc: Marek Szyprowski
Cc: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
5ea3b1b2f cma: add placement specifier for "cma=" kernel parameter ... Browse Code »

Currently, "cma=" kernel parameter is used to specify the size of CMA,
but we can't specify where it is located. We want to locate CMA below
4GB for devices only supporting 32-bit addressing on 64-bit systems
without iommu.

This enables to specify the placement of CMA by extending "cma=" kernel
parameter.

Examples:
1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
2. locate 64MB CMA exact at 512MB by "cma=64M@512M"

Note that the DMA contiguous memory allocator on x86 assumes that
page_address() works for the pages to allocate. So this change requires
to limit end address of contiguous memory area upto max_pfn_mapped to
prevent from locating it on highmem area by the argument of
dma_contiguous_reserve().

Signed-off-by: Akinobu Mita
Cc: Marek Szyprowski
Cc: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
2bfc2862c memblock: introduce memblock_alloc_range() ... Browse Code »

This introduces memblock_alloc_range() which allocates memblock from the
specified range of physical address. I would like to use this function
to specify the location of CMA.

Signed-off-by: Akinobu Mita
Cc: Marek Szyprowski
Cc: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
367464362 intel-iommu: integrate DMA CMA ... Browse Code »

This adds support for the DMA Contiguous Memory Allocator for
intel-iommu. This change enables dma_alloc_coherent() to allocate big
contiguous memory.

It is achieved in the same way as nommu_dma_ops currently does, i.e.
trying to allocate memory by dma_alloc_from_contiguous() and
alloc_pages() is used as a fallback.

Signed-off-by: Akinobu Mita
Cc: Marek Szyprowski
Cc: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
9c5a36214 x86: enable DMA CMA with swiotlb ... Browse Code »

The DMA Contiguous Memory Allocator support on x86 is disabled when
swiotlb config option is enabled. So DMA CMA is always disabled on
x86_64 because swiotlb is always enabled. This attempts to support for
DMA CMA with enabling swiotlb config option.

The contiguous memory allocator on x86 is integrated in the function
dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
for dma_alloc_coherent().

x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
tries to allocate with dma_generic_alloc_coherent() firstly and then
swiotlb_alloc_coherent() is called as a fallback.

The main part of supporting DMA CMA with swiotlb is that changing
x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
for dma_free_coherent() so that it can distinguish memory allocated by
dma_generic_alloc_coherent() from one allocated by
swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
which can handle contiguous memory. This change requires making
is_swiotlb_buffer() global function.

This also needs to change .free callback in the dma_map_ops for amd_gart
and sta2x11, because these dma_ops are also using
dma_generic_alloc_coherent().

Signed-off-by: Akinobu Mita
Acked-by: Marek Szyprowski
Acked-by: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
d92ef66c4 x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled ... Browse Code »

This patchset enhances the DMA Contiguous Memory Allocator on x86.

Currently the DMA CMA is only supported with pci-nommu dma_map_ops and
furthermore it can't be enabled on x86_64. But I would like to allocate
big contiguous memory with dma_alloc_coherent() and tell it to the device
that requires it, regardless of which dma mapping implementation is
actually used in the system.

So this makes it work with swiotlb and intel-iommu dma_map_ops, too. And
this also extends "cma=" kernel parameter to specify placement constraint
by the physical address range of memory allocations. For example, CMA
allocates memory below 4GB by "cma=64M@0-4G", it is required for the
devices only supporting 32-bit addressing on 64-bit systems without iommu.

This patch (of 5):

Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.

But when the contiguous memory allocator (CMA) is enabled on x86 and the
memory region is allocated by dma_alloc_from_contiguous(), it doesn't
return zeroed memory. Because dma_generic_alloc_coherent() forgot to fill
the memory region with zero if it was allocated by
dma_alloc_from_contiguous()

Most implementations of dma_alloc_coherent() return zeroed memory
regardless of whether __GFP_ZERO is specified. So this fixes it by
unconditionally zeroing the allocated memory region.

Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed
out memory and remove memset() from all caller of it. But we can't simply
remove the memset on arm because __dma_clear_buffer() is used there for
ensuring cache flushing and it is used in many places. Of course we can
do redundant memset in dma_alloc_from_contiguous(), but I think this patch
is less impact for fixing this problem.

Signed-off-by: Akinobu Mita
Cc: Marek Szyprowski
Cc: Konrad Rzeszutek Wilk
Cc: David Woodhouse
Cc: Don Dutile
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-06-05 07:53:57 +0800
6b4ebc3a9 mm,vmacache: optimize overflow system-wide flushing ... Browse Code »

For single threaded workloads, we can avoid flushing and iterating through
the entire list of tasks, making the whole function a lot faster,
requiring only a single atomic read for the mm_users.

Signed-off-by: Davidlohr Bueso
Suggested-by: Oleg Nesterov
Cc: Aswin Chandramouleeswaran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-06-05 07:53:57 +0800
4f115147f mm,vmacache: add debug data ... Browse Code »

Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
hit rate -- exported in /proc/vmstat.

Any updates to the caching scheme needs this kind of data, thus it can
save some work re-implementing the counting all the time.

Signed-off-by: Davidlohr Bueso
Cc: Aswin Chandramouleeswaran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-06-05 07:53:57 +0800
6f04f48dc mm: only force scan in reclaim when none of the LRUs are big enough. ... Browse Code »

Prior to this change, we would decide whether to force scan a LRU during
reclaim if that LRU itself was too small for the current priority.
However, this can lead to the file LRU getting force scanned even if
there are a lot of anonymous pages we can reclaim, leading to hot file
pages getting needlessly reclaimed.

To address this, we instead only force scan when none of the reclaimable
LRUs are big enough.

Gives huge improvements with zswap. For example, when doing -j20 kernel
build in a 500MB container with zswap enabled, runtime (in seconds) is
greatly reduced:

x without this change
+ with this change
N Min Max Median Avg Stddev
x 5 700.997 790.076 763.928 754.05 39.59493
+ 5 141.634 197.899 155.706 161.9 21.270224
Difference at 95.0% confidence
-592.15 +/- 46.3521
-78.5293% +/- 6.14709%
(Student's t, pooled s = 31.7819)

Should also give some improvements in regular (non-zswap) swap cases.

Yes, hughd found significant speedup using regular swap, with several
memcgs under pressure; and it should also be effective in the non-memcg
case, whenever one or another zone LRU is forced too small.

Signed-off-by: Suleiman Souhlal
Signed-off-by: Hugh Dickins
Cc: Suleiman Souhlal
Cc: Mel Gorman
Acked-by: Rik van Riel
Acked-by: Rafael Aquini
Cc: Michal Hocko
Cc: Yuanhan Liu
Cc: Seth Jennings
Cc: Bob Liu
Cc: Minchan Kim
Cc: Luigi Semenzato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suleiman Souhlal
2014-06-05 07:53:56 +0800
c86c97ff4 mm: softdirty: clear VM_SOFTDIRTY flag inside clear_refs_write() instead of clear_soft_dirty() ... Browse Code »

clear_refs_write() is called earlier than clear_soft_dirty() and it is
more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
PTEs) that early instead of clearing it a way deeper inside call chain.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-06-05 07:53:56 +0800
b43790eed mm: softdirty: don't forget to save file map softdiry bit on unmap ... Browse Code »

pte_file_mksoft_dirty operates with argument passed by a value and
returns modified result thus we need to assign @ptfile here, otherwise
itis a no-op which may lead to loss of the softdirty bit.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-06-05 07:53:56 +0800
0bf073315 mm: softdirty: make freshly remapped file pages being softdirty unconditionally ... Browse Code »

Hugh reported:

| I noticed your soft_dirty work in install_file_pte(): which looked
| good at first, until I realized that it's propagating the soft_dirty
| of a pte it's about to zap completely, to the unrelated entry it's
| about to insert in its place. Which seems very odd to me.

Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
operates with pte_t argument and returns new pte_t which were never used
after. After looking more I think what we need is to soft-dirtify all
newely remapped file pages because it should look like a new mapping for
memory tracker.

Signed-off-by: Cyrill Gorcunov
Reported-by: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-06-05 07:53:56 +0800
52383431b mm: get rid of __GFP_KMEMCG ... Browse Code »

Currently to allocate a page that should be charged to kmemcg (e.g.
threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
allocated is then to be freed by free_memcg_kmem_pages. Apart from
looking asymmetrical, this also requires intrusion to the general
allocation path. So let's introduce separate functions that will
alloc/free pages charged to kmemcg.

The new functions are called alloc_kmem_pages and free_kmem_pages. They
should be used when the caller actually would like to use kmalloc, but
has to fall back to the page allocator for the allocation is large.
They only differ from alloc_pages and free_pages in that besides
allocating or freeing pages they also charge them to the kmem resource
counter of the current memory cgroup.

[sfr@canb.auug.org.au: export kmalloc_order() to modules]
Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:56 +0800
5dfb41750 sl[au]b: charge slabs to kmemcg explicitly ... Browse Code »

We have only a few places where we actually want to charge kmem so
instead of intruding into the general page allocation path with
__GFP_KMEMCG it's better to explictly charge kmem there. All kmem
charges will be easier to follow that way.

This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
from memcg caches' allocflags. Instead it makes slab allocation path
call memcg_charge_kmem directly getting memcg to charge from the cache's
memcg params.

This also eliminates any possibility of misaccounting an allocation
going from one memcg's cache to another memcg, because now we always
charge slabs against the memcg the cache belongs to. That's why this
patch removes the big comment to memcg_kmem_get_cache.

Signed-off-by: Vladimir Davydov
Acked-by: Greg Thelen
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:53:56 +0800
8eae14926 mm: slub: fix ALLOC_SLOWPATH stat ... Browse Code »

There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:

1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return

Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.

This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().

Signed-off-by: Dave Hansen
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-06-05 07:53:56 +0800
9a02d6999 mm, slab: suppress out of memory warning unless debug is enabled ... Browse Code »

When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.

Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.

Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.

Signed-off-by: David Rientjes
Cc: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:53:56 +0800
ecc42fbe9 mm/slub.c: convert vnsprintf-static to va_format ... Browse Code »

Inspired by Joe Perches suggestion in ntfs logging clean-up.

Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:53:55 +0800
f9f582859 mm/slub.c: convert printk to pr_foo() ... Browse Code »

All printk(KERN_foo converted to pr_foo()

Default printk converted to pr_warn()

Coalesce format fragments

Signed-off-by: Fabian Frederick
Acked-by: Christoph Lameter
Cc: Joe Perches
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:53:55 +0800
982792c78 x86, mm: probe memory block size for generic x86 64bit ... Browse Code »

On system with 2TiB ram, current x86_64 have 128M as section size, and
one memory_block only include one section. So will have 16400 entries
under /sys/devices/system/memory/.

Current code try to use block id to find block pointer in /sys for any
section, and reuse that block pointer. that finding will take some time
even after commit 7c243c7168dc ("mm: speedup in __early_pfn_to_nid")
that will skip the search in that case during booting up.

So solution could be increase block size just like SGI UV system did.
(harded code to 2g).

This patch is trying to probe the block size to make it match mmio remap
size. for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.

We could use 2g as block size instead of default 128M. That will reduce
number of entries in /sys/devices/system/memory/

On system 6TiB system will reduce boot time by 35 seconds.

Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2014-06-05 07:53:55 +0800
c46a7c817 x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels ... Browse Code »

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86. Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults. This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.

Signed-off-by: Mel Gorman
Cc: David Vrabel
Cc: Ingo Molnar
Cc: Peter Anvin
Cc: Fengguang Wu
Cc: Linus Torvalds
Cc: Steven Noonan
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Srikar Dronamraju
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:55 +0800
4468dd76f x86: require x86-64 for automatic NUMA balancing ... Browse Code »

32-bit support for NUMA is an oddity on its own but with automatic NUMA
balancing on top there is a reasonable risk that the CPUPID information
cannot be stored in the page flags. This patch removes support for
automatic NUMA support on 32-bit x86.

Signed-off-by: Mel Gorman
Cc: David Vrabel
Cc: Ingo Molnar
Cc: Peter Anvin
Cc: Fengguang Wu
Cc: Linus Torvalds
Cc: Steven Noonan
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Srikar Dronamraju
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:55 +0800
ac13a829f fs/libfs.c: add generic data flush to fsync ... Browse Code »

Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.

This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."

[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick
Suggested-by: Jan Kara
Suggested-by: Christoph Hellwig
Cc: Jan Kara
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:53:55 +0800