Eric Lee / smarc-fsl-linux-kernel

21 Feb, 2016

1 commit

e6a1c1e9d Merge tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc fixes from Michael Ellerman:
- Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
- Fix dedotify for binutils >= 2.26 from Andreas Schwab
- Don't trace hcalls on offline CPUs from Denis Kirjanov
- eeh: Fix stale cached primary bus from Gavin Shan
- eeh: Fix stale PE primary bus from Gavin Shan
- mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
- ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy

* tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/ioda: Set "read" permission when "write" is set
powerpc/mm: Fix Multi hit ERAT cause by recent THP update
powerpc/powernv: Fix stale PE primary bus
powerpc/eeh: Fix stale cached primary bus
powerpc/pseries: Don't trace hcalls on offline CPUs
powerpc: Fix dedotify for binutils >= 2.26
powerpc/book3s_32: Fix build error with checkpoint restart

Linus Torvalds
2016-02-21 01:22:11 +0800

19 Feb, 2016

4 commits

52b4b950b mm: slab: free kmem_cache_node after destroy sysfs file ... Browse Code »

When slub_debug alloc_calls_show is enabled we will try to track
location and user of slab object on each online node, kmem_cache_node
structure and cpu_cache/cpu_slub shouldn't be freed till there is the
last reference to sysfs file.

This fixes the following panic:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: list_locations+0x169/0x4e0
PGD 257304067 PUD 438456067 PMD 0
Oops: 0000 [#1] SMP
CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
RIP: list_locations+0x169/0x4e0
Call Trace:
alloc_calls_show+0x1d/0x30
slab_attr_show+0x1b/0x30
sysfs_read_file+0x9a/0x1a0
vfs_read+0x9c/0x170
SyS_read+0x58/0xb0
system_call_fastpath+0x16/0x1b
Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
CR2: 0000000000000020

Separated __kmem_cache_release from __kmem_cache_shutdown which now
called on slab_kmem_cache_release (after the last reference to sysfs
file object has dropped).

Reintroduced locking in free_partial as sysfs file might access cache's
partial list after shutdowning - partial revert of the commit
69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
__remove_partial and use remove_partial (w/o underscores) as
free_partial now takes list_lock which s partial revert for commit
1e4dd9461fab ("slub: do not assert not having lock in removing freed
partial")

Signed-off-by: Dmitry Safonov
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2016-02-19 08:23:24 +0800
f8b74815a mm/hugetlb.c: fix incorrect proc nr_hugepages value ... Browse Code »

Currently incorrect default hugepage pool size is reported by proc
nr_hugepages when number of pages for the default huge page size is
specified twice.

When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
indicates the current number of pre-allocated huge pages of the default
size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
max_huge_pages and after boot time pre-allocation, max_huge_pages should
equal the number of pre-allocated pages (nr_hugepages).

Test case:

Note that this is specific to x86 architecture.

Boot the kernel with command line option 'default_hugepagesz=1G
hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
returns the value X. However, dmesg output shows that Z huge pages were
pre-allocated.

So, the root cause of the problem here is that the global variable
default_hstate_max_huge_pages is set if a default huge page size is
specified (directly or indirectly) on the command line. After the command
line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
the value is assigned to default_hstae.max_huge_pages. However,
default_hstate.max_huge_pages may have already been set based on the
number of pre-allocated huge pages of default_hstate size.

The solution to this problem is if hstate->max_huge_pages is already set
then it should not set as a result of global max_huge_pages value.
Basically if the value of the variable hugepages is set multiple times on
a command line for a specific supported hugepagesize then proc layer
should consider the last specified value.

Signed-off-by: Vaishali Thakkar
Reviewed-by: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Hillf Danton
Cc: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vaishali Thakkar
2016-02-19 08:23:24 +0800
48f7df329 mm: fix regression in remap_file_pages() emulation ... Browse Code »

Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.

Testcase:
#define _GNU_SOURCE
#include
#include
#include
#include

#define SIZE (4096 * 3)

int main(int argc, char **argv)
{
unsigned long *p;
long i;

p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}

for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;

if (remap_file_pages(p, 4096, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}

if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
perror("remap_file_pages");
return -1;
}

assert(p[0] == 1);

munmap(p, SIZE);

return 0;
}

The second remap_file_pages() fails with -EINVAL.

The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.

The solution is to check next adjacent vmas, if they map the same file
with the same flags.

Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov
Reported-by: Grazvydas Ignotas
Tested-by: Grazvydas Ignotas
Cc: [4.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-19 08:23:24 +0800
69a8ec2d8 thp, dax: do not try to withdraw pgtable from non-anon VMA ... Browse Code »

DAX doesn't deposit pgtables when it maps huge pages: nothing to
withdraw. It can lead to crash.

Signed-off-by: Kirill A. Shutemov
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-19 08:23:24 +0800

15 Feb, 2016

1 commit

c777e2a8b powerpc/mm: Fix Multi hit ERAT cause by recent THP update ... Browse Code »

With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

CPU0 CPU1
shrink_page_list()
add_to_swap()
split_huge_page_to_list()
__split_huge_pmd_locked()
pmdp_huge_clear_flush_notify()
// pmd_none() == true
exit_mmap()
unmap_vmas()
zap_pmd_range()
// no action on pmd since pmd_none() == true
pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Fixes: eef1b3ba053a ("thp: implement split_huge_pmd()")
Acked-by: Kirill A. Shutemov
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman

Aneesh Kumar K.V
2016-02-15 18:10:04 +0800

12 Feb, 2016

5 commits

6b75d1491 mm,thp: fix spellos in describing __HAVE_ARCH_FLUSH_PMD_TLB_RANGE ... Browse Code »

[akpm@linux-foundation.org: s/threshhold/threshold/]
Signed-off-by: Vineet Gupta
Cc: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineet Gupta
2016-02-12 10:35:48 +0800
6a6ac72fd mm,thp: khugepaged: call pte flush at the time of collapse ... Browse Code »

This showed up on ARC when running LMBench bw_mem tests as Overlapping
TLB Machine Check Exception triggered due to STLB entry (2M pages)
overlapping some NTLB entry (regular 8K page).

bw_mem 2m touches a large chunk of vaddr creating NTLB entries. In the
interim khugepaged kicks in, collapsing the contiguous ptes into a
single pmd. pmdp_collapse_flush()->flush_pmd_tlb_range() is called to
flush out NTLB entries for the ptes. This for ARC (by design) can only
shootdown STLB entries (for pmd). The stray NTLB entries cause the
overlap with the subsequent STLB entry for collapsed page. So make
pmdp_collapse_flush() call pte flush interface not pmd flush.

Note that originally all thp flush call sites in generic code called
flush_tlb_range() leaving it to architecture to implement the flush for
pte and/or pmd. Commit 12ebc1581ad11454 changed this by calling a new
opt-in API flush_pmd_tlb_range() which made the semantics more explicit
but failed to distinguish the pte vs pmd flush in generic code, which is
what this patch fixes.

Note that ARC can fixed w/o touching the generic pmdp_collapse_flush()
by defining a ARC version, but that defeats the purpose of generic
version, plus sementically this is the right thing to do.

Fixes STAR 9000961194: LMBench on AXS103 triggering duplicate TLB
exceptions with super pages

Fixes: 12ebc1581ad11454 ("mm,thp: introduce flush_pmd_tlb_range")
Signed-off-by: Vineet Gupta
Reviewed-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: [4.4]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineet Gupta
2016-02-12 10:35:48 +0800
078c6c3a5 mm/backing-dev.c: fix error path in wb_init() ... Browse Code »

We need to use post-decrement to get percpu_counter_destroy() called on
&wb->stat[0]. Moreover, the pre-decremebt would cause infinite
out-of-bounds accesses if the setup code failed at i==0.

Signed-off-by: Rasmus Villemoes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2016-02-12 10:35:48 +0800
6b9116a65 mm, dax: check for pmd_none() after split_huge_pmd() ... Browse Code »

DAX implements split_huge_pmd() by clearing pmd. This simple approach
reduces memory overhead, as we don't need to deposit page table on huge
page mapping to make split_huge_pmd() never-fail. PTE table can be
allocated and populated later on page fault from backing store.

But one side effect is that have to check if pmd is pmd_none() after
split_huge_pmd(). In most places we do this already to deal with
parallel MADV_DONTNEED.

But I found two call sites which is not affected by MADV_DONTNEED (due
down_write(mmap_sem)), but need to have the check to work with DAX
properly.

Signed-off-by: Kirill A. Shutemov
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-12 10:35:48 +0800
62eb320ab mm: fix filemap.c kernel doc warning ... Browse Code »

Add missing kernel-doc notation for function parameter 'gfp_mask' to fix
kernel-doc warning.

mm/filemap.c:1898: warning: No description found for parameter 'gfp_mask'

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2016-02-12 10:35:48 +0800

06 Feb, 2016

12 commits

ae026204a thp: make deferred_split_scan() work again ... Browse Code »

We need to iterate over split_queue, not local empty list to get
anything split from the shrinker.

Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()")
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-06 10:10:40 +0800
12352d3ca mm: replace vma_lock_anon_vma with anon_vma_lock_read/write ... Browse Code »

Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.

This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.

And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.

Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov
Reported-by: Dmitry Vyukov
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2016-02-06 10:10:40 +0800
080fe2068 mm, hugetlb: don't require CMA for runtime gigantic pages ... Browse Code »

Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.

Signed-off-by: Vlastimil Babka
Cc: Luiz Capitulino
Cc: Kirill A. Shutemov
Cc: Zhang Yanfei
Cc: Yasuaki Ishimatsu
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Davidlohr Bueso
Cc: Hillf Danton
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-02-06 10:10:40 +0800
b4330afbe mm/hugetlb: fix gigantic page initialization/allocation ... Browse Code »

Attempting to preallocate 1G gigantic huge pages at boot time with
"hugepagesz=1G hugepages=1" on the kernel command line will prevent
booting with the following:

kernel BUG at mm/hugetlb.c:1218!

When mapcount accounting was reworked, the setting of
compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
a result, the validation of mapcount in free_huge_page fails.

The "BUG_ON" checks in free_huge_page were also changed to
"VM_BUG_ON_PAGE" to assist with debugging.

Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Signed-off-by: Mike Kravetz
Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Tested-by: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Jerome Marchand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2016-02-06 10:10:40 +0800
cf2a82ee4 mm: downgrade VM_BUG in isolate_lru_page() to warning ... Browse Code »

Calling isolate_lru_page() is wrong and shouldn't happen, but it not
nessesary fatal: the page just will not be isolated if it's not on LRU.

Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT().

Signed-off-by: Kirill A. Shutemov
Cc: Dmitry Vyukov
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Naoya Horiguchi
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-06 10:10:40 +0800
77bf45e78 mempolicy: do not try to queue pages from !vma_migratable() ... Browse Code »

Maybe I miss some point, but I don't see a reason why we try to queue
pages from non migratable VMAs.

This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():

#include
#include
#include
#include
#include

#define SIZE 0x2000

int foo;

int main()
{
int fd;
char *p;
unsigned long mask = 2;

fd = open("/dev/sg0", O_RDWR);
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
/* Faultin pages */
foo = p[0] + p[0x1000];
mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
return 0;
}

The only case when we can queue pages from such VMA is MPOL_MF_STRICT
plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
in vma_migratable()). That's looks like a bug to me.

Let's filter out non-migratable vma at start of queue_pages_test_walk()
and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flag is set.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Dmitry Vyukov
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Naoya Horiguchi
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-06 10:10:40 +0800
564e81a57 mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress ... Browse Code »

Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM. Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.

According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1). We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.

Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.

Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Signed-off-by: Tetsuo Handa
Reported-by: Jan Stancek
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Cristopher Lameter
Cc: Joonsoo Kim
Cc: Arkadiusz Miskiewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-02-06 10:10:40 +0800
ccde8bd40 vmstat: make vmstat_update deferrable ... Browse Code »

Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and
shut down on idle") made vmstat_shepherd deferrable. vmstat_update
itself is still useing standard timer which might interrupt idle task.
This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
cancel_delayed_work from the quiet_vmstat.

Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
wakeups from the idle context.

Acked-by: Christoph Lameter
Signed-off-by: Michal Hocko
Cc: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-02-06 10:10:40 +0800
f01f17d37 mm, vmstat: make quiet_vmstat lighter ... Browse Code »

Mike has reported a considerable overhead of refresh_cpu_vm_stats from
the idle entry during pipe test:

12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
4.75% [kernel] [k] __schedule
4.70% [kernel] [k] mutex_unlock
3.14% [kernel] [k] __switch_to

This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
deferrable again and shut down on idle") which has placed quiet_vmstat
into cpu_idle_loop. The main reason here seems to be that the idle
entry has to get over all zones and perform atomic operations for each
vmstat entry even though there might be no per cpu diffs. This is a
pointless overhead for _each_ idle entry.

Make sure that quiet_vmstat is as light as possible.

First of all it doesn't make any sense to do any local sync if the
current cpu is already set in oncpu_stat_off because vmstat_update puts
itself there only if there is nothing to do.

Then we can check need_update which should be a cheap way to check for
potential per-cpu diffs and only then do refresh_cpu_vm_stats.

The original patch also did cancel_delayed_work which we are not doing
here. There are two reasons for that. Firstly cancel_delayed_work from
idle context will blow up on RT kernels (reported by Mike):

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
Call Trace:
dump_stack+0x49/0x67
___might_sleep+0xf5/0x180
rt_spin_lock+0x20/0x50
try_to_grab_pending+0x69/0x240
cancel_delayed_work+0x26/0xe0
quiet_vmstat+0x75/0xa0
cpu_idle_loop+0x38/0x3e0
cpu_startup_entry+0x13/0x20
start_secondary+0x114/0x140

And secondly, even on !RT kernels it might add some non trivial overhead
which is not necessary. Even if the vmstat worker wakes up and preempts
idle then it will be most likely a single shot noop because the stats
were already synced and so it would end up on the oncpu_stat_off anyway.
We just need to teach both vmstat_shepherd and vmstat_update to stop
scheduling the worker if there is nothing to do.

[mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU]
Signed-off-by: Michal Hocko
Reported-by: Mike Galbraith
Acked-by: Christoph Lameter
Signed-off-by: Mike Galbraith
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-02-06 10:10:40 +0800
1ce221036 mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT ... Browse Code »

The description mentions kswapd threads, while the deferred struct page
initialization is actually done by one-off "pgdatinitX" threads.

Fix the description so that potentially users are not confused about
pgdatinit threads using CPU after boot instead of kswapd.

Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-02-06 10:10:40 +0800
1f1ffb8a1 memblock: don't mark memblock_phys_mem_size() as __init ... Browse Code »

At the moment memblock_phys_mem_size() is marked as __init, and so is
discarded after boot. This is different from most of the memblock
functions which are marked __init_memblock, and are only discarded after
boot if memory hotplug is not configured.

To allow for upcoming code which will need memblock_phys_mem_size() in
the hotplug path, change it from __init to __init_memblock.

Signed-off-by: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2016-02-06 10:10:40 +0800
acf128d04 mm: validate_mm browse_rb SMP race condition ... Browse Code »

The mmap_sem for reading in validate_mm called from expand_stack is not
enough to prevent the argumented rbtree rb_subtree_gap information to
change from under us because expand_stack may be running from other
threads concurrently which will hold the mmap_sem for reading too.

The argumented rbtree is updated with vma_gap_update under the
page_table_lock so use it in browse_rb() too to avoid false positives.

Signed-off-by: Andrea Arcangeli
Reported-by: Dmitry Vyukov
Tested-by: Dmitry Vyukov
Cc: Konstantin Khlebnikov
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-02-06 10:10:40 +0800

04 Feb, 2016

10 commits

b37a05c08 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge fixes from Andrew Morton:
"18 fixes"

[ The 18 fixes turned into 17 commits, because one of the fixes was a
fix for another patch in the series that I just folded in by editing
the patch manually - hopefully correctly - Linus ]

* emailed patches from Andrew Morton :
mm: fix memory leak in copy_huge_pmd()
drivers/hwspinlock: fix race between radix tree insertion and lookup
radix-tree: fix race in gang lookup
mm/vmpressure.c: fix subtree pressure detection
mm: polish virtual memory accounting
mm: warn about VmData over RLIMIT_DATA
Documentation: cgroup-v2: add memory.stat::sock description
mm: memcontrol: drop superfluous entry in the per-memcg stats array
drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
proc: revert /proc//maps [stack:TID] annotation
numa: fix /proc//numa_maps for hugetlbfs on s390
MAINTAINERS: update Seth email
ocfs2/cluster: fix memory leak in o2hb_region_release
lib/test-string_helpers.c: fix and improve string_get_size() tests
thp: limit number of object to scan on deferred_split_scan()
thp: change deferred_split_count() to return number of THP in queue
thp: make split_queue per-node

Linus Torvalds
2016-02-04 02:10:02 +0800
464353647 mm: retire GUP WARN_ON_ONCE that outlived its usefulness ... Browse Code »

Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
areas"). The warning has served its purpose, nobody was harmed by that
change, so just remove the warning to generate less noise from Trinity.

Which reminds me of the comment I wrongly left behind with that commit
(but was spotted at the time by Kirill), which has since moved into a
separate function, and become even more obscure: delete it.

Reported-by: Dave Jones
Suggested-by: Kirill A. Shutemov
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-02-04 00:57:14 +0800
12c9d70bd mm: fix memory leak in copy_huge_pmd() ... Browse Code »

We allocate a pgtable but do not attach it to anything if the PMD is in
a DAX VMA, causing it to leak.

We certainly try to not free pgtables associated with the huge zero page
if the zero page is in a DAX VMA, so I think this is the right solution.
This needs to be properly audited.

Signed-off-by: Matthew Wilcox
Cc: Dan Williams
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2016-02-04 00:28:43 +0800
3c1da7bee mm/vmpressure.c: fix subtree pressure detection ... Browse Code »

When vmpressure is called for the entire subtree under pressure we
mistakenly use vmpressure->scanned instead of vmpressure->tree_scanned
when checking if vmpressure work is to be scheduled. This results in
suppressing all vmpressure events in the legacy cgroup hierarchy. Fix it.

Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-02-04 00:28:43 +0800
30bdbb780 mm: polish virtual memory accounting ... Browse Code »

* add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture
* always account VMAs with flag VM_STACK as stack (as it was before)
* cleanup classifying helpers
* update comments and documentation

Signed-off-by: Konstantin Khlebnikov
Tested-by: Sudip Mukherjee
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2016-02-04 00:28:43 +0800
d977d56ce mm: warn about VmData over RLIMIT_DATA ... Browse Code »

This patch provides a way of working around a slight regression
introduced by commit 84638335900f ("mm: rework virtual memory
accounting").

Before that commit RLIMIT_DATA have control only over size of the brk
region. But that change have caused problems with all existing versions
of valgrind, because it set RLIMIT_DATA to zero.

This patch fixes rlimit check (limit actually in bytes, not pages) and
by default turns it into warning which prints at first VmData misuse:

"mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."

Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
/sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".

[akpm@linux-foundation.org: tweak kernel-parameters.txt text[
Signed-off-by: Konstantin Khlebnikov
Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
Reported-by: Christian Borntraeger
Cc: Cyrill Gorcunov
Cc: Linus Torvalds
Cc: Vegard Nossum
Cc: Peter Zijlstra
Cc: Vladimir Davydov
Cc: Andy Lutomirski
Cc: Quentin Casasnovas
Cc: Kees Cook
Cc: Willy Tarreau
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2016-02-04 00:28:43 +0800
65376df58 proc: revert /proc/<pid>/maps [stack:TID] annotation ... Browse Code »

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") added [stack:TID] annotation to /proc//maps.

Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc//maps needs to look at a
million combinations.

The cost is not in proportion to the usefulness as described in the
patch.

Drop the [stack:TID] annotation to make /proc//maps (and
/proc//numa_maps) usable again for higher thread counts.

The [stack] annotation inside /proc//task//maps is retained, as
identifying the stack VMA there is an O(1) operation.

Siddesh said:
"The end users needed a way to identify thread stacks programmatically and
there wasn't a way to do that. I'm afraid I no longer remember (or have
access to the resources that would aid my memory since I changed
employers) the details of their requirement. However, I did do this on my
own time because I thought it was an interesting project for me and nobody
really gave any feedback then as to its utility, so as far as I am
concerned you could roll back the main thread maps information since the
information is available in the thread-specific files"

Signed-off-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Siddhesh Poyarekar
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-02-04 00:28:43 +0800
e3ae19535 thp: limit number of object to scan on deferred_split_scan() ... Browse Code »

If we have a lot of pages in queue to be split, deferred_split_scan()
can spend unreasonable amount of time under spinlock with disabled
interrupts.

Let's cap number of pages to split on scan by sc->nr_to_scan.

Signed-off-by: Kirill A. Shutemov
Reported-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800
cb8d68ec1 thp: change deferred_split_count() to return number of THP in queue ... Browse Code »

I've got meaning of shrinker::count_objects() wrong: it should return
number of potentially freeable objects, which is not necessary correlate
with freeable memory.

Returning 256 per THP in queue is not reasonable:
shrinker::scan_objects() never called with nr_to_scan > 128 in my setup.

Let's return 1 per THP and correct scan_object accordingly.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800
a3d0a9185 thp: make split_queue per-node ... Browse Code »

Andrea Arcangeli suggested to make split queue per-node to improve
scalability. Let's do it.

Signed-off-by: Kirill A. Shutemov
Suggested-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-04 00:28:43 +0800

02 Feb, 2016

1 commit

29a8ea4fb Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
area for storing a struct page array.

2/ Fixes for dax operations on a raw block device to prevent pagecache
collisions with dax mappings.

3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
pointer de-reference.

These have received build success notification from the kbuild robot
across 153 configs and pass the latest ndctl tests"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
phys_to_pfn_t: use phys_addr_t
mm: fix pfn_t to page conversion in vm_insert_mixed
block: use DAX for partition table reads
block: revert runtime dax control of the raw block device
fs, block: force direct-I/O for dax-enabled block devices
devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
libnvdimm, pfn: fix restoring memmap location
libnvdimm: fix mode determination for e820 devices

Linus Torvalds
2016-02-02 07:21:20 +0800

01 Feb, 2016

1 commit

03fc2da63 mm: fix pfn_t to page conversion in vm_insert_mixed ... Browse Code »

pfn_t_to_page() honors the flags in the pfn_t value to determine if a
pfn is backed by a page. However, vm_insert_mixed() was originally
written to use pfn_valid() to make this determination. To restore the
old/correct behavior, ignore the pfn_t flags in the !pfn_t_devmap() case
and fallback to trusting pfn_valid().

Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
Cc: Dave Hansen
Cc: David Airlie
Reported-by: Tomi Valkeinen
Tested-by: Tomi Valkeinen
Signed-off-by: Dan Williams

Dan Williams
2016-02-01 01:07:15 +0800

30 Jan, 2016

1 commit

297373701 Merge branch 'stable/for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm ... Browse Code »

Pull cleancache cleanups from Konrad Rzeszutek Wilk:
"Simple cleanups"

* 'stable/for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
include/linux/cleancache.h: Clean up code
cleancache: constify cleancache_ops structure

Linus Torvalds
2016-01-30 07:13:48 +0800

27 Jan, 2016

1 commit

b3c6de492 cleancache: constify cleancache_ops structure ... Browse Code »

The cleancache_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall
Signed-off-by: Konrad Rzeszutek Wilk

Julia Lawall
2016-01-27 22:09:57 +0800

25 Jan, 2016

1 commit

587198ba5 vmstat: Remove BUG_ON from vmstat_update ... Browse Code »

If we detect that there is nothing to do just set the flag and do not
check if it was already set before. Races really do not matter. If the
flag is set by any code then the shepherd will start dealing with the
situation and reenable the vmstat workers when necessary again.

Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
a particular cpu to be handled by vmstat_shepherd. This might trigger a
VM_BUG_ON in vmstat_update because the work item might have been
sleeping during the idle period and see the cpu_stat_off updated after
the wake up. The VM_BUG_ON is therefore misleading and no more
appropriate. Moreover it doesn't really suite any protection from real
bugs because vmstat_shepherd will simply reschedule the vmstat_work
anytime it sees a particular cpu set or vmstat_update would do the same
from the worker context directly. Even when the two would race the
result wouldn't be incorrect as the counters update is fully idempotent.

Reported-by: Sasha Levin
Signed-off-by: Christoph Lameter
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Tetsuo Handa
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2016-01-25 00:55:52 +0800

24 Jan, 2016

1 commit

cc673757e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull final vfs updates from Al Viro:

- The ->i_mutex wrappers (with small prereq in lustre)

- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)

- followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration

Linus Torvalds
2016-01-24 04:24:56 +0800

23 Jan, 2016

1 commit

1d5cfdb07 tree wide: use kvfree() than conditional kfree()/vfree() ... Browse Code »

There are many locations that do

if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Acked-by: Jan Kara
Acked-by: Russell King
Reviewed-by: Andreas Dilger
Acked-by: "Rafael J. Wysocki"
Acked-by: David Rientjes
Cc: "Luck, Tony"
Cc: Oleg Drokin
Cc: Boris Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-01-23 09:02:18 +0800