Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Oct, 2012

40 commits

e90bdb7f5 memory-hotplug: update memory block's state and notify userspace ... Browse Code »

remove_memory() will be called when hot removing a memory device. But
even if offlining memory, we cannot notice it. So the patch updates the
memory block's state and sends notification to userspace.

Additionally, the memory device may contain more than one memory block.
If the memory block has been offlined, __offline_pages() will fail. So we
should try to offline one memory block at a time.

Thus remove_memory() also check each memory block's state. So there is no
need to check the memory block's state before calling remove_memory().

Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Len Brown
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2012-10-09 15:23:02 +0800
a16cee10c memory-hotplug: preparation to notify memory block's state at memory hot remove ... Browse Code »

remove_memory() is called in two cases:
1. echo offline >/sys/devices/system/memory/memoryXX/state
2. hot remove a memory device

In the 1st case, the memory block's state is changed and the notification
that memory block's state changed is sent to userland after calling
remove_memory(). So user can notice memory block is changed.

But in the 2nd case, the memory block's state is not changed and the
notification is not also sent to userspcae even if calling
remove_memory(). So user cannot notice memory block is changed.

For adding the notification at memory hot remove, the patch just prepare
as follows:
1st case uses offline_pages() for offlining memory.
2nd case uses remove_memory() for offlining memory and changing memory block's
state and notifing the information.

The patch does not implement notification to remove_memory().

Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Len Brown
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2012-10-09 15:23:02 +0800
c22331166 mm: avoid section mismatch warning for memblock_type_name ... Browse Code »

Following section mismatch warning is thrown during build;

WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
The function memblock_type_name() references
the variable __meminitdata memblock.
This is often because memblock_type_name lacks a __meminitdata
annotation or the annotation of memblock is wrong.

This is because memblock_type_name makes reference to memblock variable
with attribute __meminitdata. Hence, the warning (even if the function is
inline).

[akpm@linux-foundation.org: remove inline]
Signed-off-by: Raghavendra D Prabhu
Cc: Tejun Heo
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raghavendra D Prabhu
2012-10-09 15:23:01 +0800
3e648ebe0 make GFP_NOTRACK definition unconditional ... Browse Code »

There was a general sentiment in a recent discussion (See
https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
defined unconditionally. Currently, the only offender is GFP_NOTRACK,
which is conditional to KMEMCHECK.

Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
Cc: Mel Gorman
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-10-09 15:23:01 +0800
beb51eaa8 cma: decrease cc.nr_migratepages after reclaiming pagelist ... Browse Code »

reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated. Currently, there is no problem but
it can be wrong if we try to use the value in future.

Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:01 +0800
e46a28790 CMA: migrate mlocked pages ... Browse Code »

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:00 +0800
7a71932d5 kpageflags: fix wrong KPF_THP on non-huge compound pages ... Browse Code »

KPF_THP can be set on non-huge compound pages (like slab pages or pages
allocated by drivers with __GFP_COMP) because PageTransCompound only
checks PG_head and PG_tail. Obviously this is a bug and breaks user space
applications which look for thp via /proc/kpageflags.

This patch rules out setting KPF_THP wrongly by additionally checking
PageLRU on the head pages.

Signed-off-by: Naoya Horiguchi
Acked-by: KOSAKI Motohiro
Acked-by: David Rientjes
Reviewed-by: Fengguang Wu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-10-09 15:23:00 +0800
cd8ed2a45 fs/fs-writeback.c: remove unneccesary parameter of __writeback_single_inode() ... Browse Code »

The parameter 'wb' is never used in this function.

Signed-off-by: Yan Hong
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yan Hong
2012-10-09 15:23:00 +0800
c462f179e mm/memory.c: fix typo in comment ... Browse Code »

Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2012-10-09 15:22:59 +0800
8befedfe6 mm: remove unevictable_pgs_mlockfreed ... Browse Code »

Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
have been used by any tool, and of course we can restore it easily enough
if that turns out to be wrong.

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:59 +0800
5a8838138 memory-hotplug: fix zone stat mismatch ... Browse Code »

During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.

The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.

In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.

Reviewed-by: Wen Congyang
Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Cc: Yasuaki Ishimatsu
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:59 +0800
082708072 mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") ... Browse Code »

Revert commit 0def08e3acc2 because check_range can't fail in
migrate_to_node with considering current usecases.

Quote from Johannes

: I think it makes sense to revert. Not because of the semantics, but I
: just don't see how check_range() could even fail for this callsite:
:
: 1. we pass mm->mmap->vm_start in there, so we should not fail due to
: find_vma()
:
: 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
: and so can not fail
:
: 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
: continue until addr == end, so we never fail with -EIO

And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
which might pass to MPOL_MF_STRICT.

Suggested-by: Johannes Weiner
Signed-off-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Vasiliy Kulikov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:58 +0800
6bdb913f0 mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end ... Browse Code »

In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.

Signed-off-by: Haggai Eran
Cc: Andrea Arcangeli
Cc: Sagi Grimberg
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haggai Eran
2012-10-09 15:22:58 +0800
2ec74c3ef mm: move all mmu notifier invocations to be done outside the PT lock ... Browse Code »

In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing the lock
(but before freeing the page itself), or by wrapping the page invalidation
with calls to invalidate_range_begin and invalidate_range_end.

To prevent accidental changes to the invalidate_range_end arguments after
the call to invalidate_range_begin, the patch introduces a convention of
saving the arguments in consistently named locals:

unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

...

mmun_start = ...
mmun_end = ...
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

...

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

The patch changes code to use this convention for all calls to
mmu_notifier_invalidate_range_start/end, except those where the calls are
close enough so that anyone who glances at the code can see the values
aren't changing.

This patchset is a preliminary step towards on-demand paging design to be
added to the RDMA stack.

Why do we want on-demand paging for Infiniband?

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

Why should anyone care? What problems are users currently experiencing?

This can make programming with RDMA much simpler. Today, developers
that are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we
might be able to provide a single memory access key for each process
that would provide the entire process's address as one large memory
region, and the developers wouldn't need to register memory regions at
all.

Is there any prospect that any other subsystems will utilise these
infrastructural changes? If so, which and how, etc?

As for other subsystems, I understand that XPMEM wanted to sleep in
MMU notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
perhaps Andrea knows about other use cases.

Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.

Avi said:

kvm may be a buyer. kvm::mmu_lock, which serializes guest page
faults, also protects long operations such as destroying large ranges.
It would be good to convert it into a spinlock, but as it is used inside
mmu notifiers, this cannot be done.

(there are alternatives, such as keeping the spinlock and using a
generation counter to do the teardown in O(1), which is what the "may"
is doing up there).

[akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Sagi Grimberg
Signed-off-by: Haggai Eran
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sagi Grimberg
2012-10-09 15:22:58 +0800
36e4f20af hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach ... Browse Code »

Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
page from vma") fixed pgoff calculation but it has replaced it by
vma_hugecache_offset() which is not approapriate for offsets used for
vma_prio_tree_foreach() because that one expects index in page units
rather than in huge_page_shift.

Johannes said:

: The resulting index may not be too big, but it can be too small: assume
: hpage size of 2M and the address to unmap to be 0x200000. This is regular
: page index 512 and hpage index 1. If you have a VMA that maps the file
: only starting at the second huge page, that VMAs vm_pgoff will be 512 but
: you ask for offset 1 and miss it even though it does map the page of
: interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
: cow until the allocation succeeds or the skipped vma(s) go away.

Signed-off-by: Michal Hocko
Acked-by: Hillf Danton
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Acked-by: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:57 +0800
027ef6c87 mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP ... Browse Code »

In many places !pmd_present has been converted to pmd_none. For pmds
that's equivalent and pmd_none is quicker so using pmd_none is better.

However (unless we delete pmd_present) we should provide an accurate
pmd_present too. This will avoid the risk of code thinking the pmd is non
present because it's under __split_huge_page_map, see the pmd_mknotpresent
there and the comment above it.

If the page has been mprotected as PROT_NONE, it would also lead to a
pmd_present false negative in the same way as the race with
split_huge_page.

Because the PSE bit stays on at all times (both during split_huge_page and
when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
but checking the PROTNONE bit too is still good to remember pmd_present
must always keep PROT_NONE into account.

This explains a not reproducible BUG_ON that was seldom reported on the
lists.

The same issue is in pmd_large, it would go wrong with both PROT_NONE and
if it races with split_huge_page.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-10-09 15:22:57 +0800
00ea8990a memory.txt: remove stray information ... Browse Code »

Andi removed some outedated documentation from Documentation/memory.txt
back in 2009 by commit 3b2b9a875ddc ("Documentation/memory.txt: remove
some very outdated recommendations"), but the resulting document is not
in a nice shape either.

It seems to me like we are not losing anything by completely removing the
file now.

Signed-off-by: Jiri Kosina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2012-10-09 15:22:57 +0800
957f822a0 mm, numa: reclaim from all nodes within reclaim distance ... Browse Code »

RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:22:56 +0800
a0c5e813f mm: remove free_page_mlock ... Browse Code »

We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:56 +0800
e6c509f85 mm: use clear_page_mlock() in page_remove_rmap() ... Browse Code »

We had thought that pages could no longer get freed while still marked as
mlocked; but Johannes Weiner posted this program to demonstrate that
truncating an mlocked private file mapping containing COWed pages is still
mishandled:

#include
#include
#include
#include
#include
#include
#include

int main(void)
{
char *map;
int fd;

system("grep mlockfreed /proc/vmstat");
fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
unlink("chigurh");
ftruncate(fd, 4096);
map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
map[0] = 11;
mlock(map, sizeof(fd));
ftruncate(fd, 0);
close(fd);
munlock(map, sizeof(fd));
munmap(map, 4096);
system("grep mlockfreed /proc/vmstat");
return 0;
}

The anon COWed pages are not caught by truncation's clear_page_mlock() of
the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
look out for them there in page_remove_rmap(). Indeed, why should
truncation or invalidation be doing the clear_page_mlock() when removing
from pagecache? mlock is a property of mapping in userspace, not a
property of pagecache: an mlocked unmapped page is nonsensical.

Reported-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michel Lespinasse
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:56 +0800
39b5f29ac mm: remove vma arg from page_evictable ... Browse Code »

page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
ec4d9f626 mm: fix invalidate_complete_page2() lock ordering ... Browse Code »

In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:

invalidate_inode_pages2
invalidate_complete_page2
spin_lock_irq(&mapping->tree_lock)
clear_page_mlock
isolate_lru_page
spin_unlock_irq(&zone->lru_lock)

isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.

Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.

Reported-by: Sasha Levin
Deciphered-by: Andrew Morton
Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
7ffc0edc4 memcg: move mem_cgroup_is_root upwards ... Browse Code »

kmem code uses this function and it is better to not use forward
declarations for static inline functions as some (older) compilers don't
like it:

gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)

mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here

Signed-off-by: Michal Hocko
Cc: Glauber Costa
Cc: Sachin Kamat
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:55 +0800
4bd2c1ee4 memcg: cleanup kmem tcp ifdefs ... Browse Code »

TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
the code is not used if !CONFIG_INET so we should rather test for both.
The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
let's keep those outside of any ifdefs because it is considered safer wrt.
future maintainability.

Tested with
- CONFIG_INET && CONFIG_MEMCG_KMEM
- !CONFIG_INET && CONFIG_MEMCG_KMEM
- CONFIG_INET && !CONFIG_MEMCG_KMEM
- !CONFIG_INET && !CONFIG_MEMCG_KMEM

Signed-off-by: Sachin Kamat
Signed-off-by: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:54 +0800
1939c557b memcg: trivial fixes for Documentation/cgroups/memory.txt ... Browse Code »

While reading through Documentation/cgroups/memory.txt, I found a number
of minor wordos and typos. The patch below is a conservative handling of
some of these: it provides just a number of "obviously correct" fixes to
the English that improve the readability of the document somewhat.
Obviously some more significant fixes need to be made to the document, but
some of those may not be in the "obvious correct" category.

Signed-off-by: Michael Kerrisk
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Kerrisk
2012-10-09 15:22:54 +0800
7f1290f2f mm: fix-up zone present pages ... Browse Code »

I think zone->present_pages indicates pages that buddy system can management,
it should be:

zone->present_pages = spanned pages - absent pages - bootmem pages,

but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.

spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.

This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).

[root@localhost ~]# dmesg
-bash: fork: Cannot allocate memory

I think the bug described in

http://marc.info/?l=linux-mm&m=134502182714186&w=2

is also caused by wrong zone present pages.

This patch intends to fix-up zone->present_pages when memory are freed to
buddy system on x86_64 and IA64 platforms.

Signed-off-by: Jianguo Wu
Signed-off-by: Jiang Liu
Reported-by: Petr Tesarik
Tested-by: Petr Tesarik
Cc: "Luck, Tony"
Cc: Mel Gorman
Cc: Yinghai Lu
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2012-10-09 15:22:54 +0800
05106e6a5 mm: enable CONFIG_COMPACTION by default ... Browse Code »

Now that lumpy reclaim has been removed, compaction is the only way left
to free up contiguous memory areas. It is time to just enable
CONFIG_COMPACTION by default.

Signed-off-by: Rik van Riel
Cc: Mel Gorman
Acked-by: Rafael Aquini
Acked-by: Johannes Weiner
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-10-09 15:22:53 +0800
eab1eef99 mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c ... Browse Code »

The update_mmu_cache() takes a pointer (to pte_t by default) as the last
argument but the huge_memory.c passes a pmd_t value. The patch changes
the argument to the pmd_t * pointer.

Signed-off-by: Catalin Marinas
Signed-off-by: Steve Capper
Signed-off-by: Will Deacon
Cc: Arnd Bergmann
Reviewed-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Gerald Schaefer
Reviewed-by: Andrea Arcangeli
Cc: Chris Metcalf
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2012-10-09 15:22:53 +0800
2d28a2275 mm: thp: fix the pmd_clear() arguments in pmdp_get_and_clear() ... Browse Code »

The CONFIG_TRANSPARENT_HUGEPAGE implementation of pmdp_get_and_clear()
calls pmd_clear() with 3 arguments instead of 1.

This happens only for !__HAVE_ARCH_PMDP_GET_AND_CLEAR which doesn't seem
to happen because x86 defines this and it uses pmd_update.

[mhocko@suse.cz: changelog addition]
Signed-off-by: Catalin Marinas
Signed-off-by: Steve Capper
Signed-off-by: Will Deacon
Cc: Arnd Bergmann
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Gerald Schaefer
Reviewed-by: Andrea Arcangeli
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2012-10-09 15:22:53 +0800
e3b4126c5 thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator ... Browse Code »

If NUMA is enabled, the indicator is not reset if the previous page
request failed, ausing us to trigger the BUG_ON() in
khugepaged_alloc_page().

Signed-off-by: Xiao Guangrong
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Michel Lespinasse
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:52 +0800
74c08f982 memory-hotplug: don't replace lowmem pages with highmem ... Browse Code »

The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem
pages with highmem") mentioned that lowmem pages can be replaced by
highmem pages during CMA migration. 6a6dccba2fdc fixed that issue.

Quote from that changelog:

: The filesystem layer expects pages in the block device's mapping to not
: be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
: currently replace lowmem pages with highmem pages, leading to crashes in
: filesystem code such as the one below:
:
: Unable to handle kernel NULL pointer dereference at virtual address 00000400
: pgd = c0c98000
: [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
: Internal error: Oops: 817 [#1] PREEMPT SMP ARM
: CPU: 0 Not tainted (3.5.0-rc5+ #80)
: PC is at __memzero+0x24/0x80
: ...
: Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
: Backtrace:
: [] (ext4_getblk+0x0/0x180) from [] (ext4_bread+0x1c/0x98)
: [] (ext4_bread+0x0/0x98) from [] (ext4_mkdir+0x160/0x3bc)
: r4:c15337f0
: [] (ext4_mkdir+0x0/0x3bc) from [] (vfs_mkdir+0x8c/0x98)
: [] (vfs_mkdir+0x0/0x98) from [] (sys_mkdirat+0x74/0xac)
: r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
: [] (sys_mkdirat+0x0/0xac) from [] (sys_mkdir+0x20/0x24)
: r6:beccdcf0 r5:00074000 r4:beccdbbc
: [] (sys_mkdir+0x0/0x24) from [] (ret_fast_syscall+0x0/0x30)

Memory-hotplug has same problem as CMA has so the same fix can be applied
to memory-hotplug as well.

Fix it by reusing.

Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Wen Congyang
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:52 +0800
723a0644a mm/page_alloc: refactor out __alloc_contig_migrate_alloc() ... Browse Code »

__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
it out (move + rename as a common name) into page_isolation.c.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Wen Congyang
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:52 +0800
3f6d4caeb mm/hugetlb.c: remove duplicate inclusion of header file ... Browse Code »

Signed-off-by: Sachin Kamat
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sachin Kamat
2012-10-09 15:22:51 +0800
62997027c mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity ... Browse Code »

Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:51 +0800
c89511ab2 mm: compaction: Restart compaction from near where it left off ... Browse Code »

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
bb13ffeb9 mm: compaction: cache if a pageblock was scanned and no pages were isolated ... Browse Code »

When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Cc: Fengguang Wu
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Kyungmin Park
Cc: Mark Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
753341a4b revert "mm: have order > 0 compaction start off where it left" ... Browse Code »

This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
f40d1e42b mm: compaction: acquire the zone->lock as late as possible ... Browse Code »

Compaction's free scanner acquires the zone->lock when checking for
PageBuddy pages and isolating them. It does this even if there are no
PageBuddy pages in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone->lock.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Acked-by: Minchan Kim
Tested-by: Peter Ujfalusi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:49 +0800
2a1402aa0 mm: compaction: acquire the zone->lru_lock as late as possible ... Browse Code »

Richard Davies and Shaohua Li have both reported lock contention problems
in compaction on the zone and LRU locks as well as significant amounts of
time being spent in compaction. This series aims to reduce lock
contention and scanning rates to reduce that CPU usage. Richard reported
at https://lkml.org/lkml/2012/9/21/91 that this series made a big
different to a problem he reported in August:

http://marc.info/?l=kvm&m=134511507015614&w=2

Patch 1 defers acquiring the zone->lru_lock as long as possible.

Patch 2 defers acquiring the zone->lock as lock as possible.

Patch 3 reverts Rik's "skip-free" patches as the core concept gets
reimplemented later and the remaining patches are easier to
understand if this is reverted first.

Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
pageblocks should be skipped by the migrate and free scanners.
This drastically reduces the amount of scanning compaction has
to do.

Patch 5 reimplements something similar to Rik's idea except it uses the
pageblock-skip information to decide where the scanners should
restart from and does not need to wrap around.

I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

akpm-20120920 3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
lesslock Patches 1-6
revert Patches 1-7
cachefail Patches 1-8
skipuseless Patches 1-9

Stress high-order allocation tests looked ok. Success rates are more or
less the same with the full series applied but there is an expectation
that there is less opportunity to race with other allocation requests if
there is less scanning. The time to complete the tests did not vary that
much and are uninteresting as were the vmstat statistics so I will not
present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6
akpm-20120920 lockless revert-v2r2 cachefail skipuseless

Total free scanned 360753976 515414028 565479007 17103281 18916589
Total free isolated 2852429 3597369 4048601 670493 727840
Total free efficiency 0.0079% 0.0070% 0.0072% 0.0392% 0.0385%
Total migrate scanned 247728664 822729112 1004645830 17946827 14118903
Total migrate isolated 2555324 3245937 3437501 616359 658616
Total migrate efficiency 0.0103% 0.0039% 0.0034% 0.0343% 0.0466%

The efficiency is worthless because of the nature of the test and the
number of failures. The really interesting point as far as this patch
series is concerned is the number of pages scanned. Note that reverting
Rik's patches massively increases the number of pages scanned indicating
that those patches really did make a difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-8 applied, free page and migrate page scanning are
both reduced by 95% in comparison to the akpm kernel. If the basic
concept of Rik's patches are implemened on top then scanning then the free
scanner barely changed but migrate scanning was further reduced. That
said, tests on 3.6-rc5 indicated that the last patch had greater impact
than what was measured here so it is a bit variable.

One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.

This patch:

Compaction's migrate scanner acquires the zone->lru_lock when scanning a
range of pages looking for LRU pages to acquire. It does this even if
there are no LRU pages in the range. If multiple processes are compacting
then this can cause severe locking contention. To make matters worse
commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration") releases the lru_lock every
SWAP_CLUSTER_MAX pages that are scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
if the lock is contended. This reduces the number of times it
unnecessarily disables and re-enables IRQs. The second is that it defers
acquiring the LRU lock for as long as possible. If there are no LRU pages
or the only LRU pages are transhuge then the LRU lock will not be acquired
at all which reduces contention on zone->lru_lock.

[minchan@kernel.org: augment comment]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:49 +0800
661c4cb9b mm: compaction: Update try_to_compact_pages()kerneldoc comment ... Browse Code »

Parameters were added without documentation, tut tut.

Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:49 +0800