Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2012

1 commit

4e5f01c2b memcg: clear pc->mem_cgroup if necessary. ... Browse Code »
43

This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
and reducing atomic ops/complexity in memcg LRU handling.

In some cases, pages are added to lru before charge to memcg and pages
are not classfied to memory cgroup at lru addtion. Now, the lru where
the page should be added is determined a bit in page_cgroup->flags and
pc->mem_cgroup. I'd like to remove the check of flag.

To handle the case pc->mem_cgroup may contain stale pointers if pages
are added to LRU before classification. This patch resets
pc->mem_cgroup to root_mem_cgroup before lru additions.

[akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
[hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
[akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
[hughd@google.com: stop oops in mem_cgroup_reset_owner()]
[hughd@google.com: fix page migration to reset_owner]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Miklos Szeredi
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Ying Han
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:07 +0800

01 Nov, 2011

1 commit

43362a497 oom: fix race while temporarily setting current's oom_score_adj ... Browse Code »

test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800

16 Jun, 2011

1 commit

2b472611a ksm: fix NULL pointer dereference in scan_get_next_rmap_item() ... Browse Code »

Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
triggering a NULL pointer dereference in ksmd.

ksm_scan.mm_slot == &ksm_mm_head with only one registered mm

CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
list_empty() is false
lock slot == &ksm_mm_head
list_del(slot->mm_list)
(list now empty)
unlock
lock
slot = list_entry(slot->mm_list.next)
(list is empty, so slot is still ksm_mm_head)
unlock
slot->mm == NULL ... Oops

Close this race by revalidating that the new slot is not simply the list
head again.

Andrea's test case:

#include
#include
#include
#include

#define BUFSIZE getpagesize()

int main(int argc, char **argv)
{
void *ptr;

if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
perror("posix_memalign");
exit(1);
}
if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
perror("madvise");
exit(1);
}
*(char *)NULL = 0;

return 0;
}

Reported-by: Andrea Righi
Tested-by: Andrea Righi
Cc: Andrea Arcangeli
Signed-off-by: Hugh Dickins
Signed-off-by: Chris Wright
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-06-16 11:04:02 +0800

25 May, 2011

1 commit

72788c385 oom: replace PF_OOM_ORIGIN with toggling oom_score_adj ... Browse Code »

There's a kernel-wide shortage of per-process flags, so it's always
helpful to trim one when possible without incurring a significant penalty.
It's even more important when you're planning on adding a per- process
flag yourself, which I plan to do shortly for transparent hugepages.

PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
tendency to allocate large amounts of memory and should be preferred for
killing over other tasks. We'd rather immediately kill the task making
the errant syscall rather than penalizing an innocent task.

This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

The process's old oom_score_adj is stored and then set to
OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
value is then reinstated when the process should no longer be considered a
high priority for oom killing.

Signed-off-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Hugh Dickins
Cc: Izik Eidus
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-05-25 23:39:10 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

23 Mar, 2011

1 commit

9e60109f1 mm: rename drop_anon_vma() to put_anon_vma() ... Browse Code »

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800

14 Jan, 2011

6 commits

2919bfd07 ksm: drain pagevecs to lru ... Browse Code »

It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

Signed-off-by: Hugh Dickins
Reported-by: CAI Qian
Cc:Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
22e5c47ee thp: add compound_trans_head() helper ... Browse Code »

Cleanup some code with common compound_trans_head helper.

Signed-off-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Marcelo Tosatti
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:48 +0800
29ad768cf thp: KSM on THP ... Browse Code »

This makes KSM full operational with THP pages. Subpages are scanned
while the hugepage is still in place and delivering max cpu performance,
and only if there's a match and we're going to deduplicate memory, the
single hugepages with the subpage match is split.

There will be no false sharing between ksmd and khugepaged. khugepaged
won't collapse 2m virtual regions with KSM pages inside. ksmd also should
only split pages when the checksum matches and we're likely to split an
hugepage for some long living ksm page (usual ksm heuristic to avoid
sharing pages that get de-cowed).

Signed-off-by: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:48 +0800
878aee7d6 thp: freeze khugepaged and ksmd ... Browse Code »

It's unclear why schedule friendly kernel threads can't be taken away by
the CPU through the scheduler itself. It's safer to stop them as they can
trigger memory allocation, if kswapd also freezes itself to avoid
generating I/O they have too.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:46 +0800
21ae5b017 thp: skip transhuge pages in ksm for now ... Browse Code »

Skip transhuge pages in ksm for now.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
ae52a2adb thp: ksm: free swap when swapcache page is replaced ... Browse Code »

When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.

Reported-by: Andrea Arcangeli
Signed-off-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:38 +0800

03 Dec, 2010

1 commit

a0b0f58cd ksm: annotate ksm_thread_mutex is no deadlock source ... Browse Code »

commit 62b61f611e ("ksm: memory hotremove migration only") caused the
following new lockdep warning.

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
bash/1621 is trying to acquire lock:
((memory_chain).rwsem){.+.+.+}, at: []
__blocking_notifier_call_chain+0x69/0xc0

but task is already holding lock:
(ksm_thread_mutex){+.+.+.}, at: []
ksm_memory_callback+0x3a/0xc0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (ksm_thread_mutex){+.+.+.}:
[] lock_acquire+0xaa/0x140
[] __mutex_lock_common+0x44/0x3f0
[] mutex_lock_nested+0x48/0x60
[] ksm_memory_callback+0x3a/0xc0
[] notifier_call_chain+0x8c/0xe0
[] __blocking_notifier_call_chain+0x7e/0xc0
[] blocking_notifier_call_chain+0x16/0x20
[] memory_notify+0x1b/0x20
[] remove_memory+0x1cc/0x5f0
[] memory_block_change_state+0xfd/0x1a0
[] store_mem_state+0xe2/0xf0
[] sysdev_store+0x20/0x30
[] sysfs_write_file+0xe6/0x170
[] vfs_write+0xc8/0x190
[] sys_write+0x54/0x90
[] system_call_fastpath+0x16/0x1b

-> #0 ((memory_chain).rwsem){.+.+.+}:
[] __lock_acquire+0x155a/0x1600
[] lock_acquire+0xaa/0x140
[] down_read+0x51/0xa0
[] __blocking_notifier_call_chain+0x69/0xc0
[] blocking_notifier_call_chain+0x16/0x20
[] memory_notify+0x1b/0x20
[] remove_memory+0x56e/0x5f0
[] memory_block_change_state+0xfd/0x1a0
[] store_mem_state+0xe2/0xf0
[] sysdev_store+0x20/0x30
[] sysfs_write_file+0xe6/0x170
[] vfs_write+0xc8/0x190
[] sys_write+0x54/0x90
[] system_call_fastpath+0x16/0x1b

But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex
have an outer lock (mem_hotplug_mutex). So they cannot deadlock.

Thus, This patch annotate ksm_thread_mutex is not deadlock source.

[akpm@linux-foundation.org: update comment, from Hugh]
Signed-off-by: KOSAKI Motohiro
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-12-03 06:51:15 +0800

05 Oct, 2010

1 commit

4e31635c3 ksm: fix bad user data when swapping ... Browse Code »

Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
an internal compiler error: typically indicating an error in swapping.

Perhaps there's a timing issue which makes it now more likely, perhaps
it's just a long time since I tried for so long: this bug goes back to
KSM swapping in 2.6.33.

Notice how reuse_swap_page() allows an exclusive page to be reused, but
only does SetPageDirty if it can delete it from swap cache right then -
if it's currently under Writeback, it has to be left in cache and we
don't SetPageDirty, but the page can be reused. Fine, the dirty bit
will get set in the pte; but notice how zap_pte_range() does not bother
to transfer pte_dirty to page_dirty when unmapping a PageAnon.

If KSM chooses to share such a page, it will look like a clean copy of
swapcache, and not be written out to swap when its memory is needed;
then stale data read back from swap when it's needed again.

We could fix this in reuse_swap_page() (or even refuse to reuse a
page under writeback), but it's more honest to fix my oversight in
KSM's write_protect_page(). Several days of testing on three machines
confirms that this fixes the issue they showed.

Signed-off-by: Hugh Dickins
Cc: Andrew Morton
Cc: Andrea Arcangeli
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-10-05 02:09:53 +0800

10 Sep, 2010

1 commit

4969c1192 mm: fix swapin race condition ... Browse Code »
43

The pte_same check is reliable only if the swap entry remains pinned (by
the page lock on swapcache). We've also to ensure the swapcache isn't
removed before we take the lock as try_to_free_swap won't care about the
page pin.

One of the possible impacts of this patch is that a KSM-shared page can
point to the anon_vma of another process, which could exit before the page
is freed.

This can leave a page with a pointer to a recycled anon_vma object, or
worse, a pointer to something that is no longer an anon_vma.

[riel@redhat.com: changelog help]
Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-09-10 09:57:24 +0800

10 Aug, 2010

4 commits

d9f8984c2 ksm: cleanup for mm_slots_hash ... Browse Code »

Use compile-allocated memory instead of dynamic allocated memory for
mm_slots_hash.

Use hash_ptr() instead divisions for bucket calculation.

Signed-off-by: Lai Jiangshan
Signed-off-by: Izik Eidus
Cc: Avi Kivity
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2010-08-10 11:45:03 +0800
76545066c mm: extend KSM refcounts to the anon_vma root ... Browse Code »

KSM reference counts can cause an anon_vma to exist after the processe it
belongs to have already exited. Because the anon_vma lock now lives in
the root anon_vma, we need to ensure that the root anon_vma stays around
until after all the "child" anon_vmas have been freed.

The obvious way to do this is to have a "child" anon_vma take a reference
to the root in anon_vma_fork. When the anon_vma is freed at munmap or
process exit, we drop the refcount in anon_vma_unlink and possibly free
the root anon_vma.

The KSM anon_vma reference count function also needs to be modified to
deal with the possibility of freeing 2 levels of anon_vma. The easiest
way to do this is to break out the KSM magic and make it generic.

When compiling without CONFIG_KSM, this code is compiled out.

Signed-off-by: Rik van Riel
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Acked-by: Linus Torvalds
Tested-by: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
012f18004 mm: always lock the root (oldest) anon_vma ... Browse Code »

Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.

However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.

Also add the proper locking to vma_adjust.

Signed-off-by: Rik van Riel
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800
cba48b98f mm: change direct call of spin_lock(anon_vma->lock) to inline function ... Browse Code »

Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.

This makes it easier to do the substitution to the root anon_vma lock in a
following patch.

We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Larry Woodman
Acked-by: Larry Woodman
Reviewed-by: Minchan Kim
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-08-10 11:44:55 +0800

25 May, 2010

1 commit

7f60c214f mm: migration: share the anon_vma ref counts between KSM and page migration ... Browse Code »

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:58 +0800

25 Apr, 2010

1 commit

22eccdd7d ksm: check for ERR_PTR from follow_page() ... Browse Code »

The follow_page() function can potentially return -EFAULT so I added
checks for this.

Also I silenced an uninitialized variable warning on my version of gcc
(version 4.3.2).

Signed-off-by: Dan Carpenter
Acked-by: Rik van Riel
Acked-by: Izik Eidus
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2010-04-25 02:31:26 +0800

25 Mar, 2010

1 commit

cb5323751 mm/ksm.c is doing an unneeded _notify in write_protect_page. ... Browse Code »

ksm.c's write_protect_page implements a lockless means of verifying a page
does not have any users of the page which are not accounted for via other
kernel tracking means. It does this by removing the writable pte with TLB
flushes, checking the page_count against the total known users, and then
using set_pte_at_notify to make it a read-only entry.

An unneeded mmu_notifier callout is made in the case where the known users
does not match the page_count. In that event, we are inserting the
identical pte and there is no need for the set_pte_at_notify, but rather
the simpler set_pte_at suffices.

Signed-off-by: Robin Holt
Acked-by: Izik Eidus
Acked-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2010-03-25 07:31:20 +0800

07 Mar, 2010

1 commit

5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800

16 Dec, 2009

14 commits

d0f209f68 ksm: remove unswappable max_kernel_pages ... Browse Code »

Now that ksm pages are swappable, and the known holes plugged, remove
mention of unswappable kernel pages from KSM documentation and comments.

Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
remove max_kernel_pages altogether - we can reinstate it if removal turns
out to break someone's script; but if we later want to limit KSM's memory
usage, limiting the stable nodes would not be an effective approach.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:20 +0800
62b61f611 ksm: memory hotremove migration only ... Browse Code »

The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?

Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.

But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.

So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.

And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:20 +0800
e9995ef97 ksm: rmap_walk to remove_migation_ptes ... Browse Code »

A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.

Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).

rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.

For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:20 +0800
80e148226 ksm: share anon page without allocating ... Browse Code »

When ksm pages were unswappable, it made no sense to include them in mem
cgroup accounting; but now that they are swappable (although I see no
strict logical connection) the principle of least surprise implies that
they should be accounted (with the usual dissatisfaction, that a shared
page is accounted to only one of the cgroups using it).

This patch was intended to add mem cgroup accounting where necessary; but
turned inside out, it now avoids allocating a ksm page, instead upgrading
an anon page to ksm - which brings its existing mem cgroup accounting with
it. Thus mem cgroups don't appear in the patch at all.

This upgrade from PageAnon to PageKsm takes place under page lock (via a
somewhat hacky NULL kpage interface), and audit showed only one place
which needed to cope with the race - page_referenced() is sometimes used
without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
sure of getting anon_vma and flags together (no problem if the page goes
ksm an instant after, the integrity of that anon_vma list is unaffected).

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
4035c07a8 ksm: take keyhole reference to page ... Browse Code »

There's a lamentable flaw in KSM swapping: the stable_node holds a
reference to the ksm page, so the page to be freed cannot actually be
freed until ksmd works its way around to removing the last rmap_item from
its stable_node. Which in some configurations may take minutes: not quite
responsive enough for memory reclaim. And we don't want to twist KSM and
its locking more tightly into the rest of mm. What a pity.

But although the stable_node needs to hold a pointer to the ksm page, does
it actually need to raise the reference count of that page?

No. It would need to do so if struct pages were ordinary kmalloc'ed
objects; but they are more stable than that, and reused in particular ways
according to particular rules.

Access to stable_node from its pointer in struct page is no problem, so
long as we never free a stable_node before the ksm page itself has been
freed. Access to struct page from its pointer in stable_node: reintroduce
get_ksm_page(), and let that peep out through its keyhole (the stable_node
pointer to ksm page), to see if that struct page still holds the right key
to open it (the ksm page mapping pointer back to this stable_node).

This relies upon the established way in which free_hot_cold_page() sets an
anon (including ksm) page->mapping to NULL; and relies upon no other user
of a struct page to put something which looks like the original
stable_node pointer (with two low bits also set) into page->mapping. It
also needs get_page_unless_zero() technique pioneered by speculative
pagecache; and uses rcu_read_lock() to keep the guarantees that gives.

There are several drivers which put pointers of their own into page->
mapping; but none of those could coincide with our stable_node pointers,
since KSM won't free a stable_node until it sees that the page has gone.

The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
break_lock on 32-bit, that spans both page->private and page->mapping.
Since break_lock is only 0 or 1, again no confusion for get_ksm_page().

But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3
(matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
mapping, which might coincide? We could get around that by... but a
better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
already proposed in an earlier mm/Kconfig patch.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
db114b83a ksm: hold anon_vma in rmap_item ... Browse Code »

For full functionality, page_referenced_one() and try_to_unmap_one() need
to know the vma: to pass vma down to arch-dependent flushes, or to observe
VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
vmas get split and merged without its knowledge.

Instead, note page's anon_vma in its rmap_item when adding to stable tree:
all the vmas which might map that page are listed by its anon_vma.

page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
first to find the probable vma, that which matches rmap_item's mm; but if
that is not enough to locate all instances, traverse again to try the
others. This catches those occasions when fork has duplicated a pte of a
ksm page, but ksmd has not yet come around to assign it an rmap_item.

But each rmap_item in the stable tree which refers to an anon_vma needs to
take a reference to it. Andrea's anon_vma design cleverly avoided a
reference count (an anon_vma was free when its list of vmas was empty),
but KSM now needs to add that. Is a 32-bit count sufficient? I believe
so - the anon_vma is only free when both count is 0 and list is empty.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
5ad646880 ksm: let shared pages be swappable ... Browse Code »

Initial implementation for swapping out KSM's shared pages: add
page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
faced with a PageKsm page.

Most of what's needed can be got from the rmap_items listed from the
stable_node of the ksm page, without discovering the actual vma: so in
this patch just fake up a struct vma for page_referenced_one() or
try_to_unmap_one(), then refine that in the next patch.

Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
implicit there (being only set with VM_SHARED, already excluded), but
let's make it explicit, to help justify the lack of nonlinear unmap.

Rely on the page lock to protect against concurrent modifications to that
page's node of the stable tree.

The awkward part is not swapout but swapin: do_swap_page() and
page_add_anon_rmap() now have to allow for new possibilities - perhaps a
ksm page still in swapcache, perhaps a swapcache page associated with one
location in one anon_vma now needed for another location or anon_vma.
(And the vma might even be no longer VM_MERGEABLE when that happens.)

ksm_might_need_to_copy() checks for that case, and supplies a duplicate
page when necessary, simply leaving it to a subsequent pass of ksmd to
rediscover the identity and merge them back into one ksm page.
Disappointingly primitive: but the alternative would have to accumulate
unswappable info about the swapped out ksm pages, limiting swappability.

Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
particular case it was handling, so just use it instead.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
73848b468 ksm: fix mlockfreed to munlocked ... Browse Code »

When KSM merges an mlocked page, it has been forgetting to munlock it:
that's been left to free_page_mlock(), which reports it in /proc/vmstat as
unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
silently forgiving). Call munlock_vma_page() to fix that.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
08beca44d ksm: stable_node point to page and back ... Browse Code »

Add a pointer to the ksm page into struct stable_node, holding a reference
to the page while the node exists. Put a pointer to the stable_node into
the ksm page's ->mapping.

Then we don't need get_ksm_page() while traversing the stable tree: the
page to compare against is sure to be present and correct, even if it's no
longer visible through any of its existing rmap_items.

And we can handle the forked ksm page case more efficiently: no need to
memcmp our way through the tree to find its match.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
7b6ba2c7d ksm: separate stable_node ... Browse Code »

Though we still do well to keep rmap_items in the unstable tree without a
separate tree_item at the node, for several reasons it becomes awkward to
keep rmap_items in the stable tree without a separate stable_node: lack of
space in the nicely-sized rmap_item, the need for an anchor as rmap_items
are removed, the need for a node even when temporarily no rmap_items are
attached to it.

So declare struct stable_node (rb_node to place it in the tree and
hlist_head for the rmap_items hanging off it), and convert stable tree
handling to use it: without yet taking advantage of it. Note how one
stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
two rmap_items being merged.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
6514d511d ksm: singly-linked rmap_list ... Browse Code »

Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
singly-linked list: we always traverse that list sequentially, and we
don't even lose any prefetches (but should consider adding a few later).
Name it rmap_list throughout.

Do we need to free up that pointer? Not immediately, and in the end, we
could continue to avoid it with a union; but having done the conversion,
let's keep it this way, since there's no downside, and maybe we'll want
more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
and 64 bytes on 64-bit, so we shall want to avoid expanding it).

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
8dd3557a5 ksm: cleanup some function arguments ... Browse Code »

Cleanup: make argument names more consistent from cmp_and_merge_page()
down to replace_page(), so that it's easier to follow the rmap_item's page
and the matching tree_page and the merged kpage through that code.

In some places, e.g. break_cow(), pass rmap_item instead of separate mm
and address.

cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
uninitialized" warning seen in one config by Anil SB.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:19 +0800
31e855ea7 ksm: remove redundancies when merging page ... Browse Code »

There is no need for replace_page() to calculate a write-protected prot
vm_page_prot must already be write-protected for an anonymous page (see
mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).

There is no need for try_to_merge_one_page() to get_page and put_page on
newpage and oldpage: in every case we already hold a reference to each of
them.

But some instinct makes me move try_to_merge_one_page()'s unlock_page of
oldpage down after replace_page(): that doesn't increase contention on the
ksm page, and makes thinking about the transition easier.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:18 +0800
93d17715a ksm: three remove_rmap_item_from_tree cleanups ... Browse Code »

1. remove_rmap_item_from_tree() is called as a precaution from
various places: don't dirty the rmap_item cacheline unnecessarily,
just mask the flags out of the address when they have been set.

2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
from its tree: it's easier just to do both at once (but definitely keep
the BUG_ON(age > 1) which guards against a future omission).

3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
tree, it does its own rb_erase() and accounting: that's better
expressed by remove_rmap_item_from_tree().

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:18 +0800

10 Nov, 2009

1 commit

d178f27fc ksm: cond_resched in unstable tree ... Browse Code »

KSM needs a cond_resched() for CONFIG_PREEMPT_NONE, in its unbounded
search of the unstable tree. The stable tree cases already have one,
and originally there was one down inside get_user_pages();
but I missed it when I converted to follow_page() instead.

Signed-off-by: Hugh Dickins
Acked-by: Izik Eidus
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-11-10 01:55:44 +0800

08 Oct, 2009

1 commit

c73602ad3 ksm: more on default values ... Browse Code »

Adjust the max_kernel_pages default to a quarter of totalram_pages,
instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
pinning 32MB of lowmem with their rmap_items, so no need for the more
obscure calculation (nor for its own special init function).

There is no way for the user to switch KSM on if CONFIG_SYSFS is not
enabled, so in that case default run to KSM_RUN_MERGE.

Update KSM Documentation and Kconfig to reflect the new defaults.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-10-08 22:36:38 +0800

24 Sep, 2009

1 commit

2c6854fda ksm: change default values to better fit into mainline kernel ... Browse Code »

Now that ksm is in mainline it is better to change the default values to
better fit to most of the users.

This patch change the ksm default values to be:

ksm_thread_pages_to_scan = 100 (instead of 200)
ksm_thread_sleep_millisecs = 20 (like before)
ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
disabled by default)
ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

The important aspect of this patch is: it disables ksm by default, and sets
the number of the kernel_pages that can be allocated to be a reasonable
number.

Signed-off-by: Izik Eidus
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Izik Eidus
2009-09-24 22:20:56 +0800