13 Jan, 2012

1 commit

  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

01 Nov, 2011

1 commit

  • test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
    PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
    current's oom_score_adj for ksm and swapoff without requiring an
    additional per-process flag.

    Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
    then reinstate the previous value is racy since it's possible that
    userspace can set the value to something else itself before the old value
    is reinstated. That results in userspace setting current's oom_score_adj
    to a different value and then the kernel immediately setting it back to
    its previous value without notification.

    To fix this, a new compare_swap_oom_score_adj() function is introduced
    with the same semantics as the compare and swap CAS instruction, or
    CMPXCHG on x86. It is used to reinstate the previous value of
    oom_score_adj if and only if the present value is the same as the old
    value.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Ying Han
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Jun, 2011

1 commit

  • Andrea Righi reported a case where an exiting task can race against
    ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
    triggering a NULL pointer dereference in ksmd.

    ksm_scan.mm_slot == &ksm_mm_head with only one registered mm

    CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
    list_empty() is false
    lock slot == &ksm_mm_head
    list_del(slot->mm_list)
    (list now empty)
    unlock
    lock
    slot = list_entry(slot->mm_list.next)
    (list is empty, so slot is still ksm_mm_head)
    unlock
    slot->mm == NULL ... Oops

    Close this race by revalidating that the new slot is not simply the list
    head again.

    Andrea's test case:

    #include
    #include
    #include
    #include

    #define BUFSIZE getpagesize()

    int main(int argc, char **argv)
    {
    void *ptr;

    if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
    perror("posix_memalign");
    exit(1);
    }
    if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
    perror("madvise");
    exit(1);
    }
    *(char *)NULL = 0;

    return 0;
    }

    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2011

1 commit

  • There's a kernel-wide shortage of per-process flags, so it's always
    helpful to trim one when possible without incurring a significant penalty.
    It's even more important when you're planning on adding a per- process
    flag yourself, which I plan to do shortly for transparent hugepages.

    PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
    tendency to allocate large amounts of memory and should be preferred for
    killing over other tasks. We'd rather immediately kill the task making
    the errant syscall rather than penalizing an innocent task.

    This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
    setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

    The process's old oom_score_adj is stored and then set to
    OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
    value is then reinstated when the process should no longer be considered a
    high priority for oom killing.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Izik Eidus
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Mar, 2011

1 commit


23 Mar, 2011

1 commit


14 Jan, 2011

6 commits

  • It was hard to explain the page counts which were causing new LTP tests
    of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

    Signed-off-by: Hugh Dickins
    Reported-by: CAI Qian
    Cc:Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup some code with common compound_trans_head helper.

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Marcelo Tosatti
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This makes KSM full operational with THP pages. Subpages are scanned
    while the hugepage is still in place and delivering max cpu performance,
    and only if there's a match and we're going to deduplicate memory, the
    single hugepages with the subpage match is split.

    There will be no false sharing between ksmd and khugepaged. khugepaged
    won't collapse 2m virtual regions with KSM pages inside. ksmd also should
    only split pages when the checksum matches and we're likely to split an
    hugepage for some long living ksm page (usual ksm heuristic to avoid
    sharing pages that get de-cowed).

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It's unclear why schedule friendly kernel threads can't be taken away by
    the CPU through the scheduler itself. It's safer to stop them as they can
    trigger memory allocation, if kswapd also freezes itself to avoid
    generating I/O they have too.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Skip transhuge pages in ksm for now.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When a swapcache page is replaced by a ksm page, it's best to free that
    swap immediately.

    Reported-by: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Dec, 2010

1 commit

  • commit 62b61f611e ("ksm: memory hotremove migration only") caused the
    following new lockdep warning.

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    -------------------------------------------------------
    bash/1621 is trying to acquire lock:
    ((memory_chain).rwsem){.+.+.+}, at: []
    __blocking_notifier_call_chain+0x69/0xc0

    but task is already holding lock:
    (ksm_thread_mutex){+.+.+.}, at: []
    ksm_memory_callback+0x3a/0xc0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (ksm_thread_mutex){+.+.+.}:
    [] lock_acquire+0xaa/0x140
    [] __mutex_lock_common+0x44/0x3f0
    [] mutex_lock_nested+0x48/0x60
    [] ksm_memory_callback+0x3a/0xc0
    [] notifier_call_chain+0x8c/0xe0
    [] __blocking_notifier_call_chain+0x7e/0xc0
    [] blocking_notifier_call_chain+0x16/0x20
    [] memory_notify+0x1b/0x20
    [] remove_memory+0x1cc/0x5f0
    [] memory_block_change_state+0xfd/0x1a0
    [] store_mem_state+0xe2/0xf0
    [] sysdev_store+0x20/0x30
    [] sysfs_write_file+0xe6/0x170
    [] vfs_write+0xc8/0x190
    [] sys_write+0x54/0x90
    [] system_call_fastpath+0x16/0x1b

    -> #0 ((memory_chain).rwsem){.+.+.+}:
    [] __lock_acquire+0x155a/0x1600
    [] lock_acquire+0xaa/0x140
    [] down_read+0x51/0xa0
    [] __blocking_notifier_call_chain+0x69/0xc0
    [] blocking_notifier_call_chain+0x16/0x20
    [] memory_notify+0x1b/0x20
    [] remove_memory+0x56e/0x5f0
    [] memory_block_change_state+0xfd/0x1a0
    [] store_mem_state+0xe2/0xf0
    [] sysdev_store+0x20/0x30
    [] sysfs_write_file+0xe6/0x170
    [] vfs_write+0xc8/0x190
    [] sys_write+0x54/0x90
    [] system_call_fastpath+0x16/0x1b

    But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex
    have an outer lock (mem_hotplug_mutex). So they cannot deadlock.

    Thus, This patch annotate ksm_thread_mutex is not deadlock source.

    [akpm@linux-foundation.org: update comment, from Hugh]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

05 Oct, 2010

1 commit

  • Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
    an internal compiler error: typically indicating an error in swapping.

    Perhaps there's a timing issue which makes it now more likely, perhaps
    it's just a long time since I tried for so long: this bug goes back to
    KSM swapping in 2.6.33.

    Notice how reuse_swap_page() allows an exclusive page to be reused, but
    only does SetPageDirty if it can delete it from swap cache right then -
    if it's currently under Writeback, it has to be left in cache and we
    don't SetPageDirty, but the page can be reused. Fine, the dirty bit
    will get set in the pte; but notice how zap_pte_range() does not bother
    to transfer pte_dirty to page_dirty when unmapping a PageAnon.

    If KSM chooses to share such a page, it will look like a clean copy of
    swapcache, and not be written out to swap when its memory is needed;
    then stale data read back from swap when it's needed again.

    We could fix this in reuse_swap_page() (or even refuse to reuse a
    page under writeback), but it's more honest to fix my oversight in
    KSM's write_protect_page(). Several days of testing on three machines
    confirms that this fixes the issue they showed.

    Signed-off-by: Hugh Dickins
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Sep, 2010

1 commit

  • The pte_same check is reliable only if the swap entry remains pinned (by
    the page lock on swapcache). We've also to ensure the swapcache isn't
    removed before we take the lock as try_to_free_swap won't care about the
    page pin.

    One of the possible impacts of this patch is that a KSM-shared page can
    point to the anon_vma of another process, which could exit before the page
    is freed.

    This can leave a page with a pointer to a recycled anon_vma object, or
    worse, a pointer to something that is no longer an anon_vma.

    [riel@redhat.com: changelog help]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Aug, 2010

4 commits

  • Use compile-allocated memory instead of dynamic allocated memory for
    mm_slots_hash.

    Use hash_ptr() instead divisions for bucket calculation.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Izik Eidus
    Cc: Avi Kivity
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • KSM reference counts can cause an anon_vma to exist after the processe it
    belongs to have already exited. Because the anon_vma lock now lives in
    the root anon_vma, we need to ensure that the root anon_vma stays around
    until after all the "child" anon_vmas have been freed.

    The obvious way to do this is to have a "child" anon_vma take a reference
    to the root in anon_vma_fork. When the anon_vma is freed at munmap or
    process exit, we drop the refcount in anon_vma_unlink and possibly free
    the root anon_vma.

    The KSM anon_vma reference count function also needs to be modified to
    deal with the possibility of freeing 2 levels of anon_vma. The easiest
    way to do this is to break out the KSM magic and make it generic.

    When compiling without CONFIG_KSM, this code is compiled out.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Tested-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Always (and only) lock the root (oldest) anon_vma whenever we do something
    in an anon_vma. The recently introduced anon_vma scalability is due to
    the rmap code scanning only the VMAs that need to be scanned. Many common
    operations still took the anon_vma lock on the root anon_vma, so always
    taking that lock is not expected to introduce any scalability issues.

    However, always taking the same lock does mean we only need to take one
    lock, which means rmap_walk on pages from any anon_vma in the vma is
    excluded from occurring during an munmap, expand_stack or other operation
    that needs to exclude rmap_walk and similar functions.

    Also add the proper locking to vma_adjust.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2010

1 commit

  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Apr, 2010

1 commit

  • The follow_page() function can potentially return -EFAULT so I added
    checks for this.

    Also I silenced an uninitialized variable warning on my version of gcc
    (version 4.3.2).

    Signed-off-by: Dan Carpenter
    Acked-by: Rik van Riel
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

25 Mar, 2010

1 commit

  • ksm.c's write_protect_page implements a lockless means of verifying a page
    does not have any users of the page which are not accounted for via other
    kernel tracking means. It does this by removing the writable pte with TLB
    flushes, checking the page_count against the total known users, and then
    using set_pte_at_notify to make it a read-only entry.

    An unneeded mmu_notifier callout is made in the case where the known users
    does not match the page_count. In that event, we are inserting the
    identical pte and there is no need for the set_pte_at_notify, but rather
    the simpler set_pte_at suffices.

    Signed-off-by: Robin Holt
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Mar, 2010

1 commit

  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

16 Dec, 2009

14 commits

  • Now that ksm pages are swappable, and the known holes plugged, remove
    mention of unswappable kernel pages from KSM documentation and comments.

    Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
    remove max_kernel_pages altogether - we can reinstate it if removal turns
    out to break someone's script; but if we later want to limit KSM's memory
    usage, limiting the stable nodes would not be an effective approach.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When ksm pages were unswappable, it made no sense to include them in mem
    cgroup accounting; but now that they are swappable (although I see no
    strict logical connection) the principle of least surprise implies that
    they should be accounted (with the usual dissatisfaction, that a shared
    page is accounted to only one of the cgroups using it).

    This patch was intended to add mem cgroup accounting where necessary; but
    turned inside out, it now avoids allocating a ksm page, instead upgrading
    an anon page to ksm - which brings its existing mem cgroup accounting with
    it. Thus mem cgroups don't appear in the patch at all.

    This upgrade from PageAnon to PageKsm takes place under page lock (via a
    somewhat hacky NULL kpage interface), and audit showed only one place
    which needed to cope with the race - page_referenced() is sometimes used
    without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
    sure of getting anon_vma and flags together (no problem if the page goes
    ksm an instant after, the integrity of that anon_vma list is unaffected).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a lamentable flaw in KSM swapping: the stable_node holds a
    reference to the ksm page, so the page to be freed cannot actually be
    freed until ksmd works its way around to removing the last rmap_item from
    its stable_node. Which in some configurations may take minutes: not quite
    responsive enough for memory reclaim. And we don't want to twist KSM and
    its locking more tightly into the rest of mm. What a pity.

    But although the stable_node needs to hold a pointer to the ksm page, does
    it actually need to raise the reference count of that page?

    No. It would need to do so if struct pages were ordinary kmalloc'ed
    objects; but they are more stable than that, and reused in particular ways
    according to particular rules.

    Access to stable_node from its pointer in struct page is no problem, so
    long as we never free a stable_node before the ksm page itself has been
    freed. Access to struct page from its pointer in stable_node: reintroduce
    get_ksm_page(), and let that peep out through its keyhole (the stable_node
    pointer to ksm page), to see if that struct page still holds the right key
    to open it (the ksm page mapping pointer back to this stable_node).

    This relies upon the established way in which free_hot_cold_page() sets an
    anon (including ksm) page->mapping to NULL; and relies upon no other user
    of a struct page to put something which looks like the original
    stable_node pointer (with two low bits also set) into page->mapping. It
    also needs get_page_unless_zero() technique pioneered by speculative
    pagecache; and uses rcu_read_lock() to keep the guarantees that gives.

    There are several drivers which put pointers of their own into page->
    mapping; but none of those could coincide with our stable_node pointers,
    since KSM won't free a stable_node until it sees that the page has gone.

    The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
    places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
    break_lock on 32-bit, that spans both page->private and page->mapping.
    Since break_lock is only 0 or 1, again no confusion for get_ksm_page().

    But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3
    (matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
    mapping, which might coincide? We could get around that by... but a
    better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
    already proposed in an earlier mm/Kconfig patch.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For full functionality, page_referenced_one() and try_to_unmap_one() need
    to know the vma: to pass vma down to arch-dependent flushes, or to observe
    VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
    vmas get split and merged without its knowledge.

    Instead, note page's anon_vma in its rmap_item when adding to stable tree:
    all the vmas which might map that page are listed by its anon_vma.

    page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
    first to find the probable vma, that which matches rmap_item's mm; but if
    that is not enough to locate all instances, traverse again to try the
    others. This catches those occasions when fork has duplicated a pte of a
    ksm page, but ksmd has not yet come around to assign it an rmap_item.

    But each rmap_item in the stable tree which refers to an anon_vma needs to
    take a reference to it. Andrea's anon_vma design cleverly avoided a
    reference count (an anon_vma was free when its list of vmas was empty),
    but KSM now needs to add that. Is a 32-bit count sufficient? I believe
    so - the anon_vma is only free when both count is 0 and list is empty.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When KSM merges an mlocked page, it has been forgetting to munlock it:
    that's been left to free_page_mlock(), which reports it in /proc/vmstat as
    unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
    whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
    silently forgiving). Call munlock_vma_page() to fix that.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add a pointer to the ksm page into struct stable_node, holding a reference
    to the page while the node exists. Put a pointer to the stable_node into
    the ksm page's ->mapping.

    Then we don't need get_ksm_page() while traversing the stable tree: the
    page to compare against is sure to be present and correct, even if it's no
    longer visible through any of its existing rmap_items.

    And we can handle the forked ksm page case more efficiently: no need to
    memcmp our way through the tree to find its match.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though we still do well to keep rmap_items in the unstable tree without a
    separate tree_item at the node, for several reasons it becomes awkward to
    keep rmap_items in the stable tree without a separate stable_node: lack of
    space in the nicely-sized rmap_item, the need for an anchor as rmap_items
    are removed, the need for a node even when temporarily no rmap_items are
    attached to it.

    So declare struct stable_node (rb_node to place it in the tree and
    hlist_head for the rmap_items hanging off it), and convert stable tree
    handling to use it: without yet taking advantage of it. Note how one
    stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
    two rmap_items being merged.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
    singly-linked list: we always traverse that list sequentially, and we
    don't even lose any prefetches (but should consider adding a few later).
    Name it rmap_list throughout.

    Do we need to free up that pointer? Not immediately, and in the end, we
    could continue to avoid it with a union; but having done the conversion,
    let's keep it this way, since there's no downside, and maybe we'll want
    more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
    and 64 bytes on 64-bit, so we shall want to avoid expanding it).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup: make argument names more consistent from cmp_and_merge_page()
    down to replace_page(), so that it's easier to follow the rmap_item's page
    and the matching tree_page and the merged kpage through that code.

    In some places, e.g. break_cow(), pass rmap_item instead of separate mm
    and address.

    cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
    uninitialized" warning seen in one config by Anil SB.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There is no need for replace_page() to calculate a write-protected prot
    vm_page_prot must already be write-protected for an anonymous page (see
    mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).

    There is no need for try_to_merge_one_page() to get_page and put_page on
    newpage and oldpage: in every case we already hold a reference to each of
    them.

    But some instinct makes me move try_to_merge_one_page()'s unlock_page of
    oldpage down after replace_page(): that doesn't increase contention on the
    ksm page, and makes thinking about the transition easier.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 1. remove_rmap_item_from_tree() is called as a precaution from
    various places: don't dirty the rmap_item cacheline unnecessarily,
    just mask the flags out of the address when they have been set.

    2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
    then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
    from its tree: it's easier just to do both at once (but definitely keep
    the BUG_ON(age > 1) which guards against a future omission).

    3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
    tree, it does its own rb_erase() and accounting: that's better
    expressed by remove_rmap_item_from_tree().

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Nov, 2009

1 commit

  • KSM needs a cond_resched() for CONFIG_PREEMPT_NONE, in its unbounded
    search of the unstable tree. The stable tree cases already have one,
    and originally there was one down inside get_user_pages();
    but I missed it when I converted to follow_page() instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Oct, 2009

1 commit

  • Adjust the max_kernel_pages default to a quarter of totalram_pages,
    instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
    highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
    pinning 32MB of lowmem with their rmap_items, so no need for the more
    obscure calculation (nor for its own special init function).

    There is no way for the user to switch KSM on if CONFIG_SYSFS is not
    enabled, so in that case default run to KSM_RUN_MERGE.

    Update KSM Documentation and Kconfig to reflect the new defaults.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Sep, 2009

1 commit

  • Now that ksm is in mainline it is better to change the default values to
    better fit to most of the users.

    This patch change the ksm default values to be:

    ksm_thread_pages_to_scan = 100 (instead of 200)
    ksm_thread_sleep_millisecs = 20 (like before)
    ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
    disabled by default)
    ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

    The important aspect of this patch is: it disables ksm by default, and sets
    the number of the kernel_pages that can be allocated to be a reasonable
    number.

    Signed-off-by: Izik Eidus
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus