05 Oct, 2010

1 commit

  • Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
    an internal compiler error: typically indicating an error in swapping.

    Perhaps there's a timing issue which makes it now more likely, perhaps
    it's just a long time since I tried for so long: this bug goes back to
    KSM swapping in 2.6.33.

    Notice how reuse_swap_page() allows an exclusive page to be reused, but
    only does SetPageDirty if it can delete it from swap cache right then -
    if it's currently under Writeback, it has to be left in cache and we
    don't SetPageDirty, but the page can be reused. Fine, the dirty bit
    will get set in the pte; but notice how zap_pte_range() does not bother
    to transfer pte_dirty to page_dirty when unmapping a PageAnon.

    If KSM chooses to share such a page, it will look like a clean copy of
    swapcache, and not be written out to swap when its memory is needed;
    then stale data read back from swap when it's needed again.

    We could fix this in reuse_swap_page() (or even refuse to reuse a
    page under writeback), but it's more honest to fix my oversight in
    KSM's write_protect_page(). Several days of testing on three machines
    confirms that this fixes the issue they showed.

    Signed-off-by: Hugh Dickins
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Sep, 2010

1 commit

  • The pte_same check is reliable only if the swap entry remains pinned (by
    the page lock on swapcache). We've also to ensure the swapcache isn't
    removed before we take the lock as try_to_free_swap won't care about the
    page pin.

    One of the possible impacts of this patch is that a KSM-shared page can
    point to the anon_vma of another process, which could exit before the page
    is freed.

    This can leave a page with a pointer to a recycled anon_vma object, or
    worse, a pointer to something that is no longer an anon_vma.

    [riel@redhat.com: changelog help]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Aug, 2010

4 commits

  • Use compile-allocated memory instead of dynamic allocated memory for
    mm_slots_hash.

    Use hash_ptr() instead divisions for bucket calculation.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Izik Eidus
    Cc: Avi Kivity
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • KSM reference counts can cause an anon_vma to exist after the processe it
    belongs to have already exited. Because the anon_vma lock now lives in
    the root anon_vma, we need to ensure that the root anon_vma stays around
    until after all the "child" anon_vmas have been freed.

    The obvious way to do this is to have a "child" anon_vma take a reference
    to the root in anon_vma_fork. When the anon_vma is freed at munmap or
    process exit, we drop the refcount in anon_vma_unlink and possibly free
    the root anon_vma.

    The KSM anon_vma reference count function also needs to be modified to
    deal with the possibility of freeing 2 levels of anon_vma. The easiest
    way to do this is to break out the KSM magic and make it generic.

    When compiling without CONFIG_KSM, this code is compiled out.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Tested-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Always (and only) lock the root (oldest) anon_vma whenever we do something
    in an anon_vma. The recently introduced anon_vma scalability is due to
    the rmap code scanning only the VMAs that need to be scanned. Many common
    operations still took the anon_vma lock on the root anon_vma, so always
    taking that lock is not expected to introduce any scalability issues.

    However, always taking the same lock does mean we only need to take one
    lock, which means rmap_walk on pages from any anon_vma in the vma is
    excluded from occurring during an munmap, expand_stack or other operation
    that needs to exclude rmap_walk and similar functions.

    Also add the proper locking to vma_adjust.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2010

1 commit

  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Apr, 2010

1 commit

  • The follow_page() function can potentially return -EFAULT so I added
    checks for this.

    Also I silenced an uninitialized variable warning on my version of gcc
    (version 4.3.2).

    Signed-off-by: Dan Carpenter
    Acked-by: Rik van Riel
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

25 Mar, 2010

1 commit

  • ksm.c's write_protect_page implements a lockless means of verifying a page
    does not have any users of the page which are not accounted for via other
    kernel tracking means. It does this by removing the writable pte with TLB
    flushes, checking the page_count against the total known users, and then
    using set_pte_at_notify to make it a read-only entry.

    An unneeded mmu_notifier callout is made in the case where the known users
    does not match the page_count. In that event, we are inserting the
    identical pte and there is no need for the set_pte_at_notify, but rather
    the simpler set_pte_at suffices.

    Signed-off-by: Robin Holt
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Mar, 2010

1 commit

  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

16 Dec, 2009

14 commits

  • Now that ksm pages are swappable, and the known holes plugged, remove
    mention of unswappable kernel pages from KSM documentation and comments.

    Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
    remove max_kernel_pages altogether - we can reinstate it if removal turns
    out to break someone's script; but if we later want to limit KSM's memory
    usage, limiting the stable nodes would not be an effective approach.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When ksm pages were unswappable, it made no sense to include them in mem
    cgroup accounting; but now that they are swappable (although I see no
    strict logical connection) the principle of least surprise implies that
    they should be accounted (with the usual dissatisfaction, that a shared
    page is accounted to only one of the cgroups using it).

    This patch was intended to add mem cgroup accounting where necessary; but
    turned inside out, it now avoids allocating a ksm page, instead upgrading
    an anon page to ksm - which brings its existing mem cgroup accounting with
    it. Thus mem cgroups don't appear in the patch at all.

    This upgrade from PageAnon to PageKsm takes place under page lock (via a
    somewhat hacky NULL kpage interface), and audit showed only one place
    which needed to cope with the race - page_referenced() is sometimes used
    without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
    sure of getting anon_vma and flags together (no problem if the page goes
    ksm an instant after, the integrity of that anon_vma list is unaffected).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a lamentable flaw in KSM swapping: the stable_node holds a
    reference to the ksm page, so the page to be freed cannot actually be
    freed until ksmd works its way around to removing the last rmap_item from
    its stable_node. Which in some configurations may take minutes: not quite
    responsive enough for memory reclaim. And we don't want to twist KSM and
    its locking more tightly into the rest of mm. What a pity.

    But although the stable_node needs to hold a pointer to the ksm page, does
    it actually need to raise the reference count of that page?

    No. It would need to do so if struct pages were ordinary kmalloc'ed
    objects; but they are more stable than that, and reused in particular ways
    according to particular rules.

    Access to stable_node from its pointer in struct page is no problem, so
    long as we never free a stable_node before the ksm page itself has been
    freed. Access to struct page from its pointer in stable_node: reintroduce
    get_ksm_page(), and let that peep out through its keyhole (the stable_node
    pointer to ksm page), to see if that struct page still holds the right key
    to open it (the ksm page mapping pointer back to this stable_node).

    This relies upon the established way in which free_hot_cold_page() sets an
    anon (including ksm) page->mapping to NULL; and relies upon no other user
    of a struct page to put something which looks like the original
    stable_node pointer (with two low bits also set) into page->mapping. It
    also needs get_page_unless_zero() technique pioneered by speculative
    pagecache; and uses rcu_read_lock() to keep the guarantees that gives.

    There are several drivers which put pointers of their own into page->
    mapping; but none of those could coincide with our stable_node pointers,
    since KSM won't free a stable_node until it sees that the page has gone.

    The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
    places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
    break_lock on 32-bit, that spans both page->private and page->mapping.
    Since break_lock is only 0 or 1, again no confusion for get_ksm_page().

    But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3
    (matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
    mapping, which might coincide? We could get around that by... but a
    better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
    already proposed in an earlier mm/Kconfig patch.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For full functionality, page_referenced_one() and try_to_unmap_one() need
    to know the vma: to pass vma down to arch-dependent flushes, or to observe
    VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
    vmas get split and merged without its knowledge.

    Instead, note page's anon_vma in its rmap_item when adding to stable tree:
    all the vmas which might map that page are listed by its anon_vma.

    page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
    first to find the probable vma, that which matches rmap_item's mm; but if
    that is not enough to locate all instances, traverse again to try the
    others. This catches those occasions when fork has duplicated a pte of a
    ksm page, but ksmd has not yet come around to assign it an rmap_item.

    But each rmap_item in the stable tree which refers to an anon_vma needs to
    take a reference to it. Andrea's anon_vma design cleverly avoided a
    reference count (an anon_vma was free when its list of vmas was empty),
    but KSM now needs to add that. Is a 32-bit count sufficient? I believe
    so - the anon_vma is only free when both count is 0 and list is empty.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When KSM merges an mlocked page, it has been forgetting to munlock it:
    that's been left to free_page_mlock(), which reports it in /proc/vmstat as
    unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
    whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
    silently forgiving). Call munlock_vma_page() to fix that.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add a pointer to the ksm page into struct stable_node, holding a reference
    to the page while the node exists. Put a pointer to the stable_node into
    the ksm page's ->mapping.

    Then we don't need get_ksm_page() while traversing the stable tree: the
    page to compare against is sure to be present and correct, even if it's no
    longer visible through any of its existing rmap_items.

    And we can handle the forked ksm page case more efficiently: no need to
    memcmp our way through the tree to find its match.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though we still do well to keep rmap_items in the unstable tree without a
    separate tree_item at the node, for several reasons it becomes awkward to
    keep rmap_items in the stable tree without a separate stable_node: lack of
    space in the nicely-sized rmap_item, the need for an anchor as rmap_items
    are removed, the need for a node even when temporarily no rmap_items are
    attached to it.

    So declare struct stable_node (rb_node to place it in the tree and
    hlist_head for the rmap_items hanging off it), and convert stable tree
    handling to use it: without yet taking advantage of it. Note how one
    stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
    two rmap_items being merged.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
    singly-linked list: we always traverse that list sequentially, and we
    don't even lose any prefetches (but should consider adding a few later).
    Name it rmap_list throughout.

    Do we need to free up that pointer? Not immediately, and in the end, we
    could continue to avoid it with a union; but having done the conversion,
    let's keep it this way, since there's no downside, and maybe we'll want
    more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
    and 64 bytes on 64-bit, so we shall want to avoid expanding it).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup: make argument names more consistent from cmp_and_merge_page()
    down to replace_page(), so that it's easier to follow the rmap_item's page
    and the matching tree_page and the merged kpage through that code.

    In some places, e.g. break_cow(), pass rmap_item instead of separate mm
    and address.

    cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
    uninitialized" warning seen in one config by Anil SB.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There is no need for replace_page() to calculate a write-protected prot
    vm_page_prot must already be write-protected for an anonymous page (see
    mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).

    There is no need for try_to_merge_one_page() to get_page and put_page on
    newpage and oldpage: in every case we already hold a reference to each of
    them.

    But some instinct makes me move try_to_merge_one_page()'s unlock_page of
    oldpage down after replace_page(): that doesn't increase contention on the
    ksm page, and makes thinking about the transition easier.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 1. remove_rmap_item_from_tree() is called as a precaution from
    various places: don't dirty the rmap_item cacheline unnecessarily,
    just mask the flags out of the address when they have been set.

    2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
    then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
    from its tree: it's easier just to do both at once (but definitely keep
    the BUG_ON(age > 1) which guards against a future omission).

    3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
    tree, it does its own rb_erase() and accounting: that's better
    expressed by remove_rmap_item_from_tree().

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Nov, 2009

1 commit

  • KSM needs a cond_resched() for CONFIG_PREEMPT_NONE, in its unbounded
    search of the unstable tree. The stable tree cases already have one,
    and originally there was one down inside get_user_pages();
    but I missed it when I converted to follow_page() instead.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Oct, 2009

1 commit

  • Adjust the max_kernel_pages default to a quarter of totalram_pages,
    instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
    highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
    pinning 32MB of lowmem with their rmap_items, so no need for the more
    obscure calculation (nor for its own special init function).

    There is no way for the user to switch KSM on if CONFIG_SYSFS is not
    enabled, so in that case default run to KSM_RUN_MERGE.

    Update KSM Documentation and Kconfig to reflect the new defaults.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Sep, 2009

1 commit

  • Now that ksm is in mainline it is better to change the default values to
    better fit to most of the users.

    This patch change the ksm default values to be:

    ksm_thread_pages_to_scan = 100 (instead of 200)
    ksm_thread_sleep_millisecs = 20 (like before)
    ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
    disabled by default)
    ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

    The important aspect of this patch is: it disables ksm by default, and sets
    the number of the kernel_pages that can be allocated to be a reasonable
    number.

    Signed-off-by: Izik Eidus
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus
     

22 Sep, 2009

13 commits

  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
    longer applies, and it can be made private to ksm.c; there's no more
    reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
    to provide the /sys/kernel/mm/ksm files to tune and activate it.

    Make KSM depend on SYSFS? Could do, but it might be better to provide
    some defaults so that KSM works out-of-the-box, ready for testers to
    madvise MADV_MERGEABLE, even without SYSFS.

    Though anyone serious is likely to want to retune the numbers to their
    taste once they have experience; and whether these settings ever reach
    2.6.32 can be discussed along the way.

    Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Do some housekeeping in ksm.c, to help make the next patch easier
    to understand: remove the function remove_mm_from_lists, distributing
    its code to its callsites scan_get_next_rmap_item and __ksm_exit.

    That turns out to be a win in scan_get_next_rmap_item: move its
    remove_trailing_rmap_items and cursor advancement up, and it becomes
    simpler than before. __ksm_exit becomes messier, but will change
    again; and moving its remove_trailing_rmap_items up lets us strengthen
    the unstable tree item's age condition in remove_rmap_item_from_tree.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
    only be a problem for ksmd when a memory control group imposes limits
    (normally the OOM killer will kill others with an mm until it succeeds);
    but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
    do need to route the error (or kill) back to the caller (or sighandling).

    Test signal_pending in unmerge_ksm_pages, which could be a lengthy
    procedure if it has to spill into swap: returning -ERESTARTSYS so that
    trivial signals will restart but fatals will terminate (is that right?
    we do different things in different places in mm, none exactly this).

    unmerge_and_remove_all_rmap_items was forgetting to lock when going
    down the mm_list: fix that. Whether it's successful or not, reset
    ksm_scan cursor to head; but only if it's successful, reset seqnr
    (shown in full_scans) - page counts will have gone down to zero.

    This patch leaves a significant OOM deadlock, but it's a good step
    on the way, and that deadlock is fixed in a subsequent patch.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 1. We don't use __break_cow entry point now: merge it into break_cow.
    2. remove_all_slot_rmap_items is just a special case of
    remove_trailing_rmap_items: use the latter instead.
    3. Extend comment on unmerge_ksm_pages and rmap_items.
    4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
    instead of duplicating its code; and so swap them around.
    5. Comment on cmp_and_merge_page described last year's: update it.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ksm_scan_thread already sleeps in wait_event_interruptible until setting
    ksm_run activates it; but if there's nothing on its list to look at, i.e.
    nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
    up system time and full_scans: ksmd_should_run added to check that too.

    And move the mutex_lock out around it: the new counts showed that when
    ksm_run is stopped, a little work often got done afterwards, because it
    had been read before taking the mutex.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We kept agreeing not to bother about the unswappable shared KSM pages
    which later become unshared by others: observation suggests they're not
    a significant proportion. But they are disadvantageous, and it is easier
    to break COW to replace them by swappable pages, than offer statistics
    to show that they don't matter; then we can stop worrying about them.

    Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
    this pass: give them a good chance of getting into the unstable tree
    on the next pass, or back into the stable, by computing checksum now.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The pages_shared and pages_sharing counts give a good picture of how
    successful KSM is at sharing; but no clue to how much wasted work it's
    doing to get there. Add pages_unshared (count of unique pages waiting
    in the unstable tree, hoping to find a mate) and pages_volatile.

    pages_volatile is harder to define. It includes those pages changing
    too fast to get into the unstable tree, but also whatever other edge
    conditions prevent a page getting into the trees: a high value may
    deserve investigation. Don't try to calculate it from the various
    conditions: it's the total of rmap_items less those accounted for.

    Also show full_scans: the number of completed scans of everything
    registered in the mm list.

    The locking for all these counts is simply ksm_thread_mutex.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The pages_shared count is incremented and decremented when adding a node
    to and removing a node from the stable tree: easy to understand. But the
    pages_sharing count was hard to follow, being adjusted in various places:
    increment and decrement it when adding to and removing from the stable tree.

    And the pages_sharing variable used to include the pages_shared, then those
    were subtracted when shown in the pages_sharing sysfs file: now keep it as
    an exclusive count of leaves hanging off the stable tree nodes, throughout.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We're not implementing swapping of KSM pages in its first release;
    but when that follows, "kernel_pages_allocated" will be a very poor
    name for the sysfs file showing number of nodes in the stable tree:
    rename that to "pages_shared" throughout.

    But we already have a "pages_shared", counting those page slots
    sharing the shared pages: first rename that to... "pages_sharing".

    What will become of "max_kernel_pages" when the pages shared can
    be swapped? I guess it will just be removed, so keep that name.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins