16 Dec, 2009

40 commits

  • It has no references outside memory_hotplug.c.

    Cc: "Rafael J. Wysocki"
    Cc: Andi Kleen
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Now that ksm pages are swappable, and the known holes plugged, remove
    mention of unswappable kernel pages from KSM documentation and comments.

    Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
    remove max_kernel_pages altogether - we can reinstate it if removal turns
    out to break someone's script; but if we later want to limit KSM's memory
    usage, limiting the stable nodes would not be an effective approach.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The previous patch enables page migration of ksm pages, but that soon gets
    into trouble: not surprising, since we're using the ksm page lock to lock
    operations on its stable_node, but page migration switches the page whose
    lock is to be used for that. Another layer of locking would fix it, but
    do we need that yet?

    Do we actually need page migration of ksm pages? Yes, memory hotremove
    needs to offline sections of memory: and since we stopped allocating ksm
    pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
    candidates for migration.

    But KSM is currently unconscious of NUMA issues, happily merging pages
    from different NUMA nodes: at present the rule must be, not to use
    MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
    ksm pages does not make sense yet.

    So, to complete support for ksm swapping we need to make hotremove safe.
    ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
    release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
    are freed before migration reaches them, stable_nodes may be left still
    pointing to struct pages which have been removed from the system: the
    stable_node needs to identify a page by pfn rather than page pointer, then
    it can safely prune them when MEM_OFFLINE.

    And make NUMA migration skip PageKsm pages where it skips PageReserved.
    But it's only when we reach unmap_and_move() that the page lock is taken
    and we can be sure that raised pagecount has prevented a PageAnon from
    being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
    page when offlining (has sufficient locking) but reject it otherwise.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • But ksm swapping does require one small change in mem cgroup handling.
    When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
    substitute a duplicate page to accommodate a different anon_vma (or a the
    !PageSwapCache check in mem_cgroup_try_charge_swapin().

    That was returning success without charging, on the assumption that
    pte_same() would fail after, which is not the case here. Originally I
    proposed that success, so that an unshrinkable mem cgroup at its limit
    would not fail unnecessarily; but that's a minor point, and there are
    plenty of other places where we may fail an overallocation which might
    later prove unnecessary. So just go ahead and do what all the other
    exceptions do: proceed to charge current mm.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When ksm pages were unswappable, it made no sense to include them in mem
    cgroup accounting; but now that they are swappable (although I see no
    strict logical connection) the principle of least surprise implies that
    they should be accounted (with the usual dissatisfaction, that a shared
    page is accounted to only one of the cgroups using it).

    This patch was intended to add mem cgroup accounting where necessary; but
    turned inside out, it now avoids allocating a ksm page, instead upgrading
    an anon page to ksm - which brings its existing mem cgroup accounting with
    it. Thus mem cgroups don't appear in the patch at all.

    This upgrade from PageAnon to PageKsm takes place under page lock (via a
    somewhat hacky NULL kpage interface), and audit showed only one place
    which needed to cope with the race - page_referenced() is sometimes used
    without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
    sure of getting anon_vma and flags together (no problem if the page goes
    ksm an instant after, the integrity of that anon_vma list is unaffected).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a lamentable flaw in KSM swapping: the stable_node holds a
    reference to the ksm page, so the page to be freed cannot actually be
    freed until ksmd works its way around to removing the last rmap_item from
    its stable_node. Which in some configurations may take minutes: not quite
    responsive enough for memory reclaim. And we don't want to twist KSM and
    its locking more tightly into the rest of mm. What a pity.

    But although the stable_node needs to hold a pointer to the ksm page, does
    it actually need to raise the reference count of that page?

    No. It would need to do so if struct pages were ordinary kmalloc'ed
    objects; but they are more stable than that, and reused in particular ways
    according to particular rules.

    Access to stable_node from its pointer in struct page is no problem, so
    long as we never free a stable_node before the ksm page itself has been
    freed. Access to struct page from its pointer in stable_node: reintroduce
    get_ksm_page(), and let that peep out through its keyhole (the stable_node
    pointer to ksm page), to see if that struct page still holds the right key
    to open it (the ksm page mapping pointer back to this stable_node).

    This relies upon the established way in which free_hot_cold_page() sets an
    anon (including ksm) page->mapping to NULL; and relies upon no other user
    of a struct page to put something which looks like the original
    stable_node pointer (with two low bits also set) into page->mapping. It
    also needs get_page_unless_zero() technique pioneered by speculative
    pagecache; and uses rcu_read_lock() to keep the guarantees that gives.

    There are several drivers which put pointers of their own into page->
    mapping; but none of those could coincide with our stable_node pointers,
    since KSM won't free a stable_node until it sees that the page has gone.

    The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
    places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
    break_lock on 32-bit, that spans both page->private and page->mapping.
    Since break_lock is only 0 or 1, again no confusion for get_ksm_page().

    But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3
    (matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
    mapping, which might coincide? We could get around that by... but a
    better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
    already proposed in an earlier mm/Kconfig patch.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For full functionality, page_referenced_one() and try_to_unmap_one() need
    to know the vma: to pass vma down to arch-dependent flushes, or to observe
    VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
    vmas get split and merged without its knowledge.

    Instead, note page's anon_vma in its rmap_item when adding to stable tree:
    all the vmas which might map that page are listed by its anon_vma.

    page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
    first to find the probable vma, that which matches rmap_item's mm; but if
    that is not enough to locate all instances, traverse again to try the
    others. This catches those occasions when fork has duplicated a pte of a
    ksm page, but ksmd has not yet come around to assign it an rmap_item.

    But each rmap_item in the stable tree which refers to an anon_vma needs to
    take a reference to it. Andrea's anon_vma design cleverly avoided a
    reference count (an anon_vma was free when its list of vmas was empty),
    but KSM now needs to add that. Is a 32-bit count sufficient? I believe
    so - the anon_vma is only free when both count is 0 and list is empty.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When KSM merges an mlocked page, it has been forgetting to munlock it:
    that's been left to free_page_mlock(), which reports it in /proc/vmstat as
    unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
    whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
    silently forgiving). Call munlock_vma_page() to fix that.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add a pointer to the ksm page into struct stable_node, holding a reference
    to the page while the node exists. Put a pointer to the stable_node into
    the ksm page's ->mapping.

    Then we don't need get_ksm_page() while traversing the stable tree: the
    page to compare against is sure to be present and correct, even if it's no
    longer visible through any of its existing rmap_items.

    And we can handle the forked ksm page case more efficiently: no need to
    memcmp our way through the tree to find its match.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though we still do well to keep rmap_items in the unstable tree without a
    separate tree_item at the node, for several reasons it becomes awkward to
    keep rmap_items in the stable tree without a separate stable_node: lack of
    space in the nicely-sized rmap_item, the need for an anchor as rmap_items
    are removed, the need for a node even when temporarily no rmap_items are
    attached to it.

    So declare struct stable_node (rb_node to place it in the tree and
    hlist_head for the rmap_items hanging off it), and convert stable tree
    handling to use it: without yet taking advantage of it. Note how one
    stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
    two rmap_items being merged.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
    singly-linked list: we always traverse that list sequentially, and we
    don't even lose any prefetches (but should consider adding a few later).
    Name it rmap_list throughout.

    Do we need to free up that pointer? Not immediately, and in the end, we
    could continue to avoid it with a union; but having done the conversion,
    let's keep it this way, since there's no downside, and maybe we'll want
    more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
    and 64 bytes on 64-bit, so we shall want to avoid expanding it).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup: make argument names more consistent from cmp_and_merge_page()
    down to replace_page(), so that it's easier to follow the rmap_item's page
    and the matching tree_page and the merged kpage through that code.

    In some places, e.g. break_cow(), pass rmap_item instead of separate mm
    and address.

    cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
    uninitialized" warning seen in one config by Anil SB.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There is no need for replace_page() to calculate a write-protected prot
    vm_page_prot must already be write-protected for an anonymous page (see
    mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).

    There is no need for try_to_merge_one_page() to get_page and put_page on
    newpage and oldpage: in every case we already hold a reference to each of
    them.

    But some instinct makes me move try_to_merge_one_page()'s unlock_page of
    oldpage down after replace_page(): that doesn't increase contention on the
    ksm page, and makes thinking about the transition easier.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 1. remove_rmap_item_from_tree() is called as a precaution from
    various places: don't dirty the rmap_item cacheline unnecessarily,
    just mask the flags out of the address when they have been set.

    2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
    then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
    from its tree: it's easier just to do both at once (but definitely keep
    the BUG_ON(age > 1) which guards against a future omission).

    3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
    tree, it does its own rb_erase() and accounting: that's better
    expressed by remove_rmap_item_from_tree().

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix small inconsistent of ">" and ">=".

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX.
    Then, we can remove it perfectly.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In old days, we didn't have sc.nr_to_reclaim and it brought
    sc.swap_cluster_max misuse.

    huge sc.swap_cluster_max might makes unnecessary OOM risk and no
    performance benefit.

    Now, we can stop its insane thing.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
    memory shrinker) for hibernate performance improvement. and
    sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
    memory for suspend).

    commit a06fe4d307 said

    Without the patch:
    Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
    Freed 88563 pages in 14719 jiffies = 23.50 MB/s
    Freed 205734 pages in 32389 jiffies = 24.81 MB/s

    With the patch:
    Freed 68252 pages in 496 jiffies = 537.52 MB/s
    Freed 116464 pages in 569 jiffies = 798.54 MB/s
    Freed 209699 pages in 705 jiffies = 1161.89 MB/s

    At that time, their patch was pretty worth. However, Modern Hardware
    trend and recent VM improvement broke its worth. From several reason, I
    think we should remove shrink_all_zones() at all.

    detail:

    1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
    at no i/o congestion.
    but current shrink_zone() is sane, not slow.

    2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
    fine on numa system.
    example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.

    optimal)
    steal 500MB from each node.
    shrink_all_zones)
    steal 1GB from node-0.

    Oh, Cache balancing logic was broken. ;)
    Unfortunately, Desktop system moved ahead NUMA at nowadays.
    (Side note, if hibernate require 2GB, shrink_all_zones() never success
    on above machine)

    3) if the node has several I/O flighting pages, shrink_all_zones() makes
    pretty bad result.

    schenario) hibernate need 1GB

    1) shrink_all_zones() try to reclaim 1GB from Node-0
    2) but it only reclaimed 990MB
    3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
    4) it reclaimed 990MB

    Oh, well. it reclaimed twice much than required.
    In the other hand, current shrink_zone() has sane baling out logic.
    then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

    4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
    shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

    Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
    it bring good reviewability and debuggability, and solve above problems.

    side note: Reclaim logic unificication makes two good side effect.
    - Fix recursive reclaim bug on shrink_all_memory().
    it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
    - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Acked-by: Rafael J. Wysocki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, sc.scap_cluster_max has double meanings.

    1) reclaim batch size as isolate_lru_pages()'s argument
    2) reclaim baling out thresolds

    The two meanings pretty unrelated. Thus, Let's separate it.
    this patch doesn't change any behavior.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Describe NUMA node symlink created for CPUs when CONFIG_NUMA is set.

    Signed-off-by: Alex Chiang
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • You can discover which CPUs belong to a NUMA node by examining
    /sys/devices/system/node/node#/

    However, it's not convenient to go in the other direction, when looking at
    /sys/devices/system/cpu/cpu#/

    Yes, you can muck about in sysfs, but adding these symlinks makes life a
    lot more convenient.

    Signed-off-by: Alex Chiang
    Acked-by: David Rientjes
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by two levels.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • By returning early if the node is not online, we can unindent the
    interesting code by one level.

    No functional change.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • Commit c04fc586c (mm: show node to memory section relationship with
    symlinks in sysfs) created symlinks from nodes to memory sections, e.g.

    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135

    If you're examining the memory section though and are wondering what node
    it might belong to, you can find it by grovelling around in sysfs, but
    it's a little cumbersome.

    Add a reverse symlink for each memory section that points back to the
    node to which it belongs.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Acked-by: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • When do_nonlinear_fault() realizes that the page table must have been
    corrupted for it to have been called, it does print_bad_pte() and returns
    ... VM_FAULT_OOM, which is hard to understand.

    It made some sense when I did it for 2.6.15, when do_page_fault() just
    killed the current process; but nowadays it lets the OOM killer decide who
    to kill - so page table corruption in one process would be liable to kill
    another.

    Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
    the process will be killed, but is good enough for such a rare
    abnormality, accompanied as it is by the "BUG: Bad page map" message.

    And recent HWPOISON work has copied that code into do_swap_page(), when it
    finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
    and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
    enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.

    When 2.6.15 placed the split page table lock inside struct page (usually
    sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
    we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
    letting the spinlock_t occupy both page->private and page->mapping).

    Should these debugging options be allowed to double the size of a struct
    page, when only one minority use of the page (as a page table) needs to
    fit a spinlock in there? Perhaps not.

    Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
    kmallocing a cacheline for the spinlock when it doesn't fit, but given up
    each time. Falling back to mm->page_table_lock (as we do when ptlock is
    not split) lets lockdep check out the strictest path anyway.

    And now that some arches allow 8192 cpus, use 999999 for infinity.

    (What has this got to do with KSM swapping? It doesn't care about the
    size of struct page, but may care about random junk in page->mapping - to
    be explained separately later.)

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM swapping will know where page_referenced_one() and try_to_unmap_one()
    should look. It could hack page->index to get them to do what it wants,
    but it seems cleaner now to pass the address down to them.

    Make the same change to page_mkclean_one(), since it follows the same
    pattern; but there's no real need in its case.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's contorted mlock/munlock handling in try_to_unmap_anon() and
    try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
    Simplify it by moving it all down into try_to_unmap_one().

    One thing is then lost, try_to_munlock()'s distinction between when no vma
    holds the page mlocked, and when a vma does mlock it, but we could not get
    mmap_sem to set the page flag. But its only caller takes no interest in
    that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
    the code simple and return SWAP_AGAIN for both cases.

    try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
    amusing: once unravelled, it turns out to have been choosing between two
    different ways of doing the same nothing. Ah, no, one way was actually
    returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

    [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
    [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When the code jumps to the `out', `referenced' is still zero. So there is
    no need to check it.

    Signed-off-by: Huang Shijie
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Just simplify the code when `mlocked' is true.

    Signed-off-by: Huang Shijie
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Fix the comment for try_to_unmap_anon() with the new arguments.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
    cleanups") removed generic_file_write() in filemap. Change the comment in
    vmscan pageout() to __generic_file_aio_write().

    Signed-off-by: Vincent Li
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Reorder (and comment) the fields of swap_info_struct, to make better
    use of its cachelines: it's good for swap_duplicate() in particular
    if unsigned int max and swap_map are near the start.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins