22 Sep, 2009

40 commits

  • page_is_file_cache() has been used for both boolean checks and LRU
    arithmetic, which was always a bit weird.

    Now that page_lru_base_type() exists for LRU arithmetic, make
    page_is_file_cache() a real predicate function and adjust the
    boolean-using callsites to drop those pesky double negations.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
    another helper with a more appropriate name and convert the non-boolean
    users of page_is_file_cache() accordingly.

    This new helper gives the LRU base type a page is supposed to live on,
    inactive anon or inactive file.

    [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Remove double negations where the operand is already boolean.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The kzalloc mempool zeros items when they are initially allocated, but
    does not rezero used items that are returned to the pool. Consequently
    mempool_alloc()s may return non-zeroed memory.

    Since there are/were only two in-tree users for
    mempool_create_kzalloc_pool(), and 'fixing' this in a way that will
    re-zero used (but not new) items before first use is non-trivial, just
    remove it.

    Signed-off-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sage Weil
     
  • The kzalloc mempool does not re-zero items that have been used and then
    returned to the pool. Manually zero the allocated multipath_bh instead.

    Acked-by: Neil Brown
    Signed-off-by: Sage Weil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sage Weil
     
  • Fix the following 'make includecheck' warning:

    mm/nommu.c: internal.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: David Howells
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • Fix the following 'make includecheck' warning:

    mm/shmem.c: linux/vfs.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
    or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache()
    doesn't return -EEXIST any more.

    Even though it doesn't return -EEXIST, it's not good behavior conceptually
    to call swapcache_prepare() in the -EEXIST case, because it means clearing
    SWAP_HAS_CACHE flag while the entry is on swap cache.

    This patch removes redundant codes and comments from callers of it, and
    adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
    cache but it has SWAP_HAS_CACHE flag.

    Such entries can exist on add/delete path of swap cache. On add path,
    add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
    on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
    cleared) soon after __delete_from_swap_cache() is called. So, the
    busy-wait works well in most cases.

    But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
    read_swap_cache_async() tries to swap-in the same entry on the same cpu.

    This patch calls radix_tree_preload() before swapcache_prepare() and
    divides add_to_swap_cache() into two part: radix_tree_preload() part and
    radix_tree_insert() part(define it as __add_to_swap_cache()).

    Signed-off-by: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Knowing tracepoints exist is not quite the same as knowing what they
    should be used for. This patch adds a document giving a basic description
    of the kmem tracepoints and why they might be useful to a performance
    analyst.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …lysis with tracepoints

    The documentation for ftrace, events and tracepoints is pretty extensive.
    Similarly, the perf PCL tools help files --help are there and the code
    simple enough to figure out what much of the switches mean. However,
    pulling the discrete bits and pieces together and translating that into
    "how do I solve a problem" requires a fair amount of imagination.

    This patch adds a simple document intended to get someone started on the

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Rik van Riel <riel@redhat.com>
    Reviewed-by: Ingo Molnar <mingo@elte.hu>
    Cc: Larry Woodman <lwoodman@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Li Ming Chun <macli@brc.ubc.ca>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • This patch adds a simple post-processing script for the
    page-allocator-related trace events. It can be used to give an indication
    of who the most allocator-intensive processes are and how often the zone
    lock was taken during the tracing period. Example output looks like

    Process Pages Pages Pages Pages PCPU PCPU PCPU Fragment Fragment MigType Fragment Fragment Unknown
    details allocd allocd freed freed pages drains refills Fallback Causing Changed Severe Moderate
    under lock direct pagevec drain
    swapper-0 0 0 2 0 0 0 0 0 0 0 0 0 0
    Xorg-3770 10603 5952 3685 6978 5996 194 192 0 0 0 0 0 0
    modprobe-21397 51 0 0 86 31 1 0 0 0 0 0 0 0
    xchat-5370 228 93 0 0 0 0 3 0 0 0 0 0 0
    awesome-4317 32 32 0 0 0 0 32 0 0 0 0 0 0
    thinkfan-3863 2 0 1 1 0 0 0 0 0 0 0 0 0
    hald-addon-stor-3935 2 0 0 0 0 0 0 0 0 0 0 0 0
    akregator-4506 1 1 0 0 0 0 1 0 0 0 0 0 0
    xmms-14888 0 0 1 0 0 0 0 0 0 0 0 0 0
    khelper-12 1 0 0 0 0 0 0 0 0 0 0 0 0

    Optionally, the output can include information on the parent or aggregate
    based on process name instead of aggregating based on each pid. Example output
    including parent information and stripped out the PID looks something like;

    Process Pages Pages Pages Pages PCPU PCPU PCPU Fragment Fragment MigType Fragment Fragment Unknown
    details allocd allocd freed freed pages drains refills Fallback Causing Changed Severe Moderate
    under lock direct pagevec drain
    gdm-3756 :: Xorg-3770 3796 2976 99 3813 3224 104 98 0 0 0 0 0 0
    init-1 :: hald-3892 1 0 0 0 0 0 0 0 0 0 0 0 0
    git-21447 :: editor-21448 4 0 4 0 0 0 0 0 0 0 0 0 0

    This says that Xorg allocated 3796 pages and it's parent process is gdm
    with a PID of 3756;

    The postprocessor parses the text output of tracing. While there is a
    binary format, the expectation is that the binary output can be readily
    translated into text and post-processed offline. Obviously if the text
    format changes, the parser will break but the regular expression parser is
    fairly rudimentary so should be readily adjustable.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The page allocation trace event reports that a page was successfully
    allocated but it does not specify where it came from. When analysing
    performance, it can be important to distinguish between pages coming from
    the per-cpu allocator and pages coming from the buddy lists as the latter
    requires the zone lock to the taken and more data structures to be
    examined.

    This patch adds a trace event for __rmqueue reporting when a page is being
    allocated from the buddy lists. It distinguishes between being called to
    refill the per-cpu lists or whether it is a high-order allocation.
    Similarly, this patch adds an event to catch when the PCP lists are being
    drained a little and pages are going back to the buddy lists.

    This is trickier to draw conclusions from but high activity on those
    events could explain why there were a large number of cache misses on a
    page-allocator-intensive workload. The coalescing and splitting of
    buddies involves a lot of writing of page metadata and cache line bounces
    not to mention the acquisition of an interrupt-safe lock necessary to
    enter this path.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fragmentation avoidance depends on being able to use free pages from lists
    of the appropriate migrate type. In the event this is not possible,
    __rmqueue_fallback() selects a different list and in some circumstances
    change the migratetype of the pageblock. Simplistically, the more times
    this event occurs, the more likely that fragmentation will be a problem
    later for hugepage allocation at least but there are other considerations
    such as the order of page being split to satisfy the allocation.

    This patch adds a trace event for __rmqueue_fallback() that reports what
    page is being used for the fallback, the orders of relevant pages, the
    desired migratetype and the migratetype of the lists being used, whether
    the pageblock changed type and whether this event is important with
    respect to fragmentation avoidance or not. This information can be used
    to help analyse fragmentation avoidance and help decide whether
    min_free_kbytes should be increased or not.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds trace events for the allocation and freeing of pages,
    including the freeing of pagevecs. Using the events, it will be known
    what struct page and pfns are being allocated and freed and what the call
    site was in many cases.

    The page alloc tracepoints be used as an indicator as to whether the
    workload was heavily dependant on the page allocator or not. You can make
    a guess based on vmstat but you can't get a per-process breakdown.
    Depending on the call path, the call_site for page allocation may be
    __get_free_pages() instead of a useful callsite. Instead of passing down
    a return address similar to slab debugging, the user should enable the
    stacktrace and seg-addr options to get a proper stack trace.

    The pagevec free tracepoint has a different usecase. It can be used to
    get a idea of how many pages are being dumped off the LRU and whether it
    is kswapd doing the work or a process doing direct reclaim.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function free_cold_page() has no callers so delete it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 96177299416dbccb73b54e6b344260154a445375 ("Drop free_pages()")
    modified nr_free_pages() to return 'unsigned long' instead of 'unsigned
    int'. This made the casts to 'unsigned long' in most callers superfluous,
    so remove them.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Lameter
    Acked-by: Ingo Molnar
    Acked-by: Russell King
    Acked-by: David S. Miller
    Acked-by: Kyle McMartin
    Acked-by: WANG Cong
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: David Howells
    Acked-by: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Chris Zankel
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • /proc/kcore has its own routine to access vmallc area. It can be replaced
    with vread(). And by this, /proc/kcore can do safe access to vmalloc
    area.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: WANG Cong
    Cc: Mike Smith
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • vread/vwrite access vmalloc area without checking there is a page or not.
    In most case, this works well.

    In old ages, the caller of get_vm_ara() is only IOREMAP and there is no
    memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] (
    -PAGE_SIZE is for a guard page.)

    After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous
    virtual address but remap _later_. There tend to be a hole in valid
    vmalloc area in vm_struct lists. Then, skip the hole (not mapped page) is
    necessary. This patch updates vread/vwrite() for avoiding memory hole.

    Routines which access vmalloc area without knowing for which addr is used
    are
    - /proc/kcore
    - /dev/kmem

    kcore checks IOREMAP, /dev/kmem doesn't. After this patch, IOREMAP is
    checked and /dev/kmem will avoid to read/write it. Fixes to /proc/kcore
    will be in the next patch in series.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: WANG Cong
    Cc: Mike Smith
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • vmap area should be purged after vm_struct is removed from the list
    because vread/vwrite etc...believes the range is valid while it's on
    vm_struct list.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: WANG Cong
    Cc: Mike Smith
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • …uring __rmqueue_fallback

    When there are no pages of a target migratetype free, the page allocator
    selects a high-order block of another migratetype to allocate from. When
    the order of the page taken is greater than pageblock_order, all
    pageblocks within that high-order page should change migratetype so that
    pages are later freed to the correct free-lists.

    The current behaviour is that pageblocks change migratetype if the order
    being split matches the pageblock_order. When pageblock_order <
    MAX_ORDER-1, ownership is not changing correct and pages are being later
    freed to the incorrect list and this impacts fragmentation avoidance.

    This patch changes all pageblocks within the high-order page being split
    to the correct migratetype. Without the patch, allocation success rates
    for hugepages under stress were about 59% of physical memory on x86-64.
    With the patch applied, this goes up to 65%.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Right now, if you inadvertently pass NULL to kmem_cache_create() at boot
    time, it crashes much later after boot somewhere deep inside sysfs which
    makes it very non obvious to figure out what's going on.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • The patch makes the clear_refs more versatile in adding the option to
    select anonymous pages or file backed pages for clearing. This addition
    has a measurable impact on user space application performance as it
    decreases the number of pagewalks in scenarios where one is only
    interested in a specific type of page (anonymous or file mapped).

    The patch adds anonymous and file backed filters to the clear_refs interface.

    echo 1 > /proc/PID/clear_refs resets the bits on all pages
    echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
    echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

    Any other value is ignored

    Signed-off-by: Moussa A. Ba
    Signed-off-by: Jared E. Hulbert
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Moussa A. Ba
     
  • mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
    because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
    -ENOMEM (letting madvise convert that to -EAGAIN), and can also say
    -ERESTARTSYS when signalled: so pass the error from ksm_madvise.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
    longer applies, and it can be made private to ksm.c; there's no more
    reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM originally stood for Kernel Shared Memory: but the kernel has long
    supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
    something else. So we switched to saying "merge" instead of "share".

    But Chris Wright points out that this is confusing where mmap.c merges
    adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
    is_mergeable_vma() to let vmas be merged despite flags being different.

    Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists
    only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
    that directly, with a comment on it in is_mergeable_vma().

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature

    Signed-off-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present KSM is just a waste of space if you don't have CONFIG_SYSFS=y
    to provide the /sys/kernel/mm/ksm files to tune and activate it.

    Make KSM depend on SYSFS? Could do, but it might be better to provide
    some defaults so that KSM works out-of-the-box, ready for testers to
    madvise MADV_MERGEABLE, even without SYSFS.

    Though anyone serious is likely to want to retune the numbers to their
    taste once they have experience; and whether these settings ever reach
    2.6.32 can be discussed along the way.

    Save 1kB from tiny kernels by #ifdef'ing the SYSFS side of it.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Do some housekeeping in ksm.c, to help make the next patch easier
    to understand: remove the function remove_mm_from_lists, distributing
    its code to its callsites scan_get_next_rmap_item and __ksm_exit.

    That turns out to be a win in scan_get_next_rmap_item: move its
    remove_trailing_rmap_items and cursor advancement up, and it becomes
    simpler than before. __ksm_exit becomes messier, but will change
    again; and moving its remove_trailing_rmap_items up lets us strengthen
    the unstable tree item's age condition in remove_rmap_item_from_tree.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should
    only be a problem for ksmd when a memory control group imposes limits
    (normally the OOM killer will kill others with an mm until it succeeds);
    but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
    do need to route the error (or kill) back to the caller (or sighandling).

    Test signal_pending in unmerge_ksm_pages, which could be a lengthy
    procedure if it has to spill into swap: returning -ERESTARTSYS so that
    trivial signals will restart but fatals will terminate (is that right?
    we do different things in different places in mm, none exactly this).

    unmerge_and_remove_all_rmap_items was forgetting to lock when going
    down the mm_list: fix that. Whether it's successful or not, reset
    ksm_scan cursor to head; but only if it's successful, reset seqnr
    (shown in full_scans) - page counts will have gone down to zero.

    This patch leaves a significant OOM deadlock, but it's a good step
    on the way, and that deadlock is fixed in a subsequent patch.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 1. We don't use __break_cow entry point now: merge it into break_cow.
    2. remove_all_slot_rmap_items is just a special case of
    remove_trailing_rmap_items: use the latter instead.
    3. Extend comment on unmerge_ksm_pages and rmap_items.
    4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
    instead of duplicating its code; and so swap them around.
    5. Comment on cmp_and_merge_page described last year's: update it.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ksm_scan_thread already sleeps in wait_event_interruptible until setting
    ksm_run activates it; but if there's nothing on its list to look at, i.e.
    nobody has yet said madvise MADV_MERGEABLE, it's a shame to be clocking
    up system time and full_scans: ksmd_should_run added to check that too.

    And move the mutex_lock out around it: the new counts showed that when
    ksm_run is stopped, a little work often got done afterwards, because it
    had been read before taking the mutex.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We kept agreeing not to bother about the unswappable shared KSM pages
    which later become unshared by others: observation suggests they're not
    a significant proportion. But they are disadvantageous, and it is easier
    to break COW to replace them by swappable pages, than offer statistics
    to show that they don't matter; then we can stop worrying about them.

    Doing this in ksm_do_scan, they don't go through cmp_and_merge_page on
    this pass: give them a good chance of getting into the unstable tree
    on the next pass, or back into the stable, by computing checksum now.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The pages_shared and pages_sharing counts give a good picture of how
    successful KSM is at sharing; but no clue to how much wasted work it's
    doing to get there. Add pages_unshared (count of unique pages waiting
    in the unstable tree, hoping to find a mate) and pages_volatile.

    pages_volatile is harder to define. It includes those pages changing
    too fast to get into the unstable tree, but also whatever other edge
    conditions prevent a page getting into the trees: a high value may
    deserve investigation. Don't try to calculate it from the various
    conditions: it's the total of rmap_items less those accounted for.

    Also show full_scans: the number of completed scans of everything
    registered in the mm list.

    The locking for all these counts is simply ksm_thread_mutex.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The pages_shared count is incremented and decremented when adding a node
    to and removing a node from the stable tree: easy to understand. But the
    pages_sharing count was hard to follow, being adjusted in various places:
    increment and decrement it when adding to and removing from the stable tree.

    And the pages_sharing variable used to include the pages_shared, then those
    were subtracted when shown in the pages_sharing sysfs file: now keep it as
    an exclusive count of leaves hanging off the stable tree nodes, throughout.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We're not implementing swapping of KSM pages in its first release;
    but when that follows, "kernel_pages_allocated" will be a very poor
    name for the sysfs file showing number of nodes in the stable tree:
    rename that to "pages_shared" throughout.

    But we already have a "pages_shared", counting those page slots
    sharing the shared pages: first rename that to... "pages_sharing".

    What will become of "max_kernel_pages" when the pages shared can
    be swapped? I guess it will just be removed, so keep that name.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ksm should try not to disturb other tasks as much as possible.

    Signed-off-by: Izik Eidus
    Cc: Chris Wright
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus