07 Mar, 2010

7 commits

  • The VM currently assumes that an inactive, mapped and referenced file page
    is in use and promotes it to the active list.

    However, every mapped file page starts out like this and thus a problem
    arises when workloads create a stream of such pages that are used only for
    a short time. By flooding the active list with those pages, the VM
    quickly gets into trouble finding eligible reclaim canditates. The result
    is long allocation latencies and eviction of the wrong pages.

    This patch reuses the PG_referenced page flag (used for unmapped file
    pages) to implement a usage detection that scales with the speed of LRU
    list cycling (i.e. memory pressure).

    If the scanner encounters those pages, the flag is set and the page cycled
    again on the inactive list. Only if it returns with another page table
    reference it is activated. Otherwise it is reclaimed as 'not recently
    used cache'.

    This effectively changes the minimum lifetime of a used-once mapped file
    page from a full memory cycle to an inactive list cycle, which allows it
    to occur in linear streams without affecting the stable working set of the
    system.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When the parent process breaks the COW on a page, both the original which
    is mapped at child and the new page which is mapped parent end up in that
    same anon_vma. Generally this won't be a problem, but for some workloads
    it could preserve the O(N) rmap scanning complexity.

    A simple fix is to ensure that, when a page which is mapped child gets
    reused in do_wp_page, because we already are the exclusive owner, the page
    gets moved to our own exclusive child's anon_vma.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When an anonymous page is inherited from a parent process, the
    vma->anon_vma can differ from the page anon_vma. This can trip up
    __page_check_anon_rmap, which is indirectly called from do_swap_page().

    Remove that obsolete check to prevent an oops.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

14 commits

  • In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
    grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED

    And in global VM, mapped shared memory is accounted into FILE_MAPPED.
    But memcg doesn't. fix it.
    Note:
    page_is_file_cache() just checks SwapBacked or not.
    So, we need to check PageAnon.

    Cc: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
    unevictable-lru". So, following code is easy confusable.

    if (vma->vm_flags & VM_LOCKED) {
    ret = SWAP_MLOCK;
    goto out_unmap;
    }

    Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
    the needed of calling mlock_vma_page().

    Also, add some commentary to try_to_unmap_one().

    Acked-by: Hugh Dickins
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A side-effect of making ksm pages swappable is that they have to be placed
    on the LRUs: which then exposes them to isolate_lru_page() and hence to
    page migration.

    Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
    rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
    consolidation with existing code is possible, but don't attempt that yet
    (try_to_unmap needs to handle nonlinears, but migration pte removal does
    not).

    rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
    remove_anon_migration_ptes() which it replaces, avoids calling
    page_lock_anon_vma(), because that includes a page_mapped() test which
    fails when all migration ptes are in place. That was valid when NUMA page
    migration was introduced (holding mmap_sem provided the missing guarantee
    that anon_vma's slab had not already been destroyed), but I believe not
    valid in the memory hotremove case added since.

    For now do the same as before, and consider the best way to fix that
    unlikely race later on. When fixed, we can probably use rmap_walk() on
    hwpoisoned ksm pages too: for now, they remain among hwpoison's various
    exceptions (its PageKsm test comes before the page is locked, but its
    page_lock_anon_vma fails safely if an anon gets upgraded).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When ksm pages were unswappable, it made no sense to include them in mem
    cgroup accounting; but now that they are swappable (although I see no
    strict logical connection) the principle of least surprise implies that
    they should be accounted (with the usual dissatisfaction, that a shared
    page is accounted to only one of the cgroups using it).

    This patch was intended to add mem cgroup accounting where necessary; but
    turned inside out, it now avoids allocating a ksm page, instead upgrading
    an anon page to ksm - which brings its existing mem cgroup accounting with
    it. Thus mem cgroups don't appear in the patch at all.

    This upgrade from PageAnon to PageKsm takes place under page lock (via a
    somewhat hacky NULL kpage interface), and audit showed only one place
    which needed to cope with the race - page_referenced() is sometimes used
    without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
    sure of getting anon_vma and flags together (no problem if the page goes
    ksm an instant after, the integrity of that anon_vma list is unaffected).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • For full functionality, page_referenced_one() and try_to_unmap_one() need
    to know the vma: to pass vma down to arch-dependent flushes, or to observe
    VM_LOCKED or VM_EXEC. But KSM keeps no record of vma: nor can it, since
    vmas get split and merged without its knowledge.

    Instead, note page's anon_vma in its rmap_item when adding to stable tree:
    all the vmas which might map that page are listed by its anon_vma.

    page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
    first to find the probable vma, that which matches rmap_item's mm; but if
    that is not enough to locate all instances, traverse again to try the
    others. This catches those occasions when fork has duplicated a pte of a
    ksm page, but ksmd has not yet come around to assign it an rmap_item.

    But each rmap_item in the stable tree which refers to an anon_vma needs to
    take a reference to it. Andrea's anon_vma design cleverly avoided a
    reference count (an anon_vma was free when its list of vmas was empty),
    but KSM now needs to add that. Is a 32-bit count sufficient? I believe
    so - the anon_vma is only free when both count is 0 and list is empty.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM swapping will know where page_referenced_one() and try_to_unmap_one()
    should look. It could hack page->index to get them to do what it wants,
    but it seems cleaner now to pass the address down to them.

    Make the same change to page_mkclean_one(), since it follows the same
    pattern; but there's no real need in its case.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's contorted mlock/munlock handling in try_to_unmap_anon() and
    try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
    Simplify it by moving it all down into try_to_unmap_one().

    One thing is then lost, try_to_munlock()'s distinction between when no vma
    holds the page mlocked, and when a vma does mlock it, but we could not get
    mmap_sem to set the page flag. But its only caller takes no interest in
    that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
    the code simple and return SWAP_AGAIN for both cases.

    try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
    amusing: once unravelled, it turns out to have been choosing between two
    different ways of doing the same nothing. Ah, no, one way was actually
    returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

    [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
    [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When the code jumps to the `out', `referenced' is still zero. So there is
    no need to check it.

    Signed-off-by: Huang Shijie
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Just simplify the code when `mlocked' is true.

    Signed-off-by: Huang Shijie
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Fix the comment for try_to_unmap_anon() with the new arguments.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Oct, 2009

1 commit


24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits

  • page_dup_rmap(), used on each mapped page when forking, was originally
    just an inline atomic_inc of mapcount. 2.6.22 added CONFIG_DEBUG_VM
    out-of-line checks to it, which would need to be ever-so-slightly
    complicated to allow for the PageKsm() we're about to define.

    But I think these checks never caught anything. And if it's coding errors
    we're worried about, such checks should be in page_remove_rmap() too, not
    just when forking; whereas if it's pagetable corruption we're worried
    about, then they shouldn't be limited to CONFIG_DEBUG_VM.

    Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_remove_rmap() has multiple PageAnon() tests and it has deep nesting.
    Clean this up.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Sep, 2009

4 commits

  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     
  • When a page has the poison bit set replace the PTE with a poison entry.
    This causes the right error handling to be done later when a process runs
    into it.

    v2: add a new flag to not do that (needed for the memory-failure handler
    later) (Fengguang)
    v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)

    Reviewed-by: Minchan Kim
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
    which are selected by magic flag variables. The logic is not very straight
    forward, because each of these flag change multiple behaviours (e.g.
    migration turns off aging, not only sets up migration ptes etc.)
    Also the different flags interact in magic ways.

    A later patch in this series adds another mode to try_to_unmap, so
    this becomes quickly unmanageable.

    Replace the different flags with a action code (migration, munlock, munmap)
    and some additional flags as modifiers (ignore mlock, ignore aging).
    This makes the logic more straight forward and allows easier extension
    to new behaviours. Change all the caller to declare what they want to
    do.

    This patch is supposed to be a nop in behaviour. If anyone can prove
    it is not that would be a bug.

    Cc: Lee.Schermerhorn@hp.com
    Cc: npiggin@suse.de

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Needed for later patch that walks rmap entries on its own.

    This used to be very frowned upon, but memory-failure.c does
    some rather specialized rmap walking and rmap has been stable
    for quite some time, so I think it's ok now to export it.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

27 Aug, 2009

1 commit

  • An mlocked page might lose the isolatation race. This causes the page to
    clear PG_mlocked while it remains in a VM_LOCKED vma. This means it can
    be put onto the [in]active list. We can rescue it by using try_to_unmap()
    in shrink_page_list().

    But now, As Wu Fengguang pointed out, vmscan has a bug. If the page has
    PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is
    put into the active list. If the page is referenced repeatedly, it can
    remain on the [in]active list without being moving to the unevictable
    list.

    This patch fixes it.

    Reported-by: Wu Fengguang
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro <
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

19 Jun, 2009

1 commit

  • Add file RSS tracking per memory cgroup

    We currently don't track file RSS, the RSS we report is actually anon RSS.
    All the file mapped pages, come in through the page cache and get
    accounted there. This patch adds support for accounting file RSS pages.
    It should

    1. Help improve the metrics reported by the memory resource controller
    2. Will form the basis for a future shared memory accounting heuristic
    that has been proposed by Kamezawa.

    Unfortunately, we cannot rename the existing "rss" keyword used in
    memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
    educate the end user through documentation.

    [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Dhaval Giani
    Cc: Daisuke Nishimura
    Cc: YAMAMOTO Takashi
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

17 Jun, 2009

2 commits

  • Collect vma->vm_flags of the VMAs that actually referenced the page.

    This is preparing for more informed reclaim heuristics, eg. to protect
    executable file pages more aggressively. For now only the VM_EXEC bit
    will be used by the caller.

    Thanks to Johannes, Peter and Minchan for all the good tips.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Johannes Weiner
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

22 May, 2009

1 commit

  • My old address will shut down in a few days time: remove it from the tree,
    and add a tmpfs (shmem filesystem) maintainer entry with the new address.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Feb, 2009

1 commit

  • When I tested following program, I found that the mlocked counter
    is strange. It cannot free some mlocked pages.

    It is because try_to_unmap_file() doesn't check real
    page mappings in vmas.

    That is because the goal of an address_space for a file is to find all
    processes into which the file's specific interval is mapped. It is
    related to the file's interval, not to pages.

    Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
    since the vma has VM_LOCKED, then calls try_to_mlock_page. After this the
    mlocked counter is increased again.

    COWed anon page in a file-backed vma could be a such case. This patch
    resolves it.

    -- my test program --

    int main()
    {
    mlockall(MCL_CURRENT);
    return 0;
    }

    -- before --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 0 kB
    Mlocked: 0 kB

    -- after --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 8 kB
    Mlocked: 8 kB

    Signed-off-by: MinChan Kim
    Acked-by: Lee Schermerhorn
    Acked-by: KOSAKI Motohiro
    Tested-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     

07 Jan, 2009

5 commits

  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now that bad pages are kept out of circulation, there is no need for the
    infamous page_remove_rmap() BUG() - once that page is freed, its negative
    mapcount will issue a "Bad page state" message and the page won't be
    freed. Removing the BUG() allows more info, on subsequent pages, to be
    gathered.

    We do have more info about the page at this point than bad_page() can know
    - notably, what the pmd is, which might pinpoint something like low 64kB
    corruption - but page_remove_rmap() isn't given the address to find that.

    In practice, there is only one call to page_remove_rmap() which has ever
    reported anything, that from zap_pte_range() (usually on exit, sometimes
    on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
    message and leave it all to zap_pte_range().

    mm/memory.c already has a hardly used print_bad_pte() function, showing
    some of the appropriate info: extend it to show what we want for the rmap
    case: pte info, page info (when there is a page) and vma info to compare.
    zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
    use if it works that out for itself.

    Some of this info is also shown in bad_page()'s "Bad page state" message.
    Keep them separate, but adjust them to match each other as far as
    possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
    there too.

    print_bad_pte() show current->comm unconditionally (though it should get
    repeated in the usually irrelevant stack trace): sorry, I misled Nick
    Piggin to make it conditional on vm_mm == current->mm, but current->mm is
    already NULL in the exit case. Usually current->comm is good, though
    exceptionally it may not be that of the mm (when "swapoff" for example).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_lock_anon_vma() and page_unlock_anon_vma() were made available to
    show_page_path() in vmscan.c; but now that has been removed, make them
    static in rmap.c again, they're better kept private if possible.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins