26 Jun, 2009

1 commit


24 Jun, 2009

2 commits

  • If a kthread happens to use get_user_pages() on an mm (as KSM does),
    there's a chance that it will end up trying to read in a swap page, then
    oops in grab_swap_token() because the kthread has no mm: GUP passes down
    the right mm, so grab_swap_token() ought to be using it.

    We have not identified a stronger case than KSM's daemon (not yet in
    mainline), but the issue must have come up before, since RHEL has included
    a fix for this for years (though a different fix, they just back out of
    grab_swap_token if current->mm is unset: which is what we first proposed,
    but using the right mm here seems more correct).

    Reported-by: Izik Eidus
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
    and it's tempting to devise a set of Grand Unified Paging flags;
    but not today. So until then, let's rely upon the compiler to spot
    the coincidence, "rather than have that subtle dependency and a
    comment for it" - as you remarked in another context yesterday.

    Signed-off-by: Hugh Dickins
    Acked-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Jun, 2009

2 commits

  • This allows the callers to now pass down the full set of FAULT_FLAG_xyz
    flags to handle_mm_fault(). All callers have been (mechanically)
    converted to the new calling convention, there's almost certainly room
    for architectures to clean up their code and then add FAULT_FLAG_RETRY
    when that support is added.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The fault handling routines really want more fine-grained flags than a
    single "was it a write fault" boolean - the callers will want to set
    flags like "you can return a retry error" etc.

    And that's actually how the VM works internally, but right now the
    top-level fault handling functions in mm/memory.c all pass just the
    'write_access' boolean around.

    This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
    variable instead. The 'write_access' calling convention still exists
    for the exported 'handle_mm_fault()' function, but that is next.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jun, 2009

4 commits

  • Analoguous to follow_phys(), add a helper that looks up the PFN at a
    user virtual address in an IO mapping or a raw PFN mapping.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A generic readonly page table lookup helper to map an address space and an
    address from it to a pte.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Move more documentation for get_user_pages_fast into the new kerneldoc comment.
    Add some comments for get_user_pages as well.

    Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
    there initially because it was once a static inline function.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Nick Piggin
    Cc: Andy Grover
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 May, 2009

2 commits

  • Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty. This allows the filesystem to have
    full control of page dirtying events coming from the VM.

    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.

    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite. It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).

    In this window, the VM could write out the page, clearing page-dirty. The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.

    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail. The
    filesystem may need to allocate some data structure, for example.

    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.

    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window. This provides the filesystem with a strong
    synchronisation against the VM here.

    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
    under dirty pages might be susceptible to i_size changing under partial page
    at the end of file (we also have a buffer.c-side problem here, but it cannot
    be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
    page_mkwrite functions themselves.

    [ This also moves page_mkwrite another step closer to fault, which should
    eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
    filesystem calldown and page lock/unlock cycle in __do_fault. ]

    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil
    Cc: Trond Myklebust
    Signed-off-by: Nick Piggin
    Cc: Valdis Kletnieks
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • By the time the memory cgroup code is notified about a swapin we
    already hold a reference on the fault page.

    If the cgroup callback fails make sure to unlock AND release the page
    reference which was taken by lookup_swap_cach(), or we leak the reference.

    Signed-off-by: Johannes Weiner
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Apr, 2009

3 commits

  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • At first look, mark_page_accessed() in follow_page() seems a bit strange.
    It seems pte_mkyoung() would be better consistent with other kernel code.

    However, it is intentional. The commit log said:

    ------------------------------------------------
    commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
    Author: akpm
    Date: Fri Aug 15 07:24:59 2003 +0000

    [PATCH] Use mark_page_accessed() in follow_page()

    Touching a page via follow_page() counts as a reference so we should be
    either setting the referenced bit in the pte or running mark_page_accessed().

    Altering the pte is tricky because we haven't implemented an atomic
    pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
    aging state: it can move the page onto the active list.

    BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
    ------------------------------------------------

    The atomic issue is still true nowadays. adding comment help to understand
    code intention and it would be better.

    [akpm@linux-foundation.org: clarify text]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't
    mark_page_accessed in fault path) only remove the mark_page_accessed() in
    filemap_fault().

    Therefore, swap-backed pages and file-backed pages have inconsistent
    behavior. mark_page_accessed() should be removed from do_swap_page().

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

14 Mar, 2009

1 commit


13 Mar, 2009

1 commit

  • Impact: fix false positive PAT warnings - also fix VirtalBox hang

    Use of vma->vm_pgoff to identify the pfnmaps that are fully
    mapped at mmap time is broken. vm_pgoff is set by generic mmap
    code even for cases where drivers are setting up the mappings
    at the fault time.

    The problem was originally reported here:

    http://marc.info/?l=linux-kernel&m=123383810628583&w=2

    Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
    flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
    time.

    Problem also tracked at:

    http://bugzilla.kernel.org/show_bug.cgi?id=12800

    Reported-by: Thomas Hellstrom
    Tested-by: Frans Pop
    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha @intel.com>
    Cc: Nick Piggin
    Cc: "ebiederm@xmission.com"
    Cc: # only for 2.6.29.1, not .28
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pallipadi, Venkatesh
     

06 Feb, 2009

1 commit

  • Fix do_wp_page for VM_MIXEDMAP mappings.

    In the case where pfn_valid returns 0 for a pfn at the beginning of
    do_wp_page and the mapping is not shared writable, the code branches to
    label `gotten:' with old_page == NULL.

    In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
    clear_page_mlock, and unlock_page try to access the old_page.

    This patch checks whether old_page is valid before it is dereferenced.

    The regression was introduced by "mlock: mlocked pages are unevictable"
    (commit b291f000393f5a0b679012b39d79fbc85c018233).

    Signed-off-by: Carsten Otte
    Cc: Nick Piggin
    Cc: Heiko Carstens
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

14 Jan, 2009

2 commits

  • Impact: cleanup

    Change the protection parameter for track_pfn_vma_new() into a pgprot_t pointer.
    Subsequent patch changes the x86 PAT handling to return a compatible
    memtype in pgprot_t, if what was requested cannot be allowed due to conflicts.
    No fuctionality change in this patch.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    venkatesh.pallipadi@intel.com
     
  • Impact: fix (harmless) double-free of memtype entries and avoid warning

    On track_pfn_vma_new() failure, reset the vm_flags so that there will be
    no second cleanup happening when upper level routines call unmap_vmas().

    This patch fixes part of the bug reported here:

    http://marc.info/?l=linux-kernel&m=123108883716357&w=2

    Specifically the error message:

    X:5010 freeing invalid memtype d0000000-d0101000

    Is due to multiple frees on error path, will not happen with the patch below.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    venkatesh.pallipadi@intel.com
     

12 Jan, 2009

1 commit

  • Some code (nfs/sunrpc) uses socket ops on kernel memory while holding
    the mmap_sem, this is safe because kernel memory doesn't get paged out,
    therefore we'll never actually fault, and the might_fault() annotations
    will generate false positives.

    Reported-by: "J. Bruce Fields"
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Jan, 2009

5 commits

  • Fix swapin charge operation of memcg.

    Now, memcg has hooks to swap-out operation and checks SwapCache is really
    unused or not. That check depends on contents of struct page. I.e. If
    PageAnon(page) && page_mapped(page), the page is recoginized as
    still-in-use.

    Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
    of any rmap. Then, in followinig sequence

    (Page fault with WRITE)
    try_charge() (charge += PAGESIZE)
    commit_charge() (Check page_cgroup is used or not..)
    reuse_swap_page()
    -> delete_from_swapcache()
    -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
    ......
    New charge is uncharged soon....
    To avoid this, move commit_charge() after page_mapcount() goes up to 1.
    By this,

    try_charge() (usage += PAGESIZE)
    reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
    commit_charge() (If page_cgroup is not marked as PCG_USED,
    add new charge.)
    Accounting will be correct.

    Changelog (v2) -> (v3)
    - fixed invalid charge to swp_entry==0.
    - updated documentation.
    Changelog (v1) -> (v2)
    - fixed comment.

    [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Daisuke Nishimura
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Fix misuse of gfp_kernel.

    Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.

    I think that this is from the fact that page_cgroup *was* dynamically
    allocated.

    But now, we allocate all page_cgroup at boot. And
    mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
    specified GFP_RECLAIM_MASK.

    * This is because we just want to reduce memory usage.
    "Where we should reclaim from ?" is not a problem in memcg.

    This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.

    Note: This patch is not for fixing behavior but for showing sane information
    in source code.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There is a small race in do_swap_page(). When the page swapped-in is
    charged, the mapcount can be greater than 0. But, at the same time some
    process (shares it ) call unmap and make mapcount 1->0 and the page is
    uncharged.

    CPUA CPUB
    mapcount == 1.
    (1) charge if mapcount==0 zap_pte_range()
    (2) mapcount 1 => 0.
    (3) uncharge(). (success)
    (4) set page's rmap()
    mapcount 0=>1

    Then, this swap page's account is leaked.

    For fixing this, I added a new interface.
    - charge
    account to res_counter by PAGE_SIZE and try to free pages if necessary.
    - commit
    register page_cgroup and add to LRU if necessary.
    - cancel
    uncharge PAGE_SIZE because of do_swap_page failure.

    CPUA
    (1) charge (always)
    (2) set page's rmap (mapcount > 0)
    (3) commit charge was necessary or not after set_pte().

    This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
    Usual mem_cgroup_charge_common() does charge -> commit at a time.

    And this patch also adds following function to clarify all charges.

    - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
    called against newly allocated anon pages.

    - mem_cgroup_charge_migrate_fixup()
    called only from remove_migration_ptes().
    we'll have to rewrite this later.(this patch just keeps old behavior)
    This function will be removed by additional patch to make migration
    clearer.

    Good for clarifying "what we do"

    Then, we have 4 following charge points.
    - newpage
    - swap-in
    - add-to-cache.
    - migration.

    [akpm@linux-foundation.org: add missing inline directives to stubs]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

15 commits

  • The initial implementation of checking TIF_MEMDIE covers the cases of OOM
    killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
    return immediately. This patch includes:

    1. add the case that the SIGKILL is sent by user processes. The
    process can try to get_user_pages() unlimited memory even if a user
    process has sent a SIGKILL to it(maybe a monitor find the process
    exceed its memory limit and try to kill it). In the old
    implementation, the SIGKILL won't be handled until the get_user_pages()
    returns.

    2. change the return value to be ERESTARTSYS. It makes no sense to
    return ENOMEM if the get_user_pages returned by getting a SIGKILL
    signal. Considering the general convention for a system call
    interrupted by a signal is ERESTARTNOSYS, so the current return value
    is consistant to that.

    Lee:

    An unfortunate side effect of "make-get_user_pages-interruptible" is that
    it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
    resulting in freeing of mlocked pages. Freeing of mlocked pages, in
    itself, is not so bad. We just count them now--altho' I had hoped to
    remove this stat and add PG_MLOCKED to the free pages flags check.

    However, consider pages in shared libraries mapped by more than one task
    that a task mlocked--e.g., via mlockall(). If the task that mlocked the
    pages exits via SIGKILL, these pages would be left mlocked and
    unevictable.

    Proposed fix:

    Add another GUP flag to ignore sigkill when calling get_user_pages from
    munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
    the same purpose. We are not actually allocating memory in this case,
    which "make-get_user_pages-interruptible" intends to avoid. We're just
    munlocking pages that are already resident and mapped, and we're reusing
    get_user_pages() to access those pages.

    ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
    into a single flag: GUP_FLAGS_MUNLOCK ???

    [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
    Signed-off-by: Paul Menage
    Signed-off-by: Ying Han
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Lee Schermerhorn
    Cc: Rohit Seth
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
    which I've followed in print_bad_pte(). These are serious system errors,
    on a par with BUGs, but they're not quite emergencies, and we do our best
    to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

    And remove the "Trying to fix it up, but a reboot is needed" line: it's
    not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() and bad_page() might each need ratelimiting - especially
    for their dump_stacks, almost never of interest, yet not quite
    dispensible. Correlating corruption across neighbouring entries can be
    very helpful, so allow a burst of 60 reports before keeping quiet for the
    remainder of that minute (or allow a steady drip of one report per
    second).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Complete zap_pte_range()'s coverage of bad pagetable entries by calling
    print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
    That needs free_swap_and_cache() to tell it, which will also have shown
    one of those "swap_free" errors (but with much less information).

    Similar checks in fork's copy_one_pte()? No, that would be more noisy
    than helpful: we'll see them when parent and child exec or exit.

    Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
    case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
    VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
    with what happens if do_swap_page() operates a bad swap entry; but don't
    we have patches to be more careful about killing when VM_FAULT_OOM?

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() is so far being called only when zap_pte_range() finds
    negative page_mapcount, or there's a fault on a pte_file where it does not
    belong. That's weak coverage when we suspect pagetable corruption.

    Originally, it was called when vm_normal_page() found an invalid pfn: but
    pfn_valid is expensive on some architectures and configurations, so 2.6.24
    put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
    2.6.26 replaced it by a VM_BUG_ON (likewise).

    Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
    than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
    __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
    pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
    doubt we'll need it elsewhere: of course it's not reliable, but gives much
    stronger pagetable validation on many boxes.

    Also use print_bad_pte() when the pte_special bit is found outside a
    VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now that bad pages are kept out of circulation, there is no need for the
    infamous page_remove_rmap() BUG() - once that page is freed, its negative
    mapcount will issue a "Bad page state" message and the page won't be
    freed. Removing the BUG() allows more info, on subsequent pages, to be
    gathered.

    We do have more info about the page at this point than bad_page() can know
    - notably, what the pmd is, which might pinpoint something like low 64kB
    corruption - but page_remove_rmap() isn't given the address to find that.

    In practice, there is only one call to page_remove_rmap() which has ever
    reported anything, that from zap_pte_range() (usually on exit, sometimes
    on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
    message and leave it all to zap_pte_range().

    mm/memory.c already has a hardly used print_bad_pte() function, showing
    some of the appropriate info: extend it to show what we want for the rmap
    case: pte info, page info (when there is a page) and vma info to compare.
    zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
    use if it works that out for itself.

    Some of this info is also shown in bad_page()'s "Bad page state" message.
    Keep them separate, but adjust them to match each other as far as
    possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
    there too.

    print_bad_pte() show current->comm unconditionally (though it should get
    repeated in the usually irrelevant stack trace): sorry, I misled Nick
    Piggin to make it conditional on vm_mm == current->mm, but current->mm is
    already NULL in the exit case. Usually current->comm is good, though
    exceptionally it may not be that of the mm (when "swapoff" for example).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • sparse output following warnings.

    mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces)
    mm/memory.c:2936:8: expected void *maddr
    mm/memory.c:2936:8: got void [noderef]

    cleanup here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A good place to free up old swap is where do_wp_page(), or do_swap_page(),
    is about to redirty the page: the data on disk is then stale and won't be
    read again; and if we do decide to write the page out later, using the
    previous swap location makes an unnecessary disk seek very likely.

    So give can_share_swap_page() the side-effect of delete_from_swap_cache()
    when it safely can. And can_share_swap_page() was always a misleading
    name, the more so if it has a side-effect: rename it reuse_swap_page().

    Irrelevant cleanup nearby: remove swap_token_default_timeout definition
    from swap.h: it's used nowhere.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • An application may rely on get_user_pages() to give it pages writable from
    userspace and shared with a driver, GUP breaking COW if necessary. It may
    mprotect() the pages' writability, off and on, from time to time.

    Normally this works fine (so long as the app does not fork); but just
    occasionally, under memory pressure, a readonly pte in a newly writable
    area is COWed unnecessarily, breaking the link with the driver: because
    do_wp_page() does trylock_page, and falls back to COW whenever that fails.

    For reliable behaviour in the unshared case, when the trylock_page fails,
    now unlock pagetable, lock page and relock pagetable, before deciding
    whether Copy-On-Write is really necessary.

    Reported-by: Zhou Yingchao
    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
    COW has been done if necessary, though it may be leaving the pte without
    write permission - for the odd case of forced writing to a readonly vma
    for ptrace. At present GUP then retries the follow_page() without asking
    for write permission, to escape an endless loop when forced.

    But an application may be relying on GUP to guarantee a writable page
    which won't be COWed again when written from userspace, whereas a race
    here might leave a readonly pte in place? Change the VM_FAULT_WRITE
    handling to ask follow_page() for write permission again, except in that
    odd case of forced writing to a readonly vma.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pte-level function in apply_to_range be called in lazy mmu mode,
    so that any pagetable modifications can be batched.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Venkatesh Pallipadi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge