07 Jan, 2009

17 commits

  • The initial implementation of checking TIF_MEMDIE covers the cases of OOM
    killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
    return immediately. This patch includes:

    1. add the case that the SIGKILL is sent by user processes. The
    process can try to get_user_pages() unlimited memory even if a user
    process has sent a SIGKILL to it(maybe a monitor find the process
    exceed its memory limit and try to kill it). In the old
    implementation, the SIGKILL won't be handled until the get_user_pages()
    returns.

    2. change the return value to be ERESTARTSYS. It makes no sense to
    return ENOMEM if the get_user_pages returned by getting a SIGKILL
    signal. Considering the general convention for a system call
    interrupted by a signal is ERESTARTNOSYS, so the current return value
    is consistant to that.

    Lee:

    An unfortunate side effect of "make-get_user_pages-interruptible" is that
    it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked,
    resulting in freeing of mlocked pages. Freeing of mlocked pages, in
    itself, is not so bad. We just count them now--altho' I had hoped to
    remove this stat and add PG_MLOCKED to the free pages flags check.

    However, consider pages in shared libraries mapped by more than one task
    that a task mlocked--e.g., via mlockall(). If the task that mlocked the
    pages exits via SIGKILL, these pages would be left mlocked and
    unevictable.

    Proposed fix:

    Add another GUP flag to ignore sigkill when calling get_user_pages from
    munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for
    the same purpose. We are not actually allocating memory in this case,
    which "make-get_user_pages-interruptible" intends to avoid. We're just
    munlocking pages that are already resident and mapped, and we're reusing
    get_user_pages() to access those pages.

    ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL
    into a single flag: GUP_FLAGS_MUNLOCK ???

    [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock]
    Signed-off-by: Paul Menage
    Signed-off-by: Ying Han
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Lee Schermerhorn
    Cc: Rohit Seth
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
    which I've followed in print_bad_pte(). These are serious system errors,
    on a par with BUGs, but they're not quite emergencies, and we do our best
    to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

    And remove the "Trying to fix it up, but a reboot is needed" line: it's
    not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() and bad_page() might each need ratelimiting - especially
    for their dump_stacks, almost never of interest, yet not quite
    dispensible. Correlating corruption across neighbouring entries can be
    very helpful, so allow a burst of 60 reports before keeping quiet for the
    remainder of that minute (or allow a steady drip of one report per
    second).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Complete zap_pte_range()'s coverage of bad pagetable entries by calling
    print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
    That needs free_swap_and_cache() to tell it, which will also have shown
    one of those "swap_free" errors (but with much less information).

    Similar checks in fork's copy_one_pte()? No, that would be more noisy
    than helpful: we'll see them when parent and child exec or exit.

    Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
    case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
    VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
    with what happens if do_swap_page() operates a bad swap entry; but don't
    we have patches to be more careful about killing when VM_FAULT_OOM?

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • print_bad_pte() is so far being called only when zap_pte_range() finds
    negative page_mapcount, or there's a fault on a pte_file where it does not
    belong. That's weak coverage when we suspect pagetable corruption.

    Originally, it was called when vm_normal_page() found an invalid pfn: but
    pfn_valid is expensive on some architectures and configurations, so 2.6.24
    put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
    2.6.26 replaced it by a VM_BUG_ON (likewise).

    Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
    than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
    __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
    pfn against that. We could call this pfn_plausible() or pfn_sane(), but I
    doubt we'll need it elsewhere: of course it's not reliable, but gives much
    stronger pagetable validation on many boxes.

    Also use print_bad_pte() when the pte_special bit is found outside a
    VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Now that bad pages are kept out of circulation, there is no need for the
    infamous page_remove_rmap() BUG() - once that page is freed, its negative
    mapcount will issue a "Bad page state" message and the page won't be
    freed. Removing the BUG() allows more info, on subsequent pages, to be
    gathered.

    We do have more info about the page at this point than bad_page() can know
    - notably, what the pmd is, which might pinpoint something like low 64kB
    corruption - but page_remove_rmap() isn't given the address to find that.

    In practice, there is only one call to page_remove_rmap() which has ever
    reported anything, that from zap_pte_range() (usually on exit, sometimes
    on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
    message and leave it all to zap_pte_range().

    mm/memory.c already has a hardly used print_bad_pte() function, showing
    some of the appropriate info: extend it to show what we want for the rmap
    case: pte info, page info (when there is a page) and vma info to compare.
    zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
    use if it works that out for itself.

    Some of this info is also shown in bad_page()'s "Bad page state" message.
    Keep them separate, but adjust them to match each other as far as
    possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE
    there too.

    print_bad_pte() show current->comm unconditionally (though it should get
    repeated in the usually irrelevant stack trace): sorry, I misled Nick
    Piggin to make it conditional on vm_mm == current->mm, but current->mm is
    already NULL in the exit case. Usually current->comm is good, though
    exceptionally it may not be that of the mm (when "swapoff" for example).

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • sparse output following warnings.

    mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces)
    mm/memory.c:2936:8: expected void *maddr
    mm/memory.c:2936:8: got void [noderef]

    cleanup here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A good place to free up old swap is where do_wp_page(), or do_swap_page(),
    is about to redirty the page: the data on disk is then stale and won't be
    read again; and if we do decide to write the page out later, using the
    previous swap location makes an unnecessary disk seek very likely.

    So give can_share_swap_page() the side-effect of delete_from_swap_cache()
    when it safely can. And can_share_swap_page() was always a misleading
    name, the more so if it has a side-effect: rename it reuse_swap_page().

    Irrelevant cleanup nearby: remove swap_token_default_timeout definition
    from swap.h: it's used nowhere.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • An application may rely on get_user_pages() to give it pages writable from
    userspace and shared with a driver, GUP breaking COW if necessary. It may
    mprotect() the pages' writability, off and on, from time to time.

    Normally this works fine (so long as the app does not fork); but just
    occasionally, under memory pressure, a readonly pte in a newly writable
    area is COWed unnecessarily, breaking the link with the driver: because
    do_wp_page() does trylock_page, and falls back to COW whenever that fails.

    For reliable behaviour in the unshared case, when the trylock_page fails,
    now unlock pagetable, lock page and relock pagetable, before deciding
    whether Copy-On-Write is really necessary.

    Reported-by: Zhou Yingchao
    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
    COW has been done if necessary, though it may be leaving the pte without
    write permission - for the odd case of forced writing to a readonly vma
    for ptrace. At present GUP then retries the follow_page() without asking
    for write permission, to escape an endless loop when forced.

    But an application may be relying on GUP to guarantee a writable page
    which won't be COWed again when written from userspace, whereas a race
    here might leave a readonly pte in place? Change the VM_FAULT_WRITE
    handling to ask follow_page() for write permission again, except in that
    odd case of forced writing to a readonly vma.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make the pte-level function in apply_to_range be called in lazy mmu mode,
    so that any pagetable modifications can be batched.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Venkatesh Pallipadi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • File pages mapped only in sequentially read mappings are perfect reclaim
    canditates.

    This patch makes these mappings behave like weak references, their pages
    will be reclaimed unless they have a strong reference from a normal
    mapping as well.

    It changes the reclaim and the unmap path where they check if the page has
    been referenced. In both cases, accesses through sequentially read
    mappings will be ignored.

    Benchmark results from KOSAKI Motohiro:

    http://marc.info/?l=linux-mm&m=122485301925098&w=2

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
    unmap-time if the pte is young has a number of problems.

    mark_page_accessed is supposed to be roughly the equivalent of a young pte
    for unmapped references. Unfortunately it doesn't come with any context:
    after being called, reclaim doesn't know who or why the page was touched.

    So calling mark_page_accessed not only adds extra lru or PG_referenced
    manipulations for pages that are already going to have pte_young ptes anyway,
    but it also adds these references which are difficult to work with from the
    context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
    wish to contribute to the page being referenced).

    Then, simply doing SetPageReferenced when zapping a pte and finding it is
    young, is not a really good solution either. SetPageReferenced does not
    correctly promote the page to the active list for example. So after removing
    mark_page_accessed from the fault path, several mmap()+touch+munmap() would
    have a very different result from several read(2) calls for example, which
    is not really desirable.

    Signed-off-by: Nick Piggin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

06 Jan, 2009

1 commit

  • We used to have rather schizophrenic set of checks for NULL ->i_op even
    though it had been eliminated years ago. You'd need to go out of your
    way to set it to NULL explicitly _and_ a bunch of code would die on
    such inodes anyway. After killing two remaining places that still
    did that bogosity, all that crap can go away.

    Signed-off-by: Al Viro

    Al Viro
     

31 Dec, 2008

1 commit

  • * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
    stacktrace: provide save_stack_trace_tsk() weak alias
    rcu: provide RCU options on non-preempt architectures too
    printk: fix discarding message when recursion_bug
    futex: clean up futex_(un)lock_pi fault handling
    "Tree RCU": scalable classic RCU implementation
    futex: rename field in futex_q to clarify single waiter semantics
    x86/swiotlb: add default swiotlb_arch_range_needs_mapping
    x86/swiotlb: add default physbus conversion
    x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
    x86: add swiotlb allocation functions
    swiotlb: consolidate swiotlb info message printing
    swiotlb: support bouncing of HighMem pages
    swiotlb: factor out copy to/from device
    swiotlb: add arch hook to force mapping
    swiotlb: allow architectures to override physbusphys conversions
    swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
    rcu: fix rcutorture behavior during reboot
    resources: skip sanity check of busy resources
    swiotlb: move some definitions to header
    swiotlb: allow architectures to override swiotlb pool allocation
    ...

    Fix up trivial conflicts in
    arch/x86/kernel/Makefile
    arch/x86/mm/init_32.c
    include/linux/hardirq.h
    as per Ingo's suggestions.

    Linus Torvalds
     

20 Dec, 2008

3 commits


19 Dec, 2008

3 commits

  • Impact: Introduces new hooks, which are currently null.

    Introduce generic hooks in remap_pfn_range and vm_insert_pfn and
    corresponding copy and free routines with reserve and free tracking.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: New currently unused interface.

    Add a generic interface to follow pfn in a pfnmap vma range. This is used by
    one of the subsequent x86 PAT related patch to keep track of memory types
    for vma regions across vma copy and free.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: Code transformation, new functions added should have no effect.

    Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
    in order to export reserved memory to userspace. Currently, such mappings are
    not tracked and hence not kept consistent with other mappings (/dev/mem,
    pci resource, ioremap) for the sme memory, that may exist in the system.

    The following patchset adds x86 PAT attribute tracking and untracking for
    pfnmap related APIs.

    First three patches in the patchset are changing the generic mm code to fit
    in this tracking. Last four patches are x86 specific to make things work
    with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
    which gives writecombine mapping when enabled, falling back to
    pgprot_noncached otherwise.

    This patch:

    While working on x86 PAT, we faced some hurdles with trackking
    remap_pfn_range() regions, as we do not have any information to say
    whether that PFNMAP mapping is linear for the entire vma range or
    it is smaller granularity regions within the vma.

    A simple solution to this is to use vm_pgoff as an indicator for
    linear mapping over the vma region. Currently, remap_pfn_range
    only sets vm_pgoff for COW mappings. Below patch changes the
    logic and sets the vm_pgoff irrespective of COW. This will still not
    be enough for the case where pfn is zero (vma region mapped to
    physical address zero). But, for all the other cases, we can look at
    pfnmap VMAs and say whether the mappng is for the entire vma region
    or not.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     

28 Oct, 2008

1 commit


21 Oct, 2008

1 commit


20 Oct, 2008

8 commits

  • There are not-on-LRU pages which can be mapped and they are not worth to
    be accounted. (becasue we can't shrink them and need dirty codes to
    handle specical case) We'd like to make use of usual objrmap/radix-tree's
    protcol and don't want to account out-of-vm's control pages.

    When special_mapping_fault() is called, page->mapping is tend to be NULL
    and it's charged as Anonymous page. insert_page() also handles some
    special pages from drivers.

    This patch is for avoiding to account special pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While page-cache's charge/uncharge is done under page_lock(), swap-cache
    isn't. (anonymous page is charged when it's newly allocated.)

    This patch moves do_swap_page()'s charge() call under lock. I don't see
    any bad problem *now* but this fix will be good for future for avoiding
    unnecessary racy state.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Rework Posix error return for mlock().

    Posix requires error code for mlock*() system calls for some conditions
    that differ from what kernel low level functions, such as
    get_user_pages(), return for those conditions. For more info, see:

    http://marc.info/?l=linux-kernel&m=121750892930775&w=2

    This patch provides the same translation of get_user_pages()
    error codes to posix specified error codes in the context
    of the mlock rework for unevictable lru.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This change is intended to make mlock() error returns correct.
    make_page_present() is a lower level function used by more than mlock().
    Subsequent patch[es] will add this error return fixup in an mlock specific
    path.

    Cc: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In the fault paths that install new anonymous pages, check whether the
    page is evictable or not using lru_cache_add_active_or_unevictable(). If
    the page is evictable, just add it to the active lru list [via the pagevec
    cache], else add it to the unevictable list.

    This "proactive" culling in the fault path mimics the handling of mlocked
    pages in Nick Piggin's series to keep mlocked pages off the lru lists.

    Notes:

    1) This patch is optional--e.g., if one is concerned about the
    additional test in the fault path. We can defer the moving of
    nonreclaimable pages until when vmscan [shrink_*_list()]
    encounters them. Vmscan will only need to handle such pages
    once, but if there are a lot of them it could impact system
    performance.

    2) The 'vma' argument to page_evictable() is require to notice that
    we're faulting a page into an mlock()ed vma w/o having to scan the
    page's rmap in the fault path. Culling mlock()ed anon pages is
    currently the only reason for this patch.

    3) We can't cull swap pages in read_swap_cache_async() because the
    vma argument doesn't necessarily correspond to the swap cache
    offset passed in by swapin_readahead(). This could [did!] result
    in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
    cull in this path.

    4) Move set_pte_at() to after where we add page to lru to keep it
    hidden from other tasks that might walk the page table.
    We already do it in this order in do_anonymous() page. And,
    these are COW'd anon pages. Is this safe?

    [riel@redhat.com: undo an overzealous code cleanup]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

11 Sep, 2008

1 commit


05 Aug, 2008

2 commits

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Halesh says:

    Please find the below testcase provide to test mlock.

    Test Case :
    ===========================

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    int fd,ret, i = 0;
    char *addr, *addr1 = NULL;
    unsigned int page_size;
    struct rlimit rlim;

    if (0 != geteuid())
    {
    printf("Execute this pgm as root\n");
    exit(1);
    }

    /* create a file */
    if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
    {
    printf("cant create test file\n");
    exit(1);
    }

    page_size = sysconf(_SC_PAGE_SIZE);

    /* set the MEMLOCK limit */
    rlim.rlim_cur = 2000;
    rlim.rlim_max = 2000;

    if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
    {
    printf("Cant change limit values\n");
    exit(1);
    }

    addr = 0;
    while (1)
    {
    /* map a page into memory each time*/
    if ((addr = (char *) mmap(addr,page_size, PROT_READ |
    PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
    {
    printf("cant do mmap on file\n");
    exit(1);
    }

    if (0 == i)
    addr1 = addr;
    i++;
    errno = 0;
    /* lock the mapped memory pagewise*/
    if ((ret = mlock((char *)addr, 1500)) == -1)
    {
    printf("errno value is %d\n", errno);
    printf("cant lock maped region\n");
    exit(1);
    }
    addr = addr + page_size;
    }
    }
    ======================================================

    This testcase results in an mlock() failure with errno 14 that is EFAULT,
    but it has nowhere been specified that mlock() will return EFAULT. When I
    tested the same on older kernels like 2.6.18, I got the correct result i.e
    errno 12 (ENOMEM).

    I think in source code mlock(2), setting errno ENOMEM has been missed in
    do_mlock() , on mlock_fixup() failure.

    SUSv3 requires the following behavior frmo mlock(2).

    [ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

    [EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

    This rule isn't so nice and slighly strange. but many people think
    POSIX/SUS compliance is important.

    Reported-by: Halesh Sadashiv
    Tested-by: Halesh Sadashiv
    Signed-off-by: KOSAKI Motohiro
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

02 Aug, 2008

1 commit


31 Jul, 2008

1 commit