22 Sep, 2009

15 commits

  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
    those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

    Contrary to how I'd imagined it, there's nothing ugly about this, just a
    zero_pfn test built into one or another block of vm_normal_page().

    But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
    my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
    ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
    that: it would have to take vm_flags to weed out some cases.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
    advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
    using in 2.6.24. And there were a couple of regression reports on LKML.

    Following suggestions from Linus, reinstate do_anonymous_page() use of
    the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
    with (map)count updates - let vm_normal_page() regard it as abnormal.

    Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
    most powerpc): that's not essential, but minimizes additional branches
    (keeping them in the unlikely pte_special case); and incidentally
    excludes mips (some models of which needed eight colours of ZERO_PAGE
    to avoid costly exceptions).

    Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
    callers won't want to make exceptions for it, so increment its count
    there. Changes to mlock and migration? happily seems not needed.

    In most places it's quicker to check pfn than struct page address:
    prepare a __read_mostly zero_pfn for that. Does get_dump_page()
    still need its ZERO_PAGE check? probably not, but keep it anyway.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page() has been wrong to dirty the pte regardless.
    If it's not going to mark the pte writable, then it won't help
    to mark it dirty here, and clogs up memory with pages which will
    need swap instead of being thrown away. Especially wrong if no
    overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
    we could exceed the limit and OOM despite no overcommit.

    Signed-off-by: Hugh Dickins
    Cc:
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The "FOLL_ANON optimization" and its use_zero_page() test have caused
    confusion and bugs: why does it test VM_SHARED? for the very good but
    unsatisfying reason that VMware crashed without. As we look to maybe
    reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

    Easily done: it's silly for __get_user_pages() and follow_page() to
    be guessing whether it's safe to assume that they're being used for
    a coredump (which can take a shortcut snapshot where other uses must
    handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

    get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In preparation for the next patch, add a simple get_dump_page(addr)
    interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
    get_user_pages() directly. They're not interested in errors: they
    just want to use holes as much as possible, to save space and make
    sure that the data is aligned where the headers said it would be.

    Oh, and don't use that horrid DUMP_SEEK(off) macro!

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
    flags added solely to prevent __get_user_pages() from doing some of
    what it usually does, in the munlock case: we can now remove them.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove double negations where the operand is already boolean.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM will need to identify its kernel merged pages unambiguously, and
    /proc/kpageflags will probably like to do so too.

    Since KSM will only be substituting anonymous pages, statistics are best
    preserved by making a PageKsm page a special PageAnon page: one with no
    anon_vma.

    But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
    PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Wu Fengguang
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_dup_rmap(), used on each mapped page when forking, was originally
    just an inline atomic_inc of mapcount. 2.6.22 added CONFIG_DEBUG_VM
    out-of-line checks to it, which would need to be ever-so-slightly
    complicated to allow for the PageKsm() we're about to define.

    But I think these checks never caught anything. And if it's coding errors
    we're worried about, such checks should be in page_remove_rmap() too, not
    just when forking; whereas if it's pagetable corruption we're worried
    about, then they shouldn't be limited to CONFIG_DEBUG_VM.

    Oh, just revert page_dup_rmap() to an inline atomic_inc of mapcount.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM is a linux driver that allows dynamicly sharing identical memory pages
    between one or more processes.

    Unlike tradtional page sharing that is made at the allocation of the
    memory, ksm do it dynamicly after the memory was created. Memory is
    periodically scanned; identical pages are identified and merged.

    The sharing is made in a transparent way to the processes that use it.

    Ksm is highly important for hypervisors (kvm), where in production
    enviorments there might be many copys of the same data data among the host
    memory. This kind of data can be: similar kernels, librarys, cache, and
    so on.

    Even that ksm was wrote for kvm, any userspace application that want to
    use it to share its data can try it.

    Ksm may be useful for any application that might have similar (page
    aligment) data strctures among the memory, ksm will find this data merge
    it to one copy, and even if it will be changed and thereforew copy on
    writed, ksm will merge it again as soon as it will be identical again.

    Another reason to consider using ksm is the fact that it might simplify
    alot the userspace code of application that want to use shared private
    data, instead that the application will mange shared area, ksm will do
    this for the application, and even write to this data will be allowed
    without any synchinization acts from the application.

    Ksm was designed to be a loadable module that doesn't change the VM code
    of linux.

    This patch:

    The set_pte_at_notify() macro allows setting a pte in the shadow page
    table directly, instead of flushing the shadow page table entry and then
    getting vmexit to set it. It uses a new change_pte() callback to do so.

    set_pte_at_notify() is an optimization for kvm, and other users of
    mmu_notifiers, for COW pages. It is useful for kvm when ksm is used,
    because it allows kvm not to have to receive vmexit and only then map the
    ksm page into the shadow page table, but instead map it directly at the
    same time as Linux maps the page into the host page table.

    Users of mmu_notifiers who don't implement new mmu_notifier_change_pte()
    callback will just receive the mmu_notifier_invalidate_page() callback.

    Signed-off-by: Izik Eidus
    Signed-off-by: Chris Wright
    Signed-off-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus
     

19 Sep, 2009

1 commit


28 Jul, 2009

1 commit

  • mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()

    Upcoming paches to support the new 64-bit "BookE" powerpc architecture
    will need to have the virtual address corresponding to PTE page when
    freeing it, due to the way the HW table walker works.

    Basically, the TLB can be loaded with "large" pages that cover the whole
    virtual space (well, sort-of, half of it actually) represented by a PTE
    page, and which contain an "indirect" bit indicating that this TLB entry
    RPN points to an array of PTEs from which the TLB can then create direct
    entries. Thus, in order to invalidate those when PTE pages are deleted,
    we need the virtual address to pass to tlbilx or tlbivax instructions.

    The old trick of sticking it somewhere in the PTE page struct page sucks
    too much, the address is almost readily available in all call sites and
    almost everybody implemets these as macros, so we may as well add the
    argument everywhere. I added it to the pmd and pud variants for consistency.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: David Howells [MN10300 & FRV]
    Acked-by: Nick Piggin
    Acked-by: Martin Schwidefsky [s390]
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

26 Jun, 2009

1 commit


24 Jun, 2009

2 commits

  • If a kthread happens to use get_user_pages() on an mm (as KSM does),
    there's a chance that it will end up trying to read in a swap page, then
    oops in grab_swap_token() because the kthread has no mm: GUP passes down
    the right mm, so grab_swap_token() ought to be using it.

    We have not identified a stronger case than KSM's daemon (not yet in
    mainline), but the issue must have come up before, since RHEL has included
    a fix for this for years (though a different fix, they just back out of
    grab_swap_token if current->mm is unset: which is what we first proposed,
    but using the right mm here seems more correct).

    Reported-by: Izik Eidus
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
    and it's tempting to devise a set of Grand Unified Paging flags;
    but not today. So until then, let's rely upon the compiler to spot
    the coincidence, "rather than have that subtle dependency and a
    comment for it" - as you remarked in another context yesterday.

    Signed-off-by: Hugh Dickins
    Acked-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Jun, 2009

2 commits

  • This allows the callers to now pass down the full set of FAULT_FLAG_xyz
    flags to handle_mm_fault(). All callers have been (mechanically)
    converted to the new calling convention, there's almost certainly room
    for architectures to clean up their code and then add FAULT_FLAG_RETRY
    when that support is added.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The fault handling routines really want more fine-grained flags than a
    single "was it a write fault" boolean - the callers will want to set
    flags like "you can return a retry error" etc.

    And that's actually how the VM works internally, but right now the
    top-level fault handling functions in mm/memory.c all pass just the
    'write_access' boolean around.

    This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
    variable instead. The 'write_access' calling convention still exists
    for the exported 'handle_mm_fault()' function, but that is next.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jun, 2009

4 commits

  • Analoguous to follow_phys(), add a helper that looks up the PFN at a
    user virtual address in an IO mapping or a raw PFN mapping.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A generic readonly page table lookup helper to map an address space and an
    address from it to a pte.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Acked-by: Magnus Damm
    Cc: Hans Verkuil
    Cc: Paul Mundt
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Move more documentation for get_user_pages_fast into the new kerneldoc comment.
    Add some comments for get_user_pages as well.

    Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
    there initially because it was once a static inline function.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Nick Piggin
    Cc: Andy Grover
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 May, 2009

2 commits

  • Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty. This allows the filesystem to have
    full control of page dirtying events coming from the VM.

    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.

    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite. It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).

    In this window, the VM could write out the page, clearing page-dirty. The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.

    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail. The
    filesystem may need to allocate some data structure, for example.

    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.

    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window. This provides the filesystem with a strong
    synchronisation against the VM here.

    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
    under dirty pages might be susceptible to i_size changing under partial page
    at the end of file (we also have a buffer.c-side problem here, but it cannot
    be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
    page_mkwrite functions themselves.

    [ This also moves page_mkwrite another step closer to fault, which should
    eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
    filesystem calldown and page lock/unlock cycle in __do_fault. ]

    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil
    Cc: Trond Myklebust
    Signed-off-by: Nick Piggin
    Cc: Valdis Kletnieks
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • By the time the memory cgroup code is notified about a swapin we
    already hold a reference on the fault page.

    If the cgroup callback fails make sure to unlock AND release the page
    reference which was taken by lookup_swap_cach(), or we leak the reference.

    Signed-off-by: Johannes Weiner
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Apr, 2009

3 commits

  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • At first look, mark_page_accessed() in follow_page() seems a bit strange.
    It seems pte_mkyoung() would be better consistent with other kernel code.

    However, it is intentional. The commit log said:

    ------------------------------------------------
    commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
    Author: akpm
    Date: Fri Aug 15 07:24:59 2003 +0000

    [PATCH] Use mark_page_accessed() in follow_page()

    Touching a page via follow_page() counts as a reference so we should be
    either setting the referenced bit in the pte or running mark_page_accessed().

    Altering the pte is tricky because we haven't implemented an atomic
    pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
    aging state: it can move the page onto the active list.

    BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
    ------------------------------------------------

    The atomic issue is still true nowadays. adding comment help to understand
    code intention and it would be better.

    [akpm@linux-foundation.org: clarify text]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't
    mark_page_accessed in fault path) only remove the mark_page_accessed() in
    filemap_fault().

    Therefore, swap-backed pages and file-backed pages have inconsistent
    behavior. mark_page_accessed() should be removed from do_swap_page().

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

14 Mar, 2009

1 commit


13 Mar, 2009

1 commit

  • Impact: fix false positive PAT warnings - also fix VirtalBox hang

    Use of vma->vm_pgoff to identify the pfnmaps that are fully
    mapped at mmap time is broken. vm_pgoff is set by generic mmap
    code even for cases where drivers are setting up the mappings
    at the fault time.

    The problem was originally reported here:

    http://marc.info/?l=linux-kernel&m=123383810628583&w=2

    Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
    flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
    time.

    Problem also tracked at:

    http://bugzilla.kernel.org/show_bug.cgi?id=12800

    Reported-by: Thomas Hellstrom
    Tested-by: Frans Pop
    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha @intel.com>
    Cc: Nick Piggin
    Cc: "ebiederm@xmission.com"
    Cc: # only for 2.6.29.1, not .28
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pallipadi, Venkatesh
     

06 Feb, 2009

1 commit

  • Fix do_wp_page for VM_MIXEDMAP mappings.

    In the case where pfn_valid returns 0 for a pfn at the beginning of
    do_wp_page and the mapping is not shared writable, the code branches to
    label `gotten:' with old_page == NULL.

    In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
    clear_page_mlock, and unlock_page try to access the old_page.

    This patch checks whether old_page is valid before it is dereferenced.

    The regression was introduced by "mlock: mlocked pages are unevictable"
    (commit b291f000393f5a0b679012b39d79fbc85c018233).

    Signed-off-by: Carsten Otte
    Cc: Nick Piggin
    Cc: Heiko Carstens
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

14 Jan, 2009

2 commits

  • Impact: cleanup

    Change the protection parameter for track_pfn_vma_new() into a pgprot_t pointer.
    Subsequent patch changes the x86 PAT handling to return a compatible
    memtype in pgprot_t, if what was requested cannot be allowed due to conflicts.
    No fuctionality change in this patch.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    venkatesh.pallipadi@intel.com
     
  • Impact: fix (harmless) double-free of memtype entries and avoid warning

    On track_pfn_vma_new() failure, reset the vm_flags so that there will be
    no second cleanup happening when upper level routines call unmap_vmas().

    This patch fixes part of the bug reported here:

    http://marc.info/?l=linux-kernel&m=123108883716357&w=2

    Specifically the error message:

    X:5010 freeing invalid memtype d0000000-d0101000

    Is due to multiple frees on error path, will not happen with the patch below.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    venkatesh.pallipadi@intel.com
     

12 Jan, 2009

1 commit

  • Some code (nfs/sunrpc) uses socket ops on kernel memory while holding
    the mmap_sem, this is safe because kernel memory doesn't get paged out,
    therefore we'll never actually fault, and the might_fault() annotations
    will generate false positives.

    Reported-by: "J. Bruce Fields"
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Jan, 2009

3 commits

  • Fix swapin charge operation of memcg.

    Now, memcg has hooks to swap-out operation and checks SwapCache is really
    unused or not. That check depends on contents of struct page. I.e. If
    PageAnon(page) && page_mapped(page), the page is recoginized as
    still-in-use.

    Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
    of any rmap. Then, in followinig sequence

    (Page fault with WRITE)
    try_charge() (charge += PAGESIZE)
    commit_charge() (Check page_cgroup is used or not..)
    reuse_swap_page()
    -> delete_from_swapcache()
    -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
    ......
    New charge is uncharged soon....
    To avoid this, move commit_charge() after page_mapcount() goes up to 1.
    By this,

    try_charge() (usage += PAGESIZE)
    reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
    commit_charge() (If page_cgroup is not marked as PCG_USED,
    add new charge.)
    Accounting will be correct.

    Changelog (v2) -> (v3)
    - fixed invalid charge to swp_entry==0.
    - updated documentation.
    Changelog (v1) -> (v2)
    - fixed comment.

    [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Tested-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Daisuke Nishimura
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
    of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
    happen at memory reclaim.

    But in recent discussion, it's NACKed because it sounds ugly.

    This patch is for reverting it and add some clean up to gfp_mask of
    callers of charge. No behavior change but need review before generating
    HUNK in deep queue.

    This patch also adds explanation to meaning of gfp_mask passed to charge
    functions in memcontrol.h.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki