30 Oct, 2005

40 commits

  • check_user_page_readable is a problematic variant of follow_page. It's used
    only by oprofile's i386 and arm backtrace code, at interrupt time, to
    establish whether a userspace stackframe is currently readable.

    This is problematic, because we want to push the page_table_lock down inside
    follow_page, and later split it; whereas oprofile is doing a spin_trylock on
    it (in the i386 case, forgotten in the arm case), and needs that to pin
    perhaps two pages spanned by the stackframe (which might be covered by
    different locks when we split).

    I think oprofile is going about this in the wrong way: it doesn't need to know
    the area is readable (neither i386 nor arm uses read protection of user
    pages), it doesn't need to pin the memory, it should simply
    __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've
    not got around to devising the sparse __user annotations for this.

    Then we can eliminate check_user_page_readable, and return to a single
    follow_page without the __follow_page variants.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • rmap's page_check_address descend without page_table_lock. First just
    pte_offset_map in case there's no pte present worth locking for, then take
    page_table_lock for the full check, and pass ptl back to caller in the same
    style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
    try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.

    page_check_address reformatted to avoid progressive indentation. No use is
    made of its one error code, return NULL when it fails.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends on
    address, so __xip_unmap is wrong to initialize page with that before address
    is initialized; and in fact must re-evaluate it each iteration.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the page_table_lock from around the calls to unmap_vmas, and replace
    the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
    now safe to descend without page_table_lock.

    Don't attempt fancy locking for hugepages, just take page_table_lock in
    unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
    zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
    does unmap_vmas have much use for its mm arg now.

    The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
    page_table_lock: if they're implemented at all, they typically come down to
    flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
    (which we already audited for the mprotect case).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In most places the descent from pgd to pud to pmd to pte holds mmap_sem
    (exclusively or not), which ensures that free_pgtables cannot be freeing page
    tables from any level at the same time. But truncation and reverse mapping
    descend without mmap_sem.

    No problem: just make sure that a vma is unlinked from its prio_tree (or
    nonlinear list) and from its anon_vma list, after zapping the vma, but before
    freeing its page tables. Then neither vmtruncate nor rmap can reach that vma
    whose page tables are now volatile (nor do they need to reach it, since all
    its page entries have been zapped by this stage).

    The i_mmap_lock and anon_vma->lock already serialize this correctly; but the
    locking hierarchy is such that we cannot take them while holding
    page_table_lock. Well, we're trying to push that down anyway. So in this
    patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the
    same time as moving page_table_lock around calls to unmap_vmas.

    tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but
    we made them preempt_disable and preempt_enable earlier; and a long source
    audit of all the architectures has shown no problem with removing
    page_table_lock from them. free_pgtables doesn't need page_table_lock for
    itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by
    page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented
    with exclusive mmap_sem, or mm_users 0. update_hiwater_rss and
    vm_unacct_memory don't need page_table_lock either.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There was one small but very significant change in the previous patch:
    mprotect's flush_tlb_range fell outside the page_table_lock: as it is in 2.4,
    but that doesn't prove it safe in 2.6.

    On some architectures flush_tlb_range comes to the same as flush_tlb_mm, which
    has always been called from outside page_table_lock in dup_mmap, and is so
    proved safe. Others required a deeper audit: I could find no reliance on
    page_table_lock in any; but in ia64 and parisc found some code which looks a
    bit as if it might want preemption disabled. That won't do any actual harm,
    so pending a decision from the maintainers, disable preemption there.

    Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range and
    flush_tlb_page entries in cachetlb.txt: they were rather misleading (what
    generic code does is different from what usually happens), the rules are now
    changing, and it's not yet clear where we'll end up (will the generic
    tlb_flush_mmu happen always under lock? never under lock? or sometimes under
    and sometimes not?).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert those common loops using page_table_lock on the outside and
    pte_offset_map within to use just pte_offset_map_lock within instead.

    These all hold mmap_sem (some exclusively, some not), so at no level can a
    page table be whipped away from beneath them. But whereas pte_alloc loops
    tested with the "atomic" pmd_present, these loops are testing with pmd_none,
    which on i386 PAE tests both lower and upper halves.

    That's now unsafe, so add a cast into pmd_none to test only the vital lower
    half: we lose a little sensitivity to a corrupt middle directory, but not
    enough to worry about. It appears that i386 and UML were the only
    architectures vulnerable in this way, and pgd and pud no problem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • On the page fault path, the patch before last pushed acquiring the
    page_table_lock down to the head of handle_pte_fault (though it's also taken
    and dropped earlier when a new page table has to be allocated).

    Now delete that line, read "entry = *pte" without it, and go off to this or
    that page fault handler on the basis of this unlocked peek. Usually the
    handler can proceed without the lock, relying on the subsequent locked
    pte_same or pte_none test to back out when necessary; though do_wp_page needs
    the lock immediately, and do_file_page doesn't check (if there's a race,
    install_page just zaps the entry and reinstalls it).

    But on those architectures (notably i386 with PAE) whose pte is too big to be
    read atomically, if SMP or preemption is enabled, do_swap_page and
    do_file_page might cause irretrievable damage if passed a Frankenstein entry
    stitched together from unrelated parts. In those configs, "pte_unmap_same"
    has to take page_table_lock, validate orig_pte still the same, and drop
    page_table_lock before unmapping, before proceeding.

    Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock
    avoidance leaves more lone maps and unmaps than elsewhere.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert those few architectures which are calling pud_alloc, pmd_alloc,
    pte_alloc_map on a user mm, not to take the page_table_lock first, nor drop it
    after. Each of these can continue to use pte_alloc_map, no need to change
    over to pte_alloc_map_lock, they're neither racy nor swappable.

    In the sparc64 io_remap_pfn_range, flush_tlb_range then falls outside of the
    page_table_lock: that's okay, on sparc64 it's like flush_tlb_mm, and that has
    always been called from outside of page_table_lock in dup_mmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
    calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
    pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
    add a little to kernel size, change them to macros testing inline, calling
    __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
    as fastcalls, leave that to CONFIG_REGPARM or not.

    It also seems more natural for the out-of-line functions to leave the offset
    calculation and map to the inline, which has to do it anyway for the common
    case. At least mremap move wants __pte_alloc without _map.

    Macros rather than inline functions, certainly to avoid the header file issues
    which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
    architectures I haven't built would have other such problems.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • First step in pushing down the page_table_lock. init_mm.page_table_lock has
    been used throughout the architectures (usually for ioremap): not to serialize
    kernel address space allocation (that's usually vmlist_lock), but because
    pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.

    Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
    architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
    and drop it when allocating a new one, to check lest a racing task already
    did. Similarly no page_table_lock in vmalloc's map_vm_area.

    Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
    user mms, which are converted only by a later patch, for now they have to lock
    differently according to whether or not it's init_mm.

    If sources get muddled, there's a danger that an arch source taking
    init_mm.page_table_lock will be mixed with common source also taking it (or
    neither take it). So break the rules and make another change, which should
    break the build for such a mismatch: remove the redundant mm arg from
    pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).

    Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
    used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
    pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
    map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
    took page_table_lock for no good reason.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ia64 has expand_backing_store function for growing its Register Backing Store
    vma upwards. But more complete code for this purpose is found in the
    CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide
    expand_upwards for ia64 as well as expand_stack for parisc.

    The Register Backing Store vma should be marked VM_ACCOUNT. Implement the
    intention of growing it only a page at a time, instead of passing an address
    outside of the vma to handle_mm_fault, with unknown consequences.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
    tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
    and total_vm, in the same cacheline on those arches verified.

    There are likely to be more profitable rearrangements, but less obvious (is it
    good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
    from many others?), needing serious instrumentation.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There used to be just one call to zap_pte, but it shouldn't be inline now
    there are two. Check for the common case pte_none before calling, and move
    its rss accounting up into install_page or install_file_pte - which helps the
    next patch.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Cleanup: relieve do_mremap from its surfeit of current->mms.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Small adjustment: do_swap_page should report its !pte_same race as a major
    fault if it had to read into swap cache, because whatever raced with it will
    have found page already in cache and reported minor fault.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Small adjustment: zap_pte_range decrement its rss counts from 0 then finally
    add, avoiding negations - we don't have or need a sub_mm_rss.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Small adjustment, following Nick's suggestion: it's more straightforward for
    copy_pte_range to let copy_one_pte do the rss incrementation, than use an
    index it passed back. Saves a #define, and 16 bytes of .text.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Please, please now delete the Atari CONFIG_STRAM_SWAP code. It may be
    excellent and ingenious code, but its reference to swap_vfsmnt betrays that it
    hasn't been built since 2.5.1 (four years old come December), it's delving
    deep into matters which are the preserve of core mm code, its only purpose is
    to give the more conscientious mm guys an anxiety attack from time to time;
    yet we keep on breaking it more and more.

    If you want to use RAM for swap, then if the MTD driver does not already
    provide just what you need, I'm sure David could be persuaded to add the
    extra. But you'd also like to be able to allocate extents of that swap for
    other use: we can give you a core interface for that if you need. But unbuilt
    for four years suggests to me that there's no need at all.

    I cannot swear the patch below won't break your build, but believe so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone age,
    clashing with the common hugetlb.c. Replace it by a copy of the sh
    hugetlbpage.c. Except, delete that mk_pte_huge macro neither uses.

    Signed-off-by: Hugh Dickins
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • One anomaly remains from when Andrea rationalized the responsibilities of
    mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its
    page_table_lock, but not the mmap_sem which normally guards the vma list and
    rbtree. Which could be an issue for unuse_mm: though since it just walks down
    the list (today with page_table_lock, tomorrow not), it's probably okay. Will
    need a memory barrier? Oh, keep it simple, Nick and I agreed, no harm in
    taking child's mmap_sem here.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use the parent's oldmm throughout dup_mmap, instead of perversely going back
    to current->mm. (Can you hear the sigh of relief from those mpnts? Usually I
    squash them, but not today.)

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be
    worthwhile if the mm is contended, and would reduce atomic operations if the
    counts were atomic. Let zap_pte_range now batch its updates to file_rss and
    anon_rss, per page-table in case we drop the lock outside; and copy_pte_range
    batch them too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
    that's not so good if an mm_counter_t is a special type. And how is rss
    initialized? By set_mm_counter, all over the place. Come on, we just need to
    initialize them both at once by set_mm_counter in mm_init (which follows the
    memcpy when forking).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zap_pte_range has been counting the pages it frees in tlb->freed, then
    tlb_finish_mmu has used that to update the mm's rss. That got stranger when I
    added anon_rss, yet updated it by a different route; and stranger when rss and
    anon_rss became mm_counters with special access macros. And it would no
    longer be viable if we're relying on page_table_lock to stabilize the
    mm_counter, but calling tlb_finish_mmu outside that lock.

    Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
    business, just decrement the rss mm_counter in zap_pte_range (yes, there was
    some point to batching the update, and a subsequent patch restores that). And
    forget the anal paranoia of first reading the counter to avoid going negative
    - if rss does go negative, just fix that bug.

    Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
    was being made of them. But arm26 alone was actually using the freed, in the
    way some others use need_flush: give it a need_flush. arm26 seems to prefer
    spaces to tabs here: respect that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
    mm's last user has gone and the whole mm is being torn down. And it's an
    inline function because sparc64 uses a different (slightly better)
    "tlb_frozen" name for the flag others call "fullmm".

    And now the ptep_get_and_clear_full macro used in zap_pte_range refers
    directly to tlb->fullmm, which would be wrong for sparc64. Rather than
    correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
    sparc64 to just use the same poor name as everyone else - is that okay?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_gather_mmu dates from before kernel preemption was allowed, and uses
    smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works
    because it's currently only called after getting page_table_lock, which is not
    dropped until after the matching tlb_finish_mmu. But don't rely on that, it
    will soon change: now disable preemption internally by proper get_cpu_var in
    tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Speeding up mremap's moving of ptes has never been a priority, but the locking
    will get more complicated shortly, and is already too baroque.

    Scrap the current one-by-one moving, do an extent at a time: curtailed by end
    of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
    doesn't match this usage), and by latency considerations.

    One nice property of the old method is lost: it never allocated a page table
    unless absolutely necessary, so you could free empty page tables by mremapping
    to and fro. Whereas this way, it allocates a dst table wherever there was a
    src table. I keep diving in to reinstate the old behaviour, then come out
    preferring not to clutter how it now is.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Impose a little more consistency on the page fault handlers do_wp_page,
    do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
    arguments in the same order, called the same names?

    break_cow is all very well, but what it did was inlined elsewhere: easier to
    compare if it's brought back into do_wp_page.

    do_file_page's fallback to do_no_page dates from a time when we were testing
    pte_file by using it wherever possible: currently it's peculiar to nonlinear
    vmas, so just check that. BUG_ON if not? Better not, it's probably page
    table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
    use that for do_wp_page's invalid pfn too.

    Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
    restored (and say "pud" not "pmd" in its pud_ERROR).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • exit_mmap resets various mm_struct fields, but the mm is well on its way out,
    and none of those fields matter by this point.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Divide remove_vm_struct into two parts: first anon_vma_unlink plus
    unlink_file_vma, to unlink the vma from the list and tree by which rmap or
    vmtruncate might find it; then remove_vma to close, fput and free.

    The intention here is to do the anon_vma_unlink and unlink_file_vma earlier,
    in free_pgtables before freeing any page tables: so we can be sure that any
    page tables traversed by rmap and vmtruncate are stable (and other, ordinary
    cases are stabilized by holding mmap_sem).

    This will be crucial to traversing pgd,pud,pmd without page_table_lock. But
    testing the split-out patch showed that lifting the page_table_lock is
    symbiotically necessary to make this change - the lock ordering is wrong to
    move those unlinks into free_pgtables while it's under ptlock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • unmap_vma doesn't amount to much, let's put it inside unmap_vma_list. Except
    it doesn't unmap anything, unmap_region just did the unmapping: rename it to
    remove_vma_list.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The original vm_stat_account has fallen into disuse, with only one user, and
    only one user of vm_stat_unaccount. It's easier to keep track if we convert
    them all to __vm_stat_account, then free it from its __shackles.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
    vm_page_prot must already be forcing COW, so must omit write permission, and
    so the pte_wrprotect is redundant. Replace it by a comment to that effect,
    and reword the comment on unuse_pte which also caused confusion.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zap_pte_range already avoids wasting time to mark_page_accessed on anon pages:
    it can also skip anon set_page_dirty - the page only needs to be marked dirty
    if shared with another mm, but that will say pte_dirty too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use latency breaking in msync_pte_range like that in copy_pte_range, instead
    of the ugly CONFIG_PREEMPT filemap_msync alternatives.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins