30 Oct, 2005

40 commits

  • Please, please now delete the Atari CONFIG_STRAM_SWAP code. It may be
    excellent and ingenious code, but its reference to swap_vfsmnt betrays that it
    hasn't been built since 2.5.1 (four years old come December), it's delving
    deep into matters which are the preserve of core mm code, its only purpose is
    to give the more conscientious mm guys an anxiety attack from time to time;
    yet we keep on breaking it more and more.

    If you want to use RAM for swap, then if the MTD driver does not already
    provide just what you need, I'm sure David could be persuaded to add the
    extra. But you'd also like to be able to allocate extents of that swap for
    other use: we can give you a core interface for that if you need. But unbuilt
    for four years suggests to me that there's no need at all.

    I cannot swear the patch below won't break your build, but believe so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone age,
    clashing with the common hugetlb.c. Replace it by a copy of the sh
    hugetlbpage.c. Except, delete that mk_pte_huge macro neither uses.

    Signed-off-by: Hugh Dickins
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • One anomaly remains from when Andrea rationalized the responsibilities of
    mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its
    page_table_lock, but not the mmap_sem which normally guards the vma list and
    rbtree. Which could be an issue for unuse_mm: though since it just walks down
    the list (today with page_table_lock, tomorrow not), it's probably okay. Will
    need a memory barrier? Oh, keep it simple, Nick and I agreed, no harm in
    taking child's mmap_sem here.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use the parent's oldmm throughout dup_mmap, instead of perversely going back
    to current->mm. (Can you hear the sigh of relief from those mpnts? Usually I
    squash them, but not today.)

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be
    worthwhile if the mm is contended, and would reduce atomic operations if the
    counts were atomic. Let zap_pte_range now batch its updates to file_rss and
    anon_rss, per page-table in case we drop the lock outside; and copy_pte_range
    batch them too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
    that's not so good if an mm_counter_t is a special type. And how is rss
    initialized? By set_mm_counter, all over the place. Come on, we just need to
    initialize them both at once by set_mm_counter in mm_init (which follows the
    memcpy when forking).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zap_pte_range has been counting the pages it frees in tlb->freed, then
    tlb_finish_mmu has used that to update the mm's rss. That got stranger when I
    added anon_rss, yet updated it by a different route; and stranger when rss and
    anon_rss became mm_counters with special access macros. And it would no
    longer be viable if we're relying on page_table_lock to stabilize the
    mm_counter, but calling tlb_finish_mmu outside that lock.

    Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
    business, just decrement the rss mm_counter in zap_pte_range (yes, there was
    some point to batching the update, and a subsequent patch restores that). And
    forget the anal paranoia of first reading the counter to avoid going negative
    - if rss does go negative, just fix that bug.

    Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
    was being made of them. But arm26 alone was actually using the freed, in the
    way some others use need_flush: give it a need_flush. arm26 seems to prefer
    spaces to tabs here: respect that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
    mm's last user has gone and the whole mm is being torn down. And it's an
    inline function because sparc64 uses a different (slightly better)
    "tlb_frozen" name for the flag others call "fullmm".

    And now the ptep_get_and_clear_full macro used in zap_pte_range refers
    directly to tlb->fullmm, which would be wrong for sparc64. Rather than
    correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
    sparc64 to just use the same poor name as everyone else - is that okay?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_gather_mmu dates from before kernel preemption was allowed, and uses
    smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works
    because it's currently only called after getting page_table_lock, which is not
    dropped until after the matching tlb_finish_mmu. But don't rely on that, it
    will soon change: now disable preemption internally by proper get_cpu_var in
    tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Speeding up mremap's moving of ptes has never been a priority, but the locking
    will get more complicated shortly, and is already too baroque.

    Scrap the current one-by-one moving, do an extent at a time: curtailed by end
    of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
    doesn't match this usage), and by latency considerations.

    One nice property of the old method is lost: it never allocated a page table
    unless absolutely necessary, so you could free empty page tables by mremapping
    to and fro. Whereas this way, it allocates a dst table wherever there was a
    src table. I keep diving in to reinstate the old behaviour, then come out
    preferring not to clutter how it now is.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Impose a little more consistency on the page fault handlers do_wp_page,
    do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
    arguments in the same order, called the same names?

    break_cow is all very well, but what it did was inlined elsewhere: easier to
    compare if it's brought back into do_wp_page.

    do_file_page's fallback to do_no_page dates from a time when we were testing
    pte_file by using it wherever possible: currently it's peculiar to nonlinear
    vmas, so just check that. BUG_ON if not? Better not, it's probably page
    table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
    use that for do_wp_page's invalid pfn too.

    Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
    restored (and say "pud" not "pmd" in its pud_ERROR).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • exit_mmap resets various mm_struct fields, but the mm is well on its way out,
    and none of those fields matter by this point.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Divide remove_vm_struct into two parts: first anon_vma_unlink plus
    unlink_file_vma, to unlink the vma from the list and tree by which rmap or
    vmtruncate might find it; then remove_vma to close, fput and free.

    The intention here is to do the anon_vma_unlink and unlink_file_vma earlier,
    in free_pgtables before freeing any page tables: so we can be sure that any
    page tables traversed by rmap and vmtruncate are stable (and other, ordinary
    cases are stabilized by holding mmap_sem).

    This will be crucial to traversing pgd,pud,pmd without page_table_lock. But
    testing the split-out patch showed that lifting the page_table_lock is
    symbiotically necessary to make this change - the lock ordering is wrong to
    move those unlinks into free_pgtables while it's under ptlock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • unmap_vma doesn't amount to much, let's put it inside unmap_vma_list. Except
    it doesn't unmap anything, unmap_region just did the unmapping: rename it to
    remove_vma_list.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The original vm_stat_account has fallen into disuse, with only one user, and
    only one user of vm_stat_unaccount. It's easier to keep track if we convert
    them all to __vm_stat_account, then free it from its __shackles.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
    vm_page_prot must already be forcing COW, so must omit write permission, and
    so the pte_wrprotect is redundant. Replace it by a comment to that effect,
    and reword the comment on unuse_pte which also caused confusion.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zap_pte_range already avoids wasting time to mark_page_accessed on anon pages:
    it can also skip anon set_page_dirty - the page only needs to be marked dirty
    if shared with another mm, but that will say pte_dirty too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use latency breaking in msync_pte_range like that in copy_pte_range, instead
    of the ugly CONFIG_PREEMPT filemap_msync alternatives.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • My latency breaking in copy_pte_range didn't work as intended: instead of
    checking at regularish intervals, after the first interval it checked every
    time around the loop, too impatient to be preempted. Fix that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch adds some stack dumps if the slab logic is processing slab
    blocks from the wrong node. This is necessary in order to detect
    situations as encountered by Petr.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the
    scan_control struct; and modified shrink_list() not to add anon pages to
    the swap cache if may_swap is not asserted.

    Ref: http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4

    However, further down, if the page is mapped, shrink_list() calls
    try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon ().
    try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap
    cache. Martin says he never encountered this path in his testing, but
    agrees that it might happen.

    This patch modifies shrink_list() to skip anon pages that are not already
    in the swap cache when !may_swap, rather than just not adding them to the
    cache.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is not problem actually, but sync_page_range() is using for exported
    function to filesystems.

    The msync_xxx is more readable at least to me.

    Signed-off-by: OGAWA Hirofumi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Most of them can never be triggered and were only for development.

    Signed-off-by: "Andi Kleen"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The NUMA policy code predated nodemask_t so it used open coded bitmaps.
    Convert everything to nodemask_t. Big patch, but shouldn't have any actual
    behaviour changes (except I removed one unnecessary check against
    node_online_map and one unnecessary BUG_ON)

    Signed-off-by: "Andi Kleen"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Set the low water mark for hot pages in pcp to zero.

    (akpm: for the life of me I cannot remember why we created pcp->low. Neither
    can Martin and the changelog is silent. Maybe it was just a brainfart, but I
    have this feeling that there was a reason. If not, we should remove the
    fields completely. We'll see.)

    Signed-off-by: Rohit Seth
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth, Rohit
     
  • Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.

    Over 100+ runs for a workload, the difference in mean is about 2%. The best
    results for both are almost same. Though the max variation in results with
    1/2MB is only 2.2%, whereas with 1/4MB it is 12%.

    Signed-off-by: Rohit Seth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth, Rohit
     
  • It turns out that the original swap token implementation, by Song Jiang, only
    enforced the swap token while the task holding the token is handling a page
    fault. This patch approximates that, without adding an additional flag to the
    mm_struct, by checking whether the mm->mmap_sem is held for reading, like the
    page fault code does.

    This patch has the effect of automatically, and gradually, disabling the
    enforcement of the swap token when there is little or no paging going on, and
    "turning up" the intensity of the swap token code the more the task holding
    the token is thrashing.

    Thanks to Song Jiang for pointing out this aspect of the token based thrashing
    control concept.

    The new code shows a slight degradation over the old swap token code, but
    still a big win over running without the swap token.

    2.6.12+ swap token disabled

    $ for i in `seq 10` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    101.74user 23.13system 8:26.91elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (38597major+430315minor)pagefaults 0swaps
    101.98user 24.91system 8:03.06elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (33939major+430457minor)pagefaults 0swaps
    101.93user 22.12system 7:34.90elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (33166major+421267minor)pagefaults 0swaps
    101.82user 22.38system 8:31.40elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (39338major+433262minor)pagefaults 0swaps

    2.6.12+ swap token enabled, timeout 300 seconds

    $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    102.58user 16.08system 3:41.44elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19707major+285786minor)pagefaults 0swaps
    102.07user 19.56system 4:00.64elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19012major+299259minor)pagefaults 0swaps
    102.64user 18.25system 4:07.31elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (21990major+304831minor)pagefaults 0swaps
    101.39user 19.41system 5:15.81elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (24850major+323321minor)pagefaults 0swaps

    2.6.12+ with new swap token code, timeout 300 seconds

    $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    101.87user 24.66system 5:53.20elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (26848major+363497minor)pagefaults 0swaps
    102.83user 19.95system 4:17.25elapsed 47%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19946major+305722minor)pagefaults 0swaps
    102.09user 19.46system 5:12.57elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (25461major+334994minor)pagefaults 0swaps
    101.67user 20.61system 4:52.97elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (22190major+329508minor)pagefaults 0swaps

    Signed-off-by: Rik Van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik Van Riel
     
  • Add sem_is_read/write_locked functions to the read/write semaphores, along the
    same lines of the *_is_locked spinlock functions. The swap token tuning patch
    uses sem_is_read_locked; sem_is_write_locked is added for completeness.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik Van Riel
     
  • barrier.h uses barrier() in non-SMP case. And doesn't include compiler.h.

    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ivan Kokshaysky
     
  • Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in
    second_overflow().

    Signed-off-by: YOSHIFUJI Hideaki
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YOSHIFUJI Hideaki
     
  • This patch adds

    vmalloc_node(size, node) -> Allocate necessary memory on the specified node

    and

    get_vm_area_node(size, flags, node)

    and the other functions that it depends on.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Linus Torvalds
     
  • Patch from Nicolas Pitre

    Since vmlinux.lds.S is preprocessed, we can use the defines already
    present in asm/memory.h (allowed by patch #3060) for the XIP kernel link
    address instead of relying on a duplicated Makefile hardcoded value, and
    also get rid of its dependency on awk to handle it at the same time.

    While at it let's clean XIP stuff even further and make things clearer
    in head.S with a nice code reduction.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Russell King

    Nicolas Pitre
     
  • Patch from Nicolas Pitre

    This patch allows for assorted type of cleanups by letting assembly code
    use the same set of defines for constant values and avoid duplicated
    definitions that might not always be in sync, or that might simply be
    confusing due to the different names for the same thing.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Russell King

    Nicolas Pitre
     
  • Linus Torvalds
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Some boards declare prom_free_prom_memory as a void function but the
    caller free_initmem() expects a return value.

    Fix those up and return 0 instead, just like everyone else does.

    Signed-off-by: Arthur Othieno
    Signed-off-by: Ralf Baechle

    Arthur Othieno
     
  • by emulation of a full FPU.

    Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle