10 Jan, 2006

1 commit


09 Jan, 2006

1 commit


07 Jan, 2006

2 commits

  • Optimise page_state manipulations by introducing interrupt unsafe accessors
    to page_state fields. Callers must provide their own locking (either
    disable interrupts or not update from interrupt context).

    Switch over the hot callsites that can easily be moved under interrupts off
    sections.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Optimise rmap functions by minimising atomic operations when we know there
    will be no concurrent modifications.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Nov, 2005

1 commit


29 Nov, 2005

2 commits

  • Some users (hi Zwane) have seen a problem when running a workload that
    eats nearly all of physical memory - th system does an OOM kill, even
    when there is still a lot of swap free.

    The problem appears to be a very big task that is holding the swap
    token, and the VM has a very hard time finding any other page in the
    system that is swappable.

    Instead of ignoring the swap token when sc->priority reaches 0, we could
    simply take the swap token away from the memory hog and make sure we
    don't give it back to the memory hog for a few seconds.

    This patch resolves the problem Zwane ran into.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
    explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
    VM area to contain an arbitrary range of page table entries that the VM
    never touches, and never considers to be normal pages.

    Any user of "remap_pfn_range()" automatically gets this new
    functionality, and doesn't even have to mark the pages reserved or
    indeed mark them any other way. It just works. As a side effect, doing
    mmap() on /dev/mem works for arbitrary ranges.

    Sparc update from David in the next commit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Nov, 2005

2 commits

  • copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area,
    zap_pte_range needs to free them, do_wp_page needs to COW them: just like
    ordinary pages, not like the unpaged.

    But recognizing them is a little subtle: because PageReserved is no longer a
    condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the
    distro permits, and whether it's advisable on this or that architecture, is
    another matter). So if we can see a PageAnon, it may not be ours to mess with
    (or may be ours from elsewhere in the address space). I suspect there's an
    entertaining insoluble self-referential problem here, but the page_is_anon
    function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will
    always be an odd choice.

    In updating the comment on page_address_in_vma, noticed a potential NULL
    dereference, in a path we don't actually take, but fixed it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's one peculiar use of VM_RESERVED which the previous patch left behind:
    because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout
    cursor, but should never meet VM_RESERVED vmas, it was a way of extending
    VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose.
    But that's an empty set - they don't have the populate function required. So
    just throw away those VM_RESERVED tests.

    But one more interesting in rmap.c has to go too: try_to_unmap_one will want
    to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 Oct, 2005

8 commits

  • Updated several references to page_table_lock in common code comments.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A couple of oddities were guarded by page_table_lock, no longer properly
    guarded when that is split.

    The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
    atomic64_t if the architecture supports it, in such a case. Definitions by
    courtesy of Christoph Lameter: who spent considerable effort on more scalable
    ways of counting, but found insufficient benefit in practice.

    And adding an mm with swap to the mmlist for swapoff: the list is well-
    guarded by its own lock, but the list_empty check now has to be repeated
    inside it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • rmap's page_check_address descend without page_table_lock. First just
    pte_offset_map in case there's no pte present worth locking for, then take
    page_table_lock for the full check, and pass ptl back to caller in the same
    style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
    try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.

    page_check_address reformatted to avoid progressive indentation. No use is
    made of its one error code, return NULL when it fails.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It turns out that the original swap token implementation, by Song Jiang, only
    enforced the swap token while the task holding the token is handling a page
    fault. This patch approximates that, without adding an additional flag to the
    mm_struct, by checking whether the mm->mmap_sem is held for reading, like the
    page fault code does.

    This patch has the effect of automatically, and gradually, disabling the
    enforcement of the swap token when there is little or no paging going on, and
    "turning up" the intensity of the swap token code the more the task holding
    the token is thrashing.

    Thanks to Song Jiang for pointing out this aspect of the token based thrashing
    control concept.

    The new code shows a slight degradation over the old swap token code, but
    still a big win over running without the swap token.

    2.6.12+ swap token disabled

    $ for i in `seq 10` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    101.74user 23.13system 8:26.91elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (38597major+430315minor)pagefaults 0swaps
    101.98user 24.91system 8:03.06elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (33939major+430457minor)pagefaults 0swaps
    101.93user 22.12system 7:34.90elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (33166major+421267minor)pagefaults 0swaps
    101.82user 22.38system 8:31.40elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (39338major+433262minor)pagefaults 0swaps

    2.6.12+ swap token enabled, timeout 300 seconds

    $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    102.58user 16.08system 3:41.44elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19707major+285786minor)pagefaults 0swaps
    102.07user 19.56system 4:00.64elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19012major+299259minor)pagefaults 0swaps
    102.64user 18.25system 4:07.31elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (21990major+304831minor)pagefaults 0swaps
    101.39user 19.41system 5:15.81elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (24850major+323321minor)pagefaults 0swaps

    2.6.12+ with new swap token code, timeout 300 seconds

    $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
    101.87user 24.66system 5:53.20elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (26848major+363497minor)pagefaults 0swaps
    102.83user 19.95system 4:17.25elapsed 47%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (19946major+305722minor)pagefaults 0swaps
    102.09user 19.46system 5:12.57elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (25461major+334994minor)pagefaults 0swaps
    101.67user 20.61system 4:52.97elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (22190major+329508minor)pagefaults 0swaps

    Signed-off-by: Rik Van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik Van Riel
     

05 Sep, 2005

5 commits

  • Thanks to Bill Irwin for pointing this out.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Microoptimise page_add_anon_rmap. Although these expressions are used only in
    the taken branch of the if() statement, the compiler can't reorder them inside
    because atomic_inc_and_test is a barrier.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Just be clear that VM_RESERVED pages here are a bug, and the test is not there
    because they are expected.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a
    time when testing rss was important to avoid a particular race between
    dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo-
    optimization, made even more obscure by the get_mm_counter macro.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The idea of a swap_device_lock per device, and a swap_list_lock over them all,
    is appealing; but in practice almost every holder of swap_device_lock must
    already hold swap_list_lock, which defeats the purpose of the split.

    The only exceptions have been swap_duplicate, valid_swaphandles and an
    untrodden path in try_to_unuse (plus a few places added in this series).
    valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
    demand attention. However, with the hold time in get_swap_pages so much
    reduced, I've not yet found a load and set of swap device priorities to show
    even swap_duplicate benefitting from the split. Certainly the split is mere
    overhead in the common case of a single swap device.

    So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
    (generally we seem to prefer an _ in the name, and not hide in a macro).

    If someone can show a regression in swap_duplicate, then probably we should
    add a hashlock for the swap_map entries alone (shorts being anatomic), so as
    to help the case of the single swap device too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jun, 2005

1 commit

  • - generic_file* file operations do no longer have a xip/non-xip split
    - filemap_xip.c implements a new set of fops that require get_xip_page
    aop to work proper. all new fops are exported GPL-only (don't like to
    see whatever code use those except GPL modules)
    - __xip_unmap now uses page_check_address, which is no longer static
    in rmap.c, and defined in linux/rmap.h
    - mm/filemap.h is now much more clean, plainly having just Linus'
    inline funcs moved here from filemap.c
    - fix includes in filemap_xip to make it build cleanly on i386

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

22 Jun, 2005

1 commit

  • Remember that ironic get_user_pages race? when the raised page_count on a
    page swapped out led do_wp_page to decide that it had to copy on write, so
    substituted a different page into userspace. 2.6.7 onwards have Andrea's
    solution, where try_to_unmap_one backs out if it finds page_count raised.

    Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
    and was found a few months ago to hang an intensive page migration test. A
    year ago I was hesitant to engage page_mapcount, now it seems the right fix.

    So remove the page_count hack from try_to_unmap_one; and use activate_page in
    unuse_mm when dropping lock, to replace its secondary effect of helping
    swapoff to make progress in that case.

    Simplify can_share_swap_page (now called only on anonymous pages) to check
    page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
    their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

    In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
    keep sum on the high side, and correct when can_share_swap_page called.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2005

1 commit

  • try_to_unmap_cluster() does:
    for (pte = pte_offset_map(pmd, address);
    address < end; pte++, address += PAGE_SIZE) {
    ...
    }

    pte_unmap(pte);

    It may take a little staring to notice, but pte can actually fall off the
    end of the pte page in this iteration, which makes life difficult for
    kmap_atomic() and the users not expecting it to BUG(). Of course, we're
    somewhat lucky in that arithmetic elsewhere in the function guarantees that
    at least one iteration is made, lest this force larger rearrangements to be
    made. This issue and patch also apply to non-mm mainline and with trivial
    adjustments, at least two related kernels.

    Discovered during internal testing at Oracle.

    Signed-off-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     

17 May, 2005

1 commit


01 May, 2005

1 commit

  • mm/rmap.c:page_referenced_one() and mm/rmap.c:try_to_unmap_one() contain
    identical code that

    - takes mm->page_table_lock;

    - drills through page tables;

    - checks that correct pte is reached.

    Coalesce this into page_check_address()

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds