30 Oct, 2005

2 commits

  • pgdat->node_size_lock is basically only neeeded in one place in the normal
    code: show_mem(), which is the arch-specific sysrq-m printing function.

    Strictly speaking, the architectures not doing memory hotplug do no need this
    locking in show_mem(). However, they are all included for completeness. This
    should also make any future consolidation of all of the implementations a
    little more straightforward.

    This lock is also held in the sparsemem code during a memory removal, as
    sections are invalidated. This is the place there pfn_valid() is made false
    for a memory area that's being removed. The lock is only required when doing
    pfn_valid() operations on memory which the user does not already have a
    reference on the page, such as in show_mem().

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Sep, 2005

1 commit

  • Add a clone operation for pgd updates.

    This helps complete the encapsulation of updates to page tables (or pages
    about to become page tables) into accessor functions rather than using
    memcpy() to duplicate them. This is both generally good for consistency
    and also necessary for running in a hypervisor which requires explicit
    updates to page table entries.

    The new function is:

    clone_pgd_range(pgd_t *dst, pgd_t *src, int count);

    dst - pointer to pgd range anwhere on a pgd page
    src - ""
    count - the number of pgds to copy.

    dst and src can be on the same page, but the range must not overlap
    and must not cross a page boundary.

    Note that I ommitted using this call to copy pgd entries into the
    software suspend page root, since this is not technically a live paging
    structure, rather it is used on resume from suspend. CC'ing Pavel in case
    he has any feedback on this.

    Thanks to Chris Wright for noticing that this could be more optimal in
    PAE compiles by eliminating the memset.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

26 Jun, 2005

1 commit


24 Jun, 2005

2 commits

  • This helps a lot when debugging out of memory stuff - useful especially to
    see if all the memory is sucked into slab, etc.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin J. Bligh
     
  • This patch effectively eliminates direct use of pgdat->node_mem_map outside
    of the DISCONTIG code. On a flat memory system, these fields aren't
    currently used, neither are they on a sparsemem system.

    There was also a node_mem_map(nid) macro on many architectures. Its use
    along with the use of ->node_mem_map itself was not consistent. It has
    been removed in favor of two new, more explicit, arch-independent macros:

    pgdat_page_nr(pgdat, pagenr)
    nid_page_nr(nid, pagenr)

    I called them "pgdat" and "nid" because we overload the term "node" to mean
    "NUMA node", "DISCONTIG node" or "pg_data_t" in very confusing ways. I
    believe the newer names are much clearer.

    These macros can be overridden in the sparsemem case with a theoretically
    slower operation using node_start_pfn and pfn_to_page(), instead. We could
    make this the only behavior if people want, but I don't want to change too
    much at once. One thing at a time.

    This patch removes more code than it adds.

    Compile tested on alpha, alpha discontig, arm, arm-discontig, i386, i386
    generic, NUMAQ, Summit, ppc64, ppc64 discontig, and x86_64. Full list
    here: http://sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/configs/

    Boot tested on NUMAQ, x86 SMP and ppc64 power4/5 LPARs.

    Signed-off-by: Dave Hansen
    Signed-off-by: Martin J. Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

20 Apr, 2005

1 commit

  • Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
    clear_page_range regression since 2.6.10's clear_page_tables; and its
    long-standing well-known inefficiency in searching throughout the higher-level
    page tables for those few entries to clear and free: all can be blamed on
    ignoring the list of vmas when we free page tables.

    Replace exit_mmap's clear_page_range of the total user address space by
    free_pgtables operating on the mm's vma list; unmap_region use it in the same
    way, giving floor and ceiling beyond which it may not free tables. This
    brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
    in which case latency fixes spoil unmap_vmas throughput).

    Beware: the do_mmap_pgoff driver failure case must now use unmap_region
    instead of zap_page_range, since a page table might have been allocated, and
    can only be freed while it is touched by some vma.

    Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
    from the clear_page_range levels. (Most of free_pgtables' old code was
    actually for a non-existent case, prev not properly set up, dating from before
    hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
    might want to add latency lockdrops later; but no attempt to do so yet, going
    by vma should itself reduce latency.

    But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
    examination: put that off until a later patch of the series.

    What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?

    And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
    we need to do more than is done here - every PMD_SIZE ever occupied will be
    flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
    A shame to complicate it unnecessarily.

    Special thanks to David Miller for time spent repairing my ceilings.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds