31 Oct, 2005

1 commit

  • Add CONFIG_X86_32 for i386. This allows selecting options that only apply
    to 32-bit systems.

    (X86 && !X86_64) becomes X86_32
    (X86 || X86_64) becomes X86

    Signed-off-by: Brian Gerst
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Gerst
     

30 Oct, 2005

39 commits

  • Here is a set of ppc64 specific patches that at least allow
    compilation/booting with the following configurations:

    FLATMEM
    SPARSEMEN
    SPARSEMEM + MEMORY_HOTPLUG

    Signed-off-by: Mike Kravetz
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Adds the necessary for non-NUMA hot-add of highmem to an existing zone on
    i386.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • pgdat->node_size_lock is basically only neeeded in one place in the normal
    code: show_mem(), which is the arch-specific sysrq-m printing function.

    Strictly speaking, the architectures not doing memory hotplug do no need this
    locking in show_mem(). However, they are all included for completeness. This
    should also make any future consolidation of all of the implementations a
    little more straightforward.

    This lock is also held in the sparsemem code during a memory removal, as
    sections are invalidated. This is the place there pfn_valid() is made false
    for a memory area that's being removed. The lock is only required when doing
    pfn_valid() operations on memory which the user does not already have a
    reference on the page, such as in show_mem().

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In worrying over the various pte operations in different architectures, I came
    across some unused functions in UML: remove mprotect_kernel_vm,
    protect_vm_page and addr_pte.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's usually a good reason when a pte is examined without the lock; but it
    makes me nervous when the pointer is dereferenced more than once.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The cris v32 switch_mm guards get_mmu_context with next->page_table_lock: good
    it's not really SMP yet, since get_mmu_context messes with global variables
    affecting other mms. Replace by global mmu_context_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's a worrying function translation_exists in parisc cacheflush.h,
    unaffected by split ptlock since flush_dcache_page is using it on some other
    mm, without any relevant lock. Oh well, make it a slightly more robust by
    factoring the pfn check within it. And it looked liable to confuse a
    camouflaged swap or file entry with a good pte: fix that too.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Prepare arm for the split page_table_lock: three issues.

    Signal handling's preserve and restore of iwmmxt context currently involves
    reading and writing that context to and from user space, while holding
    page_table_lock to secure the user page(s) against kswapd. If we split the
    lock, then the structure might span two pages, secured by to read into and
    write from a kernel stack buffer, copying that out and in without locking (the
    structure is 160 bytes in size, and here we're near the top of the kernel
    stack). Or would the overhead be noticeable?

    arm_syscall's cmpxchg emulation use pte_offset_map_lock, instead of
    pte_offset_map and mm-wide page_table_lock; and strictly, it should now also
    take mmap_sem before descending to pmd, to guard against another thread
    munmapping, and the page table pulled out beneath this thread.

    Updated two comments in fault-armv.c. adjust_pte is interesting, since its
    modification of a pte in one part of the mm depends on the lock held when
    calling update_mmu_cache for a pte in some other part of that mm. This can't
    be done with a split page_table_lock (and we've already taken the lowest lock
    in the hierarchy here): so we'll have to disable split on arm, unless
    CONFIG_CPU_CACHE_VIPT to ensures adjust_pte never used.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Use pte_offset_map_lock, instead of pte_offset_map (or inappropriate
    pte_offset_kernel) and mm-wide page_table_lock, in sundry arch places.

    The i386 vm86 mark_screen_rdonly: yes, there was and is an assumption that the
    screen fits inside the one page table, as indeed it does.

    The sh __do_page_fault: which handles both kernel faults (without lock) and
    user mm faults (locked - though it set_pte without locking before).

    The sh64 flush_cache_range and helpers: which wrongly thought callers held
    page_table_lock before (only its tlb_start_vma did, and no longer does so);
    moved the flush loop down, and adjusted the large versus small range decision
    to consider a range which spans page tables as large.

    Signed-off-by: Hugh Dickins
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • check_user_page_readable is a problematic variant of follow_page. It's used
    only by oprofile's i386 and arm backtrace code, at interrupt time, to
    establish whether a userspace stackframe is currently readable.

    This is problematic, because we want to push the page_table_lock down inside
    follow_page, and later split it; whereas oprofile is doing a spin_trylock on
    it (in the i386 case, forgotten in the arm case), and needs that to pin
    perhaps two pages spanned by the stackframe (which might be covered by
    different locks when we split).

    I think oprofile is going about this in the wrong way: it doesn't need to know
    the area is readable (neither i386 nor arm uses read protection of user
    pages), it doesn't need to pin the memory, it should simply
    __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've
    not got around to devising the sparse __user annotations for this.

    Then we can eliminate check_user_page_readable, and return to a single
    follow_page without the __follow_page variants.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There was one small but very significant change in the previous patch:
    mprotect's flush_tlb_range fell outside the page_table_lock: as it is in 2.4,
    but that doesn't prove it safe in 2.6.

    On some architectures flush_tlb_range comes to the same as flush_tlb_mm, which
    has always been called from outside page_table_lock in dup_mmap, and is so
    proved safe. Others required a deeper audit: I could find no reliance on
    page_table_lock in any; but in ia64 and parisc found some code which looks a
    bit as if it might want preemption disabled. That won't do any actual harm,
    so pending a decision from the maintainers, disable preemption there.

    Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range and
    flush_tlb_page entries in cachetlb.txt: they were rather misleading (what
    generic code does is different from what usually happens), the rules are now
    changing, and it's not yet clear where we'll end up (will the generic
    tlb_flush_mmu happen always under lock? never under lock? or sometimes under
    and sometimes not?).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Convert those few architectures which are calling pud_alloc, pmd_alloc,
    pte_alloc_map on a user mm, not to take the page_table_lock first, nor drop it
    after. Each of these can continue to use pte_alloc_map, no need to change
    over to pte_alloc_map_lock, they're neither racy nor swappable.

    In the sparc64 io_remap_pfn_range, flush_tlb_range then falls outside of the
    page_table_lock: that's okay, on sparc64 it's like flush_tlb_mm, and that has
    always been called from outside of page_table_lock in dup_mmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • First step in pushing down the page_table_lock. init_mm.page_table_lock has
    been used throughout the architectures (usually for ioremap): not to serialize
    kernel address space allocation (that's usually vmlist_lock), but because
    pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.

    Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
    architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
    and drop it when allocating a new one, to check lest a racing task already
    did. Similarly no page_table_lock in vmalloc's map_vm_area.

    Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
    user mms, which are converted only by a later patch, for now they have to lock
    differently according to whether or not it's init_mm.

    If sources get muddled, there's a danger that an arch source taking
    init_mm.page_table_lock will be mixed with common source also taking it (or
    neither take it). So break the rules and make another change, which should
    break the build for such a mismatch: remove the redundant mm arg from
    pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).

    Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
    used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
    pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
    map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
    took page_table_lock for no good reason.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ia64 has expand_backing_store function for growing its Register Backing Store
    vma upwards. But more complete code for this purpose is found in the
    CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide
    expand_upwards for ia64 as well as expand_stack for parisc.

    The Register Backing Store vma should be marked VM_ACCOUNT. Implement the
    intention of growing it only a page at a time, instead of passing an address
    outside of the vma to handle_mm_fault, with unknown consequences.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageReserved() calls from core code by tightening VM_RESERVED
    handling in mm/ to cover PageReserved functionality.

    PageReserved special casing is removed from get_page and put_page.

    All setting and clearing of PageReserved is retained, and it is now flagged
    in the page_alloc checks to help ensure we don't introduce any refcount
    based freeing of Reserved pages.

    MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
    deprecated. We never completely handled it correctly anyway, and is be
    reintroduced in future if required (Hugh has a proof of concept).

    Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
    arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
    be trivially removed.

    Last real user of PageReserved is swsusp, which uses PageReserved to
    determine whether a struct page points to valid memory or not. This still
    needs to be addressed (a generic page_is_ram() should work).

    A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
    thus mapcounted and count towards shared rss). These writes to the struct
    page could cause excessive cacheline bouncing on big systems. There are a
    number of ways this could be addressed if it is an issue.

    Signed-off-by: Nick Piggin

    Refcount bug fix for filemap_xip.c

    Signed-off-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Please, please now delete the Atari CONFIG_STRAM_SWAP code. It may be
    excellent and ingenious code, but its reference to swap_vfsmnt betrays that it
    hasn't been built since 2.5.1 (four years old come December), it's delving
    deep into matters which are the preserve of core mm code, its only purpose is
    to give the more conscientious mm guys an anxiety attack from time to time;
    yet we keep on breaking it more and more.

    If you want to use RAM for swap, then if the MTD driver does not already
    provide just what you need, I'm sure David could be persuaded to add the
    extra. But you'd also like to be able to allocate extents of that swap for
    other use: we can give you a core interface for that if you need. But unbuilt
    for four years suggests to me that there's no need at all.

    I cannot swear the patch below won't break your build, but believe so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone age,
    clashing with the common hugetlb.c. Replace it by a copy of the sh
    hugetlbpage.c. Except, delete that mk_pte_huge macro neither uses.

    Signed-off-by: Hugh Dickins
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
    that's not so good if an mm_counter_t is a special type. And how is rss
    initialized? By set_mm_counter, all over the place. Come on, we just need to
    initialize them both at once by set_mm_counter in mm_init (which follows the
    memcpy when forking).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • zap_pte_range has been counting the pages it frees in tlb->freed, then
    tlb_finish_mmu has used that to update the mm's rss. That got stranger when I
    added anon_rss, yet updated it by a different route; and stranger when rss and
    anon_rss became mm_counters with special access macros. And it would no
    longer be viable if we're relying on page_table_lock to stabilize the
    mm_counter, but calling tlb_finish_mmu outside that lock.

    Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
    business, just decrement the rss mm_counter in zap_pte_range (yes, there was
    some point to batching the update, and a subsequent patch restores that). And
    forget the anal paranoia of first reading the counter to avoid going negative
    - if rss does go negative, just fix that bug.

    Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
    was being made of them. But arm26 alone was actually using the freed, in the
    way some others use need_flush: give it a need_flush. arm26 seems to prefer
    spaces to tabs here: respect that.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
    mm's last user has gone and the whole mm is being torn down. And it's an
    inline function because sparc64 uses a different (slightly better)
    "tlb_frozen" name for the flag others call "fullmm".

    And now the ptep_get_and_clear_full macro used in zap_pte_range refers
    directly to tlb->fullmm, which would be wrong for sparc64. Rather than
    correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
    sparc64 to just use the same poor name as everyone else - is that okay?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The original vm_stat_account has fallen into disuse, with only one user, and
    only one user of vm_stat_unaccount. It's easier to keep track if we convert
    them all to __vm_stat_account, then free it from its __shackles.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Linus Torvalds
     
  • Patch from Nicolas Pitre

    Since vmlinux.lds.S is preprocessed, we can use the defines already
    present in asm/memory.h (allowed by patch #3060) for the XIP kernel link
    address instead of relying on a duplicated Makefile hardcoded value, and
    also get rid of its dependency on awk to handle it at the same time.

    While at it let's clean XIP stuff even further and make things clearer
    in head.S with a nice code reduction.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Russell King

    Nicolas Pitre
     
  • Patch from Nicolas Pitre

    This patch allows for assorted type of cleanups by letting assembly code
    use the same set of defines for constant values and avoid duplicated
    definitions that might not always be in sync, or that might simply be
    confusing due to the different names for the same thing.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Russell King

    Nicolas Pitre
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Some boards declare prom_free_prom_memory as a void function but the
    caller free_initmem() expects a return value.

    Fix those up and return 0 instead, just like everyone else does.

    Signed-off-by: Arthur Othieno
    Signed-off-by: Ralf Baechle

    Arthur Othieno
     
  • by emulation of a full FPU.

    Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Prefetching may be fatal on some systems if we're prefetching beyond the
    end of memory on some systems. It's also a seriously bad idea on non
    dma-coherent systems.

    Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • Limit the number of cpu type options in the cpu menu to just those
    types that are actually available for the select platform.

    Signed-off-by: Ralf Baechle

    Ralf Baechle
     
  • CFE 1.2.5 and earlier fails to turn on the ExpMemEn bit in the
    PCIFeatureControl register, which means that DMA does not work
    beyond physical address 01_0000_0000, ergo to DRAM beyond 1GB.

    With ExpMemEn turned on, 01_0000_0000-0f_ffff_ffff is mapped,
    so DMA works for up to 61 GB of DRAM.

    Will be fixed in CFE 1.2.6 (yet to be released).

    Signed-Off-By: Andy Isaacson
    Signed-off-by: Ralf Baechle

    Andrew Isaacson
     
  • PCI support code for PLX 7250 PCI-X tunnel on BCM91480B BigSur board.

    Signed-Off-By: Andy Isaacson
    Signed-off-by: Ralf Baechle

    Andrew Isaacson
     
  • Signed-Off-By: Andy Isaacson
    Signed-off-by: Ralf Baechle

    Andrew Isaacson
     
  • Expand SB1 cache error handling by adding SB1_CEX_ALWAYS_FATAL and
    SB1_CEX_STALL, allowing configurable behavior on cache errors.

    Signed-Off-By: Andy Isaacson
    Signed-off-by: Ralf Baechle

    Andrew Isaacson