07 Jan, 2009

1 commit

  • commit 8308c54d7e312f7a03e2ce2057d0837e6fe3843f ("generic: redefine
    resource_size_t as phys_addr_t") made CONFIG_RESOURCES_64BIT obsolete, but
    didn't remove it. Remove it.

    Signed-off-by: Geert Uytterhoeven
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

20 Oct, 2008

1 commit

  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

17 Oct, 2008

2 commits

  • * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
    softirq, warning fix: correct a format to avoid a warning
    softirqs, debug: preemption check
    x86, pci-hotplug, calgary / rio: fix EBDA ioremap()
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes
    softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description
    dmi scan: warn about too early calls to dmi_check_system()
    generic: redefine resource_size_t as phys_addr_t
    generic: make PFN_PHYS explicitly return phys_addr_t
    generic: add phys_addr_t for holding physical addresses
    softirq: allocate less vectors
    IO resources: fix/remove printk
    printk: robustify printk, update comment
    printk: robustify printk, fix #2
    printk: robustify printk, fix
    printk: robustify printk

    Fixed up conflicts in:
    arch/powerpc/include/asm/types.h
    arch/powerpc/platforms/Kconfig.cputype
    manually.

    Linus Torvalds
     
  • Using "def_bool n" is pointless, simply using bool here appears more
    appropriate.

    Further, retaining such options that don't have a prompt and aren't
    selected by anything seems also at least questionable.

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: Tony Luck
    Cc: Thomas Gleixner
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

14 Sep, 2008

1 commit


12 Aug, 2008

1 commit


29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit

  • Implement get_user_pages_fast without locking in the fastpath on x86.

    Do an optimistic lockless pagetable walk, without taking mmap_sem or any
    page table locks or even mmap_sem. Page table existence is guaranteed by
    turning interrupts off (combined with the fact that we're always looking
    up the current mm, means we can do the lockless page table walk within the
    constraints of the TLB shootdown design). Basically we can do this
    lockless pagetable walk in a similar manner to the way the CPU's pagetable
    walker does not have to take any locks to find present ptes.

    This patch (combined with the subsequent ones to convert direct IO to use
    it) was found to give about 10% performance improvement on a 2 socket 8
    core Intel Xeon system running an OLTP workload on DB2 v9.5

    "To test the effects of the patch, an OLTP workload was run on an IBM
    x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
    2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
    runs with and without the patch resulted in an overall performance
    benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
    __up_read and __down_read routines that is seen during thread contention
    for system resources was reduced from 2.8% down to .05%. Monitoring the
    /proc/vmstat output from the patched run showed that the counter for
    fast_gup contained a very high number while the fast_gup_slow value was
    zero."

    (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
    counter we had for the number of times the slowpath was invoked).

    The main reason for the improvement is that DB2 has multiple threads each
    issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
    contend the mmap_sem cacheline, and can also contend on page table locks.

    I would anticipate larger performance gains on larger systems, however I
    think DB2 uses an adaptive mix of threads and processes, so it could be
    that thread contention remains pretty constant as machine size increases.
    In which case, we stuck with "only" a 10% gain.

    The downside of using get_user_pages_fast is that if there is not a pte
    with the correct permissions for the access, we end up falling back to
    get_user_pages and so the get_user_pages_fast is a bit of extra work.
    However this should not be the common case in most performance critical
    code.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: Kconfig fix]
    [akpm@linux-foundation.org: Makefile fix/cleanup]
    [akpm@linux-foundation.org: warning fix]
    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jul, 2008

1 commit

  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

15 Jul, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6: (31 commits)
    avr32: Fix typo of IFSR in a comment in the PIO header file
    avr32: Power Management support ("standby" and "mem" modes)
    avr32: Add system device for the internal interrupt controller (intc)
    avr32: Add simple SRAM allocator
    avr32: Enable SDRAMC clock at startup
    rtc-at32ap700x: Enable wakeup
    macb: Basic suspend/resume support
    atmel_serial: Drain console TX shifter before suspending
    atmel_serial: Fix build on avr32 with CONFIG_PM enabled
    avr32: Use a quicklist for PTE allocation as well
    avr32: Use a quicklist for PGD allocation
    avr32: Cover the kernel page tables in the user PGDs
    avr32: Store virtual addresses in the PGD
    avr32: Remove useless zeroing of swapper_pg_dir at startup
    avr32: Clean up and optimize the TLB operations
    avr32: Rename at32ap.c -> pdc.c
    avr32: Move setup_platform() into chip-specific file
    avr32: Kill special exception handler sections
    avr32: Kill unneeded #include from asm/mmu_context.h
    avr32: Clean up time.c #includes
    ...

    Linus Torvalds
     

14 Jul, 2008

1 commit


02 Jul, 2008

1 commit


28 Apr, 2008

1 commit

  • Having separate page flags for the head and the tail of a compound page allows
    the compiler to use bitops instead of operations on a word to check for a tail
    page. That is f.e. important for virt_to_head_page() which is used in
    various critical code paths (kfree for example):

    Code for PageTail(page)

    Before:

    mov (%rdi),%rdx page->flags
    mov %rdx,%rax 3 bytes
    and $0x12000,%eax 5 bytes
    cmp $0x12000,%rax 6 bytes
    je 897

    After:

    mov (%rdi),%rax
    test $0x40,%ah (3 bytes)
    jne 887

    So we go from 14 bytes to 3 bytes and from 3 instructions to one. From the
    use of 2 registers we go to none.

    We can only use page flags for this if we have page flags available. This
    patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not
    scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA
    platforms.

    Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED
    section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or
    if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED
    fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's
    additional page flags to solve the reclaim issues could also be added there
    [hint... hint... where are these patchsets?]).

    Avoiding the overlaying of Pg_reclaim also clears the way for possible use of
    compound pages for the pagecache or on the LRU.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Jan, 2008

1 commit


18 Dec, 2007

1 commit

  • SPARSEMEM_VMEMMAP needs to be a selectable config option to support
    building the kernel both with and without sparsemem vmemmap support. This
    selection is desirable for platforms which could be configured one way for
    platform specific builds and the other for multi-platform builds.

    Signed-off-by: Miguel Botón
    Signed-off-by: Geoff Levand
    Acked-by: Yasunori Goto
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geoff Levand
     

20 Oct, 2007

1 commit


18 Oct, 2007

1 commit

  • * 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
    xfs: eagerly remove vmap mappings to avoid upsetting Xen
    xen: add some debug output for failed multicalls
    xen: fix incorrect vcpu_register_vcpu_info hypercall argument
    xen: ask the hypervisor how much space it needs reserved
    xen: lock pte pages while pinning/unpinning
    xen: deal with stale cr3 values when unpinning pagetables
    xen: add batch completion callbacks
    xen: yield to IPI target if necessary
    Clean up duplicate includes in arch/i386/xen/
    remove dead code in pgtable_cache_init
    paravirt: clean up lazy mode handling
    paravirt: refactor struct paravirt_ops into smaller pv_*_ops

    Linus Torvalds
     

17 Oct, 2007

3 commits

  • When a pagetable is created, it is made globally visible in the rmap
    prio tree before it is pinned via arch_dup_mmap(), and remains in the
    rmap tree while it is unpinned with arch_exit_mmap().

    This means that other CPUs may race with the pinning/unpinning
    process, and see a pte between when it gets marked RO and actually
    pinned, causing any pte updates to fail with write-protect faults.

    As a result, all pte pages must be properly locked, and only unlocked
    once the pinning/unpinning process has finished.

    In order to avoid taking spinlocks for the whole pagetable - which may
    overflow the PREEMPT_BITS portion of preempt counter - it locks and pins
    each pte page individually, and then finally pins the whole pagetable.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rik van Riel
    Cc: Hugh Dickens
    Cc: David Rientjes
    Cc: Andrew Morton
    Cc: Andi Kleen
    Cc: Keir Fraser
    Cc: Jan Beulich

    Jeremy Fitzhardinge
     
  • Logic.
    - set all pages in [start,end) as isolated migration-type.
    by this, all free pages in the range will be not-for-use.
    - Migrate all LRU pages in the range.
    - Test all pages in the range's refcnt is zero or not.

    Todo:
    - allocate migration destination page from better area.
    - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
    (I don't like this kind of page but..
    - Find out pages which cannot be migrated.
    - more running tests.
    - Use reclaim for unplugging other memory type area.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Convert the common vmemmap population into initialisation helpers for use by
    architecture vmemmap populators. All architecture implementing the
    SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
    initialiser, which may make use of the helpers.

    This allows us to clean up and remove the initialisation Kconfig entries.
    With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
    indicate use of that variant.

    Signed-off-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

07 Oct, 2007

1 commit

  • When pinning and unpinning pagetables, we must protect them against
    being used by other CPUs, lest they see the pagetable in an
    intermediate read-only-but-not-pinned state.

    When using split pte locks, doing this properly would require taking
    all the pte locks for the pagetable while pinning, but this may overflow
    the PREEMPT_BITS part of the preempt counter if the process has mapped
    more than about 512M of memory.

    However, failing to take the pte locks causes write-protect faults when
    the pageout code is trying to clear the Access bit on a pte which is part
    of a freshy created and still being pinned process after fork.

    This is a short-term fix until the problem is solved properly.

    Signed-off-by: Jeremy Fitzhardinge
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Andrew Morton
    Cc: Andi Kleen
    Cc: Keir Fraser
    Cc: Jan Beulich
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     

30 Jul, 2007

1 commit


18 Jul, 2007

1 commit

  • The bounce buffer logic is included on systems that do not need it. If a
    system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to
    the use of bounce buffers then there is no need to reserve memory pools etc
    etc. This is true f.e. for SGI Altix.

    Also nicifies the Makefile and gets rid of the tricky "and" there.

    Signed-off-by: Christoph Lameter
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

17 Jul, 2007

2 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: (68 commits)
    sh: sh-rtc support for SH7709.
    sh: Revert __xdiv64_32 size change.
    sh: Update r7785rp defconfig.
    sh: Export div symbols for GCC 4.2 and ST GCC.
    sh: fix race in parallel out-of-tree build
    sh: Kill off dead mach.c for hp6xx.
    sh: hd64461.h cleanup and added comments.
    sh: Update the alignment when 4K stacks are used.
    sh: Add a .bss.page_aligned section for 4K stacks.
    sh: Don't let SH-4A clobber SH-4 CFLAGS.
    sh: Add parport stub for SuperIO ports.
    sh: Drop -Wa,-dsp for DSP tuning.
    sh: Update dreamcast defconfig.
    fb: pvr2fb: A few more __devinit annotations for PCI.
    fb: pvr2fb: Fix up section mismatch warnings.
    sh: Select IPR-IRQ for SH7091.
    sh: Correct __xdiv64_32/div64_32 return value size.
    sh: Fix timer-tmu build for SH-3.
    sh: Add cpu and mach links to CLEAN_FILES.
    sh: Preliminary support for the SH-X3 CPU.
    ...

    Linus Torvalds
     
  • Make some offending drivers depend on it and set CONFIG_ARCH_NO_VIRT_TO_BUS
    for ppc64 so that we don't build those drivers.

    This gets PowerPC allmodconfig and allyesconfig much closer to building.

    Signed-off-by: Stephen Rothwell
    Cc: Al Viro
    Acked-by: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

08 Jun, 2007

1 commit


14 May, 2007

1 commit


09 May, 2007

1 commit


08 May, 2007

1 commit

  • On x86_64 this cuts allocation overhead for page table pages down to a
    fraction (kernel compile / editing load. TSC based measurement of times spend
    in each function):

    no quicklist

    pte_alloc 1569048 4.3s(401ns/2.7us/179.7us)
    pmd_alloc 780988 2.1s(337ns/2.7us/86.1us)
    pud_alloc 780072 2.2s(424ns/2.8us/300.6us)
    pgd_alloc 260022 1s(920ns/4us/263.1us)

    quicklist:

    pte_alloc 452436 573.4ms(8ns/1.3us/121.1us)
    pmd_alloc 196204 174.5ms(7ns/889ns/46.1us)
    pud_alloc 195688 172.4ms(7ns/881ns/151.3us)
    pgd_alloc 65228 9.8ms(8ns/150ns/6.1us)

    pgd allocations are the most complex and there we see the most dramatic
    improvement (may be we can cut down the amount of pgds cached somewhat?). But
    even the pte allocations still see a doubling of performance.

    1. Proven code from the IA64 arch.

    The method used here has been fine tuned for years and
    is NUMA aware. It is based on the knowledge that accesses
    to page table pages are sparse in nature. Taking a page
    off the freelists instead of allocating a zeroed pages
    allows a reduction of number of cachelines touched
    in addition to getting rid of the slab overhead. So
    performance improves. This is particularly useful if pgds
    contain standard mappings. We can save on the teardown
    and setup of such a page if we have some on the quicklists.
    This includes avoiding lists operations that are otherwise
    necessary on alloc and free to track pgds.

    2. Light weight alternative to use slab to manage page size pages

    Slab overhead is significant and even page allocator use
    is pretty heavy weight. The use of a per cpu quicklist
    means that we touch only two cachelines for an allocation.
    There is no need to access the page_struct (unless arch code
    needs to fiddle around with it). So the fast past just
    means bringing in one cacheline at the beginning of the
    page. That same cacheline may then be used to store the
    page table entry. Or a second cacheline may be used
    if the page table entry is not in the first cacheline of
    the page. The current code will zero the page which means
    touching 32 cachelines (assuming 128 byte). We get down
    from 32 to 2 cachelines in the fast path.

    3. x86_64 gets lightweight page table page management.

    This will allow x86_64 arch code to faster repopulate pgds
    and other page table entries. The list operations for pgds
    are reduced in the same way as for i386 to the point where
    a pgd is allocated from the page allocator and when it is
    freed back to the page allocator. A pgd can pass through
    the quicklists without having to be reinitialized.

    64 Consolidation of code from multiple arches

    So far arches have their own implementation of quicklist
    management. This patch moves that feature into the core allowing
    an easier maintenance and consistent management of quicklists.

    Page table pages have the characteristics that they are typically zero or in a
    known state when they are freed. This is usually the exactly same state as
    needed after allocation. So it makes sense to build a list of freed page
    table pages and then consume the pages already in use first. Those pages have
    already been initialized correctly (thus no need to zero them) and are likely
    already cached in such a way that the MMU can use them most effectively. Page
    table pages are used in a sparse way so zeroing them on allocation is not too
    useful.

    Such an implementation already exits for ia64. Howver, that implementation
    did not support constructors and destructors as needed by i386 / x86_64. It
    also only supported a single quicklist. The implementation here has
    constructor and destructor support as well as the ability for an arch to
    specify how many quicklists are needed.

    Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one
    quicklist is necessary then we can define NR_QUICK for additional lists. F.e.
    i386 needs two and thus has

    config NR_QUICK
    int
    default 2

    If an arch has requested quicklist support then pages can be allocated
    from the quicklist (or from the page allocator if the quicklist is
    empty) via:

    quicklist_alloc(, , )

    Page table pages can be freed using:

    quicklist_free(, , )

    Pages must have a definite state after allocation and before
    they are freed. If no constructor is specified then pages
    will be zeroed on allocation and must be zeroed before they are
    freed.

    If a constructor is used then the constructor will establish
    a definite page state. F.e. the i386 and x86_64 pgd constructors
    establish certain mappings.

    Constructors and destructors can also be used to track the pages.
    i386 and x86_64 use a list of pgds in order to be able to dynamically
    update standard mappings.

    Signed-off-by: Christoph Lameter
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Feb, 2007

3 commits

  • As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
    channel management. Other functionality may still expect GFP_DMA to
    provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
    set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
    mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
    theses explicitly in each arches Kconfig.

    Reviews must occur for each arch in order to determine if ZONE_DMA can be
    switched off. It can only be switched off if we know that all devices
    supported by a platform are capable of performing DMA transfers to all of
    memory (Some arches already support this: uml, avr32, sh sh64, parisc and
    IA64/Altix).

    In order to switch ZONE_DMA off conditionally, one would have to establish
    a scheme by which one can assure that no drivers are enabled that are only
    capable of doing I/O to a part of memory, or one needs to provide an
    alternate means of performing an allocation from a specific range of memory
    (like provided by alloc_pages_range()) and insure that all drivers use that
    call. In that case the arches alloc_dma_coherent() may need to be modified
    to call alloc_pages_range() instead of relying on GFP_DMA.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
    things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
    without ZONE_DMA.

    CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
    handles ISA DMA.

    First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
    needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
    mm/Kconfig and do not need to modify arch code.

    Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
    all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
    compatibility. The arches may later undefine ZONE_DMA if their arch code has
    been verified to not depend on ZONE_DMA.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

04 Oct, 2006

2 commits


01 Oct, 2006

1 commit

  • Create Kconfig namespace for MEMORY_HOTPLUG_RESERVE and MEMORY_HOTPLUG_SPARSE.
    This is needed to create a disticiton between the 2 paths. Selecting the
    high level opiton of MEMORY_HOTPLUG will get you MEMORY_HOTPLUG_SPARSE if you
    have sparsemem enabled or MEMORY_HOTPLUG_RESERVE if you are x86_64 with
    discontig and ACPI numa support.

    Signed-off-by: Keith Mannthey
    Cc: KAMEZAWA Hiroyuki
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Mannthey
     

30 Jun, 2006

2 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
    [PATCH] i386: export memory more than 4G through /proc/iomem
    [PATCH] 64bit Resource: finally enable 64bit resource sizes
    [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
    [PATCH] 64bit resource: change pnp core to use resource_size_t
    [PATCH] 64bit resource: change pci core and arch code to use resource_size_t
    [PATCH] 64bit resource: change resource core to use resource_size_t
    [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
    [PATCH] 64bit resource: fix up printks for resources in misc drivers
    [PATCH] 64bit resource: fix up printks for resources in arch and core code
    [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
    [PATCH] 64bit resource: fix up printks for resources in video drivers
    [PATCH] 64bit resource: fix up printks for resources in ide drivers
    [PATCH] 64bit resource: fix up printks for resources in mtd drivers
    [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
    [PATCH] 64bit resource: fix up printks for resources in networks drivers
    [PATCH] 64bit resource: fix up printks for resources in sound drivers
    [PATCH] 64bit resource: C99 changes for struct resource declarations

    Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
    was changed by the 64-bit resources had been deleted in the meantime ;)

    Linus Torvalds
     
  • Memory hotplug code of i386 adds memory to only highmem. So, if
    CONFIG_HIGHMEM is not set, CONFIG_MEMORY_HOTPLUG shouldn't be set.
    Otherwise, it causes compile error.

    In addition, many architecture can't use memory hotplug feature yet. So, I
    introduce CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG.

    Signed-off-by: Yasunori Goto
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

28 Jun, 2006

2 commits


23 Jun, 2006

1 commit

  • Use the migration entries for page migration

    This modifies the migration code to use the new migration entries. It now
    becomes possible to migrate anonymous pages without having to add a swap
    entry.

    We add a couple of new functions to replace migration entries with the proper
    ptes.

    We cannot take the tree_lock for migrating anonymous pages anymore. However,
    we know that we hold the only remaining reference to the page when the page
    count reaches 1.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter