22 Dec, 2009

1 commit

  • The injector filter requires stable_page_flags() which is supplied
    by procfs. So make it dependent on that.

    Also add ifdefs around the filter code in memory-failure.c so that
    when the filter is disabled due to missing dependencies the whole
    code still builds.

    Reported-by: Ingo Molnar
    Signed-off-by: Andi Kleen

    Andi Kleen
     

17 Dec, 2009

1 commit

  • In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
    skipped by the compiler. We do this as the minimum mmap address doesn't make
    any sense in NOMMU mode.

    mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Signed-off-by: James Morris

    David Howells
     

16 Dec, 2009

5 commits

  • Signed-off-by: Andi Kleen

    Andi Kleen
     
  • When specified, only poison pages if ((page_flags & mask) == value).

    - corrupt-filter-flags-mask
    - corrupt-filter-flags-value

    This allows stress testing of many kinds of pages.

    Strictly speaking, the buddy pages requires taking zone lock, to avoid
    setting PG_hwpoison on a "was buddy but now allocated to someone" page.
    However we can just do nothing because we set PG_locked in the beginning,
    this prevents the page allocator from allocating it to someone. (It will
    BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)

    [AK: Add select PROC_PAGE_MONITOR to satisfy dependency]

    CC: Nick Piggin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Now that ksm pages are swappable, and the known holes plugged, remove
    mention of unswappable kernel pages from KSM documentation and comments.

    Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
    remove max_kernel_pages altogether - we can reinstate it if removal turns
    out to break someone's script; but if we later want to limit KSM's memory
    usage, limiting the stable nodes would not be an effective approach.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
    and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
    enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.

    When 2.6.15 placed the split page table lock inside struct page (usually
    sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
    we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
    letting the spinlock_t occupy both page->private and page->mapping).

    Should these debugging options be allowed to double the size of a struct
    page, when only one minority use of the page (as a page table) needs to
    fit a spinlock in there? Perhaps not.

    Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
    kmallocing a cacheline for the spinlock when it doesn't fit, but given up
    each time. Falling back to mm->page_table_lock (as we do when ptlock is
    not split) lets lockdep check out the strictest path anyway.

    And now that some arches allow 8192 cpus, use 999999 for infinity.

    (What has this got to do with KSM swapping? It doesn't care about the
    size of struct page, but may care about random junk in page->mapping - to
    be explained separately later.)

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Nov, 2009

1 commit

  • Allow memory hotplug and hibernation in the same kernel

    Memory hotplug and hibernation were exclusive in Kconfig. This is
    obviously a problem for distribution kernels who want to support both in
    the same image.

    After some discussions with Rafael and others the only problem is with
    parallel memory hotadd or removal while a hibernation operation is in
    process. It was also working for s390 before.

    This patch removes the Kconfig level exclusion, and simply makes the
    memory add / remove functions grab the pm_mutex to exclude against
    hibernation.

    Fixes a regression - old kernels didn't exclude memory hotadd and
    hibernation.

    Signed-off-by: Andi Kleen
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Cc: Yasunori Goto
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

29 Oct, 2009

2 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
    powerpc: Minor cleanup to lib/Kconfig.debug
    powerpc: Minor cleanup to sound/ppc/Kconfig
    powerpc: Minor cleanup to init/Kconfig
    powerpc: Limit memory hotplug support to PPC64 Book-3S machines
    powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
    powerpc: Fix compile errors found by new ppc64e_defconfig
    powerpc: Add a Book-3E 64-bit defconfig
    powerpc/booke: Fix xmon single step on PowerPC Book-E
    powerpc: Align vDSO base address
    powerpc: Fix segment mapping in vdso32
    powerpc/iseries: Remove compiler version dependent hack
    powerpc/perf_events: Fix priority of MSR HV vs PR bits
    powerpc/5200: Update defconfigs
    drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL_IO_MEM
    powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
    of: Remove nested function
    mpc5200: support for the MAN mpc5200 based board mucmc52
    mpc5200: support for the MAN mpc5200 based board uc101

    Linus Torvalds
     
  • Currently, sparsemem is only available if EXPERIMENTAL is enabled.
    However, it hasn't ever been marked experimental.

    It's been about four years since sparsemem was merged, and we have
    platforms which depend on it; allow architectures to decide whether
    sparsemem should be the default memory model.

    Signed-off-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     

27 Oct, 2009

1 commit


08 Oct, 2009

1 commit

  • Adjust the max_kernel_pages default to a quarter of totalram_pages,
    instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
    highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
    pinning 32MB of lowmem with their rmap_items, so no need for the more
    obscure calculation (nor for its own special init function).

    There is no way for the user to switch KSM on if CONFIG_SYSFS is not
    enabled, so in that case default run to KSM_RUN_MERGE.

    Update KSM Documentation and Kconfig to reflect the new defaults.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Sep, 2009

1 commit

  • This build failure triggers:

    In file included from include/linux/suspend.h:8,
    from arch/x86/kernel/asm-offsets_32.c:11,
    from arch/x86/kernel/asm-offsets.c:2:
    include/linux/mm.h:503:2: error: #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS

    Because due to the hwpoison page flag we ran out of page
    flags on 32-bit.

    Dont turn on hwpoison on 32-bit NUMA (it's rare in any
    case).

    Also clean up the Kconfig dependencies in the generic MM
    code by introducing ARCH_SUPPORTS_MEMORY_FAILURE.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits

  • Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature

    Signed-off-by: Hugh Dickins
    Cc: Michael Kerrisk
    Cc: Randy Dunlap
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch presents the mm interface to a dummy version of ksm.c, for
    better scrutiny of that interface: the real ksm.c follows later.

    When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
    MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
    pretending that they can be serviced. But when CONFIG_KSM=y, accept them
    even if KSM is not currently running, and even on areas which KSM will not
    touch (e.g. hugetlb or shared file or special driver mappings).

    Like other madvices, report ENOMEM despite success if any area in the
    range is unmapped, and use EAGAIN to report out of memory.

    Define vma flag VM_MERGEABLE to identify an area on which KSM may try
    merging pages: leave it to ksm_madvise() to decide whether to set it.
    Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
    VM_MERGEABLE areas, to minimize callouts when forking or exiting.

    Based upon earlier patches by Chris Wright and Izik Eidus.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Michael Kerrisk
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

3 commits

  • Useful for some testing scenarios, although specific testing is often
    done better through MADV_POISON

    This can be done with the x86 level MCE injector too, but this interface
    allows it to do independently from low level x86 changes.

    v2: Add module license (Haicheng Li)

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Add the high level memory handler that poisons pages
    that got corrupted by hardware (typically by a two bit flip in a DIMM
    or a cache) on the Linux level. The goal is to prevent everyone
    from accessing these pages in the future.

    This done at the VM level by marking a page hwpoisoned
    and doing the appropriate action based on the type of page
    it is.

    The code that does this is portable and lives in mm/memory-failure.c

    To quote the overview comment:

    High level machine check handler. Handles pages reported by the
    hardware as being corrupted usually due to a 2bit ECC memory or cache
    failure.

    This focuses on pages detected as corrupted in the background.
    When the current CPU tries to consume corruption the currently
    running process can just be killed directly instead. This implies
    that if the error cannot be handled for some reason it's safe to
    just ignore it because no corruption has been consumed yet. Instead
    when that happens another machine check will happen.

    Handles page cache pages in various states. The tricky part
    here is that we can access any page asynchronous to other VM
    users, because memory failures could happen anytime and anywhere,
    possibly violating some of their assumptions. This is why this code
    has to be extremely careful. Generally it tries to use normal locking
    rules, as in get the standard locks, even if that means the
    error handling takes potentially a long time.

    Some of the operations here are somewhat inefficient and have non
    linear algorithmic complexity, because the data structures have not
    been optimized for this case. This is in particular the case
    for the mapping from a vma to a process. Since this case is expected
    to be rare we hope we can get away with this.

    There are in principle two strategies to kill processes on poison:
    - just unmap the data and wait for an actual reference before
    killing
    - kill as soon as corruption is detected.
    Both have advantages and disadvantages and should be used
    in different situations. Right now both are implemented and can
    be switched with a new sysctl vm.memory_failure_early_kill
    The default is early kill.

    The patch does some rmap data structure walking on its own to collect
    processes to kill. This is unusual because normally all rmap data structure
    knowledge is in rmap.c only. I put it here for now to keep
    everything together and rmap knowledge has been seeping out anyways

    Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
    Nick Piggin (who did a lot of great work) and others.

    Cc: npiggin@suse.de
    Cc: riel@redhat.com
    Signed-off-by: Andi Kleen
    Acked-by: Rik van Riel
    Reviewed-by: Hidehiro Kawai

    Andi Kleen
     
  • * 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, pat: Fix cacheflush address in change_page_attr_set_clr()
    mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set
    x86: Fix earlyprintk=dbgp for machines without NX
    x86, pat: Sanity check remap_pfn_range for RAM region
    x86, pat: Lookup the protection from memtype list on vm_insert_pfn()
    x86, pat: Add lookup_memtype to get the current memtype of a paddr
    x86, pat: Use page flags to track memtypes of RAM pages
    x86, pat: Generalize the use of page flag PG_uncached
    x86, pat: Add rbtree to do quick lookup in memtype tracking
    x86, pat: Add PAT reserve free to io_mapping* APIs
    x86, pat: New i/f for driver to request memtype for IO regions
    x86, pat: ioremap to follow same PAT restrictions as other PAT users
    x86, pat: Keep identity maps consistent with mmaps even when pat_disabled
    x86, mtrr: make mtrr_aps_delayed_init static bool
    x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
    generic-ipi: Allow cpus not yet online to call smp_call_function with irqs disabled
    x86: Fix an incorrect argument of reserve_bootmem()
    x86: Fix system crash when loading with "reservetop" parameter

    Linus Torvalds
     

01 Sep, 2009

1 commit

  • CONFIG_PAGEFLAGS_EXTENDED disables a trick to conserve pageflags.
    This trick is indended to be enabled when the pressure on page flags
    is very high.

    The previous condition was:

    - depends on 64BIT || SPARSEMEM_VMEMMAP || !NUMA || !SPARSEMEM

    ... however, the sparsemem code already has a way to crowd out the
    node number from the pageflags, which means that !NUMA actually
    doesn't contribute to hard pageflags exhaustion.

    This is required for the new PG_uncached flag to not cause pageflags
    exhaustion on x86_32 + PAE + SPARSEMEM + !NUMA.

    Signed-off-by: H. Peter Anvin
    LKML-Reference:
    Cc: Venkatesh Pallipadi
    Cc: Suresh Siddha

    H. Peter Anvin
     

17 Aug, 2009

1 commit

  • Currently SELinux enforcement of controls on the ability to map low memory
    is determined by the mmap_min_addr tunable. This patch causes SELinux to
    ignore the tunable and instead use a seperate Kconfig option specific to how
    much space the LSM should protect.

    The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
    permissions will always protect the amount of low memory designated by
    CONFIG_LSM_MMAP_MIN_ADDR.

    This allows users who need to disable the mmap_min_addr controls (usual reason
    being they run WINE as a non-root user) to do so and still have SELinux
    controls preventing confined domains (like a web server) from being able to
    map some area of low memory.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

17 Jun, 2009

2 commits

  • * akpm: (182 commits)
    fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
    fbdev: *bfin*: fix __dev{init,exit} markings
    fbdev: *bfin*: drop unnecessary calls to memset
    fbdev: bfin-t350mcqb-fb: drop unused local variables
    fbdev: blackfin has __raw I/O accessors, so use them in fb.h
    fbdev: s1d13xxxfb: add accelerated bitblt functions
    tcx: use standard fields for framebuffer physical address and length
    fbdev: add support for handoff from firmware to hw framebuffers
    intelfb: fix a bug when changing video timing
    fbdev: use framebuffer_release() for freeing fb_info structures
    radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
    s3c-fb: CPUFREQ frequency scaling support
    s3c-fb: fix resource releasing on error during probing
    carminefb: fix possible access beyond end of carmine_modedb[]
    acornfb: remove fb_mmap function
    mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
    mb862xxfb: restrict compliation of platform driver to PPC
    Samsung SoC Framebuffer driver: add Alpha Channel support
    atmel-lcdc: fix pixclock upper bound detection
    offb: use framebuffer_alloc() to allocate fb_info struct
    ...

    Manually fix up conflicts due to kmemcheck in mm/slab.c

    Linus Torvalds
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Jun, 2009

1 commit


04 Jun, 2009

1 commit


07 May, 2009

1 commit

  • NOMMU mmap() has an option controlled by a sysctl variable that determines
    whether the allocations made by do_mmap_private() should have the excess
    space trimmed off and returned to the allocator. Make the initial setting
    of this variable a Kconfig configuration option.

    The reason there can be excess space is that the allocator only allocates
    in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a
    power of 2.

    There are two alternatives:

    (1) Keep the excess as dead space. The dead space then remains unused for the
    lifetime of the mapping. Mappings of shared objects such as libc, ld.so
    or busybox's text segment may retain their dead space forever.

    (2) Return the excess to the allocator. This means that the dead space is
    limited to less than a page per mapping, but it means that for a transient
    process, there's more chance of fragmentation as the excess space may be
    reused fairly quickly.

    During the boot process, a lot of transient processes are created, and
    this can cause a lot of fragmentation as the pagecache and various slabs
    grow greatly during this time.

    By turning off the trimming of excess space during boot and disabling
    batching of frees, Coldfire can manage to boot.

    A better way of doing things might be to have /sbin/init turn this option
    off. By that point libc, ld.so and init - which are all long-duration
    processes - have all been loaded and trimmed.

    Reported-by: Lanttor Guo
    Signed-off-by: David Howells
    Tested-by: Lanttor Guo
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

14 Apr, 2009

1 commit


01 Apr, 2009

2 commits

  • Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical
    reason it shouldn't be available, and it can be used for ramfs.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The mlock() facility does not exist for NOMMU since all mappings are
    effectively locked anyway, so we don't make the bits available when
    they're not useful.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

07 Jan, 2009

1 commit

  • commit 8308c54d7e312f7a03e2ce2057d0837e6fe3843f ("generic: redefine
    resource_size_t as phys_addr_t") made CONFIG_RESOURCES_64BIT obsolete, but
    didn't remove it. Remove it.

    Signed-off-by: Geert Uytterhoeven
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

20 Oct, 2008

1 commit

  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

17 Oct, 2008

2 commits

  • * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails
    softirq, warning fix: correct a format to avoid a warning
    softirqs, debug: preemption check
    x86, pci-hotplug, calgary / rio: fix EBDA ioremap()
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix
    IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes
    softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description
    dmi scan: warn about too early calls to dmi_check_system()
    generic: redefine resource_size_t as phys_addr_t
    generic: make PFN_PHYS explicitly return phys_addr_t
    generic: add phys_addr_t for holding physical addresses
    softirq: allocate less vectors
    IO resources: fix/remove printk
    printk: robustify printk, update comment
    printk: robustify printk, fix #2
    printk: robustify printk, fix
    printk: robustify printk

    Fixed up conflicts in:
    arch/powerpc/include/asm/types.h
    arch/powerpc/platforms/Kconfig.cputype
    manually.

    Linus Torvalds
     
  • Using "def_bool n" is pointless, simply using bool here appears more
    appropriate.

    Further, retaining such options that don't have a prompt and aren't
    selected by anything seems also at least questionable.

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: Tony Luck
    Cc: Thomas Gleixner
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

14 Sep, 2008

1 commit


12 Aug, 2008

1 commit


29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit

  • Implement get_user_pages_fast without locking in the fastpath on x86.

    Do an optimistic lockless pagetable walk, without taking mmap_sem or any
    page table locks or even mmap_sem. Page table existence is guaranteed by
    turning interrupts off (combined with the fact that we're always looking
    up the current mm, means we can do the lockless page table walk within the
    constraints of the TLB shootdown design). Basically we can do this
    lockless pagetable walk in a similar manner to the way the CPU's pagetable
    walker does not have to take any locks to find present ptes.

    This patch (combined with the subsequent ones to convert direct IO to use
    it) was found to give about 10% performance improvement on a 2 socket 8
    core Intel Xeon system running an OLTP workload on DB2 v9.5

    "To test the effects of the patch, an OLTP workload was run on an IBM
    x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
    2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
    runs with and without the patch resulted in an overall performance
    benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
    __up_read and __down_read routines that is seen during thread contention
    for system resources was reduced from 2.8% down to .05%. Monitoring the
    /proc/vmstat output from the patched run showed that the counter for
    fast_gup contained a very high number while the fast_gup_slow value was
    zero."

    (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
    counter we had for the number of times the slowpath was invoked).

    The main reason for the improvement is that DB2 has multiple threads each
    issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
    contend the mmap_sem cacheline, and can also contend on page table locks.

    I would anticipate larger performance gains on larger systems, however I
    think DB2 uses an adaptive mix of threads and processes, so it could be
    that thread contention remains pretty constant as machine size increases.
    In which case, we stuck with "only" a 10% gain.

    The downside of using get_user_pages_fast is that if there is not a pte
    with the correct permissions for the access, we end up falling back to
    get_user_pages and so the get_user_pages_fast is a bit of extra work.
    However this should not be the common case in most performance critical
    code.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: Kconfig fix]
    [akpm@linux-foundation.org: Makefile fix/cleanup]
    [akpm@linux-foundation.org: warning fix]
    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jul, 2008

1 commit

  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

15 Jul, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6: (31 commits)
    avr32: Fix typo of IFSR in a comment in the PIO header file
    avr32: Power Management support ("standby" and "mem" modes)
    avr32: Add system device for the internal interrupt controller (intc)
    avr32: Add simple SRAM allocator
    avr32: Enable SDRAMC clock at startup
    rtc-at32ap700x: Enable wakeup
    macb: Basic suspend/resume support
    atmel_serial: Drain console TX shifter before suspending
    atmel_serial: Fix build on avr32 with CONFIG_PM enabled
    avr32: Use a quicklist for PTE allocation as well
    avr32: Use a quicklist for PGD allocation
    avr32: Cover the kernel page tables in the user PGDs
    avr32: Store virtual addresses in the PGD
    avr32: Remove useless zeroing of swapper_pg_dir at startup
    avr32: Clean up and optimize the TLB operations
    avr32: Rename at32ap.c -> pdc.c
    avr32: Move setup_platform() into chip-specific file
    avr32: Kill special exception handler sections
    avr32: Kill unneeded #include from asm/mmu_context.h
    avr32: Clean up time.c #includes
    ...

    Linus Torvalds
     

14 Jul, 2008

1 commit