13 Mar, 2010

2 commits

  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __zone_pcp_update() iterates over NR_CPUS instead of limiting the access
    to the possible cpus. This might result in access to uninitialized areas
    as the per cpu allocator only populates the per cpu memory for possible
    cpus.

    This problem was created as a result of the dynamic allocation of pagesets
    from percpu memory that went in during the merge window - commit
    99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
    conversion").

    Signed-off-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Acked-by: Tejun Heo
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Mar, 2010

6 commits

  • free_area_init_nodes() emits pfn ranges for all zones on the system.
    There may be no pages on a higher zone, however, due to memory limitations
    or the use of the mem= kernel parameter. For example:

    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x00100000

    The implementation copies the previous zone's highest pfn, if any, as the
    next zone's lowest pfn. If its highest pfn is then greater than the
    amount of addressable memory, the upper memory limit is used instead.
    Thus, both the lowest and highest possible pfn for higher zones without
    memory may be the same.

    The pfn range for zones without memory is now shown as "empty" instead.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Move a call of trace_mm_page_free_direct() from free_hot_page() to
    free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
    as it is done in function __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • trace_mm_page_free_direct() is called in function __free_pages(). But it
    is called again in free_hot_page() if order == 0 and produce duplicate
    records in trace file for mm_page_free_direct event. As below:

    K-PID CPU# TIMESTAMP FUNCTION
    gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

    This patch removes the first call and adds a call to
    trace_mm_page_free_direct() in __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

04 Mar, 2010

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    early_res: Need to save the allocation name in drop_range_partial()
    sparsemem: Fix compilation on PowerPC
    early_res: Add free_early_partial()
    x86: Fix non-bootmem compilation on PowerPC
    core: Move early_res from arch/x86 to kernel/
    x86: Add find_fw_memmap_area
    Move round_up/down to kernel.h
    x86: Make 32bit support NO_BOOTMEM
    early_res: Enhance check_and_double_early_res
    x86: Move back find_e820_area to e820.c
    x86: Add find_early_area_size
    x86: Separate early_res related code from e820.c
    x86: Move bios page reserve early to head32/64.c
    sparsemem: Put mem map for one node together.
    sparsemem: Put usemap for one node together
    x86: Make 64 bit use early_res instead of bootmem before slab
    x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
    x86: Make early_node_mem get mem > 4 GB if possible
    x86: Dynamically increase early_res array size
    x86: Introduce max_early_res and early_res_count
    ...

    Linus Torvalds
     

22 Feb, 2010

1 commit

  • These build errors on some non-x86 platforms (PowerPC for example):

    mm/page_alloc.c: In function '__alloc_memory_core_early':
    mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
    mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'

    The function is only needed on CONFIG_NO_BOOTMEM.

    Signed-off-by: Yinghai Lu
    Cc: Andrew Morton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

13 Feb, 2010

1 commit


02 Feb, 2010

1 commit


30 Jan, 2010

1 commit

  • After memory pressure has forced it to dip into the reserves, 2.6.32's
    5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
    list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
    pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

    Fix that in the most straightforward way (which, considering the overheads
    of alternative approaches, is Mel's preference): the right migratetype is
    already in page_private(page), but free_pcppages_bulk() wasn't using it.

    How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
    swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
    commit was easy, but explaining the magnitude of the slowdown not easy.

    The same effect appears, but much less markedly, with SLAB, and even
    less markedly on other machines (the PowerMac divides into fewer zones
    than x86, I think that may be a factor). We guess that lumpy reclaim
    of short-lived high-order pages is implicated in some way, and probably
    this bug has been tickling a poor decision somewhere in page reclaim.

    But instrumentation hasn't told me much, I've run out of time and
    imagination to determine exactly what's going on, and shouldn't hold up
    the fix any longer: it's valid, and might even fix other misbehaviours.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jan, 2010

2 commits

  • commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
    made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
    go wrong. this patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reported-by: Huang Shijie
    Reviewed-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The current check for 'backward merging' within add_active_range() does
    not seem correct. start_pfn must be compared against
    early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
    the new region is backward-mergeable with the existing range.

    Signed-off-by: Kazuhisa Ichikawa
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kazuhisa Ichikawa
     

05 Jan, 2010

1 commit

  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds
     

18 Dec, 2009

1 commit

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     

17 Dec, 2009

2 commits

  • Found one system that boot from socket1 instead of socket0, SRAT get rejected...

    [ 0.000000] SRAT: Node 1 PXM 0 0-a0000
    [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
    [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
    [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
    [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
    [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
    [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
    [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
    [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
    [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
    ...
    [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
    [ 0.000000] NUMA: Using 20 for the hash shift.
    [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
    [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
    [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
    [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
    [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
    [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
    [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
    [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
    [ 0.000000] SRAT: SRAT not used.

    the early_node_map is not sorted because node0 with non zero start come first.

    so try to sort it right away after all regions are registered.

    also fixs refression by 8716273c (x86: Export srat physical topology)

    -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
    -v3: update comments.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     

16 Dec, 2009

3 commits

  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Most free pages in the buddy system have no PG_buddy set.
    Introduce is_free_buddy_page() for detecting them reliably.

    CC: Nick Piggin
    CC: Mel Gorman
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Nov, 2009

2 commits

  • Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
    calculate the alloc_flags for allocation only once") altered watermark
    logic slightly by allowing rt_tasks that are handling an interrupt to set
    ALLOC_HARDER. This patch brings the watermark logic more in line with
    2.6.30.

    This change results in a reduction of the number high-order GFP_ATOMIC
    allocation failures reported. See
    http://www.gossamer-threads.com/lists/linux/kernel/1144153

    [rientjes@google.com: Spotted the problem]
    Signed-off-by: Mel Gorman
    Reviewed-by: Pekka Enberg
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a direct reclaim makes no forward progress, it considers whether it
    should go OOM or not. Whether OOM is triggered or not, it may retry the
    allocation afterwards. In times past, this would always wake kswapd as
    well but currently, kswapd is not woken up after direct reclaim fails.
    For order-0 allocations, this makes little difference but if there is a
    heavy mix of higher-order allocations that direct reclaim is failing for,
    it might mean that kswapd is not rewoken for higher orders as much as it
    did previously.

    This patch wakes up kswapd when an allocation is being retried after a
    direct reclaim failure. It would be expected that kswapd is already
    awake, but this has the effect of telling kswapd to reclaim at the higher
    order as well.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • Revert

    commit 71de1ccbe1fb40203edd3beb473f8580d917d2ca
    Author: KOSAKI Motohiro
    AuthorDate: Mon Sep 21 17:01:31 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Tue Sep 22 07:17:27 2009 -0700

    mm: oom analysis: add buffer cache information to show_free_areas()

    show_free_areas() is called during page allocation failures, and page
    allocation failures can occur in any calling context.

    But nr_blockdev_pages() takes VFS locks which should not be taken from
    hard IRQ context (at least). The result is lockdep warnings (and
    deadlockability) during page allocation failures.

    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Sep, 2009

2 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

10 commits

  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When round-robin freeing pages from the PCP lists, empty lists may be
    encountered. In the event one of the lists has more pages than another,
    there may be numerous checks for list_empty() which is undesirable. This
    patch maintains a count of pages to free which is incremented when empty
    lists are encountered. The intention is that more pages will then be
    freed from fuller lists than the empty ones reducing the number of empty
    list checks in the free path.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following two patches remove searching in the page allocator fast-path
    by maintaining multiple free-lists in the per-cpu structure. At the time
    the search was introduced, increasing the per-cpu structures would waste a
    lot of memory as per-cpu structures were statically allocated at
    compile-time. This is no longer the case.

    The patches are as follows. They are based on mmotm-2009-08-27.

    Patch 1 adds multiple lists to struct per_cpu_pages, one per
    migratetype that can be stored on the PCP lists.

    Patch 2 notes that the pcpu drain path check empty lists multiple times. The
    patch reduces the number of checks by maintaining a count of free
    lists encountered. Lists containing pages will then free multiple
    pages in batch

    The patches were tested with kernbench, netperf udp/tcp, hackbench and
    sysbench. The netperf tests were not bound to any CPU in particular and
    were run such that the results should be 99% confidence that the reported
    results are within 1% of the estimated mean. sysbench was run with a
    postgres background and read-only tests. Similar to netperf, it was run
    multiple times so that it's 99% confidence results are within 1%. The
    patches were tested on x86, x86-64 and ppc64 as

    x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.34% to 2.28% gain
    netperf-tcp - 0.45% to 1.22% gain
    hackbench - Small variances, very close to noise
    sysbench - Very small gains

    x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 1.83% to 10.42% gains
    netperf-tcp - No conclusive until buffer >= PAGE_SIZE
    4096 +15.83%
    8192 + 0.34% (not significant)
    16384 + 1%
    hackbench - Small gains, very close to noise
    sysbench - 0.79% to 1.6% gain

    ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
    kernbench - No significant difference, variance well within noise
    netperf-udp - 2-3% gain for almost all buffer sizes tested
    netperf-tcp - losses on small buffers, gains on larger buffers
    possibly indicates some bad caching effect.
    hackbench - No significant difference
    sysbench - 2-4% gain

    This patch:

    Currently the per-cpu page allocator searches the PCP list for pages of
    the correct migrate-type to reduce the possibility of pages being
    inappropriate placed from a fragmentation perspective. This search is
    potentially expensive in a fast-path and undesirable. Splitting the
    per-cpu list into multiple lists increases the size of a per-cpu structure
    and this was potentially a major problem at the time the search was
    introduced. These problem has been mitigated as now only the necessary
    number of structures is allocated for the running system.

    This patch replaces a list search in the per-cpu allocator with one list
    per migrate type. The potential snag with this approach is when bulk
    freeing pages. We round-robin free pages based on migrate type which has
    little bearing on the cache hotness of the page and potentially checks
    empty lists repeatedly in the event the majority of PCP pages are of one
    type.

    Signed-off-by: Mel Gorman
    Acked-by: Nick Piggin
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This is being done by allowing boot time allocations to specify that they
    may want a sub-page sized amount of memory.

    Overall this seems more consistent with the other hash table allocations,
    and allows making two supposedly mm-only variables really mm-only
    (nr_{kernel,all}_pages).

    Signed-off-by: Jan Beulich
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • After anti-fragmentation was merged, a bug was reported whereby devices
    that depended on high-order atomic allocations were failing. The solution
    was to preserve a property in the buddy allocator which tended to keep the
    minimum number of free pages in the zone at the lower physical addresses
    and contiguous. To preserve this property, MIGRATE_RESERVE was introduced
    and a number of pageblocks at the start of a zone would be marked
    "reserve", the number of which depended on min_free_kbytes.

    Anti-fragmentation works by avoiding the mixing of page migratetypes
    within the same pageblock. One way of helping this is to increase
    min_free_kbytes because it becomes less like that it will be necessary to
    place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
    there in large contiguous blocks instead of helping anti-fragmentation as
    much as it should. With the page-allocator tracepoint patches applied, it
    was found during anti-fragmentation tests that the number of
    fragmentation-related events were far higher than expected even with
    min_free_kbytes at higher values.

    This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
    to two. For example, with a sufficient min_free_kbytes, 4MB of memory
    will be kept aside on an x86-64 and remain more or less free and
    contiguous for the systems uptime. This should be sufficient for devices
    depending on high-order atomic allocations while helping fragmentation
    control when min_free_kbytes is tuned appropriately. As side-effect of
    this patch is that the reserve variable is converted to int as unsigned
    long was the wrong type to use when ensuring that only the required number
    of reserve blocks are created.

    With the patches applied, fragmentation-related events as measured by the
    page allocator tracepoints were significantly reduced when running some
    fragmentation stress-tests on systems with min_free_kbytes tuned to a
    value appropriate for hugepage allocations at runtime. On x86, the events
    recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
    99.83%.

    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The page allocation trace event reports that a page was successfully
    allocated but it does not specify where it came from. When analysing
    performance, it can be important to distinguish between pages coming from
    the per-cpu allocator and pages coming from the buddy lists as the latter
    requires the zone lock to the taken and more data structures to be
    examined.

    This patch adds a trace event for __rmqueue reporting when a page is being
    allocated from the buddy lists. It distinguishes between being called to
    refill the per-cpu lists or whether it is a high-order allocation.
    Similarly, this patch adds an event to catch when the PCP lists are being
    drained a little and pages are going back to the buddy lists.

    This is trickier to draw conclusions from but high activity on those
    events could explain why there were a large number of cache misses on a
    page-allocator-intensive workload. The coalescing and splitting of
    buddies involves a lot of writing of page metadata and cache line bounces
    not to mention the acquisition of an interrupt-safe lock necessary to
    enter this path.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fragmentation avoidance depends on being able to use free pages from lists
    of the appropriate migrate type. In the event this is not possible,
    __rmqueue_fallback() selects a different list and in some circumstances
    change the migratetype of the pageblock. Simplistically, the more times
    this event occurs, the more likely that fragmentation will be a problem
    later for hugepage allocation at least but there are other considerations
    such as the order of page being split to satisfy the allocation.

    This patch adds a trace event for __rmqueue_fallback() that reports what
    page is being used for the fallback, the orders of relevant pages, the
    desired migratetype and the migratetype of the lists being used, whether
    the pageblock changed type and whether this event is important with
    respect to fragmentation avoidance or not. This information can be used
    to help analyse fragmentation avoidance and help decide whether
    min_free_kbytes should be increased or not.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds trace events for the allocation and freeing of pages,
    including the freeing of pagevecs. Using the events, it will be known
    what struct page and pfns are being allocated and freed and what the call
    site was in many cases.

    The page alloc tracepoints be used as an indicator as to whether the
    workload was heavily dependant on the page allocator or not. You can make
    a guess based on vmstat but you can't get a per-process breakdown.
    Depending on the call path, the call_site for page allocation may be
    __get_free_pages() instead of a useful callsite. Instead of passing down
    a return address similar to slab debugging, the user should enable the
    stacktrace and seg-addr options to get a proper stack trace.

    The pagevec free tracepoint has a different usecase. It can be used to
    get a idea of how many pages are being dumped off the LRU and whether it
    is kswapd doing the work or a process doing direct reclaim.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The function free_cold_page() has no callers so delete it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman