Doug / smarc-fsl-linux-kernel | Embedian Git Server

13 Mar, 2010

2 commits

718a38211 mm: introduce dump_page() and print symbolic flag names ... Browse Code »

- introduce dump_page() to print the page info for debugging some error
condition.

- convert three mm users: bad_page(), print_bad_pte() and memory offline
failure.

- print an extra field: the symbolic names of page->flags

Example dump_page() output:

[ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
[ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

Signed-off-by: Wu Fengguang
Cc: Ingo Molnar
Cc: Alex Chiang
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-03-13 07:52:28 +0800
2d30a1f63 mm: do not iterate over NR_CPUS in __zone_pcp_update() ... Browse Code »

__zone_pcp_update() iterates over NR_CPUS instead of limiting the access
to the possible cpus. This might result in access to uninitialized areas
as the per cpu allocator only populates the per cpu memory for possible
cpus.

This problem was created as a result of the dynamic allocation of pagesets
from percpu memory that went in during the merge window - commit
99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
conversion").

Signed-off-by: Thomas Gleixner
Acked-by: Pekka Enberg
Acked-by: Tejun Heo
Acked-by: Christoph Lameter
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2010-03-13 07:52:28 +0800

07 Mar, 2010

6 commits

72f0ba025 mm: suppress pfn range output for zones without pages ... Browse Code »

free_area_init_nodes() emits pfn ranges for all zones on the system.
There may be no pages on a higher zone, however, due to memory limitations
or the use of the mem= kernel parameter. For example:

Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x00100000

The implementation copies the previous zone's highest pfn, if any, as the
next zone's lowest pfn. If its highest pfn is then greater than the
amount of addressable memory, the upper memory limit is used instead.
Thus, both the lowest and highest possible pfn for higher zones without
memory may be the same.

The pfn range for zones without memory is now shown as "empty" instead.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-03-07 03:26:26 +0800
452aa6999 mm/pm: force GFP_NOIO during suspend/hibernation and resume ... Browse Code »

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.

[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki
Reported-by: Maxim Levitsky
Cc: Sebastian Ott
Cc: Benjamin Herrenschmidt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2010-03-07 03:26:26 +0800
93e4a89a8 mm: restore zone->all_unreclaimable to independence word ... Browse Code »

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro
Cc: David Rientjes
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
fc91668ea mm: remove free_hot_page() ... Browse Code »

free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Cc: Americo Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:25 +0800
c475dab63 mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct ... Browse Code »

Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().

Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:24 +0800
f650316c8 mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct ... Browse Code »

trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:

K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().

Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:24 +0800

04 Mar, 2010

1 commit

a626b46e1 Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...

Linus Torvalds
2010-03-04 00:15:05 +0800

22 Feb, 2010

1 commit

2ee78f7b1 x86: Fix non-bootmem compilation on PowerPC ... Browse Code »

These build errors on some non-x86 platforms (PowerPC for example):

mm/page_alloc.c: In function '__alloc_memory_core_early':
mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'

The function is only needed on CONFIG_NO_BOOTMEM.

Signed-off-by: Yinghai Lu
Cc: Andrew Morton
Cc: Johannes Weiner
Cc: Mel Gorman
LKML-Reference:
Signed-off-by: Ingo Molnar

Yinghai Lu
2010-02-22 16:16:40 +0800

13 Feb, 2010

1 commit

08677214e x86: Make 64 bit use early_res instead of bootmem before slab ... Browse Code »

Finally we can use early_res to replace bootmem for x86_64 now.

Still can use CONFIG_NO_BOOTMEM to enable it or not.

-v2: fix 32bit compiling about MAX_DMA32_PFN
-v3: folded bug fix from LKML message below

Signed-off-by: Yinghai Lu
LKML-Reference:
Signed-off-by: H. Peter Anvin

Yinghai Lu
2010-02-13 01:41:59 +0800

02 Feb, 2010

1 commit

ab386128f Merge branch 'master' into percpu Browse Code »

Tejun Heo
2010-02-02 13:38:15 +0800

30 Jan, 2010

1 commit

a7016235a mm: fix migratetype bug which slowed swapping ... Browse Code »

After memory pressure has forced it to dip into the reserves, 2.6.32's
5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

Fix that in the most straightforward way (which, considering the overheads
of alternative approaches, is Mel's preference): the right migratetype is
already in page_private(page), but free_pcppages_bulk() wasn't using it.

How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
commit was easy, but explaining the magnitude of the slowdown not easy.

The same effect appears, but much less markedly, with SLAB, and even
less markedly on other machines (the PowerMac divides into fewer zones
than x86, I think that may be a factor). We guess that lumpy reclaim
of short-lived high-order pages is implicated in some way, and probably
this bug has been tickling a poor decision somewhere in page reclaim.

But instrumentation hasn't told me much, I've run out of time and
imagination to determine exactly what's going on, and shouldn't hold up
the fix any longer: it's valid, and might even fix other misbehaviours.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-01-30 02:28:09 +0800

17 Jan, 2010

2 commits

6ccf80eb1 page allocator: update NR_FREE_PAGES only when necessary ... Browse Code »

commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
go wrong. this patch fixes it.

Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Reviewed-by: Minchan Kim
Reported-by: Huang Shijie
Reviewed-by: Christoph Lameter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-01-17 08:53:55 +0800
d2dbe08dd mm/page_alloc: fix the range check for backward merging ... Browse Code »

The current check for 'backward merging' within add_active_range() does
not seem correct. start_pfn must be compared against
early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
the new region is backward-mergeable with the existing range.

Signed-off-by: Kazuhisa Ichikawa
Acked-by: David Rientjes
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kazuhisa Ichikawa
2010-01-17 04:15:38 +0800

05 Jan, 2010

1 commit

99dcc3e5a this_cpu: Page allocator conversion ... Browse Code »

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V7-V8:
- Explain chicken egg dilemmna with percpu allocator.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.

Reviewed-by: Mel Gorman
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2010-01-05 14:34:51 +0800

24 Dec, 2009

1 commit

443c6f145 SYSCTL: Add a mutex to the page_alloc zone order sysctl ... Browse Code »

The zone list code clearly cannot tolerate concurrent writers (I couldn't
find any locks for that), so simply add a global mutex. No need for RCU
in this case.

Signed-off-by: Andi Kleen

Andi Kleen
2009-12-24 04:01:16 +0800

23 Dec, 2009

1 commit

dd508ae2d Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
powerpc/gc/wii: Remove get_irq_desc()
powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
powerpc/mpic: Fix problem that affinity is not updated
powerpc/mm: Fix stupid bug in subpge protection handling
powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
powerpc: Fix MSI support on U4 bridge PCIe slot
powerpc: Handle VSX alignment faults correctly in little-endian mode
powerpc/mm: Fix typo of cpumask_clear_cpu()
powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
powerpc: Convert BUG() to use unreachable()
powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
powerpc/defconfigs: Disable token ring in powerpc defconfigs
powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
powerpc/pseries: Select XICS and PCI_MSI PSERIES
powerpc/85xx: Wrong variable returned on error
powerpc/iseries: Convert to proc_fops
...

Linus Torvalds
2009-12-23 06:18:13 +0800

20 Dec, 2009

1 commit

3981e1528 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
Makefile: Unexport LC_ALL instead of clearing it
x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
x86: Fix checking of SRAT when node 0 ram is not from 0
x86, cpuid: Add "volatile" to asm in native_cpuid()
x86, msr: msrs_alloc/free for CONFIG_SMP=n
x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
x86: Add IA32_TSC_AUX MSR and use it
x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
initramfs: add missing decompressor error check
bzip2: Add missing checks for malloc returning NULL
bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

Linus Torvalds
2009-12-20 01:48:14 +0800

18 Dec, 2009

1 commit

925cc71e5 mm: Add notifier in pageblock isolation for balloon drivers ... Browse Code »

Memory balloon drivers can allocate a large amount of memory which is not
movable but could be freed to accomodate memory hotplug remove.

Prior to calling the memory hotplug notifier chain the memory in the
pageblock is isolated. Currently, if the migrate type is not
MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
for that page range to fail.

Rather than failing pageblock isolation if the migrateteype is not
MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
and not on the LRU, are owned by a registered balloon driver (or other
entity) using a notifier chain. If all of the non-movable pages are owned
by a balloon, they can be freed later through the memory notifier chain
and the range can still be isolated in set_migratetype_isolate().

Signed-off-by: Robert Jennings
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Brian King
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Benjamin Herrenschmidt

Robert Jennings
2009-12-18 11:53:36 +0800

17 Dec, 2009

2 commits

329962503 x86: Fix checking of SRAT when node 0 ram is not from 0 ... Browse Code »

Found one system that boot from socket1 instead of socket0, SRAT get rejected...

[ 0.000000] SRAT: Node 1 PXM 0 0-a0000
[ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
[ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[ 0.000000] NUMA: Using 20 for the hash shift.
[ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[ 0.000000] SRAT: SRAT not used.

the early_node_map is not sorted because node0 with non zero start come first.

so try to sort it right away after all regions are registered.

also fixs refression by 8716273c (x86: Export srat physical topology)

-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
-v3: update comments.

Reported-and-tested-by: Jens Axboe
Signed-off-by: Yinghai Lu
LKML-Reference:
Signed-off-by: H. Peter Anvin

Yinghai Lu
2009-12-17 08:43:37 +0800
d4220f987 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
HWPOISON: Remove stray phrase in a comment
HWPOISON: Try to allocate migration page on the same node
HWPOISON: Don't do early filtering if filter is disabled
HWPOISON: Add a madvise() injector for soft page offlining
HWPOISON: Add soft page offline support
HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
HWPOISON: Use new shake_page in memory_failure
HWPOISON: Use correct name for MADV_HWPOISON in documentation
HWPOISON: mention HWPoison in Kconfig entry
HWPOISON: Use get_user_page_fast in hwpoison madvise
HWPOISON: add an interface to switch off/on all the page filters
HWPOISON: add memory cgroup filter
memcg: add accessor to mem_cgroup.css
memcg: rename and export try_get_mem_cgroup_from_page()
HWPOISON: add page flags filter
mm: export stable page flags
HWPOISON: limit hwpoison injector to known page types
HWPOISON: add fs/device filters
HWPOISON: return 0 to indicate success reliably
HWPOISON: make semantics of IGNORED/DELAYED clear
...

Linus Torvalds
2009-12-17 04:36:49 +0800

16 Dec, 2009

3 commits

4365a5676 oom-kill: fix NUMA constraint check with nodemask ... Browse Code »

Fix node-oriented allocation handling in oom-kill.c I myself think of this
as a bugfix not as an ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.

- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc: Daisuke Nishimura
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-12-16 23:19:57 +0800
8d22ba1b7 HWPOISON: detect free buddy pages explicitly ... Browse Code »

Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

CC: Nick Piggin
CC: Mel Gorman
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen

Wu Fengguang
2009-12-16 19:19:58 +0800
af8e3354b mm: CONFIG_MMU for PG_mlocked ... Browse Code »

Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.

Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.

Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Lee Schermerhorn
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Wu Fengguang
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:17 +0800

12 Nov, 2009

2 commits

9d0ed60fe page allocator: Do not allow interrupts to use ALLOC_HARDER ... Browse Code »

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
calculate the alloc_flags for allocation only once") altered watermark
logic slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

This change results in a reduction of the number high-order GFP_ATOMIC
allocation failures reported. See
http://www.gossamer-threads.com/lists/linux/kernel/1144153

[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman
Reviewed-by: Pekka Enberg
Reviewed-by: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-11-12 23:25:56 +0800
cc4a68514 page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed ... Browse Code »

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
allocation afterwards. In times past, this would always wake kswapd as
well but currently, kswapd is not woken up after direct reclaim fails.
For order-0 allocations, this makes little difference but if there is a
heavy mix of higher-order allocations that direct reclaim is failing for,
it might mean that kswapd is not rewoken for higher orders as much as it
did previously.

This patch wakes up kswapd when an allocation is being retried after a
direct reclaim failure. It would be expected that kswapd is already
awake, but this has the effect of telling kswapd to reclaim at the higher
order as well.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: Pekka Enberg
Reviewed-by: KOSAKI Motohiro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-11-12 23:25:56 +0800

29 Oct, 2009

1 commit

b76146ed1 revert "mm: oom analysis: add buffer cache information to show_free_areas()" ... Browse Code »

Revert

commit 71de1ccbe1fb40203edd3beb473f8580d917d2ca
Author: KOSAKI Motohiro
AuthorDate: Mon Sep 21 17:01:31 2009 -0700
Commit: Linus Torvalds
CommitDate: Tue Sep 22 07:17:27 2009 -0700

mm: oom analysis: add buffer cache information to show_free_areas()

show_free_areas() is called during page allocation failures, and page
allocation failures can occur in any calling context.

But nr_blockdev_pages() takes VFS locks which should not be taken from
hard IRQ context (at least). The result is lockdep warnings (and
deadlockability) during page allocation failures.

Cc: KOSAKI Motohiro
Cc: Wu Fengguang
Cc: Rik van Riel
Cc: David Rientjes
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2009-10-29 22:39:27 +0800

24 Sep, 2009

2 commits

db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800

22 Sep, 2009

10 commits

03f6462a3 mm: move highest_memmap_pfn ... Browse Code »

Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
__read_mostly in memory.c: to help them share a cacheline, since they're
very often tested together in vm_normal_page().

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
a6f9edd65 page-allocator: maintain rolling count of pages to free from the PCP ... Browse Code »

When round-robin freeing pages from the PCP lists, empty lists may be
encountered. In the event one of the lists has more pages than another,
there may be numerous checks for list_empty() which is undesirable. This
patch maintains a count of pages to free which is incremented when empty
lists are encountered. The intention is that more pages will then be
freed from fuller lists than the empty ones reducing the number of empty
list checks in the free path.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Cc: Nick Piggin
Cc: Christoph Lameter
Reviewed-by: Minchan Kim
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:39 +0800
5f8dcc212 page-allocator: split per-cpu list into one-list-per-migrate-type ... Browse Code »

The following two patches remove searching in the page allocator fast-path
by maintaining multiple free-lists in the per-cpu structure. At the time
the search was introduced, increasing the per-cpu structures would waste a
lot of memory as per-cpu structures were statically allocated at
compile-time. This is no longer the case.

The patches are as follows. They are based on mmotm-2009-08-27.

Patch 1 adds multiple lists to struct per_cpu_pages, one per
migratetype that can be stored on the PCP lists.

Patch 2 notes that the pcpu drain path check empty lists multiple times. The
patch reduces the number of checks by maintaining a count of free
lists encountered. Lists containing pages will then free multiple
pages in batch

The patches were tested with kernbench, netperf udp/tcp, hackbench and
sysbench. The netperf tests were not bound to any CPU in particular and
were run such that the results should be 99% confidence that the reported
results are within 1% of the estimated mean. sysbench was run with a
postgres background and read-only tests. Similar to netperf, it was run
multiple times so that it's 99% confidence results are within 1%. The
patches were tested on x86, x86-64 and ppc64 as

x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.34% to 2.28% gain
netperf-tcp - 0.45% to 1.22% gain
hackbench - Small variances, very close to noise
sysbench - Very small gains

x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine)
kernbench - No significant difference, variance well within noise
netperf-udp - 1.83% to 10.42% gains
netperf-tcp - No conclusive until buffer >= PAGE_SIZE
4096 +15.83%
8192 + 0.34% (not significant)
16384 + 1%
hackbench - Small gains, very close to noise
sysbench - 0.79% to 1.6% gain

ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation)
kernbench - No significant difference, variance well within noise
netperf-udp - 2-3% gain for almost all buffer sizes tested
netperf-tcp - losses on small buffers, gains on larger buffers
possibly indicates some bad caching effect.
hackbench - No significant difference
sysbench - 2-4% gain

This patch:

Currently the per-cpu page allocator searches the PCP list for pages of
the correct migrate-type to reduce the possibility of pages being
inappropriate placed from a fragmentation perspective. This search is
potentially expensive in a fast-path and undesirable. Splitting the
per-cpu list into multiple lists increases the size of a per-cpu structure
and this was potentially a major problem at the time the search was
introduced. These problem has been mitigated as now only the necessary
number of structures is allocated for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type. The potential snag with this approach is when bulk
freeing pages. We round-robin free pages based on migrate type which has
little bearing on the cache hotness of the page and potentially checks
empty lists repeatedly in the event the majority of PCP pages are of one
type.

Signed-off-by: Mel Gorman
Acked-by: Nick Piggin
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:39 +0800
f86296317 mm: do batched scans for mem_cgroup ... Browse Code »

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.

It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned

So the end result may be

- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

Cc: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Acked-by: Balbir Singh
Reviewed-by: Minchan Kim
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-09-22 22:17:39 +0800
2c85f51d2 mm: also use alloc_large_system_hash() for the PID hash table ... Browse Code »

This is being done by allowing boot time allocations to specify that they
may want a sub-page sized amount of memory.

Overall this seems more consistent with the other hash table allocations,
and allows making two supposedly mm-only variables really mm-only
(nr_{kernel,all}_pages).

Signed-off-by: Jan Beulich
Cc: Ingo Molnar
Cc: "Eric W. Biederman"
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2009-09-22 22:17:38 +0800
78986a678 page-allocator: limit the number of MIGRATE_RESERVE pageblocks per zone ... Browse Code »

After anti-fragmentation was merged, a bug was reported whereby devices
that depended on high-order atomic allocations were failing. The solution
was to preserve a property in the buddy allocator which tended to keep the
minimum number of free pages in the zone at the lower physical addresses
and contiguous. To preserve this property, MIGRATE_RESERVE was introduced
and a number of pageblocks at the start of a zone would be marked
"reserve", the number of which depended on min_free_kbytes.

Anti-fragmentation works by avoiding the mixing of page migratetypes
within the same pageblock. One way of helping this is to increase
min_free_kbytes because it becomes less like that it will be necessary to
place pages of of MIGRATE_RESERVE is unbounded, the free memory is kept
there in large contiguous blocks instead of helping anti-fragmentation as
much as it should. With the page-allocator tracepoint patches applied, it
was found during anti-fragmentation tests that the number of
fragmentation-related events were far higher than expected even with
min_free_kbytes at higher values.

This patch limits the number of MIGRATE_RESERVE blocks that exist per zone
to two. For example, with a sufficient min_free_kbytes, 4MB of memory
will be kept aside on an x86-64 and remain more or less free and
contiguous for the systems uptime. This should be sufficient for devices
depending on high-order atomic allocations while helping fragmentation
control when min_free_kbytes is tuned appropriately. As side-effect of
this patch is that the reserve variable is converted to int as unsigned
long was the wrong type to use when ensuring that only the required number
of reserve blocks are created.

With the patches applied, fragmentation-related events as measured by the
page allocator tracepoints were significantly reduced when running some
fragmentation stress-tests on systems with min_free_kbytes tuned to a
value appropriate for hugepage allocations at runtime. On x86, the events
recorded were reduced by 99.8%, on x86-64 by 99.72% and on ppc64 by
99.83%.

Signed-off-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:38 +0800
0d3d062a6 tracing, page-allocator: add trace event for page traffic related to the buddy lists ... Browse Code »

The page allocation trace event reports that a page was successfully
allocated but it does not specify where it came from. When analysing
performance, it can be important to distinguish between pages coming from
the per-cpu allocator and pages coming from the buddy lists as the latter
requires the zone lock to the taken and more data structures to be
examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called to
refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to
enter this path.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:34 +0800
e0fff1bd1 tracing, page-allocator: add trace events for anti-fragmentation falling back to other migratetypes ... Browse Code »

Fragmentation avoidance depends on being able to use free pages from lists
of the appropriate migrate type. In the event this is not possible,
__rmqueue_fallback() selects a different list and in some circumstances
change the migratetype of the pageblock. Simplistically, the more times
this event occurs, the more likely that fragmentation will be a problem
later for hugepage allocation at least but there are other considerations
such as the order of page being split to satisfy the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:34 +0800
4b4f278c0 tracing, page-allocator: add trace events for page allocation and page freeing ... Browse Code »

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs. Using the events, it will be known
what struct page and pfns are being allocated and freed and what the call
site was in many cases.

The page alloc tracepoints be used as an indicator as to whether the
workload was heavily dependant on the page allocator or not. You can make
a guess based on vmstat but you can't get a per-process breakdown.
Depending on the call path, the call_site for page allocation may be
__get_free_pages() instead of a useful callsite. Instead of passing down
a return address similar to slab debugging, the user should enable the
stacktrace and seg-addr options to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to
get a idea of how many pages are being dumped off the LRU and whether it
is kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:34 +0800
38a398572 page-allocator: remove dead function free_cold_page() ... Browse Code »

The function free_cold_page() has no callers so delete it.

Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-09-22 22:17:34 +0800