Eric Lee / smarc-fsl-linux-kernel

08 Jul, 2008

9 commits

5dab8ec13 mm, generic, x86 boot: more tweaks to hex prints of some pfn addresses ... Browse Code »

Fix some problems with (and applies on top of) a previous patch:
x86 boot: show pfn addresses in hex not decimal in some kernel info printks

Primarily change "0x%8lx" format, which displays with a right aligned
space filled hex number (spaces between the "0x" prefix and the number),
into "%0#10lx" format, which zero fills instead of space fills, and
which uses the printf flag '#' to request the "0x" prefix instead of
hard coding it.

Also replace some other "0x%lx" formats with "%#lx", making use of the
'#' printf flag again.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 19:10:40 +0800
2bc0d2615 x86 boot: more consistently use type int for node ids ... Browse Code »

Everywhere I look, node id's are of type 'int', except in this one
case, which has 'unsigned long'. Change this one to 'int' as well.
There is nothing special about the way this variable 'nid' is used in
this routine to justify using an unusual type here.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 18:51:38 +0800
e2fc252e0 x86 boot: show pfn addresses in hex not decimal in some kernel info printks ... Browse Code »

Page frame numbers (the portion of physical addresses above the low
order page offsets) are displayed in several kernel debug and info
prints in decimal, not hex. Decimal addresse are unreadable. Use hex.

Signed-off-by: Paul Jackson
Cc: "Yinghai Lu"
Cc: "Jack Steiner"
Cc: "Mike Travis"
Cc: "Huang
Cc: Ying"
Cc: "Andi Kleen"
Cc: "Andrew Morton"
Cc: Paul Jackson
Signed-off-by: Ingo Molnar

Paul Jackson
2008-07-08 18:51:37 +0800
d52d53b8a RFC x86: try to remove arch_get_ram_range ... Browse Code »

want to remove arch_get_ram_range, and use early_node_map instead.

Signed-off-by: Yinghai Lu

Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 18:48:27 +0800
3de352bbd Merge branch 'x86/mpparse' into x86/devel ... Browse Code »

Conflicts:

arch/x86/Kconfig
arch/x86/kernel/io_apic_32.c
arch/x86/kernel/setup_64.c
arch/x86/mm/init_32.c

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-07-08 17:14:58 +0800
b5bc6c0e5 x86, mm: use add_highpages_with_active_regions() for high pages init v2 ... Browse Code »

use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage

also remove page_is_reserved_early(), it is not needed anymore.

v2: fix the build of other platforms

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 16:37:25 +0800
cc1050baf x86: replace shrink_active_range() with remove_active_range() ... Browse Code »

in case we have kva before ramdisk on a node, we still need to use
those ranges.

v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-07-08 16:36:29 +0800
896395c29 Merge branch 'linus' into tmp.x86.mpparse.new Browse Code »

Ingo Molnar
2008-07-08 16:32:56 +0800
6924d1ab8 Merge branches 'x86/numa-fixes', 'x86/apic', 'x86/apm', 'x86/bitops', 'x86/build… ... Browse Code »

…', 'x86/cleanups', 'x86/cpa', 'x86/cpu', 'x86/defconfig', 'x86/gart', 'x86/i8259', 'x86/intel', 'x86/irqstats', 'x86/kconfig', 'x86/ldt', 'x86/mce', 'x86/memtest', 'x86/pat', 'x86/ptemask', 'x86/resumetrace', 'x86/threadinfo', 'x86/timers', 'x86/vdso' and 'x86/xen' into x86/devel

Ingo Molnar
2008-07-08 15:16:56 +0800

05 Jul, 2008

5 commits

d79df630f mempolicy: mask off internal flags for userspace API ... Browse Code »

Flags considered internal to the mempolicy kernel code are stored as part
of the "flags" member of struct mempolicy.

Before exposing a policy type to userspace via get_mempolicy(), these
internal flags must be masked. Flags exposed to userspace, however,
should still be returned to the user.

Signed-off-by: David Rientjes
Signed-off-by: Linus Torvalds

David Rientjes
2008-07-05 04:03:05 +0800
7a36a752d get_user_pages(): fix possible page leak on oom ... Browse Code »

get_user_pages() must not return the error when i != 0. When pages !=
NULL we have i get_page()'ed pages.

Signed-off-by: Oleg Nesterov
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2008-07-05 01:40:04 +0800
251b97f55 mm: dirty page accounting vs VM_MIXEDMAP ... Browse Code »

Dirty page accounting accurately measures the amound of dirty pages in
writable shared mappings by mapping the pages RO (as indicated by
vma_wants_writenotify). We then trap on first write and call
set_page_dirty() on the page, after which we map the page RW and
continue execution.

When we launder dirty pages, we call clear_page_dirty_for_io() which
clears both the dirty flag, and maps the page RO again before we start
writeout so that the story can repeat itself.

vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
do the regular dirty page stuff on raw PFNs and the memory isn't going
anywhere anyway.

The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
pfn_valid() pages in a single mapping.

We can't do dirty page accounting on !pfn_valid() pages as stated
above, and mapping them RO causes them to be COW'ed on write, which
breaks VM_SHARED semantics.

Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
the regular dirty page accounting for the pfn_valid() pages, which
would bring back all the head-aches from inaccurate dirty page
accounting.

So instead, we let the !pfn_valid() pages get mapped RO, but fix them
up unconditionally in the fault path.

Signed-off-by: Peter Zijlstra
Cc: Nick Piggin
Acked-by: Hugh Dickins
Cc: "Jared Hulbert"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-07-05 01:40:04 +0800
cde535359 Christoph has moved ... Browse Code »

Remove all clameter@sgi.com addresses from the kernel tree since they will
become invalid on June 27th. Change my maintainer email address for the
slab allocators to cl@linux-foundation.org (which will be the new email
address for the future).

Signed-off-by: Christoph Lameter
Signed-off-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Stephen Rothwell
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-07-05 01:40:04 +0800
3ea9eed49 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Do not use 192 byte sized cache if minimum alignment is 128 byte

Linus Torvalds
2008-07-05 00:48:21 +0800

04 Jul, 2008

2 commits

494de9009 Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr ... Browse Code »

The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.

Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.

As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims. This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman
Tested-by: Dan Williams
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-04 00:22:59 +0800
41d54d3bf slub: Do not use 192 byte sized cache if minimum alignment is 128 byte ... Browse Code »

The 192 byte cache is not necessary if we have a basic alignment of 128
byte. If it would be used then the 192 would be aligned to the next 128 byte
boundary which would result in another 256 byte cache. Two 256 kmalloc caches
cause sysfs to complain about a duplicate entry.

MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without
this patch.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2008-07-04 00:01:55 +0800

25 Jun, 2008

1 commit

1ea0704e0 mm: add a ptep_modify_prot transaction abstraction ... Browse Code »

This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.

The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.

When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.

However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:

ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.

ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.

ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.

The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.

The only current user of this interface is mprotect

Signed-off-by: Jeremy Fitzhardinge
Acked-by: Linus Torvalds
Acked-by: Hugh Dickins
Signed-off-by: Ingo Molnar

Jeremy Fitzhardinge
2008-06-25 21:15:53 +0800

24 Jun, 2008

2 commits

945754a17 mm: fix race in COW logic ... Browse Code »

There is a race in the COW logic. It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page,
however it is possible to have two racing do_wp_page()ers with one
causing the other to mistakenly believe it is safe to take the shortcut
when it is not. This could lead to data corruption.

Process 1 and process2 each have a wp pte of the same anon page (ie.
one forked the other). The page's mapcount is 2. Then they both
attempt to write to it around the same time...

proc1 proc2 thr1 proc2 thr2
CPU0 CPU1 CPU3
do_wp_page() do_wp_page()
trylock_page()
can_share_swap_page()
load page mapcount (==2)
reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
trylock_page()
can_share_swap_page()
load page mapcount (==1)
reuse = 1
ptep_set_access_flags (allow W)

write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)

Fix this by moving the page_remove_rmap of the old page after the pte
clear and flush. Potentially the entire branch could be moved down
here, but in order to stay consistent, I won't (should probably move all
the *_mm_counter stuff with one patch).

Signed-off-by: Nick Piggin
Acked-by: Hugh Dickins
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-06-24 02:28:32 +0800
672ca28e3 Fix ZERO_PAGE breakage with vmware ... Browse Code »

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

"This broke vmware 6.0.4.
Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
/build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-06-24 02:21:37 +0800

22 Jun, 2008

2 commits

481c5346d Slab: Fix memory leak in fallback_alloc() ... Browse Code »

The zonelist patches caused the loop that checks for available
objects in permitted zones to not terminate immediately. One object
per zone per allocation may be allocated and then abandoned.

Break the loop when we have successfully allocated one object.

Signed-off-by: Christoph Lameter
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-06-22 07:51:02 +0800
71c2742f5 Add return value to reserve_bootmem_node() ... Browse Code »

This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle
Reported-by: Adrian Bunk
Signed-off-by: Linus Torvalds

Bernhard Walle
2008-06-22 02:25:10 +0800

21 Jun, 2008

1 commit

89f5b7da2 Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP ... Browse Code »

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]

Acked-by: Oleg Nesterov
Acked-by: Nick Piggin
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Roland McGrath
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-06-21 02:18:25 +0800

13 Jun, 2008

2 commits

2165009bd pagemap: pass mm into pagewalkers ... Browse Code »

We need this at least for huge page detection for now, because powerpc
needs the vm_area_struct to be able to determine whether a virtual address
is referring to a huge page (its pmd_huge() doesn't work).

It might also come in handy for some of the other users.

Signed-off-by: Dave Hansen
Acked-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2008-06-13 09:05:41 +0800
c700be3d1 mm: fix incorrect variable type in do_try_to_free_pages() ... Browse Code »

"Smarter retry of costly-order allocations" patch series change behaver of
do_try_to_free_pages(). But unfortunately ret variable type was
unchanged.

Thus an overflow is possible.

Signed-off-by: KOSAKI Motohiro
Acked-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

kosaki.motohiro@jp.fujitsu.com
2008-06-13 09:05:39 +0800

12 Jun, 2008

1 commit

5a1603be5 nommu: Correct kobjsize() page validity checks. ... Browse Code »

This implements a few changes on top of the recent kobjsize() refactoring
introduced by commit 6cfd53fc03670c7a544a56d441eb1a6cc800d72b.

As Christoph points out:

virt_to_head_page cannot return NULL. virt_to_page also
does not return NULL. pfn_valid() needs to be used to
figure out if a page is valid. Otherwise the page struct
reference that was returned may have PageReserved() set
to indicate that it is not a valid page.

As discussed further in the thread, virt_addr_valid() is the preferable
way to validate the object pointer in this case. In addition to fixing
up the reserved page case, it also has the benefit of encapsulating the
hack introduced by commit 4016a1390d07f15b267eecb20e76a48fd5c524ef on
the impacted platforms, allowing us to get rid of the extra checking in
kobjsize() for the platforms that don't perform this type of bizarre
memory_end abuse (every nommu platform that isn't blackfin). If blackfin
decides to get in line with every other platform and use PageReserved
for the DMA pages in question, kobjsize() will also continue to work
fine.

It also turns out that compound_order() will give us back 0-order for
non-head pages, so we can get rid of the PageCompound check and just
use compound_order() directly. Clean that up while we're at it.

Signed-off-by: Paul Mundt
Reviewed-by: Christoph Lameter
Acked-by: David Howells
Signed-off-by: Linus Torvalds

Paul Mundt
2008-06-12 22:56:17 +0800

10 Jun, 2008

2 commits

cc1a9d86c mm, x86: shrink_active_range() should check all ... Browse Code »

Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.

So we need to make shrink_active_range() smarter.

shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-06-10 17:31:44 +0800
dfa7e20cc mm: Minor clean-up of page flags in mm/page_alloc.c ... Browse Code »

Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.

The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups. Doing that
in a header file is cleaner than adding #ifdefs to the
C code.

Signed-off-by: Russ Anderson
Signed-off-by: Linus Torvalds

Russ Anderson
2008-06-10 01:22:24 +0800

07 Jun, 2008

3 commits

6cfd53fc0 nommu: fix kobjsize() for SLOB and SLUB ... Browse Code »

kobjsize() has been abusing page->index as a method for sorting out
compound order, which blows up both for page cache pages, and SLOB's
reuse of the index in struct slob_page.

Presently we are not able to accurately size arbitrary pointers that
don't come from kmalloc(), so the best we can do is sort out the
compound order from the head page if it's a compound page, or default
to 0-order if it's impossible to ksize() the object.

Obviously this leaves quite a bit to be desired in terms of object
sizing accuracy, but the behaviour is unchanged over the existing
implementation, while fixing the page->index oopses originally reported
here:

http://marc.info/?l=linux-mm&m=121127773325245&w=2

Accuracy could also be improved by having SLUB and SLOB both set PG_slab
on ksizeable pages, rather than just handling the __GFP_COMP cases
irregardless of the PG_slab setting, as made possibly with Pekka's
patches:

http://marc.info/?l=linux-kernel&m=121139439900534&w=2
http://marc.info/?l=linux-kernel&m=121139440000537&w=2
http://marc.info/?l=linux-kernel&m=121139440000540&w=2

This is primarily a bugfix for nommu systems for 2.6.26, with the aim
being to gradually kill off kobjsize() and its particular brand of
object abuse entirely.

Reviewed-by: Pekka Enberg
Signed-off-by: Paul Mundt
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Mundt
2008-06-07 02:29:09 +0800
a5b4592cf brk: make sys_brk() honor COMPAT_BRK when computing lower bound ... Browse Code »

Fix a regression introduced by

commit 4cc6028d4040f95cdb590a87db478b42b8be0508
Author: Jiri Kosina
Date: Wed Feb 6 22:39:44 2008 +0100

brk: check the lower bound properly

The check in sys_brk() on minimum value the brk might have must take
CONFIG_COMPAT_BRK setting into account. When this option is turned on
(i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the
lower bound on brk value is mm->end_code, otherwise the brk start is
allowed to be arbitrarily shifted.

Signed-off-by: Jiri Kosina
Tested-by: Geert Uytterhoeven
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2008-06-07 02:29:09 +0800
464787581 hugetlb: fix lockdep error ... Browse Code »

=============================================
[ INFO: possible recursive locking detected ]
2.6.26-rc4 #30
---------------------------------------------
heap-overflow/2250 is trying to acquire lock:
(&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0x108/0x280

but task is already holding lock:
(&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0xfc/0x280

other info that might help us debug this:
3 locks held by heap-overflow/2250:
#0: (&mm->mmap_sem){----}, at: [] .dup_mm+0x134/0x410
#1: (&mm->mmap_sem/1){--..}, at: [] .dup_mm+0x144/0x410
#2: (&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0xfc/0x280

stack backtrace:
Call Trace:
[c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable)
[c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34
[c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080
[c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0
[c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80
[c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280
[c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790
[c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410
[c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020
[c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310
[c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80
[c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc

Fix is the same way that mm/memory.c copy_page_range does the
lockdep annotation.

Acked-by: KOSAKI Motohiro
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-06-07 02:29:09 +0800

03 Jun, 2008

1 commit

e8c27ac91 x86, numa, 32-bit: print out debug info on all kvas ... Browse Code »

also fix the print out of node_remap_end_vaddr

Signed-off-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Yinghai Lu
2008-06-03 19:26:26 +0800

27 May, 2008

1 commit

1434b6573 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: ksize() abuse checks
slob: Fix to return wrong pointer

Linus Torvalds
2008-05-27 01:21:26 +0800

25 May, 2008

5 commits

cd94b9dbf memory hotplug: fix early allocation handling ... Browse Code »

Trying to add memory via add_memory() from within an initcall function
results in

bootmem alloc of 163840 bytes failed!
Kernel panic - not syncing: Out of memory

This is caused by zone_wait_table_init() which uses system_state to decide
if it should use the bootmem allocator or not.

When initcalls are handled the system_state is still SYSTEM_BOOTING but
the bootmem allocator doesn't work anymore. So the allocation will fail.

To fix this use slab_is_available() instead as indicator like we do it
everywhere else.

[akpm@linux-foundation.org: coding-style fix]
Reviewed-by: Andy Whitcroft
Cc: Dave Hansen
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Acked-by: Yasunori Goto
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2008-05-25 00:56:12 +0800
7eb54824b zonelists: handle a node zonelist with no applicable entries ... Browse Code »

When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
panics when trying node local allocations:

BUG: unable to handle kernel NULL pointer dereference at 0000034c
IP: [] get_page_from_freelist+0x4a/0x18e
*pdpt = 00000000013a7001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:

Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
EIP: 0060:[] EFLAGS: 00010282 CPU: 0
EIP is at get_page_from_freelist+0x4a/0x18e
EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
Call Trace:
[] __alloc_pages_internal+0x99/0x378
[] __alloc_pages+0x7/0x9
[] kmem_getpages+0x66/0xef
[] cache_grow+0x8f/0x123
[] ____cache_alloc_node+0xb9/0xe4
[] kmem_cache_alloc_node+0x92/0xd2
[] setup_cpu_cache+0xaf/0x177
[] kmem_cache_create+0x2c8/0x353
[] kmem_cache_init+0x1ce/0x3ad
[] start_kernel+0x178/0x1ee

This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
node 0, all other nodes are mapped above 4GB physical. Here is a dump
of the zonelists from this system:

zonelists pgdat=c1400000
0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
1: c14006c0:2 c1400360:1 c1400000:0
zonelists pgdat=f7c00000
0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
1: f7c006c0:2
zonelists pgdat=f7e00000
0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
1: f7e006c0:2

When performing a node local allocation we call get_page_from_freelist()
looking for a page. It in turn calls first_zones_zonelist() which returns
a preferred_zone. Where there are no applicable zones this will be NULL.
However we use this unconditionally, leading to this panic.

Where there are no applicable zones there is no possibility of a successful
allocation, so simply fail the allocation.

Signed-off-by: Andy Whitcroft
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-05-25 00:56:11 +0800
80119ef5c mm: fix atomic_t overflow in vm ... Browse Code »

The atomic_t type is 32bit but a 64bit system can have more than 2^32
pages of virtual address space available. Without this we overflow on
ludicrously large mappings

Signed-off-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2008-05-25 00:56:09 +0800
f72321541 mm: don't drop a partial page in a zone's memory map size ... Browse Code »

In a zone's present pages number, account for all pages occupied by the
memory map, including a partial.

Signed-off-by: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-05-25 00:56:07 +0800
42172d751 mm: allow pfnmap ->fault()s ... Browse Code »

Take out an assertion to allow ->fault handlers to service PFNMAP regions.
This is required to reimplement .nopfn handlers with .fault handlers and
subsequently remove nopfn.

Signed-off-by: Nick Piggin
Acked-by: Jes Sorensen
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-05-25 00:56:07 +0800

23 May, 2008

1 commit

76994412f slub: ksize() abuse checks ... Browse Code »

Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch
the worst abusers of ksize() in the kernel.

Acked-by: Christoph Lameter
Cc: Matt Mackall
Signed-off-by: Pekka Enberg

Pekka Enberg
2008-05-23 00:52:18 +0800

21 May, 2008

1 commit

19051c503 mm: bdi: fix race in bdi_class device creation ... Browse Code »

There is a race from when a device is created with device_create() and
then the drvdata is set with a call to dev_set_drvdata() in which a
sysfs file could be open, yet the drvdata will be NULL, causing all
sorts of bad things to happen.

This patch fixes the problem by using the new function,
device_create_vargs().

Many thanks to Arthur Jones for reporting the bug,
and testing patches out.

Cc: Kay Sievers
Cc: Arthur Jones
Cc: Peter Zijlstra
Cc: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2008-05-21 04:31:53 +0800

20 May, 2008

1 commit

239f49c08 slob: Fix to return wrong pointer ... Browse Code »

Although slob_alloc return NULL, __kmalloc_node returns NULL + align.
Because align always can be changed, it is very hard for debugging
problem of no page if it don't return NULL.

We have to return NULL in case of no page.

[penberg@cs.helsinki.fi: fix formatting as suggested by Matt.]
Acked-by: Matt Mackall
Signed-off-by: MinChan Kim
Signed-off-by: Pekka Enberg

MinChan Kim
2008-05-20 01:55:25 +0800