05 Jul, 2008

5 commits

  • Flags considered internal to the mempolicy kernel code are stored as part
    of the "flags" member of struct mempolicy.

    Before exposing a policy type to userspace via get_mempolicy(), these
    internal flags must be masked. Flags exposed to userspace, however,
    should still be returned to the user.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • get_user_pages() must not return the error when i != 0. When pages !=
    NULL we have i get_page()'ed pages.

    Signed-off-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Dirty page accounting accurately measures the amound of dirty pages in
    writable shared mappings by mapping the pages RO (as indicated by
    vma_wants_writenotify). We then trap on first write and call
    set_page_dirty() on the page, after which we map the page RW and
    continue execution.

    When we launder dirty pages, we call clear_page_dirty_for_io() which
    clears both the dirty flag, and maps the page RO again before we start
    writeout so that the story can repeat itself.

    vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
    do the regular dirty page stuff on raw PFNs and the memory isn't going
    anywhere anyway.

    The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
    pfn_valid() pages in a single mapping.

    We can't do dirty page accounting on !pfn_valid() pages as stated
    above, and mapping them RO causes them to be COW'ed on write, which
    breaks VM_SHARED semantics.

    Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
    the regular dirty page accounting for the pfn_valid() pages, which
    would bring back all the head-aches from inaccurate dirty page
    accounting.

    So instead, we let the !pfn_valid() pages get mapped RO, but fix them
    up unconditionally in the fault path.

    Signed-off-by: Peter Zijlstra
    Cc: Nick Piggin
    Acked-by: Hugh Dickins
    Cc: "Jared Hulbert"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Do not use 192 byte sized cache if minimum alignment is 128 byte

    Linus Torvalds
     

04 Jul, 2008

2 commits

  • The non-NUMA case of build_zonelist_cache() would initialize the
    zlcache_ptr for both node_zonelists[] to NULL.

    Which is problematic, since non-NUMA only has a single node_zonelists[]
    entry, and trying to zero the non-existent second one just overwrote the
    nr_zones field instead.

    As kswapd uses this value to determine what reclaim work is necessary,
    the result is that kswapd never reclaims. This causes processes to
    stall frequently in low-memory situations as they always direct reclaim.
    This patch initialises zlcache_ptr correctly.

    Signed-off-by: Mel Gorman
    Tested-by: Dan Williams
    [ Simplified patch a bit ]
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The 192 byte cache is not necessary if we have a basic alignment of 128
    byte. If it would be used then the 192 would be aligned to the next 128 byte
    boundary which would result in another 256 byte cache. Two 256 kmalloc caches
    cause sysfs to complain about a duplicate entry.

    MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without
    this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

24 Jun, 2008

2 commits

  • There is a race in the COW logic. It contains a shortcut to avoid the
    COW and reuse the page if we have the sole reference on the page,
    however it is possible to have two racing do_wp_page()ers with one
    causing the other to mistakenly believe it is safe to take the shortcut
    when it is not. This could lead to data corruption.

    Process 1 and process2 each have a wp pte of the same anon page (ie.
    one forked the other). The page's mapcount is 2. Then they both
    attempt to write to it around the same time...

    proc1 proc2 thr1 proc2 thr2
    CPU0 CPU1 CPU3
    do_wp_page() do_wp_page()
    trylock_page()
    can_share_swap_page()
    load page mapcount (==2)
    reuse = 0
    pte unlock
    copy page to new_page
    pte lock
    page_remove_rmap(page);
    trylock_page()
    can_share_swap_page()
    load page mapcount (==1)
    reuse = 1
    ptep_set_access_flags (allow W)

    write private key into page
    read from page
    ptep_clear_flush()
    set_pte_at(pte of new_page)

    Fix this by moving the page_remove_rmap of the old page after the pte
    clear and flush. Potentially the entire branch could be moved down
    here, but in order to stay consistent, I won't (should probably move all
    the *_mm_counter stuff with one patch).

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
    optimization in 'get_user_pages()' and fix XIP") broke vmware, as
    reported by Jeff Chua:

    "This broke vmware 6.0.4.
    Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
    /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

    and the reason seems to be that there's an old bug in how we handle do
    FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
    triggered if the whole page table was missing, nobody had apparently hit
    it before.

    The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
    not just for whole missing page tables, but for individual pages as
    well, and exposed this problem.

    This fixes it by making the test for when FOLL_ANON is used more
    careful, and also makes the code easier to read and understand by moving
    the logic to a separate inline function.

    Reported-and-tested-by: Jeff Chua
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Jun, 2008

2 commits

  • The zonelist patches caused the loop that checks for available
    objects in permitted zones to not terminate immediately. One object
    per zone per allocation may be allocated and then abandoned.

    Break the loop when we have successfully allocated one object.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch changes the function reserve_bootmem_node() from void to int,
    returning -ENOMEM if the allocation fails.

    This fixes a build problem on x86 with CONFIG_KEXEC=y and
    CONFIG_NEED_MULTIPLE_NODES=y

    Signed-off-by: Bernhard Walle
    Reported-by: Adrian Bunk
    Signed-off-by: Linus Torvalds

    Bernhard Walle
     

21 Jun, 2008

1 commit

  • KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
    557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
    the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
    generally now populate the VM with real empty pages needlessly.

    We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
    since fault handling no longer uses ZERO_PAGE for new anonymous pages,
    we now need to handle that special case in follow_page() instead.

    In particular, the removal of ZERO_PAGE effectively removed the core
    file writing optimization where we would skip writing pages that had not
    been populated at all, and increased memory pressure a lot by allocating
    all those useless newly zeroed pages.

    This reinstates the optimization by making the unmapped PTE case the
    same as for a non-existent page table, which already did this correctly.

    While at it, this also fixes the XIP case for follow_page(), where the
    caller could not differentiate between the case of a page that simply
    could not be used (because it had no "struct page" associated with it)
    and a page that just wasn't mapped.

    We do that by simply returning an error pointer for pages that could not
    be turned into a "struct page *". The error is arbitrarily picked to be
    EFAULT, since that was what get_user_pages() already used for the
    equivalent IO-mapped page case.

    [ Also removed an impossible test for pte_offset_map_lock() failing:
    that's not how that function works ]

    Acked-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Jun, 2008

2 commits

  • We need this at least for huge page detection for now, because powerpc
    needs the vm_area_struct to be able to determine whether a virtual address
    is referring to a huge page (its pmd_huge() doesn't work).

    It might also come in handy for some of the other users.

    Signed-off-by: Dave Hansen
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • "Smarter retry of costly-order allocations" patch series change behaver of
    do_try_to_free_pages(). But unfortunately ret variable type was
    unchanged.

    Thus an overflow is possible.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    kosaki.motohiro@jp.fujitsu.com
     

12 Jun, 2008

1 commit

  • This implements a few changes on top of the recent kobjsize() refactoring
    introduced by commit 6cfd53fc03670c7a544a56d441eb1a6cc800d72b.

    As Christoph points out:

    virt_to_head_page cannot return NULL. virt_to_page also
    does not return NULL. pfn_valid() needs to be used to
    figure out if a page is valid. Otherwise the page struct
    reference that was returned may have PageReserved() set
    to indicate that it is not a valid page.

    As discussed further in the thread, virt_addr_valid() is the preferable
    way to validate the object pointer in this case. In addition to fixing
    up the reserved page case, it also has the benefit of encapsulating the
    hack introduced by commit 4016a1390d07f15b267eecb20e76a48fd5c524ef on
    the impacted platforms, allowing us to get rid of the extra checking in
    kobjsize() for the platforms that don't perform this type of bizarre
    memory_end abuse (every nommu platform that isn't blackfin). If blackfin
    decides to get in line with every other platform and use PageReserved
    for the DMA pages in question, kobjsize() will also continue to work
    fine.

    It also turns out that compound_order() will give us back 0-order for
    non-head pages, so we can get rid of the PageCompound check and just
    use compound_order() directly. Clean that up while we're at it.

    Signed-off-by: Paul Mundt
    Reviewed-by: Christoph Lameter
    Acked-by: David Howells
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

10 Jun, 2008

1 commit

  • Minor source code cleanup of page flags in mm/page_alloc.c.
    Move the definition of the groups of bits to page-flags.h.

    The purpose of this clean up is that the next patch will
    conditionally add a page flag to the groups. Doing that
    in a header file is cleaner than adding #ifdefs to the
    C code.

    Signed-off-by: Russ Anderson
    Signed-off-by: Linus Torvalds

    Russ Anderson
     

07 Jun, 2008

3 commits

  • kobjsize() has been abusing page->index as a method for sorting out
    compound order, which blows up both for page cache pages, and SLOB's
    reuse of the index in struct slob_page.

    Presently we are not able to accurately size arbitrary pointers that
    don't come from kmalloc(), so the best we can do is sort out the
    compound order from the head page if it's a compound page, or default
    to 0-order if it's impossible to ksize() the object.

    Obviously this leaves quite a bit to be desired in terms of object
    sizing accuracy, but the behaviour is unchanged over the existing
    implementation, while fixing the page->index oopses originally reported
    here:

    http://marc.info/?l=linux-mm&m=121127773325245&w=2

    Accuracy could also be improved by having SLUB and SLOB both set PG_slab
    on ksizeable pages, rather than just handling the __GFP_COMP cases
    irregardless of the PG_slab setting, as made possibly with Pekka's
    patches:

    http://marc.info/?l=linux-kernel&m=121139439900534&w=2
    http://marc.info/?l=linux-kernel&m=121139440000537&w=2
    http://marc.info/?l=linux-kernel&m=121139440000540&w=2

    This is primarily a bugfix for nommu systems for 2.6.26, with the aim
    being to gradually kill off kobjsize() and its particular brand of
    object abuse entirely.

    Reviewed-by: Pekka Enberg
    Signed-off-by: Paul Mundt
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     
  • Fix a regression introduced by

    commit 4cc6028d4040f95cdb590a87db478b42b8be0508
    Author: Jiri Kosina
    Date: Wed Feb 6 22:39:44 2008 +0100

    brk: check the lower bound properly

    The check in sys_brk() on minimum value the brk might have must take
    CONFIG_COMPAT_BRK setting into account. When this option is turned on
    (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the
    lower bound on brk value is mm->end_code, otherwise the brk start is
    allowed to be arbitrarily shifted.

    Signed-off-by: Jiri Kosina
    Tested-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • =============================================
    [ INFO: possible recursive locking detected ]
    2.6.26-rc4 #30
    ---------------------------------------------
    heap-overflow/2250 is trying to acquire lock:
    (&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0x108/0x280

    but task is already holding lock:
    (&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0xfc/0x280

    other info that might help us debug this:
    3 locks held by heap-overflow/2250:
    #0: (&mm->mmap_sem){----}, at: [] .dup_mm+0x134/0x410
    #1: (&mm->mmap_sem/1){--..}, at: [] .dup_mm+0x144/0x410
    #2: (&mm->page_table_lock){--..}, at: [] .copy_hugetlb_page_range+0xfc/0x280

    stack backtrace:
    Call Trace:
    [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable)
    [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34
    [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080
    [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0
    [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80
    [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280
    [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790
    [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410
    [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020
    [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310
    [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80
    [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc

    Fix is the same way that mm/memory.c copy_page_range does the
    lockdep annotation.

    Acked-by: KOSAKI Motohiro
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 May, 2008

1 commit


25 May, 2008

5 commits

  • Trying to add memory via add_memory() from within an initcall function
    results in

    bootmem alloc of 163840 bytes failed!
    Kernel panic - not syncing: Out of memory

    This is caused by zone_wait_table_init() which uses system_state to decide
    if it should use the bootmem allocator or not.

    When initcalls are handled the system_state is still SYSTEM_BOOTING but
    the bootmem allocator doesn't work anymore. So the allocation will fail.

    To fix this use slab_is_available() instead as indicator like we do it
    everywhere else.

    [akpm@linux-foundation.org: coding-style fix]
    Reviewed-by: Andy Whitcroft
    Cc: Dave Hansen
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Yasunori Goto
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
    panics when trying node local allocations:

    BUG: unable to handle kernel NULL pointer dereference at 0000034c
    IP: [] get_page_from_freelist+0x4a/0x18e
    *pdpt = 00000000013a7001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    Modules linked in:

    Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
    EIP: 0060:[] EFLAGS: 00010282 CPU: 0
    EIP is at get_page_from_freelist+0x4a/0x18e
    EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
    ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
    Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
    f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
    00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
    Call Trace:
    [] __alloc_pages_internal+0x99/0x378
    [] __alloc_pages+0x7/0x9
    [] kmem_getpages+0x66/0xef
    [] cache_grow+0x8f/0x123
    [] ____cache_alloc_node+0xb9/0xe4
    [] kmem_cache_alloc_node+0x92/0xd2
    [] setup_cpu_cache+0xaf/0x177
    [] kmem_cache_create+0x2c8/0x353
    [] kmem_cache_init+0x1ce/0x3ad
    [] start_kernel+0x178/0x1ee

    This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
    page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
    node 0, all other nodes are mapped above 4GB physical. Here is a dump
    of the zonelists from this system:

    zonelists pgdat=c1400000
    0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
    1: c14006c0:2 c1400360:1 c1400000:0
    zonelists pgdat=f7c00000
    0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
    1: f7c006c0:2
    zonelists pgdat=f7e00000
    0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
    1: f7e006c0:2

    When performing a node local allocation we call get_page_from_freelist()
    looking for a page. It in turn calls first_zones_zonelist() which returns
    a preferred_zone. Where there are no applicable zones this will be NULL.
    However we use this unconditionally, leading to this panic.

    Where there are no applicable zones there is no possibility of a successful
    allocation, so simply fail the allocation.

    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The atomic_t type is 32bit but a 64bit system can have more than 2^32
    pages of virtual address space available. Without this we overflow on
    ludicrously large mappings

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • In a zone's present pages number, account for all pages occupied by the
    memory map, including a partial.

    Signed-off-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Take out an assertion to allow ->fault handlers to service PFNMAP regions.
    This is required to reimplement .nopfn handlers with .fault handlers and
    subsequently remove nopfn.

    Signed-off-by: Nick Piggin
    Acked-by: Jes Sorensen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

23 May, 2008

1 commit

  • Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch
    the worst abusers of ksize() in the kernel.

    Acked-by: Christoph Lameter
    Cc: Matt Mackall
    Signed-off-by: Pekka Enberg

    Pekka Enberg
     

21 May, 2008

1 commit

  • There is a race from when a device is created with device_create() and
    then the drvdata is set with a call to dev_set_drvdata() in which a
    sysfs file could be open, yet the drvdata will be NULL, causing all
    sorts of bad things to happen.

    This patch fixes the problem by using the new function,
    device_create_vargs().

    Many thanks to Arthur Jones for reporting the bug,
    and testing patches out.

    Cc: Kay Sievers
    Cc: Arthur Jones
    Cc: Peter Zijlstra
    Cc: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 May, 2008

1 commit

  • Although slob_alloc return NULL, __kmalloc_node returns NULL + align.
    Because align always can be changed, it is very hard for debugging
    problem of no page if it don't return NULL.

    We have to return NULL in case of no page.

    [penberg@cs.helsinki.fi: fix formatting as suggested by Matt.]
    Acked-by: Matt Mackall
    Signed-off-by: MinChan Kim
    Signed-off-by: Pekka Enberg

    MinChan Kim
     

15 May, 2008

6 commits

  • Trying to online a new memory section that was added via memory hotplug
    sometimes results in crashes when the new pages are added via __free_page.
    Reason for that is that the pageblock bitmap isn't initialized and hence
    contains random stuff. That means that get_pageblock_migratetype()
    returns also random stuff and therefore

    list_add(&page->lru,
    &zone->free_area[order].free_list[migratetype]);

    in __free_one_page() tries to do a list_add to something that isn't even
    necessarily a list.

    This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix
    usemap initialization") which makes sure that the pageblock bitmap gets
    only initialized for pages present in a zone. Unfortunately for hot-added
    memory the zones "grow" after the memmap and the pageblock memmap have
    been initialized. Which means that the new pages have an unitialized
    bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
    are moved to __add_zone() just before the initialization happens.

    The patch also moves the two functions since __add_zone() is the only
    caller and I didn't want to add a forward declaration.

    Signed-off-by: Heiko Carstens
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • There is a defect in mprotect, which lets the user change the page cache
    type bits by-passing the kernel reserve_memtype and free_memtype
    wrappers. Fix the problem by not letting mprotect change the PAT bits.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venki Pallipadi
     
  • Add a check to online_pages() to test for failure of
    walk_memory_resource(). This fixes a condition where a failure
    of walk_memory_resource() can lead to online_pages() returning
    success without the requested pages being onlined.

    Signed-off-by: Geoff Levand
    Cc: Yasunori Goto
    Cc: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: Keith Mannthey
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geoff Levand
     
  • __add_zone calls memmap_init_zone twice if memory gets attached to an empty
    zone. Once via init_currently_empty_zone and once explictly right after that
    call.

    Looks like this is currently not a bug, however the call is superfluous and
    might lead to subtle bugs if memmap_init_zone gets changed. So make sure it
    is called only once.

    Cc: Yasunori Goto
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • filemap_fault will go into an infinite loop if ->readpage() fails
    asynchronously.

    AFAICS the bug was introduced by this commit, which removed the wait after the
    final readpage:

    commit d00806b183152af6d24f46f0c33f14162ca1262a
    Author: Nick Piggin
    Date: Thu Jul 19 01:46:57 2007 -0700

    mm: fix fault vs invalidate race for linear mappings

    Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
    the page is up-to-date before jumping back to the beginning of the function.

    I've noticed this while testing nfs exporting on fuse. The patch
    fixes it.

    Signed-off-by: Miklos Szeredi
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • There is a possible data race in the page table walking code. After the split
    ptlock patches, it actually seems to have been introduced to the core code, but
    even before that I think it would have impacted some architectures (powerpc
    and sparc64, at least, walk the page tables without taking locks eg. see
    find_linux_pte()).

    The race is as follows:
    The pte page is allocated, zeroed, and its struct page gets its spinlock
    initialized. The mm-wide ptl is then taken, and then the pte page is inserted
    into the pagetables.

    At this point, the spinlock is not guaranteed to have ordered the previous
    stores to initialize the pte page with the subsequent store to put it in the
    page tables. So another Linux page table walker might be walking down (without
    any locks, because we have split-leaf-ptls), and find that new pte we've
    inserted. It might try to take the spinlock before the store from the other
    CPU initializes it. And subsequently it might read a pte_t out before stores
    from the other CPU have cleared the memory.

    There are also similar races in higher levels of the page tables. They
    obviously don't involve the spinlock, but could see uninitialized memory.

    Arch code and hardware pagetable walkers that walk the pagetables without
    locks could see similar uninitialized memory problems, regardless of whether
    split ptes are enabled or not.

    I prefer to put the barriers in core code, because that's where the higher
    level logic happens, but the page table accessors are per-arch, and open-coding
    them everywhere I don't think is an option. I'll put the read-side barriers
    in alpha arch code for now (other architectures perform data-dependent loads
    in order).

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 May, 2008

2 commits


09 May, 2008

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    Revert "relay: fix splice problem"
    docbook: fix bio missing parameter
    block: use unitialized_var() in bio_alloc_bioset()
    block: avoid duplicate calls to get_part() in disk stat code
    cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
    block: optimize generic_unplug_device()
    block: get rid of likely/unlikely predictions in merge logic
    vfs: splice remove_suid() cleanup
    cfq-iosched: fix RCU race in the cfq io_context destructor handling
    block: adjust tagging function queue bit locking
    block: sysfs store function needs to grab queue_lock and use queue_flag_*()

    Linus Torvalds
     
  • any_slab_objects() does an atomic_read on an atomic_long_t, this
    fixes it to use atomic_long_read instead.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

07 May, 2008

2 commits

  • generic_file_splice_write() duplicates remove_suid() just because it
    doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
    anyway, so this is rather pointless.

    Move locking to generic_file_splice_write() and call remove_suid() and
    __splice_from_pipe() instead.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Jens Axboe

    Miklos Szeredi
     
  • Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.

    That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad();
    but we understand now that the typecasting was wrong for PAE in the previous
    version: pagetable pages above 4GB looked bad and stopped Arjan from booting.

    And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad
    and pud_bad to support huge pages. It was the wrong way round: we shouldn't
    weaken every pmd_bad and pud_bad check to let huge pages slip through - in
    part they check that we _don't_ have a huge page where it's not expected.

    Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
    been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
    junk in the upper word is good; and x86_64 should follow x86_32's stricter
    comparison, to stop thinking any subset of required bits is good); but that
    should be a later patch.

    Fix Hans' good observation that follow_page() will never find pmd_huge()
    because that would have already failed the pmd_bad test: test pmd_huge in
    between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check?
    No, once it's a hugepage entry, it can get quite far from a good pmd: for
    example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

    However... though follow_page() contains this and another test for huge
    pages, so it's nice to keep it working on them, where does it actually get
    called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to
    to call alternative hugetlb processing, as does unmap_vmas() and others.

    Signed-off-by: Hugh Dickins
    Earlier-version-tested-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Jeff Chua
    Cc: Hans Rosenfeld
    Cc: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Hugh Dickins