23 Mar, 2011

1 commit

  • When the user inserts a negative value into /proc/sys/vm/nr_hugepages it
    will cause the kernel to allocate as many hugepages as possible and to
    then update /proc/meminfo to reflect this.

    This changes the behavior so that the negative input will result in
    nr_hugepages value being unchanged.

    Signed-off-by: Petr Holasek
    Signed-off-by: Anton Arapov
    Reviewed-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Acked-by: Eric B Munson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Holasek
     

14 Jan, 2011

5 commits

  • When parsing changes to the huge page pool sizes made from userspace via
    the sysfs interface, bogus input values are being covered up by
    nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
    when strict_strtoul returns an error. This can cause an infinite loop in
    the nr_hugepages_store code. This patch changes the return value for
    these functions to -EINVAL when strict_strtoul returns an error.

    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Eric B Munson
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Huge pages with order >= MAX_ORDER must be allocated at boot via the
    kernel command line, they cannot be allocated or freed once the kernel is
    up and running. Currently we allow values to be written to the sysfs and
    sysctl files controling pool size for these huge page sizes. This patch
    makes the store functions for nr_hugepages and nr_overcommit_hugepages
    return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

    [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
    [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • proc_doulongvec_minmax may fail if the given buffer doesn't represent a
    valid number. If we provide something invalid we will initialize the
    resulting value (nr_overcommit_huge_pages in this case) to a random value
    from the stack.

    The issue was introduced by a3d0c6aa when the default handler has been
    replaced by the helper function where we do not check the return value.

    Reproducer:
    echo "" > /proc/sys/vm/nr_overcommit_hugepages

    [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
    Signed-off-by: Michal Hocko
    Cc: CAI Qian
    Cc: Nishanth Aravamudan
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The NODEMASK_ALLOC macro may dynamically allocate memory for its second
    argument ('nodes_allowed' in this context).

    In nr_hugepages_store_common() we may abort early if strict_strtoul()
    fails, but in that case we do not free the memory already allocated to
    'nodes_allowed', causing a memory leak.

    This patch closes the leak by freeing the memory in the error path.

    [akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
    Signed-off-by: Jesper Juhl
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Move the copy/clear_huge_page functions to common code to share between
    hugetlb.c and huge_memory.c.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

03 Dec, 2010

1 commit

  • Have hugetlb_fault() call unlock_page(page) only if it had previously
    called lock_page(page).

    Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
    resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
    unlock_page() having been called by hugetlb_fault() when page ==
    pagecache_page. This patch remedied the problem.

    Signed-off-by: Dean Nelson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     

27 Oct, 2010

1 commit


08 Oct, 2010

9 commits

  • This fixes a problem introduced with the hugetlb hwpoison handling

    The user space SIGBUS signalling wants to know the size of the hugepage
    that caused a HWPOISON fault.

    Unfortunately the architecture page fault handlers do not have easy
    access to the struct page.

    Pass the information out in the fault error code instead.

    I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
    the hpage index in some free upper bits of the fault code. The small
    page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
    changes.

    Also add code to hugetlb.h to convert that index into a page shift.

    Will be used in a further patch.

    Cc: Naoya Horiguchi
    Cc: fengguang.wu@intel.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Fixes warning reported by Stephen Rothwell

    mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used

    for the !CONFIG_MEMORY_FAILURE case.

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
    This patch enables !MF_COUNT_INCREASED case.

    Free hugepages can be handled directly by alloc_huge_page() and
    dequeue_hwpoisoned_huge_page(), and both of them are protected
    by hugetlb_lock, so there is no race between them.

    Note that this patch defines the refcount of HWPoisoned hugepage
    dequeued from freelist is 1, deviated from present 0, thereby we
    can avoid race between unpoison and memory failure on free hugepage.
    This is reasonable because unlikely to free buddy pages, free hugepage
    is governed by hugetlbfs even after error handling finishes.
    And it also makes unpoison code added in the later patch cleaner.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
    but it causes race when dequeue_hwpoison_huge_page() runs concurrently
    with alloc_huge_page().
    To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Wu Fengguang
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This check is necessary to avoid race between dequeue and allocation,
    which can cause a free hugepage to be dequeued twice and get kernel unstable.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Wu Fengguang
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch extends page migration code to support hugepage migration.
    One of the potential users of this feature is soft offlining which
    is triggered by memory corrected errors (added by the next patch.)

    Todo:
    - there are other users of page migration such as memory policy,
    memory hotplug and memocy compaction.
    They are not ready for hugepage support for now.

    ChangeLog since v4:
    - define migrate_huge_pages()
    - remove changes on isolation/putback_lru_page()

    ChangeLog since v2:
    - refactor isolate/putback_lru_page() to handle hugepage
    - add comment about race on unmap_and_move_huge_page()

    ChangeLog since v1:
    - divide migration code path for hugepage
    - define routine checking migration swap entry for hugetlb
    - replace "goto" with "if/else" in remove_migration_pte()

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch modifies hugepage copy functions to have only destination
    and source hugepages as arguments for later use.
    The old ones are renamed from copy_{gigantic,huge}_page() to
    copy_user_{gigantic,huge}_page().
    This naming convention is consistent with that between copy_highpage()
    and copy_user_highpage().

    ChangeLog since v4:
    - add blank line between local declaration and code
    - remove unnecessary might_sleep()

    ChangeLog since v2:
    - change copy_huge_page() from macro to inline dummy function
    to avoid compile warning when !CONFIG_HUGETLB_PAGE.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • We can't use existing hugepage allocation functions to allocate hugepage
    for page migration, because page migration can happen asynchronously with
    the running processes and page migration users should call the allocation
    function with physical addresses (not virtual addresses) as arguments.

    ChangeLog since v3:
    - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

    ChangeLog since v2:
    - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

    ChangeLog since v1:
    - add comment on top of alloc_huge_page_no_vma()

    Signed-off-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Signed-off-by: Jun'ichi Nomura
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • Since the PageHWPoison() check is for avoiding hwpoisoned page remained
    in pagecache mapping to the process, it should be done in "found in pagecache"
    branch, not in the common path.
    Otherwise, metadata corruption occurs if memory failure happens between
    alloc_huge_page() and lock_page() because page fault fails with metadata
    changes remained (such as refcount, mapcount, etc.)

    This patch moves the check to "found in pagecache" branch and fix the problem.

    ChangeLog since v2:
    - remove retry check in "new allocation" path.
    - make description more detailed
    - change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

24 Sep, 2010

2 commits


13 Aug, 2010

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     

11 Aug, 2010

5 commits

  • This patch fixes possible deadlock in hugepage lock_page()
    by adding missing unlock_page().

    libhugetlbfs test will hit this bug when the next patch in this
    patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables hwpoison injection through debug/hwpoison interfaces,
    with which we can test memory error handling for free or reserved
    hugepages (which cannot be tested by madvise() injector).

    [AK: Export PageHuge too for the injection module]
    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables to block access to hwpoisoned hugepage and
    also enables to block unmapping for it.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • If error hugepage is not in-use, we can fully recovery from error
    by dequeuing it from freelist, so return RECOVERY.
    Otherwise whether or not we can recovery depends on user processes,
    so return DELAYED.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

1 commit

  • When a copy-on-write occurs, we take one of two paths in handle_mm_fault:
    through handle_pte_fault for normal pages, or through hugetlb_fault for
    huge pages.

    In the normal page case, we eventually get to do_wp_page and call mmu
    notifiers via ptep_clear_flush_notify. There is no callout to the mmmu
    notifiers in the huge page case. This patch fixes that.

    Signed-off-by: Doug Doan
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Doan
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

12 May, 2010

1 commit

  • Ordinarily, application using hugetlbfs will create mappings with
    reserves. For shared mappings, these pages are reserved before mmap()
    returns success and for private mappings, the caller process is guaranteed
    and a child process that cannot get the pages gets killed with sigbus.

    An application that uses MAP_NORESERVE gets no reservations and mmap()
    will always succeed at the risk the page will not be available at fault
    time. This might be used for example on very large sparse mappings where
    the developer is confident the necessary huge pages exist to satisfy all
    faults even though the whole mapping cannot be backed by huge pages.
    Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the
    fault handler which proceeds to trigger the OOM-killer. This is
    unhelpful.

    Even without hugetlbfs mounted, a user using mmap() can trivially trigger
    the OOM-killer because VM_FAULT_OOM is returned (will provide example
    program if desired - it's a whopping 24 lines long). It could be
    considered a DOS available to an unprivileged user.

    This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE
    where huge pages were not available with SIGBUS instead of triggering the
    OOM killer.

    This change affects hugetlb_cow() as well. I feel there is a failure case
    in there, but I didn't create one. It would need a fairly specific target
    in terms of the faulting application and the hugepage pool size. The
    hugetlb_no_page() path is much easier to hit but both might as well be
    closed.

    Signed-off-by: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Apr, 2010

1 commit

  • If a futex key happens to be located within a huge page mapped
    MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a
    page->mapping that will never exist.

    See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details
    about the problem.

    This patch makes page->mapping a poisoned value that includes
    PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to
    continue but because of PAGE_MAPPING_ANON, the poisoned value is not
    dereferenced or used by futex. No other part of the VM should be
    dereferencing the page->mapping of a hugetlbfs page as its page cache is
    not on the LRU.

    This patch fixes the problem with the test case described in the bugzilla.

    [akpm@linux-foundation.org: mel cant spel]
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra
    Acked-by: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

02 Mar, 2010

1 commit

  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits)
    ARM: Eliminate decompressor -Dstatic= PIC hack
    ARM: 5958/1: ARM: U300: fix inverted clk round rate
    ARM: 5956/1: misplaced parentheses
    ARM: 5955/1: ep93xx: move timer defines into core.c and document
    ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c
    ARM: 5953/1: ep93xx: fix broken build of clock.c
    ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig
    ARM: 5949/1: NUC900 add gpio virtual memory map
    ARM: 5948/1: Enable timer0 to time4 clock support for nuc910
    ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk
    ARM: make_coherent(): fix problems with highpte, part 2
    MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
    ARM: 5945/1: ep93xx: include correct irq.h in core.c
    ARM: 5933/1: amba-pl011: support hardware flow control
    ARM: 5930/1: Add PKMAP area description to memory.txt.
    ARM: 5929/1: Add checks to detect overlap of memory regions.
    ARM: 5928/1: Change type of VMALLOC_END to unsigned long.
    ARM: 5927/1: Make delimiters of DMA area globally visibly.
    ARM: 5926/1: Add "Virtual kernel memory..." printout.
    ARM: 5920/1: OMAP4: Enable L2 Cache
    ...

    Fix up trivial conflict in arch/arm/mach-mx25/clock.c

    Linus Torvalds
     

21 Feb, 2010

1 commit

  • On VIVT ARM, when we have multiple shared mappings of the same file
    in the same MM, we need to ensure that we have coherency across all
    copies. We do this via make_coherent() by making the pages
    uncacheable.

    This used to work fine, until we allowed highmem with highpte - we
    now have a page table which is mapped as required, and is not available
    for modification via update_mmu_cache().

    Ralf Beache suggested getting rid of the PTE value passed to
    update_mmu_cache():

    On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
    to construct a pointer to the pte again. Passing a pte_t * is much
    more elegant. Maybe we might even replace the pte argument with the
    pte_t?

    Ben Herrenschmidt would also like the pte pointer for PowerPC:

    Passing the ptep in there is exactly what I want. I want that
    -instead- of the PTE value, because I have issue on some ppc cases,
    for I$/D$ coherency, where set_pte_at() may decide to mask out the
    _PAGE_EXEC.

    So, pass in the mapped page table pointer into update_mmu_cache(), and
    remove the PTE value, updating all implementations and call sites to
    suit.

    Includes a fix from Stephen Rothwell:

    sparc: fix fallout from update_mmu_cache API change

    Signed-off-by: Stephen Rothwell

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Russell King

    Russell King
     

03 Feb, 2010

1 commit

  • hugetlb_sysfs_add_hstate is called by hugetlb_register_node directly
    during init and also indirectly via sysfs after init.

    This patch removes the __init tag from hugetlb_sysfs_add_hstate.

    Signed-off-by: Jeff Mahoney
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

12 Jan, 2010

1 commit


16 Dec, 2009

6 commits

  • If a user asks for a hugepage pool resize but specified a large number,
    the machine can begin trashing. In response, they might hit ctrl-c but
    signals are ignored and the pool resize continues until it fails an
    allocation. This can take a considerable amount of time so this patch
    aborts a pool resize if a signal is pending.

    Suggested by Dave Hansen.

    Signed-off-by: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When the owner of a mapping fails COW because a child process is holding a
    reference, the children VMAs are walked and the page is unmapped. The
    i_mmap_lock is taken for the unmapping of the page but not the walking of
    the prio_tree. In theory, that tree could be changing if the lock is not
    held. This patch takes the i_mmap_lock properly for the duration of the
    prio_tree walk.

    [hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
    Signed-off-by: Mel Gorman
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hugetlb_fault() takes the mm->page_table_lock spinlock then calls
    hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
    insufficient huge page pool it calls unmap_ref_private() with the
    mm->page_table_lock held. unmap_ref_private() then calls
    unmap_hugepage_range() which tries to acquire the mm->page_table_lock.

    [] print_circular_bug_tail+0x80/0x9f
    [] ? check_noncircular+0xb0/0xe8
    [] __lock_acquire+0x956/0xc0e
    [] lock_acquire+0xee/0x12e
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? unmap_hugepage_range+0x3e/0x84
    [] _spin_lock+0x40/0x89
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? alloc_huge_page+0x218/0x318
    [] unmap_hugepage_range+0x3e/0x84
    [] hugetlb_cow+0x1e2/0x3f4
    [] ? hugetlb_fault+0x453/0x4f6
    [] hugetlb_fault+0x480/0x4f6
    [] follow_hugetlb_page+0x116/0x2d9
    [] ? _spin_unlock_irq+0x3a/0x5c
    [] __get_user_pages+0x2a3/0x427
    [] get_user_pages+0x3e/0x54
    [] get_user_pages_fast+0x170/0x1b5
    [] dio_get_page+0x64/0x14a
    [] __blockdev_direct_IO+0x4b7/0xb31
    [] blkdev_direct_IO+0x58/0x6e
    [] ? blkdev_get_blocks+0x0/0xb8
    [] generic_file_aio_read+0xdd/0x528
    [] ? avc_has_perm+0x66/0x8c
    [] do_sync_read+0xf5/0x146
    [] ? autoremove_wake_function+0x0/0x5a
    [] ? security_file_permission+0x24/0x3a
    [] vfs_read+0xb5/0x126
    [] ? fget_light+0x5e/0xf8
    [] sys_read+0x54/0x8c
    [] system_call_fastpath+0x16/0x1b

    This can be fixed by dropping the mm->page_table_lock around the call to
    unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
    the normal path anyway. However, earlier in the that function, it's also
    possible to call into the page allocator with the same spinlock held.

    What this patch does is drop the spinlock before the page allocator is
    potentially entered. The check for page allocation failure can be made
    without the page_table_lock as well as the copy of the huge page. Even if
    the PTE changed while the spinlock was held, the consequence is that a
    huge page is copied unnecessarily. This resolves both the double taking
    of the lock and sleeping with the spinlock held.

    [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
    Signed-off-by: Larry Woodman
    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Cc: Andy Whitcroft
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     
  • Objects passed to NODEMASK_ALLOC() are relatively small in size and are
    backed by slab caches that are not of large order, traditionally never
    greater than PAGE_ALLOC_COSTLY_ORDER.

    Thus, using GFP_KERNEL for these allocations on large machines when
    CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
    the allocation attempt, each time invoking both direct reclaim or the oom
    killer.

    This is of particular interest when using NODEMASK_ALLOC() from a
    mempolicy context (either directly in mm/mempolicy.c or the mempolicy
    constrained hugetlb allocations) since the oom killer always kills current
    when allocations are constrained by mempolicies. So for all present use
    cases in the kernel, current would end up being oom killed when direct
    reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
    current would have sacrificed itself upon returning.

    This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
    CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
    All current use cases either directly from hugetlb code or indirectly via
    NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
    killer when the slab allocator needs to allocate additional pages.

    The side-effect of this change is that all current use cases of either
    NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
    when the allocation fails (never for CONFIG_NODES_SHIFT
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Register per node hstate sysfs attributes only for nodes with memory.
    Global replacement of 'all online nodes" with "all nodes with memory" in
    mm/hugetlb.c. Suggested by David Rientjes.

    A subsequent patch will handle adding/removing of per node hstate sysfs
    attributes when nodes transition to/from memoryless state via memory
    hotplug.

    NOTE: this patch has not been tested with memoryless nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add the per huge page size control/query attributes to the per node
    sysdevs:

    /sys/devices/system/node/node/hugepages/hugepages-/
    nr_hugepages - r/w
    free_huge_pages - r/o
    surplus_huge_pages - r/o

    The patch attempts to re-use/share as much of the existing global hstate
    attribute initialization and handling, and the "nodes_allowed" constraint
    processing as possible.

    Calling set_max_huge_pages() with no node indicates a change to global
    hstate parameters. In this case, any non-default task mempolicy will be
    used to generate the nodes_allowed mask. A valid node id indicates an
    update to that node's hstate parameters, and the count argument specifies
    the target count for the specified node. From this info, we compute the
    target global count for the hstate and construct a nodes_allowed node mask
    contain only the specified node.

    Setting the node specific nr_hugepages via the per node attribute
    effectively ignores any task mempolicy or cpuset constraints.

    With this patch:

    (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
    ./ ../ free_hugepages nr_hugepages surplus_hugepages

    Starting from:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 0
    Node 2 HugePages_Free: 0
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 0

    Allocate 16 persistent huge pages on node 2:
    (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

    [Note that this is equivalent to:
    numactl -m 2 hugeadmin --pool-pages-min 2M:+16
    ]

    Yields:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 16
    Node 2 HugePages_Free: 16
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 16

    Global controls work as expected--reduce pool to 8 persistent huge pages:
    (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 8
    Node 2 HugePages_Free: 8
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn