13 Dec, 2012

40 commits

  • We have two different implementation of is_zero_pfn() and my_zero_pfn()
    helpers: for architectures with and without zero page coloring.

    Let's consolidate them in .

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fix the warning from __list_del_entry() which is triggered when a process
    tries to do free_huge_page() for a hwpoisoned hugepage.

    free_huge_page() can be called for hwpoisoned hugepage from
    unpoison_memory(). This function gets refcount once and clears
    PageHWPoison, and then puts refcount twice to return the hugepage back to
    free pool. The second put_page() finally reaches free_huge_page().

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Memory error handling on hugepages can break a RSS counter, which emits a
    message like "Bad rss-counter state mm:ffff88040abecac0 idx:1 val:-1".
    This is because PageAnon returns true for hugepage (this behavior is
    necessary for reverse mapping to work on hugetlbfs).

    [akpm@linux-foundation.org: clean up code layout]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When a process which used a hwpoisoned hugepage tries to exit() or
    munmap(), the kernel can print out "bad pmd" message because page table
    walker in free_pgtables() encounters 'hwpoisoned entry' on pmd.

    This is because currently we fail to clear the hwpoisoned entry in
    __unmap_hugepage_range(), so this patch simply does it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • expand_stack() runs with a shared mmap_sem lock. Because of this, there
    could be multiple concurrent stack expansions in the same mm, which may
    cause problems in the vma gap update code.

    I propose to solve this by taking the mm->page_table_lock around such vma
    expansions, in order to avoid the concurrency issue. We only have to
    worry about concurrent expand_stack() calls here, since we hold a shared
    mmap_sem lock and all vma modificaitons other than expand_stack() are done
    under an exclusive mmap_sem lock.

    I previously tried to achieve the same effect by making sure all growable
    vmas in a given mm would share the same anon_vma, which we already lock
    here. However this turned out to be difficult - all of the schemes I
    tried for refcounting the growable anon_vma and clearing turned out ugly.
    So, I'm now proposing only the minimal fix.

    The overhead of taking the page table lock during stack expansion is
    expected to be small: glibc doesn't use expandable stacks for the threads
    it creates, so having multiple growable stacks is actually uncommon and we
    don't expect the page table lock to get bounced between threads.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the
    function is either called from the page fault path or vma->vm_mm is used.
    So the check can be dropped.

    The check was introduced by commit 456f998ec817 ("memcg: add the
    pagefault count into memcg stats") because the originally proposed patch
    used current->mm for shmem but this has been changed to vma->vm_mm later
    on without the check being removed (thanks to Hugh for this
    recollection).

    Signed-off-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Ying Han
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Revert 3.5's commit f21f8062201f ("tmpfs: revert SEEK_DATA and
    SEEK_HOLE") to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and
    SEEK_HOLE"), with the intervening additional arg to
    generic_file_llseek_size().

    In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
    SEEK_DATA and SEEK_HOLE support; and a good case has now been made for
    it on tmpfs, so let's join the party.

    It's quite easy for tmpfs to scan the radix_tree to support llseek's new
    SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are
    still on my mind (in particular, the !PageUptodate-ness of pages
    fallocated but still unwritten).

    [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
    Signed-off-by: Hugh Dickins
    Cc: Dave Chinner
    Cc: Jaegeuk Hanse
    Cc: "Theodore Ts'o"
    Cc: Zheng Liu
    Cc: Jeff liu
    Cc: Paul Eggert
    Cc: Christoph Hellwig
    Cc: Josef Bacik
    Cc: Andi Kleen
    Cc: Andreas Dilger
    Cc: Marco Stornelli
    Cc: Chris Mason
    Cc: Sunil Mushran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If SPARSEMEM is enabled, it won't build page structures for non-existing
    pages (holes) within a zone, so provide a more accurate estimation of
    pages occupied by memmap if there are bigger holes within the zone.

    And pages for highmem zones' memmap will be allocated from lowmem, so
    charge nr_kernel_pages for that.

    [akpm@linux-foundation.org: mark calc_memmap_size __paging_init]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Tested-by: Jianguo Wu
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • buffer_head comes from kmem_cache_zalloc(), no need to zero its fields.

    Signed-off-by: Yan Hong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Hong
     
  • It makes no sense to inline an exported function.

    Signed-off-by: Yan Hong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Hong
     
  • Signed-off-by: Yan Hong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Hong
     
  • Currently a zone's present_pages is calcuated as below, which is
    inaccurate and may cause trouble to memory hotplug.

    spanned_pages - absent_pages - memmap_pages - dma_reserve.

    During fixing bugs caused by inaccurate zone->present_pages, we found
    zone->present_pages has been abused. The field zone->present_pages may
    have different meanings in different contexts:

    1) pages existing in a zone.
    2) pages managed by the buddy system.

    For more discussions about the issue, please refer to:
    http://lkml.org/lkml/2012/11/5/866
    https://patchwork.kernel.org/patch/1346751/

    This patchset tries to introduce a new field named "managed_pages" to
    struct zone, which counts "pages managed by the buddy system". And revert
    zone->present_pages to count "physical pages existing in a zone", which
    also keep in consistence with pgdat->node_present_pages.

    We will set an initial value for zone->managed_pages in function
    free_area_init_core() and will adjust it later if the initial value is
    inaccurate.

    For DMA/normal zones, the initial value is set to:

    (spanned_pages - absent_pages - memmap_pages - dma_reserve)

    Later zone->managed_pages will be adjusted to the accurate value when the
    bootmem allocator frees all free pages to the buddy system in function
    free_all_bootmem_node() and free_all_bootmem().

    The bootmem allocator doesn't touch highmem pages, so highmem zones'
    managed_pages is set to the accurate value "spanned_pages - absent_pages"
    in function free_area_init_core() and won't be updated anymore.

    This patch also adds a new field "managed_pages" to /proc/zoneinfo
    and sysrq showmem.

    [akpm@linux-foundation.org: small comment tweaks]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Tested-by: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • out_of_memory() is a globally defined function to call the oom killer.
    x86, sh, and powerpc all use a function of the same name within file scope
    in their respective fault.c unnecessarily. Inline the functions into the
    pagefault handlers to clean the code up.

    Signed-off-by: David Rientjes
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Reviewed-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • out_of_memory() will already cause current to schedule if it has not been
    killed, so doing it again in pagefault_out_of_memory() is redundant.
    Remove it.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • To lock the entire system from parallel oom killing, it's possible to pass
    in a zonelist with all zones rather than using for_each_populated_zone()
    for the iteration. This obsoletes try_set_system_oom() and
    clear_system_oom() so that they can be removed.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Now, memory management can handle movable node or nodes which don't have
    any normal memory, so we can dynamic configure and add movable node by:

    online a ZONE_MOVABLE memory from a previous offline node
    offline the last normal memory which result a non-normal-memory-node

    movable-node is very important for power-saving, hardware partitioning and
    high-available-system(hardware fault management).

    Signed-off-by: Lai Jiangshan
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • We need a node which only contains movable memory. This feature is very
    important for node hotplug. If a node has normal/highmem, the memory may
    be used by the kernel and can't be offlined. If the node only contains
    movable memory, we can offline the memory and the node.

    All are prepared, we can actually introduce N_MEMORY.
    add CONFIG_MOVABLE_NODE make we can use it for movable-dedicated node

    [akpm@linux-foundation.org: fix Kconfig text]
    Signed-off-by: Lai Jiangshan
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • While profiling numa/core v16 with cgroup_disable=memory on the command
    line, I noticed mem_cgroup_count_vm_event() still showed up as high as
    0.60% in perftop.

    This occurs because the function is called extremely often even when memcg
    is disabled.

    To fix this, inline the check for mem_cgroup_disabled() so we avoid the
    unnecessary function call if memcg is disabled.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Cc: Glauber Costa
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • During reviewing the source code, I found a comment which mention that
    after f_op->mmap(), vma's start address can be changed. I didn't verify
    that it is really possible, because there are so many f_op->mmap()
    implementation. But if there are some mmap() which change vma's start
    address, it is possible error situation, because we already prepare prev
    vma, rb_link and rb_parent and these are related to original address.

    So add WARN_ON_ONCE for finding that this situtation really happens.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Since commit 628f42355389 ("memcg: limit change shrink usage") both
    res_counter_write() and write_strategy_fn have been unused. This patch
    deletes them both.

    Signed-off-by: Greg Thelen
    Cc: Glauber Costa
    Cc: Tejun Heo
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Update nodemasks management for N_MEMORY.

    [lliubbo@gmail.com: fix build]
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Since we introduced N_MEMORY, we update the initialization of node_states.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Lin Feng
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Christoph Lameter
    Signed-off-by: Wen Congyang
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Christoph Lameter
    Signed-off-by: Wen Congyang
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • We have N_NORMAL_MEMORY for standing for the nodes that have normal memory
    with zone_type
    Acked-by: Christoph Lameter
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • __alloc_contig_migrate_range() should use all possible ways to get all the
    pages migrated from the given memory range, so pruning per-cpu lru lists
    for all CPUs is required, regadless the cost of such operation. Otherwise
    some pages which got stuck at per-cpu lru list might get missed by
    migration procedure causing the contiguous allocation to fail.

    Reported-by: SeongHwan Yoon
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • compact_capture_page() is only used if compaction is enabled so it should
    be moved into the corresponding #ifdef.

    Signed-off-by: Thierry Reding
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     
  • pmd value is stable only with mm->page_table_lock taken. After taking
    the lock we need to check that nobody modified the pmd before changing it.

    Signed-off-by: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: David Rientjes
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • By default kernel tries to use huge zero page on read page fault. It's
    possible to disable huge zero page by writing 0 or enable it back by
    writing 1:

    echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
    echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • hzp_alloc is incremented every time a huge zero page is successfully
    allocated. It includes allocations which where dropped due
    race with other allocation. Note, it doesn't count every map
    of the huge zero page, only its allocation.

    hzp_alloc_failed is incremented if kernel fails to allocate huge zero
    page and falls back to using small pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov