25 May, 2013

1 commit

  • Fix printk format warnings in mm/memory_hotplug.c by using "%pa":

    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'resource_size_t' [-Wformat]
    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'resource_size_t' [-Wformat]

    Signed-off-by: Randy Dunlap
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

30 Apr, 2013

4 commits

  • PFN_PHYS() is a phys_addr_t, which can be u32 or u64.
    Fix the build warning when phys_addr_t is u32.

    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'unsigned int' [-Wformat]: => 1685:3
    mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]: => 1685:3

    Signed-off-by: Randy Dunlap
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • __remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
    pseries will return -EOPNOTSUPP if unsupported.

    Adding an #ifdef causes several other functions it depends on to also
    become unnecessary, which saves in .text when disabled (it's disabled in
    most defconfigs besides powerpc, including x86). remove_memory_block()
    becomes static since it is not referenced outside of
    drivers/base/memory.c.

    Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
    and disabled.

    Signed-off-by: David Rientjes
    Acked-by: Toshi Kani
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Greg Kroah-Hartman
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Change __remove_pages() to call release_mem_region_adjustable(). This
    allows a requested memory range to be released from the iomem_resource
    table even if it does not match exactly to an resource entry but still
    fits into. The resource entries initialized at bootup usually cover the
    whole contiguous memory ranges and may not necessarily match with the
    size of memory hot-delete requests.

    If release_mem_region_adjustable() failed, __remove_pages() emits a
    warning message and continues to proceed as it was the case with
    release_mem_region(). release_mem_region(), which is defined to
    __release_region(), emits a warning message and returns no error since a
    void function.

    Signed-off-by: Toshi Kani
    Reviewed-by : Yasuaki Ishimatsu
    Acked-by: David Rientjes
    Cc: Ram Pai
    Cc: T Makphaibulchoke
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • Fix a typo "end_pft" in the comment of walk_memory_range().

    Signed-off-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

23 Mar, 2013

1 commit


14 Mar, 2013

1 commit

  • remove_memory() calls walk_memory_range() with [start_pfn, end_pfn), where
    end_pfn is exclusive in this range. Therefore, end_pfn needs to be set to
    the next page of the end address.

    Signed-off-by: Toshi Kani
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

24 Feb, 2013

21 commits

  • Replace open coded pgdat_end_pfn() with helper function.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Remove open coding of ensure_zone_is_initialzied().

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • ensure_zone_is_initialized() checks if a zone is in a empty & not
    initialized state (typically occuring after it is created in memory
    hotplugging), and, if so, calls init_currently_empty_zone() to
    initialize the zone.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Function put_page_bootmem() is used to free pages allocated by bootmem
    allocator, so it should increase totalram_pages when freeing pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • When the node is offlined, there is no memory/cpu on the node. If a
    sleep task runs on a cpu of this node, it will be migrated to the cpu on
    the other node. So we can clear cpu-to-node mapping.

    [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • try_offline_node() will be needed in the tristate
    drivers/acpi/processor_driver.c.

    The node will be offlined when all memory/cpu on the node have been
    hotremoved. So we need the function try_offline_node() in cpu-hotplug
    path.

    If the memory-hotplug is disabled, and cpu-hotplug is enabled

    1. no memory no the node
    we don't online the node, and cpu's node is the nearest node.

    2. the node contains some memory
    the node has been onlined, and cpu's node is still needed
    to migrate the sleep task on the cpu to the same node.

    So we do nothing in try_offline_node() in this case.

    [rientjes@google.com: export the function try_offline_node() fix]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Since there is no way to guarentee the address of pgdat/zone is not on
    stack of any kernel threads or used by other kernel objects without
    reference counting or other symchronizing method, we cannot reset
    node_data and free pgdat when offlining a node. Just reset pgdat to 0
    and reuse the memory when the node is online again.

    The problem is suggested by Kamezawa Hiroyuki. The idea is from Wen
    Congyang.

    NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
    will be triggered.

    [akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
    [akpm@linux-foundation.org: fix the warning again again]
    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • We call hotadd_new_pgdat() to allocate memory to store node_data. So we
    should free it when removing a node.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Introduce a new function try_offline_node() to remove sysfs file of node
    when all memory sections of this node are removed. If some memory
    sections of this node are not removed, this function does nothing.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When memory is added, we update zone's and pgdat's start_pfn and
    spanned_pages in __add_zone(). So we should revert them when the memory
    is removed.

    The patch adds a new function __remove_zone() to do this.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even
    if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • In __remove_section(), we locked pgdat_resize_lock when calling
    sparse_remove_one_section(). This lock will disable irq. But we don't
    need to lock the whole function. If we do some work to free pagetables
    in free_section_usemap(), we need to call flush_tlb_all(), which need
    irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
    will be triggered.

    If we lock the whole sparse_remove_one_section(), then we come to this call trace:

    ------------[ cut here ]------------
    WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
    Hardware name: PRIMEQUEST 1800E
    ......
    Call Trace:
    smp_call_function_many+0xbd/0x260
    smp_call_function+0x3b/0x50
    on_each_cpu+0x3b/0xc0
    flush_tlb_all+0x1c/0x20
    remove_pagetable+0x14e/0x1d0
    vmemmap_free+0x18/0x20
    sparse_remove_one_section+0xf7/0x100
    __remove_section+0xa2/0xb0
    __remove_pages+0xa0/0xd0
    arch_remove_memory+0x6b/0xc0
    remove_memory+0xb8/0xf0
    acpi_memory_device_remove+0x53/0x96
    acpi_device_remove+0x90/0xb2
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_remove+0x32/0x6d
    acpi_bus_trim+0x91/0x102
    acpi_bus_hot_remove_device+0x88/0x16b
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x20e/0x5c0
    worker_thread+0x12e/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
    ---[ end trace 25e85300f542aa01 ]---

    Signed-off-by: Tang Chen
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • For removing memmap region of sparse-vmemmap which is allocated bootmem,
    memmap region of sparse-vmemmap needs to be registered by
    get_page_bootmem(). So the patch searches pages of virtual mapping and
    registers the pages by get_page_bootmem().

    NOTE: register_page_bootmem_memmap() is not implemented for ia64,
    ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
    and revert register_page_bootmem_info_node() when platform doesn't
    support it.

    It's implemented by adding a new Kconfig option named
    CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
    by memory-hotplug feature fully supported archs(currently only on
    x86_64).

    Since we have 2 config options called MEMORY_HOTPLUG and
    MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
    and codes in function register_page_bootmem_info_node() are only
    used for collecting infomation for hot-remove, so reside it under
    MEMORY_HOTREMOVE.

    Besides page_isolation.c selected by MEMORY_ISOLATION under
    MEMORY_HOTPLUG is also such case, move it too.

    [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
    [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
    [mhocko@suse.cz: remove the arch specific functions without any implementation]
    [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
    [rientjes@google.com: fix defined but not used warning]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Wu Jianguo
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Lin Feng
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • For removing memory, we need to remove page tables. But it depends on
    architecture. So the patch introduce arch_remove_memory() for removing
    page table. Now it only calls __remove_pages().

    Note: __remove_pages() for some archtecuture is not implemented
    (I don't know how to implement it for s390).

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
    type} sysfs files are created. But there is no code to remove these
    files. This patch implements the function to remove them.

    We cannot free firmware_map_entry which is allocated by bootmem because
    there is no way to do so when the system is up. But we can at least
    remember the address of that memory and reuse the storage when the
    memory is added next time.

    This patch also introduces a new list map_entries_bootmem to link the
    map entries allocated by bootmem when they are removed, and a lock to
    protect it. And these entries will be reused when the memory is
    hot-added again.

    The idea is suggestted by Andrew Morton.

    NOTE: It is unsafe to return an entry pointer and release the
    map_entries_lock. So we should not hold the map_entries_lock
    separately in firmware_map_find_entry() and
    firmware_map_remove_entry(). Hold the map_entries_lock across find
    and remove /sys/firmware/memmap/X operation.

    And also, users of these two functions need to be careful to
    hold the lock when using these two functions.

    [tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation]
    [tangchen@cn.fujitsu.com: fix the wrong comments of map_entries]
    [tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
    [tangchen@cn.fujitsu.com: fix section mismatch problem]
    [tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Tang Chen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Julian Calaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • offlining memory blocks and checking whether memory blocks are offlined
    are very similar. This patch introduces a new function to remove
    redundant codes.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • We remove the memory like this:

    1. lock memory hotplug
    2. offline a memory block
    3. unlock memory hotplug
    4. repeat 1-3 to offline all memory blocks
    5. lock memory hotplug
    6. remove memory(TODO)
    7. unlock memory hotplug

    All memory blocks must be offlined before removing memory. But we don't
    hold the lock in the whole operation. So we should check whether all
    memory blocks are offlined before step6. Otherwise, kernel maybe
    panicked.

    Offlining a memory block and removing a memory device can be two
    different operations. Users can just offline some memory blocks without
    removing the memory device. For this purpose, the kernel has held
    lock_memory_hotplug() in __offline_pages(). To reuse the code for
    memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
    repeatedly lock and unlock memory hotplug, but not hold the memory
    hotplug lock in the whole operation.

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • memory can't be offlined when CONFIG_MEMCG is selected. For example:
    there is a memory device on node 1. The address range is [1G, 1.5G).
    You will find 4 new directories memory8, memory9, memory10, and memory11
    under the directory /sys/devices/system/memory/.

    If CONFIG_MEMCG is selected, we will allocate memory to store page
    cgroup when we online pages. When we online memory8, the memory stored
    page cgroup is not provided by this memory device. But when we online
    memory9, the memory stored page cgroup may be provided by memory8. So
    we can't offline memory8 now. We should offline the memory in the
    reversed order.

    When the memory device is hotremoved, we will auto offline memory
    provided by this memory device. But we don't know which memory is
    onlined first, so offlining memory may fail. In such case, iterate
    twice to offline the memory. 1st iterate: offline every non primary
    memory block. 2nd iterate: offline primary (i.e. first added) memory
    block.

    This idea is suggested by KOSAKI Motohiro.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Remove one redundant check of res.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

19 Dec, 2012

1 commit


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

14 Dec, 2012

2 commits

  • Merge misc VM changes from Andrew Morton:
    "The rest of most-of-MM. The other MM bits await a slab merge.

    This patch includes the addition of a huge zero_page. Not a
    performance boost but it an save large amounts of physical memory in
    some situations.

    Also a bunch of Fujitsu engineers are working on memory hotplug.
    Which, as it turns out, was badly broken. About half of their patches
    are included here; the remainder are 3.8 material."

    However, this merge disables CONFIG_MOVABLE_NODE, which was totally
    broken. We don't add new features with "default y", nor do we add
    Kconfig questions that are incomprehensible to most people without any
    help text. Does the feature even make sense without compaction or
    memory hotplug?

    * akpm: (54 commits)
    mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
    mm/memory.c: remove unused code from do_wp_page()
    asm-generic, mm: pgtable: consolidate zero page helpers
    mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
    hwpoison, hugetlbfs: fix RSS-counter warning
    hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
    mm: protect against concurrent vma expansion
    memcg: do not check for mm in __mem_cgroup_count_vm_event
    tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
    mm: provide more accurate estimation of pages occupied by memmap
    fs/buffer.c: remove redundant initialization in alloc_page_buffers()
    fs/buffer.c: do not inline exported function
    writeback: fix a typo in comment
    mm: introduce new field "managed_pages" to struct zone
    mm, oom: remove statically defined arch functions of same name
    mm, oom: remove redundant sleep in pagefault oom handler
    mm, oom: cleanup pagefault oom handler
    memory_hotplug: allow online/offline memory to result movable node
    numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
    mm, memcg: avoid unnecessary function call when memcg is disabled
    ...

    Linus Torvalds
     
  • Pull trivial branch from Jiri Kosina:
    "Usual stuff -- comment/printk typo fixes, documentation updates, dead
    code elimination."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    HOWTO: fix double words typo
    x86 mtrr: fix comment typo in mtrr_bp_init
    propagate name change to comments in kernel source
    doc: Update the name of profiling based on sysfs
    treewide: Fix typos in various drivers
    treewide: Fix typos in various Kconfig
    wireless: mwifiex: Fix typo in wireless/mwifiex driver
    messages: i2o: Fix typo in messages/i2o
    scripts/kernel-doc: check that non-void fcts describe their return value
    Kernel-doc: Convention: Use a "Return" section to describe return values
    radeon: Fix typo and copy/paste error in comments
    doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
    various: Fix spelling of "asynchronous" in comments.
    Fix misspellings of "whether" in comments.
    eisa: Fix spelling of "asynchronous".
    various: Fix spelling of "registered" in comments.
    doc: fix quite a few typos within Documentation
    target: iscsi: fix comment typos in target/iscsi drivers
    treewide: fix typo of "suport" in various comments and Kconfig
    treewide: fix typo of "suppport" in various comments
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • Currently a zone's present_pages is calcuated as below, which is
    inaccurate and may cause trouble to memory hotplug.

    spanned_pages - absent_pages - memmap_pages - dma_reserve.

    During fixing bugs caused by inaccurate zone->present_pages, we found
    zone->present_pages has been abused. The field zone->present_pages may
    have different meanings in different contexts:

    1) pages existing in a zone.
    2) pages managed by the buddy system.

    For more discussions about the issue, please refer to:
    http://lkml.org/lkml/2012/11/5/866
    https://patchwork.kernel.org/patch/1346751/

    This patchset tries to introduce a new field named "managed_pages" to
    struct zone, which counts "pages managed by the buddy system". And revert
    zone->present_pages to count "physical pages existing in a zone", which
    also keep in consistence with pgdat->node_present_pages.

    We will set an initial value for zone->managed_pages in function
    free_area_init_core() and will adjust it later if the initial value is
    inaccurate.

    For DMA/normal zones, the initial value is set to:

    (spanned_pages - absent_pages - memmap_pages - dma_reserve)

    Later zone->managed_pages will be adjusted to the accurate value when the
    bootmem allocator frees all free pages to the buddy system in function
    free_all_bootmem_node() and free_all_bootmem().

    The bootmem allocator doesn't touch highmem pages, so highmem zones'
    managed_pages is set to the accurate value "spanned_pages - absent_pages"
    in function free_area_init_core() and won't be updated anymore.

    This patch also adds a new field "managed_pages" to /proc/zoneinfo
    and sysrq showmem.

    [akpm@linux-foundation.org: small comment tweaks]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Tested-by: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now, memory management can handle movable node or nodes which don't have
    any normal memory, so we can dynamic configure and add movable node by:

    online a ZONE_MOVABLE memory from a previous offline node
    offline the last normal memory which result a non-normal-memory-node

    movable-node is very important for power-saving, hardware partitioning and
    high-available-system(hardware fault management).

    Signed-off-by: Lai Jiangshan
    Tested-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Update nodemasks management for N_MEMORY.

    [lliubbo@gmail.com: fix build]
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

5 commits

  • Old memory hotplug code and new online/movable may cause a online node
    don't have any normal memory, but memory-management acts bad when we have
    nodes which is online but don't have any normal memory. Example: it may
    cause a bound task fail on all kernel allocation and cause the task can't
    create task or create other kernel object.

    So we disable non-normal-memory-node here, we will enable it when we
    prepared.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Make online_movable/online_kernel can empty a zone or can move memory to a
    empty zone.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Add online_movable and online_kernel for logic memory hotplug. This is
    the dynamic version of "movablecore" & "kernelcore".

    We have the same reason to introduce it as to introduce "movablecore" &
    "kernelcore". It has the same motive as "movablecore" & "kernelcore", but
    it is dynamic/running-time:

    o We can configure memory as kernelcore or movablecore after boot.

    Userspace workload is increased, we need more hugepage, we can't use
    "online_movable" to add memory and allow the system use more
    THP(transparent-huge-page), vice-verse when kernel workload is increase.

    Also help for virtualization to dynamic configure host/guest's memory,
    to save/(reduce waste) memory.

    Memory capacity on Demand

    o When a new node is physically online after boot, we need to use
    "online_movable" or "online_kernel" to configure/portion it as we
    expected when we logic-online it.

    This configuration also helps for physically-memory-migrate.

    o all benefit as the same as existed "movablecore" & "kernelcore".

    o Preparing for movable-node, which is very important for power-saving,
    hardware partitioning and high-available-system(hardware fault
    management).

    (Note, we don't introduce movable-node here.)

    Action behavior:
    When a memoryblock/memorysection is onlined by "online_movable", the kernel
    will not have directly reference to the page of the memoryblock,
    thus we can remove that memory any time when needed.

    When it is online by "online_kernel", the kernel can use it.
    When it is online by "online", the zone type doesn't changed.

    Current constraints:
    Only the memoryblock which is adjacent to the ZONE_MOVABLE
    can be online from ZONE_NORMAL to ZONE_MOVABLE.

    [akpm@linux-foundation.org: use min_t, cleanups]
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • If we hot-remove memory only and leave the cpus alive, the corresponding
    node will not be removed. But the node_start_pfn and node_spanned_pages
    in pg_data will be reset to 0. In this case, when we hot-add the memory
    back next time, the node_start_pfn will always be 0 because no pfn is less
    than 0. After that, if we hot-remove the memory again, it will cause
    kernel panic in function find_biggest_section_pfn() when it tries to scan
    all the pfns.

    The zone will also have the same problem.

    This patch sets start_pfn to the start_pfn of the section being added when
    spanned_pages of the zone or pg_data is 0.

    ---How to reproduce---

    1. hot-add a container with some memory and cpus;
    2. hot-remove the container's memory, and leave cpus there;
    3. hot-add these memory again;
    4. hot-remove them again;

    then, the kernel will panic.

    ---Call trace---

    BUG: unable to handle kernel paging request at 00000fff82a8cc38
    IP: [] find_biggest_section_pfn+0xe5/0x180
    ......
    Call Trace:
    [] __remove_zone+0x184/0x1b0
    [] __remove_section+0x8c/0xb0
    [] __remove_pages+0xe7/0x120
    [] arch_remove_memory+0x2c/0x80
    [] remove_memory+0x56/0x90
    [] acpi_memory_device_remove_memory+0x48/0x73
    [] acpi_memory_device_notify+0x153/0x274
    [] acpi_ev_notify_dispatch+0x41/0x5f
    [] acpi_os_execute_deferred+0x27/0x34
    [] process_one_work+0x219/0x680
    [] worker_thread+0x12e/0x320
    [] kthread+0xc6/0xd0
    [] kernel_thread_helper+0x4/0x10
    ......
    ---[ end trace 96d845dbf33fee11 ]---

    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Currently memory_hotplug only manages the node_states[N_HIGH_MEMORY], it
    forgets to manage node_states[N_NORMAL_MEMORY]. This may cause
    node_states[N_NORMAL_MEMORY] to become incorrect.

    Example, if a node is empty before online, and we online a memory which is
    in ZONE_NORMAL. And after online, node_states[N_HIGH_MEMORY] is correct,
    but node_states[N_NORMAL_MEMORY] is incorrect, the online code doesn't set
    the new online node to node_states[N_NORMAL_MEMORY].

    The same thing will happen when offlining (the offline code doesn't clear
    the node from node_states[N_NORMAL_MEMORY] when needed). Some memory
    managment code depends node_states[N_NORMAL_MEMORY], so we have to fix up
    the node_states[N_NORMAL_MEMORY].

    We add node_states_check_changes_online() and
    node_states_check_changes_offline() to detect whether
    node_states[N_HIGH_MEMORY] and node_states[N_NORMAL_MEMORY] are changed
    while hotpluging.

    Also add @status_change_nid_normal to struct memory_notify, thus the
    memory hotplug callbacks know whether the node_states[N_NORMAL_MEMORY] are
    changed. (We can add a @flags and reuse @status_change_nid instead of
    introducing @status_change_nid_normal, but it will add much more
    complexity in memory hotplug callback in every subsystem. So introducing
    @status_change_nid_normal is better and it doesn't change the sematics of
    @status_change_nid)

    Signed-off-by: Lai Jiangshan
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Yasuaki Ishimatsu
    Cc: Rob Landley
    Cc: Jiang Liu
    Cc: Kay Sievers
    Cc: Greg Kroah-Hartman
    Cc: Mel Gorman
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan