24 Feb, 2013

40 commits

  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • There are too many return points randomly intermingled with some "goto
    done" return points. So adjust the function structure, one for the
    success path, the other for the failure path. Use atomic_long_inc
    instead of atomic_long_add.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Andrew Morton
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • When doing

    $ echo paddr > /sys/devices/system/memory/soft_offline_page

    to offline a *free* page, the value of mce_bad_pages will be added, and
    the page is set HWPoison flag, but it is still managed by page buddy
    alocator.

    $ cat /proc/meminfo | grep HardwareCorrupted

    shows the value.

    If we offline the same page, the value of mce_bad_pages will be added
    *again*, this means the value is incorrect now. Assume the page is
    still free during this short time.

    soft_offline_page()
    get_any_page()
    "else if (is_free_buddy_page(p))" branch return 0
    "goto done";
    "atomic_long_add(1, &mce_bad_pages);"

    This patch:

    Move poisoned page check at the beginning of the function in order to
    fix the error.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Tested-by: Naoya Horiguchi
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Function put_page_bootmem() is used to free pages allocated by bootmem
    allocator, so it should increase totalram_pages when freeing pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now all users of "number of pages managed by the buddy system" have been
    converted to use zone->managed_pages, so set zone->present_pages to what
    it should be:

    present_pages = spanned_pages - absent_pages;

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • …emblock_overlaps_region().

    The definition of struct movablecore_map is protected by
    CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
    is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
    movablecore_map in memblock_overlaps_region().

    Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Tang Chen
     
  • We now provide an option for users who don't want to specify physical
    memory address in kernel commandline.

    /*
    * For movablemem_map=acpi:
    *
    * SRAT: |_____| |_____| |_________| |_________| ......
    * node id: 0 1 1 2
    * hotpluggable: n y y n
    * movablemem_map: |_____| |_________|
    *
    * Using movablemem_map, we can prevent memblock from allocating memory
    * on ZONE_MOVABLE at boot time.
    */

    So user just specify movablemem_map=acpi, and the kernel will use
    hotpluggable info in SRAT to determine which memory ranges should be set
    as ZONE_MOVABLE.

    If all the memory ranges in SRAT is hotpluggable, then no memory can be
    used by kernel. But before parsing SRAT, memblock has already reserve
    some memory ranges for other purposes, such as for kernel image, and so
    on. We cannot prevent kernel from using these memory. So we need to
    exclude these ranges even if these memory is hotpluggable.

    Furthermore, there could be several memory ranges in the single node
    which the kernel resides in. We may skip one range that have memory
    reserved by memblock, but if the rest of memory is too small, then the
    kernel will fail to boot. So, make the whole node which the kernel
    resides in un-hotpluggable. Then the kernel has enough memory to use.

    NOTE: Using this way will cause NUMA performance down because the
    whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
    on it. If users don't want to lose NUMA performance, just don't use
    it.

    [akpm@linux-foundation.org: fix warning]
    [akpm@linux-foundation.org: use strcmp()]
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When implementing movablemem_map boot option, we introduced an array
    movablemem_map.map[] to store the memory ranges to be set as
    ZONE_MOVABLE.

    Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
    the whole node memory range, we need to extend it to the node end so
    that we can use it to prevent memblock from allocating memory in the
    ranges user didn't specify.

    We now implement movablemem_map boot option like this:

    /*
    * For movablemem_map=nn[KMG]@ss[KMG]:
    *
    * SRAT: |_____| |_____| |_________| |_________| ......
    * node id: 0 1 1 2
    * user specified: |__| |___|
    * movablemem_map: |___| |_________| |______| ......
    *
    * Using movablemem_map, we can prevent memblock from allocating memory
    * on ZONE_MOVABLE at boot time.
    *
    * NOTE: In this case, SRAT info will be ingored.
    */

    [akpm@linux-foundation.org: clean up code, fix build warning]
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • On linux, the pages used by kernel could not be migrated. As a result,
    if a memory range is used by kernel, it cannot be hot-removed. So if we
    want to hot-remove memory, we should prevent kernel from using it.

    The way now used to prevent this is specify a memory range by
    movablemem_map boot option and set it as ZONE_MOVABLE.

    But when the system is booting, memblock will allocate memory, and
    reserve the memory for kernel. And before we parse SRAT, and know the
    node memory ranges, memblock is working. And it may allocate memory in
    ranges to be set as ZONE_MOVABLE. This memory can be used by kernel,
    and never be freed.

    So, let's parse SRAT before memblock is called first. And it is early
    enough.

    The first call of memblock_find_in_range_node() is in:

    setup_arch()
    |-->setup_real_mode()

    so, this patch add a function early_parse_srat() to parse SRAT, and call
    it before setup_real_mode() is called.

    NOTE:

    1) early_parse_srat() is called before numa_init(), and has initialized
    numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO
    NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
    numa info.

    2) I don't know why using count of memory affinities parsed from SRAT
    as a return value in original acpi_numa_init(). So I add a static
    variable srat_mem_cnt to remember this count and use it as the return
    value of the new acpi_numa_init()

    [mhocko@suse.cz: parse SRAT before memblock is ready fix]
    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Ensure the bootmem will not allocate memory from areas that may be
    ZONE_MOVABLE. The map info is from movablecore_map boot option.

    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Wu Jianguo
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • If kernelcore or movablecore is specified at the same time with
    movablemem_map, movablemem_map will have higher priority to be
    satisfied. This patch will make find_zone_movable_pfns_for_nodes()
    calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: Wu Jianguo
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
    limit from movablemem_map boot option for all nodes. The function
    sanitize_zone_movable_limit() will find out to which node the ranges in
    movable_map.map[] belongs, and calculates the low boundary of
    ZONE_MOVABLE for each node.

    Signed-off-by: Tang Chen
    Signed-off-by: Liu Jiang
    Reviewed-by: Wen Congyang
    Cc: Wu Jianguo
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Add functions to parse movablemem_map boot option. Since the option
    could be specified more then once, all the maps will be stored in the
    global variable movablemem_map.map array.

    And also, we keep the array in monotonic increasing order by start_pfn.
    And merge all overlapped ranges.

    [akpm@linux-foundation.org: improve comment]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: remove unneeded parens]
    Signed-off-by: Tang Chen
    Signed-off-by: Lai Jiangshan
    Reviewed-by: Wen Congyang
    Tested-by: Lin Feng
    Cc: Wu Jianguo
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • During the implementation of SRAT support, we met a problem. In
    setup_arch(), we have the following call series:

    1) memblock is ready;
    2) some functions use memblock to allocate memory;
    3) parse ACPI tables, such as SRAT.

    Before 3), we don't know which memory is hotpluggable, and as a result,
    we cannot prevent memblock from allocating hotpluggable memory. So, in
    2), there could be some hotpluggable memory allocated by memblock.

    Now, we are trying to parse SRAT earlier, before memblock is ready. But
    I think we need more investigation on this topic. So in this v5, I
    dropped all the SRAT support, and v5 is just the same as v3, and it is
    based on 3.8-rc3.

    As we planned, we will support getting info from SRAT without users'
    participation at last. And we will post another patch-set to do so.

    And also, I think for now, we can add this boot option as the first step
    of supporting movable node. Since Linux cannot migrate the direct
    mapped pages, the only way for now is to limit the whole node containing
    only movable memory.

    Using SRAT is one way. But even if we can use SRAT, users still need an
    interface to enable/disable this functionality if they don't want to
    loose their NUMA performance. So I think, a user interface is always
    needed.

    For now, users can disable this functionality by not specifying the boot
    option. Later, we will post SRAT support, and add another option value
    "movablecore_map=acpi" to using SRAT.

    This patch:

    If system can create movable node which all memory of the node is
    allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
    the node's pg_data_t. So, use memblock_alloc_try_nid() instead of
    memblock_alloc_nid() to retry when the first allocation fails.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Signed-off-by: Jiang Liu
    Cc: Wu Jianguo
    Cc: Wen Congyang
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
    will return -1. As a result, cpumask_of_node(nid) will return NULL. In
    this case, find_next_bit() in for_each_cpu will get a NULL pointer and
    cause panic.

    Here is a call trace:
    Call Trace:

    select_fallback_rq+0x71/0x190
    try_to_wake_up+0x2cb/0x2f0
    wake_up_process+0x15/0x20
    hrtimer_wakeup+0x22/0x30
    __run_hrtimer+0x83/0x320
    hrtimer_interrupt+0x106/0x280
    smp_apic_timer_interrupt+0x69/0x99
    apic_timer_interrupt+0x6f/0x80

    There is a hrtimer process sleeping, whose cpu has already been
    offlined. When it is waken up, it tries to find another cpu to run, and
    get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
    ernel panic.

    This patch fixes this problem by judging if the nid is -1. If nid is
    not -1, a cpu on the same node will be picked. Else, a online cpu on
    another node will be picked.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When the node is offlined, there is no memory/cpu on the node. If a
    sleep task runs on a cpu of this node, it will be migrated to the cpu on
    the other node. So we can clear cpu-to-node mapping.

    [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • The node will be offlined when all memory/cpu on the node is hotremoved.
    So we should try offline the node when hotremoving a cpu on the node.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • try_offline_node() will be needed in the tristate
    drivers/acpi/processor_driver.c.

    The node will be offlined when all memory/cpu on the node have been
    hotremoved. So we need the function try_offline_node() in cpu-hotplug
    path.

    If the memory-hotplug is disabled, and cpu-hotplug is enabled

    1. no memory no the node
    we don't online the node, and cpu's node is the nearest node.

    2. the node contains some memory
    the node has been onlined, and cpu's node is still needed
    to migrate the sleep task on the cpu to the same node.

    So we do nothing in try_offline_node() in this case.

    [rientjes@google.com: export the function try_offline_node() fix]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • When a cpu is hotpluged, we call acpi_map_cpu2node() in
    _acpi_map_lsapic() to store the cpu's node and apicid's node. But we
    don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
    hotremoved. If the node is also hotremoved, we will get the following
    messages:

    kernel BUG at include/linux/gfp.h:329!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
    RIP: 0010:[] [] allocate_slab+0x28d/0x300
    RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
    RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
    R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
    FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
    Call Trace:
    new_slab+0x30/0x1b0
    __slab_alloc+0x358/0x4c0
    kmem_cache_alloc_node_trace+0xb4/0x1e0
    alloc_fair_sched_group+0xd0/0x1b0
    sched_create_group+0x3e/0x110
    sched_autogroup_create_attach+0x4d/0x180
    sys_setsid+0xd4/0xf0
    system_call_fastpath+0x16/0x1b
    Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
    RIP [] allocate_slab+0x28d/0x300
    RSP
    ---[ end trace adf84c90f3fea3e5 ]---

    The reason is that the cpu's node is not NUMA_NO_NODE, we will call
    alloc_pages_exact_node() to alloc memory on the node, but the node is
    offlined.

    If the node is onlined, we still need cpu's node. For example: a task
    on the cpu is sleeped when the cpu is hotremoved. We will choose
    another cpu to run this task when it is waked up. If we know the cpu's
    node, we will choose the cpu on the same node first. So we should clear
    cpu-to-node mapping when the node is offlined.

    This patch only clears apicid-to-node mapping when the cpu is
    hotremoved.

    [akpm@linux-foundation.org: fix section error]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
    based on a nodemask as well as a gfp_mask"). but it does not match its
    comments, because it does not check the zone which > policy_zone.

    Also in commit b377fd3982ad ("Apply memory policies to top two highest
    zones when highest zone is ZONE_MOVABLE"), this commits told us, if
    highest zone is ZONE_MOVABLE, we should also apply memory policies to
    it. so ZONE_MOVABLE should be valid zone for policies.
    is_valid_nodemask() need to be changed to match it.

    Fix: check all zones, even its zoneid > policy_zone. Use
    nodes_intersects() instead open code to check it.

    Reported-by: Wen Congyang
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • usemap could also be allocated as compound pages. Should also consider
    compound pages when freeing memmap.

    If we don't fix it, there could be problems when we free vmemmap
    pagetables which are stored in compound pages. The old pagetables will
    not be freed properly, and when we add the memory again, no new
    pagetable will be created. And the old pagetable entry is used, than
    the kernel will panic.

    The call trace is like the following:

    BUG: unable to handle kernel paging request at ffffea0040000000
    IP: [] sparse_add_one_section+0xef/0x166
    PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
    Oops: 0002 [#1] SMP
    Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    CPU 0
    Pid: 4, comm: kworker/0:0 Tainted: G W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
    RIP: 0010:[] [] sparse_add_one_section+0xef/0x166
    RSP: 0018:ffff8807bdcb35d8 EFLAGS: 00010006
    RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
    RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
    RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
    R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
    R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
    FS: 0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
    Call Trace:
    __add_pages+0x85/0x120
    arch_add_memory+0x71/0xf0
    add_memory+0xd6/0x1f0
    acpi_memory_device_add+0x170/0x20c
    acpi_device_probe+0x50/0x18a
    really_probe+0x6c/0x320
    driver_probe_device+0x47/0xa0
    __device_attach+0x53/0x60
    bus_for_each_drv+0x6c/0xa0
    device_attach+0xa8/0xc0
    bus_probe_device+0xb0/0xe0
    device_add+0x301/0x570
    device_register+0x1e/0x30
    acpi_device_register+0x1d8/0x27c
    acpi_add_single_object+0x1df/0x2b9
    acpi_bus_check_add+0x112/0x18f
    acpi_ns_walk_namespace+0x105/0x255
    acpi_walk_namespace+0xcf/0x118
    acpi_bus_scan+0x5b/0x7c
    acpi_bus_add+0x2a/0x2c
    container_notify_cb+0x112/0x1a9
    acpi_ev_notify_dispatch+0x46/0x61
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x20e/0x5c0
    worker_thread+0x12e/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
    Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
    RIP [] sparse_add_one_section+0xef/0x166
    RSP
    CR2: ffffea0040000000
    ---[ end trace e7f94e3a34c442d4 ]---
    Kernel panic - not syncing: Fatal exception

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Since there is no way to guarentee the address of pgdat/zone is not on
    stack of any kernel threads or used by other kernel objects without
    reference counting or other symchronizing method, we cannot reset
    node_data and free pgdat when offlining a node. Just reset pgdat to 0
    and reuse the memory when the node is online again.

    The problem is suggested by Kamezawa Hiroyuki. The idea is from Wen
    Congyang.

    NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
    will be triggered.

    [akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
    [akpm@linux-foundation.org: fix the warning again again]
    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • We call hotadd_new_pgdat() to allocate memory to store node_data. So we
    should free it when removing a node.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Introduce a new function try_offline_node() to remove sysfs file of node
    when all memory sections of this node are removed. If some memory
    sections of this node are not removed, this function does nothing.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When memory is added, we update zone's and pgdat's start_pfn and
    spanned_pages in __add_zone(). So we should revert them when the memory
    is removed.

    The patch adds a new function __remove_zone() to do this.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even
    if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Introduce a new API vmemmap_free() to free and remove vmemmap
    pagetables. Since pagetable implements are different, each architecture
    has to provide its own version of vmemmap_free(), just like
    vmemmap_populate().

    Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

    [mhocko@suse.cz: fix implicit declaration of remove_pagetable]
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Jianguo Wu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Search a page table about the removed memory, and clear page table for
    x86_64 architecture.

    [akpm@linux-foundation.org: make kernel_physical_mapping_remove() static]
    Signed-off-by: Wen Congyang
    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When memory is removed, the corresponding pagetables should alse be
    removed. This patch introduces some common APIs to support vmemmap
    pagetable and x86_64 architecture direct mapping pagetable removing.

    All pages of virtual mapping in removed memory cannot be freed if some
    pages used as PGD/PUD include not only removed memory but also other
    memory. So this patch uses the following way to check whether a page
    can be freed or not.

    1) When removing memory, the page structs of the removed memory are
    filled with 0FD.

    2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
    cleared. In this case, the page used as PT/PMD can be freed.

    For direct mapping pages, update direct_pages_count[level] when we freed
    their pagetables. And do not free the pages again because they were
    freed when offlining.

    For vmemmap pages, free the pages and their pagetables.

    For larger pages, do not split them into smaller ones because there is
    no way to know if the larger page has been split. As a result, there is
    no way to decide when to split. We deal the larger pages in the
    following way:

    1) For direct mapped pages, all the pages were freed when they were
    offlined. And since menmory offline is done section by section, all
    the memory ranges being removed are aligned to PAGE_SIZE. So only need
    to deal with unaligned pages when freeing vmemmap pages.

    2) For vmemmap pages being used to store page_struct, if part of the
    larger page is still in use, just fill the unused part with 0xFD. And
    when the whole page is fulfilled with 0xFD, then free the larger page.

    [akpm@linux-foundation.org: fix typo in comment]
    [tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables]
    [tangchen@cn.fujitsu.com: do not free direct mapping pages twice]
    [tangchen@cn.fujitsu.com: do not free page split from hugepage one by one]
    [tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages]
    [akpm@linux-foundation.org: use pmd_page_vaddr()]
    [akpm@linux-foundation.org: fix used-uninitialised bug]
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Jianguo Wu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • In __remove_section(), we locked pgdat_resize_lock when calling
    sparse_remove_one_section(). This lock will disable irq. But we don't
    need to lock the whole function. If we do some work to free pagetables
    in free_section_usemap(), we need to call flush_tlb_all(), which need
    irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
    will be triggered.

    If we lock the whole sparse_remove_one_section(), then we come to this call trace:

    ------------[ cut here ]------------
    WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
    Hardware name: PRIMEQUEST 1800E
    ......
    Call Trace:
    smp_call_function_many+0xbd/0x260
    smp_call_function+0x3b/0x50
    on_each_cpu+0x3b/0xc0
    flush_tlb_all+0x1c/0x20
    remove_pagetable+0x14e/0x1d0
    vmemmap_free+0x18/0x20
    sparse_remove_one_section+0xf7/0x100
    __remove_section+0xa2/0xb0
    __remove_pages+0xa0/0xd0
    arch_remove_memory+0x6b/0xc0
    remove_memory+0xb8/0xf0
    acpi_memory_device_remove+0x53/0x96
    acpi_device_remove+0x90/0xb2
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_remove+0x32/0x6d
    acpi_bus_trim+0x91/0x102
    acpi_bus_hot_remove_device+0x88/0x16b
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x20e/0x5c0
    worker_thread+0x12e/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
    ---[ end trace 25e85300f542aa01 ]---

    Signed-off-by: Tang Chen
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • For removing memmap region of sparse-vmemmap which is allocated bootmem,
    memmap region of sparse-vmemmap needs to be registered by
    get_page_bootmem(). So the patch searches pages of virtual mapping and
    registers the pages by get_page_bootmem().

    NOTE: register_page_bootmem_memmap() is not implemented for ia64,
    ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
    and revert register_page_bootmem_info_node() when platform doesn't
    support it.

    It's implemented by adding a new Kconfig option named
    CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
    by memory-hotplug feature fully supported archs(currently only on
    x86_64).

    Since we have 2 config options called MEMORY_HOTPLUG and
    MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
    and codes in function register_page_bootmem_info_node() are only
    used for collecting infomation for hot-remove, so reside it under
    MEMORY_HOTREMOVE.

    Besides page_isolation.c selected by MEMORY_ISOLATION under
    MEMORY_HOTPLUG is also such case, move it too.

    [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
    [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
    [mhocko@suse.cz: remove the arch specific functions without any implementation]
    [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
    [rientjes@google.com: fix defined but not used warning]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Wu Jianguo
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Lin Feng
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • For removing memory, we need to remove page tables. But it depends on
    architecture. So the patch introduce arch_remove_memory() for removing
    page table. Now it only calls __remove_pages().

    Note: __remove_pages() for some archtecuture is not implemented
    (I don't know how to implement it for s390).

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
    type} sysfs files are created. But there is no code to remove these
    files. This patch implements the function to remove them.

    We cannot free firmware_map_entry which is allocated by bootmem because
    there is no way to do so when the system is up. But we can at least
    remember the address of that memory and reuse the storage when the
    memory is added next time.

    This patch also introduces a new list map_entries_bootmem to link the
    map entries allocated by bootmem when they are removed, and a lock to
    protect it. And these entries will be reused when the memory is
    hot-added again.

    The idea is suggestted by Andrew Morton.

    NOTE: It is unsafe to return an entry pointer and release the
    map_entries_lock. So we should not hold the map_entries_lock
    separately in firmware_map_find_entry() and
    firmware_map_remove_entry(). Hold the map_entries_lock across find
    and remove /sys/firmware/memmap/X operation.

    And also, users of these two functions need to be careful to
    hold the lock when using these two functions.

    [tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation]
    [tangchen@cn.fujitsu.com: fix the wrong comments of map_entries]
    [tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
    [tangchen@cn.fujitsu.com: fix section mismatch problem]
    [tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Tang Chen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Julian Calaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • offlining memory blocks and checking whether memory blocks are offlined
    are very similar. This patch introduces a new function to remove
    redundant codes.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Reviewed-by: Kamezawa Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • We remove the memory like this:

    1. lock memory hotplug
    2. offline a memory block
    3. unlock memory hotplug
    4. repeat 1-3 to offline all memory blocks
    5. lock memory hotplug
    6. remove memory(TODO)
    7. unlock memory hotplug

    All memory blocks must be offlined before removing memory. But we don't
    hold the lock in the whole operation. So we should check whether all
    memory blocks are offlined before step6. Otherwise, kernel maybe
    panicked.

    Offlining a memory block and removing a memory device can be two
    different operations. Users can just offline some memory blocks without
    removing the memory device. For this purpose, the kernel has held
    lock_memory_hotplug() in __offline_pages(). To reuse the code for
    memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
    repeatedly lock and unlock memory hotplug, but not hold the memory
    hotplug lock in the whole operation.

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • memory can't be offlined when CONFIG_MEMCG is selected. For example:
    there is a memory device on node 1. The address range is [1G, 1.5G).
    You will find 4 new directories memory8, memory9, memory10, and memory11
    under the directory /sys/devices/system/memory/.

    If CONFIG_MEMCG is selected, we will allocate memory to store page
    cgroup when we online pages. When we online memory8, the memory stored
    page cgroup is not provided by this memory device. But when we online
    memory9, the memory stored page cgroup may be provided by memory8. So
    we can't offline memory8 now. We should offline the memory in the
    reversed order.

    When the memory device is hotremoved, we will auto offline memory
    provided by this memory device. But we don't know which memory is
    onlined first, so offlining memory may fail. In such case, iterate
    twice to offline the memory. 1st iterate: offline every non primary
    memory block. 2nd iterate: offline primary (i.e. first added) memory
    block.

    This idea is suggested by KOSAKI Motohiro.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Remove one redundant check of res.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
    multiple, however there was no equivalent code in mm_populate(), which
    caused issues.

    This could be fixed by introduced the same rounding in mm_populate(),
    however I think it's preferable to make do_mmap_pgoff() return populate
    as a size rather than as a boolean, so we don't have to duplicate the
    size rounding logic in mm_populate().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse