15 Apr, 2015

1 commit

  • There's a deadlock when concurrently hot-adding memory through the probe
    interface and switching a memory block from offline to online.

    When hot-adding memory via the probe interface, add_memory() first takes
    mem_hotplug_begin() and then device_lock() is later taken when registering
    the newly initialized memory block. This creates a lock dependency of (1)
    mem_hotplug.lock (2) dev->mutex.

    When switching a memory block from offline to online, dev->mutex is first
    grabbed in device_online() when the write(2) transitions an existing
    memory block from offline to online, and then online_pages() will take
    mem_hotplug_begin().

    This creates a lock inversion between mem_hotplug.lock and dev->mutex.
    Vitaly reports that this deadlock can happen when kworker handling a probe
    event races with systemd-udevd switching a memory block's state.

    This patch requires the state transition to take mem_hotplug_begin()
    before dev->mutex. Hot-adding memory via the probe interface creates a
    memory block while holding mem_hotplug_begin(), there is no way to take
    dev->mutex first in this case.

    online_pages() and offline_pages() are only called when transitioning
    memory block state. We now require that mem_hotplug_begin() is taken
    before calling them -- this requires exporting the mem_hotplug_begin() and
    mem_hotplug_done() to generic code. In all hot-add and hot-remove cases,
    mem_hotplug_begin() is done prior to device_online(). This is all that is
    needed to avoid the deadlock.

    Signed-off-by: David Rientjes
    Reported-by: Vitaly Kuznetsov
    Tested-by: Vitaly Kuznetsov
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: "K. Y. Srinivasan"
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Cc: Vlastimil Babka
    Cc: Zhang Zhen
    Cc: Vladimir Davydov
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Oct, 2014

1 commit

  • Currently memory-hotplug has two limits:

    1. If the memory block is in ZONE_NORMAL, you can change it to
    ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.

    2. If the memory block is in ZONE_MOVABLE, you can change it to
    ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.

    With this patch, we can easy to know a memory block can be onlined to
    which zone, and don't need to know the above two limits.

    Updated the related Documentation.

    [akpm@linux-foundation.org: use conventional comment layout]
    [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
    [akpm@linux-foundation.org: remove unused local zone_prev]
    Signed-off-by: Zhang Zhen
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     

07 Aug, 2014

2 commits

  • This series of patches fixes a problem when adding memory in bad manner.
    For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
    memory installed, following commands cause problem:

    # echo 0x40000000 > /sys/devices/system/memory/probe
    [ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
    # echo 0x48000000 > /sys/devices/system/memory/probe
    [ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
    # echo online_movable > /sys/devices/system/memory/memory9/state
    # echo 0x50000000 > /sys/devices/system/memory/probe
    [ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
    # echo 0x58000000 > /sys/devices/system/memory/probe
    [ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
    # echo online_movable > /sys/devices/system/memory/memory11/state
    # echo online> /sys/devices/system/memory/memory8/state
    # echo online> /sys/devices/system/memory/memory10/state
    # echo offline> /sys/devices/system/memory/memory9/state
    [ 30.558819] Offlined Pages 32768
    # free
    total used free shared buffers cached
    Mem: 780588 18014398509432020 830552 0 0 51180
    -/+ buffers/cache: 18014398509380840 881732
    Swap: 0 0 0

    This is because the above commands probe higher memory after online a
    section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
    for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

    After the second online_movable, the problem can be observed from
    zoneinfo:

    # cat /proc/zoneinfo
    ...
    Node 0, zone Movable
    pages free 65491
    min 250
    low 312
    high 375
    scanned 0
    spanned 18446744073709518848
    present 65536
    managed 65536
    ...

    This series of patches solve the problem by checking ZONE_MOVABLE when
    choosing zone for new memory. If new memory is inside or higher than
    ZONE_MOVABLE, makes it go there instead.

    After applying this series of patches, following are free and zoneinfo
    result (after offlining memory9):

    bash-4.2# free
    total used free shared buffers cached
    Mem: 780956 80112 700844 0 0 51180
    -/+ buffers/cache: 28932 752024
    Swap: 0 0 0

    bash-4.2# cat /proc/zoneinfo

    Node 0, zone DMA
    pages free 3389
    min 14
    low 17
    high 21
    scanned 0
    spanned 4095
    present 3998
    managed 3977
    nr_free_pages 3389
    ...
    start_pfn: 1
    inactive_ratio: 1
    Node 0, zone DMA32
    pages free 73724
    min 341
    low 426
    high 511
    scanned 0
    spanned 98304
    present 98304
    managed 92958
    nr_free_pages 73724
    ...
    start_pfn: 4096
    inactive_ratio: 1
    Node 0, zone Normal
    pages free 32630
    min 120
    low 150
    high 180
    scanned 0
    spanned 32768
    present 32768
    managed 32768
    nr_free_pages 32630
    ...
    start_pfn: 262144
    inactive_ratio: 1
    Node 0, zone Movable
    pages free 65476
    min 241
    low 301
    high 361
    scanned 0
    spanned 98304
    present 65536
    managed 65536
    nr_free_pages 65476
    ...
    start_pfn: 294912
    inactive_ratio: 1

    This patch (of 7):

    Introduce zone_for_memory() in arch independent code for
    arch_add_memory() use.

    Many arch_add_memory() function simply selects ZONE_HIGHMEM or
    ZONE_NORMAL and add new memory into it. However, with the existance of
    ZONE_MOVABLE, the selection method should be carefully considered: if
    new, higher memory is added after ZONE_MOVABLE is setup, the default
    zone and ZONE_MOVABLE may overlap each other.

    should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
    has already contain memory, compare the address of new memory and
    movable memory. If new memory is higher than movable, it should be
    added into ZONE_MOVABLE instead of default zone.

    Signed-off-by: Wang Nan
    Cc: Zhang Yanfei
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: "Mel Gorman"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     
  • In store_mem_state(), we have:

    ...
    334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
    335 online_type = -1;
    ...
    355 case -1:
    356 ret = device_offline(&mem->dev);
    357 break;
    ...

    Here, "offline" is hard coded as -1.

    This patch does the following renaming:

    ONLINE_KEEP -> MMOP_ONLINE_KEEP
    ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
    ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE

    and introduces MMOP_OFFLINE = -1 to avoid hard coding.

    Signed-off-by: Tang Chen
    Cc: Hu Tao
    Cc: Greg Kroah-Hartman
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

05 Jun, 2014

1 commit

  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2013

2 commits

  • For below functions,

    - sparse_add_one_section()
    - kmalloc_section_memmap()
    - __kmalloc_section_memmap()
    - __kfree_section_memmap()

    they are always invoked to operate on one memory section, so it is
    redundant to always pass a nr_pages parameter, which is the page numbers
    in one section. So we can directly use predefined macro PAGES_PER_SECTION
    instead of passing the parameter.

    Signed-off-by: Zhang Yanfei
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
    mem_online_node() to put its node online if offlined and then call
    build_all_zonelists() to initialize the zone list.

    These steps are specific to memory hotplug, and should be managed in
    mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
    whole steps.

    For this reason, this patch replaces mem_online_node() with
    try_online_node(), which performs the whole steps with
    lock_memory_hotplug() held. try_online_node() is named after
    try_offline_node() as they have similar purpose.

    There is no functional change in this patch.

    Signed-off-by: Toshi Kani
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

02 Jun, 2013

3 commits

  • Move the definitions of offline_pages() and remove_memory()
    for CONFIG_MEMORY_HOTREMOVE to memory_hotplug.h, where they belong,
    and make them static inline.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Now that the memory offlining should be taken care of by the
    companion device offlining code in acpi_scan_hot_remove(), the
    ACPI memory hotplug driver doesn't need to offline it in
    remove_memory() any more. Moreover, since the return value of
    remove_memory() is not used, it's better to make it be a void
    function and trigger a BUG() if the memory scheduled for removal is
    not offline.

    Change the code in accordance with the above observations.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Toshi Kani

    Rafael J. Wysocki
     
  • Since offline_memory_block(mem) is functionally equivalent to
    device_offline(&mem->dev), make the only caller of the former use
    the latter instead and drop offline_memory_block() entirely.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Greg Kroah-Hartman
    Acked-by: Toshi Kani

    Rafael J. Wysocki
     

12 May, 2013

1 commit

  • During ACPI memory hotplug configuration bind memory blocks residing
    in modules removable through the standard ACPI mechanism to struct
    acpi_device objects associated with ACPI namespace objects
    representing those modules. Accordingly, unbind those memory blocks
    from the struct acpi_device objects when the memory modules in
    question are being removed.

    When "offline" operation for devices representing memory blocks is
    introduced, this will allow the ACPI core's device hot-remove code to
    use it to carry out remove_memory() for those memory blocks and check
    the results of that before it actually removes the modules holding
    them from the system.

    Since walk_memory_range() is used for accessing all memory blocks
    corresponding to a given ACPI namespace object, it is exported from
    memory_hotplug.c so that the code in acpi_memhotplug.c can use it.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Vasilis Liaskovitis
    Reviewed-by: Toshi Kani

    Rafael J. Wysocki
     

30 Apr, 2013

1 commit

  • __remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
    pseries will return -EOPNOTSUPP if unsupported.

    Adding an #ifdef causes several other functions it depends on to also
    become unnecessary, which saves in .text when disabled (it's disabled in
    most defconfigs besides powerpc, including x86). remove_memory_block()
    becomes static since it is not referenced outside of
    drivers/base/memory.c.

    Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
    and disabled.

    Signed-off-by: David Rientjes
    Acked-by: Toshi Kani
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Greg Kroah-Hartman
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Feb, 2013

5 commits

  • try_offline_node() will be needed in the tristate
    drivers/acpi/processor_driver.c.

    The node will be offlined when all memory/cpu on the node have been
    hotremoved. So we need the function try_offline_node() in cpu-hotplug
    path.

    If the memory-hotplug is disabled, and cpu-hotplug is enabled

    1. no memory no the node
    we don't online the node, and cpu's node is the nearest node.

    2. the node contains some memory
    the node has been onlined, and cpu's node is still needed
    to migrate the sleep task on the cpu to the same node.

    So we do nothing in try_offline_node() in this case.

    [rientjes@google.com: export the function try_offline_node() fix]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Introduce a new function try_offline_node() to remove sysfs file of node
    when all memory sections of this node are removed. If some memory
    sections of this node are not removed, this function does nothing.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • For removing memmap region of sparse-vmemmap which is allocated bootmem,
    memmap region of sparse-vmemmap needs to be registered by
    get_page_bootmem(). So the patch searches pages of virtual mapping and
    registers the pages by get_page_bootmem().

    NOTE: register_page_bootmem_memmap() is not implemented for ia64,
    ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
    and revert register_page_bootmem_info_node() when platform doesn't
    support it.

    It's implemented by adding a new Kconfig option named
    CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
    by memory-hotplug feature fully supported archs(currently only on
    x86_64).

    Since we have 2 config options called MEMORY_HOTPLUG and
    MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
    and codes in function register_page_bootmem_info_node() are only
    used for collecting infomation for hot-remove, so reside it under
    MEMORY_HOTREMOVE.

    Besides page_isolation.c selected by MEMORY_ISOLATION under
    MEMORY_HOTPLUG is also such case, move it too.

    [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
    [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
    [mhocko@suse.cz: remove the arch specific functions without any implementation]
    [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
    [rientjes@google.com: fix defined but not used warning]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Wu Jianguo
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Lin Feng
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • For removing memory, we need to remove page tables. But it depends on
    architecture. So the patch introduce arch_remove_memory() for removing
    page table. Now it only calls __remove_pages().

    Note: __remove_pages() for some archtecuture is not implemented
    (I don't know how to implement it for s390).

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • We remove the memory like this:

    1. lock memory hotplug
    2. offline a memory block
    3. unlock memory hotplug
    4. repeat 1-3 to offline all memory blocks
    5. lock memory hotplug
    6. remove memory(TODO)
    7. unlock memory hotplug

    All memory blocks must be offlined before removing memory. But we don't
    hold the lock in the whole operation. So we should check whether all
    memory blocks are offlined before step6. Otherwise, kernel maybe
    panicked.

    Offlining a memory block and removing a memory device can be two
    different operations. Users can just offline some memory blocks without
    removing the memory device. For this purpose, the kernel has held
    lock_memory_hotplug() in __offline_pages(). To reuse the code for
    memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
    repeatedly lock and unlock memory hotplug, but not hold the memory
    hotplug lock in the whole operation.

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

12 Dec, 2012

1 commit

  • Add online_movable and online_kernel for logic memory hotplug. This is
    the dynamic version of "movablecore" & "kernelcore".

    We have the same reason to introduce it as to introduce "movablecore" &
    "kernelcore". It has the same motive as "movablecore" & "kernelcore", but
    it is dynamic/running-time:

    o We can configure memory as kernelcore or movablecore after boot.

    Userspace workload is increased, we need more hugepage, we can't use
    "online_movable" to add memory and allow the system use more
    THP(transparent-huge-page), vice-verse when kernel workload is increase.

    Also help for virtualization to dynamic configure host/guest's memory,
    to save/(reduce waste) memory.

    Memory capacity on Demand

    o When a new node is physically online after boot, we need to use
    "online_movable" or "online_kernel" to configure/portion it as we
    expected when we logic-online it.

    This configuration also helps for physically-memory-migrate.

    o all benefit as the same as existed "movablecore" & "kernelcore".

    o Preparing for movable-node, which is very important for power-saving,
    hardware partitioning and high-available-system(hardware fault
    management).

    (Note, we don't introduce movable-node here.)

    Action behavior:
    When a memoryblock/memorysection is onlined by "online_movable", the kernel
    will not have directly reference to the page of the memoryblock,
    thus we can remove that memory any time when needed.

    When it is online by "online_kernel", the kernel can use it.
    When it is online by "online", the zone type doesn't changed.

    Current constraints:
    Only the memoryblock which is adjacent to the ZONE_MOVABLE
    can be online from ZONE_NORMAL to ZONE_MOVABLE.

    [akpm@linux-foundation.org: use min_t, cleanups]
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

09 Oct, 2012

2 commits

  • remove_memory() will be called when hot removing a memory device. But
    even if offlining memory, we cannot notice it. So the patch updates the
    memory block's state and sends notification to userspace.

    Additionally, the memory device may contain more than one memory block.
    If the memory block has been offlined, __offline_pages() will fail. So we
    should try to offline one memory block at a time.

    Thus remove_memory() also check each memory block's state. So there is no
    need to check the memory block's state before calling remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • remove_memory() is called in two cases:
    1. echo offline >/sys/devices/system/memory/memoryXX/state
    2. hot remove a memory device

    In the 1st case, the memory block's state is changed and the notification
    that memory block's state changed is sent to userland after calling
    remove_memory(). So user can notice memory block is changed.

    But in the 2nd case, the memory block's state is not changed and the
    notification is not also sent to userspcae even if calling
    remove_memory(). So user cannot notice memory block is changed.

    For adding the notification at memory hot remove, the patch just prepare
    as follows:
    1st case uses offline_pages() for offlining memory.
    2nd case uses remove_memory() for offlining memory and changing memory block's
    state and notifing the information.

    The patch does not implement notification to remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     

05 Mar, 2012

1 commit

  • If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
    other BUG variant in a static inline (i.e. not in a #define) then
    that header really should be including and not just
    expecting it to be implicitly present.

    We can make this change risk-free, since if the files using these
    headers didn't have exposure to linux/bug.h already, they would have
    been causing compile failures/warnings.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

26 Jul, 2011

1 commit

  • This patch contains online_page_callback and apropriate functions for
    registering/unregistering online page callbacks. It allows to do some
    machine specific tasks during online page stage which is required to
    implement memory hotplug in virtual machines. Currently this patch is
    required by latest memory hotplug support for Xen balloon driver patch
    which will be posted soon.

    Additionally, originial online_page() function was splited into
    following functions doing "atomic" operations:

    - __online_page_set_limits() - set new limits for memory management code,
    - __online_page_increment_counters() - increment totalram_pages and totalhigh_pages,
    - __online_page_free() - free page to allocator.

    It was done to:
    - not duplicate existing code,
    - ease hotplug code devolpment by usage of well defined interface,
    - avoid stupid bugs which are unavoidable when the same code
    (by design) is developed in many places.

    [akpm@linux-foundation.org: use explicit indirect-call syntax]
    Signed-off-by: Daniel Kiper
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Ian Campbell
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     

15 Jan, 2011

1 commit


14 Jan, 2011

1 commit

  • PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes. We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

11 Jan, 2011

1 commit

  • Now, memory_hotplug_(un)lock() is used for add/remove/offline pages
    for avoiding races with hibernation. But this should be held in
    online_pages(), too. It seems asymmetric.

    There are cases where one has to avoid a race with memory hotplug
    notifier and his own local code, and hotplug v.s. hotplug.
    This will add a generic solution for avoiding races. In other view,
    having lock here has no big impacts. online pages is tend to be
    done by udev script at el against each memory section one by one.

    Then, it's better to have lock here, too.

    Cc: # 2.6.37
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Pekka Enberg

    KAMEZAWA Hiroyuki
     

03 Dec, 2010

1 commit

  • Presently hwpoison is using lock_system_sleep() to prevent a race with
    memory hotplug. However lock_system_sleep() is a no-op if
    CONFIG_HIBERNATION=n. Therefore we need a new lock.

    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Cc: Kamezawa Hiroyuki
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Oct, 2010

1 commit

  • Now, sysfs interface of memory hotplug shows whether the section is
    removable or not. But it checks only migrateype of pages and doesn't
    check details of cluster of pages.

    Next, memory hotplug's set_migratetype_isolate() has the same kind of
    check, too.

    This patch adds the function __count_unmovable_pages() and makes above 2
    checks to use the same logic. Then, is_removable and hotremove code uses
    the same logic. No changes in the hotremove logic itself.

    TODO: need to find a way to check RECLAMABLE. But, considering bit,
    calling shrink_slab() against a range before starting memory hotremove
    sounds better. If so, this patch's logic doesn't need to be changed.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Michal Hocko
    Cc: Wu Fengguang
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 May, 2010

1 commit

  • Enable users to online CPUs even if the CPUs belongs to a numa node which
    doesn't have onlined local memory.

    The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
    in system boot/init period, or at the time of local memory online. For a
    numa node without onlined local memory, its zonelists are not initialized
    at present. As a result, any memory allocation operations executed by
    CPUs within this node will fail. In fact, an out-of-memory error is
    triggered when attempt to online CPUs before memory comes to online.

    This patch tries to create zonelists for such numa nodes, so that the
    memory allocation for this node can be fallback'ed to other nodes.

    [akpm@linux-foundation.org: remove unneeded export]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: minskey guo
    Cc: Minchan Kim
    Cc: Yasunori Goto
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    minskey guo
     

16 Dec, 2009

1 commit


23 Sep, 2009

1 commit

  • Originally, walk_memory_resource() was introduced to traverse all memory
    of "System RAM" for detecting memory hotplug/unplug range. For doing so,
    flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
    memory hotplug.

    But for using other purpose, /proc/kcore, this may includes some firmware
    area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
    check strict to find out busy "System RAM".

    Note: PPC64 keeps their own walk_memory_resouce(), which walk through
    ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
    patch makes no difference in behavior, finally.

    And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
    Because pfn_valid() just show "there is memmap or not* and cannot be used
    for "there is physical memory or not", this function is useful in generic
    to scan physical memory range.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: WANG Cong
    Cc: Américo Wang
    Cc: David Rientjes
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

1 commit

  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade
     

25 Jul, 2008

2 commits

  • Memory may be hot-removed on a per-memory-block basis, particularly on
    POWER where the SPARSEMEM section size often matches the memory-block
    size. A user-level agent must be able to identify which sections of
    memory are likely to be removable before attempting the potentially
    expensive operation. This patch adds a file called "removable" to the
    memory directory in sysfs to help such an agent. In this patch, a memory
    block is considered removable if;

    o It contains only MOVABLE pageblocks
    o It contains only pageblocks with free pages regardless of pageblock type

    On the other hand, a memory block starting with a PageReserved() page will
    never be considered removable. Without this patch, the user-agent is
    forced to choose a memory block to remove randomly.

    Sample output of the sysfs files:

    ./memory/memory0/removable: 0
    ./memory/memory1/removable: 0
    ./memory/memory2/removable: 0
    ./memory/memory3/removable: 0
    ./memory/memory4/removable: 0
    ./memory/memory5/removable: 0
    ./memory/memory6/removable: 0
    ./memory/memory7/removable: 1
    ./memory/memory8/removable: 0
    ./memory/memory9/removable: 0
    ./memory/memory10/removable: 0
    ./memory/memory11/removable: 0
    ./memory/memory12/removable: 0
    ./memory/memory13/removable: 0
    ./memory/memory14/removable: 0
    ./memory/memory15/removable: 0
    ./memory/memory16/removable: 0
    ./memory/memory17/removable: 1
    ./memory/memory18/removable: 1
    ./memory/memory19/removable: 1
    ./memory/memory20/removable: 1
    ./memory/memory21/removable: 1
    ./memory/memory22/removable: 1

    Signed-off-by: Badari Pulavarty
    Signed-off-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • - Change some naming
    * Magic -> types
    * MIX_INFO -> MIX_SECTION_INFO
    * Change definition of bootmem type from direct hex value

    - __free_pages_bootmem() becomes __meminit.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

09 Jun, 2008

1 commit

  • The ehea driver was recently changed[1] to use walk_memory_resource() to
    detect the system's memory layout. However, walk_memory_resource() is
    available only when memory hotplug is enabled. So CONFIG_EHEA was
    made to depend on MEMORY_HOTPLUG [2], but it is inappropriate for a
    network driver to have such a dependency.

    Make the declaration of walk_memory_resource() and its powerpc
    implementation (ehea is powerpc-specific) unconditionally available.

    [1] 48cfb14f8b89d4d5b3df6c16f08b258686fb12ad
    "ehea: Add DLPAR memory remove support"

    [2] fb7b6ca2b6b7c23b52be143bdd5f55a23b9780c8
    "ehea: Add dependency to Kconfig"

    Signed-off-by: Nathan Lynch
    Acked-by: Badari Pulavarty
    Signed-off-by: Paul Mackerras

    Nathan Lynch
     

28 Apr, 2008

2 commits

  • This patch set is to free pages which is allocated by bootmem for
    memory-hotremove. Some structures of memory management are allocated by
    bootmem. ex) memmap, etc.

    To remove memory physically, some of them must be freed according to
    circumstance. This patch set makes basis to free those pages, and free
    memmaps.

    Basic my idea is using remain members of struct page to remember information
    of users of bootmem (section number or node id). When the section is
    removing, kernel can confirm it. By this information, some issues can be
    solved.

    1) When the memmap of removing section is allocated on other
    section by bootmem, it should/can be free.
    2) When the memmap of removing section is allocated on the
    same section, it shouldn't be freed. Because the section has to be
    logical memory offlined already and all pages must be isolated against
    page allocater. If it is freed, page allocator may use it which will
    be removed physically soon.
    3) When removing section has other section's memmap,
    kernel will be able to show easily which section should be removed
    before it for user. (Not implemented yet)
    4) When the above case 2), the page isolation will be able to check and skip
    memmap's page when logical memory offline (offline_pages()).
    Current page isolation code fails in this case because this page is
    just reserved page and it can't distinguish this pages can be
    removed or not. But, it will be able to do by this patch.
    (Not implemented yet.)
    5) The node information like pgdat has similar issues. But, this
    will be able to be solved too by this.
    (Not implemented yet, but, remembering node id in the pages.)

    Fortunately, current bootmem allocator just keeps PageReserved flags,
    and doesn't use any other members of page struct. The users of
    bootmem doesn't use them too.

    This patch:

    This is to register information which is node or section's id. Kernel can
    distinguish which node/section uses the pages allcated by bootmem. This is
    basis for hot-remove sections or nodes.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Generic helper function to remove section mappings and sysfs entries for the
    section of the memory we are removing. offline_pages() correctly adjusted
    zone and marked the pages reserved.

    TODO: Yasunori Goto is working on patches to free up allocations from bootmem.

    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

17 Oct, 2007

4 commits

  • Now, arch dependent code around CONFIG_MEMORY_HOTREMOVE is a mess.
    This patch cleans up them. This is against 2.6.23-rc6-mm1.

    - fix compile failure on ia64/ CONFIG_MEMORY_HOTPLUG && !CONFIG_MEMORY_HOTREMOVE case.
    - For !CONFIG_MEMORY_HOTREMOVE, add generic no-op remove_memory(),
    which returns -EINVAL.
    - removed remove_pages() only used in powerpc.
    - removed no-op remove_memory() in i386, sh, sparc64, x86_64.

    - only powerpc returns -ENOSYS at memory hot remove(no-op). changes it
    to return -EINVAL.

    Note:
    Currently, only ia64 supports CONFIG_MEMORY_HOTREMOVE. I welcome other
    archs if there are requirements and testers.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Logic.
    - set all pages in [start,end) as isolated migration-type.
    by this, all free pages in the range will be not-for-use.
    - Migrate all LRU pages in the range.
    - Test all pages in the range's refcnt is zero or not.

    Todo:
    - allocate migration destination page from better area.
    - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
    (I don't like this kind of page but..
    - Find out pages which cannot be migrated.
    - more running tests.
    - Use reclaim for unplugging other memory type area.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • A clean up patch for "scanning memory resource [start, end)" operation.

    Now, find_next_system_ram() function is used in memory hotplug, but this
    interface is not easy to use and codes are complicated.

    This patch adds walk_memory_resouce(start,len,arg,func) function.
    The function 'func' is called per valid memory resouce range in [start,pfn).

    [pbadari@us.ibm.com: Error handling in walk_memory_resource()]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch cleans up duplicate includes in
    include/linux/memory_hotplug.h

    Signed-off-by: Jesper Juhl
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl