02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit

  • There are new users of memory hotplug emerging. Some of them require
    different subset of arch_add_memory. There are some which only require
    allocation of struct pages without mapping those pages to the kernel
    address space. We currently have __add_pages for that purpose. But this
    is rather lowlevel and not very suitable for the code outside of the
    memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
    by the caller. Introduce add_pages() which should care about those
    details if they are needed. Each architecture should define its
    implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
    currently existing __add_pages.

    Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Jérôme Glisse
    Acked-by: Balbir Singh
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2017

1 commit

  • Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
    hotadded memory to zones until online") we used to allow to change the
    valid zone types of a memory block if it is adjacent to a different zone
    type.

    This fact was reflected in memoryNN/valid_zones by the ordering of
    printed zones. The first one was default (echo online > memoryNN/state)
    and the other one could be onlined explicitly by online_{movable,kernel}.

    This behavior was removed by the said patch and as such the ordering was
    not all that important. In most cases a kernel zone would be default
    anyway. The only exception is movable_node handled by "mm,
    memory_hotplug: support movable_node for hotpluggable nodes".

    Let's reintroduce this behavior again because later patch will remove
    the zone overlap restriction and so user will be allowed to online
    kernel resp. movable block regardless of its placement. Original
    behavior will then become significant again because it would be
    non-trivial for users to see what is the default zone to online into.

    Implementation is really simple. Pull out zone selection out of
    move_pfn_range into zone_for_pfn_range helper and use it in
    show_valid_zones to display the zone for default onlining and then both
    kernel and movable if they are allowed. Default online zone is not
    duplicated.

    Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Reza Arbab
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc:
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

7 commits

  • movable_node_is_enabled is defined in memblock proper while it is
    initialized from the memory hotplug proper. This is quite messy and it
    makes a dependency between the two so move movable_node along with the
    helper functions to memory_hotplug.

    To make it more entertaining the kernel parameter is ignored unless
    CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
    information for each memblock otherwise. So let's warn when the option
    is disabled.

    Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Reza Arbab
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • zone_for_memory doesn't have any user anymore as well as the whole zone
    shifting infrastructure so drop them all.

    This shouldn't introduce any functional changes.

    Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • arch_add_memory gets for_device argument which then controls whether we
    want to create memblocks for created memory sections. Simplify the
    logic by telling whether we want memblocks directly rather than going
    through pointless negation. This also makes the api easier to
    understand because it is clear what we want rather than nothing telling
    for_device which can mean anything.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Heiko Carstens has noticed that he can generate overlapping zones for
    ZONE_DMA and ZONE_NORMAL:

    DMA [mem 0x0000000000000000-0x000000007fffffff]
    Normal [mem 0x0000000080000000-0x000000017fffffff]

    $ cat /sys/devices/system/memory/block_size_bytes
    10000000
    $ cat /sys/devices/system/memory/memory5/valid_zones
    DMA
    $ echo 0 > /sys/devices/system/memory/memory5/online
    $ cat /sys/devices/system/memory/memory5/valid_zones
    Normal
    $ echo 1 > /sys/devices/system/memory/memory5/online
    Normal

    $ cat /proc/zoneinfo
    Node 0, zone DMA
    spanned 524288
    Reported-by: Heiko Carstens
    Tested-by: Heiko Carstens
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current memory hotplug implementation relies on having all the
    struct pages associate with a zone/node during the physical hotplug
    phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
    vast majority of cases this means that they are added to ZONE_NORMAL.
    This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
    hotadd without sparsemem") and it wasn't a big deal back then because
    movable onlining didn't exist yet.

    Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
    onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
    memory and portion memory") and then things got more complicated.
    Rather than reconsidering the zone association which was no longer
    needed (because the memory hotplug already depended on SPARSEMEM) a
    convoluted semantic of zone shifting has been developed. Only the
    currently last memblock or the one adjacent to the zone_movable can be
    onlined movable. This essentially means that the online type changes as
    the new memblocks are added.

    Let's simulate memory hot online manually
    $ echo 0x100000000 > /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory32/valid_zones
    Normal Movable

    $ echo $((0x100000000+(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    $ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    $ echo online_movable > /sys/devices/system/memory/memory34/state
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable Normal

    This is an awkward semantic because an udev event is sent as soon as the
    block is onlined and an udev handler might want to online it based on
    some policy (e.g. association with a node) but it will inherently race
    with new blocks showing up.

    This patch changes the physical online phase to not associate pages with
    any zone at all. All the pages are just marked reserved and wait for
    the onlining phase to be associated with the zone as per the online
    request. There are only two requirements

    - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

    - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

    the latter one is not an inherent requirement and can be changed in the
    future. It preserves the current behavior and made the code slightly
    simpler. This is subject to change in future.

    This means that the same physical online steps as above will lead to the
    following state: Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable

    Implementation:
    The current move_pfn_range is reimplemented to check the above
    requirements (allow_online_pfn_range) and then updates the respective
    zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
    pfn range with the zone/node. __add_pages is updated to not require the
    zone and only initializes sections in the range. This allowed to
    simplify the arch_add_memory code (s390 could get rid of quite some of
    code).

    devm_memremap_pages is the only user of arch_add_memory which relies on
    the zone association because it only hooks into the memory hotplug only
    half way. It uses it to associate the new memory with ZONE_DEVICE but
    doesn't allow it to be {on,off}lined via sysfs. This means that this
    particular code path has to call move_pfn_range_to_zone explicitly.

    The original zone shifting code is kept in place and will be removed in
    the follow up patch for an easier review.

    Please note that this patch also changes the original behavior when
    offlining a memory block adjacent to another zone (Normal vs. Movable)
    used to allow to change its movable type. This will be handled later.

    [richard.weiyang@gmail.com: simplify zone_intersects()]
    Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
    [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
    Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
    [akpm@linux-foundation.org: remove unused local `i']
    Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Wei Yang
    Tested-by: Dan Williams
    Tested-by: Reza Arbab
    Acked-by: Heiko Carstens # For s390 bits
    Acked-by: Vlastimil Babka
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __pageblock_pfn_to_page has two users currently, set_zone_contiguous
    which checks whether the given zone contains holes and
    pageblock_pfn_to_page which then carefully returns a first valid page
    from the given pfn range for the given zone. This doesn't handle zones
    which are not fully populated though. Memory pageblocks can be offlined
    or might not have been onlined yet. In such a case the zone should be
    considered to have holes otherwise pfn walkers can touch and play with
    offline pages.

    Current callers of pageblock_pfn_to_page in compaction seem to work
    properly right now because they only isolate PageBuddy
    (isolate_freepages_block) or PageLRU resp. __PageMovable
    (isolate_migratepages_block) which will be always false for these pages.
    It would be safer to skip these pages altogether, though.

    In order to do this patch adds a new memory section state
    (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
    in online_pages_range during the memory hotplug. Similarly
    offline_mem_sections clears the bit and it is called when the memory
    range is offlined.

    pfn_to_online_page helper is then added which check the mem section and
    only returns a page if it is onlined already.

    Use the new helper in __pageblock_pfn_to_page and skip the whole page
    block in such a case.

    [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
    mark sections online after all struct pages are initialized in
    online_pages_range (Vlastimil)]
    Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Device memory hotplug hooks into regular memory hotplug only half way.
    It needs memory sections to track struct pages but there is no
    need/desire to associate those sections with memory blocks and export
    them to the userspace via sysfs because they cannot be onlined anyway.

    This is currently expressed by for_device argument to arch_add_memory
    which then makes sure to associate the given memory range with
    ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
    to distinguish special memory hotplug from the regular one. While this
    works now, later patches in this series want to move __add_zone outside
    of arch_add_memory path so we have to come up with something else.

    Add want_memblock down the __add_pages path and use it to control
    whether the section->memblock association should be done.
    arch_add_memory then just trivially want memblock for everything but
    for_device hotplug.

    remove_memory_section doesn't need is_zone_device_section either. We
    can simply skip all the memblock specific cleanup if there is no
    memblock for the given section.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Feb, 2017

1 commit

  • Reading a sysfs "memoryN/valid_zones" file leads to the following oops
    when the first page of a range is not backed by struct page.
    show_valid_zones() assumes that 'start_pfn' is always valid for
    page_zone().

    BUG: unable to handle kernel paging request at ffffea017a000000
    IP: show_valid_zones+0x6f/0x160

    This issue may happen on x86-64 systems with 64GiB or more memory since
    their memory block size is bumped up to 2GiB. [1] An example of such
    systems is desribed below. 0x3240000000 is only aligned by 1GiB and
    this memory block starts from 0x3200000000, which is not backed by
    struct page.

    BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

    Since test_pages_in_a_zone() already checks holes, fix this issue by
    extending this function to return 'valid_start' and 'valid_end' for a
    given range. show_valid_zones() then proceeds with the valid range.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Greg Kroah-Hartman
    Cc: Zhang Zhen
    Cc: Reza Arbab
    Cc: David Rientjes
    Cc: Dan Williams
    Cc: [4.4+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

25 Jan, 2017

1 commit

  • online_{kernel|movable} is used to change the memory zone to
    ZONE_{NORMAL|MOVABLE} and online the memory.

    To check that memory zone can be changed, zone_can_shift() is used.
    Currently the function returns minus integer value, plus integer
    value and 0. When the function returns minus or plus integer value,
    it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

    But when the function returns 0, there are two meanings.

    One of the meanings is that the memory zone does not need to be changed.
    For example, when memory is in ZONE_NORMAL and onlined by online_kernel
    the memory zone does not need to be changed.

    Another meaning is that the memory zone cannot be changed. When memory
    is in ZONE_NORMAL and onlined by online_movable, the memory zone may
    not be changed to ZONE_MOVALBE due to memory online limitation(see
    Documentation/memory-hotplug.txt). In this case, memory must not be
    onlined.

    The patch changes the return type of zone_can_shift() so that memory
    online operation fails when memory zone cannot be changed as follows:

    Before applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 8388608
    managed 8388608

    online_movable operation succeeded. But memory is onlined as
    ZONE_NORMAL, not ZONE_MOVABLE.

    After applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    bash: echo: write error: Invalid argument
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320

    online_movable operation failed because of failure of changing
    the memory zone from ZONE_NORMAL to ZONE_MOVABLE

    Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
    Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

27 Jul, 2016

1 commit

  • When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
    ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.

    To be more flexible, use the following criteria instead; to online
    memory from zone X into zone Y,

    * Any zones between X and Y must be unused.
    * If X is lower than Y, the onlined memory must lie at the end of X.
    * If X is higher than Y, the onlined memory must lie at the start of X.

    Add zone_can_shift() to make this determination.

    Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewd-by: Yasuaki Ishimatsu
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Tang Chen
    Cc: Joonsoo Kim
    Cc: David Vrabel
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Andrew Banman
    Cc: Chen Yucong
    Cc: Yasunori Goto
    Cc: Zhang Zhen
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 May, 2016

1 commit

  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

20 May, 2016

1 commit


16 Mar, 2016

2 commits

  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, all newly added memory blocks remain in 'offline' state
    unless someone onlines them, some linux distributions carry special udev
    rules like:

    SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

    to make this happen automatically. This is not a great solution for
    virtual machines where memory hotplug is being used to address high
    memory pressure situations as such onlining is slow and a userspace
    process doing this (udev) has a chance of being killed by the OOM killer
    as it will probably require to allocate some memory.

    Introduce default policy for the newly added memory blocks in
    /sys/devices/system/memory/auto_online_blocks file with two possible
    values: "offline" which preserves the current behavior and "online"
    which causes all newly added memory blocks to go online as soon as
    they're added. The default is "offline".

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Daniel Kiper
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Tang Chen
    Cc: David Vrabel
    Acked-by: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Mel Gorman
    Cc: "K. Y. Srinivasan"
    Cc: Igor Mammedov
    Cc: Kay Sievers
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     

16 Jan, 2016

1 commit

  • In support of providing struct page for large persistent memory
    capacities, use struct vmem_altmap to change the default policy for
    allocating memory for the memmap array. The default vmemmap_populate()
    allocates page table storage area from the page allocator. Given
    persistent memory capacities relative to DRAM it may not be feasible to
    store the memmap in 'System Memory'. Instead vmem_altmap represents
    pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
    requests.

    Signed-off-by: Dan Williams
    Reported-by: kbuild test robot
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

23 Oct, 2015

1 commit

  • Add add_memory_resource() to add memory using an existing "System RAM"
    resource. This is useful if the memory region is being located by
    finding a free resource slot with allocate_resource().

    Xen guests will make use of this in their balloon driver to hotplug
    arbitrary amounts of memory in response to toolstack requests.

    Signed-off-by: David Vrabel
    Reviewed-by: Daniel Kiper
    Reviewed-by: Tang Chen

    David Vrabel
     

28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Apr, 2015

1 commit

  • There's a deadlock when concurrently hot-adding memory through the probe
    interface and switching a memory block from offline to online.

    When hot-adding memory via the probe interface, add_memory() first takes
    mem_hotplug_begin() and then device_lock() is later taken when registering
    the newly initialized memory block. This creates a lock dependency of (1)
    mem_hotplug.lock (2) dev->mutex.

    When switching a memory block from offline to online, dev->mutex is first
    grabbed in device_online() when the write(2) transitions an existing
    memory block from offline to online, and then online_pages() will take
    mem_hotplug_begin().

    This creates a lock inversion between mem_hotplug.lock and dev->mutex.
    Vitaly reports that this deadlock can happen when kworker handling a probe
    event races with systemd-udevd switching a memory block's state.

    This patch requires the state transition to take mem_hotplug_begin()
    before dev->mutex. Hot-adding memory via the probe interface creates a
    memory block while holding mem_hotplug_begin(), there is no way to take
    dev->mutex first in this case.

    online_pages() and offline_pages() are only called when transitioning
    memory block state. We now require that mem_hotplug_begin() is taken
    before calling them -- this requires exporting the mem_hotplug_begin() and
    mem_hotplug_done() to generic code. In all hot-add and hot-remove cases,
    mem_hotplug_begin() is done prior to device_online(). This is all that is
    needed to avoid the deadlock.

    Signed-off-by: David Rientjes
    Reported-by: Vitaly Kuznetsov
    Tested-by: Vitaly Kuznetsov
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: "K. Y. Srinivasan"
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Cc: Vlastimil Babka
    Cc: Zhang Zhen
    Cc: Vladimir Davydov
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Oct, 2014

1 commit

  • Currently memory-hotplug has two limits:

    1. If the memory block is in ZONE_NORMAL, you can change it to
    ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.

    2. If the memory block is in ZONE_MOVABLE, you can change it to
    ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.

    With this patch, we can easy to know a memory block can be onlined to
    which zone, and don't need to know the above two limits.

    Updated the related Documentation.

    [akpm@linux-foundation.org: use conventional comment layout]
    [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
    [akpm@linux-foundation.org: remove unused local zone_prev]
    Signed-off-by: Zhang Zhen
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     

07 Aug, 2014

2 commits

  • This series of patches fixes a problem when adding memory in bad manner.
    For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
    memory installed, following commands cause problem:

    # echo 0x40000000 > /sys/devices/system/memory/probe
    [ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
    # echo 0x48000000 > /sys/devices/system/memory/probe
    [ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
    # echo online_movable > /sys/devices/system/memory/memory9/state
    # echo 0x50000000 > /sys/devices/system/memory/probe
    [ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
    # echo 0x58000000 > /sys/devices/system/memory/probe
    [ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
    # echo online_movable > /sys/devices/system/memory/memory11/state
    # echo online> /sys/devices/system/memory/memory8/state
    # echo online> /sys/devices/system/memory/memory10/state
    # echo offline> /sys/devices/system/memory/memory9/state
    [ 30.558819] Offlined Pages 32768
    # free
    total used free shared buffers cached
    Mem: 780588 18014398509432020 830552 0 0 51180
    -/+ buffers/cache: 18014398509380840 881732
    Swap: 0 0 0

    This is because the above commands probe higher memory after online a
    section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
    for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

    After the second online_movable, the problem can be observed from
    zoneinfo:

    # cat /proc/zoneinfo
    ...
    Node 0, zone Movable
    pages free 65491
    min 250
    low 312
    high 375
    scanned 0
    spanned 18446744073709518848
    present 65536
    managed 65536
    ...

    This series of patches solve the problem by checking ZONE_MOVABLE when
    choosing zone for new memory. If new memory is inside or higher than
    ZONE_MOVABLE, makes it go there instead.

    After applying this series of patches, following are free and zoneinfo
    result (after offlining memory9):

    bash-4.2# free
    total used free shared buffers cached
    Mem: 780956 80112 700844 0 0 51180
    -/+ buffers/cache: 28932 752024
    Swap: 0 0 0

    bash-4.2# cat /proc/zoneinfo

    Node 0, zone DMA
    pages free 3389
    min 14
    low 17
    high 21
    scanned 0
    spanned 4095
    present 3998
    managed 3977
    nr_free_pages 3389
    ...
    start_pfn: 1
    inactive_ratio: 1
    Node 0, zone DMA32
    pages free 73724
    min 341
    low 426
    high 511
    scanned 0
    spanned 98304
    present 98304
    managed 92958
    nr_free_pages 73724
    ...
    start_pfn: 4096
    inactive_ratio: 1
    Node 0, zone Normal
    pages free 32630
    min 120
    low 150
    high 180
    scanned 0
    spanned 32768
    present 32768
    managed 32768
    nr_free_pages 32630
    ...
    start_pfn: 262144
    inactive_ratio: 1
    Node 0, zone Movable
    pages free 65476
    min 241
    low 301
    high 361
    scanned 0
    spanned 98304
    present 65536
    managed 65536
    nr_free_pages 65476
    ...
    start_pfn: 294912
    inactive_ratio: 1

    This patch (of 7):

    Introduce zone_for_memory() in arch independent code for
    arch_add_memory() use.

    Many arch_add_memory() function simply selects ZONE_HIGHMEM or
    ZONE_NORMAL and add new memory into it. However, with the existance of
    ZONE_MOVABLE, the selection method should be carefully considered: if
    new, higher memory is added after ZONE_MOVABLE is setup, the default
    zone and ZONE_MOVABLE may overlap each other.

    should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
    has already contain memory, compare the address of new memory and
    movable memory. If new memory is higher than movable, it should be
    added into ZONE_MOVABLE instead of default zone.

    Signed-off-by: Wang Nan
    Cc: Zhang Yanfei
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: "Mel Gorman"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     
  • In store_mem_state(), we have:

    ...
    334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
    335 online_type = -1;
    ...
    355 case -1:
    356 ret = device_offline(&mem->dev);
    357 break;
    ...

    Here, "offline" is hard coded as -1.

    This patch does the following renaming:

    ONLINE_KEEP -> MMOP_ONLINE_KEEP
    ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
    ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE

    and introduces MMOP_OFFLINE = -1 to avoid hard coding.

    Signed-off-by: Tang Chen
    Cc: Hu Tao
    Cc: Greg Kroah-Hartman
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

05 Jun, 2014

1 commit

  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2013

2 commits

  • For below functions,

    - sparse_add_one_section()
    - kmalloc_section_memmap()
    - __kmalloc_section_memmap()
    - __kfree_section_memmap()

    they are always invoked to operate on one memory section, so it is
    redundant to always pass a nr_pages parameter, which is the page numbers
    in one section. So we can directly use predefined macro PAGES_PER_SECTION
    instead of passing the parameter.

    Signed-off-by: Zhang Yanfei
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
    mem_online_node() to put its node online if offlined and then call
    build_all_zonelists() to initialize the zone list.

    These steps are specific to memory hotplug, and should be managed in
    mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
    whole steps.

    For this reason, this patch replaces mem_online_node() with
    try_online_node(), which performs the whole steps with
    lock_memory_hotplug() held. try_online_node() is named after
    try_offline_node() as they have similar purpose.

    There is no functional change in this patch.

    Signed-off-by: Toshi Kani
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

02 Jun, 2013

3 commits

  • Move the definitions of offline_pages() and remove_memory()
    for CONFIG_MEMORY_HOTREMOVE to memory_hotplug.h, where they belong,
    and make them static inline.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Now that the memory offlining should be taken care of by the
    companion device offlining code in acpi_scan_hot_remove(), the
    ACPI memory hotplug driver doesn't need to offline it in
    remove_memory() any more. Moreover, since the return value of
    remove_memory() is not used, it's better to make it be a void
    function and trigger a BUG() if the memory scheduled for removal is
    not offline.

    Change the code in accordance with the above observations.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Toshi Kani

    Rafael J. Wysocki
     
  • Since offline_memory_block(mem) is functionally equivalent to
    device_offline(&mem->dev), make the only caller of the former use
    the latter instead and drop offline_memory_block() entirely.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Greg Kroah-Hartman
    Acked-by: Toshi Kani

    Rafael J. Wysocki
     

12 May, 2013

1 commit

  • During ACPI memory hotplug configuration bind memory blocks residing
    in modules removable through the standard ACPI mechanism to struct
    acpi_device objects associated with ACPI namespace objects
    representing those modules. Accordingly, unbind those memory blocks
    from the struct acpi_device objects when the memory modules in
    question are being removed.

    When "offline" operation for devices representing memory blocks is
    introduced, this will allow the ACPI core's device hot-remove code to
    use it to carry out remove_memory() for those memory blocks and check
    the results of that before it actually removes the modules holding
    them from the system.

    Since walk_memory_range() is used for accessing all memory blocks
    corresponding to a given ACPI namespace object, it is exported from
    memory_hotplug.c so that the code in acpi_memhotplug.c can use it.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Vasilis Liaskovitis
    Reviewed-by: Toshi Kani

    Rafael J. Wysocki
     

30 Apr, 2013

1 commit

  • __remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
    pseries will return -EOPNOTSUPP if unsupported.

    Adding an #ifdef causes several other functions it depends on to also
    become unnecessary, which saves in .text when disabled (it's disabled in
    most defconfigs besides powerpc, including x86). remove_memory_block()
    becomes static since it is not referenced outside of
    drivers/base/memory.c.

    Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
    and disabled.

    Signed-off-by: David Rientjes
    Acked-by: Toshi Kani
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Greg Kroah-Hartman
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Feb, 2013

5 commits

  • try_offline_node() will be needed in the tristate
    drivers/acpi/processor_driver.c.

    The node will be offlined when all memory/cpu on the node have been
    hotremoved. So we need the function try_offline_node() in cpu-hotplug
    path.

    If the memory-hotplug is disabled, and cpu-hotplug is enabled

    1. no memory no the node
    we don't online the node, and cpu's node is the nearest node.

    2. the node contains some memory
    the node has been onlined, and cpu's node is still needed
    to migrate the sleep task on the cpu to the same node.

    So we do nothing in try_offline_node() in this case.

    [rientjes@google.com: export the function try_offline_node() fix]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Introduce a new function try_offline_node() to remove sysfs file of node
    when all memory sections of this node are removed. If some memory
    sections of this node are not removed, this function does nothing.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • For removing memmap region of sparse-vmemmap which is allocated bootmem,
    memmap region of sparse-vmemmap needs to be registered by
    get_page_bootmem(). So the patch searches pages of virtual mapping and
    registers the pages by get_page_bootmem().

    NOTE: register_page_bootmem_memmap() is not implemented for ia64,
    ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
    and revert register_page_bootmem_info_node() when platform doesn't
    support it.

    It's implemented by adding a new Kconfig option named
    CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
    by memory-hotplug feature fully supported archs(currently only on
    x86_64).

    Since we have 2 config options called MEMORY_HOTPLUG and
    MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
    and codes in function register_page_bootmem_info_node() are only
    used for collecting infomation for hot-remove, so reside it under
    MEMORY_HOTREMOVE.

    Besides page_isolation.c selected by MEMORY_ISOLATION under
    MEMORY_HOTPLUG is also such case, move it too.

    [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
    [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
    [mhocko@suse.cz: remove the arch specific functions without any implementation]
    [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
    [rientjes@google.com: fix defined but not used warning]
    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Reviewed-by: Wu Jianguo
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Lin Feng
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • For removing memory, we need to remove page tables. But it depends on
    architecture. So the patch introduce arch_remove_memory() for removing
    page table. Now it only calls __remove_pages().

    Note: __remove_pages() for some archtecuture is not implemented
    (I don't know how to implement it for s390).

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • We remove the memory like this:

    1. lock memory hotplug
    2. offline a memory block
    3. unlock memory hotplug
    4. repeat 1-3 to offline all memory blocks
    5. lock memory hotplug
    6. remove memory(TODO)
    7. unlock memory hotplug

    All memory blocks must be offlined before removing memory. But we don't
    hold the lock in the whole operation. So we should check whether all
    memory blocks are offlined before step6. Otherwise, kernel maybe
    panicked.

    Offlining a memory block and removing a memory device can be two
    different operations. Users can just offline some memory blocks without
    removing the memory device. For this purpose, the kernel has held
    lock_memory_hotplug() in __offline_pages(). To reuse the code for
    memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
    repeatedly lock and unlock memory hotplug, but not hold the memory
    hotplug lock in the whole operation.

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Tang Chen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

12 Dec, 2012

1 commit

  • Add online_movable and online_kernel for logic memory hotplug. This is
    the dynamic version of "movablecore" & "kernelcore".

    We have the same reason to introduce it as to introduce "movablecore" &
    "kernelcore". It has the same motive as "movablecore" & "kernelcore", but
    it is dynamic/running-time:

    o We can configure memory as kernelcore or movablecore after boot.

    Userspace workload is increased, we need more hugepage, we can't use
    "online_movable" to add memory and allow the system use more
    THP(transparent-huge-page), vice-verse when kernel workload is increase.

    Also help for virtualization to dynamic configure host/guest's memory,
    to save/(reduce waste) memory.

    Memory capacity on Demand

    o When a new node is physically online after boot, we need to use
    "online_movable" or "online_kernel" to configure/portion it as we
    expected when we logic-online it.

    This configuration also helps for physically-memory-migrate.

    o all benefit as the same as existed "movablecore" & "kernelcore".

    o Preparing for movable-node, which is very important for power-saving,
    hardware partitioning and high-available-system(hardware fault
    management).

    (Note, we don't introduce movable-node here.)

    Action behavior:
    When a memoryblock/memorysection is onlined by "online_movable", the kernel
    will not have directly reference to the page of the memoryblock,
    thus we can remove that memory any time when needed.

    When it is online by "online_kernel", the kernel can use it.
    When it is online by "online", the zone type doesn't changed.

    Current constraints:
    Only the memoryblock which is adjacent to the ZONE_MOVABLE
    can be online from ZONE_NORMAL to ZONE_MOVABLE.

    [akpm@linux-foundation.org: use min_t, cleanups]
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Yinghai Lu
    Cc: Rusty Russell
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

09 Oct, 2012

2 commits

  • remove_memory() will be called when hot removing a memory device. But
    even if offlining memory, we cannot notice it. So the patch updates the
    memory block's state and sends notification to userspace.

    Additionally, the memory device may contain more than one memory block.
    If the memory block has been offlined, __offline_pages() will fail. So we
    should try to offline one memory block at a time.

    Thus remove_memory() also check each memory block's state. So there is no
    need to check the memory block's state before calling remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • remove_memory() is called in two cases:
    1. echo offline >/sys/devices/system/memory/memoryXX/state
    2. hot remove a memory device

    In the 1st case, the memory block's state is changed and the notification
    that memory block's state changed is sent to userland after calling
    remove_memory(). So user can notice memory block is changed.

    But in the 2nd case, the memory block's state is not changed and the
    notification is not also sent to userspcae even if calling
    remove_memory(). So user cannot notice memory block is changed.

    For adding the notification at memory hot remove, the patch just prepare
    as follows:
    1st case uses offline_pages() for offlining memory.
    2nd case uses remove_memory() for offlining memory and changing memory block's
    state and notifing the information.

    The patch does not implement notification to remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang