04 May, 2017

1 commit

  • kswapd is woken to reclaim a node based on a failed allocation request
    from any eligible zone. Once reclaiming in balance_pgdat(), it will
    continue reclaiming until there is an eligible zone available for the
    zone it was woken for. kswapd tracks what zone it was recently woken
    for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
    this zone will be 0.

    However, the decision on whether to sleep is made on
    kswapd_classzone_idx which is 0 without a recent wakeup request and that
    classzone does not account for lowmem reserves. This allows kswapd to
    sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
    request even if a stream of allocations cannot use that zone. While
    kswapd may be woken again shortly in the near future there are two
    consequences -- the pgdat bits that control congestion are cleared
    prematurely and direct reclaim is more likely as kswapd slept
    prematurely.

    This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
    invalid index) when there has been no recent wakeups. If there are no
    wakeups, it'll decide whether to sleep based on the highest possible
    zone available (MAX_NR_ZONES - 1). It then becomes critical that the
    "pgdat balanced" decisions during reclaim and when deciding to sleep are
    the same. If there is a mismatch, kswapd can stay awake continually
    trying to balance tiny zones.

    simoop was used to evaluate it again. Two of the preparation patches
    regressed the workload so they are included as the second set of
    results. Otherwise this patch looks artifically excellent

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
    Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
    Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
    Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
    Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
    Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
    Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
    Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
    Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)

    With this patch on top, write and allocation latencies are massively
    improved. The read latencies are slightly impaired but it's worth
    noting that this is mostly due to the IO scheduler and not directly
    related to reclaim. The vmstats are a bit of a mix but the relevant
    ones are as follows;

    4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
    mmots-20170209 clear-v1r25keepawake-v1r25
    Swap Ins 0 0 0
    Swap Outs 0 608 0
    Direct pages scanned 6910672 3132699 6357298
    Kswapd pages scanned 57036946 82488665 56986286
    Kswapd pages reclaimed 55993488 63474329 55939113
    Direct pages reclaimed 6905990 2964843 6352115
    Kswapd efficiency 98% 76% 98%
    Kswapd velocity 12494.375 17597.507 12488.065
    Direct efficiency 99% 94% 99%
    Direct velocity 1513.835 668.306 1393.148
    Page writes by reclaim 0.000 4410243.000 0.000
    Page writes file 0 4409635 0
    Page writes anon 0 608 0
    Page reclaim immediate 1036792 14175203 1042571

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Swap Ins 0 12 0
    Swap Outs 0 838 0
    Direct pages scanned 6579706 3237270 6256811
    Kswapd pages scanned 61853702 79961486 54837791
    Kswapd pages reclaimed 60768764 60755788 53849586
    Direct pages reclaimed 6579055 2987453 6256151
    Kswapd efficiency 98% 75% 98%
    Page writes by reclaim 0.000 4389496.000 0.000
    Page writes file 0 4388658 0
    Page writes anon 0 838 0
    Page reclaim immediate 1073573 14473009 982507

    Swap-outs are equivalent to baseline.

    Direct reclaim is reduced but not eliminated. It's worth noting that
    there are two periods of direct reclaim for this workload. The first is
    when it switches from preparing the files for the actual test itself.
    It's a lot of file IO followed by a lot of allocs that reclaims heavily
    for a brief window. While direct reclaim is lower with clear-v2, it is
    due to kswapd scanning aggressively and trying to reclaim the world
    which is not the right thing to do. With the patches applied, there is
    still direct reclaim but the phase change from "creating work files" to
    starting multiple threads that allocate a lot of anonymous memory faster
    than kswapd can reclaim.

    Scanning/reclaim efficiency is restored by this patch.

    Page writes from reclaim context are back at 0 which is ideal.

    Pages immediately reclaimed after IO completes is slightly improved but
    it is expected this will vary slightly.

    On UMA, there is almost no change so this is not expected to be a
    universal win.

    [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
    Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
    Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Shantanu Goel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Mar, 2017

1 commit

  • Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
    introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
    in order to allow similar semantics for memory hotplug like for cpu
    hotplug.

    The corresponding functions for cpu hotplug are get/put_online_cpus()
    and cpu_hotplug_begin/done() for cpu hotplug.

    The commit however missed to introduce functions that would serialize
    memory hotplug operations like they are done for cpu hotplug with
    cpu_maps_update_begin/done().

    This basically leaves mem_hotplug.active_writer unprotected and allows
    concurrent writers to modify it, which may lead to problems as outlined
    by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
    mem_hotplug_{begin, done}").

    That commit was extended again with commit b5d24fda9c3d ("mm,
    devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
    done}") which serializes memory hotplug operations for some call sites
    by using the device_hotplug lock.

    In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
    for memory hotplug") a sanity check was added to mem_hotplug_begin() to
    verify that the device_hotplug lock is held.

    This in turn triggers the following warning on s390:

    WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
    Call Trace:
    assert_held_device_hotplug+0x40/0x58)
    mem_hotplug_begin+0x34/0xc8
    add_memory_resource+0x7e/0x1f8
    add_memory+0xda/0x130
    add_memory_merged+0x15c/0x178
    sclp_detect_standby_memory+0x2ae/0x2f8
    do_one_initcall+0xa2/0x150
    kernel_init_freeable+0x228/0x2d8
    kernel_init+0x2a/0x140
    kernel_thread_starter+0x6/0xc

    One possible fix would be to add more lock_device_hotplug() and
    unlock_device_hotplug() calls around each call site of
    mem_hotplug_begin/end(). But that would give the device_hotplug lock
    additional semantics it better should not have (serialize memory hotplug
    operations).

    Instead add a new memory_add_remove_lock which has the similar semantics
    like cpu_add_remove_lock for cpu hotplug.

    To keep things hopefully a bit easier the lock will be locked and unlocked
    within the mem_hotplug_begin/end() functions.

    Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Reported-by: Sebastian Ott
    Acked-by: Dan Williams
    Acked-by: Rafael J. Wysocki
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Ben Hutchings
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

02 Mar, 2017

1 commit


25 Feb, 2017

5 commits

  • Commit 31bc3858ea3e ("add automatic onlining policy for the newly added
    memory") provides the capability to have added memory automatically
    onlined during add, but this appears to be slightly broken.

    The current implementation uses walk_memory_range() to call
    online_memory_block, which uses memory_block_change_state() to online
    the memory. Instead, we should be calling device_online() for the
    memory block in online_memory_block(). This would online the memory
    (the memory bus online routine memory_subsys_online() called from
    device_online calls memory_block_change_state()) and properly update the
    device struct offline flag.

    As a result of the current implementation, attempting to remove a memory
    block after adding it using auto online fails. This is because doing a
    remove, for instance

    echo offline > /sys/devices/system/memory/memoryXXX/state

    uses device_offline() which checks the dev->offline flag.

    Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com
    Signed-off-by: Nathan Fontenot
    Cc: Michael Ellerman
    Cc: Michael Roth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • When mainline introduced commit a96dfddbcc04 ("base/memory, hotplug: fix
    a kernel oops in show_valid_zones()"), it obtained the valid start and
    end pfn from the given pfn range. The valid start pfn can fix the
    actual issue, but it introduced another issue. The valid end pfn will
    may exceed the given end_pfn.

    Although the incorrect overflow will not result in actual problem at
    present, but I think it need to be fixed.

    [toshi.kani@hpe.com: remove assumption that end_pfn is aligned by MAX_ORDER_NR_PAGES]
    Fixes: a96dfddbcc04 ("base/memory, hotplug: fix a kernel oops in show_valid_zones()")
    Link: http://lkml.kernel.org/r/1486467299-22648-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Signed-off-by: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • We had considered all of the non-lru pages as unmovable before commit
    bda807d44454 ("mm: migrate: support non-lru movable page migration").
    But now some of non-lru pages like zsmalloc, virtio-balloon pages also
    become movable. So we can offline such blocks by using non-lru page
    migration.

    This patch straightforwardly adds non-lru migration code, which means
    adding non-lru related code to the functions which scan over pfn and
    collect pages to be migrated and isolate them before migration.

    Signed-off-by: Yisheng Xie
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Hanjun Guo
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Taku Izumi
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yisheng Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • It has no modular callers.

    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • mem_hotplug_begin() assumes that it can set mem_hotplug.active_writer
    and run the hotplug process without racing another thread. Validate
    this assumption with a lockdep assertion.

    Link: http://lkml.kernel.org/r/148693886229.16345.1770484669403334689.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Ben Hutchings
    Cc: Michal Hocko
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Greg Kroah-Hartman
    Cc: Masayoshi Mizuma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

23 Feb, 2017

1 commit

  • To identify that pages of page table are allocated from bootmem
    allocator, magic number sets to page->lru.next.

    But page->lru list is initialized in reserve_bootmem_region(). So when
    calling free_pagetable(), the function cannot find the magic number of
    pages. And free_pagetable() frees the pages by free_reserved_page() not
    put_page_bootmem().

    But if the pages are allocated from bootmem allocator and used as page
    table, the pages have private flag. So before freeing the pages, we
    should clear the private flag by put_page_bootmem().

    Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
    page fields with a single branch"), we could find the following visible
    issue:

    BUG: Bad page state in process kworker/u1024:1
    page:ffffea103cfd8040 count:0 mapcount:0 mappi
    flags: 0x6fffff80000800(private)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x800(private)

    Call Trace:
    [...] dump_stack+0x63/0x87
    [...] bad_page+0x114/0x130
    [...] free_pages_prepare+0x299/0x2d0
    [...] free_hot_cold_page+0x31/0x150
    [...] __free_pages+0x25/0x30
    [...] free_pagetable+0x6f/0xb4
    [...] remove_pagetable+0x379/0x7ff
    [...] vmemmap_free+0x10/0x20
    [...] sparse_remove_one_section+0x149/0x180
    [...] __remove_pages+0x2e9/0x4f0
    [...] arch_remove_memory+0x63/0xc0
    [...] remove_memory+0x8c/0xc0
    [...] acpi_memory_device_remove+0x79/0xa5
    [...] acpi_bus_trim+0x5a/0x8d
    [...] acpi_bus_trim+0x38/0x8d
    [...] acpi_device_hotplug+0x1b7/0x418
    [...] acpi_hotplug_work_fn+0x1e/0x29
    [...] process_one_work+0x152/0x400
    [...] worker_thread+0x125/0x4b0
    [...] kthread+0xd8/0xf0
    [...] ret_from_fork+0x22/0x40

    And the issue still silently occurs.

    Until freeing the pages of page table allocated from bootmem allocator,
    the page->freelist is never used. So the patch sets magic number to
    page->freelist instead of page->lru.next.

    [isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
    Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
    Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

04 Feb, 2017

2 commits

  • Reading a sysfs "memoryN/valid_zones" file leads to the following oops
    when the first page of a range is not backed by struct page.
    show_valid_zones() assumes that 'start_pfn' is always valid for
    page_zone().

    BUG: unable to handle kernel paging request at ffffea017a000000
    IP: show_valid_zones+0x6f/0x160

    This issue may happen on x86-64 systems with 64GiB or more memory since
    their memory block size is bumped up to 2GiB. [1] An example of such
    systems is desribed below. 0x3240000000 is only aligned by 1GiB and
    this memory block starts from 0x3200000000, which is not backed by
    struct page.

    BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

    Since test_pages_in_a_zone() already checks holes, fix this issue by
    extending this function to return 'valid_start' and 'valid_end' for a
    given range. show_valid_zones() then proceeds with the valid range.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Greg Kroah-Hartman
    Cc: Zhang Zhen
    Cc: Reza Arbab
    Cc: David Rientjes
    Cc: Dan Williams
    Cc: [4.4+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

    A sysfs memory file is created for each 2GiB memory block on x86-64 when
    the system has 64GiB or more memory. [1] When the start address of a
    memory block is not backed by struct page, i.e. a memory range is not
    aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
    kernel oops. This issue was observed on multiple x86-64 systems with
    more than 64GiB of memory. This patch-set fixes this issue.

    Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
    test the start section.

    Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
    to return valid [start, end).

    Note for stable kernels: The memory block size change was made by commit
    bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
    systems"), which was accepted to 3.9. However, this patch-set depends
    on (and fixes) the change to test_pages_in_a_zone() made by commit
    5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
    test_pages_in_a_zone()"), which was accepted to 4.4.

    So, I recommend that we backport it up to 4.4.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    This patch (of 2):

    test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
    section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
    is called for testing the range of a sysfs memory file, 'start_pfn' is
    always aligned by section.

    Fix it by properly setting 'sec_end_pfn' to the next section pfn.

    Also make sure that this function returns 1 only when the range belongs
    to a zone.

    Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Andrew Banman
    Cc: Reza Arbab
    Cc: Greg KH
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

25 Jan, 2017

1 commit

  • online_{kernel|movable} is used to change the memory zone to
    ZONE_{NORMAL|MOVABLE} and online the memory.

    To check that memory zone can be changed, zone_can_shift() is used.
    Currently the function returns minus integer value, plus integer
    value and 0. When the function returns minus or plus integer value,
    it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

    But when the function returns 0, there are two meanings.

    One of the meanings is that the memory zone does not need to be changed.
    For example, when memory is in ZONE_NORMAL and onlined by online_kernel
    the memory zone does not need to be changed.

    Another meaning is that the memory zone cannot be changed. When memory
    is in ZONE_NORMAL and onlined by online_movable, the memory zone may
    not be changed to ZONE_MOVALBE due to memory online limitation(see
    Documentation/memory-hotplug.txt). In this case, memory must not be
    onlined.

    The patch changes the return type of zone_can_shift() so that memory
    online operation fails when memory zone cannot be changed as follows:

    Before applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 8388608
    managed 8388608

    online_movable operation succeeded. But memory is onlined as
    ZONE_NORMAL, not ZONE_MOVABLE.

    After applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    bash: echo: write error: Invalid argument
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320

    online_movable operation failed because of failure of changing
    the memory zone from ZONE_NORMAL to ZONE_MOVABLE

    Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
    Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

13 Dec, 2016

1 commit

  • In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
    option"), the memblock allocation direction is changed to bottom-up and
    then back to top-down like this:

    1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
    2. memblock_set_bottom_up(false), called by x86's numa_init().

    Even though (1) occurs in generic mm code, it is wrapped by #ifdef
    CONFIG_MOVABLE_NODE, which depends on X86_64.

    This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
    things will be unbalanced. (1) will happen for them, but (2) will not.

    This toggle was added in the first place because x86 has a delay between
    adding memblocks and marking them as hotpluggable. Since other arches
    do this marking either immediately or not at all, they do not require
    the bottom-up toggle.

    So, resolve things by moving (1) from cmdline_parse_movable_node() to
    x86's setup_arch(), immediately after the movable_node parameter has
    been parsed.

    Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Acked-by: Balbir Singh
    Cc: "Aneesh Kumar K.V"
    Cc: "H. Peter Anvin"
    Cc: Alistair Popple
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Bharata B Rao
    Cc: Frank Rowand
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Nathan Fontenot
    Cc: Paul Mackerras
    Cc: Rob Herring
    Cc: Stewart Smith
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 Oct, 2016

2 commits

  • When I removed the per-zone bitlock hashed waitqueues in commit
    9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
    removed all the magic hotplug memory initialization of said waitqueues
    too.

    But when I actually _tested_ the resulting build, I stupidly assumed
    that "allmodconfig" would enable memory hotplug. And it doesn't,
    because it enables KASAN instead, which then disables hotplug memory
    support.

    As a result, my build test of the per-zone waitqueues was totally
    broken, and I didn't notice that the compiler warns about the now unused
    iterator variable 'i'.

    I guess I should be happy that that seems to be the worst breakage from
    my clearly horribly failed test coverage.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • In dissolve_free_huge_pages(), free hugepages will be dissolved without
    making sure that there are enough of them left to satisfy hugepage
    reservations.

    Fix this by adding a return value to dissolve_free_huge_pages() and
    checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
    lead to the situation where dissolve_free_huge_page() returns an error
    and all free hugepages that were dissolved before that error are lost,
    while the memory block still cannot be set offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

29 Sep, 2016

1 commit

  • 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    prevents allocating from an empty nodemask, but as David points out, it is
    still wrong. As node_online_map may include memoryless nodes, only
    allocating from these nodes is meaningless.

    This patch uses node_states[N_MEMORY] mask to prevent the above case.

    Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
    Signed-off-by: Li Zhong
    Suggested-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: John Allen
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     

20 Sep, 2016

1 commit

  • Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
    neighbor node when mem-offline") introduced new_node_page() for memory
    hotplug.

    In new_node_page(), the nid is cleared before calling
    __alloc_pages_nodemask(). But if it is the only node of the system, and
    the first round allocation fails, it will not be able to get memory from
    an empty nodemask, and will trigger oom.

    The patch checks whether it is the last node on the system, and if it
    is, then don't clear the nid in the nodemask.

    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
    Signed-off-by: Li Zhong
    Reported-by: John Allen
    Acked-by: Vlastimil Babka
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     

12 Aug, 2016

1 commit

  • The following oops occurs after a pgdat is hotadded:

    Unable to handle kernel paging request for data at address 0x00c30001
    Faulting instruction address: 0xc00000000022f8f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
    task: c000000000ef3080 task.stack: c000000000f6c000
    NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
    REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
    MSR: 800000010280b033 CR: 84002028 XER: 20000000
    CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
    NIP refresh_cpu_vm_stats+0x1a4/0x2f0
    LR refresh_cpu_vm_stats+0x1f8/0x2f0
    Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

    Add per_cpu_nodestats initialization to the hotplug codepath.

    Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

29 Jul, 2016

3 commits

  • If we offline a node, alloc the new page from a nearest neighbor node
    instead of the current node or other remote nodes, because re-migrate is
    a waste of time and the distance of the remote nodes is often very
    large.

    Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
    zone or highmem zone.

    Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • kswapd goes through some complex steps trying to figure out if it should
    stay awake based on the classzone_idx and the requested order. It is
    unnecessarily complex and passes in an invalid classzone_idx to
    balance_pgdat(). What matters most of all is whether a larger order has
    been requsted and whether kswapd successfully reclaimed at the previous
    order. This patch irons out the logic to check just that and the end
    result is less headache inducing.

    Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
    ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.

    To be more flexible, use the following criteria instead; to online
    memory from zone X into zone Y,

    * Any zones between X and Y must be unused.
    * If X is lower than Y, the onlined memory must lie at the end of X.
    * If X is higher than Y, the onlined memory must lie at the start of X.

    Add zone_can_shift() to make this determination.

    Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewd-by: Yasuaki Ishimatsu
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Tang Chen
    Cc: Joonsoo Kim
    Cc: David Vrabel
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Andrew Banman
    Cc: Chen Yucong
    Cc: Yasunori Goto
    Cc: Zhang Zhen
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
    move_pfn_range_right().

    No functional change. This will be utilized by a later patch.

    Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Tang Chen
    Cc: Joonsoo Kim
    Cc: David Vrabel
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Andrew Banman
    Cc: Chen Yucong
    Cc: Yasunori Goto
    Cc: Zhang Zhen
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 May, 2016

2 commits

  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • register_page_bootmem_info_node() is invoked in mem_init(), so it will
    be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
    enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
    until page_alloc_init_late() is done, so replace pfn_to_nid() by
    early_pfn_to_nid().

    Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

20 May, 2016

3 commits

  • CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the
    memory hotplug onlining policy. Add a command line parameter to make it
    possible to override the default. It may come handy for debug and
    testing purposes.

    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • This patchset continues the work I started with commit 31bc3858ea3e
    ("memory-hotplug: add automatic onlining policy for the newly added
    memory").

    Initially I was going to stop there and bring the policy setting logic
    to userspace. I met two issues on this way:

    1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
    These blocks stay offlined if we turn the onlining policy on by
    userspace.

    2) My attempt to bring this policy setting to systemd failed, systemd
    maintainers suggest to change the default in kernel or ... to use
    tmpfiles.d to alter the policy (which looks like a hack to me):
    https://github.com/systemd/systemd/pull/2938

    Here I suggest to add a config option to set the default value for the
    policy and a kernel command line parameter to make the override.

    This patch (of 2):

    Introduce config option to set the default value for memory hotplug
    onlining policy (/sys/devices/system/memory/auto_online_blocks). The
    reason one would want to turn this option on are to have early onlining
    for hotpluggable memory available at boot and to not require any
    userspace actions to make memory hotplug work.

    [akpm@linux-foundation.org: tweak Kconfig text]
    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Make is_mem_section_removable() return bool to improve readability due
    to this particular function only using either one or zero as its return
    value.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

18 Mar, 2016

5 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • online_pages() simply returns an error value if
    memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
    want for successfully onlining target pages. This patch arms to print
    more failure information like offline_pages() in online_pages.

    This patch also converts printk(KERN_) to pr_(), and moves
    __offline_pages() to not print failure information with KERN_INFO
    according to David Rientjes's suggestion[1].

    [1] https://lkml.org/lkml/2016/2/24/1094

    Signed-off-by: Chen Yucong
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can reuse the nid we've determined instead of repeated pfn_to_nid()
    usages. Also zone_to_nid() should be a bit cheaper in general than
    pfn_to_nid().

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Mar, 2016

2 commits

  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, all newly added memory blocks remain in 'offline' state
    unless someone onlines them, some linux distributions carry special udev
    rules like:

    SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

    to make this happen automatically. This is not a great solution for
    virtual machines where memory hotplug is being used to address high
    memory pressure situations as such onlining is slow and a userspace
    process doing this (udev) has a chance of being killed by the OOM killer
    as it will probably require to allocate some memory.

    Introduce default policy for the newly added memory blocks in
    /sys/devices/system/memory/auto_online_blocks file with two possible
    values: "offline" which preserves the current behavior and "online"
    which causes all newly added memory blocks to go online as soon as
    they're added. The default is "offline".

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Daniel Kiper
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Tang Chen
    Cc: David Vrabel
    Acked-by: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Mel Gorman
    Cc: "K. Y. Srinivasan"
    Cc: Igor Mammedov
    Cc: Kay Sievers
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     

30 Jan, 2016

1 commit

  • Set IORESOURCE_SYSTEM_RAM in struct resource.flags of "System
    RAM" entries.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Acked-by: David Vrabel # xen
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Denys Vlasenko
    Cc: Gu Zheng
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Tang Chen
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1453841853-11383-9-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     

16 Jan, 2016

1 commit

  • In support of providing struct page for large persistent memory
    capacities, use struct vmem_altmap to change the default policy for
    allocating memory for the memmap array. The default vmemmap_populate()
    allocates page table storage area from the page allocator. Given
    persistent memory capacities relative to DRAM it may not be feasible to
    store the memmap in 'System Memory'. Instead vmem_altmap represents
    pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
    requests.

    Signed-off-by: Dan Williams
    Reported-by: kbuild test robot
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

15 Jan, 2016

1 commit

  • Out of memory condition is not a bug and while we can't add new memory
    in such case crashing the system seems wrong. Propagating the return
    value from register_memory_resource() requires interface change.

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Igor Mammedov
    Acked-by: David Rientjes
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Sheng Yong
    Cc: Zhu Guihua
    Cc: Dan Williams
    Cc: David Vrabel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     

30 Dec, 2015

1 commit

  • test_pages_in_a_zone() does not account for the possibility of missing
    sections in the given pfn range. pfn_valid_within always returns 1 when
    CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
    sections to pass the test, leading to a kernel oops.

    Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
    for missing sections before proceeding into the zone-check code.

    This also prevents a crash from offlining memory devices with missing
    sections. Despite this, it may be a good idea to keep the related patch
    '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
    missing sections' because missing sections in a memory block may lead to
    other problems not covered by the scope of this fix.

    Signed-off-by: Andrew Banman
    Acked-by: Alex Thorlton
    Cc: Russ Anderson
    Cc: Alex Thorlton
    Cc: Yinghai Lu
    Cc: Greg KH
    Cc: Seth Jennings
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Banman