09 Feb, 2017

2 commits

  • commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.

    Reading a sysfs "memoryN/valid_zones" file leads to the following oops
    when the first page of a range is not backed by struct page.
    show_valid_zones() assumes that 'start_pfn' is always valid for
    page_zone().

    BUG: unable to handle kernel paging request at ffffea017a000000
    IP: show_valid_zones+0x6f/0x160

    This issue may happen on x86-64 systems with 64GiB or more memory since
    their memory block size is bumped up to 2GiB. [1] An example of such
    systems is desribed below. 0x3240000000 is only aligned by 1GiB and
    this memory block starts from 0x3200000000, which is not backed by
    struct page.

    BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

    Since test_pages_in_a_zone() already checks holes, fix this issue by
    extending this function to return 'valid_start' and 'valid_end' for a
    given range. show_valid_zones() then proceeds with the valid range.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Greg Kroah-Hartman
    Cc: Zhang Zhen
    Cc: Reza Arbab
    Cc: David Rientjes
    Cc: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream.

    Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

    A sysfs memory file is created for each 2GiB memory block on x86-64 when
    the system has 64GiB or more memory. [1] When the start address of a
    memory block is not backed by struct page, i.e. a memory range is not
    aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
    kernel oops. This issue was observed on multiple x86-64 systems with
    more than 64GiB of memory. This patch-set fixes this issue.

    Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
    test the start section.

    Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
    to return valid [start, end).

    Note for stable kernels: The memory block size change was made by commit
    bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
    systems"), which was accepted to 3.9. However, this patch-set depends
    on (and fixes) the change to test_pages_in_a_zone() made by commit
    5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
    test_pages_in_a_zone()"), which was accepted to 4.4.

    So, I recommend that we backport it up to 4.4.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    This patch (of 2):

    test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
    section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
    is called for testing the range of a sysfs memory file, 'start_pfn' is
    always aligned by section.

    Fix it by properly setting 'sec_end_pfn' to the next section pfn.

    Also make sure that this function returns 1 only when the range belongs
    to a zone.

    Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Andrew Banman
    Cc: Reza Arbab
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Toshi Kani
     

01 Feb, 2017

1 commit

  • commit 8a1f780e7f28c7c1d640118242cf68d528c456cd upstream.

    online_{kernel|movable} is used to change the memory zone to
    ZONE_{NORMAL|MOVABLE} and online the memory.

    To check that memory zone can be changed, zone_can_shift() is used.
    Currently the function returns minus integer value, plus integer
    value and 0. When the function returns minus or plus integer value,
    it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

    But when the function returns 0, there are two meanings.

    One of the meanings is that the memory zone does not need to be changed.
    For example, when memory is in ZONE_NORMAL and onlined by online_kernel
    the memory zone does not need to be changed.

    Another meaning is that the memory zone cannot be changed. When memory
    is in ZONE_NORMAL and onlined by online_movable, the memory zone may
    not be changed to ZONE_MOVALBE due to memory online limitation(see
    Documentation/memory-hotplug.txt). In this case, memory must not be
    onlined.

    The patch changes the return type of zone_can_shift() so that memory
    online operation fails when memory zone cannot be changed as follows:

    Before applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 8388608
    managed 8388608

    online_movable operation succeeded. But memory is onlined as
    ZONE_NORMAL, not ZONE_MOVABLE.

    After applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    bash: echo: write error: Invalid argument
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320

    online_movable operation failed because of failure of changing
    the memory zone from ZONE_NORMAL to ZONE_MOVABLE

    Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
    Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     

28 Oct, 2016

2 commits

  • When I removed the per-zone bitlock hashed waitqueues in commit
    9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
    removed all the magic hotplug memory initialization of said waitqueues
    too.

    But when I actually _tested_ the resulting build, I stupidly assumed
    that "allmodconfig" would enable memory hotplug. And it doesn't,
    because it enables KASAN instead, which then disables hotplug memory
    support.

    As a result, my build test of the per-zone waitqueues was totally
    broken, and I didn't notice that the compiler warns about the now unused
    iterator variable 'i'.

    I guess I should be happy that that seems to be the worst breakage from
    my clearly horribly failed test coverage.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • In dissolve_free_huge_pages(), free hugepages will be dissolved without
    making sure that there are enough of them left to satisfy hugepage
    reservations.

    Fix this by adding a return value to dissolve_free_huge_pages() and
    checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
    lead to the situation where dissolve_free_huge_page() returns an error
    and all free hugepages that were dissolved before that error are lost,
    while the memory block still cannot be set offline.

    Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
    Signed-off-by: Gerald Schaefer
    Acked-by: Michal Hocko
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K . V"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rui Teng
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

29 Sep, 2016

1 commit

  • 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    prevents allocating from an empty nodemask, but as David points out, it is
    still wrong. As node_online_map may include memoryless nodes, only
    allocating from these nodes is meaningless.

    This patch uses node_states[N_MEMORY] mask to prevent the above case.

    Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
    Signed-off-by: Li Zhong
    Suggested-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: John Allen
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     

20 Sep, 2016

1 commit

  • Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
    neighbor node when mem-offline") introduced new_node_page() for memory
    hotplug.

    In new_node_page(), the nid is cleared before calling
    __alloc_pages_nodemask(). But if it is the only node of the system, and
    the first round allocation fails, it will not be able to get memory from
    an empty nodemask, and will trigger oom.

    The patch checks whether it is the last node on the system, and if it
    is, then don't clear the nid in the nodemask.

    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
    Signed-off-by: Li Zhong
    Reported-by: John Allen
    Acked-by: Vlastimil Babka
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     

12 Aug, 2016

1 commit

  • The following oops occurs after a pgdat is hotadded:

    Unable to handle kernel paging request for data at address 0x00c30001
    Faulting instruction address: 0xc00000000022f8f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
    task: c000000000ef3080 task.stack: c000000000f6c000
    NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
    REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
    MSR: 800000010280b033 CR: 84002028 XER: 20000000
    CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
    NIP refresh_cpu_vm_stats+0x1a4/0x2f0
    LR refresh_cpu_vm_stats+0x1f8/0x2f0
    Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

    Add per_cpu_nodestats initialization to the hotplug codepath.

    Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

29 Jul, 2016

3 commits

  • If we offline a node, alloc the new page from a nearest neighbor node
    instead of the current node or other remote nodes, because re-migrate is
    a waste of time and the distance of the remote nodes is often very
    large.

    Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
    zone or highmem zone.

    Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • kswapd goes through some complex steps trying to figure out if it should
    stay awake based on the classzone_idx and the requested order. It is
    unnecessarily complex and passes in an invalid classzone_idx to
    balance_pgdat(). What matters most of all is whether a larger order has
    been requsted and whether kswapd successfully reclaimed at the previous
    order. This patch irons out the logic to check just that and the end
    result is less headache inducing.

    Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
    ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.

    To be more flexible, use the following criteria instead; to online
    memory from zone X into zone Y,

    * Any zones between X and Y must be unused.
    * If X is lower than Y, the onlined memory must lie at the end of X.
    * If X is higher than Y, the onlined memory must lie at the start of X.

    Add zone_can_shift() to make this determination.

    Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewd-by: Yasuaki Ishimatsu
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Tang Chen
    Cc: Joonsoo Kim
    Cc: David Vrabel
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Andrew Banman
    Cc: Chen Yucong
    Cc: Yasunori Goto
    Cc: Zhang Zhen
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
    move_pfn_range_right().

    No functional change. This will be utilized by a later patch.

    Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Tang Chen
    Cc: Joonsoo Kim
    Cc: David Vrabel
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Andrew Banman
    Cc: Chen Yucong
    Cc: Yasunori Goto
    Cc: Zhang Zhen
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 May, 2016

2 commits

  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • register_page_bootmem_info_node() is invoked in mem_init(), so it will
    be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
    enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
    until page_alloc_init_late() is done, so replace pfn_to_nid() by
    early_pfn_to_nid().

    Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

20 May, 2016

3 commits

  • CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the
    memory hotplug onlining policy. Add a command line parameter to make it
    possible to override the default. It may come handy for debug and
    testing purposes.

    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • This patchset continues the work I started with commit 31bc3858ea3e
    ("memory-hotplug: add automatic onlining policy for the newly added
    memory").

    Initially I was going to stop there and bring the policy setting logic
    to userspace. I met two issues on this way:

    1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
    These blocks stay offlined if we turn the onlining policy on by
    userspace.

    2) My attempt to bring this policy setting to systemd failed, systemd
    maintainers suggest to change the default in kernel or ... to use
    tmpfiles.d to alter the policy (which looks like a hack to me):
    https://github.com/systemd/systemd/pull/2938

    Here I suggest to add a config option to set the default value for the
    policy and a kernel command line parameter to make the override.

    This patch (of 2):

    Introduce config option to set the default value for memory hotplug
    onlining policy (/sys/devices/system/memory/auto_online_blocks). The
    reason one would want to turn this option on are to have early onlining
    for hotpluggable memory available at boot and to not require any
    userspace actions to make memory hotplug work.

    [akpm@linux-foundation.org: tweak Kconfig text]
    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Make is_mem_section_removable() return bool to improve readability due
    to this particular function only using either one or zero as its return
    value.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

18 Mar, 2016

5 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • online_pages() simply returns an error value if
    memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
    want for successfully onlining target pages. This patch arms to print
    more failure information like offline_pages() in online_pages.

    This patch also converts printk(KERN_) to pr_(), and moves
    __offline_pages() to not print failure information with KERN_INFO
    according to David Rientjes's suggestion[1].

    [1] https://lkml.org/lkml/2016/2/24/1094

    Signed-off-by: Chen Yucong
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can reuse the nid we've determined instead of repeated pfn_to_nid()
    usages. Also zone_to_nid() should be a bit cheaper in general than
    pfn_to_nid().

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Mar, 2016

2 commits

  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, all newly added memory blocks remain in 'offline' state
    unless someone onlines them, some linux distributions carry special udev
    rules like:

    SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

    to make this happen automatically. This is not a great solution for
    virtual machines where memory hotplug is being used to address high
    memory pressure situations as such onlining is slow and a userspace
    process doing this (udev) has a chance of being killed by the OOM killer
    as it will probably require to allocate some memory.

    Introduce default policy for the newly added memory blocks in
    /sys/devices/system/memory/auto_online_blocks file with two possible
    values: "offline" which preserves the current behavior and "online"
    which causes all newly added memory blocks to go online as soon as
    they're added. The default is "offline".

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Daniel Kiper
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Tang Chen
    Cc: David Vrabel
    Acked-by: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Mel Gorman
    Cc: "K. Y. Srinivasan"
    Cc: Igor Mammedov
    Cc: Kay Sievers
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     

30 Jan, 2016

1 commit

  • Set IORESOURCE_SYSTEM_RAM in struct resource.flags of "System
    RAM" entries.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Acked-by: David Vrabel # xen
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Denys Vlasenko
    Cc: Gu Zheng
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Tang Chen
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1453841853-11383-9-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     

16 Jan, 2016

1 commit

  • In support of providing struct page for large persistent memory
    capacities, use struct vmem_altmap to change the default policy for
    allocating memory for the memmap array. The default vmemmap_populate()
    allocates page table storage area from the page allocator. Given
    persistent memory capacities relative to DRAM it may not be feasible to
    store the memmap in 'System Memory'. Instead vmem_altmap represents
    pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
    requests.

    Signed-off-by: Dan Williams
    Reported-by: kbuild test robot
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

15 Jan, 2016

1 commit

  • Out of memory condition is not a bug and while we can't add new memory
    in such case crashing the system seems wrong. Propagating the return
    value from register_memory_resource() requires interface change.

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Igor Mammedov
    Acked-by: David Rientjes
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Sheng Yong
    Cc: Zhu Guihua
    Cc: Dan Williams
    Cc: David Vrabel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     

30 Dec, 2015

1 commit

  • test_pages_in_a_zone() does not account for the possibility of missing
    sections in the given pfn range. pfn_valid_within always returns 1 when
    CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
    sections to pass the test, leading to a kernel oops.

    Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
    for missing sections before proceeding into the zone-check code.

    This also prevents a crash from offlining memory devices with missing
    sections. Despite this, it may be a good idea to keep the related patch
    '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
    missing sections' because missing sections in a memory block may lead to
    other problems not covered by the scope of this fix.

    Signed-off-by: Andrew Banman
    Acked-by: Alex Thorlton
    Cc: Russ Anderson
    Cc: Alex Thorlton
    Cc: Yinghai Lu
    Cc: Greg KH
    Cc: Seth Jennings
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Banman
     

06 Nov, 2015

1 commit

  • Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
    experienced on the Cell architecture when init-time functions,
    early_*(), are called at runtime by introducing an 'enum memmap_context'
    parameter to memmap_init_zone() and init_currently_empty_zone(). This
    parameter is intended to be used to tell whether the call of these two
    functions is being made on behalf of a hotplug event, or happening at
    boot-time. However, init_currently_empty_zone() does not use this
    parameter at all, so remove it.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

23 Oct, 2015

1 commit

  • Add add_memory_resource() to add memory using an existing "System RAM"
    resource. This is useful if the memory region is being located by
    finding a free resource slot with allocate_resource().

    Xen guests will make use of this in their balloon driver to hotplug
    arbitrary amounts of memory in response to toolstack requests.

    Signed-off-by: David Vrabel
    Reviewed-by: Daniel Kiper
    Reviewed-by: Tang Chen

    David Vrabel
     

09 Sep, 2015

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

05 Sep, 2015

1 commit

  • Commit f9126ab9241f ("memory-hotplug: fix wrong edge when hot add a new
    node") hot-added memory range to memblock, after creating pgdat for new
    node.

    But there is a problem:

    add_memory()
    |--> hotadd_new_pgdat()
    |--> free_area_init_node()
    |--> get_pfn_range_for_nid()
    |--> find start_pfn and end_pfn in memblock
    |--> ......
    |--> memblock_add_node(start, size, nid) -------- Here, just too late.

    get_pfn_range_for_nid() will find that start_pfn and end_pfn are both 0.
    As a result, when adding memory, dmesg will give the following wrong
    message.

    Initmem setup node 5 [mem 0x0000000000000000-0xffffffffffffffff]
    On node 5 totalpages: 0
    Built 5 zonelists in Node order, mobility grouping on. Total pages: 32588823
    Policy zone: Normal
    init_memory_mapping: [mem 0x60000000000-0x607ffffffff]

    The solution is simple, just add the memory range to memblock a little
    earlier, before hotadd_new_pgdat().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Tang Chen
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Cc: Kamezawa Hiroyuki
    Cc: Taku Izumi
    Cc: Gu Zheng
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: [4.2.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Aug, 2015

1 commit

  • When we add a new node, the edge of memory may be wrong.

    e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],

    1. hotremove the node3,
    2. then hotadd node3 with a part of memory, mem:[26G-30G],
    3. call hotadd_new_pgdat()
    free_area_init_node()
    get_pfn_range_for_nid()
    4. it will return wrong start_pfn and end_pfn, because we have not
    update the memblock.

    This patch also fixes a BUG_ON during hot-addition, please see
    http://marc.info/?l=linux-kernel&m=142961156129456&w=2

    Signed-off-by: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Cc: Kamezawa Hiroyuki
    Cc: Taku Izumi
    Cc: Tang Chen
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

07 Aug, 2015

1 commit

  • Commit 92923ca3aace ("mm: meminit: only set page reserved in the
    memblock region") broke memory hotplug which expects the memmap for
    newly added sections to be reserved until onlined by
    online_pages_range(). This patch marks hotplugged pages as reserved
    when adding new zones.

    Signed-off-by: Mel Gorman
    Reported-by: David Vrabel
    Tested-by: David Vrabel
    Cc: Nathan Zimmer
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Jun, 2015

1 commit

  • When hot add two nodes continuously, we found the vmemmap region info is
    a bit messed. The last region of node 2 is printed when node 3 hot
    added, like the following:

    Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
    On node 2 totalpages: 0
    Built 2 zonelists in Node order, mobility grouping on. Total pages: 16090539
    Policy zone: Normal
    init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
    [mem 0x40000000000-0x407ffffffff] page 1G
    [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
    [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
    ...
    [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
    [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
    Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
    On node 3 totalpages: 0
    Built 3 zonelists in Node order, mobility grouping on. Total pages: 16090539
    Policy zone: Normal
    init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
    [mem 0x60000000000-0x607ffffffff] page 1G
    [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 [ffff8a074a600000-ffff8a074a7fffff] on node 3
    [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
    [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
    ...

    The cause is the last region was missed at the and of hot add memory,
    and p_start, p_end, node_start were not reset, so when hot add memory to
    a new node, it will consider they are not contiguous blocks and print
    the previous one. So we print the last vmemmap region at the end of hot
    add memory to avoid the confusion.

    Signed-off-by: Zhu Guihua
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhu Guihua
     

11 Jun, 2015

1 commit

  • Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[] [] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
    unlock_page+0x6d/0x70
    generic_write_end+0x53/0xb0
    xfs_vm_write_end+0x29/0x80 [xfs]
    generic_perform_write+0x10a/0x1e0
    xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
    xfs_file_write_iter+0x79/0x120 [xfs]
    __vfs_write+0xd4/0x110
    vfs_write+0xac/0x1c0
    SyS_write+0x58/0xd0
    system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP [] __wake_up_bit+0x20/0x70
    RSP
    CR2: ffffc90008963690

    Reproduce method (re-add a node)::
    Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

    This seems an use-after-free problem, and the root cause is
    zone->wait_table was not set to *NULL* after free it in
    try_offline_node.

    When hot re-add a node, we will reuse the pgdat of it, so does the zone
    struct, and when add pages to the target zone, it will init the zone
    first (including the wait_table) if the zone is not initialized. The
    judgement of zone initialized is based on zone->wait_table:

    static inline bool zone_is_initialized(struct zone *zone)
    {
    return !!zone->wait_table;
    }

    so if we do not set the zone->wait_table to *NULL* after free it, the
    memory hotplug routine will skip the init of new zone when hot re-add
    the node, and the wait_table still points to the freed memory, then we
    will access the invalid address when trying to wake up the waiting
    people after the i/o operation with the page is done, such as mentioned
    above.

    Signed-off-by: Gu Zheng
    Reported-by: Taku Izumi
    Reviewed by: Yasuaki Ishimatsu
    Cc: KAMEZAWA Hiroyuki
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     

16 Apr, 2015

1 commit

  • Now we have an easy access to hugepages' activeness, so existing helpers to
    get the information can be cleaned up.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi