13 Nov, 2013

40 commits

  • The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
    As we mentioned before, if hotpluggable memory is used by the kernel, it
    cannot be hot-removed. So memory hotplug users may want to set all
    hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

    Memory hotplug users may also set a node as movable node, which has
    ZONE_MOVABLE only, so that the whole node can be hot-removed.

    But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
    kernel cannot use memory in movable nodes. This will cause NUMA
    performance down. And other users may be unhappy.

    So we need a way to allow users to enable and disable this functionality.
    In this patch, we introduce movable_node boot option to allow users to
    choose to not to consume hotpluggable memory at early boot time and later
    we can set it as ZONE_MOVABLE.

    To achieve this, the movable_node boot option will control the memblock
    allocation direction. That said, after memblock is ready, before SRAT is
    parsed, we should allocate memory near the kernel image as we explained in
    the previous patches. So if movable_node boot option is set, the kernel
    does the following:

    1. After memblock is ready, make memblock allocate memory bottom up.
    2. After SRAT is parsed, make memblock behave as default, allocate memory
    top down.

    Users can specify "movable_node" in kernel commandline to enable this
    functionality. For those who don't use memory hotplug or who don't want
    to lose their NUMA performance, just don't specify anything. The kernel
    will work as before.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Suggested-by: Kamezawa Hiroyuki
    Suggested-by: Ingo Molnar
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Memory reserved for crashkernel could be large. So we should not allocate
    this memory bottom up from the end of kernel image.

    When SRAT is parsed, we will be able to know which memory is hotpluggable,
    and we can avoid allocating this memory for the kernel. So reorder
    reserve_crashkernel() after SRAT is parsed.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • The Linux kernel cannot migrate pages used by the kernel. As a result,
    kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
    memory for the kernel.

    In a memory hotplug system, any numa node the kernel resides in should be
    unhotpluggable. And for a modern server, each node could have at least
    16GB memory. So memory around the kernel image is highly likely
    unhotpluggable.

    ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
    info. But before SRAT is parsed, memblock has already started to allocate
    memory for the kernel. So we need to prevent memblock from doing this.

    So direct memory mapping page tables setup is the case.
    init_mem_mapping() is called before SRAT is parsed. To prevent page
    tables being allocated within hotpluggable memory, we will use bottom-up
    direction to allocate page tables from the end of kernel image to the
    higher memory.

    Note:
    As for allocating page tables in lower memory, TJ said:

    : This is an optional behavior which is triggered by a very specific kernel
    : boot param, which I suspect is gonna need to stick around to support
    : memory hotplug in the current setup unless we add another layer of address
    : translation to support memory hotplug.

    As for page tables may occupy too much lower memory if using 4K mapping
    (CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
    pages), TJ said:

    : But as I said in the same paragraph, parsing SRAT earlier doesn't solve
    : the problem in itself either. Ignoring the option if 4k mapping is
    : required and memory consumption would be prohibitive should work, no?
    : Something like that would be necessary if we're gonna worry about cases
    : like this no matter how we implement it, but, frankly, I'm not sure this
    : is something worth worrying about.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Create a new function memory_map_top_down to factor out of the top-down
    direct memory mapping pagetable setup. This is also a preparation for the
    following patch, which will introduce the bottom-up memory mapping. That
    said, we will put the two ways of pagetable setup into separate functions,
    and choose to use which way in init_mem_mapping, which makes the code more
    clear.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • The Linux kernel cannot migrate pages used by the kernel. As a result,
    kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
    memory for the kernel.

    ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
    info. But before SRAT is parsed, memblock has already started to allocate
    memory for the kernel. So we need to prevent memblock from doing this.

    In a memory hotplug system, any numa node the kernel resides in should be
    unhotpluggable. And for a modern server, each node could have at least
    16GB memory. So memory around the kernel image is highly likely
    unhotpluggable.

    So the basic idea is: Allocate memory from the end of the kernel image and
    to the higher memory. Since memory allocation before SRAT is parsed won't
    be too much, it could highly likely be in the same node with kernel image.

    The current memblock can only allocate memory top-down. So this patch
    introduces a new bottom-up allocation mode to allocate memory bottom-up.
    And later when we use this allocation direction to allocate memory, we
    will limit the start address above the kernel.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Tejun Heo
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • [Problem]

    The current Linux cannot migrate pages used by the kernel because of the
    kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
    When the pa is changed, we cannot simply update the pagetable and keep the
    va unmodified. So the kernel pages are not migratable.

    There are also some other issues will cause the kernel pages not
    migratable. For example, the physical address may be cached somewhere and
    will be used. It is not to update all the caches.

    When doing memory hotplug in Linux, we first migrate all the pages in one
    memory device somewhere else, and then remove the device. But if pages
    are used by the kernel, they are not migratable. As a result, memory used
    by the kernel cannot be hot-removed.

    Modifying the kernel direct mapping mechanism is too difficult to do. And
    it may cause the kernel performance down and unstable. So we use the
    following way to do memory hotplug.

    [What we are doing]

    In Linux, memory in one numa node is divided into several zones. One of
    the zones is ZONE_MOVABLE, which the kernel won't use.

    In order to implement memory hotplug in Linux, we are going to arrange all
    hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
    memory. To do this, we need ACPI's help.

    In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
    memory affinities in SRAT record every memory range in the system, and
    also, flags specifying if the memory range is hotpluggable. (Please refer
    to ACPI spec 5.0 5.2.16)

    With the help of SRAT, we have to do the following two things to achieve our
    goal:

    1. When doing memory hot-add, allow the users arranging hotpluggable as
    ZONE_MOVABLE.
    (This has been done by the MOVABLE_NODE functionality in Linux.)

    2. when the system is booting, prevent bootmem allocator from allocating
    hotpluggable memory for the kernel before the memory initialization
    finishes.

    The problem 2 is the key problem we are going to solve. But before solving it,
    we need some preparation. Please see below.

    [Preparation]

    Bootloader has to load the kernel image into memory. And this memory must
    be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
    system, we can assume any node the kernel resides in is not hotpluggable.

    Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
    But memblock has already started to work. In the current kernel,
    memblock allocates the following memory before SRAT is parsed:

    setup_arch()
    |->memblock_x86_fill() /* memblock is ready */
    |......
    |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
    |->reserve_real_mode() /* allocate memory under 1MB */
    |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
    |->dma_contiguous_reserve() /* specified by user, should be low */
    |->setup_log_buf() /* specified by user, several mega bytes */
    |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
    |->acpi_initrd_override() /* several mega bytes */
    |->reserve_crashkernel() /* could be large, should reorder */
    |......
    |->initmem_init() /* Parse SRAT */

    According to Tejun's advice, before SRAT is parsed, we should try our best
    to allocate memory near the kernel image. Since the whole node the kernel
    resides in won't be hotpluggable, and for a modern server, a node may have
    at least 16GB memory, allocating several mega bytes memory around the
    kernel image won't cross to hotpluggable memory.

    [About this patchset]

    So this patchset is the preparation for the problem 2 that we want to
    solve. It does the following:

    1. Make memblock be able to allocate memory bottom up.
    1) Keep all the memblock APIs' prototype unmodified.
    2) When the direction is bottom up, keep the start address greater than the
    end of kernel image.

    2. Improve init_mem_mapping() to support allocate page tables in
    bottom up direction.

    3. Introduce "movable_node" boot option to enable and disable this
    functionality.

    This patch (of 6):

    Create a new function __memblock_find_range_top_down to factor out of
    top-down allocation from memblock_find_in_range_node. This is a
    preparation because we will introduce a new bottom-up allocation mode in
    the following patch.

    Signed-off-by: Tang Chen
    Signed-off-by: Zhang Yanfei
    Acked-by: Tejun Heo
    Acked-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Wanpeng Li
    Cc: Thomas Renninger
    Cc: Yinghai Lu
    Cc: Jiang Liu
    Cc: Wen Congyang
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Cc: Taku Izumi
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Implement mmap base randomization for the bottom up direction, so ASLR
    works for both mmap layouts on s390. See also commit df54d6fa5427 ("x86
    get_unmapped_area(): use proper mmap base for bottom-up direction").

    Signed-off-by: Heiko Carstens
    Cc: Radu Caragea
    Cc: Michel Lespinasse
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Metcalf
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • This is more or less the generic variant of commit 41aacc1eea64 ("x86
    get_unmapped_area: Access mmap_legacy_base through mm_struct member").

    So effectively architectures which use an own arch_pick_mmap_layout()
    implementation but call the generic arch_get_unmapped_area() now can
    also randomize their mmap_base.

    All architectures which have an own arch_pick_mmap_layout() and call the
    generic arch_get_unmapped_area() (arm64, s390, tile) currently set
    mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
    arch_pick_mmap_layout() function. So this change is a no-op currently.

    Signed-off-by: Heiko Carstens
    Cc: Radu Caragea
    Cc: Michel Lespinasse
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Metcalf
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • Add SetPageReclaim() before __swap_writepage() so that page can be moved
    to the tail of the inactive list, which can avoid unnecessary page
    scanning as this page was reclaimed by swap subsystem before.

    Signed-off-by: Weijie Yang
    Reviewed-by: Bob Liu
    Reviewed-by: Minchan Kim
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Soft dirty bit allows us to track which pages are written since the last
    clear_ref (by "echo 4 > /proc/pid/clear_refs".) This is useful for
    userspace applications to know their memory footprints.

    Note that the kernel exposes this flag via bit[55] of /proc/pid/pagemap,
    and the semantics is not a default one (scheduled to be the default in the
    near future.) However, it shifts to the new semantics at the first
    clear_ref, and the users of soft dirty bit always do it before utilizing
    the bit, so that's not a big deal. Users must avoid relying on the bit in
    page-types before the first clear_ref.

    Signed-off-by: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • This flag shows that the VMA is "newly created" and thus represents
    "dirty" in the task's VM.

    You can clear it by "echo 4 > /proc/pid/clear_refs."

    Signed-off-by: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • During swapoff the frontswap_map was NULL-ified before calling
    frontswap_invalidate_area(). However the frontswap_invalidate_area()
    exits early if frontswap_map is NULL. Invalidate was never called
    during swapoff.

    This patch moves frontswap_map_set() in swapoff just after calling
    frontswap_invalidate_area() so outside of locks (swap_lock and
    swap_info_struct->lock). This shouldn't be a problem as during swapon
    the frontswap_map_set() is called also outside of any locks.

    Signed-off-by: Krzysztof Kozlowski
    Reviewed-by: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • Signed-off-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
    blocks") had the side effect of making vmap_area.va_end member point to
    the next vmap_area.va_start. This was creating an artificial reference
    to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

    This patch marks the vmap_area containing pointers explicitly and
    reduces the min ref_count to 2 as vm_struct still contains a reference
    to the vmalloc'ed object. The kmemleak add_scan_area() function has
    been improved to allow a SIZE_MAX argument covering the rest of the
    object (for simpler calling sites).

    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • We pass the number of pages which hold page structs of a memory section
    to free_map_bootmem(). This is right when !CONFIG_SPARSEMEM_VMEMMAP but
    wrong when CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP, we
    should pass the number of pages of a memory section to free_map_bootmem.

    So the fix is removing the nr_pages parameter. When
    CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
    PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
    we calculate page numbers needed to hold the page structs for a memory
    section and use the value in free_map_bootmem().

    This was found by reading the code. And I have no machine that support
    memory hot-remove to test the bug now.

    Signed-off-by: Zhang Yanfei
    Reviewed-by: Wanpeng Li
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • For below functions,

    - sparse_add_one_section()
    - kmalloc_section_memmap()
    - __kmalloc_section_memmap()
    - __kfree_section_memmap()

    they are always invoked to operate on one memory section, so it is
    redundant to always pass a nr_pages parameter, which is the page numbers
    in one section. So we can directly use predefined macro PAGES_PER_SECTION
    instead of passing the parameter.

    Signed-off-by: Zhang Yanfei
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The memory.numa_stat file was not hierarchical. Memory charged to the
    children was not shown in parent's numa_stat.

    This change adds the "hierarchical_" stats to the existing stats. The
    new hierarchical stats include the sum of all children's values in
    addition to the value of the memcg.

    Tested: Create cgroup a, a/b and run workload under b. The values of
    b are included in the "hierarchical_*" under a.

    $ cd /sys/fs/cgroup
    $ echo 1 > memory.use_hierarchy
    $ mkdir a a/b

    Run workload in a/b:
    $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

    The hierarchical_ fields in parent (a) show use of workload in a/b:
    $ cat a/memory.numa_stat
    total=0 N0=0 N1=0 N2=0 N3=0
    file=0 N0=0 N1=0 N2=0 N3=0
    anon=0 N0=0 N1=0 N2=0 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    $ cat a/b/memory.numa_stat
    total=908 N0=552 N1=317 N2=39 N3=0
    file=850 N0=549 N1=301 N2=0 N3=0
    anon=58 N0=3 N1=16 N2=39 N3=0
    unevictable=0 N0=0 N1=0 N2=0 N3=0
    hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
    hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
    hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
    hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

    Signed-off-by: Ying Han
    Signed-off-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Refactor mem_control_numa_stat_show() to use a new stats structure for
    smaller and simpler code. This consolidates nearly identical code.

    text data bss dec hex filename
    8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
    8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

    Signed-off-by: Greg Thelen
    Signed-off-by: Ying Han
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Use more appropriate NUMA_NO_NODE instead of -1

    Signed-off-by: Jianguo Wu
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
    hugepage which is allocated from the node of the first scanned normal
    page, but this policy is too rough and may end with unexpected result to
    upper users.

    The problem is the original page-balancing among all nodes will be
    broken after hugepaged started. Thinking about the case if the first
    scanned normal page is allocated from node A, most of other scanned
    normal pages are allocated from node B or C.. But hugepaged will always
    allocate hugepage from node A which will cause extra memory pressure on
    node A which is not the situation before khugepaged started.

    This patch try to fix this problem by making khugepaged allocate
    hugepage from the node which have max record of scaned normal pages hit,
    so that the effect to original page-balancing can be minimized.

    The other problem is if normal scanned pages are equally allocated from
    Node A,B and C, after khugepaged started Node A will still suffer extra
    memory pressure.

    Andrew Davidoff reported a related issue several days ago. He wanted
    his application interleaving among all nodes and "numactl
    --interleave=all ./test" was used to run the testcase, but the result
    wasn't not as expected.

    cat /proc/2814/numa_maps:
    7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098

    The end result showed that most pages are from Node3 instead of
    interleave among node0-3 which was unreasonable.

    This patch also fix this issue by allocating hugepage round robin from
    all nodes have the same record, after this patch the result was as
    expected:

    7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722

    The simple testcase is like this:

    int main() {
    char *p;
    int i;
    int j;

    for (i=0; i < 200; i++) {
    p = (char *)malloc(1048576);
    printf("malloc done\n");

    if (p == 0) {
    printf("Out of memory\n");
    return 1;
    }
    for (j=0; j < 1048576; j++) {
    p[j] = 'A';
    }
    printf("touched memory\n");

    sleep(1);
    }
    printf("enter sleep\n");
    while(1) {
    sleep(100);
    }
    }

    [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
    Reported-by: Andrew Davidoff
    Tested-by: Andrew Davidoff
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Yasuaki Ishimatsu
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Move alloc_hugepage() to a better place, no need for a seperate #ifndef
    CONFIG_NUMA

    Signed-off-by: Bob Liu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Andrew Davidoff
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Signed-off-by: Christian Hesse
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Hesse
     
  • Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
    __vmalloc_area_node allocation failure. This patch reverts commit
    46c001a2753f ("mm/vmalloc.c: emit the failure message before return").

    Signed-off-by: Wanpeng Li
    Reviewed-by: Zhang Yanfei
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: Mitsuo Hayasaka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm:
    avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
    to avoid accessing the pages field with unallocated page when
    show_numa_info() is called.

    This patch moves the check just before show_numa_info in order that some
    messages still can be dumped via /proc/vmallocinfo. This patch reverts
    commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
    s_show instead of show_numa_info");

    Reviewed-by: Zhang Yanfei
    Signed-off-by: Wanpeng Li
    Cc: Mitsuo Hayasaka
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • There is a race window between vmap_area tear down and show vmap_area
    information.

    A B

    remove_vm_area
    spin_lock(&vmap_area_lock);
    va->vm = NULL;
    va->flags &= ~VM_VM_AREA;
    spin_unlock(&vmap_area_lock);
    spin_lock(&vmap_area_lock);
    if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
    return 0;
    if (!(va->flags & VM_VM_AREA)) {
    seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
    (void *)va->va_start, (void *)va->va_end,
    va->va_end - va->va_start);
    return 0;
    }
    free_unmap_vmap_area(va);
    flush_cache_vunmap
    free_unmap_vmap_area_noflush
    unmap_vmap_area
    free_vmap_area_noflush
    va->flags |= VM_LAZY_FREE

    The assumption !VM_VM_AREA represents vm_map_ram allocation is
    introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list,
    instead of vmlist, in vmallocinfo()").

    However, !VM_VM_AREA also represents vmap_area is being tear down in
    race window mentioned above. This patch fix it by don't dump any
    information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
    VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.

    Suggested-by: Joonsoo Kim
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Wanpeng Li
    Cc: Mitsuo Hayasaka
    Cc: Zhang Yanfei
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The caller address has already been set in set_vmalloc_vm(), there's no
    need to set it again in __vmalloc_area_node.

    Reviewed-by: Zhang Yanfei
    Signed-off-by: Wanpeng Li
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: Mitsuo Hayasaka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • mpol_to_str() should not fail. Currently, it either fails because the
    string buffer is too small or because a string hasn't been defined for a
    mempolicy mode.

    If a new mempolicy mode is introduced and no string is defined for it,
    just warn and return "unknown".

    If the buffer is too small, just truncate the string and return, the
    same behavior as snprintf().

    This also fixes a bug where there was no NULL-byte termination when doing
    *p++ = '=' and *p++ ':' and maxlen has been reached.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Chen Gang
    Cc: Rik van Riel
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()

    Signed-off-by: Jianguo Wu
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Chen Gong pointed out that set/unset_migratetype_isolate() was done in
    different functions in mm/memory-failure.c, which makes the code less
    readable/maintainable. So this patch does it in soft_offline_page().

    With this patch, we get to hold lock_memory_hotplug() longer but it's
    not a problem because races between memory hotplug and soft offline are
    very rare.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Chen, Gong
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
    mem_online_node() to put its node online if offlined and then call
    build_all_zonelists() to initialize the zone list.

    These steps are specific to memory hotplug, and should be managed in
    mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
    whole steps.

    For this reason, this patch replaces mem_online_node() with
    try_online_node(), which performs the whole steps with
    lock_memory_hotplug() held. try_online_node() is named after
    try_offline_node() as they have similar purpose.

    There is no functional change in this patch.

    Signed-off-by: Toshi Kani
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • On large memory machines it can take a few minutes to get through
    free_all_bootmem().

    Currently, when free_all_bootmem() calls __free_pages_memory(), the number
    of contiguous pages that __free_pages_memory() passes to the buddy
    allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally
    chosen to keep things similar to mm/nobootmem.c. But it is more efficient
    to limit it to MAX_ORDER.

    base new change
    8TB 202s 172s 30s
    16TB 401s 351s 50s

    That is around 1%-3% improvement on total boot time.

    This patch was spun off from the boot time rfc Robin and I had been
    working on.

    Signed-off-by: Robin Holt
    Signed-off-by: Nathan Zimmer
    Cc: Robin Holt
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Mike Travis
    Cc: Yinghai Lu
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • Use helper function to check if we need to deal with oom condition.

    Signed-off-by: Qiang Huang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qiang Huang
     
  • Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))".

    Signed-off-by: Xishi Qiu
    Acked-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • A is_memblock_offlined() return or 1 means memory block is offlined, but
    is_memblock_offlined_cb() returning 1 means memory block is not offlined,
    this will confuse somebody, so rename the function.

    Signed-off-by: Xishi Qiu
    Acked-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
    pgdat->node_spanned_pages". Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
    pgdat->node_spanned_pages". Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Cc: James Hogan
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Since commit 13ece886d99c ("thp: transparent hugepage config choice"),
    transparent hugepage support is disabled by default, and
    TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y.

    And since commit d39d33c332c6 ("thp: enable direct defrag"), defrag is
    enable for all transparent hugepage page faults by default, not only in
    MADV_HUGEPAGE regions.

    Signed-off-by: Jianguo Wu
    Reviewed-by: Wanpeng Li
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu