24 Feb, 2013

40 commits

  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • According to akpm, this saves 1/2k text and makes things simple for the
    next patch.

    Numbers from Minchan:

    add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
    function old new delta
    page_mapping - 48 +48
    do_task_stat 2292 2308 +16
    page_remove_rmap 240 248 +8
    load_elf_binary 4500 4508 +8
    update_queue 532 536 +4
    scsi_probe_and_add_lun 2892 2896 +4
    lookup_fast 644 648 +4
    vcs_read 1040 1036 -4
    __ip_route_output_key 1904 1900 -4
    ip_route_input_noref 2508 2500 -8
    shmem_file_aio_read 784 772 -12
    __isolate_lru_page 272 256 -16
    shmem_replace_page 708 688 -20
    mark_buffer_dirty 228 208 -20
    __set_page_dirty_buffers 240 220 -20
    __remove_mapping 276 256 -20
    update_mmu_cache 500 476 -24
    set_page_dirty_balance 92 68 -24
    set_page_dirty 172 148 -24
    page_evictable 88 64 -24
    page_cache_pipe_buf_steal 248 224 -24
    clear_page_dirty_for_io 340 316 -24
    test_set_page_writeback 400 372 -28
    test_clear_page_writeback 516 488 -28
    invalidate_inode_page 156 128 -28
    page_mkclean 432 400 -32
    flush_dcache_page 360 328 -32
    __set_page_dirty_nobuffers 324 280 -44
    shrink_page_list 2412 2356 -56

    Signed-off-by: Shaohua Li
    Suggested-by: Andrew Morton
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
    THP before migrating") Hugh Dickins noted that the control flow for
    transhuge migration was difficult to follow. Unconditionally calling
    put_page() in numamigrate_isolate_page() made the failure paths of both
    migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
    complex that they should be. Further, he was extremely wary that an
    unlock_page() should ever happen after a put_page() even if the
    put_page() should never be the final put_page.

    Hugh implemented the following cleanup to simplify the path by calling
    putback_lru_page() inside numamigrate_isolate_page() if it failed to
    isolate and always calling unlock_page() within
    migrate_misplaced_transhuge_page().

    There is no functional change after this patch is applied but the code
    is easier to follow and unlock_page() always happens before put_page().

    [mgorman@suse.de: changelog only]
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
    NUMA configuration with NUMA Balancing will still need an extra page
    field. As Peter notes "Completely dropping 32bit support for
    CONFIG_NUMA_BALANCING would simplify things, but it would also remove
    the warning if we grow enough 64bit only page-flags to push the last-cpu
    out."

    [mgorman@suse.de: minor modifications]
    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This is a preparation patch for moving page->_last_nid into page->flags
    that moves page flag layout information to a separate header. This
    patch is necessary because otherwise there would be a circular
    dependency between mm_types.h and mm.h.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • The current definitions for count_vm_numa_events() is wrong for
    !CONFIG_NUMA_BALANCING as the following would miss the side-effect.

    count_vm_numa_events(NUMA_FOO, bar++);

    There are no such users of count_vm_numa_events() but this patch fixes
    it as it is a potential pitfall. Ideally both would be converted to
    static inline but NUMA_PTE_UPDATES is not defined if
    !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
    static inline would be similarly clumsy.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
    one base page is being migrated when in fact it can also be checking
    THP.

    The consequences are that a migration will be attempted when a target
    node is nearly full and fail later. It's unlikely to be user-visible
    but it should be fixed. While we are there, migrate_balanced_pgdat()
    should treat nr_migrate_pages as an unsigned long as it is treated as a
    watermark.

    Signed-off-by: Mel Gorman
    Suggested-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/me/be/ and clarify the comment a bit when we're changing it anyway.

    Signed-off-by: Mel Gorman
    Suggested-by: Simon Jeons
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If one storage interface or usb network interface(iSCSI case) exists in
    current configuration, memory allocation with GFP_KERNEL during
    usb_device_reset() might trigger I/O transfer on the storage interface
    itself and cause deadlock because the 'us->dev_mutex' is held in
    .pre_reset() and the storage interface can't do I/O transfer when the
    reset is triggered by other interface, or the error handling can't be
    completed if the reset is triggered by the storage itself (error
    handling path).

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Reviewed-by: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
    force memory allocation with no I/O during runtime_resume/runtime_suspend
    callback on device with the flag of 'memalloc_noio' set.

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Deadlock might be caused by allocating memory with GFP_KERNEL in
    runtime_resume and runtime_suspend callback of network devices in iSCSI
    situation, so mark network devices and its ancestor as 'memalloc_noio'
    with the introduced pm_runtime_set_memalloc_noio().

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
    to teach mm not allocating memory with GFP_KERNEL flag for avoiding
    probable deadlock.

    As explained in the comment, any GFP_KERNEL allocation inside
    runtime_resume() or runtime_suspend() on any one of device in the path
    from one block or network device to the root device in the device tree
    may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
    or clears the flag on device in the path recursively.

    Signed-off-by: Ming Lei
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
    'struct task_struct'), so that the flag can be set by one task to avoid
    doing I/O inside memory allocation in the task's context.

    The patch trys to solve one deadlock problem caused by block device, and
    the problem may happen at least in the below situations:

    - during block device runtime resume, if memory allocation with
    GFP_KERNEL is called inside runtime resume callback of any one of its
    ancestors(or the block device itself), the deadlock may be triggered
    inside the memory allocation since it might not complete until the block
    device becomes active and the involed page I/O finishes. The situation
    is pointed out first by Alan Stern. It is not a good approach to
    convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
    subsystems may be involved(for example, PCI, USB and SCSI may be
    involved for usb mass stoarage device, network devices involved too in
    the iSCSI case)

    - during block device runtime suspend, because runtime resume need to
    wait for completion of concurrent runtime suspend.

    - during error handling of usb mass storage deivce, USB bus reset will
    be put on the device, so there shouldn't have any memory allocation with
    GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
    above may be triggered. Unfortunately, any usb device may include one
    mass storage interface in theory, so it requires all usb interface
    drivers to handle the situation. In fact, most usb drivers don't know
    how to handle bus reset on the device and don't provide .pre_set() and
    .post_reset() callback at all, so USB core has to unbind and bind driver
    for these devices. So it is still not practical to resort to GFP_NOIO
    for solving the problem.

    Also the introduced solution can be used by block subsystem or block
    drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
    actual I/O transfer.

    It is not a good idea to convert all these GFP_KERNEL in the affected
    path into GFP_NOIO because these functions doing that may be implemented
    as library and will be called in many other contexts.

    In fact, memalloc_noio_flags() can convert some of current static
    GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
    at least almost all GFP_NOIO in USB subsystem can be converted into
    GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
    only happen in runtime resume/bus reset/block I/O transfer contexts
    generally.

    [1], several GFP_KERNEL allocation examples in runtime resume path

    - pci subsystem
    acpi_os_allocate

    Signed-off-by: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • num_poisoned_pages counts up the number of pages isolated by memory
    errors. But for thp, only one subpage is isolated because memory error
    handler splits it, so it's wrong to add (1 << compound_trans_order).

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently soft_offline_page() is hard to maintain because it has many
    return points and goto statements. All of this mess come from
    get_any_page().

    This function should only get page refcount as the name implies, but it
    does some page isolating actions like SetPageHWPoison() and dequeuing
    hugepage. This patch corrects it and introduces some internal
    subroutines to make soft offlining code more readable and maintainable.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • There are too many return points randomly intermingled with some "goto
    done" return points. So adjust the function structure, one for the
    success path, the other for the failure path. Use atomic_long_inc
    instead of atomic_long_add.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Andrew Morton
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • When doing

    $ echo paddr > /sys/devices/system/memory/soft_offline_page

    to offline a *free* page, the value of mce_bad_pages will be added, and
    the page is set HWPoison flag, but it is still managed by page buddy
    alocator.

    $ cat /proc/meminfo | grep HardwareCorrupted

    shows the value.

    If we offline the same page, the value of mce_bad_pages will be added
    *again*, this means the value is incorrect now. Assume the page is
    still free during this short time.

    soft_offline_page()
    get_any_page()
    "else if (is_free_buddy_page(p))" branch return 0
    "goto done";
    "atomic_long_add(1, &mce_bad_pages);"

    This patch:

    Move poisoned page check at the beginning of the function in order to
    fix the error.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Tested-by: Naoya Horiguchi
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Function put_page_bootmem() is used to free pages allocated by bootmem
    allocator, so it should increase totalram_pages when freeing pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now all users of "number of pages managed by the buddy system" have been
    converted to use zone->managed_pages, so set zone->present_pages to what
    it should be:

    present_pages = spanned_pages - absent_pages;

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • …emblock_overlaps_region().

    The definition of struct movablecore_map is protected by
    CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
    is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
    movablecore_map in memblock_overlaps_region().

    Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Tang Chen
     
  • We now provide an option for users who don't want to specify physical
    memory address in kernel commandline.

    /*
    * For movablemem_map=acpi:
    *
    * SRAT: |_____| |_____| |_________| |_________| ......
    * node id: 0 1 1 2
    * hotpluggable: n y y n
    * movablemem_map: |_____| |_________|
    *
    * Using movablemem_map, we can prevent memblock from allocating memory
    * on ZONE_MOVABLE at boot time.
    */

    So user just specify movablemem_map=acpi, and the kernel will use
    hotpluggable info in SRAT to determine which memory ranges should be set
    as ZONE_MOVABLE.

    If all the memory ranges in SRAT is hotpluggable, then no memory can be
    used by kernel. But before parsing SRAT, memblock has already reserve
    some memory ranges for other purposes, such as for kernel image, and so
    on. We cannot prevent kernel from using these memory. So we need to
    exclude these ranges even if these memory is hotpluggable.

    Furthermore, there could be several memory ranges in the single node
    which the kernel resides in. We may skip one range that have memory
    reserved by memblock, but if the rest of memory is too small, then the
    kernel will fail to boot. So, make the whole node which the kernel
    resides in un-hotpluggable. Then the kernel has enough memory to use.

    NOTE: Using this way will cause NUMA performance down because the
    whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
    on it. If users don't want to lose NUMA performance, just don't use
    it.

    [akpm@linux-foundation.org: fix warning]
    [akpm@linux-foundation.org: use strcmp()]
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When implementing movablemem_map boot option, we introduced an array
    movablemem_map.map[] to store the memory ranges to be set as
    ZONE_MOVABLE.

    Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
    the whole node memory range, we need to extend it to the node end so
    that we can use it to prevent memblock from allocating memory in the
    ranges user didn't specify.

    We now implement movablemem_map boot option like this:

    /*
    * For movablemem_map=nn[KMG]@ss[KMG]:
    *
    * SRAT: |_____| |_____| |_________| |_________| ......
    * node id: 0 1 1 2
    * user specified: |__| |___|
    * movablemem_map: |___| |_________| |______| ......
    *
    * Using movablemem_map, we can prevent memblock from allocating memory
    * on ZONE_MOVABLE at boot time.
    *
    * NOTE: In this case, SRAT info will be ingored.
    */

    [akpm@linux-foundation.org: clean up code, fix build warning]
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • On linux, the pages used by kernel could not be migrated. As a result,
    if a memory range is used by kernel, it cannot be hot-removed. So if we
    want to hot-remove memory, we should prevent kernel from using it.

    The way now used to prevent this is specify a memory range by
    movablemem_map boot option and set it as ZONE_MOVABLE.

    But when the system is booting, memblock will allocate memory, and
    reserve the memory for kernel. And before we parse SRAT, and know the
    node memory ranges, memblock is working. And it may allocate memory in
    ranges to be set as ZONE_MOVABLE. This memory can be used by kernel,
    and never be freed.

    So, let's parse SRAT before memblock is called first. And it is early
    enough.

    The first call of memblock_find_in_range_node() is in:

    setup_arch()
    |-->setup_real_mode()

    so, this patch add a function early_parse_srat() to parse SRAT, and call
    it before setup_real_mode() is called.

    NOTE:

    1) early_parse_srat() is called before numa_init(), and has initialized
    numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO
    NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
    numa info.

    2) I don't know why using count of memory affinities parsed from SRAT
    as a return value in original acpi_numa_init(). So I add a static
    variable srat_mem_cnt to remember this count and use it as the return
    value of the new acpi_numa_init()

    [mhocko@suse.cz: parse SRAT before memblock is ready fix]
    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Ensure the bootmem will not allocate memory from areas that may be
    ZONE_MOVABLE. The map info is from movablecore_map boot option.

    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Wu Jianguo
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • If kernelcore or movablecore is specified at the same time with
    movablemem_map, movablemem_map will have higher priority to be
    satisfied. This patch will make find_zone_movable_pfns_for_nodes()
    calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

    Signed-off-by: Tang Chen
    Reviewed-by: Wen Congyang
    Cc: Wu Jianguo
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
    limit from movablemem_map boot option for all nodes. The function
    sanitize_zone_movable_limit() will find out to which node the ranges in
    movable_map.map[] belongs, and calculates the low boundary of
    ZONE_MOVABLE for each node.

    Signed-off-by: Tang Chen
    Signed-off-by: Liu Jiang
    Reviewed-by: Wen Congyang
    Cc: Wu Jianguo
    Reviewed-by: Lai Jiangshan
    Tested-by: Lin Feng
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Add functions to parse movablemem_map boot option. Since the option
    could be specified more then once, all the maps will be stored in the
    global variable movablemem_map.map array.

    And also, we keep the array in monotonic increasing order by start_pfn.
    And merge all overlapped ranges.

    [akpm@linux-foundation.org: improve comment]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: remove unneeded parens]
    Signed-off-by: Tang Chen
    Signed-off-by: Lai Jiangshan
    Reviewed-by: Wen Congyang
    Tested-by: Lin Feng
    Cc: Wu Jianguo
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • During the implementation of SRAT support, we met a problem. In
    setup_arch(), we have the following call series:

    1) memblock is ready;
    2) some functions use memblock to allocate memory;
    3) parse ACPI tables, such as SRAT.

    Before 3), we don't know which memory is hotpluggable, and as a result,
    we cannot prevent memblock from allocating hotpluggable memory. So, in
    2), there could be some hotpluggable memory allocated by memblock.

    Now, we are trying to parse SRAT earlier, before memblock is ready. But
    I think we need more investigation on this topic. So in this v5, I
    dropped all the SRAT support, and v5 is just the same as v3, and it is
    based on 3.8-rc3.

    As we planned, we will support getting info from SRAT without users'
    participation at last. And we will post another patch-set to do so.

    And also, I think for now, we can add this boot option as the first step
    of supporting movable node. Since Linux cannot migrate the direct
    mapped pages, the only way for now is to limit the whole node containing
    only movable memory.

    Using SRAT is one way. But even if we can use SRAT, users still need an
    interface to enable/disable this functionality if they don't want to
    loose their NUMA performance. So I think, a user interface is always
    needed.

    For now, users can disable this functionality by not specifying the boot
    option. Later, we will post SRAT support, and add another option value
    "movablecore_map=acpi" to using SRAT.

    This patch:

    If system can create movable node which all memory of the node is
    allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
    the node's pg_data_t. So, use memblock_alloc_try_nid() instead of
    memblock_alloc_nid() to retry when the first allocation fails.

    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Signed-off-by: Jiang Liu
    Cc: Wu Jianguo
    Cc: Wen Congyang
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
    will return -1. As a result, cpumask_of_node(nid) will return NULL. In
    this case, find_next_bit() in for_each_cpu will get a NULL pointer and
    cause panic.

    Here is a call trace:
    Call Trace:

    select_fallback_rq+0x71/0x190
    try_to_wake_up+0x2cb/0x2f0
    wake_up_process+0x15/0x20
    hrtimer_wakeup+0x22/0x30
    __run_hrtimer+0x83/0x320
    hrtimer_interrupt+0x106/0x280
    smp_apic_timer_interrupt+0x69/0x99
    apic_timer_interrupt+0x6f/0x80

    There is a hrtimer process sleeping, whose cpu has already been
    offlined. When it is waken up, it tries to find another cpu to run, and
    get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
    ernel panic.

    This patch fixes this problem by judging if the nid is -1. If nid is
    not -1, a cpu on the same node will be picked. Else, a online cpu on
    another node will be picked.

    Signed-off-by: Tang Chen
    Signed-off-by: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When the node is offlined, there is no memory/cpu on the node. If a
    sleep task runs on a cpu of this node, it will be migrated to the cpu on
    the other node. So we can clear cpu-to-node mapping.

    [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • The node will be offlined when all memory/cpu on the node is hotremoved.
    So we should try offline the node when hotremoving a cpu on the node.

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • try_offline_node() will be needed in the tristate
    drivers/acpi/processor_driver.c.

    The node will be offlined when all memory/cpu on the node have been
    hotremoved. So we need the function try_offline_node() in cpu-hotplug
    path.

    If the memory-hotplug is disabled, and cpu-hotplug is enabled

    1. no memory no the node
    we don't online the node, and cpu's node is the nearest node.

    2. the node contains some memory
    the node has been onlined, and cpu's node is still needed
    to migrate the sleep task on the cpu to the same node.

    So we do nothing in try_offline_node() in this case.

    [rientjes@google.com: export the function try_offline_node() fix]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Len Brown
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • When a cpu is hotpluged, we call acpi_map_cpu2node() in
    _acpi_map_lsapic() to store the cpu's node and apicid's node. But we
    don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
    hotremoved. If the node is also hotremoved, we will get the following
    messages:

    kernel BUG at include/linux/gfp.h:329!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
    RIP: 0010:[] [] allocate_slab+0x28d/0x300
    RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
    RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
    R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
    FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
    Call Trace:
    new_slab+0x30/0x1b0
    __slab_alloc+0x358/0x4c0
    kmem_cache_alloc_node_trace+0xb4/0x1e0
    alloc_fair_sched_group+0xd0/0x1b0
    sched_create_group+0x3e/0x110
    sched_autogroup_create_attach+0x4d/0x180
    sys_setsid+0xd4/0xf0
    system_call_fastpath+0x16/0x1b
    Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
    RIP [] allocate_slab+0x28d/0x300
    RSP
    ---[ end trace adf84c90f3fea3e5 ]---

    The reason is that the cpu's node is not NUMA_NO_NODE, we will call
    alloc_pages_exact_node() to alloc memory on the node, but the node is
    offlined.

    If the node is onlined, we still need cpu's node. For example: a task
    on the cpu is sleeped when the cpu is hotremoved. We will choose
    another cpu to run this task when it is waked up. If we know the cpu's
    node, we will choose the cpu on the same node first. So we should clear
    cpu-to-node mapping when the node is offlined.

    This patch only clears apicid-to-node mapping when the cpu is
    hotremoved.

    [akpm@linux-foundation.org: fix section error]
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
    based on a nodemask as well as a gfp_mask"). but it does not match its
    comments, because it does not check the zone which > policy_zone.

    Also in commit b377fd3982ad ("Apply memory policies to top two highest
    zones when highest zone is ZONE_MOVABLE"), this commits told us, if
    highest zone is ZONE_MOVABLE, we should also apply memory policies to
    it. so ZONE_MOVABLE should be valid zone for policies.
    is_valid_nodemask() need to be changed to match it.

    Fix: check all zones, even its zoneid > policy_zone. Use
    nodes_intersects() instead open code to check it.

    Reported-by: Wen Congyang
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan