30 Apr, 2013

2 commits

  • __remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
    pseries will return -EOPNOTSUPP if unsupported.

    Adding an #ifdef causes several other functions it depends on to also
    become unnecessary, which saves in .text when disabled (it's disabled in
    most defconfigs besides powerpc, including x86). remove_memory_block()
    becomes static since it is not referenced outside of
    drivers/base/memory.c.

    Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
    and disabled.

    Signed-off-by: David Rientjes
    Acked-by: Toshi Kani
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Greg Kroah-Hartman
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The sparse code, when asking the architecture to populate the vmemmap,
    specifies the section range as a starting page and a number of pages.

    This is an awkward interface, because none of the arch-specific code
    actually thinks of the range in terms of 'struct page' units and always
    translates it to bytes first.

    In addition, later patches mix huge page and regular page backing for
    the vmemmap. For this, they need to call vmemmap_populate_basepages()
    on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind. But these
    are not necessarily multiples of the 'struct page' size and so this unit
    is too coarse.

    Just translate the section range into bytes once in the generic sparse
    code, then pass byte ranges down the stack.

    Signed-off-by: Johannes Weiner
    Cc: Ben Hutchings
    Cc: Bernhard Schmidt
    Cc: Johannes Weiner
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: "Luck, Tony"
    Cc: Heiko Carstens
    Acked-by: David S. Miller
    Tested-by: David S. Miller
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Feb, 2013

4 commits

  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • usemap could also be allocated as compound pages. Should also consider
    compound pages when freeing memmap.

    If we don't fix it, there could be problems when we free vmemmap
    pagetables which are stored in compound pages. The old pagetables will
    not be freed properly, and when we add the memory again, no new
    pagetable will be created. And the old pagetable entry is used, than
    the kernel will panic.

    The call trace is like the following:

    BUG: unable to handle kernel paging request at ffffea0040000000
    IP: [] sparse_add_one_section+0xef/0x166
    PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
    Oops: 0002 [#1] SMP
    Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
    CPU 0
    Pid: 4, comm: kworker/0:0 Tainted: G W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
    RIP: 0010:[] [] sparse_add_one_section+0xef/0x166
    RSP: 0018:ffff8807bdcb35d8 EFLAGS: 00010006
    RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
    RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
    RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
    R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
    R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
    FS: 0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
    Call Trace:
    __add_pages+0x85/0x120
    arch_add_memory+0x71/0xf0
    add_memory+0xd6/0x1f0
    acpi_memory_device_add+0x170/0x20c
    acpi_device_probe+0x50/0x18a
    really_probe+0x6c/0x320
    driver_probe_device+0x47/0xa0
    __device_attach+0x53/0x60
    bus_for_each_drv+0x6c/0xa0
    device_attach+0xa8/0xc0
    bus_probe_device+0xb0/0xe0
    device_add+0x301/0x570
    device_register+0x1e/0x30
    acpi_device_register+0x1d8/0x27c
    acpi_add_single_object+0x1df/0x2b9
    acpi_bus_check_add+0x112/0x18f
    acpi_ns_walk_namespace+0x105/0x255
    acpi_walk_namespace+0xcf/0x118
    acpi_bus_scan+0x5b/0x7c
    acpi_bus_add+0x2a/0x2c
    container_notify_cb+0x112/0x1a9
    acpi_ev_notify_dispatch+0x46/0x61
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x20e/0x5c0
    worker_thread+0x12e/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
    Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
    RIP [] sparse_add_one_section+0xef/0x166
    RSP
    CR2: ffffea0040000000
    ---[ end trace e7f94e3a34c442d4 ]---
    Kernel panic - not syncing: Fatal exception

    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Introduce a new API vmemmap_free() to free and remove vmemmap
    pagetables. Since pagetable implements are different, each architecture
    has to provide its own version of vmemmap_free(), just like
    vmemmap_populate().

    Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

    [mhocko@suse.cz: fix implicit declaration of remove_pagetable]
    Signed-off-by: Yasuaki Ishimatsu
    Signed-off-by: Jianguo Wu
    Signed-off-by: Wen Congyang
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • In __remove_section(), we locked pgdat_resize_lock when calling
    sparse_remove_one_section(). This lock will disable irq. But we don't
    need to lock the whole function. If we do some work to free pagetables
    in free_section_usemap(), we need to call flush_tlb_all(), which need
    irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
    will be triggered.

    If we lock the whole sparse_remove_one_section(), then we come to this call trace:

    ------------[ cut here ]------------
    WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
    Hardware name: PRIMEQUEST 1800E
    ......
    Call Trace:
    smp_call_function_many+0xbd/0x260
    smp_call_function+0x3b/0x50
    on_each_cpu+0x3b/0xc0
    flush_tlb_all+0x1c/0x20
    remove_pagetable+0x14e/0x1d0
    vmemmap_free+0x18/0x20
    sparse_remove_one_section+0xf7/0x100
    __remove_section+0xa2/0xb0
    __remove_pages+0xa0/0xd0
    arch_remove_memory+0x6b/0xc0
    remove_memory+0xb8/0xf0
    acpi_memory_device_remove+0x53/0x96
    acpi_device_remove+0x90/0xb2
    __device_release_driver+0x7c/0xf0
    device_release_driver+0x2f/0x50
    acpi_bus_remove+0x32/0x6d
    acpi_bus_trim+0x91/0x102
    acpi_bus_hot_remove_device+0x88/0x16b
    acpi_os_execute_deferred+0x27/0x34
    process_one_work+0x20e/0x5c0
    worker_thread+0x12e/0x370
    kthread+0xee/0x100
    ret_from_fork+0x7c/0xb0
    ---[ end trace 25e85300f542aa01 ]---

    Signed-off-by: Tang Chen
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     

12 Dec, 2012

2 commits

  • If sparse memory vmemmap is enabled, we can't free the memory to store
    struct page when a memory device is hotremoved, because we may store
    struct page in the memory to manage the memory which doesn't belong to
    this memory device. When we hotadded this memory device again, we will
    reuse this memory to store struct page, and struct page may contain some
    obsolete information, and we will get bad-page state:

    init_memory_mapping: [mem 0x80000000-0x9fffffff]
    Built 2 zonelists in Node order, mobility grouping on. Total pages: 547617
    Policy zone: Normal
    BUG: Bad page state in process bash pfn:9b6dc
    page:ffffea0002200020 count:0 mapcount:0 mapping: (null) index:0xfdfdfdfdfdfdfdfd
    page flags: 0x2fdfdfdfd5df9fd(locked|referenced|uptodate|dirty|lru|active|slab|owner_priv_1|private|private_2|writeback|head|tail|swapcache|reclaim|swapbacked|unevictable|uncached|compound_lock)
    Modules linked in: netconsole acpiphp pci_hotplug acpi_memhotplug loop kvm_amd kvm microcode tpm_tis tpm tpm_bios evdev psmouse serio_raw i2c_piix4 i2c_core parport_pc parport processor button thermal_sys ext3 jbd mbcache sg sr_mod cdrom ata_generic virtio_net ata_piix virtio_blk libata virtio_pci virtio_ring virtio scsi_mod
    Pid: 988, comm: bash Not tainted 3.6.0-rc7-guest #12
    Call Trace:
    [] ? bad_page+0xb0/0x100
    [] ? free_pages_prepare+0xb3/0x100
    [] ? free_hot_cold_page+0x48/0x1a0
    [] ? online_pages_range+0x68/0xa0
    [] ? __online_page_increment_counters+0x10/0x10
    [] ? walk_system_ram_range+0x101/0x110
    [] ? online_pages+0x1a5/0x2b0
    [] ? __memory_block_change_state+0x20d/0x270
    [] ? store_mem_state+0xb6/0xf0
    [] ? sysfs_write_file+0xd2/0x160
    [] ? vfs_write+0xaa/0x160
    [] ? sys_write+0x47/0x90
    [] ? async_page_fault+0x25/0x30
    [] ? system_call_fastpath+0x16/0x1b
    Disabling lock debugging due to kernel taint

    This patch clears the memory to store struct page to avoid unexpected error.

    Signed-off-by: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Yasuaki Ishimatsu
    Reported-by: Vasilis Liaskovitis
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • When we hotremove a memory device, we will free the memory to store struct
    page. If the page is hwpoisoned page, we should decrease mce_bad_pages.

    [akpm@linux-foundation.org: cleanup ifdefs]
    Signed-off-by: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Yasuaki Ishimatsu
    Cc: Dave Hansen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     

01 Dec, 2012

1 commit

  • I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing
    memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20.

    It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is
    only used for kernel direct mapping address, but sparse-vmemmap uses
    vmemmap address, so it is going wrong here.

    ------------[ cut here ]------------
    kernel BUG at arch/x86/mm/physaddr.c:20!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
    CPU 39
    Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R
    RIP: 0010:[] [] __phys_addr+0x88/0x90
    RSP: 0018:ffff8804440d7c08 EFLAGS: 00010006
    RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c
    ...

    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Reviewd-by: Wen Congyang
    Acked-by: Johannes Weiner
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

01 Aug, 2012

4 commits

  • sparse_index_init() uses the index_init_lock spinlock to protect root
    mem_section assignment. The lock is not necessary anymore because the
    function is called only during boot (during paging init which is executed
    only from a single CPU) and from the hotplug code (by add_memory() via
    arch_add_memory()) which uses mem_hotplug_mutex.

    The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
    preparation") and sparse_index_init() was used only during boot at that
    time.

    Later when the hotplug code (and add_memory()) was introduced there was no
    synchronization so it was possible to online more sections from the same
    root probably (though I am not 100% sure about that). The first
    synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
    hibernation in the same kernel") which was later replaced by the
    mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
    {un}lock_memory_hotplug()").

    Let's remove the lock as it is not needed and it makes the code more
    confusing.

    [mhocko@suse.cz: changelog]
    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • __section_nr() was implemented to retrieve the corresponding memory
    section number according to its descriptor. It's possible that the
    specified memory section descriptor doesn't exist in the global array. So
    add more checking on that and report an error for a wrong case.

    Signed-off-by: Gavin Shan
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
    descriptors are allocated from slab or bootmem. When allocating from
    slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
    it explicitly.

    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
    Itanium, pageblock_order is a variable with default value of 0. It's set
    to the right value by set_pageblock_order() in function
    free_area_init_core().

    But pageblock_order may be used by sparse_init() before free_area_init_core()
    is called along path:
    sparse_init()
    ->sparse_early_usemaps_alloc_node()
    ->usemap_size()
    ->SECTION_BLOCKFLAGS_BITS
    ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
    NR_PAGEBLOCK_BITS)

    The uninitialized pageblock_size will cause memory wasting because
    usemap_size() returns a much bigger value then it's really needed.

    For example, on an Itanium platform,
    sparse_init() pageblock_order=0 usemap_size=24576
    free_area_init_core() before pageblock_order=0, usemap_size=24576
    free_area_init_core() after pageblock_order=12, usemap_size=8

    That means 24K memory has been wasted for each section, so fix it by calling
    set_pageblock_order() from sparse_init().

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Cc: Tony Luck
    Cc: Yinghai Lu
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

12 Jul, 2012

2 commits

  • After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
    in alloc_bootmem_section"), usemap allocations may easily be placed
    outside the optimal section that holds the node descriptor, even if
    there is space available in that section. This results in unnecessary
    hotplug dependencies that need to have the node unplugged before the
    section holding the usemap.

    The reason is that the bootmem allocator doesn't guarantee a linear
    search starting from the passed allocation goal but may start out at a
    much higher address absent an upper limit.

    Fix this by trying the allocation with the limit at the section end,
    then retry without if that fails. This keeps the fix from f5bf18fa22f8
    of not panicking if the allocation does not fit in the section, but
    still makes sure to try to stay within the section at first.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Cc: [3.3.x, 3.4.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
    bootmem allocator") introduced a bug in the allocation goal calculation
    that put section usemaps not in the same section as the node
    descriptors, creating unnecessary hotplug dependencies between them:

    node 0 must be removed before remove section 16399
    node 1 must be removed before remove section 16399
    node 2 must be removed before remove section 16399
    node 3 must be removed before remove section 16399
    node 4 must be removed before remove section 16399
    node 5 must be removed before remove section 16399
    node 6 must be removed before remove section 16399

    The reason is that it applies PAGE_SECTION_MASK to the physical address
    of the node descriptor when finding a suitable place to put the usemap,
    when this mask is actually intended to be used with PFNs. Because the
    PFN mask is wider, the target address will point beyond the wanted
    section holding the node descriptor and the node must be offlined before
    the section holding the usemap can go.

    Fix this by extending the mask to address width before use.

    Signed-off-by: Yinghai Lu
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

30 May, 2012

1 commit

  • alloc_bootmem_section() derives allocation area constraints from the
    specified sparsemem section. This is a bit specific for a generic memory
    allocator like bootmem, though, so move it over to sparsemem.

    As __alloc_bootmem_node_nopanic() already retries failed allocations with
    relaxed area constraints, the fallback code in sparsemem.c can be removed
    and the code becomes a bit more compact overall.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Yinghai Lu
    Cc: Gavin Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

22 Mar, 2012

1 commit

  • While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
    Overcommit) on powerpc, we tripped the following:

    kernel BUG at mm/bootmem.c:483!
    cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
    pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
    lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
    sp: c000000000c03bc0
    msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca = 0xc000000001d80000
    pid = 0, comm = swapper
    kernel BUG at mm/bootmem.c:483!
    enter ? for help
    [c000000000c03c80] c000000000a64bcc
    .sparse_early_usemaps_alloc_node+0x84/0x29c
    [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
    [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
    [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
    [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

    This is

    BUG_ON(limit && goal + size > limit);

    and after some debugging, it seems that

    goal = 0x7ffff000000
    limit = 0x80000000000

    and sparse_early_usemaps_alloc_node ->
    sparse_early_usemaps_alloc_pgdat_section calls

    return alloc_bootmem_section(usemap_size() * count, section_nr);

    This is on a system with 8TB available via the AMS pool, and as a quirk
    of AMS in firmware, all of that memory shows up in node 0. So, we end
    up with an allocation that will fail the goal/limit constraints.

    In theory, we could "fall-back" to alloc_bootmem_node() in
    sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
    defined, we'll BUG_ON() instead. A simple solution appears to be to
    unconditionally remove the limit condition in alloc_bootmem_section,
    meaning allocations are allowed to cross section boundaries (necessary
    for systems of this size).

    Johannes Weiner pointed out that if alloc_bootmem_section() no longer
    guarantees section-locality, we need check_usemap_section_nr() to print
    possible cross-dependencies between node descriptors and the usemaps
    allocated through it. That makes the two loops in
    sparse_early_usemaps_alloc_node() identical, so re-factor the code a
    bit.

    [akpm@linux-foundation.org: code simplification]
    Signed-off-by: Nishanth Aravamudan
    Cc: Dave Hansen
    Cc: Anton Blanchard
    Cc: Paul Mackerras
    Cc: Ben Herrenschmidt
    Cc: Robert Jennings
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: [3.3.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

31 Oct, 2011

1 commit


26 Jul, 2011

1 commit

  • These uses are read-only and in a subsequent patch I have a const struct
    page in my hand...

    [akpm@linux-foundation.org: fix warnings in lowmem_page_address()]
    Signed-off-by: Ian Campbell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Campbell
     

31 Mar, 2011

1 commit


14 Jan, 2011

1 commit

  • PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes. We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2010

1 commit

  • We need to put mem_map high when virtual memmap is not used.

    before this patch
    free mem pfn range on first node:
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3b80 3b8b
    [ 0.000000] 73b8b 73bc0 - 73c00 73c00
    [ 0.000000] 73ddd - 73e00
    [ 0.000000] 73fdd - 74000
    [ 0.000000] 741dd - 74200
    [ 0.000000] 743dd - 74400
    [ 0.000000] 745dd - 74600
    [ 0.000000] 747dd - 74800
    [ 0.000000] 749dd - 74a00
    [ 0.000000] 74bdd - 74c00
    [ 0.000000] 74ddd - 74e00
    [ 0.000000] 74fdd - 75000
    [ 0.000000] 751dd - 75200
    [ 0.000000] 753dd - 75400
    [ 0.000000] 755dd - 75600
    [ 0.000000] 757dd - 75800
    [ 0.000000] 759dd - 75a00
    [ 0.000000] 79bdd 79c00 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 10000b 100040 - 2080000 2080000
    so only 79c00 - 7d540 are major free block under 4g...

    after this patch, we will get
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3600 3600
    [ 0.000000] 37dd - 3800
    [ 0.000000] 39dd - 3a00
    [ 0.000000] 3bdd - 3c00
    [ 0.000000] 3ddd - 3e00
    [ 0.000000] 3fdd - 4000
    [ 0.000000] 41dd - 4200
    [ 0.000000] 43dd - 4400
    [ 0.000000] 45dd - 4600
    [ 0.000000] 47dd - 4800
    [ 0.000000] 49dd - 4a00
    [ 0.000000] 4bdd - 4c00
    [ 0.000000] 4ddd - 4e00
    [ 0.000000] 4fdd - 5000
    [ 0.000000] 51dd - 5200
    [ 0.000000] 53dd - 5400
    [ 0.000000] 95dd 9600 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 17000b 170040 - 2080000 2080000
    we will have 9600 - 7d540 for major free block...

    sparse-vmemmap path already used __alloc_bootmem_node_high()

    Signed-off-by: Yinghai Lu
    Cc: Jiri Slaby
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Christoph Lameter
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

02 Mar, 2010

1 commit

  • Stephen reported:
    build (powerpc
    ppc64_defconfig) produced these warnings:

    mm/sparse.c: In function 'sparse_init':
    mm/sparse.c:488: warning: unused variable 'map_count'
    mm/sparse.c:484: warning: unused variable 'size2'
    mm/sparse.c:481: warning: unused variable 'map_map'
    mm/sparse.c: At top level:
    mm/sparse.c:442: warning: 'sparse_early_mem_maps_alloc_node' defined but not used

    Introduced by commit 9bdac914240759457175ac0d6529a37d2820bc4d
    ("sparsemem: Put mem map for one node together").

    Conditionalize the bits appropriately based on the setting of
    CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

    Reported-by: Stephen Rothwell
    Tested-by: Stephen Rothwell
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

13 Feb, 2010

2 commits

  • Add vmemmap_alloc_block_buf for mem map only.

    It will fallback to the old way if it cannot get a block that big.

    Before this patch, when a node have 128g ram installed, memmap are
    split into two parts or more.
    [ 0.000000] [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
    [ 0.000000] [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
    [ 0.000000] [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
    [ 0.000000] [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
    [ 0.000000] [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
    [ 0.000000] [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
    [ 0.000000] [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
    [ 0.000000] [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
    [ 0.000000] [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
    [ 0.000000] [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
    [ 0.000000] [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
    [ 0.000000] [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
    [ 0.000000] [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
    [ 0.000000] [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
    [ 0.000000] [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7

    after patch will get
    [ 0.000000] [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
    [ 0.000000] [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
    [ 0.000000] [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7

    -v2: change buf to vmemmap_buf instead according to Ingo
    also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
    -v3: according to Andrew, use sizeof(name) instead of hard coded 15

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Cc: Christoph Lameter
    Acked-by: Christoph Lameter
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • Could save some buffer space instead of applying one by one.

    Could help that system that is going to use early_res instead of bootmem
    less entries in early_res make search more faster on system with more memory.

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

22 Sep, 2009

1 commit

  • To initialize hotadded node, some pages are allocated. At that time, the
    node hasn't memory, this makes the allocation always fail. In such case,
    let's allocate pages from other nodes.

    Signed-off-by: Shaohua Li
    Signed-off-by: Yakui Zhao
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

01 Apr, 2009

1 commit


01 Dec, 2008

1 commit


13 Aug, 2008

1 commit


27 Jul, 2008

1 commit


25 Jul, 2008

2 commits

  • Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are
    allocated on only one page. If a section has usemap, it can't be removed
    until removing other sections. This dependency is not desirable for
    memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the
    last section for removing on the node. So, if section A has pgdat and
    section B has usemap for section A, Both sections can't be removed due to
    dependency each other.

    To solve this issue, this patch collects usemap on same section with pgdat
    as much as possible. If other sections doesn't have any dependency, this
    section will be able to be removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Cc: David Miller
    Cc: Badari Pulavarty
    Cc: Heiko Carstens
    Cc: Hiroyuki KAMEZAWA
    Cc: Tony Breeds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • There are a number of different views to how much memory is currently active.
    There is the arch-independent zone-sizing view, the bootmem allocator and
    memory models view.

    Architectures register this information at different times and is not
    necessarily in sync particularly with respect to some SPARSEMEM limitations.

    This patch introduces mminit_validate_memmodel_limits() which is able to
    validate and correct PFN ranges with respect to the memory model. It is only
    SPARSEMEM that currently validates itself.

    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Apr, 2008

2 commits

  • This:

    commit 86f6dae1377523689bd8468fed2f2dd180fc0560
    Author: Yasunori Goto
    Date: Mon Apr 28 02:13:33 2008 -0700

    memory hotplug: allocate usemap on the section with pgdat

    Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are allocated
    on only one page. If a section has usemap, it can't be removed until removing
    other sections. This dependency is not desirable for memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the last
    section for removing on the node. So, if section A has pgdat and section B
    has usemap for section A, Both sections can't be removed due to dependency
    each other.

    To solve this issue, this patch collects usemap on same section with pgdat.
    If other sections doesn't have any dependency, this section will be able to be
    removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    broke davem's sparc64 bootup. Revert it while we work out what went wrong.

    Cc: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • __FUNCTION__ is gcc-specific, use __func__

    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

28 Apr, 2008

5 commits

  • This patch is to free memmaps which is allocated by bootmem.

    Freeing usemap is not necessary. The pages of usemap may be necessary for
    other sections.

    If removing section is last section on the node, its section is the final user
    of usemap page. (usemaps are allocated on its section by previous patch.) But
    it shouldn't be freed too, because the section must be logical offline state
    which all pages are isolated against page allocater. If it is freed, page
    alloctor may use it which will be removed physically soon. It will be
    disaster. So, this patch keeps it as it is.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Usemaps are allocated on the section which has pgdat by this.

    Because usemap size is very small, many other sections usemaps are allocated
    on only one page. If a section has usemap, it can't be removed until removing
    other sections. This dependency is not desirable for memory removing.

    Pgdat has similar feature. When a section has pgdat area, it must be the last
    section for removing on the node. So, if section A has pgdat and section B
    has usemap for section A, Both sections can't be removed due to dependency
    each other.

    To solve this issue, this patch collects usemap on same section with pgdat.
    If other sections doesn't have any dependency, this section will be able to be
    removed finally.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • To free memmap easier, this patch aligns it to page size. Bootmem allocater
    may mix some objects in one pages. It's not good for freeing memmap of memory
    hot-remove.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This patch set is to free pages which is allocated by bootmem for
    memory-hotremove. Some structures of memory management are allocated by
    bootmem. ex) memmap, etc.

    To remove memory physically, some of them must be freed according to
    circumstance. This patch set makes basis to free those pages, and free
    memmaps.

    Basic my idea is using remain members of struct page to remember information
    of users of bootmem (section number or node id). When the section is
    removing, kernel can confirm it. By this information, some issues can be
    solved.

    1) When the memmap of removing section is allocated on other
    section by bootmem, it should/can be free.
    2) When the memmap of removing section is allocated on the
    same section, it shouldn't be freed. Because the section has to be
    logical memory offlined already and all pages must be isolated against
    page allocater. If it is freed, page allocator may use it which will
    be removed physically soon.
    3) When removing section has other section's memmap,
    kernel will be able to show easily which section should be removed
    before it for user. (Not implemented yet)
    4) When the above case 2), the page isolation will be able to check and skip
    memmap's page when logical memory offline (offline_pages()).
    Current page isolation code fails in this case because this page is
    just reserved page and it can't distinguish this pages can be
    removed or not. But, it will be able to do by this patch.
    (Not implemented yet.)
    5) The node information like pgdat has similar issues. But, this
    will be able to be solved too by this.
    (Not implemented yet, but, remembering node id in the pages.)

    Fortunately, current bootmem allocator just keeps PageReserved flags,
    and doesn't use any other members of page struct. The users of
    bootmem doesn't use them too.

    This patch:

    This is to register information which is node or section's id. Kernel can
    distinguish which node/section uses the pages allcated by bootmem. This is
    basis for hot-remove sections or nodes.

    Signed-off-by: Yasunori Goto
    Cc: Badari Pulavarty
    Cc: Yinghai Lu
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Generic helper function to remove section mappings and sysfs entries for the
    section of the memory we are removing. offline_pages() correctly adjusted
    zone and marked the pages reserved.

    TODO: Yasunori Goto is working on patches to free up allocations from bootmem.

    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty