04 Jul, 2013

40 commits

  • Concentrate code to modify totalram_pages into the mm core, so the arch
    memory initialized code doesn't need to take care of it. With these
    changes applied, only following functions from mm core modify global
    variable totalram_pages: free_bootmem_late(), free_all_bootmem(),
    free_all_bootmem_node(), adjust_managed_page_count().

    With this patch applied, it will be much more easier for us to keep
    totalram_pages and zone->managed_pages in consistence.

    Signed-off-by: Jiang Liu
    Acked-by: David Howells
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Enhance adjust_managed_page_count() to adjust totalhigh_pages for
    highmem pages. And change code which directly adjusts totalram_pages to
    use adjust_managed_page_count() because it adjusts totalram_pages,
    totalhigh_pages and zone->managed_pages altogether in a safe way.

    Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
    driver bacause adjust_managed_page_count() has already adjusted
    totalhigh_pages.

    This patch also fixes two bugs:

    1) enhances virtio_balloon driver to adjust totalhigh_pages when
    reserve/unreserve pages.
    2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory.

    We still need to deal with modifications of totalram_pages in file
    arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.

    [akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
    [akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
    Signed-off-by: Jiang Liu
    Cc: Chris Metcalf
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Marek Szyprowski
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Sergei Shtylyov
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • In order to simpilify management of totalram_pages and
    zone->managed_pages, make __free_pages_bootmem() only available at boot
    time. With this change applied, __free_pages_bootmem() will only be
    used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
    Other callers of __free_pages_bootmem() have been converted to use
    free_reserved_page(), which handles totalram_pages and
    zone->managed_pages in a safer way.

    This patch also fix a bug in free_pagetable() for x86_64, which should
    increase zone->managed_pages instead of zone->present_pages when freeing
    reserved pages.

    And now we have managed_pages_count_lock to protect totalram_pages and
    zone->managed_pages, so remove the redundant ppb_lock lock in
    put_page_bootmem(). This greatly simplifies the locking rules.

    Signed-off-by: Jiang Liu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tejun Heo
    Cc: Will Deacon
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
    protect totalram_pages and zone->managed_pages. Other than the memory
    hotplug driver, totalram_pages and zone->managed_pages may also be
    modified at runtime by other drivers, such as Xen balloon,
    virtio_balloon etc. For those cases, memory hotplug lock is a little
    too heavy, so introduce a dedicated lock to protect totalram_pages and
    zone->managed_pages.

    Now we have a simplified locking rules totalram_pages and
    zone->managed_pages as:

    1) no locking for read accesses because they are unsigned long.
    2) no locking for write accesses at boot time in single-threaded context.
    3) serialize write accesses at runtime by acquiring the dedicated
    managed_page_count_lock.

    Also adjust zone->managed_pages when freeing reserved pages into the
    buddy system, to keep totalram_pages and zone->managed_pages in
    consistence.

    [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
    that all highmem pages will be freed into the buddy system by function
    mem_init(). But that's not always true, some architectures may reserve
    some highmem pages during boot. For example PPC may allocate highmem
    pages for giagant HugeTLB pages, and several architectures have code to
    check PageReserved flag to exclude highmem pages allocated during boot
    when freeing highmem pages into the buddy system.

    So treat highmem pages in the same way as normal pages, that is to:
    1) reset zone->managed_pages to zero in mem_init().
    2) recalculate managed_pages when freeing pages into the buddy system.

    Signed-off-by: Jiang Liu
    Cc: "H. Peter Anvin"
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Yinghai Lu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Marek Szyprowski
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use zone->managed_pages instead of zone->present_pages to calculate
    default zonelist order because managed_pages means allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Fix some trivial typos in comments.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Chris Metcalf
    Cc: Wen Congyang
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help function free_reserved_area() to simplify code.

    Signed-off-by: Jiang Liu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Tang Chen
    Cc: Wen Congyang
    Cc: Jianguo Wu
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Jeremy Fitzhardinge
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tejun Heo
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use free_reserved_area() to poison initmem memory pages and kill
    poison_init_mem() on ARM64.

    Signed-off-by: Jiang Liu
    Acked-by: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Address more review comments from last round of code review.
    1) Enhance free_reserved_area() to support poisoning freed memory with
    pattern '0'. This could be used to get rid of poison_init_mem()
    on ARM64.
    2) A previous patch has disabled memory poison for initmem on s390
    by mistake, so restore to the original behavior.
    3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().

    Signed-off-by: Jiang Liu
    Cc: Geert Uytterhoeven
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Change signature of free_reserved_area() according to Russell King's
    suggestion to fix following build warnings:

    arch/arm/mm/init.c: In function 'mem_init':
    arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
    free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
    ^
    In file included from include/linux/mman.h:4:0,
    from arch/arm/mm/init.c:15:
    include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
    extern unsigned long free_reserved_area(unsigned long start, unsigned long end,

    mm/page_alloc.c: In function 'free_reserved_area':
    >> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
    In file included from arch/mips/include/asm/page.h:49:0,
    from include/linux/mmzone.h:20,
    from include/linux/gfp.h:4,
    from include/linux/mm.h:8,
    from mm/page_alloc.c:18:
    arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
    mm/page_alloc.c: In function 'free_area_init_nodes':
    mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]

    Also address some minor code review comments.

    Signed-off-by: Jiang Liu
    Reported-by: Arnd Bergmann
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Considering the use cases where the swap device supports discard:
    a) and can do it quickly;
    b) but it's slow to do in small granularities (or concurrent with other
    I/O);
    c) but the implementation is so horrendous that you don't even want to
    send one down;

    And assuming that the sysadmin considers it useful to send the discards down
    at all, we would (probably) want the following solutions:

    i. do the fine-grained discards for freed swap pages, if device is
    capable of doing so optimally;
    ii. do single-time (batched) swap area discards, either at swapon
    or via something like fstrim (not implemented yet);
    iii. allow doing both single-time and fine-grained discards; or
    iv. turn it off completely (default behavior)

    As implemented today, one can only enable/disable discards for swap, but
    one cannot select, for instance, solution (ii) on a swap device like (b)
    even though the single-time discard is regarded to be interesting, or
    necessary to the workload because it would imply (1), and the device is
    not capable of performing it optimally.

    This patch addresses the scenario depicted above by introducing a way to
    ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
    flagged through swapon(8) to allow a sysadmin to select the best suitable
    swap discard policy accordingly to system constraints.

    This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
    new flags to allow more flexibe swap discard policies being flagged
    through swapon(8). The default behavior is to keep both single-time, or
    batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
    for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
    consistentcy with older kernel behavior, as well as maintain compatibility
    with older swapon(8). However, through the new introduced flags the best
    suitable discard policy can be selected accordingly to any given swap
    device constraint.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Rafael Aquini
    Acked-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Karel Zak
    Cc: Jeff Moyer
    Cc: Rik van Riel
    Cc: Larry Woodman
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Currently the per cpu counter's batch size for memory accounting is
    configured as twice the number of cpus in the system. However, for
    system with very large memory, it is more appropriate to make it
    proportional to the memory size per cpu in the system.

    For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
    batch size is only 2*64 pages (0.5 MB). So any memory accounting
    changes of more than 0.5MB will overflow the per cpu counter into the
    global counter. Instead, for the new scheme, the batch size is
    configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is
    more inline with the memory size.

    I've done a repeated brk test of 800KB (from will-it-scale test suite)
    with 80 concurrent processes on a 4 socket Westmere machine with a total
    of 40 cores. Without the patch, about 80% of cpu is spent on spin-lock
    contention within the vm_committed_as counter. With the patch, there's
    a 73x speedup on the benchmark and the lock contention drops off almost
    entirely.

    [akpm@linux-foundation.org: fix section mismatch]
    Signed-off-by: Tim Chen
    Cc: Tejun Heo
    Cc: Eric Dumazet
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • Use the already existing interface huge_page_shift instead of h->order +
    PAGE_SHIFT.

    Signed-off-by: Wanpeng Li
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Benjamin Herrenschmidt
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • hugetlb_prefault() is not used any more, this patch removes it.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • get_pageblock_flags and set_pageblock_flags are not used any more, this
    patch removes them.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The logic for the memory-remove code fails to correctly account the
    Total High Memory when a memory block which contains High Memory is
    offlined as shown in the example below. The following patch fixes it.

    Before logic memory remove:

    MemTotal: 7603740 kB
    MemFree: 6329612 kB
    Buffers: 94352 kB
    Cached: 872008 kB
    SwapCached: 0 kB
    Active: 626932 kB
    Inactive: 519216 kB
    Active(anon): 180776 kB
    Inactive(anon): 222944 kB
    Active(file): 446156 kB
    Inactive(file): 296272 kB
    Unevictable: 0 kB
    Mlocked: 0 kB
    HighTotal: 7294672 kB
    HighFree: 5704696 kB
    LowTotal: 309068 kB
    LowFree: 624916 kB

    After logic memory remove:

    MemTotal: 7079452 kB
    MemFree: 5805976 kB
    Buffers: 94372 kB
    Cached: 872000 kB
    SwapCached: 0 kB
    Active: 626936 kB
    Inactive: 519236 kB
    Active(anon): 180780 kB
    Inactive(anon): 222944 kB
    Active(file): 446156 kB
    Inactive(file): 296292 kB
    Unevictable: 0 kB
    Mlocked: 0 kB
    HighTotal: 7294672 kB
    HighFree: 5181024 kB
    LowTotal: 4294752076 kB
    LowFree: 624952 kB

    [mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: [2.6.24+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • During early boot-up, iomem_resource is set up from the boot descriptor
    table, such as EFI Memory Table and e820. Later,
    acpi_memory_device_add() calls add_memory() for each ACPI memory device
    object as it enumerates ACPI namespace. This add_memory() call is
    expected to fail in register_memory_resource() at boot since
    iomem_resource has been set up from EFI/e820. As a result, add_memory()
    returns -EEXIST, which acpi_memory_device_add() handles as the normal
    case.

    This scheme works fine, but the following error message is logged for
    every ACPI memory device object during boot-up.

    "System RAM resource %pR cannot be added\n"

    This patch changes register_memory_resource() to use pr_debug() for the
    message as it shows up under the normal case.

    Signed-off-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • After a successful page migration by soft offlining, the source page is
    not properly freed and it's never reusable even if we unpoison it
    afterward.

    This is caused by the race between freeing page and setting PG_hwpoison.
    In successful soft offlining, the source page is put (and the refcount
    becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
    to pagevec and actual freeing back to buddy is delayed. So if
    PG_hwpoison is set for the page before freeing, the freeing does not
    functions as expected (in such case freeing aborts in
    free_pages_prepare() check.)

    This patch tries to make sure to free the source page before setting
    PG_hwpoison on it. To avoid reallocating, the page keeps
    MIGRATE_ISOLATE until after setting PG_hwpoison.

    This patch also removes obsolete comments about "keeping elevated
    refcount" because what they say is not true. Unlike memory_failure(),
    soft_offline_page() uses no special page isolation code, and the
    soft-offlined pages have no elevated.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • vwrite() checks for overflow. vread() should do the same thing.

    Since vwrite() checks the source buffer address, vread() should check
    the destination buffer address.

    Signed-off-by: Chen Gang
    Cc: Al Viro
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • - check the length of the procfs data before copying it into a fixed
    size array.

    - when __parse_numa_zonelist_order() fails, save the error code for
    return.

    - 'char*' --> 'char *' coding style fix

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Similar to __pagevec_lru_add, this patch removes the LRU parameter from
    __lru_cache_add and lru_cache_add_lru as the caller does not control the
    exact LRU the page gets added to. lru_cache_add_lru gets renamed to
    lru_cache_add the name is silly without the lru parameter. With the
    parameter removed, it is required that the caller indicate if they want
    the page added to the active or inactive list by setting or clearing
    PageActive respectively.

    [akpm@linux-foundation.org: Suggested the patch]
    [gang.chen@asianux.com: fix used-unintialized warning]
    Signed-off-by: Mel Gorman
    Signed-off-by: Chen Gang
    Cc: Jan Kara
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that the LRU to add a page to is decided at LRU-add time, remove the
    misleading lru parameter from __pagevec_lru_add. A consequence of this
    is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
    helpers are misleading as the caller no longer has direct control over
    what LRU the page is added to. Unused helpers are removed by this patch
    and existing users of pagevec_lru_add_file() are converted to use
    lru_cache_add_file() directly and use the per-cpu pagevecs instead of
    creating their own pagevec.

    Signed-off-by: Mel Gorman
    Reviewed-by: Jan Kara
    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
    may fail to move a page to the active list as expected. Now that the
    LRU is selected at LRU drain time, mark pages PageActive if they are on
    the local pagevec so it gets moved to the correct list at LRU drain
    time. Using a debugging patch it was found that for a simple git
    checkout based workload that pages were never added to the active file
    list in practice but with this patch applied they are.

    before after
    LRU Add Active File 0 750583
    LRU Add Active Anon 2640587 2702818
    LRU Add Inactive File 8833662 8068353
    LRU Add Inactive Anon 207 200

    Note that only pages on the local pagevec are considered on purpose. A
    !PageLRU page could be in the process of being released, reclaimed,
    migrated or on a remote pagevec that is currently being drained.
    Marking it PageActive is vunerable to races where PageLRU and Active
    bits are checked at the wrong time. Page reclaim will trigger
    VM_BUG_ONs but depending on when the race hits, it could also free a
    PageActive page to the page allocator and trigger a bad_page warning.
    Similarly a potential race exists between a per-cpu drain on a pagevec
    list and an activation on a remote CPU.

    lru_add_drain_cpu
    __pagevec_lru_add
    lru = page_lru(page);
    mark_page_accessed
    if (PageLRU(page))
    activate_page
    else
    SetPageActive
    SetPageLRU(page);
    add_page_to_lru_list(page, lruvec, lru);

    In this case a PageActive page is added to the inactivate list and later
    the inactive/active stats will get skewed. While the PageActive checks
    in vmscan could be removed and potentially dealt with, a skew in the
    statistics would be very difficult to detect. Hence this patch deals
    just with the common case where a page being marked accessed has just
    been added to the local pagevec.

    Signed-off-by: Mel Gorman
    Cc: Jan Kara
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • mark_page_accessed() cannot activate an inactive page that is located on
    an inactive LRU pagevec. Hints from filesystems may be ignored as a
    result. In preparation for fixing that problem, this patch removes the
    per-LRU pagevecs and leaves just one pagevec. The final LRU the page is
    added to is deferred until the pagevec is drained.

    This means that fewer pagevecs are available and potentially there is
    greater contention on the LRU lock. However, this only applies in the
    case where there is an almost perfect mix of file, anon, active and
    inactive pages being added to the LRU. In practice I expect that we are
    adding stream of pages of a particular time and that the changes in
    contention will barely be measurable.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Jan Kara
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Andrew Perepechko reported a problem whereby pages are being prematurely
    evicted as the mark_page_accessed() hint is ignored for pages that are
    currently on a pagevec --
    http://www.spinics.net/lists/linux-ext4/msg37340.html .

    Alexey Lyahkov and Robin Dong have also reported problems recently that
    could be due to hot pages reaching the end of the inactive list too
    quickly and be reclaimed.

    Rather than addressing this on a per-filesystem basis, this series aims
    to fix the mark_page_accessed() interface by deferring what LRU a page
    is added to pagevec drain time and allowing mark_page_accessed() to call
    SetPageActive on a pagevec page.

    Patch 1 adds two tracepoints for LRU page activation and insertion. Using
    these processes it's possible to build a model of pages in the
    LRU that can be processed offline.

    Patch 2 defers making the decision on what LRU to add a page to until when
    the pagevec is drained.

    Patch 3 searches the local pagevec for pages to mark PageActive on
    mark_page_accessed. The changelog explains why only the local
    pagevec is examined.

    Patches 4 and 5 tidy up the API.

    postmark, a dd-based test and fs-mark both single and threaded mode were
    run but none of them showed any performance degradation or gain as a
    result of the patch.

    Using patch 1, I built a *very* basic model of the LRU to examine
    offline what the average age of different page types on the LRU were in
    milliseconds. Of course, capturing the trace distorts the test as it's
    written to local disk but it does not matter for the purposes of this
    test. The average age of pages in milliseconds were

    vanilla deferdrain
    Average age mapped anon: 1454 1250
    Average age mapped file: 127841 155552
    Average age unmapped anon: 85 235
    Average age unmapped file: 73633 38884
    Average age unmapped buffers: 74054 116155

    The LRU activity was mostly files which you'd expect for a dd-based
    workload. Note that the average age of buffer pages is increased by the
    series and it is expected this is due to the fact that the buffer pages
    are now getting added to the active list when drained from the pagevecs.
    Note that the average age of the unmapped file data is decreased as they
    are still added to the inactive list and are reclaimed before the
    buffers.

    There is no guarantee this is a universal win for all workloads and it
    would be nice if the filesystem people gave some thought as to whether
    this decision is generally a win or a loss.

    This patch:

    Using these tracepoints it is possible to model LRU activity and the
    average residency of pages of different types. This can be used to
    debug problems related to premature reclaim of pages of particular
    types.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hugetlb cgroup has already been implemented.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Rob Landley
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • This patch introduces mmap_vmcore().

    Don't permit writable nor executable mapping even with mprotect()
    because this mmap() is aimed at reading crash dump memory. Non-writable
    mapping is also requirement of remap_pfn_range() when mapping linear
    pages on non-consecutive physical pages; see is_cow_mapping().

    Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
    remap_vmalloc_range_pertial at the same time for a single vma.
    do_munmap() can correctly clean partially remapped vma with two
    functions in abnormal case. See zap_pte_range(), vm_normal_page() and
    their comments for details.

    On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
    limitation comes from the fact that the third argument of
    remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

    [akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach]
    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Tested-by: Maxim Uvarov
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • The previous patches newly added holes before each chunk of memory and
    the holes need to be count in vmcore file size. There are two ways to
    count file size in such a way:

    1) suppose m is a poitner to the last vmcore object in vmcore_list.
    Then file size is (m->offset + m->size), or

    2) calculate sum of size of buffers for ELF header, program headers,
    ELF note segments and objects in vmcore_list.

    Although 1) is more direct and simpler than 2), 2) seems better in that
    it reflects internal object structure of /proc/vmcore. Thus, this patch
    changes get_vmcore_size_elf{64, 32} so that it calculates size in the
    way of 2).

    As a result, both get_vmcore_size_elf{64, 32} have the same definition.
    Merge them as get_vmcore_size.

    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • Now ELF note segment has been copied in the buffer on vmalloc memory.
    To allow user process to remap the ELF note segment buffer with
    remap_vmalloc_page, the corresponding VM area object has to have
    VM_USERMAP flag set.

    [akpm@linux-foundation.org: use the conventional comment layout]
    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • The reasons why we don't allocate ELF note segment in the 1st kernel
    (old memory) on page boundary is to keep backward compatibility for old
    kernels, and that if doing so, we waste not a little memory due to
    round-up operation to fit the memory to page boundary since most of the
    buffers are in per-cpu area.

    ELF notes are per-cpu, so total size of ELF note segments depends on
    number of CPUs. The current maximum number of CPUs on x86_64 is 5192,
    and there's already system with 4192 CPUs in SGI, where total size
    amounts to 1MB. This can be larger in the near future or possibly even
    now on another architecture that has larger size of note per a single
    cpu. Thus, to avoid the case where memory allocation for large block
    fails, we allocate vmcore objects on vmalloc memory.

    This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
    to the ELF note segment buffer and its size. There's no longer the
    vmcore object that corresponds to the ELF note segment in vmcore_list.
    Accordingly, read_vmcore() has new case for ELF note segment and
    set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
    calculating offset from sum of size of ELF headers and size of ELF note
    segment.

    [akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks]
    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
    space and remap it to user-space in order to reduce the risk that memory
    allocation fails on system with huge number of CPUs and so with huge ELF
    note segment that exceeds 11-order block size.

    Although there's already remap_vmalloc_range for the purpose of
    remapping vmalloc memory to user-space, we need to specify user-space
    range via vma.
    Mmap on /proc/vmcore needs to remap range across multiple objects, so
    the interface that requires vma to cover full range is problematic.

    This patch introduces remap_vmalloc_range_partial that receives user-space
    range as a pair of base address and size and can be used for mmap on
    /proc/vmcore case.

    remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.

    [akpm@linux-foundation.org: use PAGE_ALIGNED()]
    Signed-off-by: HATAYAMA Daisuke
    Cc: KOSAKI Motohiro
    Cc: Vivek Goyal
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • Currently, __find_vmap_area searches for the kernel VM area starting at
    a given address. This patch changes this behavior so that it searches
    for the kernel VM area to which the address belongs. This change is
    needed by remap_vmalloc_range_partial to be introduced in later patch
    that receives any position of kernel VM area as target address.

    This patch changes the condition (addr > va->va_start) to the equivalent
    (addr >= va->va_end) by taking advantage of the fact that each kernel VM
    area is non-overlapping.

    Signed-off-by: HATAYAMA Daisuke
    Acked-by: KOSAKI Motohiro
    Cc: Vivek Goyal
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • …-size boundary in vmcore_list

    Treat memory chunks referenced by PT_LOAD program header entries in
    page-size boundary in vmcore_list. Formally, for each range [start,
    end], we set up the corresponding vmcore object in vmcore_list to
    [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

    This change affects layout of /proc/vmcore. The gaps generated by the
    rearrangement are newly made visible to applications as holes.
    Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
    [end, roundup(end, PAGE_SIZE)].

    Suppose variable m points at a vmcore object in vmcore_list, and
    variable phdr points at the program header of PT_LOAD type the variable
    m corresponds to. Then, pictorially:

    m->offset +---------------+
    | hole |
    phdr->p_offset = +---------------+
    m->offset + (paddr - start) | |\
    | kernel memory | phdr->p_memsz
    | |/
    +---------------+
    | hole |
    m->offset + m->size +---------------+

    where m->offset and m->offset + m->size are always page-size aligned.

    Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
    Cc: Lisa Mitchell <lisa.mitchell@hp.com>
    Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    HATAYAMA Daisuke
     
  • Allocate ELF headers on page-size boundary using __get_free_pages()
    instead of kmalloc().

    Later patch will merge PT_NOTE entries into a single unique one and
    decrease the buffer size actually used. Keep original buffer size in
    variable elfcorebuf_sz_orig to kfree the buffer later and actually used
    buffer size with rounded up to page-size boundary in variable
    elfcorebuf_sz separately.

    The size of part of the ELF buffer exported from /proc/vmcore is
    elfcorebuf_sz.

    The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
    elfcorebuf_sz_orig], is filled with 0.

    Use size of the ELF headers as an initial offset value in
    set_vmcore_list_offsets_elf{64,32} and
    process_ptload_program_headers_elf{64,32} in order to indicate that the
    offset includes the holes towards the page boundary.

    As a result, both set_vmcore_list_offsets_elf{64,32} have the same
    definition. Merge them as set_vmcore_list_offsets.

    [akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • Rewrite part of read_vmcore() that reads objects in vmcore_list in the
    same way as part reading ELF headers, by which some duplicated and
    redundant codes are removed.

    Signed-off-by: HATAYAMA Daisuke
    Acked-by: Vivek Goyal
    Cc: KOSAKI Motohiro
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     
  • To test whether an address is aligned to PAGE_SIZE.

    Cc: HATAYAMA Daisuke
    Cc: "Eric W. Biederman" ,
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
    follows:

    * Must be held any time you expect node_start_pfn, node_present_pages
    * or node_spanned_pages stay constant. [...]

    So actually hold it when we update node_present_pages in __offline_pages().

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Cody P Schafer
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
    follows:

    * Must be held any time you expect node_start_pfn, node_present_pages
    * or node_spanned_pages stay constant. [...]

    So actually hold it when we update node_present_pages in online_pages().

    Signed-off-by: Cody P Schafer
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer