18 Aug, 2018

5 commits

  • Rename new_sparse_init() to sparse_init() which enables it. Delete old
    sparse_init() and all the code that became obsolete with.

    [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
    Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Tested-by: Michael Ellerman [powerpc]
    Tested-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Now that both variants of sparse memory use the same buffers to populate
    memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
    common place.

    Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Tested-by: Michael Ellerman [powerpc]
    Tested-by: Oscar Salvador
    Reviewed-by: Andrew Morton
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "sparse_init rewrite", v6.

    In sparse_init() we allocate two large buffers to temporary hold usemap
    and memmap for the whole machine. However, we can avoid doing that if
    we changed sparse_init() to operated on per-node bases instead of doing
    it on the whole machine beforehand.

    As shown by Baoquan
    http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com

    The buffers are large enough to cause machine stop to boot on small
    memory systems.

    Another benefit of these changes is that they also obsolete
    CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

    This patch (of 5):

    When struct pages are allocated for sparse-vmemmap VA layout, we first try
    to allocate one large buffer, and than if that fails allocate struct pages
    for each section as we go.

    The code that allocates buffer is uses global variables and is spread
    across several call sites.

    Cleanup the code by introducing three functions to handle the global
    buffer:

    sparse_buffer_init() initialize the buffer
    sparse_buffer_fini() free the remaining part of the buffer
    sparse_buffer_alloc() alloc from the buffer, and if buffer is empty
    return NULL

    Define these functions in sparse.c instead of sparse-vmemmap.c because
    later we will use them for non-vmemmap sparse allocations as well.

    [akpm@linux-foundation.org: use PTR_ALIGN()]
    [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
    Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Tested-by: Michael Ellerman [powerpc]
    Reviewed-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Souptick Joarder
    Cc: Baoquan He
    Cc: Greg Kroah-Hartman
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Abdul Haleem
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • In sparse_init(), two temporary pointer arrays, usemap_map and map_map
    are allocated with the size of NR_MEM_SECTIONS. They are used to store
    each memory section's usemap and mem map if marked as present. With the
    help of these two arrays, continuous memory chunk is allocated for
    usemap and memmap for memory sections on one node. This avoids too many
    memory fragmentations. Like below diagram, '1' indicates the present
    memory section, '0' means absent one. The number 'n' could be much
    smaller than NR_MEM_SECTIONS on most of systems.

    |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
    -------------------------------------------------------
    0 1 2 3 4 5 i i+1 n-1 n

    If we fail to populate the page tables to map one section's memmap, its
    ->section_mem_map will be cleared finally to indicate that it's not
    present. After use, these two arrays will be released at the end of
    sparse_init().

    In 4-level paging mode, each array costs 4M which can be ignorable.
    While in 5-level paging, they costs 256M each, 512M altogether. Kdump
    kernel Usually only reserves very few memory, e.g 256M. So, even thouth
    they are temporarily allocated, still not acceptable.

    In fact, there's no need to allocate them with the size of
    NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred
    to the last, the number of present memory sections are kept the same
    during sparse_init() until we finally clear out the memory section's
    ->section_mem_map if its usemap or memmap is not correctly handled.
    Thus in the middle whenever for_each_present_section_nr() loop is taken,
    the i-th present memory section is always the same one.

    Here only allocate usemap_map and map_map with the size of
    'nr_present_sections'. For the i-th present memory section, install its
    usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
    Then in the last for_each_present_section_nr() loop which clears the
    failed memory section's ->section_mem_map, fetch usemap and memmap from
    usemap_map[] and map_map[] array and set them into mem_section[]
    accordingly.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
    Signed-off-by: Baoquan He
    Reviewed-by: Pavel Tatashin
    Cc: Pasha Tatashin
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
    will allocate one continuous memory chunk for mem maps on one node and
    populate the relevant page tables to map memory section one by one. If
    fail to populate for a certain mem section, print warning and its
    ->section_mem_map will be cleared to cancel the marking of being
    present. Like this, the number of mem sections marked as present could
    become less during sparse_init() execution.

    Here just defer the ms->section_mem_map clearing if failed to populate
    its page tables until the last for_each_present_section_nr() loop. This
    is in preparation for later optimizing the mem map allocation.

    [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
    Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
    Signed-off-by: Baoquan He
    Acked-by: Dave Hansen
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Kirill A. Shutemov
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     

09 Jan, 2018

3 commits


16 Nov, 2017

2 commits

  • While doing memory hotplug tests under heavy memory pressure we have
    noticed too many page allocation failures when allocating vmemmap memmap
    backed by huge page

    kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
    [...]
    Call Trace:
    dump_trace+0x59/0x310
    show_stack_log_lvl+0xea/0x170
    show_stack+0x21/0x40
    dump_stack+0x5c/0x7c
    warn_alloc_failed+0xe2/0x150
    __alloc_pages_nodemask+0x3ed/0xb20
    alloc_pages_current+0x7f/0x100
    vmemmap_alloc_block+0x79/0xb6
    __vmemmap_alloc_block_buf+0x136/0x145
    vmemmap_populate+0xd2/0x2b9
    sparse_mem_map_populate+0x23/0x30
    sparse_add_one_section+0x68/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    add_memory_resource+0x89/0x160
    add_memory+0x6d/0xd0
    acpi_memory_device_add+0x181/0x251
    acpi_bus_attach+0xfd/0x19b
    acpi_bus_scan+0x59/0x69
    acpi_device_hotplug+0xd2/0x41f
    acpi_hotplug_work_fn+0x1a/0x23
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xbd/0xe0
    ret_from_fork+0x3f/0x70

    and we do see many of those because essentially every allocation fails
    for each memory section. This is an excessive way to tell the user that
    there is nothing to really worry about because we do have a fallback
    mechanism to use base pages. The only downside might be a performance
    degradation due to TLB pressure.

    This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
    explicitly once on the first allocation failure. This will reduce the
    noise in the kernel log considerably, while we still have an indication
    that a performance might be impacted.

    [mhocko@kernel.org: forgot to git add the follow up fix]
    Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Michal Hocko
    Cc: Joe Perches
    Cc: Vlastimil Babka
    Cc: Khalid Aziz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vmemmap_alloc_block() will no longer zero the block, so zero memory at
    its call sites for everything except struct pages. Struct page memory
    is zero'd by struct page initialization.

    Replace allocators in sparse-vmemmap to use the non-zeroing version.
    So, we will get the performance improvement by zeroing the memory in
    parallel when struct pages are zeroed.

    Add struct page zeroing as a part of initialization of other fields in
    __init_single_page().

    This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
    v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

    BASE FIX
    sparse_init 11.244671836s 0.007199623s
    zone_sizes_init 4.879775891s 8.355182299s
    --------------------------
    Total 16.124447727s 8.362381922s

    sparse_init is where memory for struct pages is zeroed, and the zeroing
    part is moved later in this patch into __init_single_page(), which is
    called from zone_sizes_init().

    [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
    Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Steven Sistare
    Reviewed-by: Daniel Jordan
    Reviewed-by: Bob Picco
    Tested-by: Bob Picco
    Acked-by: Michal Hocko
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: David S. Miller
    Cc: Dmitry Vyukov
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Sam Ravnborg
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Sep, 2017

1 commit

  • Commit f52407ce2dea ("memory hotplug: alloc page from other node in
    memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
    aware allocations when there is some memory present because the
    respective node might not have any memory yet at the time and so it
    could fail or even OOM.

    Things have changed since then though. Zonelists are now always
    initialized before we do any allocations even for hotplug (see
    959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
    zonelist")).

    Therefore these checks are not really needed. In fact caller of the
    allocator should never care about whether the node is populated because
    that might change at any time.

    Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Shaohua Li
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Mar, 2017

1 commit


03 Aug, 2016

1 commit

  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

18 Mar, 2016

2 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

16 Jan, 2016

1 commit

  • In support of providing struct page for large persistent memory
    capacities, use struct vmem_altmap to change the default policy for
    allocating memory for the memmap array. The default vmemmap_populate()
    allocates page table storage area from the page allocator. Given
    persistent memory capacities relative to DRAM it may not be feasible to
    store the memmap in 'System Memory'. Instead vmem_altmap represents
    pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
    requests.

    Signed-off-by: Dan Williams
    Reported-by: kbuild test robot
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

22 Jan, 2014

1 commit

  • Switch to memblock interfaces for early memory allocator instead of
    bootmem allocator. No functional change in beahvior than what it is in
    current code from bootmem users points of view.

    Archs already converted to NO_BOOTMEM now directly use memblock
    interfaces instead of bootmem wrappers build on top of memblock. And
    the archs which still uses bootmem, these new apis just fallback to
    exiting bootmem APIs.

    Signed-off-by: Santosh Shilimkar
    Cc: "Rafael J. Wysocki"
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Greg Kroah-Hartman
    Cc: Grygorii Strashko
    Cc: H. Peter Anvin
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Michal Hocko
    Cc: Paul Walmsley
    Cc: Pavel Machek
    Cc: Russell King
    Cc: Tejun Heo
    Cc: Tony Lindgren
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     

30 Apr, 2013

2 commits

  • The sparse code, when asking the architecture to populate the vmemmap,
    specifies the section range as a starting page and a number of pages.

    This is an awkward interface, because none of the arch-specific code
    actually thinks of the range in terms of 'struct page' units and always
    translates it to bytes first.

    In addition, later patches mix huge page and regular page backing for
    the vmemmap. For this, they need to call vmemmap_populate_basepages()
    on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind. But these
    are not necessarily multiples of the 'struct page' size and so this unit
    is too coarse.

    Just translate the section range into bytes once in the generic sparse
    code, then pass byte ranges down the stack.

    Signed-off-by: Johannes Weiner
    Cc: Ben Hutchings
    Cc: Bernhard Schmidt
    Cc: Johannes Weiner
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: "Luck, Tony"
    Cc: Heiko Carstens
    Acked-by: David S. Miller
    Tested-by: David S. Miller
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Hot-adding memory on x86_64 normally requires huge page allocation.
    When this is done to a VM guest, it's usually because the system is
    already tight on memory, so the request tends to fail. Try to avoid
    this by adding __GFP_REPEAT to the allocation flags.

    Addresses http://bugs.debian.org/699913

    Signed-off-by: Ben Hutchings
    Signed-off-by: Johannes Weiner
    Reported-by: Bernhard Schmidt
    Tested-by: Bernhard Schmidt
    Cc: Russell King
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: "Luck, Tony"
    Cc: Heiko Carstens
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     

31 Oct, 2011

1 commit


02 Nov, 2010

1 commit

  • "gadget", "through", "command", "maintain", "maintain", "controller", "address",
    "between", "initiali[zs]e", "instead", "function", "select", "already",
    "equal", "access", "management", "hierarchy", "registration", "interest",
    "relative", "memory", "offset", "already",

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Jiri Kosina

    Uwe Kleine-König
     

28 Aug, 2010

1 commit

  • 1. replace find_e820_area with memblock_find_in_range
    2. replace reserve_early with memblock_x86_reserve_range
    3. replace free_early with memblock_x86_free_range.
    4. NO_BOOTMEM will switch to use memblock too.
    5. use _e820, _early wrap in the patch, in following patch, will
    replace them all
    6. because memblock_x86_free_range support partial free, we can remove some special care
    7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
    so adjust some calling later in setup.c::setup_arch()
    -- corruption_check and mptable_update

    -v2: Move reserve_brk() early
    Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
    that could happen We have more then 128 RAM entry in E820 tables, and
    memblock_x86_fill() could use memblock_find_in_range() to find a new place for
    memblock.memory.region array.
    and We don't need to use extend_brk() after fill_memblock_area()
    So move reserve_brk() early before fill_memblock_area().
    -v3: Move find_smp_config early
    To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
    in right place.
    -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
    memblock.reserved already..
    use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
    -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
    active_region for 32bit does include high pages
    need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
    -v6: Use current_limit instead
    -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
    -v8: Set memblock_can_resize early to handle EFI with more RAM entries
    -v9: update after kmemleak changes in mainline

    Suggested-by: David S. Miller
    Suggested-by: Benjamin Herrenschmidt
    Suggested-by: Thomas Gleixner
    Signed-off-by: Yinghai Lu
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Feb, 2010

2 commits

  • Add vmemmap_alloc_block_buf for mem map only.

    It will fallback to the old way if it cannot get a block that big.

    Before this patch, when a node have 128g ram installed, memmap are
    split into two parts or more.
    [ 0.000000] [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
    [ 0.000000] [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
    [ 0.000000] [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
    [ 0.000000] [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
    [ 0.000000] [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
    [ 0.000000] [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
    [ 0.000000] [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
    [ 0.000000] [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
    [ 0.000000] [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
    [ 0.000000] [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
    [ 0.000000] [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
    [ 0.000000] [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
    [ 0.000000] [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
    [ 0.000000] [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
    [ 0.000000] [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7

    after patch will get
    [ 0.000000] [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
    [ 0.000000] [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
    [ 0.000000] [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
    [ 0.000000] [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
    [ 0.000000] [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
    [ 0.000000] [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
    [ 0.000000] [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
    [ 0.000000] [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7

    -v2: change buf to vmemmap_buf instead according to Ingo
    also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
    -v3: according to Andrew, use sizeof(name) instead of hard coded 15

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Cc: Christoph Lameter
    Acked-by: Christoph Lameter
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • Finally we can use early_res to replace bootmem for x86_64 now.

    Still can use CONFIG_NO_BOOTMEM to enable it or not.

    -v2: fix 32bit compiling about MAX_DMA32_PFN
    -v3: folded bug fix from LKML message below

    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

22 Sep, 2009

1 commit

  • To initialize hotadded node, some pages are allocated. At that time, the
    node hasn't memory, this makes the allocation always fail. In such case,
    let's allocate pages from other nodes.

    Signed-off-by: Shaohua Li
    Signed-off-by: Yakui Zhao
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Nov, 2008

1 commit


05 Jul, 2008

1 commit

  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

31 Mar, 2008

1 commit


30 Nov, 2007

1 commit


30 Oct, 2007

1 commit

  • mm/sparse-vmemmap.c uses init_mm in some places. However, it is not
    present in any of the headers currently included in the file.

    init_mm is defined as extern in sched.h, so we add it to the headers list

    Up to now, this problem was masked by the fact that functions like
    set_pte_at() and pmd_populate_kernel() are usually macros that expand to
    simpler variants that does not use the first parameter at all.

    Signed-off-by: Glauber de Oliveira Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber de Oliveira Costa
     

17 Oct, 2007

3 commits

  • This patch is to avoid panic when memory hot-add is executed with
    sparsemem-vmemmap. Current vmemmap-sparsemem code doesn't support memory
    hot-add. Vmemmap must be populated when hot-add. This is for
    2.6.23-rc2-mm2.

    Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
    offnode page_structs" is displayed. To allocate memmap on its node,
    memmap (and pgdat) must be initialized itself like chicken and
    egg relationship.

    # vmemmap_unpopulate will be necessary for followings.
    - For cancel hot-add due to error.
    - For unplug.

    Signed-off-by: Yasunori Goto
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Convert the common vmemmap population into initialisation helpers for use by
    architecture vmemmap populators. All architecture implementing the
    SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
    initialiser, which may make use of the helpers.

    This allows us to clean up and remove the initialisation Kconfig entries.
    With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
    indicate use of that variant.

    Signed-off-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
    the arches. It would be great if it could be the default so that we can get
    rid of various forms of DISCONTIG and other variations on memory maps. So far
    what has hindered this are the additional lookups that SPARSEMEM introduces
    for virt_to_page and page_address. This goes so far that the code to do this
    has to be kept in a separate function and cannot be used inline.

    This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
    is mapped into a virtually contigious area, only the active sections are
    physically backed. This allows virt_to_page page_address and cohorts become
    simple shift/add operations. No page flag fields, no table lookups, nothing
    involving memory is required.

    The two key operations pfn_to_page and page_to_page become:

    #define __pfn_to_page(pfn) (vmemmap + (pfn))
    #define __page_to_pfn(page) ((page) - vmemmap)

    By having a virtual mapping for the memmap we allow simple access without
    wasting physical memory. As kernel memory is typically already mapped 1:1
    this introduces no additional overhead. The virtual mapping must be big
    enough to allow a struct page to be allocated and mapped for all valid
    physical pages. This vill make a virtual memmap difficult to use on 32 bit
    platforms that support 36 address bits.

    However, if there is enough virtual space available and the arch already maps
    its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this
    technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
    FLATMEM needs to read the contents of the mem_map variable to get the start of
    the memmap and then add the offset to the required entry. vmemmap is a
    constant to which we can simply add the offset.

    This patch has the potential to allow us to make SPARSMEM the default (and
    even the only) option for most systems. It should be optimal on UP, SMP and
    NUMA on most platforms. Then we may even be able to remove the other memory
    models: FLATMEM, DISCONTIG etc.

    [apw@shadowen.org: config cleanups, resplit code etc]
    [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
    [apw@shadowen.org: vmemmap: remove excess debugging]
    [apw@shadowen.org: simplify initialisation code and reduce duplication]
    [apw@shadowen.org: pull out the vmemmap code into its own file]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: "David S. Miller"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter