09 Sep, 2015

13 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • For a memoryless node, the output of get_pfn_range_for_nid are all zero.
    It will display mem from 0 to -1.

    Signed-off-by: Zhen Lei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • When hot adding a node from add_memory(), we will add memblock first, so
    the node is not empty. But when called from cpu_up(), the node should
    be empty.

    Signed-off-by: Xishi Qiu
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Taku Izumi \
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • We use sysctl_lowmem_reserve_ratio rather than
    sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
    is in defending lowmem from the possibility of being captured into
    pinned user memory. To avoid misleading, correct it in some comments.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • The comment says that the per-cpu batchsize and zone watermarks are
    determined by present_pages which is definitely wrong, they are both
    calculated from managed_pages. Fix it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The pair of get/set_freepage_migratetype() functions are used to cache
    pageblock migratetype for a page put on a pcplist, so that it does not
    have to be retrieved again when the page is put on a free list (e.g.
    when pcplists become full). Historically it was also assumed that the
    value is accurate for pages on freelists (as the functions' names
    unfortunately suggest), but that cannot be guaranteed without affecting
    various allocator fast paths. It is in fact not needed and all such
    uses have been removed.

    The last remaining (but pointless) usage related to pages of freelists
    is in move_freepages(), which this patch removes.

    To prevent further confusion, rename the functions to
    get/set_pcppage_migratetype() and expand their description. Since all
    the users are now in mm/page_alloc.c, move the functions there from the
    shared header.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Cc: Minchan Kim
    Acked-by: Michal Nazarewicz
    Cc: Laura Abbott
    Reviewed-by: Naoya Horiguchi
    Cc: Seungho Park
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __test_page_isolated_in_pageblock() is used to verify whether all
    pages in pageblock were either successfully isolated, or are hwpoisoned.
    Two of the possible state of pages, that are tested, are however bogus
    and misleading.

    Both tests rely on get_freepage_migratetype(page), which however has no
    guarantees about pages on freelists. Specifically, it doesn't guarantee
    that the migratetype returned by the function actually matches the
    migratetype of the freelist that the page is on. Such guarantee is not
    its purpose and would have negative impact on allocator performance.

    The first test checks whether the freepage_migratetype equals
    MIGRATE_ISOLATE, supposedly to catch races between page isolation and
    allocator activity. These races should be fixed nowadays with
    51bb1a4093 ("mm/page_alloc: add freepage on isolate pageblock to correct
    buddy list") and related patches. As explained above, the check
    wouldn't be able to catch them reliably anyway. For the same reason
    false positives can happen, although they are harmless, as the
    move_freepages() call would just move the page to the same freelist it's
    already on. So removing the test is not a bug fix, just cleanup. After
    this patch, we assume that all PageBuddy pages are on the correct
    freelist and that the races were really fixed. A truly reliable
    verification in the form of e.g. VM_BUG_ON() would be complicated and
    is arguably not needed.

    The second test (page_count(page) == 0 && get_freepage_migratetype(page)
    == MIGRATE_ISOLATE) is probably supposed (the code comes from a big
    memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE
    pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages
    nowadays, those are freed directly to free lists, so the check is
    obsolete. Remove it as well.

    Signed-off-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: Minchan Kim
    Acked-by: Michal Nazarewicz
    Cc: Laura Abbott
    Reviewed-by: Naoya Horiguchi
    Cc: Seungho Park
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Acked-by: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The force_kill member of struct oom_control isn't needed if an order of -1
    is used instead. This is the same as order == -1 in struct
    compact_control which requires full memory compaction.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Cc: Sergey Senozhatsky
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are essential elements to an oom context that are passed around to
    multiple functions.

    Organize these elements into a new struct, struct oom_control, that
    specifies the context for an oom condition.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit febd5949e134 ("mm/memory hotplug: init the zone's size when
    calculating node totalpages") refines the function
    free_area_init_core().

    After doing so, these two parameters are not used anymore.

    This patch removes these two parameters.

    Signed-off-by: Wei Yang
    Cc: Gu Zheng
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • nr_node_ids records the highest possible node id, which is calculated by
    scanning the bitmap node_states[N_POSSIBLE]. Current implementation
    scan the bitmap from the beginning, which will scan the whole bitmap.

    This patch reverses the order by scanning from the end with
    find_last_bit().

    Signed-off-by: Wei Yang
    Cc: Tejun Heo
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Aug, 2015

1 commit

  • Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Cc: [3.6+]
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2015

1 commit

  • When we add a new node, the edge of memory may be wrong.

    e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],

    1. hotremove the node3,
    2. then hotadd node3 with a part of memory, mem:[26G-30G],
    3. call hotadd_new_pgdat()
    free_area_init_node()
    get_pfn_range_for_nid()
    4. it will return wrong start_pfn and end_pfn, because we have not
    update the memblock.

    This patch also fixes a BUG_ON during hot-addition, please see
    http://marc.info/?l=linux-kernel&m=142961156129456&w=2

    Signed-off-by: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Cc: Kamezawa Hiroyuki
    Cc: Taku Izumi
    Cc: Tang Chen
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

07 Aug, 2015

4 commits

  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
    before basic setup") introduced a rwsem to signal completion of the
    initialization workers.

    Lockdep complains about possible recursive locking:
    =============================================
    [ INFO: possible recursive locking detected ]
    4.1.0-12802-g1dc51b8 #3 Not tainted
    ---------------------------------------------
    swapper/0/1 is trying to acquire lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0xc7/0xe6

    but task is already holding lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0x3e/0xe6

    Replace the rwsem by a completion together with an atomic
    "outstanding work counter".

    [peterz@infradead.org: Barrier removal on the grounds of being pointless]
    [mgorman@suse.de: Applied review feedback]
    Signed-off-by: Nicolai Stange
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • early_pfn_to_nid() historically was inherently not SMP safe but only
    used during boot which is inherently single threaded or during hotplug
    which is protected by a giant mutex.

    With deferred memory initialisation there was a thread-safe version
    introduced and the early_pfn_to_nid would trigger a BUG_ON if used
    unsafely. Memory hotplug hit that check. This patch makes
    early_pfn_to_nid introduces a lock to make it safe to use during
    hotplug.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Ng
    Tested-by: Alex Ng
    Acked-by: Peter Zijlstra (Intel)
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Jul, 2015

3 commits

  • Currently, we set wrong gfp_mask to page_owner info in case of isolated
    freepage by compaction and split page. It causes incorrect mixed
    pageblock report that we can get from '/proc/pagetypeinfo'. This metric
    is really useful to measure fragmentation effect so should be accurate.
    This patch fixes it by setting correct information.

    Without this patch, after kernel build workload is finished, number of
    mixed pageblock is 112 among roughly 210 movable pageblocks.

    But, with this fix, output shows that mixed pageblock is just 57.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When I tested my new patches, I found that page pointer which is used
    for setting page_owner information is changed. This is because page
    pointer is used to set new migratetype in loop. After this work, page
    pointer could be out of bound. If this wrong pointer is used for
    page_owner, access violation happens. Below is error message that I
    got.

    BUG: unable to handle kernel paging request at 0000000000b00018
    IP: [] save_stack_address+0x30/0x40
    PGD 1af2d067 PUD 166e0067 PMD 0
    Oops: 0002 [#1] SMP
    ...snip...
    Call Trace:
    print_context_stack+0xcf/0x100
    dump_trace+0x15f/0x320
    save_stack_trace+0x2f/0x50
    __set_page_owner+0x46/0x70
    __isolate_free_page+0x1f7/0x210
    split_free_page+0x21/0xb0
    isolate_freepages_block+0x1e2/0x410
    compaction_alloc+0x22d/0x2d0
    migrate_pages+0x289/0x8b0
    compact_zone+0x409/0x880
    compact_zone_order+0x6d/0x90
    try_to_compact_pages+0x110/0x210
    __alloc_pages_direct_compact+0x3d/0xe6
    __alloc_pages_nodemask+0x6cd/0x9a0
    alloc_pages_current+0x91/0x100
    runtest_store+0x296/0xa50
    simple_attr_write+0xbd/0xe0
    __vfs_write+0x28/0xf0
    vfs_write+0xa9/0x1b0
    SyS_write+0x46/0xb0
    system_call_fastpath+0x16/0x75

    This patch fixes this error by moving up set_page_owner().

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The kbuild test robot reported the following

    tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
    head: 14a6f1989dae9445d4532941bdd6bbad84f4c8da
    commit: 3b242c66ccbd60cf47ab0e8992119d9617548c23 x86: mm: enable deferred struct page initialisation on x86-64
    date: 3 days ago
    config: x86_64-randconfig-x006-201527 (attached as .config)
    reproduce:
    git checkout 3b242c66ccbd60cf47ab0e8992119d9617548c23
    # save the attached .config to linux build tree
    make ARCH=x86_64

    All warnings (new ones prefixed by >>):

    mm/page_alloc.c: In function 'early_page_uninitialised':
    >> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
    int nid = early_pfn_to_nid(pfn);

    It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
    configurations. This patch avoids the warning by not declaring nid.

    Signed-off-by: Mel Gorman
    Reported-by: Wu Fengguang
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Jul, 2015

12 commits

  • Waiman Long reported that 24TB machines hit OOM during basic setup when
    struct page initialisation was deferred. One approach is to initialise
    memory on demand but it interferes with page allocator paths. This patch
    creates dedicated threads to initialise memory before basic setup. It
    then blocks on a rw_semaphore until completion as a wait_queue and counter
    is overkill. This may be slower to boot but it's simplier overall and
    also gets rid of a section mangling which existed so kswapd could do the
    initialisation.

    [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
    Signed-off-by: Mel Gorman
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Scott Norton
    Tested-by: Daniel J Blueman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • mminit_verify_page_links() is an extremely paranoid check that was
    introduced when memory initialisation was being heavily reworked.
    Profiles indicated that up to 10% of parallel memory initialisation was
    spent on checking this for every page. The cost could be reduced but in
    practice this check only found problems very early during the
    initialisation rewrite and has found nothing since. This patch removes an
    expensive unnecessary check.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During parallel sturct page initialisation, ranges are checked for every
    PFN unnecessarily which increases boot times. This patch alters when the
    ranges are checked.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Parallel struct page frees pages one at a time. Try free pages as single
    large pages where possible.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Deferred struct page initialisation is using pfn_to_page() on every PFN
    unnecessarily. This patch minimises the number of lookups and scheduler
    checks.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Only a subset of struct pages are initialised at the moment. When this
    patch is applied kswapd initialise the remaining struct pages in parallel.

    This should boot faster by spreading the work to multiple CPUs and
    initialising data that is local to the CPU. The user-visible effect on
    large machines is that free memory will appear to rapidly increase early
    in the lifetime of the system until kswapd reports that all memory is
    initialised in the kernel log. Once initialised there should be no other
    user-visibile effects.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch initalises all low memory struct pages and 2G of the highest
    zone on each node during memory initialisation if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
    but will be available in a later patch. Parallel initialisation of struct
    page depends on some features from memory hotplug and it is necessary to
    alter alter section annotations.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
    unnecessarily visible outside memory initialisation. As well as
    unnecessary visibility, it's unnecessary function call overhead when
    initialising pages. This patch moves the helpers inline.

    [akpm@linux-foundation.org: fix build]
    [mhocko@suse.cz: fix build]
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __early_pfn_to_nid() use static variables to cache recent lookups as
    memblock lookups are very expensive but it assumes that memory
    initialisation is single-threaded. Parallel initialisation of struct
    pages will break that assumption so this patch makes __early_pfn_to_nid()
    SMP-safe by requiring the caller to cache recent search information.
    early_pfn_to_nid() keeps the same interface but is only safe to use early
    in boot due to the use of a global static variable. meminit_pfn_in_nid()
    is an SMP-safe version that callers must maintain their own state for.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __free_pages_bootmem prepares a page for release to the buddy allocator
    and assumes that the struct page is initialised. Parallel initialisation
    of struct pages defers initialisation and __free_pages_bootmem can be
    called for struct pages that cannot yet map struct page to PFN. This
    patch passes PFN to __free_pages_bootmem with no other functional change.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently each page struct is set as reserved upon initialization. This
    patch leaves the reserved bit clear and only sets the reserved bit when it
    is known the memory was allocated by the bootmem allocator. This makes it
    easier to distinguish between uninitialised struct pages and reserved
    struct pages in later patches.

    Signed-off-by: Robin Holt
    Signed-off-by: Nathan Zimmer
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • Currently, memmap_init_zone() has all the smarts for initializing a single
    page. A subset of this is required for parallel page initialisation and
    so this patch breaks up the monolithic function in preparation.

    Signed-off-by: Robin Holt
    Signed-off-by: Nathan Zimmer
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

25 Jun, 2015

5 commits

  • Merge first patchbomb from Andrew Morton:

    - a few misc things

    - ocfs2 udpates

    - kernel/watchdog.c feature work (took ages to get right)

    - most of MM. A few tricky bits are held up and probably won't make 4.2.

    * emailed patches from Andrew Morton : (91 commits)
    mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
    mm, thp: respect MPOL_PREFERRED policy with non-local node
    tmpfs: truncate prealloc blocks past i_size
    mm/memory hotplug: print the last vmemmap region at the end of hot add memory
    mm/mmap.c: optimization of do_mmap_pgoff function
    mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
    mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
    mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
    mm: kmemleak: fix delete_object_*() race when called on the same memory block
    mm: kmemleak: allow safe memory scanning during kmemleak disabling
    memcg: convert mem_cgroup->under_oom from atomic_t to int
    memcg: remove unused mem_cgroup->oom_wakeups
    frontswap: allow multiple backends
    x86, mirror: x86 enabling - find mirrored memory ranges
    mm/memblock: allocate boot time data structures from mirrored memory
    mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
    mm: do not ignore mapping_gfp_mask in page cache allocation paths
    mm/cma.c: fix typos in comments
    mm/oom_kill.c: print points as unsigned int
    mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
    ...

    Linus Torvalds
     
  • The should_alloc_retry() function was meant to encapsulate retry
    conditions of the allocator slowpath, but there are still checks
    remaining in the main function, and much of how the retrying is
    performed also depends on the OOM killer progress. The physical
    separation of those conditions make the code hard to follow.

    Inline the should_alloc_retry() checks. Notes:

    - The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
    replace it with looping on OOM killer progress

    - The pm_suspended_storage() check is meant to skip the OOM killer
    when reclaim has no IO available, move to __alloc_pages_may_oom()

    - The order
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The zonelist locking and the oom_sem are two overlapping locks that are
    used to serialize global OOM killing against different things.

    The historical zonelist locking serializes OOM kills from allocations with
    overlapping zonelists against each other to prevent killing more tasks
    than necessary in the same memory domain. Only when neither tasklists nor
    zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
    bound to separate nodes) are OOM kills allowed to execute in parallel.

    The younger oom_sem is a read-write lock to serialize OOM killing against
    the PM code trying to disable the OOM killer altogether.

    However, the OOM killer is a fairly cold error path, there is really no
    reason to optimize for highly performant and concurrent OOM kills. And
    the oom_sem is just flat-out redundant.

    Replace both locking schemes with a single global mutex serializing OOM
    kills regardless of context.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Init the zone's size when calculating node totalpages to avoid duplicated
    operations in free_area_init_core().

    Signed-off-by: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     
  • It's been five years now that KM_* kmap flags have been removed and that
    we can call clear_highpage from any context. So we remove prep_zero_pages
    accordingly.

    Signed-off-by: Anisse Astier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anisse Astier