23 May, 2018

1 commit

  • commit ab1e8d8960b68f54af42b6484b5950bd13a4054b upstream.

    It is unsafe to do virtual to physical translations before mm_init() is
    called if struct page is needed in order to determine the memory section
    number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
    we initialize struct pages for all the allocated memory when deferred
    struct pages are used.

    My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
    during boot") exposed this problem, because it greatly reduced number of
    pages that are initialized before mm_init(), but the problem existed
    even before my fix, as Fengguang Wu found.

    Below is a more detailed explanation of the problem.

    We initialize struct pages in four places:

    1. Early in boot a small set of struct pages is initialized to fill the
    first section, and lower zones.

    2. During mm_init() we initialize "struct pages" for all the memory that
    is allocated, i.e reserved in memblock.

    3. Using on-demand logic when pages are allocated after mm_init call
    (when memblock is finished)

    4. After smp_init() when the rest free deferred pages are initialized.

    The problem occurs if we try to do va to phys translation of a memory
    between steps 1 and 2. Because we have not yet initialized struct pages
    for all the reserved pages, it is inherently unsafe to do va to phys if
    the translation itself requires access of "struct page" as in case of
    this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP

    The following path exposes the problem:

    start_kernel()
    trap_init()
    setup_cpu_entry_areas()
    setup_cpu_entry_area(cpu)
    get_cpu_gdt_paddr(cpu)
    per_cpu_ptr_to_phys(addr)
    pcpu_addr_to_page(addr)
    virt_to_page(addr)
    pfn_to_page(__pa(addr) >> PAGE_SHIFT)

    We disable this path by not allowing NEED_PER_CPU_KM with deferred
    struct pages feature.

    The problems are discussed in these threads:
    http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com

    Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
    Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Mel Gorman
    Cc: Fengguang Wu
    Cc: Dennis Zhou
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

09 Sep, 2017

6 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This is a heterogeneous memory management (HMM) process address space
    mirroring. In a nutshell this provide an API to mirror process address
    space on a device. This boils down to keeping CPU and device page table
    synchronize (we assume that both device and CPU are cache coherent like
    PCIe device can be).

    This patch provide a simple API for device driver to achieve address space
    mirroring thus avoiding each device driver to grow its own CPU page table
    walker and its own CPU page table synchronization mechanism.

    This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
    hardware in the future.

    [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"]
    Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com
    Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

    This patch introduces some common helpers and definitions to all of
    those 3 functionality.

    Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
    functionality to x86_64, which should be safer at the first step.

    Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Reviewed-by: Anshuman Khandual
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

1 commit

  • devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
    per section's worth of memory (128MB). The key for each of those
    entries is a section number.

    This leads to false positives when devm_memremap_pages() is passed a
    section-unaligned range as lookups in the misalignment fail to return
    NULL. We can close this hole by using the pfn as the key for entries in
    the tree. The number of entries required to describe a remapped range
    is reduced by leveraging multi-order entries.

    In practice this approach usually yields just one entry in the tree if
    the size and starting address are of the same power-of-2 alignment.
    Previously we always needed nr_entries = mapping_size / 128MB.

    Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
    Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Toshi Kani
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

11 Jul, 2017

1 commit

  • KASAN doesn't happen work with memory hotplug because hotplugged memory
    doesn't have any shadow memory. So any access to hotplugged memory
    would cause a crash on shadow check.

    Use memory hotplug notifier to allocate and map shadow memory when the
    hotplugged memory is going online and free shadow after the memory
    offlined.

    Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: "H. Peter Anvin"
    Cc: Alexander Potapenko
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

08 Jul, 2017

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Highlights include:

    - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.

    - Platform support for FSP2 (476fpe) board

    - Enable ZONE_DEVICE on 64-bit server CPUs.

    - Generic & powerpc spin loop primitives to optimise busy waiting

    - Convert VDSO update function to use new update_vsyscall() interface

    - Optimisations to hypercall/syscall/context-switch paths

    - Improvements to the CPU idle code on Power8 and Power9.

    As well as many other fixes and improvements.

    Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
    Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
    Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
    Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
    Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
    Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
    Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
    Bauermann, Yang Li"

    * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
    powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
    powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
    powerpc/mm/hash: Implement mark_rodata_ro() for hash
    powerpc/vmlinux.lds: Align __init_begin to 16M
    powerpc/lib/code-patching: Use alternate map for patch_instruction()
    powerpc/xmon: Add patch_instruction() support for xmon
    powerpc/kprobes/optprobes: Use patch_instruction()
    powerpc/kprobes: Move kprobes over to patch_instruction()
    powerpc/mm/radix: Fix execute permissions for interrupt_vectors
    powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
    powerpc/64s: Blacklist rtas entry/exit from kprobes
    powerpc/64s: Blacklist functions invoked on a trap
    powerpc/64s: Un-blacklist system_call() from kprobes
    powerpc/64s: Move system_call() symbol to just after setting MSR_EE
    powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
    powerpc/64s: Convert .L__replay_interrupt_return to a local label
    powerpc64/elfv1: Only dereference function descriptor for non-text symbols
    cxl: Export library to support IBM XSL
    powerpc/dts: Use #include "..." to include local DT
    powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
    ...

    Linus Torvalds
     

07 Jul, 2017

2 commits

  • Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
    movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
    good explanation on why it is actually useful.

    It makes a lot of sense to make movable node semantic opt in but we
    already have that because the feature has to be explicitly enabled on
    the kernel command line. A config option on top only makes the
    configuration space larger without a good reason. It also adds an
    additional ifdefery that pollutes the code.

    Just drop the config option and make it de-facto always enabled. This
    shouldn't introduce any change to the semantic.

    Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Reza Arbab
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

06 Jul, 2017

1 commit

  • Pull percpu updates from Tejun Heo:
    "These are the percpu changes for the v4.13-rc1 merge window. There are
    a couple visibility related changes - tracepoints and allocator stats
    through debugfs, along with __ro_after_init markings and a cosmetic
    rename in percpu_counter.

    Please note that the simple O(#elements_in_the_chunk) area allocator
    used by percpu allocator is again showing scalability issues,
    primarily with bpf allocating and freeing large number of counters.
    Dennis is working on the replacement allocator and the percpu
    allocator will be seeing increased churns in the coming cycles"

    * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix static checker warnings in pcpu_destroy_chunk
    percpu: fix early calls for spinlock in pcpu_stats
    percpu: resolve err may not be initialized in pcpu_alloc
    percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
    percpu: add tracepoint support for percpu memory
    percpu: expose statistics about percpu memory via debugfs
    percpu: migrate percpu data structures to internal header
    percpu: add missing lockdep_assert_held to func pcpu_free_area
    mark most percpu globals as __ro_after_init

    Linus Torvalds
     

02 Jul, 2017

1 commit

  • Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
    new architectures (and platforms) get ZONE_DEVICE support. Move to an
    arch selected Kconfig option to save us the trouble.

    Cc: linux-mm@kvack.org
    Acked-by: Ingo Molnar
    Acked-by: Balbir Singh
    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Michael Ellerman

    Oliver O'Halloran
     

21 Jun, 2017

1 commit

  • There is limited visibility into the use of percpu memory leaving us
    unable to reason about correctness of parameters and overall use of
    percpu memory. These counters and statistics aim to help understand
    basic statistics about percpu memory such as number of allocations over
    the lifetime, allocation sizes, and fragmentation.

    New Config: PERCPU_STATS

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

13 Jun, 2017

1 commit

  • This patch provides all required callbacks required by the generic
    get_user_pages_fast() code and switches x86 over - and removes
    the platform specific implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     

01 May, 2017

1 commit


13 Dec, 2016

3 commits

  • Add arch specific callback in the generic THP page cache code that will
    deposit and withdarw preallocated page table. Archs like ppc64 use this
    preallocated table to store the hash pte slot information.

    Testing:
    kernel build of the patch series on tmpfs mounted with option huge=always

    The related thp stat:
    thp_fault_alloc 72939
    thp_fault_fallback 60547
    thp_collapse_alloc 603
    thp_collapse_alloc_failed 0
    thp_file_alloc 253763
    thp_file_mapped 4251
    thp_split_page 51518
    thp_split_page_failed 1
    thp_deferred_split_page 73566
    thp_split_pmd 665
    thp_zero_page_alloc 3
    thp_zero_page_alloc_failed 0

    [akpm@linux-foundation.org: remove unneeded parentheses, per Kirill]
    Link: http://lkml.kernel.org/r/20161113150025.17942-2-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • When movable nodes are enabled, any node containing only hotpluggable
    memory is made movable at boot time.

    On x86, hotpluggable memory is discovered by parsing the ACPI SRAT,
    making corresponding calls to memblock_mark_hotplug().

    If we introduce a dt property to describe memory as hotpluggable,
    configs supporting early fdt may then also do this marking and use
    movable nodes.

    Link: http://lkml.kernel.org/r/1479160961-25840-5-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Tested-by: Balbir Singh
    Acked-by: Balbir Singh
    Cc: "Aneesh Kumar K.V"
    Cc: "H. Peter Anvin"
    Cc: Alistair Popple
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Bharata B Rao
    Cc: Frank Rowand
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Nathan Fontenot
    Cc: Paul Mackerras
    Cc: Rob Herring
    Cc: Stewart Smith
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
    the following must be true:

    1. This config has the capability to identify movable nodes at boot.
    Right now, only x86 can do this.

    2. Our config supports memory hotplug, which means that a movable node
    can be created by hotplugging all of its memory into ZONE_MOVABLE.

    Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
    recognizes (1), but not (2).

    Link: http://lkml.kernel.org/r/1479160961-25840-4-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Balbir Singh
    Cc: "Aneesh Kumar K.V"
    Cc: "H. Peter Anvin"
    Cc: Alistair Popple
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Bharata B Rao
    Cc: Frank Rowand
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Nathan Fontenot
    Cc: Paul Mackerras
    Cc: Rob Herring
    Cc: Stewart Smith
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 Oct, 2016

1 commit

  • No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
    but for build testing there is no reason not to allow them together.

    This hopefully means better build coverage and fewer embarrasing silly
    problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
    variable in memory hotplug") in the future.

    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Aug, 2016

1 commit

  • The current wording of the COMPACTION Kconfig help text doesn't
    emphasise that disabling COMPACTION might cripple the page allocator
    which relies on the compaction quite heavily for high order requests and
    an unexpected OOM can happen with the lack of compaction. Make sure we
    are vocal about that.

    Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Markus Trippelsdorf
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Aug, 2016

1 commit

  • At present it is obvious that memory online and offline will fail when
    KASAN is enabled. So add the condition to limit the memory_hotplug when
    KASAN is enabled.

    Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

29 Jul, 2016

1 commit

  • When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
    CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
    However, now that the ZONE_DMA conflict has been eliminated it no longer
    makes sense to require CONFIG_EXPERT.

    Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Eric Sandeen
    Reported-by: Jeff Moyer
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

27 Jul, 2016

1 commit

  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

28 May, 2016

1 commit

  • When we have !NO_BOOTMEM, the deferred page struct initialization
    doesn't work well because the pages reserved in bootmem are released to
    the page allocator uncoditionally. It causes memory corruption and
    system crash eventually.

    As Mel suggested, the bootmem is retiring slowly. We fix the issue by
    simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

    Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     

27 May, 2016

1 commit

  • Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
    requires some ordering wrt other initialization operations, e.g.
    page_ext_init has to happen after the whole memmap is initialized
    properly.

    For SPARSEMEM this requires to wait for page_alloc_init_late. Other
    memory models (e.g. flatmem) might have different initialization
    layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT
    depends on MEMORY_HOTPLUG which in turn

    depends on SPARSEMEM || X86_64_ACPI_NUMA
    depends on ARCH_ENABLE_MEMORY_HOTPLUG

    and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
    memory model:

    config ARCH_FLATMEM_ENABLE
    def_bool y
    depends on X86_32 && !NUMA

    so FLATMEM is ruled out via dependency maze. Be explicit and disable
    FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
    subtle initialization bugs

    [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

21 May, 2016

2 commits

  • I've been receiving increasingly concerned notes from 0day about how
    much my recent changes have been bloating the radix tree. Make it
    happier by only including multiorder support if
    CONFIG_TRANSPARENT_HUGEPAGES is set.

    This is an independent Kconfig option, so other radix tree users can
    also set it if they have a need.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

20 May, 2016

2 commits

  • This patchset continues the work I started with commit 31bc3858ea3e
    ("memory-hotplug: add automatic onlining policy for the newly added
    memory").

    Initially I was going to stop there and bring the policy setting logic
    to userspace. I met two issues on this way:

    1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
    These blocks stay offlined if we turn the onlining policy on by
    userspace.

    2) My attempt to bring this policy setting to systemd failed, systemd
    maintainers suggest to change the default in kernel or ... to use
    tmpfiles.d to alter the policy (which looks like a hack to me):
    https://github.com/systemd/systemd/pull/2938

    Here I suggest to add a config option to set the default value for the
    policy and a kernel command line parameter to make the override.

    This patch (of 2):

    Introduce config option to set the default value for memory hotplug
    onlining policy (/sys/devices/system/memory/auto_online_blocks). The
    reason one would want to turn this option on are to have early onlining
    for hotpluggable memory available at boot and to not require any
    userspace actions to make memory hotplug work.

    [akpm@linux-foundation.org: tweak Kconfig text]
    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
    not, so ZONE_DMA_FLAG sounds no longer useful.

    And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
    comment [1] from Johannes Weiner, so remove them and ORing passed in
    flags with the cache gfp flags has been done in kmem_getpages().

    [1] https://lkml.org/lkml/2014/9/25/553

    Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • The primary use case for devm_memremap_pages() is to allocate an memmap
    array from persistent memory. That capabilty requires vmem_altmap which
    requires SPARSEMEM_VMEMMAP.

    Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
    ZONES_WIDTH and triggers the:

    "Unfortunate NUMA and NUMA Balancing config, growing page-frame for
    last_cpupid."

    ...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
    a configuration we should worry about supporting.

    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
    mm zones that are bumping up against the current maximum limit of 4
    zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.

    The GFP_ZONE_TABLE poses an interesting constraint since
    include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
    build. We need to be careful to only build the table for zones that
    have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
    purpose. This patch does not attempt to solve the problem of adding a
    new zone that also has a corresponding GFP_ flag.

    Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
    SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
    even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
    consume another bit in page->flags (expand ZONES_WIDTH) with room to
    spare.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
    Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
    Signed-off-by: Dan Williams
    Reported-by: Mark
    Reported-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
    selected by the supported architectures, so the following arch depend is
    unnecessary.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

19 Feb, 2016

1 commit

  • The syscall-level code is passed a protection key and need to
    return an appropriate error code if the protection key is bogus.
    We will be using this in subsequent patches.

    Note that this also begins a series of arch-specific calls that
    we need to expose in otherwise arch-independent code. We create
    a linux/pkeys.h header where we will put *all* the stubs for
    these functions.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

1 commit

  • vma->vm_flags is an 'unsigned long', so has space for 32 flags
    on 32-bit architectures. The high 32 bits are unused on 64-bit
    platforms. We've steered away from using the unused high VMA
    bits for things because we would have difficulty supporting it
    on 32-bit.

    Protection Keys are not available in 32-bit mode, so there is
    no concern about supporting this feature in 32-bit mode or on
    32-bit CPUs.

    This patch carves out 4 bits from the high half of
    vma->vm_flags and allows architectures to set config option
    to make them available.

    Sparse complains about these constants unless we explicitly
    call them "UL".

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Valentin Rothberg
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

06 Feb, 2016

1 commit

  • The description mentions kswapd threads, while the deferred struct page
    initialization is actually done by one-off "pgdatinitX" threads.

    Fix the description so that potentially users are not confused about
    pgdatinit threads using CPU after boot instead of kswapd.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Sep, 2015

1 commit

  • Pull media updates from Mauro Carvalho Chehab:
    "A series of patches that move part of the code used to allocate memory
    from the media subsystem to the mm subsystem"

    [ The mm parts have been acked by VM people, and the series was
    apparently in -mm for a while - Linus ]

    * tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
    [media] media: vb2: Remove unused functions
    [media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
    [media] vb2: Provide helpers for mapping virtual addresses
    [media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
    [media] mm: Provide new get_vaddr_frames() helper
    [media] vb2: Push mmap_sem down to memops

    Linus Torvalds