28 Oct, 2016

1 commit

  • No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
    but for build testing there is no reason not to allow them together.

    This hopefully means better build coverage and fewer embarrasing silly
    problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
    variable in memory hotplug") in the future.

    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Aug, 2016

1 commit

  • The current wording of the COMPACTION Kconfig help text doesn't
    emphasise that disabling COMPACTION might cripple the page allocator
    which relies on the compaction quite heavily for high order requests and
    an unexpected OOM can happen with the lack of compaction. Make sure we
    are vocal about that.

    Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Markus Trippelsdorf
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Aug, 2016

1 commit

  • At present it is obvious that memory online and offline will fail when
    KASAN is enabled. So add the condition to limit the memory_hotplug when
    KASAN is enabled.

    Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

29 Jul, 2016

1 commit

  • When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
    CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
    However, now that the ZONE_DMA conflict has been eliminated it no longer
    makes sense to require CONFIG_EXPERT.

    Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Eric Sandeen
    Reported-by: Jeff Moyer
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

27 Jul, 2016

1 commit

  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

28 May, 2016

1 commit

  • When we have !NO_BOOTMEM, the deferred page struct initialization
    doesn't work well because the pages reserved in bootmem are released to
    the page allocator uncoditionally. It causes memory corruption and
    system crash eventually.

    As Mel suggested, the bootmem is retiring slowly. We fix the issue by
    simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

    Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     

27 May, 2016

1 commit

  • Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
    requires some ordering wrt other initialization operations, e.g.
    page_ext_init has to happen after the whole memmap is initialized
    properly.

    For SPARSEMEM this requires to wait for page_alloc_init_late. Other
    memory models (e.g. flatmem) might have different initialization
    layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT
    depends on MEMORY_HOTPLUG which in turn

    depends on SPARSEMEM || X86_64_ACPI_NUMA
    depends on ARCH_ENABLE_MEMORY_HOTPLUG

    and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
    memory model:

    config ARCH_FLATMEM_ENABLE
    def_bool y
    depends on X86_32 && !NUMA

    so FLATMEM is ruled out via dependency maze. Be explicit and disable
    FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
    subtle initialization bugs

    [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

21 May, 2016

2 commits

  • I've been receiving increasingly concerned notes from 0day about how
    much my recent changes have been bloating the radix tree. Make it
    happier by only including multiorder support if
    CONFIG_TRANSPARENT_HUGEPAGES is set.

    This is an independent Kconfig option, so other radix tree users can
    also set it if they have a need.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

20 May, 2016

2 commits

  • This patchset continues the work I started with commit 31bc3858ea3e
    ("memory-hotplug: add automatic onlining policy for the newly added
    memory").

    Initially I was going to stop there and bring the policy setting logic
    to userspace. I met two issues on this way:

    1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
    These blocks stay offlined if we turn the onlining policy on by
    userspace.

    2) My attempt to bring this policy setting to systemd failed, systemd
    maintainers suggest to change the default in kernel or ... to use
    tmpfiles.d to alter the policy (which looks like a hack to me):
    https://github.com/systemd/systemd/pull/2938

    Here I suggest to add a config option to set the default value for the
    policy and a kernel command line parameter to make the override.

    This patch (of 2):

    Introduce config option to set the default value for memory hotplug
    onlining policy (/sys/devices/system/memory/auto_online_blocks). The
    reason one would want to turn this option on are to have early onlining
    for hotpluggable memory available at boot and to not require any
    userspace actions to make memory hotplug work.

    [akpm@linux-foundation.org: tweak Kconfig text]
    Signed-off-by: Vitaly Kuznetsov
    Cc: Jonathan Corbet
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Vrabel
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Lennart Poettering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
    not, so ZONE_DMA_FLAG sounds no longer useful.

    And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
    comment [1] from Johannes Weiner, so remove them and ORing passed in
    flags with the cache gfp flags has been done in kmem_getpages().

    [1] https://lkml.org/lkml/2014/9/25/553

    Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • The primary use case for devm_memremap_pages() is to allocate an memmap
    array from persistent memory. That capabilty requires vmem_altmap which
    requires SPARSEMEM_VMEMMAP.

    Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
    ZONES_WIDTH and triggers the:

    "Unfortunate NUMA and NUMA Balancing config, growing page-frame for
    last_cpupid."

    ...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
    a configuration we should worry about supporting.

    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
    mm zones that are bumping up against the current maximum limit of 4
    zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.

    The GFP_ZONE_TABLE poses an interesting constraint since
    include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
    build. We need to be careful to only build the table for zones that
    have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
    purpose. This patch does not attempt to solve the problem of adding a
    new zone that also has a corresponding GFP_ flag.

    Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
    SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
    even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
    consume another bit in page->flags (expand ZONES_WIDTH) with room to
    spare.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
    Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
    Signed-off-by: Dan Williams
    Reported-by: Mark
    Reported-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
    selected by the supported architectures, so the following arch depend is
    unnecessary.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

19 Feb, 2016

1 commit

  • The syscall-level code is passed a protection key and need to
    return an appropriate error code if the protection key is bogus.
    We will be using this in subsequent patches.

    Note that this also begins a series of arch-specific calls that
    we need to expose in otherwise arch-independent code. We create
    a linux/pkeys.h header where we will put *all* the stubs for
    these functions.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

1 commit

  • vma->vm_flags is an 'unsigned long', so has space for 32 flags
    on 32-bit architectures. The high 32 bits are unused on 64-bit
    platforms. We've steered away from using the unused high VMA
    bits for things because we would have difficulty supporting it
    on 32-bit.

    Protection Keys are not available in 32-bit mode, so there is
    no concern about supporting this feature in 32-bit mode or on
    32-bit CPUs.

    This patch carves out 4 bits from the high half of
    vma->vm_flags and allows architectures to set config option
    to make them available.

    Sparse complains about these constants unless we explicitly
    call them "UL".

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Valentin Rothberg
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

06 Feb, 2016

1 commit

  • The description mentions kswapd threads, while the deferred struct page
    initialization is actually done by one-off "pgdatinitX" threads.

    Fix the description so that potentially users are not confused about
    pgdatinit threads using CPU after boot instead of kswapd.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Sep, 2015

1 commit

  • Pull media updates from Mauro Carvalho Chehab:
    "A series of patches that move part of the code used to allocate memory
    from the media subsystem to the mm subsystem"

    [ The mm parts have been acked by VM people, and the series was
    apparently in -mm for a while - Linus ]

    * tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
    [media] media: vb2: Remove unused functions
    [media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
    [media] vb2: Provide helpers for mapping virtual addresses
    [media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
    [media] mm: Provide new get_vaddr_frames() helper
    [media] vb2: Push mmap_sem down to memops

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Sep, 2015

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

17 Aug, 2015

1 commit

  • Provide new function get_vaddr_frames(). This function maps virtual
    addresses from given start and fills given array with page frame numbers of
    the corresponding pages. If given start belongs to a normal vma, the function
    grabs reference to each of the pages to pin them in memory. If start
    belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures. Caller
    must make sure pfns aren't reused for anything else while he is using
    them.

    This function is created for various drivers to simplify handling of
    their buffers.

    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Andrew Morton
    Signed-off-by: Hans Verkuil
    Signed-off-by: Mauro Carvalho Chehab

    Jan Kara
     

24 Jul, 2015

1 commit


01 Jul, 2015

1 commit

  • This patch initalises all low memory struct pages and 2G of the highest
    zone on each node during memory initialisation if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
    but will be available in a later patch. Parallel initialisation of struct
    page depends on some features from memory hotplug and it is necessary to
    alter alter section annotations.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Jun, 2015

1 commit

  • RAS user space tools like rasdaemon which base on trace event, could
    receive mce error event, but no memory recovery result event. So, I want
    to add this event to make this scenario complete.

    This patch add a event at ras group for memory-failure.

    The output like below:
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 2/2 #P:24
    #
    # _-----=> irqs-off
    # / _----=> need-resched
    # | / _---=> hardirq/softirq
    # || / _--=> preempt-depth
    # ||| / delay
    # TASK-PID CPU# |||| TIMESTAMP FUNCTION
    # | | | |||| | |
    mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

    [xiexiuqi@huawei.com: fix build error]
    Signed-off-by: Xie XiuQi
    Reviewed-by: Naoya Horiguchi
    Acked-by: Steven Rostedt
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Jim Davis
    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     

15 Apr, 2015

1 commit

  • I've noticed that there is no interfaces exposed by CMA which would let me
    fuzz what's going on in there.

    This small patchset exposes some information out to userspace, plus adds
    the ability to trigger allocation and freeing from userspace.

    This patch (of 3):

    Implement a simple debugfs interface to expose information about CMA areas
    in the system.

    Useful for testing/sanity checks for CMA since it was impossible to
    previously retrieve this information in userspace.

    Signed-off-by: Sasha Levin
    Acked-by: Joonsoo Kim
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

20 Feb, 2015

1 commit

  • Pull kconfig updates from Michal Marek:
    "Yann E Morin was supposed to take over kconfig maintainership, but
    this hasn't happened. So I'm sending a few kconfig patches that I
    collected:

    - Fix for missing va_end in kconfig
    - merge_config.sh displays used if given too few arguments
    - s/boolean/bool/ in Kconfig files for consistency, with the plan to
    only support bool in the future"

    * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    kconfig: use va_end to match corresponding va_start
    merge_config.sh: Display usage if given too few arguments
    kconfig: use bool instead of boolean for type definition attributes

    Linus Torvalds
     

13 Feb, 2015

1 commit

  • Keeping fragmentation of zsmalloc in a low level is our target. But now
    we still need to add the debug code in zsmalloc to get the quantitative
    data.

    This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
    statistics collection for developers. Currently only the objects
    statatitics in each class are collected. User can get the information via
    debugfs.

    cat /sys/kernel/debug/zsmalloc/zram0/...

    For example:

    After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
    class size obj_allocated obj_used pages_used
    0 32 0 0 0
    1 48 256 12 3
    2 64 64 14 1
    3 80 51 7 1
    4 96 128 5 3
    5 112 73 5 2
    6 128 32 4 1
    7 144 0 0 0
    8 160 0 0 0
    9 176 0 0 0
    10 192 0 0 0
    11 208 0 0 0
    12 224 0 0 0
    13 240 0 0 0
    14 256 16 1 1
    15 272 15 9 1
    16 288 0 0 0
    17 304 0 0 0
    18 320 0 0 0
    19 336 0 0 0
    20 352 0 0 0
    21 368 0 0 0
    22 384 0 0 0
    23 400 0 0 0
    24 416 0 0 0
    25 432 0 0 0
    26 448 0 0 0
    27 464 0 0 0
    28 480 0 0 0
    29 496 33 1 4
    30 512 0 0 0
    31 528 0 0 0
    32 544 0 0 0
    33 560 0 0 0
    34 576 0 0 0
    35 592 0 0 0
    36 608 0 0 0
    37 624 0 0 0
    38 640 0 0 0
    40 672 0 0 0
    42 704 0 0 0
    43 720 17 1 3
    44 736 0 0 0
    46 768 0 0 0
    49 816 0 0 0
    51 848 0 0 0
    52 864 14 1 3
    54 896 0 0 0
    57 944 13 1 3
    58 960 0 0 0
    62 1024 4 1 1
    66 1088 15 2 4
    67 1104 0 0 0
    71 1168 0 0 0
    74 1216 0 0 0
    76 1248 0 0 0
    83 1360 3 1 1
    91 1488 11 1 4
    94 1536 0 0 0
    100 1632 5 1 2
    107 1744 0 0 0
    111 1808 9 1 4
    126 2048 4 4 2
    144 2336 7 3 4
    151 2448 0 0 0
    168 2720 15 15 10
    190 3072 28 27 21
    202 3264 0 0 0
    254 4096 36209 36209 36209

    Total 37022 36326 36288

    We can calculate the overall fragentation by the last line:
    Total 37022 36326 36288
    (37022 - 36326) / 37022 = 1.87%

    Also by analysing objects alocated in every class we know why we got so
    low fragmentation: Most of the allocated objects is in . And
    there is only 1 page in class 254 zspage. So, No fragmentation will be
    introduced by allocating objs in class 254.

    And in future, we can collect other zsmalloc statistics as we need and
    analyse them.

    Signed-off-by: Ganesh Mahendran
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     

07 Jan, 2015

2 commits

  • Support for keyword 'boolean' will be dropped later on.

    No functional change.

    Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
    Signed-off-by: Christoph Jaeger
    Signed-off-by: Michal Marek

    Christoph Jaeger
     
  • SRCU is not necessary to be compiled by default in all cases. For tinification
    efforts not compiling SRCU unless necessary is desirable.

    The current patch tries to make compiling SRCU optional by introducing a new
    Kconfig option CONFIG_SRCU which is selected when any of the components making
    use of SRCU are selected.

    If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

    text data bss dec hex filename
    2007 0 0 2007 7d7 kernel/rcu/srcu.o

    Size of arch/powerpc/boot/zImage changes from

    text data bss dec hex filename
    831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
    829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after

    so the savings are about ~2000 bytes.

    Signed-off-by: Pranith Kumar
    CC: Paul E. McKenney
    CC: Josh Triplett
    CC: Lai Jiangshan
    Signed-off-by: Paul E. McKenney
    [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]

    Pranith Kumar
     

10 Oct, 2014

2 commits

  • Always mark pages with PageBalloon even if balloon compaction is disabled
    and expose this mark in /proc/kpageflags as KPF_BALLOON.

    Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
    "balloon_deflate" and "balloon_migrate". They accumulate balloon
    activity. Current size of balloon is (balloon_inflate - balloon_deflate)
    pages.

    All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
    It should be selected by ballooning driver which wants use this feature.
    Currently virtio-balloon is the only user.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This series implements general forms of get_user_pages_fast and
    __get_user_pages_fast in core code and activates them for arm and arm64.

    These are required for Transparent HugePages to function correctly, as a
    futex on a THP tail will otherwise result in an infinite loop (due to the
    core implementation of __get_user_pages_fast always returning 0).

    Unfortunately, a futex on THP tail can be quite common for certain
    workloads; thus THP is unreliable without a __get_user_pages_fast
    implementation.

    This series may also be beneficial for direct-IO heavy workloads and
    certain KVM workloads.

    This patch (of 6):

    get_user_pages_fast() attempts to pin user pages by walking the page
    tables directly and avoids taking locks. Thus the walker needs to be
    protected from page table pages being freed from under it, and needs to
    block any THP splits.

    One way to achieve this is to have the walker disable interrupts, and rely
    on IPIs from the TLB flushing code blocking before the page table pages
    are freed.

    On some platforms we have hardware broadcast of TLB invalidations, thus
    the TLB flushing code doesn't necessarily need to broadcast IPIs; and
    spuriously broadcasting IPIs can hurt system performance if done too
    often.

    This problem has been solved on PowerPC and Sparc by batching up page
    table pages belonging to more than one mm_user, then scheduling an
    rcu_sched callback to free the pages. This RCU page table free logic has
    been promoted to core code and is activated when one enables
    HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
    own get_user_pages_fast routines.

    The RCU page table free logic coupled with an IPI broadcast on THP split
    (which is a rare event), allows one to protect a page table walker by
    merely disabling the interrupts during the walk.

    This patch provides a general RCU implementation of get_user_pages_fast
    that can be used by architectures that perform hardware broadcast of TLB
    invalidations.

    It is based heavily on the PowerPC implementation by Nick Piggin.

    [akpm@linux-foundation.org: various comment fixes]
    Signed-off-by: Steve Capper
    Tested-by: Dann Frazier
    Reviewed-by: Catalin Marinas
    Acked-by: Hugh Dickins
    Cc: Russell King
    Cc: Mark Rutland
    Cc: Mel Gorman
    Cc: Will Deacon
    Cc: Christoffer Dall
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     

07 Aug, 2014

3 commits

  • Change zswap to use the zpool api instead of directly using zbud. Add a
    boot-time param to allow selecting which zpool implementation to use,
    with zbud as the default.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add zpool api.

    zpool provides an interface for memory storage, typically of compressed
    memory. Users can select what backend to use; currently the only
    implementations are zbud, a low density implementation with up to two
    compressed pages per storage page, and zsmalloc, a higher density
    implementation with multiple compressed pages per storage page.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Currently, there are two users on CMA functionality, one is the DMA
    subsystem and the other is the KVM on powerpc. They have their own code
    to manage CMA reserved area even if they looks really similar. From my
    guess, it is caused by some needs on bitmap management. KVM side wants
    to maintain bitmap not for 1 page, but for more size. Eventually it use
    bitmap where one bit represents 64 pages.

    When I implement CMA related patches, I should change those two places
    to apply my change and it seem to be painful to me. I want to change
    this situation and reduce future code management overhead through this
    patch.

    This change could also help developer who want to use CMA in their new
    feature development, since they can use CMA easily without copying &
    pasting this reserved area management code.

    In previous patches, we have prepared some features to generalize CMA
    reserved area management and now it's time to do it. This patch moves
    core functions to mm/cma.c and change DMA APIs to use these functions.

    There is no functional change in DMA APIs.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Reviewed-by: Aneesh Kumar K.V
    Cc: Alexander Graf
    Cc: Aneesh Kumar K.V
    Cc: Gleb Natapov
    Acked-by: Marek Szyprowski
    Tested-by: Marek Szyprowski
    Cc: Paolo Bonzini
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

3 commits

  • Now, we can build zsmalloc as module because unmap_kernel_range was
    exported.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
    process_vm_writev, it's a kind of IPC for copying data between processes.
    Currently this option is placed inside "Processor type and features".

    This patch moves it into "General setup" (where all other arch-independed
    syscalls and ipc features are placed) and changes prompt string to less
    cryptic.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Christopher Yeoh
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently hugepage migration is available for all archs which support
    pmd-level hugepage, but testing is done only for x86_64 and there're
    bugs for other archs. So to avoid breaking such archs, this patch
    limits the availability strictly to x86_64 until developers of other
    archs get interested in enabling this feature.

    Simply disabling hugepage migration on non-x86_64 archs is not enough to
    fix the reported problem where sys_move_pages() hits the BUG_ON() in
    follow_page(FOLL_GET), so let's fix this by checking if hugepage
    migration is supported in vma_migratable().

    Signed-off-by: Naoya Horiguchi
    Reported-by: Michael Ellerman
    Tested-by: Michael Ellerman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: Tony Luck
    Cc: Russell King
    Cc: Martin Schwidefsky
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: David Miller
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi