14 Oct, 2020

1 commit

  • kmemleak-test.c is just a kmemleak test module, which also can not be used
    as a built-in kernel module. Thus, i think it may should not be in mm
    dir, and move the kmemleak-test.c to samples/kmemleak/kmemleak-test.c.
    Fix the spelling of built-in by the way.

    Signed-off-by: Hui Su
    Signed-off-by: Andrew Morton
    Cc: Catalin Marinas
    Cc: Jonathan Corbet
    Cc: Mauro Carvalho Chehab
    Cc: David S. Miller
    Cc: Rob Herring
    Cc: Masahiro Yamada
    Cc: Sam Ravnborg
    Cc: Josh Poimboeuf
    Cc: Steven Rostedt (VMware)
    Cc: Miguel Ojeda
    Cc: Divya Indi
    Cc: Tomas Winkler
    Cc: David Howells
    Link: https://lkml.kernel.org/r/20200925183729.GA172837@rlk
    Signed-off-by: Linus Torvalds

    Hui Su
     

08 Aug, 2020

1 commit

  • The functionality in lib/ioremap.c deals with pagetables, vmalloc and
    caches, so it naturally belongs to mm/ Moving it there will also allow
    declaring p?d_alloc_track functions in an header file inside mm/ rather
    than having those declarations in include/linux/mm.h

    Suggested-by: Andrew Morton
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra (Intel)
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Geert Uytterhoeven
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-8-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

12 Jun, 2020

2 commits

  • Pull the Kernel Concurrency Sanitizer from Thomas Gleixner:
    "The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector,
    which relies on compile-time instrumentation, and uses a
    watchpoint-based sampling approach to detect races.

    The feature was under development for quite some time and has already
    found legitimate bugs.

    Unfortunately it comes with a limitation, which was only understood
    late in the development cycle:

    It requires an up to date CLANG-11 compiler

    CLANG-11 is not yet released (scheduled for June), but it's the only
    compiler today which handles the kernel requirements and especially
    the annotations of functions to exclude them from KCSAN
    instrumentation correctly.

    These annotations really need to work so that low level entry code and
    especially int3 text poke handling can be completely isolated.

    A detailed discussion of the requirements and compiler issues can be
    found here:

    https://lore.kernel.org/lkml/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com/

    We came to the conclusion that trying to work around compiler
    limitations and bugs again would end up in a major trainwreck, so
    requiring a working compiler seemed to be the best choice.

    For Continous Integration purposes the compiler restriction is
    manageable and that's where most xxSAN reports come from.

    For a change this limitation might make GCC people actually look at
    their bugs. Some issues with CSAN in GCC are 7 years old and one has
    been 'fixed' 3 years ago with a half baken solution which 'solved' the
    reported issue but not the underlying problem.

    The KCSAN developers also ponder to use a GCC plugin to become
    independent, but that's not something which will show up in a few
    days.

    Blocking KCSAN until wide spread compiler support is available is not
    a really good alternative because the continuous growth of lockless
    optimizations in the kernel demands proper tooling support"

    * tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
    compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of CONFIG_KASAN to decide inlining
    compiler.h: Move function attributes to compiler_types.h
    compiler.h: Avoid nested statement expression in data_race()
    compiler.h: Remove data_race() and unnecessary checks from {READ,WRITE}_ONCE()
    kcsan: Update Documentation to change supported compilers
    kcsan: Remove 'noinline' from __no_kcsan_or_inline
    kcsan: Pass option tsan-instrument-read-before-write to Clang
    kcsan: Support distinguishing volatile accesses
    kcsan: Restrict supported compilers
    kcsan: Avoid inserting __tsan_func_entry/exit if possible
    ubsan, kcsan: Don't combine sanitizer with kcov on clang
    objtool, kcsan: Add kcsan_disable_current() and kcsan_enable_current_nowarn()
    kcsan: Add __kcsan_{enable,disable}_current() variants
    checkpatch: Warn about data_race() without comment
    kcsan: Use GFP_ATOMIC under spin lock
    Improve KCSAN documentation a bit
    kcsan: Make reporting aware of KCSAN tests
    kcsan: Fix function matching in report
    kcsan: Change data_race() to no longer require marking racing accesses
    kcsan: Move kcsan_{disable,enable}_current() to kcsan-checks.h
    ...

    Linus Torvalds
     
  • Merge the state of the locking kcsan branch before the read/write_once()
    and the atomics modifications got merged.

    Squash the fallout of the rebase on top of the read/write once and atomic
    fallback work into the merge. The history of the original branch is
    preserved in tag locking-kcsan-2020-06-02.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

11 Jun, 2020

1 commit

  • Patch series "improve use_mm / unuse_mm", v2.

    This series improves the use_mm / unuse_mm interface by better documenting
    the assumptions, and my taking the set_fs manipulations spread over the
    callers into the core API.

    This patch (of 3):

    Use the proper API instead.

    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

    These helpers are only for use with kernel threads, and I will tie them
    more into the kthread infrastructure going forward. Also move the
    prototypes to kthread.h - mmu_context.h was a little weird to start with
    as it otherwise contains very low-level MM bits.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Cc: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

05 Jun, 2020

1 commit

  • This adds tests which will validate architecture page table helpers and
    other accessors in their compliance with expected generic MM semantics.
    This will help various architectures in validating changes to existing
    page table helpers or addition of new ones.

    This test covers basic page table entry transformations including but not
    limited to old, young, dirty, clean, write, write protect etc at various
    level along with populating intermediate entries with next page table page
    and validating them.

    Test page table pages are allocated from system memory with required size
    and alignments. The mapped pfns at page table levels are derived from a
    real pfn representing a valid kernel text symbol. This test gets called
    via late_initcall().

    This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected.
    Any architecture, which is willing to subscribe this test will need to
    select ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64,
    x86, s390 and powerpc platforms where the test is known to build and run
    successfully Going forward, other architectures too can subscribe the test
    after fixing any build or runtime problems with their page table helpers.

    Folks interested in making sure that a given platform's page table helpers
    conform to expected generic MM semantics should enable the above config
    which will just trigger this test during boot. Any non conformity here
    will be reported as an warning which would need to be fixed. This test
    will help catch any changes to the agreed upon semantics expected from
    generic MM and enable platforms to accommodate it thereafter.

    [anshuman.khandual@arm.com: v17]
    Link: http://lkml.kernel.org/r/1587436495-22033-3-git-send-email-anshuman.khandual@arm.com
    [anshuman.khandual@arm.com: v18]
    Link: http://lkml.kernel.org/r/1588564865-31160-3-git-send-email-anshuman.khandual@arm.com
    Suggested-by: Catalin Marinas
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Christophe Leroy
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Gerald Schaefer [s390]
    Tested-by: Christophe Leroy [ppc32]
    Reviewed-by: Ingo Molnar
    Cc: Mike Rapoport
    Cc: Vineet Gupta
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Link: http://lkml.kernel.org/r/1583919272-24178-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

13 Apr, 2020

1 commit


08 Apr, 2020

1 commit

  • In order to pave the way for free page reporting in virtualized
    environments we will need a way to get pages out of the free lists and
    identify those pages after they have been returned. To accomplish this,
    this patch adds the concept of a Reported Buddy, which is essentially
    meant to just be the Uptodate flag used in conjunction with the Buddy page
    type.

    To prevent the reported pages from leaking outside of the buddy lists I
    added a check to clear the PageReported bit in the del_page_from_free_list
    function. As a result any reported page that is split, merged, or
    allocated will have the flag cleared prior to the PageBuddy value being
    cleared.

    The process for reporting pages is fairly simple. Once we free a page
    that meets the minimum order for page reporting we will schedule a worker
    thread to start 2s or more in the future. That worker thread will begin
    working from the lowest supported page reporting order up to MAX_ORDER - 1
    pulling unreported pages from the free list and storing them in the
    scatterlist.

    When processing each individual free list it is necessary for the worker
    thread to release the zone lock when it needs to stop and report the full
    scatterlist of pages. To reduce the work of the next iteration the worker
    thread will rotate the free list so that the first unreported page in the
    free list becomes the first entry in the list.

    It will then call a reporting function providing information on how many
    entries are in the scatterlist. Once the function completes it will
    return the pages to the free area from which they were allocated and start
    over pulling more pages from the free areas until there are no longer
    enough pages to report on to keep the worker busy, or we have processed as
    many pages as were contained in the free area when we started processing
    the list.

    The worker thread will work in a round-robin fashion making its way though
    each zone requesting reporting, and through each reportable free list
    within that zone. Once all free areas within the zone have been processed
    it will check to see if there have been any requests for reporting while
    it was processing. If so it will reschedule the worker thread to start up
    again in roughly 2s and exit.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

03 Apr, 2020

1 commit

  • Kmemleak could scan task stacks while plain writes happens to those stack
    variables which could results in data races. For example, in
    sys_rt_sigaction and do_sigaction(), it could have plain writes in a
    32-byte size. Since the kmemleak does not care about the actual values of
    a non-pointer and all do_sigaction() call sites only copy to stack
    variables, just disable KCSAN for kmemleak to avoid annotating anything
    outside Kmemleak just because Kmemleak scans everything.

    Suggested-by: Marco Elver
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Marco Elver
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/1583263716-25150-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     

21 Mar, 2020

1 commit


04 Feb, 2020

1 commit

  • Add a generic version of page table dumping that architectures can opt-in
    to.

    Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     

01 Feb, 2020

1 commit

  • Don't instrument 3 more files that contain debugging facilities and
    produce large amounts of uninteresting coverage for every syscall.

    The following snippets are sprinkled all over the place in kcov traces
    in a debugging kernel. We already try to disable instrumentation of
    stack unwinding code and of most debug facilities. I guess we did not
    use fault-inject.c at the time, and stacktrace.c was somehow missed (or
    something has changed in kernel/configs). This change both speeds up
    kcov (kernel doesn't need to store these PCs, user-space doesn't need to
    process them) and frees trace buffer capacity for more useful coverage.

    should_fail
    lib/fault-inject.c:149
    fail_dump
    lib/fault-inject.c:45

    stack_trace_save
    kernel/stacktrace.c:124
    stack_trace_consume_entry
    kernel/stacktrace.c:86
    stack_trace_consume_entry
    kernel/stacktrace.c:89
    ... a hundred frames skipped ...
    stack_trace_consume_entry
    kernel/stacktrace.c:93
    stack_trace_consume_entry
    kernel/stacktrace.c:86

    Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

30 Dec, 2019

1 commit


16 Nov, 2019

1 commit


06 Nov, 2019

1 commit

  • Add two utilities to 1) write-protect and 2) clean all ptes pointing into
    a range of an address space.
    The utilities are intended to aid in tracking dirty pages (either
    driver-allocated system memory or pci device memory).
    The write-protect utility should be used in conjunction with
    page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
    accesses. Typically one would want to use this on sparse accesses into
    large memory regions. The clean utility should be used to utilize
    hardware dirtying functionality and avoid the overhead of page-faults,
    typically on large accesses into small memory regions.

    Cc: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Signed-off-by: Thomas Hellstrom
    Acked-by: Andrew Morton

    Thomas Hellstrom
     

25 Sep, 2019

2 commits

  • When compiling a kernel with W=1, there are several of those warnings due
    to arm64 overriding a field on purpose. Just disable those warnings for
    both GCC and Clang of this file, so it will help dig "gems" hidden in the
    W=1 warnings by reducing some noises.

    mm/init-mm.c:39:2: warning: initializer overrides prior initialization
    of this subobject [-Winitializer-overrides]
    INIT_MM_CONTEXT(init_mm)
    ^~~~~~~~~~~~~~~~~~~~~~~~
    ./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro
    'INIT_MM_CONTEXT'
    .pgd = init_pg_dir,
    ^~~~~~~~~~~
    mm/init-mm.c:30:10: note: previous initialization is here
    .pgd = swapper_pg_dir,
    ^~~~~~~~~~~~~~

    Note: there is a side project trying to support explicitly allowing
    specific initializer overrides in Clang, but there is no guarantee it
    will happen or not.

    https://github.com/ClangBuiltLinux/linux/issues/639

    Link: http://lkml.kernel.org/r/1566920867-27453-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai
    Cc: Nick Desaulniers
    Cc: Masahiro Yamada
    Cc: Mark Rutland
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Patch series "mm: remove quicklist page table caches".

    A while ago Nicholas proposed to remove quicklist page table caches [1].

    I've rebased his patch on the curren upstream and switched ia64 and sh to
    use generic versions of PTE allocation.

    [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

    This patch (of 3):

    Remove page table allocator "quicklists". These have been around for a
    long time, but have not got much traction in the last decade and are only
    used on ia64 and sh architectures.

    The numbers in the initial commit look interesting but probably don't
    apply anymore. If anybody wants to resurrect this it's in the git
    history, but it's unhelpful to have this code and divergent allocator
    behaviour for minor archs.

    Also it might be better to instead make more general improvements to page
    allocator if this is still so slow.

    Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Mike Rapoport
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

03 Aug, 2019

1 commit

  • memremap.c implements MM functionality for ZONE_DEVICE, so it really
    should be in the mm/ directory, not the kernel/ one.

    Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Anshuman Khandual
    Acked-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

1 commit

  • Always build mm/gup.c so that we don't have to provide separate nommu
    stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
    when HAVE_FAST_GUP into the main implementations, which will never call
    the fast path if HAVE_FAST_GUP is not set.

    This also ensures the new put_user_pages* helpers are available for nommu,
    as those are currently missing, which would create a problem as soon as we
    actually grew users for it.

    Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

03 Jul, 2019

1 commit

  • All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau
    depend on it instead of the mix of a dummy dependency symbol plus the
    actually selected one. Drop various odd dependencies, as the code is
    pretty portable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

1 commit

  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

31 Oct, 2018

3 commits

  • Move a few remaining functions from nobootmem.c to memblock.c and remove
    nobootmem

    Link: http://lkml.kernel.org/r/1536927045-23536-28-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • All architecures use memblock for early memory management. There is no need
    for the CONFIG_HAVE_MEMBLOCK configuration option.

    [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
    Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
    [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
    Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
    [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
    Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Tested-by: Jonathan Cameron
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • All achitectures select NO_BOOTMEM which essentially becomes 'Y' for any
    kernel configuration and therefore it can be removed.

    [alexander.h.duyck@linux.intel.com: remove now defunct NO_BOOTMEM from depends list for deferred init]
    Link: http://lkml.kernel.org/r/20180925201814.3576.15105.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/1536927045-23536-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Alexander Duyck
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Oct, 2018

1 commit

  • Pull arm64 updates from Catalin Marinas:
    "Apart from some new arm64 features and clean-ups, this also contains
    the core mmu_gather changes for tracking the levels of the page table
    being cleared and a minor update to the generic
    compat_sys_sigaltstack() introducing COMPAT_SIGMINSKSZ.

    Summary:

    - Core mmu_gather changes which allow tracking the levels of
    page-table being cleared together with the arm64 low-level flushing
    routines

    - Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
    mitigate Spectre-v4 dynamically without trapping to EL3 firmware

    - Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack

    - Optimise emulation of MRS instructions to ID_* registers on ARMv8.4

    - Support for Common Not Private (CnP) translations allowing threads
    of the same CPU to share the TLB entries

    - Accelerated crc32 routines

    - Move swapper_pg_dir to the rodata section

    - Trap WFI instruction executed in user space

    - ARM erratum 1188874 workaround (arch_timer)

    - Miscellaneous fixes and clean-ups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
    arm64: KVM: Guests can skip __install_bp_hardening_cb()s HYP work
    arm64: cpufeature: Trap CTR_EL0 access only where it is necessary
    arm64: cpufeature: Fix handling of CTR_EL0.IDC field
    arm64: cpufeature: ctr: Fix cpu capability check for late CPUs
    Documentation/arm64: HugeTLB page implementation
    arm64: mm: Use __pa_symbol() for set_swapper_pgd()
    arm64: Add silicon-errata.txt entry for ARM erratum 1188873
    Revert "arm64: uaccess: implement unsafe accessors"
    arm64: mm: Drop the unused cpu parameter
    MAINTAINERS: fix bad sdei paths
    arm64: mm: Use #ifdef for the __PAGETABLE_P?D_FOLDED defines
    arm64: Fix typo in a comment in arch/arm64/mm/kasan_init.c
    arm64: xen: Use existing helper to check interrupt status
    arm64: Use daifflag_restore after bp_hardening
    arm64: daifflags: Use irqflags functions for daifflags
    arm64: arch_timer: avoid unused function warning
    arm64: Trap WFI executed in userspace
    arm64: docs: Document SSBS HWCAP
    arm64: docs: Fix typos in ELF hwcaps
    arm64/kprobes: remove an extra semicolon in arch_prepare_kprobe
    ...

    Linus Torvalds
     

07 Sep, 2018

1 commit


31 Aug, 2018

1 commit

  • The implementation of readahead(2) syscall is identical to that of
    fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
    1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
    ignores the request and returns 0.
    2. fadvise64() checks for integer overflow corner case
    3. fadvise64() calls the optional filesystem fadvise() file operation

    Unite the two implementations by calling vfs_fadvise() from readahead(2)
    syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
    documented syscall ABI behaviour.

    Suggested-by: Miklos Szeredi
    Fixes: d1d04ef8572b ("ovl: stack file ops")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

08 Jun, 2018

1 commit

  • With the addition of memfd hugetlbfs support, we now have the situation
    where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only
    supported on tmpfs, so it made sense that the code resided in shmem.c.
    In the current code, memfd is only functional if TMPFS is defined. If
    HUGETLFS is defined and TMPFS is not defined, then memfd functionality
    will not be available for hugetlbfs. This does not cause BUGs, just a
    lack of potentially desired functionality.

    Code is restructured in the following way:
    - include/linux/memfd.h is a new file containing memfd specific
    definitions previously contained in shmem_fs.h.
    - mm/memfd.c is a new file containing memfd specific code previously
    contained in shmem.c.
    - memfd specific code is removed from shmem_fs.h and shmem.c.
    - A new config option MEMFD_CREATE is added that is defined if TMPFS
    or HUGETLBFS is defined.

    No functional changes are made to the code: restructuring only.

    Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Khalid Aziz
    Cc: Andrea Arcangeli
    Cc: David Herrmann
    Cc: Hugh Dickins
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

06 Apr, 2018

1 commit

  • For mm/swap_slots.c, use the traditional Linux method of conditional
    compilation and linking instead of always compiling it by using #ifdef
    CONFIG_SWAP and #endif for the entire source file (excluding header
    files).

    Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Tim Chen
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Nov, 2017

1 commit

  • Performance of get_user_pages_fast() is critical for some workloads, but
    it's tricky to test it directly.

    This patch provides /sys/kernel/debug/gup_benchmark that helps with
    testing performance of it.

    See tools/testing/selftests/vm/gup_benchmark.c for userspace
    counterpart.

    Link: http://lkml.kernel.org/r/20170908215603.9189-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Shuah Khan
    Cc: Ingo Molnar
    Cc: Thorsten Leemhuis
    Cc: Jonathan Corbet
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Nov, 2017

1 commit

  • Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

    This patch introduces some common helpers and definitions to all of
    those 3 functionality.

    Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

21 Jun, 2017

1 commit

  • There is limited visibility into the use of percpu memory leaving us
    unable to reason about correctness of parameters and overall use of
    percpu memory. These counters and statistics aim to help understand
    basic statistics about percpu memory such as number of allocations over
    the lifetime, allocation sizes, and fragmentation.

    New Config: PERCPU_STATS

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

28 Feb, 2017

1 commit

  • This patch makes arch-independent testcases for RODATA. Both x86 and
    x86_64 already have testcases for RODATA, But they are arch-specific
    because using inline assembly directly.

    And cacheflush.h is not a suitable location for rodata-test related
    things. Since they were in cacheflush.h, If someone change the state of
    CONFIG_DEBUG_RODATA_TEST, It cause overhead of kernel build.

    To solve the above issues, write arch-independent testcases and move it
    to shared location.

    [jinb.park7@gmail.com: fix config dependency]
    Link: http://lkml.kernel.org/r/20170209131625.GA16954@pjb1027-Latitude-E5410
    Link: http://lkml.kernel.org/r/20170129105436.GA9303@pjb1027-Latitude-E5410
    Signed-off-by: Jinbum Park
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Arjan van de Ven
    Cc: Laura Abbott
    Cc: Russell King
    Cc: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jinbum Park
     

25 Feb, 2017

1 commit

  • Introduce a new interface to check if a page is mapped into a vma. It
    aims to address shortcomings of page_check_address{,_transhuge}.

    Existing interface is not able to handle PTE-mapped THPs: it only finds
    the first PTE. The rest lefted unnoticed.

    page_vma_mapped_walk() iterates over all possible mapping of the page in
    the vma.

    Link: http://lkml.kernel.org/r/20170129173858.45174-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Feb, 2017

1 commit

  • We add per cpu caches for swap slots that can be allocated and freed
    quickly without the need to touch the swap info lock.

    Two separate caches are maintained for swap slots allocated and swap
    slots returned. This is to allow the swap slots to be returned to the
    global pool in a batch so they will have a chance to be coaelesced with
    other slots in a cluster. We do not reuse the slots that are returned
    right away, as it may increase fragmentation of the slots.

    The swap allocation cache is protected by a mutex as we may sleep when
    searching for empty slots in cache. The swap free cache is protected by
    a spin lock as we cannot sleep in the free path.

    We refill the swap slots cache when we run out of slots, and we disable
    the swap slots cache and drain the slots if the global number of slots
    fall below a low watermark threshold. We re-enable the cache agian when
    the slots available are above a high watermark.

    [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
    [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
    Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
    Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Michal Hocko
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     

13 Oct, 2016

1 commit

  • This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning
    for __builtin_return_address(>0) for tracing only") because it turns out
    that it really isn't tracing only - it's all over the tree.

    We already also had the warning disabled separately for mm/usercopy.c
    (which this commit also removes), and it turns out that we will also
    want to disable it for get_lock_parent_ip(), that is used for at least
    TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.

    Steven Rostedt had a patch that tried to limit it to just the config
    options that actually triggered this, but quite frankly, the extra
    complexity and abstraction just isn't worth it. We have never actually
    had a case where the warning is actually useful, so let's just disable
    it globally and not worry about it.

    Acked-by: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds