15 May, 2019

1 commit

  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

31 Oct, 2018

3 commits

  • Move a few remaining functions from nobootmem.c to memblock.c and remove
    nobootmem

    Link: http://lkml.kernel.org/r/1536927045-23536-28-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • All architecures use memblock for early memory management. There is no need
    for the CONFIG_HAVE_MEMBLOCK configuration option.

    [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
    Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
    [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
    Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
    [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
    Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Tested-by: Jonathan Cameron
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • All achitectures select NO_BOOTMEM which essentially becomes 'Y' for any
    kernel configuration and therefore it can be removed.

    [alexander.h.duyck@linux.intel.com: remove now defunct NO_BOOTMEM from depends list for deferred init]
    Link: http://lkml.kernel.org/r/20180925201814.3576.15105.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/1536927045-23536-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Alexander Duyck
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Oct, 2018

1 commit

  • Pull arm64 updates from Catalin Marinas:
    "Apart from some new arm64 features and clean-ups, this also contains
    the core mmu_gather changes for tracking the levels of the page table
    being cleared and a minor update to the generic
    compat_sys_sigaltstack() introducing COMPAT_SIGMINSKSZ.

    Summary:

    - Core mmu_gather changes which allow tracking the levels of
    page-table being cleared together with the arm64 low-level flushing
    routines

    - Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
    mitigate Spectre-v4 dynamically without trapping to EL3 firmware

    - Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack

    - Optimise emulation of MRS instructions to ID_* registers on ARMv8.4

    - Support for Common Not Private (CnP) translations allowing threads
    of the same CPU to share the TLB entries

    - Accelerated crc32 routines

    - Move swapper_pg_dir to the rodata section

    - Trap WFI instruction executed in user space

    - ARM erratum 1188874 workaround (arch_timer)

    - Miscellaneous fixes and clean-ups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
    arm64: KVM: Guests can skip __install_bp_hardening_cb()s HYP work
    arm64: cpufeature: Trap CTR_EL0 access only where it is necessary
    arm64: cpufeature: Fix handling of CTR_EL0.IDC field
    arm64: cpufeature: ctr: Fix cpu capability check for late CPUs
    Documentation/arm64: HugeTLB page implementation
    arm64: mm: Use __pa_symbol() for set_swapper_pgd()
    arm64: Add silicon-errata.txt entry for ARM erratum 1188873
    Revert "arm64: uaccess: implement unsafe accessors"
    arm64: mm: Drop the unused cpu parameter
    MAINTAINERS: fix bad sdei paths
    arm64: mm: Use #ifdef for the __PAGETABLE_P?D_FOLDED defines
    arm64: Fix typo in a comment in arch/arm64/mm/kasan_init.c
    arm64: xen: Use existing helper to check interrupt status
    arm64: Use daifflag_restore after bp_hardening
    arm64: daifflags: Use irqflags functions for daifflags
    arm64: arch_timer: avoid unused function warning
    arm64: Trap WFI executed in userspace
    arm64: docs: Document SSBS HWCAP
    arm64: docs: Fix typos in ELF hwcaps
    arm64/kprobes: remove an extra semicolon in arch_prepare_kprobe
    ...

    Linus Torvalds
     

07 Sep, 2018

1 commit


31 Aug, 2018

1 commit

  • The implementation of readahead(2) syscall is identical to that of
    fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
    1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
    ignores the request and returns 0.
    2. fadvise64() checks for integer overflow corner case
    3. fadvise64() calls the optional filesystem fadvise() file operation

    Unite the two implementations by calling vfs_fadvise() from readahead(2)
    syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
    documented syscall ABI behaviour.

    Suggested-by: Miklos Szeredi
    Fixes: d1d04ef8572b ("ovl: stack file ops")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

08 Jun, 2018

1 commit

  • With the addition of memfd hugetlbfs support, we now have the situation
    where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only
    supported on tmpfs, so it made sense that the code resided in shmem.c.
    In the current code, memfd is only functional if TMPFS is defined. If
    HUGETLFS is defined and TMPFS is not defined, then memfd functionality
    will not be available for hugetlbfs. This does not cause BUGs, just a
    lack of potentially desired functionality.

    Code is restructured in the following way:
    - include/linux/memfd.h is a new file containing memfd specific
    definitions previously contained in shmem_fs.h.
    - mm/memfd.c is a new file containing memfd specific code previously
    contained in shmem.c.
    - memfd specific code is removed from shmem_fs.h and shmem.c.
    - A new config option MEMFD_CREATE is added that is defined if TMPFS
    or HUGETLBFS is defined.

    No functional changes are made to the code: restructuring only.

    Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Khalid Aziz
    Cc: Andrea Arcangeli
    Cc: David Herrmann
    Cc: Hugh Dickins
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

06 Apr, 2018

1 commit

  • For mm/swap_slots.c, use the traditional Linux method of conditional
    compilation and linking instead of always compiling it by using #ifdef
    CONFIG_SWAP and #endif for the entire source file (excluding header
    files).

    Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Tim Chen
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Nov, 2017

1 commit

  • Performance of get_user_pages_fast() is critical for some workloads, but
    it's tricky to test it directly.

    This patch provides /sys/kernel/debug/gup_benchmark that helps with
    testing performance of it.

    See tools/testing/selftests/vm/gup_benchmark.c for userspace
    counterpart.

    Link: http://lkml.kernel.org/r/20170908215603.9189-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Shuah Khan
    Cc: Ingo Molnar
    Cc: Thorsten Leemhuis
    Cc: Jonathan Corbet
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Nov, 2017

1 commit

  • Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

    This patch introduces some common helpers and definitions to all of
    those 3 functionality.

    Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

21 Jun, 2017

1 commit

  • There is limited visibility into the use of percpu memory leaving us
    unable to reason about correctness of parameters and overall use of
    percpu memory. These counters and statistics aim to help understand
    basic statistics about percpu memory such as number of allocations over
    the lifetime, allocation sizes, and fragmentation.

    New Config: PERCPU_STATS

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

28 Feb, 2017

1 commit

  • This patch makes arch-independent testcases for RODATA. Both x86 and
    x86_64 already have testcases for RODATA, But they are arch-specific
    because using inline assembly directly.

    And cacheflush.h is not a suitable location for rodata-test related
    things. Since they were in cacheflush.h, If someone change the state of
    CONFIG_DEBUG_RODATA_TEST, It cause overhead of kernel build.

    To solve the above issues, write arch-independent testcases and move it
    to shared location.

    [jinb.park7@gmail.com: fix config dependency]
    Link: http://lkml.kernel.org/r/20170209131625.GA16954@pjb1027-Latitude-E5410
    Link: http://lkml.kernel.org/r/20170129105436.GA9303@pjb1027-Latitude-E5410
    Signed-off-by: Jinbum Park
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Arjan van de Ven
    Cc: Laura Abbott
    Cc: Russell King
    Cc: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jinbum Park
     

25 Feb, 2017

1 commit

  • Introduce a new interface to check if a page is mapped into a vma. It
    aims to address shortcomings of page_check_address{,_transhuge}.

    Existing interface is not able to handle PTE-mapped THPs: it only finds
    the first PTE. The rest lefted unnoticed.

    page_vma_mapped_walk() iterates over all possible mapping of the page in
    the vma.

    Link: http://lkml.kernel.org/r/20170129173858.45174-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Feb, 2017

1 commit

  • We add per cpu caches for swap slots that can be allocated and freed
    quickly without the need to touch the swap info lock.

    Two separate caches are maintained for swap slots allocated and swap
    slots returned. This is to allow the swap slots to be returned to the
    global pool in a batch so they will have a chance to be coaelesced with
    other slots in a cluster. We do not reuse the slots that are returned
    right away, as it may increase fragmentation of the slots.

    The swap allocation cache is protected by a mutex as we may sleep when
    searching for empty slots in cache. The swap free cache is protected by
    a spin lock as we cannot sleep in the free path.

    We refill the swap slots cache when we run out of slots, and we disable
    the swap slots cache and drain the slots if the global number of slots
    fall below a low watermark threshold. We re-enable the cache agian when
    the slots available are above a high watermark.

    [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
    [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
    Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
    Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Michal Hocko
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Jonathan Corbet escreveu:
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     

13 Oct, 2016

1 commit

  • This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning
    for __builtin_return_address(>0) for tracing only") because it turns out
    that it really isn't tracing only - it's all over the tree.

    We already also had the warning disabled separately for mm/usercopy.c
    (which this commit also removes), and it turns out that we will also
    want to disable it for get_lock_parent_ip(), that is used for at least
    TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.

    Steven Rostedt had a patch that tried to limit it to just the config
    options that actually triggered this, but quite frankly, the extra
    complexity and abstraction just isn't worth it. We have never actually
    had a case where the warning is actually useful, so let's just disable
    it globally and not worry about it.

    Acked-by: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • khugepaged implementation grew to the point when it deserve separate
    file in source.

    Let's move it to mm/khugepaged.c.

    Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.

    This patch contains the logic for validating several conditions when
    performing copy_to_user() and copy_from_user() on the kernel object
    being copied to/from:
    - address range doesn't wrap around
    - address range isn't NULL or zero-allocated (with a non-zero copy size)
    - if on the slab allocator:
    - object size must be less than or equal to copy size (when check is
    implemented in the allocator, which appear in subsequent patches)
    - otherwise, object must not span page allocations (excepting Reserved
    and CMA ranges)
    - if on the stack
    - object must not extend before/after the current process stack
    - object must be contained by a valid stack frame (when there is
    arch/build support for identifying stack frames)
    - object must not overlap with kernel text

    Signed-off-by: Kees Cook
    Tested-by: Valdis Kletnieks
    Tested-by: Michael Ellerman

    Kees Cook
     

21 May, 2016

1 commit

  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

26 Mar, 2016

1 commit

  • Add KASAN hooks to SLAB allocator.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

18 Mar, 2016

1 commit

  • CMA allocation should be guaranteed to succeed by definition, but,
    unfortunately, it would be failed sometimes. It is hard to track down
    the problem, because it is related to page reference manipulation and we
    don't have any facility to analyze it.

    This patch adds tracepoints to track down page reference manipulation.
    With it, we can find exact reason of failure and can fix the problem.
    Following is an example of tracepoint output. (note: this example is
    stale version that printing flags as the number. Recent version will
    print it as human readable string.)

    -9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
    -9018 [004] 92.678378: kernel_stack:
    => get_page_from_freelist (ffffffff81176659)
    => __alloc_pages_nodemask (ffffffff81176d22)
    => alloc_pages_vma (ffffffff811bf675)
    => handle_mm_fault (ffffffff8119e693)
    => __do_page_fault (ffffffff810631ea)
    => trace_do_page_fault (ffffffff81063543)
    => do_async_page_fault (ffffffff8105c40a)
    => async_page_fault (ffffffff817581d8)
    [snip]
    -9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
    [snip]
    ...
    ...
    -9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
    [snip]
    -9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
    => release_pages (ffffffff8117c9e4)
    => free_pages_and_swap_cache (ffffffff811b0697)
    => tlb_flush_mmu_free (ffffffff81199616)
    => tlb_finish_mmu (ffffffff8119a62c)
    => exit_mmap (ffffffff811a53f7)
    => mmput (ffffffff81073f47)
    => do_exit (ffffffff810794e9)
    => do_group_exit (ffffffff81079def)
    => SyS_exit_group (ffffffff81079e74)
    => entry_SYSCALL_64_fastpath (ffffffff817560b6)

    This output shows that problem comes from exit path. In exit path, to
    improve performance, pages are not freed immediately. They are gathered
    and processed by batch. During this process, migration cannot be
    possible and CMA allocation is failed. This problem is hard to find
    without this page reference tracepoint facility.

    Enabling this feature bloat kernel text 30 KB in my configuration.

    text data bss dec hex filename
    12127327 2243616 1507328 15878271 f2487f vmlinux_disabled
    12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled

    Note that, due to header file dependency problem between mm.h and
    tracepoint.h, this feature has to open code the static key functions for
    tracepoints. Proposed by Steven Rostedt in following link.

    https://lkml.org/lkml/2015/12/9/699

    [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
    [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
    [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Acked-by: Steven Rostedt
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

16 Mar, 2016

1 commit

  • Page poisoning is currently set up as a feature if architectures don't
    have architecture debug page_alloc to allow unmapping of pages. It has
    uses apart from that though. Clearing of the pages on free provides an
    increase in security as it helps to limit the risk of information leaks.
    Allow page poisoning to be enabled as a separate option independent of
    kernel_map pages since the two features do separate work. Because of
    how hiberanation is implemented, the checks on alloc cannot occur if
    hibernation is enabled. The runtime alloc checks can also be enabled
    with an option when !HIBERNATION.

    Credit to Grsecurity/PaX team for inspiring this work

    Signed-off-by: Laura Abbott
    Cc: Rafael J. Wysocki
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Kees Cook
    Cc: Mathias Krause
    Cc: Dave Hansen
    Cc: Jianyu Zhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

12 Sep, 2015

1 commit

  • Pull media updates from Mauro Carvalho Chehab:
    "A series of patches that move part of the code used to allocate memory
    from the media subsystem to the mm subsystem"

    [ The mm parts have been acked by VM people, and the series was
    apparently in -mm for a while - Linus ]

    * tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
    [media] media: vb2: Remove unused functions
    [media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
    [media] vb2: Provide helpers for mapping virtual addresses
    [media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
    [media] mm: Provide new get_vaddr_frames() helper
    [media] vb2: Push mmap_sem down to memops

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

05 Sep, 2015

1 commit

  • This implements mcopy_atomic and mfill_zeropage that are the lowlevel
    VM methods that are invoked respectively by the UFFDIO_COPY and
    UFFDIO_ZEROPAGE userfaultfd commands.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

17 Aug, 2015

1 commit

  • Provide new function get_vaddr_frames(). This function maps virtual
    addresses from given start and fills given array with page frame numbers of
    the corresponding pages. If given start belongs to a normal vma, the function
    grabs reference to each of the pages to pin them in memory. If start
    belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures. Caller
    must make sure pfns aren't reused for anything else while he is using
    them.

    This function is created for various drivers to simplify handling of
    their buffers.

    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Andrew Morton
    Signed-off-by: Hans Verkuil
    Signed-off-by: Mauro Carvalho Chehab

    Jan Kara
     

15 Apr, 2015

2 commits

  • Memtest is a simple feature which fills the memory with a given set of
    patterns and validates memory contents, if bad memory regions is detected
    it reserves them via memblock API. Since memblock API is widely used by
    other architectures this feature can be enabled outside of x86 world.

    This patch set promotes memtest to live under generic mm umbrella and
    enables memtest feature for arm/arm64.

    It was reported that this patch set was useful for tracking down an issue
    with some errant DMA on an arm64 platform.

    This patch (of 6):

    There is nothing platform dependent in the core memtest code, so other
    platforms might benefit from this feature too.

    [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
    Signed-off-by: Vladimir Murzin
    Acked-by: Will Deacon
    Tested-by: Mark Rutland
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Paul Bolle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Murzin
     
  • I've noticed that there is no interfaces exposed by CMA which would let me
    fuzz what's going on in there.

    This small patchset exposes some information out to userspace, plus adds
    the ability to trigger allocation and freeing from userspace.

    This patch (of 3):

    Implement a simple debugfs interface to expose information about CMA areas
    in the system.

    Useful for testing/sanity checks for CMA since it was impossible to
    previously retrieve this information in userspace.

    Signed-off-by: Sasha Levin
    Acked-by: Joonsoo Kim
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

18 Feb, 2015

1 commit


17 Feb, 2015

1 commit

  • All callers of get_xip_mem() are now gone. Remove checks for it,
    initialisers of it, documentation of it and the only implementation of it.
    Also remove mm/filemap_xip.c as it is now empty. Also remove
    documentation of the long-gone get_xip_page().

    Signed-off-by: Matthew Wilcox
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

14 Feb, 2015

2 commits

  • With this patch kasan will be able to catch bugs in memory allocated by
    slub. Initially all objects in newly allocated slab page, marked as
    redzone. Later, when allocation of slub object happens, requested by
    caller number of bytes marked as accessible, and the rest of the object
    (including slub's metadata) marked as redzone (inaccessible).

    We also mark object as accessible if ksize was called for this object.
    There is some places in kernel where ksize function is called to inquire
    size of really allocated area. Such callers could validly access whole
    allocated memory, so it should be marked as accessible.

    Code in slub.c and slab_common.c files could validly access to object's
    metadata, so instrumentation for this files are disabled.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Kernel Address sanitizer (KASan) is a dynamic memory error detector. It
    provides fast and comprehensive solution for finding use-after-free and
    out-of-bounds bugs.

    KASAN uses compile-time instrumentation for checking every memory access,
    therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with
    putting symbol aliases into the wrong section, which breaks kasan
    instrumentation of globals.

    This patch only adds infrastructure for kernel address sanitizer. It's
    not available for use yet. The idea and some code was borrowed from [1].

    Basic idea:

    The main idea of KASAN is to use shadow memory to record whether each byte
    of memory is safe to access or not, and use compiler's instrumentation to
    check the shadow memory on each memory access.

    Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
    memory and uses direct mapping with a scale and offset to translate a
    memory address to its corresponding shadow address.

    Here is function to translate address to corresponding shadow address:

    unsigned long kasan_mem_to_shadow(unsigned long addr)
    {
    return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
    }

    where KASAN_SHADOW_SCALE_SHIFT = 3.

    So for every 8 bytes there is one corresponding byte of shadow memory.
    The following encoding used for each shadow byte: 0 means that all 8 bytes
    of the corresponding memory region are valid for access; k (1
    Acked-by: Michal Marek
    Signed-off-by: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

11 Feb, 2015

1 commit

  • remap_file_pages(2) was invented to be able efficiently map parts of
    huge file into limited 32-bit virtual address space such as in database
    workloads.

    Nonlinear mappings are pain to support and it seems there's no
    legitimate use-cases nowadays since 64-bit systems are widely available.

    Let's drop it and get rid of all these special-cased code.

    The patch replaces the syscall with emulation which creates new VMA on
    each remap_file_pages(), unless they it can be merged with an adjacent
    one.

    I didn't find *any* real code that uses remap_file_pages(2) to test
    emulation impact on. I've checked Debian code search and source of all
    packages in ALT Linux. No real users: libc wrappers, mentions in
    strace, gdb, valgrind and this kind of stuff.

    There are few basic tests in LTP for the syscall. They work just fine
    with emulation.

    To test performance impact, I've written small test case which
    demonstrate pretty much worst case scenario: map 4G shmfs file, write to
    begin of every page pgoff of the page, remap pages in reverse order,
    read every page.

    The test creates 1 million of VMAs if emulation is in use, so I had to
    set vm.max_map_count to 1100000 to avoid -ENOMEM.

    Before: 23.3 ( +- 4.31% ) seconds
    After: 43.9 ( +- 0.85% ) seconds
    Slowdown: 1.88x

    I believe we can live with that.

    Test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define MB (1024UL * 1024)
    #define SIZE (4096 * MB)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i, pass;

    for (pass = 0; pass < 10; pass++) {
    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    for (i = 0; i < SIZE / 4096; i++) {
    if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
    0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
    perror("remap_file_pages");
    return -1;
    }
    }

    for (i = SIZE / 4096 - 1; i >= 0; i--)
    assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

    munmap(p, SIZE);
    }

    return 0;
    }

    [akpm@linux-foundation.org: fix spello]
    [sasha.levin@oracle.com: initialize populate before usage]
    [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
    Signed-off-by: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Armin Rigo
    Signed-off-by: Sasha Levin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

14 Dec, 2014

2 commits

  • This is the page owner tracking code which is introduced so far ago. It
    is resident on Andrew's tree, though, nobody tried to upstream so it
    remain as is. Our company uses this feature actively to debug memory leak
    or to find a memory hogger so I decide to upstream this feature.

    This functionality help us to know who allocates the page. When
    allocating a page, we store some information about allocation in extra
    memory. Later, if we need to know status of all pages, we can get and
    analyze it from this stored information.

    In previous version of this feature, extra memory is statically defined in
    struct page, but, in this version, extra memory is allocated outside of
    struct page. It enables us to turn on/off this feature at boottime
    without considerable memory waste.

    Although we already have tracepoint for tracing page allocation/free,
    using it to analyze page owner is rather complex. We need to enlarge the
    trace buffer for preventing overlapping until userspace program launched.
    And, launched program continually dump out the trace buffer for later
    analysis and it would change system behaviour with more possibility rather
    than just keeping it in memory, so bad for debug.

    Moreover, we can use page_owner feature further for various purposes. For
    example, we can use it for fragmentation statistics implemented in this
    patch. And, I also plan to implement some CMA failure debugging feature
    using this interface.

    I'd like to give the credit for all developers contributed this feature,
    but, it's not easy because I don't know exact history. Sorry about that.
    Below is people who has "Signed-off-by" in the patches in Andrew's tree.

    Contributor:
    Alexander Nyberg
    Mel Gorman
    Dave Hansen
    Minchan Kim
    Michal Nazarewicz
    Andrew Morton
    Jungsoo Son

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When we debug something, we'd like to insert some information to every
    page. For this purpose, we sometimes modify struct page itself. But,
    this has drawbacks. First, it requires re-compile. This makes us
    hesitate to use the powerful debug feature so development process is
    slowed down. And, second, sometimes it is impossible to rebuild the
    kernel due to third party module dependency. At third, system behaviour
    would be largely different after re-compile, because it changes size of
    struct page greatly and this structure is accessed by every part of
    kernel. Keeping this as it is would be better to reproduce errornous
    situation.

    This feature is intended to overcome above mentioned problems. This
    feature allocates memory for extended data per page in certain place
    rather than the struct page itself. This memory can be accessed by the
    accessor functions provided by this code. During the boot process, it
    checks whether allocation of huge chunk of memory is needed or not. If
    not, it avoids allocating memory at all. With this advantage, we can
    include this feature into the kernel in default and can avoid rebuild and
    solve related problems.

    Until now, memcg uses this technique. But, now, memcg decides to embed
    their variable to struct page itself and it's code to extend struct page
    has been removed. I'd like to use this code to develop debug feature, so
    this patch resurrect it.

    To help these things to work well, this patch introduces two callbacks for
    clients. One is the need callback which is mandatory if user wants to
    avoid useless memory allocation at boot-time. The other is optional, init
    callback, which is used to do proper initialization after memory is
    allocated. Detailed explanation about purpose of these functions is in
    code comment. Please refer it.

    Others are completely same with previous extension code in memcg.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim