08 Aug, 2020

1 commit

  • Commit e900a918b098 ("mm: shuffle initial free memory to improve
    memory-side-cache utilization") promised "autodetection of a
    memory-side-cache (to be added in a follow-on patch)" over a year ago.

    The original series included patches [1], however, they were dropped
    during review [2] to be followed-up later.

    Due to lack of platforms that publish an HMAT, autodetection is currently
    not implemented. However, manual activation is actively used [3]. Let's
    simplify for now and re-add when really (ever?) needed.

    [1] https://lkml.kernel.org/r/154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com
    [2] https://lkml.kernel.org/r/154690326478.676627.103843791978176914.stgit@dwillia2-desk3.amr.corp.intel.com
    [3] https://lkml.kernel.org/r/CAPcyv4irwGUU2x+c6b4L=KbB1dnasNKaaZd6oSpYjL9kfsnROQ@mail.gmail.com

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Acked-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Huang Ying
    Cc: Wei Yang
    Cc: Mel Gorman
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200624094741.9918-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

08 Apr, 2020

1 commit

  • Patch series "mm / virtio: Provide support for free page reporting", v17.

    This series provides an asynchronous means of reporting free guest pages
    to a hypervisor so that the memory associated with those pages can be
    dropped and reused by other processes and/or guests on the host. Using
    this it is possible to avoid unnecessary I/O to disk and greatly improve
    performance in the case of memory overcommit on the host.

    When enabled we will be performing a scan of free memory every 2 seconds
    while pages of sufficiently high order are being freed. In each pass at
    least one sixteenth of each free list will be reported. By doing this we
    avoid racing against other threads that may be causing a high amount of
    memory churn.

    The lowest page order currently scanned when reporting pages is
    pageblock_order so that this feature will not interfere with the use of
    Transparent Huge Pages in the case of virtualization.

    Currently this is only in use by virtio-balloon however there is the hope
    that at some point in the future other hypervisors might be able to make
    use of it. In the virtio-balloon/QEMU implementation the hypervisor is
    currently using MADV_DONTNEED to indicate to the host kernel that the page
    is currently free. It will be zeroed and faulted back into the guest the
    next time the page is accessed.

    To track if a page is reported or not the Uptodate flag was repurposed and
    used as a Reported flag for Buddy pages. We walk though the free list
    isolating pages and adding them to the scatterlist until we either
    encounter the end of the list or have processed at least one sixteenth of
    the pages that were listed in nr_free prior to us starting. If we fill
    the scatterlist before we reach the end of the list we rotate the list so
    that the first unreported page we encounter is moved to the head of the
    list as that is where we will resume after we have freed the reported
    pages back into the tail of the list.

    Below are the results from various benchmarks. I primarily focused on two
    tests. The first is the will-it-scale/page_fault2 test, and the other is
    a modified version of will-it-scale/page_fault1 that was enabled to use
    THP. I did this as it allows for better visibility into different parts
    of the memory subsystem. The guest is running with 32G for RAM on one
    node of a E5-2630 v3. The host has had some features such as CPU turbo
    disabled in the BIOS.

    Test page_fault1 (THP) page_fault2
    Name tasks Process Iter STDEV Process Iter STDEV
    Baseline 1 1012402.50 0.14% 361855.25 0.81%
    16 8827457.25 0.09% 3282347.00 0.34%

    Patches Applied 1 1007897.00 0.23% 361887.00 0.26%
    16 8784741.75 0.39% 3240669.25 0.48%

    Patches Enabled 1 1010227.50 0.39% 359749.25 0.56%
    16 8756219.00 0.24% 3226608.75 0.97%

    Patches Enabled 1 1050982.00 4.26% 357966.25 0.14%
    page shuffle 16 8672601.25 0.49% 3223177.75 0.40%

    Patches enabled 1 1003238.00 0.22% 360211.00 0.22%
    shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71%

    The results above are for a baseline with a linux-next-20191219 kernel,
    that kernel with this patch set applied but page reporting disabled in
    virtio-balloon, the patches applied and page reporting fully enabled, the
    patches enabled with page shuffling enabled, and the patches applied with
    page shuffling enabled and an RFC patch that makes used of MADV_FREE in
    QEMU. These results include the deviation seen between the average value
    reported here versus the high and/or low value. I observed that during
    the test memory usage for the first three tests never dropped whereas with
    the patches fully enabled the VM would drop to using only a few GB of the
    host's memory when switching from memhog to page fault tests.

    Any of the overhead visible with this patch set enabled seems due to page
    faults caused by accessing the reported pages and the host zeroing the
    page before giving it back to the guest. This overhead is much more
    visible when using THP than with standard 4K pages. In addition page
    shuffling seemed to increase the amount of faults generated due to an
    increase in memory churn. The overehad is reduced when using MADV_FREE as
    we can avoid the extra zeroing of the pages when they are reintroduced to
    the host, as can be seen when the RFC is applied with shuffling enabled.

    The overall guest size is kept fairly small to only a few GB while the
    test is running. If the host memory were oversubscribed this patch set
    should result in a performance improvement as swapping memory in the host
    can be avoided.

    A brief history on the background of free page reporting can be found at:
    https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/

    This patch (of 9):

    Move the head/tail adding logic out of the shuffle code and into the
    __free_one_page function since ultimately that is where it is really
    needed anyway. By doing this we should be able to reduce the overhead and
    can consolidate all of the list addition bits in one spot.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Acked-by: Mel Gorman
    Acked-by: David Hildenbrand
    Cc: Yang Zhang
    Cc: Pankaj Gupta
    Cc: Konrad Rzeszutek Wilk
    Cc: Nitesh Narayan Lal
    Cc: Rik van Riel
    Cc: Matthew Wilcox
    Cc: Luiz Capitulino
    Cc: Dave Hansen
    Cc: Wei Wang
    Cc: Andrea Arcangeli
    Cc: Paolo Bonzini
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Oscar Salvador
    Cc: Michael S. Tsirkin
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

15 May, 2019

2 commits

  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams