20 Sep, 2018

1 commit

  • commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream.

    Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Sep, 2017

3 commits

  • Patch series "mm, swap: VMA based swap readahead", v4.

    The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory space. And the different tasks in the system may have different
    access patterns, which makes the global space locality estimation
    incorrect.

    In this patchset, when page fault occurs, the virtual pages near the
    fault address will be readahead instead of the swap slots near the fault
    swap slot in swap device. This avoid to readahead the unrelated swap
    slots. At the same time, the swap readahead is changed to work on
    per-VMA from globally. So that the different access patterns of the
    different VMAs could be distinguished, and the different readahead
    policy could be applied accordingly. The original core readahead
    detection and scaling algorithm is reused, because it is an effect
    algorithm to detect the space locality.

    In addition to the swap readahead changes, some new sysfs interface is
    added to show the efficiency of the readahead algorithm and some other
    swap statistics.

    This new implementation will incur more small random read, on SSD, the
    improved correctness of estimation and readahead target should beat the
    potential increased overhead, this is also illustrated in the test
    results below. But on HDD, the overhead may beat the benefit, so the
    original implementation will be used by default.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
    Swap device: NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    This patch (of 5):

    The statistics for total readahead pages and total readahead hits are
    recorded and exported via the following sysfs interface.

    /sys/kernel/mm/swap/ra_hits
    /sys/kernel/mm/swap/ra_total

    With them, the efficiency of the swap readahead could be measured, so
    that the swap readahead algorithm and parameters could be tuned
    accordingly.

    [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
    Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • When swapping out THP (Transparent Huge Page), instead of swapping out
    the THP as a whole, sometimes we have to fallback to split the THP into
    normal pages before swapping, because no free swap clusters are
    available, or cgroup limit is exceeded, etc. To count the number of the
    fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
    we fallback to split the THP.

    Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To support delay splitting THP (Transparent Huge Page) after swapped
    out, we need to enhance swap writing code to support to write a THP as a
    whole. This will improve swap write IO performance.

    As Ming Lei pointed out, this should be based on
    multipage bvec support, which hasn't been merged yet. So this patch is
    only for testing the functionality of the other patches in the series.
    And will be reimplemented after multipage bvec support is merged.

    Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

07 Jul, 2017

1 commit

  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Jun, 2017

1 commit

  • This fixes CONFIG_SMP=n, CONFIG_DEBUG_TLBFLUSH=y without introducing
    further #ifdef soup. Caught by a Kbuild bot randconfig build.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: ce4a4e565f52 ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code")
    Link: http://lkml.kernel.org/r/76da9a3cc4415996f2ad2c905b93414add322021.1496673616.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

04 May, 2017

1 commit

  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

10 Mar, 2017

1 commit

  • We added support for PUD-sized transparent hugepages, however we count
    the event "thp split pud" into thp_split_pmd event.

    To separate the event count of thp split pud from pmd, add a new event
    named thp_split_pud.

    Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Sebastian Siewior
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: David Rientjes
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

23 Feb, 2017

1 commit

  • A "compact_daemon_wake" vmstat exists that represents the number of
    times kcompactd has woken up. This doesn't represent how much work it
    actually did, though.

    It's useful to understand how much compaction work is being done by
    kcompactd versus other methods such as direct compaction and explicitly
    triggered per-node (or system) compaction.

    This adds two new vmstats: "compact_daemon_migrate_scanned" and
    "compact_daemon_free_scanned" to represent the number of pages kcompactd
    has scanned as part of its migration scanner and freeing scanner,
    respectively.

    These values are still accounted for in the general
    "compact_migrate_scanned" and "compact_free_scanned" for compatibility.

    It could be argued that explicitly triggered compaction could also be
    tracked separately, and that could be added if others find it useful.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 Jul, 2016

2 commits

  • The vmstat allocstall was fairly useful in the general sense but
    node-based LRUs change that. It's important to know if a stall was for
    an address-limited allocation request as this will require skipping
    pages from other zones. This patch adds pgstall_* counters to replace
    allocstall. The sum of the counters will equal the old allocstall so it
    can be trivially recalculated. A high number of address-limited
    allocation requests may result in a lot of useless LRU scanning for
    suitable pages.

    As address-limited allocations require pages to be skipped, it's
    important to know how much useless LRU scanning took place so this patch
    adds pgskip* counters. This yields the following model

    1. The number of address-space limited stalls can be accounted for (pgstall)
    2. The amount of useless work required to reclaim the data is accounted (pgskip)
    3. The total number of scans is available from pgscan_kswapd and pgscan_direct
    so from that the ratio of useful to useless scans can be calculated.

    [mgorman@techsingularity.net: s/pgstall/allocstall/]
    Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

1 commit

  • THP_FILE_ALLOC: how many times huge page was allocated and put page
    cache.

    THP_FILE_MAPPED: how many times file huge page was mapped.

    Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

2 commits

  • Count how many times we put a THP in split queue. Currently, it happens
    on partial unmap of a THP.

    Rapidly growing value can indicate that an application behaves
    unfriendly wrt THP: often fault in huge page and then unmap part of it.
    This leads to unnecessary memory fragmentation and the application may
    require tuning.

    The event also can help with debugging kernel [mis-]behaviour.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Jan, 2016

2 commits

  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
    THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
    are going to be able split PMD without the compound page and that
    split_huge_page() can fail.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Christoph Lameter
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit


14 Dec, 2014

1 commit

  • These flushes deal with sequence number overflows, such as for long lived
    threads. These are rare, but interesting from a debugging PoV. As such,
    display the number of flushes when vmacache debugging is enabled.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

10 Oct, 2014

1 commit

  • Always mark pages with PageBalloon even if balloon compaction is disabled
    and expose this mark in /proc/kpageflags as KPF_BALLOON.

    Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
    "balloon_deflate" and "balloon_migrate". They accumulate balloon
    activity. Current size of balloon is (balloon_inflate - balloon_deflate)
    pages.

    All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
    It should be selected by ballooning driver which wants use this feature.
    Currently virtio-balloon is the only user.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Jun, 2014

1 commit

  • Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
    hit rate -- exported in /proc/vmstat.

    Any updates to the caching scheme needs this kind of data, thus it can
    save some work re-implementing the counting all the time.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

04 Apr, 2014

1 commit

  • There is plenty of anecdotal evidence and a load of blog posts
    suggesting that using "drop_caches" periodically keeps your system
    running in "tip top shape". Perhaps adding some kernel documentation
    will increase the amount of accurate data on its use.

    If we are not shrinking caches effectively, then we have real bugs.
    Using drop_caches will simply mask the bugs and make them harder to
    find, but certainly does not fix them, nor is it an appropriate
    "workaround" to limit the size of the caches. On the contrary, there
    have been bug reports on issues that turned out to be misguided use of
    cache dropping.

    Dropping caches is a very drastic and disruptive operation that is good
    for debugging and running tests, but if it creates bug reports from
    production use, kernel developers should be aware of its use.

    Add a bit more documentation about it, a syslog message to track down
    abusers, and vmstat drop counters to help analyze problem reports.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hannes@cmpxchg.org: add runtime suppression control]
    Signed-off-by: Dave Hansen
    Signed-off-by: Michal Hocko
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

25 Jan, 2014

1 commit

  • Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

13 Nov, 2013

1 commit

  • Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Sep, 2013

2 commits

  • The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
    counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
    for SMP.

    UP systems do not do remote TLB flushes, so compile those counters out on
    UP.

    arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
    probably an optimization since both the mtrr code and __flush_tlb() write
    cr4. It would probably be safe to make that a flush_tlb_all() (and then
    get these statistics), but the mtrr code is ancient and I'm hesitant to
    touch it other than to just stick in the counters.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Dave Hansen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I was investigating some TLB flush scaling issues and realized that we do
    not have any good methods for figuring out how many TLB flushes we are
    doing.

    It would be nice to be able to do these in generic code, but the
    arch-independent calls don't explicitly specify whether we actually need
    to do remote flushes or not. In the end, we really need to know if we
    actually _did_ global vs. local invalidations, so that leaves us with few
    options other than to muck with the counters from arch-specific code.

    Signed-off-by: Dave Hansen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

24 Feb, 2013

1 commit

  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • hzp_alloc is incremented every time a huge zero page is successfully
    allocated. It includes allocations which where dropped due
    race with other allocation. Note, it doesn't count every map
    of the huge zero page, only its allocation.

    hzp_alloc_failed is incremented if kernel fails to allocate huge zero
    page and falls back to using small pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2012

3 commits

  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • Compaction already has tracepoints to count scanned and isolated pages
    but it requires that ftrace be enabled and if that information has to be
    written to disk then it can be disruptive. This patch adds vmstat counters
    for compaction called compact_migrate_scanned, compact_free_scanned and
    compact_isolated.

    With these counters, it is possible to define a basic cost model for
    compaction. This approximates of how much work compaction is doing and can
    be compared that with an oprofile showing TLB misses and see if the cost of
    compaction is being offset by THP for example. Minimally a compaction patch
    can be evaluated in terms of whether it increases or decreases cost. The
    basic cost model looks like this

    Fundamental unit u: a word sizeof(void *)

    Ca = cost of struct page access = sizeof(struct page) / u

    Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
    Cmf = Cost migrate failure = Ca * 2
    Ci = Cost page isolation = (Ca + Wi)
    where Wi is a constant that should reflect the approximate
    cost of the locking operation.

    Csm = Cost migrate scanning = Ca
    Csf = Cost free scanning = Ca

    Overall cost = (Csm * compact_migrate_scanned) +
    (Csf * compact_free_scanned) +
    (Ci * compact_isolated) +
    (Cmc * pgmigrate_success) +
    (Cmf * pgmigrate_failed)

    Where the values are read from /proc/vmstat.

    This is very basic and ignores certain costs such as the allocation cost
    to do a migrate page copy but any improvement to the model would still
    use the same vmstat counters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

09 Oct, 2012

2 commits

  • Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
    from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
    have been used by any tool, and of course we can restore it easily enough
    if that turns out to be wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
    remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
    already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
    checking it, reporting "BUG: Bad page state" if it's ever found set.
    Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Aug, 2012

1 commit

  • Under significant pressure when writing back to network-backed storage,
    direct reclaimers may get throttled. This is expected to be a short-lived
    event and the processes get woken up again but processes do get stalled.
    This patch counts how many times such stalling occurs. It's up to the
    administrator whether to reduce these stalls by increasing
    min_free_kbytes.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Apr, 2012

1 commit

  • The "pgsteal" stat is confusing because it counts both direct reclaim as
    well as background reclaim. However, we have "kswapd_steal" which also
    counts background reclaim value.

    This patch fixes it and also makes it match the existng "pgscan_" stats.

    Test:
    pgsteal_kswapd_dma32 447623
    pgsteal_kswapd_normal 42272677
    pgsteal_kswapd_movable 0
    pgsteal_direct_dma32 2801
    pgsteal_direct_normal 44353270
    pgsteal_direct_movable 0

    Signed-off-by: Ying Han
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Dan Magenheimer
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

27 May, 2011

1 commit

  • enums are problematic because they cannot be forward-declared:

    akpm2:/home/akpm> cat t.c

    enum foo;

    static inline void bar(enum foo f)
    {
    }
    akpm2:/home/akpm> gcc -c t.c
    t.c:4: error: parameter 1 ('f') has incomplete type

    So move the enum's definition into a standalone header file which can be used
    wherever its definition is needed.

    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton