29 Feb, 2020

1 commit

  • commit 76073c646f5f4999d763f471df9e38a5a912d70d upstream.

    Commit 68600f623d69 ("mm: don't miss the last page because of round-off
    error") makes the scan size round up to @denominator regardless of the
    memory cgroup's state, online or offline. This affects the overall
    reclaiming behavior: the corresponding LRU list is eligible for
    reclaiming only when its size logically right shifted by @sc->priority
    is bigger than zero in the former formula.

    For example, the inactive anonymous LRU list should have at least 0x4000
    pages to be eligible for reclaiming when we have 60/12 for
    swappiness/priority and without taking scan/rotation ratio into account.

    After the roundup is applied, the inactive anonymous LRU list becomes
    eligible for reclaiming when its size is bigger than or equal to 0x1000
    in the same condition.

    (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
    ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

    aarch64 has 512MB huge page size when the base page size is 64KB. The
    memory cgroup that has a huge page is always eligible for reclaiming in
    that case.

    The reclaiming is likely to stop after the huge page is reclaimed,
    meaing the further iteration on @sc->priority and the silbing and child
    memory cgroups will be skipped. The overall behaviour has been changed.
    This fixes the issue by applying the roundup to offlined memory cgroups
    only, to give more preference to reclaim memory from offlined memory
    cgroup. It sounds reasonable as those memory is unlikedly to be used by
    anyone.

    The issue was found by starting up 8 VMs on a Ampere Mustang machine,
    which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
    2GB memory. It took 264 seconds for all VMs to be completely up and
    784MB swap is consumed after that. With this patch applied, it took 236
    seconds and 60MB swap to do same thing. So there is 10% performance
    improvement for my case. Note that KSM is disable while THP is enabled
    in the testing.

    total used free shared buff/cache available
    Mem: 16196 10065 2049 16 4081 3749
    Swap: 8175 784 7391
    total used free shared buff/cache available
    Mem: 16196 11324 3656 24 1215 2936
    Swap: 8175 60 8115

    Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
    Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
    Signed-off-by: Gavin Shan
    Acked-by: Roman Gushchin
    Cc: [4.20+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     

31 Dec, 2019

1 commit

  • commit 42a9a53bb394a1de2247ef78f0b802ae86798122 upstream.

    Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on
    memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
    CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
    with CONFIG_MEMCG_KMEM.

    And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
    shrinker, and it is deferred_split_shrinker. But it is never actually
    called, since idr_replace() is never compiled due to the wrong #ifdef.
    The deferred_split_shrinker all the time is staying in half-registered
    state, and it's never called for subordinate mem cgroups.

    Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem")
    Signed-off-by: Yang Shi
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Cc: [5.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

19 Oct, 2019

2 commits

  • __remove_mapping() assumes that pages can only be either base pages or
    HPAGE_PMD_SIZE. Ask the page what size it is.

    Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: William Kucharski
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Song Liu
    Acked-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Kucharski
     
  • Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with
    lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
    instead of calculating it directly from lru_zone_size with an idea that
    this would be more effective.

    Tim has reported that this is not really the case for their database
    benchmark which is showing an opposite results where lruvec_page_state
    is taking up a huge chunk of CPU cycles (about 25% of the system time
    which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload
    is running on a larger machine (96cpus), it has many cgroups (500) and
    it is heavily direct reclaim bound.

    Tim Chen said:

    : The problem can also be reproduced by running simple multi-threaded
    : pmbench benchmark with a fast Optane SSD swap (see profile below).
    :
    :
    : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size
    : |
    : |--3.07%--lruvec_lru_size
    : | |
    : | |--2.11%--cpumask_next
    : | | |
    : | | --1.66%--find_next_bit
    : | |
    : | --0.57%--call_function_interrupt
    : | |
    : | --0.55%--smp_call_function_interrupt
    : |
    : |--1.59%--0x441f0fc3d009
    : | _ops_rdtsc_init_base_freq
    : | access_histogram
    : | page_fault
    : | __do_page_fault
    : | handle_mm_fault
    : | __handle_mm_fault
    : | |
    : | --1.54%--do_swap_page
    : | swapin_readahead
    : | swap_cluster_readahead
    : | |
    : | --1.53%--read_swap_cache_async
    : | __read_swap_cache_async
    : | alloc_pages_vma
    : | __alloc_pages_nodemask
    : | __alloc_pages_slowpath
    : | try_to_free_pages
    : | do_try_to_free_pages
    : | shrink_node
    : | shrink_node_memcg
    : | |
    : | |--0.77%--lruvec_lru_size
    : | |
    : | --0.76%--inactive_list_is_low
    : | |
    : | --0.76%--lruvec_lru_size
    : |
    : --1.50%--measure_read
    : page_fault
    : __do_page_fault
    : handle_mm_fault
    : __handle_mm_fault
    : do_swap_page
    : swapin_readahead
    : swap_cluster_readahead
    : |
    : --1.48%--read_swap_cache_async
    : __read_swap_cache_async
    : alloc_pages_vma
    : __alloc_pages_nodemask
    : __alloc_pages_slowpath
    : try_to_free_pages
    : do_try_to_free_pages
    : shrink_node
    : shrink_node_memcg
    : |
    : |--0.75%--inactive_list_is_low
    : | |
    : | --0.75%--lruvec_lru_size
    : |
    : --0.73%--lruvec_lru_size

    The likely culprit is the cache traffic the lruvec_page_state_local
    generates. Dave Hansen says:

    : I was thinking purely of the cache footprint. If it's reading
    : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
    : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would
    : be 18k of data for the whole system and the caching would be pretty
    : efficient and all 18k would probably survive a tight page fault loop in
    : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't
    : fit in the L1 and probably wouldn't survive a tight page fault loop if
    : both logical threads were banging on different cgroups.
    :
    : It's just a theory, but it's why I noted the number of cgroups when I
    : initially saw this show up in profiles

    Fix the regression by partially reverting the said commit and calculate
    the lru size explicitly.

    Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
    Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
    Signed-off-by: Honglei Wang
    Reported-by: Tim Chen
    Acked-by: Tim Chen
    Tested-by: Tim Chen
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Dave Hansen
    Cc: [5.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Honglei Wang
     

08 Oct, 2019

3 commits

  • This patch is an incremental improvement on the existing
    memory.{low,min} relative reclaim work to base its scan pressure
    calculations on how much protection is available compared to the current
    usage, rather than how much the current usage is over some protection
    threshold.

    This change doesn't change the experience for the user in the normal
    case too much. One benefit is that it replaces the (somewhat arbitrary)
    100% cutoff with an indefinite slope, which makes it easier to ballpark
    a memory.low value.

    As well as this, the old methodology doesn't quite apply generically to
    machines with varying amounts of physical memory. Let's say we have a
    top level cgroup, workload.slice, and another top level cgroup,
    system-management.slice. We want to roughly give 12G to
    system-management.slice, so on a 32GB machine we set memory.low to 20GB
    in workload.slice, and on a 64GB machine we set memory.low to 52GB.
    However, because these are relative amounts to the total machine size,
    while the amount of memory we want to generally be willing to yield to
    system.slice is absolute (12G), we end up putting more pressure on
    system.slice just because we have a larger machine and a larger workload
    to fill it, which seems fairly unintuitive. With this new behaviour, we
    don't end up with this unintended side effect.

    Previously the way that memory.low protection works is that if you are
    50% over a certain baseline, you get 50% of your normal scan pressure.
    This is certainly better than the previous cliff-edge behaviour, but it
    can be improved even further by always considering memory under the
    currently enforced protection threshold to be out of bounds. This means
    that we can set relatively low memory.low thresholds for variable or
    bursty workloads while still getting a reasonable level of protection,
    whereas with the previous version we may still trivially hit the 100%
    clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
    one is more concretely based on the currently enforced protection
    threshold, which is likely easier to reason about.

    There is also a subtle issue with the way that proportional reclaim
    worked previously -- it promotes having no memory.low, since it makes
    pressure higher during low reclaim. This happens because we base our
    scan pressure modulation on how far memory.current is between memory.min
    and memory.low, but if memory.low is unset, we only use the overage
    method. In most cromulent configurations, this then means that we end
    up with *more* pressure than with no memory.low at all when we're in low
    reclaim, which is not really very usable or expected.

    With this patch, memory.low and memory.min affect reclaim pressure in a
    more understandable and composable way. For example, from a user
    standpoint, "protected" memory now remains untouchable from a reclaim
    aggression standpoint, and users can also have more confidence that
    bursty workloads will still receive some amount of guaranteed
    protection.

    Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
    Signed-off-by: Chris Down
    Reviewed-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Roman points out that when when we do the low reclaim pass, we scale the
    reclaim pressure relative to position between 0 and the maximum
    protection threshold.

    However, if the maximum protection is based on memory.elow, and
    memory.emin is above zero, this means we still may get binary behaviour
    on second-pass low reclaim. This is because we scale starting at 0, not
    starting at memory.emin, and since we don't scan at all below emin, we
    end up with cliff behaviour.

    This should be a fairly uncommon case since usually we don't go into the
    second pass, but it makes sense to scale our low reclaim pressure
    starting at emin.

    You can test this by catting two large sparse files, one in a cgroup
    with emin set to some moderate size compared to physical RAM, and
    another cgroup without any emin. In both cgroups, set an elow larger
    than 50% of physical RAM. The one with emin will have less page
    scanning, as reclaim pressure is lower.

    Rebase on top of and apply the same idea as what was applied to handle
    cgroup_memory=disable properly for the original proportional patch
    http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
    memcg: Handle cgroup_disable=memory when getting memcg protection").

    Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
    Signed-off-by: Chris Down
    Suggested-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • cgroup v2 introduces two memory protection thresholds: memory.low
    (best-effort) and memory.min (hard protection). While they generally do
    what they say on the tin, there is a limitation in their implementation
    that makes them difficult to use effectively: that cliff behaviour often
    manifests when they become eligible for reclaim. This patch implements
    more intuitive and usable behaviour, where we gradually mount more
    reclaim pressure as cgroups further and further exceed their protection
    thresholds.

    This cliff edge behaviour happens because we only choose whether or not
    to reclaim based on whether the memcg is within its protection limits
    (see the use of mem_cgroup_protected in shrink_node), but we don't vary
    our reclaim behaviour based on this information. Imagine the following
    timeline, with the numbers the lruvec size in this zone:

    1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
    2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
    3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
    scanned. (?!)

    * Of course, we won't usually scan all available pages in the zone even
    without this patch because of scan control priority, over-reclaim
    protection, etc. However, as shown by the tests at the end, these
    techniques don't sufficiently throttle such an extreme change in input,
    so cliff-like behaviour isn't really averted by their existence alone.

    Here's an example of how this plays out in practice. At Facebook, we are
    trying to protect various workloads from "system" software, like
    configuration management tools, metric collectors, etc (see this[0] case
    study). In order to find a suitable memory.low value, we start by
    determining the expected memory range within which the workload will be
    comfortable operating. This isn't an exact science -- memory usage deemed
    "comfortable" will vary over time due to user behaviour, differences in
    composition of work, etc, etc. As such we need to ballpark memory.low,
    but doing this is currently problematic:

    1. If we end up setting it too low for the workload, it won't have
    *any* effect (see discussion above). The group will receive the full
    weight of reclaim and won't have any priority while competing with the
    less important system software, as if we had no memory.low configured
    at all.

    2. Because of this behaviour, we end up erring on the side of setting
    it too high, such that the comfort range is reliably covered. However,
    protected memory is completely unavailable to the rest of the system,
    so we might cause undue memory and IO pressure there when we *know* we
    have some elasticity in the workload.

    3. Even if we get the value totally right, smack in the middle of the
    comfort zone, we get extreme jumps between no pressure and full
    pressure that cause unpredictable pressure spikes in the workload due
    to the current binary reclaim behaviour.

    With this patch, we can set it to our ballpark estimation without too much
    worry. Any undesirable behaviour, such as too much or too little reclaim
    pressure on the workload or system will be proportional to how far our
    estimation is off. This means we can set memory.low much more
    conservatively and thus waste less resources *without* the risk of the
    workload falling off a cliff if we overshoot.

    As a more abstract technical description, this unintuitive behaviour
    results in having to give high-priority workloads a large protection
    buffer on top of their expected usage to function reliably, as otherwise
    we have abrupt periods of dramatically increased memory pressure which
    hamper performance. Having to set these thresholds so high wastes
    resources and generally works against the principle of work conservation.
    In addition, having proportional memory reclaim behaviour has other
    benefits. Most notably, before this patch it's basically mandatory to set
    memory.low to a higher than desirable value because otherwise as soon as
    you exceed memory.low, all protection is lost, and all pages are eligible
    to scan again. By contrast, having a gradual ramp in reclaim pressure
    means that you now still get some protection when thresholds are exceeded,
    which means that one can now be more comfortable setting memory.low to
    lower values without worrying that all protection will be lost. This is
    important because workingset size is really hard to know exactly,
    especially with variable workloads, so at least getting *some* protection
    if your workingset size grows larger than you expect increases user
    confidence in setting memory.low without a huge buffer on top being
    needed.

    Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
    assistance in thinking about how to make this work better.

    In testing these changes, I intended to verify that:

    1. Changes in page scanning become gradual and proportional instead of
    binary.

    To test this, I experimented stepping further and further down
    memory.low protection on a workload that floats around 19G workingset
    when under memory.low protection, watching page scan rates for the
    workload cgroup:

    +------------+-----------------+--------------------+--------------+
    | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
    +------------+-----------------+--------------------+--------------+
    | 21G | 0 | 0 | N/A |
    | 17G | 867 | 3799 | 23% |
    | 12G | 1203 | 3543 | 34% |
    | 8G | 2534 | 3979 | 64% |
    | 4G | 3980 | 4147 | 96% |
    | 0 | 3799 | 3980 | 95% |
    +------------+-----------------+--------------------+--------------+

    As you can see, the test kernel (with a kernel containing this
    patch) ramps up page scanning significantly more gradually than the
    control kernel (without this patch).

    2. More gradual ramp up in reclaim aggression doesn't result in
    premature OOMs.

    To test this, I wrote a script that slowly increments the number of
    pages held by stress(1)'s --vm-keep mode until a production system
    entered severe overall memory contention. This script runs in a highly
    protected slice taking up the majority of available system memory.
    Watching vmstat revealed that page scanning continued essentially
    nominally between test and control, without causing forward reclaim
    progress to become arrested.

    [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

    [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
    [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
    Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
    Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

26 Sep, 2019

2 commits

  • When a process expects no accesses to a certain memory range for a long
    time, it could hint kernel that the pages can be reclaimed instantly but
    data should be preserved for future use. This could reduce workingset
    eviction so it ends up increasing performance.

    This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
    MADV_PAGEOUT can be used by a process to mark a memory range as not
    expected to be used for a long time so that kernel reclaims *any LRU*
    pages instantly. The hint can help kernel in deciding which pages to
    evict proactively.

    A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
    intentionally because it's automatically bounded by PMD size. If PMD
    size(e.g., 256) makes some trouble, we could fix it later by limit it to
    SWAP_CLUSTER_MAX[1].

    - man-page material

    MADV_PAGEOUT (since Linux x.x)

    Do not expect access in the near future so pages in the specified
    regions could be reclaimed instantly regardless of memory pressure.
    Thus, access in the range after successful operation could cause
    major page fault but never lose the up-to-date contents unlike
    MADV_DONTNEED. Pages belonging to a shared mapping are only processed
    if a write access is allowed for the calling process.

    MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
    VM_PFNMAP pages.

    [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

    [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
    Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
    as default. It is for preventing to reclaim dirty pages when CMA try to
    migrate pages. Strictly speaking, we don't need it because CMA didn't
    allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

    Moreover, it has a problem to prevent anonymous pages's swap out even
    though force_reclaim = true in shrink_page_list on upcoming patch. So
    this patch makes references's default value to PAGEREF_RECLAIM and rename
    force_reclaim with ignore_references to make it more clear.

    This is a preparatory work for next patch.

    Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: James E.J. Bottomley
    Cc: Joel Fernandes (Google)
    Cc: kbuild test robot
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 Sep, 2019

6 commits

  • Currently shrinker is just allocated and can work when memcg kmem is
    enabled. But, THP deferred split shrinker is not slab shrinker, it
    doesn't make too much sense to have such shrinker depend on memcg kmem.
    It should be able to reclaim THP even though memcg kmem is disabled.

    Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
    When memcg kmem is disabled, just such shrinkers can be called in
    shrinking memcg slab.

    [yang.shi@linux.alibaba.com: add comment]
    Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • A later patch makes THP deferred split shrinker memcg aware, but it needs
    page->mem_cgroup information in THP destructor, which is called after
    mem_cgroup_uncharge() now.

    So move mem_cgroup_uncharge() from __page_cache_release() to compound page
    destructor, which is called by both THP and other compound pages except
    HugeTLB. And call it in __put_single_page() for single order page.

    Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • After commit "mm, reclaim: make should_continue_reclaim perform dryrun
    detection", closer look at the function shows, that nr_reclaimed == 0
    means the function will always return false. And since non-zero
    nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no
    purpose, and so does the testing for __GFP_RETRY_MAYFAIL.

    This patch thus cleans up the function to test only !nr_reclaimed upfront,
    and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter
    completely. Comment is also updated, explaining that approximating "full
    LRU list has been scanned" with nr_scanned == 0 didn't really work.

    Link: http://lkml.kernel.org/r/20190806014744.15446-3-mike.kravetz@oracle.com
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mike Kravetz
    Acked-by: Mike Kravetz
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "address hugetlb page allocation stalls", v2.

    Allocation of hugetlb pages via sysctl or procfs can stall for minutes or
    hours. A simple example on a two node system with 8GB of memory is as
    follows:

    echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
    echo 4096 > /proc/sys/vm/nr_hugepages

    Obviously, both allocation attempts will fall short of their 8GB goal.
    However, one or both of these commands may stall and not be interruptible.
    The issues were initially discussed in mail thread [1] and RFC code at
    [2].

    This series addresses the issues causing the stalls. There are two
    distinct fixes, a cleanup, and an optimization. The reclaim patch by
    Hillf and compaction patch by Vlasitmil address corner cases in their
    respective areas. hugetlb page allocation could stall due to either of
    these issues. Vlasitmil added a cleanup patch after Hillf's
    modifications. The hugetlb patch by Mike is an optimization suggested
    during the debug and development process.

    [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
    [2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com

    This patch (of 4):

    Address the issue of should_continue_reclaim returning true too often for
    __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was
    observed during hugetlb page allocation causing stalls for minutes or
    hours.

    We can stop reclaiming pages if compaction reports it can make a progress.
    There might be side-effects for other high-order allocations that would
    potentially benefit from reclaiming more before compaction so that they
    would be faster and less likely to stall. However, the consequences of
    premature/over-reclaim are considered worse.

    We can also bail out of reclaiming pages if we know that there are not
    enough inactive lru pages left to satisfy the costly allocation.

    We can give up reclaiming pages too if we see dryrun occur, with the
    certainty of plenty of inactive pages. IOW with dryrun detected, we are
    sure we have reclaimed as many pages as we could.

    Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.com
    Signed-off-by: Hillf Danton
    Signed-off-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • One of our services observed a high rate of cgroup OOM kills in the
    presence of large amounts of clean cache. Debugging showed that the
    culprit is the shared cgroup iteration in page reclaim.

    Under high allocation concurrency, multiple threads enter reclaim at the
    same time. Fearing overreclaim when we first switched from the single
    global LRU to cgrouped LRU lists, we introduced a shared iteration state
    for reclaim invocations - whether 1 or 20 reclaimers are active
    concurrently, we only walk the cgroup tree once: the 1st reclaimer
    reclaims the first cgroup, the second the second one etc. With more
    reclaimers than cgroups, we start another walk from the top.

    This sounded reasonable at the time, but the problem is that reclaim
    concurrency doesn't scale with allocation concurrency. As reclaim
    concurrency increases, the amount of memory individual reclaimers get to
    scan gets smaller and smaller. Individual reclaimers may only see one
    cgroup per cycle, and that may not have much reclaimable memory. We see
    individual reclaimers declare OOM when there is plenty of reclaimable
    memory available in cgroups they didn't visit.

    This patch does away with the shared iterator, and every reclaimer is
    allowed to scan the full cgroup tree and see all of reclaimable memory,
    just like it would on a non-cgrouped system. This way, when OOM is
    declared, we know that the reclaimer actually had a chance.

    To still maintain fairness in reclaim pressure, disallow cgroup reclaim
    from bailing out of the tree walk early. Kswapd and regular direct
    reclaim already don't bail, so it's not clear why limit reclaim would have
    to, especially since it only walks subtrees to begin with.

    This change completely eliminates the OOM kills on our service, while
    showing no signs of overreclaim - no increased scan rates, %sys time, or
    abrupt free memory spikes. I tested across 100 machines that have 64G of
    RAM and host about 300 cgroups each.

    [ It's possible overreclaim never was a *practical* issue to begin
    with - it was simply a concern we had on the mailing lists at the
    time, with no real data to back it up. But we have also added more
    bail-out conditions deeper inside reclaim (e.g. the proportional
    exit in shrink_node_memcg) since. Regardless, now we have data that
    suggests full walks are more reliable and scale just fine. ]

    Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

31 Aug, 2019

1 commit

  • Adric Blake has noticed[1] the following warning:

    WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
    [...]
    Call Trace:
    mem_cgroup_shrink_node+0x9b/0x1d0
    mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
    balance_pgdat+0x276/0x540
    kswapd+0x200/0x3f0
    ? wait_woken+0x80/0x80
    kthread+0xfd/0x130
    ? balance_pgdat+0x540/0x540
    ? kthread_park+0x80/0x80
    ret_from_fork+0x35/0x40
    ---[ end trace 727343df67b2398a ]---

    which tells us that soft limit reclaim is about to overwrite the
    reclaim_state configured up in the call chain (kswapd in this case but
    the direct reclaim is equally possible). This means that reclaim stats
    would get misleading once the soft reclaim returns and another reclaim
    is done.

    Fix the warning by dropping set_task_reclaim_state from the soft reclaim
    which is always called with reclaim_state set up.

    [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com

    Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Adric Blake
    Acked-by: Yafang Shao
    Acked-by: Yang Shi
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Aug, 2019

1 commit

  • Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
    ("mm: reclaim small amounts of memory when an external fragmentation
    event occurs").

    The report is extensive:

    https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/

    and it's worth recording the most relevant parts (colorful language and
    typos included).

    When running a simple, steady state 4kB file creation test to
    simulate extracting tarballs larger than memory full of small
    files into the filesystem, I noticed that once memory fills up
    the cache balance goes to hell.

    The workload is creating one dirty cached inode for every dirty
    page, both of which should require a single IO each to clean and
    reclaim, and creation of inodes is throttled by the rate at which
    dirty writeback runs at (via balance dirty pages). Hence the ingest
    rate of new cached inodes and page cache pages is identical and
    steady. As a result, memory reclaim should quickly find a steady
    balance between page cache and inode caches.

    The moment memory fills, the page cache is reclaimed at a much
    faster rate than the inode cache, and evidence suggests that
    the inode cache shrinker is not being called when large batches
    of pages are being reclaimed. In roughly the same time period
    that it takes to fill memory with 50% pages and 50% slab caches,
    memory reclaim reduces the page cache down to just dirty pages
    and slab caches fill the entirety of memory.

    The LRU is largely full of dirty pages, and we're getting spikes
    of random writeback from memory reclaim so it's all going to shit.
    Behaviour never recovers, the page cache remains pinned at just
    dirty pages, and nothing I could tune would make any difference.
    vfs_cache_pressure makes no difference - I would set it so high
    it should trim the entire inode caches in a single pass, yet it
    didn't do anything. It was clear from tracing and live telemetry
    that the shrinkers were pretty much not running except when
    there was absolutely no memory free at all, and then they did
    the minimum necessary to free memory to make progress.

    So I went looking at the code, trying to find places where pages
    got reclaimed and the shrinkers weren't called. There's only one
    - kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
    reclaim small amounts of memory when an external fragmentation
    event occurs").

    The watermark boosting introduced by the commit is triggered in response
    to an allocation "fragmentation event". The boosting was not intended
    to target THP specifically and triggers even if THP is disabled.
    However, with Dave's perfectly reasonable workload, fragmentation events
    can be very common given the ratio of slab to page cache allocations so
    boosting remains active for long periods of time.

    As high-order allocations might use compaction and compaction cannot
    move slab pages the decision was made in the commit to special-case
    kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
    reclaiming slab does not directly help compaction.

    As Dave notes, this decision means that slab can be artificially
    protected for long periods of time and messes up the balance with slab
    and page caches.

    Removing the special casing can still indirectly help avoid
    fragmentation by avoiding fragmentation-causing events due to slab
    allocation as pages from a slab pageblock will have some slab objects
    freed. Furthermore, with the special casing, reclaim behaviour is
    unpredictable as kswapd sometimes examines slab and sometimes does not
    in a manner that is tricky to tune or analyse.

    This patch removes the special casing. The downside is that this is not
    a universal performance win. Some benchmarks that depend on the
    residency of data when rereading metadata may see a regression when slab
    reclaim is restored to its original behaviour. Similarly, some
    benchmarks that only read-once or write-once may perform better when
    page reclaim is too aggressive. The primary upside is that slab
    shrinker is less surprising (arguably more sane but that's a matter of
    opinion), behaves consistently regardless of the fragmentation state of
    the system and properly obeys VM sysctls.

    A fsmark benchmark configuration was constructed similar to what Dave
    reported and is codified by the mmtest configuration
    config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
    machine to avoid dealing with NUMA-related issues and the timing of
    reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
    filesystem was used for the test data.

    This is not an exact replication of Dave's setup. The configuration
    scales its parameters depending on the memory size of the SUT to behave
    similarly across machines. The parameters mean the first sample
    reported by fs_mark is using 50% of RAM which will barely be throttled
    and look like a big outlier. Dave used fake NUMA to have multiple
    kswapd instances which I didn't replicate. Finally, the number of
    iterations differ from Dave's test as the target disk was not large
    enough. While not identical, it should be representative.

    fsmark
    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
    1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
    2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
    3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
    Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
    Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
    Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
    Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
    Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
    CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
    BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
    BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
    BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
    BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
    BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)

    5.3.0-rc3 5.3.0-rc3
    vanillashrinker-v1r1
    Duration User 501.82 497.29
    Duration System 4401.44 4424.08
    Duration Elapsed 8124.76 8358.05

    This is showing a slight skew for the max result representing a large
    outlier for the 1st, 2nd and 3rd quartile are similar indicating that
    the bulk of the results show little difference. Note that an earlier
    version of the fsmark configuration showed a regression but that
    included more samples taken while memory was still filling.

    Note that the elapsed time is higher. Part of this is that the
    configuration included time to delete all the test files when the test
    completes -- the test automation handles the possibility of testing
    fsmark with multiple thread counts. Without the patch, many of these
    objects would be memory resident which is part of what the patch is
    addressing.

    There are other important observations that justify the patch.

    1. With the vanilla kernel, the number of dirty pages in the system is
    very low for much of the test. With this patch, dirty pages is
    generally kept at 10% which matches vm.dirty_background_ratio which
    is normal expected historical behaviour.

    2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
    0.95 for much of the test i.e. Slab is being left alone and
    dominating memory consumption. With the patch applied, the ratio
    varies between 0.35 and 0.45 with the bulk of the measured ratios
    roughly half way between those values. This is a different balance to
    what Dave reported but it was at least consistent.

    3. Slabs are scanned throughout the entire test with the patch applied.
    The vanille kernel has periods with no scan activity and then
    relatively massive spikes.

    4. Without the patch, kswapd scan rates are very variable. With the
    patch, the scan rates remain quite steady.

    4. Overall vmstats are closer to normal expectations

    5.3.0-rc3 5.3.0-rc3
    vanilla shrinker-v1r1
    Ops Direct pages scanned 99388.00 328410.00
    Ops Kswapd pages scanned 45382917.00 33451026.00
    Ops Kswapd pages reclaimed 30869570.00 25239655.00
    Ops Direct pages reclaimed 74131.00 5830.00
    Ops Kswapd efficiency % 68.02 75.45
    Ops Kswapd velocity 5585.75 4002.25
    Ops Page reclaim immediate 1179721.00 430927.00
    Ops Slabs scanned 62367361.00 73581394.00
    Ops Direct inode steals 2103.00 1002.00
    Ops Kswapd inode steals 570180.00 5183206.00

    o Vanilla kernel is hitting direct reclaim more frequently,
    not very much in absolute terms but the fact the patch
    reduces it is interesting
    o "Page reclaim immediate" in the vanilla kernel indicates
    dirty pages are being encountered at the tail of the LRU.
    This is generally bad and means in this case that the LRU
    is not long enough for dirty pages to be cleaned by the
    background flush in time. This is much reduced by the
    patch.
    o With the patch, kswapd is reclaiming 10 times more slab
    pages than with the vanilla kernel. This is indicative
    of the watermark boosting over-protecting slab

    A more complete set of tests were run that were part of the basis for
    introducing boosting and while there are some differences, they are well
    within tolerances.

    Bottom line, the special casing kswapd to avoid slab behaviour is
    unpredictable and can lead to abnormal results for normal workloads.

    This patch restores the expected behaviour that slab and page cache is
    balanced consistently for a workload with a steady allocation ratio of
    slab/pagecache pages. It also means that if there are workloads that
    favour the preservation of slab over pagecache that it can be tuned via
    vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
    the parameter when boosting is active.

    Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reviewed-by: Dave Chinner
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Aug, 2019

1 commit

  • Shakeel Butt reported premature oom on kernel with
    "cgroup_disable=memory" since mem_cgroup_is_root() returns false even
    though memcg is actually NULL. The drop_caches is also broken.

    It is because commit aeed1d325d42 ("mm/vmscan.c: generalize
    shrink_slab() calls in shrink_node()") removed the !memcg check before
    !mem_cgroup_is_root(). And, surprisingly root memcg is allocated even
    though memory cgroup is disabled by kernel boot parameter.

    Add mem_cgroup_disabled() check to make reclaimer work as expected.

    Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: aeed1d325d42 ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()")
    Signed-off-by: Yang Shi
    Reported-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Jan Hadrava
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Cc: Qian Cai
    Cc: Kirill A. Shutemov
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

17 Jul, 2019

3 commits

  • Six sites are presently altering current->reclaim_state. There is a
    risk that one function stomps on a caller's value. Use a helper
    function to catch such errors.

    Cc: Yafang Shao
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • There are six different reclaim paths by now:

    - kswapd reclaim path
    - node reclaim path
    - hibernate preallocate memory reclaim path
    - direct reclaim path
    - memcg reclaim path
    - memcg softlimit reclaim path

    The slab caches reclaimed in these paths are only calculated in the
    above three paths.

    There're some drawbacks if we don't calculate the reclaimed slab caches.

    - The sc->nr_reclaimed isn't correct if there're some slab caches
    relcaimed in this path.

    - The slab caches may be reclaimed thoroughly if there're lots of
    reclaimable slab caches and few page caches.

    Let's take an easy example for this case. If one memcg is full of
    slab caches and the limit of it is 512M, in other words there're
    approximately 512M slab caches in this memcg. Then the limit of the
    memcg is reached and the memcg reclaim begins, and then in this memcg
    reclaim path it will continuesly reclaim the slab caches until the
    sc->priority drops to 0. After this reclaim stops, you will find
    there're few slab caches left, which is less than 20M in my test
    case. While after this patch applied the number is greater than 300M
    and the sc->priority only drops to 3.

    Link: http://lkml.kernel.org/r/1561112086-6169-3-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Kirill Tkhai
    Reviewed-by: Andrew Morton
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths".

    This patchset is to fix the issues in doing shrink slab.

    There're six different reclaim paths by now,
    - kswapd reclaim path
    - node reclaim path
    - hibernate preallocate memory reclaim path
    - direct reclaim path
    - memcg reclaim path
    - memcg softlimit reclaim path

    The slab caches reclaimed in these paths are only calculated in the
    above three paths. The issues are detailed explained in patch #2. We
    should calculate the reclaimed slab caches in every reclaim path. In
    order to do it, the struct reclaim_state is placed into the struct
    shrink_control.

    In node reclaim path, there'is another issue about shrinking slab, which
    is adressed in "mm/vmscan: shrink slab in node reclaim"
    (https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/).

    This patch (of 2):

    The struct reclaim_state is used to record how many slab caches are
    reclaimed in one reclaim path. The struct shrink_control is used to
    control one reclaim path. So we'd better put reclaim_state into
    shrink_control.

    [laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()]
    Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com
    Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Reviewed-by: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

13 Jul, 2019

3 commits

  • Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped
    out"), THP can be swapped out in a whole. But, nr_reclaimed and some
    other vm counters still get inc'ed by one even though a whole THP (512
    pages) gets swapped out.

    This doesn't make too much sense to memory reclaim.

    For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX
    pages, reclaiming one THP could fulfill it. But, if nr_reclaimed is not
    increased correctly, direct reclaim may just waste time to reclaim more
    pages, SWAP_CLUSTER_MAX * 512 pages in worst case.

    And, it may cause pgsteal_{kswapd|direct} is greater than
    pgscan_{kswapd|direct}, like the below:

    pgsteal_kswapd 122933
    pgsteal_direct 26600225
    pgscan_kswapd 174153
    pgscan_direct 14678312

    nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
    break some page reclaim logic, e.g.

    vmpressure: this looks at the scanned/reclaimed ratio so it won't change
    semantics as long as scanned & reclaimed are fixed in parallel.

    compaction/reclaim: compaction wants a certain number of physical pages
    freed up before going back to compacting.

    kswapd priority raising: kswapd raises priority if we scan fewer pages
    than the reclaim target (which itself is obviously expressed in order-0
    pages). As a result, kswapd can falsely raise its aggressiveness even
    when it's making great progress.

    Other than nr_scanned and nr_reclaimed, some other counters, e.g.
    pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too
    since they are user visible via cgroup, /proc/vmstat or trace points,
    otherwise they would be underreported.

    When isolating pages from LRUs, nr_taken has been accounted in base page,
    but nr_scanned and nr_skipped are still accounted in THP. It doesn't make
    too much sense too since this may cause trace point underreport the
    numbers as well.

    So accounting those counters in base page instead of accounting THP as one
    page.

    nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
    file cache, so they are not impacted by THP swap.

    This change may result in lower steal/scan ratio in some cases since THP
    may get split during page reclaim, then a part of tail pages get reclaimed
    instead of the whole 512 pages, but nr_scanned is accounted by 512,
    particularly for direct reclaim. But, this should be not a significant
    issue.

    Link: http://lkml.kernel.org/r/1559025859-72759-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: Hillf Danton
    Cc: Josef Bacik
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets") has
    broken up the relationship between sc->nr_scanned and slab pressure.
    The sc->nr_scanned can't double slab pressure anymore. So, it sounds no
    sense to still keep sc->nr_scanned inc'ed. Actually, it would prevent
    from adding pressure on slab shrink since excessive sc->nr_scanned would
    prevent from scan->priority raise.

    The bonnie test doesn't show this would change the behavior of slab
    shrinkers.

    w/ w/o
    /sec %CP /sec %CP
    Sequential delete: 3960.6 94.6 3997.6 96.2
    Random delete: 2518 63.8 2561.6 64.6

    The slight increase of "/sec" without the patch would be caused by the
    slight increase of CPU usage.

    Link: http://lkml.kernel.org/r/1559025859-72759-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: Hillf Danton
    Cc: "Huang, Ying"
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When file refaults are detected and there are many inactive file pages,
    the system never reclaim anonymous pages, the file pages are dropped
    aggressively when there are still a lot of cold anonymous pages and
    system thrashes. This issue impacts the performance of applications
    with large executable, e.g. chrome.

    With this patch, when file refault is detected, inactive_list_is_low()
    always returns true for file pages in get_scan_count() to enable
    scanning anonymous pages.

    The problem can be reproduced by the following test program.

    ---8= size)
    return;

    fd = open(filename, O_WRONLY | O_CREAT, 0600);
    if (fd < 0) {
    perror("create file");
    exit(1);
    }
    if (posix_fallocate(fd, 0, size)) {
    perror("fallocate");
    exit(1);
    }
    close(fd);
    }

    long *alloc_anon(long size)
    {
    long *start = malloc(size);
    memset(start, 1, size);
    return start;
    }

    long access_file(const char *filename, long size, long rounds)
    {
    int fd, i;
    volatile char *start1, *end1, *start2;
    const int page_size = getpagesize();
    long sum = 0;

    fd = open(filename, O_RDONLY);
    if (fd == -1) {
    perror("open");
    exit(1);
    }

    /*
    * Some applications, e.g. chrome, use a lot of executable file
    * pages, map some of the pages with PROT_EXEC flag to simulate
    * the behavior.
    */
    start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
    fd, 0);
    if (start1 == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    end1 = start1 + size / 2;

    start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
    if (start2 == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }

    for (i = 0; i < rounds; ++i) {
    struct timeval before, after;
    volatile char *ptr1 = start1, *ptr2 = start2;
    gettimeofday(&before, NULL);
    for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
    sum += *ptr1 + *ptr2;
    gettimeofday(&after, NULL);
    printf("File access time, round %d: %f (sec)
    ", i,
    (after.tv_sec - before.tv_sec) +
    (after.tv_usec - before.tv_usec) / 1000000.0);
    }
    return sum;
    }

    int main(int argc, char *argv[])
    {
    const long MB = 1024 * 1024;
    long anon_mb, file_mb, file_rounds;
    const char filename[] = "large";
    long *ret1;
    long ret2;

    if (argc != 4) {
    printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
    ");
    exit(0);
    }
    anon_mb = atoi(argv[1]);
    file_mb = atoi(argv[2]);
    file_rounds = atoi(argv[3]);

    fallocate_file(filename, file_mb * MB);
    printf("Allocate %ld MB anonymous pages
    ", anon_mb);
    ret1 = alloc_anon(anon_mb * MB);
    printf("Access %ld MB file pages
    ", file_mb);
    ret2 = access_file(filename, file_mb * MB, file_rounds);
    printf("Print result to prevent optimization: %ld
    ",
    *ret1 + ret2);
    return 0;
    }
    ---8
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Sonny Rao
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vladimir Davydov
    Cc: Minchan Kim
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuo-Hsin Yang
     

05 Jul, 2019

1 commit

  • In production we have noticed hard lockups on large machines running
    large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
    sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
    GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
    isolate_lru_pages() was basically skipping GiBs of pages while holding
    the LRU spinlock with interrupt disabled.

    On further inspection, it seems like there are two issues:

    (1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
    node is still unbalanced), the classzone_idx is unintentionally set
    to 0 and the whole reclaim cycle of kswapd will try to reclaim only
    the lowest and smallest zone while traversing the whole memory.

    (2) Fundamentally isolate_lru_pages() is really bad when the
    allocation has woken kswapd for a smaller zone on a very large machine
    running very large jobs. It can hoard the LRU spinlock while skipping
    over 100s of GiBs of pages.

    This patch only fixes (1). (2) needs a more fundamental solution. To
    fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
    invalid use the classzone_idx of the previous kswapd loop otherwise use
    the one the waker has requested.

    Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
    Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Yang Shi
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

14 Jun, 2019

2 commits

  • There was the below bug report from Wu Fangsuo.

    On the CMA allocation path, isolate_migratepages_range() could isolate
    unevictable LRU pages and reclaim_clean_page_from_list() can try to
    reclaim them if they are clean file-backed pages.

    page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
    flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
    raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
    raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
    page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
    page->mem_cgroup:ffffffc09bf3ee80
    ------------[ cut here ]------------
    kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S 4.14.81 #3
    Hardware name: ASR AQUILAC EVB (DT)
    task: ffffffc00a54cd00 task.stack: ffffffc009b00000
    PC is at shrink_page_list+0x1998/0x3240
    LR is at shrink_page_list+0x1998/0x3240
    pc : [] lr : [] pstate: 60400045
    sp : ffffffc009b05940
    ..
    shrink_page_list+0x1998/0x3240
    reclaim_clean_pages_from_list+0x3c0/0x4f0
    alloc_contig_range+0x3bc/0x650
    cma_alloc+0x214/0x668
    ion_cma_allocate+0x98/0x1d8
    ion_alloc+0x200/0x7e0
    ion_ioctl+0x18c/0x378
    do_vfs_ioctl+0x17c/0x1780
    SyS_ioctl+0xac/0xc0

    Wu found it's due to commit ad6b67041a45 ("mm: remove SWAP_MLOCK in
    ttu"). Before that, unevictable pages go to cull_mlocked so that we
    can't reach the VM_BUG_ON_PAGE line.

    To fix the issue, this patch filters out unevictable LRU pages from the
    reclaim_clean_pages_from_list in CMA.

    Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
    Fixes: ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu")
    Signed-off-by: Minchan Kim
    Reported-by: Wu Fangsuo
    Debugged-by: Wu Fangsuo
    Tested-by: Wu Fangsuo
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Pankaj Suryawanshi
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Johannes pointed out that after commit 886cf1901db9 ("mm: move
    recent_rotated pages calculation to shrink_inactive_list()") we lost all
    zone_reclaim_stat::recent_rotated history.

    This fixes it.

    Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
    Fixes: 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
    Signed-off-by: Kirill Tkhai
    Reported-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

15 May, 2019

10 commits

  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We can use __count_memcg_events() directly because this callsite is alreay
    protected by spin_lock_irq().

    Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • This merges together duplicated patterns of code. Also, replace
    count_memcg_events() with its irq-careless namesake, because they are
    already called in interrupts disabled context.

    Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
    Signed-off-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Daniel Jordan
    Acked-by: Johannes Weiner
    Cc: Baoquan He
    Cc: Davidlohr Bueso

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • There are three tracepoints using this template, which are
    mm_vmscan_direct_reclaim_begin,
    mm_vmscan_memcg_reclaim_begin,
    mm_vmscan_memcg_softlimit_reclaim_begin.

    Regarding mm_vmscan_direct_reclaim_begin,
    sc.may_writepage is !laptop_mode, that's a static setting, and
    reclaim_idx is derived from gfp_mask which is already show in this
    tracepoint.

    Regarding mm_vmscan_memcg_reclaim_begin,
    may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1),
    which are both static value.

    mm_vmscan_memcg_softlimit_reclaim_begin is the same with
    mm_vmscan_memcg_reclaim_begin.

    So we can drop them all.

    Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Instead of adding up the zone counters, use lruvec_page_state() to get the
    node state directly. This is a bit cheaper and more stream-lined.

    Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The page alloc fast path it may perform node reclaim, which may cause a
    latency spike. We should add tracepoint for this event, and also measure
    the latency it causes.

    So bellow two tracepoints are introduced,
    mm_vmscan_node_reclaim_begin
    mm_vmscan_node_reclaim_end

    Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Souptick Joarder
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • This combines two similar functions move_active_pages_to_lru() and
    putback_inactive_pages() into single move_pages_to_lru(). This remove
    duplicate code and makes object file size smaller.

    Before:
    text data bss dec hex filename
    57082 4732 128 61942 f1f6 mm/vmscan.o
    After:
    text data bss dec hex filename
    55112 4600 128 59840 e9c0 mm/vmscan.o

    Note, that now we are checking for !page_evictable() coming from
    shrink_active_list(), which shouldn't change any behavior since that path
    works with evictable pages only.

    Link: http://lkml.kernel.org/r/155290129627.31489.8321971028677203248.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • We may use input argument list as output argument too. This makes the
    function more similar to putback_inactive_pages().

    Link: http://lkml.kernel.org/r/155290129079.31489.16180612694090502942.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • We know which LRU is not active.

    [chris@chrisdown.name: fix build on !CONFIG_MEMCG]
    Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name
    Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Chris Down
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "mm: Generalize putback functions"]

    putback_inactive_pages() and move_active_pages_to_lru() are almost
    similar, so this patchset merges them ina single function.

    This patch (of 4):

    The patch moves the calculation from putback_inactive_pages() to
    shrink_inactive_list(). This makes putback_inactive_pages() looking more
    similar to move_active_pages_to_lru().

    To do that, we account activated pages in reclaim_stat::nr_activate.
    Since a page may change its LRU type from anon to file cache inside
    shrink_page_list() (see ClearPageSwapBacked()), we have to account pages
    for the both types. So, nr_activate becomes an array.

    Previously we used nr_activate to account PGACTIVATE events, but now we
    account them into pgactivate variable (since they are about number of
    pages in general, not about sum of hpage_nr_pages).

    Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

08 May, 2019

1 commit

  • Pull printk updates from Petr Mladek:

    - Allow state reset of printk_once() calls.

    - Prevent crashes when dereferencing invalid pointers in vsprintf().
    Only the first byte is checked for simplicity.

    - Make vsprintf warnings consistent and inlined.

    - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
    modifiers.

    - Some clean up of vsprintf and test_printf code.

    * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
    lib/vsprintf: Make function pointer_string static
    vsprintf: Limit the length of inlined error messages
    vsprintf: Avoid confusion between invalid address and value
    vsprintf: Prevent crash when dereferencing invalid pointers
    vsprintf: Consolidate handling of unknown pointer specifiers
    vsprintf: Factor out %pO handler as kobject_string()
    vsprintf: Factor out %pV handler as va_format()
    vsprintf: Factor out %p[iI] handler as ip_addr_string()
    vsprintf: Do not check address of well-known strings
    vsprintf: Consistent %pK handling for kptr_restrict == 0
    vsprintf: Shuffle restricted_pointer()
    printk: Tie printk_once / printk_deferred_once into .data.once for reset
    treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
    lib/test_printf: Switch to bitmap_zalloc()

    Linus Torvalds
     

20 Apr, 2019

1 commit

  • During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
    thrashing on the node that is about to be reclaimed. But when cgroups
    are enabled, we suddenly ignore the node scope and use the cgroup scope
    only. The result is that pressure bleeds between NUMA nodes depending
    on whether cgroups are merely compiled into Linux. This behavioral
    difference is unexpected and undesirable.

    When the refault adaptivity of the inactive list was first introduced,
    there were no statistics at the lruvec level - the intersection of node
    and memcg - so it was better than nothing.

    But now that we have that infrastructure, use lruvec_page_state() to
    make the list balancing decision always NUMA aware.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
    Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Apr, 2019

1 commit

  • %pF and %pf are functionally equivalent to %pS and %ps conversion
    specifiers. The former are deprecated, therefore switch the current users
    to use the preferred variant.

    The changes have been produced by the following command:

    git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
    while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

    And verifying the result.

    Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
    Cc: Andy Shevchenko
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: xen-devel@lists.xenproject.org
    Cc: linux-acpi@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: drbd-dev@lists.linbit.com
    Cc: linux-block@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: linux-mm@kvack.org
    Cc: ceph-devel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Sakari Ailus
    Acked-by: David Sterba (for btrfs)
    Acked-by: Mike Rapoport (for mm/memblock.c)
    Acked-by: Bjorn Helgaas (for drivers/pci)
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Petr Mladek

    Sakari Ailus