07 Nov, 2015

6 commits

  • Someone has an 86 column display.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • High-order watermark checking exists for two reasons -- kswapd high-order
    awareness and protection for high-order atomic requests. Historically the
    kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
    high-order free pages for as long as possible. This patch introduces
    MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
    allocations on demand and avoids using those blocks for order-0
    allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

    A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
    allocation request steals a pageblock but limits the total number to 1% of
    the zone. Callers that speculatively abuse atomic allocations for
    long-lived high-order allocations to access the reserve will quickly fail.
    Note that SLUB is currently not such an abuser as it reclaims at least
    once. It is possible that the pageblock stolen has few suitable
    high-order pages and will need to steal again in the near future but there
    would need to be strong justification to search all pageblocks for an
    ideal candidate.

    The pageblocks are unreserved if an allocation fails after a direct
    reclaim attempt.

    The watermark checks account for the reserved pageblocks when the
    allocation request is not a high-order atomic allocation.

    The reserved pageblocks can not be used for order-0 allocations. This may
    allow temporary wastage until a failed reclaim reassigns the pageblock.
    This is deliberate as the intent of the reservation is to satisfy a
    limited number of atomic high-order short-lived requests if the system
    requires them.

    The stutter benchmark was used to evaluate this but while it was running
    there was a systemtap script that randomly allocated between 1 high-order
    page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
    is much larger than the potential reserve and it does not attempt to be
    realistic. It is intended to stress random high-order allocations from an
    unknown source, show that there is a reduction in failures without
    introducing an anomaly where atomic allocations are more reliable than
    regular allocations. The amount of memory reserved varied throughout the
    workload as reserves were created and reclaimed under memory pressure.
    The allocation failures once the workload warmed up were as follows;

    4.2-rc5-vanilla 70%
    4.2-rc5-atomic-reserve 56%

    The failure rate was also measured while building multiple kernels. The
    failure rate was 14% but is 6% with this patch applied.

    Overall, this is a small reduction but the reserves are small relative to
    the number of allocation requests. In early versions of the patch, the
    failure rate reduced by a much larger amount but that required much larger
    reserves and perversely made atomic allocations seem more reliable than
    regular allocations.

    [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MIGRATE_RESERVE preserves an old property of the buddy allocator that
    existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
    tended to remain contiguous until the only alternative was to fail the
    allocation. At the time it was discovered that high-order atomic
    allocations relied on this property so MIGRATE_RESERVE was introduced. A
    later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
    deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
    Note that this patch in isolation may look like a false regression if
    someone was bisecting high-order atomic allocation failures.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The zonelist cache (zlc) was introduced to skip over zones that were
    recently known to be full. This avoided expensive operations such as the
    cpuset checks, watermark calculations and zone_reclaim. The situation
    today is different and the complexity of zlc is harder to justify.

    1) The cpuset checks are no-ops unless a cpuset is active and in general
    are a lot cheaper.

    2) zone_reclaim is now disabled by default and I suspect that was a large
    source of the cost that zlc wanted to avoid. When it is enabled, it's
    known to be a major source of stalling when nodes fill up and it's
    unwise to hit every other user with the overhead.

    3) Watermark checks are expensive to calculate for high-order
    allocation requests. Later patches in this series will reduce the cost
    of the watermark checking.

    4) The most important issue is that in the current implementation it
    is possible for a failed THP allocation to mark a zone full for order-0
    allocations and cause a fallback to remote nodes.

    The last issue could be addressed with additional complexity but as the
    benefit of zlc is questionable, it is better to remove it. If stalls due
    to zone_reclaim are ever reported then an alternative would be to
    introduce deferring logic based on a timeout inside zone_reclaim itself
    and leave the page allocator fast paths alone.

    The impact on page-allocator microbenchmarks is negligible as they don't
    hit the paths where the zlc comes into play. Most page-reclaim related
    workloads showed no noticeable difference as a result of the removal.

    The impact was noticeable in a workload called "stutter". One part uses a
    lot of anonymous memory, a second measures mmap latency and a third copies
    a large file. In an ideal world the latency application would not notice
    the mmap latency. On a 2-node machine the results of this patch are

    stutter
    4.3.0-rc1 4.3.0-rc1
    baseline nozlc-v4
    Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
    1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
    2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
    3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
    Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
    Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
    Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
    Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
    Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
    Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
    Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
    Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
    Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
    Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
    Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
    Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
    Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)

    Note that the maximum stall latency went from 24 seconds to 12 which is
    still bad but an improvement. The milage varies considerably 2-node
    machine on an earlier test went from 494 seconds to 47 seconds and a
    4-node machine that tested an earlier version of this patch went from a
    worst case stall time of 6 seconds to 67ms. The nature of the benchmark
    is inherently unpredictable as it is hammering the system and the milage
    will vary between machines.

    There is a secondary impact with potentially more direct reclaim because
    zones are now being considered instead of being skipped by zlc. In this
    particular test run it did not occur so will not be described. However,
    in at least one test the following was observed

    1. Direct reclaim rates were higher. This was likely due to direct reclaim
    being entered instead of the zlc disabling a zone and busy looping.
    Busy looping may have the effect of allowing kswapd to make more
    progress and in some cases may be better overall. If this is found then
    the correct action is to put direct reclaimers to sleep on a waitqueue
    and allow kswapd make forward progress. Busy looping on the zlc is even
    worse than when the allocator used to blindly call congestion_wait().

    2. There was higher swap activity as direct reclaim was active.

    3. Direct reclaim efficiency was lower. This is related to 1 as more
    scanning activity also encountered more pages that could not be
    immediately reclaimed

    In that case, the direct page scan and reclaim rates are noticeable but
    it is not considered a problem for a few reasons

    1. The test is primarily concerned with latency. The mmap attempts are also
    faulted which means there are THP allocation requests. The ZLC could
    cause zones to be disabled causing the process to busy loop instead
    of reclaiming. This looks like elevated direct reclaim activity but
    it's the correct action to take based on what processes requested.

    2. The test hammers reclaim and compaction heavily. The number of successful
    THP faults is highly variable but affects the reclaim stats. It's not a
    realistic or reasonable measure of page reclaim activity.

    3. No other page-reclaim intensive workload that was tested showed a problem.

    4. If a workload is identified that benefitted from the busy looping then it
    should be fixed by having direct reclaimers sleep on a wait queue until
    woken by kswapd instead of busy looping. We had this class of problem before
    when congestion_waits() with a fixed timeout was a brain damaged decision
    but happened to benefit some workloads.

    If a workload is identified that relied on the zlc to busy loop then it
    should be fixed correctly and have a direct reclaimer sleep on a waitqueue
    until woken by kswapd.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch redefines which GFP bits are used for specifying mobility and
    the order of the migrate types. Once redefined it's possible to convert
    GFP flags to a migrate type with a simple mask and shift. The only
    downside is that readers of OOM kill messages and allocation failures may
    have been used to the existing values but scripts/gfp-translate will help.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Overall, the intent of this series is to remove the zonelist cache which
    was introduced to avoid high overhead in the page allocator. Once this is
    done, it is necessary to reduce the cost of watermark checks.

    The series starts with minor micro-optimisations.

    Next it notes that GFP flags that affect watermark checks are abused.
    __GFP_WAIT historically identified callers that could not sleep and could
    access reserves. This was later abused to identify callers that simply
    prefer to avoid sleeping and have other options. A patch distinguishes
    between atomic callers, high-priority callers and those that simply wish
    to avoid sleep.

    The zonelist cache has been around for a long time but it is of dubious
    merit with a lot of complexity and some issues that are explained. The
    most important issue is that a failed THP allocation can cause a zone to
    be treated as "full". This potentially causes unnecessary stalls, reclaim
    activity or remote fallbacks. The issues could be fixed but it's not
    worth it. The series places a small number of other micro-optimisations
    on top before examining GFP flags watermarks.

    High-order watermarks enforcement can cause high-order allocations to fail
    even though pages are free. The watermark checks both protect high-order
    atomic allocations and make kswapd aware of high-order pages but there is
    a much better way that can be handled using migrate types. This series
    uses page grouping by mobility to reserve pageblocks for high-order
    allocations with the size of the reservation depending on demand. kswapd
    awareness is maintained by examining the free lists. By patch 12 in this
    series, there are no high-order watermark checks while preserving the
    properties that motivated the introduction of the watermark checks.

    This patch (of 10):

    No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
    removes the unnecessary parameter.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

1 commit

  • Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
    experienced on the Cell architecture when init-time functions,
    early_*(), are called at runtime by introducing an 'enum memmap_context'
    parameter to memmap_init_zone() and init_currently_empty_zone(). This
    parameter is intended to be used to tell whether the call of these two
    functions is being made on behalf of a hotplug event, or happening at
    boot-time. However, init_currently_empty_zone() does not use this
    parameter at all, so remove it.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

09 Sep, 2015

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

05 Sep, 2015

1 commit


28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

01 Jul, 2015

3 commits

  • This patch initalises all low memory struct pages and 2G of the highest
    zone on each node during memory initialisation if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
    but will be available in a later patch. Parallel initialisation of struct
    page depends on some features from memory hotplug and it is necessary to
    alter alter section annotations.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
    unnecessarily visible outside memory initialisation. As well as
    unnecessary visibility, it's unnecessary function call overhead when
    initialising pages. This patch moves the helpers inline.

    [akpm@linux-foundation.org: fix build]
    [mhocko@suse.cz: fix build]
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __early_pfn_to_nid() use static variables to cache recent lookups as
    memblock lookups are very expensive but it assumes that memory
    initialisation is single-threaded. Parallel initialisation of struct
    pages will break that assumption so this patch makes __early_pfn_to_nid()
    SMP-safe by requiring the caller to cache recent search information.
    early_pfn_to_nid() keeps the same interface but is only safe to use early
    in boot due to the use of a global static variable. meminit_pfn_in_nid()
    is an SMP-safe version that callers must maintain their own state for.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Apr, 2015

1 commit


08 Apr, 2015

1 commit

  • Huang Ying reported the following problem due to commit 3484b2de9499 ("mm:
    rearrange zone fields into read-only, page alloc, statistics and page
    reclaim lines") from the Intel performance tests

    24b7e5819ad5cbef 3484b2de9499df23c4604a513b
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    152288 \261 0% -46.2% 81911 \261 0% aim7.jobs-per-min
    237 \261 0% +85.6% 440 \261 0% aim7.time.elapsed_time
    237 \261 0% +85.6% 440 \261 0% aim7.time.elapsed_time.max
    25026 \261 0% +70.7% 42712 \261 0% aim7.time.system_time
    2186645 \261 5% +32.0% 2885949 \261 4% aim7.time.voluntary_context_switches
    4576561 \261 1% +24.9% 5715773 \261 0% aim7.time.involuntary_context_switches

    The problem is specific to very large machines under stress. It was not
    reproducible with the machines I had used to justify the original patch
    because large numbers of CPUs are required. When pressure is high enough,
    the cache line is bouncing between CPUs trying to acquire the lock and the
    holder of the lock adjusting free lists. The intention was that the
    acquirer of the lock would automatically have the cache line holding the
    free lists but according to Huang, this is not a universal win.

    One possibility is to move the zone lock to its own cache line but it
    increases the size of the zone. This patch moves the lock to the other
    end of the free lists where they do not contend under high pressure. It
    does mean the page allocator paths now require more cache lines but Huang
    reports that it restores performance to previous levels on large machines

    %stddev %change %stddev
    \ | \
    84568 \261 1% +94.3% 164280 \261 1% aim7.jobs-per-min
    2881944 \261 2% -35.1% 1870386 \261 8% aim7.time.voluntary_context_switches
    681 \261 1% -3.4% 658 \261 0% aim7.time.user_time
    5538139 \261 0% -12.1% 4867884 \261 0% aim7.time.involuntary_context_switches
    44174 \261 1% -46.0% 23848 \261 1% aim7.time.system_time
    426 \261 1% -48.4% 219 \261 1% aim7.time.elapsed_time
    426 \261 1% -48.4% 219 \261 1% aim7.time.elapsed_time.max
    468 \261 1% -43.1% 266 \261 2% uptime.boot

    Signed-off-by: Mel Gorman
    Reported-by: Huang Ying
    Tested-by: Huang Ying
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

2 commits

  • next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
    via extra parameter. Since the latter can be trivially obtained by
    dereferencing the former, the overhead of the extra parameter is
    unjustified.

    This patch thus removes the zone parameter from next_zones_zonelist().
    Both callers happen to be in the same header file, so it's simple to add
    the zoneref dereference inline. We save some bytes of code size.

    add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
    function old new delta
    nr_free_zone_pages 129 115 -14
    __alloc_pages_nodemask 2300 2285 -15
    get_page_from_freelist 2652 2576 -76

    add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
    function old new delta
    try_to_compact_pages 569 579 +10

    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     

14 Dec, 2014

1 commit

  • When we debug something, we'd like to insert some information to every
    page. For this purpose, we sometimes modify struct page itself. But,
    this has drawbacks. First, it requires re-compile. This makes us
    hesitate to use the powerful debug feature so development process is
    slowed down. And, second, sometimes it is impossible to rebuild the
    kernel due to third party module dependency. At third, system behaviour
    would be largely different after re-compile, because it changes size of
    struct page greatly and this structure is accessed by every part of
    kernel. Keeping this as it is would be better to reproduce errornous
    situation.

    This feature is intended to overcome above mentioned problems. This
    feature allocates memory for extended data per page in certain place
    rather than the struct page itself. This memory can be accessed by the
    accessor functions provided by this code. During the boot process, it
    checks whether allocation of huge chunk of memory is needed or not. If
    not, it avoids allocating memory at all. With this advantage, we can
    include this feature into the kernel in default and can avoid rebuild and
    solve related problems.

    Until now, memcg uses this technique. But, now, memcg decides to embed
    their variable to struct page itself and it's code to extend struct page
    has been removed. I'd like to use this code to develop debug feature, so
    this patch resurrect it.

    To help these things to work well, this patch introduces two callbacks for
    clients. One is the need callback which is mandatory if user wants to
    avoid useless memory allocation at boot-time. The other is optional, init
    callback, which is used to do proper initialization after memory is
    allocated. Detailed explanation about purpose of these functions is in
    code comment. Please refer it.

    Others are completely same with previous extension code in memcg.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

11 Dec, 2014

1 commit

  • Memory cgroups used to have 5 per-page pointers. To allow users to
    disable that amount of overhead during runtime, those pointers were
    allocated in a separate array, with a translation layer between them and
    struct page.

    There is now only one page pointer remaining: the memcg pointer, that
    indicates which cgroup the page is associated with when charged. The
    complexity of runtime allocation and the runtime translation overhead is
    no longer justified to save that *potential* 0.19% of memory. With
    CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
    after the page->private member and doesn't even increase struct page,
    and then this patch actually saves space. Remaining users that care can
    still compile their kernels without CONFIG_MEMCG.

    text data bss dec hex filename
    8828345 1725264 983040 11536649 b00909 vmlinux.old
    8827425 1725264 966656 11519345 afc571 vmlinux.new

    [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: David S. Miller
    Acked-by: KAMEZAWA Hiroyuki
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Joonsoo Kim
    Acked-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Nov, 2014

1 commit

  • Before describing bugs itself, I first explain definition of freepage.

    1. pages on buddy list are counted as freepage.
    2. pages on isolate migratetype buddy list are *not* counted as freepage.
    3. pages on cma buddy list are counted as CMA freepage, too.

    Now, I describe problems and related patch.

    Patch 1: There is race conditions on getting pageblock migratetype that
    it results in misplacement of freepages on buddy list, incorrect
    freepage count and un-availability of freepage.

    Patch 2: Freepages on pcp list could have stale cached information to
    determine migratetype of buddy list to go. This causes misplacement of
    freepages on buddy list and incorrect freepage count.

    Patch 4: Merging between freepages on different migratetype of
    pageblocks will cause freepages accouting problem. This patch fixes it.

    Without patchset [3], above problem doesn't happens on my CMA allocation
    test, because CMA reserved pages aren't used at all. So there is no
    chance for above race.

    With patchset [3], I did simple CMA allocation test and get below
    result:

    - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
    - run kernel build (make -j16) on background
    - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
    - Result: more than 5000 freepage count are missed

    With patchset [3] and this patchset, I found that no freepage count are
    missed so that I conclude that problems are solved.

    On my simple memory offlining test, these problems also occur on that
    environment, too.

    This patch (of 4):

    There are two paths to reach core free function of buddy allocator,
    __free_one_page(), one is free_one_page()->__free_one_page() and the
    other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
    Each paths has race condition causing serious problems. At first, this
    patch is focused on first type of freepath. And then, following patch
    will solve the problem in second type of freepath.

    In the first type of freepath, we got migratetype of freeing page
    without holding the zone lock, so it could be racy. There are two cases
    of this race.

    1. pages are added to isolate buddy list after restoring orignal
    migratetype

    CPU1 CPU2

    get migratetype => return MIGRATE_ISOLATE
    call free_one_page() with MIGRATE_ISOLATE

    grab the zone lock
    unisolate pageblock
    release the zone lock

    grab the zone lock
    call __free_one_page() with MIGRATE_ISOLATE
    freepage go into isolate buddy list,
    although pageblock is already unisolated

    This may cause two problems. One is that we can't use this page anymore
    until next isolation attempt of this pageblock, because freepage is on
    isolate buddy list. The other is that freepage accouting could be wrong
    due to merging between different buddy list. Freepages on isolate buddy
    list aren't counted as freepage, but ones on normal buddy list are
    counted as freepage. If merge happens, buddy freepage on normal buddy
    list is inevitably moved to isolate buddy list without any consideration
    of freepage accouting so it could be incorrect.

    2. pages are added to normal buddy list while pageblock is isolated.
    It is similar with above case.

    This also may cause two problems. One is that we can't keep these
    freepages from being allocated. Although this pageblock is isolated,
    freepage would be added to normal buddy list so that it could be
    allocated without any restriction. And the other problem is same as
    case 1, that it, incorrect freepage accouting.

    This race condition would be prevented by checking migratetype again
    with holding the zone lock. Because it is somewhat heavy operation and
    it isn't needed in common case, we want to avoid rechecking as much as
    possible. So this patch introduce new variable, nr_isolate_pageblock in
    struct zone to check if there is isolated pageblock. With this, we can
    avoid to re-check migratetype in common case and do it only if there is
    isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
    mentioned problems.

    Changes from v3:
    Add one more check in free_one_page() that checks whether migratetype is
    MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.

    Signed-off-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

1 commit

  • Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
    sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
    reader through layers indirection just to track down a simple bit.

    Remove all zone flag wrappers and just use bitops against zone->flags
    directly. It's just as readable and the lines are barely any longer.

    Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
    remove the zone_flags_t typedef.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

4 commits

  • The fair zone allocation policy round-robins allocations between zones
    within a node to avoid age inversion problems during reclaim. If the
    first allocation fails, the batch counts are reset and a second attempt
    made before entering the slow path.

    One assumption made with this scheme is that batches expire at roughly
    the same time and the resets each time are justified. This assumption
    does not hold when zones reach their low watermark as the batches will
    be consumed at uneven rates. Allocation failure due to watermark
    depletion result in additional zonelist scans for the reset and another
    watermark check before hitting the slowpath.

    On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
    machine it's variable due to the variability of measuring overhead with
    the vmstat changes. The system CPU overhead comparison looks like

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanilla vmstat-v5 lowercost-v5
    User 746.94 774.56 802.00
    System 65336.22 32847.27 40852.33
    Elapsed 27553.52 27415.04 27368.46

    However it is worth noting that the overall benchmark still completed
    faster and intuitively it makes sense to take as few passes as possible
    through the zonelists.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free. Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.

    On a small UMA machine running tiobench the difference is marginal. On
    a 4-node machine the overhead is more noticable. Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5 vmstat-v5
    User 746.94 759.78 774.56
    System 65336.22 58350.98 32847.27
    Elapsed 27553.52 27282.02 27415.04

    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The arrangement of struct zone has changed over time and now it has
    reached the point where there is some inappropriate sharing going on.
    On x86-64 for example

    o The zone->node field is shared with the zone lock and zone->node is
    accessed frequently from the page allocator due to the fair zone
    allocation policy.

    o span_seqlock is almost never used by shares a line with free_area

    o Some zone statistics share a cache line with the LRU lock so
    reclaim-intensive and allocator-intensive workloads can bounce the cache
    line on a stat update

    This patch rearranges struct zone to put read-only and read-mostly
    fields together and then splits the page allocator intensive fields, the
    zone statistics and the page reclaim intensive fields into their own
    cache lines. Note that the type of lowmem_reserve changes due to the
    watermark calculations being signed and avoiding a signed/unsigned
    conversion there.

    On the test configuration I used the overall size of struct zone shrunk
    by one cache line. On smaller machines, this is not likely to be
    noticable. However, on a 4-node NUMA machine running tiobench the
    system CPU overhead is reduced by this patch.

    3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5r9
    User 746.94 759.78
    System 65336.22 58350.98
    Elapsed 27553.52 27282.02

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not
    highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages,
    it extracts pages from the previous zone before ZONE_MOVABLE. Which is
    logically inconsistent:

    If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on,
    zone_movable_is_highmem() makes movable zone not highmem, but
    online_pages() extracts pages from ZONE_HIGHMEM.

    This inconsistency doesn't cause real problem currently, because all
    architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP.
    However, fixing it makes code clear, and also helps futher coding.

    Signed-off-by: Wang Nan
    Cc: Zhang Zhen
    Cc: Mel Gorman
    Cc: Jiang Liu
    Cc: Li Zefan
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     

05 Jun, 2014

6 commits

  • X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
    by zone_reclaim due to its distance. As it is expected that
    zone_reclaim_mode will be rarely enabled it is unreasonable for all
    machines to take a penalty. Fortunately, the zone_reclaim_mode() path
    is already slow and it is the path that takes the hit.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Zhang Yanfei
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Apr, 2014

2 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently usedbut
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This patch solves one half of the problem by decoupling the ability to
    detect working set changes from the inactive list size. By maintaining
    a history of recently evicted file pages it can detect frequently used
    pages with an arbitrarily small inactive list size, and subsequently
    apply pressure on the active list based on actual demand for cache, not
    just overall eviction speed.

    Every zone maintains a counter that tracks inactive list aging speed.
    When a page is evicted, a snapshot of this counter is stored in the
    now-empty page cache radix tree slot. On refault, the minimum access
    distance of the page can be assessed, to evaluate whether the page
    should be part of the active list or not.

    This fixes the VM's blindness towards working set changes in excess of
    the inactive list. And it's the foundation to further improve the
    protection ability and reduce the minimum inactive list size of 50%.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Mar, 2014

1 commit

  • GFP_THISNODE is for callers that implement their own clever fallback to
    remote nodes. It restricts the allocation to the specified node and
    does not invoke reclaim, assuming that the caller will take care of it
    when the fallback fails, e.g. through a subsequent allocation request
    without GFP_THISNODE set.

    However, many current GFP_THISNODE users only want the node exclusive
    aspect of the flag, without actually implementing their own fallback or
    triggering reclaim if necessary. This results in things like page
    migration failing prematurely even when there is easily reclaimable
    memory available, unless kswapd happens to be running already or a
    concurrent allocation attempt triggers the necessary reclaim.

    Convert all callsites that don't implement their own fallback strategy
    to __GFP_THISNODE. This restricts the allocation a single node too, but
    at the same time allows the allocator to enter the slowpath, wake
    kswapd, and invoke direct reclaim if necessary, to make the allocation
    happen when memory is full.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Jan Stancek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

22 Jan, 2014

2 commits

  • NUMA migrate rate limiting protects a migration counter and window using
    a lock but in some cases this can be a contended lock. It is not
    critical that the number of pages be perfect, lost updates are
    acceptable. Reduce the importance of this lock.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     

12 Sep, 2013

2 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Jul, 2013

1 commit


04 Jul, 2013

1 commit

  • Instead of leaving a hidden trap for the next person who comes along and
    wants to add something to mem_section, add a big fat warning about it
    needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
    to catch mistakes.

    Right now non-power-of-2 mem_sections cause a number of WARNs at boot
    (which don't clearly point to the size of mem_section as an issue), but
    the system limps on (temporarily, at least).

    This is based upon Dave Hansen's earlier RFC where he ran into the same
    issue:
    "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
    http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html

    Signed-off-by: Cody P Schafer
    Acked-by: Dave Hansen
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer