26 Sep, 2014

3 commits

  • commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream.

    zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free. Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.

    On a small UMA machine running tiobench the difference is marginal. On
    a 4-node machine the overhead is more noticable. Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5 vmstat-v5
    User 746.94 759.78 774.56
    System 65336.22 58350.98 32847.27
    Elapsed 27553.52 27282.02 27415.04

    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream.

    The arrangement of struct zone has changed over time and now it has
    reached the point where there is some inappropriate sharing going on.
    On x86-64 for example

    o The zone->node field is shared with the zone lock and zone->node is
    accessed frequently from the page allocator due to the fair zone
    allocation policy.

    o span_seqlock is almost never used by shares a line with free_area

    o Some zone statistics share a cache line with the LRU lock so
    reclaim-intensive and allocator-intensive workloads can bounce the cache
    line on a stat update

    This patch rearranges struct zone to put read-only and read-mostly
    fields together and then splits the page allocator intensive fields, the
    zone statistics and the page reclaim intensive fields into their own
    cache lines. Note that the type of lowmem_reserve changes due to the
    watermark calculations being signed and avoiding a signed/unsigned
    conversion there.

    On the test configuration I used the overall size of struct zone shrunk
    by one cache line. On smaller machines, this is not likely to be
    noticable. However, on a 4-node NUMA machine running tiobench the
    system CPU overhead is reduced by this patch.

    3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5r9
    User 746.94 759.78
    System 65336.22 58350.98
    Elapsed 27553.52 27282.02

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.

    Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     

08 Dec, 2013

1 commit

  • commit 72403b4a0fbdf433c1fe0127e49864658f6f6468 upstream.

    Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

12 Sep, 2013

7 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • Disabling interrupts repeatedly can be avoided in the inner loop if we use
    a this_cpu operation.

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Both functions that update global counters use the same mechanism.

    Create a function that contains the common code.

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The main idea behind this patchset is to reduce the vmstat update overhead
    by avoiding interrupt enable/disable and the use of per cpu atomics.

    This patch (of 3):

    It is better to have a separate folding function because
    refresh_cpu_vm_stats() also does other things like expire pages in the
    page allocator caches.

    If we have a separate function then refresh_cpu_vm_stats() is only called
    from the local cpu which allows additional optimizations.

    The folding function is only called when a cpu is being downed and
    therefore no other processor will be accessing the counters. Also
    simplifies synchronization.

    [akpm@linux-foundation.org: fix UP build]
    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
    counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
    for SMP.

    UP systems do not do remote TLB flushes, so compile those counters out on
    UP.

    arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
    probably an optimization since both the mtrr code and __flush_tlb() write
    cr4. It would probably be safe to make that a flush_tlb_all() (and then
    get these statistics), but the mtrr code is ancient and I'm hesitant to
    touch it other than to just stick in the counters.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Dave Hansen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I was investigating some TLB flush scaling issues and realized that we do
    not have any good methods for figuring out how many TLB flushes we are
    doing.

    It would be nice to be able to do these in generic code, but the
    arch-independent calls don't explicitly specify whether we actually need
    to do remote flushes or not. In the end, we really need to know if we
    actually _did_ global vs. local invalidations, so that leaves us with few
    options other than to muck with the counters from arch-specific code.

    Signed-off-by: Dave Hansen
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

30 Apr, 2013

2 commits


24 Feb, 2013

4 commits

  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • Currently a zone's present_pages is calcuated as below, which is
    inaccurate and may cause trouble to memory hotplug.

    spanned_pages - absent_pages - memmap_pages - dma_reserve.

    During fixing bugs caused by inaccurate zone->present_pages, we found
    zone->present_pages has been abused. The field zone->present_pages may
    have different meanings in different contexts:

    1) pages existing in a zone.
    2) pages managed by the buddy system.

    For more discussions about the issue, please refer to:
    http://lkml.org/lkml/2012/11/5/866
    https://patchwork.kernel.org/patch/1346751/

    This patchset tries to introduce a new field named "managed_pages" to
    struct zone, which counts "pages managed by the buddy system". And revert
    zone->present_pages to count "physical pages existing in a zone", which
    also keep in consistence with pgdat->node_present_pages.

    We will set an initial value for zone->managed_pages in function
    free_area_init_core() and will adjust it later if the initial value is
    inaccurate.

    For DMA/normal zones, the initial value is set to:

    (spanned_pages - absent_pages - memmap_pages - dma_reserve)

    Later zone->managed_pages will be adjusted to the accurate value when the
    bootmem allocator frees all free pages to the buddy system in function
    free_all_bootmem_node() and free_all_bootmem().

    The bootmem allocator doesn't touch highmem pages, so highmem zones'
    managed_pages is set to the accurate value "spanned_pages - absent_pages"
    in function free_area_init_core() and won't be updated anymore.

    This patch also adds a new field "managed_pages" to /proc/zoneinfo
    and sysrq showmem.

    [akpm@linux-foundation.org: small comment tweaks]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Tested-by: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Christoph Lameter
    Signed-off-by: Wen Congyang
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • hzp_alloc is incremented every time a huge zero page is successfully
    allocated. It includes allocations which where dropped due
    race with other allocation. Note, it doesn't count every map
    of the huge zero page, only its allocation.

    hzp_alloc_failed is incremented if kernel fails to allocate huge zero
    page and falls back to using small pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2012

3 commits

  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • Compaction already has tracepoints to count scanned and isolated pages
    but it requires that ftrace be enabled and if that information has to be
    written to disk then it can be disruptive. This patch adds vmstat counters
    for compaction called compact_migrate_scanned, compact_free_scanned and
    compact_isolated.

    With these counters, it is possible to define a basic cost model for
    compaction. This approximates of how much work compaction is doing and can
    be compared that with an oprofile showing TLB misses and see if the cost of
    compaction is being offset by THP for example. Minimally a compaction patch
    can be evaluated in terms of whether it increases or decreases cost. The
    basic cost model looks like this

    Fundamental unit u: a word sizeof(void *)

    Ca = cost of struct page access = sizeof(struct page) / u

    Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
    Cmf = Cost migrate failure = Ca * 2
    Ci = Cost page isolation = (Ca + Wi)
    where Wi is a constant that should reflect the approximate
    cost of the locking operation.

    Csm = Cost migrate scanning = Ca
    Csf = Cost free scanning = Ca

    Overall cost = (Csm * compact_migrate_scanned) +
    (Csf * compact_free_scanned) +
    (Ci * compact_isolated) +
    (Cmc * pgmigrate_success) +
    (Cmf * pgmigrate_failed)

    Where the values are read from /proc/vmstat.

    This is very basic and ignores certain costs such as the allocation cost
    to do a migrate page copy but any improvement to the model would still
    use the same vmstat counters.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

09 Oct, 2012

4 commits

  • Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
    from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
    have been used by any tool, and of course we can restore it easily enough
    if that turns out to be wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
    causing the kernel to hang. When the system doesn't have enough free
    pages, it enters reclaim but never reclaim any pages due to
    too_many_isolated()==true and loops forever.

    The cause is that when we do memory-hotadd after memory-remove,
    __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
    although the vm_stat_diff of all CPUs still have values.

    In addtion, when we offline all pages of the zone, we reset them in
    zone_pcp_reset without draining so we loss some zone stat item.

    Reviewed-by: Wen Congyang
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Yasuaki Ishimatsu
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
    remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
    already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
    checking it, reporting "BUG: Bad page state" if it's ever found set.
    Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
    __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
    counter always available (not only when CONFIG_CMA=y).

    [akpm@linux-foundation.org: use conventional migratetype naming]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

22 Aug, 2012

1 commit


01 Aug, 2012

1 commit

  • Under significant pressure when writing back to network-backed storage,
    direct reclaimers may get throttled. This is expected to be a short-lived
    event and the processes get woken up again but processes do get stalled.
    This patch counts how many times such stalling occurs. It's up to the
    administrator whether to reduce these stalls by increasing
    min_free_kbytes.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 May, 2012

1 commit

  • …g_debug_root dentry local

    Remove debug fs files and directory on failure. Since no one is using
    "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it
    local to the function.

    Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Sasikantha babu
     

21 May, 2012

1 commit

  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     

26 Apr, 2012

1 commit

  • The "pgsteal" stat is confusing because it counts both direct reclaim as
    well as background reclaim. However, we have "kswapd_steal" which also
    counts background reclaim value.

    This patch fixes it and also makes it match the existng "pgscan_" stats.

    Test:
    pgsteal_kswapd_dma32 447623
    pgsteal_kswapd_normal 42272677
    pgsteal_kswapd_movable 0
    pgsteal_direct_dma32 2801
    pgsteal_direct_normal 44353270
    pgsteal_direct_movable 0

    Signed-off-by: Ying Han
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Dan Magenheimer
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

13 Jan, 2012

1 commit

  • Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

01 Nov, 2011

3 commits

  • Avoid false sharing of the vm_stat array.

    This was found to adversely affect tmpfs I/O performance.

    Tests run on a 640 cpu UV system.

    With 120 threads doing parallel writes, each to different tmpfs mounts:
    No patch: ~300 MB/sec
    With vm_stat alignment: ~430 MB/sec

    Signed-off-by: Dimitri Sivanich
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Testing from the XFS folk revealed that there is still too much I/O from
    the end of the LRU in kswapd. Previously it was considered acceptable by
    VM people for a small number of pages to be written back from reclaim with
    testing generally showing about 0.3% of pages reclaimed were written back
    (higher if memory was low). That writing back a small number of pages is
    ok has been heavily disputed for quite some time and Dave Chinner
    explained it well;

    It doesn't have to be a very high number to be a problem. IO
    is orders of magnitude slower than the CPU time it takes to
    flush a page, so the cost of making a bad flush decision is
    very high. And single page writeback from the LRU is almost
    always a bad flush decision.

    To complicate matters, filesystems respond very differently to requests
    from reclaim according to Christoph Hellwig;

    xfs tries to write it back if the requester is kswapd
    ext4 ignores the request if it's a delayed allocation
    btrfs ignores the request

    As a result, each filesystem has different performance characteristics
    when under memory pressure and there are many pages being dirtied. In
    some cases, the request is ignored entirely so the VM cannot depend on the
    IO being dispatched.

    The objective of this series is to reduce writing of filesystem-backed
    pages from reclaim, play nicely with writeback that is already in progress
    and throttle reclaim appropriately when writeback pages are encountered.
    The assumption is that the flushers will always write pages faster than if
    reclaim issues the IO.

    A secondary goal is to avoid the problem whereby direct reclaim splices
    two potentially deep call stacks together.

    There is a potential new problem as reclaim has less control over how long
    before a page in a particularly zone or container is cleaned and direct
    reclaimers depend on kswapd or flusher threads to do the necessary work.
    However, as filesystems sometimes ignore direct reclaim requests already,
    it is not expected to be a serious issue.

    Patch 1 disables writeback of filesystem pages from direct reclaim
    entirely. Anonymous pages are still written.

    Patch 2 removes dead code in lumpy reclaim as it is no longer able
    to synchronously write pages. This hurts lumpy reclaim but
    there is an expectation that compaction is used for hugepage
    allocations these days and lumpy reclaim's days are numbered.

    Patches 3-4 add warnings to XFS and ext4 if called from
    direct reclaim. With patch 1, this "never happens" and is
    intended to catch regressions in this logic in the future.

    Patch 5 disables writeback of filesystem pages from kswapd unless
    the priority is raised to the point where kswapd is considered
    to be in trouble.

    Patch 6 throttles reclaimers if too many dirty pages are being
    encountered and the zones or backing devices are congested.

    Patch 7 invalidates dirty pages found at the end of the LRU so they
    are reclaimed quickly after being written back rather than
    waiting for a reclaimer to find them

    I consider this series to be orthogonal to the writeback work but it is
    worth noting that the writeback work affects the viability of patch 8 in
    particular.

    I tested this on ext4 and xfs using fs_mark, a simple writeback test based
    on dd and a micro benchmark that does a streaming write to a large mapping
    (exercises use-once LRU logic) followed by streaming writes to a mix of
    anonymous and file-backed mappings. The command line for fs_mark when
    botted with 512M looked something like

    ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

    The number of files was adjusted depending on the amount of available
    memory so that the files created was about 3xRAM. For multiple threads,
    the -d switch is specified multiple times.

    The test machine is x86-64 with an older generation of AMD processor with
    4 cores. The underlying storage was 4 disks configured as RAID-0 as this
    was the best configuration of storage I had available. Swap is on a
    separate disk. Dirty ratio was tuned to 40% instead of the default of
    20%.

    Testing was run with and without monitors to both verify that the patches
    were operating as expected and that any performance gain was real and not
    due to interference from monitors.

    Here is a summary of results based on testing XFS.

    512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
    512M1P-xfs Elapsed Time fsmark 51.41 48.29
    512M1P-xfs Elapsed Time simple-wb 114.09 108.61
    512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
    512M1P-xfs Kswapd efficiency fsmark 62% 63%
    512M1P-xfs Kswapd efficiency simple-wb 56% 61%
    512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
    512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
    512M-xfs Elapsed Time fsmark 56.08 48.90
    512M-xfs Elapsed Time simple-wb 112.22 98.13
    512M-xfs Elapsed Time mmap-strm 219.15 196.67
    512M-xfs Kswapd efficiency fsmark 54% 56%
    512M-xfs Kswapd efficiency simple-wb 54% 55%
    512M-xfs Kswapd efficiency mmap-strm 45% 44%
    512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
    512M-4X-xfs Elapsed Time fsmark 63.26 55.88
    512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
    512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
    512M-4X-xfs Kswapd efficiency fsmark 49% 50%
    512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
    512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
    512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
    512M-16X-xfs Elapsed Time fsmark 67.47 58.25
    512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
    512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
    512M-16X-xfs Kswapd efficiency fsmark 45% 46%
    512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
    512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

    Up until 512-4X, the FSmark improvements were statistically significant.
    For the 4X and 16X tests the results were within standard deviations but
    just barely. The time to completion for all tests is improved which is an
    important result. In general, kswapd efficiency is not affected by
    skipping dirty pages.

    1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
    1024M1P-xfs Elapsed Time fsmark 84.14 80.41
    1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
    1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
    1024M1P-xfs Kswapd efficiency fsmark 69% 75%
    1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
    1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
    1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
    1024M-xfs Elapsed Time fsmark 94.59 91.00
    1024M-xfs Elapsed Time simple-wb 229.84 195.08
    1024M-xfs Elapsed Time mmap-strm 405.38 440.29
    1024M-xfs Kswapd efficiency fsmark 79% 71%
    1024M-xfs Kswapd efficiency simple-wb 74% 74%
    1024M-xfs Kswapd efficiency mmap-strm 39% 42%
    1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
    1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
    1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
    1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
    1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
    1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
    1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
    1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
    1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
    1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
    1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
    1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
    1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
    1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%

    All FSMark tests up to 16X had statistically significant improvements.
    For the most part, tests are completing faster with the exception of the
    streaming writes to a mixture of anonymous and file-backed mappings which
    were slower in two cases

    In the cases where the mmap-strm tests were slower, there was more
    swapping due to dirty pages being skipped. The number of additional pages
    swapped is almost identical to the fewer number of pages written from
    reclaim. In other words, roughly the same number of pages were reclaimed
    but swapping was slower. As the test is a bit unrealistic and stresses
    memory heavily, the small shift is acceptable.

    4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
    4608M1P-xfs Elapsed Time fsmark 512.01 492.15
    4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
    4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
    4608M1P-xfs Kswapd efficiency fsmark 93% 86%
    4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
    4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
    4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
    4608M-xfs Elapsed Time fsmark 555.96 532.34
    4608M-xfs Elapsed Time simple-wb 659.72 571.85
    4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
    4608M-xfs Kswapd efficiency fsmark 89% 91%
    4608M-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-xfs Kswapd efficiency mmap-strm 48% 46%
    4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
    4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
    4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
    4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
    4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
    4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
    4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
    4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
    4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
    4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
    4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
    4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
    4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

    Unlike the other tests, the fsmark results are not statistically
    significant but the min and max times are both improved and for the most
    part, tests completed faster.

    There are other indications that this is an improvement as well. For
    example, in the vast majority of cases, there were fewer pages scanned by
    direct reclaim implying in many cases that stalls due to direct reclaim
    are reduced. KSwapd is scanning more due to skipping dirty pages which is
    unfortunate but the CPU usage is still acceptable

    In an earlier set of tests, I used blktrace and in almost all cases
    throughput throughout the entire test was higher. However, I ended up
    discarding those results as recording blktrace data was too heavy for my
    liking.

    On a laptop, I plugged in a USB stick and ran a similar tests of tests
    using it as backing storage. A desktop environment was running and for
    the entire duration of the tests, firefox and gnome terminal were
    launching and exiting to vaguely simulate a user.

    1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
    1024M-xfs Elapsed Time fsmark 2053.52 1641.03
    1024M-xfs Elapsed Time simple-wb 1229.53 768.05
    1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
    1024M-xfs Kswapd efficiency fsmark 84% 85%
    1024M-xfs Kswapd efficiency simple-wb 92% 81%
    1024M-xfs Kswapd efficiency mmap-strm 60% 51%
    1024M-xfs Avg wait ms fsmark 5404.53 4473.87
    1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
    1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

    The mmap-strm results were hurt because firefox launching had a tendency
    to push the test out of memory. On the postive side, firefox launched
    marginally faster with the patches applied. Time to completion for many
    tests was faster but more importantly - the "Avg wait" time as measured by
    iostat was far lower implying the system would be more responsive. It was
    also the case that "Avg wait ms" on the root filesystem was lower. I
    tested it manually and while the system felt slightly more responsive
    while copying data to a USB stick, it was marginal enough that it could be
    my imagination.

    This patch: do not writeback filesystem pages in direct reclaim.

    When kswapd is failing to keep zones above the min watermark, a process
    will enter direct reclaim in the same manner kswapd does. If a dirty page
    is encountered during the scan, this page is written to backing storage
    using mapping->writepage.

    This causes two problems. First, it can result in very deep call stacks,
    particularly if the target storage or filesystem are complex. Some
    filesystems ignore write requests from direct reclaim as a result. The
    second is that a single-page flush is inefficient in terms of IO. While
    there is an expectation that the elevator will merge requests, this does
    not always happen. Quoting Christoph Hellwig;

    The elevator has a relatively small window it can operate on,
    and can never fix up a bad large scale writeback pattern.

    This patch prevents direct reclaim writing back filesystem pages by
    checking if current is kswapd. Anonymous pages are still written to swap
    as there is not the equivalent of a flusher thread for anonymous pages.
    If the dirty pages cannot be written back, they are placed back on the LRU
    lists. There is now a direct dependency on dirty page balancing to
    prevent too many pages in the system being dirtied which would prevent
    reclaim making forward progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Sep, 2011

1 commit

  • The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
    yet it is referenced for per-node vmstat with CONFIG_NUMA:

    drivers/built-in.o: In function `node_read_vmstat':
    node.c:(.text+0x1106df): undefined reference to `vmstat_text'

    Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
    vmstats").

    Define the array for CONFIG_NUMA as well.

    [akpm@linux-foundation.org: remove unneeded ifdefs]
    Signed-off-by: David Rientjes
    Reported-by: Cong Wang
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 May, 2011

1 commit