30 Dec, 2015

1 commit

  • mod_zone_page_state() takes a "delta" integer argument. delta contains
    the number of pages that should be added or subtracted from a struct
    zone's vm_stat field.

    If a zone is larger than 8TB this will cause overflows. E.g. for a
    zone with a size slightly larger than 8TB the line

    mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

    in mm/page_alloc.c:free_area_init_core() will result in a negative
    result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
    contain 0x8xxxxxxx pages which will be sign extended to a negative
    value.

    Fix this by changing the delta argument to long type.

    This could fix an early boot problem seen on s390, where we have a 9TB
    system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
    rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
    completely empty, so it tries to reclaim pages in an endless loop.

    This was seen on a heavily patched 3.10 kernel. One possible
    explaination seem to be the overflows caused by mod_zone_page_state().
    Unfortunately I did not have the chance to verify that this patch
    actually fixes the problem, since I don't have access to the system
    right now. However the overflow problem does exist anyway.

    Given the description that a system with slightly less than 8TB does
    work, this seems to be a candidate for the observed problem.

    Signed-off-by: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

06 Nov, 2015

2 commits

  • refresh_cpu_vm_stats(int cpu) is no longer referenced by !SMP kernel
    since Linux 3.12.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
    (4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
    stack. Uninlining node_page_state() reduces this to 440 bytes.

    The stack consumption issue is fixed by newer gcc (4.8.4) however with
    that compiler this patch reduces the node.o text size from 7314 bytes to
    4578.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

05 Jun, 2014

1 commit

  • Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
    hit rate -- exported in /proc/vmstat.

    Any updates to the caching scheme needs this kind of data, thus it can
    save some work re-implementing the counting all the time.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

08 Apr, 2014

1 commit


04 Apr, 2014

1 commit

  • Summary:

    The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently used list
    "inactive list" and the frequently used list "active list".

    Currently, the VM aims for a 1:1 ratio between the lists, which is the
    "perfect" trade-off between the ability to *protect* frequently used
    pages and the ability to *detect* frequently used pages. This means
    that working set changes bigger than half of cache memory go undetected
    and thrash indefinitely, whereas working sets bigger than half of cache
    memory are unprotected against used-once streams that don't even need
    caching.

    This happens on file servers and media streaming servers, where the
    popular files and file sections change over time. Even though the
    individual files might be smaller than half of memory, concurrent access
    to many of them may still result in their inter-reference distance being
    greater than half of memory. It's also been reported as a problem on
    database workloads that switch back and forth between tables that are
    bigger than half of memory. In these cases the VM never recognizes the
    new working set and will for the remainder of the workload thrash disk
    data which could easily live in memory.

    Historically, every reclaim scan of the inactive list also took a
    smaller number of pages from the tail of the active list and moved them
    to the head of the inactive list. This model gave established working
    sets more gracetime in the face of temporary use-once streams, but
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This series solves the problem by maintaining a history of pages evicted
    from the inactive list, enabling the VM to detect frequently used pages
    regardless of inactive list size and facilitate working set transitions.

    Tests:

    The reported database workload is easily demonstrated on a 8G machine
    with two filesets a 6G. This fio workload operates on one set first,
    then switches to the other. The VM should obviously always cache the
    set that the workload is currently using.

    This test is based on a problem encountered by Citus Data customers:
    http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

    unpatched:
    db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
    db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
    sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

    real 27m15.541s
    user 0m19.059s
    sys 0m51.459s

    patched:
    db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
    db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
    sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

    real 6m8.630s
    user 0m14.714s
    sys 0m31.233s

    As can be seen, the unpatched kernel simply never adapts to the
    workingset change and db2 is stuck indefinitely with secondary storage
    speed. The patched kernel needs 2-3 iterations over db2 before it
    replaces db1 and reaches full memory speed. Given the unbounded
    negative affect of the existing VM behavior, these patches should be
    considered correctness fixes rather than performance optimizations.

    Another test resembles a fileserver or streaming server workload, where
    data in excess of memory size is accessed at different frequencies.
    There is very hot data accessed at a high frequency. Machines should be
    fitted so that the hot set of such a workload can be fully cached or all
    bets are off. Then there is a very big (compared to available memory)
    set of data that is used-once or at a very low frequency; this is what
    drives the inactive list and does not really benefit from caching.
    Lastly, there is a big set of warm data in between that is accessed at
    medium frequencies and benefits from caching the pages between the first
    and last streamer of each burst.

    unpatched:
    hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
    warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
    cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
    sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%

    patched:
    hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
    warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
    cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
    sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%

    In both kernels, the hot set is propagated to the active list and then
    served from cache.

    In both kernels, the beginning of the warm set is propagated to the
    active list as well, but in the unpatched case the active list
    eventually takes up half of memory and no new pages from the warm set
    get activated, despite repeated access, and despite most of the active
    list soon being stale. The patched kernel on the other hand detects the
    thrashing and manages to keep this cache window rolling through the data
    set. This frees up enough IO bandwidth that the cold set is served at
    full speed as well and disk utilization even drops by 20%.

    For reference, this same test was performed with the traditional
    demotion mechanism, where deactivation is coupled to inactive list
    reclaim. However, this had the same outcome as the unpatched kernel:
    while the warm set does indeed get activated continuously, it is forced
    out of the active list by inactive list pressure, which is dictated
    primarily by the unrelated cold set. The warm set is evicted before
    subsequent streamers can benefit from it, even though there would be
    enough space available to cache the pages of interest.

    Costs:

    Page reclaim used to shrink the radix trees but now the tree nodes are
    reused for shadow entries, where the cost depends heavily on the page
    cache access patterns. However, with workloads that maintain spatial or
    temporal locality, the shadow entries are either refaulted quickly or
    reclaimed along with the inode object itself. Workloads that will
    experience a memory cost increase are those that don't really benefit
    from caching in the first place.

    A more predictable alternative would be a fixed-cost separate pool of
    shadow entries, but this would incur relatively higher memory cost for
    well-behaved workloads at the benefit of cornercases. It would also
    make the shadow entry lookup more costly compared to storing them
    directly in the cache structure.

    Future:

    To simplify the merging process, this patch set is implementing thrash
    detection on a global per-zone level only for now, but the design is
    such that it can be extended to memory cgroups as well. All we need to
    do is store the unique cgroup ID along the node and zone identifier
    inside the eviction cookie to identify the lruvec.

    Right now we have a fixed ratio (50:50) between inactive and active list
    but we already have complaints about working sets exceeding half of
    memory being pushed out of the cache by simple streaming in the
    background. Ultimately, we want to adjust this ratio and allow for a
    much smaller inactive list. These patches are an essential step in this
    direction because they decouple the VMs ability to detect working set
    changes from the inactive list size. This would allow us to base the
    inactive list size on the combined readahead window size for example and
    potentially protect a much bigger working set.

    It's also a big step towards activating pages with a reuse distance
    larger than memory, as long as they are the most frequently used pages
    in the workload. This will require knowing more about the access
    frequency of active pages than what we measure right now, so it's also
    deferred in this series.

    Another possibility of having thrashing information would be to revisit
    the idea of local reclaim in the form of zero-config memory control
    groups. Instead of having allocating tasks go straight to global
    reclaim, they could try to reclaim the pages in the memcg they are part
    of first as long as the group is not thrashing. This would allow a user
    to drop e.g. a back-up job in an otherwise unconfigured memcg and it
    would only inflate (and possibly do global reclaim) until it has enough
    memory to do proper readahead. But once it reaches that point and stops
    thrashing it would just recycle its own used-once pages without kicking
    out the cache of any other tasks in the system more than necessary.

    This patch (of 10):

    Fengguang Wu's build testing spotted problems with inc_zone_state() and
    dec_zone_state() on UP configurations in out-of-tree patches.

    inc_zone_state() is declared but not defined, dec_zone_state() is
    missing entirely.

    Just like with *_zone_page_state(), they can be defined like their
    preemption-unsafe counterparts on UP.

    [akpm@linux-foundation.org: make it build]
    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Feb, 2014

1 commit


30 Jan, 2014

1 commit

  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

25 Jan, 2014

1 commit

  • Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Sep, 2013

2 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • The main idea behind this patchset is to reduce the vmstat update overhead
    by avoiding interrupt enable/disable and the use of per cpu atomics.

    This patch (of 3):

    It is better to have a separate folding function because
    refresh_cpu_vm_stats() also does other things like expire pages in the
    page allocator caches.

    If we have a separate function then refresh_cpu_vm_stats() is only called
    from the local cpu which allows additional optimizations.

    The folding function is only called when a cpu is being downed and
    therefore no other processor will be accessing the counters. Also
    simplifies synchronization.

    [akpm@linux-foundation.org: fix UP build]
    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

30 Apr, 2013

1 commit

  • CONFIG_HOTPLUG is going away as an option, cleanup CONFIG_HOTPLUG
    ifdefs in mm files.

    Signed-off-by: Yijing Wang
    Acked-by: Greg Kroah-Hartman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yijing Wang
     

24 Feb, 2013

1 commit

  • The current definitions for count_vm_numa_events() is wrong for
    !CONFIG_NUMA_BALANCING as the following would miss the side-effect.

    count_vm_numa_events(NUMA_FOO, bar++);

    There are no such users of count_vm_numa_events() but this patch fixes
    it as it is a potential pitfall. Ideally both would be converted to
    static inline but NUMA_PTE_UPDATES is not defined if
    !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
    static inline would be similarly clumsy.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Dec, 2012

1 commit

  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     

09 Oct, 2012

2 commits

  • During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
    causing the kernel to hang. When the system doesn't have enough free
    pages, it enters reclaim but never reclaim any pages due to
    too_many_isolated()==true and loops forever.

    The cause is that when we do memory-hotadd after memory-remove,
    __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
    although the vm_stat_diff of all CPUs still have values.

    In addtion, when we offline all pages of the zone, we reset them in
    zone_pcp_reset without draining so we loss some zone stat item.

    Reviewed-by: Wen Congyang
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Yasuaki Ishimatsu
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
    __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
    counter always available (not only when CONFIG_CMA=y).

    [akpm@linux-foundation.org: use conventional migratetype naming]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

01 Aug, 2012

1 commit

  • pg_data_t is zeroed before reaching free_area_init_core(), so remove the
    now unnecessary initializations.

    Signed-off-by: Minchan Kim
    Cc: Tejun Heo
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

27 May, 2011

1 commit

  • enums are problematic because they cannot be forward-declared:

    akpm2:/home/akpm> cat t.c

    enum foo;

    static inline void bar(enum foo f)
    {
    }
    akpm2:/home/akpm> gcc -c t.c
    t.c:4: error: parameter 1 ('f') has incomplete type

    So move the enum's definition into a standalone header file which can be used
    wherever its definition is needed.

    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

25 May, 2011

2 commits

  • Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
    doesn't. There is no reason for this.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n build]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2ac390370a ("writeback: add
    /sys/devices/system/node//vmstat") added vmstat entry. But
    strangely it only show nr_written and nr_dirtied.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Of course, It's not adequate. With this patch, the vmstat show all vm
    stastics as /proc/vmstat.

    # cat /sys/devices/system/node/node0/vmstat
    nr_free_pages 899224
    nr_inactive_anon 201
    nr_active_anon 17380
    nr_inactive_file 31572
    nr_active_file 28277
    nr_unevictable 0
    nr_mlock 0
    nr_anon_pages 17321
    nr_mapped 8640
    nr_file_pages 60107
    nr_dirty 33
    nr_writeback 0
    nr_slab_reclaimable 6850
    nr_slab_unreclaimable 7604
    nr_page_table_pages 3105
    nr_kernel_stack 175
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 260
    nr_dirtied 1050
    nr_written 938
    numa_hit 962872
    numa_miss 0
    numa_foreign 0
    numa_interleave 8617
    numa_local 962872
    numa_other 0
    nr_anon_transparent_hugepages 0

    [akpm@linux-foundation.org: no externs in .c files]
    Signed-off-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Cc: Wu Fengguang
    Acked-by: David Rientjes
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

15 Apr, 2011

1 commit

  • I found it difficult to make sense of transparent huge pages without
    having any counters for its actions. Add some counters to vmstat for
    allocation of transparent hugepages and fallback to smaller pages.

    Optional patch, but useful for development and understanding the system.

    Contains improvements from Andrea Arcangeli and Johannes Weiner

    [akpm@linux-foundation.org: coding-style fixes]
    [hannes@cmpxchg.org: fix vmstat_text[] entries]
    Signed-off-by: Andi Kleen
    Acked-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

23 Mar, 2011

1 commit

  • Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
    zone_statistics() that an allocation is on behalf of another thread. This
    way the local and remote counters can be still correct, even when
    background daemons like khugepaged are changing memory mappings.

    This only affects the accounting, but I think it's worth doing that right
    to avoid confusing users.

    I first tried to just pass down the right node, but this required a lot of
    changes to pass down this parameter and at least one addition of a 10th
    argument to a 9 argument function. Using the flag is a lot less
    intrusive.

    Open: should be also used for migration?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Jan, 2011

2 commits

  • reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
    to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
    errors due to counter drift. The functions duplicate some code so this
    patch replaces them with a single set_pgdat_percpu_threshold() that takes
    a callback function to calculate the desired threshold as a parameter.

    [akpm@linux-foundation.org: readability tweak]
    [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Sep, 2010

1 commit

  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     

25 May, 2010

2 commits

  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Jan, 2010

1 commit


16 Dec, 2009

2 commits

  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • Now that the return from alloc_percpu is compatible with the address
    of per-cpu vars, it makes sense to hand around the address of per-cpu
    variables. To make this sane, we remove the per_cpu__ prefix we used
    created to stop people accidentally using these vars directly.

    Now we have sparse, we can use that (next patch).

    tj: * Updated to convert stuff which were missed by or added after the
    original patch.

    * Kill per_cpu_var() macro.

    Signed-off-by: Rusty Russell
    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter

    Rusty Russell
     

03 Oct, 2009

1 commit


22 Sep, 2009

2 commits

  • global_lru_pages() / zone_lru_pages() can be used in two ways:
    - to estimate max reclaimable pages in determine_dirtyable_memory()
    - to calculate the slab scan ratio

    When swap is full or not present, the anon lru lists are not reclaimable
    and also won't be scanned. So the anon pages shall not be counted in both
    usage scenarios. Also rename to _reclaimable_pages: now they are counting
    the possibly reclaimable lru pages.

    It can greatly (and correctly) increase the slab scan rate under high
    memory pressure (when most file pages have been reclaimed and swap is
    full/absent), thus reduce false OOM kills.

    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Wu Fengguang
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Jesse Barnes
    Cc: David Howells
    Cc: "Li, Ming Chun"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __add_zone_page_state() and __sub_zone_page_state() are unused.

    Signed-off-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

2 commits

  • On NUMA machines, the administrator can configure zone_reclaim_mode that
    is a more targetted form of direct reclaim. On machines with large NUMA
    distances for example, a zone_reclaim_mode defaults to 1 meaning that
    clean unmapped pages will be reclaimed if the zone watermarks are not
    being met.

    There is a heuristic that determines if the scan is worthwhile but it is
    possible that the heuristic will fail and the CPU gets tied up scanning
    uselessly. Detecting the situation requires some guesswork and
    experimentation so this patch adds a counter "zreclaim_failed" to
    /proc/vmstat. If during high CPU utilisation this counter is increasing
    rapidly, then the resolution to the problem may be to set
    /proc/sys/vm/zone_reclaim_mode to 0.

    [akpm@linux-foundation.org: name things consistently]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
    configurability is unnecessary.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Acked-by: Minchan Kim
    Cc: David Woodhouse
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

23 Oct, 2008

3 commits