06 Oct, 2018

2 commits

  • 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
    on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
    the kernel unconditional to reduce #ifdef soup, but (either to avoid
    showing dummy zero counters to userspace, or because that code was missed)
    didn't update the vmstat_array, meaning that all following counters would
    be shown with incorrect values.

    This only affects kernel builds with
    CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

    Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
    Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
    VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
    entry in vmstat_text. This causes an out-of-bounds access in
    vmstat_show().

    Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
    is probably very rare.

    Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
    Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

29 Jun, 2018

1 commit

  • Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
    BUG"). Steven saw a "using smp_processor_id() in preemptible" message
    and added a preempt_disable() section around it to keep it quiet. This
    is not the right thing to do it does not fix the real problem.

    vmstat_update() is invoked by a kworker on a specific CPU. This worker
    it bound to this CPU. The name of the worker was "kworker/1:1" so it
    should have been a worker which was bound to CPU1. A worker which can
    run on any CPU would have a `u' before the first digit.

    smp_processor_id() can be used in a preempt-enabled region as long as
    the task is bound to a single CPU which is the case here. If it could
    run on an arbitrary CPU then this is the problem we have an should seek
    to resolve.

    Not only this smp_processor_id() must not be migrated to another CPU but
    also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
    Not to mention that other code relies on the fact that such a worker
    runs on one specific CPU only.

    Therefore revert that commit and we should look instead what broke the
    affinity mask of the kworker.

    Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Steven J. Hill
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

16 May, 2018

1 commit


12 May, 2018

1 commit

  • Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
    no need to export this vm counter to userspace, and some changes are
    expected in reclaimable object accounting, which can alter this counter.

    Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

12 Apr, 2018

1 commit

  • Patch series "indirectly reclaimable memory", v2.

    This patchset introduces the concept of indirectly reclaimable memory
    and applies it to fix the issue of when a big number of dentries with
    external names can significantly affect the MemAvailable value.

    This patch (of 3):

    Introduce a concept of indirectly reclaimable memory and adds the
    corresponding memory counter and /proc/vmstat item.

    Indirectly reclaimable memory is any sort of memory, used by the kernel
    (except of reclaimable slabs), which is actually reclaimable, i.e. will
    be released under memory pressure.

    The counter is in bytes, as it's not always possible to count such
    objects in pages. The name contains BYTES by analogy to
    NR_KERNEL_STACK_KB.

    Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

29 Mar, 2018

1 commit

  • Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
    cause vmstat_update() to report a BUG due to preemption not being
    disabled around smp_processor_id().

    Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

    BUG: using smp_processor_id() in preemptible [00000000] code:
    kworker/1:1/269
    caller is vmstat_update+0x50/0xa0
    CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
    4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
    Workqueue: mm_percpu_wq vmstat_update
    Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

    Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
    Signed-off-by: Steven J. Hill
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven J. Hill
     

16 Nov, 2017

2 commits

  • This is the second step which introduces a tunable interface that allow
    numa stats configurable for optimizing zone_statistics(), as suggested
    by Dave Hansen and Ying Huang.

    =========================================================================

    When page allocation performance becomes a bottleneck and you can
    tolerate some possible tool breakage and decreased numa counter
    precision, you can do:

    echo 0 > /proc/sys/vm/numa_stat

    In this case, numa counter update is ignored. We can see about
    *4.8%*(185->176) drop of cpu cycles per single page allocation and
    reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
    drop of cpu cycles per single page allocation and reclaim on Jesper's
    page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
    (88 threads, 126G memory).

    Benchmark link provided by Jesper D Brouer (increase loop times to
    10000000):

    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    =========================================================================

    When page allocation performance is not a bottleneck and you want all
    tooling to work, you can do:

    echo 1 > /proc/sys/vm/numa_stat

    This is system default setting.

    Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
    for comments to help improve the original patch.

    [keescook@chromium.org: make sure mutex is a global static]
    Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
    Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Signed-off-by: Kees Cook
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Luis R . Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Sebastian Andrzej Siewior
    Cc: Andrey Ryabinin
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") 'pgdat->inactive_ratio' is not used, except for printing
    "node_inactive_ratio: 0" in /proc/zoneinfo output.

    Remove it.

    Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

09 Sep, 2017

3 commits

  • To avoid deviation, the per cpu number of NUMA stats in
    vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.

    Since NUMA stats does not be read by users frequently, and kernel does not
    need it to make a decision, it will not be a problem to make the readers
    more expensive.

    Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tim Chen
    Cc: Ying Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • There is significant overhead in cache bouncing caused by zone counters
    (NUMA associated counters) update in parallel in multi-threaded page
    allocation (suggested by Dave Hansen).

    This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
    as a small threshold greatly increases the update frequency of the global
    counter from local per cpu counter(suggested by Ying Huang).

    The rationality is that these statistics counters don't affect the
    kernel's decision, unlike other VM counters, so it's not a problem to use
    a large threshold.

    With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
    single page allocation and reclaim on Jesper's page_bench03 benchmark.

    Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
    bench

    Threshold CPU cycles Throughput(88 threads)
    32 799 241760478
    64 640 301628829
    125 537 358906028 system by default (base)
    256 468 412397590
    512 428 450550704
    4096 399 482520943
    20000 394 489009617
    30000 395 488017817
    65533 369(-31.3%) 521661345(+45.3%) with this patchset
    N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

    Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Mel Gorman
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Patch series "Separate NUMA statistics from zone statistics", v2.

    Each page allocation updates a set of per-zone statistics with a call to
    zone_statistics(). As discussed in 2017 MM summit, these are a
    substantial source of overhead in the page allocator and are very rarely
    consumed. This significant overhead in cache bouncing caused by zone
    counters (NUMA associated counters) update in parallel in multi-threaded
    page allocation (pointed out by Dave Hansen).

    A link to the MM summit slides:
    http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

    To mitigate this overhead, this patchset separates NUMA statistics from
    zone statistics framework, and update NUMA counter threshold to a fixed
    size of MAX_U16 - 2, as a small threshold greatly increases the update
    frequency of the global counter from local per cpu counter (suggested by
    Ying Huang). The rationality is that these statistics counters don't
    need to be read often, unlike other VM counters, so it's not a problem
    to use a large threshold and make readers more expensive.

    With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
    below) for per single page allocation and reclaim on Jesper's
    page_bench03 benchmark. Meanwhile, this patchset keeps the same style
    of virtual memory statistics with little end-user-visible effects (only
    move the numa stats to show behind zone page stats, see the first patch
    for details).

    I did an experiment of single page allocation and reclaim concurrently
    using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
    server (88 processors with 126G memory) with different size of threshold
    of pcp counter.

    Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    Threshold CPU cycles Throughput(88 threads)
    32 799 241760478
    64 640 301628829
    125 537 358906028 system by default
    256 468 412397590
    512 428 450550704
    4096 399 482520943
    20000 394 489009617
    30000 395 488017817
    65533 369(-31.3%) 521661345(+45.3%) with this patchset
    N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

    This patch (of 3):

    In this patch, NUMA statistics is separated from zone statistics
    framework, all the call sites of NUMA stats are changed to use
    numa-stats-specific functions, it does not have any functionality change
    except that the number of NUMA stats is shown behind zone page stats
    when users *read* the zone info.

    E.g. cat /proc/zoneinfo
    ***Base*** ***With this patch***
    nr_free_pages 3976 nr_free_pages 3976
    nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
    nr_zone_active_anon 0 nr_zone_active_anon 0
    nr_zone_inactive_file 0 nr_zone_inactive_file 0
    nr_zone_active_file 0 nr_zone_active_file 0
    nr_zone_unevictable 0 nr_zone_unevictable 0
    nr_zone_write_pending 0 nr_zone_write_pending 0
    nr_mlock 0 nr_mlock 0
    nr_page_table_pages 0 nr_page_table_pages 0
    nr_kernel_stack 0 nr_kernel_stack 0
    nr_bounce 0 nr_bounce 0
    nr_zspages 0 nr_zspages 0
    numa_hit 0 *nr_free_cma 0*
    numa_miss 0 numa_hit 0
    numa_foreign 0 numa_miss 0
    numa_interleave 0 numa_foreign 0
    numa_local 0 numa_interleave 0
    numa_other 0 numa_local 0
    *nr_free_cma 0* numa_other 0
    ... ...
    vm stats threshold: 10 vm stats threshold: 10
    ... ...

    The next patch updates the numa stats counter size and threshold.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ying Huang
    Cc: Aaron Lu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     

07 Sep, 2017

6 commits

  • Patch series "mm, swap: VMA based swap readahead", v4.

    The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory space. And the different tasks in the system may have different
    access patterns, which makes the global space locality estimation
    incorrect.

    In this patchset, when page fault occurs, the virtual pages near the
    fault address will be readahead instead of the swap slots near the fault
    swap slot in swap device. This avoid to readahead the unrelated swap
    slots. At the same time, the swap readahead is changed to work on
    per-VMA from globally. So that the different access patterns of the
    different VMAs could be distinguished, and the different readahead
    policy could be applied accordingly. The original core readahead
    detection and scaling algorithm is reused, because it is an effect
    algorithm to detect the space locality.

    In addition to the swap readahead changes, some new sysfs interface is
    added to show the efficiency of the readahead algorithm and some other
    swap statistics.

    This new implementation will incur more small random read, on SSD, the
    improved correctness of estimation and readahead target should beat the
    potential increased overhead, this is also illustrated in the test
    results below. But on HDD, the overhead may beat the benefit, so the
    original implementation will be used by default.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
    Swap device: NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    This patch (of 5):

    The statistics for total readahead pages and total readahead hits are
    recorded and exported via the following sysfs interface.

    /sys/kernel/mm/swap/ra_hits
    /sys/kernel/mm/swap/ra_total

    With them, the efficiency of the swap readahead could be measured, so
    that the swap readahead algorithm and parameters could be tuned
    accordingly.

    [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
    Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
    pagetypeinfo_show_free()'s comment. This commit fixes it.

    Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
    Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
    Signed-off-by: SeongJae Park
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • When order is -1 or too big, *1UL << order* will be 0, which will cause
    a divide error. Although it seems that all callers of
    __fragmentation_index() will only do so with a valid order, the patch
    can make it more robust.

    Should prevent reoccurrences of
    https://bugzilla.kernel.org/show_bug.cgi?id=196555

    Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
    Signed-off-by: Wen Yang
    Reviewed-by: Jiang Biao
    Suggested-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When swapping out THP (Transparent Huge Page), instead of swapping out
    the THP as a whole, sometimes we have to fallback to split the THP into
    normal pages before swapping, because no free swap clusters are
    available, or cgroup limit is exceeded, etc. To count the number of the
    fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
    we fallback to split the THP.

    Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To support delay splitting THP (Transparent Huge Page) after swapped
    out, we need to enhance swap writing code to support to write a THP as a
    whole. This will improve swap write IO performance.

    As Ming Lei pointed out, this should be based on
    multipage bvec support, which hasn't been merged yet. So this patch is
    only for testing the functionality of the other patches in the series.
    And will be reimplemented after multipage bvec support is merged.

    Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

11 Jul, 2017

1 commit

  • pagetypeinfo_showmixedcount_print is found to take a lot of time to
    complete and it does this holding the zone lock and disabling
    interrupts. In some cases it is found to take more than a second (On a
    2.4GHz,8Gb RAM,arm64 cpu).

    Avoid taking the zone lock similar to what is done by read_page_owner,
    which means possibility of inaccurate results.

    Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: zhongjiang
    Cc: Sergey Senozhatsky
    Cc: Sudip Mukherjee
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Sebastian Andrzej Siewior
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     

07 Jul, 2017

4 commits

  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • pagetypeinfo_showblockcount_print skips over invalid pfns but it would
    report pages which are offline because those have a valid pfn. Their
    migrate type is misleading at best.

    Now that we have pfn_to_online_page() we can use it instead of
    pfn_valid() and fix this.

    [mhocko@suse.com: fix build]
    Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Standardize the file operation variable names related to all four memory
    management /proc interface files. Also change all the symbol
    permissions (S_IRUGO) into octal permissions (0444) as it got complaints
    from checkpatch.pl. This does not create any functional change to the
    interface.

    Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

13 May, 2017

1 commit

  • After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
    zoneinfo"), /proc/zoneinfo will show unpopulated zones.

    A memoryless node, having no populated zones at all, was previously
    ignored, but will now trigger the WARN() in is_zone_first_populated().

    Remove this warning, as its only purpose was to warn of a situation that
    has since been enabled.

    Aside: The "per-node stats" are still printed under the first populated
    zone, but that's not necessarily the first stanza any more. I'm not
    sure which criteria is more important with regard to not breaking
    parsers, but it looks a little weird to the eye.

    Fixes: e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
    Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

04 May, 2017

5 commits

  • After "mm, vmstat: print non-populated zones in zoneinfo",
    /proc/zoneinfo will show unpopulated zones.

    The per-cpu pageset statistics are not relevant for unpopulated zones
    and can be potentially lengthy, so supress them when they are not
    interesting.

    Also moves lowmem reserve protection information above pcp stats since
    it is relevant for all zones per vm.lowmem_reserve_ratio.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Initscripts can use the information (protection levels) from
    /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.

    vm.lowmem_reserve_ratio is an array of ratios for each configured zone
    on the system. If a zone is not populated on an arch, /proc/zoneinfo
    suppresses its output.

    This results in there not being a 1:1 mapping between the set of zones
    emitted by /proc/zoneinfo and the zones configured by
    vm.lowmem_reserve_ratio.

    This patch shows statistics for non-populated zones in /proc/zoneinfo.
    The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
    Without this patch, it is not possible to determine which index in the
    array controls which zone if one or more zones on the system are not
    populated.

    Remaining users of walk_zones_in_node() are unchanged. Files such as
    /proc/pagetypeinfo require certain zone data to be initialized properly
    for display, which is not done for unpopulated zones.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • NR_PAGES_SCANNED counts number of pages scanned since the last page free
    event in the allocator. This was used primarily to measure the
    reclaimability of zones and nodes, and determine when reclaim should
    give up on them. In that role, it has been replaced in the preceding
    patches by a different mechanism.

    Being implemented as an efficient vmstat counter, it was automatically
    exported to userspace as well. It's however unlikely that anyone
    outside the kernel is using this counter in any meaningful way.

    Remove the counter and the unused pgdat_reclaimable().

    Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".

    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage. We have seen similar cases at Facebook.

    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages. In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met. Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.

    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer. If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.

    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.

    Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.

    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.

    This patch (of 9):

    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:

    $ echo 4000 >/proc/sys/vm/nr_hugepages

    top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
    Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
    KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.

    Kswapd needs to back away from nodes that fail to balance. Up until
    commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism. It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them. This guard was erroneously
    removed as the patch changed the definition of a balanced node.

    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met. Kswapd would stay awake anyway.

    Introduce a new and much simpler way of backing off. If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node. This is the same number of shots
    direct reclaim takes before declaring OOM. Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.

    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
    Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
    Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Reported-by: Jia He
    Tested-by: Jia He
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Apr, 2017

1 commit

  • Geert has reported a freeze during PM resume and some additional
    debugging has shown that the device_resume worker cannot make a forward
    progress because it waits for an event which is stuck waiting in
    drain_all_pages:

    INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
    Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    kworker/u4:0 D 0 5 2 0x00000000
    Workqueue: events_unbound async_run_entry_fn
    __schedule
    schedule
    schedule_timeout
    wait_for_common
    dpm_wait_for_superior
    device_resume
    async_resume
    async_run_entry_fn
    process_one_work
    worker_thread
    kthread
    [...]
    bash D 0 1703 1694 0x00000000
    __schedule
    schedule
    schedule_timeout
    wait_for_common
    flush_work
    drain_all_pages
    start_isolate_page_range
    alloc_contig_range
    cma_alloc
    __alloc_from_contiguous
    cma_allocator_alloc
    __dma_alloc
    arm_dma_alloc
    sh_eth_ring_init
    sh_eth_open
    sh_eth_resume
    dpm_run_callback
    device_resume
    dpm_resume
    dpm_resume_end
    suspend_devices_and_enter
    pm_suspend
    state_store
    kernfs_fop_write
    __vfs_write
    vfs_write
    SyS_write
    [...]
    Showing busy workqueues and worker pools:
    [...]
    workqueue mm_percpu_wq: flags=0xc
    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
    delayed: drain_local_pages_wq, vmstat_update
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
    delayed: drain_local_pages_wq BAR(1703), vmstat_update

    Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
    so it is frozen this early during resume so we are effectively
    deadlocked. Fix this by dropping WQ_FREEZABLE when creating
    mm_percpu_wq. We really want to have it operational all the time.

    Fixes: ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq")
    Reported-and-tested-by: Geert Uytterhoeven
    Debugged-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Apr, 2017

1 commit

  • We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
    vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
    per cpu lru caches. This seems more than necessary because both can run
    on a single WQ. Both do not block on locks requiring a memory
    allocation nor perform any allocations themselves. We will save one
    rescuer thread this way.

    On the other hand drain_all_pages() queues work on the system wq which
    doesn't have rescuer and so this depend on memory allocation (when all
    workers are stuck allocating and new ones cannot be created).

    Initially we thought this would be more of a theoretical problem but
    Hugh Dickins has reported:

    : 4.11-rc has been giving me hangs after hours of swapping load. At
    : first they looked like memory leaks ("fork: Cannot allocate memory");
    : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
    : before looking at /proc/meminfo one time, and the stat_refresh stuck
    : in D state, waiting for completion of flush_work like many kworkers.
    : kthreadd waiting for completion of flush_work in drain_all_pages().

    This worker should be using WQ_RECLAIM as well in order to guarantee a
    forward progress. We can reuse the same one as for lru draining and
    vmstat.

    Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Tested-by: Yang Li
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Apr, 2017

1 commit

  • Yang Li has reported that drain_all_pages triggers a WARN_ON which means
    that this function is called earlier than the mm_percpu_wq is
    initialized on arm64 with CMA configured:

    WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
    Modules linked in:
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
    Hardware name: Freescale Layerscape 2088A RDB Board (DT)
    task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
    PC is at drain_all_pages+0x244/0x25c
    LR is at start_isolate_page_range+0x14c/0x1f0
    [...]
    drain_all_pages+0x244/0x25c
    start_isolate_page_range+0x14c/0x1f0
    alloc_contig_range+0xec/0x354
    cma_alloc+0x100/0x1fc
    dma_alloc_from_contiguous+0x3c/0x44
    atomic_pool_init+0x7c/0x208
    arm64_dma_init+0x44/0x4c
    do_one_initcall+0x38/0x128
    kernel_init_freeable+0x1a0/0x240
    kernel_init+0x10/0xfc
    ret_from_fork+0x10/0x20

    Fix this by moving the whole setup_vmstat which is an initcall right now
    to init_mm_internals which will be called right after the WQ subsystem
    is initialized.

    Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Yang Li
    Tested-by: Yang Li
    Tested-by: Xiaolong Ye
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Mar, 2017

1 commit

  • We added support for PUD-sized transparent hugepages, however we count
    the event "thp split pud" into thp_split_pmd event.

    To separate the event count of thp split pud from pmd, add a new event
    named thp_split_pud.

    Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Sebastian Siewior
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: David Rientjes
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

23 Feb, 2017

1 commit

  • A "compact_daemon_wake" vmstat exists that represents the number of
    times kcompactd has woken up. This doesn't represent how much work it
    actually did, though.

    It's useful to understand how much compaction work is being done by
    kcompactd versus other methods such as direct compaction and explicitly
    triggered per-node (or system) compaction.

    This adds two new vmstats: "compact_daemon_migrate_scanned" and
    "compact_daemon_free_scanned" to represent the number of pages kcompactd
    has scanned as part of its migration scanner and freeing scanner,
    respectively.

    These values are still accounted for in the general
    "compact_migrate_scanned" and "compact_free_scanned" for compatibility.

    It could be argued that explicitly triggered compaction could also be
    tracked separately, and that could be added if others find it useful.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

02 Dec, 2016

3 commits

  • Install the callbacks via the state machine, but do not invoke them as we
    can initialize the node state without calling the callbacks on all online
    CPUs.

    start_shepherd_timer() is now called outside the get_online_cpus() block
    which is safe as it only operates on cpu possible mask.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Johannes Weiner
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Both iterations over online cpus can be replaced by the proper node
    specific functions.

    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Michal Hocko
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Johannes Weiner
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Both functions are called with protection against cpu hotplug already so
    *_online_cpus() could be dropped.

    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Michal Hocko
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Johannes Weiner
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

08 Oct, 2016

3 commits

  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Every current KDE system has process named ksysguardd polling files
    below once in several seconds:

    $ strace -e trace=open -p $(pidof ksysguardd)
    Process 1812 attached
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/proc/net/dev", O_RDONLY) = 8
    open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory)
    open("/proc/stat", O_RDONLY) = 8
    open("/proc/vmstat", O_RDONLY) = 8

    Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

    Benchmark is open+read+close 1.000.000 times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% )
    15 context-switches # 0.001 K/sec ( +- 1.41% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    104 page-faults # 0.008 K/sec ( +- 0.57% )
    45,489,799,349 cycles # 3.460 GHz ( +- 0.03% )
    9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% )
    2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% )
    79,241,190,850 instructions # 1.74 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% )
    176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% )

    13.691078109 seconds time elapsed ( +- 0.03% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% )
    10 context-switches # 0.001 K/sec ( +- 2.13% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.012 K/sec ( +- 0.56% )
    30,384,010,730 cycles # 3.497 GHz ( +- 0.07% )
    12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% )
    3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% )
    28,969,052,879 instructions # 0.95 insn per cycle
    # 0.42 stalled cycles per insn ( +- 0.01% )
    6,308,245,891 branches # 726.058 M/sec ( +- 0.00% )
    214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% )

    9.146081052 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    vsnprintf() is slow because:

    1. format_decode() is busy looking for format specifier: 2 branches
    per character (not in this case, but in others)

    2. approximately million branches while parsing format mini language
    and everywhere

    3. just look at what string() does /proc/vmstat is good case because
    most of its content are strings

    Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • In current kernel code, we only call node_set_state(cpu_to_node(cpu),
    N_CPU) when a cpu is hot plugged. But we do not set the node state for
    N_CPU when the cpus are brought online during boot.

    So this could lead to failure when we check to see if a node contains
    cpu with node_state(node_id, N_CPU).

    One use case is in the node_reclaime function:

    /*
    * Only run node reclaim on the local node or on nodes that do
    * not
    * have associated processors. This will favor the local
    * processor
    * over remote processors and spread off node memory allocations
    * as wide as possible.
    */
    if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
    numa_node_id())
    return NODE_RECLAIM_NOSCAN;

    I instrumented the kernel to call this function after boot and it always
    returns 0 on a x86 desktop machine until I apply the attached patch.

    int num_cpu_node(void)
    {
    int i, nr_cpu_nodes = 0;

    for_each_node(i) {
    if (node_state(i, N_CPU))
    ++ nr_cpu_nodes;
    }

    return nr_cpu_nodes;
    }

    Fix this by checking each node for online CPU when we initialize
    vmstat that's responsible for maintaining node state.

    Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
    Signed-off-by: Tim Chen
    Acked-by: David Rientjes
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc:
    Cc: Ying
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen