17 Oct, 2013

1 commit

  • This leak was added by commit 1d3d4437eae1 ("vmscan: per-node deferred
    work").

    unreferenced object 0xffff88006ada3bd0 (size 8):
    comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
    hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00 ........
    backtrace:
    [] kmemleak_alloc+0x5e/0xc0
    [] __kmalloc+0x247/0x310
    [] register_shrinker+0x3c/0xa0
    [] sget+0x5ab/0x670
    [] proc_mount+0x54/0x170
    [] mount_fs+0x43/0x1b0
    [] vfs_kern_mount+0x72/0x110
    [] kern_mount_data+0x19/0x30
    [] pid_ns_prepare_proc+0x20/0x40
    [] alloc_pid+0x466/0x4a0
    [] copy_process+0xc6a/0x1860
    [] do_fork+0x8b/0x370
    [] SyS_clone+0x16/0x20
    [] stub_clone+0x69/0x90
    [] 0xffffffffffffffff

    Signed-off-by: Andrew Vagin
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Cc: Chuck Lever
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     

01 Oct, 2013

1 commit

  • Isolated balloon pages can wrongly end up in LRU lists when
    migrate_pages() finishes its round without draining all the isolated
    page list.

    The same issue can happen when reclaim_clean_pages_from_list() tries to
    reclaim pages from an isolated page list, before migration, in the CMA
    path. Such balloon page leak opens a race window against LRU lists
    shrinkers that leads us to the following kernel panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] shrink_page_list+0x24e/0x897
    PGD 3cda2067 PUD 3d713067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: shrink_page_list+0x24e/0x897
    RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
    RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
    RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
    R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
    R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
    FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
    Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
    Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
    RIP [] shrink_page_list+0x24e/0x897
    RSP
    CR2: 0000000000000028
    ---[ end trace 703d2451af6ffbfd ]---
    Kernel panic - not syncing: Fatal exception

    This patch fixes the issue, by assuring the proper tests are made at
    putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
    isolated balloon pages being wrongly reinserted in LRU lists.

    [akpm@linux-foundation.org: clarify awkward comment text]
    Signed-off-by: Rafael Aquini
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

25 Sep, 2013

5 commits


13 Sep, 2013

8 commits

  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • Clean up some mess made by the "Soft limit rework" series, and a few other
    things.

    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • shrink_zone starts with soft reclaim pass first and then falls back to
    regular reclaim if nothing has been scanned. This behavior is natural
    but there is a catch. Memcg iterators, when used with the reclaim
    cookie, are designed to help to prevent from over reclaim by
    interleaving reclaimers (per node-zone-priority) so the tree walk might
    miss many (even all) nodes in the hierarchy e.g. when there are direct
    reclaimers racing with each other or with kswapd in the global case or
    multiple allocators reaching the limit for the target reclaim case. To
    make it even more complicated, targeted reclaim doesn't do the whole
    tree walk because it stops reclaiming once it reclaims sufficient pages.
    As a result groups over the limit might be missed, thus nothing is
    scanned, and reclaim would fall back to the reclaim all mode.

    This patch checks for the incomplete tree walk in shrink_zone. If no
    group has been visited and the hierarchy is soft reclaimable then we
    must have missed some groups, in which case the __shrink_zone is called
    again. This doesn't guarantee there will be some progress of course
    because the current reclaimer might be still racing with others but it
    would at least give a chance to start the walk without a big risk of
    reclaim latencies.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
    done and it always says yes currently. Memcg iterators are clever to
    skip nodes that are not soft reclaimable quite efficiently but
    mem_cgroup_should_soft_reclaim can be more clever and do not start the
    soft reclaim pass at all if it knows that nothing would be scanned
    anyway.

    In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
    the target group of the reclaim and allow the pass only if the whole
    subtree wouldn't be skipped.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The caller of the iterator might know that some nodes or even subtrees
    should be skipped but there is no way to tell iterators about that so the
    only choice left is to let iterators to visit each node and do the
    selection outside of the iterating code. This, however, doesn't scale
    well with hierarchies with many groups where only few groups are
    interesting.

    This patch adds mem_cgroup_iter_cond variant of the iterator with a
    callback which gets called for every visited node. There are three
    possible ways how the callback can influence the walk. Either the node is
    visited, it is skipped but the tree walk continues down the tree or the
    whole subtree of the current group is skipped.

    [hughd@google.com: fix memcg-less page reclaim]
    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Soft reclaim has been done only for the global reclaim (both background
    and direct). Since "memcg: integrate soft reclaim tighter with zone
    shrinking code" there is no reason for this limitation anymore as the soft
    limit reclaim doesn't use any special code paths and it is a part of the
    zone shrinking code which is used by both global and targeted reclaims.

    From the semantic point of view it is natural to consider soft limit
    before touching all groups in the hierarchy tree which is touching the
    hard limit because soft limit tells us where to push back when there is a
    memory pressure. It is not important whether the pressure comes from the
    limit or imbalanced zones.

    This patch simply enables soft reclaim unconditionally in
    mem_cgroup_should_soft_reclaim so it is enabled for both global and
    targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
    about the root of the reclaim to know where to stop checking soft limit
    state of parents up the hierarchy. Say we have

    A (over soft limit)
    \
    B (below s.l., hit the hard limit)
    / \
    C D (below s.l.)

    B is the source of the outside memory pressure now for D but we shouldn't
    soft reclaim it because it is behaving well under B subtree and we can
    still reclaim from C (pressumably it is over the limit).
    mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
    hierarchy at B (root of the memory pressure).

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset is sitting out of tree for quite some time without any
    objections. I would be really happy if it made it into 3.12. I do not
    want to push it too hard but I think this work is basically ready and
    waiting more doesn't help.

    The basic idea is quite simple. Pull soft reclaim into shrink_zone in the
    first step and get rid of the previous soft reclaim infrastructure.
    shrink_zone is done in two passes now. First it tries to do the soft
    limit reclaim and it falls back to reclaim-all mode if no group is over
    the limit or no pages have been scanned. The second pass happens at the
    same priority so the only time we waste is the memcg tree walk which has
    been updated in the third step to have only negligible overhead.

    As a bonus we will get rid of a _lot_ of code by this and soft reclaim
    will not stand out like before when it wasn't integrated into the zone
    shrinking code and it reclaimed at priority 0 (the testing results show
    that some workloads suffers from such an aggressive reclaim). The clean
    up is in a separate patch because I felt it would be easier to review that
    way.

    The second step is soft limit reclaim integration into targeted reclaim.
    It should be rather straight forward. Soft limit has been used only for
    the global reclaim so far but it makes sense for any kind of pressure
    coming from up-the-hierarchy, including targeted reclaim.

    The third step (patches 4-8) addresses the tree walk overhead by enhancing
    memcg iterators to enable skipping whole subtrees and tracking number of
    over soft limit children at each level of the hierarchy. This information
    is updated same way the old soft limit tree was updated (from
    memcg_check_events) so we shouldn't see an additional overhead. In fact
    mem_cgroup_update_soft_limit is much simpler than tree manipulation done
    previously.

    __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
    mem_cgroup_iter so the decision whether a particular group should be
    visited is done at the iterator level which allows us to decide to skip
    the whole subtree as well (if there is no child in excess). This reduces
    the tree walk overhead considerably.

    * TEST 1
    ========

    My primary test case was a parallel kernel build with 2 groups (make is
    running with -j8 with a distribution .config in a separate cgroup without
    any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
    run taskset to Node 0 cpus.

    I was mostly interested in 2 setups. Default - no soft limit set and -
    and 0 soft limit set to both groups. The first one should tell us whether
    the rework regresses the default behavior while the second one should show
    us improvements in an extreme case where both workloads are always over
    the soft limit.

    /usr/bin/time -v has been used to collect the statistics and each
    configuration had 3 runs after fresh boot without any other load on the
    system.

    base is mmotm-2013-07-18-16-40
    rework all 8 patches applied on top of base

    * No-limit
    User
    no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
    no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
    System
    no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
    no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
    Elapsed
    no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
    no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6

    The results are within noise. Elapsed time has a bigger variance but the
    average looks good.

    * 0-limit
    User
    0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
    0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
    System
    0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
    0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
    Elapsed
    0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
    0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6

    The improvement is really huge here (even bigger than with my previous
    testing and I suspect that this highly depends on the storage). Page
    fault statistics tell us at least part of the story:

    Minor
    0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
    0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
    Major
    0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
    0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6

    Same as with my previous testing Minor faults are more or less within
    noise but Major fault count is way bellow the base kernel.

    While this looks as a nice win it is fair to say that 0-limit
    configuration is quite artificial. So I was playing with 0-no-limit
    loads as well.

    * TEST 2
    ========

    The following results are from 2 groups configuration on a 16GB machine
    (single NUMA node).

    - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
    2*TotalMem with 0 soft limit.
    - B running a mem_eater which consumes TotalMem-1G without any limit. The
    mem_eater consumes the memory in 100 chunks with 1s nap after each
    mmap+poppulate so that both loads have chance to fight for the memory.

    The expected result is that B shouldn't be reclaimed and A shouldn't see
    a big dropdown in elapsed time.

    User
    base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
    rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
    System
    base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
    rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
    Elapsed
    base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
    rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3

    System time improved slightly as well as Elapsed. My previous testing
    has shown worse numbers but this again seem to depend on the storage
    speed.

    My theory is that the writeback doesn't catch up and prio-0 soft reclaim
    falls into wait on writeback page too often in the base kernel. The
    patched kernel doesn't do that because the soft reclaim is done from the
    kswapd/direct reclaim context. This can be seen on the following graph
    nicely. The A's group usage_in_bytes regurarly drops really low very often.

    All 3 runs
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
    resp. a detail of the single run
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png

    mem_eater seems to be doing better as well. It gets to the full
    allocation size faster as can be seen on the following graph:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png

    /proc/meminfo collected during the test also shows that rework kernel
    hasn't swapped that much (well almost not at all):
    base: max: 123900 K avg: 56388.29 K
    rework: max: 300 K avg: 128.68 K

    kswapd and direct reclaim statistics are of no use unfortunatelly because
    soft reclaim is not accounted properly as the counters are hidden by
    global_reclaim() checks in the base kernel.

    * TEST 3
    ========

    Another test was the same configuration as TEST2 except the stream IO was
    replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
    in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
    left.

    Kbuild did better with the rework kernel here as well:
    User
    base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
    rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
    System
    base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
    rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
    Elapsed
    base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
    rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
    Minor
    base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
    rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
    Major
    base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
    rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3

    Again we can see a significant improvement in Elapsed (it also seems to
    be more stable), there is a huge dropdown for the Major page faults and
    much more swapping:
    base: max: 583736 K avg: 112547.43 K
    rework: max: 4012 K avg: 124.36 K

    Graphs from all three runs show the variability of the kbuild quite
    nicely. It even seems that it took longer after every run with the base
    kernel which would be quite surprising as the source tree for the build is
    removed and caches are dropped after each run so the build operates on a
    freshly extracted sources everytime.
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png

    My other testing shows that this is just a matter of timing and other runs
    behave differently the std for Elapsed time is similar ~50. Example of
    other three runs:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png

    So to wrap this up. The series is still doing good and improves the soft
    limit.

    The testing results for bunch of cgroups with both stream IO and kbuild
    loads can be found in "memcg: track children in soft limit excess to
    improve soft limit".

    This patch:

    Memcg soft reclaim has been traditionally triggered from the global
    reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
    then picked up a group which exceeds the soft limit the most and reclaimed
    it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

    The infrastructure requires per-node-zone trees which hold over-limit
    groups and keep them up-to-date (via memcg_check_events) which is not cost
    free. Although this overhead hasn't turned out to be a bottle neck the
    implementation is suboptimal because mem_cgroup_update_tree has no idea
    which zones consumed memory over the limit so we could easily end up
    having a group on a node-zone tree having only few pages from that
    node-zone.

    This patch doesn't try to fix node-zone trees management because it seems
    that integrating soft reclaim into zone shrinking sounds much easier and
    more appropriate for several reasons. First of all 0 priority reclaim was
    a crude hack which might lead to big stalls if the group's LRUs are big
    and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim
    should be applicable also to the targeted reclaim which is awkward right
    now without additional hacks. Last but not least the whole infrastructure
    eats quite some code.

    After this patch shrink_zone is done in 2 passes. First it tries to do
    the soft reclaim if appropriate (only for global reclaim for now to keep
    compatible with the original state) and fall back to ignoring soft limit
    if no group is eligible to soft reclaim or nothing has been scanned during
    the first pass. Only groups which are over their soft limit or any of
    their parents up the hierarchy is over the limit are considered eligible
    during the first pass.

    Soft limit tree which is not necessary anymore will be removed in the
    follow up patch to make this patch smaller and easier to review.

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Greg Thelen
    Cc: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

12 Sep, 2013

3 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • The goal of this patch series is to improve performance of munlock() of
    large mlocked memory areas on systems without THP. This is motivated by
    reported very long times of crash recovery of processes with such areas,
    where munlock() can take several seconds. See
    http://lwn.net/Articles/548108/

    The work was driven by a simple benchmark (to be included in mmtests) that
    mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
    munlock(). Profiling was performed by attaching operf --pid to the
    process and sending a signal to trigger the munlock() part and then notify
    bach the monitoring wrapper to stop operf, so that only munlock() appears
    in the profile.

    The profiles have shown that CPU time is spent mostly by atomic operations
    and repeated locking per single pages. This series aims to reduce both, starting
    from simpler to more complex changes.

    Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
    type is not determined without being actually needed.

    Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
    pagevec after each munlocked page is put there.

    Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
    multiple non-THP pages under a single lru_lock instead of locking and
    processing each page separately.

    Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
    introduced by previous patch.

    Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
    when possible, bypassing the per-cpu pvec and associated overhead.

    Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
    operations.

    Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
    multiple page references under a single page table lock where possible.

    Measurements were made using 3.11-rc3 as a baseline. The first set of
    measurements shows the possibly ideal conditions where batching should
    help the most. All memory is allocated from a single NUMA node and THP is
    disabled.

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 3.38 ( 0.00%) 3.39 ( -0.13%) 3.00 ( 11.33%) 2.70 ( 20.20%) 2.67 ( 21.11%) 2.37 ( 29.88%) 2.20 ( 34.91%) 1.91 ( 43.59%)
    Elapsed mean 3.39 ( 0.00%) 3.40 ( -0.23%) 3.01 ( 11.33%) 2.70 ( 20.26%) 2.67 ( 21.21%) 2.38 ( 29.88%) 2.21 ( 34.93%) 1.92 ( 43.46%)
    Elapsed stddev 0.01 ( 0.00%) 0.01 (-43.09%) 0.01 ( 15.42%) 0.01 ( 23.42%) 0.00 ( 89.78%) 0.01 ( -7.15%) 0.00 ( 76.69%) 0.02 (-91.77%)
    Elapsed max 3.41 ( 0.00%) 3.43 ( -0.52%) 3.03 ( 11.29%) 2.72 ( 20.16%) 2.67 ( 21.63%) 2.40 ( 29.50%) 2.21 ( 35.21%) 1.96 ( 42.39%)
    Elapsed range 0.03 ( 0.00%) 0.04 (-51.16%) 0.02 ( 6.27%) 0.02 ( 14.67%) 0.00 ( 88.90%) 0.03 (-19.18%) 0.01 ( 73.70%) 0.06 (-113.35%

    The second set of measurements simulates the worst possible conditions for
    batching by using numactl --interleave, so that there is in fact only one
    page per pagevec. Even in this case the series seems to improve
    performance thanks to reduced atomic operations and removal of
    lru_add_drain().

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 4.00 ( 0.00%) 4.04 ( -0.93%) 3.87 ( 3.37%) 3.72 ( 6.94%) 3.81 ( 4.72%) 3.69 ( 7.82%) 3.64 ( 8.92%) 3.41 ( 14.81%)
    Elapsed mean 4.17 ( 0.00%) 4.15 ( 0.51%) 4.03 ( 3.49%) 3.89 ( 6.84%) 3.86 ( 7.48%) 3.89 ( 6.69%) 3.70 ( 11.27%) 3.48 ( 16.59%)
    Elapsed stddev 0.16 ( 0.00%) 0.08 ( 50.76%) 0.10 ( 41.58%) 0.16 ( 4.59%) 0.05 ( 72.38%) 0.19 (-12.91%) 0.05 ( 68.09%) 0.06 ( 66.03%)
    Elapsed max 4.34 ( 0.00%) 4.32 ( 0.56%) 4.19 ( 3.62%) 4.12 ( 5.15%) 3.91 ( 9.88%) 4.12 ( 5.25%) 3.80 ( 12.58%) 3.56 ( 18.08%)
    Elapsed range 0.34 ( 0.00%) 0.28 ( 17.91%) 0.32 ( 6.45%) 0.40 (-15.73%) 0.10 ( 70.06%) 0.43 (-24.84%) 0.15 ( 55.32%) 0.15 ( 56.16%)

    For completeness, a third set of measurements shows the situation where
    THP is enabled and allocations are again done on a single NUMA node. Here
    munlock() is already very fast thanks to huge pages, and this series does
    not compromise that performance. It seems that the removal of call to
    lru_add_drain() still helps a bit.

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 0.01 ( 0.00%) 0.01 ( -0.11%) 0.01 ( 6.59%) 0.01 ( 5.41%) 0.01 ( 5.45%) 0.01 ( 5.03%) 0.01 ( 6.08%) 0.01 ( 5.20%)
    Elapsed mean 0.01 ( 0.00%) 0.01 ( -0.27%) 0.01 ( 6.39%) 0.01 ( 5.30%) 0.01 ( 5.32%) 0.01 ( 5.03%) 0.01 ( 5.97%) 0.01 ( 5.22%)
    Elapsed stddev 0.00 ( 0.00%) 0.00 ( -9.59%) 0.00 ( 10.77%) 0.00 ( 3.24%) 0.00 ( 24.42%) 0.00 ( 31.86%) 0.00 ( -7.46%) 0.00 ( 6.11%)
    Elapsed max 0.01 ( 0.00%) 0.01 ( -0.01%) 0.01 ( 6.83%) 0.01 ( 5.42%) 0.01 ( 5.79%) 0.01 ( 5.53%) 0.01 ( 6.08%) 0.01 ( 5.26%)
    Elapsed range 0.00 ( 0.00%) 0.00 ( 7.30%) 0.00 ( 24.38%) 0.00 ( 6.10%) 0.00 ( 30.79%) 0.00 ( 42.52%) 0.00 ( 6.11%) 0.00 ( 10.07%)

    This patch (of 7):

    In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter
    from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
    determine lru list via page_lru_base_type().

    This patch replaces it with simple flag is_unevictable which says that the
    page was put on the inevictable list. This is the only information that
    matters in subsequent tests.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The way the page allocator interacts with kswapd creates aging imbalances,
    where the amount of time a userspace page gets in memory under reclaim
    pressure is dependent on which zone, which node the allocator took the
    page frame from.

    #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
    nodes falling behind for a full reclaim cycle relative to the other
    nodes in the system

    #3 fixes an interaction where kswapd and a continuous stream of page
    allocations keep the preferred zone of a task between the high and
    low watermark (allocations succeed + kswapd does not go to sleep)
    indefinitely, completely underutilizing the lower zones and
    thrashing on the preferred zone

    These patches are the aging fairness part of the thrash-detection based
    file LRU balancing. Andrea recommended to submit them separately as they
    are bugfixes in their own right.

    The following test ran a foreground workload (memcachetest) with
    background IO of various sizes on a 4 node 8G system (similar results were
    observed with single-node 4G systems):

    parallelio
    BAS FAIRALLO
    BASE FAIRALLOC
    Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%)
    Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%)
    Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%)
    Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%)
    Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%)
    Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%)
    Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%)
    Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%)
    Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%)
    Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%)
    Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%)
    Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%)
    Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%)
    Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%)
    Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%)
    Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%)
    Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%)
    Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%)
    Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%)

    BAS FAIRALLO
    BASE FAIRALLOC
    User 287.78 460.97
    System 2151.67 3142.51
    Elapsed 9737.00 8879.34

    BAS FAIRALLO
    BASE FAIRALLOC
    Minor Faults 53721925 57188551
    Major Faults 392195 15157
    Swap Ins 2994854 112770
    Swap Outs 4907092 134982
    Direct pages scanned 0 41824
    Kswapd pages scanned 32975063 8128269
    Kswapd pages reclaimed 6323069 7093495
    Direct pages reclaimed 0 41824
    Kswapd efficiency 19% 87%
    Kswapd velocity 3386.573 915.414
    Direct efficiency 100% 100%
    Direct velocity 0.000 4.710
    Percentage direct scans 0% 0%
    Zone normal velocity 2011.338 550.661
    Zone dma32 velocity 1365.623 369.221
    Zone dma velocity 9.612 0.242
    Page writes by reclaim 18732404.000 614807.000
    Page writes file 13825312 479825
    Page writes anon 4907092 134982
    Page reclaim immediate 85490 5647
    Sector Reads 12080532 483244
    Sector Writes 88740508 65438876
    Page rescued immediate 0 0
    Slabs scanned 82560 12160
    Direct inode steals 0 0
    Kswapd inode steals 24401 40013
    Kswapd skipped wait 0 0
    THP fault alloc 6 8
    THP collapse alloc 5481 5812
    THP splits 75 22
    THP fault fallback 0 0
    THP collapse fail 0 0
    Compaction stalls 0 54
    Compaction success 0 45
    Compaction failures 0 9
    Page migrate success 881492 82278
    Page migrate failure 0 0
    Compaction pages isolated 0 60334
    Compaction migrate scanned 0 53505
    Compaction free scanned 0 1537605
    Compaction cost 914 86
    NUMA PTE updates 46738231 41988419
    NUMA hint faults 31175564 24213387
    NUMA hint local faults 10427393 6411593
    NUMA pages migrated 881492 55344
    AutoNUMA cost 156221 121361

    The overall runtime was reduced, throughput for both the foreground
    workload as well as the background IO improved, major faults, swapping and
    reclaim activity shrunk significantly, reclaim efficiency more than
    quadrupled.

    This patch:

    When the page allocator fails to get a page from all zones in its given
    zonelist, it wakes up the per-node kswapds for all zones that are at their
    low watermark.

    However, with a system under load the free pages in a zone can fluctuate
    enough that the allocation fails but the kswapd wakeup is also skipped
    while the zone is still really close to the low watermark.

    When one node misses a wakeup like this, it won't be aged before all the
    other node's zones are down to their low watermarks again. And skipping a
    full aging cycle is an obvious fairness problem.

    Kswapd runs until the high watermarks are restored, so it should also be
    woken when the high watermarks are not met. This ages nodes more equally
    and creates a safety margin for the page counter fluctuation.

    By using zone_balanced(), it will now check, in addition to the watermark,
    if compaction requires more order-0 pages to create a higher order page.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Tested-by: Zlatko Calusic
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Sep, 2013

4 commits

  • There are no more users of this API, so kill it dead, dead, dead and
    quietly bury the corpse in a shallow, unmarked grave in a dark forest deep
    in the hills...

    [glommer@openvz.org: added flowers to the grave]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Greg Thelen
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • The list_lru infrastructure already keeps per-node LRU lists in its
    node-specific list_lru_node arrays and provide us with a per-node API, and
    the shrinkers are properly equiped with node information. This means that
    we can now focus our shrinking effort in a single node, but the work that
    is deferred from one run to another is kept global at nr_in_batch. Work
    can be deferred, for instance, during direct reclaim under a GFP_NOFS
    allocation, where situation, all the filesystem shrinkers will be
    prevented from running and accumulate in nr_in_batch the amount of work
    they should have done, but could not.

    This creates an impedance problem, where upon node pressure, work deferred
    will accumulate and end up being flushed in other nodes. The problem we
    describe is particularly harmful in big machines, where many nodes can
    accumulate at the same time, all adding to the global counter nr_in_batch.
    As we accumulate more and more, we start to ask for the caches to flush
    even bigger numbers. The result is that the caches are depleted and do
    not stabilize. To achieve stable steady state behavior, we need to tackle
    it differently.

    In this patch we keep the deferred count per-node, in the new array
    nr_deferred[] (the name is also a bit more descriptive) and will never
    accumulate that to other nodes.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • Pass the node of the current zone being reclaimed to shrink_slab(),
    allowing the shrinker control nodemask to be set appropriately for node
    aware shrinkers.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The current shrinker callout API uses an a single shrinker call for
    multiple functions. To determine the function, a special magical value is
    passed in a parameter to change the behaviour. This complicates the
    implementation and return value specification for the different
    behaviours.

    Separate the two different behaviours into separate operations, one to
    return a count of freeable objects in the cache, and another to scan a
    certain number of objects in the cache for freeing. In defining these new
    operations, ensure the return values and resultant behaviours are clearly
    defined and documented.

    Modify shrink_slab() to use the new API and implement the callouts for all
    the existing shrinkers.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     

10 Jul, 2013

2 commits

  • After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
    the scanning priority of kswapd changed.

    The priority now rises until it is scanning enough pages to meet the
    high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
    pages were encountered under writeback but this value is scaled based on
    the priority. As kswapd frequently scans with a higher priority now it
    is relatively easy to set ZONE_WRITEBACK. This patch removes the
    scaling and treates writeback pages similar to how it treats unqueued
    dirty pages and congested pages. The user-visible effect should be that
    kswapd will writeback fewer pages from reclaim context.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim is not aborting to allow compaction to go ahead properly.
    do_try_to_free_pages is told to abort reclaim which is happily ignores
    and instead increases priority instead until it reaches 0 and starts
    shrinking file/anon equally. This patch corrects the situation by
    aborting reclaim when requested instead of raising priority.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Jul, 2013

16 commits

  • Similar to __pagevec_lru_add, this patch removes the LRU parameter from
    __lru_cache_add and lru_cache_add_lru as the caller does not control the
    exact LRU the page gets added to. lru_cache_add_lru gets renamed to
    lru_cache_add the name is silly without the lru parameter. With the
    parameter removed, it is required that the caller indicate if they want
    the page added to the active or inactive list by setting or clearing
    PageActive respectively.

    [akpm@linux-foundation.org: Suggested the patch]
    [gang.chen@asianux.com: fix used-unintialized warning]
    Signed-off-by: Mel Gorman
    Signed-off-by: Chen Gang
    Cc: Jan Kara
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Page reclaim keeps track of dirty and under writeback pages and uses it
    to determine if wait_iff_congested() should stall or if kswapd should
    begin writing back pages. This fails to account for buffer pages that
    can be under writeback but not PageWriteback which is the case for
    filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
    can have all the buffers clean and writepage does no IO so it should not
    be accounted as congested.

    This patch adds an address_space operation that filesystems may
    optionally use to check if a page is really dirty or really under
    writeback. An implementation is provided for for buffer_heads is added
    and used for block operations and ext3 in ordered mode. By default the
    page flags are obeyed.

    Credit goes to Jan Kara for identifying that the page flags alone are
    not sufficient for ext3 and sanity checking a number of ideas on how the
    problem could be addressed.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a zone will only be marked congested if the underlying BDI is
    congested but if dirty pages are spread across zones it is possible that
    an individual zone is full of dirty pages without being congested. The
    impact is that zone gets scanned very quickly potentially reclaiming
    really clean pages. This patch treats pages marked for immediate
    reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
    and stalling in wait_iff_congested.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_inactive_list makes decisions on whether to stall based on the
    number of dirty pages encountered. The wait_iff_congested() call in
    shrink_page_list does no such thing and it's arbitrary.

    This patch moves the decision on whether to set ZONE_CONGESTED and the
    wait_iff_congested call into shrink_page_list. This keeps all the
    decisions on whether to stall or not in the one place.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In shrink_page_list a decision may be made to stall and flag a zone as
    ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
    encountered later then the reclaimer will stall. Set ZONE_WRITEBACK
    before potentially going to sleep so it is noticed sooner.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit "mm: vmscan: Block kswapd if it is encountering pages under
    writeback" blocks page reclaim if it encounters pages under writeback
    marked for immediate reclaim. It blocks while pages are still isolated
    from the LRU which is unnecessary. This patch defers the blocking until
    after the isolated pages have been processed and tidies up some of the
    comments.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Further testing of the "Reduce system disruption due to kswapd"
    discovered a few problems. First and foremost, it's possible for pages
    under writeback to be freed which will lead to badness. Second, as
    pages were not being swapped the file LRU was being scanned faster and
    clean file pages were being reclaimed. In some cases this results in
    increased read IO to re-read data from disk. Third, more pages were
    being written from kswapd context which can adversly affect IO
    performance. Lastly, it was observed that PageDirty pages are not
    necessarily dirty on all filesystems (buffers can be clean while
    PageDirty is set and ->writepage generates no IO) and not all
    filesystems set PageWriteback when the page is being written (e.g.
    ext3). This disconnect confuses the reclaim stalling logic. This
    follow-up series is aimed at these problems.

    The tests were based on three kernels

    vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
    mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to
    kswapd" applied on top as per what should be in Andrew's tree
    right now
    lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel

    The first test used memcached+memcachetest while some background IO was
    in progress as implemented by the parallel IO tests implement in MM
    Tests. memcachetest benchmarks how many operations/second memcached can
    service. It starts with no background IO on a freshly created ext4
    filesystem and then re-runs the test with larger amounts of IO in the
    background to roughly simulate a large copy in progress. The
    expectation is that the IO should have little or no impact on
    memcachetest which is running entirely in memory.

    parallelio
    3.9.0 3.9.0 3.9.0
    vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
    Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%)
    Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%)
    Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%)
    Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%)
    Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%)
    Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%)
    Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%)
    Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%)
    Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%)
    Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%)
    Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%)
    Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%)
    Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%)
    Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%)
    Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%)

    memcachetest is the transactions/second reported by memcachetest. In
    the vanilla kernel note that performance drops from around
    23K/sec to just over 4K/second when there is 2385M of IO going
    on in the background. With current mmotm, there is no collapse
    in performance and with this follow-up series there is little
    change.

    swaptotal is the total amount of swap traffic. With mmotm and the follow-up
    series, the total amount of swapping is much reduced.

    3.9.0 3.9.0 3.9.0
    vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
    Minor Faults 11160152 10706748 10622316
    Major Faults 46305 755 678
    Swap Ins 260249 0 0
    Swap Outs 683860 18 18
    Direct pages scanned 0 678 2520
    Kswapd pages scanned 6046108 8814900 1639279
    Kswapd pages reclaimed 1081954 1172267 1094635
    Direct pages reclaimed 0 566 2304
    Kswapd efficiency 17% 13% 66%
    Kswapd velocity 5217.560 7618.953 1414.879
    Direct efficiency 100% 83% 91%
    Direct velocity 0.000 0.586 2.175
    Percentage direct scans 0% 0% 0%
    Zone normal velocity 5105.086 6824.681 671.158
    Zone dma32 velocity 112.473 794.858 745.896
    Zone dma velocity 0.000 0.000 0.000
    Page writes by reclaim 1929612.000 6861768.000 32821.000
    Page writes file 1245752 6861750 32803
    Page writes anon 683860 18 18
    Page reclaim immediate 7484 40 239
    Sector Reads 1130320 93996 86900
    Sector Writes 13508052 10823500 11804436
    Page rescued immediate 0 0 0
    Slabs scanned 33536 27136 18560
    Direct inode steals 0 0 0
    Kswapd inode steals 8641 1035 0
    Kswapd skipped wait 0 0 0
    THP fault alloc 8 37 33
    THP collapse alloc 508 552 515
    THP splits 24 1 1
    THP fault fallback 0 0 0
    THP collapse fail 0 0 0

    There are a number of observations to make here

    1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
    pages swapped were really unused anonymous pages. Related to that,
    major faults are much reduced.

    2. kswapd efficiency was impacted by the initial series but with these
    follow-up patches, the efficiency is now at 66% indicating that far
    fewer pages were skipped during scanning due to dirty or writeback
    pages.

    3. kswapd velocity is reduced indicating that fewer pages are being scanned
    with the follow-up series as kswapd now stalls when the tail of the
    LRU queue is full of unqueued dirty pages. The stall gives flushers a
    chance to catch-up so kswapd can reclaim clean pages when it wakes

    4. In light of Zlatko's recent reports about zone scanning imbalances,
    mmtests now reports scanning velocity on a per-zone basis. With mainline,
    you can see that the scanning activity is dominated by the Normal
    zone with over 45 times more scanning in Normal than the DMA32 zone.
    With the series currently in mmotm, the ratio is slightly better but it
    is still the case that the bulk of scanning is in the highest zone. With
    this follow-up series, the ratio of scanning between the Normal and
    DMA32 zone is roughly equal.

    5. As Dave Chinner observed, the current patches in mmotm increased the
    number of pages written from kswapd context which is expected to adversly
    impact IO performance. With the follow-up patches, far fewer pages are
    written from kswapd context than the mainline kernel

    6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
    the follow-up series, there is less slab shrinking activity and no inodes
    were reclaimed.

    7. Note that "Sectors Read" is drastically reduced implying that the source
    data being used for the IO is not being aggressively discarded due to
    page reclaim skipping over dirty pages and reclaiming clean pages. Note
    that the reducion in reads could also be due to inode data not being
    re-read from disk after a slab shrink.

    3.9.0 3.9.0 3.9.0
    vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
    Mean sda-avgqz 166.99 32.09 33.44
    Mean sda-await 853.64 192.76 185.43
    Mean sda-r_await 6.31 9.24 5.97
    Mean sda-w_await 2992.81 202.65 192.43
    Max sda-avgqz 1409.91 718.75 698.98
    Max sda-await 6665.74 3538.00 3124.23
    Max sda-r_await 58.96 111.95 58.00
    Max sda-w_await 28458.94 3977.29 3148.61

    In light of the changes in writes from reclaim context, the number of
    reads and Dave Chinner's concerns about IO performance I took a closer
    look at the IO stats for the test disk. Few observations

    1. The average queue size is reduced by the initial series and roughly
    the same with this follow up.

    2. Average wait times for writes are reduced and as the IO
    is completing faster it at least implies that the gain is because
    flushers are writing the files efficiently instead of page reclaim
    getting in the way.

    3. The reduction in maximum write latency is staggering. 28 seconds down
    to 3 seconds.

    Jan Kara asked how NFS is affected by all of this. Unstable pages can
    be taken into account as one of the patches in the series shows but it
    is still the case that filesystems with unusual handling of dirty or
    writeback could still be treated better.

    Tests like postmark, fsmark and largedd showed up nothing useful. On my test
    setup, pages are simply not being written back from reclaim context with or
    without the patches and there are no changes in performance. My test setup
    probably is just not strong enough network-wise to be really interesting.

    I ran a longer-lived memcached test with IO going to NFS instead of a local disk

    parallelio
    3.9.0 3.9.0 3.9.0
    vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
    Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%)
    Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%)
    Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%)
    Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%)
    Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%)
    Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%)
    Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%)
    Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%)
    Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%)
    Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%)
    Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%)
    Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%)
    Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%)
    Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%)
    Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%)
    Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%)
    Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%)
    Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%)
    Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%)
    Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%)
    Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%)

    1. Performance does not collapse due to IO which is good. IO is also completing
    faster. Note with mmotm, IO completes in a third of the time and faster again
    with this series applied

    2. Swapping is reduced, although not eliminated. The figures for the follow-up
    look bad but it does vary a bit as the stalling is not perfect for nfs
    or filesystems like ext3 with unusual handling of dirty and writeback
    pages

    3. There are swapins, particularly with larger amounts of IO indicating
    that active pages are being reclaimed. However, the number of much
    reduced.

    3.9.0 3.9.0 3.9.0
    vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
    Minor Faults 36339175 35025445 35219699
    Major Faults 310964 27108 51887
    Swap Ins 2176399 173069 333316
    Swap Outs 3344050 357228 504824
    Direct pages scanned 8972 77283 43242
    Kswapd pages scanned 20899983 8939566 14772851
    Kswapd pages reclaimed 6193156 5172605 5231026
    Direct pages reclaimed 8450 73802 39514
    Kswapd efficiency 29% 57% 35%
    Kswapd velocity 3929.743 1847.499 3058.840
    Direct efficiency 94% 95% 91%
    Direct velocity 1.687 15.972 8.954
    Percentage direct scans 0% 0% 0%
    Zone normal velocity 3721.907 939.103 2185.142
    Zone dma32 velocity 209.522 924.368 882.651
    Zone dma velocity 0.000 0.000 0.000
    Page writes by reclaim 4082185.000 526319.000 537114.000
    Page writes file 738135 169091 32290
    Page writes anon 3344050 357228 504824
    Page reclaim immediate 9524 170 5595843
    Sector Reads 8909900 861192 1483680
    Sector Writes 13428980 1488744 2076800
    Page rescued immediate 0 0 0
    Slabs scanned 38016 31744 28672
    Direct inode steals 0 0 0
    Kswapd inode steals 424 0 0
    Kswapd skipped wait 0 0 0
    THP fault alloc 14 15 119
    THP collapse alloc 1767 1569 1618
    THP splits 30 29 25
    THP fault fallback 0 0 0
    THP collapse fail 8 5 0
    Compaction stalls 17 41 100
    Compaction success 7 31 95
    Compaction failures 10 10 5
    Page migrate success 7083 22157 62217
    Page migrate failure 0 0 0
    Compaction pages isolated 14847 48758 135830
    Compaction migrate scanned 18328 48398 138929
    Compaction free scanned 2000255 355827 1720269
    Compaction cost 7 24 68

    I guess the main takeaway again is the much reduced page writes
    from reclaim context and reduced reads.

    3.9.0 3.9.0 3.9.0
    vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
    Mean sda-avgqz 23.58 0.35 0.44
    Mean sda-await 133.47 15.72 15.46
    Mean sda-r_await 4.72 4.69 3.95
    Mean sda-w_await 507.69 28.40 33.68
    Max sda-avgqz 680.60 12.25 23.14
    Max sda-await 3958.89 221.83 286.22
    Max sda-r_await 63.86 61.23 67.29
    Max sda-w_await 11710.38 883.57 1767.28

    And as before, write wait times are much reduced.

    This patch:

    The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
    encountered, not priority" decides whether to writeback pages from reclaim
    context based on the number of dirty pages encountered. This situation is
    flagged too easily and flushers are not given the chance to catch up
    resulting in more pages being written from reclaim context and potentially
    impacting IO performance. The check for PageWriteback is also misplaced
    as it happens within a PageDirty check which is nonsense as the dirty may
    have been cleared for IO. The accounting is updated very late and pages
    that are already under writeback, were reactivated, could not unmapped or
    could not be released are all missed. Similarly, a page is considered
    congested for reasons other than being congested and pages that cannot be
    written out in the correct context are skipped. Finally, it considers
    stalling and writing back filesystem pages due to encountering dirty
    anonymous pages at the tail of the LRU which is dumb.

    This patch causes kswapd to begin writing filesystem pages from reclaim
    context only if page reclaim found that all filesystem pages at the tail
    of the LRU were unqueued dirty pages. Before it starts writing filesystem
    pages, it will stall to give flushers a chance to catch up. The decision
    on whether wait_iff_congested is also now determined by dirty filesystem
    pages only. Congested pages are based on whether the underlying BDI is
    congested regardless of the context of the reclaiming process.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • balance_pgdat() is very long and some of the logic can and should be
    internal to kswapd_shrink_zone(). Move it so the flow of
    balance_pgdat() is marginally easier to follow.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently kswapd checks if it should start writepage as it shrinks each
    zone without taking into consideration if the zone is balanced or not.
    This is not wrong as such but it does not make much sense either. This
    patch checks once per pgdat scan if kswapd should be writing pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Historically, kswapd used to congestion_wait() at higher priorities if
    it was not making forward progress. This made no sense as the failure
    to make progress could be completely independent of IO. It was later
    replaced by wait_iff_congested() and removed entirely by commit 258401a6
    (mm: don't wait on congested zones in balance_pgdat()) as it was
    duplicating logic in shrink_inactive_list().

    This is problematic. If kswapd encounters many pages under writeback
    and it continues to scan until it reaches the high watermark then it
    will quickly skip over the pages under writeback and reclaim clean young
    pages or push applications out to swap.

    The use of wait_iff_congested() is not suited to kswapd as it will only
    stall if the underlying BDI is really congested or a direct reclaimer
    was unable to write to the underlying BDI. kswapd bypasses the BDI
    congestion as it sets PF_SWAPWRITE but even if this was taken into
    account then it would cause direct reclaimers to stall on writeback
    which is not desirable.

    This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
    encountering too many pages under writeback. If this flag is set and
    kswapd encounters a PageReclaim page under writeback then it'll assume
    that the LRU lists are being recycled too quickly before IO can complete
    and block waiting for some IO to complete.

    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently kswapd queues dirty pages for writeback if scanning at an
    elevated priority but the priority kswapd scans at is not related to the
    number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
    kswapd priority loop", the priority is related to the size of the LRU
    and the zone watermark which is no indication as to whether kswapd
    should write pages or not.

    This patch tracks if an excessive number of unqueued dirty pages are
    being encountered at the end of the LRU. If so, it indicates that dirty
    pages are being recycled before flusher threads can clean them and flags
    the zone so that kswapd will start writing pages until the zone is
    balanced.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Page reclaim at priority 0 will scan the entire LRU as priority 0 is
    considered to be a near OOM condition. Kswapd can reach priority 0
    quite easily if it is encountering a large number of pages it cannot
    reclaim such as pages under writeback. When this happens, kswapd
    reclaims very aggressively even though there may be no real risk of
    allocation failure or OOM.

    This patch prevents kswapd reaching priority 0 and trying to reclaim the
    world. Direct reclaimers will still reach priority 0 in the event of an
    OOM situation.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the past, kswapd makes a decision on whether to compact memory after
    the pgdat was considered balanced. This more or less worked but it is
    late to make such a decision and does not fit well now that kswapd makes
    a decision whether to exit the zone scanning loop depending on reclaim
    progress.

    This patch will compact a pgdat if at least the requested number of
    pages were reclaimed from unbalanced zones for a given priority. If any
    zone is currently balanced, kswapd will not call compaction as it is
    expected the necessary pages are already available.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd stops raising the scanning priority when at least
    SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
    balanced. It then rechecks if it needs to restart at DEF_PRIORITY and
    whether high-order reclaim needs to be reset. This is not wrong per-se
    but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
    may require several restarts before it has scanned enough pages to meet
    the high watermark even at 100% efficiency. This patch irons out the
    logic a bit by controlling when priority is raised and removing the
    "goto loop_again".

    This patch has kswapd raise the scanning priority until it is scanning
    enough pages that it could meet the high watermark in one shrink of the
    LRU lists if it is able to reclaim at 100% efficiency. It will not
    raise the scanning prioirty higher unless it is failing to reclaim any
    pages.

    To avoid infinite looping for high-order allocation requests kswapd will
    not reclaim for high-order allocations when it has reclaimed at least
    twice the number of pages as the allocation request.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Simplistically, the anon and file LRU lists are scanned proportionally
    depending on the value of vm.swappiness although there are other factors
    taken into account by get_scan_count(). The patch "mm: vmscan: Limit
    the number of pages kswapd reclaims" limits the number of pages kswapd
    reclaims but it breaks this proportional scanning and may evenly shrink
    anon/file LRUs regardless of vm.swappiness.

    This patch preserves the proportional scanning and reclaim. It does
    mean that kswapd will reclaim more than requested but the number of
    pages will be related to the high watermark.

    [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
    [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
    [hannes@cmpxchg.org: Account for already scanned pages properly]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This series does not fix all the current known problems with reclaim but
    it addresses one important swapping bug when there is background IO.

    Changelog since V3
    - Drop the slab shrink changes in light of Glaubers series and
    discussions highlighted that there were a number of potential
    problems with the patch. (mel)
    - Rebased to 3.10-rc1

    Changelog since V2
    - Preserve ratio properly for proportional scanning (kamezawa)

    Changelog since V1
    - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
    - Reformat comment in shrink_page_list (andi)
    - Clarify some comments (dhillf)
    - Rework how the proportional scanning is preserved
    - Add PageReclaim check before kswapd starts writeback
    - Reset sc.nr_reclaimed on every full zone scan

    Kswapd and page reclaim behaviour has been screwy in one way or the
    other for a long time. Very broadly speaking it worked in the far past
    because machines were limited in memory so it did not have that many
    pages to scan and it stalled congestion_wait() frequently to prevent it
    going completely nuts. In recent times it has behaved very
    unsatisfactorily with some of the problems compounded by the removal of
    stall logic and the introduction of transparent hugepage support with
    high-order reclaims.

    There are many variations of bugs that are rooted in this area. One
    example is reports of a large copy operations or backup causing the
    machine to grind to a halt or applications pushed to swap. Sometimes in
    low memory situations a large percentage of memory suddenly gets
    reclaimed. In other cases an application starts and kswapd hits 100%
    CPU usage for prolonged periods of time and so on. There is now talk of
    introducing features like an extra free kbytes tunable to work around
    aspects of the problem instead of trying to deal with it. It's
    compounded by the problem that it can be very workload and machine
    specific.

    This series aims at addressing some of the worst of these problems
    without attempting to fundmentally alter how page reclaim works.

    Patches 1-2 limits the number of pages kswapd reclaims while still obeying
    the anon/file proportion of the LRUs it should be scanning.

    Patches 3-4 control how and when kswapd raises its scanning priority and
    deletes the scanning restart logic which is tricky to follow.

    Patch 5 notes that it is too easy for kswapd to reach priority 0 when
    scanning and then reclaim the world. Down with that sort of thing.

    Patch 6 notes that kswapd starts writeback based on scanning priority which
    is not necessarily related to dirty pages. It will have kswapd
    writeback pages if a number of unqueued dirty pages have been
    recently encountered at the tail of the LRU.

    Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
    to reduce LRU churn and the likelihood that it'll reclaim young
    clean pages or push applications to swap. It will cause kswapd
    to block on IO if it detects that pages being reclaimed under
    writeback are recycling through the LRU before the IO completes.

    Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
    are applied.

    This was tested using memcached+memcachetest while some background IO
    was in progress as implemented by the parallel IO tests implement in MM
    Tests.

    memcachetest benchmarks how many operations/second memcached can service
    and it is run multiple times. It starts with no background IO and then
    re-runs the test with larger amounts of IO in the background to roughly
    simulate a large copy in progress. The expectation is that the IO
    should have little or no impact on memcachetest which is running
    entirely in memory.

    3.10.0-rc1 3.10.0-rc1
    vanilla lessdisrupt-v4
    Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
    Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
    Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
    Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
    Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
    Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
    Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
    Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
    Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
    Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
    Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
    Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
    Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
    Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
    Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
    Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
    Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
    Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
    Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)

    Note how the vanilla kernels performance collapses when there is enough
    IO taking place in the background. This drop in performance is part of
    what users complain of when they start backups. Note how the swapin and
    major fault figures indicate that processes were being pushed to swap
    prematurely. With the series applied, there is no noticable performance
    drop and while there is still some swap activity, it's tiny.

    20 iterations of this test were run in total and averaged. Every 5
    iterations, additional IO was generated in the background using dd to
    measure how the workload was impacted. The 0M, 715M, 2385M and 4055M
    subblock refer to the amount of IO going on in the background at each
    iteration. So memcachetest-2385M is reporting how many
    transactions/second memcachetest recorded on average over 5 iterations
    while there was 2385M of IO going on in the ground. There are six
    blocks of information reported here

    memcachetest is the transactions/second reported by memcachetest. In
    the vanilla kernel note that performance drops from around
    22K/sec to just under 4K/second when there is 2385M of IO going
    on in the background. This is one type of performance collapse
    users complain about if a large cp or backup starts in the
    background

    io-duration refers to how long it takes for the background IO to
    complete. It's showing that with the patched kernel that the IO
    completes faster while not interfering with the memcache
    workload

    swaptotal is the total amount of swap traffic. With the patched kernel,
    the total amount of swapping is much reduced although it is
    still not zero.

    swapin in this case is an indication as to whether we are swap trashing.
    The closer the swapin/swapout ratio is to 1, the worse the
    trashing is. Note with the patched kernel that there is no swapin
    activity indicating that all the pages swapped were really inactive
    unused pages.

    minorfaults are just minor faults. An increased number of minor faults
    can indicate that page reclaim is unmapping the pages but not
    swapping them out before they are faulted back in. With the
    patched kernel, there is only a small change in minor faults

    majorfaults are just major faults in the target workload and a high
    number can indicate that a workload is being prematurely
    swapped. With the patched kernel, major faults are much reduced. As
    there are no swapin's recorded so it's not being swapped. The likely
    explanation is that that libraries or configuration files used by
    the workload during startup get paged out by the background IO.

    Overall with the series applied, there is no noticable performance drop
    due to background IO and while there is still some swap activity, it's
    tiny and the lack of swapins imply that the swapped pages were inactive
    and unused.

    3.10.0-rc1 3.10.0-rc1
    vanilla lessdisrupt-v4
    Page Ins 1234608 101892
    Page Outs 12446272 11810468
    Swap Ins 283406 0
    Swap Outs 698469 27882
    Direct pages scanned 0 136480
    Kswapd pages scanned 6266537 5369364
    Kswapd pages reclaimed 1088989 930832
    Direct pages reclaimed 0 120901
    Kswapd efficiency 17% 17%
    Kswapd velocity 5398.371 4635.115
    Direct efficiency 100% 88%
    Direct velocity 0.000 117.817
    Percentage direct scans 0% 2%
    Page writes by reclaim 1655843 4009929
    Page writes file 957374 3982047
    Page writes anon 698469 27882
    Page reclaim immediate 5245 1745
    Page rescued immediate 0 0
    Slabs scanned 33664 25216
    Direct inode steals 0 0
    Kswapd inode steals 19409 778
    Kswapd skipped wait 0 0
    THP fault alloc 35 30
    THP collapse alloc 472 401
    THP splits 27 22
    THP fault fallback 0 0
    THP collapse fail 0 1
    Compaction stalls 0 4
    Compaction success 0 0
    Compaction failures 0 4
    Page migrate success 0 0
    Page migrate failure 0 0
    Compaction pages isolated 0 0
    Compaction migrate scanned 0 0
    Compaction free scanned 0 0
    Compaction cost 0 0
    NUMA PTE updates 0 0
    NUMA hint faults 0 0
    NUMA hint local faults 0 0
    NUMA pages migrated 0 0
    AutoNUMA cost 0 0

    Unfortunately, note that there is a small amount of direct reclaim due to
    kswapd no longer reclaiming the world. ftrace indicates that the direct
    reclaim stalls are mostly harmless with the vast bulk of the stalls
    incurred by dd

    23 tclsh-3367
    38 memcachetest-13733
    49 memcachetest-12443
    57 tee-3368
    1541 dd-13826
    1981 dd-12539

    A consequence of the direct reclaim for dd is that the processes for the
    IO workload may show a higher system CPU usage. There is also a risk that
    kswapd not reclaiming the world may mean that it stays awake balancing
    zones, does not stall on the appropriate events and continually scans
    pages it cannot reclaim consuming CPU. This will be visible as continued
    high CPU usage but in my own tests I only saw a single spike lasting less
    than a second and I did not observe any problems related to reclaim while
    running the series on my desktop.

    This patch:

    The number of pages kswapd can reclaim is bound by the number of pages it
    scans which is related to the size of the zone and the scanning priority.
    In many cases the priority remains low because it's reset every
    SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
    number of pages it cannot reclaim, it will raise the priority and
    potentially discard a large percentage of the zone as sc->nr_to_reclaim is
    ULONG_MAX. The user-visible effect is a reclaim "spike" where a large
    percentage of memory is suddenly freed. It would be bad enough if this
    was just unused memory but because of how anon/file pages are balanced it
    is possible that applications get pushed to swap unnecessarily.

    This patch limits the number of pages kswapd will reclaim to the high
    watermark. Reclaim will still overshoot due to it not being a hard limit
    as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
    prevents kswapd reclaiming the world at higher priorities. The number of
    pages it reclaims is not adjusted for high-order allocations as kswapd
    will reclaim excessively if it is to balance zones for high-order
    allocations.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman