08 Oct, 2016

2 commits

  • File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
    etc.) to accelerate finding the pages with a specific tag in the radix
    tree during inode writeback. But for anonymous pages in the swap cache,
    there is no inode writeback. So there is no need to find the pages with
    some writeback tags in the radix tree. It is not necessary to touch
    radix tree writeback tags for pages in the swap cache.

    Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
    introduced for address spaces which don't need to update the writeback
    tags. The flag is set for swap caches. It may be used for DAX file
    systems, etc.

    With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
    ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
    The test is done on a Xeon E5 v3 system. The swap device used is a RAM
    simulated PMEM (persistent memory) device. The improvement comes from
    the reduced contention on the swap cache radix tree lock. To test
    sequential swapping out, the test case uses 8 processes, which
    sequentially allocate and write to the anonymous pages until RAM and
    part of the swap device is used up.

    Details of comparison is as follow,

    base base+patch
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
    1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
    10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
    10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
    10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
    10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

    Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Wu Fengguang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
    excessive pageout activity during the reclaim. Too many pages could be
    put under writeback therefore LRUs would be full of unreclaimable pages
    until the IO completes and in turn the OOM killer could be invoked.

    There have been some important changes introduced since then in the
    reclaim path though. Writers are throttled by balance_dirty_pages when
    initiating the buffered IO and later during the memory pressure, the
    direct reclaim is throttled by wait_iff_congested if the node is
    considered congested by dirty pages on LRUs and the underlying bdi is
    congested by the queued IO. The kswapd is throttled as well if it
    encounters pages marked for immediate reclaim or under writeback which
    signals that that there are too many pages under writeback already.
    Finally should_reclaim_retry does congestion_wait if the reclaim cannot
    make any progress and there are too many dirty/writeback pages.

    Another important aspect is that we do not issue any IO from the direct
    reclaim context anymore. In a heavy parallel load this could queue a
    lot of IO which would be very scattered and thus unefficient which would
    just make the problem worse.

    This three mechanisms should throttle and keep the amount of IO in a
    steady state even under heavy IO and memory pressure so yet another
    throttling point doesn't really seem helpful. Quite contrary, Mikulas
    Patocka has reported that swap backed by dm-crypt doesn't work properly
    because the swapout IO cannot make sufficient progress as the writeout
    path depends on dm_crypt worker which has to allocate memory to perform
    the encryption. In order to guarantee a forward progress it relies on
    the mempool allocator. mempool_alloc(), however, prefers to use the
    underlying (usually page) allocator before it grabs objects from the
    pool. Such an allocation can dive into the memory reclaim and
    consequently to throttle_vm_writeout. If there are too many dirty or
    pages under writeback it will get throttled even though it is in fact a
    flusher to clear pending pages.

    kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
    Workqueue: kcryptd kcryptd_crypt [dm_crypt]
    Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

    Let's just drop throttle_vm_writeout altogether. It is not very much
    helpful anymore.

    I have tried to test a potential writeback IO runaway similar to the one
    described in the original patch which has introduced that [1]. Small
    virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
    rather slow NFS in a sync mode on the host) with 8 parallel writers each
    writing 1G worth of data. As soon as the pagecache fills up and the
    direct reclaim hits then I start anon memory consumer in a loop
    (allocating 300M and exiting after populating it) in the background to
    make the memory pressure even stronger as well as to disrupt the steady
    state for the IO. The direct reclaim is throttled because of the
    congestion as well as kswapd hitting congestion_wait due to nr_immediate
    but throttle_vm_writeout doesn't ever trigger the sleep throughout the
    test. Dirty+writeback are close to nr_dirty_threshold with some
    fluctuations caused by the anon consumer.

    [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
    Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikulas Patocka
    Cc: Marcelo Tosatti
    Cc: NeilBrown
    Cc: Ondrej Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2016

1 commit

  • Install the callbacks via the state machine and let the core invoke
    the callbacks on the already online CPUs.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Jens Axboe
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

29 Jul, 2016

8 commits

  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the reintroduction of per-zone LRU stats, highmem_file_pages is
    redundant so remove it.

    [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
    Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When I tested vmscale in mmtest in 32bit, I found the benchmark was slow
    down 0.5 times.

    base node
    1 global-1
    User 12.98 16.04
    System 147.61 166.42
    Elapsed 26.48 38.08

    With vmstat, I found IO wait avg is much increased compared to base.

    The reason was highmem_dirtyable_memory accumulates free pages and
    highmem_file_pages from HIGHMEM to MOVABLE zones which was wrong. With
    that, dirth_thresh in throtlle_vm_write is always 0 so that it calls
    congestion_wait frequently if writeback starts.

    With this patch, it is much recovered.

    base node fi
    1 global-1 fix
    User 12.98 16.04 13.78
    System 147.61 166.42 143.92
    Elapsed 26.48 38.08 29.64

    Link: http://lkml.kernel.org/r/1468404004-5085-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Historically dirty pages were spread among zones but now that LRUs are
    per-node it is more appropriate to consider dirty pages in a node.

    Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

30 May, 2016

1 commit

  • As vm.dirty_[background_]bytes can't be applied verbatim to multiple
    cgroup writeback domains, they get converted to percentages in
    domain_dirty_limits() and applied the same way as
    vm.dirty_[background]ratio. However, if the specified bytes is lower
    than 1% of available memory, the calculated ratios become zero and the
    writeback domain gets throttled constantly.

    Fix it by using per-PAGE_SIZE instead of percentage for ratio
    calculations. Also, the updated DIV_ROUND_UP() usages now should
    yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
    bytes are above zero.

    Signed-off-by: Tejun Heo
    Reported-by: Miao Xie
    Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
    Cc: stable@vger.kernel.org # v4.2+
    Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
    Reviewed-by: Jan Kara

    Adjusted comment based on Jan's suggestion.
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 May, 2016

1 commit

  • When nfsd is exporting a filesystem over NFS which is then NFS-mounted
    on the local machine there is a risk of deadlock. This happens when
    there are lots of dirty pages in the NFS filesystem and they cause NFSD
    to be throttled, either in throttle_vm_writeout() or in
    balance_dirty_pages().

    To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
    and it provides a 25% increase to the limits that affect NFSD. Any
    process writing to an NFS filesystem will be throttled well before the
    number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
    will not deadlock on pages that it needs to write out. At least it
    shouldn't.

    All processes are allowed a small excess margin to avoid performing too
    many calculations: ratelimit_pages.

    ratelimit_pages is set so that if a thread on every CPU uses the entire
    margin, the total will only go 3% over the limit, and this is much less
    than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
    shouldn't be a problem. But it is.

    The "total memory" that these 3% and 25% are calculated against are not
    really total memory but are "global_dirtyable_memory()" which doesn't
    include anonymous memory, just free memory and page-cache memory.

    The "ratelimit_pages" number is based on whatever the
    global_dirtyable_memory was on the last CPU hot-plug, which might not be
    what you expect, but is probably close to the total freeable memory.

    The throttle threshold uses the global_dirtable_memory at the moment
    when the throttling happens, which could be much less than at the last
    CPU hotplug. So if lots of anonymous memory has been allocated, thus
    pushing out lots of page-cache pages, then NFSD might end up being
    throttled due to dirty NFS pages because the "25%" bonus it gets is
    calculated against a rather small amount of dirtyable memory, while the
    "3%" margin that other processes are allowed to dirty without penalty is
    calculated against a much larger number.

    To remove this possibility of deadlock we need to make sure that the
    margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
    Simply adding ratelimit_pages isn't enough as that should be multiplied
    by the number of cpus.

    So add "global_wb_domain.dirty_limit / 32" as that more accurately
    reflects the current total over-shoot margin. This ensures that the
    number of dirty NFS pages never gets so high that nfsd will be throttled
    waiting for them to be written.

    Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

20 May, 2016

1 commit

  • ZONE_MOVABLE could be treated as highmem so we need to consider it for
    accurate calculation of dirty pages. And, in following patches,
    ZONE_CMA will be introduced and it can be treated as highmem, too. So,
    instead of manually adding stat of ZONE_MOVABLE, looping all zones and
    check whether the zone is highmem or not and add stat of the zone which
    can be treated as highmem.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Laura Abbott
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 May, 2016

1 commit


06 May, 2016

1 commit

  • Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use
    wb_domain aware operations") unintentionally changed this function's
    meaning from "are there more dirty pages than the background writeback
    threshold" to "are there more dirty pages than the writeback threshold".
    The background writeback threshold is typically half of the writeback
    threshold, so this had the effect of raising the number of dirty pages
    required to cause a writeback worker to perform background writeout.

    This can cause a very severe performance regression when a BDI uses
    BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
    can now disagree on whether writeback should be initiated.

    For example, in a system having 1GB of RAM, a single spinning disk, and a
    "pass-through" FUSE filesystem mounted over the disk, application code
    mmapped a 128MB file on the disk and was randomly dirtying pages in that
    mapping.

    Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
    balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
    dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
    dirtying processes when we have 151 dirty pages and wakes up a background
    writeback worker. But the worker tests the wrong threshold (200 instead of
    100), so it does not initiate writeback and just returns.

    Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
    worker who will do nothing. It remains stuck in this state until the few
    dirty pages that we have finally expire and we write them back for that
    reason. Then the whole process repeats, resulting in near-zero throughput
    through the FUSE BDI.

    The fix is to call the parameterized variant of wb_calc_thresh, so that the
    worker will do writeback if the bg_thresh is exceeded which was the
    behavior before the referenced commit.

    Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
    Signed-off-by: Howard Cochran
    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Cc: # v4.2+
    Tested-by Sedat Dilek
    Signed-off-by: Jens Axboe

    Howard Cochran
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

4 commits

  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Calculation of dirty_ratelimit sometimes is not correct. E.g. initial
    values of dirty_ratelimit == INIT_BW and step == 0, lead to the
    following result:

    UBSAN: Undefined behaviour in ../mm/page-writeback.c:1286:7
    shift exponent 25600 is too large for 64-bit type 'long unsigned int'

    The fix is straightforward - make step 0 if the shift exponent is too
    big.

    Signed-off-by: Andrey Ryabinin
    Cc: Wu Fengguang
    Cc: Tejun Heo
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

15 Jan, 2016

1 commit

  • The dirty balance reserve that dirty throttling has to consider is
    merely memory not available to userspace allocations. There is nothing
    writeback-specific about it. Generalize the name so that it's reusable
    outside of that context.

    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Nov, 2015

1 commit

  • There were still a number of references to my old Red Hat email
    address in the kernel source. Remove these while keeping the
    Red Hat copyright notices intact.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Nov, 2015

1 commit

  • When building kernel with gcc 5.2, the below warning is raised:

    mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
    mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
    unsigned long m_dirty, m_thresh, m_bg_thresh;

    The m_dirty{thresh, bg_thresh} are initialized in the block of "if
    (mdtc)", so if mdts is null, they won't be initialized before being used.
    Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
    keep consistency.

    They are used later by if condition: !mdtc || m_dirty
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

13 Oct, 2015

4 commits

  • For memcg domains, the amount of available memory was calculated as

    min(the amount currently in use + headroom according to memcg,
    total clean memory)

    This isn't quite correct as what should be capped by the amount of
    clean memory is the headroom, not the sum of memory in use and
    headroom. For example, if a memcg domain has a significant amount of
    dirty memory, the above can lead to a value which is lower than the
    current amount in use which doesn't make much sense. In most
    circumstances, the above leads to a number which is somewhat but not
    drastically lower.

    As the amount of memory which can be readily allocated to the memcg
    domain is capped by the amount of system-wide clean memory which is
    not already assigned to the memcg itself, the number we want is

    the amount currently in use +
    min(headroom according to memcg, clean memory elsewhere in the system)

    This patch updates mem_cgroup_wb_stats() to return the number of
    filepages and headroom instead of the calculated available pages.
    mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
    above calculation from file, headroom, dirty and globally clean pages.

    v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK. Fixed.

    Signed-off-by: Tejun Heo
    Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • MDTC_INIT() is used to initialize dirty_throttle_control for memcg
    domains. It used DTC_INIT_COMMON() to initialized mdtc->wb and
    ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
    latter to wb->completions instead of wb->memcg_completions. This can
    lead to wildly incorrect results when calculating the proportion of
    dirty memory the memcg domain should get.

    Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
    mdtc->wb_completions to wb->memcg_completions.

    Signed-off-by: Tejun Heo
    Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_for_each_wb() is used in several places to wake up or issue
    writeback work items to all wb's (bdi_writeback's) on a given bdi.
    The iteration is performed by walking bdi->cgwb_tree; however, the
    tree only indexes wb's which are currently active.

    For example, when a memcg gets associated with a different blkcg, the
    old wb is removed from the tree so that the new one can be indexed.
    The old wb starts dying from then on but will linger till all its
    inodes are drained. As these dying wb's may still host dirty inodes,
    writeback operations which affect all wb's must include them.
    bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
    failing to sync the inodes belonging to those wb's.

    This patch adds a RCU protected @bdi->wb_list which lists all wb's
    beloinging to that bdi. wb's are added on creation and removed on
    release rather than on the start of destruction. bdi_for_each_wb()
    usages are replaced with list_for_each[_continue]_rcu() iterations
    over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

    v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Artem Bityutskiy
    Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • laptop_mode_timer_fn() was using bdi_for_each_wb() without the
    required RCU locking leading to the following warning.

    WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
    ...
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] laptop_mode_timer_fn+0x106/0x170
    [] call_timer_fn+0xb3/0x2f0
    [] run_timer_softirq+0x205/0x370
    [] __do_softirq+0xd4/0x460
    [] irq_exit+0x89/0xa0
    [] smp_apic_timer_interrupt+0x42/0x50
    [] apic_timer_interrupt+0x84/0x90
    ...

    Fix it by adding rcu_read_lock() around the iteration.

    Signed-off-by: Tejun Heo
    Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

19 Aug, 2015

1 commit

  • The following tracepoints are updated to report the cgroup used during
    cgroup writeback.

    * writeback_write_inode[_start]
    * writeback_queue
    * writeback_exec
    * writeback_start
    * writeback_written
    * writeback_wait
    * writeback_nowork
    * writeback_wake_background
    * wbc_writepage
    * writeback_queue_io
    * bdi_dirty_ratelimit
    * balance_dirty_pages
    * writeback_sb_inodes_requeue
    * writeback_single_inode[_start]

    Note that writeback_bdi_register is separated out from writeback_class
    as reporting cgroup doesn't make sense to it. Tracepoints which take
    bdi are updated to take bdi_writeback instead.

    Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Aug, 2015

1 commit

  • The initial value of global_wb_domain.dirty_limit set by
    writeback_set_ratelimit() is zeroed out by the memset in
    wb_domain_init().

    Signed-off-by: Rabin Vincent
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rabin Vincent
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

02 Jun, 2015

6 commits

  • The mechanism for detecting whether an inode should switch its wb
    (bdi_writeback) association is now in place. This patch build the
    framework for the actual switching.

    This patch adds a new inode flag I_WB_SWITCHING, which has two
    functions. First, the easy one, it ensures that there's only one
    switching in progress for a give inode. Second, it's used as a
    mechanism to synchronize wb stat updates.

    The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
    but track the current number of dirty pages and pages under writeback
    respectively. As such, when an inode is moved from one wb to another,
    the inode's portion of those stats have to be transferred together;
    unfortunately, this is a bit tricky as those stat updates are percpu
    operations which are performed without holding any lock in some
    places.

    This patch solves the problem in a similar way as memcg. Each such
    lockless stat updates are wrapped in transaction surrounded by
    unlocked_inode_to_wb_begin/end(). During normal operation, they map
    to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
    mapping->tree_lock is grabbed across the transaction.

    In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
    grace period to pass before actually starting to switch, which
    guarantees that all stat update paths are synchronizing against
    mapping->tree_lock.

    This patch still doesn't implement the actual switching.

    v3: Updated on top of the recent cancel_dirty_page() updates.
    unlocked_inode_to_wb_begin() now nests inside
    mem_cgroup_begin_page_stat() to match the locking order.

    v2: The i_wb access transaction will be used for !stat accesses too.
    Function names and comments updated accordingly.

    s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
    s/switch_wb/switch_wbs/

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • While cgroup writeback support now connects memcg and blkcg so that
    writeback IOs are properly attributed and controlled, the IO back
    pressure propagation mechanism implemented in balance_dirty_pages()
    and its subroutines wasn't aware of cgroup writeback.

    Processes belonging to a memcg may have access to only subset of total
    memory available in the system and not factoring this into dirty
    throttling rendered it completely ineffective for processes under
    memcg limits and memcg ended up building a separate ad-hoc degenerate
    mechanism directly into vmscan code to limit page dirtying.

    The previous patches updated balance_dirty_pages() and its subroutines
    so that they can deal with multiple wb_domain's (writeback domains)
    and defined per-memcg wb_domain. Processes belonging to a non-root
    memcg are bound to two wb_domains, global wb_domain and memcg
    wb_domain, and should be throttled according to IO pressures from both
    domains. This patch updates dirty throttling code so that it repeats
    similar calculations for the two domains - the differences between the
    two are few and minor - and applies the lower of the two sets of
    resulting constraints.

    wb_over_bg_thresh(), which controls when background writeback
    terminates, is also updated to consider both global and memcg
    wb_domains. It returns true if dirty is over bg_thresh for either
    domain.

    This makes the dirty throttling mechanism operational for memcg
    domains including writeback-bandwidth-proportional dirty page
    distribution inside them but the ad-hoc memcg throttling mechanism in
    vmscan is still in place. The next patch will rip it out.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Dirtyable memory is distributed to a wb (bdi_writeback) according to
    the relative bandwidth the wb is writing out in the whole system.
    This distribution is global - each wb is measured against all other
    wb's and gets the proportinately sized portion of the memory in the
    whole system.

    For cgroup writeback, the amount of dirtyable memory is scoped by
    memcg and thus each wb would need to be measured and controlled in its
    memcg. IOW, a wb will belong to two writeback domains - the global
    and memcg domains.

    The previous patches laid the groundwork to support the two wb_domains
    and this patch implements memcg wb_domain. memcg->cgwb_domain is
    initialized on css online and destroyed on css release,
    wb->memcg_completions is added, and __wb_writeout_inc() is updated to
    increment completions against both global and memcg wb_domains.

    The following patches will update balance_dirty_pages() and its
    subroutines to actually consider memcg wb_domain for throttling.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_over_bg_thresh() currently uses global_dirty_limits() and
    wb_dirty_limit() both of which are wrappers around operations which
    take dirty_throttle_control. For cgroup writeback support, the
    function will be updated to also consider memcg wb_domains which
    requires the context information carried in dirty_throttle_control.

    This patch updates wb_over_bg_thresh() so that it uses the underlying
    wb_domain aware operations directly and builds the global
    dirty_throttle_control in the process.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • and rename it to wb_over_bg_thresh(). The function is closely tied to
    the dirty throttling mechanism implemented in page-writeback.c. This
    relocation will allow future updates necessary for cgroup writeback
    support.

    While at it, add function comment.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • global_dirty_limits() calculates thresh and bg_thresh (confusingly
    called *pdirty and *pbackground in the function) assuming
    global_wb_domain; however, cgroup writeback support requires
    considering per-memcg wb_domain too.

    This patch separates out domain_dirty_limits() which takes
    dirty_throttle_control out of global_dirty_limits(). As thresh and
    bg_thresh calculation needs the amount of dirtyable memory in the
    domain, dirty_throttle_control->avail is added. The new function
    calculates the two thresholds and store them directly in the
    dirty_throttle_control.

    Also, as memcg domains can't follow vm_dirty_bytes and
    dirty_background_bytes settings directly. If those are set and
    domain_dirty_limits() is invoked for a !global domain, the settings
    are translated to ratios by scaling them against globally available
    memory. dirty_throttle_control->gdtc is added to enable this when
    CONFIG_CGROUP_WRITEBACK.

    global_dirty_limits() is now a thin wrapper around
    domain_dirty_limits() and balance_dirty_pages() is updated to use the
    new function too.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo