21 Nov, 2015

1 commit

  • When building kernel with gcc 5.2, the below warning is raised:

    mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
    mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
    unsigned long m_dirty, m_thresh, m_bg_thresh;

    The m_dirty{thresh, bg_thresh} are initialized in the block of "if
    (mdtc)", so if mdts is null, they won't be initialized before being used.
    Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
    keep consistency.

    They are used later by if condition: !mdtc || m_dirty
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

13 Oct, 2015

4 commits

  • For memcg domains, the amount of available memory was calculated as

    min(the amount currently in use + headroom according to memcg,
    total clean memory)

    This isn't quite correct as what should be capped by the amount of
    clean memory is the headroom, not the sum of memory in use and
    headroom. For example, if a memcg domain has a significant amount of
    dirty memory, the above can lead to a value which is lower than the
    current amount in use which doesn't make much sense. In most
    circumstances, the above leads to a number which is somewhat but not
    drastically lower.

    As the amount of memory which can be readily allocated to the memcg
    domain is capped by the amount of system-wide clean memory which is
    not already assigned to the memcg itself, the number we want is

    the amount currently in use +
    min(headroom according to memcg, clean memory elsewhere in the system)

    This patch updates mem_cgroup_wb_stats() to return the number of
    filepages and headroom instead of the calculated available pages.
    mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
    above calculation from file, headroom, dirty and globally clean pages.

    v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
    to build failure when !CGROUP_WRITEBACK. Fixed.

    Signed-off-by: Tejun Heo
    Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • MDTC_INIT() is used to initialize dirty_throttle_control for memcg
    domains. It used DTC_INIT_COMMON() to initialized mdtc->wb and
    ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
    latter to wb->completions instead of wb->memcg_completions. This can
    lead to wildly incorrect results when calculating the proportion of
    dirty memory the memcg domain should get.

    Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
    mdtc->wb_completions to wb->memcg_completions.

    Signed-off-by: Tejun Heo
    Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_for_each_wb() is used in several places to wake up or issue
    writeback work items to all wb's (bdi_writeback's) on a given bdi.
    The iteration is performed by walking bdi->cgwb_tree; however, the
    tree only indexes wb's which are currently active.

    For example, when a memcg gets associated with a different blkcg, the
    old wb is removed from the tree so that the new one can be indexed.
    The old wb starts dying from then on but will linger till all its
    inodes are drained. As these dying wb's may still host dirty inodes,
    writeback operations which affect all wb's must include them.
    bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
    failing to sync the inodes belonging to those wb's.

    This patch adds a RCU protected @bdi->wb_list which lists all wb's
    beloinging to that bdi. wb's are added on creation and removed on
    release rather than on the start of destruction. bdi_for_each_wb()
    usages are replaced with list_for_each[_continue]_rcu() iterations
    over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

    v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Artem Bityutskiy
    Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • laptop_mode_timer_fn() was using bdi_for_each_wb() without the
    required RCU locking leading to the following warning.

    WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
    ...
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] laptop_mode_timer_fn+0x106/0x170
    [] call_timer_fn+0xb3/0x2f0
    [] run_timer_softirq+0x205/0x370
    [] __do_softirq+0xd4/0x460
    [] irq_exit+0x89/0xa0
    [] smp_apic_timer_interrupt+0x42/0x50
    [] apic_timer_interrupt+0x84/0x90
    ...

    Fix it by adding rcu_read_lock() around the iteration.

    Signed-off-by: Tejun Heo
    Fixes: a06fd6b10228 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

19 Aug, 2015

1 commit

  • The following tracepoints are updated to report the cgroup used during
    cgroup writeback.

    * writeback_write_inode[_start]
    * writeback_queue
    * writeback_exec
    * writeback_start
    * writeback_written
    * writeback_wait
    * writeback_nowork
    * writeback_wake_background
    * wbc_writepage
    * writeback_queue_io
    * bdi_dirty_ratelimit
    * balance_dirty_pages
    * writeback_sb_inodes_requeue
    * writeback_single_inode[_start]

    Note that writeback_bdi_register is separated out from writeback_class
    as reporting cgroup doesn't make sense to it. Tracepoints which take
    bdi are updated to take bdi_writeback instead.

    Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Aug, 2015

1 commit

  • The initial value of global_wb_domain.dirty_limit set by
    writeback_set_ratelimit() is zeroed out by the memset in
    wb_domain_init().

    Signed-off-by: Rabin Vincent
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rabin Vincent
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

02 Jun, 2015

31 commits

  • The mechanism for detecting whether an inode should switch its wb
    (bdi_writeback) association is now in place. This patch build the
    framework for the actual switching.

    This patch adds a new inode flag I_WB_SWITCHING, which has two
    functions. First, the easy one, it ensures that there's only one
    switching in progress for a give inode. Second, it's used as a
    mechanism to synchronize wb stat updates.

    The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
    but track the current number of dirty pages and pages under writeback
    respectively. As such, when an inode is moved from one wb to another,
    the inode's portion of those stats have to be transferred together;
    unfortunately, this is a bit tricky as those stat updates are percpu
    operations which are performed without holding any lock in some
    places.

    This patch solves the problem in a similar way as memcg. Each such
    lockless stat updates are wrapped in transaction surrounded by
    unlocked_inode_to_wb_begin/end(). During normal operation, they map
    to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
    mapping->tree_lock is grabbed across the transaction.

    In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
    grace period to pass before actually starting to switch, which
    guarantees that all stat update paths are synchronizing against
    mapping->tree_lock.

    This patch still doesn't implement the actual switching.

    v3: Updated on top of the recent cancel_dirty_page() updates.
    unlocked_inode_to_wb_begin() now nests inside
    mem_cgroup_begin_page_stat() to match the locking order.

    v2: The i_wb access transaction will be used for !stat accesses too.
    Function names and comments updated accordingly.

    s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
    s/switch_wb/switch_wbs/

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • While cgroup writeback support now connects memcg and blkcg so that
    writeback IOs are properly attributed and controlled, the IO back
    pressure propagation mechanism implemented in balance_dirty_pages()
    and its subroutines wasn't aware of cgroup writeback.

    Processes belonging to a memcg may have access to only subset of total
    memory available in the system and not factoring this into dirty
    throttling rendered it completely ineffective for processes under
    memcg limits and memcg ended up building a separate ad-hoc degenerate
    mechanism directly into vmscan code to limit page dirtying.

    The previous patches updated balance_dirty_pages() and its subroutines
    so that they can deal with multiple wb_domain's (writeback domains)
    and defined per-memcg wb_domain. Processes belonging to a non-root
    memcg are bound to two wb_domains, global wb_domain and memcg
    wb_domain, and should be throttled according to IO pressures from both
    domains. This patch updates dirty throttling code so that it repeats
    similar calculations for the two domains - the differences between the
    two are few and minor - and applies the lower of the two sets of
    resulting constraints.

    wb_over_bg_thresh(), which controls when background writeback
    terminates, is also updated to consider both global and memcg
    wb_domains. It returns true if dirty is over bg_thresh for either
    domain.

    This makes the dirty throttling mechanism operational for memcg
    domains including writeback-bandwidth-proportional dirty page
    distribution inside them but the ad-hoc memcg throttling mechanism in
    vmscan is still in place. The next patch will rip it out.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Dirtyable memory is distributed to a wb (bdi_writeback) according to
    the relative bandwidth the wb is writing out in the whole system.
    This distribution is global - each wb is measured against all other
    wb's and gets the proportinately sized portion of the memory in the
    whole system.

    For cgroup writeback, the amount of dirtyable memory is scoped by
    memcg and thus each wb would need to be measured and controlled in its
    memcg. IOW, a wb will belong to two writeback domains - the global
    and memcg domains.

    The previous patches laid the groundwork to support the two wb_domains
    and this patch implements memcg wb_domain. memcg->cgwb_domain is
    initialized on css online and destroyed on css release,
    wb->memcg_completions is added, and __wb_writeout_inc() is updated to
    increment completions against both global and memcg wb_domains.

    The following patches will update balance_dirty_pages() and its
    subroutines to actually consider memcg wb_domain for throttling.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_over_bg_thresh() currently uses global_dirty_limits() and
    wb_dirty_limit() both of which are wrappers around operations which
    take dirty_throttle_control. For cgroup writeback support, the
    function will be updated to also consider memcg wb_domains which
    requires the context information carried in dirty_throttle_control.

    This patch updates wb_over_bg_thresh() so that it uses the underlying
    wb_domain aware operations directly and builds the global
    dirty_throttle_control in the process.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • and rename it to wb_over_bg_thresh(). The function is closely tied to
    the dirty throttling mechanism implemented in page-writeback.c. This
    relocation will allow future updates necessary for cgroup writeback
    support.

    While at it, add function comment.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • global_dirty_limits() calculates thresh and bg_thresh (confusingly
    called *pdirty and *pbackground in the function) assuming
    global_wb_domain; however, cgroup writeback support requires
    considering per-memcg wb_domain too.

    This patch separates out domain_dirty_limits() which takes
    dirty_throttle_control out of global_dirty_limits(). As thresh and
    bg_thresh calculation needs the amount of dirtyable memory in the
    domain, dirty_throttle_control->avail is added. The new function
    calculates the two thresholds and store them directly in the
    dirty_throttle_control.

    Also, as memcg domains can't follow vm_dirty_bytes and
    dirty_background_bytes settings directly. If those are set and
    domain_dirty_limits() is invoked for a !global domain, the settings
    are translated to ratios by scaling them against globally available
    memory. dirty_throttle_control->gdtc is added to enable this when
    CONFIG_CGROUP_WRITEBACK.

    global_dirty_limits() is now a thin wrapper around
    domain_dirty_limits() and balance_dirty_pages() is updated to use the
    new function too.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently __wb_writeout_inc() and hard_dirty_limit() assume
    global_wb_domain; however, cgroup writeback support requires
    considering per-memcg wb_domain too.

    This patch separates out domain-specific part of __wb_writeout_inc()
    into wb_domain_writeout_inc() which takes wb_domain as a parameter and
    adds the parameter to hard_dirty_limit(). This will allow these two
    functions to handle per-memcg wb_domains.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently all dirty throttle operations use global_wb_domain; however,
    cgroup writeback support requires considering per-memcg wb_domain too.
    This patch adds dirty_throttle_control->dom and updates functions
    which are directly using globabl_wb_domain to use it instead.

    As this makes global_update_bandwidth() a misnomer, the function is
    renamed to domain_update_bandwidth().

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb->completions measures the wb's proportional write bandwidth in
    global_wb_domain and thus naturally tied to the wb_domain. This patch
    adds dirty_throttle_control->wb_completions which is initialized to
    wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use
    it instead of dereferencing wb->completions directly.

    This will allow dirty_throttle_control to represent different
    wb_domains and the matching wb completions.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_position_ratio() is used to calculate pos_ratio, which is used for
    two purposes. wb_update_dirty_ratelimit() uses it to adjust
    wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
    immediately adjust dirty_ratelimit right before applying it to
    determine pause duration.

    While wb_update_dirty_ratelimit() is separately rate limited from
    balance_dirty_pages(), on the run where the ratelimit is updated, we
    end up calculating pos_ratio twice with the same parameters.

    This patch adds dirty_throttle_control->pos_ratio.
    balance_dirty_pages() calculates it once per run and
    wb_update_dirty_ratelimit() uses the value stored in
    dirty_throttle_control.

    This removes the duplicate calculation and also will help implementing
    memcg wb_domain.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_calc_thresh() calculates wb_thresh by scaling thresh according to
    the wb's portion in the system-wide write bandwidth. cgroup writeback
    support would need to calculate wb_thresh against memcg domain too.
    This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it
    take dirty_throttle_control so that the function can later be updated
    to calculate against different domains according to
    dirty_throttle_control.

    wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh().

    v2: The original version was incorrectly scaling dtc->dirty instead of
    dtc->thresh. This was due to the extremely confusing function and
    variable names. Added a rename patch and fixed this one.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_bg_thresh is currently treated as a second-class citizen. It's
    only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
    doesn't calculate it unless the cap is set. When the cap is set, the
    calculated value is not passed around but instead recalculated
    whenever it's used.

    wb_position_ratio() calculates it by scaling wb_thresh proportional to
    bg_thresh / thresh. wb_update_dirty_ratelimit() uses wb_dirty_limit()
    on bg_thresh, which should generally lead to a similar result as the
    proportional scaling but can also be way off in the presence of
    max/min_ratio settings.

    Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
    divsion when BDI_CAP_STRICTLIMIT is not set. Given that
    balance_dirty_pages() is already ratelimited, this doesn't justify the
    incurred extra complexity.

    This patch adds wb_bg_thresh to dirty_throttle_control and makes
    wb_dirty_limits() always calculate it and updates the users to use the
    pre-calculated value.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Dirty throttling implemented in balance_dirty_pages() and its
    subroutines makes use of a number of parameters which are passed
    around individually. This renders these functions somewhat unwieldy
    and makes it difficult to add or change the involved parameters. Also
    some functions use different or conflicting naming schemes for the
    same parameters making the code confusing to follow.

    This patch consolidates the main parameters into struct
    dirty_throttle_control so that they can be passed around easily and
    adding new paramters isn't painful. This also unifies how a given
    parameter is named and accessed. The drawback of using this type of
    control structure rather than explicit paramters is that it isn't
    immediately obvious which function accesses and modifies what;
    however, it's fairly clear that the benefits outweigh in this case.

    GDTC_INIT() macro is provided to ease initializing
    dirty_throttle_control for the global_wb_domain and
    balance_dirty_pages() uses a separate pointer to point to its global
    dirty_throttle_control. This is to make it uniform with memcg domain
    handling which will be added later.

    This patch doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patch is a part of the series to define wb_domain which
    represents a domain that wb's (bdi_writeback's) belong to and are
    measured against each other in. This will enable IO backpressure
    propagation for cgroup writeback.

    global_dirty_limit exists to regulate the global dirty threshold which
    is a property of the wb_domain. This patch moves hard_dirty_limit,
    dirty_lock, and update_time into wb_domain.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Dirtyable memory is distributed to a wb (bdi_writeback) according to
    the relative bandwidth the wb is writing out in the whole system.
    This distribution is global - each wb is measured against all other
    wb's and gets the proportinately sized portion of the memory in the
    whole system.

    For cgroup writeback, the amount of dirtyable memory is scoped by
    memcg and thus each wb would need to be measured and controlled in its
    memcg. IOW, a wb will belong to two writeback domains - the global
    and memcg domains.

    Currently, what constitutes the global writeback domain are scattered
    across a number of global states. This patch starts collecting them
    into struct wb_domain.

    * fprop_global which serves as the basis for proportional bandwidth
    measurement and its period timer are moved into struct wb_domain.

    * global_wb_domain hosts the states for the global domain.

    * While at it, flatten wb_writeout_fraction() into its callers. This
    thin wrapper doesn't provide any actual benefits while getting in
    the way.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __wb_update_bandwidth() is called from two places -
    fs/fs-writeback.c::balance_dirty_pages() and
    mm/page-writeback.c::wb_writeback(). The latter updates only the
    write bandwidth while the former also deals with the dirty ratelimit.
    The two callsites are distinguished by whether @thresh parameter is
    zero or not, which is cryptic. In addition, the two files define
    their own different versions of wb_update_bandwidth() on top of
    __wb_update_bandwidth(), which is confusing to say the least. This
    patch cleans up [__]wb_update_bandwidth() in the following ways.

    * __wb_update_bandwidth() now takes explicit @update_ratelimit
    parameter to gate dirty ratelimit handling.

    * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
    caller - balance_dirty_pages().

    * fs/fs-writeback.c::wb_update_bandwidth() is moved to
    mm/page-writeback.c and __wb_update_bandwidth() is made static.

    * While at it, add a lockdep assertion to __wb_update_bandwidth().

    Except for the lockdep addition, this is pure reorganization and
    doesn't introduce any behavioral changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The function name wb_dirty_limit(), its argument @dirty and the local
    variable @wb_dirty are mortally confusing given that the function
    calculates per-wb threshold value not dirty pages, especially given
    that @dirty and @wb_dirty are used elsewhere for dirty pages.

    Let's rename the function to wb_calc_thresh() and wb_dirty to
    wb_thresh.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_start_background_writeback() currently takes @bdi and kicks the
    root wb (bdi_writeback). In preparation for cgroup writeback support,
    make it take wb instead.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • writeback_in_progress() currently takes @bdi and returns whether
    writeback is in progress on its root wb (bdi_writeback). In
    preparation for cgroup writeback support, make it take wb instead.
    While at it, make it an inline function.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For cgroup writeback support, all bdi-wide operations should be
    distributed to all its wb's (bdi_writeback's).

    This patch updates laptop_mode_timer_fn() so that it invokes
    wb_start_writeback() on all wb's rather than just the root one. As
    the intent is writing out all dirty data, there's no reason to split
    the number of pages to write.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_start_writeback() is a thin wrapper on top of
    __wb_start_writeback() which is used only by laptop_mode_timer_fn().
    This patches removes bdi_start_writeback(), renames
    __wb_start_writeback() to wb_start_writeback() and makes
    laptop_mode_timer_fn() use it instead.

    This doesn't cause any functional difference and will ease making
    laptop_mode_timer_fn() cgroup writeback aware.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
    dirty limit of each bdi. For cgroup writeback, they need to be
    further distributed across wb's (bdi_writeback's) belonging to the
    configured bdi.

    This patch introduces wb_min_max_ratio() which distributes
    bdi->min/max_ratio according to a wb's proportion in the total active
    bandwidth of its bdi.

    v2: Update wb_min_max_ratio() to fix a bug where both min and max were
    assigned the min value and avoid calculations when possible.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_has_dirty_io() used to only reflect whether the root wb
    (bdi_writeback) has dirty inodes. For cgroup writeback support, it
    needs to take all active wb's into account. If any wb on the bdi has
    dirty inodes, bdi_has_dirty_io() should return true.

    To achieve that, as inode_wb_list_{move|del}_locked() now keep track
    of the dirty state transition of each wb, the number of dirty wbs can
    be counted in the bdi; however, bdi is already aggregating
    wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
    there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
    dip below 1. bdi_has_dirty_io() can simply test whether
    bdi->tot_write_bandwidth is zero or not.

    While this bumps the value of wb->avg_write_bandwidth to one when it
    used to be zero, this shouldn't cause any meaningful behavior
    difference.

    bdi_has_dirty_io() is made an inline function which tests whether
    ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its
    value are added to inode_wb_list_{move|del}_locked().

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup writeback support needs to keep track of the sum of
    avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
    distribute write workload. This patch adds bdi->tot_write_bandwidth
    and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
    and wb_update_write_bandwidth() to adjust it as wb's gain and lose
    dirty inodes and its avg_write_bandwidth gets updated.

    As the update events are not synchronized with each other,
    bdi->tot_write_bandwidth is an atomic_long_t.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, balance_dirty_pages() always work on bdi->wb. This patch
    updates it to work on the wb (bdi_writeback) matching memcg and blkcg
    of the current task as that's what the inode is being dirtied against.

    balance_dirty_pages_ratelimited() now pins the current wb and passes
    it to balance_dirty_pages().

    As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
    visible behavior differences.

    v2: Updated for per-inode wb association.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Until now, all WB_* stats were accounted against the root wb
    (bdi_writeback), now that multiple wb (bdi_writeback) support is in
    place, let's attributes the stats to the respective per-cgroup wb's.

    As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
    visible behavior differences.

    v2: Updated for per-inode wb association.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Writeback operations will now be per wb (bdi_writeback) instead of
    bdi. Replace the relevant bdi references in symbol names and comments
    with wb. This patch is purely cosmetic and doesn't make any
    functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Wu Fengguang
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->bdi_stat[] into wb.

    * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
    enums is changed from BDI_ to WB_.

    * BDI_STAT_BATCH() -> WB_STAT_BATCH()

    * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

    * bdi_stat[_error]() -> wb_stat[_error]()

    * bdi_writeout_inc() -> wb_writeout_inc()

    * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
    frees stat.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Cc: Trond Myklebust
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen