05 Aug, 2016

1 commit

  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

15 Oct, 2015

1 commit

  • bdi's are initialized in two steps, bdi_init() and bdi_register(), but
    destroyed in a single step by bdi_destroy() which, for a bdi embedded
    in a request_queue, is called during blk_cleanup_queue() which makes
    the queue invisible and starts the draining of remaining usages.

    A request_queue's user can access the congestion state of the embedded
    bdi as long as it holds a reference to the queue. As such, it may
    access the congested state of a queue which finished
    blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
    Because the congested state was embedded in backing_dev_info which in
    turn is embedded in request_queue, accessing the congested state after
    bdi_destroy() was called was fine. The bdi was destroyed but the
    memory region for the congested state remained accessible till the
    queue got released.

    a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
    bdi_writeback") changed the situation. Now, the root congested state
    which is expected to be pinned while request_queue remains accessible
    is separately reference counted and the base ref is put during
    bdi_destroy(). This means that the root congested state may go away
    prematurely while the queue is between bdi_dstroy() and
    blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

    The root cause of this problem is that bdi doesn't distinguish the two
    steps of destruction, unregistration and release, and now the root
    congested state actually requires a separate release step. To fix the
    issue, this patch separates out bdi_unregister() and bdi_exit() from
    bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
    and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
    simple wrapper calling the two steps back-to-back.

    While at it, the prototype of bdi_destroy() is moved right below
    bdi_setup_and_register() so that the counterpart operations are
    located together.

    Signed-off-by: Tejun Heo
    Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
    Cc: stable@vger.kernel.org # v4.2+
    Reported-and-tested-by: Andrey Konovalov
    Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
    Reviewed-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Oct, 2015

1 commit

  • bdi_for_each_wb() is used in several places to wake up or issue
    writeback work items to all wb's (bdi_writeback's) on a given bdi.
    The iteration is performed by walking bdi->cgwb_tree; however, the
    tree only indexes wb's which are currently active.

    For example, when a memcg gets associated with a different blkcg, the
    old wb is removed from the tree so that the new one can be indexed.
    The old wb starts dying from then on but will linger till all its
    inodes are drained. As these dying wb's may still host dirty inodes,
    writeback operations which affect all wb's must include them.
    bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
    failing to sync the inodes belonging to those wb's.

    This patch adds a RCU protected @bdi->wb_list which lists all wb's
    beloinging to that bdi. wb's are added on creation and removed on
    release rather than on the start of destruction. bdi_for_each_wb()
    usages are replaced with list_for_each[_continue]_rcu() iterations
    over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

    v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Artem Bityutskiy
    Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Sep, 2015

2 commits

  • Pull to receive 9badce000e2c ("cgroup, writeback: don't enable cgroup
    writeback on traditional hierarchies"). The commit adds
    cgroup_on_dfl() usages in include/linux/backing-dev.h thus causing a
    silent conflict with 9e10a130d9b6 ("cgroup: replace cgroup_on_dfl()
    tests in controllers with cgroup_subsys_on_dfl()"). The conflict is
    fixed during this merge.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • inode_cgwb_enabled() gates cgroup writeback support. If it returns
    true, each inode is attached to the corresponding memory domain which
    gets mapped to io domain. It currently only tests whether the
    filesystem and bdi support cgroup writeback; however, cgroup writeback
    support doesn't work on traditional hierarchies and thus it should
    also test whether memcg and iocg are on the default hierarchy.

    This caused traditional hierarchy setups to hit the cgroup writeback
    path inadvertently and ended up creating separate writeback domains
    for each memcg and mapping them all to the root iocg uncovering a
    couple issues in the cgroup writeback path.

    cgroup writeback was never meant to be enabled on traditional
    hierarchies. Make inode_cgwb_enabled() test whether both memcg and
    iocg are on the default hierarchy.

    Signed-off-by: Tejun Heo
    Reported-by: Artem Bityutskiy
    Reported-by: Dexuan Cui
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net

    Tejun Heo
     

19 Aug, 2015

2 commits

  • blkio interface has become messy over time and is currently the
    largest. In addition to the inconsistent naming scheme, it has
    multiple stat files which report more or less the same thing, a number
    of debug stat files which expose internal details which shouldn't have
    been part of the public interface in the first place, recursive and
    non-recursive stats and leaf and non-leaf knobs.

    Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
    don't make any sense on the unified hierarchy as only leaf cgroups can
    contain processes. cgroups is going through a major interface
    revision with the unified hierarchy involving significant fundamental
    usage changes and given that a significant portion of the interface
    doesn't make sense anymore, it's a good time to reorganize the
    interface.

    As the first step, this patch renames the external visible subsystem
    name from "blkio" to "io". This is more concise, matches the other
    two major subsystem names, "cpu" and "memory", and better suited as
    blkcg will be involved in anything writeback related too whether an
    actual block device is involved or not.

    As the subsystem legacy_name is set to "blkio", the only userland
    visible change outside the unified hierarchy is that blkcg is reported
    as "io" instead of "blkio" in the subsystem initialized message during
    boot. On the unified hierarchy, blkcg now appears as "io".

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
    earlier implementation, wb's were keyed by blkcg ID.
    bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
    allows iterations to start from an arbitrary ID which is used to
    interrupt and resume iterations.

    Unfortunately, while changing wb to be keyed by memcg ID instead of
    blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
    are keyed by blkcg ID. This doesn't affect iterations which don't get
    interrupted but bdi_split_work_to_wbs() makes use of iteration
    resuming on allocation failures and thus may incorrectly skip or
    repeat wb's.

    Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
    blkcg IDs and updating bdi_split_work_to_wbs() accordingly.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jul, 2015

1 commit

  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    To fix the bug, bdi destruction will be updated to not wait for cong's
    to drain, which naturally means that cong's may outlive the associated
    bdi. This is fine for non-root cong's but is problematic for the root
    cong's which are embedded in their bdi's as they may end up getting
    dereferenced after the containing bdi's are freed.

    This patch makes root cong's behave the same as non-root cong's. They
    are no longer embedded in their bdi's but allocated separately during
    bdi initialization, indexed and reference counted the same way.

    * As cong handling is the same for all wb's, wb->congested
    initialization is moved into wb_init().

    * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
    bdi->wb_congested is now a pointer pointing to the root cong
    allocated during bdi init and minimal refcnting operations are
    implemented.

    * The above makes root wb init paths diverge depending on
    CONFIG_CGROUP_WRITEBACK. root wb init is moved to cgwb_bdi_init().

    This patch in itself shouldn't cause any consequential behavior
    differences but prepares for the actual fix.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Tested-by: Jon Christopherson

    Added include to backing-dev.h for kfree() definition.

    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

18 Jun, 2015

1 commit

  • FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
    cgroup writeback; however, different super_blocks of the same
    file_system_type may or may not support cgroup writeback depending on
    filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a
    per-super_block flag.

    super_block->s_flags carries some internal flags in the high bits but
    it's exposd to userland through uapi header and running out of space
    anyway. This patch adds a new field super_block->s_iflags to carry
    kernel-internal flags. It is currently only used by the new
    SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
    consistency with other super_block flags.

    ext2_fill_super() is updated to set SB_I_CGROUPWB.

    v2: Added super_block->s_iflags instead of stealing another high bit
    from sb->s_flags as suggested by Christoph and Jan.

    Signed-off-by: Tejun Heo
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: linux-ext4@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jun, 2015

22 commits

  • With the previous three patches, all operations which acquire wb from
    inode are either under one of inode->i_lock, mapping->tree_lock or
    wb->list_lock or protected by unlocked_inode_to_wb transaction. This
    will be depended upon by foreign inode wb switching.

    This patch adds lockdep assertion to inode_to_wb() so that usages
    outside the above list locks can be caught easily. There are three
    exceptions.

    * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
    wb may not be the inode's. Ensuring that is the function's role
    after all. Updated to deref inode->i_wb directly.

    * inode_wb_stat_unlocked_begin() is usually protected by combination
    of !I_WB_SWITCH and rcu_read_lock(). Updated to deref inode->i_wb
    directly.

    * inode_congested() wants to test whether inode->i_wb is set before
    starting the transaction. Added inode_to_wb_is_valid() which tests
    inode->i_wb directly.

    v5: might_lock() removed. It annotates that the lock is grabbed w/
    irq enabled which isn't the case and triggering lockdep warning
    spuriously.

    v4: might_lock() added to unlocked_inode_to_wb_begin().

    v3: inode_congested() conversion added.

    v2: locked_inode_to_wb_and_lock_list() was missing in the first
    version.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The mechanism for detecting whether an inode should switch its wb
    (bdi_writeback) association is now in place. This patch build the
    framework for the actual switching.

    This patch adds a new inode flag I_WB_SWITCHING, which has two
    functions. First, the easy one, it ensures that there's only one
    switching in progress for a give inode. Second, it's used as a
    mechanism to synchronize wb stat updates.

    The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
    but track the current number of dirty pages and pages under writeback
    respectively. As such, when an inode is moved from one wb to another,
    the inode's portion of those stats have to be transferred together;
    unfortunately, this is a bit tricky as those stat updates are percpu
    operations which are performed without holding any lock in some
    places.

    This patch solves the problem in a similar way as memcg. Each such
    lockless stat updates are wrapped in transaction surrounded by
    unlocked_inode_to_wb_begin/end(). During normal operation, they map
    to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
    mapping->tree_lock is grabbed across the transaction.

    In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
    grace period to pass before actually starting to switch, which
    guarantees that all stat update paths are synchronizing against
    mapping->tree_lock.

    This patch still doesn't implement the actual switching.

    v3: Updated on top of the recent cancel_dirty_page() updates.
    unlocked_inode_to_wb_begin() now nests inside
    mem_cgroup_begin_page_stat() to match the locking order.

    v2: The i_wb access transaction will be used for !stat accesses too.
    Function names and comments updated accordingly.

    s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
    s/switch_wb/switch_wbs/

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, for cgroup writeback, the IO submission paths directly
    associate the bio's with the blkcg from inode_to_wb_blkcg_css();
    however, it'd be necessary to keep more writeback context to implement
    foreign inode writeback detection. wbc (writeback_control) is the
    natural fit for the extra context - it persists throughout the
    writeback of each inode and is passed all the way down to IO
    submission paths.

    This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
    wbc_attach_fdatawrite_inode() which are used to associate wbc with the
    inode being written back. IO submission paths now use wbc_init_bio()
    instead of directly associating bio's with blkcg themselves. This
    leaves inode_to_wb_blkcg_css() w/o any user. The function is removed.

    wbc currently only tracks the associated wb (bdi_writeback). Future
    patches will add more for foreign inode detection. The association is
    established under i_lock which will be depended upon when migrating
    foreign inodes to other wb's.

    As currently, once established, inode to wb association never changes,
    going through wbc when initializing bio's doesn't cause any behavior
    changes.

    v2: submit_blk_blkcg() now checks whether the wbc is associated with a
    wb before dereferencing it. This can happen when pageout() is
    writing pages directly without going through the usual writeback
    path. As pageout() path is single-threaded, we don't want it to
    be blocked behind a slow cgroup and ultimately want it to delegate
    actual writing to the usual writeback path.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, majority of cgroup writeback support including all the
    above functions are implemented in include/linux/backing-dev.h and
    mm/backing-dev.c; however, the portion closely related to writeback
    logic implemented in include/linux/writeback.h and mm/page-writeback.c
    will expand to support foreign writeback detection and correction.

    This patch moves wb[_try]_get() and wb_put() to
    include/linux/backing-dev-defs.h so that they can be used from
    writeback.h and inode_{attach|detach}_wb() to writeback.h and
    page-writeback.c.

    This is pure reorganization and doesn't introduce any functional
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • [__]block_write_full_page() is used to implement ->writepage in
    various filesystems. All writeback logic is now updated to handle
    cgroup writeback and the block cgroup to issue IOs for is encoded in
    writeback_control and can be retrieved from the inode; however,
    [__]block_write_full_page() currently ignores the blkcg indicated by
    inode and issues all bio's without explicit blkcg association.

    This patch adds submit_bh_blkcg() which associates the bio with the
    specified blkio cgroup before issuing and uses it in
    __block_write_full_page() so that the issued bio's are associated with
    inode_to_wb_blkcg_css(inode).

    v2: Updated for per-inode wb association.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_start_background_writeback() currently takes @bdi and kicks the
    root wb (bdi_writeback). In preparation for cgroup writeback support,
    make it take wb instead.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • writeback_in_progress() currently takes @bdi and returns whether
    writeback is in progress on its root wb (bdi_writeback). In
    preparation for cgroup writeback support, make it take wb instead.
    While at it, make it an inline function.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_start_writeback() is a thin wrapper on top of
    __wb_start_writeback() which is used only by laptop_mode_timer_fn().
    This patches removes bdi_start_writeback(), renames
    __wb_start_writeback() to wb_start_writeback() and makes
    laptop_mode_timer_fn() use it instead.

    This doesn't cause any functional difference and will ease making
    laptop_mode_timer_fn() cgroup writeback aware.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This will be used to implement bdi-wide operations which should be
    distributed across all its cgroup bdi_writebacks.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdi_has_dirty_io() used to only reflect whether the root wb
    (bdi_writeback) has dirty inodes. For cgroup writeback support, it
    needs to take all active wb's into account. If any wb on the bdi has
    dirty inodes, bdi_has_dirty_io() should return true.

    To achieve that, as inode_wb_list_{move|del}_locked() now keep track
    of the dirty state transition of each wb, the number of dirty wbs can
    be counted in the bdi; however, bdi is already aggregating
    wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
    there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
    dip below 1. bdi_has_dirty_io() can simply test whether
    bdi->tot_write_bandwidth is zero or not.

    While this bumps the value of wb->avg_write_bandwidth to one when it
    used to be zero, this shouldn't cause any meaningful behavior
    difference.

    bdi_has_dirty_io() is made an inline function which tests whether
    ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its
    value are added to inode_wb_list_{move|del}_locked().

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
    has any dirty inode by testing all three IO lists on each invocation
    without actively keeping track. For cgroup writeback support, a
    single bdi will host multiple wb's each of which will host dirty
    inodes separately and we'll need to make bdi_has_dirty_io(), which
    currently only represents the root wb, aggregate has_dirty_io from all
    member wb's, which requires tracking transitions in has_dirty_io state
    on each wb.

    This patch introduces inode_wb_list_{move|del}_locked() to consolidate
    IO list operations leaving queue_io() the only other function which
    directly manipulates IO lists (via move_expired_inodes()). All three
    functions are updated to call wb_io_lists_[de]populated() which keep
    track of whether the wb has dirty inodes or not and record it using
    the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return
    value indicates whether the wb had no dirty inodes before.

    mark_inode_dirty() is restructured so that the return value of
    inode_wb_list_move_locked() can be used for deciding whether to wake
    up the wb.

    While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
    These functions were returning 0 and 1 before. Also, add a comment
    explaining the synchronization of wb_state flags.

    v2: Updated to accommodate b_dirty_time.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • In several places, bdi_congested() and its wrappers are used to
    determine whether more IOs should be issued. With cgroup writeback
    support, this question can't be answered solely based on the bdi
    (backing_dev_info). It's dependent on whether the filesystem and bdi
    support cgroup writeback and the blkcg the inode is associated with.

    This patch implements inode_congested() and its wrappers which take
    @inode and determines the congestion state considering cgroup
    writeback. The new functions replace bdi_*congested() calls in places
    where the query is about specific inode and task.

    There are several filesystem users which also fit this criteria but
    they should be updated when each filesystem implements cgroup
    writeback support.

    v2: Now that a given inode is associated with only one wb, congestion
    state can be determined independent from the asking task. Drop
    @task. Spotted by Vivek. Also, converted to take @inode instead
    of @mapping and renamed to inode_congested().

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, all congestion functions take bdi (backing_dev_info) and
    always operate on the root wb (bdi->wb) and the congestion state from
    the block layer is propagated only for the root blkcg. This patch
    introduces {set|clear}_wb_congested() and wb_congested() which take a
    bdi_writeback_congested and bdi_writeback respectively. The bdi
    counteparts are now wrappers invoking the wb based functions on
    @bdi->wb.

    While converting clear_bdi_congested() to clear_wb_congested(), the
    local variable declaration order between @wqh and @bit is swapped for
    cosmetic reason.

    This patch just adds the new wb based functions. The following
    patches will apply them.

    v2: Updated for bdi_writeback_congested.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup writeback requires support from both bdi and filesystem sides.
    Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
    support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
    default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
    both MEMCG and BLK_CGROUP are enabled.

    inode_cgwb_enabled() which determines whether a given inode's both bdi
    and fs support cgroup writeback is added.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a wb's (bdi_writeback) congestion state is carried in its
    ->state field; however, cgroup writeback support will require multiple
    wb's sharing the same congestion state. This patch separates out
    congestion state into its own struct - struct bdi_writeback_congested.
    A new field wb field, wb_congested, points to its associated congested
    struct. The default wb, bdi->wb, always points to bdi->wb_congested.

    While this patch adds a layer of indirection, it doesn't introduce any
    behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Now that bdi definitions are moved to backing-dev-defs.h,
    backing-dev.h can include blkdev.h and inline inode_to_bdi() without
    worrying about introducing circular include dependency. The function
    gets called from hot paths and fairly trivial.

    This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
    function calls inline. blockdev_superblock and noop_backing_dev_info
    are EXPORT_GPL'd to allow the inline functions to be used from
    modules.

    While at it, make sb_is_blkdev_sb() return bool instead of int.

    v2: Fixed typo in description as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->wb_lock and ->worklist into wb.

    * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
    moving, rename it to wb->work_lock as wb->wb_lock is confusing.
    Also, move wb->dwork downwards so that it's colocated with the new
    ->work_lock and ->work_list fields.

    * bdi_writeback_workfn() -> wb_workfn()
    bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
    bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
    bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
    __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
    get_next_work_item(bdi) -> get_next_work_item(wb)

    * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
    The function contained parts which belong to the containing bdi
    rather than the wb itself - testing cap_writeback_dirty and
    bdi_remove_from_list() invocation. Those are moved to
    bdi_unregister().

    * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
    Initializations of the moved bdi->wb_lock and ->work_list are
    relocated from bdi_init() to wb_init().

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->bdi_stat[] into wb.

    * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
    enums is changed from BDI_ to WB_.

    * BDI_STAT_BATCH() -> WB_STAT_BATCH()

    * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

    * bdi_stat[_error]() -> wb_stat[_error]()

    * bdi_writeout_inc() -> wb_writeout_inc()

    * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
    frees stat.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Miklos Szeredi
    Cc: Trond Myklebust
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->state into wb.

    * enum bdi_state is renamed to wb_state and the prefix of all enums is
    changed from BDI_ to WB_.

    * Explicit zeroing of bdi->state is removed without adding zeoring of
    wb->state as the whole data structure is zeroed on init anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: drbd-dev@lists.linbit.com
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

18 Feb, 2015

1 commit


05 Feb, 2015

1 commit

  • Add a new mount option which enables a new "lazytime" mode. This mode
    causes atime, mtime, and ctime updates to only be made to the
    in-memory version of the inode. The on-disk times will only get
    updated when (a) if the inode needs to be updated for some non-time
    related change, (b) if userspace calls fsync(), syncfs() or sync(), or
    (c) just before an undeleted inode is evicted from memory.

    This is OK according to POSIX because there are no guarantees after a
    crash unless userspace explicitly requests via a fsync(2) call.

    For workloads which feature a large number of random write to a
    preallocated file, the lazytime mount option significantly reduces
    writes to the inode table. The repeated 4k writes to a single block
    will result in undesirable stress on flash devices and SMR disk
    drives. Even on conventional HDD's, the repeated writes to the inode
    table block will trigger Adjacent Track Interference (ATI) remediation
    latencies, which very negatively impact long tail latencies --- which
    is a very big deal for web serving tiers (for example).

    Google-Bug-Id: 18297052

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     

21 Jan, 2015

3 commits

  • Now that default_backing_dev_info is not used for writeback purposes we can
    git rid of it easily:

    - instead of using it's name for tracing unregistered bdi we just use
    "unknown"
    - btrfs and ceph can just assign the default read ahead window themselves
    like several other filesystems already do.
    - we can assign noop_backing_dev_info as the default one in alloc_super.
    All filesystems already either assigned their own or
    noop_backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig