15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

05 Oct, 2017

1 commit

  • Handle start-all writeback like we do periodic or kupdate
    style writeback - by marking the bdi_writeback as needing a full
    flush, and simply waking the thread. This eliminates the need to
    allocate and queue a specific work item just for this purpose.

    After this change, we truly only ever have one of them running at
    any point in time. We mark the need to start all flushes, and the
    writeback thread will clear it once it has processed the request.

    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jul, 2017

1 commit

  • inode number and generation can identify a kernfs node. We are going to
    export the identification by exportfs operations, so put ino and
    generation into a separate structure. It's convenient when later patches
    use the identification.

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

25 Feb, 2017

1 commit

  • Memory pressure can put dirty pages at the end of the LRU without
    anybody running into dirty limits. Don't start writing individual pages
    from kswapd while the flushers might be asleep.

    Unlike the old direct reclaim flusher wakeup (removed in the next patch)
    that flushes the number of pages just scanned, this patch wakes the
    flushers for all outstanding dirty pages. That seemed to perform better
    in a synthetic test that pushes dirty pages to the end of the LRU and
    into reclaim, because we know LRU aging outstrips writeback already, and
    this way we give younger dirty pages a headstart rather than wait until
    reclaim runs into them as well. It also means less plugging and risk of
    exhausting the struct request pool from reclaim.

    There is a concern that this will cause temporary files that used to get
    dirtied and truncated before writeback to now get written to disk under
    memory pressure. If this turns out to be a real problem, we'll have to
    revisit this and tame the reclaim flusher wakeups.

    [hannes@cmpxchg.org: mention dirty expiration as a condition]
    Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Jul, 2016

2 commits

  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

1 commit

  • The per-sb inode writeback list tracks inodes currently under writeback
    to facilitate efficient sync processing. In particular, it ensures that
    sync only needs to walk through a list of inodes that were cleaned by
    the sync.

    Add a couple tracepoints to help identify when inodes are added/removed
    to and from the writeback lists. Piggyback off of the writeback
    lazytime tracepoint template as it already tracks the relevant inode
    information.

    Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    cc: Josef Bacik
    Cc: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Foster
     

09 Mar, 2016

1 commit

  • commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
    tracepoints to report cgroup") made writeback tracepoints print out cgroup
    path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
    kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
    acquire spin lock that is sleepable on -rt kernel.

    BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
    in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
    INFO: lockdep is turned off.
    Preemption disabled at:[] wb_writeback+0xec/0x830

    CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
    Hardware name: Freescale Layerscape 2085a RDB Board (DT)
    Workqueue: writeback wb_workfn (flush-7:0)
    Call trace:
    [] dump_backtrace+0x0/0x200
    [] show_stack+0x24/0x30
    [] dump_stack+0x88/0xa8
    [] ___might_sleep+0x2ec/0x300
    [] rt_spin_lock+0x38/0xb8
    [] kernfs_path_len+0x30/0x90
    [] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
    [] wb_writeback+0x620/0x830
    [] wb_workfn+0x61c/0x950
    [] process_one_work+0x3ac/0xb30
    [] worker_thread+0x9c/0x7a8
    [] kthread+0x190/0x1b0
    [] ret_from_fork+0x10/0x30

    With unlocked kernfs_* functions, synchronize_sched() has to be called in
    kernfs_rename which could be called in syscall path, but it is problematic.
    So, print out cgroup ino instead of path name, which could be converted to
    path name by userland.

    Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
    dir ino vary from different filesystems, so printing out -1U to indicate
    an invalid cgroup ino.

    Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org

    Acked-by: Tejun Heo
    Signed-off-by: Yang Shi
    Signed-off-by: Steven Rostedt

    Yang Shi
     

19 Aug, 2015

1 commit

  • The following tracepoints are updated to report the cgroup used during
    cgroup writeback.

    * writeback_write_inode[_start]
    * writeback_queue
    * writeback_exec
    * writeback_start
    * writeback_written
    * writeback_wait
    * writeback_nowork
    * writeback_wake_background
    * wbc_writepage
    * writeback_queue_io
    * bdi_dirty_ratelimit
    * balance_dirty_pages
    * writeback_sb_inodes_requeue
    * writeback_single_inode[_start]

    Note that writeback_bdi_register is separated out from writeback_class
    as reporting cgroup doesn't make sense to it. Tracepoints which take
    bdi are updated to take bdi_writeback instead.

    Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

02 Jun, 2015

2 commits

  • This patch is a part of the series to define wb_domain which
    represents a domain that wb's (bdi_writeback's) belong to and are
    measured against each other in. This will enable IO backpressure
    propagation for cgroup writeback.

    global_dirty_limit exists to regulate the global dirty threshold which
    is a property of the wb_domain. This patch moves hard_dirty_limit,
    dirty_lock, and update_time into wb_domain.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

08 Apr, 2015

1 commit

  • The enums used in tracepoints for __print_symbolic() do not have their
    values shown in the tracepoint format files and this makes it difficult
    for user space tools to convert the binary values to the strings they
    are to represent.

    Add TRACE_DEFINE_ENUM(x) macros to export the enum names to their values
    to make the tracing output from user space tools more robust.

    Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org

    Cc: Dave Chinner
    Cc: Jens Axboe
    Reviewed-by: Masami Hiramatsu
    Tested-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

18 Feb, 2015

1 commit


05 Feb, 2015

1 commit

  • Add a new mount option which enables a new "lazytime" mode. This mode
    causes atime, mtime, and ctime updates to only be made to the
    in-memory version of the inode. The on-disk times will only get
    updated when (a) if the inode needs to be updated for some non-time
    related change, (b) if userspace calls fsync(), syncfs() or sync(), or
    (c) just before an undeleted inode is evicted from memory.

    This is OK according to POSIX because there are no guarantees after a
    crash unless userspace explicitly requests via a fsync(2) call.

    For workloads which feature a large number of random write to a
    preallocated file, the lazytime mount option significantly reduces
    writes to the inode table. The repeated 4k writes to a single block
    will result in undesirable stress on flash devices and SMR disk
    drives. Even on conventional HDD's, the repeated writes to the inode
    table block will trigger Adjacent Track Interference (ATI) remediation
    latencies, which very negatively impact long tail latencies --- which
    is a very big deal for web serving tiers (for example).

    Google-Bug-Id: 18297052

    Signed-off-by: Theodore Ts'o
    Signed-off-by: Al Viro

    Theodore Ts'o
     

21 Jan, 2015

2 commits

  • Now that default_backing_dev_info is not used for writeback purposes we can
    git rid of it easily:

    - instead of using it's name for tracing unregistered bdi we just use
    "unknown"
    - btrfs and ceph can just assign the default read ahead window themselves
    like several other filesystems already do.
    - we can assign noop_backing_dev_info as the default one in alloc_super.
    All filesystems already either assigned their own or
    noop_backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Apr, 2014

1 commit

  • Pull tracing updates from Steven Rostedt:
    "Most of the changes were largely clean ups, and some documentation.
    But there were a few features that were added:

    Uprobes now work with event triggers and multi buffers and have
    support under ftrace and perf.

    The big feature is that the function tracer can now be used within the
    multi buffer instances. That is, you can now trace some functions in
    one buffer, others in another buffer, all functions in a third buffer
    and so on. They are basically agnostic from each other. This only
    works for the function tracer and not for the function graph trace,
    although you can have the function graph tracer running in the top
    level buffer (or any tracer for that matter) and have different
    function tracing going on in the sub buffers"

    * tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
    tracing: Add BUG_ON when stack end location is over written
    tracepoint: Remove unused API functions
    Revert "tracing: Move event storage for array from macro to standalone function"
    ftrace: Constify ftrace_text_reserved
    tracepoints: API doc update to tracepoint_probe_register() return value
    tracepoints: API doc update to data argument
    ftrace: Fix compilation warning about control_ops_free
    ftrace/x86: BUG when ftrace recovery fails
    ftrace: Warn on error when modifying ftrace function
    ftrace: Remove freelist from struct dyn_ftrace
    ftrace: Do not pass data to ftrace_dyn_arch_init
    ftrace: Pass retval through return in ftrace_dyn_arch_init()
    ftrace: Inline the code from ftrace_dyn_table_alloc()
    ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
    tracing: Evaluate len expression only once in __dynamic_array macro
    tracing: Correctly expand len expressions from __dynamic_array macro
    tracing/module: Replace include of tracepoint.h with jump_label.h in module.h
    tracing: Fix event header migrate.h to include tracepoint.h
    tracing: Fix event header writeback.h to include tracepoint.h
    tracing: Warn if a tracepoint is not set via debugfs
    ...

    Linus Torvalds
     

07 Mar, 2014

1 commit


22 Feb, 2014

1 commit

  • This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
    Chinner has reported the commit may cause some
    inodes to be left out from sync(2). This is because we can call
    redirty_tail() for some inode (which sets i_dirtied_when to current time)
    after sync(2) has started or similarly requeue_inode() can set
    i_dirtied_when to current time if writeback had to skip some pages. The
    real problem is in the functions clobbering i_dirtied_when but fixing
    that isn't trivial so revert is a safer choice for now.

    CC: stable@vger.kernel.org # >= 3.13
    Signed-off-by: Jan Kara

    Jan Kara
     

13 Nov, 2013

1 commit

  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

02 Apr, 2013

1 commit

  • Writeback implements its own worker pool - each bdi can be associated
    with a worker thread which is created and destroyed dynamically. The
    worker thread for the default bdi is always present and serves as the
    "forker" thread which forks off worker threads for other bdis.

    there's no reason for writeback to implement its own worker pool when
    using unbound workqueue instead is much simpler and more efficient.
    This patch replaces custom worker pool implementation in writeback
    with an unbound workqueue.

    The conversion isn't too complicated but the followings are worth
    mentioning.

    * bdi_writeback->last_active, task and wakeup_timer are removed.
    delayed_work ->dwork is added instead. Explicit timer handling is
    no longer necessary. Everything works by either queueing / modding
    / flushing / canceling the delayed_work item.

    * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
    bdi_writeback->dwork. On each execution, it processes
    bdi->work_list and reschedules itself if there are more things to
    do.

    The function also handles low-mem condition, which used to be
    handled by the forker thread. If the function is running off a
    rescuer thread, it only writes out limited number of pages so that
    the rescuer can serve other bdis too. This preserves the flusher
    creation failure behavior of the forker thread.

    * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
    bdi_writeback_workfn() about on-going bdi unregistration so that it
    always drains work_list even if it's running off the rescuer. Note
    that the original code was broken in this regard. Under memory
    pressure, a bdi could finish unregistration with non-empty
    work_list.

    * The default bdi is no longer special. It now is treated the same as
    any other bdi and bdi_cap_flush_forker() is removed.

    * BDI_pending is no longer used. Removed.

    * Some tracepoints become non-applicable. The following TPs are
    removed - writeback_nothread, writeback_wake_thread,
    writeback_wake_forker_thread, writeback_thread_start,
    writeback_thread_stop.

    Everything, including devices coming and going away and rescuer
    operation under simulated memory pressure, seems to work fine in my
    test setup.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Fengguang Wu
    Cc: Jeff Moyer

    Tejun Heo
     

14 Jan, 2013

1 commit

  • Add tracepoints for page dirtying, writeback_single_inode start, inode
    dirtying and writeback. For the latter two inode events, a pair of
    events are defined to denote start and end of the operations (the
    starting one has _start suffix and the one w/o suffix happens after
    the operation is complete). These inode ops are FS specific and can
    be non-trivial and having enclosing tracepoints is useful for external
    tracers.

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
    before dereferencing.

    v3: buffer dirtying moved to a block TP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 May, 2012

1 commit

  • When writeback_single_inode() is called on inode which has I_SYNC already
    set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
    this makes sense only if the caller is flusher thread. For other callers of
    writeback_single_inode() it doesn't really make sense and may be even wrong
    - flusher thread may be doing WB_SYNC_ALL writeback in parallel.

    So we move requeueing from writeback_single_inode() to writeback_sb_inodes().

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

25 Mar, 2012

1 commit

  • Pull avoidance patches from Paul Gortmaker:
    "Nearly every subsystem has some kind of header with a proto like:

    void foo(struct device *dev);

    and yet there is no reason for most of these guys to care about the
    sub fields within the device struct. This allows us to significantly
    reduce the scope of headers including headers. For this instance, a
    reduction of about 40% is achieved by replacing the include with the
    simple fact that the device is some kind of a struct.

    Unlike the much larger module.h cleanup, this one is simply two
    commits. One to fix the implicit users, and then one
    to delete the device.h includes from the linux/include/ dir wherever
    possible."

    * tag 'device-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    device.h: audit and cleanup users in main include dir
    device.h: cleanup users outside of linux/include (C files)

    Linus Torvalds
     

16 Mar, 2012

1 commit

  • The header includes a lot of stuff, and
    it in turn gets a lot of use just for the basic "struct device"
    which appears so often.

    Clean up the users as follows:

    1) For those headers only needing "struct device" as a pointer
    in fcn args, replace the include with exactly that.

    2) For headers not really using anything from device.h, simply
    delete the include altogether.

    3) For headers relying on getting device.h implicitly before
    being included themselves, now explicitly include device.h

    4) For files in which doing #1 or #2 uncovers an implicit
    dependency on some other header, fix by explicitly adding
    the required header(s).

    Any C files that were implicitly relying on device.h to be
    present have already been dealt with in advance.

    Total removals from #1 and #2: 51. Total additions coming
    from #3: 9. Total other implicit dependencies from #4: 7.

    As of 3.3-rc1, there were 110, so a net removal of 42 gives
    about a 38% reduction in device.h presence in include/*

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

06 Feb, 2012

1 commit

  • When a SD card is hot removed without umount, del_gendisk() will call
    bdi_unregister() without destroying/freeing it. This leaves the bdi in
    the bdi->dev = NULL, bdi->wb.task = NULL, bdi->bdi_list removed state.

    When sync(2) gets the bdi before bdi_unregister() and calls
    bdi_queue_work() after the unregister, trace_writeback_queue will be
    dereferencing the NULL bdi->dev. Fix it with a simple test for NULL.

    LKML-reference: http://lkml.org/lkml/2012/1/18/346
    Cc: stable@kernel.org
    Reported-by: Rabin Vincent
    Tested-by: Namjae Jeon
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

01 Feb, 2012

1 commit


18 Dec, 2011

2 commits

  • Compensate the task's think time when computing the final pause time,
    so that ->dirty_ratelimit can be executed accurately.

    think time := time spend outside of balance_dirty_pages()

    In the rare case that the task slept longer than the 200ms period time
    (result in negative pause time), the sleep time will be compensated in
    the following periods, too, if it's less than 1 second.

    Accumulated errors are carefully avoided as long as the max pause area
    is not hitted.

    Pseudo code:

    period = pages_dirtied / task_ratelimit;
    think = jiffies - dirty_paused_when;
    pause = period - think;

    1) normal case: period > think

    pause = period - think
    dirty_paused_when = jiffies + pause
    nr_dirtied = 0

    period time
    |===============================>|
    think time pause time
    |===============>|==============>|
    ------|----------------|---------------|------------------------
    dirty_paused_when jiffies

    2) no pause case: period |
    think time
    |===================================================>|
    ------|--------------------------------+-------------------|----
    dirty_paused_when jiffies

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • This makes the binary trace understandable by trace-cmd.

    CC: Dave Chinner
    CC: Curt Wohlgemuth
    CC: Steven Rostedt
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Oct, 2011

4 commits

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     
  • Instead of sending ->older_than_this to queue_io() and
    move_expired_inodes(), send the entire wb_writeback_work
    structure. There are other fields of a work item that are
    useful in these routines and in tracepoints.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     
  • Useful for analyzing the dynamics of the throttling algorithms and
    debugging user reported problems.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • It helps understand how various throttle bandwidths are updated.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

03 Oct, 2011

1 commit

  • As proposed by Chris, Dave and Jan, don't start foreground writeback IO
    inside balance_dirty_pages(). Instead, simply let it idle sleep for some
    time to throttle the dirtying task. In the mean while, kick off the
    per-bdi flusher thread to do background writeback IO.

    RATIONALS
    =========

    - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

    If every thread doing writes and being throttled start foreground
    writeback, it leads to N IO submitters from at least N different
    inodes at the same time, end up with N different sets of IO being
    issued with potentially zero locality to each other, resulting in
    much lower elevator sort/merge efficiency and hence we seek the disk
    all over the place to service the different sets of IO.
    OTOH, if there is only one submission thread, it doesn't jump between
    inodes in the same way when congestion clears - it keeps writing to
    the same inode, resulting in large related chunks of sequential IOs
    being issued to the disk. This is more efficient than the above
    foreground writeback because the elevator works better and the disk
    seeks less.

    - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

    With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
    from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

    * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

    * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

    * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

    * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

    - IO size too small for fast arrays and too large for slow USB sticks

    The write_chunk used by current balance_dirty_pages() cannot be
    directly set to some large value (eg. 128MB) for better IO efficiency.
    Because it could lead to more than 1 second user perceivable stalls.
    Even the current 4MB write size may be too large for slow USB sticks.
    The fact that balance_dirty_pages() starts IO on itself couples the
    IO size to wait time, which makes it hard to do suitable IO size while
    keeping the wait time under control.

    Now it's possible to increase writeback chunk size proportional to the
    disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
    the larger writeback size dramatically reduces the seek count to 1/10
    (far beyond my expectation) and improves the write throughput by 24%.

    - long block time in balance_dirty_pages() hurts desktop responsiveness

    Many of us may have the experience: it often takes a couple of seconds
    or even long time to stop a heavy writing dd/cp/tar command with
    Ctrl-C or "kill -9".

    - IO pipeline broken by bumpy write() progress

    There are a broad class of "loop {read(buf); write(buf);}" applications
    whose read() pipeline will be under-utilized or even come to a stop if
    the write()s have long latencies _or_ don't progress in a constant rate.
    The current threshold based throttling inherently transfers the large
    low level IO completion fluctuations to bumpy application write()s,
    and further deteriorates with increasing number of dirtiers and/or bdi's.

    For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
    the rsync progresses very bumpy in legacy kernel, and throughput is
    improved by 67% by this patchset. (plus the larger write chunk size,
    it will be 93% speedup).

    The new rate based throttling can support 1000+ dd's with excellent
    smoothness, low latency and low overheads.

    For the above reasons, it's much better to do IO-less and low latency
    pauses in balance_dirty_pages().

    Jan Kara, Dave Chinner and me explored the scheme to let
    balance_dirty_pages() wait for enough writeback IO completions to
    safeguard the dirty limit. However it's found to have two problems:

    - in large NUMA systems, the per-cpu counters may have big accounting
    errors, leading to big throttle wait time and jitters.

    - NFS may kill large amount of unstable pages with one single COMMIT.
    Because NFS server serves COMMIT with expensive fsync() IOs, it is
    desirable to delay and reduce the number of COMMITs. So it's not
    likely to optimize away such kind of bursty IO completions, and the
    resulted large (and tiny) stall times in IO completion based throttling.

    So here is a pause time oriented approach, which tries to control the
    pause time in each balance_dirty_pages() invocations, by controlling
    the number of pages dirtied before calling balance_dirty_pages(), for
    smooth and efficient dirty throttling:

    - avoid useless (eg. zero pause time) balance_dirty_pages() calls
    - avoid too small pause time (less than 4ms, which burns CPU power)
    - avoid too large pause time (more than 200ms, which hurts responsiveness)
    - avoid big fluctuations of pause times

    It can control pause times at will. The default policy (in a followup
    patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
    in 1000-dd case.

    BEHAVIOR CHANGE
    ===============

    (1) dirty threshold

    Users will notice that the applications will get throttled once crossing
    the global (background + dirty)/2=15% threshold, and then balanced around
    17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
    memory in 1-dd case.

    Since the task will be soft throttled earlier than before, it may be
    perceived by end users as performance "slow down" if his application
    happens to dirty more than 15% dirtyable memory.

    (2) smoothness/responsiveness

    Users will notice a more responsive system during heavy writeback.
    "killall dd" will take effect instantly.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

31 Aug, 2011

1 commit


10 Jul, 2011

2 commits

  • Add trace event balance_dirty_state for showing the global dirty page
    counts and thresholds at each global_dirty_limits() invocation. This
    will cover the callers throttle_vm_writeout(), over_bground_thresh()
    and each balance_dirty_pages() loop.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
    and initialize the struct writeback_control there.

    struct writeback_control is basically designed to control writeback of a
    single file, but we keep abuse it for writing multiple files in
    writeback_sb_inodes() and its callers.

    It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
    work->nr_pages starts to make sense, and instead of saving and restoring
    pages_skipped in writeback_sb_inodes it can always start with a clean
    zero value.

    It also makes a neat IO pattern change: large dirty files are now
    written in the full 4MB writeback chunk size, rather than whatever
    remained quota in wbc->nr_to_write.

    Acked-by: Jan Kara
    Proposed-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang

    Wu Fengguang