14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

25 Sep, 2020

1 commit


20 Sep, 2020

1 commit

  • Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    changed ctl_table.proc_handler to take a kernel pointer. Adjust the
    definition of dirtytime_interval_handler to match its prototype in
    linux/writeback.h which fixes the following sparse error/warning:

    fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces)
    fs/fs-writeback.c:2189:50: expected void *
    fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer
    fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)):
    fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
    fs/fs-writeback.c: note: in included file:
    ./include/linux/writeback.h:374:5: note: previously declared as:
    ./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Signed-off-by: Tobias Klauser
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Al Viro
    Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     

15 Jun, 2020

4 commits

  • The only use of I_DIRTY_TIME_EXPIRE is to detect in
    __writeback_single_inode() that inode got there because flush worker
    decided it's time to writeback the dirty inode time stamps (either
    because we are syncing or because of age). However we can detect this
    directly in __writeback_single_inode() and there's no need for the
    strange propagation with I_DIRTY_TIME_EXPIRE flag.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • When we are processing writeback for sync(2), move_expired_inodes()
    didn't set any inode expiry value (older_than_this). This can result in
    writeback never completing if there's steady stream of inodes added to
    b_dirty_time list as writeback rechecks dirty lists after each writeback
    round whether there's more work to be done. Fix the problem by using
    sync(2) start time is inode expiry value when processing b_dirty_time
    list similarly as for ordinarily dirtied inodes. This requires some
    refactoring of older_than_this handling which simplifies the code
    noticeably as a bonus.

    Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
    CC: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Inode's i_io_list list head is used to attach inode to several different
    lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
    prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
    to b_io list. Thus it is critical for sync(2) data integrity guarantees
    that inode is not requeued to any other writeback list when inode is
    queued for processing by flush worker. That's the reason why
    writeback_single_inode() does not touch i_io_list (unless the inode is
    completely clean) and why __mark_inode_dirty() does not touch i_io_list
    if I_SYNC flag is set.

    However there are two flaws in the current logic:

    1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
    list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
    can still move inode back to b_dirty list resulting in skipping
    writeback of inode time stamps during sync(2).

    2) When inode is on b_dirty_time list and writeback_single_inode() races
    with __mark_inode_dirty() like:

    writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES)
    inode->i_state |= I_SYNC
    __writeback_single_inode()
    inode->i_state |= I_DIRTY_PAGES;
    if (inode->i_state & I_SYNC)
    bail
    if (!(inode->i_state & I_DIRTY_ALL))
    - not true so nothing done

    We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
    standard background writeback will not writeback this inode leading to
    possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
    analysis).

    Fix these problems by tracking whether inode is queued in b_io or
    b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
    know flush worker has queued inode and we should not touch i_io_list.
    On the other hand we also know that once flush worker is done with the
    inode it will requeue the inode to appropriate dirty list. When
    I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
    to appropriate dirty list.

    Reported-by: Martijn Coenen
    Reviewed-by: Martijn Coenen
    Tested-by: Martijn Coenen
    Reviewed-by: Christoph Hellwig
    Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently, operations on inode->i_io_list are protected by
    wb->list_lock. In the following patches we'll need to maintain
    consistency between inode->i_state and inode->i_io_list so change the
    code so that inode->i_lock protects also all inode's i_io_list handling.

    Reviewed-by: Martijn Coenen
    Reviewed-by: Christoph Hellwig
    CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback"
    Signed-off-by: Jan Kara

    Jan Kara
     

06 Jun, 2020

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A lot of bug fixes and cleanups for ext4, including:

    - Fix performance problems found in dioread_nolock now that it is the
    default, caused by transaction leaks.

    - Clean up fiemap handling in ext4

    - Clean up and refactor multiple block allocator (mballoc) code

    - Fix a problem with mballoc with a smaller file systems running out
    of blocks because they couldn't properly use blocks that had been
    reserved by inode preallocation.

    - Fixed a race in ext4_sync_parent() versus rename()

    - Simplify the error handling in the extent manipulation code

    - Make sure all metadata I/O errors are felected to
    ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

    - Avoid passing an error pointer to brelse in ext4_xattr_set()

    - Fix race which could result to freeing an inode on the dirty last
    in data=journal mode.

    - Fix refcount handling if ext4_iget() fails

    - Fix a crash in generic/019 caused by a corrupted extent node"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
    ext4: avoid unnecessary transaction starts during writeback
    ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
    ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
    fs: remove the access_ok() check in ioctl_fiemap
    fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
    fs: move fiemap range validation into the file systems instances
    iomap: fix the iomap_fiemap prototype
    fs: move the fiemap definitions out of fs.h
    fs: mark __generic_block_fiemap static
    ext4: remove the call to fiemap_check_flags in ext4_fiemap
    ext4: split _ext4_fiemap
    ext4: fix fiemap size checks for bitmap files
    ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
    add comment for ext4_dir_entry_2 file_type member
    jbd2: avoid leaking transaction credits when unreserving handle
    ext4: drop ext4_journal_free_reserved()
    ext4: mballoc: use lock for checking free blocks while retrying
    ext4: mballoc: refactor ext4_mb_good_group()
    ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
    ext4: mballoc: refactor ext4_mb_discard_preallocations()
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • Ext4 needs to remove inode from writeback lists after it is out of
    visibility of its journalling machinery (which can still dirty the
    inode). Export inode_io_list_del() for it.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

03 Jun, 2020

2 commits

  • Pull block updates from Jens Axboe:
    "Core block changes that have been queued up for this release:

    - Remove dead blk-throttle and blk-wbt code (Guoqing)

    - Include pid in blktrace note traces (Jan)

    - Don't spew I/O errors on wouldblock termination (me)

    - Zone append addition (Johannes, Keith, Damien)

    - IO accounting improvements (Konstantin, Christoph)

    - blk-mq hardware map update improvements (Ming)

    - Scheduler dispatch improvement (Salman)

    - Inline block encryption support (Satya)

    - Request map fixes and improvements (Weiping)

    - blk-iocost tweaks (Tejun)

    - Fix for timeout failing with error injection (Keith)

    - Queue re-run fixes (Douglas)

    - CPU hotplug improvements (Christoph)

    - Queue entry/exit improvements (Christoph)

    - Move DMA drain handling to the few drivers that use it (Christoph)

    - Partition handling cleanups (Christoph)"

    * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
    block: mark bio_wouldblock_error() bio with BIO_QUIET
    blk-wbt: rename __wbt_update_limits to wbt_update_limits
    blk-wbt: remove wbt_update_limits
    blk-throttle: remove tg_drain_bios
    blk-throttle: remove blk_throtl_drain
    null_blk: force complete for timeout request
    blk-mq: drain I/O when all CPUs in a hctx are offline
    blk-mq: add blk_mq_all_tag_iter
    blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
    blk-mq: use BLK_MQ_NO_TAG in more places
    blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
    blk-mq: move more request initialization to blk_mq_rq_ctx_init
    blk-mq: simplify the blk_mq_get_request calling convention
    blk-mq: remove the bio argument to ->prepare_request
    nvme: force complete cancelled requests
    blk-mq: blk-mq: provide forced completion method
    block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
    block: blk-crypto-fallback: remove redundant initialization of variable err
    block: reduce part_stat_lock() scope
    block: use __this_cpu_add() instead of access by smp_processor_id()
    ...

    Linus Torvalds
     
  • After an NFS page has been written it is considered "unstable" until a
    COMMIT request succeeds. If the COMMIT fails, the page will be
    re-written.

    These "unstable" pages are currently accounted as "reclaimable", either
    in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
    'reclaimable' count. This might have made sense when sending the COMMIT
    required a separate action by the VFS/MM (e.g. releasepage() used to
    send a COMMIT). However now that all writes generated by ->writepages()
    will automatically be followed by a COMMIT (since commit 919e3bd9a875
    ("NFS: Ensure we commit after writeback is complete")) it makes more
    sense to treat them as writeback pages.

    So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
    NR_WRITEBACK and WB_WRITEBACK.

    A particular effect of this change is that when
    wb_check_background_flush() calls wb_over_bg_threshold(), the latter
    will report 'true' a lot less often as the 'unstable' pages are no
    longer considered 'dirty' (as there is nothing that writeback can do
    about them anyway).

    Currently wb_check_background_flush() will trigger writeback to NFS even
    when there are relatively few dirty pages (if there are lots of unstable
    pages), this can result in small writes going to the server (10s of
    Kilobytes rather than a Megabyte) which hurts throughput. With this
    patch, there are fewer writes which are each larger on average.

    Where the NR_UNSTABLE_NFS count was included in statistics
    virtual-files, the entry is retained, but the value is hard-coded as
    zero. static trace points and warning printks which mentioned this
    counter no longer report it.

    [akpm@linux-foundation.org: re-layout comment]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: Trond Myklebust
    Acked-by: Michal Hocko [mm]
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     

10 May, 2020

1 commit


01 Feb, 2020

1 commit

  • Without memcg, there is a one-to-one mapping between the bdi and
    bdi_writeback structures. In this world, things are fairly
    straightforward; the first thing bdi_unregister() does is to shutdown
    the bdi_writeback structure (or wb), and part of that writeback ensures
    that no other work queued against the wb, and that the wb is fully
    drained.

    With memcg, however, there is a one-to-many relationship between the bdi
    and bdi_writeback structures; that is, there are multiple wb objects
    which can all point to a single bdi. There is a refcount which prevents
    the bdi object from being released (and hence, unregistered). So in
    theory, the bdi_unregister() *should* only get called once its refcount
    goes to zero (bdi_put will drop the refcount, and when it is zero,
    release_bdi gets called, which calls bdi_unregister).

    Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
    the Brave New memcg World, and calls bdi_unregister directly. It does
    this without informing the file system, or the memcg code, or anything
    else. This causes the root wb associated with the bdi to be
    unregistered, but none of the memcg-specific wb's are shutdown. So when
    one of these wb's are woken up to do delayed work, they try to
    dereference their wb->bdi->dev to fetch the device name, but
    unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
    called by del_gendisk(). As a result, *boom*.

    Fortunately, it looks like the rest of the writeback path is perfectly
    happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
    create a bdi_dev_name() function which can handle bdi->dev being NULL.
    This also allows us to bulletproof the writeback tracepoints to prevent
    them from dereferencing a NULL pointer and crashing the kernel if one is
    tracing with memcg's enabled, and an iSCSI device dies or a USB storage
    stick is pulled.

    The most common way of triggering this will be hotremoval of a device
    while writeback with memcg enabled is going on. It was triggering
    several times a day in a heavily loaded production environment.

    Google Bug Id: 145475544

    Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
    Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
    Signed-off-by: Theodore Ts'o
    Cc: Chris Mason
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

09 Nov, 2019

1 commit

  • cgroup writeback tries to refresh the associated wb immediately if the
    current wb is dead. This is to avoid keeping issuing IOs on the stale
    wb after memcg - blkcg association has changed (ie. when blkcg got
    disabled / enabled higher up in the hierarchy).

    Unfortunately, the logic gets triggered spuriously on inodes which are
    associated with dead cgroups. When the logic is triggered on dead
    cgroups, the attempt fails only after doing quite a bit of work
    allocating and initializing a new wb.

    While c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping
    has no dirty pages") alleviated the issue significantly as it now only
    triggers when the inode has dirty pages. However, the condition can
    still be triggered before the inode is switched to a different cgroup
    and the logic simply doesn't make sense.

    Skip the immediate switching if the associated memcg is dying.

    This is a simplified version of the following two patches:

    * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
    * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz

    Cc: Konstantin Khlebnikov
    Fixes: e8a7abf5a5bd ("writeback: disassociate inodes from dying bdi_writebacks")
    Acked-by: Dennis Zhou
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Oct, 2019

1 commit

  • Fix kernel-doc warning in fs/fs-writeback.c:

    fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id'

    Link: http://lkml.kernel.org/r/756645ac-0ce8-d47e-d30a-04d9e4923a4f@infradead.org
    Fixes: d62241c7a406 ("writeback, memcg: Implement cgroup_writeback_by_id()")
    Signed-off-by: Randy Dunlap
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

08 Oct, 2019

1 commit

  • finish_writeback_work() reads @done->waitq after decrementing
    @done->cnt. However, once @done->cnt reaches zero, @done may be freed
    (from stack) at any moment and @done->waitq can contain something
    unrelated by the time finish_writeback_work() tries to read it. This
    led to the following crash.

    "BUG: kernel NULL pointer dereference, address: 0000000000000002"
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
    ...
    Workqueue: writeback wb_workfn (flush-btrfs-1)
    RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
    Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
    RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
    RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
    Call Trace:
    __wake_up_common_lock+0x63/0xc0
    wb_workfn+0xd2/0x3e0
    process_one_work+0x1f5/0x3f0
    worker_thread+0x2d/0x3d0
    kthread+0x111/0x130
    ret_from_fork+0x1f/0x30

    Fix it by reading and caching @done->waitq before decrementing
    @done->cnt.

    Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
    Fixes: 5b9cce4c7eb069 ("writeback: Generalize and expose wb_completion")
    Signed-off-by: Tejun Heo
    Debugged-by: Chris Mason
    Reviewed-by: Jens Axboe
    Cc: Jan Kara
    Cc: [5.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

30 Aug, 2019

1 commit


27 Aug, 2019

2 commits

  • Implement cgroup_writeback_by_id() which initiates cgroup writeback
    from bdi and memcg IDs. This will be used by memcg foreign inode
    flushing.

    v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating
    spurious wbs.

    v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort
    flushing while avoding possible livelocks.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_completion is used to track writeback completions. We want to use
    it from memcg side for foreign inode flushes. This patch updates it
    to remember the target waitq instead of assuming bdi->wb_waitq and
    expose it outside of fs-writeback.c.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Aug, 2019

2 commits

  • As inode wb switching may make sync(2) miss some inodes, they're
    synchronized using wb_switch_rwsem so that no wb switching happens
    while sync(2) is in progress. In addition to synchronizing the actual
    switching, the rwsem is also used to prevent queueing new switch
    attempts while sync(2) is in progress. This is to avoid queueing too
    many instances while the rwsem is held by sync(2). Unfortunately,
    this is too agressive and can block wb switching for a long time if
    sync(2) is frequent.

    The goal is avoiding expolding the number of scheduled switches, not
    avoiding scheduling anything. Let's use wb_switch_rwsem only for
    synchronizing the actual switching and sync(2) and use
    isw_nr_in_flight instead for limiting the maximum number of scheduled
    switches. The limit is set to 1024 which should be more than enough
    while still avoiding extreme situations.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
    to ignore short writeback rounds to prevent getting confused by a
    burst of short writebacks. The parameter is currently 2 meaning that
    anything smaller than half of the running average writback duration
    will be ignored.

    This is unnecessarily aggressive. The detection logic uses 16 history
    slots and is already reasonably protected against some short bursts
    confusing it and the current parameter can lead to tens of seconds of
    missed detection depending on the writeback pattern.

    Let's change the parameter to 8, so that it only ignores writeback
    with are smaller than 12.5% of the current running average.

    v2: Add comment explaining what's going on with the foreign detection
    parameters.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Jul, 2019

3 commits

  • When writeback IOs are bounced through async layers, the IOs should
    only be accounted against the wbc from the original bdi writeback to
    avoid confusing cgroup inode ownership arbitration. Add
    wbc->no_cgroup_owner to allow disabling wbc cgroup owner accounting.
    This will be used make btrfs compression work well with cgroup IO
    control.

    v2: Renamed from no_wbc_acct to no_cgroup_owner and added comment as
    per Jan.

    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wbc_account_io() does a very specific job - try to see which cgroup is
    actually dirtying an inode and transfer its ownership to the majority
    dirtier if needed. The name is too generic and confusing. Let's
    rename it to something more specific.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • btrfs is going to use css_put() and wbc helpers to improve cgroup
    writeback support. Add dummy css_get() definition and export wbc
    helpers to prepare for module and !CONFIG_CGROUP builds.

    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Jun, 2019

1 commit

  • wbc_account_io() collects information on cgroup ownership of writeback
    pages to determine which cgroup should own the inode. Pages can stay
    associated with dead memcgs but we want to avoid attributing IOs to
    dead blkcgs as much as possible as the association is likely to be
    stale. However, currently, pages associated with dead memcgs
    contribute to the accounting delaying and/or confusing the
    arbitration.

    Fix it by ignoring pages associated with dead memcgs.

    Signed-off-by: Tejun Heo
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

19 May, 2019

1 commit

  • synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
    switch may not go to the workqueue after synchronize_rcu(). Thus
    previous scheduled switches was not finished even flushing the
    workqueue, which will cause a NULL pointer dereferenced followed below.

    VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day...
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
    evict+0xb3/0x180
    iput+0x1b0/0x230
    inode_switch_wbs_work_fn+0x3c0/0x6a0
    worker_thread+0x4e/0x490
    ? process_one_work+0x410/0x410
    kthread+0xe6/0x100
    ret_from_fork+0x39/0x50

    Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
    pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
    in inode_switch_wbs() to make more sense.

    Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com
    Signed-off-by: Jiufei Xue
    Acked-by: Tejun Heo
    Suggested-by: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiufei Xue
     

23 Jan, 2019

1 commit

  • sync_inodes_sb() can race against cgwb (cgroup writeback) membership
    switches and fail to writeback some inodes. For example, if an inode
    switches to another wb while sync_inodes_sb() is in progress, the new
    wb might not be visible to bdi_split_work_to_wbs() at all or the inode
    might jump from a wb which hasn't issued writebacks yet to one which
    already has.

    This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
    switch path against sync_inodes_sb() so that sync_inodes_sb() is
    guaranteed to see all the target wbs and inodes can't jump wbs to
    escape syncing.

    v2: Fixed misplaced rwsem init. Spotted by Jiufei.

    Signed-off-by: Tejun Heo
    Reported-by: Jiufei Xue
    Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Oct, 2018

1 commit


04 May, 2018

1 commit

  • Syzbot has reported that it can hit a NULL pointer dereference in
    wb_workfn() due to wb->bdi->dev being NULL. This indicates that
    wb_workfn() was called for an already unregistered bdi which should not
    happen as wb_shutdown() called from bdi_unregister() should make sure
    all pending writeback works are completed before bdi is unregistered.
    Except that wb_workfn() itself can requeue the work with:

    mod_delayed_work(bdi_wq, &wb->dwork, 0);

    and if this happens while wb_shutdown() is waiting in:

    flush_delayed_work(&wb->dwork);

    the dwork can get executed after wb_shutdown() has finished and
    bdi_unregister() has cleared wb->bdi->dev.

    Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
    the necessary precautions against racing with bdi unregistration.

    CC: Tetsuo Handa
    CC: Tejun Heo
    Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
    Reported-by: syzbot
    Reviewed-by: Dave Chinner
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

21 Apr, 2018

1 commit

  • lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.

    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains. Switches occur when
    enough writes are issued from a new domain.

    This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock. This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end


    end_page_writeback
    test_clear_page_writeback
    lock_page_memcg

    unlock_page_memcg

    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).

    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:

    cd /mnt/cgroup/memory
    mkdir a b
    echo 1 > a/memory.move_charge_at_immigrate
    echo 1 > b/memory.move_charge_at_immigrate
    (
    echo $BASHPID > a/cgroup.procs
    while true; do
    dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
    ) &
    while true; do
    sync
    done &
    sleep 1h &
    SLEEP=$!
    while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
    done

    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel. I suggest we should to prevent future
    surprises. And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146

    Wang Long said "this deadlock occurs three times in our environment"

    [gthelen@google.com: v4]
    Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: Greg Thelen
    Reported-by: Wang Long
    Acked-by: Wang Long
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Nicholas Piggin
    Cc: [v4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

28 Mar, 2018

1 commit


07 Jan, 2018

1 commit


28 Nov, 2017

1 commit

  • This is a pure automated search-and-replace of the internal kernel
    superblock flags.

    The s_flags are now called SB_*, with the names and the values for the
    moment mirroring the MS_* flags that they're equivalent to.

    Note how the MS_xyz flags are the ones passed to the mount system call,
    while the SB_xyz flags are what we then use in sb->s_flags.

    The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
    include/linux/fs.h include/uapi/linux/bfs_fs.h \
    security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
    DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
    POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
    I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
    ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Oct, 2017

1 commit

  • Since commit 925a6efb8ff0c ("Btrfs: stop using
    try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't
    been used outside so stop exporting it.

    In addition we merge it into try_to_writeback_inodes_sb() which is the
    only caller. Also change return type of try_to_writeback_inodes_sb to
    void as the only user ext4 doesn't care.

    Reviewed-by: Jan Kara
    Signed-off-by: Rakesh Pandit
    Signed-off-by: Jens Axboe

    Rakesh Pandit
     

05 Oct, 2017

1 commit

  • Handle start-all writeback like we do periodic or kupdate
    style writeback - by marking the bdi_writeback as needing a full
    flush, and simply waking the thread. This eliminates the need to
    allocate and queue a specific work item just for this purpose.

    After this change, we truly only ever have one of them running at
    any point in time. We mark the need to start all flushes, and the
    writeback thread will clear it once it has processed the request.

    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Oct, 2017

3 commits

  • When someone calls wakeup_flusher_threads() or
    wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
    pages in the system (or on that bdi). If we are tight on memory, we
    can get tons of these queued from kswapd/vmscan. This causes (at
    least) two problems:

    1) We consume a ton of memory just allocating writeback work items.
    We've seen as much as 600 million of these writeback work items
    pending. That's a lot of memory to pointlessly hold hostage,
    while the box is under memory pressure.

    2) We spend so much time processing these work items, that we
    introduce a softlockup in writeback processing. This is because
    each of the writeback work items don't end up doing any work (it's
    hard when you have millions of identical ones coming in to the
    flush machinery), so we just sit in a tight loop pulling work
    items and deleting/freeing them.

    Fix this by adding a 'start_all' bit to the writeback structure, and
    set that when someone attempts to flush all dirty pages. The bit is
    cleared when we start writeback on that work item. If the bit is
    already set when we attempt to queue !nr_pages writeback, then we
    simply ignore it.

    This provides us one full flush in flight, with one pending as well,
    and makes for more efficient handling of this type of writeback.

    Acked-by: Johannes Weiner
    Tested-by: Chris Mason
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now that we have no external callers of wb_start_writeback(), we
    can shuffle the passing in of 'nr_pages'. Everybody passes in 0
    at this point, so just kill the argument and move the dirty
    count retrieval to that function.

    Acked-by: Johannes Weiner
    Tested-by: Chris Mason
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't have any callers outside of fs-writeback.c anymore,
    make it private.

    Acked-by: Johannes Weiner
    Tested-by: Chris Mason
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe