31 Aug, 2022

1 commit

  • commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream.

    When a disk is removed, bdi_unregister gets called to stop further
    writeback and wait for associated delayed work to complete. However,
    wb_inode_writeback_end() may schedule bandwidth estimation dwork after
    this has completed, which can result in the timer attempting to access the
    just freed bdi_writeback.

    Fix this by checking if the bdi_writeback is alive, similar to when
    scheduling writeback work.

    Since this requires wb->work_lock, and wb_inode_writeback_end() may get
    called from interrupt, switch wb->work_lock to an irqsafe lock.

    Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
    Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload")
    Signed-off-by: Khazhismel Kumykov
    Reviewed-by: Jan Kara
    Cc: Michael Stapelberg
    Cc: Wu Fengguang
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Khazhismel Kumykov
     

04 Sep, 2021

6 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • We do some unlocked reads of writeback statistics like
    avg_write_bandwidth, dirty_ratelimit, or bw_time_stamp. Generally we are
    fine with getting somewhat out-of-date values but actually getting
    different values in various parts of the functions because the compiler
    decided to reload value from original memory location could confuse
    calculations. Use READ_ONCE for these unlocked accesses and WRITE_ONCE
    for the updates to be on the safe side.

    Link: https://lkml.kernel.org/r/20210713104716.22868-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Michael Stapelberg
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Rename domain_update_bandwidth() to domain_update_dirty_limit(). The
    original name is a misnomer. The function has nothing to do with a
    bandwidth, it updates dirty limits.

    Link: https://lkml.kernel.org/r/20210713104716.22868-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Michael Stapelberg
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Michael Stapelberg has reported that for workload with short big spikes of
    writes (GCC linker seem to trigger this frequently) the write throughput
    is heavily underestimated and tends to steadily sink until it reaches
    zero. This has rather bad impact on writeback throttling (causing
    stalls). The problem is that writeback throughput estimate gets updated
    at most once per 200 ms. One update happens early after we submit pages
    for writeback (at that point writeout of only small fraction of pages is
    completed and thus observed throughput is tiny). Next update happens only
    during the next write spike (updates happen only from inode writeback and
    dirty throttling code) and if that is more than 1s after previous spike,
    we decide system was idle and just ignore whatever was written until this
    moment.

    Fix the problem by making sure writeback throughput estimate is also
    updated shortly after writeback completes to get reasonable estimate of
    throughput for spiky workloads.

    [jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()]

    Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com
    Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Reported-by: Michael Stapelberg
    Tested-by: Michael Stapelberg
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we trigger writeback bandwidth estimation from
    balance_dirty_pages() and from wb_writeback(). However neither of these
    need to trigger when the system is relatively idle and writeback is
    triggered e.g. from fsync(2). Make sure writeback estimates happen
    reliably by triggering them from do_writepages().

    Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Michael Stapelberg
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "writeback: Fix bandwidth estimates", v4.

    Fix estimate of writeback throughput when device is not fully busy doing
    writeback. Michael Stapelberg has reported that such workload (e.g.
    generated by linking) tends to push estimated throughput down to 0 and as
    a result writeback on the device is practically stalled.

    The first three patches fix the reported issue, the remaining two patches
    are unrelated cleanups of problems I've noticed when reading the code.

    This patch (of 4):

    Track number of inodes under writeback for each bdi_writeback structure.
    We will use this to decide whether wb does any IO and so we can estimate
    its writeback throughput. In principle we could use number of pages under
    writeback (WB_WRITEBACK counter) for this however normal percpu counter
    reads are too inaccurate for our purposes and summing the counter is too
    expensive.

    Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
    Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Wu Fengguang
    Cc: Michael Stapelberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

10 Aug, 2021

1 commit

  • Don't leak the detaіls of the timer into the block layer, instead
    initialize the timer in bdi_alloc and delete it in bdi_unregister.
    Note that this means the timer is initialized (but not armed) for
    non-block queues as well now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jul, 2021

1 commit

  • Pull core block updates from Jens Axboe:

    - disk events cleanup (Christoph)

    - gendisk and request queue allocation simplifications (Christoph)

    - bdev_disk_changed cleanups (Christoph)

    - IO priority improvements (Bart)

    - Chained bio completion trace fix (Edward)

    - blk-wbt fixes (Jan)

    - blk-wbt enable/disable fix (Zhang)

    - Scheduler dispatch improvements (Jan, Ming)

    - Shared tagset scheduler improvements (John)

    - BFQ updates (Paolo, Luca, Pietro)

    - BFQ lock inversion fix (Jan)

    - Documentation improvements (Kir)

    - CLONE_IO block cgroup fix (Tejun)

    - Remove of ancient and deprecated block dump feature (zhangyi)

    - Discard merge fix (Ming)

    - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
    Yang)

    * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
    block: fix discard request merge
    block/mq-deadline: Remove a WARN_ON_ONCE() call
    blk-mq: update hctx->dispatch_busy in case of real scheduler
    blk: Fix lock inversion between ioc lock and bfqd lock
    bfq: Remove merged request already in bfq_requests_merged()
    block: pass a gendisk to bdev_disk_changed
    block: move bdev_disk_changed
    block: add the events* attributes to disk_attrs
    block: move the disk events code to a separate file
    block: fix trace completion for chained bio
    block/partitions/msdos: Fix typo inidicator -> indicator
    block, bfq: reset waker pointer with shared queues
    block, bfq: check waker only for queues with no in-flight I/O
    block, bfq: avoid delayed merge of async queues
    block, bfq: boost throughput by extending queue-merging times
    block, bfq: consider also creation time in delayed stable merge
    block, bfq: fix delayed stable merge check
    block, bfq: let also stably merged queues enjoy weight raising
    blk-wbt: make sure throttle is enabled properly
    blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
    ...

    Linus Torvalds
     

30 Jun, 2021

8 commits

  • Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
    on the page, which will be used to avoid calling set_page_dirty() in the
    future. It will have no effect on actually writing the page back, as the
    pages are not on any LRU lists.

    [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]

    Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This is fundamentally the same code, so just call it instead of
    duplicating it.

    Link: https://lkml.kernel.org/r/20210615162342.1669332-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Further set_page_dirty cleanups".

    Prompted by Christoph's recent patches, here are some more patches to
    improve the state of set_page_dirty(). They're all from the folio tree,
    so they've been tested to a certain extent.

    This patch (of 6):

    Nothing in __set_page_dirty() is specific to buffer_head, so move it to
    mm/page-writeback.c. That removes the only caller of
    account_page_dirtied() outside of page-writeback.c, so make it static.

    Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Cc: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
    that method up for the missing instances.

    [hch@lst.de: ecryptfs: add a ->set_page_dirty cludge]
    Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de

    Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Jan Kara
    Cc: Al Viro
    Cc: Matthew Wilcox (Oracle)
    Cc: Tyler Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • As account_page_dirtied() was always protected by xa_lock_irqsave(), so
    using __this_cpu_inc() is better.

    Link: https://lkml.kernel.org/r/20210512144742.4764-1-wuchi.zero@gmail.com
    Signed-off-by: Chi Wu
    Reviewed-by: Jan Kara
    Cc: Howard Cochran
    Cc: Miklos Szeredi
    Cc: Sedat Dilek
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chi Wu
     
  • As the value of pos_ratio_polynom() clamp between 0 and 2LL <<
    RATELIMIT_CALC_SHIFT, the global control line should be consistent with
    it.

    Link: https://lkml.kernel.org/r/20210511103606.3732-1-wuchi.zero@gmail.com
    Signed-off-by: Chi Wu
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Howard Cochran
    Cc: Miklos Szeredi
    Cc: Sedat Dilek
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chi Wu
     
  • Fix performance when BDI's share of ratio is 0.

    The issue is similar to commit 74d369443325 ("writeback: Fix
    performance regression in wb_over_bg_thresh()").

    Balance_dirty_pages and the writeback worker will also disagree on
    whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share
    of the thresh ratio is zero.

    For example, A thread on cpu0 writes 32 pages and then
    balance_dirty_pages, it will wake up background writeback and pauses
    because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero).
    A thread may runs on cpu0 again because scheduler prefers pre_cpu.
    Then writeback worker may runs on other cpus(1,2..) which causes the
    value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does
    not writeback and returns.

    Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
    worker who will do nothing. It remains stuck in this state until the
    writeback worker hit the right dirty cpu or the dirty pages expire.

    The fix that we should get the wb_stat_sum radically when thresh is low.

    Link: https://lkml.kernel.org/r/20210428225046.16301-1-wuchi.zero@gmail.com
    Signed-off-by: Chi Wu
    Reviewed-by: Jan Kara
    Cc: Tejun Heo
    Cc: Miklos Szeredi
    Cc: Sedat Dilek
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chi Wu
     
  • The get_writeback_state() has gone since 2006, kill related comments.

    Link: https://lkml.kernel.org/r/20210508125026.56600-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

24 May, 2021

1 commit

  • We have already delete block_dump feature in mark_inode_dirty() because
    it can be replaced by tracepoints, now we also remove the part in
    submit_bio() for the same reason. The part of block dump feature in
    submit_bio() dump the write process, write region and sectors on the
    target disk into kernel message. it can be replaced by
    block_bio_queue tracepoint in submit_bio_checks(), so we do not need
    block_dump anymore, remove the whole block_dump feature.

    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    zhangyi (F)
     

07 May, 2021

1 commit

  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 May, 2021

1 commit

  • Page writeback doesn't hold a page reference, which allows truncate to
    free a page the second PageWriteback is cleared. This used to require
    special attention in test_clear_page_writeback(), where we had to be
    careful not to rely on the unstable page->memcg binding and look up all
    the necessary information before clearing the writeback flag.

    Since commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and
    BUG_ON(PageWriteback)") test_clear_page_writeback() is called with an
    explicit reference on the page, and this dance is no longer needed.

    Use unlock_page_memcg() and dec_lruvec_page_state() directly.

    This removes the last user of the lock_page_memcg() return value, change
    it to void. Touch up the comments in there as well. This also removes
    the last extern user of __unlock_page_memcg(), make it static. Further,
    it removes the last user of dec_lruvec_state(), delete it, along with a
    few other unused helpers.

    Link: https://lkml.kernel.org/r/YCQbYAWg4nvBFL6h@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hugh Dickins
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Mar, 2021

1 commit

  • This is the killable version of wait_on_page_writeback.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Signed-off-by: David Howells
    Tested-by: kafs-testing@auristor.com
    cc: linux-afs@lists.infradead.org
    cc: linux-mm@kvack.org
    Link: https://lore.kernel.org/r/20210320054104.1300774-3-willy@infradead.org

    Matthew Wilcox (Oracle)
     

06 Jan, 2021

1 commit

  • Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
    logic") we've had some very occasional reports of BUG_ON(PageWriteback)
    in write_cache_pages(), which we thought we already fixed in commit
    073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

    But syzbot just reported another one, even with that commit in place.

    And it turns out that there's a simpler way to trigger the BUG_ON() than
    the one Hugh found with page re-use. It all boils down to the fact that
    the page writeback is ostensibly serialized by the page lock, but that
    isn't actually really true.

    Yes, the people _setting_ writeback all do so under the page lock, but
    the actual clearing of the bit - and waking up any waiters - happens
    without any page lock.

    This gives us this fairly simple race condition:

    CPU1 = end previous writeback
    CPU2 = start new writeback under page lock
    CPU3 = write_cache_pages()

    CPU1 CPU2 CPU3
    ---- ---- ----

    end_page_writeback()
    test_clear_page_writeback(page)
    ... delayed...

    lock_page();
    set_page_writeback()
    unlock_page()

    lock_page()
    wait_on_page_writeback();

    wake_up_page(page, PG_writeback);
    .. wakes up CPU3 ..

    BUG_ON(PageWriteback(page));

    where the BUG_ON() happens because we woke up the PG_writeback bit
    becasue of the _previous_ writeback, but a new one had already been
    started because the clearing of the bit wasn't actually atomic wrt the
    actual wakeup or serialized by the page lock.

    The reason this didn't use to happen was that the old logic in waiting
    on a page bit would just loop if it ever saw the bit set again.

    The nice proper fix would probably be to get rid of the whole "wait for
    writeback to clear, and then set it" logic in the writeback path, and
    replace it with an atomic "wait-to-set" (ie the same as we have for page
    locking: we set the page lock bit with a single "lock_page()", not with
    "wait for lock bit to clear and then set it").

    However, out current model for writeback is that the waiting for the
    writeback bit is done by the generic VFS code (ie write_cache_pages()),
    but the actual setting of the writeback bit is done much later by the
    filesystem ".writepages()" function.

    IOW, to make the writeback bit have that same kind of "wait-to-set"
    behavior as we have for page locking, we'd have to change our roughly
    ~50 different writeback functions. Painful.

    Instead, just make "wait_on_page_writeback()" loop on the very unlikely
    situation that the PG_writeback bit is still set, basically re-instating
    the old behavior. This is very non-optimal in case of contention, but
    since we only ever set the bit under the page lock, that situation is
    controlled.

    Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Acked-by: Hugh Dickins
    Cc: Andrew Morton
    Cc: Matthew Wilcox
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Nov, 2020

1 commit

  • Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
    on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
    end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
    no longer an ext4 page at all.

    The problem is that PageWriteback is not accompanied by a page reference
    (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
    soon as TestClearPageWriteback has been done, that page could be removed
    from page cache, freed, and reused for something else by the time that
    wake_up_page() is reached.

    https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
    Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
    check; but I'm paranoid about even looking at an unreferenced struct page,
    lest its memory might itself have already been reused or hotremoved (and
    wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

    Then on crashing a second time, realized there's a stronger reason against
    that approach. If my testing just occasionally crashes on that check,
    when the page is reused for part of a compound page, wouldn't it be much
    more common for the page to get reused as an order-0 page before reaching
    wake_up_page()? And on rare occasions, might that reused page already be
    marked PageWriteback by its new user, and already be waited upon? What
    would that look like?

    It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
    in write_cache_pages() (though I have never seen that crash myself).

    Matthew Wilcox explaining this to himself:
    "page is allocated, added to page cache, dirtied, writeback starts,

    --- thread A ---
    filesystem calls end_page_writeback()
    test_clear_page_writeback()
    --- context switch to thread B ---
    truncate_inode_pages_range() finds the page, it doesn't have writeback set,
    we delete it from the page cache. Page gets reallocated, dirtied, writeback
    starts again. Then we call write_cache_pages(), see
    PageWriteback() set, call wait_on_page_writeback()
    --- context switch back to thread A ---
    wake_up_page(page, PG_writeback);
    ... thread B is woken, but because the wakeup was for the old use of
    the page, PageWriteback is still set.

    Devious"

    And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    this would have been much less likely: before that, wake_page_function()'s
    non-exclusive case would stop walking and not wake if it found Writeback
    already set again; whereas now the non-exclusive case proceeds to wake.

    I have not thought of a fix that does not add a little overhead: the
    simplest fix is for end_page_writeback() to get_page() before calling
    test_clear_page_writeback(), then put_page() after wake_up_page().

    Was there a chance of missed wakeups before, since a page freed before
    reaching wake_up_page() would have PageWaiters cleared? I think not,
    because each waiter does hold a reference on the page. This bug comes
    when the old use of the page, the one we do TestClearPageWriteback on,
    had *no* waiters, so no additional page reference beyond the page cache
    (and whoever racily freed it). The reuse of the page has a waiter
    holding a reference, and its own PageWriteback set; but the belated
    wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

    Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
    Reported-by: Qian Cai
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Oct, 2020

1 commit


25 Sep, 2020

3 commits

  • Replace the two negative flags that are always used together with a
    single positive flag that indicates the writeback capability instead
    of two related non-capabilities. Also remove the pointless wrappers
    to just check the flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
    make the checks more obvious. Also remove the pointless
    bdi_cap_account_writeback wrapper that just obsfucates the check.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Aug, 2020

1 commit

  • The global variable "vm_total_pages" is a relic from older days. There is
    only a single user that reads the variable - build_all_zonelists() - and
    the first thing it does is update it.

    Use a local variable in build_all_zonelists() instead and remove the
    global variable.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Huang Ying
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

05 Jun, 2020

1 commit


04 Jun, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

03 Jun, 2020

3 commits

  • After an NFS page has been written it is considered "unstable" until a
    COMMIT request succeeds. If the COMMIT fails, the page will be
    re-written.

    These "unstable" pages are currently accounted as "reclaimable", either
    in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
    'reclaimable' count. This might have made sense when sending the COMMIT
    required a separate action by the VFS/MM (e.g. releasepage() used to
    send a COMMIT). However now that all writes generated by ->writepages()
    will automatically be followed by a COMMIT (since commit 919e3bd9a875
    ("NFS: Ensure we commit after writeback is complete")) it makes more
    sense to treat them as writeback pages.

    So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
    NR_WRITEBACK and WB_WRITEBACK.

    A particular effect of this change is that when
    wb_check_background_flush() calls wb_over_bg_threshold(), the latter
    will report 'true' a lot less often as the 'unstable' pages are no
    longer considered 'dirty' (as there is nothing that writeback can do
    about them anyway).

    Currently wb_check_background_flush() will trigger writeback to NFS even
    when there are relatively few dirty pages (if there are lots of unstable
    pages), this can result in small writes going to the server (10s of
    Kilobytes rather than a Megabyte) which hurts throughput. With this
    patch, there are fewer writes which are each larger on average.

    Where the NR_UNSTABLE_NFS count was included in statistics
    virtual-files, the entry is retained, but the value is hard-coded as
    zero. static trace points and warning printks which mentioned this
    counter no longer report it.

    [akpm@linux-foundation.org: re-layout comment]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: Trond Myklebust
    Acked-by: Michal Hocko [mm]
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
    loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
    daemon needs to write to one bdi (the final bdi) in order to free up
    writes queued to another bdi (the client bdi).

    The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
    pages, so that it can still dirty pages after other processses have been
    throttled. The purpose of this is to avoid deadlock that happen when
    the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
    but it is being thottled and cannot write.

    This approach was designed when all threads were blocked equally,
    independently on which device they were writing to, or how fast it was.
    Since that time the writeback algorithm has changed substantially with
    different threads getting different allowances based on non-trivial
    heuristics. This means the simple "add 25%" heuristic is no longer
    reliable.

    The important issue is not that the daemon needs a *larger* dirty page
    allowance, but that it needs a *private* dirty page allowance, so that
    dirty pages for the "client" bdi that it is helping to clear (the bdi
    for an NFS filesystem or loop block device etc) do not affect the
    throttling of the daemon writing to the "final" bdi.

    This patch changes the heuristic so that the task is not throttled when
    the bdi it is writing to has a dirty page count below below (or equal
    to) the free-run threshold for that bdi. This ensures it will always be
    able to have some pages in flight, and so will not deadlock.

    In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
    still be throttled by global threshold, but that is acceptable as it is
    only the deadlock state that is interesting for this flag.

    This approach of "only throttle when target bdi is busy" is consistent
    with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
    it causes attention to be focussed only on the target bdi.

    So this patch
    - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
    - removes the 25% bonus that that flag gives, and
    - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
    global and the local free-run thresholds are exceeded.

    Note that previously realtime threads were treated the same as
    PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
    for real-time threads, so it is now different from the behaviour of nfsd
    and loop tasks. I don't know what is wanted for realtime.

    [akpm@linux-foundation.org: coding style fixes]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Acked-by: Chuck Lever [nfsd]
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Cc: Trond Myklebust
    Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback
    vs writepages deadlock") left unused variable, remove it.

    Signed-off-by: Chao Yu
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200528033740.17269-1-yuchao0@huawei.com
    Signed-off-by: Linus Torvalds

    Chao Yu
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Apr, 2020

3 commits

  • With the introduction of protected KVM guests on s390 there is now a
    concept of inaccessible pages. These pages need to be made accessible
    before the host can access them.

    While cpu accesses will trigger a fault that can be resolved, I/O accesses
    will just fail. We need to add a callback into architecture code for
    places that will do I/O, namely when writeback is started or when a page
    reference is taken.

    This is not only to enable paging, file backing etc, it is also necessary
    to protect the host against a malicious user space. For example a bad
    QEMU could simply start direct I/O on such protected memory. We do not
    want userspace to be able to trigger I/O errors and thus the logic is
    "whenever somebody accesses that page (gup) or does I/O, make sure that
    this page can be accessed". When the guest tries to access that page we
    will wait in the page fault handler for writeback to have finished and for
    the page_ref to be the expected value.

    On s390x the function is not supposed to fail, so it is ok to use a
    WARN_ON on failure. If we ever need some more finegrained handling we can
    tackle this when we know the details.

    Signed-off-by: Claudio Imbrenda
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Christian Borntraeger
    Reviewed-by: John Hubbard
    Acked-by: Will Deacon
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ira Weiny
    Cc: Jérôme Glisse
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     
  • Dumping the page information in this circumstance helps for debugging.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There used to be a 'retry' label in between the two (identical) checks
    when first introduced in commit f446daaea9d4 ("mm: implement writeback
    livelock avoidance using page tagging"), and later modified/updated in
    commit 6e6938b6d313 ("writeback: introduce .tagged_writepages for the
    WB_SYNC_NONE sync stage").

    The label has been removed in commit 64081362e8ff ("mm/page-writeback.c:
    fix range_cyclic writeback vs writepages deadlock"), and the (identical)
    checks are now present / performed immediately one after another.

    So, remove/deduplicate the latter check, moving tag_pages_for_writeback()
    into the former check before the 'tag' variable assignment, so it's clear
    that it's not used in this (similarly-named) function call but only later
    in pagevec_lookup_range_tag().

    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.com
    Signed-off-by: Linus Torvalds

    Mauricio Faria de Oliveira
     

14 Jan, 2020

3 commits

  • Use div64_ul() instead of do_div() if the divisor is unsigned long, to
    avoid truncation to 32-bit on 64-bit platforms.

    Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • The two variables 'numerator' and 'denominator', though they are
    declared as long, they should actually be unsigned long (according to
    the implementation of the fprop_fraction_percpu() function)

    And do_div() does a 64-by-32 division, while the divisor 'denominator'
    is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
    function to call is div64_ul().

    Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • Patch series "use div64_ul() instead of div_u64() if the divisor is
    unsigned long".

    We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
    by zero in avg_atom () calculation"), then refer to the recently analyzed
    mm code, we found this suspicious place.

    201 if (min) {
    202 min *= this_bw;
    203 do_div(min, tot_bw);
    204 }

    And we also disassembled and confirmed it:

    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
    0xffffffff811c37da : xor %r10d,%r10d
    0xffffffff811c37dd : test %rax,%rax
    0xffffffff811c37e0 : je 0xffffffff811c3800
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
    0xffffffff811c37e2 : imul %r8,%rax
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
    0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
    0xffffffff811c37e9 : xor %edx,%edx
    0xffffffff811c37eb : div %r10
    0xffffffff811c37ee : imul %rbx,%rax
    0xffffffff811c37f2 : shr $0x2,%rax
    0xffffffff811c37f6 : mul %rcx
    0xffffffff811c37f9 : shr $0x2,%rdx
    0xffffffff811c37fd : mov %rdx,%r10

    This series uses div64_ul() instead of div_u64() if the divisor is
    unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

    This patch (of 3):

    The variables 'min' and 'max' are unsigned long and do_div truncates
    them to 32 bits, which means it can test non-zero and be truncated to
    zero for division. Fix this issue by using div64_ul() instead.

    Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
    Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang