12 Dec, 2020

2 commits


09 Dec, 2020

1 commit


07 Dec, 2020

1 commit


28 Nov, 2020

1 commit

  • trace_sched_blocked_trace in CFS is really useful for debugging via
    trace because it tell where the process was stuck on callstack.

    For example,
    -6143 ( 6136) [005] d..2 50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208
    -6136 ( 6136) [005] d..2 50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0
    -6142 ( 6136) [006] d..2 50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58
    -6144 ( 6136) [006] d..2 50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104

    However, sometime it gives pointless information like this.
    RenderThread-2322 ( 1805) [006] d.s3 50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
    logd.writer-594 ( 587) [002] d.s3 50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208
    kworker/u16:13-333 ( 333) [007] d.s4 50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220

    Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't
    carry on higher information to identify the callstack.

    The reason is page_lock and waitqueue are special synchronization method unlike
    other normal locks(mutex, spinlock).
    Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace
    could detect it and skip them. It will produce more meaningful callstack
    function like this.

    -2867 ( 1068) [002] d.h4 124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470
    -2867 ( 1068) [002] d.s3 124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104
    -2867 ( 1068) [002] d.s4 124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
    ScreenDecoratio-2364 ( 1867) [002] d.s3 124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc
    ScreenDecoratio-2364 ( 1867) [002] d.s4 124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
    -329 ( 329) [000] d..3 124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470
    kworker/u16:13-538 ( 538) [007] d..3 124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470

    Test: build pass and boot to home.
    Bug: 144961676
    Bug: 144713689
    Bug: 172212772
    Signed-off-by: Minchan Kim
    Signed-off-by: Jimmy Shiu
    Change-Id: I9c738802a16941ca767dcc37ae4463070b3fabf4
    (cherry picked from commit 1e4de875d9e0cfaccf5131bcc709ae8646cdc168)
    Signed-off-by: Will McVicker

    Jimmy Shiu
     

25 Nov, 2020

1 commit

  • Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
    on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
    end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
    no longer an ext4 page at all.

    The problem is that PageWriteback is not accompanied by a page reference
    (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
    soon as TestClearPageWriteback has been done, that page could be removed
    from page cache, freed, and reused for something else by the time that
    wake_up_page() is reached.

    https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
    Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
    check; but I'm paranoid about even looking at an unreferenced struct page,
    lest its memory might itself have already been reused or hotremoved (and
    wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

    Then on crashing a second time, realized there's a stronger reason against
    that approach. If my testing just occasionally crashes on that check,
    when the page is reused for part of a compound page, wouldn't it be much
    more common for the page to get reused as an order-0 page before reaching
    wake_up_page()? And on rare occasions, might that reused page already be
    marked PageWriteback by its new user, and already be waited upon? What
    would that look like?

    It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
    in write_cache_pages() (though I have never seen that crash myself).

    Matthew Wilcox explaining this to himself:
    "page is allocated, added to page cache, dirtied, writeback starts,

    --- thread A ---
    filesystem calls end_page_writeback()
    test_clear_page_writeback()
    --- context switch to thread B ---
    truncate_inode_pages_range() finds the page, it doesn't have writeback set,
    we delete it from the page cache. Page gets reallocated, dirtied, writeback
    starts again. Then we call write_cache_pages(), see
    PageWriteback() set, call wait_on_page_writeback()
    --- context switch back to thread A ---
    wake_up_page(page, PG_writeback);
    ... thread B is woken, but because the wakeup was for the old use of
    the page, PageWriteback is still set.

    Devious"

    And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    this would have been much less likely: before that, wake_page_function()'s
    non-exclusive case would stop walking and not wake if it found Writeback
    already set again; whereas now the non-exclusive case proceeds to wake.

    I have not thought of a fix that does not add a little overhead: the
    simplest fix is for end_page_writeback() to get_page() before calling
    test_clear_page_writeback(), then put_page() after wake_up_page().

    Was there a chance of missed wakeups before, since a page freed before
    reaching wake_up_page() would have PageWaiters cleared? I think not,
    because each waiter does hold a reference on the page. This bug comes
    when the old use of the page, the one we do TestClearPageWriteback on,
    had *no* waiters, so no additional page reference beyond the page cache
    (and whoever racily freed it). The reuse of the page has a waiter
    holding a reference, and its own PageWriteback set; but the belated
    wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

    Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
    Reported-by: Qian Cai
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Nov, 2020

1 commit

  • We catch the case where we enter generic_file_buffered_read() with data
    already transferred, but we also need to be careful not to allow an async
    page lock if we're looping transferring data. If not, we could be
    returning -EIOCBQUEUED instead of the transferred amount, and it could
    result in double waitqueue additions as well.

    Cc: stable@vger.kernel.org # v5.9
    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Oct, 2020

1 commit

  • Pull clone/dedupe/remap code refactoring from Darrick Wong:
    "Move the generic file range remap (aka reflink and dedupe) functions
    out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
    reduce clutter in the first two files"

    * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: move the generic write and copy checks out of mm
    vfs: move the remap range helpers to remap_range.c
    vfs: move generic_remap_checks out of mm

    Linus Torvalds
     

18 Oct, 2020

1 commit

  • Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
    we should no longer attempt to async lock a new page. Instead make sure
    we return the copied amount, and let the caller retry, instead of
    returning -EIOCBQUEUED for a new page.

    This should only be possible with read-ahead disabled on the below
    device, and multiple threads racing on the same file. Haven't been able
    to reproduce on anything else.

    Cc: stable@vger.kernel.org # v5.9
    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Reported-by: Kent Overstreet
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

5 commits

  • Merge more updates from Andrew Morton:
    "155 patches.

    Subsystems affected by this patch series: mm (dax, debug, thp,
    readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
    core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
    binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
    romfs, and fault-injection"

    * emailed patches from Andrew Morton : (155 commits)
    lib, uaccess: add failure injection to usercopy functions
    lib, include/linux: add usercopy failure capability
    ROMFS: support inode blocks calculation
    ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
    sched.h: drop in_ubsan field when UBSAN is in trap mode
    scripts/gdb/tasks: add headers and improve spacing format
    scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
    kernel/relay.c: drop unneeded initialization
    panic: dump registers on panic_on_warn
    rapidio: fix the missed put_device() for rio_mport_add_riodev
    rapidio: fix error handling path
    nilfs2: fix some kernel-doc warnings for nilfs2
    autofs: harden ioctl table
    ramfs: fix nommu mmap with gaps in the page cache
    mm: remove the now-unnecessary mmget_still_valid() hack
    mm/gup: take mmap_lock in get_dump_page()
    binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
    coredump: rework elf/elf_fdpic vma_dump_size() into common helper
    coredump: refactor page range dumping into common helper
    coredump: let dump_emit() bail out on short writes
    ...

    Linus Torvalds
     
  • Fix some broken comments including typo, grammar error and wrong function
    name.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200913095456.54873-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Fold ra_submit() into its last remaining user and pass the
    readahead_control struct to both do_page_cache_ra() and
    page_cache_sync_ra().

    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-9-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Patch series "Remove assumptions of THP size".

    There are a number of places in the VM which assume that a THP is a PMD in
    size. That's true today, and remains true after this patch series, but
    this is a prerequisite for switching to arbitrary-sized THPs.
    thp_nr_pages() still returns either HPAGE_PMD_NR or 1, but will be changed
    later.

    This patch (of 11):

    page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20200908195539.25896-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • When a THP is removed from the page cache by reclaim, we replace it with a
    shadow entry that occupies all slots of the XArray previously occupied by
    the THP. If the user then accesses that page again, we only allocate a
    single page, but storing it into the shadow entry replaces all entries
    with that one page. That leads to bugs like

    page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
    ------------[ cut here ]------------
    kernel BUG at mm/filemap.c:2529!

    https://bugzilla.kernel.org/show_bug.cgi?id=206569

    This is hard to reproduce with mainline, but happens regularly with the
    THP patchset (as so many more THPs are created). This solution is take
    from the THP patchset. It splits the shadow entry into order-0 pieces at
    the time that we bring a new page into cache.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Song Liu
    Cc: "Kirill A . Shutemov"
    Cc: Qian Cai
    Link: https://lkml.kernel.org/r/20200903183029.14930-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

16 Oct, 2020

2 commits

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     
  • The generic write check helpers also don't have much to do with the page
    cache, so move them to the vfs.

    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     

15 Oct, 2020

1 commit

  • I would like to move all the generic helpers for the vfs remap range
    functionality (aka clonerange and dedupe) into a separate file so that
    they won't be scattered across the vfs and the mm subsystems. The
    eventual goal is to be able to deselect remap_range.c if none of the
    filesystems need that code, but the tricky part here is picking a
    stable(ish) part of the merge window to rearrange code.

    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     

14 Oct, 2020

6 commits

  • We dereference page->mapping and page->index directly after calling
    find_subpage() and these fields are not valid for tail pages. While
    commit 4101196b19d7 ("mm: page cache: store only head pages in i_pages")
    introduced the call to find_subpage(), the problem existed prior to this;
    I'm going to suggest all the way back to when THPs first existed.

    The user-visible effects of this are almost negligible. To hit it, you
    have to mmap a tmpfs file at an unaligned address and then it's only a
    disabled optimisation causing page faults to happen more frequently than
    they otherwise would.

    Fix this by keeping both head and page pointers and checking the
    appropriate one. We could use page_mapping() and page_to_index(), but
    that's higher overhead.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200911012532.24761-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Add a new FGP_HEAD flag which avoids calling find_subpage() and add a
    convenience wrapper for it.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Convert shmem_getpage_gfp() (the only remaining caller of
    find_lock_entry()) to cope with a head page being returned instead of
    the subpage for the index.

    [willy@infradead.org: fix BUG()s]
    Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There are only four callers remaining of find_get_entry().
    get_shadow_from_swap_cache() only wants to see shadow entries and doesn't
    care about which page is returned. Push the find_subpage() call into
    find_lock_entry(), find_get_incore_page() and pagecache_get_page().

    [willy@infradead.org: fix oops]
    Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • i915 does not want to see value entries. Switch it to use
    find_lock_page() instead, and remove the export of find_lock_entry().
    Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage
    any future use.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

06 Oct, 2020

1 commit


03 Oct, 2020

1 commit

  • Pull io_uring fixes from Jens Axboe:

    - fix for async buffered reads if read-ahead is fully disabled (Hao)

    - double poll match fix

    - ->show_fdinfo() potential ABBA deadlock complaint fix

    * tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
    io_uring: fix async buffered reads when readahead is disabled
    io_uring: fix potential ABBA deadlock in ->show_fdinfo()
    io_uring: always delete double poll wait entry on match

    Linus Torvalds
     

29 Sep, 2020

1 commit

  • The async buffered reads feature is not working when readahead is
    turned off. There are two things to concern:

    - when doing retry in io_read, not only the IOCB_WAITQ flag but also
    the IOCB_NOWAIT flag is still set, which makes it goes to would_block
    phase in generic_file_buffered_read() and then return -EAGAIN. After
    that, the io-wq thread work is queued, and later doing the async
    reads in the old way.

    - even if we remove IOCB_NOWAIT when doing retry, the feature is still
    not running properly, since in generic_file_buffered_read() it goes to
    lock_page_killable() after calling mapping->a_ops->readpage() to do
    IO, and thus causing process to sleep.

    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Fixes: 3b2a4439e0ae ("io_uring: get rid of kiocb_wait_page_queue_init()")
    Signed-off-by: Hao Xu
    Signed-off-by: Jens Axboe

    Hao Xu
     

25 Sep, 2020

1 commit


23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

21 Sep, 2020

1 commit


18 Sep, 2020

1 commit

  • Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
    the page locking entirely fair, in that if a waiter came in while the
    lock was held, the lock would be transferred to the lockers strictly in
    order.

    That was intended to finally get rid of the long-reported watchdog
    failures that involved the page lock under extreme load, where a process
    could end up waiting essentially forever, as other page lockers stole
    the lock from under it.

    It also improved some benchmarks, but it ended up causing huge
    performance regressions on others, simply because fair lock behavior
    doesn't end up giving out the lock as aggressively, causing better
    worst-case latency, but potentially much worse average latencies and
    throughput.

    Instead of reverting that change entirely, this introduces a controlled
    amount of unfairness, with a sysctl knob to tune it if somebody needs
    to. But the default value should hopefully be good for any normal load,
    allowing a few rounds of lock stealing, but enforcing the strict
    ordering before the lock has been stolen too many times.

    There is also a hint from Matthieu Baerts that the fair page coloring
    may end up exposing an ABBA deadlock that is hidden by the usual
    optimistic lock stealing, and while the unfairness doesn't fix the
    fundamental issue (and I'm still looking at that), it avoids it in
    practice.

    The amount of unfairness can be modified by writing a new value to the
    'sysctl_page_lock_unfairness' variable (default value of 5, exposed
    through /proc/sys/vm/page_lock_unfairness), but that is hopefully
    something we'd use mainly for debugging rather than being necessary for
    any deep system tuning.

    This whole issue has exposed just how critical the page lock can be, and
    how contended it gets under certain locks. And the main contention
    doesn't really seem to be anything related to IO (which was the origin
    of this lock), but for things like just verifying that the page file
    mapping is stable while faulting in the page into a page table.

    Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
    Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
    Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
    Reported-and-tested-by: Michael Larabel
    Tested-by: Matthieu Baerts
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: Amir Goldstein
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Aug, 2020

1 commit

  • 'static' and 'static noinline' function attributes make no guarantees that
    gcc/clang won't optimize them. The compiler may decide to inline 'static'
    function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler
    could have inlined __add_to_page_cache_locked() in one callsite and didn't
    inline in another. In such case injecting errors into it would cause
    unpredictable behavior. It's worse with 'static noinline' which won't be
    inlined, but it still can be optimized. Like the compiler may decide to remove
    one argument or constant propagate the value depending on the callsite.

    To avoid such issues make sure that these functions are global noinline.

    Fixes: af3b854492f3 ("mm/page_alloc.c: allow error injection")
    Fixes: cfcbfb1382db ("mm/filemap.c: enable error injection at add_to_page_cache()")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

15 Aug, 2020

2 commits

  • struct file_ra_state ra.mmap_miss could be accessed concurrently during
    page faults as noticed by KCSAN,

    BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

    write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
    filemap_fault+0x920/0xfc0
    do_sync_mmap_readahead at mm/filemap.c:2384
    (inlined by) filemap_fault at mm/filemap.c:2486
    __xfs_filemap_fault+0x112/0x3e0 [xfs]
    xfs_filemap_fault+0x74/0x90 [xfs]
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
    filemap_map_pages+0xc2e/0xd80
    filemap_map_pages at mm/filemap.c:2625
    do_fault+0x3da/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    ra.mmap_miss is used to contribute the readahead decisions, a data race
    could be undesirable. Both the read and write is only under non-exclusive
    mmap_sem, two concurrent writers could even underflow the counter. Fix
    the underflow by writing to a local variable before committing a final
    store to ra.mmap_miss given a small inaccuracy of the counter should be
    acceptable.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Qian Cai
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

1 commit

  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

08 Aug, 2020

2 commits

  • FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
    comment.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Cc: Gang Deng
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1593031747-4249-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Since commit bbddabe2e436aa ("mm: filemap: only do access activations on
    reads"), mark_page_accessed() is called for reads only. But the idle flag
    is cleared by mark_page_accessed() so the idle flag won't get cleared if
    the page is write accessed only.

    Basically idle page tracking is used to estimate workingset size of
    workload, noticeable size of workingset might be missed if the idle flag
    is not maintained correctly.

    It seems good enough to just clear idle flag for write operations.

    Fixes: bbddabe2e436 ("mm: filemap: only do access activations on reads")
    Reported-by: Gang Deng
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1593020612-13051-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Aug, 2020

1 commit

  • Pull io_uring updates from Jens Axboe:
    "Lots of cleanups in here, hardening the code and/or making it easier
    to read and fixing bugs, but a core feature/change too adding support
    for real async buffered reads. With the latter in place, we just need
    buffered write async support and we're done relying on kthreads for
    the fast path. In detail:

    - Cleanup how memory accounting is done on ring setup/free (Bijan)

    - sq array offset calculation fixup (Dmitry)

    - Consistently handle blocking off O_DIRECT submission path (me)

    - Support proper async buffered reads, instead of relying on kthread
    offload for that. This uses the page waitqueue to drive retries
    from task_work, like we handle poll based retry. (me)

    - IO completion optimizations (me)

    - Fix race with accounting and ring fd install (me)

    - Support EPOLLEXCLUSIVE (Jiufei)

    - Get rid of the io_kiocb unionizing, made possible by shrinking
    other bits (Pavel)

    - Completion side cleanups (Pavel)

    - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

    - Request environment grabbing cleanups (Pavel)

    - File and socket read/write cleanups (Pavel)

    - Improve kiocb_set_rw_flags() (Pavel)

    - Tons of fixes and cleanups (Pavel)

    - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

    * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
    io_uring: flip if handling after io_setup_async_rw
    fs: optimise kiocb_set_rw_flags()
    io_uring: don't touch 'ctx' after installing file descriptor
    io_uring: get rid of atomic FAA for cq_timeouts
    io_uring: consolidate *_check_overflow accounting
    io_uring: fix stalled deferred requests
    io_uring: fix racy overflow count reporting
    io_uring: deduplicate __io_complete_rw()
    io_uring: de-unionise io_kiocb
    io-wq: update hash bits
    io_uring: fix missing io_queue_linked_timeout()
    io_uring: mark ->work uninitialised after cleanup
    io_uring: deduplicate io_grab_files() calls
    io_uring: don't do opcode prep twice
    io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
    io_uring: batch put_task_struct()
    tasks: add put_task_struct_many()
    io_uring: return locked and pinned page accounting
    io_uring: don't miscount pinned memory
    io_uring: don't open-code recv kbuf managment
    ...

    Linus Torvalds
     

03 Aug, 2020

2 commits

  • That gives us ordering guarantees around the pair.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It turns out that wait_on_page_bit_common() had several problems,
    ranging from just unfair behavioe due to re-queueing at the end of the
    wait queue when re-trying, and an outright bug that could result in
    missed wakeups (but probably never happened in practice).

    This rewrites the whole logic to avoid both issues, by simply moving the
    logic to check (and possibly take) the bit lock into the wakeup path
    instead.

    That makes everything much more straightforward, and means that we never
    need to re-queue the wait entry: if we get woken up, we'll be notified
    through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
    and everything will have been done for us.

    Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@mail.gmail.com/
    Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
    Reported-by: Oleg Nesterov
    Reported-by: Hugh Dickins
    Cc: Michal Hocko
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jul, 2020

1 commit

  • Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
    shouldn't trigger any filesystem I/O for the actual request or for
    readahead. This allows to do tentative reads out of the page cache as
    some filesystems allow, and to take the appropriate locks and retry the
    reads only if the requested pages are not cached.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher