04 Sep, 2021

3 commits

  • Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and
    cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static
    key check inline before calling the main body of the function. This
    minimizes the memcg overhead in the pagefault and exit_mmap paths when
    memcgs are disabled using cgroup_disable=memory command-line option. This
    change results in ~1% overhead reduction when running PFT test [1]
    comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
    configuration on an 8-core ARM64 Android device.

    [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

    Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Reviewed-by: Shakeel Butt
    Reviewed-by: Muchun Song
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Yang Shi
    Cc: Alex Shi
    Cc: Wei Yang
    Cc: Jens Axboe
    Cc: Joonsoo Kim
    Cc: David Hildenbrand
    Cc: Matthew Wilcox (Oracle)
    Cc: Alistair Popple
    Cc: Minchan Kim
    Cc: Miaohe Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     
  • Add mem_cgroup_disabled check in vmpressure, mem_cgroup_uncharge_swap and
    cgroup_throttle_swaprate functions. This minimizes the memcg overhead in
    the pagefault and exit_mmap paths when memcgs are disabled using
    cgroup_disable=memory command-line option.

    This change results in ~2.1% overhead reduction when running PFT test [1]
    comparing {CONFIG_MEMCG=n, CONFIG_MEMCG_SWAP=n} against {CONFIG_MEMCG=y,
    CONFIG_MEMCG_SWAP=y, cgroup_disable=memory} configuration on an 8-core
    ARM64 Android device.

    [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

    Link: https://lkml.kernel.org/r/20210713010934.299876-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Reviewed-by: Shakeel Butt
    Reviewed-by: Muchun Song
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Alex Shi
    Cc: Alistair Popple
    Cc: David Hildenbrand
    Cc: Jens Axboe
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox (Oracle)
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Roman Gushchin
    Cc: Tejun Heo
    Cc: Wei Yang
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     
  • We had a recurring situation in which admin procedures setting up
    swapfiles would race with test preparation clearing away swapfiles; and
    just occasionally that got stuck on a swapfile "(deleted)" which could
    never be swapped off. That is not supposed to be possible.

    2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted
    that it was leaving a race window open: now close it.

    may_delete() makes the IS_SWAPFILE check (amongst many others) before
    inode_lock has been taken on target: now repeat just that simple check in
    vfs_unlink() and vfs_rename(), after taking inode_lock.

    Which goes most of the way to fixing the race, but swapon() must also
    check after it acquires inode_lock, that the file just opened has not
    already been unlinked.

    Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
    Fixes: f9454548e17c ("don't unlink an active swapfile")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Jul, 2021

1 commit

  • Fix some spelling mistakes in comments:
    each having differents usage ==> each has a different usage
    statments ==> statements
    adresses ==> addresses
    aggresive ==> aggressive
    datas ==> data
    posion ==> poison
    higer ==> higher
    precisly ==> precisely
    wont ==> won't
    We moves tha ==> We move the
    endianess ==> endianness

    Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
    Signed-off-by: Zhen Lei
    Reviewed-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei
     

30 Jun, 2021

3 commits

  • Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
    accesses to avoid NULL derefs"), the typical code to reference the
    swap_info[] is as follows,

    type = swp_type(swp_entry);
    if (type >= nr_swapfiles)
    /* handle invalid swp_entry */;
    p = swap_info[type];
    /* access fields of *p. OOPS! p may be NULL! */

    Because the ordering isn't guaranteed, it's possible that swap_info[type]
    is read before "nr_swapfiles". And that may result in NULL pointer
    dereference.

    So after commit c10d38cc8d3e, the code becomes,

    struct swap_info_struct *swap_type_to_swap_info(int type)
    {
    if (type >= READ_ONCE(nr_swapfiles))
    return NULL;
    smp_rmb();
    return READ_ONCE(swap_info[type]);
    }

    /* users */
    type = swp_type(swp_entry);
    p = swap_type_to_swap_info(type);
    if (!p)
    /* handle invalid swp_entry */;
    /* dereference p */

    Where the value of swap_info[type] (that is, "p") is checked to be
    non-zero before being dereferenced. So, the NULL deferencing becomes
    impossible even if "nr_swapfiles" is read after swap_info[type].
    Therefore, the "smp_rmb()" becomes unnecessary.

    And, we don't even need to read "nr_swapfiles" here. Because the non-zero
    checking for "p" is sufficient. We just need to make sure we will not
    access out of the boundary of the array. With the change, nr_swapfiles
    will only be accessed with swap_lock held, except in
    swapcache_free_entries(). Where the absolute correctness of the value
    isn't needed, as described in the comments.

    We still need to guarantee swap_info[type] is read before being
    dereferenced. That can be satisfied via the data dependency ordering
    enforced by READ_ONCE(swap_info[type]). This needs to be paired with
    proper write barriers. So smp_store_release() is used in
    alloc_swap_info() to guarantee the fields of *swap_info[type] is
    initialized before swap_info[type] itself being written. Note that the
    fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly.
    The assignment and deferencing of swap_info[type] is like
    rcu_assign_pointer() and rcu_dereference().

    Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Daniel Jordan
    Cc: Dan Carpenter
    Cc: Andrea Parri
    Cc: Peter Zijlstra (Intel)
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Omar Sandoval
    Cc: Paul McKenney
    Cc: Tejun Heo
    Cc: Will Deacon
    Cc: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "Cleanups for swap", v2.

    This series contains just cleanups to remove some unused variables, delete
    meaningless forward declarations and so on. More details can be found in
    the respective changelogs.

    This patch (of 4):

    We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the
    only caller of this function is now suspend routine.

    [linmiaohe@huawei.com: move scan_swap_map() under CONFIG_HIBERNATION]
    Link: https://lkml.kernel.org/r/20210521070855.2015094-1-linmiaohe@huawei.com
    [linmiaohe@huawei.com: fold scan_swap_map() into the only caller get_swap_page_of_type()]
    Link: https://lkml.kernel.org/r/20210527120328.3935132-1-linmiaohe@huawei.com

    Link: https://lkml.kernel.org/r/20210520134022.1370406-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210520134022.1370406-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Cc: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Patch series "close various race windows for swap", v6.

    When I was investigating the swap code, I found some possible race
    windows. This series aims to fix all these races. But using current
    get/put_swap_device() to guard against concurrent swapoff for
    swap_readpage() looks terrible because swap_readpage() may take really
    long time. And to reduce the performance overhead on the hot-path as much
    as possible, it appears we can use the percpu_ref to close this race
    window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support
    for swap and most of the remaining patches try to use this to close
    various race windows. More details can be found in the respective
    changelogs.

    This patch (of 4):

    Using current get/put_swap_device() to guard against concurrent swapoff
    for some swap ops, e.g. swap_readpage(), looks terrible because they
    might take really long time. This patch adds the percpu_ref support to
    serialize against concurrent swapoff(as suggested by Huang, Ying). Also
    we remove the SWP_VALID flag because it's used together with RCU solution.

    Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: "Huang, Ying"
    Cc: Alex Shi
    Cc: David Hildenbrand
    Cc: Dennis Zhou
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Tim Chen
    Cc: Wei Yang
    Cc: Yang Shi
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

17 Jun, 2021

1 commit

  • I found it by pure code review, that pte_same_as_swp() of unuse_vma()
    didn't take uffd-wp bit into account when comparing ptes.
    pte_same_as_swp() returning false negative could cause failure to
    swapoff swap ptes that was wr-protected by userfaultfd.

    Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
    Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Peter Xu
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: [5.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

06 May, 2021

1 commit

  • Various coding style tweaks to various files under mm/

    [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn
    [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks]
    Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn

    Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn
    Signed-off-by: Zhiyuan Dai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhiyuan Dai
     

03 Mar, 2021

1 commit

  • We're not factoring in the start of the file for where to write and
    read the swapfile, which leads to very unfortunate side effects of
    writing where we should not be...

    Fixes: 48d15436fde6 ("mm: remove get_swap_bio")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Feb, 2021

1 commit


22 Feb, 2021

1 commit

  • Pull arm64 updates from Will Deacon:

    - vDSO build improvements including support for building with BSD.

    - Cleanup to the AMU support code and initialisation rework to support
    cpufreq drivers built as modules.

    - Removal of synthetic frame record from exception stack when entering
    the kernel from EL0.

    - Add support for the TRNG firmware call introduced by Arm spec
    DEN0098.

    - Cleanup and refactoring across the board.

    - Avoid calling arch_get_random_seed_long() from
    add_interrupt_randomness()

    - Perf and PMU updates including support for Cortex-A78 and the v8.3
    SPE extensions.

    - Significant steps along the road to leaving the MMU enabled during
    kexec relocation.

    - Faultaround changes to initialise prefaulted PTEs as 'old' when
    hardware access-flag updates are supported, which drastically
    improves vmscan performance.

    - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
    (#1024718)

    - Preparatory work for yielding the vector unit at a finer granularity
    in the crypto code, which in turn will one day allow us to defer
    softirq processing when it is in use.

    - Support for overriding CPU ID register fields on the command-line.

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
    drivers/perf: Replace spin_lock_irqsave to spin_lock
    mm: filemap: Fix microblaze build failure with 'mmu_defconfig'
    arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+
    arm64: cpufeatures: Allow disabling of Pointer Auth from the command-line
    arm64: Defer enabling pointer authentication on boot core
    arm64: cpufeatures: Allow disabling of BTI from the command-line
    arm64: Move "nokaslr" over to the early cpufeature infrastructure
    KVM: arm64: Document HVC_VHE_RESTART stub hypercall
    arm64: Make kvm-arm.mode={nvhe, protected} an alias of id_aa64mmfr1.vh=0
    arm64: Add an aliasing facility for the idreg override
    arm64: Honor VHE being disabled from the command-line
    arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line
    arm64: cpufeature: Add an early command-line cpufeature override facility
    arm64: Extract early FDT mapping from kaslr_early_init()
    arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding()
    arm64: cpufeature: Add global feature override facility
    arm64: Move SCTLR_EL1 initialisation to EL-agnostic code
    arm64: Simplify init_el2_state to be non-VHE only
    arm64: Move VHE-specific SPE setup to mutate_to_vhe()
    arm64: Drop early setting of MDSCR_EL2.TPMS
    ...

    Linus Torvalds
     

10 Feb, 2021

1 commit

  • Open code the parts of map_swap_entry that was actually used by
    swapdev_block, and remove the now unused map_swap_entry function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Rafael J. Wysocki
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jan, 2021

2 commits

  • Current tree spews this on compile:

    mm/swapfile.c:2290:17: warning: ‘map_swap_entry’ defined but not used [-Wunused-function]
    2290 | static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev)
    | ^~~~~~~~~~~~~~

    if !CONFIG_HIBERNATION, as we don't use the function unless we have that
    config option set.

    Fixes: 48d15436fde6 ("mm: remove get_swap_bio")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just reuse the block_device and sector from the swap_info structure,
    just as used by the SWP_SYNCHRONOUS path. Also remove the checks for
    NULL returns from bio_alloc as that can't happen for sleeping
    allocations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Chaitanya Kulkarni
    Acked-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Jan, 2021

1 commit


16 Dec, 2020

4 commits

  • The scenario on which "Free swap = -4kB" happens in my system, which is caused
    by several get_swap_pages racing with each other and show_swap_cache_info
    happens simutaniously. No need to add a lock on get_swap_page_of_type as we
    remove "Presub/PosAdd" here.

    ProcessA ProcessB ProcessC
    ngoals = 1 ngoals = 1
    avail = nr_swap_pages(1) avail = nr_swap_pages(1)
    nr_swap_pages(1) -= ngoals
    nr_swap_pages(0) -= ngoals
    nr_swap_pages = -1

    Link: https://lkml.kernel.org/r/1607050340-4535-1-git-send-email-zhaoyang.huang@unisoc.com
    Signed-off-by: Zhaoyang Huang
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhaoyang Huang
     
  • We could use helper memset to fill the swap_map with SWAP_HAS_CACHE instead
    of a direct loop here to simplify the code. Also we can remove the local
    variable i and map this way.

    Link: https://lkml.kernel.org/r/20200921122224.7139-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • When the code went to the out label, it must have p == NULL. So what out
    label really does is redundant if check and return err. We should Remove
    this unnecessary out label because it does not handle resource free and so
    on.

    Link: https://lkml.kernel.org/r/20201009130337.29698-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Commit 570a335b8e22 ("swap_info: swap count continuations") introduced the
    func add_swap_count_continuation() but forgot to use the helper function
    swap_count() introduced by commit 355cfa73ddff ("mm: modify swap_map and
    add SWAP_HAS_CACHE flag").

    Link: https://lkml.kernel.org/r/20201009134306.18033-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

07 Dec, 2020

1 commit

  • We can't call kvfree() with a spin lock held, so defer it. Fixes a
    might_sleep() runtime warning.

    Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc:
    Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
    Signed-off-by: Linus Torvalds

    Qian Cai
     

14 Oct, 2020

5 commits

  • If we failed to drain inode, we would forget to free the swap address
    space allocated by init_swap_address_space() above.

    Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • It's unnecessary to goto the out label while out label is just below.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • We don't initially add anon pages to active lruvec after commit
    b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU").
    Remove activate_page() from unuse_pte(), which seems to be missed by the
    commit. And make the function static while we are at it.

    Before the commit, we called lru_cache_add_active_or_unevictable() to add
    new ksm pages to active lruvec. Therefore, activate_page() wasn't
    necessary for them in the first place.

    Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Cc: Alexander Duyck
    Cc: Huang Ying
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Mel Gorman
    Cc: Nicholas Piggin
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • SWP_FS is used to make swap_{read,write}page() go through the filesystem,
    and it's only used for swap files over NFS for now. Otherwise it will
    directly submit IO to blockdev according to swapfile extents reported by
    filesystems in advance.

    As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
    rename to SWP_FS_OPS.

    [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

    Suggested-by: Matthew Wilcox
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

27 Sep, 2020

1 commit

  • SWP_FS is used to make swap_{read,write}page() go through the
    filesystem, and it's only used for swap files over NFS. So, !SWP_FS
    means non NFS for now, it could be either file backed or device backed.
    Something similar goes with legacy SWP_FILE.

    So in order to achieve the goal of the original patch, SWP_BLKDEV should
    be used instead.

    FS corruption can be observed with SSD device + XFS + fragmented
    swapfile due to CONFIG_THP_SWAP=y.

    I reproduced the issue with the following details:

    Environment:

    QEMU + upstream kernel + buildroot + NVMe (2 GB)

    Kernel config:

    CONFIG_BLK_DEV_NVME=y
    CONFIG_THP_SWAP=y

    Some reproducible steps:

    mkfs.xfs -f /dev/nvme0n1
    mkdir /tmp/mnt
    mount /dev/nvme0n1 /tmp/mnt
    bs="32k"
    sz="1024m" # doesn't matter too much, I also tried 16m
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
    xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

    mkswap /tmp/mnt/sw
    swapon /tmp/mnt/sw

    stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

    Symptoms:
    - FS corruption (e.g. checksum failure)
    - memory corruption at: 0xd2808010
    - segfault

    Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
    Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
    Signed-off-by: Gao Xiang
    Signed-off-by: Andrew Morton
    Reviewed-by: "Huang, Ying"
    Reviewed-by: Yang Shi
    Acked-by: Rafael Aquini
    Cc: Matthew Wilcox
    Cc: Carlos Maiolino
    Cc: Eric Sandeen
    Cc: Dave Chinner
    Cc:
    Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
    Signed-off-by: Linus Torvalds

    Gao Xiang
     

25 Sep, 2020

2 commits

  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
    decided if ->rw_page can be used on a block device. Just check up for
    the method instead. The only complication is that zram needs a second
    set of block_device_operations as it can switch between modes that
    actually support ->rw_page and those who don't.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2020

2 commits

  • swap_type_of is used for two entirely different purposes:

    (1) check what swap type a given device/offset corresponds to
    (2) find the first available swap device that can be written to

    Mixing both in a single function creates an unreadable mess. Create two
    separate functions instead, and switch both to pass a dev_t instead of
    a struct block_device to further simplify the code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Sep, 2020

1 commit

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     

15 Aug, 2020

2 commits

  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

01 Jul, 2020

1 commit


10 Jun, 2020

2 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport