18 Jan, 2021

2 commits


26 Oct, 2020

1 commit


25 Oct, 2020

1 commit


21 Oct, 2020

1 commit


17 Oct, 2020

1 commit

  • The page cache needs to know whether the filesystem supports THPs so that
    it doesn't send THPs to filesystems which can't handle them. Dave Chinner
    points out that getting from the page mapping to the filesystem type is
    too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
    information in the address space flags.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Song Liu
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Oct, 2020

1 commit

  • Convert shmem_getpage_gfp() (the only remaining caller of
    find_lock_entry()) to cope with a head page being returned instead of
    the subpage for the index.

    [willy@infradead.org: fix BUG()s]
    Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

21 Sep, 2020

1 commit


20 Sep, 2020

1 commit

  • Commit e809d5f0b5c9 ("tmpfs: per-superblock i_ino support") made changes
    to shmem_reserve_inode() in mm/shmem.c, however the original test for
    (sbinfo->max_inodes) got dropped. This causes mounting tmpfs with option
    nr_inodes=0 to fail:

    # mount -ttmpfs -onr_inodes=0 none /ext0
    mount: /ext0: mount(2) system call failed: Cannot allocate memory.

    This patch restores the nr_inodes=0 functionality.

    Fixes: e809d5f0b5c9 ("tmpfs: per-superblock i_ino support")
    Signed-off-by: Byron Stanoszek
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Acked-by: Chris Down
    Link: https://lkml.kernel.org/r/20200902035715.16414-1-gandalf@winds.org
    Signed-off-by: Linus Torvalds

    Byron Stanoszek
     

04 Sep, 2020

2 commits

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     
  • Since arm64 memory (allocation) tags can only be stored in RAM, mapping
    files with PROT_MTE is not allowed by default. RAM-based files like
    those in a tmpfs mount or memfd_create() can support memory tagging, so
    update the vm_flags accordingly in shmem_mmap().

    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Catalin Marinas
     

13 Aug, 2020

3 commits

  • …ernel/git/abelloni/linux") into android-mainline

    Steps on the way to 5.9-rc1.

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Iceded779988ff472863b7e1c54e22a9fa6383a30

    Greg Kroah-Hartman
     
  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

4 commits

  • …kernel/git/sre/linux-power-supply") into android-mainline

    Merges along the way to 5.9-rc1

    resolves conflicts in:
    Documentation/ABI/testing/sysfs-class-power
    drivers/power/supply/power_supply_sysfs.c
    fs/crypto/inline_crypt.c

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Ia087834f54fb4e5269d68c3c404747ceed240701

    Greg Kroah-Hartman
     
  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     
  • The default is still set to inode32 for backwards compatibility, but
    system administrators can opt in to the new 64-bit inode numbers by
    either:

    1. Passing inode64 on the command line when mounting, or
    2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

    The inode64 and inode32 names are used based on existing precedent from
    XFS.

    [hughd@google.com: Kconfig fixes]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils

    Signed-off-by: Chris Down
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Amir Goldstein
    Acked-by: Hugh Dickins
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

    In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
    affected tiers, in excess of 10% of hosts show multiple files with
    different content and the same inode number, with some servers even having
    as many as 150 duplicated inode numbers with differing file content.

    This causes actual, tangible problems in production. For example, we have
    complaints from those working on remote caches that their application is
    reporting cache corruptions because it uses (device, inodenum) to
    establish the identity of a particular cache object, but because it's not
    unique any more, the application refuses to continue and reports cache
    corruption. Even worse, sometimes applications may not even detect the
    corruption but may continue anyway, causing phantom and hard to debug
    behaviour.

    In general, userspace applications expect that (device, inodenum) should
    be enough to be uniquely point to one inode, which seems fair enough. One
    might also need to check the generation, but in this case:

    1. That's not currently exposed to userspace
    (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
    2. Even with generation, there shouldn't be two live inodes with the
    same inode number on one device.

    In order to mitigate this, we take a two-pronged approach:

    1. Moving inum generation from being global to per-sb for tmpfs. This
    itself allows some reduction in i_ino churn. This works on both 64-
    and 32- bit machines.
    2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
    64-bit ino_t only: we allow users to mount tmpfs with a new inode64
    option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

    You can see how this compares to previous related patches which didn't
    implement this per-superblock:

    - https://patchwork.kernel.org/patch/11254001/
    - https://patchwork.kernel.org/patch/11023915/

    This patch (of 2):

    get_next_ino has a number of problems:

    - It uses and returns a uint, which is susceptible to become overflowed
    if a lot of volatile inodes that use get_next_ino are created.
    - It's global, with no specificity per-sb or even per-filesystem. This
    means it's not that difficult to cause inode number wraparounds on a
    single device, which can result in having multiple distinct inodes
    with the same inode number.

    This patch adds a per-superblock counter that mitigates the second case.
    This design also allows us to later have a specific i_ino size per-device,
    for example, allowing users to choose whether to use 32- or 64-bit inodes
    for each tmpfs mount. This is implemented in the next commit.

    For internal shmem mounts which may be less tolerant to spinlock delays,
    we implement a percpu batching scheme which only takes the stat_lock at
    each batch boundary.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Cc: Amir Goldstein
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     

25 Jul, 2020

2 commits

  • Partial 5.8-rc7 merge to make the final merge easier.

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I95f1b0a379e3810333300a70c5a93f449d945c54

    Greg Kroah-Hartman
     
  • After commit fdc85222d58e ("kernfs: kvmalloc xattr value instead of
    kmalloc"), simple xattr entry is allocated with kvmalloc() instead of
    kmalloc(), so we should release it with kvfree() instead of kfree().

    Fixes: fdc85222d58e ("kernfs: kvmalloc xattr value instead of kmalloc")
    Signed-off-by: Chengguang Xu
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Acked-by: Tejun Heo
    Cc: Daniel Xu
    Cc: Chris Down
    Cc: Andreas Dilger
    Cc: Greg Kroah-Hartman
    Cc: Al Viro
    Cc: [5.7]
    Link: http://lkml.kernel.org/r/20200704051608.15043-1-cgxu519@mykernel.net
    Signed-off-by: Linus Torvalds

    Chengguang Xu
     

24 Jun, 2020

1 commit


22 Jun, 2020

1 commit


10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

7 commits

  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, users that are otherwise memory controlled can easily escape
    their containment and allocate significant amounts of memory that they're
    not being charged for. That's because swap readahead pages are not being
    charged until somebody actually faults them into their page table. This
    can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
    allocations without charging the pages.

    There are additional problems with the delayed charging of swap pages:

    1. To implement refault/workingset detection for anonymous pages, we
    need to have a target LRU available at swapin time, but the LRU is not
    determinable until the page has been charged.

    2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
    stable when the page is isolated from the LRU; otherwise, the locks
    change under us. But swapcache gets charged after it's already on the
    LRU, and even if we cannot isolate it ourselves (since charging is not
    exactly optional).

    The previous patch ensured we always maintain cgroup ownership records for
    swap pages. This patch moves the swapcache charging point from the fault
    handler to swapin time to fix all of the above problems.

    v2: simplify swapin error checking (Joonsoo)

    [hughd@google.com: fix livelock in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Rafael Aquini
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
    divergence from the generic VM accounting means unnecessary code overhead,
    and creates a dependency for memcg that page->mapping is set up at the
    time of charging, so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
    The page is already locked in these places, so page->mem_cgroup is stable;
    we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
    it's set up in time.

    Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
    NR_SHMEM accounting sites.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The try/commit/cancel protocol that memcg uses dates back to when pages
    used to be uncharged upon removal from the page cache, and thus couldn't
    be committed before the insertion had succeeded. Nowadays, pages are
    uncharged when they are physically freed; it doesn't matter whether the
    insertion was successful or not. For the page cache, the transaction
    dance has become unnecessary.

    Introduce a mem_cgroup_charge() function that simply charges a newly
    allocated page to a cgroup and sets up page->mem_cgroup in one single
    step. If the insertion fails, the caller doesn't have to do anything but
    free/put the page.

    Then switch the page cache over to this new API.

    Subsequent patches will also convert anon pages, but it needs a bit more
    prep work. Right now, memcg depends on page->mapping being already set up
    at the time of charging, so that it can maintain its own MEMCG_CACHE and
    MEMCG_RSS counters. For anon, page->mapping is set under the same pte
    lock under which the page is publishd, so a single charge point that can
    block doesn't work there just yet.

    The following prep patches will replace the private memcg counters with
    the generic vmstat counters, thus removing the page->mapping dependency,
    then complete the transition to the new single-point charge API and delete
    the old transactional scheme.

    v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
    v3: rebase on preceeding shmem simplification patch

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
    recognized that hole punching can race with swapin and removed the
    BUG_ON() for a truncated entry from the swapin path.

    The patch also added a swapcache deletion to optimize this rare case:
    Since swapin has the page locked, and free_swap_and_cache() merely
    trylocks, this situation can leave the page stranded in swapcache.
    Usually, page reclaim picks up stale swapcache pages, and the race can
    happen at any other time when the page is locked. (The same happens for
    non-shmem swapin racing with page table zapping.) The thinking here was:
    we already observed the race and we have the page locked, we may as well
    do the cleanup instead of waiting for reclaim.

    However, this optimization complicates the next patch which moves the
    cgroup charging code around. As this is just a minor speedup for a race
    condition that is so rare that it required a fuzzer to trigger the
    original BUG_ON(), it's no longer worth the complications.

    Suggested-by: Hugh Dickins
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Cc: Alex Shi
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200511181056.GA339505@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Apr, 2020

1 commit


22 Apr, 2020

3 commits

  • Syzbot reported the below lockdep splat:

    WARNING: possible irq lock inversion dependency detected
    5.6.0-rc7-syzkaller #0 Not tainted
    --------------------------------------------------------
    syz-executor.0/10317 just changed the state of lock:
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: shmem_mfill_atomic_pte+0x1012/0x21c0 mm/shmem.c:2407
    but this lock was taken by another, SOFTIRQ-safe lock in the past:
    (&(&xa->xa_lock)->rlock#5){..-.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&info->lock)->rlock);
    local_irq_disable();
    lock(&(&xa->xa_lock)->rlock#5);
    lock(&(&info->lock)->rlock);

    lock(&(&xa->xa_lock)->rlock#5);

    *** DEADLOCK ***

    The full report is quite lengthy, please see:

    https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2004152007370.13597@eggly.anvils/T/#m813b412c5f78e25ca8c6c7734886ed4de43f241d

    It is because CPU 0 held info->lock with IRQ enabled in userfaultfd_copy
    path, then CPU 1 is splitting a THP which held xa_lock and info->lock in
    IRQ disabled context at the same time. If softirq comes in to acquire
    xa_lock, the deadlock would be triggered.

    The fix is to acquire/release info->lock with *_irq version instead of
    plain spin_{lock,unlock} to make it softirq safe.

    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Reported-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/1587061357-122619-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Recent commit 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page()
    when punching hole") has allowed syzkaller to probe deeper, uncovering a
    long-standing lockdep issue between the irq-unsafe shmlock_user_lock,
    the irq-safe xa_lock on mapping->i_pages, and shmem inode's info->lock
    which nests inside xa_lock (or tree_lock) since 4.8's shmem_uncharge().

    user_shm_lock(), servicing SysV shmctl(SHM_LOCK), wants
    shmlock_user_lock while its caller shmem_lock() holds info->lock with
    interrupts disabled; but hugetlbfs_file_setup() calls user_shm_lock()
    with interrupts enabled, and might be interrupted by a writeback endio
    wanting xa_lock on i_pages.

    This may not risk an actual deadlock, since shmem inodes do not take
    part in writeback accounting, but there are several easy ways to avoid
    it.

    Requiring interrupts disabled for shmlock_user_lock would be easy, but
    it's a high-level global lock for which that seems inappropriate.
    Instead, recall that the use of info->lock to guard info->flags in
    shmem_lock() dates from pre-3.1 days, when races with SHMEM_PAGEIN and
    SHMEM_TRUNCATE could occur: nowadays it serves no purpose, the only flag
    added or removed is VM_LOCKED itself, and calls to shmem_lock() an inode
    are already serialized by the caller.

    Take info->lock out of the chain and the possibility of deadlock or
    lockdep warning goes away.

    Fixes: 4595ef88d136 ("shmem: make shmem_inode_info::lock irq-safe")
    Reported-by: syzbot+c8a8197c8852f566b9d9@syzkaller.appspotmail.com
    Reported-by: syzbot+40b71e145e73f78f81ad@syzkaller.appspotmail.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: Yang Shi
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004161707410.16322@eggly.anvils
    Link: https://lore.kernel.org/lkml/000000000000e5838c05a3152f53@google.com/
    Link: https://lore.kernel.org/lkml/0000000000003712b305a331d3b1@google.com/
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Some optimizers don't notice that shmem_punch_compound() is always true
    (PageTransCompound() being false) without CONFIG_TRANSPARENT_HUGEPAGE==y.

    Use IS_ENABLED to help them to avoid the BUILD_BUG inside HPAGE_PMD_NR.

    Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
    Reported-by: Randy Dunlap
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004142339170.10035@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Apr, 2020

1 commit


08 Apr, 2020

3 commits

  • …ux/kernel/git/vgupta/arc") into android-mainline

    Steps along the 5.7-rc1 merge.

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Ib9f87147ac3d81985496818b0c61bdd086140eed

    Greg Kroah-Hartman
     
  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins