17 Oct, 2020

1 commit

  • The page cache needs to know whether the filesystem supports THPs so that
    it doesn't send THPs to filesystems which can't handle them. Dave Chinner
    points out that getting from the page mapping to the filesystem type is
    too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
    information in the address space flags.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Song Liu
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Oct, 2020

1 commit

  • Convert shmem_getpage_gfp() (the only remaining caller of
    find_lock_entry()) to cope with a head page being returned instead of
    the subpage for the index.

    [willy@infradead.org: fix BUG()s]
    Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Johannes Weiner
    Cc: Matthew Auld
    Cc: William Kucharski
    Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

20 Sep, 2020

1 commit

  • Commit e809d5f0b5c9 ("tmpfs: per-superblock i_ino support") made changes
    to shmem_reserve_inode() in mm/shmem.c, however the original test for
    (sbinfo->max_inodes) got dropped. This causes mounting tmpfs with option
    nr_inodes=0 to fail:

    # mount -ttmpfs -onr_inodes=0 none /ext0
    mount: /ext0: mount(2) system call failed: Cannot allocate memory.

    This patch restores the nr_inodes=0 functionality.

    Fixes: e809d5f0b5c9 ("tmpfs: per-superblock i_ino support")
    Signed-off-by: Byron Stanoszek
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Acked-by: Chris Down
    Link: https://lkml.kernel.org/r/20200902035715.16414-1-gandalf@winds.org
    Signed-off-by: Linus Torvalds

    Byron Stanoszek
     

04 Sep, 2020

2 commits

  • Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
    every physical page, when swapping pages out to disk it is necessary to
    save these tags, and later restore them when reading the pages back.

    Add some hooks along with dummy implementations to enable the
    arch code to handle this.

    Three new hooks are added to the swap code:
    * arch_prepare_to_swap() and
    * arch_swap_invalidate_page() / arch_swap_invalidate_area().
    One new hook is added to shmem:
    * arch_swap_restore()

    Signed-off-by: Steven Price
    [catalin.marinas@arm.com: add unlock_page() on the error path]
    [catalin.marinas@arm.com: dropped the _tags suffix]
    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steven Price
     
  • Since arm64 memory (allocation) tags can only be stored in RAM, mapping
    files with PROT_MTE is not allowed by default. RAM-based files like
    those in a tmpfs mount or memfd_create() can support memory tagging, so
    update the vm_flags accordingly in shmem_mmap().

    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Catalin Marinas
     

13 Aug, 2020

2 commits

  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

3 commits

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     
  • The default is still set to inode32 for backwards compatibility, but
    system administrators can opt in to the new 64-bit inode numbers by
    either:

    1. Passing inode64 on the command line when mounting, or
    2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

    The inode64 and inode32 names are used based on existing precedent from
    XFS.

    [hughd@google.com: Kconfig fixes]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008011928010.13320@eggly.anvils

    Signed-off-by: Chris Down
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Amir Goldstein
    Acked-by: Hugh Dickins
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

    In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
    affected tiers, in excess of 10% of hosts show multiple files with
    different content and the same inode number, with some servers even having
    as many as 150 duplicated inode numbers with differing file content.

    This causes actual, tangible problems in production. For example, we have
    complaints from those working on remote caches that their application is
    reporting cache corruptions because it uses (device, inodenum) to
    establish the identity of a particular cache object, but because it's not
    unique any more, the application refuses to continue and reports cache
    corruption. Even worse, sometimes applications may not even detect the
    corruption but may continue anyway, causing phantom and hard to debug
    behaviour.

    In general, userspace applications expect that (device, inodenum) should
    be enough to be uniquely point to one inode, which seems fair enough. One
    might also need to check the generation, but in this case:

    1. That's not currently exposed to userspace
    (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
    2. Even with generation, there shouldn't be two live inodes with the
    same inode number on one device.

    In order to mitigate this, we take a two-pronged approach:

    1. Moving inum generation from being global to per-sb for tmpfs. This
    itself allows some reduction in i_ino churn. This works on both 64-
    and 32- bit machines.
    2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
    64-bit ino_t only: we allow users to mount tmpfs with a new inode64
    option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

    You can see how this compares to previous related patches which didn't
    implement this per-superblock:

    - https://patchwork.kernel.org/patch/11254001/
    - https://patchwork.kernel.org/patch/11023915/

    This patch (of 2):

    get_next_ino has a number of problems:

    - It uses and returns a uint, which is susceptible to become overflowed
    if a lot of volatile inodes that use get_next_ino are created.
    - It's global, with no specificity per-sb or even per-filesystem. This
    means it's not that difficult to cause inode number wraparounds on a
    single device, which can result in having multiple distinct inodes
    with the same inode number.

    This patch adds a per-superblock counter that mitigates the second case.
    This design also allows us to later have a specific i_ino size per-device,
    for example, allowing users to choose whether to use 32- or 64-bit inodes
    for each tmpfs mount. This is implemented in the next commit.

    For internal shmem mounts which may be less tolerant to spinlock delays,
    we implement a percpu batching scheme which only takes the stat_lock at
    each batch boundary.

    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Cc: Amir Goldstein
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     

25 Jul, 2020

1 commit

  • After commit fdc85222d58e ("kernfs: kvmalloc xattr value instead of
    kmalloc"), simple xattr entry is allocated with kvmalloc() instead of
    kmalloc(), so we should release it with kvfree() instead of kfree().

    Fixes: fdc85222d58e ("kernfs: kvmalloc xattr value instead of kmalloc")
    Signed-off-by: Chengguang Xu
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Acked-by: Tejun Heo
    Cc: Daniel Xu
    Cc: Chris Down
    Cc: Andreas Dilger
    Cc: Greg Kroah-Hartman
    Cc: Al Viro
    Cc: [5.7]
    Link: http://lkml.kernel.org/r/20200704051608.15043-1-cgxu519@mykernel.net
    Signed-off-by: Linus Torvalds

    Chengguang Xu
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

7 commits

  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, users that are otherwise memory controlled can easily escape
    their containment and allocate significant amounts of memory that they're
    not being charged for. That's because swap readahead pages are not being
    charged until somebody actually faults them into their page table. This
    can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
    allocations without charging the pages.

    There are additional problems with the delayed charging of swap pages:

    1. To implement refault/workingset detection for anonymous pages, we
    need to have a target LRU available at swapin time, but the LRU is not
    determinable until the page has been charged.

    2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
    stable when the page is isolated from the LRU; otherwise, the locks
    change under us. But swapcache gets charged after it's already on the
    LRU, and even if we cannot isolate it ourselves (since charging is not
    exactly optional).

    The previous patch ensured we always maintain cgroup ownership records for
    swap pages. This patch moves the swapcache charging point from the fault
    handler to swapin time to fix all of the above problems.

    v2: simplify swapin error checking (Joonsoo)

    [hughd@google.com: fix livelock in __read_swap_cache_async()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Rafael Aquini
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
    divergence from the generic VM accounting means unnecessary code overhead,
    and creates a dependency for memcg that page->mapping is set up at the
    time of charging, so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
    The page is already locked in these places, so page->mem_cgroup is stable;
    we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
    it's set up in time.

    Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
    NR_SHMEM accounting sites.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The try/commit/cancel protocol that memcg uses dates back to when pages
    used to be uncharged upon removal from the page cache, and thus couldn't
    be committed before the insertion had succeeded. Nowadays, pages are
    uncharged when they are physically freed; it doesn't matter whether the
    insertion was successful or not. For the page cache, the transaction
    dance has become unnecessary.

    Introduce a mem_cgroup_charge() function that simply charges a newly
    allocated page to a cgroup and sets up page->mem_cgroup in one single
    step. If the insertion fails, the caller doesn't have to do anything but
    free/put the page.

    Then switch the page cache over to this new API.

    Subsequent patches will also convert anon pages, but it needs a bit more
    prep work. Right now, memcg depends on page->mapping being already set up
    at the time of charging, so that it can maintain its own MEMCG_CACHE and
    MEMCG_RSS counters. For anon, page->mapping is set under the same pte
    lock under which the page is publishd, so a single charge point that can
    block doesn't work there just yet.

    The following prep patches will replace the private memcg counters with
    the generic vmstat counters, thus removing the page->mapping dependency,
    then complete the transition to the new single-point charge API and delete
    the old transactional scheme.

    v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
    v3: rebase on preceeding shmem simplification patch

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
    recognized that hole punching can race with swapin and removed the
    BUG_ON() for a truncated entry from the swapin path.

    The patch also added a swapcache deletion to optimize this rare case:
    Since swapin has the page locked, and free_swap_and_cache() merely
    trylocks, this situation can leave the page stranded in swapcache.
    Usually, page reclaim picks up stale swapcache pages, and the race can
    happen at any other time when the page is locked. (The same happens for
    non-shmem swapin racing with page table zapping.) The thinking here was:
    we already observed the race and we have the page locked, we may as well
    do the cleanup instead of waiting for reclaim.

    However, this optimization complicates the next patch which moves the
    cgroup charging code around. As this is just a minor speedup for a race
    condition that is so rare that it required a fuzzer to trigger the
    original BUG_ON(), it's no longer worth the complications.

    Suggested-by: Hugh Dickins
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Cc: Alex Shi
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200511181056.GA339505@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

22 Apr, 2020

3 commits

  • Syzbot reported the below lockdep splat:

    WARNING: possible irq lock inversion dependency detected
    5.6.0-rc7-syzkaller #0 Not tainted
    --------------------------------------------------------
    syz-executor.0/10317 just changed the state of lock:
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: shmem_mfill_atomic_pte+0x1012/0x21c0 mm/shmem.c:2407
    but this lock was taken by another, SOFTIRQ-safe lock in the past:
    (&(&xa->xa_lock)->rlock#5){..-.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&info->lock)->rlock);
    local_irq_disable();
    lock(&(&xa->xa_lock)->rlock#5);
    lock(&(&info->lock)->rlock);

    lock(&(&xa->xa_lock)->rlock#5);

    *** DEADLOCK ***

    The full report is quite lengthy, please see:

    https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2004152007370.13597@eggly.anvils/T/#m813b412c5f78e25ca8c6c7734886ed4de43f241d

    It is because CPU 0 held info->lock with IRQ enabled in userfaultfd_copy
    path, then CPU 1 is splitting a THP which held xa_lock and info->lock in
    IRQ disabled context at the same time. If softirq comes in to acquire
    xa_lock, the deadlock would be triggered.

    The fix is to acquire/release info->lock with *_irq version instead of
    plain spin_{lock,unlock} to make it softirq safe.

    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Reported-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/1587061357-122619-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Recent commit 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page()
    when punching hole") has allowed syzkaller to probe deeper, uncovering a
    long-standing lockdep issue between the irq-unsafe shmlock_user_lock,
    the irq-safe xa_lock on mapping->i_pages, and shmem inode's info->lock
    which nests inside xa_lock (or tree_lock) since 4.8's shmem_uncharge().

    user_shm_lock(), servicing SysV shmctl(SHM_LOCK), wants
    shmlock_user_lock while its caller shmem_lock() holds info->lock with
    interrupts disabled; but hugetlbfs_file_setup() calls user_shm_lock()
    with interrupts enabled, and might be interrupted by a writeback endio
    wanting xa_lock on i_pages.

    This may not risk an actual deadlock, since shmem inodes do not take
    part in writeback accounting, but there are several easy ways to avoid
    it.

    Requiring interrupts disabled for shmlock_user_lock would be easy, but
    it's a high-level global lock for which that seems inappropriate.
    Instead, recall that the use of info->lock to guard info->flags in
    shmem_lock() dates from pre-3.1 days, when races with SHMEM_PAGEIN and
    SHMEM_TRUNCATE could occur: nowadays it serves no purpose, the only flag
    added or removed is VM_LOCKED itself, and calls to shmem_lock() an inode
    are already serialized by the caller.

    Take info->lock out of the chain and the possibility of deadlock or
    lockdep warning goes away.

    Fixes: 4595ef88d136 ("shmem: make shmem_inode_info::lock irq-safe")
    Reported-by: syzbot+c8a8197c8852f566b9d9@syzkaller.appspotmail.com
    Reported-by: syzbot+40b71e145e73f78f81ad@syzkaller.appspotmail.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: Yang Shi
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004161707410.16322@eggly.anvils
    Link: https://lore.kernel.org/lkml/000000000000e5838c05a3152f53@google.com/
    Link: https://lore.kernel.org/lkml/0000000000003712b305a331d3b1@google.com/
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Some optimizers don't notice that shmem_punch_compound() is always true
    (PageTransCompound() being false) without CONFIG_TRANSPARENT_HUGEPAGE==y.

    Use IS_ENABLED to help them to avoid the BUILD_BUG inside HPAGE_PMD_NR.

    Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
    Reported-by: Randy Dunlap
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004142339170.10035@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Apr, 2020

7 commits

  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Previously 0 was assigned to variable 'error' but the variable was never
    read before reassignemnt later. So the assignment can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Pankaj Gupta
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Variables declared in a switch statement before any case statements cannot
    be automatically initialized with compiler instrumentation (as they are
    not part of any execution flow). With GCC's proposed automatic stack
    variable initialization feature, this triggers a warning (and they don't
    get initialized). Clang's automatic stack variable initialization (via
    CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
    initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase, so
    even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where they're
    used or lift them up into the main function body.

    mm/shmem.c: In function `shmem_getpage_gfp':
    mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
    1816 | loff_t i_size;
    | ^~~~~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alexander Potapenko
    Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The thp_fault_fallback and thp_file_fallback vmstats are incremented if
    either the hugepage allocation fails through the page allocator or the
    hugepage charge fails through mem cgroup.

    This patch leaves this field untouched but adds two new fields,
    thp_{fault,file}_fallback_charge, which is incremented only when the mem
    cgroup charge fails.

    This distinguishes between attempted hugepage allocations that fail due to
    fragmentation (or low memory conditions) and those that fail due to mem
    cgroup limits. That can be used to determine the impact of fragmentation
    on the system by excluding faults that failed due to memcg usage.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The existing thp_fault_fallback indicates when thp attempts to allocate a
    hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
    hierarchy.

    Extend this to shmem as well. Adds a new thp_file_fallback to complement
    thp_file_alloc that gets incremented when a hugepage is attempted to be
    allocated but fails, or if it cannot be charged to the mem cgroup
    hierarchy.

    Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
    shmem_alloc_hugepage() since it is only called with this configuration
    option.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Apr, 2020

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Christian extended clone3 so that processes can be spawned into
    cgroups directly.

    This is not only neat in terms of semantics but also avoids grabbing
    the global cgroup_threadgroup_rwsem for migration.

    - Daniel added !root xattr support to cgroupfs.

    Userland already uses xattrs on cgroupfs for bookkeeping. This will
    allow delegated cgroups to support such usages.

    - Prateek tried to make cpuset hotplug handling synchronous but that
    led to possible deadlock scenarios. Reverted.

    - Other minor changes including release_agent_path handling cleanup.

    * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: Document the cpuset_v2_mode mount option
    Revert "cpuset: Make cpuset hotplug synchronous"
    cgroupfs: Support user xattrs
    kernfs: Add option to enable user xattrs
    kernfs: Add removed_size out param for simple_xattr_set
    kernfs: kvmalloc xattr value instead of kmalloc
    cgroup: Restructure release_agent_path handling
    selftests/cgroup: add tests for cloning into cgroups
    clone3: allow spawning processes into cgroups
    cgroup: add cgroup_may_write() helper
    cgroup: refactor fork helpers
    cgroup: add cgroup_get_from_file() helper
    cgroup: unify attach permission checking
    cpuset: Make cpuset hotplug synchronous
    cgroup.c: Use built-in RCU list checking
    kselftest/cgroup: add cgroup destruction test
    cgroup: Clean up css_set task traversal

    Linus Torvalds
     

17 Mar, 2020

1 commit

  • This helps set up size accounting in the next commit. Without this out
    param, it's difficult to find out the removed xattr size without taking
    a lock for longer and walking the xattr linked list twice.

    Signed-off-by: Daniel Xu
    Acked-by: Chris Down
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Tejun Heo

    Daniel Xu
     

19 Feb, 2020

1 commit


08 Feb, 2020

3 commits


07 Feb, 2020

2 commits


14 Jan, 2020

1 commit

  • Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
    enabled. But it doesn't work well with above-47bit hint address.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks THP alignment in shmem/tmp:
    shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
    *any* hint address specified.

    This can be fixed by requesting the aligned area if the we failed to
    allocated at user-specified hint address. The request with inflated
    length will also take the user-specified hint address. This way we will
    not lose an allocation request from the full address space.

    [kirill@shutemov.name: fold in a fixup]
    Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
    Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Cc: "Willhalm, Thomas"
    Cc: Dan Williams
    Cc: "Bruggeman, Otto G"
    Cc: "Aneesh Kumar K . V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov