16 Sep, 2020

20 commits


07 Sep, 2020

7 commits

  • With the recent rework of the inode cluster flushing, we no longer
    ever wait on the the inode flush "lock". It was never a lock in the
    first place, just a completion to allow callers to wait for inode IO
    to complete. We now never wait for flush completion as all inode
    flushing is non-blocking. Hence we can get rid of all the iflock
    infrastructure and instead just set and check a state flag.

    Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
    xfs_iflock_nowait() test-and-set operations on that flag, and
    replace all the xfs_ifunlock() calls to clear operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Remove kmem_realloc() function and convert its users to use MM API
    directly (krealloc())

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Carlos Maiolino
     
  • Linus Torvalds
     
  • Pull more io_uring fixes from Jens Axboe:
    "Two followup fixes. One is fixing a regression from this merge window,
    the other is two commits fixing cancelation of deferred requests.

    Both have gone through full testing, and both spawned a few new
    regression test additions to liburing.

    - Don't play games with const, properly store the output iovec and
    assign it as needed.

    - Deferred request cancelation fix (Pavel)"

    * tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block:
    io_uring: fix linked deferred ->files cancellation
    io_uring: fix cancel of deferred reqs with ->files
    io_uring: fix explicit async read/write mapping for large segments

    Linus Torvalds
     
  • Pull iommu fixes from Joerg Roedel:

    - three Intel VT-d fixes to fix address handling on 32bit, fix a NULL
    pointer dereference bug and serialize a hardware register access as
    required by the VT-d spec.

    - two patches for AMD IOMMU to force AMD GPUs into translation mode
    when memory encryption is active and disallow using IOMMUv2
    functionality. This makes the AMDGPU driver work when memory
    encryption is active.

    - two more fixes for AMD IOMMU to fix updating the Interrupt Remapping
    Table Entries.

    - MAINTAINERS file update for the Qualcom IOMMU driver.

    * tag 'iommu-fixes-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
    iommu/vt-d: Handle 36bit addressing for x86-32
    iommu/amd: Do not use IOMMUv2 functionality when SME is active
    iommu/amd: Do not force direct mapping when SME is active
    iommu/amd: Use cmpxchg_double() when updating 128-bit IRTE
    iommu/amd: Restore IRTE.RemapEn bit after programming IRTE
    iommu/vt-d: Fix NULL pointer dereference in dev_iommu_priv_set()
    iommu/vt-d: Serialize IOMMU GCMD register modifications
    MAINTAINERS: Update QUALCOMM IOMMU after Arm SMMU drivers move

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:

    - more generic entry code ABI fallout

    - debug register handling bugfixes

    - fix vmalloc mappings on 32-bit kernels

    - kprobes instrumentation output fix on 32-bit kernels

    - fix over-eager WARN_ON_ONCE() on !SMAP hardware

    - NUMA debugging fix

    - fix Clang related crash on !RETPOLINE kernels

    * tag 'x86-urgent-2020-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/entry: Unbreak 32bit fast syscall
    x86/debug: Allow a single level of #DB recursion
    x86/entry: Fix AC assertion
    tracing/kprobes, x86/ptrace: Fix regs argument order for i386
    x86, fakenuma: Fix invalid starting node ID
    x86/mm/32: Bring back vmalloc faulting on x86_32
    x86/cmdline: Disable jump tables for cmdline.c

    Linus Torvalds
     
  • Pull xen updates from Juergen Gross:
    "A small series for fixing a problem with Xen PVH guests when running
    as backends (e.g. as dom0).

    Mapping other guests' memory is now working via ZONE_DEVICE, thus not
    requiring to abuse the memory hotplug functionality for that purpose"

    * tag 'for-linus-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen: add helpers to allocate unpopulated memory
    memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
    xen/balloon: add header guard

    Linus Torvalds
     

06 Sep, 2020

13 commits

  • While looking for ->files in ->defer_list, consider that requests there
    may actually be links.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • While trying to cancel requests with ->files, it also should look for
    requests in ->defer_list, otherwise it might end up hanging a thread.

    Cancel all requests in ->defer_list up to the last request there with
    matching ->files, that's needed to follow drain ordering semantics.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     
  • Linus Torvalds
     
  • Pull ARC fixes from Vineet Gupta:

    - HSDK-4xd Dev system: perf driver updates for sampling interrupt

    - HSDK* Dev System: Ethernet broken [Evgeniy Didin]

    - HIGHMEM broken (2 memory banks) [Mike Rapoport]

    - show_regs() rewrite once and for all

    - Other minor fixes

    * tag 'arc-5.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
    ARC: [plat-hsdk]: Switch ethernet phy-mode to rgmii-id
    arc: fix memory initialization for systems with two memory banks
    irqchip/eznps: Fix build error for !ARC700 builds
    ARC: show_regs: fix r12 printing and simplify
    ARC: HSDK: wireup perf irq
    ARC: perf: don't bail setup if pct irq missing in device-tree
    ARC: pgalloc.h: delete a duplicated word + other fixes

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "19 patches.

    Subsystems affected by this patch series: MAINTAINERS, ipc, fork,
    checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration,
    hugetlb)"

    * emailed patches from Andrew Morton :
    include/linux/log2.h: add missing () around n in roundup_pow_of_two()
    mm/khugepaged.c: fix khugepaged's request size in collapse_file
    mm/hugetlb: fix a race between hugetlb sysctl handlers
    mm/hugetlb: try preferred node first when alloc gigantic page from cma
    mm/migrate: preserve soft dirty in remove_migration_pte()
    mm/migrate: remove unnecessary is_zone_device_page() check
    mm/rmap: fixup copying of soft dirty and uffd ptes
    mm/migrate: fixup setting UFFD_WP flag
    mm: madvise: fix vma user-after-free
    checkpatch: fix the usage of capture group ( ... )
    fork: adjust sysctl_max_threads definition to match prototype
    ipc: adjust proc_ipc_sem_dointvec definition to match prototype
    mm: track page table modifications in __apply_to_page_range()
    MAINTAINERS: IA64: mark Status as Odd Fixes only
    MAINTAINERS: add LLVM maintainers
    MAINTAINERS: update Cavium/Marvell entries
    mm: slub: fix conversion of freelist_corrupted()
    mm: memcg: fix memcg reclaim soft lockup
    memcg: fix use-after-free in uncharge_batch

    Linus Torvalds
     
  • Otherwise gcc generates warnings if the expression is complicated.

    Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant")
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/0-v1-8a2697e3c003+41165-log_brackets_jgg@nvidia.com
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     
  • collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
    be read to page_cache_sync_readahead(). The intent was probably to read
    a single page. Fix it to use the number of pages to the end of the
    window instead.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Song Liu
    Acked-by: Yang Shi
    Acked-by: Pankaj Gupta
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     
  • There is a race between the assignment of `table->data` and write value
    to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
    the other thread.

    CPU0: CPU1:
    proc_sys_write
    hugetlb_sysctl_handler proc_sys_call_handler
    hugetlb_sysctl_handler_common hugetlb_sysctl_handler
    table->data = &tmp; hugetlb_sysctl_handler_common
    table->data = &tmp;
    proc_doulongvec_minmax
    do_proc_doulongvec_minmax sysctl_head_finish
    __do_proc_doulongvec_minmax unuse_table
    i = table->data;
    *i = val; // corrupt CPU1's stack

    Fix this by duplicating the `table`, and only update the duplicate of
    it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
    simplify the code.

    The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
    ? set_max_huge_pages+0x3da/0x4f0
    ? alloc_pool_huge_page+0x150/0x150
    ? proc_doulongvec_minmax+0x46/0x60
    ? hugetlb_sysctl_handler_common+0x1c7/0x200
    ? nr_hugepages_store+0x20/0x20
    ? copy_fd_bitmaps+0x170/0x170
    ? hugetlb_sysctl_handler+0x1e/0x20
    ? proc_sys_call_handler+0x2f1/0x300
    ? unregister_sysctl_table+0xb0/0xb0
    ? __fd_install+0x78/0x100
    ? proc_sys_write+0x14/0x20
    ? __vfs_write+0x4d/0x90
    ? vfs_write+0xef/0x240
    ? ksys_write+0xc0/0x160
    ? __ia32_sys_read+0x50/0x50
    ? __close_fd+0x129/0x150
    ? __x64_sys_write+0x43/0x50
    ? do_syscall_64+0x6c/0x200
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • Since commit cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic
    hugepages using cma"), the gigantic page would be allocated from node
    which is not the preferred node, although there are pages available from
    that node. The reason is that the nid parameter has been ignored in
    alloc_gigantic_page().

    Besides, the __GFP_THISNODE also need be checked if user required to
    alloc only from the preferred node.

    After this patch, the preferred node is tried first before other allowed
    nodes, and don't try to allocate from other nodes if __GFP_THISNODE is
    specified. If user don't specify the preferred node, the current node
    will be used as preferred node, which makes sure consistent behavior of
    allocating gigantic and non-gigantic hugetlb page.

    Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • The code to remove a migration PTE and replace it with a device private
    PTE was not copying the soft dirty bit from the migration entry. This
    could lead to page contents not being marked dirty when faulting the page
    back from device private memory.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Cc: Jerome Glisse
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".

    I happened to notice this from code inspection after seeing Alistair
    Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").

    This patch (of 2):

    The check for is_zone_device_page() and is_device_private_page() is
    unnecessary since the latter is sufficient to determine if the page is a
    device private page. Simplify the code for easier reading.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Cc: Jerome Glisse
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
    Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • During memory migration a pte is temporarily replaced with a migration
    swap pte. Some pte bits from the existing mapping such as the soft-dirty
    and uffd write-protect bits are preserved by copying these to the
    temporary migration swap pte.

    However these bits are not stored at the same location for swap and
    non-swap ptes. Therefore testing these bits requires using the
    appropriate helper function for the given pte type.

    Unfortunately several code locations were found where the wrong helper
    function is being used to test soft_dirty and uffd_wp bits which leads to
    them getting incorrectly set or cleared during page-migration.

    Fix these by using the correct tests based on pte type.

    Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Alistair Popple
    Signed-off-by: Andrew Morton
    Reviewed-by: Peter Xu
    Cc: Jérôme Glisse
    Cc: John Hubbard
    Cc: Ralph Campbell
    Cc: Alistair Popple
    Cc:
    Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
    Signed-off-by: Linus Torvalds

    Alistair Popple
     
  • Commit f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
    introduced support for tracking the uffd wp bit during page migration.
    However the non-swap PTE variant was used to set the flag for zone device
    private pages which are a type of swap page.

    This leads to corruption of the swap offset if the original PTE has the
    uffd_wp flag set.

    Fixes: f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Alistair Popple
    Signed-off-by: Andrew Morton
    Reviewed-by: Peter Xu
    Cc: Jérôme Glisse
    Cc: John Hubbard
    Cc: Ralph Campbell
    Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
    Signed-off-by: Linus Torvalds

    Alistair Popple