30 Dec, 2020

1 commit

  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     

19 Oct, 2020

1 commit

  • Besides calling the callback on each page, apply_to_page_range also has
    the effect of pre-faulting all PTEs for the range. To support callers
    that only need the pre-faulting, make the callback optional.

    Based on a patch from Minchan Kim .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

17 Oct, 2020

1 commit

  • A compound page in the page cache will not necessarily be of PMD size,
    so check explicitly.

    [willy@infradead.org: fix remove page fault assumption of compound page size]
    Link: https://lkml.kernel.org/r/20201001152259.14932-1-willy@infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Huang Ying
    Cc: Kirill A. Shutemov
    Link: https://lkml.kernel.org/r/20200908195539.25896-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

16 Oct, 2020

1 commit

  • Pull dma-mapping updates from Christoph Hellwig:

    - rework the non-coherent DMA allocator

    - move private definitions out of

    - lower CMA_ALIGNMENT (Paul Cercueil)

    - remove the omap1 dma address translation in favor of the common code

    - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

    - support per-node DMA CMA areas (Barry Song)

    - increase the default seg boundary limit (Nicolin Chen)

    - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

    - various cleanups

    * tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
    ARM/ixp4xx: add a missing include of dma-map-ops.h
    dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
    dma-direct: factor out a dma_direct_alloc_from_pool helper
    dma-direct check for highmem pages in dma_direct_alloc_pages
    dma-mapping: merge into
    dma-mapping: move large parts of to kernel/dma
    dma-mapping: move dma-debug.h to kernel/dma/
    dma-mapping: remove
    dma-mapping: merge into
    dma-contiguous: remove dma_contiguous_set_default
    dma-contiguous: remove dev_set_cma_area
    dma-contiguous: remove dma_declare_contiguous
    dma-mapping: split
    cma: decrease CMA_ALIGNMENT lower limit to 2
    firewire-ohci: use dma_alloc_pages
    dma-iommu: implement ->alloc_noncoherent
    dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
    dma-mapping: add a new dma_alloc_pages API
    dma-mapping: remove dma_cache_sync
    53c700: convert to dma_alloc_noncoherent
    ...

    Linus Torvalds
     

14 Oct, 2020

4 commits

  • Both of the mm pointers are not needed after commit 7a4830c380f3
    ("mm/fork: Pass new vma pointer into copy_page_range()").

    Jason Gunthorpe also reported that the ordering of copy_page_range() is
    odd. Since working at it, reorder the parameters to be logical, by (1)
    always put the dst_* fields to be before src_* fields, and (2) keep the
    same type of parameters together.

    [peterx@redhat.com: further reorder some parameters and line format, per Jason]
    Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com
    [peterx@redhat.com: fix warnings]
    Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1

    Reported-by: Kirill A. Shutemov
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Acked-by: Kirill A. Shutemov
    Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Fix typo/spello of "function".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/e7bf180e-c558-b1d5-9a15-6d9708823c9c@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The code has declared a vma_struct named vma which is assigned a value of
    vmf->vma. Thus, use variable vma directly here.

    Signed-off-by: Yanfei Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Link: http://lkml.kernel.org/r/20200818084607.37616-1-yanfei.xu@windriver.com
    Signed-off-by: Linus Torvalds

    Yanfei Xu
     
  • It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that.

    Signed-off-by: Yanfei Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200818104339.5310-1-yanfei.xu@windriver.com
    Signed-off-by: Linus Torvalds

    Yanfei Xu
     

09 Oct, 2020

1 commit

  • In commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork()
    for ptes") we write-protected the PTE before doing the page pinning
    check, in order to avoid a race with concurrent fast-GUP pinning (which
    doesn't take the mm semaphore or the page table lock).

    That trick doesn't actually work - it doesn't handle memory ordering
    properly, and doing so would be prohibitively expensive.

    It also isn't really needed. While we're moving in the direction of
    allowing and supporting page pinning without marking the pinned area
    with MADV_DONTFORK, the fact is that we've never really supported this
    kind of odd "concurrent fork() and page pinning", and doing the
    serialization on a pte level is just wrong.

    We can add serialization with a per-mm sequence counter, so we know how
    to solve that race properly, but we'll do that at a more appropriate
    time. Right now this just removes the write protect games.

    It also turns out that the write protect games actually break on Power,
    as reported by Aneesh Kumar:

    "Architecture like ppc64 expects set_pte_at to be not used for updating
    a valid pte. This is further explained in commit 56eecdb912b5 ("mm:
    Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"

    and the code triggered a warning there:

    WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
    Call Trace:
    copy_present_page mm/memory.c:857 [inline]
    copy_present_pte mm/memory.c:899 [inline]
    copy_pte_range mm/memory.c:1014 [inline]
    copy_pmd_range mm/memory.c:1092 [inline]
    copy_pud_range mm/memory.c:1127 [inline]
    copy_p4d_range mm/memory.c:1150 [inline]
    copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
    dup_mmap kernel/fork.c:592 [inline]
    dup_mm+0x77c/0xab0 kernel/fork.c:1355
    copy_mm kernel/fork.c:1411 [inline]
    copy_process+0x1f00/0x2740 kernel/fork.c:2070
    _do_fork+0xc4/0x10b0 kernel/fork.c:2429

    Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
    Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/
    Reported-by: Aneesh Kumar K.V
    Tested-by: Leon Romanovsky
    Cc: Peter Xu
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Kirill Shutemov
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Oct, 2020

1 commit


28 Sep, 2020

2 commits

  • This allows copy_pte_range() to do early cow if the pages were pinned on
    the source mm.

    Currently we don't have an accurate way to know whether a page is pinned
    or not. The only thing we have is page_maybe_dma_pinned(). However
    that's good enough for now. Especially, with the newly added
    mm->has_pinned flag to make sure we won't affect processes that never
    pinned any pages.

    It would be easier if we can do GFP_KERNEL allocation within
    copy_one_pte(). Unluckily, we can't because we're with the page table
    locks held for both the parent and child processes. So the page
    allocation needs to be done outside copy_one_pte().

    Some trick is there in copy_present_pte(), majorly the wrprotect trick
    to block concurrent fast-gup. Comments in the function should explain
    better in place.

    Oleg Nesterov reported a (probably harmless) bug during review that we
    didn't reset entry.val properly in copy_pte_range() so that potentially
    there's chance to call add_swap_count_continuation() multiple times on
    the same swp entry. However that should be harmless since even if it
    happens, the same function (add_swap_count_continuation()) will return
    directly noticing that there're enough space for the swp counter. So
    instead of a standalone stable patch, it is touched up in this patch
    directly.

    Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • This prepares for the future work to trigger early cow on pinned pages
    during fork().

    No functional change intended.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

24 Sep, 2020

3 commits

  • Commit 09854ba94c6a ("mm: do_wp_page() simplification") reorganized all
    the code around the page re-use vs copy, but in the process also moved
    the final unlock_page() around to after the wp_page_reuse() call.

    That normally doesn't matter - but it means that the unlock_page() is
    now done after releasing the page table lock. Again, not a big deal,
    you'd think.

    But it turns out that it's very wrong indeed, because once we've
    released the page table lock, we've basically lost our only reference to
    the page - the page tables - and it could now be free'd at any time. We
    do hold the mmap_sem, so no actual unmap() can happen, but madvise can
    come in and a MADV_DONTNEED will zap the page range - and free the page.

    So now the page may be free'd just as we're unlocking it, which in turn
    will usually trigger a "Bad page state" error in the freeing path. To
    make matters more confusing, by the time the debug code prints out the
    page state, the unlock has typically completed and everything looks fine
    again.

    This all doesn't happen in any normal situations, but it does trigger
    with the dirtyc0w_child LTP test. And it seems to trigger much more
    easily (but not expclusively) on s390 than elsewhere, probably because
    s390 doesn't do the "batch pages up for freeing after the TLB flush"
    that gives the unlock_page() more time to complete and makes the race
    harder to hit.

    Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
    Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
    Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/
    Reported-by: Qian Cai
    Reported-by: Alex Shi
    Bisected-and-analyzed-by: Gerald Schaefer
    Tested-by: Gerald Schaefer
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This completes the split of the non-present and present pte cases by
    moving the check for the source pte being present into the single
    caller, which also means that we clearly separate out the very different
    return value case for a non-present pte.

    The present pte case currently always succeeds.

    This is a pure code re-organization with no semantic change: the intent
    is to make it much easier to add a new return case to the present pte
    case for when we do early COW at page table copy time.

    This was split out from the previous commit simply to make it easy to
    visually see that there were no semantic changes from this code
    re-organization.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This is a purely mechanical split of the copy_one_pte() function. It's
    not immediately obvious when looking at the diff because of the
    indentation change, but the way to see what is going on in this commit
    is to use the "-w" flag to not show pure whitespace changes, and you see
    how the first part of copy_one_pte() is simply lifted out into a
    separate function.

    And since the non-present case is marked unlikely, don't make the new
    function be inlined. Not that gcc really seems to care, since it looks
    like it will inline it anyway due to the whole "single callsite for
    static function" logic. In fact, code generation with the function
    split is almost identical to before. But not marking it inline is the
    right thing to do.

    This is pure prep-work and cleanup for subsequent changes.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Sep, 2020

2 commits

  • Merge misc fixes from Andrew Morton:
    "19 patches.

    Subsystems affected by this patch series: MAINTAINERS, ipc, fork,
    checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration,
    hugetlb)"

    * emailed patches from Andrew Morton :
    include/linux/log2.h: add missing () around n in roundup_pow_of_two()
    mm/khugepaged.c: fix khugepaged's request size in collapse_file
    mm/hugetlb: fix a race between hugetlb sysctl handlers
    mm/hugetlb: try preferred node first when alloc gigantic page from cma
    mm/migrate: preserve soft dirty in remove_migration_pte()
    mm/migrate: remove unnecessary is_zone_device_page() check
    mm/rmap: fixup copying of soft dirty and uffd ptes
    mm/migrate: fixup setting UFFD_WP flag
    mm: madvise: fix vma user-after-free
    checkpatch: fix the usage of capture group ( ... )
    fork: adjust sysctl_max_threads definition to match prototype
    ipc: adjust proc_ipc_sem_dointvec definition to match prototype
    mm: track page table modifications in __apply_to_page_range()
    MAINTAINERS: IA64: mark Status as Odd Fixes only
    MAINTAINERS: add LLVM maintainers
    MAINTAINERS: update Cavium/Marvell entries
    mm: slub: fix conversion of freelist_corrupted()
    mm: memcg: fix memcg reclaim soft lockup
    memcg: fix use-after-free in uncharge_batch

    Linus Torvalds
     
  • __apply_to_page_range() is also used to change and/or allocate
    page-table pages in the vmalloc area of the address space. Make sure
    these changes get synchronized to other page-tables in the system by
    calling arch_sync_kernel_mappings() when necessary.

    The impact appears limited to x86-32, where apply_to_page_range may miss
    updating the PMD. That leads to explosions in drivers like

    BUG: unable to handle page fault for address: fe036000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    *pde = 00000000
    Oops: 0002 [#1] SMP
    CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16
    Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
    EIP: __execlists_context_alloc+0x132/0x2d0 [i915]
    Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81
    EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001
    ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
    CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: fffe0ff0 DR7: 00000400
    Call Trace:
    execlists_context_alloc+0x10/0x20 [i915]
    intel_context_alloc_state+0x3f/0x70 [i915]
    __intel_context_do_pin+0x117/0x170 [i915]
    i915_gem_do_execbuffer+0xcc7/0x2500 [i915]
    i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915]
    drm_ioctl_kernel+0x8f/0xd0
    drm_ioctl+0x223/0x3d0
    __ia32_sys_ioctl+0x1ab/0x760
    __do_fast_syscall_32+0x3f/0x70
    do_fast_syscall_32+0x29/0x60
    do_SYSENTER_32+0x15/0x20
    entry_SYSENTER_32+0x9f/0xf2
    EIP: 0xb7f28559
    Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
    EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c
    ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8
    DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
    Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan
    CR2: 00000000fe036000

    It looks like kasan, xen and i915 are vulnerable.

    Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking
    after 30-or-so minutes, and machine is unusable"

    [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h]
    Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au
    [chris@chris-wilson.co.uk: changelog addition]
    [pavel@ucw.cz: changelog addition]

    Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
    Fixes: 86cf69f1d893 ("x86/mm/32: implement arch_sync_kernel_mappings()")
    Signed-off-by: Joerg Roedel
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Tested-by: Chris Wilson [x86-32]
    Tested-by: Pavel Machek
    Acked-by: Linus Torvalds
    Cc: [5.8+]
    Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.org
    Signed-off-by: Linus Torvalds

    Joerg Roedel
     

05 Sep, 2020

3 commits

  • Merge emailed patches from Peter Xu:
    "This is a small series that I picked up from Linus's suggestion to
    simplify cow handling (and also make it more strict) by checking
    against page refcounts rather than mapcounts.

    This makes uffd-wp work again (verified by running upmapsort)"

    Note: this is horrendously bad timing, and making this kind of
    fundamental vm change after -rc3 is not at all how things should work.
    The saving grace is that it really is a a nice simplification:

    8 files changed, 29 insertions(+), 120 deletions(-)

    The reason for the bad timing is that it turns out that commit
    17839856fd58 ("gup: document and work around 'COW can break either way'
    issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
    Patocka also reports that it caused issues for strace when running in a
    DAX environment with ext4 on a persistent memory setup.

    And we can't just revert that commit without re-introducing the original
    issue that is a potential security hole, so making COW stricter (and in
    the process much simpler) is a step to then undoing the forced COW that
    broke other uses.

    Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

    * emailed patches from Peter Xu :
    mm: Add PGREUSE counter
    mm/gup: Remove enfornced COW mechanism
    mm/ksm: Remove reuse_ksm_page()
    mm: do_wp_page() simplification

    Linus Torvalds
     
  • This accounts for wp_page_reuse() case, where we reused a page for COW.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • How about we just make sure we're the only possible valid user fo the
    page before we bother to reuse it?

    Simplify, simplify, simplify.

    And get rid of the nasty serialization on the page lock at the same time.

    [peterx: add subject prefix]

    Signed-off-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Aug, 2020

1 commit

  • Recently we found regression when running will_it_scale/page_fault3 test
    on ARM64. Over 70% down for the multi processes cases and over 20% down
    for the multi threads cases. It turns out the regression is caused by
    commit 89b15332af7c ("mm: drop mmap_sem before calling
    balance_dirty_pages() in write fault").

    The test mmaps a memory size file then write to the mapping, this would
    make all memory dirty and trigger dirty pages throttle, that upstream
    commit would release mmap_sem then retry the page fault. The retried
    page fault would see correct PTEs installed then just fall through to
    spurious TLB flush. The regression is caused by the excessive spurious
    TLB flush. It is fine on x86 since x86's spurious TLB flush is no-op.

    We could just skip the spurious TLB flush to mitigate the regression.

    Suggested-by: Linus Torvalds
    Reported-by: Xu Yu
    Debugged-by: Xu Yu
    Tested-by: Xu Yu
    Cc: Johannes Weiner
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc:
    Signed-off-by: Yang Shi
    Signed-off-by: Linus Torvalds

    Yang Shi
     

15 Aug, 2020

3 commits

  • Merge more updates from Andrew Morton:
    "Subsystems affected by this patch series: mm/hotfixes, lz4, exec,
    mailmap, mm/thp, autofs, sysctl, mm/kmemleak, mm/misc and lib"

    * emailed patches from Andrew Morton : (35 commits)
    virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
    ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
    rtl818x: constify ioreadX() iomem argument (as in generic implementation)
    iomap: constify ioreadX() iomem argument (as in generic implementation)
    sh: use generic strncpy()
    sh: clkfwk: remove r8/r16/r32
    include/asm-generic/vmlinux.lds.h: align ro_after_init
    mm: annotate a data race in page_zonenum()
    mm/swap.c: annotate data races for lru_rotate_pvecs
    mm/rmap: annotate a data race at tlb_flush_batched
    mm/mempool: fix a data race in mempool_free()
    mm/list_lru: fix a data race in list_lru_count_one
    mm/memcontrol: fix a data race in scan count
    mm/page_counter: fix various data races at memsw
    mm/swapfile: fix and annotate various data races
    mm/filemap.c: fix a data race in filemap_fault()
    mm/swap_state: mark various intentional data races
    mm/page_io: mark various intentional data races
    mm/frontswap: mark various intentional data races
    mm/kmemleak: silence KCSAN splats in checksum
    ...

    Linus Torvalds
     
  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • This remoes the code from the COW path to call debug_dma_assert_idle(),
    which was added many years ago.

    Google shows that it hasn't caught anything in the 6+ years we've had it
    apart from a false positive, and Hugh just noticed how it had a very
    unfortunate spinlock serialization in the COW path.

    He fixed that issue the previous commit (a85ffd59bd36: "dma-debug: fix
    debug_dma_assert_idle(), use rcu_read_lock()"), but let's see if anybody
    even notices when we remove this function entirely.

    NOTE! We keep the dma tracking infrastructure that was added by the
    commit that introduced it. Partly to make it easier to resurrect this
    debug code if we ever deside to, and partly because that tracking by pfn
    and offset looks quite reasonable.

    The problem with this debug code was simply that it was expensive and
    didn't seem worth it, not that it was wrong per se.

    Acked-by: Dan Williams
    Acked-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Aug, 2020

6 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Here're the last pieces of page fault accounting that were still done
    outside handle_mm_fault() where we still have regs==NULL when calling
    handle_mm_fault():

    arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
    arch/sparc/mm/fault_32.c: force_user_fault
    arch/um/kernel/trap.c: handle_page_fault
    mm/gup.c: faultin_page
    fixup_user_fault
    mm/hmm.c: hmm_vma_fault
    mm/ksm.c: break_ksm

    Some of them has the issue of duplicated accounting for page fault
    retries. Some of them didn't do the accounting at all.

    This patch cleans all these up by letting handle_mm_fault() to do per-task
    page fault accounting even if regs==NULL (though we'll still skip the perf
    event accountings). With that, we can safely remove all the outliers now.

    There's another functional change in that now we account the page faults
    to the caller of gup, rather than the task_struct that passed into the gup
    code. More information of this can be found at [1].

    After this patch, below things should never be touched again outside
    handle_mm_fault():

    - task_struct.[maj|min]_flt
    - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]

    [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Patch series "mm: Page fault accounting cleanups", v5.

    This is v5 of the pf accounting cleanup series. It originates from Gerald
    Schaefer's report on an issue a week ago regarding to incorrect page fault
    accountings for retried page fault after commit 4064b9827063 ("mm: allow
    VM_FAULT_RETRY for multiple times"):

    https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

    What this series did:

    - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault. For example, page fault
    retries should not be counted in page fault counters. Same to the
    perf events.

    - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g. errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense. And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

    - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR). More information in patch 1.

    - Always account the page fault onto the one that triggered the page
    fault. This does not matter much for #PF handlings, but mostly for
    gup. More information on this in patch 25.

    Patchset layout:

    Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
    Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
    Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
    Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

    This patch (of 25):

    This is a preparation patch to move page fault accountings into the
    general code in handle_mm_fault(). This includes both the per task
    flt_maj/flt_min counters, and the major/minor page fault perf events. To
    do this, the pt_regs pointer is passed into handle_mm_fault().

    PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
    handlers.

    So far, all the pt_regs pointer that passed into handle_mm_fault() is
    NULL, which means this patch should have no intented functional change.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
    Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Drop the repeated word "to" in two places.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • This patch implements workingset detection for anonymous LRU. All the
    infrastructure is implemented by the previous patches so this patch just
    activates the workingset detection by installing/retrieving the shadow
    entry and adding refault calculation.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

2 commits

  • This function implicitly assumes that the addr passed in is page aligned.
    A non page aligned addr could ultimately cause a kernel bug in
    remap_pte_range as the exit condition in the logic loop may never be
    satisfied. This patch documents the need for the requirement, as well as
    explicitly adds a check for it.

    Signed-off-by: Alex Zhang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200617233512.177519-1-zhangalex@google.com
    Signed-off-by: Linus Torvalds

    Alex Zhang
     
  • In zap_pte_range(), the check for non_swap_entry() and
    is_device_private_entry() is unnecessary since the latter is sufficient to
    determine if the page is a device private page. Remove the test for
    non_swap_entry() to simplify the code and for clarity.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Acked-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200615175405.4613-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

04 Aug, 2020

1 commit

  • Pull arm64 and cross-arch updates from Catalin Marinas:
    "Here's a slightly wider-spread set of updates for 5.9.

    Going outside the usual arch/arm64/ area is the removal of
    read_barrier_depends() series from Will and the MSI/IOMMU ID
    translation series from Lorenzo.

    The notable arm64 updates include ARMv8.4 TLBI range operations and
    translation level hint, time namespace support, and perf.

    Summary:

    - Removal of the tremendously unpopular read_barrier_depends()
    barrier, which is a NOP on all architectures apart from Alpha, in
    favour of allowing architectures to override READ_ONCE() and do
    whatever dance they need to do to ensure address dependencies
    provide LOAD -> LOAD/STORE ordering.

    This work also offers a potential solution if compilers are shown
    to convert LOAD -> LOAD address dependencies into control
    dependencies (e.g. under LTO), as weakly ordered architectures will
    effectively be able to upgrade READ_ONCE() to smp_load_acquire().
    The latter case is not used yet, but will be discussed further at
    LPC.

    - Make the MSI/IOMMU input/output ID translation PCI agnostic,
    augment the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID
    bus-specific parameter and apply the resulting changes to the
    device ID space provided by the Freescale FSL bus.

    - arm64 support for TLBI range operations and translation table level
    hints (part of the ARMv8.4 architecture version).

    - Time namespace support for arm64.

    - Export the virtual and physical address sizes in vmcoreinfo for
    makedumpfile and crash utilities.

    - CPU feature handling cleanups and checks for programmer errors
    (overlapping bit-fields).

    - ACPI updates for arm64: disallow AML accesses to EFI code regions
    and kernel memory.

    - perf updates for arm64.

    - Miscellaneous fixes and cleanups, most notably PLT counting
    optimisation for module loading, recordmcount fix to ignore
    relocations other than R_AARCH64_CALL26, CMA areas reserved for
    gigantic pages on 16K and 64K configurations.

    - Trivial typos, duplicate words"

    Link: http://lkml.kernel.org/r/20200710165203.31284-1-will@kernel.org
    Link: http://lkml.kernel.org/r/20200619082013.13661-1-lorenzo.pieralisi@arm.com

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (82 commits)
    arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack
    arm64/mm: save memory access in check_and_switch_context() fast switch path
    arm64: sigcontext.h: delete duplicated word
    arm64: ptrace.h: delete duplicated word
    arm64: pgtable-hwdef.h: delete duplicated words
    bus: fsl-mc: Add ACPI support for fsl-mc
    bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver
    of/irq: Make of_msi_map_rid() PCI bus agnostic
    of/irq: make of_msi_map_get_device_domain() bus agnostic
    dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus
    of/device: Add input id to of_dma_configure()
    of/iommu: Make of_map_rid() PCI agnostic
    ACPI/IORT: Add an input ID to acpi_dma_configure()
    ACPI/IORT: Remove useless PCI bus walk
    ACPI/IORT: Make iort_msi_map_rid() PCI agnostic
    ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic
    ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC
    arm64: enable time namespace support
    arm64/vdso: Restrict splitting VVAR VMA
    arm64/vdso: Handle faults on timens page
    ...

    Linus Torvalds
     

25 Jul, 2020

1 commit

  • clang static analysis reports a garbage return

    In file included from mm/memory.c:84:
    mm/memory.c:1612:2: warning: Undefined or garbage value returned to caller [core.uninitialized.UndefReturn]
    return err;
    ^~~~~~~~~~

    The setting of err depends on a loop executing. So initialize err.

    Signed-off-by: Tom Rix
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200703155354.29132-1-trix@redhat.com
    Signed-off-by: Linus Torvalds

    Tom Rix
     

21 Jul, 2020

1 commit


17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

26 Jun, 2020

3 commits

  • With synchronous IO swap device, swap-in is directly handled in fault
    code. Since IO cost notation isn't added there, with synchronous IO
    swap device, LRU balancing could be wrongly biased. Fix it to count it
    in fault code.

    Link: http://lkml.kernel.org/r/1592288204-27734-4-git-send-email-iamjoonsoo.kim@lge.com
    Fixes: 314b57fb0460001 ("mm: balance LRU lists based on relative thrashing cache sizing")
    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Calls to pte_offset_map() in vm_insert_pages() are erroneously not
    matched with a call to pte_unmap(). This would cause problems on
    architectures where that is not a no-op.

    This patch does away with the non-traditional locking in the existing
    code, and instead uses pte_offset_map_lock/unlock() as usual,
    incrementing PTE as necessary. The PTE pointer is kept within bounds
    since we clamp it with PTRS_PER_PTE.

    Link: http://lkml.kernel.org/r/20200618220446.20284-1-arjunroy.kdev@gmail.com
    Fixes: 8cd3984d81d5 ("mm/memory.c: add vm_insert_pages()")
    Signed-off-by: Arjun Roy
    Acked-by: David Rientjes
    Cc: Eric Dumazet
    Cc: Hugh Dickins
    Cc: Soheil Hassas Yeganeh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjun Roy
     
  • do_swap_page() returns error codes from the VM_FAULT* space. try_charge()
    might return -ENOMEM, though, and then do_swap_page() simply returns 0
    which means a success.

    We almost never return ENOMEM for GFP_KERNEL single page charge. Except
    for async OOM handling (oom_disabled v1). So this needs translation to
    VM_FAULT_OOM otherwise the the page fault path will not notify the
    userspace and wait for an action.

    Link: http://lkml.kernel.org/r/20200617090238.GL9499@dhcp22.suse.cz
    Fixes: 4c6355b25e8b ("mm: memcontrol: charge swapin pages on instantiation")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Alex Shi
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko