09 Feb, 2020

2 commits

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

08 Feb, 2020

3 commits


07 Feb, 2020

2 commits


04 Feb, 2020

28 commits

  • Merge more updates from Andrew Morton:
    "The rest of MM and the rest of everything else: hotfixes, ipc, misc,
    procfs, lib, cleanups, arm"

    * emailed patches from Andrew Morton : (67 commits)
    ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()
    treewide: remove redundant IS_ERR() before error code check
    include/linux/cpumask.h: don't calculate length of the input string
    lib: new testcases for bitmap_parse{_user}
    lib: rework bitmap_parse()
    lib: make bitmap_parse_user a wrapper on bitmap_parse
    lib: add test for bitmap_parse()
    bitops: more BITS_TO_* macros
    lib/string: add strnchrnul()
    proc: convert everything to "struct proc_ops"
    proc: decouple proc from VFS with "struct proc_ops"
    asm-generic/tlb: provide MMU_GATHER_TABLE_FREE
    asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
    asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
    asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
    asm-generic/tlb: add missing CONFIG symbol
    asm-gemeric/tlb: remove stray function declarations
    asm-generic/tlb: avoid potential double flush
    mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
    powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
    ...

    Linus Torvalds
     
  • Pull drm ttm/mm updates from Dave Airlie:
    "Thomas Hellstrom has some more changes to the TTM layer that needed a
    patch to the mm subsystem.

    This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack
    that has limitations in the TTM layer"

    * tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm:
    mm, drm/ttm: Fix vm page protection handling
    mm: Add a vmf_insert_mixed_prot() function

    Linus Torvalds
     
  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • As described in the comment, the correct order for freeing pages is:

    1) unhook page
    2) TLB invalidate page
    3) free page

    This order equally applies to page directories.

    Currently there are two correct options:

    - use tlb_remove_page(), when all page directores are full pages and
    there are no futher contraints placed by things like software
    walkers (HAVE_FAST_GUP).

    - use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
    architecture does not do IPI based TLB invalidate and has
    HAVE_FAST_GUP (or software TLB fill).

    This however leaves architectures that don't have page based directories
    but don't need RCU in a bind. For those, provide MMU_GATHER_TABLE_FREE,
    which provides the independent batching for directories without the
    additional RCU freeing.

    Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Towards a more consistent naming scheme.

    [akpm@linux-foundation.org: fix sparc64 Kconfig]
    Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Aneesh Kumar K.V
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Architectures for which we have hardware walkers of Linux page table
    should flush TLB on mmu gather batch allocation failures and batch flush.
    Some architectures like POWER supports multiple translation modes (hash
    and radix) and in the case of POWER only radix translation mode needs the
    above TLBI. This is because for hash translation mode kernel wants to
    avoid this extra flush since there are no hardware walkers of linux page
    table. With radix translation, the hardware also walks linux page table
    and with that, kernel needs to make sure to TLB invalidate page walk cache
    before page table pages are freed.

    More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
    TLB caches for RCU_TABLE_FREE")

    The changes to sparc are to make sure we keep the old behavior since we
    are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
    tlb_needs_table_invalidate is to always force an invalidate and sparc can
    avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
    false for sparc architecture.

    Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
    Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Michael Ellerman [powerpc]
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • struct mm_struct is quite large (~1664 bytes) and so allocating on the
    stack may cause problems as the kernel stack size is small.

    Since ptdump_walk_pgd_level_core() was only allocating the structure so
    that it could modify the pgd argument we can instead introduce a pgd
    override in struct mm_walk and pass this down the call stack to where it
    is needed.

    Since the correct mm_struct is now being passed down, it is now also
    unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
    now take the semaphore on the real mm.

    [steven.price@arm.com: restore missed arm64 changes]
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Signed-off-by: Steven Price
    Reported-by: Stephen Rothwell
    Cc: Catalin Marinas
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
    let's change the meaning of 'level' in note_page() since that makes the
    code simplier.

    Note that for x86, the level numbers were previously increased by 1 in
    commit 45dcd2091363 ("x86/mm/dump_pagetables: Fix printout of p4d level")
    and the comment "Bit 7 has a different meaning" was not updated, so this
    change also makes the code match the comment again.

    Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.com
    Signed-off-by: Steven Price
    Reviewed-by: Catalin Marinas
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Add a generic version of page table dumping that architectures can opt-in
    to.

    Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • The pte_hole() callback is called at multiple levels of the page tables.
    Code dumping the kernel page tables needs to know what at what depth the
    missing entry is. Add this is an extra parameter to pte_hole(). When the
    depth isn't know (e.g. processing a vma) then -1 is passed.

    The depth that is reported is the actual level where the entry is missing
    (ignoring any folding that is in place), i.e. any levels where
    PTRS_PER_P?D is set to 1 are ignored.

    Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
    natural numbers as levels 2/3/4.

    Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
    Signed-off-by: Steven Price
    Tested-by: Zong Li
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • If walk_pte_range() is called with a 'end' argument that is beyond the
    last page of memory (e.g. ~0UL) then the comparison between 'addr' and
    'end' will always fail and the loop will be infinite. Instead change the
    comparison to >= while accounting for overflow.

    Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • walk_page_range_novma() can be used to walk page tables or the kernel or
    for firmware. These page tables may contain entries that are not backed
    by a struct page and so it isn't (in general) possible to take the PTE
    lock for the pte_entry() callback. So update walk_pte_range() to only
    take the lock when no_vma==false by splitting out the inner loop to a
    separate function and add a comment explaining the difference to
    walk_page_range_novma().

    Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
    vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
    because it lacks a vma.

    This means each arch has re-implemented page table walking when needed,
    for example in the per-arch ptdump walker.

    Remove the requirement to have a vma in the generic code and add a new
    function walk_page_range_novma() which ignores the VMAs and simply walks
    the page tables.

    Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
    ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
    users. We're about to add users so reintroduce them, along with
    p4d_entry() as we now have 5 levels of tables.

    Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
    transparent hugepages") already re-added pud_entry() but with different
    semantics to the other callbacks. This commit reverts the semantics back
    to match the other callbacks.

    To support hmm.c which now uses the new semantics of pud_entry() a new
    member ('action') of struct mm_walk is added which allows the callbacks to
    either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
    repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
    pud_trans_huge_lock() itself and make use of the splitting/retry logic of
    the core code.

    After this change pud_entry() is called for all entries, not just
    transparent huge pages.

    [arnd@arndb.de: fix unused variable warning]
    Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
    Signed-off-by: Steven Price
    Signed-off-by: Arnd Bergmann
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Since 5.5-rc1 the last user of this function is gone, so remove the
    functionality.

    See commit
    2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
    for details.

    Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
    Signed-off-by: Florian Westphal
    Acked-by: Andrew Morton
    Acked-by: David Rientjes
    Reviewed-by: David Hildenbrand
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Florian Westphal
     
  • The callers are only interested in the actual zone, they don't care about
    boundaries. Return the zone instead to simplify.

    Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's drop the basically unused section stuff and simplify.

    Also, let's use a shorter variant to calculate the number of pages to
    the next section boundary.

    Link: http://lkml.kernel.org/r/20191006085646.5768-11-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Get rid of the unnecessary local variables.

    Link: http://lkml.kernel.org/r/20191006085646.5768-10-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • If we have holes, the holes will automatically get detected and removed
    once we remove the next bigger/smaller section. The extra checks can go.

    Link: http://lkml.kernel.org/r/20191006085646.5768-9-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • With shrink_pgdat_span() out of the way, we now always have a valid zone.

    Link: http://lkml.kernel.org/r/20191006085646.5768-8-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's poison the pages similar to when adding new memory in
    sparse_add_section(). Also call remove_pfn_range_from_zone() from
    memunmap_pages(), so we can poison the memmap from there as well.

    Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.

    This series fixes the access of uninitialized memmaps when shrinking
    zones/nodes and when removing memory. Also, it contains all fixes for
    crashes that can be triggered when removing certain namespace using
    memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

    We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
    more involved (we don't have SECTION_IS_ONLINE as an indicator), and
    shrinking is only of limited use (set_zone_contiguous() cannot detect the
    ZONE_DEVICE as contiguous).

    We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
    code to a minimum. Shrinking is especially necessary to keep
    zone->contiguous set where possible, especially, on memory unplug of DIMMs
    at zone boundaries.

    --------------------------------------------------------------------------

    Zones are now properly shrunk when offlining memory blocks or when
    onlining failed. This allows to properly shrink zones on memory unplug
    even if the separate memory blocks of a DIMM were onlined to different
    zones or re-onlined to a different zone after offlining.

    Example:

    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0
    :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
    :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 98304
    present 65536
    managed 65536
    :/# echo 0 > /sys/devices/system/memory/memory43/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 32768
    present 32768
    managed 32768
    :/# echo 0 > /sys/devices/system/memory/memory41/online
    :/# cat /proc/zoneinfo
    Node 1, zone Movable
    spanned 0
    present 0
    managed 0

    This patch (of 6):

    The third argument is actually number of pages. Change the variable name
    from size to nr_pages to indicate this better.

    No functional change in this patch.

    Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Reviewed-by: David Hildenbrand
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Let's move it to the header and use the shorter variant from
    mm/page_alloc.c (the original one will also check
    "__highest_present_section_nr + 1", which is not necessary). While at
    it, make the section_nr in next_pfn() const.

    In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
    exceed __highest_present_section_nr, which doesn't make a difference in
    the caller as it is big enough (>= all sane end_pfn).

    Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Cc: Baoquan He
    Cc: Dan Williams
    Cc: "Jin, Zhi"
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's update the pfn manually whenever we continue the loop. This makes
    the code easier to read but also less error prone (and we can directly fix
    one issue).

    When overlap_memmap_init() returns true, pfn is updated to
    "memblock_region_memory_end_pfn(r)". So it already points at the *next*
    pfn to process. Incrementing the pfn another time is wrong, we might
    leave one uninitialized. I spotted this by inspecting the code, so I have
    no idea if this is relevant in practise (with kernelcore=mirror).

    Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
    Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
    Signed-off-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Alexander Duyck
    Cc: Pavel Tatashin
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Kirill A. Shutemov
    Cc: Baoquan He
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Jin, Zhi"
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's make sure that all memory holes are actually marked PageReserved(),
    that page_to_pfn() produces reliable results, and that these pages are not
    detected as "mmap" pages due to the mapcount.

    E.g., booting a x86-64 QEMU guest with 4160 MB:

    [ 0.010585] Early memory node ranges
    [ 0.010586] node 0: [mem 0x0000000000001000-0x000000000009efff]
    [ 0.010588] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
    [ 0.010589] node 0: [mem 0x0000000100000000-0x0000000143ffffff]

    max_pfn is 0x144000.

    Before this change:

    [root@localhost ~]# ./page-types -r -a 0x144000,
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000800 16384 64 ___________M_______________________________ mmap
    total 16384 64

    After this change:

    [root@localhost ~]# ./page-types -r -a 0x144000,
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000100000000 16384 64 ___________________________r_______________ reserved
    total 16384 64

    IOW, especially the unavailable physical memory ("memory hole") in the
    last section would not get properly marked PageReserved() and is indicated
    to be "mmap" memory.

    Drop the trace of that function from include/linux/mm.h - nobody else
    needs it, and rename it accordingly.

    Note: The fake zone/node might not be covered by the zone/node span. This
    is not an urgent issue (for now, we had the same node/zone due to the
    zeroing). We'll need a clean way to mark memory holes (e.g., using a page
    type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
    marking these memory holes PageReserved().

    Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Alexey Dobriyan
    Cc: Bob Picco
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Pavel Tatashin
    Cc: Stephen Rothwell
    Cc: Steven Sistare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm: fix max_pfn not falling on section boundary", v2.

    Playing with different memory sizes for a x86-64 guest, I discovered that
    some memmaps (highest section if max_mem does not fall on the section
    boundary) are marked as being valid and online, but contain garbage. We
    have to properly initialize these memmaps.

    Looking at /proc/kpageflags and friends, I found some more issues,
    partially related to this.

    This patch (of 3):

    If max_pfn is not aligned to a section boundary, we can easily run into
    BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
    memory size that is not a multiple of 128MB (e.g., 4097MB, but also
    4160MB). I was told that on real HW, we can easily have this scenario
    (esp., one of the main reasons sub-section hotadd of devmem was added).

    The issue is, that we have a valid memmap (pfn_valid()) for the whole
    section, and the whole section will be marked "online".
    pfn_to_online_page() will succeed, but the memmap contains garbage.

    E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
    4160M" - (see tools/vm/page-types.c):

    [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 200.477500] #PF: supervisor read access in kernel mode
    [ 200.478334] #PF: error_code(0x0000) - not-present page
    [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
    [ 200.479557] Oops: 0000 [#4] SMP NOPTI
    [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
    [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
    [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
    [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
    [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
    [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
    [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
    [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
    [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
    [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
    [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
    [ 200.488897] Call Trace:
    [ 200.489115] kpageflags_read+0xe9/0x140
    [ 200.489447] proc_reg_read+0x3c/0x60
    [ 200.489755] vfs_read+0xc2/0x170
    [ 200.490037] ksys_pread64+0x65/0xa0
    [ 200.490352] do_syscall_64+0x5c/0xa0
    [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
    after cold/hot plugging a DIMM to such a system:

    [root@localhost ~]# cat /proc/kpageflags > /dev/null
    [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 111.517907] #PF: supervisor read access in kernel mode
    [ 111.518333] #PF: error_code(0x0000) - not-present page
    [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

    This patch fixes that by at least zero-ing out that memmap (so e.g.,
    page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
    unavailable struct pages") tried to fix a similar issue, but forgot to
    consider this special case.

    After this patch, there are still problems to solve. E.g., not all of
    these pages falling into a memory hole will actually get initialized later
    and set PageReserved - they are only zeroed out - but at least the
    immediate crashes are gone. A follow-up patch will take care of this.

    Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: David Hildenbrand
    Tested-by: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc: Pavel Tatashin
    Cc: Andrew Morton
    Cc: Steven Sistare
    Cc: Michal Hocko
    Cc: Daniel Jordan
    Cc: Bob Picco
    Cc: Oscar Salvador
    Cc: Alexey Dobriyan
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Stephen Rothwell
    Cc: [4.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

01 Feb, 2020

4 commits

  • Pull updates from Andrew Morton:
    "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
    ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

    MM is fairly quiet this time. Holidays, I assume"

    * emailed patches from Andrew Morton : (118 commits)
    kcov: ignore fault-inject and stacktrace
    include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
    execve: warn if process starts with executable stack
    reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
    init/main.c: fix misleading "This architecture does not have kernel memory protection" message
    init/main.c: fix quoted value handling in unknown_bootoption
    init/main.c: remove unnecessary repair_env_string in do_initcall_level
    init/main.c: log arguments and environment passed to init
    fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
    fs/binfmt_elf.c: coredump: delete duplicated overflow check
    fs/binfmt_elf.c: coredump: allocate core ELF header on stack
    fs/binfmt_elf.c: make BAD_ADDR() unlikely
    fs/binfmt_elf.c: better codegen around current->mm
    fs/binfmt_elf.c: don't copy ELF header around
    fs/binfmt_elf.c: fix ->start_code calculation
    fs/binfmt_elf.c: smaller code generation around auxv vector fill
    lib/find_bit.c: uninline helper _find_next_bit()
    lib/find_bit.c: join _find_next_bit{_le}
    uapi: rename ext2_swab() to swab() and share globally in swab.h
    lib/scatterlist.c: adjust indentation in __sg_alloc_table
    ...

    Linus Torvalds
     
  • Pull RISC-V updates from Palmer Dabbelt:
    "This contains a handful of patches for this merge window:

    - Support for kasan

    - 32-bit physical addresses on rv32i-based systems

    - Support for CONFIG_DEBUG_VIRTUAL

    - DT entry for the FU540 GPIO controller, which has recently had a
    device driver merged

    These boot a buildroot-based system on QEMU's virt board for me"

    * tag 'riscv-for-linus-5.6-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
    riscv: dts: Add DT support for SiFive FU540 GPIO driver
    riscv: mm: add support for CONFIG_DEBUG_VIRTUAL
    riscv: keep 32-bit kernel to 32-bit phys_addr_t
    kasan: Add riscv to KASAN documentation.
    riscv: Add KASAN support
    kasan: No KASAN's memmove check if archs don't have it.

    Linus Torvalds
     
  • Don't instrument 3 more files that contain debugging facilities and
    produce large amounts of uninteresting coverage for every syscall.

    The following snippets are sprinkled all over the place in kcov traces
    in a debugging kernel. We already try to disable instrumentation of
    stack unwinding code and of most debug facilities. I guess we did not
    use fault-inject.c at the time, and stacktrace.c was somehow missed (or
    something has changed in kernel/configs). This change both speeds up
    kcov (kernel doesn't need to store these PCs, user-space doesn't need to
    process them) and frees trace buffer capacity for more useful coverage.

    should_fail
    lib/fault-inject.c:149
    fail_dump
    lib/fault-inject.c:45

    stack_trace_save
    kernel/stacktrace.c:124
    stack_trace_consume_entry
    kernel/stacktrace.c:86
    stack_trace_consume_entry
    kernel/stacktrace.c:89
    ... a hundred frames skipped ...
    stack_trace_consume_entry
    kernel/stacktrace.c:93
    stack_trace_consume_entry
    kernel/stacktrace.c:86

    Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • The "pool" pointer can be NULL at the end of the init_zswap(). (We
    would allocate a new pool later in that situation)

    So in the error handling then we need to make sure pool is a valid
    pointer before calling "zswap_pool_destroy(pool);" because that function
    dereferences the argument.

    Link: http://lkml.kernel.org/r/20200114050902.og32fkllkod5ycf5@kili.mountain
    Fixes: 93d4dfa9fbd0 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
    Signed-off-by: Dan Carpenter
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter