09 Jun, 2020

40 commits

  • read_code operates on user addresses.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200515143646.3857579-27-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Only build read_code when binary formats that use it are built into the
    kernel.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200515143646.3857579-26-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Rename the current flush_icache_range to flush_icache_user_range as per
    commit ae92ef8a4424 ("PATCH] flush icache in correct context") there
    seems to be an assumption that it operates on user addresses. Add a
    flush_icache_range around it that for now is a no-op.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200515143646.3857579-25-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • flush_icache_user_range will be the name for a generic primitive. Move
    the arm name so that arm already has an implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Russell King
    Link: http://lkml.kernel.org/r/20200515143646.3857579-24-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The Xtensa implementation of flush_icache_range seems to be able to cope
    with user addresses. Just define flush_icache_user_range to
    flush_icache_range.

    [jcmvbkbc@gmail.com: fix flush_icache_user_range in noMMU configs]
    Link: http://lkml.kernel.org/r/20200525221556.4270-1-jcmvbkbc@gmail.com

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Max Filippov
    Signed-off-by: Andrew Morton
    Cc: Chris Zankel
    Cc: Max Filippov
    Link: http://lkml.kernel.org/r/20200515143646.3857579-23-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The SuperH implementation of flush_icache_range seems to be able to cope
    with user addresses. Just define flush_icache_user_range to
    flush_icache_range.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Link: http://lkml.kernel.org/r/20200515143646.3857579-22-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Define flush_icache_user_range to flush_icache_range unless the
    architecture provides its own implementation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Link: http://lkml.kernel.org/r/20200515143646.3857579-21-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The function currently known as flush_icache_user_range only operates on
    a single page. Rename it to flush_icache_user_page as we'll need the
    name flush_icache_user_range for something else soon.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Jonas Bonn
    Cc: Stefan Kristiansson
    Cc: Stafford Horne
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Link: http://lkml.kernel.org/r/20200515143646.3857579-20-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • flush_icache_user_range is only used by , so
    remove it from the architectures that implement it, but don't use
    .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Russell King
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Link: http://lkml.kernel.org/r/20200515143646.3857579-19-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • RISC-V needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Also remove the pointless __KERNEL__ ifdef while we're at it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Palmer Dabbelt
    Acked-by: Palmer Dabbelt
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Link: http://lkml.kernel.org/r/20200515143646.3857579-18-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Power needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Also remove the pointless __KERNEL__ ifdef while we're at it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Link: http://lkml.kernel.org/r/20200515143646.3857579-17-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • OpenRISC needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Jonas Bonn
    Cc: Stefan Kristiansson
    Cc: Stafford Horne
    Link: http://lkml.kernel.org/r/20200515143646.3857579-16-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • m68knommu needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Greg Ungerer
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200515143646.3857579-15-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Microblaze needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Michal Simek
    Link: http://lkml.kernel.org/r/20200515143646.3857579-14-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • IA64 needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Tony Luck
    Cc: Fenghua Yu
    Link: http://lkml.kernel.org/r/20200515143646.3857579-13-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Hexagon needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Brian Cain
    Link: http://lkml.kernel.org/r/20200515143646.3857579-12-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • C6x needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Mark Salter
    Cc: Aurelien Jacquiot
    Link: http://lkml.kernel.org/r/20200515143646.3857579-11-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • ARM64 needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Catalin Marinas
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200515143646.3857579-10-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Alpha needs almost no cache flushing routines of its own. Rely on
    asm-generic/cacheflush.h for the defaults.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Link: http://lkml.kernel.org/r/20200515143646.3857579-9-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • There is a magic ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE cpp symbol that
    guards non-stub availability of flush_dcache_pagge. Use that to check
    if flush_dcache_pagg is implemented.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Link: http://lkml.kernel.org/r/20200515143646.3857579-8-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • This seems to lead to some crazy include loops when using
    asm-generic/cacheflush.h on more architectures, so leave it to the arch
    header for now.

    [hch@lst.de: fix warning]
    Link: http://lkml.kernel.org/r/20200520173520.GA11199@lst.de

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Will Deacon
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Anton Ivanov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Dan Williams
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Ira Weiny
    Cc: Arnd Bergmann
    Link: http://lkml.kernel.org/r/20200515143646.3857579-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • cacheflush.h uses a somewhat to generic include guard name that clashes
    with various arch files. Use a more specific one.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Link: http://lkml.kernel.org/r/20200515143646.3857579-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • flush_cache_user_range is an ARMism not used by any generic or unicore32
    specific code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Guan Xuetao
    Link: http://lkml.kernel.org/r/20200515143646.3857579-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • flush_icache_user_range is only used by copy_to_user_page, which is only
    used by core VM code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Link: http://lkml.kernel.org/r/20200515143646.3857579-4-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • flush_icache_page is only used by mm/memory.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Greentime Hu
    Cc: Vincent Chen
    Link: http://lkml.kernel.org/r/20200515143646.3857579-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "sort out the flush_icache_range mess", v2.

    flush_icache_range is mostly used for kernel address, except for the
    following cases:

    - the nommu brk and mmap implementations

    - the read_code helper that is only used for binfmt_flat,
    binfmt_elf_fdpic, and binfmt_aout including the broken
    ia32 compat version

    - binfmt_flat itself

    none of which really are used by a typical MMU enabled kernel, as a.out
    can only be build for alpha and m68k to start with.

    But strangely enough commit ae92ef8a4424 ("PATCH] flush icache in
    correct context") added a "set_fs(KERNEL_DS)" around the
    flush_icache_range call in the module loader, because apparently m68k
    assumed user pointers.

    This series first cleans up the cacheflush implementations, largely by
    switching as much as possible to the asm-generic version after a few
    preparations, then moves the misnamed current flush_icache_user_range to
    a new name, to finally introduce a real flush_icache_user_range to be
    used for the above use cases to flush the instruction cache for a
    userspace address range. The last patch then drops the set_fs in the
    module code and moves it into the m68k implementation.

    This patch (of 29):

    The arguments passed look bogus, try to fix them to something that seems
    to make sense.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Roman Zippel
    Cc: Jessica Yu
    Cc: Michal Simek
    Cc: Albert Ou
    Cc: Alexander Shishkin
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Anton Ivanov
    Cc: Arnaldo Carvalho de Melo
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Chris Zankel
    Cc: Daniel Borkmann
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: "David S. Miller"
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Ivan Kokshaysky
    Cc: Jeff Dike
    Cc: Jiri Olsa
    Cc: Jonas Bonn
    Cc: Keith Busch
    Cc: Mark Rutland
    Cc: Mark Salter
    Cc: Martin KaFai Lau
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Namhyung Kim
    Cc: Nick Piggin
    Cc: Palmer Dabbelt
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Song Liu
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Yonghong Song
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200515143646.3857579-1-hch@lst.de
    Link: http://lkml.kernel.org/r/20200515143646.3857579-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • This code was using get_user_pages*(), in approximately a "Case 5"
    scenario (accessing the data within a page), using the categorization
    from [1]. That means that it's time to convert the get_user_pages*() +
    put_page() calls to pin_user_pages*() + unpin_user_pages() calls.

    There is some helpful background in [2]: basically, this is a small part
    of fixing a long-standing disconnect between pinning pages, and file
    systems' use of those pages.

    [1] Documentation/core-api/pin_user_pages.rst

    [2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Acked-by: Michael S. Tsirkin
    Acked-by: Pankaj Gupta
    Cc: Jason Wang
    Cc: Dave Chinner
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Souptick Joarder
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200529234309.484480-3-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "vhost, docs: convert to pin_user_pages(), new "case 5""

    It recently became clear to me that there are some get_user_pages*()
    callers that don't fit neatly into any of the four cases that are so far
    listed in pin_user_pages.rst. vhost.c is one of those.

    Add a Case 5 to the documentation, and refer to that when converting
    vhost.c.

    Thanks to Jan Kara for helping me (again) in understanding the
    interaction between get_user_pages() and page writeback [1].

    This is based on today's mmotm, which has a nearby patch to
    pin_user_pages.rst that rewords cases 3 and 4.

    Note that I have only compile-tested the vhost.c patch, although that
    does also include cross-compiling for a few other arches. Any run-time
    testing would be greatly appreciated.

    [1] https://lore.kernel.org/r/20200529070343.GL14550@quack2.suse.cz

    This patch (of 2):

    There are four cases listed in pin_user_pages.rst. These are intended
    to help developers figure out whether to use get_user_pages*(), or
    pin_user_pages*(). However, the four cases do not cover all the
    situations. For example, drivers/vhost/vhost.c has a "pin, write to
    page, set page dirty, unpin" case.

    Add a fifth case, to help explain that there is a general pattern that
    requires pin_user_pages*() API calls.

    [jhubbard@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/20200601052633.853874-2-jhubbard@nvidia.com

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Dave Chinner
    Cc: Jonathan Corbet
    Cc: Souptick Joarder
    Cc: "Michael S . Tsirkin"
    Cc: Jason Wang
    Link: http://lkml.kernel.org/r/20200529234309.484480-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200529234309.484480-2-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • All of the pin_user_pages*() API calls will cause pages to be
    dma-pinned. As such, they are all suitable for either DMA, RDMA, and/or
    Direct IO.

    The documentation should say so, but it was instead saying that three of
    the API calls were only suitable for Direct IO. This was discovered
    when a reviewer wondered why an API call that specifically recommended
    against Case 2 (DMA/RDMA) was being used in a DMA situation [1].

    Fix this by simply deleting those claims. The gup.c comments already
    refer to the more extensive Documentation/core-api/pin_user_pages.rst,
    which does have the correct guidance. So let's just write it once,
    there.

    [1] https://lore.kernel.org/r/20200529074658.GM30374@kadam

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Acked-by: Souptick Joarder
    Cc: Dan Carpenter
    Cc: Jan Kara
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200529084515.46259-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • This code was using get_user_pages*(), and all of the callers so far
    were in a "Case 2" scenario (DMA/RDMA), using the categorization from [1].

    That means that it's time to convert the get_user_pages*() + put_page()
    calls to pin_user_pages*() + unpin_user_pages() calls.

    There is some helpful background in [2]: basically, this is a small part
    of fixing a long-standing disconnect between pinning pages, and file
    systems' use of those pages.

    [1] Documentation/core-api/pin_user_pages.rst

    [2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Daniel Vetter
    Cc: Jérôme Glisse
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Pankaj Gupta
    Cc: Souptick Joarder
    Link: http://lkml.kernel.org/r/20200527223243.884385-3-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Patch series "mm/gup: introduce pin_user_pages_locked(), use it in frame_vector.c", v2.

    This adds yet one more pin_user_pages*() variant, and uses that to
    convert mm/frame_vector.c.

    With this, along with maybe 20 or 30 other recent patches in various
    trees, we are close to having the relevant gup call sites
    converted--with the notable exception of the bio/block layer.

    This patch (of 2):

    Introduce pin_user_pages_locked(), which is nearly identical to
    get_user_pages_locked() except that it sets FOLL_PIN and rejects
    FOLL_GET.

    As with other pairs of get_user_pages*() and pin_user_pages() API calls,
    it's prudent to assert that FOLL_PIN is *not* set in the
    get_user_pages*() call, so add that as part of this.

    [jhubbard@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/20200531234131.770697-2-jhubbard@nvidia.com

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Daniel Vetter
    Cc: Jérôme Glisse
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Souptick Joarder
    Link: http://lkml.kernel.org/r/20200531234131.770697-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200527223243.884385-1-jhubbard@nvidia.com
    Link: http://lkml.kernel.org/r/20200527223243.884385-2-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Update case 3 so that it covers the use of mmu notifiers, for hardware
    that does, or does not have replayable page faults.

    Also, elaborate case 4 slightly, as it was quite cryptic.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Cc: Daniel Vetter
    Cc: Jérôme Glisse
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Jonathan Corbet
    Link: http://lkml.kernel.org/r/20200527194953.11130-1-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
    align with pin_user_pages_fast_only().

    As part of this we will get rid of write parameter. Instead caller will
    pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
    existing functionality of the API.

    All the callers are changed to pass FOLL_WRITE.

    Also introduce get_user_page_fast_only(), and use it in a few places
    that hard-code nr_pages to 1.

    Updated the documentation of the API.

    Signed-off-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Paul Mackerras [arch/powerpc/kvm]
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Paolo Bonzini
    Cc: Stephen Rothwell
    Cc: Mike Rapoport
    Cc: Aneesh Kumar K.V
    Cc: Michal Suchanek
    Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Users with SYS_ADMIN capability can add arbitrary taint flags to the
    running kernel by writing to /proc/sys/kernel/tainted or issuing the
    command 'sysctl -w kernel.tainted=...'. This interface, however, is
    open for any integer value and this might cause an invalid set of flags
    being committed to the tainted_mask bitset.

    This patch introduces a simple way for proc_taint() to ignore any
    eventual invalid bit coming from the user input before committing those
    bits to the kernel tainted_mask.

    Signed-off-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Reviewed-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: "Theodore Ts'o"
    Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Usually when the kernel reaches an oops condition, it's a point of no
    return; in case not enough debug information is available in the kernel
    splat, one of the last resorts would be to collect a kernel crash dump
    and analyze it. The problem with this approach is that in order to
    collect the dump, a panic is required (to kexec-load the crash kernel).
    When in an environment of multiple virtual machines, users may prefer to
    try living with the oops, at least until being able to properly shutdown
    their VMs / finish their important tasks.

    This patch implements a way to collect a bit more debug details when an
    oops event is reached, by printing all the CPUs backtraces through the
    usage of NMIs (on architectures that support that). The sysctl added
    (and documented) here was called "oops_all_cpu_backtrace", and when set
    will (as the name suggests) dump all CPUs backtraces.

    Far from ideal, this may be the last option though for users that for
    some reason cannot panic on oops. Most of times oopses are clear enough
    to indicate the kernel portion that must be investigated, but in virtual
    environments it's possible to observe hypervisor/KVM issues that could
    lead to oopses shown in other guests CPUs (like virtual APIC crashes).
    This patch hence aims to help debug such complex issues without
    resorting to kdump.

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Cc: Luis Chamberlain
    Cc: Iurii Zaikin
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Randy Dunlap
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • Commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before
    panic") introduced a change in that we started to show all CPUs
    backtraces when a hung task is detected _and_ the sysctl/kernel
    parameter "hung_task_panic" is set. The idea is good, because usually
    when observing deadlocks (that may lead to hung tasks), the culprit is
    another task holding a lock and not necessarily the task detected as
    hung.

    The problem with this approach is that dumping backtraces is a slightly
    expensive task, specially printing that on console (and specially in
    many CPU machines, as servers commonly found nowadays). So, users that
    plan to collect a kdump to investigate the hung tasks and narrow down
    the deadlock definitely don't need the CPUs backtrace on dmesg/console,
    which will delay the panic and pollute the log (crash tool would easily
    grab all CPUs traces with 'bt -a' command).

    Also, there's the reciprocal scenario: some users may be interested in
    seeing the CPUs backtraces but not have the system panic when a hung
    task is detected. The current approach hence is almost as embedding a
    policy in the kernel, by forcing the CPUs backtraces' dump (only) on
    hung_task_panic.

    This patch decouples the panic event on hung task from the CPUs
    backtraces dump, by creating (and documenting) a new sysctl called
    "hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
    lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
    allow individual control. The new mechanism for dumping the CPUs
    backtraces on hung task detection respects "hung_task_warnings" by not
    dumping the traces in case there's no warnings left.

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • After a recent change introduced by Vlastimil's series [0], kernel is
    able now to handle sysctl parameters on kernel command line; also, the
    series introduced a simple infrastructure to convert legacy boot
    parameters (that duplicate sysctls) into sysctl aliases.

    This patch converts the watchdog parameters softlockup_panic and
    {hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
    It fixes the documentation too, since the alias only accepts values 0 or
    1, not the full range of integers.

    We also took the opportunity here to improve the documentation of the
    previously converted hung_task_panic (see the patch series [0]) and put
    the alias table in alphabetical order.

    [0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Luis Chamberlain
    Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • Testing is done by a new parameter debug.test_sysctl.boot_int which
    defaults to 0 and it's expected that the tester passes a boot parameter
    that sets it to 1. The test checks if it's set to 1.

    To distinguish true failure from parameter not being set, the test
    checks /proc/cmdline for the expected parameter, and whether test_sysctl
    is built-in and not a module.

    [vbabka@suse.cz: skip the new test if boot_int sysctl is not present]
    Link: http://lkml.kernel.org/r/305af605-1e60-cf84-fada-6ce1ca37c102@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Christian Brauner
    Cc: David Rientjes
    Cc: "Eric W . Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "Guilherme G . Piccoli"
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Kees Cook
    Cc: Luis Chamberlain
    Cc: Masami Hiramatsu
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200427180433.7029-6-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The testing script recommends CONFIG_TEST_SYSCTL=y, but actually only
    works with CONFIG_TEST_SYSCTL=m. Testing of sysctl setting via boot
    param however requires the test to be built-in, so make sure the test
    script supports it.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Luis Chamberlain
    Cc: Alexey Dobriyan
    Cc: Christian Brauner
    Cc: David Rientjes
    Cc: "Eric W . Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "Guilherme G . Piccoli"
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Kees Cook
    Cc: Masami Hiramatsu
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200427180433.7029-5-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • We can now handle sysctl parameters on kernel command line and have
    infrastructure to convert legacy command line options that duplicate
    sysctl to become a sysctl alias.

    This patch converts the hung_task_panic parameter. Note that the sysctl
    handler is more strict and allows only 0 and 1, while the legacy
    parameter allowed any non-zero value. But there is little reason anyone
    would not be using 1.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Christian Brauner
    Cc: David Rientjes
    Cc: "Eric W . Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "Guilherme G . Piccoli"
    Cc: Iurii Zaikin
    Cc: Ivan Teterevkov
    Cc: Luis Chamberlain
    Cc: Masami Hiramatsu
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka