11 Apr, 2020
26 commits
-
After request_module(), nothing is stopping the module from being
unloaded until someone takes a reference to it via try_get_module().The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace
running 'rmmod' concurrently.Since WARN_ONCE() is for kernel bugs only, not for user-reachable
situations, downgrade this warning to pr_warn_once().Keep it printed once only, since the intent of this warning is to detect
a bug in modprobe at boot time. Printing the warning more than once
wouldn't really provide any useful extra information.Fixes: 41124db869b7 ("fs: warn in case userspace lied about modprobe return")
Signed-off-by: Eric Biggers
Signed-off-by: Andrew Morton
Reviewed-by: Jessica Yu
Cc: Alexei Starovoitov
Cc: Greg Kroah-Hartman
Cc: Jeff Vander Stoep
Cc: Jessica Yu
Cc: Kees Cook
Cc: Luis Chamberlain
Cc: NeilBrown
Cc: [4.13+]
Link: http://lkml.kernel.org/r/20200312202552.241885-3-ebiggers@kernel.org
Signed-off-by: Linus Torvalds -
Patch series "module autoloading fixes and cleanups", v5.
This series fixes a bug where request_module() was reporting success to
kernel code when module autoloading had been completely disabled via
'echo > /proc/sys/kernel/modprobe'.It also addresses the issues raised on the original thread
(https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u)
bydocumenting the modprobe sysctl, adding a self-test for the empty path
case, and downgrading a user-reachable WARN_ONCE().This patch (of 4):
It's long been possible to disable kernel module autoloading completely
(while still allowing manual module insertion) by setting
/proc/sys/kernel/modprobe to the empty string.This can be preferable to setting it to a nonexistent file since it
avoids the overhead of an attempted execve(), avoids potential
deadlocks, and avoids the call to security_kernel_module_request() and
thus on SELinux-based systems eliminates the need to write SELinux rules
to dontaudit module_request.However, when module autoloading is disabled in this way,
request_module() returns 0. This is broken because callers expect 0 to
mean that the module was successfully loaded.Apparently this was never noticed because this method of disabling
module autoloading isn't used much, and also most callers don't use the
return value of request_module() since it's always necessary to check
whether the module registered its functionality or not anyway.But improperly returning 0 can indeed confuse a few callers, for example
get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
fs = __get_fs_type(name, len);
WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
}This is easily reproduced with:
echo > /proc/sys/kernel/modprobe
mount -t NONEXISTENT none /It causes:
request_module fs-NONEXISTENT succeeded, but still no fs?
WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
[...]This should actually use pr_warn_once() rather than WARN_ONCE(), since
it's also user-reachable if userspace immediately unloads the module.
Regardless, request_module() should correctly return an error when it
fails. So let's make it return -ENOENT, which matches the error when
the modprobe binary doesn't exist.I've also sent patches to document and test this case.
Signed-off-by: Eric Biggers
Signed-off-by: Andrew Morton
Reviewed-by: Kees Cook
Reviewed-by: Jessica Yu
Acked-by: Luis Chamberlain
Cc: Alexei Starovoitov
Cc: Greg Kroah-Hartman
Cc: Jeff Vander Stoep
Cc: Ben Hutchings
Cc: Josh Triplett
Cc:
Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org
Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org
Signed-off-by: Linus Torvalds -
PCI BAR IO memory should never be mapped as WB, however prior to this
the PAT bits were set WB and it was typically overridden by MTRR
registers set by the firmware.Set PCI P2PDMA memory to be UC as this is what it currently, typically,
ends up being mapped as on x86 after the MTRR registers override the
cache setting.Future use-cases may need to generalize this by adding flags to select
the caching type, as some P2PDMA cases may not want UC. However, those
use-cases are not upstream yet and this can be changed when they arrive.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: Dan Williams
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Eric Badger
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com
Signed-off-by: Linus Torvalds -
devm_memremap_pages() is currently used by the PCI P2PDMA code to create
struct page mappings for IO memory. At present, these mappings are
created with PAGE_KERNEL which implies setting the PAT bits to be WB.
However, on x86, an mtrr register will typically override this and force
the cache type to be UC-. In the case firmware doesn't set this
register it is effectively WB and will typically result in a machine
check exception when it's accessed.Other arches are not currently likely to function correctly seeing they
don't have any MTRR registers to fall back on.To solve this, provide a way to specify the pgprot value explicitly to
arch_add_memory().Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
simple change to pass the pgprot_t down to their respective functions
which set up the page tables. For x86_32, set the page tables
explicitly using _set_memory_prot() (seeing they are already mapped).For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
should be fine, for now, seeing these architectures don't support
ZONE_DEVICE.A check in __add_pages() is also added to ensure the pgprot parameter
was set for all arches.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Dan Williams
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Dave Hansen
Cc: Eric Badger
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
Signed-off-by: Linus Torvalds -
In prepartion to support a pgprot_t argument for arch_add_memory().
Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Eric Badger
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Michal Hocko
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-6-logang@deltatee.com
Signed-off-by: Linus Torvalds -
For use in the 32bit arch_add_memory() to set the pgprot type of the
memory to add.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: Dan Williams
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: David Hildenbrand
Cc: Eric Badger
Cc: Jason Gunthorpe
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Paul Mackerras
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-5-logang@deltatee.com
Signed-off-by: Linus Torvalds -
In preparation to support a pgprot_t argument for arch_add_memory().
It's required to move the prototype of init_memory_mapping() seeing the
original location came before the definition of pgprot_t.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: David Hildenbrand
Cc: Eric Badger
Cc: Jason Gunthorpe
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-4-logang@deltatee.com
Signed-off-by: Linus Torvalds -
The mhp_restrictions struct really doesn't specify anything resembling a
restriction anymore so rename it to be mhp_params as it is a list of
extended parameters.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christoph Hellwig
Cc: Dave Hansen
Cc: Eric Badger
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
Signed-off-by: Linus Torvalds -
Patch series "Allow setting caching mode in arch_add_memory() for
P2PDMA", v4.Currently, the page tables created using memremap_pages() are always
created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code is
creating pages for PCI BAR memory which should never be accessed through
the cache and instead use either WC or UC. This still works in most
cases, on x86, because the MTRR registers typically override the caching
settings in the page tables for all of the IO memory to be UC-.
However, this tends not to work so well on other arches or some rare x86
machines that have firmware which does not setup the MTRR registers in
this way.Instead of this, this series proposes a change to arch_add_memory() to
take the pgprot required by the mapping which allows us to explicitly
set pagetable entries for P2PDMA memory to UC.This changes is pretty routine for most of the arches: x86_64, arm64 and
powerpc simply need to thread the pgprot through to where the page
tables are setup. x86_32 unfortunately sets up the page tables at boot
so must use _set_memory_prot() to change their caching mode. ia64, s390
and sh don't appear to have an easy way to change the page tables so,
for now at least, we just return -EINVAL on such mappings and thus they
will not support P2PDMA memory until the work for this is done. This
should be fine as they don't yet support ZONE_DEVICE.This patch (of 7):
This variable is not used anywhere and should therefore be removed from
the structure.Signed-off-by: Logan Gunthorpe
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Dan Williams
Acked-by: Michal Hocko
Cc: Christoph Hellwig
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Benjamin Herrenschmidt
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Eric Badger
Cc: "H. Peter Anvin"
Cc: Jason Gunthorpe
Cc: Michael Ellerman
Cc: Paul Mackerras
Link: http://lkml.kernel.org/r/20200306170846.9333-2-logang@deltatee.com
Signed-off-by: Linus Torvalds -
Currently there are many platforms that dont enable ARCH_HAS_PTE_SPECIAL
but required to define quite similar fallback stubs for special page
table entry helpers such as pte_special() and pte_mkspecial(), as they
get build in generic MM without a config check. This creates two
generic fallback stub definitions for these helpers, eliminating much
code duplication.mips platform has a special case where pte_special() and pte_mkspecial()
visibility is wider than what ARCH_HAS_PTE_SPECIAL enablement requires.
This restricts those symbol visibility in order to avoid redefinitions
which is now exposed through this new generic stubs and subsequent build
failure. arm platform set_pte_at() definition needs to be moved into a
C file just to prevent a build failure.[anshuman.khandual@arm.com: use defined(CONFIG_ARCH_HAS_PTE_SPECIAL) in mips per Thomas]
Link: http://lkml.kernel.org/r/1583851924-21603-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Acked-by: Guo Ren [csky]
Acked-by: Geert Uytterhoeven [m68k]
Acked-by: Stafford Horne [openrisc]
Acked-by: Helge Deller [parisc]
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Russell King
Cc: Brian Cain
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Sam Creasey
Cc: Michal Simek
Cc: Ralf Baechle
Cc: Paul Burton
Cc: Nick Hu
Cc: Greentime Hu
Cc: Vincent Chen
Cc: Ley Foon Tan
Cc: Jonas Bonn
Cc: Stefan Kristiansson
Cc: "James E.J. Bottomley"
Cc: "David S. Miller"
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Anton Ivanov
Cc: Guan Xuetao
Cc: Chris Zankel
Cc: Max Filippov
Cc: Thomas Bogendoerfer
Link: http://lkml.kernel.org/r/1583802551-15406-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds -
There are many places where all basic VMA access flags (read, write,
exec) are initialized or checked against as a group. One such example
is during page fault. Existing vma_is_accessible() wrapper already
creates the notion of VMA accessibility as a group access permissions.Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
will not only reduce code duplication but also extend the VMA
accessibility concept in general.Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Cc: Russell King
Cc: Catalin Marinas
Cc: Mark Salter
Cc: Nick Hu
Cc: Ley Foon Tan
Cc: Michael Ellerman
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Guan Xuetao
Cc: Dave Hansen
Cc: Thomas Gleixner
Cc: Rob Springer
Cc: Greg Kroah-Hartman
Cc: Geert Uytterhoeven
Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds -
There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
existing VM_STACK_DEFAULT_FLAGS. While here, also define some more
macros with standard VMA access flag combinations that are used
frequently across many platforms. Apart from simplification, this
reduces code duplication as well.Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Acked-by: Geert Uytterhoeven
Cc: Richard Henderson
Cc: Vineet Gupta
Cc: Russell King
Cc: Catalin Marinas
Cc: Mark Salter
Cc: Guo Ren
Cc: Yoshinori Sato
Cc: Brian Cain
Cc: Tony Luck
Cc: Michal Simek
Cc: Ralf Baechle
Cc: Paul Burton
Cc: Nick Hu
Cc: Ley Foon Tan
Cc: Jonas Bonn
Cc: "James E.J. Bottomley"
Cc: Michael Ellerman
Cc: Paul Walmsley
Cc: Heiko Carstens
Cc: Rich Felker
Cc: "David S. Miller"
Cc: Guan Xuetao
Cc: Thomas Gleixner
Cc: Jeff Dike
Cc: Chris Zankel
Link: http://lkml.kernel.org/r/1583391014-8170-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds -
Add the ability to insert multiple pages at once to a user VM with lower
PTE spinlock operations.The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.[akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument]
[arjunroy@google.com: add missing page_count() check to vm_insert_pages()]
Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
[arjunroy@google.com: vm_insert_pages() checks if pte_index defined]
Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Andrew Morton
Cc: David Miller
Cc: Matthew Wilcox
Cc: Jason Gunthorpe
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com
Signed-off-by: Linus Torvalds -
pte_index() is either defined as a macro (e.g. sparc64) or as an
inlined function (e.g. x86). vm_insert_pages() depends on pte_index
but it is not defined on all platforms (e.g. m68k).To fix compilation of vm_insert_pages() on architectures not providing
pte_index(), we perform the following fix:0. For platforms where it is meaningful, and defined as a macro, no
change is needed.
1. For platforms where it is meaningful and defined as an inlined
function, and we want to use it with vm_insert_pages(), we define
a degenerate macro of the form: #define pte_index pte_index
2. vm_insert_pages() checks for the existence of a pte_index macro
definition. If found, it implements a batched insert. If not found,
it devolves to calling vm_insert_page() in a loop.This patch implements step 1 for x86.
v3 of this patch fixes a compilation warning for an unused method.
v2 of this patch moved a macro definition to a more readable location.Signed-off-by: Arjun Roy
Signed-off-by: Andrew Morton
Cc: David Miller
Cc: Eric Dumazet
Cc: Jason Gunthorpe
Cc: Matthew Wilcox
Cc: Soheil Hassas Yeganeh
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200228054714.204424-1-arjunroy.kdev@gmail.com
Signed-off-by: Linus Torvalds -
pte_index() on platforms other than sparc return a numerical index. On
sparc, it returns a pte_t*. This presents an issue for
vm_insert_pages(), which relies on pte_index() to find the offset for a
pte within a pmd, for batched inserts.This patch:
1. Modifies pte_index() for sparc to return a numerical index, like
other platforms,
2. Defines pte_entry() for sparc which returns a pte_t*
(as pte_index() used to),
3. Converts existing sparc callers for pte_index() to use pte_entry().[sfr@canb.auug.org.au: remove pte_entry and just directly modified pte_offset_kernel instead]
Signed-off-by: Arjun Roy
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Eric Dumazet
Cc: Soheil Hassas Yeganeh
Cc: David Miller
Cc: Matthew Wilcox
Cc: Arjun Roy
Cc: Jason Gunthorpe
Link: http://lkml.kernel.org/r/20200227105045.6b421d9f@canb.auug.org.au
Signed-off-by: Linus Torvalds -
Add helper methods for vm_insert_page()/insert_page() to prepare for
vm_insert_pages(), which batch-inserts pages to reduce spinlock
operations when inserting multiple consecutive pages into the user page
table.The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.Signed-off-by: Arjun Roy
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Andrew Morton
Cc: David Miller
Cc: Matthew Wilcox
Cc: Jason Gunthorpe
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com
Signed-off-by: Linus Torvalds -
On passing requirement to vm_unmapped_area, arch_get_unmapped_area and
arch_get_unmapped_area_topdown did not set align_offset. Internally on
both unmapped_area and unmapped_area_topdown, if info->align_mask is 0,
then info->align_offset was meaningless.But commit df529cabb7a2 ("mm: mmap: add trace point of
vm_unmapped_area") always prints info->align_offset even though it is
uninitialized.Fix this uninitialized value issue by setting it to 0 explicitly.
Before:
vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022After:
vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0Signed-off-by: Jaewon Kim
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox (Oracle)
Cc: Michel Lespinasse
Cc: Borislav Petkov
Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.com
Signed-off-by: Linus Torvalds -
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma areaIn this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepagesIf the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Tested-by: Andreas Schaufler
Acked-by: Mike Kravetz
Acked-by: Michal Hocko
Cc: Aslan Bakirov
Cc: Randy Dunlap
Cc: Rik van Riel
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds -
I've noticed that there is no interface exposed by CMA which would let
me to declare contigous memory on particular NUMA node.This patchset adds the ability to try to allocate contiguous memory on a
specific node. It will fallback to other nodes if the specified one
doesn't work.Implement a new method for declaring contigous memory on particular node
and keep cma_declare_contiguous() as a wrapper.[akpm@linux-foundation.org: build fix]
Signed-off-by: Aslan Bakirov
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Andreas Schaufler
Cc: Mike Kravetz
Cc: Rik van Riel
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200407163840.92263-2-guro@fb.com
Signed-off-by: Linus Torvalds -
Linux fallocate(2) with FALLOC_FL_PUNCH_HOLE mode set, its offset can
exceed the inode size. Ocfs2 now doesn't allow that offset beyond inode
size. This restriction is not necessary and violates fallocate(2)
semantics.If fallocate(2) offset is beyond inode size, just return success and do
nothing further.Otherwise, ocfs2 will crash the kernel.
kernel BUG at fs/ocfs2//alloc.c:7264!
ocfs2_truncate_inline+0x20f/0x360 [ocfs2]
ocfs2_remove_inode_range+0x23c/0xcb0 [ocfs2]
__ocfs2_change_file_space+0x4a5/0x650 [ocfs2]
ocfs2_fallocate+0x83/0xa0 [ocfs2]
vfs_fallocate+0x148/0x230
SyS_fallocate+0x48/0x80
do_syscall_64+0x79/0x170Signed-off-by: Changwei Ge
Signed-off-by: Andrew Morton
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Changwei Ge
Cc: Gang He
Cc: Jun Piao
Cc:
Link: http://lkml.kernel.org/r/20200407082754.17565-1-chge@linux.alibaba.com
Signed-off-by: Linus Torvalds -
Fix the following sparse warning:
mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static?
mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static?Reported-by: Hulk Robot
Signed-off-by: Jason Yan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds -
Add description of function parameter 'mt' to fix kernel-doc warning:
mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page'
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Acked-by: Pankaj Gupta
Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org
Signed-off-by: Linus Torvalds -
There is a typo at the cross-reference link, causing this warning:
include/linux/slab.h:11: WARNING: undefined label: memory-allocation (if the link has no caption the label must precede a section header)
Signed-off-by: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Cc: Jonathan Corbet
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/0aeac24235d356ebd935d11e147dcc6edbb6465c.1586359676.git.mchehab+huawei@kernel.org
Signed-off-by: Linus Torvalds -
There is a typo in comment, fix it.
s/eariler/earlier/Signed-off-by: Qiujun Huang
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Christoph Lameter
Link: http://lkml.kernel.org/r/20200405160544.1246-1-hqjagain@gmail.com
Signed-off-by: Linus Torvalds -
If a cgroup violates its memory.high constraints, we may end up unduly
penalising it. For example, for the following hierarchy:A: max high, 20 usage
A/B: 9 high, 10 usage
A/C: max high, 10 usageWe would end up doing the following calculation below when calculating
high delay for A/B:A/B: 10 - 9 = 1...
A: 20 - PAGE_COUNTER_MAX = 21, so set max_overage to 21.This gets worse with higher disparities in usage in the parent.
I have no idea how this disappeared from the final version of the patch,
but it is certainly Not Good(tm). This wasn't obvious in testing because,
for a simple cgroup hierarchy with only one child, the result is usually
roughly the same. It's only in more complex hierarchies that things go
really awry (although still, the effects are limited to a maximum of 2
seconds in schedule_timeout_killable at a maximum).[chris@chrisdown.name: changelog]
Fixes: e26733e0d0ec ("mm, memcg: throttle allocators based on ancestral memory.high")
Signed-off-by: Jakub Kicinski
Signed-off-by: Chris Down
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: [5.4.x]
Link: http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name
Signed-off-by: Linus Torvalds -
When removing files containing extended attributes, the hfsplus driver may
remove the wrong entries from the attributes b-tree, causing major
filesystem damage and in some cases even kernel crashes.To remove a file, all its extended attributes have to be removed as well.
The driver does this by looking up all keys in the attributes b-tree with
the cnid of the file. Each of these entries then gets deleted using the
key used for searching, which doesn't contain the attribute's name when it
should. Since the key doesn't contain the name, the deletion routine will
not find the correct entry and instead remove the one in front of it. If
parent nodes have to be modified, these become corrupt as well. This
causes invalid links and unsorted entries that not even macOS's fsck_hfs
is able to fix.To fix this, modify the search key before an entry is deleted from the
attributes b-tree by copying the found entry's key into the search key,
therefore ensuring that the correct entry gets removed from the tree.Signed-off-by: Simon Gander
Signed-off-by: Andrew Morton
Reviewed-by: Anton Altaparmakov
Cc:
Link: http://lkml.kernel.org/r/20200327155541.1521-1-simon@tuxera.com
Signed-off-by: Linus Torvalds
10 Apr, 2020
5 commits
-
Pull module updates from Jessica Yu:
"Only a small cleanup this time around: a trivial conversion of
zero-length arrays to flexible arrays"* tag 'modules-for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
kernel: module: Replace zero-length array with flexible-array member -
Pull arm64 fixes from Catalin Marinas:
- Ensure that the compiler and linker versions are aligned so that ld
doesn't complain about not understanding a .note.gnu.property section
(emitted when pointer authentication is enabled).- Force -mbranch-protection=none when the feature is not enabled, in
case a compiler may choose a different default value.- Remove CONFIG_DEBUG_ALIGN_RODATA. It was never in defconfig and
rarely enabled.- Fix checking 16-bit Thumb-2 instructions checking mask in the
emulation of the SETEND instruction (it could match the bottom half
of a 32-bit Thumb-2 instruction).* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: armv8_deprecated: Fix undef_hook mask for thumb setend
arm64: remove CONFIG_DEBUG_ALIGN_RODATA feature
arm64: Always force a branch protection mode when the compiler has one
arm64: Kconfig: ptrauth: Add binutils version check to fix mismatch
init/kconfig: Add LD_VERSION Kconfig -
Pull more powerpc updates from Michael Ellerman:
"The bulk of this is the series to make CONFIG_COMPAT user-selectable,
it's been around for a long time but was blocked behind the
syscall-in-C series.Plus there's also a few fixes and other minor things.
Summary:
- A fix for a crash in machine check handling on pseries (ie. guests)
- A small series to make it possible to disable CONFIG_COMPAT, and
turn it off by default for ppc64le where it's not used.- A few other miscellaneous fixes and small improvements.
Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
Nicholas Piggin, Stephen Boyd, Wen Xiong"* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Always build the tm-poison test 64-bit
powerpc: Improve ppc_save_regs()
Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
powerpc/time: Replace by
powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
powerpc/perf: split callchain.c by bitness
powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
powerpc/64: make buildable without CONFIG_COMPAT
powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
powerpc/perf: consolidate read_user_stack_32
powerpc: move common register copy functions from signal_32.c to signal.c
powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
powerpc/ps3: Remove an unneeded NULL check
powerpc/ps3: Remove duplicate error message
powerpc/powernv: Re-enable imc trace-mode in kernel
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
powerpc/pseries: Fix MCE handling on pseries
selftests/eeh: Skip ahci adapters
powerpc/64s: Fix doorbell wakeup msgclr optimisation -
Pull m68knommu update from Greg Ungerer:
"Only a single commit, to remove all use of the obsolete setup_irq()
calls within the m68knommu architecture code"* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: Replace setup_irq() by request_irq() -
Pull RISC-V updates from Palmer Dabbelt:
"This contains a handful of new features:- Partial support for the Kendryte K210.
There are still a few outstanding issues that I have patches for,
but I don't actually have a board to test them so they're not
included yet.- SBI v0.2 support.
- Fixes to support for building with LLVM-based toolchains. The
resulting images are known not to boot yet.I don't anticipate a part two, but I'll probably have something early
in the RCs to finish up the K210 support"* tag 'riscv-for-linus-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (38 commits)
riscv: create a loader.bin boot image for Kendryte SoC
riscv: Kendryte K210 default config
riscv: Add Kendryte K210 device tree
riscv: Select required drivers for Kendryte SOC
riscv: Add Kendryte K210 SoC support
riscv: Add SOC early init support
riscv: Unaligned load/store handling for M_MODE
RISC-V: Support cpu hotplug
RISC-V: Add supported for ordered booting method using HSM
RISC-V: Add SBI HSM extension definitions
RISC-V: Export SBI error to linux error mapping function
RISC-V: Add cpu_ops and modify default booting method
RISC-V: Move relocate and few other functions out of __init
RISC-V: Implement new SBI v0.2 extensions
RISC-V: Introduce a new config for SBI v0.1
RISC-V: Add SBI v0.2 extension definitions
RISC-V: Add basic support for SBI v0.2
RISC-V: Mark existing SBI as 0.1 SBI.
riscv: Use macro definition instead of magic number
riscv: Add support to dump the kernel page tables
...
09 Apr, 2020
9 commits
-
Pull 9p documentation update from Dominique Martinet:
"Document the new O_NONBLOCK short read behavior"* tag '9p-for-5.7-2' of git://github.com/martinetd/linux:
9p: document short read behaviour with O_NONBLOCK -
Pull ceph updates from Ilya Dryomov:
"The main items are:- support for asynchronous create and unlink (Jeff Layton).
Creates and unlinks are satisfied locally, without waiting for a
reply from the MDS, provided the client has been granted
appropriate caps (new in v15.y.z ("Octopus") release). This can be
a big help for metadata heavy workloads such as tar and rsync.
Opt-in with the new nowsync mount option.- multiple blk-mq queues for rbd (Hannes Reinecke and myself).
When the driver was converted to blk-mq, we settled on a single
blk-mq queue because of a global lock in libceph and some other
technical debt. These have since been addressed, so allocate a
queue per CPU to enhance parallelism.- don't hold onto caps that aren't actually needed (Zheng Yan).
This has been our long-standing behavior, but it causes issues with
some active/standby applications (synchronous I/O, stalls if the
standby goes down, etc).- .snap directory timestamps consistent with ceph-fuse (Luis
Henriques)"* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
ceph: fix snapshot directory timestamps
ceph: wait for async creating inode before requesting new max size
ceph: don't skip updating wanted caps when cap is stale
ceph: request new max size only when there is auth cap
ceph: cleanup return error of try_get_cap_refs()
ceph: return ceph_mdsc_do_request() errors from __get_parent()
ceph: check all mds' caps after page writeback
ceph: update i_requested_max_size only when sending cap msg to auth mds
ceph: simplify calling of ceph_get_fmode()
ceph: remove delay check logic from ceph_check_caps()
ceph: consider inode's last read/write when calculating wanted caps
ceph: always renew caps if mds_wanted is insufficient
ceph: update dentry lease for async create
ceph: attempt to do async create when possible
ceph: cache layout in parent dir on first sync create
ceph: add new MDS req field to hold delegated inode number
ceph: decode interval_sets for delegated inos
ceph: make ceph_fill_inode non-static
ceph: perform asynchronous unlink if we have sufficient caps
ceph: don't take refs to want mask unless we have all bits
... -
Pull overlayfs update from Miklos Szeredi:
- Fix failure to copy-up files from certain NFSv4 mounts
- Sort out inconsistencies between st_ino and i_ino (used in /proc/locks)
- Allow consistent (POSIX-y) inode numbering in more cases
- Allow virtiofs to be used as upper layer
- Miscellaneous cleanups and fixes
* tag 'ovl-update-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: document xino expected behavior
ovl: enable xino automatically in more cases
ovl: avoid possible inode number collisions with xino=on
ovl: use a private non-persistent ino pool
ovl: fix WARN_ON nlink drop to zero
ovl: fix a typo in comment
ovl: replace zero-length array with flexible-array member
ovl: ovl_obtain_alias(): don't call d_instantiate_anon() for old
ovl: strict upper fs requirements for remote upper fs
ovl: check if upper fs supports RENAME_WHITEOUT
ovl: allow remote upper
ovl: decide if revalidate needed on a per-dentry basis
ovl: separate detection of remote upper layer from stacked overlay
ovl: restructure dentry revalidation
ovl: ignore failure to copy up unknown xattrs
ovl: document permission model
ovl: simplify i_ino initialization
ovl: factor out helper ovl_get_root()
ovl: fix out of date comment and unreachable code
ovl: fix value of i_ino for lower hardlink corner case -
Pull iomap fix from Darrick Wong:
"Fix a problem in readahead where we can crash if we can't allocate a
full bio due to GFP_NORETRY"* tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Handle memory allocation failure in readahead -
Pull crypto fixes from Herbert Xu:
"This fixes a Kconfig dependency for hisilicon as well as a double free
in marvell/octeontx"* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: marvell/octeontx - fix double free of ptr
crypto: hisilicon - Fix build error -
Pull watchdog updates from Wim Van Sebroeck:
- add TI K3 RTI watchdog
- add stop_on_reboot parameter to control reboot policy
- wm831x_wdt: Remove GPIO handling
- several small fixes, improvements and clean-ups
* tag 'linux-watchdog-5.7-rc1' of git://www.linux-watchdog.org/linux-watchdog:
watchdog: Add K3 RTI watchdog support
dt-bindings: watchdog: Add support for TI K3 RTI watchdog
watchdog: ziirave_wdt: change name to be more specific
watchdog: orion: use 0 for unset heartbeat
watchdog: npcm: remove whitespaces
watchdog: reset last_hw_keepalive time at start
watchdog: imx2_wdt: Drop .remove callback
watchdog: Add stop_on_reboot parameter to control reboot policy
watchdog: wm831x_wdt: Remove GPIO handling
watchdog: imx7ulp: Remove unused include of init.h
watchdog: imx_sc_wdt: Remove unused includes
watchdog: qcom: Use irq flags from firmware
watchdog: pm8916_wdt: Add system sleep callbacks
watchdog: qcom-wdt: disable pretimeout on timer platform -
…ernel/git/chrome-platform/linux
Pull chrome platform updates from Benson Leung:
cros-usbpd-notify and cros_ec_typec:
- Add a new notification driver that handles and dispatches USB PD
related events to other drivers.
- Add a Type C connector class driver for cros_ecCrOS EC:
- Introduce a new cros_ec_cmd_xfer_status helperSensors/iio:
- A series from Gwendal that adds Cros EC sensor hub FIFO supportWilco EC:
- Fix a build warning.
- Platform data shouldn't include kernel.hMisc:
- i2c api conversion complete, with i2c_new_client_device instead of
i2c_new_device in chromeos_laptop.
- Replace zero-length array with flexible-array member in
cros_ec_chardev and wilco_ec
- Update new structure for SPI transfer delays in cros_ec_spi* tag 'tag-chrome-platform-for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: (34 commits)
platform/chrome: cros_ec_spi: Wait for USECS, not NSECS
iio: cros_ec: Use Hertz as unit for sampling frequency
iio: cros_ec: Report hwfifo_watermark_max
iio: cros_ec: Expose hwfifo_timeout
iio: cros_ec: Remove pm function
iio: cros_ec: Register to cros_ec_sensorhub when EC supports FIFO
iio: expose iio_device_set_clock
iio: cros_ec: Move function description to .c file
platform/chrome: cros_ec_sensorhub: Add median filter
platform/chrome: cros_ec_sensorhub: Add code to spread timestmap
platform/chrome: cros_ec_sensorhub: Add FIFO support
platform/chrome: cros_ec_sensorhub: Add the number of sensors in sensorhub
platform/chrome: chromeos_laptop: make I2C API conversion complete
platform/chrome: wilco_ec: event: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_chardev: Replace zero-length array with flexible-array member
platform/chrome: cros_ec_typec: Update port info from EC
platform/chrome: Add Type C connector class driver
platform/chrome: cros_usbpd_notify: Pull PD_HOST_EVENT status
platform/chrome: cros_usbpd_notify: Amend ACPI driver to plat
platform/chrome: cros_usbpd_notify: Add driver data struct
... -
Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.Summary:
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.- Promote numa_map_to_online_node() to a cross-kernel generic
facility.- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.- Fixup some flexible-array declarations"
* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
... -
Pull iommu updates from Joerg Roedel:
- ARM-SMMU support for the TLB range invalidation command in SMMUv3.2
- ARM-SMMU introduction of command batching helpers to batch up CD and
ATC invalidation- ARM-SMMU support for PCI PASID, along with necessary PCI symbol
exports- Introduce a generic (actually rename an existing) IOMMU related
pointer in struct device and reduce the IOMMU related pointers- Some fixes for the OMAP IOMMU driver to make it build on 64bit
architectures- Various smaller fixes and improvements
* tag 'iommu-updates-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (39 commits)
iommu: Move fwspec->iommu_priv to struct dev_iommu
iommu/virtio: Use accessor functions for iommu private data
iommu/qcom: Use accessor functions for iommu private data
iommu/mediatek: Use accessor functions for iommu private data
iommu/renesas: Use accessor functions for iommu private data
iommu/arm-smmu: Use accessor functions for iommu private data
iommu/arm-smmu: Refactor master_cfg/fwspec usage
iommu/arm-smmu-v3: Use accessor functions for iommu private data
iommu: Introduce accessors for iommu private data
iommu/arm-smmu: Fix uninitilized variable warning
iommu: Move iommu_fwspec to struct dev_iommu
iommu: Rename struct iommu_param to dev_iommu
iommu/tegra-gart: Remove direct access of dev->iommu_fwspec
drm/msm/mdp5: Remove direct access of dev->iommu_fwspec
ACPI/IORT: Remove direct access of dev->iommu_fwspec
iommu: Define dev_iommu_fwspec_get() for !CONFIG_IOMMU_API
iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE
iommu/virtio: Fix freeing of incomplete domains
iommu/virtio: Fix sparse warning
iommu/vt-d: Add build dependency on IOASID
...