08 Apr, 2020
13 commits
-
It's clearer to just put this inline.
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.com
Signed-off-by: Linus Torvalds -
The process maps file was the only user of version (introduced back in
2005). Now that it uses ppos instead, we can remove it.Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com
Signed-off-by: Linus Torvalds -
The ppos is a private cursor, just like m->version. Use the canonical
cursor, not a special one.Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.com
Signed-off-by: Linus Torvalds -
Instead of setting m->version in the show method, set it in m_next(),
where it should be. Also remove the fallback code for failing to find a
vma, or version being zero.Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.com
Signed-off-by: Linus Torvalds -
Instead of calling vma_stop() from m_start() and m_next(), do its work
in m_stop().Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.com
Signed-off-by: Linus Torvalds -
top(1) reads all /proc/*/statm files but kernel threads will always have
zeros. Print those zeroes directly without going through
seq_put_decimal_ull().Speed up reading /proc/2/statm (which is kthreadd) is like 3%.
My system has more kernel threads than normal processes after booting KDE.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2
Signed-off-by: Linus Torvalds -
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swapsMore will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%Benchmark source:
#include
#include
#include
#include#include
#include
#include
#includeconst int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}void f(int n)
{
glue(n % NR_CPUS);while (*(volatile int *)&xxx == 0) {
}for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}std::vector T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration dt = t1 - t0;
std::cout << dt.count() << '
';return 0;
}P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot
Reported-by: Dan Carpenter
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Cc: Al Viro
Cc: Joe Perches
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds -
Fix sparse locking imbalance warning:
warning: context imbalance in close_pdeo() - unexpected unlock
Signed-off-by: Jules Irenge
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2
Signed-off-by: Linus Torvalds -
Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user
registers regions with shmem/hugetlbfs we won't expose the new ioctl to
them. Even with complete anonymous memory range, we'll only expose the
new WP ioctl bit if the register mode has MODE_WP.Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com
Signed-off-by: Linus Torvalds -
It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region. Only wake up when resolving a write
protected page fault.Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com
Signed-off-by: Linus Torvalds -
Introduce the new uffd-wp APIs for userspace.
Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking. Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds -
This allows UFFDIO_COPY to map pages write-protected.
[peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
commit messages]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
Signed-off-by: Linus Torvalds -
This replaces all remaining open encodings with is_vm_hugetlb_page().
Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Alexander Viro
Cc: Will Deacon
Cc: "Aneesh Kumar K.V"
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Arnd Bergmann
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Geert Uytterhoeven
Cc: Guo Ren
Cc: Mel Gorman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds
06 Apr, 2020
8 commits
-
Pull fsnotify updates from Jan Kara:
"This implements the fanotify FAN_DIR_MODIFY event.This event reports the name in a directory under which a change
happened and together with the directory filehandle and fstatat()
allows reliable and efficient implementation of directory
synchronization"* tag 'fsnotify_for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: Fix the checks in fanotify_fsid_equal
fanotify: report name info for FAN_DIR_MODIFY event
fanotify: record name info for FAN_DIR_MODIFY event
fanotify: Drop fanotify_event_has_fid()
fanotify: prepare to report both parent and child fid's
fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
fanotify: divorce fanotify_path_event and fanotify_fid_event
fanotify: Store fanotify handles differently
fanotify: Simplify create_fd()
fanotify: fix merging marks masks with FAN_ONDIR
fanotify: merge duplicate events on parent and child
fsnotify: replace inode pointer with an object id
fsnotify: simplify arguments passing to fsnotify_parent()
fsnotify: use helpers to access data by data_type
fsnotify: funnel all dirent events through fsnotify_name()
fsnotify: factor helpers fsnotify_dentry() and fsnotify_file()
fsnotify: tidy up FS_ and FAN_ constants -
Pull ext2/udf updates from Jan Kara:
"Cleanups and fixes for ext2 and one cleanup for udf"* tag 'for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix empty body warnings when -Wextra is used
ext2: fix debug reference to ext2_xattr_cache
udf: udf_sb.h: Replace zero-length array with flexible-array member
ext2: xattr.h: Replace zero-length array with flexible-array member
ext2: Silence lockdep warning about reclaim under xattr_sem -
Pull 9p updates from Dominique Martinet:
"Not much new, but a few patches for this cycle:- Fix read with O_NONBLOCK to allow incomplete read and return
immediately- Rest is just cleanup (indent, unused field in struct, extra
semicolon)"* tag '9p-for-5.7' of git://github.com/martinetd/linux:
net/9p: remove unused p9_req_t aux field
9p: read only once on O_NONBLOCK
9pnet: allow making incomplete read requests
9p: Remove unneeded semicolon
9p: Fix Kconfig indentation -
Pull vfs pathwalk fix from Al Viro:
"Dumb braino in legitimize_path()..."* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a braino in legitimize_path() -
brown paperbag time... wrong order of arguments ended up confusing
the values to check dentry and mount_lock seqcounts against.Reported-by: kernel test robot
Fixes: 2aa38470853a ("non-RCU analogue of the previous commit")
Tested-by: kernel test robot
Signed-off-by: Al Viro -
Commit 9255782f7061 ("sysfs: Wrap __compat_only_sysfs_link_entry_to_kobj
function to change the symlink name") made this function a wrapper
around a new non-underscored function, which is a bit odd. The normal
naming convention is the other way around: the underscored function is
the wrappee, and the non-underscored function is the wrapper.There's only one single user (well, two call-sites in that user) of the
more limited double underscore version of this function, so just remove
the oddly named wrapper entirely and just add the extra NULL argument to
the user.I considered just doing that in the merge, but that tends to make
history really hard to read.Link: https://lore.kernel.org/lkml/CAHk-=wgkkmNV5tMzQDmPAQuNJBuMcry--Jb+h8H1o4RA3kF7QQ@mail.gmail.com/
Cc: Sourabh Jain
Cc: Michael Ellerman
Signed-off-by: Linus Torvalds -
Pull powerpc updates from Michael Ellerman:
"Slightly late as I had to rebase mid-week to insert a bug fix:- A large series from Nick for 64-bit to further rework our exception
vectors, and rewrite portions of the syscall entry/exit and
interrupt return in C. The result is much easier to follow code
that is also faster in general.- Cleanup of our ptrace code to split various parts out that had
become badly intertwined with #ifdefs over the years.- Changes to our NUMA setup under the PowerVM hypervisor which should
hopefully avoid non-sensical topologies which can lead to warnings
from the workqueue code and other problems.- MAINTAINERS updates to remove some of our old orphan entries and
update the status of others.- Quite a few other small changes and fixes all over the map.
Thanks to: Abdul Haleem, afzal mohammed, Alexey Kardashevskiy, Andrew
Donnellan, Aneesh Kumar K.V, Balamuruhan S, Cédric Le Goater, Chen
Zhou, Christophe JAILLET, Christophe Leroy, Christoph Hellwig, Clement
Courbet, Daniel Axtens, David Gibson, Douglas Miller, Fabiano Rosas,
Fangrui Song, Ganesh Goudar, Gautham R. Shenoy, Greg Kroah-Hartman,
Greg Kurz, Gustavo Luiz Duarte, Hari Bathini, Ilie Halip, Jan Kara,
Joe Lawrence, Joe Perches, Kajol Jain, Larry Finger, Laurentiu Tudor,
Leonardo Bras, Libor Pechacek, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Masami Hiramatsu, Mauricio Faria de Oliveira, Michael
Neuling, Michal Suchanek, Mike Rapoport, Nageswara R Sastry, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick
Desaulniers, Oliver O'Halloran, Po-Hsu Lin, Pratik Rajesh Sampat,
Rasmus Villemoes, Ravi Bangoria, Roman Bolshakov, Sam Bobroff,
Sandipan Das, Santosh S, Sedat Dilek, Segher Boessenkool, Shilpasri G
Bhat, Sourabh Jain, Srikar Dronamraju, Stephen Rothwell, Tyrel
Datwyler, Vaibhav Jain, YueHaibing"* tag 'powerpc-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc: Make setjmp/longjmp signature standard
powerpc/cputable: Remove unnecessary copy of cpu_spec->oprofile_type
powerpc: Suppress .eh_frame generation
powerpc: Drop -fno-dwarf2-cfi-asm
powerpc/32: drop unused ISA_DMA_THRESHOLD
powerpc/powernv: Add documentation for the opal sensor_groups sysfs interfaces
selftests/powerpc: Fix try-run when source tree is not writable
powerpc/vmlinux.lds: Explicitly retain .gnu.hash
powerpc/ptrace: move ptrace_triggered() into hw_breakpoint.c
powerpc/ptrace: create ppc_gethwdinfo()
powerpc/ptrace: create ptrace_get_debugreg()
powerpc/ptrace: split out ADV_DEBUG_REGS related functions.
powerpc/ptrace: move register viewing functions out of ptrace.c
powerpc/ptrace: split out TRANSACTIONAL_MEM related functions.
powerpc/ptrace: split out SPE related functions.
powerpc/ptrace: split out ALTIVEC related functions.
powerpc/ptrace: split out VSX related functions.
powerpc/ptrace: drop PARAMETER_SAVE_AREA_OFFSET
powerpc/ptrace: drop unnecessary #ifdefs CONFIG_PPC64
powerpc/ptrace: remove unused header includes
... -
Pull ext4 updates from Ted Ts'o:
- Replace ext4's bmap and iopoll implementations to use iomap.
- Clean up extent tree handling.
- Other cleanups and miscellaneous bug fixes
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
ext4: save all error info in save_error_info() and drop ext4_set_errno()
ext4: fix incorrect group count in ext4_fill_super error message
ext4: fix incorrect inodes per group in error message
ext4: don't set dioread_nolock by default for blocksize < pagesize
ext4: disable dioread_nolock whenever delayed allocation is disabled
ext4: do not commit super on read-only bdev
ext4: avoid ENOSPC when avoiding to reuse recently deleted inodes
ext4: unregister sysfs path before destroying jbd2 journal
ext4: check for non-zero journal inum in ext4_calculate_overhead
ext4: remove map_from_cluster from ext4_ext_map_blocks
ext4: clean up ext4_ext_insert_extent() call in ext4_ext_map_blocks()
ext4: mark block bitmap corrupted when found instead of BUGON
ext4: use flexible-array member for xattr structs
ext4: use flexible-array member in struct fname
Documentation: correct the description of FIEMAP_EXTENT_LAST
ext4: move ext4_fiemap to use iomap framework
ext4: make ext4_ind_map_blocks work with fiemap
ext4: move ext4 bmap to use iomap infrastructure
ext4: optimize ext4_ext_precache for 0 depth
ext4: add IOMAP_F_MERGED for non-extent based mapping
...
05 Apr, 2020
2 commits
-
Pull exfat filesystem from Al Viro:
"Shiny new fs/exfat replacement for drivers/staging/exfat"* 'work.exfat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
exfat: update file system parameter handling
staging: exfat: make staging/exfat and fs/exfat mutually exclusive
MAINTAINERS: add exfat filesystem
exfat: add Kconfig and Makefile
exfat: add nls operations
exfat: add misc operations
exfat: add exfat cache
exfat: add bitmap operations
exfat: add fat entry operations
exfat: add file operations
exfat: add directory operations
exfat: add inode operations
exfat: add super block operations
exfat: add in-memory and on-disk structures and headers -
Pull nfsd updates from Chuck Lever:
- Fix EXCHANGE_ID response when NFSD runs in a container
- A battery of new static trace points
- Socket transports now use bio_vec to send Replies
- NFS/RDMA now supports filesystems with no .splice_read method
- Favor memcpy() over DMA mapping for small RPC/RDMA Replies
- Add pre-requisites for supporting multiple Write chunks
- Numerous minor fixes and clean-ups
[ Chuck is filling in for Bruce this time while he and his family settle
into a new house ]* tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6: (39 commits)
svcrdma: Fix leak of transport addresses
SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
SUNRPC/cache: don't allow invalid entries to be flushed
nfsd: fsnotify on rmdir under nfsd/clients/
nfsd4: kill warnings on testing stateids with mismatched clientids
nfsd: remove read permission bit for ctl sysctl
NFSD: Fix NFS server build errors
sunrpc: Add tracing for cache events
SUNRPC/cache: Allow garbage collection of invalid cache entries
nfsd: export upcalls must not return ESTALE when mountd is down
nfsd: Add tracepoints for update of the expkey and export cache entries
nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
nfsd: Add tracing to nfsd_set_fh_dentry()
nfsd: Don't add locks to closed or closing open stateids
SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
SUNRPC: Refactor xs_sendpages()
svcrdma: Avoid DMA mapping small RPC Replies
svcrdma: Fix double sync of transport header buffer
svcrdma: Refactor chunk list encoders
SUNRPC: Add encoders for list item discriminators
...
04 Apr, 2020
2 commits
-
Pull SPDX updates from Greg KH:
"Here are three SPDX patches for 5.7-rc1.One fixes up the SPDX tag for a single driver, while the other two go
through the tree and add SPDX tags for all of the .gitignore files as
needed.Nothing too complex, but you will get a merge conflict with your
current tree, that should be trivial to handle (one file modified by
two things, one file deleted.)All three of these have been in linux-next for a while, with no
reported issues other than the merge conflict"* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
ASoC: MT6660: make spdxcheck.py happy
.gitignore: add SPDX License Identifier
.gitignore: remove too obvious comments -
Pull cgroup updates from Tejun Heo:
- Christian extended clone3 so that processes can be spawned into
cgroups directly.This is not only neat in terms of semantics but also avoids grabbing
the global cgroup_threadgroup_rwsem for migration.- Daniel added !root xattr support to cgroupfs.
Userland already uses xattrs on cgroupfs for bookkeeping. This will
allow delegated cgroups to support such usages.- Prateek tried to make cpuset hotplug handling synchronous but that
led to possible deadlock scenarios. Reverted.- Other minor changes including release_agent_path handling cleanup.
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Document the cpuset_v2_mode mount option
Revert "cpuset: Make cpuset hotplug synchronous"
cgroupfs: Support user xattrs
kernfs: Add option to enable user xattrs
kernfs: Add removed_size out param for simple_xattr_set
kernfs: kvmalloc xattr value instead of kmalloc
cgroup: Restructure release_agent_path handling
selftests/cgroup: add tests for cloning into cgroups
clone3: allow spawning processes into cgroups
cgroup: add cgroup_may_write() helper
cgroup: refactor fork helpers
cgroup: add cgroup_get_from_file() helper
cgroup: unify attach permission checking
cpuset: Make cpuset hotplug synchronous
cgroup.c: Use built-in RCU list checking
kselftest/cgroup: add cgroup destruction test
cgroup: Clean up css_set task traversal
03 Apr, 2020
15 commits
-
Merge updates from Andrew Morton:
"A large amount of MM, plenty more to come.Subsystems affected by this patch series:
- tools
- kthread
- kbuild
- scripts
- ocfs2
- vfs
- mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
hugetlbfs, hugetlb"* emailed patches from Andrew Morton : (155 commits)
include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
selftests/vm: fix map_hugetlb length used for testing read and write
mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
mm/hugetlb.c: clean code by removing unnecessary initialization
hugetlb_cgroup: add hugetlb_cgroup reservation docs
hugetlb_cgroup: add hugetlb_cgroup reservation tests
hugetlb: support file_region coalescing again
hugetlb_cgroup: support noreserve mappings
hugetlb_cgroup: add accounting for shared mappings
hugetlb: disable region_add file_region coalescing
hugetlb_cgroup: add reservation accounting for private mappings
mm/hugetlb_cgroup: fix hugetlb_cgroup migration
hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
hugetlb_cgroup: add hugetlb_cgroup reservation counter
hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
mm/memblock.c: remove redundant assignment to variable max_addr
mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
... -
Pull xfs updates from Darrick Wong:
"There's a lot going on this cycle with cleanups in the log code, the
btree code, and the xattr code.We're tightening of metadata validation and online fsck checking, and
introducing a common btree rebuilding library so that we can refactor
xfs_repair and introduce online repair in a future cycle.We also fixed a few visible bugs -- most notably there's one in
getdents that we introduced in 5.6; and a fix for hangs when disabling
quotas.This series has been running fstests & other QA in the background for
over a week and looks good so far.I anticipate sending a second pull request next week. That batch will
change how xfs interacts with memory reclaim; how the log batches and
throttles log items; how hard writes near ENOSPC will try to squeeze
more space out of the filesystem; and hopefully fix the last of the
umount hangs after a catastrophic failure. That should ease a lot of
problems when running at the limits, but for now I'm leaving that in
for-next for another week to make sure we got all the subtleties
right.Summary:
- Fix a hard to trigger race between iclog error checking and log
shutdown.- Strengthen the AGF verifier.
- Ratelimit some of the more spammy error messages.
- Remove the icdinode uid/gid members and just use the ones in the
vfs inode.- Hold ILOCK across insert/collapse range.
- Clean up the extended attribute interfaces.
- Clean up the attr flags mess.
- Restore PF_MEMALLOC after exiting xfsaild thread to avoid
triggering warnings in the process accounting code.- Remove the flexibly-sized array from struct xfs_agfl to eliminate
compiler warnings about unaligned pointers and packed structures.- Various macro and typedef removals.
- Stale metadata buffers if we decide they're corrupt outside of a
verifier.- Check directory data/block/free block owners.
- Fix a UAF when aborting inactivation of a corrupt xattr fork.
- Teach online scrub to report failed directory and attr name lookups
as a metadata corruption instead of a runtime error.- Avoid potential buffer overflows in sysfs files by using scnprintf.
- Fix a regression in getdents lookups due to a mistake in pointer
arithmetic.- Refactor btree cursor private data structures to use anonymous
unions.- Cleanups in the log unmounting code.
- Fix a potential mishandling of ENOMEM errors on multi-block
directory buffer lookups.- Fix an incorrect test in the block allocation code.
- Cleanups and name prefix shortening in the scrub code.
- Introduce btree bulk loading code for online repair and scrub.
- Fix a quotaoff log item leak (and hang) when the fs goes down
midway through a quotaoff operation.- Remove di_version from the incore inode.
- Refactor some of the log shutdown checking code.
- Record the forcing of the log unmount records in the log force
counters.- Fix a longstanding bug where quotacheck would purge the
administrator's default quota grace interval and warning limits.- Reduce memory usage when scrubbing directory and xattr trees.
- Don't let fsfreeze race with GETFSMAP or online scrub.
- Handle bio_add_page failures more gracefully in xlog_write_iclog"
* tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (108 commits)
xfs: prohibit fs freezing when using empty transactions
xfs: shutdown on failure to add page to log bio
xfs: directory bestfree check should release buffers
xfs: drop all altpath buffers at the end of the sibling check
xfs: preserve default grace interval during quotacheck
xfs: remove xlog_state_want_sync
xfs: move the ioerror check out of xlog_state_clean_iclog
xfs: refactor xlog_state_clean_iclog
xfs: remove the aborted parameter to xlog_state_done_syncing
xfs: simplify log shutdown checking in xfs_log_release_iclog
xfs: simplify the xfs_log_release_iclog calling convention
xfs: factor out a xlog_wait_on_iclog helper
xfs: merge xlog_cil_push into xlog_cil_push_work
xfs: remove the di_version field from struct icdinode
xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
xfs: simplify di_flags2 inheritance in xfs_ialloc
xfs: only check the superblock version for dinode size calculation
xfs: add a new xfs_sb_version_has_v3inode helper
xfs: fix unmount hang and memory leak on shutdown during quotaoff
xfs: factor out quotaoff intent AIL removal and memory free
... -
Pull hibernation fix from Darrick Wong:
"Fix a regression where we broke the userspace hibernation driver by
disallowing writes to the swap device"* tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
hibernate: Allow uswsusp to write to swap -
Pull iomap updates from Darrick Wong:
"We're fixing tracepoints and comments in this cycle, so there
shouldn't be any surprises here.I anticipate sending a second pull request next week with a single bug
fix for readahead, but it's still undergoing QA.Summary:
- Fix a broken tracepoint
- Fix a broken comment"
* tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix comments in iomap_dio_rw
iomap: Remove pgoff from tracepoints -
Pull vfs pathwalk sanitizing from Al Viro:
"Massive pathwalk rewrite and cleanups.Several iterations have been posted; hopefully this thing is getting
readable and understandable now. Pretty much all parts of pathname
resolutions are affected...The branch is identical to what has sat in -next, except for commit
message in "lift all calls of step_into() out of follow_dotdot/
follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
commit message changed there."* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
lookup_open(): don't bother with fallbacks to lookup+create
atomic_open(): no need to pass struct open_flags anymore
open_last_lookups(): move complete_walk() into do_open()
open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
open_last_lookups(): consolidate fsnotify_create() calls
take post-lookup part of do_last() out of loop
link_path_walk(): sample parent's i_uid and i_mode for the last component
__nd_alloc_stack(): make it return bool
reserve_stack(): switch to __nd_alloc_stack()
pick_link(): take reserving space on stack into a new helper
pick_link(): more straightforward handling of allocation failures
fold path_to_nameidata() into its only remaining caller
pick_link(): pass it struct path already with normal refcounting rules
fs/namei.c: kill follow_mount()
non-RCU analogue of the previous commit
helper for mount rootwards traversal
follow_dotdot(): be lazy about changing nd->path
follow_dotdot_rcu(): be lazy about changing nd->path
follow_dotdot{,_rcu}(): massage loops
... -
Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
... -
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem in
write mode when modifying i_size. In this way, truncation can not
proceed when page faults are being processed. In addition, i_size
will not change during fault processing so a single check can be made
to ensure faults are not beyond (proposed) end of file. Faults can
still race with hole punch, but that race is handled by existing code
and the use of hugetlb_fault_mutex.With this modification, checks for races with truncation in the page
fault path can be simplified and removed. remove_inode_hugepages no
longer needs to take hugetlb_fault_mutex in the case of truncation.
Comments are expanded to explain reasoning behind locking.Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K . V"
Cc: Davidlohr Bueso
Cc: Hugh Dickins
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Prakash Sangappa
Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds -
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K . V"
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Davidlohr Bueso
Cc: Prakash Sangappa
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds -
Userfaultfd fault path was by default killable even if the caller does not
have FAULT_FLAG_KILLABLE. That makes sense before in that when with gup
we don't have FAULT_FLAG_KILLABLE properly set before. Now after previous
patch we've got FAULT_FLAG_KILLABLE applied even for gup code so it should
also make sense to let userfaultfd to honor the FAULT_FLAG_KILLABLE.Because we're unconditionally setting FAULT_FLAG_KILLABLE in gup code
right now, this patch should have no functional change. It also cleaned
the code a little bit by introducing some helpers.Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Tested-by: Brian Geffon
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Matthew Wilcox
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Link: http://lkml.kernel.org/r/20200220160300.9941-1-peterx@redhat.com
Signed-off-by: Linus Torvalds -
handle_userfaultfd() is currently the only one place in the kernel page
fault procedures that can respond to non-fatal userspace signals. It was
trying to detect such an allowance by checking against USER & KILLABLE
flags, which was "un-official".In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
that the fault handler allows the fault procedure to respond even to
non-fatal signals. Meanwhile, add this new flag to the default fault
flags so that all the page fault handlers can benefit from the new flag.
With that, replacing the userfault check to this one.Since the line is getting even longer, clean up the fault flags a bit too
to ease TTY users.Although we've got a new flag and applied it, we shouldn't have any
functional change with this patch so far.Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Tested-by: Brian Geffon
Reviewed-by: David Hildenbrand
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Matthew Wilcox
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.com
Signed-off-by: Linus Torvalds -
This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs might
have changed. Meanwhile with previous patch we don't lose responsiveness
as well since the core mm code now can handle the nonfatal userspace
signals even if we return VM_FAULT_RETRY.Suggested-by: Andrea Arcangeli
Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Tested-by: Brian Geffon
Reviewed-by: Jerome Glisse
Cc: Bobby Powers
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Matthew Wilcox
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Link: http://lkml.kernel.org/r/20200220160234.9646-1-peterx@redhat.com
Signed-off-by: Linus Torvalds -
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg2) set or clear the PageKmemcg flag
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds -
This notice fills my boot logs with scary-looking asterisks but doesn't
really tell me anything. Let's just remove it; validation errors are
already reported separately, so this is just a redundant list of
filesystems.$ dmesg | grep VALIDATE
[ 0.306256] *** VALIDATE tmpfs ***
[ 0.307422] *** VALIDATE proc ***
[ 0.308355] *** VALIDATE cgroup ***
[ 0.308741] *** VALIDATE cgroup2 ***
[ 0.813256] *** VALIDATE bpf ***
[ 0.815272] *** VALIDATE ramfs ***
[ 0.815665] *** VALIDATE hugetlbfs ***
[ 0.876970] *** VALIDATE nfs ***
[ 0.877383] *** VALIDATE nfs4 ***Signed-off-by: Kees Cook
Signed-off-by: Andrew Morton
Reviewed-by: Seth Arnold
Cc: Alexander Viro
Link: http://lkml.kernel.org/r/202003061617.A8835CAAF@keescook
Signed-off-by: Linus Torvalds -
OCFS2 doesn't mind if memory reclaim makes I/Os happen; it just cares that
it won't be reentered, so it can use memalloc_nofs_save() instead of
memalloc_noio_save().Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Changwei Ge
Cc: Gang He
Cc: Jun Piao
Link: http://lkml.kernel.org/r/20200326200214.1102-1-willy@infradead.org
Signed-off-by: Linus Torvalds -
Since snprintf() returns the would-be-output size instead of the actual
output size, the succeeding calls may go beyond the given buffer limit.
Fix it by replacing with scnprintf().Signed-off-by: Takashi Iwai
Signed-off-by: Andrew Morton
Acked-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Cc: Changwei Ge
Cc: Gang He
Cc: Jun Piao
Link: http://lkml.kernel.org/r/20200311093516.25300-1-tiwai@suse.de
Signed-off-by: Linus Torvalds