31 Aug, 2022
1 commit
-
[ Upstream commit efd4149342db2df41b1bbe68972ead853b30e444 ]
These bits should only be valid when the ptes are present. Introducing
two booleans for it and set it to false when !pte_present() for both pte
and pmd accountings.The bug is found during code reading and no real world issue reported, but
logically such an error can cause incorrect readings for either smaps or
smaps_rollup output on quite a few fields.For example, it could cause over-estimate on values like Shared_Dirty,
Private_Dirty, Referenced. Or it could also cause under-estimate on
values like LazyFree, Shared_Clean, Private_Clean.Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
Fixes: b1d4d9e0cbd0 ("proc/smaps: carefully handle migration entries")
Fixes: c94b6923fa0a ("/proc/PID/smaps: Add PMD migration entry parsing")
Signed-off-by: Peter Xu
Reviewed-by: Vlastimil Babka
Reviewed-by: David Hildenbrand
Reviewed-by: Yang Shi
Cc: Konstantin Khlebnikov
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin
17 Aug, 2022
1 commit
-
[ Upstream commit d919a1e79bac890421537cf02ae773007bf55e6b ]
Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal(). Then, process systemd can
take long period high cpu usage during releasing task in following
concurrent processes:systemd ps
kernel_waitid stat(/proc/tgid)
do_wait filename_lookup
wait_consider_task lookup_fast
release_task
__exit_signal
__unhash_process
detach_pid
__change_pid // remove task->pid_links
d_revalidate -> pid_revalidate // 0
d_invalidate(/proc/tgid)
shrink_dcache_parent(/proc/tgid)
d_walk(/proc/tgid)
spin_lock_nested(/proc/tgid/fd)
// iterating opened fd
proc_flush_pid |
d_invalidate (/proc/tgid/fd) |
shrink_dcache_parent(/proc/tgid/fd) |
shrink_dentry_list(subdirs) ↓
shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lockFunction d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
'/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in
reverse order by invoking hlist_add_head_rcu(). But proc should not add
any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
adding inode into 'pid->inodes' only if the inode is /proc/tgid or
/proc/tgid/task/pid.Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).Before fix:
$ time killall -wq aa
real 4m40.946s # During this period, we can see 'ps' and 'systemd'
taking high cpu usage.After fix:
$ time killall -wq aa
real 1m20.732s # During this period, we can see 'systemd' taking
high cpu usage.Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
Fixes: 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
Signed-off-by: Zhihao Cheng
Signed-off-by: Zhang Yi
Suggested-by: Brian Foster
Reviewed-by: Brian Foster
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: Eric Biederman
Cc: Matthew Wilcox
Cc: Baoquan He
Cc: Kalesh Singh
Cc: Yu Kuai
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin
29 Jul, 2022
1 commit
-
[ Upstream commit 78e36f3b0dae586f623c4a37ec5eb5496f5abbe1 ]
sysctl has helpers which let us specify boundary values for a min or max
int value. Since these are used for a boundary check only they don't
change, so move these variables to sysctl_vals to avoid adding duplicate
variables. This will help with our cleanup of kernel/sysctl.c.[akpm@linux-foundation.org: update it for "mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%"]
[mcgrof@kernel.org: major rebase]Link: https://lkml.kernel.org/r/20211123202347.818157-3-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni
Signed-off-by: Luis Chamberlain
Reviewed-by: Kees Cook
Cc: Al Viro
Cc: Amir Goldstein
Cc: Andy Shevchenko
Cc: Benjamin LaHaise
Cc: "Eric W. Biederman"
Cc: Greg Kroah-Hartman
Cc: Iurii Zaikin
Cc: Jan Kara
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Petr Mladek
Cc: Qing Wang
Cc: Sebastian Reichel
Cc: Sergey Senozhatsky
Cc: Stephen Kitt
Cc: Tetsuo Handa
Cc: Antti Palosaari
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Clemens Ladisch
Cc: David Airlie
Cc: Jani Nikula
Cc: Joel Becker
Cc: Joonas Lahtinen
Cc: Joseph Qi
Cc: Julia Lawall
Cc: Lukas Middendorf
Cc: Mark Fasheh
Cc: Phillip Potter
Cc: Rodrigo Vivi
Cc: Douglas Gilbert
Cc: James E.J. Bottomley
Cc: Jani Nikula
Cc: John Ogness
Cc: Martin K. Petersen
Cc: "Rafael J. Wysocki"
Cc: Steven Rostedt (VMware)
Cc: Suren Baghdasaryan
Cc: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
09 Jun, 2022
1 commit
-
[ Upstream commit 7055197705709c59b8ab77e6a5c7d46d61edd96e ]
When a process exits, /proc/${pid}, and /proc/${pid}/net dentries are
flushed. However some leaf dentries like /proc/${pid}/net/arp_cache
aren't. That's because respective PDEs have proc_misc_d_revalidate() hook
which returns 1 and leaves dentries/inodes in the LRU.Force revalidation/lookup on everything under /proc/${pid}/net by
inheriting proc_net_dentry_ops.[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/YjdVHgildbWO7diJ@localhost.localdomain
Fixes: c6c75deda813 ("proc: fix lookup in /proc/net subdirectories after setns(2)")
Signed-off-by: Alexey Dobriyan
Reported-by: hui li
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin
18 May, 2022
1 commit
-
[ Upstream commit 1927e498aee1757b3df755a194cbfc5cc0f2b663 ]
The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.Add the missing ptrace permission check for opening the fdinfo directory.
[1] https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Link: https://lkml.kernel.org/r/20210713162008.1056986-1-kaleshsingh@google.com
Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Signed-off-by: Kalesh Singh
Cc: Kees Cook
Cc: Eric W. Biederman
Cc: Christian Brauner
Cc: Suren Baghdasaryan
Cc: Hridya Valsaraju
Cc: Jann Horn
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin
08 Apr, 2022
1 commit
-
commit bed5b60bf67ccd8957b8c0558fead30c4a3f5d3f upstream.
kzalloc is a memory allocation function which can return NULL when some
internal memory errors happen. It is safer to add null pointer check.Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn
Cc: stable@vger.kernel.org
Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Acked-by: Masami Hiramatsu
Reported-by: Zeal Robot
Signed-off-by: Lv Ruyi
Signed-off-by: Steven Rostedt (Google)
Signed-off-by: Greg Kroah-Hartman
09 Mar, 2022
1 commit
-
commit dd21bfa425c098b95ca86845f8e7d1ec1ddf6e4a upstream.
Since bit 57 was exported for uffd-wp write-protected (commit
fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"),
fixing it can reduce some unnecessary confusion.Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information")
Signed-off-by: Yun Zhou
Reviewed-by: Peter Xu
Cc: Jonathan Corbet
Cc: Tiberiu A Georgescu
Cc: Florian Schmidt
Cc: Ivan Teterevkov
Cc: SeongJae Park
Cc: Yang Shi
Cc: David Hildenbrand
Cc: Axel Rasmussen
Cc: Miaohe Lin
Cc: Andrea Arcangeli
Cc: Colin Cross
Cc: Alistair Popple
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
23 Feb, 2022
1 commit
-
commit 24d7275ce2791829953ed4e72f68277ceb2571c6 upstream.
The syzbot reported the below BUG:
kernel BUG at include/linux/page-flags.h:785!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
Call Trace:
page_mapcount include/linux/mm.h:837 [inline]
smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
walk_pmd_range mm/pagewalk.c:128 [inline]
walk_pud_range mm/pagewalk.c:205 [inline]
walk_p4d_range mm/pagewalk.c:240 [inline]
walk_pgd_range mm/pagewalk.c:277 [inline]
__walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
walk_page_vma+0x277/0x350 mm/pagewalk.c:530
smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
smap_gather_stats fs/proc/task_mmu.c:741 [inline]
show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
seq_read+0x3e0/0x5b0 fs/seq_file.c:162
vfs_read+0x1b5/0x600 fs/read_write.c:479
ksys_read+0x12d/0x250 fs/read_write.c:619
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xaeThe reproducer was trying to read /proc/$PID/smaps when calling
MADV_FREE at the mean time. MADV_FREE may split THPs if it is called
for partial THP. It may trigger the below race:CPU A CPU B
----- -----
smaps walk: MADV_FREE:
page_mapcount()
PageCompound()
split_huge_page()
page = compound_head(page)
PageDoubleMap(page)When calling PageDoubleMap() this page is not a tail page of THP anymore
so the BUG is triggered.This could be fixed by elevated refcount of the page before calling
mapcount, but that would prevent it from counting migration entries, and
it seems overkilling because the race just could happen when PMD is
split so all PTE entries of tail pages are actually migration entries,
and smaps_account() does treat migration entries as mapcount == 1 as
Kirill pointed out.Add a new parameter for smaps_account() to tell this entry is migration
entry then skip calling page_mapcount(). Don't skip getting mapcount
for device private entries since they do track references with mapcount.Pagemap also has the similar issue although it was not reported. Fixed
it as well.[shy828301@gmail.com: v4]
Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
[nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Yang Shi
Signed-off-by: Nathan Chancellor
Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
Acked-by: David Hildenbrand
Cc: "Kirill A. Shutemov"
Cc: Jann Horn
Cc: Matthew Wilcox
Cc: Alexey Dobriyan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
01 Dec, 2021
1 commit
-
commit c1e63117711977cc4295b2ce73de29dd17066c82 upstream.
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":systemd[1]: Starting Kdump Vmcore Save Service...
kdump[420]: Kdump is using the default log level(3).
kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[465]: saving vmcore-dmesg.txt complete
kdump[467]: saving vmcore
BUG: unable to handle page fault for address: 00007f2374e01000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
Oops: 0003 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
Call Trace:
read_vmcore+0x236/0x2c0
proc_reg_read+0x55/0xa0
vfs_read+0x95/0x190
ksys_read+0x4f/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xaeSome x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().To fix, properly use clear_user() when we're dealing with a user buffer.
Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand
Acked-by: Baoquan He
Cc: Dave Young
Cc: Baoquan He
Cc: Vivek Goyal
Cc: Philipp Rudo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
19 Nov, 2021
1 commit
-
[ Upstream commit a130e8fbc7de796eb6e680724d87f4737a26d0ac ]
/proc/uptime reports idle time by reading the CPUTIME_IDLE field from
the per-cpu kcpustats. However, on NO_HZ systems, idle time is not
continually updated on idle cpus, leading this value to appear
incorrectly small./proc/stat performs an accounting update when reading idle time; we
can use the same approach for uptime.With this patch, /proc/stat and /proc/uptime now agree on idle time.
Additionally, the following shows idle time tick up consistently on an
idle machine:(while true; do cat /proc/uptime; sleep 1; done) | awk '{print $2-prev; prev=$2}'
Reported-by: Luigi Rizzo
Signed-off-by: Josh Don
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Eric Dumazet
Link: https://lkml.kernel.org/r/20210827165438.3280779-1-joshdon@google.com
Signed-off-by: Sasha Levin
12 Nov, 2021
1 commit
-
commit 54354c6a9f7fd5572d2b9ec108117c4f376d4d23 upstream.
This reverts commit 152c432b128cb043fc107e8f211195fe94b2159c.
When a kernel address couldn't be symbolized for /proc/$pid/wchan, it
would leak the raw value, a potential information exposure. This is a
regression compared to the safer pre-v5.12 behavior.Reported-by: kernel test robot
Reported-by: Vito Caputo
Reported-by: Jann Horn
Signed-off-by: Kees Cook
Signed-off-by: Peter Zijlstra (Intel)
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211008111626.090829198@infradead.org
Signed-off-by: Greg Kroah-Hartman
09 Sep, 2021
3 commits
-
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"* emailed patches from Andrew Morton : (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
... -
While comm change event via prctl has been reported to proc connector by
'commit f786ecba4158 ("connector: add comm change event report to proc
connector")', connector listeners were missing comm changes by explicit
writes on /proc/[pid]/comm.Let explicit writes on /proc/[pid]/comm report to proc connector.
Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
Signed-off-by: Ohhoon Kwon
Cc: Ingo Molnar
Cc: David S. Miller
Cc: Christian Brauner
Cc: Eric W. Biederman
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use seq_escape_str and seq_printf instead of poking holes into the
seq_file abstraction.Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
Signed-off-by: Christoph Hellwig
Acked-by: Christian Brauner
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Sep, 2021
1 commit
-
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.Acked-by: "Eric W. Biederman"
Acked-by: Christian König
Signed-off-by: David Hildenbrand
04 Jul, 2021
2 commits
-
Pull vfs name lookup updates from Al Viro:
"Small namei.c patch series, mostly to simplify the rules for nameidata
state. It's actually from the previous cycle - but I didn't post it
for review in time...Changes visible outside of fs/namei.c: file_open_root() calling
conventions change, some freed bits in LOOKUP_... space"* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namei: make sure nd->depth is always valid
teach set_nameidata() to handle setting the root as well
take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
switch file_open_root() to struct path -
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
03 Jul, 2021
1 commit
-
Merge more updates from Andrew Morton:
"190 patches.Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"* emailed patches from Andrew Morton : (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
02 Jul, 2021
4 commits
-
And 'ino' field to /proc//fdinfo/ and
/proc//task//fdinfo/.The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc//fd/* when accounting
per-process DMA buffer sizes.Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh
Acked-by: Randy Dunlap
Acked-by: Christian König
Cc: Jann Horn
Cc: Jeff Vander Stoep
Cc: Kees Cook
Cc: Suren Baghdasaryan
Cc: Minchan Kim
Cc: Hridya Valsaraju
Cc: Matthew Wilcox
Cc: Alexander Viro
Cc: Kalesh Singh
Cc: Alexey Dobriyan
Cc: Jonathan Corbet
Cc: Mauro Carvalho Chehab
Cc: Michal Hocko
Cc: Alexey Gladkov
Cc: Szabolcs Nagy
Cc: Eric W. Biederman
Cc: Christian Brauner
Cc: Michel Lespinasse
Cc: Bernd Edlinger
Cc: Andrei Vagin
Cc: Helge Deller
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.Currently, DMA buffer FDs can be accounted using /proc//fd/* and
/proc//fdinfo -- both are only readable by the process owner, as
follows:1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc//fdinfo/, to get the DMA buffer size.Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh
Suggested-by: Jann Horn
Acked-by: Christian König
Cc: Alexander Viro
Cc: Alexey Dobriyan
Cc: Alexey Gladkov
Cc: Andrei Vagin
Cc: Bernd Edlinger
Cc: Christian Brauner
Cc: Eric W. Biederman
Cc: Helge Deller
Cc: Hridya Valsaraju
Cc: James Morris
Cc: Jeff Vander Stoep
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: Matthew Wilcox
Cc: Mauro Carvalho Chehab
Cc: Michal Hocko
Cc: Michel Lespinasse
Cc: Minchan Kim
Cc: Randy Dunlap
Cc: Suren Baghdasaryan
Cc: Szabolcs Nagy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
Reviewed-by: David Disseldorp
Signed-off-by: Thadeu Lima de Souza Cascardo
Signed-off-by: Marcelo Henrique Cerri
Cc: Alexey Dobriyan
Cc: Souza Cascardo
Cc: Christian Brauner
Cc: Michel Lespinasse
Cc: Helge Deller
Cc: Oleg Nesterov
Cc: Lorenzo Stoakes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "Add support for SVM atomics in Nouveau", v11.
Introduction
============Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .Implementation
==============Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.Patches
=======Patches 1 & 2 refactor existing migration and device private entry
functions.Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().Patch 5 renames some existing code but does not introduce functionality.
Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
Patch 7 contains the bulk of the implementation for device exclusive
memory.Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.Patch 9 is a cleanup for the Nouveau SVM implementation.
Patch 10 contains the implementation of atomic access for the Nouveau
driver.Testing
=======This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.This patch (of 10):
Remove multiple similar inline functions for dealing with different types
of special swap entries.Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Ralph Campbell
Reviewed-by: Christoph Hellwig
Cc: "Matthew Wilcox (Oracle)"
Cc: Hugh Dickins
Cc: Peter Xu
Cc: Shakeel Butt
Cc: Ben Skeggs
Cc: Jason Gunthorpe
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jul, 2021
6 commits
-
Let's properly synchronize with drivers that set PageOffline().
Unfreeze/thaw every now and then, so drivers that want to set
PageOffline() can make progress.Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Mike Rapoport
Reviewed-by: Oscar Salvador
Cc: Aili Yao
Cc: Alexey Dobriyan
Cc: Alex Shi
Cc: Haiyang Zhang
Cc: Jason Wang
Cc: Jiri Bohac
Cc: "K. Y. Srinivasan"
Cc: "Matthew Wilcox (Oracle)"
Cc: "Michael S. Tsirkin"
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Cc: Stephen Hemminger
Cc: Steven Price
Cc: Wei Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let's avoid reading:
1) Offline memory sections: the content of offline memory sections is
stale as the memory is effectively unused by the kernel. On s390x with
standby memory, offline memory sections (belonging to offline storage
increments) are not accessible. With virtio-mem and the hyper-v
balloon, we can have unavailable memory chunks that should not be
accessed inside offline memory sections. Last but not least, offline
memory sections might contain hwpoisoned pages which we can no longer
identify because the memmap is stale.2) PG_offline pages: logically offline pages that are documented as
"The content of these pages is effectively stale. Such pages should
not be touched (read/write/dump/save) except by their owner.".
Examples include pages inflated in a balloon or unavailble memory
ranges inside hotplugged memory sections with virtio-mem or the hyper-v
balloon.3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
As documented: "Accessing is not safe since it may cause another
machine check. Don't touch!"Introduce is_page_hwpoison(), adding a comment that it is inherently racy
but best we can really do.Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page()
yet.Note 1: we can race against memory offlining code, especially memory going
offline and getting unplugged: however, we will properly tear down the
identity mapping and handle faults gracefully when accessing this memory
from kcore code.Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Mike Rapoport
Reviewed-by: Oscar Salvador
Cc: Aili Yao
Cc: Alexey Dobriyan
Cc: Alex Shi
Cc: Haiyang Zhang
Cc: Jason Wang
Cc: Jiri Bohac
Cc: "K. Y. Srinivasan"
Cc: "Matthew Wilcox (Oracle)"
Cc: "Michael S. Tsirkin"
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Cc: Stephen Hemminger
Cc: Steven Price
Cc: Wei Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Mike Rapoport
Cc: Aili Yao
Cc: Alexey Dobriyan
Cc: Alex Shi
Cc: Haiyang Zhang
Cc: Jason Wang
Cc: Jiri Bohac
Cc: "K. Y. Srinivasan"
Cc: "Matthew Wilcox (Oracle)"
Cc: "Michael S. Tsirkin"
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Roman Gushchin
Cc: Stephen Hemminger
Cc: Steven Price
Cc: Wei Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.
Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.Examples for PageOffline() pages are pages inflated in a balloon, memory
unplugged via virtio-mem, and partially-present sections in memory added
by the Hyper-V balloon.When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.Tested via kcore_dump:
https://github.com/schlafwandler/kcore_dumpThis patch (of 6):
Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Mike Rapoport
Cc: "Michael S. Tsirkin"
Cc: Jason Wang
Cc: Alexey Dobriyan
Cc: "Matthew Wilcox (Oracle)"
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Alex Shi
Cc: Steven Price
Cc: Mike Kravetz
Cc: Aili Yao
Cc: Jiri Bohac
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Wei Liu
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Export the PTE/PMD status of uffd-wp to pagemap too.
Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
Signed-off-by: Peter Xu
Cc: Alexander Viro
Cc: Andrea Arcangeli
Cc: Axel Rasmussen
Cc: Brian Geffon
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Joe Perches
Cc: Kirill A. Shutemov
Cc: Lokesh Gidra
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Mina Almasry
Cc: Oliver Upton
Cc: Shaohua Li
Cc: Shuah Khan
Cc: Stephen Rothwell
Cc: Wang Qing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported. But it forgot
to add checking for it in transparent_hugepage_enabled(). To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code. We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.comLink: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin
Reviewed-by: Yang Shi
Cc: Alexey Dobriyan
Cc: "Aneesh Kumar K . V"
Cc: Anshuman Khandual
Cc: David Hildenbrand
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Minchan Kim
Cc: Ralph Campbell
Cc: Rik van Riel
Cc: Song Liu
Cc: William Kucharski
Cc: Zi Yan
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jun, 2021
2 commits
-
Merge misc updates from Andrew Morton:
"191 patches.Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"* emailed patches from Andrew Morton : (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
... -
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.This is a noop as far as overall performance and SMP scalability are
concerned.[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Reviewed-by: John Hubbard
Cc: Hugh Dickins
Cc: Jan Kara
Cc: Jann Horn
Cc: Jason Gunthorpe
Cc: Kirill Shutemov
Cc: Kirill Tkhai
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jun, 2021
2 commits
-
Pull user namespace rlimit handling update from Eric Biederman:
"This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users
to have stricter limits on the resources created within a user
namespace."* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
cred: add missing return error code when set_cred_ucounts() failed
ucounts: Silence warning in dec_rlimit_ucounts
ucounts: Set ucount_max to the largest positive value the type can hold
kselftests: Add test to check for rlimit changes in different user namespaces
Reimplement RLIMIT_MEMLOCK on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_NPROC on top of ucounts
Use atomic_t for ucounts reference counting
Add a reference to ucounts for each cred
Increase size of ucounts to atomic_long_t -
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu//cpu.cfs_burst_us.- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
18 Jun, 2021
1 commit
-
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.cSigned-off-by: Ingo Molnar
16 Jun, 2021
1 commit
-
Commit 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too. And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected. The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: Leon Romanovsky
Cc: Kees Cook
Cc: Christian Brauner
Cc: Andrea Righi
Signed-off-by: Linus Torvalds
11 Jun, 2021
1 commit
-
It is not possible to put an array value with subkeys under
a key node, because both of subkeys and the array elements
are using "next" field of the xbc_node.Thus this changes the array values to use "child" field in
the array case. The reason why split this change is to
test it easily.Link: https://lkml.kernel.org/r/162262193838.264090.16044473274501498656.stgit@devnote2
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)
09 Jun, 2021
1 commit
-
Commit bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)Reported-by: Christian Brauner
Reported-by: Andrea Righi
Tested-by: Andrea Righi
Fixes: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds
04 Jun, 2021
1 commit
-
Signed-off-by: Ingo Molnar
26 May, 2021
1 commit
-
Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds
13 May, 2021
2 commits
-
Creating 2**32 tasks to wait in D-state is impossible and wasteful.
Return "unsigned int" and save on REX prefixes.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com -
Creating 2**32 tasks is impossible due to futex pid limits and wasteful
anyway. Nobody has done it.Bring nr_running() into 32-bit world to save on REX prefixes.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com