Eric Lee / smarc-fsl-linux-kernel

11 Aug, 2024

2 commits

ffde3af4b sysctl: always initialize i_uid/i_gid ... Browse Code »

[ Upstream commit 98ca62ba9e2be5863c7d069f84f7166b45a5b2f4 ]

Always initialize i_uid/i_gid inside the sysfs core so set_ownership()
can safely skip setting them.

Commit 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of
i_uid/i_gid on /proc/sys inodes.") added defaults for i_uid/i_gid when
set_ownership() was not implemented. It also missed adjusting
net_ctl_set_ownership() to use the same default values in case the
computation of a better value failed.

Fixes: 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh
Signed-off-by: Joel Granados
Signed-off-by: Sasha Levin

Thomas Weißschuh
2024-08-11 18:47:13 +0800
96f1d909c sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) ... Browse Code »

[ Upstream commit 520713a93d550406dae14d49cdb8778d70cecdfd ]

Remove the 'table' argument from set_ownership as it is never used. This
change is a step towards putting "struct ctl_table" into .rodata and
eventually having sysctl core only use "const struct ctl_table".

The patch was created with the following coccinelle script:

@@
identifier func, head, table, uid, gid;
@@

void func(
struct ctl_table_header *head,
- struct ctl_table *table,
kuid_t *uid, kgid_t *gid)
{ ... }

No additional occurrences of 'set_ownership' were found after doing a
tree-wide search.

Reviewed-by: Joel Granados
Signed-off-by: Thomas Weißschuh
Signed-off-by: Joel Granados
Stable-dep-of: 98ca62ba9e2b ("sysctl: always initialize i_uid/i_gid")
Signed-off-by: Sasha Levin

Thomas Weißschuh
2024-08-11 18:47:13 +0800

03 Aug, 2024

4 commits

8acbcc506 fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs ... Browse Code »

[ Upstream commit 2c1f057e5be63e890f2dd89e4c25ab5eef084a91 ]

We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499bb60f ("pagemap: add
mmap-exclusive bit for marking pages mapped only here"), when THPs could
not be partially mapped and page_mapcount() returned something that was
true for all pages of the THP.

In 2016, we added support for partially mapping THPs via commit
53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of
THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page.

Checking page_mapcount() on the head page does not tell the whole story.

We should check each individual page. In a future without per-page
mapcounts it will be different, but we'll change that to be consistent
with PTE-mapped THPs once we deal with that.

Link: https://lkml.kernel.org/r/20240607122357.115423-4-david@redhat.com
Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Kirill A. Shutemov
Cc: Alexey Dobriyan
Cc: Jonathan Corbet
Cc: Lance Yang
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin

David Hildenbrand
2024-08-03 14:54:09 +0800
cdeba6d1c fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT ... Browse Code »

[ Upstream commit da7f31ed0f4df8f61e8195e527aa83dd54896ba3 ]

Relying on the mapcount for non-present PTEs that reference pages doesn't
make any sense: they are not accounted in the mapcount, so page_mapcount()
== 1 won't return the result we actually want to know.

While we don't check the mapcount for migration entries already, we could
end up checking it for swap, hwpoison, device exclusive, ... entries,
which we really shouldn't.

There is one exception: device private entries, which we consider
fake-present (e.g., incremented the mapcount). But we won't care about
that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
although they are fake-present already sounds suspiciously wrong.

Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.

Link: https://lkml.kernel.org/r/20240607122357.115423-3-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Alexey Dobriyan
Cc: Jonathan Corbet
Cc: Kirill A. Shutemov
Cc: Lance Yang
Signed-off-by: Andrew Morton
Stable-dep-of: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
Signed-off-by: Sasha Levin

David Hildenbrand
2024-08-03 14:54:09 +0800
92e8bd49a fs/proc/task_mmu.c: add_to_pagemap: remove useless parameter addr ... Browse Code »

[ Upstream commit cabbb6d51e2af4fc2f3c763f58a12c628f228987 ]

Function parameter addr of add_to_pagemap() is useless. Remove it.

Link: https://lkml.kernel.org/r/20240111084533.40038-1-teawaterz@linux.alibaba.com
Signed-off-by: Hui Zhu
Reviewed-by: Muhammad Usama Anjum
Tested-by: Muhammad Usama Anjum
Cc: Alexey Dobriyan
Cc: Andrei Vagin
Cc: David Hildenbrand
Cc: Hugh Dickins
Cc: Kefeng Wang
Cc: Liam R. Howlett
Cc: Peter Xu
Cc: Ryan Roberts
Signed-off-by: Andrew Morton
Stable-dep-of: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
Signed-off-by: Sasha Levin

Hui Zhu
2024-08-03 14:54:09 +0800
3aa4d9163 fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP ... Browse Code »

[ Upstream commit 3f9f022e975d930709848a86a1c79775b0585202 ]

Patch series "fs/proc: move page_mapcount() to fs/proc/internal.h".

With all other page_mapcount() users in the tree gone, move
page_mapcount() to fs/proc/internal.h, rename it and extend the
documentation to prevent future (ab)use.

... of course, I find some issues while working on that code that I sort
first ;)

We'll now only end up calling page_mapcount() [now
folio_precise_page_mapcount()] on pages mapped via present page table
entries. Except for /proc/kpagecount, that still does questionable
things, but we'll leave that legacy interface as is for now.

Did a quick sanity check. Likely we would want some better selfestest for
/proc/$/pagemap + smaps. I'll see if I can find some time to write some
more.

This patch (of 6):

Looks like we never taught pagemap_pmd_range() about the existence of
PMD-mapped file THPs. Seems to date back to the times when we first added
support for non-anon THPs in the form of shmem THP.

Link: https://lkml.kernel.org/r/20240607122357.115423-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240607122357.115423-2-david@redhat.com
Signed-off-by: David Hildenbrand
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Acked-by: Kirill A. Shutemov
Reviewed-by: Lance Yang
Reviewed-by: Oscar Salvador
Cc: David Hildenbrand
Cc: Jonathan Corbet
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Sasha Levin

David Hildenbrand
2024-08-03 14:54:09 +0800

21 Jun, 2024

1 commit

518fbd644 fs/proc: fix softlockup in __read_vmcore ... Browse Code »

commit 5cbcb62dddf5346077feb82b7b0c9254222d3445 upstream.

While taking a kernel core dump with makedumpfile on a larger system,
softlockup messages often appear.

While softlockup warnings can be harmless, they can also interfere with
things like RCU freeing memory, which can be problematic when the kdump
kexec image is configured with as little memory as possible.

Avoid the softlockup, and give things like work items and RCU a chance to
do their thing during __read_vmcore by adding a cond_resched.

Link: https://lkml.kernel.org/r/20240507091858.36ff767f@imladris.surriel.com
Signed-off-by: Rik van Riel
Acked-by: Baoquan He
Cc: Dave Young
Cc: Vivek Goyal
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Rik van Riel
2024-06-21 20:38:41 +0800

16 Jun, 2024

3 commits

2eeff6e36 mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again ... Browse Code »

commit 6d065f507d82307d6161ac75c025111fb8b08a46 upstream.

After switching smaps_rollup to use VMA iterator, searching for next entry
is part of the condition expression of the do-while loop. So the current
VMA needs to be addressed before the continue statement.

Otherwise, with some VMAs skipped, userspace observed memory
consumption from /proc/pid/smaps_rollup will be smaller than the sum of
the corresponding fields from /proc/pid/smaps.

Link: https://lkml.kernel.org/r/20240523183531.2535436-1-yzhong@purestorage.com
Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
Signed-off-by: Yuanyuan Zhong
Reviewed-by: Mohamed Khalfella
Cc: David Hildenbrand
Cc: Matthew Wilcox (Oracle)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Yuanyuan Zhong
2024-06-16 19:47:42 +0800
99ed145f4 mm/ksm: fix ksm_zero_pages accounting ... Browse Code »

commit c2dc78b86e0821ecf9a9d0c35dba2618279a5bb6 upstream.

We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
but ksm_zero_pages-- is done from page tables side, where there is no any
accessing protection of ksm_zero_pages.

So we can read very exceptional value of ksm_zero_pages in rare cases,
such as -1, which is very confusing to users.

Fix it by changing to use atomic_long_t, and the same case with the
mm->ksm_zero_pages.

Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev
Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM")
Fixes: 6080d19f0704 ("ksm: add ksm zero pages for each process")
Signed-off-by: Chengming Zhou
Acked-by: David Hildenbrand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Ran Xiaokai
Cc: Stefan Roesch
Cc: xu xin
Cc: Yang Yang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Chengming Zhou
2024-06-16 19:47:41 +0800
0c08b92f9 proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation ... Browse Code »

commit 0a960ba49869ebe8ff859d000351504dd6b93b68 upstream.

The following commits loosened the permissions of /proc//fdinfo/
directory, as well as the files within it, from 0500 to 0555 while also
introducing a PTRACE_MODE_READ check between the current task and
's task:

- commit 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
- commit 1927e498aee1 ("procfs: prevent unprivileged processes accessing fdinfo dir")

Before those changes, inode based system calls like inotify_add_watch(2)
would fail when the current task didn't have sufficient read permissions:

[...]
lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
inotify_add_watch(64, "/proc/1/task/1/fdinfo",
IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
[...]

This matches the documented behavior in the inotify_add_watch(2) man
page:

ERRORS
EACCES Read access to the given file is not permitted.

After those changes, inotify_add_watch(2) started succeeding despite the
current task not having PTRACE_MODE_READ privileges on the target task:

[...]
lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
inotify_add_watch(64, "/proc/1/task/1/fdinfo",
IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = 1757
openat(AT_FDCWD, "/proc/1/task/1/fdinfo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 EACCES (Permission denied)
[...]

This change in behavior broke .NET prior to v7. See the github link
below for the v7 commit that inadvertently/quietly (?) fixed .NET after
the kernel changes mentioned above.

Return to the old behavior by moving the PTRACE_MODE_READ check out of
the file .open operation and into the inode .permission operation:

[...]
lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
inotify_add_watch(64, "/proc/1/task/1/fdinfo",
IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
[...]

Reported-by: Kevin Parsons (Microsoft)
Link: https://github.com/dotnet/runtime/commit/89e5469ac591b82d38510fe7de98346cce74ad4f
Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda
Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Cc: stable@vger.kernel.org
Cc: Christian Brauner
Cc: Christian König
Cc: Jann Horn
Cc: Kalesh Singh
Cc: Hardik Garg
Cc: Allen Pais
Signed-off-by: Tyler Hicks (Microsoft)
Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com
Signed-off-by: Christian Brauner
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks (Microsoft)
2024-06-16 19:47:33 +0800

02 May, 2024

1 commit

92abee9c4 mm: support page_mapcount() on page_has_type() pages ... Browse Code »

commit fd1a745ce03e37945674c14833870a9af0882e2d upstream.

Return 0 for pages which can't be mapped. This matches how page_mapped()
works. It is more convenient for users to not have to filter out these
pages.

Link: https://lkml.kernel.org/r/20240321142448.1645400-5-willy@infradead.org
Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: David Hildenbrand
Acked-by: Vlastimil Babka
Cc: Miaohe Lin
Cc: Muchun Song
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Matthew Wilcox (Oracle)
2024-05-02 22:32:43 +0800

23 Feb, 2024

1 commit

5d858e2d3 fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() ... Browse Code »

commit 60f92acb60a989b14e4b744501a0df0f82ef30a3 upstream.

Patch series "fs/proc: do_task_stat: use sig->stats_".

do_task_stat() has the same problem as getrusage() had before "getrusage:
use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If
NR_CPUS threads call lock_task_sighand() at the same time and the process
has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS *
NR_THREADS) time.

This patch (of 3):

thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.

Not only this removes for_each_thread() from the critical section with
irqs disabled, this removes another case when stats_lock is taken with
siglock held. We want to remove this dependency, then we can change the
users of stats_lock to not disable irqs.

Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com
Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com
Signed-off-by: Oleg Nesterov
Signed-off-by: Dylan Hatch
Cc: Eric W. Biederman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2024-02-23 16:25:17 +0800

06 Feb, 2024

1 commit

15893975e sysctl: Fix out of bounds access for empty sysctl registers ... Browse Code »

[ Upstream commit 315552310c7de92baea4e570967066569937a843 ]

When registering tables to the sysctl subsystem there is a check to see
if header is a permanently empty directory (used for mounts). This check
evaluates the first element of the ctl_table. This results in an out of
bounds evaluation when registering empty directories.

The function register_sysctl_mount_point now passes a ctl_table of size
1 instead of size 0. It now relies solely on the type to identify
a permanently empty register.

Make sure that the ctl_table has at least one element before testing for
permanent emptiness.

Signed-off-by: Joel Granados
Reported-by: kernel test robot
Closes: https://lore.kernel.org/oe-lkp/202311201431.57aae8f3-oliver.sang@intel.com
Signed-off-by: Luis Chamberlain
Signed-off-by: Sasha Levin

Joel Granados
2024-02-06 04:14:17 +0800

29 Nov, 2023

2 commits

dbfbac0f9 watchdog: move softlockup_panic back to early_param ... Browse Code »

commit 8b793bcda61f6c3ed4f5b2ded7530ef6749580cb upstream.

Setting softlockup_panic from do_sysctl_args() causes it to take effect
later in boot. The lockup detector is enabled before SMP is brought
online, but do_sysctl_args runs afterwards. If a user wants to set
softlockup_panic on boot and have it trigger should a softlockup occur
during onlining of the non-boot processors, they could do this prior to
commit f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot
parameters to sysctl aliases"). However, after this commit the value
of softlockup_panic is set too late to be of help for this type of
problem. Restore the prior behavior.

Signed-off-by: Krister Johansen
Cc: stable@vger.kernel.org
Fixes: f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases")
Signed-off-by: Luis Chamberlain
Signed-off-by: Greg Kroah-Hartman

Krister Johansen
2023-11-29 01:19:57 +0800
2b1ee516b proc: sysctl: prevent aliased sysctls from getting passed to init ... Browse Code »

commit 8001f49394e353f035306a45bcf504f06fca6355 upstream.

The code that checks for unknown boot options is unaware of the sysctl
alias facility, which maps bootparams to sysctl values. If a user sets
an old value that has a valid alias, a message about an invalid
parameter will be printed during boot, and the parameter will get passed
to init. Fix by checking for the existence of aliased parameters in the
unknown boot parameter code. If an alias exists, don't return an error
or pass the value to init.

Signed-off-by: Krister Johansen
Cc: stable@vger.kernel.org
Fixes: 0a477e1ae21b ("kernel/sysctl: support handling command line aliases")
Signed-off-by: Luis Chamberlain
Signed-off-by: Greg Kroah-Hartman

Krister Johansen
2023-11-29 01:19:57 +0800

20 Sep, 2023

2 commits

fe4419801 proc: nommu: fix empty /proc/<pid>/maps ... Browse Code »

On no-MMU, /proc//maps reads as an empty file. This happens because
find_vma(mm, 0) always returns NULL (assuming no vma actually contains the
zero address, which is normally the case).

To fix this bug and improve the maintainability in the future, this patch
makes the no-MMU implementation as similar as possible to the MMU
implementation.

The only remaining differences are the lack of hold/release_task_mempolicy
and the extra code to shoehorn the gate vma into the iterator.

This has been tested on top of 6.5.3 on an STM32F746.

Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com
Fixes: 0c563f148043 ("proc: remove VMA rbtree use from nommu")
Signed-off-by: Ben Wolsieffer
Cc: Davidlohr Bueso
Cc: Giulio Benetti
Cc: Liam R. Howlett
Cc: Matthew Wilcox (Oracle)
Cc: Oleg Nesterov
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton

Ben Wolsieffer
2023-09-20 04:21:34 +0800
578d7699e proc: nommu: /proc/<pid>/maps: release mmap read lock ... Browse Code »

The no-MMU implementation of /proc//map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock. Since _vml is NULL when the end of the
mappings is reached, the lock is not released.

Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc//maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':

#include
#include
#include

int main(int argc, char *argv[]) {
void *buf;
sleep(1);
buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
puts("mmap returned");
return 0;
}

Then, run:

./proc_maps_bug &; cat /proc/$!/maps; fg

Without this patch, mmap() will hang and the command will never
complete.

This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.

The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.

Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com
Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer
Acked-by: Oleg Nesterov
Cc: Giulio Benetti
Cc: Greg Ungerer
Cc:
Signed-off-by: Andrew Morton

Ben Wolsieffer
2023-09-20 04:21:34 +0800

06 Sep, 2023

1 commit

5eea5820c Merge tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ... Browse Code »

Pull more MM updates from Andrew Morton:

- Stefan Roesch has added ksm statistics to /proc/pid/smaps

- Also a number of singleton patches, mainly cleanups and leftovers

* tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/kmemleak: move up cond_resched() call in page scanning loop
mm: page_alloc: remove stale CMA guard code
MAINTAINERS: add rmap.h to mm entry
rmap: remove anon_vma_link() nommu stub
proc/ksm: add ksm stats to /proc/pid/smaps
mm/hwpoison: rename hwp_walk* to hwpoison_walk*
mm: memory-failure: add PageOffline() check

Linus Torvalds
2023-09-06 01:56:27 +0800

03 Sep, 2023

1 commit

8b4793354 proc/ksm: add ksm stats to /proc/pid/smaps ... Browse Code »

With madvise and prctl KSM can be enabled for different VMA's. Once it is
enabled we can query how effective KSM is overall. However we cannot
easily query if an individual VMA benefits from KSM.

This commit adds a KSM section to the /prod//smaps file. It reports
how many of the pages are KSM pages. Note that KSM-placed zeropages are
not included, only actual KSM pages.

Here is a typical output:

7f420a000000-7f421a000000 rw-p 00000000 00:00 0
Size: 262144 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 51212 kB
Pss: 8276 kB
Shared_Clean: 172 kB
Shared_Dirty: 42996 kB
Private_Clean: 196 kB
Private_Dirty: 7848 kB
Referenced: 15388 kB
Anonymous: 51212 kB
KSM: 41376 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 202016 kB
SwapPss: 3882 kB
Locked: 0 kB
THPeligible: 0
ProtectionKey: 0
ksm_state: 0
ksm_skip_base: 0
ksm_skip_count: 0
VmFlags: rd wr mr mw me nr mg anon

This information also helps with the following workflow:
- First enable KSM for all the VMA's of a process with prctl.
- Then analyze with the above smaps report which VMA's benefit the most
- Change the application (if possible) to add the corresponding madvise
calls for the VMA's that benefit the most

[shr@devkernel.io: v5]
Link: https://lkml.kernel.org/r/20230823170107.1457915-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230822180539.1424843-1-shr@devkernel.io
Signed-off-by: Stefan Roesch
Reviewed-by: David Hildenbrand
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton

Stefan Roesch
2023-09-03 06:17:33 +0800

01 Sep, 2023

1 commit

df57721f9 Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).

CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.

For more information, refer to the links below for the earlier
versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...

Linus Torvalds
2023-09-01 03:20:12 +0800

30 Aug, 2023

3 commits

adfd67167 Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux ... Browse Code »

Pull sysctl updates from Luis Chamberlain:
"Long ago we set out to remove the kitchen sink on kernel/sysctl.c
arrays and placings sysctls to their own sybsystem or file to help
avoid merge conflicts. Matthew Wilcox pointed out though that if we're
going to do that we might as well also *save* space while at it and
try to remove the extra last sysctl entry added at the end of each
array, a sentintel, instead of bloating the kernel by adding a new
sentinel with each array moved.

Doing that was not so trivial, and has required slowing down the moves
of kernel/sysctl.c arrays and measuring the impact on size by each new
move.

The complex part of the effort to help reduce the size of each sysctl
is being done by the patient work of el señor Don Joel Granados. A lot
of this is truly painful code refactoring and testing and then trying
to measure the savings of each move and removing the sentinels.
Although Joel already has code which does most of this work,
experience with sysctl moves in the past shows is we need to be
careful due to the slew of odd build failures that are possible due to
the amount of random Kconfig options sysctls use.

To that end Joel's work is split by first addressing the major
housekeeping needed to remove the sentinels, which is part of this
merge request. The rest of the work to actually remove the sentinels
will be done later in future kernel releases.

The preliminary math is showing this will all help reduce the overall
build time size of the kernel and run time memory consumed by the
kernel by about ~64 bytes per array where we are able to remove each
sentinel in the future. That also means there is no more bloating the
kernel with the extra ~64 bytes per array moved as no new sentinels
are created"

* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: Use ctl_table_size as stopping criteria for list macro
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
vrf: Update to register_net_sysctl_sz
networking: Update to register_net_sysctl_sz
netfilter: Update to register_net_sysctl_sz
ax.25: Update to register_net_sysctl_sz
sysctl: Add size to register_net_sysctl function
sysctl: Add size arg to __register_sysctl_init
sysctl: Add size to register_sysctl
sysctl: Add a size arg to __register_sysctl_table
sysctl: Add size argument to init_header
sysctl: Add ctl_table_size to ctl_table_header
sysctl: Use ctl_table_header in list_for_each_table_entry
sysctl: Prefer ctl_table_header in proc_sysctl

Linus Torvalds
2023-08-30 08:39:15 +0800
d68b4b6f3 Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/lin… ... Browse Code »

…ux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

- An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options")

- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h")

- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands")

- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions")

- Switch the handling of kdump from a udev scheme to in-kernel
handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
hot un/plug")

- Many singleton patches to various parts of the tree

* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
document while_each_thread(), change first_tid() to use for_each_thread()
drivers/char/mem.c: shrink character device's devlist[] array
x86/crash: optimize CPU changes
crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
crash: hotplug support for kexec_load()
x86/crash: add x86 crash hotplug support
crash: memory and CPU hotplug sysfs attributes
kexec: exclude elfcorehdr from the segment digest
crash: add generic infrastructure for crash hotplug support
crash: move a few code bits to setup support of crash hotplug
kstrtox: consistently use _tolower()
kill do_each_thread()
nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
scripts/bloat-o-meter: count weak symbol sizes
treewide: drop CONFIG_EMBEDDED
lockdep: fix static memory detection even more
lib/vsprintf: declare no_hash_pointers in sprintf.h
lib/vsprintf: split out sprintf() and friends
kernel/fork: stop playing lockless games for exe_file replacement
adfs: delete unused "union adfs_dirtail" definition
...

Linus Torvalds
2023-08-30 05:53:51 +0800
b96a3e914 Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ... Browse Code »

Pull MM updates from Andrew Morton:

- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")

- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.

- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").

- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").

- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").

- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").

- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").

- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").

- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").

- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").

- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").

- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").

- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").

- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").

- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").

- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").

- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").

- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").

- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").

- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").

- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").

- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").

- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").

- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").

- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").

- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").

- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").

- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").

- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").

- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").

- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").

- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").

- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").

- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").

- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").

- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").

- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").

- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").

- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").

- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").

- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...

Linus Torvalds
2023-08-30 05:25:26 +0800

29 Aug, 2023

2 commits

b4a04f92a Merge tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs ... Browse Code »

Pull procfs fixes from Christian Brauner:
"Mode changes to files under /proc// aren't supported ever since
commit 6d76fa58b050 ("Don't allow chmod() on the /proc// files").

Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
mode changes on /proc/thread-self/comm were accidently allowed.

Similar, mode changes for all files beneath /proc//net/ are
blocked but mode changes on /proc//net itself were accidently
allowed.

Both issues come down to not using the generic proc_setattr() helper
which blocks all mode changes. This is rectified with this pull
request.

This also removes a strange nolibc test that abused /proc//net
for testing mode changes. Using procfs for this test never made a lot
of sense given procfs has special semantics for almost everything
anway.

Both changes are minor user-visible changes. It is however very
unlikely that mode changes on proc//net and
/proc/thread-self/comm are something that userspace relies on"

* tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
procfs: block chmod on /proc/thread-self/comm
proc: use generic setattr() for /proc/$PID/net
selftests/nolibc: drop test chmod_net

Linus Torvalds
2023-08-29 02:43:19 +0800
615e95831 Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs ... Browse Code »

Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.

The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).

If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

This introduces fine-grained timestamps that are used when they are
actively queried.

This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.

As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.

Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.

Various preparatory changes, fixes and cleanups are included:

- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.

- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.

- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.

- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.

- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...

Linus Torvalds
2023-08-29 00:31:32 +0800

26 Aug, 2023

1 commit

6f0edbb83 Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/… ... Browse Code »

…linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4
issues or aren't considered suitable for a -stable backport"

* tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
shmem: fix smaps BUG sleeping while atomic
selftests: cachestat: catch failing fsync test on tmpfs
selftests: cachestat: test for cachestat availability
maple_tree: disable mas_wr_append() when other readers are possible
madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
mm: multi-gen LRU: don't spin during memcg release
mm: memory-failure: fix unexpected return value in soft_offline_page()
radix tree: remove unused variable
mm: add a call to flush_cache_vmap() in vmap_pfn()
selftests/mm: FOLL_LONGTERM need to be updated to 0x100
nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()
mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
selftests: cgroup: fix test_kmem_basic less than error
mm: enable page walking API to lock vmas during the walk
smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT

Linus Torvalds
2023-08-26 02:44:43 +0800

25 Aug, 2023

1 commit

dce8f8ed1 document while_each_thread(), change first_tid() to use for_each_thread() ... Browse Code »

Add the comment to explain that while_each_thread(g,t) is not rcu-safe
unless g is stable (e.g. current). Even if g is a group leader and thus
can't exit before t, t or another sub-thread can exec and remove g from
the thread_group list.

The only lockless user of while_each_thread() is first_tid() and it is
fine in that it can't loop forever, yet for_each_thread() looks better and
I am going to change while_each_thread/next_thread.

Link: https://lkml.kernel.org/r/20230823170806.GA11724@redhat.com
Signed-off-by: Oleg Nesterov
Cc: Eric W. Biederman
Signed-off-by: Andrew Morton

Oleg Nesterov
2023-08-25 07:25:15 +0800

22 Aug, 2023

7 commits

5994eabf3 merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes Browse Code »

Andrew Morton
2023-08-22 05:26:20 +0800
daa60ae64 mm,thp: fix smaps THPeligible output alignment ... Browse Code »

Extract from current /proc/self/smaps output:

Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
ProtectionKey: 0

That's not the alignment shown in Documentation/filesystems/proc.rst: it's
an ugly artifact from missing out the %8 other fields are using; but
there's even one selftest which expects it to look that way. Hoping no
other smaps parsers depend on THPeligible to look so ugly, fix these.

Link: https://lkml.kernel.org/r/cfb81f7a-f448-5bc2-b0e1-8136fcd1dd8c@google.com
Signed-off-by: Hugh Dickins
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton

Hugh Dickins
2023-08-22 04:38:01 +0800
3f32c49ed mm: memtest: convert to memtest_report_meminfo() ... Browse Code »

It is better to not expose too many internal variables of memtest,
add a helper memtest_report_meminfo() to show memtest results.

Link: https://lkml.kernel.org/r/20230808033359.174986-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang
Acked-by: Mike Rapoport (IBM)
Cc: Matthew Wilcox
Cc: Tomas Mudrunka
Signed-off-by: Andrew Morton

Kefeng Wang
2023-08-22 04:37:47 +0800
11250fd12 mm: factor out VMA stack and heap checks ... Browse Code »

Patch series "mm: convert to vma_is_initial_heap/stack()", v3.

Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them
to simplify code.

This patch (of 4):

Factor out VMA stack and heap checks and name them vma_is_initial_stack()
and vma_is_initial_heap() for general use.

Link: https://lkml.kernel.org/r/20230728050043.59880-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20230728050043.59880-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang
Reviewed-by: David Hildenbrand
Acked-by: Peter Zijlstra (Intel)
Cc: Christian Göttsche
Cc: Alex Deucher
Cc: Arnaldo Carvalho de Melo
Cc: Christian Göttsche
Cc: Christian König
Cc: Daniel Vetter
Cc: David Airlie
Cc: Eric Paris
Cc: Felix Kuehling
Cc: "Pan, Xinhui"
Cc: Paul Moore
Cc: Stephen Smalley
Signed-off-by: Andrew Morton

Kefeng Wang
2023-08-22 04:37:31 +0800
42c06a0e8 mm: kill frontswap ... Browse Code »

The only user of frontswap is zswap, and has been for a long time. Have
swap call into zswap directly and remove the indirection.

[hannes@cmpxchg.org: remove obsolete comment, per Yosry]
Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org
[fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load]
Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com
Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Yin Fengwei
Acked-by: Konrad Rzeszutek Wilk
Acked-by: Nhat Pham
Acked-by: Yosry Ahmed
Acked-by: Christoph Hellwig
Cc: Domenico Cerasuolo
Cc: Matthew Wilcox (Oracle)
Cc: Vitaly Wool
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton

Johannes Weiner
2023-08-22 04:37:26 +0800
49b063850 mm: enable page walking API to lock vmas during the walk ... Browse Code »

walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan
Suggested-by: Linus Torvalds
Suggested-by: Jann Horn
Cc: David Hildenbrand
Cc: Davidlohr Bueso
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox (Oracle)
Cc: Michal Hocko
Cc: Michel Lespinasse
Cc: Peter Xu
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton

Suren Baghdasaryan
2023-08-22 04:07:20 +0800
8b9c1cc04 smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() ... Browse Code »

We shouldn't be using a GUP-internal helper if it can be avoided.

Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.

In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():

(1) Will always return the head page, not a tail page of a THP.

If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.

If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").

This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):

(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position

If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.

(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()

Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.

(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()

As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.

Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.

So skipping pmd_devmap() pages seems to be the right thing to do.

(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()

We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.

Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.

So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.

Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.

Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: Linus Torvalds
Cc: liubo
Cc: Matthew Wilcox (Oracle)
Cc: Mel Gorman
Cc: Paolo Bonzini
Cc: Peter Xu
Cc: Shuah Khan
Signed-off-by: Andrew Morton

David Hildenbrand
2023-08-22 04:07:20 +0800

19 Aug, 2023

1 commit

6080d19f0 ksm: add ksm zero pages for each process ... Browse Code »

As the number of ksm zero pages is not included in ksm_merging_pages per
process when enabling use_zero_pages, it's unclear of how many actual
pages are merged by KSM. To let users accurately estimate their memory
demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
pages per process. In addition, it help users to know the actual KSM
profit because KSM-placed zero pages are also benefit from KSM.

since unsharing zero pages placed by KSM accurately is achieved, then
tracking empty pages merging and unmerging is not a difficult thing any
longer.

Since we already have /proc//ksm_stat, just add the information of
'ksm_zero_pages' in it.

Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
Signed-off-by: xu xin
Acked-by: David Hildenbrand
Reviewed-by: Xiaokai Ran
Reviewed-by: Yang Yang
Cc: Claudio Imbrenda
Cc: Xuexin Jiang
Signed-off-by: Andrew Morton

xu xin
2023-08-19 01:12:10 +0800

16 Aug, 2023

5 commits

53f3811df sysctl: Use ctl_table_size as stopping criteria for list macro ... Browse Code »

This is a preparation commit to make it easy to remove the sentinel
elements (empty end markers) from the ctl_table arrays. It both allows
the systematic removal of the sentinels and adds the ctl_table_size
variable to the stopping criteria of the list_for_each_table_entry macro
that traverses all ctl_table arrays. Once all the sentinels are removed
by subsequent commits, ctl_table_size will become the only stopping
criteria in the macro. We don't actually remove any elements in this
commit, but it sets things up to for the removal process to take place.

By adding header->ctl_table_size as an additional stopping criteria for
the list_for_each_table_entry macro, it will execute until it finds an
"empty" ->procname or until the size runs out. Therefore if a ctl_table
array with a sentinel is passed its size will be too big (by one
element) but it will stop on the sentinel. On the other hand, if the
ctl_table array without a sentinel is passed its size will be just write
and there will be no need for a sentinel.

Signed-off-by: Joel Granados
Suggested-by: Jani Nikula
Signed-off-by: Luis Chamberlain

Joel Granados
2023-08-16 06:26:18 +0800
3bc269cfd sysctl: Add size arg to __register_sysctl_init ... Browse Code »

This commit adds table_size to __register_sysctl_init in preparation for
the removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by calculating the ctl_table array size with ARRAY_SIZE.

We add a table_size argument to __register_sysctl_init and modify the
register_sysctl_init macro to calculate the array size with ARRAY_SIZE.
The original callers do not need to be updated as they will go through
the new macro.

Signed-off-by: Joel Granados
Suggested-by: Greg Kroah-Hartman
Signed-off-by: Luis Chamberlain

Joel Granados
2023-08-16 06:26:17 +0800
9edbfe92a sysctl: Add size to register_sysctl ... Browse Code »

This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.

We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.

Signed-off-by: Joel Granados
Suggested-by: Greg Kroah-Hartman
Signed-off-by: Luis Chamberlain

Joel Granados
2023-08-16 06:26:17 +0800
bff97cf11 sysctl: Add a size arg to __register_sysctl_table ... Browse Code »

We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.

We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.

The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.

Signed-off-by: Joel Granados
Signed-off-by: Luis Chamberlain

Joel Granados
2023-08-16 06:26:17 +0800
b1f01e2ba sysctl: Add size argument to init_header ... Browse Code »

In this commit, we add a table_size argument to the init_header function
in order to initialize the ctl_table_size variable in ctl_table_header.
Even though the size is not yet used, it is now initialized within the
sysctl subsys. We need this commit for when we start adding the
table_size arguments to the sysctl functions (e.g. register_sysctl,
__register_sysctl_table and __register_sysctl_init).

Note that in __register_sysctl_table we temporarily use a calculated
size until we add the size argument to that function in subsequent
commits.

Signed-off-by: Joel Granados
Signed-off-by: Luis Chamberlain

Joel Granados
2023-08-16 06:26:17 +0800