11 Aug, 2024

2 commits

  • [ Upstream commit 98ca62ba9e2be5863c7d069f84f7166b45a5b2f4 ]

    Always initialize i_uid/i_gid inside the sysfs core so set_ownership()
    can safely skip setting them.

    Commit 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of
    i_uid/i_gid on /proc/sys inodes.") added defaults for i_uid/i_gid when
    set_ownership() was not implemented. It also missed adjusting
    net_ctl_set_ownership() to use the same default values in case the
    computation of a better value failed.

    Fixes: 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Weißschuh
    Signed-off-by: Joel Granados
    Signed-off-by: Sasha Levin

    Thomas Weißschuh
     
  • [ Upstream commit 520713a93d550406dae14d49cdb8778d70cecdfd ]

    Remove the 'table' argument from set_ownership as it is never used. This
    change is a step towards putting "struct ctl_table" into .rodata and
    eventually having sysctl core only use "const struct ctl_table".

    The patch was created with the following coccinelle script:

    @@
    identifier func, head, table, uid, gid;
    @@

    void func(
    struct ctl_table_header *head,
    - struct ctl_table *table,
    kuid_t *uid, kgid_t *gid)
    { ... }

    No additional occurrences of 'set_ownership' were found after doing a
    tree-wide search.

    Reviewed-by: Joel Granados
    Signed-off-by: Thomas Weißschuh
    Signed-off-by: Joel Granados
    Stable-dep-of: 98ca62ba9e2b ("sysctl: always initialize i_uid/i_gid")
    Signed-off-by: Sasha Levin

    Thomas Weißschuh
     

03 Aug, 2024

4 commits

  • [ Upstream commit 2c1f057e5be63e890f2dd89e4c25ab5eef084a91 ]

    We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499bb60f ("pagemap: add
    mmap-exclusive bit for marking pages mapped only here"), when THPs could
    not be partially mapped and page_mapcount() returned something that was
    true for all pages of the THP.

    In 2016, we added support for partially mapping THPs via commit
    53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of
    THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page.

    Checking page_mapcount() on the head page does not tell the whole story.

    We should check each individual page. In a future without per-page
    mapcounts it will be different, but we'll change that to be consistent
    with PTE-mapped THPs once we deal with that.

    Link: https://lkml.kernel.org/r/20240607122357.115423-4-david@redhat.com
    Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Kirill A. Shutemov
    Cc: Alexey Dobriyan
    Cc: Jonathan Corbet
    Cc: Lance Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    David Hildenbrand
     
  • [ Upstream commit da7f31ed0f4df8f61e8195e527aa83dd54896ba3 ]

    Relying on the mapcount for non-present PTEs that reference pages doesn't
    make any sense: they are not accounted in the mapcount, so page_mapcount()
    == 1 won't return the result we actually want to know.

    While we don't check the mapcount for migration entries already, we could
    end up checking it for swap, hwpoison, device exclusive, ... entries,
    which we really shouldn't.

    There is one exception: device private entries, which we consider
    fake-present (e.g., incremented the mapcount). But we won't care about
    that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
    although they are fake-present already sounds suspiciously wrong.

    Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.

    Link: https://lkml.kernel.org/r/20240607122357.115423-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Alexey Dobriyan
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Lance Yang
    Signed-off-by: Andrew Morton
    Stable-dep-of: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
    Signed-off-by: Sasha Levin

    David Hildenbrand
     
  • [ Upstream commit cabbb6d51e2af4fc2f3c763f58a12c628f228987 ]

    Function parameter addr of add_to_pagemap() is useless. Remove it.

    Link: https://lkml.kernel.org/r/20240111084533.40038-1-teawaterz@linux.alibaba.com
    Signed-off-by: Hui Zhu
    Reviewed-by: Muhammad Usama Anjum
    Tested-by: Muhammad Usama Anjum
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: David Hildenbrand
    Cc: Hugh Dickins
    Cc: Kefeng Wang
    Cc: Liam R. Howlett
    Cc: Peter Xu
    Cc: Ryan Roberts
    Signed-off-by: Andrew Morton
    Stable-dep-of: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
    Signed-off-by: Sasha Levin

    Hui Zhu
     
  • [ Upstream commit 3f9f022e975d930709848a86a1c79775b0585202 ]

    Patch series "fs/proc: move page_mapcount() to fs/proc/internal.h".

    With all other page_mapcount() users in the tree gone, move
    page_mapcount() to fs/proc/internal.h, rename it and extend the
    documentation to prevent future (ab)use.

    ... of course, I find some issues while working on that code that I sort
    first ;)

    We'll now only end up calling page_mapcount() [now
    folio_precise_page_mapcount()] on pages mapped via present page table
    entries. Except for /proc/kpagecount, that still does questionable
    things, but we'll leave that legacy interface as is for now.

    Did a quick sanity check. Likely we would want some better selfestest for
    /proc/$/pagemap + smaps. I'll see if I can find some time to write some
    more.

    This patch (of 6):

    Looks like we never taught pagemap_pmd_range() about the existence of
    PMD-mapped file THPs. Seems to date back to the times when we first added
    support for non-anon THPs in the form of shmem THP.

    Link: https://lkml.kernel.org/r/20240607122357.115423-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20240607122357.115423-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Lance Yang
    Reviewed-by: Oscar Salvador
    Cc: David Hildenbrand
    Cc: Jonathan Corbet
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    David Hildenbrand
     

21 Jun, 2024

1 commit

  • commit 5cbcb62dddf5346077feb82b7b0c9254222d3445 upstream.

    While taking a kernel core dump with makedumpfile on a larger system,
    softlockup messages often appear.

    While softlockup warnings can be harmless, they can also interfere with
    things like RCU freeing memory, which can be problematic when the kdump
    kexec image is configured with as little memory as possible.

    Avoid the softlockup, and give things like work items and RCU a chance to
    do their thing during __read_vmcore by adding a cond_resched.

    Link: https://lkml.kernel.org/r/20240507091858.36ff767f@imladris.surriel.com
    Signed-off-by: Rik van Riel
    Acked-by: Baoquan He
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Rik van Riel
     

16 Jun, 2024

3 commits

  • commit 6d065f507d82307d6161ac75c025111fb8b08a46 upstream.

    After switching smaps_rollup to use VMA iterator, searching for next entry
    is part of the condition expression of the do-while loop. So the current
    VMA needs to be addressed before the continue statement.

    Otherwise, with some VMAs skipped, userspace observed memory
    consumption from /proc/pid/smaps_rollup will be smaller than the sum of
    the corresponding fields from /proc/pid/smaps.

    Link: https://lkml.kernel.org/r/20240523183531.2535436-1-yzhong@purestorage.com
    Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
    Signed-off-by: Yuanyuan Zhong
    Reviewed-by: Mohamed Khalfella
    Cc: David Hildenbrand
    Cc: Matthew Wilcox (Oracle)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Yuanyuan Zhong
     
  • commit c2dc78b86e0821ecf9a9d0c35dba2618279a5bb6 upstream.

    We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
    but ksm_zero_pages-- is done from page tables side, where there is no any
    accessing protection of ksm_zero_pages.

    So we can read very exceptional value of ksm_zero_pages in rare cases,
    such as -1, which is very confusing to users.

    Fix it by changing to use atomic_long_t, and the same case with the
    mm->ksm_zero_pages.

    Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev
    Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM")
    Fixes: 6080d19f0704 ("ksm: add ksm zero pages for each process")
    Signed-off-by: Chengming Zhou
    Acked-by: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Ran Xiaokai
    Cc: Stefan Roesch
    Cc: xu xin
    Cc: Yang Yang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Chengming Zhou
     
  • commit 0a960ba49869ebe8ff859d000351504dd6b93b68 upstream.

    The following commits loosened the permissions of /proc//fdinfo/
    directory, as well as the files within it, from 0500 to 0555 while also
    introducing a PTRACE_MODE_READ check between the current task and
    's task:

    - commit 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
    - commit 1927e498aee1 ("procfs: prevent unprivileged processes accessing fdinfo dir")

    Before those changes, inode based system calls like inotify_add_watch(2)
    would fail when the current task didn't have sufficient read permissions:

    [...]
    lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
    inotify_add_watch(64, "/proc/1/task/1/fdinfo",
    IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
    IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
    [...]

    This matches the documented behavior in the inotify_add_watch(2) man
    page:

    ERRORS
    EACCES Read access to the given file is not permitted.

    After those changes, inotify_add_watch(2) started succeeding despite the
    current task not having PTRACE_MODE_READ privileges on the target task:

    [...]
    lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
    inotify_add_watch(64, "/proc/1/task/1/fdinfo",
    IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
    IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = 1757
    openat(AT_FDCWD, "/proc/1/task/1/fdinfo",
    O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 EACCES (Permission denied)
    [...]

    This change in behavior broke .NET prior to v7. See the github link
    below for the v7 commit that inadvertently/quietly (?) fixed .NET after
    the kernel changes mentioned above.

    Return to the old behavior by moving the PTRACE_MODE_READ check out of
    the file .open operation and into the inode .permission operation:

    [...]
    lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
    inotify_add_watch(64, "/proc/1/task/1/fdinfo",
    IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|
    IN_ONLYDIR|IN_DONT_FOLLOW|IN_EXCL_UNLINK) = -1 EACCES (Permission denied)
    [...]

    Reported-by: Kevin Parsons (Microsoft)
    Link: https://github.com/dotnet/runtime/commit/89e5469ac591b82d38510fe7de98346cce74ad4f
    Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda
    Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
    Cc: stable@vger.kernel.org
    Cc: Christian Brauner
    Cc: Christian König
    Cc: Jann Horn
    Cc: Kalesh Singh
    Cc: Hardik Garg
    Cc: Allen Pais
    Signed-off-by: Tyler Hicks (Microsoft)
    Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks (Microsoft)
     

02 May, 2024

1 commit

  • commit fd1a745ce03e37945674c14833870a9af0882e2d upstream.

    Return 0 for pages which can't be mapped. This matches how page_mapped()
    works. It is more convenient for users to not have to filter out these
    pages.

    Link: https://lkml.kernel.org/r/20240321142448.1645400-5-willy@infradead.org
    Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Hildenbrand
    Acked-by: Vlastimil Babka
    Cc: Miaohe Lin
    Cc: Muchun Song
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox (Oracle)
     

23 Feb, 2024

1 commit

  • commit 60f92acb60a989b14e4b744501a0df0f82ef30a3 upstream.

    Patch series "fs/proc: do_task_stat: use sig->stats_".

    do_task_stat() has the same problem as getrusage() had before "getrusage:
    use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If
    NR_CPUS threads call lock_task_sighand() at the same time and the process
    has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS *
    NR_THREADS) time.

    This patch (of 3):

    thread_group_cputime() does its own locking, we can safely shift
    thread_group_cputime_adjusted() which does another for_each_thread loop
    outside of ->siglock protected section.

    Not only this removes for_each_thread() from the critical section with
    irqs disabled, this removes another case when stats_lock is taken with
    siglock held. We want to remove this dependency, then we can change the
    users of stats_lock to not disable irqs.

    Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com
    Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Dylan Hatch
    Cc: Eric W. Biederman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

06 Feb, 2024

1 commit

  • [ Upstream commit 315552310c7de92baea4e570967066569937a843 ]

    When registering tables to the sysctl subsystem there is a check to see
    if header is a permanently empty directory (used for mounts). This check
    evaluates the first element of the ctl_table. This results in an out of
    bounds evaluation when registering empty directories.

    The function register_sysctl_mount_point now passes a ctl_table of size
    1 instead of size 0. It now relies solely on the type to identify
    a permanently empty register.

    Make sure that the ctl_table has at least one element before testing for
    permanent emptiness.

    Signed-off-by: Joel Granados
    Reported-by: kernel test robot
    Closes: https://lore.kernel.org/oe-lkp/202311201431.57aae8f3-oliver.sang@intel.com
    Signed-off-by: Luis Chamberlain
    Signed-off-by: Sasha Levin

    Joel Granados
     

29 Nov, 2023

2 commits

  • commit 8b793bcda61f6c3ed4f5b2ded7530ef6749580cb upstream.

    Setting softlockup_panic from do_sysctl_args() causes it to take effect
    later in boot. The lockup detector is enabled before SMP is brought
    online, but do_sysctl_args runs afterwards. If a user wants to set
    softlockup_panic on boot and have it trigger should a softlockup occur
    during onlining of the non-boot processors, they could do this prior to
    commit f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot
    parameters to sysctl aliases"). However, after this commit the value
    of softlockup_panic is set too late to be of help for this type of
    problem. Restore the prior behavior.

    Signed-off-by: Krister Johansen
    Cc: stable@vger.kernel.org
    Fixes: f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases")
    Signed-off-by: Luis Chamberlain
    Signed-off-by: Greg Kroah-Hartman

    Krister Johansen
     
  • commit 8001f49394e353f035306a45bcf504f06fca6355 upstream.

    The code that checks for unknown boot options is unaware of the sysctl
    alias facility, which maps bootparams to sysctl values. If a user sets
    an old value that has a valid alias, a message about an invalid
    parameter will be printed during boot, and the parameter will get passed
    to init. Fix by checking for the existence of aliased parameters in the
    unknown boot parameter code. If an alias exists, don't return an error
    or pass the value to init.

    Signed-off-by: Krister Johansen
    Cc: stable@vger.kernel.org
    Fixes: 0a477e1ae21b ("kernel/sysctl: support handling command line aliases")
    Signed-off-by: Luis Chamberlain
    Signed-off-by: Greg Kroah-Hartman

    Krister Johansen
     

20 Sep, 2023

2 commits

  • On no-MMU, /proc//maps reads as an empty file. This happens because
    find_vma(mm, 0) always returns NULL (assuming no vma actually contains the
    zero address, which is normally the case).

    To fix this bug and improve the maintainability in the future, this patch
    makes the no-MMU implementation as similar as possible to the MMU
    implementation.

    The only remaining differences are the lack of hold/release_task_mempolicy
    and the extra code to shoehorn the gate vma into the iterator.

    This has been tested on top of 6.5.3 on an STM32F746.

    Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com
    Fixes: 0c563f148043 ("proc: remove VMA rbtree use from nommu")
    Signed-off-by: Ben Wolsieffer
    Cc: Davidlohr Bueso
    Cc: Giulio Benetti
    Cc: Liam R. Howlett
    Cc: Matthew Wilcox (Oracle)
    Cc: Oleg Nesterov
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton

    Ben Wolsieffer
     
  • The no-MMU implementation of /proc//map doesn't normally release
    the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
    whether to release the lock. Since _vml is NULL when the end of the
    mappings is reached, the lock is not released.

    Reading /proc/1/maps twice doesn't cause a hang because it only
    takes the read lock, which can be taken multiple times and therefore
    doesn't show any problem if the lock isn't released. Instead, you need
    to perform some operation that attempts to take the write lock after
    reading /proc//maps. To actually reproduce the bug, compile the
    following code as 'proc_maps_bug':

    #include
    #include
    #include

    int main(int argc, char *argv[]) {
    void *buf;
    sleep(1);
    buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    puts("mmap returned");
    return 0;
    }

    Then, run:

    ./proc_maps_bug &; cat /proc/$!/maps; fg

    Without this patch, mmap() will hang and the command will never
    complete.

    This code was incorrectly adapted from the MMU implementation, which at
    the time released the lock in m_next() before returning the last entry.

    The MMU implementation has diverged further from the no-MMU version since
    then, so this patch brings their locking and error handling into sync,
    fixing the bug and hopefully avoiding similar issues in the future.

    Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com
    Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm")
    Signed-off-by: Ben Wolsieffer
    Acked-by: Oleg Nesterov
    Cc: Giulio Benetti
    Cc: Greg Ungerer
    Cc:
    Signed-off-by: Andrew Morton

    Ben Wolsieffer
     

06 Sep, 2023

1 commit

  • Pull more MM updates from Andrew Morton:

    - Stefan Roesch has added ksm statistics to /proc/pid/smaps

    - Also a number of singleton patches, mainly cleanups and leftovers

    * tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
    mm/kmemleak: move up cond_resched() call in page scanning loop
    mm: page_alloc: remove stale CMA guard code
    MAINTAINERS: add rmap.h to mm entry
    rmap: remove anon_vma_link() nommu stub
    proc/ksm: add ksm stats to /proc/pid/smaps
    mm/hwpoison: rename hwp_walk* to hwpoison_walk*
    mm: memory-failure: add PageOffline() check

    Linus Torvalds
     

03 Sep, 2023

1 commit

  • With madvise and prctl KSM can be enabled for different VMA's. Once it is
    enabled we can query how effective KSM is overall. However we cannot
    easily query if an individual VMA benefits from KSM.

    This commit adds a KSM section to the /prod//smaps file. It reports
    how many of the pages are KSM pages. Note that KSM-placed zeropages are
    not included, only actual KSM pages.

    Here is a typical output:

    7f420a000000-7f421a000000 rw-p 00000000 00:00 0
    Size: 262144 kB
    KernelPageSize: 4 kB
    MMUPageSize: 4 kB
    Rss: 51212 kB
    Pss: 8276 kB
    Shared_Clean: 172 kB
    Shared_Dirty: 42996 kB
    Private_Clean: 196 kB
    Private_Dirty: 7848 kB
    Referenced: 15388 kB
    Anonymous: 51212 kB
    KSM: 41376 kB
    LazyFree: 0 kB
    AnonHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    FilePmdMapped: 0 kB
    Shared_Hugetlb: 0 kB
    Private_Hugetlb: 0 kB
    Swap: 202016 kB
    SwapPss: 3882 kB
    Locked: 0 kB
    THPeligible: 0
    ProtectionKey: 0
    ksm_state: 0
    ksm_skip_base: 0
    ksm_skip_count: 0
    VmFlags: rd wr mr mw me nr mg anon

    This information also helps with the following workflow:
    - First enable KSM for all the VMA's of a process with prctl.
    - Then analyze with the above smaps report which VMA's benefit the most
    - Change the application (if possible) to add the corresponding madvise
    calls for the VMA's that benefit the most

    [shr@devkernel.io: v5]
    Link: https://lkml.kernel.org/r/20230823170107.1457915-1-shr@devkernel.io
    Link: https://lkml.kernel.org/r/20230822180539.1424843-1-shr@devkernel.io
    Signed-off-by: Stefan Roesch
    Reviewed-by: David Hildenbrand
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton

    Stefan Roesch
     

01 Sep, 2023

1 commit

  • Pull x86 shadow stack support from Dave Hansen:
    "This is the long awaited x86 shadow stack support, part of Intel's
    Control-flow Enforcement Technology (CET).

    CET consists of two related security features: shadow stacks and
    indirect branch tracking. This series implements just the shadow stack
    part of this feature, and just for userspace.

    The main use case for shadow stack is providing protection against
    return oriented programming attacks. It works by maintaining a
    secondary (shadow) stack using a special memory type that has
    protections against modification. When executing a CALL instruction,
    the processor pushes the return address to both the normal stack and
    to the special permission shadow stack. Upon RET, the processor pops
    the shadow stack copy and compares it to the normal stack copy.

    For more information, refer to the links below for the earlier
    versions of this patch set"

    Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
    Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

    * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
    x86/shstk: Change order of __user in type
    x86/ibt: Convert IBT selftest to asm
    x86/shstk: Don't retry vm_munmap() on -EINTR
    x86/kbuild: Fix Documentation/ reference
    x86/shstk: Move arch detail comment out of core mm
    x86/shstk: Add ARCH_SHSTK_STATUS
    x86/shstk: Add ARCH_SHSTK_UNLOCK
    x86: Add PTRACE interface for shadow stack
    selftests/x86: Add shadow stack test
    x86/cpufeatures: Enable CET CR4 bit for shadow stack
    x86/shstk: Wire in shadow stack interface
    x86: Expose thread features in /proc/$PID/status
    x86/shstk: Support WRSS for userspace
    x86/shstk: Introduce map_shadow_stack syscall
    x86/shstk: Check that signal frame is shadow stack mem
    x86/shstk: Check that SSP is aligned on sigreturn
    x86/shstk: Handle signals for shadow stack
    x86/shstk: Introduce routines modifying shstk
    x86/shstk: Handle thread shadow stack
    x86/shstk: Add user-mode shadow stack support
    ...

    Linus Torvalds
     

30 Aug, 2023

3 commits

  • Pull sysctl updates from Luis Chamberlain:
    "Long ago we set out to remove the kitchen sink on kernel/sysctl.c
    arrays and placings sysctls to their own sybsystem or file to help
    avoid merge conflicts. Matthew Wilcox pointed out though that if we're
    going to do that we might as well also *save* space while at it and
    try to remove the extra last sysctl entry added at the end of each
    array, a sentintel, instead of bloating the kernel by adding a new
    sentinel with each array moved.

    Doing that was not so trivial, and has required slowing down the moves
    of kernel/sysctl.c arrays and measuring the impact on size by each new
    move.

    The complex part of the effort to help reduce the size of each sysctl
    is being done by the patient work of el señor Don Joel Granados. A lot
    of this is truly painful code refactoring and testing and then trying
    to measure the savings of each move and removing the sentinels.
    Although Joel already has code which does most of this work,
    experience with sysctl moves in the past shows is we need to be
    careful due to the slew of odd build failures that are possible due to
    the amount of random Kconfig options sysctls use.

    To that end Joel's work is split by first addressing the major
    housekeeping needed to remove the sentinels, which is part of this
    merge request. The rest of the work to actually remove the sentinels
    will be done later in future kernel releases.

    The preliminary math is showing this will all help reduce the overall
    build time size of the kernel and run time memory consumed by the
    kernel by about ~64 bytes per array where we are able to remove each
    sentinel in the future. That also means there is no more bloating the
    kernel with the extra ~64 bytes per array moved as no new sentinels
    are created"

    * tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
    sysctl: Use ctl_table_size as stopping criteria for list macro
    sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
    vrf: Update to register_net_sysctl_sz
    networking: Update to register_net_sysctl_sz
    netfilter: Update to register_net_sysctl_sz
    ax.25: Update to register_net_sysctl_sz
    sysctl: Add size to register_net_sysctl function
    sysctl: Add size arg to __register_sysctl_init
    sysctl: Add size to register_sysctl
    sysctl: Add a size arg to __register_sysctl_table
    sysctl: Add size argument to init_header
    sysctl: Add ctl_table_size to ctl_table_header
    sysctl: Use ctl_table_header in list_for_each_table_entry
    sysctl: Prefer ctl_table_header in proc_sysctl

    Linus Torvalds
     
  • …ux/kernel/git/akpm/mm

    Pull non-MM updates from Andrew Morton:

    - An extensive rework of kexec and crash Kconfig from Eric DeVolder
    ("refactor Kconfig to consolidate KEXEC and CRASH options")

    - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
    couple of macros to args.h")

    - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
    commands")

    - vsprintf inclusion rationalization from Andy Shevchenko
    ("lib/vsprintf: Rework header inclusions")

    - Switch the handling of kdump from a udev scheme to in-kernel
    handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
    hot un/plug")

    - Many singleton patches to various parts of the tree

    * tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
    document while_each_thread(), change first_tid() to use for_each_thread()
    drivers/char/mem.c: shrink character device's devlist[] array
    x86/crash: optimize CPU changes
    crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
    crash: hotplug support for kexec_load()
    x86/crash: add x86 crash hotplug support
    crash: memory and CPU hotplug sysfs attributes
    kexec: exclude elfcorehdr from the segment digest
    crash: add generic infrastructure for crash hotplug support
    crash: move a few code bits to setup support of crash hotplug
    kstrtox: consistently use _tolower()
    kill do_each_thread()
    nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
    scripts/bloat-o-meter: count weak symbol sizes
    treewide: drop CONFIG_EMBEDDED
    lockdep: fix static memory detection even more
    lib/vsprintf: declare no_hash_pointers in sprintf.h
    lib/vsprintf: split out sprintf() and friends
    kernel/fork: stop playing lockless games for exe_file replacement
    adfs: delete unused "union adfs_dirtail" definition
    ...

    Linus Torvalds
     
  • Pull MM updates from Andrew Morton:

    - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
    add_to_avail_list")

    - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
    reduces the special-case code for handling hugetlb pages in GUP. It
    also speeds up GUP handling of transparent hugepages.

    - Peng Zhang provides some maple tree speedups ("Optimize the fast path
    of mas_store()").

    - Sergey Senozhatsky has improved te performance of zsmalloc during
    compaction (zsmalloc: small compaction improvements").

    - Domenico Cerasuolo has developed additional selftest code for zswap
    ("selftests: cgroup: add zswap test program").

    - xu xin has doe some work on KSM's handling of zero pages. These
    changes are mainly to enable the user to better understand the
    effectiveness of KSM's treatment of zero pages ("ksm: support
    tracking KSM-placed zero-pages").

    - Jeff Xu has fixes the behaviour of memfd's
    MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
    MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

    - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
    Stop read optimisation when folio removed from pagecache").

    - Axel Rasmussen has given userfaultfd the ability to simulate memory
    poisoning ("add UFFDIO_POISON to simulate memory poisoning with
    UFFD").

    - Miaohe Lin has contributed some routine maintenance work on the
    memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
    check").

    - Peng Zhang has contributed some maintenance work on the maple tree
    code ("Improve the validation for maple tree and some cleanup").

    - Hugh Dickins has optimized the collapsing of shmem or file pages into
    THPs ("mm: free retracted page table by RCU").

    - Jiaqi Yan has a patch series which permits us to use the healthy
    subpages within a hardware poisoned huge page for general purposes
    ("Improve hugetlbfs read on HWPOISON hugepages").

    - Kemeng Shi has done some maintenance work on the pagetable-check code
    ("Remove unused parameters in page_table_check").

    - More folioification work from Matthew Wilcox ("More filesystem folio
    conversions for 6.6"), ("Followup folio conversions for zswap"). And
    from ZhangPeng ("Convert several functions in page_io.c to use a
    folio").

    - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

    - Baoquan He has converted some architectures to use the
    GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
    architectures to take GENERIC_IOREMAP way").

    - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
    batched/deferred tlb shootdown during page reclamation/migration").

    - Better maple tree lockdep checking from Liam Howlett ("More strict
    maple tree lockdep"). Liam also developed some efficiency
    improvements ("Reduce preallocations for maple tree").

    - Cleanup and optimization to the secondary IOMMU TLB invalidation,
    from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
    upgrade").

    - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
    for arm64").

    - Kemeng Shi provides some maintenance work on the compaction code
    ("Two minor cleanups for compaction").

    - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
    most file-backed faults under the VMA lock").

    - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
    on ppc64, under some circumstances ("Add support for DAX vmemmap
    optimization for ppc64").

    - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
    data in page_ext"), ("minor cleanups to page_ext header").

    - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
    cleanups").

    - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

    - VMA handling cleanups from Kefeng Wang ("mm: convert to
    vma_is_initial_heap/stack()").

    - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
    implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
    address ranges and DAMON monitoring targets").

    - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

    - Liam Howlett has improved the maple tree node replacement code
    ("maple_tree: Change replacement strategy").

    - ZhangPeng has a general code cleanup - use the K() macro more widely
    ("cleanup with helper macro K()").

    - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
    memmap on memory feature on ppc64").

    - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
    in page_alloc"), ("Two minor cleanups for get pageblock
    migratetype").

    - Vishal Moola introduces a memory descriptor for page table tracking,
    "struct ptdesc" ("Split ptdesc from struct page").

    - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
    for vm.memfd_noexec").

    - MM include file rationalization from Hugh Dickins ("arch: include
    asm/cacheflush.h in asm/hugetlb.h").

    - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
    output").

    - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
    object_cache instead of kmemleak_initialized").

    - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
    and _folio_order").

    - A VMA locking scalability improvement from Suren Baghdasaryan
    ("Per-VMA lock support for swap and userfaults").

    - pagetable handling cleanups from Matthew Wilcox ("New page table
    range API").

    - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
    using page->private on tail pages for THP_SWAP + cleanups").

    - Cleanups and speedups to the hugetlb fault handling from Matthew
    Wilcox ("Change calling convention for ->huge_fault").

    - Matthew Wilcox has also done some maintenance work on the MM
    subsystem documentation ("Improve mm documentation").

    * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
    maple_tree: shrink struct maple_tree
    maple_tree: clean up mas_wr_append()
    secretmem: convert page_is_secretmem() to folio_is_secretmem()
    nios2: fix flush_dcache_page() for usage from irq context
    hugetlb: add documentation for vma_kernel_pagesize()
    mm: add orphaned kernel-doc to the rst files.
    mm: fix clean_record_shared_mapping_range kernel-doc
    mm: fix get_mctgt_type() kernel-doc
    mm: fix kernel-doc warning from tlb_flush_rmaps()
    mm: remove enum page_entry_size
    mm: allow ->huge_fault() to be called without the mmap_lock held
    mm: move PMD_ORDER to pgtable.h
    mm: remove checks for pte_index
    memcg: remove duplication detection for mem_cgroup_uncharge_swap
    mm/huge_memory: work on folio->swap instead of page->private when splitting folio
    mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
    mm/swap: use dedicated entry for swap in folio
    mm/swap: stop using page->private on tail pages for THP_SWAP
    selftests/mm: fix WARNING comparing pointer to 0
    selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
    ...

    Linus Torvalds
     

29 Aug, 2023

2 commits

  • Pull procfs fixes from Christian Brauner:
    "Mode changes to files under /proc// aren't supported ever since
    commit 6d76fa58b050 ("Don't allow chmod() on the /proc// files").

    Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread
    cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
    mode changes on /proc/thread-self/comm were accidently allowed.

    Similar, mode changes for all files beneath /proc//net/ are
    blocked but mode changes on /proc//net itself were accidently
    allowed.

    Both issues come down to not using the generic proc_setattr() helper
    which blocks all mode changes. This is rectified with this pull
    request.

    This also removes a strange nolibc test that abused /proc//net
    for testing mode changes. Using procfs for this test never made a lot
    of sense given procfs has special semantics for almost everything
    anway.

    Both changes are minor user-visible changes. It is however very
    unlikely that mode changes on proc//net and
    /proc/thread-self/comm are something that userspace relies on"

    * tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
    procfs: block chmod on /proc/thread-self/comm
    proc: use generic setattr() for /proc/$PID/net
    selftests/nolibc: drop test chmod_net

    Linus Torvalds
     
  • Pull vfs timestamp updates from Christian Brauner:
    "This adds VFS support for multi-grain timestamps and converts tmpfs,
    xfs, ext4, and btrfs to use them. This carries acks from all relevant
    filesystems.

    The VFS always uses coarse-grained timestamps when updating the ctime
    and mtime after a change. This has the benefit of allowing filesystems
    to optimize away a lot of metadata updates, down to around 1 per
    jiffy, even when a file is under heavy writes.

    Unfortunately, this has always been an issue when we're exporting via
    NFSv3, which relies on timestamps to validate caches. A lot of changes
    can happen in a jiffy, so timestamps aren't sufficient to help the
    client decide to invalidate the cache.

    Even with NFSv4, a lot of exported filesystems don't properly support
    a change attribute and are subject to the same problems with timestamp
    granularity. Other applications have similar issues with timestamps
    (e.g., backup applications).

    If we were to always use fine-grained timestamps, that would improve
    the situation, but that becomes rather expensive, as the underlying
    filesystem would have to log a lot more metadata updates.

    This introduces fine-grained timestamps that are used when they are
    actively queried.

    This uses the 31st bit of the ctime tv_nsec field to indicate that
    something has queried the inode for the mtime or ctime. When this flag
    is set, on the next mtime or ctime update, the kernel will fetch a
    fine-grained timestamp instead of the usual coarse-grained one.

    As POSIX generally mandates that when the mtime changes, the ctime
    must also change the kernel always stores normalized ctime values, so
    only the first 30 bits of the tv_nsec field are ever used.

    Filesytems can opt into this behavior by setting the FS_MGTIME flag in
    the fstype. Filesystems that don't set this flag will continue to use
    coarse-grained timestamps.

    Various preparatory changes, fixes and cleanups are included:

    - Fixup all relevant places where POSIX requires updating ctime
    together with mtime. This is a wide-range of places and all
    maintainers provided necessary Acks.

    - Add new accessors for inode->i_ctime directly and change all
    callers to rely on them. Plain accesses to inode->i_ctime are now
    gone and it is accordingly rename to inode->__i_ctime and commented
    as requiring accessors.

    - Extend generic_fillattr() to pass in a request mask mirroring in a
    sense the statx() uapi. This allows callers to pass in a request
    mask to only get a subset of attributes filled in.

    - Rework timestamp updates so it's possible to drop the @now
    parameter the update_time() inode operation and associated helpers.

    - Add inode_update_timestamps() and convert all filesystems to it
    removing a bunch of open-coding"

    * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
    btrfs: convert to multigrain timestamps
    ext4: switch to multigrain timestamps
    xfs: switch to multigrain timestamps
    tmpfs: add support for multigrain timestamps
    fs: add infrastructure for multigrain timestamps
    fs: drop the timespec64 argument from update_time
    xfs: have xfs_vn_update_time gets its own timestamp
    fat: make fat_update_time get its own timestamp
    fat: remove i_version handling from fat_update_time
    ubifs: have ubifs_update_time use inode_update_timestamps
    btrfs: have it use inode_update_timestamps
    fs: drop the timespec64 arg from generic_update_time
    fs: pass the request_mask to generic_fillattr
    fs: remove silly warning from current_time
    gfs2: fix timestamp handling on quota inodes
    fs: rename i_ctime field to __i_ctime
    selinux: convert to ctime accessor functions
    security: convert to ctime accessor functions
    apparmor: convert to ctime accessor functions
    sunrpc: convert to ctime accessor functions
    ...

    Linus Torvalds
     

26 Aug, 2023

1 commit

  • …linux/kernel/git/akpm/mm

    Pull misc fixes from Andrew Morton:
    "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4
    issues or aren't considered suitable for a -stable backport"

    * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
    shmem: fix smaps BUG sleeping while atomic
    selftests: cachestat: catch failing fsync test on tmpfs
    selftests: cachestat: test for cachestat availability
    maple_tree: disable mas_wr_append() when other readers are possible
    madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
    madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
    madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
    mm: multi-gen LRU: don't spin during memcg release
    mm: memory-failure: fix unexpected return value in soft_offline_page()
    radix tree: remove unused variable
    mm: add a call to flush_cache_vmap() in vmap_pfn()
    selftests/mm: FOLL_LONGTERM need to be updated to 0x100
    nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()
    mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
    selftests: cgroup: fix test_kmem_basic less than error
    mm: enable page walking API to lock vmas during the walk
    smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
    mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT

    Linus Torvalds
     

25 Aug, 2023

1 commit

  • Add the comment to explain that while_each_thread(g,t) is not rcu-safe
    unless g is stable (e.g. current). Even if g is a group leader and thus
    can't exit before t, t or another sub-thread can exec and remove g from
    the thread_group list.

    The only lockless user of while_each_thread() is first_tid() and it is
    fine in that it can't loop forever, yet for_each_thread() looks better and
    I am going to change while_each_thread/next_thread.

    Link: https://lkml.kernel.org/r/20230823170806.GA11724@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Eric W. Biederman
    Signed-off-by: Andrew Morton

    Oleg Nesterov
     

22 Aug, 2023

7 commits

  • Andrew Morton
     
  • Extract from current /proc/self/smaps output:

    Swap: 0 kB
    SwapPss: 0 kB
    Locked: 0 kB
    THPeligible: 0
    ProtectionKey: 0

    That's not the alignment shown in Documentation/filesystems/proc.rst: it's
    an ugly artifact from missing out the %8 other fields are using; but
    there's even one selftest which expects it to look that way. Hoping no
    other smaps parsers depend on THPeligible to look so ugly, fix these.

    Link: https://lkml.kernel.org/r/cfb81f7a-f448-5bc2-b0e1-8136fcd1dd8c@google.com
    Signed-off-by: Hugh Dickins
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton

    Hugh Dickins
     
  • It is better to not expose too many internal variables of memtest,
    add a helper memtest_report_meminfo() to show memtest results.

    Link: https://lkml.kernel.org/r/20230808033359.174986-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Mike Rapoport (IBM)
    Cc: Matthew Wilcox
    Cc: Tomas Mudrunka
    Signed-off-by: Andrew Morton

    Kefeng Wang
     
  • Patch series "mm: convert to vma_is_initial_heap/stack()", v3.

    Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them
    to simplify code.

    This patch (of 4):

    Factor out VMA stack and heap checks and name them vma_is_initial_stack()
    and vma_is_initial_heap() for general use.

    Link: https://lkml.kernel.org/r/20230728050043.59880-1-wangkefeng.wang@huawei.com
    Link: https://lkml.kernel.org/r/20230728050043.59880-2-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Reviewed-by: David Hildenbrand
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Göttsche
    Cc: Alex Deucher
    Cc: Arnaldo Carvalho de Melo
    Cc: Christian Göttsche
    Cc: Christian König
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Eric Paris
    Cc: Felix Kuehling
    Cc: "Pan, Xinhui"
    Cc: Paul Moore
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton

    Kefeng Wang
     
  • The only user of frontswap is zswap, and has been for a long time. Have
    swap call into zswap directly and remove the indirection.

    [hannes@cmpxchg.org: remove obsolete comment, per Yosry]
    Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org
    [fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load]
    Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com
    Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Yin Fengwei
    Acked-by: Konrad Rzeszutek Wilk
    Acked-by: Nhat Pham
    Acked-by: Yosry Ahmed
    Acked-by: Christoph Hellwig
    Cc: Domenico Cerasuolo
    Cc: Matthew Wilcox (Oracle)
    Cc: Vitaly Wool
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton

    Johannes Weiner
     
  • walk_page_range() and friends often operate under write-locked mmap_lock.
    With introduction of vma locks, the vmas have to be locked as well during
    such walks to prevent concurrent page faults in these areas. Add an
    additional member to mm_walk_ops to indicate locking requirements for the
    walk.

    The change ensures that page walks which prevent concurrent page faults
    by write-locking mmap_lock, operate correctly after introduction of
    per-vma locks. With per-vma locks page faults can be handled under vma
    lock without taking mmap_lock at all, so write locking mmap_lock would
    not stop them. The change ensures vmas are properly locked during such
    walks.

    A sample issue this solves is do_mbind() performing queue_pages_range()
    to queue pages for migration. Without this change a concurrent page
    can be faulted into the area and be left out of migration.

    Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Suggested-by: Linus Torvalds
    Suggested-by: Jann Horn
    Cc: David Hildenbrand
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox (Oracle)
    Cc: Michal Hocko
    Cc: Michel Lespinasse
    Cc: Peter Xu
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton

    Suren Baghdasaryan
     
  • We shouldn't be using a GUP-internal helper if it can be avoided.

    Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
    vm_normal_page_pmd() that similarly refuses to return the huge zeropage.

    In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():

    (1) Will always return the head page, not a tail page of a THP.

    If we'd ever call smaps_account with a tail page while setting "compound
    = true", we could be in trouble, because smaps_account() would look at
    the memmap of unrelated pages.

    If we're unlucky, that memmap does not exist at all. Before we removed
    PG_doublemap, we could have triggered something similar as in
    commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
    migration entry").

    This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
    smaps_rollup: do not stall write attempts on mmap_lock"):

    (a) We're in show_smaps_rollup() and processed a VMA
    (b) We release the mmap lock in show_smaps_rollup() because it is
    contended
    (c) We merged that VMA with another VMA
    (d) We collapsed a THP in that merged VMA at that position

    If the end address of the original VMA falls into the middle of a THP
    area, we would call smap_gather_stats() with a start address that falls
    into a PMD-mapped THP. It's probably very rare to trigger when not
    really forced.

    (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()

    Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
    If such pages would be anonymous, we most certainly would want to
    account them.

    (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()

    As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
    pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
    entries.

    Especially, follow_pmd_mask() never ends up calling
    follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
    follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
    is set.

    So skipping pmd_devmap() pages seems to be the right thing to do.

    (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()

    We won't be returning a memmap that should be ignored by core-mm, or
    worse, a memmap that does not even exist. Note that while
    walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.

    Most probably this case doesn't currently really happen on the PMD level,
    otherwise we'd already be able to trigger kernel crashes when reading
    smaps / smaps_rollup.

    So most probably only (1) is relevant in practice as of now, but could only
    cause trouble in extreme corner cases.

    Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
    reuse in wrong context.

    Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
    Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
    Signed-off-by: David Hildenbrand
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Linus Torvalds
    Cc: liubo
    Cc: Matthew Wilcox (Oracle)
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Peter Xu
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton

    David Hildenbrand
     

19 Aug, 2023

1 commit

  • As the number of ksm zero pages is not included in ksm_merging_pages per
    process when enabling use_zero_pages, it's unclear of how many actual
    pages are merged by KSM. To let users accurately estimate their memory
    demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
    pages per process. In addition, it help users to know the actual KSM
    profit because KSM-placed zero pages are also benefit from KSM.

    since unsharing zero pages placed by KSM accurately is achieved, then
    tracking empty pages merging and unmerging is not a difficult thing any
    longer.

    Since we already have /proc//ksm_stat, just add the information of
    'ksm_zero_pages' in it.

    Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn
    Signed-off-by: xu xin
    Acked-by: David Hildenbrand
    Reviewed-by: Xiaokai Ran
    Reviewed-by: Yang Yang
    Cc: Claudio Imbrenda
    Cc: Xuexin Jiang
    Signed-off-by: Andrew Morton

    xu xin
     

16 Aug, 2023

5 commits

  • This is a preparation commit to make it easy to remove the sentinel
    elements (empty end markers) from the ctl_table arrays. It both allows
    the systematic removal of the sentinels and adds the ctl_table_size
    variable to the stopping criteria of the list_for_each_table_entry macro
    that traverses all ctl_table arrays. Once all the sentinels are removed
    by subsequent commits, ctl_table_size will become the only stopping
    criteria in the macro. We don't actually remove any elements in this
    commit, but it sets things up to for the removal process to take place.

    By adding header->ctl_table_size as an additional stopping criteria for
    the list_for_each_table_entry macro, it will execute until it finds an
    "empty" ->procname or until the size runs out. Therefore if a ctl_table
    array with a sentinel is passed its size will be too big (by one
    element) but it will stop on the sentinel. On the other hand, if the
    ctl_table array without a sentinel is passed its size will be just write
    and there will be no need for a sentinel.

    Signed-off-by: Joel Granados
    Suggested-by: Jani Nikula
    Signed-off-by: Luis Chamberlain

    Joel Granados
     
  • This commit adds table_size to __register_sysctl_init in preparation for
    the removal of the sentinel elements in the ctl_table arrays (last empty
    markers). And though we do *not* remove any sentinels in this commit, we
    set things up by calculating the ctl_table array size with ARRAY_SIZE.

    We add a table_size argument to __register_sysctl_init and modify the
    register_sysctl_init macro to calculate the array size with ARRAY_SIZE.
    The original callers do not need to be updated as they will go through
    the new macro.

    Signed-off-by: Joel Granados
    Suggested-by: Greg Kroah-Hartman
    Signed-off-by: Luis Chamberlain

    Joel Granados
     
  • This commit adds table_size to register_sysctl in preparation for the
    removal of the sentinel elements in the ctl_table arrays (last empty
    markers). And though we do *not* remove any sentinels in this commit, we
    set things up by either passing the table_size explicitly or using
    ARRAY_SIZE on the ctl_table arrays.

    We replace the register_syctl function with a macro that will add the
    ARRAY_SIZE to the new register_sysctl_sz function. In this way the
    callers that are already using an array of ctl_table structs do not
    change. For the callers that pass a ctl_table array pointer, we pass the
    table_size to register_sysctl_sz instead of the macro.

    Signed-off-by: Joel Granados
    Suggested-by: Greg Kroah-Hartman
    Signed-off-by: Luis Chamberlain

    Joel Granados
     
  • We make these changes in order to prepare __register_sysctl_table and
    its callers for when we remove the sentinel element (empty element at
    the end of ctl_table arrays). We don't actually remove any sentinels in
    this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
    is available when the removal occurs.

    We add a table_size argument to __register_sysctl_table and adjust
    callers, all of which pass ctl_table pointers and need an explicit call
    to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
    order to forward the size of the array pointer received from the network
    register calls.

    The new table_size argument does not yet have any effect in the
    init_header call which is still dependent on the sentinel's presence.
    table_size *does* however drive the `kzalloc` allocation in
    __register_sysctl_table with no adverse effects as the allocated memory
    is either one element greater than the calculated ctl_table array (for
    the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
    of the calculated ctl_table array (for the call from sysctl_net.c and
    register_sysctl). This approach will allows us to "just" remove the
    sentinel without further changes to __register_sysctl_table as
    table_size will represent the exact size for all the callers at that
    point.

    Signed-off-by: Joel Granados
    Signed-off-by: Luis Chamberlain

    Joel Granados
     
  • In this commit, we add a table_size argument to the init_header function
    in order to initialize the ctl_table_size variable in ctl_table_header.
    Even though the size is not yet used, it is now initialized within the
    sysctl subsys. We need this commit for when we start adding the
    table_size arguments to the sysctl functions (e.g. register_sysctl,
    __register_sysctl_table and __register_sysctl_init).

    Note that in __register_sysctl_table we temporarily use a calculated
    size until we add the size argument to that function in subsequent
    commits.

    Signed-off-by: Joel Granados
    Signed-off-by: Luis Chamberlain

    Joel Granados