31 Aug, 2022

1 commit

  • [ Upstream commit efd4149342db2df41b1bbe68972ead853b30e444 ]

    These bits should only be valid when the ptes are present. Introducing
    two booleans for it and set it to false when !pte_present() for both pte
    and pmd accountings.

    The bug is found during code reading and no real world issue reported, but
    logically such an error can cause incorrect readings for either smaps or
    smaps_rollup output on quite a few fields.

    For example, it could cause over-estimate on values like Shared_Dirty,
    Private_Dirty, Referenced. Or it could also cause under-estimate on
    values like LazyFree, Shared_Clean, Private_Clean.

    Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
    Fixes: b1d4d9e0cbd0 ("proc/smaps: carefully handle migration entries")
    Fixes: c94b6923fa0a ("/proc/PID/smaps: Add PMD migration entry parsing")
    Signed-off-by: Peter Xu
    Reviewed-by: Vlastimil Babka
    Reviewed-by: David Hildenbrand
    Reviewed-by: Yang Shi
    Cc: Konstantin Khlebnikov
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Peter Xu
     

17 Aug, 2022

1 commit

  • [ Upstream commit d919a1e79bac890421537cf02ae773007bf55e6b ]

    Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
    moved proc_flush_task() behind __exit_signal(). Then, process systemd can
    take long period high cpu usage during releasing task in following
    concurrent processes:

    systemd ps
    kernel_waitid stat(/proc/tgid)
    do_wait filename_lookup
    wait_consider_task lookup_fast
    release_task
    __exit_signal
    __unhash_process
    detach_pid
    __change_pid // remove task->pid_links
    d_revalidate -> pid_revalidate // 0
    d_invalidate(/proc/tgid)
    shrink_dcache_parent(/proc/tgid)
    d_walk(/proc/tgid)
    spin_lock_nested(/proc/tgid/fd)
    // iterating opened fd
    proc_flush_pid |
    d_invalidate (/proc/tgid/fd) |
    shrink_dcache_parent(/proc/tgid/fd) |
    shrink_dentry_list(subdirs) ↓
    shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock

    Function d_invalidate() will remove dentry from hash firstly, but why does
    proc_flush_pid() process dentry '/proc/tgid/fd' before dentry
    '/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in
    reverse order by invoking hlist_add_head_rcu(). But proc should not add
    any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by
    adding inode into 'pid->inodes' only if the inode is /proc/tgid or
    /proc/tgid/task/pid.

    Performance regression:
    Create 200 tasks, each task open one file for 50,000 times. Kill all
    tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).

    Before fix:
    $ time killall -wq aa
    real 4m40.946s # During this period, we can see 'ps' and 'systemd'
    taking high cpu usage.

    After fix:
    $ time killall -wq aa
    real 1m20.732s # During this period, we can see 'systemd' taking
    high cpu usage.

    Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com
    Fixes: 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054
    Signed-off-by: Zhihao Cheng
    Signed-off-by: Zhang Yi
    Suggested-by: Brian Foster
    Reviewed-by: Brian Foster
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: Eric Biederman
    Cc: Matthew Wilcox
    Cc: Baoquan He
    Cc: Kalesh Singh
    Cc: Yu Kuai
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Zhihao Cheng
     

29 Jul, 2022

1 commit

  • [ Upstream commit 78e36f3b0dae586f623c4a37ec5eb5496f5abbe1 ]

    sysctl has helpers which let us specify boundary values for a min or max
    int value. Since these are used for a boundary check only they don't
    change, so move these variables to sysctl_vals to avoid adding duplicate
    variables. This will help with our cleanup of kernel/sysctl.c.

    [akpm@linux-foundation.org: update it for "mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%"]
    [mcgrof@kernel.org: major rebase]

    Link: https://lkml.kernel.org/r/20211123202347.818157-3-mcgrof@kernel.org
    Signed-off-by: Xiaoming Ni
    Signed-off-by: Luis Chamberlain
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Amir Goldstein
    Cc: Andy Shevchenko
    Cc: Benjamin LaHaise
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: Iurii Zaikin
    Cc: Jan Kara
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Qing Wang
    Cc: Sebastian Reichel
    Cc: Sergey Senozhatsky
    Cc: Stephen Kitt
    Cc: Tetsuo Handa
    Cc: Antti Palosaari
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Clemens Ladisch
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joel Becker
    Cc: Joonas Lahtinen
    Cc: Joseph Qi
    Cc: Julia Lawall
    Cc: Lukas Middendorf
    Cc: Mark Fasheh
    Cc: Phillip Potter
    Cc: Rodrigo Vivi
    Cc: Douglas Gilbert
    Cc: James E.J. Bottomley
    Cc: Jani Nikula
    Cc: John Ogness
    Cc: Martin K. Petersen
    Cc: "Rafael J. Wysocki"
    Cc: Steven Rostedt (VMware)
    Cc: Suren Baghdasaryan
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Xiaoming Ni
     

09 Jun, 2022

1 commit

  • [ Upstream commit 7055197705709c59b8ab77e6a5c7d46d61edd96e ]

    When a process exits, /proc/${pid}, and /proc/${pid}/net dentries are
    flushed. However some leaf dentries like /proc/${pid}/net/arp_cache
    aren't. That's because respective PDEs have proc_misc_d_revalidate() hook
    which returns 1 and leaves dentries/inodes in the LRU.

    Force revalidation/lookup on everything under /proc/${pid}/net by
    inheriting proc_net_dentry_ops.

    [akpm@linux-foundation.org: coding-style cleanups]
    Link: https://lkml.kernel.org/r/YjdVHgildbWO7diJ@localhost.localdomain
    Fixes: c6c75deda813 ("proc: fix lookup in /proc/net subdirectories after setns(2)")
    Signed-off-by: Alexey Dobriyan
    Reported-by: hui li
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Alexey Dobriyan
     

18 May, 2022

1 commit

  • [ Upstream commit 1927e498aee1757b3df755a194cbfc5cc0f2b663 ]

    The file permissions on the fdinfo dir from were changed from
    S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
    for opening the fdinfo files [1]. However, the ptrace permission check
    was not added to the directory, allowing anyone to get the open FD numbers
    by reading the fdinfo directory.

    Add the missing ptrace permission check for opening the fdinfo directory.

    [1] https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com

    Link: https://lkml.kernel.org/r/20210713162008.1056986-1-kaleshsingh@google.com
    Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
    Signed-off-by: Kalesh Singh
    Cc: Kees Cook
    Cc: Eric W. Biederman
    Cc: Christian Brauner
    Cc: Suren Baghdasaryan
    Cc: Hridya Valsaraju
    Cc: Jann Horn
    Signed-off-by: Andrew Morton
    Signed-off-by: Sasha Levin

    Kalesh Singh
     

08 Apr, 2022

1 commit

  • commit bed5b60bf67ccd8957b8c0558fead30c4a3f5d3f upstream.

    kzalloc is a memory allocation function which can return NULL when some
    internal memory errors happen. It is safer to add null pointer check.

    Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn

    Cc: stable@vger.kernel.org
    Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
    Acked-by: Masami Hiramatsu
    Reported-by: Zeal Robot
    Signed-off-by: Lv Ruyi
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Greg Kroah-Hartman

    Lv Ruyi
     

09 Mar, 2022

1 commit

  • commit dd21bfa425c098b95ca86845f8e7d1ec1ddf6e4a upstream.

    Since bit 57 was exported for uffd-wp write-protected (commit
    fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"),
    fixing it can reduce some unnecessary confusion.

    Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
    Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information")
    Signed-off-by: Yun Zhou
    Reviewed-by: Peter Xu
    Cc: Jonathan Corbet
    Cc: Tiberiu A Georgescu
    Cc: Florian Schmidt
    Cc: Ivan Teterevkov
    Cc: SeongJae Park
    Cc: Yang Shi
    Cc: David Hildenbrand
    Cc: Axel Rasmussen
    Cc: Miaohe Lin
    Cc: Andrea Arcangeli
    Cc: Colin Cross
    Cc: Alistair Popple
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yun Zhou
     

23 Feb, 2022

1 commit

  • commit 24d7275ce2791829953ed4e72f68277ceb2571c6 upstream.

    The syzbot reported the below BUG:

    kernel BUG at include/linux/page-flags.h:785!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
    RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
    Call Trace:
    page_mapcount include/linux/mm.h:837 [inline]
    smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
    smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
    smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
    walk_pmd_range mm/pagewalk.c:128 [inline]
    walk_pud_range mm/pagewalk.c:205 [inline]
    walk_p4d_range mm/pagewalk.c:240 [inline]
    walk_pgd_range mm/pagewalk.c:277 [inline]
    __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
    walk_page_vma+0x277/0x350 mm/pagewalk.c:530
    smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
    smap_gather_stats fs/proc/task_mmu.c:741 [inline]
    show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
    seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
    seq_read+0x3e0/0x5b0 fs/seq_file.c:162
    vfs_read+0x1b5/0x600 fs/read_write.c:479
    ksys_read+0x12d/0x250 fs/read_write.c:619
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    The reproducer was trying to read /proc/$PID/smaps when calling
    MADV_FREE at the mean time. MADV_FREE may split THPs if it is called
    for partial THP. It may trigger the below race:

    CPU A CPU B
    ----- -----
    smaps walk: MADV_FREE:
    page_mapcount()
    PageCompound()
    split_huge_page()
    page = compound_head(page)
    PageDoubleMap(page)

    When calling PageDoubleMap() this page is not a tail page of THP anymore
    so the BUG is triggered.

    This could be fixed by elevated refcount of the page before calling
    mapcount, but that would prevent it from counting migration entries, and
    it seems overkilling because the race just could happen when PMD is
    split so all PTE entries of tail pages are actually migration entries,
    and smaps_account() does treat migration entries as mapcount == 1 as
    Kirill pointed out.

    Add a new parameter for smaps_account() to tell this entry is migration
    entry then skip calling page_mapcount(). Don't skip getting mapcount
    for device private entries since they do track references with mapcount.

    Pagemap also has the similar issue although it was not reported. Fixed
    it as well.

    [shy828301@gmail.com: v4]
    Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
    [nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
    Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
    Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
    Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")
    Signed-off-by: Yang Shi
    Signed-off-by: Nathan Chancellor
    Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
    Acked-by: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Jann Horn
    Cc: Matthew Wilcox
    Cc: Alexey Dobriyan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

01 Dec, 2021

1 commit

  • commit c1e63117711977cc4295b2ce73de29dd17066c82 upstream.

    To clear a user buffer we cannot simply use memset, we have to use
    clear_user(). With a virtio-mem device that registers a vmcore_cb and
    has some logically unplugged memory inside an added Linux memory block,
    I can easily trigger a BUG by copying the vmcore via "cp":

    systemd[1]: Starting Kdump Vmcore Save Service...
    kdump[420]: Kdump is using the default log level(3).
    kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
    kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
    kdump[465]: saving vmcore-dmesg.txt complete
    kdump[467]: saving vmcore
    BUG: unable to handle page fault for address: 00007f2374e01000
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0003) - permissions violation
    PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
    Oops: 0003 [#1] PREEMPT SMP NOPTI
    CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
    RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
    Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
    RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
    RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
    RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
    RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
    R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
    R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
    FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
    Call Trace:
    read_vmcore+0x236/0x2c0
    proc_reg_read+0x55/0xa0
    vfs_read+0x95/0x190
    ksys_read+0x4f/0xc0
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
    Prevention (SMAP)", which is used to detect wrong access from the kernel
    to user buffers like this: SMAP triggers a permissions violation on
    wrong access. In the x86-64 variant of clear_user(), SMAP is properly
    handled via clac()+stac().

    To fix, properly use clear_user() when we're dealing with a user buffer.

    Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
    Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
    Signed-off-by: David Hildenbrand
    Acked-by: Baoquan He
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Philipp Rudo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

19 Nov, 2021

1 commit

  • [ Upstream commit a130e8fbc7de796eb6e680724d87f4737a26d0ac ]

    /proc/uptime reports idle time by reading the CPUTIME_IDLE field from
    the per-cpu kcpustats. However, on NO_HZ systems, idle time is not
    continually updated on idle cpus, leading this value to appear
    incorrectly small.

    /proc/stat performs an accounting update when reading idle time; we
    can use the same approach for uptime.

    With this patch, /proc/stat and /proc/uptime now agree on idle time.
    Additionally, the following shows idle time tick up consistently on an
    idle machine:

    (while true; do cat /proc/uptime; sleep 1; done) | awk '{print $2-prev; prev=$2}'

    Reported-by: Luigi Rizzo
    Signed-off-by: Josh Don
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Eric Dumazet
    Link: https://lkml.kernel.org/r/20210827165438.3280779-1-joshdon@google.com
    Signed-off-by: Sasha Levin

    Josh Don
     

12 Nov, 2021

1 commit

  • commit 54354c6a9f7fd5572d2b9ec108117c4f376d4d23 upstream.

    This reverts commit 152c432b128cb043fc107e8f211195fe94b2159c.

    When a kernel address couldn't be symbolized for /proc/$pid/wchan, it
    would leak the raw value, a potential information exposure. This is a
    regression compared to the safer pre-v5.12 behavior.

    Reported-by: kernel test robot
    Reported-by: Vito Caputo
    Reported-by: Jann Horn
    Signed-off-by: Kees Cook
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20211008111626.090829198@infradead.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

09 Sep, 2021

3 commits

  • Merge more updates from Andrew Morton:
    "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

    Subsystems affected by this patch series: mm (memory-hotplug, rmap,
    ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
    alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
    checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
    selftests, ipc, and scripts"

    * emailed patches from Andrew Morton : (94 commits)
    scripts: check_extable: fix typo in user error message
    mm/workingset: correct kernel-doc notations
    ipc: replace costly bailout check in sysvipc_find_ipc()
    selftests/memfd: remove unused variable
    Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
    configs: remove the obsolete CONFIG_INPUT_POLLDEV
    prctl: allow to setup brk for et_dyn executables
    pid: cleanup the stale comment mentioning pidmap_init().
    kernel/fork.c: unexport get_{mm,task}_exe_file
    coredump: fix memleak in dump_vma_snapshot()
    fs/coredump.c: log if a core dump is aborted due to changed file permissions
    nilfs2: use refcount_dec_and_lock() to fix potential UAF
    nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
    nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
    nilfs2: fix NULL pointer in nilfs_##name##_attr_release
    nilfs2: fix memory leak in nilfs_sysfs_create_device_group
    trap: cleanup trap_init()
    init: move usermodehelper_enable() to populate_rootfs()
    ...

    Linus Torvalds
     
  • While comm change event via prctl has been reported to proc connector by
    'commit f786ecba4158 ("connector: add comm change event report to proc
    connector")', connector listeners were missing comm changes by explicit
    writes on /proc/[pid]/comm.

    Let explicit writes on /proc/[pid]/comm report to proc connector.

    Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
    Signed-off-by: Ohhoon Kwon
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Christian Brauner
    Cc: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ohhoon Kwon
     
  • Use seq_escape_str and seq_printf instead of poking holes into the
    seq_file abstraction.

    Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Acked-by: Christian Brauner
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

04 Sep, 2021

1 commit

  • All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
    set from user space, so all users are gone; let's remove it.

    Acked-by: "Eric W. Biederman"
    Acked-by: Christian König
    Signed-off-by: David Hildenbrand

    David Hildenbrand
     

04 Jul, 2021

2 commits

  • Pull vfs name lookup updates from Al Viro:
    "Small namei.c patch series, mostly to simplify the rules for nameidata
    state. It's actually from the previous cycle - but I didn't post it
    for review in time...

    Changes visible outside of fs/namei.c: file_open_root() calling
    conventions change, some freed bits in LOOKUP_... space"

    * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    namei: make sure nd->depth is always valid
    teach set_nameidata() to handle setting the root as well
    take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
    switch file_open_root() to struct path

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:

    - Added option for per CPU threads to the hwlat tracer

    - Have hwlat tracer handle hotplug CPUs

    - New tracer: osnoise, that detects latency caused by interrupts,
    softirqs and scheduling of other tasks.

    - Added timerlat tracer that creates a thread and measures in detail
    what sources of latency it has for wake ups.

    - Removed the "success" field of the sched_wakeup trace event. This has
    been hardcoded as "1" since 2015, no tooling should be looking at it
    now. If one exists, we can revert this commit, fix that tool and try
    to remove it again in the future.

    - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.

    - New boot command line option "tp_printk_stop", as tp_printk causes
    trace events to write to console. When user space starts, this can
    easily live lock the system. Having a boot option to stop just after
    boot up is useful to prevent that from happening.

    - Have ftrace_dump_on_oops boot command line option take numbers that
    match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.

    - Bootconfig clean ups, fixes and enhancements.

    - New ktest script that tests bootconfig options.

    - Add tracepoint_probe_register_may_exist() to register a tracepoint
    without triggering a WARN*() if it already exists. BPF has a path
    from user space that can do this. All other paths are considered a
    bug.

    - Small clean ups and fixes

    * tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
    tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
    tracing: Simplify & fix saved_tgids logic
    treewide: Add missing semicolons to __assign_str uses
    tracing: Change variable type as bool for clean-up
    trace/timerlat: Fix indentation on timerlat_main()
    trace/osnoise: Make 'noise' variable s64 in run_osnoise()
    tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
    tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
    Documentation: Fix a typo on trace/osnoise-tracer
    trace/osnoise: Fix return value on osnoise_init_hotplug_support
    trace/osnoise: Make interval u64 on osnoise_main
    trace/osnoise: Fix 'no previous prototype' warnings
    tracing: Have osnoise_main() add a quiescent state for task rcu
    seq_buf: Make trace_seq_putmem_hex() support data longer than 8
    seq_buf: Fix overflow in seq_buf_putmem_hex()
    trace/osnoise: Support hotplug operations
    trace/hwlat: Support hotplug operations
    trace/hwlat: Protect kdata->kthread with get/put_online_cpus
    trace: Add timerlat tracer
    trace: Add osnoise tracer
    ...

    Linus Torvalds
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

4 commits

  • And 'ino' field to /proc//fdinfo/ and
    /proc//task//fdinfo/.

    The inode numbers can be used to uniquely identify DMA buffers in user
    space and avoids a dependency on /proc//fd/* when accounting
    per-process DMA buffer sizes.

    Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
    Signed-off-by: Kalesh Singh
    Acked-by: Randy Dunlap
    Acked-by: Christian König
    Cc: Jann Horn
    Cc: Jeff Vander Stoep
    Cc: Kees Cook
    Cc: Suren Baghdasaryan
    Cc: Minchan Kim
    Cc: Hridya Valsaraju
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Kalesh Singh
    Cc: Alexey Dobriyan
    Cc: Jonathan Corbet
    Cc: Mauro Carvalho Chehab
    Cc: Michal Hocko
    Cc: Alexey Gladkov
    Cc: Szabolcs Nagy
    Cc: Eric W. Biederman
    Cc: Christian Brauner
    Cc: Michel Lespinasse
    Cc: Bernd Edlinger
    Cc: Andrei Vagin
    Cc: Helge Deller
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kalesh Singh
     
  • Android captures per-process system memory state when certain low memory
    events (e.g a foreground app kill) occur, to identify potential memory
    hoggers. In order to measure how much memory a process actually consumes,
    it is necessary to include the DMA buffer sizes for that process in the
    memory accounting. Since the handle to DMA buffers are raw FDs, it is
    important to be able to identify which processes have FD references to a
    DMA buffer.

    Currently, DMA buffer FDs can be accounted using /proc//fd/* and
    /proc//fdinfo -- both are only readable by the process owner, as
    follows:

    1. Do a readlink on each FD.
    2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
    3. stat the file to get the dmabuf inode number.
    4. Read/ proc//fdinfo/, to get the DMA buffer size.

    Accessing other processes' fdinfo requires root privileges. This limits
    the use of the interface to debugging environments and is not suitable for
    production builds. Granting root privileges even to a system process
    increases the attack surface and is highly undesirable.

    Since fdinfo doesn't permit reading process memory and manipulating
    process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.

    Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
    Signed-off-by: Kalesh Singh
    Suggested-by: Jann Horn
    Acked-by: Christian König
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Alexey Gladkov
    Cc: Andrei Vagin
    Cc: Bernd Edlinger
    Cc: Christian Brauner
    Cc: Eric W. Biederman
    Cc: Helge Deller
    Cc: Hridya Valsaraju
    Cc: James Morris
    Cc: Jeff Vander Stoep
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Mauro Carvalho Chehab
    Cc: Michal Hocko
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Cc: Suren Baghdasaryan
    Cc: Szabolcs Nagy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kalesh Singh
     
  • Use size_t when capping the count argument received by mem_rw(). Since
    count is size_t, using min_t(int, ...) can lead to a negative value
    that will later be passed to access_remote_vm(), which can cause
    unexpected behavior.

    Since we are capping the value to at maximum PAGE_SIZE, the conversion
    from size_t to int when passing it to access_remote_vm() as "len"
    shouldn't be a problem.

    Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
    Reviewed-by: David Disseldorp
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Signed-off-by: Marcelo Henrique Cerri
    Cc: Alexey Dobriyan
    Cc: Souza Cascardo
    Cc: Christian Brauner
    Cc: Michel Lespinasse
    Cc: Helge Deller
    Cc: Oleg Nesterov
    Cc: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcelo Henrique Cerri
     
  • Patch series "Add support for SVM atomics in Nouveau", v11.

    Introduction
    ============

    Some devices have features such as atomic PTE bits that can be used to
    implement atomic access to system memory. To support atomic operations to
    a shared virtual memory page such a device needs access to that page which
    is exclusive of the CPU. This series introduces a mechanism to
    temporarily unmap pages granting exclusive access to a device.

    These changes are required to support OpenCL atomic operations in Nouveau
    to shared virtual memory (SVM) regions allocated with the
    CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
    OpenCL SVM feature is available at
    https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
    OpenCL_API.html#_shared_virtual_memory .

    Implementation
    ==============

    Exclusive device access is implemented by adding a new swap entry type
    (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
    difference is that on fault the original entry is immediately restored by
    the fault handler instead of waiting.

    Restoring the entry triggers calls to MMU notifers which allows a device
    driver to revoke the atomic access permission from the GPU prior to the
    CPU finalising the entry.

    Patches
    =======

    Patches 1 & 2 refactor existing migration and device private entry
    functions.

    Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
    functionality into separate functions - try_to_migrate_one() and
    try_to_munlock_one().

    Patch 5 renames some existing code but does not introduce functionality.

    Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

    Patch 7 contains the bulk of the implementation for device exclusive
    memory.

    Patch 8 contains some additions to the HMM selftests to ensure everything
    works as expected.

    Patch 9 is a cleanup for the Nouveau SVM implementation.

    Patch 10 contains the implementation of atomic access for the Nouveau
    driver.

    Testing
    =======

    This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
    which checks that GPU atomic accesses to system memory are atomic.
    Without this series the test fails as there is no way of write-protecting
    the page mapping which results in the device clobbering CPU writes. For
    reference the test is available at
    https://ozlabs.org/~apopple/opencl_svm_atomics/

    Further testing has been performed by adding support for testing exclusive
    access to the hmm-tests kselftests.

    This patch (of 10):

    Remove multiple similar inline functions for dealing with different types
    of special swap entries.

    Both migration and device private swap entries use the swap offset to
    store a pfn. Instead of multiple inline functions to obtain a struct page
    for each swap entry type use a common function pfn_swap_entry_to_page().
    Also open-code the various entry_to_pfn() functions as this results is
    shorter code that is easier to understand.

    Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
    Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Peter Xu
    Cc: Shakeel Butt
    Cc: Ben Skeggs
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     

01 Jul, 2021

6 commits

  • Let's properly synchronize with drivers that set PageOffline().
    Unfreeze/thaw every now and then, so drivers that want to set
    PageOffline() can make progress.

    Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Mike Rapoport
    Reviewed-by: Oscar Salvador
    Cc: Aili Yao
    Cc: Alexey Dobriyan
    Cc: Alex Shi
    Cc: Haiyang Zhang
    Cc: Jason Wang
    Cc: Jiri Bohac
    Cc: "K. Y. Srinivasan"
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Michael S. Tsirkin"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Cc: Stephen Hemminger
    Cc: Steven Price
    Cc: Wei Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's avoid reading:

    1) Offline memory sections: the content of offline memory sections is
    stale as the memory is effectively unused by the kernel. On s390x with
    standby memory, offline memory sections (belonging to offline storage
    increments) are not accessible. With virtio-mem and the hyper-v
    balloon, we can have unavailable memory chunks that should not be
    accessed inside offline memory sections. Last but not least, offline
    memory sections might contain hwpoisoned pages which we can no longer
    identify because the memmap is stale.

    2) PG_offline pages: logically offline pages that are documented as
    "The content of these pages is effectively stale. Such pages should
    not be touched (read/write/dump/save) except by their owner.".
    Examples include pages inflated in a balloon or unavailble memory
    ranges inside hotplugged memory sections with virtio-mem or the hyper-v
    balloon.

    3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
    As documented: "Accessing is not safe since it may cause another
    machine check. Don't touch!"

    Introduce is_page_hwpoison(), adding a comment that it is inherently racy
    but best we can really do.

    Reading /proc/kcore now performs similar checks as when reading
    /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
    It's also similar to hibernation code, however, we don't skip hwpoisoned
    pages when processing pages in kernel/power/snapshot.c:saveable_page()
    yet.

    Note 1: we can race against memory offlining code, especially memory going
    offline and getting unplugged: however, we will properly tear down the
    identity mapping and handle faults gracefully when accessing this memory
    from kcore code.

    Note 2: we can race against drivers setting PageOffline() and turning
    memory inaccessible in the hypervisor. We'll handle this in a follow-up
    patch.

    Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Mike Rapoport
    Reviewed-by: Oscar Salvador
    Cc: Aili Yao
    Cc: Alexey Dobriyan
    Cc: Alex Shi
    Cc: Haiyang Zhang
    Cc: Jason Wang
    Cc: Jiri Bohac
    Cc: "K. Y. Srinivasan"
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Michael S. Tsirkin"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Cc: Stephen Hemminger
    Cc: Steven Price
    Cc: Wei Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's resturcture the code, using switch-case, and checking pfn_is_ram()
    only when we are dealing with KCORE_RAM.

    Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Mike Rapoport
    Cc: Aili Yao
    Cc: Alexey Dobriyan
    Cc: Alex Shi
    Cc: Haiyang Zhang
    Cc: Jason Wang
    Cc: Jiri Bohac
    Cc: "K. Y. Srinivasan"
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Michael S. Tsirkin"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Roman Gushchin
    Cc: Stephen Hemminger
    Cc: Steven Price
    Cc: Wei Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.

    Looking for places where the kernel might unconditionally read
    PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
    needs some more love to not touch some other pages we really don't want to
    read -- i.e., hwpoisoned ones.

    Examples for PageOffline() pages are pages inflated in a balloon, memory
    unplugged via virtio-mem, and partially-present sections in memory added
    by the Hyper-V balloon.

    When reading pages inflated in a balloon, we essentially produce
    unnecessary load in the hypervisor; holes in partially present sections in
    case of Hyper-V are not accessible and already were a problem for
    /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
    the future, virtio-mem might disallow reading unplugged memory -- marked
    as PageOffline() -- in some environments, resulting in undefined behavior
    when accessed; therefore, I'm trying to identify and rework all these
    (corner) cases.

    With this series, there is really only access via /dev/mem, /proc/vmcore
    and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
    use case -- we won't care for now if someone explicitly tries to do nasty
    things by reading from/writing to physical addresses we better not touch.
    /dev/mem is a use case we won't support for virtio-mem, at least for now,
    so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
    /proc/vmcore is really only a problem when dumping the old kernel via
    something that's not makedumpfile (read: basically never), however, we'll
    try sanitizing that as well in the second kernel in the future.

    Tested via kcore_dump:
    https://github.com/schlafwandler/kcore_dump

    This patch (of 6):

    Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
    removed the last user of KCORE_REMAP.

    Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
    dumping vsyscall user page") removed the last user of KCORE_OTHER.

    Let's drop both types. While at it, also drop vaddr in "struct
    kcore_list", used by KCORE_REMAP only.

    Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Mike Rapoport
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Cc: Alexey Dobriyan
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Alex Shi
    Cc: Steven Price
    Cc: Mike Kravetz
    Cc: Aili Yao
    Cc: Jiri Bohac
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Export the PTE/PMD status of uffd-wp to pagemap too.

    Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
    Signed-off-by: Peter Xu
    Cc: Alexander Viro
    Cc: Andrea Arcangeli
    Cc: Axel Rasmussen
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Cc: Lokesh Gidra
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
    (non-shmem) FS"), read-only THP file mapping is supported. But it forgot
    to add checking for it in transparent_hugepage_enabled(). To fix it, we
    add checking for read-only THP file mapping and also introduce helper
    transhuge_vma_enabled() to check whether thp is enabled for specified vma
    to reduce duplicated code. We rename transparent_hugepage_enabled to
    transparent_hugepage_active to make the code easier to follow as suggested
    by David Hildenbrand.

    [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
    Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

    Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Miaohe Lin
    Reviewed-by: Yang Shi
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: David Hildenbrand
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

30 Jun, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "191 patches.

    Subsystems affected by this patch series: kthread, ia64, scripts,
    ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
    slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
    mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
    pagealloc, and memory-failure)"

    * emailed patches from Andrew Morton : (191 commits)
    mm,hwpoison: make get_hwpoison_page() call get_any_page()
    mm,hwpoison: send SIGBUS with error virutal address
    mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
    mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
    mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
    mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
    docs: remove description of DISCONTIGMEM
    arch, mm: remove stale mentions of DISCONIGMEM
    mm: remove CONFIG_DISCONTIGMEM
    m68k: remove support for DISCONTIGMEM
    arc: remove support for DISCONTIGMEM
    arc: update comment about HIGHMEM implementation
    alpha: remove DISCONTIGMEM and NUMA
    mm/page_alloc: move free_the_page
    mm/page_alloc: fix counting of managed_pages
    mm/page_alloc: improve memmap_pages dbg msg
    mm: drop SECTION_SHIFT in code comments
    mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
    mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
    mm/page_alloc: scale the number of pages that are batch freed
    ...

    Linus Torvalds
     
  • has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
    cleanup.

    Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
    would reintroduce a loss of SMP scalability to pin-fast, so there's no
    future potential usefulness to keep an atomic in the mm for this.

    set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
    (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
    atomic_set after this commit) has to be still issued only once per "mm",
    so the difference between the two will be lost in the noise.

    will-it-scale "mmap2" shows no change in performance with enterprise
    config as expected.

    will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
    improvement against upstream as expected.

    This is a noop as far as overall performance and SMP scalability are
    concerned.

    [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
    Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
    [akpm@linux-foundation.org: coding style fixes]
    [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

    Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Reviewed-by: John Hubbard
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Jann Horn
    Cc: Jason Gunthorpe
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

29 Jun, 2021

2 commits

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     
  • Pull scheduler udpates from Ingo Molnar:

    - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
    coordinated scheduling across SMT siblings. This is a much
    requested feature for cloud computing platforms, to allow the
    flexible utilization of SMT siblings, without exposing untrusted
    domains to information leaks & side channels, plus to ensure more
    deterministic computing performance on SMT systems used by
    heterogenous workloads.

    There are new prctls to set core scheduling groups, which allows
    more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
    wakeups and rename it to ->__state in the process to catch new
    abuses.

    - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
    workloads.

    - "Age" (decay) average idle time, to better track & improve
    workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
    bursty CPU-bound workloads to borrow a bit against their future
    quota to improve overall latencies & batching. Can be tweaked via
    /sys/fs/cgroup/cpu//cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

    - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
    runtime if tooling needs it. Use static keys and other
    optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

    - Misc cleanups and fixes.

    * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/doc: Update the CPU capacity asymmetry bits
    sched/topology: Rework CPU capacity asymmetry detection
    sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
    psi: Fix race between psi_trigger_create/destroy
    sched/fair: Introduce the burstable CFS controller
    sched/uclamp: Fix uclamp_tg_restrict()
    sched/rt: Fix Deadline utilization tracking during policy change
    sched/rt: Fix RT utilization tracking during policy change
    sched: Change task_struct::state
    sched,arch: Remove unused TASK_STATE offsets
    sched,timer: Use __set_current_state()
    sched: Add get_current_state()
    sched,perf,kvm: Fix preemption condition
    sched: Introduce task_is_running()
    sched: Unbreak wakeups
    sched/fair: Age the average idle time
    sched/cpufreq: Consider reduced CPU capacity in energy calculation
    sched/fair: Take thermal pressure into account while estimating energy
    thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
    sched/fair: Return early from update_tg_cfs_load() if delta == 0
    ...

    Linus Torvalds
     

18 Jun, 2021

1 commit

  • This commit in sched/urgent moved the cfs_rq_is_decayed() function:

    a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

    and this fresh commit in sched/core modified it in the old location:

    9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

    Merge the two variants.

    Conflicts:
    kernel/sched/fair.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Jun, 2021

1 commit

  • Commit 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct") we
    started using __mem_open() to track the mm_struct at open-time, so that
    we could then check it for writes.

    But that also ended up making the permission checks at open time much
    stricter - and not just for writes, but for reads too. And that in turn
    caused a regression for at least Fedora 29, where NIC interfaces fail to
    start when using NetworkManager.

    Since only the write side wanted the mm_struct test, ignore any failures
    by __mem_open() at open time, leaving reads unaffected. The write()
    time verification of the mm_struct pointer will then catch the failure
    case because a NULL pointer will not match a valid 'current->mm'.

    Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
    Fixes: 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct")
    Reported-and-tested-by: Leon Romanovsky
    Cc: Kees Cook
    Cc: Christian Brauner
    Cc: Andrea Righi
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Jun, 2021

1 commit

  • It is not possible to put an array value with subkeys under
    a key node, because both of subkeys and the array elements
    are using "next" field of the xbc_node.

    Thus this changes the array values to use "child" field in
    the array case. The reason why split this change is to
    test it easily.

    Link: https://lkml.kernel.org/r/162262193838.264090.16044473274501498656.stgit@devnote2

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

09 Jun, 2021

1 commit

  • Commit bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")
    tried to make sure that there could not be a confusion between the opener of
    a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
    the privileges didn't change. However, there were existing cases where a more
    privileged thread was passing the opened fd to a differently privileged thread
    (during container setup). Instead, use mm_struct to track whether the opener
    and writer are still the same process. (This is what several other proc files
    already do, though for different reasons.)

    Reported-by: Christian Brauner
    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Fixes: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

04 Jun, 2021

1 commit


26 May, 2021

1 commit

  • Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
    files need to check the opener credentials, since these fds do not
    transition state across execve(). Without this, it is possible to
    trick another process (which may have different credentials) to write
    to its own /proc/$pid/attr/ files, leading to unexpected and possibly
    exploitable behaviors.

    [1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials

    Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

13 May, 2021

2 commits

  • Creating 2**32 tasks to wait in D-state is impossible and wasteful.

    Return "unsigned int" and save on REX prefixes.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com

    Alexey Dobriyan
     
  • Creating 2**32 tasks is impossible due to futex pid limits and wasteful
    anyway. Nobody has done it.

    Bring nr_running() into 32-bit world to save on REX prefixes.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com

    Alexey Dobriyan