01 Oct, 2020

2 commits

  • [ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

    This changes do_io_accounting to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/io for instance.

    This should be safe, as the credentials are only used for reading.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     
  • [ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

    This changes lock_trace to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/stack for instance.

    This should be safe, as the credentials are only used for reading,
    and task->mm is updated on execve under the new exec_update_mutex.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     

17 Jun, 2020

1 commit

  • commit ef1548adada51a2f32ed7faef50aa465e1b4c5da upstream.

    Recently syzbot reported that unmounting proc when there is an ongoing
    inotify watch on the root directory of proc could result in a use
    after free when the watch is removed after the unmount of proc
    when the watcher exits.

    Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount
    of proc") made it easier to unmount proc and allowed syzbot to see the
    problem, but looking at the code it has been around for a long time.

    Looking at the code the fsnotify watch should have been removed by
    fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode
    was allocated with new_inode_pseudo instead of new_inode so the inode
    was not on the sb->s_inodes list. Which prevented
    fsnotify_unmount_inodes from finding the inode and removing the watch
    as well as made it so the "VFS: Busy inodes after unmount" warning
    could not find the inodes to warn about them.

    Make all of the inodes in proc visible to generic_shutdown_super,
    and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
    The only functional difference is that new_inode places the inodes
    on the sb->s_inodes list.

    I wrote a small test program and I can verify that without changes it
    can trigger this issue, and by replacing new_inode_pseudo with
    new_inode the issues goes away.

    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
    Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
    Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread")
    Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry")
    Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

29 Apr, 2020

1 commit

  • commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

    remap_vmalloc_range() has had various issues with the bounds checks it
    promises to perform ("This function checks that addr is a valid
    vmalloc'ed area, and that it is big enough to cover the vma") over time,
    e.g.:

    - not detecting pgoff<<<<
    Signed-off-by: Andrew Morton
    Cc: stable@vger.kernel.org
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Martin KaFai Lau
    Cc: Song Liu
    Cc: Yonghong Song
    Cc: Andrii Nakryiko
    Cc: John Fastabend
    Cc: KP Singh
    Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

19 Oct, 2019

2 commits

  • Patch series "Fixes for THP in page cache", v2.

    This patch (of 5):

    Add extra space for FileHugePages and FilePmdMapped, so the output is
    aligned with other rows.

    Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
    Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Song Liu
    Tested-by: Song Liu
    Acked-by: Yang Shi
    Cc: Matthew Wilcox
    Cc: Oleg Nesterov
    Cc: Srikar Dronamraju
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There are three places where we access uninitialized memmaps, namely:
    - /proc/kpagecount
    - /proc/kpageflags
    - /proc/kpagecgroup

    We have initialized memmaps either when the section is online or when the
    page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
    garbage and in the worst case trigger kernel BUGs, especially with
    CONFIG_PAGE_POISONING.

    For example, not onlining a DIMM during boot and calling /proc/kpagecount
    with CONFIG_PAGE_POISONING:

    :/# cat /proc/kpagecount > tmp.test
    BUG: unable to handle page fault for address: fffffffffffffffe
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    RIP: 0010:kpagecount_read+0xce/0x1e0
    Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
    RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
    RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
    RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
    FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
    Call Trace:
    proc_reg_read+0x3c/0x60
    vfs_read+0xc5/0x180
    ksys_read+0x68/0xe0
    do_syscall_64+0x5c/0xa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    For now, let's drop support for ZONE_DEVICE from the three pseudo files
    in order to fix this. To distinguish offline memory (with garbage
    memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
    would have to check get_dev_pagemap() and pfn_zone_device_reserved()
    right now. The usage of both (especially, special casing devmem) is
    frowned upon and needs to be reworked.

    The fundamental issue we have is:

    if (pfn_to_online_page(pfn)) {
    /* memmap initialized */
    } else if (pfn_valid(pfn)) {
    /*
    * ???
    * a) offline memory. memmap garbage.
    * b) devmem: memmap initialized to ZONE_DEVICE.
    * c) devmem: reserved for driver. memmap garbage.
    * (d) devmem: memmap currently initializing - garbage)
    */
    }

    We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
    in place as that function is also used from memory failure. We now no
    longer dump information about pages that are not in use anymore -
    offline.

    Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reported-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Alexey Dobriyan
    Cc: Stephen Rothwell
    Cc: Toshiki Fukasawa
    Cc: Pankaj gupta
    Cc: Mike Rapoport
    Cc: Anthony Yznaga
    Cc: "Aneesh Kumar K.V"
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

28 Sep, 2019

1 commit

  • Pull kernel lockdown mode from James Morris:
    "This is the latest iteration of the kernel lockdown patchset, from
    Matthew Garrett, David Howells and others.

    From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

    There are two major changes since this was last proposed for mainline:

    - Separating lockdown from EFI secure boot. Background discussion is
    covered here: https://lwn.net/Articles/751061/

    - Implementation as an LSM, with a default stackable lockdown LSM
    module. This allows the lockdown feature to be policy-driven,
    rather than encoding an implicit policy within the mechanism.

    The new locked_down LSM hook is provided to allow LSMs to make a
    policy decision around whether kernel functionality that would allow
    tampering with or examining the runtime state of the kernel should be
    permitted.

    The included lockdown LSM provides an implementation with a simple
    policy intended for general purpose use. This policy provides a coarse
    level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

    Enable the kernel lockdown feature. If set to integrity, kernel features
    that allow userland to modify the running kernel are disabled. If set to
    confidentiality, kernel features that allow userland to extract
    confidential information from the kernel are also disabled.

    This may also be controlled via /sys/kernel/security/lockdown and
    overriden by kernel configuration.

    New or existing LSMs may implement finer-grained controls of the
    lockdown features. Refer to the lockdown_reason documentation in
    include/linux/security.h for details.

    The lockdown feature has had signficant design feedback and review
    across many subsystems. This code has been in linux-next for some
    weeks, with a few fixes applied along the way.

    Stephen Rothwell noted that commit 9d1f8be5cf42 ("bpf: Restrict bpf
    when kernel lockdown is in confidentiality mode") is missing a
    Signed-off-by from its author. Matthew responded that he is providing
    this under category (c) of the DCO"

    * 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
    kexec: Fix file verification on S390
    security: constify some arrays in lockdown LSM
    lockdown: Print current->comm in restriction messages
    efi: Restrict efivar_ssdt_load when the kernel is locked down
    tracefs: Restrict tracefs when the kernel is locked down
    debugfs: Restrict debugfs when the kernel is locked down
    kexec: Allow kexec_file() with appropriate IMA policy when locked down
    lockdown: Lock down perf when in confidentiality mode
    bpf: Restrict bpf when kernel lockdown is in confidentiality mode
    lockdown: Lock down tracing and perf kprobes when in confidentiality mode
    lockdown: Lock down /proc/kcore
    x86/mmiotrace: Lock down the testmmiotrace module
    lockdown: Lock down module params that specify hardware parameters (eg. ioport)
    lockdown: Lock down TIOCSSERIAL
    lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
    acpi: Disable ACPI table override if the kernel is locked down
    acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
    ACPI: Limit access to custom_method when the kernel is locked down
    x86/msr: Restrict MSR access when the kernel is locked down
    x86: Lock down IO port access when the kernel is locked down
    ...

    Linus Torvalds
     

25 Sep, 2019

3 commits

  • In preparation for non-shmem THP, this patch adds a few stats and exposes
    them in /proc/meminfo, /sys/bus/node/devices//meminfo, and
    /proc//task//smaps.

    This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
    https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/

    Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Patch series "mm: remove quicklist page table caches".

    A while ago Nicholas proposed to remove quicklist page table caches [1].

    I've rebased his patch on the curren upstream and switched ia64 and sh to
    use generic versions of PTE allocation.

    [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

    This patch (of 3):

    Remove page table allocator "quicklists". These have been around for a
    long time, but have not got much traction in the last decade and are only
    used on ia64 and sh architectures.

    The numbers in the initial commit look interesting but probably don't
    apply anymore. If anybody wants to resurrect this it's in the git
    history, but it's unhelpful to have this code and divergent allocator
    behaviour for minor archs.

    Also it might be better to instead make more general improvements to page
    allocator if this is still so slow.

    Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Mike Rapoport
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

21 Sep, 2019

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "This is a bit late, partly due to me travelling, and partly due to a
    power outage knocking out some of my test systems *while* I was
    travelling.

    - Initial support for running on a system with an Ultravisor, which
    is software that runs below the hypervisor and protects guests
    against some attacks by the hypervisor.

    - Support for building the kernel to run as a "Secure Virtual
    Machine", ie. as a guest capable of running on a system with an
    Ultravisor.

    - Some changes to our DMA code on bare metal, to allow devices with
    medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
    DMA space.

    - Support for firmware assisted crash dumps on bare metal (powernv).

    - Two series fixing bugs in and refactoring our PCI EEH code.

    - A large series refactoring our exception entry code to use gas
    macros, both to make it more readable and also enable some future
    optimisations.

    As well as many cleanups and other minor features & fixups.

    Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
    Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
    Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
    JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
    Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
    Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
    Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
    Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
    Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
    Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
    Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
    O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
    Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
    Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
    Lendacky, Vasant Hegde"

    * tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
    powerpc/mm/mce: Keep irqs disabled during lockless page table walk
    powerpc: Use ftrace_graph_ret_addr() when unwinding
    powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
    ftrace: Look up the address of return_to_handler() using helpers
    powerpc: dump kernel log before carrying out fadump or kdump
    docs: powerpc: Add missing documentation reference
    powerpc/xmon: Fix output of XIVE IPI
    powerpc/xmon: Improve output of XIVE interrupts
    powerpc/mm/radix: remove useless kernel messages
    powerpc/fadump: support holes in kernel boot memory area
    powerpc/fadump: remove RMA_START and RMA_END macros
    powerpc/fadump: update documentation about option to release opalcore
    powerpc/fadump: consider f/w load area
    powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
    powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
    powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
    powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
    powerpc/fadump: improve how crashed kernel's memory is reserved
    powerpc/fadump: consider reserved ranges while releasing memory
    powerpc/fadump: make crash memory ranges array allocation generic
    ...

    Linus Torvalds
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

06 Sep, 2019

1 commit


20 Aug, 2019

2 commits

  • Print the content of current->comm in messages generated by lockdown to
    indicate a restriction that was hit. This makes it a bit easier to find
    out what caused the message.

    The message now patterned something like:

    Lockdown: : is restricted; see man kernel_lockdown.7

    Signed-off-by: David Howells
    Signed-off-by: Matthew Garrett
    Reviewed-by: Kees Cook
    Signed-off-by: James Morris

    Matthew Garrett
     
  • Disallow access to /proc/kcore when the kernel is locked down to prevent
    access to cryptographic data. This is limited to lockdown
    confidentiality mode and is still permitted in integrity mode.

    Signed-off-by: David Howells
    Signed-off-by: Matthew Garrett
    Reviewed-by: Kees Cook
    Signed-off-by: James Morris

    David Howells
     

09 Aug, 2019

1 commit

  • Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
    appear in generic kernel code because it forces non-x86 architectures to
    define the sev_active() function, which doesn't make a lot of sense.

    To solve this problem, add an x86 elfcorehdr_read() function to override
    the generic weak implementation. To do that, it's necessary to make
    read_from_oldmem() public so that it can be used outside of vmcore.c.

    Also, remove the export for sev_active() since it's only used in files that
    won't be built as modules.

    Signed-off-by: Thiago Jung Bauermann
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Lianbo Jiang
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20190806044919.10622-6-bauerman@linux.ibm.com

    Thiago Jung Bauermann
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

2 commits

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     
  • Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
    vma") introduced THPeligible bit for processes' smaps. But, when
    checking the eligibility for shmem vma, __transparent_hugepage_enabled()
    is called to override the result from shmem_huge_enabled(). It may
    result in the anonymous vma's THP flag override shmem's. For example,
    running a simple test which create THP for shmem, but with anonymous THP
    disabled, when reading the process's smaps, it may show:

    7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
    Size: 4096 kB
    ...
    [snip]
    ...
    ShmemPmdMapped: 4096 kB
    ...
    [snip]
    ...
    THPeligible: 0

    And, /proc/meminfo does show THP allocated and PMD mapped too:

    ShmemHugePages: 4096 kB
    ShmemPmdMapped: 4096 kB

    This doesn't make too much sense. The shmem objects should be treated
    separately from anonymous THP. Calling shmem_huge_enabled() with
    checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
    and dax vma check since we already checked if the vma is shmem already.

    Also check if vma is suitable for THP by calling
    transhuge_vma_suitable().

    And minor fix to smaps output format and documentation.

    Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
    Signed-off-by: Yang Shi
    Acked-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

17 Jul, 2019

8 commits

  • Merge more updates from Andrew Morton:
    "VM:
    - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

    - more accurate reclaimed slab caches calculations by Yafang Shao

    - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
    Christoph Hellwig

    - !CONFIG_MMU fixes by Christoph Hellwig

    - new novmcoredd parameter to omit device dumps from vmcore, by
    Kairui Song

    - new test_meminit module for testing heap and pagealloc
    initialization, by Alexander Potapenko

    - ioremap improvements for huge mappings, by Anshuman Khandual

    - generalize kprobe page fault handling, by Anshuman Khandual

    - device-dax hotplug fixes and improvements, by Pavel Tatashin

    - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

    - add pte_devmap() support for arm64, by Robin Murphy

    - unify locked_vm accounting with a helper, by Daniel Jordan

    - several misc fixes

    core/lib:
    - new typeof_member() macro including some users, by Alexey Dobriyan

    - make BIT() and GENMASK() available in asm, by Masahiro Yamada

    - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
    code generation, by Alexey Dobriyan

    - rbtree code size optimizations, by Michel Lespinasse

    - convert struct pid count to refcount_t, by Joel Fernandes

    get_maintainer.pl:
    - add --no-moderated switch to skip moderated ML's, by Joe Perches

    misc:
    - ptrace PTRACE_GET_SYSCALL_INFO interface

    - coda updates

    - gdb scripts, various"

    [ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

    * emailed patches from Andrew Morton : (100 commits)
    fs/select.c: use struct_size() in kmalloc()
    mm: add account_locked_vm utility function
    arm64: mm: implement pte_devmap support
    mm: introduce ARCH_HAS_PTE_DEVMAP
    mm: clean up is_device_*_page() definitions
    mm/mmap: move common defines to mman-common.h
    mm: move MAP_SYNC to asm-generic/mman-common.h
    device-dax: "Hotremove" persistent memory that is used like normal RAM
    mm/hotplug: make remove_memory() interface usable
    device-dax: fix memory and resource leak if hotplug fails
    include/linux/lz4.h: fix spelling and copy-paste errors in documentation
    ipc/mqueue.c: only perform resource calculation if user valid
    include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
    scripts/gdb: add helpers to find and list devices
    scripts/gdb: add lx-genpd-summary command
    drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
    kernel/pid.c: convert struct pid count to refcount_t
    drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
    select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
    select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
    ...

    Linus Torvalds
     
  • Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
    but this is not a correct behavior for proc. Since sysctl permission
    check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
    sense to use these values in u_[ug]id of proc inodes. In other words:
    although uid/gid in the inode is not read during test_perm, the inode
    logically belongs to the root of the namespace. I have confirmed this
    with Eric Biederman at LPC and in this thread:
    https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com

    Consequences
    ============

    Since the i_[ug]id values of proc nodes are not used for permissions
    checks, this change usually makes no functional difference. However, it
    causes an issue in a setup where:

    * a namespace container is created without root user in container -
    hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID

    * container creator tries to configure it by writing /proc/sys files,
    e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit

    Kernel does not allow to open an inode for writing if its i_[ug]id are
    invalid, making it impossible to write shmmax and thus - configure the
    container.

    Using a container with no root mapping is apparently rare, but we do use
    this configuration at Google. Also, we use a generic tool to configure
    the container limits, and the inability to write any of them causes a
    failure.

    History
    =======

    The invalid uids/gids in inodes first appeared due to 81754357770e (fs:
    Update i_[ug]id_(read|write) to translate relative to s_user_ns).
    However, AFAIK, this did not immediately cause any issues. The
    inability to write to these "invalid" inodes was only caused by a later
    commit 0bd23d09b874 (vfs: Don't modify inodes with a uid or gid unknown
    to the vfs).

    Tested: Used a repro program that creates a user namespace without any
    mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
    Before the change, it shows the overflow uid, with the change it's 0.
    The overflow uid indicates that the uid in the inode is not correct and
    thus it is not possible to open the file for writing.

    Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
    Fixes: 0bd23d09b874 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
    Signed-off-by: Radoslaw Burny
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: "Eric W . Biederman"
    Cc: Seth Forshee
    Cc: John Sperbeck
    Cc: Alexey Dobriyan
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Radoslaw Burny
     
  • Don't repeat function signatures twice.

    This is a kind-of-precursor for "struct proc_ops".

    Note:

    typeof(pde->proc_fops->...) ...;

    can't be used because ->proc_fops is "const struct file_operations *".
    "const" prevents assignment down the code and it can't be deleted in the
    type system.

    Link: http://lkml.kernel.org/r/20190529191110.GB5703@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Since commit 2724273e8fd0 ("vmcore: add API to collect hardware dump in
    second kernel"), drivers are allowed to add device related dump data to
    vmcore as they want by using the device dump API. This has a potential
    issue, the data is stored in memory, drivers may append too much data
    and use too much memory. The vmcore is typically used in a kdump kernel
    which runs in a pre-reserved small chunk of memory. So as a result it
    will make kdump unusable at all due to OOM issues.

    So introduce new 'novmcoredd' command line option. User can disable
    device dump to reduce memory usage. This is helpful if device dump is
    using too much memory, disabling device dump could make sure a regular
    vmcore without device dump data is still available.

    [akpm@linux-foundation.org: tweak documentation]
    [akpm@linux-foundation.org: vmcore.c needs moduleparam.h]
    Link: http://lkml.kernel.org/r/20190528111856.7276-1-kasong@redhat.com
    Signed-off-by: Kairui Song
    Acked-by: Dave Young
    Reviewed-by: Bhupesh Sharma
    Cc: Rahul Lakkireddy
    Cc: "David S . Miller"
    Cc: Eric Biederman
    Cc: Alexey Dobriyan
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kairui Song
     
  • Pull rst conversion of docs from Mauro Carvalho Chehab:
    "As agreed with Jon, I'm sending this big series directly to you, c/c
    him, as this series required a special care, in order to avoid
    conflicts with other trees"

    * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
    docs: kbuild: fix build with pdf and fix some minor issues
    docs: block: fix pdf output
    docs: arm: fix a breakage with pdf output
    docs: don't use nested tables
    docs: gpio: add sysfs interface to the admin-guide
    docs: locking: add it to the main index
    docs: add some directories to the main documentation index
    docs: add SPDX tags to new index files
    docs: add a memory-devices subdir to driver-api
    docs: phy: place documentation under driver-api
    docs: serial: move it to the driver-api
    docs: driver-api: add remaining converted dirs to it
    docs: driver-api: add xilinx driver API documentation
    docs: driver-api: add a series of orphaned documents
    docs: admin-guide: add a series of orphaned documents
    docs: cgroup-v1: add it to the admin-guide book
    docs: aoe: add it to the driver-api book
    docs: add some documentation dirs to the driver-api book
    docs: driver-model: move it to the driver-api book
    docs: lp855x-driver.rst: add it to the driver-api book
    ...

    Linus Torvalds
     
  • This fixes two problems reported with the cmdline simplification and
    cleanup last year:

    - the setproctitle() special cases didn't quite match the original
    semantics, and it can be noticeable:

    https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/

    - it could leak an uninitialized byte from the temporary buffer under
    the right (wrong) circustances:

    https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/

    It rewrites the logic entirely, splitting it into two separate commits
    (and two separate functions) for the two different cases ("unedited
    cmdline" vs "setproctitle() has been used to change the command line").

    * proc-cmdline:
    /proc//cmdline: add back the setproctitle() special case
    /proc//cmdline: remove all the special cases

    Linus Torvalds
     
  • This makes the setproctitle() special case very explicit indeed, and
    handles it with a separate helper function entirely. In the process, it
    re-instates the original semantics of simply stopping at the first NUL
    character when the original last NUL character is no longer there.

    [ The original semantics can still be seen in mm/util.c: get_cmdline()
    that is limited to a fixed-size buffer ]

    This makes the logic about when we use the string lengths etc much more
    obvious, and makes it easier to see what we do and what the two very
    different cases are.

    Note that even when we allow walking past the end of the argument array
    (because the setproctitle() might have overwritten and overflowed the
    original argv[] strings), we only allow it when it overflows into the
    environment region if it is immediately adjacent.

    [ Fixed for missing 'count' checks noted by Alexey Izbyshev ]

    Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
    Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function")
    Cc: Jakub Jankowski
    Cc: Alexey Dobriyan
    Cc: Alexey Izbyshev
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Start off with a clean slate that only reads exactly from arg_start to
    arg_end, without any oddities. This simplifies the code and in the
    process removes the case that caused us to potentially leak an
    uninitialized byte from the temporary kernel buffer.

    Note that in order to start from scratch with an understandable base,
    this simplifies things _too_ much, and removes all the legacy logic to
    handle setproctitle() having changed the argument strings.

    We'll add back those special cases very differently in the next commit.

    Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
    Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
    Cc: Alexey Izbyshev
    Cc: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jul, 2019

2 commits

  • The stuff under sysctl describes /sys interface from userspace
    point of view. So, add it to the admin-guide and remove the
    :orphan: from its index file.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

9 commits

  • Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to
    mem_exclusive cpuset") introduces a heuristic where a potential
    oom-killer victim is skipped if the intersection of the potential victim
    and the current (the process triggered the oom) is empty based on the
    reason that killing such victim most probably will not help the current
    allocating process.

    However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the
    heuristic to just decrease the oom_badness scores of such potential
    victim based on the reason that the cpuset of such processes might have
    changed and previously they may have allocated memory on mems where the
    current allocating process can allocate from.

    Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a
    side effect as the oom_badness is also exposed to the user space through
    /proc/[pid]/oom_score, so, readers with different cpusets can read
    different oom_score of the same process.

    Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same
    cpuset") fixed the side effect introduced by 7887a3da753e by moving the
    cpuset intersection back to only oom-killer context and out of
    oom_badness. However the combination of ab290adbaf8f ("oom: make
    oom_unkillable_task() helper function") and 26ebc984913b ("oom:
    /proc//oom_score treat kernel thread honestly") unintentionally
    brought back the cpuset intersection check into the oom_badness
    calculation function.

    Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
    oom context is also doing cpuset/mempolicy intersection which is quite
    wrong and is caught by syzcaller with the following report:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    Call Trace:
    oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
    mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
    select_bad_process mm/oom_kill.c:374 [inline]
    out_of_memory mm/oom_kill.c:1088 [inline]
    out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
    mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
    mem_cgroup_oom mm/memcontrol.c:1905 [inline]
    try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
    mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
    mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
    do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
    do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
    wp_huge_pmd mm/memory.c:3793 [inline]
    __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
    handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
    do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
    __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
    do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
    page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
    RIP: 0033:0x400590
    Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
    8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 06 e9 1e 01
    00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
    RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
    RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
    R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
    Modules linked in:
    ---[ end trace a65689219582ffff ]---
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600

    The fix is to decouple the cpuset/mempolicy intersection check from
    oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
    only done in the global oom context.

    [shakeelb@google.com: change function name and update comment]
    Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • oom_unkillable_task() can be called from three different contexts i.e.
    global OOM, memcg OOM and oom_score procfs interface. At the moment
    oom_unkillable_task() does a task_in_mem_cgroup() check on the given
    process. Since there is no reason to perform task_in_mem_cgroup()
    check for global OOM and oom_score procfs interface, those contexts
    provide NULL memcg and skips the task_in_mem_cgroup() check. However
    for memcg OOM context, the oom_unkillable_task() is always called from
    mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
    redundant and effectively dead code. So, just remove the
    task_in_mem_cgroup() check altogether.

    Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Signed-off-by: Tetsuo Handa
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Vmalloc() is getting more and more used these days (kernel stacks, bpf and
    percpu allocator are new top users), and the total % of memory consumed by
    vmalloc() can be pretty significant and changes dynamically.

    /proc/meminfo is the best place to display this information: its top goal
    is to show top consumers of the memory.

    Since the VmallocUsed field in /proc/meminfo is not in use for quite a
    long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of
    'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
    actual physical memory consumption of vmalloc().

    Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Report separate components (anon, file, and shmem) for PSS in
    smaps_rollup.

    This helps understand and tune the memory manager behavior in consumer
    devices, particularly mobile devices. Many of them (e.g. chromebooks and
    Android-based devices) use zram for anon memory, and perform disk reads
    for discarded file pages. The difference in latency is large (e.g.
    reading a single page from SSD is 30 times slower than decompressing a
    zram page on one popular device), thus it is useful to know how much of
    the PSS is anon vs. file.

    All the information is already present in /proc/pid/smaps, but much more
    expensive to obtain because of the large size of that procfs entry.

    This patch also removes a small code duplication in smaps_account, which
    would have gotten worse otherwise.

    Also updated Documentation/filesystems/proc.txt (the smaps section was a
    bit stale, and I added a smaps_rollup section) and
    Documentation/ABI/testing/procfs-smaps_rollup.

    [semenzato@chromium.org: v5]
    Link: http://lkml.kernel.org/r/20190626234333.44608-1-semenzato@chromium.org
    Link: http://lkml.kernel.org/r/20190626180429.174569-1-semenzato@chromium.org
    Signed-off-by: Luigi Semenzato
    Acked-by: Yu Zhao
    Cc: Sonny Rao
    Cc: Yu Zhao
    Cc: Brian Geffon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luigi Semenzato
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    It seems ->d_revalidate() could return any error (except ECHILD) to abort
    validation and pass error as result of lookup sequence.

    [akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
    Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    Replace the only unkillable mmap_sem lock in clear_refs_write().

    Link: http://lkml.kernel.org/r/156007493826.3335.5424884725467456239.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    Link: http://lkml.kernel.org/r/156007493638.3335.4872164955523928492.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    Link: http://lkml.kernel.org/r/156007493429.3335.14666825072272692455.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Do not remain stuck forever if something goes wrong. Using a killable
    lock permits cleanup of stuck tasks and simplifies investigation.

    This function is also used for /proc/pid/smaps.

    Link: http://lkml.kernel.org/r/156007493160.3335.14447544314127417266.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Roman Gushchin
    Reviewed-by: Cyrill Gorcunov
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Michal Koutný
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov