11 Apr, 2020

15 commits

  • Merge yet more updates from Andrew Morton:

    - Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
    gup, hugetlb, pagemap, memremap)

    - Various other things (hfs, ocfs2, kmod, misc, seqfile)

    * akpm: (34 commits)
    ipc/util.c: sysvipc_find_ipc() should increase position index
    kernel/gcov/fs.c: gcov_seq_next() should increase position index
    fs/seq_file.c: seq_read(): add info message about buggy .next functions
    drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
    change email address for Pali Rohár
    selftests: kmod: test disabling module autoloading
    selftests: kmod: fix handling test numbers above 9
    docs: admin-guide: document the kernel.modprobe sysctl
    fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
    kmod: make request_module() return an error when autoloading is disabled
    mm/memremap: set caching mode for PCI P2PDMA memory to WC
    mm/memory_hotplug: add pgprot_t to mhp_params
    powerpc/mm: thread pgprot_t through create_section_mapping()
    x86/mm: introduce __set_memory_prot()
    x86/mm: thread pgprot_t through init_memory_mapping()
    mm/memory_hotplug: rename mhp_restrictions to mhp_params
    mm/memory_hotplug: drop the flags field from struct mhp_restrictions
    mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
    mm/vma: introduce VM_ACCESS_FLAGS
    mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
    ...

    Linus Torvalds
     
  • PCI BAR IO memory should never be mapped as WB, however prior to this
    the PAT bits were set WB and it was typically overridden by MTRR
    registers set by the firmware.

    Set PCI P2PDMA memory to be UC as this is what it currently, typically,
    ends up being mapped as on x86 after the MTRR registers override the
    cache setting.

    Future use-cases may need to generalize this by adding flags to select
    the caching type, as some P2PDMA cases may not want UC. However, those
    use-cases are not upstream yet and this can be changed when they arrive.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     
  • devm_memremap_pages() is currently used by the PCI P2PDMA code to create
    struct page mappings for IO memory. At present, these mappings are
    created with PAGE_KERNEL which implies setting the PAT bits to be WB.
    However, on x86, an mtrr register will typically override this and force
    the cache type to be UC-. In the case firmware doesn't set this
    register it is effectively WB and will typically result in a machine
    check exception when it's accessed.

    Other arches are not currently likely to function correctly seeing they
    don't have any MTRR registers to fall back on.

    To solve this, provide a way to specify the pgprot value explicitly to
    arch_add_memory().

    Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
    simple change to pass the pgprot_t down to their respective functions
    which set up the page tables. For x86_32, set the page tables
    explicitly using _set_memory_prot() (seeing they are already mapped).

    For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
    should be fine, for now, seeing these architectures don't support
    ZONE_DEVICE.

    A check in __add_pages() is also added to ensure the pgprot parameter
    was set for all arches.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Dave Hansen
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     
  • The mhp_restrictions struct really doesn't specify anything resembling a
    restriction anymore so rename it to be mhp_params as it is a list of
    extended parameters.

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Dave Hansen
    Cc: Eric Badger
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
    Signed-off-by: Linus Torvalds

    Logan Gunthorpe
     
  • There are many places where all basic VMA access flags (read, write,
    exec) are initialized or checked against as a group. One such example
    is during page fault. Existing vma_is_accessible() wrapper already
    creates the notion of VMA accessibility as a group access permissions.

    Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
    will not only reduce code duplication but also extend the VMA
    accessibility concept in general.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Mark Salter
    Cc: Nick Hu
    Cc: Ley Foon Tan
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Guan Xuetao
    Cc: Dave Hansen
    Cc: Thomas Gleixner
    Cc: Rob Springer
    Cc: Greg Kroah-Hartman
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Add the ability to insert multiple pages at once to a user VM with lower
    PTE spinlock operations.

    The intention of this patch-set is to reduce atomic ops for tcp zerocopy
    receives, which normally hits the same spinlock multiple times
    consecutively.

    [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument]
    [arjunroy@google.com: add missing page_count() check to vm_insert_pages()]
    Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
    [arjunroy@google.com: vm_insert_pages() checks if pte_index defined]
    Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com
    Signed-off-by: Arjun Roy
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Andrew Morton
    Cc: David Miller
    Cc: Matthew Wilcox
    Cc: Jason Gunthorpe
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com
    Signed-off-by: Linus Torvalds

    Arjun Roy
     
  • Add helper methods for vm_insert_page()/insert_page() to prepare for
    vm_insert_pages(), which batch-inserts pages to reduce spinlock
    operations when inserting multiple consecutive pages into the user page
    table.

    The intention of this patch-set is to reduce atomic ops for tcp zerocopy
    receives, which normally hits the same spinlock multiple times
    consecutively.

    Signed-off-by: Arjun Roy
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Andrew Morton
    Cc: David Miller
    Cc: Matthew Wilcox
    Cc: Jason Gunthorpe
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com
    Signed-off-by: Linus Torvalds

    Arjun Roy
     
  • On passing requirement to vm_unmapped_area, arch_get_unmapped_area and
    arch_get_unmapped_area_topdown did not set align_offset. Internally on
    both unmapped_area and unmapped_area_topdown, if info->align_mask is 0,
    then info->align_offset was meaningless.

    But commit df529cabb7a2 ("mm: mmap: add trace point of
    vm_unmapped_area") always prints info->align_offset even though it is
    uninitialized.

    Fix this uninitialized value issue by setting it to 0 explicitly.

    Before:
    vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022

    After:
    vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0

    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox (Oracle)
    Cc: Michel Lespinasse
    Cc: Borislav Petkov
    Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
    at runtime") has added the run-time allocation of gigantic pages.

    However it actually works only at early stages of the system loading,
    when the majority of memory is free. After some time the memory gets
    fragmented by non-movable pages, so the chances to find a contiguous 1GB
    block are getting close to zero. Even dropping caches manually doesn't
    help a lot.

    At large scale rebooting servers in order to allocate gigantic hugepages
    is quite expensive and complex. At the same time keeping some constant
    percentage of memory in reserved hugepages even if the workload isn't
    using it is a big waste: not all workloads can benefit from using 1 GB
    pages.

    The following solution can solve the problem:
    1) On boot time a dedicated cma area* is reserved. The size is passed
    as a kernel argument.
    2) Run-time allocations of gigantic hugepages are performed using the
    cma allocator and the dedicated cma area

    In this case gigantic hugepages can be allocated successfully with a
    high probability, however the memory isn't completely wasted if nobody
    is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
    etc.

    * On a multi-node machine a per-node cma area is allocated on each node.
    Following gigantic hugetlb allocation are using the first available
    numa node if the mask isn't specified by a user.

    Usage:
    1) configure the kernel to allocate a cma area for hugetlb allocations:
    pass hugetlb_cma=10G as a kernel argument

    2) allocate hugetlb pages as usual, e.g.
    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

    If the option isn't enabled or the allocation of the cma area failed,
    the current behavior of the system is preserved.

    x86 and arm-64 are covered by this patch, other architectures can be
    trivially added later.

    The patch contains clean-ups and fixes proposed and implemented by Aslan
    Bakirov and Randy Dunlap. It also contains ideas and suggestions
    proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Tested-by: Andreas Schaufler
    Acked-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Aslan Bakirov
    Cc: Randy Dunlap
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • I've noticed that there is no interface exposed by CMA which would let
    me to declare contigous memory on particular NUMA node.

    This patchset adds the ability to try to allocate contiguous memory on a
    specific node. It will fallback to other nodes if the specified one
    doesn't work.

    Implement a new method for declaring contigous memory on particular node
    and keep cma_declare_contiguous() as a wrapper.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Aslan Bakirov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Andreas Schaufler
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200407163840.92263-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Aslan Bakirov
     
  • Fix the following sparse warning:

    mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static?
    mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     
  • Add description of function parameter 'mt' to fix kernel-doc warning:

    mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Acked-by: Pankaj Gupta
    Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • There is a typo in comment, fix it.
    s/eariler/earlier/

    Signed-off-by: Qiujun Huang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Christoph Lameter
    Link: http://lkml.kernel.org/r/20200405160544.1246-1-hqjagain@gmail.com
    Signed-off-by: Linus Torvalds

    Qiujun Huang
     
  • If a cgroup violates its memory.high constraints, we may end up unduly
    penalising it. For example, for the following hierarchy:

    A: max high, 20 usage
    A/B: 9 high, 10 usage
    A/C: max high, 10 usage

    We would end up doing the following calculation below when calculating
    high delay for A/B:

    A/B: 10 - 9 = 1...
    A: 20 - PAGE_COUNTER_MAX = 21, so set max_overage to 21.

    This gets worse with higher disparities in usage in the parent.

    I have no idea how this disappeared from the final version of the patch,
    but it is certainly Not Good(tm). This wasn't obvious in testing because,
    for a simple cgroup hierarchy with only one child, the result is usually
    roughly the same. It's only in more complex hierarchies that things go
    really awry (although still, the effects are limited to a maximum of 2
    seconds in schedule_timeout_killable at a maximum).

    [chris@chrisdown.name: changelog]
    Fixes: e26733e0d0ec ("mm, memcg: throttle allocators based on ancestral memory.high")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [5.4.x]
    Link: http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name
    Signed-off-by: Linus Torvalds

    Jakub Kicinski
     
  • Pull block fixes from Jens Axboe:
    "Here's a set of fixes that should go into this merge window. This
    contains:

    - NVMe pull request from Christoph with various fixes

    - Better discard support for loop (Evan)

    - Only call ->commit_rqs() if we have queued IO (Keith)

    - blkcg offlining fixes (Tejun)

    - fix (and fix the fix) for busy partitions"

    * tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block:
    block: fix busy device checking in blk_drop_partitions again
    block: fix busy device checking in blk_drop_partitions
    nvmet-rdma: fix double free of rdma queue
    blk-mq: don't commit_rqs() if none were queued
    nvme-fc: Revert "add module to ops template to allow module references"
    nvme: fix deadlock caused by ANA update wrong locking
    nvmet-rdma: fix bonding failover possible NULL deref
    loop: Better discard support for block devices
    loop: Report EOPNOTSUPP properly
    nvmet: fix NULL dereference when removing a referral
    nvme: inherit stable pages constraint in the mpath stack device
    blkcg: don't offline parent blkcg first
    blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it
    nvme-tcp: fix possible crash in recv error flow
    nvme-tcp: don't poll a non-live queue
    nvme-tcp: fix possible crash in write_zeroes processing
    nvmet-fc: fix typo in comment
    nvme-rdma: Replace comma with a semicolon
    nvme-fcloop: fix deallocation of working context
    nvme: fix compat address handling in several ioctls

    Linus Torvalds
     

09 Apr, 2020

2 commits

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     
  • __get_user_pages_locked() will return 0 instead of -EINTR after commit
    4426e945df588 ("mm/gup: allow VM_FAULT_RETRY for multiple times") which
    added extra code to allow gup detect fatal signal faster.

    Restore the original -EINTR behavior.

    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
    Reported-by: syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com
    Signed-off-by: Hillf Danton
    Acked-by: Michal Hocko
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

08 Apr, 2020

23 commits

  • It's definitely incorrect to mark the lock as taken even if
    down_read_killable() failed.

    This wass overlooked when we switched from down_read() to
    down_read_killable() because down_read() won't fail while
    down_read_killable() could.

    Fixes: 71335f37c5e8 ("mm/gup: allow to react to fatal signals")
    Reported-by: syzbot+a8c70b7f3579fc0587dc@syzkaller.appspotmail.com
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • lookup_node() uses gup to pin the page and get node information. It
    checks against ret>=0 assuming the page will be filled in. However it's
    also possible that gup will return zero, for example, when the thread is
    quickly killed with a fatal signal. Teach lookup_node() to gracefully
    return an error -EFAULT if it happens.

    Meanwhile, initialize "page" to NULL to avoid potential risk of
    exploiting the pointer.

    Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
    Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • As done in the full WARN() handler, panic_on_warn needs to be cleared
    before calling panic() to avoid recursive panics.

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Dan Carpenter
    Cc: Elena Petrova
    Cc: "Gustavo A. R. Silva"
    Link: http://lkml.kernel.org/r/20200227193516.32566-6-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • filter_irq_stacks() can be used by other tools (e.g. KMSAN), so it needs
    to be moved to a common location. lib/stackdepot.c seems a good place, as
    filter_irq_stacks() is usually applied to the output of
    stack_trace_save().

    This patch has been previously mailed as part of KMSAN RFC patch series.

    [glider@google.co: nds32: linker script: add SOFTIRQENTRY_TEXT\
    Link: http://lkml.kernel.org/r/20200311121002.241430-1-glider@google.com
    [glider@google.com: add IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to linker script]
    Link: http://lkml.kernel.org/r/20200311121124.243352-1-glider@google.com
    Signed-off-by: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Cc: Vegard Nossum
    Cc: Dmitry Vyukov
    Cc: Marco Elver
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Arnd Bergmann
    Cc: Sergey Senozhatsky
    Link: http://lkml.kernel.org/r/20200220141916.55455-3-glider@google.com
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Now that "struct proc_ops" exist we can start putting there stuff which
    could not fly with VFS "struct file_operations"...

    Most of fs/proc/inode.c file is dedicated to make open/read/.../close
    reliable in the event of disappearing /proc entries which usually happens
    if module is getting removed. Files like /proc/cpuinfo which never
    disappear simply do not need such protection.

    Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
    "permanent" files.

    Enable "permanent" flag for

    /proc/cpuinfo
    /proc/kmsg
    /proc/modules
    /proc/slabinfo
    /proc/stat
    /proc/sysvipc/*
    /proc/swaps

    More will come once I figure out foolproof way to prevent out module
    authors from marking their stuff "permanent" for performance reasons
    when it is not.

    This should help with scalability: benchmark is "read /proc/cpuinfo R times
    by N threads scattered over the system".

    N R t, s (before) t, s (after)
    -----------------------------------------------------
    64 4096 1.582458 1.530502 -3.2%
    256 4096 6.371926 6.125168 -3.9%
    1024 4096 25.64888 24.47528 -4.6%

    Benchmark source:

    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include

    const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
    int N;
    const char *filename;
    int R;

    int xxx = 0;

    int glue(int n)
    {
    cpu_set_t m;
    CPU_ZERO(&m);
    CPU_SET(n, &m);
    return sched_setaffinity(0, sizeof(cpu_set_t), &m);
    }

    void f(int n)
    {
    glue(n % NR_CPUS);

    while (*(volatile int *)&xxx == 0) {
    }

    for (int i = 0; i < R; i++) {
    int fd = open(filename, O_RDONLY);
    char buf[4096];
    ssize_t rv = read(fd, buf, sizeof(buf));
    asm volatile ("" :: "g" (rv));
    close(fd);
    }
    }

    int main(int argc, char *argv[])
    {
    if (argc < 4) {
    std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
    ";
    return 1;
    }

    N = atoi(argv[1]);
    filename = argv[2];
    R = atoi(argv[3]);

    for (int i = 0; i < NR_CPUS; i++) {
    if (glue(i) == 0)
    break;
    }

    std::vector T;
    T.reserve(N);
    for (int i = 0; i < N; i++) {
    T.emplace_back(f, i);
    }

    auto t0 = std::chrono::system_clock::now();
    {
    *(volatile int *)&xxx = 1;
    for (auto& t: T) {
    t.join();
    }
    }
    auto t1 = std::chrono::system_clock::now();
    std::chrono::duration dt = t1 - t0;
    std::cout << dt.count() << '
    ';

    return 0;
    }

    P.S.:
    Explicit randomization marker is added because adding non-function pointer
    will silently disable structure layout randomization.

    [akpm@linux-foundation.org: coding style fixes]
    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Previously there was a check if 'size' is aligned to 'align' and if not
    then it was aligned. This check was expensive as both branch and division
    are expensive instructions in most architectures. 'ALIGN' function on
    already aligned value will not change it, and as it is cheaper than branch
    + division it can be executed all the time and branch can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200320173317.26408-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • MAX_ZONELISTS is a compile time constant, so it should be compared using
    BUILD_BUG_ON not BUG_ON.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Wei Yang
    Link: http://lkml.kernel.org/r/20200228224617.11343-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • The parameter of remap_pfn_range() @pfn passed from the caller is actually
    a page-frame number converted by corresponding physical address of kernel
    memory, the original comment is ambiguous that may mislead the users.

    Meanwhile, there is an ambiguous typo "VMM" in the comment of
    vm_area_struct. So fixing them will make the code more readable.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     
  • Sparse reports a warning at unpin_tag()()

    warning: context imbalance in unpin_tag() - unexpected unlock

    The root cause is the missing annotation at unpin_tag()
    Add the missing __releases(bitlock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200214204741.94112-14-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at pin_tag()()

    warning: context imbalance in pin_tag() - wrong count at exit

    The root cause is the missing annotation at pin_tag()
    Add the missing __acquires(bitlock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200214204741.94112-13-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at migrate_read_unlock()()

    warning: context imbalance in migrate_read_unlock() - unexpected unlock

    The root cause is the missing annotation at migrate_read_unlock()
    Add the missing __releases(&zspage->lock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200214204741.94112-12-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at migrate_read_lock()()

    warning: context imbalance in migrate_read_lock() - wrong count at exit

    The root cause is the missing annotation at migrate_read_lock()
    Add the missing __acquires(&zspage->lock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200214204741.94112-11-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at put_map()()

    warning: context imbalance in put_map() - unexpected unlock

    The root cause is the missing annotation at put_map()
    Add the missing __releases(&object_map_lock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-10-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at get_map()()

    warning: context imbalance in get_map() - wrong count at exit

    The root cause is the missing annotation at get_map()
    Add the missing __acquires(&object_map_lock) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-9-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at queue_pages_pmd()

    context imbalance in queue_pages_pmd() - unexpected unlock

    The root cause is the missing annotation at queue_pages_pmd()
    Add the missing __releases(ptl)

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at gather_surplus_pages()

    warning: context imbalance in hugetlb_cow() - unexpected unlock

    The root cause is the missing annotation at gather_surplus_pages()
    Add the missing __must_hold(&hugetlb_lock)

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Link: http://lkml.kernel.org/r/20200214204741.94112-7-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Sparse reports a warning at compact_lock_irqsave()

    warning: context imbalance in compact_lock_irqsave() - wrong count at exit

    The root cause is the missing annotation at compact_lock_irqsave()
    Add the missing __acquires(lock) annotation.

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • The compressed cache for swap pages (zswap) currently needs from 1 to 3
    extra kernel command line parameters in order to make it work: it has to
    be enabled by adding a "zswap.enabled=1" command line parameter and if one
    wants a different compressor or pool allocator than the default lzo / zbud
    combination then these choices also need to be specified on the kernel
    command line in additional parameters.

    Using a different compressor and allocator for zswap is actually pretty
    common as guides often recommend using the lz4 / z3fold pair instead of
    the default one. In such case it is also necessary to remember to enable
    the appropriate compression algorithm and pool allocator in the kernel
    config manually.

    Let's avoid the need for adding these kernel command line parameters and
    automatically pull in the dependencies for the selected compressor
    algorithm and pool allocator by adding an appropriate default switches to
    Kconfig.

    The default values for these options match what the code was using
    previously as its defaults.

    Signed-off-by: Maciej S. Szmigiero
    Signed-off-by: Andrew Morton
    Reviewed-by: Vitaly Wool
    Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
    Signed-off-by: Linus Torvalds

    Maciej S. Szmigiero
     
  • I recently build the RISC-V port with LLVM trunk, which has introduced a
    new warning when casting from a pointer to an enum of a smaller size.
    This patch simply casts to a long in the middle to stop the warning. I'd
    be surprised this is the only one in the kernel, but it's the only one I
    saw.

    Signed-off-by: Palmer Dabbelt
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
    Signed-off-by: Linus Torvalds

    Palmer Dabbelt
     
  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Previously 0 was assigned to variable 'error' but the variable was never
    read before reassignemnt later. So the assignment can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Pankaj Gupta
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Variables declared in a switch statement before any case statements cannot
    be automatically initialized with compiler instrumentation (as they are
    not part of any execution flow). With GCC's proposed automatic stack
    variable initialization feature, this triggers a warning (and they don't
    get initialized). Clang's automatic stack variable initialization (via
    CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
    initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase, so
    even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where they're
    used or lift them up into the main function body.

    mm/shmem.c: In function `shmem_getpage_gfp':
    mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
    1816 | loff_t i_size;
    | ^~~~~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alexander Potapenko
    Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook