07 Jan, 2019

1 commit

  • The semantics of what "in core" means for the mincore() system call are
    somewhat unclear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache" rather than "page is mapped in the mapping".

    The problem with that traditional semantic is that it exposes a lot of
    system cache state that it really probably shouldn't, and that users
    shouldn't really even care about.

    So let's try to avoid that information leak by simply changing the
    semantics to be that mincore() counts actual mapped pages, not pages
    that might be cheaply mapped if they were faulted (note the "might be"
    part of the old semantics: being in the cache doesn't actually guarantee
    that you can access them without IO anyway, since things like network
    filesystems may have to revalidate the cache before use).

    In many ways the old semantics were somewhat insane even aside from the
    information leak issue. From the very beginning (and that beginning is
    a long time ago: 2.3.52 was released in March 2000, I think), the code
    had a comment saying

    Later we can get more picky about what "in core" means precisely.

    and this is that "later". Admittedly it is much later than is really
    comfortable.

    NOTE! This is a real semantic change, and it is for example known to
    change the output of "fincore", since that program literally does a
    mmmap without populating it, and then doing "mincore()" on that mapping
    that doesn't actually have any pages in it.

    I'm hoping that nobody actually has any workflow that cares, and the
    info leak is real.

    We may have to do something different if it turns out that people have
    valid reasons to want the old semantics, and if we can limit the
    information leak sanely.

    Cc: Kevin Easton
    Cc: Jiri Kosina
    Cc: Masatake YAMATO
    Cc: Andrew Morton
    Cc: Greg KH
    Cc: Peter Zijlstra
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Jan, 2019

1 commit

  • Merge more updates from Andrew Morton:

    - procfs updates

    - various misc bits

    - lib/ updates

    - epoll updates

    - autofs

    - fatfs

    - a few more MM bits

    * emailed patches from Andrew Morton : (58 commits)
    mm/page_io.c: fix polled swap page in
    checkpatch: add Co-developed-by to signature tags
    docs: fix Co-Developed-by docs
    drivers/base/platform.c: kmemleak ignore a known leak
    fs: don't open code lru_to_page()
    fs/: remove caller signal_pending branch predictions
    mm/: remove caller signal_pending branch predictions
    arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
    kernel/sched/: remove caller signal_pending branch predictions
    kernel/locking/mutex.c: remove caller signal_pending branch predictions
    mm: select HAVE_MOVE_PMD on x86 for faster mremap
    mm: speed up mremap by 20x on large regions
    mm: treewide: remove unused address argument from pte_alloc functions
    initramfs: cleanup incomplete rootfs
    scripts/gdb: fix lx-version string output
    kernel/kcov.c: mark write_comp_data() as notrace
    kernel/sysctl: add panic_print into sysctl
    panic: add options to print system info when panic happens
    bfs: extra sanity checking and static inode bitmap
    exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
    ...

    Linus Torvalds
     

05 Jan, 2019

5 commits

  • swap_readpage() wants to do polling to bring in pages if asked to, but
    it doesn't mark the bio as being polled. Additionally, the looping
    around the blk_poll() check isn't correct - if we get a zero return, we
    should call io_schedule(), we can't just assume that the bio has
    completed. The regular bio->bi_private check should be used for that.

    Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
    Signed-off-by: Jens Axboe
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • Multiple filesystems open code lru_to_page(). Rectify this by moving
    the macro from mm_inline (which is specific to lru stuff) to the more
    generic mm.h header and start using the macro where appropriate.

    No functional changes.

    Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
    Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Mike Rapoport
    Acked-by: Pankaj gupta
    Acked-by: "Yan, Zheng" [ceph]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • This is already done for us internally by the signal machinery.

    Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Android needs to mremap large regions of memory during memory management
    related operations. The mremap system call can be really slow if THP is
    not enabled. The bottleneck is move_page_tables, which is copying each
    pte at a time, and can be really slow across a large map. Turning on
    THP may not be a viable option, and is not for us. This patch speeds up
    the performance for non-THP system by copying at the PMD level when
    possible.

    The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap,
    the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
    microseconds.

    Before:
    Total mremap time for 1GB data: 3521942 nanoseconds.
    Total mremap time for 1GB data: 3449229 nanoseconds.
    Total mremap time for 1GB data: 3488230 nanoseconds.

    After:
    Total mremap time for 1GB data: 150279 nanoseconds.
    Total mremap time for 1GB data: 144665 nanoseconds.
    Total mremap time for 1GB data: 158708 nanoseconds.

    If THP is enabled the optimization is mostly skipped except in certain
    situations.

    [joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
    Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
    Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Acked-by: Kirill A. Shutemov
    Reviewed-by: William Kucharski
    Cc: Julia Lawall
    Cc: Michal Hocko
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     
  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Jan, 2019

1 commit

  • This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO"). It was wrongly claiming that the ordering
    wasn't necessary. The memory barrier _is_ necessary.

    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier. Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.

    Note that sometimes the memory barrier can be elsewhere. For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).

    But none of those cases were true here.

    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.

    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Dec, 2018

2 commits

  • Pull documentation update from Jonathan Corbet:
    "A fairly normal cycle for documentation stuff. We have a new document
    on perf security, more Italian translations, more improvements to the
    memory-management docs, improvements to the pathname lookup
    documentation, and the usual array of smaller fixes.

    As is often the case, there are a few reaches outside of
    Documentation/ to adjust kerneldoc comments"

    * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
    docs: improve pathname-lookup document structure
    configfs: fix wrong name of struct in documentation
    docs/mm-api: link slab_common.c to "The Slab Cache" section
    slab: make kmem_cache_create{_usercopy} description proper kernel-doc
    doc:process: add links where missing
    docs/core-api: make mm-api.rst more structured
    x86, boot: documentation whitespace fixup
    Documentation: devres: note checking needs when converting
    doc:it: add some process/* translations
    doc:it: fixes in process/1.Intro
    Documentation: convert path-lookup from markdown to resturctured text
    Documentation/admin-guide: update admin-guide index.rst
    Documentation/admin-guide: introduce perf-security.rst file
    scripts/kernel-doc: Fix struct and struct field attribute processing
    Documentation: dev-tools: Fix typos in index.rst
    Correct gen_init_cpio tool's documentation
    Document /proc/pid PID reuse behavior
    Documentation: update path-lookup.md for parallel lookups
    Documentation: Use "while" instead of "whilst"
    dmaengine: Add mailing list address to the documentation
    ...

    Linus Torvalds
     
  • Pull percpu update from Dennis Zhou:
    "Michael Cree noted generic UP Alpha has been broken since v3.18. This
    is a small fix for locking in UP percpu code that fixes the issue"

    * 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
    percpu: convert spin_lock_irq to spin_lock_irqsave.

    Linus Torvalds
     

29 Dec, 2018

29 commits

  • Merge misc updates from Andrew Morton:

    - large KASAN update to use arm's "software tag-based mode"

    - a few misc things

    - sh updates

    - ocfs2 updates

    - just about all of MM

    * emailed patches from Andrew Morton : (167 commits)
    kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
    memcg, oom: notify on oom killer invocation from the charge path
    mm, swap: fix swapoff with KSM pages
    include/linux/gfp.h: fix typo
    mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
    hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    memory_hotplug: add missing newlines to debugging output
    mm: remove __hugepage_set_anon_rmap()
    include/linux/vmstat.h: remove unused page state adjustment macro
    mm/page_alloc.c: allow error injection
    mm: migrate: drop unused argument of migrate_page_move_mapping()
    blkdev: avoid migration stalls for blkdev pages
    mm: migrate: provide buffer_migrate_page_norefs()
    mm: migrate: move migrate_page_lock_buffers()
    mm: migrate: lock buffers before migrate_page_move_mapping()
    mm: migration: factor out code to compute expected number of page references
    mm, page_alloc: enable pcpu_drain with zone capability
    kmemleak: add config to select auto scan
    mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     
  • Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
    eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
    out_of_memory back to the charge path") has moved the oom handling back to
    the charge path. While doing so the notification was left behind in
    mem_cgroup_oom_synchronize.

    Fix the issue by replicating the oom hierarchy locking and the
    notification.

    Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
    Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
    Signed-off-by: Michal Hocko
    Reported-by: Burt Holzman
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • KSM pages may be mapped to the multiple VMAs that cannot be reached from
    one anon_vma. So during swapin, a new copy of the page need to be
    generated if a different anon_vma is needed, please refer to comments of
    ksm_might_need_to_copy() for details.

    During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
    virtual address mapped to the page, so not all mappings to a swapped out
    KSM page could be found. So in try_to_unuse(), even if the swap count of
    a swap entry isn't zero, the page needs to be deleted from swap cache, so
    that, in the next round a new page could be allocated and swapin for the
    other mappings of the swapped out KSM page.

    But this contradicts with the THP swap support. Where the THP could be
    deleted from swap cache only after the swap count of every swap entry in
    the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
    changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
    space for THP swapped out") to check that before delete a page from swap
    cache, but this has broken KSM swapoff too.

    Fortunately, KSM is for the normal pages only, so the original behavior
    for KSM pages could be restored easily via checking PageTransCompound().
    That is how this patch works.

    The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
    swap space for THP swapped out"), which is merged by v4.14-rc1. So I
    think we should backport the fix to from 4.14 on. But Hugh thinks it may
    be rare for the KSM pages being in the swap device when swapoff, so nobody
    reports the bug so far.

    Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
    Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Hugh Dickins
    Tested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Daniel Jordan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The kbuild robot reported the following on a development branch that used
    memremap.h in a new path:

    In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
    from arch/m68k/include/asm/pgtable.h:5,
    from include/linux/memremap.h:7,
    from drivers//dax/bus.c:3:
    arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
    >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
    return mm->pgd + pgd_index(address);
    ^~

    The ->page_fault() callback is specific to HMM. Move it to 'struct
    hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in
    include/linux/hmm.h. Longer term refactoring this dependency out of HMM
    is recommended, but in the meantime memremap.h remains generic.

    Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...")
    Signed-off-by: Dan Williams
    Reviewed-by: "Jérôme Glisse"
    Cc: Logan Gunthorpe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken into
    account as well.

    Instead of writing the required complicated code for this rare occurrence,
    just eliminate the race. i_mmap_rwsem is now held in read mode for the
    duration of page fault processing. Hold i_mmap_rwsem longer in truncation
    and hold punch code to cover the call to remove_inode_hugepages.

    With this modification, code in remove_inode_hugepages checking for races
    becomes 'dead' as it can not longer happen. Remove the dead code and
    expand comments to explain reasoning. Similarly, checks for races with
    truncation in the page fault path can be simplified and removed.

    [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
    Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
    Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This function is identical to __page_set_anon_rmap() since the time, when
    it was introduced (8 years ago). The patch removes the function, and
    makes its users to use __page_set_anon_rmap() instead.

    Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Model call chain after should_failslab(). Likewise, we can now use a
    kprobe to override the return value of should_fail_alloc_page() and inject
    allocation failures into alloc_page*().

    This will allow injecting allocation failures using the BCC tools even
    without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
    fail_page_alloc= parameter, which incurs some overhead even when failures
    are not being injected. On the other hand, this patch adds an
    unconditional call to should_fail_alloc_page() from page allocation
    hotpath. That overhead should be rather negligible with
    CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

    [vbabka@suse.cz: changelog addition]
    Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
    Signed-off-by: Benjamin Poirier
    Acked-by: Vlastimil Babka
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Poirier
     
  • All callers of migrate_page_move_mapping() now pass NULL for 'head'
    argument. Drop it.

    Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Provide a variant of buffer_migrate_page() that also checks whether there
    are no unexpected references to buffer heads. This function will then be
    safe to use for block device pages.

    [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
    Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
    move it close to it and also drop the now unused stub for !CONFIG_BLOCK.

    Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Lock buffers before calling into migrate_page_move_mapping() so that that
    function doesn't have to know about buffers (which is somewhat unexpected
    anyway) and all the buffer head logic is in buffer_migrate_page().

    Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm: migrate: Fix page migration stalls for blkdev pages".

    This patchset deals with page migration stalls that were reported by our
    customer due to a block device page that had a bufferhead that was in the
    bh LRU cache.

    The patchset modifies the page migration code so that bufferheads are
    completely handled inside buffer_migrate_page() and then provides a new
    migration helper for pages with buffer heads that is safe to use even for
    block device pages and that also deals with bh lrus.

    This patch (of 6):

    Factor out function to compute number of expected page references in
    migrate_page_move_mapping(). Note that we move hpage_nr_pages() and
    page_has_private() checks from under xas_lock_irq() however this is safe
    since we hold page lock.

    [jack@suse.cz: fix expected_page_refs()]
    Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
    Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • drain_all_pages is documented to drain per-cpu pages for a given zone (if
    non-NULL). The current implementation doesn't match the description
    though. It will drain all pcp pages for all zones that happen to have
    cached pages on the same cpu as the given zone. This will lead to
    premature pcp cache draining for zones that are not of any interest to the
    caller - e.g. compaction, hwpoison or memory offline.

    This forces the page allocator to take locks and potential lock contention
    as a result.

    There is no real reason for this sub-optimal implementation. Replace
    per-cpu work item with a dedicated structure which contains a pointer to
    the zone and pass it over to the worker. This will get the zone
    information all the way down to the worker function and do the right job.

    [akpm@linux-foundation.org: avoid 80-col tricks]
    [mhocko@suse.com: refactor the whole changelog]
    Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Kmemleak scan can be cpu intensive and can stall user tasks at times. To
    prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto
    scan on boot up. Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as
    this is meant for only first automatic scan.

    Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com
    Signed-off-by: Sri Krishna chowdary
    Signed-off-by: Sachin Nikam
    Signed-off-by: Prateek
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sri Krishna chowdary
     
  • When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
    pages initialization can take a long time. Below were the reported init
    times on a 8-socket 96-core 4TB IvyBridge system.

    1) Non-debug kernel without CONFIG_KASAN
    [ 8.764222] node 1 initialised, 132086516 pages in 7027ms

    2) Debug kernel with CONFIG_KASAN
    [ 146.288115] node 1 initialised, 132075466 pages in 143052ms

    So the page init time in a debug kernel was 20X of the non-debug kernel.
    The long init time can be problematic as the page initialization is done
    with interrupt disabled. In this particular case, it caused the
    appearance of following warning messages as well as NMI backtraces of all
    the cores that were doing the initialization.

    [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
    [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
    [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
    [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
    [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
    [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
    [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
    [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

    The long init time was mainly caused by the call to kasan_free_pages() to
    poison the newly initialized pages. On a 4TB system, we are talking about
    almost 500GB of memory probably on the same node.

    In reality, we may not need to poison the newly initialized pages before
    they are ever allocated. So KASAN poisoning of freed pages before the
    completion of deferred memory initialization is now disabled. Those pages
    will be properly poisoned when they are allocated or freed after deferred
    pages initialization is done.

    With this change, the new page initialization time became:

    [ 21.948010] node 1 initialised, 132075466 pages in 18702ms

    This was still about double the non-debug kernel time, but was much
    better than before.

    Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
    If someone adds extra migrate type, then he may forget to enlarge the
    NR_PAGEBLOCK_BITS. Hence it requires some way to fix.

    NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
    two different .h file with reverse dependency, it is a little hard to
    refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind
    such relation in compiling-time.

    Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     
  • ksm thread unconditionally sleeps in ksm_scan_thread() after each
    iteration:

    schedule_timeout_interruptible(
    msecs_to_jiffies(ksm_thread_sleep_millisecs))

    The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.

    In case of user writes a big value by a mistake, and the thread enters
    into schedule_timeout_interruptible(), it's not possible to cancel the
    sleep by writing a new smaler value; the thread is just sleeping till
    timeout expires.

    The patch fixes the problem by waking the thread each time after the value
    is updated.

    This also may be useful for debug purposes; and also for userspace
    daemons, which change sleep_millisecs value in dependence of system load.

    Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Cyrill Gorcunov
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • filemap_map_pages takes a speculative reference to each page in the range
    before it tries to lock that page. While this is correct it also can
    influence page migration which will bail out when seeing an elevated
    reference count. The faultaround code would bail on seeing a locked page
    so we can pro-actively check the PageLocked bit before
    page_cache_get_speculative and prevent from pointless reference count
    churn.

    Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Acked-by: Hugh Dickins
    Reviewed-by: William Kucharski
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Memory migration might fail during offlining and we keep retrying in that
    case. This is currently obfuscated by goto retry loop. The code is hard
    to follow and as a result it is even suboptimal becase each retry round
    scans the full range from start_pfn even though we have successfully
    scanned/migrated [start_pfn, pfn] range already. This is all only because
    check_pages_isolated failure has to rescan the full range again.

    De-obfuscate the migration retry loop by promoting it to a real for loop.
    In fact remove the goto altogether by making it a proper double loop
    (yeah, gotos are nasty in this specific case). In the end we will get a
    slightly more optimal code which is better readable.

    [akpm@linux-foundation.org: reflow comments to 80 cols]
    Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Pavel Tatashin
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "few memory offlining enhancements".

    I have been chasing memory offlining not making progress recently. On the
    way I have noticed few weird decisions in the code. The migration itself
    is restricted without a reasonable justification and the retry loop around
    the migration is quite messy. This is addressed by patch 1 and patch 2.

    Patch 3 is targeting on the faultaround code which has been a hot
    candidate for the initial issue reported upstream [2] and that I am
    debugging internally. It turned out to be not the main contributor in the
    end but I believe we should address it regardless. See the patch
    description for more details.

    [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv

    This patch (of 3):

    do_migrate_range has been limiting the number of pages to migrate to 256
    for some reason which is not documented. Even if the limit made some
    sense back then when it was introduced it doesn't really serve a good
    purpose these days. If the range contains huge pages then we break out of
    the loop too early and go through LRU and pcp caches draining and
    scan_movable_pages is quite suboptimal.

    The only reason to limit the number of pages I can think of is to reduce
    the potential time to react on the fatal signal. But even then the number
    of pages is a questionable metric because even a single page migration
    might block in a non-killable state (e.g. __unmap_and_move).

    Remove the limit and offline the full requested range (this is one
    memblock worth of pages with the current code). Should we ever get a
    report that offlining takes too long to react on fatal signal then we
    should rather fix the core migration to use killable waits and bailout
    on a signal.

    Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
    Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Userspace falls short when trying to find out whether a specific memory
    range is eligible for THP. There are usecases that would like to know
    that
    http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
    : This is used to identify heap mappings that should be able to fault thp
    : but do not, and they normally point to a low-on-memory or fragmentation
    : issue.

    The only way to deduce this now is to query for hg resp. nh flags and
    confronting the state with the global setting. Except that there is also
    PR_SET_THP_DISABLE that might change the picture. So the final logic is
    not trivial. Moreover the eligibility of the vma depends on the type of
    VMA as well. In the past we have supported only anononymous memory VMAs
    but things have changed and shmem based vmas are supported as well these
    days and the query logic gets even more complicated because the
    eligibility depends on the mount option and another global configuration
    knob.

    Simplify the current state and report the THP eligibility in
    /proc//smaps for each existing vma. Reuse
    transparent_hugepage_enabled for this purpose. The original
    implementation of this function assumes that the caller knows that the vma
    itself is supported for THP so make the core checks into
    __transparent_hugepage_enabled and use it for existing callers.
    __show_smap just use the new transparent_hugepage_enabled which also
    checks the vma support status (please note that this one has to be out of
    line due to include dependency issues).

    [mhocko@kernel.org: fix oops with NULL ->f_mapping]
    Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Paul Oppenheimer
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Patch series "mmu notifier contextual informations", v2.

    This patchset adds contextual information, why an invalidation is
    happening, to mmu notifier callback. This is necessary for user of mmu
    notifier that wish to maintains their own data structure without having to
    add new fields to struct vm_area_struct (vma).

    For instance device can have they own page table that mirror the process
    address space. When a vma is unmap (munmap() syscall) the device driver
    can free the device page table for the range.

    Today we do not have any information on why a mmu notifier call back is
    happening and thus device driver have to assume that it is always an
    munmap(). This is inefficient at it means that it needs to re-allocate
    device page table on next page fault and rebuild the whole device driver
    data structure for the range.

    Other use case beside munmap() also exist, for instance it is pointless
    for device driver to invalidate the device page table when the
    invalidation is for the soft dirtyness tracking. Or device driver can
    optimize away mprotect() that change the page table permission access for
    the range.

    This patchset enables all this optimizations for device drivers. I do not
    include any of those in this series but another patchset I am posting will
    leverage this.

    The patchset is pretty simple from a code point of view. The first two
    patches consolidate all mmu notifier arguments into a struct so that it is
    easier to add/change arguments. The last patch adds the contextual
    information (munmap, protection, soft dirty, clear, ...).

    This patch (of 3):

    To avoid having to change many callback definition everytime we want to
    add a parameter use a structure to group all parameters for the
    mmu_notifier invalidate_range_start/end callback. No functional changes
    with this patch.

    [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
    Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Jan Kara
    Acked-by: Jason Gunthorpe [infiniband]
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • We have received a bug report that an injected MCE about faulty memory
    prevents memory offline to succeed on 4.4 base kernel. The underlying
    reason was that the HWPoison page has an elevated reference count and the
    migration keeps failing. There are two problems with that. First of all
    it is dubious to migrate the poisoned page because we know that accessing
    that memory is possible to fail. Secondly it doesn't make any sense to
    migrate a potentially broken content and preserve the memory corruption
    over to a new location.

    Oscar has found out that 4.4 and the current upstream kernels behave
    slightly differently with his simply testcase

    ===

    int main(void)
    {
    int ret;
    int i;
    int fd;
    char *array = malloc(4096);
    char *array_locked = malloc(4096);

    fd = open("/tmp/data", O_RDONLY);
    read(fd, array, 4095);

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
    if (ret)
    perror("mlock");

    sleep (20);

    ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
    if (ret)
    perror("madvise");

    for (i = 0; i < 4096; i++)
    array_locked[i] = 'd';

    return 0;
    }
    ===

    + offline this memory.

    In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
    list
    kernel: [] dump_trace+0x59/0x340
    kernel: [] show_stack_log_lvl+0xea/0x170
    kernel: [] show_stack+0x21/0x40
    kernel: [] dump_stack+0x5c/0x7c
    kernel: [] warn_slowpath_common+0x81/0xb0
    kernel: [] __pagevec_lru_add_fn+0x14c/0x160
    kernel: [] pagevec_lru_move_fn+0xad/0x100
    kernel: [] __lru_cache_add+0x6c/0xb0
    kernel: [] add_to_page_cache_lru+0x46/0x70
    kernel: [] extent_readpages+0xc3/0x1a0 [btrfs]
    kernel: [] __do_page_cache_readahead+0x177/0x200
    kernel: [] ondemand_readahead+0x168/0x2a0
    kernel: [] generic_file_read_iter+0x41f/0x660
    kernel: [] __vfs_read+0xcd/0x140
    kernel: [] vfs_read+0x7a/0x120
    kernel: [] kernel_read+0x3b/0x50
    kernel: [] do_execveat_common.isra.29+0x490/0x6f0
    kernel: [] do_execve+0x28/0x30
    kernel: [] call_usermodehelper_exec_async+0xfb/0x130
    kernel: [] ret_from_fork+0x55/0x80

    And that latter confuses the hotremove path because an LRU page is
    attempted to be migrated and that fails due to an elevated reference
    count. It is quite possible that the reuse of the HWPoisoned page is some
    kind of fixed race condition but I am not really sure about that.

    With the upstream kernel the failure is slightly different. The page
    doesn't seem to have LRU bit set but isolate_movable_page simply fails and
    do_migrate_range simply puts all the isolated pages back to LRU and
    therefore no progress is made and scan_movable_pages finds same set of
    pages over and over again.

    Fix both cases by explicitly checking HWPoisoned pages before we even try
    to get reference on the page, try to unmap it if it is still mapped. As
    explained by Naoya:

    : Hwpoison code never unmapped those for no big reason because
    : Ksm pages never dominate memory, so we simply didn't have strong
    : motivation to save the pages.

    Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
    HWPoison pages which shouldn't happen but I couldn't convince myself about
    that. Naoya has noted the following:

    : Theoretically no such gurantee, because try_to_unmap() doesn't have a
    : guarantee of success and then memory_failure() returns immediately
    : when hwpoison_user_mappings fails.
    : Or the following code (comes after hwpoison_user_mappings block) also impli=
    : es
    : that the target page can still have PageLRU flag.
    :
    : /*
    : * Torn down by someone else?
    : */
    : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
    : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
    : res =3D -EBUSY;
    : goto out;
    : }
    :
    : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
    : current version of your patch.

    Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Debugged-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Acked-by: David Hildenbrand
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • kmemleak_scan() goes through all online nodes and tries to scan all used
    pages.

    We can do better and use pfn_to_online_page(), so in case we have
    CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically. For
    boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
    will fallback to pfn_valid().

    Another little optimization is to check if the page belongs to the node we
    are currently checking, so in case we have nodes interleaved we will not
    check the same pfn multiple times.

    I ran some tests:

    Add some memory to node1 and node2 making it interleaved:

    (qemu) object_add memory-backend-ram,id=ram0,size=1G
    (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
    (qemu) object_add memory-backend-ram,id=ram1,size=1G
    (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
    (qemu) object_add memory-backend-ram,id=ram2,size=1G
    (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1

    Then, we offline that memory:
    # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
    # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
    # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done

    And we run kmemleak_scan:

    # echo "scan" > /sys/kernel/debug/kmemleak

    before the patch:

    kmemleak: time spend: 41596 us

    after the patch:

    kmemleak: time spend: 34899 us

    [akpm@linux-foundation.org: remove stray newline, per Oscar]
    Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Suggested-by: Michal Hocko
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • page always is not NULL, so we may remove this useless check.

    Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Cyrill Gorcunov
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
    free_area_init_core_hotplug"), some functions changed to only be called
    during system initialization. Concretly, free_area_init_node() and the
    functions that hang from it.

    Also, some variables are no longer used after the system has gone
    through initialization. So this could be considered as a late clean-up
    for that patch.

    This patch changes the functions from __meminit to __init, and the
    variables from __meminitdata to __initdata.

    In return, we get some KBs back:

    Before:
    Freeing unused kernel image memory: 2472K

    After:
    Freeing unused kernel image memory: 2480K

    Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc: Alexander Duyck
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador