09 Sep, 2017

1 commit

  • Patch series "HMM (Heterogeneous Memory Management)", v25.

    Heterogeneous Memory Management (HMM) (description and justification)

    Today device driver expose dedicated memory allocation API through their
    device file, often relying on a combination of IOCTL and mmap calls.
    The device can only access and use memory allocated through this API.
    This effectively split the program address space into object allocated
    for the device and useable by the device and other regular memory
    (malloc, mmap of a file, share memory, â) only accessible by
    CPU (or in a very limited way by a device by pinning memory).

    Allowing different isolated component of a program to use a device thus
    require duplication of the input data structure using device memory
    allocator. This is reasonable for simple data structure (array, grid,
    image, â) but this get extremely complex with advance data
    structure (list, tree, graph, â) that rely on a web of memory
    pointers. This is becoming a serious limitation on the kind of work
    load that can be offloaded to device like GPU.

    New industry standard like C++, OpenCL or CUDA are pushing to remove
    this barrier. This require a shared address space between GPU device
    and CPU so that GPU can access any memory of a process (while still
    obeying memory protection like read only). This kind of feature is also
    appearing in various other operating systems.

    HMM is a set of helpers to facilitate several aspects of address space
    sharing and device memory management. Unlike existing sharing mechanism
    that rely on pining pages use by a device, HMM relies on mmu_notifier to
    propagate CPU page table update to device page table.

    Duplicating CPU page table is only one aspect necessary for efficiently
    using device like GPU. GPU local memory have bandwidth in the TeraBytes/
    second range but they are connected to main memory through a system bus
    like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
    is necessary to allow migration of process memory from main system memory
    to device memory. Issue is that on platform that only have PCIE the
    device memory is not accessible by the CPU with the same properties as
    main memory (cache coherency, atomic operations, ...).

    To allow migration from main memory to device memory HMM provides a set of
    helper to hotplug device memory as a new type of ZONE_DEVICE memory which
    is un-addressable by CPU but still has struct page representing it. This
    allow most of the core kernel logic that deals with a process memory to
    stay oblivious of the peculiarity of device memory.

    When page backing an address of a process is migrated to device memory the
    CPU page table entry is set to a new specific swap entry. CPU access to
    such address triggers a migration back to system memory, just like if the
    page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE
    page so that it can always be migrated back to system memory if CPU access
    it. Conversely HMM does not migrate to device memory any page that is pin
    in system memory.

    To allow efficient migration between device memory and main memory a new
    migrate_vma() helpers is added with this patchset. It allows to leverage
    device DMA engine to perform the copy operation.

    This feature will be use by upstream driver like nouveau mlx5 and probably
    other in the future (amdgpu is next suspect in line). We are actively
    working on nouveau and mlx5 support. To test this patchset we also worked
    with NVidia close source driver team, they have more resources than us to
    test this kind of infrastructure and also a bigger and better userspace
    eco-system with various real industry workload they can be use to test and
    profile HMM.

    The expected workload is a program builds a data set on the CPU (from
    disk, from network, from sensors, â). Program uses GPU API (OpenCL,
    CUDA, ...) to give hint on memory placement for the input data and also
    for the output buffer. Program call GPU API to schedule a GPU job, this
    happens using device driver specific ioctl. All this is hidden from
    programmer point of view in case of C++ compiler that transparently
    offload some part of a program to GPU. Program can keep doing other stuff
    on the CPU while the GPU is crunching numbers.

    It is expected that CPU will not access the same data set as the GPU while
    GPU is working on it, but this is not mandatory. In fact we expect some
    small memory object to be actively access by both GPU and CPU concurrently
    as synchronization channel and/or for monitoring purposes. Such object
    will stay in system memory and should not be bottlenecked by system bus
    bandwidth (rare write and read access from both CPU and GPU).

    As we are relying on device driver API, HMM does not introduce any new
    syscall nor does it modify any existing ones. It does not change any
    POSIX semantics or behaviors. For instance the child after a fork of a
    process that is using HMM will not be impacted in anyway, nor is there any
    data hazard between child COW or parent COW of memory that was migrated to
    device prior to fork.

    HMM assume a numbers of hardware features. Device must allow device page
    table to be updated at any time (ie device job must be preemptable).
    Device page table must provides memory protection such as read only.
    Device must track write access (dirty bit). Device must have a minimum
    granularity that match PAGE_SIZE (ie 4k).

    Reviewer (just hint):
    Patch 1 HMM documentation
    Patch 2 introduce core infrastructure and definition of HMM, pretty
    small patch and easy to review
    Patch 3 introduce the mirror functionality of HMM, it relies on
    mmu_notifier and thus someone familiar with that part would be
    in better position to review
    Patch 4 is an helper to snapshot CPU page table while synchronizing with
    concurrent page table update. Understanding mmu_notifier makes
    review easier.
    Patch 5 is mostly a wrapper around handle_mm_fault()
    Patch 6 add new add_pages() helper to avoid modifying each arch memory
    hot plug function
    Patch 7 add a new memory type for ZONE_DEVICE and also add all the logic
    in various core mm to support this new type. Dan Williams and
    any core mm contributor are best people to review each half of
    this patchset
    Patch 8 special case HMM ZONE_DEVICE pages inside put_page() Kirill and
    Dan Williams are best person to review this
    Patch 9 allow to uncharge a page from memory group without using the lru
    list field of struct page (best reviewer: Johannes Weiner or
    Vladimir Davydov or Michal Hocko)
    Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
    reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
    Patch 11 add helper to hotplug un-addressable device memory as new type
    of ZONE_DEVICE memory (new type introducted in patch 3 of this
    serie). This is boiler plate code around memory hotplug and it
    also pick a free range of physical address for the device memory.
    Note that the physical address do not point to anything (at least
    as far as the kernel knows).
    Patch 12 introduce a new hmm_device class as an helper for device driver
    that want to expose multiple device memory under a common fake
    device driver. This is usefull for multi-gpu configuration.
    Anyone familiar with device driver infrastructure can review
    this. Boiler plate code really.
    Patch 13 add a new migrate mode. Any one familiar with page migration is
    welcome to review.
    Patch 14 introduce a new migration helper (migrate_vma()) that allow to
    migrate a range of virtual address of a process using device DMA
    engine to perform the copy. It is not limited to do copy from and
    to device but can also do copy between any kind of source and
    destination memory. Again anyone familiar with migration code
    should be able to verify the logic.
    Patch 15 optimize the new migrate_vma() by unmapping pages while we are
    collecting them. This can be review by any mm folks.
    Patch 16 add unaddressable memory migration to helper introduced in patch
    7, this can be review by anyone familiar with migration code
    Patch 17 add a feature that allow device to allocate non-present page on
    the GPU when migrating a range of address to device memory. This
    is an helper for device driver to avoid having to first allocate
    system memory before migration to device memory
    Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
    memory (CDM)
    Patch 19 add an helper to hotplug CDM memory

    Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v14 https://lkml.org/lkml/2016/12/8/344
    v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
    v16 http://www.spinics.net/lists/linux-mm/msg119814.html
    v17 https://lkml.org/lkml/2017/1/27/847
    v18 https://lkml.org/lkml/2017/3/16/596
    v19 https://lkml.org/lkml/2017/4/5/831
    v20 https://lwn.net/Articles/720715/
    v21 https://lkml.org/lkml/2017/4/24/747
    v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
    v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
    v24 https://lwn.net/Articles/726691/

    This patch (of 19):

    This adds documentation for HMM (Heterogeneous Memory Management). It
    presents the motivation behind it, the features necessary for it to be
    useful and and gives an overview of how this is implemented.

    Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: John Hubbard
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Balbir Singh
    Cc: Aneesh Kumar
    Cc: Benjamin Herrenschmidt
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

2 commits

  • If the system has more than one swap device and swap device has the node
    information, we can make use of this information to decide which swap
    device to use in get_swap_pages() to get better performance.

    The current code uses a priority based list, swap_avail_list, to decide
    which swap device to use and if multiple swap devices share the same
    priority, they are used round robin. This patch changes the previous
    single global swap_avail_list into a per-numa-node list, i.e. for each
    numa node, it sees its own priority based list of available swap
    devices. Swap device's priority can be promoted on its matching node's
    swap_avail_list.

    The current swap device's priority is set as: user can set a >=0 value,
    or the system will pick one starting from -1 then downwards. The
    priority value in the swap_avail_list is the negated value of the swap
    device's due to plist being sorted from low to high. The new policy
    doesn't change the semantics for priority >=0 cases, the previous
    starting from -1 then downwards now becomes starting from -2 then
    downwards and -1 is reserved as the promoted value.

    Take 4-node EX machine as an example, suppose 4 swap devices are
    available, each sit on a different node:
    swapA on node 0
    swapB on node 1
    swapC on node 2
    swapD on node 3

    After they are all swapped on in the sequence of ABCD.

    Current behaviour:
    their priorities will be:
    swapA: -1
    swapB: -2
    swapC: -3
    swapD: -4
    And their position in the global swap_avail_list will be:
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:2 prio:3 prio:4

    New behaviour:
    their priorities will be(note that -1 is skipped):
    swapA: -2
    swapB: -3
    swapC: -4
    swapD: -5
    And their positions in the 4 swap_avail_lists[nid] will be:
    swap_avail_lists[0]: /* node 0's available swap device list */
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:3 prio:4 prio:5
    swap_avali_lists[1]: /* node 1's available swap device list */
    swapB -> swapA -> swapC -> swapD
    prio:1 prio:2 prio:4 prio:5
    swap_avail_lists[2]: /* node 2's available swap device list */
    swapC -> swapA -> swapB -> swapD
    prio:1 prio:2 prio:3 prio:5
    swap_avail_lists[3]: /* node 3's available swap device list */
    swapD -> swapA -> swapB -> swapC
    prio:1 prio:2 prio:3 prio:4

    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.

    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:

    runtime=30m/processes=32/total test size=128G/each process mmap region=4G
    kernel throughput
    vanilla 13306
    auto-binding 15169 +14%

    runtime=30m/processes=64/total test size=128G/each process mmap region=2G
    kernel throughput
    vanilla 11885
    auto-binding 14879 +25%

    [aaron.lu@intel.com: v2]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    [akpm@linux-foundation.org: use kmalloc_array()]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    Signed-off-by: Aaron Lu
    Cc: "Chen, Tim C"
    Cc: Huang Ying
    Cc: Andi Kleen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • Patch series "cleanup zonelists initialization", v1.

    This is aimed at cleaning up the zonelists initialization code we have
    but the primary motivation was bug report [2] which got resolved but the
    usage of stop_machine is just too ugly to live. Most patches are
    straightforward but 3 of them need a special consideration.

    Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
    because this is a user visible change. As I argue in the patch
    description I do not think we have a strong usecase for it these days.
    I have kept sysctl in place and warn into the log if somebody tries to
    configure zone lists ordering. If somebody has a real usecase for it we
    can revert this patch but I do not expect anybody will actually notice
    runtime differences. This patch is not strictly needed for the rest but
    it made patch 6 easier to implement.

    Patch 7 removes stop_machine from build_all_zonelists without adding any
    special synchronization between iterators and updater which I _believe_
    is acceptable as explained in the changelog. I hope I am not missing
    anything.

    Patch 8 then removes zonelists_mutex which is kind of ugly as well and
    not really needed AFAICS but a care should be taken when double checking
    my thinking.

    This patch (of 9):

    Supporting zone ordered zonelists costs us just a lot of code while the
    usefulness is arguable if existent at all. Mel has already made node
    ordering default on 64b systems. 32b systems are still using
    ZONELIST_ORDER_ZONE because it is considered better to fallback to a
    different NUMA node rather than consume precious lowmem zones.

    This argument is, however, weaken by the fact that the memory reclaim
    has been reworked to be node rather than zone oriented. This means that
    lowmem requests have to skip over all highmem pages on LRUs already and
    so zone ordering doesn't save the reclaim time much. So the only
    advantage of the zone ordering is under a light memory pressure when
    highmem requests do not ever hit into lowmem zones and the lowmem
    pressure doesn't need to reclaim.

    Considering that 32b NUMA systems are rather suboptimal already and it
    is generally advisable to use 64b kernel on such a HW I believe we
    should rather care about the code maintainability and just get rid of
    ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
    somebody tries to set zone ordering either from kernel command line or
    the sysctl.

    [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
    Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Abdul Haleem
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

1 commit

  • Without a max deduplication limit for each KSM page, the list of the
    rmap_items associated to each stable_node can grow infinitely large.

    During the rmap walk each entry can take up to ~10usec to process
    because of IPIs for the TLB flushing (both for the primary MMU and the
    secondary MMUs with the MMU notifier). With only 16GB of address space
    shared in the same KSM page, that would amount to dozens of seconds of
    kernel runtime.

    A ~256 max deduplication factor will reduce the latencies of the rmap
    walks on KSM pages to order of a few msec. Just doing the
    cond_resched() during the rmap walks is not enough, the list size must
    have a limit too, otherwise the caller could get blocked in (schedule
    friendly) kernel computations for seconds, unexpectedly.

    There's room for optimization to significantly reduce the IPI delivery
    cost during the page_referenced(), but at least for page_migration in
    the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
    it may be inevitable to send lots of IPIs if each rmap_item->mm is
    active on a different CPU and there are lots of CPUs. Even if we ignore
    the IPI delivery cost, we've still to walk the whole KSM rmap list, so
    we can't allow millions or billions (ulimited) number of entries in the
    KSM stable_node rmap_item lists.

    The limit is enforced efficiently by adding a second dimension to the
    stable rbtree. So there are three types of stable_nodes: the regular
    ones (identical as before, living in the first flat dimension of the
    stable rbtree), the "chains" and the "dups".

    Every "chain" and all "dups" linked into a "chain" enforce the invariant
    that they represent the same write protected memory content, even if
    each "dup" will be pointed by a different KSM page copy of that content.
    This way the stable rbtree lookup computational complexity is unaffected
    if compared to an unlimited max_sharing_limit. It is still enforced
    that there cannot be KSM page content duplicates in the stable rbtree
    itself.

    Adding the second dimension to the stable rbtree only after the
    max_page_sharing limit hits, provides for a zero memory footprint
    increase on 64bit archs. The memory overhead of the per-KSM page
    stable_tree and per virtual mapping rmap_item is unchanged. Only after
    the max_page_sharing limit hits, we need to allocate a stable_tree
    "chain" and rb_replace() the "regular" stable_node with the newly
    allocated stable_node "chain". After that we simply add the "regular"
    stable_node to the chain as a stable_node "dup" by linking hlist_dup in
    the stable_node_chain->hlist. This way the "regular" (flat) stable_node
    is converted to a stable_node "dup" living in the second dimension of
    the stable rbtree.

    During stable rbtree lookups the stable_node "chain" is identified as
    stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
    is_stable_node_chain()).

    When dropping stable_nodes, the stable_node "dup" is identified as
    stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

    The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
    elsewhere in any stable_node->head/node to avoid a clashes with the
    stable_node->node.rb_parent_color pointer, and different from
    &migrate_nodes. So the second field of &migrate_nodes is picked and
    verified as always safe with a BUILD_BUG_ON in case the list_head
    implementation changes in the future.

    The STABLE_NODE_DUP is picked as a random negative value in
    stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when
    it's a "regular" stable_node or a stable_node "dup".

    The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
    is aliased in a union with a time field used to rate limit the
    stable_node_chain->hlist prunes.

    The garbage collection of the stable_node_chain happens lazily during
    stable rbtree lookups (as for all other kind of stable_nodes), or while
    disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
    entire stable rbtree.

    While the "regular" stable_nodes and the stable_node "dups" must wait
    for their underlying tree_page to be freed before they can be freed
    themselves, the stable_node "chains" can be freed immediately if the
    stable_node->hlist turns empty. This is because the "chains" are never
    pointed by any page->mapping and they're effectively stable rbtree KSM
    self contained metadata.

    [akpm@linux-foundation.org: fix non-NUMA build]
    Signed-off-by: Andrea Arcangeli
    Tested-by: Petr Holasek
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Arjan van de Ven
    Cc: Evgheni Dereveanchin
    Cc: Andrey Ryabinin
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

09 May, 2017

1 commit


04 May, 2017

1 commit

  • Adding a brief overview of hugetlbfs reservation design and
    implementation as an aid to those making code modifications in this
    area.

    Link: http://lkml.kernel.org/r/1491586995-13085-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

10 Mar, 2017

1 commit

  • Patch series "userfaultfd non-cooperative further update for 4.11 merge
    window".

    Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
    more testing. I've been doing testing before and this was also tested
    by kbuild bot and exercised by the selftest, but this bug never
    reproduced before.

    I dropped userfaultfd_exit as result. I dropped it because of
    implementation difficulty in receiving signals in __mmput and because I
    think -ENOSPC as result from the background UFFDIO_COPY should be enough
    already.

    Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
    wasn't exercised by the selftest and when I tried to exercise it, after
    moving it to a more correct place in __mmput where it would make more
    sense and where the vma list is stable, it resulted in the
    event_wait_completion in D state. So then I added the second patch to
    be sure even if we call userfaultfd_event_wait_completion too late
    during task exit(), we won't risk to generate tasks in D state. The
    same check exists in handle_userfault() for the same reason, except it
    makes a difference there, while here is just a robustness check and it's
    run under WARN_ON_ONCE.

    While looking at the userfaultfd_event_wait_completion() function I
    looked back at its callers too while at it and I think it's not ok to
    stop executing dup_fctx on the fcs list because we relay on
    userfaultfd_event_wait_completion to execute
    userfaultfd_ctx_put(fctx->orig) which is paired against
    userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
    list_add(fcs). This change only takes care of fctx->orig but this area
    also needs further review looking for similar problems in fctx->new.

    The only patch that is urgent is the first because it's an use after
    free during a SMP race condition that affects all processes if
    CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
    impossible without SLUB poisoning enabled.

    This patch (of 3):

    I once reproduced this oops with the userfaultfd selftest, it's not
    easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[] [] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
    do_exit+0x297/0xd10
    SyS_exit+0x17/0x20
    tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP [] userfaultfd_exit+0x29/0xa0
    RSP
    ---[ end trace 9fecd6dcb442846a ]---

    In the debugger I located the "mm" pointer in the stack and walking
    mm->mmap->vm_next through the end shows the vma->vm_next list is fully
    consistent and it is null terminated list as expected. So this has to
    be an SMP race condition where userfaultfd_exit was running while the
    vma list was being modified by another CPU.

    When userfaultfd_exit() run one of the ->vm_next pointers pointed to
    SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

    The reason is that it's not running in __mmput but while there are still
    other threads running and it's not holding the mmap_sem (it can't as it
    has to wait the even to be received by the manager). So this is an use
    after free that was happening for all processes.

    One more implementation problem aside from the race condition:
    userfaultfd_exit has really to check a flag in mm->flags before walking
    the vma or it's going to slowdown the exit() path for regular tasks.

    One more implementation problem: at that point signals can't be
    delivered so it would also create a task in D state if the manager
    doesn't read the event.

    The major design issue: it overall looks superfluous as the manager can
    check for -ENOSPC in the background transfer:

    if (mmget_not_zero(ctx->mm)) {
    [..]
    } else {
    return -ENOSPC;
    }

    It's safer to roll it back and re-introduce it later if at all.

    [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
    Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mike Rapoport
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Feb, 2017

1 commit

  • Fix typos and add the following to the scripts/spelling.txt:

    an user||a user
    an userspace||a userspace

    I also added "userspace" to the list since it is a common word in Linux.
    I found some instances for "an userfaultfd", but I did not add it to the
    list. I felt it is endless to find words that start with "user" such as
    "userland" etc., so must draw a line somewhere.

    Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

25 Feb, 2017

3 commits

  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add documentation about new userfaultfd features and events

    Link: http://lkml.kernel.org/r/1487716431-5551-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Some architectures have a set of zero pages (coloured zero pages)
    instead of only one zero page, in order to improve the cache
    performance. In those cases, the kernel samepage merger (KSM) would
    merge all the allocated pages that happen to be filled with zeroes to
    the same deduplicated page, thus losing all the advantages of coloured
    zero pages.

    This behaviour is noticeable when a process accesses large arrays of
    allocated pages containing zeroes. A test I conducted on s390 shows
    that there is a speed penalty when KSM merges such pages, compared to
    not merging them or using actual zero pages from the start without
    breaking the COW.

    This patch fixes this behaviour. When coloured zero pages are present,
    the checksum of a zero page is calculated during initialisation, and
    compared with the checksum of the current canditate during merging. In
    case of a match, the normal merging routine is used to merge the page
    with the correct coloured zero page, which ensures the candidate page is
    checked to be equal to the target zero page.

    A sysfs entry is also added to toggle this behaviour, since it can
    potentially introduce performance regressions, especially on
    architectures without coloured zero pages. The default value is
    disabled, for backwards compatibility.

    With this patch, the performance with KSM is the same as with non
    COW-broken actual zero pages, which is also the same as without KSM.

    [akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea]
    [imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication]
    Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Signed-off-by: Claudio Imbrenda
    Cc: Christian Borntraeger
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     

23 Feb, 2017

3 commits

  • Merge updates from Andrew Morton:
    "142 patches:

    - DAX updates

    - various misc bits

    - OCFS2 updates

    - most of MM"

    * emailed patches from Andrew Morton : (142 commits)
    mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
    mm: fix stray kernel-doc notation
    zram: remove obsolete sysfs attrs
    mm/memblock.c: remove unnecessary log and clean up
    oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
    mm: drop unused argument of zap_page_range()
    mm: drop zap_details::check_swap_entries
    mm: drop zap_details::ignore_dirty
    mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled
    mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
    mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically
    mm: consolidate GFP_NOFAIL checks in the allocator slowpath
    lib/show_mem.c: teach show_mem to work with the given nodemask
    arch, mm: remove arch specific show_mem
    mm, page_alloc: warn_alloc print nodemask
    mm, page_alloc: do not report all nodes in show_mem
    Revert "mm: bail out in shrink_inactive_list()"
    mm, vmscan: consider eligible zones in get_scan_count
    mm, vmscan: cleanup lru size claculations
    mm, vmscan: do not count freed pages as PGDEACTIVATE
    ...

    Linus Torvalds
     
  • Pull documentation updates from Jonathan Corbet:
    "A slightly quieter cycle for documentation this time around.

    Three more DocBook template files have been converted to RST; only 21
    to go. There are various build improvements and the usual array of
    documentation improvements and fixes"

    * tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits)
    docs / driver-api: Fix structure references in device_link.rst
    PM / docs: Fix structure references in device.rst
    Add a target to check broken external links in the Documentation
    Documentation: Fix linux-api list typo
    Documentation: DocBook/Makefile comment typo
    Improve sparse documentation
    Documentation: make Makefile.sphinx no-ops quieter
    Documentation: DMA-ISA-LPC.txt
    Documentation: input: fix path to input code definitions
    docs: Remove the copyright year from conf.py
    docs: Fix a warning in the Korean HOWTO.rst translation
    PM / sleep / docs: Convert PM notifiers document to reST
    PM / core / docs: Convert sleep states API document to reST
    PM / core: Update kerneldoc comments in pm.h
    doc-rst: Fix recursive make invocation from macros
    doc-rst: Delete output of failed dot-SVG conversion
    doc-rst: Break shell command sequences on failure
    Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0
    doc-rst: fixed cleandoc target when used with O=dir
    Documentation/sphinx: prevent generation of .pyc files in the source tree
    ...

    Linus Torvalds
     
  • There is no thp defrag option that currently allows MADV_HUGEPAGE
    regions to do direct compaction and reclaim while all other thp
    allocations simply trigger kswapd and kcompactd in the background and
    fail immediately.

    The "defer" setting simply triggers background reclaim and compaction
    for all regions, regardless of MADV_HUGEPAGE, which makes it unusable
    for our userspace where MADV_HUGEPAGE is being used to indicate the
    application is willing to wait for work for thp memory to be available.

    The "madvise" setting will do direct compaction and reclaim for these
    MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the
    background for anybody else.

    For reasonable usage, there needs to be a mesh between the two options.
    This patch introduces a fifth mode, "defer+madvise", that will do direct
    reclaim and compaction for MADV_HUGEPAGE regions and trigger background
    reclaim and compaction for everybody else so that hugepages may be
    available in the near future.

    A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE
    regions as part of the "defer" mode, making it a very powerful setting
    and avoids breaking userspace, was offered:
    http://marc.info/?t=148236612700003
    This additional mode is a compromise.

    A second proposal to allow both "defer" and "madvise" to be selected at
    the same time was also offered:
    http://marc.info/?t=148357345300001.
    This is possible, but there was a concern that it might break existing
    userspaces the parse the output of the defrag mode, so the fifth option
    was introduced instead.

    This patch also cleans up the helper function for storing to "enabled"
    and "defrag" since the former supports three modes while the latter
    supports five and triple_flag_store() was getting unnecessarily messy.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701101614330.41805@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Jan, 2017

1 commit


11 Jan, 2017

1 commit

  • This is a first pass at trying to add documentation for the page_frag
    APIs. They may still change over time but for now I thought I would try
    to get these documented so that as more network drivers and stack calls
    make use of them we have one central spot to document how they are meant
    to be used.

    Link: http://lkml.kernel.org/r/20170104024157.13451.6758.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

13 Dec, 2016

2 commits

  • Pull documentation update from Jonathan Corbet:
    "These are the documentation changes for 4.10.

    It's another busy cycle for the docs tree, as the sphinx conversion
    continues. Highlights include:

    - Further work on PDF output, which remains a bit of a pain but
    should be more solid now.

    - Five more DocBook template files converted to Sphinx. Only 27 to
    go... Lots of plain-text files have also been converted and
    integrated.

    - Images in binary formats have been replaced with more
    source-friendly versions.

    - Various bits of organizational work, including the renaming of
    various files discussed at the kernel summit.

    - New documentation for the device_link mechanism.

    ... and, of course, lots of typo fixes and small updates"

    * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
    dma-buf: Extract dma-buf.rst
    Update Documentation/00-INDEX
    docs: 00-INDEX: document directories/files with no docs
    docs: 00-INDEX: remove non-existing entries
    docs: 00-INDEX: add missing entries for documentation files/dirs
    docs: 00-INDEX: consolidate process/ and admin-guide/ description
    scripts: add a script to check if Documentation/00-INDEX is sane
    Docs: change sh -> awk in REPORTING-BUGS
    Documentation/core-api/device_link: Add initial documentation
    core-api: remove an unexpected unident
    ppc/idle: Add documentation for powersave=off
    Doc: Correct typo, "Introdution" => "Introduction"
    Documentation/atomic_ops.txt: convert to ReST markup
    Documentation/local_ops.txt: convert to ReST markup
    Documentation/assoc_array.txt: convert to ReST markup
    docs-rst: parse-headers.pl: cleanup the documentation
    docs-rst: fix media cleandocs target
    docs-rst: media/Makefile: reorganize the rules
    docs-rst: media: build SVG from graphviz files
    docs-rst: replace bayer.png by a SVG image
    ...

    Linus Torvalds
     
  • Test programs want to know the size of a transparent hugepage. While it
    is commonly the same as the size of a hugetlbfs page (shown as
    Hugepagesize in /proc/meminfo), that is not always so: powerpc
    implements transparent hugepages in a different way from hugetlbfs
    pages, so it's coincidence when their sizes are the same; and x86 and
    others can support more than one hugetlbfs page size.

    Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP
    size in bytes - it's the same for Anonymous and Shmem hugepages. Call
    it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in case
    some transparent support for pud and pgd pages is added later.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052200290.13021@eggly.anvils
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Oct, 2016

1 commit


07 Aug, 2016

1 commit

  • Pull documentation fixes from Jonathan Corbet:
    "Three fixes for the docs build, including removing an annoying warning
    on 'make help' if sphinx isn't present"

    * tag 'doc-4.8-fixes' of git://git.lwn.net/linux:
    DocBook: use DOCBOOKS="" to ignore DocBooks instead of IGNORE_DOCBOOKS=1
    Documenation: update cgroup's document path
    Documentation/sphinx: do not warn about missing tools in 'make help'

    Linus Torvalds
     

04 Aug, 2016

1 commit


27 Jul, 2016

4 commits

  • Randy reported below build error.

    > In file included from ../include/linux/balloon_compaction.h:48:0,
    > from ../mm/balloon_compaction.c:11:
    > ../include/linux/compaction.h:237:51: warning: 'struct node' declared inside parameter list [enabled by default]
    > static inline int compaction_register_node(struct node *node)
    > ../include/linux/compaction.h:237:51: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
    > ../include/linux/compaction.h:242:54: warning: 'struct node' declared inside parameter list [enabled by default]
    > static inline void compaction_unregister_node(struct node *node)
    >

    It was caused by non-lru page migration which needs compaction.h but
    compaction.h doesn't include any header to be standalone.

    I think proper header for non-lru page migration is migrate.h rather
    than compaction.h because migrate.h has already headers needed to work
    non-lru page migration indirectly like isolate_mode_t, migrate_mode
    MIGRATEPAGE_SUCCESS.

    [akpm@linux-foundation.org: revert mm-balloon-use-general-non-lru-movable-page-feature-fix.patch temp fix]
    Link: http://lkml.kernel.org/r/20160610003304.GE29779@bbox
    Signed-off-by: Minchan Kim
    Reported-by: Randy Dunlap
    Cc: Konstantin Khlebnikov
    Cc: Vlastimil Babka
    Cc: Gioh Kim
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Add info about tmpfs/shmem with huge pages.

    Link: http://lkml.kernel.org/r/1466021202-61880-38-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add description of THP handling into unevictable-lru.txt.

    Link: http://lkml.kernel.org/r/1466021202-61880-7-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

21 May, 2016

2 commits

  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Signed-off-by: Eric Engestrom
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Engestrom
     

20 May, 2016

2 commits

  • Merge updates from Andrew Morton:

    - fsnotify fix

    - poll() timeout fix

    - a few scripts/ tweaks

    - debugobjects updates

    - the (small) ocfs2 queue

    - Minor fixes to kernel/padata.c

    - Maybe half of the MM queue

    * emailed patches from Andrew Morton : (117 commits)
    mm, page_alloc: restore the original nodemask if the fast path allocation failed
    mm, page_alloc: uninline the bad page part of check_new_page()
    mm, page_alloc: don't duplicate code in free_pcp_prepare
    mm, page_alloc: defer debugging checks of pages allocated from the PCP
    mm, page_alloc: defer debugging checks of freed pages until a PCP drain
    cpuset: use static key better and convert to new API
    mm, page_alloc: inline pageblock lookup in page free fast paths
    mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
    mm, page_alloc: pull out side effects from free_pages_check
    mm, page_alloc: un-inline the bad part of free_pages_check
    mm, page_alloc: check multiple page fields with a single branch
    mm, page_alloc: remove field from alloc_context
    mm, page_alloc: avoid looking up the first zone in a zonelist twice
    mm, page_alloc: shortcut watermark checks for order-0 pages
    mm, page_alloc: reduce cost of fair zone allocation policy retry
    mm, page_alloc: shorten the page allocator fast path
    mm, page_alloc: check once if a zone has isolated pageblocks
    mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
    mm, page_alloc: simplify last cpupid reset
    mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
    ...

    Linus Torvalds
     
  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

15 May, 2016

1 commit


28 Apr, 2016

1 commit


16 Apr, 2016

1 commit


18 Mar, 2016

2 commits

  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Count how many times we put a THP in split queue. Currently, it happens
    on partial unmap of a THP.

    Rapidly growing value can indicate that an application behaves
    unfriendly wrt THP: often fault in huge page and then unmap part of it.
    This leads to unnecessary memory fragmentation and the application may
    require tuning.

    The event also can help with debugging kernel [mis-]behaviour.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

2 commits

  • CONFIG_PAGE_OWNER attempts to impose negligible runtime overhead when
    enabled during compilation, but not actually enabled during runtime by
    boot param page_owner=on. This overhead can be further reduced using
    the static key mechanism, which this patch does.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

18 Jan, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync). This isn't quite complete but DAX
    is new and it's good enough and the guys have a handle on what
    needs to be done - I expect this to be wrapped in the next week or
    two.

    - some vsprintf maintenance work

    - various other misc bits

    * emailed patches from Andrew Morton : (145 commits)
    printk: change recursion_bug type to bool
    lib/vsprintf: factor out %pN[F] handler as netdev_bits()
    lib/vsprintf: refactor duplicate code to special_hex_number()
    printk-formats.txt: remove unimplemented %pT
    printk: help pr_debug and pr_devel to optimize out arguments
    lib/test_printf.c: test dentry printing
    lib/test_printf.c: add test for large bitmaps
    lib/test_printf.c: account for kvasprintf tests
    lib/test_printf.c: add a few number() tests
    lib/test_printf.c: test precision quirks
    lib/test_printf.c: check for out-of-bound writes
    lib/test_printf.c: don't BUG
    lib/kasprintf.c: add sanity check to kvasprintf
    lib/vsprintf.c: warn about too large precisions and field widths
    lib/vsprintf.c: help gcc make number() smaller
    lib/vsprintf.c: expand field_width to 24 bits
    lib/vsprintf.c: eliminate potential race in string()
    lib/vsprintf.c: move string() below widen_string()
    lib/vsprintf.c: pull out padding code from dentry_name()
    printk: do cond_resched() between lines while outputting to consoles
    ...

    Linus Torvalds
     

16 Jan, 2016

1 commit

  • The patch updates Documentation/vm/transhuge.txt to reflect changes in
    THP design.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2015

1 commit


07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov