06 Mar, 2019

1 commit

  • commit cb6acd01e2e43fd8bad11155752b7699c3d0fb76 upstream.

    hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

07 Feb, 2019

1 commit

  • commit 1ac25013fb9e4ed595cd608a406191e93520881e upstream.

    hugetlb needs the same fix as faultin_nopage (which was applied in
    commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle
    FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
    released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
    in the FOLL_NOWAIT case.

    Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
    Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
    Signed-off-by: Andrea Arcangeli
    Tested-by: "Dr. David Alan Gilbert"
    Reported-by: "Dr. David Alan Gilbert"
    Reviewed-by: Mike Kravetz
    Reviewed-by: Peter Xu
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

08 Dec, 2018

1 commit

  • commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.

    Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

21 Nov, 2018

1 commit

  • commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.

    This bug has been experienced several times by the Oracle DB team. The
    BUG is in remove_inode_hugepages() as follows:

    /*
    * If page is mapped, it was faulted in after being
    * unmapped in caller. Unmap (again) now after taking
    * the fault mutex. The mutex will prevent faults
    * until we finish removing the page.
    *
    * This race can only happen in the hole punch case.
    * Getting here in a truncate operation is a bug.
    */
    if (unlikely(page_mapped(page))) {
    BUG_ON(truncate_op);

    In this case, the elevated map count is not the result of a race.
    Rather it was incorrectly incremented as the result of a bug in the huge
    pmd sharing code. Consider the following:

    - Process A maps a hugetlbfs file of sufficient size and alignment
    (PUD_SIZE) that a pmd page could be shared.

    - Process B maps the same hugetlbfs file with the same size and
    alignment such that a pmd page is shared.

    - Process B then calls mprotect() to change protections for the mapping
    with the shared pmd. As a result, the pmd is 'unshared'.

    - Process B then calls mprotect() again to chage protections for the
    mapping back to their original value. pmd remains unshared.

    - Process B then forks and process C is created. During the fork
    process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
    tables. Copying page tables for hugetlb mappings is done in the
    routine copy_hugetlb_page_range.

    In copy_hugetlb_page_range(), the destination pte is obtained by:

    dst_pte = huge_pte_alloc(dst, addr, sz);

    If pmd sharing is possible, the returned pointer will be to a pte in an
    existing page table. In the situation above, process C could share with
    either process A or process B. Since process A is first in the list,
    the returned pte is a pointer to a pte in process A's page table.

    However, the check for pmd sharing in copy_hugetlb_page_range is:

    /* If the pagetables are shared don't copy or take references */
    if (dst_pte == src_pte)
    continue;

    Since process C is sharing with process A instead of process B, the
    above test fails. The code in copy_hugetlb_page_range which follows
    assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
    from src_pte to dst_pte and increments this map count of the associated
    page. This is how we end up with an elevated map count.

    To solve, check the dst_pte entry for huge_pte_none. If !none, this
    implies PMD sharing so do not copy.

    Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
    Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

14 Nov, 2018

1 commit

  • commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.

    Some test systems were experiencing negative huge page reserve counts and
    incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
    removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
    explicit code removes the pages, the appropriate accounting is not
    performed.

    This can be recreated as follows:
    fallocate -l 2M /dev/hugepages/foo
    echo 1 > /proc/sys/vm/drop_caches
    fallocate -l 2M /dev/hugepages/foo
    grep -i huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 2048
    HugePages_Free: 2047
    HugePages_Rsvd: 18446744073709551615
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 4194304 kB
    ls -lsh /dev/hugepages/foo
    4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

    To address this issue, dirty pages as they are added to pagecache. This
    can easily be reproduced with fallocate as shown above. Read faulted
    pages will eventually end up being marked dirty. But there is a window
    where they are clean and could be impacted by code such as drop_caches.
    So, just dirty them all as they are added to the pagecache.

    Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
    Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
    Signed-off-by: Mike Kravetz
    Acked-by: Mihcla Hocko
    Reviewed-by: Khalid Aziz
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

06 Oct, 2018

2 commits

  • When fixing an issue with PMD sharing and migration, it was discovered via
    code inspection that other callers of huge_pmd_unshare potentially have an
    issue with cache and tlb flushing.

    Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst
    case ranges for mmu notifiers. Ensure that this range is flushed if
    huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

    Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Michal Hocko
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

24 Aug, 2018

2 commits

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Patch series "mm: soft-offline: fix race against page allocation".

    Xishi recently reported the issue about race on reusing the target pages
    of soft offlining. Discussion and analysis showed that we need make
    sure that setting PG_hwpoison should be done in the right place under
    zone->lock for soft offline. 1/2 handles free hugepage's case, and 2/2
    hanldes free buddy page's case.

    This patch (of 2):

    There's a race condition between soft offline and hugetlb_fault which
    causes unexpected process killing and/or hugetlb allocation failure.

    The process killing is caused by the following flow:

    CPU 0 CPU 1 CPU 2

    soft offline
    get_any_page
    // find the hugetlb is free
    mmap a hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    // succeed
    soft_offline_free_page
    // set hwpoison flag
    mmap the hugetlb file
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    find_lock_page
    return VM_FAULT_HWPOISON
    mm_fault_error
    do_sigbus
    // kill the process

    The hugetlb allocation failure comes from the following flow:

    CPU 0 CPU 1

    mmap a hugetlb file
    // reserve all free page but don't fault-in
    soft offline
    get_any_page
    // find the hugetlb is free
    soft_offline_free_page
    // set hwpoison flag
    dissolve_free_huge_page
    // fail because all free hugepages are reserved
    page fault
    ...
    hugetlb_fault
    hugetlb_no_page
    alloc_huge_page
    ...
    dequeue_huge_page_node_exact
    // ignore hwpoisoned hugepage
    // and finally fail due to no-mem

    The root cause of this is that current soft-offline code is written based
    on an assumption that PageHWPoison flag should be set at first to avoid
    accessing the corrupted data. This makes sense for memory_failure() or
    hard offline, but does not for soft offline because soft offline is about
    corrected (not uncorrected) error and is safe from data lost. This patch
    changes soft offline semantics where it sets PageHWPoison flag only after
    containment of the error page completes successfully.

    Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Suggested-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

18 Aug, 2018

4 commits

  • When using 1GiB pages during early boot, use the new
    memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
    Zeroing out hundreds or thousands of GiB in a single core memset() call
    is very slow, and can make early boot last upwards of 20-30 minutes on
    multi TiB machines.

    The memory does not need to be zero'd as the hugetlb pages are always
    zero'd on page fault.

    Tested: Booted with ~3800 1G pages, and it booted successfully in
    roughly the same amount of time as with 0, as opposed to the 25+ minutes
    it would take before.

    Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Acked-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: David Matlack
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cannon Matthews
     
  • This reverts ee8f248d266e ("hugetlb: add phys addr to struct
    huge_bootmem_page").

    At one time powerpc used this field and supporting code. However that
    was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support
    for reserving gigantic huge pages via kernel command line").

    There are no users of this field and supporting code, so remove it.

    Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Cannon Matthews
    Cc: Becky Bruce
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This is to take better advantage of the general huge page copying
    optimization. Where, the target subpage will be copied last to avoid
    the cache lines of target subpage to be evicted when copying other
    subpages. This works better if the address of the target subpage is
    available when copying huge page. So hugetlbfs page fault handlers are
    changed to pass that information to hugetlb_cow(). This will benefit
    workloads which don't access the begin of the hugetlbfs huge page after
    the page fault under heavy cache contention.

    Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Mike Kravetz
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: "Aneesh Kumar K.V"
    Cc: Punit Agrawal
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To take better advantage of general huge page copying optimization, the
    target subpage address will be passed to hugetlb_cow(), then
    copy_user_huge_page(). So we will use both target subpage address and
    huge page size aligned address in hugetlb_cow(). To distinguish between
    them, "haddr" is used for huge page size aligned address to be
    consistent with Transparent Huge Page naming convention.

    Now, only huge page size aligned address is used in hugetlb_cow(), so
    the "address" is renamed to "haddr" in hugetlb_cow() in this patch.
    Next patch will use target subpage address in hugetlb_cow() too.

    The patch is just code cleanup without any functionality changes.

    Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Mike Kravetz
    Suggested-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: "Aneesh Kumar K.V"
    Cc: Punit Agrawal
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

03 Aug, 2018

1 commit

  • Commit 05ea88608d4e ("mm, hugetlbfs: introduce ->pagesize() to
    vm_operations_struct") adds a new ->pagesize() function to
    hugetlb_vm_ops, intended to cover all hugetlbfs backed files.

    With System V shared memory model, if "huge page" is specified, the
    "shared memory" is backed by hugetlbfs files, but the mappings initiated
    via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
    so we need to add a ->pagesize function to shm_vm_ops. Otherwise,
    vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
    result in below BUG:

    fs/hugetlbfs/inode.c
    443 if (unlikely(page_mapped(page))) {
    444 BUG_ON(truncate_op);

    resulting in

    hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
    ------------[ cut here ]------------
    kernel BUG at fs/hugetlbfs/inode.c:444!
    Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
    CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
    RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
    ....
    Call Trace:
    hugetlbfs_evict_inode+0x1e/0x3e
    evict+0xdb/0x1af
    iput+0x1a2/0x1f7
    dentry_unlink_inode+0xc6/0xf0
    __dentry_kill+0xd8/0x18d
    dput+0x1b5/0x1ed
    __fput+0x18b/0x216
    ____fput+0xe/0x10
    task_work_run+0x90/0xa7
    exit_to_usermode_loop+0xdd/0x116
    do_syscall_64+0x187/0x1ae
    entry_SYSCALL_64_after_hwframe+0x150/0x0

    [jane.chu@oracle.com: relocate comment]
    Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
    Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
    Fixes: 05ea88608d4e13 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
    Signed-off-by: Jane Chu
    Suggested-by: Mike Kravetz
    Reviewed-by: Mike Kravetz
    Acked-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jane Chu
     

04 Jul, 2018

1 commit

  • When booting with very large numbers of gigantic (i.e. 1G) pages, the
    operations in the loop of gather_bootmem_prealloc, and specifically
    prep_compound_gigantic_page, takes a very long time, and can cause a
    softlockup if enough pages are requested at boot.

    For example booting with 3844 1G pages requires prepping
    (set_compound_head, init the count) over 1 billion 4K tail pages, which
    takes considerable time.

    Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
    prevent this lockup.

    Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
    no softlockup is reported, and the hugepages are reported as
    successfully setup.

    Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
    Signed-off-by: Cannon Matthews
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cannon Matthews
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

2 commits

  • This is to take better advantage of general huge page clearing
    optimization (commit c79b57e462b5: "mm: hugetlb: clear target sub-page
    last when clearing huge page") for hugetlbfs.

    In the general optimization patch, the sub-page to access will be
    cleared last to avoid the cache lines of to access sub-page to be
    evicted when clearing other sub-pages. This works better if we have the
    address of the sub-page to access, that is, the fault address inside the
    huge page. So the hugetlbfs no page fault handler is changed to pass
    that information. This will benefit workloads which don't access the
    begin of the hugetlbfs huge page after the page fault under heavy cache
    contention for shared last level cache.

    The patch is a generic optimization which should benefit quite some
    workloads, not for a specific use case. To demonstrate the performance
    benefit of the patch, we tested it with vm-scalability run on hugetlbfs.

    With this patch, the throughput increases ~28.1% in vm-scalability
    anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
    system (44 cores, 88 threads). The test case creates 88 processes, each
    process mmaps a big anonymous memory area with MAP_HUGETLB and writes to
    it from the end to the begin. For each process, other processes could
    be seen as other workload which generates heavy cache pressure. At the
    same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC
    (instruction per cycle) increased from 0.3 to 0.37, and the time spent
    in user space is reduced ~19.3%.

    Link: http://lkml.kernel.org/r/20180517083539.9242-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Mike Kravetz
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: "Aneesh Kumar K.V"
    Cc: Punit Agrawal
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Use new return type vm_fault_t for fault handler in struct
    vm_operations_struct. For now, this is just documenting that the
    function returns a VM_FAULT value rather than an errno. Once all
    instances are converted, vm_fault_t will become a distinct type.

    See commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Joe Perches
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

06 Apr, 2018

2 commits

  • When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and report the MMU page mapping size that is being enforced by
    the vma.

    Similar to commit 31383c6865a5 "mm, hugetlbfs: introduce ->split() to
    vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
    about device-dax page mapping sizes in the same (hstate) way that
    hugetlbfs communicates this attribute. Instead, these patches introduce
    a new ->pagesize() vm operation.

    Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Jane Chu
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm, smaps: MMUPageSize for device-dax", v3.

    Similar to commit 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to
    vm_operations_struct") here is another occasion where we want
    special-case hugetlbfs/hstate enabling to also apply to device-dax.

    This prompts the question what other hstate conversions we might do
    beyond ->split() and ->pagesize(), but this appears to be the last of
    the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.

    This patch (of 3):

    The current powerpc definition of vma_mmu_pagesize() open codes looking
    up the page size via hstate. It is identical to the generic
    vma_kernel_pagesize() implementation.

    Now, vma_kernel_pagesize() is growing support for determining the page
    size of Device-DAX vmas in addition to the existing Hugetlbfs page size
    determination.

    Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
    would automatically benefit from any new vma-type support that is added
    to vma_kernel_pagesize(). However, the powerpc vma_mmu_pagesize() is
    prevented from calling vma_kernel_pagesize() due to a circular header
    dependency that requires vma_mmu_pagesize() to be defined before
    including .

    Break this circular dependency by defining the default vma_mmu_pagesize()
    as a __weak symbol to be overridden by the powerpc version.

    Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Jane Chu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

23 Mar, 2018

1 commit

  • A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

10 Mar, 2018

1 commit

  • Dan Rue has noticed that libhugetlbfs test suite fails counter test:

    # mount_point="/mnt/hugetlb/"
    # echo 200 > /proc/sys/vm/nr_hugepages
    # mkdir -p "${mount_point}"
    # mount -t hugetlbfs hugetlbfs "${mount_point}"
    # export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
    # /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
    Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
    Base pool size: 0
    Clean...
    FAIL Line 326: Bad HugePages_Total: expected 0, actual 1

    The bug was bisected to 0c397daea1d4 ("mm, hugetlb: further simplify
    hugetlb allocation API").

    The reason is that alloc_surplus_huge_page() misaccounts per node
    surplus pages. We should increase surplus_huge_pages_node rather than
    nr_huge_pages_node which is already handled by alloc_fresh_huge_page.

    Link: http://lkml.kernel.org/r/20180221191439.GM2231@dhcp22.suse.cz
    Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API")
    Signed-off-by: Michal Hocko
    Reported-by: Dan Rue
    Tested-by: Dan Rue
    Reviewed-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Feb, 2018

9 commits

  • Dan Carpenter has noticed that mbind migration callback (new_page) can
    get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
    relies on the VMA to get the hstate. We used to BUG_ON this case but
    the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
    mbind hugetlb migration".

    The proper way to handle this is to get the hstate from the migrated
    page and rely on huge_node (resp. get_vma_policy) do the right thing
    with null VMA. We are currently falling back to the default mempolicy
    in that case which is in line what THP path is doing here.

    Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Dan Carpenter
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
    pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
    allocation function which has to take care of reserves, overcommit or
    hugetlb cgroup accounting. None of that is really required for the page
    migration because the new page is only temporal and either will replace
    the original page or it will be dropped. This is essentially as for
    other migration call paths and there shouldn't be any reason to handle
    mbind in a special way.

    The current implementation is even suboptimal because the migration
    might fail just because the hugetlb cgroup limit is reached, or the
    overcommit is saturated.

    Fix this by making mbind like other hugetlb migration paths. Add a new
    migration helper alloc_huge_page_vma as a wrapper around
    alloc_huge_page_nodemask with additional mempolicy handling.

    alloc_huge_page_noerr has no more users and it can go.

    Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Hugetlb allocator has several layer of allocation functions depending
    and the purpose of the allocation. There are two allocators depending
    on whether the page can be allocated from the page allocator or we need
    a contiguous allocator. This is currently opencoded in
    alloc_fresh_huge_page which is the only path that might allocate giga
    pages which require the later allocator. Create alloc_fresh_huge_page
    which hides this implementation detail and use it in all callers which
    hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
    This shouldn't introduce any funtional change because both migration and
    surplus allocators exlude giga pages explicitly.

    While we are at it let's do some renaming. The current scheme is not
    consistent and overly painfull to read and understand. Get rid of
    prefix underscores from most functions. There is no real reason to make
    names longer.

    * alloc_fresh_huge_page is the new layer to abstract underlying
    allocator
    * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
    alloc_buddy_huge_page.
    * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
    the new page directly to the pool
    * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
    as it uses alloc_fresh_huge_page now
    * others lose their excessive prefix underscores to make names shorter

    [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
    Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
    Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • alloc_surplus_huge_page increases the pool size and the number of
    surplus pages opportunistically to prevent from races with the pool size
    change. See commit d1c3fb1f8f29 ("hugetlb: introduce
    nr_overcommit_hugepages sysctl") for more details.

    The resulting code is unnecessarily hairy, cause code duplication and
    doesn't allow to share the allocation paths. Moreover pool size changes
    tend to be very seldom so optimizing for them is not really reasonable.
    Simplify the code and allow to allocate a fresh surplus page as long as
    we are under the overcommit limit and then recheck the condition after
    the allocation and drop the new page if the situation has changed. This
    should provide a reasonable guarantee that an abrupt allocation requests
    will not go way off the limit.

    If we consider races with the pool shrinking and enlarging then we
    should be reasonably safe as well. In the first case we are off by one
    in the worst case and the second case should work OK because the page is
    not yet visible. We can waste CPU cycles for the allocation but that
    should be acceptable for a relatively rare condition.

    Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • hugepage migration relies on __alloc_buddy_huge_page to get a new page.
    This has 2 main disadvantages.

    1) it doesn't allow to migrate any huge page if the pool is used
    completely which is not an exceptional case as the pool is static and
    unused memory is just wasted.

    2) it leads to a weird semantic when migration between two numa nodes
    might increase the pool size of the destination NUMA node while the
    page is in use. The issue is caused by per NUMA node surplus pages
    tracking (see free_huge_page).

    Address both issues by changing the way how we allocate and account
    pages allocated for migration. Those should temporal by definition. So
    we mark them that way (we will abuse page flags in the 3rd page) and
    update free_huge_page to free such pages to the page allocator. Page
    migration path then just transfers the temporal status from the new page
    to the old one which will be freed on the last reference. The global
    surplus count will never change during this path but we still have to be
    careful when migrating a per-node suprlus page. This is now handled in
    move_hugetlb_state which is called from the migration path and it copies
    the hugetlb specific page state and fixes up the accounting when needed

    Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
    reflect its purpose. The new allocation routine for the migration path
    is __alloc_migrate_huge_page.

    The user visible effect of this patch is that migrated pages are really
    temporal and they travel between NUMA nodes as per the migration
    request:

    Before migration
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

    After
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

    with the previous implementation, both nodes would have nr_hugepages:1
    until the page is freed.

    Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
    specie with a lot of special casing. The allocation path is not an
    exception. Unnecessarily so to be honest. It is true that the
    underlying allocator is different but that is an implementation detail.

    This patch unifies the hugetlb allocation path that a prepares fresh
    pool pages. alloc_fresh_gigantic_page basically copies
    alloc_fresh_huge_page logic so we can move everything there. This will
    simplify set_max_huge_pages which doesn't have to care about what kind
    of huge page we allocate.

    Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm, hugetlb: allocation API and migration improvements"

    Motivation:

    this is a follow up for [3] for the allocation API and [4] for the
    hugetlb migration. It wasn't really easy to split those into two
    separate patch series as they share some code.

    My primary motivation to touch this code is to make the gigantic pages
    migration working. The giga pages allocation code is just too fragile
    and hacked into the hugetlb code now. This series tries to move giga
    pages closer to the first class citizen. We are not there yet but
    having 5 patches is quite a lot already and it will already make the
    code much easier to follow. I will come with other changes on top after
    this sees some review.

    The first two patches should be trivial to review. The third patch
    changes the way how we migrate huge pages. Newly allocated pages are a
    subject of the overcommit check and they participate surplus accounting
    which is quite unfortunate as the changelog explains. This patch
    doesn't change anything wrt. giga pages.

    Patch #4 removes the surplus accounting hack from
    __alloc_surplus_huge_page. I hope I didn't miss anything there and a
    deeper review is really due there.

    Patch #5 finally unifies allocation paths and giga pages shouldn't be
    any special anymore. There is also some renaming going on as well.

    This patch (of 6):

    hugetlb allocator has two entry points to the page allocator
    - alloc_fresh_huge_page_node
    - __hugetlb_alloc_buddy_huge_page

    The two differ very subtly in two aspects. The first one doesn't care
    about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
    prep_new_huge_page is not used because it not only initializes hugetlb
    specific stuff but because it also put_page and releases the page to the
    hugetlb pool which is not what is required in some contexts. This makes
    things more complicated than necessary.

    Simplify things by a) removing the page allocator entry point duplicity
    and only keep __hugetlb_alloc_buddy_huge_page and b) make
    prep_new_huge_page more reusable by removing the put_page which moves
    the page to the allocator pool. All current callers are updated to call
    put_page explicitly. Later patches will add new callers which won't
    need it.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow
    huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
    allocations from ZONE_MOVABLE even when hugetlb pages were not
    migrateable. The purpose of the movable zone was different at the time.
    It aimed at reducing memory fragmentation and hugetlb pages being long
    lived and large werre not contributing to the fragmentation so it was
    acceptable to use the zone back then.

    Things have changed though and the primary purpose of the zone became
    migratability guarantee. If we allow non migrateable hugetlb pages to
    be in ZONE_MOVABLE memory hotplug might fail to offline the memory.

    Remove the knob and only rely on hugepage_migration_supported to allow
    movable zones.

    Mel said:

    : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
    : the ability to grow it again. The use case was for batched jobs, some of
    : which needed huge pages and others that did not but didn't want the memory
    : useless pinned in the huge pages pool.
    :
    : I suspect that more users rely on THP than hugetlbfs for flexible use of
    : huge pages with fallback options so I think that removing the option
    : should be ok.

    Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Alexandru Moise
    Acked-by: Mel Gorman
    Cc: Alexandru Moise
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we display some hugepage statistics (total, free, etc) in
    /proc/meminfo, but only for default hugepage size (e.g. 2Mb).

    If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
    /proc/meminfo output can be confusing, as non-default sized hugepages
    are not reflected at all, and there are no signs that they are existing
    and consuming system memory.

    To solve this problem, let's display the total amount of memory,
    consumed by hugetlb pages of all sized (both free and used). Let's call
    it "Hugetlb", and display size in kB to match generic /proc/meminfo
    style.

    For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
    $ cat /proc/meminfo
    MemTotal: 8168984 kB
    MemFree: 3789276 kB

    CmaFree: 0 kB
    HugePages_Total: 1024
    HugePages_Free: 1024
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 4194304 kB
    DirectMap4k: 32632 kB
    DirectMap2M: 4161536 kB
    DirectMap1G: 6291456 kB

    Also, this patch updates corresponding docs to reflect Hugetlb entry
    meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.

    Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: "Aneesh Kumar K.V"
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

30 Nov, 2017

2 commits

  • I made a mistake during converting hugetlb code to 5-level paging: in
    huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().

    Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
    if p4d table is not yet allocated.

    It only can happen in 5-level paging mode. In 4-level paging mode
    p4d_offset() always returns pgd, so we are fine.

    Link: http://lkml.kernel.org/r/20171122121921.64822-1-kirill.shutemov@linux.intel.com
    Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: [4.11+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Patch series "device-dax: fix unaligned munmap handling"

    When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and fail attempts to split vmas into unaligned ranges. It
    would be messy to teach the munmap path about device-dax alignment
    constraints in the same (hstate) way that hugetlbfs communicates this
    constraint. Instead, these patches introduce a new ->split() vm
    operation.

    This patch (of 2):

    The device-dax interface has similar constraints as hugetlbfs in that it
    requires the munmap path to unmap in huge page aligned units. Rather
    than add more custom vma handling code in __split_vma() introduce a new
    vm operation to perform this vma specific check.

    Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Cc: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

16 Nov, 2017

1 commit

  • This patch only affects users of mmu_notifier->invalidate_range callback
    which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
    and it is an optimization for those users. Everyone else is unaffected
    by it.

    When clearing a pte/pmd we are given a choice to notify the event under
    the page table lock (notify version of *_clear_flush helpers do call the
    mmu_notifier_invalidate_range). But that notification is not necessary
    in all cases.

    This patch removes almost all cases where it is useless to have a call
    to mmu_notifier_invalidate_range before
    mmu_notifier_invalidate_range_end. It also adds documentation in all
    those cases explaining why.

    Below is a more in depth analysis of why this is fine to do this:

    For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
    device use thing like ATS/PASID to get the IOMMU to walk the CPU page
    table to access a process virtual address space). There is only 2 cases
    when you need to notify those secondary TLB while holding page table
    lock when clearing a pte/pmd:

    A) page backing address is free before mmu_notifier_invalidate_range_end
    B) a page table entry is updated to point to a new page (COW, write fault
    on zero page, __replace_page(), ...)

    Case A is obvious you do not want to take the risk for the device to write
    to a page that might now be used by something completely different.

    Case B is more subtle. For correctness it requires the following sequence
    to happen:
    - take page table lock
    - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
    - set page table entry to point to new page

    If clearing the page table entry is not followed by a notify before setting
    the new pte/pmd value then you can break memory model like C11 or C++11 for
    the device.

    Consider the following scenario (device use a feature similar to ATS/
    PASID):

    Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
    assume they are write protected for COW (other case of B apply too).

    [Time N] -----------------------------------------------------------------
    CPU-thread-0 {try to write to addrA}
    CPU-thread-1 {try to write to addrB}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {read addrA and populate device TLB}
    DEV-thread-2 {read addrB and populate device TLB}
    [Time N+1] ---------------------------------------------------------------
    CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
    CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+2] ---------------------------------------------------------------
    CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
    CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+3] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {preempted}
    CPU-thread-2 {write to addrA which is a write to new page}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+3] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {preempted}
    CPU-thread-2 {}
    CPU-thread-3 {write to addrB which is a write to new page}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+4] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+5] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {read addrA from old page}
    DEV-thread-2 {read addrB from new page}

    So here because at time N+2 the clear page table entry was not pair with a
    notification to invalidate the secondary TLB, the device see the new value
    for addrB before seing the new value for addrA. This break total memory
    ordering for the device.

    When changing a pte to write protect or to point to a new write protected
    page with same content (KSM) it is ok to delay invalidate_range callback
    to mmu_notifier_invalidate_range_end() outside the page table lock. This
    is true even if the thread doing page table update is preempted right
    after releasing page table lock before calling
    mmu_notifier_invalidate_range_end

    Thanks to Andrea for thinking of a problematic scenario for COW.

    [jglisse@redhat.com: v2]
    Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Nadav Amit
    Cc: Joerg Roedel
    Cc: Suravee Suthikulpanit
    Cc: David Woodhouse
    Cc: Alistair Popple
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Stephen Rothwell
    Cc: Andrew Donnellan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

03 Nov, 2017

1 commit

  • This oops:

    kernel BUG at fs/hugetlbfs/inode.c:484!
    RIP: remove_inode_hugepages+0x3d0/0x410
    Call Trace:
    hugetlbfs_setattr+0xd9/0x130
    notify_change+0x292/0x410
    do_truncate+0x65/0xa0
    do_sys_ftruncate.constprop.3+0x11a/0x180
    SyS_ftruncate+0xe/0x10
    tracesys+0xd9/0xde

    was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte.

    mmap() can still succeed beyond the end of the i_size after vmtruncate
    zapped vmas in those ranges, but the faults must not succeed, and that
    includes UFFDIO_COPY.

    We could differentiate the retval to userland to represent a SIGBUS like
    a page fault would do (vs SIGSEGV), but it doesn't seem very useful and
    we'd need to pick a random retval as there's no meaningful syscall
    retval that would differentiate from SIGSEGV and SIGBUS, there's just
    -EFAULT.

    Link: http://lkml.kernel.org/r/20171016223914.2421-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Kravetz
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

08 Sep, 2017

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Nothing really major this release, despite quite a lot of activity.
    Just lots of things all over the place.

    Some things of note include:

    - Access via perf to a new type of PMU (IMC) on Power9, which can
    count both core events as well as nest unit events (Memory
    controller etc).

    - Optimisations to the radix MMU TLB flushing, mostly to avoid
    unnecessary Page Walk Cache (PWC) flushes when the structure of the
    tree is not changing.

    - Reworks/cleanups of do_page_fault() to modernise it and bring it
    closer to other architectures where possible.

    - Rework of our page table walking so that THP updates only need to
    send IPIs to CPUs where the affected mm has run, rather than all
    CPUs.

    - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
    systems. This avoids problems with the percpu allocator on systems
    with very sparse NUMA layouts.

    - STRICT_KERNEL_RWX support on PPC32.

    - A new sched domain topology for Power9, to capture the fact that
    pairs of cores may share an L2 cache.

    - Power9 support for VAS, which is a new mechanism for accessing
    coprocessors, and initial support for using it with the NX
    compression accelerator.

    - Major work on the instruction emulation support, adding support for
    many new instructions, and reworking it so it can be used to
    implement the emulation needed to fixup alignment faults.

    - Support for guests under PowerVM to use the Power9 XIVE interrupt
    controller.

    And probably that many things again that are almost as interesting,
    but I had to keep the list short. Plus the usual fixes and cleanups as
    always.

    Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
    Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
    Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
    Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
    Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
    Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
    LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
    Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
    Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
    Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
    Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
    Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

    * tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
    powerpc/xive: Fix section __init warning
    powerpc: Fix kernel crash in emulation of vector loads and stores
    powerpc/xive: improve debugging macros
    powerpc/xive: add XIVE Exploitation Mode to CAS
    powerpc/xive: introduce H_INT_ESB hcall
    powerpc/xive: add the HW IRQ number under xive_irq_data
    powerpc/xive: introduce xive_esb_write()
    powerpc/xive: rename xive_poke_esb() in xive_esb_read()
    powerpc/xive: guest exploitation of the XIVE interrupt controller
    powerpc/xive: introduce a common routine xive_queue_page_alloc()
    powerpc/sstep: Avoid used uninitialized error
    axonram: Return directly after a failed kzalloc() in axon_ram_probe()
    axonram: Improve a size determination in axon_ram_probe()
    axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
    powerpc/powernv/npu: Move tlb flush before launching ATSD
    powerpc/macintosh: constify wf_sensor_ops structures
    powerpc/iommu: Use permission-specific DEVICE_ATTR variants
    powerpc/eeh: Delete an error out of memory message at init time
    powerpc/mm: Use seq_putc() in two functions
    macintosh: Convert to using %pOF instead of full_name
    ...

    Linus Torvalds
     

07 Sep, 2017

2 commits

  • alloc_gigantic_page doesn't consider movability of the gigantic hugetlb
    when scanning eligible ranges for the allocation. As 1GB hugetlb pages
    are not movable currently this can break the movable zone assumption
    that all allocations are migrateable and as such break memory hotplug.

    Reorganize the code and use the standard zonelist allocations scheme
    that we use for standard hugetbl pages. htlb_alloc_mask will ensure
    that only migratable hugetlb pages will ever see a movable zone.

    Link: http://lkml.kernel.org/r/20170803083549.21407-1-mhocko@kernel.org
    Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime")
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: Luiz Capitulino
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • attribute_group are not supposed to change at runtime. All functions
    working with attribute_group provided by work with const
    attribute_group. So mark the non-const structs as const.

    Link: http://lkml.kernel.org/r/1501157260-3922-1-git-send-email-arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav